COSMEA(1)                    OGMP SEQUENCE UTILITIES                  COSMEA(1)

NAME
  cosmea - create a new masterfile from a phredPhrap dump and the previous
           masterfile in the project cycle

SYNOPSIS
   cosmea [-w num] [-t] [-n numloc] [-c context_length] [-o ocode] [-C file] 
          [-L] [-N] [-H] [-d dump] [-m | mf]

DESCRIPTION     
  The previous mf2stad and stad2mf system for annotation maintenance was too
  heavy when sequencing projects got too large. Cosmea will replace their 
  functionality by a new system which stores annotation positions of a 
  masterfile present at a particular cycle of sequencing in an annotation 
  database and reinserts them in the masterfile created after the next cycle.

  Note that cosmea has a different behaviour depending on the arguments passed
  on the command line:

    1. If no dump is passed on the command line (via -d, see OPTIONS), cosmea 
       looks in the current directory for an ace file *.ace.1 creates a dump
       with the appropriate tool (consed2dump). 

    2. If there is no masterfile on the command line, cosmea will look in the
       current directory for a masterfile whose name is listed in
       /share/supported/apps/ogmp/lib/genome_name.lst (see FILES). If one is
       found, it will assume that the user wants the new version of this
       particular masterfile. If more than one is found, cosmea will abort.
       If none are found, cosmea will assume that only the dump should be used
       to create the masterfile (to use only the dump, even if a known
       masterfile exists in the current directory, use -m (see OPTIONS)).

    3. If the organism code is not supplied, cosmea will try to guess it from
       any of these:
 
         - the masterfile passed on the command line, or the one found in step
           2 if none is supplied on the command line
         - the dump passed on the command line, or the one created in step 1 if
           none is passed on the command line

  Generally, if the masterfile on which cosmea is run is listed in
  /share/supported/apps/ogmp/lib/genome_name.lst, and if there is a known
  phredPhrap project in the current directory, the user is only required to
  type "cosmea" and all the necessary parameters will be guessed automatically
  by the program. 

  Once the masterfile is generated, cosmea will look in the project directory
  (as listed in genome_name.lst) to see if there is already a version of the 
  masterfile. For the slab, the project directory is assumed to be 
  /research/Seqbanks. For langlab, the project directory is assumed to be 
  <PATH>/edit_dir (where <PATH> is the project directory path listed
  in genome_names.lst).	For any other lab, the project's directory is assumed
  to be the project's path listed in genome_name.lst. If there is a version of
  the masterfile in the project's directory it is updated by the one cosmea just
  created. A backup is also created in <PATH>/backup_mf (where <PATH> is the 
  project directory path listed in /share/supported/apps/ogmp/lib genome_name.lst).
  If the backup directory does not exists, cosmea will create it automatically.

  The user should run cleanmf on the input masterfile (if one is supplied) to 
  be sure that its syntax is valid. Cosmea will use the readings as markers
  and record the position of all annotations relative to these markers. After
  the dump is read it will then compute the new position of every annotation, 
  using the markers and their new locations. Cosmea will try to match the 
  annotation's previous context with the context the annotation would have if it
  was inserted at the new position. If the context does not match right away,
  cosmea will try to insert the annotation further up (at most 5 (configurable
  via -w, see OPTIONS) bases) and further down (again, at most 5 (also 
  configurable via -w, see OPTIONS) bases) to see if it can find a matching 
  context. Note that this process is case-insensitive and the characters "n" and
  "N"s count as wildcards (ie they match anything). If cosmea does not find a 
  matching context and if the annotation is not a track, it will display a 
  warning that the annotation might have drifted and indicate that it could not
  find the original context. This warning will also go in the mf.err file. If 
  the original context can be found, the annotation is inserted without warning
  at the position where the context matches. The newly created masterfile is
  always given the permissions -rw-r-----, if possible.

  Note: if a reading in the original masterfile is in a contig whose name has
  an extension (like "_mt" in "386gsz443_mt"), the extension will be appended 
  to the name of the contig in which the reading will go in the new masterfile.
  The "SPAM_DELETED" lines in the new masterfile will be updated accordingly.
  If there is a contig in the new masterfile that contains readings that 
  belonged to different contigs with different extensions, the extensions are
  concetenated together and appended to the new contig name.


OPTIONS
  -c num
   Take num bases to the left (less if not enough) and num bases to the right
   (less if not enough) to record the context surrounding the annotation in the
   masterfile. The default is 20.

  -d dump
   Use a existing dumpfile 'dump' rather than creating a dumpfile from
   an existing Consed or Staden project.
   
  -o ocode
   Use ocode as the organism code for the masterfile. Is is not necessary to 
   specify that option if the masterfile passed on the command line is in the
   file /share/supported/apps/ogmp/lib/genome_name.lst. This option must
   absolutely be used if no masterfile is passed on the command line.

  -l num
   This value indicates the minimum continuous base called sequence length (in 
   nucleotides) that is allowed to be kept when singlets are cleaned. For
   example, a motif length of 50 would keep all lengths of sequence that are 
   uninterupted by an X or N. Sequences not fitting this criteria, and on the
   edge of the reading will be trimmed. Default length is 20 (done automatically
   when this option is not specified).

  -m
   This switch cannot be used if a masterfile is passed on the command line. It
   tells cosmea to ignore all masterfiles in the current directory and to use
   only the dump to create a (usually first version of the) masterfile

  -n num
   Take num locators to the left (less if not enough) and num locators to the
   right (less if not enough) to record the position of an annotation in the
   masterfile. The default is 5.
   This parameter helps to reinsert annontation between two different masters
   files. $MIN_FAIL_PERC varaiable into cosmea is modified in order to reinsert
   all annotations from staden master file to phredPhrap master file.
   The threshold is decreased until 1 (by default 33).

  -t
   Ordinarily, if a track's old context differs from the context obtained in the
   new masterfile, no warning is issued. This switch forces a warning to appear
   on the screen and in the masterfile, indicating a possible annotation drift
   due to a context change if this situation occurs.

  -w num
   When checking for a matching context, try num bases further up and num bases
   further down the annotation's position for a perfect match. The default is
   5.

  -C file
   This option tells cosmea to read an alternate consensus rule file. 
   file itself. If this option is not specified, cosmea will check if there is
   a file .stad2mf.consensus in the current directory if it finds it. If this
   file is not found, /share/supported/apps/ogmp/lib/stad2mf.consensus (the
   global default consensus file) will be used. A description of the format of
   that rule file can be found in the global default consensus file.

  -L
   This option tells cosmea to perform one more operation during consensus
   calculation, after the rules from the consensus_rules file have been 
   applied. This operation is to convert to lowercase the consensus letter if
   the following two conditions are both met:
 
        - The consensus letter returned by the rule file is one of "G", "A",
          "T" or "C" (note: in UPPERCASE).
        - There is at least one of the following (lowercase) letter in
          the dump column: "g", "a", "t" or "c", OR there is at
          least one "N" in that dump column.
 
  -N
   This option tells cosmea to modify its consensus calculation procedure as 
   follows: each "N" found in a dump column will count as 0.25 G, 0.25 A, 
   0.25 T and 0.25 C. The number of points awarded to "N"s will also be set to
   0 (zero). See the default consensus rule file for more information.

  -H
   When creating annotations for new gel readings, DO NOT use the
   headers file created by ogmp2stad(1), usually 
   /share/supported/apps/ogmp/lib/headers/ ocode.headers where ocode is the
   organism code (the headers file contains all the information in the header
   line of the gel reading).

FILES
  mf 
    The input masterfile
  mf.err
    Where the annotations that could not be inserted will go
  /share/supported/apps/ogmp/lib/stad2mf.consensus
    The default consensus file
  
DEPENDENCIES
  /share/supported/apps/ogmp/lib/perl/ogmp/ProjPack.pl
    To obtain a lot of information on the masterfile.
  /share/supported/apps/ogmp/bin/consed2dump
    To create the dump file from ace file.
  /share/supported/apps/ogmp/lib/perl/ogmp/stadump2.pl
    To parse the dump.
  /share/supported/apps/ogmp/lib/perl/ogmp/stadcons2.pl
    To calculate the OGMPCONSENSUS from the dump.
  /share/supported/apps/ogmp/lib/genome_name.lst
    File containing information for some OGMP projects.
  /share/supported/apps/ogmp/lib/headers/ocode.header
    Reading headers file.

SEE ALSO
  stad2mf(1), mf2stad(1), consed2dump(1), addconsedread(1), assembly(1)

AUTHOR
  Specifications: Gertraud Burger and Franz Lang
  Program design: Pierre Rioux and Nicolas Brossard
  Programming   : Nicolas Brossard
                : Catherine Eng
  

Organelle Genome Megasequencing Project, FEB98/AUG04