COSMEA(1) OGMP SEQUENCE UTILITIES COSMEA(1) NAME cosmea - create a new masterfile from a phredPhrap dump and the previous masterfile in the project cycle SYNOPSIS cosmea [-w num] [-t] [-n numloc] [-c context_length] [-o ocode] [-C file] [-L] [-N] [-H] [-d dump] [-m | mf] DESCRIPTION The previous mf2stad and stad2mf system for annotation maintenance was too heavy when sequencing projects got too large. Cosmea will replace their functionality by a new system which stores annotation positions of a masterfile present at a particular cycle of sequencing in an annotation database and reinserts them in the masterfile created after the next cycle. Note that cosmea has a different behaviour depending on the arguments passed on the command line: 1. If no dump is passed on the command line (via -d, see OPTIONS), cosmea looks in the current directory for an ace file *.ace.1 creates a dump with the appropriate tool (consed2dump). 2. If there is no masterfile on the command line, cosmea will look in the current directory for a masterfile whose name is listed in /share/supported/apps/ogmp/lib/genome_name.lst (see FILES). If one is found, it will assume that the user wants the new version of this particular masterfile. If more than one is found, cosmea will abort. If none are found, cosmea will assume that only the dump should be used to create the masterfile (to use only the dump, even if a known masterfile exists in the current directory, use -m (see OPTIONS)). 3. If the organism code is not supplied, cosmea will try to guess it from any of these: - the masterfile passed on the command line, or the one found in step 2 if none is supplied on the command line - the dump passed on the command line, or the one created in step 1 if none is passed on the command line Generally, if the masterfile on which cosmea is run is listed in /share/supported/apps/ogmp/lib/genome_name.lst, and if there is a known phredPhrap project in the current directory, the user is only required to type "cosmea" and all the necessary parameters will be guessed automatically by the program. Once the masterfile is generated, cosmea will look in the project directory (as listed in genome_name.lst) to see if there is already a version of the masterfile. For the slab, the project directory is assumed to be /research/Seqbanks. For langlab, the project directory is assumed to be /edit_dir (where is the project directory path listed in genome_names.lst). For any other lab, the project's directory is assumed to be the project's path listed in genome_name.lst. If there is a version of the masterfile in the project's directory it is updated by the one cosmea just created. A backup is also created in /backup_mf (where is the project directory path listed in /share/supported/apps/ogmp/lib genome_name.lst). If the backup directory does not exists, cosmea will create it automatically. The user should run cleanmf on the input masterfile (if one is supplied) to be sure that its syntax is valid. Cosmea will use the readings as markers and record the position of all annotations relative to these markers. After the dump is read it will then compute the new position of every annotation, using the markers and their new locations. Cosmea will try to match the annotation's previous context with the context the annotation would have if it was inserted at the new position. If the context does not match right away, cosmea will try to insert the annotation further up (at most 5 (configurable via -w, see OPTIONS) bases) and further down (again, at most 5 (also configurable via -w, see OPTIONS) bases) to see if it can find a matching context. Note that this process is case-insensitive and the characters "n" and "N"s count as wildcards (ie they match anything). If cosmea does not find a matching context and if the annotation is not a track, it will display a warning that the annotation might have drifted and indicate that it could not find the original context. This warning will also go in the mf.err file. If the original context can be found, the annotation is inserted without warning at the position where the context matches. The newly created masterfile is always given the permissions -rw-r-----, if possible. Note: if a reading in the original masterfile is in a contig whose name has an extension (like "_mt" in "386gsz443_mt"), the extension will be appended to the name of the contig in which the reading will go in the new masterfile. The "SPAM_DELETED" lines in the new masterfile will be updated accordingly. If there is a contig in the new masterfile that contains readings that belonged to different contigs with different extensions, the extensions are concetenated together and appended to the new contig name. OPTIONS -c num Take num bases to the left (less if not enough) and num bases to the right (less if not enough) to record the context surrounding the annotation in the masterfile. The default is 20. -d dump Use a existing dumpfile 'dump' rather than creating a dumpfile from an existing Consed or Staden project. -o ocode Use ocode as the organism code for the masterfile. Is is not necessary to specify that option if the masterfile passed on the command line is in the file /share/supported/apps/ogmp/lib/genome_name.lst. This option must absolutely be used if no masterfile is passed on the command line. -l num This value indicates the minimum continuous base called sequence length (in nucleotides) that is allowed to be kept when singlets are cleaned. For example, a motif length of 50 would keep all lengths of sequence that are uninterupted by an X or N. Sequences not fitting this criteria, and on the edge of the reading will be trimmed. Default length is 20 (done automatically when this option is not specified). -m This switch cannot be used if a masterfile is passed on the command line. It tells cosmea to ignore all masterfiles in the current directory and to use only the dump to create a (usually first version of the) masterfile -n num Take num locators to the left (less if not enough) and num locators to the right (less if not enough) to record the position of an annotation in the masterfile. The default is 5. This parameter helps to reinsert annontation between two different masters files. $MIN_FAIL_PERC varaiable into cosmea is modified in order to reinsert all annotations from staden master file to phredPhrap master file. The threshold is decreased until 1 (by default 33). -t Ordinarily, if a track's old context differs from the context obtained in the new masterfile, no warning is issued. This switch forces a warning to appear on the screen and in the masterfile, indicating a possible annotation drift due to a context change if this situation occurs. -w num When checking for a matching context, try num bases further up and num bases further down the annotation's position for a perfect match. The default is 5. -C file This option tells cosmea to read an alternate consensus rule file. file itself. If this option is not specified, cosmea will check if there is a file .stad2mf.consensus in the current directory if it finds it. If this file is not found, /share/supported/apps/ogmp/lib/stad2mf.consensus (the global default consensus file) will be used. A description of the format of that rule file can be found in the global default consensus file. -L This option tells cosmea to perform one more operation during consensus calculation, after the rules from the consensus_rules file have been applied. This operation is to convert to lowercase the consensus letter if the following two conditions are both met: - The consensus letter returned by the rule file is one of "G", "A", "T" or "C" (note: in UPPERCASE). - There is at least one of the following (lowercase) letter in the dump column: "g", "a", "t" or "c", OR there is at least one "N" in that dump column. -N This option tells cosmea to modify its consensus calculation procedure as follows: each "N" found in a dump column will count as 0.25 G, 0.25 A, 0.25 T and 0.25 C. The number of points awarded to "N"s will also be set to 0 (zero). See the default consensus rule file for more information. -H When creating annotations for new gel readings, DO NOT use the headers file created by ogmp2stad(1), usually /share/supported/apps/ogmp/lib/headers/ ocode.headers where ocode is the organism code (the headers file contains all the information in the header line of the gel reading). FILES mf The input masterfile mf.err Where the annotations that could not be inserted will go /share/supported/apps/ogmp/lib/stad2mf.consensus The default consensus file DEPENDENCIES /share/supported/apps/ogmp/lib/perl/ogmp/ProjPack.pl To obtain a lot of information on the masterfile. /share/supported/apps/ogmp/bin/consed2dump To create the dump file from ace file. /share/supported/apps/ogmp/lib/perl/ogmp/stadump2.pl To parse the dump. /share/supported/apps/ogmp/lib/perl/ogmp/stadcons2.pl To calculate the OGMPCONSENSUS from the dump. /share/supported/apps/ogmp/lib/genome_name.lst File containing information for some OGMP projects. /share/supported/apps/ogmp/lib/headers/ocode.header Reading headers file. SEE ALSO stad2mf(1), mf2stad(1), consed2dump(1), addconsedread(1), assembly(1) AUTHOR Specifications: Gertraud Burger and Franz Lang Program design: Pierre Rioux and Nicolas Brossard Programming : Nicolas Brossard : Catherine Eng Organelle Genome Megasequencing Project, FEB98/AUG04