CLEANMF(1) OGMP SEQUENCE UTILITIES CLEANMF(1) NAME cleanmf - interactively clean the contents of a masterfile SYNOPSIS cleanmf [-t distance] [-c clone_file] [-B] [-f] [-o ocode] [-Q] [-l] [-s nb_spaces] [-g gcode] [-G] [-d] masterfile DESCRIPTION Cleanmf is a tool used to validate interactively a given masterfile. It provides an one-line editor with which the user can edit the incorrect lines of the masterfile detected by the program. For a complete description of the masterfile format rules, see mf(5). Cleanmf starts by executing: chkogmp -m masterfile (for a complete description of this function see chkogmp(5)). If an error has been found, the process stops and the user is warned. Otherwise, the validation continues. Each time the program finds an incorrect line in the masterfile, the user is asked if this particular line should be corrected or not and cleanmf suggests a possible correction that the user may edit. The program supplies a one-line editor called editline. (See editline(5) for a complete description of editline's features or type ^K in the one-line editor to have a summary of the available commands). Cleanmf's validation is divided into two parts. In the first part, the program identifies a defined set of errors for which it suggests corrections. The second part identifies more complex errors for which it does not offer corrections. Errors detected in part I: ------------------------- 1. General inconsistencies: * If there is a continuation character '\' at the end of a line that doesn't start with a ";", cleanmf will prompt the user to verify whether or not it can remove this illegal character. * Cleanmf will warn the user if after a line that starts with ';' and ends with '\' it finds a line having a pattern common to gel, track, gene start, gene end annotations. This indicates that the continuation character should probably not be there. * If there is a continuation character on a given line, and the line that follows doesn't start with ';', cleanmf will remove the '\' and inform the user that it did so. * Cleanmf reports incorrect annotations and therefore tries to determine the type of annotation (gene start, gene end, ...). For each invalid annotation, a valid annotation is suggested by cleanmf and the user will be prompted to edit or accept it. Invalid gene annotations are reported in part II. 2. Contig inconsistencies: * All the masterfile's contig names have to be different. If cleanmf finds two identical contig names, it will prompt the user to change one of them and suggest a valid new name for the invalid one. * If there is an invalid character in one of the contig's sequences (i.e. not in the set [agctAGCTnN0-9!]), cleanmf will terminate, printing an error message. * If the last line of a contig starts with a ";" and ends with a continuation character, cleanmf will prompt the user to verify that it can remove this illegal character. * If cleanmf finds a contig that starts with a line whose first character is not ">", cleanmf ends immediately. 3. Track inconsistencies: * A warning will be issued if cleanmf finds a mismatch in the directions of a track/reading pair. * If cleanmf finds a potential track that lacks the T-, the user will be prompted to add this suffix. Such potential tracks are indentified by starting with a ";" and containing an arrow with one or more asterisks appended at the blunt end of the arrow. * In a track comment, the strings "upstream" and "downstream" are replaced by "from here". The user will be prompted in each case to accept the suggested change. * Cleanmf will also change the location of a given track if this is defined by its offset ("Starts xy bases from here") and the corresponding contig grows in the direction that the track covers. All track annotations are migrated as far as possible into newly assembled sequence and their coordinate information updated (e.g. if a track annotation says "starts 50 bp from here" and subsequently these 50 bp are sequenced, then the annotation is migrated 50 bp in the direction of the arrow and the "starts 50 bp from here" deleted from the comment line. If, on the other hand, only 20 new bases of sequence have been added, the comment is moved to the end of the contig and the comment line changes to read "starts 30 bp from here"). * Cleanmf provides a "de-track" feature. When cleanmf finds a track annotation of a clone that has been sequenced completely, it will eliminate the track annotation and warn the user. There are some exceptions for which this deletion will not take place: i) When the reading has been sequenced with the reverse primer ii) When the track is too far away from the reading (see the "-t" command line switch). Errors detected by part II : -------------------------- 1. Complete validation of gene annotations: * Cleanmf will verify that all gene annotations are valid according to the rules encoded in gene_name.pl. To have a complete description of a gene annotation's expected format, consult: http://megasun.bch.umontreal.ca/People/lang/ogmp-mf/intro.html 2. Drifted annotation warning removal: * Cleanmf will detect automatically the warnings associated with annotations that may have drifted (in the same contig), as in: 26269 CTTGAAATG ;;WARNING: This annotation may have drifted. ; T-rs1643 (0.8) <==* i290 ;;WARN:>NTTTGTGGTTAGGGGTGAAAGGTCAAACAAAATTGGTGATAGCTGGTTTCTTGCGAAAA 26278 ATTTGTGGTTAGGGGTGAAAGGTCAAACAAAATTGGTG When such lines are found, cleanmf will prompt the user to see if he/she want to delete these lines. 3. Check consistency of gene start and gene end annotations: i) Check that for each gene start annotation there is a matching gene end annotation and vice-versa. ii) Check that every gene start and gene end annotation is unique in a given contig. iii) Check that for each matching "gene start" and "gene end" annotation, the direction of the arrow is identical for both. iv) Check that for each matching "gene start" and "gene end" annotation, the gene start is upstream from the gene end annotation. Note that the behavior of cleanmf changes depending on whether the masterfile is a finished project or an unfinished project. Cleanmf uses the "-f" switch to distinguish between the two (see OPTIONS). The test i) is not done if the project is finished. Also, if the test iv) fails for an unfinished project, cleanmf reports an error, whereas if the test fails for a finished project, cleanmf warns the user. This is to support the circular structure of a completely sequenced genome. 4. Validation of clone sizes: Cleanmf will validate the sizes of the clones that appear in the masterfile. In order to do so, it uses the clonelist file supplied as argument (see the "-c" option). If no clonelist file is supplied on the command line, cleanmf will prompt the user for a valid clonelist filename (RETURN for no clonelist file). Cleanmf first verifies that all sizes indicated in the masterfile are syntactically valid (a valid size is a digit followed by a dot and from one to three digits, or a set of blanks). The following steps are then performed: i) Cleanmf will add in the masterfile all the missing sizes, according to the sizes it reads in the clonelist file (if any, otherwise an empty set of braces is added). ii) If a clonelist file was specified, cleanmf will issue a warning for each clone whose size in the clone file is different from the size in the masterfile. iii) If a clonelist file was specified, cleanmf will also issue a warning for each clone whose size is specified in the masterfile but not in the clone file. 5. Check consistency of multiple reading annotations of a clone: If there is more than one reading annotation for a given clone, cleanmf will verify that the sizes indicated in both annotations agree. If they do, it will also verify that the start/end of the insert infered from the indicated size do not differ by more than 20% of the size for both annotations. These test rely heavily on /share/supported/apps/ogmp/lib/spam/spam.offsets (see FILES) to compute the start/end of an insert from a reading annotation's text and its position in the masterfile 6. Check the position of gene annotations relative to one another: Several rules have been implanted in cleanmf to verify the embedding of gene annotations: - If two annotations are not associated with the same gene, they are totally independent in the sense that there might be annotations of gene X right between the beginning and end of gene Y, or just after, or before, etc... - All the annotations relative to a given gene have to be in the same direction. This is checked only inside every contig, not globally. - There is no restriction imposed on the first annotation that can appear in a given contig. As long as it is syntactically valid, cleanmf accepts it. - Following an annotation "G-xxx-yyy-zzz ==> start" , there can be: i) The annotation "G-xxx-yyy-zzz ==> end" ; or ii) A syntactically correct annotation of the form "G-xxx-yyy-zzz-www ==> start" where "www" is a valid subpart of "zzz" (note that these can be "Mot-", "Sig-". - Following an annotation "G-xxx-yyy-zzz ==> end", there can be: i) A syntactically correct annotation of the form "G-xxx-yyy-www ==> start" ii) The annotation "G-xxx-yyy ==> end" (Note that there are corresponding rules for backward oriented annotations) 7. Exons, introns and fragments related tests: - The exon-intron-...-exon structure must be respected inside each contig. - The numbering of the exons and introns must be coherent (i.e. exon1- intron1-exon2-intron2-...). Cleanmf now supports the special exon and intron number "0" which is used when the exon or intron number was not known at the time the masterfile was produced (see part 7 though). - The numbering of the fragment parts must also be coherent - A genetic element cannot have fragments and an exon-intron structure. 8. Tests performed for finished projects: This section describes the tests that are performed only for finished projects (using the "-f" switch, see OPTIONS): - Exons and introns cannot be numbered "0". - Every gene start must have a matching gene end annotation and vice- versa. - The masterfile cannot contain "parts" annotations (-P). - The first exon in every exon-intron-...-exon structure must be exon "1". 9. Qualifier checking Cleanmf will validate all the qualifiers in every gene annotation. For every faulty annotation, an error message will be printed on screen and in the file masterfile.out. Cleanmf will not give suggestions on how to correct these annotations. For a complete description of the expected syntax of these qualifiers consult the web page: http://megasun.bch.umontreal.ca/People/lang/ogmp-mf/intro.html All the errors for which cleanmf doesn't suggest valid corrections (e.g. all the logic errors, tracks not in the same direction as readings, etc...) are saved in the file masterfile.errors. This file should always be checked after the completion of cleanmf. The user must correct these errors in the masterfile by using an appropriate texteditor. Each time cleanmf prompts the user for a correction, the following keystrokes are recognized: - Ctrl-Q: quit the program and save the changes automatically - Ctrl-C: quit the program, but only after confirmation by the user. All changes to the masterfile are discarded. There are other special keystrokes that cleanmf recognizes. Ctrl-K will list them all when the line editor is active. If the user quits and saves the changes, cleanmf will first verify that there were indeed some changes. Then: - It produces a file (having the same name as the masterfile but with the extension .cleanlog) that records everything that changed (in a format similar to diff(1)). If there was no change to the masterfile, this file is not produced. - The old version of the masterfile is stored in a file having the same name as the original masterfile but with the extension .orig appended at the end. The "-B" switch overrides this behavior (see OPTIONS). - The new version of the masterfile (updated with the user's changes) will overwrite the old one. - If there were no significant changes, the user is notified by the program and the log file (.cleanlog) name is changed to .oldcleanlog (if there was indeed a cleanlog file). The number of lines that the user deleted or added is reported by the program. Cleanmf will produce an error message if it cannot create the log file, the backup file, or the new version of the masterfile. Cleanmf will also process the entire (new) masterfile and update the line numbers that have to be changed because some sequence information has been removed or added. It will also indent the masterfile properly. These changes are done automatically at the end of the program. 10. Trna checking Syntactically valid pairs of matching trna annotations undergo an additional check. Cleanmf will verify that the anticodon indicated in the annotation text matches the one surrounded by the anticodon markers (!). A warning will be issued in all these cases: - There is an anticodon in the annotation text but no markers appear in the sequence associated to the trna. - There is an anticodon in the annotation text but the number of markers in the sequence associated to the trna is not 2. - There is no anticodon in the annotation text (e.g. G-trnX) and there are markers in the sequence associated to the trna Cleanmf will also verify that the anticodon in the annotation's text (if any) matches the amino acid besides it. This check is done according to the genetic code associated to the project. To find out how the genetic code is set, see the -g OPTION. There are exceptions to this test: - If the amino acid is X, the anticodon can be any of the possible anticodons, including those with four nucleotides. trnX can also have no anticodon (e.g. "; G-trnX ==> start") - If the anticodon found in the trna annotation is 'cau' then 'M' and 'I' are the only valid amino acids accepted for this annotation, no matter the genetic code used. Finally, cleanmf will issue a warning when it finds anticodon markers (!) that are not contained within a matching pair of valid trna annotations. OPTIONS -t distance Use distance as the maximum distance that can separate a reading and a track in order for the track for be eligible for deletion. That is, if a track is more than distance base pairs away from the reading, this track will not be offered for deletion. The default value of this parameter is 150 and is irrelevant (i.e. not used) if the reading is of a sequence produced with the reverse primer. -c clonefile Use clonefile as the name of the clone list file that cleanmf must read. If this parameter is not supplied, cleanmf will prompt the user for a valid name. -B Do not produce the backup file .orig. By default, a backup of the masterfile is produced. -f Assume the masterfile describes a finished project. No checking is done by cleanmf to verify that this is indeed true. -o ocode Use ocode as the organism code used in the masterfile. Only letters are allowed in the organism code and it can be of arbitrary length. i) If this flag is not used, cleanmf will set the ocode according to the information stored in /share/supported/apps/ogmp/lib/genome_name.lst (this file contains, for each masterfile, the corresponding ocode). If there is no ocode specified on the command line, and no ocode associated to the masterfile in the file /share/supported/apps/ogmp/lib/genome_name.lst, cleanmf produces an error message and exits. ii) If this switch is used in conjunction with the "-l" switch, then cleanmf is not going to verify that the organism code does indeed exist in the file /share/supported/apps/ogmp/lib/genome_name.lst. -Q Quiet mode. Don't print the header indicating the current contig during editing. -l Loose checking. If this switch is used without the "-o" switch, only gene annotations are checked and so the organism code is useless (cleanmf will not even check to see that the masterfile passed on the command line exists in /share/supported/apps/ogmp/lib/genome_name.lst). Every other annotation other than a "G-" annotation is ignored. Note that when this option is used, the original masterfile will not be modified by cleanmf in any way. Errors are reported in the ".errors" file. If the "-o" switch is used in conjunction with "-l", then cleanmf behaves as usual (ie all annotations are checked), but is not going to check if the organism code supplied exists in /share/supported/apps/ogmp/lib/genome_name.lst. -s nb_spaces Put nb_spaces spaces between the leading ';' and the start of the gene name of every gene annotation. For example, with -s5, the gene annotation "; G-cox1 ==> start" would be reformatted in "; G-cox1 ==> start". Note that nb_spaces must be positive. -g gcode Use gcode as the genetic code associated to the mf passed on the command line. Gcode must be a number. To see all the valid genetic code, run gcshow(1). If this switch is not used, genome_name.lst is consulted in order to find the genetic code associated to the project. If there is none, or if the masterfile passed on the command line is not listed in genome_name.lst, genetic code 20 is used. This switch has an impact on trna validation. It is used for codon translation. -G Loosen up the gene name validation. Do not validate that the gene names in the gene annotations exist in the gene_names.lst. With this option, the following annotations would be legal: ; G-foobar ==> start ; G-orf ==> start ; G-unknow_trna ==> end -d Do not issue a warning if a gene is found in two different contigs. The default is to issue such warnings. FILES /share/supported/apps/ogmp/lib/genome_name.lst A description of the OGMP projects; accessed mainly to find out and verify organism codes for the sequences in OGMP format. /share/supported/apps/ogmp/lib/gene_name.lst A description of known mitochondrial genes. /share/supported/apps/ogmp/lib/spam/spam.offsets Used to obtain the offset to apply to an insert's start/end /share/supported/apps/ogmp/lib/perl/ogmp/ProjPack.pl A perl library to read and parse the genome_name.lst file. /share/supported/apps/ogmp/lib/perl/ogmp/tRNA.pl This file is used to convert tRNA names. /share/supported/apps/ogmp/lib/perl/ogmp/gene_names.pl A perl library to read and parse the gene_name.lst file. /share/supported/apps/ogmp/lib/perl/ogmp/annotate.pl Used to annotate a sequence file stored in an array /share/supported/apps/ogmp/lib/perl/ogmp/naming.pl Library used to verify the format of reading names /share/supported/apps/ogmp/lib/perl/ogmp/gene.pl Library used to parse all the gene annotations /share/supported/apps/ogmp/lib/perl/ogmp/qualifs.pl Library used to parse all the qualifier parts of gene annotations /share/supported/apps/ogmp/lib/perl5/ogmp/clnfilepath.dta File that contains the default clone list file for various masterfiles SEE ALSO ogmp(5), chkogmp(5), mf(5), tracks(1), gcshow(1) AUTHOR Pierre Rioux and Nicolas Brossard (version 3.0 and later) Tim Littlejohn (project management version 2.0 and earlier) Organelle Genome Megasequencing Project.