CLEANMF(1)         OGMP SEQUENCE UTILITIES             CLEANMF(1)

NAME
    cleanmf - interactively clean the contents of a masterfile

SYNOPSIS
    cleanmf [-t distance] [-c clone_file] [-B] [-f] [-o ocode] [-Q] [-l]
	[-s nb_spaces] [-g gcode] [-G] [-d] masterfile

DESCRIPTION
    Cleanmf is a tool used to validate interactively a given 
    masterfile. It provides an one-line editor with which the user
    can edit the incorrect lines of the masterfile detected by 
    the program. For a complete description of the masterfile
    format rules, see mf(5). 
 
    Cleanmf starts by executing:

        chkogmp -m masterfile
  
    (for a complete description of this function see chkogmp(5)).
    If an error has been found, the process stops and the user is
    warned. Otherwise, the validation continues.

    Each time the program finds an incorrect line in the
    masterfile, the user is asked if this particular line should be
    corrected or not and cleanmf suggests a possible correction that 
    the user may edit.
    The program supplies a one-line editor called editline. 
    (See editline(5) for a complete description of editline's features 
    or type ^K in the one-line editor to have a summary of the 
    available commands).

    Cleanmf's validation is divided into two parts. In the first part, the
    program identifies a defined set of errors for which it suggests
    corrections. The second part identifies more complex errors for which it
    does not offer corrections.


    Errors detected in part I:
    -------------------------

    1. General inconsistencies:

       * If there is a continuation character '\' at the end of a 
         line that doesn't start with a ";", cleanmf will prompt 
         the user to verify whether or not it can remove this  
         illegal character. 

       * Cleanmf will warn the user if after a line that starts with 
         ';' and ends with '\' it finds a line having a pattern common
         to gel, track, gene start, gene end annotations. 
         This indicates that the continuation character should 
         probably not be there.

       * If there is a continuation character on a given line, and the 
         line that follows doesn't start with ';', cleanmf will remove 
         the '\' and inform the user that it did so.

       * Cleanmf reports incorrect annotations and therefore tries to 
         determine the type of annotation (gene start, gene end, ...). 
         For each invalid annotation, a valid annotation is suggested 
         by cleanmf and the user will be prompted to edit or accept it. 
         Invalid gene annotations are reported in part II.


    2. Contig inconsistencies:

       * All the masterfile's contig names have to be different. 
         If cleanmf finds two identical contig names, it will
         prompt the user to change one of them and suggest a valid
         new name for the invalid one. 

        * If there is an invalid character in one of the contig's sequences
          (i.e. not in the set [agctAGCTnN0-9!]), cleanmf will terminate, 
          printing an error message.
          
        * If the last line of a contig starts with a ";" and ends with a
          continuation character, cleanmf will prompt the user to verify that
          it can remove this illegal character.

        * If cleanmf finds a contig that starts with a line whose first
          character is not ">", cleanmf ends immediately.


   3. Track inconsistencies:

      * A warning will be issued if cleanmf finds a mismatch in the 
        directions of a track/reading pair.

      * If cleanmf finds a potential track that lacks the T-, the user 
        will be prompted to add this suffix. Such potential tracks
        are indentified by starting with a ";" and containing an 
        arrow with one or more asterisks appended at the blunt end 
        of the arrow.

      * In a track comment, the strings "upstream" and "downstream" are 
        replaced by "from here". The user will be prompted in each case to 
        accept the suggested change.

      * Cleanmf will also change the location of a given track if this is 
        defined by its offset ("Starts xy bases from here") and the 
        corresponding contig grows in the direction that the track covers.
        All track annotations are migrated as far as possible into 
        newly assembled sequence and their coordinate
        information updated (e.g. if a track annotation says "starts 50 bp 
        from here" and subsequently these 50 bp are sequenced, then the
        annotation is migrated 50 bp in the direction of the arrow and
        the "starts 50 bp from here" deleted from the comment line. If, on
        the other hand, only 20 new bases of sequence have been added, the
        comment is moved to the end of the contig and the comment line
        changes to read "starts 30 bp from here"). 

      * Cleanmf provides a "de-track" feature. When cleanmf finds a track 
        annotation of a clone that has been sequenced completely, it
        will eliminate the track annotation and warn the user. There are some
        exceptions for which this deletion will not take place:

          i) When the reading has been sequenced with the reverse primer
         ii) When the track is too far away from the reading (see the "-t"
             command line switch).

   Errors detected by part II :
   --------------------------

   1. Complete validation of gene annotations:

      * Cleanmf will verify that all gene annotations are valid according to
        the rules encoded in gene_name.pl. To have a complete description of
        a gene annotation's expected format, consult:

           http://megasun.bch.umontreal.ca/People/lang/ogmp-mf/intro.html

      
   2. Drifted annotation warning removal:  
       
      * Cleanmf will detect automatically the warnings associated with 
        annotations that may have drifted (in the same contig), as in:

          26269  CTTGAAATG
          ;;WARNING: This annotation may have drifted.
          ;        T-rs1643 (0.8) <==* i290
          ;;WARN:>NTTTGTGGTTAGGGGTGAAAGGTCAAACAAAATTGGTGATAGCTGGTTTCTTGCGAAAA
          26278  ATTTGTGGTTAGGGGTGAAAGGTCAAACAAAATTGGTG

        When such lines are found, cleanmf will prompt the user to see if he/she
        want to delete these lines.

   3. Check consistency of gene start and gene end annotations:

       i) Check that for each gene start annotation there is a matching gene
          end annotation and vice-versa.
    
      ii) Check that every gene start and gene end annotation is unique in a
          given contig.

     iii) Check that for each matching "gene start" and "gene end" annotation, 
          the direction of the arrow is identical for both.

      iv) Check that for each matching "gene start" and "gene end" annotation, 
          the gene start is upstream from the gene end annotation.

      Note that the behavior of cleanmf changes depending on whether the 
      masterfile is a finished project or an unfinished project. Cleanmf uses 
      the "-f" switch to distinguish between the two (see OPTIONS). The test i)
      is not done if the project is finished. Also, if the test iv) fails for
      an unfinished project, cleanmf reports an error, whereas if the test
      fails for a finished project, cleanmf warns the user. This is to support
      the circular structure of a completely sequenced genome.
     

   4. Validation of clone sizes:

      Cleanmf will validate the sizes of the clones that appear in the
      masterfile. In order to do so, it uses the clonelist file supplied as 
      argument (see the "-c" option). If no clonelist file is supplied on the
      command line, cleanmf will prompt the user for a valid clonelist filename
      (RETURN for no clonelist file).

      Cleanmf first verifies that all sizes indicated in the masterfile are 
      syntactically valid (a valid size is a digit followed by a dot and from
      one to three digits, or a set of blanks). The following steps are then 
      performed:

         i) Cleanmf will add in the masterfile all the missing sizes, according
            to the sizes it reads in the clonelist file (if any, otherwise an 
            empty set of braces is added).  

        ii) If a clonelist file was specified, cleanmf will issue a warning for
            each clone whose size in the clone file is different from the size
            in the masterfile.

       iii) If a clonelist file was specified, cleanmf will also issue a
            warning for each clone whose size is specified in the masterfile
            but not in the clone file.

   5. Check consistency of multiple reading annotations of a clone:
 
      If there is more than one reading annotation for a given clone, cleanmf
      will verify that the sizes indicated in both annotations agree. If they
      do, it will also verify that the start/end of the insert infered from the
      indicated size do not differ by more than 20% of the size for both 
      annotations. These test rely heavily on 
      /share/supported/apps/ogmp/lib/spam/spam.offsets (see FILES) to compute
      the start/end of an insert from a reading annotation's text and its
      position in the masterfile

   6. Check the position of gene annotations relative to one another:

      Several rules have been implanted in cleanmf to verify the embedding of
      gene annotations:

         - If two annotations are not associated with the same gene, they are
           totally independent in the sense that there might be annotations of
           gene X right between the beginning and end of gene Y, or just after,
           or before, etc...

         - All the annotations relative to a given gene have to be in the same 
           direction. This is checked only inside every contig, not globally.

         - There is no restriction imposed on the first annotation that can
           appear in a given contig. As long as it is syntactically valid,
           cleanmf accepts it.

         - Following an annotation "G-xxx-yyy-zzz ==> start" , there can be:

              i) The annotation "G-xxx-yyy-zzz ==> end" ; or

             ii) A syntactically correct annotation of the form  
                 "G-xxx-yyy-zzz-www ==> start" where "www" is a valid subpart 
                 of "zzz" (note that these can be "Mot-<elememt_name>", 
                 "Sig-<elememt_name>".

         - Following an annotation "G-xxx-yyy-zzz ==> end", there can be:

              i) A syntactically correct annotation of the form 
                 "G-xxx-yyy-www ==> start"
 
             ii) The annotation "G-xxx-yyy ==> end"

        (Note that there are corresponding rules for backward oriented
        annotations)

   
   7. Exons, introns and fragments related tests:

       - The exon-intron-...-exon structure must be respected inside each
         contig.

       - The numbering of the exons and introns must be coherent (i.e. exon1-
         intron1-exon2-intron2-...). Cleanmf now supports the special exon and
         intron number "0" which is used when the exon or intron number was not
         known at the time the masterfile was produced (see part 7 though).

       - The numbering of the fragment parts must also be coherent
	
       - A genetic element cannot have fragments and an exon-intron structure.


   8. Tests performed for finished projects:

       This section describes the tests that are performed only for finished 
       projects (using the "-f" switch, see OPTIONS):

       - Exons and introns cannot be numbered "0".

       - Every gene start must have a matching gene end annotation and vice-
         versa.

       - The masterfile cannot contain "parts" annotations (-P<num>).

       - The first exon in every exon-intron-...-exon structure must be exon
         "1".


   9. Qualifier checking

   Cleanmf will validate all the qualifiers in every gene annotation.
   For every faulty annotation, an error message will be printed on screen 
   and in the file masterfile.out. 
   Cleanmf will not give suggestions on how to correct these
   annotations. For a complete description of the expected syntax of these
   qualifiers consult the web page:

       http://megasun.bch.umontreal.ca/People/lang/ogmp-mf/intro.html

   All the errors for which cleanmf doesn't suggest valid corrections (e.g.
   all the logic errors, tracks not in the same direction as readings, etc...)
   are saved in the file masterfile.errors. This file should always be checked
   after the completion of cleanmf. The user must correct these errors
   in the masterfile by using an appropriate texteditor.

   Each time cleanmf prompts the user for a correction, the following 
   keystrokes are recognized: 

      - Ctrl-Q: quit the program and save the changes automatically

      - Ctrl-C: quit the program, but only after confirmation by the user.
                All changes to the masterfile are discarded.

   There are other special keystrokes that cleanmf recognizes. Ctrl-K will
   list them all when the line editor is active.
   
   If the user quits and saves the changes, cleanmf will first verify
   that there were indeed some changes. Then:

       - It produces a file (having the same name as the masterfile 
         but with the extension .cleanlog) that records everything
         that changed (in a format similar to diff(1)). If there was no change
         to the masterfile, this file is not produced.

       - The old version of the masterfile is stored in a file having
         the same name as the original masterfile but with the 
         extension .orig appended at the end. The "-B" switch overrides this
         behavior (see OPTIONS).

       - The new version of the masterfile (updated with the user's
         changes) will overwrite the old one.

       - If there were no significant changes, the user is notified by the 
         program and the log file (<mf name>.cleanlog) name is changed to
         <mf name>.oldcleanlog (if there was indeed a cleanlog file).


       The number of lines that the user deleted or added is reported by 
       the program. Cleanmf will produce an error message if it cannot
       create the log file, the backup file, or the new version of the 
       masterfile. Cleanmf will also process the entire (new) masterfile and
       update the line numbers that have to be changed because some sequence
       information has been removed or added. It will also indent the 
       masterfile properly. These changes are done automatically at the end of
       the program.  

   10. Trna checking 
       
   Syntactically valid pairs of matching trna annotations undergo an additional
   check. Cleanmf will verify that the anticodon indicated in the annotation 
   text matches the one surrounded by the anticodon markers (!). A warning will
   be issued in all these cases:

      - There is an anticodon in the annotation text but no markers appear in
        the sequence associated to the trna.

      - There is an anticodon in the annotation text but the number of markers
        in the sequence associated to the trna is not 2.

      - There is no anticodon in the annotation text (e.g. G-trnX) and there
        are markers in the sequence associated to the trna
  
   Cleanmf will also verify that the anticodon in the annotation's text (if any)
   matches the amino acid besides it. This check is done according to the 
   genetic code associated to the project. To find out how the genetic code is
   set, see the -g OPTION. There are exceptions to this test:
  
      - If the amino acid is X, the anticodon can be any of the possible
        anticodons, including those with four nucleotides. trnX can also have
        no anticodon (e.g. ";   G-trnX ==> start")

      - If the anticodon found in the trna annotation is 'cau' then 'M' and 'I'
        are the only valid amino acids accepted for this annotation, no matter
        the genetic code used.

   Finally, cleanmf will issue a warning when it finds anticodon markers (!)
   that are not contained within a matching pair of valid trna annotations.

    
OPTIONS
    -t distance
         Use distance as the maximum distance that can separate a reading and a
         track in order for the track for be eligible for deletion. That is, if
         a track is more than distance base pairs away from the reading, this 
         track will not be offered for deletion. The default value of this
         parameter is 150 and is irrelevant (i.e. not used) if the reading 
         is of a sequence produced with the reverse primer.

    -c clonefile
         Use clonefile as the name of the clone list file that cleanmf must
         read. If this parameter is not supplied, cleanmf will prompt the user
         for a valid name.
   
    -B   Do not produce the backup file <mf name>.orig. By default, a
         backup of the masterfile is produced.

    -f   Assume the masterfile describes a finished project. No checking is
         done by cleanmf to verify that this is indeed true. 

    -o ocode
         Use ocode as the organism code used in the masterfile. Only letters
         are allowed in the organism code and it can be of arbitrary length. 

           i) If this flag is not used, cleanmf will set the ocode according 
              to the information stored in 
              /share/supported/apps/ogmp/lib/genome_name.lst (this file contains,
              for each masterfile, the corresponding ocode). If there is no ocode
              specified on the command line, and no ocode associated to the
              masterfile in the file /share/supported/apps/ogmp/lib/genome_name.lst,
              cleanmf produces an error message and exits.

          ii) If this switch is used in conjunction with the "-l" switch, then
              cleanmf is not going to verify that the organism code does indeed
              exist in the file /share/supported/apps/ogmp/lib/genome_name.lst. 

    -Q   Quiet mode. Don't print the header indicating the current contig 
         during editing.

    -l   Loose checking. If this switch is used without the "-o" switch, only
         gene annotations are checked and so the organism code is useless
         (cleanmf will not even check to see that the masterfile passed on the
         command line exists in /share/supported/apps/ogmp/lib/genome_name.lst).
         Every other annotation other than a "G-" annotation is ignored. 

         Note that when this option is used, the original masterfile will not
         be modified by cleanmf in any way. Errors are reported in the
         "<mf>.errors" file. If the "-o" switch is used in conjunction with
         "-l", then cleanmf behaves as usual (ie all annotations are checked),
         but is not going to check if the organism code supplied exists in
         /share/supported/apps/ogmp/lib/genome_name.lst.

    -s nb_spaces
         Put nb_spaces spaces between the leading ';' and the start of the gene
         name of every gene annotation. For example, with -s5, the gene
         annotation "; G-cox1 ==> start" would be reformatted in 
         ";     G-cox1 ==> start". Note that nb_spaces must be positive.

    -g gcode
         Use gcode as the genetic code associated to the mf passed on the 
         command line. Gcode must be a number. To see all the valid genetic 
         code, run gcshow(1). If this switch is not used, genome_name.lst is
	 consulted in order to find the genetic code associated to the project.
         If there is none, or if the masterfile passed on the command line is 
         not listed in genome_name.lst, genetic code 20 is used. This switch
         has an impact on trna validation. It is used for codon translation.

    -G Loosen up the gene name validation. Do not validate that the gene names
       in the gene annotations exist in the gene_names.lst. With this option,
       the following annotations would be legal:

         ;   G-foobar ==> start
         ;   G-orf ==> start
         ;   G-unknow_trna ==> end

    -d Do not issue a warning if a gene is found in two different contigs.
       The default is to issue such warnings.


FILES 
    /share/supported/apps/ogmp/lib/genome_name.lst
        A description of the OGMP projects; accessed mainly to find out and
        verify organism codes for the sequences in OGMP format.
    /share/supported/apps/ogmp/lib/gene_name.lst
        A description of known mitochondrial genes.
    /share/supported/apps/ogmp/lib/spam/spam.offsets
        Used to obtain the offset to apply to an insert's start/end
    /share/supported/apps/ogmp/lib/perl/ogmp/ProjPack.pl
        A perl library to read and parse the genome_name.lst file.
    /share/supported/apps/ogmp/lib/perl/ogmp/tRNA.pl
        This file is used to convert tRNA names.
    /share/supported/apps/ogmp/lib/perl/ogmp/gene_names.pl
        A perl library to read and parse the gene_name.lst file.
    /share/supported/apps/ogmp/lib/perl/ogmp/annotate.pl
        Used to annotate a sequence file stored in an array
    /share/supported/apps/ogmp/lib/perl/ogmp/naming.pl
        Library used to verify the format of reading names
    /share/supported/apps/ogmp/lib/perl/ogmp/gene.pl
        Library used to parse all the gene annotations
    /share/supported/apps/ogmp/lib/perl/ogmp/qualifs.pl
        Library used to parse all the qualifier parts of gene annotations
    /share/supported/apps/ogmp/lib/perl5/ogmp/clnfilepath.dta
        File that contains the default clone list file for various masterfiles


SEE ALSO
    ogmp(5), chkogmp(5), mf(5), tracks(1), gcshow(1)

AUTHOR
    Pierre Rioux and Nicolas Brossard (version 3.0 and later)
    Tim Littlejohn (project management version 2.0 and earlier)

    Organelle Genome Megasequencing Project.