MF2STAD(1)                 OGMP ASSEMBLY UTILITIES             MF2STAD(1)

NAME
    mf2stad - extract information from an OGMP masterfile as a set of
              staden-formated pseudo-sequences or sequence offset positions
              (coordinates).

SYNOPSIS
    mf2stad [-B] [-P] [-o ocode] [-F] [-S] [-L length]
            [-X regex1 [-I] [-C [-D regex2]]] masterfile

DESCRIPTION
    mf2stad reads the file ogmp_masterfile and creates pseudo-sequences
    in staden format for each track, gene, or comment found in it.
    One staden file is created for each pseudo-sequence, in the current
    directory. Genes pseudo-sequences are recognized in the masterfile
    by identifiers starting with "G-"; tracks by "T-". The pseudo-sequences
    files have correspondingly starting filenames. Other annotations
    are considered comments and the pseudo-sequences are stored in
    numbered files starting with "C-". The comments themselves are
    saved in a file called "{PROJECT}.comments" where PROJECT is the
    first word of the argument ogmp_masterfile. This file also
    contains the framing annotations used to identify genes and tracks
    too.

    Filenames of genes, tracks or comments pseudo-sequences can have
    an extension of the form "~d"  or "+d" where d is sequence of digits;
    this means that the actual position of the comment is displaced relative
    to the pseudo-sequence itself, by d positions before/after its start.

    Missing end-of-gene annotations are recreated by the program on the
    fly, with a warning message.

    The organism code related to the ogmp_masterfile to be processed must
    be found by mf2stad; if it is not supplied with the -o option, then
    mf2stad will try to find it by matching the ogmp_masterfile argument
    with the list of masterfiles found in a system file (usually
    /share/supported/apps/ogmp/lib/genome_name.lst"; see projinfo(1)).

OPTIONS
    -B Do not make a backup copy of previously existing files in the
       current directory that conflict with files created by the current
       invocation of mf2stad. The default is to add a ".bak" extension to
       the old filenames.

    -L length This parameter is the length of the created pseudo-sequences;
       the default length if not specified is found out by calling findrep(1).
       If findrep returns a number less than 100, then 100 is used.

    -o ocode The organism code related to the masterfile to be processed.

    -X regex1 If regex1 is supplied (it must be a legal perl regular
       expression), then only annotations matching this regex1 considered
       for pseudo-sequence creation. These annotations are saved in
       exactly the same format as comments, except that the created
       staden files have a "X-" prefix rather than a "C-" prefix. The
       special comment file is called ""{PROJECT}.extracts" rather than
       "{PROJECT}.comments". All annotations that do not match regex1
       are completely ignored. Note that trying to extract genes
       annotation with, for example, the regular expression '^;\s+G-'
       will not extract genes annotation for witch the START part is
       missing. Pattern matching is case-sensitive.

    -I This inverts the test performed with the -X option.

    -S This removes internal limitations to allowable sizes
       of the created pseudo-sequences. The minimum size is 100
       bases for most annotations and 50 for tRNA genes. The maximum
       pseudo-sequence length is 4096 bases. When -S is specified,
       all those limits disappear. Also, a side effect of using this
       option is that range annotations (those with a START and END
       annotation) that are shorter than the standard size for pseudo-
       sequences will have an exact length of abs(END-START) rather than
       be completed with more sequence info to lengthen them to that
       standard size. Finally, teh restriction that pseudo-sequence
       filenames must be less than 17 characters is also no longer there
       (this restriction was necessary for the pseudo-sequences to be
       parsed by the obap(1) program).

    -C Do not create any of the pseudosequence files, nor the .comments file,
       but rather create a single .coordinates file that contains absolute
       coordinates of the annotations extracted with -X (therefore -C can
       only be used with -X). This coordinates file can be read by mfmain(1),
       or protfilt(1).

    -F Although this seems contrary to the main purpose of mf2stad, this
       option forces the creation of pseudo-sequences in FASTA format
       rather than STADEN format. Pseudo-sequences in fasta format have
       no maximum limit on their length.

    -D regex2 Annotations extracted with the -X and -C options which ALSO
       matches regex2 will have two records written to the coordinates
       file: along with the original record, a record with the same
       annotation(s) marked on the other strand will follow. For range
       annotations like this one:

           >contig1
           100 200
           ;      G-gene ==>
           ---
           ;      end ==> G-gene

       the other record will look like this:

           >contig1
           -200 100
           ;      G-gene <==
           ---
           ;      end <== G-gene

       For point annotations like this one:

           >contig2
           500
           ;      T-track101 ==>

       the other record will look like this:

           >contig2
           -499
           ;      T-track101 <==

       See the documentation on the OGMP coordinates file format.

    -P This option tells mf2stad to treat ALL annotations as comments,
       and therefore it overrides (and disables) option -X, -F, -I, -C,
       and -D. This options is used when mf2stad acts as an annotation
       extractor for Feature Maintenance in the OGMP (see obap(1) and
       stad2mf(1)).

ANNOTATION RECOGNITION RULES
    Genes and tracks annotations are recognised as follows:

        GENE START annotations:
        ;        G-genename [anything] arrow [anything]

    Note that introns and exons can be described by using suffixes
    following the gene name, e.g.

        ;        G-genename-e1 [anything] arrow [anything]

    for exon number one of gene "genename"

        ;        G-genename-i1 [anything] arrow [anything]

    for intron number one of gene "genename"

        GENE END annotations:
        ;        end arrow G-genename [anything]

        TRACK annotations:
        ;        T-identifier arrow [anything]
        ;        T-identifier arrow [anything] Starts nnn bases downstream
        ;        T-identifier arrow [anything] Starts nnn bases upstream

    where genename and identifier do not contain any spaces, nnn (in 
    tracks only) is a positive integer, and arrow is one of "<==",
    "<==*", "<==**", "==>", "*==>" or "**==>".

    Examples of annotations:
        ;        G-cox1 ==>
        ;        end ==> G-cox1
        ;        T-od1055 <==**
        ;        G-cob-i1 start ==> position uncertain
        ;        T-mo312 () <==** Starts 191 bases from here

FILES
    /share/supported/apps/ogmp/lib/genome_name.lst
        A description of the OGMP projects; accessed mainly to find out
        what is the organism code for the current project if it wasn't
        supplied on the command-line or in the annotation_file.
    /share/supported/apps/ogmp/lib/perl/ogmp/ProjPak.pl
        A perl library to read and parse the genome_name.lst file.
        $PERLLIB is the standard perl library path.
    /share/supported/apps/ogmp/lib/perl/ogmp/tRNA.pl.pl
        A perl library with tRNA conversion tables between 1 and 3
        letters codes. $PERLLIB is the standard perl library path.

SEE ALSO
    obap(1), stad2mf(1), trna(3), projpack(3), projinfo(1), mf(5), findrep(1).

AUTHORS
    Pierre Rioux,
    Tim Littlejohn (project management), OGMP,
    August 1994