MF2STAD(1) OGMP ASSEMBLY UTILITIES MF2STAD(1) NAME mf2stad - extract information from an OGMP masterfile as a set of staden-formated pseudo-sequences or sequence offset positions (coordinates). SYNOPSIS mf2stad [-B] [-P] [-o ocode] [-F] [-S] [-L length] [-X regex1 [-I] [-C [-D regex2]]] masterfile DESCRIPTION mf2stad reads the file ogmp_masterfile and creates pseudo-sequences in staden format for each track, gene, or comment found in it. One staden file is created for each pseudo-sequence, in the current directory. Genes pseudo-sequences are recognized in the masterfile by identifiers starting with "G-"; tracks by "T-". The pseudo-sequences files have correspondingly starting filenames. Other annotations are considered comments and the pseudo-sequences are stored in numbered files starting with "C-". The comments themselves are saved in a file called "{PROJECT}.comments" where PROJECT is the first word of the argument ogmp_masterfile. This file also contains the framing annotations used to identify genes and tracks too. Filenames of genes, tracks or comments pseudo-sequences can have an extension of the form "~d" or "+d" where d is sequence of digits; this means that the actual position of the comment is displaced relative to the pseudo-sequence itself, by d positions before/after its start. Missing end-of-gene annotations are recreated by the program on the fly, with a warning message. The organism code related to the ogmp_masterfile to be processed must be found by mf2stad; if it is not supplied with the -o option, then mf2stad will try to find it by matching the ogmp_masterfile argument with the list of masterfiles found in a system file (usually /share/supported/apps/ogmp/lib/genome_name.lst"; see projinfo(1)). OPTIONS -B Do not make a backup copy of previously existing files in the current directory that conflict with files created by the current invocation of mf2stad. The default is to add a ".bak" extension to the old filenames. -L length This parameter is the length of the created pseudo-sequences; the default length if not specified is found out by calling findrep(1). If findrep returns a number less than 100, then 100 is used. -o ocode The organism code related to the masterfile to be processed. -X regex1 If regex1 is supplied (it must be a legal perl regular expression), then only annotations matching this regex1 considered for pseudo-sequence creation. These annotations are saved in exactly the same format as comments, except that the created staden files have a "X-" prefix rather than a "C-" prefix. The special comment file is called ""{PROJECT}.extracts" rather than "{PROJECT}.comments". All annotations that do not match regex1 are completely ignored. Note that trying to extract genes annotation with, for example, the regular expression '^;\s+G-' will not extract genes annotation for witch the START part is missing. Pattern matching is case-sensitive. -I This inverts the test performed with the -X option. -S This removes internal limitations to allowable sizes of the created pseudo-sequences. The minimum size is 100 bases for most annotations and 50 for tRNA genes. The maximum pseudo-sequence length is 4096 bases. When -S is specified, all those limits disappear. Also, a side effect of using this option is that range annotations (those with a START and END annotation) that are shorter than the standard size for pseudo- sequences will have an exact length of abs(END-START) rather than be completed with more sequence info to lengthen them to that standard size. Finally, teh restriction that pseudo-sequence filenames must be less than 17 characters is also no longer there (this restriction was necessary for the pseudo-sequences to be parsed by the obap(1) program). -C Do not create any of the pseudosequence files, nor the .comments file, but rather create a single .coordinates file that contains absolute coordinates of the annotations extracted with -X (therefore -C can only be used with -X). This coordinates file can be read by mfmain(1), or protfilt(1). -F Although this seems contrary to the main purpose of mf2stad, this option forces the creation of pseudo-sequences in FASTA format rather than STADEN format. Pseudo-sequences in fasta format have no maximum limit on their length. -D regex2 Annotations extracted with the -X and -C options which ALSO matches regex2 will have two records written to the coordinates file: along with the original record, a record with the same annotation(s) marked on the other strand will follow. For range annotations like this one: >contig1 100 200 ; G-gene ==> --- ; end ==> G-gene the other record will look like this: >contig1 -200 100 ; G-gene <== --- ; end <== G-gene For point annotations like this one: >contig2 500 ; T-track101 ==> the other record will look like this: >contig2 -499 ; T-track101 <== See the documentation on the OGMP coordinates file format. -P This option tells mf2stad to treat ALL annotations as comments, and therefore it overrides (and disables) option -X, -F, -I, -C, and -D. This options is used when mf2stad acts as an annotation extractor for Feature Maintenance in the OGMP (see obap(1) and stad2mf(1)). ANNOTATION RECOGNITION RULES Genes and tracks annotations are recognised as follows: GENE START annotations: ; G-genename [anything] arrow [anything] Note that introns and exons can be described by using suffixes following the gene name, e.g. ; G-genename-e1 [anything] arrow [anything] for exon number one of gene "genename" ; G-genename-i1 [anything] arrow [anything] for intron number one of gene "genename" GENE END annotations: ; end arrow G-genename [anything] TRACK annotations: ; T-identifier arrow [anything] ; T-identifier arrow [anything] Starts nnn bases downstream ; T-identifier arrow [anything] Starts nnn bases upstream where genename and identifier do not contain any spaces, nnn (in tracks only) is a positive integer, and arrow is one of "<==", "<==*", "<==**", "==>", "*==>" or "**==>". Examples of annotations: ; G-cox1 ==> ; end ==> G-cox1 ; T-od1055 <==** ; G-cob-i1 start ==> position uncertain ; T-mo312 () <==** Starts 191 bases from here FILES /share/supported/apps/ogmp/lib/genome_name.lst A description of the OGMP projects; accessed mainly to find out what is the organism code for the current project if it wasn't supplied on the command-line or in the annotation_file. /share/supported/apps/ogmp/lib/perl/ogmp/ProjPak.pl A perl library to read and parse the genome_name.lst file. $PERLLIB is the standard perl library path. /share/supported/apps/ogmp/lib/perl/ogmp/tRNA.pl.pl A perl library with tRNA conversion tables between 1 and 3 letters codes. $PERLLIB is the standard perl library path. SEE ALSO obap(1), stad2mf(1), trna(3), projpack(3), projinfo(1), mf(5), findrep(1). AUTHORS Pierre Rioux, Tim Littlejohn (project management), OGMP, August 1994