protfilt(1) OGMP SEQUENCE UTILITY protfilt(1) NAME protfilt - filter protein sequences SYNOPSIS protfilt [-c] [-m] [-min num1] [-max num2] [-s char] [-N coorfile -Fspec1 [-Fspec2...] [-S]] [filename] or protfilt [-c] [-min num1] [-max num2] [-m] [-s char] [-N coorfile -Fspec1 [-Fspec2...] [-S]] [filename] DESCRIPTION protfilt processes a file containing protein sequences in FASTA format, with header lines as generated by flip ("prot.lst" format). The default input filename, if not specified, is "prot.lst". Two types of filtering are possible- "clipping" and "filtering". Clipping changes the sequences by marking out sub-sequences as comments. Filtering discards entire entries. Options that specify the kind of filtering as follow: Clipping: -m When the -m option is specified, sequences are realigned so that the first amino acid is an "M" (methionine). Amino acids preceding the first M are commented out with a ";". A comment line describing the new and old length of the sequence is added just after the header. Finally, the header line itself is adjusted, so that ranges of nucleotides specified there accurately reflect the length of the new sequence aligned to the M. This assumes the header is in "prot.lst" format; see the example below. in the case that the sequence doesn't contain an "M", the header is left unchanged. The position of the -m on the command line tells protlist at what point to do the realignment. If it is before the first -min or -max argument, it is done before length-related stripping is started (-min and -max), else it is done afterwards. -s char If -s is specified, char must be a single letter; this letter will be used instead of "M" for adjusting sequences with the -m option. Filtering: -min num1 When the -min option is used, sequences which are less than num1 amino acids long are commented out with ";". -max num2 When the -max option is used, sequences which are more than num2 amino acids long are commented out with ";". -c When the -c option is specified, the sequences affected by the -min and -max options are completely stripped out rather than simply commented out. With the -m option, sequences not containing an "M" amino acid are stripped out. -N coorfile -F spec -S When -N is used at least one -F specification must also be supplied. coorfile is the name of a coordinates file as created by mf2stad(1). These options force the -c option. spec is an overlap specification; sequences in the prot.lst format file called "filename" which overlap at least one of the ranges defined in the coorfile according to spec are written out to a file called filename.overlap; the remainder are written to a file called filename.nonoverlap. Five different filtering specifications are supported. Let (A,B) be a pair of coordinates from the coordinate file that are on the same strand as an sequence that itself goes from nucleotide positions X to Y. -Ffull Specifies that the orf must be fully covered by the range of nucleotide positions from the coorfile: On the non-complementary strand: A <= X and Y <= B On the complementary strand: B <= Y and X <= A -Fcover Specifies that the orf must completely cover the range of nucleotide positions from the coorfile and extend beyond this range: On the non-complementary strand: X <= A and B <= Y On the complementary strand: Y <= B and A <= X And in both cases: abs(B-A) < abs(Y-X) -FaNN-MM This means that the orf and the range of nucleotide position must share at least NN aminoacid positions (or 3*NN nucleotide positions) and no more than MM aminoacid positions. If NN is omitted, it defaults to 1; if MM is omitted, it defaults to infinity. -F%oNN-MM and -F%cNN-MM This means that the overlap between the orf positions and the range of nucleotide positions from the coorfile must account for a percentage of the length of the orf (for %o) or the length of the range (for %c) that is at least NN% and no more than MM%. If NN is omitted, it default to "greater than 0%". If MM is omitted, it defaults to "less than 100%". The -S option override the default strand-sensitiveness of the comparisons done with -F and -N (i.e. range overlaps are now report no matter the strandedness given by the coordinates file). FILTERING PRECEDENCE protfilt processes each of the -m, -min and -max options only once. The priorities are as follow: 1) -m 2) -min 3) -max 4) -m Options will always be processed in that order. Therefore, the position of the options on the command line are not important, except for the -m option which is executed either in position 1 or 4 (but not both) according to the following criteria: if the first -m appears after a -min or a -max, then it is executed in position 4, otherwise in position 1. Filtering related to the -N and -F option is always done after the -m, -min and -max options. All -F filtering options are processed independently, in the order they are listed on the command-line. RESULT FILE NAMING After filtering, the results are stored in a file with the same name as the input file, but with extensions added to help identify the filtering that has been performed on its contents. The table below give some examples of extensions created for different filtering operations. FILTERING OPTIONS EXTENSION ----------------- --------- -m .m -min 50 .min50 -max 99 .max99 -min 50 -max 99 .50-99 -min 50 -m .min50.m -max 99 -m .max99.m -min 50 -max 99 -m .50-99.m -m -min 50 -max 99 .m.50-99 [-m, -min and/or -max] -c [extension]-c Aside from this file, two other files may be created if the -F and -N options are used: they are filename.overlap and filename.nonoverlap. See the section on the -F option for more details. EXAMPLE Starting from this sequence, > A sample amino acid sequence; From orig. strand: 400 to 459 ABCDEFGHIKLMNPQRSTVW the -m option alone produces: > A sample amino acid sequence; From orig. strand: 433 to 459 ; Total length 9 aa, previously 20 ;ABCDEFGHIKL MNPQRSTVW LIMITATIONS The behavior of the program is not garanteed if the header is not in a proper prot.lst format (with integers to specify the range of nucleotide). FILES filename.overlap filenamp.nonverlap These two files are created along side the input filename, and contains sequences filtered by the -N and -F options. SEE ALSO flip(1), mf2stad(1), coordinates(5) AUTHOR Pierre Rioux, OGMP (Organelle Genome Sequencing Project), Dec. 1993