PEPPER(1) OGMP SEQUENCE UTILITIES PEPPER(1) NAME pepper - extract proteins and intergenic portions from a masterfile SYNOPSIS pepper [-c] [-x gene] [-X string] [-g gcode] [-i] [-f file] [-o out] [-s start] [-d 'deviation'] [-m | -t[iw]] [-p] [-r] [-n] [-a] [-F] [-C ctx_lg] [-P] masterfile DESCRIPTION Pepper is a program used to extract selected regions of a masterfile and to report them, translated (ie amino acid sequence) or not translated (ie DNA sequence)(see OPTIONS). Selecting the type of genetic element to extract: ------------------------------------------------ Pepper can extract various types of genetic elements: - all the genetic elements (coding or non-coding) in the masterfile (see OPTION -a). - protein-coding genes (see OPTION -p). - RNA-coding genes (see OPTION -r) - non-coding genetic elements (see OPTION -n) The file /share/supported/apps/ogmp/lib/gene_names.lst is consulted to determine the type of a genetic element. Multiple options can be used to extract more than one type of element. If none of the above switches are used, only the protein-coding regions are extracted (-p is thus the default). Imposing restrictions on the name of the genetic elements to extract: ------------------------------------------------------------------- It is possible to specify name restrictions on the genetic elements that pepper will extract: - to extract genes with specific names, see OPTION -x - to extract genetic elements whose names match a specific string pattern, see OPTION -X. - genes whose names are contained in a specific file (see OPTION -f) Note that the name restrictions apply ON TOP of the type restrictions seen above. The name restrictions should not be considered to infer anything as to the type of genetic element to extract. Just because one uses "-X Sig" does not mean that -n (extract non-coding regions) is implied. To extract signals (the most probable intent of the last switch example), one must also use -n (or -a). One must always specify the type of the genetic element to extract along with any name restrictions (failure to do so will result in pepper only considering protein coding-region, as -p is the default). To translate or not to translate -------------------------------- Only the DNA sequence of the extracted region is reported as soon as the switch -t, -r, -n or -a is used. In ANY other case, pepper will attempt to translate the region it has extracted. If translation takes place, it is always done according to the input file's default genetic code. If the file passed on the command line does not have a genetic code in /share/supported/apps/ogmp/lib/genome_name.lst, genetic code 20 is used. It is also possible to specify the genetic code to use during translation with -g (see OPTIONS). Note that if pepper has to translate a coding region whose length (in nucleotides) is not a multiple of 3, it will print a warning message and ignore the region. To view all the possible genetic codes, one can use the program gcshow(1). Exons, introns and fragments ---------------------------- If a genetic element has fragments, they will be reported in the result file. If the start annotation of the first fragment contains the qualifier "/join", the fragments are joined together and the resulting sequence is reported in the output file (but the output varies, depending on whether -F was used (see OPTIONS)). Otherwise, fragments are reported separately in the result file. This behaviour is irrespective of whether or not translation takes place. Pepper has a variety of ways to report genetic elements with an exon-intron structure: - If translation must take place, pepper joins together the DNA sequences of each exon and translates the resulting sequence according to the specified genetic code. - If translation does not take place: 1) Pepper can unite the DNA sequence of the exons and report the result. To do this, use -t. 2) Pepper can report the DNA sequence of each exon and intron separately. To do this, use -ti. 3) Pepper can unite the DNA sequence of all the exon/intron(s) and report the resulting sequence. To do this, use -tw. Intergenic masterfile --------------------- Pepper can also extract the intergenic regions of a masterfile (see OPTION -i). The output file is list of intergenic sequences, one contig for each. To determine if a "G-...." annotation describes a valid gene, pepper consults /share/supported/apps/ogmp/lib/gene_names.lst. Note that parts of genetic elements (whether or not in the gene_names file, like "G-whatever-P1") are recognised as valid "genes" while Sig, Mot, Mob and Var are not. The intergenic masterfile will contain all comments/annotations lines found in the masterfile's contig, except reading and track lines. From the comment lines at the top of the masterfile are only kept the lines that describe the last cosmea(1) cycle. Also, the line numbers of the original masterfile are maintained. For counting of nts, one can run a cn(1) or flip(1) on the file. Each non-coding region in the masterfile will be in a separate contig. Pepper will assign appropriate name to those contigs. This will facilitate fasta(1) searches in the non-coding regions afterwards. Prot.prm file ------------- Pepper uses the file prot.prm (see flip's man-page for a complete description) if it exists in the current directory. It will be used only to determine the genetic code to use during translation, to define deviations to the current code or to redefine the start codons. If -s, -g or -d are passed on the command line, prot.prm is ignored. Pepper's result file -------------------- Pepper produces on output a file which contains the genetic elements found. A global header will be printed in this file indicating useful information about the content of the file (genetic code used, deviations, etc...). Each protein/gene in the file has a maximum of 4 components: - A header, indicating the contig, organism and genome (mitochondrial, plastid, etc..) in which the protein/gene was found (if the masterfile is in /share/supported/apps/ogmp/lib/genome_name.lst). The header will also include the qualifiers that were found on the gene start annotation associated to the protein/gene, along with their values (note that only the following qualifiers will be printed if present: /synonym=, /OMEGA, /pseudo, /intronic, /rev_trans and /group=). Note that invalid qualifiers will be ignored by pepper. Finally, if -t, -r, -n and -a were not used, the length of the aa sequence is also printed (the ending stop codon is not included in this count). - A comment indicating the number of stop codons found if it is not 1 (omitted if -t, -r, -n, or -a was used). - A comment indicating if an unusual start codon was found at the beginning of the protein (omitted if -t, -r, -n, or -a was used). - The protein or DNA (depending on whether -t, -r, -n, or -a were used) sequence itself, formatted to 60 characters per line. OPTIONS -c Case-insensitive search. If this switch is used, all the genes specified via -x, -X, or -f are searched in a case-sensitive way (default is case- sensitive). If this switch is used without -x, -X or -f, it has no effect. -x gene Extract the region associated to those genes whose name (right-bound) is EXACTLY the given gene name. Note that gene can be a comma-separated list of genes to extract (e.g "-x nad1,cox1" would extract nad1 and cox1, but not cox11 or nad11). This switch can be viewed as a stricter "-X". Pepper will verify that all the genes supplied to the -x switch are listed in gene_names.lst(1). The "-X" option (see below) can also be used in conjunction with "-x". In this case only the genes that match the "-X string" OR the genes that match "-x gene" are retained. So for example, "-X nad -x cox1" would tell pepper to extract only the nad gene series and cox1 (but, again, not cox11). Note that -x cannot be used with -f (see below). Note that -x applies ON TOP of the genetic element type restrictions (specified via -p, -r, -n, -a). The switch -x should not be considered to infer anything as to the type of genetic element to extract. Just because one uses "-x Sig" does not mean that -n (extract non-coding regions) is implied. To extract signals (the most probable intent of the last switch example), one must also use -n (or -a). One must always specify the type of the genetic elements to extract along with -x (failure to do so will result in pepper only considering protein coding-region, as -p is the default). This option is case-sensitive in gene name pattern-matching. A "^" at the beginning of the argument specifies that whatever is following "^" should NOT be extracted. E.g., pepper -x '^cox1' extracts all (protein-coding) genes except cox1. pepper -X 'co' -x '^cob,cox3' extracts all (protein-coding) genes that start with co, i.e., cox1,cox2..,, but not cob and cox3. -X string Extract the region associated to those genes whose name starts left-bound with the given string. To extract all orfs, specify the string 'orf'. This will apply to orfs embedded in other features (; G-cox1-I1-orf234') and free-standing orfs. The string 'rp' will extract all genes starting with rp: rps#,rpl#, rpo[A-D] etc. Note that string can consist of a comma-separated list of strings (e.g "-X nad,cox,rpl" would extract all the nad genes, the cox genes and all rpl genes). Pepper does not support any kind of wild-card character for the gene string specification. So every character is treated as is, except the comma. The "-X" option can also be used in conjunction with "-x" (see above). In this case only the genes that match the "-X string" OR the genes that match "-x gene" are retained. So for example, "-X nad -x cox1" would tell pepper to extract only the nad gene series and cox1 (but not cox11). Note that -X cannot be used in conjunction with -f (see below). Note that -X applies ON TOP of the genetic element type restrictions (specified via -p, -r, -n, -a). The switch -X should not be considered to infer anything as to the type of genetic element to extract. Just because one uses "-X Sig" does not mean that -n (extract non-coding regions) is implied. To extract signals (the most probable intent of the last switch example), one must also use -n (or -a). One must always specify the type of the genetic elements to extract along with -X (failure to do so will result in pepper only considering protein coding-region, as -p is the default). This option is case-insensitive in pattern-matching. A "^" at the beginning of the argument specifies that genes starting with whatever is following "^" should NOT be extracted. See examples above for option -x. -g code Use the genetic code "code" to perform translation. "code" can be any valid NCBI genetic code number (e.g. 1, 19) or id (e.g. TAG-Leu). Refer to $NCBI/data/gc.prt (3) to see the valid identifiers. If -g is not supplied and there is a prot.prm file in the current directory, pepper will read it and fetch the genetic code from it. If prot.prm cannot be found in the current directory, pepper will consult /share/supported/apps/lib/genome_name.lst (see FILES) in order to find the genetic code associated to the masterfile's basename. If there is no genetic code supplied either on the command line (with -g), in prot.prm or in the file /share/supported/apps/ogmp/lib/genome_name.lst, pepper will use the basic mitochondrial code (number 20 in $NCBI/data/gc.prt). Also note that if a supplied genetic code is invalid, the default code will be used instead. To view all the available genetic codes, use gcshow(1). -i Produce the intergenic regions file. This file will be named masterfile.noncod.process_nb See section intergenic masterfile for a description of this file. -f file Extract only the genes whose names are specified in file. File will contain the gene names to extract, one per line. Each gene name should not contain any whitespaces. Pepper will verify that all the genes in file are listed in gene_names.lst. A typical gene file is shown below: # # Lines that start with '#' are comments # Blank lines are ignored # cox1 nad5 rlp Note that the gene names must match EXACTLY (just like -x) in the sequence file to be extracted (so for example, with the above gene file, cox1 would be extracted but not cox11). Finally, -f cannot be used with -x or -X. Note that -f applies ON TOP of the genetic element type restrictions (specified via -p, -r, -n, -a). The switch -f should not be considered to infer anything as to the type of genetic element to extract. One must always specify the type of the genetic elements to extract along with -f (failure to do so will result in pepper only considering protein coding- region, as -p is the default). -o out Write pepper's results in file out. Note that if out exists, a backup will be made. -s starts Use starts as the start codons, instead of the ones defined in the current genetic code. starts is a list of codons separated by commas (case- insensitive). Note that specifying new start codons doesn't alter the genetic code. -d file|'deviation' Apply the specified deviations to the current genetic code. The parameter of the "-d" switch can be a file name or a string of the form "atc=r,ttt=p,agt=*". It is a good habit to quote the entire deviation string, otherwise the shell might give an obscure error message if the deviation contains a shell meta-character (like '*'). You can also precede the '*' by a backslash in order to avoid shell interpretation. All the codon ids are case-insensitive, but the aa ids are not (that is, codon aTT and AtT refer to the same codon but amino acid "m" and "M" are different). The parameter of the "-d" switch can also be the name of a file containing deviations. A typical file is shown below: # These lines are comments, blank lines ignored # Every codon is case-insensitive, but aas are not aTC = a TTT = p ATG = * Note that the star does not have to be quoted in this case. -m With this switch, pepper will translate the first codon of a protein by 'M' if the codon is a start codon. Remember that the start codons for each genetic code are defined in $NCBI/data/gc.prt, and by the arguments of the "-s" switch (if any). Default is to translate all start codons with the amino acid they would produce if they would appear in a position other than the first. This switch cannot be used with -t. -t[iw] Do not translate the extracted region. Switch -t can be used all by itself or can take one argument: 'i' or 'w'. See the section exon-introns and fragments above for a complete description of this switch. -p Extract the protein-coding genes. If -r, -n and -a are not used, -p is assumed by default. -r Extract the rna-coding genes. As soon as this switch is used, pepper will not attempt to translate any of the extracted region. -n Extract the non-coding genetic elements. As soon as this switch is used, pepper will not attempt to translate any of the extracted region. -a Extract all (synonym for "-p -r -n"). As soon as this switch is used, pepper will not attempt to translate any of the extracted region. -F Indicate fragment boundaries, even if the fragments must be joined. If this option is used, a comment line of the form "; -- fragment boundary --" is inserted between each fragment of a gene, unless the fragments are reported separately (ie there is no qualifier "/join" in the fragment annotations). -C ctx_lg For each genetic element written in pepper output file, give some context (ctx_lg nucleotides at most, where ctx_lg is a positive number) upstream and downstream. Note that for elements on the reverse strand, the context is also taken on the reverse strand. -P Print the position(s) in the masterfile for each extracted element. This option can only be used when translation does not take place. Positions are listed in the form [start-end] For elements on the reverse strand, start will be greater than or equal to end. For elements on the original strand, start will be less than or equal to end. If an element contains subparts that were joined by pepper to obtain the sequence displayed (as would be the case if an extracted gene has exons and -t was used), then pepper will list the positions of all the subparts. This list will have the form: [start1-end1,start2-end2,...] where each star/end pair corresponds to the position of a given subpart. Note that for elements on the complementary strand, the subparts positions are listed in the reverse order in which they appear in the masterfile. Within a given start/end pair, if the element was on the complementary strand, we will have start >= end, as noted previously. FILES /share/supported/apps/ogmp/lib/genome_name.lst Current masterfile list, with their properties. /share/supported/apps/ogmp/lib/perl/ogmp/gene.pl Used to parse the gene annotations /share/supported/apps/ogmp/lib/perl/ogmp/gene_names.pl Library that identifies the known genes /share/supported/apps/ogmp/lib/perl/ogmp/gc.pl Used to fetch the genetic codes /share/supported/apps/ogmp/lib/perl/ogmp/prot.pl Used to read prot.prm and set deviations $NCBI/data/gc.prt All the genetic codes available for translation /share/supported/apps/ogmp/lib/gene_names.lst All the known genes. DEPENDENCIES gene.pl, gc.pl, gene_names.pl, prot.pl SEE ALSO flip(1), onip(1), stad2mf(1), cn(1), fasta(1), gcshow(1) AUTHOR Gertraud Burger (Initial idea and specs) Nicolas Brossard (programming and design) Organelle Genome Megasequencing Project, JAN97