SPAM(1) OGMP SEQUENCE UTILITIES SPAM(1) NAME spam - sequencing project analysis and management SYNOPSIS spam [-o ocode] [-O] [-d] cfg_file DESCRIPTION Spam (Sequencing Project Analysis and Management) is a sequence quality reporter and a decision tool, assisting the finishing phase of a sequencing project. Spam identifies sequence ambiguities and regions of insufficient coverage and suggests sequencing tasks to improve the situation. Spam takes as input a configuration file that contains most of the parameters used by the application. A typical configuration file follows: # Sample spam config file # These lines are comments. Blank lines ignored qf = /users/brossard/tempo/qf mf = /users/brossard/tempo/masterfile dump = staden.dump value = /users/brossard/tempo/value.nick consensus = ./consensus # Note: the two following parametres are obsolete now... min_nb_imp = 200 min_nb_new = 100 def_size = 1.1 clone = clonefile_path cost = cost_file_path table = table_file_path All lines whose first non-blank character is "#" are comments. Blank lines are ignored. All the other lines must have the format: = where the identifier can be one of the following: + qf: The filename of the quality file (either a quality dump file or a quality masterfile). This parameter is optional. + mf: The filename of the masterfile on which spam will be run. This parameter is mandatory, unless the -d switch is used (see OPTIONS). + dump: The filename of the staden dump associated with the masterfile. This parameter is also appear in the configuration file. + value: The path of the value file. This parameter is also compulsory + consensus: Path of the consensus file (also required). + min_nb_imp: The minimum number of nucleotides an action has to "improve" in the existing sequence in order to be considered during spam's filtering phase. This value has to be an integer. This parameter is optional (default value: 1) and is now deprecated. It should almost never be used. + min_nb_new: If a sequencing action provides new information, this is the minimum number of nucleotides it must provide in order not to be discarded during spam's filtering. This has to be an integer. This parameter is optional (default value: 1) and is now deprecated. It should almost never be used. + def_size: The default size a clone will be assigned if there is no size indicated in the masterfile and if spam didn't find it's size in the clone file (if the clone file was supplied). It must be expressed in kbp (as in 1.3, 2.0, ...). This parameter is optional and if it's not supplied, spam will prompt the user each time it cannot determine the size of a clone. + clone: The path of the clone file. This parameter is optional. + cost: The path of the cost file. This parameter is mandatory. + table: The path of the quality tables file to produde (optional). All the identifiers are case-insensitive, so mf, Mf and MF all refer to the masterfile's name. In the following text, each parameter is described in detail. The masterfile and dump are validated when read. All tracks and reading names are expected to be of the correct format. All the annotations of the masterfile that were not identified as tracks or readings will be written in ".spam.reject". This file should be consulted often in order to verify that spam reads the annotations as it should. Spam will in fact issue a warning if this file is not empty. If spam finds in the masterfile a track and a reading of the same clone, the track is discarded. The comments "starts xx bases from here" are parsed by spam to infer the correct position of a tracked clone in the masterfile. For spam to work correctly, it is very important that the dump and the mf be in sync. Spam ignores in the dump all contigs that contain only pseudo-sequences. Spam also ignores columns that have been "deleted" by stad2mf(5). Stad2mf(5) will indicate in the masterfile which columns of the dump have been deleted by writing a line "SPAM_DELETED = " after the contig identifier line (e.g. ">1143gsm33"). The optional clone file supplied in argument must be in clonelist(1) format. Spam will use this file to find the size of clones whose sizes are missing in the masterfile. During this step, all leading zeros of clone numbers are ignored, so clone "0123" and clone "123" refer to the same clone. Spam will abort if it finds multiple size definitions for a given clone. Spam will prompt the user to enter the correct size if the size in the clone file doesn't match the one read in the masterfile, or if it finds a clone without a size indicated either in the masterfile or the clone file. This behaviour will change if a default size is supplied as argument, in which case spam uses this size when it doesn't know the size of a clone. The consensus file used by spam resembles the one used by stad2mf(5). Each line in the consensus file indicates the conditions that have to be met in order for a given character to be put in the quality masterfile (qmf for short). Conditions can be imposed on 7 different sequence properties (all deduced from the staden dump): - The number of readings a sequence position is covered by. - The number of different clones a sequence position is covered by. - The number of forward readings a sequence position is covered by. - The number of backward readings a sequence position is covered by. - The number of reading of reaction type "Std" (manual GTP), a sequence position is covered by. - The number of reading of reaction type "Itp" (manual ITP), a sequence position is covered by. - The number of reading of reaction type "Licor" (automatic GTP), a sequence position is covered by. A restriction is imposed on each sequence property via an arithmetic expression that can have one of these forms: - A digit this indicates that the characteristic's value has to be equal to the specified digit. - >=digit (or <=digit) this indicates that the characteristic's value has to greater than (less than) or equal to the specified digit. - >digit (or 1 - 0 1 0 0 1 < 1 - - - 0 1 0 i #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # 2 readings #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2 - - - - - - . The "Character" column indicates which character to put into the qmf in the event that all properties meet the imposed restrictions for a particular nucleotide in the sequence. All restrictions should be mutually exclusive. The "x" character must not be used in the consensus file. This character represents the absence of information in a contig and is present (although invisible) at the extremities of every contig. Otherwise, the characters that the user puts in the consensus file can be any character of the ascii set, except "N" and "q". When two restrictions overlap, spam will issue a warning and the restriction that was read first in the consensus file will apply. Also, be sure that the consensus file used covers all cases that arise in the dump, otherwise spam will output an error message when it doesn't know which character to put in the qmf. A good way to avoir that is to use the special consensus line: otherwise This tells spam which character to put when all the other consensus lines do not apply. Note that the -O option cannot be used if the consensus file contains an "otherwise" line. The value file assigns a numerical value to each character of the consensus file. This value represents the degree of sequence quality. By default,"x" has a value of 0. Every character in the consensus sequence must have a value. The quality value is essential to the filtering and optimization phase. It must be defined such that if a character (say "i", representing coverage by a single ITP reaction) can be "upgraded" into another one (say ".", representing coverage by any two readings), then the value associated to "i" must be less than or equal to the value associated to ".". A sample value file follows: # Sample value file # A line starting with "#" is ignored, unless it looks like # a line that assigns a value to the character "#" (see below) # Blank lines are ignored . 30 # 14 = 14 > 14 < 14 i 18 The cost file associates an artificial cost to each reaction type. This "cost" represents the effort (time, money,...)required to accomplish a sequencing action using the given reaction type. The bigger the number, the more costly the action is to execute. A typical cost file is shown below: # Sample cost file for tracks analysis # Lines starting with "#" are comments. Blank lines ignored. # The costs are specified for reactions: Std, Itp, Lic. # A special cost is specified when the reverse primer is used. # This special cost will override the reaction's standard cost ITP 1324 STD 1314 LIC 3466 REV 6536 All ids are case-insensitive and each cost must be positive. The table file (if it's path is specified in the config file) reports all the positions of all the quality characters in the dump. For each contig, a table is created: its columns give all the positions in which a particular quality character occurs. A typical table is given below: Contig: 0764gsb15 i = . -------------------------------- 412-484 1-22 364-411 485-514 23-154 515-668 155-363 669-709 710-771 772-806 807-865 866-1006 1007-1008 On output, spam produces a quality file (either a quality masterfile or a quality dump file, see the "-d" switch). If a quality file's path was specified in the configuration file (qf =...), then output goes into that file. Otherwise, the output goes to the screen. Note that there are a couple of parameters that spam uses in order to determine the number of new nucleotides obtained from every reaction and the offset of the resulting information. Spam allows the user to modify any or all of them, via the program setoffset(1), which must be run separately before spam is executed. After that, spam asks the user if (s)he wants to try to improve the quality of the information in the masterfile. The user can define various regions in the masterfile (a contig, a set of quality characters to improve, etc...) that need improvement. Spam will then try to suggest actions that will improve the sequencing quality in that particular region. This phase is done in four steps: 1) Building the exhaustive set: Spam builds an array of all the possible actions that can be taken (we'll call this set E). A particular action is defined as sequencing a given clone using a given reaction type. E is built from the information read in the masterfile. 2) Restricting E: The user can restrict the set of reactions and clones considered (e.g. specify that only tracks or forward reactions should be considered, etc...). Spam will adjust E accordingly. 3) Primary filtering: Spam eliminates from E all the particular actions that do not cover the regions of interest. For example, if the user wants to improve the quality of a given contig and clone 123 is not in this contig, then all the particular actions associated to that clone will be removed from E. 4) Secondary filtering: Spam performs a filtering on the remaining actions in E. During this phase, spam eliminates from E all the particular actions that cannot improve the quality of the masterfile. Note that the resulting set of actions is not the optimized one, it merely contains actions that are "good candidates" for the optimization phase. In this phase spam uses the parameters min_nb_imp and min_nb_new to eliminate unwanted actions (see the description of the configuration file's parameters above). 5) Optimization: Spam optimizes the set E. It tries to obtain the set of actions that will generate the greatest increase in quality at the minimum possible "cost" (see the description of the cost file). The cost is defined as being the difficulty of executing a particular action (for example a licor reaction is more "difficult" to do than a standard one, and so spam will have a tendency to prefer standard reactions over licor reactions). All these "costs" have to be defined in a file supplied by the user. (To be continued when the optimization will be completed). During the optimization, spam forks itslef numerous times. Since at this point spam's process will be quite big, make sure that the swap space available on the host on which spam is run is big enough. Spam solves the optimization problem by splitting it into several little pieces, which will be sent across the network to be solved in parallel on different computers. The list of hosts on which spam will send such problems is in /share/supported/apps/ogmp/lib/spam/spam.hosts. Before running an optimization you should make sure that all the indicated hosts are operationnal and that you can run rcp and rsh on each of them without supplying a password. The optimization engine used by spam is lp_solve, from the public domain (A C implementation of the classic branch and bound algorithm). Spam allows the user to put all the results of his query in a file for further examination. OPTIONS -o ocode Specify the organism code for the project. This is not necessary for projects which are in the genome_name.lst(1). Although it is fully validated, this option is only useful in the layout of the optiomization result file (it is appended to the clone number, for easier retrieval in the masterfile afterwards). This option must be specified if -d is used -d Use only the dump to generate the quality file. All the masterfile's information is ignored. -O If this switch is used, spam will execute the optimization phase. Note that this option cannot be used if the consensus file contains an "otherwise" line (see the consensus file section) or if -d was used. FILES /share/supported/apps/ogmp/lib/perl/ogmp/stadump.pl The library used to parse the staden dump /share/supported/apps/ogmp/lib/genome_name.lst List of all known OGMP sequencing projects /share/supported/apps/ogmp/lib/spam/spam.hosts Lists all the hosts which will assist the host on which spam is run in the optimization procedure. /share/supported/apps/ogmp/lib/spam/spam.offsets This file will be used to determine the star/end points of all the actions contained in a masterfile. It should not be edited. The program setoffset(1) should be used to modify the informations contained in that file. SEE ALSO stad2mf(5), clonelist(1), seqedit(1), setoffset(1) AUTHOR Nicolas Brossard (program design) Gertraud Burger (specs) Pierre Rioux (project management) Organelle Genome Megasequencing Project