SERAM(1) OGMP SEQUENCE UTILITIES SERAM(1) NAME seram - a tool to retrieve sequences in batch and call applications (scripts) on them. SYNOPSIS seram -s sequences_file -f scriptname -r|-R [-I] [-D] [-T info] DESCRIPTION seram is a tool that allows a user or another program to request a list of sequences using ferret(1) and execute a custom script for each of these sequences, or only once for all of them. OPTIONS -s sequences_file is the file containing sequences specifications (called the "seqspec" file). Each line in this file is either a path to a file containing a sequence or an NCBI format identifier for a sequence. Paths are prefixed by the characters "F-" and identifiers by "S-". Relative paths are transformed into absolute paths by searching for the file relative to the current directory from which seram was started; if this fails, the file is searched relative to the $FERRETBANK directory (hardcoded in seram). Here is an example seqspec file: # Example seqspec file F-/usr/local/myseqbank/a_sequence F-another_sequence S-gb|access|locus other information The first line is an absolute path to a_sequence. In the second line, the path to another_sequence is a relative one. The third line is a NCBI identifier for a genbank sequence. Comments must start with a "#" and be on separate lines. -f scriptname is the file containing the template for the script to be executed. It can be any kind of script: bourne shell, c-shell, perl, etc as long as the first line contains the correct "#!" sequence. The script doesn't need to have the executable bit set, since this file is going to be copied by seram into a temporary working directory and its executable bit set there. Unless the -I option is specified, all occurences of the two characters "{}" will be replaced by the filename of a sequence before execution. The script will be started with its current directory set to $FERRETBANK. -r or -R The -r option tells seram that the script must be run for EACH sequence specified in the seqspec file; the -R option tells it that the script must be run only once, when ALL sequences are ready (i.e. have been received by ferret(1)). In the case of -R, all sequences are concatenated together in a temporary file before being supplied to the script. -I The presence or absence of this flag tells seram HOW to feed sequences to the script. When present, the sequences are fed by standard input to the script, which is executed exactly as supplied by the user of seram. When it is absent, all occurences of the two characters "{}" in the script are substituted for the filename of the sequence. -D Tells seram to DELETE the two files scriptname and sequences_file after reading them. This can be useful when they are temporary files created by another application which doesn't want to wait until seram has completed it's work (which can be very long if ferret(1) expects some sequences to be received by e-mail) before deleting them. Since seram makes a copy of these files in a temporary directory managed by itself, the original files are no longer needed after seram is invoked and therefore when the -D option is present the two files will be deleted soon after execution. -F Tells seram to ask ferret to retrieve the Full reports associated with the sequence specifications, rather than simply the sequence. -T info Supply to seram some more info about the program that called it. info can be any descriptive text string. EXECUTION seram does the following proccessing on its input: 1- Validate command line arguments. 2- Validate the seqspec file. This include: - Making sure all paths after F- lines are accessibles, - Making sure databases specified by S- lines in the seqspec file are supported by ferret(1), - Reporting errors and commenting out offending lines. 3- Send a ferret(1) request for each S- line in the seqspec file (if any). Some support files are created when this step starts (see the section JOB-SPECIFIC SUPPORT FILES). 4- If the -r option was specified, seram will: a- Execute the script for all sequences specified as F-, b- Access ferret(1)'s index file to find out which sequences specified as S- have been received, and process them too, c- Removed from the seqspec list the lines processed in 4a and 4b, d- Wait 15 minutes and try again at step 4b, as long as there are still sequences being waited for. 5- If the -R option was specified, seram will: a- Access ferret(1)'s index file to find out which sequences specified as S- are still being waited for, b- If any such sequence exist, wait 15 mins. and go back to 5a, c- Else, all sequences are available, and seram will concatenate them all into a temporary file and feed this file to the script. If seram receives a HANGUP (kill -1) signal while in the 15 minutes waiting period, it will stop waiting and immediately proceed to checking the arrival states of the expected sequences. JOB-SPECIFIC SUPPORT FILES Starting at step 3 of the preceding execution breakdown, seram creates in a private working directory (usually $FERRETBANK/Jobs) four files that are a snapshot of the state of the current processing. These files can be used if the script is killed and the user wants to restart the processing. The "seram.restart.*" file contains the command(s) necessary to restart everything. The "seram.script.*" file contains a copy of the script template exactly as supplied by the -f command line argument, the "seram.data.*" file contains to-be-processed sequence specifications (which may not be exactly like the sequences_file originally supplied in argument since some sequences may have been processed already) and the "seram.info.*" file contain relevant textual information about the job (this file is mostly used by badger(1)). KILLING SERAM A seram job can be cleanly killed by removing the "seram.info.*" file related to that job. REPORTING If sequences are still expected after 24 hours, seram will report that to the user by sending her/him an e-mail message. Also, any sequence specification which resulted in an error will be reported the same way. When the job completes, a final report with information regarding all correctly and incorrectly processed specification will be sent. If an serious error occurs, then a FATAL error message will be sent to both the user who submitted the job and to the OPERATOR, as defined by the ferret(1) configuration file. EXAMPLE Here is an example of a working seram invocation. Let scriptfile be this: ---- Begin scriptfile ---- #!/bin/sh # This is a bourne shell script which send a sequence to a user. if [ "{}" = \{\} ] ; then cat | mail -s ScriptOutput username # Seq via stdin else cat "{}" | mail -s ScriptOutput username # Seq file replaces {} fi ---- End scriptfile ---- and sequences_file contains this: ---- Begin sequences_file ---- # A seqspec file F-/usr/local/ferretbank/human-complete-genome F-human-mitochondrial-genome S-emb|X54252|MTCE C. elegans complete mitochondrial genome ---- End sequences_file ---- When seram is invoked as "seram -f scriptfile -s sequences_file -r", the script will be run three times for the three sequences, with the possibility that seram will have to wait for the third sequence to be retrieved by ferret(1). When seram is invoked as "seram -f scriptfile -s sequences_file -R", the script will wait until the third sequence is received by ferret(1) and THEN run the script once with the three sequences files concatenated. Note that this script is made to work both with or without the -I option of seram; when run it will detect if the sequence is supplied via stdin or through a filename substituted in it's body. FILES scriptfile Supplied as argument. sequences_file Supplied as argument. $PERLLIB/ogmp/ferret.conf A configuration file where many absolute paths are defined. $PERLLIB is the standard perl library path. $FERRETBANK/ The directory where ferret(1) stores sequences. $FERRETBANK/Reports The directory where ferret(1) stores reports. $FERRETBANK/Support/INDEX Index of sequences/reports. $FERRETBANK/Jobs/ A working directory where the seram.xxx.yyy.zz files are created. seram.info.$USER.$PID Main info related to job seram.data.$USER.$PID List of sequences specifications seram.script.$USER.$PID Script to be run for each/all sequences seram.restart.$USER.$PID Small script used to restart this job. BUGS seram is dependant on ferret(1); if ferret doesn't properly update the INDEX file, seram may wait for sequences indefinitely. The tool badger(1) can be used to monitor (and cancel) seram jobs that have been running for too long, or to restart killed seram jobs. SEE ALSO ferret(1), badger(1) AUTHORS Pierre Rioux, Tim Littlejohn (Project Management) Organelle Genome Megasequencing Program, March 1994, January 1995.