FERRET(1) OGMP SEQUENCE UTILITIES FERRET(1) NAME ferret - A sequence retrieval and reformating tool SYNOPSIS ferret <-d remote_database> <-a accession> | <-l locus> | <-g gim> [-t seq_type] [-i header_information] [-f filename] [-F tofile] [-D local_database_dir] [-M mailaddress] [-REPORT] [-STDIN] ferret -SERVER DESCRIPTION ferret is a tool that allows a user or another application to request DNA or protein sequences from different data sources and store the sequences in Pearson/fasta format in a simple local database (henceforth called ferretbank). Current data sources are: - clever(1), a tool to access the Entrez database, - nclever, the network version of clever, - The NCBI e-mail retrieve server. The data sources are tried in the order listed. Reports obtained from the data sources (from which the sequences are extracted) are stored in a Reports subdirectory of ferretbank. OPTIONS -d remote_database This mandatory argument specifies from which remote database the requested sequence must come from. Legal databases (and their synonyms) are: gb, genbank, pir, gp, genpept, sp, swissprot, emb, embl, kabat, kabatnuc, kabatpro, vector, tfd, epd, dbest. -a accession This argument is the accession number for the sequence in the database specified by the -d option. Some databases are not indexed by accession number and therefore this option can be left out or an empty string "" can be supplied. -l locus This argument is the locus number for the sequence in the database specified by the -d option. Some databases are not indexed by locus name and therefore this option can be left out or an empty string "" can be supplied. -g gim This argument is a NCBI gim number used to retrieve sequences indexed with those numbers, like in the Entrez database. As currently only the NCBI Entrez database uses gim numbers, this option can be left out or an empty string "" can be supplied. -t seqtype This argument should be the letter N (for NUCLEOTIDE) or the letter P (for PROTEIN). Most supported databases are of one or the other type, and therefore this argument is almost never necessary. The only exception is if the user supplies "kabat" as the database (with the -d option): then it is necessary to tell ferret whether a protein or nucleotide sequence is expected. -i header_information This arguments allows the user to explicitely set the content of the header part of the sequence, in Pearson/fasta format. If the sequence was retrieved using farfetch, this information will be used only if it has a greater length than what was originally in the farfetch sequence header. If no header information is supplied with the -i option, then ferret will supply it's own. -f filename The argument allows the user to explicitely force a name for the file that will keep a copy of the sequence in the ferretbank. If absent, ferret will supply it's own filename. -F tofile This is another filename; it allows the user to make a copy of the sequence in another file called tofile. It is the main way to extract a sequence already in the ferretbank. -M mailaddress This option allows the user to send a copy of the sequence by email to the address mailaddress. -D local_database_dir This option is used when the user wants ferret to add its sequences in another directory than the hardcoded $FERRETBANK in its configuration file. -REPORT This tells ferret that the REPORTS (not the fasta formated sequences) are to be processed by the -M and -F options. -SERVER This is an alternate mode of execution of ferret. In this mode ferret will read its standard input, expecting a mail message in the format of the NCBI mail retrieve server. It will then parse the message, extract a sequence from it and store it in the ferretbank, PROVIDED that the mail request that generated the mail message had been sent by ferret itself. The -SERVER mode is usually used in sendmail's aliases file in an entry like this one: ferret-server:"|/share/supported/apps/ogmp/bin/ferret -SERVER" See the installation manual for more details on this option. -STDIN This option can be used to add a sequence to the database without requesting it from ferret's normal data sources; when specified, ferret expects to receive a GENBANK formatted report in standard input. It will extract accession numbers, locus names, and GI numbers from that report and put it in the database. If other command-line arguments supply accession numbers and such, ferret will make sure they are consistent with the content of the report. EXECUTION When invoked in normal mode (that is, without the -SERVER or -CLEANUP options), ferret does the following things: - Validate command line arguments. - Check the ferretbank's INDEX to see if the sequence is already there. If it is, just perform the actions required by the -F and -M options if necessary and then exit. - If an accession number or a gim number was given, try to get a sequence from the Entrez database using clever(1). If this fail, try nclever. - If ferret still hasn't found a sequence, send a message to the NCBI retrieve server, and save the current info concerning this sequence in the ARRIVALS file. - Update the INDEX file, putting there the name under which the sequence was stored in the ferretbank or a flag to mark it as "waited for". - Perform the actions required by the -F and -M options if necessary. When invoked in SERVER mode, ferret does the following things: - Read the mail message into an internal buffer. - Read the ARRIVALS file and find the record related to the mail message. - Create a fasta-formated sequence from the report found in the body of the message. - Store it in the ferretbank, update the INDEX, and perform the actions related to the -M and -F options as they were specified when ferret sent the request. In normal mode, errors are reported on standard output. In server mode, the normal errors or warning are reported there too, but grave errors are reported by e-mail to the ferret administrator. In all cases, error messages and informations on received sequences are logged in a LOGFILE. ferret uses a file locking mechanism to provide consistent access to the LOGIFLE, INDEXFILE and ARRIVALS file simultaneously by many users. FILES $PERLLIB/ogmp/ferret.conf A configuration file containing many absolute path definitions. $PERLLIB is the standard perl library path. $FERRETBANK A directory, ferret's database of sequences $FERRETBANK/Reports Where the reports from which the sequence were extracted are kept. $FERRETBANK/Support Where some support files for the managing of the sequences are found. Know as $FERRETSUPPORT. $FERRETSUPPORT/LOGFILE The logfile where information and error messages are stored. $FERRETSUPPORT/ARRIVALS This file contains records on what sequences are being expected by mail. $FERRETSUPPORT/INDEX This file contains records (one per line) on what sequences are in the database. Fields are separated by tabs. The fields are: database_name, accession, locus, gim, filename. A "@" in the filename field means that the sequence is expected by mail. $FERRETSUPPORT/FERRET.RULES A rule file loaded by the RulePack.pl perl library and used by ferret to make meaningful header lines and files names for sequences and reports. BUGS Sometimes the Mail Retrieve Server will send TWO identical replies to one query. Ferret will process the first one correctly, but wills forward the second message to the operator, with the message "Cannot identify type of report". This bug is not really related to ferret itself. The file locking mecanism (implemented in the "lock.pl" perl library) is buggy; it doesn't always prevent many processes from writing to to same file at the same time. This is never a problem unless many, many processes do intensive access to the database. The -D option, to make ferret use a completely separate directory hierarchy as its database, has not been tested exhaustively. SEE ALSO clever(1), rulepack(3), ferretfasta(5), writegde(1) AUTHOR Pierre Rioux, Organelle Genome Megasequencing Project, March 1994. Project Management Tim Littlejohn, OGMP. Version 3.0 February 1995.