FERRET(1)               OGMP SEQUENCE UTILITIES                 FERRET(1)

NAME
    ferret - A sequence retrieval and reformating tool

SYNOPSIS
    ferret <-d remote_database> <-a accession> | <-l locus> | <-g gim>
           [-t seq_type] [-i header_information] [-f filename] [-F tofile]
           [-D local_database_dir] [-M mailaddress] [-REPORT] [-STDIN]
    ferret -SERVER

DESCRIPTION
    ferret is a tool that allows a user or another application to
    request DNA or protein sequences from different data sources
    and store the sequences in Pearson/fasta format in a simple local
    database (henceforth called ferretbank). Current data sources are:

        - clever(1), a tool to access the Entrez database,
        - nclever, the network version of clever,
        - The NCBI e-mail retrieve server.

    The data sources are tried in the order listed.

    Reports obtained from the data sources (from which the sequences
    are extracted) are stored in a Reports subdirectory of ferretbank.

OPTIONS
    -d remote_database This mandatory argument specifies from which
    remote database the requested sequence must come from. Legal
    databases (and their synonyms) are:

                gb, genbank,
                pir,
                gp, genpept,
                sp, swissprot,
                emb, embl,
                kabat, kabatnuc, kabatpro,
                vector,
                tfd,
                epd,
                dbest.

    -a accession This argument is the accession number for the sequence
    in the database specified by the -d option. Some databases are not
    indexed by accession number and therefore this option can be left
    out or an empty string "" can be supplied.

    -l locus This argument is the locus number for the sequence in
    the database specified by the -d option. Some databases are not
    indexed by locus name and therefore this option can be left
    out or an empty string "" can be supplied.

    -g gim This argument is a NCBI gim number used to retrieve sequences
    indexed with those numbers, like in the Entrez database.  As currently
    only the NCBI Entrez database uses gim numbers, this option can be left
    out or an empty string "" can be supplied.

    -t seqtype This argument should be the letter N (for NUCLEOTIDE) 
    or the letter P (for PROTEIN). Most supported databases are
    of one or the other type, and therefore this argument is almost
    never necessary. The only exception is if the user supplies
    "kabat" as the database (with the -d option): then it is
    necessary to tell ferret whether a protein or nucleotide
    sequence is expected.

    -i header_information This arguments allows the user to explicitely
    set the content of the header part of the sequence, in Pearson/fasta
    format. If the sequence was retrieved using farfetch, this information
    will be used only if it has a greater length than what was originally
    in the farfetch sequence header. If no header information is supplied
    with the -i option, then ferret will supply it's own.

    -f filename The argument allows the user to explicitely force a name
    for the file that will keep a copy of the sequence in the ferretbank.
    If absent, ferret will supply it's own filename.

    -F tofile This is another filename; it allows the user to make a
    copy of the sequence in another file called tofile. It is the
    main way to extract a sequence already in the ferretbank.

    -M mailaddress This option allows the user to send a copy
    of the sequence by email to the address mailaddress.

    -D local_database_dir This option is used when the user wants
    ferret to add its sequences in another directory than the
    hardcoded $FERRETBANK in its configuration file.

    -REPORT This tells ferret that the REPORTS (not the fasta formated
    sequences) are to be processed by the -M and -F options.

    -SERVER This is an alternate mode of execution of ferret. In
    this mode ferret will read its standard input, expecting a
    mail message in the format of the NCBI mail retrieve server.
    It will then parse the message, extract a sequence from it
    and store it in the ferretbank, PROVIDED that the mail request
    that generated the mail message had been sent by ferret itself.
    The -SERVER mode is usually used in sendmail's aliases file
    in an entry like this one:

        ferret-server:"|/share/supported/apps/ogmp/bin/ferret -SERVER"

    See the installation manual for more details on this option.

    -STDIN This option can be used to add a sequence to the database
    without requesting it from ferret's normal data sources; when
    specified, ferret expects to receive a GENBANK formatted report in
    standard input. It will extract accession numbers, locus names, and
    GI numbers from that report and put it in the database. If other
    command-line arguments supply accession numbers and such, ferret
    will make sure they are consistent with the content of the report.

EXECUTION
    When invoked in normal mode (that is, without the -SERVER or -CLEANUP
    options), ferret does the following things:

        - Validate command line arguments.
        - Check the ferretbank's INDEX to see if the sequence is already
          there. If it is, just perform the actions required by the
          -F and -M options if necessary and then exit.
        - If an accession number or a gim number was given, try
          to get a sequence from the Entrez database using clever(1).
          If this fail, try nclever.
        - If ferret still hasn't found a sequence, send a message
          to the NCBI retrieve server, and save the current info
          concerning this sequence in the ARRIVALS file.
        - Update the INDEX file, putting there the name under which
          the sequence was stored in the ferretbank or a flag to
          mark it as "waited for".
        - Perform the actions required by the -F and -M options if necessary.

    When invoked in SERVER mode, ferret does the following things:

        - Read the mail message into an internal buffer.
        - Read the ARRIVALS file and find the record related to
          the mail message.
        - Create a fasta-formated sequence from the report found
          in the body of the message.
        - Store it in the ferretbank, update the INDEX, and perform
          the actions related to the -M and -F options as they were
          specified when ferret sent the request.

    In normal mode, errors are reported on standard output. In
    server mode, the normal errors or warning are reported there
    too, but grave errors are reported by e-mail to the ferret
    administrator. In all cases, error messages and informations
    on received sequences are logged in a LOGFILE.

    ferret uses a file locking mechanism to provide consistent access
    to the LOGIFLE, INDEXFILE and ARRIVALS file simultaneously by
    many users.

FILES
    $PERLLIB/ogmp/ferret.conf
                             A configuration file containing many
                             absolute path definitions. $PERLLIB is the
                             standard perl library path.

    $FERRETBANK              A directory, ferret's database of sequences

    $FERRETBANK/Reports      Where the reports from which the sequence
                             were extracted are kept.

    $FERRETBANK/Support      Where some support files for the managing of
                             the sequences are found. Know as $FERRETSUPPORT.

    $FERRETSUPPORT/LOGFILE   The logfile where information and error
                             messages are stored.

    $FERRETSUPPORT/ARRIVALS  This file contains records on what sequences
                             are being expected by mail.

    $FERRETSUPPORT/INDEX     This file contains records (one per line) on
                             what sequences are in the database. Fields
                             are separated by tabs. The fields are:
                             database_name, accession, locus, gim, filename.
                             A "@" in the filename field means that the
                             sequence is expected by mail.

    $FERRETSUPPORT/FERRET.RULES
                             A rule file loaded by the RulePack.pl perl
                             library and used by ferret to make meaningful
                             header lines and files names for sequences
                             and reports.

BUGS
    Sometimes the Mail Retrieve Server will send TWO identical replies
    to one query. Ferret will process the first one correctly, but
    wills forward the second message to the operator, with the message
    "Cannot identify type of report". This bug is not really related to
     ferret itself.

    The file locking mecanism (implemented in the "lock.pl" perl library)
    is buggy; it doesn't always prevent many processes from writing to
    to same file at the same time. This is never a problem unless many,
    many processes do intensive access to the database.

    The -D option, to make ferret use a completely separate directory
    hierarchy as its database, has not been tested exhaustively.

SEE ALSO
    clever(1), rulepack(3), ferretfasta(5), writegde(1)

AUTHOR
    Pierre Rioux, Organelle Genome Megasequencing Project, March 1994.
    Project Management Tim Littlejohn, OGMP.
    Version 3.0 February 1995.