FINDREP(1)                   OGMP ANALYSIS TOOL                   FINDREP(1)

NAME
    findrep - find exact repeated subsequences in a masterfile

SYNOPSIS
    findrep [-v] [-s rep_size] [-h sub_size] [-C]
            [-S strand] [-B contigs] [-H] [-G] [-Q] masterfile

DESCRIPTION
    findrep analyses a masterfile (that is, a file in mf(5) format) and
    reports occurences of identical repeat subsequences whose length
    is greater than or equal some value (by default, 100 bases).

OPTIONS
    -s rep_size
       rep_size is the minimum size of the repeats subsequences to report;
       by default is is 100 bases.

    -h sub_size
       sub_size is an internal value specific to the length of the
       subsequences used during the hashing part of the processing.
       Its default value is the same as rep_size (as specified with
       the -s command line option). sub_size cannot have a value greater
       than rep_size. Memory space used by the program increase roughly
       with the product sub_size * size_of_masterfile (where
       size_of_masterfile is the total number of bases in the masterfile).
       Therefore, for large genome, is can be more memory efficient to
       reduce sub_size to less than rep_size, at the cost of a little
       bit more processing time when the hashing is done.

    -v Verbose option. Prints more information during the processing
       phase; use this when you want to see how quickly (or slowly)
       the analysis proceeds on a particular platform. For even more
       info, try -v2 (no space between the v and the 2).

    -C Coordinates file format. The output will be in coordinates(5)
       file format rather than FASTA.

    -S strand
       Restrict the strandedness of the search. The argument strand
       can be one of the following keywords:

           - "forward", "noncomp"          for the forward strand;
           - "reverse", "comp"             for the reverse strand.
           - "both", "any"                 for both strands;

       Actually, only the first letter of the keyword is looked at;
       thus the options "-S both" can be written as "-S b" or "-Sb".

       Restricing the strandedness to a single strand roughly halves the
       amount of memory required during processing. The default behavior
       is to search both strands.

    -B contigs
       Retrict the reporting of the repeats. You can ignore repeats
       that are in the same contig or ignore repeats that are in
       different contigs. The argument contigs can be one of the
       following keywords:

           - "any"              allow repeats between any contigs.
           - "same"             only allow repeats in the same contig.
           - "different"        only allow repeats in different contigs.

    -H Reports only hairpin repeats (subsequences like ATACTGCAGTAT which
       are their own repeat when reversed and complemented). This
       option forces the options "-S both" and "-B same".

    -G Graphical option. In the reports, a sized down representation of
       the contigs with the location of the repeats wil be inserted.
       See the OUTPUT section for more details.

    -Q Quiet mode. When present, disable ALL output, except for the
       printing of a single integer which is the longuest repeat
       of length greater than or equal to rep_size. The program will
       output "0" (zero) if no repeat of at least rep_size was found.

OUTPUT
    The information displayed by findrep consists in a header section that
    identify the parameters used during invocation followed by zero, one or
    many repeat reports.

    A repeat report typically consists of 6 parts, numbered (1) to (6) in
    the sample report below:

(1) >Repeat 2: forward(304-406)  reverse(111-213)  (103)
(2) ;;> R.youthere unknown reading frame #1
(3) ;;> P.terpiper mitochondrial fragment
(4) ;; >>>>>>>>>>>>---->>>>>>>>>>>>>
(5) ;; <<<<----<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
(6) TTGATCAACGGAACAAGTTACCCTAGGGATAACAGCGCAATCCTATTCTAGAGTCCATAT
    CGACAATAGGGTTTACTGGATCAGGACATCCTAATGGTGCAGC

    Explications for each part:

    (1) >Repeat 2: forward(304-406)  reverse(111-213)  (103)
        This is a fasta header; repeats are numbered sequentially as they
        are printed out. The "forward(304-406)" string identifies the first
        subsequence match for this repeat; "forward" means that this
        match occured in the contig as read by the program in the
        masterfile. The "reverse(111-213)" string identifies the second
        subsequence match for this repeat; "reverse" means that this match
        occured in the contig on the opposite strand as the one given by the
        masterfile. ALL range positions (like "304-406" and "111-213") are
        relative to the beginning of the contig in the direction given by
        the masterfile, no matter whether the match is "forward" or
        "reverse". The last piece of information, "(103"), is the length
        of the repeat.

    (2) ;;> R.youthere unknown reading frame #1
    (3) ;;> P.terpiper mitochondrial fragment
        These lines identify the contigs in which the repeats were found.
        Line (2) is the name of the first contig (in our example, the
        one where the "forward(304-406)" match occured) and line (3)
        is the line of the second contig (where the "reverse(111-213)"
        match occured). If the two matches occured in the same contig,
        then line (3) is not displayed in the output.

    (4) ;; >>>>>>>>>>>>---->>>>>>>>>>>>>
    (5) ;; <<<<----<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<...
        These lines appear only when the -G (graphics) option is specified
        on the command-line argument. They show where the matches occured
        (the dashed lines) inside each contigs. The contigs are shown
        with ">" signs when the match was found on the forward strand and
        with "<" signs when the match was founs on the reverse strand.
        The length of each contig's representation is scaled to respect
        their relative sizes.

    (6) TTGATCAACGGAACAAGTTACCCTAGGGATAACAGCGCAATCCTATTCTAGAGTCCATAT...
        This is the match itself, ALWAYS shown on the same strand as
        in the masterfile, and extracted from the first contig.

SEE ALSO
    coordinates(5), mf(5)

AUTHORS
    Pierre Rioux,
    Tim Littlejohn (Project Management),
    Organelle Genome Megasequencing Project, Feb. 1995.