FINDREP(1) OGMP ANALYSIS TOOL FINDREP(1) NAME findrep - find exact repeated subsequences in a masterfile SYNOPSIS findrep [-v] [-s rep_size] [-h sub_size] [-C] [-S strand] [-B contigs] [-H] [-G] [-Q] masterfile DESCRIPTION findrep analyses a masterfile (that is, a file in mf(5) format) and reports occurences of identical repeat subsequences whose length is greater than or equal some value (by default, 100 bases). OPTIONS -s rep_size rep_size is the minimum size of the repeats subsequences to report; by default is is 100 bases. -h sub_size sub_size is an internal value specific to the length of the subsequences used during the hashing part of the processing. Its default value is the same as rep_size (as specified with the -s command line option). sub_size cannot have a value greater than rep_size. Memory space used by the program increase roughly with the product sub_size * size_of_masterfile (where size_of_masterfile is the total number of bases in the masterfile). Therefore, for large genome, is can be more memory efficient to reduce sub_size to less than rep_size, at the cost of a little bit more processing time when the hashing is done. -v Verbose option. Prints more information during the processing phase; use this when you want to see how quickly (or slowly) the analysis proceeds on a particular platform. For even more info, try -v2 (no space between the v and the 2). -C Coordinates file format. The output will be in coordinates(5) file format rather than FASTA. -S strand Restrict the strandedness of the search. The argument strand can be one of the following keywords: - "forward", "noncomp" for the forward strand; - "reverse", "comp" for the reverse strand. - "both", "any" for both strands; Actually, only the first letter of the keyword is looked at; thus the options "-S both" can be written as "-S b" or "-Sb". Restricing the strandedness to a single strand roughly halves the amount of memory required during processing. The default behavior is to search both strands. -B contigs Retrict the reporting of the repeats. You can ignore repeats that are in the same contig or ignore repeats that are in different contigs. The argument contigs can be one of the following keywords: - "any" allow repeats between any contigs. - "same" only allow repeats in the same contig. - "different" only allow repeats in different contigs. -H Reports only hairpin repeats (subsequences like ATACTGCAGTAT which are their own repeat when reversed and complemented). This option forces the options "-S both" and "-B same". -G Graphical option. In the reports, a sized down representation of the contigs with the location of the repeats wil be inserted. See the OUTPUT section for more details. -Q Quiet mode. When present, disable ALL output, except for the printing of a single integer which is the longuest repeat of length greater than or equal to rep_size. The program will output "0" (zero) if no repeat of at least rep_size was found. OUTPUT The information displayed by findrep consists in a header section that identify the parameters used during invocation followed by zero, one or many repeat reports. A repeat report typically consists of 6 parts, numbered (1) to (6) in the sample report below: (1) >Repeat 2: forward(304-406) reverse(111-213) (103) (2) ;;> R.youthere unknown reading frame #1 (3) ;;> P.terpiper mitochondrial fragment (4) ;; >>>>>>>>>>>>---->>>>>>>>>>>>> (5) ;; <<<<----<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< (6) TTGATCAACGGAACAAGTTACCCTAGGGATAACAGCGCAATCCTATTCTAGAGTCCATAT CGACAATAGGGTTTACTGGATCAGGACATCCTAATGGTGCAGC Explications for each part: (1) >Repeat 2: forward(304-406) reverse(111-213) (103) This is a fasta header; repeats are numbered sequentially as they are printed out. The "forward(304-406)" string identifies the first subsequence match for this repeat; "forward" means that this match occured in the contig as read by the program in the masterfile. The "reverse(111-213)" string identifies the second subsequence match for this repeat; "reverse" means that this match occured in the contig on the opposite strand as the one given by the masterfile. ALL range positions (like "304-406" and "111-213") are relative to the beginning of the contig in the direction given by the masterfile, no matter whether the match is "forward" or "reverse". The last piece of information, "(103"), is the length of the repeat. (2) ;;> R.youthere unknown reading frame #1 (3) ;;> P.terpiper mitochondrial fragment These lines identify the contigs in which the repeats were found. Line (2) is the name of the first contig (in our example, the one where the "forward(304-406)" match occured) and line (3) is the line of the second contig (where the "reverse(111-213)" match occured). If the two matches occured in the same contig, then line (3) is not displayed in the output. (4) ;; >>>>>>>>>>>>---->>>>>>>>>>>>> (5) ;; <<<<----<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<... These lines appear only when the -G (graphics) option is specified on the command-line argument. They show where the matches occured (the dashed lines) inside each contigs. The contigs are shown with ">" signs when the match was found on the forward strand and with "<" signs when the match was founs on the reverse strand. The length of each contig's representation is scaled to respect their relative sizes. (6) TTGATCAACGGAACAAGTTACCCTAGGGATAACAGCGCAATCCTATTCTAGAGTCCATAT... This is the match itself, ALWAYS shown on the same strand as in the masterfile, and extracted from the first contig. SEE ALSO coordinates(5), mf(5) AUTHORS Pierre Rioux, Tim Littlejohn (Project Management), Organelle Genome Megasequencing Project, Feb. 1995.