WATSTATS(1) OGMP SEQUENCING PROJECT MANAGEMENT UTILITY WATSTATS(1) NAME watstats - Calculate and print tables of statistics and expected values related to a shotgun sequencing project. DESCRIPTION watstats is a program that asks for four parameters related to a shotgun sequencing project: G = The genome size in kbp, N = The number of readings, L = The average length of those readings in bp, and T = The minimum number of bp needed to recognise sequence overlaps, and calculates the following values, functions of those parameters: c = The redundancy of coverage, E(Nc) = The expected number of contigs, E(Sc) = The expected number of readings per contig, E(Lc) = The expected length of each contig, and %c = The percentage of coverage of the genome. Of the four parameters interactively asked for (G, N, L or T), at most one of them can be a "range specification". Such a specification is a list of three integer numbers separated by anything else, like this: "200-500,100" or "From 200 to 500 by steps of 100" or "xyz200beatles500&!%100". Such a specification is interpreted like in the second example just mentionned: the first number is a START value, the second number a END value and the third one a STEPping value. When a range specification is given to G, N, L or T parameter, then watstats will print a table of the c, E(Nc), E(Sc), E(Lc) and %c values with that parameter changing from START to END by increments of STEP. OPTION -O This is the "ocean option". When present, watstat will ask for one more parameter besides G, N, L and T. This parameter is O = A threshold ocean size in bp. An ocean is a gap between two contigs, after assembly. The O parameter will be used to compute a supplementary value: %o = The probability of having an ocean > O bp after assembly. FORMULAS All these formulas are from [Lander, Waterman], except for the %c calculation which is a simple equation figured out by PR. Let GM = G*1000 (The genome size in bp), theta = T/L, sigma = 1-theta. Then c = L*N/GM, E(Nc) = N*exp(-c*sigma), E(Sc) = exp(c*sigma), E(Lc) = L*[((exp(c*sigma)-1)/c)+(1-sigma)], %c = E(Nc)*E(Sc)/G (converted to percentage), %o = exp(-c*(O/L+theta)). REFERENCE Lander, E. S. and Waterman, M. S., "Genomic mapping by fingerprinting random clones: a mathematical analysis", Genomics 2, 231-239 (1988). AUTHOR Pierre Rioux, Tim Littlejohn (Project Management), Organelle Genome Megasequencing Project, Oct. 1994. Also see the REFERENCE section.