WATSTATS(1) OGMP SEQUENCING PROJECT MANAGEMENT UTILITY WATSTATS(1)
NAME
watstats - Calculate and print tables of statistics and expected
values related to a shotgun sequencing project.
DESCRIPTION
watstats is a program that asks for four parameters related to
a shotgun sequencing project:
G = The genome size in kbp,
N = The number of readings,
L = The average length of those readings in bp,
and T = The minimum number of bp needed to recognise sequence overlaps,
and calculates the following values, functions of those parameters:
c = The redundancy of coverage,
E(Nc) = The expected number of contigs,
E(Sc) = The expected number of readings per contig,
E(Lc) = The expected length of each contig,
and %c = The percentage of coverage of the genome.
Of the four parameters interactively asked for (G, N, L or T), at most
one of them can be a "range specification". Such a specification is
a list of three integer numbers separated by anything else, like
this: "200-500,100" or "From 200 to 500 by steps of 100" or
"xyz200beatles500&!%100". Such a specification is interpreted like
in the second example just mentionned: the first number is a START
value, the second number a END value and the third one a STEPping
value. When a range specification is given to G, N, L or T parameter,
then watstats will print a table of the c, E(Nc), E(Sc), E(Lc)
and %c values with that parameter changing from START to END by
increments of STEP.
OPTION
-O This is the "ocean option". When present, watstat will ask for
one more parameter besides G, N, L and T. This parameter is
O = A threshold ocean size in bp.
An ocean is a gap between two contigs, after assembly. The O parameter
will be used to compute a supplementary value:
%o = The probability of having an ocean > O bp after assembly.
FORMULAS
All these formulas are from [Lander, Waterman], except for the %c
calculation which is a simple equation figured out by PR. Let
GM = G*1000 (The genome size in bp),
theta = T/L,
sigma = 1-theta.
Then
c = L*N/GM,
E(Nc) = N*exp(-c*sigma),
E(Sc) = exp(c*sigma),
E(Lc) = L*[((exp(c*sigma)-1)/c)+(1-sigma)],
%c = E(Nc)*E(Sc)/G (converted to percentage),
%o = exp(-c*(O/L+theta)).
REFERENCE
Lander, E. S. and Waterman, M. S., "Genomic mapping by fingerprinting
random clones: a mathematical analysis", Genomics 2, 231-239 (1988).
AUTHOR
Pierre Rioux,
Tim Littlejohn (Project Management),
Organelle Genome Megasequencing Project, Oct. 1994.
Also see the REFERENCE section.