SCaFoS - Input format

SCaFoS needs aligned sequences files stored in a unique directory. These files can contain more than one sequence per species, but it is not necessary that all species be present in each file. SCaFoS can automatically recognize files in FASTA, MUST, NEXUS or PHYLIP format with the most common extensions. All files must have the same format and contain the same type of sequences (protein or nucleic acid).
Other input files are created by SCaFoS itself or by MUST and are described below.

Aligned sequence files

Global features

Specific sequence name format

To take into account paralogous and xenologous genes as well as pseudogenes, one of the specificities of SCaFoS is its ability to deal with several copies from the same gene within one organism (species or strain); therefore the sequence identifier is composed as follow:

Organism@Gene

where Organism is the identifier used to described the organism (see below a detailed format),
Gene is a string of characters (like the accession number by example) that is specific to the corresponding sequence,
and finally the at symbol (@) is reserved to separate the organism identifier and the gene identifier and it must not be used in the organism name

It is important that the sequence names respect this format otherwise SCaFoS will consider that the various sequences correspond to different organisms and it will be unable to select the best sequence among all the sequences of this organism.

SCaFoS has the ability to distinguish between species name and strain name if the sequence name format respects the second important specification: nor genus name nor species name should contain an underscore (_) which is the usual character used to separate genus name/species name from strain name. This restriction is useful to deal with several strains of the same species and so to do phylogenetic analysis on a small scale, otherwise, the entire sentence preceding the gene identifier, if available, will be completely kept as the organism name.

To accommodate all these name constraints, SCaFoS is able to manage sequence name with the following format:

GenusName SpeciesName_StrainName@AccessionNumber

Not all parts of the sequence name are necessary. SCaFoS looks for the first space character to determine genus name and eventually the unique underscore and the unique at symbol.

The other authorized characters in the sequence name are:

all alphabetic characters,
all numeric characters,
point (.), dash (-) and space.

Gaps and missing characters

For all available input formats, star (*) or dash (–) on the one hand and question mark (?) or character X (X or x) on the other hand indicate gaps and unknown characters respectively. The MUST format also accepts space as unknown character.

Comment

All characters following a hash sign (#) are considered comments and are removed during the processing.

FASTA format

This format is defined by the sequence name on the first line with a 'greater than' character (>) at the beginning of the line, and the sequence on the next lines (sequence may be written on several lines) and eventually with space between character blocks.
The spacers between characters blocks (space or end of line) are removed during the concatenation.

>Homo
ATGCAACGTTGACCTAGCATGAGA
>>Mus
ATCCAACGTTGACCTAGCATAAGA

MUST format

This is the format of the ED program from the MUST package, a tool to manage aligned sequences. It is very similar to the FASTA format, except that the file must begin with at least two lines beginning with a hash sign (#), and that the sequence for a given species should be on a single line.

NEXUS format

SCaFoS ignores all NEXUS options, only taxa and data blocks are treated. Only a data block with ntax and nchar values, and the matrix is needed by SCaFoS. A NEXUX file must begin by #NEXUS keyword and all comments and empty lines are ignored. Interleaved or sequential formats and placeholder character are also supported.

#NEXUS
begin data;
dimensions ntax=2 nchar=24;
format datatype=RNA gap='-';
matrix
Homo ATGCAACGTT GACCTAGCAT GAGA
Mus ATCCAACGTT GACCTAGCAT AAGA
;
end;

PHYLIP format

The number of sequences and the sequence length must be present on the first line.
The PHYLIP format accepts two different formats: interleaved or sequential.
The latter is the simplest with the sequence name on 10 characters and all the sequence on the same line or cut on multiple lines before the following sequence.
In the interleaved format sequences are cut in short fragments: a block of fragments contains the homologous sequence part for all sequences; the first block of fragments is preceded by the sequence names, the next block displays the second part for each sequence, and so on for all the fragments.
Interleaved and sequential formats could separate sequence in blocks with spaces.

2 24
Homo ATGCAACGTT GACCTAGCAT
Mus ATCCAACGTT GACCTAGCAT

GAGA
AAGA

Other input files ^

According to usage and relative options, some other input files may be needed. These various files will be described below.

OTU file

<out>.otu

It is created during the first use of SCaFoS and used in the next steps. The line format of OTU file is:

OTU name : species1 name (comment1), species2 name(comment2), ...

where the OTU name is used to characterize the selected sequences from this OTU,
and species1 name... are all the name of related species that can be grouped in the OTU with a possible comment that is not be considered.

Because some programs of phylogenetic inference do not support sequence names longer than 10 characters, the sequence names are truncated; make sure that OTU names are all different for the first ten characters to avoid subsequent problems.

In the SPECIES PRESENCE usage, SCaFoS creates a OTU file where each OTU contains one species and its name also is the species name; one OTU is created for each species present at least once in the aligned files:

species1 name : species1 name (percentage of genes present)
species2 name : species2 name (percentage of genes present)
...

Depending on the needs, the user creates his/her own OTUs by juxtaposition of all the species names that constitute the new group:

new group name : species1 name, species2 name, ...

Systematic file

<taxa>.nom

It is a text file in which the taxonomic group is indicated for each species. This file could be used to create the OTU file ordered by taxonomic group to facilitate species selection and grouping. The line format of the systematic file is:

species name, taxa name

Default sequence files

<prefix of aligned sequences file>.def

The automatic selection of a sequence requires that SCaFoS has been able to select only one sequence for an OTU in a given gene. If this unique selection is not possible, the user could interactively specify the correct one. The previously specified sequence chosen for representing an OTU is conserved in the default files; each aligned sequences file has its own default file. These files are created into the output directory the first time the default files option is used and they must be moved in the input directory containing the aligned sequences files to be reused later in the subsequent steps. The line format of default file is:

selected sequence id #OTU : sequence1 id, sequence2 id, ...

It is possible to modify a default sequences file, but the user must take care to respect the line format, particularly the list of all the sequence identifiers following the OTU name in which the selected sequence identifier must be present.

Tree file

<prefix of aligned sequences file>.ps

This kind of file is necessary to have a funny visualization of the selection made by SCaFoS for each gene, but the choice of this option does not influence the selection itself.
A tree file is a phylogenetic tree in postscript format generated by TREEPLOT, another program from the MUST package. If a tree file exists in the input directory, SCaFoS will change the color of sequence names according to the sequences selection realized by SCaFoS:

green→ chimera
blue→ selected sequence
grey→ discarded sequence