Selection, Concatenation and Fusion of Sequences |
< Workflow | Table of content | Command line mode > |
Aligned sequence files: Global features Fasta Must Nexus Phylip Others input files: OTU file Systematic file Default files Tree file Aligned sequence files: Global features Fasta Must Nexus Phylip Others input files: OTU file Systematic file Default files Tree file Aligned sequence files: Global features Fasta Must Nexus Phylip Others input files: OTU file Systematic file Default files Tree file Aligned sequence files: Global features Fasta Must Nexus Phylip Others input files: OTU file Systematic file Default files Tree file Aligned sequence files: Global features Fasta Must Nexus Phylip Others input files: OTU file Systematic file Default files Tree file Aligned sequence files: Global features Fasta Must Nexus Phylip Others input files: OTU file Systematic file Default files Tree file Aligned sequence files: Global features Fasta Must Nexus Phylip Others input files: OTU file Systematic file Default files Tree file |
SCaFoS needs aligned sequences files stored in a unique directory. These files can contain
more than one sequence per species, but it is not necessary that all species be present
in each file. SCaFoS can automatically recognize files in FASTA, MUST, NEXUS or
PHYLIP format with the most common extensions. All files must have the same format
and contain the same type of sequences (protein or nucleic acid). Other input files are created by SCaFoS itself or by MUST and are described below. Aligned sequence files
Specific sequence name format
To take into account paralogous and xenologous genes as well as
pseudogenes, one of the specificities of SCaFoS is its ability to deal
with several copies from the same gene within one organism (species or
strain); therefore the sequence identifier is composed as follow:
Organism@Gene
where Organism is the identifier used to described the organism
(see below a detailed format),
Gene is a string of characters (like the accession number by example) that is specific to the corresponding sequence, and finally the at symbol (@) is reserved to separate the organism identifier and the gene identifier and it must not be used in the organism name
It is important that the sequence names respect this
format otherwise SCaFoS will consider that the various sequences
correspond to different organisms and it will be unable to select
the best sequence among all the sequences of this organism.
SCaFoS has the ability to distinguish between species name and strain name if the sequence name format respects the second important specification: nor genus name nor species name should contain an underscore (_) which is the usual character used to separate genus name/species name from strain name. This restriction is useful to deal with several strains of the same species and so to do phylogenetic analysis on a small scale, otherwise, the entire sentence preceding the gene identifier, if available, will be completely kept as the organism name. To accommodate all these name constraints, SCaFoS is able to manage sequence name with the following format: GenusName SpeciesName_StrainName@AccessionNumber
Not all parts of the sequence name are necessary.
SCaFoS looks for the first space character to determine genus name
and eventually the unique underscore and the unique at symbol.
The other authorized characters in the sequence name are:
Gaps and missing characters
For all available input formats, star (*)
or dash (–) on the one hand and question
mark (?) or character X (X
or x) on the other hand indicate gaps and unknown
characters respectively. The MUST format also accepts space as unknown character.
Comment
All characters following a hash sign (#)
are considered comments and are removed during the processing.
This format is defined by the sequence name on the first line with a
'greater than' character (>)
at the beginning of the line, and the sequence on the next lines
(sequence may be written on several lines) and eventually with space
between character blocks.
The spacers between characters blocks (space or end of line) are removed during the concatenation.
>Homo
ATGCAACGTTGACCTAGCATGAGA >>Mus ATCCAACGTTGACCTAGCATAAGA
This is the format of the ED program from the MUST package, a tool to
manage aligned sequences. It is very similar to the FASTA format,
except that the file must begin with at least two lines beginning with
a hash sign (#), and that the
sequence for a given species should be on a single line.
SCaFoS ignores all NEXUS options, only taxa
and data blocks are treated. Only a data block with
ntax and nchar
values, and the matrix is needed by SCaFoS.
A NEXUX file must begin by #NEXUS
keyword and all comments and empty lines are ignored. Interleaved or
sequential formats and placeholder character are also supported.
#NEXUS
begin data; dimensions ntax=2 nchar=24; format datatype=RNA gap='-'; matrix Homo ATGCAACGTT GACCTAGCAT GAGA Mus ATCCAACGTT GACCTAGCAT AAGA ; end;
The number of sequences and the sequence length must be present on the
first line.
The PHYLIP format accepts two different formats: interleaved or sequential. The latter is the simplest with the sequence name on 10 characters and all the sequence on the same line or cut on multiple lines before the following sequence. In the interleaved format sequences are cut in short fragments: a block of fragments contains the homologous sequence part for all sequences; the first block of fragments is preceded by the sequence names, the next block displays the second part for each sequence, and so on for all the fragments. Interleaved and sequential formats could separate sequence in blocks with spaces.
2 24
Homo ATGCAACGTT GACCTAGCAT Mus ATCCAACGTT GACCTAGCAT GAGA AAGA
According to usage and relative options, some other input files may be
needed. These various files will be described below.
<out>.otu It is created during the first use of SCaFoS and used in the next steps. The line format of OTU file is:
OTU name : species1 name (comment1), species2 name(comment2), ...
where the OTU name is used to characterize the
selected sequences from this OTU,
and species1 name... are all the name of related species that can be grouped in the OTU with a possible comment that is not be considered.
Because some programs of phylogenetic inference do not
support sequence names longer than 10 characters, the sequence names
are truncated; make sure that OTU names are all different for the first
ten characters to avoid subsequent problems.
In the SPECIES PRESENCE usage, SCaFoS creates a OTU file where each OTU contains one species and its name also is the species name; one OTU is created for each species present at least once in the aligned files:
species1 name : species1 name (percentage of genes present)
species2 name : species2 name (percentage of genes present) ...
Depending on the needs, the user creates his/her own OTUs by
juxtaposition of all the species names that constitute the new group:
new group name : species1 name, species2 name, ...
<taxa>.nom
It is a text file in which the taxonomic group is indicated for each species. This file could be used to create the OTU file ordered by taxonomic group to facilitate species selection and grouping. The line format of the systematic file is:
species name, taxa name
<prefix of aligned sequences file>.def
The automatic selection of a sequence requires that SCaFoS has been able to select only one sequence for an OTU in a given gene. If this unique selection is not possible, the user could interactively specify the correct one. The previously specified sequence chosen for representing an OTU is conserved in the default files; each aligned sequences file has its own default file. These files are created into the output directory the first time the default files option is used and they must be moved in the input directory containing the aligned sequences files to be reused later in the subsequent steps. The line format of default file is:
selected sequence id #OTU : sequence1 id, sequence2 id, ...
It is possible to modify a default sequences file, but the user must
take care to respect the line format, particularly the list of all the
sequence identifiers following the OTU name in
which the selected sequence identifier must be present.
<prefix of aligned sequences file>.ps This kind of file is necessary to have a funny visualization of the selection made by SCaFoS for each gene, but the choice of this option does not influence the selection itself. A tree file is a phylogenetic tree in postscript format generated by TREEPLOT, another program from the MUST package. If a tree file exists in the input directory, SCaFoS will change the color of sequence names according to the sequences selection realized by SCaFoS:
green→ chimera
blue→ selected sequence grey→ discarded sequence |
< Workflow | Table of content | Command line mode > |