Selection, Concatenation and Fusion of Sequences

< Comand line mode Table of content Output Files >
Graphical mode

To run SCaFoS

Under a Linux system, the program is loaded by typing:


or the following command in interpreted mode:

where XXX is the version number of SCaFoS

In the Windows environment, open a command prompt window and type one of the previous commands to run SCaFoS.
In a standard installation of Windows, to open a command prompt window:
  • click Start
  • point to All Programs
  • point to Accessories
  • click Command Prompt
Or shorter, 
  • chose Run in Start menu,
  • type cmd (XP or 2000) or command (98 or Me).

In the Mac OS X environment, open a X11 window and type one of the previous commands run SCaFoS. The X11 windows are accessible in the Applications>Utilities folder.

For all operating systems, be sure that the scafos environment variable is defined previously to run SCaFoS (see the instructions for installation section)

Input and output files are described in Input files format section and Output Files format section respectively.

To choose the usage

The three usages are available according to the chronological use of SCaFoS:
  • SPECIES PRESENCE: to list all species in aligned sequence files and create an OTU file, 
  • FILE SELECTION: to extract sequences according to species included in the OTU file, 
  • DATASETS ASSEMBLING: to create datasets for super-matrix or super-tree approaches.
According to his/her need, the user must click on the corresponding radio-button:

Last options selected are automatically reloaded. By clicking on the clear all button, all options are removed and default values are reloaded.
Input and output directories are always needed. If the output directory already exists, you must confirm to replace the old one or type a new name.
For each usage, some parameters are needed or optional and only potential parameters will be activated; the description of these options will follow. Once the parameters are chosen, a simple click on RUN button executes the program. A new window will appear while the application is running. Close the result window, by clicking on the CLOSE button; the user can make another analysis by changing options or quit by clicking on the QUIT button.
With the VERBOSE option, a detailed description of the run is displayed.

In all following explanations, out will be used to refer to the output directory.

To make an OTUs file

To create an OTU file, first choose the SPECIES PRESENCE radiobutton.
The following options are available:
  • To obtain an OTUs file in taxonomic order, you must specify the Systematic file where the correspondence between species and taxa are defined; the button browser open a window where it is possible to select the systematic file
  • You could minimize species number by a threshold corresponding to the Minimum frequency of species

It is possible to continue with a concatenation according to the length of sequences. The program prompts the user for this option: a concatenated sequence will be made for all species found at least once in the aligned sequences files. The concatenated file resulting of this rough concatenation will give a first draft of phylogeny because this dataset contains many missing data and lacks good choice of paralogous sequences

To select files with chosen species

Choose radiobutton SELECT FILES.

Since SCaFoS will perform sequences selection, it's necessary to indicate the OTUs file. To easily select OTUs file, click on the browserbutton to open a files selection window.

You could automatically eliminate sequences that are too short by typing the Minimal length to choose a sequence. By default sequences with a length shorter than 10 percent of the longest sequence are removed from the analysis and are not taken into account to calculate the evolutionary distance.

According to the checked option, selected files of aligned sequences are written in output directory; three options are available:
  • writing new files ONLY with the sequences of species defined in the OTUs file
  • copying aligned files that contain AT LEAST ONE SPECIES  defined in the OTUs file
  • copying files that contain at least one species for EACH OTU  defined in the OTUs file

Except for the first option for which unselected sequences are removed from the file, the selected aligned sequences files are identical to the initial files

To create datasets

This step is the central point of SCaFoS because it generates a lot of information that could be used to choose genes and species, so several options are associated with this usage for which the user needs to click the radiobutton DATASETS ASSEMBLING.
This usage needs to choose an OTU file.

Selection criteria

When MAKING of CHIMERA is chosen, if no complete sequence exists for an OTU, a new sequence is created with portions from different species within the OTU (remember that with the MINIMAL LENGTH option is applied on the created chimera). Several criteria are used to determine the best sequence within an OTU:
  • Minimal evolutionary distance: SCaFoS selects the sequence with the smallest evolutionary distance to all other sequences within an OTU (default option of selection)
    • With gamma: the evolutionary distance is calculated according to a gamma distribution with 4 categories
    • The Threshold is the value used to eliminate too divergent OTUs in order to reduce cause of LBA and the risk of including paralogous sequences (default value: 25)
  • select Using DEFAULT sequence for an OTU allows the user to select its/her preferred sequence in ambiguous case except if the corresponding default sequences file already exists
    • With automatic choice by SCaFoS: when no previous choice is defined in the default file, let SCaFoS choose the best sequence, even in ambiguous case 
  • Remove file when at least one OTU is too divergent: this option is only available in the automatic mode
  • Longest sequence: a fast but rough selection

Selection of default sequences

In the semi-automatic mode, when SCaFoS is unable to select an unique sequence for the OTU (i.e. when Using DEFAULT sequence for an OTU option is selected), the user must validate the choice of the sequence guided by the evolutionary distance, the amount of missing positions, and eventually the deviation in composition. Remember that the list of sequences is written in red when at least one sequence is too divergent, otherwise the list is written in blue. Three options are available:
  • accept the first sequence which has the smallest evolutionary distance (checked sequence) by clicking on the OK button
  • select an other sequence and click OK
  • remove the OTU for this aligned sequence file by clicking on the No sequence button

If all sequences are identical, SCaFoS automatically selects the first sequence without asking the user.
Sequence selection in semi-automatic mode could be very time comsuming, but selection of default sequences is a real gain when the same dataset is use for various purposes with minor variations in definition of OTUs. After closing SCaFoS, copy the .def files newly created from the output directory to the input directory; on the next run, the previously selected sequences will be used and no question will be asked to the user for these OTUs.

Creation of subdirectories

To minimize frequency of missing characters in final data set, the user can create subdirectories with selected files according to:
  • Maximum number of missing OTUs: create directories with files containing between 0 and <value> missing OTUs (default: 25 missing OTUs by file)
  • Maximum number of missing sites: create directories with files containing between 0 and <value> percent of missing sites with a <value> step (default: 20% of missing sites by step of 5%)

Global options

The value given for the option Maximal percent of missing sites corresponding to complete sequence determines the threshold for which a sequence is treated as complete and not included in the chimerical sequence. This option is used only if the MAKING of CHIMERA option is checked; otherwise all sequences are taken into account to calculate the evolutionary distance independently of their completeness.

To obtain a phylogenetic tree where sequence names are written in different colors according to their selection by SCaFoS, check the corresponding option: Tree in postscript format colored according to selected sequence

In complementary of computing of evolutionary distances, it is possible to display the variation in sequence composition towards the average composition of all sequences in the file. By checking the option: Determine variation in sequence composition, this information is displayed during the interactive choice by the user and in various output files.

The concatenated file can be output in several formats: FASTA, PHYLIP in sequential format, MUST and NEXUS. Just check the desired options. Commands can be automatically added to the nexus file to obtain output files suitable for PAUP or MrBayes. The user can also modify these optional commands or add his/her own commands in the NEXUS file with the for other option: a text editor will be displayed to type appropriate commands. Be careful because no verification for the validity of these commands is made by SCaFoS. For all formats it is possible to choose the order of the concatenated sequences with the Output order option. These options are also used for the concatenated files in subdirectories if needed.

Checking the Add number of characters to the OTU name adds the length of the concatenated sequence at the end of each OTU name: i.e. the number of real characters.

To obtain the help

A short help for each option is displayed when the mouse is motionless over the corresponding option. A global help is displayed by clicking on the HELP button. In the new opened window, it is possible to display the full online help available directly on this web site.

< Comand line mode Table of content Output Files >

Hervé PHILIPPE's Lab