# Supporting Scripts for "Eukan: a fully automated nuclear genome annotation pipeline for less studied and divergent eukaryotes"

## Description

This repository provides supporting scripts, Docker configurations, and workflow documentation for the manuscript entitled "Eukan: an automated tool for nuclear genome annotation in less studied and divergent eukaryotic organisms".

The workflow is as follows:
- Transcriptome assembly from reference RNA-seq data.
- Execution of annotation tools via containerized environments.
- Output formatting for consistency across pipelines.
- Data processing and statistical analysis using R (e.g., BUSCO completeness, sensitivity/specificity, gene/exon/intron metrics, and performance correlations).

Results (e.g., generated GTF/GFF files) are archived separately due to size, see the manuscript's supplementary materials. This repository focuses on reproducing the analysis pipeline and data processing steps.

## Prerequisites

### Software and Dependencies
- **Docker**: For running containerized annotation pipelines.
- **Licenses**: Some tools require academic/research licenses (e.g., MAKER, GeneMark). Obtain as needed.
- **Shell/Bash**: For running scripts (default on Linux/macOS).
- **Storage**: ~150MB for the repository; GB+ for input data and outputs.
- **Environment**: Linux/Unix system with at least 64GB RAM and 20 CPU cores for full runs (based on pipeline benchmarks).
- **R**: For data processing scripts (version 4.0.4 was used to generate results).
    - dependencies:
      - tidyverse
      - magrittr
      - stringr
      - tidyr
      - readr
      - ggplot2
      - VennDiagram
      - RColorBrewer
      - broom
      - pastecs
      - twosamples
      - rstatix
      - ggpubr
      - lsmeans
      - asbio
      - reshape2
      - gplots
      - FSA

### Data Requirements
- Input genomes (FASTA) and transcriptomes (FASTQ) from NCBI, see manuscript supplementary materials for reference IDs.
- Reference protein sequences bundled with genome downloads.

Data can be fetched using NCBI's `datasets`.

### Setup Steps
1. Clone or download this repository.
2. Verify Docker: `docker --version` (and add user to `docker` group if needed: `sudo usermod -aG docker $USER`).
3. Fetch software binaries: Download specific versions from the manuscript's URL: `https://megasun.bch.umontreal.ca/matt/eukan-manuscript/dockerised-annotation-pipelines`. Place in the docker/ directory for builds.
4. Build Docker images (see Step 1 below).

## Repository Contents

- **`docker/`**: Dockerfiles and command references for containerized pipelines (e.g., BRAKER, GeMoMa, Eukan, Maker). Includes timing benchmarks, shell scripts for annotation execution, and organization scripts (e.g., `maker_annotation.sh` for MAKER runs).
- **`busco-and-annot-output-formatting-scripts/`**: Bash scripts to reformat GFF/GTF outputs from annotation tools (e.g., `format_annot_output_Arabidopsis_thaliana.sh`) into consistent bed12 format for downstream analysis.
- **`annot-data-prep-scripts/`**: R scripts for ingesting bed12 annotation data and BUSCO results, in order to transform them into tidy datasets for statistics. One per species (e.g., `annot-data-prep-Arabidopsis_thaliana.R` computes gene/mRNA/exon metrics; `build-annot-summaries.R` combines results across species).
- **`analyses/`**: R script for statistical analysis and visualization of annotation results (`annot-analysis.R`). Processes metrics from prep scripts, performs BUSCO result comparisons, computes F1 scores, runs statistical tests (e.g., Kruskal-Wallis), generates plots (e.g., ECDFs of performance, Venn diagrams of overlaps), and correlates outcomes across pipelines and lineages (plants, animals, fungi, protists).
- **`docker/pipeline-commands.md`**: Comprehensive reference for reproducible Docker commands per species (e.g., transcriptome assembly times, annotation tool parameters, and runtimes).
- **Other files in `docker/`** (e.g., `alignAssembly.config`, `zff2augustus_gbk.pl`): Configuration files and utilities for specific tools like Augustus training.

**Note**: Referenced outputs like `annot-comparison-results/` (resulting annotations), `build-annot-summaries.R` (combined results for all organisms, split into gene, mRNA, exon, intron files), or `annot-comparison-analysis.tar.gz` (final results) are not included in this repo for size reasons. They are generated via these scripts and analyzed as in the manuscript.

## Workflow
1. [Docker Setup](#step-1-docker-setup)
2. [Fetch reference data](#step-2-fetch-data)
3. [Transcriptome Assembly](#step-3-transcriptome-assembly)
4. [Annotation Tool Execution](#step-4-annotation-tool-execution)
5. [Output Formatting and BUSCO Analysis](#step-5-output-formatting-and-busco-analysis)
6. [Data Analysis and Statistics](#step-6-data-analysis-and-statistics)

## Step 1: Docker Setup
Prepare containerized versions of annotation tools (e.g., BRAKER, MAKER, GeMoMa, Eukan) to ensure reproducible runs with pinned tool versions. This step isolates dependencies and matches manuscript conditions.

- Fetch software binaries from `https://megasun.bch.umontreal.ca/matt/eukan-manuscript/dockerised-annotation-pipelines` (requires licenses—contact authors if needed).
- Build images using Dockerfiles in `docker/`:
  ```bash
  # requires presence of specific versions of software, some of which require licenses
  docker build -t base -f docker/Dockerfile-base
  docker build -t base -f docker/Dockerfile-braker
  # Repeat for others (eukan, gemoma, maker)
  ```
- Use the `annot-docker` script in `docker/` as a wrapper: `./annot-docker braker braker.pl ...` (attaches to current directory for input/output volumes).

## Step 2: Download reference data

See the supplementary materials of the manuscript to fetch the complete list of reference IDs to download from NCBI using `datasets` or other tools.

## Step 3: Transcriptome Assembly
Assemble RNA-seq reads into de novo transcripts using Trinity (via Docker) to provide evidence for gene models. This generates hints for annotation tools, improving accuracy in divergent eukaryotes.

- Run for each species using commands like:
  ```bash
  # Example for Saccharomyces cerevisiae
  docker run --mount type=bind,source="$(pwd)",target=/root/tmp annot-pipelines-stack transcriptome_assembly.sh \
    -l /root/tmp/SRR4368892_1.fastq.gz,/root/tmp/SRR4368902_1.fastq.gz \
    -r /root/tmp/SRR4368892_2.fastq.gz,/root/tmp/SRR4368902_2.fastq.gz \
    -g /root/tmp/Saccharomyces_cerevisiae_S288C.fasta -M 5000 -S RF -A
  ```
- Parameters: `-M` for max intron length, `-S` for strand, etc. (adjust per species' sequencing library).
- Outputs: `nr_transcripts.fasta` (assembled transcripts) and `nr_transcripts.gff3` (hints).

**Tips**: Stranded reads (e.g., `-S RF`) improve eukaryotes. Benchmarks in `docker/pipeline-commands.md` show ~45-80 min runtime. Replace file paths with your local data.

## Step 4: Annotation Tool Execution
Run multiple state-of-the-art annotation pipelines (BRAKER, GeMoMa, MAKER, Eukan) per species to generate gene predictions. Each tool uses RNA hints and homology references for comprehensive annotations.

- Tools supported include:
  - **BRAKER**: Ab initio prediction with RNA hints.
  - **GeMoMa**: Synology-based evidence aggregation.
  - **MAKER**: Comprehensive pipeline with Augustus training.
  - **Eukan**: Protist-fungal specific optimizations.
- Example commands (from `docker/pipeline-commands.md`—use for reproducible runs):
  ```bash
  ./annot-docker braker braker.pl --species=Saccharomyces_cerevisiae_S288C \
    --genome=Saccharomyces_cerevisiae_S288C.fasta \
    --hints=hints_rnaseq.gff --prot_seq=protein_refs.faa --cores=20 --softmasking
  ./annot-docker gemoma java -jar /opt/gemoma/GeMoMa.jar CLI GeMoMaPipeline \
    t=genome.fasta s=own i=ref1 g=ref1_genomic.fna a=ref1_genomic.gff outdir=out
  ```
- Inputs: Genome FASTA, RNA hints (from Step 3), reference proteins (e.g., from NCBI).
- Outputs: GFF3/GTF files per tool and species.

## Step 5: Output Formatting and BUSCO Analysis
Standardize GFF/GTF outputs from Step 4 and assess completeness using BUSCO (via Docker).

- Run formatting scripts in `busco-and-annot-output-formatting-scripts/`:
  ```bash
  # Per species/tool (e.g., from output_dir)
  ./format_annot_output_Arabidopsis_thaliana.sh <input.gff> > output.gtf
  ```
- These scripts reformat to consistent BED12 or GTF, filter artifacts, and integrate BUSCO scores.
- BUSCO: Run on amino acid translations of predictions (e.g., via `busco_run.py` in containers) for completeness metrics.

**Outputs**: Cleaned annotations and BUSCO reports (.txt with scores). Scripts assume Unix-like tools (e.g., awk for parsing). One script per species—list in `ls *.sh` for 17 tested eukaryotes.

## Step 6: Data Analysis and Statistics
Process formatted outputs (from Step 5) into summary metrics and visualizations using R. Compute sensitivity/specificity, gene/exon/intron stats, and comparisons discussed in the manuscript.

- Run annot-data-prep scripts in `annot-data-prep-scripts/`:
  ```bash
  # In R, per species (loads data, transforms to tidy format)
  source('annot-data-prep-Arabidopsis_thaliana.R')
  # Outputs: Per-species datasets with metrics (e.g., gene counts, BUSCO scores)
  ```
- These R scripts ingest bed12-like files as structured tables, compute stats (e.g., overlap, accuracy), and prepare for aggregation.
- Run final analysis in `analyses/annot-analysis.R`: Load prep data, perform statistical tests (e.g., F1 scores, correlations), generate plots (e.g., ECDFs, heatmaps), and output figures/tables (e.g., `ecdf-gene-busco-summary.tiff`).
- Combine across species (e.g., via custom `build-annot-summaries.R` not in this repo): Merge datasets, run tests (e.g., Kruskal-Wallis in R), and plot reconstructions.
- **Final Outputs**: Tables/figures for gene-level, transcript/exon/intron metrics, and BUSCO overviews.