################################################################################################### ######################## AnnCX_gene_annotation_pipeline (v1.0.0) ################################## ################################################################################################### AnnCX is gene annotation pipeline designed for the analysis of gene-rich complex genomic regions demonstrating to outperform the widely used MAKER pipeline in our benchmarking on complex genomic regions (paper). AnnCX automates all intermediate steps to provide comprehensive results with minimal user input. Key novel features include: - Iterative extraction of target regions from whole-genome sequences to focus on complex genomic areas (Step 1: extract ROI = OPTIONAL) - Incorporate a more diverse array of individual annotation tools than other pipelines with an emphasis on the accurate identification of exon-intron boundaries. - Detection and annotation of ambiguous nucleotide regions (assembly gaps) to provide a clear representation of problematic genome assemblies that commonly affect complex regions regions. - Support for manual curation by implementing an automatic visualization of the raw annotation results once the pipeline has completed its execution. Custom supplementary tools to assess quality of assembly: (1) identify_pred2ref: gene identification process This feature performs a BLAST search between predicted and reference FASTA sequences, followed by the automatic generation of heatmaps to display the results of the gene identification process. This feature addresses a critical challenge that arises when interpreting annotation results predicted by any gene annotation tool or annotation pipeline. Neither the consensus annotation results produced by any annotation pipeline specify which query reference sequence was used to generate each predicted gene annotation. This circumstance arises because gene annotation tools often make it difficult to trace which specific query sequence led to a given gene annotation. In fact, in tools that do not produce gene annotation models, such as BLAST, a single locus might be annotated based on multiple query sequences. This issue becomes particularly pronounced in genomic regions which contain recently duplicated genes, where high sequence similarity further complicates the attribution of annotations to their original queries. The identify_pred2ref strategy is more robust than phylogenetic trees, which can be distorted by artificial rearrangements that obscure evolutionary relationships between the genes. The plots generated by this feature also serve to illustrate the spatial organization of genes within the region of study, aiding the study of gene order conservation and neighboring genes across species to identify orthologs. (2) identify_rearrangements: identify exon-level rearrangements The identify_rearrangements feature uses Exonerate to conduct an exon-by-exon comparison between predicted and reference genes to identify mismatches in the exon sequences. The exon-level rearrangements detected by this method can result from either biological processes or assembly artifacts, the latter commonly occurring in the assemblies of complex genomic regions when sequencing depth is insufficient to resolve the repetitive DNA in these areas (Peel et al., 2022). Detecting rearrangements of biological origin can advance our understanding of the genes under study, while identifying artificial rearrangements is crucial, as they can significantly affect the interpretation of gene content within the annotated region. This feature provides an exon- level assessment of the quality of the assembly which, combined with AnnCX’s annotation of ambiguous nucleotides, can help guide targeted sequencing efforts towards regions with a high number of artificial rearrangements, thereby reducing overall costs. Moreover, the number and sequence composition of exons found within different genes can serve as an additional indicator of potential gene homology. Other custom supplementary tools (3) annotation2fasta: convert annotation GFF3 files into FASTA sequences This feature, which automatically converts the raw consensus annotation output into FASTA sequences as part of the execution of AnnCX, can also be invoked as a stand-alone feature. This enables users to generate FASTA sequences from the manually refined annotations or other annotation data to facilitate downstream analyses that require sequence data rather than annotation coordinates as input, such as phylogenetic analyses. AnnCX’s annotation2fasta allows users to generate a broader range of sequence types than other annotation pipelines, which typically output only transcript and protein sequences. AnnCX also supplies sequences for genes, exon and CDS (both as individual elements and as concatenated sequences per gene), as well as intron sequences. Providing intron sequences allows the user to streamline the analysis of these regions, which are often overlooked in conventional annotation outputs, yet are increasingly recognized for their roles in gene regulation, alternative splicing, and the evolution of gene structure (Jo and Choi, 2015). --------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------- Installation --------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------- 1. Clone this repository: git clone https://github.com/laurahum/AnnCX_gene_annotation_pipeline.git # Link to repository cd path/to/pipeline/folder # Directory of the repository folder 2. Installation chmod +x install_AnnCX.sh # Give rights to the installation script bash install_AnnCX.sh # Run installation script --------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------- Usage for: --------------------------------------------------------------------------------------------------- - Genome annotation --------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------- 1. Activate the conda environment where the pipeline is installed: conda activate AnnCX 2. Run the pipeline: *IMPORTANT:* Please ensure that Artemis is closed before running the pipeline. If Artemis is open, the genome files will be annotated, but the new annotations may not appear in the Artemis interface* *Step 1: extract ROI = OPTIONAL - If the user provides flanking regions, AnnCX checks how many genome files contain both flanking genes and, by default, prompts whether to continue extract ROI and annotate on those genomes. Use --skip-prompt to proceed automatically without asking for confirmation.* *EVM is run using the recommended evidence weights described by the authors (see: https://doi.org/10.1186/gb-2008-9-1-r7) If the user needs to adjust these parameters, the weights file is located at: ../weights/weights_recommended.txt* usage: AnnCX.py [-h] --genome GENOME --namegenes NAMEGENES --querycDNA QUERYCDNA --queryprot QUERYPROT --queryexon QUERYEXON --spsrepeatmasker SPSREPEATMASKER --spsaugustus SPSAUGUSTUS --outdir OUTDIR [--flanking FLANKING] [--maxintron MAXINTRON] [--threads THREADS] [--overlapsEVM OVERLAPSEVM] [--skip-prompt] AnnCX gene annotation pipeline options: -h, --help show this help message and exit --genome GENOME Directory where the genomic FASTA files are located. Every FASTA file in the directory will be processed. No special characters are allowed in the FASTA files. (allowed extensions: FASTA/FA/FNA/FFN/FAA/FRN) Example: --genome /path/to/genomes --namegenes NAMEGENES Name of the genes to be annotated. Example: --namegenes NKG2 --querycDNA QUERYCDNA FASTA file with entries for cDNA sequences to use as query. Example: --querycDNA /path/to/query/querycDNA.fasta --queryprot QUERYPROT FASTA file with entries for protein sequences to use as query. Example: --queryprot /path/to/query/queryprot.fasta --queryexon QUERYEXON FASTA file with entries for exon sequences to use as query. Each exon must be a different entry in the file. (e.g header: > NKG2_exon_1). Example: --queryexon /path/to/query/queryexon.fasta --spsrepeatmasker SPSREPEATMASKER Name of the phylogenetic group to select a repeat database in RepeatMasker. Select the phylogenetic group that suits best your project. Taken from RepeatMasker documentation (https://www.repeatmasker.org/): -species <query species> Specify the species or clade of the input sequence. The species name must be a valid NCBI Taxonomy Database species name and be contained in the RepeatMasker repeat database. Some examples are: -species human -species mouse -species rattus -species "ciona savignyi" -species arabidopsis Other commonly used species: mammal, carnivore, rodentia, rat, cow, pig, cat, dog, chicken, fugu, danio, "ciona intestinalis" drosophila, anopheles, worm, diatoaea, artiodactyl, arabidopsis, rice, wheat, and maize Example: --spsrepeatmasker primates --spsaugustus SPSAUGUSTUS Name of the species to run AUGUSTUS on pre-trained parameters. Select the species that suits best your project. Taken from AUGUSTUS documentation (https://bioinf.uni-greifswald.de/augustus/): AUGUSTUS has currently been trained on species specific training sets to predict genes in the following species. Note that for closely related species usually only one version is necessary. For example, the human version is good for all mammals Example: --spsaugustus human --outdir OUTDIR Directory to save the output files. AnnCX creates a folder within the output directory specified where the output files will be stored named 'AnnCX_namegenes' Example: --outdir /path/to/output --flanking FLANKING (OPTIONAL) FASTA file with entries for gene sequences to use as flanking genes and extract a genomic region of interest from a larger genome file to annotate. The sequences can be sequences of gene, cDNA or similar. If this argument is not used the pipeline does not extract a region of interest and starts with the annotation process directly. Example: --flanking /path/to/flanking/geneflanking.fasta --maxintron MAXINTRON (OPTIONAL) Sets maximum intron size in basepairs (Default = 7000, based on available eukaryotic studies (Long and Deutsch, 1999)). Example: --maxintron 6000 --threads THREADS (OPTIONAL) Number of threads that can be used to run some tools. Example on how to calculate the number of threads in your computer: $ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\(' CPU(s): 8 ## your number of threads Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 Example: --threads 8 --overlapsEVM OVERLAPSEVM (OPTIONAL) Filter for the EVM implementation within AnnCX. Number of overlapping gene annotation tools to support the consensus annotation results (0-7, 0 = no filter, Default = 4). Example: --overlapsEVM 2 --skip-prompt (OPTIONAL) If the argument is used, skips the prompt after finding the flanking genes that asks the user whether to continue with the annotation process. Example: --skip-prompt --version Version of AnnCX Example using the example data provided in the repository: (AnnCX) user@computer:~/path/to/pipeline/folder$ ./AnnCX.py \ --genome examples/genome \ --namegenes NKG2 \ --querycDNA examples/Genome_annotation/cDNA_sequences.fasta \ --queryprot examples/Genome_annotation/protein_sequences.fasta \ --queryexon examples/Genome_annotation/exon_sequences.fasta \ --spsrepeatmasker primates \ --spsaugustus human \ --outdir /path/to/output/folder/ \ --flanking examples/Genome_annotation/flanking_regions.fasta \ --threads your_threads Artemis is opened automatically for the visualization of the annotation output produced by AnnCX 3. Open results in Artemis again After the pipeline has finished running and you have closed Artemis, you can reopen Artemis within the conda environment to visualize the annotation results again: Example: (AnnCX) user@computer: art Consult the Artemis manual for detailed specification of the features (https://ftp.sanger.ac.uk/pub/resources/software/artemis/archive/v4/artemis_manual_v4_b1-old/c201.htm) For best view of the annotations the recommended features are: - One Line Per Entry - All Features On Frame Lines - Zoom to selection For manual editing, the user is recommended to: (1) adjust feature boundaries by dragging annotation boundaries on the 'DNA view' window if the change is small or if the change is substantial in Artemis Gene Builder (click on annotation -> ctrl+E), (2) modify CDS boundaries while inspecting the amino acid sequence (click on annotation -> View -> aminoacids of selection as Fasta) to prevent introducing erroneous stop codons, (3) update gene IDs and names in Artemis Gene Builder upon completion of edits, and (4) save all Artemis entries (Files -> Save all entries). 4. Interpretation of output files from running AnnCX: user@computer:~/path/to/output/folder/AnnCX_namegenes$ tree . ├── 1_extract_ROI # Result of Step 1: Extract ROI │ ├── 3_found_flanking_genes # Genomes in which both flanking genes were found │ │ ├── txt_output_no.txt # Both flanking genes │ │ └── txt_output_yes.txt # One flanking gene or none │ │ │ ├── 4_find_single_contig_genomes # Genomes with both flanking genes │ │ ├── Non_single_contig_list.txt # Not in the same assembly unit (chr/contig) → discarded │ │ └── Single_contig_list.txt # In the same assembly unit → continue annotation │ │ │ └── 7_extracted_roi_raw # Extracted genomic regions of interest between flanking genes │ └── NameGenome_NameGenes_AssemblyUnit_ROI.fasta → imported to Artemis │ ├── 2_annotate_N_gaps # Result of Step 2: Annotate gaps in ROI → imported to Artemis │ └── NameGenome_annotate_N_stretches.gff3 # Annotation of gaps in assembly │ ├── 3_repeatmasker # Result of Step 3: Hard-masking ROI │ ├── repeat_annotations # Annotation of repeats → imported to Artemis │ │ └── NameGenome_NameGenes_AssemblyUnit_ROI.fasta.out.gff │ └── roi_hardmasked # Hard-masked genomic region of interest │ └── NameGenome_NameGenes_AssemblyUnit_ROI.fasta.masked │ ├── 4_gene_annotation_tools # Result of Step 4: Run gene annotation tools │ ├── AUGUSTUS │ │ └── raw # Annotation AUGUSTUS (with protein profile) → imported to Artemis │ │ └── NameGenome_ROI_NameGenes_protprof_augustus │ ├── BLAST │ │ ├── BLASTN │ │ │ └── filtered # Annotation BLASTN (evidence: cDNA) → imported to Artemis │ │ │ └── NameGenome_ROI_NameGenes_cDNA_blastn_formatted_FILTERED.gff │ │ └── TBLASTN │ │ └── filtered # Annotation TBLASTN (evidence: protein) → imported to Artemis │ │ └── NameGenome_ROI_NameGenes_protein_tblastn_formatted_FILTERED.gff │ ├── Exonerate │ │ └── filtered # Annotation Exonerate (evidence: cDNA) → imported to Artemis │ │ └── NameGenome_ROI_NameGenes_cDNA_exonerate_FORMATTED_FILTERED.gff │ ├── GMAP │ │ ├── GMAP_cDNA │ │ │ └── filtered # Annotation GMAP (evidence: cDNA) → imported to Artemis │ │ │ └── NameGenome_NameGenes_cDNA_ROI_gmap_FILTERED.gff3 │ │ └── GMAP_exon │ │ └── filtered # Annotation GMAP (evidence: exon) → imported to Artemis │ │ └── NameGenome_NameGenes_exon_ROI_gmap_FILTERED.gff3 │ └── GeneWise │ └── filtered # Annotation GeneWise (evidence: protein) → imported to Artemis │ └── NameGenome_ROI_NameGenes_prot_genewise_FORMATTED_FILTERED.gff │ ├── 7_consensus_EVM # Result of Step 7 │ └── 3_filter │ └── filter_3 # Annotation consensus (number = overlapsEVM filter) → imported to Artemis │ └── NameGenome_consensus_filter_OverlapsEVM.gff3 │ ├── output_raw_FASTA_annotations # Raw AnnCX output annotations converted to FASTA │ ├── NameGenome_NameGenes_CDS_all_sequence.fasta # concatenated CDS (~cDNA) │ ├── NameGenome_NameGenes_CDS_sequence.fasta # each CDS numbered │ ├── NameGenome_NameGenes_exon_all_sequence.fasta # concatenated exons │ ├── NameGenome_NameGenes_exon_sequence.fasta # each exon numbered │ ├── NameGenome_NameGenes_gene_sequence.fasta # gene (exon + intron) │ ├── NameGenome_NameGenes_intron_sequence.fasta # each intron numbered │ ├── NameGenome_NameGenes_mRNA_sequence.fasta # mRNA (~gene) │ └── NameGenome_NameGenes_protein_sequence.fasta # protein (CDS_all→prot) │ ├── errors # Errors found in the run │ ├── format_genomes_input # Issue with the formatting of the genome names │ └── missed_annotations # Annotation files from an individual tool missing │ └── report └── pipeline_report.txt # Full run report of running AnnCX --------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------- - Identify predicted genes: identify_pred2ref feature --------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------- This feature produces a heatmap plot (SVG) for the set of genes predicted in an annotated genome. The plot displays predicted genes (x-axis) against reference genes (y-axis). Columns show alignment results for each predicted gene. Individual cells represent the match quality for each pair-wise alignment and indicate BLAST’s percentage identity as a intuitive metric of the alignment quality. Each column is colored based on a composite match score using percentage of identity, coverage, bit-score and penalized by gap openings, that is used to generate a gradient (from lowest to highest match: red→white→blue) for a comprehensive overall assessment of the alignment match. Cells highlighted in black indicate the best composite match score for each, showing which reference gene is most similar. 1. Activate the conda environment: conda activate AnnCX 2. Run the tool: usage: identify_pred2ref.py [-h] --subject SUBJECT --query QUERY --typeseq_query TYPESEQ_QUERY --typeseq_subject TYPESEQ_SUBJECT --namegenome NAMEGENOME --outdir OUTDIR AnnCX feature identify_pred2ref to identify predicted genes options: -h, --help show this help message and exit --subject SUBJECT File (FASTA) with reference gene sequence. These sequences correspond to previously published and curated sequences of phylogenetically related genes to those to be annotated. They can be those used as evidence to run the pipeline or others. (e.g. gene, exon_all, CDS_all, cDNA). Example: --subject /path/to/reference/reference.fasta --query QUERY File (FASTA) with predicted gene sequences . These sequences are those predicted by the pipeline, either the raw output or the sequences after the process of manual curation in Artemis. The raw predicted sequences can be found in the folder: AnnCX_genenames/output_raw_FASTA_annotations (e.g. gene, exon_all, CDS_all, cDNA). Example: --query /path/to/query/query.fasta --typeseq_query TYPESEQ_QUERY Name of the type of query fasta sequences. Used to label the axis of the plot and the output files. (e.g. gene, exon_all, CDS_all, cDNA). Example: --typeseq_query cDNA --typeseq_subject TYPESEQ_SUBJECT Name of the type of subject fasta sequences. Used to label the axis of the plot and the output files. (e.g. gene, exon_all, CDS_all, cDNA). Example: --typeseq_subject cDNA --namegenome NAMEGENOME Name of the genome in which the genes were predicted. Used as the title for the plot and to label the output files. Example: --namegenome Homo_sapiens --outdir OUTDIR Directory to save the output files. AnnCX creates a folder within the output directory specified where the output files will be stored named 'Identify_pred2ref' Example: --outdir /path/to/output --gapopen GAPOPEN (OPTIONAL) Number to run --gapopen open_penalty argument in BLASTN (Default = 1) Taken from BLASTN documentation (https://www.ncbi.nlm.nih.gov/books/NBK279684/table/appendices.T.blastn_application_options/) -gapopen <Integer> Cost to open a gap Example: --gapopen 2 --gapextend GAPEXTEND (OPTIONAL) Number to run --gapextend extend_penalty argument in BLASTN (Default = 1) Taken from BLASTN documentation (https://www.ncbi.nlm.nih.gov/books/NBK279684/table/appendices.T.blastn_application_options/) -gapextend <Integer> Cost to extend a gap Example: --gapextend 2 Example using the example data provided in the repository: (AnnCX) user@computer:~/path/to/pipeline/folder$ ./src/identify_pred2ref.py \ --subject examples/Genome_annotation/cDNA_sequences.fasta \ --query examples/Genome_annotation/cDNA_sequences.fasta \ --typeseq_query cDNA \ --typeseq_subject cDNA \ --namegenome Macaca_mulatta \ --outdir /path/to/output/folder \ 3. Interpretation of output files from running identify_pred2ref: user@computer:~/path/to/output/folder/Identify_pred2ref$ tree . ├── BLASTN_output # Results from running BLASTN │ └── Identify_pred2ref_NameGenome_TypeSeqQueryvsTypeSeqSubject └── Heatmaps # Plot for the genome of study └── NameGenome_identify_pred2ref.svg --------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------- - Identify rearrangements: identify_rearrangements feature --------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------- This feature produces one heatmap plot (SVG) per gene predicted in an annotated genome. The plot displays the exons of one of the predicted genes (x-axis) against exons of reference genes (y-axis). Columns represent the alignment results of each predicted exon and are colored using Exonerate’s percentage of identity to generate a gradient (from lowest to highest identity: red→white→blue). Exons from each reference genes are separated by horizontal lines. Each cell displays pair-wise Exonerate identity score and those with the highest identity per exon are highlighted with a black square. 1. Activate the conda environment: conda activate AnnCX 2. Run the tool: usage: identify_rearrangements.py [-h] --subject SUBJECT --query QUERY --namegenes NAMEGENES --namegenome NAMEGENOME --outdir OUTDIR [--threads THREADS] AnnCX feature identify_rearrangements to identify exon-level rearrangements options: -h, --help show this help message and exit --subject SUBJECT File (FASTA) with reference exon sequences. These sequences correspond to previously published and curated sequences of phylogenetically related genes to those to be annotated. They can be those used as evidence to run the pipeline or others. Each exon must be a different entry in the file and labeled as: 'exon' '_' 'positional number' (e.g header: > NKG2_exon_1). Example: --subject /path/to/reference/reference.fasta --query QUERY File (FASTA) with predicted exon sequences. These sequences are those predicted by the pipeline, either the raw output or the sequences after the process of manual curation in Artemis. Each exon must be a different entry in the file and labeled as: 'exon' '_' 'positional number' (e.g header: > NKG2_exon_1). The raw predicted sequences can be found in the folder: AnnCX_genenames/output_raw_FASTA_annotations Example: --query /path/to/query/query.fasta --namegenes NAMEGENES File (TXT) with the names of the predicted genes. These names must be contained in the headers of the FASTA file for predicted exon sequences passed as --query. Example: --namegenes /path/to/genes/genes.txt --namegenome NAMEGENOME Name of the genome in which the genes were predicted. Used as the title for the plot and to label the output files. Example: --namegenome Homo_sapiens --outdir OUTDIR Directory to save the output files. AnnCX creates a folder within the output directory specified where the output files will be stored named 'Identify_rearrangements' Example: --outdir /path/to/output --threads THREADS (OPTIONAL) Number of threads that can be used to run this feature. Example on how to calculate the number of threads in your computer: $ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\(' CPU(s): 8 ## your number of threads Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 Example: --threads 8 Example using the example data provided in the repository: (AnnCX) user@computer:~/path/to/pipeline/folder$ ./src/identify_rearrangements.py \ --subject examples/Genome_annotation/exon_sequences.fasta \ --query examples/Genome_annotation/exon_sequences.fasta \ --namegenes examples/Identify_rearrangements/gene_names.txt --namegenome Macaca_mulatta \ --outdir /path/to/output/folder \ --threads your_threads 3. Interpretation of output files from running identify_rearrangements: user@computer:~/path/to/output/folder/Identify_rearrangements$ tree . ├── Exonerate_output # Results from running Exonerate │ └── Exonerate_identify_artificial_rearrangements_NameGenome └── Heatmaps # Plots for each gene of study ├── NameGenome_Gene1_identify_rearrangements.svg ├── NameGenome_Gene2_identify_rearrangements.svg ├── NameGenome_Gene3_identify_rearrangements.svg ├── NameGenome_Gene4_identify_rearrangements.svg └── NameGenome_Gene5_identify_rearrangements.svg --------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------- - Convert annotation to fasta: annotation2fasta feature --------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------- 1. Activate the conda environment: conda activate AnnCX 2. Run the tool: *IMPORTANT:* Input annotation files (--annotation) must be in GFF3 format with annotation features: gene, mRNA, exon, CDS *IMPORTANT:* Input FASTA file with the genomic region (--genome) that was annotated. If the annotations are produced by AnnCX using flanking genes, give the extracted ROI, not the whole genome FASTA (/path/ to/output/AnnCX_GeneNames/1_extract_ROI/7_extracted_roi_raw)* usage: annotation2fasta.py [-h] --annotation ANNOTATION --genome GENOME --namegenome NAMEGENOME --nameproject NAMEPROJECT --outdir OUTDIR AnnCX feature annotate2fasta to convert annotations (GFF3) to genomic sequences (FASTA) options: -h, --help show this help message and exit --annotation ANNOTATION Directory where the annotation file (GFF3) to be converted to FASTA is located. This file should contain the features: gene, mRNA, exon, CDS Example: --annotation /path/to/annotation --genome GENOME Directory where the genome or genomic region of interest file (FASTA) that was annotated is located. This file must correspond to the FASTA file used to generate the GFF3 annotation file. Example: --genome /path/to/genome --namegenome NAMEGENOME Name of the genome or genomic region of interest annotated. This name (e.g. Macaca_mulatta) must be contained in the name of the genome (e.g. Macaca_mulatta_genome.fasta) and annotation (e.g. Macaca_mulatta_annotation.gff3) files. Example: --namegenome Macaca_mulatta --nameproject NAMEPROJECT Name of the project. Used to label the output files. Example: --nameproject extract_annotation_NKG2 --outdir OUTDIR Directory to save the output files. AnnCX creates a folder within the output directory specified where the output files will be stored named 'annotate2fasta' Example: --outdir /path/to/output Example using the example data provided in the repository: (AnnCX) user@computer:~/path/to/pipeline/folder$ ./src/annotation2fasta.py \ --annotation examples/annotate2fasta \ --genome examples/genome \ --namegenome Macaca_mulatta \ --nameproject extract_annotation_NKG2 \ --outdir /path/to/output/folder 3. Interpretation of output files from running annotate2fasta: user@computer:~/path/to/output/folder/annotate2fasta$ tree . ├── NameGenome_NameProject_CDS_all_sequence.fasta # concatenated CDS (~cDNA) ├── NameGenome_NameProject_CDS_sequence.fasta # each CDS numbered ├── NameGenome_NameProject_exon_all_sequence.fasta # concatenated exons ├── NameGenome_NameProject_exon_sequence.fasta # each exon numbered ├── NameGenome_NameProject_gene_sequence.fasta # gene (exon + intron) ├── NameGenome_NameProject_intron_sequence.fasta # each intron numbered ├── NameGenome_NameProject_mRNA_sequence.fasta # mRNA (~gene) └── NameGenome_NameProject_protein_sequence.fasta # protein (CDS_all->prot) --------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------- Requirements --------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------- - Conda (Miniconda or Anaconda) - Linux operating system AnnCX can be executed without the need for high-performance computing resources, as demonstrated by successful testing on a Linux Ubuntu system (v22.04.3 LTS) equipped with an Intel Core i5 CPU, 500 Gb HDD and 6 Gb RAM. Testing included the annotation of genomic regions up to 5 Mb in length.