###################################################################################################
######################## AnnCX_gene_annotation_pipeline (v1.0.0) ##################################
###################################################################################################

AnnCX is gene annotation pipeline designed for the analysis of gene-rich complex genomic regions 
demonstrating to outperform the widely used MAKER pipeline in our benchmarking on complex genomic 
regions (paper). AnnCX automates all intermediate steps to provide comprehensive results with 
minimal user input. Key novel features include:

-    Iterative extraction of target regions from whole-genome sequences to focus on complex 
genomic areas (Step 1: extract ROI = OPTIONAL)
-    Incorporate a more diverse array of individual annotation tools than other pipelines with an 
emphasis on the accurate identification of exon-intron boundaries.
-    Detection and annotation of ambiguous nucleotide regions (assembly gaps) to provide a clear 
representation of problematic genome assemblies that commonly affect complex regions regions.
-    Support for manual curation by implementing an automatic visualization of the raw annotation 
results once the pipeline has completed its execution.


Custom supplementary tools to assess quality of assembly:

(1)    identify_pred2ref: gene identification process
This feature performs a BLAST search between predicted and reference FASTA sequences, followed by 
the automatic generation of heatmaps to display the results of the gene identification process. 
This feature addresses a critical challenge that arises when interpreting annotation results 
predicted by any gene annotation tool or annotation pipeline. Neither the consensus annotation 
results produced by any annotation pipeline specify which query reference sequence was used to 
generate each predicted gene annotation. This circumstance arises because gene annotation tools 
often make it difficult to trace which specific query sequence led to a given gene annotation. In 
fact, in tools that do not produce gene annotation models, such as BLAST, a single locus might be 
annotated based on multiple query sequences. This issue becomes particularly pronounced in genomic 
regions which contain recently duplicated genes, where high sequence similarity further 
complicates the attribution of annotations to their original queries. The identify_pred2ref 
strategy is more robust than phylogenetic trees, which can be distorted by artificial 
rearrangements that obscure evolutionary relationships between the genes. The plots generated by 
this feature also serve to illustrate the spatial organization of genes within the region of 
study, aiding the study of gene order conservation and neighboring genes across species to 
identify orthologs. 

(2)    identify_rearrangements: identify exon-level rearrangements
The identify_rearrangements feature uses Exonerate to conduct an exon-by-exon comparison between 
predicted and reference genes to identify mismatches in the exon sequences. The exon-level 
rearrangements detected by this method can result from either biological processes or assembly 
artifacts, the latter commonly occurring in the assemblies of complex genomic regions when 
sequencing depth is insufficient to resolve the repetitive DNA in these areas (Peel et al., 2022). 
Detecting rearrangements of biological origin can advance our understanding of the genes under 
study, while identifying artificial rearrangements is crucial, as they can significantly affect 
the interpretation of gene content within the annotated region. This feature provides an exon-
level assessment of the quality of the assembly which, combined with AnnCX’s annotation of 
ambiguous nucleotides, can help guide targeted sequencing efforts towards regions with a high 
number of artificial rearrangements, thereby reducing overall costs. Moreover, the number and 
sequence composition of exons found within different genes can serve as an additional indicator of 
potential gene homology.

Other custom supplementary tools

(3)    annotation2fasta: convert annotation GFF3 files into FASTA sequences
This feature, which automatically converts the raw consensus annotation output into FASTA 
sequences as part of the execution of AnnCX, can also be invoked as a stand-alone feature. This 
enables users to generate FASTA sequences from the manually refined annotations or other 
annotation data to facilitate downstream analyses that require sequence data rather than 
annotation coordinates as input, such as phylogenetic analyses. AnnCX’s annotation2fasta allows 
users to generate a broader range of sequence types than other annotation pipelines, which 
typically output only transcript and protein sequences. AnnCX also supplies sequences for genes, 
exon and CDS (both as individual elements and as concatenated sequences per gene), as well as 
intron sequences. Providing intron sequences allows the user to streamline the analysis of these 
regions, which are often overlooked in conventional annotation outputs, yet are increasingly 
recognized for their roles in gene regulation, alternative splicing, and the evolution of gene 
structure (Jo and Choi, 2015).


---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
Installation
---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
    1. Clone this repository:
    
    git clone https://github.com/laurahum/AnnCX_gene_annotation_pipeline.git	# Link to repository
    cd path/to/pipeline/folder							# Directory of the repository folder

    2. Installation
    
    chmod +x install_AnnCX.sh 	# Give rights to the installation script
    bash install_AnnCX.sh 	# Run installation script


---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
Usage for:
---------------------------------------------------------------------------------------------------
	- Genome annotation
---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
   1. Activate the conda environment where the pipeline is installed:
   
   conda activate AnnCX
   
   
   2. Run the pipeline:

*IMPORTANT:*
Please ensure that Artemis is closed before running the pipeline. If Artemis is open, the genome 
files will be annotated, but the new annotations may not appear in the Artemis interface*

*Step 1: extract ROI = OPTIONAL - If the user provides flanking regions, AnnCX checks how many 
genome files contain both flanking genes and, by default, prompts whether to continue extract ROI 
and annotate on those genomes. Use --skip-prompt to proceed automatically without asking for 
confirmation.*

*EVM is run using the recommended evidence weights described by the authors 
(see: https://doi.org/10.1186/gb-2008-9-1-r7)
If the user needs to adjust these parameters, the weights file is located at: 
../weights/weights_recommended.txt*


usage: AnnCX.py [-h] --genome GENOME --namegenes NAMEGENES --querycDNA QUERYCDNA --queryprot
                QUERYPROT --queryexon QUERYEXON --spsrepeatmasker SPSREPEATMASKER --spsaugustus
                SPSAUGUSTUS --outdir OUTDIR [--flanking FLANKING] [--maxintron MAXINTRON]
                [--threads THREADS] [--overlapsEVM OVERLAPSEVM] [--skip-prompt]

AnnCX gene annotation pipeline

options:
  -h, --help            show this help message and exit
  
  --genome GENOME       
  			Directory where the genomic FASTA files are located. 
  			Every FASTA file in the directory will be processed.
  			No special characters are allowed in the FASTA files.
  			(allowed extensions: FASTA/FA/FNA/FFN/FAA/FRN)
  			
  			Example: --genome /path/to/genomes
  			
  --namegenes NAMEGENES
                        Name of the genes to be annotated. 
                        
                        Example: --namegenes NKG2
                        
  --querycDNA QUERYCDNA
                        FASTA file with entries for cDNA sequences to use as query. 
                        
                        Example: --querycDNA /path/to/query/querycDNA.fasta
                        
  --queryprot QUERYPROT
                        FASTA file with entries for protein sequences to use as query. 
                        
                        Example: --queryprot /path/to/query/queryprot.fasta
                        
  --queryexon QUERYEXON
                        FASTA file with entries for exon sequences to use as query. 
                        Each exon must be a different entry in the file.
                        (e.g header: > NKG2_exon_1).
                        
                        Example: --queryexon /path/to/query/queryexon.fasta
                        
  --spsrepeatmasker SPSREPEATMASKER
                        Name of the phylogenetic group to select a repeat database in
                        RepeatMasker. Select the phylogenetic group that suits best your project. 
                        Taken from RepeatMasker documentation (https://www.repeatmasker.org/):
                            -species <query species>
        			Specify the species or clade of the input sequence. The species 
        			name must be a valid NCBI Taxonomy Database species name and be 
        			contained in the RepeatMasker repeat database. Some examples are:
					-species human
					-species mouse
					-species rattus
					-species "ciona savignyi"
					-species arabidopsis

        			Other commonly used species:
				mammal, carnivore, rodentia, rat, cow, pig, cat, dog, chicken, 
				fugu, danio, "ciona intestinalis" drosophila, anopheles, worm, 
				diatoaea, artiodactyl, arabidopsis, rice, wheat, and maize

                        Example: --spsrepeatmasker primates
                        
  --spsaugustus SPSAUGUSTUS
                        Name of the species to run AUGUSTUS on pre-trained parameters. 
                        Select the species that suits best your project. 
                        Taken from AUGUSTUS documentation (https://bioinf.uni-greifswald.de/augustus/):
                        	AUGUSTUS has currently been trained on species specific training 
                        	sets to predict genes in the following species. Note that for 
                        	closely related species usually only one version is necessary. For 
                        	example, the human version is good for all mammals
                        
                        Example: --spsaugustus human
                        
  --outdir OUTDIR       
  			Directory to save the output files. 
  			AnnCX creates a folder within the output directory specified where the
  			output files will be stored named 'AnnCX_namegenes'
  			
  			Example: --outdir /path/to/output
  			
  --flanking FLANKING   
  			(OPTIONAL) FASTA file with entries for gene sequences to use as flanking
                        genes and extract a genomic region of interest from a larger genome file 
                        to annotate. The sequences can be sequences of gene, cDNA or similar.
                        If this argument is not used the pipeline does not extract a region of
                        interest and starts with the annotation process directly.
                        
                        Example: --flanking /path/to/flanking/geneflanking.fasta
  			
  --maxintron MAXINTRON
                        (OPTIONAL) Sets maximum intron size in basepairs (Default = 7000, based on 
                        available eukaryotic studies (Long and Deutsch, 1999)). 
                        
                        Example: --maxintron 6000
                        
  --threads THREADS     
  			(OPTIONAL) Number of threads that can be used to run some tools. 
  			Example on how to calculate the number of threads in your computer:
  			
  			$ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('
				CPU(s):              8   ## your number of threads
				Thread(s) per core:  2
				Core(s) per socket:  4
				Socket(s):           1

  			Example: --threads 8
                        
  --overlapsEVM OVERLAPSEVM
                        (OPTIONAL) Filter for the EVM implementation within AnnCX.
                        Number of overlapping gene annotation tools to support the consensus 
                        annotation results (0-7, 0 = no filter, Default = 4). 
                        
                        Example: --overlapsEVM 2
                        
  --skip-prompt         
			(OPTIONAL) If the argument is used, skips the prompt after finding the
                        flanking genes that asks the user whether to continue with the annotation
                        process. 
                        
                        Example: --skip-prompt
                        
  --version		Version of AnnCX
                        
                        
   
   Example using the example data provided in the repository:
   
   (AnnCX) user@computer:~/path/to/pipeline/folder$ ./AnnCX.py \
   --genome examples/genome \
   --namegenes NKG2 \
   --querycDNA examples/Genome_annotation/cDNA_sequences.fasta \
   --queryprot examples/Genome_annotation/protein_sequences.fasta \
   --queryexon examples/Genome_annotation/exon_sequences.fasta \
   --spsrepeatmasker primates \
   --spsaugustus human \
   --outdir /path/to/output/folder/ \
   --flanking examples/Genome_annotation/flanking_regions.fasta \
   --threads your_threads

   
   Artemis is opened automatically for the visualization of the annotation output produced 
   by AnnCX


   3. Open results in Artemis again 
   
After the pipeline has finished running and you have closed Artemis, you can reopen Artemis 
within the conda environment to visualize the annotation results again:

Example:   (AnnCX) user@computer: art


Consult the Artemis manual for detailed specification of the features 
(https://ftp.sanger.ac.uk/pub/resources/software/artemis/archive/v4/artemis_manual_v4_b1-old/c201.htm)

For best view of the annotations the recommended features are:
- One Line Per Entry
- All Features On Frame Lines
- Zoom to selection

For manual editing, the user is recommended to: 
(1) adjust feature boundaries by dragging annotation boundaries on the 'DNA view' window if the 
change is small or if the change is substantial in Artemis Gene Builder (click on annotation -> ctrl+E), 
(2) modify CDS boundaries while inspecting the amino acid sequence 
(click on annotation -> View -> aminoacids of selection as Fasta) to prevent introducing erroneous stop codons, 
(3) update gene IDs and names in Artemis Gene Builder upon completion of edits, and 
(4) save all Artemis entries (Files -> Save all entries). 



   4. Interpretation of output files from running AnnCX:

user@computer:~/path/to/output/folder/AnnCX_namegenes$ tree
.
├── 1_extract_ROI	# Result of Step 1: Extract ROI
│   ├── 3_found_flanking_genes	# Genomes in which both flanking genes were found
│   │   ├── txt_output_no.txt	# Both flanking genes
│   │   └── txt_output_yes.txt	# One flanking gene or none
│   │
│   ├── 4_find_single_contig_genomes	# Genomes with both flanking genes
│   │   ├── Non_single_contig_list.txt	# Not in the same assembly unit (chr/contig) → discarded 
│   │   └── Single_contig_list.txt	# In the same assembly unit → continue annotation
│   │
│   └── 7_extracted_roi_raw	# Extracted genomic regions of interest between flanking genes
│       └── NameGenome_NameGenes_AssemblyUnit_ROI.fasta  → imported to Artemis
│   
├── 2_annotate_N_gaps	# Result of Step 2: Annotate gaps in ROI  → imported to Artemis
│   └── NameGenome_annotate_N_stretches.gff3	# Annotation of gaps in assembly
│   
├── 3_repeatmasker	# Result of Step 3: Hard-masking ROI
│   ├── repeat_annotations	# Annotation of repeats  → imported to Artemis
│   │   └── NameGenome_NameGenes_AssemblyUnit_ROI.fasta.out.gff
│   └── roi_hardmasked		# Hard-masked genomic region of interest
│       └── NameGenome_NameGenes_AssemblyUnit_ROI.fasta.masked
│   
├── 4_gene_annotation_tools	# Result of Step 4: Run gene annotation tools
│   ├── AUGUSTUS
│   │   └── raw	  # Annotation AUGUSTUS (with protein profile) → imported to Artemis
│   │       └── NameGenome_ROI_NameGenes_protprof_augustus 
│   ├── BLAST
│   │   ├── BLASTN
│   │   │   └── filtered	# Annotation BLASTN (evidence: cDNA) → imported to Artemis
│   │   │       └── NameGenome_ROI_NameGenes_cDNA_blastn_formatted_FILTERED.gff
│   │   └── TBLASTN
│   │       └── filtered	# Annotation TBLASTN (evidence: protein) → imported to Artemis
│   │           └── NameGenome_ROI_NameGenes_protein_tblastn_formatted_FILTERED.gff
│   ├── Exonerate
│   │   └── filtered	# Annotation Exonerate (evidence: cDNA) → imported to Artemis
│   │       └── NameGenome_ROI_NameGenes_cDNA_exonerate_FORMATTED_FILTERED.gff
│   ├── GMAP
│   │   ├── GMAP_cDNA
│   │   │   └── filtered	# Annotation GMAP (evidence: cDNA) → imported to Artemis
│   │   │       └── NameGenome_NameGenes_cDNA_ROI_gmap_FILTERED.gff3
│   │   └── GMAP_exon
│   │       └── filtered	# Annotation GMAP (evidence: exon) → imported to Artemis
│   │           └── NameGenome_NameGenes_exon_ROI_gmap_FILTERED.gff3
│   └── GeneWise
│       └── filtered	# Annotation GeneWise (evidence: protein) → imported to Artemis
│           └── NameGenome_ROI_NameGenes_prot_genewise_FORMATTED_FILTERED.gff
│   
├── 7_consensus_EVM	# Result of Step 7
│   └── 3_filter
│       └── filter_3   # Annotation consensus (number = overlapsEVM filter) → imported to Artemis
│           └── NameGenome_consensus_filter_OverlapsEVM.gff3
│   
├── output_raw_FASTA_annotations	# Raw AnnCX output annotations converted to FASTA
│   ├── NameGenome_NameGenes_CDS_all_sequence.fasta	# concatenated CDS (~cDNA)
│   ├── NameGenome_NameGenes_CDS_sequence.fasta		# each CDS numbered
│   ├── NameGenome_NameGenes_exon_all_sequence.fasta	# concatenated exons
│   ├── NameGenome_NameGenes_exon_sequence.fasta	# each exon numbered
│   ├── NameGenome_NameGenes_gene_sequence.fasta	# gene (exon + intron)
│   ├── NameGenome_NameGenes_intron_sequence.fasta	# each intron numbered
│   ├── NameGenome_NameGenes_mRNA_sequence.fasta	# mRNA (~gene)
│   └── NameGenome_NameGenes_protein_sequence.fasta	# protein (CDS_all→prot)
│   
├── errors	# Errors found in the run
│   ├── format_genomes_input	# Issue with the formatting of the genome names
│   └── missed_annotations	# Annotation files from an individual tool missing
│   
└── report
    └── pipeline_report.txt	# Full run report of running AnnCX



---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
	- Identify predicted genes:		identify_pred2ref feature
---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
This feature produces a heatmap plot (SVG) for the set of genes predicted in an annotated genome. 
The plot displays predicted genes (x-axis) against reference genes (y-axis). Columns show 
alignment results for each predicted gene. Individual cells represent the match quality for each 
pair-wise alignment and indicate BLAST’s percentage identity as a intuitive metric of the 
alignment quality. Each column is colored based on a composite match score using percentage of 
identity, coverage, bit-score and penalized by gap openings, that is used to generate a gradient 
(from lowest to highest match: red→white→blue) for a comprehensive overall assessment of the 
alignment match. Cells highlighted in black indicate the best composite match score for each, 
showing which reference gene is most similar.


   1. Activate the conda environment:

   conda activate AnnCX


   2. Run the tool:

usage: identify_pred2ref.py [-h] --subject SUBJECT --query QUERY --typeseq_query TYPESEQ_QUERY
                            --typeseq_subject TYPESEQ_SUBJECT --namegenome NAMEGENOME --outdir
                            OUTDIR

AnnCX feature identify_pred2ref to identify predicted genes

options:
  -h, --help            show this help message and exit
  
  --subject SUBJECT     
  			File (FASTA) with reference gene sequence.
  			These sequences correspond to previously published and curated sequences 
  			of phylogenetically related genes to those to be annotated. They can be 
  			those used as evidence to run the pipeline or others.
  			(e.g. gene, exon_all, CDS_all, cDNA). 
  			
                        Example: --subject /path/to/reference/reference.fasta
                        
  --query QUERY         
  			File (FASTA) with predicted gene sequences .
  			These sequences are those predicted by the pipeline, either the raw output 
  			or the sequences after the process of manual curation in Artemis.
  			
  			The raw predicted sequences can be found in the folder:
  			AnnCX_genenames/output_raw_FASTA_annotations
  			(e.g. gene, exon_all, CDS_all, cDNA). 
  			
                        Example: --query /path/to/query/query.fasta
                        
  --typeseq_query TYPESEQ_QUERY
                        Name of the type of query fasta sequences. 
                        Used to label the axis of the plot and the output files.
  			(e.g. gene, exon_all, CDS_all, cDNA). 
  			
                        Example: --typeseq_query cDNA

  --typeseq_subject TYPESEQ_SUBJECT
                        Name of the type of subject fasta sequences. 
                        Used to label the axis of the plot and the output files.
  			(e.g. gene, exon_all, CDS_all, cDNA). 
  			
                        Example: --typeseq_subject cDNA

  --namegenome NAMEGENOME
                        Name of the genome in which the genes were predicted. 
                        Used as the title for the plot and to label the output files.
                        
                        Example: --namegenome Homo_sapiens
                        
  --outdir OUTDIR       
  			Directory to save the output files. 
  			AnnCX creates a folder within the output directory specified where the
  			output files will be stored named 'Identify_pred2ref'

  			Example: --outdir /path/to/output
  			
  --gapopen GAPOPEN
  			(OPTIONAL) Number to run --gapopen open_penalty
                        argument in BLASTN (Default = 1)
                        Taken from BLASTN documentation (https://www.ncbi.nlm.nih.gov/books/NBK279684/table/appendices.T.blastn_application_options/)
                         	-gapopen <Integer>
   				Cost to open a gap
  
  			Example: --gapopen 2	
  
  --gapextend GAPEXTEND
  			(OPTIONAL) Number to run --gapextend extend_penalty
                        argument in BLASTN (Default = 1)
                        Taken from BLASTN documentation (https://www.ncbi.nlm.nih.gov/books/NBK279684/table/appendices.T.blastn_application_options/)
                         	-gapextend <Integer>
                        	Cost to extend a gap
                        
  			Example: --gapextend 2
  			

   
   Example using the example data provided in the repository:
   
   (AnnCX) user@computer:~/path/to/pipeline/folder$ ./src/identify_pred2ref.py \
   --subject examples/Genome_annotation/cDNA_sequences.fasta \
   --query examples/Genome_annotation/cDNA_sequences.fasta \
   --typeseq_query cDNA \
   --typeseq_subject cDNA \
   --namegenome Macaca_mulatta \
   --outdir /path/to/output/folder \


   3. Interpretation of output files from running identify_pred2ref:
   
user@computer:~/path/to/output/folder/Identify_pred2ref$ tree
.
├── BLASTN_output  # Results from running BLASTN
│   └── Identify_pred2ref_NameGenome_TypeSeqQueryvsTypeSeqSubject
└── Heatmaps	   # Plot for the genome of study
    └── NameGenome_identify_pred2ref.svg



---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
	- Identify rearrangements:		identify_rearrangements feature
---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
This feature produces one heatmap plot (SVG) per gene predicted in an annotated genome.
The plot displays the exons of one of the predicted genes (x-axis) against exons of reference 
genes (y-axis). Columns represent the alignment results of each predicted exon and are colored 
using Exonerate’s percentage of identity to generate a gradient (from lowest to highest identity: 
red→white→blue). Exons from each reference genes are separated by horizontal lines. Each cell 
displays pair-wise Exonerate identity score and those with the highest identity per exon are 
highlighted with a black square.


   1. Activate the conda environment:

   conda activate AnnCX


   2. Run the tool:

usage: identify_rearrangements.py [-h] --subject SUBJECT --query QUERY --namegenes NAMEGENES
                                  --namegenome NAMEGENOME --outdir OUTDIR [--threads THREADS]

AnnCX feature identify_rearrangements to identify exon-level rearrangements

options:
  -h, --help            show this help message and exit
  
  --subject SUBJECT     
  			File (FASTA) with reference exon sequences. 
  			These sequences correspond to previously published and curated sequences 
  			of phylogenetically related genes to those to be annotated. They can be 
  			those used as evidence to run the pipeline or others.
  			
                        Each exon must be a different entry in the file and labeled as:
                        'exon' '_' 'positional number' (e.g header: > NKG2_exon_1).
                        
                        Example: --subject /path/to/reference/reference.fasta
                        
  --query QUERY         
  			File (FASTA) with predicted exon sequences.
  			These sequences are those predicted by the pipeline, either the raw output 
  			or the sequences after the process of manual curation in Artemis.
  			
  			Each exon must be a different entry in the file and labeled as:
                        'exon' '_' 'positional number' (e.g header: > NKG2_exon_1).
                        
  			The raw predicted sequences can be found in the folder:
  			AnnCX_genenames/output_raw_FASTA_annotations
  			
                        Example: --query /path/to/query/query.fasta
                        
  --namegenes NAMEGENES
                        File (TXT) with the names of the predicted genes. 
                        These names must be contained in the headers of the FASTA file for 
                        predicted exon sequences passed as --query.
                        
                        Example: --namegenes /path/to/genes/genes.txt
                        
  --namegenome NAMEGENOME
                        Name of the genome in which the genes were predicted. 
                        Used as the title for the plot and to label the output files.
                        
                        Example: --namegenome Homo_sapiens
                        
  --outdir OUTDIR       
  			Directory to save the output files. 
  			AnnCX creates a folder within the output directory specified where the
  			output files will be stored named 'Identify_rearrangements'  			
  			
  			Example: --outdir /path/to/output
  			
  --threads THREADS     
  			(OPTIONAL) Number of threads that can be used to run this feature. 
  			Example on how to calculate the number of threads in your computer:
  			
  			$ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('
				CPU(s):              8   ## your number of threads
				Thread(s) per core:  2
				Core(s) per socket:  4
				Socket(s):           1

  			Example: --threads 8


   Example using the example data provided in the repository:
   
   (AnnCX) user@computer:~/path/to/pipeline/folder$ ./src/identify_rearrangements.py \
   --subject examples/Genome_annotation/exon_sequences.fasta \
   --query examples/Genome_annotation/exon_sequences.fasta \
   --namegenes examples/Identify_rearrangements/gene_names.txt
   --namegenome Macaca_mulatta \
   --outdir /path/to/output/folder \
   --threads your_threads


   3. Interpretation of output files from running identify_rearrangements:
   
user@computer:~/path/to/output/folder/Identify_rearrangements$ tree
.
├── Exonerate_output	# Results from running Exonerate
│   └── Exonerate_identify_artificial_rearrangements_NameGenome
└── Heatmaps	# Plots for each gene of study
    ├── NameGenome_Gene1_identify_rearrangements.svg
    ├── NameGenome_Gene2_identify_rearrangements.svg
    ├── NameGenome_Gene3_identify_rearrangements.svg
    ├── NameGenome_Gene4_identify_rearrangements.svg
    └── NameGenome_Gene5_identify_rearrangements.svg



---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
	- Convert annotation to fasta:	annotation2fasta feature
---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------

   1. Activate the conda environment:
   
   conda activate AnnCX


   2. Run the tool:
   
*IMPORTANT:*
Input annotation files (--annotation) must be in GFF3 format with annotation features: gene, mRNA, 
exon, CDS

*IMPORTANT:*
Input FASTA file with the genomic region (--genome) that was annotated. If the annotations are 
produced by AnnCX using flanking genes, give the extracted ROI, not the whole genome FASTA (/path/
to/output/AnnCX_GeneNames/1_extract_ROI/7_extracted_roi_raw)*


usage: annotation2fasta.py [-h] --annotation ANNOTATION --genome GENOME --namegenome NAMEGENOME
                           --nameproject NAMEPROJECT --outdir OUTDIR

AnnCX feature annotate2fasta to convert annotations (GFF3) to genomic sequences (FASTA)

options:
  -h, --help            show this help message and exit
  
  --annotation ANNOTATION
                        Directory where the annotation file (GFF3) to be converted to FASTA is
                        located.
                        
                        This file should contain the features: gene, mRNA, exon, CDS
                        
                        Example: --annotation /path/to/annotation
                        
  --genome GENOME       
                        Directory where the genome or genomic region of interest file (FASTA) that
                        was annotated is located. 
                        
                        This file must correspond to the FASTA file used to generate the GFF3 
                        annotation file.
                        
                        Example: --genome /path/to/genome
                        
  --namegenome NAMEGENOME
                        Name of the genome or genomic region of interest annotated. 
                        This name (e.g. Macaca_mulatta) must be contained in the name of the 
                        genome (e.g. Macaca_mulatta_genome.fasta) and annotation (e.g. 
                        Macaca_mulatta_annotation.gff3) files.
                        
                        Example: --namegenome Macaca_mulatta
                        
  --nameproject NAMEPROJECT
                        Name of the project. 
                        Used to label the output files.
                        
                        Example: --nameproject extract_annotation_NKG2
                        
  --outdir OUTDIR       
  			Directory to save the output files. 
  			AnnCX creates a folder within the output directory specified where the
  			output files will be stored named 'annotate2fasta'  			

  			Example: --outdir /path/to/output



   Example using the example data provided in the repository:

   (AnnCX) user@computer:~/path/to/pipeline/folder$ ./src/annotation2fasta.py \
   --annotation examples/annotate2fasta \
   --genome examples/genome \
   --namegenome Macaca_mulatta \
   --nameproject extract_annotation_NKG2 \
   --outdir /path/to/output/folder


   3. Interpretation of output files from running annotate2fasta:
   
user@computer:~/path/to/output/folder/annotate2fasta$ tree
.
  ├── NameGenome_NameProject_CDS_all_sequence.fasta	# concatenated CDS (~cDNA)
  ├── NameGenome_NameProject_CDS_sequence.fasta		# each CDS numbered
  ├── NameGenome_NameProject_exon_all_sequence.fasta	# concatenated exons
  ├── NameGenome_NameProject_exon_sequence.fasta	# each exon numbered
  ├── NameGenome_NameProject_gene_sequence.fasta	# gene (exon + intron)
  ├── NameGenome_NameProject_intron_sequence.fasta	# each intron numbered
  ├── NameGenome_NameProject_mRNA_sequence.fasta	# mRNA (~gene)
  └── NameGenome_NameProject_protein_sequence.fasta	# protein (CDS_all->prot)




---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
Requirements
---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------
- Conda (Miniconda or Anaconda)
- Linux operating system


AnnCX can be executed without the need for high-performance computing resources, as demonstrated 
by successful testing on a Linux Ubuntu system (v22.04.3 LTS) equipped with an Intel Core i5 CPU, 
500 Gb HDD and 6 Gb RAM. Testing included the annotation of genomic regions up to 5 Mb in length.