Reference genomes
Many nf-core pipelines use reference genomes for alignment, annotation, and similar tasks. This page describes available approaches for managing reference genomes.
There are three main ways to use reference genomes with nf-core pipelines:
- Local copies of genomes: user downloaded and self-managed
- AWS iGenomes: Illumina-hosted pre-build reference genomes and indices
- Refgenie: programmatic genome asset management tool
Local copies of genomes
Most genomics nf-core pipelines can start from just a FASTA and GTF file and create downstream reference assets (genome indices, interval files, etc.) as part of pipeline execution.
Using GRCh38 as an example:
-
Download the latest files:
#!/bin/bash VERSION=108 wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/homo_sapiens/Homo_sapiens.GRCh38.$VERSION.gtf.gz -
Run pipeline with
--save_referenceto generate indices:nextflow run \ nf-core/rnaseq \ --input samplesheet.csv \ --fasta Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz \ --gtf Homo_sapiens.GRCh38.108.gtf.gz \ --save_referenceNoteThe pipeline will generate and save reference assets. For example, the STAR index will be stored in
<results-dir>/genome/index/star. -
Move generated assets to a central, persistent storage location for re-use in future runs.
-
Use pre-generated indices in future runs.
nextflow run \ nf-core/rnaseq \ --input samplesheet.csv \ --fasta Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz \ --gtf Homo_sapiens.GRCh38.108.gtf.gz \ --star_index </path/to/moved/star/directory/> \ --gene_bed </path/to/moved/genes.bed>
AWS iGenomes
AWS iGenomes is Illumina’s centralized resource that organizes commonly used reference genome and pre-built index files in a consistent structure for multiple genomes. It provides the following benefits:
- Hosted on AWS S3 through the Registry of Open Data
- Free to access and download
- Maintained by nf-core as a copy of the original Illumina resource
See the AWS iGenomes documentation for more information.
Transcriptome and GTF files in iGenomes are significantly outdated. For example, human annotations are from Ensembl release 75, while current release is 108+. Consider using custom genomes for current annotations.
GRCh38 in iGenomes comes from NCBI instead of Ensembl, not the masked Ensembl assembly. This can cause pipeline issues in some cases. See nf-core/rnaseq issue #460 for details. For GRCh38 with masked Ensembl assembly, use Custom genomes.
Use remote AWS iGenomes
To use remote AWS iGenomes in supported nf-core pipelines, supply the --genome flag to your pipeline (e.g., --genome GRCh37).
On execution the pipeline will then:
- Automatically download required reference files.
- Auto-populated reference genome parameters from
conf/igenomes.config.- Parameters like FASTA, GTF, and index paths are set automatically.
- Download only what it requires for that specific workflow.
Downloading reference genome files takes time and bandwidth. We recommend using a local copy when possible. See Use local AWS iGenomes for more information.
Use local AWS iGenomes
To use local AWS iGenomes:
-
Download the iGenomes reference files you need to a local directory.
-
Set
--igenomes_baseto your local iGenomes directory path.WarningThis directory structure must reflect the structure defined in
conf/igenomes.config. -
Pipeline will use local files instead of downloading from AWS.
Check annotation versions
To check the version of annotations used by AWS iGenomes:
-
Download the README file from the iGenomes S3 bucket using AWS CLI:
aws s3 cp --no-sign-request s3://ngi-igenomes/igenomes/Homo_sapiens/Ensembl/GRCh37/Annotation/README.txt -
View the README to see annotation details:
cat README.txtExample output:
The contents of the annotation directories were downloaded from Ensembl on: July 17, 2015. Gene annotation files were downloaded from Ensembl release 75. SmallRNA annotation files were downloaded from miRBase release 21.This confirms the annotations are from Ensembl release 75 (July 2015), which is significantly outdated.