How to build reference genome?
Preparing the files required to build a reference genome
When using Seeksoultools rna or fast module, the species' reference genome sequence and the corresponding gtf annotation file are required. Please try to use reference genomes from Ensembl and UCSC. If NCBI reference genomes are used, the gene-peak linkages association relationship cannot be calculated. The relevant file formats are as follows:
Genome sequence file
The genome sequence files requires a fa format file. The chromosome id must be consistent with the seqname of the gtf file, and the gtf seqname must to be a subset of the chromosome id of the fa file. Note that the file cannot have blank lines.
gtf file
The gtf file format is described as follows:
seqname: sequence name, usually chromosome or contig ID
source: the source of the annotation, which can be the name of a database, such as from the RefSeq database, or the name of a software, such as predicted by GeneScan software. Of course, it can also be empty and filled with a dot.
Feature: represents the feature type corresponding to the interval. In GTF, common types include: gene, transcript, CDS, exon, start_codon, stop_codon, etc.
start: the starting position of the feature
end: the end position of the feature
score: indicates the confidence of feature existence and coordinates, which can be a floating point number or an integer. “.” means it is empty, which means it is not needed.
Strand: The feature is located on the positive strand (+) or negative strand (-) of the reference genome
frame: 0 means the first complete codon of the reading frame is located at the 5’ end, 1 means there is an extra base before the first complete codon, and 2 means there are two extra bases before the first complete codon. Note that frame is not the remainder of the CDS length divided by 3. If the strand is ‘-’, the first base value of the region is ‘end’, because the corresponding coding region will be from end to start on the antisense strand.
attribute: The format should be attributes_name ‘attributes_values’; each attribute must end with a semicolon and be separated from the next attribute by a space, and the attribute value is surrounded by double quotes. Generally, the following three attributes are required
attribute |
means |
---|---|
gene_id “value”; |
The unique ID of the locus of the transcript on the genome. gene_id and value are separated by a space. If the value is empty, it means there is no corresponding gene. |
transcript_id “value”; |
The unique ID of the predicted transcript. transcript_id and value are separated by a space, and an empty value indicates no transcript. |
gene_type “value”; |
Biological type of gene, protein coding, lncRNA, etc. |
Notes on gtf file preparation:
The feature column of each gene in the gtf file must contain three pieces of information: gene, transcript, and exon;
When the feature column is ‘gene’, the attributes column needs to contain gene_id, gene_type, and if there is no gene_name, the value of gene_id is used as genename; when the feature column is ‘transcript’, the attributes column needs to contain transcript_id;When the feature column is ‘exon’, the attributes column needs to include ‘exon_id’; otherwise, it will affect the processing when reads are annotated to multiple genes.
GTF files should also not have blank lines.
In the GTF file, the gene_name of mitochondrial genes needs to start with “Mt-” or “mt-”; otherwise, the mito section in the report will all be 0.
Scenario 1: Building a reference genome that is compatible with single-cell data from different platforms.
If you have both single-cell data from 10X Genomics and SeekArc products, it is recommended to use 10X Cellranger-arc to build the reference genome. SeekArcTools is compatible with the reference genome built by Cellranger-arc.
Please configure the config.json file in the following format:
{
"organism": "GRCh38",
"genome": ["GRCh38"],
"input_fasta": ["/path/to/GRCh38.fa"],
"input_gtf": ["/path/to/Homo_sapiens.GRCh38.ensembl.filtered.gtf"]
}
The code for processing gene annotation files (GTF files) is as follows:
/path/to/cellranger mkgtf Homo_sapiens.GRCh38.ensembl.gtf Homo_sapiens.GRCh38.ensembl.filtered.gtf \
--attribute=gene_biotype:protein_coding \
--attribute=gene_biotype:lncRNA \
--attribute=gene_biotype:antisense \
--attribute=gene_biotype:IG_LV_gene \
--attribute=gene_biotype:IG_V_gene \
--attribute=gene_biotype:IG_V_pseudogene \
--attribute=gene_biotype:IG_D_gene \
--attribute=gene_biotype:IG_J_gene \
--attribute=gene_biotype:IG_J_pseudogene \
--attribute=gene_biotype:IG_C_gene \
--attribute=gene_biotype:IG_C_pseudogene \
--attribute=gene_biotype:TR_V_gene \
--attribute=gene_biotype:TR_V_pseudogene \
--attribute=gene_biotype:TR_D_gene \
--attribute=gene_biotype:TR_J_gene \
--attribute=gene_biotype:TR_J_pseudogene \
--attribute=gene_biotype:TR_C_gene
cellranger-arc mkref --config=config.json
cd GRCh38/genes
gunzip -dc genes.gtf.gz > genes.gtf
Note
If the reference genome built by Cellranger-arc is not compatible with the STAR version of SeekArcTools, you can specify the STAR path of CellRanger-arc for SeekArcTools with
--star_path /path/to/cellranger-arc-2.0.2/lib/bin/STAR
.The chromosome names in fasta files must match the chromosome names in the gtf file. For example, if the name of chromosome 1 in fasta files is
chr1
, then the name of chromosome 1 in the gtf file must also bechr1
.
Scenario 2: if you only have SeekArc products, there is no need to consider platform compatibility.
The code for building genome index using STAR is as follows:
/demo/seekarctools/bin/STAR \
--runMode genomeGenerate \
--runThreadN 16 \
--genomeDir /path/to/star \
--genomeFastaFiles /path/to/genome.fa \
--sjdbGTFfile /path/to/genome.gtf \
--sjdbOverhang 149 \
--limitGenomeGenerateRAM 17179869184
cd /path/to/fasta
bwa index genome.fa
Note
The chromosome names in fasta files must match the chromosome names in the gtf file. For example, if the name of chromosome 1 in fasta files is
chr1
, then the name of chromosome 1 in the gtf file must also bechr1
.