如何构建参考基因组？

准备构建基因组所需的文件

使用SeekArctools软件进行多组学分析时，需要准备物种的参考基因组序列和对应的gtf注释文件。请尽量使用Ensembl和UCSC来源的参考基因组，如使用NCBI的参考基因组，将无法计算基因与 Peak 的连接关系。请保持参考基因组的染色体数目不超过65536条。相关文件格式说明如下：

基因组序列

基因组序列要求fa格式文件，染色体id要与gtf文件第一列seqname一致，且gtf的seqname需要是fa文件的chrom id的子集。注意文件不可以有空行。

gtf文件

gtf文件格式说明如下：

seqname:序列名称，通常为染色体或Contig ID
source:注释来源,可以是数据库的名称，比如来自RefSeq数据库，也可以是软件的名称，比如用GeneScan软件预测得到，当然，也可以为空，用.点号填充
feature: 代表区间对应的特征类型, 在GTF中，常见类型：gene、transcript、CDS、exon、start_codon、stop_codon等
start: feature 的起始位置
end: feature 的终止位置
score: 表示feature存在和坐标的置信度，可以是一个浮点数或整数，"."表示为空，就是不需要
strand: 该feature 位于参考基因组的正链(+) 或者负链(-)
frame: 0表示阅读框的第一个完整密码子位于最5'端，1表示第一个完整密码子之前有一个额外的碱基，2表示第一个完整密码子之前有两个额外的碱基。注意frame 不是CDS长度除以3的余数，如果链是'-',则该区域的第一个碱基值为'end',因为对应的编码区将在反义链从end 到start
attribute: 应具有的格式是 attributes_name "attributes_values"; 每个属性必须以分号结尾并且与下一个属性之间以空格分隔，并且属性的值用双引号包围。包含以下三种属性：

attribute	含义
gene_id "value";	表示转录本在基因组上的基因座的唯一ID。 gene_id与value值用空格分开，如果值为空，则表示没有对应的基因。
transcript_id "value";	预测转录本的唯一ID。transcipt_id 与value值用空格分开，空表示没有转录本
gene_type "value";	gene的生物学类型，protein coding；lncRNA…

gtf文件准备的注意事项：

gtf文件中每个基因的feature列须包含gene、transcript、exon三个信息；
当feature列为'gene'时，attributes列需要包含gene_id，gene_type，如果没有gene_name，则将gene_id的值当作genename;当feature列为' transcript'时，attributes列需要包含transcript_id；当feature列为'exon'时，attributes列需要包含exon_id,否则会影响read注释到多个基因时的处理。
gtf文件同样不要有空行。
gtf文件中线粒体基因的gene_name需要以“Mt-”或者“mt-”开头，否则在报告中mito部分均为0.

场景一：构建兼容不同平台单细胞数据的参考基因组

如果您既有10X Genomics单细胞数据，又有SeekArc 产品的单细胞数据，推荐使用10X Cellranger-arc来构建参考基因组，SeekArcTools可以兼容Cellranger-arc构建的参考基因组。

请按以下格式配置 config.json 文件：

{
    "organism": "GRCh38",
    "genome": ["GRCh38"],
    "input_fasta": ["/path/to/GRCh38.fa"],
    "input_gtf": ["/path/to/Homo_sapiens.GRCh38.ensembl.filtered.gtf"]
}

处理基因注释文件（GTF 文件）的代码如下：

/path/to/cellranger mkgtf Homo_sapiens.GRCh38.ensembl.gtf Homo_sapiens.GRCh38.ensembl.filtered.gtf \
    --attribute=gene_biotype:protein_coding \
    --attribute=gene_biotype:lncRNA \
    --attribute=gene_biotype:antisense \
    --attribute=gene_biotype:IG_LV_gene \
    --attribute=gene_biotype:IG_V_gene \
    --attribute=gene_biotype:IG_V_pseudogene \
    --attribute=gene_biotype:IG_D_gene \
    --attribute=gene_biotype:IG_J_gene \
    --attribute=gene_biotype:IG_J_pseudogene \
    --attribute=gene_biotype:IG_C_gene \
    --attribute=gene_biotype:IG_C_pseudogene \
    --attribute=gene_biotype:TR_V_gene \
    --attribute=gene_biotype:TR_V_pseudogene \
    --attribute=gene_biotype:TR_D_gene \
    --attribute=gene_biotype:TR_J_gene \
    --attribute=gene_biotype:TR_J_pseudogene \
    --attribute=gene_biotype:TR_C_gene
cellranger-arc mkref --config=config.json
cd GRCh38/genes
gunzip -dc genes.gtf.gz > genes.gtf

注：

当Cellranger-arc构建的参考基因组与SeekArcTools的STAR版本不兼容时，可以将cellranger-arc的STAR路径指定给SeekArcTools，例如：--star_path /path/to/cellranger-arc-2.0.2/lib/bin/STAR。
fasta文件中染色体名称必须与gtf文件中的染色体名称相匹配。例如，fasta的1号染色体名称为chr1，那么gtf文件中1号染色体名称也必须为chr1。

场景二：仅有SeekArc产品，不需要考虑平台兼容性

构建STAR索引的代码如下：

/demo/seekarctools/bin/STAR \
  --runMode genomeGenerate \
  --runThreadN 16 \                        
  --genomeDir /path/to/star \             
  --genomeFastaFiles /path/to/genome.fa \  
  --sjdbGTFfile /path/to/genome.gtf \      
  --sjdbOverhang 149 \                     
  --limitGenomeGenerateRAM 17179869184   
cd /path/to/fasta
bwa index genome.fa

注：

fasta文件中染色体名称必须与gtf文件中的染色体名称相匹配。例如，fasta的1号染色体名称为chr1，那么gtf文件中1号染色体名称也必须为chr1。