0基础学习基因组三代测序组装（4）——初步组装二代数据

经过前面的全基因组特征调查（survey）后，我们发现这是一个复杂基因组，杂合度较高，可以以二代+三代测序技术相结合的策略进行全基因组组装，还可以以Hi-C（高通量染色体捕获技术，High-through chromosome conformation capture）技术进行辅助组装。这里我用华大开发的二代测序组装工具SOAPdenovo，用二代测序数据对进行初步基因组组装。

1 安装SOAPdenovo 2.0

github上这个软件的版本是2.0，网址点击这里

软件下载安装过程非常顺利，如果有报错无法解决的话可以在Issue里向作者反馈。

# 集群如果无法登录github，下载源码包，通过xftp传到集群
tar -zxvf SOAPdenovo2
cd SOAPdenovo2-r242
make

编译之后可以看到有如下几个文件

SOAPdenovo-127mer和SOAPdenovo-63mer是用于组装的两个程序，分别代表支持的最大k-mer为127和63，用法上是完全相同的。

example.config是配置文件，组装之前我们要设置其中的参数内容；README.md是帮助文件，详细记录了各项参数的作用和设置方法。这个后面会讲到。

2 kmergenie计算最佳k值

现在组装基因组的算法主要有三种：De Bruijn graph，Overlap-Layout-Consensus和String Graph。SOAPdenovo软件组装基因组用的是De Bruijn graph算法，简单理解是通过将reads打断成k-mer后，利用k-mer之间的重复部分构建图，得到最优化路径从而拼接contig。要具体了解什么是De Bruijn graph，可以参考这一篇博文。

不同k-mer值构建的De Bruijn graph不一样，会导致组装质量的差异，因此我们需要选择一个最佳的组装k-mer大小（尽管可以用默认值23直接组装，但是效果不一定是最好的）。

kmergenie软件和之前的Jellyfish类似，都可以用于统计k-mer数量，kmergenie最大优点是可以对预设的多个k-mer进行分析，找到最佳的k-mer。点击这里进入Kmergenie官网，下载最新版本的软件。

注意下这个软件安装需要python > 2.7，并且需要安装R和zlib。

tar -zxvf kmergenie-1.7051.tar.gz
cd kmergenie-1.7051
make
python setup.py install --user  # 安装到用户环境中，不报错说明可以使用

vim file.txt    # 将两个fq文件路径写进去，一行一个

/public/home/wlxie/biosoft/kmergenie-1.7051/kmergenie file.txt -o ./kmergenie_res -l 15 -k 65 -s 5 -t 30 --diploid      # 运行kmergenie

-o # 输出文件位置和名称
-l # 设定的最小k值
-k # 设定的最大k值
-s # 最小k值到最大k值，每次增加的间隔（根据需要设定间隔大小）
-t # 运行的线程（CPU核）数
--diploid # 二倍体模式，前面我们已经用jellyfish确认过这是个复杂基因组。默认是单倍体模式

其原理就是设置不同k值进行基因组大小预估，将组装的基因组最大的k值作为最佳k值。

最终会给出kmergenie_res为前缀的一系列报告，生成的.histo文件还可以用来上一篇笔记中的GenomeScope分析，这里我们只需要看总结的html文件。

确定组装的最佳k值为51

3 SOAPdenovo2组装contigs/scaffolds

复制一份example.config配置文件，重命名为run_config，修改部分参数

#maximal read length
#全局配置参数，只要高于这个参数的序列都会被截取到这个长度
max_rd_len=150
#文库配置以[LIB]开头
[LIB]
#average insert size
#文库插入片段的平均长度，在实际设置时，可以参考文库size分布图，取峰值（默认200）
avg_ins=200
#if sequence needs to be reversed
#是否需要将序列反向互补，对于pair-end数据，不需要反向互补，设置为0；对于mate-pair数据，需要反向互补，设置为1
reverse_seq=0
#in which part(s) the reads are used
#1表示只组装contig，2表示只组装scaffold，3表示同时组装contig和scaffold，4表示只补gap
asm_flags=3
#use only first 100 bps of each read
#序列长度阈值，作用和max_rd_len相同，大于该长度的序列会被切除到该长度
rd_len_cutoff=150
#in which order the reads are used while scaffolding
#设置不同文库数据的优先级顺序，取值范围为整数，rank值相同的多个文库，在组装scaffold时，会同时使用。
rank=1
#cutoff of pair number for a reliable connection (at least 3 for short insert size)
#contig或者scaffold之前的最小overlap个数，对于pair-end数据，默认值为3；对于mate-paird数据，默认值为5
pair_num_cutoff=3
#minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)
#比对长度的最小阈值，对于pair-end数据，默认值为32；对于mate-pair数据，默认值为35
map_len=32
#a pair of fastq file, read 1 file should always be followed by read 2 file
#过滤后的双端测序数据文件路径，q为fastq格式，f为fasta格式，b为bam格式
q1=/public/home/wlxie/luobuma/luobuma/sample_1_rawdata/Second-generation_sequencing/20211106-BaiYiHuiNeng01/01.rawFq/00.mergeRawFq/1/clean_data/1_r    aw_1_val_1.fq
q2=/public/home/wlxie/luobuma/luobuma/sample_1_rawdata/Second-generation_sequencing/20211106-BaiYiHuiNeng01/01.rawFq/00.mergeRawFq/1/clean_data/1_r    aw_2_val_2.fq

SOAPdenovo有6个子命令pregraph、sparse_pregraph、contig、map、scaff和all，前5个命令对应5个组装步骤，第一和第二是两种不同构图方式，all命令一次执行所有步骤，用all命令比较省事儿。

SOAPdenovo命令还有一些参数用于调整，参数参考

-s  # 配置文件
-o  # 输出文件的前缀
-K  # 输入的K-mer值大小，默认值23
-p  # 程序运行时设定的线程数，默认值8
-R  # 利用read鉴别短的重复序列，默认值不进行此操作
-d  # 去除频数不大于该值的k-mer，默认值为0
-D  # 去除频数不大于该值的由k-mer连接的边，默认值为1，即该边上每个点的频数都小于等于1时才去除
-M  # 连接contig时合并相似序列的等级，默认值为1，最大值3。
-F  # 利用read对scaffold中的gap进行填补，默认不执行
-u  # 构建scaffold前不屏蔽高覆盖度的contig，这里高频率覆盖度指平均contig覆盖深度的2倍。默认屏蔽
-G  # 估计gap的大小和实际补gap的大小的差异，默认值为50bp。
-L  # 用于构建scaffold的contig的最短长度，默认为：Kmer参数值 ×2
-k  # map步骤中kmer的大小，默认是和K一样的kmer大小
-N  # 基因组大小
-V  # 输出可视化的组装信息

运行组装命令

1	/public/home/wlxie/biosoft/SOAPdenovo2-r242/SOAPdenovo-63mer all -s /public/home/wlxie/biosoft/SOAPdenovo2-r242/run_config -K 51 -R -V -o A_venetum -p 30

程序运行了3个小时，结束后生成了以下文件

4 组装结果解读

组装结果文件其实只有两个，分别以**.contig结尾和.scafseq结尾**。因为我是在集群上运行的，slurm-11168.out是集群的输出日志文件，记录了详细的组装过程和结果。

最终得到935861个contigs，总长度295302126 bp，平均长度315 bp，最长的长度38673 bp，contig N50是532 bp，contig N90是103 bp；scaffold个数77918，总长度192992858 bp，平均长度2476 bp，最长的长度108587 bp，scaffold N50是3385 bp，scaffold N90是130 bp。（可以做一个统计表）

从组装的contig覆盖深度和数量还可以做一个柱状图，理论上来说是和前面k-mer分布图呈现一样的趋势，也就是一个主峰和一个杂峰，两个图相互印证目标基因组是个复杂基因组。

其他的结果文件在github上有解释，我先直接复制过来，以后用到再翻译翻译……

1. Output files from the command "pregraph"
a. *.kmerFreq
      Each row shows the number of Kmers with a frequency equals the row number. Note that those peaks of frequencies which are the integral multiple of 63 are due to the data structure.
b. *.edge
      Each record gives the information of an edge in the pre-graph: length, Kmers on both ends, average kmer coverage, whether it's reverse-complementarily identical and the sequence.
c. *.markOnEdge & *.path
      These two files are for using reads to solve small repeats.
e. *.preArc
      Connections between edges which are established by the read paths.
f. *.vertex
      Kmers at the ends of edges.
g. *.preGraphBasic
      Some basic information about the pre-graph: number of vertex, K value, number of edges, maximum read length etc.
      
2. Output files from the command "contig"
a. *.contig
      Contig information: corresponding edge index, length, kmer coverage, whether it's tip and the sequence. Either a contig or its reverse complementry counterpart is included. Each reverse complementary contig index is indicated in the *.ContigIndex file.
b. *.Arc
      Arcs coming out of each edge and their corresponding coverage by reads
c. *.updated.edge
      Some information for each edge in graph: length, Kmers at both ends, index difference between the reverse-complementary edge and this one.
d. *.ContigIndex
      Each record gives information about each contig in the *.contig: it's edge index, length, the index difference between its reverse-complementary counterpart and itself.
      
3. Output files from the command "map"
a. *.peGrads
      Information for each clone library: insert-size, read index upper bound, rank and pair number cutoff for a reliable link. This file can be revised manually for scaffolding tuning.
b. *.readOnContig
      Reads' locations on contigs. Here contigs are referred by their edge index. Howerver about half of them are not listed in the *.contig file for their reverse-complementary counterparts are included already.
c. *.readInGap
      This file includes reads that could be located in gaps between contigs. This information will be used to close gaps in scaffolds if "-F" is set.
      
4. Output files from the command "scaff"
a. *.newContigIndex
      Contigs are sorted according their length before scaffolding. Their new index are listed in this file.  This is useful if one wants to corresponds contigs in *.contig with those in *.links.
b. *.links
      Links between contigs which are established by read pairs. New index are used.
c. *.scaf_gap
      Contigs in gaps found by contig graph outputted by the contiging procedure. Here new index are used.
d. *.scaf
      Contigs for each scaffold: contig index (concordant to index in *.contig),  approximate start position on scaffold, orientation, contig length, and its links to others contigs.
e. *.gapSeq
      Gap sequences between contigs.
f. *.scafSeq
      Sequences of each scaffolds.
g. *.contigPosInscaff
      Contigs' positions in each scaffold.
h. *.bubbleInScaff
      Contigs that form bubble structures in scaffolds. Every two contigs form a bubble and the contig with higher coverage will be kept in scaffold.
i. *.scafStatistics
      Statistic information of final scaffold and contig.