以 FM-index 為基礎之第三代定序自我型錯誤修正法;A self-error correction algorithm for third-generation sequencing using FM-index

Abstract

[[abstract]]因為第三代定序技術所產生出的序列為較長的序列,定序的偏差也較低還有定序分布平均等特質,使得第三代定序技術成為現有基因組裝(de novo assembly)的受歡迎選項。 但是由於它所產出的序列錯誤率較高,所以在進行基因組裝前都先必須進行序列的錯誤修正。目前錯誤修正的方法可以分為比對序列分析法和非比對序列分析法。比對序列分析法比較費時但可以在高相似度和低覆蓋率的區域修正。另一方面,分比對錯誤修正法比較快速但敏感度較低。在這篇論文裡,我們研發出一個新的非比對錯誤修正法,藉由FM-index試著把錯誤修正問題轉化成路徑搜尋問題。為了能夠在高相似度和低覆蓋率的區域進行錯誤修正,研發出了使用多種長度子字串的可適性種子搜尋演算法。最後實驗結果指出我們的方法比現有的比對序列分析法和非比對序列分析法還要快在大腸桿菌跟酵母菌之下。在大物種線蟲我們的方法比現有的比對序列分析法還要慢但還是比現有的非比對序列分析法還要快速。 The 3rd-generation sequencing technologies are becoming the popular choice in de novo assembly projects, because of long reads, less sequencing bias, and more uniform coverage. But it comes at the cost of much higher error rates and thus error correction is often performed prior to assembly. Currently, error correction methods can be divided into alignment-based and alignment-free approaches. Alignment-based methods are more time-consuming but able to correct reads in repetitive and low-coverage regions. On the other hand, alignment-free methods are much faster but have less sensitivity. In this thesis, we develop a novel alignment-free algorithm which reduces the correction problem to a path-searching problem via FM-index extension. In order to correct reads in low-coverage and repetitive regions, an adaptive seeding algorithm using multiple sizes of k-mers is developed. The experimental results indicated that our method is faster than existing alignment-based and alignment-free methods in E. coli and S. cerevisiae datasets. For large genome datasets, our method is slower than alignment-based methods but still faster than existing alignment-free method

    Similar works