HiMMe, A Next-Generation Sequencing Quality Assessment and Correction Tool Based on Hidden Markov Models

Abstract

Both deoxyribonucleic acid (DNA) and ribonucleic acid (RNA) play a crucial role in the existence and proper development of all living organisms. In addition, it is through these molecules that the genetic information is passed from parent to offspring. It is no surprise that, over the last decades, a lot of efforts have been put into developing technology that help us better understand their underlying mechanisms. Similarly to how computers work using only ones and zeros, DNA and RNA only need four different characters to encrypt all the genetic information. Thanks to the sequencing technology development over the past decades, it is possible nowadays to sequence these molecules in a relatively fast and inexpensive way. However, as in any measurement, there is noise involved and this needs to be addressed if one is to reach conclusions based on these kind of data. The hidden Markov model (HMM) is a perfect fit for this case. Through a Markov chain, the model can capture genetic patterns, while, by introducing the emission probabilities, the noise involved in the process can be taken into account. In addition, previous knowledge can be used by training the model to fit, for instance, a given organism or sequencing technology. In this thesis, the HMM theory is applied for two purposes, (1) to assess the reliability of sequencing data, and (2) to correct potential errors in the sequences observed. The results show that the HMM model is capable of identifying genetic patterns in the sequence and to repair potential errors, thus improving the reliability of the data before any downstream analysis is performed. For these purposes, HiMMe has been developed and is publicly available on https://github.com/jordiabante/HiMMe

    Similar works