8,605 research outputs found
Whole-genome sequence analysis for pathogen detection and diagnostics
This dissertation focuses on computational methods for improving the accuracy of commonly used nucleic acid tests for pathogen detection and diagnostics. Three specific biomolecular techniques are addressed: polymerase chain reaction, microarray comparative genomic hybridization, and whole-genome sequencing. These methods are potentially the future of diagnostics, but each requires sophisticated computational design or analysis to operate effectively. This dissertation presents novel computational methods that unlock the potential of these diagnostics by efficiently analyzing whole-genome DNA sequences. Improvements in the accuracy and resolution of each of these diagnostic tests promises more effective diagnosis of illness and rapid detection of pathogens in the environment.
For designing real-time detection assays, an efficient data structure and search algorithm are presented to identify the most distinguishing sequences of a pathogen that are absent from all other sequenced genomes. Results are presented that show these "signature" sequences can be used to detect pathogens in complex samples and differentiate them from their non-pathogenic, phylogenetic near neighbors. For microarray, novel pan-genomic design and analysis methods are presented for the characterization of unknown microbial isolates. To demonstrate the effectiveness of these methods, pan-genomic arrays are applied to the study of multiple strains of the foodborne pathogen, Listeria monocytogenes, revealing new insights into the diversity and evolution of the species. Finally, multiple methods are presented for the validation of whole-genome sequence assemblies, which are capable of identifying assembly errors in even finished genomes. These validated assemblies provide the ultimate nucleic acid diagnostic, revealing the entire sequence of a genome
A bioinformatic filter for improved base-call accuracy and polymorphism detection using the Affymetrix GeneChip® whole-genome resequencing platform
DNA resequencing arrays enable rapid acquisition of high-quality sequence data. This technology represents a promising platform for rapid high-resolution genotyping of microorganisms. Traditional array-based resequencing methods have relied on the use of specific PCR-amplified fragments from the query samples as hybridization targets. While this specificity in the target DNA population reduces the potential for artifacts caused by cross-hybridization, the subsampling of the query genome limits the sequence coverage that can be obtained and therefore reduces the technique's resolution as a genotyping method. We have developed and validated an Affymetrix Inc. GeneChip® array-based, whole-genome resequencing platform for Francisella tularensis, the causative agent of tularemia. A set of bioinformatic filters that targeted systematic base-calling errors caused by cross-hybridization between the whole-genome sample and the array probes and by deletions in the sample DNA relative to the chip reference sequence were developed. Our approach eliminated 91% of the false-positive single-nucleotide polymorphism calls identified in the SCHU S4 query sample, at the cost of 10.7% of the true positives, yielding a total base-calling accuracy of 99.992%
Acute Myeloid Leukemia
Acute myeloid leukemia (AML) is the most common type of leukemia. The Cancer Genome Atlas Research Network has demonstrated the increasing genomic complexity of acute myeloid leukemia (AML). In addition, the network has facilitated our understanding of the molecular events leading to this deadly form of malignancy for which the prognosis has not improved over past decades. AML is a highly heterogeneous disease, and cytogenetics and molecular analysis of the various chromosome aberrations including deletions, duplications, aneuploidy, balanced reciprocal translocations and fusion of transcription factor genes and tyrosine kinases has led to better understanding and identification of subgroups of AML with different prognoses. Furthermore, molecular classification based on mRNA expression profiling has facilitated identification of novel subclasses and defined high-, poor-risk AML based on specific molecular signatures. However, despite increased understanding of AML genetics, the outcome for AML patients whose number is likely to rise as the population ages, has not changed significantly. Until it does, further investigation of the genomic complexity of the disease and advances in drug development are needed. In this review, leading AML clinicians and research investigators provide an up-to-date understanding of the molecular biology of the disease addressing advances in diagnosis, classification, prognostication and therapeutic strategies that may have significant promise and impact on overall patient survival
THE EFFECT OF STRUCTURE IN SHORT REGIONS OF DNA ON MEASUREMENTS ON SHORT OLIGONUCLEOTIDE MICROARRAY AND ION TORRENT PGM SEQUENCING PLATFORMS
Single-stranded DNA in solution has been studied by biophysicists for many years, as complex structures, both stable and dynamic, form under normal experimental conditions. Stable intra-strand formations affect enzymatic technical processes such as PCR and biological processes such as gene regulation. In the research described here we examined the effect of such structures on two high-throughput genomic assay platforms and whether we could predict the influence of those effects to improve the interpretation of genomic sequencing results.
Helical structures in DNA can be composed of interactions across strands or within a strand. Exclusion of the aqueous solvent provides an entropic advantage to more compact structures. Our first experiments were tested whether internal helical regions in one of the two binding partners in a microarray experiment would influence the stability of the complex. Our results are novel and show, from molecular simulations and hybridization experiments, that stable secondary structures on the boundary, when not impinging on the ability of targets to access the probes, stabilize the probe-target hybridization.
High-throughput sequencing (HTS) platforms use as templates short single-stranded DNA fragments. We tested the influence of template secondary structure on the fidelity of reads generated using the Ion Torrent PGM platform. It can clearly be seen for targets where hairpin structures are quite long (~20bp) that a high level of mis-calling occurs, particularly of deletions, and that some of these deletions are 20-30 bases long. These deletions are not associated with homopolymers, which are known to cause base mis-calls on the PGM, and the effect of structure on the sequencing reaction, rather than the PCR preparative steps, has not been previously published.
As HTS technologies bring the cost of sequencing whole genomes down, a number of unexpected observations have arisen. An example that caught our attention is the prevalence of far more short deletions than had been detected using Sanger methods. The prevalence is particularly high in the Korean genome. Since we showed that helical structures could disrupt the fidelity of base calls on the Ion Torrent we looked at the context of the apparent deletions to determine whether any sequence or structure pattern discriminated them. Starting with the genome provided by Kim et al (1) we selected deletions > 2 bases long from chromosome I of a Korean genome. We created 70 nucleotide fragments centered on the deletion. We simulated the secondary structures using OMP software and then modeled using the Random Forest algorithm in the WEKA modeling package to characterize the relations between the deletions and secondary structures in or around them. After training the model on chromosome I deletions we tested it using chromosome 20 deletions. We show that sequence information alone is not able to predict whether a deletion will occur, while the addition of structural information improves the prediction rates. Classification rates are not yet high: additional data and a more precise structural description are likely needed to train a robust model. We are unable to state which of the structures affect in vitro platforms and which occur in vivo. A comparative genomics approach using 38 genomes recently made available for the CAMDA 2013 competition should provide the necessary information to train separate models if the important features are different in the two cases
INVESTIGATING INVASION IN DUCTAL CARCINOMA IN SITU WITH TOPOGRAPHICAL SINGLE CELL GENOME SEQUENCING
Synchronous Ductal Carcinoma in situ (DCIS-IDC) is an early stage breast cancer invasion in which it is possible to delineate genomic evolution during invasion because of the presence of both in situ and invasive regions within the same sample. While laser capture microdissection studies of DCIS-IDC examined the relationship between the paired in situ (DCIS) and invasive (IDC) regions, these studies were either confounded by bulk tissue or limited to a small set of genes or markers. To overcome these challenges, we developed Topographic Single Cell Sequencing (TSCS), which combines laser-catapulting with single cell DNA sequencing to measure genomic copy number profiles from single tumor cells while preserving their spatial context. We applied TSCS to sequence 1,293 single cells from 10 synchronous DCIS patients. We also applied deep-exome sequencing to the in situ, invasive and normal tissues for the DCIS-IDC patients. Previous bulk tissue studies had produced several conflicting models of tumor evolution. Our data support a multiclonal invasion model, in which genome evolution occurs within the ducts and gives rise to multiple subclones that escape the ducts into the adjacent tissues to establish the invasive carcinomas. In summary, we have developed a novel method for single cell DNA sequencing, which preserves spatial context, and applied this method to understand clonal evolution during the transition between carcinoma in situ to invasive ductal carcinoma
On the role of metaheuristic optimization in bioinformatics
Metaheuristic algorithms are employed to solve complex and large-scale optimization problems in many different fields, from transportation and smart cities to finance. This paper discusses how metaheuristic algorithms are being applied to solve different optimization problems in the area of bioinformatics. While the text provides references to many optimization problems in the area, it focuses on those that have attracted more interest from the optimization community. Among the problems analyzed, the paper discusses in more detail the molecular docking problem, the protein structure prediction, phylogenetic inference, and different string problems. In addition, references to other relevant optimization problems are also given, including those related to medical imaging or gene selection for classification. From the previous analysis, the paper generates insights on research opportunities for the Operations Research and Computer Science communities in the field of bioinformatics
The mapping task and its various applications in next-generation sequencing
The aim of this thesis is the development and benchmarking of
computational methods for the analysis of high-throughput data from
tiling arrays and next-generation sequencing. Tiling arrays have been
a mainstay of genome-wide transcriptomics, e.g., in the identification
of functional elements in the human genome. Due to limitations of
existing methods for the data analysis of this data, a novel
statistical approach is presented that identifies expressed segments
as significant differences from the background distribution and thus
avoids dataset-specific parameters. This method detects differentially
expressed segments in biological data with significantly lower false
discovery rates and equivalent sensitivities compared to commonly used
methods. In addition, it is also clearly superior in the recovery of
exon-intron structures. Moreover, the search for local accumulations
of expressed segments in tiling array data has led to the
identification of very large expressed regions that may constitute a
new class of macroRNAs.
This thesis proceeds with next-generation sequencing for which various
protocols have been devised to study genomic, transcriptomic, and
epigenomic features. One of the first crucial steps in most NGS data
analyses is the mapping of sequencing reads to a reference
genome. This work introduces algorithmic methods to solve the mapping
tasks for three major NGS protocols: DNA-seq, RNA-seq, and
MethylC-seq. All methods have been thoroughly benchmarked and
integrated into the segemehl mapping suite.
First, mapping of DNA-seq data is facilitated by the core mapping
algorithm of segemehl. Since the initial publication, it has been
continuously updated and expanded. Here, extensive and reproducible
benchmarks are presented that compare segemehl to state-of-the-art
read aligners on various data sets. The results indicate that it is
not only more sensitive in finding the optimal alignment with respect
to the unit edit distance but also very specific compared to most
commonly used alternative read mappers. These advantages are
observable for both real and simulated reads, are largely independent
of the read length and sequencing technology, but come at the cost of
higher running time and memory consumption.
Second, the split-read extension of segemehl, presented by Hoffmann,
enables the mapping of RNA-seq data, a computationally more difficult
form of the mapping task due to the occurrence of splicing. Here, the
novel tool lack is presented, which aims to recover missed RNA-seq
read alignments using de novo splice junction information. It
performs very well in benchmarks and may thus be a beneficial
extension to RNA-seq analysis pipelines.
Third, a novel method is introduced that facilitates the mapping of
bisulfite-treated sequencing data. This protocol is considered the
gold standard in genome-wide studies of DNA methylation, one of the
major epigenetic modifications in animals and plants. The treatment of
DNA with sodium bisulfite selectively converts unmethylated cytosines
to uracils, while methylated ones remain unchanged. The bisulfite
extension developed here performs seed searches on a collapsed
alphabet followed by bisulfite-sensitive dynamic programming
alignments. Thus, it is insensitive to bisulfite-related mismatches
and does not rely on post-processing, in contrast to other methods. In
comparison to state-of-the-art tools, this method achieves
significantly higher sensitivities and performs time-competitive in
mapping millions of sequencing reads to vertebrate
genomes. Remarkably, the increase in sensitivity does not come at the
cost of decreased specificity and thus may finally result in a better
performance in calling the methylation rate.
Lastly, the potential of mapping strategies for de novo genome
assemblies is demonstrated with the introduction of a new guided
assembly procedure. It incorporates mapping as major component and
uses the additional information (e.g., annotation) as guide. With this
method, the complete mitochondrial genome of Eulimnogammarus verrucosus has been
successfully assembled even though the sequencing library has been
heavily dominated by nuclear DNA.
In summary, this thesis introduces algorithmic methods that
significantly improve the analysis of tiling array, DNA-seq, RNA-seq,
and MethylC-seq data, and proposes standards for benchmarking NGS read
aligners. Moreover, it presents a new guided assembly procedure that
has been successfully applied in the de novo assembly of a
crustacean mitogenome.Diese Arbeit befasst sich mit der Entwicklung und dem Benchmarken von
Verfahren zur Analyse von Daten aus Hochdurchsatz-Technologien, wie
Tiling Arrays oder Hochdurchsatz-Sequenzierung. Tiling Arrays bildeten
lange Zeit die Grundlage für die genomweite Untersuchung des
Transkriptoms und kamen beispielsweise bei der Identifizierung
funktioneller Elemente im menschlichen Genom zum Einsatz. In dieser
Arbeit wird ein neues statistisches Verfahren zur Auswertung von
Tiling Array-Daten vorgestellt. Darin werden Segmente als exprimiert
klassifiziert, wenn sich deren Signale signifikant von der
Hintergrundverteilung unterscheiden. Dadurch werden keine auf den
Datensatz abgestimmten Parameterwerte benötigt. Die hier
vorgestellte Methode erkennt differentiell exprimierte Segmente in
biologischen Daten bei gleicher Sensitivität mit geringerer
Falsch-Positiv-Rate im Vergleich zu den derzeit hauptsächlich
eingesetzten Verfahren. Zudem ist die Methode bei der Erkennung von
Exon-Intron Grenzen präziser. Die Suche nach Anhäufungen
exprimierter Segmente hat darüber hinaus zur Entdeckung von sehr
langen Regionen geführt, welche möglicherweise eine neue
Klasse von macroRNAs darstellen.
Nach dem Exkurs zu Tiling Arrays konzentriert sich diese Arbeit nun
auf die Hochdurchsatz-Sequenzierung, für die bereits verschiedene
Sequenzierungsprotokolle zur Untersuchungen des Genoms, Transkriptoms
und Epigenoms etabliert sind. Einer der ersten und entscheidenden
Schritte in der Analyse von Sequenzierungsdaten stellt in den meisten
Fällen das Mappen dar, bei dem kurze Sequenzen (Reads) auf ein
großes Referenzgenom aligniert werden. Die vorliegende Arbeit
stellt algorithmische Methoden vor, welche das Mapping-Problem für
drei wichtige Sequenzierungsprotokolle (DNA-Seq, RNA-Seq und
MethylC-Seq) lösen. Alle Methoden wurden ausführlichen
Benchmarks unterzogen und sind in der segemehl-Suite integriert.
Als Erstes wird hier der Kern-Algorithmus von segemehl vorgestellt,
welcher das Mappen von DNA-Sequenzierungsdaten ermöglicht. Seit
der ersten Veröffentlichung wurde dieser kontinuierlich optimiert
und erweitert. In dieser Arbeit werden umfangreiche und auf
Reproduzierbarkeit bedachte Benchmarks präsentiert, in denen
segemehl auf zahlreichen Datensätzen mit bekannten
Mapping-Programmen verglichen wird. Die Ergebnisse zeigen, dass
segemehl nicht nur sensitiver im Auffinden von optimalen Alignments
bezüglich der Editierdistanz sondern auch sehr spezifisch im
Vergleich zu anderen Methoden ist. Diese Vorteile sind in realen und
simulierten Daten unabhängig von der Sequenzierungstechnologie
oder der Länge der Reads erkennbar, gehen aber zu Lasten einer
längeren Laufzeit und eines höheren Speicherverbrauchs.
Als Zweites wird das Mappen von RNA-Sequenzierungsdaten untersucht,
welches bereits von der Split-Read-Erweiterung von segemehl
unterstützt wird. Aufgrund von Spleißen ist diese Form des
Mapping-Problems rechnerisch aufwendiger. In dieser Arbeit wird das
neue Programm lack vorgestellt, welches darauf abzielt, fehlende
Read-Alignments mit Hilfe von de novo Spleiß-Information zu
finden. Es erzielt hervorragende Ergebnisse und stellt somit eine
sinnvolle Ergänzung zu Analyse-Pipelines für
RNA-Sequenzierungsdaten dar.
Als Drittes wird eine neue Methode zum Mappen von Bisulfit-behandelte
Sequenzierungsdaten vorgestellt. Dieses Protokoll gilt als
Goldstandard in der genomweiten Untersuchung der DNA-Methylierung,
einer der wichtigsten epigenetischen Modifikationen in Tieren und
Pflanzen. Dabei wird die DNA vor der Sequenzierung mit Natriumbisulfit
behandelt, welches selektiv nicht methylierte Cytosine zu Uracilen
konvertiert, während Methylcytosine davon unberührt
bleiben. Die hier vorgestellte Bisulfit-Erweiterung führt die
Seed-Suche auf einem reduziertem Alphabet durch und verifiziert die
erhaltenen Treffer mit einem auf dynamischer Programmierung
basierenden Bisulfit-sensitiven Alignment-Algorithmus. Das verwendete
Verfahren ist somit unempfindlich gegenüber
Bisulfit-Konvertierungen und erfordert im Gegensatz zu anderen
Verfahren keine weitere Nachverarbeitung. Im Vergleich zu aktuell
eingesetzten Programmen ist die Methode sensitiver und benötigt
eine vergleichbare Laufzeit beim Mappen von Millionen von Reads auf
große Genome. Bemerkenswerterweise wird die erhöhte
Sensitivität bei gleichbleibend guter Spezifizität
erreicht. Dadurch könnte diese Methode somit auch bessere
Ergebnisse bei der präzisen Bestimmung der Methylierungsraten
erreichen.
Schließlich wird noch das Potential von Mapping-Strategien für
Assemblierungen mit der Einführung eines neuen,
Kristallisation-genanntes Verfahren zur unterstützten
Assemblierung aufgezeigt. Es enthält Mapping als Hauptbestandteil
und nutzt Zusatzinformation (z.B. Annotationen) als
Unterstützung. Dieses Verfahren ermöglichte die erfolgreiche
Assemblierung des kompletten mitochondrialen Genoms von Eulimnogammarus verrucosus trotz
einer vorwiegend aus nukleärer DNA bestehenden genomischen
Bibliothek.
Zusammenfassend stellt diese Arbeit algorithmische Methoden vor,
welche die Analysen von Tiling Array, DNA-Seq, RNA-Seq und MethylC-Seq
Daten signifikant verbessern. Es werden zudem Standards für den
Vergleich von Programmen zum Mappen von Daten der
Hochdurchsatz-Sequenzierung vorgeschlagen. Darüber hinaus wird ein
neues Verfahren zur unterstützten Genom-Assemblierung vorgestellt,
welches erfolgreich bei der de novo-Assemblierung eines
mitochondrialen Krustentier-Genoms eingesetzt wurde
New constructive heuristics for DNA sequencing by hybridization
Deoxyribonucleic acid (DNA) is a molecule that consists of two complementary sequences of amino acids. Reading these sequences is an important task in biology, called DNA sequencing. However, large DNA molecules cannot be read in one piece. Therefore, existing techniques first break the given DNA molecules up into small fragments which can be read. One of these techniques is called the hybridization experiment. The reconstruction of the original DNA molecule from these fragments is a challenging problem from the computational point of view. In recent years the specific problem of DNA sequencing by hybridization has attracted quite a lot of interest in the optimization community. While most researchers focused on the development of metaheuristic approaches, work on simple constructive heuristics hardly received any attention. This is despite the fact that well-working constructive heuristics are often an essential component of succesful metaheuristics. It is exactly this lack of constructive heuristics that motivated the work presented in this paper. The results of our best constructive heuristic are comparable to the results of the best existing metaheuristics, while using less computational time.Postprint (published version
- …