91 research outputs found
Non-Unique oligonucleotide probe selection heuristics
The non-unique probe selection problem consists of selecting both unique and nonunique oligonucleotide probes for oligonucleotide microarrays, which are widely used tools to identify viruses or bacteria in biological samples. The non-unique probes, designed to hybridize to at least one target, are used as alternatives when the design of unique probes is particularly difficult for the closely related target genes. The goal of the non-unique probe selection problem is to determine a smallest set of probes able to identify all targets present in a biological sample. This problem is known to be NP-hard. In this thesis, several novel heuristics are presented based on greedy strategy, genetic algorithms and evolutionary strategy respectively for the minimization problem arisen from the non-unique probe selection using the best-known ILP formulation. Experiment results show that our methods are capable of reducing the number of probes required over the state-of-the-art methods
Improving the efficiency of Bayesian Network Based EDAs and their application in Bioinformatics
Estimation of distribution algorithms (EDAs) is a relatively new trend of stochastic optimizers which have received a lot of attention during last decade. In each generation, EDAs build probabilistic models of promising solutions of an optimization problem to guide the search process. New sets of solutions are obtained by sampling the corresponding probability distributions. Using this approach, EDAs are able to provide the user a set of models that reveals the dependencies between variables of the optimization problems while solving them. In order to solve a complex problem, it is necessary to use a probabilistic model which is able to capture the dependencies. Bayesian networks are usually used for modeling multiple dependencies between variables. Learning Bayesian networks, especially for large problems with high degree of dependencies among their variables is highly computationally expensive which makes it the bottleneck of EDAs. Therefore introducing efficient Bayesian learning algorithms in EDAs seems necessary in order to use them for large problems. In this dissertation, after comparing several Bayesian network learning algorithms, we propose an algorithm, called CMSS-BOA, which uses a recently introduced heuristic called max-min parent children (MMPC) in order to constrain the model search space. This algorithm does not consider a fixed and small upper bound on the order of interaction between variables and is able solve problems with large numbers of variables efficiently. We compare the efficiency of CMSS-BOA with the standard Bayesian network based EDA for solving several benchmark problems and finally we use it to build a predictor for predicting the glycation sites in mammalian proteins
Bayesian Optimization Algorithm for Non-unique Oligonucleotide Probe Selection
One important application of DNA microarrays is measuring the expression levels of genes. The quality of the microarrays design which includes selecting short Oligonucleotide sequences (probes) to be affixed on the surface of the microarray becomes a major issue. A good design is the one that contains the minimum possible number of probes while having an acceptable ability in identifying the targets existing in the sample. We focuse on the problem of computing the minimal set of probes which is able to identify each target of a sample, referred to as Non-unique Oligonucleotide Probe Selection. We present the application of an Estimation of Distribution Algorithm named Bayesian Optimization Algorithm (BOA) to this problem, and consider integration of BOA and one simple heuristic. We also present application of our method in integration with decoding approach in a multiobjective optimization framework for solving the problem in case of multiple targets in the sample
Greene SCPrimer: a rapid comprehensive tool for designing degenerate primers from multiple sequence alignments
Polymerase chain reaction (PCR) is widely applied in clinical and environmental microbiology. Primer design is key to the development of successful assays and is often performed manually by using multiple nucleic acid alignments. Few public software tools exist that allow comprehensive design of degenerate primers for large groups of related targets based on complex multiple sequence alignments. Here we present a method for designing such primers based on tree building followed by application of a set covering algorithm, and demonstrate its utility in compiling Multiplex PCR primer panels for detection and differentiation of viral pathogens
The mapping task and its various applications in next-generation sequencing
The aim of this thesis is the development and benchmarking of
computational methods for the analysis of high-throughput data from
tiling arrays and next-generation sequencing. Tiling arrays have been
a mainstay of genome-wide transcriptomics, e.g., in the identification
of functional elements in the human genome. Due to limitations of
existing methods for the data analysis of this data, a novel
statistical approach is presented that identifies expressed segments
as significant differences from the background distribution and thus
avoids dataset-specific parameters. This method detects differentially
expressed segments in biological data with significantly lower false
discovery rates and equivalent sensitivities compared to commonly used
methods. In addition, it is also clearly superior in the recovery of
exon-intron structures. Moreover, the search for local accumulations
of expressed segments in tiling array data has led to the
identification of very large expressed regions that may constitute a
new class of macroRNAs.
This thesis proceeds with next-generation sequencing for which various
protocols have been devised to study genomic, transcriptomic, and
epigenomic features. One of the first crucial steps in most NGS data
analyses is the mapping of sequencing reads to a reference
genome. This work introduces algorithmic methods to solve the mapping
tasks for three major NGS protocols: DNA-seq, RNA-seq, and
MethylC-seq. All methods have been thoroughly benchmarked and
integrated into the segemehl mapping suite.
First, mapping of DNA-seq data is facilitated by the core mapping
algorithm of segemehl. Since the initial publication, it has been
continuously updated and expanded. Here, extensive and reproducible
benchmarks are presented that compare segemehl to state-of-the-art
read aligners on various data sets. The results indicate that it is
not only more sensitive in finding the optimal alignment with respect
to the unit edit distance but also very specific compared to most
commonly used alternative read mappers. These advantages are
observable for both real and simulated reads, are largely independent
of the read length and sequencing technology, but come at the cost of
higher running time and memory consumption.
Second, the split-read extension of segemehl, presented by Hoffmann,
enables the mapping of RNA-seq data, a computationally more difficult
form of the mapping task due to the occurrence of splicing. Here, the
novel tool lack is presented, which aims to recover missed RNA-seq
read alignments using de novo splice junction information. It
performs very well in benchmarks and may thus be a beneficial
extension to RNA-seq analysis pipelines.
Third, a novel method is introduced that facilitates the mapping of
bisulfite-treated sequencing data. This protocol is considered the
gold standard in genome-wide studies of DNA methylation, one of the
major epigenetic modifications in animals and plants. The treatment of
DNA with sodium bisulfite selectively converts unmethylated cytosines
to uracils, while methylated ones remain unchanged. The bisulfite
extension developed here performs seed searches on a collapsed
alphabet followed by bisulfite-sensitive dynamic programming
alignments. Thus, it is insensitive to bisulfite-related mismatches
and does not rely on post-processing, in contrast to other methods. In
comparison to state-of-the-art tools, this method achieves
significantly higher sensitivities and performs time-competitive in
mapping millions of sequencing reads to vertebrate
genomes. Remarkably, the increase in sensitivity does not come at the
cost of decreased specificity and thus may finally result in a better
performance in calling the methylation rate.
Lastly, the potential of mapping strategies for de novo genome
assemblies is demonstrated with the introduction of a new guided
assembly procedure. It incorporates mapping as major component and
uses the additional information (e.g., annotation) as guide. With this
method, the complete mitochondrial genome of Eulimnogammarus verrucosus has been
successfully assembled even though the sequencing library has been
heavily dominated by nuclear DNA.
In summary, this thesis introduces algorithmic methods that
significantly improve the analysis of tiling array, DNA-seq, RNA-seq,
and MethylC-seq data, and proposes standards for benchmarking NGS read
aligners. Moreover, it presents a new guided assembly procedure that
has been successfully applied in the de novo assembly of a
crustacean mitogenome.Diese Arbeit befasst sich mit der Entwicklung und dem Benchmarken von
Verfahren zur Analyse von Daten aus Hochdurchsatz-Technologien, wie
Tiling Arrays oder Hochdurchsatz-Sequenzierung. Tiling Arrays bildeten
lange Zeit die Grundlage fĂŒr die genomweite Untersuchung des
Transkriptoms und kamen beispielsweise bei der Identifizierung
funktioneller Elemente im menschlichen Genom zum Einsatz. In dieser
Arbeit wird ein neues statistisches Verfahren zur Auswertung von
Tiling Array-Daten vorgestellt. Darin werden Segmente als exprimiert
klassifiziert, wenn sich deren Signale signifikant von der
Hintergrundverteilung unterscheiden. Dadurch werden keine auf den
Datensatz abgestimmten Parameterwerte benötigt. Die hier
vorgestellte Methode erkennt differentiell exprimierte Segmente in
biologischen Daten bei gleicher SensitivitÀt mit geringerer
Falsch-Positiv-Rate im Vergleich zu den derzeit hauptsÀchlich
eingesetzten Verfahren. Zudem ist die Methode bei der Erkennung von
Exon-Intron Grenzen prÀziser. Die Suche nach AnhÀufungen
exprimierter Segmente hat darĂŒber hinaus zur Entdeckung von sehr
langen Regionen gefĂŒhrt, welche möglicherweise eine neue
Klasse von macroRNAs darstellen.
Nach dem Exkurs zu Tiling Arrays konzentriert sich diese Arbeit nun
auf die Hochdurchsatz-Sequenzierung, fĂŒr die bereits verschiedene
Sequenzierungsprotokolle zur Untersuchungen des Genoms, Transkriptoms
und Epigenoms etabliert sind. Einer der ersten und entscheidenden
Schritte in der Analyse von Sequenzierungsdaten stellt in den meisten
FĂ€llen das Mappen dar, bei dem kurze Sequenzen (Reads) auf ein
groĂes Referenzgenom aligniert werden. Die vorliegende Arbeit
stellt algorithmische Methoden vor, welche das Mapping-Problem fĂŒr
drei wichtige Sequenzierungsprotokolle (DNA-Seq, RNA-Seq und
MethylC-Seq) lösen. Alle Methoden wurden ausfĂŒhrlichen
Benchmarks unterzogen und sind in der segemehl-Suite integriert.
Als Erstes wird hier der Kern-Algorithmus von segemehl vorgestellt,
welcher das Mappen von DNA-Sequenzierungsdaten ermöglicht. Seit
der ersten Veröffentlichung wurde dieser kontinuierlich optimiert
und erweitert. In dieser Arbeit werden umfangreiche und auf
Reproduzierbarkeit bedachte Benchmarks prÀsentiert, in denen
segemehl auf zahlreichen DatensÀtzen mit bekannten
Mapping-Programmen verglichen wird. Die Ergebnisse zeigen, dass
segemehl nicht nur sensitiver im Auffinden von optimalen Alignments
bezĂŒglich der Editierdistanz sondern auch sehr spezifisch im
Vergleich zu anderen Methoden ist. Diese Vorteile sind in realen und
simulierten Daten unabhÀngig von der Sequenzierungstechnologie
oder der LĂ€nge der Reads erkennbar, gehen aber zu Lasten einer
lÀngeren Laufzeit und eines höheren Speicherverbrauchs.
Als Zweites wird das Mappen von RNA-Sequenzierungsdaten untersucht,
welches bereits von der Split-Read-Erweiterung von segemehl
unterstĂŒtzt wird. Aufgrund von SpleiĂen ist diese Form des
Mapping-Problems rechnerisch aufwendiger. In dieser Arbeit wird das
neue Programm lack vorgestellt, welches darauf abzielt, fehlende
Read-Alignments mit Hilfe von de novo SpleiĂ-Information zu
finden. Es erzielt hervorragende Ergebnisse und stellt somit eine
sinnvolle ErgĂ€nzung zu Analyse-Pipelines fĂŒr
RNA-Sequenzierungsdaten dar.
Als Drittes wird eine neue Methode zum Mappen von Bisulfit-behandelte
Sequenzierungsdaten vorgestellt. Dieses Protokoll gilt als
Goldstandard in der genomweiten Untersuchung der DNA-Methylierung,
einer der wichtigsten epigenetischen Modifikationen in Tieren und
Pflanzen. Dabei wird die DNA vor der Sequenzierung mit Natriumbisulfit
behandelt, welches selektiv nicht methylierte Cytosine zu Uracilen
konvertiert, wĂ€hrend Methylcytosine davon unberĂŒhrt
bleiben. Die hier vorgestellte Bisulfit-Erweiterung fĂŒhrt die
Seed-Suche auf einem reduziertem Alphabet durch und verifiziert die
erhaltenen Treffer mit einem auf dynamischer Programmierung
basierenden Bisulfit-sensitiven Alignment-Algorithmus. Das verwendete
Verfahren ist somit unempfindlich gegenĂŒber
Bisulfit-Konvertierungen und erfordert im Gegensatz zu anderen
Verfahren keine weitere Nachverarbeitung. Im Vergleich zu aktuell
eingesetzten Programmen ist die Methode sensitiver und benötigt
eine vergleichbare Laufzeit beim Mappen von Millionen von Reads auf
groĂe Genome. Bemerkenswerterweise wird die erhöhte
SensitivitÀt bei gleichbleibend guter SpezifizitÀt
erreicht. Dadurch könnte diese Methode somit auch bessere
Ergebnisse bei der prÀzisen Bestimmung der Methylierungsraten
erreichen.
SchlieĂlich wird noch das Potential von Mapping-Strategien fĂŒr
Assemblierungen mit der EinfĂŒhrung eines neuen,
Kristallisation-genanntes Verfahren zur unterstĂŒtzten
Assemblierung aufgezeigt. Es enthÀlt Mapping als Hauptbestandteil
und nutzt Zusatzinformation (z.B. Annotationen) als
UnterstĂŒtzung. Dieses Verfahren ermöglichte die erfolgreiche
Assemblierung des kompletten mitochondrialen Genoms von Eulimnogammarus verrucosus trotz
einer vorwiegend aus nukleÀrer DNA bestehenden genomischen
Bibliothek.
Zusammenfassend stellt diese Arbeit algorithmische Methoden vor,
welche die Analysen von Tiling Array, DNA-Seq, RNA-Seq und MethylC-Seq
Daten signifikant verbessern. Es werden zudem Standards fĂŒr den
Vergleich von Programmen zum Mappen von Daten der
Hochdurchsatz-Sequenzierung vorgeschlagen. DarĂŒber hinaus wird ein
neues Verfahren zur unterstĂŒtzten Genom-Assemblierung vorgestellt,
welches erfolgreich bei der de novo-Assemblierung eines
mitochondrialen Krustentier-Genoms eingesetzt wurde
Data Mining Using the Crossing Minimization Paradigm
Our ability and capacity to generate, record and store multi-dimensional, apparently
unstructured data is increasing rapidly, while the cost of data storage is going down. The data recorded is not perfect, as noise gets introduced in it from different sources. Some of the basic forms of noise are incorrect recording of values and missing values. The formal study of discovering useful hidden information in the data is called Data Mining.
Because of the size, and complexity of the problem, practical data mining problems are
best attempted using automatic means.
Data Mining can be categorized into two types i.e. supervised learning or classification and unsupervised learning or clustering. Clustering only the records in a database (or data matrix) gives a global view of the data and is called one-way clustering. For a detailed analysis or a local view, biclustering or co-clustering or two-way clustering is required involving the simultaneous clustering of the records and the attributes.
In this dissertation, a novel fast and white noise tolerant data mining solution is
proposed based on the Crossing Minimization (CM) paradigm; the solution works for
one-way as well as two-way clustering for discovering overlapping biclusters. For
decades the CM paradigm has traditionally been used for graph drawing and VLSI
(Very Large Scale Integration) circuit design for reducing wire length and congestion. The utility of the proposed technique is demonstrated by comparing it with other biclustering techniques using simulated noisy, as well as real data from Agriculture, Biology and other domains.
Two other interesting and hard problems also addressed in this dissertation are (i) the
Minimum Attribute Subset Selection (MASS) problem and (ii) Bandwidth
Minimization (BWM) problem of sparse matrices. The proposed CM technique is
demonstrated to provide very convincing results while attempting to solve the said
problems using real public domain data.
Pakistan is the fourth largest supplier of cotton in the world. An apparent anomaly has
been observed during 1989-97 between cotton yield and pesticide consumption in
Pakistan showing unexpected periods of negative correlation. By applying the
indigenous CM technique for one-way clustering to real Agro-Met data (2001-2002), a possible explanation of the anomaly has been presented in this thesis
Novel graph based algorithms for transcriptome sequence analysis
RNA-sequencing (RNA-seq) is one of the most-widely used techniques in molecular biology. A key bioinformatics task in any RNA-seq workflow is the assembling the reads. As the size of transcriptomics data sets is constantly increasing, scalable and accurate assembly approaches have to be developed.Here, we propose several approaches to improve assembling of RNA-seq data generated by second-generation sequencing technologies. We demonstrated that the systematic removal of irrelevant reads from a high coverage dataset prior to assembly, reduces runtime and improves the quality of the assembly. Further, we propose a novel RNA-seq assembly work- flow comprised of read error correction, normalization, assembly with informed parameter selection and transcript-level expression computation.
In recent years, the popularity of third-generation sequencing technologies in- creased as long reads allow for accurate isoform quantification and gene-fusion detection, which is essential for biomedical research. We present a sequence-to-graph alignment method to detect and to quantify transcripts for third-generation sequencing data. Also, we propose the first gene-fusion prediction tool which is specifically tailored towards long-read data and hence achieves accurate expression estimation even on complex data sets. Moreover, our method predicted experimentally verified fusion events along with some novel events, which can be validated in the future
Evolutionary framework for DNA Microarry Cluster Analysis
En esta investigación se propone un framework evolutivo donde se fusionan un método de clustering
jerĂĄrquico basado en un modelo evolutivo, un conjunto de medidas de validaciĂłn de agrupamientos (clusters)
de datos y una herramienta de visualizaciĂłn de clusterings. El objetivo es crear un marco apropiado para la
extracciĂłn de conocimiento a partir de datos provenientes de DNA-microarrays. Por una parte, el modelo
evolutivo de clustering de nuestro framework es una alternativa novedosa que intenta resolver algunos de los
problemas presentes en los métodos de clustering existentes. Por otra parte, nuestra alternativa de
visualizaciĂłn de clusterings, materializada en una herramienta, incorpora nuevas propiedades y nuevos
componentes de visualizaciĂłn, lo cual permite validar y analizar los resultados de la tarea de clustering. De este
modo, la integraciĂłn del modelo evolutivo de clustering con el modelo visual de clustering, convierta a nuestro
framework evolutivo en una aplicaciĂłn novedosa de minerĂa de datos frente a los mĂ©todos convencionales
Genealogy Reconstruction: Methods and applications in cancer and wild populations
Genealogy reconstruction is widely used in biology when relationships among entities are studied. Phylogenies, or evolutionary trees, show the differences between species. They are of profound importance because they help to obtain better understandings of evolutionary processes. Pedigrees, or family trees, on the other hand visualize the relatedness between individuals in a population. The reconstruction of pedigrees and the inference of parentage in general is now a cornerstone in molecular ecology. Applications include the direct infer- ence of gene flow, estimation of the effective population size and parameters describing the populationâs mating behaviour such as rates of inbreeding.
In the first part of this thesis, we construct genealogies of various types of cancer. Histopatho- logical classification of human tumors relies in part on the degree of differentiation of the tumor sample. To date, there is no objective systematic method to categorize tumor subtypes by maturation. We introduce a novel algorithm to rank tumor subtypes according to the dis- similarity of their gene expression from that of stem cells and fully differentiated tissue, and thereby construct a phylogenetic tree of cancer. We validate our methodology with expression data of leukemia and liposarcoma subtypes and then apply it to a broader group of sarcomas and of breast cancer subtypes. This ranking of tumor subtypes resulting from the application of our methodology allows the identification of genes correlated with differentiation and may help to identify novel therapeutic targets. Our algorithm represents the first phylogeny-based tool to analyze the differentiation status of human tumors.
In contrast to asexually reproducing cancer cell populations, pedigrees of sexually reproduc- ing populations cannot be represented by phylogenetic trees. Pedigrees are directed acyclic graphs (DAGs) and therefore resemble more phylogenetic networks where reticulate events are indicated by vertices with two incoming arcs. We present a software package for pedigree reconstruction in natural populations using co-dominant genomic markers such as microsatel- lites and single nucleotide polymorphism (SNPs) in the second part of the thesis. If available, the algorithm makes use of prior information such as known relationships (sub-pedigrees) or the age and sex of individuals. Statistical confidence is estimated by Markov chain Monte Carlo (MCMC) sampling. The accuracy of the algorithm is demonstrated for simulated data as well as an empirical data set with known pedigree. The parentage inference is robust even in the presence of genotyping errors. We further demonstrate the accuracy of the algorithm on simulated clonal populations. We show that the joint estimation of parameters of inter- est such as the rate of self-fertilization or clonality is possible with high accuracy even with marker panels of moderate power. Classical methods can only assign a very limited number of statistically significant parentages in this case and would therefore fail. The method is implemented in a fast and easy to use open source software that scales to large datasets with many thousand individuals.:Abstract v
Acknowledgments vii
1 Introduction 1
2 Cancer Phylogenies 7
2.1 Introduction..................................... 7
2.2 Background..................................... 9
2.2.1 PhylogeneticTrees............................. 9
2.2.2 Microarrays................................. 10
2.3 Methods....................................... 11
2.3.1 Datasetcompilation ............................ 11
2.3.2 Statistical Methods and Analysis..................... 13
2.3.3 Comparison of our methodology to other methods . . . . . . . . . . . 15
2.4 Results........................................ 16
2.4.1 Phylogenetic tree reconstruction method. . . . . . . . . . . . . . . . . 16
2.4.2 Comparison of tree reconstruction methods to other algorithms . . . . 28
2.4.3 Systematic analysis of methods and parameters . . . . . . . . . . . . . 30
2.5 Discussion...................................... 32
3 Wild Pedigrees 35
3.1 Introduction..................................... 35
3.2 The molecular ecologistâs tools of the trade ................... 36
3.2.1 3.2.2 3.2.3
3.2.1 Sibship inference and parental reconstruction . . . . . . . . . . . . . . 37
3.2.2 Parentage and paternity inference .................... 39
3.2.3 Multigenerational pedigree reconstruction . . . . . . . . . . . . . . . . 40
3.3 Background..................................... 40
3.3.1 Pedigrees .................................. 40
3.3.2 Genotypes.................................. 41
3.3.3 Mendelian segregation probability .................... 41
3.3.4 LOD Scores................................. 43
3.3.5 Genotyping Errors ............................. 43
3.3.6 IBD coefficients............................... 45
3.3.7 Bayesian MCMC.............................. 46
3.4 Methods....................................... 47
3.4.1 Likelihood Model.............................. 47
3.4.2 Efficient Likelihood Calculation...................... 49
3.4.3 Maximum Likelihood Pedigree ...................... 51
3.4.4 Full siblings................................. 52
3.4.5 Algorithm.................................. 53
3.4.6 Missing Values ............................... 56
3.4.7 Allelefrequencies.............................. 58
3.4.8 Rates of Self-fertilization.......................... 60
3.4.9 Rates of Clonality ............................. 60
3.5 Results........................................ 61
3.5.1 Real Microsatellite Data.......................... 61
3.5.2 Simulated Human Population....................... 62
3.5.3 SimulatedClonalPlantPopulation.................... 64
3.6 Discussion...................................... 71
4 Conclusions 77
A FRANz 79
A.1 Availability ..................................... 79
A.2 Input files...................................... 79
A.2.1 Maininputfile ............................... 79
A.2.2 Knownrelationships ............................ 80
A.2.3 Allele frequencies.............................. 81
A.2.4 Sampling locations............................. 82
A.3 Output files..................................... 83
A.4 Web 2.0 Interface.................................. 86
List of Figures 87
List of Tables 88
List Abbreviations 90
Bibliography 92
Curriculum Vitae
- âŠ