20 research outputs found
Distance Measures in Bioinformatics
Many bioinformatics applications rely on the computation of similarities between objects. Distance and similarity measures applied to vectors of characteristics are essential to problems such as classification, clustering and information retrieval. This study explores the usefulness of distance and similarity measures in several bioinformatics applications. These applications are in two categories. (1) Estimation of the adverse reaction severity of unknown pharmaceutical treatments, based on the severity of known treatments, in order to provide guidance for testing of the unknown treatments in clinical trials. (2) Classification of cancer tissue types and estimation of cancer stages, based on high-dimensional microarray data, in order to support clinical decisions making. To address the first category, we studied several clustering and classification approaches for binary severity estimation of Cytokine Release Syndrome (CRS). We developed a Severity Estimation using Distance Metric Learning (SE-DML) approach to get graded severity estimation. With binary estimation we were able to identify treatments that caused the most severe response and then built prediction models for CRS. Using the SE-DML approach, we evaluated four known data sets and showed that SE-DML outperformed other widely used methods on these data sets. For the second category, we presented Kernelized Information-Theoretic Metric Learning (KITML) algorithms that optimize distance metrics and effectively handle high-dimensional data. This learned metric by KITML is used to improve the performance of -nearest neighbor classification for cancer tissue microarray data. We evaluated our approach on fourteen (14) cancer microarray data sets and compared our results with other state-of-the-art approaches. We achieved the best overall performance for the classification task. In addition we tested the KITML algorithm in estimating the severity stages of cancer samples, with accurate results.Ph.D., Electrical Engineering -- Drexel University, 201
Probabilistic analysis of the human transcriptome with side information
Understanding functional organization of genetic information is a major
challenge in modern biology. Following the initial publication of the human
genome sequence in 2001, advances in high-throughput measurement technologies
and efficient sharing of research material through community databases have
opened up new views to the study of living organisms and the structure of life.
In this thesis, novel computational strategies have been developed to
investigate a key functional layer of genetic information, the human
transcriptome, which regulates the function of living cells through protein
synthesis. The key contributions of the thesis are general exploratory tools
for high-throughput data analysis that have provided new insights to
cell-biological networks, cancer mechanisms and other aspects of genome
function.
A central challenge in functional genomics is that high-dimensional genomic
observations are associated with high levels of complex and largely unknown
sources of variation. By combining statistical evidence across multiple
measurement sources and the wealth of background information in genomic data
repositories it has been possible to solve some the uncertainties associated
with individual observations and to identify functional mechanisms that could
not be detected based on individual measurement sources. Statistical learning
and probabilistic models provide a natural framework for such modeling tasks.
Open source implementations of the key methodological contributions have been
released to facilitate further adoption of the developed methods by the
research community.Comment: Doctoral thesis. 103 pages, 11 figure
Cell Type-specific Analysis of Human Interactome and Transcriptome
Cells are the fundamental building block of complex tissues in higher-order organisms. These cells take different forms and shapes to perform a broad range of functions. What makes a cell uniquely eligible to perform a task, however, is not well-understood; neither is the defining characteristic that groups similar cells together to constitute a cell type. Even for known cell types, underlying pathways that mediate cell type-specific functionality are not readily available. These functions, in turn, contribute to cell type-specific susceptibility in various disorders
Computational Approaches to Drug Profiling and Drug-Protein Interactions
Despite substantial increases in R&D spending within the pharmaceutical industry, denovo drug design has become a time-consuming endeavour. High attrition rates led to a
long period of stagnation in drug approvals. Due to the extreme costs associated with
introducing a drug to the market, locating and understanding the reasons for clinical failure
is key to future productivity. As part of this PhD, three main contributions were made in
this respect. First, the web platform, LigNFam enables users to interactively explore
similarity relationships between ‘drug like’ molecules and the proteins they bind. Secondly,
two deep-learning-based binding site comparison tools were developed, competing with
the state-of-the-art over benchmark datasets. The models have the ability to predict offtarget interactions and potential candidates for target-based drug repurposing. Finally, the
open-source ScaffoldGraph software was presented for the analysis of hierarchical scaffold
relationships and has already been used in multiple projects, including integration into a
virtual screening pipeline to increase the tractability of ultra-large screening experiments.
Together, and with existing tools, the contributions made will aid in the understanding of
drug-protein relationships, particularly in the fields of off-target prediction and drug
repurposing, helping to design better drugs faster
Scalable Feature Selection Applications for Genome-Wide Association Studies of Complex Diseases
Personalized medicine will revolutionize our capabilities to combat disease. Working toward this goal, a fundamental task is the deciphering of geneticvariants that are predictive of complex diseases. Modern studies, in the formof genome-wide association studies (GWAS) have afforded researchers with the opportunity to reveal new genotype-phenotype relationships through the extensive scanning of genetic variants. These studies typically contain over half a million genetic features for thousands of individuals. Examining this with methods other than univariate statistics is a challenging task requiring advanced algorithms that are scalable to the genome-wide level. In the future, next-generation sequencing studies (NGS) will contain an even larger number of common and rare variants.
Machine learning-based feature selection algorithms have been shown to have the ability to effectively create predictive models for various genotype-phenotype relationships. This work explores the problem of selecting genetic variant subsets that are the most predictive of complex disease phenotypes through various feature selection methodologies, including filter, wrapper and embedded algorithms. The examined machine learning algorithms were demonstrated to not only be effective at predicting the disease phenotypes, but also doing so efficiently through the use of computational shortcuts. While much of the work was able to be run on high-end desktops, some work was further extended so that it could be implemented on parallel computers helping to assure that they will also scale to the NGS data sets.
Further, these studies analyzed the relationships between various feature selection methods and demonstrated the need for careful testing when selecting an algorithm. It was shown that there is no universally optimal algorithm for variant selection in GWAS, but rather methodologies need to be selected based on the desired outcome, such as the number of features to be included in the prediction model. It was also demonstrated that without proper model validation, for example using nested cross-validation, the models can result in overly-optimistic prediction accuracies and decreased generalization ability. It is through the implementation and application of machine learning methods that one can extract predictive genotype–phenotype relationships and biological insights from genetic data sets.Siirretty Doriast
Ranking to Learn and Learning to Rank: On the Role of Ranking in Pattern Recognition Applications
The last decade has seen a revolution in the theory and application of
machine learning and pattern recognition. Through these advancements, variable
ranking has emerged as an active and growing research area and it is now
beginning to be applied to many new problems. The rationale behind this fact is
that many pattern recognition problems are by nature ranking problems. The main
objective of a ranking algorithm is to sort objects according to some criteria,
so that, the most relevant items will appear early in the produced result list.
Ranking methods can be analyzed from two different methodological perspectives:
ranking to learn and learning to rank. The former aims at studying methods and
techniques to sort objects for improving the accuracy of a machine learning
model. Enhancing a model performance can be challenging at times. For example,
in pattern classification tasks, different data representations can complicate
and hide the different explanatory factors of variation behind the data. In
particular, hand-crafted features contain many cues that are either redundant
or irrelevant, which turn out to reduce the overall accuracy of the classifier.
In such a case feature selection is used, that, by producing ranked lists of
features, helps to filter out the unwanted information. Moreover, in real-time
systems (e.g., visual trackers) ranking approaches are used as optimization
procedures which improve the robustness of the system that deals with the high
variability of the image streams that change over time. The other way around,
learning to rank is necessary in the construction of ranking models for
information retrieval, biometric authentication, re-identification, and
recommender systems. In this context, the ranking model's purpose is to sort
objects according to their degrees of relevance, importance, or preference as
defined in the specific application.Comment: European PhD Thesis. arXiv admin note: text overlap with
arXiv:1601.06615, arXiv:1505.06821, arXiv:1704.02665 by other author
Ranking to Learn and Learning to Rank: On the Role of Ranking in Pattern Recognition Applications
The last decade has seen a revolution in the theory and application of
machine learning and pattern recognition. Through these advancements, variable
ranking has emerged as an active and growing research area and it is now
beginning to be applied to many new problems. The rationale behind this fact is
that many pattern recognition problems are by nature ranking problems. The main
objective of a ranking algorithm is to sort objects according to some criteria,
so that, the most relevant items will appear early in the produced result list.
Ranking methods can be analyzed from two different methodological perspectives:
ranking to learn and learning to rank. The former aims at studying methods and
techniques to sort objects for improving the accuracy of a machine learning
model. Enhancing a model performance can be challenging at times. For example,
in pattern classification tasks, different data representations can complicate
and hide the different explanatory factors of variation behind the data. In
particular, hand-crafted features contain many cues that are either redundant
or irrelevant, which turn out to reduce the overall accuracy of the classifier.
In such a case feature selection is used, that, by producing ranked lists of
features, helps to filter out the unwanted information. Moreover, in real-time
systems (e.g., visual trackers) ranking approaches are used as optimization
procedures which improve the robustness of the system that deals with the high
variability of the image streams that change over time. The other way around,
learning to rank is necessary in the construction of ranking models for
information retrieval, biometric authentication, re-identification, and
recommender systems. In this context, the ranking model's purpose is to sort
objects according to their degrees of relevance, importance, or preference as
defined in the specific application.Comment: European PhD Thesis. arXiv admin note: text overlap with
arXiv:1601.06615, arXiv:1505.06821, arXiv:1704.02665 by other author