827 research outputs found

    Evaluation of a prior-incorporated statistical model and established classifiers for externally visible characteristics prediction

    Get PDF
    Human identification through DNA has played an important role in forensic science and in the criminal justice system for decades. It is referring to the association of genetic data with a particular human being and has facilitated police investigations in cases such as the identification of suspected perpetrators from biological traces found at crime scenes, missing persons, or victims of mass disasters [1]. Currently there are two main methods developed: the genotyping through short tandem repeats (STR profiling) and the forensic DNA phenotyping (FDP). Despite the fact that these two methods are aiming in identifying a person through its genetic material, their approach and consequences that come up are completely different. STR profiling compares allele repeats at specific loci in DNA and aims at a match with already known to the police authorities DNA profiles, while FDP, which is the focus on the current study, aims in the prediction of appearance traits of an individual [2, 3]. In contrast with STR profiling, information that arise out of FDP cannot be used as sole evidence in the court [4]. The ability of predicting EVCs from DNA can be used as ‘biological witnesses’ that can only provide leads for the investigative authorities and subsequently narrow down a possible large set of potential suspects. The use of FDP begins a new era of ‘DNA intelligence’ and holds great promise especially in cases where individuals cannot be identified with the conventional method of STR profiling and also in cases where there is no additional knowledge on the sample donor. So far in FDP, traits such as eye, hair and skin color can be predicted reliably with high prediction accuracy and predictive models have already been forensically validated [5-7]. Regarding other appearance traits, the current lack of knowledge on the genetic markers responsible for their phenotypic variation and the lower predictability, especially of intermediate categories, has prevented FDP from being routinely implemented in the field of forensic science. The majority of the predictive models developed for appearance trait prediction were based on multinomial logistic regression (MLR) while only few used other methods such as decision trees and neural networks. Machine learning (ML) approaches have become a widely used tool for classification problems in several fields and they are known for their potential to boost model performance and their ability to handle different and complex types of data [8]. However, within the context of predicting EVCs, a systematic and comparative analysis among different ML approaches that could possibly indicate methods that outperform the standard MLR, has not been conducted so far. In addition, incorporation of priors in the EVC prediction models that may have potential to improve the already existing approaches, has not been investigated in the context of forensics yet. These priors indicate the trait category prevalence values among biogeographic ancestry groups, and their use would allow us to leverage Bayesian statistics in order to build more powerful prediction models. In our case, incorporation of such priors in the model could reflect the additional information from all yet unknown causal genetic factors and act as proxies in the prediction model. Therefore, those two approaches were conducted throughout my PhD project in order to improve the already existing approaches of FDP which was the main aim of my study. In the first study, I aimed to collect a comprehensive data set from previously published sources on the spatial distribution of different appearance traits. I conducted a literature review in order to assemble this information, which later on could be incorporated as priors in the EVCs prediction models. Due to the lack of available and reliable sources, our resulting data set contained only eye and hair color for mostly European countries. More specifically, I collected data on eye color from 16 European and Central Asian countries, while for hair color I collected data from seven European countries. For countries outside of Europe, where the variation is low, it was not possible to assemble trustworthy and population-representative data. Afterwards, I calculated the association of those two traits and obtained a moderate association between them. Interpolation techniques were applied in order to infer trait prevalence values in at least neighboring countries. Resulting prevalences and interpolated values were presented in spatial maps. The subject of the second study was to incorporate the trait prevalence values as priors in the prediction model. However, due to the lack of reliable data that was observed in the first study, the incorporation of the actual priors that would give us the actual insight of their impact in the EVC prediction was not feasible with the current existing knowledge and the available data. Therefore, I assessed the impact of priors across a grid that contained all possible values that priors can take, for a set of appearance traits including eye, hair, skin color, hair structure, and freckles. In this way, I aimed to assess potential pitfalls caused by misspecification of priors. Results were compared and evaluated with the corresponding prior-free' previously established prediction models. The effect of priors was demonstrated in the standard performance measurements, including area under curve (AUC) and overall accuracy. I found out that from all possible prior values, there is a proportion that shows potential in improving the prediction accuracy. However, possible misspecification of priors can significantly diminish the overall accuracy. Based on that, I emphasize the importance of accurate prior values in the prediction modelling in order to identify the actual impact. As a consequence of the above, the use of prior informed models in forensics is currently infeasible and more studies on the topic are necessary in order to extend the current knowledge on spatial trait prevalence. Finally, the focus of the third study was exploring and comparing the performances of methodologies beyond MLR. MLR is considered the standard method for predicting EVCs, since the majority of the predictive models developed are based on that method. Due to the fact that there is still potential for improvement of MLR models, especially for traits such as skin color or hair structure, I aimed at applying different ML methods in order to identify whether there is a potential classifier that outperforms the conventional method of MLR. Therefore I conducted a systematic comparison between MLR and three alternative ML classifiers, namely support vector machines (SVM), random forests (RF) and artificial neural networks (ANN). The traits that I focused on here were eye, hair, and skin color. All models were based on the genetic markers that were previously established in IrisPlex, HIrisPlex and HIrisPlex-S [5-7]. Overall, I observed that all four classifiers performed almost equally well, especially for eye color. Only non-substantial differences were obtained across the different traits and across trait categories. Given this outcome, none of the ML methods applied here performed better than MLR, at least for the three traits of eye, hair, and skin color. Ultimately, due to the easier interpretability of the MLR, it is suggested at least for now and for the currently known marker sets, that the use of MLR is the most appropriate method for predicting appearance traits from DNA. Throughout my PhD project, it became apparent that the available knowledge on spatial trait prevalence values was quite restricted not only in certain appearance traits but also in continental groups. More specifically, most available and reliable data were focused on European populations and the traits that were available were mostly for eye and hair color. For other traits, such as skin color, hair structure, and freckles, the data were either extremely few or nonexistent. This was a significant obstacle throughout the project, since it prevented me from applying and testing the actual impact of the accurate trait prevalence values as priors in EVC prediction. However, the lack of data presented an opportunity to perform in-depth theoretical research, in particular testing the impact of priors within a spatial grid that included its possible values. I found out that there is a proportion of priors that showed potential to improve EVC prediction. However, caution is advised regarding misspecification of priors that can significantly deteriorate the models' performance. Furthermore, the application of different ML approaches did not show any significant improvement on the prediction performance against the standard MLR. This could be due to the nature of the traits, since some of them are multifactorial and affected by various external independent factors or due to possible limitations of the currently known predictive markers. With the available knowledge so far, it is emphasized throughout this study that for the time being, priors are refrained from being incorporated in the EVC prediction models while from the different classifiers applied, MLR is considered as the most appropriate method for EVC prediction due to its easier interpretability. In addition, the presented study highlights the importance of reference data on externally visible traits and the identification of more genetic markers that contribute to certain traits and I hope that the present work will motivate the emergence of these certain types of data collections that potentially may improve the current EVC prediction models

    Accurate modeling of confounding variation in eQTL studies leads to a great increase in power to detect trans-regulatory effects

    Get PDF
    Expression quantitative trait loci (eQTL) studies are an integral tool to investigate the genetic component of gene expression variation. A major challenge in the analysis of such studies are hidden confounding factors, such as unobserved covariates or unknown environmental influences. These factors can induce a pronounced artifactual correlation structure in the expression profiles, which may create spurious false associations or mask real genetic association signals. 

Here, we report PANAMA (Probabilistic ANAlysis of genoMic dAta), a novel probabilistic model to account for confounding factors within an
eQTL analysis. In contrast to previous methods, PANAMA learns hidden factors jointly with the effect of prominent genetic regulators. As a result, PANAMA can more accurately distinguish between true genetic association signals and confounding variation. 

We applied our model and compared it to existing methods on a variety of datasets and biological systems. PANAMA consistently performs better than alternative methods, and finds in particular substantially more trans regulators. Importantly, PANAMA not only identified a greater number of associations, but also yields hits that are biologically more plausible and can be better reproduced between independent studies

    Incorporating Biological Pathways via a Markov Random Field Model in Genome-Wide Association Studies

    Get PDF
    Genome-wide association studies (GWAS) examine a large number of markers across the genome to identify associations between genetic variants and disease. Most published studies examine only single markers, which may be less informative than considering multiple markers and multiple genes jointly because genes may interact with each other to affect disease risk. Much knowledge has been accumulated in the literature on biological pathways and interactions. It is conceivable that appropriate incorporation of such prior knowledge may improve the likelihood of making genuine discoveries. Although a number of methods have been developed recently to prioritize genes using prior biological knowledge, such as pathways, most methods treat genes in a specific pathway as an exchangeable set without considering the topological structure of a pathway. However, how genes are related with each other in a pathway may be very informative to identify association signals. To make use of the connectivity information among genes in a pathway in GWAS analysis, we propose a Markov Random Field (MRF) model to incorporate pathway topology for association analysis. We show that the conditional distribution of our MRF model takes on a simple logistic regression form, and we propose an iterated conditional modes algorithm as well as a decision theoretic approach for statistical inference of each gene's association with disease. Simulation studies show that our proposed framework is more effective to identify genes associated with disease than a single gene–based method. We also illustrate the usefulness of our approach through its applications to a real data example

    Representation Learning for Chemical Activity Predictions

    Full text link
    Computational prediction of a phenotypic response upon the chemical perturbation on a biological system plays an important role in drug discovery and many other applications. Chemical fingerprints derived from chemical structures are a widely used feature to build machine learning models. However, the fingerprints ignore the biological context, thus, they suffer from several problems such as the activity cliff and curse of dimensionality. Fundamentally, the chemical modulation of biological activities is a multi-scale process. It is the genome-wide chemical-target interactions that modulate chemical phenotypic responses. Thus, the genome-scale chemical-target interaction profile will more directly correlate with in vitro and in vivo activities than the chemical structure. Nevertheless, the scope of direct application of the chemical-target interaction profile is limited due to the severe incompleteness, bias, and noisiness of bioassay data. To address the aforementioned problems, we developed two new chemical and protein representation methods in this thesis. The first one is a Latent Target Interaction Profile (LTIP). LTIP embeds chemicals into a low dimensional continuous latent space that represents genome-scale chemical-target interactions. Subsequently, LTIP can be used as a feature to build machine learning models. Using the drug sensitivity of cancer cell lines as a benchmark, we have shown that the LTIP robustly outperforms chemical fingerprints regardless of machine learning algorithms. Moreover, the LTIP is complementary to the chemical fingerprints. We can combine LTIP with other fingerprints to further improve the performance of bioactivity prediction. We also developed a new protein sequence embedding method Distilled Sequence Alignment Embedding (DISAE) to represent proteins. We compared CGKronRLS to other machine learning algorithms including Random Forest and XGBoost for predicting drug-target interactions. We show how the resultant protein deep representations can be used to predict novel drug-protein pairs interactions which can improve drug safety and open many avenues for drug repurposing. Our results demonstrate the potential of LTIP in particular and multi-scale modeling in general in predictive modeling of chemical modulation of biological activities. It also shows the predictive power of DISAE which can further be improved through deep learning models
    • …
    corecore