8 research outputs found

    Evaluation of a prior-incorporated statistical model and established classifiers for externally visible characteristics prediction

    Get PDF
    Human identification through DNA has played an important role in forensic science and in the criminal justice system for decades. It is referring to the association of genetic data with a particular human being and has facilitated police investigations in cases such as the identification of suspected perpetrators from biological traces found at crime scenes, missing persons, or victims of mass disasters [1]. Currently there are two main methods developed: the genotyping through short tandem repeats (STR profiling) and the forensic DNA phenotyping (FDP). Despite the fact that these two methods are aiming in identifying a person through its genetic material, their approach and consequences that come up are completely different. STR profiling compares allele repeats at specific loci in DNA and aims at a match with already known to the police authorities DNA profiles, while FDP, which is the focus on the current study, aims in the prediction of appearance traits of an individual [2, 3]. In contrast with STR profiling, information that arise out of FDP cannot be used as sole evidence in the court [4]. The ability of predicting EVCs from DNA can be used as ‘biological witnesses’ that can only provide leads for the investigative authorities and subsequently narrow down a possible large set of potential suspects. The use of FDP begins a new era of ‘DNA intelligence’ and holds great promise especially in cases where individuals cannot be identified with the conventional method of STR profiling and also in cases where there is no additional knowledge on the sample donor. So far in FDP, traits such as eye, hair and skin color can be predicted reliably with high prediction accuracy and predictive models have already been forensically validated [5-7]. Regarding other appearance traits, the current lack of knowledge on the genetic markers responsible for their phenotypic variation and the lower predictability, especially of intermediate categories, has prevented FDP from being routinely implemented in the field of forensic science. The majority of the predictive models developed for appearance trait prediction were based on multinomial logistic regression (MLR) while only few used other methods such as decision trees and neural networks. Machine learning (ML) approaches have become a widely used tool for classification problems in several fields and they are known for their potential to boost model performance and their ability to handle different and complex types of data [8]. However, within the context of predicting EVCs, a systematic and comparative analysis among different ML approaches that could possibly indicate methods that outperform the standard MLR, has not been conducted so far. In addition, incorporation of priors in the EVC prediction models that may have potential to improve the already existing approaches, has not been investigated in the context of forensics yet. These priors indicate the trait category prevalence values among biogeographic ancestry groups, and their use would allow us to leverage Bayesian statistics in order to build more powerful prediction models. In our case, incorporation of such priors in the model could reflect the additional information from all yet unknown causal genetic factors and act as proxies in the prediction model. Therefore, those two approaches were conducted throughout my PhD project in order to improve the already existing approaches of FDP which was the main aim of my study. In the first study, I aimed to collect a comprehensive data set from previously published sources on the spatial distribution of different appearance traits. I conducted a literature review in order to assemble this information, which later on could be incorporated as priors in the EVCs prediction models. Due to the lack of available and reliable sources, our resulting data set contained only eye and hair color for mostly European countries. More specifically, I collected data on eye color from 16 European and Central Asian countries, while for hair color I collected data from seven European countries. For countries outside of Europe, where the variation is low, it was not possible to assemble trustworthy and population-representative data. Afterwards, I calculated the association of those two traits and obtained a moderate association between them. Interpolation techniques were applied in order to infer trait prevalence values in at least neighboring countries. Resulting prevalences and interpolated values were presented in spatial maps. The subject of the second study was to incorporate the trait prevalence values as priors in the prediction model. However, due to the lack of reliable data that was observed in the first study, the incorporation of the actual priors that would give us the actual insight of their impact in the EVC prediction was not feasible with the current existing knowledge and the available data. Therefore, I assessed the impact of priors across a grid that contained all possible values that priors can take, for a set of appearance traits including eye, hair, skin color, hair structure, and freckles. In this way, I aimed to assess potential pitfalls caused by misspecification of priors. Results were compared and evaluated with the corresponding prior-free' previously established prediction models. The effect of priors was demonstrated in the standard performance measurements, including area under curve (AUC) and overall accuracy. I found out that from all possible prior values, there is a proportion that shows potential in improving the prediction accuracy. However, possible misspecification of priors can significantly diminish the overall accuracy. Based on that, I emphasize the importance of accurate prior values in the prediction modelling in order to identify the actual impact. As a consequence of the above, the use of prior informed models in forensics is currently infeasible and more studies on the topic are necessary in order to extend the current knowledge on spatial trait prevalence. Finally, the focus of the third study was exploring and comparing the performances of methodologies beyond MLR. MLR is considered the standard method for predicting EVCs, since the majority of the predictive models developed are based on that method. Due to the fact that there is still potential for improvement of MLR models, especially for traits such as skin color or hair structure, I aimed at applying different ML methods in order to identify whether there is a potential classifier that outperforms the conventional method of MLR. Therefore I conducted a systematic comparison between MLR and three alternative ML classifiers, namely support vector machines (SVM), random forests (RF) and artificial neural networks (ANN). The traits that I focused on here were eye, hair, and skin color. All models were based on the genetic markers that were previously established in IrisPlex, HIrisPlex and HIrisPlex-S [5-7]. Overall, I observed that all four classifiers performed almost equally well, especially for eye color. Only non-substantial differences were obtained across the different traits and across trait categories. Given this outcome, none of the ML methods applied here performed better than MLR, at least for the three traits of eye, hair, and skin color. Ultimately, due to the easier interpretability of the MLR, it is suggested at least for now and for the currently known marker sets, that the use of MLR is the most appropriate method for predicting appearance traits from DNA. Throughout my PhD project, it became apparent that the available knowledge on spatial trait prevalence values was quite restricted not only in certain appearance traits but also in continental groups. More specifically, most available and reliable data were focused on European populations and the traits that were available were mostly for eye and hair color. For other traits, such as skin color, hair structure, and freckles, the data were either extremely few or nonexistent. This was a significant obstacle throughout the project, since it prevented me from applying and testing the actual impact of the accurate trait prevalence values as priors in EVC prediction. However, the lack of data presented an opportunity to perform in-depth theoretical research, in particular testing the impact of priors within a spatial grid that included its possible values. I found out that there is a proportion of priors that showed potential to improve EVC prediction. However, caution is advised regarding misspecification of priors that can significantly deteriorate the models' performance. Furthermore, the application of different ML approaches did not show any significant improvement on the prediction performance against the standard MLR. This could be due to the nature of the traits, since some of them are multifactorial and affected by various external independent factors or due to possible limitations of the currently known predictive markers. With the available knowledge so far, it is emphasized throughout this study that for the time being, priors are refrained from being incorporated in the EVC prediction models while from the different classifiers applied, MLR is considered as the most appropriate method for EVC prediction due to its easier interpretability. In addition, the presented study highlights the importance of reference data on externally visible traits and the identification of more genetic markers that contribute to certain traits and I hope that the present work will motivate the emergence of these certain types of data collections that potentially may improve the current EVC prediction models

    Testing the impact of trait prevalence priors in Bayesian-based genetic prediction modeling of human appearance traits

    Get PDF
    The prediction of appearance traits by use of solely genetic information has become an established approach and a number of statistical prediction models have already been developed for this purpose. However, given limited knowledge on appearance genetics, currently available models are incomplete and do not include all causal genetic variants as predictors. Therefore such prediction models may benefit from the inclusion of additional information that acts as a proxy for this unknown genetic background. Use of priors, possibly informed by trait category prevalence values in biogeographic ancestry groups, in a Bayesian framework may thus improve the prediction accuracy of previously predicted externally visible characteristics, but has not been investigated as of yet. In this study, we assessed the impact of using trait prevalence-informed priors on the prediction p

    Development and evaluations of the ancestry informative markers of the VISAGE Enhanced Tool for Appearance and Ancestry

    Get PDF
    The VISAGE Enhanced Tool for Appearance and Ancestry (ET) has been designed to combine markers for the prediction of bio-geographical ancestry plus a range of externally visible characteristics into a single massively parallel sequencing (MPS) assay. We describe the development of the ancestry panel markers used in ET, and the enhanced analyses they provide compared to previous MPS-based forensic ancestry assays. As well as established autosomal single nucleotide polymorphisms (SNPs) that differentiate sub-Saharan African, European, East Asian, South Asian, Native American, and Oceanian populations, ET includes autosomal SNPs able to efficiently differentiate populations from Middle East regions [...]The study was supported by the European Union’s Horizon 2020 Research and Innovation Programme under grant agreement No. 740580 within the framework of the VISible Attributes through GEnomics (VISAGE) Project and Consortium. M.d.l.P. is supported by a post-doctorate grant funded by the Consellería de Cultura, Educación e Ordenación Universitaria e da Consellería de Economía, Emprego e Industria from Xunta de Galicia, Spain (ED481D-2021–008). J.R. is supported by the “Programa de axudas á etapa predoutoral” funded by the Consellería de Cultura, Educación e Ordenación Universitaria e da Consellería de Economía, Emprego e Industria from Xunta de Galicia, Spain (ED481A-2020/039). C.P., A.F.A., A.M.M., M.d.l.P., M.V.L. and the work to compile ancestry informative tri-allelic SNPs and microhaplotypes are supported by MAPA, ‘Multiple Allele Polymorphism Analysis’ (BIO2016–78525-R), a research project funded by the Spanish Research State Agency (AEI) and co-financed with ERDF funds. The population studies by S.O. at University of Santiago de Compostela, were financed by the Fundação de Apoio a Pesquisa do Distrito Federal (FAPDF), BrazilS

    True colors: A literature review on the spatial distribution of eye and hair pigmentation

    No full text
    DNA-based prediction of externally visible characteristics has become an established approach in forensic genetics, with the aim of tracing individuals who are potentially unknown to the investigating authorities but without using this prediction as evidence in court. While a number of prediction models have been proposed, use of prior probabilities in those models has largely been absent. Here, we aim at compiling information on the spatial distribution of eye and hair coloration in order to use this as prior knowledge to improve prediction accuracy. To this end, we conducted a detailed literature review and created maps showing the eye and hair pigmentation prevalence both by countries with available information and by interpolation in order to obtain prior estimates for populations without available data. Furthermore, we assessed the association between these two traits in a very large data set. A strong limitation was the quite low amount of available data, especially outside Europe. We hope that our results will facilitate the improvement of already existing and of novel prediction methods for pigmentation traits and induce further studies on the spatial distribution of these traits

    Evaluation of supervised machine-learning methods for predicting appearance traits from DNA

    Get PDF
    The prediction of human externally visible characteristics (EVCs) based solely on DNA information has become an established approach in forensic and anthropological genetics in recent years. While for a large set of EVCs, predictive models have already been established using multinomial logistic regression (MLR), the prediction performances of other possible classification methods have not been thoroughly investigated thus far. Motivated by the question to identify a potential classifier that outperforms these specific trait models, we conducted a systematic comparison between the widely used MLR and three popular machine learning (ML) classifiers, namely support vector machines (SVM), random forest (RF) and artificial neural networks (ANN), that have shown good performance outside EVC prediction. As examples, we used eye, hair and skin color categories as phenotypes and genotypes based on the previously established IrisPlex, HIrisPlex, and HIrisPlex-S DNA markers. We compared and assessed the performances of each of the four methods, complemented by detailed hyperparameter tuning that was applied to some of the methods in order to maximize their performance. Overall, we observed that all four classification methods showed rather similar performance, with no method being substantially superior to the others for any of the traits, although performances varied slightly across the different traits and more so across the trait categories. Hence, based on our findings, none of the ML methods applied here provide any advantage on appearance prediction, at least when it comes to the categorical pigmentation traits and the selected DNA markers used here

    Testing the impact of trait prevalence priors in Bayesian-based genetic prediction modeling of human appearance traits

    Get PDF
    The prediction of appearance traits by use of solely genetic information has become an established approach and a number of statistical prediction models have already been developed for this purpose. However, given limited knowledge on appearance genetics, currently available models are incomplete and do not include all causal genetic variants as predictors. Therefore such prediction models may benefit from the inclusion of additional information that acts as a proxy for this unknown genetic background. Use of priors, possibly informed by trait category prevalence values in biogeographic ancestry groups, in a Bayesian framework may thus improve the prediction accuracy of previously predicted externally visible characteristics, but has not been investigated as of yet. In this study, we assessed the impact of using trait prevalence-informed priors on the prediction performance in Bayesian models for eye, hair and skin color as well as hair structure and freckles in comparison to the respective prior-free models. Those prior-free models were either similarly defined either very close to the already established ones by using a reduced predictive marker set. However, these differences in the number of the predictive markers should not affect significantly our main outcomes. We observed that such priors often had a strong effect on the prediction performance, but to varying degrees between different traits and also different trait categories, with some categories barely showing an effect. While we found potential for improving the prediction accuracy of many of the appearance trait categories tested by using priors, our analyses also showed that misspecification of those prior values often severely diminished the accuracy compared to the respective prior-free approach. This emphasizes the importance of accurate specification of prevalence-informed priors in Bayesian prediction modeling of appearance traits. However, the existing literature knowledge on spatial prevalence is sparse for most appearance traits, including those investigated here. Due to the limitations in appearance trait prevalence knowledge, our results render the use of trait prevalence-informed priors in DNA-based appearance trait prediction currently infeasible
    corecore