919 research outputs found

    Study of large and highly stratified population datasets by combining iterative pruning principal component analysis and structure

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The ever increasing sizes of population genetic datasets pose great challenges for population structure analysis. The Tracy-Widom (TW) statistical test is widely used for detecting structure. However, it has not been adequately investigated whether the TW statistic is susceptible to type I error, especially in large, complex datasets. Non-parametric, Principal Component Analysis (PCA) based methods for resolving structure have been developed which rely on the TW test. Although PCA-based methods can resolve structure, they cannot infer ancestry. Model-based methods are still needed for ancestry analysis, but they are not suitable for large datasets. We propose a new structure analysis framework for large datasets. This includes a new heuristic for detecting structure and incorporation of the structure patterns inferred by a PCA method to complement STRUCTURE analysis.</p> <p>Results</p> <p>A new heuristic called EigenDev for detecting population structure is presented. When tested on simulated data, this heuristic is robust to sample size. In contrast, the TW statistic was found to be susceptible to type I error, especially for large population samples. EigenDev is thus better-suited for analysis of large datasets containing many individuals, in which spurious patterns are likely to exist and could be incorrectly interpreted as population stratification. EigenDev was applied to the iterative pruning PCA (ipPCA) method, which resolves the underlying subpopulations. This subpopulation information was used to supervise STRUCTURE analysis to infer patterns of ancestry at an unprecedented level of resolution. To validate the new approach, a bovine and a large human genetic dataset (3945 individuals) were analyzed. We found new ancestry patterns consistent with the subpopulations resolved by ipPCA.</p> <p>Conclusions</p> <p>The EigenDev heuristic is robust to sampling and is thus superior for detecting structure in large datasets. The application of EigenDev to the ipPCA algorithm improves the estimation of the number of subpopulations and the individual assignment accuracy, especially for very large and complex datasets. Furthermore, we have demonstrated that the structure resolved by this approach complements parametric analysis, allowing a much more comprehensive account of population structure. The new version of the ipPCA software with EigenDev incorporated can be downloaded from <url>http://www4a.biotec.or.th/GI/tools/ippca</url>.</p

    Identification of gene-gene interactions for Alzheimer's disease using co-operative game theory

    Full text link
    Thesis (Ph.D.)--Boston UniversityThe multifactorial nature of Alzheimer's Disease suggests that complex gene-gene interactions are present in AD pathways. Contemporary approaches to detect such interactions in genome-wide data are mathematically and computationally challenging. We investigated gene-gene interactions for AD using a novel algorithm based on cooperative game theory in 15 genome-wide association study (GWAS) datasets comprising of a total of 11,840 AD cases and 10,931 cognitively normal elderly controls from the Alzheimer Disease Genetics Consortium (ADGC). We adapted this approach, which was developed originally for solving multi-dimensional problems in economics and social sciences, to compute a Shapely value statistic to identify genetic markers that contribute most to coalitions of SNPs in predicting AD risk. Treating each GWAS dataset as independent discovery, markers were ranked according to their contribution to coalitions formed with other markers. Using a backward elimination strategy, markers with low Shapley values were eliminated and the statistic was recalculated iteratively. We tested all two-way interactions between top Shapley markers in regression models which included the two SNPs (main effects) and a term for their interaction. Models yielding a p-value<0.05 for the interaction term were evaluated in each of the other datasets and the results from all datasets were combined by meta-analysis. Statistically significant interactions were observed with multiple marker combinations in the APOE regions. My analyses also revealed statistically strong interactions between markers in 6 regions; CTNNA3-ATP11A (p=4.1E-07), CSMD1-PRKCQ (p=3.5E-08), DCC-UNC5CL (p=5.9e-8), CNTNAP2-RFC3 (p=1.16e-07), AACS-TSHZ3 (p=2.64e-07) and CAMK4-MMD (p=3.3e-07). The Shapley value algorithm outperformed Chi-Square and ReliefF in detecting known interactions between APOE and GAB2 in a previously published GWAS dataset. It was also more accurate than competing filtering methods in identifying simulated epistastic SNPs that are additive in nature, but its accuracy was low in identifying non-linear interactions. The game theory algorithm revealed strong interactions between markers in novel genes with weak main effects, which would have been overlooked if only markers with strong marginal association with AD were tested. This method will be a valuable tool for identifying gene-gene interactions for complex diseases and other traits

    A Comprehensive Survey on Rare Event Prediction

    Full text link
    Rare event prediction involves identifying and forecasting events with a low probability using machine learning and data analysis. Due to the imbalanced data distributions, where the frequency of common events vastly outweighs that of rare events, it requires using specialized methods within each step of the machine learning pipeline, i.e., from data processing to algorithms to evaluation protocols. Predicting the occurrences of rare events is important for real-world applications, such as Industry 4.0, and is an active research area in statistical and machine learning. This paper comprehensively reviews the current approaches for rare event prediction along four dimensions: rare event data, data processing, algorithmic approaches, and evaluation approaches. Specifically, we consider 73 datasets from different modalities (i.e., numerical, image, text, and audio), four major categories of data processing, five major algorithmic groupings, and two broader evaluation approaches. This paper aims to identify gaps in the current literature and highlight the challenges of predicting rare events. It also suggests potential research directions, which can help guide practitioners and researchers.Comment: 44 page

    Towards Precision Psychiatry: gray Matter Development And Cognition In Adolescence

    Get PDF
    Precision Psychiatry promises a new era of optimized psychiatric diagnosis and treatment through comprehensive, data-driven patient stratification. Among the core requirements towards that goal are: 1) neurobiology-guided preprocessing and analysis of brain imaging data for noninvasive characterization of brain structure and function, and 2) integration of imaging, genomic, cognitive, and clinical data in accurate and interpretable predictive models for diagnosis, and treatment choice and monitoring. In this thesis, we shall touch on specific aspects that fit under these two broad points. First, we investigate normal gray matter development around adolescence, a critical period for the development of psychopathology. For years, the common narrative in human developmental neuroimaging has been that gray matter declines in adolescence. We demonstrate that different MRI-derived gray matter measures exhibit distinct age and sex effects and should not be considered equivalent, as has often been done in the past, but complementary. We show for the first time that gray matter density increases from childhood to young adulthood, in contrast with gray matter volume and cortical thickness, and that females, who are known to have lower gray matter volume than males, have higher density throughout the brain. A custom preprocessing pipeline and a novel high-resolution gray matter parcellation were created to analyze brain scans of 1189 youths collected as part of the Philadelphia Neurodevelopmental Cohort. This work emphasizes the need for future studies combining quantitative histology and neuroimaging to fully understand the biological basis of MRI contrasts and their derived measures. Second, we use the same gray matter measures to assess how well they can predict cognitive performance. We train mass-univariate and multivariate models to show that gray matter volume and density are complementary in their ability to predict performance. We suggest that parcellation resolution plays a big role in prediction accuracy and that it should be tuned separately for each modality for a fair comparison among modalities and for an optimal prediction when combining all modalities. Lastly, we introduce rtemis, an R package for machine learning and visualization, aimed at making advanced data analytics more accessible. Adoption of accurate and interpretable machine learning methods in basic research and medical practice will help advance biomedical science and make precision medicine a reality

    Quantifying the genetic component of the metabolic syndrome using a novel proposal score and SNP-based heritability

    Get PDF
    Introduction. Metabolic syndrome (MetS) is a complex, multifactorial disease that poses a major public health problem. MetS increases the risk of coronary heart disease (CHD), atherosclerotic cardiovascular diseases (ASCVD), type 2 diabetes mellitus (T2DM), and all-cause mortality. Currently, there are a many different criteria that define MetS but the physiopathology is not completely understood both in terms of clinical progression and genetic contribution. Aims. The present work characterizes MetS components (obesity, hypertension, glucose, etc.) as one continuous phenotype and genetic components of the proposed MetS score were estimated using both family-based samples and population-based samples. Methods. In the first step, Confirmatory Factor Analysis (CFA) was used to select a model with the best fit. After the selection of the best factor structure and development an algorithm to calculate the score, heritability was performed in both pedigrees and SNPs/markers data. For the first sample, SOLAR (Sequential Oligogenic Linkage Analysis Routines) software was used to obtain the estimates. For the second sample, genetic variance components were calculated by fitting a linear mixed model (LMM) using two types of genetic relatedness matrices (Identity-By-Descend, IBD and Genome-Wide Complex Trait Analysis, GCTA), different levels of Linkage Disequilibrium (LD) pruning (0.20 – 0.80 and no LD pruning), and suggestive Genome-Wide Association Study (GWAS) SNPs. Results. According to the analyses, the best CFA model was the bifactor model; estimated coefficients were used to calculate the MetS score. The score showed good performance and good agreement compared to the International Diabetes Federation (IDF) criteria, the gold standard used for clinical diagnosis. With regards to the estimation of genetic variance, heritability was significant and ranged from 0.1 to 0.4 in whole samples and in all models. The heterogeneity of the results was due to the different samples and different types of matrix inputs into the LMMs. Heritability obtained using the GCTA matrix was significantly increased compared to the IBD matrix. No significant differences between family data and genetic data (markers) in Sardinia samples were observed using an LD threshold of 0.80 with no pruning. Conclusions. Evidence of complex interactions in metabolic syndrome and significant genetic contributions were obtained from these analyses. Increased knowledge of the environmental and genetic components could allow for better assessment and identification of patients with this syndrome

    Using machine learning to support better and intelligent visualisation for genomic data

    Get PDF
    Massive amounts of genomic data are created for the advent of Next Generation Sequencing technologies. Great technological advances in methods of characterising the human diseases, including genetic and environmental factors, make it a great opportunity to understand the diseases and to find new diagnoses and treatments. Translating medical data becomes more and more rich and challenging. Visualisation can greatly aid the processing and integration of complex data. Genomic data visual analytics is rapidly evolving alongside with advances in high-throughput technologies such as Artificial Intelligence (AI), and Virtual Reality (VR). Personalised medicine requires new genomic visualisation tools, which can efficiently extract knowledge from the genomic data effectively and speed up expert decisions about the best treatment of an individual patient’s needs. However, meaningful visual analysis of such large genomic data remains a serious challenge. Visualising these complex genomic data requires not only simply plotting of data but should also lead to better decisions. Machine learning has the ability to make prediction and aid in decision-making. Machine learning and visualisation are both effective ways to deal with big data, but they focus on different purposes. Machine learning applies statistical learning techniques to automatically identify patterns in data to make highly accurate prediction, while visualisation can leverage the human perceptual system to interpret and uncover hidden patterns in big data. Clinicians, experts and researchers intend to use both visualisation and machine learning to analyse their complex genomic data, but it is a serious challenge for them to understand and trust machine learning models in the serious medical industry. The main goal of this thesis is to study the feasibility of intelligent and interactive visualisation which combined with machine learning algorithms for medical data analysis. A prototype has also been developed to illustrate the concept that visualising genomics data from childhood cancers in meaningful and dynamic ways could lead to better decisions. Machine learning algorithms are used and illustrated during visualising the cancer genomic data in order to provide highly accurate predictions. This research could open a new and exciting path to discovery for disease diagnostics and therapies

    Combining classification algorithms

    Get PDF
    Dissertação de Doutoramento em Ciência de Computadores apresentada à Faculdade de Ciências da Universidade do PortoA capacidade de um algoritmo de aprendizagem induzir, para um determinado problema, uma boa generalização depende da linguagem de representação usada para generalizar os exemplos. Como diferentes algoritmos usam diferentes linguagens de representação e estratégias de procura, são explorados espaços diferentes e são obtidos resultados diferentes. O problema de encontrar a representação mais adequada para o problema em causa, é uma área de investigação bastante activa. Nesta dissertação, em vez de procurar métodos que fazem o ajuste aos dados usando uma única linguagem de representação, apresentamos uma família de algoritmos, sob a designação genérica de Generalização em Cascata, onde o espaço de procura contem modelos que utilizam diferentes linguagens de representação. A ideia básica do método consiste em utilizar os algoritmos de aprendizagem em sequência. Em cada iteração ocorre um processo com dois passos. No primeiro passo, um classificador constrói um modelo. No segundo passo, o espaço definido pelos atributos é estendido pela inserção de novos atributos gerados utilizando este modelo. Este processo de construção de novos atributos constrói atributos na linguagem de representação do classificador usado para construir o modelo. Se posteriormente na sequência, um classificador utiliza um destes novos atributos para construir o seu modelo, a sua capacidade de representação foi estendida. Desta forma as restrições da linguagem de representação dosclassificadores utilizados a mais alto nível na sequência, são relaxadas pela incorporação de termos da linguagem derepresentação dos classificadores de base. Esta é a metodologia base subjacente ao sistema Ltree e à arquitecturada Generalização em Cascata.O método é apresentado segundo duas perspectivas. Numa primeira parte, é apresentado como uma estratégia paraconstruir árvores de decisão multivariadas. É apresentado o sistema Ltree que utiliza como operador para a construção de atributos um discriminante linear. ..
    corecore