Search CORE

209 research outputs found

Scalable Feature Selection Applications for Genome-Wide Association Studies of Complex Diseases

Author: Okser Sebastian
Publication venue: Turku Centre for Computer Science
Publication date: 19/08/2015
Field of study

Personalized medicine will revolutionize our capabilities to combat disease. Working toward this goal, a fundamental task is the deciphering of geneticvariants that are predictive of complex diseases. Modern studies, in the formof genome-wide association studies (GWAS) have aﬀorded researchers with the opportunity to reveal new genotype-phenotype relationships through the extensive scanning of genetic variants. These studies typically contain over half a million genetic features for thousands of individuals. Examining this with methods other than univariate statistics is a challenging task requiring advanced algorithms that are scalable to the genome-wide level. In the future, next-generation sequencing studies (NGS) will contain an even larger number of common and rare variants. Machine learning-based feature selection algorithms have been shown to have the ability to eﬀectively create predictive models for various genotype-phenotype relationships. This work explores the problem of selecting genetic variant subsets that are the most predictive of complex disease phenotypes through various feature selection methodologies, including ﬁlter, wrapper and embedded algorithms. The examined machine learning algorithms were demonstrated to not only be eﬀective at predicting the disease phenotypes, but also doing so eﬃciently through the use of computational shortcuts. While much of the work was able to be run on high-end desktops, some work was further extended so that it could be implemented on parallel computers helping to assure that they will also scale to the NGS data sets. Further, these studies analyzed the relationships between various feature selection methods and demonstrated the need for careful testing when selecting an algorithm. It was shown that there is no universally optimal algorithm for variant selection in GWAS, but rather methodologies need to be selected based on the desired outcome, such as the number of features to be included in the prediction model. It was also demonstrated that without proper model validation, for example using nested cross-validation, the models can result in overly-optimistic prediction accuracies and decreased generalization ability. It is through the implementation and application of machine learning methods that one can extract predictive genotype–phenotype relationships and biological insights from genetic data sets.Siirretty Doriast

UTUPub

Neural Networks beyond explainability: Selective inference for sequence motifs

Author: de Castro Yohann
Jacob Laurent
Veber Philippe
Villié Antoine
Publication venue
Publication date: 23/12/2022
Field of study

Over the past decade, neural networks have been successful at making predictions from biological sequences, especially in the context of regulatory genomics. As in other fields of deep learning, tools have been devised to extract features such as sequence motifs that can explain the predictions made by a trained network. Here we intend to go beyond explainable machine learning and introduce SEISM, a selective inference procedure to test the association between these extracted features and the predicted phenotype. In particular, we discuss how training a one-layer convolutional network is formally equivalent to selecting motifs maximizing some association score. We adapt existing sampling-based selective inference procedures by quantizing this selection over an infinite set to a large but finite grid. Finally, we show that sampling under a specific choice of parameters is sufficient to characterize the composite null hypothesis typically used for selective inference-a result that goes well beyond our particular framework. We illustrate the behavior of our method in terms of calibration, power and speed and discuss its power/speed trade-off with a simpler data-split strategy. SEISM paves the way to an easier analysis of neural networks used in regulatory genomics, and to more powerful methods for genome wide association studies (GWAS)

arXiv.org e-Print Archive

Features Ranking Techniques for Single Nucleotide Polymorphism Data

Author: Abounada Mohanad Feisal M H
Publication venue
Publication date: 01/06/2017
Field of study

Identifying biomarkers like single nucleotide polymorphisms (SNPs) is an important topic in biomedical applications. Such SNPs can be associated with an individual’s metabolism of drugs, which make these SNPs targets for drug therapy, and useful in personalized medicine applications. Yet another important application is that SNPs can be associated with an individual’s genetic predisposition to develop a disease. Identifying these associations allow proactive steps to be taken to hinder, delay or eliminate the disease. However, the problem is challenging; data are high dimensional and incomplete, and features (SNPs) are correlated. The goal of this thesis is to propose features ranking methods to reduce the number of selected features and the computational cost required to select these features in a binary classification task. The main idea of the hypothesis is that specific values within a feature might be useful in predicting specific classes, while other values are not. In this context, three heuristic methods are applied to select the best features. The methods are applied to the Wellcome Trust Case Control Consortium (WTCCC1) dataset, and evaluated on Texas A&M University Qatar’s High Performance Computing platform. The results show that the classification accuracy achieved by the proposed methods is comparable to the baseline. However, one of the proposed methods reduced the execution time of the feature selection and the number of features required to achieve similar accuracy in the baseline by 40% and 47% respectively

Qatar University Institutional Repository

Recommended from our members

Computational Tools for Immune Repertoire Characterization and Primer Set Design

Author: Yu Jane
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

The enormous decrease in the cost of genomic sequencing over the past two decades has enabled researchers to revisit previously unaddressable questions in sequence analysis. However, this boom of genomic information has introduced new sets of problems that often demand computationally efficient methods. In this work, we describe computational tools for two such settings involving large-scale genomic data: 1) estimating copy number and allelic variation in two highly complex gene families, and 2) selective sequencing of a target genome in a complex DNA sample.We first describe a method that takes short reads from high-throughput sequencing and characterizes both copy number and allelic variation in the IGHV and TRBV loci. These two loci can vary extensively between individuals in copy number and contain genes that are highly similar, making their analysis technically challenging. Additionally, we have conducted the first study of a globally diverse sample of hundreds of individuals in these two loci from over a hundred populations. In addition to providing insight into the different evolutionary paths of the IGHV and TRBV loci, our results are also important to the adaptive immune repertoire sequencing community, where the lack of frequencies of common alleles and copy number variants is hampering existing analytical pipelines.In our second problem setting, we describe SOAPswga, an optimized and parallelized pipeline for primer design in the context of selective amplification. Unlike previous heuristic-based methods, SOAPswga uses machine learning methods to evaluate both individual primers and primer sets. Additionally, rather than brute force search for primer sets, such as in predecessor methods, SOAPswga uses branch-and-bound principles to pursue only the most promising sets. These optimizations, including the parallelization of each step, allow for a huge decrease in runtime from the order of weeks to minutes. We also discuss the results of our pipeline applied to the selective amplification of Mycobacterium tuberculosis in a sample of human blood. Lastly, we expand on the importance of this work, and in general, its potential usefulness to any setting consisting of targeted sequencing

eScholarship - University of California

High Dimensional Analysis of Genetic Data for the Classification of Type 2 Diabetes Using Advanced Machine Learning Algorithms

Author: Abdulaimma B
Publication venue
Publication date
Field of study

The prevalence of type 2 diabetes (T2D) has increased steadily over the last thirty years and has now reached epidemic proportions. The secondary complications associated with T2D have significant health and economic impacts worldwide and it is now regarded as the seventh leading cause of mortality. Therefore, understanding the underlying causes of T2D is high on government agendas. The condition is a multifactorial disorder with a complex aetiology. This means that T2D emerges from the convergence between genetics, the environment and diet, and lifestyle choices. The genetic determinants remain largely elusive, with only a handful of identified candidate genes. Genome-wide association studies (GWAS) have enhanced our understanding of genetic-based determinants in common complex human diseases. To date, 120 single nucleotide polymorphisms (SNPs) for T2D have been identified using GWAS. Standard statistical tests for single and multi-locus analysis, such as logistic regression, have demonstrated little effect in understanding the genetic architecture of complex human diseases. Logistic regression can capture linear interactions between SNPs and traits however it neglects the non-linear epistatic interactions that are often present within genetic data. Complex human diseases are caused by the contributions made by many interacting genetic variants. However, detecting epistatic interactions and understanding the underlying pathogenesis architecture of complex human disorders remains a significant challenge. This thesis presents a novel framework based on deep learning to reduce the high-dimensional space in GWAS and learn non-linear epistatic interactions in T2D genetic data for binary classification tasks. This framework includes traditional GWAS quality control, association analysis, deep learning stacked autoencoders, and a multilayer perceptron for classification. Quality control procedures are conducted to exclude genetic variants and individuals that do not meet a pre-specified criterion. Logistic association analysis under an additive genetic model adjusted for genomic control inflation factor is also conducted. SNPs generated with a p-value threshold of 10−2 are considered, resulting in 6609 SNPs (features), to remove statistically improbable SNPs and help minimise the computational requirements needed to process all SNPs. The 6609 SNPs are used for epistatic analysis through progressively smaller hidden layer units. Latent representations are extracted using stacked autoencoders to initialise a multilayer feedforward network for binary classification. The classifier is fine-tuned to discriminate between cases and controls using T2D genetic data. The performance of a deep learning stacked autoencoder model is evaluated and benchmarked against a multilayer perceptron and a random forest learning algorithm. The findings show that the best results were obtained using 2500 compressed hidden units (AUC=94.25%). However, the classification accuracy when using 300 compressed neurons remains reasonable with (AUC=80.78%). The results are promising. Using deep learning stacked autoencoders, it is possible to reduce high-dimensional features in T2D GWAS data and learn non-linear epistatic interactions between SNPs while enhancing overall model performance for binary classification purposes

LJMU Research Online (Liverpool John Moores University)

Statistical and Computational Methods for Analyzing and Visualizing Large-Scale Genomic Datasets

Author: Kwong Alan
Publication venue
Publication date: 01/01/2020
Field of study

Advances in large-scale genomic data production have led to a need for better methods to process, interpret, and organize this data. Starting with raw sequencing data, generating results requires many complex data processing steps, from quality control, alignment, and variant calling to genome wide association studies (GWAS) and characterization of expression quantitative trait loci (eQTL). In this dissertation, I present methods to address issues faced when working with large-scale genomic datasets. In Chapter 2, I present an analysis of 4,787 whole genomes sequenced for the study of age-related macular degeneration (AMD) as a follow-up fine-mapping study to previous work from the International AMD Genomics Consortium (IAMDGC). Through whole genome sequencing, we comprehensively characterized genetic variants associated with AMD in known loci to provide additional insights on the variants potentially responsible for the disease by leveraging 60,706 additional controls. Our study improved the understanding of loci associated with AMD and demonstrated the advantages and disadvantages of different approaches for fine-mapping studies with sequence-based genotypes. In Chapter 3, I describe a novel method and a software tool to perform Hardy-Weinberg equilibrium (HWE) tests for structured populations. In sequence-based genetic studies, HWE test statistics are important quality metrics to distinguish true genetic variants from artifactual ones, but it becomes much less informative when it is applied to a heterogeneous and/or structured population. As next generation sequencing studies contain samples from increasingly diverse ancestries, we developed a new HWE test which addresses both the statistical and computational challenges of modern large-scale sequencing data and implemented the method in a publicly available software tool. Moreover, we extensively evaluated our proposed method with alternative methods to test HWE in both simulated and real datasets. Our method has been successfully applied to the latest variant calling QC pipeline in the TOPMed project. In Chapter 4, I describe PheGET, a web application to interactively visualize Expression Quantitative Trait Loci (eQTLs) across tissues, genes, and regions to aid functional interpretations of regulatory variants. Tissue-specific expression has become increasingly important for understanding the links between genetic variation and disease. To address this need, the Genotype-Tissue Expression (GTEx) project collected and analyzed a treasure trove of expression data. However, effectively navigating this wealth of data to find signals relevant to researchers has become a major challenge. I demonstrate the functionalities of PheGET using the newest GTEx data on our eQTL browser website at https://eqtl.pheweb.org/, allowing the user to 1) view all cis-eQTLs for a single variant; 2) view and compare single-tissue, single-gene associations within any genomic region; 3) find the best eQTL signal in any given genomic region or gene; and 4) customize the plotted data in real time. PheGET is designed to handle and display the kind of complex multidimensional data often seen in our post-GWAS era, such as multi-tissue expression data, in an intuitive and convenient interface, giving researchers an additional tool to better understand the links between genetics and disease.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/162918/1/amkwong_1.pd

Deep Blue Documents at the University of Michigan

Discovering Higher-order SNP Interactions in High-dimensional Genomic Data

Author: Uppu Suneetha
Publication venue: Curtin University
Publication date: 01/01/2018
Field of study

In this thesis, a multifactor dimensionality reduction based method on associative classification is employed to identify higher-order SNP interactions for enhancing the understanding of the genetic architecture of complex diseases. Further, this thesis explored the application of deep learning techniques by providing new clues into the interaction analysis. The performance of the deep learning method is maximized by unifying deep neural networks with a random forest for achieving reliable interactions in the presence of noise

espace@Curtin

Bioinformatics

Author
Publication venue: 'IntechOpen'
Publication date: 20/04/2021
Field of study

This book is divided into different research areas relevant in Bioinformatics such as biological networks, next generation sequencing, high performance computing, molecular modeling, structural bioinformatics, molecular modeling and intelligent data analysis. Each book section introduces the basic concepts and then explains its application to problems of great relevance, so both novice and expert readers can benefit from the information and research works presented here

Directory of Open Access Books (DOAB)

Recommended from our members

A generic approach to behaviour-driven biochemical model construction

Author: Wu Zujian
Publication venue: Brunel University, School of Information Systems, Computing and Mathematics
Publication date: 01/01/2012
Field of study

This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Modelling of biochemical systems has received considerable attention over the last decade from bioengineering, biochemistry, computer science, and mathematics. This thesis investigates the applications of computational techniques to computational systems biology, for the construction of biochemical models in terms of topology and kinetic rates. Due to the complexity of biochemical systems, it is natural to construct models representing the biochemical systems incrementally in a piecewise manner. Syntax and semantics of two patterns are defined for the instantiation of components which are extendable, reusable and fundamental building blocks for models composition. We propose and implement a set of genetic operators and composition rules to tackle issues of piecewise composing models from scratch. Quantitative Petri nets are evolved by the genetic operators, and evolutionary process of modelling are guided by the composition rules. Metaheuristic algorithms are widely applied in BioModel Engineering to support intelligent and heuristic analysis of biochemical systems in terms of structure and kinetic rates. We illustrate parameters of biochemical models based on Biochemical Systems Theory, and then the topology and kinetic rates of the models are manipulated by employing evolution strategy and simulated annealing respectively. A new hybrid modelling framework is proposed and implemented for the models construction. Two heuristic algorithms are performed on two embedded layers in the hybrid framework: an outer layer for topology mutation and an inner layer for rates optimization. Moreover, variants of the hybrid piecewise modelling framework are investigated. Regarding flexibility of these variants, various combinations of evolutionary operators, evaluation criteria and design principles can be taken into account. We examine performance of five sets of the variants on specific aspects of modelling. The comparison of variants is not to explicitly show that one variant clearly outperforms the others, but it provides an indication of considering important features for various aspects of the modelling. Because of the very heavy computational demands, the process of modelling is paralleled by employing a grid environment, GridGain. Application of the GridGain and heuristic algorithms to analyze biological processes can support modelling of biochemical systems in a computational manner, which can also benefit mathematical modelling in computer science and bioengineering. We apply our proposed modelling framework to model biochemical systems in a hybrid piecewise manner. Modelling variants of the framework are comparatively studied on specific aims of modelling. Simulation results show that our modelling framework can compose synthetic models exhibiting similar species behaviour, generate models with alternative topologies and obtain general knowledge about key modelling features

Brunel University Research Archive

Massively-Parallel Feature Selection for Big Data

Author: Borboudakis Giorgos
Christophides Vassilis
Katsogridakis Pavlos
Pratikakis Polyvios
Tsamardinos Ioannis
Publication venue
Publication date: 23/08/2017
Field of study

We present the Parallel, Forward-Backward with Pruning (PFBP) algorithm for feature selection (FS) in Big Data settings (high dimensionality and/or sample size). To tackle the challenges of Big Data FS PFBP partitions the data matrix both in terms of rows (samples, training examples) as well as columns (features). By employing the concepts of

p

-values of conditional independence tests and meta-analysis techniques PFBP manages to rely only on computations local to a partition while minimizing communication costs. Then, it employs powerful and safe (asymptotically sound) heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Our empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores, while dominating other competitive algorithms in its class

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Hal-Diderot