Search CORE

61 research outputs found

A Genetic Programming Model for Association Studies to Detect Epistasis in Low Heritability Data

Author: Arbex Wagner
Borges Carlos Cristiano Hasenclever
Ribeiro Igor Magalhães
Silva Bruno Zonovelli
Publication venue: 'Universidade Federal do Rio Grande do Sul'
Publication date: 17/07/2018
Field of study

The genome-wide associations studies (GWAS) aims to identify the most influential markers in relation to the phenotype values. One of the substantial challenges is to find a non-linear mapping between genotype and phenotype, also known as epistasis, that usually becomes the process of searching and identifying functional SNPs more complex. Some diseases such as cervical cancer, leukemia and type 2 diabetes have low heritability. The heritability of the sample is directly related to the explanation defined by the genotype, so the lower the heritability the greater the influence of the environmental factors and the less the genotypic explanation. In this work, an algorithm capable of identifying epistatic associations at different levels of heritability is proposed. The developing model is a aplication of genetic programming with a specialized initialization for the initial population consisting of a random forest strategy. The initialization process aims to rank the most important SNPs increasing the probability of their insertion in the initial population of the genetic programming model. The expected behavior of the presented model for the obtainment of the causal markers intends to be robust in relation to the heritability level. The simulated experiments are case-control type with heritability level of 0.4, 0.3, 0.2 and 0.1 considering scenarios with 100 and 1000 markers. Our approach was compared with the GPAS software and a genetic programming algorithm without the initialization step. The results show that the use of an efficient population initialization method based on ranking strategy is very promising compared to other models

Em Questao

Archives of the Faculty of Veterinary Medicine UFRGS

ATHENA: A knowledge-based hybrid backpropagation-grammatical evolution neural network algorithm for discovering epistasis among quantitative trait Loci

Author: A Bateman
A Freitas
AA Motsinger
AA Motsinger-Reif
B Maher
BC White
C Kooperberg
C Newton-Cheh
CJ Willer
CM Bishop
CR Porter
CS Carlson
CS Greene
CS Greene
CY Huang
D Ruano
DB Goldstein
E Boerwinkle
E Colucci-Guyon
ER Holzinger
F Sato
G Peng
H Shao
HJ Cordell
I Vastrik
I Xenarios
IG Sprinkhuizen-Kuyper
International hapmap consortium
International hapmap consortium
J Koza
J Meiler
J Moore
J Moore
J Ott
JE Dayhoff
JH Moore
JH Moore
JH Moore
JH Moore
JH Moore
JH Moore
JN Hirschhorn
KG Becker
KH Pietilainen
LA Hindorff
M Abney
M Ashburner
M Kanehisa
M O'Neil
Marylyn D Ritchie
MC Gruda
MD Ritchie
MD Ritchie
MR Nelson
N Killeen
N Penrod
P Cohen
P Gorry
P Lucek
R Bellman
R Culverhouse
R Culverhouse
R Linder
R Poli
R Shen
RD Finn
RJ Klein
S Itohara
S Kathiresan
S Kim
S Wright
SC Hamon
Scott M Dudek
SD Turner
SD Turner
SE Baranzini
SE Maxwell
Stephen D Turner
T Baba
TA Manolio
TL Edwards
TM Frayling
V Kurkova
WJ Gauderman
WJ Gauderman
WS Bush
X He
X Yao
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Growing interest and burgeoning technology for discovering genetic mechanisms that influence disease processes have ushered in a flood of genetic association studies over the last decade, yet little heritability in highly studied complex traits has been explained by genetic variation. Non-additive gene-gene interactions, which are not often explored, are thought to be one source of this "missing" heritability. Methods Stochastic methods employing evolutionary algorithms have demonstrated promise in being able to detect and model gene-gene and gene-environment interactions that influence human traits. Here we demonstrate modifications to a neural network algorithm in ATHENA (the Analysis Tool for Heritable and Environmental Network Associations) resulting in clear performance improvements for discovering gene-gene interactions that influence human traits. We employed an alternative tree-based crossover, backpropagation for locally fitting neural network weights, and incorporation of domain knowledge obtainable from publicly accessible biological databases for initializing the search for gene-gene interactions. We tested these modifications <it>in silico </it>using simulated datasets. Results We show that the alternative tree-based crossover modification resulted in a modest increase in the sensitivity of the ATHENA algorithm for discovering gene-gene interactions. The performance increase was highly statistically significant when backpropagation was used to locally fit NN weights. We also demonstrate that using domain knowledge to initialize the search for gene-gene interactions results in a large performance increase, especially when the search space is larger than the search coverage. Conclusions We show that a hybrid optimization procedure, alternative crossover strategies, and incorporation of domain knowledge from publicly available biological databases can result in marked increases in sensitivity and performance of the ATHENA algorithm for detecting and modelling gene-gene interactions that influence a complex human trait.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Bioinformatics challenges for genome-wide association studies

Author: Ahmed
Altshuler
Amundadottir
Askland
Bureau
Bush
Calle
Chang
Chanock
Cook
Culverhouse
Donnelly
Easton
Eiberg
Elbers
Emily
F. W. Asselbergs
Greene
Hahn
Hahn
Hirschhorn
Holmans
Infante
J. H. Moore
Jakobsdottir
Kooperberg
Kraft
Lewontin
Lou
Lunetta
Manolio
Manolio
Marchini
McKinney
McKinney
Mei
Millstein
Moore
Moore
Moore
Moore
Moore
Moore
Moore
Moore
Moore
Moore
Motsinger
Namkung
Nelson
Pan
Pattin
Reich
Reif
Ripperger
Ritchie
Ritchie
Ritchie
S. M. Williams
Schork
Sinnott-Armstrong
Spencer
Thornton-Wells
Torkamani
Velez
Wang
Wilke
Williams
Wongseree
Yu
Yu
Zhang
Publication venue: Oxford University Press
Publication date: 15/02/2010
Field of study

Motivation: The sequencing of the human genome has made it possible to identify an informative set of >1 million single nucleotide polymorphisms (SNPs) across the genome that can be used to carry out genome-wide association studies (GWASs). The availability of massive amounts of GWAS data has necessitated the development of new biostatistical methods for quality control, imputation and analysis issues including multiple testing. This work has been successful and has enabled the discovery of new associations that have been replicated in multiple studies. However, it is now recognized that most SNPs discovered via GWAS have small effects on disease susceptibility and thus may not be suitable for improving health care through genetic testing. One likely explanation for the mixed results of GWAS is that the current biostatistical analysis paradigm is by design agnostic or unbiased in that it ignores all prior knowledge about disease pathobiology. Further, the linear modeling framework that is employed in GWAS often considers only one SNP at a time thus ignoring their genomic and environmental context. There is now a shift away from the biostatistical approach toward a more holistic approach that recognizes the complexity of the genotype–phenotype relationship that is characterized by significant heterogeneity and gene–gene and gene–environment interaction. We argue here that bioinformatics has an important role to play in addressing the complexity of the underlying genetic basis of common human diseases. The goal of this review is to identify and discuss those GWAS challenges that will require computational methods

CiteSeerX

Crossref

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

PubMed Central

UCL Discovery

Dissertations of the University of Groningen

Accelerating epistasis analysis in human genetics with consumer graphics hardware

Author: CANCARE' FABIO
CS Greene
JH Moore
NA Sinnott Armstrong
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

BACKGROUND: Human geneticists are now capable of measuring more than one million DNA sequence variations from across the human genome. The new challenge is to develop computationally feasible methods capable of analyzing these data for associations with common human disease, particularly in the context of epistasis. Epistasis describes the situation where multiple genes interact in a complex non-linear manner to determine an individual's disease risk and is thought to be ubiquitous for common diseases. Multifactor Dimensionality Reduction (MDR) is an algorithm capable of detecting epistasis. An exhaustive analysis with MDR is often computationally expensive, particularly for high order interactions. This challenge has previously been met with parallel computation and expensive hardware. The option we examine here exploits commodity hardware designed for computer graphics. In modern computers Graphics Processing Units (GPUs) have more memory bandwidth and computational capability than Central Processing Units (CPUs) and are well suited to this problem. Advances in the video game industry have led to an economy of scale creating a situation where these powerful components are readily available at very low cost. Here we implement and evaluate the performance of the MDR algorithm on GPUs. Of primary interest are the time required for an epistasis analysis and the price to performance ratio of available solutions. FINDINGS: We found that using MDR on GPUs consistently increased performance per machine over both a feature rich Java software package and a C++ cluster implementation. The performance of a GPU workstation running a GPU implementation reduces computation time by a factor of 160 compared to an 8-core workstation running the Java implementation on CPUs. This GPU workstation performs similarly to 150 cores running an optimized C++ implementation on a Beowulf cluster. Furthermore this GPU system provides extremely cost effective performance while leaving the CPU available for other tasks. The GPU workstation containing three GPUs costs

2000 while obtaining similar performance on a Beowulf cluster requires 150 CPU cores which, including the added infrastructure and support cost of the cluster system, cost approximately

82,500. CONCLUSION: Graphics hardware based computing provides a cost effective means to perform genetic analysis of epistasis using MDR on large datasets without the infrastructure of a computing cluster

Archivio istituzionale della ricerca - Politecnico di Milano

Directory of Open Access Journals

PubMed Central

Knowledge-Driven Genome-Wide Analysis of Multigenic Interactions Impacting HDL Cholesterol Level

Author: Turner Stephen Dale
Publication venue: VANDERBILT
Publication date
Field of study

Discovering Higher-order SNP Interactions in High-dimensional Genomic Data

Author: Uppu Suneetha
Publication venue: Curtin University
Publication date: 01/01/2018
Field of study

In this thesis, a multifactor dimensionality reduction based method on associative classification is employed to identify higher-order SNP interactions for enhancing the understanding of the genetic architecture of complex diseases. Further, this thesis explored the application of deep learning techniques by providing new clues into the interaction analysis. The performance of the deep learning method is maximized by unifying deep neural networks with a random forest for achieving reliable interactions in the presence of noise

espace@Curtin

A New Evolutionary Algorithm For Mining Noisy, Epistatic, Geospatial Survey Data Associated With Chagas Disease

Author: Hanley John P.
Publication venue: UVM ScholarWorks
Publication date: 01/01/2017
Field of study

The scientific community is just beginning to understand some of the profound affects that feature interactions and heterogeneity have on natural systems. Despite the belief that these nonlinear and heterogeneous interactions exist across numerous real-world systems (e.g., from the development of personalized drug therapies to market predictions of consumer behaviors), the tools for analysis have not kept pace. This research was motivated by the desire to mine data from large socioeconomic surveys aimed at identifying the drivers of household infestation by a Triatomine insect that transmits the life-threatening Chagas disease. To decrease the risk of transmission, our colleagues at the laboratory of applied entomology and parasitology have implemented mitigation strategies (known as Ecohealth interventions); however, limited resources necessitate the search for better risk models. Mining these complex Chagas survey data for potential predictive features is challenging due to imbalanced class outcomes, missing data, heterogeneity, and the non-independence of some features. We develop an evolutionary algorithm (EA) to identify feature interactions in Big Datasets with desired categorical outcomes (e.g., disease or infestation). The method is non-parametric and uses the hypergeometric PMF as a fitness function to tackle challenges associated with using p-values in Big Data (e.g., p-values decrease inversely with the size of the dataset). To demonstrate the EA effectiveness, we first test the algorithm on three benchmark datasets. These include two classic Boolean classifier problems: (1) the majority-on problem and (2) the multiplexer problem, as well as (3) a simulated single nucleotide polymorphism (SNP) disease dataset. Next, we apply the EA to real-world Chagas Disease survey data and successfully archived numerous high-order feature interactions associated with infestation that would not have been discovered using traditional statistics. These feature interactions are also explored using network analysis. The spatial autocorrelation of the genetic data (SNPs of Triatoma dimidiata) was captured using geostatistics. Specifically, a modified semivariogram analysis was performed to characterize the SNP data and help elucidate the movement of the vector within two villages. For both villages, the SNP information showed strong spatial autocorrelation albeit with different geostatistical characteristics (sills, ranges, and nuggets). These metrics were leveraged to create risk maps that suggest the more forested village had a sylvatic source of infestation, while the other village had a domestic/peridomestic source. This initial exploration into using Big Data to analyze disease risk shows that novel and modified existing statistical tools can improve the assessment of risk on a fine-scale

ScholarWorks @ UVM

Recommended from our members

Stepwise Distributed Open Innovation Contests for Software Development: Acceleration of Genome-Wide Association Analysis

Author: Bharadwaj Ragu B.
Guinan Eva
Hill Andrew
Jelinsky Scott A.
Kilty Iain
Lakhani Karim
Loh Po-Ru
Pons Pascal
Shang Jingbo
Publication venue: 'Oxford University Press (OUP)'
Publication date: 24/07/2017
Field of study

Abstract Background: The association of differing genotypes with disease-related phenotypic traits offers great potential to both help identify new therapeutic targets and support stratification of patients who would gain the greatest benefit from specific drug classes. Development of low-cost genotyping and sequencing has made collecting large-scale genotyping data routine in population and therapeutic intervention studies. In addition, a range of new technologies is being used to capture numerous new and complex phenotypic descriptors. As a result, genotype and phenotype datasets have grown exponentially. Genome-wide association studies associate genotypes and phenotypes using methods such as logistic regression. As existing tools for association analysis limit the efficiency by which value can be extracted from increasing volumes of data, there is a pressing need for new software tools that can accelerate association analyses on large genotype-phenotype datasets. Results: Using open innovation (OI) and contest-based crowdsourcing, the logistic regression analysis in a leading, community-standard genetics software package (PLINK 1.07) was substantially accelerated. OI allowed us to do this in <6 months by providing rapid access to highly skilled programmers with specialized, difficult-to-find skill sets. Through a crowd-based contest a combination of computational, numeric, and algorithmic approaches was identified that accelerated the logistic regression in PLINK 1.07 by 18- to 45-fold. Combining contest-derived logistic regression code with coarse-grained parallelization, multithreading, and associated changes to data initialization code further developed through distributed innovation, we achieved an end-to-end speedup of 591-fold for a data set size of 6678 subjects by 645 863 variants, compared to PLINK 1.07's logistic regression. This represents a reduction in run time from 4.8 hours to 29 seconds. Accelerated logistic regression code developed in this project has been incorporated into the PLINK2 project. Conclusions: Using iterative competition-based OI, we have developed a new, faster implementation of logistic regression for genome-wide association studies analysis. We present lessons learned and recommendations on running a successful OI process for bioinformatics

Harvard University - DASH

Mitmekesiste bioloogiliste andmete ühendamine ja analüüs

Author: Sügis Elena
Publication venue
Publication date: 22/05/2019
Field of study

Väitekirja elektrooniline versioon ei sisalda publikatsiooneTänu tehnoloogiate arengule on bioloogiliste andmete maht viimastel aastatel mitmekordistunud. Need andmed katavad erinevaid bioloogia valdkondi. Piirdudes vaid ühe andmestikuga saab bioloogilisi protsesse või haigusi uurida vaid ühest aspektist korraga. Seetõttu on tekkinud üha suurem vajadus masinõppe meetodite järele, mis aitavad kombineerida eri valdkondade andmeid, et uurida bioloogilisi protsesse tervikuna. Lisaks on nõudlus usaldusväärsete haigusspetsiifiliste andmestike kogude järele, mis võimaldaks vastavaid analüüse efektiivsemalt läbi viia. Käesolev väitekiri kirjeldab, kuidas rakendada masinõppel põhinevaid integratsiooni meetodeid erinevate bioloogiliste küsimuste uurimiseks. Me näitame kuidas integreeritud andmetel põhinev analüüs võimaldab paremini aru saada bioloogilistes protsessidest kolmes valdkonnas: Alzheimeri tõbi, toksikoloogia ja immunoloogia. Alzheimeri tõbi on vanusega seotud neurodegeneratiivne haigus millel puudub efektiivne ravi. Väitekirjas näitame, kuidas integreerida erinevaid Alzheimeri tõve spetsiifilisi andmestikke, et moodustada heterogeenne graafil põhinev Alzheimeri spetsiifiline andmestik HENA. Seejärel demonstreerime süvaõppe meetodi, graafi konvolutsioonilise tehisnärvivõrgu, rakendamist HENA-le, et leida potentsiaalseid haigusega seotuid geene. Teiseks uurisime kroonilist immuunpõletikulist haigust psoriaasi. Selleks kombineerisime patsientide verest ja nahast pärinevad laboratoorsed mõõtmised kliinilise infoga ning integreerisime vastavad analüüside tulemused tuginedes valdkonnaspetsiifilistel teadmistel. Töö viimane osa keskendub toksilisuse testimise strateegiate edasiarendusele. Toksilisuse testimine on protsess, mille käigus hinnatakse, kas uuritavatel kemikaalidel esineb organismile kahjulikke toimeid. See on vajalik näiteks ravimite ohutuse hindamisel. Töös me tuvastasime sarnase toimemehhanismiga toksiliste ühendite rühmad. Lisaks arendasime klassifikatsiooni mudeli, mis võimaldab hinnata uute ühendite toksilisust.A fast advance in biotechnological innovation and decreasing production costs led to explosion of experimental data being produced in laboratories around the world. Individual experiments allow to understand biological processes, e.g. diseases, from different angles. However, in order to get a systematic view on disease it is necessary to combine these heterogeneous data. The large amounts of diverse data requires building machine learning models that can help, e.g. to identify which genes are related to disease. Additionally, there is a need to compose reliable integrated data sets that researchers could effectively work with. In this thesis we demonstrate how to combine and analyze different types of biological data in the example of three biological domains: Alzheimer’s disease, immunology, and toxicology. More specifically, we combine data sets related to Alzheimer’s disease into a novel heterogeneous network-based data set for Alzheimer’s disease (HENA). We then apply graph convolutional networks, state-of-the-art deep learning methods, to node classification task in HENA to find genes that are potentially associated with the disease. Combining patient’s data related to immune disease helps to uncover its pathological mechanisms and to find better treatments in the future. We analyse laboratory data from patients’ skin and blood samples by combining them with clinical information. Subsequently, we bring together the results of individual analyses using available domain knowledge to form a more systematic view on the disease pathogenesis. Toxicity testing is the process of defining harmful effects of the substances for the living organisms. One of its applications is safety assessment of drugs or other chemicals for a human organism. In this work we identify groups of toxicants that have similar mechanism of actions. Additionally, we develop a classification model that allows to assess toxic actions of unknown compounds.https://www.ester.ee/record=b523255

DSpace at Tartu University Library

The era of big data: Genome-scale modelling meets machine learning

Author: Antonakoudis A
Barbosa R
Kontoravdi K
Kotidis P
Publication venue: 'Elsevier BV'
Publication date: 08/10/2020
Field of study

With omics data being generated at an unprecedented rate, genome-scale modelling has become pivotal in its organisation and analysis. However, machine learning methods have been gaining ground in cases where knowledge is insufficient to represent the mechanisms underlying such data or as a means for data curation prior to attempting mechanistic modelling. We discuss the latest advances in genome-scale modelling and the development of optimisation algorithms for network and error reduction, intracellular constraining and applications to strain design. We further review applications of supervised and unsupervised machine learning methods to omics datasets from microbial and mammalian cell systems and present efforts to harness the potential of both modelling approaches through hybrid modelling

Spiral - Imperial College Digital Repository