Search CORE

1,907 research outputs found

Interpretable machine learning for genomics

Author: Watson DS
Publication venue
Publication date: 20/10/2021
Field of study

High-throughput technologies such as next-generation sequencing allow biologists to observe cell function with unprecedented resolution, but the resulting datasets are too large and complicated for humans to understand without the aid of advanced statistical methods. Machine learning (ML) algorithms, which are designed to automatically find patterns in data, are well suited to this task. Yet these models are often so complex as to be opaque, leaving researchers with few clues about underlying mechanisms. Interpretable machine learning (iML) is a burgeoning subdiscipline of computational statistics devoted to making the predictions of ML models more intelligible to end users. This article is a gentle and critical introduction to iML, with an emphasis on genomic applications. I define relevant concepts, motivate leading methodologies, and provide a simple typology of existing approaches. I survey recent examples of iML in genomics, demonstrating how such techniques are increasingly integrated into research workflows. I argue that iML solutions are required to realize the promise of precision medicine. However, several open challenges remain. I examine the limitations of current state-of-the-art tools and propose a number of directions for future research. While the horizon for iML in genomics is wide and bright, continued progress requires close collaboration across disciplines

UCL Discovery

Protein Networks as Logic Functions in Development and Cancer

Author: A Bureau
A Chariot
A Ransick
AS Sultan
B Alberts
BJ Frey
BME Moret
C Kingsford
C Lefebvre
CL Smith
CW Roberts
D Hanahan
D Lang
D Opitz
DR Rhodes
E Lee
E Segal
E Segal
EH Davidson
EH Davidson
EV Prochownik
F Rapaport
FJ Muller
H Aizawa
HJ Cordell
HS Phillips
HY Chuang
I Ulitsky
I Ulitsky
I Ulitsky
IA Stasinopoulos
IW Taylor
J Lessard
JA Blake
Janusz Dutkowski
KG Becker
L Bei
L Breiman
L Breiman
L Ein-Dor
L Ein-Dor
L Ho
L Ho
L Meng
LH Hartwell
LM Bundy
M Ashburner
M Dramiński
M Kang
M Wozniak
ME Higgins
MJ van de Vijver
MQ Hassan
MS Carro
MS Cline
R Ren
RK Nibbe
Russ B. Altman
S Efroni
SA Chowdhury
SC Materna
SH Li
T Hwang
T Ideker
T Ideker
T Ravasi
TM Williams
Trey Ideker
W Huang da
X Yang
Y Kwon
Y Ono
Y Wang
Publication venue: Public Library of Science
Publication date: 01/09/2011
Field of study

Many biological and clinical outcomes are based not on single proteins, but on modules of proteins embedded in protein networks. A fundamental question is how the proteins within each module contribute to the overall module activity. Here, we study the modules underlying three representative biological programs related to tissue development, breast cancer metastasis, or progression of brain cancer, respectively. For each case we apply a new method, called Network-Guided Forests, to identify predictive modules together with logic functions which tie the activity of each module to the activity of its component genes. The resulting modules implement a diverse repertoire of decision logic which cannot be captured using the simple approximations suggested in previous work such as gene summation or subtraction. We show that in cancer, certain combinations of oncogenes and tumor suppressors exert competing forces on the system, suggesting that medical genetics should move beyond cataloguing individual cancer genes to cataloguing their combinatorial logic

Crossref

Directory of Open Access Journals

PubMed Central

Predicting gene expression in the human malaria parasite Plasmodium falciparum using histone modification, nucleosome positioning, and 3D localization features.

Author: Cook Kate
Le Roch Karine G
Lu Yang Y
Noble William Stafford
Read David F
Publication venue: eScholarship, University of California
Publication date: 01/09/2019
Field of study

Empirical evidence suggests that the malaria parasite Plasmodium falciparum employs a broad range of mechanisms to regulate gene transcription throughout the organism's complex life cycle. To better understand this regulatory machinery, we assembled a rich collection of genomic and epigenomic data sets, including information about transcription factor (TF) binding motifs, patterns of covalent histone modifications, nucleosome occupancy, GC content, and global 3D genome architecture. We used these data to train machine learning models to discriminate between high-expression and low-expression genes, focusing on three distinct stages of the red blood cell phase of the Plasmodium life cycle. Our results highlight the importance of histone modifications and 3D chromatin architecture in Plasmodium transcriptional regulation and suggest that AP2 transcription factors may play a limited regulatory role, perhaps operating in conjunction with epigenetic factors

Directory of Open Access Journals

eScholarship - University of California

Automated Feature Engineering for Deep Neural Networks with Genetic Programming

Author: Heaton Jeff T.
Publication venue: NSUWorks
Publication date: 01/01/2017
Field of study

Feature engineering is a process that augments the feature vector of a machine learning model with calculated values that are designed to enhance the accuracy of a model’s predictions. Research has shown that the accuracy of models such as deep neural networks, support vector machines, and tree/forest-based algorithms sometimes benefit from feature engineering. Expressions that combine one or more of the original features usually create these engineered features. The choice of the exact structure of an engineered feature is dependent on the type of machine learning model in use. Previous research demonstrated that various model families benefit from different types of engineered feature. Random forests, gradient-boosting machines, or other tree-based models might not see the same accuracy gain that an engineered feature allowed neural networks, generalized linear models, or other dot-product based models to achieve on the same data set. This dissertation presents a genetic programming-based algorithm that automatically engineers features that increase the accuracy of deep neural networks for some data sets. For a genetic programming algorithm to be effective, it must prioritize the search space and efficiently evaluate what it finds. This dissertation algorithm faced a potential search space composed of all possible mathematical combinations of the original feature vector. Five experiments were designed to guide the search process to efficiently evolve good engineered features. The result of this dissertation is an automated feature engineering (AFE) algorithm that is computationally efficient, even though a neural network is used to evaluate each candidate feature. This approach gave the algorithm a greater opportunity to specifically target deep neural networks in its search for engineered features that improve accuracy. Finally, a sixth experiment empirically demonstrated the degree to which this algorithm improved the accuracy of neural networks on data sets augmented by the algorithm’s engineered features

NSU Works

An Interpretable Stroke Prediction Model using Rules and Bayesian Analysis

Author: Letham Benjamin
Madigan David
McCormick Tyler H.
Rudin Cynthia
Publication venue
Publication date: 10/10/2013
Field of study

We aim to produce predictive models that are not only accurate, but are also interpretable to human experts. Our models are decision lists, which consist of a series of if...then... statements (for example, if high blood pressure, then stroke) that discretize a high-dimensional, multivariate feature space into a series of simple, readily inter- pretable decision statements. We introduce a generative model called the Bayesian List Machine which yields a posterior distribution over possible decision lists. It employs a novel prior structure to encourage sparsity. Our experiments show that the Bayesian List Machine has predictive accuracy on par with the current top algorithms for prediction in machine learning. Our method is motivated by recent developments in personalized medicine, and can be used to produce highly accurate and interpretable medical scoring systems. We demonstrate this by producing an alternative to the CHADS2 score, actively used in clinical practice for estimating the risk of stroke in patients that have atrial brillation. Our model is as interpretable as CHADS2, but more accurate

CiteSeerX

DSpace@MIT

Machine learning for network based intrusion detection: an investigation into discrepancies in findings with the KDD cup '99 data set and multi-objective evolution of neural network classifier ensembles from imbalanced data.

Author: Engen Vegard
Publication venue
Publication date
Field of study

For the last decade it has become commonplace to evaluate machine learning techniques for network based intrusion detection on the KDD Cup '99 data set. This data set has served well to demonstrate that machine learning can be useful in intrusion detection. However, it has undergone some criticism in the literature, and it is out of date. Therefore, some researchers question the validity of the findings reported based on this data set. Furthermore, as identified in this thesis, there are also discrepancies in the findings reported in the literature. In some cases the results are contradictory. Consequently, it is difficult to analyse the current body of research to determine the value in the findings. This thesis reports on an empirical investigation to determine the underlying causes of the discrepancies. Several methodological factors, such as choice of data subset, validation method and data preprocessing, are identified and are found to affect the results significantly. These findings have also enabled a better interpretation of the current body of research. Furthermore, the criticisms in the literature are addressed and future use of the data set is discussed, which is important since researchers continue to use it due to a lack of better publicly available alternatives. Due to the nature of the intrusion detection domain, there is an extreme imbalance among the classes in the KDD Cup '99 data set, which poses a significant challenge to machine learning. In other domains, researchers have demonstrated that well known techniques such as Artificial Neural Networks (ANNs) and Decision Trees (DTs) often fail to learn the minor class(es) due to class imbalance. However, this has not been recognized as an issue in intrusion detection previously. This thesis reports on an empirical investigation that demonstrates that it is the class imbalance that causes the poor detection of some classes of intrusion reported in the literature. An alternative approach to training ANNs is proposed in this thesis, using Genetic Algorithms (GAs) to evolve the weights of the ANNs, referred to as an Evolutionary Neural Network (ENN). When employing evaluation functions that calculate the fitness proportionally to the instances of each class, thereby avoiding a bias towards the major class(es) in the data set, significantly improved true positive rates are obtained whilst maintaining a low false positive rate. These findings demonstrate that the issues of learning from imbalanced data are not due to limitations of the ANNs; rather the training algorithm. Moreover, the ENN is capable of detecting a class of intrusion that has been reported in the literature to be undetectable by ANNs. One limitation of the ENN is a lack of control of the classification trade-off the ANNs obtain. This is identified as a general issue with current approaches to creating classifiers. Striving to create a single best classifier that obtains the highest accuracy may give an unfruitful classification trade-off, which is demonstrated clearly in this thesis. Therefore, an extension of the ENN is proposed, using a Multi-Objective GA (MOGA), which treats the classification rate on each class as a separate objective. This approach produces a Pareto front of non-dominated solutions that exhibit different classification trade-offs, from which the user can select one with the desired properties. The multi-objective approach is also utilised to evolve classifier ensembles, which yields an improved Pareto front of solutions. Furthermore, the selection of classifier members for the ensembles is investigated, demonstrating how this affects the performance of the resultant ensembles. This is a key to explaining why some classifier combinations fail to give fruitful solutions

Bournemouth University Research Online

Assessing atypical brain functional connectivity development : an approach based on generative adversarial networks

Author: Biazoli Junior Claudinei Eduardo
Gadelha Ary
Mendes Sérgio Leonardo
Miguel Eurípedes Constantino
Rohde Luis Augusto Paim
Salum Junior Giovanni Abrahão
Santos Pedro Machado Nery dos
Sato João Ricardo
Publication venue
Publication date: 01/01/2023
Field of study

Generative Adversarial Networks (GANs) are promising analytical tools in machine learning applications. Characterizing atypical neurodevelopmental processes might be useful in establishing diagnostic and prognostic biomarkers of psychiatric disorders. In this article, we investigate the potential of GANs models combined with functional connectivity (FC) measures to build a predictive neurotypicality score 3-years after scanning. We used a ROI-to-ROI analysis of resting-state functional magnetic resonance imaging (fMRI) data from a community-based cohort of children and adolescents (377 neurotypical and 126 atypical participants). Models were trained on data from neurotypical participants, capturing their sample variability of FC. The discriminator subnetwork of each GAN model discriminated between the learned neurotypical functional connectivity pattern and atypical or unrelated patterns. Discriminator models were combined in ensembles, improving discrimination performance. Explanations for the model’s predictions are provided using the LIME (Local Interpretable Model-Agnostic) algorithm and local hubs are identified in light of these explanations. Our findings suggest this approach is a promising strategy to build potential biomarkers based on functional connectivity

Lume 5.8

PubMed Central

Contributions on distance-based algorithms, multi-classifier construction and pairwise classification

Author: Mendialdua Beitia Iñigo
Publication venue
Publication date: 01/01/2015
Field of study

179 p.Aurkezten den ikerketa lan honetan saikapen atazak landu dira, non helburua,sailkapen gainbegiratuaren artearen-egoera aberastea izan den. Sailkapengainbegiratuaren zenbait estrategi analizatu dira, beraien ezaugarri etaahuleziak aztertuz. Beraz, ezaugarri positiboak mantenduz, ahuleziak hobetzekosaiakera egin da. Hau burutu ahal izateko, sailkapen gainbegiratuarenzenbait estrategi konbinatzeaz gain, zenbait bilaketa heuristiko ere erabili dira.Sailkapen gainbegiratuko 3 ikerketa lerro desberdinetan burutu dira ekarpenak.Aurkezten diren lehenengo proposamenak, K-NN algoritmoan zentratzendira, honen zenbait bertsio aurkezten direlarik. Ondoren sailkatzaileen konbinaketarekinerlazionatutako beste lan bat aurkezten da. Eta azkenik, binakakosailkapenaren zenbait estrategi berritzaile proposatzen dira. Ekarpenhauek aldizkari edo konferentzi internazionaletan publikatuak edo bidaliakizan dira.Buruturiko experimentuetan, proposatutako algoritmoak artearen-estatuanaurkituriko zenbait algoritmorekin konparatu dira, emaitza interesgarriak lortuaz.Honetaz gain, emaitza hauetatik ondorio esanguratsuak eskuratzeko asmoz,test estatistikoen erabilera ere burutu da

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Archivo Digital para la Docencia y la Investigación