39,138 research outputs found
Recommended from our members
VarSight: prioritizing clinically reported variants with binary classification algorithms.
BackgroundWhen applying genomic medicine to a rare disease patient, the primary goal is to identify one or more genomic variants that may explain the patient's phenotypes. Typically, this is done through annotation, filtering, and then prioritization of variants for manual curation. However, prioritization of variants in rare disease patients remains a challenging task due to the high degree of variability in phenotype presentation and molecular source of disease. Thus, methods that can identify and/or prioritize variants to be clinically reported in the presence of such variability are of critical importance.MethodsWe tested the application of classification algorithms that ingest variant annotations along with phenotype information for predicting whether a variant will ultimately be clinically reported and returned to a patient. To test the classifiers, we performed a retrospective study on variants that were clinically reported to 237 patients in the Undiagnosed Diseases Network.ResultsWe treated the classifiers as variant prioritization systems and compared them to four variant prioritization algorithms and two single-measure controls. We showed that the trained classifiers outperformed all other tested methods with the best classifiers ranking 72% of all reported variants and 94% of reported pathogenic variants in the top 20.ConclusionsWe demonstrated how freely available binary classification algorithms can be used to prioritize variants even in the presence of real-world variability. Furthermore, these classifiers outperformed all other tested methods, suggesting that they may be well suited for working with real rare disease patient datasets
PASS: a simple classifier system for data analysis
Let x be a vector of predictors and y a scalar response associated with it. Consider the regression problem of inferring the relantionship between predictors and response on the basis of a sample of observed pairs (x,y). This is a familiar problem for which a variety of methods are available. This paper describes a new method based on the classifier system approach to problem solving. Classifier systems provide a rich framework for learning and induction, and they have been suc:cessfully applied in the artificial intelligence literature for some time. The present method emiches the simplest classifier system architecture with some new heuristic and explores its potential in a purely inferential context. A prototype called PASS (Predictive Adaptative Sequential System) has been built to test these ideas empirically. Preliminary Monte Carlo experiments indicate that PASS is able to discover the structure imposed on the data in a wide array of cases
Presymptomatic risk assessment for chronic non-communicable diseases
The prevalence of common chronic non-communicable diseases (CNCDs) far
overshadows the prevalence of both monogenic and infectious diseases combined.
All CNCDs, also called complex genetic diseases, have a heritable genetic
component that can be used for pre-symptomatic risk assessment. Common single
nucleotide polymorphisms (SNPs) that tag risk haplotypes across the genome
currently account for a non-trivial portion of the germ-line genetic risk and
we will likely continue to identify the remaining missing heritability in the
form of rare variants, copy number variants and epigenetic modifications. Here,
we describe a novel measure for calculating the lifetime risk of a disease,
called the genetic composite index (GCI), and demonstrate its predictive value
as a clinical classifier. The GCI only considers summary statistics of the
effects of genetic variation and hence does not require the results of
large-scale studies simultaneously assessing multiple risk factors. Combining
GCI scores with environmental risk information provides an additional tool for
clinical decision-making. The GCI can be populated with heritable risk
information of any type, and thus represents a framework for CNCD
pre-symptomatic risk assessment that can be populated as additional risk
information is identified through next-generation technologies.Comment: Plos ONE paper. Previous version was withdrawn to be updated by the
journal's pdf versio
Hyperspectral classification of Cyperus esculentus clones and morphologically similar weeds
Cyperus esculentus (yellow nutsedge) is one of the world's worst weeds as it can cause great damage to crops and crop production. To eradicate C. esculentus, early detection is key-a challenging task as it is often confused with other Cyperaceae and displays wide genetic variability. In this study, the objective was to classify C. esculentus clones and morphologically similar weeds. Hyperspectral reflectance between 500 and 800 nm was tested as a measure to discriminate between (I) C. esculentus and morphologically similar Cyperaceae weeds, and between (II) different clonal populations of C. esculentus using three classification models: random forest (RF), regularized logistic regression (RLR) and partial least squares-discriminant analysis (PLS-DA). RLR performed better than RF and PLS-DA, and was able to adequately classify the samples. The possibility of creating an affordable multispectral sensing tool, for precise in-field recognition of C. esculentus plants based on fewer spectral bands, was tested. Results of this study were compared against simulated results from a commercially available multispectral camera with four spectral bands. The model created with customized bands performed almost equally well as the original PLS-DA or RLR model, and much better than the model describing multispectral image data from a commercially available camera. These results open up the opportunity to develop a dedicated robust tool for C. esculentus recognition based on four spectral bands and an appropriate classification model
Automating biomedical data science through tree-based pipeline optimization
Over the past decade, data science and machine learning has grown from a
mysterious art form to a staple tool across a variety of fields in academia,
business, and government. In this paper, we introduce the concept of tree-based
pipeline optimization for automating one of the most tedious parts of machine
learning---pipeline design. We implement a Tree-based Pipeline Optimization
Tool (TPOT) and demonstrate its effectiveness on a series of simulated and
real-world genetic data sets. In particular, we show that TPOT can build
machine learning pipelines that achieve competitive classification accuracy and
discover novel pipeline operators---such as synthetic feature
constructors---that significantly improve classification accuracy on these data
sets. We also highlight the current challenges to pipeline optimization, such
as the tendency to produce pipelines that overfit the data, and suggest future
research paths to overcome these challenges. As such, this work represents an
early step toward fully automating machine learning pipeline design.Comment: 16 pages, 5 figures, to appear in EvoBIO 2016 proceeding
Inference in classifier systems
Classifier systems (Css) provide a rich framework for learning and induction, and they have beenı successfully applied in the artificial intelligence literature for some time. In this paper, both theı architecture and the inferential mechanisms in general CSs are reviewed, and a number of limitations and extensions of the basic approach are summarized. A system based on the CS approach that is capable of quantitative data analysis is outlined and some of its peculiarities discussed
Genetic Classification of Populations using Supervised Learning
There are many instances in genetics in which we wish to determine whether
two candidate populations are distinguishable on the basis of their genetic
structure. Examples include populations which are geographically separated,
case--control studies and quality control (when participants in a study have
been genotyped at different laboratories). This latter application is of
particular importance in the era of large scale genome wide association
studies, when collections of individuals genotyped at different locations are
being merged to provide increased power. The traditional method for detecting
structure within a population is some form of exploratory technique such as
principal components analysis. Such methods, which do not utilise our prior
knowledge of the membership of the candidate populations. are termed
\emph{unsupervised}. Supervised methods, on the other hand are able to utilise
this prior knowledge when it is available.
In this paper we demonstrate that in such cases modern supervised approaches
are a more appropriate tool for detecting genetic differences between
populations. We apply two such methods, (neural networks and support vector
machines) to the classification of three populations (two from Scotland and one
from Bulgaria). The sensitivity exhibited by both these methods is considerably
higher than that attained by principal components analysis and in fact
comfortably exceeds a recently conjectured theoretical limit on the sensitivity
of unsupervised methods. In particular, our methods can distinguish between the
two Scottish populations, where principal components analysis cannot. We
suggest, on the basis of our results that a supervised learning approach should
be the method of choice when classifying individuals into pre-defined
populations, particularly in quality control for large scale genome wide
association studies.Comment: Accepted PLOS On
Classification systems offer a microcosm of issues in conceptual processing: A commentary on Kemmerer (2016)
This is a commentary on Kemmerer (2016), Categories of Object Concepts Across Languages and Brains: The Relevance of Nominal Classification Systems to Cognitive Neuroscience, DOI: 10.1080/23273798.2016.1198819
- …