973 research outputs found
Recommended from our members
How does predicate invention affect human comprehensibility?
During the 1980s Michie defined Machine Learning in terms of two orthogonal axes of performance: predictive accuracy and comprehensibility of generated hypotheses. Since predictive accuracy was readily measurable and comprehensibility not so, later definitions in the 1990s, such as that of Mitchell, tended to use a one-dimensional approach to Machine Learning based solely on predictive accuracy, ultimately favouring statistical over symbolic Machine Learning approaches. In this paper we provide a definition of comprehensibility of hypotheses which can be estimated using human participant trials. We present the results of experiments testing human comprehensibility of logic programs learned with and without predicate invention. Results indicate that comprehensibility is affected not only by the complexity of the presented program but also by the existence of anonymous predicate symbols
Deeply sequenced metagenome and metatranscriptome of a biogas-producing microbial community from an agricultural production-scale biogas plant
Bremges A, Maus I, Belmann P, et al. Deeply sequenced metagenome and metatranscriptome of a biogas-producing microbial community from an agricultural production-scale biogas plant. GigaScience. 2015;4(1): 33.Background
The production of biogas takes place under anaerobic conditions and involves microbial decomposition of organic matter. Most of the participating microbes are still unknown and non-cultivable. Accordingly, shotgun metagenome sequencing currently is the method of choice to obtain insights into community composition and the genetic repertoire.
Findings
Here, we report on the deeply sequenced metagenome and metatranscriptome of a complex biogas-producing microbial community from an agricultural production-scale biogas plant. We assembled the metagenome and, as an example application, show that we reconstructed most genes involved in the methane metabolism, a key pathway involving methanogenesis performed by methanogenic Archaea. This result indicates that there is sufficient sequencing coverage for most downstream analyses.
Conclusions
Sequenced at least one order of magnitude deeper than previous studies, our metagenome data will enable new insights into community composition and the genetic potential of important community members. Moreover, mapping of transcripts to reconstructed genome sequences will enable the identification of active metabolic pathways in target organisms
Using Bayesian Networks and Machine Learning to Predict Computer Science Success
Bayesian Networks and Machine Learning techniques were
evaluated and compared for predicting academic performance of Computer
Science students at the University of Cape Town. Bayesian Networks
performed similarly to other classification models. The causal links AQ1
inherent in Bayesian Networks allow for understanding of the contributing
factors for academic success in this field. The most effective indicators
of success in first-year ‘core’ courses in Computer Science included the
student’s scores for Mathematics and Physics as well as their aptitude for
learning and their work ethos. It was found that unsuccessful students
could be identified with ≈91% accuracy. This could help to increase
throughput as well as student wellbeing at university
Machine Learning Classification of Females Susceptibility to Visceral Fat Associated Diseases
The problem of classifying subjects into risk categories is a common challenge in medical research. Machine Learning (ML) methods are widely used in the areas of risk prediction and classification. The primary objective of these algorithms is to predict dichotomous responses (e.g. healthy/at risk) based on several features. Similarly to statistical inference models, also ML models are subject to the common problem of class imbalance. Therefore, they are affected by the majority class increasing the false-negative rate.
In this paper, we built and evaluated eighteen ML models classifying approximately 4300 female participants from the UK Biobank into three categorical risk statuses based on responses for the discretised visceral adipose tissue values from magnetic resonance imaging. We also examined the effect of sampling techniques on classification modelling when dealing with class imbalance.
Results showed that the use of sampling techniques had a significant impact. They not only drove an improvement in predicting patients risk status but also facilitated an increase in the information contained within each variable. Based on domain experts criteria, the three best models for classification were finally identified.
These encouraging results will guide further developments of classification models for predicting visceral adipose tissue without the need for a costly scan
Cancer cells exploit an orphan RNA to drive metastatic progression.
Here we performed a systematic search to identify breast-cancer-specific small noncoding RNAs, which we have collectively termed orphan noncoding RNAs (oncRNAs). We subsequently discovered that one of these oncRNAs, which originates from the 3' end of TERC, acts as a regulator of gene expression and is a robust promoter of breast cancer metastasis. This oncRNA, which we have named T3p, exerts its prometastatic effects by acting as an inhibitor of RISC complex activity and increasing the expression of the prometastatic genes NUPR1 and PANX2. Furthermore, we have shown that oncRNAs are present in cancer-cell-derived extracellular vesicles, raising the possibility that these circulating oncRNAs may also have a role in non-cell autonomous disease pathogenesis. Additionally, these circulating oncRNAs present a novel avenue for cancer fingerprinting using liquid biopsies
Recommended from our members
Ultra-Strong Machine Learning: comprehensibility of programs learned with ILP
During the 1980s Michie defined Machine Learning in terms of two orthogonal axes of performance: predictive accuracy and comprehensibility of generated hypotheses. Since predictive accuracy was readily measurable and comprehensibility not so, later definitions in the 1990s, such as Mitchell’s, tended to use a one-dimensional approach to Machine Learning based solely on predictive accuracy, ultimately favouring statistical over symbolic Machine Learning approaches. In this paper we provide a definition of comprehensibility of hypotheses which can be estimated using human participant trials. We present two sets of experiments testing human comprehensibility of logic programs. In the first experiment we test human comprehensibility with and without predicate invention. Results indicate comprehensibility is affected not only by the complexity of the presented program but also by the existence of anonymous predicate symbols. In the second experiment we directly test whether any state-of-the-art ILP systems are ultra-strong learners in Michie’s sense, and select the Metagol system for use in humans trials. Results show participants were not able to learn the relational concept on their own from a set of examples but they were able to apply the relational definition provided by the ILP system correctly. This implies the existence of a class of relational concepts which are hard to acquire for humans, though easy to understand given an abstract explanation. We believe improved understanding of this class could have potential relevance to contexts involving human learning, teaching and verbal interaction
Multiple Imputation Ensembles (MIE) for dealing with missing data
Missing data is a significant issue in many real-world datasets, yet there are no robust methods for dealing with it appropriately. In this paper, we propose a robust approach to dealing with missing data in classification problems: Multiple Imputation Ensembles (MIE). Our method integrates two approaches: multiple imputation and ensemble methods and compares two types of ensembles: bagging and stacking. We also propose a robust experimental set-up using 20 benchmark datasets from the UCI machine learning repository. For each dataset, we introduce increasing amounts of data Missing Completely at Random. Firstly, we use a number of single/multiple imputation methods to recover the missing values and then ensemble a number of different classifiers built on the imputed data. We assess the quality of the imputation by using dissimilarity measures. We also evaluate the MIE performance by comparing classification accuracy on the complete and imputed data. Furthermore, we use the accuracy of simple imputation as a benchmark for comparison. We find that our proposed approach combining multiple imputation with ensemble techniques outperform others, particularly as missing data increases
Predicting cell types and genetic variations contributing to disease by combining GWAS and epigenetic data
Genome-wide association studies (GWASs) identify single nucleotide polymorphisms (SNPs) that are enriched in individuals suffering from a given disease. Most disease-associated SNPs fall into non-coding regions, so that it is not straightforward to infer phenotype or function; moreover, many SNPs are in tight genetic linkage, so that a SNP identified as associated with a particular disease may not itself be causal, but rather signify the presence of a linked SNP that is functionally relevant to disease pathogenesis. Here, we present an analysis method that takes advantage of the recent rapid accumulation of epigenomics data to address these problems for some SNPs. Using asthma as a prototypic example; we show that non-coding disease-associated SNPs are enriched in genomic regions that function as regulators of transcription, such as enhancers and promoters. Identifying enhancers based on the presence of the histone modification marks such as H3K4me1 in different cell types, we show that the location of enhancers is highly cell-type specific. We use these findings to predict which SNPs are likely to be directly contributing to disease based on their presence in regulatory regions, and in which cell types their effect is expected to be detectable. Moreover, we can also predict which cell types contribute to a disease based on overlap of the disease-associated SNPs with the locations of enhancers present in a given cell type. Finally, we suggest that it will be possible to re-analyze GWAS studies with much higher power by limiting the SNPs considered to those in coding or regulatory regions of cell types relevant to a given disease
- …