261 research outputs found
Simultaneous Matrix Diagonalization for Structural Brain Networks Classification
This paper considers the problem of brain disease classification based on
connectome data. A connectome is a network representation of a human brain. The
typical connectome classification problem is very challenging because of the
small sample size and high dimensionality of the data. We propose to use
simultaneous approximate diagonalization of adjacency matrices in order to
compute their eigenstructures in more stable way. The obtained approximate
eigenvalues are further used as features for classification. The proposed
approach is demonstrated to be efficient for detection of Alzheimer's disease,
outperforming simple baselines and competing with state-of-the-art approaches
to brain disease classification
Using Bayesian Networks and Machine Learning to Predict Computer Science Success
Bayesian Networks and Machine Learning techniques were
evaluated and compared for predicting academic performance of Computer
Science students at the University of Cape Town. Bayesian Networks
performed similarly to other classification models. The causal links AQ1
inherent in Bayesian Networks allow for understanding of the contributing
factors for academic success in this field. The most effective indicators
of success in first-year ‘core’ courses in Computer Science included the
student’s scores for Mathematics and Physics as well as their aptitude for
learning and their work ethos. It was found that unsuccessful students
could be identified with ≈91% accuracy. This could help to increase
throughput as well as student wellbeing at university
A Machine Learning Trainable Model to Assess the Accuracy of Probabilistic Record Linkage
Record linkage (RL) is the process of identifying and linking data that relates to the same physical entity across multiple heterogeneous data sources. Deterministic linkage methods rely on the presence of common uniquely identifying attributes across all sources while probabilistic approaches use non-unique attributes and calculates similarity indexes for pair wise comparisons. A key component of record linkage is accuracy assessment — the process of manually verifying and validating matched pairs to further refine linkage parameters and increase its overall effectiveness. This process however is time-consuming and impractical when applied to large administrative data sources where millions of records must be linked. Additionally, it is potentially biased as the gold standard used is often the reviewer’s intuition. In this paper, we present an approach for assessing and refining the accuracy of probabilistic linkage based on different supervised machine learning methods (decision trees, naïve Bayes, logistic regression, random forest, linear support vector machines and gradient boosted trees). We used data sets extracted from huge Brazilian socioeconomic and public health care data sources. These models were evaluated using receiver operating characteristic plots, sensitivity, specificity and positive predictive values collected from a 10-fold cross-validation method. Results show that logistic regression outperforms other classifiers and enables the creation of a generalized, very accurate model to validate linkage results
Investigating the Effect of Emoji in Opinion Classification of Uzbek Movie Review Comments
Opinion mining on social media posts has become more and more popular. Users
often express their opinion on a topic not only with words but they also use
image symbols such as emoticons and emoji. In this paper, we investigate the
effect of emoji-based features in opinion classification of Uzbek texts, and
more specifically movie review comments from YouTube. Several classification
algorithms are tested, and feature ranking is performed to evaluate the
discriminative ability of the emoji-based features.Comment: 10 pages, 1 figure, 3 table
A feature selection method for classification within functional genomics experiments based on the proportional overlapping score
Background: Microarray technology, as well as other functional genomics experiments, allow simultaneous measurements of thousands of genes within each sample. Both the prediction accuracy and interpretability of a classifier could be enhanced by performing the classification based only on selected discriminative genes. We propose a statistical method for selecting genes based on overlapping analysis of expression data across classes. This method results in a novel measure, called proportional overlapping score (POS), of a feature's relevance to a classification task.Results: We apply POS, along-with four widely used gene selection methods, to several benchmark gene expression datasets. The experimental results of classification error rates computed using the Random Forest, k Nearest Neighbor and Support Vector Machine classifiers show that POS achieves a better performance.Conclusions: A novel gene selection method, POS, is proposed. POS analyzes the expressions overlap across classes taking into account the proportions of overlapping samples. It robustly defines a mask for each gene that allows it to minimize the effect of expression outliers. The constructed masks along-with a novel gene score are exploited to produce the selected subset of genes
Predicting volume of distribution with decision tree-based regression methods using predicted tissue:plasma partition coefficients
Background: Volume of distribution is an important pharmacokinetic property that indicates the extent of a drug's distribution in the body tissues. This paper addresses the problem of how to estimate the apparent volume of distribution at steady state (Vss) of chemical compounds in the human body using decision tree-based regression methods from the area of data mining (or machine learning). Hence, the pros and cons of several different types of decision tree-based regression methods have been discussed. The regression methods predict Vss using, as predictive features, both the compounds' molecular descriptors and the compounds' tissue:plasma partition coefficients (Kt:p) - often used in physiologically-based pharmacokinetics. Therefore, this work has assessed whether the data mining-based prediction of Vss can be made more accurate by using as input not only the compounds' molecular descriptors but also (a subset of) their predicted Kt:p values. Results: Comparison of the models that used only molecular descriptors, in particular, the Bagging decision tree (mean fold error of 2.33), with those employing predicted Kt:p values in addition to the molecular descriptors, such as the Bagging decision tree using adipose Kt:p (mean fold error of 2.29), indicated that the use of predicted Kt:p values as descriptors may be beneficial for accurate prediction of Vss using decision trees if prior feature selection is applied. Conclusions: Decision tree based models presented in this work have an accuracy that is reasonable and similar to the accuracy of reported Vss inter-species extrapolations in the literature. The estimation of Vss for new compounds in drug discovery will benefit from methods that are able to integrate large and varied sources of data and flexible non-linear data mining methods such as decision trees, which can produce interpretable models. Figure not available: see fulltext. © 2015 Freitas et al.; licensee Springer
Human Communication Dynamics in Digital Footsteps: A Study of the Agreement between Self-Reported Ties and Email Networks
Digital communication data has created opportunities to advance the knowledge of human dynamics in many areas, including national security, behavioral health, and consumerism. While digital data uniquely captures the totality of a person's communication, past research consistently shows that a subset of contacts makes up a person's “social network” of unique resource providers. To address this gap, we analyzed the correspondence between self-reported social network data and email communication data with the objective of identifying the dynamics in e-communication that correlate with a person's perception of a significant network tie. First, we examined the predictive utility of three popular methods to derive social network data from email data based on volume and reciprocity of bilateral email exchanges. Second, we observed differences in the response dynamics along self-reported ties, allowing us to introduce and test a new method that incorporates time-resolved exchange data. Using a range of robustness checks for measurement and misreporting errors in self-report and email data, we find that the methods have similar predictive utility. Although e-communication has lowered communication costs with large numbers of persons, and potentially extended our number of, and reach to contacts, our case results suggest that underlying behavioral patterns indicative of friendship or professional contacts continue to operate in a classical fashion in email interactions
Pruning of Error Correcting Output Codes by optimization of accuracy–diversity trade off
Ensemble learning is a method of combining learners to obtain more reliable and accurate predictions in supervised and unsupervised learning. However, the ensemble sizes are sometimes unnecessarily large which leads to additional memory usage, computational overhead and decreased effectiveness. To overcome such side effects, pruning algorithms have been developed; since this is a combinatorial problem, finding the exact subset of ensembles is computationally infeasible. Different types of heuristic algorithms have developed to obtain an approximate solution but they lack a theoretical guarantee. Error Correcting Output Code (ECOC) is one of the well-known ensemble techniques for multiclass classification which combines the outputs of binary base learners to predict the classes for multiclass data. In this paper, we propose a novel approach for pruning the ECOC matrix by utilizing accuracy and diversity information simultaneously. All existing pruning methods need the size of the ensemble as a parameter, so the performance of the pruning methods depends on the size of the ensemble. Our unparametrized pruning method is novel as being independent of the size of ensemble. Experimental results show that our pruning method is mostly better than other existing approaches
Identification of early changes in specific symptoms that predict longer-term response to atypical antipsychotics in the treatment of patients with schizophrenia
<p>Abstract</p> <p>Background</p> <p>To identify a simple decision tree using early symptom change to predict response to atypical antipsychotic therapy in patients with (Diagnostic and Statistical Manual, Fourth Edition, Text Revised) chronic schizophrenia.</p> <p>Methods</p> <p>Data were pooled from moderately to severely ill patients (n = 1494) from 6 randomized, double-blind trials (N = 2543). Response was defined as a ≥30% reduction in Positive and Negative Syndrome Scale (PANSS) Total score by Week 8 of treatment. Analyzed predictors were change in individual PANSS items at Weeks 1 and 2. A decision tree was constructed using classification and regression tree (CART) analysis to identify predictors that most effectively differentiated responders from non-responders.</p> <p>Results</p> <p>A 2-branch, 6-item decision tree was created, producing 3 distinct groups. First branch criterion was a 2-point score decrease in at least 2 of 5 PANSS positive items (Week 2). Second branch criterion was a 2-point score decrease in the PANSS excitement item (Week 2). "Likely responders" met the first branch criteria; "likely non-responders" did not meet first or second criterion; "not predictable" patients did not meet the first but did meet the second criterion. Using this approach, response to treatment could be predicted in most patients (92%) with high positive predictive value (79%) and high negative predictive value (75%). Predictive findings were confirmed through analysis of data from 2 independent trials.</p> <p>Conclusions</p> <p>Using a data-driven approach, we identified decision rules using early change in the scores of selected PANSS items to accurately predict longer-term treatment response or non-response to atypical antipsychotic therapy. This could lead to development of a simple quantitative evaluation tool to help guide early treatment decisions.</p> <p>Trial Registration</p> <p>This is a retrospective, non-intervention study in which pooled results from 6 previously published reports were analyzed; thus, clinical trial registration is not required.</p
- …