261 research outputs found

    Simultaneous Matrix Diagonalization for Structural Brain Networks Classification

    Full text link
    This paper considers the problem of brain disease classification based on connectome data. A connectome is a network representation of a human brain. The typical connectome classification problem is very challenging because of the small sample size and high dimensionality of the data. We propose to use simultaneous approximate diagonalization of adjacency matrices in order to compute their eigenstructures in more stable way. The obtained approximate eigenvalues are further used as features for classification. The proposed approach is demonstrated to be efficient for detection of Alzheimer's disease, outperforming simple baselines and competing with state-of-the-art approaches to brain disease classification

    Using Bayesian Networks and Machine Learning to Predict Computer Science Success

    Get PDF
    Bayesian Networks and Machine Learning techniques were evaluated and compared for predicting academic performance of Computer Science students at the University of Cape Town. Bayesian Networks performed similarly to other classification models. The causal links AQ1 inherent in Bayesian Networks allow for understanding of the contributing factors for academic success in this field. The most effective indicators of success in first-year ‘core’ courses in Computer Science included the student’s scores for Mathematics and Physics as well as their aptitude for learning and their work ethos. It was found that unsuccessful students could be identified with ≈91% accuracy. This could help to increase throughput as well as student wellbeing at university

    A Machine Learning Trainable Model to Assess the Accuracy of Probabilistic Record Linkage

    Get PDF
    Record linkage (RL) is the process of identifying and linking data that relates to the same physical entity across multiple heterogeneous data sources. Deterministic linkage methods rely on the presence of common uniquely identifying attributes across all sources while probabilistic approaches use non-unique attributes and calculates similarity indexes for pair wise comparisons. A key component of record linkage is accuracy assessment — the process of manually verifying and validating matched pairs to further refine linkage parameters and increase its overall effectiveness. This process however is time-consuming and impractical when applied to large administrative data sources where millions of records must be linked. Additionally, it is potentially biased as the gold standard used is often the reviewer’s intuition. In this paper, we present an approach for assessing and refining the accuracy of probabilistic linkage based on different supervised machine learning methods (decision trees, naïve Bayes, logistic regression, random forest, linear support vector machines and gradient boosted trees). We used data sets extracted from huge Brazilian socioeconomic and public health care data sources. These models were evaluated using receiver operating characteristic plots, sensitivity, specificity and positive predictive values collected from a 10-fold cross-validation method. Results show that logistic regression outperforms other classifiers and enables the creation of a generalized, very accurate model to validate linkage results

    Investigating the Effect of Emoji in Opinion Classification of Uzbek Movie Review Comments

    Full text link
    Opinion mining on social media posts has become more and more popular. Users often express their opinion on a topic not only with words but they also use image symbols such as emoticons and emoji. In this paper, we investigate the effect of emoji-based features in opinion classification of Uzbek texts, and more specifically movie review comments from YouTube. Several classification algorithms are tested, and feature ranking is performed to evaluate the discriminative ability of the emoji-based features.Comment: 10 pages, 1 figure, 3 table

    A feature selection method for classification within functional genomics experiments based on the proportional overlapping score

    Get PDF
    Background: Microarray technology, as well as other functional genomics experiments, allow simultaneous measurements of thousands of genes within each sample. Both the prediction accuracy and interpretability of a classifier could be enhanced by performing the classification based only on selected discriminative genes. We propose a statistical method for selecting genes based on overlapping analysis of expression data across classes. This method results in a novel measure, called proportional overlapping score (POS), of a feature's relevance to a classification task.Results: We apply POS, along-with four widely used gene selection methods, to several benchmark gene expression datasets. The experimental results of classification error rates computed using the Random Forest, k Nearest Neighbor and Support Vector Machine classifiers show that POS achieves a better performance.Conclusions: A novel gene selection method, POS, is proposed. POS analyzes the expressions overlap across classes taking into account the proportions of overlapping samples. It robustly defines a mask for each gene that allows it to minimize the effect of expression outliers. The constructed masks along-with a novel gene score are exploited to produce the selected subset of genes

    Predicting volume of distribution with decision tree-based regression methods using predicted tissue:plasma partition coefficients

    Get PDF
    Background: Volume of distribution is an important pharmacokinetic property that indicates the extent of a drug's distribution in the body tissues. This paper addresses the problem of how to estimate the apparent volume of distribution at steady state (Vss) of chemical compounds in the human body using decision tree-based regression methods from the area of data mining (or machine learning). Hence, the pros and cons of several different types of decision tree-based regression methods have been discussed. The regression methods predict Vss using, as predictive features, both the compounds' molecular descriptors and the compounds' tissue:plasma partition coefficients (Kt:p) - often used in physiologically-based pharmacokinetics. Therefore, this work has assessed whether the data mining-based prediction of Vss can be made more accurate by using as input not only the compounds' molecular descriptors but also (a subset of) their predicted Kt:p values. Results: Comparison of the models that used only molecular descriptors, in particular, the Bagging decision tree (mean fold error of 2.33), with those employing predicted Kt:p values in addition to the molecular descriptors, such as the Bagging decision tree using adipose Kt:p (mean fold error of 2.29), indicated that the use of predicted Kt:p values as descriptors may be beneficial for accurate prediction of Vss using decision trees if prior feature selection is applied. Conclusions: Decision tree based models presented in this work have an accuracy that is reasonable and similar to the accuracy of reported Vss inter-species extrapolations in the literature. The estimation of Vss for new compounds in drug discovery will benefit from methods that are able to integrate large and varied sources of data and flexible non-linear data mining methods such as decision trees, which can produce interpretable models. Figure not available: see fulltext. © 2015 Freitas et al.; licensee Springer

    Human Communication Dynamics in Digital Footsteps: A Study of the Agreement between Self-Reported Ties and Email Networks

    Get PDF
    Digital communication data has created opportunities to advance the knowledge of human dynamics in many areas, including national security, behavioral health, and consumerism. While digital data uniquely captures the totality of a person's communication, past research consistently shows that a subset of contacts makes up a person's “social network” of unique resource providers. To address this gap, we analyzed the correspondence between self-reported social network data and email communication data with the objective of identifying the dynamics in e-communication that correlate with a person's perception of a significant network tie. First, we examined the predictive utility of three popular methods to derive social network data from email data based on volume and reciprocity of bilateral email exchanges. Second, we observed differences in the response dynamics along self-reported ties, allowing us to introduce and test a new method that incorporates time-resolved exchange data. Using a range of robustness checks for measurement and misreporting errors in self-report and email data, we find that the methods have similar predictive utility. Although e-communication has lowered communication costs with large numbers of persons, and potentially extended our number of, and reach to contacts, our case results suggest that underlying behavioral patterns indicative of friendship or professional contacts continue to operate in a classical fashion in email interactions

    Identification of early changes in specific symptoms that predict longer-term response to atypical antipsychotics in the treatment of patients with schizophrenia

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>To identify a simple decision tree using early symptom change to predict response to atypical antipsychotic therapy in patients with (Diagnostic and Statistical Manual, Fourth Edition, Text Revised) chronic schizophrenia.</p> <p>Methods</p> <p>Data were pooled from moderately to severely ill patients (n = 1494) from 6 randomized, double-blind trials (N = 2543). Response was defined as a ≥30% reduction in Positive and Negative Syndrome Scale (PANSS) Total score by Week 8 of treatment. Analyzed predictors were change in individual PANSS items at Weeks 1 and 2. A decision tree was constructed using classification and regression tree (CART) analysis to identify predictors that most effectively differentiated responders from non-responders.</p> <p>Results</p> <p>A 2-branch, 6-item decision tree was created, producing 3 distinct groups. First branch criterion was a 2-point score decrease in at least 2 of 5 PANSS positive items (Week 2). Second branch criterion was a 2-point score decrease in the PANSS excitement item (Week 2). "Likely responders" met the first branch criteria; "likely non-responders" did not meet first or second criterion; "not predictable" patients did not meet the first but did meet the second criterion. Using this approach, response to treatment could be predicted in most patients (92%) with high positive predictive value (79%) and high negative predictive value (75%). Predictive findings were confirmed through analysis of data from 2 independent trials.</p> <p>Conclusions</p> <p>Using a data-driven approach, we identified decision rules using early change in the scores of selected PANSS items to accurately predict longer-term treatment response or non-response to atypical antipsychotic therapy. This could lead to development of a simple quantitative evaluation tool to help guide early treatment decisions.</p> <p>Trial Registration</p> <p>This is a retrospective, non-intervention study in which pooled results from 6 previously published reports were analyzed; thus, clinical trial registration is not required.</p