143 research outputs found
Investigating the Effect of Emoji in Opinion Classification of Uzbek Movie Review Comments
Opinion mining on social media posts has become more and more popular. Users
often express their opinion on a topic not only with words but they also use
image symbols such as emoticons and emoji. In this paper, we investigate the
effect of emoji-based features in opinion classification of Uzbek texts, and
more specifically movie review comments from YouTube. Several classification
algorithms are tested, and feature ranking is performed to evaluate the
discriminative ability of the emoji-based features.Comment: 10 pages, 1 figure, 3 table
A general approach to simultaneous model fitting and variable elimination in response models for biological data with many more variables than observations
<p>Abstract</p> <p>Background</p> <p>With the advent of high throughput biotechnology data acquisition platforms such as micro arrays, SNP chips and mass spectrometers, data sets with many more variables than observations are now routinely being collected. Finding relationships between response variables of interest and variables in such data sets is an important problem akin to finding needles in a haystack. Whilst methods for a number of response types have been developed a general approach has been lacking.</p> <p>Results</p> <p>The major contribution of this paper is to present a unified methodology which allows many common (statistical) response models to be fitted to such data sets. The class of models includes virtually any model with a linear predictor in it, for example (but not limited to), multiclass logistic regression (classification), generalised linear models (regression) and survival models. A fast algorithm for finding sparse well fitting models is presented. The ideas are illustrated on real data sets with numbers of variables ranging from thousands to millions. R code implementing the ideas is available for download.</p> <p>Conclusion</p> <p>The method described in this paper enables existing work on response models when there are less variables than observations to be leveraged to the situation when there are many more variables than observations. It is a powerful approach to finding parsimonious models for such datasets. The method is capable of handling problems with millions of variables and a large variety of response types within the one framework. The method compares favourably to existing methods such as support vector machines and random forests, but has the advantage of not requiring separate variable selection steps. It is also works for data types which these methods were not designed to handle. The method usually produces very sparse models which make biological interpretation simpler and more focused.</p
Feature engineering and a proposed decision-support system for systematic reviewers of medical evidence
Objectives: Evidence-based medicine depends on the timely synthesis of research findings. An important source of synthesized evidence resides in systematic reviews. However, a bottleneck in review production involves dual screening of citations with titles and abstracts to find eligible studies. For this research, we tested the effect of various kinds of textual information (features) on performance of a machine learning classifier. Based on our findings, we propose an automated system to reduce screeing burden, as well as offer quality assurance. Methods: We built a database of citations from 5 systematic reviews that varied with respect to domain, topic, and sponsor. Consensus judgments regarding eligibility were inferred from published reports. We extracted 5 feature sets from citations: alphabetic, alphanumeric +, indexing, features mapped to concepts in systematic reviews, and topic models. To simulate a two-person team, we divided the data into random halves. We optimized the parameters of a Bayesian classifier, then trained and tested models on alternate data halves. Overall, we conducted 50 independent tests. Results: All tests of summary performance (mean F3) surpassed the corresponding baseline, P<0.0001. The ranks for mean F3, precision, and classification error were statistically different across feature sets averaged over reviews; P-values for Friedman's test were .045, .002, and .002, respectively. Differences in ranks for mean recall were not statistically significant. Alphanumeric+ features were associated with best performance; mean reduction in screening burden for this feature type ranged from 88% to 98% for the second pass through citations and from 38% to 48% overall. Conclusions: A computer-assisted, decision support system based on our methods could substantially reduce the burden of screening citations for systematic review teams and solo reviewers. Additionally, such a system could deliver quality assurance both by confirming concordant decisions and by naming studies associated with discordant decisions for further consideration. © 2014 Bekhuis et al
Support Vector Machine Implementations for Classification & Clustering
BACKGROUND: We describe Support Vector Machine (SVM) applications to classification and clustering of channel current data. SVMs are variational-calculus based methods that are constrained to have structural risk minimization (SRM), i.e., they provide noise tolerant solutions for pattern recognition. The SVM approach encapsulates a significant amount of model-fitting information in the choice of its kernel. In work thus far, novel, information-theoretic, kernels have been successfully employed for notably better performance over standard kernels. Currently there are two approaches for implementing multiclass SVMs. One is called external multi-class that arranges several binary classifiers as a decision tree such that they perform a single-class decision making function, with each leaf corresponding to a unique class. The second approach, namely internal-multiclass, involves solving a single optimization problem corresponding to the entire data set (with multiple hyperplanes). RESULTS: Each SVM approach encapsulates a significant amount of model-fitting information in its choice of kernel. In work thus far, novel, information-theoretic, kernels were successfully employed for notably better performance over standard kernels. Two SVM approaches to multiclass discrimination are described: (1) internal multiclass (with a single optimization), and (2) external multiclass (using an optimized decision tree). We describe benefits of the internal-SVM approach, along with further refinements to the internal-multiclass SVM algorithms that offer significant improvement in training time without sacrificing accuracy. In situations where the data isn't clearly separable, making for poor discrimination, signal clustering is used to provide robust and useful information – to this end, novel, SVM-based clustering methods are also described. As with the classification, there are Internal and External SVM Clustering algorithms, both of which are briefly described
Using Expression and Genotype to Predict Drug Response in Yeast
Personalized, or genomic, medicine entails tailoring pharmacological therapies according to individual genetic variation at genomic loci encoding proteins in drug-response pathways. It has been previously shown that steady-state mRNA expression can be used to predict the drug response (i.e., sensitivity or resistance) of non-genotyped mammalian cancer cell lines to chemotherapeutic agents. In a real-world setting, clinicians would have access to both steady-state expression levels of patient tissue(s) and a patient's genotypic profile, and yet the predictive power of transcripts versus markers is not well understood. We have previously shown that a collection of genotyped and expression-profiled yeast strains can provide a model for personalized medicine. Here we compare the predictive power of 6,229 steady-state mRNA transcript levels and 2,894 genotyped markers using a pattern recognition algorithm. We were able to predict with over 70% accuracy the drug sensitivity of 104 individual genotyped yeast strains derived from a cross between a laboratory strain and a wild isolate. We observe that, independently of drug mechanism of action, both transcripts and markers can accurately predict drug response. Marker-based prediction is usually more accurate than transcript-based prediction, likely reflecting the genetic determination of gene expression in this cross
Online Learning for 3D LiDAR-based Human Detection: Experimental Analysis of Point Cloud Clustering and Classification Methods
This paper presents a system for online learning of human classifiers by mobile service robots using 3D~LiDAR sensors, and its experimental evaluation in a large indoor public space. The learning framework requires a minimal set of labelled samples (e.g. one or several samples) to initialise a classifier. The classifier is then retrained iteratively during operation of the robot. New training samples are generated automatically using multi-target tracking and a pair of "experts" to estimate false negatives and false positives. Both classification and tracking utilise an efficient real-time clustering algorithm for segmentation of 3D point cloud data. We also introduce a new feature to improve human classification in sparse, long-range point clouds. We provide an extensive evaluation of our the framework using a 3D LiDAR dataset of people moving in a large indoor public space, which is made available to the research community. The experiments demonstrate the influence of the system components and improved classification of humans compared to the state-of-the-art
Combined artificial bee colony algorithm and machine learning techniques for prediction of online consumer repurchase intention
A novel paradigm in the service sector i.e. services through the web is a progressive mechanism for rendering offerings over diverse environments. Internet provides huge opportunities for companies to provide personalized online services to their customers. But prompt novel web services introduction may unfavorably affect the quality and user gratification. Subsequently, prediction of the consumer intention is of supreme importance in selecting the web services for an application. The aim of study is to predict online consumer repurchase intention and to achieve this objective a hybrid approach which a combination of machine learning techniques and Artificial Bee Colony (ABC) algorithm has been used. The study is divided into three phases. Initially, shopping mall and consumer characteristic’s for repurchase intention has been identified through extensive literature review. Secondly, ABC has been used to determine the feature selection of consumers’ characteristics and shopping malls’ attributes (with > 0.1 threshold value) for the prediction model. Finally, validation using K-fold cross has been employed to measure the best classification model robustness. The classification models viz., Decision Trees (C5.0), AdaBoost, Random Forest (RF), Support Vector Machine (SVM) and Neural Network (NN), are utilized for prediction of consumer purchase intention. Performance evaluation of identified models on training-testing partitions (70-30%) of the data set, shows that AdaBoost method outperforms other classification models with sensitivity and accuracy of 0.95 and 97.58% respectively, on testing data set. This study is a revolutionary attempt that considers both, shopping mall and consumer characteristics in examine the consumer purchase intention.N/
Classification and Analysis of Regulatory Pathways Using Graph Property, Biochemical and Physicochemical Property, and Functional Property
Given a regulatory pathway system consisting of a set of proteins, can we predict which pathway class it belongs to? Such a problem is closely related to the biological function of the pathway in cells and hence is quite fundamental and essential in systems biology and proteomics. This is also an extremely difficult and challenging problem due to its complexity. To address this problem, a novel approach was developed that can be used to predict query pathways among the following six functional categories: (i) “Metabolism”, (ii) “Genetic Information Processing”, (iii) “Environmental Information Processing”, (iv) “Cellular Processes”, (v) “Organismal Systems”, and (vi) “Human Diseases”. The prediction method was established trough the following procedures: (i) according to the general form of pseudo amino acid composition (PseAAC), each of the pathways concerned is formulated as a 5570-D (dimensional) vector; (ii) each of components in the 5570-D vector was derived by a series of feature extractions from the pathway system according to its graphic property, biochemical and physicochemical property, as well as functional property; (iii) the minimum redundancy maximum relevance (mRMR) method was adopted to operate the prediction. A cross-validation by the jackknife test on a benchmark dataset consisting of 146 regulatory pathways indicated that an overall success rate of 78.8% was achieved by our method in identifying query pathways among the above six classes, indicating the outcome is quite promising and encouraging. To the best of our knowledge, the current study represents the first effort in attempting to identity the type of a pathway system or its biological function. It is anticipated that our report may stimulate a series of follow-up investigations in this new and challenging area
- …