11,961 research outputs found

    Complexities of convex combinations and bounding the generalization error in classification

    Full text link
    We introduce and study several measures of complexity of functions from the convex hull of a given base class. These complexity measures take into account the sparsity of the weights of a convex combination as well as certain clustering properties of the base functions involved in it. We prove new upper confidence bounds on the generalization error of ensemble (voting) classification algorithms that utilize the new complexity measures along with the empirical distributions of classification margins, providing a better explanation of generalization performance of large margin classification methods.Comment: Published at http://dx.doi.org/10.1214/009053605000000228 in the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    iDTI-ESBoost: Identification of Drug Target Interaction Using Evolutionary and Structural Features with Boosting

    Full text link
    Prediction of new drug-target interactions is extremely important as it can lead the researchers to find new uses for old drugs and to realize the therapeutic profiles or side effects thereof. However, experimental prediction of drug-target interactions is expensive and time-consuming. As a result, computational methods for prediction of new drug-target interactions have gained much interest in recent times. We present iDTI-ESBoost, a prediction model for identification of drug-target interactions using evolutionary and structural features. Our proposed method uses a novel balancing technique and a boosting technique for the binary classification problem of drug-target interaction. On four benchmark datasets taken from a gold standard data, iDTI-ESBoost outperforms the state-of-the-art methods in terms of area under Receiver operating characteristic (auROC) curve. iDTI-ESBoost also outperforms the latest and the best-performing method in the literature to-date in terms of area under precision recall (auPR) curve. This is significant as auPR curves are argued to be more appropriate as a metric for comparison for imbalanced datasets, like the one studied in this research. In the sequel, our experiments establish the effectiveness of the classifier, balancing methods and the novel features incorporated in iDTI-ESBoost. iDTI-ESBoost is a novel prediction method that has for the first time exploited the structural features along with the evolutionary features to predict drug-protein interactions. We believe the excellent performance of iDTI-ESBoost both in terms of auROC and auPR would motivate the researchers and practitioners to use it to predict drug-target interactions. To facilitate that, iDTI-ESBoost is readily available for use at: http://farshidrayhan.pythonanywhere.com/iDTI-ESBoost/Comment: pre-prin

    Evaluation of Three Vision Based Object Perception Methods for a Mobile Robot

    Full text link
    This paper addresses object perception applied to mobile robotics. Being able to perceive semantically meaningful objects in unstructured environments is a key capability in order to make robots suitable to perform high-level tasks in home environments. However, finding a solution for this task is daunting: it requires the ability to handle the variability in image formation in a moving camera with tight time constraints. The paper brings to attention some of the issues with applying three state of the art object recognition and detection methods in a mobile robotics scenario, and proposes methods to deal with windowing/segmentation. Thus, this work aims at evaluating the state-of-the-art in object perception in an attempt to develop a lightweight solution for mobile robotics use/research in typical indoor settings.Comment: 37 pages, 11 figure

    Really? Well. Apparently Bootstrapping Improves the Performance of Sarcasm and Nastiness Classifiers for Online Dialogue

    Full text link
    More and more of the information on the web is dialogic, from Facebook newsfeeds, to forum conversations, to comment threads on news articles. In contrast to traditional, monologic Natural Language Processing resources such as news, highly social dialogue is frequent in social media, making it a challenging context for NLP. This paper tests a bootstrapping method, originally proposed in a monologic domain, to train classifiers to identify two different types of subjective language in dialogue: sarcasm and nastiness. We explore two methods of developing linguistic indicators to be used in a first level classifier aimed at maximizing precision at the expense of recall. The best performing classifier for the first phase achieves 54% precision and 38% recall for sarcastic utterances. We then use general syntactic patterns from previous work to create more general sarcasm indicators, improving precision to 62% and recall to 52%. To further test the generality of the method, we then apply it to bootstrapping a classifier for nastiness dialogic acts. Our first phase, using crowdsourced nasty indicators, achieves 58% precision and 49% recall, which increases to 75% precision and 62% recall when we bootstrap over the first level with generalized syntactic patterns.Comment: Workshop on Language Analysis in Social Media (LASM 2013), at the North American Chapter of the Association for Computational Linguistics (NAACL 2013

    Empirical margin distributions and bounding the generalization error of combined classifiers

    Full text link
    We prove new probabilistic upper bounds on generalization error of complex classifiers that are combinations of simple classifiers. Such combinations could be implemented by neural networks or by voting methods of combining the classifiers, such as boosting and bagging. The bounds are in terms of the empirical distribution of the margin of the combined classifier. They are based on the methods of the theory of Gaussian and empirical processes (comparison inequalities, symmetrization method, concentration inequalities) and they improve previous results of Bartlett (1998) on bounding the generalization error of neural networks in terms of l_1-norms of the weights of neurons and of Schapire, Freund, Bartlett and Lee (1998) on bounding the generalization error of boosting. We also obtain rates of convergence in Levy distance of empirical margin distribution to the true margin distribution uniformly over the classes of classifiers and prove the optimality of these rates.Comment: 35 pages, 1 figur

    Using Multi-Label Classification for Improved Question Answering

    Full text link
    A plethora of diverse approaches for question answering over RDF data have been developed in recent years. While the accuracy of these systems has increased significantly over time, most systems still focus on particular types of questions or particular challenges in question answering. What is a curse for single systems is a blessing for the combination of these systems. We show in this paper how machine learning techniques can be applied to create a more accurate question answering metasystem by reusing existing systems. In particular, we develop a multi-label classification-based metasystem for question answering over 6 existing systems using an innovative set of 14 question features. The metasystem outperforms the best single system by 14% F-measure on the recent QALD-6 benchmark. Furthermore, we analyzed the influence and correlation of the underlying features on the metasystem quality.Comment: 15 pages, 4 Tables, 3 Figue

    Robust and Efficient Boosting Method using the Conditional Risk

    Full text link
    Well-known for its simplicity and effectiveness in classification, AdaBoost, however, suffers from overfitting when class-conditional distributions have significant overlap. Moreover, it is very sensitive to noise that appears in the labels. This article tackles the above limitations simultaneously via optimizing a modified loss function (i.e., the conditional risk). The proposed approach has the following two advantages. (1) It is able to directly take into account label uncertainty with an associated label confidence. (2) It introduces a "trustworthiness" measure on training samples via the Bayesian risk rule, and hence the resulting classifier tends to have finite sample performance that is superior to that of the original AdaBoost when there is a large overlap between class conditional distributions. Theoretical properties of the proposed method are investigated. Extensive experimental results using synthetic data and real-world data sets from UCI machine learning repository are provided. The empirical study shows the high competitiveness of the proposed method in predication accuracy and robustness when compared with the original AdaBoost and several existing robust AdaBoost algorithms.Comment: 14 Pages, 2 figures and 5 table

    An empirical evaluation of imbalanced data strategies from a practitioner's point of view

    Full text link
    This research tested the following well known strategies to deal with binary imbalanced data on 82 different real life data sets (sampled to imbalance rates of 5%, 3%, 1%, and 0.1%): class weight, SMOTE, Underbagging, and a baseline (just the base classifier). As base classifiers we used SVM with RBF kernel, random forests, and gradient boosting machines and we measured the quality of the resulting classifier using 6 different metrics (Area under the curve, Accuracy, F-measure, G-mean, Matthew's correlation coefficient and Balanced accuracy). The best strategy strongly depends on the metric used to measure the quality of the classifier. For AUC and accuracy class weight and the baseline perform better; for F-measure and MCC, SMOTE performs better; and for G-mean and balanced accuracy, underbagging

    MonoStream: A Minimal-Hardware High Accuracy Device-free WLAN Localization System

    Full text link
    Device-free (DF) localization is an emerging technology that allows the detection and tracking of entities that do not carry any devices nor participate actively in the localization process. Typically, DF systems require a large number of transmitters and receivers to achieve acceptable accuracy, which is not available in many scenarios such as homes and small businesses. In this paper, we introduce MonoStream as an accurate single-stream DF localization system that leverages the rich Channel State Information (CSI) as well as MIMO information from the physical layer to provide accurate DF localization with only one stream. To boost its accuracy and attain low computational requirements, MonoStream models the DF localization problem as an object recognition problem and uses a novel set of CSI-context features and techniques with proven accuracy and efficiency. Experimental evaluation in two typical testbeds, with a side-by-side comparison with the state-of-the-art, shows that MonoStream can achieve an accuracy of 0.95m with at least 26% enhancement in median distance error using a single stream only. This enhancement in accuracy comes with an efficient execution of less than 23ms per location update on a typical laptop. This highlights the potential of MonoStream usage for real-time DF tracking applications

    Detecting Table Region in PDF Documents Using Distant Supervision

    Full text link
    Superior to state-of-the-art approaches which compete in table recognition with 67 annotated government reports in PDF format released by {\it ICDAR 2013 Table Competition}, this paper contributes a novel paradigm leveraging large-scale unlabeled PDF documents to open-domain table detection. We integrate the paradigm into our latest developed system ({\it PdfExtra}) to detect the region of tables by means of 9,466 academic articles from the entire repository of {\it ACL Anthology}, where almost all papers are archived by PDF format without annotation for tables. The paradigm first designs heuristics to automatically construct weakly labeled data. It then feeds diverse evidences, such as layouts of documents and linguistic features, which are extracted by {\it Apache PDFBox} and processed by {\it Stanford NLP} toolkit, into different canonical classifiers. We finally use these classifiers, i.e. {\it Naive Bayes}, {\it Logistic Regression} and {\it Support Vector Machine}, to collaboratively vote on the region of tables. Experimental results show that {\it PdfExtra} achieves a great leap forward, compared with the state-of-the-art approach. Moreover, we discuss the factors of different features, learning models and even domains of documents that may impact the performance. Extensive evaluations demonstrate that our paradigm is compatible enough to leverage various features and learning models for open-domain table region detection within PDF files
    • …
    corecore