11,961 research outputs found
Complexities of convex combinations and bounding the generalization error in classification
We introduce and study several measures of complexity of functions from the
convex hull of a given base class. These complexity measures take into account
the sparsity of the weights of a convex combination as well as certain
clustering properties of the base functions involved in it. We prove new upper
confidence bounds on the generalization error of ensemble (voting)
classification algorithms that utilize the new complexity measures along with
the empirical distributions of classification margins, providing a better
explanation of generalization performance of large margin classification
methods.Comment: Published at http://dx.doi.org/10.1214/009053605000000228 in the
Annals of Statistics (http://www.imstat.org/aos/) by the Institute of
Mathematical Statistics (http://www.imstat.org
iDTI-ESBoost: Identification of Drug Target Interaction Using Evolutionary and Structural Features with Boosting
Prediction of new drug-target interactions is extremely important as it can
lead the researchers to find new uses for old drugs and to realize the
therapeutic profiles or side effects thereof. However, experimental prediction
of drug-target interactions is expensive and time-consuming. As a result,
computational methods for prediction of new drug-target interactions have
gained much interest in recent times. We present iDTI-ESBoost, a prediction
model for identification of drug-target interactions using evolutionary and
structural features. Our proposed method uses a novel balancing technique and a
boosting technique for the binary classification problem of drug-target
interaction. On four benchmark datasets taken from a gold standard data,
iDTI-ESBoost outperforms the state-of-the-art methods in terms of area under
Receiver operating characteristic (auROC) curve. iDTI-ESBoost also outperforms
the latest and the best-performing method in the literature to-date in terms of
area under precision recall (auPR) curve. This is significant as auPR curves
are argued to be more appropriate as a metric for comparison for imbalanced
datasets, like the one studied in this research. In the sequel, our experiments
establish the effectiveness of the classifier, balancing methods and the novel
features incorporated in iDTI-ESBoost. iDTI-ESBoost is a novel prediction
method that has for the first time exploited the structural features along with
the evolutionary features to predict drug-protein interactions. We believe the
excellent performance of iDTI-ESBoost both in terms of auROC and auPR would
motivate the researchers and practitioners to use it to predict drug-target
interactions. To facilitate that, iDTI-ESBoost is readily available for use at:
http://farshidrayhan.pythonanywhere.com/iDTI-ESBoost/Comment: pre-prin
Evaluation of Three Vision Based Object Perception Methods for a Mobile Robot
This paper addresses object perception applied to mobile robotics. Being able
to perceive semantically meaningful objects in unstructured environments is a
key capability in order to make robots suitable to perform high-level tasks in
home environments. However, finding a solution for this task is daunting: it
requires the ability to handle the variability in image formation in a moving
camera with tight time constraints. The paper brings to attention some of the
issues with applying three state of the art object recognition and detection
methods in a mobile robotics scenario, and proposes methods to deal with
windowing/segmentation. Thus, this work aims at evaluating the state-of-the-art
in object perception in an attempt to develop a lightweight solution for mobile
robotics use/research in typical indoor settings.Comment: 37 pages, 11 figure
Really? Well. Apparently Bootstrapping Improves the Performance of Sarcasm and Nastiness Classifiers for Online Dialogue
More and more of the information on the web is dialogic, from Facebook
newsfeeds, to forum conversations, to comment threads on news articles. In
contrast to traditional, monologic Natural Language Processing resources such
as news, highly social dialogue is frequent in social media, making it a
challenging context for NLP. This paper tests a bootstrapping method,
originally proposed in a monologic domain, to train classifiers to identify two
different types of subjective language in dialogue: sarcasm and nastiness. We
explore two methods of developing linguistic indicators to be used in a first
level classifier aimed at maximizing precision at the expense of recall. The
best performing classifier for the first phase achieves 54% precision and 38%
recall for sarcastic utterances. We then use general syntactic patterns from
previous work to create more general sarcasm indicators, improving precision to
62% and recall to 52%. To further test the generality of the method, we then
apply it to bootstrapping a classifier for nastiness dialogic acts. Our first
phase, using crowdsourced nasty indicators, achieves 58% precision and 49%
recall, which increases to 75% precision and 62% recall when we bootstrap over
the first level with generalized syntactic patterns.Comment: Workshop on Language Analysis in Social Media (LASM 2013), at the
North American Chapter of the Association for Computational Linguistics
(NAACL 2013
Empirical margin distributions and bounding the generalization error of combined classifiers
We prove new probabilistic upper bounds on generalization error of complex
classifiers that are combinations of simple classifiers. Such combinations
could be implemented by neural networks or by voting methods of combining the
classifiers, such as boosting and bagging. The bounds are in terms of the
empirical distribution of the margin of the combined classifier. They are based
on the methods of the theory of Gaussian and empirical processes (comparison
inequalities, symmetrization method, concentration inequalities) and they
improve previous results of Bartlett (1998) on bounding the generalization
error of neural networks in terms of l_1-norms of the weights of neurons and of
Schapire, Freund, Bartlett and Lee (1998) on bounding the generalization error
of boosting. We also obtain rates of convergence in Levy distance of empirical
margin distribution to the true margin distribution uniformly over the classes
of classifiers and prove the optimality of these rates.Comment: 35 pages, 1 figur
Using Multi-Label Classification for Improved Question Answering
A plethora of diverse approaches for question answering over RDF data have
been developed in recent years. While the accuracy of these systems has
increased significantly over time, most systems still focus on particular types
of questions or particular challenges in question answering. What is a curse
for single systems is a blessing for the combination of these systems. We show
in this paper how machine learning techniques can be applied to create a more
accurate question answering metasystem by reusing existing systems. In
particular, we develop a multi-label classification-based metasystem for
question answering over 6 existing systems using an innovative set of 14
question features. The metasystem outperforms the best single system by 14%
F-measure on the recent QALD-6 benchmark. Furthermore, we analyzed the
influence and correlation of the underlying features on the metasystem quality.Comment: 15 pages, 4 Tables, 3 Figue
Robust and Efficient Boosting Method using the Conditional Risk
Well-known for its simplicity and effectiveness in classification, AdaBoost,
however, suffers from overfitting when class-conditional distributions have
significant overlap. Moreover, it is very sensitive to noise that appears in
the labels. This article tackles the above limitations simultaneously via
optimizing a modified loss function (i.e., the conditional risk). The proposed
approach has the following two advantages. (1) It is able to directly take into
account label uncertainty with an associated label confidence. (2) It
introduces a "trustworthiness" measure on training samples via the Bayesian
risk rule, and hence the resulting classifier tends to have finite sample
performance that is superior to that of the original AdaBoost when there is a
large overlap between class conditional distributions. Theoretical properties
of the proposed method are investigated. Extensive experimental results using
synthetic data and real-world data sets from UCI machine learning repository
are provided. The empirical study shows the high competitiveness of the
proposed method in predication accuracy and robustness when compared with the
original AdaBoost and several existing robust AdaBoost algorithms.Comment: 14 Pages, 2 figures and 5 table
An empirical evaluation of imbalanced data strategies from a practitioner's point of view
This research tested the following well known strategies to deal with binary
imbalanced data on 82 different real life data sets (sampled to imbalance rates
of 5%, 3%, 1%, and 0.1%): class weight, SMOTE, Underbagging, and a baseline
(just the base classifier). As base classifiers we used SVM with RBF kernel,
random forests, and gradient boosting machines and we measured the quality of
the resulting classifier using 6 different metrics (Area under the curve,
Accuracy, F-measure, G-mean, Matthew's correlation coefficient and Balanced
accuracy). The best strategy strongly depends on the metric used to measure the
quality of the classifier. For AUC and accuracy class weight and the baseline
perform better; for F-measure and MCC, SMOTE performs better; and for G-mean
and balanced accuracy, underbagging
MonoStream: A Minimal-Hardware High Accuracy Device-free WLAN Localization System
Device-free (DF) localization is an emerging technology that allows the
detection and tracking of entities that do not carry any devices nor
participate actively in the localization process. Typically, DF systems require
a large number of transmitters and receivers to achieve acceptable accuracy,
which is not available in many scenarios such as homes and small businesses. In
this paper, we introduce MonoStream as an accurate single-stream DF
localization system that leverages the rich Channel State Information (CSI) as
well as MIMO information from the physical layer to provide accurate DF
localization with only one stream. To boost its accuracy and attain low
computational requirements, MonoStream models the DF localization problem as an
object recognition problem and uses a novel set of CSI-context features and
techniques with proven accuracy and efficiency. Experimental evaluation in two
typical testbeds, with a side-by-side comparison with the state-of-the-art,
shows that MonoStream can achieve an accuracy of 0.95m with at least 26%
enhancement in median distance error using a single stream only. This
enhancement in accuracy comes with an efficient execution of less than 23ms per
location update on a typical laptop. This highlights the potential of
MonoStream usage for real-time DF tracking applications
Detecting Table Region in PDF Documents Using Distant Supervision
Superior to state-of-the-art approaches which compete in table recognition
with 67 annotated government reports in PDF format released by {\it ICDAR 2013
Table Competition}, this paper contributes a novel paradigm leveraging
large-scale unlabeled PDF documents to open-domain table detection. We
integrate the paradigm into our latest developed system ({\it PdfExtra}) to
detect the region of tables by means of 9,466 academic articles from the entire
repository of {\it ACL Anthology}, where almost all papers are archived by PDF
format without annotation for tables. The paradigm first designs heuristics to
automatically construct weakly labeled data. It then feeds diverse evidences,
such as layouts of documents and linguistic features, which are extracted by
{\it Apache PDFBox} and processed by {\it Stanford NLP} toolkit, into different
canonical classifiers. We finally use these classifiers, i.e. {\it Naive
Bayes}, {\it Logistic Regression} and {\it Support Vector Machine}, to
collaboratively vote on the region of tables. Experimental results show that
{\it PdfExtra} achieves a great leap forward, compared with the
state-of-the-art approach. Moreover, we discuss the factors of different
features, learning models and even domains of documents that may impact the
performance. Extensive evaluations demonstrate that our paradigm is compatible
enough to leverage various features and learning models for open-domain table
region detection within PDF files
- …