353 research outputs found
Learning Multi-label Alternating Decision Trees from Texts and Data
International audienceMulti-label decision procedures are the target of the supervised learning algorithm we propose in this paper. Multi-label decision procedures map examples to a finite set of labels. Our learning algorithm extends Schapire and Singer?s Adaboost.MH and produces sets of rules that can be viewed as trees like Alternating Decision Trees (invented by Freund and Mason). Experiments show that we take advantage of both performance and readability using boosting techniques as well as tree representations of large set of rules. Moreover, a key feature of our algorithm is the ability to handle heterogenous input data: discrete and continuous values and text data. Keywords boosting - alternating decision trees - text mining - multi-label problem
PAC-Bayesian Bounds for Randomized Empirical Risk Minimizers
The aim of this paper is to generalize the PAC-Bayesian theorems proved by
Catoni in the classification setting to more general problems of statistical
inference. We show how to control the deviations of the risk of randomized
estimators. A particular attention is paid to randomized estimators drawn in a
small neighborhood of classical estimators, whose study leads to control the
risk of the latter. These results allow to bound the risk of very general
estimation procedures, as well as to perform model selection
A survey of cost-sensitive decision tree induction algorithms
The past decade has seen a significant interest on the problem of inducing decision trees that take account of costs of misclassification and costs of acquiring the features used for decision making.  This survey identifies over 50 algorithms including approaches that are direct adaptations of accuracy based methods, use genetic algorithms, use anytime methods and utilize boosting and bagging.  The survey brings together these different studies and novel approaches to cost-sensitive decision tree learning, provides a useful taxonomy, a historical timeline of how the field has developed and should provide a useful reference point for future research in this field
Learning to Order Things
There are many applications in which it is desirable to order rather than
classify instances. Here we consider the problem of learning how to order
instances given feedback in the form of preference judgments, i.e., statements
to the effect that one instance should be ranked ahead of another. We outline a
two-stage approach in which one first learns by conventional means a binary
preference function indicating whether it is advisable to rank one instance
before another. Here we consider an on-line algorithm for learning preference
functions that is based on Freund and Schapire's 'Hedge' algorithm. In the
second stage, new instances are ordered so as to maximize agreement with the
learned preference function. We show that the problem of finding the ordering
that agrees best with a learned preference function is NP-complete.
Nevertheless, we describe simple greedy algorithms that are guaranteed to find
a good approximation. Finally, we show how metasearch can be formulated as an
ordering problem, and present experimental results on learning a combination of
'search experts', each of which is a domain-specific query expansion strategy
for a web search engine
Study of B0(s)→K0Sh+h′− decays with first observation of B0s→K0SK±π∓ and B0s→K0Sπ+π−
A search for charmless three-body decays of B 0 and B0s mesons with a K0S meson in the final state is performed using the pp collision data, corresponding to an integrated luminosity of 1.0 fb−1, collected at a centre-of-mass energy of 7 TeV recorded by the LHCb experiment. Branching fractions of the B0(s)→K0Sh+h′− decay modes (h (′) = π, K), relative to the well measured B0→K0Sπ+π− decay, are obtained. First observation of the decay modes B0s→K0SK±π∓ and B0s→K0Sπ+π− and confirmation of the decay B0→K0SK±π∓ are reported. The following relative branching fraction measurements or limits are obtained B(B0→K0SK±π∓)B(B0→K0Sπ+π−)=0.128±0.017(stat.)±0.009(syst.), B(B0→K0SK+K−)B(B0→K0Sπ+π−)=0.385±0.031(stat.)±0.023(syst.), B(B0s→K0Sπ+π−)B(B0→K0Sπ+π−)=0.29±0.06(stat.)±0.03(syst.)±0.02(fs/fd), B(B0s→K0SK±π∓)B(B0→K0Sπ+π−)=1.48±0.12(stat.)±0.08(syst.)±0.12(fs/fd)B(B0s→K0SK+K−)B(B0→K0Sπ+π−)∈[0.004;0.068]at90%CL
BoostingTree: parallel selection of weak learners in boosting, with application to ranking
Boosting algorithms have been found successful in many areas of machine learning and, in particular, in ranking. For typical classes of weak learners used in boosting (such as decision stumps or trees), a large feature space can slow down the training, while a long sequence of weak hypotheses combined by boosting can result in a computationally expensive model. In this paper we propose a strategy that builds several sequences of weak hypotheses in parallel, and extends the ones that are likely to yield a good model. The weak hypothesis sequences are arranged in a boosting tree, and new weak hypotheses are added to promising nodes (both leaves and inner nodes) of the tree using some randomized method. Theoretical results show that the proposed algorithm asymptotically achieves the performance of the base boosting algorithm applied. Experiments are provided in ranking web documents and move ordering in chess, and the results indicate that the new strategy yields better performance when the length of the sequence is limited, and converges to similar performance as the original boosting algorithms otherwise. © 2013 The Author(s)
Machine Learning in Automated Text Categorization
The automated categorization (or classification) of texts into predefined
categories has witnessed a booming interest in the last ten years, due to the
increased availability of documents in digital form and the ensuing need to
organize them. In the research community the dominant approach to this problem
is based on machine learning techniques: a general inductive process
automatically builds a classifier by learning, from a set of preclassified
documents, the characteristics of the categories. The advantages of this
approach over the knowledge engineering approach (consisting in the manual
definition of a classifier by domain experts) are a very good effectiveness,
considerable savings in terms of expert manpower, and straightforward
portability to different domains. This survey discusses the main approaches to
text categorization that fall within the machine learning paradigm. We will
discuss in detail issues pertaining to three different problems, namely
document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey
Information retrieval and text mining technologies for chemistry
Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European
Community’s Horizon 2020 Program (project reference:
654021 - OpenMinted). M.K. additionally acknowledges the
Encomienda MINETAD-CNIO as part of the Plan for the
Advancement of Language Technology. O.R. and J.O. thank
the Foundation for Applied Medical Research (FIMA),
University of Navarra (Pamplona, Spain). This work was
partially funded by Consellería
de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic
funding of UID/BIO/04469/2013 unit and COMPETE 2020
(POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi
for useful feedback and discussions during the preparation of
the manuscript.info:eu-repo/semantics/publishedVersio
Aneuploidy prediction and tumor classification with heterogeneous hidden conditional random fields
Motivation: The heterogeneity of cancer cannot always be recognized by tumor morphology, but may be reflected by the underlying genetic aberrations. Array comparative genome hybridization (array-CGH) methods provide high-throughput data on genetic copy numbers, but determining the clinically relevant copy number changes remains a challenge. Conventional classification methods for linking recurrent alterations to clinical outcome ignore sequential correlations in selecting relevant features. Conversely, existing sequence classification methods can only model overall copy number instability, without regard to any particular position in the genome
- …
