5,402 research outputs found
Machine Learning in Automated Text Categorization
The automated categorization (or classification) of texts into predefined
categories has witnessed a booming interest in the last ten years, due to the
increased availability of documents in digital form and the ensuing need to
organize them. In the research community the dominant approach to this problem
is based on machine learning techniques: a general inductive process
automatically builds a classifier by learning, from a set of preclassified
documents, the characteristics of the categories. The advantages of this
approach over the knowledge engineering approach (consisting in the manual
definition of a classifier by domain experts) are a very good effectiveness,
considerable savings in terms of expert manpower, and straightforward
portability to different domains. This survey discusses the main approaches to
text categorization that fall within the machine learning paradigm. We will
discuss in detail issues pertaining to three different problems, namely
document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey
CEDR: Contextualized Embeddings for Document Ranking
Although considerable attention has been given to neural ranking
architectures recently, far less attention has been paid to the term
representations that are used as input to these models. In this work, we
investigate how two pretrained contextualized language models (ELMo and BERT)
can be utilized for ad-hoc document ranking. Through experiments on TREC
benchmarks, we find that several existing neural ranking architectures can
benefit from the additional context provided by contextualized language models.
Furthermore, we propose a joint approach that incorporates BERT's
classification vector into existing neural models and show that it outperforms
state-of-the-art ad-hoc ranking baselines. We call this joint approach CEDR
(Contextualized Embeddings for Document Ranking). We also address practical
challenges in using these models for ranking, including the maximum input
length imposed by BERT and runtime performance impacts of contextualized
language models.Comment: Appeared in SIGIR 2019, 4 page
CEDR: Contextualized Embeddings for Document Ranking
Although considerable attention has been given to neural ranking architectures recently, far less attention has been paid to the term representations that are used as input to these models. In this work, we investigate how two pretrained contextualized language modes (ELMo and BERT) can be utilized for ad-hoc document ranking. Through experiments on TREC benchmarks, we find that several existing neural ranking architectures can benefit from the additional context provided by contextualized language models. Furthermore, we propose a joint approach that incorporates BERT's classification vector into existing neural models and show that it outperforms state-of-the-art ad-hoc ranking baselines. We call this joint approach CEDR (Contextualized Embeddings for Document Ranking). We also address practical challenges in using these models for ranking, including the maximum input length imposed by BERT and runtime performance impacts of contextualized language models
Building a Test Collection for Significant-Event Detection in Arabic Tweets
With the increasing popularity of microblogging services like Twitter, researchers discov-
ered a rich medium for tackling real-life problems like event detection. However, event
detection in Twitter is often obstructed by the lack of public evaluation mechanisms
such as test collections (set of tweets, labels, and queries to measure the eectiveness of
an information retrieval system). The problem is more evident when non-English lan-
guages, e.g., Arabic, are concerned. With the recent surge of signicant events in the
Arab world, news agencies and decision makers rely on Twitters microblogging service to
obtain recent information on events. In this thesis, we address the problem of building a
test collection of Arabic tweets (named EveTAR) for the task of event detection.
To build EveTAR, we rst adopted an adequate denition of an event, which is a
signicant occurrence that takes place at a certain time. An occurrence is signicant if
there are news articles about it. We collected Arabic tweets using Twitter's streaming
API. Then, we identied a set of events from the Arabic data collection using Wikipedias
current events portal. Corresponding tweets were extracted by querying the Arabic data
collection with a set of manually-constructed queries. To obtain relevance judgments for
those tweets, we leveraged CrowdFlower's crowdsourcing platform.
Over a period of 4 weeks, we crawled over 590M tweets, from which we identied 66
events that cover 8 dierent categories and gathered more than 134k relevance judgments.
Each event contains an average of 779 relevant tweets. Over all events, we got an average
Kappa of 0.6, which is a substantially acceptable value. EveTAR was used to evalu-
ate three state-of-the-art event detection algorithms. The best performing algorithms
achieved 0.60 in F1 measure and 0.80 in both precision and recall. We plan to make
our test collection available for research, including events description, manually-crafted
queries to extract potentially-relevant tweets, and all judgments per tweet. EveTAR is
the rst Arabic test collection built from scratch for the task of event detection. Addi-
tionally, we show in our experiments that it supports other tasks like ad-hoc search
Mixed Graph of Terms: Beyond the bags of words representation of a text
The main purpose of text mining techniques is to
identify common patterns through the observation of
vectors of features and then to use such patterns to
make predictions. Vectors of features are usually
made up of weighted words, as well as those used in
the text retrieval field, which are obtained thanks to
the assumption that considers a document as a "bag
of words". However, in this paper we demonstrate
that, to obtain more accuracy in the analysis and
revelation of common patterns, we could employ
(observe) more complex features than simple
weighted words. The proposed vector of features
considers a hierarchical structure, named a mixed
Graph of Terms, composed of a directed and an
undirected sub-graph of words, that can be
automatically constructed from a small set of
documents through the probabilistic Topic Model.
The graph has demonstrated its efficiency in a classic
"ad-hoc" text retrieval problem. Here we consider
expanding the initial query with this new structured
vector of features
- …