107 research outputs found
Subset Labeled LDA for Large-Scale Multi-Label Classification
Labeled Latent Dirichlet Allocation (LLDA) is an extension of the standard
unsupervised Latent Dirichlet Allocation (LDA) algorithm, to address
multi-label learning tasks. Previous work has shown it to perform in par with
other state-of-the-art multi-label methods. Nonetheless, with increasing label
sets sizes LLDA encounters scalability issues. In this work, we introduce
Subset LLDA, a simple variant of the standard LLDA algorithm, that not only can
effectively scale up to problems with hundreds of thousands of labels but also
improves over the LLDA state-of-the-art. We conduct extensive experiments on
eight data sets, with label sets sizes ranging from hundreds to hundreds of
thousands, comparing our proposed algorithm with the previously proposed LLDA
algorithms (Prior--LDA, Dep--LDA), as well as the state of the art in extreme
multi-label classification. The results show a steady advantage of our method
over the other LLDA algorithms and competitive results compared to the extreme
multi-label classification algorithms
A Divide-and-Conquer Approach to the Summarization of Long Documents
We present a novel divide-and-conquer method for the neural summarization of
long documents. Our method exploits the discourse structure of the document and
uses sentence similarity to split the problem into an ensemble of smaller
summarization problems. In particular, we break a long document and its summary
into multiple source-target pairs, which are used for training a model that
learns to summarize each part of the document separately. These partial
summaries are then combined in order to produce a final complete summary. With
this approach we can decompose the problem of long document summarization into
smaller and simpler problems, reducing computational complexity and creating
more training examples, which at the same time contain less noise in the target
summaries compared to the standard approach. We demonstrate that this approach
paired with different summarization models, including sequence-to-sequence RNNs
and Transformers, can lead to improved summarization performance. Our best
models achieve results that are on par with the state-of-the-art in two two
publicly available datasets of academic articles
Making Classifier Chains Resilient to Class Imbalance
Class imbalance is an intrinsic characteristic of multi-label data. Most of
the labels in multi-label data sets are associated with a small number of
training examples, much smaller compared to the size of the data set. Class
imbalance poses a key challenge that plagues most multi-label learning methods.
Ensemble of Classifier Chains (ECC), one of the most prominent multi-label
learning methods, is no exception to this rule, as each of the binary models it
builds is trained from all positive and negative examples of a label. To make
ECC resilient to class imbalance, we first couple it with random undersampling.
We then present two extensions of this basic approach, where we build a varying
number of binary models per label and construct chains of different sizes, in
order to improve the exploitation of majority examples with approximately the
same computational budget. Experimental results on 16 multi-label datasets
demonstrate the effectiveness of the proposed approaches in a variety of
evaluation metrics
Unsupervised Keyphrase Extraction from Scientific Publications
We propose a novel unsupervised keyphrase extraction approach that filters
candidate keywords using outlier detection. It starts by training word
embeddings on the target document to capture semantic regularities among the
words. It then uses the minimum covariance determinant estimator to model the
distribution of non-keyphrase word vectors, under the assumption that these
vectors come from the same distribution, indicative of their irrelevance to the
semantics expressed by the dimensions of the learned vector representation.
Candidate keyphrases only consist of words that are detected as outliers of
this dominant distribution. Empirical results show that our approach
outperforms state-of-the-art and recent unsupervised keyphrase extraction
methods.Comment: author pre-print versio
Local Word Vectors Guiding Keyphrase Extraction
Automated keyphrase extraction is a fundamental textual information
processing task concerned with the selection of representative phrases from a
document that summarize its content. This work presents a novel unsupervised
method for keyphrase extraction, whose main innovation is the use of local word
embeddings (in particular GloVe vectors), i.e., embeddings trained from the
single document under consideration. We argue that such local representation of
words and keyphrases are able to accurately capture their semantics in the
context of the document they are part of, and therefore can help in improving
keyphrase extraction quality. Empirical results offer evidence that indeed
local representations lead to better keyphrase extraction results compared to
both embeddings trained on very large third corpora or larger corpora
consisting of several documents of the same scientific field and to other
state-of-the-art unsupervised keyphrase extraction methods.Comment: author pre-print versio
Structured Summarization of Academic Publications
We propose SUSIE, a novel summarization method that can work with
state-of-the-art summarization models in order to produce structured scientific
summaries for academic articles. We also created PMC-SA, a new dataset of
academic publications, suitable for the task of structured summarization with
neural networks. We apply SUSIE combined with three different summarization
models on the new PMC-SA dataset and we show that the proposed method improves
the performance of all models by as much as 4 ROUGE points
Web Robot Detection in Academic Publishing
Recent industry reports assure the rise of web robots which comprise more
than half of the total web traffic. They not only threaten the security,
privacy and efficiency of the web but they also distort analytics and metrics,
doubting the veracity of the information being promoted. In the academic
publishing domain, this can cause articles to be faulty presented as prominent
and influential. In this paper, we present our approach on detecting web robots
in academic publishing websites. We use different supervised learning
algorithms with a variety of characteristics deriving from both the log files
of the server and the content served by the website. Our approach relies on the
assumption that human users will be interested in specific domains or articles,
while web robots crawl a web library incoherently. We experiment with features
adopted in previous studies with the addition of novel semantic characteristics
which derive after performing a semantic analysis using the Latent Dirichlet
Allocation (LDA) algorithm. Our real-world case study shows promising results,
pinpointing the significance of semantic features in the web robot detection
problem
Discovering and Exploiting Entailment Relationships in Multi-Label Learning
This work presents a sound probabilistic method for enforcing adherence of
the marginal probabilities of a multi-label model to automatically discovered
deterministic relationships among labels. In particular we focus on discovering
two kinds of relationships among the labels. The first one concerns pairwise
positive entailement: pairs of labels, where the presence of one implies the
presence of the other in all instances of a dataset. The second concerns
exclusion: sets of labels that do not coexist in the same instances of the
dataset. These relationships are represented with a Bayesian network. Marginal
probabilities are entered as soft evidence in the network and adjusted through
probabilistic inference. Our approach offers robust improvements in mean
average precision compared to the standard binary relavance approach across all
12 datasets involved in our experiments. The discovery process helps
interesting implicit knowledge to emerge, which could be useful in itself
Multi-Target Regression via Input Space Expansion: Treating Targets as Inputs
In many practical applications of supervised learning the task involves the
prediction of multiple target variables from a common set of input variables.
When the prediction targets are binary the task is called multi-label
classification, while when the targets are continuous the task is called
multi-target regression. In both tasks, target variables often exhibit
statistical dependencies and exploiting them in order to improve predictive
accuracy is a core challenge. A family of multi-label classification methods
address this challenge by building a separate model for each target on an
expanded input space where other targets are treated as additional input
variables. Despite the success of these methods in the multi-label
classification domain, their applicability and effectiveness in multi-target
regression has not been studied until now. In this paper, we introduce two new
methods for multi-target regression, called Stacked Single-Target and Ensemble
of Regressor Chains, by adapting two popular multi-label classification methods
of this family. Furthermore, we highlight an inherent problem of these methods
- a discrepancy of the values of the additional input variables between
training and prediction - and develop extensions that use out-of-sample
estimates of the target variables during training in order to tackle this
problem. The results of an extensive experimental evaluation carried out on a
large and diverse collection of datasets show that, when the discrepancy is
appropriately mitigated, the proposed methods attain consistent improvements
over the independent regressions baseline. Moreover, two versions of Ensemble
of Regression Chains perform significantly better than four state-of-the-art
methods including regularization-based multi-task learning methods and a
multi-objective random forest approach.Comment: Accepted for publication in Machine Learning journal. This
replacement contains major improvements compared to the previous version,
including a deeper theoretical and experimental analysis and an extended
discussion of related wor
Dense Distributions from Sparse Samples: Improved Gibbs Sampling Parameter Estimators for LDA
We introduce a novel approach for estimating Latent Dirichlet Allocation
(LDA) parameters from collapsed Gibbs samples (CGS), by leveraging the full
conditional distributions over the latent variable assignments to efficiently
average over multiple samples, for little more computational cost than drawing
a single additional collapsed Gibbs sample. Our approach can be understood as
adapting the soft clustering methodology of Collapsed Variational Bayes (CVB0)
to CGS parameter estimation, in order to get the best of both techniques. Our
estimators can straightforwardly be applied to the output of any existing
implementation of CGS, including modern accelerated variants. We perform
extensive empirical comparisons of our estimators with those of standard
collapsed inference algorithms on real-world data for both unsupervised LDA and
Prior-LDA, a supervised variant of LDA for multi-label classification. Our
results show a consistent advantage of our approach over traditional CGS under
all experimental conditions, and over CVB0 inference in the majority of
conditions. More broadly, our results highlight the importance of averaging
over multiple samples in LDA parameter estimation, and the use of efficient
computational techniques to do so
- …