1,586,004 research outputs found
Relabelling Algorithms for Large Dataset Mixture Models
Mixture models are flexible tools in density estimation and classification
problems. Bayesian estimation of such models typically relies on sampling from
the posterior distribution using Markov chain Monte Carlo. Label switching
arises because the posterior is invariant to permutations of the component
parameters. Methods for dealing with label switching have been studied fairly
extensively in the literature, with the most popular approaches being those
based on loss functions. However, many of these algorithms turn out to be too
slow in practice, and can be infeasible as the size and dimension of the data
grow. In this article, we review earlier solutions which can scale up well for
large data sets, and compare their performances on simulated and real datasets.
In addition, we propose a new, and computationally efficient algorithm based on
a loss function interpretation, and show that it can scale up well in larger
problems. We conclude with some discussions and recommendations of all the
methods studied
RACE: Large-scale ReAding Comprehension Dataset From Examinations
We present RACE, a new dataset for benchmark evaluation of methods in the
reading comprehension task. Collected from the English exams for middle and
high school Chinese students in the age range between 12 to 18, RACE consists
of near 28,000 passages and near 100,000 questions generated by human experts
(English instructors), and covers a variety of topics which are carefully
designed for evaluating the students' ability in understanding and reasoning.
In particular, the proportion of questions that requires reasoning is much
larger in RACE than that in other benchmark datasets for reading comprehension,
and there is a significant gap between the performance of the state-of-the-art
models (43%) and the ceiling human performance (95%). We hope this new dataset
can serve as a valuable resource for research and evaluation in machine
comprehension. The dataset is freely available at
http://www.cs.cmu.edu/~glai1/data/race/ and the code is available at
https://github.com/qizhex/RACE_AR_baselines.Comment: EMNLP 201
LCSTS: A Large Scale Chinese Short Text Summarization Dataset
Automatic text summarization is widely regarded as the highly difficult
problem, partially because of the lack of large text summarization data set.
Due to the great challenge of constructing the large scale summaries for full
text, in this paper, we introduce a large corpus of Chinese short text
summarization dataset constructed from the Chinese microblogging website Sina
Weibo, which is released to the public
{http://icrc.hitsz.edu.cn/Article/show/139.html}. This corpus consists of over
2 million real Chinese short texts with short summaries given by the author of
each text. We also manually tagged the relevance of 10,666 short summaries with
their corresponding short texts. Based on the corpus, we introduce recurrent
neural network for the summary generation and achieve promising results, which
not only shows the usefulness of the proposed corpus for short text
summarization research, but also provides a baseline for further research on
this topic.Comment: Recently, we received feedbacks from Yuya Taguchi from NAIST in Japan
and Qian Chen from USTC of China, that the results in the EMNLP2015 version
seem to be underrated. So we carefully checked our results and find out that
we made a mistake while using the standard ROUGE. Then we re-evaluate all
methods in the paper and get corrected results listed in Table 2 of this
versio
A large multilingual and multi-domain dataset for recommender systems
This paper presents a multi-domain interests dataset to train and test Recommender Systems, and the methodology to create the dataset
from Twitter messages in English and Italian. The English dataset includes an average of 90 preferences per user on music, books,
movies, celebrities, sport, politics and much more, for about half million users. Preferences are either extracted from messages of
users who use Spotify, Goodreads and other similar content sharing platforms, or induced from their ”topical” friends, i.e., followees
representing an interest rather than a social relation between peers. In addition, preferred items are matched with Wikipedia articles
describing them. This unique feature of our dataset provides a mean to derive a semantic categorization of the preferred items, exploiting
available semantic resources linked to Wikipedia such as the Wikipedia Category Graph, DBpedia, BabelNet and others
Who did What: A Large-Scale Person-Centered Cloze Dataset
We have constructed a new "Who-did-What" dataset of over 200,000
fill-in-the-gap (cloze) multiple choice reading comprehension problems
constructed from the LDC English Gigaword newswire corpus. The WDW dataset has
a variety of novel features. First, in contrast with the CNN and Daily Mail
datasets (Hermann et al., 2015) we avoid using article summaries for question
formation. Instead, each problem is formed from two independent articles --- an
article given as the passage to be read and a separate article on the same
events used to form the question. Second, we avoid anonymization --- each
choice is a person named entity. Third, the problems have been filtered to
remove a fraction that are easily solved by simple baselines, while remaining
84% solvable by humans. We report performance benchmarks of standard systems
and propose the WDW dataset as a challenge task for the community.Comment: To appear at EMNLP 2016. Our dataset is available at
tticnlp.github.io/who_did_wha
- …
