15,447 research outputs found
Ranking ideas for diversity and quality
When selecting ideas or trying to find inspiration, designers often must sift
through hundreds or thousands of ideas. This paper provides an algorithm to
rank design ideas such that the ranked list simultaneously maximizes the
quality and diversity of recommended designs. To do so, we first define and
compare two diversity measures using Determinantal Point Processes (DPP) and
additive sub-modular functions. We show that DPPs are more suitable for items
expressed as text and that a greedy algorithm diversifies rankings with both
theoretical guarantees and empirical performance on what is otherwise an
NP-Hard problem. To produce such rankings, this paper contributes a novel way
to extend quality and diversity metrics from sets to permutations of ranked
lists.
These rank metrics open up the use of multi-objective optimization to
describe trade-offs between diversity and quality in ranked lists. We use such
trade-off fronts to help designers select rankings using indifference curves.
However, we also show that rankings on trade-off front share a number of
top-ranked items; this means reviewing items (for a given depth like the top
10) from across the entire diversity-to-quality front incurs only a marginal
increase in the number of designs considered. While the proposed techniques are
general purpose enough to be used across domains, we demonstrate concrete
performance on selecting items in an online design community (OpenIDEO), where
our approach reduces the time required to review diverse, high-quality ideas
from around 25 hours to 90 minutes. This makes evaluation of crowd-generated
ideas tractable for a single designer. Our code is publicly accessible for
further research
An end-to-end Neural Network Framework for Text Clustering
The unsupervised text clustering is one of the major tasks in natural
language processing (NLP) and remains a difficult and complex problem.
Conventional \mbox{methods} generally treat this task using separated steps,
including text representation learning and clustering the representations. As
an improvement, neural methods have also been introduced for continuous
representation learning to address the sparsity problem. However, the
multi-step process still deviates from the unified optimization target.
Especially the second step of cluster is generally performed with conventional
methods such as k-Means. We propose a pure neural framework for text clustering
in an end-to-end manner. It jointly learns the text representation and the
clustering model. Our model works well when the context can be obtained, which
is nearly always the case in the field of NLP. We have our method
\mbox{evaluated} on two widely used benchmarks: IMDB movie reviews for
sentiment classification and -Newsgroup for topic categorization. Despite
its simplicity, experiments show the model outperforms previous clustering
methods by a large margin. Furthermore, the model is also verified on English
wiki dataset as a large corpus
A new class of metrics for learning on real-valued and structured data
We propose a new class of metrics on sets, vectors, and functions that can be
used in various stages of data mining, including exploratory data analysis,
learning, and result interpretation. These new distance functions unify and
generalize some of the popular metrics, such as the Jaccard and bag distances
on sets, Manhattan distance on vector spaces, and Marczewski-Steinhaus distance
on integrable functions. We prove that the new metrics are complete and show
useful relationships with -divergences for probability distributions. To
further extend our approach to structured objects such as concept hierarchies
and ontologies, we introduce information-theoretic metrics on directed acyclic
graphs drawn according to a fixed probability distribution. We conduct
empirical investigation to demonstrate intuitive interpretation of the new
metrics and their effectiveness on real-valued, high-dimensional, and
structured data. Extensive comparative evaluation demonstrates that the new
metrics outperformed multiple similarity and dissimilarity functions
traditionally used in data mining, including the Minkowski family, the
fractional family, two -divergences, cosine distance, and two
correlation coefficients. Finally, we argue that the new class of metrics is
particularly appropriate for rapid processing of high-dimensional and
structured data in distance-based learning
Real-Time Web Scale Event Summarization Using Sequential Decision Making
We present a system based on sequential decision making for the online
summarization of massive document streams, such as those found on the web.
Given an event of interest (e.g. "Boston marathon bombing"), our system is able
to filter the stream for relevance and produce a series of short text updates
describing the event as it unfolds over time. Unlike previous work, our
approach is able to jointly model the relevance, comprehensiveness, novelty,
and timeliness required by time-sensitive queries. We demonstrate a 28.3%
improvement in summary F1 and a 43.8% improvement in time-sensitive F1 metrics.Comment: in Proceedings of the 25th International Joint Conference on
Artificial Intelligence 201
Evaluating the Complementarity of Taxonomic Relation Extraction Methods Across Different Languages
Modern information systems are changing the idea of "data processing" to the
idea of "concept processing", meaning that instead of processing words, such
systems process semantic concepts which carry meaning and share contexts with
other concepts. Ontology is commonly used as a structure that captures the
knowledge about a certain area via providing concepts and relations between
them. Traditionally, concept hierarchies have been built manually by knowledge
engineers or domain experts. However, the manual construction of a concept
hierarchy suffers from several limitations such as its coverage and the
enormous costs of its extension and maintenance. Ontology learning, usually
referred to the (semi-)automatic support in ontology development, is usually
divided into steps, going from concepts identification, passing through
hierarchy and non-hierarchy relations detection and, seldom, axiom extraction.
It is reasonable to say that among these steps the current frontier is in the
establishment of concept hierarchies, since this is the backbone of ontologies
and, therefore, a good concept hierarchy is already a valuable resource for
many ontology applications. The automatic construction of concept hierarchies
from texts is a complex task and much work have been proposing approaches to
better extract relations between concepts. These different proposals have never
been contrasted against each other on the same set of data and across different
languages. Such comparison is important to see whether they are complementary
or incremental. Also, we can see whether they present different tendencies
towards recall and precision. This paper evaluates these different methods on
the basis of hierarchy metrics such as density and depth, and evaluation
metrics such as Recall and Precision. Results shed light over the comprehensive
set of methods according to the literature in the area
Extractive Multi-document Summarization Using Multilayer Networks
Huge volumes of textual information has been produced every single day. In
order to organize and understand such large datasets, in recent years,
summarization techniques have become popular. These techniques aims at finding
relevant, concise and non-redundant content from such a big data. While network
methods have been adopted to model texts in some scenarios, a systematic
evaluation of multilayer network models in the multi-document summarization
task has been limited to a few studies. Here, we evaluate the performance of a
multilayer-based method to select the most relevant sentences in the context of
an extractive multi document summarization (MDS) task. In the adopted model,
nodes represent sentences and edges are created based on the number of shared
words between sentences. Differently from previous studies in multi-document
summarization, we make a distinction between edges linking sentences from
different documents (inter-layer) and those connecting sentences from the same
document (intra-layer). As a proof of principle, our results reveal that such a
discrimination between intra- and inter-layer in a multilayered representation
is able to improve the quality of the generated summaries. This piece of
information could be used to improve current statistical methods and related
textual models
Wisdom of Crowds cluster ensemble
The Wisdom of Crowds is a phenomenon described in social science that
suggests four criteria applicable to groups of people. It is claimed that, if
these criteria are satisfied, then the aggregate decisions made by a group will
often be better than those of its individual members. Inspired by this concept,
we present a novel feedback framework for the cluster ensemble problem, which
we call Wisdom of Crowds Cluster Ensemble (WOCCE). Although many conventional
cluster ensemble methods focusing on diversity have recently been proposed,
WOCCE analyzes the conditions necessary for a crowd to exhibit this collective
wisdom. These include decentralization criteria for generating primary results,
independence criteria for the base algorithms, and diversity criteria for the
ensemble members. We suggest appropriate procedures for evaluating these
measures, and propose a new measure to assess the diversity. We evaluate the
performance of WOCCE against some other traditional base algorithms as well as
state-of-the-art ensemble methods. The results demonstrate the efficiency of
WOCCE's aggregate decision-making compared to other algorithms.Comment: Intelligent Data Analysis (IDA), IOS Pres
Nonnegative Multi-level Network Factorization for Latent Factor Analysis
Nonnegative Matrix Factorization (NMF) aims to factorize a matrix into two
optimized nonnegative matrices and has been widely used for unsupervised
learning tasks such as product recommendation based on a rating matrix.
However, although networks between nodes with the same nature exist, standard
NMF overlooks them, e.g., the social network between users. This problem leads
to comparatively low recommendation accuracy because these networks are also
reflections of the nature of the nodes, such as the preferences of users in a
social network. Also, social networks, as complex networks, have many different
structures. Each structure is a composition of links between nodes and reflects
the nature of nodes, so retaining the different network structures will lead to
differences in recommendation performance. To investigate the impact of these
network structures on the factorization, this paper proposes four multi-level
network factorization algorithms based on the standard NMF, which integrates
the vertical network (e.g., rating matrix) with the structures of horizontal
network (e.g., user social network). These algorithms are carefully designed
with corresponding convergence proofs to retain four desired network
structures. Experiments on synthetic data show that the proposed algorithms are
able to preserve the desired network structures as designed. Experiments on
real-world data show that considering the horizontal networks improves the
accuracy of document clustering and recommendation with standard NMF, and
various structures show their differences in performance on these two tasks.
These results can be directly used in document clustering and recommendation
systems
Short Text Topic Modeling Techniques, Applications, and Performance: A Survey
Analyzing short texts infers discriminative and coherent latent topics that
is a critical and fundamental task since many real-world applications require
semantic understanding of short texts. Traditional long text topic modeling
algorithms (e.g., PLSA and LDA) based on word co-occurrences cannot solve this
problem very well since only very limited word co-occurrence information is
available in short texts. Therefore, short text topic modeling has already
attracted much attention from the machine learning research community in recent
years, which aims at overcoming the problem of sparseness in short texts. In
this survey, we conduct a comprehensive review of various short text topic
modeling techniques proposed in the literature. We present three categories of
methods based on Dirichlet multinomial mixture, global word co-occurrences, and
self-aggregation, with example of representative approaches in each category
and analysis of their performance on various tasks. We develop the first
comprehensive open-source library, called STTM, for use in Java that integrates
all surveyed algorithms within a unified interface, benchmark datasets, to
facilitate the expansion of new methods in this research field. Finally, we
evaluate these state-of-the-art methods on many real-world datasets and compare
their performance against one another and versus long text topic modeling
algorithm.Comment: arXiv admin note: text overlap with arXiv:1808.02215 by other author
State of the Art, Evaluation and Recommendations regarding "Document Processing and Visualization Techniques"
Several Networks of Excellence have been set up in the framework of the
European FP5 research program. Among these Networks of Excellence, the NEMIS
project focuses on the field of Text Mining.
Within this field, document processing and visualization was identified as
one of the key topics and the WG1 working group was created in the NEMIS
project, to carry out a detailed survey of techniques associated with the text
mining process and to identify the relevant research topics in related research
areas.
In this document we present the results of this comprehensive survey. The
report includes a description of the current state-of-the-art and practice, a
roadmap for follow-up research in the identified areas, and recommendations for
anticipated technological development in the domain of text mining.Comment: 54 pages, Report of Working Group 1 for the European Network of
Excellence (NoE) in Text Mining and its Applications in Statistics (NEMIS
- …