32 research outputs found
Synsets improve short text clustering for search support: combining LDA and WordNet
In this study, I proposed a short text clustering approach with WordNet as the external resources to cluster documents from corpus.byu.edu. Experimental results show that our approach largely improved the clustering performance. The factors that have an influence on the performance of the topic model are the total number of documents, Synsets distribution among topics and words overlapping between the query’s Synsets. In addition, the performance will also be influenced by the missing Synset in WordNet. Finally, we provide an idea of using clustering approaches generating ranked query suggestion to disambiguate the query. Combining with Synsets of the query, text document clustering can provide an effective way to disambiguate user search query by organizing a large set of searching results into a small number of groups labeled with Synsets from WordNet.Master of Science in Information Scienc
Recommended from our members
Facilitating Creativity in Collaborative Work with Computational Intelligence Software
The use of computational intelligence for leveraging social creativity is a relatively new approach that allows organizations to find creative solutions to complex problems in which the interaction between stakeholders is crucial. The creative solutions that come from joint thinking-from the combined knowledge and abilities of people with diverse perspectives-contrast with traditional views of creativity that focus primarily on the individual as the main contributor of creativity. In an effort to support social creativity in organizations, in this paper we present computational intelligence software tools for that aim and an architecture for creating software mashups based on the concept of affinity space. The affinity space defines a digital setting to facilitate specific scenarios in collaborative business environments. The solution presented includes a set of free and open source software tools ranging from newly developed brainstorming applications to an expertise recommender for enhancing social creativity in the enterprise. The current paper addresses software design issues and presents reflections on the research work undertaken in the COLLAGE project between 2012 and 2015
Neural Methods for Effective, Efficient, and Exposure-Aware Information Retrieval
Neural networks with deep architectures have demonstrated significant
performance improvements in computer vision, speech recognition, and natural
language processing. The challenges in information retrieval (IR), however, are
different from these other application areas. A common form of IR involves
ranking of documents--or short passages--in response to keyword-based queries.
Effective IR systems must deal with query-document vocabulary mismatch problem,
by modeling relationships between different query and document terms and how
they indicate relevance. Models should also consider lexical matches when the
query contains rare terms--such as a person's name or a product model
number--not seen during training, and to avoid retrieving semantically related
but irrelevant results. In many real-life IR tasks, the retrieval involves
extremely large collections--such as the document index of a commercial Web
search engine--containing billions of documents. Efficient IR methods should
take advantage of specialized IR data structures, such as inverted index, to
efficiently retrieve from large collections. Given an information need, the IR
system also mediates how much exposure an information artifact receives by
deciding whether it should be displayed, and where it should be positioned,
among other results. Exposure-aware IR systems may optimize for additional
objectives, besides relevance, such as parity of exposure for retrieved items
and content publishers. In this thesis, we present novel neural architectures
and methods motivated by the specific needs and challenges of IR tasks.Comment: PhD thesis, Univ College London (2020
Entity-Oriented Search
This open access book covers all facets of entity-oriented search—where “search” can be interpreted in the broadest sense of information access—from a unified point of view, and provides a coherent and comprehensive overview of the state of the art. It represents the first synthesis of research in this broad and rapidly developing area. Selected topics are discussed in-depth, the goal being to establish fundamental techniques and methods as a basis for future research and development. Additional topics are treated at a survey level only, containing numerous pointers to the relevant literature. A roadmap for future research, based on open issues and challenges identified along the way, rounds out the book. The book is divided into three main parts, sandwiched between introductory and concluding chapters. The first two chapters introduce readers to the basic concepts, provide an overview of entity-oriented search tasks, and present the various types and sources of data that will be used throughout the book. Part I deals with the core task of entity ranking: given a textual query, possibly enriched with additional elements or structural hints, return a ranked list of entities. This core task is examined in a number of different variants, using both structured and unstructured data collections, and numerous query formulations. In turn, Part II is devoted to the role of entities in bridging unstructured and structured data. Part III explores how entities can enable search engines to understand the concepts, meaning, and intent behind the query that the user enters into the search box, and how they can provide rich and focused responses (as opposed to merely a list of documents)—a process known as semantic search. The final chapter concludes the book by discussing the limitations of current approaches, and suggesting directions for future research. Researchers and graduate students are the primary target audience of this book. A general background in information retrieval is sufficient to follow the material, including an understanding of basic probability and statistics concepts as well as a basic knowledge of machine learning concepts and supervised learning algorithms
Computer Science & Technology Series : XXI Argentine Congress of Computer Science. Selected papers
CACIC’15 was the 21thCongress in the CACIC series. It was organized by the School of Technology at the UNNOBA (North-West of Buenos Aires National University) in JunĂn, Buenos Aires.
The Congress included 13 Workshops with 131 accepted papers, 4 Conferences, 2 invited tutorials, different meetings related with Computer Science Education (Professors, PhD students, Curricula) and an International School with 6 courses.
CACIC 2015 was organized following the traditional Congress format, with 13 Workshops covering a diversity of dimensions of Computer Science Research. Each topic was supervised by a committee of 3-5 chairs of different Universities.
The call for papers attracted a total of 202 submissions. An average of 2.5 review reports werecollected for each paper, for a grand total of 495 review reports that involved about 191 different reviewers.
A total of 131 full papers, involving 404 authors and 75 Universities, were accepted and 24 of them were selected for this book.Red de Universidades con Carreras en Informática (RedUNCI
Knowledge graph exploration for natural language understanding in web information retrieval
In this thesis, we study methods to leverage information from fully-structured knowledge bases
(KBs), in particular the encyclopedic knowledge graph (KG) DBpedia, for different text-related
tasks from the area of information retrieval (IR) and natural language processing (NLP). The
key idea is to apply entity linking (EL) methods that identify mentions of KB entities in text,
and then exploit the structured information within KGs. Developing entity-centric methods for
text understanding using KG exploration is the focus of this work.
We aim to show that structured background knowledge is a means for improving performance in
different IR and NLP tasks that traditionally only make use of the unstructured text input itself.
Thereby, the KB entities mentioned in text act as connection between the unstructured text and
the structured KG. We focus in particular on how to best leverage the knowledge as contained in
such fully-structured (RDF) KGs like DBpedia with their labeled edges/predicates – which is in
contrast to previous work on Wikipedia-based approaches we build upon, which typically relies
on unlabeled graphs only. The contribution of this thesis can be structured along its three parts:
In Part I, we apply EL and semantify short text snippets with KB entities. While only retrieving
types and categories from DBpedia for each entity, we are able to leverage this information
to create semantically coherent clusters of text snippets. This pipeline of connecting text to
background knowledge via the mentioned entities will be reused in all following chapters.
In Part II, we focus on semantic similarity and extend the idea of semantifying text with entities
by proposing in Chapter 5 a model that represents whole documents by their entities. In this
model, comparing documents semantically with each other is viewed as the task of comparing
the semantic relatedness of the respective entities, which we address in Chapter 4. We propose
an unsupervised graph weighting schema and show that weighting the DBpedia KG leads to
better results on an existing entity ranking dataset. The exploration of weighted KG paths turns
out to be also useful when trying to disambiguate the entities from an open information extraction
(OIE) system in Chapter 6. With this weighting schema, the integration of KG information
for computing semantic document similarity in Chapter 5 becomes the task of comparing the two
KG subgraphs with each other, which we address by an approximate subgraph matching. Based
on a well-established evaluation dataset for semantic document similarity, we show that our unsupervised
method achieves competitive performance similar to other state-of-the-art methods.
Our results from this part indicate that KGs can contain helpful background knowledge, in particular
when exploring KG paths, but that selecting the relevant parts of the graph is an important
yet difficult challenge.
In Part III, we shift to the task of relevance ranking and first study in Chapter 7 how to best
retrieve KB entities for a given keyword query. Combining again text with KB information, we
extract entities from the top-k retrieved, query-specific documents and then link the documents
to two different KBs, namely Wikipedia and DBpedia. In a learning-to-rank setting, we study
extensively which features from the text, theWikipedia KB, and the DBpedia KG can be helpful
for ranking entities with respect to the query. Experimental results on two datasets, which build
upon existing TREC document retrieval collections, indicate that the document-based mention
frequency of an entity and the Wikipedia-based query-to-entity similarity are both important
features for ranking. The KG paths in contrast play only a minor role in this setting, even when
integrated with a semantic kernel extension. In Chapter 8, we further extend the integration of
query-specific text documents and KG information, by extracting not only entities, but also relations
from text. In this exploratory study based on a self-created relevance dataset, we find that
not all extracted relations are relevant with respect to the query, but that they often contain information
not contained within the DBpedia KG. The main insight from the research presented in
this part is that in a query-specific setting, established IR methods for document retrieval provide
an important source of information even for entity-centric tasks, and that a close integration of
relevant text document and background knowledge is promising.
Finally, in the concluding chapter we argue that future research should further address the integration
of KG information with entities and relations extracted from (specific) text documents,
as their potential seems to be not fully explored yet. The same holds also true for a better KG
exploration, which has gained some scientific interest in recent years. It seems to us that both aspects
will remain interesting problems in the next years, also because of the growing importance
of KGs for web search and knowledge modeling in industry and academia
Computer Science & Technology Series : XXI Argentine Congress of Computer Science. Selected papers
CACIC’15 was the 21thCongress in the CACIC series. It was organized by the School of Technology at the UNNOBA (North-West of Buenos Aires National University) in JunĂn, Buenos Aires.
The Congress included 13 Workshops with 131 accepted papers, 4 Conferences, 2 invited tutorials, different meetings related with Computer Science Education (Professors, PhD students, Curricula) and an International School with 6 courses.
CACIC 2015 was organized following the traditional Congress format, with 13 Workshops covering a diversity of dimensions of Computer Science Research. Each topic was supervised by a committee of 3-5 chairs of different Universities.
The call for papers attracted a total of 202 submissions. An average of 2.5 review reports werecollected for each paper, for a grand total of 495 review reports that involved about 191 different reviewers.
A total of 131 full papers, involving 404 authors and 75 Universities, were accepted and 24 of them were selected for this book.Red de Universidades con Carreras en Informática (RedUNCI
Disambiguated query suggestions and personalized content-similarity and novelty ranking of clustered results to optimize web searches
In this paper, we face the so called “ranked list problem” of Web searches, that occurs when users submit short requests to search engines. Generally, as a consequence of terms’ ambiguity and polysemy, users engage long cycles of query reformulation in an attempt to capture relevant information in the top ranked results.
The overall objective of the proposal is to support the user in optimizing Web searches, by reducing the need for long search iterations. Specifically, in this paper we describe an iterative query disambiguation mechanism that follows three main phases. (1) The results of a Web search performed by the user (by submitting a query to a search engine) are clustered. (2) Clusters are ranked, based on a personalized balance of their content-similarity to the query and their novelty. (3) From each cluster, a disambiguated query that highlights the main contents of the cluster is generated, in such a way the new query is potentially capable to retrieve new documents, not previously retrieved; the disambiguated queries are suggestions for possibly new and more focused searches.
The paper describes the proposal, illustrating a sample application of the mechanism. Finally, the paper presents a user’s evaluation experiment of the proposed approach, comparing it with common practice based on the direct use of search engines
Computer Science & Technology Series
CACIC’15 was the 21thCongress in the CACIC series. It was organized by the School of Technology at the UNNOBA (North-West of Buenos Aires National University) in JunĂn, Buenos Aires. The Congress included 13 Workshops with 131 accepted papers, 4 Conferences, 2 invited tutorials, different meetings related with Computer Science Education (Professors, PhD students, Curricula) and an International School with 6 courses. CACIC 2015 was organized following the traditional Congress format, with 13 Workshops covering a diversity of dimensions of Computer Science Research. Each topic was supervised by a committee of 3-5 chairs of different Universities. The call for papers attracted a total of 202 submissions. An average of 2.5 review reports werecollected for each paper, for a grand total of 495 review reports that involved about 191 different reviewers. A total of 131 full papers, involving 404 authors and 75 Universities, were accepted and 24 of them were selected for this book
Computer Science & Technology Series : XXI Argentine Congress of Computer Science. Selected papers
CACIC’15 was the 21thCongress in the CACIC series. It was organized by the School of Technology at the UNNOBA (North-West of Buenos Aires National University) in JunĂn, Buenos Aires.
The Congress included 13 Workshops with 131 accepted papers, 4 Conferences, 2 invited tutorials, different meetings related with Computer Science Education (Professors, PhD students, Curricula) and an International School with 6 courses.
CACIC 2015 was organized following the traditional Congress format, with 13 Workshops covering a diversity of dimensions of Computer Science Research. Each topic was supervised by a committee of 3-5 chairs of different Universities.
The call for papers attracted a total of 202 submissions. An average of 2.5 review reports werecollected for each paper, for a grand total of 495 review reports that involved about 191 different reviewers.
A total of 131 full papers, involving 404 authors and 75 Universities, were accepted and 24 of them were selected for this book.Red de Universidades con Carreras en Informática (RedUNCI