33 research outputs found
Improving Natural Language Inference Using External Knowledge in the Science Questions Domain
Natural Language Inference (NLI) is fundamental to many Natural Language
Processing (NLP) applications including semantic search and question answering.
The NLI problem has gained significant attention thanks to the release of large
scale, challenging datasets. Present approaches to the problem largely focus on
learning-based methods that use only textual information in order to classify
whether a given premise entails, contradicts, or is neutral with respect to a
given hypothesis. Surprisingly, the use of methods based on structured
knowledge -- a central topic in artificial intelligence -- has not received
much attention vis-a-vis the NLI problem. While there are many open knowledge
bases that contain various types of reasoning information, their use for NLI
has not been well explored. To address this, we present a combination of
techniques that harness knowledge graphs to improve performance on the NLI
problem in the science questions domain. We present the results of applying our
techniques on text, graph, and text-to-graph based models, and discuss
implications for the use of external knowledge in solving the NLI problem. Our
model achieves the new state-of-the-art performance on the NLI problem over the
SciTail science questions dataset.Comment: 9 pages, 3 figures, 5 table
Diversified query expansion
La diversification des résultats de recherche (DRR) vise à sélectionner divers documents à partir des résultats de recherche afin de couvrir autant d’intentions que possible. Dans les approches existantes, on suppose que les résultats initiaux sont suffisamment diversifiés et couvrent bien les aspects de la requête. Or, on observe souvent que les résultats initiaux n’arrivent pas à couvrir certains aspects.
Dans cette thèse, nous proposons une nouvelle approche de DRR qui consiste à diversifier l’expansion de requête (DER) afin d’avoir une meilleure couverture des aspects. Les termes d’expansion sont sélectionnés à partir d’une ou de plusieurs ressource(s) suivant le principe de pertinence marginale maximale. Dans notre première contribution, nous proposons une méthode pour DER au niveau des termes où la similarité entre les termes est mesurée superficiellement à l’aide des ressources. Quand plusieurs ressources sont utilisées pour DER, elles ont été uniformément combinées dans la littérature, ce qui permet d’ignorer la contribution individuelle de chaque ressource par rapport à la requête. Dans la seconde contribution de cette thèse, nous proposons une nouvelle méthode de pondération de ressources selon la requête. Notre méthode utilise un ensemble de caractéristiques
qui sont intégrées à un modèle de régression linéaire, et génère à partir de chaque ressource un nombre de termes d’expansion proportionnellement au poids de cette ressource.
Les méthodes proposées pour DER se concentrent sur l’élimination de la redondance entre les termes d’expansion sans se soucier si les termes sélectionnés couvrent effectivement les différents aspects de la requête. Pour pallier à cet inconvénient, nous introduisons dans la troisième contribution de cette thèse une nouvelle méthode pour DER au niveau des aspects. Notre méthode est entraînée de façon supervisée selon le principe que les termes reliés doivent correspondre au même aspect. Cette méthode permet de sélectionner des termes d’expansion à un niveau sémantique latent afin de couvrir autant que possible différents aspects de la requête. De plus, cette méthode autorise l’intégration de plusieurs ressources afin de suggérer des termes d’expansion, et supporte l’intégration de plusieurs contraintes telles que la contrainte de dispersion.
Nous évaluons nos méthodes à l’aide des données de ClueWeb09B et de trois collections de requêtes de TRECWeb track et montrons l’utilité de nos approches par rapport aux méthodes existantes.Search Result Diversification (SRD) aims to select diverse documents from the search results in order to cover as many search intents as possible. For the existing approaches, a prerequisite is that the initial retrieval results contain diverse documents and ensure a good coverage of the query aspects.
In this thesis, we investigate a new approach to SRD by diversifying the query, namely diversified query expansion (DQE). Expansion terms are selected either from a single resource or from multiple resources following the Maximal Marginal Relevance principle. In the first contribution, we propose a new term-level DQE method in which word similarity is determined at the surface (term) level based on the resources.
When different resources are used for the purpose of DQE, they are combined in a uniform way, thus totally ignoring the contribution differences among resources. In practice the usefulness of a resource greatly changes depending on the query. In the second contribution, we propose a new method of query level resource weighting for DQE. Our method is based on a set of features which are integrated into a linear regression model and generates for a resource a number of expansion candidates that is proportional to the weight of that resource.
Existing DQE methods focus on removing the redundancy among selected expansion terms and no attention has been paid on how well the selected expansion terms can indeed cover the query aspects. Consequently, it is not clear how we can cope with the semantic relations between terms. To overcome this drawback, our third contribution in this thesis aims to introduce a novel method for aspect-level DQE which relies on an explicit modeling of query aspects based on embedding. Our method (called latent semantic aspect embedding) is trained in a supervised manner according to the principle that related terms should correspond to the same aspects. This method allows us to select expansion terms at a latent semantic level in order to cover as much as possible the aspects of a given query. In addition, this method also incorporates several different external resources to suggest potential expansion terms, and supports several constraints, such as the sparsity constraint.
We evaluate our methods using ClueWeb09B dataset and three query sets from TRECWeb tracks, and show the usefulness of our proposed approaches compared to the state-of-the-art approaches
Spatial Search Strategies for Open Government Data: A Systematic Comparison
The increasing availability of open government datasets on the Web calls for
ways to enable their efficient access and searching. There is however an
overall lack of understanding regarding spatial search strategies which would
perform best in this context. To address this gap, this work has assessed the
impact of different spatial search strategies on performance and user relevance
judgment. We harvested machine-readable spatial datasets and their metadata
from three English-based open government data portals, performed metadata
enhancement, developed a prototype and performed both a theoretical and
user-based evaluation. The results highlight that (i) switching between area of
overlap and Hausdorff distance for spatial similarity computation does not have
any substantial impact on performance; and (ii) the use of Hausdorff distance
induces slightly better user relevance ratings than the use of area of overlap.
The data collected and the insights gleaned may serve as a baseline against
which future work can compare.Comment: Paper accepted to GIR'19: 13th Workshop on Geographic Information
Retrieval (Lyon, France
A Survey of Volunteered Open Geo-Knowledge Bases in the Semantic Web
Over the past decade, rapid advances in web technologies, coupled with
innovative models of spatial data collection and consumption, have generated a
robust growth in geo-referenced information, resulting in spatial information
overload. Increasing 'geographic intelligence' in traditional text-based
information retrieval has become a prominent approach to respond to this issue
and to fulfill users' spatial information needs. Numerous efforts in the
Semantic Geospatial Web, Volunteered Geographic Information (VGI), and the
Linking Open Data initiative have converged in a constellation of open
knowledge bases, freely available online. In this article, we survey these open
knowledge bases, focusing on their geospatial dimension. Particular attention
is devoted to the crucial issue of the quality of geo-knowledge bases, as well
as of crowdsourced data. A new knowledge base, the OpenStreetMap Semantic
Network, is outlined as our contribution to this area. Research directions in
information integration and Geographic Information Retrieval (GIR) are then
reviewed, with a critical discussion of their current limitations and future
prospects
Leveraging semantic resources in diversified query expansion
A search query, being a very concise grounding of user intent, could potentially have many possible interpretations. Search engines hedge their bets by diversifying top results to cover multiple such possibilities so that the user is likely to be satisfied, whatever be her intended interpretation. Diversified Query Expansion is the problem of diversifying query expansion suggestions, so that the user can specialize the query to better suit her intent, even before perusing search results. In this paper, we consider the usage of semantic resources and tools to arrive at improved methods for diversified query expansion. In particular, we develop two methods, those that leverage Wikipedia and pre-learnt distributional word embeddings respectively. Both the approaches operate on a common three-phase framework; that of first taking a set of informative terms from the search results of the initial query, then building a graph, following by using a diversity-conscious node ranking to prioritize candidate terms for diversified query expansion. Our methods differ in the second phase, with the first method Select-Link-Rank (SLR) linking terms with Wikipedia entities to accomplish graph construction; on the other hand, our second method, Select-Embed-Rank (SER), constructs the graph using similarities between distributional word embeddings. Through an empirical analysis and user study, we show that SLR ourperforms state-of-the-art diversified query expansion methods, thus establishing that Wikipedia is an effective resource to aid diversified query expansion. Our empirical analysis also illustrates that SER outperforms the baselines convincingly, asserting that it is the best available method for those cases where SLR is not applicable; these include narrow-focus search systems where a relevant knowledge base is unavailable. Our SLR method is also seen to outperform a state-of-the-art method in the task of diversified entity ranking. <br/
An Anthological Review of Research Utilizing MontyLingua: a Python-Based End-to-End Text Processor
MontyLingua, an integral part of ConceptNet which is currently the largest commonsense knowledge base, is an English text processor developed using Python programming language in MIT Media Lab. The main feature of MontyLingua is the coverage for all aspects of English text processing from raw input text to semantic meanings and summary generation, yet each component in MontyLingua is loosely-coupled to each other at the architectural and code level, which enabled individual components to be used independently or substituted. However, there has been no review exploring the role of MontyLingua in recent research work utilizing it. This paper aims to review the use of and roles played by MontyLingua and its components in research work published in 19 articles between October 2004 and August 2006. We had observed a diversified use of MontyLingua in many different areas, both generic and domain-specific. Although the use of text summarizing component had not been observe, we are optimistic that it will have a crucial role in managing the current trend of information overload in future research
A systematic comparison of spatial search strategies for open government datasets
Dissertation submitted in partial fulfilment of the requirements for the Degree of Master of Science in Geospatial TechnologiesDatasets produced or collected by governments are being made publicly available
for re-use. Open government data portals help realize such reuse by providing list
of datasets and links to access those datasets. This ensures that users can search,
inspect and use the data easily.
With the rapidly increasing size of datasets in open government data portals,
just like it is the case with the web, nding relevant datasets with a query of few
keywords is a challenge. Furthermore, those data portals not only consist of textual
information but also georeferenced data that needs to be searched properly. Currently,
most popular open government data portals like the data.gov.uk and data.gov.ie lack
the support for simultaneous thematic and spatial search. Moreover, the use of query
expansion hasn't also been studied in open government datasets.
In this study we have assessed di erent spatial search strategies and query expansions'
performance and impact on user relevance judgment. To evaluate those
strategies we harvested machine readable spatial datasets and their metadata from
three English based open government data portals, performed metadata enhancement,
developed a prototype and performed theoretical and user evaluation.
According to the results from the evaluations keyword based search strategy returned
limited number of results but the highest relevance rating. In the other hand
aggregated spatial and thematic search improved the number of results of the baseline
keyword based strategy with a 1 second increase in response time and but decreased
relevance rating. Moreover, strategies based on WordNet Synonyms query expansion
exhibited the highest relevance rated rst seven results than all other strategies except
the keyword based baseline strategy in three out of the four query terms.
Regarding the use of Hausdor distance and area of overlap, since documents
were returned as results only if they overlap with the query, the number of results
returned were the same in both spatial similarities. But strategies using Hausdor
distance were of higher relevance rating and average mean than area of overlap based
strategies in three of the four queries.
In conclusion, while the spatial search strategies assessed in this study can be
used to improve the existing keyword based OGDs search approaches, we recommend
OGD developers to also consider using WordNet Synonyms based query expansion
and hausdor distance as a way of improving relevant spatial data discovery in open
government datasets using few keywords and tolerable response time