72,428 research outputs found
Information Extraction for Event Ranking
Search engines are evolving towards richer and stronger semantic approaches, focusing on entity-oriented tasks where knowledge bases have become fundamental. In order to support semantic search, search engines are increasingly reliant on robust information extraction systems. In fact, most modern search engines are already highly dependent on a well-curated knowledge base. Nevertheless, they still lack the ability to effectively and automatically take advantage of multiple heterogeneous data sources. Central tasks include harnessing the information locked within textual content by linking mentioned entities to a knowledge base, or the integration of multiple knowledge bases to answer natural language questions. Combining text and knowledge bases is frequently used to improve search results, but it can also be used for the query-independent ranking of entities like events. In this work, we present a complete information extraction pipeline for the Portuguese language, covering all stages from data acquisition to knowledge base population. We also describe a practical application of the automatically extracted information, to support the ranking of upcoming events displayed in the landing page of an institutional search engine, where space is limited to only three relevant events. We manually annotate a dataset of news, covering event announcements from multiple faculties and organic units of the institution. We then use it to train and evaluate the named entity recognition module of the pipeline. We rank events by taking advantage of identified entities, as well as partOf relations, in order to compute an entity popularity score, as well as an entity click score based on implicit feedback from clicks from the institutional search engine. We then combine these two scores with the number of days to the event, obtaining a final ranking for the three most relevant upcoming events
Improving Natural Language Inference Using External Knowledge in the Science Questions Domain
Natural Language Inference (NLI) is fundamental to many Natural Language
Processing (NLP) applications including semantic search and question answering.
The NLI problem has gained significant attention thanks to the release of large
scale, challenging datasets. Present approaches to the problem largely focus on
learning-based methods that use only textual information in order to classify
whether a given premise entails, contradicts, or is neutral with respect to a
given hypothesis. Surprisingly, the use of methods based on structured
knowledge -- a central topic in artificial intelligence -- has not received
much attention vis-a-vis the NLI problem. While there are many open knowledge
bases that contain various types of reasoning information, their use for NLI
has not been well explored. To address this, we present a combination of
techniques that harness knowledge graphs to improve performance on the NLI
problem in the science questions domain. We present the results of applying our
techniques on text, graph, and text-to-graph based models, and discuss
implications for the use of external knowledge in solving the NLI problem. Our
model achieves the new state-of-the-art performance on the NLI problem over the
SciTail science questions dataset.Comment: 9 pages, 3 figures, 5 table
The Short Text Matching Model Enhanced with Knowledge via Contrastive Learning
In recent years, short Text Matching tasks have been widely applied in the
fields ofadvertising search and recommendation. The difficulty lies in the lack
of semantic information and word ambiguity caused by the short length of the
text. Previous works have introduced complement sentences or knowledge bases to
provide additional feature information. However, these methods have not fully
interacted between the original sentence and the complement sentence, and have
not considered the noise issue that may arise from the introduction of external
knowledge bases. Therefore, this paper proposes a short Text Matching model
that combines contrastive learning and external knowledge. The model uses a
generative model to generate corresponding complement sentences and uses the
contrastive learning method to guide the model to obtain more semantically
meaningful encoding of the original sentence. In addition, to avoid noise, we
use keywords as the main semantics of the original sentence to retrieve
corresponding knowledge words in the knowledge base, and construct a knowledge
graph. The graph encoding model is used to integrate the knowledge base
information into the model. Our designed model achieves state-of-the-art
performance on two publicly available Chinese Text Matching datasets,
demonstrating the effectiveness of our model.Comment: 11 pages,2 figure
Recommended from our members
Using TREC for cross-comparison between classic IR and ontology-based search models at a Web scale
The construction of standard datasets and benchmarks to evaluate ontology-based search approaches and to compare then against baseline IR models is a major open problem in the semantic technologies community. In this paper we propose a novel evaluation benchmark for ontology-based IR models based on an adaptation of the well-known Cranfield paradigm (Cleverdon, 1967) traditionally used by the IR community. The proposed benchmark comprises: 1) a text document collection, 2) a set of queries and their corresponding document relevance judgments and 3) a set of ontologies and Knowledge Bases covering the query topics. The document collection and the set of queries and judgments are taken from one of the most widely used datasets in the IR community, the TREC Web track. As a use case example we apply the proposed benchmark to compare a real ontology-based search model (Fernandez, et al., 2008) against the best IR systems of TREC 9 and TREC 2001 competitions. A deep analysis of the strengths and weaknesses of this benchmark and a discussion of how it can be used to evaluate other ontology-based search systems is also included at the end of the paper
Semantic Annotation and Search: Bridging the Gap between Text, Knowledge and Language
In recent years, the ever-increasing quantities of entities in large knowledge bases on the Web, such as DBpedia, Freebase and YAGO, pose new challenges but at the same time open up new opportunities for intelligent information access. These knowledge bases (KBs) have become valuable resources in many research areas, such as natural language processing (NLP) and information retrieval (IR). Recently, almost every major commercial Web search engine has incorporated entities into their search process, including Google’s Knowledge Graph, Yahoo!’s Web of Objects and Microsoft’s Satori Graph/Bing Snapshots. The goal is to bridge the semantic gap between natural language text and formalized knowledge.
Within the context of globalization, multilingual and cross-lingual access to information has emerged as an issue of major interest. Nowadays, more and more people from different countries are connecting to the Internet, in particular the Web, and many users can understand more than one language. While the diversity of languages on the Web has been growing, for most people there is still very little content in their native language. As a consequence of the ability to understand more than one language, users are also interested in Web content in other languages than their mother tongue. There is an impending need for technologies that can help in overcoming the language barrier for multilingual and cross-lingual information access. In this thesis, we face the overall research question of how to allow for semantic-aware and cross-lingual processing of Web documents and user queries by leveraging knowledge bases.
With the goal of addressing this complex problem, we provide the following solutions: (1) semantic annotation for addressing the semantic gap between Web documents and knowledge; (2) semantic search for coping with the semantic gap between keyword queries and knowledge; (3) the exploitation of cross-lingual semantics for overcoming the language barrier between natural language expressions (i.e., keyword queries and Web documents) and knowledge for enabling cross-lingual semantic annotation and search. We evaluated these solutions and the results showed advances beyond the state-of-the-art. In addition, we implemented a framework of cross-lingual semantic annotation and search, which has been widely used for cross-lingual processing of media content in the context of our research projects
Recommended from our members
Semantic modelling of video annotations – the TIB AV-Portal's metadata structure
The TIB AV-Portal (https://av.tib.eu) is an online platform for sharing scientific videos operated by the German National Library of Science and Technology (TIB). Besides the allocation of Digital Object Identifiers (DOI) and Media Fragment Identifiers (MFID) for video citation, long-term preservation of all material and open licenses like Creative Commons, the core feature of the TIB AV-Portal are its various methods of automated metadata extraction to fundamentally improve search functionalities (e.g. fine-grained search and faceting). These comprise of an automated chaptering, extraction of superimposed text, speech to text recognition, and the detection of predefined visual concepts. In addition, extracted metadata are consequently mapped against authority files like the German “Gemeinsame Normdatei” and knowledge bases like DBpedia and Library of Congress Subject Headings via a process of automated named entity linking (NEL) to enable semantic and cross-lingual search.
The results of this process are expressed as temporal and/or spatial video annotations, linking extracted metadata to certain key frames and video segments. In order to structure the data, express relations between single entities, and link to external information resources, several common vocabularies, ontologies and knowledge bases are being used. These include amongst others the Open Annotation Data Model, the NLP Interchange Format (NIF), BIBFRAME, the Friend of Friend Vocabulary (FOAF), and Schema.org. Furthermore, all data is stored adhering to the Resource Description Framework (RDF) data model and published as linked open data. This provides third parties with an interoperable and easy to reuse RDF graph representation of the AV-Portal’s metadata.
On our poster we illustrate the general structure of the TIB AV-Portal’s comprehensive metadata both authoritative and extracted automatically. Here, the main focus is on the underlying video annotation graph model and on semantic interoperability and reusability of the data. In particular we visualize how the use of vocabularies, ontologies and knowledge bases allows for rich semantic descriptions of video materials as well as for easy metadata publication, interlinking, and opportunities of reuse by third parties (e.g. for information retrieval and enrichment). In doing so, we present the AV-Portal’s metadata structure as an illustrative example for the complexity of modelling temporal and spatial video metadata and as a set of best practices in the field of audio-visual resources
Representation Learning for Words and Entities
This thesis presents new methods for unsupervised learning of distributed
representations of words and entities from text and knowledge bases. The first
algorithm presented in the thesis is a multi-view algorithm for learning
representations of words called Multiview Latent Semantic Analysis (MVLSA). By
incorporating up to 46 different types of co-occurrence statistics for the same
vocabulary of english words, I show that MVLSA outperforms other
state-of-the-art word embedding models. Next, I focus on learning entity
representations for search and recommendation and present the second method of
this thesis, Neural Variational Set Expansion (NVSE). NVSE is also an
unsupervised learning method, but it is based on the Variational Autoencoder
framework. Evaluations with human annotators show that NVSE can facilitate
better search and recommendation of information gathered from noisy, automatic
annotation of unstructured natural language corpora. Finally, I move from
unstructured data and focus on structured knowledge graphs. I present novel
approaches for learning embeddings of vertices and edges in a knowledge graph
that obey logical constraints.Comment: phd thesis, Machine Learning, Natural Language Processing,
Representation Learning, Knowledge Graphs, Entities, Word Embeddings, Entity
Embedding
Knowledge will Propel Machine Understanding of Content: Extrapolating from Current Examples
Machine Learning has been a big success story during the AI resurgence. One
particular stand out success relates to learning from a massive amount of data.
In spite of early assertions of the unreasonable effectiveness of data, there
is increasing recognition for utilizing knowledge whenever it is available or
can be created purposefully. In this paper, we discuss the indispensable role
of knowledge for deeper understanding of content where (i) large amounts of
training data are unavailable, (ii) the objects to be recognized are complex,
(e.g., implicit entities and highly subjective content), and (iii) applications
need to use complementary or related data in multiple modalities/media. What
brings us to the cusp of rapid progress is our ability to (a) create relevant
and reliable knowledge and (b) carefully exploit knowledge to enhance ML/NLP
techniques. Using diverse examples, we seek to foretell unprecedented progress
in our ability for deeper understanding and exploitation of multimodal data and
continued incorporation of knowledge in learning techniques.Comment: Pre-print of the paper accepted at 2017 IEEE/WIC/ACM International
Conference on Web Intelligence (WI). arXiv admin note: substantial text
overlap with arXiv:1610.0770
A Survey of Volunteered Open Geo-Knowledge Bases in the Semantic Web
Over the past decade, rapid advances in web technologies, coupled with
innovative models of spatial data collection and consumption, have generated a
robust growth in geo-referenced information, resulting in spatial information
overload. Increasing 'geographic intelligence' in traditional text-based
information retrieval has become a prominent approach to respond to this issue
and to fulfill users' spatial information needs. Numerous efforts in the
Semantic Geospatial Web, Volunteered Geographic Information (VGI), and the
Linking Open Data initiative have converged in a constellation of open
knowledge bases, freely available online. In this article, we survey these open
knowledge bases, focusing on their geospatial dimension. Particular attention
is devoted to the crucial issue of the quality of geo-knowledge bases, as well
as of crowdsourced data. A new knowledge base, the OpenStreetMap Semantic
Network, is outlined as our contribution to this area. Research directions in
information integration and Geographic Information Retrieval (GIR) are then
reviewed, with a critical discussion of their current limitations and future
prospects
- …