3,152 research outputs found
Information Retrieval: Recent Advances and Beyond
In this paper, we provide a detailed overview of the models used for
information retrieval in the first and second stages of the typical processing
chain. We discuss the current state-of-the-art models, including methods based
on terms, semantic retrieval, and neural. Additionally, we delve into the key
topics related to the learning process of these models. This way, this survey
offers a comprehensive understanding of the field and is of interest for for
researchers and practitioners entering/working in the information retrieval
domain
Knowledge will Propel Machine Understanding of Content: Extrapolating from Current Examples
Machine Learning has been a big success story during the AI resurgence. One
particular stand out success relates to learning from a massive amount of data.
In spite of early assertions of the unreasonable effectiveness of data, there
is increasing recognition for utilizing knowledge whenever it is available or
can be created purposefully. In this paper, we discuss the indispensable role
of knowledge for deeper understanding of content where (i) large amounts of
training data are unavailable, (ii) the objects to be recognized are complex,
(e.g., implicit entities and highly subjective content), and (iii) applications
need to use complementary or related data in multiple modalities/media. What
brings us to the cusp of rapid progress is our ability to (a) create relevant
and reliable knowledge and (b) carefully exploit knowledge to enhance ML/NLP
techniques. Using diverse examples, we seek to foretell unprecedented progress
in our ability for deeper understanding and exploitation of multimodal data and
continued incorporation of knowledge in learning techniques.Comment: Pre-print of the paper accepted at 2017 IEEE/WIC/ACM International
Conference on Web Intelligence (WI). arXiv admin note: substantial text
overlap with arXiv:1610.0770
Semantic interpretation of events in lifelogging
The topic of this thesis is lifelogging, the automatic, passive recording of a personâs daily activities and in particular, on performing a semantic analysis and enrichment of lifelogged data. Our work centers on visual lifelogged data, such as taken from wearable cameras. Such wearable cameras generate an archive of a personâs day taken from a first-person viewpoint but one of the problems with this is the sheer volume of information that can be generated. In order to make this potentially very large volume of information more manageable, our analysis of this data is based on segmenting each dayâs lifelog data into discrete and non-overlapping events corresponding to activities in the wearerâs day. To manage lifelog data at an event level, we define a set of concepts using an ontology which is appropriate to the wearer, applying automatic detection of concepts to these events and then semantically enriching each of the detected lifelog events making them an index into the events. Once this enrichment is complete we can use the lifelog to support semantic search for everyday media management, as a memory aid, or as part of medical analysis on the activities of daily living (ADL), and so on. In the thesis, we address the problem of how to select the concepts to be used for indexing events and we propose a semantic, density- based algorithm to cope with concept selection issues for lifelogging. We then apply activity detection to classify everyday activities by employing the selected concepts as high-level semantic features. Finally, the activity is modeled by multi-context representations and enriched by Semantic Web technologies. The thesis includes an experimental evaluation using real data from users and shows the performance of our algorithms in capturing the semantics of everyday concepts and their efficacy in activity recognition and semantic enrichment
Opinion mining: Reviewed from word to document level
International audienceOpinion mining is one of the most challenging tasks of the field of information retrieval. Research community has been publishing a number of articles on this topic but a significant increase in interest has been observed during the past decade especially after the launch of several online social networks. In this paper, we provide a very detailed overview of the related work of opinion mining. Following features of our review make it stand unique among the works of similar kind: (1) it presents a very different perspective of the opinion mining field by discussing the work on different granularity levels (like word, sentences, and document levels) which is very unique and much required, (2) discussion of the related work in terms of challenges of the field of opinion mining, (3) document level discussion of the related work gives an overview of opinion mining task in blogosphere, one of most popular online social network, and (4) highlights the importance of online social networks for opinion mining task and other related sub-tasks
Effective distributed representations for academic expert search
Expert search aims to find and rank experts based on a user's query. In
academia, retrieving experts is an efficient way to navigate through a large
amount of academic knowledge. Here, we study how different distributed
representations of academic papers (i.e. embeddings) impact academic expert
retrieval. We use the Microsoft Academic Graph dataset and experiment with
different configurations of a document-centric voting model for retrieval. In
particular, we explore the impact of the use of contextualized embeddings on
search performance. We also present results for paper embeddings that
incorporate citation information through retrofitting. Additionally,
experiments are conducted using different techniques for assigning author
weights based on author order. We observe that using contextual embeddings
produced by a transformer model trained for sentence similarity tasks produces
the most effective paper representations for document-centric expert retrieval.
However, retrofitting the paper embeddings and using elaborate author
contribution weighting strategies did not improve retrieval performance.Comment: To be published in the Scholarly Document Processing 2020 Workshop @
EMNLP 2020 proceeding
LexMAE: Lexicon-Bottlenecked Pretraining for Large-Scale Retrieval
In large-scale retrieval, the lexicon-weighting paradigm, learning weighted
sparse representations in vocabulary space, has shown promising results with
high quality and low latency. Despite it deeply exploiting the
lexicon-representing capability of pre-trained language models, a crucial gap
remains between language modeling and lexicon-weighting retrieval -- the former
preferring certain or low-entropy words whereas the latter favoring pivot or
high-entropy words -- becoming the main barrier to lexicon-weighting
performance for large-scale retrieval. To bridge this gap, we propose a
brand-new pre-training framework, lexicon-bottlenecked masked autoencoder
(LexMAE), to learn importance-aware lexicon representations. Essentially, we
present a lexicon-bottlenecked module between a normal language modeling
encoder and a weakened decoder, where a continuous bag-of-words bottleneck is
constructed to learn a lexicon-importance distribution in an unsupervised
fashion. The pre-trained LexMAE is readily transferred to the lexicon-weighting
retrieval via fine-tuning. On the ad-hoc retrieval benchmark, MS-Marco, it
achieves 42.6% MRR@10 with 45.8 QPS for the passage dataset and 44.4% MRR@100
with 134.8 QPS for the document dataset, by a CPU machine. And LexMAE shows
state-of-the-art zero-shot transfer capability on BEIR benchmark with 12
datasets.Comment: Appeared at ICLR 202
Sentiment Analysis in Social Streams
In this chapter, we review and discuss the state of the art on sentiment
analysis in social streamsâsuch as web forums, microblogging systems, and social
networks, aiming to clarify how user opinions, affective states, and intended emo tional effects are extracted from user generated content, how they are modeled, and
howthey could be finally exploited.We explainwhy sentiment analysistasks aremore
difficult for social streams than for other textual sources, and entail going beyond
classic text-based opinion mining techniques. We show, for example, that social
streams may use vocabularies and expressions that exist outside the mainstream of
standard, formal languages, and may reflect complex dynamics in the opinions and
sentiments expressed by individuals and communities
Knowledge-Rich Self-Supervision for Biomedical Entity Linking
Entity linking faces significant challenges such as prolific variations and
prevalent ambiguities, especially in high-value domains with myriad entities.
Standard classification approaches suffer from the annotation bottleneck and
cannot effectively handle unseen entities. Zero-shot entity linking has emerged
as a promising direction for generalizing to new entities, but it still
requires example gold entity mentions during training and canonical
descriptions for all entities, both of which are rarely available outside of
Wikipedia. In this paper, we explore Knowledge-RIch Self-Supervision () for biomedical entity linking, by leveraging readily available domain
knowledge. In training, it generates self-supervised mention examples on
unlabeled text using a domain ontology and trains a contextual encoder using
contrastive learning. For inference, it samples self-supervised mentions as
prototypes for each entity and conducts linking by mapping the test mention to
the most similar prototype. Our approach can easily incorporate entity
descriptions and gold mention labels if available. We conducted extensive
experiments on seven standard datasets spanning biomedical literature and
clinical notes. Without using any labeled information, our method produces , a universal entity linker for four million UMLS entities that
attains new state of the art, outperforming prior self-supervised methods by as
much as 20 absolute points in accuracy
- âŠ