31 research outputs found
Modeling Tag Prediction based on Question Tagging Behavior Analysis of CommunityQA Platform Users
In community question-answering platforms, tags play essential roles in
effective information organization and retrieval, better question routing,
faster response to questions, and assessment of topic popularity. Hence,
automatic assistance for predicting and suggesting tags for posts is of high
utility to users of such platforms. To develop better tag prediction across
diverse communities and domains, we performed a thorough analysis of users'
tagging behavior in 17 StackExchange communities. We found various common
inherent properties of this behavior in those diverse domains. We used the
findings to develop a flexible neural tag prediction architecture, which
predicts both popular tags and more granular tags for each question. Our
extensive experiments and obtained performance show the effectiveness of our
modelComment: 20 page
Knowledge-Augmented Large Language Models for Personalized Contextual Query Suggestion
Large Language Models (LLMs) excel at tackling various natural language
tasks. However, due to the significant costs involved in re-training or
fine-tuning them, they remain largely static and difficult to personalize.
Nevertheless, a variety of applications could benefit from generations that are
tailored to users' preferences, goals, and knowledge. Among them is web search,
where knowing what a user is trying to accomplish, what they care about, and
what they know can lead to improved search experiences. In this work, we
propose a novel and general approach that augments an LLM with relevant context
from users' interaction histories with a search engine in order to personalize
its outputs. Specifically, we construct an entity-centric knowledge store for
each user based on their search and browsing activities on the web, which is
then leveraged to provide contextually relevant LLM prompt augmentations. This
knowledge store is light-weight, since it only produces user-specific aggregate
projections of interests and knowledge onto public knowledge graphs, and
leverages existing search log infrastructure, thereby mitigating the privacy,
compliance, and scalability concerns associated with building deep user
profiles for personalization. We then validate our approach on the task of
contextual query suggestion, which requires understanding not only the user's
current search context but also what they historically know and care about.
Through a number of experiments based on human evaluation, we show that our
approach is significantly better than several other LLM-powered baselines,
generating query suggestions that are contextually more relevant, personalized,
and useful
MAG: A Multilingual, Knowledge-base Agnostic and Deterministic Entity Linking Approach
Entity linking has recently been the subject of a significant body of
research. Currently, the best performing approaches rely on trained
mono-lingual models. Porting these approaches to other languages is
consequently a difficult endeavor as it requires corresponding training data
and retraining of the models. We address this drawback by presenting a novel
multilingual, knowledge-based agnostic and deterministic approach to entity
linking, dubbed MAG. MAG is based on a combination of context-based retrieval
on structured knowledge bases and graph algorithms. We evaluate MAG on 23 data
sets and in 7 languages. Our results show that the best approach trained on
English datasets (PBOH) achieves a micro F-measure that is up to 4 times worse
on datasets in other languages. MAG, on the other hand, achieves
state-of-the-art performance on English datasets and reaches a micro F-measure
that is up to 0.6 higher than that of PBOH on non-English languages.Comment: Accepted in K-CAP 2017: Knowledge Capture Conferenc
Large-scale named entity disambiguation based on Wikipedia data
This paper presents a large-scale system for the recognition and semantic disambiguation of named entities based on information extracted from a large encyclopedic collection and Web search results. It describes in detail the disambiguation paradigm employed and the information extraction process from Wikipedia. Through a process of maximizing the agreement between the contextual information extracted from Wikipedia and the context of a document, as well as the agreement among the category tags associated with the candidate entities, the implemented system shows high disambiguation accuracy on both news stories and Wikipedia articles. 1 Introduction and Related Wor
Language Independent, Minimally Supervised Induction of Lexical Probabilities
A central problem in part-of-speech tagging, especially for new languages for which limited annotated resources are available, is estimating the distribution of lexical probabilities for unknown words. This paper introduces a new paradigmatic similarity measure and presents a minimally supervised learning approach combining effective selection and weighting methods based on paradigmatic and contextual similarity measures populated from large quantities of inexpensive raw text data. This approach is highly language independent and requires no modification to the algorithm or implementation to shift between languages such as French and English
Bootstrapping a Multilingual Part-of-speech Tagger in One Person-day
This paper presents a method for bootstrapping a fine-grained, broad-coverage part-of-speech (POS) tagger in a new language using only one personday of data acquisition effort. It requires only three resources, which are currently readily available in 60-100 world languages: (1) an online or hard-copy pocket-sized bilingual dictionary, (2) a basic library reference grammar, and (3) access to an existing monolingual text corpus in the language. The algorithm begins by inducing initial lexical POS distributions from English translations in a bilingual dictionary without POS tags. It handles irregular, regular and semi-regular morphology through a robust generative model using weighted Levenshtein alignments. Unsupervised induction of grammatical gender is performed via global modeling of contextwindow feature agreement. Using a combination of these and other evidence sources, interactive training of context and lexical prior models are accomplished for fine-grained POS tag spaces. Experiments show high accuracy, fine-grained tag resolution with minimal new human effort
Augmented Mixture Models for Lexical Disambiguation
This paper investigates several augmented mixture models that are competitive alternatives to standard Bayesian models and prove to be very suitable to word sense disambiguation and related classification tasks. We present a new classification correction technique that successfully addresses the problem of under-estimation of infrequent classes in the training data. We show that the mixture models are boosting-friendly and that both Adaboost and our original correction technique can improve the results of the raw model significantly, achieving stateof -the-art performance on several standard test sets in four languages. With substantially different output to Nave Bayes and other statistical methods, the investigated models are also shown to be effective participants in classifier combination
Predicting Accuracy of Extracting Information from Unstructured Text Collections
Exploiting lexical and semantic relationships in large unstructured text collections can significantly enhance managing, integrating, and querying information locked in unstructured text. Most notably, named entities and relations between entities are crucial for effective question answering and other information retrieval and knowledge management tasks. Unfortunately, the success in extracting these relationships can vary for different domains, languages, and document collections. Predicting extraction performance is an important step towards scalable and intelligent knowledge management, information retrieval and information integration. We present a general language modeling method for quantifying the difficulty of information extraction tasks. We demonstrate the viability of our approach by predicting performance of real world information extraction tasks, Named Entity recognition and Relation Extraction