Search CORE

31 research outputs found

Modeling Tag Prediction based on Question Tagging Behavior Analysis of CommunityQA Platform Users

Author: Chandrasekaran Nirupama
Cucerzan Silviu
Gamon Michael
Pal Kuntal Kumar
Publication venue
Publication date: 03/07/2023
Field of study

In community question-answering platforms, tags play essential roles in effective information organization and retrieval, better question routing, faster response to questions, and assessment of topic popularity. Hence, automatic assistance for predicting and suggesting tags for posts is of high utility to users of such platforms. To develop better tag prediction across diverse communities and domains, we performed a thorough analysis of users' tagging behavior in 17 StackExchange communities. We found various common inherent properties of this behavior in those diverse domains. We used the findings to develop a flexible neural tag prediction architecture, which predicts both popular tags and more granular tags for each question. Our extensive experiments and obtained performance show the effectiveness of our modelComment: 20 page

arXiv.org e-Print Archive

Knowledge-Augmented Large Language Models for Personalized Contextual Query Suggestion

Author: Baek Jinheon
Chandrasekaran Nirupama
Cucerzan Silviu
herring Allen
Jauhar Sujay Kumar
Publication venue
Publication date: 09/11/2023
Field of study

Large Language Models (LLMs) excel at tackling various natural language tasks. However, due to the significant costs involved in re-training or fine-tuning them, they remain largely static and difficult to personalize. Nevertheless, a variety of applications could benefit from generations that are tailored to users' preferences, goals, and knowledge. Among them is web search, where knowing what a user is trying to accomplish, what they care about, and what they know can lead to improved search experiences. In this work, we propose a novel and general approach that augments an LLM with relevant context from users' interaction histories with a search engine in order to personalize its outputs. Specifically, we construct an entity-centric knowledge store for each user based on their search and browsing activities on the web, which is then leveraged to provide contextually relevant LLM prompt augmentations. This knowledge store is light-weight, since it only produces user-specific aggregate projections of interests and knowledge onto public knowledge graphs, and leverages existing search log infrastructure, thereby mitigating the privacy, compliance, and scalability concerns associated with building deep user profiles for personalization. We then validate our approach on the task of contextual query suggestion, which requires understanding not only the user's current search context but also what they historically know and care about. Through a number of experiments based on human evaluation, we show that our approach is significantly better than several other LLM-powered baselines, generating query suggestions that are contextually more relevant, personalized, and useful

arXiv.org e-Print Archive

MAG: A Multilingual, Knowledge-base Agnostic and Deterministic Entity Linking Approach

Author: Bryl Volha
Brümmer Martin
Consoli Sergio
Cucerzan Silviu
Devi Pooja
Erp Marieke Van
Ferreira Thiago Castro
Hoffart Johannes
Juan
Luo Gang
Nuzzolese Andrea-Giovanni
Röder Michael
Steinmetz Nadine
van Erp Marieke
Zhang Lei
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 17/10/2017
Field of study

Entity linking has recently been the subject of a significant body of research. Currently, the best performing approaches rely on trained mono-lingual models. Porting these approaches to other languages is consequently a difficult endeavor as it requires corresponding training data and retraining of the models. We address this drawback by presenting a novel multilingual, knowledge-based agnostic and deterministic approach to entity linking, dubbed MAG. MAG is based on a combination of context-based retrieval on structured knowledge bases and graph algorithms. We evaluate MAG on 23 data sets and in 7 languages. Our results show that the best approach trained on English datasets (PBOH) achieves a micro F-measure that is up to 4 times worse on datasets in other languages. MAG, on the other hand, achieves state-of-the-art performance on English datasets and reaches a micro F-measure that is up to 0.6 higher than that of PBOH on non-English languages.Comment: Accepted in K-CAP 2017: Knowledge Capture Conferenc

arXiv.org e-Print Archive

Crossref

Large-scale named entity disambiguation based on Wikipedia data

Author: Silviu Cucerzan
Publication venue
Publication date
Field of study

This paper presents a large-scale system for the recognition and semantic disambiguation of named entities based on information extracted from a large encyclopedic collection and Web search results. It describes in detail the disambiguation paradigm employed and the information extraction process from Wikipedia. Through a process of maximizing the agreement between the contextual information extracted from Wikipedia and the context of a document, as well as the agreement among the category tags associated with the candidate entities, the implemented system shows high disambiguation accuracy on both news stories and Wikipedia articles. 1 Introduction and Related Wor

CiteSeerX

Language Independent, Minimally Supervised Induction of Lexical Probabilities

Author: David Yarowsky
Silviu Cucerzan
Publication venue
Publication date: 01/01/2000
Field of study

A central problem in part-of-speech tagging, especially for new languages for which limited annotated resources are available, is estimating the distribution of lexical probabilities for unknown words. This paper introduces a new paradigmatic similarity measure and presents a minimally supervised learning approach combining effective selection and weighting methods based on paradigmatic and contextual similarity measures populated from large quantities of inexpensive raw text data. This approach is highly language independent and requires no modification to the algorithm or implementation to shift between languages such as French and English

CiteSeerX

Crossref

Bootstrapping a Multilingual Part-of-speech Tagger in One Person-day

Author: David Yarowsky
Silviu Cucerzan
Publication venue
Publication date: 01/01/2002
Field of study

This paper presents a method for bootstrapping a fine-grained, broad-coverage part-of-speech (POS) tagger in a new language using only one personday of data acquisition effort. It requires only three resources, which are currently readily available in 60-100 world languages: (1) an online or hard-copy pocket-sized bilingual dictionary, (2) a basic library reference grammar, and (3) access to an existing monolingual text corpus in the language. The algorithm begins by inducing initial lexical POS distributions from English translations in a bilingual dictionary without POS tags. It handles irregular, regular and semi-regular morphology through a robust generative model using weighted Levenshtein alignments. Unsupervised induction of grammatical gender is performed via global modeling of contextwindow feature agreement. Using a combination of these and other evidence sources, interactive training of context and lexical prior models are accomplished for fine-grained POS tag spaces. Experiments show high accuracy, fine-grained tag resolution with minimal new human effort

CiteSeerX

Crossref

Augmented Mixture Models for Lexical Disambiguation

Author: David Yarowsky
Silviu Cucerzan
Publication venue
Publication date: 01/01/2002
Field of study

This paper investigates several augmented mixture models that are competitive alternatives to standard Bayesian models and prove to be very suitable to word sense disambiguation and related classification tasks. We present a new classification correction technique that successfully addresses the problem of under-estimation of infrequent classes in the training data. We show that the mixture models are boosting-friendly and that both Adaboost and our original correction technique can improve the results of the raw model significantly, achieving stateof -the-art performance on several standard test sets in four languages. With substantially different output to Nave Bayes and other statistical methods, the investigated models are also shown to be effective participants in classifier combination

CiteSeerX

Crossref

Predicting Accuracy of Extracting Information from Unstructured Text Collections

Author: Eugene Agichtein
Silviu Cucerzan
Publication venue
Publication date: 01/01/2005
Field of study

Exploiting lexical and semantic relationships in large unstructured text collections can significantly enhance managing, integrating, and querying information locked in unstructured text. Most notably, named entities and relations between entities are crucial for effective question answering and other information retrieval and knowledge management tasks. Unfortunately, the success in extracting these relationships can vary for different domains, languages, and document collections. Predicting extraction performance is an important step towards scalable and intelligent knowledge management, information retrieval and information integration. We present a general language modeling method for quantifying the difficulty of information extraction tasks. We demonstrate the viability of our approach by predicting performance of real world information extraction tasks, Named Entity recognition and Relation Extraction

CiteSeerX