27 research outputs found
PersoNER: Persian named-entity recognition
© 1963-2018 ACL. Named-Entity Recognition (NER) is still a challenging task for languages with low digital resources. The main difficulties arise from the scarcity of annotated corpora and the consequent problematic training of an effective NER pipeline. To abridge this gap, in this paper we target the Persian language that is spoken by a population of over a hundred million people world-wide. We first present and provide ArmanPerosNERCorpus, the first manually-annotated Persian NER corpus. Then, we introduce PersoNER, an NER pipeline for Persian that leverages a word embedding and a sequential max-margin classifier. The experimental results show that the proposed approach is capable of achieving interesting MUC7 and CoNNL scores while outperforming two alternatives based on a CRF and a recurrent neural network
Context Aware Query Rewriting for Text Rankers using LLM
Query rewriting refers to an established family of approaches that are
applied to underspecified and ambiguous queries to overcome the vocabulary
mismatch problem in document ranking. Queries are typically rewritten during
query processing time for better query modelling for the downstream ranker.
With the advent of large-language models (LLMs), there have been initial
investigations into using generative approaches to generate pseudo documents to
tackle this inherent vocabulary gap. In this work, we analyze the utility of
LLMs for improved query rewriting for text ranking tasks. We find that there
are two inherent limitations of using LLMs as query re-writers -- concept drift
when using only queries as prompts and large inference costs during query
processing. We adopt a simple, yet surprisingly effective, approach called
context aware query rewriting (CAR) to leverage the benefits of LLMs for query
understanding. Firstly, we rewrite ambiguous training queries by context-aware
prompting of LLMs, where we use only relevant documents as context.Unlike
existing approaches, we use LLM-based query rewriting only during the training
phase. Eventually, a ranker is fine-tuned on the rewritten queries instead of
the original queries during training. In our extensive experiments, we find
that fine-tuning a ranker using re-written queries offers a significant
improvement of up to 33% on the passage ranking task and up to 28% on the
document ranking task when compared to the baseline performance of using
original queries
Leveraging Semantic Annotations for Event-focused Search & Summarization
Today in this Big Data era, overwhelming amounts of textual information across different sources with a high degree of redundancy has made it hard for a consumer to retrospect on past events. A plausible solution is to link semantically similar information contained across the different sources to enforce a structure thereby providing multiple access paths to relevant information. Keeping this larger goal in view, this work uses Wikipedia and online news articles as two prominent yet disparate information sources to address the following three problems: • We address a linking problem to connect Wikipedia excerpts to news articles by casting it into an IR task. Our novel approach integrates time, geolocations, and entities with text to identify relevant documents that can be linked to a given excerpt. • We address an unsupervised extractive multi-document summarization task to generate a fixed-length event digest that facilitates efficient consumption of information contained within a large set of documents. Our novel approach proposes an ILP for global inference across text, time, geolocations, and entities associated with the event. • To estimate temporal focus of short event descriptions, we present a semi-supervised approach that leverages redundancy within a longitudinal news collection to estimate accurate probabilistic time models. Extensive experimental evaluations demonstrate the effectiveness and viability of our proposed approaches towards achieving the larger goal.Im heutigen Big Data Zeitalters existieren überwältigende Mengen an Textinformationen, die über mehrere Quellen verteilt sind und ein hohes Maß an Redundanz haben. Durch diese Gegebenheiten ist eine Retroperspektive auf vergangene Ereignisse für Konsumenten nur schwer möglich. Eine plausible Lösung ist die Verknüpfung semantisch ähnlicher, aber über mehrere Quellen verteilter Informationen, um dadurch eine Struktur zu erzwingen, die mehrere Zugriffspfade auf relevante Informationen, bietet. Vor diesem Hintergrund benutzt diese Dissertation Wikipedia und Onlinenachrichten als zwei prominente, aber dennoch grundverschiedene Informationsquellen, um die folgenden drei Probleme anzusprechen: • Wir adressieren ein Verknüpfungsproblem, um Wikipedia-Auszüge mit Nachrichtenartikeln zu verbinden und das Problem in eine Information-Retrieval-Aufgabe umzuwandeln. Unser neuartiger Ansatz integriert Zeit- und Geobezüge sowie Entitäten mit Text, um relevante Dokumente, die mit einem gegebenen Auszug verknüpft werden können, zu identifizieren. • Wir befassen uns mit einer unüberwachten Extraktionsmethode zur automatischen Zusammenfassung von Texten aus mehreren Dokumenten um Ereigniszusammenfassungen mit fester Länge zu generieren, was eine effiziente Aufnahme von Informationen aus großen Dokumentenmassen ermöglicht. Unser neuartiger Ansatz schlägt eine ganzzahlige lineare Optimierungslösung vor, die globale Inferenzen über Text, Zeit, Geolokationen und mit Ereignis-verbundenen Entitäten zieht. • Um den zeitlichen Fokus kurzer Ereignisbeschreibungen abzuschätzen, stellen wir einen semi-überwachten Ansatz vor, der die Redundanz innerhalb einer langzeitigen Dokumentensammlung ausnutzt, um genaue probabilistische Zeitmodelle abzuschätzen. Umfangreiche experimentelle Auswertungen zeigen die Wirksamkeit und Tragfähigkeit unserer vorgeschlagenen Ansätze zur Erreichung des größeren Ziels
Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval
Dense retrieval models have predominantly been studied for English, where
models have shown great success, due to the availability of human-labeled
training pairs. However, there has been limited success for multilingual
retrieval so far, as training data is uneven or scarcely available across
multiple languages. Synthetic training data generation is promising (e.g.,
InPars or Promptagator), but has been investigated only for English. Therefore,
to study model capabilities across both cross-lingual and monolingual retrieval
tasks, we develop SWIM-IR, a synthetic retrieval training dataset containing 33
(high to very-low resource) languages for training multilingual dense retrieval
models without requiring any human supervision. To construct SWIM-IR, we
propose SAP (summarize-then-ask prompting), where the large language model
(LLM) generates a textual summary prior to the query generation step. SAP
assists the LLM in generating informative queries in the target language. Using
SWIM-IR, we explore synthetic fine-tuning of multilingual dense retrieval
models and evaluate them robustly on three retrieval benchmarks: XOR-Retrieve
(cross-lingual), XTREME-UP (cross-lingual) and MIRACL (monolingual). Our
models, called SWIM-X, are competitive with human-supervised dense retrieval
models, e.g., mContriever, finding that SWIM-IR can cheaply substitute for
expensive human-labeled retrieval training data.Comment: Data released at https://github.com/google-research-datasets/swim-i
Query Understanding in the Age of Large Language Models
Querying, conversing, and controlling search and information-seeking
interfaces using natural language are fast becoming ubiquitous with the rise
and adoption of large-language models (LLM). In this position paper, we
describe a generic framework for interactive query-rewriting using LLMs. Our
proposal aims to unfold new opportunities for improved and transparent intent
understanding while building high-performance retrieval systems using LLMs. A
key aspect of our framework is the ability of the rewriter to fully specify the
machine intent by the search engine in natural language that can be further
refined, controlled, and edited before the final retrieval phase. The ability
to present, interact, and reason over the underlying machine intent in natural
language has profound implications on transparency, ranking performance, and a
departure from the traditional way in which supervised signals were collected
for understanding intents. We detail the concept, backed by initial
experiments, along with open questions for this interactive query understanding
framework.Comment: Accepted to GENIR(SIGIR'23
Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources
Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen
Explainable Information Retrieval: A Survey
Explainable information retrieval is an emerging research area aiming to make
transparent and trustworthy information retrieval systems. Given the increasing
use of complex machine learning models in search systems, explainability is
essential in building and auditing responsible information retrieval models.
This survey fills a vital gap in the otherwise topically diverse literature of
explainable information retrieval. It categorizes and discusses recent
explainability methods developed for different application domains in
information retrieval, providing a common framework and unifying perspectives.
In addition, it reflects on the common concern of evaluating explanations and
highlights open challenges and opportunities.Comment: 35 pages, 10 figures. Under revie
Enhancing Inter-Document Similarity Using Sub Max
Document similarity, a core theme in Information Retrieval (IR), is a machine learning (ML) task associated with natural language processing (NLP). It is a measure of the distance between two documents given a set of rules. For the purpose of this thesis, two documents are similar if they are semantically alike, and describe similar concepts. While document similarity can be applied to multiple tasks, we focus our work on the accuracy of models in detecting referenced papers as similar documents using their sub max similarity. Multiple approaches have been used to determine the similarity of documents in regards to literature reviews. Some of such approaches use the number of similar citations, the similarity between the body of text, and the figures present in those documents. This researcher hypothesized that documents with sections of high similarity(sub max) but a global low similarity are prone to being overlooked by existing models as the global score of the documents are used to measure similarity. In this study, we aim to detect, measure, and show the similarity of documents based on the maximum similarity of their subsections. The sub max of any two given documents is the subsections of those documents with the highest similarity. By comparing subsections of the documents in our corpus and using the sub max, we were able to improve the performance of some models by over 100%
Entity centric neural models for natural language processing
This thesis explores how to enhance natural language understanding by incorporating entity information into neural network models. It tackles three key questions:1. Leveraging entities for understanding tasks: This work introduces Entity-GCN, a model that performs multi-step reasoning on a graph where nodes represent entity mentions and edges represent relationships. This method achieved state-of-the-art results on a multi-document question-answering dataset.2. Identifying and disambiguating entities using large language models: This research proposes a novel system that retrieves entities by generating their names token-by-token, overcoming limitations of traditional methods and significantly reducing memory footprint. This approach is also extended to a multilingual setting and further optimized for efficiency.3. Interpreting and controlling entity knowledge within models: This thesis presents a post-hoc interpretation technique to analyze how decisions are made across layers in neural models, allowing for visualization and analysis of knowledge representation. Additionally, a method for editing factual knowledge about entities is proposed, enabling correction of model predictions without costly retraining
Entity centric neural models for natural language processing
This thesis explores how to enhance natural language understanding by incorporating entity information into neural network models. It tackles three key questions:1. Leveraging entities for understanding tasks: This work introduces Entity-GCN, a model that performs multi-step reasoning on a graph where nodes represent entity mentions and edges represent relationships. This method achieved state-of-the-art results on a multi-document question-answering dataset.2. Identifying and disambiguating entities using large language models: This research proposes a novel system that retrieves entities by generating their names token-by-token, overcoming limitations of traditional methods and significantly reducing memory footprint. This approach is also extended to a multilingual setting and further optimized for efficiency.3. Interpreting and controlling entity knowledge within models: This thesis presents a post-hoc interpretation technique to analyze how decisions are made across layers in neural models, allowing for visualization and analysis of knowledge representation. Additionally, a method for editing factual knowledge about entities is proposed, enabling correction of model predictions without costly retraining