1,827 research outputs found
Natural language processing
Beginning with the basic issues of NLP, this chapter aims to chart the major research activities in this area since the last ARIST Chapter in 1996 (Haas, 1996), including: (i) natural language text processing systems - text summarization, information extraction, information retrieval, etc., including domain-specific applications; (ii) natural language interfaces; (iii) NLP in the context of www and digital libraries ; and (iv) evaluation of NLP systems
A Factoid Question Answering System for Vietnamese
In this paper, we describe the development of an end-to-end factoid question
answering system for the Vietnamese language. This system combines both
statistical models and ontology-based methods in a chain of processing modules
to provide high-quality mappings from natural language text to entities. We
present the challenges in the development of such an intelligent user interface
for an isolating language like Vietnamese and show that techniques developed
for inflectional languages cannot be applied "as is". Our question answering
system can answer a wide range of general knowledge questions with promising
accuracy on a test set.Comment: In the proceedings of the HQA'18 workshop, The Web Conference
Companion, Lyon, Franc
Spoken content retrieval: A survey of techniques and technologies
Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR
Mixed-Language Arabic- English Information Retrieval
Includes abstract.Includes bibliographical references.This thesis attempts to address the problem of mixed querying in CLIR. It proposes mixed-language (language-aware) approaches in which mixed queries are used to retrieve most relevant documents, regardless of their languages. To achieve this goal, however, it is essential firstly to suppress the impact of most problems that are caused by the mixed-language feature in both queries and documents and which result in biasing the final ranked list. Therefore, a cross-lingual re-weighting model was developed. In this cross-lingual model, term frequency, document frequency and document length components in mixed queries are estimated and adjusted, regardless of languages, while at the same time the model considers the unique mixed-language features in queries and documents, such as co-occurring terms in two different languages. Furthermore, in mixed queries, non-technical terms (mostly those in non-English language) would likely overweight and skew the impact of those technical terms (mostly those in English) due to high document frequencies (and thus low weights) of the latter terms in their corresponding collection (mostly the English collection). Such phenomenon is caused by the dominance of the English language in scientific domains. Accordingly, this thesis also proposes reasonable re-weighted Inverse Document Frequency (IDF) so as to moderate the effect of overweighted terms in mixed queries
A Controllable Model of Grounded Response Generation
Current end-to-end neural conversation models inherently lack the flexibility
to impose semantic control in the response generation process, often resulting
in uninteresting responses. Attempts to boost informativeness alone come at the
expense of factual accuracy, as attested by pretrained language models'
propensity to "hallucinate" facts. While this may be mitigated by access to
background knowledge, there is scant guarantee of relevance and informativeness
in generated responses. We propose a framework that we call controllable
grounded response generation (CGRG), in which lexical control phrases are
either provided by a user or automatically extracted by a control phrase
predictor from dialogue context and grounding knowledge. Quantitative and
qualitative results show that, using this framework, a transformer based model
with a novel inductive attention mechanism, trained on a conversation-like
Reddit dataset, outperforms strong generation baselines.Comment: AAAI 202
Answering Complex Questions by Joining Multi-Document Evidence with Quasi Knowledge Graphs
Direct answering of questions that involve multiple entities and relations is a challenge for text-based QA. This problem is most pronounced when answers can be found only by joining evidence from multiple documents. Curated knowledge graphs (KGs) may yield good answers, but are limited by their inherent incompleteness and potential staleness. This paper presents QUEST, a method that can answer complex questions directly from textual sources on-the-fly, by computing similarity joins over partial results from different documents. Our method is completely unsupervised, avoiding training-data bottlenecks and being able to cope with rapidly evolving ad hoc topics and formulation style in user questions. QUEST builds a noisy quasi KG with node and edge weights, consisting of dynamically retrieved entity names and relational phrases. It augments this graph with types and semantic alignments, and computes the best answers by an algorithm for Group Steiner Trees. We evaluate QUEST on benchmarks of complex questions, and show that it substantially outperforms state-of-the-art baselines
- …