145 research outputs found
Discriminative Reranking for Spelling Correction
PACLIC 20 / Wuhan, China / 1-3 November, 200
Discriminative reranking for spelling correction
Abstract. This paper proposes a novel approach to spelling correction. It reranks the output of an existing spelling corrector, Aspell. A discriminative model (Ranking SVM) is employed to improve upon the initial ranking, using additional features as evidence. These features are derived from stateof-the-art techniques in spelling correction, including edit distance, letter-based n-gram, phonetic similarity and noisy channel model. This paper also presents a new method to automatically extract training samples from the query log chain. The system outperforms the baseline Aspell greatly, as well as previous models and several off-the-shelf spelling correction systems (e.g. Microsoft Word 2003). The results on query chain pairs are comparable to that based on manually-annotated pairs, with 32.2%/32.6 % reduction in error rate, respectively. 1
A study on creating a custom South Sotho spellchecking and correcting software desktop application
Thesis (B. Tech.) - Central University of Technology, Free State, 200
Virtual assistant with natural language processing capabilities
During this thesis, an end-to-end solution was provided, creating and developing a Chatbot which, thanks to natural language processing techniques, is able to answer very complex questions, often requiring even more complex answers, in a well-defined area
Rich Linguistic Structure from Large-Scale Web Data
The past two decades have shown an unexpected effectiveness of Web-scale data in natural language processing. Even the simplest models, when paired with unprecedented amounts of unstructured and unlabeled Web data, have been shown to outperform sophisticated ones. It has been argued that the effectiveness of Web-scale data has undermined the necessity of sophisticated modeling or laborious data set curation. In this thesis, we argue for and illustrate an alternative view, that Web-scale data not only serves to improve the performance of simple models, but also can allow the use of qualitatively more sophisticated models that would not be deployable otherwise, leading to even further performance gains.Engineering and Applied Science
Scalable and Quality-Aware Training Data Acquisition for Conversational Cognitive Services
Dialog Systems (or simply bots) have recently become a popular human-computer interface for performing user's tasks, by invoking the appropriate back-end APIs (Application Programming Interfaces) based on the user's request in natural language. Building task-oriented bots, which aim at performing real-world tasks (e.g., booking flights), has become feasible with the continuous advances in Natural Language Processing (NLP), Artificial Intelligence (AI), and the countless number of devices which allow third-party software systems to invoke their back-end APIs.
Nonetheless, bot development technologies are still in their preliminary stages, with several unsolved theoretical and technical challenges stemming from the ambiguous nature of human languages. Given the richness of natural language, supervised models require a large number of user utterances paired with their corresponding tasks -- called intents.
To build a bot, developers need to manually translate APIs to utterances (called canonical utterances) and paraphrase them to obtain a diverse set of utterances. Crowdsourcing has been widely used to obtain such datasets,
by paraphrasing the initial utterances generated by the bot developers for each task. However, there are several unsolved issues. First, generating canonical utterances requires manual efforts, making bot development both expensive and hard to scale. Second, since crowd workers may be anonymous and are asked to provide open-ended text (paraphrases), crowdsourced paraphrases may be noisy and incorrect (not conveying the same intent as the given task).
This thesis first surveys the state-of-the-art approaches for collecting large training utterances for task-oriented bots. Next, we conduct an empirical study to identify quality issues of crowdsourced utterances (e.g., grammatical errors, semantic completeness). Moreover, we propose novel approaches for identifying unqualified crowd workers and eliminating malicious workers from crowdsourcing tasks. Particularly, we propose a novel technique to promote the diversity of crowdsourced paraphrases by dynamically generating word suggestions while crowd workers are paraphrasing a particular utterance. Moreover, we propose a novel technique to automatically translate APIs to canonical utterances. Finally, we present our platform to automatically generate bots out of API specifications. We also conduct thorough experiments to validate the proposed techniques and models
Recommended from our members
An empirically-based debugging system for novice programmers
The research described here concerns the design and construction of an empirically-based debugging aid for first-time computer users, integrated into the Open University's SOLO programming environment. Its basis is an account of the processes involved as human experts debug faulty code, which account was later found to be supported by empirical tests on human experts. The account implies that an understanding of the intentions of the programmer is not essential to successful debugging of a certain class of programs. That class comprises programs written in a database-dependent language by users who are initially completely computer-naive and who during their course become competent to write simple programs which embody one or more basic AI techniques such as recursive inference. The debugging system, called AURAC, incorporates an explicit model of the debugging strategies used by human experts. Its understanding, therefore, is of programming in general and of the SOLO environment in particular. We present in the process a broad taxonomy of naive users' errors, showing that they can be divided into types, each type requiring a different debugging approach and indicating a different degree of expertise on the part of the perpetrator. SOLO is a conveniently delimited though nonetheless rich problem domain.
Also described is a new version of SOLO itself (MacSOLO) which incorporates a large number of traps for the simple errors which plague novices, thus enabling AURAC to concentrate on the more interesting programming mistakes. AURAC is intended to operate after the event rather than whilst a program is actually being written, and is able via analysis of programming cliches and of data flows to isolate errors in the user's code. Where AURAC cannot analyse, or where its analysis yields nothing useful, it describes the corresponding section of code instead, so that the user receives a coherent output.
MacSOLO and AURAC together form a unified system, based upon the principles of Simplicity, Consistency and Transparency. We show how these principles were applied during the design and construction phases
- …