17 research outputs found
Adapting Machine Translation Models toward Misrecognized Speech with Text-to-Speech Pronunciation Rules and Acoustic Confusability
open4sihttp://interspeech2015.orgIn the spoken language translation pipeline, machine translation systems that are trained solely on written bitexts are often unable to recover from speech recognition errors due to the mismatch in training data. We propose a novel technique to simulate the errors generated by an ASR system, using the ASR systemâs pronunciation dictionary and language model. Lexical entries in the pronunciation dictionary are converted into phoneme sequences using a text-to-speech (TTS) analyzer and stored in a phoneme-to-word translation model. The translation model and ASR language model are combined into a phoneme-to-word MT system that âdamagesâ clean texts to look like ASR outputs based on acoustic confusions. Training texts are TTS-converted and damaged into synthetic ASR data for use as adaptation data for training a speech translation system. Our proposed technique yields consistent improvements in translation quality on English-French lectures.Ruiz Nicholas; Gao Qin; Lewis Will ; Marcello FedericoRuiz, Nicholas; Gao, Qin; Lewis, Will; Federico, Marcell
Modularity and Neural Integration in Large-Vocabulary Continuous Speech Recognition
This Thesis tackles the problems of modularity in Large-Vocabulary Continuous Speech Recognition with use of Neural Network
A parallel corpus of Python functions and documentation strings for automated code documentation and code generation
Automated documentation of programming source code and automated code
generation from natural language are challenging tasks of both practical and
scientific interest. Progress in these areas has been limited by the low
availability of parallel corpora of code and natural language descriptions,
which tend to be small and constrained to specific domains.
In this work we introduce a large and diverse parallel corpus of a hundred
thousands Python functions with their documentation strings ("docstrings")
generated by scraping open source repositories on GitHub. We describe baseline
results for the code documentation and code generation tasks obtained by neural
machine translation. We also experiment with data augmentation techniques to
further increase the amount of training data.
We release our datasets and processing scripts in order to stimulate research
in these areas.Comment: 5 pages, 1 figure, 3 table
GEAR: Augmenting Language Models with Generalizable and Efficient Tool Resolution
Augmenting large language models (LLM) to use external tools enhances their
performance across a variety of tasks. However, prior works over-rely on
task-specific demonstration of tool use that limits their generalizability and
computational cost due to making many calls to large-scale LLMs. We introduce
GEAR, a computationally efficient query-tool grounding algorithm that is
generalizable to various tasks that require tool use while not relying on
task-specific demonstrations. GEAR achieves better efficiency by delegating
tool grounding and execution to small language models (SLM) and LLM,
respectively; while leveraging semantic and pattern-based evaluation at both
question and answer levels for generalizable tool grounding. We evaluate GEAR
on 14 datasets across 6 downstream tasks, demonstrating its strong
generalizability to novel tasks, tools and different SLMs. Despite offering
more efficiency, GEAR achieves higher precision in tool grounding compared to
prior strategies using LLM prompting, thus improving downstream accuracy at a
reduced computational cost. For example, we demonstrate that GEAR-augmented
GPT-J and GPT-3 outperform counterpart tool-augmented baselines because of
better tool use
Changeset-based Retrieval of Source Code Artifacts for Bug Localization
Modern software development is extremely collaborative and agile, with unprecedented speed and scale of activity. Popular trends like continuous delivery and continuous deployment aim at building, fixing, and releasing software with greater speed and frequency. Bug localization, which aims to automatically localize bug reports to relevant software artifacts, has the potential to improve software developer efficiency by reducing the time spent on debugging and examining code. To date, this problem has been primarily addressed by applying information retrieval techniques based on static code elements, which are intrinsically unable to reflect how software evolves over time. Furthermore, as prior approaches frequently rely on exact term matching to measure relatedness between a bug report and a software artifact, they are prone to be affected by the lexical gap that exists between natural and programming language.
This thesis explores using software changes (i.e., changesets), instead of static code elements, as the primary data unit to construct an information retrieval model toward bug localization. Changesets, which represent the differences between two consecutive versions of the source code, provide a natural representation of a software change, and allow to capture both the semantics of the source code, and the semantics of the code modification. To bridge the lexical gap between source code and natural language, this thesis investigates using topic modeling and deep learning architectures that enable creating semantically rich data representation with the goal of identifying latent connection between bug reports and source code. To show the feasibility of the proposed approaches, this thesis also investigates practical aspects related to using a bug localization tool, such retrieval delay and training data availability.
The results indicate that the proposed techniques effectively leverage historical data about bugs and their related source code components to improve retrieval accuracy, especially for bug reports that are expressed in natural language, with little to no explicit code references. Further improvement in accuracy is observed when the size of the training dataset is increased through data augmentation and data balancing strategies proposed in this thesis, although depending on the model architecture the magnitude of the improvement varies. In terms of retrieval delay, the results indicate that the proposed deep learning architecture significantly outperforms prior work, and scales up with respect to search space size