141 research outputs found
Overcoming Data Challenges in Machine Translation
Data-driven machine translation paradigms—which use machine learning to create translation models that can automatically translate from one language to another—have the potential to enable seamless communication across language barriers, and improve global information access. For this to become a reality, machine translation must be available for all languages and styles of text. However, the translation quality of these models is sensitive to the quality and quantity of the data the models are trained on. In this dissertation we address and analyze challenges arising from this sensitivity; we present methods that improve translation quality in difficult data settings, and analyze the effect of data quality on machine translation quality.
Machine translation models are typically trained on parallel corpora, but limited quantities of such data are available for most language pairs, leading to a low resource problem. We present a method for transfer learning from a paraphraser to overcome data sparsity in low resource settings. Even when training data is available in the desired language pair, it is frequently of a different style or genre than we would like to translate—leading to a domain mismatch. We present a method for improving domain adaptation translation quality.
A seemingly obvious approach when faced with a lack of data is to acquire more data. However, it is not always feasible to produce additional human translations. In such a case, an option may be to crawl the web for additional training data. However, as we demonstrate, such data can be very noisy and harm machine translation quality. Our analysis motivated subsequent work on data filtering and cleaning by the broader community.
The contributions in this dissertation not only improve translation quality in difficult data settings, but also serve as a reminder to carefully consider the impact of the data when training machine learning models
Recommended from our members
Neural NLP models under low-supervision scenarios
Neural models have been shown to work well for natural language processing
tasks when one has large amounts of labeled data, but problems arise when this is not the case. In this thesis we investigate several ‘low-supervision’ scenarios in which we do not have sufficient training data, and we propose methods to improve performance in these scenarios.
First, we consider the scenario where we can use other types of resources in
addition to the limited training labels. For instance, we can ask human annotators to provide rationales supporting their labels (annotations) for training examples. To capitalize on such supervision, we develop a neural model that can train on both instance labels and associated rationales. We also investigate how to incorporate existing ontologies into neural models. Specifically, we develop a novel training algorithm that enforces weight sharing among similar words in the ontologies, thus inductively biasing the neural model training.
In addition incorporating other types of resources beyond instance labels, we also use transfer learning techniques which are general means of learning in
low-supervision settings. We study how to use multiple sets of pre-trained word
embeddings as inputs to neural models, and fine-tune them to the task at hand in
a more intelligent way than simply concatenating them. We also develop a novel model for text generation, in which the model is able to generate text from a new domain (unseen in training data). Rather than simply fine-tuning the model on the target domain, the model fully uses the domain information in the training set, allowing it to generate domain specific text.
Lastly, we consider how to collect data under a limited budget more efficiently than simply random selection of unlabeled data for annotation. We develop new active learning (AL) methods to collect more informative examples to be annotated specifically for neural models, so that better models and more discriminative text representation can be learned with fewer labels. Following this, we further develop new AL approaches when we have richly annotated data from a relevant
domain, that is, we combine AL and transfer learning and leverage the advantages of both methods. We also investigate how to use the pre-trained deep bidirectional transformer (BERT) to actively select labels.Computer Science
TOWARDS BUILDING AN INTELLIGENT REVISION ASSISTANT FOR ARGUMENTATIVE WRITINGS
Current intelligent writing assistance tools (e.g. Grammarly, Turnitin, etc.) typically work by locating the problems of essays for users (grammar, spelling, argument, etc.) and providing possible solutions. These tools focus on providing feedback on a single draft, while ignoring feedback on an author’s changes between drafts (revision). This thesis argues that it is also important to provide feedback on authors’ revision, as such information can not only improve the quality of the writing but also improve the rewriting skill of the authors. Thus, it is desirable to build an intelligent assistant that focuses on providing feedback to revisions.
This thesis presents work from two perspectives towards the building of such an assistant: 1) a study of the revision’s impact on writings, which includes the development of a sentence-level revision schema, the annotation of corpora based on the schema and data analysis on the created corpora; a prototype revision assistant was built to provide revision feedback based on the schema and a user study was conducted to investigate whether the assistant could influence the users’ rewriting behaviors. 2) the development of algorithms for automatic revision identification, which includes the automatic extraction of the revised content and the automatic classification of revision types; we first investigated the two problems separately in a pipeline manner and then explored a joint approach that solves the two problems at the same time
Extracting Insights from Differences: Analyzing Node-aligned Social Graphs
Social media and network research often focus on the agreement between different entities to infer connections, recommend actions and subscriptions and even improve algorithms via ensemble methods. However, studying differences instead of similarities can yield useful insights in all these cases. We can infer and understand inter-community interactions (including ideological and user-based community conflicts, hierarchical community relations) and improve community detection algorithms via insights gained from differences among entities such as communities, users and algorithms. When the entities are communities or user groups, we often study the difference via node-aligned networks, which are networks with the same set of nodes but different sets of edges. The edges define implicit connections which we can infer via similarities or differences between two nodes.
We perform a set of studies to identify and understand differences among user groups using Reddit, where the subreddit structure provides us with pre-defined user groups. Studying the difference between author overlap and textual similarity among different subreddits, we find misaligned edges and networks which expose subreddits at ideological 'war', community fragmentation, asymmetry of interactions involving subreddits based on marginalized social groups and more. Differences in perceived user behavior across different subreddits allow us to identify subreddit conflicts and features which can implicate communal misbehavior. We show that these features can be used to identify some subreddits banned by Reddit. Applying the idea of differences in community detection algorithms helps us identify problematic community assignments where we can ask for human help in categorizing a node in a specific community. It also gives us an idea of the overall performance of a particular community detection algorithm on a particular network input. In general, these improve ensemble community detection techniques. We demonstrate this via CommunityDiff (a community detection and visualization tool), which compares and contrasts different algorithms and incorporates user knowledge in community detection output. We believe the idea of gaining insights from differences can be applied to several other problems and help us understand and improve social media interactions and research.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/149801/1/srayand_1.pd
- …