495 research outputs found
A Corpus of Sentence-level Revisions in Academic Writing: A Step towards Understanding Statement Strength in Communication
The strength with which a statement is made can have a significant impact on
the audience. For example, international relations can be strained by how the
media in one country describes an event in another; and papers can be rejected
because they overstate or understate their findings. It is thus important to
understand the effects of statement strength. A first step is to be able to
distinguish between strong and weak statements. However, even this problem is
understudied, partly due to a lack of data. Since strength is inherently
relative, revisions of texts that make claims are a natural source of data on
strength differences. In this paper, we introduce a corpus of sentence-level
revisions from academic writing. We also describe insights gained from our
annotation efforts for this task.Comment: 6 pages, to appear in Proceedings of ACL 2014 (short paper
Second language learning from a multilingual perspective
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2018.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student-submitted PDF version of thesis.Includes bibliographical references (pages 119-127).How do people learn a second language? In this thesis, we study this question through an examination of cross-linguistic transfer: the role of a speaker's native language in the acquisition, representation, usage and processing of a second language. We present a computational framework that enables studying transfer in a unified fashion across language production and language comprehension. Our framework supports bidirectional inference between linguistic characteristics of speakers' native languages, and the way they use and process a new language. We leverage this inference ability to demonstrate the systematic nature of cross-linguistic transfer, and to uncover some of its key linguistic and cognitive manifestations. We instantiate our framework in language production by relating syntactic usage patterns and grammatical errors in English as a Second Language (ESL) to typological properties of the native language, showing its utility for automated typology learning and prediction of second language grammatical errors. We then introduce eye tracking during reading as a methodology for studying cross-linguistic transfer in second language comprehension. Using this methodology, we demonstrate that learners' native language can be predicted from their eye movement while reading free-form second language text. Further, we show that language processing during second language comprehension is intimately related to linguistic characteristics of the reader's first language. Finally, we introduce the Treebank of Learner English (TLE), the first syntactically annotated corpus of learner English. The TLE is annotated with Universal Dependencies (UD), a framework geared towards multilingual language analysis, and will support linguistic and computational research on learner language. Taken together, our results highlight the importance of multilingual approaches to the scientific study of second language acquisition, and to Natural Language Processing (NLP) applications for non-native language.by Yevgeni Berzak.Ph. D
Grammatical Error Correction: A Survey of the State of the Art
Grammatical Error Correction (GEC) is the task of automatically detecting and
correcting errors in text. The task not only includes the correction of
grammatical errors, such as missing prepositions and mismatched subject-verb
agreement, but also orthographic and semantic errors, such as misspellings and
word choice errors respectively. The field has seen significant progress in the
last decade, motivated in part by a series of five shared tasks, which drove
the development of rule-based methods, statistical classifiers, statistical
machine translation, and finally neural machine translation systems which
represent the current dominant state of the art. In this survey paper, we
condense the field into a single article and first outline some of the
linguistic challenges of the task, introduce the most popular datasets that are
available to researchers (for both English and other languages), and summarise
the various methods and techniques that have been developed with a particular
focus on artificial error generation. We next describe the many different
approaches to evaluation as well as concerns surrounding metric reliability,
especially in relation to subjective human judgements, before concluding with
an overview of recent progress and suggestions for future work and remaining
challenges. We hope that this survey will serve as comprehensive resource for
researchers who are new to the field or who want to be kept apprised of recent
developments
MUSIED: A Benchmark for Event Detection from Multi-Source Heterogeneous Informal Texts
Event detection (ED) identifies and classifies event triggers from
unstructured texts, serving as a fundamental task for information extraction.
Despite the remarkable progress achieved in the past several years, most
research efforts focus on detecting events from formal texts (e.g., news
articles, Wikipedia documents, financial announcements). Moreover, the texts in
each dataset are either from a single source or multiple yet relatively
homogeneous sources. With massive amounts of user-generated text accumulating
on the Web and inside enterprises, identifying meaningful events in these
informal texts, usually from multiple heterogeneous sources, has become a
problem of significant practical value. As a pioneering exploration that
expands event detection to the scenarios involving informal and heterogeneous
texts, we propose a new large-scale Chinese event detection dataset based on
user reviews, text conversations, and phone conversations in a leading
e-commerce platform for food service. We carefully investigate the proposed
dataset's textual informality and multi-source heterogeneity characteristics by
inspecting data samples quantitatively and qualitatively. Extensive experiments
with state-of-the-art event detection methods verify the unique challenges
posed by these characteristics, indicating that multi-source informal event
detection remains an open problem and requires further efforts. Our benchmark
and code are released at \url{https://github.com/myeclipse/MUSIED}.Comment: Accepted at EMNLP 202
A Frustratingly Easy Plug-and-Play Detection-and-Reasoning Module for Chinese Spelling Check
In recent years, Chinese Spelling Check (CSC) has been greatly improved by
designing task-specific pre-training methods or introducing auxiliary tasks,
which mostly solve this task in an end-to-end fashion. In this paper, we
propose to decompose the CSC workflow into detection, reasoning, and searching
subtasks so that the rich external knowledge about the Chinese language can be
leveraged more directly and efficiently. Specifically, we design a
plug-and-play detection-and-reasoning module that is compatible with existing
SOTA non-autoregressive CSC models to further boost their performance. We find
that the detection-and-reasoning module trained for one model can also benefit
other models. We also study the primary interpretability provided by the task
decomposition. Extensive experiments and detailed analyses demonstrate the
effectiveness and competitiveness of the proposed module.Comment: Accepted for publication in Findings of EMNLP 202
Affect Lexicon Induction For the Github Subculture Using Distributed Word Representations
Sentiments and emotions play essential roles in small group interactions, especially in self-organized collaborative groups. Many people view sentiments as universal constructs; however, cultural differences exist in some aspects of sentiments. Understanding the features of sentiment space in small group cultures provides essential insights into the dynamics of self-organized collaborations. However, due to the limit of carefully human annotated data, it is hard to describe sentimental divergences across cultures.
In this thesis, we present a new approach to inspect cultural differences on the level of sentiments and compare subculture with the general social environment. We use Github, a collaborative software development network, as an example of self-organized subculture. First, we train word embeddings on large corpora and do embedding alignment using linear transformation method. Then we model finer-grained human sentiment in the Evaluation- Potency-Activity (EPA) space and extend subculture EPA lexicon with two-dense-layered neural networks. Finally, we apply Long Short-Term Memory (LSTM) network to analyze the identities’ sentiments triggered by event-based sentences. We evaluate the predicted EPA lexicon for Github community using a recently collected dataset, and the result proves our approach could capture subtle changes in affective dimensions. Moreover, our induced sentiment lexicon shows individuals from two environments have different understandings to sentiment-related words and phrases but agree on nouns and adjectives. The sentiment features of “Github culture” could explain that people in self-organized groups tend to reduce personal sentiment to improve group collaboration
Learning Explicit and Implicit Arabic Discourse Relations.
We propose in this paper a supervised learning approach to identify discourse relations in Arabic texts. To our knowledge, this work represents the first attempt to focus on both explicit and implicit relations that link adjacent as well as non adjacent Elementary Discourse Units (EDUs) within the Segmented Discourse Representation Theory (SDRT). We use the Discourse Arabic Treebank corpus (D-ATB) which is composed of newspaper documents extracted from the syntactically annotated Arabic Treebank v3.2 part3 where each document is associated with complete discourse graph according to the cognitive principles of SDRT. Our list of discourse relations is composed of a three-level hierarchy of 24 relations grouped into 4 top-level classes. To automatically learn them, we use state of the art features whose efficiency has been empirically proved. We investigate how each feature contributes to the learning process. We report our experiments on identifying fine-grained discourse relations, mid-level classes and also top-level classes. We compare our approach with three baselines that are based on the most frequent relation, discourse connectives and the features used by Al-Saif and Markert (2011). Our results are very encouraging and outperform all the baselines with an F-score of 78.1% and an accuracy of 80.6%
- …