36 research outputs found
Sentences and Documents in Native Language Identification
Starting from a wide set of linguistic features, we present the first in depth feature analysis in two different Native Language Identification (NLI) scenarios. We compare the results obtained in a traditional NLI document classification task and in a newly introduced sentence classification task, investigating the different role played by the considered features. Finally, we study the impact of a set of selected features extracted from the sentence classifier in document classification.Partendo da un ampio insieme di caratteristiche linguistiche, presentiamo la prima analisi approfondita del ruolo delle caratteristiche linguistiche nel compito di identificazione della lingua nativa (NLI) in due differenti scenari. Confrontiamo i risultati ottenuti nel tradizionale task di NLI ed in un nuovo compito di classificazione di frasi, studiando il ruolo differente che svolgono le caratteristiche considerate. Infine, studiamo l’impatto di un insieme di caratteristiche estratte dal classificatore di frasi nel task di classificazione di documenti
Stacked Sentence-Document Classifier Approach for Improving Native Language Identification
In this paper, we describe the approach of the ItaliaNLP Lab team to native language identification and discuss the results we submitted as participants to the essay track of NLI Shared Task 2017. We introduce for the first time a 2-stacked sentence-document architecture for native language identification that is able to exploit both local sentence information and a wide set of general-purpose features qualifying the lexical and grammatical structure of the whole document. When evaluated on the official test set, our sentence-document stacked architecture obtained the best result among all the participants of the essay track with an F1 score of 0.8818
Constrained multi-task learning for automated essay scoring
Supervised machine learning models for
automated essay scoring (AES) usually require
substantial task-specific training data
in order to make accurate predictions for
a particular writing task. This limitation
hinders their utility, and consequently
their deployment in real-world settings. In
this paper, we overcome this shortcoming
using a constrained multi-task pairwisepreference
learning approach that enables
the data from multiple tasks to be combined
effectively.
Furthermore, contrary to some recent research,
we show that high performance
AES systems can be built with little or no
task-specific training data. We perform a
detailed study of our approach on a publicly
available dataset in scenarios where
we have varying amounts of task-specific
training data and in scenarios where the
number of tasks increases.This is the author accepted manuscript. The final version is available from Association for Computational Linguistics at http://acl2016.org/index.php?article_id=71
Towards Orthographic and Grammatical Clinical Text Correction: a First Approach
Akats Gramatikalen Zuzenketa (GEC, ingelesetik, Grammatical Error Analysis)
Hizkuntza Naturalaren Prozesamenduaren azpieremu bat da, ortogra a, puntuazio edo
gramatika akatsak dituzten testuak automatikoki zuzentzea helburu duena. Orain arte,
bigarren hizkuntzako ikasleek ekoitzitako testuetara bideratu da gehien bat, ingelesez
idatzitako testuetara batez ere. Master-Tesi honetan gaztelaniaz idatzitako
mediku-txostenetarako Akats Gramatikalen Zuzenketa lantzen da. Arlo espezi ko hau ez
da asko esploratu orain arte, ez gaztelaniarako zentzu orokorrean, ezta domeinu
klinikorako konkretuki ere. Hasteko, IMEC (gaztelaniatik, Informes Médicos en Español
Corregidos) corpusa aurkezten da, eskuz zuzendutako mediku-txosten elektronikoen
bilduma paralelo berria. Corpusa automatikoki etiketatu da zeregin honetarako
egokitutako ERRANT tresna erabiliz. Horrez gain, hainbat esperimentu deskribatzen
dira, zeintzuetan sare neuronaletan oinarritutako sistemak ataza honetarako
diseinatutako baseline sistema batekin alderatzen diren.Grammatical Error Correction (GEC) is a sub field of Natural Language Processing that aims to automatically correct texts that include errors related to spelling, punctuation or grammar. So far, it has mainly focused on texts produced by second language learners, mostly in English. This Master's Thesis describes a first approach to Grammatical Error Correction for Spanish health records. This specific field has not been explored much until now, nor in Spanish in a general sense nor for the clinical domain specifically. For this purpose, the corpus IMEC (Informes Médicos en Español Corregidos) ---a manually-corrected parallel collection of Electronic Health Records--- is introduced. This corpus has been automatically annotated using the toolkit ERRANT, specialized in the automatic annotation of GEC parallel corpora, which was adapted to Spanish for this task. Furthermore, some experiments using neural networks and data augmentation are shown and compared with a baseline system also created specifically for this task
Analyzing Text Complexity and Text Simplification: Connecting Linguistics, Processing and Educational Applications
Reading plays an important role in the process of learning and knowledge acquisition
for both children and adults. However, not all texts are accessible to every
prospective reader. Reading difficulties can arise when there is a mismatch between
a reader’s language proficiency and the linguistic complexity of the text
they read. In such cases, simplifying the text in its linguistic form while retaining
all the content could aid reader comprehension. In this thesis, we study text
complexity and simplification from a computational linguistic perspective.
We propose a new approach to automatically predict the text complexity using
a wide range of word level and syntactic features of the text. We show that this
approach results in accurate, generalizable models of text readability that work
across multiple corpora, genres and reading scales. Moving from documents to
sentences, We show that our text complexity features also accurately distinguish
different versions of the same sentence in terms of the degree of simplification
performed. This is useful in evaluating the quality of simplification performed by
a human expert or a machine-generated output and for choosing targets to simplify
in a difficult text. We also experimentally show the effect of text complexity on
readers’ performance outcomes and cognitive processing through an eye-tracking
experiment.
Turning from analyzing text complexity and identifying sentential simplifications
to generating simplified text, one can view automatic text simplification as a
process of translation from English to simple English. In this thesis, we propose
a statistical machine translation based approach for text simplification, exploring
the role of focused training data and language models in the process.
Exploring the linguistic complexity analysis further, we show that our text
complexity features can be useful in assessing the language proficiency of English
learners. Finally, we analyze German school textbooks in terms of their
linguistic complexity, across various grade levels, school types and among different
publishers by applying a pre-existing set of text complexity features developed
for German
Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018 : 10-12 December 2018, Torino
On behalf of the Program Committee, a very warm welcome to the Fifth Italian Conference on Computational Linguistics (CLiC-‐it 2018). This edition of the conference is held in Torino. The conference is locally organised by the University of Torino and hosted into its prestigious main lecture hall “Cavallerizza Reale”. The CLiC-‐it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after five years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges
When Automated Assessment Meets Automated Content Generation: Examining Text Quality in the Era of GPTs
The use of machine learning (ML) models to assess and score textual data has
become increasingly pervasive in an array of contexts including natural
language processing, information retrieval, search and recommendation, and
credibility assessment of online content. A significant disruption at the
intersection of ML and text are text-generating large-language models such as
generative pre-trained transformers (GPTs). We empirically assess the
differences in how ML-based scoring models trained on human content assess the
quality of content generated by humans versus GPTs. To do so, we propose an
analysis framework that encompasses essay scoring ML-models, human and
ML-generated essays, and a statistical model that parsimoniously considers the
impact of type of respondent, prompt genre, and the ML model used for
assessment model. A rich testbed is utilized that encompasses 18,460
human-generated and GPT-based essays. Results of our benchmark analysis reveal
that transformer pretrained language models (PLMs) more accurately score human
essay quality as compared to CNN/RNN and feature-based ML methods.
Interestingly, we find that the transformer PLMs tend to score GPT-generated
text 10-15\% higher on average, relative to human-authored documents.
Conversely, traditional deep learning and feature-based ML models score human
text considerably higher. Further analysis reveals that although the
transformer PLMs are exclusively fine-tuned on human text, they more
prominently attend to certain tokens appearing only in GPT-generated text,
possibly due to familiarity/overlap in pre-training. Our framework and results
have implications for text classification settings where automated scoring of
text is likely to be disrupted by generative AI.Comment: Data available at:
https://github.com/nd-hal/automated-ML-scoring-versus-generatio