281 research outputs found

    Finding the online cry for help : automatic text classification for suicide prevention

    Get PDF
    Successful prevention of suicide, a serious public health concern worldwide, hinges on the adequate detection of suicide risk. While online platforms are increasingly used for expressing suicidal thoughts, manually monitoring for such signals of distress is practically infeasible, given the information overload suicide prevention workers are confronted with. In this thesis, the automatic detection of suicide-related messages is studied. It presents the first classification-based approach to online suicidality detection, and focuses on Dutch user-generated content. In order to evaluate the viability of such a machine learning approach, we developed a gold standard corpus, consisting of message board and blog posts. These were manually labeled according to a newly developed annotation scheme, grounded in suicide prevention practice. The scheme provides for the annotation of a post's relevance to suicide, and the subject and severity of a suicide threat, if any. This allowed us to derive two tasks: the detection of suicide-related posts, and of severe, high-risk content. In a series of experiments, we sought to determine how well these tasks can be carried out automatically, and which information sources and techniques contribute to classification performance. The experimental results show that both types of messages can be detected with high precision. Therefore, the amount of noise generated by the system is minimal, even on very large datasets, making it usable in a real-world prevention setting. Recall is high for the relevance task, but at around 60%, it is considerably lower for severity. This is mainly attributable to implicit references to suicide, which often go undetected. We found a variety of information sources to be informative for both tasks, including token and character ngram bags-of-words, features based on LSA topic models, polarity lexicons and named entity recognition, and suicide-related terms extracted from a background corpus. To improve classification performance, the models were optimized using feature selection, hyperparameter, or a combination of both. A distributed genetic algorithm approach proved successful in finding good solutions for this complex search problem, and resulted in more robust models. Experiments with cascaded classification of the severity task did not reveal performance benefits over direct classification (in terms of F1-score), but its structure allows the use of slower, memory-based learning algorithms that considerably improved recall. At the end of this thesis, we address a problem typical of user-generated content: noise in the form of misspellings, phonetic transcriptions and other deviations from the linguistic norm. We developed an automatic text normalization system, using a cascaded statistical machine translation approach, and applied it to normalize the data for the suicidality detection tasks. Subsequent experiments revealed that, compared to the original data, normalized data resulted in fewer and more informative features, and improved classification performance. This extrinsic evaluation demonstrates the utility of automatic normalization for suicidality detection, and more generally, text classification on user-generated content

    Low-Resource Unsupervised NMT:Diagnosing the Problem and Providing a Linguistically Motivated Solution

    Get PDF
    Unsupervised Machine Translation hasbeen advancing our ability to translatewithout parallel data, but state-of-the-artmethods assume an abundance of mono-lingual data. This paper investigates thescenario where monolingual data is lim-ited as well, finding that current unsuper-vised methods suffer in performance un-der this stricter setting. We find that theperformance loss originates from the poorquality of the pretrained monolingual em-beddings, and we propose using linguis-tic information in the embedding train-ing scheme. To support this, we look attwo linguistic features that may help im-prove alignment quality: dependency in-formation and sub-word information. Us-ing dependency-based embeddings resultsin a complementary word representationwhich offers a boost in performance ofaround 1.5 BLEU points compared to stan-dardWORD2VECwhen monolingual datais limited to 1 million sentences per lan-guage. We also find that the inclusion ofsub-word information is crucial to improv-ing the quality of the embedding

    Ti plasmids

    No full text

    Legal translation trainees’ performance in from-scratch translation and post-editing: A product analysis

    Get PDF
    This study explores the practice of adopting MT tools in the area of legal translation didactics to assess and compare the translation quality of from-scratch vs post-edited translations through an error-based revision. Error analysis highlights both common and unique patterns in the frequency, type and severity of translation errors to possibly determine if and to what extent errors are influenced by the presence of a pre-translated text and which procedure led to higher-quality translations. The study also points out the areas of strength of Machine Translation applied to legal translation didactics alongside its limitations as inferable from the final product.This study explores the practice of adopting MT tools in the area of legal translation didactics to assess and compare the translation quality of from-scratch vs post-edited translations through an error-based revision. Error analysis highlights both common and unique patterns in the frequency, type and severity of translation errors to possibly determine if and to what extent errors are influenced by the presence of a pre-translated text and which procedure led to higher-quality translations. The study also points out the areas of strength of Machine Translation applied to legal translation didactics alongside its limitations as inferable from the final product

    Postediting machine translation output and its revision: subject-matter experts versus professional translators

    Get PDF
    El presente estudio compara la post-edición de textos técnicos de ingenieros y traductores profesionales en términos de velocidad, documentación y cambios. También se compara la calidad de los textos post-editados. Además, se explora cuál de los siguientes flujos de trabajo es más rápido y produce resultados de mayor calidad: la post-edición de los resultados de Traducción Automática hecha por los ingenieros y la revisada por traductores profesionales, o viceversa. Los resultados sugieren que la experiencia y conocimientos en la materia son los principales factores que determinan la calidad de la post-edición. Cuando se penalizan los errores recurrentes, la post-edición de textos técnicos realizada por los ingenieros es significativamente de mayor calidad que la de los traductores. La velocidad de revisión de traductores e ingenieros no difirió significativamente. En textos técnicos, la mejora de la calidad que conlleva que el ingeniero revise la post-edición del traductor es mayor que en cuando el trabajo se organiza al revés. Además, la calidad de los textos post-editados y sus versiones revisadas (ya sea realizada por traductores profesionales o ingenieros) cambia significativamente según se penalicen o no los errores recurrentes.El present estudi compara la post-edició de textos tècnics d'enginyers i traductors professionals en termes de velocitat, documentació i canvis. També es compara la qualitat dels textos post-editats. A més, s'explora quin dels següents fluxos de treball és més ràpid i produeix resultats de major qualitat: la post-edició dels resultats de Traducció Automàtica feta pels enginyers i la revisada per traductors professionals, o viceversa. Els resultats suggereixen que l'experiència i coneixements en la matèria són els principals factors que determinen la qualitat de la post-edició. Quan es penalitzen els errors recurrents, la post-edició de textos tècnics realitzada pels enginyers és significativament de major qualitat que la dels traductors. La velocitat de revisió de traductors i enginyers no va diferir significativament. En textos tècnics, la millora de la qualitat que comporta que l'enginyer revisi la post-edició del traductor és major que en quan el treball s'organitza a l'inrevés. A més, la qualitat dels textos post-editats i les seves versions revisades (ja sigui realitzada per traductors professionals o enginyers) canvia significativament segons es penalitzin o no els errors recurrents.The present research compares engineers’ and professional translators’ postediting a technical text in terms of speed, documentation and changes. It also compares the postedited texts with regard to quality. Further, we explore which of the following workflows is faster and produces outputs of higher quality: Postediting MT output by engineers and revising the postedited text by professional translators, or vice-versa. The findings suggest that expertise and experience in the subject-matter are the main factors determining postediting quality. When the recurrent errors are penalized, the engineers’ postediting of technical texts is of significantly higher quality than the translators’. The translators’ and the engineers’ postediting and revision speed did not differ significantly. For technical texts, the quality improvement brought about by engineer-revision of translator-postediting is higher than vice-versa. Further, the quality of the postedited texts and their revised versions (either performed by professional translators or engineers) changes significantly as a result of penalizing and unpenalizing recurrent errors
    • …
    corecore