Search CORE

358 research outputs found

To Normalize, or Not to Normalize: The Impact of Normalization on Part-of-Speech Tagging

Author: Nissim Malvina
Plank Barbara
van der Goot Rob
Publication venue
Publication date: 01/01/2017
Field of study

Does normalization help Part-of-Speech (POS) tagging accuracy on noisy, non-canonical data? To the best of our knowledge, little is known on the actual impact of normalization in a real-world scenario, where gold error detection is not available. We investigate the effect of automatic normalization on POS tagging of tweets. We also compare normalization to strategies that leverage large amounts of unlabeled data kept in its raw form. Our results show that normalization helps, but does not add consistently beyond just word embedding layer initialization. The latter approach yields a tagging model that is competitive with a Twitter state-of-the-art tagger.Comment: In WNUT 201

arXiv.org e-Print Archive

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen

Computational Sociolinguistics: A Survey

Author: de Jong Franciska
Doğruöz A. Seza
Nguyen Dong
Rosé Carolyn P.
Publication venue
Publication date: 01/01/2016
Field of study

Language is a social phenomenon and variation is inherent to its social nature. Recently, there has been a surge of interest within the computational linguistics (CL) community in the social dimension of language. In this article we present a survey of the emerging field of "Computational Sociolinguistics" that reflects this increased interest. We aim to provide a comprehensive overview of CL research on sociolinguistic themes, featuring topics such as the relation between language and social identity, language use in social interaction and multilingual communication. Moreover, we demonstrate the potential for synergy between the research communities involved, by showing how the large-scale data-driven methods that are widely used in CL can complement existing sociolinguistic studies, and how sociolinguistics can inform and challenge the methods and assumptions employed in CL studies. We hope to convey the possible benefits of a closer collaboration between the two communities and conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication: 18th February, 201

arXiv.org e-Print Archive

Crossref

Ghent University Academic Bibliography

EUR Research Repository

University of Twente Research Information

bot.zen @ EmpiriST 2015 - A minimally-deep learning PoS-tagger (trained for German CMC and Web data)

Author: Stemle Egon,
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 07/08/2016
Field of study

International audienceThis article describes the system that participated in the Part-of-speech tagging subtask of the EmpiriST 2015 shared task on automatic linguistic annotation of computer-mediated communication / social media. The system combines a small assertion of trending techniques, which implement matured methods, from NLP and ML to achieve competitive results on PoS tagging of German CMC and Web corpus data; in particular, the system uses word embeddings and character-level representations of word beginnings and endings in a LSTM RNN architecture. Labelled data (Tiger v2.2 and EmpiriST) and unlabelled data (German Wikipedia) were used for training. The system is available under the APLv2 open-source license

Compression-based Parts-of-Speech Tagger for the Arabic Language

Author: Alkhazi Ibrahim
Publication venue
Publication date: 18/12/2019
Field of study

Bangor University Research Portal

Robust part-of-speech tagging of social media text

Author: Horsmann Tobias
Publication venue
Publication date: 27/04/2018
Field of study

Part-of-Speech (PoS) tagging (Wortklassenerkennung) ist ein wichtiger Verarbeitungsschritt in vielen sprachverarbeitenden Anwendungen. Heute gibt es daher viele PoS Tagger, die diese wichtige Aufgabe automatisiert erledigen. Es hat sich gezeigt, dass PoS tagging auf informellen Texten oft nur mit unzureichender Genauigkeit möglich ist. Insbesondere Texte aus sozialen Medien sind eine große Herausforderung. Die erhöhte Fehlerrate, welche auf mangelnde Robustheit zurückgeführt werden kann, hat schwere Folgen für Anwendungen die auf PoS Informationen angewiesen sind. Diese Arbeit untersucht daher Tagger-Robustheit unter den drei Gesichtspunkten der (i) Domänenrobustheit, (ii) Sprachrobustheit und (iii) Robustheit gegenüber seltenen linguistischen Phänomene. Für (i) beginnen wir mit einer Analyse der Phänomene, die in informellen Texten häufig anzutreffen sind, aber in formalen Texten nur selten bis gar keine Verwendung finden. Damit schaffen wir einen Überblick über die Art der Phänomene die das Tagging von informellen Texten so schwierig machen. Wir evaluieren viele der üblicherweise benutzen Tagger für die englische und deutsche Sprache auf Texten aus verschiedenen Domänen, um einen umfassenden Überblick über die derzeitige Robustheit der verfügbaren Tagger zu bieten. Die Untersuchung ergab im Wesentlichen, dass alle Tagger auf informellen Texten große Schwächen zeigen. Methoden, um die Robustheit für domänenübergreifendes Tagging zu verbessern, sind prinzipiell hilfreich, lösen aber das grundlegende Robustheitsproblem nicht. Als neuen Lösungsansatz stellen wir Tagging in zwei Schritten vor, welches eine erhöhte Robustheit gegenüber domänenübergreifenden Tagging bietet. Im ersten Schritt wird nur grob-granular getaggt und im zweiten Schritt wird dieses Tagging dann auf das fein-granulare Level verfeinert. Für (ii) untersuchen wir Sprachrobustheit und ob jede Sprache einen zugeschnittenen Tagger benötigt, oder ob es möglich ist einen sprach-unabhängigen Tagger zu konstruieren, der für mehrere Sprachen funktioniert. Dazu vergleichen wir Tagger basierend auf verschiedenen Algorithmen auf 21 Sprachen und analysieren die notwendigen technischen Eigenschaften für einen Tagger, der auf mehreren Sprachen akkurate Modelle lernen kann. Die Untersuchung ergibt, dass Sprachrobustheit an für sich kein schwerwiegendes Problem ist und, dass die Tagsetgröße des Trainingskorpus ein wesentlich stärkerer Einflussfaktor für die Eignung eines Taggers ist als die Zugehörigkeit zu einer gewissen Sprache. Bezüglich (iii) untersuchen wir, wie man mit seltenen Phänomenen umgehen kann, für die nicht genug Trainingsdaten verfügbar sind. Dazu stellen wir eine neue kostengünstige Methode vor, die nur einen minimalen Aufwand an manueller Annotation erwartet, um zusätzliche Daten für solche seltenen Phänomene zu produzieren. Ein Feldversuch hat gezeigt, dass die produzierten Daten ausreichen um das Tagging von seltenen Phänomenen deutlich zu verbessern. Abschließend präsentieren wir zwei Software-Werkzeuge, FlexTag und DeepTC, die wir im Rahmen dieser Arbeit entwickelt haben. Diese Werkzeuge bieten die notwendige Flexibilität und Reproduzierbarkeit für die Experimente in dieser Arbeit.Part-of-speech (PoS) taggers are an important processing component in many Natural Language Processing (NLP) applications, which led to a variety of taggers for tackling this task. Recent work in this field showed that tagging accuracy on informal text domains is poor in comparison to formal text domains. In particular, social media text, which is inherently different from formal standard text, leads to a drastically increased error rate. These arising challenges originate in a lack of robustness of taggers towards domain transfers. This increased error rate has an impact on NLP applications that depend on PoS information. The main contribution of this thesis is the exploration of the concept of robustness under the following three aspects: (i) domain robustness, (ii) language robustness and (iii) long tail robustness. Regarding (i), we start with an analysis of the phenomena found in informal text that make tagging this kind of text challenging. Furthermore, we conduct a comprehensive robustness comparison of many commonly used taggers for English and German by evaluating them on the text of several text domains. We find that the tagging of informal text is poorly supported by available taggers. A review and analysis of currently used methods to adapt taggers to informal text showed that these methods improve tagging accuracy but offer no satisfactory solution. We propose an alternative tagging approach that reaches an increased multi-domain tagging robustness. This approach is based on tagging in two steps. The first step tags on a coarse-grained level and the second step refines the tags to the fine-grained tags. Regarding (ii), we investigate whether each language requires a language-tailored PoS tagger or if the construction of a competitive language independent tagger is feasible. We explore the technical details that contribute to a tagger's language robustness by comparing taggers based on different algorithms to learn models of 21 languages. We find that language robustness is a less severe issue and that the impact of the tagger choice depends more on the granularity of the tagset that shall be learned than on the language. Regarding (iii), we investigate methods to improve tagging of infrequent phenomena of which no sufficient amount of annotated training data is available, which is a common challenge in the social media domain. We propose a new method to overcome this lack of data that offers an inexpensive way of producing more training data. In a field study, we show that the quality of the produced data suffices to train tagger models that can recognize these under-represented phenomena. Furthermore, we present two software tools, FlexTag and DeepTC, which we developed in the course of this thesis. These tools provide the necessary flexibility for conducting all the experiments in this thesis and ensure their reproducibility

Duisburg-Essen Publications Online

Recommended from our members

Language teachers’ perceptions on the use of OER language processing technologies in MALL

Author: Aguado Jiménez P
Ordoñana Guillamón C
Pérez-Paredes P
Publication venue: Computer Assisted Language Learning
Publication date: 05/01/2018
Field of study

Combined with the ubiquity and constant connectivity of mobile devices, and with innovative approaches such as Data-Driven Learning (DDL), Natural Language Processing Technologies (NLPTs) as Open Educational Resources (OERs) could become a powerful tool for language learning as they promote individual and personalized learning. Using a questionnaire that was answered by language teachers (n= 230) in Spain and the UK, this research explores the extent to which OER NLPTs are currently known and used in adult foreign language learning. Our results suggest that teachers´ familiarity and use of OER NLPTs are very low. Although online dictionaries, collocation dictionaries and spell checkers are widely known, NLPTs appear to be generally underused in foreign language teaching. It was found that teachers prefer computer-based environments over mobile devices such as smartphones and tablets and that teachers´ qualification determines their familiarity with a wider range of OER NLPTs. This research offers insight into future applications of Language Processing Technologies as OERs in language learning.TELL-OP -Transforming European Learner Language into Learning Opportunities KA200 Higher Education Strategic Partnership 2014-1-ES01-KA203-004782 ERASMUS

Apollo (Cambridge)

Adverse Drug Event Detection, Causality Inference, Patient Communication and Translational Research

Author: Polepalli Ramesh Balaji
Publication venue: UWM Digital Commons
Publication date: 01/05/2014
Field of study

Adverse drug events (ADEs) are injuries resulting from a medical intervention related to a drug. ADEs are responsible for nearly 20% of all the adverse events that occur in hospitalized patients. ADEs have been shown to increase the cost of health care and the length of stays in hospital. Therefore, detecting and preventing ADEs for pharmacovigilance is an important task that can improve the quality of health care and reduce the cost in a hospital setting. In this dissertation, we focus on the development of ADEtector, a system that identifies ADEs and medication information from electronic medical records and the FDA Adverse Event Reporting System reports. The ADEtector system employs novel natural language processing approaches for ADE detection and provides a user interface to display ADE information. The ADEtector employs machine learning techniques to automatically processes the narrative text and identify the adverse event (AE) and medication entities that appear in that narrative text. The system will analyze the entities recognized to infer the causal relation that exists between AEs and medications by automating the elements of Naranjo score using knowledge and rule based approaches. The Naranjo Adverse Drug Reaction Probability Scale is a validated tool for finding the causality of a drug induced adverse event or ADE. The scale calculates the likelihood of an adverse event related to drugs based on a list of weighted questions. The ADEtector also presents the user with evidence for ADEs by extracting figures that contain ADE related information from biomedical literature. A brief summary is generated for each of the figures that are extracted to help users better comprehend the figure. This will further enhance the user experience in understanding the ADE information better. The ADEtector also helps patients better understand the narrative text by recognizing complex medical jargon and abbreviations that appear in the text and providing definitions and explanations for them from external knowledge resources. This system could help clinicians and researchers in discovering novel ADEs and drug relations and also hypothesize new research questions within the ADE domain

University of Wisconsin-Milwaukee

Natural Language Processing for Under-resourced Languages: Developing a Welsh Natural Language Toolkit

Author: Cunliffe D
Tudhope D
Vlachidis A
Williams D
Publication venue: 'Elsevier BV'
Publication date: 01/03/2022
Field of study

Language technology is becoming increasingly important across a variety of application domains which have become common place in large, well-resourced languages. However, there is a danger that small, under-resourced languages are being increasingly pushed to the technological margins. Under-resourced languages face significant challenges in delivering the underlying language resources necessary to support such applications. This paper describes the development of a natural language processing toolkit for an under-resourced language, Cymraeg (Welsh). Rather than creating the Welsh Natural Language Toolkit (WNLT) from scratch, the approach involved adapting and enhancing the language processing functionality provided for other languages within an existing framework and making use of external language resources where available. This paper begins by introducing the GATE NLP framework, which was used as the development platform for the WNLT. It then describes each of the core modules of the WNLT in turn, detailing the extensions and adaptations required for Welsh language processing. An evaluation of the WNLT is then reported. Following this, two demonstration applications are presented. The first is a simple text mining application that analyses wedding announcements. The second describes the development of a Twitter NLP application, which extends the core WNLT pipeline. As a relatively small-scale project, the WNLT makes use of existing external language resources where possible, rather than creating new resources. This approach of adaptation and reuse can provide a practical and achievable route to developing language resources for under-resourced languages

UCL Discovery

Book Review: E. Martín-Monje, I. Elorza, I. and B. García Riaza (eds) (2016). Technology-Enhanced Language Learning for Specialized Domains. Practical applications and mobility. New York: Routledge, pp. 286, ISBN: 978-1-315-65172-9.

Author: Montaner Villalba Salvador
Publication venue: 'Universidad de Las Palmas de Gran Canaria'
Publication date: 05/06/2018
Field of study

Computers have had a significant presence in language teaching since the 1960s, while the obvious emerging development of “educational technology” can be established in the early 1980s. By then, this term began to obtain significant popularity, since instructional media started to get a wider impact on educational practices. Since then, terminology has shifted significantly, from the initial Computer-Assisted Language Learning (CALL) to Technology-Enhanced Language Learning (TELL), subtly considering the fact that present computers are transforming less obvious “on the surface” while, at same time, being completely necessary. Computers lead other kinds of technology, such as audio, video and the World Wide Web, so that the current focus is on the communication which is facilitated by the computer rather than the machine itself

Portal digital de revistas científicas de la ULPGC (Universidad de Las Palmas de Gran Canaria)

The targeted use of the informal register on a social networking site by foreign-language learners evaluated through linguistic analysis and perceived-context appropriateness

Author: Toetenel Lisette
Publication venue: Oxford Brookes University
Publication date: 01/01/2018
Field of study

In today’s society, complex issues relating to socio-cultural integration are a key concern for policy makers, with far-reaching implications for domestic and foreign-language policies. In an increasingly globalized world, English continues to be used by many people from diverse linguistic, cultural, and ethnic backgrounds, who need to communicate daily. The use of the informal register is crucial for developing successful professional and personal relationships, yet it has not received sufficient attention from foreign-language teachers, researchers and policy makers. This exploratory study addressed this research gap through the deployment of a multi-layered study, which focussed on the instruction and perception of the informal register. It is the product of a research project spanning almost five years in which it employed a one-group pretest-posttest intervention. In the intervention study, referred to as Stage One, 15 advanced foreign-language learners completed study materials comprising of listening, reading, writing and ‘speaking’ activities over a period of five weeks. The ‘speaking’ activities were undertaken using asynchronous chat on the social networking site. In addition to a linguistic assessment of the intervention, a practical evaluation was undertaken in Stage Two by speakers of English who rated Stage One posts based upon their context appropriateness. The results of the study indicate that students not only used the informal register with more frequency, but utilized a wider variety of register features and furthermore used these with greater appropriateness. Students considered instruction in the informal-register features to be beneficial. Analysis of the findings illustrated that context-perceived appropriateness is linked to characteristics of English speakers such as personal preference and knowledge of Spanish, and not to the linguistic features identified in the posts. The implications of this study for practice, theory, policy and methodology are extensive; from the need to reassess the effectiveness of traditional e-learning models for interaction to the introduction of new policies which introduce pedagogical-focused teacher training to exploit the affordances associated with the educational use of social media. The study’s primary, original contribution to knowledge lies in the fact that it contributes to the debate about the teaching of informal language, by introducing dedicated instruction in the informal register, to adult learners of English, using a social networking site

Oxford Brookes University: RADAR