Search CORE

582 research outputs found

Tweeting and Being Ironic in the Debate about a Political Reform: the French Annotated Corpus TWitter-MariagePourTous.

Author: Bosco Cristina
Lai Mirko
Patti Viviana
Virone Daniela
Publication venue: elra
Publication date: 01/01/2016
Field of study

Institutional Research Information System University of Turin

A Survey of Corpora for Germanic Low-Resource Languages and Dialects

Author: Alumäe Tanel
Blaschke Verena
Fishel Mark
Plank Barbara
Schütze Hinrich
Publication venue
Publication date: 01/05/2023
Field of study

Despite much progress in recent years, the vast majority of work in natural language processing (NLP) is on standard languages with many speakers. In this work, we instead focus on low-resource languages and in particular non-standardized low-resource languages. Even within branches of major language families, often considered well-researched, little is known about the extent and type of available resources and what the major NLP challenges are for these language varieties. The first step to address this situation is a systematic survey of available corpora (most importantly, annotated corpora, which are particularly valuable for NLP research). Focusing on Germanic low-resource language varieties, we provide such a survey in this paper. Except for geolocation (origin of speaker or document), we find that manually annotated linguistic resources are sparse and, if they exist, mostly cover morphosyntax. Despite this lack of resources, we observe that interest in this area is increasing: there is active development and a growing research community. To facilitate research, we make our overview of over 80 corpora publicly available

Open Access LMU

A Survey of Corpora for Germanic Low-Resource Languages and Dialects

Author: Blaschke Verena
Plank Barbara
Schütze Hinrich
Publication venue
Publication date: 19/04/2023
Field of study

arXiv.org e-Print Archive

Computer-based tracking, analysis, and visualization of linguistically significant nonmanual events in American Sign Language (ASL)

Author: Liu Bo
Liu Jingjing
Metaxas Dimitris
Neidle Carol
Peng Xi
Vogler Christian
Publication venue: EUROPEAN LANGUAGE RESOURCES ASSOC-ELRA
Publication date: 01/01/2014
Field of study

Our linguistically annotated American Sign Language (ASL) corpora have formed a basis for research to automate detection by computer of essential linguistic information conveyed through facial expressions and head movements. We have tracked head position and facial deformations, and used computational learning to discern specific grammatical markings. Our ability to detect, identify, and temporally localize the occurrence of such markings in ASL videos has recently been improved by incorporation of (1) new techniques for deformable model-based 3D tracking of head position and facial expressions, which provide significantly better tracking accuracy and recover quickly from temporary loss of track due to occlusion; and (2) a computational learning approach incorporating 2-level Conditional Random Fields (CRFs), suited to the multi-scale spatio-temporal characteristics of the data, which analyses not only low-level appearance characteristics, but also the patterns that enable identification of significant gestural components, such as periodic head movements and raised or lowered eyebrows. Here we summarize our linguistically motivated computational approach and the results for detection and recognition of nonmanual grammatical markings; demonstrate our data visualizations, and discuss the relevance for linguistic research; and describe work underway to enable such visualizations to be produced over large corpora and shared publicly on the Web

Boston University Institutional Repository (OpenBU)

Vulnerability in Acquisition, Language Impairments in Dutch: Creating a VALID Data Archive

Author: A Mariani
Anne Baker
B Loftsson
Baker A., de Jong, J., .. P
Eric Sanders
Fikkert H
Frank Wijnen
H Declerck
Henk Van Den Heuvel
J Klatter
J Maegaard
J Moreno
Jan De Jong
Jetske Klatter
Odijk
Paul Trilsbeek
Paula Fikkert
R Van Den Heuvel
Roeland Van Hout
T Choukri
Van Hout
Publication venue
Publication date: 06/03/2020
Field of study

Vulnerability in Acquisition, Language Impairments in Dutch: Creating a VALID Data Archive Klatter, J.; van Hout, R.; van den Heuvel, H.; Fikkert, P.; Baker, A.E.; de Jong, J.; Wijnen, F.; Sanders, E.; Trilsbeek, P. General rights It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons). Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible. Abstract The VALID Data Archive is an open multimedia data archive (under construction) with data from speakers suffering from language impairments. We report on a pilot project in the CLARIN-NL framework in which five data resources were curated. For all data sets concerned, written informed consent from the participants or their caretakers has been obtained. All materials were anonymized. The audio files were converted into wav (linear PCM) files and the transcriptions into CHAT or ELAN format. Research data that consisted of test, SPSS and Excel files were documented and converted into CSV files. All data sets obtained appropriate CMDI metadata files. A new CMDI metadata profile for this type of data resources was established and care was taken that ISOcat metadata categories were used to optimize interoperability. After curation all data are deposited at the Max Planck Institute for Psycholinguistics Nijmegen where persistent identifiers are linked to all resources. The content of the transcriptions in CHAT and plain text format can be searched with the TROVA search engine

CiteSeerX

Comparing Czech and English AMRs

Author: Bojar Ondřej
Hajič Jan
Urešová Zdeňka
Publication venue
Publication date: 01/01/2014
Field of study

This paper compares Czech and English annotation using Abstract Meaning Represantation formalism

Crossref

Biblio at Institute of Formal and Applied Linguistics

An Analysis of Older Users' Interactions with Spoken Dialogue Systems

Author: Bost Jamie
Moore Johanna
Publication venue
Publication date: 01/05/2014
Field of study

Edinburgh Research Explorer

CamemBERT: a Tasty French Language Model

Author: de la Clergerie Éric Villemonte
Dupont Yoann
Martin Louis
Muller Benjamin
Romary Laurent
Sagot Benoît
Seddah Djamé
Suárez Pedro Javier Ortiz
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2020
Field of study

Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models --in all languages except English-- very limited. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We show that the use of web crawled data is preferable to the use of Wikipedia data. More surprisingly, we show that a relatively small web crawled dataset (4GB) leads to results that are as good as those obtained using larger datasets (130+GB). Our best performing model CamemBERT reaches or improves the state of the art in all four downstream tasks.Comment: ACL 2020 long paper. Web site: https://camembert-model.f

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

О истакнутим асоцијативним истраживањима на Западу у 21. веку и могућностима њихове примене у (домаћој) лингвистици

Author: Ана Петровић Дакић
Publication venue: University of Banja Luka, Faculty of Philology
Publication date: 01/06/2020
Field of study

удући да су домаћа асоцијативна истраживања досада била окрену-та Истоку и руској традицији, циљ је овог прегледа да прикаже нашој јавности мање позната, а изразито занимљива западна истраживања, која би могла по-служити овдашњим лингвистима у будућности. Асоцијације се примењују у више грана друштвених наука, али овде ћемо се усредсредити на области најближе домаћим истраживачима и на њихова достигнућа: удруживање асоцијативног метода с корпусима и његово коришћење како би се олакшала претрага речника; примена овог метода како би се изградили модели менталног лексикона; асоција-ције у примењеној линвгистици, у анализама везаним за усвајање страног језика

Directory of Open Access Journals