Search CORE

40 research outputs found

UD_Japanese-CEJC: Dependency Relation Annotation on Corpus of Everyday Japanese Conversation

Author: ASAHARA Masayuki
MATSUDA Hiroshi
OMURA Mai
WAKASA Aya
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2023
Field of study

Conference name: the 24th Meeting of the Special Interest Group on Discourse and Dialogue, Conference place: Prague, Czechia, Session period: 2023/09/11-15, Organizer: Association for Computational Linguisticsapplication/pdfNational Institute for Japanese Language and LinguisticsTohoku UniversityMegagon Labs, Tokyo, Recruit Co., LtdNational Institute for Japanese Language and LinguisticsIn this study, we have developed Universal Dependencies (UD) resources for spoken Japanese in the Corpus of Everyday Japanese Conversation (CEJC). The CEJC is a large corpus of spoken language that encompasses various everyday conversations in Japanese, and includes word delimitation and part-of-speech annotation. We have newly annotated Long Word Unit delimitation and Bunsetsu (Japanese phrase)-based dependencies, including Bunsetsu boundaries, for CEJC. The UD of Japanese resources was constructed in accordance with hand-maintained conversion rules from the CEJC with two types of word delimitation, part-of-speech tags and Bunsetsu-based syntactic dependency relations. Furthermore, we examined various issues pertaining to the construction of UD in the CEJC by comparing it with the written Japanese corpus and evaluating UD parsing accuracy.conference pape

Academic Repository of the National Institute for Japanese Language and Linguistics / 国立国語研究所学術情報リポジトリ

Genre as Weak Supervision for Cross-lingual Dependency Parsing

Author: Müller-Eberstein Maximilian
Plank Barbara
van der Goot Rob
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/11/2021
Field of study

The IT University of Copenhagen's Repository

Treebanking user-generated content: a UD based overview of guidelines, corpora and unified recommendations

Author: Bosco Cristina
Cignarella ALESSANDRA TERESA
Sanguinetti Manuela
Publication venue
Publication date: 01/01/2022
Field of study

Institutional Research Information System University of Turin

Massive Choice, Ample Tasks (MaChAmp): A Toolkit for Multi-task Learning in NLP

Author: Plank Barbara
Ramponi Alan
Sharaf Ibrahim
van der Goot Rob
Üstün Ahmet
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2021
Field of study

Transfer learning, particularly approaches that combine multi-task learning with pre-trained contextualized embeddings and fine-tuning, have advanced the field of Natural Language Processing tremendously in recent years. In this paper we present MaChAmp, a toolkit for easy fine-tuning of contextualized embeddings in multi-task settings. The benefits of MaChAmp are its flexible configuration options, and the support of a variety of natural language processing tasks in a uniform toolkit, from text classification and sequence labeling to dependency parsing, masked language modeling, and text generation.Comment: https://machamp-nlp.github.io

arXiv.org e-Print Archive

Archivio della ricerca - Fondazione Bruno Kessler

The IT University of Copenhagen's Repository

Corpus-Based Research on Chinese Language and Linguistics

Author: Basciano Bianca
Basciano Bianca
Gatti Franco
Gatti Franco
Morbiato Anna
Morbiato Anna
Publication venue: 'Edizioni Ca Foscari'
Publication date: 01/01/2020
Field of study

This volume collects papers presenting corpus-based research on Chinese language and linguistics, from both a synchronic and a diachronic perspective. The contributions cover different fields of linguistics, including syntax and pragmatics, semantics, morphology and the lexicon, sociolinguistics, and corpus building. There is now considerable emphasis on the reliability of linguistic data: the studies presented here are all grounded in the tenet that corpora, intended as collections of naturally occurring texts produced by a variety of speakers/writers, provide a more robust, statistically significant foundation for linguistic analysis. The volume explores not only the potential of using corpora as tools allowing access to authentic language material, but also the challenges involved in corpus interrogation, analysis, and building

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Detecting syntactic differences automatically using the Minimum Description Length principle

Author: Barbiers L.C.J.
Kroon M.S.
Odijk J., Pas, S.L. van der
Publication venue
Publication date: 12/12/2020
Field of study

In this paper we present a systematic approach to detect and rank hypotheses about possible syntactic differences for further investigation by leveraging parallel data and using the Minimum Description Length (MDL) principle. We deploy the SQS-algorithm (‘Summarising event seQuenceS’; Tatti and Vreeken 2012) – an MDL-based algorithm – to mine ‘typical’ sequences of Part of Speech (POS) tags for each language under investigation. We create a shortlist of potential syntactic differences based on the number of parallel sentences with a mismatch in pattern occurrence. We applied our method to parallel corpora of English, Dutch and Czech sentences from the Europarl v7 corpus (Koehn 2005). The approach proved useful in both retrieving POS building blocks of a language as well as pointing to meaningful syntactic differences between languages. Despite a clear sensitivity to tagging accuracy, our results and approach are promising. Analysis and Stochastic

Leiden University Scholary Publications

Development of linguistic linked open data resources for collaborative data-intensive research in the language sciences

Author: Blume Maria
Chiarcos Christian
Lust Barbara C.
Pareja-Lora Antonio
Publication venue: 'MIT Press - Journals'
Publication date: 27/04/2023
Field of study

Making diverse data in linguistics and the language sciences open, distributed, and accessible: perspectives from language/language acquistiion researchers and technical LOD (linked open data) researchers. This volume examines the challenges inherent in making diverse data in linguistics and the language sciences open, distributed, integrated, and accessible, thus fostering wide data sharing and collaboration. It is unique in integrating the perspectives of language researchers and technical LOD (linked open data) researchers. Reporting on both active research needs in the field of language acquisition and technical advances in the development of data interoperability, the book demonstrates the advantages of an international infrastructure for scholarship in the field of language sciences. With contributions by researchers who produce complex data content and scholars involved in both the technology and the conceptual foundations of LLOD (linguistics linked open data), the book focuses on the area of language acquisition because it involves complex and diverse data sets, cross-linguistic analyses, and urgent collaborative research. The contributors discuss a variety of research methods, resources, and infrastructures. Contributors Isabelle Barrière, Nan Bernstein Ratner, Steven Bird, Maria Blume, Ted Caldwell, Christian Chiarcos, Cristina Dye, Suzanne Flynn, Claire Foley, Nancy Ide, Carissa Kang, D. Terence Langendoen, Barbara Lust, Brian MacWhinney, Jonathan Masci, Steven Moran, Antonio Pareja-Lora, Jim Reidy, Oya Y. Rieger, Gary F. Simons, Thorsten Trippel, Kara Warburton, Sue Ellen Wright, Claus Zin

OPUS Augsburg

Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

Author
Publication venue: The Association for Computational Linguistics
Publication date: 01/01/2018
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Development of Linguistic Linked Open Data Resources for Collaborative Data-Intensive Research in the Language Sciences

Author
Publication venue
Publication date
Field of study

This book is the product of an international workshop dedicated to addressing data accessibility in the linguistics field. It is therefore vital to the book’s mission that its content be open access. Linguistics as a field remains behind many others as far as data management and accessibility strategies. The problem is particularly acute in the subfield of language acquisition, where international linguistic sound files are needed for reference. Linguists' concerns are very much tied to amount of information accumulated by individual researchers over the years that remains fragmented and inaccessible to the larger community. These concerns are shared by other fields, but linguistics to date has seen few efforts at addressing them. This collection, undertaken by a range of leading experts in the field, represents a big step forward. Its international scope and interdisciplinary combination of scholars/librarians/data consultants will provide an important contribution to the field

OAPEN Library

Challenges for the development of linked open data for research in multilingualism

Author: Barriere I
Blume M
Dye CD
Kang C
Publication venue: 'MIT Press - Journals'
Publication date
Field of study

Newcastle University E-Prints