3,075 research outputs found
Machine Assisted Analysis of Vowel Length Contrasts in Wolof
Growing digital archives and improving algorithms for automatic analysis of
text and speech create new research opportunities for fundamental research in
phonetics. Such empirical approaches allow statistical evaluation of a much
larger set of hypothesis about phonetic variation and its conditioning factors
(among them geographical / dialectal variants). This paper illustrates this
vision and proposes to challenge automatic methods for the analysis of a not
easily observable phenomenon: vowel length contrast. We focus on Wolof, an
under-resourced language from Sub-Saharan Africa. In particular, we propose
multiple features to make a fine evaluation of the degree of length contrast
under different factors such as: read vs semi spontaneous speech ; standard vs
dialectal Wolof. Our measures made fully automatically on more than 20k vowel
tokens show that our proposed features can highlight different degrees of
contrast for each vowel considered. We notably show that contrast is weaker in
semi-spontaneous speech and in a non standard semi-spontaneous dialect.Comment: Accepted to Interspeech 201
UD_Japanese-CEJC: Dependency Relation Annotation on Corpus of Everyday Japanese Conversation
Conference name: the 24th Meeting of the Special Interest Group on Discourse and Dialogue, Conference place: Prague, Czechia, Session period: 2023/09/11-15, Organizer: Association for Computational Linguisticsapplication/pdfNational Institute for Japanese Language and LinguisticsTohoku UniversityMegagon Labs, Tokyo, Recruit Co., LtdNational Institute for Japanese Language and LinguisticsIn this study, we have developed Universal Dependencies (UD) resources for spoken Japanese in the Corpus of Everyday Japanese Conversation (CEJC). The CEJC is a large corpus of spoken language that encompasses various everyday conversations in Japanese, and includes word delimitation and part-of-speech annotation. We have newly annotated Long Word Unit delimitation and Bunsetsu (Japanese phrase)-based dependencies, including Bunsetsu boundaries, for CEJC. The UD of Japanese resources was constructed in accordance with hand-maintained conversion rules from the CEJC with two types of word delimitation, part-of-speech tags and Bunsetsu-based syntactic dependency relations. Furthermore, we examined various issues pertaining to the construction of UD in the CEJC by comparing it with the written Japanese corpus and evaluating UD parsing accuracy.conference pape
ON MONITORING LANGUAGE CHANGE WITH THE SUPPORT OF CORPUS PROCESSING
One of the fundamental characteristics of language is that it can change over time. One
method to monitor the change is by observing its corpora: a structured language
documentation. Recent development in technology, especially in the field of Natural
Language Processing allows robust linguistic processing, which support the description of
diverse historical changes of the corpora. The interference of human linguist is inevitable as
it determines the gold standard, but computer assistance provides considerable support by
incorporating computational approach in exploring the corpora, especially historical
corpora. This paper proposes a model for corpus development, where corpus are annotated
to support further computational operations such as lexicogrammatical pattern matching,
automatic retrieval and extraction. The corpus processing operations are performed by local
grammar based corpus processing software on a contemporary Indonesian corpus. This
paper concludes that data collection and data processing in a corpus are equally crucial
importance to monitor language change, and none can be set aside
- …