54 research outputs found
Proceedings
Proceedings of the NODALIDA 2009 workshop
Nordic Perspectives on the CLARIN Infrastructure of Language Resources.
Editors: Rickard Domeij, Kimmo Koskenniemi, Steven Krauwer, Bente Maegaard,
Eiríkur Rögnvaldsson and Koenraad de Smedt.
NEALT Proceedings Series, Vol. 5 (2009), v+45 pp.
© 2009 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/9207
Icelandic Language Resources and Technology: Status and Prospects
Proceedings of the NODALIDA 2009 workshop
Nordic Perspectives on the CLARIN Infrastructure of Language Resources.
Editors: Rickard Domeij, Kimmo Koskenniemi, Steven Krauwer, Bente Maegaard,
Eiríkur Rögnvaldsson and Koenraad de Smedt.
NEALT Proceedings Series, Vol. 5 (2009), 27-32.
© 2009 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/9207
Segmental Durations of Speech
This dissertation considers the segmental durations of speech from the viewpoint of speech technology, especially speech synthesis. The idea is that better models of segmental durations lead to higher naturalness and better intelligibility. These features are the key factors for better usability and generality of synthesized speech technology. Even though the studies are based on a Finnish corpus the approaches apply to all other languages as well. This is possibly due to the fact that most of the studies included in this dissertation are about universal effects taking place on utterance boundaries. Also the methods invented and used here are suitable for any other study of another language.
This study is based on two corpora of news reading speech and sentences read aloud. The other corpus is read aloud by a 39-year-old male, whilst the other consists of several speakers in various situations. The use of two corpora is twofold: it involves a comparison of the corpora and a broader view on the matters of interest.
The dissertation begins with an overview to the phonemes and the quantity system in the Finnish language. Especially, we are covering the intrinsic durations of phonemes and phoneme categories, as well as the difference of duration between short and long phonemes. The phoneme categories are presented to facilitate the problem of variability of speech segments.
In this dissertation we cover the boundary-adjacent effects on segmental durations. In initial positions of utterances we find that there seems to be initial shortening in Finnish, but the result depends on the level of detail and on the individual phoneme. On the phoneme level we find that the shortening or lengthening only affects the very first ones at the beginning of an utterance. However, on average, the effect seems to shorten the whole first word on the word level.
We establish the effect of final lengthening in Finnish. The effect in Finnish has been an open question for a long time, whilst Finnish has been the last missing piece for it to be a universal phenomenon. Final lengthening is studied from various angles and it is also shown that it is not a mere effect of prominence or an effect of speech corpus with high inter- and intra-speaker variation. The effect of final lengthening seems to extend from the final to the penultimate word. On a phoneme level it reaches a much wider area than the initial effect.
We also present a normalization method suitable for corpus studies on segmental durations. The method uses an utterance-level normalization approach to capture the pattern of segmental durations within each utterance. This prevents the impact of various problematic variations within the corpora. The normalization is used in a study on final lengthening to show that the results on the effect are not caused by variation in the material.
The dissertation shows an implementation and prowess of speech synthesis on a mobile platform. We find that the rule-based method of speech synthesis is a real-time software solution, but the signal generation process slows down the system beyond real time. Future aspects of speech synthesis on limited platforms are discussed.
The dissertation considers ethical issues on the development of speech technology. The main focus is on the development of speech synthesis with high naturalness, but the problems and solutions are applicable to any other speech technology approaches.Siirretty Doriast
Syntactic Nuclei in Dependency Parsing -- A Multilingual Exploration
Standard models for syntactic dependency parsing take words to be the
elementary units that enter into dependency relations. In this paper, we
investigate whether there are any benefits from enriching these models with the
more abstract notion of nucleus proposed by Tesni\`{e}re. We do this by showing
how the concept of nucleus can be defined in the framework of Universal
Dependencies and how we can use composition functions to make a
transition-based dependency parser aware of this concept. Experiments on 12
languages show that nucleus composition gives small but significant
improvements in parsing accuracy. Further analysis reveals that the improvement
mainly concerns a small number of dependency relations, including nominal
modifiers, relations of coordination, main predicates, and direct objects.Comment: Accepted at EACL-202
New technologies for Old Germanic: resources and research on parallel bibles in Older Continental Western Germanic
We provide an overview of on-going efforts to facilitate the study of older Germanic languages currently pursued at the Goethe-University Frankfurt, Germany.
We describe created resources, such as a parallel corpus of Germanic Bibles and a morphosyntactically annotated corpus of Old High German (OHG) and Old Saxon, a lexicon of OHG in XML and a multilingual etymological database. We discuss NLP algorithms operating on this data, and their relevance for research in the Humanities.
RDF and Linked Data represent new and promising aspects in our research, currently applied to establish cross-references between etymological dictionaries, infer new information from their symmetric closure and to formalize linguistic annotations in a corpus and grammatical categories in a lexicon in an interoperable way
Zināšanās bāzētu un korpusā bāzētu metožu kombinētā izmantošanas mašīntulkošanā
ANOTĀCIJA.
Mašīntulkošanas (MT) sistēmas tiek būvētas izmantojot dažādas metodes (zināšanās un korpusā bāzētas). Zināšanās bāzēta MT tulko tekstu, izmantojot cilvēka rakstītus likumus. Korpusā bāzēta MT izmanto no tulkojumu piemēriem automātiski izgūtus modeļus. Abām metodēm ir gan priekšrocības, gan trūkumi. Šajā darbā tiek meklēta kombināta metode MT kvalitātes uzlabošanai, kombinējot abas metodes.
Darbā tiek pētīta metožu piemērotība latviešu valodai, kas ir maza, morfoloģiski bagāta valoda ar ierobežotiem resursiem. Tiek analizētas esošās metodes un tiek piedāvātas vairākas kombinētās metodes. Metodes ir realizētas un novērtētas, izmantojot gan automātiskas, gan cilvēka novērtēšanas metodes. Faktorēta statistiskā MT ar zināšanās balstītu morfoloģisko analizatoru ir piedāvāta kā perspektīvākā. Darbā aprakstīts arī metodes praktiskais pielietojums.
Atslēgas vārdi: mašīntulkošana (MT), zināšanās balstīta MT, korpusā balstīta MT, kombinēta metodeABSTRACT.
Machine Translation (MT) systems are built using different methods (knowledge-based and corpus-based). Knowledge-based MT translates text using human created rules. Corpus-based MT uses models which are automatically built from translation examples. Both methods have their advantages and disadvantages. This work aims to find a combined method to improve the MT quality combining both methods.
An applicability of the methods for Latvian (a small, morphologically rich, under-resourced language) is researched. The existing MT methods have been analyzed and several combined methods have been proposed. Methods have been implemented and evaluated using an automatic and human evaluation. The factored statistical MT with a rule-based morphological analyzer is proposed to be the most promising. The practical application of methods is described.
Keywords: Machine Translation (MT), Rule-based MT, Statistical MT, Combined approac
- …