Search CORE

6,337 research outputs found

Digital Presentation of Bulgarian Lexical Heritage. Towards an Electronic Historical Dictionary

Author: Totomanova Anna-Maria
Publication venue: Wydawnictwo Uniwersytetu Łódzkiego (Lodz University Press)
Publication date: 01/01/2012
Field of study

The article presents the results of the project “ICT Tools for Historical Linguistic Studies”, funded by the European Social Fund, OP Human Resources. The main project goal was to elaborate electronic tools for creating a Historical Dictionary of Diachronic Type that should present the history of the Bulgarian words from their first written occurrence until today. By the end of the project the team (Faculty of Slavic Studies at Sofia University, Institute for Bulgarian Language, BAS and PAM Publishing Company, Sofia) had at their disposal a set of Old Bulgarian Unicode fonts, meant for publishing medieval texts and a convertor that converts non-Unicode documents into the new standard. The convertor allowed the participants to create in a relatively short time a Diachronic text corpus of Bulgarian medieval texts, containing already more than 90 texts dated from the 10th to the 18th century. The corpus software enables editing the texts and turned out to be an excellent tool for preparing electronic editions of the Old Bulgarian (OCS) manuscripts. In addition to the corpus an electronic dictionary of Old Bulgarian is available, which contains the digitized version of Старобългарски речник, produced by IBL. Both tools are accessible on the project website at the address histdict.uni-sofia.bg. The Standard of the Historical Dictionary took shape during the project course and respective software for elaborating new dictionary entries was designed and tested. The article also displays screenshots that demonstrate the functionalities of both the corpus and dictionary software.The article presents the results of the project “ICT Tools for Historical Linguistic Studies”, funded by the European Social Fund, OP Human Resources

Biblioteka Nauki - repozytorium artykuÅÃ³w

Repozytorium Uniwersytetu Łódzkiego (University of Lodz Repository)

A Multivariate Study of T/V Forms in European Languages Based on a Parallel Corpus of Film Subtitles

Author: Levshina Natalia
Publication venue: Wydawnictwo Uniwersytetu Łódzkiego
Publication date: 01/01/2017
Field of study

The present study investigates the cross-linguistic differences in the use of so-called T/V forms (e.g. French tu and vous, German du and Sie, Russian ty and vy) in ten European languages from different language families and genera. These constraints represent an elusive object of investigation because they depend on a large number of subtle contextual features and social distinctions, which should be cross-linguistically matched. Film subtitles in different languages offer a convenient solution because the situations of communication between film characters can serve as comparative concepts. I selected more than two hundred contexts that contain the pronouns you and yourself in the original English versions, which are then coded for fifteen contextual variables that describe the Speaker and the Hearer, their relationships and different situational properties. The creators of subtitles in the other languages have to choose between T and V when translating from English, where the T/V distinction is not expressed grammatically. On the basis of these situations translated in ten languages, I perform multivariate analyses using the method of conditional inference trees in order to identify the most relevant contextual variables that constrain the T/V variation in each language

Biblioteka Nauki - repozytorium artykuÅÃ³w

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Repozytorium Uniwersytetu Łódzkiego (University of Lodz Repository)

VP-fronting in Czech and Polish : a case study in corpus-oriented grammar research

Author: Meyer Roland
Publication venue
Publication date: 01/01/2005
Field of study

Fronting of an infinite VP across a finite main verb - akin to German "VP-topicalization" - can be found also in Czech and Polish. The paper discusses evidence from large corpora for this process and some of its properties, both syntactic and information-structural. Based on this case, criteria for more user-friedly searching and retrieval of corpus data in syntactic research are being developed

Hochschulschriftenserver - Universität Frankfurt am Main

A Robust Transformation-Based Learning Approach Using Ripple Down Rules for Part-of-Speech Tagging

Author: Nguyen Dai Quoc
Nguyen Dat Quoc
Pham Dang Duc
Pham Son Bao
Publication venue: 'IOS Press'
Publication date: 19/12/2015
Field of study

In this paper, we propose a new approach to construct a system of transformation rules for the Part-of-Speech (POS) tagging task. Our approach is based on an incremental knowledge acquisition method where rules are stored in an exception structure and new rules are only added to correct the errors of existing rules; thus allowing systematic control of the interaction between the rules. Experimental results on 13 languages show that our approach is fast in terms of training time and tagging speed. Furthermore, our approach obtains very competitive accuracy in comparison to state-of-the-art POS and morphological taggers.Comment: Version 1: 13 pages. Version 2: Submitted to AI Communications - the European Journal on Artificial Intelligence. Version 3: Resubmitted after major revisions. Version 4: Resubmitted after minor revisions. Version 5: to appear in AI Communications (accepted for publication on 3/12/2015

arXiv.org e-Print Archive

Macquarie University ResearchOnline

The strategic impact of META-NET on the regional, national and international level

Author: Ananiadou Sophia
Branco Antonio
Hajic Jan
Hernáez Inma
Mariani Joseph
McNaught John
Melero Maite
Monachini Monica
Moreno Bilbao M. Asunción
Odijk Jan
Piperidis Stelios
Rosner Mike
Skadina Inguna
Tadic Marko
Thompson Paul
Tufis Dan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

This article provides an overview of the dissemination work carried out in META-NET from 2010 until 2015; we describe its impact on the regional, national and international level, mainly with regard to politics and the funding situation for LT topics. The article documents the initiative's work throughout Europe in order to boost progress and innovation in our field.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Mimicking Word Embeddings using Subword RNNs

Author: Eisenstein Jacob
Guthrie Robert
Pinter Yuval
Publication venue
Publication date: 01/01/2017
Field of study

Word embeddings improve generalization over lexical features by placing each word in a lower-dimensional space, using distributional information obtained from unlabeled data. However, the effectiveness of word embeddings for downstream NLP tasks is limited by out-of-vocabulary (OOV) words, for which embeddings do not exist. In this paper, we present MIMICK, an approach to generating OOV word embeddings compositionally, by learning a function from spellings to distributional embeddings. Unlike prior work, MIMICK does not require re-training on the original word embedding corpus; instead, learning is performed at the type level. Intrinsic and extrinsic evaluations demonstrate the power of this simple approach. On 23 languages, MIMICK improves performance over a word-based baseline for tagging part-of-speech and morphosyntactic attributes. It is competitive with (and complementary to) a supervised character-based model in low-resource settings.Comment: EMNLP 201

arXiv.org e-Print Archive

Crossref

External Lexical Information for Multilingual Part-of-Speech Tagging

Author: Sagot Benoît
Publication venue
Publication date: 01/06/2016
Field of study

Morphosyntactic lexicons and word vector representations have both proven useful for improving the accuracy of statistical part-of-speech taggers. Here we compare the performances of four systems on datasets covering 16 languages, two of these systems being feature-based (MEMMs and CRFs) and two of them being neural-based (bi-LSTMs). We show that, on average, all four approaches perform similarly and reach state-of-the-art results. Yet better performances are obtained with our feature-based models on lexically richer datasets (e.g. for morphologically rich languages), whereas neural-based results are higher on datasets with less lexical variability (e.g. for English). These conclusions hold in particular for the MEMM models relying on our system MElt, which benefited from newly designed features. This shows that, under certain conditions, feature-based approaches enriched with morphosyntactic lexicons are competitive with respect to neural methods

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Hal-Diderot

Multilingual Twitter Sentiment Classification: The Role of Human Annotators

Author: Grcar Miha
Mozetic Igor
Smailovic Jasmina
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 23/02/2016
Field of study

What are the limits of automated Twitter sentiment classification? We analyze a large set of manually labeled tweets in different languages, use them as training data, and construct automated classification models. It turns out that the quality of classification models depends much more on the quality and size of training data than on the type of the model trained. Experimental results indicate that there is no statistically significant difference between the performance of the top classification models. We quantify the quality of training data by applying various annotator agreement measures, and identify the weakest points of different datasets. We show that the model performance approaches the inter-annotator agreement when the size of the training set is sufficiently large. However, it is crucial to regularly monitor the self- and inter-annotator agreements since this improves the training datasets and consequently the model performance. Finally, we show that there is strong evidence that humans perceive the sentiment classes (negative, neutral, and positive) as ordered

arXiv.org e-Print Archive

Common Language Resources and Technology Infrastructure - Slovenia

Directory of Open Access Journals

PubMed Central

Digital repository of Slovenian research organizations

Bilingual Corpus - Digital Repository for Preservation of Language Heritage

Author: Dimitrova Ludmila
Garabík Radovan
Publication venue: Institute of Mathematics and Informatics Bulgarian Academy of Sciences
Publication date: 01/01/2012
Field of study

The article briefly reviews bilingual Slovak-Bulgarian/Bulgarian-Slovak parallel and aligned corpus. The corpus is collected and developed as results of the collaboration in the frameworks of the joint research project between Institute of Mathematics and Informatics, Bulgarian Academy of Sciences, and Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences. The multilingual corpora are large repositories of language data with an important role in preserving and supporting the world's cultural heritage, because the natural language is an outstanding part of the human cultural values and collective memory, and a bridge between cultures. This bilingual corpus will be widely applicable to the contrastive studies of the both Slavic languages, will also be useful resource for language engineering research and development, especially in machine translation

Bulgarian Digital Mathematics Library at IMI-BAS