6,337 research outputs found
Digital Presentation of Bulgarian Lexical Heritage. Towards an Electronic Historical Dictionary
The article presents the results of the project “ICT Tools for Historical Linguistic Studies”,
funded by the European Social Fund, OP Human Resources. The main project goal was to elaborate
electronic tools for creating a Historical Dictionary of Diachronic Type that should present the history
of the Bulgarian words from their first written occurrence until today. By the end of the project
the team (Faculty of Slavic Studies at Sofia University, Institute for Bulgarian Language, BAS and PAM Publishing Company, Sofia) had at their disposal a set of Old Bulgarian Unicode fonts, meant for
publishing medieval texts and a convertor that converts non-Unicode documents into the new standard.
The convertor allowed the participants to create in a relatively short time a Diachronic text corpus
of Bulgarian medieval texts, containing already more than 90 texts dated from the 10th to the 18th century.
The corpus software enables editing the texts and turned out to be an excellent tool for preparing
electronic editions of the Old Bulgarian (OCS) manuscripts. In addition to the corpus an electronic
dictionary of Old Bulgarian is available, which contains the digitized version of Старобългарски речник,
produced by IBL. Both tools are accessible on the project website at the address histdict.uni-sofia.bg.
The Standard of the Historical Dictionary took shape during the project course and respective software
for elaborating new dictionary entries was designed and tested. The article also displays screenshots
that demonstrate the functionalities of both the corpus and dictionary software.The article presents the results of the project “ICT Tools for Historical Linguistic Studies”,
funded by the European Social Fund, OP Human Resources
A Multivariate Study of T/V Forms in European Languages Based on a Parallel Corpus of Film Subtitles
The present study investigates the cross-linguistic differences in the use of so-called T/V forms (e.g. French tu and vous, German du and Sie, Russian ty and vy) in ten European languages from different language families and genera. These constraints represent an elusive object of investigation because they depend on a large number of subtle contextual features and social distinctions, which should be cross-linguistically matched. Film subtitles in different languages offer a convenient solution because the situations of communication between film characters can serve as comparative concepts. I selected more than two hundred contexts that contain the pronouns you and yourself in the original English versions, which are then coded for fifteen contextual variables that describe the Speaker and the Hearer, their relationships and different situational properties. The creators of subtitles in the other languages have to choose between T and V when translating from English, where the T/V distinction is not expressed grammatically. On the basis of these situations translated in ten languages, I perform multivariate analyses using the method of conditional inference trees in order to identify the most relevant contextual variables that constrain the T/V variation in each language
VP-fronting in Czech and Polish : a case study in corpus-oriented grammar research
Fronting of an infinite VP across a finite main verb - akin to German "VP-topicalization" - can be found also in Czech and Polish. The paper discusses evidence from large corpora for this process and some of its properties, both syntactic and information-structural. Based on this case, criteria for more user-friedly searching and retrieval of corpus data in syntactic research are being developed
A Robust Transformation-Based Learning Approach Using Ripple Down Rules for Part-of-Speech Tagging
In this paper, we propose a new approach to construct a system of
transformation rules for the Part-of-Speech (POS) tagging task. Our approach is
based on an incremental knowledge acquisition method where rules are stored in
an exception structure and new rules are only added to correct the errors of
existing rules; thus allowing systematic control of the interaction between the
rules. Experimental results on 13 languages show that our approach is fast in
terms of training time and tagging speed. Furthermore, our approach obtains
very competitive accuracy in comparison to state-of-the-art POS and
morphological taggers.Comment: Version 1: 13 pages. Version 2: Submitted to AI Communications - the
European Journal on Artificial Intelligence. Version 3: Resubmitted after
major revisions. Version 4: Resubmitted after minor revisions. Version 5: to
appear in AI Communications (accepted for publication on 3/12/2015
The strategic impact of META-NET on the regional, national and international level
This article provides an overview of the dissemination work carried out in META-NET from 2010 until 2015; we describe its impact on the regional, national and international level, mainly with regard to politics and the funding situation for LT topics. The article documents the initiative's work throughout Europe in order to boost progress and innovation in our field.Peer ReviewedPostprint (author's final draft
Mimicking Word Embeddings using Subword RNNs
Word embeddings improve generalization over lexical features by placing each
word in a lower-dimensional space, using distributional information obtained
from unlabeled data. However, the effectiveness of word embeddings for
downstream NLP tasks is limited by out-of-vocabulary (OOV) words, for which
embeddings do not exist. In this paper, we present MIMICK, an approach to
generating OOV word embeddings compositionally, by learning a function from
spellings to distributional embeddings. Unlike prior work, MIMICK does not
require re-training on the original word embedding corpus; instead, learning is
performed at the type level. Intrinsic and extrinsic evaluations demonstrate
the power of this simple approach. On 23 languages, MIMICK improves performance
over a word-based baseline for tagging part-of-speech and morphosyntactic
attributes. It is competitive with (and complementary to) a supervised
character-based model in low-resource settings.Comment: EMNLP 201
External Lexical Information for Multilingual Part-of-Speech Tagging
Morphosyntactic lexicons and word vector representations have both proven
useful for improving the accuracy of statistical part-of-speech taggers. Here
we compare the performances of four systems on datasets covering 16 languages,
two of these systems being feature-based (MEMMs and CRFs) and two of them being
neural-based (bi-LSTMs). We show that, on average, all four approaches perform
similarly and reach state-of-the-art results. Yet better performances are
obtained with our feature-based models on lexically richer datasets (e.g. for
morphologically rich languages), whereas neural-based results are higher on
datasets with less lexical variability (e.g. for English). These conclusions
hold in particular for the MEMM models relying on our system MElt, which
benefited from newly designed features. This shows that, under certain
conditions, feature-based approaches enriched with morphosyntactic lexicons are
competitive with respect to neural methods
Multilingual Twitter Sentiment Classification: The Role of Human Annotators
What are the limits of automated Twitter sentiment classification? We analyze
a large set of manually labeled tweets in different languages, use them as
training data, and construct automated classification models. It turns out that
the quality of classification models depends much more on the quality and size
of training data than on the type of the model trained. Experimental results
indicate that there is no statistically significant difference between the
performance of the top classification models. We quantify the quality of
training data by applying various annotator agreement measures, and identify
the weakest points of different datasets. We show that the model performance
approaches the inter-annotator agreement when the size of the training set is
sufficiently large. However, it is crucial to regularly monitor the self- and
inter-annotator agreements since this improves the training datasets and
consequently the model performance. Finally, we show that there is strong
evidence that humans perceive the sentiment classes (negative, neutral, and
positive) as ordered
Bilingual Corpus - Digital Repository for Preservation of Language Heritage
The article briefly reviews bilingual Slovak-Bulgarian/Bulgarian-Slovak parallel and aligned corpus. The corpus is collected and developed as results of the collaboration in the frameworks of the joint research project between Institute of Mathematics and Informatics, Bulgarian Academy of Sciences, and Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences. The multilingual corpora are large repositories of language data with an important role in preserving and supporting the world's cultural heritage, because the natural language is an outstanding part of the human cultural values and collective memory, and a bridge between cultures. This bilingual corpus will be widely applicable to the contrastive studies of the both Slavic languages, will also be useful resource for language engineering research and development, especially in machine translation
- …