7,880 research outputs found
BEA – A multifunctional Hungarian spoken language database
In diverse areas of linguistics, the demand for studying actual language use is on
the increase. The aim of developing a phonetically-based multi-purpose database of
Hungarian spontaneous speech, dubbed BEA2, is to accumulate a large amount of
spontaneous speech of various types together with sentence repetition and reading.
Presently, the recorded material of BEA amounts to 260 hours produced by 280
present-day Budapest speakers (ages between 20 and 90, 168 females and 112
males), providing also annotated materials for various types of research and practical
applications
Media monitoring and information extraction for the highly inflected agglutinative language Hungarian
The Europe Media Monitor (EMM) is a fully-automatic system that analyses written online news by gathering articles in over 70 languages and by applying text analysis software for currently 21 languages, without using linguistic tools such as parsers, part-of-speech taggers or morphological analysers. In this paper, we describe the effort of adding to EMM Hungarian text mining tools for news gathering; document categorisation; named entity recognition and classification for persons, organisations and locations; name lemmatisation; quotation recognition; and cross-lingual linking of related news clusters. The major challenge of dealing with the Hungarian language is its high degree of inflection and agglutination. We present several experiments where we apply linguistically light-weight methods to deal with inflection and we propose a method to overcome the challenges. We also present detailed frequency lists of Hungarian person and location name suffixes, as found in real-life news texts. This empirical data can be used to draw further conclusions and to improve existing Named Entity Recognition software. Within EMM, the solutions described here will also be applied to other morphologically complex languages such as those of the Slavic language family. The media monitoring and analysis system EMM is freely accessible online via the web pag
A Large-Scale Comparison of Historical Text Normalization Systems
There is no consensus on the state-of-the-art approach to historical text
normalization. Many techniques have been proposed, including rule-based
methods, distance metrics, character-based statistical machine translation, and
neural encoder--decoder models, but studies have used different datasets,
different evaluation methods, and have come to different conclusions. This
paper presents the largest study of historical text normalization done so far.
We critically survey the existing literature and report experiments on eight
languages, comparing systems spanning all categories of proposed normalization
techniques, analysing the effect of training data quantity, and using different
evaluation methods. The datasets and scripts are made publicly available.Comment: Accepted at NAACL 201
An Algorithm For Building Language Superfamilies Using Swadesh Lists
The main contributions of this thesis are the following: i. Developing an algorithm to generate language families and superfamilies given for each input language a Swadesh list represented using the international phonetic alphabet (IPA) notation. ii. The algorithm is novel in using the Levenshtein distance metric on the IPA representation and in the way it measures overall distance between pairs of Swadesh lists. iii. Building a Swadesh list for the author\u27s native Kinyarwanda language because a Swadesh list could not be found even after an extensive search for it.
Adviser: Peter Reves
Artificial Sequences and Complexity Measures
In this paper we exploit concepts of information theory to address the
fundamental problem of identifying and defining the most suitable tools to
extract, in a automatic and agnostic way, information from a generic string of
characters. We introduce in particular a class of methods which use in a
crucial way data compression techniques in order to define a measure of
remoteness and distance between pairs of sequences of characters (e.g. texts)
based on their relative information content. We also discuss in detail how
specific features of data compression techniques could be used to introduce the
notion of dictionary of a given sequence and of Artificial Text and we show how
these new tools can be used for information extraction purposes. We point out
the versatility and generality of our method that applies to any kind of
corpora of character strings independently of the type of coding behind them.
We consider as a case study linguistic motivated problems and we present
results for automatic language recognition, authorship attribution and self
consistent-classification.Comment: Revised version, with major changes, of previous "Data Compression
approach to Information Extraction and Classification" by A. Baronchelli and
V. Loreto. 15 pages; 5 figure
Language Trees and Zipping
In this letter we present a very general method to extract information from a
generic string of characters, e.g. a text, a DNA sequence or a time series.
Based on data-compression techniques, its key point is the computation of a
suitable measure of the remoteness of two bodies of knowledge. We present the
implementation of the method to linguistic motivated problems, featuring highly
accurate results for language recognition, authorship attribution and language
classification.Comment: 5 pages, RevTeX4, 1 eps figure. In press in Phys. Rev. Lett. (January
2002
Exploring the Mental Lexicon of the Multilingual: Vocabulary Size, Cognate Recognition and Lexical Access in the L1, L2 and L3
Recent empirical findings in the field of Multilingualism have shown that the mental lexicon of a language learner does not consist of separate entities, but rather of an intertwined system where languages can interact with each other (e.g. Cenoz, 2013; Szubko-Sitarek, 2015). Accordingly, multilingual language learners have been considered differently to second language learners in a growing number of studies, however studies on the variation in learners’ vocabulary size both in the L2 and L3 and the effect of cognates on the target languages have been relatively scarce. This paper, therefore, investigates the impact of prior lexical knowledge on additional language learning in the case of Hungarian native speakers, who use Romanian (a Romance language) as a second language (L2) and learn English as an L3. The study employs an adapted version of the widely used Vocabulary Size Test (Nation & Beglar, 2007), the Romanian Vocabulary Size Test (based on the Romanian Frequency List; Szabo, 2015) and a Hungarian test (based on a Hungarian frequency list; Varadi, 2002) in order to measure vocabulary sizes, cognate knowledge and response times in these languages. The findings, complemented by a self-rating language background questionnaire, indicate a strong link between Romanian and English lexical proficiency
- …