Search CORE

7,880 research outputs found

BEA – A multifunctional Hungarian spoken language database

Author: Gósy Mária
Publication venue: International Society of Phonetic Sciences
Publication date: 01/01/2013
Field of study

In diverse areas of linguistics, the demand for studying actual language use is on the increase. The aim of developing a phonetically-based multi-purpose database of Hungarian spontaneous speech, dubbed BEA2, is to accumulate a large amount of spontaneous speech of various types together with sentence repetition and reading. Presently, the recorded material of BEA amounts to 260 hours produced by 280 present-day Budapest speakers (ages between 20 and 90, 168 females and 112 males), providing also annotated materials for various types of research and practical applications

Repository of the Academy's Library

Media monitoring and information extraction for the highly inflected agglutinative language Hungarian

Author: Eszter Simon
Júlia Pajzs
Leonida Della Rocca
Maud Ehrmann
Mohamed Ebrahim
Ralf Steinberger
Stefano Bucci
Tamás Váradi
Publication venue: ELRA
Publication date: 01/01/2014
Field of study

The Europe Media Monitor (EMM) is a fully-automatic system that analyses written online news by gathering articles in over 70 languages and by applying text analysis software for currently 21 languages, without using linguistic tools such as parsers, part-of-speech taggers or morphological analysers. In this paper, we describe the effort of adding to EMM Hungarian text mining tools for news gathering; document categorisation; named entity recognition and classification for persons, organisations and locations; name lemmatisation; quotation recognition; and cross-lingual linking of related news clusters. The major challenge of dealing with the Hungarian language is its high degree of inflection and agglutination. We present several experiments where we apply linguistically light-weight methods to deal with inflection and we propose a method to overcome the challenges. We also present detailed frequency lists of Hungarian person and location name suffixes, as found in real-life news texts. This empirical data can be used to draw further conclusions and to improve existing Named Entity Recognition software. Within EMM, the solutions described here will also be applied to other morphologically complex languages such as those of the Slavic language family. The media monitoring and analysis system EMM is freely accessible online via the web pag

CiteSeerX

Repository of the Academy's Library

A Large-Scale Comparison of Historical Text Normalization Systems

Author: Bollmann Marcel
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2019
Field of study

There is no consensus on the state-of-the-art approach to historical text normalization. Many techniques have been proposed, including rule-based methods, distance metrics, character-based statistical machine translation, and neural encoder--decoder models, but studies have used different datasets, different evaluation methods, and have come to different conclusions. This paper presents the largest study of historical text normalization done so far. We critically survey the existing literature and report experiments on eight languages, comparing systems spanning all categories of proposed normalization techniques, analysing the effect of training data quantity, and using different evaluation methods. The datasets and scripts are made publicly available.Comment: Accepted at NAACL 201

arXiv.org e-Print Archive

Crossref

Publikationer från Linköpings universitet

Copenhagen University Research Information System

Digitala Vetenskapliga Arkivet - Academic Archive On-line

An Algorithm For Building Language Superfamilies Using Swadesh Lists

Author: Mutabazi Bill
Publication venue: DigitalCommons@University of Nebraska - Lincoln
Publication date: 23/04/2020
Field of study

The main contributions of this thesis are the following: i. Developing an algorithm to generate language families and superfamilies given for each input language a Swadesh list represented using the international phonetic alphabet (IPA) notation. ii. The algorithm is novel in using the Levenshtein distance metric on the IPA representation and in the way it measures overall distance between pairs of Swadesh lists. iii. Building a Swadesh list for the author\u27s native Kinyarwanda language because a Swadesh list could not be found even after an extensive search for it. Adviser: Peter Reves

DigitalCommons@University of Nebraska

Trawling for Terrorists: A Big Data Analysis of Conceptual Meanings and Contexts in Swedish Newspapers, 1780–1926

Author: Borin L.
Brodén D.
Fridlund M.
Olsson L.
Publication venue
Publication date: 12/09/2019
Field of study

MPG.PuRe

Artificial Sequences and Complexity Measures

In this paper we exploit concepts of information theory to address the fundamental problem of identifying and defining the most suitable tools to extract, in a automatic and agnostic way, information from a generic string of characters. We introduce in particular a class of methods which use in a crucial way data compression techniques in order to define a measure of remoteness and distance between pairs of sequences of characters (e.g. texts) based on their relative information content. We also discuss in detail how specific features of data compression techniques could be used to introduce the notion of dictionary of a given sequence and of Artificial Text and we show how these new tools can be used for information extraction purposes. We point out the versatility and generality of our method that applies to any kind of corpora of character strings independently of the type of coding behind them. We consider as a case study linguistic motivated problems and we present results for automatic language recognition, authorship attribution and self consistent-classification.Comment: Revised version, with major changes, of previous "Data Compression approach to Information Extraction and Classification" by A. Baronchelli and V. Loreto. 15 pages; 5 figure

arXiv.org e-Print Archive

City Research Online

Crossref

Archivio della ricerca- Università di Roma La Sapienza

Identification and Analysis of Personification in Hungarian

Author: Simon Gábor
Publication venue
Publication date: 01/01/2022
Field of study

ELTE Digital Institutional Repository (EDIT)

Language Trees and Zipping

Author: Benedetto Dario
Caglioti Emanuele
Loreto Vittorio
Publication venue: 'American Physical Society (APS)'
Publication date: 19/12/2001
Field of study

In this letter we present a very general method to extract information from a generic string of characters, e.g. a text, a DNA sequence or a time series. Based on data-compression techniques, its key point is the computation of a suitable measure of the remoteness of two bodies of knowledge. We present the implementation of the method to linguistic motivated problems, featuring highly accurate results for language recognition, authorship attribution and language classification.Comment: 5 pages, RevTeX4, 1 eps figure. In press in Phys. Rev. Lett. (January 2002

arXiv.org e-Print Archive

Crossref

Archivio della ricerca- Università di Roma La Sapienza

Exploring the Mental Lexicon of the Multilingual: Vocabulary Size, Cognate Recognition and Lexical Access in the L1, L2 and L3

Author: Szabo Cz.
Publication venue
Publication date: 01/09/2016
Field of study

Recent empirical findings in the field of Multilingualism have shown that the mental lexicon of a language learner does not consist of separate entities, but rather of an intertwined system where languages can interact with each other (e.g. Cenoz, 2013; Szubko-Sitarek, 2015). Accordingly, multilingual language learners have been considered differently to second language learners in a growing number of studies, however studies on the variation in learners’ vocabulary size both in the L2 and L3 and the effect of cognates on the target languages have been relatively scarce. This paper, therefore, investigates the impact of prior lexical knowledge on additional language learning in the case of Hungarian native speakers, who use Romanian (a Romance language) as a second language (L2) and learn English as an L3. The study employs an adapted version of the widely used Vocabulary Size Test (Nation & Beglar, 2007), the Romanian Vocabulary Size Test (based on the Romanian Frequency List; Szabo, 2015) and a Hungarian test (based on a Hungarian frequency list; Varadi, 2002) in order to measure vocabulary sizes, cognate knowledge and response times in these languages. The findings, complemented by a self-rating language background questionnaire, indicate a strong link between Romanian and English lexical proficiency

Crossref

Directory of Open Access Journals

Open Research Online (The Open University)