Search CORE

205,469 research outputs found

An automatic part-of-speech tagger for Middle Low German

Author: Breitbarth Anne
Desmet Bart
Farasyn Melissa
Hoste Veronique
Koleva Mariya
Publication venue: 'John Benjamins Publishing Company'
Publication date: 01/01/2017
Field of study

Syntactically annotated corpora are highly important for enabling large-scale diachronic and diatopic language research. Such corpora have recently been developed for a variety of historical languages, or are still under development. One of those under development is the fully tagged and parsed Corpus of Historical Low German (CHLG), which is aimed at facilitating research into the highly under-researched diachronic syntax of Low German. The present paper reports on a crucial step in creating the corpus, viz. the creation of a part-of-speech tagger for Middle Low German (MLG). Having been transmitted in several non-standardised written varieties, MLG poses a challenge to standard POS taggers, which usually rely on normalized spelling. We outline the major issues faced in the creation of the tagger and present our solutions to them

Crossref

Ghent University Academic Bibliography

A Note on the Creation Formula in Zechariah 12:1–8; Isaiah 42:5–6; and Old Persian Inscriptions

Author: Christine Mitchell
Publication venue: 'Modern Language Association'
Publication date: 01/01/2014
Field of study

This note explores whether the influence of the Old Persian creation formula as well as its underlying theology can be seen in biblical texts. The particular focus is on Zech 12:1–8 and Isa 42:5–6. While both of these texts use creation language found elsewhere in the Hebrew Bible corpus, the particular content and structure of these texts have strong resonances with the Old Persian texts

Humanities Commons

Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling

Author: Blunsom Phil
Dyer Chris
Kawakami Kazuya
Publication venue
Publication date: 01/01/2017
Field of study

Fixed-vocabulary language models fail to account for one of the most characteristic statistical facts of natural language: the frequent creation and reuse of new word types. Although character-level language models offer a partial solution in that they can create word types not attested in the training corpus, they do not capture the "bursty" distribution of such words. In this paper, we augment a hierarchical LSTM language model that generates sequences of word tokens character by character with a caching mechanism that learns to reuse previously generated words. To validate our model we construct a new open-vocabulary language modeling corpus (the Multilingual Wikipedia Corpus, MWC) from comparable Wikipedia articles in 7 typologically diverse languages and demonstrate the effectiveness of our model across this range of languages.Comment: ACL 201

arXiv.org e-Print Archive

Crossref

Recommended from our members

OBOME - Ontology based opinion mining in UBIPOL

Author: Husani M
Ko A
Kocyigit A
Lee H
Tapucu D
Publication venue: Brunel University
Publication date: 01/01/2012
Field of study

Ontologies have a special role in the UBIPOL system, they help to structure the policy related context, provide conceptualization for policy domain and use in the opinion mining process. In this work we presented a system called Ontology Based Opinion Mining Engine (OBOME) for analyzing a domain-specific opinion corpus by first assisting the user with the creation of a domain ontology from the corpus. We determined the polarity of opinion on the various domain aspects. In the former step, the policy domain aspect has are identified (namely which policy category is represented by the concept). This identification is supported by the policy modelling ontology, which describe the most important policy – related classes and structure. Then the most informative documents from the corpus are extracted and asked the user to create a set of aspects and related keywords using these documents. In the latter step, we used the corpus specific ontology to model the domain and extracted aspect-polarity associations using grammatical dependencies between words. Later, summarized results are shown to the user to analyze and store. Finally, in an offline process policy modeling ontology is updated

Brunel University Research Archive

Coping with noise in a real-world weblog crawler and retrieval system

Author: Ferguson Paul
Lanagan James
O'Hare Neil
Smeaton Alan F.
Publication venue
Publication date: 01/05/2010
Field of study

In this paper we examine the effects of noise when creating a real-world weblog corpus for information retrieval. We focus on the DiffPost (Lee et al. 2008) approach to noise removal from blog pages, examining the difficulties encountered when crawling the blogosphere during the creation of a real-world corpus of blog pages. We introduce and evaluate a number of enhancements to the original DiffPost approach in order to increase the robustness of the algorithm. We then extend DiffPost by looking at the anchor-text to text ratio, and dis- cover that the time-interval between crawls is more impor- tant to the successful application of noise-removal algorithms within the blog context, than any additional improvements to the removal algorithm itself

Irish Universities

DCU Online Research Access Service

Effects of Sole Gift of Proceeds with No Disposition of Corpus

Author: Marshall Phillip H
Publication venue: EngagedScholarship@CSU
Publication date: 01/01/1956
Field of study

What is the effect of an absolute devise of proceeds or income of the corpus of an estate where, under a general devise, words clearly importing the creation of a fee simple estate as to the corpus are lacking and there is no express or implied power of sale? More particularly we shall consider the above in the following circumstances:(1) When in the will there is contained an absolute gift of proceeds of the corpus; (2) A gift of proceeds of the corpus followed by limitation upon the corpus; and (3) Where under a bequest of proceeds along with creation of a testamentary trust and limitations over upon corpus of the estate

Cleveland-Marshall College of Law

Balancing SoNaR: IPR versus Processing Issues in a 500-Million-Word Written Dutch Reference Corpus

Author: De Clercq Orph´ee
Heuvel Henk van den
Jong Franciska de
Oostdijk Nelleke
Reynaert Martin
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2010
Field of study

In The Low Countries, a major reference corpus for written Dutch is beingbuilt. We discuss the interplay between data acquisition and data processingduring the creation of the SoNaR Corpus. Based on developments in traditionalcorpus compiling and new web harvesting approaches, SoNaR is designed tocontain 500 million words, balanced over 36 text types including bothtraditional and new media texts. Beside its balanced design, every text sampleincluded in SoNaR will have its IPR issues settled to the largest extentpossible. This data collection task presents many challenges because everydecision taken on the level of text acquisition has ramifications for the levelof processing and the general usability of the corpus. As far as thetraditional text types are concerned, each text brings its own processingrequirements and issues. For new media texts - SMS, chat - the problem is evenmore complex, issues such as anonimity, recognizability and citation right, allpresent problems that have to be tackled. The solutions actually lead to thecreation of two corpora: a gigaword SoNaR, IPR-cleared for research purposes,and the smaller - of commissioned size - more privacy compliant SoNaR,IPR-cleared for commercial purposes as well

CiteSeerX

Ghent University Academic Bibliography

Radboud Repository

University of Twente Research Information

Tilburg University Repository