1,364 research outputs found

    Natural language processing

    Get PDF
    Beginning with the basic issues of NLP, this chapter aims to chart the major research activities in this area since the last ARIST Chapter in 1996 (Haas, 1996), including: (i) natural language text processing systems - text summarization, information extraction, information retrieval, etc., including domain-specific applications; (ii) natural language interfaces; (iii) NLP in the context of www and digital libraries ; and (iv) evaluation of NLP systems

    An Evaluation of POS Taggers for the CHILDES Corpus

    Full text link
    This project evaluates four mainstream taggers on a representative collection of child-adult’s dialogues from Child Language Data Exchange System. The nine children’s files from Valian corpora and part of Eve corpora have been manually labeled, and rewrote with LARC tagset. They served as gold standard corpora in the training and testing process. Four taggers: CLAN MOR tagger, ACOPOST trigram tagger, Stanford parser, and Ver. 1.14 of Brill tagger have been tested by 10-fold cross validation. By analyzing what kinds of assumptions the tagger made about category assignment lead to failing, we identify several problematic cases of tagging. By comparing the average error rate of each tagger, we found the size of training data set, and the length of utterance both plays a role to effect tagging accuracy

    Proceedings

    Get PDF
    Proceedings of the Workshop on Annotation and Exploitation of Parallel Corpora AEPC 2010. Editors: Lars Ahrenberg, Jörg Tiedemann and Martin Volk. NEALT Proceedings Series, Vol. 10 (2010), 98 pages. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15893

    Penn Korean Treebank : Development and Evaluation

    Get PDF

    Constructing a Large-Scale English-Persian Parallel Corpus

    Get PDF
    In recent years the exploitation of large text corpora in solving various kinds of linguistic problems, including those of translation, is commonplace. Yet a large-scale English-Persian corpus is still unavailable, because of certain difficulties and the amount of work required to overcome them.The project reported here is an attempt to constitute an English-Persian parallel corpus composed of digital texts and Web documents containing little or no noise. The Internet is useful because translations of existing texts are often published on the Web. The task is to find parallel pages in English and Persian, to judge their translation quality, and to download and align them. The corpus so created is of course open; that is, more material can be added as the need arises.One of the main activities associated with building such a corpus is to develop software for parallel concordancing, in which a user can enter a search string in one language and see all the citations for that string in it and corresponding sentences in the target language. Our intention is to construct general translation memory software using the present English-Persian parallel corpus.Au cours des derniĂšres annĂ©es, l’exploitation de grands corpus de textes pour rĂ©soudre des problĂšmes linguistiques, notamment des problĂšmes de traduction, est devenue une pratique courante. Jusqu’à rĂ©cemment, aucun corpus bilingue anglais-persan Ă  grande Ă©chelle n’avait Ă©tĂ© constituĂ©, en raison des difficultĂ©s qu’implique une telle entreprise.Cet article prĂ©sente un projet rĂ©alisĂ© en vue de colliger des corpus de textes numĂ©riques variĂ©s, tels que des documents du rĂ©seau Internet, avec le moins de bruit possible. L’utilisation d’Internet peut ĂȘtre considĂ©rĂ©e comme une aide prĂ©cieuse car, souvent, il existe des traductions antĂ©rieures qui sont dĂ©jĂ  publiĂ©es sur le Web. La tĂąche consiste Ă  trouver les pages parallĂšles en anglais et en persan, Ă  Ă©valuer la qualitĂ© de leur traduction, Ă  les tĂ©lĂ©charger et Ă  les aligner. Le corpus ainsi obtenu est un corpus ouvert, soit un corpus auquel de nouvelles donnĂ©es peuvent ĂȘtre ajoutĂ©es, selon les besoins.Une des principales consĂ©quences de l’élaboration d’un tel corpus est la mise au point d’un logiciel de concordance parallĂšle, dans lequel l’utilisateur pourrait introduire une chaĂźne de caractĂšres dans une langue et afficher toutes les citations concernant cette chaĂźne dans la langue recherchĂ©e ainsi que des phrases correspondantes dans la langue cible. L’étape suivante serait d’utiliser ce corpus parallĂšle pour construire un logiciel de traduction gĂ©nĂ©rale.Le corpus bilingue alignĂ© se trouve ĂȘtre utile dans beaucoup d’autres cas, entre autres pour la traduction par ordinateur, pour lever les ambiguĂŻtĂ©s de sens, pour le rĂ©tablissement des donnĂ©es interlangues, en lexicographie ainsi que pour l’apprentissage des langues

    A Survey on Awesome Korean NLP Datasets

    Full text link
    English based datasets are commonly available from Kaggle, GitHub, or recently published papers. Although benchmark tests with English datasets are sufficient to show off the performances of new models and methods, still a researcher need to train and validate the models on Korean based datasets to produce a technology or product, suitable for Korean processing. This paper introduces 15 popular Korean based NLP datasets with summarized details such as volume, license, repositories, and other research results inspired by the datasets. Also, I provide high-resolution instructions with sample or statistics of datasets. The main characteristics of datasets are presented on a single table to provide a rapid summarization of datasets for researchers.Comment: 11 pages, 1 horizontal page for large tabl
    • 

    corecore