1,364 research outputs found
Natural language processing
Beginning with the basic issues of NLP, this chapter aims to chart the major research activities in this area since the last ARIST Chapter in 1996 (Haas, 1996), including: (i) natural language text processing systems - text summarization, information extraction, information retrieval, etc., including domain-specific applications; (ii) natural language interfaces; (iii) NLP in the context of www and digital libraries ; and (iv) evaluation of NLP systems
An Evaluation of POS Taggers for the CHILDES Corpus
This project evaluates four mainstream taggers on a representative collection of child-adultâs dialogues from Child Language Data Exchange System. The nine childrenâs files from Valian corpora and part of Eve corpora have been manually labeled, and rewrote with LARC tagset. They served as gold standard corpora in the training and testing process. Four taggers: CLAN MOR tagger, ACOPOST trigram tagger, Stanford parser, and Ver. 1.14 of Brill tagger have been tested by 10-fold cross validation. By analyzing what kinds of assumptions the tagger made about category assignment lead to failing, we identify several problematic cases of tagging. By comparing the average error rate of each tagger, we found the size of training data set, and the length of utterance both plays a role to effect tagging accuracy
Proceedings
Proceedings of the Workshop on Annotation and
Exploitation of Parallel Corpora AEPC 2010.
Editors: Lars Ahrenberg, Jörg Tiedemann and Martin Volk.
NEALT Proceedings Series, Vol. 10 (2010), 98 pages.
© 2010 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/15893
Constructing a Large-Scale English-Persian Parallel Corpus
In recent years the exploitation of large text corpora in solving various kinds of linguistic problems, including those of translation, is commonplace. Yet a large-scale English-Persian corpus is still unavailable, because of certain difficulties and the amount of work required to overcome them.The project reported here is an attempt to constitute an English-Persian parallel corpus composed of digital texts and Web documents containing little or no noise. The Internet is useful because translations of existing texts are often published on the Web. The task is to find parallel pages in English and Persian, to judge their translation quality, and to download and align them. The corpus so created is of course open; that is, more material can be added as the need arises.One of the main activities associated with building such a corpus is to develop software for parallel concordancing, in which a user can enter a search string in one language and see all the citations for that string in it and corresponding sentences in the target language. Our intention is to construct general translation memory software using the present English-Persian parallel corpus.Au cours des derniĂšres annĂ©es, lâexploitation de grands corpus de textes pour rĂ©soudre des problĂšmes linguistiques, notamment des problĂšmes de traduction, est devenue une pratique courante. JusquâĂ rĂ©cemment, aucun corpus bilingue anglais-persan Ă grande Ă©chelle nâavait Ă©tĂ© constituĂ©, en raison des difficultĂ©s quâimplique une telle entreprise.Cet article prĂ©sente un projet rĂ©alisĂ© en vue de colliger des corpus de textes numĂ©riques variĂ©s, tels que des documents du rĂ©seau Internet, avec le moins de bruit possible. Lâutilisation dâInternet peut ĂȘtre considĂ©rĂ©e comme une aide prĂ©cieuse car, souvent, il existe des traductions antĂ©rieures qui sont dĂ©jĂ publiĂ©es sur le Web. La tĂąche consiste Ă trouver les pages parallĂšles en anglais et en persan, Ă Ă©valuer la qualitĂ© de leur traduction, Ă les tĂ©lĂ©charger et Ă les aligner. Le corpus ainsi obtenu est un corpus ouvert, soit un corpus auquel de nouvelles donnĂ©es peuvent ĂȘtre ajoutĂ©es, selon les besoins.Une des principales consĂ©quences de lâĂ©laboration dâun tel corpus est la mise au point dâun logiciel de concordance parallĂšle, dans lequel lâutilisateur pourrait introduire une chaĂźne de caractĂšres dans une langue et afficher toutes les citations concernant cette chaĂźne dans la langue recherchĂ©e ainsi que des phrases correspondantes dans la langue cible. LâĂ©tape suivante serait dâutiliser ce corpus parallĂšle pour construire un logiciel de traduction gĂ©nĂ©rale.Le corpus bilingue alignĂ© se trouve ĂȘtre utile dans beaucoup dâautres cas, entre autres pour la traduction par ordinateur, pour lever les ambiguĂŻtĂ©s de sens, pour le rĂ©tablissement des donnĂ©es interlangues, en lexicographie ainsi que pour lâapprentissage des langues
A Survey on Awesome Korean NLP Datasets
English based datasets are commonly available from Kaggle, GitHub, or
recently published papers. Although benchmark tests with English datasets are
sufficient to show off the performances of new models and methods, still a
researcher need to train and validate the models on Korean based datasets to
produce a technology or product, suitable for Korean processing. This paper
introduces 15 popular Korean based NLP datasets with summarized details such as
volume, license, repositories, and other research results inspired by the
datasets. Also, I provide high-resolution instructions with sample or
statistics of datasets. The main characteristics of datasets are presented on a
single table to provide a rapid summarization of datasets for researchers.Comment: 11 pages, 1 horizontal page for large tabl
- âŠ