992 research outputs found

    Discourse Structure in Machine Translation Evaluation

    Full text link
    In this article, we explore the potential of using sentence-level discourse structure for machine translation evaluation. We first design discourse-aware similarity measures, which use all-subtree kernels to compare discourse parse trees in accordance with the Rhetorical Structure Theory (RST). Then, we show that a simple linear combination with these measures can help improve various existing machine translation evaluation metrics regarding correlation with human judgments both at the segment- and at the system-level. This suggests that discourse information is complementary to the information used by many of the existing evaluation metrics, and thus it could be taken into account when developing richer evaluation metrics, such as the WMT-14 winning combined metric DiscoTKparty. We also provide a detailed analysis of the relevance of various discourse elements and relations from the RST parse trees for machine translation evaluation. In particular we show that: (i) all aspects of the RST tree are relevant, (ii) nuclearity is more useful than relation type, and (iii) the similarity of the translation RST tree to the reference tree is positively correlated with translation quality.Comment: machine translation, machine translation evaluation, discourse analysis. Computational Linguistics, 201

    Computational linguistics in the Netherlands 1996 : papers from the 7th CLIN meeting, November 15, 1996, Eindhoven

    Get PDF

    ANNOTATION MODEL FOR LOANWORDS IN INDONESIAN CORPUS: A LOCAL GRAMMAR FRAMEWORK

    Get PDF
    There is a considerable number for loanwords in Indonesian language as it has been, or even continuously, in contact with other languages. The contact takes place via different media; one of them is via machine readable medium. As the information in different languages can be obtained by a mouse click these days, the contact becomes more and more intense. This paper aims at proposing an annotation model and lexical resource for loanwords in Indonesian. The lexical resource is applied to a corpus by a corpus processing software called UNITEX. This software works under local grammar framewor

    Computational linguistics in the Netherlands 1996 : papers from the 7th CLIN meeting, November 15, 1996, Eindhoven

    Get PDF

    Measuring Sentences Similarity Based on Discourse Representation Structure

    Get PDF
    The problem of measuring similarity between sentences is crucial for many applications in Natural Language Processing (NLP). Most of the proposed approaches depend on similarity of words in sentences. This research considers semantic relations between words in calculating sentence similarity. This paper uses Discourse Representation Structure (DRS) of natural language sentences to measure similarity. DRS captures the structure and semantic information of sentences. Moreover, the estimation of similarity between sentences depends on semantic coverage of relations of the first sentence in the other sentence. Experiments show that exploiting structural information achieves better results than traditional word-to-word approaches. Moreover, the proposed method outperforms similar approaches on a standard benchmark dataset

    LiDom builder: Automatising the construction of multilingual domain modules

    Get PDF
    136 p.Laburpena Lan honetan LiDOM Builder tresnaren analisi, diseinu eta ebaluazioa aurkezten dira. Teknologian oinarritutako hezkuntzarako tresnen Domeinu Modulu Eleaniztunak testuliburu elektronikoetatik era automatikoan erauztea ahalbidetzen du LiDOM Builderek. Ezagutza eskuratzeko, Hizkuntzaren Prozesamendurako eta Ikaste Automatikorako teknikekin batera, hainbat baliabide eleaniztun erabiltzen ditu, besteak beste, Wikipedia eta WordNet.Domeinu Modulu Elebakarretik Domeinu Modulu Eleaniztunerako bidean, LiDOM Builder tresna DOM-Sortze ingurunearen (Larrañaga, 2012; Larrañaga et al., 2014) bilakaera dela esan genezake. Horretarako, LiDOM Builderek domeinua ikuspegi eleaniztun batetik adieraztea ahalbidetzen duen mekanismoa dakar. Domeinu Modulu Eleaniztunak bi maila ezberdinetako ezagutza jasotzen du: Ikaste Domeinuaren Ontologia (IDO), non hizkuntza ezberdinetan etiketatutako topikoak eta hauen arteko erlazio pedagogikoak jasotzen baitira, eta Ikaste Objektuak (IO), hau da, metadatuekin etiketatutako baliabide didaktikoen bilduma, hizkuntza horietan. LiDOM Builderek onartutako hizkuntza guztietan domeinuaren topikoak adierazteko aukera ematen du. Topiko bakoitza lotuta dago dagokion hizkuntzako bere etiketa baliokidearekin. Gainera, IOak deskribatzeko metadatu aberastuak erabiltzen ditu hizkuntza desberdinetan parekideak diren baliabide didaktikoak lotzeko.LiDOM Builderen, hasiera batean, domeinu-modulua hizkuntza jakin batean idatzitako dokumentu batetik erauziko da eta, baliabide eleaniztunak erabiliko dira, gerora, bai topikoak bai IOak beste hizkuntzetan ere lortzeko. Lan honetan, Ingelesez idatzitako liburuek osatuko dute informazio-iturri nagusia bai doitze-prozesuan bai ebaluazio-prozesuan. Zehazki, honako testuliburu hauek erabili dira: Principles of Object Oriented Programming (Wong and Nguyen, 2010), Introduction to Astronomy (Morison, 2008) eta Introduction to Molecular Biology (Raineri, 2010). Baliabide eleaniztunei dagokienez, Wikipedia, WordNet eta Wikipediatik erauzitako beste hainbat ezagutza-base erabili dira. Testuliburuetatik Domeinu Modulu Eleaniztunak eraikitzeko, LiDOM Builder hiru modulu nagusitan oinarritzen da: LiTeWi eta LiReWi moduluak IDO eleaniztuna eraikitzeaz arduratuko dira eta LiLoWi, aldiz, IO eleaniztunak eraikitzeaz. Jarraian, aipatutako modulu bakoitza xehetasun gehiagorekin azaltzen da.¿ LiTeWi (Conde et al., 2015) moduluak, edozein ikaste-domeinutako testuliburu batetik abiatuta, Hezkuntzarako Ontologia bati dagozkion hainbat termino eleaniztun identifikatuko ditu, hala nola TF-IDF, KP-Miner, CValue eta Shallow Parsing Grammar. Hori lortzeko, gainbegiratu gabeko datu-erauzketa teknikez eta Wikipediaz baliatzen da. Ontologiako topikoak erauzteak LiTeWi-n hiru urrats ditu: lehenik hautagai diren terminoen erauzketa; bigarrenik, lortutako terminoen konbinatzea eta fintzea azken termino zerrenda osatuz; eta azkenik, zerrendako terminoak beste hizkuntzetara mapatzea Wikipedia baliatuz.¿ LiReWi (Conde et al., onartzeko) moduluak Hezkuntzarako Ontologia erlazio pedagogikoez aberastuko du, beti ere testuliburua abiapuntu gisa erabilita. Lau motatako erlazio pedagogikoak erauziko ditu (isA, partOf, prerequisite eta pedagogicallyClose) hainbat teknika eta ezagutza-base konbinatuz. Ezagutza-baseen artean Wikipedia, WordNet, WikiTaxonomy, WibiTaxonomy eta WikiRelations daude. LiReWi-k ere hiru urrats emango ditu erlazioak lortzeko: hasteko, ontologiako topikoak erlazioak erauzteko erabiliko diren ezagutza-base desberdinekin mapatuko ditu; gero, hainbat erlazio-erauzle, bakoitza teknika desberdin batean oinarritzen dena, exekutatuko ditu konkurrenteki erlazio hautagaiak erauzteko; eta, bukatzeko, lortutako emaitza guztiak konbinatu eta iragaziko ditu erlazio pedagogikoen azken multzoa lortuz. Gainera, DOM-Sortzetik LiDOM Buildererako trantsizioan, tesi honetan hobetu egin dira dokumentuen indizeetatik erauzitako isA eta partOf erlazioak, Wikipedia baliabide gehigarri bezala erabilita (Conde et al., 2014).¿ LiLoWi moduluak IOak -batzuk eleaniztunak- erauziko ditu, abiapuntuko testuliburutik ez ezik Wikipedia edo WordNet bezalako ezagutza-baseetatik ere. IDO ontologiako topiko bakoitza Wikipedia eta WordNet-ekin mapatu ostean, LiLoWi-k baliabide didaktikoak erauziko ditu hainbat IO erauzlez baliatuz.IO erauzketa-prozesuan, DOM-Sortzetik LiDOM Buildereko bidean, eta Wikipedia eta WordNet erabili aurretik, ingelesa hizkuntza ere gehitu eta ebaluatu da (Conde et al., 2012).LiDOM Builderen ebaluaziori dagokionez, modulu bakoitza bere aldetik testatua eta ebaluatua izan da bai Gold-standard teknika bai aditu-ebaluazioa baliatuz. Gainera, Wikipedia eta WordNet ezagutza-baseen integrazioak IOen erauzketari ekarri dion hobekuntza ere ebaluatu da. Esan genezake kasu guztietan lortu diren emaitzak oso onak direla.Bukatzeko, eta laburpen gisa, lau dira LiDOM Builderek Domeinu Modulu Eleaniztunaren arloari egin dizkion ekarpen nagusiak:¿ Domeinu Modulu Eleaniztunak adierazteko mekanismo egokia.¿ LiTeWiren garapena. Testuliburuetatik Hezkuntzarako Ontologietarako terminologia eleaniztuna erauztea ahalbidetzen du modulu honek. Ingelesa eta Gaztelera hizkuntzentzako termino-erauzlea eskura dago https://github.com/Neuw84/LiTe URLan.¿ LiReWiren garapena. Testuliburuetatik Hezkuntzarako Ontologietarako erlazio pedagogikoak erauztea ahalbidetzen du modulu honek. Erabiltzen duen Wikipedia/WordNet mapatzailea eskura dago https://github.com/Neuw84/Wikipedia2WordNet URLan.¿ LiLoWiren garapena. Testuliburua eta Wikipedia eta WordNet ezagutza-baseak erabilita IO eleaniztunak erauztea ahalbidetzen du modulu honek

    Character-based Neural Semantic Parsing

    Get PDF
    Humans and computers do not speak the same language. A lot of day-to-day tasks would be vastly more efficient if we could communicate with computers using natural language instead of relying on an interface. It is necessary, then, that the computer does not see a sentence as a collection of individual words, but instead can understand the deeper, compositional meaning of the sentence. A way to tackle this problem is to automatically assign a formal, structured meaning representation to each sentence, which are easy for computers to interpret. There have been quite a few attempts at this before, but these approaches were usually heavily reliant on predefined rules, word lists or representations of the syntax of the text. This made the general usage of these methods quite complicated. In this thesis we employ an algorithm that can learn to automatically assign meaning representations to texts, without using any such external resource. Specifically, we use a type of artificial neural network called a sequence-to-sequence model, in a process that is often referred to as deep learning. The devil is in the details, but we find that this type of algorithm can produce high quality meaning representations, with better performance than the more traditional methods. Moreover, a main finding of the thesis is that, counter intuitively, it is often better to represent the text as a sequence of individual characters, and not words. This is likely the case because it helps the model in dealing with spelling errors, unknown words and inflections

    CLiFF Notes: Research In Natural Language Processing at the University of Pennsylvania

    Get PDF
    CLIFF is the Computational Linguists\u27 Feedback Forum. We are a group of students and faculty who gather once a week to hear a presentation and discuss work currently in progress. The \u27feedback\u27 in the group\u27s name is important: we are interested in sharing ideas, in discussing ongoing research, and in bringing together work done by the students and faculty in Computer Science and other departments. However, there are only so many presentations which we can have in a year. We felt that it would be beneficial to have a report which would have, in one place, short descriptions of the work in Natural Language Processing at the University of Pennsylvania. This report then, is a collection of abstracts from both faculty and graduate students, in Computer Science, Psychology and Linguistics. We want to stress the close ties between these groups, as one of the things that we pride ourselves on here at Penn is the communication among different departments and the inter-departmental work. Rather than try to summarize the varied work currently underway at Penn, we suggest reading the abstracts to see how the students and faculty themselves describe their work. The report illustrates the diversity of interests among the researchers here, as well as explaining the areas of common interest. In addition, since it was our intent to put together a document that would be useful both inside and outside of the university, we hope that this report will explain to everyone some of what we are about
    corecore