458 research outputs found

    Idiom treatment experiments in machine translation

    Get PDF
    Idiomatic expressions pose a particular challenge for the today\u27;s Machine Translation systems, because their translation mostly does not result literally, but logically. The present dissertation shows, how with the help of a corpus, and morphosyntactic rules, such idiomatic expressions can be recognized and finally correctly translated. The work leads the reader in the first chapter generally to the field of Machine Translation and following that, it focuses on the special field of Example-based Machine Translation. Next, an important part of the doctoral thesis dissertation is devoted to the theory of idiomatic expressions. The practical part of the thesis describes how the hybrid Example-based Machine Translation system METIS-II, with the help of morphosyntactic rules, is able to correctly process certain idiomatic expressions and finally, to translate them. The following chapter deals with the function of the transfer system CAT2 and its handling of the idiomatic expressions. The last part of the thesis includes the evaluation of three commercial systems, namely SYSTRAN, T1 Langenscheidt, and Power Translator Pro, with respect to continuous and discontinuous idiomatic expressions. For this, both small corpora and a part of the extensive corpus Europarl and the Digital Lexicon of the German Language in 20th century were processed, firstly manually and then automatically. The dissertation concludes with results from this evaluation.Idiomatische Redewendungen stellen fĂŒr heutige maschinelle Übersetzungssysteme eine besondere Herausforderung dar, da ihre Übersetzung nicht wörtlich, sondern stets sinngemĂ€ĂŸ erfolgen muss. Die vorliegende Dissertation zeigt, wie mit Hilfe eines Korpus sowie morphosyntaktischer Regeln solche idiomatische Redewendungen erkannt und am Ende richtig ĂŒbersetzt werden können. Die Arbeit fĂŒhrt den Leser im ersten Kapitel allgemein in das Gebiet der Maschinellen Übersetzung ein und vertieft im Anschluss daran das Spezialgebiet der Beispielbasierten Maschinellen Übersetzung. Im Folgenden widmet sich ein wesentlicher Teil der Doktorarbeit der Theorie ĂŒber idiomatische Redewendungen. Der praktische Teil der Arbeit beschreibt wie das hybride Beispielbasierte Maschinelle Übersetzungssystem METIS-II mit Hilfe von morphosyntaktischen Regeln befĂ€higt wurde, bestimmte idiomatische Redewendungen korrekt zu bearbeiten und am Ende zu ĂŒbersetzen. Das nachfolgende Kapitel behandelt die Funktion des Transfersystems CAT2 und dessen Umgang mit idiomatischen Wendungen. Der letzte Teil der Arbeit beinhaltet die Evaluation von drei kommerzielle Systemen, nĂ€mlich SYSTRAN, T1 Langenscheidt und Power Translator Pro, in Bezug auf deren Umgang mit kontinuierlichen und diskontinuierlichen idiomatischen Redewendungen. Hierzu wurden sowohl kleine Korpora als auch ein Teil des umfangreichen Korpus Europarl und des Digatalen Wörterbuchs der deutschen Sprache des 20. Jh. erst manuell und dann maschinell bearbeitet. Die Dissertation wird mit Folgerungen aus der Evaluation abgeschlossen

    Information retrieval and text mining technologies for chemistry

    Get PDF
    Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European Community’s Horizon 2020 Program (project reference: 654021 - OpenMinted). M.K. additionally acknowledges the Encomienda MINETAD-CNIO as part of the Plan for the Advancement of Language Technology. O.R. and J.O. thank the Foundation for Applied Medical Research (FIMA), University of Navarra (Pamplona, Spain). This work was partially funded by Consellería de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic funding of UID/BIO/04469/2013 unit and COMPETE 2020 (POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi for useful feedback and discussions during the preparation of the manuscript.info:eu-repo/semantics/publishedVersio

    Meaning refinement to improve cross-lingual information retrieval

    Get PDF
    Magdeburg, Univ., Fak. fĂŒr Informatik, Diss., 2012von Farag Ahme

    Proceedings of the 17th Annual Conference of the European Association for Machine Translation

    Get PDF
    Proceedings of the 17th Annual Conference of the European Association for Machine Translation (EAMT

    Criteria for the validation of specialized verb equivalents : application in bilingual terminography

    Full text link
    Multilingual terminological resources do not always include valid equivalents of legal terms for two main reasons. Firstly, legal systems can differ from one language community to another and even from one country to another because each has its own history and traditions. As a result, the non-isomorphism between legal and linguistic systems may render the identification of equivalents a particularly challenging task. Secondly, by focusing primarily on the definition of equivalence, a notion widely discussed in translation but not in terminology, the literature does not offer solid and systematic methodologies for assigning terminological equivalents. As a result, there is a lack of criteria to guide both terminologists and translators in the search and validation of equivalent terms. This problem is even more evident in the case of predicative units, such as verbs. Although some terminologists (L‘Homme 1998; Lerat 2002; Lorente 2007) have worked on specialized verbs, terminological equivalence between units that belong to this part of speech would benefit from a thorough study. By proposing a novel methodology to assign the equivalents of specialized verbs, this research aims at defining validation criteria for this kind of predicative units, so as to contribute to a better understanding of the phenomenon of terminological equivalence as well as to the development of multilingual terminography in general, and to the development of legal terminography, in particular. The study uses a Portuguese-English comparable corpus that consists of a single genre of texts, i.e. Supreme Court judgments, from which 100 Portuguese and 100 English specialized verbs were selected. The description of the verbs is based on the theory of Frame Semantics (Fillmore 1976, 1977, 1982, 1985; Fillmore and Atkins 1992), on the FrameNet methodology (Ruppenhofer et al. 2010), as well as on the methodology for compiling specialized lexical resources, such as DiCoInfo (L‘Homme 2008), developed in the Observatoire de linguistique Sens-Texte at the UniversitĂ© de MontrĂ©al. The research reviews contributions that have adopted the same theoretical and methodological framework to the compilation of lexical resources and proposes adaptations to the specific objectives of the project. In contrast to the top-down approach adopted by FrameNet lexicographers, the approach described here is bottom-up, i.e. verbs are first analyzed and then grouped into frames for each language separately. Specialized verbs are said to evoke a semantic frame, a sort of conceptual scenario in which a number of mandatory elements (core Frame Elements) play specific roles (e.g. ARGUER, JUDGE, LAW), but specialized verbs are often accompanied by other optional information (non-core Frame Elements), such as the criteria and reasons used by the judge to reach a decision (statutes, codes, previous decisions). The information concerning the semantic frame that each verb evokes was encoded in an xml editor and about twenty contexts illustrating the specific way each specialized verb evokes a given frame were semantically and syntactically annotated. The labels attributed to each semantic frame (e.g. [Compliance], [Verdict]) were used to group together certain synonyms, antonyms as well as equivalent terms. The research identified 165 pairs of candidate equivalents among the 200 Portuguese and English terms that were grouped together into 76 frames. 71% of the pairs of equivalents were considered full equivalents because not only do the verbs evoke the same conceptual scenario but their actantial structures, the linguistic realizations of the actants and their syntactic patterns were similar. 29% of the pairs of equivalents did not entirely meet these criteria and were considered partial equivalents. Reasons for partial equivalence are provided along with illustrative examples. Finally, the study describes the semasiological and onomasiological entry points that JuriDiCo, the bilingual lexical resource compiled during the project, offers to future users.Les ressources multilingues portant sur le domaine juridique n‘incluent pas toujours dâ€˜Ă©quivalents valides pour deux raisons. D‘abord, les systĂšmes juridiques peuvent diffĂ©rer d‘une communautĂ© linguistique Ă  l‘autre et mĂȘme d‘un pays Ă  l‘autre, car chacun a son histoire et ses traditions. Par consĂ©quent, le phĂ©nomĂšne de la non-isomorphie entre les systĂšmes juridiques et linguistiques rend difficile la tĂąche d‘identification des Ă©quivalents. En deuxiĂšme lieu, en se concentrant surtout sur la dĂ©finition de la notion dâ€˜Ă©quivalence, notion largement dĂ©battue en traductologie, mais non suffisamment en terminologie, la littĂ©rature ne propose pas de mĂ©thodologies solides et systĂ©matiques pour identifier les Ă©quivalents. On assiste donc Ă  une absence de critĂšres pouvant guider tant les terminologues que les traducteurs dans la recherche et la validation des Ă©quivalents des termes. Ce problĂšme est encore plus Ă©vident dans le cas d‘unitĂ©s prĂ©dicatives comme les verbes. Bien que certains terminologues (L'Homme, 1998; Lorente et Bevilacqua 2000; Costa et Silva 2004) aient dĂ©jĂ  travaillĂ© sur les verbes spĂ©cialisĂ©s, lâ€˜Ă©quivalence terminologique, en ce qui concerne ce type d‘unitĂ©s, bĂ©nĂ©ficierait d‘une Ă©tude approfondie. En proposant une mĂ©thodologie originale pour identifier les Ă©quivalents des verbes spĂ©cialisĂ©s, cette recherche consiste donc Ă  dĂ©finir des critĂšres de validation de ce type d‘unitĂ©s prĂ©dicatives afin de mieux comprendre le phĂ©nomĂšne de lâ€˜Ă©quivalence et aussi amĂ©liorer les ressources terminologiques multilingues, en gĂ©nĂ©ral, et les ressources terminologiques multilingues couvrant le domaine juridique, en particulier. Cette Ă©tude utilise un corpus comparable portugais-anglais contenant un seul genre de textes, Ă  savoir les dĂ©cisions des cours suprĂȘmes, Ă  partir duquel 100 verbes spĂ©cialisĂ©s ont Ă©tĂ© sĂ©lectionnĂ©s pour chaque langue. La description des verbes se base sur la thĂ©orie de la sĂ©mantique des cadres (Fillmore 1976, 1977, 1982, 1985; Fillmore and Atkins 1992), sur la mĂ©thodologie de FrameNet (Ruppenhofer et al. 2010), ainsi que sur la mĂ©thodologie dĂ©veloppĂ©e Ă  l‘Observatoire de linguistique Sens-Texte pour compiler des ressources lexicales spĂ©cialisĂ©es, telles que le DiCoInfo (L‘Homme 2008). La recherche examine d‘autres contributions ayant dĂ©jĂ  utilisĂ© ce cadre thĂ©orique et mĂ©thodologique et propose des adaptations objectives du projet. Au lieu de suivre une dĂ©marche descendante comme le font les lexicographes de FrameNet, la dĂ©marche que nous dĂ©crivons est ascendante, c‘est-Ă -dire, pour chaque langue sĂ©parĂ©ment, les verbes sont d‘abord analysĂ©s puis regroupĂ©s par cadres sĂ©mantiques. Dans cette recherche, chacun des verbes « Ă©voque » un cadre ou frame, une sorte de scĂ©nario conceptuel, dans lequel un certain nombre d‘acteurs obligatoires (core Frame Elements) jouent des rĂŽles spĂ©cifiques (le rĂŽle de juge, le rĂŽle d‘appelant, le rĂŽle de la loi). Mis en discours, les termes sont souvent accompagnĂ©s d‘autres renseignements optionnels (non-core Frame Elements) comme ceux des critĂšres utilisĂ©s par le juge pour rendre une dĂ©cision (des lois, des codes, d‘autres dĂ©cisions antĂ©rieures). Tous les renseignements concernant les cadres sĂ©mantiques que chacun des verbes Ă©voque ont Ă©tĂ© encodĂ©s dans un Ă©diteur xml et une vingtaine de contextes illustrant la façon spĂ©cifique dont chacun des verbes Ă©voque un cadre donnĂ© ont Ă©tĂ© annotĂ©s. Les Ă©tiquettes attribuĂ©es Ă  chaque cadre sĂ©mantique (ex. [Compliance], [Verdict]) ont servi Ă  relier certains termes synonymes, certains termes antonymes ainsi que des candidats Ă©quivalents. Parmi les 200 termes portugais et anglais regroupĂ©s en 76 cadres, 165 paires de candidats Ă©quivalents ont Ă©tĂ© identifiĂ©s. 71% des paires dâ€˜Ă©quivalents sont des Ă©quivalents parfaits parce que les verbes Ă©voquent le mĂȘme scĂ©nario conceptuel, leurs structures actancielles sont identiques, les rĂ©alisations linguistiques de chacun des actants sont Ă©quivalentes, et les patrons syntaxiques des verbes sont similaires. 29% des paires dâ€˜Ă©quivalents correspondent Ă  des Ă©quivalents partiels parce qu‘ils ne remplissent pas tous ces critĂšres. Au moyen d‘exemples, lâ€˜Ă©tude illustre tous les cas de figure observĂ©s et termine en prĂ©sentant les diffĂ©rentes façons dont les futurs utilisateurs peuvent consulter le JuriDiCo, la ressource lexicale qui a Ă©tĂ© compilĂ©e pendant ce projet

    Visualización del lenguaje a través de corpus

    Get PDF
    Digital version of the print publication, published in A Coruña: Universidade da Coruña, Servizo de PublicaciĂłns, 2010 (ISBN 978-84-9749-401-4)This book contains the papers presented at the Second International Conference on Corpus Linguistics held at the University of A Coruña in 2010 and organised by the MuStE group. The essays deal with different aspects of corpus linguistics both as a methodology and as a branch of Linguistics.[Abstract] The collection of essays we are presenting here are just a mere sample of the interest the topics relating to Corpus Linguistics have arisen everywhere. Such different topics as those related to Computational Linguistics found in “Obtaining computational resources for languages with scarce resources from closely related computationally-developed languages. The Galician and Portuguese case“ or “Corpus-Based Modelling of Lexical Changes in Manic Depression Disorders: The Case of Edgar Allan Poe” belonging to the field of Corpus and Literary Studies can be found in the ensuing pages. Almost all research areas can nowadays be investigated using Corpus Linguistics as a valid methodology. This is reason why Language Windowing through Corpora gathers papers dealing with discourse, variation and change, grammatical studies, lexicology and lexicography, corpus design, contrastive analyses, language acquisition and learning or translation. This work’s title aims at reflecting not only the great variety of topics gathered in it but also the worldwide interest awaken by the computer processing of language. In fact, researchers from many different institutions all over the world have contributed to this book. Apart from the twenty-two Spanish Universities, people from other Higher Education Institutions have authored and co-authored the essays contained here, namely, Russia, Venezuela, Brazil, UK, Finland, Portugal, Poland, Austria, Mexico, Thailand, Iran, the Netherlands, Belgium, Japan, Turkey, China, Italy, Malaysia, Romania and Sweden. All these essays have been alphabetically arranged, by the names of their authors, in two parts. Part 1 contains the papers by authors from A to K and Part 2, those of authors from L to Z

    Tune your brown clustering, please

    Get PDF
    Brown clustering, an unsupervised hierarchical clustering technique based on ngram mutual information, has proven useful in many NLP applications. However, most uses of Brown clustering employ the same default configuration; the appropriateness of this configuration has gone predominantly unexplored. Accordingly, we present information for practitioners on the behaviour of Brown clustering in order to assist hyper-parametre tuning, in the form of a theoretical model of Brown clustering utility. This model is then evaluated empirically in two sequence labelling tasks over two text types. We explore the dynamic between the input corpus size, chosen number of classes, and quality of the resulting clusters, which has an impact for any approach using Brown clustering. In every scenario that we examine, our results reveal that the values most commonly used for the clustering are sub-optimal
    • 

    corecore