108 research outputs found

    Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

    Get PDF
    Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen

    Liage de données RDF : évaluation d'approches interlingues

    Get PDF
    The Semantic Web extends the Web by publishing structured and interlinked data using RDF.An RDF data set is a graph where resources are nodes labelled in natural languages. One of the key challenges of linked data is to be able to discover links across RDF data sets. Given two data sets, equivalent resources should be identified and linked by owl:sameAs links. This problem is particularly difficult when resources are described in different natural languages.This thesis investigates the effectiveness of linguistic resources for interlinking RDF data sets. For this purpose, we introduce a general framework in which each RDF resource is represented as a virtual document containing text information of neighboring nodes. The context of a resource are the labels of the neighboring nodes. Once virtual documents are created, they are projected in the same space in order to be compared. This can be achieved by using machine translation or multilingual lexical resources. Once documents are in the same space, similarity measures to find identical resources are applied. Similarity between elements of this space is taken for similarity between RDF resources.We performed evaluation of cross-lingual techniques within the proposed framework. We experimentally evaluate different methods for linking RDF data. In particular, two strategies are explored: applying machine translation or using references to multilingual resources. Overall, evaluation shows the effectiveness of cross-lingual string-based approaches for linking RDF resources expressed in different languages. The methods have been evaluated on resources in English, Chinese, French and German. The best performance (over 0.90 F-measure) was obtained by the machine translation approach. This shows that the similarity-based method can be successfully applied on RDF resources independently of their type (named entities or thesauri concepts). The best experimental results involving just a pair of languages demonstrated the usefulness of such techniques for interlinking RDF resources cross-lingually.Le Web des données étend le Web en publiant des données structurées et liées en RDF. Un jeu de données RDF est un graphe orienté où les ressources peuvent être des sommets étiquetées dans des langues naturelles. Un des principaux défis est de découvrir les liens entre jeux de données RDF. Étant donnés deux jeux de données, cela consiste à trouver les ressources équivalentes et les lier avec des liens owl:sameAs. Ce problème est particulièrement difficile lorsque les ressources sont décrites dans différentes langues naturelles.Cette thèse étudie l'efficacité des ressources linguistiques pour le liage des données exprimées dans différentes langues. Chaque ressource RDF est représentée comme un document virtuel contenant les informations textuelles des sommets voisins. Les étiquettes des sommets voisins constituent le contexte d'une ressource. Une fois que les documents sont créés, ils sont projetés dans un même espace afin d'être comparés. Ceci peut être réalisé à l'aide de la traduction automatique ou de ressources lexicales multilingues. Une fois que les documents sont dans le même espace, des mesures de similarité sont appliquées afin de trouver les ressources identiques. La similarité entre les documents est prise pour la similarité entre les ressources RDF.Nous évaluons expérimentalement différentes méthodes pour lier les données RDF. En particulier, deux stratégies sont explorées: l'application de la traduction automatique et l'usage des banques de données terminologiques et lexicales multilingues. Dans l'ensemble, l'évaluation montre l'efficacité de ce type d'approches. Les méthodes ont été évaluées sur les ressources en anglais, chinois, français, et allemand. Les meilleurs résultats (F-mesure > 0.90) ont été obtenus par la traduction automatique. L'évaluation montre que la méthode basée sur la similarité peut être appliquée avec succès sur les ressources RDF indépendamment de leur type (entités nommées ou concepts de dictionnaires)

    Theory and Applications for Advanced Text Mining

    Get PDF
    Due to the growth of computer technologies and web technologies, we can easily collect and store large amounts of text data. We can believe that the data include useful knowledge. Text mining techniques have been studied aggressively in order to extract the knowledge from the data since late 1990s. Even if many important techniques have been developed, the text mining research field continues to expand for the needs arising from various application fields. This book is composed of 9 chapters introducing advanced text mining techniques. They are various techniques from relation extraction to under or less resourced language. I believe that this book will give new knowledge in the text mining field and help many readers open their new research fields

    Linked Open Data - Creating Knowledge Out of Interlinked Data: Results of the LOD2 Project

    Get PDF
    Database Management; Artificial Intelligence (incl. Robotics); Information Systems and Communication Servic

    A comparative study of Arthur John Arberry’s and Desmond O’Grady’s translations of the seven Mu‘allaqāt

    Get PDF
    This study investigates the politicisation of Arthur John Arberry’s and Desmond O’Grady’s translations of the seven Mu‘allqāt, drawing on Pierre Bourdieu’s sociological theory. It presents a sociology of translation that is based on five of the conceptual tools that Bourdieu employs in understanding social reality in studying the influence of the social norms on the two translators’ decisions. The study foregrounds the fact that Arberry’s and O’Grady’s translations were similarly produced in highly politicised societies due to the British and later the American involvement in the Middle East, and it argues that British and American propaganda respectively formed the doxa about Arabs at the times the translations were produced and influenced the representation of Arabs in each translation. The study aims to advance the understanding of the influence of the socio-political context on poetry translation which has rarely been studied. A review of extant English translations of the Mu’allaqāt defines the boundaries of the field; specifies its key players, and the factors that shaped their habitus; highlights the major types of capital over which these players struggle; and thus helps to situate Arberry’s and O’Grady’s translations in the field. The theoretical framework of this study draws on Bourdieu’s sociology in order to establish the link between politics and Anglophone literary fields during the time the translations were produced. It thus tests Bourdieu’s sociology in the study of poetry translation. The theoretical framework employs Skopostheorie to explain the different approaches that the two translators adopt to the translation; it also draws on the domestication/foreignisation model. The study analyses and compares the two translators’ choices of methodologies which ultimately result in characterising their representations of the Arab reality described in the Mu‘allaqāt by essentialism, absence, and otherness that have been the three characteristics of Orientalist representation of the non-West since the eighteenth century. The analysis reveals how the decisions of both translators result in problems such as distorting or altering Arab reality, or in obstructing the message of the original qaṣīdas. The study concludes that the socio-political context had its impact on Arberry’s and O’Grady’s translation choices in spite of the different purposes of their translations. It also concludes that the socio-political context seems to have influenced O’Grady’s choices relating to style. Furthermore, it sheds light on the problems that result from the influence of the socio-political circumstances on the translators’ decisions, and offers suggestions for avoiding such problems

    Translating The Tale of Khun Chang Khun Phaen: representations of culture, gender and Buddhism

    Get PDF
    A recent major work on Thai-English poetry translation is The Tale of Khun Chang Khun Phaen (2010/2012), the only complete translation into any language of the Thai-language epic poem Sepha rueang Khun Chang Khun Phaen (KCKP). Chris Baker and Pasuk Phongpaichit, the translators, mainly render their translation of the epic verse into prose. Their translation is an English version of the standard accounts as edited by Prince Damrong Rajanubhap in 1917–1918 with a slight revision in 1925 and older manuscripts, notably the Wat Ko edition. Baker and Pasuk’s intervention manifests itself at textual level for they restored a great number of passages excised by Damrong. The reinstated segments include censored female sexuality, monk clowning and the less violent account of the creation of Goldchild (กุมารทอง). In the standard edition, Damrong did not allow Siamese women to be sexually expressive and Buddhist monks to be clowns in the national literature he helped shape. The violent account of the Goldchild creation Damrong chose for his standard edition vilifies the leading male character, Khun Phaen. To identify approaches to translating a Thai epic poem into English, twenty-four segments rendered into verse passages, twenty key culturally specific items (CSIs) and four paratextual elements, which also represent the text as a whole, are analysed. This interdisciplinary study takes into account the socio-cultural contexts and aesthetic norms prevalent in the periods in which the source texts were written. The sociological approach in which the method of interview is employed is also adopted in this study. The translated text, paratext and responses from the interviews are analysed to identify translation strategies and procedures and whether the translators conformed to the ‘textual system’ of their time so that their translation of an unrecognised national literature would be admitted to the fellowship of world literature

    ' "The Tale of the Tribe": The Twentieth-Century Alliterative Revival.'

    Get PDF
    This thesis studies the revival of Old English- and Norse-inspired alliterative versification in twentieth-century English poetry and poetics. It is organised as a chronological sequence of three case-studies: three authors, heirs to Romantic Nationalism, writing at twentieth-century intersections between Modernism, Postmodernism, and Medievalism. It indicates why this form attracted revival; which medieval models were emulated, with what success, in which modern works: the technique and mystique of alliterative verse as a modern mode. It differs from previous scholarship by advocating Kipling and Tolkien, by foregrounding the primacy of language, historical linguistics, especially the philological reconstruction of Germanic metre; and by, accordingly, methodological emphasis on formal scansion, taking account of audio recordings of Pound and Tolkien performing their poetry. It proposes the revived form as archaising, epic, mythopoeic, constructed by its exponents as an authentic poetic speech symbolising an archetypical Englishness—‘The Tale of the Tribe’. A trope emerges of revival of the culturally-‘buried’ native and innate, an ancestral lexico-metrical heritage conjured back to life. A substantial Introduction offers a primer of Old English metre and style: how it works, and what it means, according to Eduard Sievers’ (1850-1932) reconstruction. Chapter I promotes Rudyard Kipling (1865-1936) as pioneering alliterative poet, his engagement with Old-Northernism, runes, and retelling of the myth of Weland. Chapter II assesses the impact of Anglo-Saxon on and through Ezra Pound (1885-1972). Scansions of his ‘Seafarer’ and Cantos testify to the influence of Saxonising versification in the development of Pound’s Modernist language and free verse. Chapter III exhibits the alliterative oeuvre of J. R. R. Tolkien (1892-1973), featuring close readings of verse from Lord of the Rings. The Conclusion contends that twentieth-century English poetry should be recognised as evincing an ambitious alliterative revival, impossible before, and that this ancient metre is likely to endure into the future

    Tune your brown clustering, please

    Get PDF
    Brown clustering, an unsupervised hierarchical clustering technique based on ngram mutual information, has proven useful in many NLP applications. However, most uses of Brown clustering employ the same default configuration; the appropriateness of this configuration has gone predominantly unexplored. Accordingly, we present information for practitioners on the behaviour of Brown clustering in order to assist hyper-parametre tuning, in the form of a theoretical model of Brown clustering utility. This model is then evaluated empirically in two sequence labelling tasks over two text types. We explore the dynamic between the input corpus size, chosen number of classes, and quality of the resulting clusters, which has an impact for any approach using Brown clustering. In every scenario that we examine, our results reveal that the values most commonly used for the clustering are sub-optimal
    corecore