100 research outputs found

    A language-independent method for the alignement of parallel corpora

    Get PDF
    PACLIC 20 / Wuhan, China / 1-3 November, 200

    Comparability measurement for terminology extraction

    Get PDF
    Proceedings of the Workshop CHAT 2011: Creation, Harmonization and Application of Terminology Resources. Editors: Tatiana Gornostay and Andrejs Vasiļjevs. NEALT Proceedings Series, Vol. 12 (2011), 3-10. © 2011 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/16956

    Étude sur l'équivalence de termes extraits automatiquement d'un corpus parallèle : contribution à l'extraction terminologique bilingue

    Full text link
    Mémoire numérisé par la Division de la gestion de documents et des archives de l'Université de Montréal

    Projection Interlingue d'Étiquettes pour l'Annotation Sémantique Non Supervisée

    Get PDF
    International audienceCross-lingual Annotation Projection for Unsupervised Semantic Tagging. This work focuses on the development of linguistic analysis tools for resource-poor languages. In a previous study, we proposed a method based on cross-language projection of linguistic annotations from parallel corpora to automatically induce a morpho-syntactic analyzer. Our approach was based on Recurrent Neural Networks (RNNs). In this paper, we present an improvement of our neural model. We investigate the inclusion of external information (POS tags) in the neural network to train a multilingual SuperSenses Tagger. We demonstrate the validity and genericity of our method by using parallel corpora (obtained by manual or automatic translation). Our experiments are conducted for cross-lingual annotation projection from English to French and Italian

    French-English Terminology Extraction from Comparable Corpora

    Full text link

    Proceedings

    Get PDF
    Proceedings of the Workshop CHAT 2011: Creation, Harmonization and Application of Terminology Resources. Editors: Tatiana Gornostay and Andrejs Vasiļjevs. NEALT Proceedings Series, Vol. 12 (2011). © 2011 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/16956

    Enhancing BERT for Lexical Normalization

    Get PDF
    International audienceLanguage model-based pre-trained representations have become ubiquitous in natural language processing. They have been shown to significantly improve the performance of neu-ral models on a great variety of tasks. However , it remains unclear how useful those general models can be in handling non-canonical text. In this article, focusing on User Generated Content (UGC) in a resource-scarce scenario , we study the ability of BERT (Devlin et al., 2018) to perform lexical normalisation. Our contribution is simple: by framing lexical normalisation as a token prediction task, by enhancing its architecture and by carefully fine-tuning it, we show that BERT can be a competitive lexical normalisation model without the need of any UGC resources aside from 3,000 training sentences. To the best of our knowledge , it is the first work done in adapting and analysing the ability of this model to handle noisy UGC data

    Towards the automatic processing of Yongning Na (Sino-Tibetan): developing a 'light' acoustic model of the target language and testing 'heavyweight' models from five national languages

    Get PDF
    International audienceAutomatic speech processing technologies hold great potential to facilitate the urgent task of documenting the world's languages. The present research aims to explore the application of speech recognition tools to a little-documented language, with a view to facilitating processes of annotation, transcription and linguistic analysis. The target language is Yongning Na (a.k.a. Mosuo), an unwritten Sino-Tibetan language with less than 50,000 speakers. An acoustic model of Na was built using CMU Sphinx. In addition to this 'light' model, trained on a small data set (only 4 hours of speech from 1 speaker), 'heavyweight' models from five national languages (English, French, Chinese, Vietnamese and Khmer) were also applied to the same data. Preliminary results are reported, and perspectives for the long road ahead are outlined
    corecore