363 research outputs found

    Distinguishing Noise and Main Text Content from Web-Sourced Plain Text Documents Using Sequential Neural Networks

    Get PDF
    Boilerplate removal and the identification of the actual textual content is a crucial step in web corpus creation. However, existing methods don’t always filter out the noise perfectly and are often not applicable for plain text corpora. In this thesis, I will develop machine learning methods to identify the main textual content in plain text documents. I will utilize transfer learning and pretrained language models as a base for training monolingual models with French and Swedish data as well as a multilingual model with French, Swedish, English, Finnish, German and Spanish data. I will compare two machine learning architectures based on the XLM-RoBERTa language model: first a classification model built on top of the pretrained XLM-RoBERTa model and a second model using an additional Long Short-Term Memory (LSTM) network layer. I will show that the LSTM layer improves the classification of the XLM-RoBERTa model and the built multilingual model performs well even with data in unseen languages. I will perform a further analysis on the results and show that the results of the boilerplate detection with the trained models differ with text varieties. Certain types of text documents, such as lyrical texts or discussion forum texts pose challenges in boilerplate detection, and it would be beneficial for future research to focus on gathering data that has been difficult to clean

    Proceedings of the 12th Web as Corpus Workshop

    Get PDF
    The web presents unprecedented opportunities for large-scale collection of text in many languages. However, two critical steps in the development of web corpora remain challenging: the identification of clean text from source HTML and the assignment of genre or register information to the documents. In this paper, we evaluate a multilingual approach to this end. Our starting points are the Swedish and French Common Crawl datasets gathered for the 2017 CoNLL shared task, particularly the URLs. We 1) fetch HTML pages based on the URLs and run boilerplate removal, 2) train a classifier to further clean out undesired text fragments, and 3) annotate text registers. We compare boilerplate removal against the CoNLL texts, and find an improvement. For the further cleaning of undesired material, the best results are achieved using Multilingual BERT with monolingual fine-tuning. However, our results are promising also in a cross-lingual setting, without fine-tuning on the target language. Finally, the register annotations show that most of the documents belong to a relatively small set of registers, which are relatively similar in the two languages. A number of additional flags in the annotation are, however, necessary to reflect the wide range of linguistic variation associated with the documents.</p

    Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

    Get PDF
    We explore cross-lingual transfer of register classification for web documents. Registers,that is, text varieties such as blogs or news are one of the primary predictors of linguistic variation and thus affect the automatic processing of language. We introduce two new registerannotated corpora, FreCORE and SweCORE, for French and Swedish. We demonstrate that deep pre-trained language models perform strongly in these languages and outperform previous state-of-the-art in English and Finnish. Specifically, we show 1) that zeroshot cross-lingual transfer from the large English CORE corpus can match or surpass previously published monolingual models, and 2) that lightweight monolingual classification requiring very little training data can reach or surpass our zero-shot performance. We further analyse classification results finding that certain registers continue to pose challenges in particular for cross-lingual transfer.</p

    D7.1. Criteria for evaluation of resources, technology and integration.

    Get PDF
    This deliverable defines how evaluation is carried out at each integration cycle in the PANACEA project. As PANACEA aims at producing large scale resources, evaluation becomes a critical and challenging issue. Critical because it is important to assess the quality of the results that should be delivered to users. Challenging because we prospect rather new areas, and through a technical platform: some new methodologies will have to be explored or old ones to be adapted

    Construction de corpus généraux et spécialisés à partir du Web

    Get PDF
    At the beginning of the first chapter the interdisciplinary setting between linguistics, corpus linguistics, and computational linguistics is introduced. Then, the notion of corpus is put into focus. Existing corpus and text definitions are discussed. Several milestones of corpus design are presented, from pre-digital corpora at the end of the 1950s to web corpora in the 2000s and 2010s. The continuities and changes between the linguistic tradition and web native corpora are exposed.In the second chapter, methodological insights on automated text scrutiny in computer science, computational linguistics and natural language processing are presented. The state of the art on text quality assessment and web text filtering exemplifies current interdisciplinary research trends on web texts. Readability studies and automated text classification are used as a paragon of methods to find salient features in order to grasp text characteristics. Text visualization exemplifies corpus processing in the digital humanities framework. As a conclusion, guiding principles for research practice are listed, and reasons are given to find a balance between quantitative analysis and corpus linguistics, in an environment which is spanned by technological innovation and artificial intelligence techniques.Third, current research on web corpora is summarized. I distinguish two main approaches to web document retrieval: restricted retrieval and web crawling. The notion of web corpus preprocessing is introduced and salient steps are discussed. The impact of the preprocessing phase on research results is assessed. I explain why the importance of preprocessing should not be underestimated and why it is an important task for linguists to learn new skills in order to confront the whole data gathering and preprocessing phase.I present my work on web corpus construction in the fourth chapter. My analyses concern two main aspects, first the question of corpus sources (or prequalification), and secondly the problem of including valid, desirable documents in a corpus (or document qualification). Last, I present work on corpus visualization consisting of extracting certain corpus characteristics in order to give indications on corpus contents and quality.Le premier chapitre s'ouvre par un description du contexte interdisciplinaire. Ensuite, le concept de corpus est prĂ©sentĂ© en tenant compte de l'Ă©tat de l'art. Le besoin de disposer de preuves certes de nature linguistique mais embrassant diffĂ©rentes disciplines est illustrĂ© par plusieurs scĂ©narios de recherche. Plusieurs Ă©tapes clĂ©s de la construction de corpus sont retracĂ©es, des corpus prĂ©cĂ©dant l'Ăšre digitale Ă  la fin des annĂ©es 1950 aux corpus web des annĂ©es 2000 et 2010. Les continuitĂ©s et changements entre la tradition en linguistique et les corpus tirĂ©s du web sont exposĂ©s.Le second chapitre rassemble des considĂ©rations mĂ©thodologiques. L'Ă©tat de l'art concernant l'estimation de la qualitĂ© de textes est dĂ©crit. Ensuite, les mĂ©thodes utilisĂ©es par les Ă©tudes de lisibilitĂ© ainsi que par la classification automatique de textes sont rĂ©sumĂ©es. Des dĂ©nominateurs communs sont isolĂ©s. Enfin, la visualisation de textes dĂ©montre l'intĂ©rĂȘt de l'analyse de corpus pour les humanitĂ©s numĂ©riques. Les raisons de trouver un Ă©quilibre entre analyse quantitative et linguistique de corpus sont abordĂ©es.Le troisiĂšme chapitre rĂ©sume l'apport de la thĂšse en ce qui concerne la recherche sur les corpus tirĂ©s d'internet. La question de la collection des donnĂ©es est examinĂ©e avec une attention particuliĂšre, tout spĂ©cialement le cas des URLs sources. La notion de prĂ©traitement des corpus web est introduite, ses Ă©tapes majeures sont brossĂ©es. L'impact des prĂ©traitements sur le rĂ©sultat est Ă©valuĂ©. La question de la simplicitĂ© et de la reproducibilitĂ© de la construction de corpus est mise en avant.La quatriĂšme partie dĂ©crit l'apport de la thĂšse du point de vue de la construction de corpus proprement dite, Ă  travers la question des sources et le problĂšmes des documents invalides ou indĂ©sirables. Une approche utilisant un Ă©claireur lĂ©ger pour prĂ©parer le parcours du web est prĂ©sentĂ©e. Ensuite, les travaux concernant la sĂ©lection de documents juste avant l'inclusion dans un corpus sont rĂ©sumĂ©s : il est possible d'utiliser les apports des Ă©tudes de lisibilitĂ© ainsi que des techniques d'apprentissage artificiel au cours de la construction du corpus. Un ensemble de caractĂ©ristiques textuelles testĂ©es sur des Ă©chantillons annotĂ©s Ă©value l'efficacitĂ© du procĂ©dĂ©. Enfin, les travaux sur la visualisation de corpus sont abordĂ©s : extraction de caractĂ©ristiques Ă  l'Ă©chelle d'un corpus afin de donner des indications sur sa composition et sa qualitĂ©

    Adaptation of machine translation for multilingual information retrieval in the medical domain

    Get PDF
    Objective. We investigate machine translation (MT) of user search queries in the context of cross-lingual information retrieval (IR) in the medical domain. The main focus is on techniques to adapt MT to increase translation quality; however, we also explore MT adaptation to improve eectiveness of cross-lingual IR. Methods and Data. Our MT system is Moses, a state-of-the-art phrase-based statistical machine translation system. The IR system is based on the BM25 retrieval model implemented in the Lucene search engine. The MT techniques employed in this work include in-domain training and tuning, intelligent training data selection, optimization of phrase table configuration, compound splitting, and exploiting synonyms as translation variants. The IR methods include morphological normalization and using multiple translation variants for query expansion. The experiments are performed and thoroughly evaluated on three language pairs: Czech–English, German–English, and French–English. MT quality is evaluated on data sets created within the Khresmoi project and IR eectiveness is tested on the CLEF eHealth 2013 data sets. Results. The search query translation results achieved in our experiments are outstanding – our systems outperform not only our strong baselines, but also Google Translate and Microsoft Bing Translator in direct comparison carried out on all the language pairs. The baseline BLEU scores increased from 26.59 to 41.45 for Czech–English, from 23.03 to 40.82 for German–English, and from 32.67 to 40.82 for French–English. This is a 55% improvement on average. In terms of the IR performance on this particular test collection, a significant improvement over the baseline is achieved only for French–English. For Czech–English and German–English, the increased MT quality does not lead to better IR results. Conclusions. Most of the MT techniques employed in our experiments improve MT of medical search queries. Especially the intelligent training data selection proves to be very successful for domain adaptation of MT. Certain improvements are also obtained from German compound splitting on the source language side. Translation quality, however, does not appear to correlate with the IR performance – better translation does not necessarily yield better retrieval. We discuss in detail the contribution of the individual techniques and state-of-the-art features and provide future research directions
    • 

    corecore