6 research outputs found

    Efficient High-precision Boilerplate Detection Using Multilayer Perceptrons (Extended abstract of unpublished paper)

    Get PDF
    Abstract Removal of boilerplate is among the essential tasks in web corpus construction and web indexing. In this paper, we present an improved machine learning approach to general-purpose boilerplate detection for languages based on (extended) Latin alphabets (easily adaptable to other scripts). We keep it highly efficient (around 320 documents per single CPU core second) by using an optimized Multilayer Perceptron implementation while achieving around 95% correct classifications (Precision, Recall, and F 1 score over 0.95) by extracting suitable text block-internal features. We finally compare the performance of the Multilayer Perceptron to that of other classifiers such as Support Vector Machines

    Influence of geographic biases on geolocation prediction in Twitter

    Get PDF
    Geolocating Twitter users --- the task of identifying their home locations --- serves a wide range of community and business applications such as managing natural crises, journalism, and public health. While users can record their location on their profiles, more than 34% record fake or sarcastic locations. Twitter allows users to GPS locate their content, however, less than 1% of tweets are geotagged. Therefore, inferring user location has been an important field of investigation since 2010. This thesis investigates two of the most important factors which can affect the quality of inferring user location: (i) the influence of tweet-language; and (ii) the effectiveness of the evaluation process. Previous research observed that Twitter users writing in some languages appeared to be easier to locate than those writing in others. They speculated that the geographic coverage of a language (language bias) --- represented by the number of locations where the tweets of a specific language come from --- played an important role in determining location accuracy. So important was this role that accuracy might be largely predictable by considering language alone. In this thesis, I investigate the influence of language bias on the accuracy of geolocating Twitter users. The analysis, using a large corpus of tweets written in thirteen languages and a re-implemented state-of-the-art geolocation model back at the time, provides a new understanding of the reasons behind reported performance disparities between languages. The results show that data imbalance in the distribution of Twitter users over locations (population bias) has a greater impact on accuracy than language bias. A comparison between micro and macro averaging demonstrates that existing evaluation approaches are less appropriate than previously thought. The results suggest both averaging approaches should be used to effectively evaluate geolocation. Many approaches have been proposed for automatically geolocating users; at the same time, various evaluation metrics have been proposed to measure the effectiveness of these approaches, making it challenging to understand which of these metrics is the most suitable for this task. In this thesis, I provide a standardized evaluation framework for geolocation systems. The framework is employed to analyze fifteen Twitter user geolocation models and two baselines in a controlled experimental setting. The models are composed of the re-implemented model and a variation of it, two locally retrained open source models and the results of eleven models submitted to a shared task. Models are evaluated using ten metrics --- out of fourteen employed in previous research --- over four geographic granularities. Rank correlations and thorough statistical analysis are used to assess the effectiveness of these metrics. The results demonstrate that the choice of effectiveness metric can have a substantial impact on the conclusions drawn from a geolocation system experiment, potentially leading experimenters to contradictory results about relative effectiveness. For general evaluations, a range of performance metrics should be reported, to ensure that a complete picture of system effectiveness is conveyed. Although a lot of complex geolocation algorithms have been applied in recent years, a majority class baseline is still competitive at coarse geographic granularity. A suite of statistical analysis tests is proposed, based on the employed metric, to ensure that the results are not coincidental

    Workshop Proceedings of the 12th edition of the KONVENS conference

    Get PDF
    The 2014 issue of KONVENS is even more a forum for exchange: its main topic is the interaction between Computational Linguistics and Information Science, and the synergies such interaction, cooperation and integrated views can produce. This topic at the crossroads of different research traditions which deal with natural language as a container of knowledge, and with methods to extract and manage knowledge that is linguistically represented is close to the heart of many researchers at the Institut für Informationswissenschaft und Sprachtechnologie of Universität Hildesheim: it has long been one of the institute’s research topics, and it has received even more attention over the last few years

    Construction de corpus généraux et spécialisés à partir du Web

    Get PDF
    At the beginning of the first chapter the interdisciplinary setting between linguistics, corpus linguistics, and computational linguistics is introduced. Then, the notion of corpus is put into focus. Existing corpus and text definitions are discussed. Several milestones of corpus design are presented, from pre-digital corpora at the end of the 1950s to web corpora in the 2000s and 2010s. The continuities and changes between the linguistic tradition and web native corpora are exposed.In the second chapter, methodological insights on automated text scrutiny in computer science, computational linguistics and natural language processing are presented. The state of the art on text quality assessment and web text filtering exemplifies current interdisciplinary research trends on web texts. Readability studies and automated text classification are used as a paragon of methods to find salient features in order to grasp text characteristics. Text visualization exemplifies corpus processing in the digital humanities framework. As a conclusion, guiding principles for research practice are listed, and reasons are given to find a balance between quantitative analysis and corpus linguistics, in an environment which is spanned by technological innovation and artificial intelligence techniques.Third, current research on web corpora is summarized. I distinguish two main approaches to web document retrieval: restricted retrieval and web crawling. The notion of web corpus preprocessing is introduced and salient steps are discussed. The impact of the preprocessing phase on research results is assessed. I explain why the importance of preprocessing should not be underestimated and why it is an important task for linguists to learn new skills in order to confront the whole data gathering and preprocessing phase.I present my work on web corpus construction in the fourth chapter. My analyses concern two main aspects, first the question of corpus sources (or prequalification), and secondly the problem of including valid, desirable documents in a corpus (or document qualification). Last, I present work on corpus visualization consisting of extracting certain corpus characteristics in order to give indications on corpus contents and quality.Le premier chapitre s'ouvre par un description du contexte interdisciplinaire. Ensuite, le concept de corpus est présenté en tenant compte de l'état de l'art. Le besoin de disposer de preuves certes de nature linguistique mais embrassant différentes disciplines est illustré par plusieurs scénarios de recherche. Plusieurs étapes clés de la construction de corpus sont retracées, des corpus précédant l'ère digitale à la fin des années 1950 aux corpus web des années 2000 et 2010. Les continuités et changements entre la tradition en linguistique et les corpus tirés du web sont exposés.Le second chapitre rassemble des considérations méthodologiques. L'état de l'art concernant l'estimation de la qualité de textes est décrit. Ensuite, les méthodes utilisées par les études de lisibilité ainsi que par la classification automatique de textes sont résumées. Des dénominateurs communs sont isolés. Enfin, la visualisation de textes démontre l'intérêt de l'analyse de corpus pour les humanités numériques. Les raisons de trouver un équilibre entre analyse quantitative et linguistique de corpus sont abordées.Le troisième chapitre résume l'apport de la thèse en ce qui concerne la recherche sur les corpus tirés d'internet. La question de la collection des données est examinée avec une attention particulière, tout spécialement le cas des URLs sources. La notion de prétraitement des corpus web est introduite, ses étapes majeures sont brossées. L'impact des prétraitements sur le résultat est évalué. La question de la simplicité et de la reproducibilité de la construction de corpus est mise en avant.La quatrième partie décrit l'apport de la thèse du point de vue de la construction de corpus proprement dite, à travers la question des sources et le problèmes des documents invalides ou indésirables. Une approche utilisant un éclaireur léger pour préparer le parcours du web est présentée. Ensuite, les travaux concernant la sélection de documents juste avant l'inclusion dans un corpus sont résumés : il est possible d'utiliser les apports des études de lisibilité ainsi que des techniques d'apprentissage artificiel au cours de la construction du corpus. Un ensemble de caractéristiques textuelles testées sur des échantillons annotés évalue l'efficacité du procédé. Enfin, les travaux sur la visualisation de corpus sont abordés : extraction de caractéristiques à l'échelle d'un corpus afin de donner des indications sur sa composition et sa qualité
    corecore