961 research outputs found

    Quootstrap: Scalable Unsupervised Extraction of Quotation-Speaker Pairs from Large News Corpora via Bootstrapping

    Full text link
    We propose Quootstrap, a method for extracting quotations, as well as the names of the speakers who uttered them, from large news corpora. Whereas prior work has addressed this problem primarily with supervised machine learning, our approach follows a fully unsupervised bootstrapping paradigm. It leverages the redundancy present in large news corpora, more precisely, the fact that the same quotation often appears across multiple news articles in slightly different contexts. Starting from a few seed patterns, such as ["Q", said S.], our method extracts a set of quotation-speaker pairs (Q, S), which are in turn used for discovering new patterns expressing the same quotations; the process is then repeated with the larger pattern set. Our algorithm is highly scalable, which we demonstrate by running it on the large ICWSM 2011 Spinn3r corpus. Validating our results against a crowdsourced ground truth, we obtain 90% precision at 40% recall using a single seed pattern, with significantly higher recall values for more frequently reported (and thus likely more interesting) quotations. Finally, we showcase the usefulness of our algorithm's output for computational social science by analyzing the sentiment expressed in our extracted quotations.Comment: Accepted at the 12th International Conference on Web and Social Media (ICWSM), 201

    Automatic text filtering using limited supervision learning for epidemic intelligence

    Get PDF
    [no abstract

    Global Patterns of Synchronization in Human Communications

    Full text link
    Social media are transforming global communication and coordination. The data derived from social media can reveal patterns of human behavior at all levels and scales of society. Using geolocated Twitter data, we have quantified collective behaviors across multiple scales, ranging from the commutes of individuals, to the daily pulse of 50 major urban areas and global patterns of human coordination. Human activity and mobility patterns manifest the synchrony required for contingency of actions between individuals. Urban areas show regular cycles of contraction and expansion that resembles heartbeats linked primarily to social rather than natural cycles. Business hours and circadian rhythms influence daily cycles of work, recreation, and sleep. Different urban areas have characteristic signatures of daily collective activities. The differences are consistent with a new emergent global synchrony that couples behavior in distant regions across the world. A globally synchronized peak that includes exchange of ideas and information across Europe, Africa, Asia and Australasia. We propose a dynamical model to explain the emergence of global synchrony in the context of increasing global communication and reproduce the observed behavior. The collective patterns we observe show how social interactions lead to interdependence of behavior manifest in the synchronization of communication. The creation and maintenance of temporally sensitive social relationships results in the emergence of complexity of the larger scale behavior of the social system.Comment: 20 pages, 12 figures. arXiv admin note: substantial text overlap with arXiv:1602.0621

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    'I See Something You Don't See'. A Computational Analysis of the Digital Services Act and the Digital Markets Act

    Get PDF
    none4siIn its latest proposals, the Digital Markets Act (DMA) and Digital Services Act (DSA), the European Commission puts forward several new obligations for online intermediaries, especially large online platforms and “gatekeepers.” Both are expected to serve as a blueprint for regulation in the United States, where lawmakers have also been investigating competition on digital platforms and new antitrust laws passed the House Judiciary Committee as of June 11, 2021. This Article investigates whether all stakeholder groups share the same understanding and use of the relevant terms and concepts of the DSA and DMA. Leveraging the power of computational text analysis, we find significant differences in the employment of terms like “gatekeepers,” “self-preferencing,” “collusion,” and others in the position papers of the consultation process that informed the drafting of the two latest Commission proposals. Added to that, sentiment analysis shows that in some cases these differences also come with dissimilar attitudes. While this may not be surprising for new concepts such as gatekeepers or self-preferencing, the same is not true for other terms, like “self-regulatory,” which not only is used differently by stakeholders but is also viewed more favorably by medium and big companies and organizations than by small ones. We conclude by sketching out how different computational text analysis tools, could be combined to provide many helpful insights for both rulemakers and legal scholars.Di Porto, Fabiana; Grote, Tatjana; Volpi, Gabriele; Invernizzi, RiccardoDi Porto, Fabiana; Grote, Tatjana; Volpi, Gabriele; Invernizzi, Riccard

    Physical activity measurement and indoor location for the assessment of daily routines in older adults

    Get PDF
    Este trabajo presenta una propuesta para la monitorización de adultos mayores con el fin de inferir las actividades de la vida diaria (ADLs), e identificar desviaciones en sus rutinas que podrían necesitar alguna clase de intervención. Esta monitorización se consigue analizando el tiempo que pasan en cada habitación de su lugar de residencia, el cual puede ser estimado con balizas basadas en tecnología BLE (Bluetooth Low Energy). Las balizas receptoras de BLE desplegadas en el entorno detectan la señal del dispositivo emisor que porta el usuario. La localización de la persona se realiza a través de algunos métodos de fingerprinting, procesando la intensidad de la señal recibida.This paper presents a proposal for monitoring older adults in order to infer activities of daily living (ADLs), and identify deviations in their routines that might need some kind of intervention. This monitoring is achieved by analysing the time spent in each room of their place of residence, which can be estimated with beacons based on BLE (Bluetooth Low Energy) technology. The BLE receiving beacons deployed in the environment detect the signal of the transmitting device carried by the user. The location of the person is done through some fingerprinting methods by processing the received signal strength.Grado en Ingeniería en Electrónica y Automåtica Industria

    Weakly-supervised Learning Approaches for Event Knowledge Acquisition and Event Detection

    Get PDF
    Capabilities of detecting events and recognizing temporal, subevent, or causality relations among events can facilitate many applications in natural language understanding. However, supervised learning approaches that previous research mainly uses have two problems. First, due to the limited size of annotated data, supervised systems cannot sufficiently capture diverse contexts to distill universal event knowledge. Second, under certain application circumstances such as event recognition during emergent natural disasters, it is infeasible to spend days or weeks to annotate enough data to train a system. My research aims to use weakly-supervised learning to address these problems and to achieve automatic event knowledge acquisition and event recognition. In this dissertation, I first introduce three weakly-supervised learning approaches that have been shown effective in acquiring event relational knowledge. Firstly, I explore the observation that regular event pairs show a consistent temporal relation despite of their various contexts, and these rich contexts can be used to train a contextual temporal relation classifier to further recognize new temporal relation knowledge. Secondly, inspired by the double temporality characteristic of narrative texts, I propose a weakly supervised approach that identifies 287k narrative paragraphs using narratology principles and then extract rich temporal event knowledge from identified narratives. Lastly, I develop a subevent knowledge acquisition approach by exploiting two observations that 1) subevents are temporally contained by the parent event and 2) the definitions of the parent event can be used to guide the identification of subevents. I collect rich weak supervision to train a contextual BERT classifier and apply the classifier to identify new subevent knowledge. Recognizing texts that describe specific categories of events is also challenging due to language ambiguity and diverse descriptions of events. So I also propose a novel method to rapidly build a fine-grained event recognition system on social media texts for disaster management. My method creates high-quality weak supervision based on clustering-assisted word sense disambiguation and enriches tweet message representations using preceding context tweets and reply tweets in building event recognition classifiers

    Data-efficient methods for information extraction

    Get PDF
    Strukturierte WissensreprĂ€sentationssysteme wie Wissensdatenbanken oder Wissensgraphen bieten Einblicke in EntitĂ€ten und Beziehungen zwischen diesen EntitĂ€ten in der realen Welt. Solche WissensreprĂ€sentationssysteme können in verschiedenen Anwendungen der natĂŒrlichen Sprachverarbeitung eingesetzt werden, z. B. bei der semantischen Suche, der Beantwortung von Fragen und der Textzusammenfassung. Es ist nicht praktikabel und ineffizient, diese WissensreprĂ€sentationssysteme manuell zu befĂŒllen. In dieser Arbeit entwickeln wir Methoden, um automatisch benannte EntitĂ€ten und Beziehungen zwischen den EntitĂ€ten aus Klartext zu extrahieren. Unsere Methoden können daher verwendet werden, um entweder die bestehenden unvollstĂ€ndigen WissensreprĂ€sentationssysteme zu vervollstĂ€ndigen oder ein neues strukturiertes WissensreprĂ€sentationssystem von Grund auf zu erstellen. Im Gegensatz zu den gĂ€ngigen ĂŒberwachten Methoden zur Informationsextraktion konzentrieren sich unsere Methoden auf das Szenario mit wenigen Daten und erfordern keine große Menge an kommentierten Daten. Im ersten Teil der Arbeit haben wir uns auf das Problem der Erkennung von benannten EntitĂ€ten konzentriert. Wir haben an der gemeinsamen Aufgabe von Bacteria Biotope 2019 teilgenommen. Die gemeinsame Aufgabe besteht darin, biomedizinische EntitĂ€tserwĂ€hnungen zu erkennen und zu normalisieren. Unser linguistically informed Named-Entity-Recognition-System besteht aus einem Deep-Learning-basierten Modell, das sowohl verschachtelte als auch flache EntitĂ€ten extrahieren kann; unser Modell verwendet mehrere linguistische Merkmale und zusĂ€tzliche Trainingsziele, um effizientes Lernen in datenarmen Szenarien zu ermöglichen. Unser System zur EntitĂ€tsnormalisierung verwendet String-Match, Fuzzy-Suche und semantische Suche, um die extrahierten benannten EntitĂ€ten mit den biomedizinischen Datenbanken zu verknĂŒpfen. Unser System zur Erkennung von benannten EntitĂ€ten und zur EntitĂ€tsnormalisierung erreichte die niedrigste Slot-Fehlerrate von 0,715 und belegte den ersten Platz in der gemeinsamen Aufgabe. Wir haben auch an zwei gemeinsamen Aufgaben teilgenommen: Adverse Drug Effect Span Detection (Englisch) und Profession Span Detection (Spanisch); beide Aufgaben sammeln Daten von der Social Media Plattform Twitter. Wir haben ein Named-Entity-Recognition-Modell entwickelt, das die Eingabedarstellung des Modells durch das Stapeln heterogener Einbettungen aus verschiedenen DomĂ€nen verbessern kann; unsere empirischen Ergebnisse zeigen komplementĂ€res Lernen aus diesen heterogenen Einbettungen. Unser Beitrag belegte den 3. Platz in den beiden gemeinsamen Aufgaben. Im zweiten Teil der Arbeit untersuchten wir Strategien zur Erweiterung synthetischer Daten, um ressourcenarme Informationsextraktion in spezialisierten DomĂ€nen zu ermöglichen. Insbesondere haben wir backtranslation an die Aufgabe der Erkennung von benannten EntitĂ€ten auf Token-Ebene und der Extraktion von Beziehungen auf Satzebene angepasst. Wir zeigen, dass die RĂŒckĂŒbersetzung sprachlich vielfĂ€ltige und grammatikalisch kohĂ€rente synthetische SĂ€tze erzeugen kann und als wettbewerbsfĂ€hige Erweiterungsstrategie fĂŒr die Aufgaben der Erkennung von benannten EntitĂ€ten und der Extraktion von Beziehungen dient. Bei den meisten realen Aufgaben zur Extraktion von Beziehungen stehen keine kommentierten Daten zur VerfĂŒgung, jedoch ist hĂ€ufig ein großer unkommentierter Textkorpus vorhanden. Bootstrapping-Methoden zur Beziehungsextraktion können mit diesem großen Korpus arbeiten, da sie nur eine Handvoll Startinstanzen benötigen. Bootstrapping-Methoden neigen jedoch dazu, im Laufe der Zeit Rauschen zu akkumulieren (bekannt als semantische Drift), und dieses PhĂ€nomen hat einen drastischen negativen Einfluss auf die endgĂŒltige Genauigkeit der Extraktionen. Wir entwickeln zwei Methoden zur EinschrĂ€nkung des Bootstrapping-Prozesses, um die semantische Drift bei der Extraktion von Beziehungen zu minimieren. Unsere Methoden nutzen die Graphentheorie und vortrainierte Sprachmodelle, um verrauschte Extraktionsmuster explizit zu identifizieren und zu entfernen. Wir berichten ĂŒber die experimentellen Ergebnisse auf dem TACRED-Datensatz fĂŒr vier Relationen. Im letzten Teil der Arbeit demonstrieren wir die Anwendung der DomĂ€nenanpassung auf die anspruchsvolle Aufgabe der mehrsprachigen Akronymextraktion. Unsere Experimente zeigen, dass die DomĂ€nenanpassung die Akronymextraktion in wissenschaftlichen und juristischen Bereichen in sechs Sprachen verbessern kann, darunter auch Sprachen mit geringen Ressourcen wie Persisch und Vietnamesisch.The structured knowledge representation systems such as knowledge base or knowledge graph can provide insights regarding entities and relationship(s) among these entities in the real-world, such knowledge representation systems can be employed in various natural language processing applications such as semantic search, question answering and text summarization. It is infeasible and inefficient to manually populate these knowledge representation systems. In this work, we develop methods to automatically extract named entities and relationships among the entities from plain text and hence our methods can be used to either complete the existing incomplete knowledge representation systems to create a new structured knowledge representation system from scratch. Unlike mainstream supervised methods for information extraction, our methods focus on the low-data scenario and do not require a large amount of annotated data. In the first part of the thesis, we focused on the problem of named entity recognition. We participated in the shared task of Bacteria Biotope 2019, the shared task consists of recognizing and normalizing the biomedical entity mentions. Our linguistically informed named entity recognition system consists of a deep learning based model which can extract both nested and flat entities; our model employed several linguistic features and auxiliary training objectives to enable efficient learning in data-scarce scenarios. Our entity normalization system employed string match, fuzzy search and semantic search to link the extracted named entities to the biomedical databases. Our named entity recognition and entity normalization system achieved the lowest slot error rate of 0.715 and ranked first in the shared task. We also participated in two shared tasks of Adverse Drug Effect Span detection (English) and Profession Span Detection (Spanish); both of these tasks collect data from the social media platform Twitter. We developed a named entity recognition model which can improve the input representation of the model by stacking heterogeneous embeddings from a diverse domain(s); our empirical results demonstrate complementary learning from these heterogeneous embeddings. Our submission ranked 3rd in both of the shared tasks. In the second part of the thesis, we explored synthetic data augmentation strategies to address low-resource information extraction in specialized domains. Specifically, we adapted backtranslation to the token-level task of named entity recognition and sentence-level task of relation extraction. We demonstrate that backtranslation can generate linguistically diverse and grammatically coherent synthetic sentences and serve as a competitive augmentation strategy for the task of named entity recognition and relation extraction. In most of the real-world relation extraction tasks, the annotated data is not available, however, quite often a large unannotated text corpus is available. Bootstrapping methods for relation extraction can operate on this large corpus as they only require a handful of seed instances. However, bootstrapping methods tend to accumulate noise over time (known as semantic drift) and this phenomenon has a drastic negative impact on the final precision of the extractions. We develop two methods to constrain the bootstrapping process to minimise semantic drift for relation extraction; our methods leverage graph theory and pre-trained language models to explicitly identify and remove noisy extraction patterns. We report the experimental results on the TACRED dataset for four relations. In the last part of the thesis, we demonstrate the application of domain adaptation to the challenging task of multi-lingual acronym extraction. Our experiments demonstrate that domain adaptation can improve acronym extraction within scientific and legal domains in 6 languages including low-resource languages such as Persian and Vietnamese

    Theory and Applications for Advanced Text Mining

    Get PDF
    Due to the growth of computer technologies and web technologies, we can easily collect and store large amounts of text data. We can believe that the data include useful knowledge. Text mining techniques have been studied aggressively in order to extract the knowledge from the data since late 1990s. Even if many important techniques have been developed, the text mining research field continues to expand for the needs arising from various application fields. This book is composed of 9 chapters introducing advanced text mining techniques. They are various techniques from relation extraction to under or less resourced language. I believe that this book will give new knowledge in the text mining field and help many readers open their new research fields
    • 

    corecore