Search CORE

116 research outputs found

Text2Icons: using AI to tell a story with icons

Author: Joana Maria Lima Valente
Publication venue
Publication date: 16/11/2021
Field of study

Repositório Aberto da Universidade do Porto

Extracting keywords from tweets

Author: Farinha André Filipe Neves
Publication venue
Publication date: 01/11/2018
Field of study

Nos últimos anos, uma enorme quantidade de informações foi disponibilizada na Internet. As redes sociais estão entre as que mais contribuem para esse aumento no volume de dados. O Twitter, em particular, abriu o caminho, enquanto plataforma social, para que pessoas e organizações possam interagir entre si, gerando grandes volumes de dados a partir dos quais é possível extrair informação útil. Uma tal quantidade de dados, permitirá por exemplo, revelar-se importante se e quando, vários indivíduos relatarem sintomas de doença ao mesmo tempo e no mesmo lugar. Processar automaticamente um tal volume de informações e obter a partir dele conhecimento útil, torna-se, no entanto, uma tarefa impossível para qualquer ser humano. Os extratores de palavras-chave surgem neste contexto como uma ferramenta valiosa que visa facilitar este trabalho, ao permitir, de uma forma rápida, ter acesso a um conjunto de termos caracterizadores do documento. Neste trabalho, tentamos contribuir para um melhor entendimento deste problema, avaliando a eficácia do YAKE (um algoritmo de extração de palavras-chave não supervisionado) em cima de um conjunto de tweets, um tipo de texto, caracterizado não só pelo seu reduzido tamanho, mas também pela sua natureza não estruturada. Embora os extratores de palavras-chave tenham sido amplamente aplicados a textos genéricos, como a relatórios, artigos, entre outros, a sua aplicabilidade em tweets é escassa e até ao momento não foi disponibilizado formalmente nenhum conjunto de dados. Neste trabalho e por forma a contornar esse problema optámos por desenvolver e tornar disponível uma nova coleção de dados, um importante contributo para que a comunidade científica promova novas soluções neste domínio. O KWTweet foi anotado por 15 anotadores e resultou em 7736 tweets anotados. Com base nesta informação, pudemos posteriormente avaliar a eficácia do YAKE! contra 9 baselines de extração de palavra-chave não supervisionados (TextRank, KP-Miner, SingleRank, PositionRank, TopicPageRank, MultipartiteRank, TopicRank, Rake e TF.IDF). Os resultados obtidos demonstram que o YAKE! tem um desempenho superior quando comparado com os seus competidores, provando-se assim a sua eficácia neste tipo de textos. Por fim, disponibilizamos uma demo que visa demonstrar o funcionamento do YAKE! Nesta plataforma web, os utilizadores têm a possibilidade de fazer uma pesquisa por utilizador ou hashtag e dessa forma obter as palavras chave mais relevantes através de uma nuvem de palavra

Repositório Comum

Entities with quantities : extraction, search, and ranking

Author: Ho Vinh Thinh
Publication venue: Saarländische Universitäts- und Landesbibliothek
Publication date: 01/01/2022
Field of study

Quantities are more than numeric values. They denote measures of the world’s entities such as heights of buildings, running times of athletes, energy efficiency of car models or energy production of power plants, all expressed in numbers with associated units. Entity-centric search and question answering (QA) are well supported by modern search engines. However, they do not work well when the queries involve quantity filters, such as searching for athletes who ran 200m under 20 seconds or companies with quarterly revenue above $2 Billion. State-of-the-art systems fail to understand the quantities, including the condition (less than, above, etc.), the unit of interest (seconds, dollar, etc.), and the context of the quantity (200m race, quarterly revenue, etc.). QA systems based on structured knowledge bases (KBs) also fail as quantities are poorly covered by state-of-the-art KBs. In this dissertation, we developed new methods to advance the state-of-the-art on quantity knowledge extraction and search.Zahlen sind mehr als nur numerische Werte. Sie beschreiben Maße von Entitäten wie die Höhe von Gebäuden, die Laufzeit von Sportlern, die Energieeffizienz von Automodellen oder die Energieerzeugung von Kraftwerken - jeweils ausgedrückt durch Zahlen mit zugehörigen Einheiten. Entitätszentriete Anfragen und direktes Question-Answering werden von Suchmaschinen häufig gut unterstützt. Sie funktionieren jedoch nicht gut, wenn die Fragen Zahlenfilter beinhalten, wie z. B. die Suche nach Sportlern, die 200m unter 20 Sekunden gelaufen sind, oder nach Unternehmen mit einem Quartalsumsatz von über 2 Milliarden US-Dollar. Selbst moderne Systeme schaffen es nicht, Quantitäten, einschließlich der genannten Bedingungen (weniger als, über, etc.), der Maßeinheiten (Sekunden, Dollar, etc.) und des Kontexts (200-Meter-Rennen, Quartalsumsatz usw.), zu verstehen. Auch QA-Systeme, die auf strukturierten Wissensbanken (“Knowledge Bases”, KBs) aufgebaut sind, versagen, da quantitative Eigenschaften von modernen KBs kaum erfasst werden. In dieser Dissertation werden neue Methoden entwickelt, um den Stand der Technik zur Wissensextraktion und -suche von Quantitäten voranzutreiben. Unsere Hauptbeiträge sind die folgenden: • Zunächst präsentieren wir Qsearch [Ho et al., 2019, Ho et al., 2020] – ein System, das mit erweiterten Fragen mit Quantitätsfiltern umgehen kann, indem es Hinweise verwendet, die sowohl in der Frage als auch in den Textquellen vorhanden sind. Qsearch umfasst zwei Hauptbeiträge. Der erste Beitrag ist ein tiefes neuronales Netzwerkmodell, das für die Extraktion quantitätszentrierter Tupel aus Textquellen entwickelt wurde. Der zweite Beitrag ist ein neuartiges Query-Matching-Modell zum Finden und zur Reihung passender Tupel. • Zweitens, um beim Vorgang heterogene Tabellen einzubinden, stellen wir QuTE [Ho et al., 2021a, Ho et al., 2021b] vor – ein System zum Extrahieren von Quantitätsinformationen aus Webquellen, insbesondere Ad-hoc Webtabellen in HTML-Seiten. Der Beitrag von QuTE umfasst eine Methode zur Verknüpfung von Quantitäts- und Entitätsspalten, für die externe Textquellen genutzt werden. Zur Beantwortung von Fragen kontextualisieren wir die extrahierten Entitäts-Quantitäts-Paare mit informativen Hinweisen aus der Tabelle und stellen eine neue Methode zur Konsolidierung und verbesserteer Reihung von Antwortkandidaten durch Inter-Fakten-Konsistenz vor. • Drittens stellen wir QL [Ho et al., 2022] vor – eine Recall-orientierte Methode zur Anreicherung von Knowledge Bases (KBs) mit quantitativen Fakten. Moderne KBs wie Wikidata oder YAGO decken viele Entitäten und ihre relevanten Informationen ab, übersehen aber oft wichtige quantitative Eigenschaften. QL ist frage-gesteuert und basiert auf iterativem Lernen mit zwei Hauptbeiträgen, um die KB-Abdeckung zu verbessern. Der erste Beitrag ist eine Methode zur Expansion von Fragen, um einen größeren Pool an Faktenkandidaten zu erfassen. Der zweite Beitrag ist eine Technik zur Selbstkonsistenz durch Berücksichtigung der Werteverteilungen von Quantitäten

Universaar

Acronym

XMG : eXtensible MetaGrammar

Author: Crabbé Benoît
Duchier Denys
Gardent Claire
Le Roux Joseph
Parmentier Yannick
Publication venue: Massachusetts Institute of Technology Press (MIT Press)
Publication date: 01/09/2013
Field of study

International audienceIn this article, we introduce eXtensible MetaGrammar (xmg), a framework for specifying tree-based grammars such as Feature-Based Lexicalised Tree-Adjoining Grammars (FB-LTAG) and Interaction Grammars (IG). We argue that xmg displays three features which facilitate both grammar writing and a fast prototyping of tree-based grammars. Firstly, \xmg\ is fully declarative. For instance, it permits a declarative treatment of diathesis that markedly departs from the procedural lexical rules often used to specify tree-based grammars. Secondly, the \xmg\ language has a high notational expressivity in that it supports multiple linguistic dimensions, inheritance and a sophisticated treatment of identifiers. Thirdly, xmg is extensible in that its computational architecture facilitates the extension to other linguistic formalisms. We explain how this architecture naturally supports the design of three linguistic formalisms namely, FB-LTAG, IG, and Multi-Component Tree-Adjoining Grammar (MC-TAG). We further show how it permits a straightforward integration of additional mechanisms such as linguistic and formal principles. To further illustrate the declarativity, notational expressivity and extensibility of \xmg , we describe the methodology used to specify an FB-LTAG for French augmented with a unification-based compositional semantics. This illustrates both how xmg facilitates the modelling of the tree fragment hierarchies required to specify tree-based grammars and of a syntax/semantics interface between semantic representations and syntactic trees. Finally, we briefly report on several grammars for French, English and German that were implemented using \xmg\ and compare \xmg\ to other existing grammar specification frameworks for tree-based grammars

Crossref

INRIA a CCSD electronic archive server

HAL-Paris 13

Hal-Diderot

Error propagation

Author: Lê Minh Ngoc
Publication venue: Independently published
Publication date: 28/05/2021
Field of study

VU Research Portal

Event extraction from biomedical texts using trimmed dependency graphs

Author: Buyko Ekaterina
Publication venue
Publication date: 03/11/2012
Field of study

This thesis explores the automatic extraction of information from biomedical publications. Such techniques are urgently needed because the biosciences are publishing continually increasing numbers of texts. The focus of this work is on events. Information about events is currently manually curated from the literature by biocurators. Biocuration, however, is time-consuming and costly so automatic methods are needed for information extraction from the literature. This thesis is dedicated to modeling, implementing and evaluating an advanced event extraction approach based on the analysis of syntactic dependency graphs. This work presents the event extraction approach proposed and its implementation, the JReX (Jena Relation eXtraction) system. This system was used by the University of Jena (JULIE Lab) team in the "BioNLP 2009 Shared Task on Event Extraction" competition and was ranked second among 24 competing teams. Thereafter JReX was the highest scorer on the worldwide shared U-Compare event extraction server, outperforming the competing systems from the challenge. This success was made possible, among other things, by extensive research on event extraction solutions carried out during this thesis, e.g., exploring the effects of syntactic and semantic processing procedures on solving the event extraction task. The evaluations executed on standard and community-wide accepted competition data were complemented by real-life evaluation of large-scale biomedical database reconstruction. This work showed that considerable parts of manually curated databases can be automatically re-created with the help of the event extraction approach developed. Successful re-creation was possible for parts of RegulonDB, the world's largest database for E. coli. In summary, the event extraction approach justified, developed and implemented in this thesis meets the needs of a large community of human curators and thus helps in the acquisition of new knowledge in the biosciences

Digitale Bibliothek Thüringen

Problemes lingüístics de la traducció automàtica entre l'anglès i el japonès

Author: Simon Montserrat Anna
Universitat Autònoma de Barcelona. Facultat de Filosofia i Lletres
Publication venue
Publication date: 01/01/2017
Field of study

El japonès i l'anglès són dues llengües molt distanciades tipològicament, i això suposa un problema important per a la traducció automàtica. En aquest treball s'estudia la traducció del japonès a l'anglès mitjançant quatre sistemes (Google, Bing, SYSTRAN i Weblio) i considerant tres de les diferències principals entre els dos idiomes: l'ordre de constituents, la marca de temps de futur i la flexió de nombre. Amb l'avaluació automàtica dels resultats de les traduccions mitjançant les mètriques BLEU, NIST i METEOR s'intenta determinar en quina mesura la traducció del japonès a l'anglès per als problemes estudiats és fiable en els traductors seleccionats

Diposit Digital de Documents de la UAB

Empirical Evaluation Methodology for Target Dependent Sentiment Analysis

Author: Moore Andrew
Publication venue: Lancaster University
Publication date: 27/08/2021
Field of study

The area of sentiment analysis has been around for at least 20 years in one form or another. In which time, it has had many and varied applications ranging from predicting film successes to social media analytics, and it has gained widespread use via selling it as a tool through application programming interfaces. The focus of this thesis is not on the application side but rather on novel evaluation methodology for the most fine grained form of sentiment analysis, target dependent sentiment analysis (TDSA). TDSA has seen a recent upsurge but to date most research only evaluates on very similar datasets which limits the conclusions that can be drawn from it. Further, most research only marginally improves results, chasing the State Of The Art (SOTA), but these prior works cannot empirically show where their improvements come from beyond overall metrics and small qualitative examples. By performing an extensive literature review on the different granularities of sentiment analysis, coarse (document level) to fine grained, a new and extended definition of fine grained sentiment analysis, the hextuple, is created which removes ambiguities that can arise from the context. In addition, examples from the literature will be provided where studies are not able to be replicated nor reproduced. This thesis includes the largest empirical analysis on six English datasets across multiple existing neural and non-neural methods, allowing for the methods to be tested for generalisability. In performing these experiments factors such as dataset size and sentiment class distribution determine whether neural or non-neural approaches are best, further finding that no method is generalisable. By formalising, analysing, and testing prior TDSA error splits, newly created error splits, and a new TDSA specific metric, a new empirical evaluation methodology has been created for TDSA. This evaluation methodology is then applied to multiple case studies to empirically justify improvements, such as position encoding, and show how contextualised word representation improves TDSA methods. From the first reproduction study in TDSA, it is believed that random seeds significantly affecting the neural method is the reason behind the difficulty in reproducing or replicating the original study results. Thus highlighting empirically for the first in TDSA the need for reporting multiple run results for neural methods, to allow for better reporting and improved evaluation. This thesis is fully reproducible through the codebases and Jupyter notebooks referenced, making it an executable thesis

Lancaster E-Prints

Recommended from our members

Improving and Understanding Deep Models for Natural Language Comprehension

Author: Ghaeini Reza
Publication venue: 'Oregon State University'
Publication date
Field of study

Natural Language Comprehension is a challenging domain of Natural Language Processing. To improve a model’s language comprehension/understanding, one approach would be to enrich the structure of the model to enhance its capability in learning the latent rules of the language. In this dissertation, we will ﬁrst introduce several deep models for a variety of natural language comprehension tasks including natural language inference and question answering. Previous approaches employ reading mechanisms that do not fully exploit the interdependencies between the input sources like “premise and hypothesis” or “document and query”. In contrast, we explore more sophisticated reading mechanisms to efﬁciently model the relationships between input sources (e.g. “premise and hypothesis” or “document and query”). These mechanisms and models yield better empirical performances, however, due to the black-box nature of deep learning, it is difﬁcult to assess whether the improved models indeed acquire a better understanding of language. Meanwhile, data is often plagued by meaningless or even harmful statistical biases and deep models might achieve high performance by focusing on the biases. This motivates us to study methods for “peaking inside” the black-box deep models to provide explanation and understanding of the models’ behavior. The proposed method (a.k.a. saliency) takes a step toward explaining deep learning-based models based on gradient of the model output with respect to different components like the input layer and inter-mediate layers. Saliency reveals interesting insights and identiﬁes critical information contributing to the model decisions. Besides proposing a model-agnostic interpretation method (saliency), we study model-dependent interpretation solutions and propose two interpretable designs and structures. Finally, we introduce a novel mechanism (saliency learning), which learns from ground-truth explanation signal such that the learned model will not only make the right prediction but also for the right reason. Our experimental results on multiple tasks and datasets demonstrate the effectiveness of the proposed methods, which produce more faithful to right reasons and evidences predictions while delivering better results compared to traditionally trained models

ScholarsArchive@OSU