6 research outputs found

    Few-shot entity linking of food names

    Get PDF
    Entity linking (EL), the task of automatically matching mentions in text to concepts in a target knowledge base, remains under-explored when it comes to the food domain, despite its many potential applications, e.g., finding the nutritional value of ingredients in databases. In this paper, we describe the creation of new resources supporting the development of EL methods applied to the food domain: the E.Care Knowledge Base (E.Care KB) which contains 664 food concepts and the E.Care dataset, a corpus of 468 cooking recipes where ingredient names have been manually linked to corresponding concepts in the E.Care KB. We developed and evaluated different methods for EL, namely, deep learning-based approaches underpinned by Siamese networks trained under a few-shot learning setting, traditional machine learning-based approaches underpinned by support vector machines (SVMs) and unsupervised approaches based on string matching algorithms. Combining the strengths of each of these approaches, we built a hybrid model for food EL that balances the trade-offs between performance and inference speed. Specifically, our hybrid model obtains 89.40% accuracy and links mentions at an average speed of 0.24 seconds per mention, whereas our best deep learning-based model, SVM model and unsupervised model obtain accuracies of 86.99%, 87.19% and 87.43% at inference speeds of 0.007, 0.66 and 0.02 seconds per mention, respectively

    Uso de Técnicas e Ferramentas de Embedding de Conhecimento para Desambiguação de Anotações Segundo Contextos Semânticos

    Get PDF
    TCC(graduação) - Universidade Federal de Santa Catarina. Centro Tecnológico. Ciências da Computação.Anotações semânticas permitem associar a dados ou porções de dados não-estuturados (e.g., menções relevantes em textos) recursos com semântica bem definida em bases de conhecimento (e.g., DBpedia, Wordnet, Babelnet) que ajudam a explicar a que os dados anotados se referem. Tais anotações permitem melhor explorar os dados anotados em uma miríade de domínios e aplicações, incluindo comércio, marketing, turismo, segurança pública, entre outras. Todavia, ao tentar capturar a semântica de dados, como, por exemplo, postagens em mídias sociais, aplicações de enriquecimento semântico esbarram em problemas como o uso de gírias e regionalismos, e principalmente, a ocorrência de ambiguidade. Várias técnicas ao longo dos anos tentam desambiguar menções a entidades do mundo real em textos para efetuar anotações semânticas corretas. Este trabalho consiste em realizar um estudo do estado da arte do problema da ambiguidade de palavras, tendo como objetivo o domínio das técnicas e ferramentas disponíveis atualmente e que vêm sendo utilizadas na solução deste problema, como por exemplo os embeddings de palavras e conhecimento. A partir disso, pretende-se desambiguar as menções existentes em postagens de mídias sociais com o auxílio dos embeddings em conjunto com redes neurais. Para tal, são identificadas e selecionadas implementações existentes de embeddings. Contextos semânticos são capturados, representados e explorados nesses embeddings para a desambiguação de anotações. A abordagem proposta é avaliada na melhoria da precisão de anotações de conjuntos de tweets, como por exemplo o dataset Microposts, que fornece conjuntos de tweets anuais.Semantic annotations allow to link unstructured data (e.g., relevant references in documents) to resources with well-founded semantics in knowledge information bases (e.g., DBPedia, Wordnet, Babelnet, Google Knowledge Graph) that help to explain what those annotated data are referring to. Such annotations enable better exploitation of the annotated data in a myriad of domains and applications, including marketing, tourism, and public security, among many others. However, by trying to capture the semantics in data, such as social media posts, the computer may face problems, such as the common use of slangs and regionalisms in informal conversation, and mainly the occurrence of ambiguity. Many techniques over the years try to disambiguate mentions to real world entities in documents to produce correct semantic annotations. This work consists in an study of the state of the art of the word ambiguity problem, having as objective the mastery of the techniques and tools available that are being used to solve this problem, such as knowledge and word embedding. From this, it is intended to disambiguate mentions present in social media contents by using embeddings together with neural networks. For such, implementations of embeddings are identified and selected. Semantic contexts are captured, represented and explored in these embeddings for annotation disambiguation. The proposed approach is evaluated in the improvement of the precision of the annotations from a set of tweets, such as the Microposts dataset, which provide annual sets of tweets

    Uso de classificadores binários em embeddings de palavras e conhecimento na tarefa de ligação de entidades

    Get PDF
    TCC(graduação) - Universidade Federal de Santa Catarina. Centro Tecnológico. Sistemas de Informação.Anotar semanticamente dados cuja semântica não é bem definidaou processável por máquinas, tais como grandes quantidades dedados semi ou não-estruturados atualmente disponíveis na Web,tem o potencial de alavancar aplicações que podem tirar proveitode interpretações automáticas de tais dados. Entretanto, os atuaisprocessos automáticos de anotação semântica falham em entregarresultados com boa qualidade. Uma maneira de melhorar a qualidadedos resultados gerados pelos processos de anotação atuais é considerarum contexto mais amplo no qual os dados se encontram. Contextossemânticos construídos a partir de algumas anotações confiáveispodem auxiliar no processo de desambiguação de novas anotações.Eles podem ser representados comoembeddings, que são uma categoriade modelos de processamento natural da linguagem que mapeiammatematicamente as palavras para vetores numéricos. Esse recursofacilita e agiliza a determinação de palavras (vetores) semelhantes ouclusters vetoriais. Este trabalho propõe algoritmos para a criação eutilização de contextos semânticos usando técnicas de aprendizado demáquina em grafos, para melhorar o processo de desambiguação denovas anotações. Os resultados das anotações produzidas pelo modeloproposto são comparadas com aqueles de ferramentas do estado daarte em anotação semântica de dados textuais.Semantically annotating data whose semantics are not well-definedby machines, such as large amounts of semi-structured unstructureddata currently available on the Web, has the potential to leverageapplications that can take advantage of automatic interpretations ofsuch data. However, the current automatic processes of semanticannotation fail to deliver results with good quality. One way toimprove the quality of the results generated by the current annotationprocesses is to consider a broader context in which the data are found.Semantic contexts constructed from reliable annotation can help in thedisambiguation process of new annotations. They can be representedas embedding, which is a category of natural language processingmodels that mathematically map the words to numerical vectors.This feature makes it much easier and streamlines the determinationof similar words (vectors) or vector clusters This work proposesalgorithms for the creation and use of semantic contexts using machinelearning techniques in graphs to improve the disambiguation processof new annotations. The results of the annotations produced by theproposed model are compared with those of state of the art tools insemantic annotation of textual data

    Entities with quantities : extraction, search, and ranking

    Get PDF
    Quantities are more than numeric values. They denote measures of the world’s entities such as heights of buildings, running times of athletes, energy efficiency of car models or energy production of power plants, all expressed in numbers with associated units. Entity-centric search and question answering (QA) are well supported by modern search engines. However, they do not work well when the queries involve quantity filters, such as searching for athletes who ran 200m under 20 seconds or companies with quarterly revenue above $2 Billion. State-of-the-art systems fail to understand the quantities, including the condition (less than, above, etc.), the unit of interest (seconds, dollar, etc.), and the context of the quantity (200m race, quarterly revenue, etc.). QA systems based on structured knowledge bases (KBs) also fail as quantities are poorly covered by state-of-the-art KBs. In this dissertation, we developed new methods to advance the state-of-the-art on quantity knowledge extraction and search.Zahlen sind mehr als nur numerische Werte. Sie beschreiben Maße von Entitäten wie die Höhe von Gebäuden, die Laufzeit von Sportlern, die Energieeffizienz von Automodellen oder die Energieerzeugung von Kraftwerken - jeweils ausgedrückt durch Zahlen mit zugehörigen Einheiten. Entitätszentriete Anfragen und direktes Question-Answering werden von Suchmaschinen häufig gut unterstützt. Sie funktionieren jedoch nicht gut, wenn die Fragen Zahlenfilter beinhalten, wie z. B. die Suche nach Sportlern, die 200m unter 20 Sekunden gelaufen sind, oder nach Unternehmen mit einem Quartalsumsatz von über 2 Milliarden US-Dollar. Selbst moderne Systeme schaffen es nicht, Quantitäten, einschließlich der genannten Bedingungen (weniger als, über, etc.), der Maßeinheiten (Sekunden, Dollar, etc.) und des Kontexts (200-Meter-Rennen, Quartalsumsatz usw.), zu verstehen. Auch QA-Systeme, die auf strukturierten Wissensbanken (“Knowledge Bases”, KBs) aufgebaut sind, versagen, da quantitative Eigenschaften von modernen KBs kaum erfasst werden. In dieser Dissertation werden neue Methoden entwickelt, um den Stand der Technik zur Wissensextraktion und -suche von Quantitäten voranzutreiben. Unsere Hauptbeiträge sind die folgenden: • Zunächst präsentieren wir Qsearch [Ho et al., 2019, Ho et al., 2020] – ein System, das mit erweiterten Fragen mit Quantitätsfiltern umgehen kann, indem es Hinweise verwendet, die sowohl in der Frage als auch in den Textquellen vorhanden sind. Qsearch umfasst zwei Hauptbeiträge. Der erste Beitrag ist ein tiefes neuronales Netzwerkmodell, das für die Extraktion quantitätszentrierter Tupel aus Textquellen entwickelt wurde. Der zweite Beitrag ist ein neuartiges Query-Matching-Modell zum Finden und zur Reihung passender Tupel. • Zweitens, um beim Vorgang heterogene Tabellen einzubinden, stellen wir QuTE [Ho et al., 2021a, Ho et al., 2021b] vor – ein System zum Extrahieren von Quantitätsinformationen aus Webquellen, insbesondere Ad-hoc Webtabellen in HTML-Seiten. Der Beitrag von QuTE umfasst eine Methode zur Verknüpfung von Quantitäts- und Entitätsspalten, für die externe Textquellen genutzt werden. Zur Beantwortung von Fragen kontextualisieren wir die extrahierten Entitäts-Quantitäts-Paare mit informativen Hinweisen aus der Tabelle und stellen eine neue Methode zur Konsolidierung und verbesserteer Reihung von Antwortkandidaten durch Inter-Fakten-Konsistenz vor. • Drittens stellen wir QL [Ho et al., 2022] vor – eine Recall-orientierte Methode zur Anreicherung von Knowledge Bases (KBs) mit quantitativen Fakten. Moderne KBs wie Wikidata oder YAGO decken viele Entitäten und ihre relevanten Informationen ab, übersehen aber oft wichtige quantitative Eigenschaften. QL ist frage-gesteuert und basiert auf iterativem Lernen mit zwei Hauptbeiträgen, um die KB-Abdeckung zu verbessern. Der erste Beitrag ist eine Methode zur Expansion von Fragen, um einen größeren Pool an Faktenkandidaten zu erfassen. Der zweite Beitrag ist eine Technik zur Selbstkonsistenz durch Berücksichtigung der Werteverteilungen von Quantitäten
    corecore