2,361 research outputs found

    Normalized Web Distance and Word Similarity

    Get PDF
    There is a great deal of work in cognitive psychology, linguistics, and computer science, about using word (or phrase) frequencies in context in text corpora to develop measures for word similarity or word association, going back to at least the 1960s. The goal of this chapter is to introduce the normalizedis a general way to tap the amorphous low-grade knowledge available for free on the Internet, typed in by local users aiming at personal gratification of diverse objectives, and yet globally achieving what is effectively the largest semantic electronic database in the world. Moreover, this database is available for all by using any search engine that can return aggregate page-count estimates for a large range of search-queries. In the paper introducing the NWD it was called `normalized Google distance (NGD),' but since Google doesn't allow computer searches anymore, we opt for the more neutral and descriptive NWD. web distance (NWD) method to determine similarity between words and phrases. ItComment: Latex, 20 pages, 7 figures, to appear in: Handbook of Natural Language Processing, Second Edition, Nitin Indurkhya and Fred J. Damerau Eds., CRC Press, Taylor and Francis Group, Boca Raton, FL, 2010, ISBN 978-142008592

    Biomedical ontology alignment: An approach based on representation learning

    Get PDF
    While representation learning techniques have shown great promise in application to a number of different NLP tasks, they have had little impact on the problem of ontology matching. Unlike past work that has focused on feature engineering, we present a novel representation learning approach that is tailored to the ontology matching task. Our approach is based on embedding ontological terms in a high-dimensional Euclidean space. This embedding is derived on the basis of a novel phrase retrofitting strategy through which semantic similarity information becomes inscribed onto fields of pre-trained word vectors. The resulting framework also incorporates a novel outlier detection mechanism based on a denoising autoencoder that is shown to improve performance. An ontology matching system derived using the proposed framework achieved an F-score of 94% on an alignment scenario involving the Adult Mouse Anatomical Dictionary and the Foundational Model of Anatomy ontology (FMA) as targets. This compares favorably with the best performing systems on the Ontology Alignment Evaluation Initiative anatomy challenge. We performed additional experiments on aligning FMA to NCI Thesaurus and to SNOMED CT based on a reference alignment extracted from the UMLS Metathesaurus. Our system obtained overall F-scores of 93.2% and 89.2% for these experiments, thus achieving state-of-the-art results

    Normalized Information Distance

    Get PDF
    The normalized information distance is a universal distance measure for objects of all kinds. It is based on Kolmogorov complexity and thus uncomputable, but there are ways to utilize it. First, compression algorithms can be used to approximate the Kolmogorov complexity if the objects have a string representation. Second, for names and abstract concepts, page count statistics from the World Wide Web can be used. These practical realizations of the normalized information distance can then be applied to machine learning tasks, expecially clustering, to perform feature-free and parameter-free data mining. This chapter discusses the theoretical foundations of the normalized information distance and both practical realizations. It presents numerous examples of successful real-world applications based on these distance measures, ranging from bioinformatics to music clustering to machine translation.Comment: 33 pages, 12 figures, pdf, in: Normalized information distance, in: Information Theory and Statistical Learning, Eds. M. Dehmer, F. Emmert-Streib, Springer-Verlag, New-York, To appea

    Data-Driven Shape Analysis and Processing

    Full text link
    Data-driven methods play an increasingly important role in discovering geometric, structural, and semantic relationships between 3D shapes in collections, and applying this analysis to support intelligent modeling, editing, and visualization of geometric data. In contrast to traditional approaches, a key feature of data-driven approaches is that they aggregate information from a collection of shapes to improve the analysis and processing of individual shapes. In addition, they are able to learn models that reason about properties and relationships of shapes without relying on hard-coded rules or explicitly programmed instructions. We provide an overview of the main concepts and components of these techniques, and discuss their application to shape classification, segmentation, matching, reconstruction, modeling and exploration, as well as scene analysis and synthesis, through reviewing the literature and relating the existing works with both qualitative and numerical comparisons. We conclude our report with ideas that can inspire future research in data-driven shape analysis and processing.Comment: 10 pages, 19 figure

    Vermeidung von Repräsentationsheterogenitäten in realweltlichen Wissensgraphen

    Get PDF
    Knowledge graphs are repositories providing factual knowledge about entities. They are a great source of knowledge to support modern AI applications for Web search, question answering, digital assistants, and online shopping. The advantages of machine learning techniques and the Web's growth have led to colossal knowledge graphs with billions of facts about hundreds of millions of entities collected from a large variety of sources. While integrating independent knowledge sources promises rich information, it inherently leads to heterogeneities in representation due to a large variety of different conceptualizations. Thus, real-world knowledge graphs are threatened in their overall utility. Due to their sheer size, they are hardly manually curatable anymore. Automatic and semi-automatic methods are needed to cope with these vast knowledge repositories. We first address the general topic of representation heterogeneity by surveying the problem throughout various data-intensive fields: databases, ontologies, and knowledge graphs. Different techniques for automatically resolving heterogeneity issues are presented and discussed, while several open problems are identified. Next, we focus on entity heterogeneity. We show that automatic matching techniques may run into quality problems when working in a multi-knowledge graph scenario due to incorrect transitive identity links. We present four techniques that can be used to improve the quality of arbitrary entity matching tools significantly. Concerning relation heterogeneity, we show that synonymous relations in knowledge graphs pose several difficulties in querying. Therefore, we resolve these heterogeneities with knowledge graph embeddings and by Horn rule mining. All methods detect synonymous relations in knowledge graphs with high quality. Furthermore, we present a novel technique for avoiding heterogeneity issues at query time using implicit knowledge storage. We show that large neural language models are a valuable source of knowledge that is queried similarly to knowledge graphs already solving several heterogeneity issues internally.Wissensgraphen sind eine wichtige Datenquelle von Entitätswissen. Sie unterstützen viele moderne KI-Anwendungen. Dazu gehören unter anderem Websuche, die automatische Beantwortung von Fragen, digitale Assistenten und Online-Shopping. Neue Errungenschaften im maschinellen Lernen und das außerordentliche Wachstum des Internets haben zu riesigen Wissensgraphen geführt. Diese umfassen häufig Milliarden von Fakten über Hunderte von Millionen von Entitäten; häufig aus vielen verschiedenen Quellen. Während die Integration unabhängiger Wissensquellen zu einer großen Informationsvielfalt führen kann, führt sie inhärent zu Heterogenitäten in der Wissensrepräsentation. Diese Heterogenität in den Daten gefährdet den praktischen Nutzen der Wissensgraphen. Durch ihre Größe lassen sich die Wissensgraphen allerdings nicht mehr manuell bereinigen. Dafür werden heutzutage häufig automatische und halbautomatische Methoden benötigt. In dieser Arbeit befassen wir uns mit dem Thema Repräsentationsheterogenität. Wir klassifizieren Heterogenität entlang verschiedener Dimensionen und erläutern Heterogenitätsprobleme in Datenbanken, Ontologien und Wissensgraphen. Weiterhin geben wir einen knappen Überblick über verschiedene Techniken zur automatischen Lösung von Heterogenitätsproblemen. Im nächsten Kapitel beschäftigen wir uns mit Entitätsheterogenität. Wir zeigen Probleme auf, die in einem Multi-Wissensgraphen-Szenario aufgrund von fehlerhaften transitiven Links entstehen. Um diese Probleme zu lösen stellen wir vier Techniken vor, mit denen sich die Qualität beliebiger Entity-Alignment-Tools deutlich verbessern lässt. Wir zeigen, dass Relationsheterogenität in Wissensgraphen zu Problemen bei der Anfragenbeantwortung führen kann. Daher entwickeln wir verschiedene Methoden um synonyme Relationen zu finden. Eine der Methoden arbeitet mit hochdimensionalen Wissensgrapheinbettungen, die andere mit einem Rule Mining Ansatz. Beide Methoden können synonyme Relationen in Wissensgraphen mit hoher Qualität erkennen. Darüber hinaus stellen wir eine neuartige Technik zur Vermeidung von Heterogenitätsproblemen vor, bei der wir eine implizite Wissensrepräsentation verwenden. Wir zeigen, dass große neuronale Sprachmodelle eine wertvolle Wissensquelle sind, die ähnlich wie Wissensgraphen angefragt werden können. Im Sprachmodell selbst werden bereits viele der Heterogenitätsprobleme aufgelöst, so dass eine Anfrage heterogener Wissensgraphen möglich wird

    Cross-view Semantic Alignment for Livestreaming Product Recognition

    Full text link
    Live commerce is the act of selling products online through live streaming. The customer's diverse demands for online products introduce more challenges to Livestreaming Product Recognition. Previous works have primarily focused on fashion clothing data or utilize single-modal input, which does not reflect the real-world scenario where multimodal data from various categories are present. In this paper, we present LPR4M, a large-scale multimodal dataset that covers 34 categories, comprises 3 modalities (image, video, and text), and is 50x larger than the largest publicly available dataset. LPR4M contains diverse videos and noise modality pairs while exhibiting a long-tailed distribution, resembling real-world problems. Moreover, a cRoss-vIew semantiC alignmEnt (RICE) model is proposed to learn discriminative instance features from the image and video views of the products. This is achieved through instance-level contrastive learning and cross-view patch-level feature propagation. A novel Patch Feature Reconstruction loss is proposed to penalize the semantic misalignment between cross-view patches. Extensive experiments demonstrate the effectiveness of RICE and provide insights into the importance of dataset diversity and expressivity. The dataset and code are available at https://github.com/adxcreative/RICEComment: Accepted to ICCV202

    Getting Past the Language Gap: Innovations in Machine Translation

    Get PDF
    In this chapter, we will be reviewing state of the art machine translation systems, and will discuss innovative methods for machine translation, highlighting the most promising techniques and applications. Machine translation (MT) has benefited from a revitalization in the last 10 years or so, after a period of relatively slow activity. In 2005 the field received a jumpstart when a powerful complete experimental package for building MT systems from scratch became freely available as a result of the unified efforts of the MOSES international consortium. Around the same time, hierarchical methods had been introduced by Chinese researchers, which allowed the introduction and use of syntactic information in translation modeling. Furthermore, the advances in the related field of computational linguistics, making off-the-shelf taggers and parsers readily available, helped give MT an additional boost. Yet there is still more progress to be made. For example, MT will be enhanced greatly when both syntax and semantics are on board: this still presents a major challenge though many advanced research groups are currently pursuing ways to meet this challenge head-on. The next generation of MT will consist of a collection of hybrid systems. It also augurs well for the mobile environment, as we look forward to more advanced and improved technologies that enable the working of Speech-To-Speech machine translation on hand-held devices, i.e. speech recognition and speech synthesis. We review all of these developments and point out in the final section some of the most promising research avenues for the future of MT

    Composing Measures for Computing Text Similarity

    Get PDF
    We present a comprehensive study of computing similarity between texts. We start from the observation that while the concept of similarity is well grounded in psychology, text similarity is much less well-defined in the natural language processing community. We thus define the notion of text similarity and distinguish it from related tasks such as textual entailment and near-duplicate detection. We then identify multiple text dimensions, i.e. characteristics inherent to texts that can be used to judge text similarity, for which we provide empirical evidence. We discuss state-of-the-art text similarity measures previously proposed in the literature, before continuing with a thorough discussion of common evaluation metrics and datasets. Based on the analysis, we devise an architecture which combines text similarity measures in a unified classification framework. We apply our system in two evaluation settings, for which it consistently outperforms prior work and competing systems: (a) an intrinsic evaluation in the context of the Semantic Textual Similarity Task as part of the Semantic Evaluation (SemEval) exercises, and (b) an extrinsic evaluation for the detection of text reuse. As a basis for future work, we introduce DKPro Similarity, an open source software package which streamlines the development of text similarity measures and complete experimental setups
    • …
    corecore