Search CORE

11,179 research outputs found

Composing Measures for Computing Text Similarity

Author: Bär Daniel
Gurevych Iryna
Zesch Torsten
Publication venue
Publication date: 26/01/2015
Field of study

We present a comprehensive study of computing similarity between texts. We start from the observation that while the concept of similarity is well grounded in psychology, text similarity is much less well-defined in the natural language processing community. We thus define the notion of text similarity and distinguish it from related tasks such as textual entailment and near-duplicate detection. We then identify multiple text dimensions, i.e. characteristics inherent to texts that can be used to judge text similarity, for which we provide empirical evidence. We discuss state-of-the-art text similarity measures previously proposed in the literature, before continuing with a thorough discussion of common evaluation metrics and datasets. Based on the analysis, we devise an architecture which combines text similarity measures in a unified classification framework. We apply our system in two evaluation settings, for which it consistently outperforms prior work and competing systems: (a) an intrinsic evaluation in the context of the Semantic Textual Similarity Task as part of the Semantic Evaluation (SemEval) exercises, and (b) an extrinsic evaluation for the detection of text reuse. As a basis for future work, we introduce DKPro Similarity, an open source software package which streamlines the development of text similarity measures and complete experimental setups

TUbiblio

tuprints

HESML: A scalable ontology-based semantic similarity measures library with a set of reproducible experiments and a replication dataset

Author: Adhikari
Agirre
Al-Mubaid
Ana García-Serrano
Aouicha
Ashburner
Baker
Banerjee
Banjade
Batet
Batet
Batet
Batet
Batet
Batet
Ben Aouicha
Ben Aouicha
Blanchard
Botsch
Budanitsky
Castellanos
Castellanos
Castells
Chaves-González
Chen
Chirigati
Chirigati
Couto
Couto
Cross
Dagher
de Berg
Dijkman
Editorial
Fernando
Fernando Chirigati
Fokkens
Fähndrich
Gao
Garla
Grego
Guzzi
Hadj Taieb
Hadj Taieb
Hadj Taieb
Hao
Harispe
Harispe
Harispe
Harispe
Hill
Hirst
Jiang
Jiang
Juan J. Lastra-Díaz
Kyogoku
Lastra-Díaz
Lastra-Díaz
Lastra-Díaz
Lastra-Díaz
Lastra-Díaz
Lastra-Díaz
Lastra-Díaz
Lastra-Díaz
Lastra-Díaz
Leacock
Lee
Leopold
Leopold
Li
Lin
Liu
Lord
Mandreoli
Martinez-Gil
Martínez
Mazandu
McInnes
McInnes
Mehlhorn
Mendling
Meng
Meng
Meng
Merkel
Meymandpour
Mihalcea
Miller
Miller
Miriam Fernández
Montani
Montserrat Batet
Munafò
Oliva
Patwardhan
Patwardhan
Pedersen
Pedersen
Pedersen
Pedersen
Pedersen
Pedersen
Pekar
Pesquita
Petrakis
Pirró
Pirró
Pirró
Pothos
Rada
Resnik
Resnik
Rodríguez
Rubenstein
Schlicker
Schlicker
Sebti
Seco
Seddiqui
Shima
Stanchev
Stojanovic
Sánchez
Sánchez
Sánchez
Sánchez
Tversky
Van Miltenburg
Vrandečić
Wolke
Wolke
Wu
Wu
Yuan
Zhang
Zhou
Zhou
Publication venue: 'Elsevier BV'
Publication date: 01/01/2017
Field of study

This work is a detailed companion reproducibility paper of the methods and experiments proposed by Lastra-Díaz and García-Serrano in (2015, 2016) [56–58], which introduces the following contributions: (1) a new and efficient representation model for taxonomies, called PosetHERep, which is an adaptation of the half-edge data structure commonly used to represent discrete manifolds and planar graphs; (2) a new Java software library called the Half-Edge Semantic Measures Library (HESML) based on PosetHERep, which implements most ontology-based semantic similarity measures and Information Content (IC) models reported in the literature; (3) a set of reproducible experiments on word similarity based on HESML and ReproZip with the aim of exactly reproducing the experimental surveys in the three aforementioned works; (4) a replication framework and dataset, called WNSimRep v1, whose aim is to assist the exact replication of most methods reported in the literature; and finally, (5) a set of scalability and performance benchmarks for semantic measures libraries. PosetHERep and HESML are motivated by several drawbacks in the current semantic measures libraries, especially the performance and scalability, as well as the evaluation of new methods and the replication of most previous methods. The reproducible experiments introduced herein are encouraged by the lack of a set of large, self-contained and easily reproducible experiments with the aim of replicating and confirming previously reported results. Likewise, the WNSimRep v1 dataset is motivated by the discovery of several contradictory results and difficulties in reproducing previously reported methods and experiments. PosetHERep proposes a memory-efficient representation for taxonomies which linearly scales with the size of the taxonomy and provides an efficient implementation of most taxonomy-based algorithms used by the semantic measures and IC models, whilst HESML provides an open framework to aid research into the area by providing a simpler and more efficient software architecture than the current software libraries. Finally, we prove the outperformance of HESML on the state-of-the-art libraries, as well as the possibility of significantly improving their performance and scalability without caching using PosetHERep

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Open Research Online (The Open University)

The Oberta in open access

Comparing and Benchmarking Semantic Measures Using SMComp

Author: Costa Teresa
Publication venue: OASIcs - OpenAccess Series in Informatics. 5th Symposium on Languages, Applications and Technologies (SLATE\u2716)
Publication date: 01/01/2016
Field of study

The goal of the semantic measures is to compare pairs of concepts, words, sentences or named entities. Their categorization depends on what they measure. If a measure only considers taxonomy relationships is a similarity measure; if it considers all type of relationships it is a relatedness measure. The evaluation process of these measures usually relies on semantic gold standards. These datasets, with several pairs of words with a rating assigned by persons, are used to assess how well a semantic measure performs. There are a few frameworks that provide tools to compute and analyze several well-known measures. This paper presents a novel tool - SMComp - a testbed designed for path-based semantic measures. At its current state, it is a domain-specific tool using three different versions of WordNet. SMComp has two views: one to compute semantic measures of a pair of words and another to assess a semantic measure using a dataset. On the first view, it offers several measures described in the literature as well as the possibility of creating a new measure, by introducing Java code snippets on the GUI. The other view offers a large set of semantic benchmarks to use in the assessment process. It also offers the possibility of uploading a custom dataset to be used in the assessment

Dagstuhl Research Online Publication Server

Learning Single-Image Depth from Videos using Quality Assessment Networks

Author: Chen Weifeng
Deng Jia
Qian Shengyi
Publication venue
Publication date: 01/01/2019
Field of study

Depth estimation from a single image in the wild remains a challenging problem. One main obstacle is the lack of high-quality training data for images in the wild. In this paper we propose a method to automatically generate such data through Structure-from-Motion (SfM) on Internet videos. The core of this method is a Quality Assessment Network that identifies high-quality reconstructions obtained from SfM. Using this method, we collect single-view depth training data from a large number of YouTube videos and construct a new dataset called YouTube3D. Experiments show that YouTube3D is useful in training depth estimation networks and advances the state of the art of single-view depth estimation in the wild

arXiv.org e-Print Archive

Princeton University Open Access Repository