Search CORE

155 research outputs found

Joint Upper & Lower Bound Normalization for IR Evaluation

Author: Feng Dongji
Santu Shubhra Kanti Karmaker
Publication venue
Publication date: 20/09/2022
Field of study

In this paper, we present a novel perspective towards IR evaluation by proposing a new family of evaluation metrics where the existing popular metrics (e.g., nDCG, MAP) are customized by introducing a query-specific lower-bound (LB) normalization term. While original nDCG, MAP etc. metrics are normalized in terms of their upper bounds based on an ideal ranked list, a corresponding LB normalization for them has not yet been studied. Specifically, we introduce two different variants of the proposed LB normalization, where the lower bound is estimated from a randomized ranking of the corresponding documents present in the evaluation set. We next conducted two case-studies by instantiating the new framework for two popular IR evaluation metric (with two variants, e.g., DCG_UL_V1,2 and MSP_UL_V1,2 ) and then comparing against the traditional metric without the proposed LB normalization. Experiments on two different data-sets with eight Learning-to-Rank (LETOR) methods demonstrate the following properties of the new LB normalized metric: 1) Statistically significant differences (between two methods) in terms of original metric no longer remain statistically significant in terms of Upper Lower (UL) Bound normalized version and vice-versa, especially for uninformative query-sets. 2) When compared against the original metric, our proposed UL normalized metrics demonstrate higher Discriminatory Power and better Consistency across different data-sets. These findings suggest that the IR community should consider UL normalization seriously when computing nDCG and MAP and more in-depth study of UL normalization for general IR evaluation is warranted.Comment: 26 pages, 3 figure

arXiv.org e-Print Archive

Search and information extraction in handwritten tables

Author: Andrés Moreno José
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 17/09/2021
Field of study

[ES] Actualmente, archivos de todo el mundo están digitalizando grandes colecciones de documentos manuscritos con el fin de preservarlos y facilitar su difusión a investigadores y usuarios generales. Este hecho está motivando una gran evolución en las técnicas de reconocimiento de texto manuscrito (HTR por sus siglas en inglés), que permiten acceder a los contenidos textuales de las imágenes digitales mediante consultas de texto plano, de la misma manera que se hace con los libros y otros documentos digitales. Dentro del conjunto de documentos manuscritos sin transcripción, nos encontramos con que aproximadamente más de la mitad de los documentos se corresponden con documentos estructurados. Estos documentos contienen información de todo tipo: registros de nacimiento, de navegación, cuadernos de bitácora, etc. Toda esta información es a menudo imprescindible para usos jurídicos, estudios demográficos, estudios de la evolución del clima, etc. El objetivo de este trabajo es desarrollar nuevos métodos que permitan realizar búsquedas según el modelo "atributo-valor'" sobre estos documentos, donde los "atributos" son las cabeceras de las columnas y filas que forman la tabla y los "valores" son el resto de celdas de la tabla que no son cabecera. Para ello, vamos a basarnos en el marco de la indexación probabilistica (que está en cierto modo relacionado con el campo conocido como "keyword spotting"). En este marco, cada elemento de una imagen que se pueda interpretar como una palabra es detectado y almacenado, junto con su posición dentro de la imagen y la correspondiente probabilidad de relevancia. Así pues, empleando la información geométrica de los índices probabilísticos en conjunto con el uso de distribuciones gausianas, se pretende permitir realizar este tipo de búsquedas desde una perspectiva completamente probabilística. Bajo este enfoque, además de la búsqueda, se estudia la extracción de la información con objetivo de volcar contenidos específicos de las imágenes digitales a un formato compatible con bases de datos convencionales. En ambas tareas se han logrado resultados que superan el baseline propuesto.[EN] Currently, all archives around the world are digitising large collections of manuscripts, aiming to preserve and facilitate their dissemination to researchers and general users. This fact is motivating a fast evolution in handwritten text recognition (HTR) techniques, which allow accessing to the textual contents of digital images by means of plain-text queries, in the same way as with books and other digital documents. Among the huge set of manuscripts without transcription, more than half of the documents contain structured text. This is the case of birth records, navigation logs, etc. The information contained in these documents is often needed for legal matters, demographic studies, weather evolution studies, etc. The purpose of this work is to develop new methods that allow to perform searches according to the "attribute-value" model about these documents, where the "attributes" are, for example, column or row headers in tables and the "values" are the corresponding table cells. For this purpose, we will rely on the so-called probabilistic indexing framework (which in a certain sense is related with the field known as "keyword spotting"). In this framework, each element of an image that can be interpreted as a word is detected and stored, along with its position within the image and the correspondence relevance probability. This way, by using the geometric information available in the probabilistic indices and Gaussian distributions, we aim at allowing this type of search from a completely probabilistic perspective. Following this approach, in addition to information search, we study how to actually extract specific textual contents of the digital images in standard formats compatible with conventional databases.Andrés Moreno, J. (2021). Search and information extraction in handwritten tables. Universitat Politècnica de València. http://hdl.handle.net/10251/172740TFG

RiuNet

Be4SeD: Benchmarking para evaluación de técnicas de descubrimiento de servicios

Author: Caicedo Rendón Oscar Mauricio
Corrales Muñoz Juan Carlos
Rojas Potosí Luis Antonio
Suárez Meza Luís Javier
Publication venue
Publication date: 15/09/2011
Field of study

Actualmente, el creciente número de procesos de negocio y servicios ofrecidos, es fuente de innumerables proyectos de investigación, orientados a generar mecanismos de descubrimiento; teniendo como resultado un sinnúmero de algoritmos para recuperar servicios. Sin embargo, dichos proyectos no utilizan una base común para evaluar sus técnicas de búsqueda, impidiendo que las evaluaciones sean objetivas. Por lo tanto, se hace necesaria una herramienta pública, que proporcione una referencia común, que permita comparar y valorar los resultados de los diferentes algoritmos utilizados en el emparejamiento de servicios, con el fin de mejorar la calidad de las técnicas de descubrimiento propuestas. Este artículo presenta una aplicación pública, que implementa una metodología de benchmarking para evaluar la calidad de recuperación de las técnicas de emparejamiento de servicios. Este benchmarking está compuesto de un mecanismo de evaluación intuitivo, de un módulo de ingreso de los datos correspondientes al algoritmo a evaluar y un componente que entrega resultados estadísticos: recall, precision, overall, k-precision y p-precision. Sus funcionalidades se ofrecen como servicio web para facilitar la integración con las implementaciones de algoritmos a evaluar. Finalmente se evalúa un algoritmo de emparejamiento, el cual evidencia el uso de la plataforma Be4SeD en este contexto

Biblioteca Digital de la Universidad del Valle

Realising context-oriented information filtering.

Author: Webster David Edward
Publication venue
Publication date: 01/05/2010
Field of study

The notion of information overload is an increasing factor in modern information service environments where information is ‘pushed’ to the user. As increasing volumes of information are presented to computing users in the form of email, web sites, instant messaging and news feeds, there is a growing need to filter and prioritise the importance of this information. ‘Information management’ needs to be undertaken in a manner that not only prioritises what information we do need, but to also dispose of information that is sent, which is of no (or little) use to us.The development of a model to aid information filtering in a context-aware way is developed as an objective for this thesis. A key concern in the conceptualisation of a single concept is understanding the context under which that concept exists (or can exist). An example of a concept is a concrete object, for instance a book. This contextual understanding should provide us with clear conceptual identification of a concept including implicit situational information and detail of surrounding concepts.Existing solutions to filtering information suffer from their own unique flaws: textbased filtering suffers from problems of inaccuracy; ontology-based solutions suffer from scalability challenges; taxonomies suffer from problems with collaboration. A major objective of this thesis is to explore the use of an evolving community maintained knowledge-base (that of Wikipedia) in order to populate the context model from prioritise concepts that are semantically relevant to the user’s interest space. Wikipedia can be classified as a weak knowledge-base due to its simple TBox schema and implicit predicates, therefore, part of this objective is to validate the claim that a weak knowledge-base is fit for this purpose. The proposed and developed solution, therefore, provides the benefits of high recall filtering with low fallout and a dependancy on a scalable and collaborative knowledge-base.A simple web feed aggregator has been built using the Java programming language that we call DAVe’s Rss Organisation System (DAVROS-2) as a testbed environment to demonstrate specific tests used within this investigation. The motivation behind the experiments is to demonstrate that the combination of the concept framework instantiated through Wikipedia can provide a framework to aid in concept comparison, and therefore be used in news filtering scenario as an example of information overload. In order to evaluate the effectiveness of the method well understood measures of information retrieval are used. This thesis demonstrates that the utilisation of the developed contextual concept expansion framework (instantiated using Wikipedia) improved the quality of concept filtering over a baseline based on string matching. This has been demonstrated through the analysis of recall and fallout measures

Repository@Hull - Worktribe

HMM word graph based keyword spotting in handwritten document images

Author: Frinken Volkmar
Romero Verónica
Toselli Alejandro Héctor
Vidal Enrique
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

[EN] Line-level keyword spotting (KWS) is presented on the basis of frame-level word posterior probabilities. These posteriors are obtained using word graphs derived from the recogni- tion process of a full-fledged handwritten text recognizer based on hidden Markov models and N-gram language models. This approach has several advantages. First, since it uses a holistic, segmentation-free technology, it does not require any kind of word or charac- ter segmentation. Second, the use of language models allows the context of each spotted word to be taken into account, thereby considerably increasing KWS accuracy. And third, the proposed KWS scores are based on true posterior probabilities, taking into account all (or most) possible word segmentations of the input image. These scores are properly bounded and normalized. This mathematically clean formulation lends itself to smooth, threshold-based keyword queries which, in turn, permit comfortable trade-offs between search precision and recall. Experiments are carried out on several historic collections of handwritten text images, as well as a well-known data set of modern English handwrit- ten text. According to the empirical results, the proposed approach achieves KWS results comparable to those obtained with the recently-introduced "BLSTM neural networks KWS" approach and clearly outperform the popular, state-of-the-art "Filler HMM" KWS method. Overall, the results clearly support all the above-claimed advantages of the proposed ap- proach.This work has been partially supported by the Generalitat Valenciana under the Prometeo/2009/014 project grant ALMA-MATER, and through the EU projects: HIMANIS (JPICH programme, Spanish grant Ref. PCIN-2015-068) and READ (Horizon 2020 programme, grant Ref. 674943).Toselli, AH.; Vidal, E.; Romero, V.; Frinken, V. (2016). HMM word graph based keyword spotting in handwritten document images. Information Sciences. 370:497-518. https://doi.org/10.1016/j.ins.2016.07.063S49751837

RiuNet

Recommended from our members

Deriving Technology Intelligence from Patents: Preposition-based Semantic Analysis

Author: An J
Kim K
Mortara L
Sungjoo L
Publication venue: Journal of Informetrics
Publication date: 03/01/2018
Field of study

Patents are one of the most reliable sources of technology intelligence, and the true value of patent analysis stems from its capability of describing the content of technology based on the relationships between keywords. To date a number of techniques for analyzing the information contained in patent documents that focus on the relationships between keywords have been suggested. However, a drawback of the existing keyword approaches is that they cannot yet determine the types of relationships between the keywords. This study proposes a novel approach based on preposition semantic analysis network which overcomes the limitations of the existing keywords-based network analysis and demonstrates its potential through an application. A preposition is a word that defines the relationship between two neighboring words, and, in the case of patents, prepositions aid in revealing the relationships between keywords related to technologies. To demonstrate the approach, patents regarding an electric vehicle were employed. 13 prepositions were identified which could be used to define 5 relationships between neighboring technological terms: “inclusion (utilization),” “objective (purpose),” “effect,” “process,” and “likeness.” The proposed approach is expected to improve the usability of keyword-based patent analyses and support more elaborate studies on patent documents

Apollo (Cambridge)

CUED - Cambridge University Engineering Department

Exploiting user signals and stochastic models to improve information retrieval systems and evaluation

Author: Maistro Maria
Publication venue
Publication date: 14/01/2018
Field of study

The leitmotiv throughout this thesis is represented by IR evaluation. We discuss different issues related to effectiveness measures and novel solutions that we propose to address these challenges. We start by providing a formal definition of utility-oriented measurement of retrieval effectiveness, based on the representational theory of measurement. The proposed theoretical framework contributes to a better understanding of the problem complexities, separating those due to the inherent problems in comparing systems, from those due to the expected numerical properties of measures. We then propose AWARE, a probabilistic framework for dealing with the noise and inconsistencies introduced when relevance labels are gathered with multiple crowd assessors. By modeling relevance judgements and crowd assessors as sources of uncertainty, we directly combine the performance measures computed on the ground-truth generated by each crowd assessor, instead of adopting a classification technique to merge the labels at pool level. Finally, we investigate evaluation measures able to account for user signals. We propose a new user model based on Markov chains, that allows the user to scan the result list with many degrees of freedom. We exploit this Markovian model in order to inject user models into precision, defining a new family of evaluation measures, and we embed this model as objective function of an LtR algorithm to improve system performances

Archivio istituzionale della ricerca - Università di Padova

Cloud-Based Benchmarking of Medical Image Analysis

Author
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/04/2020
Field of study

Medical imagin

Directory of Open Access Books (DOAB)