155 research outputs found
Joint Upper & Lower Bound Normalization for IR Evaluation
In this paper, we present a novel perspective towards IR evaluation by
proposing a new family of evaluation metrics where the existing popular metrics
(e.g., nDCG, MAP) are customized by introducing a query-specific lower-bound
(LB) normalization term. While original nDCG, MAP etc. metrics are normalized
in terms of their upper bounds based on an ideal ranked list, a corresponding
LB normalization for them has not yet been studied. Specifically, we introduce
two different variants of the proposed LB normalization, where the lower bound
is estimated from a randomized ranking of the corresponding documents present
in the evaluation set. We next conducted two case-studies by instantiating the
new framework for two popular IR evaluation metric (with two variants, e.g.,
DCG_UL_V1,2 and MSP_UL_V1,2 ) and then comparing against the traditional metric
without the proposed LB normalization. Experiments on two different data-sets
with eight Learning-to-Rank (LETOR) methods demonstrate the following
properties of the new LB normalized metric: 1) Statistically significant
differences (between two methods) in terms of original metric no longer remain
statistically significant in terms of Upper Lower (UL) Bound normalized version
and vice-versa, especially for uninformative query-sets. 2) When compared
against the original metric, our proposed UL normalized metrics demonstrate
higher Discriminatory Power and better Consistency across different data-sets.
These findings suggest that the IR community should consider UL normalization
seriously when computing nDCG and MAP and more in-depth study of UL
normalization for general IR evaluation is warranted.Comment: 26 pages, 3 figure
Search and information extraction in handwritten tables
[ES] Actualmente, archivos de todo el mundo están digitalizando
grandes colecciones de documentos manuscritos con el fin de
preservarlos y facilitar su difusión a investigadores y usuarios
generales. Este hecho está motivando una gran evolución en las
técnicas de reconocimiento de texto manuscrito (HTR por sus
siglas en inglés), que permiten acceder a los contenidos
textuales de las imágenes digitales mediante consultas de texto
plano, de la misma manera que se hace con los libros y otros
documentos digitales.
Dentro del conjunto de documentos manuscritos sin transcripción,
nos encontramos con que aproximadamente más de la mitad de los
documentos se corresponden con documentos estructurados. Estos
documentos contienen información de todo tipo: registros de
nacimiento, de navegación, cuadernos de bitácora, etc. Toda esta
información es a menudo imprescindible para usos jurídicos,
estudios demográficos, estudios de la evolución del clima, etc.
El objetivo de este trabajo es desarrollar nuevos métodos que
permitan realizar búsquedas según el modelo "atributo-valor'"
sobre estos documentos, donde los "atributos" son las cabeceras
de las columnas y filas que forman la tabla y los "valores" son el resto
de celdas de la tabla que no son cabecera. Para ello, vamos a
basarnos en el marco de la indexación probabilistica (que está en
cierto modo relacionado con el campo conocido como "keyword
spotting"). En este marco, cada elemento de una imagen que se
pueda interpretar como una palabra es detectado y almacenado,
junto con su posición dentro de la imagen y la correspondiente
probabilidad de relevancia.
Así pues, empleando la información geométrica de los índices probabilísticos
en conjunto con el uso de distribuciones gausianas, se pretende
permitir realizar este tipo de búsquedas desde una
perspectiva completamente probabilística. Bajo este enfoque,
además de la búsqueda, se estudia la extracción de la
información con objetivo de volcar contenidos específicos de las
imágenes digitales a un formato compatible con bases de datos
convencionales. En ambas tareas se han logrado resultados que
superan el baseline propuesto.[EN] Currently, all archives around the world are digitising large
collections of manuscripts, aiming to preserve and facilitate their
dissemination to researchers and general users. This fact is
motivating a fast evolution in handwritten text recognition (HTR)
techniques, which allow accessing to the textual contents of digital
images by means of plain-text queries, in the same way as with books
and other digital documents.
Among the huge set of manuscripts without transcription, more than half of the documents contain structured text. This is the case of
birth records, navigation logs, etc. The information contained in
these documents is often needed for legal matters, demographic
studies, weather evolution studies, etc.
The purpose of this work is to develop new methods that allow to
perform searches according to the "attribute-value" model about these
documents, where the "attributes" are, for example, column or row
headers in tables and the "values" are the corresponding table
cells. For this purpose, we will rely on the so-called
probabilistic indexing framework (which in a certain sense is related
with the field known as "keyword spotting"). In this framework, each
element of an image that can be interpreted as a word is detected and
stored, along with its position within the image and the
correspondence relevance probability.
This way, by using the geometric information available in the probabilistic indices
and Gaussian distributions, we aim at allowing this type of search from
a completely probabilistic perspective. Following this approach, in
addition to information search, we study how to actually extract
specific textual contents of the digital images in standard formats
compatible with conventional databases.Andrés Moreno, J. (2021). Search and information extraction in handwritten tables. Universitat Politècnica de València. http://hdl.handle.net/10251/172740TFG
Be4SeD: Benchmarking para evaluación de técnicas de descubrimiento de servicios
Actualmente, el creciente número de procesos de negocio y servicios ofrecidos, es fuente de innumerables proyectos de investigación, orientados a generar mecanismos de descubrimiento; teniendo como resultado un sinnúmero de algoritmos para recuperar servicios. Sin embargo, dichos proyectos no utilizan una base común para evaluar sus técnicas de búsqueda, impidiendo que las evaluaciones sean objetivas. Por lo tanto, se hace necesaria una herramienta pública, que proporcione una referencia común, que permita comparar y valorar los resultados de los diferentes algoritmos utilizados en el emparejamiento de servicios, con el fin de mejorar la calidad de las técnicas de descubrimiento propuestas. Este artículo presenta una aplicación pública, que implementa una metodología de benchmarking para evaluar la calidad de recuperación de las técnicas de emparejamiento de servicios. Este benchmarking está compuesto de un mecanismo de evaluación intuitivo, de un módulo de ingreso de los datos correspondientes al algoritmo a evaluar y un componente que entrega resultados estadísticos: recall, precision, overall, k-precision y p-precision. Sus funcionalidades se ofrecen como servicio web para facilitar la integración con las implementaciones de algoritmos a evaluar. Finalmente se evalúa un algoritmo de emparejamiento, el cual evidencia el uso de la plataforma Be4SeD en este contexto
Realising context-oriented information filtering.
The notion of information overload is an increasing factor in modern information service environments where information is ‘pushed’ to the user. As increasing volumes of information are presented to computing users in the form of email, web sites, instant messaging and news feeds, there is a growing need to filter and prioritise the importance of this information. ‘Information management’ needs to be undertaken in a manner that not only prioritises what information we do need, but to also dispose of information that is sent, which is of no (or little) use to us.The development of a model to aid information filtering in a context-aware way is developed as an objective for this thesis. A key concern in the conceptualisation of a single concept is understanding the context under which that concept exists (or can exist). An example of a concept is a concrete object, for instance a book. This contextual understanding should provide us with clear conceptual identification of a concept including implicit situational information and detail of surrounding concepts.Existing solutions to filtering information suffer from their own unique flaws: textbased filtering suffers from problems of inaccuracy; ontology-based solutions suffer from scalability challenges; taxonomies suffer from problems with collaboration. A major objective of this thesis is to explore the use of an evolving community maintained knowledge-base (that of Wikipedia) in order to populate the context model from prioritise concepts that are semantically relevant to the user’s interest space. Wikipedia can be classified as a weak knowledge-base due to its simple TBox schema and implicit predicates, therefore, part of this objective is to validate the claim that a weak knowledge-base is fit for this purpose. The proposed and developed solution, therefore, provides the benefits of high recall filtering with low fallout and a dependancy on a scalable and collaborative knowledge-base.A simple web feed aggregator has been built using the Java programming language that we call DAVe’s Rss Organisation System (DAVROS-2) as a testbed environment to demonstrate specific tests used within this investigation. The motivation behind the experiments is to demonstrate that the combination of the concept framework instantiated through Wikipedia can provide a framework to aid in concept comparison, and therefore be used in news filtering scenario as an example of information overload. In order to evaluate the effectiveness of the method well understood measures of information retrieval are used. This thesis demonstrates that the utilisation of the developed contextual concept expansion framework (instantiated using Wikipedia) improved the quality of concept filtering over a baseline based on string matching. This has been demonstrated through the analysis of recall and fallout measures
HMM word graph based keyword spotting in handwritten document images
[EN] Line-level keyword spotting (KWS) is presented on the basis of frame-level word posterior
probabilities. These posteriors are obtained using word graphs derived from the recogni-
tion process of a full-fledged handwritten text recognizer based on hidden Markov models
and N-gram language models. This approach has several advantages. First, since it uses
a holistic, segmentation-free technology, it does not require any kind of word or charac-
ter segmentation. Second, the use of language models allows the context of each spotted
word to be taken into account, thereby considerably increasing KWS accuracy. And third,
the proposed KWS scores are based on true posterior probabilities, taking into account
all (or most) possible word segmentations of the input image. These scores are properly
bounded and normalized. This mathematically clean formulation lends itself to smooth,
threshold-based keyword queries which, in turn, permit comfortable trade-offs between
search precision and recall. Experiments are carried out on several historic collections of
handwritten text images, as well as a well-known data set of modern English handwrit-
ten text. According to the empirical results, the proposed approach achieves KWS results
comparable to those obtained with the recently-introduced "BLSTM neural networks KWS"
approach and clearly outperform the popular, state-of-the-art "Filler HMM" KWS method.
Overall, the results clearly support all the above-claimed advantages of the proposed ap-
proach.This work has been partially supported by the Generalitat Valenciana under the Prometeo/2009/014 project grant ALMA-MATER, and through the EU projects: HIMANIS (JPICH programme, Spanish grant Ref. PCIN-2015-068) and READ (Horizon 2020 programme, grant Ref. 674943).Toselli, AH.; Vidal, E.; Romero, V.; Frinken, V. (2016). HMM word graph based keyword spotting in handwritten document images. Information Sciences. 370:497-518. https://doi.org/10.1016/j.ins.2016.07.063S49751837
Recommended from our members
Deriving Technology Intelligence from Patents: Preposition-based Semantic Analysis
Patents are one of the most reliable sources of technology intelligence, and the true value of patent analysis stems from its capability of describing the content of technology based on the relationships between keywords. To date a number of techniques for analyzing the information contained in patent documents that focus on the relationships between keywords have been suggested. However, a drawback of the existing keyword approaches is that they cannot yet determine the types of relationships between the keywords. This study proposes a novel approach based on preposition semantic analysis network which overcomes the limitations of the existing keywords-based network analysis and demonstrates its potential through an application. A preposition is a word that defines the relationship between two neighboring words, and, in the case of patents, prepositions aid in revealing the relationships between keywords related to technologies. To demonstrate the approach, patents regarding an electric vehicle were employed. 13 prepositions were identified which could be used to define 5 relationships between neighboring technological terms: “inclusion (utilization),” “objective (purpose),” “effect,” “process,” and “likeness.” The proposed approach is expected to improve the usability of keyword-based patent analyses and support more elaborate studies on patent documents
Exploiting user signals and stochastic models to improve information retrieval systems and evaluation
The leitmotiv throughout this thesis is represented by IR evaluation. We discuss different issues related to effectiveness measures and novel solutions that we propose to address these challenges. We start by providing a formal definition of utility-oriented measurement of retrieval effectiveness, based on the representational theory of measurement. The proposed theoretical framework contributes to a better understanding of the problem complexities, separating those due to the inherent problems in comparing systems, from those due to the expected numerical properties of measures. We then propose AWARE, a probabilistic framework for dealing with the noise and inconsistencies introduced when relevance labels are gathered with multiple crowd assessors. By modeling relevance judgements and crowd assessors as sources of uncertainty, we directly combine the performance measures computed on the ground-truth generated by each crowd assessor, instead of adopting a classification technique to merge the labels at pool level. Finally, we investigate evaluation measures able to account for user signals. We propose a new user model based on Markov chains, that allows the user to scan the result list with many degrees of freedom. We exploit this Markovian model in order to inject user models into precision, defining a new family of evaluation measures, and we embed this model as objective function of an LtR algorithm to improve system performances
Cloud-Based Benchmarking of Medical Image Analysis
Medical imagin
- …