Search CORE

3 research outputs found

Fusion de systèmes et analyse des caractéristiques linguistiques des requêtes: vers un processus de RI adaptatif

Author: Kompaoré Nongdo Désiré,
Publication venue: HAL CCSD
Publication date: 26/06/2008
Field of study

Today, accessing wide volumes of information is reality. Information retrieval (IR) techniques are more and more used by a huge number of users on the Internet to retrieve relevant information (data, video, pictures, etc.). We are interested in this workin textual IR.Three elements are necessary during an IR process : an information need (more often a query of few words), an IR system and a set of documents. The query is submitted to the system which tries to return relevant documents from the set of document as an answer to the user inquiry. Variability in the expression of the query lead to variation in the performances of the systems (Buckley et al., 2004). For instance, system A can be very efficient for a given query and very bad for an other one, whereas system B gets opposite results.Or thesis is done in this context of variabilities. The main objective of our work is to propose retrieval techniques that can adapt to different contexts. We consider for example that the linguistic features of queries, the performance of the systems and theircharacteristics are contextual elements of the retrieval process. Many propositions are done in this thesis. Queries are clustered according to their linguistic features (Mothe et Tanguy, 2005) with technics like Agglomerative clustering methods and k-means. Queries are then analysed by the linguistic profile of their belonging cluster. The underlyinghypothesis is that some IR systems are more suitable than other for different clusters ofqueries. We analyse the performance of the systems for each of the determined cluster of queries (query context). Four fusion methods are proposed and tested with a set of experiments.This work is done in the context of TREC campain.La recherche d'information (RI) est un domaine de recherche qui est de plus en plus visible, surtout avec la profusion de données (textes, images, vidéos, etc) sur Internet.Nous nous intéressons dans cette thèse à la RI à partir de documents textuels non structurés.Trois éléments sont essentiels dans un processus de RI : un besoin d'information (généralement exprimé sous la forme d'une requête), un système de recherche d'information (SRI), et une collection de documents. Ainsi, la requête est soumise au SRI quirecherche dans la collection les documents les plus pertinents pour la requête. La variabilité relative à l'expression de la requête, la relation entre la requête et les documents, ainsi que celle liée aux caractéristiques des SRI utilisés conduisent à des variabilités dans les réponses obtenues (Buckley et al., 2004). Ainsi, le système A peut être trèsperformant pour une requête donnée et être très médiocre pour une autre requête, alors que le système B conduira à des résultats inversés.Notre thèse se situe dans ce contexte. Notre objectif est de proposer des méthodes de recherche pouvant s'intégrer dans un modèle de recherche capable de s'adapter à différents contextes. Nous considérons par exemple que les caractéristiques linguistiques (CL) des requêtes, les performances locales des systèmes ainsi que leurs caractéristiquessont des éléments définissant différents contextes. Nous proposons plusieurs processus afin d'atteindre cet objectif. D'une part, nous utilisons un profil linguistique des requêtes (Mothe et Tanguy, 2005) qui nous permet d'établir une classification des requêtes à base de leurs CL. Nous utilisons à cet effet des techniques statistiques d'analyse de données telles que la classification ascendante hiérarchique (CAH) et les k-means. Les requêtes ne sont plus alors considérées de manière isolée, mais sont vues comme des groupes possédant des CL similaires. L'hypothèse sous-jacente que nous faisons est qu'il existe des contextes dans lesquels certains SRI sont plus adaptés que d'autres. Nous étudions alors les performances des systèmes sur les classes de requêtes obtenues (contextes). Nous proposons quatre méthodes de fusion afin de combiner les résultats obtenus pour une requête donnée, par différents SRI. Une série d'expérimentations valide nos propositions. L'ensemble de ces travaux s'appuie sur l'évaluation au travers des campagnes d'évaluation de TREC

Thèses en Ligne

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes

Thèses en ligne de l'Université Toulouse III - Paul Sabatier

Information Extraction on Para-Relational Data.

Author: Chen Zhe
Publication venue
Publication date
Field of study

Para-relational data (such as spreadsheets and diagrams) refers to a type of nearly relational data that shares the important qualities of relational data but does not present itself in a relational format. Para-relational data often conveys highly valuable information and is widely used in many different areas. If we can convert para-relational data into the relational format, many existing tools can be leveraged for a variety of interesting applications, such as data analysis with relational query systems and data integration applications. This dissertation aims to convert para-relational data into a high-quality relational form with little user assistance. We have developed four standalone systems, each addressing a specific type of para-relational data. Senbazuru is a prototype spreadsheet database management system that extracts relational information from a large number of spreadsheets. Anthias is an extension of the Senbazuru system to convert a broader range of spreadsheets into a relational format. Lyretail is an extraction system to detect long-tail dictionary entities on webpages. Finally, DiagramFlyer is a web-based search system that obtains a large number of diagrams automatically extracted from web-crawled PDFs. Together, these four systems demonstrate that converting para-relational data into the relational format is possible today, and also suggest directions for future systems.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120853/1/chenzhe_1.pd

Deep Blue Documents at the University of Michigan

Recommended from our members

Automated synthesis of data extraction and transformation programs

Author: Yaghmazadeh Navid
Publication venue
Publication date: 27/08/2018
Field of study

Due to the abundance of data in today’s data-rich world, end-users increasingly need to perform various data extraction and transformation tasks. While many of these tedious tasks can be performed in a programmatic way, most end-users lack the required programming expertise to automate them and end up spending their valuable time in manually performing various data- related tasks. The field of program synthesis aims to overcome this problem by automatically generating programs from informal specifications, such as input-output examples or natural language. This dissertation focuses on the design and implementation of new systems for automating important classes of data transformation and extraction tasks. It introduces solutions for automating data manipulation tasks on fully- structured data formats like relational tables, or on semi-structured formats such as XML and JSON documents. First, we describe a novel algorithm for synthesizing hierarchical data transformations from input-output examples. A key novelty of our approach is that it reduces the synthesis of tree transformations to the simpler problem of synthesizing transformations over the paths of the tree. We also describe a new and effective algorithm for learning path transformations that combines logical SMT-based reasoning with machine learning techniques based on decision trees. Next, we present a new methodology for learning programs that migrate tree-structured documents to relational table representations from input-output examples. Our approach achieves its goal by decomposing the synthesis task to two subproblems of (A) learning the column extraction logic, and (B) learning the row extraction logic. We propose a technique for learning column extraction programs using deterministic finite automata, and a new algorithm for predicate learning which combines integer linear programing and logic minimization. Finally, we address the problem of automating data extraction tasks from natural language. Specifically, we focus on data retrieval from relational databases and describe a novel approach for learning SQL queries from English descriptions. The method we describe is fully automatic and database-agnostic (i.e., does not require customization for each database). Our method combines semantic parsing techniques from the NLP community with novel programming languages ideas involving probabilistic type inhabitation and automated sketch repair.Computer Science

Texas ScholarWorks