3 research outputs found
Fusion de systĂšmes et analyse des caractĂ©ristiques linguistiques des requĂȘtes: vers un processus de RI adaptatif
Today, accessing wide volumes of information is reality. Information retrieval (IR) techniques are more and more used by a huge number of users on the Internet to retrieve relevant information (data, video, pictures, etc.). We are interested in this workin textual IR.Three elements are necessary during an IR process : an information need (more often a query of few words), an IR system and a set of documents. The query is submitted to the system which tries to return relevant documents from the set of document as an answer to the user inquiry. Variability in the expression of the query lead to variation in the performances of the systems (Buckley et al., 2004). For instance, system A can be very efficient for a given query and very bad for an other one, whereas system B gets opposite results.Or thesis is done in this context of variabilities. The main objective of our work is to propose retrieval techniques that can adapt to different contexts. We consider for example that the linguistic features of queries, the performance of the systems and theircharacteristics are contextual elements of the retrieval process. Many propositions are done in this thesis. Queries are clustered according to their linguistic features (Mothe et Tanguy, 2005) with technics like Agglomerative clustering methods and k-means. Queries are then analysed by the linguistic profile of their belonging cluster. The underlyinghypothesis is that some IR systems are more suitable than other for different clusters ofqueries. We analyse the performance of the systems for each of the determined cluster of queries (query context). Four fusion methods are proposed and tested with a set of experiments.This work is done in the context of TREC campain.La recherche d'information (RI) est un domaine de recherche qui est de plus en plus visible, surtout avec la profusion de donnĂ©es (textes, images, vidĂ©os, etc) sur Internet.Nous nous intĂ©ressons dans cette thĂšse Ă la RI Ă partir de documents textuels non structurĂ©s.Trois Ă©lĂ©ments sont essentiels dans un processus de RI : un besoin d'information (gĂ©nĂ©ralement exprimĂ© sous la forme d'une requĂȘte), un systĂšme de recherche d'information (SRI), et une collection de documents. Ainsi, la requĂȘte est soumise au SRI quirecherche dans la collection les documents les plus pertinents pour la requĂȘte. La variabilitĂ© relative Ă l'expression de la requĂȘte, la relation entre la requĂȘte et les documents, ainsi que celle liĂ©e aux caractĂ©ristiques des SRI utilisĂ©s conduisent Ă des variabilitĂ©s dans les rĂ©ponses obtenues (Buckley et al., 2004). Ainsi, le systĂšme A peut ĂȘtre trĂšsperformant pour une requĂȘte donnĂ©e et ĂȘtre trĂšs mĂ©diocre pour une autre requĂȘte, alors que le systĂšme B conduira Ă des rĂ©sultats inversĂ©s.Notre thĂšse se situe dans ce contexte. Notre objectif est de proposer des mĂ©thodes de recherche pouvant s'intĂ©grer dans un modĂšle de recherche capable de s'adapter Ă diffĂ©rents contextes. Nous considĂ©rons par exemple que les caractĂ©ristiques linguistiques (CL) des requĂȘtes, les performances locales des systĂšmes ainsi que leurs caractĂ©ristiquessont des Ă©lĂ©ments dĂ©finissant diffĂ©rents contextes. Nous proposons plusieurs processus afin d'atteindre cet objectif. D'une part, nous utilisons un profil linguistique des requĂȘtes (Mothe et Tanguy, 2005) qui nous permet d'Ă©tablir une classification des requĂȘtes Ă base de leurs CL. Nous utilisons Ă cet effet des techniques statistiques d'analyse de donnĂ©es telles que la classification ascendante hiĂ©rarchique (CAH) et les k-means. Les requĂȘtes ne sont plus alors considĂ©rĂ©es de maniĂšre isolĂ©e, mais sont vues comme des groupes possĂ©dant des CL similaires. L'hypothĂšse sous-jacente que nous faisons est qu'il existe des contextes dans lesquels certains SRI sont plus adaptĂ©s que d'autres. Nous Ă©tudions alors les performances des systĂšmes sur les classes de requĂȘtes obtenues (contextes). Nous proposons quatre mĂ©thodes de fusion afin de combiner les rĂ©sultats obtenus pour une requĂȘte donnĂ©e, par diffĂ©rents SRI. Une sĂ©rie d'expĂ©rimentations valide nos propositions. L'ensemble de ces travaux s'appuie sur l'Ă©valuation au travers des campagnes d'Ă©valuation de TREC
Information Extraction on Para-Relational Data.
Para-relational data (such as spreadsheets and diagrams) refers to a type of nearly
relational data that shares the important qualities of relational data but does not
present itself in a relational format. Para-relational data often conveys highly valuable
information and is widely used in many different areas. If we can convert para-relational
data into the relational format, many existing tools can be leveraged for a
variety of interesting applications, such as data analysis with relational query systems
and data integration applications.
This dissertation aims to convert para-relational data into a high-quality relational
form with little user assistance. We have developed four standalone systems, each
addressing a specific type of para-relational data. Senbazuru is a prototype spreadsheet
database management system that extracts relational information from a large
number of spreadsheets. Anthias is an extension of the Senbazuru system to convert
a broader range of spreadsheets into a relational format. Lyretail is an extraction
system to detect long-tail dictionary entities on webpages. Finally, DiagramFlyer is
a web-based search system that obtains a large number of diagrams automatically
extracted from web-crawled PDFs. Together, these four systems demonstrate that
converting para-relational data into the relational format is possible today, and also
suggest directions for future systems.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120853/1/chenzhe_1.pd
Recommended from our members
Automated synthesis of data extraction and transformation programs
Due to the abundance of data in todayâs data-rich world, end-users increasingly need to perform various data extraction and transformation tasks. While many of these tedious tasks can be performed in a programmatic way, most end-users lack the required programming expertise to automate them and end up spending their valuable time in manually performing various data- related tasks. The field of program synthesis aims to overcome this problem by automatically generating programs from informal specifications, such as input-output examples or natural language.
This dissertation focuses on the design and implementation of new systems for automating important classes of data transformation and extraction tasks. It introduces solutions for automating data manipulation tasks on fully- structured data formats like relational tables, or on semi-structured formats such as XML and JSON documents.
First, we describe a novel algorithm for synthesizing hierarchical data transformations from input-output examples. A key novelty of our approach is that it reduces the synthesis of tree transformations to the simpler problem of synthesizing transformations over the paths of the tree. We also describe a new and effective algorithm for learning path transformations that combines logical SMT-based reasoning with machine learning techniques based on decision trees.
Next, we present a new methodology for learning programs that migrate tree-structured documents to relational table representations from input-output examples. Our approach achieves its goal by decomposing the synthesis task to two subproblems of (A) learning the column extraction logic, and (B) learning the row extraction logic. We propose a technique for learning column extraction programs using deterministic finite automata, and a new algorithm for predicate learning which combines integer linear programing and logic minimization.
Finally, we address the problem of automating data extraction tasks from natural language. Specifically, we focus on data retrieval from relational databases and describe a novel approach for learning SQL queries from English descriptions. The method we describe is fully automatic and database-agnostic
(i.e., does not require customization for each database). Our method combines semantic parsing techniques from the NLP community with novel programming languages ideas involving probabilistic type inhabitation and automated sketch repair.Computer Science