3 research outputs found

    Fusion de systĂšmes et analyse des caractĂ©ristiques linguistiques des requĂȘtes: vers un processus de RI adaptatif

    Get PDF
    Today, accessing wide volumes of information is reality. Information retrieval (IR) techniques are more and more used by a huge number of users on the Internet to retrieve relevant information (data, video, pictures, etc.). We are interested in this workin textual IR.Three elements are necessary during an IR process : an information need (more often a query of few words), an IR system and a set of documents. The query is submitted to the system which tries to return relevant documents from the set of document as an answer to the user inquiry. Variability in the expression of the query lead to variation in the performances of the systems (Buckley et al., 2004). For instance, system A can be very efficient for a given query and very bad for an other one, whereas system B gets opposite results.Or thesis is done in this context of variabilities. The main objective of our work is to propose retrieval techniques that can adapt to different contexts. We consider for example that the linguistic features of queries, the performance of the systems and theircharacteristics are contextual elements of the retrieval process. Many propositions are done in this thesis. Queries are clustered according to their linguistic features (Mothe et Tanguy, 2005) with technics like Agglomerative clustering methods and k-means. Queries are then analysed by the linguistic profile of their belonging cluster. The underlyinghypothesis is that some IR systems are more suitable than other for different clusters ofqueries. We analyse the performance of the systems for each of the determined cluster of queries (query context). Four fusion methods are proposed and tested with a set of experiments.This work is done in the context of TREC campain.La recherche d'information (RI) est un domaine de recherche qui est de plus en plus visible, surtout avec la profusion de donnĂ©es (textes, images, vidĂ©os, etc) sur Internet.Nous nous intĂ©ressons dans cette thĂšse Ă  la RI Ă  partir de documents textuels non structurĂ©s.Trois Ă©lĂ©ments sont essentiels dans un processus de RI : un besoin d'information (gĂ©nĂ©ralement exprimĂ© sous la forme d'une requĂȘte), un systĂšme de recherche d'information (SRI), et une collection de documents. Ainsi, la requĂȘte est soumise au SRI quirecherche dans la collection les documents les plus pertinents pour la requĂȘte. La variabilitĂ© relative Ă  l'expression de la requĂȘte, la relation entre la requĂȘte et les documents, ainsi que celle liĂ©e aux caractĂ©ristiques des SRI utilisĂ©s conduisent Ă  des variabilitĂ©s dans les rĂ©ponses obtenues (Buckley et al., 2004). Ainsi, le systĂšme A peut ĂȘtre trĂšsperformant pour une requĂȘte donnĂ©e et ĂȘtre trĂšs mĂ©diocre pour une autre requĂȘte, alors que le systĂšme B conduira Ă  des rĂ©sultats inversĂ©s.Notre thĂšse se situe dans ce contexte. Notre objectif est de proposer des mĂ©thodes de recherche pouvant s'intĂ©grer dans un modĂšle de recherche capable de s'adapter Ă  diffĂ©rents contextes. Nous considĂ©rons par exemple que les caractĂ©ristiques linguistiques (CL) des requĂȘtes, les performances locales des systĂšmes ainsi que leurs caractĂ©ristiquessont des Ă©lĂ©ments dĂ©finissant diffĂ©rents contextes. Nous proposons plusieurs processus afin d'atteindre cet objectif. D'une part, nous utilisons un profil linguistique des requĂȘtes (Mothe et Tanguy, 2005) qui nous permet d'Ă©tablir une classification des requĂȘtes Ă  base de leurs CL. Nous utilisons Ă  cet effet des techniques statistiques d'analyse de donnĂ©es telles que la classification ascendante hiĂ©rarchique (CAH) et les k-means. Les requĂȘtes ne sont plus alors considĂ©rĂ©es de maniĂšre isolĂ©e, mais sont vues comme des groupes possĂ©dant des CL similaires. L'hypothĂšse sous-jacente que nous faisons est qu'il existe des contextes dans lesquels certains SRI sont plus adaptĂ©s que d'autres. Nous Ă©tudions alors les performances des systĂšmes sur les classes de requĂȘtes obtenues (contextes). Nous proposons quatre mĂ©thodes de fusion afin de combiner les rĂ©sultats obtenus pour une requĂȘte donnĂ©e, par diffĂ©rents SRI. Une sĂ©rie d'expĂ©rimentations valide nos propositions. L'ensemble de ces travaux s'appuie sur l'Ă©valuation au travers des campagnes d'Ă©valuation de TREC

    Information Extraction on Para-Relational Data.

    Full text link
    Para-relational data (such as spreadsheets and diagrams) refers to a type of nearly relational data that shares the important qualities of relational data but does not present itself in a relational format. Para-relational data often conveys highly valuable information and is widely used in many different areas. If we can convert para-relational data into the relational format, many existing tools can be leveraged for a variety of interesting applications, such as data analysis with relational query systems and data integration applications. This dissertation aims to convert para-relational data into a high-quality relational form with little user assistance. We have developed four standalone systems, each addressing a specific type of para-relational data. Senbazuru is a prototype spreadsheet database management system that extracts relational information from a large number of spreadsheets. Anthias is an extension of the Senbazuru system to convert a broader range of spreadsheets into a relational format. Lyretail is an extraction system to detect long-tail dictionary entities on webpages. Finally, DiagramFlyer is a web-based search system that obtains a large number of diagrams automatically extracted from web-crawled PDFs. Together, these four systems demonstrate that converting para-relational data into the relational format is possible today, and also suggest directions for future systems.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120853/1/chenzhe_1.pd
    corecore