21 research outputs found
Focused Retrieval
Traditional information retrieval applications, such as Web search, return atomic units of retrieval, which are generically called ``documents''. Depending on the application, a document may be a Web page, an email message, a journal article, or any similar object. In contrast to this traditional approach, focused retrieval helps users better pin-point their exact information needs by returning results at the sub-document level. These results may consist of predefined document components~---~such as pages, sections, and paragraphs~---~or they may consist of arbitrary passages, comprising any sub-string of a document. If a document is marked up with XML, a focused retrieval system might return individual XML elements or ranges of elements. This thesis proposes and evaluates a number of approaches to focused retrieval, including methods based on XML markup and methods based on arbitrary passages. It considers the best unit of retrieval, explores methods for efficient sub-document retrieval, and evaluates formulae for sub-document scoring. Focused retrieval is also considered in the specific context of the Wikipedia, where methods for automatic vandalism detection and automatic link generation are developed and evaluated
Recommended from our members
Mining cross-document relationships from text
The paper argues that automatic link generation and typing methods are needed to ïŹnd and maintain cross document links in large and growing textual collections. Such links are important to organise information and to support search and navigation. We present an experimental study on mining cross document links from a collection of 5000 documents. We identify a set of link types and show that the value of semantic similarity is a good distinguishing indicator
Recommended from our members
KMI, The Open University at NTCIR-9 CrossLink: Cross-Lingual Link Discovery in Wikipedia using explicit semantic analysis
This paper describes the methods used in the submission of Knowledge Media institute (KMI), The Open University to the NTCIR-9 Cross-Lingual Link Discovery (CLLD)task entitled CrossLink. KMI submitted four runs for link discovery from English to Chinese; however, the developed methods, which utilise Explicit Semantic Analysis (ESA), are applicable also to other language combinations. Three of the runs are based on exploiting the existing cross-lingual mapping between different versions of Wikipedia articles. In the fourth run, we assume information about the mapping is not available. Our methods achieved encouraging results and we describe in detail how their performance can be further improved. Finally, we discuss two important issues in link discovery: the evaluation methodology and the applicability of the developed methods across dfferent textual collections
Recherche d'information dans les documents XML : prise en compte des liens pour la sélection d'éléments pertinents
156 p. : ill. ; 30 cmNotre travail se situe dans le contexte de la recherche d'information (RI), plus particuliĂšrement la recherche d'information dans des documents semi structurĂ©s de type XML. L'exploitation efficace des documents XML disponibles doit prendre en compte la dimension structurelle. Cette dimension a conduit Ă l'Ă©mergence de nouveaux dĂ©fis dans le domaine de la RI. Contrairement aux approches classiques de RI qui mettent l'accent sur la recherche des contenus non structurĂ©s, la RI XML combine Ă la fois des informations textuelles et structurelles pour effectuer diffĂ©rentes tĂąches de recherche. Plusieurs approches exploitant les types d'Ă©vidence ont Ă©tĂ© proposĂ©es et sont principalement basĂ©es sur les modĂšles classiques de RI, adaptĂ©es Ă des documents XML. La structure XML a Ă©tĂ© utilisĂ©e pour fournir un accĂšs ciblĂ© aux documents, en retournant des composants de document (par exemple, sections, paragraphes, etc.), au lieu de retourner tout un document en rĂ©ponse une requĂȘte de l'utilisateur.
En RI traditionnelle, la mesure de similaritĂ© est gĂ©nĂ©ralement basĂ©e sur l'information textuelle. Elle permetle classement des documents en fonction de leur degrĂ© de pertinence en utilisant des mesures comme:" similitude terme " ou " probabilitĂ© terme ". Cependant, d'autres sources d'Ă©vidence peuvent ĂȘtre considĂ©rĂ©es pour rechercher des informations pertinentes dans les documents. Par exemple, les liens hypertextes ont Ă©tĂ© largement exploitĂ©s dans le cadre de la RI sur le Web.MalgrĂ© leur popularitĂ© dans le contexte du Web, peud'approchesexploitant cette source d'Ă©vidence ont Ă©tĂ© proposĂ©es dans le contexte de la RI XML.
Le but de notre travail est de proposer des approches pour l'utilisation de liens comme une source d'évidencedans le cadre de la recherche d'information XML. Cette thÚse vise à apporter des réponses aux questions de recherche suivantes :
1. Peut-on considérer les liens comme une source d'évidence dans le contexte de la RIXML?
2. Est-ce que l'utilisation de certains algorithmes d'analyse de liensdans le contexte de la RI XML améliore la qualité des résultats, en particulier dans le cas de la collection Wikipedia?
3. Quels types de liens peuvent ĂȘtre utilisĂ©s pour amĂ©liorer le mieux la pertinence des rĂ©sultats de recherche?
4. Comment calculer le score lien des différents éléments retournés comme résultats de recherche? Doit-on considérer lesliens de type "document-document" ou plus précisément les liens de type "élément-élément"? Quel est le poids des liens de navigation par rapport aux liens hiérarchiques?
5. Quel est l'impact d'utilisation de liens dans le contexte global ou local?
6. Comment intégrer le score lien dans le calcul du score final des éléments XML retournés?
7. Quel est l'impact de la qualité des premiers résultats sur le comportement des formules proposées?
Pour répondre à ces questions, nous avons mené une étude statistique, sur les résultats de recherche retournés par le systÚme de recherche d'information"DALIAN", qui a clairement montré que les liens représentent un signe de pertinence des éléments dans le contexte de la RI XML, et cecien utilisant la collection de test fournie par INEX. Aussi, nous avons implémenté trois algorithmes d'analyse des liens (Pagerank, HITS et SALSA) qui nous ont permis de réaliser une étude comparative montrant que les approches "query-dependent" sont les meilleures par rapport aux approches "global context" . Nous avons proposé durant cette thÚse trois formules de calcul du score lien: Le premiÚreest appelée "Topical Pagerank"; la seconde est la formule : "distance-based"; et la troisiÚme est :"weighted links based". Nous avons proposé aussi trois formules de combinaison, à savoir, la formule linéaire, la formule Dempster-Shafer et la formule fuzzy-based. Enfin, nous avons mené une série d'expérimentations. Toutes ces expérimentations ont montré que: les approches proposées ont permis d'améliorer la pertinence des résultats pour les différentes configurations testées; les approches "query-dependent" sont les meilleurescomparées aux approches global context; les approches exploitant les liens de type "élément-élément"ont obtenu de bons résultats; les formules de combinaison qui se basent sur le principe de l'incertitude pour le calcul des scores finaux des éléments XML permettent de réaliser de bonnes performance
BM25t: a BM25 extension for focused information retrieval
25 pagesInternational audienceThis paper addresses the integration of XML tags into a term-weighting function for focused XML Information Retrieval (IR). Our model allows us to consider a certain kind of structural information: tags that represent a logical structure (e.g. title, section, paragraph, etc.) as well as other tags (e.g. bold, italic, center, etc.). We take into account the influence of a tag by estimating the probability for this tag to distinguish relevant terms from the others. Then, these weights are integrated in a term-weighting function. Experiments on a large collection from the INEX 2008 XML IR evaluation campaign showed improvements on focused XML retrieval
XML retrieval using pruned element-index files
An element-index is a crucial mechanism for supporting content-only (CO) queries over XML collections. A full element-index that indexes each element along with the content of its descendants involves a high redundancy and reduces query processing efficiency. A direct index, on the other hand, only indexes the content that is directly under each element and disregards the descendants. This results in a smaller index, but possibly in return to some reduction in system effectiveness. In this paper, we propose using static index pruning techniques for obtaining more compact index files that can still result in comparable retrieval performance to that of a full index. We also compare the retrieval performance of these pruning based approaches to some other strategies that make use of a direct element-index. Our experiments conducted along with the lines of INEX evaluation framework reveal that pruned index files yield comparable to or even better retrieval performance than the full index and direct index, for several tasks in the ad hoc track. © 2010 Springer-Verlag Berlin Heidelberg
Applying Wikipedia to Interactive Information Retrieval
There are many opportunities to improve the interactivity of information retrieval systems beyond the ubiquitous search box. One idea is to use knowledge basesâe.g. controlled vocabularies, classification schemes, thesauri and ontologiesâto organize, describe and navigate the information space. These resources are popular in libraries and specialist collections, but have proven too expensive and narrow to be applied to everyday webscale search. Wikipedia has the potential to bring structured knowledge into more widespread use. This online, collaboratively generated encyclopaedia is one of the largest and most consulted reference works in existence. It is broader, deeper and more agile than the knowledge bases put forward to assist retrieval in the past. Rendering this resource machine-readable is a challenging task that has captured the interest of many researchers. Many see it as a key step required to break the knowledge acquisition bottleneck that crippled previous efforts. This thesis claims that the roadblock can be sidestepped: Wikipedia can be applied effectively to open-domain information retrieval with minimal natural language processing or information extraction. The key is to focus on gathering and applying human-readable rather than machine-readable knowledge. To demonstrate this claim, the thesis tackles three separate problems: extracting knowledge from Wikipedia; connecting it to textual documents; and applying it to the retrieval process. First, we demonstrate that a large thesaurus-like structure can be obtained directly from Wikipedia, and that accurate measures of semantic relatedness can be efficiently mined from it. Second, we show that Wikipedia provides the necessary features and training data for existing data mining techniques to accurately detect and disambiguate topics when they are mentioned in plain text. Third, we provide two systems and user studies that demonstrate the utility of the Wikipedia-derived knowledge base for interactive information retrieval
Using Explicit Semantic Analysis to Link in Multi-Lingual Document Collections
UdrĆŸovĂĄnĂ prolinkovĂĄnĂ dokumentĆŻ v ryhle rostoucĂch kolekcĂch je problematickĂ©. To je dĂĄle zvÄtĆĄeno vĂcejazyÄnostĂ tÄchto kolekcĂ. Navrhujeme pouĆŸĂt ExplicitnĂ SĂ©mantickou AnalĂœzu k identifikaci relevantnĂch dokumentĆŻ a linkĆŻ napĆĂÄ jazyky, bez pouĆŸitĂ strojovĂ©ho pĆekladu. Navrhli jsme a implementovali nÄkolik pĆistupĆŻ v prototypu linkovacĂho systĂ©mu. Evaluace byla provedena na ÄĂnskĂ©, ÄeskĂ©, AnglickĂ© a Ć panÄlskĂ© Wikipedii. Diskutujeme evaluaÄnĂ metodologii pro linkovacĂ systĂ©my, a hodnotĂme souhlasnost mezi odkazy v rĆŻznĂœch jazykoĂœch verzĂch Wikipedie. HodnotĂme vlastnosti ExplicitnĂ SĂ©mantickĂ© AnalĂœzy dĆŻleĆŸitĂ© pro jejĂ praktickĂ© pouĆŸitĂ.Keeping links in quickly growing document collections up-to-date is problematic, which is exacerbated by their multi-linguality. We utilize Explicit Semantic Analysis to help identify relevant documents and links across languages without machine translation. We designed and implemented several approaches as a part of our link discovery system. Evaluation was conducted on Chinese, Czech, English and Spanish Wikipedia. Also, we discuss the evaluation methodology for such systems and assess the agreement between links on different versions of Wikipedia. In addition, we evaluate properties of Explicit Semantic Analysis which are important for its practical use.
The Role of Context in Matching and Evaluation of XML Information Retrieval
SÀhköisten kokoelmien kasvun, hakujen arkipÀivÀistymisen ja mobiililaitteiden yleistymisen myötÀ yksi tiedonhaun menetelmien kehittÀmisen tavoitteista on saavuttaa alati tarkempia hakutuloksia; pitkistÀkin dokumenteista oleellinen sisÀltö pyritÀÀn osoittamaan hakijalle tarkasti. Tiedonhakija pyritÀÀn siis vapauttamaan turhasta dokumenttien selaamisesta. InternetissÀ ja muussa sÀhköisessÀ julkaisemisessa dokumenttien osat merkitÀÀn usein XML-kielen avulla dokumenttien automaattista kÀsittelyÀ varten. XML-merkkaus mahdollistaa dokumenttien sisÀisen rakenteen hyödyntÀmisen. Toisin sanoen tÀtÀ merkkausta voidaan hyödyntÀÀ kehitettÀessÀ tarkkuusorientoituneita (kohdennettuja) tiedonhakujÀrjestelmiÀ ja menetelmiÀ.
VÀitöskirja kÀsittelee tarkkuusorientoitunutta tiedonhakua, jossa eksplisiittistÀ XML merkkausta voidaan hyödyntÀÀ. VÀitöskirjassa on kaksi pÀÀteemaa, joista ensimmÀisen kÀsittelee XML -tiedonhakujÀrjestelmÀ TRIX:in (Tampere Retrieval and Indexing for XML) kehittÀmistÀ, toteuttamista ja arviointia. Toinen teema kÀsittelee kohdennettujen tiedonhakujÀrjestelmien empiirisiÀ arviointimenetelmiÀ.
EnsimmÀisen teeman merkittÀvin kontribuutio on kontekstualisointi, jolloin tÀsmÀytyksessÀ XML-tiedonhaulle tyypillistÀ tekstievidenssin vÀhÀisyyttÀ kompensoidaan hyödyntÀmÀllÀ XML-hierarkian ylempien tai rinnakkaisten osien sisÀltöÀ (so. kontekstia). MenetelmÀn toimivuus osoitetaan empiirisin menetelmin. Tutkimuksen seurauksena kontekstualisointi (contextualization) on vakiintunut alan yleiseen, kansainvÀliseen sanastoon.
Toisessa teemassa todetaan kohdennetun tiedonhaun vaikuttavuuden mittaamiseen kÀytettÀvien menetelmien olevan monin tavoin puutteellisia. Puutteiden korjaamiseksi vÀitöskirjassa kehitetÀÀn realistisempia arviointimenetelmiÀ, jotka ottavat huomioon palautettavien hakuyksiköiden kontekstin, lukemisjÀrjestyksen ja kÀyttÀjÀlle selailusta koituvan vaivan. Tutkimuksessa kehitetty mittari (T2I(300)) on valittu varsinaiseksi mittariksi kansainvÀlisessÀ INEX (Initiative for the Evaluation of XML Retrieval) hankkeessa, joka on vuonna 2002 perustettu XML tiedonhaun tutkimusfoorumi.This dissertation addresses focused retrieval, especially its sub-concept XML (eXtensible Mark-up Language) information retrieval (XML IR). In XML IR, the retrievable units are either individual elements, or sets of elements grouped together typically by a document. These units are ranked according to their estimated relevance by an XML IR system. In traditional information retrieval, the retrievable unit is an atomic document. Due to this atomicity, many core characteristics of such document retrieval paradigm are not appropriate for XML IR. Of these characteristics, this dissertation explores element indexing, scoring and evaluation methods which form two main themes:
1. Element indexing, scoring, and contextualization
2. Focused retrieval evaluation
To investigate the first theme, an XML IR system based on structural indices is constructed. The structural indices offer analyzing power for studying element hierarchies. The main finding in the system development is the utilization of surrounding elements as supplementary evidence in element scoring. This method is called contextualization, for which we distinguish three models: vertical, horizontal and ad hoc contextualizations.
The models are tested with the tools provided by (or derived from) the Initiative for the Evaluation of XML retrieval (INEX). The results indicate that the evidence from element surroundings improves the scoring effectiveness of XML retrieval.
The second theme entails a task where the retrievable elements are grouped by a document. The aim of this theme is to create methods measuring XML IR effectiveness in a credible fashion in a laboratory environment. The credibility is pursued by assuming the chronological reading order of a user together with a point where the user becomes frustrated after reading a certain amount of non-relevant material. Novel metrics are created based on these assumptions.
The relative rankings of systems measured with the metrics differ from those delivered by contemporary metrics. In addition, the focused retrieval strategies benefit from the novel metrics over traditional full document retrieval
Recommended from our members
Linking Textual Resources to Support Information Discovery
A vast amount of information is today stored in the form of textual documents, many of which are available online. These documents come from different sources and are of different types. They include newspaper articles, books, corporate reports, encyclopedia entries and research papers. At a semantic level, these documents contain knowledge, which was created by explicitly connecting information and expressing it in the form of a natural language. However, a significant amount of knowledge is not explicitly stated in a single document, yet can be derived or discovered by researching, i.e. accessing, comparing, contrasting and analysing, information from multiple documents. Carrying out this work using traditional search interfaces is tedious due to information overload and the difficulty of formulating queries that would help us to discover information we are not aware of.
In order to support this exploratory process, we need to be able to effectively navigate between related pieces of information across documents. While information can be connected using manually curated cross-document links, this approach not only does not scale, but cannot systematically assist us in the discovery of sometimes non-obvious (hidden) relationships. Consequently, there is a need for automatic approaches to link discovery.
This work studies how people link content, investigates the properties of different link types, presents new methods for automatic link discovery and designs a system in which link discovery is applied on a collection of millions of documents to improve access to public knowledge