465 research outputs found

    Evidential-Link-based Approach for Re-ranking XML Retrieval Results

    Get PDF
    In this paper, we propose a new evidential link-based approach for re-ranking XML retrieval results. The approach, based on Dempster-Shafer theory of evidence, combines, for each retrieved XML element, content relevance evidence, and computed link evidence (score and rank). The use of the Dempster–Shafer theory is motivated by the need to improve retrieval accuracy by incorporating the uncertain nature of both bodies of evidence (content and link relevance). The link score is computed according to a new link analysis algorithm based on weighted links, where relevance is propagated through the two types of links, i.e., hierarchical and navigational. The propagation, i.e. the amount of relevance score received by each retrieved XML element, depends on link weight which is defined according to two parameters: link type and link length. To evaluate our proposal we carried out a set of experiments based on INEX data collectio

    A document management methodology based on similarity contents

    Get PDF
    The advent of the WWW and distributed information systems have made it possible to share documents between different users and organisations. However, this has created many problems related to the security, accessibility, right and most importantly the consistency of documents. It is important that the people involved in the documents management process have access to the most up-to-date version of documents, retrieve the correct documents and should be able to update the documents repository in such a way that his or her document are known to others. In this paper we propose a method for organising, storing and retrieving documents based on similarity contents. The method uses techniques based on information retrieval, document indexation and term extraction and indexing. This methodology is developed for the E-Cognos project which aims at developing tools for the management and sharing of documents in the construction domain

    Indexing, learning and content-based retrieval for special purpose image databases

    Get PDF
    This chapter deals with content-based image retrieval in special purpose image databases. As image data is amassed ever more effortlessly, building efficient systems for searching and browsing of image databases becomes increasingly urgent. We provide an overview of the current state-of-the art by taking a tour along the entir

    Indexing Heterogeneous XML for Full-Text Search

    Get PDF
    XML documents are becoming more and more common in various environments. In particular, enterprise-scale document management is commonly centred around XML, and desktop applications as well as online document collections are soon to follow. The growing number of XML documents increases the importance of appropriate indexing methods and search tools in keeping the information accessible. Therefore, we focus on content that is stored in XML format as we develop such indexing methods. Because XML is used for different kinds of content ranging all the way from records of data fields to narrative full-texts, the methods for Information Retrieval are facing a new challenge in identifying which content is subject to data queries and which should be indexed for full-text search. In response to this challenge, we analyse the relation of character content and XML tags in XML documents in order to separate the full-text from data. As a result, we are able to both reduce the size of the index by 5-6\% and improve the retrieval precision as we select the XML fragments to be indexed. Besides being challenging, XML comes with many unexplored opportunities which are not paid much attention in the literature. For example, authors often tag the content they want to emphasise by using a typeface that stands out. The tagged content constitutes phrases that are descriptive of the content and useful for full-text search. They are simple to detect in XML documents, but also possible to confuse with other inline-level text. Nonetheless, the search results seem to improve when the detected phrases are given additional weight in the index. Similar improvements are reported when related content is associated with the indexed full-text including titles, captions, and references. Experimental results show that for certain types of document collections, at least, the proposed methods help us find the relevant answers. Even when we know nothing about the document structure but the XML syntax, we are able to take advantage of the XML structure when the content is indexed for full-text search.XML on yleistynyt tekstidokumenttien formaattina monessa ympÀristössÀ. Erityisesti konsernitason dokumenttienhallinta perustuu juuri XML:ÀÀn, mutta myös kotikoneilla ja WWW-ympÀristössÀ XML on yleinen tallennusmuoto sekÀ tekstille ettÀ datalle. Dokumenttien mÀÀrÀn voimakas kasva korostaa indeksointi- ja hakumenetelmien tÀrkeyttÀ, koska dokumenttien sisÀltÀmÀ tietomÀÀrÀ ei ole hallittavissa ilman tiedonhakujÀrjestelmÀÀ. Keskitymme siis XML-muodossa tallennetun sisÀllön indeksointiin tekstihakua varten. Dokumenttiformaattina XML ei mitenkÀÀn rajoita itse tallennetun sisÀllön laatua, vaan XML-dokumenteista löytÀÀ kaikkea mahdollista tietokoneiden raakadatasta kaunokirjalliseen proosaan. Siksi on tÀrkeÀÀ tunnistaa sisÀllön laatu ennen sen indeksointia. Yksi menetelmÀ datan erottamiseen tekstistÀ on XML-dokumenttien sisÀisen rakenteen analysointi: data vaatii tiukasti sÀÀnnöllisen ja mÀÀrÀmuotoisen rakenteen, kun taas tekstidokumenttien XML-rakenteessa on paljon vaihtelua. Kun datan jÀttÀÀ indeksoimatta, saavutetaan n. 5-6% pienempi indeksi sekÀ tarkemmat hakutulokset. XML-dokumenteilla on myös muita ominaisuuksia, joita ei aikaisemmin ole hyödynnetty tekstin indeksointimenetelmissÀ. SisÀltö, jota kirjoittaja haluaa korostaa esim. toisella kirjasintyypillÀ, on erikseen merkitty XML-koodiin. Korostettu sisÀltö on siten helppo paikallistaa. Antamalla sille enemmÀn painoarvoa indeksissÀ kuin korostamattomalle sisÀllölle, saadaan hakutuloksia ohjattua parempaan suuntaan. Sama vaikutus on otsikkojen, kuvatekstien ja viitteiden analysoinnilla ja painotuksella. Alustavien testitulosten mukaan esitetyt indeksointimenetelmÀt auttavat relevantin tiedon löytÀmisessÀ XML-dokumenteista

    Recherche d'information dans les documents XML : prise en compte des liens pour la sélection d'éléments pertinents

    Get PDF
    156 p. : ill. ; 30 cmNotre travail se situe dans le contexte de la recherche d'information (RI), plus particuliĂšrement la recherche d'information dans des documents semi structurĂ©s de type XML. L'exploitation efficace des documents XML disponibles doit prendre en compte la dimension structurelle. Cette dimension a conduit Ă  l'Ă©mergence de nouveaux dĂ©fis dans le domaine de la RI. Contrairement aux approches classiques de RI qui mettent l'accent sur la recherche des contenus non structurĂ©s, la RI XML combine Ă  la fois des informations textuelles et structurelles pour effectuer diffĂ©rentes tĂąches de recherche. Plusieurs approches exploitant les types d'Ă©vidence ont Ă©tĂ© proposĂ©es et sont principalement basĂ©es sur les modĂšles classiques de RI, adaptĂ©es Ă  des documents XML. La structure XML a Ă©tĂ© utilisĂ©e pour fournir un accĂšs ciblĂ© aux documents, en retournant des composants de document (par exemple, sections, paragraphes, etc.), au lieu de retourner tout un document en rĂ©ponse une requĂȘte de l'utilisateur. En RI traditionnelle, la mesure de similaritĂ© est gĂ©nĂ©ralement basĂ©e sur l'information textuelle. Elle permetle classement des documents en fonction de leur degrĂ© de pertinence en utilisant des mesures comme:" similitude terme " ou " probabilitĂ© terme ". Cependant, d'autres sources d'Ă©vidence peuvent ĂȘtre considĂ©rĂ©es pour rechercher des informations pertinentes dans les documents. Par exemple, les liens hypertextes ont Ă©tĂ© largement exploitĂ©s dans le cadre de la RI sur le Web.MalgrĂ© leur popularitĂ© dans le contexte du Web, peud'approchesexploitant cette source d'Ă©vidence ont Ă©tĂ© proposĂ©es dans le contexte de la RI XML. Le but de notre travail est de proposer des approches pour l'utilisation de liens comme une source d'Ă©videncedans le cadre de la recherche d'information XML. Cette thĂšse vise Ă  apporter des rĂ©ponses aux questions de recherche suivantes : 1. Peut-on considĂ©rer les liens comme une source d'Ă©vidence dans le contexte de la RIXML? 2. Est-ce que l'utilisation de certains algorithmes d'analyse de liensdans le contexte de la RI XML amĂ©liore la qualitĂ© des rĂ©sultats, en particulier dans le cas de la collection Wikipedia? 3. Quels types de liens peuvent ĂȘtre utilisĂ©s pour amĂ©liorer le mieux la pertinence des rĂ©sultats de recherche? 4. Comment calculer le score lien des diffĂ©rents Ă©lĂ©ments retournĂ©s comme rĂ©sultats de recherche? Doit-on considĂ©rer lesliens de type "document-document" ou plus prĂ©cisĂ©ment les liens de type "Ă©lĂ©ment-Ă©lĂ©ment"? Quel est le poids des liens de navigation par rapport aux liens hiĂ©rarchiques? 5. Quel est l'impact d'utilisation de liens dans le contexte global ou local? 6. Comment intĂ©grer le score lien dans le calcul du score final des Ă©lĂ©ments XML retournĂ©s? 7. Quel est l'impact de la qualitĂ© des premiers rĂ©sultats sur le comportement des formules proposĂ©es? Pour rĂ©pondre Ă  ces questions, nous avons menĂ© une Ă©tude statistique, sur les rĂ©sultats de recherche retournĂ©s par le systĂšme de recherche d'information"DALIAN", qui a clairement montrĂ© que les liens reprĂ©sentent un signe de pertinence des Ă©lĂ©ments dans le contexte de la RI XML, et cecien utilisant la collection de test fournie par INEX. Aussi, nous avons implĂ©mentĂ© trois algorithmes d'analyse des liens (Pagerank, HITS et SALSA) qui nous ont permis de rĂ©aliser une Ă©tude comparative montrant que les approches "query-dependent" sont les meilleures par rapport aux approches "global context" . Nous avons proposĂ© durant cette thĂšse trois formules de calcul du score lien: Le premiĂšreest appelĂ©e "Topical Pagerank"; la seconde est la formule : "distance-based"; et la troisiĂšme est :"weighted links based". Nous avons proposĂ© aussi trois formules de combinaison, Ă  savoir, la formule linĂ©aire, la formule Dempster-Shafer et la formule fuzzy-based. Enfin, nous avons menĂ© une sĂ©rie d'expĂ©rimentations. Toutes ces expĂ©rimentations ont montrĂ© que: les approches proposĂ©es ont permis d'amĂ©liorer la pertinence des rĂ©sultats pour les diffĂ©rentes configurations testĂ©es; les approches "query-dependent" sont les meilleurescomparĂ©es aux approches global context; les approches exploitant les liens de type "Ă©lĂ©ment-Ă©lĂ©ment"ont obtenu de bons rĂ©sultats; les formules de combinaison qui se basent sur le principe de l'incertitude pour le calcul des scores finaux des Ă©lĂ©ments XML permettent de rĂ©aliser de bonnes performance

    Automatic bilingual text document summarization.

    Get PDF
    Lo Sau-Han Silvia.Thesis (M.Phil.)--Chinese University of Hong Kong, 2002.Includes bibliographical references (leaves 137-143).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Definition of a summary --- p.2Chapter 1.2 --- Definition of text summarization --- p.3Chapter 1.3 --- Previous work --- p.4Chapter 1.3.1 --- Extract-based text summarization --- p.5Chapter 1.3.2 --- Abstract-based text summarization --- p.8Chapter 1.3.3 --- Sophisticated text summarization --- p.9Chapter 1.4 --- Summarization evaluation methods --- p.10Chapter 1.4.1 --- Intrinsic evaluation --- p.10Chapter 1.4.2 --- Extrinsic evaluation --- p.11Chapter 1.4.3 --- The TIPSTER SUMMAC text summarization evaluation --- p.11Chapter 1.4.4 --- Text Summarization Challenge (TSC) --- p.13Chapter 1.5 --- Research contributions --- p.14Chapter 1.5.1 --- Text summarization based on thematic term approach --- p.14Chapter 1.5.2 --- Bilingual news summarization based on an event-driven approach --- p.15Chapter 1.6 --- Thesis organization --- p.16Chapter 2 --- Text Summarization based on a Thematic Term Approach --- p.17Chapter 2.1 --- System overview --- p.18Chapter 2.2 --- Document preprocessor --- p.20Chapter 2.2.1 --- English corpus --- p.20Chapter 2.2.2 --- English corpus preprocessor --- p.22Chapter 2.2.3 --- Chinese corpus --- p.23Chapter 2.2.4 --- Chinese corpus preprocessor --- p.24Chapter 2.3 --- Corpus thematic term extractor --- p.24Chapter 2.4 --- Article thematic term extractor --- p.26Chapter 2.5 --- Sentence score generator --- p.29Chapter 2.6 --- Chapter summary --- p.30Chapter 3 --- Evaluation for Summarization using the Thematic Term Ap- proach --- p.32Chapter 3.1 --- Content-based similarity measure --- p.33Chapter 3.2 --- Experiments using content-based similarity measure --- p.36Chapter 3.2.1 --- English corpus and parameter training --- p.36Chapter 3.2.2 --- Experimental results using content-based similarity mea- sure --- p.38Chapter 3.3 --- Average inverse rank (AIR) method --- p.59Chapter 3.4 --- Experiments using average inverse rank method --- p.60Chapter 3.4.1 --- Corpora and parameter training --- p.61Chapter 3.4.2 --- Experimental results using AIR method --- p.62Chapter 3.5 --- Comparison between the content-based similarity measure and the average inverse rank method --- p.69Chapter 3.6 --- Chapter summary --- p.73Chapter 4 --- Bilingual Event-Driven News Summarization --- p.74Chapter 4.1 --- Corpora --- p.75Chapter 4.2 --- Topic and event definitions --- p.76Chapter 4.3 --- Architecture of bilingual event-driven news summarization sys- tem --- p.77Chapter 4.4 --- Bilingual event-driven approach summarization --- p.80Chapter 4.4.1 --- Dictionary-based term translation applying on English news articles --- p.80Chapter 4.4.2 --- Preprocessing for Chinese news articles --- p.89Chapter 4.4.3 --- Event clusters generation --- p.89Chapter 4.4.4 --- Cluster selection and summary generation --- p.96Chapter 4.5 --- Evaluation for summarization based on event-driven approach --- p.101Chapter 4.6 --- Experimental results on event-driven summarization --- p.103Chapter 4.6.1 --- Experimental settings --- p.103Chapter 4.6.2 --- Results and analysis --- p.105Chapter 4.7 --- Chapter summary --- p.113Chapter 5 --- Applying Event-Driven Summarization to a Parallel Corpus --- p.114Chapter 5.1 --- Parallel corpus --- p.115Chapter 5.2 --- Parallel documents preparation --- p.116Chapter 5.3 --- Evaluation methods for the event-driven summaries generated from the parallel corpus --- p.118Chapter 5.4 --- Experimental results and analysis --- p.121Chapter 5.4.1 --- Experimental settings --- p.121Chapter 5.4.2 --- Results and analysis --- p.123Chapter 5.5 --- Chapter summary --- p.132Chapter 6 --- Conclusions and Future Work --- p.133Chapter 6.1 --- Conclusions --- p.133Chapter 6.2 --- Future work --- p.135Bibliography --- p.137Chapter A --- English Stop Word List --- p.144Chapter B --- Chinese Stop Word List --- p.149Chapter C --- Event List Items on the Corpora --- p.151Chapter C.1 --- "Event list items for the topic ""Upcoming Philippine election""" --- p.151Chapter C.2 --- "Event list items for the topic ""German train derail"" " --- p.153Chapter C.3 --- "Event list items for the topic ""Electronic service delivery (ESD) scheme"" " --- p.154Chapter D --- The sample of an English article (9505001.xml). --- p.15

    Creating a Phrase Similarity Graph From Wikipedia

    Get PDF
    The paper addresses the problem of modeling the relationship between phrases in English using a similarity graph. The mathematical model stores data about the strength of the relationship between phrases expressed as a decimal number. Both structured data from Wikipedia, such as that the Wikipedia page with title “Dog” belongs to theWikipedia category “Domesticated animals”, and textual descriptions, such as that the Wikipedia page with title “Dog” contains the word “wolf” thirty one times are used in creating the graph. The quality of the graph data is validated by comparing the similarity of pairs of phrases using our software that uses the graph with results of studies that were performed with human subjects. To the best of our knowledge, our software produces better correlation with the results of both the Miller and Charles study and the WordSimilarity-353 study than any other published research

    Intelligent information retrieval and fault diagnosis for the asset management of power substations

    Get PDF
    This thesis mainly presents two intelligent approaches to the Asset Management (AM) of power substations, which include an Evidential Reasoning (ER)-based document ranking approach to an Ontology-based Document Search Engine (ODSE) for the Information Retrieval (IR) of power substations and an Association Rule Mining (ARM)-based Dissolved Gas Analysis (DGA) approach to the Fault Diagnosis (FD) of power transformers

    Temporally Biased Search Result Snippets

    Get PDF
    The search engine result snippets are an important source of information for the user to obtain quick insights into the corresponding result documents. When the search terms are too general, like a person\u27s name or a company\u27s name, creating an appropriate snippet that effectively summarizes the document\u27s content can be challenging owing to multiple occurrences of the search term in the top ranked documents, without a simple means to select a subset of sentences containing them to form result snippet. In web pages classified as narratives and news articles, multiple references to explicit, implicit and relative temporal expressions can be found. Based on these expressions, the sentences can be ordered on a timeline. In this thesis, we propose the idea of generation of an alternate search results snippet, by exploiting these temporal expressions embedded within the pages, using a timeline map. Our method of snippets generation is mainly targeted at general search terms. At present, when the search terms are too general, the existing systems generate static snippets for resultant pages like displaying the first line. In our approach, we introduce an alternate method of extracting and selecting temporal data from these pages to adapt a snippet to be a more effective summary. Specifically, it selects and blends temporally interesting sentences. Using weighted kappa measure, we evaluate our approach by comparing snippets generated for multiple search terms based on existing systems and snippets generated by using our approach
    • 

    corecore