23 research outputs found

    BM25t: a BM25 extension for focused information retrieval

    No full text
    25 pagesInternational audienceThis paper addresses the integration of XML tags into a term-weighting function for focused XML Information Retrieval (IR). Our model allows us to consider a certain kind of structural information: tags that represent a logical structure (e.g. title, section, paragraph, etc.) as well as other tags (e.g. bold, italic, center, etc.). We take into account the influence of a tag by estimating the probability for this tag to distinguish relevant terms from the others. Then, these weights are integrated in a term-weighting function. Experiments on a large collection from the INEX 2008 XML IR evaluation campaign showed improvements on focused XML retrieval

    Mining XML Documents

    Get PDF
    XML documents are becoming ubiquitous because of their rich and flexible format that can be used for a variety of applications. Giving the increasing size of XML collections as information sources, mining techniques that traditionally exist for text collections or databases need to be adapted and new methods to be invented to exploit the particular structure of XML documents. Basically XML documents can be seen as trees, which are well known to be complex structures. This chapter describes various ways of using and simplifying this tree structure to model documents and support efficient mining algorithms. We focus on three mining tasks: classification and clustering which are standard for text collections; discovering of frequent tree structure which is especially important for heterogeneous collection. This chapter presents some recent approaches and algorithms to support these tasks together with experimental evaluation on a variety of large XML collections

    Report on the XML Mining Track at INEX 2005 and INEX 2006, Categorization and Clustering of XML Documents

    Get PDF
    International audienceThis article is a report concerning the two years of the XML Mining track at INEX (2005 and 2006). We focus here on the classification and clustering XML documents. We detail these two tasks and the corpus used for this challenge and then present a summary of the different methods proposed by the participants. We last compare the results obtained during the two years of the track

    Peut-on évaluer les outils d'acquisition de connaissances à partir de textes ?

    No full text
    National audienceMalgré les années de recul et d'expériences accumulées, il est difficile de se faire une idée claire de l'état d'avancement des recherches en acquisition de connaissances à partir de textes. Le manque de protocoles d'évaluation ne facilite pas la comparaison des résultats. Nous développons, dans cet article, la question de l'évaluation des outils d'acquisition de terminologies et d'ontologies en soulignant les princi- pales difficultés et en décrivant nos premières propositions dans ce domaine

    Advanced Document Description, a Sequential Approach

    Get PDF
    To be able to perform efficient document processing, information systems need to use simple models of documents that can be treated in a smaller number of operations. This problem of document representation is not trivial. For decades, researchers have tried to combine relevant document representations with efficient processing. Documents are commonly represented by vectors in which each dimension corresponds to a word of the document. This approach is termed “bag of words”, as it entirely ignores the relative positions of words. One natural improvement over this representation is the extraction and use of cohesive word sequences. In this dissertation, we consider the problem of the extraction, selection and exploitation of word sequences, with a particular focus on the applicability of our work to domain-independent document collections written in any language

    Indexing Heterogeneous XML for Full-Text Search

    Get PDF
    XML documents are becoming more and more common in various environments. In particular, enterprise-scale document management is commonly centred around XML, and desktop applications as well as online document collections are soon to follow. The growing number of XML documents increases the importance of appropriate indexing methods and search tools in keeping the information accessible. Therefore, we focus on content that is stored in XML format as we develop such indexing methods. Because XML is used for different kinds of content ranging all the way from records of data fields to narrative full-texts, the methods for Information Retrieval are facing a new challenge in identifying which content is subject to data queries and which should be indexed for full-text search. In response to this challenge, we analyse the relation of character content and XML tags in XML documents in order to separate the full-text from data. As a result, we are able to both reduce the size of the index by 5-6\% and improve the retrieval precision as we select the XML fragments to be indexed. Besides being challenging, XML comes with many unexplored opportunities which are not paid much attention in the literature. For example, authors often tag the content they want to emphasise by using a typeface that stands out. The tagged content constitutes phrases that are descriptive of the content and useful for full-text search. They are simple to detect in XML documents, but also possible to confuse with other inline-level text. Nonetheless, the search results seem to improve when the detected phrases are given additional weight in the index. Similar improvements are reported when related content is associated with the indexed full-text including titles, captions, and references. Experimental results show that for certain types of document collections, at least, the proposed methods help us find the relevant answers. Even when we know nothing about the document structure but the XML syntax, we are able to take advantage of the XML structure when the content is indexed for full-text search.XML on yleistynyt tekstidokumenttien formaattina monessa ympäristössä. Erityisesti konsernitason dokumenttienhallinta perustuu juuri XML:ään, mutta myös kotikoneilla ja WWW-ympäristössä XML on yleinen tallennusmuoto sekä tekstille että datalle. Dokumenttien määrän voimakas kasva korostaa indeksointi- ja hakumenetelmien tärkeyttä, koska dokumenttien sisältämä tietomäärä ei ole hallittavissa ilman tiedonhakujärjestelmää. Keskitymme siis XML-muodossa tallennetun sisällön indeksointiin tekstihakua varten. Dokumenttiformaattina XML ei mitenkään rajoita itse tallennetun sisällön laatua, vaan XML-dokumenteista löytää kaikkea mahdollista tietokoneiden raakadatasta kaunokirjalliseen proosaan. Siksi on tärkeää tunnistaa sisällön laatu ennen sen indeksointia. Yksi menetelmä datan erottamiseen tekstistä on XML-dokumenttien sisäisen rakenteen analysointi: data vaatii tiukasti säännöllisen ja määrämuotoisen rakenteen, kun taas tekstidokumenttien XML-rakenteessa on paljon vaihtelua. Kun datan jättää indeksoimatta, saavutetaan n. 5-6% pienempi indeksi sekä tarkemmat hakutulokset. XML-dokumenteilla on myös muita ominaisuuksia, joita ei aikaisemmin ole hyödynnetty tekstin indeksointimenetelmissä. Sisältö, jota kirjoittaja haluaa korostaa esim. toisella kirjasintyypillä, on erikseen merkitty XML-koodiin. Korostettu sisältö on siten helppo paikallistaa. Antamalla sille enemmän painoarvoa indeksissä kuin korostamattomalle sisällölle, saadaan hakutuloksia ohjattua parempaan suuntaan. Sama vaikutus on otsikkojen, kuvatekstien ja viitteiden analysoinnilla ja painotuksella. Alustavien testitulosten mukaan esitetyt indeksointimenetelmät auttavat relevantin tiedon löytämisessä XML-dokumenteista

    Der Lehrstuhl Datenbank- und Informationssysteme der Universität Rostock

    Get PDF
    Im Jahr 2014 feierte der Lehrstuhl Datenbank- und Informationssysteme (LS DBIS) an der Universität Rostock sein zwanzigjähriges Bestehen. Zur Jubiläumsveranstaltung mit ehemaligen und aktuellen Studenten, Mitarbeitern, Kollegen und Kooperationspartnern wurde diverses Material aus 20 Jahren aufbereitet. In diesem Beitrag soll daraus ein Rückblick auf 20 Jahre Forschung und Lehre im Bereich Datenbank- und Informationssysteme sowie ein Ein- und Ausblick auf aktuelle Forschungsarbeiten gegeben werden

    Seventh Biennial Report : June 2003 - March 2005

    No full text
    corecore