10,673 research outputs found

    Investigation into Indexing XML Data Techniques

    Get PDF
    The rapid development of XML technology improves the WWW, since the XML data has many advantages and has become a common technology for transferring data cross the internet. Therefore, the objective of this research is to investigate and study the XML indexing techniques in terms of their structures. The main goal of this investigation is to identify the main limitations of these techniques and any other open issues. Furthermore, this research considers most common XML indexing techniques and performs a comparison between them. Subsequently, this work makes an argument to find out these limitations. To conclude, the main problem of all the XML indexing techniques is the trade-off between the size and the efficiency of the indexes. So, all the indexes become large in order to perform well, and none of them is suitable for all users’ requirements. However, each one of these techniques has some advantages in somehow

    Enhancing Content-And-Structure Information Retrieval using a Native XML Database

    Get PDF
    Three approaches to content-and-structure XML retrieval are analysed in this paper: first by using Zettair, a full-text information retrieval system; second by using eXist, a native XML database, and third by using a hybrid XML retrieval system that uses eXist to produce the final answers from likely relevant articles retrieved by Zettair. INEX 2003 content-and-structure topics can be classified in two categories: the first retrieving full articles as final answers, and the second retrieving more specific elements within articles as final answers. We show that for both topic categories our initial hybrid system improves the retrieval effectiveness of a native XML database. For ranking the final answer elements, we propose and evaluate a novel retrieval model that utilises the structural relationships between the answer elements of a native XML database and retrieves Coherent Retrieval Elements. The final results of our experiments show that when the XML retrieval task focusses on highly relevant elements our hybrid XML retrieval system with the Coherent Retrieval Elements module is 1.8 times more effective than Zettair and 3 times more effective than eXist, and yields an effective content-and-structure XML retrieval

    Hybrid XML Retrieval: Combining Information Retrieval and a Native XML Database

    Get PDF
    This paper investigates the impact of three approaches to XML retrieval: using Zettair, a full-text information retrieval system; using eXist, a native XML database; and using a hybrid system that takes full article answers from Zettair and uses eXist to extract elements from those articles. For the content-only topics, we undertake a preliminary analysis of the INEX 2003 relevance assessments in order to identify the types of highly relevant document components. Further analysis identifies two complementary sub-cases of relevance assessments ("General" and "Specific") and two categories of topics ("Broad" and "Narrow"). We develop a novel retrieval module that for a content-only topic utilises the information from the resulting answer list of a native XML database and dynamically determines the preferable units of retrieval, which we call "Coherent Retrieval Elements". The results of our experiments show that -- when each of the three systems is evaluated against different retrieval scenarios (such as different cases of relevance assessments, different topic categories and different choices of evaluation metrics) -- the XML retrieval systems exhibit varying behaviour and the best performance can be reached for different values of the retrieval parameters. In the case of INEX 2003 relevance assessments for the content-only topics, our newly developed hybrid XML retrieval system is substantially more effective than either Zettair or eXist, and yields a robust and a very effective XML retrieval.Comment: Postprint version. The editor version can be accessed through the DO

    Information extraction from multimedia web documents: an open-source platform and testbed

    No full text
    The LivingKnowledge project aimed to enhance the current state of the art in search, retrieval and knowledge management on the web by advancing the use of sentiment and opinion analysis within multimedia applications. To achieve this aim, a diverse set of novel and complementary analysis techniques have been integrated into a single, but extensible software platform on which such applications can be built. The platform combines state-of-the-art techniques for extracting facts, opinions and sentiment from multimedia documents, and unlike earlier platforms, it exploits both visual and textual techniques to support multimedia information retrieval. Foreseeing the usefulness of this software in the wider community, the platform has been made generally available as an open-source project. This paper describes the platform design, gives an overview of the analysis algorithms integrated into the system and describes two applications that utilise the system for multimedia information retrieval

    The NASA Astrophysics Data System: Data Holdings

    Get PDF
    Since its inception in 1993, the ADS Abstract Service has become an indispensable research tool for astronomers and astrophysicists worldwide. In those seven years, much effort has been directed toward improving both the quantity and the quality of references in the database. From the original database of approximately 160,000 astronomy abstracts, our dataset has grown almost tenfold to approximately 1.5 million references covering astronomy, astrophysics, planetary sciences, physics, optics, and engineering. We collect and standardize data from approximately 200 journals and present the resulting information in a uniform, coherent manner. With the cooperation of journal publishers worldwide, we have been able to place scans of full journal articles on-line back to the first volumes of many astronomical journals, and we are able to link to current version of articles, abstracts, and datasets for essentially all of the current astronomy literature. The trend toward electronic publishing in the field, the use of electronic submission of abstracts for journal articles and conference proceedings, and the increasingly prominent use of the World Wide Web to disseminate information have enabled the ADS to build a database unparalleled in other disciplines. The ADS can be accessed at http://adswww.harvard.eduComment: 24 pages, 1 figure, 6 tables, 3 appendice

    DCU and ISI@INEX 2010: Ad-hoc and data-centric tracks

    Get PDF
    We describe the participation of Dublin City University (DCU)and the Indian Statistical Institute (ISI) in INEX 2010. The main contributions of this paper are: i) a simplified version of Hierarchical Language Model (HLM) which involves scoring XML elements with a combined probability of generating the given query from itself and the top level article node, is shown to outperform the baselines of Language Model (LM) and Vector Space Model (VSM) scoring of XML elements; ii) the Expectation Maximization (EM) feedback in LM is shown to be the most effective on the domain specic collection of IMDB; iii) automated removal of sentences indicating aspects of irrelevance from the narratives of INEX ad-hoc topics is shown to improve retrieval eectiveness

    Fast, linked, and open – the future of taxonomic publishing for plants: launching the journal PhytoKeys

    Get PDF
    The paper describes the focus, scope and the rationale of PhytoKeys, a newly established, peer-reviewed, open-access journal in plant systematics. PhytoKeys is launched to respond to four main challenges of our time: (1) Appearance of electronic publications as amendments or even alternatives to paper publications; (2) Open Access (OA) as a new publishing model; (3) Linkage of electronic registers, indices and aggregators that summarize information on biological species through taxonomic names or their persistent identifiers (Globally Unique Identifiers or GUIDs; currently Life Science Identifiers or LSIDs); (4) Web 2.0 technologies that permit the semantic markup of, and semantic enhancements to, published biological texts. The journal will pursue cutting-edge technologies in publication and dissemination of biodiversity information while strictly following the requirements of the current International Code of Botanical Nomenclature (ICBN)
    corecore