5 research outputs found

    A keyquery-based classification system for CORE

    Get PDF
    We apply keyquery-based taxonomy composition to compute a classification system for the CORE dataset, a shared crawl of about 850,000 scientific papers. Keyquery-based taxonomy composition can be understood as a two-phase hierarchical document clustering technique that utilizes search queries as cluster labels: In a first phase, the document collection is indexed by a reference search engine, and the documents are tagged with the search queries they are relevant—for their so-called keyqueries. In a second phase, a hierarchical clustering is formed from the keyqueries within an iterative process. We use the explicit topic model ESA as document retrieval model in order to index the CORE dataset in the reference search engine. Under the ESA retrieval model, documents are represented as vectors of similarities to Wikipedia articles; a methodology proven to be advantageous for text categorization tasks. Our paper presents the generated taxonomy and reports on quantitative properties such as document coverage and processing requirements

    A survey on big data indexing strategies

    Get PDF
    The operations of the Internet have led to a significant growth and accumulation of data known as Big Data.Individuals and organizations that utilize this data, had no idea, nor were they prepared for this data explosion.Hence, the available solutions cannot meet the needs of the growing heterogeneous data in terms of processing. This results in inefficient information retrieval or search query results.The design of indexing strategies that can support this need is required. A survey on various indexing strategies and how they are utilized for solving Big Data management issues can serve as a guide for choosing the strategy best suited for a problem, and can also serve as a base for the design of more efficient indexing strategies.The aim of the study is to explore the characteristics of the indexing strategies used in Big Data manageability by covering some of the weaknesses and strengths of B-tree, R-tree, to name but a few. This paper covers some popular indexing strategies used for Big Data management. It exposes the potentials of each by carefully exploring their properties in ways that are related to problem solving

    Geographic information extraction from texts

    Get PDF
    A large volume of unstructured texts, containing valuable geographic information, is available online. This information – provided implicitly or explicitly – is useful not only for scientific studies (e.g., spatial humanities) but also for many practical applications (e.g., geographic information retrieval). Although large progress has been achieved in geographic information extraction from texts, there are still unsolved challenges and issues, ranging from methods, systems, and data, to applications and privacy. Therefore, this workshop will provide a timely opportunity to discuss the recent advances, new ideas, and concepts but also identify research gaps in geographic information extraction

    Dynamic taxonomy composition via keyqueries

    No full text
    corecore