8,666 research outputs found

    Classifying document types to enhance search and recommendations in digital libraries

    Full text link
    In this paper, we address the problem of classifying documents available from the global network of (open access) repositories according to their type. We show that the metadata provided by repositories enabling us to distinguish research papers, thesis and slides are missing in over 60% of cases. While these metadata describing document types are useful in a variety of scenarios ranging from research analytics to improving search and recommender (SR) systems, this problem has not yet been sufficiently addressed in the context of the repositories infrastructure. We have developed a new approach for classifying document types using supervised machine learning based exclusively on text specific features. We achieve 0.96 F1-score using the random forest and Adaboost classifiers, which are the best performing models on our data. By analysing the SR system logs of the CORE [1] digital library aggregator, we show that users are an order of magnitude more likely to click on research papers and thesis than on slides. This suggests that using document types as a feature for ranking/filtering SR results in digital libraries has the potential to improve user experience.Comment: 12 pages, 21st International Conference on Theory and Practise of Digital Libraries (TPDL), 2017, Thessaloniki, Greec

    Institutional Repositories in India: A Case Study of National Aerospace Laboratories

    Get PDF
    This paper traces the history and developments in Open Archives Initiatives including open access journals, e-print archives and Institutional repositories. The setting up of NAL’s Institutional Repository using OSS GNU Eprints, document types with statistical analysis, country wise statistics of full text download, levels of accessibility and technologies used in building the Institutional Repository have been discussed at lengt

    What makes papers visible on social media? An analysis of various document characteristics

    Get PDF
    In this study we have investigated the relationship between different document characteristics and the number of Mendeley readership counts, tweets, Facebook posts, mentions in blogs and mainstream media for 1.3 million papers published in journals covered by the Web of Science (WoS). It aims to demonstrate that how factors affecting various social media-based indicators differ from those influencing citations and which document types are more popular across different platforms. Our results highlight the heterogeneous nature of altmetrics, which encompasses different types of uses and user groups engaging with research on social media.Comment: Presented at the 21th International Conference in Science & Technology Indicators (STI), 13-16, September, 2016, Valencia, Spai

    Rewrite based Verification of XML Updates

    Get PDF
    We consider problems of access control for update of XML documents. In the context of XML programming, types can be viewed as hedge automata, and static type checking amounts to verify that a program always converts valid source documents into also valid output documents. Given a set of update operations we are particularly interested by checking safety properties such as preservation of document types along any sequence of updates. We are also interested by the related policy consistency problem, that is detecting whether a sequence of authorized operations can simulate a forbidden one. We reduce these questions to type checking problems, solved by computing variants of hedge automata characterizing the set of ancestors and descendants of the initial document type for the closure of parameterized rewrite rules

    Worldwide Research Trends on Wheat and Barley: A Bibliometric Comparative Analysis

    Get PDF
    Grain cereals such as wheat, barley, rice, and maize are the nutritional basis of humans and animals worldwide. Thus, these crop plants are essential in terms of global food security. We conducted a bibliometric assessment of scientific documents and patents related to wheat and barley through the Scopus database. The number of documents published per year, their affiliation and corresponding scientific areas, the publishing journals, document types and languages were metricized. The main keywords included in research publications concerning these crops were also analysed globally and clustered in thematic groups. In the case of keywords related to agronomy or genetics and molecular biology, we considered documents dated up to 1999, and from 2000 to 2018, separately. Comparison of the results obtained for wheat and barley revealed some remarkable different trends, for which the underlying reasons are further discussed

    Unsupervised learning of document image types

    Full text link
    In a system where medical paper document images have been converted to a digital format by a scanning operation, understanding the document types that exists in this system could provide for vital data indexing and retrieval. In a system where millions of document images have been scanned, it is infeasible to expect a supervised based algorithm or a tedious (human based) effort to discover the document types. The most sensible and practical way to do that is an unsupervised algorithm. Many clustering techniques have been developed for unsupervised classification. Many rely on all data being presented at once, the number of clusters to be known, or both. Presented in this thesis is a clustering scheme that is a two-threshold based technique relying on a hierarchical decomposition of the features. On a subset of document images, it discovers document types at an acceptable level and confidently classifies unknown document images

    SWI-Prolog and the Web

    Get PDF
    Where Prolog is commonly seen as a component in a Web application that is either embedded or communicates using a proprietary protocol, we propose an architecture where Prolog communicates to other components in a Web application using the standard HTTP protocol. By avoiding embedding in external Web servers development and deployment become much easier. To support this architecture, in addition to the transfer protocol, we must also support parsing, representing and generating the key Web document types such as HTML, XML and RDF. This paper motivates the design decisions in the libraries and extensions to Prolog for handling Web documents and protocols. The design has been guided by the requirement to handle large documents efficiently. The described libraries support a wide range of Web applications ranging from HTML and XML documents to Semantic Web RDF processing. To appear in Theory and Practice of Logic Programming (TPLP)Comment: 31 pages, 24 figures and 2 tables. To appear in Theory and Practice of Logic Programming (TPLP

    DocumentNet: Bridging the Data Gap in Document Pre-Training

    Full text link
    Document understanding tasks, in particular, Visually-rich Document Entity Retrieval (VDER), have gained significant attention in recent years thanks to their broad applications in enterprise AI. However, publicly available data have been scarce for these tasks due to strict privacy constraints and high annotation costs. To make things worse, the non-overlapping entity spaces from different datasets hinder the knowledge transfer between document types. In this paper, we propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models. The collected dataset, named DocumentNet, does not depend on specific document types or entity sets, making it universally applicable to all VDER tasks. The current DocumentNet consists of 30M documents spanning nearly 400 document types organized in a four-level ontology. Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training for both classic and few-shot learning settings. With the recent emergence of large language models (LLMs), DocumentNet provides a large data source to extend their multi-modal capabilities for VDER.Comment: EMNLP 202

    An Analysis of Current Grey Literature Document Typology

    Get PDF
    This analysis is based on the classification of the international systems GreyNet, (the Grey Literature Network Service), OpenSIGLE, (the System for Information on Grey Literature in Europe), and the Registry of Open Access Repositories (ROAR), as well as focusing on national schemata in the Czech Republic, namely ASEP (Register of Publication Activity of the AS CR), NRGL (National Repository of Grey Literature), and RIV (Information Register of R & D Results). During the analysis of the lists of document types, we have discovered that these typologies contain, besides “real” document types (reports, theses, etc.) other aspects, such as events (arrangement, organization), types of events (conferences, speeches), producers (universities, institutes), processes (translations, output), content (political documents, legal texts), location (domestic, foreign), and format (e-texts, numeric data). However, this approach is not systematic. Therefore, we have decided to create a classification scheme for document types only, and classify other aspects into various groups in order to define them more precisely. The scheme will be processed in a text version as well as schematically in mind maps

    Analyzing readerships of International Iranian publications in Mendeley: an altmetrics study

    Full text link
    In this study, the presence and distribution of both Mendeley readerships and Web of Science citations for the publications published in the 43 Iranian international journals indexed in Journal Citation Reports have been investigated. The aim was to determine the impact, visibility and use of the publications published by the Iranian international journals in Mendeley compared to their citation impact; furthermore, to explore if there is any relation between these two impact indicators (Mendeley readership counts and WoS citation counts) for these publications. The DOIs of the 1,884 publications used to extract the readerships data from Mendeley REST API in February 2014 and citations data until end of 2013 calculated using CWTS in-house WoS database. SPSS (version 21) used to analyze the relationship between the readerships and citations for those publications. The Mendeley usage distribution both at the publication level (across publications years, fields and document types) and at the user level (across users disciplines, academic status and countries) have been investigated. These information will help to understand the visibility and usage vs citation pattern and impact of Iranian scientific outputs.Comment: in Persia
    corecore