8,666 research outputs found
Classifying document types to enhance search and recommendations in digital libraries
In this paper, we address the problem of classifying documents available from
the global network of (open access) repositories according to their type. We
show that the metadata provided by repositories enabling us to distinguish
research papers, thesis and slides are missing in over 60% of cases. While
these metadata describing document types are useful in a variety of scenarios
ranging from research analytics to improving search and recommender (SR)
systems, this problem has not yet been sufficiently addressed in the context of
the repositories infrastructure. We have developed a new approach for
classifying document types using supervised machine learning based exclusively
on text specific features. We achieve 0.96 F1-score using the random forest and
Adaboost classifiers, which are the best performing models on our data. By
analysing the SR system logs of the CORE [1] digital library aggregator, we
show that users are an order of magnitude more likely to click on research
papers and thesis than on slides. This suggests that using document types as a
feature for ranking/filtering SR results in digital libraries has the potential
to improve user experience.Comment: 12 pages, 21st International Conference on Theory and Practise of
Digital Libraries (TPDL), 2017, Thessaloniki, Greec
Institutional Repositories in India: A Case Study of National Aerospace Laboratories
This paper traces the history and developments in Open Archives Initiatives including open access journals, e-print archives and Institutional repositories. The setting up of NAL’s Institutional Repository using OSS GNU Eprints, document types with statistical analysis, country wise statistics of full text download, levels of accessibility and technologies used in building the Institutional Repository have been discussed at lengt
What makes papers visible on social media? An analysis of various document characteristics
In this study we have investigated the relationship between different
document characteristics and the number of Mendeley readership counts, tweets,
Facebook posts, mentions in blogs and mainstream media for 1.3 million papers
published in journals covered by the Web of Science (WoS). It aims to
demonstrate that how factors affecting various social media-based indicators
differ from those influencing citations and which document types are more
popular across different platforms. Our results highlight the heterogeneous
nature of altmetrics, which encompasses different types of uses and user groups
engaging with research on social media.Comment: Presented at the 21th International Conference in Science &
Technology Indicators (STI), 13-16, September, 2016, Valencia, Spai
Rewrite based Verification of XML Updates
We consider problems of access control for update of XML documents. In the
context of XML programming, types can be viewed as hedge automata, and static
type checking amounts to verify that a program always converts valid source
documents into also valid output documents. Given a set of update operations we
are particularly interested by checking safety properties such as preservation
of document types along any sequence of updates. We are also interested by the
related policy consistency problem, that is detecting whether a sequence of
authorized operations can simulate a forbidden one. We reduce these questions
to type checking problems, solved by computing variants of hedge automata
characterizing the set of ancestors and descendants of the initial document
type for the closure of parameterized rewrite rules
Worldwide Research Trends on Wheat and Barley: A Bibliometric Comparative Analysis
Grain cereals such as wheat, barley, rice, and maize are the nutritional basis of humans and animals worldwide. Thus, these crop plants are essential in terms of global food security. We conducted a bibliometric assessment of scientific documents and patents related to wheat and barley through the Scopus database. The number of documents published per year, their affiliation and corresponding scientific areas, the publishing journals, document types and languages were metricized. The main keywords included in research publications concerning these crops were also analysed globally and clustered in thematic groups. In the case of keywords related to agronomy or genetics and molecular biology, we considered documents dated up to 1999, and from 2000 to 2018, separately. Comparison of the results obtained for wheat and barley revealed some remarkable different trends, for which the underlying reasons are further discussed
Unsupervised learning of document image types
In a system where medical paper document images have been converted to a digital format by a scanning operation, understanding the document types that exists in this system could provide for vital data indexing and retrieval. In a system where millions of document images have been scanned, it is infeasible to expect a supervised based algorithm or a tedious (human based) effort to discover the document types. The most sensible and practical way to do that is an unsupervised algorithm. Many clustering techniques have been developed for unsupervised classification. Many rely on all data being presented at once, the number of clusters to be known, or both. Presented in this thesis is a clustering scheme that is a two-threshold based technique relying on a hierarchical decomposition of the features. On a subset of document images, it discovers document types at an acceptable level and confidently classifies unknown document images
SWI-Prolog and the Web
Where Prolog is commonly seen as a component in a Web application that is
either embedded or communicates using a proprietary protocol, we propose an
architecture where Prolog communicates to other components in a Web application
using the standard HTTP protocol. By avoiding embedding in external Web servers
development and deployment become much easier. To support this architecture, in
addition to the transfer protocol, we must also support parsing, representing
and generating the key Web document types such as HTML, XML and RDF.
This paper motivates the design decisions in the libraries and extensions to
Prolog for handling Web documents and protocols. The design has been guided by
the requirement to handle large documents efficiently. The described libraries
support a wide range of Web applications ranging from HTML and XML documents to
Semantic Web RDF processing.
To appear in Theory and Practice of Logic Programming (TPLP)Comment: 31 pages, 24 figures and 2 tables. To appear in Theory and Practice
of Logic Programming (TPLP
DocumentNet: Bridging the Data Gap in Document Pre-Training
Document understanding tasks, in particular, Visually-rich Document Entity
Retrieval (VDER), have gained significant attention in recent years thanks to
their broad applications in enterprise AI. However, publicly available data
have been scarce for these tasks due to strict privacy constraints and high
annotation costs. To make things worse, the non-overlapping entity spaces from
different datasets hinder the knowledge transfer between document types. In
this paper, we propose a method to collect massive-scale and weakly labeled
data from the web to benefit the training of VDER models. The collected
dataset, named DocumentNet, does not depend on specific document types or
entity sets, making it universally applicable to all VDER tasks. The current
DocumentNet consists of 30M documents spanning nearly 400 document types
organized in a four-level ontology. Experiments on a set of broadly adopted
VDER tasks show significant improvements when DocumentNet is incorporated into
the pre-training for both classic and few-shot learning settings. With the
recent emergence of large language models (LLMs), DocumentNet provides a large
data source to extend their multi-modal capabilities for VDER.Comment: EMNLP 202
An Analysis of Current Grey Literature Document Typology
This analysis is based on the classification of the international systems GreyNet, (the Grey Literature Network Service), OpenSIGLE, (the System for Information on Grey Literature in Europe), and the Registry of Open Access Repositories (ROAR), as well as focusing on national schemata in the Czech Republic, namely ASEP (Register of Publication Activity of the AS CR), NRGL (National Repository of Grey Literature), and RIV (Information Register of R & D Results). During the analysis of the lists of document types, we have discovered that these typologies contain, besides “real” document types (reports, theses, etc.) other aspects, such as events (arrangement, organization), types of events (conferences, speeches), producers (universities, institutes), processes (translations, output), content (political documents, legal texts), location (domestic, foreign), and format (e-texts, numeric data). However, this approach is not systematic. Therefore, we have decided to create a classification scheme for document types only, and classify other aspects into various groups in order to define them more precisely. The scheme will be processed in a text version as well as schematically in mind maps
Analyzing readerships of International Iranian publications in Mendeley: an altmetrics study
In this study, the presence and distribution of both Mendeley readerships and
Web of Science citations for the publications published in the 43 Iranian
international journals indexed in Journal Citation Reports have been
investigated. The aim was to determine the impact, visibility and use of the
publications published by the Iranian international journals in Mendeley
compared to their citation impact; furthermore, to explore if there is any
relation between these two impact indicators (Mendeley readership counts and
WoS citation counts) for these publications. The DOIs of the 1,884 publications
used to extract the readerships data from Mendeley REST API in February 2014
and citations data until end of 2013 calculated using CWTS in-house WoS
database. SPSS (version 21) used to analyze the relationship between the
readerships and citations for those publications. The Mendeley usage
distribution both at the publication level (across publications years, fields
and document types) and at the user level (across users disciplines, academic
status and countries) have been investigated. These information will help to
understand the visibility and usage vs citation pattern and impact of Iranian
scientific outputs.Comment: in Persia
- …