4 research outputs found
A Survey of Scholarly Data: From Big Data Perspective
Recently, there has been a shifting focus of organizations and governments towards digitization of academic and technical documents, adding a new facet to the concept of digital libraries. The volume, variety and velocity of this generated data, satisfies the big data definition, as a result of which, this scholarly reserve is popularly referred to as big scholarly data. In order to facilitate data analytics for big scholarly data, architectures and services for the same need to be developed. The evolving nature of research problems has made them essentially interdisciplinary. As a result, there is a growing demand for scholarly applications like collaborator discovery, expert finding and research recommendation systems, in addition to several others. This research paper investigates the current trends and identifies the existing challenges in development of a big scholarly data platform, with specific focus on directions for future research and maps them to the different phases of the big data lifecycle
Extracting specific text from documents using machine learning algorithms
Increasing use of Portable Document Format (PDF) files has promoted research
in analyzing the files' layout for text extraction purpose. For this reason, it is important
to have a system in place to analyze these documents and extract required
text. The purpose of this research fulfills this need by extracting specific text from
PDF documents while considering the document layout. This approach is used to
extract learning outcomes from academic course outlines. Our algorithm consists of
a supervised leaning algorithm and white space analysis. The supervised algorithm
locates the relevant text followed by white space analysis to understand document
layout before extraction. The supervised learning approach used for detecting relevant
text does so by looking for relevant headings, which mimics the approach used
by humans while going through a document.
The data set used for this research consists of 500 course outlines randomly sampled
from the internet. To show the capability of our text detection algorithm to
work with documents other than course outlines, it is also tested on 25 reports and
articles sampled from the internet. The implemented system has shown promising
results with an accuracy of 81.8% and remediated the limitation shown by the current
literature by supporting documents with unknown format. The algorithm has a wide
scope of applications and takes a step towards automating the task of text extraction
from PDF documents