2 research outputs found

    Sentence boundary extraction from scientific literature of electric double layer capacitor domain: Tools and techniques

    Get PDF
    Given the growth of scientific literature on the web, particularly material science, acquiring data precisely from the literature has become more significant. Material information systems, or chemical information systems, play an essential role in discovering data, materials, or synthesis processes using the existing scientific literature. Processing and understanding the natural language of scientific literature is the backbone of these systems, which depend heavily on appropriate textual content. Appropriate textual content means a complete, meaningful sentence from a large chunk of textual content. The process of detecting the beginning and end of a sentence and extracting them as correct sentences is called sentence boundary extraction. The accurate extraction of sentence boundaries from PDF documents is essential for readability and natural language processing. Therefore, this study provides a comparative analysis of different tools for extracting PDF documents into text, which are available as Python libraries or packages and are widely used by the research community. The main objective is to find the most suitable technique among the available techniques that can correctly extract sentences from PDF files as text. The performance of the used techniques Pypdf2, Pdfminer.six, Pymupdf, Pdftotext, Tika, and Grobid is presented in terms of precision, recall, f-1 score, run time, and memory consumption. NLTK, Spacy, and Gensim Natural Language Processing (NLP) tools are used to identify sentence boundaries. Of all the techniques studied, the Grobid PDF extraction package using the NLP tool Spacy achieved the highest f-1 score of 93% and consumed the least amount of memory at 46.13 MegaBytes

    A Text Extraction Software Benchmark Based on a Synthesized Dataset

    No full text
    Text extraction plays an important function for data processing workflows in digital libraries. For example, it is a crucial prerequisite for evaluating the quality of migrated textual documents. Complex file formats make the extraction process error-prone and have made it very challenging to verify the correctness of extraction components. Based on digital preservation and information retrieval scenarios, three quality requirements in terms of effectiveness of text extraction tools are identified: 1) is a certain text snippet correctly extracted from a document, 2) does the extracted text appear in the right order relatively to other elements and, 3) is the structure of the text preserved. A number of text extraction tools is available fulfilling these three quality requirements to various degrees. However, systematic benchmarks to evaluate those tools are still missing, mainly due to the lack of datasets with accompanying ground truth. The contribution of this paper is two-fold. First we describe a dataset generation method based on model driven engineering principles and use it to synthesize a dataset and its ground truth directly from a model. Second, we define a benchmark for text extraction tools and complete an experiment to calculate performance measures for several tools that cover the three quality requirements. The results demonstrate the benefits of the approach in terms of scalability and effectiveness in generating ground truth for content and structure of text elements.Part of this work was supported by the Vienna Science and Technology Fund (WWTF) through the project BenchmarkDP (ICT12-046)and by NSERC through RGPIN-2016-06640
    corecore