13 research outputs found

    zbMATH Open: API Solutions and Research Challenges

    Get PDF
    We present zbMATH Open, the most comprehensive collection of reviews and bibliographic metadata of scholarly literature in mathematics. Besides our website https://zbMATH.org which is openly accessible since the beginning of this year, we provide API endpoints to offer our data. The API improves interoperability with others, i.e., digital libraries, and allows using our data for research purposes. In this article, we (1) illustrate the current and future overview of the services offered by zbMATH; (2) present the initial version of the zbMATH links API; (3) analyze potentials and limitations of the links API based on the example of the NIST Digital Library of Mathematical Functions; (4) and finally, present the zbMATH Open dataset as a research resource and discuss connected open research problems

    Financial sentiment analysis of quarterly reports and stock performance

    Get PDF
    This thesis aims to examine the use of financial sentiment analysis for quarterly reports published by companies listed on the Oslo Stock Exchange (OSE). Additionally, the intention of the study is to use methods from computer science to enable the transformation of financial reports, from the raw PDF format to the financial sentiment scores. Furthermore, this thesis aims to discuss the relationship between predicted financial sentiment and stock performance for chosen companies and industries. This thesis applies the famous and recently developed language model for financial sentiment analysis, FinBERT. The model is built upon a more general language model, BERT. The motivation for the study is the increasing interest in machine learning and Natural Language Processing (NLP) for financial applications. Modern modeling techniques are allowing investors to make more informed decisions, and the rise of language modeling has made it possible to derive insight into the opinions of people through news and social networks. However, there are only a minority of studies investigating the language of quarterly reports. Methodologically, quarterly reports from the first quarter of 2019 to the fourth quarter of 2021 are downloaded from the investor relations pages of the selected companies. The downloaded reports are the input of a data pipeline that extracts the text and predicts the financial sentiment using Python tools such as PDFMiner and the Transformers library. The predicted sentiment is then loaded into a pipeline for visualization and stock performance comparisons based on stock data downloaded with the yfinance open source tool. The thesis concludes that extracting text from financial PDF files is feasible. Furthermore, the FinBERT model predicts the financial sentiment with a higher accuracy than the more general BERT model. However, the relationship between stock performance and predicted sentiment is not strong, despite individual differences. Additionally, the relationship is stronger for stock performance in the past. However, this thesis demonstrates the value of domain-specific NLP for applications in the financial industry.M-I

    Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations

    Full text link
    Identifying academic plagiarism is a pressing task for educational and research institutions, publishers, and funding agencies. Current plagiarism detection systems reliably find instances of copied and moderately reworded text. However, reliably detecting concealed plagiarism, such as strong paraphrases, translations, and the reuse of nontextual content and ideas is an open research problem. In this paper, we extend our prior research on analyzing mathematical content and academic citations. Both are promising approaches for improving the detection of concealed academic plagiarism primarily in Science, Technology, Engineering and Mathematics (STEM). We make the following contributions: i) We present a two-stage detection process that combines similarity assessments of mathematical content, academic citations, and text. ii) We introduce new similarity measures that consider the order of mathematical features and outperform the measures in our prior research. iii) We compare the effectiveness of the math-based, citation-based, and text-based detection approaches using confirmed cases of academic plagiarism. iv) We demonstrate that the combined analysis of math-based and citation-based content features allows identifying potentially suspicious cases in a collection of 102K STEM documents. Overall, we show that analyzing the similarity of mathematical content and academic citations is a striking supplement for conventional text-based detection approaches for academic literature in the STEM disciplines.Comment: Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL) 2019. The data and code of our study are openly available at https://purl.org/hybridP

    unarXive: a large scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata

    Get PDF
    In recent years, scholarly data sets have been used for various purposes, such as paper recommendation, citation recommendation, citation context analysis, and citation context-based document summarization. The evaluation of approaches to such tasks and their applicability in real-world scenarios heavily depend on the used data set. However, existing scholarly data sets are limited in several regards. Here, we propose a new data set based on all publications from all scientific disciplines available on arXiv.org. Apart from providing the papers' plain text, in-text citations were annotated via global identifiers. Furthermore, citing and cited publications were linked to the Microsoft Academic Graph, providing access to rich metadata. Our data set consists of over one million documents and 29.2 million citation contexts. The data set, which is made freely available for research purposes, not only can enhance the future evaluation of research paper-based and citation context-based approaches but also serve as a basis for new ways to analyze in-text citations. See https://github.com/IllDepence/unarXive for the source code which has been used for creating the data set. For citing our data set and for further information we can refer to our journal article Tarek Saier, Michael Färber: "unarXive: A Large Scholarly Data Set with Publications’ Full-Text, Annotated In-Text Citations, and Links to Metadata", Scientometrics, 2020, http://dx.doi.org/10.1007/s11192-020-03382-z

    Automatic generation of ISO 19650 compliant templates based on standard construction contracts using a microservices approach.

    Get PDF
    Abstract. This study aims to establish a framework for automatically generating evidence for ISO 19650 certification. The study starts with an investigation of the challenges organisations face in compliance with BIM standards ISO 19650, the key areas of interest identified in relation to this are an organisation’s ability to understand what their information requirements are. Once requirements have been identified, they are translated into format which is both machine and human readable. Extraction of text from existing project documentation is also investigated, proposing a microservice-based solution which formats and produces documents which meet the standards for information management requirements

    Document Layout Analysis and Recognition Systems

    Get PDF
    Automatic extraction of relevant knowledge to domain-specific questions from Optical Character Recognition (OCR) documents is critical for developing intelligent systems, such as document search engines, sentiment analysis, and information retrieval, since hands-on knowledge extraction by a domain expert with a large volume of documents is intensive, unscalable, and time-consuming. There have been a number of studies that have automatically extracted relevant knowledge from OCR documents, such as ABBY and Sandford Natural Language Processing (NLP). Despite the progress, there are still limitations yet-to-be solved. For instance, NLP often fails to analyze a large document. In this thesis, we propose a knowledge extraction framework, which takes domain-specific questions as input and provides the most relevant sentence/paragraph to the given questions in the document. Overall, our proposed framework has two phases. First, an OCR document is reconstructed into a semi-structured document (a document with hierarchical structure of (sub)sections and paragraphs). Then, relevant sentence/paragraph for a given question is identified from the reconstructed semi structured document. Specifically, we proposed (1) a method that converts an OCR document into a semi structured document using text attributes such as font size, font height, and boldface (in Chapter 2), (2) an image-based machine learning method that extracts Table of Contents (TOC) to provide an overall structure of the document (in Chapter 3), (3) a document texture-based deep learning method (DoT-Net) that classifies types of blocks such as text, image, and table (in Chapter 4), and (4) a Question & Answer (Q&A) system that retrieves most relevant sentence/paragraph for a domain-specific question. A large number of document intelligent systems can benefit from our proposed automatic knowledge extraction system to construct a Q&A system for OCR documents. Our Q&A system has applied to extract domain specific information from business contracts at GE Power

    Sentence boundary extraction from scientific literature of electric double layer capacitor domain: Tools and techniques

    Get PDF
    Given the growth of scientific literature on the web, particularly material science, acquiring data precisely from the literature has become more significant. Material information systems, or chemical information systems, play an essential role in discovering data, materials, or synthesis processes using the existing scientific literature. Processing and understanding the natural language of scientific literature is the backbone of these systems, which depend heavily on appropriate textual content. Appropriate textual content means a complete, meaningful sentence from a large chunk of textual content. The process of detecting the beginning and end of a sentence and extracting them as correct sentences is called sentence boundary extraction. The accurate extraction of sentence boundaries from PDF documents is essential for readability and natural language processing. Therefore, this study provides a comparative analysis of different tools for extracting PDF documents into text, which are available as Python libraries or packages and are widely used by the research community. The main objective is to find the most suitable technique among the available techniques that can correctly extract sentences from PDF files as text. The performance of the used techniques Pypdf2, Pdfminer.six, Pymupdf, Pdftotext, Tika, and Grobid is presented in terms of precision, recall, f-1 score, run time, and memory consumption. NLTK, Spacy, and Gensim Natural Language Processing (NLP) tools are used to identify sentence boundaries. Of all the techniques studied, the Grobid PDF extraction package using the NLP tool Spacy achieved the highest f-1 score of 93% and consumed the least amount of memory at 46.13 MegaBytes

    Citation Recommendation: Approaches and Datasets

    Get PDF
    Citation recommendation describes the task of recommending citations for a given text. Due to the overload of published scientific works in recent years on the one hand, and the need to cite the most appropriate publications when writing scientific texts on the other hand, citation recommendation has emerged as an important research topic. In recent years, several approaches and evaluation data sets have been presented. However, to the best of our knowledge, no literature survey has been conducted explicitly on citation recommendation. In this article, we give a thorough introduction into automatic citation recommendation research. We then present an overview of the approaches and data sets for citation recommendation and identify differences and commonalities using various dimensions. Last but not least, we shed light on the evaluation methods, and outline general challenges in the evaluation and how to meet them. We restrict ourselves to citation recommendation for scientific publications, as this document type has been studied the most in this area. However, many of the observations and discussions included in this survey are also applicable to other types of text, such as news articles and encyclopedic articles.Comment: to be published in the International Journal on Digital Librarie

    Methoden des Data-Minings zur Plagiatanalyse studentischer Abschlussarbeiten

    Get PDF
    Bestehende Ansätze der automatisierten Plagiatanalyse nutzen umfangreiche und pflegeaufwändige Referenzkorpora oder greifen ausschließlich auf die im Untersuchungsobjekt enthaltenen Informationen zurück. Die Nutzung externer Daten führt in der Regel zu besseren Analyseergebnissen (vgl. [Tschuggnall 2014, 8]). In der vorliegenden Arbeit wurde ein extrinsisches Verfahren zur Plagiatanalyse studentischer Abschlussarbeiten entwickelt und evaluiert, welches einen begrenzten Trainingsdatensatz als Referenzkorpus nutzt. Das genannte Verfahren greift hierbei auf die Methoden der Dokumenttypklassifikation und der Stilometrie zurück. Entspricht ein Abschnitt des Eingabedokuments nicht dem durchschnittlichen Schreibstil einer studentischen Abschlussarbeit, so wird dieser als potentielles Plagiat markiert. Anhand verschiedener Evaluationsschritte konnte gezeigt werden, dass das Verfahren prinzipiell für die Plagiatanalyse studentischer Abschlussarbeiten geeignet ist. Im simulierten Anwendungskontext konnten 71,03 % der Segmente aus Bachelor- und Masterarbeiten sowie 53,62 % der Segmente aus Fachbüchern, Fachartikeln und Wikipediaartikeln korrekt eingeordnet werden. Der erreichte F1-Wert entspricht der Performanz intrinsischer Verfahren. Der erzielte Recall-Wert ist hierbei wesentlich höher. Die aus den Trainingskorpora extrahierten features wurden als ARFF-Dateien zur Verfügung gestellt
    corecore