6 research outputs found

    Content-Based Quality Estimation for Automatic Subject Indexing of Short Texts under Precision and Recall Constraints

    Get PDF
    Semantic annotations have to satisfy quality constraints to be useful for digital libraries, which is particularly challenging on large and diverse datasets. Confidence scores of multi-label classification methods typically refer only to the relevance of particular subjects, disregarding indicators of insufficient content representation at the document-level. Therefore, we propose a novel approach that detects documents rather than concepts where quality criteria are met. Our approach uses a deep, multi-layered regression architecture, which comprises a variety of content-based indicators. We evaluated multiple configurations using text collections from law and economics, where the available content is restricted to very short texts. Notably, we demonstrate that the proposed quality estimation technique can determine subsets of the previously unseen data where considerable gains in document-level recall can be achieved, while upholding precision at the same time. Hence, the approach effectively performs a filtering that ensures high data quality standards in operative information retrieval systems.Comment: authors' manuscript, paper submitted to TPDL-2018 conference, 12 page

    Aufbau eines produktiven Dienstes für die automatisierte Inhaltserschließung an der ZBW: Ein Status- und Erfahrungsbericht

    Get PDF
    Die ZBW – Leibniz-Informationszentrum Wirtschaft betreibt seit 2016 eigene angewandte Forschung im Bereich Machine Learning mit dem Zweck, praktikable Lösungen für eine automatisierte oder maschinell unterstützte Inhaltserschließung zu entwickeln. 2020 begann ein Team an der ZBW die Konzeption und Implementierung einer Softwarearchitektur, die es ermöglichte, diese prototypischen Lösungen in einen produktiven Dienst zu überführen und mit den bestehenden Nachweis- und Informationssystemen zu verzahnen. Sowohl die angewandte Forschung als auch die für dieses Vorhaben („AutoSE“) notwendige Softwareentwicklung sind direkt im Bibliotheksbereich der ZBW angesiedelt, werden kontinuierlich anhand des State of the Art vorangetrieben und profitieren von einem engen Austausch mit den Verantwortlichen für die intellektuelle Inhaltserschließung. Dieser Beitrag zeigt die Meilensteine auf, die das AutoSE-Team in zwei Jahren in Bezug auf den Aufbau und die Integration der Software erreicht hat, und skizziert, welche bis zum Ende der Pilotphase (2024) noch ausstehen. Die Architektur basiert auf Open-Source-Software und die eingesetzten Machine-Learning-Komponenten werden im Rahmen einer internationalen Zusammenarbeit im engen Austausch mit der Finnischen Nationalbibliothek (NLF) weiterentwickelt und zur Nachnutzung in dem von der NLF entwickelten Open-Source-Werkzeugkasten Annif aufbereitet. Das Betriebsmodell des AutoSE-Dienstes sieht regelmäßige Überprüfungen sowohl einzelner Komponenten als auch des Produktionsworkflows als Ganzes vor und erlaubt eine fortlaufende Weiterentwicklung der Architektur. Eines der Ergebnisse, das bis zum Ende der Pilotphase vorliegen soll, ist die Dokumentation der Anforderungen an einen dauerhaften produktiven Betrieb des Dienstes, damit die Ressourcen dafür im Rahmen eines tragfähigen Modells langfristig gesichert werden können. Aus diesem Praxisbeispiel lässt sich ableiten, welche Bedingungen gegeben sein müssen, um Machine-Learning-Lösungen wie die in Annif enthaltenen erfolgreich an einer Institution für die Inhaltserschließung einsetzen zu können.Since 2016, ZBW – Leibniz Information Centre for Economics has been conducting their own research in the area of machine learning with the goal to develop viable solutions for automated or machine assisted subject indexing in-house. In 2020, a team at ZBW started designing and implementing a suitable software architecture in order to transfer these prototypical solutions into a productive service and to integrate it into the existing metadata systems and workflows. Both the applied research and the software development necessary for this endeavour (dubbed “AutoSE”) are executed by an organizational unit of the library department of ZBW, are continually pushed forward following the state of the art and benefit from a close communication with the staff responsible for intellectual subject indexing. This article reports on the milestones that the AutoSE team has reached over the last two years with respect to the implementation and the integration of the software and outlines those that are yet to be delivered until the end of the pilot phase (2024). The architecture is based on open source software and its machine-learning-based components are developed in close communication with the National Library of Finland (NLF) and, where possible, adapted to be integrated into NLF’s open source toolkit Annif. The operating model of the AutoSE service includes periodical reviews of individual components and of the productive workflow in its entirety and allows continuous improvements of the architecture. One of the results to be delivered by the end of the pilot phase is a documentation of the requirements for running the productive service on a permanent basis so that the necessary resources can be secured. This practical example shows which conditions have to be met by an institution in order to successfully use machine learning solutions such as the ones offered in Annif for subject indexing

    Diabetes Diagnosis through Machine Learning: An Analysis of Classification Algorithms

    Get PDF
    Diabetes is a serious and chronic disease characterized by high levels of sugar in the blood. If left untreated, it can lead to numerous complications. In the past, diagnosing diabetes required a visit to a diagnostic center and consultation with a doctor. However, the use of machine learning can help to identify the disease earlier and more accurately. This study aimed to create a model that can accurately predict the likelihood of diabetes in patients using three machine learning classification algorithms: Logistic Regression (LR), Decision Tree (DT), and Naive Bayes (NB). The model was tested on the Pima Indians Diabetes Database (PIDD) from the UCI machine learning repository and the performance of the algorithms was evaluated using various metrics such as accuracy, precision, F-measure, and recall. The results showed that Logistic Regression had the highest accuracy at 71.39% outperforming the other algorithms

    DIAGNOSE EYES DISEASES USING VARIOUS FEATURES EXTRACTION APPROACHES AND MACHINE LEARNING ALGORITHMS

    Get PDF
    Ophthalmic diseases like glaucoma, diabetic retinopathy, and cataracts are the main cause of visual impairment worldwide. With the use of the fundus images, it could be difficult for a clinician to detect eye diseases early enough. By other hand, the diagnoses of eye disease are prone to errors, challenging and labor-intensive. Thus, for the purpose of identifying various eye problems with the use of the fundus images, a system of automated ocular disease detection with computer-assisted tools is needed. Due to machine learning (ML) algorithms' advanced skills for image classification, this kind of system is feasible. An essential area of artificial intelligence)AI (is machine learning. Ophthalmologists will soon be able to deliver accurate diagnoses and support individualized healthcare thanks to the general capacity of machine learning to automatically identify, find, and grade pathological aspects in ocular disorders. This work presents a ML-based method for targeted ocular detection. The Ocular Disease Intelligent Recognition (ODIR) dataset, which includes 5,000 images of 8 different fundus types, was classified using machine learning methods. Various ocular diseases are represented by these classes. In this study, the dataset was divided into 70% training data and 30% test data, and preprocessing operations were performed on all images starting from color image conversion to grayscale, histogram equalization, BLUR, and resizing operation. The feature extraction represents the next phase in this study ,two algorithms are applied to perform the extraction of features which includes: SIFT(Scale-invariant feature transform) and GLCM(Gray Level Co-occurrence Matrix), ODIR dataset is then subjected to the classification techniques Naïve Bayes, Decision Tree, Random Forest, and K-nearest Neighbor. This study achieved the highest accuracy for binary classification (abnormal and normal) which is 75% (NB algorithm), 62% (RF algorithm), 53% (KNN algorithm), 51% (DT algorithm) and achieved the highest accuracy for multiclass classification (types of eye diseases) which is 88% (RF algorithm), 61% (KNN algorithm) 42% (NB algorithm), and 39% (DT algorithm)

    Automating subject indexing at ZBW

    Get PDF
    Subject indexing, i.e., the enrichment of metadata records for textual resources with descriptors from a controlled vocabulary, is one of the core activities of libraries. Due to the proliferation of digital documents, it is no longer possible to annotate every single document intellectually, which is why we need to explore the potentials of automation on every level. At ZBW the efforts to partially or completely automate the subject indexing process started as early as 2000 with experiments involving external partners and commercial software. The conclusion of that first exploratory period was that commercial, supposedly shelf-ready solutions would not suffice to cover the requirements of the library. In 2014 the decision was made to start doing the necessary applied research in-house which was successfully implemented by establishing a PhD position. However, the prototypical machine learning solutions that they developed over the following years were yet to be integrated into productive operations at the library. Therefore in 2020 an additional position for a software engineer was established and a pilot phase was initiated (planned to last until 2024) with the goal to complete the transfer of our solutions into practice by building a suitable software architecture that allows for real-time subject indexing with our trained models and the integration thereof into the other metadata workflows at ZBW. In this paper we address the question of how to transfer results from applied research into a productive service, and we report on the milestones we have reached so far and on those that are yet to be reached on an operational level. We also discuss the challenges we were facing on a strategic level, the measures and resources (computing power, software, personnel) that were needed in order to be able to affect the transfer, and those that will be necessary in order to subsequently ensure the continued availability of the architecture and to enable a continuous development during running operations. We conclude that there are still no shelf-ready open source systems for the automation of subject indexing – existing software has to be adapted and maintained continuously which requires various forms of expertise. However, the task of automation is here to stay, and librarians are witnessing the dawn of a new era where subject indexing is done at least in part by machines, and the respective roles of machines and human experts may shift even further and more rapidly in a not-so-distant future. We argue that in general, the format of “project” and the mindset that goes with it may not suffice to secure the commitment that an institution and its decision-makers and the library community as a whole will have to bring to the table in order to face the monumental task of the digital transformation and automation in the long run. We also highlight the importance of all parties – applied researchers, software engineers, stakeholders – staying involved and continuously communicating requirements and issues back and forth in order to successfully create and establish a productive service that is suitable and equipped for operation

    Content-Based Quality Estimation for Automatic Subject Indexing of Short Texts under Precision and Recall Constraints

    No full text
    Semantic annotations have to satisfy quality constraints to be useful for digital libraries, which is particularly challenging on large and diverse datasets. Confidence scores of multi-label classification methods typically refer only to the relevance of particular subjects, disregarding indicators of insufficient content representation at the documentlevel. Therefore, we propose a novel approach that detects documents rather than concepts where quality criteria are met. Our approach uses a deep, multi-layered regression architecture, which comprises a variety of content-based indicators. We evaluated multiple configurations using text collections from law and economics, where the available content is restricted to very short texts. Notably, we demonstrate that the proposed quality estimation technique can determine subsets of the previously unseen data where considerable gains in document-level recall can be achieved, while upholding precision at the same time. Hence, the approach effectively performs a filtering that ensures high data quality standards in operative information retrieval systems
    corecore