Search CORE

16 research outputs found

Cross-Domain information extraction from scientific articles for research knowledge graphs

Author: Brack Arthur
Publication venue: Hannover : Institutionelles Repositorium der Leibniz Universität Hannover
Publication date: 01/01/2022
Field of study

Today’s scholarly communication is a document-centred process and as such, rather inefficient. Fundamental contents of research papers are not accessible by computers since they are only present in unstructured PDF files. Therefore, current research infrastructures are not able to assist scientists appropriately in their core research tasks. This thesis addresses this issue and proposes methods to automatically extract relevant information from scientific articles for Research Knowledge Graphs (RKGs) that represent scholarly knowledge structured and interlinked. First, this thesis conducts a requirements analysis for an Open Research Knowledge Graph (ORKG). We present literature-related use cases of researchers that should be supported by an ORKG-based system and their specific requirements for the underlying ontology and instance data. Based on this analysis, the identified use cases are categorised into two groups: The first group of use cases needs manual or semi-automatic approaches for knowledge graph (KG) construction since they require high correctness of the instance data. The second group requires high completeness and can tolerate noisy instance data. Thus, this group needs automatic approaches for KG population. This thesis focuses on the second group of use cases and provides contributions for machine learning tasks that aim to support them. To assess the relevance of a research paper, scientists usually skim through titles, abstracts, introductions, and conclusions. An organised presentation of the articles' essential information would make this process more time-efficient. The task of sequential sentence classification addresses this issue by classifying sentences in an article in categories like research problem, used methods, or obtained results. To address this problem, we propose a novel unified cross-domain multi-task deep learning approach that makes use of datasets from different scientific domains (e.g. biomedicine and computer graphics) and varying structures (e.g. datasets covering either only abstracts or full papers). Our approach outperforms the state of the art on full paper datasets significantly while being competitive for datasets consisting of abstracts. Moreover, our approach enables the categorisation of sentences in a domain-independent manner. Furthermore, we present the novel task of domain-independent information extraction to extract scientific concepts from research papers in a domain-independent manner. This task aims to support the use cases find related work and get recommended articles. For this purpose, we introduce a set of generic scientific concepts that are relevant over ten domains in Science, Technology, and Medicine (STM) and release an annotated dataset of 110 abstracts from these domains. Since the annotation of scientific text is costly, we suggest an active learning strategy based on a state-of-the-art deep learning approach. The proposed method enables us to nearly halve the amount of required training data. Then, we extend this domain-independent information extraction approach with the task of \textit{coreference resolution}. Coreference resolution aims to identify mentions that refer to the same concept or entity. Baseline results on our corpus with current state-of-the-art approaches for coreference resolution showed that current approaches perform poorly on scientific text. Therefore, we propose a sequential transfer learning approach that exploits annotated datasets from non-academic domains. Our experimental results demonstrate that our approach noticeably outperforms the state-of-the-art baselines. Additionally, we investigate the impact of coreference resolution on KG population. We demonstrate that coreference resolution has a small impact on the number of resulting concepts in the KG, but improved its quality significantly. Consequently, using our domain-independent information extraction approach, we populate an RKG from 55,485 abstracts of the ten investigated STM domains. We show that every domain mainly uses its own terminology and that the populated RKG contains useful concepts. Moreover, we propose a novel approach for the task of \textit{citation recommendation}. This task can help researchers improve the quality of their work by finding or recommending relevant related work. Our approach exploits RKGs that interlink research papers based on mentioned scientific concepts. Using our automatically populated RKG, we demonstrate that the combination of information from RKGs with existing state-of-the-art approaches is beneficial. Finally, we conclude the thesis and sketch possible directions of future work.Die Kommunikation von Forschungsergebnissen erfolgt heutzutage in Form von Dokumenten und ist aus verschiedenen Gründen ineffizient. Wesentliche Inhalte von Forschungsarbeiten sind für Computer nicht zugänglich, da sie in unstrukturierten PDF-Dateien verborgen sind. Daher können derzeitige Forschungsinfrastrukturen Forschende bei ihren Kernaufgaben nicht angemessen unterstützen. Diese Arbeit befasst sich mit dieser Problemstellung und untersucht Methoden zur automatischen Extraktion von relevanten Informationen aus Forschungspapieren für Forschungswissensgraphen (Research Knowledge Graphs). Solche Graphen sollen wissenschaftliches Wissen maschinenlesbar strukturieren und verknüpfen. Zunächst wird eine Anforderungsanalyse für einen Open Research Knowledge Graph (ORKG) durchgeführt. Wir stellen literaturbezogene Anwendungsfälle von Forschenden vor, die durch ein ORKG-basiertes System unterstützt werden sollten, und deren spezifische Anforderungen an die zugrundeliegende Ontologie und die Instanzdaten. Darauf aufbauend werden die identifizierten Anwendungsfälle in zwei Gruppen eingeteilt: Die erste Gruppe von Anwendungsfällen benötigt manuelle oder halbautomatische Ansätze für die Konstruktion eines ORKG, da sie eine hohe Korrektheit der Instanzdaten erfordern. Die zweite Gruppe benötigt eine hohe Vollständigkeit der Instanzdaten und kann fehlerhafte Daten tolerieren. Daher erfordert diese Gruppe automatische Ansätze für die Konstruktion des ORKG. Diese Arbeit fokussiert sich auf die zweite Gruppe von Anwendungsfällen und schlägt Methoden für maschinelle Aufgabenstellungen vor, die diese Anwendungsfälle unterstützen können. Um die Relevanz eines Forschungsartikels effizient beurteilen zu können, schauen sich Forschende in der Regel die Titel, Zusammenfassungen, Einleitungen und Schlussfolgerungen an. Durch eine strukturierte Darstellung von wesentlichen Informationen des Artikels könnte dieser Prozess zeitsparender gestaltet werden. Die Aufgabenstellung der sequenziellen Satzklassifikation befasst sich mit diesem Problem, indem Sätze eines Artikels in Kategorien wie Forschungsproblem, verwendete Methoden oder erzielte Ergebnisse automatisch klassifiziert werden. In dieser Arbeit wird für diese Aufgabenstellung ein neuer vereinheitlichter Multi-Task Deep-Learning-Ansatz vorgeschlagen, der Datensätze aus verschiedenen wissenschaftlichen Bereichen (z. B. Biomedizin und Computergrafik) mit unterschiedlichen Strukturen (z. B. Datensätze bestehend aus Zusammenfassungen oder vollständigen Artikeln) nutzt. Unser Ansatz übertrifft State-of-the-Art-Verfahren der Literatur auf Benchmark-Datensätzen bestehend aus vollständigen Forschungsartikeln. Außerdem ermöglicht unser Ansatz die Klassifizierung von Sätzen auf eine domänenunabhängige Weise. Darüber hinaus stellen wir die neue Aufgabenstellung domänenübergreifende Informationsextraktion vor. Hierbei werden, unabhängig vom behandelten wissenschaftlichen Fachgebiet, inhaltliche Konzepte aus Forschungspapieren extrahiert. Damit sollen die Anwendungsfälle Finden von verwandten Arbeiten und Empfehlung von Artikeln unterstützt werden. Zu diesem Zweck führen wir eine Reihe von generischen wissenschaftlichen Konzepten ein, die in zehn Bereichen der Wissenschaft, Technologie und Medizin (STM) relevant sind, und veröffentlichen einen annotierten Datensatz von 110 Zusammenfassungen aus diesen Bereichen. Da die Annotation wissenschaftlicher Texte aufwändig ist, kombinieren wir ein Active-Learning-Verfahren mit einem aktuellen Deep-Learning-Ansatz, um die notwendigen Trainingsdaten zu reduzieren. Die vorgeschlagene Methode ermöglicht es uns, die Menge der erforderlichen Trainingsdaten nahezu zu halbieren. Anschließend erweitern wir unseren domänenunabhängigen Ansatz zur Informationsextraktion um die Aufgabe der Koreferenzauflösung. Die Auflösung von Koreferenzen zielt darauf ab, Erwähnungen zu identifizieren, die sich auf dasselbe Konzept oder dieselbe Entität beziehen. Experimentelle Ergebnisse auf unserem Korpus mit aktuellen Ansätzen zur Koreferenzauflösung haben gezeigt, dass diese bei wissenschaftlichen Texten unzureichend abschneiden. Daher schlagen wir eine Transfer-Learning-Methode vor, die annotierte Datensätze aus nicht-akademischen Bereichen nutzt. Die experimentellen Ergebnisse zeigen, dass unser Ansatz deutlich besser abschneidet als die bisherigen Ansätze. Darüber hinaus untersuchen wir den Einfluss der Koreferenzauflösung auf die Erstellung von Wissensgraphen. Wir zeigen, dass diese einen geringen Einfluss auf die Anzahl der resultierenden Konzepte in dem Wissensgraphen hat, aber die Qualität des Wissensgraphen deutlich verbessert. Mithilfe unseres domänenunabhängigen Ansatzes zur Informationsextraktion haben wir aus 55.485 Zusammenfassungen der zehn untersuchten STM-Domänen einen Forschungswissensgraphen erstellt. Unsere Analyse zeigt, dass jede Domäne hauptsächlich ihre eigene Terminologie verwendet und dass der erstellte Wissensgraph nützliche Konzepte enthält. Schließlich schlagen wir einen Ansatz für die Empfehlung von passenden Referenzen vor. Damit können Forschende einfacher relevante verwandte Arbeiten finden oder passende Empfehlungen erhalten. Unser Ansatz nutzt Forschungswissensgraphen, die Forschungsarbeiten mit in ihnen erwähnten wissenschaftlichen Konzepten verknüpfen. Wir zeigen, dass aktuelle Verfahren zur Empfehlung von Referenzen von zusätzlichen Informationen aus einem automatisch erstellten Wissensgraphen profitieren. Zum Schluss wird ein Fazit gezogen und ein Ausblick für mögliche zukünftige Arbeiten gegeben

Institutionelles Repositorium der Leibniz Universität Hannover

ABridges: Scalable, self-configuring Ethernet campus networks

Author: Azcorra Saloña Arturo
García Martínez Alberto
Ibáñez Fernández Guillermo Agustín
Soto Campos Ignacio
Publication venue: 'Elsevier BV'
Publication date: 01/01/2008
Field of study

This article describes a scalable, self-configuring architecture for campus networks, the ABridges architecture. It is a two-tiered hierarchy of layer two switches in which network islands running independent rapid spanning tree protocols communicate through a core formed by island root bridges (ABridges). ABridges use AMSTP, a simplified and self configuring version of MSTP protocol, to establish shortest paths in the core using multiple spanning tree instances, one instance rooted at each core edge ABridge. The architecture is very efficient in terms of network usage and path length due to the ability of AMSTP to provide optimum paths in the core mesh, while RSTP is used to aggregate efficiently the traffic at islands networks, where sparsely connected, tree-like topologies are frequent and recommended. Convergence speed is as fast as existing Rapid Spanning Tree and Multiple Spanning Tree Protocols.Publicad

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Carlos III de Madrid e-Archivo

Sentence Classification with Hierarchical Neural Networks for Rhetorical Sections Extraction

Author: Ntargaras Andreas
Νταργαράς Ανδρέας
Publication venue
Publication date: 01/01/2021
Field of study

Υπόβαθρο: Εκατομμύρια επιστημονικά άρθρα και επιστημονικές εργασίες δημοσιεύονται κάθε χρόνο, καθιστώντας την έρευνα για σχετική βιβλιογραφία όλο και πιο δύσκολη με κάθε μέρα που περνά. Ως εκ τούτου, οι σαφείς και ενημερωτικές περιλήψεις έχουν καταστεί απαραίτητο μέσο για να εντοπίζουν οι ερευνητές τις επιθυμητές πληροφορίες εγκαίρως και με αποτελεσματικό τρόπο. Πολλές περιλήψεις, ωστόσο, εξακολουθούν να στερούνται κοινών ρητορικών δομικών στοιχείων τα οποία θα βελτίωναν τους επικοινωνιακούς τους σκοπούς στο πλαίσιο του ακαδημαϊκού λόγου. Στόχος: Στην παρούσα διατριβή στοχεύουμε να εξετάσουμε την αποτελεσματικότητα των μοντέλων ταξινόμησης προτάσεων για την εξαγωγή ρητορικών ενοτήτων σε περιλήψεις διαφορετικών τομέων και δομών και να δημιουργήσουμε ένα εργαλείο που υτοματοποιεί αυτήν τη διαδικασία. Μέθοδος: Τα μοντέλα ταξινόμησης προτάσεων που χρησιμοποιήθηκαν εδώ βασίστηκαν σε ένα ιεραρχικό νευρωνικό δίκτυο (HNN) που έχει εκπαιδευτεί σε τρία διαφορετικά σύνολα δεδομένων. Αποτέλεσμα: Τα αποτελέσματά μας δείχνουν ότι τα μοντέλα μας επιβεβαιώνουν την ”state of the art” απόδοσή τους (SOTA) σε περιλήψεις του ίδιου επιστημονικού πεδίου με εκείνες που εκπαιδεύτηκαν, αλλά η διαπεδιακή ακρίβειά τους μειώνεται σημαντικά ειδικά όταν εφαρμόζονται σε μη κλασσικά δομημένες περιλήψεις. Συμπέρασμα: Ένα ακριβές εργαλείο για την απόκτηση των ρητορικών τμημάτων των περιλήψεων μπορεί να αποτελέσει τη βάση για ένα μεγαλύτερο σύστημα που θα μπορεί να συνοψίζει τις πληροφορίες, βοηθώντας έτσι σε μεγάλο βαθμό την επιτάχυνση της διαδικασίας της βιβλιογραφικής έρευνας.Background: Millions of scholarly articles and scientific papers are being published each year, making the search for relevant literature harder with each passing day. Clear and informative abstracts have therefore become an essential medium for researchers to locate their desired information in a timely and efficient manner. Many abstracts however, still lack common rhetorical structural elements that would improve their communicative purposes within the context of academic discourse. Objective: In the present thesis we aim to review the efficacy of sentence classification models for rhetorical sections extraction on abstracts of different domains and structures and create a tool that automates this process. Method: The sentence classification models used here were based on a hierarchical neural network (HNN) that has been trained on three different datasets. Result: Our results show that our models manage to confirm their state of the art (SOTA) performance on abstracts of the same scientific field with the ones they were trained in, but their interdomain accuracy drops significantly especially when applied to unordinarily structured abstracts. Conclusion: An accurate tool for obtaining the rhetorical sections of abstracts can become the basis for a larger framework that could summarize information, helping tremendously to speed up the process of literature research

Pergamos : Unified Institutional Repository / Digital Library Platform of the National and Kapodistrian University of Athens

Improving the performance of parallel scientific applications using cache injection

Author: Leon Borja Edgar
Publication venue: UNM Digital Repository
Publication date: 01/05/2009
Field of study

Cache injection is a viable technique to improve the performance of data-intensive parallel applications. This dissertation characterizes cache injection of incoming network data in terms of parallel application performance. My results show that the benefit of this technique is dependent on: the ratio of processor speed to memory speed, the cache injection policy, and the application\u27s communication characteristics. Cache injection addresses the memory wall for I/O by writing data into a processor\u27s cache directly from the I/O bus. This technique, unlike data prefetching, reduces the number of reads served by the memory unit. This reduction is significant for data-intensive applications whose performance is dominated by compulsory cache misses and cannot be alleviated by traditional caching systems. Unlike previous work on cache injection which focused on reducing host network stack overhead incurred by memory copies, I show that applications can directly benefit from this technique based on their temporal and spatial locality in accessing incoming network data. I also show that the performance of cache injection is directly proportional to the ratio of processor speed to memory speed. In other words, systems with a memory wall can provide significantly better performance with cache injection and an appropriate injection policy. This result implies that multi-core and many-core architectures would benefit from this technique. Finally, my results show that the application\u27s communication characteristics are key to cache injection performance. For example, cache injection can improve the performance of certain collective communication operations by up to 20% as a function of message size

Technology for large space systems: A bibliography with indexes (supplement 19)

Author
Publication venue
Publication date
Field of study

This bibliography lists 526 reports, articles, and other documents introduced into the NASA scientific and technical information system between January 1, 1988 and June 30, 1988. Its purpose is to provide helpful information to the researcher, manager, and designer in technology development and mission design according to system, interactive analysis and design, structural and thermal analysis and design, structural concepts and control systems, electronics, advanced materials, assembly concepts, propulsion, and solar power satellite systems

NASA Technical Reports Server