2,093 research outputs found

    Semi-automated ontology generation within OBO-Edit

    Get PDF
    Motivation: Ontologies and taxonomies have proven highly beneficial for biocuration. The Open Biomedical Ontology (OBO) Foundry alone lists over 90 ontologies mainly built with OBO-Edit. Creating and maintaining such ontologies is a labour-intensive, difficult, manual process. Automating parts of it is of great importance for the further development of ontologies and for biocuration

    Context Aware Computing for The Internet of Things: A Survey

    Get PDF
    As we are moving towards the Internet of Things (IoT), the number of sensors deployed around the world is growing at a rapid pace. Market research has shown a significant growth of sensor deployments over the past decade and has predicted a significant increment of the growth rate in the future. These sensors continuously generate enormous amounts of data. However, in order to add value to raw sensor data we need to understand it. Collection, modelling, reasoning, and distribution of context in relation to sensor data plays critical role in this challenge. Context-aware computing has proven to be successful in understanding sensor data. In this paper, we survey context awareness from an IoT perspective. We present the necessary background by introducing the IoT paradigm and context-aware fundamentals at the beginning. Then we provide an in-depth analysis of context life cycle. We evaluate a subset of projects (50) which represent the majority of research and commercial solutions proposed in the field of context-aware computing conducted over the last decade (2001-2011) based on our own taxonomy. Finally, based on our evaluation, we highlight the lessons to be learnt from the past and some possible directions for future research. The survey addresses a broad range of techniques, methods, models, functionalities, systems, applications, and middleware solutions related to context awareness and IoT. Our goal is not only to analyse, compare and consolidate past research work but also to appreciate their findings and discuss their applicability towards the IoT.Comment: IEEE Communications Surveys & Tutorials Journal, 201

    Barry Smith an sich

    Get PDF
    Festschrift in Honor of Barry Smith on the occasion of his 65th Birthday. Published as issue 4:4 of the journal Cosmos + Taxis: Studies in Emergent Order and Organization. Includes contributions by Wolfgang Grassl, Nicola Guarino, John T. Kearns, Rudolf LĂŒthe, Luc Schneider, Peter Simons, Wojciech Ć»eƂaniec, and Jan WoleƄski

    Biomedical semantics in the Semantic Web

    Get PDF
    The Semantic Web offers an ideal platform for representing and linking biomedical information, which is a prerequisite for the development and application of analytical tools to address problems in data-intensive areas such as systems biology and translational medicine. As for any new paradigm, the adoption of the Semantic Web offers opportunities and poses questions and challenges to the life sciences scientific community: which technologies in the Semantic Web stack will be more beneficial for the life sciences? Is biomedical information too complex to benefit from simple interlinked representations? What are the implications of adopting a new paradigm for knowledge representation? What are the incentives for the adoption of the Semantic Web, and who are the facilitators? Is there going to be a Semantic Web revolution in the life sciences

    Cross-Domain information extraction from scientific articles for research knowledge graphs

    Get PDF
    Today’s scholarly communication is a document-centred process and as such, rather inefficient. Fundamental contents of research papers are not accessible by computers since they are only present in unstructured PDF files. Therefore, current research infrastructures are not able to assist scientists appropriately in their core research tasks. This thesis addresses this issue and proposes methods to automatically extract relevant information from scientific articles for Research Knowledge Graphs (RKGs) that represent scholarly knowledge structured and interlinked. First, this thesis conducts a requirements analysis for an Open Research Knowledge Graph (ORKG). We present literature-related use cases of researchers that should be supported by an ORKG-based system and their specific requirements for the underlying ontology and instance data. Based on this analysis, the identified use cases are categorised into two groups: The first group of use cases needs manual or semi-automatic approaches for knowledge graph (KG) construction since they require high correctness of the instance data. The second group requires high completeness and can tolerate noisy instance data. Thus, this group needs automatic approaches for KG population. This thesis focuses on the second group of use cases and provides contributions for machine learning tasks that aim to support them. To assess the relevance of a research paper, scientists usually skim through titles, abstracts, introductions, and conclusions. An organised presentation of the articles' essential information would make this process more time-efficient. The task of sequential sentence classification addresses this issue by classifying sentences in an article in categories like research problem, used methods, or obtained results. To address this problem, we propose a novel unified cross-domain multi-task deep learning approach that makes use of datasets from different scientific domains (e.g. biomedicine and computer graphics) and varying structures (e.g. datasets covering either only abstracts or full papers). Our approach outperforms the state of the art on full paper datasets significantly while being competitive for datasets consisting of abstracts. Moreover, our approach enables the categorisation of sentences in a domain-independent manner. Furthermore, we present the novel task of domain-independent information extraction to extract scientific concepts from research papers in a domain-independent manner. This task aims to support the use cases find related work and get recommended articles. For this purpose, we introduce a set of generic scientific concepts that are relevant over ten domains in Science, Technology, and Medicine (STM) and release an annotated dataset of 110 abstracts from these domains. Since the annotation of scientific text is costly, we suggest an active learning strategy based on a state-of-the-art deep learning approach. The proposed method enables us to nearly halve the amount of required training data. Then, we extend this domain-independent information extraction approach with the task of \textit{coreference resolution}. Coreference resolution aims to identify mentions that refer to the same concept or entity. Baseline results on our corpus with current state-of-the-art approaches for coreference resolution showed that current approaches perform poorly on scientific text. Therefore, we propose a sequential transfer learning approach that exploits annotated datasets from non-academic domains. Our experimental results demonstrate that our approach noticeably outperforms the state-of-the-art baselines. Additionally, we investigate the impact of coreference resolution on KG population. We demonstrate that coreference resolution has a small impact on the number of resulting concepts in the KG, but improved its quality significantly. Consequently, using our domain-independent information extraction approach, we populate an RKG from 55,485 abstracts of the ten investigated STM domains. We show that every domain mainly uses its own terminology and that the populated RKG contains useful concepts. Moreover, we propose a novel approach for the task of \textit{citation recommendation}. This task can help researchers improve the quality of their work by finding or recommending relevant related work. Our approach exploits RKGs that interlink research papers based on mentioned scientific concepts. Using our automatically populated RKG, we demonstrate that the combination of information from RKGs with existing state-of-the-art approaches is beneficial. Finally, we conclude the thesis and sketch possible directions of future work.Die Kommunikation von Forschungsergebnissen erfolgt heutzutage in Form von Dokumenten und ist aus verschiedenen GrĂŒnden ineffizient. Wesentliche Inhalte von Forschungsarbeiten sind fĂŒr Computer nicht zugĂ€nglich, da sie in unstrukturierten PDF-Dateien verborgen sind. Daher können derzeitige Forschungsinfrastrukturen Forschende bei ihren Kernaufgaben nicht angemessen unterstĂŒtzen. Diese Arbeit befasst sich mit dieser Problemstellung und untersucht Methoden zur automatischen Extraktion von relevanten Informationen aus Forschungspapieren fĂŒr Forschungswissensgraphen (Research Knowledge Graphs). Solche Graphen sollen wissenschaftliches Wissen maschinenlesbar strukturieren und verknĂŒpfen. ZunĂ€chst wird eine Anforderungsanalyse fĂŒr einen Open Research Knowledge Graph (ORKG) durchgefĂŒhrt. Wir stellen literaturbezogene AnwendungsfĂ€lle von Forschenden vor, die durch ein ORKG-basiertes System unterstĂŒtzt werden sollten, und deren spezifische Anforderungen an die zugrundeliegende Ontologie und die Instanzdaten. Darauf aufbauend werden die identifizierten AnwendungsfĂ€lle in zwei Gruppen eingeteilt: Die erste Gruppe von AnwendungsfĂ€llen benötigt manuelle oder halbautomatische AnsĂ€tze fĂŒr die Konstruktion eines ORKG, da sie eine hohe Korrektheit der Instanzdaten erfordern. Die zweite Gruppe benötigt eine hohe VollstĂ€ndigkeit der Instanzdaten und kann fehlerhafte Daten tolerieren. Daher erfordert diese Gruppe automatische AnsĂ€tze fĂŒr die Konstruktion des ORKG. Diese Arbeit fokussiert sich auf die zweite Gruppe von AnwendungsfĂ€llen und schlĂ€gt Methoden fĂŒr maschinelle Aufgabenstellungen vor, die diese AnwendungsfĂ€lle unterstĂŒtzen können. Um die Relevanz eines Forschungsartikels effizient beurteilen zu können, schauen sich Forschende in der Regel die Titel, Zusammenfassungen, Einleitungen und Schlussfolgerungen an. Durch eine strukturierte Darstellung von wesentlichen Informationen des Artikels könnte dieser Prozess zeitsparender gestaltet werden. Die Aufgabenstellung der sequenziellen Satzklassifikation befasst sich mit diesem Problem, indem SĂ€tze eines Artikels in Kategorien wie Forschungsproblem, verwendete Methoden oder erzielte Ergebnisse automatisch klassifiziert werden. In dieser Arbeit wird fĂŒr diese Aufgabenstellung ein neuer vereinheitlichter Multi-Task Deep-Learning-Ansatz vorgeschlagen, der DatensĂ€tze aus verschiedenen wissenschaftlichen Bereichen (z. B. Biomedizin und Computergrafik) mit unterschiedlichen Strukturen (z. B. DatensĂ€tze bestehend aus Zusammenfassungen oder vollstĂ€ndigen Artikeln) nutzt. Unser Ansatz ĂŒbertrifft State-of-the-Art-Verfahren der Literatur auf Benchmark-DatensĂ€tzen bestehend aus vollstĂ€ndigen Forschungsartikeln. Außerdem ermöglicht unser Ansatz die Klassifizierung von SĂ€tzen auf eine domĂ€nenunabhĂ€ngige Weise. DarĂŒber hinaus stellen wir die neue Aufgabenstellung domĂ€nenĂŒbergreifende Informationsextraktion vor. Hierbei werden, unabhĂ€ngig vom behandelten wissenschaftlichen Fachgebiet, inhaltliche Konzepte aus Forschungspapieren extrahiert. Damit sollen die AnwendungsfĂ€lle Finden von verwandten Arbeiten und Empfehlung von Artikeln unterstĂŒtzt werden. Zu diesem Zweck fĂŒhren wir eine Reihe von generischen wissenschaftlichen Konzepten ein, die in zehn Bereichen der Wissenschaft, Technologie und Medizin (STM) relevant sind, und veröffentlichen einen annotierten Datensatz von 110 Zusammenfassungen aus diesen Bereichen. Da die Annotation wissenschaftlicher Texte aufwĂ€ndig ist, kombinieren wir ein Active-Learning-Verfahren mit einem aktuellen Deep-Learning-Ansatz, um die notwendigen Trainingsdaten zu reduzieren. Die vorgeschlagene Methode ermöglicht es uns, die Menge der erforderlichen Trainingsdaten nahezu zu halbieren. Anschließend erweitern wir unseren domĂ€nenunabhĂ€ngigen Ansatz zur Informationsextraktion um die Aufgabe der Koreferenzauflösung. Die Auflösung von Koreferenzen zielt darauf ab, ErwĂ€hnungen zu identifizieren, die sich auf dasselbe Konzept oder dieselbe EntitĂ€t beziehen. Experimentelle Ergebnisse auf unserem Korpus mit aktuellen AnsĂ€tzen zur Koreferenzauflösung haben gezeigt, dass diese bei wissenschaftlichen Texten unzureichend abschneiden. Daher schlagen wir eine Transfer-Learning-Methode vor, die annotierte DatensĂ€tze aus nicht-akademischen Bereichen nutzt. Die experimentellen Ergebnisse zeigen, dass unser Ansatz deutlich besser abschneidet als die bisherigen AnsĂ€tze. DarĂŒber hinaus untersuchen wir den Einfluss der Koreferenzauflösung auf die Erstellung von Wissensgraphen. Wir zeigen, dass diese einen geringen Einfluss auf die Anzahl der resultierenden Konzepte in dem Wissensgraphen hat, aber die QualitĂ€t des Wissensgraphen deutlich verbessert. Mithilfe unseres domĂ€nenunabhĂ€ngigen Ansatzes zur Informationsextraktion haben wir aus 55.485 Zusammenfassungen der zehn untersuchten STM-DomĂ€nen einen Forschungswissensgraphen erstellt. Unsere Analyse zeigt, dass jede DomĂ€ne hauptsĂ€chlich ihre eigene Terminologie verwendet und dass der erstellte Wissensgraph nĂŒtzliche Konzepte enthĂ€lt. Schließlich schlagen wir einen Ansatz fĂŒr die Empfehlung von passenden Referenzen vor. Damit können Forschende einfacher relevante verwandte Arbeiten finden oder passende Empfehlungen erhalten. Unser Ansatz nutzt Forschungswissensgraphen, die Forschungsarbeiten mit in ihnen erwĂ€hnten wissenschaftlichen Konzepten verknĂŒpfen. Wir zeigen, dass aktuelle Verfahren zur Empfehlung von Referenzen von zusĂ€tzlichen Informationen aus einem automatisch erstellten Wissensgraphen profitieren. Zum Schluss wird ein Fazit gezogen und ein Ausblick fĂŒr mögliche zukĂŒnftige Arbeiten gegeben

    Computer Science Named Entity Recognition in the Open Research Knowledge Graph

    Get PDF
    Domain-specific named entity recognition (NER) on Computer Science (CS) scholarly articles is an information extraction task that is arguably more challenging for the various annotation aims that can hamper the task and has been less studied than NER in the general domain. Given that significant progress has been made on NER, we anticipate that scholarly domain-specific NER will receive increasing attention in the years to come. Currently, progress on CS NER – the focus of this work – is hampered in part by its recency and the lack of a standardized annotation aims for scientific entities/terms. Directly addressing these issues, this work proposes a standardized task by defining a set of seven contribution-centric scholarly entities for CS NER viz., research problem, solution, resource, language, tool, method, and dataset

    Modeling sample variables with an Experimental Factor Ontology

    Get PDF
    Motivation: Describing biological sample variables with ontologies is complex due to the cross-domain nature of experiments. Ontologies provide annotation solutions; however, for cross-domain investigations, multiple ontologies are needed to represent the data. These are subject to rapid change, are often not interoperable and present complexities that are a barrier to biological resource users

    The BioPrompt-box: an ontology-based clustering tool for searching in biological databases

    Get PDF
    Background: High-throughput molecular biology provides new data at an incredible rate, so that the increase in the size of biological databanks is enormous and very rapid. This scenario generates severe problems not only at indexing time, where suitable algorithmic techniques for data indexing and retrieval are required, but also at query time, since a user query may produce such a large set of results that their browsing and understanding becomes humanly impractical. This problem is well known to the Web community, where a new generation of Web search engines is being developed, like Vivisimo. These tools organize on-the-fly the results of a user query in a hierarchy of labeled folders that ease their browsing and knowledge extraction. We investigate this approach on biological data, and propose the so called The BioPrompt-boxsoftware system which deploys ontology-driven clustering strategies for making the searching process of biologists more efficient and effective. Results: The BioPrompt-box (Bpb) defines a document as a biological sequence plus its associated meta-data taken from the underneath databank - like references to ontologies or to external databanks, and plain texts as comments of researchers and (title, abstracts or even body of) papers. Bpboffers several tools to customize the search and the clustering process over its indexed documents. The user can search a set of keywords within a specific field of the document schema, or can execute Blastto find documents relative to homologue sequences. In both cases the search task returns a set of documents (hits) which constitute the answer to the user query. Since the number of hits may be large, Bpbclusters them into groups of homogenous content, organized as a hierarchy of labeled clusters. The user can actually choose among several ontology-based hierarchical clustering strategies, each offering a different view of the returned hits. Bpbcomputes these views by exploiting the meta-data present within the retrieved documents such as the references to Gene Ontology, the taxonomy lineage, the organism and the keywords. Of course, the approach is flexible enough to leave room for future additions of other meta-information. The ultimate goal of the clustering process is to provide the user with several different readings of the (maybe numerous) query results and show possible hidden correlations among them, thus improving their browsing and understanding. Conclusion: Bpb is a powerful search engine that makes it very easy to perform complex queries over the indexed databanks (currently only UNIPROT is considered). The ontology-based clustering approach is efficient and effective, and could thus be applied successfully to larger databanks, like GenBank or EMBL

    Technological taxonomies for hypernym and hyponym retrieval in patent texts

    Full text link
    This paper presents an automatic approach to creating taxonomies of technical terms based on the Cooperative Patent Classification (CPC). The resulting taxonomy contains about 170k nodes in 9 separate technological branches and is freely available. We also show that a Text-to-Text Transfer Transformer (T5) model can be fine-tuned to generate hypernyms and hyponyms with relatively high precision, confirming the manually assessed quality of the resource. The T5 model opens the taxonomy to any new technological terms for which a hypernym can be generated, thus making the resource updateable with new terms, an essential feature for the constantly evolving field of technological terminology.Comment: ToTh 2022 - Terminology & Ontology: Theories and applications, Jun 2022, Chamb{\'e}ry, Franc
    • 

    corecore