894 research outputs found
Improving Cross-Lingual Transfer Learning for Event Detection
The widespread adoption of applications powered by Artificial Intelligence (AI) backbones has unquestionably changed the way we interact with the world around us. Applications such as automated personal assistants, automatic question answering, and machine-based translation systems have become mainstays of modern culture thanks to the recent considerable advances in Natural Language Processing (NLP) research. Nonetheless, with over 7000 spoken languages in the world, there still remain a considerable number of marginalized communities that are unable to benefit from these technological advancements largely due to the language they speak. Cross-Lingual Learning (CLL) looks to address this issue by transferring the knowledge acquired from a popular, high-resource source language (e.g., English, Chinese, or Spanish) to a less favored, lower-resourced target language (e.g., Urdu or Swahili). This dissertation leverages the Event Detection (ED) sub-task of Information Extraction (IE) as a testbed and presents three novel approaches that improve cross-lingual transfer learning from distinct perspectives: (1) direct knowledge transfer, (2) hybrid knowledge transfer, and (3) few-shot learning
Unifying context with labeled property graph: A pipeline-based system for comprehensive text representation in NLP
Extracting valuable insights from vast amounts of unstructured digital text presents significant challenges across diverse domains. This research addresses this challenge by proposing a novel pipeline-based system that generates domain-agnostic and task-agnostic text representations. The proposed approach leverages labeled property graphs (LPG) to encode contextual information, facilitating the integration of diverse linguistic elements into a unified representation. The proposed system enables efficient graph-based querying and manipulation by addressing the crucial aspect of comprehensive context modeling and fine-grained semantics. The effectiveness of the proposed system is demonstrated through the implementation of NLP components that operate on LPG-based representations. Additionally, the proposed approach introduces specialized patterns and algorithms to enhance specific NLP tasks, including nominal mention detection, named entity disambiguation, event enrichments, event participant detection, and temporal link detection. The evaluation of the proposed approach, using the MEANTIME corpus comprising manually annotated documents, provides encouraging results and valuable insights into the system\u27s strengths. The proposed pipeline-based framework serves as a solid foundation for future research, aiming to refine and optimize LPG-based graph structures to generate comprehensive and semantically rich text representations, addressing the challenges associated with efficient information extraction and analysis in NLP
Computational reproducibility of Jupyter notebooks from biomedical publications
Jupyter notebooks facilitate the bundling of executable code with its
documentation and output in one interactive environment, and they represent a
popular mechanism to document and share computational workflows. The
reproducibility of computational aspects of research is a key component of
scientific reproducibility but has not yet been assessed at scale for Jupyter
notebooks associated with biomedical publications. We address computational
reproducibility at two levels: First, using fully automated workflows, we
analyzed the computational reproducibility of Jupyter notebooks related to
publications indexed in PubMed Central. We identified such notebooks by mining
the articles full text, locating them on GitHub and re-running them in an
environment as close to the original as possible. We documented reproduction
success and exceptions and explored relationships between notebook
reproducibility and variables related to the notebooks or publications. Second,
this study represents a reproducibility attempt in and of itself, using
essentially the same methodology twice on PubMed Central over two years. Out of
27271 notebooks from 2660 GitHub repositories associated with 3467 articles,
22578 notebooks were written in Python, including 15817 that had their
dependencies declared in standard requirement files and that we attempted to
re-run automatically. For 10388 of these, all declared dependencies could be
installed successfully, and we re-ran them to assess reproducibility. Of these,
1203 notebooks ran through without any errors, including 879 that produced
results identical to those reported in the original notebook and 324 for which
our results differed from the originally reported ones. Running the other
notebooks resulted in exceptions. We zoom in on common problems, highlight
trends and discuss potential improvements to Jupyter-related workflows
associated with biomedical publications.Comment: arXiv admin note: substantial text overlap with arXiv:2209.0430
Method versatility in analysing human attitudes towards technology
Various research domains are facing new challenges brought about by growing volumes of data. To make optimal use of them, and to increase the reproducibility of research findings, method versatility is required. Method versatility is the ability to flexibly apply widely varying data analytic methods depending on the study goal and the dataset characteristics.
Method versatility is an essential characteristic of data science, but in other areas of research, such as educational science or psychology, its importance is yet to be fully accepted. Versatile methods can enrich the repertoire of specialists who validate psychometric instruments, conduct data analysis of large-scale educational surveys, and communicate their findings to the academic community, which corresponds to three stages of the research cycle: measurement, research per se, and communication. In this thesis, studies related to these stages have a common theme of human attitudes towards technology, as this topic becomes vitally important in our age of ever-increasing digitization.
The thesis is based on four studies, in which method versatility is introduced in four different ways: the consecutive use of methods, the toolbox choice, the simultaneous use, and the range extension. In the first study, different methods of psychometric analysis are used consecutively to reassess psychometric properties of a recently developed scale measuring affinity for technology interaction. In the second, the random forest algorithm and hierarchical linear modeling, as tools from machine learning and statistical toolboxes, are applied to data analysis of a large-scale educational survey related to students’ attitudes to information and communication technology. In the third, the challenge of selecting the number of clusters in model-based clustering is addressed by the simultaneous use of model fit, cluster separation, and the stability of partition criteria, so that generalizable separable clusters can be selected in the data related to teachers’ attitudes towards technology. The fourth reports the development and evaluation of a scholarly knowledge graph-powered dashboard aimed at extending the range of scholarly communication means.
The findings of the thesis can be helpful for increasing method versatility in various research areas. They can also facilitate methodological advancement of academic training in data analysis and aid further development of scholarly communication in accordance with open science principles.Verschiedene Forschungsbereiche müssen sich durch steigende Datenmengen neuen Herausforderungen stellen. Der Umgang damit erfordert – auch in Hinblick auf die Reproduzierbarkeit von Forschungsergebnissen – Methodenvielfalt. Methodenvielfalt ist die Fähigkeit umfangreiche Analysemethoden unter Berücksichtigung von angestrebten Studienzielen und gegebenen Eigenschaften der Datensätze flexible anzuwenden.
Methodenvielfalt ist ein essentieller Bestandteil der Datenwissenschaft, der aber in seinem Umfang in verschiedenen Forschungsbereichen wie z. B. den Bildungswissenschaften oder der Psychologie noch nicht erfasst wird. Methodenvielfalt erweitert die Fachkenntnisse von Wissenschaftlern, die psychometrische Instrumente validieren, Datenanalysen von groß angelegten Umfragen im Bildungsbereich durchführen und ihre Ergebnisse im akademischen Kontext präsentieren. Das entspricht den drei Phasen eines Forschungszyklus: Messung, Forschung per se und Kommunikation. In dieser Doktorarbeit werden Studien, die sich auf diese Phasen konzentrieren, durch das gemeinsame Thema der Einstellung zu Technologien verbunden. Dieses Thema ist im Zeitalter zunehmender Digitalisierung von entscheidender Bedeutung.
Die Doktorarbeit basiert auf vier Studien, die Methodenvielfalt auf vier verschiedenen Arten vorstellt: die konsekutive Anwendung von Methoden, die Toolbox-Auswahl, die simultane Anwendung von Methoden sowie die Erweiterung der Bandbreite. In der ersten Studie werden verschiedene psychometrische Analysemethoden konsekutiv angewandt, um die psychometrischen Eigenschaften einer entwickelten Skala zur Messung der Affinität von Interaktion mit Technologien zu überprüfen. In der zweiten Studie werden der Random-Forest-Algorithmus und die hierarchische lineare Modellierung als Methoden des Machine Learnings und der Statistik zur Datenanalyse einer groß angelegten Umfrage über die Einstellung von Schülern zur Informations- und Kommunikationstechnologie herangezogen. In der dritten Studie wird die Auswahl der Anzahl von Clustern im modellbasierten Clustering bei gleichzeitiger Verwendung von Kriterien für die Modellanpassung, der Clustertrennung und der Stabilität beleuchtet, so dass generalisierbare trennbare Cluster in den Daten zu den Einstellungen von Lehrern zu Technologien ausgewählt werden können. Die vierte Studie berichtet über die Entwicklung und Evaluierung eines wissenschaftlichen wissensgraphbasierten Dashboards, das die Bandbreite wissenschaftlicher Kommunikationsmittel erweitert.
Die Ergebnisse der Doktorarbeit tragen dazu bei, die Anwendung von vielfältigen Methoden in verschiedenen Forschungsbereichen zu erhöhen. Außerdem fördern sie die methodische Ausbildung in der Datenanalyse und unterstützen die Weiterentwicklung der wissenschaftlichen Kommunikation im Rahmen von Open Science
A Historical Interaction between Artificial Intelligence and Philosophy
This paper delves into AI development’s historical and philosophical dimensions while highlighting the symbiotic relationship between philosophy and AI from a technological perspective: philosophy furnishes foundational concepts, and AI supplies practical tools. The paper posits neurosymbolic AI as a solution to present challenges, sparking discussions encompassing both technical and philosophical considerations. Advocating a multidisciplinary approach calls for merging empirical AI insights with philosophy and cognition science to enrich our comprehension of intelligence and propel AI forward
The Knowledge Graph Construction in the Educational Domain: Take an Australian School Science Course as an Example
The evolution of the Internet technology and artificial intelligence has changed the ways we gain knowledge, which has expanded to every aspect of our lives. In recent years, Knowledge Graphs technology as one of the artificial intelligence techniques has been widely used in the educational domain. However, there are few studies dedicating the construction of knowledge graphs for K-10 education in Australia, and most of the existing studies only focus on at the theory level, and little research shows practical pipeline steps to complete the complex flow of constructing the educational knowledge graph. Apart from that, most studies focused on concept entities and their relations but ignored the features of concept entities and the relations between learning knowledge points and required learning outcomes. To overcome these shortages and provide the data foundation for the development of downstream research and applications in this educational domain, the construction processes of building a knowledge graph for Australian K-10 education were analyzed at the theory level and implemented in a practical way in this research. We took the Year 9 science course as a typical data source example fed to the proposed method called K10EDU-RCF-KG to construct this educational knowledge graph and to enrich the features of entities in the knowledge graph. In the construction pipeline, a variety of techniques were employed to complete the building process. Firstly, the POI and OCR techniques were applied to convert Word and PDF format files into text, followed by developing an educational resources management platform where the machine-readable text could be stored in a relational database management system. Secondly, we designed an architecture framework as the guidance of the construction pipeline. According to this architecture, the educational ontology was initially designed, and a backend microservice was developed to process the entity extraction and relation extraction by NLP-NER and probabilistic association rule mining algorithms, respectively. We also adopted the NLP-POS technique to find out the neighbor adjectives related to entitles to enrich features of these concept entitles. In addition, a subject dictionary was introduced during the refinement process of the knowledge graph, which reduced the data noise rate of the knowledge graph entities. Furthermore, the connections between learning outcome entities and topic knowledge point entities were directly connected, which provides a clear and efficient way to identify what corresponding learning objectives are related to the learning unit. Finally, a set of REST APIs for querying this educational knowledge graph were developed
RECAP-KG: Mining Knowledge Graphs from Raw GP Notes for Remote COVID-19 Assessment in Primary Care
Clinical decision-making is a fundamental stage in delivering appropriate
care to patients. In recent years several decision-making systems designed to
aid the clinician in this process have been developed. However, technical
solutions currently in use are based on simple regression models and are only
able to take into account simple pre-defined multiple-choice features, such as
patient age, pre-existing conditions, smoker status, etc. One particular source
of patient data, that available decision-making systems are incapable of
processing is the collection of patient consultation GP notes. These contain
crucial signs and symptoms - the information used by clinicians in order to
make a final decision and direct the patient to the appropriate care.
Extracting information from GP notes is a technically challenging problem, as
they tend to include abbreviations, typos, and incomplete sentences.
This paper addresses this open challenge. We present a framework that
performs knowledge graph construction from raw GP medical notes written during
or after patient consultations. By relying on support phrases mined from the
SNOMED ontology, as well as predefined supported facts from values used in the
RECAP (REmote COVID-19 Assessment in Primary Care) patient risk prediction
tool, our graph generative framework is able to extract structured knowledge
graphs from the highly unstructured and inconsistent format that consultation
notes are written in. Our knowledge graphs include information about existing
patient symptoms, their duration, and their severity.
We apply our framework to consultation notes of COVID-19 patients in the UK
COVID-19 Clinical Assesment Servcie (CCAS) patient dataset. We provide a
quantitative evaluation of the performance of our framework, demonstrating that
our approach has better accuracy than traditional NLP methods when answering
questions about patients
Mutually-paced Knowledge Distillation for Cross-lingual Temporal Knowledge Graph Reasoning
This paper investigates cross-lingual temporal knowledge graph reasoning
problem, which aims to facilitate reasoning on Temporal Knowledge Graphs (TKGs)
in low-resource languages by transfering knowledge from TKGs in high-resource
ones. The cross-lingual distillation ability across TKGs becomes increasingly
crucial, in light of the unsatisfying performance of existing reasoning methods
on those severely incomplete TKGs, especially in low-resource languages.
However, it poses tremendous challenges in two aspects. First, the
cross-lingual alignments, which serve as bridges for knowledge transfer, are
usually too scarce to transfer sufficient knowledge between two TKGs. Second,
temporal knowledge discrepancy of the aligned entities, especially when
alignments are unreliable, can mislead the knowledge distillation process. We
correspondingly propose a mutually-paced knowledge distillation model MP-KD,
where a teacher network trained on a source TKG can guide the training of a
student network on target TKGs with an alignment module. Concretely, to deal
with the scarcity issue, MP-KD generates pseudo alignments between TKGs based
on the temporal information extracted by our representation module. To maximize
the efficacy of knowledge transfer and control the noise caused by the temporal
knowledge discrepancy, we enhance MP-KD with a temporal cross-lingual attention
mechanism to dynamically estimate the alignment strength. The two procedures
are mutually paced along with model training. Extensive experiments on twelve
cross-lingual TKG transfer tasks in the EventKG benchmark demonstrate the
effectiveness of the proposed MP-KD method.Comment: This paper is accepted by The Web Conference 202
- …