295 research outputs found
Automated Deductive Content Analysis of Text: A Deep Contrastive and Active Learning Based Approach
Content analysis traditionally involves human coders manually combing through text documents to search for relevant concepts and categories. However, this approach is time-intensive and not scalable, particularly for secondary data like social media content, news articles, or corporate reports. To address this problem, the paper presents an automated framework called Automated Deductive Content Analysis of Text (ADCAT) that uses deep learning-based semantic techniques, ontology of validated construct measures, large language model, human-in-the-loop disambiguation, and a novel augmentation-based weighted contrastive learning approach for improved language representations, to build a scalable approach for deductive content analysis. We demonstrate the effectiveness of the proposed approach to identify firm innovation strategies from their 10-K reports to obtain inferences reasonably close to human coding
Hybrid intelligent framework for automated medical learning
This paper investigates the automated medical learning and proposes hybrid intelligent framework, called Hybrid Automated Medical Learning (HAML). The goal is the efficient combination of several intelligent components in order to automatically learn the medical data. Multi agents system is proposed by using distributed deep learning, and knowledge graph for learning medical data. The distributed deep learning is used for efficient learning of the different agents in the system, where the knowledge graph is used for dealing with heterogeneous medical data. To demonstrate the usefulness and accuracy of the HAML framework, intensive simulations on medical data were conducted. A wide range of experiments were conducted to verify the efficiency of the proposed system. Three case studies are discussed in this research, the first case study is related to process mining, and more precisely on the ability of HAML to detect relevant patterns from event medical data. The second case study is related to smart building, and the ability of HAML to recognize the different activities of the patients. The third one is related to medical image retrieval, and the ability of HAML to find the most relevant medical images according to the image query. The results show that the developed HAML achieves good performance compared to the most up-to-date medical learning models regarding both the computational and cost the quality of returned solutions.publishedVersio
Hybrid intelligent framework for automated medical learning
This paper investigates the automated medical learning and proposes hybrid intelligent framework, called Hybrid Automated Medical Learning (HAML). The goal is the efficient combination of several intelligent components in order to automatically learn the medical data. Multi agents system is proposed by using distributed deep learning, and knowledge graph for learning medical data. The distributed deep learning is used for efficient learning of the different agents in the system, where the knowledge graph is used for dealing with heterogeneous medical data. To demonstrate the usefulness and accuracy of the HAML framework, intensive simulations on medical data were conducted. A wide range of experiments were conducted to verify the efficiency of the proposed system. Three case studies are discussed in this research, the first case study is related to process mining, and more precisely on the ability of HAML to detect relevant patterns from event medical data. The second case study is related to smart building, and the ability of HAML to recognize the different activities of the patients. The third one is related to medical image retrieval, and the ability of HAML to find the most relevant medical images according to the image query. The results show that the developed HAML achieves good performance compared to the most up-to-date medical learning models regarding both the computational and cost the quality of returned solutionspublishedVersio
Query-Time Data Integration
Today, data is collected in ever increasing scale and variety, opening up enormous potential for new insights and data-centric products. However, in many cases the volume and heterogeneity of new data sources precludes up-front integration using traditional ETL processes and data warehouses. In some cases, it is even unclear if and in what context the collected data will be utilized. Therefore, there is a need for agile methods that defer the effort of integration until the usage context is established.
This thesis introduces Query-Time Data Integration as an alternative concept to traditional up-front integration. It aims at enabling users to issue ad-hoc queries on their own data as if all potential other data sources were already integrated, without declaring specific sources and mappings to use. Automated data search and integration methods are then coupled directly with query processing on the available data. The ambiguity and uncertainty introduced through fully automated retrieval and mapping methods is compensated by answering those queries with ranked lists of alternative results. Each result is then based on different data sources or query interpretations, allowing users to pick the result most suitable to their information need.
To this end, this thesis makes three main contributions. Firstly, we introduce a novel method for Top-k Entity Augmentation, which is able to construct a top-k list of consistent integration results from a large corpus of heterogeneous data sources. It improves on the state-of-the-art by producing a set of individually consistent, but mutually diverse, set of alternative solutions, while minimizing the number of data sources used. Secondly, based on this novel augmentation method, we introduce the DrillBeyond system, which is able to process Open World SQL queries, i.e., queries referencing arbitrary attributes not defined in the queried database. The original database is then augmented at query time with Web data sources providing those attributes. Its hybrid augmentation/relational query processing enables the use of ad-hoc data search and integration in data analysis queries, and improves both performance and quality when compared to using separate systems for the two tasks. Finally, we studied the management of large-scale dataset corpora such as data lakes or Open Data platforms, which are used as data sources for our augmentation methods. We introduce Publish-time Data Integration as a new technique for data curation systems managing such corpora, which aims at improving the individual reusability of datasets without requiring up-front global integration. This is achieved by automatically generating metadata and format recommendations, allowing publishers to enhance their datasets with minimal effort.
Collectively, these three contributions are the foundation of a Query-time Data Integration architecture, that enables ad-hoc data search and integration queries over large heterogeneous dataset collections
An End-to-end Neural Natural Language Interface for Databases
The ability to extract insights from new data sets is critical for decision
making. Visual interactive tools play an important role in data exploration
since they provide non-technical users with an effective way to visually
compose queries and comprehend the results. Natural language has recently
gained traction as an alternative query interface to databases with the
potential to enable non-expert users to formulate complex questions and
information needs efficiently and effectively. However, understanding natural
language questions and translating them accurately to SQL is a challenging
task, and thus Natural Language Interfaces for Databases (NLIDBs) have not yet
made their way into practical tools and commercial products.
In this paper, we present DBPal, a novel data exploration tool with a natural
language interface. DBPal leverages recent advances in deep models to make
query understanding more robust in the following ways: First, DBPal uses a deep
model to translate natural language statements to SQL, making the translation
process more robust to paraphrasing and other linguistic variations. Second, to
support the users in phrasing questions without knowing the database schema and
the query features, DBPal provides a learned auto-completion model that
suggests partial query extensions to users during query formulation and thus
helps to write complex queries
Knowledge-augmented Graph Machine Learning for Drug Discovery: A Survey from Precision to Interpretability
The integration of Artificial Intelligence (AI) into the field of drug
discovery has been a growing area of interdisciplinary scientific research.
However, conventional AI models are heavily limited in handling complex
biomedical structures (such as 2D or 3D protein and molecule structures) and
providing interpretations for outputs, which hinders their practical
application. As of late, Graph Machine Learning (GML) has gained considerable
attention for its exceptional ability to model graph-structured biomedical data
and investigate their properties and functional relationships. Despite
extensive efforts, GML methods still suffer from several deficiencies, such as
the limited ability to handle supervision sparsity and provide interpretability
in learning and inference processes, and their ineffectiveness in utilising
relevant domain knowledge. In response, recent studies have proposed
integrating external biomedical knowledge into the GML pipeline to realise more
precise and interpretable drug discovery with limited training instances.
However, a systematic definition for this burgeoning research direction is yet
to be established. This survey presents a comprehensive overview of
long-standing drug discovery principles, provides the foundational concepts and
cutting-edge techniques for graph-structured data and knowledge databases, and
formally summarises Knowledge-augmented Graph Machine Learning (KaGML) for drug
discovery. A thorough review of related KaGML works, collected following a
carefully designed search methodology, are organised into four categories
following a novel-defined taxonomy. To facilitate research in this promptly
emerging field, we also share collected practical resources that are valuable
for intelligent drug discovery and provide an in-depth discussion of the
potential avenues for future advancements
Active Learning for Reducing Labeling Effort in Text Classification Tasks
Labeling data can be an expensive task as it is usually performed manually by
domain experts. This is cumbersome for deep learning, as it is dependent on
large labeled datasets. Active learning (AL) is a paradigm that aims to reduce
labeling effort by only using the data which the used model deems most
informative. Little research has been done on AL in a text classification
setting and next to none has involved the more recent, state-of-the-art Natural
Language Processing (NLP) models. Here, we present an empirical study that
compares different uncertainty-based algorithms with BERT as the used
classifier. We evaluate the algorithms on two NLP classification datasets:
Stanford Sentiment Treebank and KvK-Frontpages. Additionally, we explore
heuristics that aim to solve presupposed problems of uncertainty-based AL;
namely, that it is unscalable and that it is prone to selecting outliers.
Furthermore, we explore the influence of the query-pool size on the performance
of AL. Whereas it was found that the proposed heuristics for AL did not improve
performance of AL; our results show that using uncertainty-based AL with
BERT outperforms random sampling of data. This difference in
performance can decrease as the query-pool size gets larger.Comment: Accepted as a conference paper at the joint 33rd Benelux Conference
on Artificial Intelligence and the 30th Belgian Dutch Conference on Machine
Learning (BNAIC/BENELEARN 2021). This camera-ready version submitted to
BNAIC/BENELEARN, adds several improvements including a more thorough
discussion of related work plus an extended discussion section. 28 pages
including references and appendice
Dwelling on ontology - semantic reasoning over topographic maps
The thesis builds upon the hypothesis that the spatial arrangement of topographic
features, such as buildings, roads and other land cover parcels, indicates how land is
used. The aim is to make this kind of high-level semantic information explicit within
topographic data. There is an increasing need to share and use data for a wider range of
purposes, and to make data more definitive, intelligent and accessible. Unfortunately,
we still encounter a gap between low-level data representations and high-level concepts
that typify human qualitative spatial reasoning. The thesis adopts an ontological
approach to bridge this gap and to derive functional information by using standard
reasoning mechanisms offered by logic-based knowledge representation formalisms. It
formulates a framework for the processes involved in interpreting land use information
from topographic maps. Land use is a high-level abstract concept, but it is also an
observable fact intimately tied to geography. By decomposing this relationship, the
thesis correlates a one-to-one mapping between high-level conceptualisations
established from human knowledge and real world entities represented in the data.
Based on a middle-out approach, it develops a conceptual model that incrementally
links different levels of detail, and thereby derives coarser, more meaningful
descriptions from more detailed ones. The thesis verifies its proposed ideas by
implementing an ontology describing the land use ‘residential area’ in the ontology
editor Protégé. By asserting knowledge about high-level concepts such as types of
dwellings, urban blocks and residential districts as well as individuals that link directly
to topographic features stored in the database, the reasoner successfully infers instances
of the defined classes. Despite current technological limitations, ontologies are a
promising way forward in the manner we handle and integrate geographic data,
especially with respect to how humans conceptualise geographic space
Form-ing institutional order: the scaffolding of lists and identifiers
This paper examines the central place of the list and the associated concept of an identifier within the scaffolding of contemporary institutional order. These terms are deliberately chosen to make strange and help unpack the constitutive capacity of information systems and information technology within and between contemporary organisations. We draw upon the substantial body of work by John Searle to help understand the place of lists and identifiers in the constitution of institutional order. To enable us to ground our discussion of the potentiality and problematic associated with lists we describe a number of significant instances of list-making, situated particularly around the use of identifiers to refer to people, places and products. The theorisation developed allows us to better explain not only the significance imbued within lists and identifiers but the key part they play in form-ing the institutional order. We also hint at the role such symbolic artefacts play within breakdowns in institutional order
- …