10 research outputs found

    Synthetic data generation in finance: requirements, challenges and applicability

    Get PDF
    Financial datasets possess susceptible, private and identifiable details about clients. The usage and distribution of such data for research outside a financial institution are strictly constrained due to privacy laws. One option to deal with this restriction is creating artificial data. The generation of fake data protects the confidentiality of customers' information. Data privacy is a prime concern in public opinion. This research study reviews various requirements and challenges for data generative techniques and handling synthetic data in finance

    Two-Stage Fine-Tuning: A Novel Strategy for Learning Class-Imbalanced Data

    Full text link
    Classification on long-tailed distributed data is a challenging problem, which suffers from serious class-imbalance and hence poor performance on tail classes with only a few samples. Owing to this paucity of samples, learning on the tail classes is especially challenging for the fine-tuning when transferring a pretrained model to a downstream task. In this work, we present a simple modification of standard fine-tuning to cope with these challenges. Specifically, we propose a two-stage fine-tuning: we first fine-tune the final layer of the pretrained model with class-balanced reweighting loss, and then we perform the standard fine-tuning. Our modification has several benefits: (1) it leverages pretrained representations by only fine-tuning a small portion of the model parameters while keeping the rest untouched; (2) it allows the model to learn an initial representation of the specific task; and importantly (3) it protects the learning of tail classes from being at a disadvantage during the model updating. We conduct extensive experiments on synthetic datasets of both two-class and multi-class tasks of text classification as well as a real-world application to ADME (i.e., absorption, distribution, metabolism, and excretion) semantic labeling. The experimental results show that the proposed two-stage fine-tuning outperforms both fine-tuning with conventional loss and fine-tuning with a reweighting loss on the above datasets.Comment: 20 pages, 6 figure

    Semi-supervised Optimal Transport with Self-paced Ensemble for Cross-hospital Sepsis Early Detection

    Full text link
    The utilization of computer technology to solve problems in medical scenarios has attracted considerable attention in recent years, which still has great potential and space for exploration. Among them, machine learning has been widely used in the prediction, diagnosis and even treatment of Sepsis. However, state-of-the-art methods require large amounts of labeled medical data for supervised learning. In real-world applications, the lack of labeled data will cause enormous obstacles if one hospital wants to deploy a new Sepsis detection system. Different from the supervised learning setting, we need to use known information (e.g., from another hospital with rich labeled data) to help build a model with acceptable performance, i.e., transfer learning. In this paper, we propose a semi-supervised optimal transport with self-paced ensemble framework for Sepsis early detection, called SPSSOT, to transfer knowledge from the other that has rich labeled data. In SPSSOT, we first extract the same clinical indicators from the source domain (e.g., hospital with rich labeled data) and the target domain (e.g., hospital with little labeled data), then we combine the semi-supervised domain adaptation based on optimal transport theory with self-paced under-sampling to avoid a negative transfer possibly caused by covariate shift and class imbalance. On the whole, SPSSOT is an end-to-end transfer learning method for Sepsis early detection which can automatically select suitable samples from two domains respectively according to the number of iterations and align feature space of two domains. Extensive experiments on two open clinical datasets demonstrate that comparing with other methods, our proposed SPSSOT, can significantly improve the AUC values with only 1% labeled data in the target domain in two transfer learning scenarios, MIMIC rightarrowrightarrow Challenge and Challenge rightarrowrightarrow MIMIC.Comment: 14 pages, 9 figure

    LCASPMDA: a computational model for predicting potential microbe-drug associations based on learnable graph convolutional attention networks and self-paced iterative sampling ensemble

    Get PDF
    IntroductionNumerous studies show that microbes in the human body are very closely linked to the human host and can affect the human host by modulating the efficacy and toxicity of drugs. However, discovering potential microbe-drug associations through traditional wet labs is expensive and time-consuming, hence, it is important and necessary to develop effective computational models to detect possible microbe-drug associations.MethodsIn this manuscript, we proposed a new prediction model named LCASPMDA by combining the learnable graph convolutional attention network and the self-paced iterative sampling ensemble strategy to infer latent microbe-drug associations. In LCASPMDA, we first constructed a heterogeneous network based on newly downloaded known microbe-drug associations. Then, we adopted the learnable graph convolutional attention network to learn the hidden features of nodes in the heterogeneous network. After that, we utilized the self-paced iterative sampling ensemble strategy to select the most informative negative samples to train the Multi-Layer Perceptron classifier and put the newly-extracted hidden features into the trained MLP classifier to infer possible microbe-drug associations.Results and discussionIntensive experimental results on two different public databases including the MDAD and the aBiofilm showed that LCASPMDA could achieve better performance than state-of-the-art baseline methods in microbe-drug association prediction

    Survey on highly imbalanced multi-class data

    Get PDF
    Machine learning technology has a massive impact on society because it offers solutions to solve many complicated problems like classification, clustering analysis, and predictions, especially during the COVID-19 pandemic. Data distribution in machine learning has been an essential aspect in providing unbiased solutions. From the earliest literatures published on highly imbalanced data until recently, machine learning research has focused mostly on binary classification data problems. Research on highly imbalanced multi-class data is still greatly unexplored when the need for better analysis and predictions in handling Big Data is required. This study focuses on reviews related to the models or techniques in handling highly imbalanced multi-class data, along with their strengths and weaknesses and related domains. Furthermore, the paper uses the statistical method to explore a case study with a severely imbalanced dataset. This article aims to (1) understand the trend of highly imbalanced multi-class data through analysis of related literatures; (2) analyze the previous and current methods of handling highly imbalanced multi-class data; (3) construct a framework of highly imbalanced multi-class data. The chosen highly imbalanced multi-class dataset analysis will also be performed and adapted to the current methods or techniques in machine learning, followed by discussions on open challenges and the future direction of highly imbalanced multi-class data. Finally, for highly imbalanced multi-class data, this paper presents a novel framework. We hope this research can provide insights on the potential development of better methods or techniques to handle and manipulate highly imbalanced multi-class data

    A Data-driven Approach to Large Knowledge Graph Matching

    Get PDF
    In the last decade, a remarkable number of open Knowledge Graphs (KGs) were developed, such as DBpedia, NELL, and YAGO. While some of such KGs are curated via crowdsourcing platforms, others are semi-automatically constructed. This has resulted in a significant degree of semantic heterogeneity and overlapping facts. KGs are highly complementary; thus, mapping them can benefit intelligent applications that require integrating different KGs such as recommendation systems, query answering, and semantic web navigation. Although the problem of ontology matching has been investigated and a significant number of systems have been developed, the challenges of mapping large-scale KGs remain significant. KG matching has been a topic of interest in the Semantic Web community since it has been introduced to the Ontology Alignment Evaluation Initiative (OAEI) in 2018. Nonetheless, a major limitation of the current benchmarks is their lack of representation of real-world KGs. This work also highlights a number of limitations with current matching methods, such as: (i) they are highly dependent on string-based similarity measures, and (ii) they are primarily built to handle well-formed ontologies. These features make them unsuitable for large, (semi/fully) automatically constructed KGs with hundreds of classes and millions of instances. Another limitation of current work is the lack of benchmark datasets that represent the challenging task of matching real-world KGs. This work addresses the limitation of the current datasets by first introducing two gold standard datasets for matching the schema of large, automatically constructed, less-well-structured KGs based on common KGs such as NELL, DBpedia, and Wikidata. We believe that the datasets which we make public in this work make the largest domain-independent benchmarks for matching KG classes. As many state-of-the-art methods are not suitable for matching large-scale and cross-domain KGs that often suffer from highly imbalanced class distribution, recent studies have revisited instance-based matching techniques in addressing this task. This is because such large KGs often lack a well-defined structure and descriptive metadata about their classes, but contain numerous class instances. Therefore, inspired by the role of instances in KGs, we propose a hybrid matching approach. Our method composes an instance-based matcher that casts the schema-matching process as a text classification task by exploiting instances of KG classes, and a string-based matcher. Our method is domain-independent and is able to handle KG classes with imbalanced populations. Further, we show that incorporating an instance-based approach with the appropriate data balancing strategy results in significant results in matching large and common KG classes
    corecore