4,957 research outputs found
Including Physics in Deep Learning -- An example from 4D seismic pressure saturation inversion
Geoscience data often have to rely on strong priors in the face of
uncertainty. Additionally, we often try to detect or model anomalous sparse
data that can appear as an outlier in machine learning models. These are
classic examples of imbalanced learning. Approaching these problems can benefit
from including prior information from physics models or transforming data to a
beneficial domain. We show an example of including physical information in the
architecture of a neural network as prior information. We go on to present
noise injection at training time to successfully transfer the network from
synthetic data to field data.Comment: 5 pages, 5 figures, workshop, extended abstract, EAGE 2019 Workshop
Programme, European Association of Geoscientists and Engineer
An empirical evaluation of imbalanced data strategies from a practitioner's point of view
This research tested the following well known strategies to deal with binary
imbalanced data on 82 different real life data sets (sampled to imbalance rates
of 5%, 3%, 1%, and 0.1%): class weight, SMOTE, Underbagging, and a baseline
(just the base classifier). As base classifiers we used SVM with RBF kernel,
random forests, and gradient boosting machines and we measured the quality of
the resulting classifier using 6 different metrics (Area under the curve,
Accuracy, F-measure, G-mean, Matthew's correlation coefficient and Balanced
accuracy). The best strategy strongly depends on the metric used to measure the
quality of the classifier. For AUC and accuracy class weight and the baseline
perform better; for F-measure and MCC, SMOTE performs better; and for G-mean
and balanced accuracy, underbagging
Learning to Auto Weight: Entirely Data-driven and Highly Efficient Weighting Framework
Example weighting algorithm is an effective solution to the training bias
problem, however, most previous typical methods are usually limited to human
knowledge and require laborious tuning of hyperparameters. In this paper, we
propose a novel example weighting framework called Learning to Auto Weight
(LAW). The proposed framework finds step-dependent weighting policies
adaptively, and can be jointly trained with target networks without any
assumptions or prior knowledge about the dataset. It consists of three key
components: Stage-based Searching Strategy (3SM) is adopted to shrink the huge
searching space in a complete training process; Duplicate Network Reward (DNR)
gives more accurate supervision by removing randomness during the searching
process; Full Data Update (FDU) further improves the updating efficiency.
Experimental results demonstrate the superiority of weighting policy explored
by LAW over standard training pipeline. Compared with baselines, LAW can find a
better weighting schedule which achieves much more superior accuracy on both
biased CIFAR and ImageNet.Comment: Accepted by AAAI 202
OhioState at SemEval-2018 Task 7: Exploiting Data Augmentation for Relation Classification in Scientific Papers using Piecewise Convolutional Neural Networks
We describe our system for SemEval-2018 Shared Task on Semantic Relation
Extraction and Classification in Scientific Papers where we focus on the
Classification task. Our simple piecewise convolution neural encoder performs
decently in an end to end manner. A simple inter-task data augmentation
signifi- cantly boosts the performance of the model. Our best-performing
systems stood 8th out of 20 teams on the classification task on noisy data and
12th out of 28 teams on the classification task on clean data.Comment: To apperar in Proceedings of International Workshop on Semantic
Evaluation (SemEval-2018
Towards Data-centric Graph Machine Learning: Review and Outlook
Data-centric AI, with its primary focus on the collection, management, and
utilization of data to drive AI models and applications, has attracted
increasing attention in recent years. In this article, we conduct an in-depth
and comprehensive review, offering a forward-looking outlook on the current
efforts in data-centric AI pertaining to graph data-the fundamental data
structure for representing and capturing intricate dependencies among massive
and diverse real-life entities. We introduce a systematic framework,
Data-centric Graph Machine Learning (DC-GML), that encompasses all stages of
the graph data lifecycle, including graph data collection, exploration,
improvement, exploitation, and maintenance. A thorough taxonomy of each stage
is presented to answer three critical graph-centric questions: (1) how to
enhance graph data availability and quality; (2) how to learn from graph data
with limited-availability and low-quality; (3) how to build graph MLOps systems
from the graph data-centric view. Lastly, we pinpoint the future prospects of
the DC-GML domain, providing insights to navigate its advancements and
applications.Comment: 42 pages, 9 figure
Tackling Diverse Minorities in Imbalanced Classification
Imbalanced datasets are commonly observed in various real-world applications,
presenting significant challenges in training classifiers. When working with
large datasets, the imbalanced issue can be further exacerbated, making it
exceptionally difficult to train classifiers effectively. To address the
problem, over-sampling techniques have been developed to linearly interpolating
data instances between minorities and their neighbors. However, in many
real-world scenarios such as anomaly detection, minority instances are often
dispersed diversely in the feature space rather than clustered together.
Inspired by domain-agnostic data mix-up, we propose generating synthetic
samples iteratively by mixing data samples from both minority and majority
classes. It is non-trivial to develop such a framework, the challenges include
source sample selection, mix-up strategy selection, and the coordination
between the underlying model and mix-up strategies. To tackle these challenges,
we formulate the problem of iterative data mix-up as a Markov decision process
(MDP) that maps data attributes onto an augmentation strategy. To solve the
MDP, we employ an actor-critic framework to adapt the discrete-continuous
decision space. This framework is utilized to train a data augmentation policy
and design a reward signal that explores classifier uncertainty and encourages
performance improvement, irrespective of the classifier's convergence. We
demonstrate the effectiveness of our proposed framework through extensive
experiments conducted on seven publicly available benchmark datasets using
three different types of classifiers. The results of these experiments showcase
the potential and promise of our framework in addressing imbalanced datasets
with diverse minorities
Beyond Volume: The Impact of Complex Healthcare Data on the Machine Learning Pipeline
From medical charts to national census, healthcare has traditionally operated
under a paper-based paradigm. However, the past decade has marked a long and
arduous transformation bringing healthcare into the digital age. Ranging from
electronic health records, to digitized imaging and laboratory reports, to
public health datasets, today, healthcare now generates an incredible amount of
digital information. Such a wealth of data presents an exciting opportunity for
integrated machine learning solutions to address problems across multiple
facets of healthcare practice and administration. Unfortunately, the ability to
derive accurate and informative insights requires more than the ability to
execute machine learning models. Rather, a deeper understanding of the data on
which the models are run is imperative for their success. While a significant
effort has been undertaken to develop models able to process the volume of data
obtained during the analysis of millions of digitalized patient records, it is
important to remember that volume represents only one aspect of the data. In
fact, drawing on data from an increasingly diverse set of sources, healthcare
data presents an incredibly complex set of attributes that must be accounted
for throughout the machine learning pipeline. This chapter focuses on
highlighting such challenges, and is broken down into three distinct
components, each representing a phase of the pipeline. We begin with attributes
of the data accounted for during preprocessing, then move to considerations
during model building, and end with challenges to the interpretation of model
output. For each component, we present a discussion around data as it relates
to the healthcare domain and offer insight into the challenges each may impose
on the efficiency of machine learning techniques.Comment: Healthcare Informatics, Machine Learning, Knowledge Discovery: 20
Pages, 1 Figur
- …