5,946 research outputs found
Reconstruction of three-dimensional porous media using generative adversarial neural networks
To evaluate the variability of multi-phase flow properties of porous media at
the pore scale, it is necessary to acquire a number of representative samples
of the void-solid structure. While modern x-ray computer tomography has made it
possible to extract three-dimensional images of the pore space, assessment of
the variability in the inherent material properties is often experimentally not
feasible. We present a novel method to reconstruct the solid-void structure of
porous media by applying a generative neural network that allows an implicit
description of the probability distribution represented by three-dimensional
image datasets. We show, by using an adversarial learning approach for neural
networks, that this method of unsupervised learning is able to generate
representative samples of porous media that honor their statistics. We
successfully compare measures of pore morphology, such as the Euler
characteristic, two-point statistics and directional single-phase permeability
of synthetic realizations with the calculated properties of a bead pack, Berea
sandstone, and Ketton limestone. Results show that GANs can be used to
reconstruct high-resolution three-dimensional images of porous media at
different scales that are representative of the morphology of the images used
to train the neural network. The fully convolutional nature of the trained
neural network allows the generation of large samples while maintaining
computational efficiency. Compared to classical stochastic methods of image
reconstruction, the implicit representation of the learned data distribution
can be stored and reused to generate multiple realizations of the pore
structure very rapidly.Comment: 21 pages, 20 figure
Granular computing based approach of rule learning for binary classification
Rule learning is one of the most popular types of machine-learning approaches, which typically follow two main strategies: ‘divide and conquer’ and ‘separate and conquer’. The former strategy is aimed at induction of rules in the form of a decision tree, whereas the latter one is aimed at direct induction of if–then rules. Due to the case that the divide and conquer strategy could result in the replicated sub-tree problem, which not only leads to overfitting but also increases the computational complexity in classifying unseen instances, researchers have thus been motivated to develop rule learning approaches through the separate and conquer strategy. In this paper, we focus on investigation of the Prism algorithm, since it is a representative one that follows the separate and conquer strategy, and is aimed at learning a set of rules for each class in the setting of granular computing, where each class (referred to as target class) is viewed as a granule. The Prism algorithm shows highly comparable performance to the most popular algorithms, such as ID3 and C4.5, which follow the divide and conquer strategy. However, due to the need to learn a rule set for each class, Prism usually produces very complex rule-based classifiers. In real applications, there are many problems that involve one target class only, so it is not necessary to learn a rule set for each class, i.e., only a set of rules for the target class needs to be learned and a default rule is used to indicate the case of non-target classes. To address the above issues of Prism, we propose a new version of the algorithm referred to as PrismSTC, where ‘STC’ stands for ‘single target class’. Our experimental results show that PrismSTC leads to production of simpler rule-based classifiers without loss of accuracy in comparison with Prism. PrismSTC also demonstrates sufficiently good performance comparing with C4.5
On the Relation Between Mobile Encounters and Web Traffic Patterns: A Data-driven Study
Mobility and network traffic have been traditionally studied separately.
Their interaction is vital for generations of future mobile services and
effective caching, but has not been studied in depth with real-world big data.
In this paper, we characterize mobility encounters and study the correlation
between encounters and web traffic profiles using large-scale datasets (30TB in
size) of WiFi and NetFlow traces. The analysis quantifies these correlations
for the first time, across spatio-temporal dimensions, for device types grouped
into on-the-go Flutes and sit-to-use Cellos. The results consistently show a
clear relation between mobility encounters and traffic across different
buildings over multiple days, with encountered pairs showing higher traffic
similarity than non-encountered pairs, and long encounters being associated
with the highest similarity. We also investigate the feasibility of learning
encounters through web traffic profiles, with implications for dissemination
protocols, and contact tracing. This provides a compelling case to integrate
both mobility and web traffic dimensions in future models, not only at an
individual level, but also at pairwise and collective levels. We have released
samples of code and data used in this study on GitHub, to support
reproducibility and encourage further research
(https://github.com/BabakAp/encounter-traffic).Comment: Technical report with details for conference paper at MSWiM 2018, v3
adds GitHub lin
Pyramid: Enhancing Selectivity in Big Data Protection with Count Featurization
Protecting vast quantities of data poses a daunting challenge for the growing
number of organizations that collect, stockpile, and monetize it. The ability
to distinguish data that is actually needed from data collected "just in case"
would help these organizations to limit the latter's exposure to attack. A
natural approach might be to monitor data use and retain only the working-set
of in-use data in accessible storage; unused data can be evicted to a highly
protected store. However, many of today's big data applications rely on machine
learning (ML) workloads that are periodically retrained by accessing, and thus
exposing to attack, the entire data store. Training set minimization methods,
such as count featurization, are often used to limit the data needed to train
ML workloads to improve performance or scalability. We present Pyramid, a
limited-exposure data management system that builds upon count featurization to
enhance data protection. As such, Pyramid uniquely introduces both the idea and
proof-of-concept for leveraging training set minimization methods to instill
rigor and selectivity into big data management. We integrated Pyramid into
Spark Velox, a framework for ML-based targeting and personalization. We
evaluate it on three applications and show that Pyramid approaches
state-of-the-art models while training on less than 1% of the raw data
Semi-supervised Learning with Deterministic Labeling and Large Margin Projection
The centrality and diversity of the labeled data are very influential to the
performance of semi-supervised learning (SSL), but most SSL models select the
labeled data randomly. This study first construct a leading forest that forms a
partially ordered topological space in an unsupervised way, and select a group
of most representative samples to label with one shot (differs from active
learning essentially) using property of homeomorphism. Then a kernelized large
margin metric is efficiently learned for the selected data to classify the
remaining unlabeled sample. Optimal leading forest (OLF) has been observed to
have the advantage of revealing the difference evolution along a path within a
subtree. Therefore, we formulate an optimization problem based on OLF to select
the samples. Also with OLF, the multiple local metrics learning is facilitated
to address multi-modal and mix-modal problem in SSL, especially when the number
of class is large. Attribute to this novel design, stableness and accuracy of
the performance is significantly improved when compared with the
state-of-the-art graph SSL methods. The extensive experimental studies have
shown that the proposed method achieved encouraging accuracy and efficiency.
Code has been made available at https://github.com/alanxuji/DeLaLA.Comment: 12 pages, ready to submit to a journa
Aggregation of classifiers: a justifiable information granularity approach.
In this paper, we introduced a new approach of combining multiple classifiers in a heterogeneous ensemble system. Instead of using numerical membership values when combining, we constructed interval membership values for each class prediction from the meta-data of observation by using the concept of information granule. In the proposed method, the uncertainty (diversity) of the predictions produced by the base classifiers is quantified by the interval-based information granules. The decision model is then generated by considering both bound and length of the intervals. Extensive experimentation using the UCI datasets has demonstrated the superior performance of our algorithm over other algorithms including six fixed combining methods, one trainable combining method, AdaBoost, bagging, and random subspace
Combining heterogeneous classifiers via granular prototypes.
In this study, a novel framework to combine multiple classifiers in an ensemble system is introduced. Here we exploit the concept of information granule to construct granular prototypes for each class on the outputs of an ensemble of base classifiers. In the proposed method, uncertainty in the outputs of the base classifiers on training observations is captured by an interval-based representation. To predict the class label for a new observation, we first determine the distances between the output of the base classifiers for this observation and the class prototypes, then the predicted class label is obtained by choosing the label associated with the shortest distance. In the experimental study, we combine several learning algorithms to build the ensemble system and conduct experiments on the UCI, colon cancer, and selected CLEF2009 datasets. The experimental results demonstrate that the proposed framework outperforms several benchmarked algorithms including two trainable combining methods, i.e., Decision Template and Two Stages Ensemble System, AdaBoost, Random Forest, L2-loss Linear Support Vector Machine, and Decision Tree
Nearest Labelset Using Double Distances for Multi-label Classification
Multi-label classification is a type of supervised learning where an instance
may belong to multiple labels simultaneously. Predicting each label
independently has been criticized for not exploiting any correlation between
labels. In this paper we propose a novel approach, Nearest Labelset using
Double Distances (NLDD), that predicts the labelset observed in the training
data that minimizes a weighted sum of the distances in both the feature space
and the label space to the new instance. The weights specify the relative
tradeoff between the two distances. The weights are estimated from a binomial
regression of the number of misclassified labels as a function of the two
distances. Model parameters are estimated by maximum likelihood. NLDD only
considers labelsets observed in the training data, thus implicitly taking into
account label dependencies. Experiments on benchmark multi-label data sets show
that the proposed method on average outperforms other well-known approaches in
terms of Hamming loss, 0/1 loss, and multi-label accuracy and ranks second
after ECC on the F-measure
- …