28,617 research outputs found
Hellinger Distance Trees for Imbalanced Streams
Classifiers trained on data sets possessing an imbalanced class distribution
are known to exhibit poor generalisation performance. This is known as the
imbalanced learning problem. The problem becomes particularly acute when we
consider incremental classifiers operating on imbalanced data streams,
especially when the learning objective is rare class identification. As
accuracy may provide a misleading impression of performance on imbalanced data,
existing stream classifiers based on accuracy can suffer poor minority class
performance on imbalanced streams, with the result being low minority class
recall rates. In this paper we address this deficiency by proposing the use of
the Hellinger distance measure, as a very fast decision tree split criterion.
We demonstrate that by using Hellinger a statistically significant improvement
in recall rates on imbalanced data streams can be achieved, with an acceptable
increase in the false positive rate.Comment: 6 Pages, 2 figures, to be published in Proceedings 22nd International
Conference on Pattern Recognition (ICPR) 201
Towards automatic pulmonary nodule management in lung cancer screening with deep learning
The introduction of lung cancer screening programs will produce an
unprecedented amount of chest CT scans in the near future, which radiologists
will have to read in order to decide on a patient follow-up strategy. According
to the current guidelines, the workup of screen-detected nodules strongly
relies on nodule size and nodule type. In this paper, we present a deep
learning system based on multi-stream multi-scale convolutional networks, which
automatically classifies all nodule types relevant for nodule workup. The
system processes raw CT data containing a nodule without the need for any
additional information such as nodule segmentation or nodule size and learns a
representation of 3D data by analyzing an arbitrary number of 2D views of a
given nodule. The deep learning system was trained with data from the Italian
MILD screening trial and validated on an independent set of data from the
Danish DLCST screening trial. We analyze the advantage of processing nodules at
multiple scales with a multi-stream convolutional network architecture, and we
show that the proposed deep learning system achieves performance at classifying
nodule type that surpasses the one of classical machine learning approaches and
is within the inter-observer variability among four experienced human
observers.Comment: Published on Scientific Report
On Practical Algorithms for Entropy Estimation and the Improved Sample Complexity of Compressed Counting
Estimating the p-th frequency moment of data stream is a very heavily studied
problem. The problem is actually trivial when p = 1, assuming the strict
Turnstile model. The sample complexity of our proposed algorithm is essentially
O(1) near p=1. This is a very large improvement over the previously believed
O(1/eps^2) bound. The proposed algorithm makes the long-standing problem of
entropy estimation an easy task, as verified by the experiments included in the
appendix
Estimating Entropy of Data Streams Using Compressed Counting
The Shannon entropy is a widely used summary statistic, for example, network
traffic measurement, anomaly detection, neural computations, spike trains, etc.
This study focuses on estimating Shannon entropy of data streams. It is known
that Shannon entropy can be approximated by Reenyi entropy or Tsallis entropy,
which are both functions of the p-th frequency moments and approach Shannon
entropy as p->1.
Compressed Counting (CC) is a new method for approximating the p-th frequency
moments of data streams. Our contributions include:
1) We prove that Renyi entropy is (much) better than Tsallis entropy for
approximating Shannon entropy.
2) We propose the optimal quantile estimator for CC, which considerably
improves the previous estimators.
3) Our experiments demonstrate that CC is indeed highly effective
approximating the moments and entropies. We also demonstrate the crucial
importance of utilizing the variance-bias trade-off
When Two Choices Are not Enough: Balancing at Scale in Distributed Stream Processing
Carefully balancing load in distributed stream processing systems has a
fundamental impact on execution latency and throughput. Load balancing is
challenging because real-world workloads are skewed: some tuples in the stream
are associated to keys which are significantly more frequent than others. Skew
is remarkably more problematic in large deployments: more workers implies fewer
keys per worker, so it becomes harder to "average out" the cost of hot keys
with cold keys.
We propose a novel load balancing technique that uses a heaving hitter
algorithm to efficiently identify the hottest keys in the stream. These hot
keys are assigned to choices to ensure a balanced load, where is
tuned automatically to minimize the memory and computation cost of operator
replication. The technique works online and does not require the use of routing
tables. Our extensive evaluation shows that our technique can balance
real-world workloads on large deployments, and improve throughput and latency
by and respectively over the previous
state-of-the-art when deployed on Apache Storm.Comment: 12 pages, 14 Figures, this paper is accepted and will be published at
ICDE 201
Geoadditive Regression Modeling of Stream Biological Condition
Indices of biotic integrity (IBI) have become an established tool to quantify the condition of small non-tidal streams and their watersheds. To investigate the effects of watershed characteristics on stream biological condition, we present a new technique for regressing IBIs on watershed-specific explanatory variables. Since IBIs are typically evaluated on anordinal scale, our method is based on the proportional odds model for ordinal outcomes. To avoid overfitting, we do not use classical maximum likelihood estimation but a component-wise functional gradient boosting approach. Because component-wise gradient boosting has an intrinsic mechanism for variable selection and model choice, determinants of biotic integrity can be identified. In addition, the method offers a relatively simple way to account for spatial correlation in ecological data. An analysis of the Maryland Biological Streams Survey shows that nonlinear effects of predictor variables on stream condition can be quantified while, in addition, accurate predictions of biological condition at unsurveyed locations are obtained
- …