8,340 research outputs found
Pyramid: Enhancing Selectivity in Big Data Protection with Count Featurization
Protecting vast quantities of data poses a daunting challenge for the growing
number of organizations that collect, stockpile, and monetize it. The ability
to distinguish data that is actually needed from data collected "just in case"
would help these organizations to limit the latter's exposure to attack. A
natural approach might be to monitor data use and retain only the working-set
of in-use data in accessible storage; unused data can be evicted to a highly
protected store. However, many of today's big data applications rely on machine
learning (ML) workloads that are periodically retrained by accessing, and thus
exposing to attack, the entire data store. Training set minimization methods,
such as count featurization, are often used to limit the data needed to train
ML workloads to improve performance or scalability. We present Pyramid, a
limited-exposure data management system that builds upon count featurization to
enhance data protection. As such, Pyramid uniquely introduces both the idea and
proof-of-concept for leveraging training set minimization methods to instill
rigor and selectivity into big data management. We integrated Pyramid into
Spark Velox, a framework for ML-based targeting and personalization. We
evaluate it on three applications and show that Pyramid approaches
state-of-the-art models while training on less than 1% of the raw data
Support matrix machine: A review
Support vector machine (SVM) is one of the most studied paradigms in the
realm of machine learning for classification and regression problems. It relies
on vectorized input data. However, a significant portion of the real-world data
exists in matrix format, which is given as input to SVM by reshaping the
matrices into vectors. The process of reshaping disrupts the spatial
correlations inherent in the matrix data. Also, converting matrices into
vectors results in input data with a high dimensionality, which introduces
significant computational complexity. To overcome these issues in classifying
matrix input data, support matrix machine (SMM) is proposed. It represents one
of the emerging methodologies tailored for handling matrix input data. The SMM
method preserves the structural information of the matrix data by using the
spectral elastic net property which is a combination of the nuclear norm and
Frobenius norm. This article provides the first in-depth analysis of the
development of the SMM model, which can be used as a thorough summary by both
novices and experts. We discuss numerous SMM variants, such as robust, sparse,
class imbalance, and multi-class classification models. We also analyze the
applications of the SMM model and conclude the article by outlining potential
future research avenues and possibilities that may motivate academics to
advance the SMM algorithm
Distributional Random Forests: Heterogeneity Adjustment and Multivariate Distributional Regression
Random Forests (Breiman, 2001) is a successful and widely used regression and
classification algorithm. Part of its appeal and reason for its versatility is
its (implicit) construction of a kernel-type weighting function on training
data, which can also be used for targets other than the original mean
estimation. We propose a novel forest construction for multivariate responses
based on their joint conditional distribution, independent of the estimation
target and the data model. It uses a new splitting criterion based on the MMD
distributional metric, which is suitable for detecting heterogeneity in
multivariate distributions. The induced weights define an estimate of the full
conditional distribution, which in turn can be used for arbitrary and
potentially complicated targets of interest. The method is very versatile and
convenient to use, as we illustrate on a wide range of examples. The code is
available as Python and R packages drf
- …