24 research outputs found
Symmetric Sparse Boolean Matrix Factorization and Applications
In this work, we study a variant of nonnegative matrix factorization where we
wish to find a symmetric factorization of a given input matrix into a sparse,
Boolean matrix. Formally speaking, given ,
we want to find such that is minimized among all for which
each row is -sparse. This question turns out to be closely related to a
number of questions like recovering a hypergraph from its line graph, as well
as reconstruction attacks for private neural network training.
As this problem is hard in the worst-case, we study a natural average-case
variant that arises in the context of these reconstruction attacks: for a random Boolean matrix with
-sparse rows, and the goal is to recover up to column
permutation. Equivalently, this can be thought of as recovering a uniformly
random -uniform hypergraph from its line graph.
Our main result is a polynomial-time algorithm for this problem based on
bootstrapping higher-order information about and then decomposing
an appropriate tensor. The key ingredient in our analysis, which may be of
independent interest, is to show that such a matrix has full
column rank with high probability as soon as , which
we do using tools from Littlewood-Offord theory and estimates for binary
Krawtchouk polynomials.Comment: 33 pages, to appear in Innovations in Theoretical Computer Science
(ITCS 2022), v2: updated ref
PROV-IO+: A Cross-Platform Provenance Framework for Scientific Data on HPC Systems
Data provenance, or data lineage, describes the life cycle of data. In
scientific workflows on HPC systems, scientists often seek diverse provenance
(e.g., origins of data products, usage patterns of datasets). Unfortunately,
existing provenance solutions cannot address the challenges due to their
incompatible provenance models and/or system implementations. In this paper, we
analyze four representative scientific workflows in collaboration with the
domain scientists to identify concrete provenance needs. Based on the
first-hand analysis, we propose a provenance framework called PROV-IO+, which
includes an I/O-centric provenance model for describing scientific data and the
associated I/O operations and environments precisely. Moreover, we build a
prototype of PROV-IO+ to enable end-to-end provenance support on real HPC
systems with little manual effort. The PROV-IO+ framework can support both
containerized and non-containerized workflows on different HPC platforms with
flexibility in selecting various classes of provenance. Our experiments with
realistic workflows show that PROV-IO+ can address the provenance needs of the
domain scientists effectively with reasonable performance (e.g., less than 3.5%
tracking overhead for most experiments). Moreover, PROV-IO+ outperforms a
state-of-the-art system (i.e., ProvLake) in our experiments
WOMD-LiDAR: Raw Sensor Dataset Benchmark for Motion Forecasting
Widely adopted motion forecasting datasets substitute the observed sensory
inputs with higher-level abstractions such as 3D boxes and polylines. These
sparse shapes are inferred through annotating the original scenes with
perception systems' predictions. Such intermediate representations tie the
quality of the motion forecasting models to the performance of computer vision
models. Moreover, the human-designed explicit interfaces between perception and
motion forecasting typically pass only a subset of the semantic information
present in the original sensory input. To study the effect of these modular
approaches, design new paradigms that mitigate these limitations, and
accelerate the development of end-to-end motion forecasting models, we augment
the Waymo Open Motion Dataset (WOMD) with large-scale, high-quality, diverse
LiDAR data for the motion forecasting task.
The new augmented dataset WOMD-LiDAR consists of over 100,000 scenes that
each spans 20 seconds, consisting of well-synchronized and calibrated high
quality LiDAR point clouds captured across a range of urban and suburban
geographies (https://waymo.com/open/data/motion/). Compared to Waymo Open
Dataset (WOD), WOMD-LiDAR dataset contains 100x more scenes. Furthermore, we
integrate the LiDAR data into the motion forecasting model training and provide
a strong baseline. Experiments show that the LiDAR data brings improvement in
the motion forecasting task. We hope that WOMD-LiDAR will provide new
opportunities for boosting end-to-end motion forecasting models.Comment: Dataset website: https://waymo.com/open/data/motion