24 research outputs found

    Symmetric Sparse Boolean Matrix Factorization and Applications

    Get PDF
    In this work, we study a variant of nonnegative matrix factorization where we wish to find a symmetric factorization of a given input matrix into a sparse, Boolean matrix. Formally speaking, given M∈Zm×m\mathbf{M}\in\mathbb{Z}^{m\times m}, we want to find W∈{0,1}m×r\mathbf{W}\in\{0,1\}^{m\times r} such that ∥M−WW⊤∥0\| \mathbf{M} - \mathbf{W}\mathbf{W}^\top \|_0 is minimized among all W\mathbf{W} for which each row is kk-sparse. This question turns out to be closely related to a number of questions like recovering a hypergraph from its line graph, as well as reconstruction attacks for private neural network training. As this problem is hard in the worst-case, we study a natural average-case variant that arises in the context of these reconstruction attacks: M=WW⊤\mathbf{M} = \mathbf{W}\mathbf{W}^{\top} for W\mathbf{W} a random Boolean matrix with kk-sparse rows, and the goal is to recover W\mathbf{W} up to column permutation. Equivalently, this can be thought of as recovering a uniformly random kk-uniform hypergraph from its line graph. Our main result is a polynomial-time algorithm for this problem based on bootstrapping higher-order information about W\mathbf{W} and then decomposing an appropriate tensor. The key ingredient in our analysis, which may be of independent interest, is to show that such a matrix W\mathbf{W} has full column rank with high probability as soon as m=Ω~(r)m = \widetilde{\Omega}(r), which we do using tools from Littlewood-Offord theory and estimates for binary Krawtchouk polynomials.Comment: 33 pages, to appear in Innovations in Theoretical Computer Science (ITCS 2022), v2: updated ref

    PROV-IO+: A Cross-Platform Provenance Framework for Scientific Data on HPC Systems

    Full text link
    Data provenance, or data lineage, describes the life cycle of data. In scientific workflows on HPC systems, scientists often seek diverse provenance (e.g., origins of data products, usage patterns of datasets). Unfortunately, existing provenance solutions cannot address the challenges due to their incompatible provenance models and/or system implementations. In this paper, we analyze four representative scientific workflows in collaboration with the domain scientists to identify concrete provenance needs. Based on the first-hand analysis, we propose a provenance framework called PROV-IO+, which includes an I/O-centric provenance model for describing scientific data and the associated I/O operations and environments precisely. Moreover, we build a prototype of PROV-IO+ to enable end-to-end provenance support on real HPC systems with little manual effort. The PROV-IO+ framework can support both containerized and non-containerized workflows on different HPC platforms with flexibility in selecting various classes of provenance. Our experiments with realistic workflows show that PROV-IO+ can address the provenance needs of the domain scientists effectively with reasonable performance (e.g., less than 3.5% tracking overhead for most experiments). Moreover, PROV-IO+ outperforms a state-of-the-art system (i.e., ProvLake) in our experiments

    WOMD-LiDAR: Raw Sensor Dataset Benchmark for Motion Forecasting

    Full text link
    Widely adopted motion forecasting datasets substitute the observed sensory inputs with higher-level abstractions such as 3D boxes and polylines. These sparse shapes are inferred through annotating the original scenes with perception systems' predictions. Such intermediate representations tie the quality of the motion forecasting models to the performance of computer vision models. Moreover, the human-designed explicit interfaces between perception and motion forecasting typically pass only a subset of the semantic information present in the original sensory input. To study the effect of these modular approaches, design new paradigms that mitigate these limitations, and accelerate the development of end-to-end motion forecasting models, we augment the Waymo Open Motion Dataset (WOMD) with large-scale, high-quality, diverse LiDAR data for the motion forecasting task. The new augmented dataset WOMD-LiDAR consists of over 100,000 scenes that each spans 20 seconds, consisting of well-synchronized and calibrated high quality LiDAR point clouds captured across a range of urban and suburban geographies (https://waymo.com/open/data/motion/). Compared to Waymo Open Dataset (WOD), WOMD-LiDAR dataset contains 100x more scenes. Furthermore, we integrate the LiDAR data into the motion forecasting model training and provide a strong baseline. Experiments show that the LiDAR data brings improvement in the motion forecasting task. We hope that WOMD-LiDAR will provide new opportunities for boosting end-to-end motion forecasting models.Comment: Dataset website: https://waymo.com/open/data/motion
    corecore