Search CORE

24 research outputs found

Symmetric Sparse Boolean Matrix Factorization and Applications

Author: Chen Sitan
Song Zhao
Tao Runzhou
Zhang Ruizhe
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 13th Innovations in Theoretical Computer Science Conference (ITCS 2022)
Publication date: 01/01/2022
Field of study

In this work, we study a variant of nonnegative matrix factorization where we wish to find a symmetric factorization of a given input matrix into a sparse, Boolean matrix. Formally speaking, given

\mathbf{M}\in\mathbb{Z}^{m\times m}

, we want to find

\mathbf{W}\in\{0,1\}^{m\times r}

such that

\| \mathbf{M} - \mathbf{W}\mathbf{W}^\top \|_0

is minimized among all

\mathbf{W}

for which each row is

k

-sparse. This question turns out to be closely related to a number of questions like recovering a hypergraph from its line graph, as well as reconstruction attacks for private neural network training. As this problem is hard in the worst-case, we study a natural average-case variant that arises in the context of these reconstruction attacks:

\mathbf{M} = \mathbf{W}\mathbf{W}^{\top}

for

\mathbf{W}

a random Boolean matrix with

k

-sparse rows, and the goal is to recover

\mathbf{W}

up to column permutation. Equivalently, this can be thought of as recovering a uniformly random

k

-uniform hypergraph from its line graph. Our main result is a polynomial-time algorithm for this problem based on bootstrapping higher-order information about

\mathbf{W}

and then decomposing an appropriate tensor. The key ingredient in our analysis, which may be of independent interest, is to show that such a matrix

\mathbf{W}

has full column rank with high probability as soon as

m = \widetilde{\Omega}(r)

, which we do using tools from Littlewood-Offord theory and estimates for binary Krawtchouk polynomials.Comment: 33 pages, to appear in Innovations in Theoretical Computer Science (ITCS 2022), v2: updated ref

arXiv.org e-Print Archive

DROPS Dagstuhl Research Online Publication Server

PROV-IO+: A Cross-Platform Provenance Framework for Scientific Data on HPC Systems

Author: Byna Suren
Chen Yong
Dai Dong
Dong Bin
Han Runzhou
Hassoun Joseph
Kim Dongkyun
Tang Houjun
Thorsley David
Wolf Matthew
Zheng Mai
Publication venue
Publication date: 01/08/2023
Field of study

Data provenance, or data lineage, describes the life cycle of data. In scientific workflows on HPC systems, scientists often seek diverse provenance (e.g., origins of data products, usage patterns of datasets). Unfortunately, existing provenance solutions cannot address the challenges due to their incompatible provenance models and/or system implementations. In this paper, we analyze four representative scientific workflows in collaboration with the domain scientists to identify concrete provenance needs. Based on the first-hand analysis, we propose a provenance framework called PROV-IO+, which includes an I/O-centric provenance model for describing scientific data and the associated I/O operations and environments precisely. Moreover, we build a prototype of PROV-IO+ to enable end-to-end provenance support on real HPC systems with little manual effort. The PROV-IO+ framework can support both containerized and non-containerized workflows on different HPC platforms with flexibility in selecting various classes of provenance. Our experiments with realistic workflows show that PROV-IO+ can address the provenance needs of the domain scientists effectively with reasonable performance (e.g., less than 3.5% tracking overhead for most experiments). Moreover, PROV-IO+ outperforms a state-of-the-art system (i.e., ProvLake) in our experiments

arXiv.org e-Print Archive

WOMD-LiDAR: Raw Sensor Dataset Benchmark for Motion Forecasting

Author: Ai-Rfou Rami
Anguelov Dragomir
Bogun Ivan
Chen Kan
Ettinger Scott
Ge Runzhou
Leng Zhaoqi
Mustafa Mustafa
Qi Charles R.
Qiu Hang
Sun Pei
Tan Mingxing
Wang Weiyue
Yang Zoey
Zhou Xuanyu
Publication venue
Publication date: 07/04/2023
Field of study

Widely adopted motion forecasting datasets substitute the observed sensory inputs with higher-level abstractions such as 3D boxes and polylines. These sparse shapes are inferred through annotating the original scenes with perception systems' predictions. Such intermediate representations tie the quality of the motion forecasting models to the performance of computer vision models. Moreover, the human-designed explicit interfaces between perception and motion forecasting typically pass only a subset of the semantic information present in the original sensory input. To study the effect of these modular approaches, design new paradigms that mitigate these limitations, and accelerate the development of end-to-end motion forecasting models, we augment the Waymo Open Motion Dataset (WOMD) with large-scale, high-quality, diverse LiDAR data for the motion forecasting task. The new augmented dataset WOMD-LiDAR consists of over 100,000 scenes that each spans 20 seconds, consisting of well-synchronized and calibrated high quality LiDAR point clouds captured across a range of urban and suburban geographies (https://waymo.com/open/data/motion/). Compared to Waymo Open Dataset (WOD), WOMD-LiDAR dataset contains 100x more scenes. Furthermore, we integrate the LiDAR data into the motion forecasting model training and provide a strong baseline. Experiments show that the LiDAR data brings improvement in the motion forecasting task. We hope that WOMD-LiDAR will provide new opportunities for boosting end-to-end motion forecasting models.Comment: Dataset website: https://waymo.com/open/data/motion

arXiv.org e-Print Archive