Controlling false positives (Type I errors) through statistical hypothesis
testing is a foundation of modern scientific data analysis. Existing causal
structure discovery algorithms either do not provide Type I error control or
cannot scale to the size of modern scientific datasets. We consider a variant
of the causal discovery problem with two sets of nodes, where the only edges of
interest form a bipartite causal subgraph between the sets. We develop Scalable
Causal Structure Learning (SCSL), a method for causal structure discovery on
bipartite subgraphs that provides Type I error control. SCSL recasts the
discovery problem as a simultaneous hypothesis testing problem and uses
discrete optimization over the set of possible confounders to obtain an upper
bound on the test statistic for each edge. Semi-synthetic simulations
demonstrate that SCSL scales to handle graphs with hundreds of nodes while
maintaining error control and good power. We demonstrate the practical
applicability of the method by applying it to a cancer dataset to reveal
connections between somatic gene mutations and metastases to different tissues.Comment: 10 figures, 24 page