20,000 research outputs found
Algorithmic and Statistical Perspectives on Large-Scale Data Analysis
In recent years, ideas from statistics and scientific computing have begun to
interact in increasingly sophisticated and fruitful ways with ideas from
computer science and the theory of algorithms to aid in the development of
improved worst-case algorithms that are useful for large-scale scientific and
Internet data analysis problems. In this chapter, I will describe two recent
examples---one having to do with selecting good columns or features from a (DNA
Single Nucleotide Polymorphism) data matrix, and the other having to do with
selecting good clusters or communities from a data graph (representing a social
or information network)---that drew on ideas from both areas and that may serve
as a model for exploiting complementary algorithmic and statistical
perspectives in order to solve applied large-scale data analysis problems.Comment: 33 pages. To appear in Uwe Naumann and Olaf Schenk, editors,
"Combinatorial Scientific Computing," Chapman and Hall/CRC Press, 201
Recent Advances in Graph Partitioning
We survey recent trends in practical algorithms for balanced graph
partitioning together with applications and future research directions
Visual and computational analysis of structure-activity relationships in high-throughput screening data
Novel analytic methods are required to assimilate the large volumes of structural and bioassay data generated by combinatorial chemistry and high-throughput screening programmes in the pharmaceutical and agrochemical industries. This paper reviews recent work in visualisation and data mining that can be used to develop structure-activity relationships from such chemical/biological datasets
Supporting Regularized Logistic Regression Privately and Efficiently
As one of the most popular statistical and machine learning models, logistic
regression with regularization has found wide adoption in biomedicine, social
sciences, information technology, and so on. These domains often involve data
of human subjects that are contingent upon strict privacy regulations.
Increasing concerns over data privacy make it more and more difficult to
coordinate and conduct large-scale collaborative studies, which typically rely
on cross-institution data sharing and joint analysis. Our work here focuses on
safeguarding regularized logistic regression, a widely-used machine learning
model in various disciplines while at the same time has not been investigated
from a data security and privacy perspective. We consider a common use scenario
of multi-institution collaborative studies, such as in the form of research
consortia or networks as widely seen in genetics, epidemiology, social
sciences, etc. To make our privacy-enhancing solution practical, we demonstrate
a non-conventional and computationally efficient method leveraging distributing
computing and strong cryptography to provide comprehensive protection over
individual-level and summary data. Extensive empirical evaluation on several
studies validated the privacy guarantees, efficiency and scalability of our
proposal. We also discuss the practical implications of our solution for
large-scale studies and applications from various disciplines, including
genetic and biomedical studies, smart grid, network analysis, etc
- …