14 research outputs found
Calibrated Multivariate Regression with Application to Neural Semantic Basis Discovery
We propose a calibrated multivariate regression method named CMR for fitting
high dimensional multivariate regression models. Compared with existing
methods, CMR calibrates regularization for each regression task with respect to
its noise level so that it simultaneously attains improved finite-sample
performance and tuning insensitiveness. Theoretically, we provide sufficient
conditions under which CMR achieves the optimal rate of convergence in
parameter estimation. Computationally, we propose an efficient smoothed
proximal gradient algorithm with a worst-case numerical rate of convergence
\cO(1/\epsilon), where is a pre-specified accuracy of the
objective function value. We conduct thorough numerical simulations to
illustrate that CMR consistently outperforms other high dimensional
multivariate regression methods. We also apply CMR to solve a brain activity
prediction problem and find that it is as competitive as a handcrafted model
created by human experts. The R package \texttt{camel} implementing the
proposed method is available on the Comprehensive R Archive Network
\url{http://cran.r-project.org/web/packages/camel/}.Comment: Journal of Machine Learning Research, 201
Distributed Primal-Dual Optimization for Online Multi-Task Learning
Conventional online multi-task learning algorithms suffer from two critical
limitations: 1) Heavy communication caused by delivering high velocity of
sequential data to a central machine; 2) Expensive runtime complexity for
building task relatedness. To address these issues, in this paper we consider a
setting where multiple tasks are geographically located in different places,
where one task can synchronize data with others to leverage knowledge of
related tasks. Specifically, we propose an adaptive primal-dual algorithm,
which not only captures task-specific noise in adversarial learning but also
carries out a projection-free update with runtime efficiency. Moreover, our
model is well-suited to decentralized periodic-connected tasks as it allows the
energy-starved or bandwidth-constraint tasks to postpone the update.
Theoretical results demonstrate the convergence guarantee of our distributed
algorithm with an optimal regret. Empirical results confirm that the proposed
model is highly effective on various real-world datasets
MTFuzz: Fuzzing with a Multi-Task Neural Network
Fuzzing is a widely used technique for detecting software bugs and
vulnerabilities. Most popular fuzzers generate new inputs using an evolutionary
search to maximize code coverage. Essentially, these fuzzers start with a set
of seed inputs, mutate them to generate new inputs, and identify the promising
inputs using an evolutionary fitness function for further mutation. Despite
their success, evolutionary fuzzers tend to get stuck in long sequences of
unproductive mutations. In recent years, machine learning (ML) based mutation
strategies have reported promising results. However, the existing ML-based
fuzzers are limited by the lack of quality and diversity of the training data.
As the input space of the target programs is high dimensional and sparse, it is
prohibitively expensive to collect many diverse samples demonstrating
successful and unsuccessful mutations to train the model. In this paper, we
address these issues by using a Multi-Task Neural Network that can learn a
compact embedding of the input space based on diverse training samples for
multiple related tasks (i.e., predicting for different types of coverage). The
compact embedding can guide the mutation process by focusing most of the
mutations on the parts of the embedding where the gradient is high. \tool
uncovers previously unseen bugs and achieves an average of more
edge coverage compared with 5 state-of-the-art fuzzer on 10 real-world
programs.Comment: ACM Joint European Software Engineering Conference and Symposium on
the Foundations of Software Engineering (ESEC/FSE) 202
Explainable tensor multi-task ensemble learning based on brain structure variation for Alzheimer's disease dynamic prediction
Objective: Machine learning approaches for predicting Alzheimer’s disease (AD) progression can substantially assist researchers and clinicians in developing effective AD preventive and treatment strategies. Methods: This study proposes a novel machine learning algorithm to predict the AD progression utilising a multi-task ensemble learning approach. Specifically, we present a novel tensor multi-task learning (MTL) algorithm based on similarity measurement of spatio-temporal variability of brain biomarkers to model AD progression. In this model, the prediction of each patient sample in the tensor is set as one task, where all tasks share a set of latent factors obtained through tensor decomposition. Furthermore, as subjects have continuous records of brain biomarker testing, the model is extended to ensemble the subjects’ temporally continuous prediction results utilising a gradient boosting kernel to find more accurate predictions. Results: We have conducted extensive experiments utilising data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) to evaluate the performance of the proposed algorithm and model. Results demonstrate that the proposed model have superior accuracy and stability in predicting AD progression compared to benchmarks and state-of-the-art multi-task regression methods in terms of the Mini Mental State Examination (MMSE) questionnaire and The Alzheimer’s Disease Assessment Scale-Cognitive Subscale (ADAS-Cog) cognitive scores. Conclusion: Brain biomarker correlation information can be utilised to identify variations in individual brain structures and the model can be utilised to effectively predict the progression of AD with magnetic resonance imaging (MRI) data and cognitive scores of AD patients at different stages
Improved Multi-Task Learning Based on Local Rademacher Analysis
Considering a single prediction task at a time is the most commonly paradigm in machine learning practice. This methodology, however, ignores the potentially relevant information that might be available in other related tasks in the same domain. This becomes even more critical where facing the lack of a sufficient amount of data in a prediction task of an individual subject may lead to deteriorated generalization performance. In such cases, learning multiple related tasks together might offer a better performance by allowing tasks to leverage information from each other. Multi-Task Learning (MTL) is a machine learning framework, which learns multiple related tasks simultaneously to overcome data scarcity limitations of Single Task Learning (STL), and therefore, it results in an improved performance. Although MTL has been actively investigated by the machine learning community, there are only a few studies examining the theoretical justification of this learning framework. The focus of previous studies is on providing learning guarantees in the form of generalization error bounds. The study of generalization bounds is considered as an important problem in machine learning, and, more specifically, in statistical learning theory. This importance is twofold: (1) generalization bounds provide an upper-tail confidence interval for the true risk of a learning algorithm the latter of which cannot be precisely calculated due to its dependency to some unknown distribution P from which the data are drawn, (2) this type of bounds can also be employed as model selection tools, which lead to identifying more accurate learning models. The generalization error bounds are typically expressed in terms of the empirical risk of the learning hypothesis along with a complexity measure of that hypothesis. Although different complexity measures can be used in deriving error bounds, Rademacher complexity has received considerable attention in recent years, due to its superiority to other complexity measures. In fact, Rademacher complexity can potentially lead to tighter error bounds compared to the ones obtained by other complexity measures. However, one shortcoming of the general notion of Rademacher complexity is that it provides a global complexity estimate of the learning hypothesis space, which does not take into consideration the fact that learning algorithms, by design, select functions belonging to a more favorable subset of this space and, therefore, they yield better performing models than the worst case. To overcome the limitation of global Rademacher complexity, a more nuanced notion of Rademacher complexity, the so-called local Rademacher complexity, has been considered, which leads to sharper learning bounds, and as such, compared to its global counterpart, guarantees faster convergence rates in terms of number of samples. Also, considering the fact that locally-derived bounds are expected to be tighter than globally-derived ones, they can motivate better (more accurate) model selection algorithms. While the previous MTL studies provide generalization bounds based on some other complexity measures, in this dissertation, we prove excess risk bounds for some popular kernel-based MTL hypothesis spaces based on the Local Rademacher Complexity (LRC) of those hypotheses. We show that these local bounds have faster convergence rates compared to the previous Global Rademacher Complexity (GRC)-based bounds. We then use our LRC-based MTL bounds to design a new kernel-based MTL model, which enjoys strong learning guarantees. Moreover, we develop an optimization algorithm to solve our new MTL formulation. Finally, we run simulations on experimental data that compare our MTL model to some classical Multi-Task Multiple Kernel Learning (MT-MKL) models designed based on the GRCs. Since the local Rademacher complexities are expected to be tighter than the global ones, our new model is also expected to exhibit better performance compared to the GRC-based models
Recommended from our members
Adaptive and Effective Fuzzing: a Data-Driven Approach
Security vulnerabilities have a large real-world impact, from ransomware attacks costing billions of dollars every year to sensitive data breaches in government, military and industry. Fuzzing is a popular technique to discover these vulnerabilities in an automated fashion. Industries have poured tons of resources into building large-scale fuzzing factories (e.g., Google’s ClusterFuzz and Microsoft’s OneFuzz) to test their products and make their product more secure. Despite the wide application of fuzzing in industry, there remain many issues constraining its performance. One fundamental limitation is the rule-based design in fuzzing. Rule-based fuzzers heavily rely on a set of static rules or heuristics. These fixed rules are summarized from human experience, hence failing to generalize on a diverse set of programs.
In this dissertation, we present an adaptive and effective fuzzing framework in data-driven approach. A data-driven fuzzer makes decisions based on the analysis and reasoning of data rather than the static rules. Hence it is more adaptive, effective, and flexible than a typical rule-based fuzzer. More interestingly, the data-driven approach can bridge the connection from fuzzing to various data-centric domains (e.g., machine learning, optimizations and social network), enabling sophisticated designs in the fuzzing framework.
A general fuzzing framework consists of two major components: seed scheduling and seed mutation. The seed scheduling module selects a seed from a seed corpus that includes multiple testcases. Then seed mutation module applies perturbation on the selected seed to generate a new testcase. First, we present Neuzz, the first machine learning (ML) based general-purpose fuzzer that adopts ML to seed mutation and greatly improves fuzzing performance. Then we present MTFuzz, a follow-up work of Neuzz by including diverse data into ML to generate effective seed mutations.
In the end, we present K-Scheduler, a fuzzer-agnostic seed scheduling algorithm in data-driven approach. K-Scheduler leverages the graph data (i.e., inter-procedural control flow graph) and dynamic coverage data (i.e., code coverage bitmap) to construct a dynamic graph and schedule seeds by the graph centrality scores on that graph. It can significantly improve the fuzzing performance than the-state-of-art seed schedulers on various fuzzers widely-used in the industry