332 research outputs found
Information criterion-based clustering with order-restricted candidate profiles in short time-course microarray experiments
<p>Abstract</p> <p>Background</p> <p>Time-course microarray experiments produce vector gene expression profiles across a series of time points. Clustering genes based on these profiles is important in discovering functional related and co-regulated genes. Early developed clustering algorithms do not take advantage of the ordering in a time-course study, explicit use of which should allow more sensitive detection of genes that display a consistent pattern over time. Peddada <it>et al</it>. <abbrgrp><abbr bid="B1">1</abbr></abbrgrp> proposed a clustering algorithm that can incorporate the temporal ordering using order-restricted statistical inference. This algorithm is, however, very time-consuming and hence inapplicable to most microarray experiments that contain a large number of genes. Its computational burden also imposes difficulty to assess the clustering reliability, which is a very important measure when clustering noisy microarray data.</p> <p>Results</p> <p>We propose a computationally efficient information criterion-based clustering algorithm, called ORICC, that also takes account of the ordering in time-course microarray experiments by embedding the order-restricted inference into a model selection framework. Genes are assigned to the profile which they best match determined by a newly proposed information criterion for order-restricted inference. In addition, we also developed a bootstrap procedure to assess ORICC's clustering reliability for every gene. Simulation studies show that the ORICC method is robust, always gives better clustering accuracy than Peddada's method and saves hundreds of times computational time. Under some scenarios, its accuracy is also better than some other existing clustering methods for short time-course microarray data, such as STEM <abbrgrp><abbr bid="B2">2</abbr></abbrgrp> and Wang <it>et al</it>. <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. It is also computationally much faster than Wang <it>et al</it>. <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>.</p> <p>Conclusion</p> <p>Our ORICC algorithm, which takes advantage of the temporal ordering in time-course microarray experiments, provides good clustering accuracy and is meanwhile much faster than Peddada's method. Moreover, the clustering reliability for each gene can also be assessed, which is unavailable in Peddada's method. In a real data example, the ORICC algorithm identifies new and interesting genes that previous analyses failed to reveal.</p
SaaS: A situational awareness and analysis system for massive android malware detection
A large amount of mobile applications (Apps) are uploaded, distributed and updated in various Android markets, e.g., Google Play and Huawei AppGallery every day. One of the ongoing challenges is to detect malicious Apps (also known as malware) among those massive newcomers accurately and efficiently in the daily security management of Android App markets. Customers rely on those detection results in the selection of Apps upon downloading, and undetected malware may result in great damages. In this paper, we propose a cloud-based malware detection system called SaaS by leveraging and marrying multiple approaches from diverse domains such as natural language processing (n-gram), image processing (GLCM), cryptography (fuzzy hash), machine learning (random forest) and complex networks. We firstly extract n-gram features and GLCM features from an App's smali code and DEX file, respectively. We next feed those features into training data set, to create a machine learning detect model. The model is further enhanced by fuzzy hash to detect whether inspected App is repackaged or not. Extensive experiments (involving 1495 samples) demonstrates that the detecting accuracy is more than 98.5%, and support a large-scale detecting and monitoring. Besides, our proposed system can be deployed as a service in clouds and customers can access cloud services on demand
Privacy Intelligence: A Survey on Image Sharing on Online Social Networks
Image sharing on online social networks (OSNs) has become an indispensable
part of daily social activities, but it has also led to an increased risk of
privacy invasion. The recent image leaks from popular OSN services and the
abuse of personal photos using advanced algorithms (e.g. DeepFake) have
prompted the public to rethink individual privacy needs when sharing images on
OSNs. However, OSN image sharing itself is relatively complicated, and systems
currently in place to manage privacy in practice are labor-intensive yet fail
to provide personalized, accurate and flexible privacy protection. As a result,
an more intelligent environment for privacy-friendly OSN image sharing is in
demand. To fill the gap, we contribute a systematic survey of 'privacy
intelligence' solutions that target modern privacy issues related to OSN image
sharing. Specifically, we present a high-level analysis framework based on the
entire lifecycle of OSN image sharing to address the various privacy issues and
solutions facing this interdisciplinary field. The framework is divided into
three main stages: local management, online management and social experience.
At each stage, we identify typical sharing-related user behaviors, the privacy
issues generated by those behaviors, and review representative intelligent
solutions. The resulting analysis describes an intelligent privacy-enhancing
chain for closed-loop privacy management. We also discuss the challenges and
future directions existing at each stage, as well as in publicly available
datasets.Comment: 32 pages, 9 figures. Under revie
Combining conditional and unconditional moment restrictions with missing responses
AbstractMany statistical models, e.g.Β regression models, can be viewed as conditional moment restrictions when distributional assumptions on the error term are not assumed. For such models, several estimators that achieve the semiparametric efficiency bound have been proposed. However, in many studies, auxiliary information is available as unconditional moment restrictions. Meanwhile, we also consider the presence of missing responses. We propose the combined empirical likelihood (CEL) estimator to incorporate such auxiliary information to improve the estimation efficiency of the conditional moment restriction models. We show that, when assuming responses are strongly ignorable missing at random, the CEL estimator achieves better efficiency than the previous estimators due to utilization of the auxiliary information. Based on the asymptotic property of the CEL estimator, we also develop Wilksβ type tests and corresponding confidence regions for the model parameter and the mean response. Since kernel smoothing is used, the CEL method may have difficulty for problems with high dimensional covariates. In such situations, we propose an instrumental variable-based empirical likelihood (IVEL) method to handle this problem. The merit of the CEL and IVEL are further illustrated through simulation studies
A Consistent Cosmic Shear Analysis in Harmonic and Real Space
Recent cosmic shear analyses have exhibited inconsistencies of up to
between the inferred cosmological parameters when analyzing summary
statistics in real space versus harmonic space. In this paper, we demonstrate
the consistent measurement and analysis of cosmic shear two-point functions in
harmonic and real space using the {\sc Master} algorithm. This algorithm
provides a consistent prescription to model the survey window effects and scale
cuts in both real space (due to observational systematics) and harmonic space
(due to model limitations), resulting in a consistent estimation of the cosmic
shear power spectrum from both harmonic and real space estimators. We show that
the \textsc{Master} algorithm gives consistent results using measurements
from the HSC Y1 mock shape catalogs in both real and harmonic space, resulting
in consistent inferences of . This method
provides an unbiased estimate of the cosmic shear power spectrum, and
inference that has a correlation coefficient of 0.997 between analyses using
measurements in real space and harmonic space. We observe the mean difference
between the two inferred values to be 0.0004, far below the observed
difference of 0.042 for the published HSC Y1 analyses and well below the
statistical uncertainties. While the notation employed in this paper is
specific to photometric galaxy surveys, the methods are equally applicable and
can be extended to spectroscopic galaxy surveys, intensity mapping, and CMB
surveys.Comment: 17 pages, 6 figures. For submission to MNRA
Photometric Redshift Uncertainties in Weak Gravitational Lensing Shear Analysis: Models and Marginalization
Recovering credible cosmological parameter constraints in a weak lensing
shear analysis requires an accurate model that can be used to marginalize over
nuisance parameters describing potential sources of systematic uncertainty,
such as the uncertainties on the sample redshift distribution . Due to
the challenge of running Markov Chain Monte-Carlo (MCMC) in the high
dimensional parameter spaces in which the uncertainties may be
parameterized, it is common practice to simplify the parameterization or
combine MCMC chains that each have a fixed resampled from the
uncertainties. In this work, we propose a statistically-principled Bayesian
resampling approach for marginalizing over the uncertainty using
multiple MCMC chains. We self-consistently compare the new method to existing
ones from the literature in the context of a forecasted cosmic shear analysis
for the HSC three-year shape catalog, and find that these methods recover
similar cosmological parameter constraints, implying that using the most
computationally efficient of the approaches is appropriate. However, we find
that for datasets with the constraining power of the full HSC survey dataset
(and, by implication, those upcoming surveys with even tighter constraints),
the choice of method for marginalizing over uncertainty among the
several methods from the literature may significantly impact the statistical
uncertainties on cosmological parameters, and a careful model selection is
needed to ensure credible parameter intervals.Comment: 15 pages, 8 figures, submitted to mnra
A Differentiable Perturbation-based Weak Lensing Shear Estimator
Upcoming imaging surveys will use weak gravitational lensing to study the
large-scale structure of the Universe, demanding sub-percent accuracy for
precise cosmic shear measurements. We present a new differentiable
implementation of our perturbation-based shear estimator (FPFS), using JAX,
which is publicly available as part of a new suite of analytic shear algorithms
called AnaCal. This code can analytically calibrate the shear response of any
nonlinear observable constructed with the FPFS shapelets and detection modes
utilizing auto-differentiation (AD), generalizing the formalism to include a
family of shear estimators with corrections for detection and selection biases.
Using the AD capability of JAX, it calculates the full Hessian matrix of the
non-linear observables, which improves the previously presented second-order
noise bias correction in the shear estimation. As an illustration of the power
of the new AnaCal framework, we optimize the effective galaxy number density in
the space of the generalized shear estimators using an LSST-like galaxy image
simulation for the ten-year LSST. For the generic shear estimator, the
magnitude of the multiplicative bias is below (99.7%
confidence interval), and the effective galaxy number density is improved by
5%. We also discuss some planned future additions to the AnaCal software suite
to extend its applicability beyond the FPFS measurements.Comment: 9 pages, 7 figures, submitted to MNRA
Machine Unlearning: A Survey
Machine learning has attracted widespread attention and evolved into an
enabling technology for a wide range of highly successful applications, such as
intelligent computer vision, speech recognition, medical diagnosis, and more.
Yet a special need has arisen where, due to privacy, usability, and/or the
right to be forgotten, information about some specific samples needs to be
removed from a model, called machine unlearning. This emerging technology has
drawn significant interest from both academics and industry due to its
innovation and practicality. At the same time, this ambitious problem has led
to numerous research efforts aimed at confronting its challenges. To the best
of our knowledge, no study has analyzed this complex topic or compared the
feasibility of existing unlearning solutions in different kinds of scenarios.
Accordingly, with this survey, we aim to capture the key concepts of unlearning
techniques. The existing solutions are classified and summarized based on their
characteristics within an up-to-date and comprehensive review of each
category's advantages and limitations. The survey concludes by highlighting
some of the outstanding issues with unlearning techniques, along with some
feasible directions for new research opportunities
How Does a Deep Learning Model Architecture Impact Its Privacy? A Comprehensive Study of Privacy Attacks on CNNs and Transformers
As a booming research area in the past decade, deep learning technologies
have been driven by big data collected and processed on an unprecedented scale.
However, privacy concerns arise due to the potential leakage of sensitive
information from the training data. Recent research has revealed that deep
learning models are vulnerable to various privacy attacks, including membership
inference attacks, attribute inference attacks, and gradient inversion attacks.
Notably, the efficacy of these attacks varies from model to model. In this
paper, we answer a fundamental question: Does model architecture affect model
privacy? By investigating representative model architectures from CNNs to
Transformers, we demonstrate that Transformers generally exhibit higher
vulnerability to privacy attacks compared to CNNs. Additionally, We identify
the micro design of activation layers, stem layers, and LN layers, as major
factors contributing to the resilience of CNNs against privacy attacks, while
the presence of attention modules is another main factor that exacerbates the
privacy vulnerability of Transformers. Our discovery reveals valuable insights
for deep learning models to defend against privacy attacks and inspires the
research community to develop privacy-friendly model architectures.Comment: Under revie
On-the-fly Denoising for Data Augmentation in Natural Language Understanding
Data Augmentation (DA) is frequently used to provide additional training data
without extra human annotation automatically. However, data augmentation may
introduce noisy data that impairs training. To guarantee the quality of
augmented data, existing methods either assume no noise exists in the augmented
data and adopt consistency training or use simple heuristics such as training
loss and diversity constraints to filter out "noisy" data. However, those
filtered examples may still contain useful information, and dropping them
completely causes a loss of supervision signals. In this paper, based on the
assumption that the original dataset is cleaner than the augmented data, we
propose an on-the-fly denoising technique for data augmentation that learns
from soft augmented labels provided by an organic teacher model trained on the
cleaner original data. To further prevent overfitting on noisy labels, a simple
self-regularization module is applied to force the model prediction to be
consistent across two distinct dropouts. Our method can be applied to general
augmentation techniques and consistently improve the performance on both text
classification and question-answering tasks.Comment: Findings of EACL 202
- β¦