332 research outputs found

    Information criterion-based clustering with order-restricted candidate profiles in short time-course microarray experiments

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Time-course microarray experiments produce vector gene expression profiles across a series of time points. Clustering genes based on these profiles is important in discovering functional related and co-regulated genes. Early developed clustering algorithms do not take advantage of the ordering in a time-course study, explicit use of which should allow more sensitive detection of genes that display a consistent pattern over time. Peddada <it>et al</it>. <abbrgrp><abbr bid="B1">1</abbr></abbrgrp> proposed a clustering algorithm that can incorporate the temporal ordering using order-restricted statistical inference. This algorithm is, however, very time-consuming and hence inapplicable to most microarray experiments that contain a large number of genes. Its computational burden also imposes difficulty to assess the clustering reliability, which is a very important measure when clustering noisy microarray data.</p> <p>Results</p> <p>We propose a computationally efficient information criterion-based clustering algorithm, called ORICC, that also takes account of the ordering in time-course microarray experiments by embedding the order-restricted inference into a model selection framework. Genes are assigned to the profile which they best match determined by a newly proposed information criterion for order-restricted inference. In addition, we also developed a bootstrap procedure to assess ORICC's clustering reliability for every gene. Simulation studies show that the ORICC method is robust, always gives better clustering accuracy than Peddada's method and saves hundreds of times computational time. Under some scenarios, its accuracy is also better than some other existing clustering methods for short time-course microarray data, such as STEM <abbrgrp><abbr bid="B2">2</abbr></abbrgrp> and Wang <it>et al</it>. <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. It is also computationally much faster than Wang <it>et al</it>. <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>.</p> <p>Conclusion</p> <p>Our ORICC algorithm, which takes advantage of the temporal ordering in time-course microarray experiments, provides good clustering accuracy and is meanwhile much faster than Peddada's method. Moreover, the clustering reliability for each gene can also be assessed, which is unavailable in Peddada's method. In a real data example, the ORICC algorithm identifies new and interesting genes that previous analyses failed to reveal.</p

    SaaS: A situational awareness and analysis system for massive android malware detection

    Get PDF
    A large amount of mobile applications (Apps) are uploaded, distributed and updated in various Android markets, e.g., Google Play and Huawei AppGallery every day. One of the ongoing challenges is to detect malicious Apps (also known as malware) among those massive newcomers accurately and efficiently in the daily security management of Android App markets. Customers rely on those detection results in the selection of Apps upon downloading, and undetected malware may result in great damages. In this paper, we propose a cloud-based malware detection system called SaaS by leveraging and marrying multiple approaches from diverse domains such as natural language processing (n-gram), image processing (GLCM), cryptography (fuzzy hash), machine learning (random forest) and complex networks. We firstly extract n-gram features and GLCM features from an App's smali code and DEX file, respectively. We next feed those features into training data set, to create a machine learning detect model. The model is further enhanced by fuzzy hash to detect whether inspected App is repackaged or not. Extensive experiments (involving 1495 samples) demonstrates that the detecting accuracy is more than 98.5%, and support a large-scale detecting and monitoring. Besides, our proposed system can be deployed as a service in clouds and customers can access cloud services on demand

    Privacy Intelligence: A Survey on Image Sharing on Online Social Networks

    Full text link
    Image sharing on online social networks (OSNs) has become an indispensable part of daily social activities, but it has also led to an increased risk of privacy invasion. The recent image leaks from popular OSN services and the abuse of personal photos using advanced algorithms (e.g. DeepFake) have prompted the public to rethink individual privacy needs when sharing images on OSNs. However, OSN image sharing itself is relatively complicated, and systems currently in place to manage privacy in practice are labor-intensive yet fail to provide personalized, accurate and flexible privacy protection. As a result, an more intelligent environment for privacy-friendly OSN image sharing is in demand. To fill the gap, we contribute a systematic survey of 'privacy intelligence' solutions that target modern privacy issues related to OSN image sharing. Specifically, we present a high-level analysis framework based on the entire lifecycle of OSN image sharing to address the various privacy issues and solutions facing this interdisciplinary field. The framework is divided into three main stages: local management, online management and social experience. At each stage, we identify typical sharing-related user behaviors, the privacy issues generated by those behaviors, and review representative intelligent solutions. The resulting analysis describes an intelligent privacy-enhancing chain for closed-loop privacy management. We also discuss the challenges and future directions existing at each stage, as well as in publicly available datasets.Comment: 32 pages, 9 figures. Under revie

    Combining conditional and unconditional moment restrictions with missing responses

    Get PDF
    AbstractMany statistical models, e.g.Β regression models, can be viewed as conditional moment restrictions when distributional assumptions on the error term are not assumed. For such models, several estimators that achieve the semiparametric efficiency bound have been proposed. However, in many studies, auxiliary information is available as unconditional moment restrictions. Meanwhile, we also consider the presence of missing responses. We propose the combined empirical likelihood (CEL) estimator to incorporate such auxiliary information to improve the estimation efficiency of the conditional moment restriction models. We show that, when assuming responses are strongly ignorable missing at random, the CEL estimator achieves better efficiency than the previous estimators due to utilization of the auxiliary information. Based on the asymptotic property of the CEL estimator, we also develop Wilks’ type tests and corresponding confidence regions for the model parameter and the mean response. Since kernel smoothing is used, the CEL method may have difficulty for problems with high dimensional covariates. In such situations, we propose an instrumental variable-based empirical likelihood (IVEL) method to handle this problem. The merit of the CEL and IVEL are further illustrated through simulation studies

    A Consistent Cosmic Shear Analysis in Harmonic and Real Space

    Full text link
    Recent cosmic shear analyses have exhibited inconsistencies of up to 1Οƒ1\sigma between the inferred cosmological parameters when analyzing summary statistics in real space versus harmonic space. In this paper, we demonstrate the consistent measurement and analysis of cosmic shear two-point functions in harmonic and real space using the ii{\sc Master} algorithm. This algorithm provides a consistent prescription to model the survey window effects and scale cuts in both real space (due to observational systematics) and harmonic space (due to model limitations), resulting in a consistent estimation of the cosmic shear power spectrum from both harmonic and real space estimators. We show that the ii\textsc{Master} algorithm gives consistent results using measurements from the HSC Y1 mock shape catalogs in both real and harmonic space, resulting in consistent inferences of S8=Οƒ8(Ξ©m/0.3)0.5S_8=\sigma_8(\Omega_m/0.3)^{0.5}. This method provides an unbiased estimate of the cosmic shear power spectrum, and S8S_8 inference that has a correlation coefficient of 0.997 between analyses using measurements in real space and harmonic space. We observe the mean difference between the two inferred S8S_8 values to be 0.0004, far below the observed difference of 0.042 for the published HSC Y1 analyses and well below the statistical uncertainties. While the notation employed in this paper is specific to photometric galaxy surveys, the methods are equally applicable and can be extended to spectroscopic galaxy surveys, intensity mapping, and CMB surveys.Comment: 17 pages, 6 figures. For submission to MNRA

    Photometric Redshift Uncertainties in Weak Gravitational Lensing Shear Analysis: Models and Marginalization

    Full text link
    Recovering credible cosmological parameter constraints in a weak lensing shear analysis requires an accurate model that can be used to marginalize over nuisance parameters describing potential sources of systematic uncertainty, such as the uncertainties on the sample redshift distribution n(z)n(z). Due to the challenge of running Markov Chain Monte-Carlo (MCMC) in the high dimensional parameter spaces in which the n(z)n(z) uncertainties may be parameterized, it is common practice to simplify the n(z)n(z) parameterization or combine MCMC chains that each have a fixed n(z)n(z) resampled from the n(z)n(z) uncertainties. In this work, we propose a statistically-principled Bayesian resampling approach for marginalizing over the n(z)n(z) uncertainty using multiple MCMC chains. We self-consistently compare the new method to existing ones from the literature in the context of a forecasted cosmic shear analysis for the HSC three-year shape catalog, and find that these methods recover similar cosmological parameter constraints, implying that using the most computationally efficient of the approaches is appropriate. However, we find that for datasets with the constraining power of the full HSC survey dataset (and, by implication, those upcoming surveys with even tighter constraints), the choice of method for marginalizing over n(z)n(z) uncertainty among the several methods from the literature may significantly impact the statistical uncertainties on cosmological parameters, and a careful model selection is needed to ensure credible parameter intervals.Comment: 15 pages, 8 figures, submitted to mnra

    A Differentiable Perturbation-based Weak Lensing Shear Estimator

    Full text link
    Upcoming imaging surveys will use weak gravitational lensing to study the large-scale structure of the Universe, demanding sub-percent accuracy for precise cosmic shear measurements. We present a new differentiable implementation of our perturbation-based shear estimator (FPFS), using JAX, which is publicly available as part of a new suite of analytic shear algorithms called AnaCal. This code can analytically calibrate the shear response of any nonlinear observable constructed with the FPFS shapelets and detection modes utilizing auto-differentiation (AD), generalizing the formalism to include a family of shear estimators with corrections for detection and selection biases. Using the AD capability of JAX, it calculates the full Hessian matrix of the non-linear observables, which improves the previously presented second-order noise bias correction in the shear estimation. As an illustration of the power of the new AnaCal framework, we optimize the effective galaxy number density in the space of the generalized shear estimators using an LSST-like galaxy image simulation for the ten-year LSST. For the generic shear estimator, the magnitude of the multiplicative bias ∣m∣|m| is below 3Γ—10βˆ’33\times 10^{-3} (99.7% confidence interval), and the effective galaxy number density is improved by 5%. We also discuss some planned future additions to the AnaCal software suite to extend its applicability beyond the FPFS measurements.Comment: 9 pages, 7 figures, submitted to MNRA

    Machine Unlearning: A Survey

    Full text link
    Machine learning has attracted widespread attention and evolved into an enabling technology for a wide range of highly successful applications, such as intelligent computer vision, speech recognition, medical diagnosis, and more. Yet a special need has arisen where, due to privacy, usability, and/or the right to be forgotten, information about some specific samples needs to be removed from a model, called machine unlearning. This emerging technology has drawn significant interest from both academics and industry due to its innovation and practicality. At the same time, this ambitious problem has led to numerous research efforts aimed at confronting its challenges. To the best of our knowledge, no study has analyzed this complex topic or compared the feasibility of existing unlearning solutions in different kinds of scenarios. Accordingly, with this survey, we aim to capture the key concepts of unlearning techniques. The existing solutions are classified and summarized based on their characteristics within an up-to-date and comprehensive review of each category's advantages and limitations. The survey concludes by highlighting some of the outstanding issues with unlearning techniques, along with some feasible directions for new research opportunities

    How Does a Deep Learning Model Architecture Impact Its Privacy? A Comprehensive Study of Privacy Attacks on CNNs and Transformers

    Full text link
    As a booming research area in the past decade, deep learning technologies have been driven by big data collected and processed on an unprecedented scale. However, privacy concerns arise due to the potential leakage of sensitive information from the training data. Recent research has revealed that deep learning models are vulnerable to various privacy attacks, including membership inference attacks, attribute inference attacks, and gradient inversion attacks. Notably, the efficacy of these attacks varies from model to model. In this paper, we answer a fundamental question: Does model architecture affect model privacy? By investigating representative model architectures from CNNs to Transformers, we demonstrate that Transformers generally exhibit higher vulnerability to privacy attacks compared to CNNs. Additionally, We identify the micro design of activation layers, stem layers, and LN layers, as major factors contributing to the resilience of CNNs against privacy attacks, while the presence of attention modules is another main factor that exacerbates the privacy vulnerability of Transformers. Our discovery reveals valuable insights for deep learning models to defend against privacy attacks and inspires the research community to develop privacy-friendly model architectures.Comment: Under revie

    On-the-fly Denoising for Data Augmentation in Natural Language Understanding

    Full text link
    Data Augmentation (DA) is frequently used to provide additional training data without extra human annotation automatically. However, data augmentation may introduce noisy data that impairs training. To guarantee the quality of augmented data, existing methods either assume no noise exists in the augmented data and adopt consistency training or use simple heuristics such as training loss and diversity constraints to filter out "noisy" data. However, those filtered examples may still contain useful information, and dropping them completely causes a loss of supervision signals. In this paper, based on the assumption that the original dataset is cleaner than the augmented data, we propose an on-the-fly denoising technique for data augmentation that learns from soft augmented labels provided by an organic teacher model trained on the cleaner original data. To further prevent overfitting on noisy labels, a simple self-regularization module is applied to force the model prediction to be consistent across two distinct dropouts. Our method can be applied to general augmentation techniques and consistently improve the performance on both text classification and question-answering tasks.Comment: Findings of EACL 202
    • …
    corecore