299 research outputs found

    LASSO risk and phase transition under dependence

    Full text link
    We consider the problem of recovering a kk-sparse signal {\mbox{\beta}}_0\in\mathbb{R}^p from noisy observations \bf y={\bf X}\mbox{\beta}_0+{\bf w}\in\mathbb{R}^n. One of the most popular approaches is the l1l_1-regularized least squares, also known as LASSO. We analyze the mean square error of LASSO in the case of random designs in which each row of X{\bf X} is drawn from distribution N(0,{\mbox{\Sigma}}) with general {\mbox{\Sigma}}. We first derive the asymptotic risk of LASSO in the limit of n,p→∞n,p\rightarrow\infty with n/p→δn/p\rightarrow\delta. We then examine conditions on nn, pp, and kk for LASSO to exactly reconstruct {\mbox{\beta}}_0 in the noiseless case w=0{\bf w}=0. A phase boundary δc=δ(ϵ)\delta_c=\delta(\epsilon) is precisely established in the phase space defined by 0≤δ,ϵ≤10\le\delta,\epsilon\le 1, where ϵ=k/p\epsilon=k/p. Above this boundary, LASSO perfectly recovers {\mbox{\beta}}_0 with high probability. Below this boundary, LASSO fails to recover \mbox{\beta}_0 with high probability. While the values of the non-zero elements of {\mbox{\beta}}_0 do not have any effect on the phase transition curve, our analysis shows that δc\delta_c does depend on the signed pattern of the nonzero values of \mbox{\beta}_0 for general {\mbox{\Sigma}}\ne{\bf I}_p. This is in sharp contrast to the previous phase transition results derived in i.i.d. case with \mbox{\Sigma}={\bf I}_p where δc\delta_c is completely determined by ϵ\epsilon regardless of the distribution of \mbox{\beta}_0. Underlying our formalism is a recently developed efficient algorithm called approximate message passing (AMP) algorithm. We generalize the state evolution of AMP from i.i.d. case to general case with {\mbox{\Sigma}}\ne{\bf I}_p. Extensive computational experiments confirm that our theoretical predictions are consistent with simulation results on moderate size system.Comment: 40 pages, 7 figure

    Age-adjusted nonparametric detection of differential DNA methylation with case–control designs

    Get PDF
    Background: DNA methylation profiles differ among disease types and, therefore, can be used in disease diagnosis. In addition, large-scale whole genome DNA methylation data offer tremendous potential in understanding the role of DNA methylation in normal development and function. However, due to the unique feature of the methylation data, powerful and robust statistical methods are very limited in this area. Results: In this paper, we proposed and examined a new statistical method to detect differentially methylated loci for case control designs that is fully nonparametric and does not depend on any assumption for the underlying distribution of the data. Moreover, the proposed method adjusts for the age effect that has been shown to be highly correlated with DNA methylation profiles. Using simulation studies and a real data application, we have demonstrated the advantages of our method over existing commonly used methods. Conclusions: Compared to existing methods, our method improved the detection power for differentially methylated loci for case control designs and controlled the type I error well. Its applications are not limited to methylation data; it can be extended to many other case–control studies

    A Survey of Document-Level Information Extraction

    Full text link
    Document-level information extraction (IE) is a crucial task in natural language processing (NLP). This paper conducts a systematic review of recent document-level IE literature. In addition, we conduct a thorough error analysis with current state-of-the-art algorithms and identify their limitations as well as the remaining challenges for the task of document-level IE. According to our findings, labeling noises, entity coreference resolution, and lack of reasoning, severely affect the performance of document-level IE. The objective of this survey paper is to provide more insights and help NLP researchers to further enhance document-level IE performance

    Some contributions to high dimensional statistical learning

    Get PDF
    This dissertation consists of two major contributions to high dimensional statistical learning. The focus is on classification which is one of the central research topics in the field of statistical learning. This research is on both binary and multiclass learning. For binary classification, we propose the Bi-Directional Discrimination (BDD) method which generalizes linear classifiers from one hyperplane to two or more hyperplanes. BDD combines the strengths of linear and general nonlinear methods. Linear classifiers are very popular, but can suffer some serious limitations when the classes have distinct subpopulations. General nonlinear classifiers can give improved classification error rates, but do not give clear interpretation of the results and present great challenges in terms of overfitting in high dimensions. BDD gives much of the flexibility of a general nonlinear classifier while maintaining the interpretability, and less tendency towards overfitting, of linear classifiers. While the idea is generally applicable, we focus our discussion on the generalization of the Support Vector Machine (SVM) and Distance Weighted Discrimination (DWD) methods. The performance and usefulness of the proposed method are assessed using asymptotics, and demonstrated through analysis of simulated and real data. For multiclass learning, the DWD method is generalized from the binary case to the multiclass case. DWD is a powerful tool for solving binary classification problems which has been shown to improve upon SVM in high dimensional situations. We extend the binary DWD to the multiclass DWD. In addition to some well known extensions which simply combine several binary DWD classifiers, we propose a global multiclass DWD (MDWD) which finds a single classifier that simultaneously considers all classes. Our theoretical results show that MDWD is Fisher consistent, even in the particularly challenging case when there is no dominating class (i.e., maximal class conditional probability is less than 1/2). The performances of different multiclass DWD methods are assessed through simulation and real data studies

    Sequential Recommendation with Diffusion Models

    Full text link
    Generative models, such as Variational Auto-Encoder (VAE) and Generative Adversarial Network (GAN), have been successfully applied in sequential recommendation. These methods require sampling from probability distributions and adopt auxiliary loss functions to optimize the model, which can capture the uncertainty of user behaviors and alleviate exposure bias. However, existing generative models still suffer from the posterior collapse problem or the model collapse problem, thus limiting their applications in sequential recommendation. To tackle the challenges mentioned above, we leverage a new paradigm of the generative models, i.e., diffusion models, and present sequential recommendation with diffusion models (DiffRec), which can avoid the issues of VAE- and GAN-based models and show better performance. While diffusion models are originally proposed to process continuous image data, we design an additional transition in the forward process together with a transition in the reverse process to enable the processing of the discrete recommendation data. We also design a different noising strategy that only noises the target item instead of the whole sequence, which is more suitable for sequential recommendation. Based on the modified diffusion process, we derive the objective function of our framework using a simplification technique and design a denoise sequential recommender to fulfill the objective function. As the lengthened diffusion steps substantially increase the time complexity, we propose an efficient training strategy and an efficient inference strategy to reduce training and inference cost and improve recommendation diversity. Extensive experiment results on three public benchmark datasets verify the effectiveness of our approach and show that DiffRec outperforms the state-of-the-art sequential recommendation models

    Pollution level and risk assessment of heavy metals in sewage sludge from eight wastewater treatment plants in Wuhu City, China

    Get PDF
    Aim of study: To investigate the content, contamination levels and potential sources of five heavy metals (Hg, Pb, Cd, Cr, As) in sewage sludge from eight wastewater treatment plants (W1 to W8).Area of study: Wuhu, located in southeastern Anhui Province, southeastern China.Material and methods: The sewage sludge pollution assessment employed the single-factor pollution index, Nemerow’s synthetic pollution index, monomial potential ecological risk coefficient and potential ecological risk index. The potential sources among the five heavy metals were determined using the Pearson’s correlation analysis and principal component analysis (PCA).Main results: The mean concentrations of the heavy metals were 0.27 mg/kg (Hg), 70.78 mg/kg (Pb), 3.48 mg/kg (Cd), 143.65 mg/kg (Cr) and 22.17 mg/kg (As). W1, W5 and W6 sewage sludge samples showed the highest levels of heavy metal contamination, and cadmium had the highest contamination level in the study area. Pearson’s correlation analysis and PCA revealed that Pb and Cd mainly derived from traffic emissions and the manufacturing industry and that As and Cr originated from agricultural discharges.Research highlights: The pollution of cadmium in Wuhu should be controlled preferentially. The heavy metal pollution of W1, W5 and W6 sewage treatment plants is relatively high, they should be key prevention targets
    • …
    corecore