301 research outputs found
LASSO risk and phase transition under dependence
We consider the problem of recovering a -sparse signal
{\mbox{\beta}}_0\in\mathbb{R}^p from noisy observations \bf y={\bf
X}\mbox{\beta}_0+{\bf w}\in\mathbb{R}^n. One of the most popular approaches
is the -regularized least squares, also known as LASSO. We analyze the
mean square error of LASSO in the case of random designs in which each row of
is drawn from distribution N(0,{\mbox{\Sigma}}) with general
{\mbox{\Sigma}}. We first derive the asymptotic risk of LASSO in the limit
of with . We then examine
conditions on , , and for LASSO to exactly reconstruct
{\mbox{\beta}}_0 in the noiseless case . A phase boundary
is precisely established in the phase space defined
by , where . Above this boundary, LASSO
perfectly recovers {\mbox{\beta}}_0 with high probability. Below this
boundary, LASSO fails to recover \mbox{\beta}_0 with high probability.
While the values of the non-zero elements of {\mbox{\beta}}_0 do not have
any effect on the phase transition curve, our analysis shows that
does depend on the signed pattern of the nonzero values of \mbox{\beta}_0
for general {\mbox{\Sigma}}\ne{\bf I}_p. This is in sharp contrast to the
previous phase transition results derived in i.i.d. case with
\mbox{\Sigma}={\bf I}_p where is completely determined by
regardless of the distribution of \mbox{\beta}_0. Underlying our
formalism is a recently developed efficient algorithm called approximate
message passing (AMP) algorithm. We generalize the state evolution of AMP from
i.i.d. case to general case with {\mbox{\Sigma}}\ne{\bf I}_p. Extensive
computational experiments confirm that our theoretical predictions are
consistent with simulation results on moderate size system.Comment: 40 pages, 7 figure
Age-adjusted nonparametric detection of differential DNA methylation with case–control designs
Background: DNA methylation profiles differ among disease types and, therefore, can be used in disease diagnosis. In addition, large-scale whole genome DNA methylation data offer tremendous potential in understanding the role of DNA methylation in normal development and function. However, due to the unique feature of the methylation data, powerful and robust statistical methods are very limited in this area. Results: In this paper, we proposed and examined a new statistical method to detect differentially methylated loci for case control designs that is fully nonparametric and does not depend on any assumption for the underlying distribution of the data. Moreover, the proposed method adjusts for the age effect that has been shown to be highly correlated with DNA methylation profiles. Using simulation studies and a real data application, we have demonstrated the advantages of our method over existing commonly used methods. Conclusions: Compared to existing methods, our method improved the detection power for differentially methylated loci for case control designs and controlled the type I error well. Its applications are not limited to methylation data; it can be extended to many other case–control studies
A Survey of Document-Level Information Extraction
Document-level information extraction (IE) is a crucial task in natural
language processing (NLP). This paper conducts a systematic review of recent
document-level IE literature. In addition, we conduct a thorough error analysis
with current state-of-the-art algorithms and identify their limitations as well
as the remaining challenges for the task of document-level IE. According to our
findings, labeling noises, entity coreference resolution, and lack of
reasoning, severely affect the performance of document-level IE. The objective
of this survey paper is to provide more insights and help NLP researchers to
further enhance document-level IE performance
Some contributions to high dimensional statistical learning
This dissertation consists of two major contributions to high dimensional statistical learning. The focus is on classification which is one of the central research topics in the field of statistical learning. This research is on both binary and multiclass learning. For binary classification, we propose the Bi-Directional Discrimination (BDD) method which generalizes linear classifiers from one hyperplane to two or more hyperplanes. BDD combines the strengths of linear and general nonlinear methods. Linear classifiers are very popular, but can suffer some serious limitations when the classes have distinct subpopulations. General nonlinear classifiers can give improved classification error rates, but do not give clear interpretation of the results and present great challenges in terms of overfitting in high dimensions. BDD gives much of the flexibility of a general nonlinear classifier while maintaining the interpretability, and less tendency towards overfitting, of linear classifiers. While the idea is generally applicable, we focus our discussion on the generalization of the Support Vector Machine (SVM) and Distance Weighted Discrimination (DWD) methods. The performance and usefulness of the proposed method are assessed using asymptotics, and demonstrated through analysis of simulated and real data. For multiclass learning, the DWD method is generalized from the binary case to the multiclass case. DWD is a powerful tool for solving binary classification problems which has been shown to improve upon SVM in high dimensional situations. We extend the binary DWD to the multiclass DWD. In addition to some well known extensions which simply combine several binary DWD classifiers, we propose a global multiclass DWD (MDWD) which finds a single classifier that simultaneously considers all classes. Our theoretical results show that MDWD is Fisher consistent, even in the particularly challenging case when there is no dominating class (i.e., maximal class conditional probability is less than 1/2). The performances of different multiclass DWD methods are assessed through simulation and real data studies
Sequential Recommendation with Diffusion Models
Generative models, such as Variational Auto-Encoder (VAE) and Generative
Adversarial Network (GAN), have been successfully applied in sequential
recommendation. These methods require sampling from probability distributions
and adopt auxiliary loss functions to optimize the model, which can capture the
uncertainty of user behaviors and alleviate exposure bias. However, existing
generative models still suffer from the posterior collapse problem or the model
collapse problem, thus limiting their applications in sequential
recommendation. To tackle the challenges mentioned above, we leverage a new
paradigm of the generative models, i.e., diffusion models, and present
sequential recommendation with diffusion models (DiffRec), which can avoid the
issues of VAE- and GAN-based models and show better performance. While
diffusion models are originally proposed to process continuous image data, we
design an additional transition in the forward process together with a
transition in the reverse process to enable the processing of the discrete
recommendation data. We also design a different noising strategy that only
noises the target item instead of the whole sequence, which is more suitable
for sequential recommendation. Based on the modified diffusion process, we
derive the objective function of our framework using a simplification technique
and design a denoise sequential recommender to fulfill the objective function.
As the lengthened diffusion steps substantially increase the time complexity,
we propose an efficient training strategy and an efficient inference strategy
to reduce training and inference cost and improve recommendation diversity.
Extensive experiment results on three public benchmark datasets verify the
effectiveness of our approach and show that DiffRec outperforms the
state-of-the-art sequential recommendation models
Pollution level and risk assessment of heavy metals in sewage sludge from eight wastewater treatment plants in Wuhu City, China
Aim of study: To investigate the content, contamination levels and potential sources of five heavy metals (Hg, Pb, Cd, Cr, As) in sewage sludge from eight wastewater treatment plants (W1 to W8).Area of study: Wuhu, located in southeastern Anhui Province, southeastern China.Material and methods: The sewage sludge pollution assessment employed the single-factor pollution index, Nemerow’s synthetic pollution index, monomial potential ecological risk coefficient and potential ecological risk index. The potential sources among the five heavy metals were determined using the Pearson’s correlation analysis and principal component analysis (PCA).Main results: The mean concentrations of the heavy metals were 0.27 mg/kg (Hg), 70.78 mg/kg (Pb), 3.48 mg/kg (Cd), 143.65 mg/kg (Cr) and 22.17 mg/kg (As). W1, W5 and W6 sewage sludge samples showed the highest levels of heavy metal contamination, and cadmium had the highest contamination level in the study area. Pearson’s correlation analysis and PCA revealed that Pb and Cd mainly derived from traffic emissions and the manufacturing industry and that As and Cr originated from agricultural discharges.Research highlights: The pollution of cadmium in Wuhu should be controlled preferentially. The heavy metal pollution of W1, W5 and W6 sewage treatment plants is relatively high, they should be key prevention targets
- …