Search CORE

343 research outputs found

Adaptive Huber Regression

Author: Fan Jianqing
Sun Qiang
Zhou Wenxin
Publication venue
Publication date: 10/10/2018
Field of study

Big data can easily be contaminated by outliers or contain variables with heavy-tailed distributions, which makes many conventional methods inadequate. To address this challenge, we propose the adaptive Huber regression for robust estimation and inference. The key observation is that the robustification parameter should adapt to the sample size, dimension and moments for optimal tradeoff between bias and robustness. Our theoretical framework deals with heavy-tailed distributions with bounded

(1+\delta)

-th moment for any

\delta > 0

. We establish a sharp phase transition for robust estimation of regression parameters in both low and high dimensions: when

\delta \geq 1

, the estimator admits a sub-Gaussian-type deviation bound without sub-Gaussian assumptions on the data, while only a slower rate is available in the regime

0<\delta< 1

. Furthermore, this transition is smooth and optimal. In addition, we extend the methodology to allow both heavy-tailed predictors and observation noise. Simulation studies lend further support to the theory. In a genetic study of cancer cell lines that exhibit heavy-tailedness, the proposed methods are shown to be more robust and predictive.Comment: final versio

arXiv.org e-Print Archive

eScholarship - University of California

FigShare

A Unified Framework for Testing High Dimensional Parameters: A Data-Adaptive Approach

Author: Liu Han
Zhang Xinsheng
Zhou Cheng
Zhou Wenxin
Publication venue
Publication date: 08/08/2018
Field of study

High dimensional hypothesis test deals with models in which the number of parameters is significantly larger than the sample size. Existing literature develops a variety of individual tests. Some of them are sensitive to the dense and small disturbance, and others are sensitive to the sparse and large disturbance. Hence, the powers of these tests depend on the assumption of the alternative scenario. This paper provides a unified framework for developing new tests which are adaptive to a large variety of alternative scenarios in high dimensions. In particular, our framework includes arbitrary hypotheses which can be tested using high dimensional

U

-statistic based vectors. Under this framework, we first develop a broad family of tests based on a novel variant of the

L_p

-norm with

p\in \{1,\dots,\infty\}

. We then combine these tests to construct a data-adaptive test that is simultaneously powerful under various alternative scenarios. To obtain the asymptotic distributions of these tests, we utilize the multiplier bootstrap for

U

-statistics. In addition, we consider the computational aspect of the bootstrap method and propose a novel low-cost scheme. We prove the optimality of the proposed tests. Thorough numerical results on simulated and real datasets are provided to support our theory

arXiv.org e-Print Archive

eScholarship - University of California

Deconstructing Student Perceptions of Generative AI (GenAI) through an Expectancy Value Theory (EVT)-based Instrument

Author: Chan Cecilia Ka Yuk
Zhou Wenxin
Publication venue
Publication date: 01/05/2023
Field of study

This study examines the relationship between student perceptions and their intention to use generative AI in higher education. Drawing on Expectancy-Value Theory (EVT), a questionnaire was developed to measure students' knowledge of generative AI, perceived value, and perceived cost. A sample of 405 students participated in the study, and confirmatory factor analysis was used to validate the constructs. The results indicate a strong positive correlation between perceived value and intention to use generative AI, and a weak negative correlation between perceived cost and intention to use. As we continue to explore the implications of generative AI in education and other domains, it is crucial to carefully consider the potential long-term consequences and the ethical dilemmas that may arise from widespread adoption

arXiv.org e-Print Archive

Distributed Adaptive Huber Regression

Author: Luo Jiyu
Sun Qiang
Zhou Wenxin
Publication venue
Publication date: 06/07/2021
Field of study

Distributed data naturally arise in scenarios involving multiple sources of observations, each stored at a different location. Directly pooling all the data together is often prohibited due to limited bandwidth and storage, or due to privacy protocols. This paper introduces a new robust distributed algorithm for fitting linear regressions when data are subject to heavy-tailed and/or asymmetric errors with finite second moments. The algorithm only communicates gradient information at each iteration and therefore is communication-efficient. Statistically, the resulting estimator achieves the centralized nonasymptotic error bound as if all the data were pooled together and came from a distribution with sub-Gaussian tails. Under a finite

(2+\delta)

-th moment condition, we derive a Berry-Esseen bound for the distributed estimator, based on which we construct robust confidence intervals. Numerical studies further confirm that compared with extant distributed methods, the proposed methods achieve near-optimal accuracy with low variability and better coverage with tighter confidence width.Comment: 29 page

arXiv.org e-Print Archive

eScholarship - University of California

政治通訳におけるリスク: 英国と中国の間の第一次アヘン戦争(1839-1842)中の通訳者の事例研究

Author: Zhou Wenxin
ショウウェンシン
Publication venue
Publication date: 15/03/2023
Field of study

早大学位記番号:新9203博士(国際コミュニケーション学)早稲田大

Waseda University Repository

How Early Participation Determines Long-Term Sustained Activity in GitHub Projects?

Author: He Hao
Xiao Wenxin
Xu Weiwei
Zhang Yuxia
Zhou Minghui
Publication venue
Publication date: 28/09/2023
Field of study

Although the open source model bears many advantages in software development, open source projects are always hard to sustain. Previous research on open source sustainability mainly focuses on projects that have already reached a certain level of maturity (e.g., with communities, releases, and downstream projects). However, limited attention is paid to the development of (sustainable) open source projects in their infancy, and we believe an understanding of early sustainability determinants is crucial for project initiators, incubators, newcomers, and users. In this paper, we aim to explore the relationship between early participation factors and long-term project sustainability. We leverage a novel methodology combining the Blumberg model of performance and machine learning to predict the sustainability of 290,255 GitHub projects. Specificially, we train an XGBoost model based on early participation (first three months of activity) in 290,255 GitHub projects and we interpret the model using LIME. We quantitatively show that early participants have a positive effect on project's future sustained activity if they have prior experience in OSS project incubation and demonstrate concentrated focus and steady commitment. Participation from non-code contributors and detailed contribution documentation also promote project's sustained activity. Compared with individual projects, building a community that consists of more experienced core developers and more active peripheral developers is important for organizational projects. This study provides unique insights into the incubation and recognition of sustainable open source projects, and our interpretable prediction approach can also offer guidance to open source project initiators and newcomers.Comment: The 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2023

arXiv.org e-Print Archive

Personalized First Issue Recommender for Newcomers in Open Source Projects

Author: He Hao
Li Jingyue
Qiu Ruiqiao
Xiao Wenxin
Zhou Minghui
Publication venue
Publication date: 17/08/2023
Field of study

Many open source projects provide good first issues (GFIs) to attract and retain newcomers. Although several automated GFI recommenders have been proposed, existing recommenders are limited to recommending generic GFIs without considering differences between individual newcomers. However, we observe mismatches between generic GFIs and the diverse background of newcomers, resulting in failed attempts, discouraged onboarding, and delayed issue resolution. To address this problem, we assume that personalized first issues (PFIs) for newcomers could help reduce the mismatches. To justify the assumption, we empirically analyze 37 newcomers and their first issues resolved across multiple projects. We find that the first issues resolved by the same newcomer share similarities in task type, programming language, and project domain. These findings underscore the need for a PFI recommender to improve over state-of-the-art approaches. For that purpose, we identify features that influence newcomers' personalized selection of first issues by analyzing the relationship between possible features of the newcomers and the characteristics of the newcomers' chosen first issues. We find that the expertise preference, OSS experience, activeness, and sentiment of newcomers drive their personalized choice of the first issues. Based on these findings, we propose a Personalized First Issue Recommender (PFIRec), which employs LamdaMART to rank candidate issues for a given newcomer by leveraging the identified influential features. We evaluate PFIRec using a dataset of 68,858 issues from 100 GitHub projects. The evaluation results show that PFIRec outperforms existing first issue recommenders, potentially doubling the probability that the top recommended issue is suitable for a specific newcomer and reducing one-third of a newcomer's unsuccessful attempts to identify suitable first issues, in the median.Comment: The 38th IEEE/ACM International Conference on Automated Software Engineering (ASE 2023

arXiv.org e-Print Archive