Search CORE

32 research outputs found

The asymptotic distribution and Berry--Esseen bound of a new test for independence in high dimension with an application to stochastic optimization

Author: Lin Zhengyan
Liu Wei-Dong
Shao Qi-Man
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2008
Field of study

Let

\mathbf{X}_1,...,\mathbf{X}_n

be a random sample from a

p

-dimensional population distribution. Assume that

c_1n^{\alpha}\leq p\leq c_2n^{\alpha}

for some positive constants

c_1,c_2

and

\alpha

. In this paper we introduce a new statistic for testing independence of the

p

-variates of the population and prove that the limiting distribution is the extreme distribution of type I with a rate of convergence

O((\log n)^{5/2}/\sqrt{n})

. This is much faster than

O(1/\log n)

, a typical convergence rate for this type of extreme distribution. A simulation study and application to stochastic optimization are discussed.Comment: Published in at http://dx.doi.org/10.1214/08-AAP527 the Annals of Applied Probability (http://www.imstat.org/aap/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref

Hong Kong University of Science and Technology Institutional Repository

Sub-Character Tokenization for Chinese Pretrained Language Models

Author: Chen Yingfa
Liu Qun
Liu Zhiyuan
Qi Fanchao
Si Chenglei
Sun Maosong
Wang Xiaozhi
Wang Yasheng
Zhang Zhengyan
Publication venue
Publication date: 22/12/2021
Field of study

Tokenization is fundamental to pretrained language models (PLMs). Existing tokenization methods for Chinese PLMs typically treat each character as an indivisible token. However, they ignore the unique feature of the Chinese writing system where additional linguistic information exists below the character level, i.e., at the sub-character level. To utilize such information, we propose sub-character (SubChar for short) tokenization. Specifically, we first encode the input text by converting each Chinese character into a short sequence based on its glyph or pronunciation, and then construct the vocabulary based on the encoded text with sub-word tokenization. Experimental results show that SubChar tokenizers have two main advantages over existing tokenizers: 1) They can tokenize inputs into much shorter sequences, thus improving the computational efficiency. 2) Pronunciation-based SubChar tokenizers can encode Chinese homophones into the same transliteration sequences and produce the same tokenization output, hence being robust to all homophone typos. At the same time, models trained with SubChar tokenizers perform competitively on downstream tasks. We release our code at https://github.com/thunlp/SubCharTokenization to facilitate future work.Comment: This draft supersedes the previous version named "SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language Model Pretraining

arXiv.org e-Print Archive

Directory of Open Access Journals

Red Alarm for Pre-trained Models: Universal Vulnerability to Neuron-Level Backdoor Attacks

Author: Jiang Xin
Li Yongwei
Liu Zhiyuan
Lv Tian
Qi Fanchao
Sun Maosong
Wang Yasheng
Xiao Guangxuan
Zhang Zhengyan
Publication venue
Publication date: 13/06/2021
Field of study

Pre-trained models (PTMs) have been widely used in various downstream tasks. The parameters of PTMs are distributed on the Internet and may suffer backdoor attacks. In this work, we demonstrate the universal vulnerability of PTMs, where fine-tuned PTMs can be easily controlled by backdoor attacks in arbitrary downstream tasks. Specifically, attackers can add a simple pre-training task, which restricts the output representations of trigger instances to pre-defined vectors, namely neuron-level backdoor attack (NeuBA). If the backdoor functionality is not eliminated during fine-tuning, the triggers can make the fine-tuned model predict fixed labels by pre-defined vectors. In the experiments of both natural language processing (NLP) and computer vision (CV), we show that NeuBA absolutely controls the predictions for trigger instances without any knowledge of downstream tasks. Finally, we apply several defense methods to NeuBA and find that model pruning is a promising direction to resist NeuBA by excluding backdoored neurons. Our findings sound a red alarm for the wide use of PTMs. Our source code and models are available at \url{https://github.com/thunlp/NeuBA}

arXiv.org e-Print Archive

Ureteral Obstruction and Stent Thrombosis AfterEndovascular Treatment of Iliac Artery Aneurysm

Author: Lin Qi
Long Wang
Zhengyan Tang
Publication venue: Urology and Nephrology Research Center, Shahid Beheshti University of Medical Sciences
Publication date: 01/09/2011
Field of study

Directory of Open Access Journals

Validity of Intraepithelial Lymphocyte Count in the Diagnosis of Celiac Disease: A Histopathological Study

Author: Eranga Himalee Siriweera
Jim L C Yong
Zhengyan Qi
Publication venue
Publication date: 01/01/2015
Field of study

Abstract The gold standard for diagnosis of Celiac disease (CD) is histological evidence of a small intestinal biopsy together with positive serology. Modified Marsh classification utilizes the histological parameter of intraepithelial lymphocyte (IEL) count in the diagnosis of CD. The reported upper limit of normal IEL in the duodenum varies from 20-40 per 100 epithelial cells (EC). The objectives of the study are to determine the normal upper limit of IEL in the duodenum and assess the diagnostic accuracy of existing criteria of IEL counts to diagnose CD. A retrospective analysis of histopathological records of duodenal biopsies reported as normal (control group, n=38) and consistent with CD (n=37) formed the basis of the study. IEL were counted in an uninterrupted length of surface epithelium and villous tips from formalin fixed, paraffin embedded, Haematoxylin and Eosin stained biopsies under the light microscope (X400 magnification). In the control group the mean IEL/100EC was 3.81(upper limit of normal = 7.78) and the mean IEL/villous tip was 0.96 (upper limit of normal = 3.5). In CD the mean IEL/100EC and mean IEL/villous tip were 20.93 and 6.83 respectively. The upper limit of normal IEL/100EC, mean IEL/100EC in CD and the villous tip IEL count in both the control and CD groups were considerably lower than those reported in other studies. The ethnicity, country of origin and environmental factors may be partly responsible for this observation. If high cut off values of IEL/100EC is taken to diagnose CD many cases may be under diagnosed, particularly when the upper limit of the normal IEL count is lower for that population and region, highlighting the importance of a multidisciplinary approach in the diagnosis of CD

CiteSeerX

Validity of Intraepithelial Lymphocyte Count in the Diagnosis of Celiac Disease: A Histopathological Study

Author: Eranga Himalee Siriweera
Jim L C Yong
Zhengyan Qi
Publication venue
Publication date: 01/05/2020
Field of study

CiteSeerX

A note on weak laws of large numbers for arrays of rowwise negatively quadrant dependent random variables*

Author: Lehmann E.L.
Matula P.
Qi Y.
Qi Y.
Qi Y.
Tianxiao Pang
Zhengyan Lin
Publication venue: 'Informa UK Limited'
Publication date
Field of study

Crossref