32 research outputs found

    The asymptotic distribution and Berry--Esseen bound of a new test for independence in high dimension with an application to stochastic optimization

    Full text link
    Let X1,...,Xn\mathbf{X}_1,...,\mathbf{X}_n be a random sample from a pp-dimensional population distribution. Assume that c1nαpc2nαc_1n^{\alpha}\leq p\leq c_2n^{\alpha} for some positive constants c1,c2c_1,c_2 and α\alpha. In this paper we introduce a new statistic for testing independence of the pp-variates of the population and prove that the limiting distribution is the extreme distribution of type I with a rate of convergence O((logn)5/2/n)O((\log n)^{5/2}/\sqrt{n}). This is much faster than O(1/logn)O(1/\log n), a typical convergence rate for this type of extreme distribution. A simulation study and application to stochastic optimization are discussed.Comment: Published in at http://dx.doi.org/10.1214/08-AAP527 the Annals of Applied Probability (http://www.imstat.org/aap/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Sub-Character Tokenization for Chinese Pretrained Language Models

    Full text link
    Tokenization is fundamental to pretrained language models (PLMs). Existing tokenization methods for Chinese PLMs typically treat each character as an indivisible token. However, they ignore the unique feature of the Chinese writing system where additional linguistic information exists below the character level, i.e., at the sub-character level. To utilize such information, we propose sub-character (SubChar for short) tokenization. Specifically, we first encode the input text by converting each Chinese character into a short sequence based on its glyph or pronunciation, and then construct the vocabulary based on the encoded text with sub-word tokenization. Experimental results show that SubChar tokenizers have two main advantages over existing tokenizers: 1) They can tokenize inputs into much shorter sequences, thus improving the computational efficiency. 2) Pronunciation-based SubChar tokenizers can encode Chinese homophones into the same transliteration sequences and produce the same tokenization output, hence being robust to all homophone typos. At the same time, models trained with SubChar tokenizers perform competitively on downstream tasks. We release our code at https://github.com/thunlp/SubCharTokenization to facilitate future work.Comment: This draft supersedes the previous version named "SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language Model Pretraining

    Red Alarm for Pre-trained Models: Universal Vulnerability to Neuron-Level Backdoor Attacks

    Full text link
    Pre-trained models (PTMs) have been widely used in various downstream tasks. The parameters of PTMs are distributed on the Internet and may suffer backdoor attacks. In this work, we demonstrate the universal vulnerability of PTMs, where fine-tuned PTMs can be easily controlled by backdoor attacks in arbitrary downstream tasks. Specifically, attackers can add a simple pre-training task, which restricts the output representations of trigger instances to pre-defined vectors, namely neuron-level backdoor attack (NeuBA). If the backdoor functionality is not eliminated during fine-tuning, the triggers can make the fine-tuned model predict fixed labels by pre-defined vectors. In the experiments of both natural language processing (NLP) and computer vision (CV), we show that NeuBA absolutely controls the predictions for trigger instances without any knowledge of downstream tasks. Finally, we apply several defense methods to NeuBA and find that model pruning is a promising direction to resist NeuBA by excluding backdoored neurons. Our findings sound a red alarm for the wide use of PTMs. Our source code and models are available at \url{https://github.com/thunlp/NeuBA}

    Ureteral Obstruction and Stent Thrombosis AfterEndovascular Treatment of Iliac Artery Aneurysm

    No full text

    Validity of Intraepithelial Lymphocyte Count in the Diagnosis of Celiac Disease: A Histopathological Study

    No full text
    Abstract The gold standard for diagnosis of Celiac disease (CD) is histological evidence of a small intestinal biopsy together with positive serology. Modified Marsh classification utilizes the histological parameter of intraepithelial lymphocyte (IEL) count in the diagnosis of CD. The reported upper limit of normal IEL in the duodenum varies from 20-40 per 100 epithelial cells (EC). The objectives of the study are to determine the normal upper limit of IEL in the duodenum and assess the diagnostic accuracy of existing criteria of IEL counts to diagnose CD. A retrospective analysis of histopathological records of duodenal biopsies reported as normal (control group, n=38) and consistent with CD (n=37) formed the basis of the study. IEL were counted in an uninterrupted length of surface epithelium and villous tips from formalin fixed, paraffin embedded, Haematoxylin and Eosin stained biopsies under the light microscope (X400 magnification). In the control group the mean IEL/100EC was 3.81(upper limit of normal = 7.78) and the mean IEL/villous tip was 0.96 (upper limit of normal = 3.5). In CD the mean IEL/100EC and mean IEL/villous tip were 20.93 and 6.83 respectively. The upper limit of normal IEL/100EC, mean IEL/100EC in CD and the villous tip IEL count in both the control and CD groups were considerably lower than those reported in other studies. The ethnicity, country of origin and environmental factors may be partly responsible for this observation. If high cut off values of IEL/100EC is taken to diagnose CD many cases may be under diagnosed, particularly when the upper limit of the normal IEL count is lower for that population and region, highlighting the importance of a multidisciplinary approach in the diagnosis of CD

    Validity of Intraepithelial Lymphocyte Count in the Diagnosis of Celiac Disease: A Histopathological Study

    No full text
    Abstract The gold standard for diagnosis of Celiac disease (CD) is histological evidence of a small intestinal biopsy together with positive serology. Modified Marsh classification utilizes the histological parameter of intraepithelial lymphocyte (IEL) count in the diagnosis of CD. The reported upper limit of normal IEL in the duodenum varies from 20-40 per 100 epithelial cells (EC). The objectives of the study are to determine the normal upper limit of IEL in the duodenum and assess the diagnostic accuracy of existing criteria of IEL counts to diagnose CD. A retrospective analysis of histopathological records of duodenal biopsies reported as normal (control group, n=38) and consistent with CD (n=37) formed the basis of the study. IEL were counted in an uninterrupted length of surface epithelium and villous tips from formalin fixed, paraffin embedded, Haematoxylin and Eosin stained biopsies under the light microscope (X400 magnification). In the control group the mean IEL/100EC was 3.81(upper limit of normal = 7.78) and the mean IEL/villous tip was 0.96 (upper limit of normal = 3.5). In CD the mean IEL/100EC and mean IEL/villous tip were 20.93 and 6.83 respectively. The upper limit of normal IEL/100EC, mean IEL/100EC in CD and the villous tip IEL count in both the control and CD groups were considerably lower than those reported in other studies. The ethnicity, country of origin and environmental factors may be partly responsible for this observation. If high cut off values of IEL/100EC is taken to diagnose CD many cases may be under diagnosed, particularly when the upper limit of the normal IEL count is lower for that population and region, highlighting the importance of a multidisciplinary approach in the diagnosis of CD
    corecore