32 research outputs found
The asymptotic distribution and Berry--Esseen bound of a new test for independence in high dimension with an application to stochastic optimization
Let be a random sample from a -dimensional
population distribution. Assume that
for some positive constants and . In this paper we introduce
a new statistic for testing independence of the -variates of the population
and prove that the limiting distribution is the extreme distribution of type I
with a rate of convergence . This is much faster
than , a typical convergence rate for this type of extreme
distribution. A simulation study and application to stochastic optimization are
discussed.Comment: Published in at http://dx.doi.org/10.1214/08-AAP527 the Annals of
Applied Probability (http://www.imstat.org/aap/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Sub-Character Tokenization for Chinese Pretrained Language Models
Tokenization is fundamental to pretrained language models (PLMs). Existing
tokenization methods for Chinese PLMs typically treat each character as an
indivisible token. However, they ignore the unique feature of the Chinese
writing system where additional linguistic information exists below the
character level, i.e., at the sub-character level. To utilize such information,
we propose sub-character (SubChar for short) tokenization. Specifically, we
first encode the input text by converting each Chinese character into a short
sequence based on its glyph or pronunciation, and then construct the vocabulary
based on the encoded text with sub-word tokenization. Experimental results show
that SubChar tokenizers have two main advantages over existing tokenizers: 1)
They can tokenize inputs into much shorter sequences, thus improving the
computational efficiency. 2) Pronunciation-based SubChar tokenizers can encode
Chinese homophones into the same transliteration sequences and produce the same
tokenization output, hence being robust to all homophone typos. At the same
time, models trained with SubChar tokenizers perform competitively on
downstream tasks. We release our code at
https://github.com/thunlp/SubCharTokenization to facilitate future work.Comment: This draft supersedes the previous version named "SHUOWEN-JIEZI:
Linguistically Informed Tokenizers For Chinese Language Model Pretraining
Red Alarm for Pre-trained Models: Universal Vulnerability to Neuron-Level Backdoor Attacks
Pre-trained models (PTMs) have been widely used in various downstream tasks.
The parameters of PTMs are distributed on the Internet and may suffer backdoor
attacks. In this work, we demonstrate the universal vulnerability of PTMs,
where fine-tuned PTMs can be easily controlled by backdoor attacks in arbitrary
downstream tasks. Specifically, attackers can add a simple pre-training task,
which restricts the output representations of trigger instances to pre-defined
vectors, namely neuron-level backdoor attack (NeuBA). If the backdoor
functionality is not eliminated during fine-tuning, the triggers can make the
fine-tuned model predict fixed labels by pre-defined vectors. In the
experiments of both natural language processing (NLP) and computer vision (CV),
we show that NeuBA absolutely controls the predictions for trigger instances
without any knowledge of downstream tasks. Finally, we apply several defense
methods to NeuBA and find that model pruning is a promising direction to resist
NeuBA by excluding backdoored neurons. Our findings sound a red alarm for the
wide use of PTMs. Our source code and models are available at
\url{https://github.com/thunlp/NeuBA}
Ureteral Obstruction and Stent Thrombosis AfterEndovascular Treatment of Iliac Artery Aneurysm
Validity of Intraepithelial Lymphocyte Count in the Diagnosis of Celiac Disease: A Histopathological Study
Abstract The gold standard for diagnosis of Celiac disease (CD) is histological evidence of a small intestinal biopsy together with positive serology. Modified Marsh classification utilizes the histological parameter of intraepithelial lymphocyte (IEL) count in the diagnosis of CD. The reported upper limit of normal IEL in the duodenum varies from 20-40 per 100 epithelial cells (EC). The objectives of the study are to determine the normal upper limit of IEL in the duodenum and assess the diagnostic accuracy of existing criteria of IEL counts to diagnose CD. A retrospective analysis of histopathological records of duodenal biopsies reported as normal (control group, n=38) and consistent with CD (n=37) formed the basis of the study. IEL were counted in an uninterrupted length of surface epithelium and villous tips from formalin fixed, paraffin embedded, Haematoxylin and Eosin stained biopsies under the light microscope (X400 magnification). In the control group the mean IEL/100EC was 3.81(upper limit of normal = 7.78) and the mean IEL/villous tip was 0.96 (upper limit of normal = 3.5). In CD the mean IEL/100EC and mean IEL/villous tip were 20.93 and 6.83 respectively. The upper limit of normal IEL/100EC, mean IEL/100EC in CD and the villous tip IEL count in both the control and CD groups were considerably lower than those reported in other studies. The ethnicity, country of origin and environmental factors may be partly responsible for this observation. If high cut off values of IEL/100EC is taken to diagnose CD many cases may be under diagnosed, particularly when the upper limit of the normal IEL count is lower for that population and region, highlighting the importance of a multidisciplinary approach in the diagnosis of CD
Validity of Intraepithelial Lymphocyte Count in the Diagnosis of Celiac Disease: A Histopathological Study
Abstract The gold standard for diagnosis of Celiac disease (CD) is histological evidence of a small intestinal biopsy together with positive serology. Modified Marsh classification utilizes the histological parameter of intraepithelial lymphocyte (IEL) count in the diagnosis of CD. The reported upper limit of normal IEL in the duodenum varies from 20-40 per 100 epithelial cells (EC). The objectives of the study are to determine the normal upper limit of IEL in the duodenum and assess the diagnostic accuracy of existing criteria of IEL counts to diagnose CD. A retrospective analysis of histopathological records of duodenal biopsies reported as normal (control group, n=38) and consistent with CD (n=37) formed the basis of the study. IEL were counted in an uninterrupted length of surface epithelium and villous tips from formalin fixed, paraffin embedded, Haematoxylin and Eosin stained biopsies under the light microscope (X400 magnification). In the control group the mean IEL/100EC was 3.81(upper limit of normal = 7.78) and the mean IEL/villous tip was 0.96 (upper limit of normal = 3.5). In CD the mean IEL/100EC and mean IEL/villous tip were 20.93 and 6.83 respectively. The upper limit of normal IEL/100EC, mean IEL/100EC in CD and the villous tip IEL count in both the control and CD groups were considerably lower than those reported in other studies. The ethnicity, country of origin and environmental factors may be partly responsible for this observation. If high cut off values of IEL/100EC is taken to diagnose CD many cases may be under diagnosed, particularly when the upper limit of the normal IEL count is lower for that population and region, highlighting the importance of a multidisciplinary approach in the diagnosis of CD