105 research outputs found
READIN: A Chinese Multi-Task Benchmark with Realistic and Diverse Input Noises
For many real-world applications, the user-generated inputs usually contain
various noises due to speech recognition errors caused by linguistic
variations1 or typographical errors (typos). Thus, it is crucial to test model
performance on data with realistic input noises to ensure robustness and
fairness. However, little study has been done to construct such benchmarks for
Chinese, where various language-specific input noises happen in the real world.
In order to fill this important gap, we construct READIN: a Chinese multi-task
benchmark with REalistic And Diverse Input Noises. READIN contains four diverse
tasks and requests annotators to re-enter the original test data with two
commonly used Chinese input methods: Pinyin input and speech input. We designed
our annotation pipeline to maximize diversity, for example by instructing the
annotators to use diverse input method editors (IMEs) for keyboard noises and
recruiting speakers from diverse dialectical groups for speech noises. We
experiment with a series of strong pretrained language models as well as robust
training methods, we find that these models often suffer significant
performance drops on READIN even with robustness methods like data
augmentation. As the first large-scale attempt in creating a benchmark with
noises geared towards user-generated inputs, we believe that READIN serves as
an important complement to existing Chinese NLP benchmarks. The source code and
dataset can be obtained from https://github.com/thunlp/READIN.Comment: Preprin
CT-based Subchondral Bone Microstructural Analysis in Knee Osteoarthritis via MR-Guided Distillation Learning
Background: MR-based subchondral bone effectively predicts knee
osteoarthritis. However, its clinical application is limited by the cost and
time of MR. Purpose: We aim to develop a novel distillation-learning-based
method named SRRD for subchondral bone microstructural analysis using
easily-acquired CT images, which leverages paired MR images to enhance the
CT-based analysis model during training. Materials and Methods: Knee joint
images of both CT and MR modalities were collected from October 2020 to May
2021. Firstly, we developed a GAN-based generative model to transform MR images
into CT images, which was used to establish the anatomical correspondence
between the two modalities. Next, we obtained numerous patches of subchondral
bone regions of MR images, together with their trabecular parameters (BV / TV,
Tb. Th, Tb. Sp, Tb. N) from the corresponding CT image patches via regression.
The distillation-learning technique was used to train the regression model and
transfer MR structural information to the CT-based model. The regressed
trabecular parameters were further used for knee osteoarthritis classification.
Results: A total of 80 participants were evaluated. CT-based regression results
of trabecular parameters achieved intra-class correlation coefficients (ICCs)
of 0.804, 0.773, 0.711, and 0.622 for BV / TV, Tb. Th, Tb. Sp, and Tb. N,
respectively. The use of distillation learning significantly improved the
performance of the CT-based knee osteoarthritis classification method using the
CNN approach, yielding an AUC score of 0.767 (95% CI, 0.681-0.853) instead of
0.658 (95% CI, 0.574-0.742) (p<.001). Conclusions: The proposed SRRD method
showed high reliability and validity in MR-CT registration, regression, and
knee osteoarthritis classification, indicating the feasibility of subchondral
bone microstructural analysis based on CT images.Comment: 5 figures, 4 table
Sub-Character Tokenization for Chinese Pretrained Language Models
Tokenization is fundamental to pretrained language models (PLMs). Existing
tokenization methods for Chinese PLMs typically treat each character as an
indivisible token. However, they ignore the unique feature of the Chinese
writing system where additional linguistic information exists below the
character level, i.e., at the sub-character level. To utilize such information,
we propose sub-character (SubChar for short) tokenization. Specifically, we
first encode the input text by converting each Chinese character into a short
sequence based on its glyph or pronunciation, and then construct the vocabulary
based on the encoded text with sub-word tokenization. Experimental results show
that SubChar tokenizers have two main advantages over existing tokenizers: 1)
They can tokenize inputs into much shorter sequences, thus improving the
computational efficiency. 2) Pronunciation-based SubChar tokenizers can encode
Chinese homophones into the same transliteration sequences and produce the same
tokenization output, hence being robust to all homophone typos. At the same
time, models trained with SubChar tokenizers perform competitively on
downstream tasks. We release our code at
https://github.com/thunlp/SubCharTokenization to facilitate future work.Comment: This draft supersedes the previous version named "SHUOWEN-JIEZI:
Linguistically Informed Tokenizers For Chinese Language Model Pretraining
Simulation of tumor ablation in hyperthermia cancer treatment: A parametric study
A holistic simulation framework is established on magnetic hyperthermia
modeling to solve the treatment process of tumor, which is surrounded by a
healthy tissue block. The interstitial tissue fluid, MNP distribution,
temperature profile, and nanofluids are involved in the simulation. Study
evaluates the cancer treatment efficacy by cumulative-equivalent-minutes-at-43
centigrade (CEM43), a widely accepted thermal dose coming from the cell death
curve. Results are separated into the conditions of with or without gravity
effect in the computational domain, where two baseline case are investigated
and compared. An optimal treatment time 46.55 min happens in the baseline case
without gravity, but the situation deteriorates with gravity effect where the
time for totally killing tumor cells prolongs 36.11% and meanwhile causing
21.32% ablation in healthy tissue. For the cases without gravity, parameter
study of Lewis number and Heat source number are conducted and the variation of
optimal treatment time are both fitting to the inverse functions. For the case
considering the gravity, parameters Buoyancy ratio and Darcy ratio are
investigated and their influence on totally killing tumor cells and the injury
on healthy tissue are matching with the parabolic functions. The results are
beneficial to the prediction of various conditions, and provides useful guide
to the magnetic hyperthermia treatment
- …