60 research outputs found
READIN: A Chinese Multi-Task Benchmark with Realistic and Diverse Input Noises
For many real-world applications, the user-generated inputs usually contain
various noises due to speech recognition errors caused by linguistic
variations1 or typographical errors (typos). Thus, it is crucial to test model
performance on data with realistic input noises to ensure robustness and
fairness. However, little study has been done to construct such benchmarks for
Chinese, where various language-specific input noises happen in the real world.
In order to fill this important gap, we construct READIN: a Chinese multi-task
benchmark with REalistic And Diverse Input Noises. READIN contains four diverse
tasks and requests annotators to re-enter the original test data with two
commonly used Chinese input methods: Pinyin input and speech input. We designed
our annotation pipeline to maximize diversity, for example by instructing the
annotators to use diverse input method editors (IMEs) for keyboard noises and
recruiting speakers from diverse dialectical groups for speech noises. We
experiment with a series of strong pretrained language models as well as robust
training methods, we find that these models often suffer significant
performance drops on READIN even with robustness methods like data
augmentation. As the first large-scale attempt in creating a benchmark with
noises geared towards user-generated inputs, we believe that READIN serves as
an important complement to existing Chinese NLP benchmarks. The source code and
dataset can be obtained from https://github.com/thunlp/READIN.Comment: Preprin
Sub-Character Tokenization for Chinese Pretrained Language Models
Tokenization is fundamental to pretrained language models (PLMs). Existing
tokenization methods for Chinese PLMs typically treat each character as an
indivisible token. However, they ignore the unique feature of the Chinese
writing system where additional linguistic information exists below the
character level, i.e., at the sub-character level. To utilize such information,
we propose sub-character (SubChar for short) tokenization. Specifically, we
first encode the input text by converting each Chinese character into a short
sequence based on its glyph or pronunciation, and then construct the vocabulary
based on the encoded text with sub-word tokenization. Experimental results show
that SubChar tokenizers have two main advantages over existing tokenizers: 1)
They can tokenize inputs into much shorter sequences, thus improving the
computational efficiency. 2) Pronunciation-based SubChar tokenizers can encode
Chinese homophones into the same transliteration sequences and produce the same
tokenization output, hence being robust to all homophone typos. At the same
time, models trained with SubChar tokenizers perform competitively on
downstream tasks. We release our code at
https://github.com/thunlp/SubCharTokenization to facilitate future work.Comment: This draft supersedes the previous version named "SHUOWEN-JIEZI:
Linguistically Informed Tokenizers For Chinese Language Model Pretraining
Numerical analysis of heat transfer and fluid flow in multilayer deposition of PAW-based wire and arc additive manufacturing
A three-dimensional numerical model has been developed to investigate the fluid flow and heat transfer behaviors in multilayer deposition of plasma arc welding (PAW) based wire and arc additive manufacture (WAAM). The volume of fluid (VOF) and porosity enthalpy methods are employed to track the molten pool free surface and solidification front, respectively. A modified double ellipsoidal heat source model is utilized to ensure constant arc heat input in calculation in the case that molten pool surface dynamically changes. Transient simulations were conducted for the 1st, 2nd and 21st layer depositions. The shape and size of deposited bead and weld pool were predicted and compared with experimental results. The results show that for each layer of deposition the Marangoni force plays the most important role in affecting fluid flow, conduction is the dominant method of heat dissipation compared to convection and radiation to the air. As the layer number increases, the length and width of molten pool and the width of deposited bead increase, whilst the layer height decreases. However these dimensions remain constant when the deposited part is sufficiently high. In high layer deposition, where side support is absent, the depth of the molten pool at the rear part is almost flat in the Y direction. The profile of the deposited bead is mainly determined by static pressure caused by gravity and surface tension pressure, therefore the bead profile is nearly circular. The simulated profiles and size dimensions of deposited bead and molten pool were validated with experimental weld appearance, cross-sectional images and process camera images. The simulated results are in good agreement with experimental results
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Large language models (LLMs) have been shown to be able to perform new tasks
based on a few demonstrations or natural language instructions. While these
capabilities have led to widespread adoption, most LLMs are developed by
resource-rich organizations and are frequently kept from the public. As a step
towards democratizing this powerful technology, we present BLOOM, a
176B-parameter open-access language model designed and built thanks to a
collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer
language model that was trained on the ROOTS corpus, a dataset comprising
hundreds of sources in 46 natural and 13 programming languages (59 in total).
We find that BLOOM achieves competitive performance on a wide variety of
benchmarks, with stronger results after undergoing multitask prompted
finetuning. To facilitate future research and applications using LLMs, we
publicly release our models and code under the Responsible AI License
Research on the Development Characteristics of Green Energy Industry in Main Developed Countries
Green energy is regarded as the breakthrough of the fourth technological revolution of mankind, which is highly concerned by the whole world. By analysing the development strategies of major developed countries to promote green energy industry, this paper constructs a theoretical framework from four aspects: government policy, green consumption, technology and capital, so as to summarize the typical characteristics of the development of green energy industry. The study found that government policy and technology are the main driving force for the development of green energy industry in major developed countries, the resource-rich United States leads the industrial development with policy, and the European Union obtains new energy development through policy and technological innovation at the same time. Japan continues to innovate and take the lead in technology to break the limitations of innate conditions. The conclusions of the study will help countries with the same resource base, policy environment and consumption concept to sort out the development ideas of green energy, and provide some reference and reference for the formulation of effective development strategies
- …