60 research outputs found

    READIN: A Chinese Multi-Task Benchmark with Realistic and Diverse Input Noises

    Full text link
    For many real-world applications, the user-generated inputs usually contain various noises due to speech recognition errors caused by linguistic variations1 or typographical errors (typos). Thus, it is crucial to test model performance on data with realistic input noises to ensure robustness and fairness. However, little study has been done to construct such benchmarks for Chinese, where various language-specific input noises happen in the real world. In order to fill this important gap, we construct READIN: a Chinese multi-task benchmark with REalistic And Diverse Input Noises. READIN contains four diverse tasks and requests annotators to re-enter the original test data with two commonly used Chinese input methods: Pinyin input and speech input. We designed our annotation pipeline to maximize diversity, for example by instructing the annotators to use diverse input method editors (IMEs) for keyboard noises and recruiting speakers from diverse dialectical groups for speech noises. We experiment with a series of strong pretrained language models as well as robust training methods, we find that these models often suffer significant performance drops on READIN even with robustness methods like data augmentation. As the first large-scale attempt in creating a benchmark with noises geared towards user-generated inputs, we believe that READIN serves as an important complement to existing Chinese NLP benchmarks. The source code and dataset can be obtained from https://github.com/thunlp/READIN.Comment: Preprin

    Sub-Character Tokenization for Chinese Pretrained Language Models

    Full text link
    Tokenization is fundamental to pretrained language models (PLMs). Existing tokenization methods for Chinese PLMs typically treat each character as an indivisible token. However, they ignore the unique feature of the Chinese writing system where additional linguistic information exists below the character level, i.e., at the sub-character level. To utilize such information, we propose sub-character (SubChar for short) tokenization. Specifically, we first encode the input text by converting each Chinese character into a short sequence based on its glyph or pronunciation, and then construct the vocabulary based on the encoded text with sub-word tokenization. Experimental results show that SubChar tokenizers have two main advantages over existing tokenizers: 1) They can tokenize inputs into much shorter sequences, thus improving the computational efficiency. 2) Pronunciation-based SubChar tokenizers can encode Chinese homophones into the same transliteration sequences and produce the same tokenization output, hence being robust to all homophone typos. At the same time, models trained with SubChar tokenizers perform competitively on downstream tasks. We release our code at https://github.com/thunlp/SubCharTokenization to facilitate future work.Comment: This draft supersedes the previous version named "SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language Model Pretraining

    Numerical analysis of heat transfer and fluid flow in multilayer deposition of PAW-based wire and arc additive manufacturing

    Get PDF
    A three-dimensional numerical model has been developed to investigate the fluid flow and heat transfer behaviors in multilayer deposition of plasma arc welding (PAW) based wire and arc additive manufacture (WAAM). The volume of fluid (VOF) and porosity enthalpy methods are employed to track the molten pool free surface and solidification front, respectively. A modified double ellipsoidal heat source model is utilized to ensure constant arc heat input in calculation in the case that molten pool surface dynamically changes. Transient simulations were conducted for the 1st, 2nd and 21st layer depositions. The shape and size of deposited bead and weld pool were predicted and compared with experimental results. The results show that for each layer of deposition the Marangoni force plays the most important role in affecting fluid flow, conduction is the dominant method of heat dissipation compared to convection and radiation to the air. As the layer number increases, the length and width of molten pool and the width of deposited bead increase, whilst the layer height decreases. However these dimensions remain constant when the deposited part is sufficiently high. In high layer deposition, where side support is absent, the depth of the molten pool at the rear part is almost flat in the Y direction. The profile of the deposited bead is mainly determined by static pressure caused by gravity and surface tension pressure, therefore the bead profile is nearly circular. The simulated profiles and size dimensions of deposited bead and molten pool were validated with experimental weld appearance, cross-sectional images and process camera images. The simulated results are in good agreement with experimental results

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Full text link
    Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License

    Research on the Development Characteristics of Green Energy Industry in Main Developed Countries

    No full text
    Green energy is regarded as the breakthrough of the fourth technological revolution of mankind, which is highly concerned by the whole world. By analysing the development strategies of major developed countries to promote green energy industry, this paper constructs a theoretical framework from four aspects: government policy, green consumption, technology and capital, so as to summarize the typical characteristics of the development of green energy industry. The study found that government policy and technology are the main driving force for the development of green energy industry in major developed countries, the resource-rich United States leads the industrial development with policy, and the European Union obtains new energy development through policy and technological innovation at the same time. Japan continues to innovate and take the lead in technology to break the limitations of innate conditions. The conclusions of the study will help countries with the same resource base, policy environment and consumption concept to sort out the development ideas of green energy, and provide some reference and reference for the formulation of effective development strategies
    • …
    corecore