16 research outputs found

    ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with Effective Evaluation Model

    Full text link
    During the development of large language models (LLMs), the scale and quality of the pre-training data play a crucial role in shaping LLMs' capabilities. To accelerate the research of LLMs, several large-scale datasets, such as C4 [1], Pile [2], RefinedWeb [3] and WanJuan [4], have been released to the public. However, most of the released corpus focus mainly on English, and there is still lack of complete tool-chain for extracting clean texts from web data. Furthermore, fine-grained information of the corpus, e.g. the quality of each text, is missing. To address these challenges, we propose in this paper a new complete tool-chain EvalWeb to extract Chinese clean texts from noisy web data. First, similar to previous work, manually crafted rules are employed to discard explicit noisy texts from the raw crawled web contents. Second, a well-designed evaluation model is leveraged to assess the remaining relatively clean data, and each text is assigned a specific quality score. Finally, we can easily utilize an appropriate threshold to select the high-quality pre-training data for Chinese. Using our proposed approach, we release the largest and latest large-scale high-quality Chinese web text ChineseWebText, which consists of 1.42 TB and each text is associated with a quality score, facilitating the LLM researchers to choose the data according to the desired quality thresholds. We also release a much cleaner subset of 600 GB Chinese data with the quality exceeding 90%

    CoLLiE: Collaborative Training of Large Language Models in an Efficient Way

    Full text link
    Large language models (LLMs) are increasingly pivotal in a wide range of natural language processing tasks. Access to pre-trained models, courtesy of the open-source community, has made it possible to adapt these models to specific applications for enhanced performance. However, the substantial resources required for training these models necessitate efficient solutions. This paper introduces CoLLiE, an efficient library that facilitates collaborative training of large language models using 3D parallelism, parameter-efficient fine-tuning (PEFT) methods, and optimizers such as Lion, Adan, Sophia, LOMO and AdaLomo. With its modular design and comprehensive functionality, CoLLiE offers a balanced blend of efficiency, ease of use, and customization. CoLLiE has proven superior training efficiency in comparison with prevalent solutions in pre-training and fine-tuning scenarios. Furthermore, we provide an empirical evaluation of the correlation between model size and GPU memory consumption under different optimization methods, as well as an analysis of the throughput. Lastly, we carry out a comprehensive comparison of various optimizers and PEFT methods within the instruction-tuning context. CoLLiE is available at https://github.com/OpenLMLab/collie.Comment: To appear at EMNLP 2023 Demo; Code is available at https://github.com/OpenLMLab/colli

    Experience reverses the red effect among Chinese stockbrokers.

    Get PDF
    Recent research has shown that the color red influences psychological functioning. Red is hypothesized to be linked to aggression and danger in evolution, and these links are enhanced by culture-specific uses of red. Thus, color meanings are thought to be grounded in biologically based proclivities and learned associations. However, to date, there has been no direct evidence for the influence of experience on the red effect. This study focused on whether experience could change the psychological effects of the color red. In the context of the Chinese stock market, contrary to the meaning generally associated with red as negative and green as positive, red represents a rise in stock price and green stands for a decrease. An experiment using a 2×2 between subjects factorial design demonstrated that red (compared with green) impaired Chinese college students' performance on an IQ test (in accordance with the red effect), but the opposite effect was found among stockbrokers. These results provide direct evidence of learned color meanings, in support of the general model of color effect

    The effect of color on performance on Raven’s Standard Progressive Matrices The college students in the red group (n = 12) performed worse than did those in the green group (n = 12).

    No full text
    <p>Conversely, the stockbrokers in the red group (n = 12) performed better than did those in the green group (n = 12). Error bars indicate standard error of test scores.</p

    Red color in flags: A signal for competition

    No full text
    The color-in-context theory and ecological valence theory suggest that color preference depends on the context and ecological object that define the psychological meanings of colors. The present study was conducted to identify the preference for the color red in national flags across the world. We explored 192 national flags across the world and found that red was the most frequently used color. Through a systemic examination of symbolic meanings behind use of the color red in flags, it was also found that the color red was often attached with an aggressive connotation. In contrast, the flags of the selected international collaborative organizations did not appear to prefer red. These results support the hypothesis of red flag preference in real-world competitive contexts. Limitations and future research directions are also discussed.</p

    Urban wider Adults Becoming Unhealthier in Modern China: A Cross-Temporal Meta-Analysis

    No full text
    This study investigated patterns of change in the health status of urban older adults in urban China from 2001 to 2013. A cross-temporal meta-analysis was applied to I I I selected studies in which the SF-36 had been administered to urban older adults in China. Scores from a total of 72,441 participants were analyzed. Correlations between the SF-36 scores and sampling years were examined. The self-reported health status of urban older adults in China has declined significantly in the past 13 years. The observed decline in the health status of older adults suggests that economic progress and a rapidly aging population have had more negative than positive effects on the health of this population

    Influence of Ink Properties on the Morphology of Long-Wave Infrared HgSe Quantum Dot Films

    No full text
    As the core device of the miniature quantum dot (QD) spectrometer, the morphology control of the filter film array cannot be ignored. We eliminated strong interference from additives on the spectrum of a long-wave infrared (LWIR) QD filter film by selecting volatile additives. This work is significant for detecting targets by spectroscopic methods. In this work, a filter film with characteristic spectral bands located in the LWIR was obtained by the natural evaporation of QD ink, which was prepared by mixing various volatile organic solvents with HgSe QD&ndash;toluene solution. The factors affecting the morphology of HgSe LWIR films, including ink surface tension, particle size, and solute volume fraction, were the main focus of the analysis. The experimental results suggested that the film slipped in the evaporation process, and the multilayer annular deposition formed when the surface tension of the ink was no more than 24.86 mN/m. The &ldquo;coffee ring&rdquo; and the multilayer annular deposition essentially disappeared when the solute particles were larger than 188.11 nm. QDs in the film were accumulated, and a &ldquo;gully&rdquo; morphology appeared when the solute volume fraction was greater than 0.1. In addition, both the increase rate of the film height and the decrease rate of the transmission slowed down. The relationship between film height and transmission was obtained by fitting, and the curve conformed to the Lambert&ndash;Beer law. Therefore, a uniform and flat film without &ldquo;coffee rings&rdquo; can be prepared by adjusting the surface tension, particle size, and volume fraction. This method could provide an empirical method for the preparation of LWIR QD filter film arrays

    Synthesis and Application of Polymer SXFA in the Detection of Organophosphine Agents with a SAW Sensor

    No full text
    The effective detection of isopropyl methylfluorophosphonate (GB, sarin), a type of organophosphine poisoning agent, is an urgent issue to address to maintain public safety. In this research, a gas-sensitive film material, poly (4-hydroxy-4,4-bis trifluoromethyl)-butyl-1-enyl)-siloxane (SXFA), with a structure of hexafluoroisopropyl (HFIP) functional group was synthesized by using methyl vinylpropyl dichlorosilane and hexafluoroacetone trihydrate as initial materials. The synthesis process products were characterized using FTIR. SXFA was prepared on a 200 MHz shear surface wave delay line using the spin-coating method for GB detection. A detection limit of 3 was achieved through conditional experiments. Meanwhile, we also obtained a maximum response of 2.168 mV at a 0.1 mg/m3 concentration, indicating the much lower detection limit of the SAW-SXFA sensor. Additionally, a maximum response standard deviation of 0.11 mV with a coefficient of variation of 0.01 and a maximum recovery standard deviation of 0.22 mV with a coefficient of variation of 0.02 were also obtained through five repeated experiments. The results show that the SAW-SXFA sensor has strong selectivity and reproducibility, good selectivity, positive detection ability, high sensitivity, and fast alarm performance for sarin detection
    corecore