10 research outputs found

    Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback

    Full text link
    A key technology for the development of large language models (LLMs) involves instruction tuning that helps align the models' responses with human expectations to realize impressive learning abilities. Two major approaches for instruction tuning characterize supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), which are currently applied to produce the best commercial LLMs (e.g., ChatGPT). To improve the accessibility of LLMs for research and development efforts, various instruction-tuned open-source LLMs have also been introduced recently, e.g., Alpaca, Vicuna, to name a few. However, existing open-source LLMs have only been instruction-tuned for English and a few popular languages, thus hindering their impacts and accessibility to many other languages in the world. Among a few very recent work to explore instruction tuning for LLMs in multiple languages, SFT has been used as the only approach to instruction-tune LLMs for multiple languages. This has left a significant gap for fine-tuned LLMs based on RLHF in diverse languages and raised important questions on how RLHF can boost the performance of multilingual instruction tuning. To overcome this issue, we present Okapi, the first system with instruction-tuned LLMs based on RLHF for multiple languages. Okapi introduces instruction and response-ranked data in 26 diverse languages to facilitate the experiments and development of future multilingual LLM research. We also present benchmark datasets to enable the evaluation of generative LLMs in multiple languages. Our experiments demonstrate the advantages of RLHF for multilingual instruction over SFT for different base models and datasets. Our framework and resources are released at https://github.com/nlp-uoregon/Okapi

    CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages

    Full text link
    The driving factors behind the development of large language models (LLMs) with impressive learning capabilities are their colossal model sizes and extensive training datasets. Along with the progress in natural language processing, LLMs have been frequently made accessible to the public to foster deeper investigation and applications. However, when it comes to training datasets for these LLMs, especially the recent state-of-the-art models, they are often not fully disclosed. Creating training data for high-performing LLMs involves extensive cleaning and deduplication to ensure the necessary level of quality. The lack of transparency for training data has thus hampered research on attributing and addressing hallucination and bias issues in LLMs, hindering replication efforts and further advancements in the community. These challenges become even more pronounced in multilingual learning scenarios, where the available multilingual text datasets are often inadequately collected and cleaned. Consequently, there is a lack of open-source and readily usable dataset to effectively train LLMs in multiple languages. To overcome this issue, we present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages, tailored for LLM development. Our dataset undergoes meticulous cleaning and deduplication through a rigorous pipeline of multiple stages to accomplish the best quality for model training, including language identification, URL-based filtering, metric-based cleaning, document refinement, and data deduplication. CulturaX is fully released to the public in HuggingFace to facilitate research and advancements in multilingual LLMs: https://huggingface.co/datasets/uonlp/CulturaX.Comment: Ongoing Wor

    Viral Metagenomic Analysis of Cerebrospinal Fluid from Patients with Acute Central Nervous System Infections of Unknown Origin, Vietnam.

    Get PDF
    Central nervous system (CNS) infection is a serious neurologic condition, although the etiology remains unknown in >50% of patients. We used metagenomic next-generation sequencing to detect viruses in 204 cerebrospinal fluid (CSF) samples from patients with acute CNS infection who were enrolled from Vietnam hospitals during 2012-2016. We detected 8 viral species in 107/204 (52.4%) of CSF samples. After virus-specific PCR confirmation, the detection rate was lowered to 30/204 (14.7%). Enteroviruses were the most common viruses detected (n = 23), followed by hepatitis B virus (3), HIV (2), molluscum contagiosum virus (1), and gemycircularvirus (1). Analysis of enterovirus sequences revealed the predominance of echovirus 30 (9). Phylogenetically, the echovirus 30 strains belonged to genogroup V and VIIb. Our results expanded knowledge about the clinical burden of enterovirus in Vietnam and underscore the challenges of identifying a plausible viral pathogen in CSF of patients with CNS infections

    Improved performance for antenna based on a combination of fractal geometry with CSRR

    No full text
    ABSTRACTIn this paper, an antenna design method operating at 3.5 GHz for 5G system is presented to improve its performance. The antenna is designed using fractal geometry combined with an imperfectly structured ground plane. In which, the radiation surface has the form of a Minkowski island fractal geometry, and the removed part of the ground is a complementary split ring resonator unit cell. In this design, the substrate material is FR4-epoxy microwave laminates with dielectric constant ϵ = 4.4, loss tangent (tan δ) of 0.02, and h = 1.6 mm thickness used to design the antennas. HFSS software is used in the simulation with the feeding method with a microstrip line. The proposed antenna has a significant performance increase compared to the original microstrip antenna such as reduced about 56% reduction in total size, enhanced 207% bandwidth, increased peak gain to 4.66 dB, and improved radiated efficiency to 89.3%. The physical model of the antenna has been fabricated and measured to verify the correctness of the design

    Effect of foliar application of oligochitosan with different molecular weight on growth promotion and fruit yield enhancement of chili plant

    No full text
    Oligochitosan (OC) is effective biostimulant on growth promotion and elicitation against disease infection for plants. However, the range of OC molecular weight that exhibits the most effective activity is not fully understood and requires further investigation. In this study, OCs with different weight average molecular weight (Mw) were prepared by gamma Co-60 irradiation degradation of chitosan in solution and the effect on growth promotion and enhancement of fruit yield of chili plant (Capsicum frutescens L.) by foliar application of OCs particularly with Mw of 7.8, 5.0, and 2.5 kDa was investigated. Chili plants, cultivated in a greenhouse were sprayed with OC concentration of 50 mg/L for three times. Results indicated that among treatments, OC with 2.5 kDa proved to be the best, which increased the shoot fresh weight by 71.5%, shoot dry weight by 184%, total chlorophyll content by 12%, and fruit fresh weight by 49.8% for the control. Thus, OC with low Mw (2.5 kDa) that can be suitably produced on large scale by gamma Co-60 ray irradiation degradation of chitosan solution is potentially promising to apply as a biostimulant to enhance chili fruit yield significantly

    Viral metagenomic analysis of cerebrospinal fluid from patients with acute central nervous system infections of unknown origin, Vietnam

    No full text
    Central nervous system (CNS) infection is a serious neurologic condition, although the etiology remains unknown in >50% of patients. We used metagenomic next-generation sequencing to detect viruses in 204 cerebrospinal fluid (CSF) samples from patients with acute CNS infection who were enrolled from Vietnam hospitals during 2012–2016. We detected 8 viral species in 107/204 (52.4%) of CSF samples. After virus-specific PCR confirmation, the detection rate was lowered to 30/204 (14.7%). Enteroviruses were the most common viruses detected (n = 23), followed by hepatitis B virus (3), HIV (2), molluscum contagiosum virus (1), and gemycircularvirus (1). Analysis of enterovirus sequences revealed the predominance of echovirus 30 (9). Phylogenetically, the echovirus 30 strains belonged to genogroup V and VIIb. Our results expanded knowledge about the clinical burden of enterovirus in Vietnam and underscore the challenges of identifying a plausible viral pathogen in CSF of patients with CNS infections

    Aetiology and Potential Animal Exposure in Central Nervous System Infections in Vietnam.

    No full text
    An estimated 73% of emerging infections are zoonotic in origin, with animal contact and encroachment on their habitats increasing the risk of spill-over events. In Vietnam, close exposure to a wide range of animals and animal products can lead to acquisition of zoonotic pathogens, a number of which cause central nervous system (CNS) infections. However, studies show the aetiology of CNS infections remains unknown in around half of cases. We used samples and data from hospitalised patients with CNS infections, enrolled into the Vietnam Initiative on Zoonotic Infections multicentre study, to determine the association between aetiology and animal contact including those in whom the cause was unknown. Among 933 patients, a pathogen or an antibody response to it was identified in 291 (31.2%, 95% CI 28.3-34.3%). The most common pathogens were Streptococcus suis (n = 91 (9.8%, 8.0-11.9%)) and Japanese encephalitis virus (JEV) (n = 72 (7.7%, 6.1-9.7%)). Commonly reported animal contact included keeping, raising or handling (n = 364 (39.0%, 35.9-42.2%)) and handling, cooking or consuming raw meat, blood or viscera in the 2 weeks prior to symptom onset (n = 371 (39.8%, 36.6-43.0%)), with the latter most commonly from pigs (n = 343 (36.9%, 33.8-40.1%). There was no association between an unknown aetiology and exposure to animals in a multivariate logistic regression. Further testing for unknown or undetected pathogens may increase diagnostic yield, however, given the high proportion of zoonotic pathogens and the presence of risk factors, increasing public awareness about zoonoses and preventive measures can be considered

    Multimodal analysis of methylomics and fragmentomics in plasma cell-free DNA for multi-cancer early detection and localization

    No full text
    Despite their promise, circulating tumor DNA (ctDNA)-based assays for multi-cancer early detection face challenges in test performance, due mostly to the limited abundance of ctDNA and its inherent variability. To address these challenges, published assays to date demanded a very high-depth sequencing, resulting in an elevated price of test. Herein, we developed a multimodal assay called SPOT-MAS (screening for the presence of tumor by methylation and size) to simultaneously profile methylomics, fragmentomics, copy number, and end motifs in a single workflow using targeted and shallow genome-wide sequencing (~0.55×) of cell-free DNA. We applied SPOT-MAS to 738 non-metastatic patients with breast, colorectal, gastric, lung, and liver cancer, and 1550 healthy controls. We then employed machine learning to extract multiple cancer and tissue-specific signatures for detecting and locating cancer. SPOT-MAS successfully detected the five cancer types with a sensitivity of 72.4% at 97.0% specificity. The sensitivities for detecting early-stage cancers were 73.9% and 62.3% for stages I and II, respectively, increasing to 88.3% for non-metastatic stage IIIA. For tumor-of-origin, our assay achieved an accuracy of 0.7. Our study demonstrates comparable performance to other ctDNA-based assays while requiring significantly lower sequencing depth, making it economically feasible for population-wide screening
    corecore