75 research outputs found

    Strong scaling of general-purpose molecular dynamics simulations on GPUs

    Get PDF
    We describe a highly optimized implementation of MPI domain decomposition in a GPU-enabled, general-purpose molecular dynamics code, HOOMD-blue (Anderson and Glotzer, arXiv:1308.5587). Our approach is inspired by a traditional CPU-based code, LAMMPS (Plimpton, J. Comp. Phys. 117, 1995), but is implemented within a code that was designed for execution on GPUs from the start (Anderson et al., J. Comp. Phys. 227, 2008). The software supports short-ranged pair force and bond force fields and achieves optimal GPU performance using an autotuning algorithm. We are able to demonstrate equivalent or superior scaling on up to 3,375 GPUs in Lennard-Jones and dissipative particle dynamics (DPD) simulations of up to 108 million particles. GPUDirect RDMA capabilities in recent GPU generations provide better performance in full double precision calculations. For a representative polymer physics application, HOOMD-blue 1.0 provides an effective GPU vs. CPU node speed-up of 12.5x.Comment: 30 pages, 14 figure

    Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback

    Full text link
    A key technology for the development of large language models (LLMs) involves instruction tuning that helps align the models' responses with human expectations to realize impressive learning abilities. Two major approaches for instruction tuning characterize supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), which are currently applied to produce the best commercial LLMs (e.g., ChatGPT). To improve the accessibility of LLMs for research and development efforts, various instruction-tuned open-source LLMs have also been introduced recently, e.g., Alpaca, Vicuna, to name a few. However, existing open-source LLMs have only been instruction-tuned for English and a few popular languages, thus hindering their impacts and accessibility to many other languages in the world. Among a few very recent work to explore instruction tuning for LLMs in multiple languages, SFT has been used as the only approach to instruction-tune LLMs for multiple languages. This has left a significant gap for fine-tuned LLMs based on RLHF in diverse languages and raised important questions on how RLHF can boost the performance of multilingual instruction tuning. To overcome this issue, we present Okapi, the first system with instruction-tuned LLMs based on RLHF for multiple languages. Okapi introduces instruction and response-ranked data in 26 diverse languages to facilitate the experiments and development of future multilingual LLM research. We also present benchmark datasets to enable the evaluation of generative LLMs in multiple languages. Our experiments demonstrate the advantages of RLHF for multilingual instruction over SFT for different base models and datasets. Our framework and resources are released at https://github.com/nlp-uoregon/Okapi

    CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages

    Full text link
    The driving factors behind the development of large language models (LLMs) with impressive learning capabilities are their colossal model sizes and extensive training datasets. Along with the progress in natural language processing, LLMs have been frequently made accessible to the public to foster deeper investigation and applications. However, when it comes to training datasets for these LLMs, especially the recent state-of-the-art models, they are often not fully disclosed. Creating training data for high-performing LLMs involves extensive cleaning and deduplication to ensure the necessary level of quality. The lack of transparency for training data has thus hampered research on attributing and addressing hallucination and bias issues in LLMs, hindering replication efforts and further advancements in the community. These challenges become even more pronounced in multilingual learning scenarios, where the available multilingual text datasets are often inadequately collected and cleaned. Consequently, there is a lack of open-source and readily usable dataset to effectively train LLMs in multiple languages. To overcome this issue, we present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages, tailored for LLM development. Our dataset undergoes meticulous cleaning and deduplication through a rigorous pipeline of multiple stages to accomplish the best quality for model training, including language identification, URL-based filtering, metric-based cleaning, document refinement, and data deduplication. CulturaX is fully released to the public in HuggingFace to facilitate research and advancements in multilingual LLMs: https://huggingface.co/datasets/uonlp/CulturaX.Comment: Ongoing Wor

    Rigid body constraints realized in massively-parallel molecular dynamics on graphics processing units,

    Get PDF
    Molecular dynamics (MD) methods compute the trajectory of a system of point particles in response to a potential function by numerically integrating Newton's equations of motion. Extending these basic methods with rigid body constraints enables composite particles with complex shapes such as anisotropic nanoparticles, grains, molecules, and rigid proteins to be modeled. Rigid body constraints are added to the GPU-accelerated MD package, HOOMD-blue, version 0.10.0. The software can now simulate systems of particles, rigid bodies, or mixed systems in microcanonical (NVE), canonical (NVT), and isothermalisobaric (NPT) ensembles. It can also apply the FIRE energy minimization technique to these systems. In this paper, we detail the massively parallel scheme that implements these algorithms and discuss how our design is tuned for the maximum possible performance. Two different case studies are included to demonstrate the performance attained, patchy spheres and tethered nanorods. In typical cases, HOOMDblue on a single GTX 480 executes 2.5-3.6 times faster than LAMMPS executing the same simulation on any number of CPU cores in parallel. Simulations with rigid bodies may now be run with larger systems and for longer time scales on a single workstation than was previously even possible on large clusters

    Multiplexing siRNAs to compress RNAi-based screen size in human cells

    Get PDF
    Here we describe a novel strategy using multiplexes of synthetic small interfering RNAs (siRNAs) corresponding to multiple gene targets in order to compress RNA interference (RNAi) screen size. Before investigating the practical use of this strategy, we first characterized the gene-specific RNAi induced by a large subset (258 siRNAs, 129 genes) of the entire siRNA library used in this study (∼800 siRNAs, ∼400 genes). We next demonstrated that multiplexed siRNAs could silence at least six genes to the same degree as when the genes were targeted individually. The entire library was then used in a screen in which randomly multiplexed siRNAs were assayed for their affect on cell viability. Using this strategy, several gene targets that influenced the viability of a breast cancer cell line were identified. This study suggests that the screening of randomly multiplexed siRNAs may provide an important avenue towards the identification of candidate gene targets for downstream functional analyses and may also be useful for the rapid identification of positive controls for use in novel assay systems. This approach is likely to be especially applicable where assay costs or platform limitations are prohibitive

    A hidden HIV epidemic among women in Vietnam

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The HIV epidemic in Vietnam is still concentrated among high risk populations, including IDU and FSW. The response of the government has focused on the recognized high risk populations, mainly young male drug users. This concentration on one high risk population may leave other populations under-protected or unprepared for the risk and the consequences of HIV infection. In particular, attention to women's risks of exposure and needs for care may not receive sufficient attention as long as the perception persists that the epidemic is predominantly among young males. Without more knowledge of the epidemic among women, policy makers and planners cannot ensure that programs will also serve women's needs.</p> <p>Methods</p> <p>More than 300 documents appearing in the period 1990 to 2005 were gathered and reviewed to build an understanding of HIV infection and related risk behaviors among women and of the changes over time that may suggest needed policy changes.</p> <p>Results</p> <p>It appears that the risk of HIV transmission among women in Vietnam has been underestimated; the reported data may represent as little as 16% of the real number. Although modeling predicted that there would be 98,500 cases of HIV-infected women in 2005, only 15,633 were accounted for in reports from the health system. That could mean that in 2005, up to 83,000 women infected with HIV have not been detected by the health care system, for a number of possible reasons. For both detection and prevention, these women can be divided into sub-groups with different risk characteristics. They can be infected by sharing needles and syringes with IDU partners, or by having unsafe sex with clients, husbands or lovers. However, most new infections among women can be traced to sexual relations with young male injecting drug users engaged in extramarital sex. Each of these groups may need different interventions to increase the detection rate and thus ensure that the women receive the care they need.</p> <p>Conclusion</p> <p>Women in Vietnam are increasingly at risk of HIV transmission but that risk is under-reported and under-recognized. The reasons are that women are not getting tested, are not aware of risks, do not protect themselves and are not being protected by men. Based on this information, policy-makers and planners can develop better prevention and care programs that not only address women's needs but also reduce further spread of the infection among the general population.</p

    One health, une seule santé

    Get PDF
    One Health, « Une seule santé », est une stratégie mondiale visant à développer les collaborations interdisciplinaires pour la santé humaine, animale et environnementale. Elle promeut une approche intégrée, systémique et unifiée de la santé aux échelles locale, nationale et mondiale, afin de mieux affronter les maladies émergentes à risque pandémique, mais aussi s'adapter aux impacts environnementaux présents et futurs. Bien que ce mouvement s’étende, la littérature en français reste rare. Traduit de l’anglais, coordonné par d’éminents épidémiologistes et s'appuyant sur un large panel d' approches scientifiques rarement réunies autour de la santé, cet ouvrage retrace les origines du concept et présente un contenu pratique sur les outils méthodologiques, la collecte de données, les techniques de surveillance et les plans d’étude. Il combine recherche et pratique en un seul volume et constitue un ouvrage de référence unique pour la santé mondiale

    Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking

    Get PDF
    The potential of the diverse chemistries present in natural products (NP) for biotechnology and medicine remains untapped because NP databases are not searchable with raw data and the NP community has no way to share data other than in published papers. Although mass spectrometry techniques are well-suited to high-throughput characterization of natural products, there is a pressing need for an infrastructure to enable sharing and curation of data. We present Global Natural Products Social molecular networking (GNPS, http://gnps.ucsd.edu), an open-access knowledge base for community wide organization and sharing of raw, processed or identified tandem mass (MS/MS) spectrometry data. In GNPS crowdsourced curation of freely available community-wide reference MS libraries will underpin improved annotations. Data-driven social-networking should facilitate identification of spectra and foster collaborations. We also introduce the concept of ‘living data’ through continuous reanalysis of deposited data

    Situation et perspectives mondiales du riz (deuxième partie)

    No full text
    Nguyen Dac Simone A. Situation et perspectives mondiales du riz (deuxième partie). In: L'information géographique, volume 59, n°2, 1995. pp. 75-79

    Situation et perspectives mondiales du riz (première partie)

    No full text
    Nguyen Dac Simone A. Situation et perspectives mondiales du riz (première partie). In: L'information géographique, volume 59, n°2, 1995. pp. 57-61
    corecore