75 research outputs found
Strong scaling of general-purpose molecular dynamics simulations on GPUs
We describe a highly optimized implementation of MPI domain decomposition in
a GPU-enabled, general-purpose molecular dynamics code, HOOMD-blue (Anderson
and Glotzer, arXiv:1308.5587). Our approach is inspired by a traditional
CPU-based code, LAMMPS (Plimpton, J. Comp. Phys. 117, 1995), but is implemented
within a code that was designed for execution on GPUs from the start (Anderson
et al., J. Comp. Phys. 227, 2008). The software supports short-ranged pair
force and bond force fields and achieves optimal GPU performance using an
autotuning algorithm. We are able to demonstrate equivalent or superior scaling
on up to 3,375 GPUs in Lennard-Jones and dissipative particle dynamics (DPD)
simulations of up to 108 million particles. GPUDirect RDMA capabilities in
recent GPU generations provide better performance in full double precision
calculations. For a representative polymer physics application, HOOMD-blue 1.0
provides an effective GPU vs. CPU node speed-up of 12.5x.Comment: 30 pages, 14 figure
Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback
A key technology for the development of large language models (LLMs) involves
instruction tuning that helps align the models' responses with human
expectations to realize impressive learning abilities. Two major approaches for
instruction tuning characterize supervised fine-tuning (SFT) and reinforcement
learning from human feedback (RLHF), which are currently applied to produce the
best commercial LLMs (e.g., ChatGPT). To improve the accessibility of LLMs for
research and development efforts, various instruction-tuned open-source LLMs
have also been introduced recently, e.g., Alpaca, Vicuna, to name a few.
However, existing open-source LLMs have only been instruction-tuned for English
and a few popular languages, thus hindering their impacts and accessibility to
many other languages in the world. Among a few very recent work to explore
instruction tuning for LLMs in multiple languages, SFT has been used as the
only approach to instruction-tune LLMs for multiple languages. This has left a
significant gap for fine-tuned LLMs based on RLHF in diverse languages and
raised important questions on how RLHF can boost the performance of
multilingual instruction tuning. To overcome this issue, we present Okapi, the
first system with instruction-tuned LLMs based on RLHF for multiple languages.
Okapi introduces instruction and response-ranked data in 26 diverse languages
to facilitate the experiments and development of future multilingual LLM
research. We also present benchmark datasets to enable the evaluation of
generative LLMs in multiple languages. Our experiments demonstrate the
advantages of RLHF for multilingual instruction over SFT for different base
models and datasets. Our framework and resources are released at
https://github.com/nlp-uoregon/Okapi
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages
The driving factors behind the development of large language models (LLMs)
with impressive learning capabilities are their colossal model sizes and
extensive training datasets. Along with the progress in natural language
processing, LLMs have been frequently made accessible to the public to foster
deeper investigation and applications. However, when it comes to training
datasets for these LLMs, especially the recent state-of-the-art models, they
are often not fully disclosed. Creating training data for high-performing LLMs
involves extensive cleaning and deduplication to ensure the necessary level of
quality. The lack of transparency for training data has thus hampered research
on attributing and addressing hallucination and bias issues in LLMs, hindering
replication efforts and further advancements in the community. These challenges
become even more pronounced in multilingual learning scenarios, where the
available multilingual text datasets are often inadequately collected and
cleaned. Consequently, there is a lack of open-source and readily usable
dataset to effectively train LLMs in multiple languages. To overcome this
issue, we present CulturaX, a substantial multilingual dataset with 6.3
trillion tokens in 167 languages, tailored for LLM development. Our dataset
undergoes meticulous cleaning and deduplication through a rigorous pipeline of
multiple stages to accomplish the best quality for model training, including
language identification, URL-based filtering, metric-based cleaning, document
refinement, and data deduplication. CulturaX is fully released to the public in
HuggingFace to facilitate research and advancements in multilingual LLMs:
https://huggingface.co/datasets/uonlp/CulturaX.Comment: Ongoing Wor
Rigid body constraints realized in massively-parallel molecular dynamics on graphics processing units,
Molecular dynamics (MD) methods compute the trajectory of a system of point particles in response to a potential function by numerically integrating Newton's equations of motion. Extending these basic methods with rigid body constraints enables composite particles with complex shapes such as anisotropic nanoparticles, grains, molecules, and rigid proteins to be modeled. Rigid body constraints are added to the GPU-accelerated MD package, HOOMD-blue, version 0.10.0. The software can now simulate systems of particles, rigid bodies, or mixed systems in microcanonical (NVE), canonical (NVT), and isothermalisobaric (NPT) ensembles. It can also apply the FIRE energy minimization technique to these systems. In this paper, we detail the massively parallel scheme that implements these algorithms and discuss how our design is tuned for the maximum possible performance. Two different case studies are included to demonstrate the performance attained, patchy spheres and tethered nanorods. In typical cases, HOOMDblue on a single GTX 480 executes 2.5-3.6 times faster than LAMMPS executing the same simulation on any number of CPU cores in parallel. Simulations with rigid bodies may now be run with larger systems and for longer time scales on a single workstation than was previously even possible on large clusters
Multiplexing siRNAs to compress RNAi-based screen size in human cells
Here we describe a novel strategy using multiplexes of synthetic small interfering RNAs (siRNAs) corresponding to multiple gene targets in order to compress RNA interference (RNAi) screen size. Before investigating the practical use of this strategy, we first characterized the gene-specific RNAi induced by a large subset (258 siRNAs, 129 genes) of the entire siRNA library used in this study (∼800 siRNAs, ∼400 genes). We next demonstrated that multiplexed siRNAs could silence at least six genes to the same degree as when the genes were targeted individually. The entire library was then used in a screen in which randomly multiplexed siRNAs were assayed for their affect on cell viability. Using this strategy, several gene targets that influenced the viability of a breast cancer cell line were identified. This study suggests that the screening of randomly multiplexed siRNAs may provide an important avenue towards the identification of candidate gene targets for downstream functional analyses and may also be useful for the rapid identification of positive controls for use in novel assay systems. This approach is likely to be especially applicable where assay costs or platform limitations are prohibitive
A hidden HIV epidemic among women in Vietnam
<p>Abstract</p> <p>Background</p> <p>The HIV epidemic in Vietnam is still concentrated among high risk populations, including IDU and FSW. The response of the government has focused on the recognized high risk populations, mainly young male drug users. This concentration on one high risk population may leave other populations under-protected or unprepared for the risk and the consequences of HIV infection. In particular, attention to women's risks of exposure and needs for care may not receive sufficient attention as long as the perception persists that the epidemic is predominantly among young males. Without more knowledge of the epidemic among women, policy makers and planners cannot ensure that programs will also serve women's needs.</p> <p>Methods</p> <p>More than 300 documents appearing in the period 1990 to 2005 were gathered and reviewed to build an understanding of HIV infection and related risk behaviors among women and of the changes over time that may suggest needed policy changes.</p> <p>Results</p> <p>It appears that the risk of HIV transmission among women in Vietnam has been underestimated; the reported data may represent as little as 16% of the real number. Although modeling predicted that there would be 98,500 cases of HIV-infected women in 2005, only 15,633 were accounted for in reports from the health system. That could mean that in 2005, up to 83,000 women infected with HIV have not been detected by the health care system, for a number of possible reasons. For both detection and prevention, these women can be divided into sub-groups with different risk characteristics. They can be infected by sharing needles and syringes with IDU partners, or by having unsafe sex with clients, husbands or lovers. However, most new infections among women can be traced to sexual relations with young male injecting drug users engaged in extramarital sex. Each of these groups may need different interventions to increase the detection rate and thus ensure that the women receive the care they need.</p> <p>Conclusion</p> <p>Women in Vietnam are increasingly at risk of HIV transmission but that risk is under-reported and under-recognized. The reasons are that women are not getting tested, are not aware of risks, do not protect themselves and are not being protected by men. Based on this information, policy-makers and planners can develop better prevention and care programs that not only address women's needs but also reduce further spread of the infection among the general population.</p
One health, une seule santé
One Health, « Une seule santé », est une stratégie mondiale visant à développer les collaborations interdisciplinaires pour la santé humaine, animale et environnementale. Elle promeut une approche intégrée, systémique et unifiée de la santé aux échelles locale, nationale et mondiale, afin de mieux affronter les maladies émergentes à risque pandémique, mais aussi s'adapter aux impacts environnementaux présents et futurs. Bien que ce mouvement s’étende, la littérature en français reste rare. Traduit de l’anglais, coordonné par d’éminents épidémiologistes et s'appuyant sur un large panel d' approches scientifiques rarement réunies autour de la santé, cet ouvrage retrace les origines du concept et présente un contenu pratique sur les outils méthodologiques, la collecte de données, les techniques de surveillance et les plans d’étude. Il combine recherche et pratique en un seul volume et constitue un ouvrage de référence unique pour la santé mondiale
Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking
The potential of the diverse chemistries present in natural products (NP) for biotechnology and medicine remains untapped because NP databases are not searchable with raw data and the NP community has no way to share data other than in published papers. Although mass spectrometry techniques are well-suited to high-throughput characterization of natural products, there is a pressing need for an infrastructure to enable sharing and curation of data. We present Global Natural Products Social molecular networking (GNPS, http://gnps.ucsd.edu), an open-access knowledge base for community wide organization and sharing of raw, processed or identified tandem mass (MS/MS) spectrometry data. In GNPS crowdsourced curation of freely available community-wide reference MS libraries will underpin improved annotations. Data-driven social-networking should facilitate identification of spectra and foster collaborations. We also introduce the concept of ‘living data’ through continuous reanalysis of deposited data
Situation et perspectives mondiales du riz (deuxième partie)
Nguyen Dac Simone A. Situation et perspectives mondiales du riz (deuxième partie). In: L'information géographique, volume 59, n°2, 1995. pp. 75-79
Situation et perspectives mondiales du riz (première partie)
Nguyen Dac Simone A. Situation et perspectives mondiales du riz (première partie). In: L'information géographique, volume 59, n°2, 1995. pp. 57-61
- …