130 research outputs found

    Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

    Full text link
    The popularity of LLaMA (Touvron et al., 2023a;b) and other recently emerged moderate-sized large language models (LLMs) highlights the potential of building smaller yet powerful LLMs. Regardless, the cost of training such models from scratch on trillions of tokens remains high. In this work, we study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models. Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains. We demonstrate the efficacy of our approach by presenting the Sheared-LLaMA series, pruning the LLaMA2-7B model down to 1.3B and 2.7B parameters. Sheared-LLaMA models outperform state-of-the-art open-source models of equivalent sizes, such as Pythia, INCITE, and OpenLLaMA models, on a wide range of downstream and instruction tuning evaluations, while requiring only 3% of compute compared to training such models from scratch. This work provides compelling evidence that leveraging existing LLMs with structured pruning is a far more cost-effective approach for building smaller LLMs.Comment: The code and models are available at https://github.com/princeton-nlp/LLM-Shearin

    Arena: A Learning-based Synchronization Scheme for Hierarchical Federated Learning--Technical Report

    Full text link
    Federated learning (FL) enables collaborative model training among distributed devices without data sharing, but existing FL suffers from poor scalability because of global model synchronization. To address this issue, hierarchical federated learning (HFL) has been recently proposed to let edge servers aggregate models of devices in proximity, while synchronizing via the cloud periodically. However, a critical open challenge about how to make a good synchronization scheme (when devices and edges should be synchronized) is still unsolved. Devices are heterogeneous in computing and communication capability, and their data could be non-IID. No existing work can well synchronize various roles (\textit{e.g.}, devices and edges) in HFL to guarantee high learning efficiency and accuracy. In this paper, we propose a learning-based synchronization scheme for HFL systems. By collecting data such as edge models, CPU usage, communication time, \textit{etc}., we design a deep reinforcement learning-based approach to decide the frequencies of cloud aggregation and edge aggregation, respectively. The proposed scheme well considers device heterogeneity, non-IID data and device mobility, to maximize the training model accuracy while minimizing the energy overhead. Meanwhile, the convergence bound of the proposed synchronization scheme has been analyzed. And we build an HFL testbed and conduct the experiments with real data obtained from Raspberry Pi and Alibaba Cloud. Extensive experiments under various settings are conducted to confirm the effectiveness of \textit{Arena}

    Systematic Analysis of Impact of Sampling Regions and Storage Methods on Fecal Gut Microbiome and Metabolome Profiles.

    Get PDF
    The contribution of human gastrointestinal (GI) microbiota and metabolites to host health has recently become much clearer. However, many confounding factors can influence the accuracy of gut microbiome and metabolome studies, resulting in inconsistencies in published results. In this study, we systematically investigated the effects of fecal sampling regions and storage and retrieval conditions on gut microbiome and metabolite profiles from three healthy children. Our analysis indicated that compared to homogenized and snap-frozen samples (standard control [SC]), different sampling regions did not affect microbial community alpha diversity, while a total of 22 of 176 identified metabolites varied significantly across different sampling regions. In contrast, storage conditions significantly influenced the microbiome and metabolome. Short-term room temperature storage had a minimal effect on the microbiome and metabolome profiles. Sample storage in RNALater showed a significant level of variation in both microbiome and metabolome profiles, independent of the storage or retrieval conditions. The effect of RNALater on the metabolome was stronger than the effect on the microbiome, and individual variability between study participants outweighed the effect of RNALater on the microbiome. We conclude that homogenizing stool samples was critical for metabolomic analysis but not necessary for microbiome analysis. Short-term room temperature storage had a minimal effect on the microbiome and metabolome profiles and is recommended for short-term fecal sample storage. In addition, our study indicates that the use of RNALater as a storage medium of stool samples for microbial and metabolomic analyses is not recommended.IMPORTANCE The gastrointestinal microbiome and metabolome can provide a new angle to understand the development of health and disease. Stool samples are most frequently used for large-scale cohort studies. Standardized procedures for stool sample handling and storage can be a determining factor for performing microbiome or metabolome studies. In this study, we focused on the effects of stool sampling regions and stool sample storage conditions on variations in the gut microbiome composition and metabolome profile

    Differentially Private Learning with Per-Sample Adaptive Clipping

    Full text link
    Privacy in AI remains a topic that draws attention from researchers and the general public in recent years. As one way to implement privacy-preserving AI, differentially private learning is a framework that enables AI models to use differential privacy (DP). To achieve DP in the learning process, existing algorithms typically limit the magnitude of gradients with a constant clipping, which requires carefully tuned due to its significant impact on model performance. As a solution to this issue, latest works NSGD and Auto-S innovatively propose to use normalization instead of clipping to avoid hyperparameter tuning. However, normalization-based approaches like NSGD and Auto-S rely on a monotonic weight function, which imposes excessive weight on small gradient samples and introduces extra deviation to the update. In this paper, we propose a Differentially Private Per-Sample Adaptive Clipping (DP-PSAC) algorithm based on a non-monotonic adaptive weight function, which guarantees privacy without the typical hyperparameter tuning process of using a constant clipping while significantly reducing the deviation between the update and true batch-averaged gradient. We provide a rigorous theoretical convergence analysis and show that with convergence rate at the same order, the proposed algorithm achieves a lower non-vanishing bound, which is maintained over training iterations, compared with NSGD/Auto-S. In addition, through extensive experimental evaluation, we show that DP-PSAC outperforms or matches the state-of-the-art methods on multiple main-stream vision and language tasks.Comment: To appear in AAAI 2023, Revised acknowledgments and citation

    Bi-Drop: Enhancing Fine-tuning Generalization via Synchronous sub-net Estimation and Optimization

    Full text link
    Pretrained language models have achieved remarkable success in natural language understanding. However, fine-tuning pretrained models on limited training data tends to overfit and thus diminish performance. This paper presents Bi-Drop, a fine-tuning strategy that selectively updates model parameters using gradients from various sub-nets dynamically generated by dropout. The sub-net estimation of Bi-Drop is performed in an in-batch manner, so it overcomes the problem of hysteresis in sub-net updating, which is possessed by previous methods that perform asynchronous sub-net estimation. Also, Bi-Drop needs only one mini-batch to estimate the sub-net so it achieves higher utility of training data. Experiments on the GLUE benchmark demonstrate that Bi-Drop consistently outperforms previous fine-tuning methods. Furthermore, empirical results also show that Bi-Drop exhibits excellent generalization ability and robustness for domain transfer, data imbalance, and low-resource scenarios.Comment: EMNLP 2023 Findings. Camera-ready version. Co-first authors with equal contribution

    The association between retina thinning and hippocampal atrophy in Alzheimer’s disease and mild cognitive impairment: a meta-analysis and systematic review

    Get PDF
    IntroductionThe retina is the “window” of the central nervous system. Previous studies discovered that retinal thickness degenerates through the pathological process of the Alzheimer’s disease (AD) continuum. Hippocampal atrophy is one of the typical clinical features and diagnostic criteria of AD. Former studies have described retinal thinning in normal aging subjects and AD patients, yet the association between retinal thickness and hippocampal atrophy in AD is unclear. The optical coherence tomography (OCT) technique has access the non-invasive to retinal images and magnetic resonance imaging can outline the volume of the hippocampus. Thus, we aim to quantify the correlation between these two parameters to identify whether the retina can be a new biomarker for early AD detection.MethodsWe systematically searched the PubMed, Embase, and Web of Science databases from inception to May 2023 for studies investigating the correlation between retinal thickness and hippocampal volume. The Newcastle-Ottawa Quality Assessment Scale (NOS) was used to assess the study quality. Pooled correlation coefficient r values were combined after Fisher’s Z transformation. Moderator effects were detected through subgroup analysis and the meta-regression method.ResultsOf the 1,596 citations initially identified, we excluded 1,062 studies after screening the titles and abstract (animal models, n = 99; irrelevant literature, n = 963). Twelve studies met the inclusion criteria, among which three studies were excluded due to unextractable data. Nine studies were eligible for this meta-analysis. A positive moderate correlation between the retinal thickness was discovered in all participants of with AD, mild cognitive impairment (MCI), and normal controls (NC) (r = 0.3469, 95% CI: 0.2490–0.4377, I2 = 5.0%), which was significantly higher than that of the AD group (r = 0.1209, 95% CI:0.0905–0.1510, I2 = 0.0%) (p < 0.05). Among different layers, the peripapillary retinal nerve fiber layer (pRNFL) indicated a moderate positive correlation with hippocampal volume (r = 0.1209, 95% CI:0.0905–0.1510, I2 = 0.0%). The retinal pigmented epithelium (RPE) was also positively correlated [r = 0.1421, 95% CI:(−0.0447–0.3192), I2 = 84.1%]. The retinal layers and participants were the main overall heterogeneity sources. Correlation in the bilateral hemisphere did not show a significant difference.ConclusionThe correlation between RNFL thickness and hippocampal volume is more predominant in both NC and AD groups than other layers. Whole retinal thickness is positively correlated to hippocampal volume not only in AD continuum, especially in MCI, but also in NC.Systematic review registrationhttps://www.crd.york.ac.uk/PROSPERO/, CRD42022328088
    corecore