39 research outputs found

    Toward efficient online scheduling for large-scale distributed machine learning system

    Get PDF
    Thanks to the rise of machine learning (ML) and its vast applications, recent years have witnessed a rapid growth of large-scale distributed ML frameworks, which exploit the massive parallelism of computing clusters to expedite ML training jobs. However, the proliferation of large-scale distributed ML frameworks also introduces many unique technical challenges in computing system design and optimization. In a networked computing cluster that supports a large number of training jobs, a central question is how to design efficient scheduling algorithms to allocate workers and parameter servers across different machines to minimize the overall training time. Toward this end, in this paper, we develop an online scheduling algorithm that jointly optimizes resource allocation and locality decisions. Our main contributions are three-fold: i) We develop a new analytical model that considers both resource allocation and locality; ii) Based on an equivalent reformulation and close observations on the worker-parameter server locality configurations, we transform the problem into a mixed cover/packing integer program, which enables approximation algorithm design; iii) We propose a meticulously designed randomized rounding approximation algorithm and rigorously prove its performance.Collectively, our results contribute to a comprehensive and fundamental understanding of distributed ML system optimization and algorithm design

    Trends of hypercholesterolemia change in Shenzhen, China during 1997-2018

    Get PDF
    To demonstrate the trends of hypercholesterolemia change in Shenzhen, China from 1997 to 2018. Participants were residents aged 18 to 69 years in Shenzhen, China, and were recruited using multi-stage cluster sampling. All participants were surveyed about their socio-demographics, lifestyle, occupation, mental health, and social support. Physical measurements and blood samples for subsequent measurements were collected according to a standardized protocol. A total of 26,621 individuals participated in the three surveys with 8,266 in 1997, 8,599 in 2009, and 9,756 in 2018. In both women and men, there was a significant downward linear trend in age-adjusted mean high-density lipoprotein-cholesterol (HDL-C) from 1997 to 2018 (women: 0.17 ± 0.06, p = 0.008 vs. men: 0.21 ± 0.04, p < 0.001). In contrast, the age-adjusted total triglycerides and total cholesterol in both sexes have demonstrated an increasing trend in the past two decades. However, no significant changes in age-adjusted low-density lipoprotein-cholesterol (LDL-C) in both men and women between 2009 and 2018 were found (women: 0.00 ± 0.02, p = 0.85 vs. men 0.02 ± 0.03, p = 0.34). The age-adjusted prevalence of hypercholesterolemia observed a rapid rise from 1997 to 2009 and appeared to be stabilized in 2018, which was similar to the trend of the prevalence of high total triglycerides in women. Changes in trends were varied by different types of lipids traits. Over the observed decades, there was a clear increasing trend of prevalence of low HDL-C (<1.04 mmol/L) in both sexes (women: 8.8% in 1997 and doubled to reach 17.5% in 2018 vs. men was 22.1% in 1997 and increased to 39.1% in 2018), particularly among younger age groups. Hence, a bespoke public health strategy aligned with the characteristics of lipids epidemic considered by sex and age groups needs to be developed and implemented

    Who is the main caregiver of the mother during the doing-the-month : is there an association with postpartum depression?

    Get PDF
    Background: To examine the relationship between the main caregiver during the “doing-the-month” (a traditional Chinese practice which a mother is confined at home for 1 month after giving birth) and the risk of postpartum depression (PPD) in postnatal women. Methods: Participants were postnatal women stayed in hospital and women who attended the hospital for postpartum examination, at 14–60 days after delivery from November 1, 2013 to December 30, 2013. Postpartum depression status was assessed using the Edinburgh Postnatal Depression Scale. Univariate and multivariable logistic regressions were used to identify the associations between the main caregiver during “doing-the-month” and the risk of PPD in postnatal women. Results: One thousand three hundred twenty-five postnatal women with a mean (SD) age of 28 (4.58) years were included in the analyses. The median score (IQR) of PPD was 6.0 (2, 10) and the prevalence of PPD was 27%. Of these postnatal women, 44.5% were cared by their mother-in-law in the first month after delivery, 36.3% cared by own mother, 11.1% by “yuesao” or “maternity matron” and 8.1% by other relatives. No association was found between the main caregivers and the risk of PPD after multiple adjustments. Conclusions: Although no association between the main caregivers and the risk of PPD during doing-the-month was identified, considering the increasing prevalence of PPD in Chinese women, and the contradictions between traditional culture and latest scientific evidence for some of the doing-the-month practices, public health interventions aim to increase the awareness of PPD among caregivers and family members are warranted

    Low-latency computing network resource scheduling and allocation algorithms for deep learning jobs

    No full text
    This dissertation will focus on modeling and designing efficient resource scheduling and allocation algorithms for deep learning jobs in distributed machine learning systems or computing clusters based on mainstream frameworks (e.g., TensorFlow and PyTorch). Due to the rapid growth of training dataset size and model complexity, it becomes prevailing to leverage the data parallelism to expedite the training process. However, data communication between computing devices (e.g., GPUs) typically becomes the bottleneck to scaling the system. Thus how to alleviate the communication bottleneck when scheduling deep learning jobs in distributed systems attracts increasing attention both in academia and industry recently. However, designing the resource allocation scheduling algorithms is highly non-trivial. Specifically, the problem typically has packing-type constraints (due to the resource capacity limit), covering-type constraints (due to the job workload requirements), and non-convex constraints (due to the topology, contention, etc), which is NP-Hard in general. Moreover, demanding integer variables adds another layer of difficulty to solve the problem. To overcome these challenges, we need to design a suite of provable algorithms to schedule the jobs efficiently. In this thesis, we start with a resource allocation algorithm design for the computing clusters, where we focus on resource allocation without considering the placement for DNN jobs. Then we extend our work to the distributed machine learning systems and computing clusters by jointly optimizing the placement and resource scheduling for DNN jobs. In this thesis, we design schedulers for deep learning jobs with various objectives (e.g., minimize the overall training completion time, minimize the makespan and maximize the overall job utility). We first work on designing efficient scheduling algorithms based on simplified assumptions like reserved bandwidth for each job and the underlying network is a complete graph. Then we extend our work by taking into practical concerns (e.g., topology mapping and contention among multiple jobs) into consideration when developing schedulers for the distributed machine learning systems and computing clusters

    Toward efficient online scheduling for large-scale distributed machine learning system

    Get PDF
    Thanks to the rise of machine learning (ML) and its vast applications, recent years have witnessed a rapid growth of large-scale distributed ML frameworks, which exploit the massive parallelism of computing clusters to expedite ML training jobs. However, the proliferation of large-scale distributed ML frameworks also introduces many unique technical challenges in computing system design and optimization. In a networked computing cluster that supports a large number of training jobs, a central question is how to design efficient scheduling algorithms to allocate workers and parameter servers across different machines to minimize the overall training time. Toward this end, in this paper, we develop an online scheduling algorithm that jointly optimizes resource allocation and locality decisions. Our main contributions are three-fold: i) We develop a new analytical model that considers both resource allocation and locality; ii) Based on an equivalent reformulation and close observations on the worker-parameter server locality configurations, we transform the problem into a mixed cover/packing integer program, which enables approximation algorithm design; iii) We propose a meticulously designed randomized rounding approximation algorithm and rigorously prove its performance.Collectively, our results contribute to a comprehensive and fundamental understanding of distributed ML system optimization and algorithm design.</p

    Low-latency computing network resource scheduling and allocation algorithms for deep learning jobs

    No full text
    This dissertation will focus on modeling and designing efficient resource scheduling and allocation algorithms for deep learning jobs in distributed machine learning systems or computing clusters based on mainstream frameworks (e.g., TensorFlow and PyTorch). Due to the rapid growth of training dataset size and model complexity, it becomes prevailing to leverage the data parallelism to expedite the training process. However, data communication between computing devices (e.g., GPUs) typically becomes the bottleneck to scaling the system. Thus how to alleviate the communication bottleneck when scheduling deep learning jobs in distributed systems attracts increasing attention both in academia and industry recently. However, designing the resource allocation scheduling algorithms is highly non-trivial. Specifically, the problem typically has packing-type constraints (due to the resource capacity limit), covering-type constraints (due to the job workload requirements), and non-convex constraints (due to the topology, contention, etc), which is NP-Hard in general. Moreover, demanding integer variables adds another layer of difficulty to solve the problem. To overcome these challenges, we need to design a suite of provable algorithms to schedule the jobs efficiently. In this thesis, we start with a resource allocation algorithm design for the computing clusters, where we focus on resource allocation without considering the placement for DNN jobs. Then we extend our work to the distributed machine learning systems and computing clusters by jointly optimizing the placement and resource scheduling for DNN jobs. In this thesis, we design schedulers for deep learning jobs with various objectives (e.g., minimize the overall training completion time, minimize the makespan and maximize the overall job utility). We first work on designing efficient scheduling algorithms based on simplified assumptions like reserved bandwidth for each job and the underlying network is a complete graph. Then we extend our work by taking into practical concerns (e.g., topology mapping and contention among multiple jobs) into consideration when developing schedulers for the distributed machine learning systems and computing clusters

    Low-latency computing network resource scheduling and allocation algorithms for deep learning jobs

    No full text
    This dissertation will focus on modeling and designing efficient resource scheduling and allocation algorithms for deep learning jobs in distributed machine learning systems or computing clusters based on mainstream frameworks (e.g., TensorFlow and PyTorch). Due to the rapid growth of training dataset size and model complexity, it becomes prevailing to leverage the data parallelism to expedite the training process. However, data communication between computing devices (e.g., GPUs) typically becomes the bottleneck to scaling the system. Thus how to alleviate the communication bottleneck when scheduling deep learning jobs in distributed systems attracts increasing attention both in academia and industry recently. However, designing the resource allocation scheduling algorithms is highly non-trivial. Specifically, the problem typically has packing-type constraints (due to the resource capacity limit), covering-type constraints (due to the job workload requirements), and non-convex constraints (due to the topology, contention, etc), which is NP-Hard in general. Moreover, demanding integer variables adds another layer of difficulty to solve the problem. To overcome these challenges, we need to design a suite of provable algorithms to schedule the jobs efficiently. In this thesis, we start with a resource allocation algorithm design for the computing clusters, where we focus on resource allocation without considering the placement for DNN jobs. Then we extend our work to the distributed machine learning systems and computing clusters by jointly optimizing the placement and resource scheduling for DNN jobs. In this thesis, we design schedulers for deep learning jobs with various objectives (e.g., minimize the overall training completion time, minimize the makespan and maximize the overall job utility). We first work on designing efficient scheduling algorithms based on simplified assumptions like reserved bandwidth for each job and the underlying network is a complete graph. Then we extend our work by taking into practical concerns (e.g., topology mapping and contention among multiple jobs) into consideration when developing schedulers for the distributed machine learning systems and computing clusters

    Low-latency computing network resource scheduling and allocation algorithms for deep learning jobs

    Get PDF
    This dissertation will focus on modeling and designing efficient resource scheduling and allocation algorithms for deep learning jobs in distributed machine learning systems or computing clusters based on mainstream frameworks (e.g., TensorFlow and PyTorch). Due to the rapid growth of training dataset size and model complexity, it becomes prevailing to leverage the data parallelism to expedite the training process. However, data communication between computing devices (e.g., GPUs) typically becomes the bottleneck to scaling the system. Thus how to alleviate the communication bottleneck when scheduling deep learning jobs in distributed systems attracts increasing attention both in academia and industry recently. However, designing the resource allocation scheduling algorithms is highly non-trivial. Specifically, the problem typically has packing-type constraints (due to the resource capacity limit), covering-type constraints (due to the job workload requirements), and non-convex constraints (due to the topology, contention, etc), which is NP-Hard in general. Moreover, demanding integer variables adds another layer of difficulty to solve the problem. To overcome these challenges, we need to design a suite of provable algorithms to schedule the jobs efficiently. In this thesis, we start with a resource allocation algorithm design for the computing clusters, where we focus on resource allocation without considering the placement for DNN jobs. Then we extend our work to the distributed machine learning systems and computing clusters by jointly optimizing the placement and resource scheduling for DNN jobs. In this thesis, we design schedulers for deep learning jobs with various objectives (e.g., minimize the overall training completion time, minimize the makespan and maximize the overall job utility). We first work on designing efficient scheduling algorithms based on simplified assumptions like reserved bandwidth for each job and the underlying network is a complete graph. Then we extend our work by taking into practical concerns (e.g., topology mapping and contention among multiple jobs) into consideration when developing schedulers for the distributed machine learning systems and computing clusters

    Toward Efficient Online Scheduling for Distributed Machine Learning Systems

    Full text link
    Recent years have witnessed a rapid growth of distributed machine learning (ML) frameworks, which exploit the massive parallelism of computing clusters to expedite ML training. However, the proliferation of distributed ML frameworks also introduces many unique technical challenges in computing system design and optimization. In a networked computing cluster that supports a large number of training jobs, a key question is how to design efficient scheduling algorithms to allocate workers and parameter servers across different machines to minimize the overall training time. Toward this end, in this paper, we develop an online scheduling algorithm that jointly optimizes resource allocation and locality decisions. Our main contributions are three-fold: i) We develop a new analytical model that considers both resource allocation and locality; ii) Based on an equivalent reformulation and observations on the worker-parameter server locality configurations, we transform the problem into a mixed packing and covering integer program, which enables approximation algorithm design; iii) We propose a meticulously designed approximation algorithm based on randomized rounding and rigorously analyze its performance. Collectively, our results contribute to the state of the art of distributed ML system optimization and algorithm design.Comment: IEEE Transactions on Network Science and Engineering (TNSE), accepted in July 2021, to appea

    Dose-response analysis between hemoglobin A1c and risk of atrial fibrillation in patients with and without known diabetes.

    No full text
    BACKGROUND:The relationship between serum hemoglobin A1c (HbA1c) and atrial fibrillation (AF) or postoperative AF (POAF) in coronary artery bypass (CABG) patients is still under debate. It is also unclear whether there is a dose-response relationship between circulating HbA1c and the risk of AF or POAF. METHODS AND RESULTS:The Cochrane Library, PubMed, and EMBASE databases were searched. A robust-error meta-regression method was used to summarize the shape of the dose-response relationship. The RR and 95%CI were using a random-effects model. In total, 14 studies were included, totaling 17,914 AF cases among 352,325 participants. The summary RR per 1% increase in HbA1c was 1.16 (95% CI: 1.07-1.27). In the subgroup analysis, the summary RR was 1.13 (95% CI: 1.08-1.19) or 1.12 (95% CI: 1.05-1.20) for patients with diabetes or without known diabetes, respectively. The nonlinear analysis showed a nonlinear (Pnonlinear = 0.04) relationship between HbA1c and AF, with a significantly increased risk of AF if HbA1c was over 6.3%. However, HbA1c (per 1% increase) was not associated with POAF in patients with diabetes (RR: 1.13, P = 0.34) or without known diabetes (RR: 0.91, P = 0.37) among patients undergoing CABG. CONCLUSION:Our results suggest that higher HbA1c was associated with an increased risk of AF, both in diabetes and in without diabetes or with unknown diabetes. However, no association was found between HbA1c and POAF in patients undergoing CABG. Further prospective studies with larger population sizes are needed to explore the association between serum HbA1c level and the risk of POAF
    corecore