629 research outputs found
OpTree: An Efficient Algorithm for All-gather Operation in Optical Interconnect Systems
All-gather collective communication is one of the most important
communication primitives in parallel and distributed computation, which plays
an essential role in many HPC applications such as distributed Deep Learning
(DL) with model and hybrid parallelism. To solve the communication bottleneck
of All-gather, optical interconnection network can provide unprecedented high
bandwidth and reliability for data transfer among the distributed nodes.
However, most traditional All-gather algorithms are designed for electrical
interconnection, which cannot fit well for optical interconnect systems,
resulting in poor performance. This paper proposes an efficient scheme, called
OpTree, for All-gather operation on optical interconnect systems. OpTree
derives an optimal -ary tree corresponding to the optimal number of
communication stages, achieving minimum communication time. We further analyze
and compare the communication steps of OpTree with existing All-gather
algorithms. Theoretical results exhibit that OpTree requires much less number
of communication steps than existing All-gather algorithms on optical
interconnect systems. Simulation results show that OpTree can reduce
communication time by 72.21%, 94.30%, and 88.58%, respectively, compared with
three existing All-gather schemes, WRHT, Ring, and NE.Comment: This paper is under review at a conferenc
Accelerating Fully Connected Neural Network on Optical Network-on-Chip (ONoC)
Fully Connected Neural Network (FCNN) is a class of Artificial Neural
Networks widely used in computer science and engineering, whereas the training
process can take a long time with large datasets in existing many-core systems.
Optical Network-on-Chip (ONoC), an emerging chip-scale optical interconnection
technology, has great potential to accelerate the training of FCNN with low
transmission delay, low power consumption, and high throughput. However,
existing methods based on Electrical Network-on-Chip (ENoC) cannot fit in ONoC
because of the unique properties of ONoC. In this paper, we propose a
fine-grained parallel computing model for accelerating FCNN training on ONoC
and derive the optimal number of cores for each execution stage with the
objective of minimizing the total amount of time to complete one epoch of FCNN
training. To allocate the optimal number of cores for each execution stage, we
present three mapping strategies and compare their advantages and disadvantages
in terms of hotspot level, memory requirement, and state transitions.
Simulation results show that the average prediction error for the optimal
number of cores in NN benchmarks is within 2.3%. We further carry out extensive
simulations which demonstrate that FCNN training time can be reduced by 22.28%
and 4.91% on average using our proposed scheme, compared with traditional
parallel computing methods that either allocate a fixed number of cores or
allocate as many cores as possible, respectively. Compared with ENoC,
simulation results show that under batch sizes of 64 and 128, on average ONoC
can achieve 21.02% and 12.95% on reducing training time with 47.85% and 39.27%
on saving energy, respectively.Comment: 14 pages, 10 figures. This paper is under the second review of IEEE
Transactions of Computer
Exploring the Impact of Demographic Characteristics of Top Management Team on Earning Management: The Case of Chinese-Listed Manufacturing Companies
This study examines the relationship between the demographic characteristics of the top management team (TMT) and earnings management by the use of 1307 Chinese-listed manufacturing companies from China Stock Market and Accounting Research, covering the period 2014-2018. The demographic characteristics of the TMT is measured by the use of four factors: gender, age, financial work experience and educational level. The abnormal cash flow from operations, abnormal production cost, abnormal discretionary cost and a comprehensive proxy of the sum of these three are employed in the estimation of real earnings management, which is a proxy for earnings management. The variables controlled by this research include leverage, company size, book-to-market ratio, return on equity and sales growth. The findings in this study show that the gender, age and financial experience of top managers have no significant impact on earnings management. However, the average educational level of top managers is significantly negatively correlated with earnings management meaning that top managers with higher education may participate less in earnings management activities. These results are partially consistent with the prediction of the Upper Echelons Theory that the TMT’s demographic characteristics as a measure of the potential cognition and behaviour of individuals and teams, thus influencing the earnings management decisions of the company. These results have implications for various stakeholders in corporate financial reporting and also provides insights for those who select and train top managers
WRHT: Efficient All-reduce for Distributed DNN Training in Optical Interconnect System
Communication efficiency plays an important role in accelerating the
distributed training of Deep Neural Networks (DNN). All-reduce is the key
communication primitive to reduce model parameters in distributed DNN training.
Most existing all-reduce algorithms are designed for traditional electrical
interconnect systems, which cannot meet the communication requirements for
distributed training of large DNNs. One of the promising alternatives for
electrical interconnect is optical interconnect, which can provide high
bandwidth, low transmission delay, and low power cost. We propose an efficient
scheme called WRHT (Wavelength Reused Hierarchical Tree) for implementing
all-reduce operation in optical interconnect system, which can take advantage
of WDM (Wavelength Division Multiplexing) to reduce the communication time of
distributed data-parallel DNN training. We further derive the minimum number of
communication steps and communication time to realize the all-reduce using
WRHT. Simulation results show that the communication time of WRHT is reduced by
75.59%, 49.25%, and 70.1% respectively compared with three traditional
all-reduce algorithms simulated in optical interconnect system. Simulation
results also show that WRHT can reduce the communication time for all-reduce
operation by 86.69% and 84.71% in comparison with two existing all-reduce
algorithms in electrical interconnect system.Comment: This paper is under the submission of GLOBECOM 202
- …