157 research outputs found
Computing Platforms for Big Biological Data Analytics: Perspectives and Challenges.
The last decade has witnessed an explosion in the amount of available biological sequence data, due to the rapid progress of high-throughput sequencing projects. However, the biological data amount is becoming so great that traditional data analysis platforms and methods can no longer meet the need to rapidly perform data analysis tasks in life sciences. As a result, both biologists and computer scientists are facing the challenge of gaining a profound insight into the deepest biological functions from big biological data. This in turn requires massive computational resources. Therefore, high performance computing (HPC) platforms are highly needed as well as efficient and scalable algorithms that can take advantage of these platforms. In this paper, we survey the state-of-the-art HPC platforms for big biological data analytics. We first list the characteristics of big biological data and popular computing platforms. Then we provide a taxonomy of different biological data analysis applications and a survey of the way they have been mapped onto various computing platforms. After that, we present a case study to compare the efficiency of different computing platforms for handling the classical biological sequence alignment problem. At last we discuss the open issues in big biological data analytics
A Survey on Design Methodologies for Accelerating Deep Learning on Heterogeneous Architectures
In recent years, the field of Deep Learning has seen many disruptive and
impactful advancements. Given the increasing complexity of deep neural
networks, the need for efficient hardware accelerators has become more and more
pressing to design heterogeneous HPC platforms. The design of Deep Learning
accelerators requires a multidisciplinary approach, combining expertise from
several areas, spanning from computer architecture to approximate computing,
computational models, and machine learning algorithms. Several methodologies
and tools have been proposed to design accelerators for Deep Learning,
including hardware-software co-design approaches, high-level synthesis methods,
specific customized compilers, and methodologies for design space exploration,
modeling, and simulation. These methodologies aim to maximize the exploitable
parallelism and minimize data movement to achieve high performance and energy
efficiency. This survey provides a holistic review of the most influential
design methodologies and EDA tools proposed in recent years to implement Deep
Learning accelerators, offering the reader a wide perspective in this rapidly
evolving field. In particular, this work complements the previous survey
proposed by the same authors in [203], which focuses on Deep Learning hardware
accelerators for heterogeneous HPC platforms
Graphics Processing Unit-Based Computer-Aided Design Algorithms for Electronic Design Automation
The electronic design automation (EDA) tools are a specific set of software that play important roles in modern integrated circuit (IC) design. These software automate the design processes of IC with various stages. Among these stages, two important EDA design tools are the focus of this research: floorplanning and global routing. Specifically, the goal of this study is to parallelize these two tools such that their execution time can be significantly shortened on modern multi-core and graphics processing unit (GPU) architectures. The GPU hardware is a massively parallel architecture, enabling thousands of independent threads to execute concurrently. Although a small set of EDA tools can benefit from using GPU to accelerate their speed, most algorithms in this field are designed with the single-core paradigm in mind. The floorplanning and global routing algorithms are among the latter, and difficult to render any speedup on the GPU due to their inherent sequential nature.
This work parallelizes the floorplanning and global routing algorithm through a novel approach and results in significant speedups for both tools implemented on the GPU hardware. Specifically, with a complete overhaul of solution space and design space exploration, a GPU-based floorplanning algorithm is able to render 4-166X speedup, while achieving similar or improved solutions compared with the sequential algorithm. The GPU-based global routing algorithm is shown to achieve significant speedup against existing state-of-the-art routers, while delivering competitive solution quality. Importantly, this parallel model for global routing renders a stable solution that is independent from the level of parallelism. In summary, this research has shown that through a design paradigm overhaul, sequential algorithms can also benefit from the massively parallel architecture. The findings of this study have a positive impact on the efficiency and design quality of modern EDA design flow
Multi-node Acceleration for Large-scale GCNs
Limited by the memory capacity and compute power, singe-node graph
convolutional neural network (GCN) accelerators cannot complete the execution
of GCNs within a reasonable amount of time, due to the explosive size of graphs
nowadays. Thus, large-scale GCNs call for a multi-node acceleration system
(MultiAccSys) like TPU-Pod for large-scale neural networks. In this work, we
aim to scale up single-node GCN accelerators to accelerate GCNs on large-scale
graphs. We first identify the communication pattern and challenges of
multi-node acceleration for GCNs on large-scale graphs. We observe that (1)
coarse-grained communication patterns exist in the execution of GCNs in
MultiAccSys, which introduces massive amount of redundant network transmissions
and off-chip memory accesses; (2) overall, the acceleration of GCNs in
MultiAccSys is bandwidth-bound and latency-tolerant. Guided by these two
observations, we then propose MultiGCN, the first MultiAccSys for large-scale
GCNs that trades network latency for network bandwidth. Specifically, by
leveraging the network latency tolerance, we first propose a topology-aware
multicast mechanism with a one put per multicast message-passing model to
reduce transmissions and alleviate network bandwidth requirements. Second, we
introduce a scatter-based round execution mechanism which cooperates with the
multicast mechanism and reduces redundant off-chip memory accesses. Compared to
the baseline MultiAccSys, MultiGCN achieves 4~12x speedup using only 28%~68%
energy, while reducing 32% transmissions and 73% off-chip memory accesses on
average. It not only achieves 2.5~8x speedup over the state-of-the-art
multi-GPU solution, but also scales to large-scale graphs as opposed to
single-node GCN accelerators.Comment: To appear in T
Co-design Hardware and Algorithm for Vector Search
Vector search has emerged as the foundation for large-scale information
retrieval and machine learning systems, with search engines like Google and
Bing processing tens of thousands of queries per second on petabyte-scale
document datasets by evaluating vector similarities between encoded query texts
and web documents. As performance demands for vector search systems surge,
accelerated hardware offers a promising solution in the post-Moore's Law era.
We introduce \textit{FANNS}, an end-to-end and scalable vector search framework
on FPGAs. Given a user-provided recall requirement on a dataset and a hardware
resource budget, \textit{FANNS} automatically co-designs hardware and
algorithm, subsequently generating the corresponding accelerator. The framework
also supports scale-out by incorporating a hardware TCP/IP stack in the
accelerator. \textit{FANNS} attains up to 23.0 and 37.2 speedup
compared to FPGA and CPU baselines, respectively, and demonstrates superior
scalability to GPUs, achieving 5.5 and 7.6 speedup in median
and 95\textsuperscript{th} percentile (P95) latency within an eight-accelerator
configuration. The remarkable performance of \textit{FANNS} lays a robust
groundwork for future FPGA integration in data centers and AI supercomputers.Comment: 11 page
Mengenal pasti tahap pengetahuan pelajar tahun akhir Ijazah Sarjana Muda Kejuruteraan di KUiTTHO dalam bidang keusahawanan dari aspek pengurusan modal
Malaysia ialah sebuah negara membangun di dunia. Dalam proses pembangunan
ini, hasrat negara untuk melahirkan bakal usahawan beijaya tidak boleh dipandang
ringan. Oleh itu, pengetahuan dalam bidang keusahawanan perlu diberi perhatian
dengan sewajarnya; antara aspek utama dalam keusahawanan ialah modal. Pengurusan
modal yang tidak cekap menjadi punca utama kegagalan usahawan. Menyedari hakikat
ini, kajian berkaitan Pengurusan Modal dijalankan ke atas 100 orang pelajar Tahun
Akhir Kejuruteraan di KUiTTHO. Sampel ini dipilih kerana pelajar-pelajar ini akan
menempuhi alam pekeijaan di mana mereka boleh memilih keusahawanan sebagai satu
keijaya. Walau pun mereka bukanlah pelajar dari jurusan perniagaan, namun mereka
mempunyai kemahiran dalam mereka cipta produk yang boleh dikomersialkan. Hasil
dapatan kajian membuktikan bahawa pelajar-pelajar ini berminat dalam bidang
keusahawanan namun masih kurang pengetahuan tentang pengurusan modal
terutamanya dalam menentukan modal permulaan, pengurusan modal keija dan caracara
menentukan pembiayaan kewangan menggunakan kaedah jualan harian. Oleh itu,
satu garis panduan Pengurusan Modal dibina untuk memberi pendedahan kepada
mereka
A Comprehensive Survey on Distributed Training of Graph Neural Networks
Graph neural networks (GNNs) have been demonstrated to be a powerful
algorithmic model in broad application fields for their effectiveness in
learning over graphs. To scale GNN training up for large-scale and ever-growing
graphs, the most promising solution is distributed training which distributes
the workload of training across multiple computing nodes. At present, the
volume of related research on distributed GNN training is exceptionally vast,
accompanied by an extraordinarily rapid pace of publication. Moreover, the
approaches reported in these studies exhibit significant divergence. This
situation poses a considerable challenge for newcomers, hindering their ability
to grasp a comprehensive understanding of the workflows, computational
patterns, communication strategies, and optimization techniques employed in
distributed GNN training. As a result, there is a pressing need for a survey to
provide correct recognition, analysis, and comparisons in this field. In this
paper, we provide a comprehensive survey of distributed GNN training by
investigating various optimization techniques used in distributed GNN training.
First, distributed GNN training is classified into several categories according
to their workflows. In addition, their computational patterns and communication
patterns, as well as the optimization techniques proposed by recent work are
introduced. Second, the software frameworks and hardware platforms of
distributed GNN training are also introduced for a deeper understanding. Third,
distributed GNN training is compared with distributed training of deep neural
networks, emphasizing the uniqueness of distributed GNN training. Finally,
interesting issues and opportunities in this field are discussed.Comment: To Appear in Proceedings of the IEE
- …