397 research outputs found
PUMA: Efficient and Low-Cost Memory Allocation and Alignment Support for Processing-Using-Memory Architectures
Processing-using-DRAM (PUD) architectures impose a restrictive data layout
and alignment for their operands, where source and destination operands (i)
must reside in the same DRAM subarray (i.e., a group of DRAM rows sharing the
same row buffer and row decoder) and (ii) are aligned to the boundaries of a
DRAM row. However, standard memory allocation routines (i.e., malloc,
posix_memalign, and huge pages-based memory allocation) fail to meet the data
layout and alignment requirements for PUD architectures to operate
successfully. To allow the memory allocation API to influence the OS memory
allocator and ensure that memory objects are placed within specific DRAM
subarrays, we propose a new lazy data allocation routine (in the kernel) for
PUD memory objects called PUMA. The key idea of PUMA is to use the internal
DRAM mapping information together with huge pages and then split huge pages
into finer-grained allocation units that are (i) aligned to the page address
and size and (ii) virtually contiguous.
We implement PUMA as a kernel module using QEMU and emulate a RISC-V machine
running Fedora 33 with v5.9.0 Linux Kernel. We emulate the implementation of a
PUD system capable of executing row copy operations (as in RowClone) and
Boolean AND/OR/NOT operations (as in Ambit). In our experiments, such an
operation is performed in the host CPU if a given operation cannot be executed
in our PUD substrate (due to data misalignment). PUMA significantly outperforms
the baseline memory allocators for all evaluated microbenchmarks and allocation
sizes
In vivo and in vitro heat shock proteins gene expression in cattle.
The main purpose for this study was the quantification of the heat shock proteins HSPA1A and HSP90AA1, in cow lymphocytes, when subjected to heat stress directly - in vivo, or indirectly - in vitro. The aim was to identify differences between HSP expression in vitro and in vivo. The experiment was conducted in the Biometeorology and Ethology Laboratory of FZEA-USP. Were used three female Holstein Frisian, which were subjected to heat stress, by sun exposure. The blood samples were collected after sun exposure, with a temperature of 40 ± 2 º C, during three days. For in vitro tests, blood of the same animals was collected and placed for a period of 4 hours in a water bath at 40 º C, thus simulating the thermal stress. Total RNA of lymphocytes was extracted, treated with DNase I and submitted to cDNA synthesis for gene expression quantification of HSPA1A and HSP90AA1, by real time PCR (qRT-PCR). The data were tested for normality by Kolmogorov-Smirnov test and for homocedasticity by Levene test. Data were analyzed according to a general linear model procedure with 2 fixed factors treatment and genes expression. Significantly different means were submitted to post-hoc comparisons of means (LSD test) and regarded as significantly different when P<0.05. The results showed that there are no significant differences between the in vitro and the in vivo treatments
Chronic daily headache: concepts and treatments
Chronic Daily Headache Associated to Analgesic Abuse is found in 3% of general population, and it is a not unusual cause of medical consultation. The International Headache Society Classification does not have a complete approach about this entity and the therapeutic modalities are not unanimous in the international literature at the present time. The objective of this article is to be helpful to the generalist formation of medical students and residents, reviewing the concepts and the therapeutic actualization about this entity. The text begins with the pathology definition, and gives a general view about the historical aspects, pathophysiology, clinical findings and diagnoses criteria, with a special attention to the treatment.Cefaléia Crônica Diária Associada ao Abuso de Analgésicos apresenta uma prevalência de cerca de 3% na população geral, sendo uma causa não rara de busca por auxílio médico. A Classificação da Sociedade Internacional de Cefaléias não apresenta uma abordagem completa dessa entidade e as modalidades terapêuticas ainda não são totalmente unânimes na literatura internacional. O presente artigo tem, como objetivo, a revisão de conceitos e a atualização terapêutica dessa modalidade nosológica, visando especialmente auxiliar na formação generalista de alunos da graduação e médicos residentes. Para tanto, o texto inicia-se com uma definição da patologia e aborda ainda sua epidemiologia, aspectos históricos, fisiopatologia, quadro clínico e critérios diagnósticos, com especial ênfase ao tratamento
Accelerating Neural Network Inference with Processing-in-DRAM: From the Edge to the Cloud
Neural networks (NNs) are growing in importance and complexity. A neural
network's performance (and energy efficiency) can be bound either by
computation or memory resources. The processing-in-memory (PIM) paradigm, where
computation is placed near or within memory arrays, is a viable solution to
accelerate memory-bound NNs. However, PIM architectures vary in form, where
different PIM approaches lead to different trade-offs. Our goal is to analyze,
discuss, and contrast DRAM-based PIM architectures for NN performance and
energy efficiency. To do so, we analyze three state-of-the-art PIM
architectures: (1) UPMEM, which integrates processors and DRAM arrays into a
single 2D chip; (2) Mensa, a 3D-stack-based PIM architecture tailored for edge
devices; and (3) SIMDRAM, which uses the analog principles of DRAM to execute
bit-serial operations. Our analysis reveals that PIM greatly benefits
memory-bound NNs: (1) UPMEM provides 23x the performance of a high-end GPU when
the GPU requires memory oversubscription for a general matrix-vector
multiplication kernel; (2) Mensa improves energy efficiency and throughput by
3.0x and 3.1x over the Google Edge TPU for 24 Google edge NN models; and (3)
SIMDRAM outperforms a CPU/GPU by 16.7x/1.4x for three binary NNs. We conclude
that the ideal PIM architecture for NN models depends on a model's distinct
attributes, due to the inherent architectural design choices.Comment: This is an extended and updated version of a paper published in IEEE
Micro, pp. 1-14, 29 Aug. 2022. arXiv admin note: text overlap with
arXiv:2109.1432
TransPimLib: A Library for Efficient Transcendental Functions on Processing-in-Memory Systems
Processing-in-memory (PIM) promises to alleviate the data movement bottleneck
in modern computing systems. However, current real-world PIM systems have the
inherent disadvantage that their hardware is more constrained than in
conventional processors (CPU, GPU), due to the difficulty and cost of building
processing elements near or inside the memory. As a result, general-purpose PIM
architectures support fairly limited instruction sets and struggle to execute
complex operations such as transcendental functions and other hard-to-calculate
operations (e.g., square root). These operations are particularly important for
some modern workloads, e.g., activation functions in machine learning
applications.
In order to provide support for transcendental (and other hard-to-calculate)
functions in general-purpose PIM systems, we present \emph{TransPimLib}, a
library that provides CORDIC-based and LUT-based methods for trigonometric
functions, hyperbolic functions, exponentiation, logarithm, square root, etc.
We develop an implementation of TransPimLib for the UPMEM PIM architecture and
perform a thorough evaluation of TransPimLib's methods in terms of performance
and accuracy, using microbenchmarks and three full workloads (Blackscholes,
Sigmoid, Softmax). We open-source all our code and datasets
at~\url{https://github.com/CMU-SAFARI/transpimlib}.Comment: Our open-source software is available at
https://github.com/CMU-SAFARI/transpimli
Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture
Many modern workloads, such as neural networks, databases, and graph
processing, are fundamentally memory-bound. For such workloads, the data
movement between main memory and CPU cores imposes a significant overhead in
terms of both latency and energy. A major reason is that this communication
happens through a narrow bus with high latency and limited bandwidth, and the
low data reuse in memory-bound workloads is insufficient to amortize the cost
of main memory access. Fundamentally addressing this data movement bottleneck
requires a paradigm where the memory system assumes an active role in computing
by integrating processing capabilities. This paradigm is known as
processing-in-memory (PIM).
Recent research explores different forms of PIM architectures, motivated by
the emergence of new 3D-stacked memory technologies that integrate memory with
a logic layer where processing elements can be easily placed. Past works
evaluate these architectures in simulation or, at best, with simplified
hardware prototypes. In contrast, the UPMEM company has designed and
manufactured the first publicly-available real-world PIM architecture.
This paper provides the first comprehensive analysis of the first
publicly-available real-world PIM architecture. We make two key contributions.
First, we conduct an experimental characterization of the UPMEM-based PIM
system using microbenchmarks to assess various architecture limits such as
compute throughput and memory bandwidth, yielding new insights. Second, we
present PrIM, a benchmark suite of 16 workloads from different application
domains (e.g., linear algebra, databases, graph processing, neural networks,
bioinformatics).Comment: Our open source software is available at
https://github.com/CMU-SAFARI/prim-benchmark
DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks
Data movement between the CPU and main memory is a first-order obstacle
against improving performance, scalability, and energy efficiency in modern
systems. Computer systems employ a range of techniques to reduce overheads tied
to data movement, spanning from traditional mechanisms (e.g., deep multi-level
cache hierarchies, aggressive hardware prefetchers) to emerging techniques such
as Near-Data Processing (NDP), where some computation is moved close to memory.
Our goal is to methodically identify potential sources of data movement over a
broad set of applications and to comprehensively compare traditional
compute-centric data movement mitigation techniques to more memory-centric
techniques, thereby developing a rigorous understanding of the best techniques
to mitigate each source of data movement.
With this goal in mind, we perform the first large-scale characterization of
a wide variety of applications, across a wide range of application domains, to
identify fundamental program properties that lead to data movement to/from main
memory. We develop the first systematic methodology to classify applications
based on the sources contributing to data movement bottlenecks. From our
large-scale characterization of 77K functions across 345 applications, we
select 144 functions to form the first open-source benchmark suite (DAMOV) for
main memory data movement studies. We select a diverse range of functions that
(1) represent different types of data movement bottlenecks, and (2) come from a
wide range of application domains. Using NDP as a case study, we identify new
insights about the different data movement bottlenecks and use these insights
to determine the most suitable data movement mitigation mechanism for a
particular application. We open-source DAMOV and the complete source code for
our new characterization methodology at https://github.com/CMU-SAFARI/DAMOV.Comment: Our open source software is available at
https://github.com/CMU-SAFARI/DAMO
An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System
Training machine learning (ML) algorithms is a computationally intensive
process, which is frequently memory-bound due to repeatedly accessing large
training datasets. As a result, processor-centric systems (e.g., CPU, GPU)
suffer from costly data movement between memory units and processing units,
which consumes large amounts of energy and execution cycles. Memory-centric
computing systems, i.e., with processing-in-memory (PIM) capabilities, can
alleviate this data movement bottleneck.
Our goal is to understand the potential of modern general-purpose PIM
architectures to accelerate ML training. To do so, we (1) implement several
representative classic ML algorithms (namely, linear regression, logistic
regression, decision tree, K-Means clustering) on a real-world general-purpose
PIM architecture, (2) rigorously evaluate and characterize them in terms of
accuracy, performance and scaling, and (3) compare to their counterpart
implementations on CPU and GPU. Our evaluation on a real memory-centric
computing system with more than 2500 PIM cores shows that general-purpose PIM
architectures can greatly accelerate memory-bound ML workloads, when the
necessary operations and datatypes are natively supported by PIM hardware. For
example, our PIM implementation of decision tree is faster than a
state-of-the-art CPU version on an 8-core Intel Xeon, and faster
than a state-of-the-art GPU version on an NVIDIA A100. Our K-Means clustering
on PIM is and than state-of-the-art CPU and GPU
versions, respectively.
To our knowledge, our work is the first one to evaluate ML training on a
real-world PIM architecture. We conclude with key observations, takeaways, and
recommendations that can inspire users of ML workloads, programmers of PIM
architectures, and hardware designers & architects of future memory-centric
computing systems
- …