Search CORE

570 research outputs found

POET: Training Neural Networks on Tiny Devices with Integrated Rematerialization and Paging

Author: Dutta Prabal
Gonzalez Joseph E.
Jain Paras
Patil Shishir G.
Stoica Ion
Publication venue
Publication date: 15/07/2022
Field of study

Fine-tuning models on edge devices like mobile phones would enable privacy-preserving personalization over sensitive data. However, edge training has historically been limited to relatively small models with simple architectures because training is both memory and energy intensive. We present POET, an algorithm to enable training large neural networks on memory-scarce battery-operated edge devices. POET jointly optimizes the integrated search search spaces of rematerialization and paging, two algorithms to reduce the memory consumption of backpropagation. Given a memory budget and a run-time constraint, we formulate a mixed-integer linear program (MILP) for energy-optimal training. Our approach enables training significantly larger models on embedded devices while reducing energy consumption while not modifying mathematical correctness of backpropagation. We demonstrate that it is possible to fine-tune both ResNet-18 and BERT within the memory constraints of a Cortex-M class embedded device while outperforming current edge training methods in energy efficiency. POET is an open-source project available at https://github.com/ShishirPatil/poetComment: Proceedings of the 39th International Conference on Machine Learning 2022 (ICML 2022

arXiv.org e-Print Archive

Three main pain points from today’s Smartphones

Author: Karçanaj Luan
Ruçi Luan
Publication venue: Control Theory and Informatics
Publication date: 29/11/2013
Field of study

The massive usage of smartphones that has occurred last years has been a boon for the telecommunications industry. Smartphone is blossoming freely and have been widely used in our daily life, learning, and working. The services, applications, and connectivity that the industry provides is something that consumers talk, blog, and tweet about with passion. The average Smartphone consumes only 10% of the data traffic that a laptop does, but smartphones, largely driven by the applications that consumers love, can connect to the network in the background hundreds or even thousands of times daily, pushing signaling traffic higher than network planners ever imagined. Unfortunately, with the widespread introduction of the smartphones, mobile network operators are confronted with the new challenges: congested network resources, worsening network KPI’s and increasing complains from end users. What users want: The ingredients for a good quality experience can be summarized in four words: convenience, immediacy, simplicity and reliability. Users want a service that is efficient, predictable and easy to use. Factors contributing to a poor quality experience include lengthy loading times, no access, crashing, slowing down, poor battery life or anything potentially complicated or confusing. So the task before us all is to understand how handsets, their batteries, networks, and applications work together, to show/ determine and try by optimizing the proper performance balance for each so that both end user experience and network efficiency can be optimized. Keywords: mobile apps, battery, signaling, smartphone

International Institute for Science, Technology and Education (IISTE): E-Journals

C-RAM: Breaking Mobile Device Memory Barriers Using the Cloud

Author: Pamboris Andreas
Pietzuch Peter
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 09/12/2015
Field of study

Mobile applications are constrained by the available memory of mobile devices. We present C-RAM, a system that uses cloud-based memory to extend the memory of mobile devices. It splits application state and its associated computation between a mobile device and a cloud node to allow applications to consume more memory, while minimising the performance impact. C-RAM thus enables developers to realise new applications or port legacy desktop applications with a large memory footprint to mobile platforms without explicitly designing them to account for memory limitations. To handle network failures with partitioned application state, C-RAM uses a new snapshot-based fault tolerance mechanism in which changes to remote memory objects are periodically backed up to the device. After failure, or when network usage exceeds a given limit, the device rolls back execution to continue from the last snapshot. C-RAM supports local execution with an application state that exceeds the available device memory through a user-level virtual memory: objects are loaded on-demand from snapshots in flash memory. Our C-RAM prototype supports Objective-C applications on the unmodified iOS platform. With C-RAM, applications can consume 10× more memory than the device capacity, with a negligible impact on application performance. In some cases, C-RAM even achieves a significant speed-up in execution time (up to 9.7×)

CLoK

Spiral - Imperial College Digital Repository

메모리 스왑 패턴 분석을 통한 스왑 시스템 최적화

Author: 정혜린
Publication venue: 서울대학교 대학원
Publication date: 01/02/2021
Field of study

학위논문 (석사) -- 서울대학교 대학원 : 공과대학 컴퓨터공학부, 2021. 2. 염헌영.The use of memory is one of the key parts of modern computer architecture (Von Neumann architecture) but when considering limited memory, it could be the most lethal part at the same time. Advances in hardware and software are making rapid strides in areas such as Big Data, HPC and machine learning and facing new turning points, while the use of memory increases along with those advances. In the server environment, various programs share resources which leads to a shortage of resources. Memory is one of those resources and needs to be managed. When the system is out of memory, the operating system evicts some of the pages out to storage and then loads the requested pages in memory. Given that the storage performance is slower than the memory, swap-induced delay is one of the critical issues in the overall performance degradation. Therefore, we designed and implemented a swpTracer to provide visualization to trace the swap in/out movement. To check the generality of the tool, we used mlock to optimize 429.mcf of Spec CPU 2006 based on the hint from swpTracer. The optimized program executes 2 to 3 times faster than the original program in a memory scarce environment. The scope of the performance improvement with previous system calls decreases when the memory limit increases. To sustain the improvement, we build a swap- prefetch to read ahead the swapped-out pages. The optimized application with swpTracer and swap-prefetch consistently exceeds the performance of the original code by 1.5x.메모리의 사용은 현대 컴퓨터 아키텍처(폰 노이만 아키텍쳐)의 핵심 부분 중 하 나이기 때문에, 메모리가 부족한 환경은 성능에 치명적인다. 하드웨어와 소프트웨 어의 발전으로 빅데이터, HPC, 머신러닝과 같은 분야들이 빠른 속도로 발전하여 그에 따라 메모리의 사용량도 증가한다. 따라서, 메모리가 제한된 임베디드 환경 이나, 여러 작업이 동시에 수행되는 서버에서 메모리 부족으로 작업이 중단되는 문제가 발생한다. 시스템이메모리가부족하면운영체제는일부페이지를저장소로내보낸다음 요청된 페이지를 메모리에 로드한다. 스토리지 성능이 메모리보다 느리다는 점에 서 스왑에 의한 지연은 전반적인 성능 저하의 중요한 문제 중 하나이다. 따라서 스왑이 프로그램 수행 시간에 영향을 미치지 않도록 프로그램의 스왑 발생 추이를 분석하여 스왑 발생을 줄일 수 있도록 힌트를 주는 도구인 swpTracer를 설계, 실 행했다. mlock을 사용하여 Spec CPU 2006 벤치마크 중 429.mcf에 적용했을 때 기존 프로그램 대비 2, 3 배 성능이 빨라졌다. 기존의 시스템 콜을 사용하여 최적화했을 때 메모리가 살짝 부족한 경우에는 비슷한성능을보여주지만, 메모리가 50% 부족한순간부터성능향상폭이줄었다. 이를 보완하기 위해 스왑 아웃 되었던 페이지를 미리 읽어두는 swap-prefetch를 구현했다. 배열을 3번 횡단하는 프로그램을 대상으로 배열의 크기를 조절하면서 swap-prefetch의 성능을 시험했다. 원본 코드와 시스템 함수인 madvise를 사용 했을 때보다 평균적으로 1.5 좋아졌다. 또, swap-prefetch를 다른 시스템 함수를 사용했을 때와 mlock과 비교했을 때 평균 1.25배 성능이 빨라졌다.Abstract Chapter 1 Introduction 1 Chapter 2 Background 4 2.1 Page Reclamation Policy . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Linux Swap Management . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Linux System Calls . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Chapter 3 Design and Implementation 8 3.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2.1 Kernel Level . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2.2 Application Level . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 4 Evaluation 15 4.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.2 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2.1 Generality of swpTracer . . . . . . . . . . . . . . . . . . . 16 4.2.2 Memory Optimization Method Comparison . . . . . . . . 17 Chapter 5 Related Work 20 Chapter 6 Conclusion 22 Bibliography 초록 28Maste

SNU Open Repository and Archive

Scrolling vs Paging: Reading Performance and Preference of Reading Modes in Long-form Online News

Author: Herlihy Richard
Publication venue: Technological University Dublin
Publication date: 01/01/2022
Field of study

This study explores the impact of scrolling and dynamic pagination in long-form online documents on reader performance and reader experience. Previous research has produced mixed results, indicating no difference between modes, or a positive effect favouring scrolling. Recent advances in web standards have enabled simpler, dynamic, performant methods of pagination to tailor content responsively to any screen, meriting renewed study in this area. This paper uses one such method to load subsequent online news pages instantly without buffering. In an online browser experiment with 38 participants, an increase in reading speed in the scrolling mode was found at a level of significance. This follows previous research which has suggested that while a scrolling presentation style exacts extra demands on working memory capacity (WMC), many current web users have developed compensatory strategies and cognitive flexibility for navigating scrolling web documents

Arrow@TUDublin

VIRTUAL MEMORY ON A MANY-CORE NOC

Author: McMenamin Adrian Ciaran
Publication venue: University of York
Publication date: 01/06/2019
Field of study

Many-core devices are likely to become increasingly common in real-time and embedded systems as computational demands grow and as expectations for higher performance can generally only be met by by increasing core numbers rather than relying on higher clock speeds. Network-on-chip devices, where multiple cores share a single slice of silicon and employ packetised communications, are a widely-deployed many-core option for system designers. As NoCs are expected to run larger and more complex programs, the small amount of fast, on-chip memory available to each core is unlikely to be sufficient for all but the simplest of tasks, and it is necessary to find an efficient, effective, and time-bounded, means of accessing resources stored in off-chip memory, such as DRAM or Flash storage. The abstraction of paged virtual memory is a familiar technique to manage similar tasks in general computing but has often been shunned by real-time developers because of concern about time predictability. We show it can be a poor choice for a many-core NoC system as, unmodified, it typically uses page sizes optimised for interaction with spinning disks and not solid state media, and transports significant volumes of subsequently unused data across already congested links. In this work we outline and simulate an efficient partial paging algorithm where only those memory resources that are locally accessed are transported between global and local storage. We further show that smaller page sizes add to efficiency. We examine the factors that lead to timing delays in such systems, and show we can predict worst case execution times at even safety-critical thresholds by using statistical methods from extreme value theory. We also show these results are applicable to systems with a variety of connections to memory

White Rose E-theses Online