198 research outputs found

    Sound of Violent Images / Violence of Sound Images: Pulling apart Tom and Jerry

    Get PDF
    Violence permeates Tom and Jerry in the repetitive, physically violent gags and scenes of humiliation and mocking, yet unarguably, there is comedic value in the onscreen violence.The musical scoring of Tom and Jerry in the early William Hanna and Joseph Barbera period of production (pre-1958) by Scott Bradley played a key role in conveying the comedic impact of violent gags due to the close synchronisation of music and sound with visual action and is typified by a form of sound design characteristic of zip crash animation as described by Paul Taberham (2012), in which sound actively participates in the humour and directly influences the viewer’s interpretation of the visual action. This research investigates the sound-image relationships in Tom and Jerry through practice, by exploring how processes of decontextualisation and desynchronisation of sound and image elements of violent gags unmask the underlying violent subtext of Tom and Jerry’s slapstick comedy. This research addresses an undertheorised area in animation related to the role of sound-image synchronisation and presents new knowledge derived from the novel application of audiovisual analysis of Tom and Jerry source material and the production of audiovisual artworks. The findings of this research are discussed from a pan theoretical perspective drawing on theorisation of film sound and cognitivist approaches to film music. This investigation through practice, supports the notion that intrinsic and covert processes of sound-image synchronisation as theorised by Kevin Donnelly (2014), play a key role in the reading of slapstick violence as comedic. Therefore, this practice-based research can be viewed as a case study that demonstrates the potential of a sampling-based creative practice to enable new readings to emerge from sampled source material. Novel artefacts were created in the form of audiovisual works that embody specific knowledge of factors related to the reconfiguration of sound-image relations and their impact in altering viewers’ readings of violence contained within Tom and Jerry. Critically, differences emerged between the artworks in terms of the extent to which they unmasked underlying themes of violence and potential mediating factors are discussed related to the influence of asynchrony on comical framing, the role of the unseen voice, perceived musicality and perceptions of interiority in the audiovisual artworks. The research findings yielded new knowledge regarding a potential gender-based bias in the perception of the human voice in the animated artworks produced. This research also highlights the role of intra-animation dimensions pertaining to the use of the single frame, the use of blank spaces and the relationship of sound-image synchronisation to the notion of the acousmatic imaginary. The PhD includes a portfolio of experimental audiovisual artworks produced during the testing and experimental phases of the research on which the textual dissertation critically reflects

    Vector-Processing for Mobile Devices: Benchmark and Analysis

    Full text link
    Vector processing has become commonplace in today's CPU microarchitectures. Vector instructions improve performance and energy which is crucial for resource-constraint mobile devices. The research community currently lacks a comprehensive benchmark suite to study the benefits of vector processing for mobile devices. This paper presents Swan-an extensive vector processing benchmark suite for mobile applications. Swan consists of a diverse set of data-parallel workloads from four commonly used mobile applications: operating system, web browser, audio/video messaging application, and PDF rendering engine. Using Swan benchmark suite, we conduct a detailed analysis of the performance, power, and energy consumption of vectorized workloads, and show that: (a) Vectorized kernels increase the pressure on cache hierarchy due to the higher rate of memory requests. (b) Vector processing is more beneficial for workloads with lower precision operations and higher cache hit rates. (c) Limited Instruction-Level Parallelism and strided memory accesses to multi-dimensional data structures prevent vector processing benefits from scaling with more SIMD functional units and wider registers. (d) Despite lower computation throughput than domain-specific accelerators, such as GPU, vector processing outperforms these accelerators for kernels with lower operation counts. Finally, we show five common computation patterns in mobile data-parallel workloads that dominate the execution time.Comment: 2023 IEEE International Symposium on Workload Characterization (IISWC

    The FastLanes Compression Layout: Decoding >100 billion integers per second with scalar code

    Get PDF
    The open-source FastLanes project aims to improve big data formats, such as Parquet, ORC and columnar database formats, in multiple ways. In this paper, we significantly accelerate decoding of all common Light-Weight Compression (LWC) schemes: DICT, FOR, DELTA and RLE through better data-parallelism. We do so by re-designing the compression layout using two main ideas: (i) generalizing the value interleaving technique in the basic operation of bit-(un)packing by targeting a virtual 1024-bits SIMD register, (ii) reordering the tuples in all columns of a table in the same Unified Transposed Layout that puts tuple chunks in a common “04261537” order (explained in the paper); allowing for maximum independent work for all possible basic SIMD lane widths: 8, 16, 32, and 64 bits. We address the software development, maintenance and futureproofness challenges of increasing hardware diversity, by defining a virtual 1024-bits instruction set that consists of simple operators supported by all SIMD dialects; and also, importantly, by scalar code. The interleaved and tuple-reordered layout actually makes scalar decoding faster, extracting more data-parallelism from today’s wide-issue CPUs. Importantly, the scalar version can be fully auto-vectorized by modern compilers, eliminating technical debt in software caused by platform-specific SIMD intrinsics. Micro-benchmarks on Intel, AMD, Apple and AWS CPUs show that FastLanes accelerates decoding by factors (decoding >40 values per CPU cycle). FastLanes can make queries faster, as compressing the data reduces bandwidth needs, while decoding is almost free

    Accelerating 128-bit Floating-Point Matrix Multiplication on FPGAs

    Full text link
    General Matrix Multiplication (GEMM) is a fundamental operation widely used in scientific computations. Its performance and accuracy significantly impact the performance and accuracy of applications that depend on it. One such application is semidefinite programming (SDP), and it often requires binary128 or higher precision arithmetic to solve problems involving SDP stably. However, only some processors support binary128 arithmetic, which makes SDP solvers generally slow. In this study, we focused on accelerating GEMM with binary128 arithmetic on field-programmable gate arrays (FPGAs) to enable the flexible design of accelerators for the desired computations. Our binary128 GEMM designs on a recent high-performance FPGA achieved approximately 90GFlops, 147x faster than the computation executed on a recent CPU with 20 threads for large matrices. Using our binary128 GEMM design on the FPGA, we successfully accelerated two numerical applications: LU decomposition and SDP problems, for the first time.Comment: 12 pages, 8 figure

    Morpheus unleashed: Fast cross-platform SpMV on emerging architectures

    Full text link
    Sparse matrices and linear algebra are at the heart of scientific simulations. Over the years, more than 70 sparse matrix storage formats have been developed, targeting a wide range of hardware architectures and matrix types, each of which exploit the particular strengths of an architecture, or the specific sparsity patterns of the matrices. In this work, we explore the suitability of storage formats such as COO, CSR and DIA for emerging architectures such as AArch64 CPUs and FPGAs. In addition, we detail hardware-specific optimisations to these targets and evaluate the potential of each contribution to be integrated into Morpheus, a modern library that provides an abstraction of sparse matrices (currently) across x86 CPUs and NVIDIA/AMD GPUs. Finally, we validate our work by comparing the performance of the Morpheus-enabled HPCG benchmark against vendor-optimised implementations

    The FastLanes compression layout: Decoding >100 billion integers per second with scalar code

    Get PDF
    The open-source Fast Lanes project aims to improve big data formats, such as Parquet, ORC and columnar database formats, in multiple ways. In this paper, we significantly accelerate decoding of all common Light-Weight Compression (LWC) schemes: DICT, FOR, DELTA and RLE through better data-parallelism. We do so by re-designing the compression layout using two main ideas: (i) generalizing the value interleaving technique in the basic operation of bit-(un)packing by targeting a virtual 1024-bits SIMD register, (ii) reordering the tuples in all columns of a table in the same Unified Transposed Layout that puts tuple chunks in a common "104261537" order (explained in the paper); allowing for maximum independent work for all possible basic SIMD lane widths: 8, 16, 32, and 64 bits. We address the software development, maintenance and future proofness challenges of increasing hardware diversity, by defining a virtual 1024-bits instruction set that consists of simple operators supported by all SIMD dialects; and also, importantly, by scalar code. The interleaved and tuple-reordered layout actually makes scalar decoding faster, extracting more data-parallelism from today’s wide-issue CPUs. Importantly, the scalar version can be fully auto-vectorized by modern compilers, eliminating technical debt in software caused by platform-specific SIMD intrinsics. Micro-benchmarks on Intel, AMD, Apple and AWS CPUs show that Fast Lanes accelerates decoding by factors (decoding > 40 values per CPU cycle). Fast Lanes can make queries faster, as compressing the data reduces bandwidth needs, while decoding is almost free

    GPU-based Architecture Modeling and Instruction Set Extension for Signal Processing Applications

    Get PDF
    The modeling of embedded systems attempts to estimate the performance and costs prior to the implementation. The early stage predictions for performance and power dissipation reduces the more costly late stage design modifications. Workload modeling is an approach where an abstract application is evaluated against an abstract architecture. The challenge in modeling is the balance between fidelity and simplicity, where fidelity refers to the correctness of the predictions and the simplicity relates to the simulation time of the model and its ease of comprehension for the developer. A model named GSLA for performance and power modeling is presented, which extends existing architecture modeling by including GPUs as parallel processing elements. The performance model showed an average fidelity of 93% and the power model demonstrated an average fidelity of 84% between the models and several application measurements. The GSLA model is very simple: only 2 parameters that can be obtained by automated scripts. Besides the modeling, this thesis addresses lower level signal processing system improvements by proposing Instruction Set Architecture (ISA) extensions for RISC-V processors. A vehicle classifier neural network model was used as a case study, in which the benefit of Bit Manipulation Instructions (BMI) is shown. The result is a new PopCount instruction extension that is verified in ETISS simulator. The PopCount extension of RISC-V ISA showed a performance improvement of more than double for the vehicle classifier application. In addition, the design flow for adding a new instruction extension for a re-configurable platform is presented. The GPU modeling and the RISC-V ISA extension added new features to the state of the art. They improve the modeling features as well as reduce the execution costs in signal processing platforms

    Compiler-centric across-stack deep learning acceleration

    Get PDF
    Optimizing the deployment of Deep Neural Networks (DNNs) is hard. Despite deep learning approaches increasingly providing state-of-the-art solutions to a variety of difficult problems, such as computer vision and natural language processing, DNNs can be prohibitively expensive, for example, in terms of inference time or memory usage. Effective exploration of the design space requires a holistic approach, including a range of topics from machine learning, systems, and hardware. The rapid proliferation of deep learning applications has raised demand for efficient exploration and acceleration of deep learning based solutions. However, managing the range of optimization techniques, as well as how they interact with each other across the stack is a non-trivial task. A family of emerging specialized compilers for deep learning, tensor compilers, appear to be a strong candidate to help manage the complexity of across-stack optimization choices, and enable new approaches. This thesis presents new techniques and explorations of the Deep Learning Acceleration Stack (DLAS), with the perspective that the tensor compiler will increasingly be the center of this stack. First, we motivate the challenges in exploring DLAS, by describing the experience of running a perturbation study varying parameters at every layer of the stack. The core of the study is implemented using a tensor compiler, which reduces the complexity of evaluating the wide range of variants, although still requires a significant engineering effort to realize. Next, we develop a new algorithm for grouped convolution, a model optimization technique for which existing solutions provided poor inference time scaling. We implement and optimize our algorithm using a tensor compiler, outperforming existing approaches by 5.1Ă— on average (arithmetic mean). Finally, we propose a technique, transfer-tuning, to reduce the search time required for automatic tensor compiler code optimization, reducing the search time required by 6.5Ă— on average. The techniques and contributions of this thesis across these interconnected domains demonstrate the exciting potential of tensor compilers to simplify and improve design space exploration for DNNs, and their deployment. The outcomes of this thesis enable new lines of research to enable machine learning developers to keep up with the rapidly evolving landscape of neural architectures and hardware
    • …
    corecore