622 research outputs found

    Evolutionary optimization of all-dielectric magnetic nanoantennas

    Full text link
    Magnetic light and matter interactions are generally too weak to be detected, studied and applied technologically. However, if one can increase the magnetic power density of light by several orders of magnitude, the coupling between magnetic light and matter could become of the same order of magnitude as the coupling with its electric counterpart. For that purpose, photonic nanoantennas have been proposed, and in particular dielectric nanostructures, to engineer strong local magnetic field and therefore increase the probability of magnetic interactions. Unfortunately, dielectric designs suffer from physical limitations that confine the magnetic hot spot in the core of the material itself, preventing experimental and technological implementations. Here, we demonstrate that evolutionary algorithms can overcome such limitations by designing new dielectric photonic nanoantennas, able to increase and extract the optical magnetic field from high refractive index materials. We also demonstrate that the magnetic power density in an evolutionary optimized dielectric nanostructure can be increased by a factor 5 compared to state of the art dielectric nanoantennas. In addition, we show that the fine details of the nanostructure are not critical in reaching these aforementioned features, as long as the general shape of the motif is maintained. This advocates for the feasibility of nanofabricating the optimized antennas experimentally and their subsequent application. By designing all dielectric magnetic antennas that feature local magnetic hot-spots outside of high refractive index materials, this work highlights the potential of evolutionary methods to fill the gap between electric and magnetic light-matter interactions, opening up new possibilities in many research fields.Comment: 13 pages, 4 figure

    Effective data parallel computing on multicore processors

    Get PDF
    The rise of chip multiprocessing or the integration of multiple general purpose processing cores on a single chip (multicores), has impacted all computing platforms including high performance, servers, desktops, mobile, and embedded processors. Programmers can no longer expect continued increases in software performance without developing parallel, memory hierarchy friendly software that can effectively exploit the chip level multiprocessing paradigm of multicores. The goal of this dissertation is to demonstrate a design process for data parallel problems that starts with a sequential algorithm and ends with a high performance implementation on a multicore platform. Our design process combines theoretical algorithm analysis with practical optimization techniques. Our target multicores are quad-core processors from Intel and the eight-SPE IBM Cell B.E. Target applications include Matrix Multiplications (MM), Finite Difference Time Domain (FDTD), LU Decomposition (LUD), and Power Flow Solver based on Gauss-Seidel (PFS-GS) algorithms. These applications are popular computation methods in science and engineering problems and are characterized by unit-stride (MM, LUD, and PFS-GS) or 2-point stencil (FDTD) memory access pattern. The main contributions of this dissertation include a cache- and space-efficient algorithm model, integrated data pre-fetching and caching strategies, and in-core optimization techniques. Our multicore efficient implementations of the above described applications outperform nai¨ve parallel implementations by at least 2x and scales well with problem size and with the number of processing cores

    Resource Optimized Scheduling For Enhanced Power Efficiency And Throughput On Chip Multi Processor Platforms

    Get PDF
    The parallel nature of process execution on Chip Multi-Processors (CMPs) has boosted levels of application performance far beyond the capabilities of erstwhile single-core designs. Generally, CMPs offer improved performance by integrating multiple simpler cores onto a single die that share certain computing resources among them such as last-level caches, data buses, and main memory. This ensures architectural simplicity while also boosting performance for multi-threaded applications. However, a major trade-off associated with this approach is that concurrently executing applications incur performance degradation if their collective resource requirements exceed the total amount of resources available to the system. If dynamic resource allocation is not carefully considered, the potential performance gain from having multiple cores may be outweighed by the losses due to contention for allocation of shared resources. Additionally, CMPs with inbuilt dynamic voltage-frequency scaling (DVFS) mechanisms may try to compensate for the performance bottleneck by scaling to higher clock frequencies. For performance degradation due to shared-resource contention, this does not necessarily improve performance but does ensure a significant penalty on power consumption due to the quadratic relation of electrical power and voltage (P_dynamic ∝ V^2 * f).This dissertation presents novel methodologies for balancing the competing requirements of high performance, fairness of execution, and enforcement of priority, while also ensuring overall power efficiency of CMPs. Specifically, we (1) Analyze the problem of resource interference during concurrent process execution and propose two fine-grained scheduling methodologies for improving overall performance and fairness, (2) Develop an approach for enforcement of priority (i.e., minimum performance) for specific processes while avoiding resource starvation for others, and (3) Present a machine-learning approach for maximizing the power efficiency (performance-per-Watt) of CMPs through estimation of a workload\u27s performance and power consumption limits at different clock frequencies.As modern computing workloads become increasingly dynamic, and computers themselves become increasingly ubiquitous, the problem of finding the ideal balance between performance and power consumption of CMPs is of particular relevance today, especially given the unprecedented proliferation of embedded devices for use in Internet-of-Things, edge computing, smart wearables, and even exotic experiments such as space probes comprised entirely of a CMP, sensors, and an antenna ( space chips ). Additionally, reducing power consumption while maintaining constant performance can contribute to addressing the growing problem of dark silicon

    Viability of Numerical Full-Wave Techniques in Telecommunication Channel Modelling

    Get PDF
    In telecommunication channel modelling the wavelength is small compared to the physical features of interest, therefore deterministic ray tracing techniques provide solutions that are more efficient, faster and still within time constraints than current numerical full-wave techniques. Solving fundamental Maxwell's equations is at the core of computational electrodynamics and best suited for modelling electrical field interactions with physical objects where characteristic dimensions of a computing domain is on the order of a few wavelengths in size. However, extreme communication speeds, wireless access points closer to the user and smaller pico and femto cells will require increased accuracy in predicting and planning wireless signals, testing the accuracy limits of the ray tracing methods. The increased computing capabilities and the demand for better characterization of communication channels that span smaller geographical areas make numerical full-wave techniques attractive alternative even for larger problems. The paper surveys ways of overcoming excessive time requirements of numerical full-wave techniques while providing acceptable channel modelling accuracy for the smallest radio cells and possibly wider. We identify several research paths that could lead to improved channel modelling, including numerical algorithm adaptations for large-scale problems, alternative finite-difference approaches, such as meshless methods, and dedicated parallel hardware, possibly as a realization of a dataflow machine

    COMET: A Cross-Layer Optimized Optical Phase Change Main Memory Architecture

    Full text link
    Traditional DRAM-based main memory systems face several challenges with memory refresh overhead, high latency, and low throughput as the industry moves towards smaller DRAM cells. These issues have been exacerbated by the emergence of data-intensive applications in recent years. Memories based on phase change materials (PCMs) offer promising solutions to these challenges. PCMs store data in the material's phase, which can shift between amorphous and crystalline states when external thermal energy is supplied. This is often achieved using electrical pulses. Alternatively, using laser pulses and integration with silicon photonics offers a unique opportunity to realize high-bandwidth and low-latency photonic memories. Such a memory system may in turn open the possibility of realizing fully photonic computing systems. But to realize photonic memories, several challenges that are unique to the photonic domain such as crosstalk, optical loss management, and laser power overhead have to be addressed. In this work, we present COMET, the first cross-layer optimized optical main memory architecture that uses PCMs. In architecting COMET, we explore how to use silicon photonics and PCMs together to design a large-scale main memory system while addressing associated challenges. We explore challenges and propose solutions at the PCM cell, photonic memory circuit, and memory architecture levels. Based on our evaluations, COMET offers 7.1x better bandwidth, 15.1x lower EPB, and 3x lower latencies than the best-known prior work on photonic main memory architecture design

    Intra-cluster coalescing and distributed-block scheduling to reduce GPU NoC pressure

    Get PDF
    GPUs continue to boost the number of streaming multiprocessors (SMs) to provide increasingly higher compute capabilities. To construct a scalable crossbar network-on-chip (NoC) that connects the SMs to the memory controllers, a cluster structure is introduced in modern GPUs in which several SMs are grouped together to share a network port. Because of network port sharing, clustered GPUs face severe NoC congestion, which creates a critical performance bottleneck. In this paper, we target redundant network traffic to mitigate GPU NoC congestion. In particular, we observe that in many GPU-compute applications, different SMs in a cluster access shared data. Sending redundant requests to access the same memory location wastes valuable NoC bandwidth-we find on average 19 percent (and up to 48 percent) of the requests to be redundant. To remove redundant NoC traffic, we propose distributed-block scheduling, intra-cluster coalescing (ICC) and the coalesced cache (CC) to coalesce L1 cache misses within and across SMs in a cluster, respectively. Our evaluation results show that distributed-block scheduling, ICC and CC are complementary and improve both performance and energy consumption. We report an average performance improvement of 15 percent (and up to 67 percent) while at the same time reducing system energy by 6 percent (and up to 19 percent) and improving the energy-delay product (EDP) by 19 percent on average (and up to 53 percent), compared to state-of-the-art distributed CTA scheduling
    corecore