72 research outputs found

    The JM-Filter to detect specific frequency in monitored signal

    Get PDF
    The Discrete Fourier Transform (DFT) is a mathematical procedure that stands at the center of the processing inside a digital signal processor. It has been widely known and argued in relevant literature that the Fast Fourier Transform (FFT) is useless in detecting specific frequencies in a monitored signal of length N because most of the computed results are ignored. In this paper, we present an efficient FFT-based method to detect specific frequencies in a monitored signal, which will then be compared to the most frequently used method which is the recursive Goertzel algorithm that detects and analyses one selectable frequency component from a discrete signal. The proposed JM-Filter algorithm presents a reduction of iterations compared to the first and second order Goertzel algorithm by a factor of r, where r represents the radix of the JM-Filter. The obtained results are significant in terms of computational reduction and accuracy in fixed-point implementation. Gains of 15 dB and 19 dB in signal to quantization noise ratio (SQNR) were respectively observed for the proposed first and second order radix-8 JM-Filter in comparison to Goertzel algorithm

    More Bang for Your Buck: Improved use of GPU Nodes for GROMACS 2018

    Get PDF
    We identify hardware that is optimal to produce molecular dynamics trajectories on Linux compute clusters with the GROMACS 2018 simulation package. Therefore, we benchmark the GROMACS performance on a diverse set of compute nodes and relate it to the costs of the nodes, which may include their lifetime costs for energy and cooling. In agreement with our earlier investigation using GROMACS 4.6 on hardware of 2014, the performance to price ratio of consumer GPU nodes is considerably higher than that of CPU nodes. However, with GROMACS 2018, the optimal CPU to GPU processing power balance has shifted even more towards the GPU. Hence, nodes optimized for GROMACS 2018 and later versions enable a significantly higher performance to price ratio than nodes optimized for older GROMACS versions. Moreover, the shift towards GPU processing allows to cheaply upgrade old nodes with recent GPUs, yielding essentially the same performance as comparable brand-new hardware.Comment: 41 pages, 13 figures, 4 tables. This updated version includes the following improvements: - most notably, added benchmarks for two coarse grain MARTINI systems VES and BIG, resulting in a new Figure 13 - fixed typos - made text clearer in some places - added two more benchmarks for MEM and RIB systems (E3-1240v6 + RTX 2080 / 2080Ti

    Low-Complexity Multicarrier Waveform Processing Schemes fo Future Wireless Communications

    Get PDF
    Wireless communication systems deliver enormous variety of services and applications. Nowa- days, wireless communications play a key-role in many fields, such as industry, social life, education, and home automation. The growing demand for wireless services and applications has motivated the development of the next generation cellular radio access technology called fifth-generation new radio (5G-NR). The future networks are required to magnify the delivered user data rates to gigabits per second, reduce the communication latency below 1 ms, and en- able communications for massive number of simple devices. Those main features of the future networks come with new demands for the wireless communication systems, such as enhancing the efficiency of the radio spectrum use at below 6 GHz frequency bands, while supporting various services with quite different requirements for the waveform related key parameters. The current wireless systems lack the capabilities to handle those requirements. For exam- ple, the long-term evolution (LTE) employs the cyclic-prefix orthogonal frequency-division multiplexing (CP-OFDM) waveform, which has critical drawbacks in the 5G-NR context. The basic drawback of CP-OFDM waveform is the lack of spectral localization. Therefore, spectrally enhanced variants of CP-OFDM or other multicarrier waveforms with well localized spectrum should be considered. This thesis investigates spectrally enhanced CP-OFDM (E-OFDM) schemes to suppress the out-of-band (OOB) emissions, which are normally produced by CP-OFDM. Commonly, the weighted overlap-and-add (WOLA) scheme applies smooth time-domain window on the CP- OFDM waveform, providing spectrally enhanced subcarriers and reducing the OOB emissions with very low additional computational complexity. Nevertheless, the suppression perfor- mance of WOLA-OFDM is not sufficient near the active subband. Another technique is based on filtering the CP-OFDM waveform, which is referred to as F-OFDM. F-OFDM is able to provide well-localized spectrum, however, with significant increase in the computational com- plexity in the basic scheme with time-domain filters. Also filter-bank multicarrier (FBMC) waveforms are included in this study. FBMC has been widely studied as a potential post- OFDM scheme with nearly ideal subcarrier spectrum localization. However, this scheme has quite high computational complexity while being limited to uniformly distributed sub- bands. Anyway, filter-bank based waveform processing is one of the main topics of this work. Instead of traditional polyphase network (PPN) based uniform filter banks, the focus is on fast-convolution filter banks (FC-FBs), which utilize fast Fourier transform (FFT) domain processing to realize effectively filter-banks with high flexibility in terms of subcarrier bandwidths and center frequencies. FC-FBs are applied for both FBMC and F-OFDM waveform genera- tion and processing with greatly increased flexibility and significantly reduced computational complexity. This study proposes novel structures for FC-FB processing based on decomposition of the FC-FB structure consisting of forward and inverse discrete Fourier transforms (DFT and IDFT). The decomposition of multirate FC provides means of reducing the computational complexity in some important specific scenarios. A generic FC decomposition model is proposed and analyzed. This scheme is mathematically equivalent to the corresponding direct FC imple- mentation, with exactly the same performance. The benefits of the optimized decomposition structure appear mainly in communication scenarios with relatively narrow active transmis- sion band, resulting in significantly reduced computational complexity compared to the direct FC structure. The narrowband scenarios find their places in the recent 3GPP specification of cellular low- power wide-area (LPWA) access technology called narrowband internet-of-things (NB-IoT). NB-IoT aims at introducing the IoT to LTE and GSM frequency bands in coexistence with those technologies. NB-IoT uses CP-OFDM based waveforms with parameters compatible with the LTE. However, additional means are needed also for NB-IoT transmitters to improve the spec- trum localization. For NB-IoT user devices, it is important to consider ultra-low complexity solutions, and a look-up table (LUT) based approach is proposed to implement NB-IoT uplink transmitters with filtered waveforms. This approach provides completely multiplication-free digital baseband implementations and the addition rates are similar or smaller than in the basic NB-IoT waveform generation without the needed elements for spectrum enhancement. The basic idea includes storing full or partial waveforms for all possible data symbol combinations. Then the transmitted waveform is composed through summation of needed stored partial waveforms and trivial phase rotations. The LUT based scheme is developed with different vari- ants tackling practical implementations issues of NB-IoT device transmitters, considering also the effects of nonlinear power amplifier. Moreover, a completely multiplication and addition- free LUT variant is proposed and found to be feasible for very narrowband transmission, with up to 3 subcarriers. The finite-wordlength performance of LUT variants is evaluated through simulations

    A language and a system for program optimization

    Get PDF
    Hardware complexity has increased over time, and as architectures evolve and new ones are adopted, programs must often be altered by numerous optimizations to attain maximum computing power on each target environment. As a result, the code becomes unrecognizable over time, hard to maintain, and challenging to modify. Furthermore, as the code evolves, it is hard to keep the optimizations up to date. The need to develop and maintain separate versions of the application for each target platform is an immense undertaking, especially for the large and long-lived applications commonly found in the high-performance computing (HPC) community. This dissertation presents Locus, a new system, and a language for optimizing complex, long-lived applications for different platforms. We describe the requirements that we believe are necessary for making automatic performance tuning widely adopted. We present the design and implementation of a system that fulfills these requirements. It includes a domain-specific language that can represent complex collections of transformations, an interface to integrate external modules, and a database to manage platform-specific efficient code. The database allows the system’s users to access optimized code without having to install the code generation toolset. The Locus language allows the definition of a search space combined with the programming of optimization sequences separated from the application’s reference code. After all, we present an approach for performance portability. Our thesis is that we can ameliorate the difficulty of optimizing applications using a methodology based on optimization programming and automated empirical search. Our system automatically selects, generates, and executes candidate implementations to find the one with the best performance. We present examples to illustrate the power and simplicity of the language. The experimental evaluation shows that exploring the space of candidate implementations typically leads to better performing codes than those produced by conventional compiler optimizations that are based solely on heuristics. Locus was able to generate a matrix-matrix multiplication code that outperformed the IBM XLC internal hand-optimized version by 2× on the Power 9 processors. On Intel E5, Locus generates code with performance comparable to Intel MKL’s. We also improve performance relative to the reference implementation of up to 4× on stencil computations. Locus ability to integrate complex search spaces with optimization sequences can result in very complicated optimization programs. Locus compiler applies optimizations to remove from the optimization sequences unnecessary search statements making the exploration for faster implementations more accessible. We optimize matrix transpose, matrix-matrix multiplication, fast Fourier transform, symmetric eigenproblem, and sparse matrix-vector multiplication through divide and conquer. We implement three strategies using the Locus language to create search spaces to find the best shapes of the base case and the best ways of subdividing the problem. The search space representation for the divide-and-conquer strategy uses a combination of recursion and OR blocks. The Locus compiler automatically expands the recursion and ensures that the search space is correctly represented. The results showed that the empirical search was important to improve performance by generating faster base cases and finding the best splitting. We also use Locus to optimize large, complex applications. We match the performance of hand-optimized kernels of the Kripke transport code for different input data layouts. The Plascom2 multi-physics application is optimized to find the best way to use a multi-core CPU and GPU. The use of Tangram, Hydra, and OpenMP provided an interesting search space that improved performance by approximately 4.3× on ZAXPY and ZXDOTY kernels. Lastly, in a similar fashion to how a compiler works, we applied a search space representing a collection of optimization sequences to 856 loops extracted from 16 benchmarks that resulted in good performance improvements

    Frequency-Multiplexed Array Digitization for MIMO Receivers: 4-Antennas/ADC at 28 GHz on Xilinx ZCU-1285 RF SoC

    Get PDF
    Communications at mm-wave frequencies and above rely heavily on beamforming antenna arrays. Typically, hundreds, if not thousands, of independent antenna channels are used to achieve high SNR for throughput and increased capacity. Using a dedicated ADC per antenna receiver is preferable but it\u27s not practical for very large arrays due to unreasonable cost and complexity. Frequency division multiplexing (FDM) is a well-known technique for combining multiple signals into a single wideband channel. In a first of its kind measurements, this paper explores FDM for combining multiple antenna outputs at IF into a single wideband signal that can be sampled and digitized using a high-speed wideband ADC. The sampled signals are sub-band filtered and digitally down-converted to obtain individual antenna channels. A prototype receiver was realized with a uniform linear array consisting of 4 elements with 250 MHz bandwidth per channel at 28 GHz carrier frequency. Each of the receiver chains were frequency-multiplexed at an intermediate frequency of 1 GHz to avoid the requirement for multiple, precise local oscillators (LOs). Combined narrowband receiver outputs were sampled using a single ADC with digital front-end operating on a Xilinx ZCU-1285 RF SoC FPGA to synthesize 4 digital beams. The approach allows MM -fold increase in spatial degrees of freedom per ADC, for temporal oversampling by a factor of MM

    Datacenter Design for Future Cloud Radio Access Network.

    Full text link
    Cloud radio access network (C-RAN), an emerging cloud service that combines the traditional radio access network (RAN) with cloud computing technology, has been proposed as a solution to handle the growing energy consumption and cost of the traditional RAN. Through aggregating baseband units (BBUs) in a centralized cloud datacenter, C-RAN reduces energy and cost, and improves wireless throughput and quality of service. However, designing a datacenter for C-RAN has not yet been studied. In this dissertation, I investigate how a datacenter for C-RAN BBUs should be built on commodity servers. I first design WiBench, an open-source benchmark suite containing the key signal processing kernels of many mainstream wireless protocols, and study its characteristics. The characterization study shows that there is abundant data level parallelism (DLP) and thread level parallelism (TLP). Based on this result, I then develop high performance software implementations of C-RAN BBU kernels in C++ and CUDA for both CPUs and GPUs. In addition, I generalize the GPU parallelization techniques of the Turbo decoder to the trellis algorithms, an important family of algorithms that are widely used in data compression and channel coding. Then I evaluate the performance of commodity CPU servers and GPU servers. The study shows that the datacenter with GPU servers can meet the LTE standard throughput with 4× to 16× fewer machines than with CPU servers. A further energy and cost analysis show that GPU servers can save on average 13× more energy and 6× more cost. Thus, I propose the C-RAN datacenter be built using GPUs as a server platform. Next I study resource management techniques to handle the temporal and spatial traffic imbalance in a C-RAN datacenter. I propose a “hill-climbing” power management that combines powering-off GPUs and DVFS to match the temporal C-RAN traffic pattern. Under a practical traffic model, this technique saves 40% of the BBU energy in a GPU-based C-RAN datacenter. For spatial traffic imbalance, I propose three workload distribution techniques to improve load balance and throughput. Among all three techniques, pipelining packets has the most throughput improvement at 10% and 16% for balanced and unbalanced loads, respectively.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120825/1/qizheng_1.pd
    corecore