471 research outputs found
Hybrid Machine Translation with Multi-Source Encoder-Decoder Long Short-Term Memory in English-Malay Translation
Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) are the state-of-the-art approaches in machine translation (MT). The translation produced by a SMT is based on the statistical analysis of text corpora, while NMT uses deep neural network to model and to generate a translation. SMT and NMT have their strength and weaknesses. SMT may produce better translation with a small parallel text corpus compared to NMT. Nevertheless, when the amount of parallel text available is large, the quality of the translation produced by NMT is often higher than SMT. Besides that, study also shown that the translation produced by SMT is better than NMT in cases where there is a domain mismatch between training and testing. SMT also has an advantage on long sentences. In addition, when a translation produced by an NMT is wrong, it is very difficult to find the error. In this paper, we investigate a hybrid approach that combine SMT and NMT to perform English to Malay translation. The motivation of using a hybrid machine translation is to combine the strength of both approaches to produce a more accurate translation. Our approach uses the multi-source encoder-decoder long short-term memory (LSTM) architecture. The architecture uses two encoders, one to embed the sentence to be translated, and another encoder to embed the initial translation produced by SMT. The translation from the SMT can be viewed as a “suggestion translation” to the neural MT. Our experiments show that the hybrid MT increases the BLEU scores of our best baseline machine translation in computer science domain and news domain from 21.21 and 48.35 to 35.97 and 61.81 respectively
A Flexible BCH decoder for Flash Memory Systems using Cascaded BCH codes
NAND ash memories are widely used in consumer electronics, such as tablets, personal computers, smartphones, and gaming systems. However, unlike other standard storage devices, these ash memories suffer from various random errors. In order to address these reliability issues, various error correction codes (ECC) are employed. Bose-Chaudhuri Hocquenghem (BCH) code is the most common ECC used to address the errors in modern ash memories. Because of the limitation of the realization of the BCH codes for more extensive error correction, the modern ash memory devices use Low-density parity-check (LDPC) codes for error correction scheme. The realization of the LDPC decoders have greater complexity than BCH decoders, so these ECC decoders are implemented within the ash memory device. This thesis analyzes the limitation imposed by the state of the art implementation of BCH decoders and proposes a cascaded BCH code to address these limitations.
In order to support a variety of ash memory devices, there are three main challenges to be addressed for BCH decoders. First, the latency of the BCH decoders, in the case of no error scenario, should be less than 100us. Second, there should be flexibility in supporting different ECC block size; more precisely, the solution should be able to support 256, 512, 1024, and 2048 bytes of ECC block. Third, there should be flexibility in supporting different bit errors.
A recent development with Graphical Processing Units (GPUs) has attracted many researchers to use GPUs for non-graphical implementation. These GPUs are used in many consumer electronics as part of the system on chip (SOC) configuration. In this thesis we studied the limitation imposed by different implementations (VLSI, GPU, and CPU) of BCH decoders, and we propose a cascaded BCH code implemented using a hybrid approach to overcome the limitations of the BCH codes. By splitting the implementation across VLSI and GPUs, we have shown in this thesis that this method can provide flexibility over the block size and the bit error to be corrected
Unified turbo/LDPC code decoder architecture for deep-space communications
Deep-space communications are characterized by extremely
critical conditions; current standards foresee the usage of both turbo
and low-density-parity-check (LDPC) codes to ensure recovery from
received errors, but each of them displays consistent drawbacks.
Code concatenation is widely used in all kinds of communication to
boost the error correction capabilities of single codes; serial
concatenation of turbo and LDPC codes has been recently proven
effective enough for deep space communications, being able to
overcome the shortcomings of both code types. This work extends
the performance analysis of this scheme and proposes a novel
hardware decoder architecture for concatenated turbo and LDPC
codes based on the same decoding algorithm. This choice leads to a
high degree of datapath and memory sharing; postlayout
implementation results obtained with complementary metal-oxide
semiconductor (CMOS) 90 nm technology show small area
occupation (0.98 mm
2
) and very low power consumption (2.1 mW)
NengoFPGA: an FPGA Backend for the Nengo Neural Simulator
Low-power, high-speed neural networks are critical for providing deployable embedded AI
applications at the edge. We describe a Xilinx FPGA implementation of Neural Engineering
Framework (NEF) networks with online learning that outperforms mobile Nvidia GPU
implementations by an order of magnitude or more. Specifically, we provide an embedded
Python-capable PYNQ FPGA implementation supported with a Xilinx Vivado High-Level
Synthesis (HLS) workflow that allows sub-millisecond implementation of adaptive neural
networks with low-latency, direct I/O access to the physical world. The outcome of this
work is NengoFPGA, a seamless and user-friendly extension to the neural compiler Python
package Nengo. To reduce memory requirements and improve performance we tune the
precision of the different intermediate variables in the code to achieve competitive absolute
accuracy against slower and larger floating-point reference designs. The online learning
component of the neural network exploits immediate feedback to adjust the network weights
to best support a given arithmetic precision. As the space of possible design configurations
of such quantized networks is vast and is subject to a target accuracy constraint, we use
the Hyperopt hyper-parameter tuning tool instead of manual search to find Pareto optimal
designs. Specifically, we are able to generate the optimized designs in under 500 short
iterations of Vivado HLS C synthesis before running the complete Vivado place-and-route
phase on that subset, a much longer process not conducive to rapid exploration. For neural
network populations of 64–4096 neurons and 1–8 representational dimensions our optimized
FPGA implementation generated by Hyperopt has a speedup of 10–484× over a competing
cuBLAS implementation on the Jetson TX1 GPU while using 2.4–9.5× less power. Our
speedups are a result of HLS-specific reformulation (15× improvement), precision adaptation
(3× improvement), and low-latency direct I/O access (1000× improvement)
- …