104 research outputs found
Heap-based Algorithms to Accelerate Fingerprint Matching on Parallel Platforms
Nowadays, fingerprint is the most used biometric trait for individuals identification. In this area, the state-of-the-art algorithms are very accurate, but when the database contains millions of identities, an acceleration of the algorithm is required. From these algorithms, Minutia Cylinder-Code (MCC) stands out for its good results in terms of accuracy, however its efficiency in computational time is not high. In this work, we propose to use two different parallel platforms to accelerate fingerprint matching process by using MCC: (1) a multi-core server, and (2) a Xeon Phi coprocessor. Our proposal is based on heaps as auxiliary structure to process the global similarity of MCC. As heap-based algorithms are exhaustive (all the elements are accessed), we also explored the use an indexing algorithm to avoid comparing the query against all the fingerprints of the database. Experimental results show an improvement up to 97.15x of speed-up, which is competitive compared to other state-of-the-art algorithms in GPU and FPGA. To the best of our knowledge, this is the first work for fingerprint identification using a Xeon Phi coprocessor.Instituto de Investigación en Informátic
Heap-based Algorithms to Accelerate Fingerprint Matching on Parallel Platforms
Nowadays, fingerprint is the most used biometric trait for individuals identification. In this area, the state-of-the-art algorithms are very accurate, but when the database contains millions of identities, an acceleration of the algorithm is required. From these algorithms, Minutia Cylinder-Code (MCC) stands out for its good results in terms of accuracy, however its efficiency in computational time is not high. In this work, we propose to use two different parallel platforms to accelerate fingerprint matching process by using MCC: (1) a multi-core server, and (2) a Xeon Phi coprocessor. Our proposal is based on heaps as auxiliary structure to process the global similarity of MCC. As heap-based algorithms are exhaustive (all the elements are accessed), we also explored the use an indexing algorithm to avoid comparing the query against all the fingerprints of the database. Experimental results show an improvement up to 97.15x of speed-up, which is competitive compared to other state-of-the-art algorithms in GPU and FPGA. To the best of our knowledge, this is the first work for fingerprint identification using a Xeon Phi coprocessor.Instituto de Investigación en Informátic
Co-design Hardware and Algorithm for Vector Search
Vector search has emerged as the foundation for large-scale information
retrieval and machine learning systems, with search engines like Google and
Bing processing tens of thousands of queries per second on petabyte-scale
document datasets by evaluating vector similarities between encoded query texts
and web documents. As performance demands for vector search systems surge,
accelerated hardware offers a promising solution in the post-Moore's Law era.
We introduce \textit{FANNS}, an end-to-end and scalable vector search framework
on FPGAs. Given a user-provided recall requirement on a dataset and a hardware
resource budget, \textit{FANNS} automatically co-designs hardware and
algorithm, subsequently generating the corresponding accelerator. The framework
also supports scale-out by incorporating a hardware TCP/IP stack in the
accelerator. \textit{FANNS} attains up to 23.0 and 37.2 speedup
compared to FPGA and CPU baselines, respectively, and demonstrates superior
scalability to GPUs, achieving 5.5 and 7.6 speedup in median
and 95\textsuperscript{th} percentile (P95) latency within an eight-accelerator
configuration. The remarkable performance of \textit{FANNS} lays a robust
groundwork for future FPGA integration in data centers and AI supercomputers.Comment: 11 page
Exploiting multiple levels of parallelism of Convergent Cross Mapping
Identifying causal relationships between variables remains an essential problem across various scientific fields. Such identification is particularly important but challenging in complex systems, such as those involving human behaviour, sociotechnical contexts, and natural ecosystems. By exploiting state space reconstruction via lagged embeddings of time series, convergent cross mapping (CCM) serves as an important method for addressing this problem. While powerful, CCM is computationally costly; moreover, CCM results are highly sensitive to several parameter values. Current best practice involves performing a systematic search on a range of parameters, but results in high computational burden, which mainly raises barriers to practical use. In light of both such challenges and the growing size of commonly encountered datasets from complex systems, inferring the causality with confidence using CCM in a reasonable time becomes a biggest challenge.
In this thesis, I investigate the performance associated with a variety of parallel techniques (CUDA, Thrust, OpenMP, MPI and Spark, etc.,) to accelerate convergent cross mapping. The performance of each method was collected and compared across multiple experiments to further evaluate potential bottlenecks. Moreover, the work deployed and tested combinations of these techniques to more thoroughly exploit available computation resources. The results obtained from these experiments indicate that GPUs can only accelerate the CCM algorithm under certain circumstances and requirements. Otherwise, the overhead of data transfer and communication can become the limiting bottleneck. On the other hand, in cluster computing, the MPI/OpenMP framework outperforms the Spark framework by more than one order of magnitude in terms of processing speed and provides more consistent performance for distributed computing. This also reflects the large size of the output from the CCM algorithm. However, Spark shows better cluster infrastructure management, ease of software engineering, and more ready handling of other aspects, such as node failure and data replication. Furthermore, combinations of GPU and cluster frameworks are deployed and compared in GPU/CPU clusters. An apparent speedup can be achieved in the Spark framework, while extra time cost is incurred in the MPI/OpenMP framework. The underlying reason reflects the fact that the code complexity imposed by GPU utilization cannot be readily offset in the MPI/OpenMP framework. Overall, the experimental results on parallelized solutions have demonstrated a capacity for over an order of magnitude performance improvement when compared with the widely used current library rEDM. Such economies in computation time can speed learning and robust identification of causal drivers in complex systems.
I conclude that these parallel techniques can achieve significant improvements. However, the performance gain varies among different techniques or frameworks. Although the use of GPUs can accelerate the application, there still exists constraints required to be taken into consideration, especially with regards to the input data scale. Without proper usage, GPUs use can even slow down the whole execution time. Convergent cross mapping can achieve a maximum speedup by adopting the MPI/OpenMP framework, as it is suitable to computation-intensive algorithms. By contrast, the Spark framework with integrated GPU accelerators still offers low execution cost comparing to the pure Spark version, which mainly fits in data-intensive problems
Heap-based Algorithms to Accelerate Fingerprint Matching on Parallel Platforms
Nowadays, fingerprint is the most used biometric trait for individuals identification. In this area, the state-of-the-art algorithms are very accurate, but when the database contains millions of identities, an acceleration of the algorithm is required. From these algorithms, Minutia Cylinder-Code (MCC) stands out for its good results in terms of accuracy, however its efficiency in computational time is not high. In this work, we propose to use two different parallel platforms to accelerate fingerprint matching process by using MCC: (1) a multi-core server, and (2) a Xeon Phi coprocessor. Our proposal is based on heaps as auxiliary structure to process the global similarity of MCC. As heap-based algorithms are exhaustive (all the elements are accessed), we also explored the use an indexing algorithm to avoid comparing the query against all the fingerprints of the database. Experimental results show an improvement up to 97.15x of speed-up, which is competitive compared to other state-of-the-art algorithms in GPU and FPGA. To the best of our knowledge, this is the first work for fingerprint identification using a Xeon Phi coprocessor.Instituto de Investigación en Informátic
Recommended from our members
Efficient Learning in Heterogeneous Internet of Things Ecosystems
The Internet of Things (IoT) is a growing network of heterogeneous devices, combining various sensing and computing nodes at different scales, which creates a large volume of data. Many IoT applications use machine learning (ML) algorithms to analyze the data. The high computational complexity of ML workloads poses significant computational challenges to IoT computing platforms, which tend to be less-powerful and resource-constrained devices. Transmitting such large volumes of data to the cloud also have various issues such as scalability, security and privacy. In this dissertation, we propose efficient solutions to perform the ML tasks while decreasing power consumption and improving performance. We first leverage the heterogeneous and interconnected nature of the IoT systems, where IoT applications run on many different architectures (e.g., X86 server or ARM-based edge device) while communicating with each other. We present a cross-platform power and performance prediction technique for intelligent task allocation. The proposed technique estimates the time-variant energy consumption with only 7% error across completely different architectures, enabling the intelligent task allocation that saves the energy consumption of 16.5% for state-of-the-art ML workloads.We next show how to further advance the learning procedures towards real-time and online processing by distributing such learning tasks onto the hierarchy of IoT devices. Our solution leverages brain-inspired high-dimensional (HD) computing to derive a new class oflearning algorithms that can easily run on IoT devices, while providing high accuracy comparable to the state-of-the-arts. We present that the HD-based learning algorithms can cover various real-world problems from conventional classification to other cognitive tasks beyond classical MLs such as DNA pattern matching. We demonstrate that the HD-based learning can enable secure, collaborative learning by efficiently distributing a large volume of learning tasks into heterogeneous computing nodes. We have implemented the proposed learning solution on various platforms while offering superior computing efficiency. For example, our solution achieves 486×and 7× performance improvements for each of the training and inference phases on a low-power ARM processor, as compared to state-of-the-art deep learning
- …