82 research outputs found

    Bio-inspired computation for big data fusion, storage, processing, learning and visualization: state of the art and future directions

    Get PDF
    This overview gravitates on research achievements that have recently emerged from the confluence between Big Data technologies and bio-inspired computation. A manifold of reasons can be identified for the profitable synergy between these two paradigms, all rooted on the adaptability, intelligence and robustness that biologically inspired principles can provide to technologies aimed to manage, retrieve, fuse and process Big Data efficiently. We delve into this research field by first analyzing in depth the existing literature, with a focus on advances reported in the last few years. This prior literature analysis is complemented by an identification of the new trends and open challenges in Big Data that remain unsolved to date, and that can be effectively addressed by bio-inspired algorithms. As a second contribution, this work elaborates on how bio-inspired algorithms need to be adapted for their use in a Big Data context, in which data fusion becomes crucial as a previous step to allow processing and mining several and potentially heterogeneous data sources. This analysis allows exploring and comparing the scope and efficiency of existing approaches across different problems and domains, with the purpose of identifying new potential applications and research niches. Finally, this survey highlights open issues that remain unsolved to date in this research avenue, alongside a prescription of recommendations for future research.This work has received funding support from the Basque Government (Eusko Jaurlaritza) through the Consolidated Research Group MATHMODE (IT1294-19), EMAITEK and ELK ARTEK programs. D. Camacho also acknowledges support from the Spanish Ministry of Science and Education under PID2020-117263GB-100 grant (FightDIS), the Comunidad Autonoma de Madrid under S2018/TCS-4566 grant (CYNAMON), and the CHIST ERA 2017 BDSI PACMEL Project (PCI2019-103623, Spain)

    ๋ณ‘๋ ฌํ™” ์šฉ์ดํ•œ ํ†ต๊ณ„๊ณ„์‚ฐ ๋ฐฉ๋ฒ•๋ก ๊ณผ ํ˜„๋Œ€ ๊ณ ์„ฑ๋Šฅ ์ปดํ“จํŒ… ํ™˜๊ฒฝ์—์˜ ์ ์šฉ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ž์—ฐ๊ณผํ•™๋Œ€ํ•™ ํ†ต๊ณ„ํ•™๊ณผ, 2020. 8. ์›์ค‘ํ˜ธ.Technological advances in the past decade, hardware and software alike, have made access to high-performance computing (HPC) easier than ever. In this dissertation, easily-parallelizable, inversion-free, and variable-separated algorithms and their implementation in statistical computing are discussed. The first part considers statistical estimation problems under structured sparsity posed as minimization of a sum of two or three convex functions, one of which is a composition of non-smooth and linear functions. Examples include graph-guided sparse fused lasso and overlapping group lasso. Two classes of inversion-free primal-dual algorithms are considered and unified from a perspective of monotone operator theory. From this unification, a continuum of preconditioned forward-backward operator splitting algorithms amenable to parallel and distributed computing is proposed. The unification is further exploited to introduce a continuum of accelerated algorithms on which the theoretically optimal asymptotic rate of convergence is obtained. For the second part, easy-to-use distributed matrix data structures in PyTorch and Julia are presented. They enable users to write code once and run it anywhere from a laptop to a workstation with multiple graphics processing units (GPUs) or a supercomputer in a cloud. With these data structures, various parallelizable statistical applications, including nonnegative matrix factorization, positron emission tomography, multidimensional scaling, and โ„“1-regularized Cox regression, are demonstrated. The examples scale up to an 8-GPU workstation and a 720-CPU-core cluster in a cloud. As a case in point, the onset of type-2 diabetes from the UK Biobank with 400,000 subjects and about 500,000 single nucleotide polymorphisms is analyzed using the HPC โ„“1-regularized Cox regression. Fitting a half-million variate model took about 50 minutes, reconfirming known associations. To my knowledge, the feasibility of a joint genome-wide association analysis of survival outcomes at this scale is first demonstrated.์ง€๋‚œ 10๋…„๊ฐ„์˜ ํ•˜๋“œ์›จ์–ด์™€ ์†Œํ”„ํŠธ์›จ์–ด์˜ ๊ธฐ์ˆ ์ ์ธ ๋ฐœ์ „์€ ๊ณ ์„ฑ๋Šฅ ์ปดํ“จํŒ…์˜ ์ ‘๊ทผ์žฅ๋ฒฝ์„ ๊ทธ ์–ด๋Š ๋•Œ๋ณด๋‹ค ๋‚ฎ์ถ”์—ˆ๋‹ค. ์ด ํ•™์œ„๋…ผ๋ฌธ์—์„œ๋Š” ๋ณ‘๋ ฌํ™” ์šฉ์ดํ•˜๊ณ  ์—ญํ–‰๋ ฌ ์—ฐ์‚ฐ์ด ์—†๋Š” ๋ณ€์ˆ˜ ๋ถ„๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ๊ทธ ํ†ต๊ณ„๊ณ„์‚ฐ์—์„œ์˜ ๊ตฌํ˜„์„ ๋…ผ์˜ํ•œ๋‹ค. ์ฒซ ๋ถ€๋ถ„์€ ๋ณผ๋ก ํ•จ์ˆ˜ ๋‘ ๊ฐœ ๋˜๋Š” ์„ธ ๊ฐœ์˜ ํ•ฉ์œผ๋กœ ๋‚˜ํƒ€๋‚˜๋Š” ๊ตฌ์กฐํ™”๋œ ํฌ์†Œ ํ†ต๊ณ„ ์ถ”์ • ๋ฌธ์ œ์— ๋Œ€ํ•ด ๋‹ค๋ฃฌ๋‹ค. ์ด ๋•Œ ํ•จ์ˆ˜๋“ค ์ค‘ ํ•˜๋‚˜๋Š” ๋น„ํ‰ํ™œ ํ•จ์ˆ˜์™€ ์„ ํ˜• ํ•จ์ˆ˜์˜ ํ•ฉ์„ฑ์œผ๋กœ ๋‚˜ํƒ€๋‚œ๋‹ค. ๊ทธ ์˜ˆ์‹œ๋กœ๋Š” ๊ทธ๋ž˜ํ”„ ๊ตฌ์กฐ๋ฅผ ํ†ตํ•ด ์œ ๋„๋˜๋Š” ํฌ์†Œ ์œตํ•ฉ Lasso ๋ฌธ์ œ์™€ ํ•œ ๋ณ€์ˆ˜๊ฐ€ ์—ฌ๋Ÿฌ ๊ทธ๋ฃน์— ์†ํ•  ์ˆ˜ ์žˆ๋Š” ๊ทธ๋ฃน Lasso ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค. ์ด๋ฅผ ํ’€๊ธฐ ์œ„ํ•ด ์—ญํ–‰๋ ฌ ์—ฐ์‚ฐ์ด ์—†๋Š” ๋‘ ์ข…๋ฅ˜์˜ ์›์‹œ-์Œ๋Œ€ (primal-dual) ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋‹จ์กฐ ์—ฐ์‚ฐ์ž ์ด๋ก  ๊ด€์ ์—์„œ ํ†ตํ•ฉํ•˜๋ฉฐ ์ด๋ฅผ ํ†ตํ•ด ๋ณ‘๋ ฌํ™” ์šฉ์ดํ•œ precondition๋œ ์ „๋ฐฉ-ํ›„๋ฐฉ ์—ฐ์‚ฐ์ž ๋ถ„ํ•  ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์ง‘ํ•ฉ์„ ์ œ์•ˆํ•œ๋‹ค. ์ด ํ†ตํ•ฉ์€ ์ ๊ทผ์ ์œผ๋กœ ์ตœ์  ์ˆ˜๋ ด๋ฅ ์„ ๊ฐ–๋Š” ๊ฐ€์† ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์ง‘ํ•ฉ์„ ๊ตฌ์„ฑํ•˜๋Š” ๋ฐ ํ™œ์šฉ๋œ๋‹ค. ๋‘ ๋ฒˆ์งธ ๋ถ€๋ถ„์—์„œ๋Š” PyTorch์™€ Julia๋ฅผ ํ†ตํ•ด ์‚ฌ์šฉํ•˜๊ธฐ ์‰ฌ์šด ๋ถ„์‚ฐ ํ–‰๋ ฌ ์ž๋ฃŒ ๊ตฌ์กฐ๋ฅผ ์ œ์‹œํ•œ๋‹ค. ์ด ๊ตฌ์กฐ๋Š” ์‚ฌ์šฉ์ž๋“ค์ด ์ฝ”๋“œ๋ฅผ ํ•œ ๋ฒˆ ์ž‘์„ฑํ•˜๋ฉด ์ด๊ฒƒ์„ ๋…ธํŠธ๋ถ ํ•œ ๋Œ€์—์„œ๋ถ€ํ„ฐ ์—ฌ๋Ÿฌ ๋Œ€์˜ ๊ทธ๋ž˜ํ”ฝ ์ฒ˜๋ฆฌ ์žฅ์น˜ (GPU)๋ฅผ ๊ฐ€์ง„ ์›Œํฌ์Šคํ…Œ์ด์…˜, ๋˜๋Š” ํด๋ผ์šฐ๋“œ ์ƒ์— ์žˆ๋Š” ์Šˆํผ์ปดํ“จํ„ฐ๊นŒ์ง€ ๋‹ค์–‘ํ•œ ์Šค์ผ€์ผ์—์„œ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด ์ค€๋‹ค. ์•„์šธ๋Ÿฌ, ์ด ์ž๋ฃŒ ๊ตฌ์กฐ๋ฅผ ๋น„์Œ ํ–‰๋ ฌ ๋ถ„ํ•ด, ์–‘์ „์ž ๋‹จ์ธต ์ดฌ์˜, ๋‹ค์ฐจ์› ์ฒ™ ๋„๋ฒ•, โ„“1-๋ฒŒ์ ํ™” Cox ํšŒ๊ท€ ๋ถ„์„ ๋“ฑ ๋‹ค์–‘ํ•œ ๋ณ‘๋ ฌํ™” ๊ฐ€๋Šฅํ•œ ํ†ต๊ณ„์  ๋ฌธ์ œ์— ์ ์šฉํ•œ๋‹ค. ์ด ์˜ˆ์‹œ๋“ค์€ 8๋Œ€์˜ GPU๊ฐ€ ์žˆ๋Š” ์›Œํฌ์Šคํ…Œ์ด์…˜๊ณผ 720๊ฐœ์˜ ์ฝ”์–ด๊ฐ€ ์žˆ๋Š” ํด๋ผ์šฐ๋“œ ์ƒ์˜ ๊ฐ€์ƒ ํด๋Ÿฌ์Šคํ„ฐ์—์„œ ํ™•์žฅ ๊ฐ€๋Šฅํ–ˆ๋‹ค. ํ•œ ์‚ฌ๋ก€๋กœ 400,000๋ช…์˜ ๋Œ€์ƒ๊ณผ 500,000๊ฐœ์˜ ๋‹จ์ผ ์—ผ๊ธฐ ๋‹คํ˜•์„ฑ ์ •๋ณด๊ฐ€ ์žˆ๋Š” UK Biobank ์ž๋ฃŒ์—์„œ์˜ ์ œ2ํ˜• ๋‹น๋‡จ๋ณ‘ (T2D) ๋ฐœ๋ณ‘ ๋‚˜์ด๋ฅผ โ„“1-๋ฒŒ์ ํ™” Cox ํšŒ๊ท€ ๋ชจํ˜•์„ ํ†ตํ•ด ๋ถ„์„ํ–ˆ๋‹ค. 500,000๊ฐœ์˜ ๋ณ€์ˆ˜๊ฐ€ ์žˆ๋Š” ๋ชจํ˜•์„ ์ ํ•ฉ์‹œํ‚ค๋Š” ๋ฐ 50๋ถ„ ๊ฐ€๋Ÿ‰์˜ ์‹œ๊ฐ„์ด ๊ฑธ๋ ธ์œผ๋ฉฐ ์•Œ๋ ค์ง„ T2D ๊ด€๋ จ ๋‹คํ˜•์„ฑ๋“ค์„ ์žฌํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ทœ๋ชจ์˜ ์ „์œ ์ „์ฒด ๊ฒฐํ•ฉ ์ƒ์กด ๋ถ„์„์€ ์ตœ์ดˆ๋กœ ์‹œ๋„๋œ ๊ฒƒ์ด๋‹ค.Chapter1Prologue 1 1.1 Introduction 1 1.2 Accessible High-Performance Computing Systems 4 1.2.1 Preliminaries 4 1.2.2 Multiple CPU nodes: clusters, supercomputers, and clouds 7 1.2.3 Multi-GPU node 9 1.3 Highly Parallelizable Algorithms 12 1.3.1 MM algorithms 12 1.3.2 Proximal gradient descent 14 1.3.3 Proximal distance algorithm 16 1.3.4 Primal-dual methods 17 Chapter 2 Easily Parallelizable and Distributable Class of Algorithms for Structured Sparsity, with Optimal Acceleration 20 2.1 Introduction 20 2.2 Unification of Algorithms LV and CV (g โ‰ก 0) 30 2.2.1 Relation between Algorithms LV and CV 30 2.2.2 Unified algorithm class 34 2.2.3 Convergence analysis 35 2.3 Optimal acceleration 39 2.3.1 Algorithms 40 2.3.2 Convergence analysis 41 2.4 Stochastic optimal acceleration 45 2.4.1 Algorithm 45 2.4.2 Convergence analysis 47 2.5 Numerical experiments 50 2.5.1 Model problems 50 2.5.2 Convergence behavior 52 2.5.3 Scalability 62 2.6 Discussion 63 Chapter 3 Towards Unified Programming for High-Performance Statistical Computing Environments 66 3.1 Introduction 66 3.2 Related Software 69 3.2.1 Message-passing interface and distributed array interfaces 69 3.2.2 Unified array interfaces for CPU and GPU 69 3.3 Easy-to-use Software Libraries for HPC 70 3.3.1 Deep learning libraries and HPC 70 3.3.2 Case study: PyTorch versus TensorFlow 73 3.3.3 A brief introduction to PyTorch 76 3.3.4 A brief introduction to Julia 80 3.3.5 Methods and multiple dispatch 80 3.3.6 Multidimensional arrays 82 3.3.7 Matrix multiplication 83 3.3.8 Dot syntax for vectorization 86 3.4 Distributed matrix data structure 87 3.4.1 Distributed matrices in PyTorch: distmat 87 3.4.2 Distributed arrays in Julia: MPIArray 90 3.5 Examples 98 3.5.1 Nonnegative matrix factorization 100 3.5.2 Positron emission tomography 109 3.5.3 Multidimensional scaling 113 3.5.4 L1-regularized Cox regression 117 3.5.5 Genome-wide survival analysis of the UK Biobank dataset 121 3.6 Discussion 126 Chapter 4 Conclusion 131 Appendix A Monotone Operator Theory 134 Appendix B Proofs for Chapter II 139 B.1 Preconditioned forward-backward splitting 139 B.2 Optimal acceleration 147 B.3 Optimal stochastic acceleration 158 Appendix C AWS EC2 and ParallelCluster 168 C.1 Overview 168 C.2 Glossary 169 C.3 Prerequisites 172 C.4 Installation 173 C.5 Configuration 173 C.6 Creating, accessing, and destroying the cluster 178 C.7 Installation of libraries 178 C.8 Running a job 179 C.9 Miscellaneous 180 Appendix D Code for memory-efficient L1-regularized Cox proportional hazards model 182 Appendix E Details of SNPs selected in L1-regularized Cox regression 184 Bibliography 188 ๊ตญ๋ฌธ์ดˆ๋ก 212Docto

    MapReduce network enabled algorithms for classification based on association rules

    Get PDF
    There is growing evidence that integrating classification and association rule mining can produce more efficient and accurate classifiers than traditional techniques. This thesis introduces a new MapReduce based association rule miner for extracting strong rules from large datasets. This miner is used later to develop a new large scale classifier. Also new MapReduce simulator was developed to evaluate the scalability of proposed algorithms on MapReduce clusters. The developed associative rule miner inherits the MapReduce scalability to huge datasets and to thousands of processing nodes. For finding frequent itemsets, it uses hybrid approach between miners that uses counting methods on horizontal datasets, and miners that use set intersections on datasets of vertical formats. The new miner generates same rules that usually generated using apriori-like algorithms because it uses the same confidence and support thresholds definitions. In the last few years, a number of associative classification algorithms have been proposed, i.e. CPAR, CMAR, MCAR, MMAC and others. This thesis also introduces a new MapReduce classifier that based MapReduce associative rule mining. This algorithm employs different approaches in rule discovery, rule ranking, rule pruning, rule prediction and rule evaluation methods. The new classifier works on multi-class datasets and is able to produce multi-label predications with probabilities for each predicted label. To evaluate the classifier 20 different datasets from the UCI data collection were used. Results show that the proposed approach is an accurate and effective classification technique, highly competitive and scalable if compared with other traditional and associative classification approaches. Also a MapReduce simulator was developed to measure the scalability of MapReduce based applications easily and quickly, and to captures the behaviour of algorithms on cluster environments. This also allows optimizing the configurations of MapReduce clusters to get better execution times and hardware utilization.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Functional programming languages in computing clouds: practical and theoretical explorations

    Get PDF
    Cloud platforms must integrate three pillars: messaging, coordination of workers and data. This research investigates whether functional programming languages have any special merit when it comes to the implementation of cloud computing platforms. This thesis presents the lightweight message queue CMQ and the DSL CWMWL for the coordination of workers that we use as artefact to proof or disproof the special merit of functional programming languages in computing clouds. We have detailed the design and implementation with the broad aim to match the notions and the requirements of computing clouds. Our approach to evaluate these aims is based on evaluation criteria that are based on a series of comprehensive rationales and specifics that allow the FPL Haskell to be thoroughly analysed. We find that Haskell is excellent for use cases that do not require the distribution of the application across the boundaries of (physical or virtual) systems, but not appropriate as a whole for the development of distributed cloud based workloads that require communication with the far side and coordination of decoupled workloads. However, Haskell may be able to qualify as a suitable vehicle in the future with future developments of formal mechanisms that embrace non-determinism in the underlying distributed environments leading to applications that are anti-fragile rather than applications that insist on strict determinism that can only be guaranteed on the local system or via slow blocking communication mechanisms

    CHORUS Deliverable 2.2: Second report - identification of multi-disciplinary key issues for gap analysis toward EU multimedia search engines roadmap

    Get PDF
    After addressing the state-of-the-art during the first year of Chorus and establishing the existing landscape in multimedia search engines, we have identified and analyzed gaps within European research effort during our second year. In this period we focused on three directions, notably technological issues, user-centred issues and use-cases and socio- economic and legal aspects. These were assessed by two central studies: firstly, a concerted vision of functional breakdown of generic multimedia search engine, and secondly, a representative use-cases descriptions with the related discussion on requirement for technological challenges. Both studies have been carried out in cooperation and consultation with the community at large through EC concertation meetings (multimedia search engines cluster), several meetings with our Think-Tank, presentations in international conferences, and surveys addressed to EU projects coordinators as well as National initiatives coordinators. Based on the obtained feedback we identified two types of gaps, namely core technological gaps that involve research challenges, and โ€œenablersโ€, which are not necessarily technical research challenges, but have impact on innovation progress. New socio-economic trends are presented as well as emerging legal challenges

    European Language Grid

    Get PDF
    This open access book provides an in-depth description of the EU project European Language Grid (ELG). Its motivation lies in the fact that Europe is a multilingual society with 24 official European Union Member State languages and dozens of additional languages including regional and minority languages. The only meaningful way to enable multilingualism and to benefit from this rich linguistic heritage is through Language Technologies (LT) including Natural Language Processing (NLP), Natural Language Understanding (NLU), Speech Technologies and language-centric Artificial Intelligence (AI) applications. The European Language Grid provides a single umbrella platform for the European LT community, including research and industry, effectively functioning as a virtual home, marketplace, showroom, and deployment centre for all services, tools, resources, products and organisations active in the field. Today the ELG cloud platform already offers access to more than 13,000 language processing tools and language resources. It enables all stakeholders to deposit, upload and deploy their technologies and datasets. The platform also supports the long-term objective of establishing digital language equality in Europe by 2030 โ€“ to create a situation in which all European languages enjoy equal technological support. This is the very first book dedicated to Language Technology and NLP platforms. Cloud technology has only recently matured enough to make the development of a platform like ELG feasible on a larger scale. The book comprehensively describes the results of the ELG project. Following an introduction, the content is divided into four main parts: (I) ELG Cloud Platform; (II) ELG Inventory of Technologies and Resources; (III) ELG Community and Initiative; and (IV) ELG Open Calls and Pilot Projects

    Building models from multiple point sets with kernel density estimation

    Get PDF
    One of the fundamental problems in computer vision is point set registration. Point set registration finds use in many important applications and in particular can be considered one of the crucial stages involved in the reconstruction of models of physical objects and environments from depth sensor data. The problem of globally aligning multiple point sets, representing spatial shape measurements from varying sensor viewpoints, into a common frame of reference is a complex task that is imperative due to the large number of critical functions that accurate and reliable model reconstructions contribute to. In this thesis we focus on improving the quality and feasibility of model and environment reconstruction through the enhancement of multi-view point set registration techniques. The thesis makes the following contributions: First, we demonstrate that employing kernel density estimation to reason about the unknown generating surfaces that range sensors measure allows us to express measurement variability, uncertainty and also to separate the problems of model design and viewpoint alignment optimisation. Our surface estimates define novel view alignment objective functions that inform the registration process. Our surfaces can be estimated from point clouds in a datadriven fashion. Through experiments on a variety of datasets we demonstrate that we have developed a novel and effective solution to the simultaneous multi-view registration problem. We then focus on constructing a distributed computation framework capable of solving generic high-throughput computational problems. We present a novel task-farming model that we call Semi-Synchronised Task Farming (SSTF), capable of modelling and subsequently solving computationally distributable problems that benefit from both independent and dependent distributed components and a level of communication between process elements. We demonstrate that this framework is a novel schema for parallel computer vision algorithms and evaluate the performance to establish computational gains over serial implementations. We couple this framework with an accurate computation-time prediction model to contribute a novel structure appropriate for addressing expensive real-world algorithms with substantial parallel performance and predictable time savings. Finally, we focus on a timely instance of the multi-view registration problem: modern range sensors provide large numbers of viewpoint samples that result in an abundance of depth data information. The ability to utilise this abundance of depth data in a feasible and principled fashion is of importance to many emerging application areas making use of spatial information. We develop novel methodology for the registration of depth measurements acquired from many viewpoints capturing physical object surfaces. By defining registration and alignment quality metrics based on our density estimation framework we construct an optimisation methodology that implicitly considers all viewpoints simultaneously. We use a non-parametric data-driven approach to consider varying object complexity and guide large view-set spatial transform optimisations. By aligning large numbers of partial, arbitrary-pose views we evaluate this strategy quantitatively on large view-set range sensor data where we find that we can improve registration accuracy over existing methods and contribute increased registration robustness to the magnitude of coarse seed alignment. This allows large-scale registration on problem instances exhibiting varying object complexity with the added advantage of massive parallel efficiency
    • โ€ฆ
    corecore