85 research outputs found

    Review of the Use of Cloud and Virtualization Technologies in Grid Infrastructures

    No full text
    This document describes the efforts of the StratusLab project to better understand its target communities, to gauge their experience with cloud technologies, to validate the defined use cases, and to extract relevant requirements from the communities. In parallel, the exercise was used as a dissemination tool to inform people about existing software packages, to increase the awareness of StratusLab, and to expand the our contacts within our target communities. The project created, distributed, and analyzed two surveys to achieve these goals. They validate the defined use cases and provide detailed requirements. One identified, critical issue relates to system administrators' reluctance to allow users to run their own virtual machines on the infrastructure. The project must define the criteria to trust such images and provide sufficient sand-boxing to avoid threats to other machines and services

    Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries

    Full text link
    Graph processing has become an important part of multiple areas of computer science, such as machine learning, computational sciences, medical applications, social network analysis, and many others. Numerous graphs such as web or social networks may contain up to trillions of edges. Often, these graphs are also dynamic (their structure changes over time) and have domain-specific rich data associated with vertices and edges. Graph database systems such as Neo4j enable storing, processing, and analyzing such large, evolving, and rich datasets. Due to the sheer size of such datasets, combined with the irregular nature of graph processing, these systems face unique design challenges. To facilitate the understanding of this emerging domain, we present the first survey and taxonomy of graph database systems. We focus on identifying and analyzing fundamental categories of these systems (e.g., triple stores, tuple stores, native graph database systems, or object-oriented systems), the associated graph models (e.g., RDF or Labeled Property Graph), data organization techniques (e.g., storing graph data in indexing structures or dividing data into records), and different aspects of data distribution and query execution (e.g., support for sharding and ACID). 51 graph database systems are presented and compared, including Neo4j, OrientDB, or Virtuoso. We outline graph database queries and relationships with associated domains (NoSQL stores, graph streaming, and dynamic graph algorithms). Finally, we describe research and engineering challenges to outline the future of graph databases

    Adaptive Asynchronous Control and Consistency in Distributed Data Exploration Systems

    Get PDF
    Advances in machine learning and streaming systems provide a backbone to transform vast arrays of raw data into valuable information. Leveraging distributed execution, analysis engines can process this information effectively within an iterative data exploration workflow to solve problems at unprecedented rates. However, with increased input dimensionality, a desire to simultaneously share and isolate information, as well as overlapping and dependent tasks, this process is becoming increasingly difficult to maintain. User interaction derails exploratory progress due to manual oversight on lower level tasks such as tuning parameters, adjusting filters, and monitoring queries. We identify human-in-the-loop management of data generation and distributed analysis as an inhibiting problem precluding efficient online, iterative data exploration which causes delays in knowledge discovery and decision making. The flexible and scalable systems implementing the exploration workflow require semi-autonomous methods integrated as architectural support to reduce human involvement. We, thus, argue that an abstraction layer providing adaptive asynchronous control and consistency management over a series of individual tasks coordinated to achieve a global objective can significantly improve data exploration effectiveness and efficiency. This thesis introduces methodologies which autonomously coordinate distributed execution at a lower level in order to synchronize multiple efforts as part of a common goal. We demonstrate the impact on data exploration through serverless simulation ensemble management and multi-model machine learning by showing improved performance and reduced resource utilization enabling a more productive semi-autonomous exploration workflow. We focus on the specific genres of molecular dynamics and personalized healthcare, however, the contributions are applicable to a wide variety of domains

    Introducing distributed dynamic data-intensive (D3) science: Understanding applications and infrastructure

    Get PDF
    A common feature across many science and engineering applications is the amount and diversity of data and computation that must be integrated to yield insights. Data sets are growing larger and becoming distributed; and their location, availability and properties are often time-dependent. Collectively, these characteristics give rise to dynamic distributed data-intensive applications. While "static" data applications have received significant attention, the characteristics, requirements, and software systems for the analysis of large volumes of dynamic, distributed data, and data-intensive applications have received relatively less attention. This paper surveys several representative dynamic distributed data-intensive application scenarios, provides a common conceptual framework to understand them, and examines the infrastructure used in support of applications.Comment: 38 pages, 2 figure

    ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ ํƒ์ƒ‰์„ ์œ„ํ•œ ์ ์ง„์  ์‹œ๊ฐํ™” ์‹œ์Šคํ…œ ์„ค๊ณ„

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ)--์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› :๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€,2020. 2. ์„œ์ง„์šฑ.Understanding data through interactive visualization, also known as visual analytics, is a common and necessary practice in modern data science. However, as data sizes have increased at unprecedented rates, the computation latency of visualization systems becomes a significant hurdle to visual analytics. The goal of this dissertation is to design a series of systems for progressive visual analytics (PVA)โ€”a visual analytics paradigm that can provide intermediate results during computation and allow visual exploration of these resultsโ€”to address the scalability hurdle. To support the interactive exploration of data with billions of records, we first introduce SwiftTuna, an interactive visualization system with scalable visualization and computation components. Our performance benchmark demonstrates that it can handle data with four billion records, giving responsive feedback every few seconds without precomputation. Second, we present PANENE, a progressive algorithm for the Approximate k-Nearest Neighbor (AKNN) problem. PANENE brings useful machine learning methods into visual analytics, which has been challenging due to their long initial latency resulting from AKNN computation. In particular, we accelerate t-Distributed Stochastic Neighbor Embedding (t-SNE), a popular non-linear dimensionality reduction technique, which enables the responsive visualization of data with a few hundred columns. Each of these two contributions aims to address the scalability issues stemming from a large number of rows or columns in data, respectively. Third, from the users' perspective, we focus on improving the trustworthiness of intermediate knowledge gained from uncertain results in PVA. We propose a novel PVA concept, Progressive Visual Analytics with Safeguards, and introduce PVA-Guards, safeguards people can leave on uncertain intermediate knowledge that needs to be verified. We also present a proof-of-concept system, ProReveal, designed and developed to integrate seven safeguards into progressive data exploration. Our user study demonstrates that people not only successfully created PVA-Guards on ProReveal but also voluntarily used PVA-Guards to manage the uncertainty of their knowledge. Finally, summarizing the three studies, we discuss design challenges for progressive systems as well as future research agendas for PVA.ํ˜„๋Œ€ ๋ฐ์ดํ„ฐ ์‚ฌ์ด์–ธ์Šค์—์„œ ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒํ•œ ์‹œ๊ฐํ™”๋ฅผ ํ†ตํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์€ ํ•„์ˆ˜์ ์ธ ๋ถ„์„ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜, ์ตœ๊ทผ ๋ฐ์ดํ„ฐ์˜ ํฌ๊ธฐ๊ฐ€ ํญ๋ฐœ์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๋ฉด์„œ ๋ฐ์ดํ„ฐ ํฌ๊ธฐ๋กœ ์ธํ•ด ๋ฐœ์ƒํ•˜๋Š” ์ง€์—ฐ ์‹œ๊ฐ„์ด ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒํ•œ ์‹œ๊ฐ์  ๋ถ„์„์— ํฐ ๊ฑธ๋ฆผ๋Œ์ด ๋˜์—ˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ํ™•์žฅ์„ฑ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ ์ง„์  ์‹œ๊ฐ์  ๋ถ„์„(Progressive Visual Analytics)์„ ์ง€์›ํ•˜๋Š” ์ผ๋ จ์˜ ์‹œ์Šคํ…œ์„ ๋””์ž์ธํ•˜๊ณ  ๊ฐœ๋ฐœํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ์ ์ง„์  ์‹œ๊ฐ์  ๋ถ„์„ ์‹œ์Šคํ…œ์€ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๊ฐ€ ์™„์ „ํžˆ ๋๋‚˜์ง€ ์•Š๋”๋ผ๋„ ์ค‘๊ฐ„ ๋ถ„์„ ๊ฒฐ๊ณผ๋ฅผ ์‚ฌ์šฉ์ž์—๊ฒŒ ์ œ๊ณตํ•จ์œผ๋กœ์จ ๋ฐ์ดํ„ฐ์˜ ํฌ๊ธฐ๋กœ ์ธํ•ด ๋ฐœ์ƒํ•˜๋Š” ์ง€์—ฐ ์‹œ๊ฐ„ ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ฒซ์งธ๋กœ, ์ˆ˜์‹ญ์–ต ๊ฑด์˜ ํ–‰์„ ๊ฐ€์ง€๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์‹œ๊ฐ์ ์œผ๋กœ ํƒ์ƒ‰ํ•  ์ˆ˜ ์žˆ๋Š” SwiftTuna ์‹œ์Šคํ…œ์„ ์ œ์•ˆํ•œ๋‹ค. ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ๋ฐ ์‹œ๊ฐ์  ํ‘œํ˜„์˜ ํ™•์žฅ์„ฑ์„ ๋ชฉํ‘œ๋กœ ๊ฐœ๋ฐœ๋œ ์ด ์‹œ์Šคํ…œ์€, ์•ฝ 40์–ต ๊ฑด์˜ ํ–‰์„ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์‹œ๊ฐํ™”๋ฅผ ์ „์ฒ˜๋ฆฌ ์—†์ด ์ˆ˜ ์ดˆ๋งˆ๋‹ค ์—…๋ฐ์ดํŠธํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ๋‹ค. ๋‘˜์งธ๋กœ, ๊ทผ์‚ฌ์  k-์ตœ๊ทผ์ ‘์ (Approximate k-Nearest Neighbor) ๋ฌธ์ œ๋ฅผ ์ ์ง„์ ์œผ๋กœ ๊ณ„์‚ฐํ•˜๋Š” PANENE ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ œ์•ˆํ•œ๋‹ค. ๊ทผ์‚ฌ์  k-์ตœ๊ทผ์ ‘์  ๋ฌธ์ œ๋Š” ์—ฌ๋Ÿฌ ๊ธฐ๊ณ„ ํ•™์Šต ๊ธฐ๋ฒ•์—์„œ ์“ฐ์ž„์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์ดˆ๊ธฐ ๊ณ„์‚ฐ ์‹œ๊ฐ„์ด ๊ธธ์–ด์„œ ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒํ•œ ์‹œ์Šคํ…œ์— ์ ์šฉํ•˜๊ธฐ ํž˜๋“  ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ๋‹ค. PANENE ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ด๋Ÿฌํ•œ ๊ธด ์ดˆ๊ธฐ ๊ณ„์‚ฐ ์‹œ๊ฐ„์„ ํš๊ธฐ์ ์œผ๋กœ ๊ฐœ์„ ํ•˜์—ฌ ๋‹ค์–‘ํ•œ ๊ธฐ๊ณ„ ํ•™์Šต ๊ธฐ๋ฒ•์„ ์‹œ๊ฐ์  ๋ถ„์„์— ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค. ํŠนํžˆ, ์œ ์šฉํ•œ ๋น„์„ ํ˜•์  ์ฐจ์› ๊ฐ์†Œ ๊ธฐ๋ฒ•์ธ t-๋ถ„ํฌ ํ™•๋ฅ ์  ์ž„๋ฒ ๋”ฉ(t-Distributed Stochastic Neighbor Embedding)์„ ๊ฐ€์†ํ•˜์—ฌ ์ˆ˜๋ฐฑ ๊ฐœ์˜ ์ฐจ์›์„ ๊ฐ€์ง€๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๋น ๋ฅธ ์‹œ๊ฐ„ ๋‚ด์— ์‚ฌ์˜ํ•  ์ˆ˜ ์žˆ๋‹ค. ์œ„์˜ ๋‘ ์‹œ์Šคํ…œ๊ณผ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋ฐ์ดํ„ฐ์˜ ํ–‰ ๋˜๋Š” ์—ด์˜ ๊ฐœ์ˆ˜๋กœ ์ธํ•œ ํ™•์žฅ์„ฑ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ–ˆ๋‹ค๋ฉด, ์„ธ ๋ฒˆ์งธ ์‹œ์Šคํ…œ์—์„œ๋Š” ์ ์ง„์  ์‹œ๊ฐ์  ๋ถ„์„์˜ ์‹ ๋ขฐ๋„ ๋ฌธ์ œ๋ฅผ ๊ฐœ์„ ํ•˜๊ณ ์ž ํ•œ๋‹ค. ์ ์ง„์  ์‹œ๊ฐ์  ๋ถ„์„์—์„œ ์‚ฌ์šฉ์ž์—๊ฒŒ ์ฃผ์–ด์ง€๋Š” ์ค‘๊ฐ„ ๊ณ„์‚ฐ ๊ฒฐ๊ณผ๋Š” ์ตœ์ข… ๊ฒฐ๊ณผ์˜ ๊ทผ์‚ฌ์น˜์ด๋ฏ€๋กœ ๋ถˆํ™•์‹ค์„ฑ์ด ์กด์žฌํ•œ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ์„ธ์ดํ”„๊ฐ€๋“œ๋ฅผ ์ด์šฉํ•œ ์ ์ง„์  ์‹œ๊ฐ์  ๋ถ„์„(Progressive Visual Analytics with Safeguards)์ด๋ผ๋Š” ์ƒˆ๋กœ์šด ๊ฐœ๋…์„ ์ œ์•ˆํ•œ๋‹ค. ์ด ๊ฐœ๋…์€ ์‚ฌ์šฉ์ž๊ฐ€ ์ ์ง„์  ํƒ์ƒ‰์—์„œ ๋งˆ์ฃผํ•˜๋Š” ๋ถˆํ™•์‹คํ•œ ์ค‘๊ฐ„ ์ง€์‹์— ์„ธ์ดํ”„๊ฐ€๋“œ๋ฅผ ๋‚จ๊ธธ ์ˆ˜ ์žˆ๋„๋ก ํ•˜์—ฌ ํƒ์ƒ‰์—์„œ ์–ป์€ ์ง€์‹์˜ ์ •ํ™•๋„๋ฅผ ์ถ”ํ›„ ๊ฒ€์ฆํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค. ๋˜ํ•œ, ์ด๋Ÿฌํ•œ ๊ฐœ๋…์„ ์‹ค์ œ๋กœ ๊ตฌํ˜„ํ•˜์—ฌ ํƒ‘์žฌํ•œ ProReveal ์‹œ์Šคํ…œ์„ ์†Œ๊ฐœํ•œ๋‹ค. ProReveal๋ฅผ ์ด์šฉํ•œ ์‚ฌ์šฉ์ž ์‹คํ—˜์—์„œ ์‚ฌ์šฉ์ž๋“ค์€ ์„ธ์ดํ”„๊ฐ€๋“œ๋ฅผ ์„ฑ๊ณต์ ์œผ๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ์—ˆ์„ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์ค‘๊ฐ„ ์ง€์‹์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด ์„ธ์ดํ”„๊ฐ€๋“œ๋ฅผ ์ž๋ฐœ์ ์œผ๋กœ ์ด์šฉํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ์œ„ ์„ธ ๊ฐ€์ง€ ์—ฐ๊ตฌ์˜ ๊ฒฐ๊ณผ๋ฅผ ์ข…ํ•ฉํ•˜์—ฌ ์ ์ง„์  ์‹œ๊ฐ์  ๋ถ„์„ ์‹œ์Šคํ…œ์„ ๊ตฌํ˜„ํ•  ๋•Œ์˜ ๋””์ž์ธ์  ๋‚œ์ œ์™€ ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์„ ๋ชจ์ƒ‰ํ•œ๋‹ค.CHAPTER1. Introduction 2 1.1 Background and Motivation 2 1.2 Thesis Statement and Research Questions 5 1.3 Thesis Contributions 5 1.3.1 Responsive and Incremental Visual Exploration of Large-scale Multidimensional Data 6 1.3.2 ProgressiveComputation of Approximate k-Nearest Neighbors and Responsive t-SNE 7 1.3.3 Progressive Visual Analytics with Safeguards 8 1.4 Structure of Dissertation 9 CHAPTER2. Related Work 11 2.1 Progressive Visual Analytics 11 2.1.1 Definitions 11 2.1.2 System Latency and Human Factors 13 2.1.3 Users, Tasks, and Models 15 2.1.4 Techniques, Algorithms, and Systems. 17 2.1.5 Uncertainty Visualization 19 2.2 Approaches for Scalable Visualization Systems 20 2.3 The k-Nearest Neighbor (KNN) Problem 22 2.4 t-Distributed Stochastic Neighbor Embedding 26 CHAPTER3. SwiTuna: Responsive and Incremental Visual Exploration of Large-scale Multidimensional Data 28 3.1 The SwiTuna Design 31 3.1.1 Design Considerations 32 3.1.2 System Overview 33 3.1.3 Scalable Visualization Components 36 3.1.4 Visualization Cards 40 3.1.5 User Interface and Interaction 42 3.2 Responsive Querying 44 3.2.1 Querying Pipeline 44 3.2.2 Prompt Responses 47 3.2.3 Incremental Processing 47 3.3 Evaluation: Performance Benchmark 49 3.3.1 Study Design 49 3.3.2 Results and Discussion 52 3.4 Implementation 56 3.5 Summary 56 CHAPTER4. PANENE:AProgressive Algorithm for IndexingandQuerying Approximate k-Nearest Neighbors 58 4.1 Approximate k-Nearest Neighbor 61 4.1.1 A Sequential Algorithm 62 4.1.2 An Online Algorithm 63 4.1.3 A Progressive Algorithm 66 4.1.4 Filtered AKNN Search 71 4.2 k-Nearest Neighbor Lookup Table 72 4.3 Benchmark. 78 4.3.1 Online and Progressive k-d Trees 78 4.3.2 k-Nearest Neighbor Lookup Tables 83 4.4 Applications 85 4.4.1 Progressive Regression and Density Estimation 85 4.4.2 Responsive t-SNE 87 4.5 Implementation 92 4.6 Discussion 92 4.7 Summary 93 CHAPTER5. ProReveal: Progressive Visual Analytics with Safeguards 95 5.1 Progressive Visual Analytics with Safeguards 98 5.1.1 Definition 98 5.1.2 Examples 101 5.1.3 Design Considerations 103 5.2 ProReveal 105 5.3 Evaluation 121 5.4 Discussion 127 5.5 Summary 130 CHAPTER6. Discussion 132 6.1 Lessons Learned 132 6.2 Limitations 135 CHAPTER7. Conclusion 137 7.1 Thesis Contributions Revisited 137 7.2 Future Research Agenda 139 7.3 Final Remarks 141 Abstract (Korean) 155 Acknowledgments (Korean) 157Docto

    Data Spaces

    Get PDF
    This open access book aims to educate data space designers to understand what is required to create a successful data space. It explores cutting-edge theory, technologies, methodologies, and best practices for data spaces for both industrial and personal data and provides the reader with a basis for understanding the design, deployment, and future directions of data spaces. The book captures the early lessons and experience in creating data spaces. It arranges these contributions into three parts covering design, deployment, and future directions respectively. The first part explores the design space of data spaces. The single chapters detail the organisational design for data spaces, data platforms, data governance federated learning, personal data sharing, data marketplaces, and hybrid artificial intelligence for data spaces. The second part describes the use of data spaces within real-world deployments. Its chapters are co-authored with industry experts and include case studies of data spaces in sectors including industry 4.0, food safety, FinTech, health care, and energy. The third and final part details future directions for data spaces, including challenges and opportunities for common European data spaces and privacy-preserving techniques for trustworthy data sharing. The book is of interest to two primary audiences: first, researchers interested in data management and data sharing, and second, practitioners and industry experts engaged in data-driven systems where the sharing and exchange of data within an ecosystem are critical

    Literature Survey of Big Data

    Get PDF
    Mention the topic of big data, and a person is bound to experience information overload. Indeed, it is so complex with so many terms and details that people want to run away from it. When used right, big data (BD) will help people access data they need in in real time and help managers make better decisions. The purpose of this paper is to evaluate methods, procedures, and architectures for the storage and retrieval of all Federal Aviation Administration (FAA) research, engineering, and development (RE&D) data sets, to leverage on the technology innovation and advancement opportunities in the field of BD analytics. The paper also discusses all relevant Executive Orders (EOs), laws, and Office of Management and Budget (OMB) memorandums that were written to address what federal agencies under the OMB\u2019s jurisdiction must do to comply with various aspects of BD
    • โ€ฆ
    corecore