11 research outputs found

    Big Geospatial Data processing in the IQmulus Cloud

    Get PDF
    Remote sensing instruments are continuously evolving in terms of spatial, spectral and temporal resolutions and hence provide exponentially increasing amounts of raw data. These volumes increase significantly faster than computing speeds. All these techniques record lots of data, yet in different data models and representations; therefore, resulting datasets require harmonization and integration prior to deriving meaningful information from them. All in all, huge datasets are available but raw data is almost of no value if not processed, semantically enriched and quality checked. The derived information need to be transferred and published to all level of possible users (from decision makers to citizens). Up to now, there are only limited automatic procedures for this; thus, a wealth of information is latent in many datasets. This paper presents the first achievements of the IQmulus EU FP7 research and development project with respect to processing and analysis of big geospatial data in the context of flood and waterlogging detection

    Sideloading - Ingestion Of large point clouds into the apache spark big data engine

    Get PDF
    In the geospatial domain we have now reached the point where data volumes we handle have clearly grown beyond the capacity of most desktop computers. This is particularly true in the area of point cloud processing. It is therefore naturally lucrative to explore established big data frameworks for big geospatial data. The very first hurdle is the import of geospatial data into big data frameworks, commonly referred to as data ingestion. Geospatial data is typically encoded in specialised binary file formats, which are not naturally supported by the existing big data frameworks. Instead such file formats are supported by software libraries that are restricted to single CPU execution. We present an approach that allows the use of existing point cloud file format libraries on the Apache Spark big data framework. We demonstrate the ingestion of large volumes of point cloud data into a compute cluster. The approach uses a map function to distribute the data ingestion across the nodes of a cluster. We test the capabilities of the proposed method to load billions of points into a commodity hardware compute cluster and we discuss the implications on scalability and performance. The performance is benchmarked against an existing native Apache Spark data import implementation

    Modelling of Spatial Big Data Analysis and Visualization

    Get PDF
    Today’s advanced survey tools open new approaches and opportunities for Geoscience researchers to create new Models, Systems and frameworks to support the lifecycle of special big data. Mobile Mapping Systems use LIDAR technology to provide efficient and accurate way to collect geographic features and its attribute from field, whichhelps city planning departments and surveyors to design and update city GIS maps with a high accuracy. It is not only about heterogenic increase in the volume of point cloud data, but also it refers to several other characteristics such as its velocity and variety. However,the vast amount of Point Cloud data gathered by Mobile Mapping Systemleads to new challenges for researches, innovation and business development to solve its five characters: Volume, Velocity, Variety, and Veracity then achievethe Value of SBD. Cloud Computing has provided a new paradigm to publish and consume new spatial models as a service plus big data utilities , services which can be utilized to overcome Point Cloud data analysis and visualization challenges. This paper presentsa model With Cloud-Based Spatial,big data Services,using spatial joinservices capabilities to relate the analysis results to its location on map,describe how Cloud Computing supports the visualizing and analyzing spatial big data and review the related scientific model’s examples

    Achieving Sustainability Through Geodata: An Empirical Study of Challenges and Barriers

    Get PDF
    Master's thesis in Information systems (IS501)Research within data management is often based on the elements of the data lifecycle. Organizations and businesses are also becoming more interested in data lifecycle management to leverage their data streams, compounded by an interest in geographical attributes within the data –referred to as geodata. Geodata provides a richer basis for analysis and is increasingly important within urban planning. Furthermore, the pressure to achieve sustainability goals calls for improving the data lifecycle. The challenge remainsas to what can be improvedwithin the data lifecycle –with geodata as an important input –to achieve sustainability dimensions. Our main contribution through this study is shedding light on challenges withgeodata from an Information Systems (IS) and sustainability perspective. Additionally, the identified challenges are also feedback to data management research and the data lifecycle

    Programming Languages for Data-Intensive HPC Applications: a Systematic Mapping Study

    Get PDF
    A major challenge in modelling and simulation is the need to combine expertise in both software technologies and a given scientific domain. When High-Performance Computing (HPC) is required to solve a scientific problem, software development becomes a problematic issue. Considering the complexity of the software for HPC, it is useful to identify programming languages that can be used to alleviate this issue. Because the existing literature on the topic of HPC is very dispersed, we performed a Systematic Mapping Study (SMS) in the context of the European COST Action cHiPSet. This literature study maps characteristics of various programming languages for data-intensive HPC applications, including category, typical user profiles, effectiveness, and type of articles. We organised the SMS in two phases. In the first phase, relevant articles are identified employing an automated keyword-based search in eight digital libraries. This lead to an initial sample of 420 papers, which was then narrowed down in a second phase by human inspection of article abstracts, titles and keywords to 152 relevant articles published in the period 2006–2018. The analysis of these articles enabled us to identify 26 programming languages referred to in 33 of relevant articles. We compared the outcome of the mapping study with results of our questionnaire-based survey that involved 57 HPC experts. The mapping study and the survey revealed that the desired features of programming languages for data-intensive HPC applications are portability, performance and usability. Furthermore, we observed that the majority of the programming languages used in the context of data-intensive HPC applications are text-based general-purpose programming languages. Typically these have a steep learning curve, which makes them difficult to adopt. We believe that the outcome of this study will inspire future research and development in programming languages for data-intensive HPC applications.Additional co-authors: Sabri Pllana, Ana Respício, José Simão, Luís Veiga, Ari Vis

    A Modular Parallel Pipeline Architecture for GWAS Applications in a Cluster Environment

    Get PDF
    A Genome Wide Association Study (GWAS) is an important bioinformatics method to associate variants with traits, identify causes of diseases and increase plant and crop production. There are several optimizations for improving GWAS performance, including running applications in parallel. However, it can be difficult for researchers to utilize different data types and workflows using existing approaches. A potential solution for this problem is to model GWAS algorithms as a set of modular tasks. In this thesis, a modular pipeline architecture for GWAS applications is proposed that can leverage a parallel computing environment as well as store and retrieve data using a shared data cache. To show that the proposed architecture increases performance of GWAS applications, two case studies are conducted in which the proposed architecture is implemented on a bioinformatics pipeline package called TASSEL and a GWAS application called FaST-LMM using both Apache Spark and Dask as the parallel processing framework and Redis as the shared data cache. The case studies implement parallel processing modules and shared data cache modules according to the specifications of the proposed architecture. Based on the case studies, a number of experiments are conducted that compare the performance of the implemented architecture on a cluster environment with the original programs. The experiments reveal that the modified applications indeed perform faster than the original sequential programs. However, the modified applications do not scale with cluster resources, as the sequential part of the operations prevent the parallelization from having linear scalability. Finally, an evaluation of the architecture was conducted based on feedback from software developers and bioinformaticians. The evaluation reveals that the domain experts find the architecture useful; the implementations have sufficient performance improvement and they are also easy to use, although a GUI based implementation would be preferable

    Towards a Reference Architecture with Modular Design for Large-scale Genotyping and Phenotyping Data Analysis: A Case Study with Image Data

    Get PDF
    With the rapid advancement of computing technologies, various scientific research communities have been extensively using cloud-based software tools or applications. Cloud-based applications allow users to access software applications from web browsers while relieving them from the installation of any software applications in their desktop environment. For example, Galaxy, GenAP, and iPlant Colaborative are popular cloud-based systems for scientific workflow analysis in the domain of plant Genotyping and Phenotyping. These systems are being used for conducting research, devising new techniques, and sharing the computer assisted analysis results among collaborators. Researchers need to integrate their new workflows/pipelines, tools or techniques with the base system over time. Moreover, large scale data need to be processed within the time-line for more effective analysis. Recently, Big Data technologies are emerging for facilitating large scale data processing with commodity hardware. Among the above-mentioned systems, GenAp is utilizing the Big Data technologies for specific cases only. The structure of such a cloud-based system is highly variable and complex in nature. Software architects and developers need to consider totally different properties and challenges during the development and maintenance phases compared to the traditional business/service oriented systems. Recent studies report that software engineers and data engineers confront challenges to develop analytic tools for supporting large scale and heterogeneous data analysis. Unfortunately, less focus has been given by the software researchers to devise a well-defined methodology and frameworks for flexible design of a cloud system for the Genotyping and Phenotyping domain. To that end, more effective design methodologies and frameworks are an urgent need for cloud based Genotyping and Phenotyping analysis system development that also supports large scale data processing. In our thesis, we conduct a few studies in order to devise a stable reference architecture and modularity model for the software developers and data engineers in the domain of Genotyping and Phenotyping. In the first study, we analyze the architectural changes of existing candidate systems to find out the stability issues. Then, we extract architectural patterns of the candidate systems and propose a conceptual reference architectural model. Finally, we present a case study on the modularity of computation-intensive tasks as an extension of the data-centric development. We show that the data-centric modularity model is at the core of the flexible development of a Genotyping and Phenotyping analysis system. Our proposed model and case study with thousands of images provide a useful knowledge-base for software researchers, developers, and data engineers for cloud based Genotyping and Phenotyping analysis system development

    A modular software architecture for processing of big geospatial data in the cloud

    No full text
    In this paper we propose a software architecture that allows for processing of large geospatial data sets in the cloud. Our system is modular and flexible and supports multiple algorithm design paradigms such as MapReduce, in-memory computing or agent-based programming. It contains a web-based user interface where domain experts (e.g. GIS analysts or urban planners) can define high-level processing workflows using a domain-specific language (DSL). The workflows are passed through a number of components including a parser, interpreter, and a service called job manager. These components use declarative and procedural knowledge encoded in rules to generate a processing chain specifying the execution of the workflows on a given cloud infrastructure according to the constraints defined by the user. The job manager evaluates this chain, spawns processing services in the cloud and monitors them. The services communicate with each other through a distributed file system that is scalable and fault-tolerant. Compared to previous work describing cloud infrastructures and architectures we focus on the processing of big heterogeneous geospatial data. In addition to that, we do not rely on only one specific programming model or a certain cloud infrastructure but support several ones. Combined with the possibility to control the processing through DSL-based workflows, this makes our architecture very flexible and configurable. We do not only see the cloud as a means to store and distribute large data sets but also as a way to harness the processing power of distributed computing environments for large-volume geospatial data sets. The proposed architecture design has been developed for the IQmulus research project funded by the European Commission. The paper concludes with the evaluation results from applying our solution to two example workflows from this project

    Cloud Computing Adoption in Afghanistan: A Quantitative Study Based on the Technology Acceptance Model

    Get PDF
    Cloud computing emerged as an alternative to traditional in-house data centers that businesses can leverage to increase the operation agility and employees\u27 productivity. IT solution architects are tasked with presenting to IT managers some analysis reflecting cloud computing adoption critical barriers and challenges. This quantitative correlational study established an enhanced technology acceptance model (TAM) with four external variables: perceived security (PeS), perceived privacy (PeP), perceived connectedness (PeN), and perceived complexity (PeC) as antecedents of perceived usefulness (PU) and perceived ease of use (PEoU) in a cloud computing context. Data collected from 125 participants, who responded to the invitation through an online survey focusing on Afghanistan\u27s main cities Kabul, Mazar, and Herat. The analysis showed that PEoU was a predictor of the behavioral intention of cloud computing adoption, which is consistent with the TAM; PEoU with an R2 = .15 had a stronger influence than PU with an R2 = .023 on cloud computing behavior intention of adoption and use. PeN, PeS, and PeP significantly influenced the behavioral intentions of IT architects to adopt and use the technology. This study showed that PeC was not a significant barrier to cloud computing adoption in Afghanistan. By adopting cloud services, employees can have access to various tools that can help increase business productivity and contribute to improving the work environment. Cloud services, as an alternative solution to home data centers, can help businesses reduce power consumption and consecutively decrease in carbon dioxide emissions due to less power demand
    corecore