1,123 research outputs found

    A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures

    Full text link
    Scientific problems that depend on processing large amounts of data require overcoming challenges in multiple areas: managing large-scale data distribution, co-placement and scheduling of data with compute resources, and storing and transferring large volumes of data. We analyze the ecosystems of the two prominent paradigms for data-intensive applications, hereafter referred to as the high-performance computing and the Apache-Hadoop paradigm. We propose a basis, common terminology and functional factors upon which to analyze the two approaches of both paradigms. We discuss the concept of "Big Data Ogres" and their facets as means of understanding and characterizing the most common application workloads found across the two paradigms. We then discuss the salient features of the two paradigms, and compare and contrast the two approaches. Specifically, we examine common implementation/approaches of these paradigms, shed light upon the reasons for their current "architecture" and discuss some typical workloads that utilize them. In spite of the significant software distinctions, we believe there is architectural similarity. We discuss the potential integration of different implementations, across the different levels and components. Our comparison progresses from a fully qualitative examination of the two paradigms, to a semi-quantitative methodology. We use a simple and broadly used Ogre (K-means clustering), characterize its performance on a range of representative platforms, covering several implementations from both paradigms. Our experiments provide an insight into the relative strengths of the two paradigms. We propose that the set of Ogres will serve as a benchmark to evaluate the two paradigms along different dimensions.Comment: 8 pages, 2 figure

    SAR Image Denoising Using Non-Local Means on MapReduce

    Get PDF
    Maapinna skaneerimise ja uurimisega tegutsevate satelliitide süsteemide üheks suureks probleemiks on müra, mis esineb elektromagnetiliselt (i.e. radarite poolt) saadud piltidel. Selle probleemi lahenduse üheks suunaks on müra vähendamise filtrid, mida rakendatakse töötlemata andmetele. On tõestatud, et filtreerimisalgoritm Non-Local Means annab väga head filtreerimistulemused. Seevastu aga on teada, et see algoritm nõuab suurt arvutusvõimsust. Käesolevas töös sellelle probleemile lähendatakse paralleelarvutuse metoodikaga hajusarvutuse raamistiku Apache Hadoop abil. On näidatud, et müra vähenemise meetodit Non-Local Means saab edukalt adapteerida käivitamiseks MapReduce hajumudelina. Meetodi skaleerimise hindamiseks on läbiviidud eksperimendid testpiltidega. Need katsed kinnitavad meetodi kõrget efektiivsust (16 protsessoritega klastri puhul on saavutatud 13.14x kiirendus) ja näitavad platvormi Hadoop positiivset potentsiaali piltide massiliseks töötlemiseks.Satellite systems designed for exploratory surface scanning face the problem of noise presence in images acquired electromagnetically, i.e by means of radars. A solution to this inherent problem has been searched for in the area of noise reduction filters applicable after the raw data is collected. The filtering algorithm Non-Local Means had shown to give good refinement results. However, the method is known to be computationally expensive, which poses a problem for processing of large datasets. In this work the parallel computing approach to this task was implemented on the distributed processing framework Apache Hadoop. It was shown that the Non-Local Means approach to noise reduction problem can be successfully adapted for execution in the distributed fashion of MapReduce model. Benchmark experiments were carried out on the test image to evaluate scalability of the approach. Tests confirmed high efficiency of parallelization (16 executor setup had given a speedup of 13.14x) and showed positive potential of Hadoop as a platform for massive image processing

    Efficient parallel algorithms for synthetic aperture radar data processing using large-scale distributed frameworks

    Get PDF
    Radarsatelliidipiltide töötlemine on märkimisväärse suurusega arvutusülesanne kuna piltide mõõtmed on äärmiselt suured. Hajusarvutust kasutatakse sageli et võimendada algoritme, mis jooksevad ühel arvutil liiga aeglaselt. Kuid on ebaselge, milliseid radaripiltide töötlusalgoritme on võimalik tõhusalt paralleelsetesse keskkondadesse ümber viia ning kuidas neid korrektselt implementeerida. Eelnevad tööd on keskendunud paralleelsele pilditöötlusele kui üldisele arvutusülesandele, kuid unikaalseid radarpiltide omadusi või uuemaid hajusarvutusraamistikke pole käsitletud või on käsitlus keskendunud mõnele üksikule algoritmile. Käesolev töö pakub välja potentsiaalselt paralleliseeritavate radaripiltide töötlusalgoritmide klassifikatsiooni. Iga algoritmide klassi uuritakse enimkasutatavate hajusraamistike ja -failisüsteemide omadustel. Kõige paremini mingeid klasse esindavad algoritmid implementeeritakse konkreetsetel tehnoloogiatel. Klassifikatsioon lihtsustab huvipakkuvate algoritmide võrdlust ja pakub üldisi implementatsioonisamme ning hõlbustab seeläbi hajusarvutuse rakendamist radarsatelliidipiltide töötlusel.Processing radar satellite images is a considerable computing task due to large image sizes. Distributed computing can often be leveraged to speed up algorithms that are too time-consuming on a single machine. It is however unclear which radar image processing algorithms can be efficiently migrated to parallel environments and what is the proper way to implement them. Previous works have concentrated on parallel image processing as a general computing task but either the unique properties of radar images or newer distributed computing frameworks are not considered or only some specific algorithms have been examined. This thesis proposes a classification of radar image processing algorithms that can potentially be parallelized. Each class of algorithms is studied based on the properties of current popular distributed computing frameworks and file systems. Algorithms that best represent their respective classes are implemented using some concrete distributed computing framework. The classification simplifies the gauging of potential algorithms in terms of parallel speedup and provides general implementation steps, thus easing the task of leveraging distributed computing for radar image processing

    Big data and hydroinformatics

    Get PDF

    Measuring and Managing Answer Quality for Online Data-Intensive Services

    Full text link
    Online data-intensive services parallelize query execution across distributed software components. Interactive response time is a priority, so online query executions return answers without waiting for slow running components to finish. However, data from these slow components could lead to better answers. We propose Ubora, an approach to measure the effect of slow running components on the quality of answers. Ubora randomly samples online queries and executes them twice. The first execution elides data from slow components and provides fast online answers; the second execution waits for all components to complete. Ubora uses memoization to speed up mature executions by replaying network messages exchanged between components. Our systems-level implementation works for a wide range of platforms, including Hadoop/Yarn, Apache Lucene, the EasyRec Recommendation Engine, and the OpenEphyra question answering system. Ubora computes answer quality much faster than competing approaches that do not use memoization. With Ubora, we show that answer quality can and should be used to guide online admission control. Our adaptive controller processed 37% more queries than a competing controller guided by the rate of timeouts.Comment: Technical Repor

    Towards intelligent geo-database support for earth system observation: Improving the preparation and analysis of big spatio-temporal raster data

    Get PDF
    The European COPERNICUS program provides an unprecedented breakthrough in the broad use and application of satellite remote sensing data. Maintained on a sustainable basis, the COPERNICUS system is operated on a free-and-open data policy. Its guaranteed availability in the long term attracts a broader community to remote sensing applications. In general, the increasing amount of satellite remote sensing data opens the door to the diverse and advanced analysis of this data for earth system science. However, the preparation of the data for dedicated processing is still inefficient as it requires time-consuming operator interaction based on advanced technical skills. Thus, the involved scientists have to spend significant parts of the available project budget rather on data preparation than on science. In addition, the analysis of the rich content of the remote sensing data requires new concepts for better extraction of promising structures and signals as an effective basis for further analysis. In this paper we propose approaches to improve the preparation of satellite remote sensing data by a geo-database. Thus the time needed and the errors possibly introduced by human interaction are minimized. In addition, it is recommended to improve data quality and the analysis of the data by incorporating Artificial Intelligence methods. A use case for data preparation and analysis is presented for earth surface deformation analysis in the Upper Rhine Valley, Germany, based on Persistent Scatterer Interferometric Synthetic Aperture Radar data. Finally, we give an outlook on our future research
    corecore