6 research outputs found

    Hadoop Image Processing Framework

    Get PDF
    With the rapid growth of social media, the number of images being uploaded to the internet is exploding. Massive quantities of images are shared through multi-platform services such as Snapchat, Instagram, Facebook and WhatsApp; recent studies estimate that over 1.8 billion photos are uploaded every day. However, for the most part, applications that make use of this vast data have yet to emerge. Most current image processing applications, designed for small-scale, local computation, do not scale well to web-sized problems with their large requirements for computational resources and storage. The emergence of processing frameworks such as the Hadoop MapReduce\cite{dean2008} platform addresses the problem of providing a system for computationally intensive data processing and distributed storage. However, to learn the technical complexities of developing useful applications using Hadoop requires a large investment of time and experience on the part of the developer. As such, the pool of researchers and programmers with the varied skills to develop applications that can use large sets of images has been limited. To address this we have developed the Hadoop Image Processing Framework, which provides a Hadoop-based library to support large-scale image processing. The main aim of the framework is to allow developers of image processing applications to leverage the Hadoop MapReduce framework without having to master its technical details and introduce an additional source of complexity and error into their programs.Computer Scienc

    Solving Big GIS Projects on Desktop Computers

    Get PDF
    Svjedočimo velikim razvojnim promjenama u digitalnim informacijskim tehnologijama. Ta situacija zadire u prostorne podatke koji sadrže atributna i lokalizacijska obilježja, a to nejednako određuje njihov položaj unutar obaveznog koordinatnog sustava. Te su promjene uzrokovale ubrzan rast digitalnih podataka, što u velikoj mjeri podržava tehnološki napredak uređaja koji takve podatke stvaraju. Dok se tehnologija dobivanja prostornih podataka razvija, metode i softver za obradu velikih podataka zaostaju. Paradoksalno je da se upotrebljava samo 2% ukupne količine podataka (Čerba 2017). Obrada velikih podataka često zahtijeva moćan hardver i softver i samo malen broj korisnika posjeduje odgovarajuću informatičku infrastrukturu. Razmjer obrađenih podataka povećao bi se kad bi obični korisnici mogli obrađivati velike podatke. U geografskim informacijskim sustavima (GIS) takvi problemi nastaju pri provođenju projekata koji pokrivaju velik teritorij ili koji imaju značajnu sekundarnu složenost, što zahtijeva obradu velike količine podataka. Ovaj rad ima u žarištu stvaranje i provjeru metoda kojima bi se stolnim računalima i softverom mogli obrađivati podatci dobiveni u velikim GIS-projektima. Riječ je o novim brzim metodama za funkcionalno smanjenje količine podataka, optimizaciju obrade, otkrivanje rubova u tri dimenzije te automatsku vektorizaciju.We are witnessing great developments in digital information technologies. The situation encroaches on spatial data, which contain both attributive and localization features, and this determines their position unequally within an obligatory coordinate system. These changes have resulted in the rapid growth of digital data, significantly supported by technical advances regarding the devices which produce them. As technology for making spatial data advances, methods and software for big data processing are falling behind. Paradoxically, only about 2% of the total volume of data is actually used (Čerba 2017). Big data processing often requires high computation performance hardware and software. Only a few users possess the appropriate information infrastructure. The proportion of processed data would improve if big data could be processed by ordinary users. In geographical information systems (GIS), these problems arise when solving projects related to extensive territory or considerable secondary complexity, which require big data processing. This paper focuses on the creation and verification of methods by which it would be possible to process effectively extensive projects in GIS supported by desktop hardware and software. It is a project regarding new quick methods for the functional reduction of the data volume, optimization of processing, edge detection in 3D and automated vectorization

    A new algorithm to split and merge ultra-high resolution 3D images

    Get PDF
    Splitting and merging ultra-high resolution 3D images is a requirement for parallel or distributed processing operations. Naive algorithms to split and merge 3D blocks from ultra-high resolution images perform very poorly, due to the number of seeks required to reconstruct spatially-adjacent blocks from linear data organizations on disk. The current solution to deal with this problem is to use file formats that preserve spatial proximity on disk, but this comes with additional complexity. We introduce a new algorithm called Multiple reads/writes to split and merge ultra-high resolution 3D images efficiently from simple file formats. Multiple reads/writes only access contiguous bytes in the reconstructed image, which leads to substantial performance improvements compared to existing algorithms. We parallelize our algorithm using multi-threading, which further improves the performance for data stored on a Hadoop cluster. We also show that on-the-fly lossless compression with the lz4 algorithm reduces the split and merge time further

    Performance Evaluation of Job Scheduling and Resource Allocation in Apache Spark

    Get PDF
    Advancements in data acquisition techniques and devices are revolutionizing the way image data are collected, managed and processed. Devices such as time-lapse cameras and multispectral cameras generate large amount of image data daily. Therefore, there is a clear need for many organizations and researchers to deal with large volume of image data efficiently. On the other hand, Big Data processing on distributed systems such as Apache Spark are gaining popularity in recent years. Apache Spark is a widely used in-memory framework for distributed processing of large datasets on a cluster of inexpensive computers. This thesis proposes using Spark for distributed processing of large amount of image data in a time efficient manner. However, to share cluster resources efficiently, multiple image processing applications submitted to the cluster must be appropriately scheduled by Spark cluster managers to take advantage of all the compute power and storage capacity of the cluster. Spark can run on three cluster managers including Standalone, Mesos and YARN, and provides several configuration parameters that control how resources are allocated and scheduled. Using default settings for these multiple parameters is not enough to efficiently share cluster resources between multiple applications running concurrently. This leads to performance issues and resource underutilization because cluster administrators and users do not know which Spark cluster manager is the right fit for their applications and how the scheduling behaviour and parameter settings of these cluster managers affect the performance of their applications in terms of resource utilization and response times. This thesis parallelized a set of heterogeneous image processing applications including Image Registration, Flower Counter and Image Clustering, and presents extensive comparisons and analyses of running these applications on a large server and a Spark cluster using three different cluster managers for resource allocation, including Standalone, Apache Mesos and Hodoop YARN. In addition, the thesis examined the two different job scheduling and resource allocations modes available in Spark: static and dynamic allocation. Furthermore, the thesis explored the various configurations available on both modes that control speculative execution of tasks, resource size and the number of parallel tasks per job, and explained their impact on image processing applications. The thesis aims to show that using optimal values for these parameters reduces jobs makespan, maximizes cluster utilization, and ensures each application is allocated a fair share of cluster resources in a timely manner

    Energy Efficient Big Data Networks

    Get PDF
    The continuous increase of big data applications in number and types creates new challenges that should be tackled by the green ICT community. Data scientists classify big data into four main categories (4Vs): Volume (with direct implications on power needs), Velocity (with impact on delay requirements), Variety (with varying CPU requirements and reduction ratios after processing) and Veracity (with cleansing and backup constraints). Each V poses many challenges that confront the energy efficiency of the underlying networks carrying big data traffic. In this work, we investigated the impact of the big data 4Vs on energy efficient bypass IP over WDM networks. The investigation is carried out by developing Mixed Integer Linear Programming (MILP) models that encapsulate the distinctive features of each V. In our analyses, the big data network is greened by progressively processing big data raw traffic at strategic locations, dubbed as processing nodes (PNs), built in the network along the path from big data sources to the data centres. At each PN, raw data is processed and lower rate useful information is extracted progressively, eventually reducing the network power consumption. For each V, we conducted an in-depth analysis and evaluated the network power saving that can be achieved by the energy efficient big data network compared to the classical approach. Along the volume dimension of big data, the work dealt with optimally handling and processing an enormous amount of big data Chunks and extracting the corresponding knowledge carried by those Chunks, transmitting knowledge instead of data, thus reducing the data volume and saving power. Variety means that there are different types of big data such as CPU intensive, memory intensive, Input/output (IO) intensive, CPU-Memory intensive, CPU/IO intensive, and memory-IO intensive applications. Each type requires a different amount of processing, memory, storage, and networking resources. The processing of different varieties of big data was optimised with the goal of minimising power consumption. In the velocity dimension, we classified the processing velocity of big data into two modes: expedited-data processing mode and relaxed-data processing mode. Expedited-data demanded higher amount of computational resources to reduce the execution time compared to the relaxed-data. The big data processing and transmission were optimised given the velocity dimension to reduce power consumption. Veracity specifies trustworthiness, data protection, data backup, and data cleansing constraints. We considered the implementation of data cleansing and backup operations prior to big data processing so that big data is cleansed and readied for entering big data analytics stage. The analysis was carried out through dedicated scenarios considering the influence of each V’s characteristic parameters. For the set of network parameters we considered, our results for network energy efficiency under the impact of volume, variety, velocity and veracity scenarios revealed that up to 52%, 47%, 60%, 58%, network power savings can be achieved by the energy efficient big data networks approach compared to the classical approach, respectively

    Detección y categorización de objetos invariante y multivista en imágenes digitales mediante visión artificial bioinspirada.

    Get PDF
    344 p.Esta tesis se posiciona en el campo de la anotación automática de imágenes dentro del área de investigación de la Visión Artificial. El principal objetivo de este campo es generar etiquetas textuales para una imagen de tal forma que describan los objetos existentes en la imagen sin intervención humana.Esta tesis se basa en el modelo de vecinos más cercanos para anotar de forma automática una imagen. La novedad de la tesis reside en la propuesta de una nueva implementación de los dos pasos principales de dicho modelo. En el primer paso, esta tesis propone el uso de las características MPEG7 para describir la similitud entre imágenes y propone un nuevo modelo de características de textura basado en el cortex primario de un primate. Se ha comprobado como el algoritmo formulado es más efectivo que la implementación propuesta por el estándar pero también es más preciso que otros modelos de córtex presentes en la literatura de neurociencia.En el segundo paso del modelo, esta tesis presenta un nuevo algoritmo para seleccionar las posibles etiquetas de una imagen dadas las imágenes visualmente similares. La principal ventaja introducida poreste algoritmo es la combinación de información textual de las etiquetas e información visual de las imágenes. Adicionalmente, esta tesis también propone un nuevo algoritmo de entrenamiento que tiene el beneficio de ser rápido y adaptado a la tarea de anotación particular, por lo que es posible aplicarlo en tiempo de anotación
    corecore