6 research outputs found

    Distributed multi-label learning on Apache Spark

    Get PDF
    This thesis proposes a series of multi-label learning algorithms for classification and feature selection implemented on the Apache Spark distributed computing model. Five approaches for determining the optimal architecture to speed up multi-label learning methods are presented. These approaches range from local parallelization using threads to distributed computing using independent or shared memory spaces. It is shown that the optimal approach performs hundreds of times faster than the baseline method. Three distributed multi-label k nearest neighbors methods built on top of the Spark architecture are proposed: an exact iterative method that computes pair-wise distances, an approximate tree-based method that indexes the instances across multiple nodes, and an approximate local sensitive hashing method that builds multiple hash tables to index the data. The results indicated that the predictions of the tree-based method are on par with those of an exact method while reducing the execution times in all the scenarios. The aforementioned method is then used to evaluate the quality of a selected feature subset. The optimal adaptation for a multi-label feature selection criterion is discussed and two distributed feature selection methods for multi-label problems are proposed: a method that selects the feature subset that maximizes the Euclidean norm of individual information measures, and a method that selects the subset of features maximizing the geometric mean. The results indicate that each method excels in different scenarios depending on type of features and the number of labels. Rigorous experimental studies and statistical analyses over many multi-label metrics and datasets confirm that the proposals achieve better performances and provide better scalability to bigger data than the methods compared in the state of the art

    Efficient Online Processing for Advanced Analytics

    Get PDF
    With the advent of emerging technologies and the Internet of Things, the importance of online data analytics has become more pronounced. Businesses and companies are adopting approaches that provide responsive analytics to stay competitive in the global marketplace. Online analytics allow data analysts to promptly react to patterns or to gain preliminary insights from early results that aid in research, decision making, and effective strategy planning. The growth of data-velocity in a variety of domains including, high-frequency trading, social networks, infrastructure monitoring, and advertising require adopting online engines that can efficiently process continuous streams of data. This thesis presents foundations, techniques, and systems' design that extend the state-of-the-art in online query processing to efficiently support relational joins with arbitrary join-predicates (beyond traditional equi-joins); and to support other data models (beyond relational) that target machine learning and graph computations. The thesis is divided into two parts: We first present a brief overview of Squall, our open-source online query processing engine that supports SQL-like queries on top of streams. Then, we focus on extending Squall to support efficient theta-join processing. Scalable distributed join processing requires a partitioning policy that evenly distributes the processing load while minimizing the size of maintained state and duplicated messages. Efficient load-balance demands apriori-statistics which are not available in the online setting. We propose a novel operator that continuously adjusts itself to the data dynamics, through adaptive dataflow routing and state repartitioning. It is also resilient to data-skew, maintains high throughput rates, avoids blocking during state repartitioning, and behaves as a black-box dataflow operator with provable performance guarantees. Our evaluation demonstrates that the proposed operator outperforms the state-of-the-art static partitioning schemes in resource utilization, throughput, and execution time up to 7x. In the second part, we present a novel framework that supports the Incremental View Maintenance (IVM) of workloads expressed as linear algebra programs. Linear algebra represents a concrete substrate for advanced analytical tasks including, machine learning, scientific computation, and graph algorithms. Previous works on relational calculus IVM are not applicable to matrix algebra workloads. This is because a single entry change to an input-matrix results in changes all over the intermediate views, rendering IVM useless in comparison to re-evaluation. We present Lago, a unified modular compiler framework that supports the IVM of a broad class of linear algebra programs. Lago automatically derives and optimizes incremental trigger programs of analytical computations, while freeing the user from erroneous manual derivations, low-level implementation details, and performance tuning. We present a novel technique that captures Δ\Delta changes as low-rank matrices. Low-rank matrices are representable in a compressed factored form that enables cheaper computations. Lago automatically propagates the factored representation across program statements to derive an efficient trigger program. Moreover, Lago extends its support to other domains that use different semi-ring configurations, e.g., graph applications. Our evaluation results demonstrate orders of magnitude (10x-1

    Aprendizaje multi-etiqueta distribuido en Apache Spark

    Get PDF
    This thesis proposes a series of multi-label learning algorithms for classication and feature selection implemented on the Apache Spark distributed computing model. Five approaches for determining the optimal architecture to speed up the multi-label learning methods are presented. These approaches range from local parallelization using threads to distributed computing using independent or shared memory spaces. It is shown that the optimal approach performs hundreds of times faster than the baseline method. Three distributed multi-label k nearest neighbors methods built on top of the Spark architecture are proposed: an exact iterative method that computes pair-wise distances, an approximate tree-based method that indexes the instances across multiple nodes, and an approximate local sensitive hashing method that builds multiple hash tables to index the data. The results indicated that the predictions of the tree-based method are on par with those of an exact method while reducing the execution times in all the scenarios. The aforementioned method is then used to evaluate the quality of a selected feature subset. The optimal adaptation for a multi-label feature selection criterion is discussed and two distributed feature selection methods for multi-label problems are proposed: a method that selects the feature subset that maximizes the Euclidean norm of the individual information measures, and a method selects the subset of features that maximize the geometrical mean. The results indicate that each method excels in di_erent scenarios depending on type of features and the number of labels. Rigorous experimental studies and statistical analyses over many multi-label metrics and datasets con_rm that the proposals achieve better performances and provide better scalability to bigger data than the methods compared in the state of the art.Esta Tesis Doctoral propone unos algoritmos de clasificación y selección de atributos para aprendizaje multi-etiqueta distribuidos implementados en Apache Spark. Cinco estrategias para determinar la arquitectura óptima para acelerar el aprendizaje multi-etiqueta son presentadas. Estas estrategias varían desde la paralelización local utilizando hilos hasta la distribución de la computación utilizando espacios de memoria compartidos o independientes. Ha sido demostrado que la estrategia óptima permite ejecutar cientos de veces más rápido que el método de referencia. Se proponen tres métodos distribuidos de \k nearest neighbors" multi-etiqueta sobre la arquitectura de Spark seleccionada: un método exacto que computa iterativamente las distancias, un método aproximado que usa un árbol para indexar las instancias, y un método aproximado que utiliza tablas hash para indexar las instancias. Los resultados indican que las predicciones del método basado en árboles son equivalente a aquellas producidas por un método exacto a la vez que reduce los tiempos de ejecución en todos los escenarios. Dicho método es utilizado para evaluar la calidad de un subconjunto de atributos. Se discute el criterio para seleccionar atributos en problemas multi-etiqueta, proponiendo: un método que selecciona el subconjunto de atributos cuyas medidas de información individuales poseen la mayor norma Euclídea, y un método que selecciona el subconjunto de atributos con la mayor media geométrica. Los resultados indican que cada método destaca en escenarios diferentes dependiendo del tipo de atributos y el número de etiquetas. Los estudios experimentales y análisis estadísticos utilizando múltiples métricas y datos multi-etiqueta confirman que nuestras propuestas alcanzan un mejor rendimiento y proporcionan una mejor escalabilidad para datos de gran tamaño respecto a los métodos de referencia

    Antares :a scalable, efficient platform for stream, historic, combined and geospatial querying

    Get PDF
    PhD ThesisTraditional methods for storing and analysing data are proving inadequate for processing \Big Data". This is due to its volume, and the rate at which it is being generated. The limitations of current technologies are further exacerbated by the increased demand for applications which allow users to access and interact with data as soon as it is generated. Near real-time analysis such as this can be partially supported by stream processing systems, however they currently lack the ability to store data for e cient historic processing: many applications require a combination of near real-time and historic data analysis. This thesis investigates this problem, and describes and evaluates a novel approach for addressing it. Antares is a layered framework that has been designed to exploit and extend the scalability of NoSQL databases to support low latency querying and high throughput rates for both stream and historic data analysis simultaneously. Antares began as a company funded project, sponsored by Red Hat the motivation was to identify a new technology which could provide scalable analysis of data, both stream and historic. The motivation for this was to explore new methods for supporting scale and e ciency, for example a layered approach. A layered approach would exploit the scale of historic stores and the speed of in-memory processing. New technologies were investigates to identify current mechanisms and suggest a means of improvement. Antares supports a layered approach for analysis, the motivation for the platform was to provide scalable, low latency querying of Twitter data for other researchers to help automate analysis. Antares needed to provide temporal and spatial analysis of Twitter data using the timestamp and geotag. The approach used Twitter as a use case and derived requirements from social scientists for a broader research project called Tweet My Street. Many data streaming applications have a location-based aspect, using geospatial data to enhance the functionality they provide. However geospatial data is inherently di - cult to process at scale due to its multidimensional nature. To address these di culties, - i - this thesis proposes Antares as a new solution to providing scalable and e cient mechanisms for querying geospatial data. The thesis describes the design of Antares and evaluates its performance on a range of scenarios taken from a real social media analytics application. The results show signi cant performance gains when compared to existing approaches, for particular types of analysis. The approach is evaluated by executing experiments across Antares and similar systems to show the improved results. Antares demonstrates a layered approach can be used to improve performance for inserts and searches as well as increasing the ingestion rate of the system

    Language Support for Distributed Functional Programming

    Get PDF
    Software development has taken a fundamental turn. Software today has gone from simple, closed programs running on a single machine, to massively open programs, patching together user experiences byway of responses received via hundreds of network requests spanning multiple machines. At the same time, as data continues to stockpile, systems for big data analytics are on the rise. Yet despite this trend towards distributing computation, issues at the level of the language and runtime abound. Serialization is still a costly runtime affair, crashing running systems and confounding developers. Function closures are being added to APIs for big data processing for use by end-users without reliably being able to transmit them over the network. And much of the frameworks developed for handling multiple concurrent requests byway of asynchronous programming facilities rely on blocking threads, causing serious scalability issues. This thesis describes a number of extensions and libraries for the Scala programming language that aim to address these issues and to provide a more reliable foundation on which to build distributed systems. This thesis presents a new approach to serialization called pickling based on the idea of generating and composing functional pickler combinators statically. The approach shifts the burden of serialization to compile time as much as possible, enabling users to catch serialization errors at compile time rather than at runtime. Further, by virtue of serialization code being generated at compile time, our framework is shown to be significantly more performant than other state-of-the-art serialization frameworks. We also generalize our technique for generating serialization code to generic functions other than pickling. Second, in light of the trend of distributed data-parallel frameworks being designed around functional patterns where closures are transmitted across cluster nodes to large-scale persistent datasets, this thesis introduces a new closure-like abstraction and type system, called spores, that can guarantee closures to be serializable, thread-safe, or even have custom user-defined properties. Crucially, our system is based on the principle of encoding type information corresponding to captured variables in the type of a spore. We prove our type system sound, implement our approach for Scala, evaluate its practicality through a small empirical study, and show the power of these guarantees through a case analysis of real-world distributed and concurrent frameworks that this safe foundation for closures facilitates. Finally, we bring together the above building blocks, pickling and spores, to form the basis of a new programming model called function-passing. Function-passing is based on the idea of a distributed persistent data structure which stores in its nodes transformations to data rather than the distributed data itself, simplifying fault recovery by design. Lazy evaluation is also central to our model; by incorporating laziness into our design only at the point of initiating network communication, our model remains easy to reason about while remaining efficient in time and memory. We formalize our programming model in the form of a small-step operational semantics which includes a precise specification of the semantics of functional fault recovery, and we provide an open-source implementation of our model in and for Scala

    Cryptographic solutions of organization’s memory protection from the point of management’s knowledge

    Get PDF
    Moderne kompanije se svakim danom suočavaju sa problemom opterećenosti velikom količinom informacija i podataka, a što otežava njihovo poslovanje i donošenje efikasnih poslovnih odluka. Pronalaženje nove suštine primene načina menadžmenta znanja u smislu efikasnog korišćenja memorije organizacije (znanja), predstavlja sve veću potrebu kompanija da unaprede svoje poslovanje. Isto tako, zaštita načina pristupanja memoriji organizacije (znanju kompanije), njegovoj razmeni i upravljanju njime, kompanije sve više posvećuju pažnju i stavljaju akcenat u svom poslovanju. Primena koncepta poslovne inteligencije u upravljanju memorije organizacije postaje neizostavan element strategije uspešnih kompanija. Integrisano automatizovano upravljanje memorijom organizacije (znanjem jedne kompanije), iako veoma složeno, predstavlja rešenje za interakciju menadžmenta znanja i informacione tehnologije. Time se stvara mogućnost potpunog objašnjavanja procesa donošenja odluka u jednoj kompaniji, ali i procesa toka dokumenata, informacija i podataka. Integrisanim automatizovanim upravljanjem memorijom organizacije, kompanija ostvaruje mogućnost dobijanja detaljnih podataka na osnovu kojih je olakšano konkretno poslovno odlučivanje. Takođe, ovde se javlja i zahtev za zaštitu jednog takvog integrisanog automatizo vanog procesa. U skladu sa određenim i usvojenim međunarodnim standardima (ISO 27001), menadžment u ovakvom sistemu kakav je memorija organizacije treba da osigura efikasnu implementaciju, praćenje i unapređenje sistema za rukovanje bezbednošću memorije organizacije. Zaštita i bezbednost memorije organizacije kroz kriptografska rešenje treba da zadovolji balans između zahteva korisnika, funkcionalnosti unutar memorije organizacije i potrebe zaštite osetljivih podataka i čuvanje njihovog integriteta. Ovakav integrisani automatizovani proces upravljanja memorijom organizacije predstavlja jedno rešenje koje bi svoju upotrebu moglo da nađe kako u oblasti učenja inteligentnih sistema, tako i u postojećim sistemima savremenog poslovnog odlučivanja. Jedan od predloženog načina rešenja zaštite integrisanog sistema za proces upravljanja memorijom organizacije u ovom radu biće i mogućnost snimanja u šifrovanom obliku, čime podaci postaju dostupni samo kroz informacioni sistem kompanije. U ovom radu biće predstavljeno sopstveno kriptografsko rešenje zaštite memorije organizacije sa stanovišta menadžmenta znanja. Pristup dokumentima i podacima će imati samo ovlašćeni korisnici sistema na osnovu definisanih dozvola pristupa. Autentičnost dokumenata i njihova nepromenljivost bi se obezbedila pomoću digitalnih potpisa, što predložena kriptografska rešenja obezbeđuju u skladu sa aktuelnim zakonskim propisima za elektronski dokument. Isto tako, biće razmotreni principi i modeli koji obezbeđuju i zaštitu podataka i privilegovan pristup podacima, a sve u cilju donošenja odluka zasnovanih na memoriji organizacije. Najbolji primer za ovakvu analizu su bezbednosno-informativne agencije, a brojni su primeri, kako dobrih organizacija, tako i propusta u njihovom radu
    corecore