6 research outputs found
Distributed multi-label learning on Apache Spark
This thesis proposes a series of multi-label learning algorithms for classification and feature selection implemented on the Apache Spark distributed computing model. Five approaches for determining the optimal architecture to speed up multi-label learning methods are presented. These approaches range from local parallelization using threads to distributed computing using independent or shared memory spaces. It is shown that the optimal approach performs hundreds of times faster than the baseline method. Three distributed multi-label k nearest neighbors methods built on top of the Spark architecture are proposed: an exact iterative method that computes pair-wise distances, an approximate tree-based method that indexes the instances across multiple nodes, and an approximate local sensitive hashing method that builds multiple hash tables to index the data. The results indicated that the predictions of the tree-based method are on par with those of an exact method while reducing the execution times in all the scenarios. The aforementioned method is then used to evaluate the quality of a selected feature subset. The optimal adaptation for a multi-label feature selection criterion is discussed and two distributed feature selection methods for multi-label problems are proposed: a method that selects the feature subset that maximizes the Euclidean norm of individual information measures, and a method that selects the subset of features maximizing the geometric mean. The results indicate that each method excels in different scenarios depending on type of features and the number of labels. Rigorous experimental studies and statistical analyses over many multi-label metrics and datasets confirm that the proposals achieve better performances and provide better scalability to bigger data than the methods compared in the state of the art
Efficient Online Processing for Advanced Analytics
With the advent of emerging technologies and the Internet of Things, the importance of online data analytics has become more pronounced. Businesses and companies are adopting approaches that provide responsive analytics to stay competitive in the global marketplace. Online analytics allow data analysts to promptly react to patterns or to gain preliminary insights from early results that aid in research, decision making, and effective strategy planning. The growth of data-velocity in a variety of domains including, high-frequency trading, social networks, infrastructure monitoring, and advertising require adopting online engines that can efficiently process continuous streams of data. This thesis presents foundations, techniques, and systems' design that extend the state-of-the-art in online query processing to efficiently support relational joins with arbitrary join-predicates (beyond traditional equi-joins); and to support other data models (beyond relational) that target machine learning and graph computations. The thesis is divided into two parts: We first present a brief overview of Squall, our open-source online query processing engine that supports SQL-like queries on top of streams. Then, we focus on extending Squall to support efficient theta-join processing. Scalable distributed join processing requires a partitioning policy that evenly distributes the processing load while minimizing the size of maintained state and duplicated messages. Efficient load-balance demands apriori-statistics which are not available in the online setting. We propose a novel operator that continuously adjusts itself to the data dynamics, through adaptive dataflow routing and state repartitioning. It is also resilient to data-skew, maintains high throughput rates, avoids blocking during state repartitioning, and behaves as a black-box dataflow operator with provable performance guarantees. Our evaluation demonstrates that the proposed operator outperforms the state-of-the-art static partitioning schemes in resource utilization, throughput, and execution time up to 7x. In the second part, we present a novel framework that supports the Incremental View Maintenance (IVM) of workloads expressed as linear algebra programs. Linear algebra represents a concrete substrate for advanced analytical tasks including, machine learning, scientific computation, and graph algorithms. Previous works on relational calculus IVM are not applicable to matrix algebra workloads. This is because a single entry change to an input-matrix results in changes all over the intermediate views, rendering IVM useless in comparison to re-evaluation. We present Lago, a unified modular compiler framework that supports the IVM of a broad class of linear algebra programs. Lago automatically derives and optimizes incremental trigger programs of analytical computations, while freeing the user from erroneous manual derivations, low-level implementation details, and performance tuning. We present a novel technique that captures changes as low-rank matrices. Low-rank matrices are representable in a compressed factored form that enables cheaper computations. Lago automatically propagates the factored representation across program statements to derive an efficient trigger program. Moreover, Lago extends its support to other domains that use different semi-ring configurations, e.g., graph applications. Our evaluation results demonstrate orders of magnitude (10x-1
Aprendizaje multi-etiqueta distribuido en Apache Spark
This thesis proposes a series of multi-label learning algorithms for classication and feature selection implemented on the Apache Spark distributed computing model. Five approaches for determining the optimal architecture to speed up the multi-label learning methods are presented. These approaches range from local parallelization using threads to distributed computing using independent or shared memory spaces. It is shown that the optimal approach performs hundreds of times faster than the baseline method. Three distributed multi-label k nearest neighbors methods built on top of the Spark architecture are proposed: an exact iterative method that computes pair-wise distances, an approximate tree-based method that indexes the instances across multiple nodes, and an approximate local sensitive hashing method that builds multiple hash tables to index the data. The results indicated that the predictions of the tree-based method are on par with those of an exact method while reducing the execution times in all the scenarios. The aforementioned method is then used to evaluate the quality of a selected feature subset. The optimal adaptation for a multi-label feature selection criterion is discussed and two distributed feature selection methods for multi-label problems are proposed: a method that selects the feature subset that maximizes the Euclidean norm of the individual information measures, and a method selects the subset of features that maximize the geometrical mean. The results indicate that each method excels in di_erent scenarios depending on type of features and the number of labels. Rigorous experimental studies and statistical analyses over many multi-label metrics and datasets con_rm that the proposals achieve better performances and provide better scalability to bigger data than the methods compared in the state of the art.Esta Tesis Doctoral propone unos algoritmos de clasificación y selección de atributos para aprendizaje multi-etiqueta distribuidos implementados en Apache Spark. Cinco estrategias para determinar la arquitectura óptima para acelerar el aprendizaje multi-etiqueta son presentadas. Estas estrategias varían desde la paralelización local utilizando hilos hasta la distribución de la computación utilizando espacios de memoria compartidos o independientes. Ha sido demostrado que la estrategia óptima permite ejecutar cientos de veces más rápido que el método de referencia. Se proponen tres métodos distribuidos de \k nearest neighbors" multi-etiqueta sobre la arquitectura de Spark seleccionada: un método exacto que computa iterativamente las distancias, un método aproximado que usa un árbol para indexar las instancias, y un método aproximado que utiliza tablas hash para indexar las instancias. Los resultados indican que las predicciones del método basado en árboles son equivalente a aquellas producidas por un método exacto a la vez que reduce los tiempos de ejecución en todos los escenarios. Dicho método es utilizado para evaluar la calidad de un subconjunto de atributos. Se discute el criterio para seleccionar atributos en problemas multi-etiqueta, proponiendo: un método que selecciona el subconjunto de atributos cuyas medidas de información individuales poseen la mayor norma Euclídea, y un método que selecciona el subconjunto de atributos con la mayor media geométrica. Los resultados indican que cada método destaca en escenarios diferentes dependiendo del tipo de atributos y el número de etiquetas. Los estudios experimentales y análisis estadísticos utilizando múltiples métricas y datos multi-etiqueta confirman que nuestras propuestas alcanzan un mejor rendimiento y proporcionan una mejor escalabilidad para datos de gran tamaño respecto a los métodos de referencia
Antares :a scalable, efficient platform for stream, historic, combined and geospatial querying
PhD ThesisTraditional methods for storing and analysing data are proving inadequate for processing
\Big Data". This is due to its volume, and the rate at which it is being generated.
The limitations of current technologies are further exacerbated by the increased demand
for applications which allow users to access and interact with data as soon as
it is generated. Near real-time analysis such as this can be partially supported by
stream processing systems, however they currently lack the ability to store data for
e cient historic processing: many applications require a combination of near real-time
and historic data analysis. This thesis investigates this problem, and describes and
evaluates a novel approach for addressing it. Antares is a layered framework that has
been designed to exploit and extend the scalability of NoSQL databases to support low
latency querying and high throughput rates for both stream and historic data analysis
simultaneously.
Antares began as a company funded project, sponsored by Red Hat the motivation was
to identify a new technology which could provide scalable analysis of data, both stream
and historic. The motivation for this was to explore new methods for supporting scale
and e ciency, for example a layered approach. A layered approach would exploit the
scale of historic stores and the speed of in-memory processing. New technologies were
investigates to identify current mechanisms and suggest a means of improvement.
Antares supports a layered approach for analysis, the motivation for the platform was
to provide scalable, low latency querying of Twitter data for other researchers to help
automate analysis. Antares needed to provide temporal and spatial analysis of Twitter
data using the timestamp and geotag. The approach used Twitter as a use case and
derived requirements from social scientists for a broader research project called Tweet
My Street.
Many data streaming applications have a location-based aspect, using geospatial data
to enhance the functionality they provide. However geospatial data is inherently di -
cult to process at scale due to its multidimensional nature. To address these di culties,
- i -
this thesis proposes Antares as a new solution to providing scalable and e cient mechanisms
for querying geospatial data. The thesis describes the design of Antares and
evaluates its performance on a range of scenarios taken from a real social media analytics
application. The results show signi cant performance gains when compared to
existing approaches, for particular types of analysis.
The approach is evaluated by executing experiments across Antares and similar systems
to show the improved results. Antares demonstrates a layered approach can be
used to improve performance for inserts and searches as well as increasing the ingestion
rate of the system
Language Support for Distributed Functional Programming
Software development has taken a fundamental turn. Software today has gone from simple, closed programs running on a single machine, to massively open programs, patching together user experiences byway of responses received via hundreds of network requests spanning multiple machines. At the same time, as data continues to stockpile, systems for big data analytics are on the rise. Yet despite this trend towards distributing computation, issues at the level of the language and runtime abound. Serialization is still a costly runtime affair, crashing running systems and confounding developers. Function closures are being added to APIs for big data processing for use by end-users without reliably being able to transmit them over the network. And much of the frameworks developed for handling multiple concurrent requests byway of asynchronous programming facilities rely on blocking threads, causing serious scalability issues. This thesis describes a number of extensions and libraries for the Scala programming language that aim to address these issues and to provide a more reliable foundation on which to build distributed systems. This thesis presents a new approach to serialization called pickling based on the idea of generating and composing functional pickler combinators statically. The approach shifts the burden of serialization to compile time as much as possible, enabling users to catch serialization errors at compile time rather than at runtime. Further, by virtue of serialization code being generated at compile time, our framework is shown to be significantly more performant than other state-of-the-art serialization frameworks. We also generalize our technique for generating serialization code to generic functions other than pickling. Second, in light of the trend of distributed data-parallel frameworks being designed around functional patterns where closures are transmitted across cluster nodes to large-scale persistent datasets, this thesis introduces a new closure-like abstraction and type system, called spores, that can guarantee closures to be serializable, thread-safe, or even have custom user-defined properties. Crucially, our system is based on the principle of encoding type information corresponding to captured variables in the type of a spore. We prove our type system sound, implement our approach for Scala, evaluate its practicality through a small empirical study, and show the power of these guarantees through a case analysis of real-world distributed and concurrent frameworks that this safe foundation for closures facilitates. Finally, we bring together the above building blocks, pickling and spores, to form the basis of a new programming model called function-passing. Function-passing is based on the idea of a distributed persistent data structure which stores in its nodes transformations to data rather than the distributed data itself, simplifying fault recovery by design. Lazy evaluation is also central to our model; by incorporating laziness into our design only at the point of initiating network communication, our model remains easy to reason about while remaining efficient in time and memory. We formalize our programming model in the form of a small-step operational semantics which includes a precise specification of the semantics of functional fault recovery, and we provide an open-source implementation of our model in and for Scala
Cryptographic solutions of organization’s memory protection from the point of management’s knowledge
Moderne kompanije se svakim danom suočavaju sa problemom opterećenosti
velikom količinom informacija i podataka, a što otežava njihovo poslovanje i donošenje
efikasnih poslovnih odluka. Pronalaženje nove suštine primene načina menadžmenta
znanja u smislu efikasnog korišćenja memorije organizacije (znanja), predstavlja sve veću
potrebu kompanija da unaprede svoje poslovanje. Isto tako, zaštita načina pristupanja
memoriji organizacije (znanju kompanije), njegovoj razmeni i upravljanju njime,
kompanije sve više posvećuju pažnju i stavljaju akcenat u svom poslovanju. Primena
koncepta poslovne inteligencije u upravljanju memorije organizacije postaje neizostavan
element strategije uspešnih kompanija. Integrisano automatizovano upravljanje memorijom
organizacije (znanjem jedne kompanije), iako veoma složeno, predstavlja rešenje za
interakciju menadžmenta znanja i informacione tehnologije. Time se stvara mogućnost
potpunog objašnjavanja procesa donošenja odluka u jednoj kompaniji, ali i procesa toka
dokumenata, informacija i podataka. Integrisanim automatizovanim upravljanjem
memorijom organizacije, kompanija ostvaruje mogućnost dobijanja detaljnih podataka na
osnovu kojih je olakšano konkretno poslovno odlučivanje. Takođe, ovde se javlja i zahtev
za zaštitu jednog takvog integrisanog automatizo vanog procesa. U skladu sa određenim i
usvojenim međunarodnim standardima (ISO 27001), menadžment u ovakvom sistemu
kakav je memorija organizacije treba da osigura efikasnu implementaciju, praćenje i
unapređenje sistema za rukovanje bezbednošću memorije organizacije.
Zaštita i bezbednost memorije organizacije kroz kriptografska rešenje treba da
zadovolji balans između zahteva korisnika, funkcionalnosti unutar memorije organizacije i
potrebe zaštite osetljivih podataka i čuvanje njihovog integriteta. Ovakav integrisani
automatizovani proces upravljanja memorijom organizacije predstavlja jedno rešenje koje
bi svoju upotrebu moglo da nađe kako u oblasti učenja inteligentnih sistema, tako i u
postojećim sistemima savremenog poslovnog odlučivanja. Jedan od predloženog načina
rešenja zaštite integrisanog sistema za proces upravljanja memorijom organizacije u ovom
radu biće i mogućnost snimanja u šifrovanom obliku, čime podaci postaju dostupni samo
kroz informacioni sistem kompanije. U ovom radu biće predstavljeno sopstveno
kriptografsko rešenje zaštite memorije organizacije sa stanovišta menadžmenta znanja.
Pristup dokumentima i podacima će imati samo ovlašćeni korisnici sistema na osnovu
definisanih dozvola pristupa. Autentičnost dokumenata i njihova nepromenljivost bi se
obezbedila pomoću digitalnih potpisa, što predložena kriptografska rešenja obezbeđuju u
skladu sa aktuelnim zakonskim propisima za elektronski dokument. Isto tako, biće
razmotreni principi i modeli koji obezbeđuju i zaštitu podataka i privilegovan pristup
podacima, a sve u cilju donošenja odluka zasnovanih na memoriji organizacije. Najbolji
primer za ovakvu analizu su bezbednosno-informativne agencije, a brojni su primeri, kako
dobrih organizacija, tako i propusta u njihovom radu