145 research outputs found

    Greedy adaptive algorithms for sparse representations

    Get PDF
    A vector or matrix is said to be sparse if the number of non-zero elements is significantly smaller than the number of zero elements. In estimation theory the vectors of model parameters can be known in advance to have a sparse structure, and solving an estimation problem taking into account this constraint can improve substantially the accuracy of the solution. The theory of sparse models has advanced significantly in recent years providing many results that can guarantee certain properties of the sparse solutions. These performance guarantees can be very powerful in applications and they have no correspondent in the estimation theory for non-sparse models. Model sparsity is an inherent characteristic of many applications (image compressing, wireless channel estimation, direction of arrival) in signal processing and other related areas.Due to the continuous technological advances that allow faster numerical computations, optimization problems, too complex to be solved in the past, are now able to provide better solutions by considering also sparsity constraints. However, an exhaustive search to finding sparse solutions generally requires a combinatorial search for the correct support, a very limiting factor due to the huge numerical complexity. This motivated a growing interest towards developing batch sparsity aware algorithms in the past twenty years. More recently, the main goal for the continuous research related to sparsity is the quest for faster, less computational intensive, adaptive methods able to recursively update the solution. In this thesis we present several such algorithms. They are greedy in nature and minimize the least squares criterion under the constraint that the solution is sparse. Similarly to other greedy sparse methods, two main steps are performed once new data are available: update the sparse support by changing the positions that contribute to the solution; compute the coefficients towards the minimization of the least squares criterion restricted to the current support. Two classes of adaptive algorithms were proposed. The first is derived from the batch matching pursuit algorithm. It uses a coordinate descent approach to update the solution, each coordinate being selected by a criterion similar to the one used by matching pursuit. We devised two algorithms that use a cyclic update strategy to improve the solution at each time instant. Since the solution support and coefficient values are assumed to vary slowly, a faster and better performing approach is later proposed by spreading the coordinate descent update in time. It was also adapted to work in a distributed setup in which different nodes communicate with their neighbors to improve their local solution towards a global optimum. The second direction can be linked to the batch orthogonal least squares. The algorithms maintain a partial QR decomposition with pivoting and require a permutation based support selection strategy to ensure a low complexity while allowing the tracking of slow variations in the support. Two versions of the algorithm were proposed. They allow past data to be forgotten by using an exponential or a sliding window, respectively. The former was modified to improve the solution in a structured sparsity case, when the solution is group sparse. We also propose mechanisms for estimating online the sparsity level. They are based on information theoretic criteria, namely the predictive least squares and the Bayesian information criterion. The main contributions are the development of the adaptive greedy algorithms and the use of the information theoretic criteria enabling the algorithms to behave robustly. The algorithms have good performance, require limited prior information and are computationally efficient. Generally, the configuration parameters, if they exist, can be easily chosen as a tradeoff between the stationary error and the convergence speed

    Learning for informative path planning

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.Includes bibliographical references (p. 104-108).Through the combined use of regression techniques, we will learn models of the uncertainty propagation efficiently and accurately to replace computationally intensive Monte- Carlo simulations in informative path planning. This will enable us to decrease the uncertainty of the weather estimates more than current methods by enabling the evaluation of many more candidate paths given the same amount of resources. The learning method and the path planning method will be validated by the numerical experiments using the Lorenz-2003 model [32], an idealized weather model.by Sooho Park.S.M

    Energy-aware Sparse Sensing of Spatial-temporally Correlated Random Fields

    Get PDF
    This dissertation focuses on the development of theories and practices of energy aware sparse sensing schemes of random fields that are correlated in the space and/or time domains. The objective of sparse sensing is to reduce the number of sensing samples in the space and/or time domains, thus reduce the energy consumption and complexity of the sensing system. Both centralized and decentralized sensing schemes are considered in this dissertation. Firstly we study the problem of energy efficient Level set estimation (LSE) of random fields correlated in time and/or space under a total power constraint. We consider uniform sampling schemes of a sensing system with a single sensor and a linear sensor network with sensors distributed uniformly on a line where sensors employ a fixed sampling rate to minimize the LSE error probability in the long term. The exact analytical cost functions and their respective upper bounds of these sampling schemes are developed by using an optimum thresholding-based LSE algorithm. The design parameters of these sampling schemes are optimized by minimizing their respective cost functions. With the analytical results, we can identify the optimum sampling period and/or node distance that can minimize the LSE error probability. Secondly we propose active sparse sensing schemes with LSE of a spatial-temporally correlated random field by using a limited number of spatially distributed sensors. In these schemes a central controller is designed to dynamically select a limited number of sensing locations according to the information revealed from past measurements,and the objective is to minimize the expected level set estimation error.The expected estimation error probability is explicitly expressed as a function of the selected sensing locations, and the results are used to formulate the optimal sensing location selection problem as a combinatorial problem. Two low complexity greedy algorithms are developed by using analytical upper bounds of the expected estimation error probability. Lastly we study the distributed estimations of a spatially correlated random field with decentralized wireless sensor networks (WSNs). We propose a distributed iterative estimation algorithm that defines the procedures for both information propagation and local estimation in each iteration. The key parameters of the algorithm, including an edge weight matrix and a sample weight matrix, are designed by following the asymptotically optimum criteria. It is shown that the asymptotically optimum performance can be achieved by distributively projecting the measurement samples into a subspace related to the covariance matrices of data and noise samples

    Proximal Bellman mappings for reinforcement learning and their application to robust adaptive filtering

    Full text link
    This paper aims at the algorithmic/theoretical core of reinforcement learning (RL) by introducing the novel class of proximal Bellman mappings. These mappings are defined in reproducing kernel Hilbert spaces (RKHSs), to benefit from the rich approximation properties and inner product of RKHSs, they are shown to belong to the powerful Hilbertian family of (firmly) nonexpansive mappings, regardless of the values of their discount factors, and possess ample degrees of design freedom to even reproduce attributes of the classical Bellman mappings and to pave the way for novel RL designs. An approximate policy-iteration scheme is built on the proposed class of mappings to solve the problem of selecting online, at every time instance, the "optimal" exponent pp in a pp-norm loss to combat outliers in linear adaptive filtering, without training data and any knowledge on the statistical properties of the outliers. Numerical tests on synthetic data showcase the superior performance of the proposed framework over several non-RL and kernel-based RL schemes.Comment: arXiv admin note: text overlap with arXiv:2210.1175

    Fundus image analysis for automatic screening of ophthalmic pathologies

    Full text link
    En los ultimos años el número de casos de ceguera se ha reducido significativamente. A pesar de este hecho, la Organización Mundial de la Salud estima que un 80% de los casos de pérdida de visión (285 millones en 2010) pueden ser evitados si se diagnostican en sus estadios más tempranos y son tratados de forma efectiva. Para cumplir esta propuesta se pretende que los servicios de atención primaria incluyan un seguimiento oftalmológico de sus pacientes así como fomentar campañas de cribado en centros proclives a reunir personas de alto riesgo. Sin embargo, estas soluciones exigen una alta carga de trabajo de personal experto entrenado en el análisis de los patrones anómalos propios de cada enfermedad. Por lo tanto, el desarrollo de algoritmos para la creación de sistemas de cribado automáticos juga un papel vital en este campo. La presente tesis persigue la identificacion automática del daño retiniano provocado por dos de las patologías más comunes en la sociedad actual: la retinopatía diabética (RD) y la degenaración macular asociada a la edad (DMAE). Concretamente, el objetivo final de este trabajo es el desarrollo de métodos novedosos basados en la extracción de características de la imagen de fondo de ojo y clasificación para discernir entre tejido sano y patológico. Además, en este documento se proponen algoritmos de pre-procesado con el objetivo de normalizar la alta variabilidad existente en las bases de datos publicas de imagen de fondo de ojo y eliminar la contribución de ciertas estructuras retinianas que afectan negativamente en la detección del daño retiniano. A diferencia de la mayoría de los trabajos existentes en el estado del arte sobre detección de patologías en imagen de fondo de ojo, los métodos propuestos a lo largo de este manuscrito evitan la necesidad de segmentación de las lesiones o la generación de un mapa de candidatos antes de la fase de clasificación. En este trabajo, Local binary patterns, perfiles granulométricos y la dimensión fractal se aplican de manera local para extraer información de textura, morfología y tortuosidad de la imagen de fondo de ojo. Posteriormente, esta información se combina de diversos modos formando vectores de características con los que se entrenan avanzados métodos de clasificación formulados para discriminar de manera óptima entre exudados, microaneurismas, hemorragias y tejido sano. Mediante diversos experimentos, se valida la habilidad del sistema propuesto para identificar los signos más comunes de la RD y DMAE. Para ello se emplean bases de datos públicas con un alto grado de variabilidad sin exlcuir ninguna imagen. Además, la presente tesis también cubre aspectos básicos del paradigma de deep learning. Concretamente, se presenta un novedoso método basado en redes neuronales convolucionales (CNNs). La técnica de transferencia de conocimiento se aplica mediante el fine-tuning de las arquitecturas de CNNs más importantes en el estado del arte. La detección y localización de exudados mediante redes neuronales se lleva a cabo en los dos últimos experimentos de esta tesis doctoral. Cabe destacar que los resultados obtenidos mediante la extracción de características "manual" y posterior clasificación se comparan de forma objetiva con las predicciones obtenidas por el mejor modelo basado en CNNs. Los prometedores resultados obtenidos en esta tesis y el bajo coste y portabilidad de las cámaras de adquisión de imagen de retina podrían facilitar la incorporación de los algoritmos desarrollados en este trabajo en un sistema de cribado automático que ayude a los especialistas en la detección de patrones anomálos característicos de las dos enfermedades bajo estudio: RD y DMAE.In last years, the number of blindness cases has been significantly reduced. Despite this promising news, the World Health Organisation estimates that 80% of visual impairment (285 million cases in 2010) could be avoided if diagnosed and treated early. To accomplish this purpose, eye care services need to be established in primary health and screening campaigns should be a common task in centres with people at risk. However, these solutions entail a high workload for trained experts in the analysis of the anomalous patterns of each eye disease. Therefore, the development of algorithms for automatic screening system plays a vital role in this field. This thesis focuses on the automatic identification of the retinal damage provoked by two of the most common pathologies in the current society: diabetic retinopathy (DR) and age-related macular degeneration (AMD). Specifically, the final goal of this work is to develop novel methods, based on fundus image description and classification, to characterise the healthy and abnormal tissue in the retina background. In addition, pre-processing algorithms are proposed with the aim of normalising the high variability of fundus images and removing the contribution of some retinal structures that could hinder in the retinal damage detection. In contrast to the most of the state-of-the-art works in damage detection using fundus images, the methods proposed throughout this manuscript avoid the necessity of lesion segmentation or the candidate map generation before the classification stage. Local binary patterns, granulometric profiles and fractal dimension are locally computed to extract texture, morphological and roughness information from retinal images. Different combinations of this information feed advanced classification algorithms formulated to optimally discriminate exudates, microaneurysms, haemorrhages and healthy tissues. Through several experiments, the ability of the proposed system to identify DR and AMD signs is validated using different public databases with a large degree of variability and without image exclusion. Moreover, this thesis covers the basics of the deep learning paradigm. In particular, a novel approach based on convolutional neural networks is explored. The transfer learning technique is applied to fine-tune the most important state-of-the-art CNN architectures. Exudate detection and localisation tasks using neural networks are carried out in the last two experiments of this thesis. An objective comparison between the hand-crafted feature extraction and classification process and the prediction models based on CNNs is established. The promising results of this PhD thesis and the affordable cost and portability of retinal cameras could facilitate the further incorporation of the developed algorithms in a computer-aided diagnosis (CAD) system to help specialists in the accurate detection of anomalous patterns characteristic of the two diseases under study: DR and AMD.En els últims anys el nombre de casos de ceguera s'ha reduït significativament. A pesar d'este fet, l'Organització Mundial de la Salut estima que un 80% dels casos de pèrdua de visió (285 milions en 2010) poden ser evitats si es diagnostiquen en els seus estadis més primerencs i són tractats de forma efectiva. Per a complir esta proposta es pretén que els servicis d'atenció primària incloguen un seguiment oftalmològic dels seus pacients així com fomentar campanyes de garbellament en centres regentats per persones d'alt risc. No obstant això, estes solucions exigixen una alta càrrega de treball de personal expert entrenat en l'anàlisi dels patrons anòmals propis de cada malaltia. Per tant, el desenrotllament d'algoritmes per a la creació de sistemes de garbellament automàtics juga un paper vital en este camp. La present tesi perseguix la identificació automàtica del dany retiniano provocat per dos de les patologies més comunes en la societat actual: la retinopatia diabètica (RD) i la degenaración macular associada a l'edat (DMAE) . Concretament, l'objectiu final d'este treball és el desenrotllament de mètodes novedodos basats en l'extracció de característiques de la imatge de fons d'ull i classificació per a discernir entre teixit sa i patològic. A més, en este document es proposen algoritmes de pre- processat amb l'objectiu de normalitzar l'alta variabilitat existent en les bases de dades publiques d'imatge de fons d'ull i eliminar la contribució de certes estructures retinianas que afecten negativament en la detecció del dany retiniano. A diferència de la majoria dels treballs existents en l'estat de l'art sobre detecció de patologies en imatge de fons d'ull, els mètodes proposats al llarg d'este manuscrit eviten la necessitat de segmentació de les lesions o la generació d'un mapa de candidats abans de la fase de classificació. En este treball, Local binary patterns, perfils granulometrics i la dimensió fractal s'apliquen de manera local per a extraure informació de textura, morfologia i tortuositat de la imatge de fons d'ull. Posteriorment, esta informació es combina de diversos modes formant vectors de característiques amb els que s'entrenen avançats mètodes de classificació formulats per a discriminar de manera òptima entre exsudats, microaneurismes, hemorràgies i teixit sa. Per mitjà de diversos experiments, es valida l'habilitat del sistema proposat per a identificar els signes més comuns de la RD i DMAE. Per a això s'empren bases de dades públiques amb un alt grau de variabilitat sense exlcuir cap imatge. A més, la present tesi també cobrix aspectes bàsics del paradigma de deep learning. Concretament, es presenta un nou mètode basat en xarxes neuronals convolucionales (CNNs) . La tècnica de transferencia de coneixement s'aplica per mitjà del fine-tuning de les arquitectures de CNNs més importants en l'estat de l'art. La detecció i localització d'exudats per mitjà de xarxes neuronals es du a terme en els dos últims experiments d'esta tesi doctoral. Cal destacar que els resultats obtinguts per mitjà de l'extracció de característiques "manual" i posterior classificació es comparen de forma objectiva amb les prediccions obtingudes pel millor model basat en CNNs. Els prometedors resultats obtinguts en esta tesi i el baix cost i portabilitat de les cambres d'adquisión d'imatge de retina podrien facilitar la incorporació dels algoritmes desenrotllats en este treball en un sistema de garbellament automàtic que ajude als especialistes en la detecció de patrons anomálos característics de les dos malalties baix estudi: RD i DMAE.Colomer Granero, A. (2018). Fundus image analysis for automatic screening of ophthalmic pathologies [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/99745TESI

    EXTRACTING NEURONAL DYNAMICS AT HIGH SPATIOTEMPORAL RESOLUTIONS: THEORY, ALGORITHMS, AND APPLICATION

    Get PDF
    Analyses of neuronal activity have revealed that various types of neurons, both at the single-unit and population level, undergo rapid dynamic changes in their response characteristics and their connectivity patterns in order to adapt to variations in the behavioral context or stimulus condition. In addition, these dynamics often admit parsimonious representations. Despite growing advances in neural modeling and data acquisition technology, a unified signal processing framework capable of capturing the adaptivity, sparsity and statistical characteristics of neural dynamics is lacking. The objective of this dissertation is to develop such a signal processing methodology in order to gain a deeper insight into the dynamics of neuronal ensembles underlying behavior, and consequently a better understanding of how brain functions. The first part of this dissertation concerns the dynamics of stimulus-driven neuronal activity at the single-unit level. We develop a sparse adaptive filtering framework for the identification of neuronal response characteristics from spiking activity. We present a rigorous theoretical analysis of our proposed sparse adaptive filtering algorithms and characterize their performance guarantees. Application of our algorithms to experimental data provides new insights into the dynamics of attention-driven neuronal receptive field plasticity, with a substantial increase in temporal resolution. In the second part, we focus on the network-level properties of neuronal dynamics, with the goal of identifying the causal interactions within neuronal ensembles that underlie behavior. Building up on the results of the first part, we introduce a new measure of causality, namely the Adaptive Granger Causality (AGC), which allows capturing the sparsity and dynamics of the causal influences in a neuronal network in a statistically robust and computationally efficient fashion. We develop a precise statistical inference framework for the estimation of AGC from simultaneous recordings of the activity of neurons in an ensemble. Finally, in the third part we demonstrate the utility of our proposed methodologies through application to synthetic and real data. We first validate our theoretical results using comprehensive simulations, and assess the performance of the proposed methods in terms of estimation accuracy and tracking capability. These results confirm that our algorithms provide significant gains in comparison to existing techniques. Furthermore, we apply our methodology to various experimentally recorded data from electrophysiology and optical imaging: 1) Application of our methods to simultaneous spike recordings from the ferret auditory and prefrontal cortical areas reveals the dynamics of top-down and bottom-up functional interactions underlying attentive behavior at unprecedented spatiotemporal resolutions; 2) Our analyses of two-photon imaging data from the mouse auditory cortex shed light on the sparse dynamics of functional networks under both spontaneous activity and auditory tone detection tasks; and 3) Application of our methods to whole-brain light-sheet imaging data from larval zebrafish reveals unique insights into the organization of functional networks involved in visuo-motor processing

    A compressed sensing approach to block-iterative equalization: connections and applications to radar imaging reconstruction

    Get PDF
    The widespread of underdetermined systems has brought forth a variety of new algorithmic solutions, which capitalize on the Compressed Sensing (CS) of sparse data. While well known greedy or iterative threshold type of CS recursions take the form of an adaptive filter followed by a proximal operator, this is no different in spirit from the role of block iterative decision-feedback equalizers (BI-DFE), where structure is roughly exploited by the signal constellation slicer. By taking advantage of the intrinsic sparsity of signal modulations in a communications scenario, the concept of interblock interference (IBI) can be approached more cunningly in light of CS concepts, whereby the optimal feedback of detected symbols is devised adaptively. The new DFE takes the form of a more efficient re-estimation scheme, proposed under recursive-least-squares based adaptations. Whenever suitable, these recursions are derived under a reduced-complexity, widely-linear formulation, which further reduces the minimum-mean-square-error (MMSE) in comparison with traditional strictly-linear approaches. Besides maximizing system throughput, the new algorithms exhibit significantly higher performance when compared to existing methods. Our reasoning will also show that a properly formulated BI-DFE turns out to be a powerful CS algorithm itself. A new algorithm, referred to as CS-Block DFE (CS-BDFE) exhibits improved convergence and detection when compared to first order methods, thus outperforming the state-of-the-art Complex Approximate Message Passing (CAMP) recursions. The merits of the new recursions are illustrated under a novel 3D MIMO Radar formulation, where the CAMP algorithm is shown to fail with respect to important performance measures.A proliferação de sistemas sub-determinados trouxe a tona uma gama de novas soluções algorítmicas, baseadas no sensoriamento compressivo (CS) de dados esparsos. As recursões do tipo greedy e de limitação iterativa para CS se apresentam comumente como um filtro adaptativo seguido de um operador proximal, não muito diferente dos equalizadores de realimentação de decisão iterativos em blocos (BI-DFE), em que um decisor explora a estrutura do sinal de constelação. A partir da esparsidade intrínseca presente na modulação de sinais no contexto de comunicações, a interferência entre blocos (IBI) pode ser abordada utilizando-se o conceito de CS, onde a realimentação ótima de símbolos detectados é realizada de forma adaptativa. O novo DFE se apresenta como um esquema mais eficiente de reestimação, baseado na atualização por mínimos quadrados recursivos (RLS). Sempre que possível estas recursões são propostas via formulação linear no sentido amplo, o que reduz ainda mais o erro médio quadrático mínimo (MMSE) em comparação com abordagens tradicionais. Além de maximizar a taxa de transferência de informação, o novo algoritmo exibe um desempenho significativamente superior quando comparado aos métodos existentes. Também mostraremos que um equalizador BI-DFE formulado adequadamente se torna um poderoso algoritmo de CS. O novo algoritmo CS-BDFE apresenta convergência e detecção aprimoradas, quando comparado a métodos de primeira ordem, superando as recursões de Passagem de Mensagem Aproximada para Complexos (CAMP). Os méritos das novas recursões são ilustrados através de um modelo tridimensional para radares MIMO recentemente proposto, onde o algoritmo CAMP falha em aspectos importantes de medidas de desempenho

    Soft-decision equalization techniques for frequency selective MIMO channels

    Get PDF
    Multi-input multi-output (MIMO) technology is an emerging solution for high data rate wireless communications. We develop soft-decision based equalization techniques for frequency selective MIMO channels in the quest for low-complexity equalizers with BER performance competitive to that of ML sequence detection. We first propose soft decision equalization (SDE), and demonstrate that decision feedback equalization (DFE) based on soft-decisions, expressed via the posterior probabilities associated with feedback symbols, is able to outperform hard-decision DFE, with a low computational cost that is polynomial in the number of symbols to be recovered, and linear in the signal constellation size. Building upon the probabilistic data association (PDA) multiuser detector, we present two new MIMO equalization solutions to handle the distinctive channel memory. With their low complexity, simple implementations, and impressive near-optimum performance offered by iterative soft-decision processing, the proposed SDE methods are attractive candidates to deliver efficient reception solutions to practical high-capacity MIMO systems. Motivated by the need for low-complexity receiver processing, we further present an alternative low-complexity soft-decision equalization approach for frequency selective MIMO communication systems. With the help of iterative processing, two detection and estimation schemes based on second-order statistics are harmoniously put together to yield a two-part receiver structure: local multiuser detection (MUD) using soft-decision Probabilistic Data Association (PDA) detection, and dynamic noise-interference tracking using Kalman filtering. The proposed Kalman-PDA detector performs local MUD within a sub-block of the received data instead of over the entire data set, to reduce the computational load. At the same time, all the inter-ference affecting the local sub-block, including both multiple access and inter-symbol interference, is properly modeled as the state vector of a linear system, and dynamically tracked by Kalman filtering. Two types of Kalman filters are designed, both of which are able to track an finite impulse response (FIR) MIMO channel of any memory length. The overall algorithms enjoy low complexity that is only polynomial in the number of information-bearing bits to be detected, regardless of the data block size. Furthermore, we introduce two optional performance-enhancing techniques: cross- layer automatic repeat request (ARQ) for uncoded systems and code-aided method for coded systems. We take Kalman-PDA as an example, and show via simulations that both techniques can render error performance that is better than Kalman-PDA alone and competitive to sphere decoding. At last, we consider the case that channel state information (CSI) is not perfectly known to the receiver, and present an iterative channel estimation algorithm. Simulations show that the performance of SDE with channel estimation approaches that of SDE with perfect CSI

    Parallelizing Set Similarity Joins

    Get PDF
    Eine der größten Herausforderungen in Data Science ist heutzutage, Daten miteinander in Beziehung zu setzen und ähnliche Daten zu finden. Hierzu kann der aus relationalen Datenbanken bekannte Join-Operator eingesetzt werden. Das Konzept der Ähnlichkeit wird häufig durch mengenbasierte Ähnlichkeitsfunktionen gemessen. Um solche Funktionen als Join-Prädikat nutzen zu können, setzt diese Arbeit voraus, dass Records aus Mengen von Tokens bestehen. Die Arbeit fokussiert sich auf den mengenbasierten Ähnlichkeitsjoin, Set Similarity Join (SSJ). Die Datenmenge, die es heute zu verarbeiten gilt, ist groß und wächst weiter. Der SSJ hingegen ist eine rechenintensive Operation. Um ihn auf großen Daten ausführen zu können, sind neue Ansätze notwendig. Diese Arbeit fokussiert sich auf das Mittel der Parallelisierung. Sie leistet folgende drei Beiträge auf dem Gebiet der SSJs. Erstens beschreibt und untersucht die Arbeit den aktuellen Stand paralleler SSJ-Ansätze. Diese Arbeit vergleicht zehn Map-Reduce-basierte Ansätze aus der Literatur sowohl analytisch als auch experimentell. Der größte Schwachpunkt aller Ansätze ist überraschenderweise eine geringe Skalierbarkeit aufgrund zu hoher Datenreplikation und/ oder ungleich verteilter Daten. Keiner der Ansätze kann den SSJ auf großen Daten berechnen. Zweitens macht die Arbeit die verfügbare hohe CPU-Parallelität moderner Rechner für den SSJ nutzbar. Sie stellt einen neuen daten-parallelen multi-threaded SSJ-Ansatz vor. Der vorgestellte Ansatz ermöglicht erhebliche Laufzeit-Beschleunigungen gegenüber der Ausführung auf einem Thread. Drittens stellt die Arbeit einen neuen hoch skalierbaren verteilten SSJ-Ansatz vor. Mit einer kostenbasierten Heuristik und einem daten-unabhängigen Skalierungsmechanismus vermeidet er Daten-Replikation und wiederholte Berechnungen. Der Ansatz beschleunigt die Join-Ausführung signifikant und ermöglicht die Ausführung auf erheblich größeren Datenmengen als bisher betrachtete parallele Ansätze.One of today's major challenges in data science is to compare and relate data of similar nature. Using the join operation known from relational databases could help solving this problem. Given a collection of records, the join operation finds all pairs of records, which fulfill a user-chosen predicate. Real-world problems could require complex predicates, such as similarity. A common way to measure similarity are set similarity functions. In order to use set similarity functions as predicates, we assume records to be represented by sets of tokens. In this thesis, we focus on the set similarity join (SSJ) operation. The amount of data to be processed today is typically large and grows continually. On the other hand, the SSJ is a compute-intensive operation. To cope with the increasing size of input data, additional means are needed to develop scalable implementations for SSJ. In this thesis, we focus on parallelization. We make the following three major contributions to SSJ. First, we elaborate on the state-of-the-art in parallelizing SSJ. We compare ten MapReduce-based approaches from the literature analytically and experimentally. Their main limit is surprisingly a low scalability due to too high and/or skewed data replication. None of the approaches could compute the join on large datasets. Second, we leverage the abundant CPU parallelism of modern commodity hardware, which has not yet been considered to scale SSJ. We propose a novel data-parallel multi-threaded SSJ. Our approach provides significant speedups compared to single-threaded executions. Third, we propose a novel highly scalable distributed SSJ approach. With a cost-based heuristic and a data-independent scaling mechanism we avoid data replication and recomputation. A heuristic assigns similar shares of compute costs to each node. Our approach significantly scales up the join execution and processes much larger datasets than all parallel approaches designed and implemented so far
    corecore