320 research outputs found

    Software library for stream-based recommender systems

    Get PDF
    Tradicionalmente, um algoritmo de machine learning é capaz de aprender com dados, dado um conjunto tratado e construído anteriormente. Também é possível analisar esse conjunto de dados, usando técnicas de mineração de dados e extrair conclusões a partir dele. Ambos os conceitos têm inúmeras aplicações em todo o mundo, desde diagnósticos médicos até reconhecimento de fala ou mesmo consultas a mecanismos de pesquisa. No entanto, tradicionalmente, supõe-se que o conjunto de dados esteja disponível a todo o momento. Isso não é necessariamente verdade com os dados modernos pois os aplicativos de sistemas distribuídos recebem e processam milhões de fluxos de dados em uma fração de tempo limitado. Portanto, são necessárias técnicas para extrair e processar esses fluxos de dados, em um período de tempo limitado, com bons resultados e dimensionamento eficaz à medida que os dados aumentam. Um sistema específico de análise e previsão de conclusões futuras a partir de dados fornecidos são os sistemas de recomendação. Vários serviços online usam sistemas de recomendação para fornecer conteúdo personalizado a seus usuários. Em muitos casos, as recomendações são um dos geradores de tráfego mais eficazes nesses serviços. O problema reside em encontrar o melhor pequeno subconjunto de itens em um sistema que corresponda às preferências pessoais de cada usuário, através da análise de uma quantidade muito grande de dados históricos. Esse problema recebe mais atenção se for considerado um problema genérico, não específico, ou seja, se uma biblioteca for construída para que possa ser estendida e usada como uma ferramenta para ajudar a construir um sistema para um caso de uso específico. Podem-se distinguir soluções entre perfeitas ou estatisticamente semelhantes. Devido a grande quantidade de dados disponíveis, a decisão de reprocessar todos os dados, sempre que novos dados chegam, não seria viável; portanto, algoritmos incrementais são usados ​​para processar os dados recebidos e manter o modelo de recomendação atualizado. O objetivo real deste trabalho é implementar uma biblioteca que contenha e avalie essas abordagens incrementais para recomendações de que as atuais são específicas da tarefa.Traditionally, a machine learning algorithm is able to learn from data, given a previously built and treated data set. One can also analyze that data set, using data mining techniques, and draw conclusions from it. Both of these concepts have numerous world-wide applications, from medical diagnosis to speech recognition or even search engine queries. However, traditionally speaking, it is being assumed that the data set is available at all times. That is not necessarily true with modern data, as distributed systems applications receive and process millions of data streams on a limited time fraction. Therefore, there is a need for techniques to mine and process these data streams,on a limited time period with good results and effective scaling as data grows. One specific use case of analyzing and predicting future conclusions from given data, are recommendation systems.Several online services use recommender systems to deliver personalized content to their users.In many cases, recommendations are one of the most effective traffic generators in such services.The problem lies in finding the best small subset of items in a system that matches the personal preferences of each user, through the analysis of a very large amount of historical data. This problem gets more attention if it is considered as a generic problem, not as a specific one, that is,if a library is built so that it can be extended and used as a tool to help build a system for a specific use case. One can distinguish solutions between perfect ones or statistically similar ones. Due to the large amount of data available, the decision to reprocess all the data every time new data arrives, would not be feasible so, incremental algorithms are used to process incoming data and keeping the recommendation model updated. The real purpose of this work is to implement such a library which contains, and evaluates these incremental approaches to recommendation since current ones are task-specific

    A Scalable Clustering Algorithm for High-dimensional Data Streams over Sliding Windows

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 공과대학 전기·컴퓨터공학부, 2017. 8. 이상구.Data stream clustering over sliding windows generates clustering results whenever a window moves. However, iterative clustering using all data in a window is highly inefficient in terms of memory and computation time. In this thesis, we address problem of data stream clustering over sliding windows using sliding window aggregation and nearest neighbor search techniques. Our algorithm constructs and maintains temporal group features as a summary of the window using the sliding window aggregation technique. The technique divides a window into disjoint chunks, computes partial aggregates over each chunk, and merges the partial aggregates to compute overall aggregates. To maintain constant size of the summary, the algorithm reduces the size of summary by joining the nearest neighbor. We exploit Locality-Sensitive Hashing for fast nearest neighbor search. We show that Locality-Sensitive Hashing can serve as an effective method for reducing synopses while minimizing the impact on quality. In addition, we also suggest re-clustering policy, which decides whether to append new summary to pre-existing clusters or to perform clustering on whole summary. Our experiments on real-world and synthetic datasets demonstrate that our algorithm can achieve a significant improvement when performing continuous clustering on data streams with sliding windows.1 Introduction 1 2. Preliminaries and Related Work 7 2.1 Data Streams 7 2.2 Window Models 7 2.3 kMeans Clustering 11 2.4 Coreset 12 2.5 Group Features 14 2.6 Related Work 16 2.7 Problem Statement 31 3. GFCS: Group Featurebased Data Stream Clustering with Sliding Windows 35 3.1 2-Level Coresets Construction 35 3.2 2-Level Coresets Maintenance 38 3.3 Clustering on 2-Level Coresets 40 4. CSCS: Coresetbased Data Stream Clustering with Sliding Windows 46 4.1 Coreset Construction based on Nearest Neighbor Search 47 4.2 Coreset Construction based on LocalitySensitive Hashing 60 4.3 Reclustering Policy 66 5. Empirical Evaluation of Data Stream Clustering with Sliding Windows 69 5.1 Experimental Setup 69 5.2 Experimental Results 71 6. Application: Documents Clustering 78 6.1 Vector Representation of Documents 78 6.2 Extension to Other Clustering Algorithms 83 6.3 Evaluation 88 7. Conclusion 95 A. Appendix 109 A.1 Experimental Results of GFCS and CSCS 109 A.2 Experimental Results of Document Clustering 117Docto

    On algorithms for large-scale graph and clustering problems

    Get PDF
    Gegenstand dieser Arbeit sind algorithmische Methoden der modernen Datenanalyse. Dabei werden vorwiegend zwei übergeordnete Themen behandelt: Datenstromalgorithmen mit Kompressionseigenschaften und Approximationsalgorithmen für Clusteringverfahren. Datenstromalgorithmen verarbeiten einen Datensatz sequentiell und haben das Ziel, Eigenschaften des Datensatzes (approximativ) zu bestimmen, ohne dabei den gesamten Datensatz abzuspeichern. Unter Clustering versteht man die Partitionierung eines Datensatzes in verschiedene Gruppen. Das erste dargestellte Problem betrifft Matching in Graphen. Hier besteht der Datensatz aus einer Folge von Einfüge- und Löschoperationen von Kanten. Die Aufgabe besteht darin, die Größe des so genannten Maximum Matchings so genau wie möglich zu bestimmen. Es wird ein Algorithmus vorgestellt, der, unter der Annahme, dass das Matching höchstens die Größe k hat, die exakte Größe bestimmt und dabei k² Speichereinheiten benötigt. Dieser Algorithmus lässt sich weiterhin verwenden um eine konstante Approximation der Matchinggröße in planaren Graphen zu bestimmen. Des Weiteren werden untere Schranken für den benötigten Speicherplatz bestimmt und eine Reduktion von gewichtetem Matching zu ungewichteten Matching durchgeführt. Anschließend werden Datenstromalgorithmen für die Nachbarschaftssuche betrachtet, wobei die Aufgabe darin besteht, für n gegebene Mengen die Paare mit hoher Ähnlichkeit in nahezu Linearzeit zu finden. Dabei ist der Jaccard Index |A ∩ B|/|A U B| das Ähnlichkeitsmaß für zwei Mengen A und B. In der Arbeit wird eine Datenstruktur beschrieben, die dies erstmalig in dynamischen Datenströmen mit geringem Speicherplatzverbrauch leistet. Dabei werden Zufallszahlen mit nur 2-facher Unabhängigkeit verwendet, was eine sehr effiziente Implementierung ermöglicht. Das dritte Problem befindet sich an der Schnittstelle zwischen den beiden Themen dieser Arbeit und betrifft das k-center Clustering Problem in Datenströmen mit einem Zeitfenster. Die Aufgabe besteht darin k Zentren zu finden, sodass die maximale Distanz unter allen Punkten zu dem jeweils nächsten Zentrum minimiert wird. Ergebnis sind ein 6-Approximationalgorithmus für ein beliebiges k und ein optimaler 4-Approximationsalgorithmus für k = 2. Die entwickelten Techniken lassen sich ebenfalls auf das Durchmesserproblem anwenden und ermöglichen für dieses Problem einen optimalen Algorithmus. Danach werden Clusteringprobleme bezüglich der Jaccard Distanz analysiert. Dabei sind wieder eine Menge N von Teilmengen aus einer Grundgesamtheit U sind und die Aufgabe besteht darin eine Teilmenge CC zu finden, die max 1-|X ∩ C|/|X U C| minimiert. Es wird gezeigt, dass zwar eine exakte Lösung des Problems NP-schwer ist, es aber gleichzeitig eine PTAS gibt. Abschließend wird die weit verbreitete lokale Suchheuristik für k-median und k-means Clustering untersucht. Obwohl es im Allgemeinen schwer ist, diese Probleme exakt oder auch nur approximativ zu lösen, gelten sie in der Praxis als relativ gut handhabbar, was andeutet, dass die Härteresultate auf pathologischen Eingaben beruhen. Auf Grund dieser Diskrepanz gab es in der Vergangenheit praxisrelevante Datensätze zu charakterisieren. Für drei der wichtigsten Charakterisierungen wird das Verhalten einer lokalen Suchheuristik untersucht mit dem Ergebnis, dass die lokale Suchheuristik in diesen Fällen optimale oder fast optimale Cluster ermittelt

    Heterogeneous Computing for Data Stream Mining

    Get PDF
    Graphical Processing Units are de-facto standard for acceleration of data parallel tasks in high performance computing. They are widely used to accelerate batch machine learning algorithms. High-end discrete GPUs are characterized by a very high number of cores (thousands), high bandwidth memory optimized for the stream access and high power requirements. Integrated GPUs are characterized by a medium number of cores (hundreds), medium bandwidth memory shared with CPU optimized for the random access and low power requirements. Data stream processing applications are often required to provide response within the limited time frame, operate on data in relatively small increments and have strict power requirements if deployed on the embedded devices. This work evaluates performance of integrated and discrete GPUs belonging to the same chip family on several variants of k-nearest neighbours algorithm over sliding window and stochastic gradient descent using OpenCL and novel Heterogeneous System Architecture platforms. We conclude that integrated GPUs provide a niche solution catering for to small work sizes that offers better power efficiency and simplicity of deployment

    Cloud-Scale Entity Resolution: Current State and Open Challenges

    Get PDF
    Entity resolution (ER) is a process to identify records in information systems, which refer to the same real-world entity. Because in the two recent decades the data volume has grown so large, parallel techniques are called upon to satisfy the ER requirements of high performance and scalability. The development of parallel ER has reached a relatively prosperous stage, and has found its way into several applications. In this work, we first comprehensively survey the state of the art of parallel ER approaches. From the comprehensive overview, we then extract the classification criteria of parallel ER, classify and compare these approaches based on these criteria. Finally, we identify open research questions and challenges and discuss potential solutions and further research potentials in this field
    corecore