129 research outputs found

    On the efficiency of finding and using tabular data summaries : scalability, accuracy, and hardness

    Get PDF
    Tabular data is ubiquitous in modern computer science. However, the size of these tables can be large so computing statistics over them is inefficient in both time and space. This thesis is concerned with finding and using small summaries of large tables for scalable and accurate approximation of the data's properties; or showing such a summary is hard to obtain in small space. This perspective yields the following results: ‱ We introduce projected frequency analysis over an n x d binary table. If the query columns are revealed after observing the data, then we show that space exponential in d is required for constant-factor approximation to statistics such as the number of distinct elements on columns S. We present algorithms that use smaller space than a brute-force approach, while tolerating some super constant error for the frequency estimation. ‱ We find small-space deterministic summaries for a variety of linear algebraic problems in all p-norms for p≄ 1. These include finding rows of high leverage, subspace embedding, regression, and low rank approximation. ‱ We implement and compare various summary techniques for efficient training of large-scale regression models. We show that a sparse random projection can lead to fast model training despite suboptimal theoretical guarantees than dense competitors. For ridge regression we show that a deterministic summary can reduce the number of gradient steps needed to train the model compared to random projections. We demonstrate the practicality of our approaches through various experiments by showing that small space summaries can lead to close to optimal solutions

    Doctor of Philosophy

    Get PDF
    dissertationMatrices are essential data representations for many large-scale problems in data analytics; for example, in text analysis under the bag-of-words model, a large corpus of documents are often represented as a matrix. Many data analytic tasks rely on obtaining a summary (a.k.a sketch) of the data matrix. Using this summary in place of the original data matrix saves on space usage and run-time of machine learning algorithms. Therefore, sketching a matrix is often a necessary first step in data reduction, and sometimes has direct relationships to core techniques including PCA, LDA, and clustering. In this dissertation, we study the problem of matrix sketching over data streams. We first describe a deterministic matrix sketching algorithm called FrequentDirections. The algorithm is presented an arbitrary input matrix A∈ Rn&× d one row at a time. It performs O(dl) operations per row and maintains a sketch matrix B ∈ Rl× d such that for any k< l, ||ATA - BTB \|| 2 < ||A - Ak||F2 / (l-k) and ||A - πBk(A)||F2 ≀ (1 + k/l-k)||A-Ak||F2 . Here, Ak stands for the minimizer of ||A - Ak||F over all rank k matrices (similarly Bk), and πBk (A) is the rank k matrix resulting from projecting A on the row span of Bk. We show both of these bounds are the best possible for the space allowed, the sketch is mergeable, and hence trivially parallelizable. We propose several variants of FrequentDirections that improve its error-size tradeoff, and nearly matches the simple heuristic Iterative SVD method in practice. We then describe SparseFrequentDirections for sketching sparse matrices. It resembles the original algorithm in many ways including having the same optimal asymptotic guarantees with respect to the space-accuracy tradeoff in the streaming setting, but unlike FrequentDirections which runs in O(ndl) time, SparseFrequentDirections runs in Õ(nnz(A)l + nl2) time. We then extend our methods to distributed streaming model, where there are m distributed sites each observing a distinct stream of data, and which has a communication channel with a coordinator. The goal is to track an Δ-approximation (for Δ ∈ (0,1)) to the norm of the matrix along any direction. We present novel algorithms to address this problem. All our methods satisfy an additive error bound that for any unit vector x, | ||A x||2 - ||B x ||2 | ≀ |Δ ||A||F2 holds

    Privacy-preserving recommendation system using federated learning

    Get PDF
    Federated Learning is a form of distributed learning which leverages edge devices for training. It aims to preserve privacy by communicating users’ learning parameters and gradient updates to the global server during the training while keeping the actual data on the users’ devices. The training on global server is performed on these parameters instead of user data directly while fine tuning of the model can be done on client’s devices locally. However, federated learning is not without its shortcomings and in this thesis, we present an overview of the learning paradigm and propose a new federated recommender system framework that utilizes homomorphic encryption. This results in a slight decrease in accuracy metrics but leads to greatly increased user-privacy. We also show that performing computations on encrypted gradients barely affects the recommendation performance while ensuring a more secure means of communicating user gradients to and from the global server

    Recent Developments in Cointegration

    Get PDF
    It is well known that inference on the cointegrating relations in a vector autoregression (CVAR) is difficult in the presence of a near unit root. The test for a given cointegration vector can have rejection probabilities under the null, which vary from the nominal size to more than 90%. This paper formulates a CVAR model allowing for multiple near unit roots and analyses the asymptotic properties of the Gaussian maximum likelihood estimator. Then two critical value adjustments suggested by McCloskey (2017) for the test on the cointegrating relations are implemented for the model with a single near unit root, and it is found by simulation that they eliminate the serious size distortions, with a reasonable power for moderate values of the near unit root parameter. The findings are illustrated with an analysis of a number of different bivariate DGPs

    PROMISE: Preconditioned Stochastic Optimization Methods by Incorporating Scalable Curvature Estimates

    Full text link
    This paper introduces PROMISE (Pr\textbf{Pr}econditioned Stochastic O\textbf{O}ptimization M\textbf{M}ethods by I\textbf{I}ncorporating S\textbf{S}calable Curvature E\textbf{E}stimates), a suite of sketching-based preconditioned stochastic gradient algorithms for solving large-scale convex optimization problems arising in machine learning. PROMISE includes preconditioned versions of SVRG, SAGA, and Katyusha; each algorithm comes with a strong theoretical analysis and effective default hyperparameter values. In contrast, traditional stochastic gradient methods require careful hyperparameter tuning to succeed, and degrade in the presence of ill-conditioning, a ubiquitous phenomenon in machine learning. Empirically, we verify the superiority of the proposed algorithms by showing that, using default hyperparameter values, they outperform or match popular tuned stochastic gradient optimizers on a test bed of 5151 ridge and logistic regression problems assembled from benchmark machine learning repositories. On the theoretical side, this paper introduces the notion of quadratic regularity in order to establish linear convergence of all proposed methods even when the preconditioner is updated infrequently. The speed of linear convergence is determined by the quadratic regularity ratio, which often provides a tighter bound on the convergence rate compared to the condition number, both in theory and in practice, and explains the fast global linear convergence of the proposed methods.Comment: 127 pages, 31 Figure

    Building a semantic search engine with games and crowdsourcing

    Get PDF
    Semantic search engines aim at improving conventional search with semantic information, or meta-data, on the data searched for and/or on the searchers. So far, approaches to semantic search exploit characteristics of the searchers like age, education, or spoken language for selecting and/or ranking search results. Such data allow to build up a semantic search engine as an extension of a conventional search engine. The crawlers of well established search engines like Google, Yahoo! or Bing can index documents but, so far, their capabilities to recognize the intentions of searchers are still rather limited. Indeed, taking into account characteristics of the searchers considerably extend both, the quantity of data to analyse and the dimensionality of the search problem. Well established search engines therefore still focus on general search, that is, "search for all", not on specialized search, that is, "search for a few". This thesis reports on techniques that have been adapted or conceived, deployed, and tested for building a semantic search engine for the very specific context of artworks. In contrast to, for example, the interpretation of X-ray images, the interpretation of artworks is far from being fully automatable. Therefore artwork interpretation has been based on Human Computation, that is, a software-based gathering of contributions by many humans. The approach reported about in this thesis first relies on so called Games With A Purpose, or GWAPs, for this gathering: Casual games provide an incentive for a potentially unlimited community of humans to contribute with their appreciations of artworks. Designing convenient incentives is less trivial than it might seem at first. An ecosystem of games is needed so as to collect the meta-data on artworks intended for. One game generates the data that can serve as input of another game. This results in semantically rich meta-data that can be used for building up a successful semantic search engine. Thus, a first part of this thesis reports on a "game ecosystem" specifically designed from one known game and including several novel games belonging to the following game classes: (1) Description Games for collecting obvious and trivial meta-data, basically the well-known ESP (for extra-sensorial perception) game of Luis von Ahn, (2) the Dissemination Game Eligo generating translations, (3) the Diversification Game Karido aiming at sharpening differences between the objects, that is, the artworks, interpreted and (3) the Integration Games Combino, Sentiment and TagATag that generate structured meta-data. Secondly, the approach to building a semantic search engine reported about in this thesis relies on Higher-Order Singular Value Decomposition (SVD). More precisely, the data and meta-data on artworks gathered with the afore mentioned GWAPs are collected in a tensor, that is a mathematical structure generalising matrices to more than only two dimensions, columns and rows. The dimensions considered are the artwork descriptions, the players, and the artwork themselves. A Higher-Order SVD of this tensor is first used for noise reduction in This thesis reports also on deploying a Higher-Order LSA. The parallel Higher-Order SVD algorithm applied for the Higher-Order LSA and its implementation has been validated on an application related to, but independent from, the semantic search engine for artworks striven for: image compression. This thesis reports on the surprisingly good image compression which can be achieved with Higher-Order SVD. While compression methods based on matrix SVD for each color, the approach reported about in this thesis relies on one single (higher-order) SVD of the whole tensor. This results in both, better quality of the compressed image and in a significant reduction of the memory space needed. Higher-Order SVD is extremely time-consuming what calls for parallel computation. Thus, a step towards automatizing the construction of a semantic search engine for artworks was parallelizing the higher-order SVD method used and running the resulting parallel algorithm on a super-computer. This thesis reports on using Hestenes’ method and R-SVD for parallelising the higher-order SVD. This method is an unconventional choice which is explained and motivated. As of the super-computer needed, this thesis reports on turning the web browsers of the players or searchers into a distributed parallel computer. This is done by a novel specific system and a novel implementation of the MapReduce data framework to data parallelism. Harnessing the web browsers of the players or searchers saves computational power on the server-side. It also scales extremely well with the number of players or searchers because both, playing with and searching for artworks, require human reflection and therefore results in idle local processors that can be brought together into a distributed super-computer.Semantische Suchmaschinen dienen der Verbesserung konventioneller Suche mit semantischen Informationen, oder Metadaten, zu Daten, nach denen gesucht wird, oder zu den Suchenden. Bisher nutzt Semantische Suche Charakteristika von Suchenden wie Alter, Bildung oder gesprochene Sprache fĂŒr die Auswahl und/oder das Ranking von Suchergebnissen. Solche Daten erlauben den Aufbau einer Semantischen Suchmaschine als Erweiterung einer konventionellen Suchmaschine. Die Crawler der fest etablierten Suchmaschinen wie Google, Yahoo! oder Bing können Dokumente indizieren, bisher sind die FĂ€higkeiten eher beschrĂ€nkt, die Absichten von Suchenden zu erkennen. TatsĂ€chlich erweitert die BerĂŒcksichtigung von Charakteristika von Suchenden betrĂ€chtlich beides, die Menge an zu analysierenden Daten und die DimensionalitĂ€t des Such-Problems. Fest etablierte Suchmaschinen fokussieren deswegen stark auf allgemeine Suche, also "Suche fĂŒr alle", nicht auf spezialisierte Suche, also "Suche fĂŒr wenige". Diese Arbeit berichtet von Techniken, die adaptiert oder konzipiert, eingesetzt und getestet wurden, um eine semantische Suchmaschine fĂŒr den sehr speziellen Kontext von Kunstwerken aufzubauen. Im Gegensatz beispielsweise zur Interpretation von Röntgenbildern ist die Interpretation von Kunstwerken weit weg davon gĂ€nzlich automatisiert werden zu können. Deswegen basiert die Interpretation von Kunstwerken auf menschlichen Berechnungen, also Software-basiertes Sammeln von menschlichen BeitrĂ€gen. Der Ansatz, ĂŒber den in dieser Arbeit berichtet wird, beruht auf sogenannten "Games With a Purpose" oder GWAPs die folgendes sammeln: Zwanglose Spiele bieten einen Anreiz fĂŒr eine potenziell unbeschrĂ€nkte Gemeinde von Menschen, mit Ihrer WertschĂ€tzung von Kunstwerken beizutragen. Geeignete Anreize zu entwerfen in weniger trivial als es zuerst scheinen mag. Ein Ökosystem von Spielen wird benötigt, um Metadaten gedacht fĂŒr Kunstwerke zu sammeln. Ein Spiel erzeugt Daten, die als Eingabe fĂŒr ein anderes Spiel dienen können. Dies resultiert in semantisch reichhaltigen Metadaten, die verwendet werden können, um eine erfolgreiche Semantische Suchmaschine aufzubauen. Deswegen berichtet der erste Teil dieser Arbeit von einem "Spiel-Ökosystem", entwickelt auf Basis eines bekannten Spiels und verschiedenen neuartigen Spielen, die zu verschiedenen Spiel-Klassen gehören. (1) Beschreibungs-Spiele zum Sammeln offensichtlicher und trivialer Metadaten, vor allem dem gut bekannten ESP-Spiel (Extra Sensorische Wahrnehmung) von Luis von Ahn, (2) dem Verbreitungs-Spiel Eligo zur Erzeugung von Übersetzungen, (3) dem Diversifikations-Spiel Karido, das Unterschiede zwischen Objekten, also interpretierten Kunstwerken, schĂ€rft und (3) Integrations-Spiele Combino, Sentiment und Tag A Tag, die strukturierte Metadaten erzeugen. Zweitens beruht der Ansatz zum Aufbau einer semantischen Suchmaschine, wie in dieser Arbeit berichtet, auf SingulĂ€rwertzerlegung (SVD) höherer Ordnung. PrĂ€ziser werden die Daten und Metadaten ĂŒber Kunstwerk gesammelt mit den vorher genannten GWAPs in einem Tensor gesammelt, einer mathematischen Struktur zur Generalisierung von Matrizen zu mehr als zwei Dimensionen, Spalten und Zeilen. Die betrachteten Dimensionen sind die Beschreibungen der Kunstwerke, die Spieler, und die Kunstwerke selbst. Eine SingulĂ€rwertzerlegung höherer Ordnung dieses Tensors wird zuerst zur Rauschreduktion verwendet nach der Methode der sogenannten Latenten Semantischen Analyse (LSA). Diese Arbeit berichtet auch ĂŒber die Anwendung einer LSA höherer Ordnung. Der parallele Algorithmus fĂŒr SingulĂ€rwertzerlegungen höherer Ordnung, der fĂŒr LSA höherer Ordnung verwendet wird, und seine Implementierung wurden validiert an einer verwandten aber von der semantischen Suche unabhĂ€ngig angestrebten Anwendung: Bildkompression. Diese Arbeit berichtet von ĂŒberraschend guter Kompression, die mit SingulĂ€rwertzerlegung höherer Ordnung erzielt werden kann. Neben Matrix-SVD-basierten Kompressionsverfahren fĂŒr jede Farbe, beruht der Ansatz wie in dieser Arbeit berichtet auf einer einzigen SVD (höherer Ordnung) auf dem gesamten Tensor. Dies resultiert in beidem, besserer QualitĂ€t von komprimierten Bildern und einer signifikant geringeren des benötigten Speicherplatzes. SingulĂ€rwertzerlegung höherer Ordnung ist extrem zeitaufwĂ€ndig, was parallele Berechnung verlangt. Deswegen war ein Schritt in Richtung Aufbau einer semantischen Suchmaschine fĂŒr Kunstwerke eine Parallelisierung der verwendeten SVD höherer Ordnung auf einem Super-Computer. Diese Arbeit berichtet vom Einsatz der Hestenes’-Methode und R-SVD zur Parallelisierung der SVD höherer Ordnung. Diese Methode ist eine unkonventionell Wahl, die erklĂ€rt und motiviert wird. Ab nun wird ein Super-Computer benötigt. Diese Arbeit berichtet ĂŒber die Wandlung der Webbrowser von Spielern oder Suchenden in einen verteilten Super-Computer. Dies leistet ein neuartiges spezielles System und eine neuartige Implementierung des MapReduce Daten-Frameworks fĂŒr Datenparallelismus. Das Einspannen der Webbrowser von Spielern und Suchenden spart server-seitige Berechnungskraft. Ebenso skaliert die Berechnungskraft so extrem gut mit der Spieleranzahl oder Suchenden, denn beides, Spiel mit oder Suche nach Kunstwerken, benötigt menschliche Reflektion, was deswegen zu ungenutzten lokalen Prozessoren fĂŒhrt, die zu einem verteilten Super-Computer zusammengeschlossen werden können
    • 

    corecore