428 research outputs found
A perceptual hash function to store and retrieve large scale DNA sequences
This paper proposes a novel approach for storing and retrieving massive DNA
sequences.. The method is based on a perceptual hash function, commonly used to
determine the similarity between digital images, that we adapted for DNA
sequences. Perceptual hash function presented here is based on a Discrete
Cosine Transform Sign Only (DCT-SO). Each nucleotide is encoded as a fixed gray
level intensity pixel and the hash is calculated from its significant frequency
characteristics. This results to a drastic data reduction between the sequence
and the perceptual hash. Unlike cryptographic hash functions, perceptual hashes
are not affected by "avalanche effect" and thus can be compared. The similarity
distance between two hashes is estimated with the Hamming Distance, which is
used to retrieve DNA sequences. Experiments that we conducted show that our
approach is relevant for storing massive DNA sequences, and retrieving them
Integer Sparse Distributed Memory and Modular Composite Representation
Challenging AI applications, such as cognitive architectures, natural language understanding, and visual object recognition share some basic operations including pattern recognition, sequence learning, clustering, and association of related data. Both the representations used and the structure of a system significantly influence which tasks and problems are most readily supported. A memory model and a representation that facilitate these basic tasks would greatly improve the performance of these challenging AI applications.Sparse Distributed Memory (SDM), based on large binary vectors, has several desirable properties: auto-associativity, content addressability, distributed storage, robustness over noisy inputs that would facilitate the implementation of challenging AI applications. Here I introduce two variations on the original SDM, the Extended SDM and the Integer SDM, that significantly improve these desirable properties, as well as a new form of reduced description representation named MCR.Extended SDM, which uses word vectors of larger size than address vectors, enhances its hetero-associativity, improving the storage of sequences of vectors, as well as of other data structures. A novel sequence learning mechanism is introduced, and several experiments demonstrate the capacity and sequence learning capability of this memory.Integer SDM uses modular integer vectors rather than binary vectors, improving the representation capabilities of the memory and its noise robustness. Several experiments show its capacity and noise robustness. Theoretical analyses of its capacity and fidelity are also presented.A reduced description represents a whole hierarchy using a single high-dimensional vector, which can recover individual items and directly be used for complex calculations and procedures, such as making analogies. Furthermore, the hierarchy can be reconstructed from the single vector. Modular Composite Representation (MCR), a new reduced description model for the representation used in challenging AI applications, provides an attractive tradeoff between expressiveness and simplicity of operations. A theoretical analysis of its noise robustness, several experiments, and comparisons with similar models are presented.My implementations of these memories include an object oriented version using a RAM cache, a version for distributed and multi-threading execution, and a GPU version for fast vector processing
Analyzing the Privacy and Societal Challenges Stemming from the Rise of Personal Genomic Testing
Progress in genomics is enabling researchers to better understand the role of the genome in our health and well-being, stimulating hope for more effective and cost efficient healthcare. At the same time, the rapid cost drop of genome sequencing has enabled the emergence of a booming market for direct-to-consumer (DTC) genetic testing. Nowadays, companies like 23andMe and AncestryDNA provide affordable health, genealogy, and ancestry reports, and have already tested tens of millions of customers. How- ever, while this technology has the potential to transform society by improving people’s lives, it also harbors dangers as it prompts important privacy and societal concerns. In this thesis, we shed light on these issues using a mixed-methods approach. We start by conducting a technical investigation of the limitations on privacy-enhancing technologies used for testing, storing, and sharing genomic data. We rely on a structured methodology to contextualize and provide a critical analysis of the current state-of-the-art and we identify and discuss ten open problems faced by the community. We then focus on the societal aspects of DTC genetic testing by conducting two large-scale analyses of the genetic testing discourse focusing on both mainstream and fringe social networks, specifically, Twitter, Reddit, and 4chan. Our analyses show that DTC genetic testing is a popular topic of discussion on all platforms. However, these discussions often include highly toxic language expressed through hateful and racist comments and openly antisemitic rhetoric, often conveyed through memes. Overall, our findings highlight that the rise in popularity of this new technology is accompanied by several societal implications that are unlikely to be addressed by only one research field and rather require a multi-disciplinary approach
Indexing Techniques for Image and Video Databases: an approach based on Animate Vision Paradigm
[ITALIANO]In questo lavoro di tesi vengono presentate e discusse delle innovative tecniche di indicizzazione per database video e di immagini basate sul paradigma della “Animate Vision” (Visione Animata).
Da un lato, sarà mostrato come utilizzando, quali algoritmi di analisi di una data immagine, alcuni meccanismi di visione biologica, come i movimenti saccadici e le fissazioni dell'occhio umano, sia possibile ottenere un query processing in database di immagini più efficace ed efficiente. In particolare, verranno discussi, la metodologia grazie alla quale risulta possibile generare due sequenze di fissazioni, a partire rispettivamente, da un'immagine di query I_q ed una di test I_t del data set, e, come confrontare tali sequenze al fine di determinare una possibile misura della similarità (consistenza) tra le due immagini. Contemporaneamente, verrà discusso come tale approccio unito a tecniche classiche di clustering possa essere usato per scoprire le associazioni semantiche nascoste tra immagini, in termini di categorie, che, di contro, permettono un'automatica pre-classificazione (indicizzazione) delle immagini e possono essere usate per guidare e migliorare il processo di query. Saranno presentati, infine, dei risultati preliminari e l'approccio proposto sarà confrontato con le più recenti tecniche per il recupero di immagini descritte in letteratura.
Dall'altro lato, sarà mostrato come utilizzando la precedente rappresentazione “foveata” di un'immagine, risulti possibile partizionare un video in shot. Più precisamente, il metodo per il rilevamento dei cambiamenti di shot si baserà sulla computazione, in ogni istante di tempo, della misura di consistenza tra le sequenze di fissazioni generate da un osservatore ideale che guarda il video. Lo schema proposto permette l'individuazione, attraverso l'utilizzo di un'unica tecnica anziché di più metodi dedicati, sia delle transizioni brusche sia di quelle graduali. Vengono infine mostrati i risultati ottenuti su varie tipologie di video e, come questi, validano l'approccio proposto. / [INGLESE]In this dissertation some novel indexing techniques for video and image database based on “Animate Vision” Paradigm are presented and discussed.
From one hand, it will be shown how, by embedding within image inspection algorithms active mechanisms of biological vision such as saccadic eye movements and fixations, a more effective query processing in image database can be achieved. In particular, it will be discussed the way to generate two fixation sequences from a query image I_q and a test image I_t of the data set, respectively, and how to compare the two sequences in order to compute a possible similarity (consistency) measure between the two images. Meanwhile, it will be shown how the approach can be used with classical clustering techniques to discover and represent the hidden semantic associations among images, in terms of categories, which, in turn, allow an automatic pre-classification (indexing), and can be used to drive and improve the query processing. Eventually, preliminary results will be presented and the proposed approach compared with the most recent techniques for image retrieval described in the literature.
From the other one, it will be discussed how by taking advantage of such foveated representation of an image, it is possible to partitioning of a video into shots. More precisely, the shot-change detection method will be based on the computation, at each time instant, of the consistency measure of the fixation sequences generated by an ideal observer looking at the video. The proposed scheme aims at detecting both abrupt and gradual transitions between shots using a single technique, rather than a set of dedicated methods. Results on videos of various content types are reported and validate the proposed approach
Audio content identification
Die Entwicklung und Erforschung von inhaltsbasierenden "Music Information Retrieval (MIR)'' - Anwendungen in den letzten Jahren hat gezeigt, dass die automatische Generierung von Inhaltsbeschreibungen, die eine Identifikation oder Klassifikation von Musik oder Musikteilen ermöglichen, eine bewältigbare Aufgabe darstellt. Aufgrund der großen Massen an verfügbarer digitaler Musik und des enormen Wachstums der entsprechenden Datenbanken, werden Untersuchungen durchgeführt, die eine möglichst automatisierte Ausführung der typischen Managementprozesse von digitaler Musik ermöglichen.
In dieser Arbeit stelle ich eine allgemeine Einführung in das Gebiet des ``Music Information Retrieval'' vor, insbesondere die automatische Identifikation von Audiomaterial und den Vergleich von ähnlichkeitsbasierenden Ansätzen mit reinen inhaltsbasierenden “Fingerprint”-Technologien. Einerseits versuchen Systeme, den menschlichen Hörapparat bzw. die Wahrnehmung und Definition von "Ähnlichkeit'' zu modellieren, um eine Klassifikation in Gruppen von verwandten Musiktiteln und im Weiteren eine Identifikation zu ermöglichen. Andererseits liegt der Fokus auf der Erstellung von Signaturen, die auf eine eindeutige Wiedererkennung abzielen ohne jede Aussage über ähnlich klingende Alternativen. In der Arbeit werden eine Reihe von Tests durchgeführt, die deutlich machen sollen, wie robust, zuverlässig und anpassbar Erkennungssysteme arbeiten sollen, wobei eine möglichst hohe Rate an richtig erkannten Musikstücken angestrebt wird. Dafür werden zwei Algorithmen, Rhythm Patterns, ein ähnlichkeitsbasierter Ansatz, und FDMF, ein frei verfügbarer Fingerprint-Extraktionsalgorithmus mittels 24 durchgeführten Testfällen gegenübergestellt, um die Arbeitsweisen der Verfahren zu vergleichen. Diese Untersuchungen zielen darauf ab, eine möglichst hohe Genauigkeit in der Wiedererkennung zu erreichen. Ähnlichkeitsbasierte Ansätze wie Rhythm Patterns erreichen bei der Identifikation Wiedererkennungsraten bis zu 89.53% und übertreffen in den durchgeführten Testszenarien somit den untersuchten Fingerprint-Ansatz deutlich. Eine sorgfältige Auswahl relevanter Features, die zur Berechnung von Ähnlichkeit herangezogen werden, führen zu äußerst vielversprechenden Ergebnissen sowohl bei variierten Ausschnitten der Musikstücke als auch nach erheblichen Signalveränderungen.The development and research of content-based music information retrieval (MIR) applications in the last years have shown that the generation of descriptions enabling the identification and classification of pieces of musical audio is a challenge that can be coped with. Due to the huge masses of digital music available and the growth of the particular databases, there are investigations of how to automatically perform tasks concerning the management of audio data.
In this thesis I will provide a general introduction of the music information retrieval techniques, especially the identification of audio material and the comparison of similarity-based approaches with content-based fingerprint technology. On the one hand, similarity retrieval systems try to model the human auditory system in various aspects and therewith the model of perceptual similarity. On the other hand there are fingerprints or signatures which try to exactly identify music without any assessment of similarity of sound titles. To figure out the differences and consequences of using these approaches I have performed several experiments that make clear how robust and adaptable an identification system must work. Rhythm Patterns, a similarity based feature extraction scheme and FDMF, a free fingerprint algorithm have been investigated by performing 24 test cases in order to compare the principle behind. This evaluation has also been done focusing on the greatest possible accuracy. It has come out that similarity features like Rhythm Patterns are able to identify audio titles promisingly as well (i.e. up to 89.53 %) in the introduced test scenarios. The proper choice of features enables that music tracks are identified at best when focusing on the highest similarity between the candidates both for varied excerpts and signal modifications
Routing and search on large scale networks
In this thesis, we address two seemingly unrelated problems, namely routing in large wireless ad hoc networks and comparison based search in image databases. However, the underlying problem is in essence similar and we can use the same strategy to attack those two problems. In both cases, the intrinsic complexity of the problem is in some sense low, and we can exploit this fact to design efficient algorithms. A wireless ad hoc network is a communication network consisting of wireless devices such as for instance laptops or cell phones. The network does not have any fixed infrastructure, and hence nodes which cannot communicate directly over the wireless medium must use intermediate nodes as relays. This immediately raises the question of how to select the relay nodes. Ideally, one would like to find a path from the source to the destination which is as short as possible. The length of the found path, also called route, typically depends on how much signaling traffic is generated in order to establish the route. This is the fundamental trade-off that we will investigate in this thesis. As mentioned above, we try and exploit the fact that the communication network is intrinsically low-dimensional, or in other words has low complexity. We show that this is indeed the case for a large class of models and that we can design efficient algorithms for routing that use this property. Low dimensionality implies that we can well embed the network in a low-dimensional space, or build simple hierarchical decompositions of the network. We use both those techniques to design routing algorithms. Comparison based search in image databases is a new problem that can be defined as follows. Given a large database of images, can a human user retrieve an image which he has in mind, or at least an image similar to that image, without going sequentially through all images? More precisely, we ask whether we can search a database of images only by making comparisons between images. As a case in point, we ask whether we can find a query image q only by asking questions of the type "does image q look more like image A or image B"? The analogous to signaling traffic for wireless networks would here be the questions we can ask human users in a learning phase anterior to the search. In other words, we would like to ask as few questions as possible to pre-process and prepare the database, while guaranteeing a certain quality of the results obtained in the search phase. As the underlying image space is not necessarily metric, this raises new questions on how to search spaces for which only rank information can be obtained. The rank of A with respect to B is k, if A is B's kth nearest neighbor. In this setup, low-dimensionality is analogous to the homogeneity of the image space. As we will see, the homogeneity can be captured by properties of the rank relationships. In turn, homogeneous spaces can be well decomposed hierarchically using comparisons. Further, it allows us to design good hash functions. To design efficient algorithms for these two problems, we can apply the same techniques mutatis mutandis. In both cases, we relied on the intuition that the problem has a low intrinsic complexity, and that we can exploit this fact. Our results come in the form of simulation results and asymptotic bounds
Recommended from our members
Advances in Machine Learning: Nearest Neighbour Search, Learning to Optimize and Generative Modelling
Machine learning is the embodiment of an unapologetically data-driven philosophy that has increasingly become one of the most important drivers of progress in artificial intelligence and beyond. Existing machine learning methods, however, entail making trade-offs in terms of computational efficiency, modelling flexibility and/or formulation faithfulness. In this dissertation, we will cover three different ways in which limitations along each axis can be overcome, without compromising on other axes.Computational EfficiencyWe start with limitations on computational efficiency. Many modern machine learning methods require performing large-scale similarity search under the hood. For example, classifying an input into one of a large number of classes requires comparing the weight vector associated with each class to the activations of the penultimate layer, attending to particular memory cells of a neural net requires comparing the keys associated with each memory cell to the query, and sparse recovery requires comparing each dictionary element to the residual. Similarity search in many cases can be reduced to nearest neighbour search, which is both a blessing and a curse. On the plus side, the nearest neighbour search problem has been extensively studied for more than four decades. On the minus side, no exact algorithm developed over the past four decades can run faster than naive exhaustive search when the intrinsic dimensionality is high, which is almost certainly the case in machine learning. Given this state of affairs, should we give up any hope of doing any better than the naive approach of exhaustive comparing each element one-by-one?It turns out this pessimism, while tempting, is unwarranted. We introduce a new family of exact randomized algorithms, known as Dynamic Continuous Indexing, which overcomes both the curse of ambient dimensionality and the curse of intrinsic dimensionality: more specifically, DCI simultaneously achieves a query time complexity with a linear dependence on ambient dimensionality, a sublinear dependence on intrinsic dimensionality and a sublinear dependence on dataset size. The key insight is that the curse of intrinsic dimensionality in many cases arises from space partitioning, which is a divide-and-conquer strategy used by most nearest neighbour search algorithms. While space partitioning makes intuitive sense and works well in low dimensions, we argue that it fundamentally fails in high dimensions, because it requires distances between each point and every possible query to be approximately preserved in the data structure. We develop a new indexing scheme that only requires the ordering of nearby points relative to distant points to be approximately preserved, and show that the number of out-of-place points after projecting to just a single dimension is sublinear in the intrinsic dimensionality. In practice, our algorithm achieves a 14 - 116x speedup and a 21x reduction in memory consumption compared to locality-sensitive hashing (LSH). Modelling FlexibilityNext we move onto probabilistic modelling, which is critical to realizing one of the central objectives of machine learning, which is to model the uncertainty that is inherent in prediction. The community has wrestled with the problem of how to strike the right balance between modelling flexibility and computational efficiency. Simple models can often be learned straightforwardly and efficiently but are not expressive; complex models are expressive, but in general cannot be learned both exactly and efficiently, often because learning requires evaluating some intractable integral. The success of deep learning has motivated the development of probabilistic models that can leverage the inductive bias and modelling power of deep neural nets, such as variational autoencoders (VAEs) and generative adversarial nets (GANs), which belong to a subclass of probabilistic models known as implicit probabilistic models. Implicit probabilistic models are defined by a procedure from drawing samples from them, rather than an explicit of the probability density function. On the positive side, sampling is always easy by definition; on the negative side, learning is difficult because not even the unnormalized complete likelihood can be expressed analytically. So these models must be learned using likelihood-free methods, but none have been shown to be able to learn the underlying distribution with a finite number of samples. Perhaps the most popular likelihood-free method is the GAN. Unfortunately, GANs suffer from the well-documented issue of mode collapse, where the learned model (generator in the GAN parlance) cannot generate some modes of the true data distribution. We argue this arises from the direction in which generated samples are matched to the real data. Under the GAN objective, each generated sample is made indistinguishable from some data example. Some data examples may not be chosen by any generated sample, resulting in mode collapse. We introduce a new likelihood-free method, known as Implicit Maximum Likelihood Estimation (IMLE) that overcomes mode collapse by inverting the direction - instead of ensuring each generated sample has a similar data example, our method ensures that each data example has a similar generated sample. This can be shown to be equivalent to maximizing a lower bound on the log-likelihood when the model class is richly parameterized and the density is smooth in parameters and data, hence the name. Compared to VAEs, which are not likelihood-free, IMLE eliminates the need for an approximate posterior and avoids the bias towards parameters where the true posteriors are less informative, a phenomenon known as "posterior collapse''. Formulation FaithfulnessFinally we introduce a novel formulation that can enable the automatic discovery of new iterative gradient-based optimization algorithms, which have become the workhorse of modern machine learning. This effectively allows us to apply machine learning to improve machine learning, which has been a dream of machine learning researchers since the early days of the field. The key challenge, however, is that it is unclear how to represent a complex object like an algorithm in a way that is amenable to machine learning. Prior approaches represent algorithms as imperative programs, i.e.: sequences of elementary operations, and therefore induces a search space whose size is exponential in the length of the optimal program. Searching in this space is unfortunately not tractable for anything but the simplest and shortest algorithms. Other approaches enumerate a small set of manually designed algorithms and search for the best algorithm within this set. Searching in this space is tractable, but the optimal algorithm may lie outside this space. It remains an open question as to how to parameterize the space of possible algorithms in a way that is both complete and efficiently searchable. We get around this issue by observing that an optimization algorithm can be uniquely characterized by its update formula - different iterative optimization algorithms only differ in their choice of the update formula. In gradient descent, for example, it is taken to be a scaled negative gradient, whereas in gradient descent with momentum, it is taken to be a scaled exponentially-weighted average of the history of gradients. Therefore, if we can learn the update formula, we can then automatically discover new optimization algorithms. The update formula can be formulated as a mapping from the history of gradients, iterates and objective values to the update step, which can be approximated with a neural net. We can then learn the optimization algorithm by learning the parameters of the neural net
Using Perl for Statistics: Data Processing and Statistical Computing
In this paper we show how Perl, an expressive and extensible high-level programming language, with network and ob ject-oriented programming support, can be used in processing data for statistics and statistical computing. The paper is organized in two parts. In Part I, we introduce the Perl programming language, with particular emphasis on the features that distinguish it from conventional languages. Then, using practical examples, we demonstrate how Perl's distinguishing features make it particularly well suited to perform labor intensive and sophisticated tasks ranging from the preparation of data to the writing of statistical reports. In Part II we show how Perl can be extended to perform statistical computations using modules and by "embedding" specialized statistical applications. We provide example on how Perl can be used to do simple statistical analyses, perform complex statistical computations involving matrix algebra and numerical optimization, and make statistical computations more easily reproducible. We also investigate the numerical and statistical reliability of various Perl statistical modules. Important computing issues such as ease of use, speed of calculation, and efficient memory usage, are also considered.
- …