15 research outputs found
Curator: Efficient Indexing for Multi-Tenant Vector Databases
Vector databases have emerged as key enablers for bridging intelligent
applications with unstructured data, providing generic search and management
support for embedding vectors extracted from the raw unstructured data. As
multiple data users can share the same database infrastructure, multi-tenancy
support for vector databases is increasingly desirable. This hinges on an
efficient filtered search operation, i.e., only querying the vectors accessible
to a particular tenant. Multi-tenancy in vector databases is currently achieved
by building either a single, shared index among all tenants, or a per-tenant
index. The former optimizes for memory efficiency at the expense of search
performance, while the latter does the opposite. Instead, this paper presents
Curator, an in-memory vector index design tailored for multi-tenant queries
that simultaneously achieves the two conflicting goals, low memory overhead and
high performance for queries, vector insertion, and deletion. Curator indexes
each tenant's vectors with a tenant-specific clustering tree and encodes these
trees compactly as sub-trees of a shared clustering tree. Each tenant's
clustering tree adapts dynamically to its unique vector distribution, while
maintaining a low per-tenant memory footprint. Our evaluation, based on two
widely used data sets, confirms that Curator delivers search performance on par
with per-tenant indexing, while maintaining memory consumption at the same
level as metadata filtering on a single, shared index
Non-equilibrium dynamics and entropy production in the human brain
Living systems operate out of thermodynamic equilibrium at small scales,
consuming energy and producing entropy in the environment in order to perform
molecular and cellular functions. However, it remains unclear whether
non-equilibrium dynamics manifest at macroscopic scales, and if so, how such
dynamics support higher-order biological functions. Here we present a framework
to probe for non-equilibrium dynamics by quantifying entropy production in
macroscopic systems. We apply our method to the human brain, an organ whose
immense metabolic consumption drives a diverse range of cognitive functions.
Using whole-brain imaging data, we demonstrate that the brain fundamentally
operates out of equilibrium at large scales. Moreover, we find that the brain
produces more entropy -- operating further from equilibrium -- when performing
physically and cognitively demanding tasks. By simulating an Ising model, we
show that macroscopic non-equilibrium dynamics can arise from asymmetries in
the interactions at the microscale. Together, these results suggest that
non-equilibrium dynamics are vital for cognition, and provide a general tool
for quantifying the non-equilibrium nature of macroscopic systems.Comment: 18 pages, 14 figure
Learning Structured Knowledge from Social Tagging Data A critical review of methods and techniques
For more than a decade, researchers have been proposing various methods and techniques to mine social tagging data and to learn structured knowledge. It is essential to conduct a comprehensive survey on the related work, which would benefit the research community by providing better understanding of the state-of-the-art and insights into the future research directions. The paper first defines the spectrum of Knowledge Organization Systems, from unstructured with less semantics to highly structured with richer semantics. It then reviews the related work by classifying the methods and techniques into two main categories, namely, learning term lists and learning relations. The method and techniques originated from natural language processing, data mining, machine learning, social network analysis, and the Semantic Web are discussed in detail under the two categories. We summarize the prominent issues with the current research and highlight future directions on learning constantly evolving knowledge from social media data
Fast feature matching for simultaneous localization and mapping
Bakalářská práce se zabĂ˝vá rychlĂ˝m vyhledávánĂm lokálnĂch obrazovĂ˝ch vlastnostĂ v rozsáhlĂ˝ch databázĂch pro simultánnĂ lokalizaci a mapovánĂ prostĹ™edĂ. SoučástĂ práce je krátkĂ˝ pĹ™ehled detektorĹŻ a deskriptorĹŻ invariantnĂch vĹŻÄŤi rotaci, translaci, zmÄ›nÄ› měřĂtka a affinitÄ›. Pro Ĺ™adu aplikacĂ z oblasti poÄŤĂtaÄŤovĂ©ho vidÄ›nĂ (SLAM, object retrieval, wide–robust baseline stereo, tracking, . . . ) je odezva reálnĂ©m ÄŤase naprosto nezbytná. Jako Ĺ™ešenĂ sublineárnĂ ÄŤasovĂ© nároÄŤnosti vyhledávánĂ v databázĂch bylo navrĹľeno pouĹľitĂ vĂcenásobnĂ˝ch náhodnÄ› generovanĂ˝ch KD–stromĹŻ. Dále je pĹ™edkládán novĂ˝ zpĹŻsob dÄ›lenĂ dat do vĂcenásobnĂ˝ch KD–stromĹŻ. NavĂc byl navrĹľen novĂ˝, obecnÄ› pouĹľitelnĂ˝ vyhodnocovacĂ software (podporovány jsou KD–stromy, BBD-stromy a k-means stromy.)The thesis deals with the fast feature matching for simultaneous localization and mapping. A brief description of local features invariant to scale, rotation, translation and affine transformations, their detectors and descriptors are included. In general, real–time response for matching is crucial for various computer vision applications (SLAM, object retrieval, wide–robust baseline stereo, tracking, . . . ). We solve the problem of sub–linear search complexity by multiple randomised KD–trees. In addition, we propose a novel way of splitting dataset into the multiple trees. Moreover, a new evaluation package for general use (KD–trees, BBD–trees, k–means trees) was developed.
Discovery and characterisation of dietary patterns in two Nordic countries. Using non-supervised and supervised multivariate statistical techniques to analyse dietary survey data
This Nordic study encompasses multivariate data analysis (MDA) of preschool Danish as well as pre- and elementary school Swedish consumers. Contrary to other counterparts the study incorporates two separate MDA varieties - Pattern discovery (PD) and predictive modelling (PM). PD, i.e. hierarchical cluster analysis (HCA) and factor analysis (using PCA), helped identifying distinct consumer aggregations and relationships across food groups, respectively, whereas PM enabled the disclosure of deeply entrenched associations. 17 clusters - here defined as dietary prototypes - were identified by means of HCA in the entire bi-national data set. These prototypes underwent further processing, which disclosed several intriguing consumption data relationships: Striking disparity between consumption patterns of Danish and Swedish preschool children was unveiled and further dissected by PM. Two prudent and mutually similar dietary prototypes appeared among each of two Swedish elementary school children data subsets. Dietary prototypes rich in sweetened soft beverages appeared among Danish and Swedish children alike. The results suggest prototype-specific risk assessment and study design
Automatic Taxonomy Construction from Keywords via Scalable Bayesian Rose Trees
Abstract-In this paper, we study a challenging problem of deriving a taxonomy from a set of keyword phrases. A solution can benefit many real-world applications because i) keywords give users the flexibility and ease to characterize a specific domain; and ii) in many applications, such as online advertisements, the domain of interest is already represented by a set of keywords. However, it is impossible to create a taxonomy out of a keyword set itself. We argue that additional knowledge and context are needed. To this end, we first use a general-purpose knowledgebase and keyword search to supply the required knowledge and context. Then we develop a Bayesian approach to build a hierarchical taxonomy for a given set of keywords. We reduce the complexity of previous hierarchical clustering approaches from O(n 2 log n) to O(n log n) using a nearest-neighbor-based approximation, so that we can derive a domain-specific taxonomy from one million keyword phrases in less than an hour. Finally, we conduct comprehensive large scale experiments to show the effectiveness and efficiency of our approach. A real life example of building an insurance-related Web search query taxonomy illustrates the usefulness of our approach for specific domains
Wo bin ich? Beiträge zum Lokalisierungsproblem mobiler Roboter
Self-localization addresses the problem of estimating the pose of mobile robots with respect to a certain coordinate system of their workspace. It is needed for various mobile robot applications like material handling in industry, disaster zone operations, vacuum cleaning, or even the exploration of foreign planets. Thus, self-localization is a very essential capability. This problem has received considerable attention over the last decades. It can be decomposed into localization on a global and local level. Global techniques are able to localize the robot without any prior knowledge about its pose with respect to an a priori known map. In contrast, local techniques aim to correct so-called odometry errors occurring during robot motion. In this thesis, the global localization problem for mobile robots is mainly addressed. The proposed method is based on matching an incremental local map to an a priori known global map. This approach is very time and memory efficient and robust to structural ambiguity as well as with respect to the occurrence of dynamic obstacles in non-static environments. The algorithm consists of several components like ego motion estimation or global point cloud matching. Nowadays most computers feature multi-core processors and thus map matching is performed by applying a parallelized variant of the Random Sample Matching (pRANSAM) approach originally devised for solving the 3D-puzzle problem. pRANSAM provides a set of hypotheses representing alleged robot poses. Techniques are discussed to postprocess the hypotheses, e.g. to decide when the robot pose is determined with a sufficient accuracy. Furthermore, runtime aspects are considered in order to facilitate localization in real-time. Finally, experimental results demonstrate the robustness of the method proposed in this thesis.Das Lokalisierungsproblem mobiler Roboter beschreibt die Aufgabe, deren Pose bezüglich eines gegebenen Weltkoordinatensystems zu bestimmen. Die Fähigkeit zur Selbstlokalisierung wird in vielen Anwendungsbereichen mobiler Roboter benötigt, wie etwa bei dem Materialtransport in der industriellen Fertigung, bei Einsätzen in Katastrophengebieten oder sogar bei der Exploration fremder Planeten. Eine Unterteilung existierender Verfahren zur Lösung des genannten Problems erfolgt je nachdem ob eine Lokalisierung auf lokaler oder auf globaler Ebene stattfindet. Globale Lokalisierungsalgorithmen bestimmen die Pose des Roboters bezüglich eines Weltkoordinatensystems ohne jegliches Vorwissen, wohingegen bei lokalen Verfahren eine grobe Schätzung der Pose vorliegt, z.B. durch gegebene Odometriedaten des Roboters. Im Rahmen dieser Dissertation wird ein neuer Ansatz zur Lösung des globalen Lokalisierungsproblems vorgestellt. Die grundlegende Idee ist, eine lokale Karte und eine globale Karte in Übereinstimmung zu bringen. Der beschriebene Ansatz ist äußerst robust sowohl gegenüber Mehrdeutigkeiten der Roboterpose als auch dem Auftreten dynamischer Hindernisse in nicht-statischen Umgebungen. Der Algorithmus besteht hauptsächlich aus drei Komponenten: Einem Scanmatcher zur Generierung der lokalen Karte, einer Methode zum matchen von lokaler und globaler Karte und einer Instanz, welche entscheidet, wann der Roboter mit hinreichender Sicherheit korrekt lokalisiert ist. Das Matching von lokaler und globaler Karte wird dabei von einer parallelisierten Variante des Random Sample Matching (pRANSAM) durchgeführt, welche eine Menge von Posenhypothesen liefert. Diese Hypothesen werden in einem weiteren Schritt analysiert, um bei hinreichender Eindeutigkeit die korrekte Roboterpose zu bestimmen. Umfangreiche Experimente belegen die Zuverlässigkeit und Genauigkeit des in dieser Dissertation vorgestellten Verfahrens
Graph-based Methods for Visualization and Clustering
The amount of data that we produce and consume is larger than it has been at any point in the history of mankind, and it keeps growing exponentially. All this information, gathered in overwhelming volumes, often comes with two problematic characteristics: it is complex and deprived of semantical context. A common step to address those issues is to embed raw data in lower dimensions, by finding a mapping which preserves the similarity between data points from their original space to a new one. Measuring similarity between large sets of high-dimensional objects is, however, problematic for two main reasons: first, high-dimensional points are subject to the curse of dimensionality and second, the number of pairwise distances between points is quadratic with respect to the amount of data points. Both problems can be addressed by using nearest neighbours graphs to understand the structure in data. As a matter of fact, most dimensionality reduction methods use similarity matrices that can be interpreted as graph adjacency matrices. Yet, despite recent progresses, dimensionality reduction is still very challenging when applied to very large datasets. Indeed, although recent methods specifically address the problem of scaleability, processing datasets of millions of elements remain a very lengthy process. In this thesis, we propose new contributions which address the problem of scaleability using the framework of Graph Signal Processing, which extends traditional signal processing to graphs. We do so motivated by the premise that graphs are well suited to represent the structure of the data. In the first part of this thesis, we look at quantitative measures for the evaluation of dimensionality reduction methods. Using tools from graph theory and Graph Signal Processing, we show that specific characteristics related to quality can be assessed by taking measures on the graph, which indirectly validates the hypothesis relating graph to structure. The second contribution is a new method for a fast eigenspace approximation of the graph Laplacian. Using principles of GSP and random matrices, we show that an approximated eigensubpace can be recovered very efficiently, which be used for fast spectral clustering or visualization. Next, we propose a compressive scheme to accelerate any dimensionality reduction technique. The idea is based on compressive sampling and transductive learning on graphs: after computing the embedding for a small subset of data points, we propagate the information everywhere using transductive inference. The key components of this technique are a good sampling strategy to select the subset and the application of transductive learning on graphs. Finally, we address the problem of over-discriminative feature spaces by proposing a hierarchical clustering structure combined with multi-resolution graphs. Using efficient coarsening and refinement procedures on this structure, we show that dimensionality reduction algorithms can be run on intermediate levels and up-sampled to all points leading to a very fast dimensionality reduction method. For all contributions, we provide extensive experiments on both synthetic and natural datasets, including large-scale problems. This allows us to show the pertinence of our models and the validity of our proposed algorithms. Following reproducible principles, we provide everything needed to repeat the examples and the experiments presented in this work
Biological and genetic aspects of wild x domestic hybridization in wild boar and wolf populations
Nowadays, hybridization is recognized as a powerful evolutionary force promoting speciation and shaping adaptation, but also as a serious threat to the conservation of biodiversity.
This thesis is focused on two cases of hybridization between wild and domestic conspecifics, whose effects are mostly unexplored.
In Sus scrofa, I sought to expand knowledge about hybridization between wild boar and domestic pig. I investigated the main sources of domestic genes introgression, and assessed hybridization at neutral markers and functional genes at both local and European scale. I also developed a set of new uniparental markers for studying male-specific gene flow, and studied the reproductive phenology of wild populations.
In Canis lupus, I investigated patterns of hybridization between wolf and domestic dog in an Italian mountain area, focusing on the assessment of introgression and the food habits of hybrids.
As regards wild boar, I detected introgression all over Europe, also highlighting the role of breeding stations in spreading domestic genes across wild populations. With respect to wolf, a new approach was used to provide complementary (genetic and phenotypic) data on specific individuals and to support hybrid identification. A trophic niche overlap between wolves and hybrids was also proved.
These studies can have relevant management implications, offering new elements of knowledge on different aspects of the hybridization in two worrisome species of the Italian fauna