15 research outputs found

    Curator: Efficient Indexing for Multi-Tenant Vector Databases

    Full text link
    Vector databases have emerged as key enablers for bridging intelligent applications with unstructured data, providing generic search and management support for embedding vectors extracted from the raw unstructured data. As multiple data users can share the same database infrastructure, multi-tenancy support for vector databases is increasingly desirable. This hinges on an efficient filtered search operation, i.e., only querying the vectors accessible to a particular tenant. Multi-tenancy in vector databases is currently achieved by building either a single, shared index among all tenants, or a per-tenant index. The former optimizes for memory efficiency at the expense of search performance, while the latter does the opposite. Instead, this paper presents Curator, an in-memory vector index design tailored for multi-tenant queries that simultaneously achieves the two conflicting goals, low memory overhead and high performance for queries, vector insertion, and deletion. Curator indexes each tenant's vectors with a tenant-specific clustering tree and encodes these trees compactly as sub-trees of a shared clustering tree. Each tenant's clustering tree adapts dynamically to its unique vector distribution, while maintaining a low per-tenant memory footprint. Our evaluation, based on two widely used data sets, confirms that Curator delivers search performance on par with per-tenant indexing, while maintaining memory consumption at the same level as metadata filtering on a single, shared index

    Non-equilibrium dynamics and entropy production in the human brain

    Full text link
    Living systems operate out of thermodynamic equilibrium at small scales, consuming energy and producing entropy in the environment in order to perform molecular and cellular functions. However, it remains unclear whether non-equilibrium dynamics manifest at macroscopic scales, and if so, how such dynamics support higher-order biological functions. Here we present a framework to probe for non-equilibrium dynamics by quantifying entropy production in macroscopic systems. We apply our method to the human brain, an organ whose immense metabolic consumption drives a diverse range of cognitive functions. Using whole-brain imaging data, we demonstrate that the brain fundamentally operates out of equilibrium at large scales. Moreover, we find that the brain produces more entropy -- operating further from equilibrium -- when performing physically and cognitively demanding tasks. By simulating an Ising model, we show that macroscopic non-equilibrium dynamics can arise from asymmetries in the interactions at the microscale. Together, these results suggest that non-equilibrium dynamics are vital for cognition, and provide a general tool for quantifying the non-equilibrium nature of macroscopic systems.Comment: 18 pages, 14 figure

    Learning Structured Knowledge from Social Tagging Data A critical review of methods and techniques

    Get PDF
    For more than a decade, researchers have been proposing various methods and techniques to mine social tagging data and to learn structured knowledge. It is essential to conduct a comprehensive survey on the related work, which would benefit the research community by providing better understanding of the state-of-the-art and insights into the future research directions. The paper first defines the spectrum of Knowledge Organization Systems, from unstructured with less semantics to highly structured with richer semantics. It then reviews the related work by classifying the methods and techniques into two main categories, namely, learning term lists and learning relations. The method and techniques originated from natural language processing, data mining, machine learning, social network analysis, and the Semantic Web are discussed in detail under the two categories. We summarize the prominent issues with the current research and highlight future directions on learning constantly evolving knowledge from social media data

    Fast feature matching for simultaneous localization and mapping

    Get PDF
    Bakalářská práce se zabývá rychlým vyhledáváním lokálních obrazových vlastností v rozsáhlých databázích pro simultánní lokalizaci a mapování prostředí. Součástí práce je krátký přehled detektorů a deskriptorů invariantních vůči rotaci, translaci, změně měřítka a affinitě. Pro řadu aplikací z oblasti počítačového vidění (SLAM, object retrieval, wide–robust baseline stereo, tracking, . . . ) je odezva reálném čase naprosto nezbytná. Jako řešení sublineární časové náročnosti vyhledávání v databázích bylo navrženo použití vícenásobných náhodně generovaných KD–stromů. Dále je předkládán nový způsob dělení dat do vícenásobných KD–stromů. Navíc byl navržen nový, obecně použitelný vyhodnocovací software (podporovány jsou KD–stromy, BBD-stromy a k-means stromy.)The thesis deals with the fast feature matching for simultaneous localization and mapping. A brief description of local features invariant to scale, rotation, translation and affine transformations, their detectors and descriptors are included. In general, real–time response for matching is crucial for various computer vision applications (SLAM, object retrieval, wide–robust baseline stereo, tracking, . . . ). We solve the problem of sub–linear search complexity by multiple randomised KD–trees. In addition, we propose a novel way of splitting dataset into the multiple trees. Moreover, a new evaluation package for general use (KD–trees, BBD–trees, k–means trees) was developed.

    Discovery and characterisation of dietary patterns in two Nordic countries. Using non-supervised and supervised multivariate statistical techniques to analyse dietary survey data

    Get PDF
    This Nordic study encompasses multivariate data analysis (MDA) of preschool Danish as well as pre- and elementary school Swedish consumers. Contrary to other counterparts the study incorporates two separate MDA varieties - Pattern discovery (PD) and predictive modelling (PM). PD, i.e. hierarchical cluster analysis (HCA) and factor analysis (using PCA), helped identifying distinct consumer aggregations and relationships across food groups, respectively, whereas PM enabled the disclosure of deeply entrenched associations. 17 clusters - here defined as dietary prototypes - were identified by means of HCA in the entire bi-national data set. These prototypes underwent further processing, which disclosed several intriguing consumption data relationships: Striking disparity between consumption patterns of Danish and Swedish preschool children was unveiled and further dissected by PM. Two prudent and mutually similar dietary prototypes appeared among each of two Swedish elementary school children data subsets. Dietary prototypes rich in sweetened soft beverages appeared among Danish and Swedish children alike. The results suggest prototype-specific risk assessment and study design

    Automatic Taxonomy Construction from Keywords via Scalable Bayesian Rose Trees

    Get PDF
    Abstract-In this paper, we study a challenging problem of deriving a taxonomy from a set of keyword phrases. A solution can benefit many real-world applications because i) keywords give users the flexibility and ease to characterize a specific domain; and ii) in many applications, such as online advertisements, the domain of interest is already represented by a set of keywords. However, it is impossible to create a taxonomy out of a keyword set itself. We argue that additional knowledge and context are needed. To this end, we first use a general-purpose knowledgebase and keyword search to supply the required knowledge and context. Then we develop a Bayesian approach to build a hierarchical taxonomy for a given set of keywords. We reduce the complexity of previous hierarchical clustering approaches from O(n 2 log n) to O(n log n) using a nearest-neighbor-based approximation, so that we can derive a domain-specific taxonomy from one million keyword phrases in less than an hour. Finally, we conduct comprehensive large scale experiments to show the effectiveness and efficiency of our approach. A real life example of building an insurance-related Web search query taxonomy illustrates the usefulness of our approach for specific domains

    Wo bin ich? Beiträge zum Lokalisierungsproblem mobiler Roboter

    Get PDF
    Self-localization addresses the problem of estimating the pose of mobile robots with respect to a certain coordinate system of their workspace. It is needed for various mobile robot applications like material handling in industry, disaster zone operations, vacuum cleaning, or even the exploration of foreign planets. Thus, self-localization is a very essential capability. This problem has received considerable attention over the last decades. It can be decomposed into localization on a global and local level. Global techniques are able to localize the robot without any prior knowledge about its pose with respect to an a priori known map. In contrast, local techniques aim to correct so-called odometry errors occurring during robot motion. In this thesis, the global localization problem for mobile robots is mainly addressed. The proposed method is based on matching an incremental local map to an a priori known global map. This approach is very time and memory efficient and robust to structural ambiguity as well as with respect to the occurrence of dynamic obstacles in non-static environments. The algorithm consists of several components like ego motion estimation or global point cloud matching. Nowadays most computers feature multi-core processors and thus map matching is performed by applying a parallelized variant of the Random Sample Matching (pRANSAM) approach originally devised for solving the 3D-puzzle problem. pRANSAM provides a set of hypotheses representing alleged robot poses. Techniques are discussed to postprocess the hypotheses, e.g. to decide when the robot pose is determined with a sufficient accuracy. Furthermore, runtime aspects are considered in order to facilitate localization in real-time. Finally, experimental results demonstrate the robustness of the method proposed in this thesis.Das Lokalisierungsproblem mobiler Roboter beschreibt die Aufgabe, deren Pose bezüglich eines gegebenen Weltkoordinatensystems zu bestimmen. Die Fähigkeit zur Selbstlokalisierung wird in vielen Anwendungsbereichen mobiler Roboter benötigt, wie etwa bei dem Materialtransport in der industriellen Fertigung, bei Einsätzen in Katastrophengebieten oder sogar bei der Exploration fremder Planeten. Eine Unterteilung existierender Verfahren zur Lösung des genannten Problems erfolgt je nachdem ob eine Lokalisierung auf lokaler oder auf globaler Ebene stattfindet. Globale Lokalisierungsalgorithmen bestimmen die Pose des Roboters bezüglich eines Weltkoordinatensystems ohne jegliches Vorwissen, wohingegen bei lokalen Verfahren eine grobe Schätzung der Pose vorliegt, z.B. durch gegebene Odometriedaten des Roboters. Im Rahmen dieser Dissertation wird ein neuer Ansatz zur Lösung des globalen Lokalisierungsproblems vorgestellt. Die grundlegende Idee ist, eine lokale Karte und eine globale Karte in Übereinstimmung zu bringen. Der beschriebene Ansatz ist äußerst robust sowohl gegenüber Mehrdeutigkeiten der Roboterpose als auch dem Auftreten dynamischer Hindernisse in nicht-statischen Umgebungen. Der Algorithmus besteht hauptsächlich aus drei Komponenten: Einem Scanmatcher zur Generierung der lokalen Karte, einer Methode zum matchen von lokaler und globaler Karte und einer Instanz, welche entscheidet, wann der Roboter mit hinreichender Sicherheit korrekt lokalisiert ist. Das Matching von lokaler und globaler Karte wird dabei von einer parallelisierten Variante des Random Sample Matching (pRANSAM) durchgeführt, welche eine Menge von Posenhypothesen liefert. Diese Hypothesen werden in einem weiteren Schritt analysiert, um bei hinreichender Eindeutigkeit die korrekte Roboterpose zu bestimmen. Umfangreiche Experimente belegen die Zuverlässigkeit und Genauigkeit des in dieser Dissertation vorgestellten Verfahrens

    Graph-based Methods for Visualization and Clustering

    Get PDF
    The amount of data that we produce and consume is larger than it has been at any point in the history of mankind, and it keeps growing exponentially. All this information, gathered in overwhelming volumes, often comes with two problematic characteristics: it is complex and deprived of semantical context. A common step to address those issues is to embed raw data in lower dimensions, by finding a mapping which preserves the similarity between data points from their original space to a new one. Measuring similarity between large sets of high-dimensional objects is, however, problematic for two main reasons: first, high-dimensional points are subject to the curse of dimensionality and second, the number of pairwise distances between points is quadratic with respect to the amount of data points. Both problems can be addressed by using nearest neighbours graphs to understand the structure in data. As a matter of fact, most dimensionality reduction methods use similarity matrices that can be interpreted as graph adjacency matrices. Yet, despite recent progresses, dimensionality reduction is still very challenging when applied to very large datasets. Indeed, although recent methods specifically address the problem of scaleability, processing datasets of millions of elements remain a very lengthy process. In this thesis, we propose new contributions which address the problem of scaleability using the framework of Graph Signal Processing, which extends traditional signal processing to graphs. We do so motivated by the premise that graphs are well suited to represent the structure of the data. In the first part of this thesis, we look at quantitative measures for the evaluation of dimensionality reduction methods. Using tools from graph theory and Graph Signal Processing, we show that specific characteristics related to quality can be assessed by taking measures on the graph, which indirectly validates the hypothesis relating graph to structure. The second contribution is a new method for a fast eigenspace approximation of the graph Laplacian. Using principles of GSP and random matrices, we show that an approximated eigensubpace can be recovered very efficiently, which be used for fast spectral clustering or visualization. Next, we propose a compressive scheme to accelerate any dimensionality reduction technique. The idea is based on compressive sampling and transductive learning on graphs: after computing the embedding for a small subset of data points, we propagate the information everywhere using transductive inference. The key components of this technique are a good sampling strategy to select the subset and the application of transductive learning on graphs. Finally, we address the problem of over-discriminative feature spaces by proposing a hierarchical clustering structure combined with multi-resolution graphs. Using efficient coarsening and refinement procedures on this structure, we show that dimensionality reduction algorithms can be run on intermediate levels and up-sampled to all points leading to a very fast dimensionality reduction method. For all contributions, we provide extensive experiments on both synthetic and natural datasets, including large-scale problems. This allows us to show the pertinence of our models and the validity of our proposed algorithms. Following reproducible principles, we provide everything needed to repeat the examples and the experiments presented in this work

    Biological and genetic aspects of wild x domestic hybridization in wild boar and wolf populations

    Get PDF
    Nowadays, hybridization is recognized as a powerful evolutionary force promoting speciation and shaping adaptation, but also as a serious threat to the conservation of biodiversity. This thesis is focused on two cases of hybridization between wild and domestic conspecifics, whose effects are mostly unexplored. In Sus scrofa, I sought to expand knowledge about hybridization between wild boar and domestic pig. I investigated the main sources of domestic genes introgression, and assessed hybridization at neutral markers and functional genes at both local and European scale. I also developed a set of new uniparental markers for studying male-specific gene flow, and studied the reproductive phenology of wild populations. In Canis lupus, I investigated patterns of hybridization between wolf and domestic dog in an Italian mountain area, focusing on the assessment of introgression and the food habits of hybrids. As regards wild boar, I detected introgression all over Europe, also highlighting the role of breeding stations in spreading domestic genes across wild populations. With respect to wolf, a new approach was used to provide complementary (genetic and phenotypic) data on specific individuals and to support hybrid identification. A trophic niche overlap between wolves and hybrids was also proved. These studies can have relevant management implications, offering new elements of knowledge on different aspects of the hybridization in two worrisome species of the Italian fauna
    corecore