612 research outputs found

    Computing With Contextual Numbers

    Full text link
    Self Organizing Map (SOM) has been applied into several classical modeling tasks including clustering, classification, function approximation and visualization of high dimensional spaces. The final products of a trained SOM are a set of ordered (low dimensional) indices and their associated high dimensional weight vectors. While in the above-mentioned applications, the final high dimensional weight vectors play the primary role in the computational steps, from a certain perspective, one can interpret SOM as a nonparametric encoder, in which the final low dimensional indices of the trained SOM are pointer to the high dimensional space. We showed how using a one-dimensional SOM, which is not common in usual applications of SOM, one can develop a nonparametric mapping from a high dimensional space to a continuous one-dimensional numerical field. These numerical values, called contextual numbers, are ordered in a way that in a given context, similar numbers refer to similar high dimensional states. Further, as these numbers can be treated similarly to usual continuous numbers, they can be replaced with their corresponding high dimensional states within any data driven modeling problem. As a potential application, we showed how using contextual numbers could be used for the problem of high dimensional spatiotemporal dynamics

    Ancestral causal learning in high dimensions with a human genome-wide application

    Full text link
    We consider learning ancestral causal relationships in high dimensions. Our approach is driven by a supervised learning perspective, with discrete indicators of causal relationships treated as labels to be learned from available data. We focus on the setting in which some causal (ancestral) relationships are known (via background knowledge or experimental data) and put forward a general approach that scales to large problems. This is motivated by problems in human biology which are characterized by high dimensionality and potentially many latent variables. We present a case study involving interventional data from human cells with total dimension p ⁣ ⁣19,000p \! \sim \! 19{,}000. Performance is assessed empirically by testing model output against previously unseen interventional data. The proposed approach is highly effective and demonstrably scalable to the human genome-wide setting. We consider sensitivity to background knowledge and find that results are robust to nontrivial perturbations of the input information. We consider also the case, relevant to some applications, where the only prior information available concerns a small number of known ancestral relationships

    1-D and 2-D Parallel Algorithms for All-Pairs Similarity Problem

    Full text link
    All-pairs similarity problem asks to find all vector pairs in a set of vectors the similarities of which surpass a given similarity threshold, and it is a computational kernel in data mining and information retrieval for several tasks. We investigate the parallelization of a recent fast sequential algorithm. We propose effective 1-D and 2-D data distribution strategies that preserve the essential optimizations in the fast algorithm. 1-D parallel algorithms distribute either dimensions or vectors, whereas the 2-D parallel algorithm distributes data both ways. Additional contributions to the 1-D vertical distribution include a local pruning strategy to reduce the number of candidates, a recursive pruning algorithm, and block processing to reduce imbalance. The parallel algorithms were programmed in OCaml which affords much convenience. Our experiments indicate that the performance depends on the dataset, therefore a variety of parallelizations is useful

    Methods for Integrating Knowledge with the Three-Weight Optimization Algorithm for Hybrid Cognitive Processing

    Full text link
    In this paper we consider optimization as an approach for quickly and flexibly developing hybrid cognitive capabilities that are efficient, scalable, and can exploit knowledge to improve solution speed and quality. In this context, we focus on the Three-Weight Algorithm, which aims to solve general optimization problems. We propose novel methods by which to integrate knowledge with this algorithm to improve expressiveness, efficiency, and scaling, and demonstrate these techniques on two example problems (Sudoku and circle packing)

    Dense Passage Retrieval for Open-Domain Question Answering

    Full text link
    Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method. In this work, we show that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder framework. When evaluated on a wide range of open-domain QA datasets, our dense retriever outperforms a strong Lucene-BM25 system largely by 9%-19% absolute in terms of top-20 passage retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA benchmarks.Comment: EMNLP 202

    News Across Languages - Cross-Lingual Document Similarity and Event Tracking

    Full text link
    In today's world, we follow news which is distributed globally. Significant events are reported by different sources and in different languages. In this work, we address the problem of tracking of events in a large multilingual stream. Within a recently developed system Event Registry we examine two aspects of this problem: how to compare articles in different languages and how to link collections of articles in different languages which refer to the same event. Taking a multilingual stream and clusters of articles from each language, we compare different cross-lingual document similarity measures based on Wikipedia. This allows us to compute the similarity of any two articles regardless of language. Building on previous work, we show there are methods which scale well and can compute a meaningful similarity between articles from languages with little or no direct overlap in the training data. Using this capability, we then propose an approach to link clusters of articles across languages which represent the same event. We provide an extensive evaluation of the system as a whole, as well as an evaluation of the quality and robustness of the similarity measure and the linking algorithm.Comment: Accepted for publication in Journal of Artificial Intelligence Research, Special Track on Cross-language Algorithms and Application

    ELKI: A large open-source library for data analysis - ELKI Release 0.7.5 "Heidelberg"

    Full text link
    This paper documents the release of the ELKI data mining framework, version 0.7.5. ELKI is an open source (AGPLv3) data mining software written in Java. The focus of ELKI is research in algorithms, with an emphasis on unsupervised methods in cluster analysis and outlier detection. In order to achieve high performance and scalability, ELKI offers data index structures such as the R*-tree that can provide major performance gains. ELKI is designed to be easy to extend for researchers and students in this domain, and welcomes contributions of additional methods. ELKI aims at providing a large collection of highly parameterizable algorithms, in order to allow easy and fair evaluation and benchmarking of algorithms. We will first outline the motivation for this release, the plans for the future, and then give a brief overview over the new functionality in this version. We also include an appendix presenting an overview on the overall implemented functionality

    Big Data: Understanding Big Data

    Full text link
    Steve Jobs, one of the greatest visionaries of our time was quoted in 1996 saying "a lot of times, people do not know what they want until you show it to them" [38] indicating he advocated products to be developed based on human intuition rather than research. With the advancements of mobile devices, social networks and the Internet of Things, enormous amounts of complex data, both structured and unstructured are being captured in hope to allow organizations to make better business decisions as data is now vital for an organizations success. These enormous amounts of data are referred to as Big Data, which enables a competitive advantage over rivals when processed and analyzed appropriately. However Big Data Analytics has a few concerns including Management of Data-lifecycle, Privacy & Security, and Data Representation. This paper reviews the fundamental concept of Big Data, the Data Storage domain, the MapReduce programming paradigm used in processing these large datasets, and focuses on two case studies showing the effectiveness of Big Data Analytics and presents how it could be of greater good in the future if handled appropriately.Comment: 8 pages, Big Data Analytics, Data Storage, MapReduce, Knowledge-Space, Big Data Inconsistencie

    Cognitive Database: A Step towards Endowing Relational Databases with Artificial Intelligence Capabilities

    Full text link
    We propose Cognitive Databases, an approach for transparently enabling Artificial Intelligence (AI) capabilities in relational databases. A novel aspect of our design is to first view the structured data source as meaningful unstructured text, and then use the text to build an unsupervised neural network model using a Natural Language Processing (NLP) technique called word embedding. This model captures the hidden inter-/intra-column relationships between database tokens of different types. For each database token, the model includes a vector that encodes contextual semantic relationships. We seamlessly integrate the word embedding model into existing SQL query infrastructure and use it to enable a new class of SQL-based analytics queries called cognitive intelligence (CI) queries. CI queries use the model vectors to enable complex queries such as semantic matching, inductive reasoning queries such as analogies, predictive queries using entities not present in a database, and, more generally, using knowledge from external sources. We demonstrate unique capabilities of Cognitive Databases using an Apache Spark based prototype to execute inductive reasoning CI queries over a multi-modal database containing text and images. We believe our first-of-a-kind system exemplifies using AI functionality to endow relational databases with capabilities that were previously very hard to realize in practice

    Generating and auto-tuning parallel stencil codes

    Get PDF
    In this thesis, we present a software framework, Patus, which generates high performance stencil codes for different types of hardware platforms, including current multicore CPU and graphics processing unit architectures. The ultimate goals of the framework are productivity, portability (of both the code and performance), and achieving a high performance on the target platform. A stencil computation updates every grid point in a structured grid based on the values of its neighboring points. This class of computations occurs frequently in scientific and general purpose computing (e.g., in partial differential equation solvers or in image processing), justifying the focus on this kind of computation. The proposed key ingredients to achieve the goals of productivity, portability, and performance are domain specific languages (DSLs) and the auto-tuning methodology. The Patus stencil specification DSL allows the programmer to express a stencil computation in a concise way independently of hardware architecture-specific details. Thus, it increases the programmer productivity by disburdening her or him of low level programming model issues and of manually applying hardware platform-specific code optimization techniques. The use of domain specific languages also implies code reusability: once implemented, the same stencil specification can be reused on different hardware platforms, i.e., the specification code is portable across hardware architectures. Constructing the language to be geared towards a special purpose makes it amenable to more aggressive optimizations and therefore to potentially higher performance. Auto-tuning provides performance and performance portability by automated adaptation of implementation-specific parameters to the characteristics of the hardware on which the code will run. By automating the process of parameter tuning — which essentially amounts to solving an integer programming problem in which the objective function is the number representing the code's performance as a function of the parameter configuration, — the system can also be used more productively than if the programmer had to fine-tune the code manually. We show performance results for a variety of stencils, for which Patus was used to generate the corresponding implementations. The selection includes stencils taken from two real-world applications: a simulation of the temperature within the human body during hyperthermia cancer treatment and a seismic application. These examples demonstrate the framework's flexibility and ability to produce high performance code
    corecore