612 research outputs found
Computing With Contextual Numbers
Self Organizing Map (SOM) has been applied into several classical modeling
tasks including clustering, classification, function approximation and
visualization of high dimensional spaces. The final products of a trained SOM
are a set of ordered (low dimensional) indices and their associated high
dimensional weight vectors. While in the above-mentioned applications, the
final high dimensional weight vectors play the primary role in the
computational steps, from a certain perspective, one can interpret SOM as a
nonparametric encoder, in which the final low dimensional indices of the
trained SOM are pointer to the high dimensional space. We showed how using a
one-dimensional SOM, which is not common in usual applications of SOM, one can
develop a nonparametric mapping from a high dimensional space to a continuous
one-dimensional numerical field. These numerical values, called contextual
numbers, are ordered in a way that in a given context, similar numbers refer to
similar high dimensional states. Further, as these numbers can be treated
similarly to usual continuous numbers, they can be replaced with their
corresponding high dimensional states within any data driven modeling problem.
As a potential application, we showed how using contextual numbers could be
used for the problem of high dimensional spatiotemporal dynamics
Ancestral causal learning in high dimensions with a human genome-wide application
We consider learning ancestral causal relationships in high dimensions. Our
approach is driven by a supervised learning perspective, with discrete
indicators of causal relationships treated as labels to be learned from
available data. We focus on the setting in which some causal (ancestral)
relationships are known (via background knowledge or experimental data) and put
forward a general approach that scales to large problems. This is motivated by
problems in human biology which are characterized by high dimensionality and
potentially many latent variables. We present a case study involving
interventional data from human cells with total dimension . Performance is assessed empirically by testing model output against
previously unseen interventional data. The proposed approach is highly
effective and demonstrably scalable to the human genome-wide setting. We
consider sensitivity to background knowledge and find that results are robust
to nontrivial perturbations of the input information. We consider also the
case, relevant to some applications, where the only prior information available
concerns a small number of known ancestral relationships
1-D and 2-D Parallel Algorithms for All-Pairs Similarity Problem
All-pairs similarity problem asks to find all vector pairs in a set of
vectors the similarities of which surpass a given similarity threshold, and it
is a computational kernel in data mining and information retrieval for several
tasks. We investigate the parallelization of a recent fast sequential
algorithm. We propose effective 1-D and 2-D data distribution strategies that
preserve the essential optimizations in the fast algorithm. 1-D parallel
algorithms distribute either dimensions or vectors, whereas the 2-D parallel
algorithm distributes data both ways. Additional contributions to the 1-D
vertical distribution include a local pruning strategy to reduce the number of
candidates, a recursive pruning algorithm, and block processing to reduce
imbalance. The parallel algorithms were programmed in OCaml which affords much
convenience. Our experiments indicate that the performance depends on the
dataset, therefore a variety of parallelizations is useful
Methods for Integrating Knowledge with the Three-Weight Optimization Algorithm for Hybrid Cognitive Processing
In this paper we consider optimization as an approach for quickly and
flexibly developing hybrid cognitive capabilities that are efficient, scalable,
and can exploit knowledge to improve solution speed and quality. In this
context, we focus on the Three-Weight Algorithm, which aims to solve general
optimization problems. We propose novel methods by which to integrate knowledge
with this algorithm to improve expressiveness, efficiency, and scaling, and
demonstrate these techniques on two example problems (Sudoku and circle
packing)
Dense Passage Retrieval for Open-Domain Question Answering
Open-domain question answering relies on efficient passage retrieval to
select candidate contexts, where traditional sparse vector space models, such
as TF-IDF or BM25, are the de facto method. In this work, we show that
retrieval can be practically implemented using dense representations alone,
where embeddings are learned from a small number of questions and passages by a
simple dual-encoder framework. When evaluated on a wide range of open-domain QA
datasets, our dense retriever outperforms a strong Lucene-BM25 system largely
by 9%-19% absolute in terms of top-20 passage retrieval accuracy, and helps our
end-to-end QA system establish new state-of-the-art on multiple open-domain QA
benchmarks.Comment: EMNLP 202
News Across Languages - Cross-Lingual Document Similarity and Event Tracking
In today's world, we follow news which is distributed globally. Significant
events are reported by different sources and in different languages. In this
work, we address the problem of tracking of events in a large multilingual
stream. Within a recently developed system Event Registry we examine two
aspects of this problem: how to compare articles in different languages and how
to link collections of articles in different languages which refer to the same
event. Taking a multilingual stream and clusters of articles from each
language, we compare different cross-lingual document similarity measures based
on Wikipedia. This allows us to compute the similarity of any two articles
regardless of language. Building on previous work, we show there are methods
which scale well and can compute a meaningful similarity between articles from
languages with little or no direct overlap in the training data. Using this
capability, we then propose an approach to link clusters of articles across
languages which represent the same event. We provide an extensive evaluation of
the system as a whole, as well as an evaluation of the quality and robustness
of the similarity measure and the linking algorithm.Comment: Accepted for publication in Journal of Artificial Intelligence
Research, Special Track on Cross-language Algorithms and Application
ELKI: A large open-source library for data analysis - ELKI Release 0.7.5 "Heidelberg"
This paper documents the release of the ELKI data mining framework, version
0.7.5.
ELKI is an open source (AGPLv3) data mining software written in Java. The
focus of ELKI is research in algorithms, with an emphasis on unsupervised
methods in cluster analysis and outlier detection. In order to achieve high
performance and scalability, ELKI offers data index structures such as the
R*-tree that can provide major performance gains. ELKI is designed to be easy
to extend for researchers and students in this domain, and welcomes
contributions of additional methods. ELKI aims at providing a large collection
of highly parameterizable algorithms, in order to allow easy and fair
evaluation and benchmarking of algorithms.
We will first outline the motivation for this release, the plans for the
future, and then give a brief overview over the new functionality in this
version. We also include an appendix presenting an overview on the overall
implemented functionality
Big Data: Understanding Big Data
Steve Jobs, one of the greatest visionaries of our time was quoted in 1996
saying "a lot of times, people do not know what they want until you show it to
them" [38] indicating he advocated products to be developed based on human
intuition rather than research. With the advancements of mobile devices, social
networks and the Internet of Things, enormous amounts of complex data, both
structured and unstructured are being captured in hope to allow organizations
to make better business decisions as data is now vital for an organizations
success. These enormous amounts of data are referred to as Big Data, which
enables a competitive advantage over rivals when processed and analyzed
appropriately. However Big Data Analytics has a few concerns including
Management of Data-lifecycle, Privacy & Security, and Data Representation. This
paper reviews the fundamental concept of Big Data, the Data Storage domain, the
MapReduce programming paradigm used in processing these large datasets, and
focuses on two case studies showing the effectiveness of Big Data Analytics and
presents how it could be of greater good in the future if handled
appropriately.Comment: 8 pages, Big Data Analytics, Data Storage, MapReduce,
Knowledge-Space, Big Data Inconsistencie
Cognitive Database: A Step towards Endowing Relational Databases with Artificial Intelligence Capabilities
We propose Cognitive Databases, an approach for transparently enabling
Artificial Intelligence (AI) capabilities in relational databases. A novel
aspect of our design is to first view the structured data source as meaningful
unstructured text, and then use the text to build an unsupervised neural
network model using a Natural Language Processing (NLP) technique called word
embedding. This model captures the hidden inter-/intra-column relationships
between database tokens of different types. For each database token, the model
includes a vector that encodes contextual semantic relationships. We seamlessly
integrate the word embedding model into existing SQL query infrastructure and
use it to enable a new class of SQL-based analytics queries called cognitive
intelligence (CI) queries. CI queries use the model vectors to enable complex
queries such as semantic matching, inductive reasoning queries such as
analogies, predictive queries using entities not present in a database, and,
more generally, using knowledge from external sources. We demonstrate unique
capabilities of Cognitive Databases using an Apache Spark based prototype to
execute inductive reasoning CI queries over a multi-modal database containing
text and images. We believe our first-of-a-kind system exemplifies using AI
functionality to endow relational databases with capabilities that were
previously very hard to realize in practice
Generating and auto-tuning parallel stencil codes
In this thesis, we present a software framework, Patus, which generates high performance stencil codes for different types of hardware platforms, including current multicore CPU and graphics processing unit architectures. The ultimate goals of the framework are productivity, portability (of both the code and performance), and achieving a high performance on the target platform.
A stencil computation updates every grid point in a structured grid based on the values of its neighboring points. This class of computations occurs frequently in scientific and general purpose computing (e.g., in partial differential equation solvers or in image processing), justifying the focus on this kind of computation.
The proposed key ingredients to achieve the goals of productivity, portability, and performance are domain specific languages (DSLs) and the auto-tuning methodology.
The Patus stencil specification DSL allows the programmer to express a stencil computation in a concise way independently of hardware architecture-specific details. Thus, it increases the programmer productivity by disburdening her or him of low level programming model issues and of manually applying hardware platform-specific
code optimization techniques. The use of domain specific languages also implies code reusability: once implemented, the same stencil specification can be reused on different
hardware platforms, i.e., the specification code is portable across hardware architectures. Constructing the language to be geared towards a special purpose makes it amenable to more aggressive optimizations and therefore to potentially higher performance.
Auto-tuning provides performance and performance portability by automated adaptation of implementation-specific parameters to the characteristics of the hardware on which the code will run. By automating the process of parameter tuning — which essentially amounts to solving an integer programming problem in which the objective function is the number representing the code's performance as a function of the parameter configuration, — the system can also be used more productively than if the programmer had to fine-tune the code manually.
We show performance results for a variety of stencils, for which Patus was used to generate the corresponding implementations. The selection includes stencils taken from two real-world applications: a simulation of the temperature within the human body during hyperthermia cancer treatment and a seismic application. These examples demonstrate the framework's flexibility and ability to produce high performance code
- …