866 research outputs found
A Compact Index for Order-Preserving Pattern Matching
Order-preserving pattern matching was introduced recently but it has already
attracted much attention. Given a reference sequence and a pattern, we want to
locate all substrings of the reference sequence whose elements have the same
relative order as the pattern elements. For this problem we consider the
offline version in which we build an index for the reference sequence so that
subsequent searches can be completed very efficiently. We propose a
space-efficient index that works well in practice despite its lack of good
worst-case time bounds. Our solution is based on the new approach of
decomposing the indexed sequence into an order component, containing ordering
information, and a delta component, containing information on the absolute
values. Experiments show that this approach is viable, faster than the
available alternatives, and it is the first one offering simultaneously small
space usage and fast retrieval.Comment: 16 pages. A preliminary version appeared in the Proc. IEEE Data
Compression Conference, DCC 2017, Snowbird, UT, USA, 201
An Information-Theoretic Test for Dependence with an Application to the Temporal Structure of Stock Returns
Information theory provides ideas for conceptualising information and
measuring relationships between objects. It has found wide application in the
sciences, but economics and finance have made surprisingly little use of it. We
show that time series data can usefully be studied as information -- by noting
the relationship between statistical redundancy and dependence, we are able to
use the results of information theory to construct a test for joint dependence
of random variables. The test is in the same spirit of those developed by
Ryabko and Astola (2005, 2006b,a), but differs from these in that we add extra
randomness to the original stochatic process. It uses data compression to
estimate the entropy rate of a stochastic process, which allows it to measure
dependence among sets of random variables, as opposed to the existing
econometric literature that uses entropy and finds itself restricted to
pairwise tests of dependence. We show how serial dependence may be detected in
S&P500 and PSI20 stock returns over different sample periods and frequencies.
We apply the test to synthetic data to judge its ability to recover known
temporal dependence structures.Comment: 22 pages, 7 figure
A compact index for order-preserving pattern matching
Order-preserving pattern matching has been introduced recently, but it has already attracted much attention. Given a reference sequence and a pattern, we want to locate all substrings of the reference sequence whose elements have the same relative order as the pattern elements. For this problem, we consider the offline version in which we build an index for the reference sequence so that subsequent searches can be completed very efficiently. We propose a space-efficient index that works well in practice despite its lack of good worst-case time bounds. Our solution is based on the new approach of decomposing the indexed sequence into an order component, containing ordering information, and a \u3b4 component, containing information on the absolute values. Experiments show that this approach is viable, is faster than the available alternatives, and is the first one offering simultaneously small space usage and fast retrieval
Adaptive text mining: Inferring structure from sequences
Text mining is about inferring structure from sequences representing natural language text, and may be defined as the process of analyzing text to extract information that is useful for particular purposes. Although hand-crafted heuristics are a common practical approach for extracting information from text, a general, and generalizable, approach requires adaptive techniques. This paper studies the way in which the adaptive techniques used in text compression can be applied to text mining. It develops several examples: extraction of hierarchical phrase structures from text, identification of keyphrases in documents, locating proper names and quantities of interest in a piece of text, text categorization, word segmentation, acronym extraction, and structure recognition. We conclude that compression forms a sound unifying principle that allows many text mining problems to be tacked adaptively
Computational Approaches to Measuring the Similarity of Short Contexts : A Review of Applications and Methods
Measuring the similarity of short written contexts is a fundamental problem
in Natural Language Processing. This article provides a unifying framework by
which short context problems can be categorized both by their intended
application and proposed solution. The goal is to show that various problems
and methodologies that appear quite different on the surface are in fact very
closely related. The axes by which these categorizations are made include the
format of the contexts (headed versus headless), the way in which the contexts
are to be measured (first-order versus second-order similarity), and the
information used to represent the features in the contexts (micro versus macro
views). The unifying thread that binds together many short context applications
and methods is the fact that similarity decisions must be made between contexts
that share few (if any) words in common.Comment: 23 page
Scalable processing and autocovariance computation of big functional data
This is the peer reviewed version of the following article: Brisaboa NR, Cao R, Paramá JR, Silva-Coira F. Scalable processing and autocovariance computation of big functional data. Softw Pract Exper. 2018; 48: 123–140 which has been published in final form at https://doi.org/10.1002/spe.2524 . This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Use of Self-Archived Versions. This article may not be enhanced, enriched or otherwise transformed into a derivative work, without express permission from Wiley or by statutory rights under applicable legislation. Copyright notices must not be removed, obscured or modified. The article must be linked to Wiley’s version of record on Wiley Online Library and any embedding, framing or otherwise making available the article or pages thereof by third parties from platforms, services and websites other than Wiley Online Library must be prohibited.[Abstract]: This paper presents 2 main contributions. The first is a compact representation of huge sets of functional data or trajectories of continuous-time stochastic processes, which allows keeping the data always compressed even during the processing in main memory. It is oriented to facilitate the efficient computation of the sample autocovariance function without a previous decompression of the data set, by using only partial local decoding. The second contribution is a new memory-efficient algorithm to compute the sample autocovariance function. The combination of the compact representation and the new memory-efficient algorithm obtained in our experiments the following benefits. The compressed data occupy in the disk 75% of the space needed by the original data. The computation of the autocovariance function used up to 13 times less main memory, and run 65% faster than the classical method implemented, for example, in the R package.This work was supported by the Ministerio de Economía y Competitividad (PGE and FEDER) under grants [TIN2016-78011-C4-1-R; MTM2014-52876-R; TIN2013-46238-C4-3-R], Centro para el desarrollo Tecnológico e Industrial MINECO [IDI-20141259; ITC-20151247; ITC-20151305; ITC-20161074]; Xunta de Galicia (cofounded with FEDER) under Grupos de Referencia Competitiva grant ED431C-2016-015; Xunta de Galicia-Consellería de Cultura, Educación e Ordenación Universitaria (cofounded with FEDER) under Redes grants R2014/041, ED341D R2016/045; Xunta de Galicia-Consellería de Cultura, Educación e Ordenación Universitaria (cofounded with FEDER) under Centro Singular de Investigación de Galicia grant ED431G/01.Xunta de Galicia; D431C-2016-015Xunta de Galicia; R2014/041Xunta de Galicia; ED341D R2016/045Xunta de Galicia; ED431G/0
Enhancing Deep Learning Models through Tensorization: A Comprehensive Survey and Framework
The burgeoning growth of public domain data and the increasing complexity of
deep learning model architectures have underscored the need for more efficient
data representation and analysis techniques. This paper is motivated by the
work of (Helal, 2023) and aims to present a comprehensive overview of
tensorization. This transformative approach bridges the gap between the
inherently multidimensional nature of data and the simplified 2-dimensional
matrices commonly used in linear algebra-based machine learning algorithms.
This paper explores the steps involved in tensorization, multidimensional data
sources, various multiway analysis methods employed, and the benefits of these
approaches. A small example of Blind Source Separation (BSS) is presented
comparing 2-dimensional algorithms and a multiway algorithm in Python. Results
indicate that multiway analysis is more expressive. Contrary to the intuition
of the dimensionality curse, utilising multidimensional datasets in their
native form and applying multiway analysis methods grounded in multilinear
algebra reveal a profound capacity to capture intricate interrelationships
among various dimensions while, surprisingly, reducing the number of model
parameters and accelerating processing. A survey of the multi-away analysis
methods and integration with various Deep Neural Networks models is presented
using case studies in different application domains.Comment: 34 pages, 8 figures, 4 table
A Nine Month Progress Report on an Investigation into Mechanisms for Improving Triple Store Performance
This report considers the requirement for fast, efficient, and scalable triple stores as part of the effort to produce the Semantic Web. It summarises relevant information in the major background field of Database Management Systems (DBMS), and provides an overview of the techniques currently in use amongst the triple store community. The report concludes that for individuals and organisations to be willing to provide large amounts of information as openly-accessible nodes on the Semantic Web, storage and querying of the data must be cheaper and faster than it is currently. Experiences from the DBMS field can be used to maximise triple store performance, and suggestions are provided for lines of investigation in areas of storage, indexing, and query optimisation. Finally, work packages are provided describing expected timetables for further study of these topics
- …