11 research outputs found
How incomputable is Kolmogorov complexity?
Kolmogorov complexity is the length of the ultimately compressed version of a
file (that is, anything which can be put in a computer). Formally, it is the
length of a shortest program from which the file can be reconstructed. We
discuss the incomputabilty of Kolmogorov complexity, which formal loopholes
this leaves us, recent approaches to compute or approximate Kolmogorov
complexity, which approaches are problematic and which approaches are viable.Comment: 9 pages LaTe
Consistent Quantification of Complex Dynamics via a Novel Statistical Complexity Measure
Natural systems often show complex dynamics. The quantification of such complex dynamics is an important step in, e.g., characterization and classification of different systems or to investigate the effect of an external perturbation on the dynamics. Promising routes were followed in the past using concepts based on (Shannon’s) entropy. Here, we propose a new, conceptually sound measure that can be pragmatically computed, in contrast to pure theoretical concepts based on, e.g., Kolmogorov complexity. We illustrate the applicability using a toy example with a control parameter and go on to the molecular evolution of the HIV1 protease for which drug treatment can be regarded as an external perturbation that changes the complexity of its molecular evolutionary dynamics. In fact, our method identifies exactly those residues which are known to bind the drug molecules by their noticeable signal. We furthermore apply our method in a completely different domain, namely foreign exchange rates, and find convincing results as well
Grokking Beyond Neural Networks: An Empirical Exploration with Model Complexity
In some settings neural networks exhibit a phenomenon known as grokking,
where they achieve perfect or near-perfect accuracy on the validation set long
after the same performance has been achieved on the training set. In this
paper, we discover that grokking is not limited to neural networks but occurs
in other settings such as Gaussian process (GP) classification, GP regression
and linear regression. We also uncover a mechanism by which to induce grokking
on algorithmic datasets via the addition of dimensions containing spurious
information. The presence of the phenomenon in non-neural architectures
provides evidence that grokking is not specific to SGD or weight norm
regularisation. Instead, grokking may be possible in any setting where solution
search is guided by complexity and error. Based on this insight and further
trends we see in the training trajectories of a Bayesian neural network (BNN)
and GP regression model, we make progress towards a more general theory of
grokking. Specifically, we hypothesise that the phenomenon is governed by the
accessibility of certain regions in the error and complexity landscapes
Evaluating Point Cloud Quality via Transformational Complexity
Full-reference point cloud quality assessment (FR-PCQA) aims to infer the
quality of distorted point clouds with available references. Merging the
research of cognitive science and intuition of the human visual system (HVS),
the difference between the expected perceptual result and the practical
perception reproduction in the visual center of the cerebral cortex indicates
the subjective quality degradation. Therefore in this paper, we try to derive
the point cloud quality by measuring the complexity of transforming the
distorted point cloud back to its reference, which in practice can be
approximated by the code length of one point cloud when the other is given. For
this purpose, we first segment the reference and the distorted point cloud into
a series of local patch pairs based on one 3D Voronoi diagram. Next, motivated
by the predictive coding theory, we utilize one space-aware vector
autoregressive (SA-VAR) model to encode the geometry and color channels of each
reference patch in cases with and without the distorted patch, respectively.
Specifically, supposing that the residual errors follow the multi-variate
Gaussian distributions, we calculate the self-complexity of the reference and
the transformational complexity between the reference and the distorted sample
via covariance matrices. Besides the complexity terms, the prediction terms
generated by SA-VAR are introduced as one auxiliary feature to promote the
final quality prediction. Extensive experiments on five public point cloud
quality databases demonstrate that the transformational complexity based
distortion metric (TCDM) produces state-of-the-art (SOTA) results, and ablation
studies have further shown that our metric can be generalized to various
scenarios with consistent performance by examining its key modules and
parameters
Learn to Categorize or Categorize to Learn? Self-Coding for Generalized Category Discovery
In the quest for unveiling novel categories at test time, we confront the
inherent limitations of traditional supervised recognition models that are
restricted by a predefined category set. While strides have been made in the
realms of self-supervised and open-world learning towards test-time category
discovery, a crucial yet often overlooked question persists: what exactly
delineates a category? In this paper, we conceptualize a category through the
lens of optimization, viewing it as an optimal solution to a well-defined
problem. Harnessing this unique conceptualization, we propose a novel,
efficient and self-supervised method capable of discovering previously unknown
categories at test time. A salient feature of our approach is the assignment of
minimum length category codes to individual data instances, which encapsulates
the implicit category hierarchy prevalent in real-world datasets. This
mechanism affords us enhanced control over category granularity, thereby
equipping our model to handle fine-grained categories adeptly. Experimental
evaluations, bolstered by state-of-the-art benchmark comparisons, testify to
the efficacy of our solution in managing unknown categories at test time.
Furthermore, we fortify our proposition with a theoretical foundation,
providing proof of its optimality. Our code is available at
https://github.com/SarahRastegar/InfoSieve.Comment: Accepted by NeurIPS 202
Ladderpath Approach: How Tinkering and Reuse Increase Complexity and Information
The notion of information and complexity are important concepts in many scientific fields such as molecular biology, evolutionary theory and exobiology. Many measures of these quantities are either difficult to compute, rely on the statistical notion of information, or can only be applied to strings. Based on assembly theory, we propose the notion of a ladderpath, which describes how an object can be decomposed into hierarchical structures using repetitive elements. From the ladderpath, two measures naturally emerge: the ladderpath-index and the order-index, which represent two axes of complexity. We show how the ladderpath approach can be applied to both strings and spatial patterns and argue that all systems that undergo evolution can be described as ladderpaths. Further, we discuss possible applications to human language and the origin of life. The ladderpath approach provides an alternative characterization of the information that is contained in a single object (or a system) and could aid in our understanding of evolving systems and the origin of life in particular
Synthetic Kolmogorov Complexity in Coq
International audienceWe present a generalised, constructive, and machine-checked approach to Kolmogorov complexity in the constructive type theory underlying the Coq proof assistant. By proving that nonrandom numbers form a simple predicate, we obtain elegant proofs of undecidability for random and nonrandom numbers and a proof of uncomputability of Kolmogorov complexity. We use a general and abstract definition of Kolmogorov complexity and subsequently instantiate it to several definitions frequently found in the literature. Whereas textbook treatments of Kolmogorov complexity usually rely heavily on classical logic and the axiom of choice, we put emphasis on the constructiveness of all our arguments, however without blurring their essence. We first give a high-level proof idea using classical logic, which can be formalised with Markov's principle via folklore techniques we subsequently explain. Lastly, we show a strategy how to eliminate Markov's principle from a certain class of computability proofs, rendering all our results fully constructive. All our results are machine-checked by the Coq proof assistant, which is enabled by using a synthetic approach to computability: rather than formalising a model of computation, which is well-known to introduce a considerable overhead, we abstractly assume a universal function, allowing the proofs to focus on the mathematical essence
Compressão eficiente de sequências biológicas usando uma rede neuronal
Background: The increasing production of genomic data has led to
an intensified need for models that can cope efficiently with the lossless
compression of biosequences. Important applications include long-term
storage and compression-based data analysis. In the literature, only a
few recent articles propose the use of neural networks for biosequence
compression. However, they fall short when compared with specific
DNA compression tools, such as GeCo2. This limitation is due to the
absence of models specifically designed for DNA sequences. In this
work, we combine the power of neural networks with specific DNA and
amino acids models. For this purpose, we created GeCo3 and AC2, two
new biosequence compressors. Both use a neural network for mixing
the opinions of multiple specific models.
Findings: We benchmark GeCo3 as a reference-free DNA compressor
in five datasets, including a balanced and comprehensive dataset
of DNA sequences, the Y-chromosome and human mitogenome, two
compilations of archaeal and virus genomes, four whole genomes, and
two collections of FASTQ data of a human virome and ancient DNA.
GeCo3 achieves a solid improvement in compression over the previous
version (GeCo2) of 2:4%, 7:1%, 6:1%, 5:8%, and 6:0%, respectively.
As a reference-based DNA compressor, we benchmark GeCo3 in four
datasets constituted by the pairwise compression of the chromosomes
of the genomes of several primates. GeCo3 improves the compression in
12:4%, 11:7%, 10:8% and 10:1% over the state-of-the-art. The cost of
this compression improvement is some additional computational time
(1:7_ to 3:0_ slower than GeCo2). The RAM is constant, and the tool
scales efficiently, independently from the sequence size. Overall, these
values outperform the state-of-the-art. For AC2 the improvements and
costs over AC are similar, which allows the tool to also outperform the
state-of-the-art.
Conclusions: The GeCo3 and AC2 are biosequence compressors with
a neural network mixing approach, that provides additional gains over
top specific biocompressors. The proposed mixing method is portable,
requiring only the probabilities of the models as inputs, providing easy
adaptation to other data compressors or compression-based data analysis
tools. GeCo3 and AC2 are released under GPLv3 and are available
for free download at https://github.com/cobilab/geco3 and
https://github.com/cobilab/ac2.Contexto: O aumento da produção de dados genómicos levou a uma
maior necessidade de modelos que possam lidar de forma eficiente com
a compressão sem perdas de biosequências. Aplicações importantes
incluem armazenamento de longo prazo e análise de dados baseada em
compressão. Na literatura, apenas alguns artigos recentes propõem o
uso de uma rede neuronal para compressão de biosequências. No entanto,
os resultados ficam aquém quando comparados com ferramentas
de compressão de ADN específicas, como o GeCo2. Essa limitação
deve-se à ausência de modelos específicos para sequências de ADN.
Neste trabalho, combinamos o poder de uma rede neuronal com modelos
específicos de ADN e aminoácidos. Para isso, criámos o GeCo3 e
o AC2, dois novos compressores de biosequências. Ambos usam uma
rede neuronal para combinar as opiniões de vários modelos específicos.
Resultados: Comparamos o GeCo3 como um compressor de ADN
sem referência em cinco conjuntos de dados, incluindo um conjunto
de dados balanceado de sequências de ADN, o cromossoma Y e o mitogenoma
humano, duas compilações de genomas de arqueas e vírus,
quatro genomas inteiros e duas coleções de dados FASTQ de um viroma
humano e ADN antigo. O GeCo3 atinge uma melhoria sólida
na compressão em relação à versão anterior (GeCo2) de 2,4%, 7,1%,
6,1%, 5,8% e 6,0%, respectivamente. Como um compressor de ADN
baseado em referência, comparamos o GeCo3 em quatro conjuntos
de dados constituídos pela compressão aos pares dos cromossomas
dos genomas de vários primatas. O GeCo3 melhora a compressão em
12,4%, 11,7%, 10,8% e 10,1% em relação ao estado da arte. O custo
desta melhoria de compressão é algum tempo computacional adicional
(1,7 _ a 3,0 _ mais lento do que GeCo2). A RAM é constante e a
ferramenta escala de forma eficiente, independentemente do tamanho
da sequência. De forma geral, os rácios de compressão superam o estado
da arte. Para o AC2, as melhorias e custos em relação ao AC são
semelhantes, o que permite que a ferramenta também supere o estado
da arte.
Conclusões: O GeCo3 e o AC2 são compressores de sequências biológicas
com uma abordagem de mistura baseada numa rede neuronal,
que fornece ganhos adicionais em relação aos biocompressores específicos
de topo. O método de mistura proposto é portátil, exigindo apenas
as probabilidades dos modelos como entradas, proporcionando uma fácil
adaptação a outros compressores de dados ou ferramentas de análise
baseadas em compressão. O GeCo3 e o AC2 são distribuídos sob GPLv3
e estão disponíveis para download gratuito em https://github.com/
cobilab/geco3 e https://github.com/cobilab/ac2.Mestrado em Engenharia de Computadores e Telemátic