30,347 research outputs found
HYPA: Efficient Detection of Path Anomalies in Time Series Data on Networks
The unsupervised detection of anomalies in time series data has important
applications in user behavioral modeling, fraud detection, and cybersecurity.
Anomaly detection has, in fact, been extensively studied in categorical
sequences. However, we often have access to time series data that represent
paths through networks. Examples include transaction sequences in financial
networks, click streams of users in networks of cross-referenced documents, or
travel itineraries in transportation networks. To reliably detect anomalies, we
must account for the fact that such data contain a large number of independent
observations of paths constrained by a graph topology. Moreover, the
heterogeneity of real systems rules out frequency-based anomaly detection
techniques, which do not account for highly skewed edge and degree statistics.
To address this problem, we introduce HYPA, a novel framework for the
unsupervised detection of anomalies in large corpora of variable-length
temporal paths in a graph. HYPA provides an efficient analytical method to
detect paths with anomalous frequencies that result from nodes being traversed
in unexpected chronological order.Comment: 11 pages with 8 figures and supplementary material. To appear at SIAM
Data Mining (SDM 2020
Optimal coding and the origins of Zipfian laws
The problem of compression in standard information theory consists of
assigning codes as short as possible to numbers. Here we consider the problem
of optimal coding -- under an arbitrary coding scheme -- and show that it
predicts Zipf's law of abbreviation, namely a tendency in natural languages for
more frequent words to be shorter. We apply this result to investigate optimal
coding also under so-called non-singular coding, a scheme where unique
segmentation is not warranted but codes stand for a distinct number. Optimal
non-singular coding predicts that the length of a word should grow
approximately as the logarithm of its frequency rank, which is again consistent
with Zipf's law of abbreviation. Optimal non-singular coding in combination
with the maximum entropy principle also predicts Zipf's rank-frequency
distribution. Furthermore, our findings on optimal non-singular coding
challenge common beliefs about random typing. It turns out that random typing
is in fact an optimal coding process, in stark contrast with the common
assumption that it is detached from cost cutting considerations. Finally, we
discuss the implications of optimal coding for the construction of a compact
theory of Zipfian laws and other linguistic laws.Comment: in press in the Journal of Quantitative Linguistics; definition of
concordant pair corrected, proofs polished, references update
Statistical analysis of simple repeats in the human genome
The human genome contains repetitive DNA at different level of sequence
length, number and dispersion. Highly repetitive DNA is particularly rich in
homo-- and di--nucleotide repeats, while middle repetitive DNA is rich of
families of interspersed, mobile elements hundreds of base pairs (bp) long,
among which the Alu families. A link between homo- and di-polymeric tracts and
mobile elements has been recently highlighted. In particular, the mobility of
Alu repeats, which form 10% of the human genome, has been correlated with the
length of poly(A) tracts located at one end of the Alu. These tracts have a
rigid and non-bendable structure and have an inhibitory effect on nucleosomes,
which normally compact the DNA. We performed a statistical analysis of the
genome-wide distribution of lengths and inter--tract separations of poly(X) and
poly(XY) tracts in the human genome. Our study shows that in humans the length
distributions of these sequences reflect the dynamics of their expansion and
DNA replication. By means of general tools from linguistics, we show that the
latter play the role of highly-significant content-bearing terms in the DNA
text. Furthermore, we find that such tracts are positioned in a non-random
fashion, with an apparent periodicity of 150 bases. This allows us to extend
the link between repetitive, highly mobile elements such as Alus and
low-complexity words in human DNA. More precisely, we show that Alus are
sources of poly(X) tracts, which in turn affect in a subtle way the combination
and diversification of gene expression and the fixation of multigene families
TermEval 2020 : shared task on automatic term extraction using the Annotated Corpora for term Extraction Research (ACTER) dataset
The TermEval 2020 shared task provided a platform for researchers to work on automatic term extraction (ATE) with the same dataset: the Annotated Corpora for Term Extraction Research (ACTER). The dataset covers three languages (English, French, and Dutch) and four domains, of which the domain of heart failure was kept as a held-out test set on which final f1-scores were calculated. The aim was to provide a large, transparent, qualitatively annotated, and diverse dataset to the ATE research community, with the goal of promoting comparative research and thus identifying strengths and weaknesses of various state-of-the-art methodologies. The results show a lot of variation between different systems and illustrate how some methodologies reach higher precision or recall, how different systems extract different types of terms, how some are exceptionally good at finding rare terms, or are less impacted by term length. The current contribution offers an overview of the shared task with a comparative evaluation, which complements the individual papers by all participants
Signatures of arithmetic simplicity in metabolic network architecture
Metabolic networks perform some of the most fundamental functions in living
cells, including energy transduction and building block biosynthesis. While
these are the best characterized networks in living systems, understanding
their evolutionary history and complex wiring constitutes one of the most
fascinating open questions in biology, intimately related to the enigma of
life's origin itself. Is the evolution of metabolism subject to general
principles, beyond the unpredictable accumulation of multiple historical
accidents? Here we search for such principles by applying to an artificial
chemical universe some of the methodologies developed for the study of genome
scale models of cellular metabolism. In particular, we use metabolic flux
constraint-based models to exhaustively search for artificial chemistry
pathways that can optimally perform an array of elementary metabolic functions.
Despite the simplicity of the model employed, we find that the ensuing pathways
display a surprisingly rich set of properties, including the existence of
autocatalytic cycles and hierarchical modules, the appearance of universally
preferable metabolites and reactions, and a logarithmic trend of pathway length
as a function of input/output molecule size. Some of these properties can be
derived analytically, borrowing methods previously used in cryptography. In
addition, by mapping biochemical networks onto a simplified carbon atom
reaction backbone, we find that several of the properties predicted by the
artificial chemistry model hold for real metabolic networks. These findings
suggest that optimality principles and arithmetic simplicity might lie beneath
some aspects of biochemical complexity
- …