18,678 research outputs found
Communication Theoretic Data Analytics
Widespread use of the Internet and social networks invokes the generation of
big data, which is proving to be useful in a number of applications. To deal
with explosively growing amounts of data, data analytics has emerged as a
critical technology related to computing, signal processing, and information
networking. In this paper, a formalism is considered in which data is modeled
as a generalized social network and communication theory and information theory
are thereby extended to data analytics. First, the creation of an equalizer to
optimize information transfer between two data variables is considered, and
financial data is used to demonstrate the advantages. Then, an information
coupling approach based on information geometry is applied for dimensionality
reduction, with a pattern recognition example to illustrate the effectiveness.
These initial trials suggest the potential of communication theoretic data
analytics for a wide range of applications.Comment: Published in IEEE Journal on Selected Areas in Communications, Jan.
201
A Primer on Causality in Data Science
Many questions in Data Science are fundamentally causal in that our objective
is to learn the effect of some exposure, randomized or not, on an outcome
interest. Even studies that are seemingly non-causal, such as those with the
goal of prediction or prevalence estimation, have causal elements, including
differential censoring or measurement. As a result, we, as Data Scientists,
need to consider the underlying causal mechanisms that gave rise to the data,
rather than simply the pattern or association observed in those data. In this
work, we review the 'Causal Roadmap' of Petersen and van der Laan (2014) to
provide an introduction to some key concepts in causal inference. Similar to
other causal frameworks, the steps of the Roadmap include clearly stating the
scientific question, defining of the causal model, translating the scientific
question into a causal parameter, assessing the assumptions needed to express
the causal parameter as a statistical estimand, implementation of statistical
estimators including parametric and semi-parametric methods, and interpretation
of our findings. We believe that using such a framework in Data Science will
help to ensure that our statistical analyses are guided by the scientific
question driving our research, while avoiding over-interpreting our results. We
focus on the effect of an exposure occurring at a single time point and
highlight the use of targeted maximum likelihood estimation (TMLE) with Super
Learner.Comment: 26 pages (with references); 4 figure
Fast filtering and animation of large dynamic networks
Detecting and visualizing what are the most relevant changes in an evolving
network is an open challenge in several domains. We present a fast algorithm
that filters subsets of the strongest nodes and edges representing an evolving
weighted graph and visualize it by either creating a movie, or by streaming it
to an interactive network visualization tool. The algorithm is an approximation
of exponential sliding time-window that scales linearly with the number of
interactions. We compare the algorithm against rectangular and exponential
sliding time-window methods. Our network filtering algorithm: i) captures
persistent trends in the structure of dynamic weighted networks, ii) smoothens
transitions between the snapshots of dynamic network, and iii) uses limited
memory and processor time. The algorithm is publicly available as open-source
software.Comment: 6 figures, 2 table
The Multivariate k-Nearest Neighbor Model for Dependent Variables : One-Sided Estimation and Forecasting
This article gives the asymptotic properties of multivariate k-nearest neighbor regression estimators for dependent variables belonging to Rd, d > 1. The results derived here permit to provide consistent forecasts, and confidence intervals for time series An illustration of the method is given through the estimation of economic indicators used to compute the GDP with the bridge equations. An empirical forecast accuracy comparison is provided by comparing this non-parametric method with a parametric one based on ARIMA modelling that we consider as a benchmark because it is still often used in Central Banks to nowcast and forecast the GDP.Multivariate k-nearest neighbor, asymptotic normality of the regression, mixing time series, confidence intervals, forecasts, economic indicators, euro area.
A nonuniform popularity-similarity optimization (nPSO) model to efficiently generate realistic complex networks with communities
The hidden metric space behind complex network topologies is a fervid topic
in current network science and the hyperbolic space is one of the most studied,
because it seems associated to the structural organization of many real complex
systems. The Popularity-Similarity-Optimization (PSO) model simulates how
random geometric graphs grow in the hyperbolic space, reproducing strong
clustering and scale-free degree distribution, however it misses to reproduce
an important feature of real complex networks, which is the community
organization. The Geometrical-Preferential-Attachment (GPA) model was recently
developed to confer to the PSO also a community structure, which is obtained by
forcing different angular regions of the hyperbolic disk to have variable level
of attractiveness. However, the number and size of the communities cannot be
explicitly controlled in the GPA, which is a clear limitation for real
applications. Here, we introduce the nonuniform PSO (nPSO) model that,
differently from GPA, forces heterogeneous angular node attractiveness by
sampling the angular coordinates from a tailored nonuniform probability
distribution, for instance a mixture of Gaussians. The nPSO differs from GPA in
other three aspects: it allows to explicitly fix the number and size of
communities; it allows to tune their mixing property through the network
temperature; it is efficient to generate networks with high clustering. After
several tests we propose the nPSO as a valid and efficient model to generate
networks with communities in the hyperbolic space, which can be adopted as a
realistic benchmark for different tasks such as community detection and link
prediction
- …