431,208 research outputs found
MEDUSA - New Model of Internet Topology Using k-shell Decomposition
The k-shell decomposition of a random graph provides a different and more
insightful separation of the roles of the different nodes in such a graph than
does the usual analysis in terms of node degrees. We develop this approach in
order to analyze the Internet's structure at a coarse level, that of the
"Autonomous Systems" or ASes, the subnetworks out of which the Internet is
assembled. We employ new data from DIMES (see http://www.netdimes.org), a
distributed agent-based mapping effort which at present has attracted over 3800
volunteers running more than 7300 DIMES clients in over 85 countries. We
combine this data with the AS graph information available from the RouteViews
project at Univ. Oregon, and have obtained an Internet map with far more detail
than any previous effort.
The data suggests a new picture of the AS-graph structure, which
distinguishes a relatively large, redundantly connected core of nearly 100 ASes
and two components that flow data in and out from this core. One component is
fractally interconnected through peer links; the second makes direct
connections to the core only. The model which results has superficial
similarities with and important differences from the "Jellyfish" structure
proposed by Tauro et al., so we call it a "Medusa." We plan to use this picture
as a framework for measuring and extrapolating changes in the Internet's
physical structure. Our k-shell analysis may also be relevant for estimating
the function of nodes in the "scale-free" graphs extracted from other
naturally-occurring processes.Comment: 24 pages, 17 figure
Learning ‘‘graph-mer’’ Motifs that Predict Gene Expression Trajectories in Development
A key problem in understanding transcriptional regulatory networks is deciphering what cis regulatory logic is encoded in gene promoter sequences and how this sequence information maps to expression. A typical computational approach to this problem involves clustering genes by their expression profiles and then searching for overrepresented motifs in the promoter sequences of genes in a cluster. However, genes with similar expression profiles may be controlled by distinct regulatory programs. Moreover, if many gene expression profiles in a data set are highly correlated, as in the case of whole organism developmental time series, it may be difficult to resolve fine-grained clusters in the first place. We present a predictive framework for modeling the natural flow of information, from promoter sequence to expression, to learn cis regulatory motifs and characterize gene expression patterns in developmental time courses. We introduce a cluster-free algorithm based on a graph-regularized version of partial least squares (PLS) regression to learn sequence patterns—represented by graphs of k-mers, or “graph-mers”—that predict gene expression trajectories. Applying the approach to wildtype germline development in Caenorhabditis elegans, we found that the first and second latent PLS factors mapped to expression profiles for oocyte and sperm genes, respectively. We extracted both known and novel motifs from the graph-mers associated to these germline-specific patterns, including novel CG-rich motifs specific to oocyte genes. We found evidence supporting the functional relevance of these putative regulatory elements through analysis of positional bias, motif conservation and in situ gene expression. This study demonstrates that our regression model can learn biologically meaningful latent structure and identify potentially functional motifs from subtle developmental time course expression data
Where's Crypto?: Automated Identification and Classification of Proprietary Cryptographic Primitives in Binary Code
The continuing use of proprietary cryptography in embedded systems across
many industry verticals, from physical access control systems and
telecommunications to machine-to-machine authentication, presents a significant
obstacle to black-box security-evaluation efforts. In-depth security analysis
requires locating and classifying the algorithm in often very large binary
images, thus rendering manual inspection, even when aided by heuristics, time
consuming.
In this paper, we present a novel approach to automate the identification and
classification of (proprietary) cryptographic primitives within binary code.
Our approach is based on Data Flow Graph (DFG) isomorphism, previously proposed
by Lestringant et al. Unfortunately, their DFG isomorphism approach is limited
to known primitives only, and relies on heuristics for selecting code fragments
for analysis. By combining the said approach with symbolic execution, we
overcome all limitations of their work, and are able to extend the analysis
into the domain of unknown, proprietary cryptographic primitives. To
demonstrate that our proposal is practical, we develop various signatures, each
targeted at a distinct class of cryptographic primitives, and present
experimental evaluations for each of them on a set of binaries, both publicly
available (and thus providing reproducible results), and proprietary ones.
Lastly, we provide a free and open-source implementation of our approach,
called Where's Crypto?, in the form of a plug-in for the popular IDA
disassembler.Comment: A proof-of-concept implementation can be found at
https://github.com/wheres-crypto/wheres-crypt
Efficient, Scalable, and Accurate Program Fingerprinting in Binary Code
Why was this binary written? Which compiler was used? Which free software
packages did the developer use? Which sections of the code were borrowed? Who wrote
the binary? These questions are of paramount importance to security analysts and reverse
engineers, and binary fingerprinting approaches may provide valuable insights that can
help answer them. This thesis advances the state of the art by addressing some of the
most fundamental problems in program fingerprinting for binary code, notably, reusable
binary code discovery, fingerprinting free open source software packages, and authorship
attribution.
First, to tackle the problem of discovering reusable binary code, we employ a technique
for identifying reused functions by matching traces of a novel representation of binary
code known as the semantic integrated graph. This graph enhances the control flow
graph, the register flow graph, and the function call graph, key concepts from classical program analysis, and merges them with other structural information to create a joint data
structure. Second, we approach the problem of fingerprinting free open source software
(FOSS) packages by proposing a novel resilient and efficient system that incorporates
three components. The first extracts the syntactical features of functions by considering
opcode frequencies and performing a hidden Markov model statistical test. The second
applies a neighborhood hash graph kernel to random walks derived from control flow
graphs, with the goal of extracting the semantics of the functions. The third applies the
z-score to normalized instructions to extract the behavior of the instructions in a function.
Then, the components are integrated using a Bayesian network model which synthesizes
the results to determine the FOSS function, making it possible to detect user-related functions.
Third, with these elements now in place, we present a framework capable of decoupling
binary program functionality from the coding habits of authors. To capture coding habits,
the framework leverages a set of features that are based on collections of functionalityindependent
choices made by authors during coding. Finally, it is well known that techniques
such as refactoring and code transformations can significantly alter the structure
of code, even for simple programs. Applying such techniques or changing the compiler
and compilation settings can significantly affect the accuracy of available binary analysis
tools, which severely limits their practicability, especially when applied to malware. To
address these issues, we design a technique that extracts the semantics of binary code in terms of both data and control flow. The proposed technique allows more robust binary
analysis because the extracted semantics of the binary code is generally immune
from code transformation, refactoring, and varying the compilers or compilation settings.
Specifically, it employs data-flow analysis to extract the semantic flow of the registers as
well as the semantic components of the control flow graph, which are then synthesized
into a novel representation called the semantic flow graph (SFG).
We evaluate the framework on large-scale datasets extracted from selected open source
C++ projects on GitHub, Google Code Jam events, Planet Source Code contests, and students’
programming projects and found that it outperforms existing methods in several
respects. First, it is able to detect the reused functions. Second, it can identify FOSS
packages in real-world projects and reused binary functions with high precision. Third, it
decouples authorship from functionality so that it can be applied to real malware binaries
to automatically generate evidence of similar coding habits. Fourth, compared to existing
research contributions, it successfully attributes a larger number of authors with a significantly
higher accuracy. Finally, the new framework is more robust than previous methods
in the sense that there is no significant drop in accuracy when the code is subjected to
refactoring techniques, code transformation methods, and different compilers
Towards better traffic volume estimation: Tackling both underdetermined and non-equilibrium problems via a correlation-adaptive graph convolution network
Traffic volume is an indispensable ingredient to provide fine-grained
information for traffic management and control. However, due to limited
deployment of traffic sensors, obtaining full-scale volume information is far
from easy. Existing works on this topic primarily focus on improving the
overall estimation accuracy of a particular method and ignore the underlying
challenges of volume estimation, thereby having inferior performances on some
critical tasks. This paper studies two key problems with regard to traffic
volume estimation: (1) underdetermined traffic flows caused by undetected
movements, and (2) non-equilibrium traffic flows arise from congestion
propagation. Here we demonstrate a graph-based deep learning method that can
offer a data-driven, model-free and correlation adaptive approach to tackle the
above issues and perform accurate network-wide traffic volume estimation.
Particularly, in order to quantify the dynamic and nonlinear relationships
between traffic speed and volume for the estimation of underdetermined flows, a
speed patternadaptive adjacent matrix based on graph attention is developed and
integrated into the graph convolution process, to capture non-local
correlations between sensors. To measure the impacts of non-equilibrium flows,
a temporal masked and clipped attention combined with a gated temporal
convolution layer is customized to capture time-asynchronous correlations
between upstream and downstream sensors. We then evaluate our model on a
real-world highway traffic volume dataset and compare it with several benchmark
models. It is demonstrated that the proposed model achieves high estimation
accuracy even under 20% sensor coverage rate and outperforms other baselines
significantly, especially on underdetermined and non-equilibrium flow
locations. Furthermore, comprehensive quantitative model analysis are also
carried out to justify the model designs
Graph-based Semi-Supervised & Active Learning for Edge Flows
We present a graph-based semi-supervised learning (SSL) method for learning
edge flows defined on a graph. Specifically, given flow measurements on a
subset of edges, we want to predict the flows on the remaining edges. To this
end, we develop a computational framework that imposes certain constraints on
the overall flows, such as (approximate) flow conservation. These constraints
render our approach different from classical graph-based SSL for vertex labels,
which posits that tightly connected nodes share similar labels and leverages
the graph structure accordingly to extrapolate from a few vertex labels to the
unlabeled vertices. We derive bounds for our method's reconstruction error and
demonstrate its strong performance on synthetic and real-world flow networks
from transportation, physical infrastructure, and the Web. Furthermore, we
provide two active learning algorithms for selecting informative edges on which
to measure flow, which has applications for optimal sensor deployment. The
first strategy selects edges to minimize the reconstruction error bound and
works well on flows that are approximately divergence-free. The second approach
clusters the graph and selects bottleneck edges that cross cluster-boundaries,
which works well on flows with global trends
Generalized Points-to Graphs: A New Abstraction of Memory in the Presence of Pointers
Flow- and context-sensitive points-to analysis is difficult to scale; for
top-down approaches, the problem centers on repeated analysis of the same
procedure; for bottom-up approaches, the abstractions used to represent
procedure summaries have not scaled while preserving precision.
We propose a novel abstraction called the Generalized Points-to Graph (GPG)
which views points-to relations as memory updates and generalizes them using
the counts of indirection levels leaving the unknown pointees implicit. This
allows us to construct GPGs as compact representations of bottom-up procedure
summaries in terms of memory updates and control flow between them. Their
compactness is ensured by the following optimizations: strength reduction
reduces the indirection levels, redundancy elimination removes redundant memory
updates and minimizes control flow (without over-approximating data dependence
between memory updates), and call inlining enhances the opportunities of these
optimizations. We devise novel operations and data flow analyses for these
optimizations.
Our quest for scalability of points-to analysis leads to the following
insight: The real killer of scalability in program analysis is not the amount
of data but the amount of control flow that it may be subjected to in search of
precision. The effectiveness of GPGs lies in the fact that they discard as much
control flow as possible without losing precision (i.e., by preserving data
dependence without over-approximation). This is the reason why the GPGs are
very small even for main procedures that contain the effect of the entire
program. This allows our implementation to scale to 158kLoC for C programs
Azimuthal Anisotropy in High Energy Nuclear Collision - An Approach based on Complex Network Analysis
Recently, a complex network based method of Visibility Graph has been applied
to confirm the scale-freeness and presence of fractal properties in the process
of multiplicity fluctuation. Analysis of data obtained from experiments on
hadron-nucleus and nucleus-nucleus interactions results in values of
Power-of-Scale-freeness-of-Visibility-Graph-(PSVG) parameter extracted from the
visibility graphs. Here, the relativistic nucleus-nucleus interaction data have
been analysed to detect azimuthal-anisotropy by extending the Visibility Graph
method and extracting the average clustering coefficient, one of the important
topological parameters, from the graph. Azimuthal-distributions corresponding
to different pseudorapidity-regions around the central-pseudorapidity value are
analysed utilising the parameter. Here we attempt to correlate the conventional
physical significance of this coefficient with respect to complex-network
systems, with some basic notions of particle production phenomenology, like
clustering and correlation. Earlier methods for detecting anisotropy in
azimuthal distribution, were mostly based on the analysis of statistical
fluctuation. In this work, we have attempted to find deterministic information
on the anisotropy in azimuthal distribution by means of precise determination
of topological parameter from a complex network perspective
- …