69,348 research outputs found
Sequence-based Multiscale Model (SeqMM) for High-throughput chromosome conformation capture (Hi-C) data analysis
In this paper, I introduce a Sequence-based Multiscale Model (SeqMM) for the
biomolecular data analysis. With the combination of spectral graph method, I
reveal the essential difference between the global scale models and local scale
ones in structure clustering, i.e., different optimization on Euclidean (or
spatial) distances and sequential (or genomic) distances. More specifically,
clusters from global scale models optimize Euclidean distance relations. Local
scale models, on the other hand, result in clusters that optimize the genomic
distance relations. For a biomolecular data, Euclidean distances and sequential
distances are two independent variables, which can never be optimized
simultaneously in data clustering. However, sequence scale in my SeqMM can work
as a tuning parameter that balances these two variables and deliver different
clusterings based on my purposes. Further, my SeqMM is used to explore the
hierarchical structures of chromosomes. I find that in global scale, the
Fiedler vector from my SeqMM bears a great similarity with the principal vector
from principal component analysis, and can be used to study genomic
compartments. In TAD analysis, I find that TADs evaluated from different scales
are not consistent and vary a lot. Particularly when the sequence scale is
small, the calculated TAD boundaries are dramatically different. Even for
regions with high contact frequencies, TAD regions show no obvious consistence.
However, when the scale value increases further, although TADs are still quite
different, TAD boundaries in these high contact frequency regions become more
and more consistent. Finally, I find that for a fixed local scale, my method
can deliver very robust TAD boundaries in different cluster numbers.Comment: 22 PAGES, 13 FIGURE
On the Effect of Semantically Enriched Context Models on Software Modularization
Many of the existing approaches for program comprehension rely on the
linguistic information found in source code, such as identifier names and
comments. Semantic clustering is one such technique for modularization of the
system that relies on the informal semantics of the program, encoded in the
vocabulary used in the source code. Treating the source code as a collection of
tokens loses the semantic information embedded within the identifiers. We try
to overcome this problem by introducing context models for source code
identifiers to obtain a semantic kernel, which can be used for both deriving
the topics that run through the system as well as their clustering. In the
first model, we abstract an identifier to its type representation and build on
this notion of context to construct contextual vector representation of the
source code. The second notion of context is defined based on the flow of data
between identifiers to represent a module as a dependency graph where the nodes
correspond to identifiers and the edges represent the data dependencies between
pairs of identifiers. We have applied our approach to 10 medium-sized open
source Java projects, and show that by introducing contexts for identifiers,
the quality of the modularization of the software systems is improved. Both of
the context models give results that are superior to the plain vector
representation of documents. In some cases, the authoritativeness of
decompositions is improved by 67%. Furthermore, a more detailed evaluation of
our approach on JEdit, an open source editor, demonstrates that inferred topics
through performing topic analysis on the contextual representations are more
meaningful compared to the plain representation of the documents. The proposed
approach in introducing a context model for source code identifiers paves the
way for building tools that support developers in program comprehension tasks
such as application and domain concept location, software modularization and
topic analysis
Assessing architectural evolution: A case study
This is the post-print version of the Article. The official published can be accessed from the link below - Copyright @ 2011 SpringerThis paper proposes to use a historical perspective on generic laws, principles,
and guidelines, like Lehman’s software evolution laws and Martin’s design principles, in order to achieve a multi-faceted process and structural assessment of a system’s architectural evolution. We present a simple structural model with associated historical metrics and
visualizations that could form part of an architect’s dashboard. We perform such an assessment for the Eclipse SDK, as a case study of a large, complex, and long-lived system for which sustained effective architectural evolution is paramount. The twofold aim of checking generic principles on a well-know system is, on the one hand,
to see whether there are certain lessons that could be learned for best practice of architectural evolution, and on the other hand to get more insights about the applicability of such principles. We find that while the Eclipse SDK does follow several of the laws and principles, there are some deviations, and we discuss areas of architectural improvement and limitations of the assessment approach
Analysis of Software Binaries for Reengineering-Driven Product Line Architecture\^aAn Industrial Case Study
This paper describes a method for the recovering of software architectures
from a set of similar (but unrelated) software products in binary form. One
intention is to drive refactoring into software product lines and combine
architecture recovery with run time binary analysis and existing clustering
methods. Using our runtime binary analysis, we create graphs that capture the
dependencies between different software parts. These are clustered into smaller
component graphs, that group software parts with high interactions into larger
entities. The component graphs serve as a basis for further software product
line work. In this paper, we concentrate on the analysis part of the method and
the graph clustering. We apply the graph clustering method to a real
application in the context of automation / robot configuration software tools.Comment: In Proceedings FMSPLE 2015, arXiv:1504.0301
apk2vec: Semi-supervised multi-view representation learning for profiling Android applications
Building behavior profiles of Android applications (apps) with holistic, rich
and multi-view information (e.g., incorporating several semantic views of an
app such as API sequences, system calls, etc.) would help catering downstream
analytics tasks such as app categorization, recommendation and malware analysis
significantly better. Towards this goal, we design a semi-supervised
Representation Learning (RL) framework named apk2vec to automatically generate
a compact representation (aka profile/embedding) for a given app. More
specifically, apk2vec has the three following unique characteristics which make
it an excellent choice for largescale app profiling: (1) it encompasses
information from multiple semantic views such as API sequences, permissions,
etc., (2) being a semi-supervised embedding technique, it can make use of
labels associated with apps (e.g., malware family or app category labels) to
build high quality app profiles, and (3) it combines RL and feature hashing
which allows it to efficiently build profiles of apps that stream over time
(i.e., online learning). The resulting semi-supervised multi-view hash
embeddings of apps could then be used for a wide variety of downstream tasks
such as the ones mentioned above. Our extensive evaluations with more than
42,000 apps demonstrate that apk2vec's app profiles could significantly
outperform state-of-the-art techniques in four app analytics tasks namely,
malware detection, familial clustering, app clone detection and app
recommendation.Comment: International Conference on Data Mining, 201
The Parallelism Motifs of Genomic Data Analysis
Genomic data sets are growing dramatically as the cost of sequencing
continues to decline and small sequencing devices become available. Enormous
community databases store and share this data with the research community, but
some of these genomic data analysis problems require large scale computational
platforms to meet both the memory and computational requirements. These
applications differ from scientific simulations that dominate the workload on
high end parallel systems today and place different requirements on programming
support, software libraries, and parallel architectural design. For example,
they involve irregular communication patterns such as asynchronous updates to
shared data structures. We consider several problems in high performance
genomics analysis, including alignment, profiling, clustering, and assembly for
both single genomes and metagenomes. We identify some of the common
computational patterns or motifs that help inform parallelization strategies
and compare our motifs to some of the established lists, arguing that at least
two key patterns, sorting and hashing, are missing
Exponential Random Graph Modeling for Complex Brain Networks
Exponential random graph models (ERGMs), also known as p* models, have been
utilized extensively in the social science literature to study complex networks
and how their global structure depends on underlying structural components.
However, the literature on their use in biological networks (especially brain
networks) has remained sparse. Descriptive models based on a specific feature
of the graph (clustering coefficient, degree distribution, etc.) have dominated
connectivity research in neuroscience. Corresponding generative models have
been developed to reproduce one of these features. However, the complexity
inherent in whole-brain network data necessitates the development and use of
tools that allow the systematic exploration of several features simultaneously
and how they interact to form the global network architecture. ERGMs provide a
statistically principled approach to the assessment of how a set of interacting
local brain network features gives rise to the global structure. We illustrate
the utility of ERGMs for modeling, analyzing, and simulating complex
whole-brain networks with network data from normal subjects. We also provide a
foundation for the selection of important local features through the
implementation and assessment of three selection approaches: a traditional
p-value based backward selection approach, an information criterion approach
(AIC), and a graphical goodness of fit (GOF) approach. The graphical GOF
approach serves as the best method given the scientific interest in being able
to capture and reproduce the structure of fitted brain networks
- …