18,651 research outputs found
AFLOW-SYM: Platform for the complete, automatic and self-consistent symmetry analysis of crystals
Determination of the symmetry profile of structures is a persistent challenge
in materials science. Results often vary amongst standard packages, hindering
autonomous materials development by requiring continuous user attention and
educated guesses. Here, we present a robust procedure for evaluating the
complete suite of symmetry properties, featuring various representations for
the point-, factor-, space groups, site symmetries, and Wyckoff positions. The
protocol determines a system-specific mapping tolerance that yields symmetry
operations entirely commensurate with fundamental crystallographic principles.
The self consistent tolerance characterizes the effective spatial resolution of
the reported atomic positions. The approach is compared with the most used
programs and is successfully validated against the space group information
provided for over 54,000 entries in the Inorganic Crystal Structure Database.
Subsequently, a complete symmetry analysis is applied to all 1.7 million
entries of the AFLOW data repository. The AFLOW-SYM package has been
implemented in, and made available for, public use through the automated,
framework AFLOW.Comment: 24 pages, 6 figure
"Influence Sketching": Finding Influential Samples In Large-Scale Regressions
There is an especially strong need in modern large-scale data analysis to
prioritize samples for manual inspection. For example, the inspection could
target important mislabeled samples or key vulnerabilities exploitable by an
adversarial attack. In order to solve the "needle in the haystack" problem of
which samples to inspect, we develop a new scalable version of Cook's distance,
a classical statistical technique for identifying samples which unusually
strongly impact the fit of a regression model (and its downstream predictions).
In order to scale this technique up to very large and high-dimensional
datasets, we introduce a new algorithm which we call "influence sketching."
Influence sketching embeds random projections within the influence computation;
in particular, the influence score is calculated using the randomly projected
pseudo-dataset from the post-convergence Generalized Linear Model (GLM). We
validate that influence sketching can reliably and successfully discover
influential samples by applying the technique to a malware detection dataset of
over 2 million executable files, each represented with almost 100,000 features.
For example, we find that randomly deleting approximately 10% of training
samples reduces predictive accuracy only slightly from 99.47% to 99.45%,
whereas deleting the same number of samples with high influence sketch scores
reduces predictive accuracy all the way down to 90.24%. Moreover, we find that
influential samples are especially likely to be mislabeled. In the case study,
we manually inspect the most influential samples, and find that influence
sketching pointed us to new, previously unidentified pieces of malware.Comment: fixed additional typo
A Reference Interpreter for the Graph Programming Language GP 2
GP 2 is an experimental programming language for computing by graph
transformation. An initial interpreter for GP 2, written in the functional
language Haskell, provides a concise and simply structured reference
implementation. Despite its simplicity, the performance of the interpreter is
sufficient for the comparative investigation of a range of test programs. It
also provides a platform for the development of more sophisticated
implementations.Comment: In Proceedings GaM 2015, arXiv:1504.0244
Malware Classification based on Call Graph Clustering
Each day, anti-virus companies receive tens of thousands samples of
potentially harmful executables. Many of the malicious samples are variations
of previously encountered malware, created by their authors to evade
pattern-based detection. Dealing with these large amounts of data requires
robust, automatic detection approaches. This paper studies malware
classification based on call graph clustering. By representing malware samples
as call graphs, it is possible to abstract certain variations away, and enable
the detection of structural similarities between samples. The ability to
cluster similar samples together will make more generic detection techniques
possible, thereby targeting the commonalities of the samples within a cluster.
To compare call graphs mutually, we compute pairwise graph similarity scores
via graph matchings which approximately minimize the graph edit distance. Next,
to facilitate the discovery of similar malware samples, we employ several
clustering algorithms, including k-medoids and DBSCAN. Clustering experiments
are conducted on a collection of real malware samples, and the results are
evaluated against manual classifications provided by human malware analysts.
Experiments show that it is indeed possible to accurately detect malware
families via call graph clustering. We anticipate that in the future, call
graphs can be used to analyse the emergence of new malware families, and
ultimately to automate implementation of generic detection schemes.Comment: This research has been supported by TEKES - the Finnish Funding
Agency for Technology and Innovation as part of its ICT SHOK Future Internet
research programme, grant 40212/0
A Deep Siamese Network for Scene Detection in Broadcast Videos
We present a model that automatically divides broadcast videos into coherent
scenes by learning a distance measure between shots. Experiments are performed
to demonstrate the effectiveness of our approach by comparing our algorithm
against recent proposals for automatic scene segmentation. We also propose an
improved performance measure that aims to reduce the gap between numerical
evaluation and expected results, and propose and release a new benchmark
dataset.Comment: ACM Multimedia 201
- …