18,651 research outputs found

    AFLOW-SYM: Platform for the complete, automatic and self-consistent symmetry analysis of crystals

    Get PDF
    Determination of the symmetry profile of structures is a persistent challenge in materials science. Results often vary amongst standard packages, hindering autonomous materials development by requiring continuous user attention and educated guesses. Here, we present a robust procedure for evaluating the complete suite of symmetry properties, featuring various representations for the point-, factor-, space groups, site symmetries, and Wyckoff positions. The protocol determines a system-specific mapping tolerance that yields symmetry operations entirely commensurate with fundamental crystallographic principles. The self consistent tolerance characterizes the effective spatial resolution of the reported atomic positions. The approach is compared with the most used programs and is successfully validated against the space group information provided for over 54,000 entries in the Inorganic Crystal Structure Database. Subsequently, a complete symmetry analysis is applied to all 1.7++ million entries of the AFLOW data repository. The AFLOW-SYM package has been implemented in, and made available for, public use through the automated, ab-initio\textit{ab-initio} framework AFLOW.Comment: 24 pages, 6 figure

    "Influence Sketching": Finding Influential Samples In Large-Scale Regressions

    Full text link
    There is an especially strong need in modern large-scale data analysis to prioritize samples for manual inspection. For example, the inspection could target important mislabeled samples or key vulnerabilities exploitable by an adversarial attack. In order to solve the "needle in the haystack" problem of which samples to inspect, we develop a new scalable version of Cook's distance, a classical statistical technique for identifying samples which unusually strongly impact the fit of a regression model (and its downstream predictions). In order to scale this technique up to very large and high-dimensional datasets, we introduce a new algorithm which we call "influence sketching." Influence sketching embeds random projections within the influence computation; in particular, the influence score is calculated using the randomly projected pseudo-dataset from the post-convergence Generalized Linear Model (GLM). We validate that influence sketching can reliably and successfully discover influential samples by applying the technique to a malware detection dataset of over 2 million executable files, each represented with almost 100,000 features. For example, we find that randomly deleting approximately 10% of training samples reduces predictive accuracy only slightly from 99.47% to 99.45%, whereas deleting the same number of samples with high influence sketch scores reduces predictive accuracy all the way down to 90.24%. Moreover, we find that influential samples are especially likely to be mislabeled. In the case study, we manually inspect the most influential samples, and find that influence sketching pointed us to new, previously unidentified pieces of malware.Comment: fixed additional typo

    A Reference Interpreter for the Graph Programming Language GP 2

    Get PDF
    GP 2 is an experimental programming language for computing by graph transformation. An initial interpreter for GP 2, written in the functional language Haskell, provides a concise and simply structured reference implementation. Despite its simplicity, the performance of the interpreter is sufficient for the comparative investigation of a range of test programs. It also provides a platform for the development of more sophisticated implementations.Comment: In Proceedings GaM 2015, arXiv:1504.0244

    Malware Classification based on Call Graph Clustering

    Full text link
    Each day, anti-virus companies receive tens of thousands samples of potentially harmful executables. Many of the malicious samples are variations of previously encountered malware, created by their authors to evade pattern-based detection. Dealing with these large amounts of data requires robust, automatic detection approaches. This paper studies malware classification based on call graph clustering. By representing malware samples as call graphs, it is possible to abstract certain variations away, and enable the detection of structural similarities between samples. The ability to cluster similar samples together will make more generic detection techniques possible, thereby targeting the commonalities of the samples within a cluster. To compare call graphs mutually, we compute pairwise graph similarity scores via graph matchings which approximately minimize the graph edit distance. Next, to facilitate the discovery of similar malware samples, we employ several clustering algorithms, including k-medoids and DBSCAN. Clustering experiments are conducted on a collection of real malware samples, and the results are evaluated against manual classifications provided by human malware analysts. Experiments show that it is indeed possible to accurately detect malware families via call graph clustering. We anticipate that in the future, call graphs can be used to analyse the emergence of new malware families, and ultimately to automate implementation of generic detection schemes.Comment: This research has been supported by TEKES - the Finnish Funding Agency for Technology and Innovation as part of its ICT SHOK Future Internet research programme, grant 40212/0

    A Deep Siamese Network for Scene Detection in Broadcast Videos

    Get PDF
    We present a model that automatically divides broadcast videos into coherent scenes by learning a distance measure between shots. Experiments are performed to demonstrate the effectiveness of our approach by comparing our algorithm against recent proposals for automatic scene segmentation. We also propose an improved performance measure that aims to reduce the gap between numerical evaluation and expected results, and propose and release a new benchmark dataset.Comment: ACM Multimedia 201
    • …
    corecore