1,842 research outputs found
Decoding billions of integers per second through vectorization
In many important applications -- such as search engines and relational
database systems -- data is stored in the form of arrays of integers. Encoding
and, most importantly, decoding of these arrays consumes considerable CPU time.
Therefore, substantial effort has been made to reduce costs associated with
compression and decompression. In particular, researchers have exploited the
superscalar nature of modern processors and SIMD instructions. Nevertheless, we
introduce a novel vectorized scheme called SIMD-BP128 that improves over
previously proposed vectorized approaches. It is nearly twice as fast as the
previously fastest schemes on desktop processors (varint-G8IU and PFOR). At the
same time, SIMD-BP128 saves up to 2 bits per integer. For even better
compression, we propose another new vectorized scheme (SIMD-FastPFOR) that has
a compression ratio within 10% of a state-of-the-art scheme (Simple-8b) while
being two times faster during decoding.Comment: For software, see https://github.com/lemire/FastPFor, For data, see
http://boytsov.info/datasets/clueweb09gap
The Inflation Technique Completely Solves the Causal Compatibility Problem
The causal compatibility question asks whether a given causal structure graph
-- possibly involving latent variables -- constitutes a genuinely plausible
causal explanation for a given probability distribution over the graph's
observed variables. Algorithms predicated on merely necessary constraints for
causal compatibility typically suffer from false negatives, i.e. they admit
incompatible distributions as apparently compatible with the given graph. In
[arXiv:1609.00672], one of us introduced the inflation technique for
formulating useful relaxations of the causal compatibility problem in terms of
linear programming. In this work, we develop a formal hierarchy of such causal
compatibility relaxations. We prove that inflation is asymptotically tight,
i.e., that the hierarchy converges to a zero-error test for causal
compatibility. In this sense, the inflation technique fulfills a longstanding
desideratum in the field of causal inference. We quantify the rate of
convergence by showing that any distribution which passes the -order
inflation test must be -close in Euclidean norm to some
distribution genuinely compatible with the given causal structure. Furthermore,
we show that for many causal structures, the (unrelaxed) causal compatibility
problem is faithfully formulated already by either the first or second order
inflation test.Comment: Updated to match forthcoming journal publication as closely as
possible. Some content removed for brevity. Expanded citations. Most
footnotes moved into the main text. Significant changes to subsection 4.1,
where we corrected an error in the example of second order inflation not
converging, and added an converse example where second order inflation
outperforms other technique
Instructions-Based Detection of Sophisticated Obfuscation and Packing
Every day thousands of malware are released online. The vast majority of these malware employ some kind of obfuscation ranging from simple XOR encryption, to more sophisticated anti-analysis, packing and encryption techniques. Dynamic analysis methods can unpack the file and reveal its hidden code. However, these methods are very time consuming when compared to static analysis. Moreover, considering the large amount of new malware being produced daily, it is not practical to solely depend on dynamic analysis methods. Therefore, finding an effective way to filter the samples and delegate only obfuscated and suspicious ones to more rigorous tests would significantly improve the overall scanning process. Current techniques of identifying obfuscation rely mainly on signatures of known packers, file entropy score, or anomalies in file header. However, these features are not only easily bypass-able, but also do not cover all types of obfuscation. In this paper, we introduce a novel approach to identify obfuscated files based on anomalies in their instructions-based characteristics. We detect the presence of interleaving instructions which are the result of the opaque predicate anti-disassembly trick, and present distinguishing statistical properties based on the opcodes and control flow graphs of obfuscated files. Our detection system combines these features with other file structural features and leads to a very good result of detecting obfuscated malware
Role of Secondary Motifs in Fast Folding Polymers: A Dynamical Variational Principle
A fascinating and open question challenging biochemistry, physics and even
geometry is the presence of highly regular motifs such as alpha-helices in the
folded state of biopolymers and proteins. Stimulating explanations ranging from
chemical propensity to simple geometrical reasoning have been invoked to
rationalize the existence of such secondary structures. We formulate a
dynamical variational principle for selection in conformation space based on
the requirement that the backbone of the native state of biologically viable
polymers be rapidly accessible from the denatured state. The variational
principle is shown to result in the emergence of helical order in compact
structures.Comment: 4 pages, RevTex, 4 eps figure
- âŠ