7,088 research outputs found
Smart Greybox Fuzzing
Coverage-based greybox fuzzing (CGF) is one of the most successful methods
for automated vulnerability detection. Given a seed file (as a sequence of
bits), CGF randomly flips, deletes or bits to generate new files. CGF
iteratively constructs (and fuzzes) a seed corpus by retaining those generated
files which enhance coverage. However, random bitflips are unlikely to produce
valid files (or valid chunks in files), for applications processing complex
file formats.
In this work, we introduce smart greybox fuzzing (SGF) which leverages a
high-level structural representation of the seed file to generate new files. We
define innovative mutation operators that work on the virtual file structure
rather than on the bit level which allows SGF to explore completely new input
domains while maintaining file validity. We introduce a novel validity-based
power schedule that enables SGF to spend more time generating files that are
more likely to pass the parsing stage of the program, which can expose
vulnerabilities much deeper in the processing logic.
Our evaluation demonstrates the effectiveness of SGF. On several libraries
that parse structurally complex files, our tool AFLSmart explores substantially
more paths (up to 200%) and exposes more vulnerabilities than baseline AFL. Our
tool AFLSmart has discovered 42 zero-day vulnerabilities in widely-used,
well-tested tools and libraries; so far 17 CVEs were assigned.Comment: Accepted IEEE Transactions on Software Engineering, 202
LCNN: Lookup-based Convolutional Neural Network
Porting state of the art deep learning algorithms to resource constrained
compute platforms (e.g. VR, AR, wearables) is extremely challenging. We propose
a fast, compact, and accurate model for convolutional neural networks that
enables efficient learning and inference. We introduce LCNN, a lookup-based
convolutional neural network that encodes convolutions by few lookups to a
dictionary that is trained to cover the space of weights in CNNs. Training LCNN
involves jointly learning a dictionary and a small set of linear combinations.
The size of the dictionary naturally traces a spectrum of trade-offs between
efficiency and accuracy. Our experimental results on ImageNet challenge show
that LCNN can offer 3.2x speedup while achieving 55.1% top-1 accuracy using
AlexNet architecture. Our fastest LCNN offers 37.6x speed up over AlexNet while
maintaining 44.3% top-1 accuracy. LCNN not only offers dramatic speed ups at
inference, but it also enables efficient training. In this paper, we show the
benefits of LCNN in few-shot learning and few-iteration learning, two crucial
aspects of on-device training of deep learning models.Comment: CVPR 1
GPU LSM: A Dynamic Dictionary Data Structure for the GPU
We develop a dynamic dictionary data structure for the GPU, supporting fast
insertions and deletions, based on the Log Structured Merge tree (LSM). Our
implementation on an NVIDIA K40c GPU has an average update (insertion or
deletion) rate of 225 M elements/s, 13.5x faster than merging items into a
sorted array. The GPU LSM supports the retrieval operations of lookup, count,
and range query operations with an average rate of 75 M, 32 M and 23 M
queries/s respectively. The trade-off for the dynamic updates is that the
sorted array is almost twice as fast on retrievals. We believe that our GPU LSM
is the first dynamic general-purpose dictionary data structure for the GPU.Comment: 11 pages, accepted to appear on the Proceedings of IEEE International
Parallel and Distributed Processing Symposium (IPDPS'18
Deep Neural Networks Ensemble for Detecting Medication Mentions in Tweets
Objective: After years of research, Twitter posts are now recognized as an
important source of patient-generated data, providing unique insights into
population health. A fundamental step to incorporating Twitter data in
pharmacoepidemiological research is to automatically recognize medication
mentions in tweets. Given that lexical searches for medication names may fail
due to misspellings or ambiguity with common words, we propose a more advanced
method to recognize them. Methods: We present Kusuri, an Ensemble Learning
classifier, able to identify tweets mentioning drug products and dietary
supplements. Kusuri ("medication" in Japanese) is composed of two modules.
First, four different classifiers (lexicon-based, spelling-variant-based,
pattern-based and one based on a weakly-trained neural network) are applied in
parallel to discover tweets potentially containing medication names. Second, an
ensemble of deep neural networks encoding morphological, semantical and
long-range dependencies of important words in the tweets discovered is used to
make the final decision. Results: On a balanced (50-50) corpus of 15,005
tweets, Kusuri demonstrated performances close to human annotators with 93.7%
F1-score, the best score achieved thus far on this corpus. On a corpus made of
all tweets posted by 113 Twitter users (98,959 tweets, with only 0.26%
mentioning medications), Kusuri obtained 76.3% F1-score. There is not a prior
drug extraction system that compares running on such an extremely unbalanced
dataset. Conclusion: The system identifies tweets mentioning drug names with
performance high enough to ensure its usefulness and ready to be integrated in
larger natural language processing systems.Comment: This is a pre-copy-editing, author-produced PDF of an article
accepted for publication in JAMIA following peer review. The definitive
publisher-authenticated version is "D. Weissenbacher, A. Sarker, A. Klein, K.
O'Connor, A. Magge, G. Gonzalez-Hernandez, Deep neural networks ensemble for
detecting medication mentions in tweets, Journal of the American Medical
Informatics Association, ocz156, 2019
Dynamically Hierarchy Revolution: DirNet for Compressing Recurrent Neural Network on Mobile Devices
Recurrent neural networks (RNNs) achieve cutting-edge performance on a
variety of problems. However, due to their high computational and memory
demands, deploying RNNs on resource constrained mobile devices is a challenging
task. To guarantee minimum accuracy loss with higher compression rate and
driven by the mobile resource requirement, we introduce a novel model
compression approach DirNet based on an optimized fast dictionary learning
algorithm, which 1) dynamically mines the dictionary atoms of the projection
dictionary matrix within layer to adjust the compression rate 2) adaptively
changes the sparsity of sparse codes cross the hierarchical layers.
Experimental results on language model and an ASR model trained with a 1000h
speech dataset demonstrate that our method significantly outperforms prior
approaches. Evaluated on off-the-shelf mobile devices, we are able to reduce
the size of original model by eight times with real-time model inference and
negligible accuracy loss.Comment: Accepted by IJCAI-ECAI 201
Inner Product Similarity Search using Compositional Codes
This paper addresses the nearest neighbor search problem under inner product
similarity and introduces a compact code-based approach. The idea is to
approximate a vector using the composition of several elements selected from a
source dictionary and to represent this vector by a short code composed of the
indices of the selected elements. The inner product between a query vector and
a database vector is efficiently estimated from the query vector and the short
code of the database vector. We show the superior performance of the proposed
group -selection algorithm that selects elements from source
dictionaries for vector approximation in terms of search accuracy and
efficiency for compact codes of the same length via theoretical and empirical
analysis. Experimental results on large-scale datasets ( and SIFT
features, linear models and Netflix) demonstrate the superiority of the
proposed approach.Comment: The approach presented in this paper (ECCV14 submission) is closely
related to multi-stage vector quantization and residual quantization. Thanks
the reviewers (CVPR14 and ECCV14) for pointing out the relationship to the
two algorithms. Related paper:
http://sites.skoltech.ru/app/data/uploads/sites/2/2013/09/CVPR14.pdf, which
also adopts the summation of vectors for vector approximatio
Urban Delay Tolerant Network Simulator (UDTNSim v0.1)
Delay Tolerant Networking (DTN) is an approach to networking which handles
network disruptions and high delays that may occur in many kinds of
communication networks. The major reasons for high delay include partial
connectivity of networks as can be seen in many types of ad hoc wireless
networks with frequent network partitions, long propagation time as experienced
in inter-planetary and deep space networks, and frequent link disruptions due
to the mobility of nodes as observed in terrestrial wireless network
environments. Experimenting network architectures, protocols, and mobility
models in such real-world scenarios is difficult due to the complexities
involved in the network environment. Therefore, in this document, we present
the documentation of an Urban Delay Tolerant Network Simulator (UDTNSim)
version 0.1, capable of simulating urban road network environments with DTN
characteristics including mobility models and routing protocols. The mobility
models included in this version of UDTNSim are (i) Stationary Movement, (ii)
Simple Random Movement, (iii) Path Type Based Movememt, (iv) Path Memory Based
Movement, (v) Path Type with Restricted Movement, and (vi) Path Type with Wait
Movement. In addition to mobility models, we also provide three routing and
data hand-off protocols: (i) Epidemic Routing, (ii) Superior Only Handoff, and
(iii) Superior Peer Handoff. UDTNSim v0.1 is designed using object-oriented
programming approach in order to provide flexibility in addition of new
features to the DTN environment. UDTNSim v0.1 is distributed as an open source
simulator for the use of the research community.Comment: 40 pages and 4 figure
Fast Convolutional Sparse Coding in the Dual Domain
Convolutional sparse coding (CSC) is an important building block of many
computer vision applications ranging from image and video compression to deep
learning. We present two contributions to the state of the art in CSC. First,
we significantly speed up the computation by proposing a new optimization
framework that tackles the problem in the dual domain. Second, we extend the
original formulation to higher dimensions in order to process a wider range of
inputs, such as RGB images and videos. Our results show up to 20 times speedup
compared to current state-of-the-art CSC solvers
Histopathological Image Classification using Discriminative Feature-oriented Dictionary Learning
In histopathological image analysis, feature extraction for classification is
a challenging task due to the diversity of histology features suitable for each
problem as well as presence of rich geometrical structures. In this paper, we
propose an automatic feature discovery framework via learning class-specific
dictionaries and present a low-complexity method for classification and disease
grading in histopathology. Essentially, our Discriminative Feature-oriented
Dictionary Learning (DFDL) method learns class-specific dictionaries such that
under a sparsity constraint, the learned dictionaries allow representing a new
image sample parsimoniously via the dictionary corresponding to the class
identity of the sample. At the same time, the dictionary is designed to be
poorly capable of representing samples from other classes. Experiments on three
challenging real-world image databases: 1) histopathological images of
intraductal breast lesions, 2) mammalian kidney, lung and spleen images
provided by the Animal Diagnostics Lab (ADL) at Pennsylvania State University,
and 3) brain tumor images from The Cancer Genome Atlas (TCGA) database, reveal
the merits of our proposal over state-of-the-art alternatives. {Moreover, we
demonstrate that DFDL exhibits a more graceful decay in classification accuracy
against the number of training images which is highly desirable in practice
where generous training is often not availableComment: Accepted version to Transaction on Medical Imaging, 13 page
Adaptive Partitioning for Very Large RDF Data
Distributed RDF systems partition data across multiple computer nodes
(workers). Some systems perform cheap hash partitioning, which may result in
expensive query evaluation, while others apply heuristics aiming at minimizing
inter-node communication during query evaluation. This requires an expensive
data preprocessing phase, leading to high startup costs for very large RDF
knowledge bases. Apriori knowledge of the query workload has also been used to
create partitions, which however are static and do not adapt to workload
changes; hence, inter-node communication cannot be consistently avoided for
queries that are not favored by the initial data partitioning.
In this paper, we propose AdHash, a distributed RDF system, which addresses
the shortcomings of previous work. First, AdHash applies lightweight
partitioning on the initial data, that distributes triples by hashing on their
subjects; this renders its startup overhead low. At the same time, the
locality-aware query optimizer of AdHash takes full advantage of the
partitioning to (i)support the fully parallel processing of join patterns on
subjects and (ii) minimize data communication for general queries by applying
hash distribution of intermediate results instead of broadcasting, wherever
possible. Second, AdHash monitors the data access patterns and dynamically
redistributes and replicates the instances of the most frequent ones among
workers. As a result, the communication cost for future queries is drastically
reduced or even eliminated. To control replication, AdHash implements an
eviction policy for the redistributed patterns. Our experiments with synthetic
and real data verify that AdHash (i) starts faster than all existing systems,
(ii) processes thousands of queries before other systems become online, and
(iii) gracefully adapts to the query load, being able to evaluate queries on
billion-scale RDF data in sub-seconds.Comment: 25 page
- …