18,865 research outputs found
Blazes: Coordination Analysis for Distributed Programs
Distributed consistency is perhaps the most discussed topic in distributed
systems today. Coordination protocols can ensure consistency, but in practice
they cause undesirable performance unless used judiciously. Scalable
distributed architectures avoid coordination whenever possible, but
under-coordinated systems can exhibit behavioral anomalies under fault, which
are often extremely difficult to debug. This raises significant challenges for
distributed system architects and developers. In this paper we present Blazes,
a cross-platform program analysis framework that (a) identifies program
locations that require coordination to ensure consistent executions, and (b)
automatically synthesizes application-specific coordination code that can
significantly outperform general-purpose techniques. We present two case
studies, one using annotated programs in the Twitter Storm system, and another
using the Bloom declarative language.Comment: Updated to include additional materials from the original technical
report: derivation rules, output stream label
OGSA first impressions: a case study re-engineering a scientific applicationwith the open grid services architecture
We present a case study of our experience re-engineeringa scientific application using the Open Grid Services Architecture(OGSA), a new specification for developing Gridapplications using web service technologies such as WSDLand SOAP. During the last decade, UCL?s Chemistry departmenthas developed a computational approach for predictingthe crystal structures of small molecules. However,each search involves running large iterations of computationallyexpensive calculations and currently takes a fewmonths to perform. Making use of early implementationsof the OGSA specification we have wrapped the Fortranbinaries into OGSI-compliant service interfaces to exposethe existing scientific application as a set of loosely coupledweb services. We show how the OGSA implementationfacilitates the distribution of such applications across alarge network, radically improving performance of the systemthrough parallel CPU capacity, coordinated resourcemanagement and automation of the computational process.We discuss the difficulties that we encountered turning Fortranexecutables into OGSA services and delivering a robust,scalable system. One unusual aspect of our approachis the way we transfer input and output data for the Fortrancodes. Instead of employing a file transfer service wetransform the XML encoded data in the SOAP message tonative file format, where possible using XSLT stylesheets.We also discuss a computational workflow service that enablesusers to distribute and manage parts of the computationalprocess across different clusters and administrativedomains. We examine how our experience re-engineeringthe polymorph prediction application led to this approachand to what extent our efforts have succeeded
Improving Malware Detection Accuracy by Extracting Icon Information
Detecting PE malware files is now commonly approached using statistical and
machine learning models. While these models commonly use features extracted
from the structure of PE files, we propose that icons from these files can also
help better predict malware. We propose an innovative machine learning approach
to extract information from icons. Our proposed approach consists of two steps:
1) extracting icon features using summary statics, histogram of gradients
(HOG), and a convolutional autoencoder, 2) clustering icons based on the
extracted icon features. Using publicly available data and by using machine
learning experiments, we show our proposed icon clusters significantly boost
the efficacy of malware prediction models. In particular, our experiments show
an average accuracy increase of 10% when icon clusters are used in the
prediction model.Comment: Full version. IEEE MIPR 201
Fast Data in the Era of Big Data: Twitter's Real-Time Related Query Suggestion Architecture
We present the architecture behind Twitter's real-time related query
suggestion and spelling correction service. Although these tasks have received
much attention in the web search literature, the Twitter context introduces a
real-time "twist": after significant breaking news events, we aim to provide
relevant results within minutes. This paper provides a case study illustrating
the challenges of real-time data processing in the era of "big data". We tell
the story of how our system was built twice: our first implementation was built
on a typical Hadoop-based analytics stack, but was later replaced because it
did not meet the latency requirements necessary to generate meaningful
real-time results. The second implementation, which is the system deployed in
production, is a custom in-memory processing engine specifically designed for
the task. This experience taught us that the current typical usage of Hadoop as
a "big data" platform, while great for experimentation, is not well suited to
low-latency processing, and points the way to future work on data analytics
platforms that can handle "big" as well as "fast" data
Algorithmic Clustering of Music
We present a fully automatic method for music classification, based only on
compression of strings that represent the music pieces. The method uses no
background knowledge about music whatsoever: it is completely general and can,
without change, be used in different areas like linguistic classification and
genomics. It is based on an ideal theory of the information content in
individual objects (Kolmogorov complexity), information distance, and a
universal similarity metric. Experiments show that the method distinguishes
reasonably well between various musical genres and can even cluster pieces by
composer.Comment: 17 pages, 11 figure
A computationally efficient framework for large-scale distributed fingerprint matching
A dissertation submitted to the Faculty of Science, University of the Witwatersrand, Johannesburg, in fulfilment of requirements for the degree of Master of Science, School of Computer Science and Applied Mathematics. May 2017.Biometric features have been widely implemented to be utilized for forensic and civil applications. Amongst many diļ¬erent kinds of biometric characteristics, the ļ¬ngerprint is globally accepted and remains the mostly used biometric characteristic by commercial and industrial societies due to its easy acquisition, uniqueness, stability and reliability.
There are currently various eļ¬ective solutions available, however the ļ¬ngerprint identiļ¬cation is still not considered a fully solved problem mainly due to accuracy and computational time requirements. Although many of the ļ¬ngerprint recognition systems based on minutiae provide good accuracy, the systems with very large databases require fast and real time comparison of ļ¬ngerprints, they often either fail to meet the high performance speed requirements or compromise the accuracy.
For ļ¬ngerprint matching that involves databases containing millions of ļ¬ngerprints, real time identiļ¬cation can only be obtained through the implementation of optimal algorithms that may utilize the given hardware as robustly and efļ¬ciently as possible. There are currently no known distributed database and computing framework available that deal with real time solution for ļ¬ngerprint recognition problem involving databases containing as many as sixty million ļ¬ngerprints, the size which is close to the size of the South African population.
This research proposal intends to serve two main purposes: 1) exploit and scale the best known minutiae matching algorithm for a minimum of sixty million ļ¬ngerprints; and 2) design a framework for distributed database to deal with large ļ¬ngerprint databases based on the results obtained in the former item.GR201
- ā¦