11,600 research outputs found
Bouma2 - A Quasi-Stateless, Tunable Multiple String-Match Algorithm
The Bouma2 algorithm attempts to challenge the prevalent "stateful" exact
string-match paradigms by suggesting a "quasi-stateless" approach. We claim
that using state-machines to solve the multiple exact string-match problem
introduces a hidden artificial constraint, namely the Consume-Order Dependency,
which results in unnecessary overhead. Bouma2 is not restricted in this sense;
we postulate that this allows memory-efficiency and improved performance versus
its state-machine equivalents. The heart of the Bouma2 preprocessing problem is
formulated as a weighted Integer Linear Programming problem, that can be tuned
for memory footprint and performance optimization. Specifically, this allows
Bouma2 to be input-sensitive, as tuning can be based on input characteristics.
Evaluating Bouma2 against the Aho-Corasick variant of the popular Snort
Intrusion Prevention System, we demonstrate double the throughput while using
about 10% of the memory.Comment: 33 pages, 3 figure
Big Data and Cross-Document Coreference Resolution: Current State and Future Opportunities
Information Extraction (IE) is the task of automatically extracting
structured information from unstructured/semi-structured machine-readable
documents. Among various IE tasks, extracting actionable intelligence from
ever-increasing amount of data depends critically upon Cross-Document
Coreference Resolution (CDCR) - the task of identifying entity mentions across
multiple documents that refer to the same underlying entity. Recently, document
datasets of the order of peta-/tera-bytes has raised many challenges for
performing effective CDCR such as scaling to large numbers of mentions and
limited representational power. The problem of analysing such datasets is
called "big data". The aim of this paper is to provide readers with an
understanding of the central concepts, subtasks, and the current
state-of-the-art in CDCR process. We provide assessment of existing
tools/techniques for CDCR subtasks and highlight big data challenges in each of
them to help readers identify important and outstanding issues for further
investigation. Finally, we provide concluding remarks and discuss possible
directions for future work
Literature Review Of Attribute Level And Structure Level Data Linkage Techniques
Data Linkage is an important step that can provide valuable insights for
evidence-based decision making, especially for crucial events. Performing
sensible queries across heterogeneous databases containing millions of records
is a complex task that requires a complete understanding of each contributing
databases schema to define the structure of its information. The key aim is to
approximate the structure and content of the induced data into a concise
synopsis in order to extract and link meaningful data-driven facts. We identify
such problems as four major research issues in Data Linkage: associated costs
in pair-wise matching, record matching overheads, semantic flow of information
restrictions, and single order classification limitations. In this paper, we
give a literature review of research in Data Linkage. The purpose for this
review is to establish a basic understanding of Data Linkage, and to discuss
the background in the Data Linkage research domain. Particularly, we focus on
the literature related to the recent advancements in Approximate Matching
algorithms at Attribute Level and Structure Level. Their efficiency,
functionality and limitations are critically analysed and open-ended problems
have been exposed.Comment: 20 page
Massively parallel read mapping on GPUs with PEANUT
We present PEANUT (ParallEl AligNment UTility), a highly parallel GPU-based
read mapper with several distinguishing features, including a novel q-gram
index (called the q-group index) with small memory footprint built on-the-fly
over the reads and the possibility to output both the best hits or all hits of
a read. Designing the algorithm particularly for the GPU architecture, we were
able to reach maximum core occupancy for several key steps. Our benchmarks show
that PEANUT outperforms other state-of- the-art mappers in terms of speed and
sensitivity. The software is available at http://peanut.readthedocs.org
Moore-Machine Filtering for Timed and Untimed Pattern Matching
Monitoring is an important body of techniques in runtime verification of
real-time, embedded, and cyber-physical systems. Mathematically, the monitoring
problem can be formalized as a pattern matching problem against a pattern
automaton. Motivated by the needs in embedded applications---especially the
limited channel capacity between a sensor unit and a processor that
monitors---we pursue the idea of filtering as preprocessing for monitoring.
Technically, for a given pattern automaton, we present a construction of a
Moore machine that works as a filter. The construction is automata-theoretic,
and we find the use of Moore machines particularly suited for embedded
applications, not only because their sequential operation is relatively cheap
but also because they are amenable to hardware acceleration by dedicated
circuits. We prove soundness (i.e., absence of lost matches), too. We work in
two settings: in the untimed one, a pattern is an NFA; in the timed one, a
pattern is a timed automaton. The extension of our untimed construction to the
timed setting is technically involved, but our experiments demonstrate its
practical benefits.Comment: Accepted for presentation at EMSOFT 2018 and for publication in IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
(TCAD) as part of the ESWEEK-TCAD special issu
Toward Faultless Content-Based Playlists Generation for Instrumentals
This study deals with content-based musical playlists generation focused on
Songs and Instrumentals. Automatic playlist generation relies on collaborative
filtering and autotagging algorithms. Autotagging can solve the cold start
issue and popularity bias that are critical in music recommender systems.
However, autotagging remains to be improved and cannot generate satisfying
music playlists. In this paper, we suggest improvements toward better
autotagging-generated playlists compared to state-of-the-art. To assess our
method, we focus on the Song and Instrumental tags. Song and Instrumental are
two objective and opposite tags that are under-studied compared to genres or
moods, which are subjective and multi-modal tags. In this paper, we consider an
industrial real-world musical database that is unevenly distributed between
Songs and Instrumentals and bigger than databases used in previous studies. We
set up three incremental experiments to enhance automatic playlist generation.
Our suggested approach generates an Instrumental playlist with up to three
times less false positives than cutting edge methods. Moreover, we provide a
design of experiment framework to foster research on Songs and Instrumentals.
We give insight on how to improve further the quality of generated playlists
and to extend our methods to other musical tags. Furthermore, we provide the
source code to guarantee reproducible research.Comment: single-column 20 pages, 3 figures, 6 table
Malicious Behavior Detection using Windows Audit Logs
As antivirus and network intrusion detection systems have increasingly proven
insufficient to detect advanced threats, large security operations centers have
moved to deploy endpoint-based sensors that provide deeper visibility into
low-level events across their enterprises. Unfortunately, for many
organizations in government and industry, the installation, maintenance, and
resource requirements of these newer solutions pose barriers to adoption and
are perceived as risks to organizations' missions.
To mitigate this problem we investigated the utility of agentless detection
of malicious endpoint behavior, using only the standard build-in Windows audit
logging facility as our signal. We found that Windows audit logs, while
emitting manageable sized data streams on the endpoints, provide enough
information to allow robust detection of malicious behavior. Audit logs provide
an effective, low-cost alternative to deploying additional expensive
agent-based breach detection systems in many government and industrial
settings, and can be used to detect, in our tests, 83% percent of malware
samples with a 0.1% false positive rate. They can also supplement already
existing host signature-based antivirus solutions, like Kaspersky, Symantec,
and McAfee, detecting, in our testing environment, 78% of malware missed by
those antivirus systems
HiFrames: High Performance Data Frames in a Scripting Language
Data frames in scripting languages are essential abstractions for processing
structured data. However, existing data frame solutions are either not
distributed (e.g., Pandas in Python) and therefore have limited scalability, or
they are not tightly integrated with array computations (e.g., Spark SQL). This
paper proposes a novel compiler-based approach where we integrate data frames
into the High Performance Analytics Toolkit (HPAT) to build HiFrames. It
provides expressive and flexible data frame APIs which are tightly integrated
with array operations. HiFrames then automatically parallelizes and compiles
relational operations along with other array computations in end-to-end data
analytics programs, and generates efficient MPI/C++ code. We demonstrate that
HiFrames is significantly faster than alternatives such as Spark SQL on
clusters, without forcing the programmer to switch to embedded SQL for part of
the program. HiFrames is 3.6x to 70x faster than Spark SQL for basic relational
operations, and can be up to 20,000x faster for advanced analytics operations,
such as weighted moving averages (WMA), that the map-reduce paradigm cannot
handle effectively. HiFrames is also 5x faster than Spark SQL for TPCx-BB Q26
on 64 nodes of Cori supercomputer
Machine Learning for E-mail Spam Filtering: Review,Techniques and Trends
We present a comprehensive review of the most effective content-based e-mail
spam filtering techniques. We focus primarily on Machine Learning-based spam
filters and their variants, and report on a broad review ranging from surveying
the relevant ideas, efforts, effectiveness, and the current progress. The
initial exposition of the background examines the basics of e-mail spam
filtering, the evolving nature of spam, spammers playing cat-and-mouse with
e-mail service providers (ESPs), and the Machine Learning front in fighting
spam. We conclude by measuring the impact of Machine Learning-based filters and
explore the promising offshoots of latest developments.Comment: Journal. 27 Page
Beneath (or beyond) the surface: Discovering voice-leading patterns with skip-grams
Recurrent voice-leading patterns like the Mi-Re-Do compound cadence (MRDCC)
rarely appear on the musical surface in complex polyphonic textures, so finding
these patterns using computational methods remains a tremendous challenge. The
present study extends the canonical n-gram approach by using skip-grams, which
include sub-sequences in an n-gram list if their constituent members occur
within a certain number of skips. We compiled four data sets of Western tonal
music consisting of symbolic encodings of the notated score and a recorded
performance, created a model pipeline for defining, counting, filtering, and
ranking skip-grams, and ranked the position of the MRDCC in every possible
model configuration. We found that the MRDCC receives a higher rank in the list
when the pipeline employs 5 skips, filters the list by excluding n-gram types
that do not reflect a genuine harmonic change between adjacent members, and
ranks the remaining types using a statistical association measure.Comment: This is an original manuscript / preprint of an article published by
Taylor & Francis in the Journal of Mathematics and Music, available online:
https://doi.org/10.1080/17459737.2020.1785568. 26 pages, 8 figures, 3 table
- …