Search CORE

11,600 research outputs found

Bouma2 - A Quasi-Stateless, Tunable Multiple String-Match Algorithm

Author: Buchnik Erez M.
Publication venue
Publication date: 20/09/2012
Field of study

The Bouma2 algorithm attempts to challenge the prevalent "stateful" exact string-match paradigms by suggesting a "quasi-stateless" approach. We claim that using state-machines to solve the multiple exact string-match problem introduces a hidden artificial constraint, namely the Consume-Order Dependency, which results in unnecessary overhead. Bouma2 is not restricted in this sense; we postulate that this allows memory-efficiency and improved performance versus its state-machine equivalents. The heart of the Bouma2 preprocessing problem is formulated as a weighted Integer Linear Programming problem, that can be tuned for memory footprint and performance optimization. Specifically, this allows Bouma2 to be input-sensitive, as tuning can be based on input characteristics. Evaluating Bouma2 against the Aho-Corasick variant of the popular Snort Intrusion Prevention System, we demonstrate double the throughput while using about 10% of the memory.Comment: 33 pages, 3 figure

arXiv.org e-Print Archive

Big Data and Cross-Document Coreference Resolution: Current State and Future Opportunities

Author: Beheshti Seyed-Mehdi-Reza
Benatallah Boualem
Ryu Seung Hwan
Venugopal Srikumar
Wang Wei
Publication venue
Publication date: 14/11/2013
Field of study

Information Extraction (IE) is the task of automatically extracting structured information from unstructured/semi-structured machine-readable documents. Among various IE tasks, extracting actionable intelligence from ever-increasing amount of data depends critically upon Cross-Document Coreference Resolution (CDCR) - the task of identifying entity mentions across multiple documents that refer to the same underlying entity. Recently, document datasets of the order of peta-/tera-bytes has raised many challenges for performing effective CDCR such as scaling to large numbers of mentions and limited representational power. The problem of analysing such datasets is called "big data". The aim of this paper is to provide readers with an understanding of the central concepts, subtasks, and the current state-of-the-art in CDCR process. We provide assessment of existing tools/techniques for CDCR subtasks and highlight big data challenges in each of them to help readers identify important and outstanding issues for further investigation. Finally, we provide concluding remarks and discuss possible directions for future work

arXiv.org e-Print Archive

Literature Review Of Attribute Level And Structure Level Data Linkage Techniques

Author: Gollapalli Mohammed
Publication venue
Publication date: 07/10/2015
Field of study

Data Linkage is an important step that can provide valuable insights for evidence-based decision making, especially for crucial events. Performing sensible queries across heterogeneous databases containing millions of records is a complex task that requires a complete understanding of each contributing databases schema to define the structure of its information. The key aim is to approximate the structure and content of the induced data into a concise synopsis in order to extract and link meaningful data-driven facts. We identify such problems as four major research issues in Data Linkage: associated costs in pair-wise matching, record matching overheads, semantic flow of information restrictions, and single order classification limitations. In this paper, we give a literature review of research in Data Linkage. The purpose for this review is to establish a basic understanding of Data Linkage, and to discuss the background in the Data Linkage research domain. Particularly, we focus on the literature related to the recent advancements in Approximate Matching algorithms at Attribute Level and Structure Level. Their efficiency, functionality and limitations are critically analysed and open-ended problems have been exposed.Comment: 20 page

arXiv.org e-Print Archive

Massively parallel read mapping on GPUs with PEANUT

Author: Köster Johannes
Rahmann Sven
Publication venue
Publication date: 07/03/2014
Field of study

We present PEANUT (ParallEl AligNment UTility), a highly parallel GPU-based read mapper with several distinguishing features, including a novel q-gram index (called the q-group index) with small memory footprint built on-the-fly over the reads and the possibility to output both the best hits or all hits of a read. Designing the algorithm particularly for the GPU architecture, we were able to reach maximum core occupancy for several key steps. Our benchmarks show that PEANUT outperforms other state-of- the-art mappers in terms of speed and sensitivity. The software is available at http://peanut.readthedocs.org

arXiv.org e-Print Archive

Moore-Machine Filtering for Timed and Untimed Pattern Matching

Author: Hasuo Ichiro
Waga Masaki
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 11/07/2019
Field of study

Monitoring is an important body of techniques in runtime verification of real-time, embedded, and cyber-physical systems. Mathematically, the monitoring problem can be formalized as a pattern matching problem against a pattern automaton. Motivated by the needs in embedded applications---especially the limited channel capacity between a sensor unit and a processor that monitors---we pursue the idea of filtering as preprocessing for monitoring. Technically, for a given pattern automaton, we present a construction of a Moore machine that works as a filter. The construction is automata-theoretic, and we find the use of Moore machines particularly suited for embedded applications, not only because their sequential operation is relatively cheap but also because they are amenable to hardware acceleration by dedicated circuits. We prove soundness (i.e., absence of lost matches), too. We work in two settings: in the untimed one, a pattern is an NFA; in the timed one, a pattern is a timed automaton. The extension of our untimed construction to the timed setting is technically involved, but our experiments demonstrate its practical benefits.Comment: Accepted for presentation at EMSOFT 2018 and for publication in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD) as part of the ESWEEK-TCAD special issu

arXiv.org e-Print Archive

Toward Faultless Content-Based Playlists Generation for Instrumentals

Author: Bayle Yann
Hanna Pierre
Robine Matthias
Publication venue
Publication date: 22/11/2017
Field of study

This study deals with content-based musical playlists generation focused on Songs and Instrumentals. Automatic playlist generation relies on collaborative filtering and autotagging algorithms. Autotagging can solve the cold start issue and popularity bias that are critical in music recommender systems. However, autotagging remains to be improved and cannot generate satisfying music playlists. In this paper, we suggest improvements toward better autotagging-generated playlists compared to state-of-the-art. To assess our method, we focus on the Song and Instrumental tags. Song and Instrumental are two objective and opposite tags that are under-studied compared to genres or moods, which are subjective and multi-modal tags. In this paper, we consider an industrial real-world musical database that is unevenly distributed between Songs and Instrumentals and bigger than databases used in previous studies. We set up three incremental experiments to enhance automatic playlist generation. Our suggested approach generates an Instrumental playlist with up to three times less false positives than cutting edge methods. Moreover, we provide a design of experiment framework to foster research on Songs and Instrumentals. We give insight on how to improve further the quality of generated playlists and to extend our methods to other musical tags. Furthermore, we provide the source code to guarantee reproducible research.Comment: single-column 20 pages, 3 figures, 6 table

arXiv.org e-Print Archive

Malicious Behavior Detection using Windows Audit Logs

Author: Berlin Konstantin
Saxe Joshua
Slater David
Publication venue
Publication date: 25/08/2015
Field of study

As antivirus and network intrusion detection systems have increasingly proven insufficient to detect advanced threats, large security operations centers have moved to deploy endpoint-based sensors that provide deeper visibility into low-level events across their enterprises. Unfortunately, for many organizations in government and industry, the installation, maintenance, and resource requirements of these newer solutions pose barriers to adoption and are perceived as risks to organizations' missions. To mitigate this problem we investigated the utility of agentless detection of malicious endpoint behavior, using only the standard build-in Windows audit logging facility as our signal. We found that Windows audit logs, while emitting manageable sized data streams on the endpoints, provide enough information to allow robust detection of malicious behavior. Audit logs provide an effective, low-cost alternative to deploying additional expensive agent-based breach detection systems in many government and industrial settings, and can be used to detect, in our tests, 83% percent of malware samples with a 0.1% false positive rate. They can also supplement already existing host signature-based antivirus solutions, like Kaspersky, Symantec, and McAfee, detecting, in our testing environment, 78% of malware missed by those antivirus systems

arXiv.org e-Print Archive

HiFrames: High Performance Data Frames in a Scripting Language

Author: Anderson Todd A.
Hassan Wajih Ul
Shpeisman Tatiana
Totoni Ehsan
Publication venue
Publication date: 07/04/2017
Field of study

Data frames in scripting languages are essential abstractions for processing structured data. However, existing data frame solutions are either not distributed (e.g., Pandas in Python) and therefore have limited scalability, or they are not tightly integrated with array computations (e.g., Spark SQL). This paper proposes a novel compiler-based approach where we integrate data frames into the High Performance Analytics Toolkit (HPAT) to build HiFrames. It provides expressive and flexible data frame APIs which are tightly integrated with array operations. HiFrames then automatically parallelizes and compiles relational operations along with other array computations in end-to-end data analytics programs, and generates efficient MPI/C++ code. We demonstrate that HiFrames is significantly faster than alternatives such as Spark SQL on clusters, without forcing the programmer to switch to embedded SQL for part of the program. HiFrames is 3.6x to 70x faster than Spark SQL for basic relational operations, and can be up to 20,000x faster for advanced analytics operations, such as weighted moving averages (WMA), that the map-reduce paradigm cannot handle effectively. HiFrames is also 5x faster than Spark SQL for TPCx-BB Q26 on 64 nodes of Cori supercomputer

arXiv.org e-Print Archive

Machine Learning for E-mail Spam Filtering: Review,Techniques and Trends

Author: Bhowmick Alexy
Hazarika Shyamanta M.
Publication venue
Publication date: 03/06/2016
Field of study

We present a comprehensive review of the most effective content-based e-mail spam filtering techniques. We focus primarily on Machine Learning-based spam filters and their variants, and report on a broad review ranging from surveying the relevant ideas, efforts, effectiveness, and the current progress. The initial exposition of the background examines the basics of e-mail spam filtering, the evolving nature of spam, spammers playing cat-and-mouse with e-mail service providers (ESPs), and the Machine Learning front in fighting spam. We conclude by measuring the impact of Machine Learning-based filters and explore the promising offshoots of latest developments.Comment: Journal. 27 Page

arXiv.org e-Print Archive

Beneath (or beyond) the surface: Discovering voice-leading patterns with skip-grams

Author: Sears David R. W.
Widmer Gerhard
Publication venue: 'Informa UK Limited'
Publication date: 27/06/2020
Field of study

Recurrent voice-leading patterns like the Mi-Re-Do compound cadence (MRDCC) rarely appear on the musical surface in complex polyphonic textures, so finding these patterns using computational methods remains a tremendous challenge. The present study extends the canonical n-gram approach by using skip-grams, which include sub-sequences in an n-gram list if their constituent members occur within a certain number of skips. We compiled four data sets of Western tonal music consisting of symbolic encodings of the notated score and a recorded performance, created a model pipeline for defining, counting, filtering, and ranking skip-grams, and ranked the position of the MRDCC in every possible model configuration. We found that the MRDCC receives a higher rank in the list when the pipeline employs 5 skips, filters the list by excluding n-gram types that do not reflect a genuine harmonic change between adjacent members, and ranks the remaining types using a statistical association measure.Comment: This is an original manuscript / preprint of an article published by Taylor & Francis in the Journal of Mathematics and Music, available online: https://doi.org/10.1080/17459737.2020.1785568. 26 pages, 8 figures, 3 table

arXiv.org e-Print Archive