Search CORE

23 research outputs found

Lempel-Ziv Networks

Author: Alam Mohammad Mahmudul
Holt James
Hurwitz John
Oates Tim
Raff Edward
Saul Rebecca
Publication venue
Publication date: 23/11/2022
Field of study

Sequence processing has long been a central area of machine learning research. Recurrent neural nets have been successful in processing sequences for a number of tasks; however, they are known to be both ineffective and computationally expensive when applied to very long sequences. Compression-based methods have demonstrated more robustness when processing such sequences -- in particular, an approach pairing the Lempel-Ziv Jaccard Distance (LZJD) with the k-Nearest Neighbor algorithm has shown promise on long sequence problems (up to

T=200,000,000

steps) involving malware classification. Unfortunately, use of LZJD is limited to discrete domains. To extend the benefits of LZJD to a continuous domain, we investigate the effectiveness of a deep-learning analog of the algorithm, the Lempel-Ziv Network. While we achieve successful proof of concept, we are unable to improve meaningfully on the performance of a standard LSTM across a variety of datasets and sequence processing tasks. In addition to presenting this negative result, our work highlights the problem of sub-par baseline tuning in newer research areas.Comment: I Can't Believe It's Not Better Workshop at NeurIPS 202

arXiv.org e-Print Archive

Security Enhancing Technologies for Cloud-of-Clouds

Author: João Miguel Maia Soares de Resende
Publication venue
Publication date: 15/07/2021
Field of study

Repositório Aberto da Universidade do Porto

An investigation of music analysis by the application of grammar-based compressors

Author: Humphreys David
Jones Andrew
Marshall David
Sidorov Kirill
Publication venue: 'Informa UK Limited'
Publication date: 01/01/2021
Field of study

Many studies have presented computational models of musical structure, as an important aspect of musicological analysis. However, the use of grammar-based compressors to automatically recover such information is a relatively new and promising technique. We investigate their performance extensively using a collection of nearly 8000 scores, on tasks including error detection, classification, and segmentation, and compare this with a range of more traditional compressors. Further, we detail a novel method for locating transcription errors based on grammar compression. Despite its lack of domain knowledge, we conclude that grammar-based compression offers competitive performance when solving a variety of musicological tasks

Online Research @ Cardiff

Classifying Sequences of Extreme Length with Constant Memory Applied to Malware Detection

Author: Anderson Hyrum S.
Filar Bobby
Fleshman William
McLean Mark
Raff Edward
Zak Richard
Publication venue
Publication date: 16/12/2020
Field of study

Recent works within machine learning have been tackling inputs of ever-increasing size, with cybersecurity presenting sequence classification problems of particularly extreme lengths. In the case of Windows executable malware detection, inputs may exceed

100

MB, which corresponds to a time series with

T=100,000,000

steps. To date, the closest approach to handling such a task is MalConv, a convolutional neural network capable of processing up to

T=2,000,000

steps. The

\mathcal{O}(T)

memory of CNNs has prevented further application of CNNs to malware. In this work, we develop a new approach to temporal max pooling that makes the required memory invariant to the sequence length

T

. This makes MalConv

116\times

more memory efficient, and up to

25.8\times

faster to train on its original dataset, while removing the input length restrictions to MalConv. We re-invest these gains into improving the MalConv architecture by developing a new Global Channel Gating design, giving us an attention mechanism capable of learning feature interactions across 100 million time steps in an efficient manner, a capability lacked by the original MalConv CNN. Our implementation can be found at https://github.com/NeuromorphicComputationResearchProgram/MalConv2Comment: To appear in AAAI 202

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Building K-Anonymous User Cohorts with\\ Consecutive Consistent Weighted Sampling (CCWS)

Author: Li Ping
Li Xiaoyun
Zhao Weijie
Zheng Xinyi
Publication venue
Publication date: 26/04/2023
Field of study

To retrieve personalized campaigns and creatives while protecting user privacy, digital advertising is shifting from member-based identity to cohort-based identity. Under such identity regime, an accurate and efficient cohort building algorithm is desired to group users with similar characteristics. In this paper, we propose a scalable

K

-anonymous cohort building algorithm called {\em consecutive consistent weighted sampling} (CCWS). The proposed method combines the spirit of the (

p

-powered) consistent weighted sampling and hierarchical clustering, so that the

K

-anonymity is ensured by enforcing a lower bound on the size of cohorts. Evaluations on a LinkedIn dataset consisting of

>70

M users and ads campaigns demonstrate that CCWS achieves substantial improvements over several hashing-based methods including sign random projections (SignRP), minwise hashing (MinHash), as well as the vanilla CWS

arXiv.org e-Print Archive

Engineering a Simplified 0-Bit Consistent Weighted Sampling

Author: Chum O.
Raff Edward
Shrivastava Anshumali
Shrivastava Anshumali
Yang Dingqi
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 23/10/2018
Field of study

The Min-Hashing approach to sketching has become an important tool in data analysis, information retrial, and classification. To apply it to real-valued datasets, the ICWS algorithm has become a seminal approach that is widely used, and provides state-of-the-art performance for this problem space. However, ICWS suffers a computational burden as the sketch size K increases. We develop a new Simplified approach to the ICWS algorithm, that enables us to obtain over 20x speedups compared to the standard algorithm. The veracity of our approach is demonstrated empirically on multiple datasets and scenarios, showing that our new Simplified CWS obtains the same quality of results while being an order of magnitude faster

arXiv.org e-Print Archive

Crossref

Differentially Private One Permutation Hashing and Bin-wise Consistent Weighted Sampling

Author: Li Ping
Li Xiaoyun
Publication venue
Publication date: 13/06/2023
Field of study

Minwise hashing (MinHash) is a standard algorithm widely used in the industry, for large-scale search and learning applications with the binary (0/1) Jaccard similarity. One common use of MinHash is for processing massive n-gram text representations so that practitioners do not have to materialize the original data (which would be prohibitive). Another popular use of MinHash is for building hash tables to enable sub-linear time approximate near neighbor (ANN) search. MinHash has also been used as a tool for building large-scale machine learning systems. The standard implementation of MinHash requires applying

K

random permutations. In comparison, the method of one permutation hashing (OPH), is an efficient alternative of MinHash which splits the data vectors into

K

bins and generates hash values within each bin. OPH is substantially more efficient and also more convenient to use. In this paper, we combine the differential privacy (DP) with OPH (as well as MinHash), to propose the DP-OPH framework with three variants: DP-OPH-fix, DP-OPH-re and DP-OPH-rand, depending on which densification strategy is adopted to deal with empty bins in OPH. A detailed roadmap to the algorithm design is presented along with the privacy analysis. An analytical comparison of our proposed DP-OPH methods with the DP minwise hashing (DP-MH) is provided to justify the advantage of DP-OPH. Experiments on similarity search confirm the merits of DP-OPH, and guide the choice of the proper variant in different practical scenarios. Our technique is also extended to bin-wise consistent weighted sampling (BCWS) to develop a new DP algorithm called DP-BCWS for non-binary data. Experiments on classification tasks demonstrate that DP-BCWS is able to achieve excellent utility at around

\epsilon = 5\sim 10

, where

\epsilon

is the standard parameter in the language of

(\epsilon, \delta)

-DP

arXiv.org e-Print Archive