20,320 research outputs found
GOGGLES: Automatic Image Labeling with Affinity Coding
Generating large labeled training data is becoming the biggest bottleneck in
building and deploying supervised machine learning models. Recently, the data
programming paradigm has been proposed to reduce the human cost in labeling
training data. However, data programming relies on designing labeling functions
which still requires significant domain expertise. Also, it is prohibitively
difficult to write labeling functions for image datasets as it is hard to
express domain knowledge using raw features for images (pixels).
We propose affinity coding, a new domain-agnostic paradigm for automated
training data labeling. The core premise of affinity coding is that the
affinity scores of instance pairs belonging to the same class on average should
be higher than those of pairs belonging to different classes, according to some
affinity functions. We build the GOGGLES system that implements affinity coding
for labeling image datasets by designing a novel set of reusable affinity
functions for images, and propose a novel hierarchical generative model for
class inference using a small development set.
We compare GOGGLES with existing data programming systems on 5 image labeling
tasks from diverse domains. GOGGLES achieves labeling accuracies ranging from a
minimum of 71% to a maximum of 98% without requiring any extensive human
annotation. In terms of end-to-end performance, GOGGLES outperforms the
state-of-the-art data programming system Snuba by 21% and a state-of-the-art
few-shot learning technique by 5%, and is only 7% away from the fully
supervised upper bound.Comment: Published at 2020 ACM SIGMOD International Conference on Management
of Dat
A multilabel fuzzy relevance clustering system for malware attack attribution in the edge layer of cyber-physical networks
The rapid increase in the number of malicious programs has made malware forensics a daunting task and caused users’ systems to become in danger. Timely identification of malware characteristics including its origin and the malware sample family would significantly limit the potential damage of malware. This is a more profound risk in Cyber-Physical Systems (CPSs), where a malware attack may cause significant physical damage to the infrastructure. Due to limited on-device available memory and processing power in CPS devices, most of the efforts for protecting CPS networks are focused on the edge layer, where the majority of security mechanisms are deployed.
Since the majority of advanced and sophisticated malware programs are combining features from different families, these malicious programs are not similar enough to any existing malware family and easily evade binary classifier detection. Therefore, in this article, we propose a novel multilabel fuzzy clustering system for malware attack attribution. Our system is deployed on the edge layer to provide insight into applicable malware threats to the CPS network. We leverage static analysis by utilizing Opcode frequencies as the feature space to classify malware families.
We observed that a multilabel classifier does not classify a part of samples. We named this problem the instance coverage problem. To overcome this problem, we developed an ensemble-based multilabel fuzzy classification method to suggest the relevance of a malware instance to the stricken families. This classifier identified samples of VirusShare, RansomwareTracker, and BIG2015 with an accuracy of 94.66%, 94.26%, and 97.56%, respectively
- …