This work presents an evaluation of six prominent commercial endpoint malware
detectors, a network malware detector, and a file-conviction algorithm from a
cyber technology vendor. The evaluation was administered as the first of the
Artificial Intelligence Applications to Autonomous Cybersecurity (AI ATAC)
prize challenges, funded by / completed in service of the US Navy. The
experiment employed 100K files (50/50% benign/malicious) with a stratified
distribution of file types, including ~1K zero-day program executables
(increasing experiment size two orders of magnitude over previous work). We
present an evaluation process of delivering a file to a fresh virtual machine
donning the detection technology, waiting 90s to allow static detection, then
executing the file and waiting another period for dynamic detection; this
allows greater fidelity in the observational data than previous experiments, in
particular, resource and time-to-detection statistics. To execute all 800K
trials (100K files × 8 tools), a software framework is designed to
choreographed the experiment into a completely automated, time-synced, and
reproducible workflow with substantial parallelization. A cost-benefit model
was configured to integrate the tools' recall, precision, time to detection,
and resource requirements into a single comparable quantity by simulating costs
of use. This provides a ranking methodology for cyber competitions and a lens
through which to reason about the varied statistical viewpoints of the results.
These statistical and cost-model results provide insights on state of
commercial malware detection