Search CORE

140 research outputs found

English Broadcast News Speech Recognition by Humans and Machines

Author: Dibert Tom
Huang Yinghui
Kaiser-Schatzlein Alice
Kingsbury Brian
Kurata Gakuto
Picheny Michael
Samko Bern
Saon George
Suzuki Masayuki
Thomas Samuel
Tuske Zoltan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 30/04/2019
Field of study

With recent advances in deep learning, considerable attention has been given to achieving automatic speech recognition performance close to human performance on tasks like conversational telephone speech (CTS) recognition. In this paper we evaluate the usefulness of these proposed techniques on broadcast news (BN), a similar challenging task. We also perform a set of recognition measurements to understand how close the achieved automatic speech recognition results are to human performance on this task. On two publicly available BN test sets, DEV04F and RT04, our speech recognition system using LSTM and residual network based acoustic models with a combination of n-gram and neural network language models performs at 6.5% and 5.9% word error rate. By achieving new performance milestones on these test sets, our experiments show that techniques developed on other related tasks, like CTS, can be transferred to achieve similar performance. In contrast, the best measured human recognition performance on these test sets is much lower, at 3.6% and 2.8% respectively, indicating that there is still room for new techniques and improvements in this space, to reach human performance levels.Comment: \copyright 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other work

arXiv.org e-Print Archive

Crossref

JABBIC Lookups: A Backend Telemetry-Based System for Malware Triage

Author: Bordeanu OC
Davies T
Shen Y
Stringhini G
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 04/11/2021
Field of study

In this paper, we propose JABBIC lookups, a telemetry-based system for malware triage at the interface between proprietary reputation score systems and malware analysts. JABBIC uses file download telemetry collected from client protection solutions installed on end-hosts to determine the threat level of an unknown file based on telemetry data associated with files already known to be malign. We apply word embeddings, and semantic and relational similarities to triage potentially malign files following the intuition that, while single elements in a malware download might change over time, their context, defined as the semantic and relational properties between the different elements in a malware delivery system (e.g., servers, autonomous systems, files) does not change as fast. To this end, we show that JABBIC can leverage file download telemetry to allow security vendors to manage the collection and analysis of unknown files from remote end-hosts for timely processing by more sophisticated malware analysis systems. We test and evaluate JABBIC lookups with 33M download events collected during October 2015. We show that 85.83% of the files triaged with JABBIC lookups are part of the same malware family as their past counterpart files. We also show that, if used with proprietary reputation score systems, JABBIC can triage as malicious 55.1% of files before they are detected by VirusTotal, preceding this detection by over 20 days

UCL Discovery

Information Retrieval with Finnish Case Law Embeddings

Author: Sarsa Sami
Publication venue: Helsingfors universitet
Publication date: 01/01/2019
Field of study

In this work, five text vectorisation models' capability in embedding Finnish case law texts to vector space for inter-textual similarity computation is studied. The embeddings and their computed similarities are used to create a Finnish case law retrieval system that allows effective querying with full documents. A working web application is presented as a part of the work. The case law data for the work is provided by the Finnish Ministry of Justice, and the studied models are: TF-IDF, LDA, Word2Vec, Doc2Vec and Doc2vecC

Helsingin yliopiston digitaalinen arkisto

Recommended from our members

Scalable Emulation of Heterogeneous Systems

Author: Garcia Cota Emilio
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2019
Field of study

The breakdown of Dennard's transistor scaling has driven computing systems toward application-specific accelerators, which can provide orders-of-magnitude improvements in performance and energy efficiency over general-purpose processors. To enable the radical departures from conventional approaches that heterogeneous systems entail, research infrastructure must be able to model processors, memory and accelerators, as well as system-level changes---such as operating system or instruction set architecture (ISA) innovations---that might be needed to realize the accelerators' potential. Unfortunately, existing simulation tools that can support such system-level research are limited by the lack of fast, scalable machine emulators to drive execution. To fill this need, in this dissertation we first present a novel machine emulator design based on dynamic binary translation that makes the following improvements over the state of the art: it scales on multicore hosts while remaining memory efficient, correctly handles cross-ISA differences in atomic instruction semantics, leverages the host floating point (FP) unit to speed up FP emulation without sacrificing correctness, and can be efficiently instrumented to---among other possible uses---drive the execution of a full-system, cross-ISA simulator with support for accelerators. We then demonstrate the utility of machine emulation for studying heterogeneous systems by leveraging it to make two additional contributions. First, we quantify the trade-offs in different coupling models for on-chip accelerators. Second, we present a technique to reuse the private memories of on-chip accelerators when they are otherwise inactive to expand the system's last-level cache, thereby reducing the opportunity cost of the accelerators' integration

Columbia University Academic Commons