Search CORE

140 research outputs found

The Use of Spectroscopy Handheld Tools in Brain Tumor Surgery: Current Evidence and Techniques

Author: Constantinos G. Hadjipanayis
Constantinos G. Hadjipanayis
Nikita Lakomkin
Nikita Lakomkin
Publication venue: 'Frontiers Media SA'
Publication date: 01/05/2019
Field of study

The fundamental principle in the operative treatment of brain tumors involves achieving maximal safe resection in order to improve postoperative outcomes. At present, challenges in visualizing microscopic disease and residual tumor remain an impediment to complete tumor removal. Spectroscopic tools have the theoretical advantage of accurate tissue identification, coupled with the potential for manual intraoperative adjustments to improve visualization of remaining tumor tissue that would otherwise be difficult to detect. The current evidence and techniques for handheld spectroscopic tools in surgical neuro-oncology are explored here

Directory of Open Access Journals

Egocentric Audio-Visual Noise Suppression

Author: He Weipeng
Kalgaonkar Kaustubh
Lakomkin Egor
Lin Ju
Liu Yang
Sharma Roshan
Publication venue
Publication date: 07/11/2022
Field of study

This paper studies audio-visual suppression for egocentric videos -- where the speaker is not captured in the video. Instead, potential noise sources are visible on screen with the camera emulating the off-screen speaker's view of the outside world. This setting is different from prior work in audio-visual speech enhancement that relies on lip and facial visuals. In this paper, we first demonstrate that egocentric visual information is helpful for noise suppression. We compare object recognition and action classification based visual feature extractors, and investigate methods to align audio and visual representations. Then, we examine different fusion strategies for the aligned features, and locations within the noise suppression model to incorporate visual information. Experiments demonstrate that visual features are most helpful when used to generate additive correction masks. Finally, in order to ensure that the visual features are discriminative with respect to different noise types, we introduce a multi-task learning framework that jointly optimizes audio-visual noise suppression and video based acoustic event detection. This proposed multi-task framework outperforms the audio only baseline on all metrics, including a 0.16 PESQ improvement. Extensive ablations reveal the improved performance of the proposed model with multiple active distractors, over all noise types and across different SNRs.Comment: Under Review at ICASSP 202

arXiv.org e-Print Archive

End-to-End Speech Recognition Contextualization with Large Language Models

Author: Fathullah Yassir
Fuegen Christian
Kalinli Ozlem
Lakomkin Egor
Seltzer Michael L.
Wu Chunyang
Publication venue
Publication date: 19/09/2023
Field of study

In recent years, Large Language Models (LLMs) have garnered significant attention from the research community due to their exceptional performance and generalization capabilities. In this paper, we introduce a novel method for contextualizing speech recognition models incorporating LLMs. Our approach casts speech recognition as a mixed-modal language modeling task based on a pretrained LLM. We provide audio features, along with optional text tokens for context, to train the system to complete transcriptions in a decoder-only fashion. As a result, the system is implicitly incentivized to learn how to leverage unstructured contextual information during training. Our empirical results demonstrate a significant improvement in performance, with a 6% WER reduction when additional textual context is provided. Moreover, we find that our method performs competitively and improve by 7.5% WER overall and 17% WER on rare words against a baseline contextualized RNN-T system that has been trained on more than twenty five times larger speech dataset. Overall, we demonstrate that by only adding a handful number of trainable parameters via adapters, we can unlock contextualized speech recognition capability for the pretrained LLM while keeping the same text-only input functionality

arXiv.org e-Print Archive