Search CORE

56 research outputs found

Uformer: A Unet based dilated complex & real dual-path conformer network for simultaneous speech enhancement and dereverberation

Author: Fu Yihui
Jv Yukai
Li Jingdong
Liu Yun
Luo Dawei
Lv Shubo
Xie Lei
Publication venue
Publication date: 04/05/2022
Field of study

Complex spectrum and magnitude are considered as two major features of speech enhancement and dereverberation. Traditional approaches always treat these two features separately, ignoring their underlying relationship. In this paper, we propose Uformer, a Unet based dilated complex & real dual-path conformer network in both complex and magnitude domain for simultaneous speech enhancement and dereverberation. We exploit time attention (TA) and dilated convolution (DC) to leverage local and global contextual information and frequency attention (FA) to model dimensional information. These three sub-modules contained in the proposed dilated complex & real dual-path conformer module effectively improve the speech enhancement and dereverberation performance. Furthermore, hybrid encoder and decoder are adopted to simultaneously model the complex spectrum and magnitude and promote the information interaction between two domains. Encoder decoder attention is also applied to enhance the interaction between encoder and decoder. Our experimental results outperform all SOTA time and complex domain models objectively and subjectively. Specifically, Uformer reaches 3.6032 DNSMOS on the blind test set of Interspeech 2021 DNS Challenge, which outperforms all top-performed models. We also carry out ablation experiments to tease apart all proposed sub-modules that are most important.Comment: Accepted by ICASSP 202

arXiv.org e-Print Archive

KeyPosS: Plug-and-Play Facial Landmark Detection through GPS-Inspired True-Range Multilateration

Author: Bao Xu
Cheng Zhi-Qi
Geng Yifeng
He Jun-Yan
Li Chenyang
Liu Hanbing
Liu Wei
Luo Bin
Sun Jingdong
Xiang Wangmeng
Xie Xuansong
Publication venue
Publication date: 18/08/2023
Field of study

In the realm of facial analysis, accurate landmark detection is crucial for various applications, ranging from face recognition and expression analysis to animation. Conventional heatmap or coordinate regression-based techniques, however, often face challenges in terms of computational burden and quantization errors. To address these issues, we present the KeyPoint Positioning System (KeyPosS) - a groundbreaking facial landmark detection framework that stands out from existing methods. The framework utilizes a fully convolutional network to predict a distance map, which computes the distance between a Point of Interest (POI) and multiple anchor points. These anchor points are ingeniously harnessed to triangulate the POI's position through the True-range Multilateration algorithm. Notably, the plug-and-play nature of KeyPosS enables seamless integration into any decoding stage, ensuring a versatile and adaptable solution. We conducted a thorough evaluation of KeyPosS's performance by benchmarking it against state-of-the-art models on four different datasets. The results show that KeyPosS substantially outperforms leading methods in low-resolution settings while requiring a minimal time overhead. The code is available at https://github.com/zhiqic/KeyPosS.Comment: Accepted to ACM Multimedia 2023; 10 pages, 7 figures, 6 tables; the code is at https://github.com/zhiqic/KeyPos

arXiv.org e-Print Archive

WordArt Designer: User-Driven Artistic Typography Synthesis using Large Language Models

Author: Cheng Zhi-Qi
Geng Yifeng
He Jun-Yan
Hu Yusen
Jin Zengke
Kang Xiaoyang
Li Chenyang
Lin Xianhui
Luo Bin
Sun Jingdong
Xiang Wangmeng
Xie Xuansong
Zhou Jingren
Publication venue
Publication date: 26/11/2023
Field of study

This paper introduces WordArt Designer, a user-driven framework for artistic typography synthesis, relying on the Large Language Model (LLM). The system incorporates four key modules: the LLM Engine, SemTypo, StyTypo, and TexTypo modules. 1) The LLM Engine, empowered by the LLM (e.g., GPT-3.5), interprets user inputs and generates actionable prompts for the other modules, thereby transforming abstract concepts into tangible designs. 2) The SemTypo module optimizes font designs using semantic concepts, striking a balance between artistic transformation and readability. 3) Building on the semantic layout provided by the SemTypo module, the StyTypo module creates smooth, refined images. 4) The TexTypo module further enhances the design's aesthetics through texture rendering, enabling the generation of inventive textured fonts. Notably, WordArt Designer highlights the fusion of generative AI with artistic typography. Experience its capabilities on ModelScope: https://www.modelscope.cn/studios/WordArt/WordArt.Comment: Accepted by EMNLP 2023, 10 pages, 11 figures, 1 table, the system is at https://www.modelscope.cn/studios/WordArt/WordAr

arXiv.org e-Print Archive