56 research outputs found
Uformer: A Unet based dilated complex & real dual-path conformer network for simultaneous speech enhancement and dereverberation
Complex spectrum and magnitude are considered as two major features of speech
enhancement and dereverberation. Traditional approaches always treat these two
features separately, ignoring their underlying relationship. In this paper, we
propose Uformer, a Unet based dilated complex & real dual-path conformer
network in both complex and magnitude domain for simultaneous speech
enhancement and dereverberation. We exploit time attention (TA) and dilated
convolution (DC) to leverage local and global contextual information and
frequency attention (FA) to model dimensional information. These three
sub-modules contained in the proposed dilated complex & real dual-path
conformer module effectively improve the speech enhancement and dereverberation
performance. Furthermore, hybrid encoder and decoder are adopted to
simultaneously model the complex spectrum and magnitude and promote the
information interaction between two domains. Encoder decoder attention is also
applied to enhance the interaction between encoder and decoder. Our
experimental results outperform all SOTA time and complex domain models
objectively and subjectively. Specifically, Uformer reaches 3.6032 DNSMOS on
the blind test set of Interspeech 2021 DNS Challenge, which outperforms all
top-performed models. We also carry out ablation experiments to tease apart all
proposed sub-modules that are most important.Comment: Accepted by ICASSP 202
KeyPosS: Plug-and-Play Facial Landmark Detection through GPS-Inspired True-Range Multilateration
In the realm of facial analysis, accurate landmark detection is crucial for
various applications, ranging from face recognition and expression analysis to
animation. Conventional heatmap or coordinate regression-based techniques,
however, often face challenges in terms of computational burden and
quantization errors. To address these issues, we present the KeyPoint
Positioning System (KeyPosS) - a groundbreaking facial landmark detection
framework that stands out from existing methods. The framework utilizes a fully
convolutional network to predict a distance map, which computes the distance
between a Point of Interest (POI) and multiple anchor points. These anchor
points are ingeniously harnessed to triangulate the POI's position through the
True-range Multilateration algorithm. Notably, the plug-and-play nature of
KeyPosS enables seamless integration into any decoding stage, ensuring a
versatile and adaptable solution. We conducted a thorough evaluation of
KeyPosS's performance by benchmarking it against state-of-the-art models on
four different datasets. The results show that KeyPosS substantially
outperforms leading methods in low-resolution settings while requiring a
minimal time overhead. The code is available at
https://github.com/zhiqic/KeyPosS.Comment: Accepted to ACM Multimedia 2023; 10 pages, 7 figures, 6 tables; the
code is at https://github.com/zhiqic/KeyPos
WordArt Designer: User-Driven Artistic Typography Synthesis using Large Language Models
This paper introduces WordArt Designer, a user-driven framework for artistic
typography synthesis, relying on the Large Language Model (LLM). The system
incorporates four key modules: the LLM Engine, SemTypo, StyTypo, and TexTypo
modules. 1) The LLM Engine, empowered by the LLM (e.g., GPT-3.5), interprets
user inputs and generates actionable prompts for the other modules, thereby
transforming abstract concepts into tangible designs. 2) The SemTypo module
optimizes font designs using semantic concepts, striking a balance between
artistic transformation and readability. 3) Building on the semantic layout
provided by the SemTypo module, the StyTypo module creates smooth, refined
images. 4) The TexTypo module further enhances the design's aesthetics through
texture rendering, enabling the generation of inventive textured fonts.
Notably, WordArt Designer highlights the fusion of generative AI with artistic
typography. Experience its capabilities on ModelScope:
https://www.modelscope.cn/studios/WordArt/WordArt.Comment: Accepted by EMNLP 2023, 10 pages, 11 figures, 1 table, the system is
at https://www.modelscope.cn/studios/WordArt/WordAr
- …