X-CrossNet: A complex spectral mapping approach to target speaker extraction with cross attention speaker embedding fusion

Sun, Chang; Qin, Bo

Computer Science > Sound

arXiv:2411.13811 (cs)

[Submitted on 21 Nov 2024 (v1), last revised 25 Nov 2024 (this version, v2)]

Title:X-CrossNet: A complex spectral mapping approach to target speaker extraction with cross attention speaker embedding fusion

Authors:Chang Sun, Bo Qin

View PDF HTML (experimental)

Abstract:Target speaker extraction (TSE) is a technique for isolating a target speaker's voice from mixed speech using auxiliary features associated with the target speaker. It is another attempt at addressing the cocktail party problem and is generally considered to have more practical application prospects than traditional speech separation methods. Although academic research in this area has achieved high performance and evaluation scores on public datasets, most models exhibit significantly reduced performance in real-world noisy or reverberant conditions. To address this limitation, we propose a novel TSE model, X-CrossNet, which leverages CrossNet as its backbone. CrossNet is a speech separation network specifically optimized for challenging noisy and reverberant environments, achieving state-of-the-art performance in tasks such as speaker separation under these conditions. Additionally, to enhance the network's ability to capture and utilize auxiliary features of the target speaker, we integrate a Cross-Attention mechanism into the global multi-head self-attention (GMHSA) module within each CrossNet block. This facilitates more effective integration of target speaker features with mixed speech features. Experimental results show that our method performs superior separation on the WSJ0-2mix and WHAMR! datasets, demonstrating strong robustness and stability.

Subjects:	Sound (cs.SD); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2411.13811 [cs.SD]
	(or arXiv:2411.13811v2 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2411.13811

Submission history

From: Chang Sun [view email]
[v1] Thu, 21 Nov 2024 03:21:42 UTC (764 KB)
[v2] Mon, 25 Nov 2024 00:21:53 UTC (764 KB)

Computer Science > Sound

Title:X-CrossNet: A complex spectral mapping approach to target speaker extraction with cross attention speaker embedding fusion

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:X-CrossNet: A complex spectral mapping approach to target speaker extraction with cross attention speaker embedding fusion

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators