Electrical Engineering and Systems Science
See recent articles
Showing new listings for Friday, 22 November 2024
- [1] arXiv:2411.13557 [pdf, html, other]
-
Title: Fast Hyperspectral Reconstruction for Neutron Computed Tomography Using Subspace ExtractionMohammad Samin Nur Chowdhury, Diyu Yang, Shimin Tang, Singanallur V. Venkatakrishnan, Andrew W. Needham, Hassina Z. Bilheux, Gregery T. Buzzard, Charles A. BoumanSubjects: Image and Video Processing (eess.IV); Signal Processing (eess.SP)
Hyperspectral neutron computed tomography enables 3D non-destructive imaging of the spectral characteristics of materials. In traditional hyperspectral reconstruction, the data for each neutron wavelength bin is reconstructed separately. This per-bin reconstruction is extremely time-consuming due to the typically large number of wavelength bins. Furthermore, these reconstructions may suffer from severe artifacts due to the low signal-to-noise ratio in each wavelength bin.
We present a novel fast hyperspectral reconstruction algorithm for computationally efficient and accurate reconstruction of hyperspectral neutron data. Our algorithm uses a subspace extraction procedure that transforms hyperspectral data into low-dimensional data within an intermediate subspace. This step effectively reduces data dimensionality and spectral noise. High-quality reconstructions are then performed within this low-dimensional subspace. Finally, the algorithm expands the subspace reconstructions into hyperspectral reconstructions. We apply our algorithm to measured neutron data and demonstrate that it reduces computation and improves reconstruction quality compared to the conventional approach. - [2] arXiv:2411.13577 [pdf, html, other]
-
Title: WavChat: A Survey of Spoken Dialogue ModelsShengpeng Ji, Yifu Chen, Minghui Fang, Jialong Zuo, Jingyu Lu, Hanting Wang, Ziyue Jiang, Long Zhou, Shujie Liu, Xize Cheng, Xiaoda Yang, Zehan Wang, Qian Yang, Jian Li, Yidi Jiang, Jingzhen He, Yunfei Chu, Jin Xu, Zhou ZhaoComments: 60 papes, working in progressSubjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD)
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain. Compared to traditional three-tier cascaded spoken dialogue models that comprise speech recognition (ASR), large language models (LLMs), and text-to-speech (TTS), modern spoken dialogue models exhibit greater intelligence. These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech. Moreover, they generate high-quality, multi-turn speech responses with low latency, enabling real-time interaction through simultaneous listening and speaking capability. Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems and the underlying technologies. To address this, we have first compiled existing spoken dialogue systems in the chronological order and categorized them into the cascaded and end-to-end paradigms. We then provide an in-depth overview of the core technologies in spoken dialogue models, covering aspects such as speech representation, training paradigm, streaming, duplex, and interaction capabilities. Each section discusses the limitations of these technologies and outlines considerations for future research. Additionally, we present a thorough review of relevant datasets, evaluation metrics, and benchmarks from the perspectives of training and evaluating spoken dialogue systems. We hope this survey will contribute to advancing both academic research and industrial applications in the field of spoken dialogue systems. The related material is available at this https URL.
- [3] arXiv:2411.13602 [pdf, other]
-
Title: Large-scale cross-modality pretrained model enhances cardiovascular state estimation and cardiomyopathy detection from electrocardiograms: An AI system development and multi-center validation studyZhengyao Ding, Yujian Hu, Youyao Xu, Chengchen Zhao, Ziyu Li, Yiheng Mao, Haitao Li, Qian Li, Jing Wang, Yue Chen, Mengjia Chen, Longbo Wang, Xuesen Chu, Weichao Pan, Ziyi Liu, Fei Wu, Hongkun Zhang, Ting Chen, Zhengxing HuangComments: 23 pages, 8 figuresSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Cardiovascular diseases (CVDs) present significant challenges for early and accurate diagnosis. While cardiac magnetic resonance imaging (CMR) is the gold standard for assessing cardiac function and diagnosing CVDs, its high cost and technical complexity limit accessibility. In contrast, electrocardiography (ECG) offers promise for large-scale early screening. This study introduces CardiacNets, an innovative model that enhances ECG analysis by leveraging the diagnostic strengths of CMR through cross-modal contrastive learning and generative pretraining. CardiacNets serves two primary functions: (1) it evaluates detailed cardiac function indicators and screens for potential CVDs, including coronary artery disease, cardiomyopathy, pericarditis, heart failure and pulmonary hypertension, using ECG input; and (2) it enhances interpretability by generating high-quality CMR images from ECG data. We train and validate the proposed CardiacNets on two large-scale public datasets (the UK Biobank with 41,519 individuals and the MIMIC-IV-ECG comprising 501,172 samples) as well as three private datasets (FAHZU with 410 individuals, SAHZU with 464 individuals, and QPH with 338 individuals), and the findings demonstrate that CardiacNets consistently outperforms traditional ECG-only models, substantially improving screening accuracy. Furthermore, the generated CMR images provide valuable diagnostic support for physicians of all experience levels. This proof-of-concept study highlights how ECG can facilitate cross-modal insights into cardiac function assessment, paving the way for enhanced CVD screening and diagnosis at a population level.
- [4] arXiv:2411.13710 [pdf, other]
-
Title: Assessing the Impact of Electric Vehicle Charging on Residential Distribution GridsSubjects: Systems and Control (eess.SY)
To achieve net-zero carbon emissions, electrification in the transportation sector plays an important role. Significant increase of electric vehicles (EV) has been observed nationally and globally. While the transition to EVs presents substantial environmental benefits, it would lead to several challenges to the power grid due to EV charging activities. Growing EVs greatly increase peak loads on residential grids, particularly during evening charging periods. This surge can result in operational challenges, including greater voltage drops, increased power losses, and potential overloading violations, compromising grid reliability and efficiency. This study focuses on determining ampacity violations, and analyzing line loading levels in a 240-bus distribution system with 1120 customers, located in the Midwest U.S. By simulating a range of charging scenarios and evaluating EV chargers with varying power capacities under different distribution system voltage levels, this research aims to identify lines at risk of ampacity violations for various EV charging penetration rates up to 100%. The findings will provide valuable insights for utilities and grid operators, informing strategies for voltage level adjustments and necessary infrastructure reinforcements to effectively accommodate the growing energy demands associated with widespread EV adoption.
- [5] arXiv:2411.13751 [pdf, html, other]
-
Title: ScAlN-on-SiC Ku-Band Solidly-Mounted Bidimensional Mode ResonatorsComments: Submitted to IEEE EDLSubjects: Systems and Control (eess.SY)
This letter reports on Solidly-Mounted Bidimensional Mode Resonators (S2MRs) based on 30% Scandium-doped Aluminum Nitride (ScAlN) on Silicon Carbide (SiC), operating near 16 GHz. Experimental results demonstrate mechanical quality factors (Qm) as high as 380, electromechanical coupling coefficients (kt2) of 4.5%, an overall Figure of Merit (FOM = Qmkt2) exceeding 17, and power handling greater than 20 dBm for devices closely matched to 50 ohm. To the best of the authors' knowledge, S2MRs exhibit the highest Key Performance Indicators (KPIs) among solidly mounted resonators in the Ku band, paving the way for the integration of nanoacoustic devices on fast substrates with high-power electronics, tailored for military and harsh environment applications.
- [6] arXiv:2411.13769 [pdf, html, other]
-
Title: Which Channel, Low-rank or Full-rank, more needs RIS?Subjects: Signal Processing (eess.SP)
RIS, as an efficient tool to improve receive signal-to-noise ratio, extend coverage and create more spatial diversity, is viewed as a most promising technique for the future wireless networks like 6G. As you know, IRS is very suitable for a special wireless scenario with wireless link between BS and users being completely blocked. In this paper, we extend its applications to a general scenario, i.e., rank-deficient-channel, particularly some extremely low-rank ones such as no link, and line-of-sight (LoS). Actually, there are several potential important low-rank applications of like satellite, UAV communications, marine, and deep-space communications. In such a situation, it is found that RIS may make a dramatic DoF enhancement over no RIS. By using a distributed RIS placement, the DoF of channels from BS to users may be even boosted from a low-rank like 0/1 to full-rank. This will achieve an extremely rate improvement via multiple spatial streams transmission per user. In this paper, we present a complete review of make a in-depth discussion on DoF effect of RIS.
- [7] arXiv:2411.13806 [pdf, html, other]
-
Title: Weak synchronization in heterogeneous multi-agent systemsComments: This paper has been submitted to IJRNC at Nov. 5, 2024 for first round review. arXiv admin note: text overlap with arXiv:2403.18200Subjects: Systems and Control (eess.SY)
In this paper, we propose a new framework for synchronization of heterogeneous multi agent system which we refer to as weak synchronization. This new framework of synchronization is based on achieving the network stability in the absence of any information on communication network including the connectivity. Here by network stability, we mean that in the basic setup of a multi-agent system, we require that the signals exchanged over the network converge to zero. As such if the network happens to have a directed spanning tree then we obtain classical synchronization. Moreover, we design protocols which achieve weak synchronization for any network without making any kind of assumptions on communication network. If the network happens to have a directed spanning tree, then we obtain classical synchronization. However, if this is not the case then we describe in detail in this paper what kind of synchronization properties are preserved in the system and the output of the different agents can behave.
- [8] arXiv:2411.13834 [pdf, html, other]
-
Title: Spatiotemporal Tubes for Temporal Reach-Avoid-Stay Tasks in Unknown SystemsSubjects: Systems and Control (eess.SY); Robotics (cs.RO)
The paper considers the controller synthesis problem for general MIMO systems with unknown dynamics, aiming to fulfill the temporal reach-avoid-stay task, where the unsafe regions are time-dependent, and the target must be reached within a specified time frame. The primary aim of the paper is to construct the spatiotemporal tube (STT) using a sampling-based approach and thereby devise a closed-form approximation-free control strategy to ensure that system trajectory reaches the target set while avoiding time-dependent unsafe sets. The proposed scheme utilizes a novel method involving STTs to provide controllers that guarantee both system safety and reachability. In our sampling-based framework, we translate the requirements of STTs into a Robust optimization program (ROP). To address the infeasibility of ROP caused by infinite constraints, we utilize the sampling-based Scenario optimization program (SOP). Subsequently, we solve the SOP to generate the tube and closed-form controller for an unknown system, ensuring the temporal reach-avoid-stay specification. Finally, the effectiveness of the proposed approach is demonstrated through three case studies: an omnidirectional robot, a SCARA manipulator, and a magnetic levitation system.
- [9] arXiv:2411.13849 [pdf, html, other]
-
Title: Sequence-to-Sequence Neural Diarization with Automatic Speaker Detection and RepresentationSubjects: Audio and Speech Processing (eess.AS)
This paper proposes a novel Sequence-to-Sequence Neural Diarization (SSND) framework to perform online and offline speaker diarization. It is developed from the sequence-to-sequence architecture of our previous target-speaker voice activity detection system and then evolves into a new diarization paradigm by addressing two critical problems. 1) Speaker Detection: The proposed approach can utilize incompletely given speaker embeddings to discover the unknown speaker and predict the target voice activities in the audio signal. It does not require a prior diarization system for speaker enrollment in advance. 2) Speaker Representation: The proposed approach can adopt the predicted voice activities as reference information to extract speaker embeddings from the audio signal simultaneously. The representation space of speaker embedding is jointly learned within the whole diarization network without using an extra speaker embedding model. During inference, the SSND framework can process long audio recordings blockwise. The detection module utilizes the previously obtained speaker-embedding buffer to predict both enrolled and unknown speakers' voice activities for each coming audio block. Next, the speaker-embedding buffer is updated according to the predictions of the representation module. Assuming that up to one new speaker may appear in a small block shift, our model iteratively predicts the results of each block and extracts target embeddings for the subsequent blocks until the signal ends. Finally, the last speaker-embedding buffer can re-score the entire audio, achieving highly accurate diarization performance as an offline system. (......)
- [10] arXiv:2411.13855 [pdf, html, other]
-
Title: A Multimodal Approach to The Detection and Classification of Skin DiseasesAllen Yang (1), Edward Yang (2), ((1) Mission San Jose High School, Fremont, CA, (2) Yale University, New Haven, CT)Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
According to PBS, nearly one-third of Americans lack access to primary care services, and another forty percent delay going to avoid medical costs. As a result, many diseases are left undiagnosed and untreated, even if the disease shows many physical symptoms on the skin. With the rise of AI, self-diagnosis and improved disease recognition have become more promising than ever; in spite of that, existing methods suffer from a lack of large-scale patient databases and outdated methods of study, resulting in studies being limited to only a few diseases or modalities. This study incorporates readily available and easily accessible patient information via image and text for skin disease classification on a new dataset of 26 skin disease types that includes both skin disease images (37K) and associated patient narratives. Using this dataset, baselines for various image models were established that outperform existing methods. Initially, the Resnet-50 model was only able to achieve an accuracy of 70% but, after various optimization techniques, the accuracy was improved to 80%. In addition, this study proposes a novel fine-tuning strategy for sequence classification Large Language Models (LLMs), Chain of Options, which breaks down a complex reasoning task into intermediate steps at training time instead of inference. With Chain of Options and preliminary disease recommendations from the image model, this method achieves state of the art accuracy 91% in diagnosing patient skin disease given just an image of the afflicted area as well as a patient description of the symptoms (such as itchiness or dizziness). Through this research, an earlier diagnosis of skin diseases can occur, and clinicians can work with deep learning models to give a more accurate diagnosis, improving quality of life and saving lives.
- [11] arXiv:2411.13862 [pdf, html, other]
-
Title: Image Compression Using Novel View Synthesis PriorsComments: Preprint submitted to Ocean EngineeringSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Real-time visual feedback is essential for tetherless control of remotely operated vehicles, particularly during inspection and manipulation tasks. Though acoustic communication is the preferred choice for medium-range communication underwater, its limited bandwidth renders it impractical to transmit images or videos in real-time. To address this, we propose a model-based image compression technique that leverages prior mission information. Our approach employs trained machine-learning based novel view synthesis models, and uses gradient descent optimization to refine latent representations to help generate compressible differences between camera images and rendered images. We evaluate the proposed compression technique using a dataset from an artificial ocean basin, demonstrating superior compression ratios and image quality over existing techniques. Moreover, our method exhibits robustness to introduction of new objects within the scene, highlighting its potential for advancing tetherless remotely operated vehicle operations.
- [12] arXiv:2411.13903 [pdf, other]
-
Title: AmpliNetECG12: A lightweight SoftMax-based relativistic amplitude amplification architecture for 12 lead ECG classificationSubjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
The urgent need to promptly detect cardiac disorders from 12-lead Electrocardiograms using limited computations is motivated by the heart's fast and complex electrical activity and restricted computational power of portable devices. Timely and precise diagnoses are crucial since delays might significantly impact patient health outcomes. This research presents a novel deep-learning architecture that aims to diagnose heart abnormalities quickly and accurately. We devised a new activation function called aSoftMax, designed to improve the visibility of ECG deflections. The proposed activation function is used with Convolutional Neural Network architecture to includes kernel weight sharing across the ECG's various leads. This innovative method thoroughly generalizes the global 12-lead ECG features and minimizes the model's complexity by decreasing the trainable parameters. aSoftMax, combined with enhanced CNN architecture yielded AmpliNetECG12, we obtain exceptional accuracy of 84% in diagnosing cardiac disorders. AmpliNetECG12 shows outstanding prediction ability when used with the CPSC2018 dataset for arrhythmia classification. The model attains an F1-score of 80.71% and a ROC-AUC score of 96.00%, with 280,000 trainable parameters which signifies the lightweight yet efficient nature of AmpliNetECG12. The stochastic characteristics of aSoftMax, a fundamental element of AmpliNetECG12, improve prediction accuracy and also increasse the model's interpretability. This feature enhances comprehension of important ECG segments in different forms of arrhythmias, establishing a new standard of explainable architecture for cardiac disorder classification.
- [13] arXiv:2411.13924 [pdf, html, other]
-
Title: Robust Data-Driven Predictive Control for Mixed Platoons under Noise and AttacksComments: 16 pages, 7 figuresSubjects: Systems and Control (eess.SY)
Controlling mixed platoons, which consist of both connected and automated vehicles (CAVs) and human-driven vehicles (HDVs), poses significant challenges due to the uncertain and unknown human driving behaviors. Data-driven control methods offer promising solutions by leveraging available trajectory data, but their performance can be compromised by process noise and adversarial attacks. To address this issue, this paper proposes a Robust Data-EnablEd Predictive Leading Cruise Control (RDeeP-LCC) framework based on data-driven reachability analysis. The framework over-approximates system dynamics under noise and attack using a matrix zonotope set derived from data, and develops a stabilizing feedback control law. By decoupling the mixed platoon system into nominal and error components, we employ data-driven reachability sets to recursively compute error reachable sets that account for noise and attacks, and obtain tightened safety constraints of the nominal system. This leads to a robust data-driven predictive control framework, solved in a tube-based control manner. Numerical simulations and human-in-the-loop experiments validate that the RDeeP-LCC method significantly enhances the robustness of mixed platoons, improving mixed traffic stability and safety against practical noise and attacks.
- [14] arXiv:2411.13931 [pdf, other]
-
Title: Implementation of tools for lessening the influence of artifacts in EEG signal analysisComments: 14 pagesJournal-ref: Applied Sciences, 14,971, 2024Subjects: Signal Processing (eess.SP)
This manuscript describes and implementation of scripts of code aimed at reducing the influence of artifacts, specifically focused on ocular artifacts, in the measurement and processing of electroencephalogram (EEG) signals. This process is of importance because it benefits the analysis and study of long trial samples when the appearance of ocular artifacts cannot be avoided by simply discarding trials. The implementations provided to the reader illustrate, with slight modifications, previously proposed methods aimed at the partial or complete elimination of EEG channels or components are those that resemble the electro-oculogram (EOG) signals in which artifacts are detected. In addition to the description of each of the provided functions, examples of utilization and illustrative figures will be included to show the expected results and processing pipeline.
- [15] arXiv:2411.13935 [pdf, html, other]
-
Title: Fast Stochastic MPC using Affine Disturbance Feedback Gains Learned OfflineComments: Submitted to L4DC 2025Subjects: Systems and Control (eess.SY)
We propose a novel Stochastic Model Predictive Control (MPC) for uncertain linear systems subject to probabilistic constraints. The proposed approach leverages offline learning to extract key features of affine disturbance feedback policies, significantly reducing the computational burden of online optimization. Specifically, we employ offline data-driven sampling to learn feature components of feedback gains and approximate the chance-constrained feasible set with a specified confidence level. By utilizing this learned information, the online MPC problem is simplified to optimization over nominal inputs and a reduced set of learned feedback gains, ensuring computational efficiency. In a numerical example, the proposed MPC approach achieves comparable control performance in terms of Region of Attraction (ROA) and average closed-loop costs to classical MPC optimizing over disturbance feedback policies, while delivering a 10-fold improvement in computational speed.
- [16] arXiv:2411.13944 [pdf, html, other]
-
Title: Semi-blind Channel Estimation for Massive MIMO LEO Satellite CommunicationsSubjects: Signal Processing (eess.SP)
This letter proposes decision-directed semi-blind channel estimation for massive multiple-input multiple-output low-Earth-orbit satellite communications. Two semi-blind estimators are proposed. The first utilizes detected data symbols in addition to pilot symbols. The second, a modified semi-blind estimator, is specially designed to mitigate the channel-aging effect caused by the highly dynamic nature of low-Earth-orbit satellite communication channels -- an issue that adversely impacts the performance of pilot-based estimators. Consequently, this modified estimator outperforms an optimal pilot-based estimator in terms of normalized mean square error and achieves symbol error rate performance comparable to that of a Genie-aided (perfectly known channel) detector. The trade-offs between the proposed estimators are also examined.
- [17] arXiv:2411.13970 [pdf, html, other]
-
Title: Movable Antenna-Equipped UAV for Data Collection in Backscatter Sensor Networks: A Deep Reinforcement Learning-based ApproachSubjects: Signal Processing (eess.SP); Machine Learning (cs.LG)
Backscatter communication (BC) becomes a promising energy-efficient solution for future wireless sensor networks (WSNs). Unmanned aerial vehicles (UAVs) enable flexible data collection from remote backscatter devices (BDs), yet conventional UAVs rely on omni-directional fixed-position antennas (FPAs), limiting channel gain and prolonging data collection time. To address this issue, we consider equipping a UAV with a directional movable antenna (MA) with high directivity and flexibility. The MA enhances channel gain by precisely aiming its main lobe at each BD, focusing transmission power for efficient communication. Our goal is to minimize the total data collection time by jointly optimizing the UAV's trajectory and the MA's orientation. We develop a deep reinforcement learning (DRL)-based strategy using the azimuth angle and distance between the UAV and each BD to simplify the agent's observation space. To ensure stability during training, we adopt Soft Actor-Critic (SAC) algorithm that balances exploration with reward maximization for efficient and reliable learning. Simulation results demonstrate that our proposed MA-equipped UAV with SAC outperforms both FPA-equipped UAVs and other RL methods, achieving significant reductions in both data collection time and energy consumption.
- [18] arXiv:2411.13987 [pdf, html, other]
-
Title: Universal Scanning GUI Tool for Available and Usable TV White Space (TVWS) SpectrumSubjects: Signal Processing (eess.SP)
In this era of advanced communication technologies, many remote rural and hard-to-reach areas still lack Internet access due to technological, geographical, and economic challenges. The TV white space (TVWS) technology has proven to be effective and feasible in connecting these areas to Internet service in many parts of the world. The TVWS-based systems operate based on geolocation white space databases (WSDB) to protect the primary systems from harmful interference and thus there is a critical need to know the available and usable channels that can be used by the secondary white space devices (WSDs) in a specific geographic area. In this work, we developed a generalized and flexible graphical user interface (GUI) tool to evaluate the availability and usability of the TVWS channels and their noise levels at each geographic location within the analyzed area. The developed tool has many features and capabilities such as allowing the users to scan the TVWS spectrum for any geographic area in the world and any frequency band in the TVWS spectrum. Moreover, it allows the user to apply widely used terrain-based radio propagation models. It provides the flexibility to import the elevation terrain profile of any region with the desired spatial accuracy and resolution. In addition, various system parameters including those related to regulation rules can be modified in the tool. This tool exports to an external dataset file the output data of the available and usable TVWS channels and their noise levels and it also visualizes these data interactively.
- [19] arXiv:2411.14013 [pdf, html, other]
-
Title: Single-Model Attribution for Spoofed Speech via Vocoder Fingerprints in an Open-World SettingSubjects: Audio and Speech Processing (eess.AS); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
As speech generation technology advances, so do the potential threats of misusing spoofed speech signals. One way to address these threats is by attributing the signals to their source generative model. In this work, we are the first to tackle the single-model attribution task in an open-world setting, that is, we aim at identifying whether spoofed speech signals from unknown sources originate from a specific vocoder. We show that the standardized average residual between audio signals and their low-pass filtered or EnCodec filtered versions can serve as powerful vocoder fingerprints. The approach only requires data from the target vocoder and allows for simple but highly accurate distance-based model attribution. We demonstrate its effectiveness on LJSpeech and JSUT, achieving an average AUROC of over 99% in most settings. The accompanying robustness study shows that it is also resilient to noise levels up to a certain degree.
- [20] arXiv:2411.14017 [pdf, html, other]
-
Title: Automatic brain tumor segmentation in 2D intra-operative ultrasound images using MRI tumor annotationsComments: 19, 8 figures, submitted to International Journal of Computer Assisted Radiology and SurgerySubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Automatic segmentation of brain tumors in intra-operative ultrasound (iUS) images could facilitate localization of tumor tissue during resection surgery. The lack of large annotated datasets limits the current models performances. In this paper, we investigate the use of tumor annotations in pre-operative MRI images, which are more easily accessible than annotations in iUS images, for training of deep learning models for iUS brain tumor segmentation. We used 180 annotated pre-operative MRI images with corresponding unannotated iUS images, and 29 annotated iUS images. Image registration was performed to transfer the MRI annotations to the corresponding iUS images before training models with the nnU-Net framework. To validate the use of MRI labels, the models were compared to a model trained with only US annotated tumors, and a model with both US and MRI annotated tumors. In addition, the results were compared to annotations validated by an expert neurosurgeon on the same test set to measure inter-observer variability. The results showed similar performance for a model trained with only MRI annotated tumors, compared to a model trained with only US annotated tumors. The model trained using both modalities obtained slightly better results with an average Dice score of 0.62, where external expert annotations achieved a score of 0.67. The results also showed that the deep learning models were comparable to expert annotation for larger tumors (> 200 mm2), but perform clearly worse for smaller tumors (< 200 mm2). This shows that MRI tumor annotations can be used as a substitute for US tumor annotations to train a deep learning model for automatic brain tumor segmentation in intra-operative ultrasound images. Small tumors is a limitation for the current models and will be the focus of future work. The main models are available here: this https URL.
- [21] arXiv:2411.14052 [pdf, html, other]
-
Title: Dynamic Trajectory and Power Control in Ultra-Dense UAV Networks: A Mean-Field Reinforcement Learning ApproachSubjects: Systems and Control (eess.SY)
In ultra-dense unmanned aerial vehicle (UAV) networks, it is challenging to coordinate the resource allocation and interference management among large-scale UAVs, for providing flexible and efficient service coverage to the ground users (GUs). In this paper, we propose a learning-based resource allocation scheme in an ultra-dense UAV communication network, where the GUs' service demands are time-varying with unknown distributions. We formulate the non-cooperative game among multiple co-channel UAVs as a stochastic game, where each UAV jointly optimizes its trajectory, user association, and downlink power control to maximize the expectation of its locally cumulative energy efficiency under the interference and energy constraints. To cope with the scalability issue in a large-scale network, we further formulate the problem as a mean-field game (MFG), which simplifies the interactions among the UAVs into a two-player game between a representative UAV and a mean-field. We prove the existence and uniqueness of the equilibrium for the MFG, and propose a model-free mean-field reinforcement learning algorithm named maximum entropy mean-field deep Q network (ME-MFDQN) to solve the mean-field equilibrium in both fully and partially observable scenarios. The simulation results reveal that the proposed algorithm improves the energy efficiency compared with the benchmark algorithms. Moreover, the performance can be further enhanced if the GUs' service demands exhibit higher temporal correlation or if the UAVs have wider observation capabilities over their nearby GUs.
- [22] arXiv:2411.14077 [pdf, html, other]
-
Title: On PI-control in Capacity-Limited NetworksSubjects: Systems and Control (eess.SY)
This paper concerns control of a class of systems where multiple dynamically stable agents share a nonlinear and bounded control-interconnection. The agents are subject to a disturbance which is too large to reject with the available control action, making it impossible to stabilize all agents in their desired states. In this nonlinear setting, we consider two different anti-windup equipped proportional-integral control strategies and analyze their properties. We show that a fully decentralized strategy will globally, asymptotically stabilize a unique equilibrium. This equilibrium also minimizes a weighted sum of the tracking errors. We also consider a light addition to the fully decentralized strategy, where rank-1 coordination between the agents is introduced via the anti-windup action. We show that any equilibrium to this closed-loop system minimizes the maximum tracking error for any agent. A remarkable property of these results is that they rely on extremely few assumptions on the interconnection between the agents. Finally we illustrate how the considered model can be applied in a district heating setting, and demonstrate the two considered controllers in a simulation.
- [23] arXiv:2411.14100 [pdf, html, other]
-
Title: BEST-STD: Bidirectional Mamba-Enhanced Speech Tokenization for Spoken Term DetectionComments: Submitted to ICASSP 2025Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Information Retrieval (cs.IR)
Spoken term detection (STD) is often hindered by reliance on frame-level features and the computationally intensive DTW-based template matching, limiting its practicality. To address these challenges, we propose a novel approach that encodes speech into discrete, speaker-agnostic semantic tokens. This facilitates fast retrieval using text-based search algorithms and effectively handles out-of-vocabulary terms. Our approach focuses on generating consistent token sequences across varying utterances of the same term. We also propose a bidirectional state space modeling within the Mamba encoder, trained in a self-supervised learning framework, to learn contextual frame-level features that are further encoded into discrete tokens. Our analysis shows that our speech tokens exhibit greater speaker invariance than those from existing tokenizers, making them more suitable for STD tasks. Empirical evaluation on LibriSpeech and TIMIT databases indicates that our method outperforms existing STD baselines while being more efficient.
- [24] arXiv:2411.14109 [pdf, html, other]
-
Title: Global and Local Attention-Based Transformer for Hyperspectral Image Change DetectionComments: IEEE GRSL 2024Subjects: Image and Video Processing (eess.IV)
Recently Transformer-based hyperspectral image (HSI) change detection methods have shown remarkable performance. Nevertheless, existing attention mechanisms in Transformers have limitations in local feature representation. To address this issue, we propose Global and Local Attention-based Transformer (GLAFormer), which incorporates a global and local attention module (GLAM) to combine high-frequency and low-frequency signals. Furthermore, we introduce a cross-gating mechanism, called cross-gated feed-forward network (CGFN), to emphasize salient features and suppress noise interference. Specifically, the GLAM splits attention heads into global and local attention components to capture comprehensive spatial-spectral features. The global attention component employs global attention on downsampled feature maps to capture low-frequency information, while the local attention component focuses on high-frequency details using non-overlapping window-based local attention. The CGFN enhances the feature representation via convolutions and cross-gating mechanism in parallel paths. The proposed GLAFormer is evaluated on three HSI datasets. The results demonstrate its superiority over state-of-the-art HSI change detection methods. The source code of GLAFormer is available at \url{this https URL}.
- [25] arXiv:2411.14135 [pdf, html, other]
-
Title: Compact Visual Data Representation for Green Multimedia -- A Human Visual System PerspectiveSubjects: Image and Video Processing (eess.IV); Multimedia (cs.MM)
The Human Visual System (HVS), with its intricate sophistication, is capable of achieving ultra-compact information compression for visual signals. This remarkable ability is coupled with high generalization capability and energy efficiency. By contrast, the state-of-the-art Versatile Video Coding (VVC) standard achieves a compression ratio of around 1,000 times for raw visual data. This notable disparity motivates the research community to draw inspiration to effectively handle the immense volume of visual data in a green way. Therefore, this paper provides a survey of how visual data can be efficiently represented for green multimedia, in particular when the ultimate task is knowledge extraction instead of visual signal reconstruction. We introduce recent research efforts that promote green, sustainable, and efficient multimedia in this field. Moreover, we discuss how the deep understanding of the HVS can benefit the research community, and envision the development of future green multimedia technologies.
- [26] arXiv:2411.14147 [pdf, html, other]
-
Title: Spiking neural networks: Towards bio-inspired multimodal perception in roboticsKaterina Maria Oikonomou, Vasiliki Balaska, Konstantinos A. Tsintotas, Christos N. Mavridis, Ioannis Kansizoglou, Antonios GasteratosSubjects: Image and Video Processing (eess.IV)
Spiking neural networks (SNNs) have captured apparent interest over the recent years, stemming from neuroscience and reaching the field of artificial intelligence. However, due to their nature SNNs remain far behind in achieving the exceptional performance of deep neural networks (DNNs). As a result, many scholars are exploring ways to enhance SNNs by using learning techniques from DNNs. While this approach has been proven to achieve considerable improvements in SNN performance, we propose another perspective: enhancing the biological plausibility of the models to leverage the advantages of SNNs fully. Our approach aims to propose a brain-like combination of audio-visual signal processing for recognition tasks, intended to succeed in more bio-plausible human-robot interaction applications.
- [27] arXiv:2411.14153 [pdf, html, other]
-
Title: MVANet: Multi-Stage Video Attention Network for Sound Event Localization and Detection with Source Distance EstimationSubjects: Audio and Speech Processing (eess.AS)
Sound event localization and detection with source distance estimation (3D SELD) involves not only identifying the sound category and its direction-of-arrival (DOA) but also predicting the source's distance, aiming to provide full information about the sound position. This paper proposes a multi-stage video attention network (MVANet) for audio-visual (AV) 3D SELD. Multi-stage audio features are used to adaptively capture the spatial information of sound sources in videos. We propose a novel output representation that combines the DOA with distance of sound sources by calculating the real Cartesian coordinates to address the newly introduced source distance estimation (SDE) task in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2024 Challenge. We also employ a variety of effective data augmentation and pre-training methods. Experimental results on the STARSS23 dataset have proven the effectiveness of our proposed MVANet. By integrating the aforementioned techniques, our system outperforms the top-ranked method we used in the AV 3D SELD task of the DCASE 2024 Challenge without model ensemble. The code will be made publicly available in the future.
- [28] arXiv:2411.14172 [pdf, html, other]
-
Title: TaQ-DiT: Time-aware Quantization for Diffusion TransformersSubjects: Image and Video Processing (eess.IV)
Transformer-based diffusion models, dubbed Diffusion Transformers (DiTs), have achieved state-of-the-art performance in image and video generation tasks. However, their large model size and slow inference speed limit their practical applications, calling for model compression methods such as quantization. Unfortunately, existing DiT quantization methods overlook (1) the impact of reconstruction and (2) the varying quantization sensitivities across different layers, which hinder their achievable performance. To tackle these issues, we propose innovative time-aware quantization for DiTs (TaQ-DiT). Specifically, (1) we observe a non-convergence issue when reconstructing weights and activations separately during quantization and introduce a joint reconstruction method to resolve this problem. (2) We discover that Post-GELU activations are particularly sensitive to quantization due to their significant variability across different denoising steps as well as extreme asymmetries and variations within each step. To address this, we propose time-variance-aware transformations to facilitate more effective quantization. Experimental results show that when quantizing DiTs' weights to 4-bit and activations to 8-bit (W4A8), our method significantly surpasses previous quantization methods.
- [29] arXiv:2411.14184 [pdf, html, other]
-
Title: Deep Learning Approach for Enhancing Oral Squamous Cell Carcinoma with LIME Explainable AI TechniqueComments: Under Review at an IEEE conferenceSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
The goal of the present study is to analyze an application of deep learning models in order to augment the diagnostic performance of oral squamous cell carcinoma (OSCC) with a longitudinal cohort study using the Histopathological Imaging Database for oral cancer analysis. The dataset consisted of 5192 images (2435 Normal and 2511 OSCC), which were allocated between training, testing, and validation sets with an estimated ratio repartition of about 52% for the OSCC group, and still, our performance measure was validated on a combination set that contains almost equal number of sample in this use case as entire database have been divided into half using stratified splitting technique based again near binary proportion but total distribution was around even. We selected four deep-learning architectures for evaluation in the present study: ResNet101, DenseNet121, VGG16, and EfficientnetB3. EfficientNetB3 was found to be the best, with an accuracy of 98.33% and F1 score (0.9844), and it took remarkably less computing power in comparison with other models. The subsequent one was DenseNet121, with 90.24% accuracy and an F1 score of 90.45%. Moreover, we employed the Local Interpretable Model-agnostic Explanations (LIME) method to clarify why EfficientNetB3 made certain decisions with its predictions to improve the explainability and trustworthiness of results. This work provides evidence for the possible superior diagnosis in OSCC activated from the EfficientNetB3 model with the explanation of AI techniques such as LIME and paves an important groundwork to build on towards clinical usage.
- [30] arXiv:2411.14250 [pdf, html, other]
-
Title: CP-UNet: Contour-based Probabilistic Model for Medical Ultrasound Images SegmentationComments: 4 pages, 4 figures, 2 tables;For icassp2025Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Deep learning-based segmentation methods are widely utilized for detecting lesions in ultrasound images. Throughout the imaging procedure, the attenuation and scattering of ultrasound waves cause contour blurring and the formation of artifacts, limiting the clarity of the acquired ultrasound images. To overcome this challenge, we propose a contour-based probabilistic segmentation model CP-UNet, which guides the segmentation network to enhance its focus on contour during decoding. We design a novel down-sampling module to enable the contour probability distribution modeling and encoding stages to acquire global-local features. Furthermore, the Gaussian Mixture Model utilizes optimized features to model the contour distribution, capturing the uncertainty of lesion boundaries. Extensive experiments with several state-of-the-art deep learning segmentation methods on three ultrasound image datasets show that our method performs better on breast and thyroid lesions segmentation.
- [31] arXiv:2411.14269 [pdf, html, other]
-
Title: Guided MRI Reconstruction via Schr\"odinger BridgeSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
Magnetic Resonance Imaging (MRI) is a multi-contrast imaging technique in which different contrast images share similar structural information. However, conventional diffusion models struggle to effectively leverage this structural similarity. Recently, the Schrödinger Bridge (SB), a nonlinear extension of the diffusion model, has been proposed to establish diffusion paths between any distributions, allowing the incorporation of guided priors. This study proposes an SB-based, multi-contrast image-guided reconstruction framework that establishes a diffusion bridge between the guiding and target image distributions. By using the guiding image along with data consistency during sampling, the target image is reconstructed more accurately. To better address structural differences between images, we introduce an inversion strategy from the field of image editing, termed $\mathbf{I}^2$SB-inversion. Experiments on a paried T1 and T2-FLAIR datasets demonstrate that $\mathbf{I}^2$SB-inversion achieve a high acceleration up to 14.4 and outperforms existing methods in terms of both reconstruction accuracy and stability.
- [32] arXiv:2411.14319 [pdf, other]
-
Title: Iteration-Free Cooperative Distributed MPC through Multiparametric ProgrammingSubjects: Systems and Control (eess.SY)
Cooperative Distributed Model Predictive Control (DiMPC) architecture employs local MPC controllers to control different subsystems, exchanging information with each other through an iterative procedure to enhance overall control performance compared to the decentralized architecture. However, this method can result in high communication between the controllers and computational costs. In this work, the amount of information exchanged and the computational costs of DiMPC are reduced significantly by developing novel iteration-free solution algorithms based on multiparametric (mp) programming. These algorithms replace the iterative procedure with simultaneous solutions of explicit mpDiMPC control law functions. The reduced communication among local controllers decreases system latency, which is crucial for real-time control applications. The effectiveness of the proposed iteration-free mpDiMPC algorithms is demonstrated through comprehensive numerical simulations involving groups of coupled linear subsystems, which are interconnected through their inputs and a cooperative plant-wide cost function.
- [33] arXiv:2411.14346 [pdf, html, other]
-
Title: Lower Dimensional Spherical Representation of Medium Voltage Load Profiles for Visualization, Outlier Detection, and Generative ModellingEdgar Mauricio Salazar Duque, Bart van der Holst, Pedro P. Vergara, Juan S. Giraldo, Phuong H. Nguyen, Anne Van der Molen, Han (J.G.)SlootwegSubjects: Systems and Control (eess.SY)
This paper presents the spherical lower dimensional representation for daily medium voltage load profiles, based on principal component analysis. The objective is to unify and simplify the tasks for (i) clustering visualisation, (ii) outlier detection and (iii) generative profile modelling under one concept. The lower dimensional projection of standardised load profiles unveils a latent distribution in a three-dimensional sphere. This spherical structure allows us to detect outliers by fitting probability distribution models in the spherical coordinate system, identifying measurements that deviate from the spherical shape. The same latent distribution exhibits an arc shape, suggesting an underlying order among load profiles. We develop a principal curve technique to uncover this order based on similarity, offering new advantages over conventional clustering techniques. This finding reveals that energy consumption in a wide region can be seen as a continuously changing process. Furthermore, we combined the principal curve with a von Mises-Fisher distribution to create a model capable of generating profiles with continuous mixtures between clusters. The presence of the spherical distribution is validated with data from four municipalities in the Netherlands. The uncovered spherical structure implies the possibility of employing new mathematical tools from directional statistics and differential geometry for load profile modelling.
- [34] arXiv:2411.14353 [pdf, other]
-
Title: Enhancing Medical Image Segmentation with Deep Learning and Diffusion ModelsSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Medical image segmentation is crucial for accurate clinical diagnoses, yet it faces challenges such as low contrast between lesions and normal tissues, unclear boundaries, and high variability across patients. Deep learning has improved segmentation accuracy and efficiency, but it still relies heavily on expert annotations and struggles with the complexities of medical images. The small size of medical image datasets and the high cost of data acquisition further limit the performance of segmentation networks. Diffusion models, with their iterative denoising process, offer a promising alternative for better detail capture in segmentation. However, they face difficulties in accurately segmenting small targets and maintaining the precision of boundary details. This article discusses the importance of medical image segmentation, the limitations of current deep learning approaches, and the potential of diffusion models to address these challenges.
- [35] arXiv:2411.14360 [pdf, html, other]
-
Title: Integrated Positioning and Communication via LEO Satellites: Opportunities and ChallengesSubjects: Signal Processing (eess.SP)
Low Earth orbit (LEO) satellites, as a prominent technology in the 6G non-terrestrial network, offer both positioning and communication capabilities. While these two applications have each been extensively studied and have achieved substantial progress in recent years, the potential synergistic benefits of integrating them remain an underexplored yet promising avenue. This article comprehensively analyzes the integrated positioning and communication (IPAC) systems on LEO satellites. By leveraging the distinct characteristics of LEO satellites, we examine how communication systems can enhance positioning accuracy and, conversely, how positioning information can be exploited to improve communication efficiency. In particular, we present two case studies to illustrate the potential of such integration. Finally, several key open research challenges in the LEO-based IPAC systems are discussed.
- [36] arXiv:2411.14365 [pdf, other]
-
Title: Formal Simulation and Visualisation of Hybrid ProgramsPedro Mendes (University of Minho, Portugal), Ricardo Correia (University of Minho, Portugal), Renato Neves (INESC-TEC & University of Minho, Portugal), José Proença (CISTER, Faculty of Sciences of the University of Porto, Portugal)Comments: In Proceedings FMAS2024, arXiv:2411.13215Journal-ref: EPTCS 411, 2024, pp. 20-37Subjects: Systems and Control (eess.SY); Programming Languages (cs.PL)
The design and analysis of systems that combine computational behaviour with physical processes' continuous dynamics - such as movement, velocity, and voltage - is a famous, challenging task. Several theoretical results from programming theory emerged in the last decades to tackle the issue; some of which are the basis of a proof-of-concept tool, called Lince, that aids in the analysis of such systems, by presenting simulations of their respective behaviours.
However being a proof-of-concept, the tool is quite limited with respect to usability, and when attempting to apply it to a set of common, concrete problems, involving autonomous driving and others, it either simply cannot simulate them or fails to provide a satisfactory user-experience.
The current work complements the aforementioned theoretical approaches with a more practical perspective, by improving Lince along several dimensions: to name a few, richer syntactic constructs, more operations, more informative plotting systems and errors messages, and a better performance overall. We illustrate our improvements via a variety of examples that involve both autonomous driving and electrical systems. - [37] arXiv:2411.14385 [pdf, html, other]
-
Title: Enhancing Diagnostic Precision in Gastric Bleeding through Automated Lesion Segmentation: A Deep DuS-KFCM ApproachXian-Xian Liu, Mingkun Xu, Yuanyuan Wei, Huafeng Qin, Qun Song, Simon Fong, Feng Tien, Wei Luo, Juntao Gao, Zhihua Zhang, Shirley SiuSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Timely and precise classification and segmentation of gastric bleeding in endoscopic imagery are pivotal for the rapid diagnosis and intervention of gastric complications, which is critical in life-saving medical procedures. Traditional methods grapple with the challenge posed by the indistinguishable intensity values of bleeding tissues adjacent to other gastric structures. Our study seeks to revolutionize this domain by introducing a novel deep learning model, the Dual Spatial Kernelized Constrained Fuzzy C-Means (Deep DuS-KFCM) clustering algorithm. This Hybrid Neuro-Fuzzy system synergizes Neural Networks with Fuzzy Logic to offer a highly precise and efficient identification of bleeding regions. Implementing a two-fold coarse-to-fine strategy for segmentation, this model initially employs the Spatial Kernelized Fuzzy C-Means (SKFCM) algorithm enhanced with spatial intensity profiles and subsequently harnesses the state-of-the-art DeepLabv3+ with ResNet50 architecture to refine the segmentation output. Through extensive experiments across mainstream gastric bleeding and red spots datasets, our Deep DuS-KFCM model demonstrated unprecedented accuracy rates of 87.95%, coupled with a specificity of 96.33%, outperforming contemporary segmentation methods. The findings underscore the model's robustness against noise and its outstanding segmentation capabilities, particularly for identifying subtle bleeding symptoms, thereby presenting a significant leap forward in medical image processing.
- [38] arXiv:2411.14418 [pdf, html, other]
-
Title: Multimodal 3D Brain Tumor Segmentation with Adversarial Training and Conditional Random FieldComments: 13 pages, 7 figures, Annual Conference on Medical Image Understanding and Analysis (MIUA) 2024Journal-ref: Medical Image Understanding and Analysis (MIUA), Lecture Notes in Computer Science, Springer, vol. 14859, 2024Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Accurate brain tumor segmentation remains a challenging task due to structural complexity and great individual differences of gliomas. Leveraging the pre-eminent detail resilience of CRF and spatial feature extraction capacity of V-net, we propose a multimodal 3D Volume Generative Adversarial Network (3D-vGAN) for precise segmentation. The model utilizes Pseudo-3D for V-net improvement, adds conditional random field after generator and use original image as supplemental guidance. Results, using the BraTS-2018 dataset, show that 3D-vGAN outperforms classical segmentation models, including U-net, Gan, FCN and 3D V-net, reaching specificity over 99.8%.
New submissions (showing 38 of 38 entries)
- [39] arXiv:2411.13560 (cross-list from cs.AI) [pdf, html, other]
-
Title: AMSnet-KG: A Netlist Dataset for LLM-based AMS Circuit Auto-Design Using Knowledge Graph RAGYichen Shi, Zhuofu Tao, Yuhao Gao, Tianjia Zhou, Cheng Chang, Yaxing Wang, Bingyu Chen, Genhao Zhang, Alvin Liu, Zhiping Yu, Ting-Jung Lin, Lei HeSubjects: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Emerging Technologies (cs.ET); Signal Processing (eess.SP)
High-performance analog and mixed-signal (AMS) circuits are mainly full-custom designed, which is time-consuming and labor-intensive. A significant portion of the effort is experience-driven, which makes the automation of AMS circuit design a formidable challenge. Large language models (LLMs) have emerged as powerful tools for Electronic Design Automation (EDA) applications, fostering advancements in the automatic design process for large-scale AMS circuits. However, the absence of high-quality datasets has led to issues such as model hallucination, which undermines the robustness of automatically generated circuit designs. To address this issue, this paper introduces AMSnet-KG, a dataset encompassing various AMS circuit schematics and netlists. We construct a knowledge graph with annotations on detailed functional and performance characteristics. Facilitated by AMSnet-KG, we propose an automated AMS circuit generation framework that utilizes the comprehensive knowledge embedded in LLMs. We first formulate a design strategy (e.g., circuit architecture using a number of circuit components) based on required specifications. Next, matched circuit components are retrieved and assembled into a complete topology, and transistor sizing is obtained through Bayesian optimization. Simulation results of the netlist are fed back to the LLM for further topology refinement, ensuring the circuit design specifications are met. We perform case studies of operational amplifier and comparator design to verify the automatic design flow from specifications to netlists with minimal human effort. The dataset used in this paper will be open-sourced upon publishing of this paper.
- [40] arXiv:2411.13766 (cross-list from cs.SD) [pdf, html, other]
-
Title: Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on the EdgeRuiyang Qin, Dancheng Liu, Gelei Xu, Zheyu Yan, Chenhui Xu, Yuting Hu, X. Sharon Hu, Jinjun Xiong, Yiyu ShiComments: 7 pages, 8 figuresSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
The combination of Large Language Models (LLM) and Automatic Speech Recognition (ASR), when deployed on edge devices (called edge ASR-LLM), can serve as a powerful personalized assistant to enable audio-based interaction for users. Compared to text-based interaction, edge ASR-LLM allows accessible and natural audio interactions. Unfortunately, existing ASR-LLM models are mainly trained in high-performance computing environments and produce substantial model weights, making them difficult to deploy on edge devices. More importantly, to better serve users' personalized needs, the ASR-LLM must be able to learn from each distinct user, given that audio input often contains highly personalized characteristics that necessitate personalized on-device training. Since individually fine-tuning the ASR or LLM often leads to suboptimal results due to modality-specific limitations, end-to-end training ensures seamless integration of audio features and language understanding (cross-modal alignment), ultimately enabling a more personalized and efficient adaptation on edge devices. However, due to the complex training requirements and substantial computational demands of existing approaches, cross-modal alignment between ASR audio and LLM can be challenging on edge devices. In this work, we propose a resource-efficient cross-modal alignment framework that bridges ASR and LLMs on edge devices to handle personalized audio input. Our framework enables efficient ASR-LLM alignment on resource-constrained devices like NVIDIA Jetson Orin (8GB RAM), achieving 50x training time speedup while improving the alignment quality by more than 50\%. To the best of our knowledge, this is the first work to study efficient ASR-LLM alignment on resource-constrained edge devices.
- [41] arXiv:2411.13785 (cross-list from cs.IT) [pdf, html, other]
-
Title: Throughput Maximization for Movable Antenna Systems with Movement Delay ConsiderationSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
In this paper, we model the minimum achievable throughput within a transmission block of restricted duration and aim to maximize it in movable antenna (MA)-enabled multiuser downlink communications. Particularly, we account for the antenna moving delay caused by mechanical movement, which has not been fully considered in previous studies, and reveal the trade-off between the delay and signal-to-interference-plus-noise ratio at users. To this end, we first consider a single-user setup to analyze the necessity of antenna movement. By quantizing the virtual angles of arrival, we derive the requisite region size for antenna moving, design the initial MA position, and elucidate the relationship between quantization resolution and moving region size. Furthermore, an efficient algorithm is developed to optimize MA position via successive convex approximation, which is subsequently extended to the general multiuser setup. Numerical results demonstrate that the proposed algorithms outperform fixed-position antenna schemes and existing ones without consideration of movement delay. Additionally, our algorithms exhibit excellent adaptability and stability across various transmission block durations and moving region sizes, and are robust to different antenna moving speeds. This allows the hardware cost of MA-aided systems to be reduced by employing low rotational speed motors.
- [42] arXiv:2411.13811 (cross-list from cs.SD) [pdf, html, other]
-
Title: X-CrossNet: A complex spectral mapping approach to target speaker extraction with cross attention speaker embedding fusionSubjects: Sound (cs.SD); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Target speaker extraction (TSE) is a technique for isolating a target speaker's voice from mixed speech using auxiliary features associated with the target speaker. This approach addresses the cocktail party problem and is generally considered more promising for practical applications than conventional speech separation methods. Although academic research in this area has achieved high accuracy and evaluation scores on public datasets, most models exhibit significantly reduced performance in real-world noisy or reverberant conditions. To address this limitation, we propose a novel TSE model, X-CrossNet, which leverages CrossNet as its backbone. CrossNet is a speech separation network specifically optimized for challenging noisy and reverberant environments, achieving state-of-the-art performance in tasks such as speaker separation under these conditions. Additionally, to enhance the network's ability to capture and utilize auxiliary features of the target speaker, we integrate a Cross-Attention mechanism into the global multi-head self-attention (GMHSA) module within each CrossNet block. This facilitates more effective integration of target speaker features with mixed speech features. Experimental results show that our method performs superior separation on the WSJ0-2mix and WHAMR! datasets, demonstrating strong robustness and stability.
- [43] arXiv:2411.13860 (cross-list from cs.CV) [pdf, html, other]
-
Title: Decoupled Sparse Priors Guided Diffusion Compression Model for Point CloudsSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Lossy compression methods rely on an autoencoder to transform a point cloud into latent points for storage, leaving the inherent redundancy of latent representations unexplored. To reduce redundancy in latent points, we propose a sparse priors guided method that achieves high reconstruction quality, especially at high compression ratios. This is accomplished by a dual-density scheme separately processing the latent points (intended for reconstruction) and the decoupled sparse priors (intended for storage). Our approach features an efficient dual-density data flow that relaxes size constraints on latent points, and hybridizes a progressive conditional diffusion model to encapsulate essential details for reconstruction within the conditions, which are decoupled hierarchically to intra-point and inter-point priors. Specifically, our method encodes the original point cloud into latent points and decoupled sparse priors through separate encoders. Latent points serve as intermediates, while sparse priors act as adaptive conditions. We then employ a progressive attention-based conditional denoiser to generate latent points conditioned on the decoupled priors, allowing the denoiser to dynamically attend to geometric and semantic cues from the priors at each encoding and decoding layer. Additionally, we integrate the local distribution into the arithmetic encoder and decoder to enhance local context modeling of the sparse points. The original point cloud is reconstructed through a point decoder. Compared to state-of-the-art, our method obtains superior rate-distortion trade-off, evidenced by extensive evaluations on the ShapeNet dataset and standard test datasets from MPEG group including 8iVFB, and Owlii.
- [44] arXiv:2411.13916 (cross-list from cs.RO) [pdf, html, other]
-
Title: Joint-repositionable Inner-wireless Planar Snake RobotAyato Kanada, Ryo Takahashi, Keito Hayashi, Ryusuke Hosaka, Wakako Yukita, Yasutaka Nakashima, Tomoyuki Yokota, Takao Someya, Mitsuhiro Kamezaki, Yoshihiro Kawahara, Motoji YamamotoSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
Bio-inspired multi-joint snake robots offer the advantages of terrain adaptability due to their limbless structure and high flexibility. However, a series of dozens of motor units in typical multiple-joint snake robots results in a heavy body structure and hundreds of watts of high power consumption. This paper presents a joint-repositionable, inner-wireless snake robot that enables multi-joint-like locomotion using a low-powered underactuated mechanism. The snake robot, consisting of a series of flexible passive links, can dynamically change its joint coupling configuration by repositioning motor-driven joint units along rack gears inside the robot. Additionally, a soft robot skin wirelessly powers the internal joint units, avoiding the risk of wire tangling and disconnection caused by the movable joint units. The combination of the joint-repositionable mechanism and the wireless-charging-enabled soft skin achieves a high degree of bending, along with a lightweight structure of 1.3 kg and energy-efficient wireless power transmission of 7.6 watts.
- [45] arXiv:2411.13922 (cross-list from stat.ML) [pdf, html, other]
-
Title: Exponentially Consistent Nonparametric Clustering of Data StreamsSubjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
In this paper, we consider nonparametric clustering of $M$ independent and identically distributed (i.i.d.) data streams generated from unknown distributions. The distributions of the $M$ data streams belong to $K$ underlying distribution clusters. Existing results on exponentially consistent nonparametric clustering algorithms, like single linkage-based (SLINK) clustering and $k$-medoids distribution clustering, assume that the maximum intra-cluster distance ($d_L$) is smaller than the minimum inter-cluster distance ($d_H$). First, in the fixed sample size (FSS) setting, we show that exponential consistency can be achieved for SLINK clustering under a less strict assumption, $d_I < d_H$, where $d_I$ is the maximum distance between any two sub-clusters of a cluster that partition the cluster. Note that $d_I < d_L$ in general. Our results show that SLINK is exponentially consistent for a larger class of problems than $k$-medoids distribution clustering. We also identify examples where $k$-medoids clustering is unable to find the true clusters, but SLINK is exponentially consistent. Then, we propose a sequential clustering algorithm, named SLINK-SEQ, based on SLINK and prove that it is also exponentially consistent. Simulation results show that the SLINK-SEQ algorithm requires fewer expected number of samples than the FSS SLINK algorithm for the same probability of error.
- [46] arXiv:2411.13951 (cross-list from cs.LG) [pdf, html, other]
-
Title: A Dataset for Evaluating Online Anomaly Detection Approaches for Discrete Multivariate Time SeriesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Systems and Control (eess.SY)
Benchmarking anomaly detection approaches for multivariate time series is challenging due to the lack of high-quality datasets. Current publicly available datasets are too small, not diverse and feature trivial anomalies, which hinders measurable progress in this research area. We propose a solution: a diverse, extensive, and non-trivial dataset generated via state-of-the-art simulation tools that reflects realistic behaviour of an automotive powertrain, including its multivariate, dynamic and variable-state properties. To cater for both unsupervised and semi-supervised anomaly detection settings, as well as time series generation and forecasting, we make different versions of the dataset available, where training and test subsets are offered in contaminated and clean versions, depending on the task. We also provide baseline results from a small selection of approaches based on deterministic and variational autoencoders, as well as a non-parametric approach. As expected, the baseline experimentation shows that the approaches trained on the semi-supervised version of the dataset outperform their unsupervised counterparts, highlighting a need for approaches more robust to contaminated training data.
- [47] arXiv:2411.13983 (cross-list from cs.MA) [pdf, html, other]
-
Title: Learning Two-agent Motion Planning Strategies from Generalized Nash Equilibrium for Model Predictive ControlComments: Submitted to 2025 Learning for Dynamics and Control Conference (L4DC)Subjects: Multiagent Systems (cs.MA); Robotics (cs.RO); Systems and Control (eess.SY)
We introduce an Implicit Game-Theoretic MPC (IGT-MPC), a decentralized algorithm for two-agent motion planning that uses a learned value function that predicts the game-theoretic interaction outcomes as the terminal cost-to-go function in a model predictive control (MPC) framework, guiding agents to implicitly account for interactions with other agents and maximize their reward. This approach applies to competitive and cooperative multi-agent motion planning problems which we formulate as constrained dynamic games. Given a constrained dynamic game, we randomly sample initial conditions and solve for the generalized Nash equilibrium (GNE) to generate a dataset of GNE solutions, computing the reward outcome of each game-theoretic interaction from the GNE. The data is used to train a simple neural network to predict the reward outcome, which we use as the terminal cost-to-go function in an MPC scheme. We showcase emerging competitive and coordinated behaviors using IGT-MPC in scenarios such as two-vehicle head-to-head racing and un-signalized intersection navigation. IGT-MPC offers a novel method integrating machine learning and game-theoretic reasoning into model-based decentralized multi-agent motion planning.
- [48] arXiv:2411.14030 (cross-list from cs.IT) [pdf, html, other]
-
Title: Performance Analysis of STAR-RIS-Assisted Cell-Free Massive MIMO Systems with Electromagnetic Interference and Phase ErrorsComments: 13 pages, 6 figures. This work has been submitted to the IEEE for possible publicationSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Simultaneous Transmitting and Reflecting Reconfigurable Intelligent Surfaces (STAR-RISs) are being explored for the next generation of sixth-generation (6G) networks. A promising configuration for their deployment is within cell-free massive multiple-input multiple-output (MIMO) systems. However, despite the advantages that STAR-RISs could bring, challenges such as electromagnetic interference (EMI) and phase errors may lead to significant performance degradation. In this paper, we investigate the impact of EMI and phase errors on STAR-RIS-assisted cell-free massive MIMO systems and propose techniques to mitigate these effects. We introduce a novel projected gradient descent (GD) algorithm for STAR-RIS coefficient matrix design by minimizing the local channel estimation normalised mean square error. We also derive the closed-form expressions of the uplink and downlink spectral efficiency (SE) to analyze system performance with EMI and phase errors, in which fractional power control methods are applied for performance improvement. The results reveal that the projected GD algorithm can effectively tackle EMI and phase errors to improve estimation accuracy and compensate for performance degradation with nearly $10\%\sim20\%$ SE improvement. Moreover, increasing access points (APs), antennas per AP, and STAR-RIS elements can also improve SE performance. Applying STAR-RIS in the proposed system achieves a larger $25\%$-likely SE than conventional RISs. However, the advantages of employing more STAR-RIS elements are reduced when EMI is severe.
- [49] arXiv:2411.14088 (cross-list from cs.IT) [pdf, html, other]
-
Title: Channel Customization for Low-Complexity CSI Acquisition in Multi-RIS-Assisted MIMO SystemsComments: Accepted by IEEE JSAC special issue on Next Generation Advanced Transceiver TechnologiesSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
The deployment of multiple reconfigurable intelligent surfaces (RISs) enhances the propagation environment by improving channel quality, but it also complicates channel estimation. Following the conventional wireless communication system design, which involves full channel state information (CSI) acquisition followed by RIS configuration, can reduce transmission efficiency due to substantial pilot overhead and computational complexity. This study introduces an innovative approach that integrates CSI acquisition and RIS configuration, leveraging the channel-altering capabilities of the RIS to reduce both the overhead and complexity of CSI acquisition. The focus is on multi-RIS-assisted systems, featuring both direct and reflected propagation paths. By applying a fast-varying reflection sequence during RIS configuration for channel training, the complex problem of channel estimation is decomposed into simpler, independent tasks. These fast-varying reflections effectively isolate transmit signals from different paths, streamlining the CSI acquisition process for both uplink and downlink communications with reduced complexity. In uplink scenarios, a positioning-based algorithm derives partial CSI, informing the adjustment of RIS parameters to create a sparse reflection channel, enabling precise reconstruction of the uplink channel. Downlink communication benefits from this strategically tailored reflection channel, allowing effective CSI acquisition with fewer pilot signals. Simulation results highlight the proposed methodology's ability to accurately reconstruct the reflection channel with minimal impact on the normalized mean square error while simultaneously enhancing spectral efficiency.
- [50] arXiv:2411.14207 (cross-list from cs.SD) [pdf, html, other]
-
Title: HARP: A Large-Scale Higher-Order Ambisonic Room Impulse Response DatasetComments: Submitted to ICASSP 2025 Workshop Dataset and code to be uploaded at: this https URLSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
This contribution introduces a dataset of 7th-order Ambisonic Room Impulse Responses (HOA-RIRs), created using the Image Source Method. By employing higher-order Ambisonics, our dataset enables precise spatial audio reproduction, a critical requirement for realistic immersive audio applications. Leveraging the virtual simulation, we present a unique microphone configuration, based on the superposition principle, designed to optimize sound field coverage while addressing the limitations of traditional microphone arrays. The presented 64-microphone configuration allows us to capture RIRs directly in the Spherical Harmonics domain. The dataset features a wide range of room configurations, encompassing variations in room geometry, acoustic absorption materials, and source-receiver distances. A detailed description of the simulation setup is provided alongside for an accurate reproduction. The dataset serves as a vital resource for researchers working on spatial audio, particularly in applications involving machine learning to improve room acoustics modeling and sound field synthesis. It further provides a very high level of spatial resolution and realism crucial for tasks such as source localization, reverberation prediction, and immersive sound reproduction.
- [51] arXiv:2411.14246 (cross-list from cs.RO) [pdf, html, other]
-
Title: Simulation-Aided Policy Tuning for Black-Box Robot LearningSubjects: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
How can robots learn and adapt to new tasks and situations with little data? Systematic exploration and simulation are crucial tools for efficient robot learning. We present a novel black-box policy search algorithm focused on data-efficient policy improvements. The algorithm learns directly on the robot and treats simulation as an additional information source to speed up the learning process. At the core of the algorithm, a probabilistic model learns the dependence of the policy parameters and the robot learning objective not only by performing experiments on the robot, but also by leveraging data from a simulator. This substantially reduces interaction time with the robot. Using this model, we can guarantee improvements with high probability for each policy update, thereby facilitating fast, goal-oriented learning. We evaluate our algorithm on simulated fine-tuning tasks and demonstrate the data-efficiency of the proposed dual-information source optimization algorithm. In a real robot learning experiment, we show fast and successful task learning on a robot manipulator with the aid of an imperfect simulator.
Cross submissions (showing 13 of 13 entries)
- [52] arXiv:2207.13021 (replaced) [pdf, other]
-
Title: CTVR-EHO TDA-IPH Topological Optimized Convolutional Visual Recurrent Network for Brain Tumor Segmentation and ClassificationSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
In today's world of health care, brain tumor detection has become common. However, the manual brain tumor classification approach is time-consuming. So Deep Convolutional Neural Network (DCNN) is used by many researchers in the medical field for making accurate diagnoses and aiding in the patient's treatment. The traditional techniques have problems such as overfitting and the inability to extract necessary features. To overcome these problems, we developed the Topological Data Analysis based Improved Persistent Homology (TDA-IPH) and Convolutional Transfer learning and Visual Recurrent learning with Elephant Herding Optimization hyper-parameter tuning (CTVR-EHO) models for brain tumor segmentation and classification. Initially, the Topological Data Analysis based Improved Persistent Homology is designed to segment the brain tumor image. Then, from the segmented image, features are extracted using TL via the AlexNet model and Bidirectional Visual Long Short-Term Memory (Bi-VLSTM). Next, elephant Herding Optimization (EHO) is used to tune the hyperparameters of both networks to get an optimal result. Finally, extracted features are concatenated and classified using the softmax activation layer. The simulation result of this proposed CTVR-EHO and TDA-IPH method is analyzed based on precision, accuracy, recall, loss, and F score metrics. When compared to other existing brain tumor segmentation and classification models, the proposed CTVR-EHO and TDA-IPH approaches show high accuracy (99.8%), high recall (99.23%), high precision (99.67%), and high F score (99.59%).
- [53] arXiv:2210.08624 (replaced) [pdf, html, other]
-
Title: Attention-Based Audio Embeddings for Query-by-ExampleComments: Accepted in ISMIR 2022Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
An ideal audio retrieval system efficiently and robustly recognizes a short query snippet from an extensive database. However, the performance of well-known audio fingerprinting systems falls short at high signal distortion levels. This paper presents an audio retrieval system that generates noise and reverberation robust audio fingerprints using the contrastive learning framework. Using these fingerprints, the method performs a comprehensive search to identify the query audio and precisely estimate its timestamp in the reference audio. Our framework involves training a CNN to maximize the similarity between pairs of embeddings extracted from clean audio and its corresponding distorted and time-shifted version. We employ a channel-wise spectral-temporal attention mechanism to better discriminate the audio by giving more weight to the salient spectral-temporal patches in the signal. Experimental results indicate that our system is efficient in computation and memory usage while being more accurate, particularly at higher distortion levels, than competing state-of-the-art systems and scalable to a larger database.
- [54] arXiv:2212.09010 (replaced) [pdf, html, other]
-
Title: Risk-Sensitive Reinforcement Learning with Exponential CriteriaSubjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
While reinforcement learning has shown experimental success in a number of applications, it is known to be sensitive to noise and perturbations in the parameters of the system, leading to high variance in the total reward amongst different episodes in slightly different environments. To introduce robustness, as well as sample efficiency, risk-sensitive reinforcement learning methods are being thoroughly studied. In this work, we provide a definition of robust reinforcement learning policies and formulate a risk-sensitive reinforcement learning problem to approximate them, by solving an optimization problem with respect to a modified objective based on exponential criteria. In particular, we study a model-free risk-sensitive variation of the widely-used Monte Carlo Policy Gradient algorithm and introduce a novel risk-sensitive online Actor-Critic algorithm based on solving a multiplicative Bellman equation using stochastic approximation updates. Analytical results suggest that the use of exponential criteria generalizes commonly used ad-hoc regularization approaches, improves sample efficiency, and introduces robustness with respect to perturbations in the model parameters and the environment. The implementation, performance, and robustness properties of the proposed methods are evaluated in simulated experiments.
- [55] arXiv:2302.09682 (replaced) [pdf, html, other]
-
Title: Dual Attention Model with Reinforcement Learning for Classification of Histology Whole-Slide ImagesSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Digital whole slide images (WSIs) are generally captured at microscopic resolution and encompass extensive spatial data. Directly feeding these images to deep learning models is computationally intractable due to memory constraints, while downsampling the WSIs risks incurring information loss. Alternatively, splitting the WSIs into smaller patches may result in a loss of important contextual information. In this paper, we propose a novel dual attention approach, consisting of two main components, both inspired by the visual examination process of a pathologist: The first soft attention model processes a low magnification view of the WSI to identify relevant regions of interest, followed by a custom sampling method to extract diverse and spatially distinct image tiles from the selected ROIs. The second component, the hard attention classification model further extracts a sequence of multi-resolution glimpses from each tile for classification. Since hard attention is non-differentiable, we train this component using reinforcement learning to predict the location of the glimpses. This approach allows the model to focus on essential regions instead of processing the entire tile, thereby aligning with a pathologist's way of diagnosis. The two components are trained in an end-to-end fashion using a joint loss function to demonstrate the efficacy of the model. The proposed model was evaluated on two WSI-level classification problems: Human epidermal growth factor receptor 2 scoring on breast cancer histology images and prediction of Intact/Loss status of two Mismatch Repair biomarkers from colorectal cancer histology images. We show that the proposed model achieves performance better than or comparable to the state-of-the-art methods while processing less than 10% of the WSI at the highest magnification and reducing the time required to infer the WSI-level label by more than 75%.
- [56] arXiv:2403.02288 (replaced) [pdf, html, other]
-
Title: PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker RecordingsComments: Speaker Odyssey 2024Subjects: Audio and Speech Processing (eess.AS)
A major drawback of supervised speech separation (SSep) systems is their reliance on synthetic data, leading to poor real-world generalization. Mixture invariant training (MixIT) was proposed as an unsupervised alternative that uses real recordings, yet struggles with overseparation and adapting to long-form audio. We introduce PixIT, a joint approach that combines permutation invariant training (PIT) for speaker diarization (SD) and MixIT for SSep. With a small extra requirement of needing SD labels, it solves the problem of overseparation and allows stitching local separated sources leveraging existing work on clustering-based neural SD. We measure the quality of the separated sources via applying automatic speech recognition (ASR) systems to them. PixIT boosts the performance of various ASR systems across two meeting corpora both in terms of the speaker-attributed and utterance-based word error rates while not requiring any fine-tuning.
- [57] arXiv:2403.14931 (replaced) [pdf, html, other]
-
Title: Structured stability analysis of networked systems with uncertain linksSubjects: Systems and Control (eess.SY); Dynamical Systems (math.DS)
An input-output approach to stability analysis is explored for networked systems with uncertain link dynamics. The main result consists of a collection of integral quadratic constraints, which together imply robust stability of the uncertain networked system, under the assumption that stability is achieved with ideal links. The conditions are decentralized inasmuch as each involves only agent and uncertainty model parameters that are local to a corresponding link. This makes the main result, which imposes no restriction on network structure, suitable for the study of large-scale systems.
- [58] arXiv:2404.08610 (replaced) [pdf, html, other]
-
Title: Full-Duplex Beyond Self-Interference: The Unlimited Sensing WayComments: Accepted to IEEE Communications LettersSubjects: Signal Processing (eess.SP)
The success of full-stack full-duplex communication systems depends on how effectively one can achieve digital self-interference cancellation (SIC). Towards this end, in this paper, we consider unlimited sensing framework (USF) enabled full-duplex system. We show that by injecting folding non-linearities in the sensing pipeline, one can not only suppress self-interference but also recover the signal of interest (SoI). This approach leads to novel design of the receiver architecture that is complemented by a modulo-domain channel estimation method. We then demonstrate the advantages of modulo ADC by analyzing the relationship between quantization noise, quantization bits, and dynamic range. Numerical experiments show that the USF enabled receiver structure can achieve up to 40 dB digital SIC by using as few as 4-bits per sample. Our method outperforms the previous approach based on adaptive filters when it comes to SoI reconstruction, detection, and digital SIC performance.
- [59] arXiv:2405.17141 (replaced) [pdf, html, other]
-
Title: MVMS-RCN: A Dual-Domain Unfolding CT Reconstruction with Multi-sparse-view and Multi-scale Refinement-correctionComments: 14 pages, Accepted to IEEE Transactions on Computational Imaging, 2024Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
X-ray Computed Tomography (CT) is one of the most important diagnostic imaging techniques in clinical applications. Sparse-view CT imaging reduces the number of projection views to a lower radiation dose and alleviates the potential risk of radiation exposure. Most existing deep learning (DL) and deep unfolding sparse-view CT reconstruction methods: 1) do not fully use the projection data; 2) do not always link their architecture designs to a mathematical theory; 3) do not flexibly deal with multi-sparse-view reconstruction assignments. This paper aims to use mathematical ideas and design optimal DL imaging algorithms for sparse-view tomography reconstructions. We propose a novel dual-domain deep unfolding unified framework that offers a great deal of flexibility for multi-sparse-view CT reconstruction with different sampling views through a single model. This framework combines the theoretical advantages of model-based methods with the superior reconstruction performance of DL-based methods, resulting in the expected generalizability of DL. We propose a refinement module that utilizes unfolding projection domain to refine full-sparse-view projection errors, as well as an image domain correction module that distills multi-scale geometric error corrections to reconstruct sparse-view CT. This provides us with a new way to explore the potential of projection information and a new perspective on designing network architectures. All parameters of our proposed framework are learnable end to end, and our method possesses the potential to be applied to plug-and-play reconstruction. Extensive experiments demonstrate that our framework is superior to other existing state-of-the-art methods. Our source codes are available at this https URL.
- [60] arXiv:2405.19347 (replaced) [pdf, html, other]
-
Title: Near-Field Spot Beamfocusing: A Correlation-Aware Transfer Learning ApproachSubjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
3D spot beamfocusing (SBF), in contrast to conventional angular-domain beamforming, concentrates radiating power within very small volume in both radial and angular domains in the near-field zone. Recently the implementation of channel-state-information (CSI)-independent machine learning (ML)-based approaches have been developed for effective SBF using extremely-largescale-programable-metasurface (ELPMs). These methods involve dividing the ELPMs into subarrays and independently training them with Deep Reinforcement Learning to jointly focus the beam at the Desired Focal Point (DFP). This paper explores near-field SBF using ELPMs, addressing challenges associated with lengthy training times resulting from independent training of subarrays. To achieve a faster CSIindependent solution, inspired by the correlation between the beamfocusing matrices of the subarrays, we leverage transfer learning techniques. First, we introduce a novel similarity criterion based on the Phase Distribution Image of subarray apertures. Then we devise a subarray policy propagation scheme that transfers the knowledge from trained to untrained subarrays. We further enhance learning by introducing Quasi-Liquid-Layers as a revised version of the adaptive policy reuse technique. We show through simulations that the proposed scheme improves the training speed about 5 times. Furthermore, for dynamic DFP management, we devised a DFP policy blending process, which augments the convergence rate up to 8-fold.
- [61] arXiv:2406.15656 (replaced) [pdf, html, other]
-
Title: Self-Supervised Adversarial Diffusion Models for Fast MRI ReconstructionSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Purpose: To propose a self-supervised deep learning-based compressed sensing MRI (DL-based CS-MRI) method named "Adaptive Self-Supervised Consistency Guided Diffusion Model (ASSCGD)" to accelerate data acquisition without requiring fully sampled datasets. Materials and Methods: We used the fastMRI multi-coil brain axial T2-weighted (T2-w) dataset from 1,376 cases and single-coil brain quantitative magnetization prepared 2 rapid acquisition gradient echoes (MP2RAGE) T1 maps from 318 cases to train and test our model. Robustness against domain shift was evaluated using two out-of-distribution (OOD) datasets: multi-coil brain axial postcontrast T1 -weighted (T1c) dataset from 50 cases and axial T1-weighted (T1-w) dataset from 50 patients. Data were retrospectively subsampled at acceleration rates R in {2x, 4x, 8x}. ASSCGD partitions a random sampling pattern into two disjoint sets, ensuring data consistency during training. We compared our method with ReconFormer Transformer and SS-MRI, assessing performance using normalized mean squared error (NMSE), peak signal-to-noise ratio (PSNR), and structural similarity index (SSIM). Statistical tests included one-way analysis of variance (ANOVA) and multi-comparison Tukey's Honesty Significant Difference (HSD) tests. Results: ASSCGD preserved fine structures and brain abnormalities visually better than comparative methods at R = 8x for both multi-coil and single-coil datasets. It achieved the lowest NMSE at R in {4x, 8x}, and the highest PSNR and SSIM values at all acceleration rates for the multi-coil dataset. Similar trends were observed for the single-coil dataset, though SSIM values were comparable to ReconFormer at R in {2x, 8x}. These results were further confirmed by the voxel-wise correlation scatter plots. OOD results showed significant (p << 10^-5 ) improvements in undersampled image quality after reconstruction.
- [62] arXiv:2407.06519 (replaced) [pdf, html, other]
-
Title: F2PAD: A General Optimization Framework for Feature-Level to Pixel-Level Anomaly DetectionSubjects: Image and Video Processing (eess.IV)
Image-based inspection systems have been widely deployed in manufacturing production lines. Due to the scarcity of defective samples, unsupervised anomaly detection that only leverages normal samples during training to detect various defects is popular. Existing feature-based methods, utilizing deep features from pretrained neural networks, show their impressive performance in anomaly localization and the low demand for the sample size for training. However, the detected anomalous regions of these methods always exhibit inaccurate boundaries, which impedes the downstream tasks. This deficiency is caused: (i) The decreased resolution of high-level features compared with the original image, and (ii) The mixture of adjacent normal and anomalous pixels during feature extraction. To address them, we propose a novel unified optimization framework (F2PAD) that leverages the Feature-level information to guide the optimization process for Pixel-level Anomaly Detection in the inference stage. The proposed framework is universal and plug-and-play, which can enhance various feature-based methods with limited assumptions. Case studies are provided to demonstrate the effectiveness of our strategy, particularly when applied to three popular backbone methods: PaDiM, CFLOW-AD, and PatchCore.
- [63] arXiv:2407.10689 (replaced) [pdf, other]
-
Title: Classification of Heart Sounds Using Multi-Branch Deep Convolutional Network and LSTM-CNNComments: 22 pagesSubjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
This paper presents a fast and cost-effective method for diagnosing cardiac abnormalities with high accuracy and reliability using low-cost systems in clinics. The primary limitation of automatic diagnosing of cardiac diseases is the rarity of correct and acceptable labeled samples, which can be expensive to prepare. To address this issue, two methods are proposed in this work. The first method is a unique Multi-Branch Deep Convolutional Neural Network (MBDCN) architecture inspired by human auditory processing, specifically designed to optimize feature extraction by employing various sizes of convolutional filters and audio signal power spectrum as input. In the second method, called as Long short-term memory-Convolutional Neural (LSCN) model, Additionally, the network architecture includes Long Short-Term Memory (LSTM) network blocks to improve feature extraction in the time domain. The innovative approach of combining multiple parallel branches consisting of the one-dimensional convolutional layers along with LSTM blocks helps in achieving superior results in audio signal processing tasks. The experimental results demonstrate superiority of the proposed methods over the state-of-the-art techniques. The overall classification accuracy of heart sounds with the LSCN network is more than 96%. The efficiency of this network is significant compared to common feature extraction methods such as Mel Frequency Cepstral Coefficients (MFCC) and wavelet transform. Therefore, the proposed method shows promising results in the automatic analysis of heart sounds and has potential applications in the diagnosis and early detection of cardiovascular diseases.
- [64] arXiv:2407.10921 (replaced) [pdf, html, other]
-
Title: Leveraging Bi-Focal Perspectives and Granular Feature Integration for Accurate Reliable Early Alzheimer's DetectionComments: 14 pages, 12 figures, 6 tablesSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Alzheimer's disease (AD) is the most common neurodegeneration, annually diagnosed in millions of patients. The present medicine scenario still finds challenges in the exact diagnosis and classification of AD through neuroimaging data. Traditional CNNs can extract a good amount of low-level information in an image but fail to extract high-level minuscule particles, which is a significant challenge in detecting AD from MRI scans. To overcome this, we propose a novel Granular Feature Integration method to combine information extraction at different scales combined with an efficient information flow, enabling the model to capture both broad and fine-grained features simultaneously. We also propose a Bi-Focal Perspective mechanism to highlight the subtle neurofibrillary tangles and amyloid plaques in the MRI scans, ensuring that critical pathological markers are accurately identified. Our model achieved an F1-Score of 99.31%, precision of 99.24%, and recall of 99.51%. These scores prove that our model is significantly better than the state-of-the-art (SOTA) CNNs in existence.
- [65] arXiv:2409.11257 (replaced) [pdf, html, other]
-
Title: To What Extent do Open-loop and Feedback Nash Equilibria Diverge in General-Sum Linear Quadratic Dynamic Games?Subjects: Systems and Control (eess.SY)
Dynamic games offer a versatile framework for modeling the evolving interactions of strategic agents, whose steady-state behavior can be captured by the Nash equilibria of the games. Nash equilibria are often computed in feedback, with policies depending on the state at each time, or in open-loop, with policies depending only on the initial state. Empirically, open-loop Nash equilibria (OLNE) could be more efficient to compute, while feedback Nash equilibria (FBNE) often encode more complex interactions. However, it remains unclear exactly which dynamic games yield FBNE and OLNE that differ significantly and which do not. To address this problem, we present a principled comparison study of OLNE and FBNE in linear quadratic (LQ) dynamic games. Specifically, we prove that the OLNE strategies of an LQ dynamic game can be synthesized by solving the coupled Riccati equations of an auxiliary LQ game with perturbed costs. The construction of the auxiliary game allows us to establish conditions under which OLNE and FBNE coincide and derive an upper bound on the deviation between FBNE and OLNE of an LQ game.
- [66] arXiv:2409.18862 (replaced) [pdf, html, other]
-
Title: Safe Decentralized Multi-Agent Control using Black-Box Predictors, Conformal Decision Policies, and Control Barrier FunctionsComments: 6 pages, 1 figure, submitted for ICRA 2025Subjects: Systems and Control (eess.SY); Multiagent Systems (cs.MA); Robotics (cs.RO)
We address the challenge of safe control in decentralized multi-agent robotic settings, where agents use uncertain black-box models to predict other agents' trajectories. We use the recently proposed conformal decision theory to adapt the restrictiveness of control barrier functions-based safety constraints based on observed prediction errors. We use these constraints to synthesize controllers that balance between the objectives of safety and task accomplishment, despite the prediction errors. We provide an upper bound on the average over time of the value of a monotonic function of the difference between the safety constraint based on the predicted trajectories and the constraint based on the ground truth ones. We validate our theory through experimental results showing the performance of our controllers when navigating a robot in the multi-agent scenes in the Stanford Drone Dataset.
- [67] arXiv:2410.00392 (replaced) [pdf, html, other]
-
Title: MERIT: Multimodal Wearable Vital Sign Waveform MonitoringComments: 8 pages, 10 figuresSubjects: Systems and Control (eess.SY); Hardware Architecture (cs.AR)
Cardiovascular disease (CVD) is the leading cause of death and premature mortality worldwide, with occupational environments significantly influencing CVD risk, underscoring the need for effective cardiac monitoring and early warning systems. Existing methods of monitoring vital signs require subjects to remain stationary, which is impractical for daily monitoring as individuals are often in motion. To address this limitation, we propose MERIT, a multimodality-based wearable system designed for precise ECG waveform monitoring without movement restrictions. Daily activities, involving frequent arm movements, can significantly affect sensor data and complicate the reconstruction of accurate ECG signals. To mitigate motion impact and enhance ECG signal reconstruction, we introduce a deep independent component analysis (Deep-ICA) module and a multimodal fusion module. We conducted experiments with 15 subjects. Our results, compared with commercial wearable devices and existing methods, demonstrate that MERIT accurately reconstructs ECG waveforms during various office activities, offering a reliable solution for fine-grained cardiac monitoring in dynamic environments.
- [68] arXiv:2411.00656 (replaced) [pdf, html, other]
-
Title: Identification of Analytic Nonlinear Dynamical Systems with Non-asymptotic GuaranteesComments: NeurIPS 2024Subjects: Systems and Control (eess.SY)
This paper focuses on the system identification of an important class of nonlinear systems: linearly parameterized nonlinear systems, which enjoys wide applications in robotics and other mechanical systems. We consider two system identification methods: least-squares estimation (LSE), which is a point estimation method; and set-membership estimation (SME), which estimates an uncertainty set that contains the true parameters. We provide non-asymptotic convergence rates for LSE and SME under i.i.d. control inputs and control policies with i.i.d. random perturbations, both of which are considered as non-active-exploration inputs. Compared with the counter-example based on piecewise-affine systems in the literature, the success of non-active exploration in our setting relies on a key assumption on the system dynamics: we require the system functions to be real-analytic. Our results, together with the piecewise-affine counter-example, reveal the importance of differentiability in nonlinear system identification through non-active exploration. Lastly, we numerically compare our theoretical bounds with the empirical performance of LSE and SME on a pendulum example and a quadrotor example.
- [69] arXiv:2411.01589 (replaced) [pdf, html, other]
-
Title: BiT-MamSleep: Bidirectional Temporal Mamba for EEG Sleep StagingSubjects: Signal Processing (eess.SP); Machine Learning (cs.LG)
In this paper, we address the challenges in automatic sleep stage classification, particularly the high computational cost, inadequate modeling of bidirectional temporal dependencies, and class imbalance issues faced by Transformer-based models. To address these limitations, we propose BiT-MamSleep, a novel architecture that integrates the Triple-Resolution CNN (TRCNN) for efficient multi-scale feature extraction with the Bidirectional Mamba (BiMamba) mechanism, which models both short- and long-term temporal dependencies through bidirectional processing of EEG data. Additionally, BiT-MamSleep incorporates an Adaptive Feature Recalibration (AFR) module and a temporal enhancement block to dynamically refine feature importance, optimizing classification accuracy without increasing computational complexity. To further improve robustness, we apply optimization techniques such as Focal Loss and SMOTE to mitigate class imbalance. Extensive experiments on four public datasets demonstrate that BiT-MamSleep significantly outperforms state-of-the-art methods, particularly in handling long EEG sequences and addressing class imbalance, leading to more accurate and scalable sleep stage classification.
- [70] arXiv:2411.05449 (replaced) [pdf, other]
-
Title: Unmanned F/A-18 Aircraft Landing Control on Aircraft Carrier in Adverse ConditionsSubjects: Systems and Control (eess.SY)
Carrier landings are a difficult control task due to wind disturbances and a changing trajectory. Demand for carrier-based drones is increasing. A robust and accurate landing control system is crucial to meet this demand. Control performance can be improved by using observers to estimate unknown variables and disturbances for feedback. This study applies a nonlinear observer to estimate the combined disturbance in the pitch dynamics of an F/A-18 during carrier landing. Additionally, controllers to regulate the velocity, rate of descent and vertical position are designed. A full model, including the nonlinear flight dynamics, controller, carrier deck motion, wind and measurement noise is modelled numerically and implemented in software. Combined with proportional derivative control, the proposed pitch control method is shown to be very effective converging 85% faster than a PID controller. The simulations, verify that the pitch controller can quickly track a time-varying reference despite noise and disturbances. The positional controller used is found to be ineffective and requires improvement.
- [71] arXiv:2411.07249 (replaced) [pdf, html, other]
-
Title: SPDIM: Source-Free Unsupervised Conditional and Label Shift Adaptation in EEGSubjects: Signal Processing (eess.SP); Machine Learning (cs.LG)
The non-stationary nature of electroencephalography (EEG) introduces distribution shifts across domains (e.g., days and subjects), posing a significant challenge to EEG-based neurotechnology generalization. Without labeled calibration data for target domains, the problem is a source-free unsupervised domain adaptation (SFUDA) problem. For scenarios with constant label distribution, Riemannian geometry-aware statistical alignment frameworks on the symmetric positive definite (SPD) manifold are considered state-of-the-art. However, many practical scenarios, including EEG-based sleep staging, exhibit label shifts. Here, we propose a geometric deep learning framework for SFUDA problems under specific distribution shifts, including label shifts. We introduce a novel, realistic generative model and show that prior Riemannian statistical alignment methods on the SPD manifold can compensate for specific marginal and conditional distribution shifts but hurt generalization under label shifts. As a remedy, we propose a parameter-efficient manifold optimization strategy termed SPDIM. SPDIM uses the information maximization principle to learn a single SPD-manifold-constrained parameter per target domain. In simulations, we demonstrate that SPDIM can compensate for the shifts under our generative model. Moreover, using public EEG-based brain-computer interface and sleep staging datasets, we show that SPDIM outperforms prior approaches.
- [72] arXiv:2411.10798 (replaced) [pdf, other]
-
Title: Unveiling Hidden Details: A RAW Data-Enhanced Paradigm for Real-World Super-ResolutionLong Peng, Wenbo Li, Jiaming Guo, Xin Di, Haoze Sun, Yong Li, Renjing Pei, Yang Wang, Yang Cao, Zheng-Jun ZhaComments: We sincerely apologize, but due to some commercial confidentiality agreements related to the report, we have decided to withdraw the submission for now and will resubmit after making the necessary revisionsSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Real-world image super-resolution (Real SR) aims to generate high-fidelity, detail-rich high-resolution (HR) images from low-resolution (LR) counterparts. Existing Real SR methods primarily focus on generating details from the LR RGB domain, often leading to a lack of richness or fidelity in fine details. In this paper, we pioneer the use of details hidden in RAW data to complement existing RGB-only methods, yielding superior outputs. We argue that key image processing steps in Image Signal Processing, such as denoising and demosaicing, inherently result in the loss of fine details in LR images, making LR RAW a valuable information source. To validate this, we present RealSR-RAW, a comprehensive dataset comprising over 10,000 pairs with LR and HR RGB images, along with corresponding LR RAW, captured across multiple smartphones under varying focal lengths and diverse scenes. Additionally, we propose a novel, general RAW adapter to efficiently integrate LR RAW data into existing CNNs, Transformers, and Diffusion-based Real SR models by suppressing the noise contained in LR RAW and aligning its distribution. Extensive experiments demonstrate that incorporating RAW data significantly enhances detail recovery and improves Real SR performance across ten evaluation metrics, including both fidelity and perception-oriented metrics. Our findings open a new direction for the Real SR task, with the dataset and code will be made available to support future research.
- [73] arXiv:2308.07266 (replaced) [pdf, html, other]
-
Title: Full Duplex Joint Communications and Sensing for 6G: Opportunities and ChallengesChandan Kumar Sheemar, Sourabh Solanki, George C. Alexandropoulos, Eva Lagunas, Jorge Querol, Symeon Chatzinotas, Björn OtterstenSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
The paradigm of joint communications and sensing (JCAS) envisions a revolutionary integration of communication and radar functionalities within a unified hardware platform. This novel concept not only opens up unprecedented interoperability opportunities, but also exhibits unique design challenges. To this end, the success of JCAS is highly dependent on efficient full-duplex (FD) operation, which has the potential to enable simultaneous transmission and reception within the same frequency band. While JCAS research is lately expanding, there still exist relevant directions of investigation that hold tremendous potential to profoundly transform the sixth generation (6G), and beyond, cellular networks. This article presents new opportunities and challenges brought up by FD-enabled JCAS, taking into account the key technical peculiarities of FD systems. Unlike simplified JCAS scenarios, we delve into the most comprehensive configuration, encompassing uplink and downlink users, as well as monostatic and bistatic radars, all harmoniously coexisting to jointly push the boundaries of both communications and sensing. The performance improvements resulting from this advancement bring forth numerous new challenges, each meticulously examined and expounded upon.
- [74] arXiv:2309.10011 (replaced) [pdf, html, other]
-
Title: Universal Photorealistic Style Transfer: A Lightweight and Adaptive ApproachSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Photorealistic style transfer aims to apply stylization while preserving the realism and structure of input content. However, existing methods often encounter challenges such as color tone distortions, dependency on pair-wise pre-training, inefficiency with high-resolution inputs, and the need for additional constraints in video style transfer tasks. To address these issues, we propose a Universal Photorealistic Style Transfer (UPST) framework that delivers accurate photorealistic style transfer on high-resolution images and videos without relying on pre-training. Our approach incorporates a lightweight StyleNet for per-instance transfer, ensuring color tone accuracy while supporting high-resolution inputs, maintaining rapid processing speeds, and eliminating the need for pretraining. To further enhance photorealism and efficiency, we introduce instance-adaptive optimization, which features an adaptive coefficient to prioritize content image realism and employs early stopping to accelerate network convergence. Additionally, UPST enables seamless video style transfer without additional constraints due to its strong non-color information preservation ability. Experimental results show that UPST consistently produces photorealistic outputs and significantly reduces GPU memory usage, making it an effective and universal solution for various photorealistic style transfer tasks.
- [75] arXiv:2312.10495 (replaced) [pdf, html, other]
-
Title: Computing Optimal Joint Chance Constrained Control PoliciesSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
We consider the problem of optimally controlling stochastic, Markovian systems subject to joint chance constraints over a finite-time horizon. For such problems, standard Dynamic Programming is inapplicable due to the time correlation of the joint chance constraints, which calls for non-Markovian, and possibly stochastic, policies. Hence, despite the popularity of this problem, solution approaches capable of providing provably-optimal and easy-to-compute policies are still missing. We fill this gap by augmenting the dynamics via a binary state, allowing us to characterize the optimal policies and develop a Dynamic Programming based solution method.
- [76] arXiv:2406.06371 (replaced) [pdf, html, other]
-
Title: mHuBERT-147: A Compact Multilingual HuBERT ModelComments: Extended version of the Interspeech 2024 paper of same nameSubjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
We present mHuBERT-147, the first general-purpose massively multilingual HuBERT speech representation model trained on 90K hours of clean, open-license data. To scale up the multi-iteration HuBERT approach, we use faiss-based clustering, achieving 5.2x faster label assignment than the original method. We also apply a new multilingual batching up-sampling strategy, leveraging both language and dataset diversity. After 3 training iterations, our compact 95M parameter mHuBERT-147 outperforms larger models trained on substantially more data. We rank second and first on the ML-SUPERB 10min and 1h leaderboards, with SOTA scores for 3 tasks. Across ASR/LID tasks, our model consistently surpasses XLS-R (300M params; 436K hours) and demonstrates strong competitiveness against the much larger MMS (1B params; 491K hours). Our findings indicate that mHuBERT-147 is a promising model for multilingual speech tasks, offering an unprecedented balance between high performance and parameter efficiency.
- [77] arXiv:2409.10664 (replaced) [pdf, html, other]
-
Title: Proximal Gradient Dynamics: Monotonicity, Exponential Convergence, and ApplicationsComments: Submitted to IEEE L-CSS and ACC, 7 pages, 1 figureSubjects: Optimization and Control (math.OC); Signal Processing (eess.SP); Systems and Control (eess.SY)
In this letter we study the proximal gradient dynamics. This recently-proposed continuous-time dynamics solves optimization problems whose cost functions are separable into a nonsmooth convex and a smooth component. First, we show that the cost function decreases monotonically along the trajectories of the proximal gradient dynamics. We then introduce a new condition that guarantees exponential convergence of the cost function to its optimal value, and show that this condition implies the proximal Polyak-Łojasiewicz condition. We also show that the proximal Polyak-Łojasiewicz condition guarantees exponential convergence of the cost function. Moreover, we extend these results to time-varying optimization problems, providing bounds for equilibrium tracking. Finally, we discuss applications of these findings, including the LASSO problem, certain matrix based problems and a numerical experiment on a feed-forward neural network.
- [78] arXiv:2410.02592 (replaced) [pdf, html, other]
-
Title: IC3M: In-Car Multimodal Multi-object Monitoring for Abnormal Status of Both Driver and PassengersComments: 16 pages, 17 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
Recently, in-car monitoring has emerged as a promising technology for detecting early-stage abnormal status of the driver and providing timely alerts to prevent traffic accidents. Although training models with multimodal data enhances the reliability of abnormal status detection, the scarcity of labeled data and the imbalance of class distribution impede the extraction of critical abnormal state features, significantly deteriorating training performance. Furthermore, missing modalities due to environment and hardware limitations further exacerbate the challenge of abnormal status identification. More importantly, monitoring abnormal health conditions of passengers, particularly in elderly care, is of paramount importance but remains underexplored. To address these challenges, we introduce our IC3M, an efficient camera-rotation-based multimodal framework for monitoring both driver and passengers in a car. Our IC3M comprises two key modules: an adaptive threshold pseudo-labeling strategy and a missing modality reconstruction. The former customizes pseudo-labeling thresholds for different classes based on the class distribution, generating class-balanced pseudo labels to guide model training effectively, while the latter leverages crossmodality relationships learned from limited labels to accurately recover missing modalities by distribution transferring from available modalities. Extensive experimental results demonstrate that IC3M outperforms state-of-the-art benchmarks in accuracy, precision, and recall while exhibiting superior robustness under limited labeled data and severe missing modality.
- [79] arXiv:2410.12976 (replaced) [pdf, html, other]
-
Title: Kapitza-Inspired Stabilization of Non-Foster Circuits via Time ModulationsComments: 10 pages (7 pages main text, 3 pages supplementary materials), 4 figures; a minor issue in Fig. 3(a) is correctedSubjects: Applied Physics (physics.app-ph); Systems and Control (eess.SY)
With his formal analysis in 1951, the physicist Pyotr Kapitza demonstrated that an inverted pendulum with an externally vibrating base can be stable in its upper position, thus overcoming the force of gravity. Kapitza's work is an example that an originally unstable system can become stable after a minor perturbation of its properties or initial conditions is applied. Inspired by his ideas, we show how non-Foster circuits can be stabilized with the application of external \textit{electrical vibration}, i.e., time modulations. Non-Foster circuits are highly appreciated in the engineering community since their bandwidth characteristics are not limited by passive-circuits bounds. Unfortunately, non-Foster circuits are usually unstable and they must be stabilized prior to operation. Here, we focus on the study of non-Foster $L(t)C$ circuits with time-varying inductors and time-invariant negative capacitors. We find an intrinsic connection between Kapitza's inverted pendulum and non-Foster $L(t)C$ resonators. Moreover, we show how positive time-varying modulations of $L(t)>0$ can overcome and stabilize non-Foster negative capacitances $C<0$. These findings open up an alternative manner of stabilizing electric circuits with the use of time modulations, and lay the groundwork for application of, what we coin \textit{Vibrational Electromagnetics}, in more complex media.
- [80] arXiv:2410.23279 (replaced) [pdf, html, other]
-
Title: A Transformer Model for Segmentation, Classification, and Caller Identification of Marmoset VocalizationSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Marmoset, a highly vocalized primate, has become a popular animal model for studying social-communicative behavior and its underlying mechanism comparing with human infant linguistic developments. In the study of vocal communication, it is vital to know the caller identities, call contents, and vocal exchanges. Previous work of a CNN has achieved a joint model for call segmentation, classification, and caller identification for marmoset vocalizations. However, the CNN has limitations in modeling long-range acoustic patterns; the Transformer architecture that has been shown to outperform CNNs, utilizes the self-attention mechanism that efficiently segregates information parallelly over long distances and captures the global structure of marmoset vocalization. We propose using the Transformer to jointly segment and classify the marmoset calls and identify the callers for each vocalization.
- [81] arXiv:2410.23773 (replaced) [pdf, other]
-
Title: Towards Generative Ray Path Sampling for Faster Point-to-Point Ray TracingJérome Eertmans, Nicola Di Cicco, Claude Oestges, Laurent Jacques, Enrico M. Vittuci, Vittorio Degli-EspostiComments: 6 pages, 6 figures, submitted to IEEE ICMLCN 2025Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP)
Radio propagation modeling is essential in telecommunication research, as radio channels result from complex interactions with environmental objects. Recently, Machine Learning has been attracting attention as a potential alternative to computationally demanding tools, like Ray Tracing, which can model these interactions in detail. However, existing Machine Learning approaches often attempt to learn directly specific channel characteristics, such as the coverage map, making them highly specific to the frequency and material properties and unable to fully capture the underlying propagation mechanisms. Hence, Ray Tracing, particularly the Point-to-Point variant, remains popular to accurately identify all possible paths between transmitter and receiver nodes. Still, path identification is computationally intensive because the number of paths to be tested grows exponentially while only a small fraction is valid. In this paper, we propose a Machine Learning-aided Ray Tracing approach to efficiently sample potential ray paths, significantly reducing the computational load while maintaining high accuracy. Our model dynamically learns to prioritize potentially valid paths among all possible paths and scales linearly with scene complexity. Unlike recent alternatives, our approach is invariant with translation, scaling, or rotation of the geometry, and avoids dependency on specific environment characteristics.
- [82] arXiv:2411.00774 (replaced) [pdf, html, other]
-
Title: Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLMComments: Project Page: this https URLSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Rapidly developing large language models (LLMs) have brought tremendous intelligent applications. Especially, the GPT-4o's excellent duplex speech interaction ability has brought impressive experience to users. Researchers have recently proposed several multi-modal LLMs in this direction that can achieve user-agent speech-to-speech conversations. This paper proposes a novel speech-text multimodal LLM architecture called Freeze-Omni. Our main contribution is that the speech input and output modalities can be easily connected to a textual LLM while keeping the LLM's parameters frozen throughout the training process. We design a three-stage training strategy for modeling both the speech input and output, enabling Freeze-Omni to obtain speech-to-speech conversation ability using text-speech paired data (such as ASR and TTS data) and only 60,000 multi-round text Q&A data on 8 GPUs. Moreover, we can effectively ensure that the intelligence of the Freeze-Omni in the speech modality is at the same level compared with that in the text modality of its backbone LLM, while achieving low latency end-to-end spoken response. In addition, we also designed a method to achieve duplex dialogue ability through multi-task training, giving Freeze-Omni a more natural style of dialogue ability between users and agents. In summary, Freeze-Omni holds great potential to conduct speech-to-speech dialogue based on a multimodal LLM under the condition of a frozen LLM, avoiding the catastrophic forgetting problem caused by limited data and training resources.
- [83] arXiv:2411.12641 (replaced) [pdf, other]
-
Title: Improving Controllability and Editability for Pretrained Text-to-Music Generation ModelsComments: PhD ThesisSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
The field of AI-assisted music creation has made significant strides, yet existing systems often struggle to meet the demands of iterative and nuanced music production. These challenges include providing sufficient control over the generated content and allowing for flexible, precise edits. This thesis tackles these issues by introducing a series of advancements that progressively build upon each other, enhancing the controllability and editability of text-to-music generation models.
First, we introduce Loop Copilot, a system that tries to address the need for iterative refinement in music creation. Loop Copilot leverages a large language model (LLM) to coordinate multiple specialised AI models, enabling users to generate and refine music interactively through a conversational interface. Central to this system is the Global Attribute Table, which records and maintains key musical attributes throughout the iterative process, ensuring that modifications at any stage preserve the overall coherence of the music. While Loop Copilot excels in orchestrating the music creation process, it does not directly address the need for detailed edits to the generated content.
To overcome this limitation, MusicMagus is presented as a further solution for editing AI-generated music. MusicMagus introduces a zero-shot text-to-music editing approach that allows for the modification of specific musical attributes, such as genre, mood, and instrumentation, without the need for retraining. By manipulating the latent space within pre-trained diffusion models, MusicMagus ensures that these edits are stylistically coherent and that non-targeted attributes remain unchanged. This system is particularly effective in maintaining the structural integrity of the music during edits, but it encounters challenges with more complex and real-world audio scenarios.
...