Computer Science
See recent articles
Showing new listings for Monday, 25 November 2024
- [1] arXiv:2411.14433 [pdf, html, other]
-
Title: Transforming Engineering Education Using Generative AI and Digital Twin TechnologiesYu-Zheng Lin, Ahmed Hussain J Alhamadah, Matthew William Redondo, Karan Himanshu Patel, Sujan Ghimire, Banafsheh Saber Latibari, Soheil Salehi, Pratik SatamComments: 8 pages, 7 figuresSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
Digital twin technology, traditionally used in industry, is increasingly recognized for its potential to enhance educational experiences. This study investigates the application of industrial digital twins (DTs) in education, focusing on how DT models of varying fidelity can support different stages of Bloom's taxonomy in the cognitive domain. We align Bloom's six cognitive stages with educational levels: undergraduate studies for "Remember" and "Understand," master's level for "Apply" and "Analyze," and doctoral level for "Evaluate" and "Create." Low-fidelity DTs aid essential knowledge acquisition and skill training, providing a low-risk environment for grasping fundamental concepts. Medium-fidelity DTs offer more detailed and dynamic simulations, enhancing application skills and problem-solving. High-fidelity DTs support advanced learners by replicating physical phenomena, allowing for innovative design and complex experiments. Within this framework, large language models (LLMs) serve as mentors, assessing progress, filling knowledge gaps, and assisting with DT interactions, parameter setting, and debugging. We evaluate the educational impact using the Kirkpatrick Model, examining how each DT model's fidelity influences learning outcomes. This framework helps educators make informed decisions on integrating DTs and LLMs to meet specific learning objectives.
- [2] arXiv:2411.14435 [pdf, html, other]
-
Title: Generative AI Policy and Governance Considerations for Health Security in Southeast AsiaComments: 13 pages, 1 appendix; accepted to GenAI for Health: Potential, Trust and Policy Compliance workshop at NeurIPS 2024Subjects: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
Southeast Asia is a geopolitically and socio-economically significant region with unique challenges and opportunities. Intensifying progress in generative AI against a backdrop of existing health security threats makes applications of AI to mitigate such threats attractive but also risky if done without due caution. This paper provides a brief sketch of some of the applications of AI for health security and the regional policy and governance landscape. I focus on policy and governance activities of the Association of Southeast Asian Nations (ASEAN), an international body whose member states represent 691 million people. I conclude by identifying sustainability as an area of opportunity for policymakers and recommend priority areas for generative AI researchers to make the most impact with their work.
- [3] arXiv:2411.14436 [pdf, html, other]
-
Title: AssertLLM: Generating Hardware Verification Assertions from Design Specifications via Multi-LLMsComments: Accepted by ASPDAC'25. arXiv admin note: substantial text overlap with arXiv:2402.00386Subjects: Hardware Architecture (cs.AR)
Assertion-based verification (ABV) is a critical method to ensure logic designs comply with their architectural specifications. ABV requires assertions, which are generally converted from specifications through human interpretation by verification engineers. Existing methods for generating assertions from specification documents are limited to sentences extracted by engineers, discouraging their practical applications. In this work, we present AssertLLM, an automatic assertion generation framework that processes complete specification documents. AssertLLM can generate assertions from both natural language and waveform diagrams in specification files. It first converts unstructured specification sentences and waveforms into structured descriptions using natural language templates. Then, a customized Large Language Model (LLM) generates the final assertions based on these descriptions. Our evaluation demonstrates that AssertLLM can generate more accurate and higher-quality assertions compared to GPT-4o and GPT-3.5.
- [4] arXiv:2411.14437 [pdf, other]
-
Title: Transforming Business with Generative AI: Research, Innovation, Market Deployment and Future Shifts in Business ModelsComments: 30 pages, 12 figures, original submissionSubjects: Computers and Society (cs.CY)
This paper explores the transformative impact of Generative AI (GenAI) on the business landscape, examining its role in reshaping traditional business models, intensifying market competition, and fostering innovation. By applying the principles of Neo-Schumpeterian economics, the research analyses how GenAI is driving a new wave of "creative destruction," leading to the emergence of novel business paradigms and value propositions. The findings reveal that GenAI enhances operational efficiency, facilitates product and service innovation, and creates new revenue streams, positioning it as a powerful catalyst for substantial shifts in business structures and strategies. However, the deployment of GenAI also presents significant challenges, including ethical concerns, regulatory demands, and the risk of job displacement. By addressing the multifarious nature of GenAI, this paper provides valuable insights for business leaders, policymakers, and researchers, guiding them towards a balanced and responsible integration of this transformative technology. Ultimately, GenAI is not merely a technological advancement but a driver of profound change, heralding a future where creativity, efficiency, and growth are redefined.
- [5] arXiv:2411.14438 [pdf, other]
-
Title: Agent-Based Modeling for Multimodal Transportation of $CO_2$ for Carbon Capture, Utilization, and Storage: CCUS-AgentSubjects: Multiagent Systems (cs.MA)
To understand the system-level interactions between the entities in Carbon Capture, Utilization, and Storage (CCUS), an agent-based foundational modeling tool, CCUS-Agent, is developed for a large-scale study of transportation flows and infrastructure in the United States. Key features of the tool include (i) modular design, (ii) multiple transportation modes, (iii) capabilities for extension, and (iv) testing against various system components and networks of small and large sizes. Five matching algorithms for CO2 supply agents (e.g., powerplants and industrial facilities) and demand agents (e.g., storage and utilization sites) are explored: Most Profitable First Year (MPFY), Most Profitable All Years (MPAY), Shortest Total Distance First Year (SDFY), Shortest Total Distance All Years (SDAY), and Shortest distance to long-haul transport All Years (ACAY). Before matching, the supply agent, demand agent, and route must be available, and the connection must be profitable. A profitable connection means the supply agent portion of revenue from the 45Q tax credit must cover the supply agent costs and all transportation costs, while the demand agent revenue portion must cover all demand agent costs. A case study employing over 5,500 supply and demand agents and multimodal CCUS transportation infrastructure in the contiguous United States is conducted. The results suggest that it is possible to capture over 9 billion tonnes (GT) of CO2 from 2025 to 2043, which will increase significantly to 22 GT if the capture costs are reduced by 40%. The MPFY and SDFY algorithms capture more CO2 earlier in the time horizon, while the MPAY and SDAY algorithms capture more later in the time horizon.
- [6] arXiv:2411.14439 [pdf, other]
-
Title: Windstorm Economic Impacts on the Spanish Resilience: A Machine Learning Real-Data ApproachMatheus Puime Pedra (1), Josune Hernantes (1), Leire Casals (1), Leire Labaka (1) ((1) Industrial Management Department - TECNUN, University of Navarra, Donostia, Spain)Journal-ref: XX Conferencia de la Asociacion Espanola para la Inteligencia Artificial 2024Subjects: Computers and Society (cs.CY)
Climate change-associated disasters have become a significant concern, principally when affecting urban areas. Assessing these regions' resilience to strengthen their disaster management is crucial, especially in the areas vulnerable to windstorms, one of Spain's most critical disasters. Smart cities and machine learning offer promising solutions to manage disasters, but accurately estimating economic losses from windstorms can be difficult due to the unique characteristics of each region and limited data. This study proposes utilizing ML classification models to enhance disaster resilience by analyzing publicly available data on windstorms in the Spanish areas. This approach can help decision-makers make informed decisions regarding preparedness and mitigation actions, ultimately creating a more resilient urban environment that can better withstand windstorms in the future.
- [7] arXiv:2411.14441 [pdf, html, other]
-
Title: GeMID: Generalizable Models for IoT Device IdentificationComments: 8 pages main (9 figures, 2 tables), 19 pages Supplementary Material, 27 pages totalSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
With the proliferation of Internet of Things (IoT) devices, ensuring their security has become paramount. Device identification (DI), which distinguishes IoT devices based on their traffic patterns, plays a crucial role in both differentiating devices and identifying vulnerable ones, closing a serious security gap. However, existing approaches to DI that build machine learning models often overlook the challenge of model generalizability across diverse network environments. In this study, we propose a novel framework to address this limitation and evaluate the generalizability of DI models across datasets collected within different network environments. Our approach involves a two-step process: first, we develop a feature and model selection method that is more robust to generalization issues by using a genetic algorithm with external feedback and datasets from distinct environments to refine the selections. Second, the resulting DI models are then tested on further independent datasets in order to robustly assess their generalizability. We demonstrate the effectiveness of our method by empirically comparing it to alternatives, highlighting how fundamental limitations of commonly employed techniques such as sliding window and flow statistics limit their generalizability. Our findings advance research in IoT security and device identification, offering insights into improving model effectiveness and mitigating risks in IoT networks.
- [8] arXiv:2411.14442 [pdf, html, other]
-
Title: AI Ethics by Design: Implementing Customizable Guardrails for Responsible AI DevelopmentSubjects: Computers and Society (cs.CY); Computation and Language (cs.CL)
This paper explores the development of an ethical guardrail framework for AI systems, emphasizing the importance of customizable guardrails that align with diverse user values and underlying ethics. We address the challenges of AI ethics by proposing a structure that integrates rules, policies, and AI assistants to ensure responsible AI behavior, while comparing the proposed framework to the existing state-of-the-art guardrails. By focusing on practical mechanisms for implementing ethical standards, we aim to enhance transparency, user autonomy, and continuous improvement in AI systems. Our approach accommodates ethical pluralism, offering a flexible and adaptable solution for the evolving landscape of AI governance. The paper concludes with strategies for resolving conflicts between ethical directives, underscoring the present and future need for robust, nuanced and context-aware AI systems.
- [9] arXiv:2411.14444 [pdf, other]
-
Title: Unlocking the Future: A Cloud-Based Artificial Intelligence Access Control SystemComments: Link to online article: this https URLJournal-ref: ERCIM News Special theme: Software Security 2024Subjects: Cryptography and Security (cs.CR)
Traditional access control systems, such as key cards, PIN pads, and physical keys, face challenges in scalability, security, and user experience in today's digital world. We present a cloud-based entry system using Raspberry Pi hardware and Amazon Web Services (AWS) technologies like Lambda, Simple Storage Service (S3), and Rekognition. This solution (AWSecure Entry System) enhances security, streamlines authentication, and increases operational efficiency.
- [10] arXiv:2411.14449 [pdf, other]
-
Title: Deferred Backdoor Functionality Attacks on Deep Learning ModelsSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Deep learning models are vulnerable to backdoor attacks, where adversaries inject malicious functionality during training that activates on trigger inputs at inference time. Extensive research has focused on developing stealthy backdoor attacks to evade detection and defense mechanisms. However, these approaches still have limitations that leave the door open for detection and mitigation due to their inherent design to cause malicious behavior in the presence of a trigger. To address this limitation, we introduce Deferred Backdoor Functionality Activation (DBFA), a new paradigm in backdoor attacks. Unlike conventional attacks, DBFA initially conceals its backdoor, producing benign outputs even when triggered. This stealthy behavior allows DBFA to bypass multiple detection and defense methods, remaining undetected during initial inspections. The backdoor functionality is strategically activated only after the model undergoes subsequent updates, such as retraining on benign data. DBFA attacks exploit the common practice in the life cycle of machine learning models to perform model updates and fine-tuning after initial deployment. To implement DBFA attacks, we approach the problem by making the unlearning of the backdoor fragile, allowing it to be easily cancelled and subsequently reactivate the backdoor functionality. To achieve this, we propose a novel two-stage training scheme, called DeferBad. Our extensive experiments across various fine-tuning scenarios, backdoor attack types, datasets, and model architectures demonstrate the effectiveness and stealthiness of DeferBad.
- [11] arXiv:2411.14450 [pdf, other]
-
Title: Development of a threat modelling framework and a web-based threat modelling tool for micro businessesComments: 109 pages, 11 figures, 5 appendicesSubjects: Cryptography and Security (cs.CR)
While there is a plethora of cybersecurity and risk management frameworks for different target audiences and use cases, micro-businesses (MBs) are often overlooked. As the smallest business entities, MBs represent a special case with regard to cybersecurity for two reasons: (1) Having fewer than 10 employees, they tend to lack cybersecurity expertise. (2) Because of their low turnover, they usually have a limited budget for cybersecurity. As a result, MBs are often the victims of security breaches and cyber-attacks every year, as demonstrated by various studies. This calls for a non-technical, simple solution tailored specifically for MBs. To address this pressing need, the SEANCE Cybersecurity Framework was developed through a 7-step methodology: (1) A literature review was conducted to explore the current state of research and available frameworks and methodologies, (2) followed by a qualitative survey to identify the cybersecurity challenges faced by MBs. (3) After analyzing the results of the literature review and the survey, (4) the relevant aspects of existing frameworks and tools for MBs were identified and (5) a non-technical framework was developed. (6) A web-based tool was developed to facilitate the implementation of the framework and (7) another qualitative survey was conducted to gather feedback. The SEANCE Framework suggests considering possible vulnerabilities and cyber threats in six hierarchical layers: (1) Self, (2) Employees, (3) Assets, (4) Network, (5) Customers and (6) Environment, with the underlying idea of a vulnerability in an inner layer propagates to the outer layers and therefore needs to be prioritized.
- [12] arXiv:2411.14451 [pdf, other]
-
Title: The Evolution of Cryptography through Number TheoryComments: 27 pages, 6 Tables, 1 figureSubjects: Cryptography and Security (cs.CR)
Cryptography, derived from Greek meaning hidden writing, uses mathematical techniques to secure information by converting it into an unreadable format. While cryptography as a science began around 100 years ago, its roots trace back to ancient civilizations like Mesopotamia and Egypt. Over time, cryptography evolved from basic methods to complex systems involving number theory, such as modular arithmetic, the Euclidean algorithm, and Eulers totient function. This paper explores the link between early information hiding techniques and modern cryptographic algorithms like RSA, which use advanced number theory to secure data for billions of people. By analyzing historical methods, this study shows how the development of number theory enabled the transition from simple letter shifting ciphers, like the Caesar and Vigenere ciphers, to more sophisticated encryption methods. This evolution reflects a profound impact on daily life and the importance of number theory in protecting information.
- [13] arXiv:2411.14453 [pdf, html, other]
-
Title: Direct Speech-to-Speech Neural Machine Translation: A SurveySubjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Speech-to-Speech Translation (S2ST) models transform speech from one language to another target language with the same linguistic information. S2ST is important for bridging the communication gap among communities and has diverse applications. In recent years, researchers have introduced direct S2ST models, which have the potential to translate speech without relying on intermediate text generation, have better decoding latency, and the ability to preserve paralinguistic and non-linguistic features. However, direct S2ST has yet to achieve quality performance for seamless communication and still lags behind the cascade models in terms of performance, especially in real-world translation. To the best of our knowledge, no comprehensive survey is available on the direct S2ST system, which beginners and advanced researchers can look upon for a quick survey. The present work provides a comprehensive review of direct S2ST models, data and application issues, and performance metrics. We critically analyze the models' performance over the benchmark datasets and provide research challenges and future directions.
- [14] arXiv:2411.14456 [pdf, other]
-
Title: Can Artificial Intelligence Generate Quality Research Topics Reflecting Patient Concerns?Jiyeong Kim, Michael L. Chen, Shawheen J. Rezaei, Mariana Ramirez-Posada, Jennifer L. Caswell-Jin, Allison W. Kurian, Fauzia Riaz, Kavita Y. Sarin, Jean Y. Tang, Steven M. Asch, Eleni LinosSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Patient-centered research is increasingly important in narrowing the gap between research and patient care, yet incorporating patient perspectives into health research has been inconsistent. We propose an automated framework leveraging innovative natural language processing (NLP) and artificial intelligence (AI) with patient portal messages to generate research ideas that prioritize important patient issues. We further quantified the quality of AI-generated research topics. To define patient clinical concerns, we analyzed 614,464 patient messages from 25,549 individuals with breast or skin cancer obtained from a large academic hospital (2013 to 2024), constructing a 2-staged unsupervised NLP topic model. Then, we generated research topics to resolve the defined issues using a widely used AI (ChatGPT-4o, OpenAI Inc, April 2024 version) with prompt-engineering strategies. We guided AI to perform multi-level tasks: 1) knowledge interpretation and summarization (e.g., interpreting and summarizing the NLP-defined topics), 2) knowledge generation (e.g., generating research ideas corresponding to patients issues), 3) self-reflection and correction (e.g., ensuring and revising the research ideas after searching for scientific articles), and 4) self-reassurance (e.g., confirming and finalizing the research ideas). Six highly experienced breast oncologists and dermatologists assessed the significance and novelty of AI-generated research topics using a 5-point Likert scale (1-exceptional, 5-poor). One-third of the AI-suggested research topics were highly significant and novel when both scores were lower than the average. Two-thirds of the AI-suggested topics were novel in both cancers. Our findings demonstrate that AI-generated research topics reflecting patient perspectives via a large volume of patient messages can meaningfully guide future directions in patient-centered health research.
- [15] arXiv:2411.14457 [pdf, html, other]
-
Title: Guiding Reinforcement Learning Using Uncertainty-Aware Large Language ModelsComments: 8 pages, 7 figuresSubjects: Machine Learning (cs.LG)
Human guidance in reinforcement learning (RL) is often impractical for large-scale applications due to high costs and time constraints. Large Language Models (LLMs) offer a promising alternative to mitigate RL sample inefficiency and potentially replace human trainers. However, applying LLMs as RL trainers is challenging due to their overconfidence and less reliable solutions in sequential tasks. We address this limitation by introducing a calibrated guidance system that uses Monte Carlo Dropout to enhance LLM advice reliability by assessing prediction variances from multiple forward passes. Additionally, we develop a novel RL policy shaping method based on dynamic model average entropy to adjust the LLM's influence on RL policies according to guidance uncertainty. This approach ensures robust RL training by relying on reliable LLM guidance. To validate our contributions, we conduct extensive experiments in a Minigrid environment with three goals in varying environment sizes. The results showcase superior model performance compared to uncalibrated LLMs, unguided RL, and calibrated LLMs with different shaping policies. Moreover, we analyze various uncertainty estimation methods, demonstrating the effectiveness of average entropy in reflecting higher uncertainty in incorrect guidance. These findings highlight the persistent overconfidence in fine-tuned LLMs and underscore the importance of effective calibration in sequential decision-making problems.
- [16] arXiv:2411.14458 [pdf, html, other]
-
Title: Improving training time and GPU utilization in geo-distributed language model trainingPalak (Microsoft Research India), Rohan Gandhi (Microsoft Research India), Karan Tandon (Microsoft Research India), Debopam Bhattacherjee (Microsoft Research India), Venkata N. Padmanabhan (Microsoft Research India)Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
The widespread adoption of language models (LMs) across multiple industries has caused huge surge in demand for GPUs. Training LMs requires tens of thousands of GPUs and housing them in the same datacenter (DCs) is becoming challenging. We focus on training such models across multiple DCs connected via Wide-Area-Network (WAN). We build ATLAS that speeds up such training time using novel temporal bandwidth sharing and many other design choices. While ATLAS improves the training time, it does not eliminate the bubbles (idle GPU cycles). We built BUBBLETEA that runs prefill-as-a-service (part of LM inference) during the bubbles that improves the GPU utilization substantially without any impact of training. Together, ATLAS and BUBBLETEA improve training time by up to 17X and achieve GPU utilization of up to 94%.
- [17] arXiv:2411.14459 [pdf, html, other]
-
Title: Unveiling User Preferences: A Knowledge Graph and LLM-Driven Approach for Conversational RecommendationSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Conversational Recommender Systems (CRSs) aim to provide personalized recommendations through dynamically capturing user preferences in interactive conversations. Conventional CRSs often extract user preferences as hidden representations, which are criticized for their lack of interpretability. This diminishes the transparency and trustworthiness of the recommendation process. Recent works have explored combining the impressive capabilities of Large Language Models (LLMs) with the domain-specific knowledge of Knowledge Graphs (KGs) to generate human-understandable recommendation explanations. Despite these efforts, the integration of LLMs and KGs for CRSs remains challenging due to the modality gap between unstructured dialogues and structured KGs. Moreover, LLMs pre-trained on large-scale corpora may not be well-suited for analyzing user preferences, which require domain-specific knowledge. In this paper, we propose COMPASS, a plug-and-play framework that synergizes LLMs and KGs to unveil user preferences, enhancing the performance and explainability of existing CRSs. To address integration challenges, COMPASS employs a two-stage training approach: first, it bridges the gap between the structured KG and natural language through an innovative graph entity captioning pre-training mechanism. This enables the LLM to transform KG entities into concise natural language descriptions, allowing them to comprehend domain-specific knowledge. Following, COMPASS optimizes user preference modeling via knowledge-aware instruction fine-tuning, where the LLM learns to reason and summarize user preferences from both dialogue histories and KG-augmented context. This enables COMPASS to perform knowledge-aware reasoning and generate comprehensive and interpretable user preferences that can seamlessly integrate with existing CRS models for improving recommendation performance and explainability.
- [18] arXiv:2411.14460 [pdf, html, other]
-
Title: LLaSA: Large Language and Structured Data AssistantSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Structured data, such as tables, graphs, and databases, play a critical role in plentiful NLP tasks such as question answering and dialogue system. Recently, inspired by Vision-Language Models, Graph Neutral Networks (GNNs) have been introduced as an additional modality into the input of Large Language Models (LLMs) to improve their performance on Structured Knowledge Grounding (SKG) tasks. However, those GNN-enhanced LLMs have the following limitations: (1) They employ diverse GNNs to model varying types of structured data, rendering them unable to uniformly process various forms of structured data. (2) The pretraining of GNNs is coupled with specific LLMs, which prevents GNNs from fully aligning with the textual space and limits their adaptability to other LLMs. To address these issues, we propose \textbf{L}arge \textbf{L}anguage and \textbf{S}tructured Data \textbf{A}ssistant (LLaSA), a general framework for enhancing LLMs' ability to handle structured data. Specifically, we represent various types of structured data in a unified hypergraph format, and use self-supervised learning to pretrain a hypergraph encoder, and a G-Former compressing encoded hypergraph representations with cross-attention. The compressed hypergraph representations are appended to the serialized inputs during training and inference stages of LLMs. Experimental results on multiple SKG tasks show that our pretrained hypergraph encoder can adapt to various LLMs and enhance their ability to process different types of structured data. Besides, LLaSA, with LoRA fine-tuning, outperforms previous SOTA method using full parameters tuning.
- [19] arXiv:2411.14461 [pdf, html, other]
-
Title: Towards Next-Generation Medical Agent: How o1 is Reshaping Decision-Making in Medical ScenariosShaochen Xu, Yifan Zhou, Zhengliang Liu, Zihao Wu, Tianyang Zhong, Huaqin Zhao, Yiwei Li, Hanqi Jiang, Yi Pan, Junhao Chen, Jin Lu, Wei Zhang, Tuo Zhang, Lu Zhang, Dajiang Zhu, Xiang Li, Wei Liu, Quanzheng Li, Andrea Sikora, Xiaoming Zhai, Zhen Xiang, Tianming LiuSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Artificial Intelligence (AI) has become essential in modern healthcare, with large language models (LLMs) offering promising advances in clinical decision-making. Traditional model-based approaches, including those leveraging in-context demonstrations and those with specialized medical fine-tuning, have demonstrated strong performance in medical language processing but struggle with real-time adaptability, multi-step reasoning, and handling complex medical tasks. Agent-based AI systems address these limitations by incorporating reasoning traces, tool selection based on context, knowledge retrieval, and both short- and long-term memory. These additional features enable the medical AI agent to handle complex medical scenarios where decision-making should be built on real-time interaction with the environment. Therefore, unlike conventional model-based approaches that treat medical queries as isolated questions, medical AI agents approach them as complex tasks and behave more like human doctors. In this paper, we study the choice of the backbone LLM for medical AI agents, which is the foundation for the agent's overall reasoning and action generation. In particular, we consider the emergent o1 model and examine its impact on agents' reasoning, tool-use adaptability, and real-time information retrieval across diverse clinical scenarios, including high-stakes settings such as intensive care units (ICUs). Our findings demonstrate o1's ability to enhance diagnostic accuracy and consistency, paving the way for smarter, more responsive AI tools that support better patient outcomes and decision-making efficacy in clinical practice.
- [20] arXiv:2411.14462 [pdf, html, other]
-
Title: Activation Functions for "A Feedforward Unitary Equivariant Neural Network"Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
In our previous work [Ma and Chan (2023)], we presented a feedforward unitary equivariant neural network. We proposed three distinct activation functions tailored for this network: a softsign function with a small residue, an identity function, and a Leaky ReLU function. While these functions demonstrated the desired equivariance properties, they limited the neural network's architecture. This short paper generalises these activation functions to a single functional form. This functional form represents a broad class of functions, maintains unitary equivariance, and offers greater flexibility for the design of equivariant neural networks.
- [21] arXiv:2411.14463 [pdf, html, other]
-
Title: Leveraging AI and NLP for Bank Marketing: A Systematic Review and Gap AnalysisSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); General Economics (econ.GN)
This paper explores the growing impact of AI and NLP in bank marketing, highlighting their evolving roles in enhancing marketing strategies, improving customer engagement, and creating value within this sector. While AI and NLP have been widely studied in general marketing, there is a notable gap in understanding their specific applications and potential within the banking sector. This research addresses this specific gap by providing a systematic review and strategic analysis of AI and NLP applications in bank marketing, focusing on their integration across the customer journey and operational excellence. Employing the PRISMA methodology, this study systematically reviews existing literature to assess the current landscape of AI and NLP in bank marketing. Additionally, it incorporates semantic mapping using Sentence Transformers and UMAP for strategic gap analysis to identify underexplored areas and opportunities for future research.
The systematic review reveals limited research specifically focused on NLP applications in bank marketing. The strategic gap analysis identifies key areas where NLP can further enhance marketing strategies, including customer-centric applications like acquisition, retention, and personalized engagement, offering valuable insights for both academic research and practical implementation. This research contributes to the field of bank marketing by mapping the current state of AI and NLP applications and identifying strategic gaps. The findings provide actionable insights for developing NLP-driven growth and innovation frameworks and highlight the role of NLP in improving operational efficiency and regulatory compliance. This work has broader implications for enhancing customer experience, profitability, and innovation in the banking industry. - [22] arXiv:2411.14465 [pdf, html, other]
-
Title: Testing Uncertainty of Large Language Models for Physics Knowledge and ReasoningSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Large Language Models (LLMs) have gained significant popularity in recent years for their ability to answer questions in various fields. However, these models have a tendency to "hallucinate" their responses, making it challenging to evaluate their performance. A major challenge is determining how to assess the certainty of a model's predictions and how it correlates with accuracy. In this work, we introduce an analysis for evaluating the performance of popular open-source LLMs, as well as gpt-3.5 Turbo, on multiple choice physics questionnaires. We focus on the relationship between answer accuracy and variability in topics related to physics. Our findings suggest that most models provide accurate replies in cases where they are certain, but this is by far not a general behavior. The relationship between accuracy and uncertainty exposes a broad horizontal bell-shaped distribution. We report how the asymmetry between accuracy and uncertainty intensifies as the questions demand more logical reasoning of the LLM agent, while the same relationship remains sharp for knowledge retrieval tasks.
- [23] arXiv:2411.14466 [pdf, html, other]
-
Title: Learning to Ask: Conversational Product Search via Representation LearningComments: Accepted by ACM TOISSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Online shopping platforms, such as Amazon and AliExpress, are increasingly prevalent in society, helping customers purchase products conveniently. With recent progress in natural language processing, researchers and practitioners shift their focus from traditional product search to conversational product search. Conversational product search enables user-machine conversations and through them collects explicit user feedback that allows to actively clarify the users' product preferences. Therefore, prospective research on an intelligent shopping assistant via conversations is indispensable. Existing publications on conversational product search either model conversations independently from users, queries, and products or lead to a vocabulary mismatch. In this work, we propose a new conversational product search model, ConvPS, to assist users in locating desirable items. The model is first trained to jointly learn the semantic representations of user, query, item, and conversation via a unified generative framework. After learning these representations, they are integrated to retrieve the target items in the latent semantic space. Meanwhile, we propose a set of greedy and explore-exploit strategies to learn to ask the user a sequence of high-performance questions for conversations. Our proposed ConvPS model can naturally integrate the representation learning of the user, query, item, and conversation into a unified generative framework, which provides a promising avenue for constructing accurate and robust conversational product search systems that are flexible and adaptive. Experimental results demonstrate that our ConvPS model significantly outperforms state-of-the-art baselines.
- [24] arXiv:2411.14468 [pdf, other]
-
Title: A Neural Network Training Method Based on Distributed PID ControlComments: 12 pages, 5 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
In the previous article, we introduced a neural network framework based on symmetric differential equations. This novel framework exhibits complete symmetry, endowing it with perfect mathematical properties. While we have examined some of the system's mathematical characteristics, a detailed discussion of the network training methodology has not yet been presented. Drawing on the principles of the traditional backpropagation algorithm, this study proposes an alternative training approach that utilizes differential equation signal propagation instead of chain rule derivation. This approach not only preserves the effectiveness of training but also offers enhanced biological interpretability. The foundation of this methodology lies in the system's reversibility, which stems from its inherent symmetry,a key aspect of our research. However, this method alone is insufficient for effective neural network training. To address this, we further introduce a distributed Proportional-Integral-Derivative (PID) control approach, emphasizing its implementation within a closed system. By incorporating this method, we achieved both faster training speeds and improved accuracy. This approach not only offers novel insights into neural network training but also extends the scope of research into control methodologies. To validate its effectiveness, we apply this method to the MNIST dataset, demonstrating its practical utility.
- [25] arXiv:2411.14469 [pdf, html, other]
-
Title: Popular LLMs Amplify Race and Gender Disparities in Human MobilitySubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
As large language models (LLMs) are increasingly applied in areas influencing societal outcomes, it is critical to understand their tendency to perpetuate and amplify biases. This study investigates whether LLMs exhibit biases in predicting human mobility -- a fundamental human behavior -- based on race and gender. Using three prominent LLMs -- GPT-4, Gemini, and Claude -- we analyzed their predictions of visitations to points of interest (POIs) for individuals, relying on prompts that included names with and without explicit demographic details. We find that LLMs frequently reflect and amplify existing societal biases. Specifically, predictions for minority groups were disproportionately skewed, with these individuals being significantly less likely to be associated with wealth-related points of interest (POIs). Gender biases were also evident, as female individuals were consistently linked to fewer career-related POIs compared to their male counterparts. These biased associations suggest that LLMs not only mirror but also exacerbate societal stereotypes, particularly in contexts involving race and gender.
- [26] arXiv:2411.14472 [pdf, html, other]
-
Title: Exploring the Potential Role of Generative AI in the TRAPD Procedure for Survey TranslationSubjects: Computation and Language (cs.CL); Applications (stat.AP); Methodology (stat.ME)
This paper explores and assesses in what ways generative AI can assist in translating survey instruments. Writing effective survey questions is a challenging and complex task, made even more difficult for surveys that will be translated and deployed in multiple linguistic and cultural settings. Translation errors can be detrimental, with known errors rendering data unusable for its intended purpose and undetected errors leading to incorrect conclusions. A growing number of institutions face this problem as surveys deployed by private and academic organizations globalize, and the success of their current efforts depends heavily on researchers' and translators' expertise and the amount of time each party has to contribute to the task. Thus, multilinguistic and multicultural surveys produced by teams with limited expertise, budgets, or time are at significant risk for translation-based errors in their data. We implement a zero-shot prompt experiment using ChatGPT to explore generative AI's ability to identify features of questions that might be difficult to translate to a linguistic audience other than the source language. We find that ChatGPT can provide meaningful feedback on translation issues, including common source survey language, inconsistent conceptualization, sensitivity and formality issues, and nonexistent concepts. In addition, we provide detailed information on the practicality of the approach, including accessing the necessary software, associated costs, and computational run times. Lastly, based on our findings, we propose avenues for future research that integrate AI into survey translation practices.
- [27] arXiv:2411.14473 [pdf, html, other]
-
Title: Large Language Model for Qualitative Research -- A Systematic Mapping StudyCauã Ferreira Barros, Bruna Borges Azevedo, Valdemar Vicente Graciano Neto, Mohamad Kassab, Marcos Kalinowski, Hugo Alexandre D. do Nascimento, Michelle C.G.S.P. BandeiraComments: 8 pages, includes 1 figures and 3 tables. Submitted to the WSESE 2025 ICSE WorkshopSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
The exponential growth of text-based data in domains such as healthcare, education, and social sciences has outpaced the capacity of traditional qualitative analysis methods, which are time-intensive and prone to subjectivity. Large Language Models (LLMs), powered by advanced generative AI, have emerged as transformative tools capable of automating and enhancing qualitative analysis. This study systematically maps the literature on the use of LLMs for qualitative research, exploring their application contexts, configurations, methodologies, and evaluation metrics. Findings reveal that LLMs are utilized across diverse fields, demonstrating the potential to automate processes traditionally requiring extensive human input. However, challenges such as reliance on prompt engineering, occasional inaccuracies, and contextual limitations remain significant barriers. This research highlights opportunities for integrating LLMs with human expertise, improving model robustness, and refining evaluation methodologies. By synthesizing trends and identifying research gaps, this study aims to guide future innovations in the application of LLMs for qualitative analysis.
- [28] arXiv:2411.14474 [pdf, html, other]
-
Title: Attention-guided Spectrogram Sequence Modeling with CNNs for Music Genre ClassificationComments: 6 pages, 7 figures, 17 ReferencesSubjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Music genre classification is a critical component of music recommendation systems, generation algorithms, and cultural analytics. In this work, we present an innovative model for classifying music genres using attention-based temporal signature modeling. By processing spectrogram sequences through Convolutional Neural Networks (CNNs) and multi-head attention layers, our approach captures the most temporally significant moments within each piece, crafting a unique "signature" for genre identification. This temporal focus not only enhances classification accuracy but also reveals insights into genre-specific characteristics that can be intuitively mapped to listener perceptions. Our findings offer potential applications in personalized music recommendation systems by highlighting cross-genre similarities and distinctiveness, aligning closely with human musical intuition. This work bridges the gap between technical classification tasks and the nuanced, human experience of genre.
- [29] arXiv:2411.14476 [pdf, other]
-
Title: StreetviewLLM: Extracting Geographic Information Using a Chain-of-Thought Multimodal Large Language ModelSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Geospatial predictions are crucial for diverse fields such as disaster management, urban planning, and public health. Traditional machine learning methods often face limitations when handling unstructured or multi-modal data like street view imagery. To address these challenges, we propose StreetViewLLM, a novel framework that integrates a large language model with the chain-of-thought reasoning and multimodal data sources. By combining street view imagery with geographic coordinates and textual data, StreetViewLLM improves the precision and granularity of geospatial predictions. Using retrieval-augmented generation techniques, our approach enhances geographic information extraction, enabling a detailed analysis of urban environments. The model has been applied to seven global cities, including Hong Kong, Tokyo, Singapore, Los Angeles, New York, London, and Paris, demonstrating superior performance in predicting urban indicators, including population density, accessibility to healthcare, normalized difference vegetation index, building height, and impervious surface. The results show that StreetViewLLM consistently outperforms baseline models, offering improved predictive accuracy and deeper insights into the built environment. This research opens new opportunities for integrating the large language model into urban analytics, decision-making in urban planning, infrastructure management, and environmental monitoring.
- [30] arXiv:2411.14478 [pdf, html, other]
-
Title: Why you don't overfit, and don't need Bayes if you only train for one epochSubjects: Machine Learning (cs.LG)
Here, we show that in the data-rich setting where you only train on each datapoint once (or equivalently, you only train for one epoch), standard "maximum likelihood" training optimizes the true data generating process (DGP) loss, which is equivalent to the test loss. Further, we show that the Bayesian model average optimizes the same objective, albeit while taking the expectation over uncertainty induced by finite data. As standard maximum likelihood training in the single-epoch setting optimizes the same objective as Bayesian inference, we argue that we do not expect Bayesian inference to offer any advantages in terms of overfitting or calibration in these settings. This explains the diminishing importance of Bayes in areas such as LLMs, which are often trained with one (or very few) epochs.
- [31] arXiv:2411.14479 [pdf, html, other]
-
Title: GRL-Prompt: Towards Knowledge Graph based Prompt Optimization via Reinforcement LearningSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large language models (LLMs) have demonstrated impressive success in a wide range of natural language processing (NLP) tasks due to their extensive general knowledge of the world. Recent works discovered that the performance of LLMs is heavily dependent on the input prompt. However, prompt engineering is usually done manually in a trial-and-error fashion, which can be labor-intensive and challenging in order to find the optimal prompts. To address these problems and unleash the utmost potential of LLMs, we propose a novel LLMs-agnostic framework for prompt optimization, namely GRL-Prompt, which aims to automatically construct optimal prompts via reinforcement learning (RL) in an end-to-end manner. To provide structured action/state representation for optimizing prompts, we construct a knowledge graph (KG) that better encodes the correlation between the user query and candidate in-context examples. Furthermore, a policy network is formulated to generate the optimal action by selecting a set of in-context examples in a rewardable order to construct the prompt. Additionally, the embedding-based reward shaping is utilized to stabilize the RL training process. The experimental results show that GRL-Prompt outperforms recent state-of-the-art methods, achieving an average increase of 0.10 in ROUGE-1, 0.07 in ROUGE-2, 0.07 in ROUGE-L, and 0.05 in BLEU.
- [32] arXiv:2411.14480 [pdf, html, other]
-
Title: Associative Knowledge Graphs for Efficient Sequence Storage and RetrievalComments: 10 pages, 6 figuresSubjects: Artificial Intelligence (cs.AI); Databases (cs.DB)
This paper presents a novel approach for constructing associative knowledge graphs that are highly effective for storing and recognizing sequences. The graph is created by representing overlapping sequences of objects, as tightly connected clusters within the larger graph. Individual objects (represented as nodes) can be a part of multiple sequences or appear repeatedly within a single sequence. To retrieve sequences, we leverage context, providing a subset of objects that triggers an association with the complete sequence. The system's memory capacity is determined by the size of the graph and the density of its connections. We have theoretically derived the relationships between the critical density of the graph and the memory capacity for storing sequences. The critical density is the point beyond which error-free sequence reconstruction becomes impossible. Furthermore, we have developed an efficient algorithm for ordering elements within a sequence. Through extensive experiments with various types of sequences, we have confirmed the validity of these relationships. This approach has potential applications in diverse fields, such as anomaly detection in financial transactions or predicting user behavior based on past actions.
- [33] arXiv:2411.14483 [pdf, html, other]
-
Title: Ranking Unraveled: Recipes for LLM Rankings in Head-to-Head AI CombatSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Deciding which large language model (LLM) to use is a complex challenge. Pairwise ranking has emerged as a new method for evaluating human preferences for LLMs. This approach entails humans evaluating pairs of model outputs based on a predefined criterion. By collecting these comparisons, a ranking can be constructed using methods such as Elo. However, applying these algorithms as constructed in the context of LLM evaluation introduces several challenges. In this paper, we explore the effectiveness of ranking systems for head-to-head comparisons of LLMs. We formally define a set of fundamental principles for effective ranking and conduct a series of extensive evaluations on the robustness of several ranking algorithms in the context of LLMs. Our analysis uncovers key insights into the factors that affect ranking accuracy and efficiency, offering guidelines for selecting the most appropriate methods based on specific evaluation contexts and resource constraints.
- [34] arXiv:2411.14484 [pdf, html, other]
-
Title: Robust Planning with Compound LLM Architectures: An LLM-Modulo ApproachSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Previous work has attempted to boost Large Language Model (LLM) performance on planning and scheduling tasks through a variety of prompt engineering techniques. While these methods can work within the distributions tested, they are neither robust nor predictable. This limitation can be addressed through compound LLM architectures where LLMs work in conjunction with other components to ensure reliability. In this paper, we present a technical evaluation of a compound LLM architecture--the LLM-Modulo framework. In this framework, an LLM is paired with a complete set of sound verifiers that validate its output, re-prompting it if it fails. This approach ensures that the system can never output any fallacious output, and therefore that every output generated is guaranteed correct--something previous techniques have not been able to claim. Our results, evaluated across four scheduling domains, demonstrate significant performance gains with the LLM-Modulo framework using various models. Additionally, we explore modifications to the base configuration of the framework and assess their impact on overall system performance.
- [35] arXiv:2411.14485 [pdf, html, other]
-
Title: Mediating Modes of Thought: LLM's for design scriptingComments: Published at ACADIA 2024Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Here is an updated version of your abstract, cleaned for submission to arXiv with potential "bad characters" corrected to conform to ASCII standards:
Architects adopt visual scripting and parametric design tools to explore more expansive design spaces (Coates, 2010), refine their thinking about the geometric logic of their design (Woodbury, 2010), and overcome conventional software limitations (Burry, 2011). Despite two decades of effort to make design scripting more accessible, a disconnect between a designer's free ways of thinking and the rigidity of algorithms remains (Burry, 2011). Recent developments in Large Language Models (LLMs) suggest this might soon change, as LLMs encode a general understanding of human context and exhibit the capacity to produce geometric logic. This project speculates that if LLMs can effectively mediate between user intent and algorithms, they become a powerful tool to make scripting in design more widespread and fun. We explore if such systems can interpret natural language prompts to assemble geometric operations relevant to computational design scripting. In the system, multiple layers of LLM agents are configured with specific context to infer the user intent and construct a sequential logic. Given a user's high-level text prompt, a geometric description is created, distilled into a sequence of logic operations, and mapped to software-specific commands. The completed script is constructed in the user's visual programming interface. The system succeeds in generating complete visual scripts up to a certain complexity but fails beyond this complexity threshold. It shows how LLMs can make design scripting much more aligned with human creativity and thought. Future research should explore conversational interactions, expand to multimodal inputs and outputs, and assess the performance of these tools. - [36] arXiv:2411.14486 [pdf, other]
-
Title: The Impossible Test: A 2024 Unsolvable Dataset and A Chance for an AGI QuizSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
This research introduces a novel evaluation framework designed to assess large language models' (LLMs) ability to acknowledge uncertainty on 675 fundamentally unsolvable problems. Using a curated dataset of graduate-level grand challenge questions with intentionally unknowable answers, we evaluated twelve state-of-the-art LLMs, including both open and closed-source models, on their propensity to admit ignorance rather than generate plausible but incorrect responses. The best models scored in 62-68% accuracy ranges for admitting the problem solution was unknown in fields ranging from biology to philosophy and mathematics. We observed an inverse relationship between problem difficulty and model accuracy, with GPT-4 demonstrating higher rates of uncertainty acknowledgment on more challenging problems (35.8%) compared to simpler ones (20.0%). This pattern indicates that models may be more prone to generate speculative answers when problems appear more tractable. The study also revealed significant variations across problem categories, with models showing difficulty in acknowledging uncertainty in invention and NP-hard problems while performing relatively better on philosophical and psychological challenges. These results contribute to the growing body of research on artificial general intelligence (AGI) assessment by highlighting the importance of uncertainty recognition as a critical component of future machine intelligence evaluation. This impossibility test thus extends previous theoretical frameworks for universal intelligence testing by providing empirical evidence of current limitations in LLMs' ability to recognize their own knowledge boundaries, suggesting new directions for improving model training architectures and evaluation approaches.
- [37] arXiv:2411.14487 [pdf, other]
-
Title: Ensuring Safety and Trust: Analyzing the Risks of Large Language Models in MedicineYifan Yang, Qiao Jin, Robert Leaman, Xiaoyu Liu, Guangzhi Xiong, Maame Sarfo-Gyamfi, Changlin Gong, Santiago Ferrière-Steinert, W. John Wilbur, Xiaojun Li, Jiaxin Yuan, Bang An, Kelvin S. Castro, Francisco Erramuspe Álvarez, Matías Stockle, Aidong Zhang, Furong Huang, Zhiyong LuSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
The remarkable capabilities of Large Language Models (LLMs) make them increasingly compelling for adoption in real-world healthcare applications. However, the risks associated with using LLMs in medical applications have not been systematically characterized. We propose using five key principles for safe and trustworthy medical AI: Truthfulness, Resilience, Fairness, Robustness, and Privacy, along with ten specific aspects. Under this comprehensive framework, we introduce a novel MedGuard benchmark with 1,000 expert-verified questions. Our evaluation of 11 commonly used LLMs shows that the current language models, regardless of their safety alignment mechanisms, generally perform poorly on most of our benchmarks, particularly when compared to the high performance of human physicians. Despite recent reports indicate that advanced LLMs like ChatGPT can match or even exceed human performance in various medical tasks, this study underscores a significant safety gap, highlighting the crucial need for human oversight and the implementation of AI safety guardrails.
- [38] arXiv:2411.14489 [pdf, html, other]
-
Title: GhostRNN: Reducing State Redundancy in RNN with Cheap OperationsJournal-ref: Proc. INTERSPEECH 2023, 226-230Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Recurrent neural network (RNNs) that are capable of modeling long-distance dependencies are widely used in various speech tasks, eg., keyword spotting (KWS) and speech enhancement (SE). Due to the limitation of power and memory in low-resource devices, efficient RNN models are urgently required for real-world applications. In this paper, we propose an efficient RNN architecture, GhostRNN, which reduces hidden state redundancy with cheap operations. In particular, we observe that partial dimensions of hidden states are similar to the others in trained RNN models, suggesting that redundancy exists in specific RNNs. To reduce the redundancy and hence computational cost, we propose to first generate a few intrinsic states, and then apply cheap operations to produce ghost states based on the intrinsic states. Experiments on KWS and SE tasks demonstrate that the proposed GhostRNN significantly reduces the memory usage (~40%) and computation cost while keeping performance similar.
- [39] arXiv:2411.14491 [pdf, html, other]
-
Title: A Survey on Human-Centric LLMsJing Yi Wang, Nicholas Sukiennik, Tong Li, Weikang Su, Qianyue Hao, Jingbo Xu, Zihan Huang, Fengli Xu, Yong LiSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
The rapid evolution of large language models (LLMs) and their capacity to simulate human cognition and behavior has given rise to LLM-based frameworks and tools that are evaluated and applied based on their ability to perform tasks traditionally performed by humans, namely those involving cognition, decision-making, and social interaction. This survey provides a comprehensive examination of such human-centric LLM capabilities, focusing on their performance in both individual tasks (where an LLM acts as a stand-in for a single human) and collective tasks (where multiple LLMs coordinate to mimic group dynamics). We first evaluate LLM competencies across key areas including reasoning, perception, and social cognition, comparing their abilities to human-like skills. Then, we explore real-world applications of LLMs in human-centric domains such as behavioral science, political science, and sociology, assessing their effectiveness in replicating human behaviors and interactions. Finally, we identify challenges and future research directions, such as improving LLM adaptability, emotional intelligence, and cultural sensitivity, while addressing inherent biases and enhancing frameworks for human-AI collaboration. This survey aims to provide a foundational understanding of LLMs from a human-centric perspective, offering insights into their current capabilities and potential for future development.
- [40] arXiv:2411.14493 [pdf, other]
-
Title: From Statistical Methods to Pre-Trained Models; A Survey on Automatic Speech Recognition for Resource Scarce Urdu LanguageComments: Submitted to SN Computer ScienceSubjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Automatic Speech Recognition (ASR) technology has witnessed significant advancements in recent years, revolutionizing human-computer interactions. While major languages have benefited from these developments, lesser-resourced languages like Urdu face unique challenges. This paper provides an extensive exploration of the dynamic landscape of ASR research, focusing particularly on the resource-constrained Urdu language, which is widely spoken across South Asian nations. It outlines current research trends, technological advancements, and potential directions for future studies in Urdu ASR, aiming to pave the way for forthcoming researchers interested in this domain. By leveraging contemporary technologies, analyzing existing datasets, and evaluating effective algorithms and tools, the paper seeks to shed light on the unique challenges and opportunities associated with Urdu language processing and its integration into the broader field of speech research.
- [41] arXiv:2411.14494 [pdf, html, other]
-
Title: dc-GAN: Dual-Conditioned GAN for Face Demorphing From a Single MorphSubjects: Computer Vision and Pattern Recognition (cs.CV)
A facial morph is an image created by combining two face images pertaining to two distinct identities. Face demorphing inverts the process and tries to recover the original images constituting a facial morph. While morph attack detection (MAD) techniques can be used to flag morph images, they do not divulge any visual information about the faces used to create them. Demorphing helps address this problem. Existing demorphing techniques are either very restrictive (assume identities during testing) or produce feeble outputs (both outputs look very similar). In this paper, we overcome these issues by proposing dc-GAN, a novel GAN-based demorphing method conditioned on the morph images. Our method overcomes morph-replication and produces high quality reconstructions of the bonafide images used to create the morphs. Moreover, our method is highly generalizable across demorphing paradigms (differential/reference-free). We conduct experiments on AMSL, FRLL-Morphs and MorDiff datasets to showcase the efficacy of our method.
- [42] arXiv:2411.14495 [pdf, html, other]
-
Title: Test-Time Adaptation of 3D Point Clouds via Denoising Diffusion ModelsComments: Accepted to WACV 2025 (Winter Conference on Applications of Computer Vision)Subjects: Computer Vision and Pattern Recognition (cs.CV)
Test-time adaptation (TTA) of 3D point clouds is crucial for mitigating discrepancies between training and testing samples in real-world scenarios, particularly when handling corrupted point clouds. LiDAR data, for instance, can be affected by sensor failures or environmental factors, causing domain gaps. Adapting models to these distribution shifts online is crucial, as training for every possible variation is impractical. Existing methods often focus on fine-tuning pre-trained models based on self-supervised learning or pseudo-labeling, which can lead to forgetting valuable source domain knowledge over time and reduce generalization on future tests. In this paper, we introduce a novel 3D test-time adaptation method, termed 3DD-TTA, which stands for 3D Denoising Diffusion Test-Time Adaptation. This method uses a diffusion strategy that adapts input point cloud samples to the source domain while keeping the source model parameters intact. The approach uses a Variational Autoencoder (VAE) to encode the corrupted point cloud into a shape latent and latent points. These latent points are corrupted with Gaussian noise and subjected to a denoising diffusion process. During this process, both the shape latent and latent points are updated to preserve fidelity, guiding the denoising toward generating consistent samples that align more closely with the source domain. We conduct extensive experiments on the ShapeNet dataset and investigate its generalizability on ModelNet40 and ScanObjectNN, achieving state-of-the-art results. The code has been released at \url{this https URL}.
- [43] arXiv:2411.14496 [pdf, html, other]
-
Title: Multi-agent reinforcement learning strategy to maximize the lifetime of Wireless RechargeableComments: 77 pages, Bachelor's thesisSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
The thesis proposes a generalized charging framework for multiple mobile chargers to maximize the network lifetime and ensure target coverage and connectivity in large scale WRSNs. Moreover, a multi-point charging model is leveraged to enhance charging efficiency, where the MC can charge multiple sensors simultaneously at each charging location. The thesis proposes an effective Decentralized Partially Observable Semi-Markov Decision Process (Dec POSMDP) model that promotes Mobile Chargers (MCs) cooperation and detects optimal charging locations based on realtime network information. Furthermore, the proposal allows reinforcement algorithms to be applied to different networks without requiring extensive retraining. To solve the Dec POSMDP model, the thesis proposes an Asynchronous Multi Agent Reinforcement Learning algorithm (AMAPPO) based on the Proximal Policy Optimization algorithm (PPO).
- [44] arXiv:2411.14497 [pdf, html, other]
-
Title: Star-Agents: Automatic Data Optimization with LLM Agents for Instruction TuningSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
The efficacy of large language models (LLMs) on downstream tasks usually hinges on instruction tuning, which relies critically on the quality of training data. Unfortunately, collecting high-quality and diverse data is both expensive and time-consuming. To mitigate this issue, we propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets through multi-agent collaboration and assessment. The framework adopts a three-pronged strategy. It initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method. Subsequently, the generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality. Finaly, the above process evolves in a dynamic refinement phase, where more effective LLMs are prioritized, enhancing the overall data quality. Our empirical studies, including instruction tuning experiments with models such as Pythia and LLaMA, demonstrate the effectiveness of the proposed framework. Optimized datasets have achieved substantial improvements, with an average increase of 12% and notable gains in specific metrics, such as a 40% improvement in Fermi, as evidenced by benchmarks like MT-bench, Vicuna bench, and WizardLM testset.
- [45] arXiv:2411.14498 [pdf, html, other]
-
Title: Delta-NAS: Difference of Architecture Encoding for Predictor-based Evolutionary Neural Architecture SearchSubjects: Computer Vision and Pattern Recognition (cs.CV)
Neural Architecture Search (NAS) continues to serve a key roll in the design and development of neural networks for task specific deployment. Modern NAS techniques struggle to deal with ever increasing search space complexity and compute cost constraints. Existing approaches can be categorized into two buckets: fine-grained computational expensive NAS and coarse-grained low cost NAS. Our objective is to craft an algorithm with the capability to perform fine-grain NAS at a low cost. We propose projecting the problem to a lower dimensional space through predicting the difference in accuracy of a pair of similar networks. This paradigm shift allows for reducing computational complexity from exponential down to linear with respect to the size of the search space. We present a strong mathematical foundation for our algorithm in addition to extensive experimental results across a host of common NAS Benchmarks. Our methods significantly out performs existing works achieving better performance coupled with a significantly higher sample efficiency.
- [46] arXiv:2411.14499 [pdf, html, other]
-
Title: Understanding World or Predicting Future? A Comprehensive Survey of World ModelsJingtao Ding, Yunke Zhang, Yu Shang, Yuheng Zhang, Zefang Zong, Jie Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, Fengli Xu, Yong LiSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
The concept of world models has garnered significant attention due to advancements in multimodal large language models such as GPT-4 and video generation models such as Sora, which are central to the pursuit of artificial general intelligence. This survey offers a comprehensive review of the literature on world models. Generally, world models are regarded as tools for either understanding the present state of the world or predicting its future dynamics. This review presents a systematic categorization of world models, emphasizing two primary functions: (1) constructing internal representations to understand the mechanisms of the world, and (2) predicting future states to simulate and guide decision-making. Initially, we examine the current progress in these two categories. We then explore the application of world models in key domains, including autonomous driving, robotics, and social simulacra, with a focus on how each domain utilizes these aspects. Finally, we outline key challenges and provide insights into potential future research directions.
- [47] arXiv:2411.14500 [pdf, html, other]
-
Title: Exploring Accuracy-Fairness Trade-off in Large Language ModelsComments: 9 pagesSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Large Language Models (LLMs) have made significant strides in the field of artificial intelligence, showcasing their ability to interact with humans and influence human cognition through information dissemination. However, recent studies have brought to light instances of bias inherent within these LLMs, presenting a critical issue that demands attention. In our research, we delve deeper into the intricate challenge of harmonising accuracy and fairness in the enhancement of LLMs. While improving accuracy can indeed enhance overall LLM performance, it often occurs at the expense of fairness. Overemphasising optimisation of one metric invariably leads to a significant degradation of the other. This underscores the necessity of taking into account multiple considerations during the design and optimisation phases of LLMs. Therefore, we advocate for reformulating the LLM training process as a multi-objective learning task. Our investigation reveals that multi-objective evolutionary learning (MOEL) methodologies offer promising avenues for tackling this challenge. Our MOEL framework enables the simultaneous optimisation of both accuracy and fairness metrics, resulting in a Pareto-optimal set of LLMs. In summary, our study sheds valuable lights on the delicate equilibrium between accuracy and fairness within LLMs, which is increasingly significant for their real-world applications. By harnessing MOEL, we present a promising pathway towards fairer and more efficacious AI technologies.
- [48] arXiv:2411.14501 [pdf, html, other]
-
Title: U-Motion: Learned Point Cloud Video Compression with U-Structured Motion EstimationSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Point cloud video (PCV) is a versatile 3D representation of dynamic scenes with many emerging applications. This paper introduces U-Motion, a learning-based compression scheme for both PCV geometry and attributes. We propose a U-Structured multiscale inter-frame prediction framework, U-Inter, which performs layer-wise explicit motion estimation and compensation (ME/MC) at different scales with varying levels of detail. It integrates both higher and lower-scale motion features, in addition to the information of current and previous frames, to enable accurate motion estimation at the current scale. In addition, we design a cascaded spatial predictive coding module to capture the inter-scale spatial redundancy remaining after U-Inter prediction. We further propose an effective context detach and restore scheme to reduce spatial-temporal redundancy in the motion and latent bit-streams and improve compression performance. We conduct experiments following the MPEG Common Test Condition and demonstrate that U-Motion can achieve significant gains over MPEG G-PCC-GesTM v3.0 and recently published learning-based methods for both geometry and attribute compression.
- [49] arXiv:2411.14502 [pdf, other]
-
Title: Global Challenge for Safe and Secure LLMs Track 1Xiaojun Jia, Yihao Huang, Yang Liu, Peng Yan Tan, Weng Kuan Yau, Mun-Thye Mak, Xin Ming Sim, Wee Siong Ng, See Kiong Ng, Hanqing Liu, Lifeng Zhou, Huanqian Yan, Xiaobing Sun, Wei Liu, Long Wang, Yiming Qian, Yong Liu, Junxiao Yang, Zhexin Zhang, Leqi Lei, Renmiao Chen, Yida Lu, Shiyao Cui, Zizhou Wang, Shaohua Li, Yan Wang, Rick Siow Mong Goh, Liangli Zhen, Yingjie Zhang, Zhe ZhaoSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
This paper introduces the Global Challenge for Safe and Secure Large Language Models (LLMs), a pioneering initiative organized by AI Singapore (AISG) and the CyberSG R&D Programme Office (CRPO) to foster the development of advanced defense mechanisms against automated jailbreaking attacks. With the increasing integration of LLMs in critical sectors such as healthcare, finance, and public administration, ensuring these models are resilient to adversarial attacks is vital for preventing misuse and upholding ethical standards. This competition focused on two distinct tracks designed to evaluate and enhance the robustness of LLM security frameworks. Track 1 tasked participants with developing automated methods to probe LLM vulnerabilities by eliciting undesirable responses, effectively testing the limits of existing safety protocols within LLMs. Participants were challenged to devise techniques that could bypass content safeguards across a diverse array of scenarios, from offensive language to misinformation and illegal activities. Through this process, Track 1 aimed to deepen the understanding of LLM vulnerabilities and provide insights for creating more resilient models.
- [50] arXiv:2411.14503 [pdf, html, other]
-
Title: Planning-Driven Programming: A Large Language Model Programming WorkflowSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
The strong performance of large language models (LLMs) on natural language processing tasks raises extensive discussion on their application to code generation. Recent work suggests multiple sampling approaches to improve initial code generation accuracy or program repair approaches to refine the code. However, these methods suffer from LLMs' inefficiencies and limited reasoning capacity. In this work, we propose an LLM programming workflow (LPW) designed to improve both initial code generation and subsequent refinements within a structured two-phase workflow. Specifically, in the solution generation phase, the LLM first outlines a solution plan that decomposes the problem into manageable sub-problems and then verifies the generated solution plan through visible test cases. Subsequently, in the code implementation phase, the LLM initially drafts a code according to the solution plan and its verification. If the generated code fails the visible tests, the plan verification serves as the intended natural language solution to inform the refinement process for correcting bugs. We further introduce SLPW, a sampling variant of LPW, which initially generates multiple solution plans and plan verifications, produces a program for each plan and its verification, and refines each program as necessary until one successfully passes the visible tests. Compared to the state-of-the-art methods across various existing LLMs, our experimental results show that LPW significantly improves the Pass@1 accuracy by up to 16.4% on well-established text-to-code generation benchmarks, especially with a notable improvement of around 10% on challenging benchmarks. Additionally, SLPW demonstrates up to a 5.6% improvement over LPW and sets new state-of-the-art Pass@1 accuracy on various benchmarks, e.g., 98.2% on HumanEval, 84.8% on MBPP, 64.0% on APPS, and 35.3% on CodeContest, using GPT-4o as the backbone.
- [51] arXiv:2411.14504 [pdf, html, other]
-
Title: Night-to-Day Translation via Illumination Degradation DisentanglementComments: 8 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV)
Night-to-Day translation (Night2Day) aims to achieve day-like vision for nighttime scenes. However, processing night images with complex degradations remains a significant challenge under unpaired conditions. Previous methods that uniformly mitigate these degradations have proven inadequate in simultaneously restoring daytime domain information and preserving underlying semantics. In this paper, we propose \textbf{N2D3} (\textbf{N}ight-to-\textbf{D}ay via \textbf{D}egradation \textbf{D}isentanglement) to identify different degradation patterns in nighttime images. Specifically, our method comprises a degradation disentanglement module and a degradation-aware contrastive learning module. Firstly, we extract physical priors from a photometric model based on Kubelka-Munk theory. Then, guided by these physical priors, we design a disentanglement module to discriminate among different illumination degradation regions. Finally, we introduce the degradation-aware contrastive learning strategy to preserve semantic consistency across distinct degradation regions. Our method is evaluated on two public datasets, demonstrating a significant improvement in visual quality and considerable potential for benefiting downstream tasks.
- [52] arXiv:2411.14505 [pdf, html, other]
-
Title: LLaVA-MR: Large Language-and-Vision Assistant for Video Moment RetrievalSubjects: Computer Vision and Pattern Recognition (cs.CV)
Multimodal Large Language Models (MLLMs) are widely used for visual perception, understanding, and reasoning. However, long video processing and precise moment retrieval remain challenging due to LLMs' limited context size and coarse frame extraction. We propose the Large Language-and-Vision Assistant for Moment Retrieval (LLaVA-MR), which enables accurate moment retrieval and contextual grounding in videos using MLLMs. LLaVA-MR combines Dense Frame and Time Encoding (DFTE) for spatial-temporal feature extraction, Informative Frame Selection (IFS) for capturing brief visual and motion patterns, and Dynamic Token Compression (DTC) to manage LLM context limitations. Evaluations on benchmarks like Charades-STA and QVHighlights demonstrate that LLaVA-MR outperforms 11 state-of-the-art methods, achieving an improvement of 1.82% in [email protected] and 1.29% in [email protected] on the QVHighlights dataset. Our implementation will be open-sourced upon acceptance.
- [53] arXiv:2411.14507 [pdf, html, other]
-
Title: FuseGPT: Learnable Layers Fusion of Generative Pre-trained TransformersSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Generative Pre-trained Transformers (GPTs) have demonstrated remarkable performance across diverse domains through the extensive scaling of model parameters. Recent works observe the redundancy across the transformer blocks and develop compression methods by structured pruning of the unimportant blocks. However, such straightforward elimination will always provide irreversible performance degradation. In this paper, we propose FuseGPT, a novel methodology to recycle the pruned transformer blocks to further recover the model performance. Firstly we introduce a new importance detection metric, Macro Influence (MI), to detect the long-term influence of each transformer block by calculating their loss of information after removal. Then we propose group-level layers fusion, which adopts the parameters in layers of the unimportant blocks and injects them into the corresponding layers inside the neighboring blocks. The fusion is not one-off but through iterative parameter updates by lightweight group-level fine-tuning. Specifically, these injected parameters are frozen but weighted with learnable rank decomposition matrices to reduce the overhead during fine-tuning. Our approach not only works well on large language models but also on large multimodal models. The experiments have shown that, by using modest amounts of data, FuseGPT can outperform previous works in both perplexity and zero-shot task performance.
- [54] arXiv:2411.14509 [pdf, html, other]
-
Title: End-to-End Convolutional Activation Anomaly Analysis for Anomaly DetectionSubjects: Machine Learning (cs.LG)
We propose an End-to-end Convolutional Activation Anomaly Analysis (E2E-CA$^3$), which is a significant extension of A$^3$ anomaly detection approach proposed by Sperl, Schulze and Böttinger, both in terms of architecture and scope of application. In contrast to the original idea, we utilize a convolutional autoencoder as a target network, which allows for natural application of the method both to image and tabular data. The alarm network is also designed as a CNN, where the activations of convolutional layers from CAE are stacked together into $k+1-$dimensional tensor. Moreover, we combine the classification loss of the alarm network with the reconstruction error of the target CAE, as a "best of both worlds" approach, which greatly increases the versatility of the network. The evaluation shows that despite generally straightforward and lightweight architecture, it has a very promising anomaly detection performance on common datasets such as MNIST, CIFAR-10 and KDDcup99.
- [55] arXiv:2411.14511 [pdf, html, other]
-
Title: Variational Autoencoders for Efficient Simulation-Based InferenceSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
We present a generative modeling approach based on the variational inference framework for likelihood-free simulation-based inference. The method leverages latent variables within variational autoencoders to efficiently estimate complex posterior distributions arising from stochastic simulations. We explore two variations of this approach distinguished by their treatment of the prior distribution. The first model adapts the prior based on observed data using a multivariate prior network, enhancing generalization across various posterior queries. In contrast, the second model utilizes a standard Gaussian prior, offering simplicity while still effectively capturing complex posterior distributions. We demonstrate the efficacy of these models on well-established benchmark problems, achieving results comparable to flow-based approaches while maintaining computational efficiency and scalability.
- [56] arXiv:2411.14512 [pdf, other]
-
Title: Detecting Distributed Denial of Service Attacks Using Logistic Regression and SVM MethodsSubjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
A distributed denial-of-service (DDoS) attack is an attempt to produce humongous traffic within a network by overwhelming a targeted server or its neighboring infrastructure with a flood of service requests ceaselessly coming from multiple remotely controlled malware-infected computers or network-connected devices. Thus, exploring DDoS attacks by recognizing their functionalities and differentiating them from normal traffic services are the primary concerns of network security issues particularly for online businesses. In modern networks, most DDoS attacks occur in the network and application layer including HTTP flood, UDP flood, SIDDOS, SMURF, SNMP flood, IP NULL, etc. The goal of this paper is to detect DDoS attacks from all service requests and classify them according to DDoS classes. In this regard, a standard dataset is collected from the internet which contains several network-related attributes and their corresponding DDoS attack class name. Two(2) different machine learning approaches, SVM and Logistic Regression, are implemented in the dataset for detecting and classifying DDoS attacks, and a comparative study is accomplished among them in terms of accuracy, precision, and recall rates. Logistic Regression and SVM both achieve 98.65% classification accuracy which is the highest achieved accuracy among other previous experiments with the same dataset.
- [57] arXiv:2411.14513 [pdf, html, other]
-
Title: Towards a Middleware for Large Language ModelsSubjects: Software Engineering (cs.SE); Computation and Language (cs.CL)
Large language models have gained widespread popularity for their ability to process natural language inputs and generate insights derived from their training data, nearing the qualities of true artificial intelligence. This advancement has prompted enterprises worldwide to integrate LLMs into their services. So far, this effort is dominated by commercial cloud-based solutions like OpenAI's ChatGPT and Microsoft Azure. As the technology matures, however, there is a strong incentive for independence from major cloud providers through self-hosting "LLM as a Service", driven by privacy, cost, and customization needs. In practice, hosting LLMs independently presents significant challenges due to their complexity and integration issues with existing systems. In this paper, we discuss our vision for a forward-looking middleware system architecture that facilitates the deployment and adoption of LLMs in enterprises, even for advanced use cases in which we foresee LLMs to serve as gateways to a complete application ecosystem and, to some degree, absorb functionality traditionally attributed to the middleware.
- [58] arXiv:2411.14514 [pdf, html, other]
-
Title: NexusSplats: Efficient 3D Gaussian Splatting in the WildComments: submitted to CVPR 2025Subjects: Computer Vision and Pattern Recognition (cs.CV)
While 3D Gaussian Splatting (3DGS) has recently demonstrated remarkable rendering quality and efficiency in 3D scene reconstruction, it struggles with varying lighting conditions and incidental occlusions in real-world scenarios. To accommodate varying lighting conditions, existing 3DGS extensions apply color mapping to the massive Gaussian primitives with individually optimized appearance embeddings. To handle occlusions, they predict pixel-wise uncertainties via 2D image features for occlusion capture. Nevertheless, such massive color mapping and pixel-wise uncertainty prediction strategies suffer from not only additional computational costs but also coarse-grained lighting and occlusion handling. In this work, we propose a nexus kernel-driven approach, termed NexusSplats, for efficient and finer 3D scene reconstruction under complex lighting and occlusion conditions. In particular, NexusSplats leverages a novel light decoupling strategy where appearance embeddings are optimized based on nexus kernels instead of massive Gaussian primitives, thus accelerating reconstruction speeds while ensuring local color consistency for finer textures. Additionally, a Gaussian-wise uncertainty mechanism is developed, aligning 3D structures with 2D image features for fine-grained occlusion handling. Experimental results demonstrate that NexusSplats achieves state-of-the-art rendering quality while reducing reconstruction time by up to 70.4% compared to the current best in quality.
- [59] arXiv:2411.14515 [pdf, html, other]
-
Title: Are Anomaly Scores Telling the Whole Story? A Benchmark for Multilevel Anomaly DetectionComments: Under reviewSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Anomaly detection (AD) is a machine learning task that identifies anomalies by learning patterns from normal training data. In many real-world scenarios, anomalies vary in severity, from minor anomalies with little risk to severe abnormalities requiring immediate attention. However, existing models primarily operate in a binary setting, and the anomaly scores they produce are usually based on the deviation of data points from normal data, which may not accurately reflect practical severity. In this paper, we address this gap by making three key contributions. First, we propose a novel setting, Multilevel AD (MAD), in which the anomaly score represents the severity of anomalies in real-world applications, and we highlight its diverse applications across various domains. Second, we introduce a novel benchmark, MAD-Bench, that evaluates models not only on their ability to detect anomalies, but also on how effectively their anomaly scores reflect severity. This benchmark incorporates multiple types of baselines and real-world applications involving severity. Finally, we conduct a comprehensive performance analysis on MAD-Bench. We evaluate models on their ability to assign severity-aligned scores, investigate the correspondence between their performance on binary and multilevel detection, and study their robustness. This analysis offers key insights into improving AD models for practical severity alignment. The code framework and datasets used for the benchmark will be made publicly available.
- [60] arXiv:2411.14516 [pdf, html, other]
-
Title: Memory Backdoor Attacks on Neural NetworksSubjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Neural networks, such as image classifiers, are frequently trained on proprietary and confidential datasets. It is generally assumed that once deployed, the training data remains secure, as adversaries are limited to query response interactions with the model, where at best, fragments of arbitrary data can be inferred without any guarantees on their authenticity. In this paper, we propose the memory backdoor attack, where a model is covertly trained to memorize specific training samples and later selectively output them when triggered with an index pattern. What makes this attack unique is that it (1) works even when the tasks conflict (making a classifier output images), (2) enables the systematic extraction of training samples from deployed models and (3) offers guarantees on the extracted authenticity of the data. We demonstrate the attack on image classifiers, segmentation models, and a large language model (LLM). We demonstrate the attack on image classifiers, segmentation models, and a large language model (LLM). With this attack, it is possible to hide thousands of images and texts in modern vision architectures and LLMs respectively, all while maintaining model performance. The memory back door attack poses a significant threat not only to conventional model deployments but also to federated learning paradigms and other modern frameworks. Therefore, we suggest an efficient and effective countermeasure that can be immediately applied and advocate for further work on the topic.
- [61] arXiv:2411.14517 [pdf, html, other]
-
Title: The Double-Ellipsoid Geometry of CLIPSubjects: Computer Vision and Pattern Recognition (cs.CV)
Contrastive Language-Image Pre-Training (CLIP) is highly instrumental in machine learning applications within a large variety of domains. We investigate the geometry of this embedding, which is still not well understood. We examine the raw unnormalized embedding and show that text and image reside on linearly separable ellipsoid shells, not centered at the origin. We explain the benefits of having this structure, allowing to better embed instances according to their uncertainty during contrastive training. Frequent concepts in the dataset yield more false negatives, inducing greater uncertainty. A new notion of conformity is introduced, which measures the average cosine similarity of an instance to any other instance within a representative data set. We show this measure can be accurately estimated by simply computing the cosine similarity to the modality mean vector. Furthermore, we find that CLIP's modality gap optimizes the matching of the conformity distributions of image and text.
- [62] arXiv:2411.14519 [pdf, html, other]
-
Title: Tra-MoE: Learning Trajectory Prediction Model from Multiple Domains for Adaptive Policy ConditioningComments: 15 pages, 5 figuresSubjects: Robotics (cs.RO)
Learning from multiple domains is a primary factor that influences the generalization of a single unified robot system. In this paper, we aim to learn the trajectory prediction model by using broad out-of-domain data to improve its performance and generalization ability. Trajectory model is designed to predict any-point trajectories in the current frame given an instruction and can provide detailed control guidance for robotic policy learning. To handle the diverse out-of-domain data distribution, we propose a sparsely-gated MoE (\textbf{Top-1} gating strategy) architecture for trajectory model, coined as \textbf{Tra-MoE}. The sparse activation design enables good balance between parameter cooperation and specialization, effectively benefiting from large-scale out-of-domain data while maintaining constant FLOPs per token. In addition, we further introduce an adaptive policy conditioning technique by learning 2D mask representations for predicted trajectories, which is explicitly aligned with image observations to guide action prediction more flexibly. We perform extensive experiments on both simulation and real-world scenarios to verify the effectiveness of Tra-MoE and adaptive policy conditioning technique. We also conduct a comprehensive empirical study to train Tra-MoE, demonstrating that our Tra-MoE consistently exhibits superior performance compared to the dense baseline model, even when the latter is scaled to match Tra-MoE's parameter count.
- [63] arXiv:2411.14520 [pdf, other]
-
Title: Open Challenges in the Formal Verification of Autonomous DrivingPaolo Burgio (University of Modena and Reggio Emilia), Angelo Ferrando (University of Modena and Reggio Emilia), Marco Villani (University of Modena and Reggio Emilia)Comments: In Proceedings FMAS2024, arXiv:2411.13215Journal-ref: EPTCS 411, 2024, pp. 191-200Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Robotics (cs.RO)
In the realm of autonomous driving, the development and integration of highly complex and heterogeneous systems are standard practice. Modern vehicles are not monolithic systems; instead, they are composed of diverse hardware components, each running its own software systems. An autonomous vehicle comprises numerous independent components, often developed by different and potentially competing companies. This diversity poses significant challenges for the certification process, as it necessitates certifying components that may not disclose their internal behaviour (black-boxes). In this paper, we present a real-world case study of an autonomous driving system, identify key open challenges associated with its development and integration, and explore how formal verification techniques can address these challenges to ensure system reliability and safety.
- [64] arXiv:2411.14521 [pdf, html, other]
-
Title: MyTimeMachine: Personalized Facial Age TransformationComments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Facial aging is a complex process, highly dependent on multiple factors like gender, ethnicity, lifestyle, etc., making it extremely challenging to learn a global aging prior to predict aging for any individual accurately. Existing techniques often produce realistic and plausible aging results, but the re-aged images often do not resemble the person's appearance at the target age and thus need personalization. In many practical applications of virtual aging, e.g. VFX in movies and TV shows, access to a personal photo collection of the user depicting aging in a small time interval (20$\sim$40 years) is often available. However, naive attempts to personalize global aging techniques on personal photo collections often fail. Thus, we propose MyTimeMachine (MyTM), which combines a global aging prior with a personal photo collection (using as few as 50 images) to learn a personalized age transformation. We introduce a novel Adapter Network that combines personalized aging features with global aging features and generates a re-aged image with StyleGAN2. We also introduce three loss functions to personalize the Adapter Network with personalized aging loss, extrapolation regularization, and adaptive w-norm regularization. Our approach can also be extended to videos, achieving high-quality, identity-preserving, and temporally consistent aging effects that resemble actual appearances at target ages, demonstrating its superiority over state-of-the-art approaches.
- [65] arXiv:2411.14522 [pdf, html, other]
-
Title: GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AITianbin Li, Yanzhou Su, Wei Li, Bin Fu, Zhe Chen, Ziyan Huang, Guoan Wang, Chenglong Ma, Ying Chen, Ming Hu, Yanjun Li, Pengcheng Chen, Xiaowei Hu, Zhongying Deng, Yuanfeng Ji, Jin Ye, Yu Qiao, Junjun HeSubjects: Computer Vision and Pattern Recognition (cs.CV)
Despite significant advancements in general artificial intelligence, such as GPT-4, their effectiveness in the medical domain (general medical AI, GMAI) remains constrained due to the absence of specialized medical knowledge. To address this challenge, we present GMAI-VL-5.5M, a comprehensive multimodal medical dataset created by converting hundreds of specialized medical datasets into meticulously constructed image-text pairs. This dataset features comprehensive task coverage, diverse modalities, and high-quality image-text data. Building upon this multimodal dataset, we propose GMAI-VL, a general medical vision-language model with a progressively three-stage training strategy. This approach significantly enhances the model's ability by integrating visual and textual information, thereby improving its ability to process multimodal data and support accurate diagnosis and clinical decision-making. Experimental evaluations demonstrate that GMAI-VL achieves state-of-the-art results across a wide range of multimodal medical tasks, such as visual question answering and medical image diagnosis. Our contributions include the development of the GMAI-VL-5.5M dataset, the introduction of the GMAI-VL model, and the establishment of new benchmarks in multiple medical domains. Code and dataset will be released at this https URL.
- [66] arXiv:2411.14538 [pdf, html, other]
-
Title: A hierarchy of reversible finite automataComments: 29 pages, 5 figuresSubjects: Formal Languages and Automata Theory (cs.FL)
In this paper, different variants of reversible finite automata are compared, and their hierarchy by the expressive power is established. It is shown that one-way reversible automata with multiple initial states (MRFA) recognize strictly more languages than sweeping reversible automata (sRFA), which are in turn stronger than one-way reversible automata with a single initial state (1RFA). The latter recognize strictly more languages than one-way permutation automata (1PerFA). It is also shown that the hierarchy of sRFA by the number of passes over the input string collapses: it turns out that three passes are always enough. On the other hand, MRFA form a hierarchy by the number of initial states: their subclass with at most $k$ initial states (MRFA$^k$) recognize strictly fewer languages than MRFA$^{k + 1}$, and also MRFA$^k$ are incomparable with sRFA. In the unary case, sRFA, MRFA$^k$ and MRFA become equal in their expressive power, and the inclusion of 1RFA into sRFA remains proper.
- [67] arXiv:2411.14539 [pdf, other]
-
Title: Performance Analysis of Traditional and Network Coded Transmission in Infrastructure-less Multi-hop Wireless NetworksComments: 10 pages, 9 figures, 4 tables, journalSubjects: Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
Infrastructure-less Multi-hop Wireless Networks are the backbone for mission critical communications such as in disaster and battlefield scenarios. However, interference signals in the wireless channel cause losses to transmission in wireless networks resulting in a reduced network throughput and making efficient transmission very challenging. Therefore, techniques to overcome interference and increase transmission efficiency have been a hot area of research for decades. In this paper two methods for transmitting data through infrastructure-less multi hop wireless networks, Traditional (TR) and Network Coded (NC) transmission are thoroughly examined for scenarios having one or two communication streams in a network. The study has developed network models in MATLAB for each transmission technique and scenario. The simulation results showed that the NC transmission method yielded a better throughput under the same network settings and physical interference. Furthermore, the impact of increasing numbers of hops between source and destination on the network capacity and the communications latency was also observed and conclusions were drawn.
- [68] arXiv:2411.14550 [pdf, other]
-
Title: The importance of the clustering model to detect new types of intrusion in data trafficComments: 18 pages, 4 figuresSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
In the current digital age, the volume of data generated by various cyber activities has become enormous and is constantly increasing. The data may contain valuable insights that can be harnessed to improve cyber security measures. However, much of this data is unclassified and qualitative, which poses significant challenges to traditional analysis methods. Clustering facilitates the identification of hidden patterns and structures in data through grouping similar data points, which makes it simpler to identify and address threats. Clustering can be defined as a data mining (DM) approach, which uses similarity calculations for dividing a data set into several categories. Hierarchical, density-based, along with partitioning clustering algorithms are typical. The presented work use K-means algorithm, which is a popular clustering technique. Utilizing K-means algorithm, we worked with two different types of data: first, we gathered data with the use of XG-boost algorithm following completing the aggregation with K-means algorithm. Data was gathered utilizing Kali Linux environment, cicflowmeter traffic, and Putty Software tools with the use of diverse and simple attacks. The concept could assist in identifying new attack types, which are distinct from the known attacks, and labeling them based on the characteristics they will exhibit, as the dynamic nature regarding cyber threats means that new attack types often emerge, for which labeled data might not yet exist. The model counted the attacks and assigned numbers to each one of them. Secondly, We tried the same work on the ready data inside the Kaggle repository called (Intrusion Detection in Internet of Things Network), and the clustering model worked well and detected the number of attacks correctly as shown in the results section.
- [69] arXiv:2411.14551 [pdf, html, other]
-
Title: An Experimental Study on Data Augmentation Techniques for Named Entity Recognition on Low-Resource DomainsArthur Elwing Torres, Edleno Silva de Moura, Altigran Soares da Silva, Mario A. Nascimento, Filipe MesquitaComments: 21 pages, 2 figuresSubjects: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Named Entity Recognition (NER) is a machine learning task that traditionally relies on supervised learning and annotated data. Acquiring such data is often a challenge, particularly in specialized fields like medical, legal, and financial sectors. Those are commonly referred to as low-resource domains, which comprise long-tail entities, due to the scarcity of available data. To address this, data augmentation techniques are increasingly being employed to generate additional training instances from the original dataset. In this study, we evaluate the effectiveness of two prominent text augmentation techniques, Mention Replacement and Contextual Word Replacement, on two widely-used NER models, Bi-LSTM+CRF and BERT. We conduct experiments on four datasets from low-resource domains, and we explore the impact of various combinations of training subset sizes and number of augmented examples. We not only confirm that data augmentation is particularly beneficial for smaller datasets, but we also demonstrate that there is no universally optimal number of augmented examples, i.e., NER practitioners must experiment with different quantities in order to fine-tune their projects.
- [70] arXiv:2411.14553 [pdf, html, other]
-
Title: Reducibility among NP-Hard graph problems and boundary classesComments: 9 pages, 6 figuresSubjects: Computational Complexity (cs.CC); Computation and Language (cs.CL); Discrete Mathematics (cs.DM)
Many NP-hard graph problems become easy for some classes of graphs, such as coloring is easy for bipartite graphs, but NP-hard in general. So we can ask question like when does a hard problem become easy? What is the minimum substructure for which the problem remains hard? We use the notion of boundary classes to study such questions. In this paper, we introduce a method for transforming the boundary class of one NP-hard graph problem into a boundary class for another problem. If $\Pi$ and $\Gamma$ are two NP-hard graph problems where $\Pi$ is reducible to $\Gamma$, we transform a boundary class of $\Pi$ into a boundary class of $\Gamma$. More formally if $\Pi$ is reducible to $\Gamma$, where the reduction is bijective and it maps hereditary classes of graphs to hereditary classes of graphs, then $X$ is a boundary class of $\Pi$ if and only if the image of $X$ under the reduction is a boundary class of $\Gamma$. This gives us a relationship between boundary classes and reducibility among several NP-hard problems. To show the strength of our main result, we apply our theorem to obtain some previously unknown boundary classes for a few graph problems namely; vertex-cover, clique, traveling-salesperson, bounded-degree-spanning-tree, subgraph-isomorphism and clique-cover.
- [71] arXiv:2411.14554 [pdf, html, other]
-
Title: Swift: A Multi-FPGA Framework for Scaling Up Accelerated Graph AnalyticsComments: Accepted in International Conference on Field Programmable Technology (FPT-2024)Subjects: Hardware Architecture (cs.AR)
Graph analytics are vital in fields such as social networks, biomedical research, and graph neural networks (GNNs). However, traditional CPUs and GPUs struggle with the memory bottlenecks caused by large graph datasets and their fine-grained memory accesses. While specialized graph accelerators address these challenges, they often support only moderate-sized graphs (under 500 million edges). Our paper proposes Swift, a novel scale-up graph accelerator framework that processes large graphs by leveraging the flexibility of FPGA custom datapath and memory resources, and optimizes utilization of high-bandwidth 3D memory (HBM). Swift supports up to 8 FPGAs in a node. Swift introduces a decoupled, asynchronous model based on the Gather-Apply-Scatter (GAS) scheme. It subgraphs across FPGAs, and each subgraph into intervals based on source vertex IDs. Processing on these intervals is decoupled and executed asynchronously, instead of bulk-synchonous operation, where throughput is limited by the slowest task. This enables simultaneous processing within each multi-FPGA node and optimizes the utilization of communication (PCIe), off-chip (HBM), and on-chip BRAM/URAM resources. Swift demonstrates significant performance improvements compared to prior scalable FPGA-based frameworks, performing 12.8 times better than the ForeGraph. Performance against Gunrock on NVIDIA A40 GPUs is mixed, because NVlink gives the GPU system a nearly 5X bandwidth advantage, but the FPGA system nevertheless achieves 2.6x greater energy efficiency.
- [72] arXiv:2411.14555 [pdf, html, other]
-
Title: Deep operator network models for predicting post-burn contractionSubjects: Numerical Analysis (math.NA); Machine Learning (cs.LG); Biological Physics (physics.bio-ph); Tissues and Organs (q-bio.TO)
Burn injuries present a significant global health challenge. Among the most severe long-term consequences are contractures, which can lead to functional impairments and disfigurement. Understanding and predicting the evolution of post-burn wounds is essential for developing effective treatment strategies. Traditional mathematical models, while accurate, are often computationally expensive and time-consuming, limiting their practical application. Recent advancements in machine learning, particularly in deep learning, offer promising alternatives for accelerating these predictions. This study explores the use of a deep operator network (DeepONet), a type of neural operator, as a surrogate model for finite element simulations, aimed at predicting post-burn contraction across multiple wound shapes. A DeepONet was trained on three distinct initial wound shapes, with enhancement made to the architecture by incorporating initial wound shape information and applying sine augmentation to enforce boundary conditions. The performance of the trained DeepONet was evaluated on a test set including finite element simulations based on convex combinations of the three basic wound shapes. The model achieved an $R^2$ score of $0.99$, indicating strong predictive accuracy and generalization. Moreover, the model provided reliable predictions over an extended period of up to one year, with speedups of up to 128-fold on CPU and 235-fold on GPU, compared to the numerical model. These findings suggest that DeepONets can effectively serve as a surrogate for traditional finite element methods in simulating post-burn wound evolution, with potential applications in medical treatment planning.
- [73] arXiv:2411.14557 [pdf, html, other]
-
Title: Privacy-Preserving Power Flow Analysis via Secure Multi-Party ComputationJonas von der Heyden, Nils Schlüter, Philipp Binfet, Martin Asman, Markus Zdrallek, Tibor Jager, Moritz Schulze DarupSubjects: Cryptography and Security (cs.CR); Systems and Control (eess.SY)
Smart grids feature a bidirectional flow of electricity and data, enhancing flexibility, efficiency, and reliability in increasingly volatile energy grids. However, data from smart meters can reveal sensitive private information. Consequently, the adoption of smart meters is often restricted via legal means and hampered by limited user acceptance. Since metering data is beneficial for fault-free grid operation, power management, and resource allocation, applying privacy-preserving techniques to smart metering data is an important research problem. This work addresses this by using secure multi-party computation (SMPC), allowing multiple parties to jointly evaluate functions of their private inputs without revealing the latter. Concretely, we show how to perform power flow analysis on cryptographically hidden prosumer data. More precisely, we present a tailored solution to the power flow problem building on an SMPC implementation of Newtons method. We analyze the security of our approach in the universal composability framework and provide benchmarks for various grid types, threat models, and solvers. Our results indicate that secure multi-party computation can be able to alleviate privacy issues in smart grids in certain applications.
- [74] arXiv:2411.14559 [pdf, html, other]
-
Title: Union of Finitely Generated Congruences on Ground Term AlgebraComments: 57 pagesSubjects: Symbolic Computation (cs.SC); Logic in Computer Science (cs.LO)
We show that for any ground term equation systems $E$ and $F$, (1) the union of the generated congruences by $E$ and $F$ is a congruence on the ground term algebra if and only if there exists a ground term equation system $H$ such that the congruence generated by $H$ is equal to the union of the congruences generated by $E$ and $F$ if and only if the congruence generated by the union of $E $ and $F$ is equal to the union of the congruences generated by $E $ and $F$, and (2) it is decidable in square time whether the congruence generated by the union of $E$ and $F$ is equal to the union of the congruences generated by $E $ and $F$, where the size of the input is the number of occurrences of symbols in $E$ plus the number of occurrences of symbols in $F$.
- [75] arXiv:2411.14560 [pdf, other]
-
Title: Enhancing GeoAI and location encoding with spatial point pattern statistics: A Case Study of Terrain Feature ClassificationComments: 4 pages with 1 figure. Accepted in 7th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge DiscoverySubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
This study introduces a novel approach to terrain feature classification by incorporating spatial point pattern statistics into deep learning models. Inspired by the concept of location encoding, which aims to capture location characteristics to enhance GeoAI decision-making capabilities, we improve the GeoAI model by a knowledge driven approach to integrate both first-order and second-order effects of point patterns. This paper investigates how these spatial contexts impact the accuracy of terrain feature predictions. The results show that incorporating spatial point pattern statistics notably enhances model performance by leveraging different representations of spatial relationships.
- [76] arXiv:2411.14561 [pdf, html, other]
-
Title: Subspace and auxiliary space preconditioners for high-order interior penalty discretizations in $H(\mathrm{div})$Comments: 24 pages, 1 figureSubjects: Numerical Analysis (math.NA)
In this paper, we construct and analyze preconditioners for the interior penalty discontinuous Galerkin discretization posed in the space $H(\mathrm{div})$. These discretizations are used as one component in exactly divergence-free pressure-robust discretizations for the Stokes problem. Three preconditioners are presently considered: a subspace correction preconditioner using vertex patches and the lowest-order $H^1$-conforming space as a coarse space, a fictitious space preconditioner using the degree-$p$ discontinuous Galerkin space, and an auxiliary space preconditioner using the degree-$(p-1)$ discontinuous Galerkin space and a block Jacobi smoother. On certain classes of meshes, the subspace and fictitious space preconditioners result in provably well-conditioned systems, independent of the mesh size $h$, polynomial degree $p$, and penalty parameter $\eta$. All three preconditioners are shown to be robust with respect to $h$ on general meshes, and numerical results indicate that the iteration counts grow only mildly with respect to $p$ in the general case. Numerical examples illustrate the convergence properties of the preconditioners applied to structured and unstructured meshes. These solvers are used to construct block-diagonal preconditioners for the Stokes problem, which result in uniform convergence when used with MINRES.
- [77] arXiv:2411.14563 [pdf, html, other]
-
Title: Constructing Trustworthy Smart ContractsSubjects: Cryptography and Security (cs.CR); Programming Languages (cs.PL)
Smart contracts form the core of Web3 applications. Contracts mediate the transfer of cryptocurrency, making them irresistible targets for hackers. We introduce ASP, a system aimed at easing the construction of provably secure contracts. The Asp system consists of three closely-linked components: a programming language, a defensive compiler, and a proof checker. The language semantics guarantee that Asp contracts are free of commonly exploited vulnerabilities such as arithmetic overflow and reentrancy. The defensive compiler enforces the semantics and translates Asp to Solidity, the most popular contract language. Deductive proofs establish functional correctness and freedom from critical vulnerabilities such as unauthorized access.
- [78] arXiv:2411.14565 [pdf, html, other]
-
Title: Privacy-Preserving Video Anomaly Detection: A SurveyComments: 19 pages, 6 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Video Anomaly Detection (VAD) aims to automatically analyze spatiotemporal patterns in surveillance videos collected from open spaces to detect anomalous events that may cause harm without physical contact. However, vision-based surveillance systems such as closed-circuit television often capture personally identifiable information. The lack of transparency and interpretability in video transmission and usage raises public concerns about privacy and ethics, limiting the real-world application of VAD. Recently, researchers have focused on privacy concerns in VAD by conducting systematic studies from various perspectives including data, features, and systems, making Privacy-Preserving Video Anomaly Detection (P2VAD) a hotspot in the AI community. However, current research in P2VAD is fragmented, and prior reviews have mostly focused on methods using RGB sequences, overlooking privacy leakage and appearance bias considerations. To address this gap, this article systematically reviews the progress of P2VAD for the first time, defining its scope and providing an intuitive taxonomy. We outline the basic assumptions, learning frameworks, and optimization objectives of various approaches, analyzing their strengths, weaknesses, and potential correlations. Additionally, we provide open access to research resources such as benchmark datasets and available code. Finally, we discuss key challenges and future opportunities from the perspectives of AI development and P2VAD deployment, aiming to guide future work in the field.
- [79] arXiv:2411.14567 [pdf, other]
-
Title: Energy Efficient Automated Driving as a GNEP: Vehicle-in-the-loop ExperimentsSubjects: Systems and Control (eess.SY)
In this paper, a multi-agent motion planning problem is studied aiming to minimize energy consumption of connected automated vehicles (CAVs) in lane change scenarios. We model this interactive motion planning as a generalized Nash equilibrium problem and formalize how vehicle-to-vehicle intention sharing enables solution of the game between multiple CAVs as an optimal control problem for each agent, to arrive at a generalized Nash equilibrium. The method is implemented via model predictive control (MPC) and compared with an advanced baseline MPC which utilizes unilateral predictions of other agents' future states. A ROS-based in-the-loop testbed is developed: the method is first evaluated in software-in-the-loop and then vehicle-in-the-loop experiments are conducted. Experimental results demonstrate energy and travel time benefits of the presented method in interactive lane change maneuvers.
- [80] arXiv:2411.14568 [pdf, html, other]
-
Title: Maximum Solar Energy Tracking Leverage High-DoF Robotics System with Deep Reinforcement LearningAnjie Jiang, Kangtong Mo, Satoshi Fujimoto, Michael Taylor, Sanjay Kumar, Chiotis Dimitrios, Emilia RuizSubjects: Robotics (cs.RO)
Solar trajectory monitoring is a pivotal challenge in solar energy systems, underpinning applications such as autonomous energy harvesting and environmental sensing. A prevalent failure mode in sustained solar tracking arises when the predictive algorithm erroneously diverges from the solar locus, erroneously anchoring to extraneous celestial or terrestrial features. This phenomenon is attributable to an inadequate assimilation of solar-specific objectness attributes within the tracking paradigm. To mitigate this deficiency inherent in extant methodologies, we introduce an innovative objectness regularization framework that compels tracking points to remain confined within the delineated boundaries of the solar entity. By encapsulating solar objectness indicators during the training phase, our approach obviates the necessity for explicit solar mask computation during operational deployment. Furthermore, we leverage the high-DoF robot arm to integrate our method to improve its robustness and flexibility in different outdoor environments.
- [81] arXiv:2411.14569 [pdf, html, other]
-
Title: Variable Extraction for Model Recovery in Scientific LiteratureSubjects: Information Retrieval (cs.IR); Machine Learning (cs.LG)
The global output of academic publications exceeds 5 million articles per year, making it difficult for humans to keep up with even a tiny fraction of scientific output. We need methods to navigate and interpret the artifacts -- texts, graphs, charts, code, models, and datasets -- that make up the literature. This paper evaluates various methods for extracting mathematical model variables from epidemiological studies, such as ``infection rate ($\alpha$),'' ``recovery rate ($\gamma$),'' and ``mortality rate ($\mu$).'' Variable extraction appears to be a basic task, but plays a pivotal role in recovering models from scientific literature. Once extracted, we can use these variables for automatic mathematical modeling, simulation, and replication of published results.
We introduce a benchmark dataset comprising manually-annotated variable descriptions and variable values extracted from scientific papers. Based on this dataset, we present several baseline methods for variable extraction based on Large Language Models (LLMs) and rule-based information extraction systems. Our analysis shows that LLM-based solutions perform the best. Despite the incremental benefits of combining rule-based extraction outputs with LLMs, the leap in performance attributed to the transfer-learning and instruction-tuning capabilities of LLMs themselves is far more significant. This investigation demonstrates the potential of LLMs to enhance automatic comprehension of scientific artifacts and for automatic model recovery and simulation. - [82] arXiv:2411.14571 [pdf, html, other]
-
Title: Assessment of LLM Responses to End-user Security QuestionsComments: 18 pages, 1 figure, 8 tablesSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
Answering end user security questions is challenging. While large language models (LLMs) like GPT, LLAMA, and Gemini are far from error-free, they have shown promise in answering a variety of questions outside of security. We studied LLM performance in the area of end user security by qualitatively evaluating 3 popular LLMs on 900 systematically collected end user security questions.
While LLMs demonstrate broad generalist ``knowledge'' of end user security information, there are patterns of errors and limitations across LLMs consisting of stale and inaccurate answers, and indirect or unresponsive communication styles, all of which impacts the quality of information received. Based on these patterns, we suggest directions for model improvement and recommend user strategies for interacting with LLMs when seeking assistance with security. - [83] arXiv:2411.14572 [pdf, html, other]
-
Title: Towards Knowledge Checking in Retrieval-augmented Generation: A Representation PerspectiveShenglai Zeng, Jiankun Zhang, Bingheng Li, Yuping Lin, Tianqi Zheng, Dante Everaert, Hanqing Lu, Hui Liu, Hui Liu, Yue Xing, Monica Xiao Cheng, Jiliang TangSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Retrieval-Augmented Generation (RAG) systems have shown promise in enhancing the performance of Large Language Models (LLMs). However, these systems face challenges in effectively integrating external knowledge with the LLM's internal knowledge, often leading to issues with misleading or unhelpful information. This work aims to provide a systematic study on knowledge checking in RAG systems. We conduct a comprehensive analysis of LLM representation behaviors and demonstrate the significance of using representations in knowledge checking. Motivated by the findings, we further develop representation-based classifiers for knowledge filtering. We show substantial improvements in RAG performance, even when dealing with noisy knowledge databases. Our study provides new insights into leveraging LLM representations for enhancing the reliability and effectiveness of RAG systems.
- [84] arXiv:2411.14574 [pdf, html, other]
-
Title: SRSA: A Cost-Efficient Strategy-Router Search Agent for Real-world Human-Machine InteractionsSubjects: Artificial Intelligence (cs.AI)
Recently, as Large Language Models (LLMs) have shown impressive emerging capabilities and gained widespread popularity, research on LLM-based search agents has proliferated. In real-world situations, users often input contextual and highly personalized queries to chatbots, challenging LLMs to capture context and generate appropriate answers. However, much of the prior research has not focused specifically on authentic human-machine dialogue scenarios. It also ignores the important balance between response quality and computational cost by forcing all queries to follow the same agent process. To address these gaps, we propose a Strategy-Router Search Agent (SRSA), routing different queries to appropriate search strategies and enabling fine-grained serial searches to obtain high-quality results at a relatively low cost. To evaluate our work, we introduce a new dataset, Contextual Query Enhancement Dataset (CQED), comprising contextual queries to simulate authentic and daily interactions between humans and chatbots. Using LLM-based automatic evaluation metrics, we assessed SRSA's performance in terms of informativeness, completeness, novelty, and actionability. To conclude, SRSA provides an approach that resolves the issue of simple serial searches leading to degenerate answers for lengthy and contextual queries, effectively and efficiently parses complex user queries, and generates more comprehensive and informative responses without fine-tuning an LLM.
- [85] arXiv:2411.14576 [pdf, html, other]
-
Title: EdgeFlowNet: 100FPS@1W Dense Optical Flow For Tiny Mobile RobotsComments: this https URLSubjects: Robotics (cs.RO)
Optical flow estimation is a critical task for tiny mobile robotics to enable safe and accurate navigation, obstacle avoidance, and other functionalities. However, optical flow estimation on tiny robots is challenging due to limited onboard sensing and computation capabilities. In this paper, we propose EdgeFlowNet , a high-speed, low-latency dense optical flow approach for tiny autonomous mobile robots by harnessing the power of edge computing. We demonstrate the efficacy of our approach by deploying EdgeFlowNet on a tiny quadrotor to perform static obstacle avoidance, flight through unknown gaps and dynamic obstacle dodging. EdgeFlowNet is about 20 faster than the previous state-of-the-art approaches while improving accuracy by over 20% and using only 1.08W of power enabling advanced autonomy on palm-sized tiny mobile robots.
- [86] arXiv:2411.14578 [pdf, html, other]
-
Title: Block subspace expansions for eigenvalues and eigenvectors approximationSubjects: Numerical Analysis (math.NA); Functional Analysis (math.FA)
Let $A\in\mathbb C^{n\times n}$ and let $\mathcal X\subset \mathbb C^n$ be an $A$-invariant subspace with $\dim \mathcal X=d\geq 1$, corresponding to exterior eigenvalues of $A$. Given an initial subspace $\mathcal V\subset \mathbb C^n$ with $\dim \mathcal V=r\geq d$, we search for expansions of $\mathcal V$ of the form $\mathcal V+A(\mathcal W_0)$, where $\mathcal W_0\subset \mathcal V$ is such that $\dim \mathcal W_0\leq d$ and such that the expanded subspace is closer to $\mathcal X$ than the initial $\mathcal V$. We show that there exist (theoretical) optimal choices of such $\mathcal W_0$, in the sense that $\theta_i(\mathcal X,\mathcal V+A(\mathcal W_0))\leq \theta_i(\mathcal V+A(\mathcal W))$ for every $\mathcal W\subset \mathcal V$ with $\dim \mathcal W\leq d$, where $\theta_i(\mathcal X,\mathcal T)$ denotes the $i$-th principal angle between $\mathcal X$ and $\mathcal T$, for $1\leq i\leq d\leq \dim \mathcal T$. We relate these optimal expansions to block Krylov subspaces generated by $A$ and $\mathcal V$. We also show that the corresponding iterative sequence of subspaces constructed in this way approximate $\mathcal X$ arbitrarily well, when $A$ is Hermitian and $\mathcal X$ is simple. We further introduce computable versions of this construction and compute several numerical examples that show the performance of the computable algorithms and test our convergence analysis.
- [87] arXiv:2411.14579 [pdf, other]
-
Title: Functional Array Programming in an Extended Pi-CalculusHans Hüttel (Department of Computer Science, University of Copenhagen), Lars Jensen (Department of Computer Science, Aalborg University), Chris Oliver Paulsen (Department of Computer Science, Aalborg University), Julian Teule (Department of Computer Science, Aalborg University)Comments: In Proceedings EXPRESS/SOS 2024, arXiv:2411.13318Journal-ref: EPTCS 412, 2024, pp. 2-18Subjects: Programming Languages (cs.PL); Logic in Computer Science (cs.LO)
We study the data-parallel language BUTF, inspired by the Futhark language for array programming. We give a translation of BUTF into a version of the pi-calculus with broadcasting and labeled names. The translation is both complete and sound. Moreover, we propose a cost model by annotating translated BUTF processes. This is used for a complexity analysis of the translation.
- [88] arXiv:2411.14580 [pdf, other]
-
Title: Synchronisability in Mailbox CommunicationCinzia Di Giusto (Université Côte d'Azur, CNRS, I3S, France), Laetitia Laversa (Université Sorbonne Paris Nord, Paris, France), Kirstin Peters (Universität Augsburg, Augsburg, Germany)Comments: In Proceedings EXPRESS/SOS 2024, arXiv:2411.13318Journal-ref: EPTCS 412, 2024, pp. 19-34Subjects: Formal Languages and Automata Theory (cs.FL); Programming Languages (cs.PL)
We revisit the problem of synchronisability for communicating automata, i.e., whether the language of send messages for an asynchronous system is the same as the language of send messages with a synchronous communication. The un/decidability of the problem depends on the specific asynchronous semantics considered as well as the topology (the communication flow) of the system. Synchronisability is known to be undecidable under the peer-to-peer semantics, while it is still an open problem for mailbox communication. The problem was shown to be decidable for ring topologies. In this paper, we show that when generalising to automata with accepting states, synchronisability is undecidable under the mailbox semantics, this result is obtained by resorting to the Post Correspondence problem. In an attempt to solve the specific problem where all states are accepting, we also show that synchronisability is decidable for tree topologies (where, as well as for rings, peer-to-peer coincides with mailbox semantics). We also discuss synchronisability for multitrees in the mailbox setting.
- [89] arXiv:2411.14581 [pdf, other]
-
Title: Semantics for Linear-time Temporal Logic with Finite ObservationsRayhana Amjad (University of Edinburgh), Rob van Glabbeek (University of Edinburgh), Liam O'Connor (Australian National University)Comments: In Proceedings EXPRESS/SOS 2024, arXiv:2411.13318Journal-ref: EPTCS 412, 2024, pp. 35-50Subjects: Logic in Computer Science (cs.LO)
LTL3 is a multi-valued variant of Linear-time Temporal Logic for runtime verification applications. The semantic descriptions of LTL3 in previous work are given only in terms of the relationship to conventional LTL. Our approach, by contrast, gives a full model-based inductive accounting of the semantics of LTL3, in terms of families of definitive prefix sets. We show that our definitive prefix sets are isomorphic to linear-time temporal properties (sets of infinite traces), and thereby show that our semantics of LTL3 directly correspond to the semantics of conventional LTL. In addition, we formalise the formula progression evaluation technique, popularly used in runtime verification and testing contexts, and show its soundness and completeness up to finite traces with respect to our semantics. All of our definitions and proofs are mechanised in Isabelle/HOL.
- [90] arXiv:2411.14583 [pdf, other]
-
Title: Expansion Laws for Forward-Reverse, Forward, and Reverse Bisimilarities via Proved EncodingsMarco Bernardo (University of Urbino), Andrea Esposito (University of Urbino), Claudio A. Mezzina (University of Urbino)Comments: In Proceedings EXPRESS/SOS 2024, arXiv:2411.13318Journal-ref: EPTCS 412, 2024, pp. 51-70Subjects: Logic in Computer Science (cs.LO)
Reversible systems exhibit both forward computations and backward computations, where the aim of the latter is to undo the effects of the former. Such systems can be compared via forward-reverse bisimilarity as well as its two components, i.e., forward bisimilarity and reverse bisimilarity. The congruence, equational, and logical properties of these equivalences have already been studied in the setting of sequential processes. In this paper we address concurrent processes and investigate compositionality and axiomatizations of forward bisimilarity, which is interleaving, and reverse and forward-reverse bisimilarities, which are truly concurrent. To uniformly derive expansion laws for the three equivalences, we develop encodings based on the proved trees approach of Degano & Priami. In the case of reverse and forward-reverse bisimilarities, we show that in the encoding every action prefix needs to be extended with the backward ready set of the reached process.
- [91] arXiv:2411.14584 [pdf, other]
-
Title: One Energy Game for the Spectrum between Branching Bisimilarity and Weak Trace SemanticsBenjamin Bisping (TU Berlin), David N. Jansen (Institute of Software, Chinese Academy of Sciences)Comments: In Proceedings EXPRESS/SOS 2024, arXiv:2411.13318Journal-ref: EPTCS 412, 2024, pp. 71-88Subjects: Logic in Computer Science (cs.LO)
We provide the first generalized game characterization of van Glabbeek's linear-time--branching-time spectrum with silent steps. Thereby, one multi-dimensional energy game can be used to characterize and decide a wide array of weak behavioral equivalences between stability-respecting branching bisimilarity and weak trace equivalence in one go. To establish correctness, we relate attacker-winning energy budgets and distinguishing sublanguages of Hennessy--Milner logic that we characterize by eight dimensions of formula expressiveness.
- [92] arXiv:2411.14585 [pdf, html, other]
-
Title: Efficient Spatio-Temporal Signal Recognition on Edge Devices Using PointLCA-NetComments: arXiv admin note: text overlap with arXiv:2411.00140Subjects: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
Recent advancements in machine learning, particularly through deep learning architectures like PointNet, have transformed the processing of three-dimensional (3D) point clouds, significantly improving 3D object classification and segmentation tasks. While 3D point clouds provide detailed spatial information, spatio-temporal signals introduce a dynamic element that accounts for changes over time. However, applying deep learning techniques to spatio-temporal signals and deploying them on edge devices presents challenges, including real-time processing, memory capacity, and power consumption. To address these issues, this paper presents a novel approach that combines PointNet's feature extraction with the in-memory computing capabilities and energy efficiency of neuromorphic systems for spatio-temporal signal recognition. The proposed method consists of a two-stage process: in the first stage, PointNet extracts features from the spatio-temporal signals, which are then stored in non-volatile memristor crossbar arrays. In the second stage, these features are processed by a single-layer spiking neural encoder-decoder that employs the Locally Competitive Algorithm (LCA) for efficient encoding and classification. This work integrates the strengths of both PointNet and LCA, enhancing computational efficiency and energy performance on edge devices. PointLCA-Net achieves high recognition accuracy for spatio-temporal data with substantially lower energy burden during both inference and training than comparable approaches, thus advancing the deployment of advanced neural architectures in energy-constrained environments.
- [93] arXiv:2411.14586 [pdf, html, other]
-
Title: Listening for Expert Identified Linguistic Features: Assessment of Audio Deepfake Discernment among Undergraduate StudentsSubjects: Sound (cs.SD); Computers and Society (cs.CY); Audio and Speech Processing (eess.AS)
This paper evaluates the impact of training undergraduate students to improve their audio deepfake discernment ability by listening for expert-defined linguistic features. Such features have been shown to improve performance of AI algorithms; here, we ascertain whether this improvement in AI algorithms also translates to improvement of the perceptual awareness and discernment ability of listeners. With humans as the weakest link in any cybersecurity solution, we propose that listener discernment is a key factor for improving trustworthiness of audio content. In this study we determine whether training that familiarizes listeners with English language variation can improve their abilities to discern audio deepfakes. We focus on undergraduate students, as this demographic group is constantly exposed to social media and the potential for deception and misinformation online. To the best of our knowledge, our work is the first study to uniquely address English audio deepfake discernment through such techniques. Our research goes beyond informational training by introducing targeted linguistic cues to listeners as a deepfake discernment mechanism, via a training module. In a pre-/post- experimental design, we evaluated the impact of the training across 264 students as a representative cross section of all students at the University of Maryland, Baltimore County, and across experimental and control sections. Findings show that the experimental group showed a statistically significant decrease in their unsurety when evaluating audio clips and an improvement in their ability to correctly identify clips they were initially unsure about. While results are promising, future research will explore more robust and comprehensive trainings for greater impact.
- [94] arXiv:2411.14590 [pdf, html, other]
-
Title: LLOR: Automated Repair of OpenMP ProgramsComments: 23 pages, 1 algorithm, 2 figures, 26th International Conference on Verification Model Checking and Abstract Interpretation (VMCAI 2025)Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Software Engineering (cs.SE)
In this paper, we present a technique for repairing data race errors in parallel programs written in C/C++ and Fortran using the OpenMP API. Our technique can also remove barriers that are deemed unnecessary for correctness. We implement these ideas in our tool called LLOR, which takes a language-independent approach to provide appropriate placements of synchronization constructs to avoid data races. To the best of our knowledge, LLOR is the only tool that can repair parallel programs that use the OpenMP API. We showcase the capabilities of LLOR by performing extensive experiments on 415 parallel programs.
- [95] arXiv:2411.14592 [pdf, html, other]
-
Title: G-RAG: Knowledge Expansion in Material ScienceSubjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
In the field of Material Science, effective information retrieval systems are essential for facilitating research. Traditional Retrieval-Augmented Generation (RAG) approaches in Large Language Models (LLMs) often encounter challenges such as outdated information, hallucinations, limited interpretability due to context constraints, and inaccurate retrieval. To address these issues, Graph RAG integrates graph databases to enhance the retrieval process. Our proposed method processes Material Science documents by extracting key entities (referred to as MatIDs) from sentences, which are then utilized to query external Wikipedia knowledge bases (KBs) for additional relevant information. We implement an agent-based parsing technique to achieve a more detailed representation of the documents. Our improved version of Graph RAG called G-RAG further leverages a graph database to capture relationships between these entities, improving both retrieval accuracy and contextual understanding. This enhanced approach demonstrates significant improvements in performance for domains that require precise information retrieval, such as Material Science.
- [96] arXiv:2411.14593 [pdf, html, other]
-
Title: A Systematic Study of Multi-Agent Deep Reinforcement Learning for Safe and Robust Autonomous Highway Ramp EntryComments: 9 pages, 9 figuresSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
Vehicles today can drive themselves on highways and driverless robotaxis operate in major cities, with more sophisticated levels of autonomous driving expected to be available and become more common in the future. Yet, technically speaking, so-called "Level 5" (L5) operation, corresponding to full autonomy, has not been achieved. For that to happen, functions such as fully autonomous highway ramp entry must be available, and provide provably safe, and reliably robust behavior to enable full autonomy. We present a systematic study of a highway ramp function that controls the vehicles forward-moving actions to minimize collisions with the stream of highway traffic into which a merging (ego) vehicle enters. We take a game-theoretic multi-agent (MA) approach to this problem and study the use of controllers based on deep reinforcement learning (DRL). The virtual environment of the MA DRL uses self-play with simulated data where merging vehicles safely learn to control longitudinal position during a taper-type merge. The work presented in this paper extends existing work by studying the interaction of more than two vehicles (agents) and does so by systematically expanding the road scene with additional traffic and ego vehicles. While previous work on the two-vehicle setting established that collision-free controllers are theoretically impossible in fully decentralized, non-coordinated environments, we empirically show that controllers learned using our approach are nearly ideal when measured against idealized optimal controllers.
- [97] arXiv:2411.14594 [pdf, html, other]
-
Title: Solving Zero-Shot 3D Visual Grounding as Constraint Satisfaction ProblemsSubjects: Computer Vision and Pattern Recognition (cs.CV)
3D visual grounding (3DVG) aims to locate objects in a 3D scene with natural language descriptions. Supervised methods have achieved decent accuracy, but have a closed vocabulary and limited language understanding ability. Zero-shot methods mostly utilize large language models (LLMs) to handle natural language descriptions, yet suffer from slow inference speed. To address these problems, in this work, we propose a zero-shot method that reformulates the 3DVG task as a Constraint Satisfaction Problem (CSP), where the variables and constraints represent objects and their spatial relations, respectively. This allows a global reasoning of all relevant objects, producing grounding results of both the target and anchor objects. Moreover, we demonstrate the flexibility of our framework by handling negation- and counting-based queries with only minor extra coding efforts. Our system, Constraint Satisfaction Visual Grounding (CSVG), has been extensively evaluated on the public datasets ScanRefer and Nr3D datasets using only open-source LLMs. Results show the effectiveness of CSVG and superior grounding accuracy over current state-of-the-art zero-shot 3DVG methods with improvements of $+7.0\%$ ([email protected] score) and $+11.2\%$ on the ScanRefer and Nr3D datasets, respectively. The code of our system is publicly available at this https URL.
- [98] arXiv:2411.14596 [pdf, html, other]
-
Title: Conjugate momentum based thruster force estimate in dynamic multimodal robotShreyansh Pitroda, Eric Sihite, Taoran Liu, Kaushik Venkatesh Krishnamurthy, Chenghao Wang, Adarsh Salagame, Reza Nemovi, Alireza Ramezani, Morteza GharibComments: Submitted to ACC 2025. arXiv admin note: text overlap with arXiv:2411.12968Subjects: Robotics (cs.RO)
In a multi-modal system which combines thruster and legged locomotion such our state-of-the-art Harpy platform to perform dynamic locomotion. Therefore, it is very important to have a proper estimate of Thruster force. Harpy is a bipedal robot capable of legged-aerial locomotion using its legs and thrusters attached to its main frame. we can characterize thruster force using a thrust stand but it generally does not account for working conditions such as battery voltage. In this study, we present a momentum-based thruster force estimator. One of the key information required to estimate is terrain information. we show estimation results with and without terrain knowledge. In this work, we derive a conjugate momentum thruster force estimator and implement it on a numerical simulator that uses thruster force to perform thruster-assisted walking.
- [99] arXiv:2411.14611 [pdf, html, other]
-
Title: CodeSAM: Source Code Representation Learning by Infusing Self-Attention with Multi-Code-View GraphsAlex Mathai, Kranthi Sedamaki, Debeshee Das, Noble Saji Mathews, Srikanth Tamilselvam, Sridhar Chimalakonda, Atul KumarSubjects: Software Engineering (cs.SE); Machine Learning (cs.LG)
Machine Learning (ML) for software engineering (SE) has gained prominence due to its ability to significantly enhance the performance of various SE applications. This progress is largely attributed to the development of generalizable source code representations that effectively capture the syntactic and semantic characteristics of code. In recent years, pre-trained transformer-based models, inspired by natural language processing (NLP), have shown remarkable success in SE tasks. However, source code contains structural and semantic properties embedded within its grammar, which can be extracted from structured code-views like the Abstract Syntax Tree (AST), Data-Flow Graph (DFG), and Control-Flow Graph (CFG). These code-views can complement NLP techniques, further improving SE tasks. Unfortunately, there are no flexible frameworks to infuse arbitrary code-views into existing transformer-based models effectively. Therefore, in this work, we propose CodeSAM, a novel scalable framework to infuse multiple code-views into transformer-based models by creating self-attention masks. We use CodeSAM to fine-tune a small language model (SLM) like CodeBERT on the downstream SE tasks of semantic code search, code clone detection, and program classification. Experimental results show that by using this technique, we improve downstream performance when compared to SLMs like GraphCodeBERT and CodeBERT on all three tasks by utilizing individual code-views or a combination of code-views during fine-tuning. We believe that these results are indicative that techniques like CodeSAM can help create compact yet performant code SLMs that fit in resource constrained settings.
- [100] arXiv:2411.14612 [pdf, html, other]
-
Title: Exploiting Boosting in Hyperdimensional Computing for Enhanced Reliability in HealthcareComments: Accepted to DATE 2025Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Hyperdimensional computing (HDC) enables efficient data encoding and processing in high-dimensional space, benefiting machine learning and data analysis. However, underutilization of these spaces can lead to overfitting and reduced model reliability, especially in data-limited systems a critical issue in sectors like healthcare that demand robustness and consistent performance. We introduce BoostHD, an approach that applies boosting algorithms to partition the hyperdimensional space into subspaces, creating an ensemble of weak learners. By integrating boosting with HDC, BoostHD enhances performance and reliability beyond existing HDC methods. Our analysis highlights the importance of efficient utilization of hyperdimensional spaces for improved model performance. Experiments on healthcare datasets show that BoostHD outperforms state-of-the-art methods. On the WESAD dataset, it achieved an accuracy of 98.37%, surpassing Random Forest, XGBoost, and OnlineHD. BoostHD also demonstrated superior inference efficiency and stability, maintaining high accuracy under data imbalance and noise. In person-specific evaluations, it achieved an average accuracy of 96.19%, outperforming other models. By addressing the limitations of both boosting and HDC, BoostHD expands the applicability of HDC in critical domains where reliability and precision are paramount.
- [101] arXiv:2411.14613 [pdf, other]
-
Title: Optimal Transcoding Preset Selection for Live Video StreamingComments: 23 pages, 10 figuresSubjects: Multimedia (cs.MM)
In today's digital landscape, video content dominates internet traffic, underscoring the need for efficient video processing to support seamless live streaming experiences on platforms like YouTube Live, Twitch, and Facebook Live. This paper introduces a comprehensive framework designed to optimize video transcoding parameters, with a specific focus on preset and bitrate selection to minimize distortion while respecting constraints on bitrate and transcoding time. The framework comprises three main steps: feature extraction, prediction, and optimization. It leverages extracted features to predict transcoding time and rate-distortion, employing both supervised and unsupervised methods. By utilizing integer linear programming, it identifies the optimal sequence of presets and bitrates for video segments, ensuring real-time application feasibility under set constraints. The results demonstrate the framework's effectiveness in enhancing video quality for live streaming, maintaining high standards of video delivery while managing computational resources efficiently. This optimization approach meets the evolving demands of video delivery by offering a solution for real-time transcoding optimization. Evaluation using the User Generated Content dataset showed an average PSNR improvement of 1.5 dB over the default Twitch configuration, highlighting significant PSNR gains. Additionally, subsequent experiments demonstrated a BD-rate reduction of -49.60%, reinforcing the framework's superior performance over Twitch's default configuration.
- [102] arXiv:2411.14617 [pdf, html, other]
-
Title: Data assimilation in 2D incompressible Navier-Stokes equations, using a stabilized explicit $O(\Delta t)^2$ leapfrog finite difference scheme run backward in timeJournal-ref: 2024 NIST Technical Note 2299Subjects: Numerical Analysis (math.NA)
For the 2D incompressible Navier-Stokes equations, with given hypothetical non smooth data at time $T > 0 $that may not correspond to an actual solution at time $T$, a previously developed stabilized backward marching explicit leapfrog finite difference scheme is applied to these data, to find initial values at time $t = 0$ that can evolve into useful approximations to the given data at time $T$. That may not always be possible. Similar data assimilation problems, involving other dissipative systems, are of considerable interest in the geophysical sciences, and are commonly solved using computationally intensive methods based on neural networks informed by machine learning. Successful solution of ill-posed time-reversed Navier-Stokes equations is limited by uncertainty estimates, based on logarithmic convexity, that place limits on the value of $T > 0$. In computational experiments involving satellite images of hurricanes and other meteorological phenomena, the present method is shown to produce successful solutions at values of $T > 0$, that are several orders of magnitude larger than would be expected, based on the best-known uncertainty estimates. However, unsuccessful examples are also given. The present self-contained paper outlines the stabilizing technique, based on applying a compensating smoothing operator at each time step, and stresses the important differences between data assimilation, and backward recovery, in ill-posed time reversed problems for dissipative equations. While theorems are stated without proof, the reader is referred to a previous paper, on Navier-Stokes backward recovery, where these proofs can be found.
- [103] arXiv:2411.14618 [pdf, html, other]
-
Title: Active Learning-Based Optimization of Hydroelectric Turbine Startup to Minimize Fatigue DamageSubjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Hydro-generating units (HGUs) play a crucial role in integrating intermittent renewable energy sources into the power grid due to their flexible operational capabilities. This evolving role has led to an increase in transient events, such as startups, which impose significant stresses on turbines, leading to increased turbine fatigue and a reduced operational lifespan. Consequently, optimizing startup sequences to minimize stresses is vital for hydropower utilities. However, this task is challenging, as stress measurements on prototypes can be expensive and time-consuming. To tackle this challenge, we propose an innovative automated approach to optimize the startup parameters of HGUs with a limited budget of measured startup sequences. Our method combines active learning and black-box optimization techniques, utilizing virtual strain sensors and dynamic simulations of HGUs. This approach was tested in real-time during an on-site measurement campaign on an instrumented Francis turbine prototype. The results demonstrate that our algorithm successfully identified an optimal startup sequence using only seven measured sequences. It achieves a remarkable 42% reduction in the maximum strain cycle amplitude compared to the standard startup sequence. This study paves the way for more efficient HGU startup optimization, potentially extending their operational lifespans.
- [104] arXiv:2411.14619 [pdf, html, other]
-
Title: Path Planning and Task Assignment for Data Retrieval from Wireless Sensor Nodes Relying on Game-Theoretic LearningComments: In proceedings of the 5th International Conference on Control, Decision and Information Technologies, 2018. 6 pages, 5 figuresSubjects: Systems and Control (eess.SY)
The energy-efficient trip allocation of mobile robots employing differential drives for data retrieval from stationary sensor locations is the scope of this article. Given a team of robots and a set of targets (wireless sensor nodes), the planner computes all possible tours that each robot can make if it needs to visit a part of or the entire set of targets. Each segment of the tour relies on a minimum energy path planning algorithm. After the computation of all possible tour-segments, a utility function penalizing the overall energy consumption is formed. Rather than relying on the NP-hard Mobile Element Scheduling (MES) MILP problem, an approach using elements from game theory is employed. The suggested approach converges fast for most practical reasons thus allowing its utilization in near real time applications. Simulations are offered to highlight the efficiency of the developed algorithm.
- [105] arXiv:2411.14622 [pdf, html, other]
-
Title: Learning Autonomous Surgical Irrigation and Suction with the da Vinci Research Kit Using Reinforcement LearningComments: 13 pages, 19 figures. Submitted to IEEE Transactions on Automation Science and Engineering (T-ASE)Subjects: Robotics (cs.RO)
The irrigation-suction process is a common procedure to rinse and clean up the surgical field in minimally invasive surgery (MIS). In this process, surgeons first irrigate liquid, typically saline, into the surgical scene for rinsing and diluting the contaminant, and then suction the liquid out of the surgical field. While recent advances have shown promising results in the application of reinforcement learning (RL) for automating surgical subtasks, fewer studies have explored the automation of fluid-related tasks. In this work, we explore the automation of both steps in the irrigation-suction procedure and train two vision-based RL agents to complete irrigation and suction autonomously. To achieve this, a platform is developed for creating simulated surgical robot learning environments and for training agents, and two simulated learning environments are built for irrigation and suction with visually plausible fluid rendering capabilities. With techniques such as domain randomization (DR) and carefully designed reward functions, two agents are trained in the simulator and transferred to the real world. Individual evaluations of both agents show satisfactory real-world results. With an initial amount of around 5 grams of contaminants, the irrigation agent ultimately achieved an average of 2.21 grams remaining after a manual suction. As a comparison, fully manual operation by a human results in 1.90 grams remaining. The suction agent achieved 2.64 and 2.24 grams of liquid remaining across two trial groups with more than 20 and 30 grams of initial liquid in the container. Fully autonomous irrigation-suction trials reduce the contaminant in the container from around 5 grams to an average of 2.42 grams, although yielding a higher total weight remaining (4.40) due to residual liquid not suctioned. Further information about the project is available at this https URL.
- [106] arXiv:2411.14623 [pdf, html, other]
-
Title: Initial Evidence of Elevated Reconnaissance Attacks Against Nodes in P2P Overlay NetworksSubjects: Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)
We hypothesize that peer-to-peer (P2P) overlay network nodes can be attractive to attackers due to their visibility, sustained uptime, and resource potential. Towards validating this hypothesis, we investigate the state of active reconnaissance attacks on Ethereum P2P network nodes by deploying a series of honeypots alongside actual Ethereum nodes across globally distributed vantage points. We find that Ethereum nodes experience not only increased attacks, but also specific types of attacks targeting particular ports and services. Furthermore, we find evidence that the threat assessment on our nodes is applicable to the wider P2P network by having performed port scans on other reachable peers. Our findings provide insights into potential mitigation strategies to improve the security of the P2P networking layer.
- [107] arXiv:2411.14625 [pdf, html, other]
-
Title: Predictive Analytics of Air Alerts in the Russian-Ukrainian WarSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
The paper considers exploratory data analysis and approaches in predictive analytics for air alerts during the Russian-Ukrainian war which broke out on Feb 24, 2022. The results illustrate that alerts in regions correlate with one another and have geospatial patterns which make it feasible to build a predictive model which predicts alerts that are expected to take place in a certain region within a specified time period. The obtained results show that the alert status in a particular region is highly dependable on the features of its adjacent regions. Seasonality features like hours, days of a week and months are also crucial in predicting the target variable. Some regions highly rely on the time feature which equals to a number of days from the initial date of the dataset. From this, we can deduce that the air alert pattern changes throughout the time.
- [108] arXiv:2411.14627 [pdf, html, other]
-
Title: Generative AI for Music and AudioComments: PhD DissertationSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Generative AI has been transforming the way we interact with technology and consume content. In the next decade, AI technology will reshape how we create audio content in various media, including music, theater, films, games, podcasts, and short videos. In this dissertation, I introduce the three main directions of my research centered around generative AI for music and audio: 1) multitrack music generation, 2) assistive music creation tools, and 3) multimodal learning for audio and music. Through my research, I aim to answer the following two fundamental questions: 1) How can AI help professionals or amateurs create music and audio content? 2) Can AI learn to create music in a way similar to how humans learn music? My long-term goal is to lower the barrier of entry for music composition and democratize audio content creation
- [109] arXiv:2411.14628 [pdf, html, other]
-
Title: HotSpot: Screened Poisson Equation for Signed Distance Function OptimizationSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
We propose a method, HotSpot, for optimizing neural signed distance functions, based on a relation between the solution of a screened Poisson equation and the distance function. Existing losses such as the eikonal loss cannot guarantee the recovered implicit function to be a distance function, even when the implicit function satisfies the eikonal equation almost everywhere. Furthermore, the eikonal loss suffers from stability issues in optimization and the remedies that introduce area or divergence minimization can lead to oversmoothing. We address these challenges by designing a loss function that when minimized can converge to the true distance function, is stable, and naturally penalize large surface area. We provide theoretical analysis and experiments on both challenging 2D and 3D datasets and show that our method provide better surface reconstruction and more accurate distance approximation.
- [110] arXiv:2411.14632 [pdf, other]
-
Title: An Investigation of the Relationship Between Crime Rate and Police CompensationJhancy Amarsingh, Likhith Kumar Reddy Appakondreddigari, Ashish Nunna, Charishma Choudary Tummala, John Winship, Alex Zhou, Huthaifa I. AshqarSubjects: Computers and Society (cs.CY); Applications (stat.AP)
The goal of this paper is to assess whether there is any correlation between police salaries and crime rates. Using public data sources that contain Baltimore Crime Rates and Baltimore Police Department (BPD) salary information from 2011 to 2021, our research uses a variety of techniques to capture and measure any correlation between the two. Based on that correlation, the paper then uses established social theories to make recommendations on how this data can potentially be used by State Leadership. Our initial results show a negative correlation between salary/compensation levels and crime rates.
- [111] arXiv:2411.14637 [pdf, html, other]
-
Title: Enhancing Clinical Trial Patient Matching through Knowledge Augmentation with Multi-AgentsSubjects: Multiagent Systems (cs.MA)
Matching patients effectively and efficiently for clinical trials is a significant challenge due to the complexity and variability of patient profiles and trial criteria. This paper presents a novel framework, Multi-Agents for Knowledge Augmentation (MAKA), designed to enhance patient-trial matching by dynamically supplementing matching prompts with external, domain-specific knowledge. The MAKA architecture consists of five key components: a knowledge probing agent that detects gaps in domain knowledge, a navigation agent that manages interactions among multiple specialized knowledge augmentation agents, a knowledge augmentation agent that incorporates relevant information into patient-trial matching prompts, a supervision agent aligning the outputs from other agents with the instructions and a matching agent making the final selection decision. This approach enhances the accuracy and contextual richness of patient matching, addresses inherent knowledge gaps in both trail criteria and large language models (LLMs), and improves the alignment between patient characteristics and the criteria.
- [112] arXiv:2411.14639 [pdf, html, other]
-
Title: Differentially Private Adaptation of Diffusion Models via Noisy Aggregated EmbeddingsSubjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
We introduce novel methods for adapting diffusion models under differential privacy (DP) constraints, enabling privacy-preserving style and content transfer without fine-tuning. Traditional approaches to private adaptation, such as DP-SGD, incur significant computational overhead and degrade model performance when applied to large, complex models. Our approach instead leverages embedding-based techniques: Universal Guidance and Textual Inversion (TI), adapted with differentially private mechanisms. We apply these methods to Stable Diffusion for style adaptation using two private datasets: a collection of artworks by a single artist and pictograms from the Paris 2024 Olympics. Experimental results show that the TI-based adaptation achieves superior fidelity in style transfer, even under strong privacy guarantees, while both methods maintain high privacy resilience by employing calibrated noise and subsampling strategies. Our findings demonstrate a feasible and efficient pathway for privacy-preserving diffusion model adaptation, balancing data protection with the fidelity of generated images, and offer insights into embedding-driven methods for DP in generative AI applications.
- [113] arXiv:2411.14642 [pdf, html, other]
-
Title: VQalAttent: a Transparent Speech Generation Pipeline based on Transformer-learned VQ-VAE Latent SpaceSubjects: Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Generating high-quality speech efficiently remains a key challenge for generative models in speech synthesis. This paper introduces VQalAttent, a lightweight model designed to generate fake speech with tunable performance and interpretability. Leveraging the AudioMNIST dataset, consisting of human utterances of decimal digits (0-9), our method employs a two-step architecture: first, a scalable vector quantized autoencoder (VQ-VAE) that compresses audio spectrograms into discrete latent representations, and second, a decoder-only transformer that learns the probability model of these latents. Trained transformer generates similar latent sequences, convertible to audio spectrograms by the VQ-VAE decoder, from which we generate fake utterances. Interpreting statistical and perceptual quality of the fakes, depending on the dimension and the extrinsic information of the latent space, enables guided improvements in larger, commercial generative models. As a valuable tool for understanding and refining audio synthesis, our results demonstrate VQalAttent's capacity to generate intelligible speech samples with limited computational resources, while the modularity and transparency of the training pipeline helps easily correlate the analytics with modular modifications, hence providing insights for the more complex models.
- [114] arXiv:2411.14647 [pdf, html, other]
-
Title: Benchmarking Multimodal Models for Ukrainian Language Understanding Across Academic and Cultural DomainsYurii Paniv, Artur Kiulian, Dmytro Chaplynskyi, Mykola Khandoga, Anton Polishko, Tetiana Bas, Guillermo GabrielliSubjects: Computation and Language (cs.CL)
While the evaluation of multimodal English-centric models is an active area of research with numerous benchmarks, there is a profound lack of benchmarks or evaluation suites for low- and mid-resource languages. We introduce ZNO-Vision, a comprehensive multimodal Ukrainian-centric benchmark derived from standardized university entrance examination (ZNO). The benchmark consists of over 4,300 expert-crafted questions spanning 12 academic disciplines, including mathematics, physics, chemistry, and humanities. We evaluated the performance of both open-source models and API providers, finding that only a handful of models performed above baseline. Alongside the new benchmark, we performed the first evaluation study of multimodal text generation for the Ukrainian language: we measured caption generation quality on the Multi30K-UK dataset, translated the VQA benchmark into Ukrainian, and measured performance degradation relative to original English versions. Lastly, we tested a few models from a cultural perspective on knowledge of national cuisine. We believe our work will advance multimodal generation capabilities for the Ukrainian language and our approach could be useful for other low-resource languages.
- [115] arXiv:2411.14650 [pdf, html, other]
-
Title: A generic Scheme For the time-dependent Navier-Stokes Equation Coupled With The Heat EquationSubjects: Numerical Analysis (math.NA)
In this work, we study the gradient discretisation method (GDM) of the time-dependent Navier-Stokes equations coupled with the heat equation, where the viscosity depends on the temperature. We design the discrete method and prove its convergence without non-physical conditions. The paper is closed with numerical experiments that confirm the theoretical results.
- [116] arXiv:2411.14652 [pdf, html, other]
-
Title: Social Media Algorithms Can Shape Affective Polarization via Exposure to Antidemocratic Attitudes and Partisan AnimosityTiziano Piccardi, Martin Saveski, Chenyan Jia, Jeffrey T. Hancock, Jeanne L. Tsai, Michael BernsteinSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
There is widespread concern about the negative impacts of social media feed ranking algorithms on political polarization. Leveraging advancements in large language models (LLMs), we develop an approach to re-rank feeds in real-time to test the effects of content that is likely to polarize: expressions of antidemocratic attitudes and partisan animosity (AAPA). In a preregistered 10-day field experiment on X/Twitter with 1,256 consented participants, we increase or decrease participants' exposure to AAPA in their algorithmically curated feeds. We observe more positive outparty feelings when AAPA exposure is decreased and more negative outparty feelings when AAPA exposure is increased. Exposure to AAPA content also results in an immediate increase in negative emotions, such as sadness and anger. The interventions do not significantly impact traditional engagement metrics such as re-post and favorite rates. These findings highlight a potential pathway for developing feed algorithms that mitigate affective polarization by addressing content that undermines the shared values required for a healthy democracy.
- [117] arXiv:2411.14653 [pdf, other]
-
Title: Does Open Access Foster Interdisciplinary Citation? Decomposing Open Access Citation AdvantageSubjects: Digital Libraries (cs.DL)
The existence of an open access (OA) citation advantage, that is, whether OA increases citations, has been a topic of interest for many years. Although numerous previous studies have focused on whether OA increases citations, expectations for OA go beyond that. One such expectation is the promotion of knowledge transfer across various fields. This study aimed to clarify whether OA, especially gold OA, increases interdisciplinary citations in various natural science fields. Specifically, we measured the effect of OA on interdisciplinary and within-discipline citation counts by decomposing an existing metric of the OA citation advantage. The results revealed that OA increases both interdisciplinary and within-discipline citations in many fields and increases only interdisciplinary citations in chemistry, computer science, and clinical medicine. Among these fields, clinical medicine tends to obtain more interdisciplinary citations without being influenced by specific journals or papers. The findings indicate that OA fosters knowledge transfer to different fields, which extends our understanding of its effects.
- [118] arXiv:2411.14654 [pdf, html, other]
-
Title: Comparative Analysis of Pooling Mechanisms in LLMs: A Sentiment Analysis PerspectiveComments: 4 figuresSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large Language Models (LLMs) have revolutionized natural language processing (NLP) by delivering state-of-the-art performance across a variety of tasks. Among these, Transformer-based models like BERT and GPT rely on pooling layers to aggregate token-level embeddings into sentence-level representations. Common pooling mechanisms such as Mean, Max, and Weighted Sum play a pivotal role in this aggregation process. Despite their widespread use, the comparative performance of these strategies on different LLM architectures remains underexplored. To address this gap, this paper investigates the effects of these pooling mechanisms on two prominent LLM families -- BERT and GPT, in the context of sentence-level sentiment analysis. Comprehensive experiments reveal that each pooling mechanism exhibits unique strengths and weaknesses depending on the task's specific requirements. Our findings underline the importance of selecting pooling methods tailored to the demands of particular applications, prompting a re-evaluation of common assumptions regarding pooling operations. By offering actionable insights, this study contributes to the optimization of LLM-based models for downstream tasks.
- [119] arXiv:2411.14655 [pdf, html, other]
-
Title: Construction and Preliminary Validation of a Dynamic Programming Concept InventoryMatthew Ferland, Varun Nagaraj Rao, Arushi Arora, Drew van der Poel, Michael Luu, Randy Huynh, Freddy Reiber, Sandra Ossman, Seth Poulsen, Michael ShindlerComments: Accepted to SIGCSE 2025Subjects: Data Structures and Algorithms (cs.DS); Computers and Society (cs.CY)
Concept inventories are standardized assessments that evaluate student understanding of key concepts within academic disciplines. While prevalent across STEM fields, their development lags for advanced computer science topics like dynamic programming (DP) -- an algorithmic technique that poses significant conceptual challenges for undergraduates. To fill this gap, we developed and validated a Dynamic Programming Concept Inventory (DPCI). We detail the iterative process used to formulate multiple-choice questions targeting known student misconceptions about DP concepts identified through prior research studies. We discuss key decisions, tradeoffs, and challenges faced in crafting probing questions to subtly reveal these conceptual misunderstandings. We conducted a preliminary psychometric validation by administering the DPCI to 172 undergraduate CS students finding our questions to be of appropriate difficulty and effectively discriminating between differing levels of student understanding. Taken together, our validated DPCI will enable instructors to accurately assess student mastery of DP. Moreover, our approach for devising a concept inventory for an advanced theoretical computer science concept can guide future efforts to create assessments for other under-evaluated areas currently lacking coverage.
- [120] arXiv:2411.14662 [pdf, other]
-
Title: Multiset Transformer: Advancing Representation Learning in Persistence DiagramsSubjects: Machine Learning (cs.LG)
To improve persistence diagram representation learning, we propose Multiset Transformer. This is the first neural network that utilizes attention mechanisms specifically designed for multisets as inputs and offers rigorous theoretical guarantees of permutation invariance. The architecture integrates multiset-enhanced attentions with a pool-decomposition scheme, allowing multiplicities to be preserved across equivariant layers. This capability enables full leverage of multiplicities while significantly reducing both computational and spatial complexity compared to the Set Transformer. Additionally, our method can greatly benefit from clustering as a preprocessing step to further minimize complexity, an advantage not possessed by the Set Transformer. Experimental results demonstrate that the Multiset Transformer outperforms existing neural network methods in the realm of persistence diagram representation learning.
- [121] arXiv:2411.14666 [pdf, other]
-
Title: Brain-Computer Interfaces for Emotional Regulation in Patients with Various DisordersSubjects: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Neurological and Physiological Disorders that impact emotional regulation each have their own unique characteristics which are important to understand in order to create a generalized solution to all of them. The purpose of this experiment is to explore the potential applications of EEG-based Brain-Computer Interfaces (BCIs) in enhancing emotional regulation for individuals with neurological and physiological disorders. The research focuses on the development of a novel neural network algorithm for understanding EEG data, with a particular emphasis on recognizing and regulating emotional states. The procedure involves the collection of EEG-based emotion data from open-Neuro. Using novel data modification techniques, information from the dataset can be altered to create a dataset that has neural patterns of patients with disorders whilst showing emotional change. The data analysis reveals promising results, as the algorithm is able to successfully classify emotional states with a high degree of accuracy. This suggests that EEG-based BCIs have the potential to be a valuable tool in aiding individuals with a range of neurological and physiological disorders in recognizing and regulating their emotions. To improve upon this work, data collection on patients with neurological disorders should be done to improve overall sample diversity.
- [122] arXiv:2411.14672 [pdf, html, other]
-
Title: Multiverse of Greatness: Generating Story Branches with LLMsPittawat Taveekitworachai, Chollakorn Nimpattanavong, Mustafa Can Gursesli, Antonio Lanata, Andrea Guazzini, Ruck ThawonmasComments: 12 pages, 14 figuresSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
This paper presents Dynamic Context Prompting/Programming (DCP/P), a novel framework for interacting with LLMs to generate graph-based content with a dynamic context window history. While there is an existing study utilizing LLMs to generate a visual novel game, the previous study involved a manual process of output extraction and did not provide flexibility in generating a longer, coherent story. We evaluate DCP/P against our baseline, which does not provide context history to an LLM and only relies on the initial story data. Through objective evaluation, we show that simply providing the LLM with a summary leads to a subpar story compared to additionally providing the LLM with the proper context of the story. We also provide an extensive qualitative analysis and discussion. We qualitatively examine the quality of the objectively best-performing generated game from each approach. In addition, we examine biases in word choices and word sentiment of the generated content. We find a consistent observation with previous studies that LLMs are biased towards certain words, even with a different LLM family. Finally, we provide a comprehensive discussion on opportunities for future studies.
- [123] arXiv:2411.14676 [pdf, html, other]
-
Title: Depth-first search for tensor rank and border rank over finite fieldsComments: 10 pages, to appear in MURJ Fall 2024Subjects: Computational Complexity (cs.CC)
We present an $O^*\left(|\mathbb{F}|^{(R-n_*)\left(\sum_d n_d\right)+n_*}\right)$-time algorithm for determining whether a tensor of shape $n_0\times\dots\times n_{D-1}$ over a finite field $\mathbb{F}$ has rank $\le R$, where $n_*:=\max_d n_d$; we assume without loss of generality that $\forall d:n_d\le R$. We also extend this problem to its border rank analog, i.e., determining tensor rank over rings of the form $\mathbb{F}[x]/(x^H)$, and give an $O^*\left(|\mathbb{F}|^{H\sum_{1\le r\le R} \sum_d \min(r,n_d)}\right)$-time algorithm. Both of our algorithms use polynomial space.
- [124] arXiv:2411.14678 [pdf, html, other]
-
Title: Reinterpreting PID Controller From the Perspective of State Feedback and Lumped Disturbance CompensationSubjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
This paper analyzes the motion of solutions to non-homogeneous linear differential equations. It further clarifies that a proportional-integral-derivative (PID) controller essentially comprises two parts: a homogeneous controller and a disturbance observer, which are responsible for stabilizing the homogeneous system and compensating for the lumped disturbances (non-homogeneous components) of the system respectively. Based on this framework, the impact of measurement noise on control performance is examined, and a parameter tuning scheme for the traditional PID controller is provided. Finally, as examples, controllers are designed for two representative control problems: a trajectory tracking controller for an underactuated vertical takeoff and landing (VTOL) aircraft in the time domain, and a lateral controller for a vehicle in the distance domain.
- [125] arXiv:2411.14679 [pdf, html, other]
-
Title: Recursive Gaussian Process State Space ModelSubjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
Learning dynamical models from data is not only fundamental but also holds great promise for advancing principle discovery, time-series prediction, and controller design. Among various approaches, Gaussian Process State-Space Models (GPSSMs) have recently gained significant attention due to their combination of flexibility and interpretability. However, for online learning, the field lacks an efficient method suitable for scenarios where prior information regarding data distribution and model function is limited. To address this issue, this paper proposes a recursive GPSSM method with adaptive capabilities for both operating domains and Gaussian process (GP) hyperparameters. Specifically, we first utilize first-order linearization to derive a Bayesian update equation for the joint distribution between the system state and the GP model, enabling closed-form and domain-independent learning. Second, an online selection algorithm for inducing points is developed based on informative criteria to achieve lightweight learning. Third, to support online hyperparameter optimization, we recover historical measurement information from the current filtering distribution. Comprehensive evaluations on both synthetic and real-world datasets demonstrate the superior accuracy, computational efficiency, and adaptability of our method compared to state-of-the-art online GPSSM techniques.
- [126] arXiv:2411.14680 [pdf, html, other]
-
Title: Self-Supervised Learning for Ordered Three-Dimensional StructuresComments: Version as submitted to the Learning on Graphs Conference 2022, with small clarifying editsSubjects: Machine Learning (cs.LG)
Recent work has proven that training large language models with self-supervised tasks and fine-tuning these models to complete new tasks in a transfer learning setting is a powerful idea, enabling the creation of models with many parameters, even with little labeled data; however, the number of domains that have harnessed these advancements has been limited. In this work, we formulate a set of geometric tasks suitable for the large-scale study of ordered three-dimensional structures, without requiring any human intervention in data labeling. We build deep rotation- and permutation-equivariant neural networks based on geometric algebra and use them to solve these tasks on both idealized and simulated three-dimensional structures. Quantifying order in complex-structured assemblies remains a long-standing challenge in materials physics; these models can elucidate the behavior of real self-assembling systems in a variety of ways, from distilling insights from learned tasks without further modification to solving new tasks with smaller amounts of labeled data via transfer learning.
- [127] arXiv:2411.14681 [pdf, html, other]
-
Title: TrojanEdit: Backdooring Text-Based Image Editing ModelsSubjects: Cryptography and Security (cs.CR)
As diffusion models have achieved success in image generation tasks, many studies have extended them to other related fields like image editing. Unlike image generation, image editing aims to modify an image based on user requests while keeping other parts of the image unchanged. Among these, text-based image editing is the most representative this http URL studies have shown that diffusion models are vulnerable to backdoor attacks, where attackers may poison the training data to inject the backdoor into models. However, previous backdoor attacks on diffusion models primarily focus on image generation models without considering image editing models. Given that image editing models accept multimodal inputs, it raises a new question regarding the effectiveness of different modalities triggers in backdoor attacks on these models. To address this question, we propose a backdoor attack framework for image editing models, named TrojanEdit, which can handle different modalities triggers. We explore five types of visual triggers, three types of textual triggers, and combine them together as fifteen types of multimodal triggers, conducting extensive experiments for three types of backdoor attack goals. Our experimental results show that the image editing model has a backdoor bias for texture triggers. Compared to visual triggers, textual triggers have stronger attack effectiveness but also cause more damage to the model's normal functionality. Furthermore, we found that multimodal triggers can achieve a good balance between the attack effectiveness and model's normal functionality.
- [128] arXiv:2411.14688 [pdf, html, other]
-
Title: Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video CaptioningSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Generating automatic dense captions for videos that accurately describe their contents remains a challenging area of research. Most current models require processing the entire video at once. Instead, we propose an efficient, online approach which outputs frequent, detailed and temporally aligned captions, without access to future frames. Our model uses a novel autoregressive factorized decoding architecture, which models the sequence of visual features for each time segment, outputting localized descriptions and efficiently leverages the context from the previous video segments. This allows the model to output frequent, detailed captions to more comprehensively describe the video, according to its actual local content, rather than mimic the training data. Second, we propose an optimization for efficient training and inference, which enables scaling to longer videos. Our approach shows excellent performance compared to both offline and online methods, and uses 20\% less compute. The annotations produced are much more comprehensive and frequent, and can further be utilized in automatic video tagging and in large-scale video data harvesting.
- [129] arXiv:2411.14691 [pdf, other]
-
Title: EV-PINN: A Physics-Informed Neural Network for Predicting Electric Vehicle DynamicsComments: This work has been submitted to the 2025 IEEE International Conference on Robotics and Automation (ICRA) for possible publicationSubjects: Machine Learning (cs.LG)
An onboard prediction of dynamic parameters (e.g. Aerodynamic drag, rolling resistance) enables accurate path planning for EVs. This paper presents EV-PINN, a Physics-Informed Neural Network approach in predicting instantaneous battery power and cumulative energy consumption during cruising while generalizing to the nonlinear dynamics of an EV. Our method learns real-world parameters such as motor efficiency, regenerative braking efficiency, vehicle mass, coefficient of aerodynamic drag, and coefficient of rolling resistance using automatic differentiation based on dynamics and ensures consistency with ground truth vehicle data. EV-PINN was validated using 15 and 35 minutes of in-situ battery log data from the Tesla Model 3 Long Range and Tesla Model S, respectively. With only vehicle speed and time as inputs, our model achieves high accuracy and generalization to dynamics, with validation losses of 0.002195 and 0.002292, respectively. This demonstrates EV-PINN's effectiveness in estimating parameters and predicting battery usage under actual driving conditions without the need for additional sensors.
- [130] arXiv:2411.14694 [pdf, html, other]
-
Title: A Data-Driven Pool Strategy for Price-Makers Under Imperfect InformationComments: Paper accepted for IEEE Transactions on Power Systems. Personal use of this material is permitted. Permission from IEEE must be obtained for all other usesJournal-ref: IEEE Transactions on Power Systems, vol. 38, no. 1, pp. 278-289, Jan. 2023Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
This paper studies the pool strategy for price-makers under imperfect information. In this occasion, market participants cannot obtain essential transmission parameters of the power system. Thus, price-makers should estimate the market results with respect to their offer curves using available historical information. The linear programming model of economic dispatch is analyzed with the theory of rim multi-parametric linear programming (rim-MPLP). The characteristics of system patterns (combinations of status flags for generating units and transmission lines) are revealed. A multi-class classification model based on support vector machine (SVM) is trained to map the offer curves to system patterns, which is then integrated into the decision framework of the price-maker. The performance of the proposed method is validated on the IEEE 30-bus system, Illinois synthetic 200-bus system, and South Carolina synthetic 500-bus system.
- [131] arXiv:2411.14695 [pdf, html, other]
-
Title: Anti-Forgetting Adaptation for Unsupervised Person Re-identificationComments: Accepted to TPAMISubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Regular unsupervised domain adaptive person re-identification (ReID) focuses on adapting a model from a source domain to a fixed target domain. However, an adapted ReID model can hardly retain previously-acquired knowledge and generalize to unseen data. In this paper, we propose a Dual-level Joint Adaptation and Anti-forgetting (DJAA) framework, which incrementally adapts a model to new domains without forgetting source domain and each adapted target domain. We explore the possibility of using prototype and instance-level consistency to mitigate the forgetting during the adaptation. Specifically, we store a small number of representative image samples and corresponding cluster prototypes in a memory buffer, which is updated at each adaptation step. With the buffered images and prototypes, we regularize the image-to-image similarity and image-to-prototype similarity to rehearse old knowledge. After the multi-step adaptation, the model is tested on all seen domains and several unseen domains to validate the generalization ability of our method. Extensive experiments demonstrate that our proposed method significantly improves the anti-forgetting, generalization and backward-compatible ability of an unsupervised person ReID model.
- [132] arXiv:2411.14698 [pdf, html, other]
-
Title: Improving Mathematical Reasoning Capabilities of Small Language Models via Feedback-Driven DistillationSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large Language Models (LLMs) demonstrate exceptional reasoning capabilities, often achieving state-of-the-art performance in various tasks. However, their substantial computational and memory demands, due to billions of parameters, hinder deployment in resource-constrained environments. A promising solution is knowledge distillation, where LLMs transfer reasoning capabilities to Small Language Models (SLMs, $\le$ 1B parameters), enabling wider deployment on low-resource devices. Existing methods primarily focus on generating high-quality reasoning rationales for distillation datasets but often neglect the critical role of data quantity and quality. To address these challenges, we propose a Feedback-Driven Distillation (FDD) framework to enhance SLMs' mathematical reasoning capabilities. In the initialization stage, a distillation dataset is constructed by prompting LLMs to pair mathematical problems with corresponding reasoning rationales. We classify problems into easy and hard categories based on SLM performance. For easy problems, LLMs generate more complex variations, while for hard problems, new questions of similar complexity are synthesized. In addition, we propose a multi-round distillation paradigm to iteratively enrich the distillation datasets, thereby progressively improving the mathematical reasoning abilities of SLMs. Experimental results demonstrate that our method can make SLMs achieve SOTA mathematical reasoning performance.
- [133] arXiv:2411.14699 [pdf, html, other]
-
Title: DNN based Two-stage Compensation Algorithm for THz Hybrid Beamforming with imperfect HardwareSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Terahertz (THz) communication is envisioned as a key technology for 6G and beyond wireless systems owing to its multi-GHz bandwidth. To maintain the same aperture area and the same link budget as the lower frequencies, ultra-massive multi-input and multi-output (UM-MIMO) with hybrid beamforming is promising. Nevertheless, the hardware imperfections particularly at THz frequencies, can degrade spectral efficiency and lead to a high symbol error rate (SER), which is often overlooked yet imperative to address in practical THz communication systems. In this paper, the hybrid beamforming is investigated for THz UM-MIMO systems accounting for comprehensive hardware imperfections, including DAC and ADC quantization errors, in-phase and quadrature imbalance (IQ imbalance), phase noise, amplitude and phase error of imperfect phase shifters and power amplifier (PA) nonlinearity. Then, a two-stage hardware imperfection compensation algorithm is proposed. A deep neural network (DNN) is developed in the first stage to represent the combined hardware imperfections, while in the second stage, the digital precoder in the transmitter (Tx) or the combiner in the receiver (Rx) is designed using NN to effectively compensate for these imperfections. Furthermore, to balance the performance and network complexity, three slimming methods including pruning, parameter sharing, and removing parts of the network are proposed and combined to slim the DNN in the first stage. Numerical results show that the Tx compensation can perform better than the Rx compensation. Additionally, using the combined slimming methods can reduce parameters by 97.2% and running time by 39.2% while maintaining nearly the same performance in both uncoded and coded systems.
- [134] arXiv:2411.14700 [pdf, html, other]
-
Title: Optimal Energy Dispatch of Grid-Connected Electric Vehicle Considering Lithium Battery Electrochemical ModelComments: Paper accepted for IEEE Transactions on Smart Grid. Personal use of this material is permitted. Permission from IEEE must be obtained for all other usesJournal-ref: IEEE Transactions on Smart Grid, vol. 15, no. 3, pp. 3000-3015, May 2024Subjects: Systems and Control (eess.SY)
The grid-connected electric vehicles (EVs) serve as a promising regulating resource in the distribution grid with Vehicle-to-Grid (V2G) facilities. In the day-ahead stage, electric vehicle batteries (EVBs) need to be precisely dispatched and controlled to ensure high efficiency and prevent degradation. This article focuses on considering a refined battery model, i.e. the electrochemical model (EM), in the optimal dispatch of the local energy system with high penetration of EVs which replenish energy through V2G-equipped charge station and battery swapping station (BSS). In this paper, to utilize the EM efficiently, recursive EVB constraints and a corresponding matrix-based state update method are proposed based on EM power characterization. The charging EV state distribution is profiled and a multi-layer BSS model along with binary aggregation is proposed, in order to overcome the computation complexity of combining the refined battery constraints with the mixed integer optimization. Finally, a local energy system scenario is investigated for evaluation. The efficiency and effectiveness of EM consideration are assessed from the perspective of both the system and battery.
- [135] arXiv:2411.14701 [pdf, html, other]
-
Title: Personalised 3D Human Digital Twin with Soft-Body Feet for Walking SimulationComments: 10 pages, 16th International Conference on Social RoboticsSubjects: Robotics (cs.RO)
With the increasing use of assistive robots in rehabilitation and assisted mobility of human patients, there has been a need for a deeper understanding of human-robot interactions particularly through simulations, allowing an understanding of these interactions in a digital environment. There is an emphasis on accurately modelling personalised 3D human digital twins in these simulations, to glean more insights on human-robot interactions. In this paper, we propose to integrate personalised soft-body feet, generated using the motion capture data of real human subjects, into a skeletal model and train it with a walking control policy. Through evaluation using ground reaction force and joint angle results, the soft-body feet were able to generate ground reaction force results comparable to real measured data and closely follow joint angle results of the bare skeletal model and the reference motion. This presents an interesting avenue to produce a dynamically accurate human model in simulation driven by their own control policy while only seeing kinematic information during training.
- [136] arXiv:2411.14704 [pdf, html, other]
-
Title: Cross-Modal Pre-Aligned Method with Global and Local Information for Remote-Sensing Image and Text RetrievalJournal-ref: IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1-18, 2024, Art no. 4709118Subjects: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
Remote sensing cross-modal text-image retrieval (RSCTIR) has gained attention for its utility in information mining. However, challenges remain in effectively integrating global and local information due to variations in remote sensing imagery and ensuring proper feature pre-alignment before modal fusion, which affects retrieval accuracy and efficiency. To address these issues, we propose CMPAGL, a cross-modal pre-aligned method leveraging global and local information. Our Gswin transformer block combines local window self-attention and global-local window cross-attention to capture multi-scale features. A pre-alignment mechanism simplifies modal fusion training, improving retrieval performance. Additionally, we introduce a similarity matrix reweighting (SMR) algorithm for reranking, and enhance the triplet loss function with an intra-class distance term to optimize feature learning. Experiments on four datasets, including RSICD and RSITMD, validate CMPAGL's effectiveness, achieving up to 4.65% improvement in R@1 and 2.28% in mean Recall (mR) over state-of-the-art methods.
- [137] arXiv:2411.14707 [pdf, html, other]
-
Title: High-Bandwidth, Low-Computational Approach: Estimator-Based Control for Hybrid Flying Capacitor Multilevel Converters Using Multi-Cost Gradient Descent and State FeedforwardComments: 18 pages, 18 figures, 3 tablesSubjects: Systems and Control (eess.SY)
This paper presents an estimator-based control framework for hybrid flying capacitor multilevel (FCML) converters, achieving high-bandwidth control and reduced computational complexity. Utilizing a hybrid estimation method that combines closed-loop and open-loop dynamics, the proposed approach enables accurate and fast flying capacitor voltage estimation without relying on isolated voltage sensors or high-cost computing hardware. The methodology employs multi-cost gradient descent and state feedforward algorithms, enhancing estimation performance while maintaining low computational overhead. A detailed analysis of stability, gain setting, and rank-deficiency issues is provided, ensuring robust operation across diverse converter levels and duty cycle conditions. Simulation results validate the effectiveness of the proposed estimator in achieving active voltage balancing and current control with 6-level AC-DC buck FCML, contributing to cost-effective solutions for FCML applications, such as data centers and electric aircraft.
- [138] arXiv:2411.14708 [pdf, html, other]
-
Title: Understanding LLM Embeddings for RegressionComments: 15 pages, 13 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
With the rise of large language models (LLMs) for flexibly processing information as strings, a natural application is regression, specifically by preprocessing string representations into LLM embeddings as downstream features for metric prediction. In this paper, we provide one of the first comprehensive investigations into embedding-based regression and demonstrate that LLM embeddings as features can be better for high-dimensional regression tasks than using traditional feature engineering. This regression performance can be explained in part due to LLM embeddings over numeric data inherently preserving Lipschitz continuity over the feature space. Furthermore, we quantify the contribution of different model effects, most notably model size and language understanding, which we find surprisingly do not always improve regression performance.
- [139] arXiv:2411.14711 [pdf, html, other]
-
Title: Can GNNs Learn Link Heuristics? A Concise Review and Evaluation of Link Prediction MethodsSubjects: Social and Information Networks (cs.SI); Machine Learning (cs.LG)
This paper explores the ability of Graph Neural Networks (GNNs) in learning various forms of information for link prediction, alongside a brief review of existing link prediction methods. Our analysis reveals that GNNs cannot effectively learn structural information related to the number of common neighbors between two nodes, primarily due to the nature of set-based pooling of the neighborhood aggregation scheme. Also, our extensive experiments indicate that trainable node embeddings can improve the performance of GNN-based link prediction models. Importantly, we observe that the denser the graph, the greater such the improvement. We attribute this to the characteristics of node embeddings, where the link state of each link sample could be encoded into the embeddings of nodes that are involved in the neighborhood aggregation of the two nodes in that link sample. In denser graphs, every node could have more opportunities to attend the neighborhood aggregation of other nodes and encode states of more link samples to its embedding, thus learning better node embeddings for link prediction. Lastly, we demonstrate that the insights gained from our research carry important implications in identifying the limitations of existing link prediction methods, which could guide the future development of more robust algorithms.
- [140] arXiv:2411.14713 [pdf, html, other]
-
Title: LIBER: Lifelong User Behavior Modeling Based on Large Language ModelsChenxu Zhu, Shigang Quan, Bo Chen, Jianghao Lin, Xiaoling Cai, Hong Zhu, Xiangyang Li, Yunjia Xi, Weinan Zhang, Ruiming TangSubjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
CTR prediction plays a vital role in recommender systems. Recently, large language models (LLMs) have been applied in recommender systems due to their emergence abilities. While leveraging semantic information from LLMs has shown some improvements in the performance of recommender systems, two notable limitations persist in these studies. First, LLM-enhanced recommender systems encounter challenges in extracting valuable information from lifelong user behavior sequences within textual contexts for recommendation tasks. Second, the inherent variability in human behaviors leads to a constant stream of new behaviors and irregularly fluctuating user interests. This characteristic imposes two significant challenges on existing models. On the one hand, it presents difficulties for LLMs in effectively capturing the dynamic shifts in user interests within these sequences, and on the other hand, there exists the issue of substantial computational overhead if the LLMs necessitate recurrent calls upon each update to the user sequences. In this work, we propose Lifelong User Behavior Modeling (LIBER) based on large language models, which includes three modules: (1) User Behavior Streaming Partition (UBSP), (2) User Interest Learning (UIL), and (3) User Interest Fusion (UIF). Initially, UBSP is employed to condense lengthy user behavior sequences into shorter partitions in an incremental paradigm, facilitating more efficient processing. Subsequently, UIL leverages LLMs in a cascading way to infer insights from these partitions. Finally, UIF integrates the textual outputs generated by the aforementioned processes to construct a comprehensive representation, which can be incorporated by any recommendation model to enhance performance. LIBER has been deployed on Huawei's music recommendation service and achieved substantial improvements in users' play count and play time by 3.01% and 7.69%.
- [141] arXiv:2411.14715 [pdf, html, other]
-
Title: Any-to-3D Generation via Hybrid Diffusion SupervisionSubjects: Computer Vision and Pattern Recognition (cs.CV)
Recent progress in 3D object generation has been fueled by the strong priors offered by diffusion models. However, existing models are tailored to specific tasks, accommodating only one modality at a time and necessitating retraining to change modalities. Given an image-to-3D model and a text prompt, a naive approach is to convert text prompts to images and then use the image-to-3D model for generation. This approach is both time-consuming and labor-intensive, resulting in unavoidable information loss during modality conversion. To address this, we introduce XBind, a unified framework for any-to-3D generation using cross-modal pre-alignment techniques. XBind integrates an multimodal-aligned encoder with pre-trained diffusion models to generate 3D objects from any modalities, including text, images, and audio. We subsequently present a novel loss function, termed Modality Similarity (MS) Loss, which aligns the embeddings of the modality prompts and the rendered images, facilitating improved alignment of the 3D objects with multiple modalities. Additionally, Hybrid Diffusion Supervision combined with a Three-Phase Optimization process improves the quality of the generated 3D objects. Extensive experiments showcase XBind's broad generation capabilities in any-to-3D scenarios. To our knowledge, this is the first method to generate 3D objects from any modality prompts. Project page: this https URL.
- [142] arXiv:2411.14716 [pdf, html, other]
-
Title: VisionPAD: A Vision-Centric Pre-training Paradigm for Autonomous DrivingHaiming Zhang, Wending Zhou, Yiyao Zhu, Xu Yan, Jiantao Gao, Dongfeng Bai, Yingjie Cai, Bingbing Liu, Shuguang Cui, Zhen LiSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
This paper introduces VisionPAD, a novel self-supervised pre-training paradigm designed for vision-centric algorithms in autonomous driving. In contrast to previous approaches that employ neural rendering with explicit depth supervision, VisionPAD utilizes more efficient 3D Gaussian Splatting to reconstruct multi-view representations using only images as supervision. Specifically, we introduce a self-supervised method for voxel velocity estimation. By warping voxels to adjacent frames and supervising the rendered outputs, the model effectively learns motion cues in the sequential data. Furthermore, we adopt a multi-frame photometric consistency approach to enhance geometric perception. It projects adjacent frames to the current frame based on rendered depths and relative poses, boosting the 3D geometric representation through pure image supervision. Extensive experiments on autonomous driving datasets demonstrate that VisionPAD significantly improves performance in 3D object detection, occupancy prediction and map segmentation, surpassing state-of-the-art pre-training strategies by a considerable margin.
- [143] arXiv:2411.14717 [pdf, html, other]
-
Title: FedMLLM: Federated Fine-tuning MLLM on Multimodal Heterogeneity DataSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Multimodal Large Language Models (MLLMs) have made significant advancements, demonstrating powerful capabilities in processing and understanding multimodal data. Fine-tuning MLLMs with Federated Learning (FL) allows for expanding the training data scope by including private data sources, thereby enhancing their practical applicability in privacy-sensitive domains. However, current research remains in the early stage, particularly in addressing the \textbf{multimodal heterogeneities} in real-world applications. In this paper, we introduce a benchmark for evaluating various downstream tasks in the federated fine-tuning of MLLMs within multimodal heterogeneous scenarios, laying the groundwork for the research in the field. Our benchmark encompasses two datasets, five comparison baselines, and four multimodal scenarios, incorporating over ten types of modal heterogeneities. To address the challenges posed by modal heterogeneity, we develop a general FedMLLM framework that integrates four representative FL methods alongside two modality-agnostic strategies. Extensive experimental results show that our proposed FL paradigm improves the performance of MLLMs by broadening the range of training data and mitigating multimodal heterogeneity. Code is available at this https URL
- [144] arXiv:2411.14718 [pdf, html, other]
-
Title: GraphTheft: Quantifying Privacy Risks in Graph Prompt LearningSubjects: Cryptography and Security (cs.CR)
Graph Prompt Learning (GPL) represents an innovative approach in graph representation learning, enabling task-specific adaptations by fine-tuning prompts without altering the underlying pre-trained model. Despite its growing prominence, the privacy risks inherent in GPL remain unexplored. In this study, we provide the first evaluation of privacy leakage in GPL across three attacker capabilities: black-box attacks when GPL as a service, and scenarios where node embeddings and prompt representations are accessible to third parties. We assess GPL's privacy vulnerabilities through Attribute Inference Attacks (AIAs) and Link Inference Attacks (LIAs), finding that under any capability, attackers can effectively infer the properties and relationships of sensitive nodes, and the success rate of inference on some data sets is as high as 98%. Importantly, while targeted inference attacks on specific prompts (e.g., GPF-plus) maintain high success rates, our analysis suggests that the prompt-tuning in GPL does not significantly elevate privacy risks compared to traditional GNNs. To mitigate these risks, we explored defense mechanisms, identifying that Laplacian noise perturbation can substantially reduce inference success, though balancing privacy protection with model performance remains challenging. This work highlights critical privacy risks in GPL, offering new insights and foundational directions for future privacy-preserving strategies in graph learning.
- [145] arXiv:2411.14720 [pdf, other]
-
Title: Optimizing Social Media Annotation of HPV Vaccine Skepticism and Misinformation Using Large Language Models: An Experimental Evaluation of In-Context Learning and Fine-Tuning Stance Detection Across Multiple ModelsLuhang Sun, Varsha Pendyala, Yun-Shiuan Chuang, Shanglin Yang, Jonathan Feldman, Andrew Zhao, Munmun De Choudhury, Sijia Yang, Dhavan ShahSubjects: Computation and Language (cs.CL)
This paper leverages large-language models (LLMs) to experimentally determine optimal strategies for scaling up social media content annotation for stance detection on HPV vaccine-related tweets. We examine both conventional fine-tuning and emergent in-context learning methods, systematically varying strategies of prompt engineering across widely used LLMs and their variants (e.g., GPT4, Mistral, and Llama3, etc.). Specifically, we varied prompt template design, shot sampling methods, and shot quantity to detect stance on HPV vaccination. Our findings reveal that 1) in general, in-context learning outperforms fine-tuning in stance detection for HPV vaccine social media content; 2) increasing shot quantity does not necessarily enhance performance across models; and 3) different LLMs and their variants present differing sensitivity to in-context learning conditions. We uncovered that the optimal in-context learning configuration for stance detection on HPV vaccine tweets involves six stratified shots paired with detailed contextual prompts. This study highlights the potential and provides an applicable approach for applying LLMs to research on social media stance and skepticism detection.
- [146] arXiv:2411.14721 [pdf, html, other]
-
Title: MolReFlect: Towards In-Context Fine-grained Alignments between Molecules and TextsJiatong Li, Yunqing Liu, Wei Liu, Jingdi Le, Di Zhang, Wenqi Fan, Dongzhan Zhou, Yuqiang Li, Qing LiComments: 22 pages, 12 figuresSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
Molecule discovery is a pivotal research field, impacting everything from the medicines we take to the materials we use. Recently, Large Language Models (LLMs) have been widely adopted in molecule understanding and generation, yet the alignments between molecules and their corresponding captions remain a significant challenge. Previous endeavours often treat the molecule as a general SMILES string or molecular graph, neglecting the fine-grained alignments between the molecular sub-structures and the descriptive textual phrases, which are crucial for accurate and explainable predictions. In this case, we introduce MolReFlect, a novel teacher-student framework designed to contextually perform the molecule-caption alignments in a fine-grained way. Our approach initially leverages a larger teacher LLM to label the detailed alignments by directly extracting critical phrases from molecule captions or SMILES strings and implying them to corresponding sub-structures or characteristics. To refine these alignments, we propose In-Context Selective Reflection, which retrieves previous extraction results as context examples for teacher LLM to reflect and lets a smaller student LLM select from in-context reflection and previous extraction results. Finally, we enhance the learning process of the student LLM through Chain-of-Thought In-Context Molecule Tuning, integrating the fine-grained alignments and the reasoning processes within the Chain-of-Thought format. Our experimental results demonstrate that MolReFlect enables LLMs like Mistral-7B to significantly outperform the previous baselines, achieving SOTA performance on the ChEBI-20 dataset. This advancement not only enhances the generative capabilities of LLMs in the molecule-caption translation task, but also contributes to a more explainable framework.
- [147] arXiv:2411.14723 [pdf, html, other]
-
Title: Effective SAM Combination for Open-Vocabulary Semantic SegmentationSubjects: Computer Vision and Pattern Recognition (cs.CV)
Open-vocabulary semantic segmentation aims to assign pixel-level labels to images across an unlimited range of classes. Traditional methods address this by sequentially connecting a powerful mask proposal generator, such as the Segment Anything Model (SAM), with a pre-trained vision-language model like CLIP. But these two-stage approaches often suffer from high computational costs, memory inefficiencies. In this paper, we propose ESC-Net, a novel one-stage open-vocabulary segmentation model that leverages the SAM decoder blocks for class-agnostic segmentation within an efficient inference framework. By embedding pseudo prompts generated from image-text correlations into SAM's promptable segmentation framework, ESC-Net achieves refined spatial aggregation for accurate mask predictions. ESC-Net achieves superior performance on standard benchmarks, including ADE20K, PASCAL-VOC, and PASCAL-Context, outperforming prior methods in both efficiency and accuracy. Comprehensive ablation studies further demonstrate its robustness across challenging conditions.
- [148] arXiv:2411.14725 [pdf, html, other]
-
Title: Evaluating and Advancing Multimodal Large Language Models in Ability LensFeng Chen, Chenhui Gou, Jing Liu, Yang Yang, Zhaoyang Li, Jiyuan Zhang, Zhenbang Sun, Bohan Zhuang, Qi WuSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
As multimodal large language models (MLLMs) advance rapidly, rigorous evaluation has become essential, providing further guidance for their development. In this work, we focus on a unified and robust evaluation of \textbf{vision perception} abilities, the foundational skill of MLLMs. We find that existing perception benchmarks, each focusing on different question types, domains, and evaluation metrics, introduce significant evaluation variance, complicating comprehensive assessments of perception abilities when relying on any single benchmark. To address this, we introduce \textbf{AbilityLens}, a unified benchmark designed to evaluate MLLMs across six key perception abilities, focusing on both accuracy and stability, with each ability encompassing diverse question types, domains, and metrics. With the assistance of AbilityLens, we: (1) identify the strengths and weaknesses of current models, highlighting stability patterns and revealing a notable performance gap between open-source and closed-source models; (2) introduce an online evaluation mode, which uncovers interesting ability conflict and early convergence phenomena during MLLM training; and (3) design a simple ability-specific model merging method that combines the best ability checkpoint from early training stages, effectively mitigating performance decline due to ability conflict. The benchmark and online leaderboard will be released soon.
- [149] arXiv:2411.14726 [pdf, html, other]
-
Title: Enhancing Molecular Design through Graph-based Topological Reinforcement LearningSubjects: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
The generation of drug-like molecules is crucial for drug design. Existing reinforcement learning (RL) methods often overlook structural information. However, feature engineering-based methods usually merely focus on binding affinity prediction without substantial molecular modification. To address this, we present Graph-based Topological Reinforcement Learning (GraphTRL), which integrates both chemical and structural data for improved molecular generation. GraphTRL leverages multiscale weighted colored graphs (MWCG) and persistent homology, combined with molecular fingerprints, as the state space for RL. Evaluations show that GraphTRL outperforms existing methods in binding affinity prediction, offering a promising approach to accelerate drug discovery.
- [150] arXiv:2411.14727 [pdf, html, other]
-
Title: Attributed Graph Clustering via Generalized Quaternion Representation LearningSubjects: Machine Learning (cs.LG)
Clustering complex data in the form of attributed graphs has attracted increasing attention, where appropriate graph representation is a critical prerequisite for accurate cluster analysis. However, the Graph Convolutional Network will homogenize the representation of graph nodes due to the well-known over-smoothing effect. This limits the network architecture to a shallow one, losing the ability to capture the critical global distribution information for clustering. Therefore, we propose a generalized graph auto-encoder network, which introduces quaternion operations to the encoders to achieve efficient structured feature representation learning without incurring deeper network and larger-scale parameters. The generalization of our method lies in the following two aspects: 1) connecting the quaternion operation naturally suitable for four feature components with graph data of arbitrary attribute dimensions, and 2) introducing a generalized graph clustering objective as a loss term to obtain clustering-friendly representations without requiring a pre-specified number of clusters $k$. It turns out that the representations of nodes learned by the proposed Graph Clustering based on Generalized Quaternion representation learning (GCGQ) are more discriminative, containing global distribution information, and are more general, suiting downstream clustering under different $k$s. Extensive experiments including significance tests, ablation studies, and qualitative results, illustrate the superiority of GCGQ. The source code is temporarily opened at \url{this https URL}.
- [151] arXiv:2411.14728 [pdf, html, other]
-
Title: K-GBS3FCM -- KNN Graph-Based Safe Semi-Supervised Fuzzy C-MeansComments: 10 pagesSubjects: Machine Learning (cs.LG)
Clustering data using prior domain knowledge, starting from a partially labeled set, has recently been widely investigated. Often referred to as semi-supervised clustering, this approach leverages labeled data to enhance clustering accuracy. To maximize algorithm performance, it is crucial to ensure the safety of this prior knowledge. Methods addressing this concern are termed safe semi-supervised clustering (S3C) algorithms. This paper introduces the KNN graph-based safety-aware semi-supervised fuzzy c-means algorithm (K-GBS3FCM), which dynamically assesses neighborhood relationships between labeled and unlabeled data using the K-Nearest Neighbors (KNN) algorithm. This approach aims to optimize the use of labeled data while minimizing the adverse effects of incorrect labels. Additionally, it is proposed a mechanism that adjusts the influence of labeled data on unlabeled ones through regularization parameters and the average safety degree. Experimental results on multiple benchmark datasets demonstrate that the graph-based approach effectively leverages prior knowledge to enhance clustering accuracy. The proposed method was significantly superior in 64% of the 56 test configurations, obtaining higher levels of clustering accuracy when compared to other semi-supervised and traditional unsupervised methods. This research highlights the potential of integrating graph-based approaches, such as KNN, with established techniques to develop advanced clustering algorithms, offering significant applications in fields that rely on both labeled and unlabeled data for more effective clustering.
- [152] arXiv:2411.14729 [pdf, html, other]
-
Title: A Lightweight Edge-CNN-Transformer Model for Detecting Coordinated Cyber and Digital Twin Attacks in Cooperative Smart FarmingSubjects: Cryptography and Security (cs.CR)
The agriculture sector is increasingly adopting innovative technologies to meet the growing food demands of the global population. To optimize resource utilization and minimize crop losses, farmers are joining cooperatives to share their data and resources among member farms. However, while farmers benefit from this data sharing and interconnection, it exposes them to cybersecurity threats and privacy concerns. A cyberattack on one farm can have widespread consequences, affecting the targeted farm as well as all member farms within a cooperative. In this research, we address existing gaps by proposing a novel and secure architecture for Cooperative Smart Farming (CSF). First, we highlight the role of edge-based DTs in enhancing the efficiency and resilience of agricultural operations. To validate this, we develop a test environment for CSF, implementing various cyberattacks on both the DTs and their physical counterparts using different attack vectors. We collect two smart farming network datasets to identify potential threats. After identifying these threats, we focus on preventing the transmission of malicious data from compromised farms to the central cloud server. To achieve this, we propose a CNN-Transformer-based network anomaly detection model, specifically designed for deployment at the edge. As a proof of concept, we implement this model and evaluate its performance by varying the number of encoder layers. Additionally, we apply Post-Quantization to compress the model and demonstrate the impact of compression on its performance in edge environments. Finally, we compare the model's performance with traditional machine learning approaches to assess its overall effectiveness.
- [153] arXiv:2411.14730 [pdf, other]
-
Title: Funhouse Mirror or Echo Chamber? A Methodological Approach to Teaching Critical AI Literacy Through MetaphorsJasper Roe (1), Leon Furze (2), Mike Perkins (3) ((1) James Cook University Singapore, Singapore, (2) Deakin University, Australia, (3) British University Vietnam, Vietnam)Subjects: Computers and Society (cs.CY)
As educational institutions grapple with teaching students about increasingly complex Artificial Intelligence (AI) systems, finding effective methods for explaining these technologies and their societal implications remains a major challenge. This study proposes a methodological approach combining Conceptual Metaphor Theory (CMT) with UNESCO's AI competency framework to develop Critical AI Literacy (CAIL). Through a systematic analysis of metaphors commonly used to describe AI systems, we develop criteria for selecting pedagogically appropriate metaphors and demonstrate their alignment with established AI literacy competencies, as well as UNESCO's AI competency framework.
Our method identifies and suggests four key metaphors for teaching CAIL. This includes GenAI as an echo chamber, GenAI as a funhouse mirror, GenAI as a black box magician, and GenAI as a map. Each of these seeks to address specific aspects of understanding characteristics of AI, from filter bubbles to algorithmic opacity. We present these metaphors alongside interactive activities designed to engage students in experiential learning of AI concepts. In doing so, we offer educators a structured approach to teaching CAIL that bridges technical understanding with societal implications. This work contributes to the growing field of AI education by demonstrating how carefully selected metaphors can make complex technological concepts more accessible while promoting critical engagement with AI systems. - [154] arXiv:2411.14733 [pdf, html, other]
-
Title: FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer AccelerationSubjects: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Systems and Control (eess.SY)
Encoder-based transformers, powered by self-attention layers, have revolutionized machine learning with their context-aware representations. However, their quadratic growth in computational and memory demands presents significant bottlenecks. Analog-Mixed-Signal Process-in-Memory (AMS-PiM) architectures address these challenges by enabling efficient on-chip processing. Traditionally, AMS-PiM relies on Quantization-Aware Training (QAT), which is hardware-efficient but requires extensive retraining to adapt models to AMS-PiMs, making it increasingly impractical for transformer models. Post-Training Quantization (PTQ) mitigates this training overhead but introduces significant hardware inefficiencies. PTQ relies on dequantization-quantization (DQ-Q) processes, floating-point units (FPUs), and high-ENOB (Effective Number of Bits) analog-to-digital converters (ADCs). Particularly, High-ENOB ADCs scale exponentially in area and energy ($2^{ENOB}$), reduce sensing margins, and increase susceptibility to process, voltage, and temperature (PVT) variations, further compounding PTQ's challenges in AMS-PiM systems. To overcome these limitations, we propose RAP, an AMS-PiM architecture that eliminates DQ-Q processes, introduces FPU- and division-free nonlinear processing, and employs a low-ENOB-ADC-based sparse Matrix Vector multiplication technique. Using the proposed techniques, RAP improves error resiliency, area/energy efficiency, and computational speed while preserving numerical stability. Experimental results demonstrate that RAP outperforms state-of-the-art GPUs and conventional PiM architectures in energy efficiency, latency, and accuracy, making it a scalable solution for the efficient deployment of transformers.
- [155] arXiv:2411.14735 [pdf, other]
-
Title: Automatic Inference of Relational Object InvariantsComments: This is an extended version of the VMCAI 2025 paper, consisting of 26 pages. The artifact is available at this https URLSubjects: Programming Languages (cs.PL)
Relational object invariants (or representation invariants) are relational properties held by the fields of a (memory) object throughout its lifetime. For example, the length of a buffer never exceeds its capacity. Automatic inference of these invariants is particularly challenging because they are often broken temporarily during field updates. In this paper, we present an Abstract Interpretation-based solution to infer object invariants. Our key insight is a new object abstraction for memory objects, where memory is divided into multiple memory banks, each containing several objects. Within each bank, the objects are further abstracted by separating the most recently used (MRU) object, represented precisely with strong updates, while the rest are summarized. For an effective implementation of this approach, we introduce a new composite abstract domain, which forms a reduced product of numerical and equality sub-domains. This design efficiently expresses relationships between a small number of variables (e.g., fields of the same abstract object). We implement the new domain in the CRAB abstract interpreter and evaluate it on several benchmarks for memory safety. We show that our approach is significantly more scalable for relational properties than the existing implementation of CRAB. For evaluating precision, we have integrated our analysis as a pre-processing step to SEABMC bounded model checker, and show that it is effective at both discharging assertions during pre-processing, and significantly improving the run-time of SEABMC.
- [156] arXiv:2411.14737 [pdf, html, other]
-
Title: AI Tailoring: Evaluating Influence of Image Features on Fashion Product PopularitySubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Identifying key product features that influence consumer preferences is essential in the fashion industry. In this study, we introduce a robust methodology to ascertain the most impactful features in fashion product images, utilizing past market sales data. First, we propose the metric called "influence score" to quantitatively assess the importance of product features. Then we develop a forecasting model, the Fashion Demand Predictor (FDP), which integrates Transformer-based models and Random Forest to predict market popularity based on product images. We employ image-editing diffusion models to modify these images and perform an ablation study, which validates the impact of the highest and lowest-scoring features on the model's popularity predictions. Additionally, we further validate these results through surveys that gather human rankings of preferences, confirming the accuracy of the FDP model's predictions and the efficacy of our method in identifying influential features. Notably, products enhanced with "good" features show marked improvements in predicted popularity over their modified counterparts. Our approach develops a fully automated and systematic framework for fashion image analysis that provides valuable guidance for downstream tasks such as fashion product design and marketing strategy development.
- [157] arXiv:2411.14738 [pdf, html, other]
-
Title: Universal and Context-Independent Triggers for Precise Control of LLM OutputsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Large language models (LLMs) have been widely adopted in applications such as automated content generation and even critical decision-making systems. However, the risk of prompt injection allows for potential manipulation of LLM outputs. While numerous attack methods have been documented, achieving full control over these outputs remains challenging, often requiring experienced attackers to make multiple attempts and depending heavily on the prompt context. Recent advancements in gradient-based white-box attack techniques have shown promise in tasks like jailbreaks and system prompt leaks. Our research generalizes gradient-based attacks to find a trigger that is (1) Universal: effective irrespective of the target output; (2) Context-Independent: robust across diverse prompt contexts; and (3) Precise Output: capable of manipulating LLM inputs to yield any specified output with high accuracy. We propose a novel method to efficiently discover such triggers and assess the effectiveness of the proposed attack. Furthermore, we discuss the substantial threats posed by such attacks to LLM-based applications, highlighting the potential for adversaries to taking over the decisions and actions made by AI agents.
- [158] arXiv:2411.14739 [pdf, html, other]
-
Title: IRLab@iKAT24: Learned Sparse Retrieval with Multi-aspect LLM Query Generation for Conversational SearchSubjects: Information Retrieval (cs.IR); Computation and Language (cs.CL)
The Interactive Knowledge Assistant Track (iKAT) 2024 focuses on advancing conversational assistants, able to adapt their interaction and responses from personalized user knowledge. The track incorporates a Personal Textual Knowledge Base (PTKB) alongside Conversational AI tasks, such as passage ranking and response generation. Query Rewrite being an effective approach for resolving conversational context, we explore Large Language Models (LLMs), as query rewriters. Specifically, our submitted runs explore multi-aspect query generation using the MQ4CS framework, which we further enhance with Learned Sparse Retrieval via the SPLADE architecture, coupled with robust cross-encoder models. We also propose an alternative to the previous interleaving strategy, aggregating multiple aspects during the reranking phase. Our findings indicate that multi-aspect query generation is effective in enhancing performance when integrated with advanced retrieval and reranking models. Our results also lead the way for better personalization in Conversational Search, relying on LLMs to integrate personalization within query rewrite, and outperforming human rewrite performance.
- [159] arXiv:2411.14740 [pdf, html, other]
-
Title: TEXGen: a Generative Diffusion Model for Mesh TexturesXin Yu, Ze Yuan, Yuan-Chen Guo, Ying-Tian Liu, JianHui Liu, Yangguang Li, Yan-Pei Cao, Ding Liang, Xiaojuan QiComments: Accepted to SIGGRAPH Asia Journal Article (TOG 2024)Journal-ref: ACM Transactions on Graphics (TOG) 2024, Volume 43, Issue 6, Article No.: 213, Pages 1-14Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
While high-quality texture maps are essential for realistic 3D asset rendering, few studies have explored learning directly in the texture space, especially on large-scale datasets. In this work, we depart from the conventional approach of relying on pre-trained 2D diffusion models for test-time optimization of 3D textures. Instead, we focus on the fundamental problem of learning in the UV texture space itself. For the first time, we train a large diffusion model capable of directly generating high-resolution texture maps in a feed-forward manner. To facilitate efficient learning in high-resolution UV spaces, we propose a scalable network architecture that interleaves convolutions on UV maps with attention layers on point clouds. Leveraging this architectural design, we train a 700 million parameter diffusion model that can generate UV texture maps guided by text prompts and single-view images. Once trained, our model naturally supports various extended applications, including text-guided texture inpainting, sparse-view texture completion, and text-driven texture synthesis. Project page is at this http URL.
- [160] arXiv:2411.14743 [pdf, html, other]
-
Title: FOCUS: Knowledge-enhanced Adaptive Visual Compression for Few-shot Whole Slide Image ClassificationComments: 15 pages, 3 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
Few-shot learning presents a critical solution for cancer diagnosis in computational pathology (CPath), addressing fundamental limitations in data availability, particularly the scarcity of expert annotations and patient privacy constraints. A key challenge in this paradigm stems from the inherent disparity between the limited training set of whole slide images (WSIs) and the enormous number of contained patches, where a significant portion of these patches lacks diagnostically relevant information, potentially diluting the model's ability to learn and focus on critical diagnostic features. While recent works attempt to address this by incorporating additional knowledge, several crucial gaps hinder further progress: (1) despite the emergence of powerful pathology foundation models (FMs), their potential remains largely untapped, with most approaches limiting their use to basic feature extraction; (2) current language guidance mechanisms attempt to align text prompts with vast numbers of WSI patches all at once, struggling to leverage rich pathological semantic information. To this end, we introduce the knowledge-enhanced adaptive visual compression framework, dubbed FOCUS, which uniquely combines pathology FMs with language prior knowledge to enable a focused analysis of diagnostically relevant regions by prioritizing discriminative WSI patches. Our approach implements a progressive three-stage compression strategy: we first leverage FMs for global visual redundancy elimination, and integrate compressed features with language prompts for semantic relevance assessment, then perform neighbor-aware visual token filtering while preserving spatial coherence. Extensive experiments on pathological datasets spanning breast, lung, and ovarian cancers demonstrate its superior performance in few-shot pathology diagnosis. Code will be made available at this https URL.
- [161] arXiv:2411.14744 [pdf, html, other]
-
Title: Point Cloud Understanding via Attention-Driven Contrastive LearningYi Wang, Jiaze Wang, Ziyu Guo, Renrui Zhang, Donghao Zhou, Guangyong Chen, Anfeng Liu, Pheng-Ann HengSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Recently Transformer-based models have advanced point cloud understanding by leveraging self-attention mechanisms, however, these methods often overlook latent information in less prominent regions, leading to increased sensitivity to perturbations and limited global comprehension. To solve this issue, we introduce PointACL, an attention-driven contrastive learning framework designed to address these limitations. Our method employs an attention-driven dynamic masking strategy that guides the model to focus on under-attended regions, enhancing the understanding of global structures within the point cloud. Then we combine the original pre-training loss with a contrastive learning loss, improving feature discrimination and generalization. Extensive experiments validate the effectiveness of PointACL, as it achieves state-of-the-art performance across a variety of 3D understanding tasks, including object classification, part segmentation, and few-shot learning. Specifically, when integrated with different Transformer backbones like Point-MAE and PointGPT, PointACL demonstrates improved performance on datasets such as ScanObjectNN, ModelNet40, and ShapeNetPart. This highlights its superior capability in capturing both global and local features, as well as its enhanced robustness against perturbations and incomplete data.
- [162] arXiv:2411.14745 [pdf, html, other]
-
Title: Approximating the Held-Karp Bound for Metric TSP in Nearly Linear Work and Polylogarithmic DepthSubjects: Data Structures and Algorithms (cs.DS)
We present a nearly linear work parallel algorithm for approximating the Held-Karp bound for the Metric TSP problem. Given an edge-weighted undirected graph $G=(V,E)$ on $m$ edges and $\epsilon>0$, it returns a $(1+\epsilon)$-approximation to the Held-Karp bound with high probability, in $\tilde{O}(m/\epsilon^4)$ work and $\tilde{O}(1/\epsilon^4)$ depth. While a nearly linear time sequential algorithm was known for almost a decade (Chekuri and Quanrud'17), it was not known how to simultaneously achieve nearly linear work alongside polylogarithmic depth. Using a reduction by Chalermsook et al.'22, we also give a parallel algorithm for computing a $(1+\epsilon)$-approximate fractional solution to the $k$-edge-connected spanning subgraph (kECSS) problem, with the same complexity.
To obtain these results, we introduce a notion of core-sequences for the parallel Multiplicative Weights Update (MWU) framework (Luby-Nisan'93, Young'01). For the Metric TSP and kECSS problems, core-sequences enable us to exploit the structure of approximate minimum cuts to reduce the cost per iteration and/or the number of iterations. The acceleration technique via core-sequences is generic and of independent interest. In particular, it improves the best-known iteration complexity of MWU algorithms for packing/covering LPs from $poly(\log nnz(A))$ to polylogarithmic in the product of cardinalities of the core-sequence sets where $A$ is the constraint matrix of the LP. For certain implicitly defined LPs such as the kECSS LP, this yields an exponential improvement in depth. - [163] arXiv:2411.14750 [pdf, html, other]
-
Title: Ordinal Multiple-instance Learning for Ulcerative Colitis Severity Estimation with Selective Aggregated TransformerComments: 10 pages, 9 figures, Accepted in WACV 2025Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Patient-level diagnosis of severity in ulcerative colitis (UC) is common in real clinical settings, where the most severe score in a patient is recorded. However, previous UC classification methods (i.e., image-level estimation) mainly assumed the input was a single image. Thus, these methods can not utilize severity labels recorded in real clinical settings. In this paper, we propose a patient-level severity estimation method by a transformer with selective aggregator tokens, where a severity label is estimated from multiple images taken from a patient, similar to a clinical setting. Our method can effectively aggregate features of severe parts from a set of images captured in each patient, and it facilitates improving the discriminative ability between adjacent severity classes. Experiments demonstrate the effectiveness of the proposed method on two datasets compared with the state-of-the-art MIL methods. Moreover, we evaluated our method in real clinical settings and confirmed that our method outperformed the previous image-level methods. The code is publicly available at this https URL.
- [164] arXiv:2411.14751 [pdf, html, other]
-
Title: TopoSD: Topology-Enhanced Lane Segment Perception with SDMap PriorSen Yang, Minyue Jiang, Ziwei Fan, Xiaolu Xie, Xiao Tan, Yingying Li, Errui Ding, Liang Wang, Jingdong WangComments: 17 pages, 7 figures, and 7 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Recent advances in autonomous driving systems have shifted towards reducing reliance on high-definition maps (HDMaps) due to the huge costs of annotation and maintenance. Instead, researchers are focusing on online vectorized HDMap construction using on-board sensors. However, sensor-only approaches still face challenges in long-range perception due to the restricted views imposed by the mounting angles of onboard cameras, just as human drivers also rely on bird's-eye-view navigation maps for a comprehensive understanding of road structures. To address these issues, we propose to train the perception model to "see" standard definition maps (SDMaps). We encode SDMap elements into neural spatial map representations and instance tokens, and then incorporate such complementary features as prior information to improve the bird's eye view (BEV) feature for lane geometry and topology decoding. Based on the lane segment representation framework, the model simultaneously predicts lanes, centrelines and their topology. To further enhance the ability of geometry prediction and topology reasoning, we also use a topology-guided decoder to refine the predictions by exploiting the mutual relationships between topological and geometric features. We perform extensive experiments on OpenLane-V2 datasets to validate the proposed method. The results show that our model outperforms state-of-the-art methods by a large margin, with gains of +6.7 and +9.1 on the mAP and topology metrics. Our analysis also reveals that models trained with SDMap noise augmentation exhibit enhanced robustness.
- [165] arXiv:2411.14754 [pdf, html, other]
-
Title: Subspace Collision: An Efficient and Accurate Framework for High-dimensional Approximate Nearest Neighbor SearchSubjects: Databases (cs.DB)
Approximate Nearest Neighbor (ANN) search in high-dimensional Euclidean spaces is a fundamental problem with a wide range of applications. However, there is currently no ANN method that performs well in both indexing and query answering performance, while providing rigorous theoretical guarantees for the quality of the answers. In this paper, we first design SC-score, a metric that we show follows the Pareto principle and can act as a proxy for the Euclidean distance between data this http URL by this, we propose a novel ANN search framework called Subspace Collision (SC), which can provide theoretical guarantees on the quality of its results. We further propose SuCo, which achieves efficient and accurate ANN search by designing a clustering-based lightweight index and query strategies for our proposed subspace collision framework. Extensive experiments on real-world datasets demonstrate that both the indexing and query answering performance of SuCo outperform state-of-the-art ANN methods that can provide theoretical guarantees, performing 1-2 orders of magnitude faster query answering with only up to one-tenth of the index memory footprint. Moreover, SuCo achieves top performance (best for hard datasets) even when compared to methods that do not provide theoretical guarantees. This paper was published in SIGMOD 2025.
- [166] arXiv:2411.14755 [pdf, html, other]
-
Title: FairAdapter: Detecting AI-generated Images with Improved FairnessSubjects: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
The high-quality, realistic images generated by generative models pose significant challenges for exposing this http URL far, data-driven deep neural networks have been justified as the most efficient forensics tools for the challenges. However, they may be over-fitted to certain semantics, resulting in considerable inconsistency in detection performance across different contents of generated samples. It could be regarded as an issue of detection fairness. In this paper, we propose a novel framework named Fairadapter to tackle the issue. In comparison with existing state-of-the-art methods, our model achieves improved fairness performance. Our project: this https URL
- [167] arXiv:2411.14756 [pdf, html, other]
-
Title: KPG 193: A Synthetic Korean Power Grid Test System for Decarbonization StudiesSubjects: Systems and Control (eess.SY)
This paper introduces the 193 bus synthetic Korean power grid (KPG 193), developed using open data sources to address recent challenges of the Korean power system. The KPG 193 test system serves as a valuable platform for decarbonization research, capturing Korean low renewable energy penetration, concentrated urban energy demand, and isolated grid structure. Clustering techniques were applied to preserve key system characteristics while maintaining computational tractability and representativeness. The system includes 193 buses, 123 generators, 407 transmission lines, and incorporates temporal weather datasets. Its feasibility was validated through Unit Commitment (UC), DC Optimal Power Flow (DCOPF) and AC Optimal Power Flow (ACOPF) simulations using 2022 demand and renewable generation data. This test system aims to provide a foundational framework for modeling and analyzing the Korean power grid.
- [168] arXiv:2411.14759 [pdf, other]
-
Title: Hammer: Towards Efficient Hot-Cold Data Identification via Online LearningSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Efficient management of storage resources in big data and cloud computing environments requires accurate identification of data's "cold" and "hot" states. Traditional methods, such as rule-based algorithms and early AI techniques, often struggle with dynamic workloads, leading to low accuracy, poor adaptability, and high operational overhead. To address these issues, we propose a novel solution based on online learning strategies. Our approach dynamically adapts to changing data access patterns, achieving higher accuracy and lower operational costs. Rigorous testing with both synthetic and real-world datasets demonstrates a significant improvement, achieving a 90% accuracy rate in hot-cold classification. Additionally, the computational and storage overheads are considerably reduced.
- [169] arXiv:2411.14760 [pdf, html, other]
-
Title: The 1st Workshop on Human-Centered Recommender SystemsKaike Zhang, Yunfan Wu, Yougang lyu, Du Su, Yingqiang Ge, Shuchang Liu, Qi Cao, Zhaochun Ren, Fei SunComments: Workshop at TheWebConf 2025Subjects: Information Retrieval (cs.IR)
Recommender systems are quintessential applications of human-computer interaction. Widely utilized in daily life, they offer significant convenience but also present numerous challenges, such as the information cocoon effect, privacy concerns, fairness issues, and more. Consequently, this workshop aims to provide a platform for researchers to explore the development of Human-Centered Recommender Systems~(HCRS). HCRS refers to the creation of recommender systems that prioritize human needs, values, and capabilities at the core of their design and operation. In this workshop, topics will include, but are not limited to, robustness, privacy, transparency, fairness, diversity, accountability, ethical considerations, and user-friendly design. We hope to engage in discussions on how to implement and enhance these properties in recommender systems. Additionally, participants will explore diverse evaluation methods, including innovative metrics that capture user satisfaction and trust. This workshop seeks to foster a collaborative environment for researchers to share insights and advance the field toward more ethical, user-centric, and socially responsible recommender systems.
- [170] arXiv:2411.14762 [pdf, html, other]
-
Title: Efficient Long Video Tokenization via Coordinated-based Patch ReconstructionComments: Code is available on the project webpage: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Efficient tokenization of videos remains a challenge in training vision models that can process long videos. One promising direction is to develop a tokenizer that can encode long video clips, as it would enable the tokenizer to leverage the temporal coherence of videos better for tokenization. However, training existing tokenizers on long videos often incurs a huge training cost as they are trained to reconstruct all the frames at once. In this paper, we introduce CoordTok, a video tokenizer that learns a mapping from coordinate-based representations to the corresponding patches of input videos, inspired by recent advances in 3D generative models. In particular, CoordTok encodes a video into factorized triplane representations and reconstructs patches that correspond to randomly sampled $(x,y,t)$ coordinates. This allows for training large tokenizer models directly on long videos without requiring excessive training resources. Our experiments show that CoordTok can drastically reduce the number of tokens for encoding long video clips. For instance, CoordTok can encode a 128-frame video with 128$\times$128 resolution into 1280 tokens, while baselines need 6144 or 8192 tokens to achieve similar reconstruction quality. We further show that this efficient video tokenization enables memory-efficient training of a diffusion transformer that can generate 128 frames at once.
- [171] arXiv:2411.14765 [pdf, html, other]
-
Title: An Attention-based Framework for Fair Contrastive LearningSubjects: Machine Learning (cs.LG)
Contrastive learning has proven instrumental in learning unbiased representations of data, especially in complex environments characterized by high-cardinality and high-dimensional sensitive information. However, existing approaches within this setting require predefined modelling assumptions of bias-causing interactions that limit the model's ability to learn debiased representations. In this work, we propose a new method for fair contrastive learning that employs an attention mechanism to model bias-causing interactions, enabling the learning of a fairer and semantically richer embedding space. In particular, our attention mechanism avoids bias-causing samples that confound the model and focuses on bias-reducing samples that help learn semantically meaningful representations. We verify the advantages of our method against existing baselines in fair contrastive learning and show that our approach can significantly boost bias removal from learned representations without compromising downstream accuracy.
- [172] arXiv:2411.14768 [pdf, html, other]
-
Title: Grid and Road Expressions Are Complementary for Trajectory Representation LearningComments: This paper is accepted by KDD2025(August Cycle)Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Trajectory representation learning (TRL) maps trajectories to vectors that can be used for many downstream tasks. Existing TRL methods use either grid trajectories, capturing movement in free space, or road trajectories, capturing movement in a road network, as input. We observe that the two types of trajectories are complementary, providing either region and location information or providing road structure and movement regularity. Therefore, we propose a novel multimodal TRL method, dubbed GREEN, to jointly utilize Grid and Road trajectory Expressions for Effective representatioN learning. In particular, we transform raw GPS trajectories into both grid and road trajectories and tailor two encoders to capture their respective information. To align the two encoders such that they complement each other, we adopt a contrastive loss to encourage them to produce similar embeddings for the same raw trajectory and design a mask language model (MLM) loss to use grid trajectories to help reconstruct masked road trajectories. To learn the final trajectory representation, a dual-modal interactor is used to fuse the outputs of the two encoders via cross-attention. We compare GREEN with 7 state-of-the-art TRL methods for 3 downstream tasks, finding that GREEN consistently outperforms all baselines and improves the accuracy of the best-performing baseline by an average of 15.99\%.
- [173] arXiv:2411.14770 [pdf, html, other]
-
Title: Aim My Robot: Precision Local Navigation to Any ObjectXiangyun Meng, Xuning Yang, Sanghun Jung, Fabio Ramos, Srid Sadhan Jujjavarapu, Sanjoy Paul, Dieter FoxSubjects: Robotics (cs.RO)
Existing navigation systems mostly consider "success" when the robot reaches within 1m radius to a goal. This precision is insufficient for emerging applications where the robot needs to be positioned precisely relative to an object for downstream tasks, such as docking, inspection, and manipulation. To this end, we design and implement Aim-My-Robot (AMR), a local navigation system that enables a robot to reach any object in its vicinity at the desired relative pose, with centimeter-level precision. AMR achieves high precision and robustness by leveraging multi-modal perception, precise action prediction, and is trained on large-scale photorealistic data generated in simulation. AMR shows strong sim2real transfer and can adapt to different robot kinematics and unseen objects with little to no fine-tuning.
- [174] arXiv:2411.14771 [pdf, html, other]
-
Title: Capacity Approximations for Insertion Channels with Small Insertion ProbabilitiesComments: 39 pages, 1 figureSubjects: Information Theory (cs.IT)
Channels with synchronization errors, exhibiting deletion and insertion errors, find practical applications in DNA storage, data reconstruction, and various other domains. Presence of insertions and deletions render the channel with memory, complicating capacity analysis. For instance, despite the formulation of an independent and identically distributed (i.i.d.) deletion channel more than fifty years ago, and proof that the channel is information stable, hence its Shannon capacity exists, calculation of the capacity remained elusive. However, a relatively recent result establishes the capacity of the deletion channel in the asymptotic regime of small deletion probabilities by computing the dominant terms of the capacity expansion. This paper extends that result to binary insertion channels, determining the dominant terms of the channel capacity for small insertion probabilities and establishing capacity in this asymptotic regime. Specifically, we consider two i.i.d. insertion channel models: insertion channel with possible random bit insertions after every transmitted bit and the Gallager insertion model, for which a bit is replaced by two random bits with a certain probability. To prove our results, we build on methods used for the deletion channel, employing Bernoulli(1/2) inputs for achievability and coupling this with a converse using stationary and ergodic processes as inputs, and show that the channel capacity differs only in the higher order terms from the achievable rates with i.i.d. inputs. The results, for instance, show that the capacity of the random insertion channel is higher than that of the Gallager insertion channel, and quantifies the difference in the asymptotic regime.
- [175] arXiv:2411.14773 [pdf, html, other]
-
Title: Mode-conditioned music learning and composition: a spiking neural network inspired by neuroscience and psychologyComments: 18 pages, 8 figuresSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Neurons and Cognition (q-bio.NC)
Musical mode is one of the most critical element that establishes the framework of pitch organization and determines the harmonic relationships. Previous works often use the simplistic and rigid alignment method, and overlook the diversity of modes. However, in contrast to AI models, humans possess cognitive mechanisms for perceiving the various modes and keys. In this paper, we propose a spiking neural network inspired by brain mechanisms and psychological theories to represent musical modes and keys, ultimately generating musical pieces that incorporate tonality features. Specifically, the contributions are detailed as follows: 1) The model is designed with multiple collaborated subsystems inspired by the structures and functions of corresponding brain regions; 2)We incorporate mechanisms for neural circuit evolutionary learning that enable the network to learn and generate mode-related features in music, reflecting the cognitive processes involved in human music perception. 3)The results demonstrate that the proposed model shows a connection framework closely similar to the Krumhansl-Schmuckler model, which is one of the most significant key perception models in the music psychology domain. 4) Experiments show that the model can generate music pieces with characteristics of the given modes and keys. Additionally, the quantitative assessments of generated pieces reveals that the generating music pieces have both tonality characteristics and the melodic adaptability needed to generate diverse and musical content. By combining insights from neuroscience, psychology, and music theory with advanced neural network architectures, our research aims to create a system that not only learns and generates music but also bridges the gap between human cognition and artificial intelligence.
- [176] arXiv:2411.14774 [pdf, html, other]
-
Title: Resolution-Agnostic Transformer-based Climate DownscalingSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Understanding future weather changes at regional and local scales is crucial for planning and decision-making, particularly in the context of extreme weather events, as well as for broader applications in agriculture, insurance, and infrastructure development. However, the computational cost of downscaling Global Climate Models (GCMs) to the fine resolutions needed for such applications presents a significant barrier. Drawing on advancements in weather forecasting models, this study introduces a cost-efficient downscaling method using a pretrained Earth Vision Transformer (Earth ViT) model. Initially trained on ERA5 data to downscale from 50 km to 25 km resolution, the model is then tested on the higher resolution BARRA-SY dataset at a 3 km resolution. Remarkably, it performs well without additional training, demonstrating its ability to generalize across different resolutions. This approach holds promise for generating large ensembles of regional climate simulations by downscaling GCMs with varying input resolutions without incurring additional training costs. Ultimately, this method could provide more comprehensive estimates of potential future changes in key climate variables, aiding in effective planning for extreme weather events and climate change adaptation strategies.
- [177] arXiv:2411.14775 [pdf, html, other]
-
Title: A Benchmark Dataset for Collaborative SLAM in Service EnvironmentsComments: 8 pages, 6 figures, Accepted to IEEE RA-LJournal-ref: IEEE Robotics and Automation Letters (RA-L), vol. 9, no. 12, pp. 11337-11344, Dec. 2024Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
As service environments have become diverse, they have started to demand complicated tasks that are difficult for a single robot to complete. This change has led to an interest in multiple robots instead of a single robot. C-SLAM, as a fundamental technique for multiple service robots, needs to handle diverse challenges such as homogeneous scenes and dynamic objects to ensure that robots operate smoothly and perform their tasks safely. However, existing C-SLAM datasets do not include the various indoor service environments with the aforementioned challenges. To close this gap, we introduce a new multi-modal C-SLAM dataset for multiple service robots in various indoor service environments, called C-SLAM dataset in Service Environments (CSE). We use the NVIDIA Isaac Sim to generate data in various indoor service environments with the challenges that may occur in real-world service environments. By using simulation, we can provide accurate and precisely time-synchronized sensor data, such as stereo RGB, stereo depth, IMU, and ground truth (GT) poses. We configure three common indoor service environments (Hospital, Office, and Warehouse), each of which includes various dynamic objects that perform motions suitable to each environment. In addition, we drive three robots to mimic the actions of real service robots. Through these factors, we generate a more realistic C-SLAM dataset for multiple service robots. We demonstrate our dataset by evaluating diverse state-of-the-art single-robot SLAM and multi-robot SLAM methods. Our dataset is available at this https URL.
- [178] arXiv:2411.14779 [pdf, html, other]
-
Title: New families of non-Reed-Solomon MDS codesSubjects: Information Theory (cs.IT)
MDS codes have garnered significant attention due to their wide applications in practice. To date, most known MDS codes are equivalent to Reed-Solomon codes. The construction of non-Reed-Solomon (non-RS) type MDS codes has emerged as an intriguing and important problem in both coding theory and finite geometry. Although some constructions of non-RS type MDS codes have been presented in the literature, the parameters of these MDS codes remain subject to strict constraints. In this paper, we introduce a general framework of constructing $[n,k]$ MDS codes using the idea of selecting a suitable set of evaluation polynomials and a set of evaluation points such that all nonzero polynomials have at most $k-1$ zeros in the evaluation set. Moreover, these MDS codes can be proved to be non-Reed-Solomon by computing their Schur squares. Furthermore, several explicit constructions of non-RS MDS codes are given by converting to combinatorial problems. As a result, new families of non-RS MDS codes with much more flexible lengths can be obtained and most of them are not covered by the known results.
- [179] arXiv:2411.14781 [pdf, html, other]
-
Title: Reconciling Semantic Controllability and Diversity for Remote Sensing Image Synthesis with Hybrid Semantic EmbeddingSubjects: Computer Vision and Pattern Recognition (cs.CV)
Significant advancements have been made in semantic image synthesis in remote sensing. However, existing methods still face formidable challenges in balancing semantic controllability and diversity. In this paper, we present a Hybrid Semantic Embedding Guided Generative Adversarial Network (HySEGGAN) for controllable and efficient remote sensing image synthesis. Specifically, HySEGGAN leverages hierarchical information from a single source. Motivated by feature description, we propose a hybrid semantic Embedding method, that coordinates fine-grained local semantic layouts to characterize the geometric structure of remote sensing objects without extra information. Besides, a Semantic Refinement Network (SRN) is introduced, incorporating a novel loss function to ensure fine-grained semantic feedback. The proposed approach mitigates semantic confusion and prevents geometric pattern collapse. Experimental results indicate that the method strikes an excellent balance between semantic controllability and diversity. Furthermore, HySEGGAN significantly improves the quality of synthesized images and achieves state-of-the-art performance as a data augmentation technique across multiple datasets for downstream tasks.
- [180] arXiv:2411.14783 [pdf, html, other]
-
Title: Segmenting Action-Value Functions Over Time-Scales in SARSA using TD($\Delta$)Comments: 17 pages. arXiv admin note: text overlap with arXiv:2411.14019Subjects: Machine Learning (cs.LG)
In numerous episodic reinforcement learning (RL) settings, SARSA-based methodologies are employed to enhance policies aimed at maximizing returns over long horizons. Conventional SARSA algorithms, however, have difficulties in balancing bias and variation due to the reliance on a singular, fixed discount factor. This study expands the temporal difference decomposition approach, TD($\triangle$), to the SARSA algorithm. SARSA, a widely utilised on-policy RL method, enhances action-value functions via temporal difference updates. TD($\triangle$) facilitates learning over several time-scales by breaking the action-value function into components associated with distinct discount factors. This decomposition improves learning efficiency and stability, particularly in problems necessitating long-horizon optimization. We illustrate that our methodology mitigates bias in SARSA's updates while facilitating accelerated convergence in contexts characterized by dense rewards. Experimental findings across many benchmark tasks indicate that the proposed SARSA($\triangle$) surpasses conventional TD learning methods in both tabular and deep RL contexts.
- [181] arXiv:2411.14786 [pdf, html, other]
-
Title: FastGrasp: Efficient Grasp Synthesis with DiffusionSubjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Effectively modeling the interaction between human hands and objects is challenging due to the complex physical constraints and the requirement for high generation efficiency in applications. Prior approaches often employ computationally intensive two-stage approaches, which first generate an intermediate representation, such as contact maps, followed by an iterative optimization procedure that updates hand meshes to capture the hand-object relation. However, due to the high computation complexity during the optimization stage, such strategies often suffer from low efficiency in inference. To address this limitation, this work introduces a novel diffusion-model-based approach that generates the grasping pose in a one-stage manner. This allows us to significantly improve generation speed and the diversity of generated hand poses. In particular, we develop a Latent Diffusion Model with an Adaptation Module for object-conditioned hand pose generation and a contact-aware loss to enforce the physical constraints between hands and objects. Extensive experiments demonstrate that our method achieves faster inference, higher diversity, and superior pose quality than state-of-the-art approaches. Code is available at \href{this https URL}{this https URL.}
- [182] arXiv:2411.14788 [pdf, html, other]
-
Title: Jovis: A Visualization Tool for PostgreSQL Query OptimizerSubjects: Databases (cs.DB); Human-Computer Interaction (cs.HC)
In the world of relational database management, the query optimizer is a critical component that significantly impacts query performance. To address the challenge of optimizing query performance due to the complexity of optimizers -- especially with join operations -- we introduce Jovis. This novel visualization tool provides a window into the often intricate process of query optimization in PostgreSQL, making it more accessible and understandable. PostgreSQL employs two different query optimization strategies: the Dynamic Programming (DP) Optimizer for most scenarios and the Genetic Query Optimizer (GEQO) for more complex queries with numerous joins, both of which are supported in Jovis. Our tool visualizes the optimizer's decision-making process, from evaluating access paths for each relation to determining join orderings, all using data derived from the optimizer's logs. Jovis not only clarifies the query optimization process through visualizations but also serves as an invaluable learning tool for learners and a practical resource for experienced database professionals looking to optimize their query performance or even the query optimizer itself. The source code has been made available at this https URL.
- [183] arXiv:2411.14789 [pdf, html, other]
-
Title: Simplifying CLIP: Unleashing the Power of Large-Scale Models on Consumer-level ComputersSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Contrastive Language-Image Pre-training (CLIP) has attracted a surge of attention for its superior zero-shot performance and excellent transferability to downstream tasks. However, training such large-scale models usually requires substantial computation and storage, which poses barriers for general users with consumer-level computers. Motivated by this observation, in this paper we investigate how to achieve competitive performance on only one Nvidia RTX3090 GPU and with one terabyte for storing dataset. On one hand, we simplify the transformer block structure and combine Weight Inheritance with multi-stage Knowledge Distillation (WIKD), thereby reducing the parameters and improving the inference speed during training along with deployment. On the other hand, confronted with the convergence challenge posed by small dataset, we generate synthetic captions for each sample as data augmentation, and devise a novel Pair Matching (PM) loss to fully exploit the distinguishment among positive and negative image-text pairs. Extensive experiments demonstrate that our model can achieve a new state-of-the-art datascale-parameter-accuracy tradeoff, which could further popularize the CLIP model in the related research community.
- [184] arXiv:2411.14790 [pdf, html, other]
-
Title: KBAda: Efficient Self Adaptation on Specific Knowledge BasesZheni Zeng, Yuxuan Chen, Shi Yu, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, Maosong SunSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Humans can utilize techniques to quickly acquire knowledge from specific materials in advance, such as creating self-assessment questions, enabling us to achieving related tasks more efficiently. In contrast, large language models (LLMs) usually relies on retrieval-augmented generation to exploit knowledge materials in an instant manner, or requires external signals such as human preference data and stronger LLM annotations to conduct knowledge adaptation. To unleash the self-learning potential of LLMs, we propose KBAda, an approach designed for efficient adaptation to downstream tasks involving knowledge bases. Our method utilizes iterative training with self-annotated data such as Q&A pairs and revision suggestions, enabling the model to grasp the knowledge content efficiently. Experimental results on multiple datasets demonstrate the effectiveness of our approach, significantly boosting model performance in downstream tasks that require specific knowledge at a low cost. Notably, our approach achieves over 90% of the performance improvement that can be obtained by using GPT-4-turbo annotation, while relying entirely on self-supervision. We release our experimental data, models, and process analyses to the community for further exploration (this https URL).
- [185] arXiv:2411.14793 [pdf, html, other]
-
Title: Style-Friendly SNR Sampler for Style-Driven GenerationSubjects: Computer Vision and Pattern Recognition (cs.CV)
Recent large-scale diffusion models generate high-quality images but struggle to learn new, personalized artistic styles, which limits the creation of unique style templates. Fine-tuning with reference images is the most promising approach, but it often blindly utilizes objectives and noise level distributions used for pre-training, leading to suboptimal style alignment. We propose the Style-friendly SNR sampler, which aggressively shifts the signal-to-noise ratio (SNR) distribution toward higher noise levels during fine-tuning to focus on noise levels where stylistic features emerge. This enables models to better capture unique styles and generate images with higher style alignment. Our method allows diffusion models to learn and share new "style templates", enhancing personalized content creation. We demonstrate the ability to generate styles such as personal watercolor paintings, minimal flat cartoons, 3D renderings, multi-panel images, and memes with text, thereby broadening the scope of style-driven generation.
- [186] arXiv:2411.14794 [pdf, html, other]
-
Title: VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame SelectionSonghao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, Si LiuComments: 14 pages, 14 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
The advancement of Large Vision Language Models (LVLMs) has significantly improved multimodal understanding, yet challenges remain in video reasoning tasks due to the scarcity of high-quality, large-scale datasets. Existing video question-answering (VideoQA) datasets often rely on costly manual annotations with insufficient granularity or automatic construction methods with redundant frame-by-frame analysis, limiting their scalability and effectiveness for complex reasoning. To address these challenges, we introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence, along with multimodal annotations of intermediate reasoning steps. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We further develop video Chain-of-Thought (CoT) annotations to enrich reasoning processes, guiding GPT-4o in extracting logical relationships from QA pairs and video content. To exploit the potential of high-quality VideoQA pairs, we propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM. This framework adaptively selects core frames and performs CoT reasoning using multimodal evidence. Evaluated on our proposed benchmark with 14 tasks against 9 popular LVLMs, our method outperforms existing baselines on most tasks, demonstrating superior video reasoning capabilities. Our code and dataset will be released at: this https URL
- [187] arXiv:2411.14795 [pdf, html, other]
-
Title: De-biased Multimodal Electrocardiogram AnalysisSubjects: Computation and Language (cs.CL)
Multimodal large language models (MLLMs) are increasingly being applied in the medical field, particularly in medical imaging. However, developing MLLMs for ECG signals, which are crucial in clinical settings, has been a significant challenge beyond medical imaging. Previous studies have attempted to address this by converting ECGs into several text tags using an external classifier in a training-free manner. However, this approach significantly compresses the information in ECGs and underutilizes the reasoning capabilities of LLMs. In this work, we directly feed the embeddings of ECGs into the LLM through a projection layer, retaining more information about ECGs and better leveraging the reasoning abilities of LLMs. Our method can also effectively handle a common situation in clinical practice where it is necessary to compare two ECGs taken at different times. Recent studies found that MLLMs may rely solely on text input to provide answers, ignoring inputs from other modalities. We analyzed this phenomenon from a causal perspective in the context of ECG MLLMs and discovered that the confounder, severity of illness, introduces a spurious correlation between the question and answer, leading the model to rely on this spurious correlation and ignore the ECG input. Such models do not comprehend the ECG input and perform poorly in adversarial tests where different expressions of the same question are used in the training and testing sets. We designed a de-biased pre-training method to eliminate the confounder's effect according to the theory of backdoor adjustment. Our model performed well on the ECG-QA task under adversarial testing and demonstrated zero-shot capabilities. An interesting random ECG test further validated that our model effectively understands and utilizes the input ECG signal.
- [188] arXiv:2411.14796 [pdf, html, other]
-
Title: Adaptive Hyper-Graph Convolution Network for Skeleton-based Human Action Recognition with Virtual ConnectionsSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
The shared topology of human skeletons motivated the recent investigation of graph convolutional network (GCN) solutions for action recognition. However, the existing GCNs rely on the binary connection of two neighbouring vertices (joints) formed by an edge (bone), overlooking the potential of constructing multi-vertex convolution structures. In this paper we address this oversight and explore the merits of a hyper-graph convolutional network (Hyper-GCN) to achieve the aggregation of rich semantic information conveyed by skeleton vertices. In particular, our Hyper-GCN adaptively optimises multi-scale hyper-graphs during training, revealing the action-driven multi-vertex relations. Besides, virtual connections are often designed to support efficient feature aggregation, implicitly extending the spectrum of dependencies within the skeleton. By injecting virtual connections into hyper-graphs, the semantic clues of diverse action categories can be highlighted. The results of experiments conducted on the NTU-60, NTU-120, and NW-UCLA datasets, demonstrate the merits of our Hyper-GCN, compared to the state-of-the-art methods. Specifically, we outperform the existing solutions on NTU-120, achieving 90.2\% and 91.4\% in terms of the top-1 recognition accuracy on X-Sub and X-Set.
- [189] arXiv:2411.14797 [pdf, html, other]
-
Title: Continual SFT Matches Multimodal RLHF with Negative SupervisionSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Multimodal RLHF usually happens after supervised finetuning (SFT) stage to continually improve vision-language models' (VLMs) comprehension. Conventional wisdom holds its superiority over continual SFT during this preference alignment stage. In this paper, we observe that the inherent value of multimodal RLHF lies in its negative supervision, the logit of the rejected responses. We thus propose a novel negative supervised finetuning (nSFT) approach that fully excavates these information resided. Our nSFT disentangles this negative supervision in RLHF paradigm, and continually aligns VLMs with a simple SFT loss. This is more memory efficient than multimodal RLHF where 2 (e.g., DPO) or 4 (e.g., PPO) large VLMs are strictly required. The effectiveness of nSFT is rigorously proved by comparing it with various multimodal RLHF approaches, across different dataset sources, base VLMs and evaluation metrics. Besides, fruitful of ablations are provided to support our hypothesis. We hope this paper will stimulate further research to properly align large vision language models.
- [190] arXiv:2411.14798 [pdf, html, other]
-
Title: Facial Features Matter: a Dynamic Watermark based Proactive Deepfake Detection ApproachSubjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Current passive deepfake face-swapping detection methods encounter significance bottlenecks in model generalization capabilities. Meanwhile, proactive detection methods often use fixed watermarks which lack a close relationship with the content they protect and are vulnerable to security risks. Dynamic watermarks based on facial features offer a promising solution, as these features provide unique identifiers. Therefore, this paper proposes a Facial Feature-based Proactive deepfake detection method (FaceProtect), which utilizes changes in facial characteristics during deepfake manipulation as a novel detection mechanism. We introduce a GAN-based One-way Dynamic Watermark Generating Mechanism (GODWGM) that uses 128-dimensional facial feature vectors as inputs. This method creates irreversible mappings from facial features to watermarks, enhancing protection against various reverse inference attacks. Additionally, we propose a Watermark-based Verification Strategy (WVS) that combines steganography with GODWGM, allowing simultaneous transmission of the benchmark watermark representing facial features within the image. Experimental results demonstrate that our proposed method maintains exceptional detection performance and exhibits high practicality on images altered by various deepfake techniques.
- [191] arXiv:2411.14802 [pdf, other]
-
Title: Enhancing a Hierarchical Graph Rewriting Language based on MELL Cut EliminationComments: 26 pages. Extended version of the paper to appear in Proc. 27th International Symposium on Practical Aspects of Declarative Languages (PADL 2025), LNCS, Springer-Verlag, 2025, with Appendices describing further details that could not be included in the conference version of the paperSubjects: Programming Languages (cs.PL)
Hierarchical graph rewriting is a highly expressive computational formalism that manipulates graphs enhanced with box structures for representing hierarchies. It has provided the foundations of various graph-based modeling tools, but the design of high-level declarative languages based on hierarchical graph rewriting is still a challenge. For a solid design choice, well-established formalisms with backgrounds other than graph rewriting would provide useful guidelines. Proof nets of Multiplicative Exponential Linear Logic (MELL) is such a framework because its original formulation of cut elimination is essentially graph rewriting involving box structures, where so-called Promotion Boxes with an indefinite number of non-local edges may be cloned, migrated and deleted. This work builds on LMNtal as a declarative language based on hierarchical (port) graph rewriting, and discusses how it can be extended to support the above operations on Promotion Boxes of MELL proof nets. LMNtal thus extended turns out to be a practical graph rewriting language that has strong affinity with MELL proof nets. The language features provided are general enough to encode other well-established models of concurrency. Using the toolchain of LMNtal that provides state-space search and model checking, we implemented cut elimination rules of MELL proof nets in extended LMNtal and demonstrated that the platform could serve as a useful workbench for proof nets.
- [192] arXiv:2411.14807 [pdf, html, other]
-
Title: Harlequin: Color-driven Generation of Synthetic Data for Referring Expression ComprehensionComments: Accepted to ICPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Referring Expression Comprehension (REC) aims to identify a particular object in a scene by a natural language expression, and is an important topic in visual language understanding. State-of-the-art methods for this task are based on deep learning, which generally requires expensive and manually labeled annotations. Some works tackle the problem with limited-supervision learning or relying on Large Vision and Language Models. However, the development of techniques to synthesize labeled data is overlooked. In this paper, we propose a novel framework that generates artificial data for the REC task, taking into account both textual and visual modalities. At first, our pipeline processes existing data to create variations in the annotations. Then, it generates an image using altered annotations as guidance. The result of this pipeline is a new dataset, called Harlequin, made by more than 1M queries. This approach eliminates manual data collection and annotation, enabling scalability and facilitating arbitrary complexity. We pre-train three REC models on Harlequin, then fine-tuned and evaluated on human-annotated datasets. Our experiments show that the pre-training on artificial data is beneficial for performance.
- [193] arXiv:2411.14808 [pdf, html, other]
-
Title: High-Resolution Image Synthesis via Next-Token PredictionComments: 30 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Denoising with a Joint-Embedding Predictive Architecture (D-JEPA), an autoregressive model, has demonstrated outstanding performance in class-conditional image generation. However, the application of next-token prediction in high-resolution text-to-image generation remains underexplored. In this paper, we introduce D-JEPA$\cdot$T2I, an extension of D-JEPA incorporating flow matching loss, designed to enable data-efficient continuous resolution learning. D-JEPA$\cdot$T2I leverages a multimodal visual transformer to effectively integrate textual and visual features and adopts Visual Rotary Positional Embedding (VoPE) to facilitate continuous resolution learning. Furthermore, we devise a data feedback mechanism that significantly enhances data utilization efficiency. For the first time, we achieve state-of-the-art \textbf{high-resolution} image synthesis via next-token prediction.
The experimental code and pretrained models will be open-sourced at \url{this https URL}. - [194] arXiv:2411.14810 [pdf, html, other]
-
Title: Scalable Wavelength Arbitration for Microring-based DWDM TransceiversSubjects: Hardware Architecture (cs.AR); Signal Processing (eess.SP)
This paper introduces the concept of autonomous microring arbitration, or \textit{wavelength arbitration}, to address the challenge of multi-microring initialization in microring-based Dense-Wavelength-Division-Multiplexed (DWDM) transceivers. This arbitration is inherently policy-driven, defining critical system characteristics such as the spectral ordering of microrings. Furthermore, to facilitate large-scale deployment, the arbitration algorithms must operate independently of specific wavelength information and be resilient to system variability. Addressing these complexities requires a holistic approach that encompasses the entire system, from device-level variabilities to the transceiver interface - this system-wide perspective is the focus of this paper. To support efficient analysis, we develop a hierarchical framework incorporating an ideal, wavelength-aware arbitration model to examine arbitration failures at both the policy and algorithmic levels. The effectiveness of this approach is demonstrated in two ways: by analyzing the robustness of each policy in relation to device variabilities, and by developing an algorithm that achieves near-perfect alignment with the ideal model, offering superior robustness compared to the traditional sequential tuning method. The simulator code used in this paper is available at \url{this https URL}.
- [195] arXiv:2411.14811 [pdf, html, other]
-
Title: Fine-Grained Alignment in Vision-and-Language Navigation through Bayesian OptimizationSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
This paper addresses the challenge of fine-grained alignment in Vision-and-Language Navigation (VLN) tasks, where robots navigate realistic 3D environments based on natural language instructions. Current approaches use contrastive learning to align language with visual trajectory sequences. Nevertheless, they encounter difficulties with fine-grained vision negatives. To enhance cross-modal embeddings, we introduce a novel Bayesian Optimization-based adversarial optimization framework for creating fine-grained contrastive vision samples. To validate the proposed methodology, we conduct a series of experiments to assess the effectiveness of the enriched embeddings on fine-grained vision negatives. We conduct experiments on two common VLN benchmarks R2R and REVERIE, experiments on the them demonstrate that these embeddings benefit navigation, and can lead to a promising performance enhancement. Our source code and trained models are available at: this https URL.
- [196] arXiv:2411.14816 [pdf, html, other]
-
Title: Unsupervised Multi-view UAV Image Geo-localization via Iterative RenderingComments: 13 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
Unmanned Aerial Vehicle (UAV) Cross-View Geo-Localization (CVGL) presents significant challenges due to the view discrepancy between oblique UAV images and overhead satellite images. Existing methods heavily rely on the supervision of labeled datasets to extract viewpoint-invariant features for cross-view retrieval. However, these methods have expensive training costs and tend to overfit the region-specific cues, showing limited generalizability to new regions. To overcome this issue, we propose an unsupervised solution that lifts the scene representation to 3d space from UAV observations for satellite image generation, providing robust representation against view distortion. By generating orthogonal images that closely resemble satellite views, our method reduces view discrepancies in feature representation and mitigates shortcuts in region-specific image pairing. To further align the rendered image's perspective with the real one, we design an iterative camera pose updating mechanism that progressively modulates the rendered query image with potential satellite targets, eliminating spatial offsets relative to the reference images. Additionally, this iterative refinement strategy enhances cross-view feature invariance through view-consistent fusion across iterations. As such, our unsupervised paradigm naturally avoids the problem of region-specific overfitting, enabling generic CVGL for UAV images without feature fine-tuning or data-driven training. Experiments on the University-1652 and SUES-200 datasets demonstrate that our approach significantly improves geo-localization accuracy while maintaining robustness across diverse regions. Notably, without model fine-tuning or paired training, our method achieves competitive performance with recent supervised methods.
- [197] arXiv:2411.14819 [pdf, other]
-
Title: Inf-sup stable space-time Local Discontinuous Galerkin method for the heat equationSubjects: Numerical Analysis (math.NA)
We propose and analyze a space-time Local Discontinuous Galerkin method for the approximation of the solution to parabolic problems. The method allows for very general discrete spaces and prismatic space-time meshes. Existence and uniqueness of a discrete solution are shown by means of an inf-sup condition, whose proof does not rely on polynomial inverse estimates. Moreover, for piecewise polynomial spaces satisfying an additional mild condition, we show a second inf-sup condition that provides an additional control of the time derivative of the discrete solution. We derive hp-a priori error bounds based on these inf-sup conditions, which we use to prove convergence rates for standard, tensor-product, and quasi-Trefftz polynomial spaces. Numerical experiments validate our theoretical results.
- [198] arXiv:2411.14821 [pdf, html, other]
-
Title: Ex-post Stability under Two-Sided Matching: Complexity and CharacterizationSubjects: Computer Science and Game Theory (cs.GT); Computational Complexity (cs.CC)
A probabilistic approach to the stable matching problem has been identified as an important research area with several important open problems. When considering random matchings, ex-post stability is a fundamental stability concept. A prominent open problem is characterizing ex-post stability and establishing its computational complexity. We investigate the computational complexity of testing ex-post stability. Our central result is that when either side has ties in the preferences/priorities, testing ex-post stability is NP-complete. The result even holds if both sides have dichotomous preferences. On the positive side, we give an algorithm using an integer programming approach, that can determine a decomposition with a maximum probability of being weakly stable. We also consider stronger versions of ex-post stability (in particular robust ex-post stability and ex-post strong stability) and prove that they can be tested in polynomial time.
- [199] arXiv:2411.14823 [pdf, html, other]
-
Title: Omni-IML: Towards Unified Image Manipulation LocalizationSubjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Image manipulation can lead to misinterpretation of visual content, posing significant risks to information security. Image Manipulation Localization (IML) has thus received increasing attention. However, existing IML methods rely heavily on task-specific designs, making them perform well only on one target image type but are mostly random guessing on other image types, and even joint training on multiple image types causes significant performance degradation. This hinders the deployment for real applications as it notably increases maintenance costs and the misclassification of image types leads to serious error accumulation. To this end, we propose Omni-IML, the first generalist model to unify diverse IML tasks. Specifically, Omni-IML achieves generalism by adopting the Modal Gate Encoder and the Dynamic Weight Decoder to adaptively determine the optimal encoding modality and the optimal decoder filters for each sample. We additionally propose an Anomaly Enhancement module that enhances the features of tampered regions with box supervision and helps the generalist model to extract common features across different IML tasks. We validate our approach on IML tasks across three major scenarios: natural images, document images, and face images. Without bells and whistles, our Omni-IML achieves state-of-the-art performance on all three tasks with a single unified model, providing valuable strategies and insights for real-world application and future research in generalist image forensics. Our code will be publicly available.
- [200] arXiv:2411.14825 [pdf, html, other]
-
Title: Distributed Model Checking in Graphs Classes of Bounded ExpansionSubjects: Data Structures and Algorithms (cs.DS)
We show that for every first-order logic (FO) formula $\varphi$, and every graph class $\mathcal{G}$ of bounded expansion, there exists a distributed (deterministic) algorithm that, for every $n$-node graph $G\in\mathcal{G}$ of diameter $D$, decides whether $G\models \varphi$ in $O(D+\log n)$ rounds under the standard CONGEST model. Graphs of bounded expansion encompass many classes of sparse graphs such as planar graphs, bounded-treedepth graphs, bounded-treewidth graphs, bounded-degree graphs, and graphs excluding a fixed graph $H$ as a minor or topological minor. Note that our algorithm is optimal up to a logarithmic additional term, as even a simple FO formula such as "there are two vertices of degree 3" already on trees requires $\Omega(D)$ rounds in CONGEST.
Our result extends to solving optimization problems expressed in FO (e.g., $k$-vertex cover of minimum weight), as well as to counting the number of solutions of a problem expressible in a fragment of FO (e.g., counting triangles), still running in $O(D+\log n)$ rounds under the CONGEST model. This exemplifies the contrast between sparse graphs and general graphs as far as CONGEST algorithms are concerned. For instance, Drucker, Kuhn, and Oshman [PODC 2014] showed that the problem of deciding whether a general graph contains a 4-cycle requires $\Theta(\sqrt{n}/\log n)$ rounds in CONGEST. For counting triangles, the best known algorithm of Chang, Pettie, and Zhang [SODA 2019] takes $\tilde{O}(\sqrt{n})$ rounds.
Finally, our result extends to distributed certification. We show that, for every FO formula~$\varphi$, and every graph class of bounded expansion, there exists a certification scheme for $\varphi$ using certificates on $O(\log n)$ bits. This significantly generalizes the recent result of Feuilloley, Bousquet, and Pierron [PODC 2022], which held solely for graphs of bounded treedepth. - [201] arXiv:2411.14827 [pdf, html, other]
-
Title: Physically Interpretable Probabilistic Domain CharacterizationAnaïs Halin, Sébastien Piérard, Renaud Vandeghen, Benoît Gérin, Maxime Zanella, Martin Colot, Jan Held, Anthony Cioppa, Emmanuel Jean, Gianluca Bontempi, Saïd Mahmoudi, Benoît Macq, Marc Van DroogenbroeckSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Characterizing domains is essential for models analyzing dynamic environments, as it allows them to adapt to evolving conditions or to hand the task over to backup systems when facing conditions outside their operational domain. Existing solutions typically characterize a domain by solving a regression or classification problem, which limits their applicability as they only provide a limited summarized description of the domain. In this paper, we present a novel approach to domain characterization by characterizing domains as probability distributions. Particularly, we develop a method to predict the likelihood of different weather conditions from images captured by vehicle-mounted cameras by estimating distributions of physical parameters using normalizing flows. To validate our proposed approach, we conduct experiments within the context of autonomous vehicles, focusing on predicting the distribution of weather parameters to characterize the operational domain. This domain is characterized by physical parameters (absolute characterization) and arbitrarily predefined domains (relative characterization). Finally, we evaluate whether a system can safely operate in a target domain by comparing it to multiple source domains where safety has already been established. This approach holds significant potential, as accurate weather prediction and effective domain adaptation are crucial for autonomous systems to adjust to dynamic environmental conditions.
- [202] arXiv:2411.14829 [pdf, html, other]
-
Title: OSPtrack: A Labeled Dataset Targeting Simulated Open-Source Package ExecutionSubjects: Cryptography and Security (cs.CR)
Open-source software is a fundamental part of the internet and the cyber supply chain, but its exploitation has become more frequent. While vulnerability detection in OSS has advanced, previous work mainly focuses on static code analysis, neglecting runtime indicators. To address this, we created a dataset spanning multiple ecosystems, capturing features generated during the execution of packages and libraries in isolated environments. The dataset includes 9,461 package reports (1,962 malicious), with static and dynamic features such as files, sockets, commands, and DNS records. Labeled with verified information and detailed sub-labels for attack types, this dataset helps identify malicious indicators, especially when source code access is limited, and supports efficient detection methods during runtime.
- [203] arXiv:2411.14832 [pdf, html, other]
-
Title: VisGraphVar: A Benchmark Generator for Assessing Variability in Graph Analysis Using Large Vision-Language ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
The fast advancement of Large Vision-Language Models (LVLMs) has shown immense potential. These models are increasingly capable of tackling abstract visual tasks. Geometric structures, particularly graphs with their inherent flexibility and complexity, serve as an excellent benchmark for evaluating these models' predictive capabilities. While human observers can readily identify subtle visual details and perform accurate analyses, our investigation reveals that state-of-the-art LVLMs exhibit consistent limitations in specific visual graph scenarios, especially when confronted with stylistic variations. In response to these challenges, we introduce VisGraphVar (Visual Graph Variability), a customizable benchmark generator able to produce graph images for seven distinct task categories (detection, classification, segmentation, pattern recognition, link prediction, reasoning, matching), designed to systematically evaluate the strengths and limitations of individual LVLMs. We use VisGraphVar to produce 990 graph images and evaluate six LVLMs, employing two distinct prompting strategies, namely zero-shot and chain-of-thought. The findings demonstrate that variations in visual attributes of images (e.g., node labeling and layout) and the deliberate inclusion of visual imperfections, such as overlapping nodes, significantly affect model performance. This research emphasizes the importance of a comprehensive evaluation across graph-related tasks, extending beyond reasoning alone. VisGraphVar offers valuable insights to guide the development of more reliable and robust systems capable of performing advanced visual graph analysis.
- [204] arXiv:2411.14834 [pdf, html, other]
-
Title: Gradient Masking All-at-Once: Ensemble Everything Everywhere Is Not RobustSubjects: Machine Learning (cs.LG)
Ensemble everything everywhere is a defense to adversarial examples that was recently proposed to make image classifiers robust. This defense works by ensembling a model's intermediate representations at multiple noisy image resolutions, producing a single robust classification. This defense was shown to be effective against multiple state-of-the-art attacks. Perhaps even more convincingly, it was shown that the model's gradients are perceptually aligned: attacks against the model produce noise that perceptually resembles the targeted class.
In this short note, we show that this defense is not robust to adversarial attack. We first show that the defense's randomness and ensembling method cause severe gradient masking. We then use standard adaptive attack techniques to reduce the defense's robust accuracy from 48% to 1% on CIFAR-100 and from 62% to 4% on CIFAR-10, under the $\ell_\infty$-norm threat model with $\varepsilon=8/255$. - [205] arXiv:2411.14842 [pdf, html, other]
-
Title: Who Can Withstand Chat-Audio Attacks? An Evaluation Benchmark for Large Language ModelsSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Adversarial audio attacks pose a significant threat to the growing use of large language models (LLMs) in voice-based human-machine interactions. While existing research has primarily focused on model-specific adversarial methods, real-world applications demand a more generalizable and universal approach to audio adversarial attacks. In this paper, we introduce the Chat-Audio Attacks (CAA) benchmark including four distinct types of audio attacks, which aims to explore the the vulnerabilities of LLMs to these audio attacks in conversational scenarios. To evaluate the robustness of LLMs, we propose three evaluation strategies: Standard Evaluation, utilizing traditional metrics to quantify model performance under attacks; GPT-4o-Based Evaluation, which simulates real-world conversational complexities; and Human Evaluation, offering insights into user perception and trust. We evaluate six state-of-the-art LLMs with voice interaction capabilities, including Gemini-1.5-Pro, GPT-4o, and others, using three distinct evaluation methods on the CAA benchmark. Our comprehensive analysis reveals the impact of four types of audio attacks on the performance of these models, demonstrating that GPT-4o exhibits the highest level of resilience.
- [206] arXiv:2411.14847 [pdf, html, other]
-
Title: Dynamics-Aware Gaussian Splatting Streaming Towards Fast On-the-Fly Training for 4D ReconstructionComments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
The recent development of 3D Gaussian Splatting (3DGS) has led to great interest in 4D dynamic spatial reconstruction from multi-view visual inputs. While existing approaches mainly rely on processing full-length multi-view videos for 4D reconstruction, there has been limited exploration of iterative online reconstruction methods that enable on-the-fly training and per-frame streaming. Current 3DGS-based streaming methods treat the Gaussian primitives uniformly and constantly renew the densified Gaussians, thereby overlooking the difference between dynamic and static features and also neglecting the temporal continuity in the scene. To address these limitations, we propose a novel three-stage pipeline for iterative streamable 4D dynamic spatial reconstruction. Our pipeline comprises a selective inheritance stage to preserve temporal continuity, a dynamics-aware shift stage for distinguishing dynamic and static primitives and optimizing their movements, and an error-guided densification stage to accommodate emerging objects. Our method achieves state-of-the-art performance in online 4D reconstruction, demonstrating a 20% improvement in on-the-fly training speed, superior representation quality, and real-time rendering capability. Project page: this https URL
- [207] arXiv:2411.14855 [pdf, html, other]
-
Title: Applications of fractional calculus in learned optimizationComments: NeurIPS Workshop on Optimization for Machine LearningSubjects: Machine Learning (cs.LG)
Fractional gradient descent has been studied extensively, with a focus on its ability to extend traditional gradient descent methods by incorporating fractional-order derivatives. This approach allows for more flexibility in navigating complex optimization landscapes and offers advantages in certain types of problems, particularly those involving non-linearities and chaotic dynamics. Yet, the challenge of fine-tuning the fractional order parameters remains unsolved. In this work, we demonstrate that it is possible to train a neural network to predict the order of the gradient effectively.
- [208] arXiv:2411.14856 [pdf, html, other]
-
Title: A Rewriting Theory for Quantum Lambda-CalculusSubjects: Logic in Computer Science (cs.LO)
Quantum lambda calculus has been studied mainly as an idealized programming language -- the evaluation essentially corresponds to a deterministic abstract machine. Very little work has been done to develop a rewriting theory for quantum lambda calculus. Recent advances in the theory of probabilistic rewriting give us a way to tackle this task with tools unavailable a decade ago. Our primary focus is standardization and normalization results.
- [209] arXiv:2411.14858 [pdf, html, other]
-
Title: Domain and Range Aware Synthetic Negatives Generation for Knowledge Graph Embedding ModelsComments: Accepted at the Third Learning on Graphs Conference (LoG 2024)Subjects: Artificial Intelligence (cs.AI)
Knowledge Graph Embedding models, representing entities and edges in a low-dimensional space, have been extremely successful at solving tasks related to completing and exploring Knowledge Graphs (KGs). One of the key aspects of training most of these models is teaching to discriminate between true statements positives and false ones (negatives). However, the way in which negatives can be defined is not trivial, as facts missing from the KG are not necessarily false and a set of ground truth negatives is hardly ever given. This makes synthetic negative generation a necessity. Different generation strategies can heavily affect the quality of the embeddings, making it a primary aspect to consider. We revamp a strategy that generates corruptions during training respecting the domain and range of relations, we extend its capabilities and we show our methods bring substantial improvement (+10% MRR) for standard benchmark datasets and over +150% MRR for a larger ontology-backed dataset.
- [210] arXiv:2411.14860 [pdf, html, other]
-
Title: Ex Uno Pluria: Insights on Ensembling in Low Precision Number SystemsComments: NeurIPS 2024Subjects: Machine Learning (cs.LG)
While ensembling deep neural networks has shown promise in improving generalization performance, scaling current ensemble methods for large models remains challenging. Given that recent progress in deep learning is largely driven by the scale, exemplified by the widespread adoption of large-scale neural network architectures, scalability emerges an increasingly critical issue for machine learning algorithms in the era of large-scale models. In this work, we first showcase the potential of low precision ensembling, where ensemble members are derived from a single model within low precision number systems in a training-free manner. Our empirical analysis demonstrates the effectiveness of our proposed low precision ensembling method compared to existing ensemble approaches.
- [211] arXiv:2411.14863 [pdf, html, other]
-
Title: Latent Schrodinger Bridge: Prompting Latent Diffusion for Fast Unpaired Image-to-Image TranslationSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Diffusion models (DMs), which enable both image generation from noise and inversion from data, have inspired powerful unpaired image-to-image (I2I) translation algorithms. However, they often require a larger number of neural function evaluations (NFEs), limiting their practical applicability. In this paper, we tackle this problem with Schrodinger Bridges (SBs), which are stochastic differential equations (SDEs) between distributions with minimal transport cost. We analyze the probability flow ordinary differential equation (ODE) formulation of SBs, and observe that we can decompose its vector field into a linear combination of source predictor, target predictor, and noise predictor. Inspired by this observation, we propose Latent Schrodinger Bridges (LSBs) that approximate the SB ODE via pre-trained Stable Diffusion, and develop appropriate prompt optimization and change of variables formula to match the training and inference between distributions. We demonstrate that our algorithm successfully conduct competitive I2I translation in unsupervised setting with only a fraction of computation cost required by previous DM-based I2I methods.
- [212] arXiv:2411.14868 [pdf, other]
-
Title: Defective Edge Detection Using Cascaded Ensemble Canny OperatorComments: 2 Pages and 2 FiguresSubjects: Computer Vision and Pattern Recognition (cs.CV)
Edge detection has been one of the most difficult challenges in computer vision because of the difficulty in identifying the borders and edges from the real-world images including objects of varying kinds and sizes. Methods based on ensemble learning, which use a combination of backbones and attention modules, outperformed more conventional approaches, such as Sobel and Canny edge detection. Nevertheless, these algorithms are still challenged when faced with complicated scene photos. In addition, the identified edges utilizing the current methods are not refined and often include incorrect edges. In this work, we used a Cascaded Ensemble Canny operator to solve these problems and detect the object edges. The most difficult Fresh and Rotten and Berkeley datasets are used to test the suggested approach in Python. In terms of performance metrics and output picture quality, the acquired results outperform the specified edge detection networks
- [213] arXiv:2411.14869 [pdf, html, other]
-
Title: BIP3D: Bridging 2D Images and 3D Perception for Embodied IntelligenceSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
In embodied intelligence systems, a key component is 3D perception algorithm, which enables agents to understand their surrounding environments. Previous algorithms primarily rely on point cloud, which, despite offering precise geometric information, still constrain perception performance due to inherent sparsity, noise, and data scarcity. In this work, we introduce a novel image-centric 3D perception model, BIP3D, which leverages expressive image features with explicit 3D position encoding to overcome the limitations of point-centric methods. Specifically, we leverage pre-trained 2D vision foundation models to enhance semantic understanding, and introduce a spatial enhancer module to improve spatial understanding. Together, these modules enable BIP3D to achieve multi-view, multi-modal feature fusion and end-to-end 3D perception. In our experiments, BIP3D outperforms current state-of-the-art results on the EmbodiedScan benchmark, achieving improvements of 5.69% in the 3D detection task and 15.25% in the 3D visual grounding task.
- [214] arXiv:2411.14870 [pdf, html, other]
-
Title: Application of AI to formal methods -- an analysis of current trendsSubjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
With artificial intelligence (AI) being well established within the daily lives of research communities, we turn our gaze toward an application area that appears intuitively unsuited for probabilistic decision-making: the area of formal methods (FM). FM aim to provide sound and understandable reasoning about problems in computer science, which seemingly collides with the black-box nature that inhibits many AI approaches. However, many researchers have crossed this gap and applied AI techniques to enhance FM approaches. As this dichotomy of FM and AI sparked our interest, we conducted a systematic mapping study to map the current landscape of research publications. In this study, we investigate the previous five years of applied AI to FM (2019-2023), as these correspond to periods of high activity. This investigation results in 189 entries, which we explore in more detail to find current trends, highlight research gaps, and give suggestions for future research.
- [215] arXiv:2411.14871 [pdf, html, other]
-
Title: Prioritize Denoising Steps on Diffusion Model Preference Alignment via Explicit Denoised Distribution EstimationSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Diffusion models have shown remarkable success in text-to-image generation, making alignment methods for these models increasingly important. A key challenge is the sparsity of preference labels, which are typically available only at the terminal of denoising trajectories. This raises the issue of how to assign credit across denoising steps based on these sparse labels. In this paper, we propose Denoised Distribution Estimation (DDE), a novel method for credit assignment. Unlike previous approaches that rely on auxiliary models or hand-crafted schemes, DDE derives its strategy more explicitly. The proposed DDE directly estimates the terminal denoised distribution from the perspective of each step. It is equipped with two estimation strategies and capable of representing the entire denoising trajectory with a single model inference. Theoretically and empirically, we show that DDE prioritizes optimizing the middle part of the denoising trajectory, resulting in a novel and effective credit assignment scheme. Extensive experiments demonstrate that our approach achieves superior performance, both quantitatively and qualitatively.
- [216] arXiv:2411.14873 [pdf, other]
-
Title: Implementation of Real-Time Lane Detection on Autonomous Mobile RobotComments: 4 pages, 9 figures 2 tablesJournal-ref: 2024 IEEE International Conference on Advanced Telecommunication and Networking Technologies (ATNT)Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
This paper describes the implementation of a learning-based lane detection algorithm on an Autonomous Mobile Robot. It aims to implement the Ultra Fast Lane Detection algorithm for real-time application on the SEATER P2MC-BRIN prototype using a camera and optimize its performance on the Jetson Nano platform. Preliminary experiments were conducted to evaluate the algorithm's performance in terms of data processing speed and accuracy using two types of datasets: outdoor using a public dataset and indoor using an internal dataset from the indoor area of the BRIN Workshop Building in Bandung. The experiments revealed that the algorithm runs more optimally on the Jetson Nano platform after conversion to TensorRT compared to the ONNX model, achieving processing speeds of approximately 101 ms using CULane and 105 ms using TuSimple, which is about 22 times faster than the previous model. While the algorithm demonstrates good accuracy on the outdoor public dataset, its performance falls short on the indoor dataset. Future work should focus on transfer learning and fine-tuning to enhance indoor lane detection accuracy.
- [217] arXiv:2411.14877 [pdf, html, other]
-
Title: Astro-HEP-BERT: A bidirectional language model for studying the meanings of concepts in astrophysics and high energy physicsComments: 7 pages, 4 figures, 1 tableSubjects: Computation and Language (cs.CL); History and Philosophy of Physics (physics.hist-ph)
I present Astro-HEP-BERT, a transformer-based language model specifically designed for generating contextualized word embeddings (CWEs) to study the meanings of concepts in astrophysics and high-energy physics. Built on a general pretrained BERT model, Astro-HEP-BERT underwent further training over three epochs using the Astro-HEP Corpus, a dataset I curated from 21.84 million paragraphs extracted from more than 600,000 scholarly articles on arXiv, all belonging to at least one of these two scientific domains. The project demonstrates both the effectiveness and feasibility of adapting a bidirectional transformer for applications in the history, philosophy, and sociology of science (HPSS). The entire training process was conducted using freely available code, pretrained weights, and text inputs, completed on a single MacBook Pro Laptop (M2/96GB). Preliminary evaluations indicate that Astro-HEP-BERT's CWEs perform comparably to domain-adapted BERT models trained from scratch on larger datasets for domain-specific word sense disambiguation and induction and related semantic change analyses. This suggests that retraining general language models for specific scientific domains can be a cost-effective and efficient strategy for HPSS researchers, enabling high performance without the need for extensive training from scratch.
- [218] arXiv:2411.14878 [pdf, other]
-
Title: Physical and Software Based Fault Injection Attacks Against TEEs in Mobile Devices: A Systemisation of KnowledgeComments: 25 pagesSubjects: Cryptography and Security (cs.CR)
Trusted Execution Environments (TEEs) are critical components of modern secure computing, providing isolated zones in processors to safeguard sensitive data and execute secure operations. Despite their importance, TEEs are increasingly vulnerable to fault injection (FI) attacks, including both physical methods, such as Electromagnetic Fault Injection (EMFI), and software-based techniques. This survey examines these FI methodologies, exploring their ability to disrupt TEE operations and expose vulnerabilities in devices ranging from smartphones and IoT systems to cloud platforms.
The study highlights the evolution and effectiveness of non-invasive techniques, such as EMFI, which induce faults through electromagnetic disturbances without physical modifications to hardware, making them harder to detect and mitigate. Real-world case studies illustrate the significant risks posed by these attacks, including unauthorised access, privilege escalation, and data corruption. In addition, the survey identifies gaps in existing TEE security architectures and emphasises the need for enhanced countermeasures, such as dynamic anomaly detection and updated threat models.
The findings underline the importance of interdisciplinary collaboration to address these vulnerabilities, involving researchers, manufacturers, and policymakers. This survey provides actionable insights and recommendations to guide the development of more robust TEE architectures in mobile devices, fortify FI resilience, and shape global security standards. By advancing TEE security, this research aims to protect critical digital infrastructure and maintain trust in secure computing systems worldwide. - [219] arXiv:2411.14879 [pdf, html, other]
-
Title: Random Permutation Codes: Lossless Source Coding of Non-Sequential DataComments: Ph.D. ThesisSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
This thesis deals with the problem of communicating and storing non-sequential data. We investigate this problem through the lens of lossless source coding, also sometimes referred to as lossless compression, from both an algorithmic and information-theoretic perspective.
Lossless compression algorithms typically preserve the ordering in which data points are compressed. However, there are data types where order is not meaningful, such as collections of files, rows in a database, nodes in a graph, and, notably, datasets in machine learning applications.
Compressing with traditional algorithms is possible if we pick an order for the elements and communicate the corresponding ordered sequence. However, unless the order information is somehow removed during the encoding process, this procedure will be sub-optimal, because the order contains information and therefore more bits are used to represent the source than are truly necessary.
In this work we give a formal definition for non-sequential objects as random sets of equivalent sequences, which we refer to as Combinatorial Random Variables (CRVs). The definition of equivalence, formalized as an equivalence relation, establishes the non-sequential data type represented by the CRV. The achievable rates of CRVs is fully characterized as a function of the equivalence relation as well as the data distribution.
The optimal rates of CRVs are achieved within the family of Random Permutation Codes (RPCs) developed in later chapters. RPCs randomly select one-of-many possible sequences that can represent the instance of the CRV. Specialized RPCs are given for the case of multisets, graphs, and partitions/clusterings, providing new algorithms for compression of databases, social networks, and web data in the JSON file format. - [220] arXiv:2411.14880 [pdf, html, other]
-
Title: Leveraging Hierarchical Prototypes as the Verbalizer for Implicit Discourse Relation RecognitionSubjects: Computation and Language (cs.CL)
Implicit discourse relation recognition involves determining relationships that hold between spans of text that are not linked by an explicit discourse connective. In recent years, the pre-train, prompt, and predict paradigm has emerged as a promising approach for tackling this task. However, previous work solely relied on manual verbalizers for implicit discourse relation recognition, which suffer from issues of ambiguity and even incorrectness. To overcome these limitations, we leverage the prototypes that capture certain class-level semantic features and the hierarchical label structure for different classes as the verbalizer. We show that our method improves on competitive baselines. Besides, our proposed approach can be extended to enable zero-shot cross-lingual learning, facilitating the recognition of discourse relations in languages with scarce resources. These advancement validate the practicality and versatility of our approach in addressing the issues of implicit discourse relation recognition across different languages.
- [221] arXiv:2411.14883 [pdf, html, other]
-
Title: Boundless Across Domains: A New Paradigm of Adaptive Feature and Cross-Attention for Domain Generalization in Medical Image SegmentationComments: 5 pages, 3 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Domain-invariant representation learning is a powerful method for domain generalization. Previous approaches face challenges such as high computational demands, training instability, and limited effectiveness with high-dimensional data, potentially leading to the loss of valuable features. To address these issues, we hypothesize that an ideal generalized representation should exhibit similar pattern responses within the same channel across cross-domain images. Based on this hypothesis, we use deep features from the source domain as queries, and deep features from the generated domain as keys and values. Through a cross-channel attention mechanism, the original deep features are reconstructed into robust regularization representations, forming an explicit constraint that guides the model to learn domain-invariant representations. Additionally, style augmentation is another common method. However, existing methods typically generate new styles through convex combinations of source domains, which limits the diversity of training samples by confining the generated styles to the original distribution. To overcome this limitation, we propose an Adaptive Feature Blending (AFB) method that generates out-of-distribution samples while exploring the in-distribution space, significantly expanding the domain range. Extensive experimental results demonstrate that our proposed methods achieve superior performance on two standard domain generalization benchmarks for medical image segmentation.
- [222] arXiv:2411.14887 [pdf, html, other]
-
Title: OMP4Py: a pure Python implementation of OpenMPComments: 13 pages, 11 figuresSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Programming Languages (cs.PL)
Python demonstrates lower performance in comparison to traditional high performance computing (HPC) languages such as C, C++, and Fortran. This performance gap is largely due to Python's interpreted nature and the Global Interpreter Lock (GIL), which hampers multithreading efficiency. However, the latest version of Python includes the necessary changes to make the interpreter thread-safe, allowing Python code to run without the GIL. This important update will enable users to fully exploit multithreading parallelism in Python. In order to facilitate that task, this paper introduces OMP4Py, the first pure Python implementation of OpenMP. We demonstrate that it is possible to bring OpenMP's familiar directive-based parallelization paradigm to Python, allowing developers to write parallel code with the same level of control and flexibility as in C, C++, or Fortran. The experimental evaluation shows that OMP4Py significantly impacts the performance of various types of applications, although the current threading limitation of Python's interpreter (v3.13) reduce its effectiveness for numerical applications.
- [223] arXiv:2411.14894 [pdf, html, other]
-
Title: The Dynamics of Innovation in Open Source Software EcosystemsSubjects: Software Engineering (cs.SE); Social and Information Networks (cs.SI)
Software libraries are the elementary building blocks of open source software ecosystems, extending the capabilities of programming languages beyond their standard libraries. Although ecosystem health is often quantified using data on libraries and their interdependencies, we know little about the rate at which new libraries are developed and used. Here we study imports of libraries in 12 different programming language ecosystems within millions of Stack Overflow posts over a 15 year period. New libraries emerge at a remarkably predictable sub-linear rate within ecosystems per post. As a consequence, the distribution of the frequency of use of libraries in all ecosystems is highly concentrated: the most widely used libraries are used many times more often than the average. Although new libraries come out more slowly over time, novel combinations of libraries appear at an approximately linear rate, suggesting that recombination is a key innovation process in software. Newer users are more likely to use new libraries and new combinations, and we find significant variation in the rates of innovation between countries. Our work links the evolution of OSS ecosystems to the literature on the dynamics of innovation, revealing how ecosystems grow and highlighting implications for sustainability.
- [224] arXiv:2411.14896 [pdf, html, other]
-
Title: Evaluating LLM Prompts for Data Augmentation in Multi-label Classification of Ecological TextsComments: Ivannikov ISPRAS Open Conference (ISPRAS) 2024Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
Large language models (LLMs) play a crucial role in natural language processing (NLP) tasks, improving the understanding, generation, and manipulation of human language across domains such as translating, summarizing, and classifying text. Previous studies have demonstrated that instruction-based LLMs can be effectively utilized for data augmentation to generate diverse and realistic text samples. This study applied prompt-based data augmentation to detect mentions of green practices in Russian social media. Detecting green practices in social media aids in understanding their prevalence and helps formulate recommendations for scaling eco-friendly actions to mitigate environmental issues. We evaluated several prompts for augmenting texts in a multi-label classification task, either by rewriting existing datasets using LLMs, generating new data, or combining both approaches. Our results revealed that all strategies improved classification performance compared to the models fine-tuned only on the original dataset, outperforming baselines in most cases. The best results were obtained with the prompt that paraphrased the original text while clearly indicating the relevant categories.
- [225] arXiv:2411.14901 [pdf, html, other]
-
Title: ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long VideosSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Large language models (LLMs) excel at retrieving information from lengthy text, but their vision-language counterparts (VLMs) face difficulties with hour-long videos, especially for temporal grounding. Specifically, these VLMs are constrained by frame limitations, often losing essential temporal details needed for accurate event localization in extended video content. We propose ReVisionLLM, a recursive vision-language model designed to locate events in hour-long videos. Inspired by human search strategies, our model initially targets broad segments of interest, progressively revising its focus to pinpoint exact temporal boundaries. Our model can seamlessly handle videos of vastly different lengths, from minutes to hours. We also introduce a hierarchical training strategy that starts with short clips to capture distinct events and progressively extends to longer videos. To our knowledge, ReVisionLLM is the first VLM capable of temporal grounding in hour-long videos, outperforming previous state-of-the-art methods across multiple datasets by a significant margin (+2.6% [email protected] on MAD). The code is available at this https URL.
- [226] arXiv:2411.14904 [pdf, html, other]
-
Title: Exploring Kolmogorov-Arnold Networks for Interpretable Time Series ClassificationSubjects: Machine Learning (cs.LG)
Time series classification is a relevant step supporting decision-making processes in various domains, and deep neural models have shown promising performance.
Despite significant advancements in deep learning, the theoretical understanding of how and why complex architectures function remains limited, prompting the need for more interpretable models. Recently, the Kolmogorov-Arnold Networks (KANs) have been proposed as a more interpretable alternative. While KAN-related research is significantly rising, to date, the study of KAN architectures for time series classification has been limited.
In this paper, we aim to conduct a comprehensive and robust exploration of the KAN architecture for time series classification on the UCR benchmark. More specifically, we look at a) how reference architectures for forecasting transfer to classification, at the b) hyperparameter and implementation influence on the classification performance in view of finding the one that performs best on the selected benchmark, the c) complexity trade-offs and d) interpretability advantages. Our results show that (1) Efficient KAN outperforms MLP in performance and computational efficiency, showcasing its suitability for tasks classification tasks. (2) Efficient KAN is more stable than KAN across grid sizes, depths, and layer configurations, particularly with lower learning rates. (3) KAN maintains competitive accuracy compared to state-of-the-art models like HIVE-COTE2, with smaller architectures and faster training times, supporting its balance of performance and transparency. (4) The interpretability of the KAN model aligns with findings from SHAP analysis, reinforcing its capacity for transparent decision-making. - [227] arXiv:2411.14905 [pdf, html, other]
-
Title: Feasibility Study for Supporting Static Malware Analysis Using LLMSubjects: Cryptography and Security (cs.CR)
Large language models (LLMs) are becoming more advanced and widespread and have shown their applicability to various domains, including cybersecurity. Static malware analysis is one of the most important tasks in cybersecurity; however, it is time-consuming and requires a high level of expertise. Therefore, we conducted a demonstration experiment focusing on whether an LLM can be used to support static analysis. First, we evaluated the ability of the LLM to explain malware functionality. The results showed that the LLM can generate descriptions that cover functions with an accuracy of up to 90.9\%. In addition, we asked six static analysts to perform a pseudo static analysis task using LLM explanations to verify that the LLM can be used in practice. Through subsequent questionnaires and interviews with the participants, we also demonstrated the practical applicability of LLMs. Lastly, we summarized the problems and required functions when using an LLM as static analysis support, as well as recommendations for future research opportunities.
- [228] arXiv:2411.14907 [pdf, html, other]
-
Title: DAIRHuM: A Platform for Directly Aligning AI Representations with Human Musical Judgments applied to Carnatic MusicComments: 4 Pages, ICASSP workshop submissionSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Quantifying and aligning music AI model representations with human behavior is an important challenge in the field of MIR. This paper presents a platform for exploring the Direct alignment between AI music model Representations and Human Musical judgments (DAIRHuM). It is designed to enable musicians and experimentalists to label similarities in a dataset of music recordings, and examine a pre-trained model's alignment with their labels using quantitative scores and visual plots. DAIRHuM is applied to analyze alignment between NSynth representations, and a rhythmic duet between two percussionists in a Carnatic quartet ensemble, an example of a genre where annotated data is scarce and assessing alignment is non-trivial. The results demonstrate significant findings on model alignment with human judgments of rhythmic harmony, while highlighting key differences in rhythm perception and music similarity judgments specific to Carnatic music. This work is among the first efforts to enable users to explore human-AI model alignment in Carnatic music and advance MIR research in Indian music while dealing with data scarcity and cultural specificity. The development of this platform provides greater accessibility to music AI tools for under-represented genres.
- [229] arXiv:2411.14908 [pdf, html, other]
-
Title: Reactive Robot Navigation Using Quasi-conformal Mappings and Control Barrier FunctionsSubjects: Robotics (cs.RO); Dynamical Systems (math.DS); Optimization and Control (math.OC)
This paper presents a robot control algorithm suitable for safe reactive navigation tasks in cluttered environments. The proposed approach consists of transforming the robot workspace into the \emph{ball world}, an artificial representation where all obstacle regions are closed balls. Starting from a polyhedral representation of obstacles in the environment, obtained using exteroceptive sensor readings, a computationally efficient mapping to ball-shaped obstacles is constructed using quasi-conformal mappings and Möbius transformations. The geometry of the ball world is amenable to provably safe navigation tasks achieved via control barrier functions employed to ensure collision-free robot motions with guarantees both on safety and on the absence of deadlocks. The performance of the proposed navigation algorithm is showcased and analyzed via extensive simulations and experiments performed using different types of robotic systems, including manipulators and mobile robots.
- [230] arXiv:2411.14913 [pdf, html, other]
-
Title: Enhancing Exploration with Diffusion Policies in Hybrid Off-Policy RL: Application to Non-Prehensile ManipulationComments: 8 pagesSubjects: Robotics (cs.RO)
Learning diverse policies for non-prehensile manipulation is essential for improving skill transfer and generalization to out-of-distribution scenarios. In this work, we enhance exploration through a two-fold approach within a hybrid framework that tackles both discrete and continuous action spaces. First, we model the continuous motion parameter policy as a diffusion model, and second, we incorporate this into a maximum entropy reinforcement learning framework that unifies both the discrete and continuous components. The discrete action space, such as contact point selection, is optimized through Q-value function maximization, while the continuous part is guided by a diffusion-based policy. This hybrid approach leads to a principled objective, where the maximum entropy term is derived as a lower bound using structured variational inference. We propose the Hybrid Diffusion Policy algorithm (HyDo) and evaluate its performance on both simulation and zero-shot sim2real tasks. Our results show that HyDo encourages more diverse behavior policies, leading to significantly improved success rates across tasks - for example, increasing from 53% to 72% on a real-world 6D pose alignment task. Project page: this https URL
- [231] arXiv:2411.14914 [pdf, html, other]
-
Title: A Reproducibility and Generalizability Study of Large Language Models for Query GenerationSubjects: Information Retrieval (cs.IR)
Systematic literature reviews (SLRs) are a cornerstone of academic research, yet they are often labour-intensive and time-consuming due to the detailed literature curation process. The advent of generative AI and large language models (LLMs) promises to revolutionize this process by assisting researchers in several tedious tasks, one of them being the generation of effective Boolean queries that will select the publications to consider including in a review. This paper presents an extensive study of Boolean query generation using LLMs for systematic reviews, reproducing and extending the work of Wang et al. and Alaniz et al. Our study investigates the replicability and reliability of results achieved using ChatGPT and compares its performance with open-source alternatives like Mistral and Zephyr to provide a more comprehensive analysis of LLMs for query generation.
Therefore, we implemented a pipeline, which automatically creates a Boolean query for a given review topic by using a previously defined LLM, retrieves all documents for this query from the PubMed database and then evaluates the results. With this pipeline we first assess whether the results obtained using ChatGPT for query generation are reproducible and consistent. We then generalize our results by analyzing and evaluating open-source models and evaluating their efficacy in generating Boolean queries.
Finally, we conduct a failure analysis to identify and discuss the limitations and shortcomings of using LLMs for Boolean query generation. This examination helps to understand the gaps and potential areas for improvement in the application of LLMs to information retrieval tasks. Our findings highlight the strengths, limitations, and potential of LLMs in the domain of information retrieval and literature review automation. - [232] arXiv:2411.14917 [pdf, html, other]
-
Title: Task-Aware Robotic Grasping by evaluating Quality Diversity Solutions through Foundation ModelsComments: 8 pages, 5 figuresSubjects: Robotics (cs.RO)
Task-aware robotic grasping is a challenging problem that requires the integration of semantic understanding and geometric reasoning. Traditional grasp planning approaches focus on stable or feasible grasps, often disregarding the specific tasks the robot needs to accomplish. This paper proposes a novel framework that leverages Large Language Models (LLMs) and Quality Diversity (QD) algorithms to enable zero-shot task-conditioned grasp selection. The framework segments objects into meaningful subparts and labels each subpart semantically, creating structured representations that can be used to prompt an LLM. By coupling semantic and geometric representations of an object's structure, the LLM's knowledge about tasks and which parts to grasp can be applied in the physical world. The QD-generated grasp archive provides a diverse set of grasps, allowing us to select the most suitable grasp based on the task. We evaluate the proposed method on a subset of the YCB dataset, where a Franka Emika robot is assigned to perform various actions based on object-specific task requirements. We created a ground truth by conducting a survey with six participants to determine the best grasp region for each task-object combination according to human intuition. The model was evaluated on 12 different objects across 4--7 object-specific tasks, achieving a weighted intersection over union (IoU) of 76.4% when compared to the survey data.
- [233] arXiv:2411.14919 [pdf, html, other]
-
Title: Optimal Beamforming for Multi-User Continuous Aperture Array (CAPA) SystemsComments: 13 pages, 6 figuresSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
The optimal beamforming design for multi-user continuous aperture array (CAPA) systems is proposed. In contrast to conventional spatially discrete array (SPDA), the beamformer for CAPA is a continuous function rather than a discrete vector or matrix, rendering beamforming optimization a non-convex integral-based functional programming. To address this challenging issue, we first derive the closed-form optimal structure of the CAPA beamformer for maximizing generic system utility functions, by using the Lagrangian duality and the calculus of variations. The derived optimal structure is a linear combination of the continuous channel responses for CAPA, with the linear weights determined by the channel correlations. As a further advance, a monotonic optimization method is proposed for obtaining globally optimal CAPA beamforming based on the derived optimal structure. More particularly, a closed-form fixed-point iteration is proposed to obtain the globally optimal solution to the power minimization problem for CAPA beamforming. Furthermore, based on the optimal structure, the low-complexity maximum ratio transmission (MRT), zero-forcing (ZF), and minimum mean-squared error (MMSE) designs for CAPA beamforming are derived. It is theoretically proved that: 1) the MRT and ZF designs are asymptotically optimal in low and high signal-to-noise ratio (SNR) regimes, respectively, and 2) the MMSE design is optimal for signal-to-leakage-plus-noise ratio (SLNR) maximization. Our numerical results validate the effectiveness of the proposed designs and reveal that: i) CAPA achieves significant communication performance gain over SPDA, and ii) the MMSE design achieves nearly optimal performance in most cases, while the MRT and ZF designs achieve nearly optimal performance in specific cases.
- [234] arXiv:2411.14922 [pdf, html, other]
-
Title: GOT4Rec: Graph of Thoughts for Sequential RecommendationSubjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
With the advancement of large language models (LLMs), researchers have explored various methods to optimally leverage their comprehension and generation capabilities in sequential recommendation scenarios. However, several challenges persist in this endeavor. Firstly, most existing approaches rely on the input-output prompting paradigm, which can result in irrelevant or inaccurate responses. Secondly, while there have been attempts to enhance LLMs using prompting strategies such as chain-of-thought (CoT), these efforts have not fully harnessed the reasoning abilities of LLMs or effectively captured the multifaceted information contained within user sequences. To address these limitations, we propose GOT4Rec, a sequential recommendation method that utilizes the graph of thoughts (GoT) prompting strategy. Specifically, we identify and utilize three key types of information within user history sequences: short-term interests, long-term interests and collaborative information from other users. Our approach enables LLMs to independently reason and generate recommendations based on these distinct types of information, subsequently aggregating the results within the GoT framework to derive the final recommended items. This method allows LLMs, with enhanced reasoning capabilities, to more effectively consider the diverse information within user sequences, resulting in more accurate recommendations and more comprehensive explanations. Extensive experiments on real-world datasets demonstrate the effectiveness of GOT4Rec, indicating that it outperforms existing state-of-the-art baselines. Our code is available at this https URL.
- [235] arXiv:2411.14923 [pdf, other]
-
Title: Predictive Modeling For Real-Time Personalized Health Monitoring in Muscular Dystrophy ManagementSubjects: Machine Learning (cs.LG)
Muscular Dystrophy is a group of genetic disorders that progressively affect the strength and functioning of muscles, thereby affecting millions of people worldwide. The lifetime nature of MD requires continuous follow-up care due to its progressive nature. This conceptual paper proposes an Internet of Things-based system to support the management of MD through remote, multi-dimensional monitoring of patients in order to provide real-time health status updates. Traditional methods have failed to give actionable data in real time, hence denying healthcare providers the opportunity to make evidence-based decisions. Technology-driven approaches are urgently needed to provide deep insights into disease progression and patient health. It aims to enhance treatment strategies, enabling patients to better manage their condition and giving healthcare professionals more confidence in their management decisions.
- [236] arXiv:2411.14925 [pdf, html, other]
-
Title: Purrfessor: A Fine-tuned Multimodal LLaVA Diet Health ChatbotComments: 10 pages, 5 figuresSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
This study introduces Purrfessor, an innovative AI chatbot designed to provide personalized dietary guidance through interactive, multimodal engagement. Leveraging the Large Language-and-Vision Assistant (LLaVA) model fine-tuned with food and nutrition data and a human-in-the-loop approach, Purrfessor integrates visual meal analysis with contextual advice to enhance user experience and engagement. We conducted two studies to evaluate the chatbot's performance and user experience: (a) simulation assessments and human validation were conducted to examine the performance of the fine-tuned model; (b) a 2 (Profile: Bot vs. Pet) by 3 (Model: GPT-4 vs. LLaVA vs. Fine-tuned LLaVA) experiment revealed that Purrfessor significantly enhanced users' perceptions of care ($\beta = 1.59$, $p = 0.04$) and interest ($\beta = 2.26$, $p = 0.01$) compared to the GPT-4 bot. Additionally, user interviews highlighted the importance of interaction design details, emphasizing the need for responsiveness, personalization, and guidance to improve user engagement.
- [237] arXiv:2411.14927 [pdf, html, other]
-
Title: LiDAR-based End-to-end Temporal Perception for Vehicle-Infrastructure CooperationComments: 11 pages, 7 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Temporal perception, the ability to detect and track objects over time, is critical in autonomous driving for maintaining a comprehensive understanding of dynamic environments. However, this task is hindered by significant challenges, including incomplete perception caused by occluded objects and observational blind spots, which are common in single-vehicle perception systems. To address these issues, we introduce LET-VIC, a LiDAR-based End-to-End Tracking framework for Vehicle-Infrastructure Cooperation (VIC). LET-VIC leverages Vehicle-to-Everything (V2X) communication to enhance temporal perception by fusing spatial and temporal data from both vehicle and infrastructure sensors. First, it spatially integrates Bird's Eye View (BEV) features from vehicle-side and infrastructure-side LiDAR data, creating a comprehensive view that mitigates occlusions and compensates for blind spots. Second, LET-VIC incorporates temporal context across frames, allowing the model to leverage historical data for enhanced tracking stability and accuracy. To further improve robustness, LET-VIC includes a Calibration Error Compensation (CEC) module to address sensor misalignments and ensure precise feature alignment. Experiments on the V2X-Seq-SPD dataset demonstrate that LET-VIC significantly outperforms baseline models, achieving at least a 13.7% improvement in mAP and a 13.1% improvement in AMOTA without considering communication delays. This work offers a practical solution and a new research direction for advancing temporal perception in autonomous driving through vehicle-infrastructure cooperation.
- [238] arXiv:2411.14933 [pdf, html, other]
-
Title: Fast-Decaying Polynomial ReproductionSubjects: Numerical Analysis (math.NA)
Polynomial reproduction plays a relevant role in deriving error estimates for various approximation schemes. Local reproduction in a quasi-uniform setting is a significant factor in the estimation of error and the assessment of stability but for some computationally relevant schemes, such as Rescaled Localized Radial Basis Functions (RL-RBF), it becomes a limitation. To facilitate the study of a greater variety of approximation methods in a unified and efficient manner, this work proposes a framework based on fast decaying polynomial reproduction: we do not restrict to compactly supported basis functions, but we allow the basis function decay to infinity as a function of the separation distance. Implementing fast decaying polynomial reproduction provides stable and convergent methods, that can be smooth when approximating by moving least squares otherwise very efficient in the case of linear programming problems. All the results presented in this paper concerning the rate of convergence, the Lebesgue constant, the smoothness of the approximant, and the compactness of the support have been verified numerically, even in the multivariate setting.
- [239] arXiv:2411.14937 [pdf, html, other]
-
Title: Geminio: Language-Guided Gradient Inversion Attacks in Federated LearningSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Foundation models that bridge vision and language have made significant progress, inspiring numerous life-enriching applications. However, their potential for misuse to introduce new threats remains largely unexplored. This paper reveals that vision-language models (VLMs) can be exploited to overcome longstanding limitations in gradient inversion attacks (GIAs) within federated learning (FL), where an FL server reconstructs private data samples from gradients shared by victim clients. Current GIAs face challenges in reconstructing high-resolution images, especially when the victim has a large local data batch. While focusing reconstruction on valuable samples rather than the entire batch is promising, existing methods lack the flexibility to allow attackers to specify their target data. In this paper, we introduce Geminio, the first approach to transform GIAs into semantically meaningful, targeted attacks. Geminio enables a brand new privacy attack experience: attackers can describe, in natural language, the types of data they consider valuable, and Geminio will prioritize reconstruction to focus on those high-value samples. This is achieved by leveraging a pretrained VLM to guide the optimization of a malicious global model that, when shared with and optimized by a victim, retains only gradients of samples that match the attacker-specified query. Extensive experiments demonstrate Geminio's effectiveness in pinpointing and reconstructing targeted samples, with high success rates across complex datasets under FL and large batch sizes and showing resilience against existing defenses.
- [240] arXiv:2411.14939 [pdf, html, other]
-
Title: Many happy returns: machine learning to support platelet issuing and waste reduction in hospital blood banksSubjects: Machine Learning (cs.LG)
Efforts to reduce platelet wastage in hospital blood banks have focused on ordering policies, but the predominant practice of issuing the oldest unit first may not be optimal when some units are returned unused. We propose a novel, machine learning (ML)-guided issuing policy to increase the likelihood of returned units being reissued before expiration. Our ML model trained to predict returns on 17,297 requests for platelets gave AUROC 0.74 on 9,353 held-out requests. Prior to ML model development we built a simulation of the blood bank operation that incorporated returns to understand the scale of benefits of such a model. Using our trained model in the simulation gave an estimated reduction in wastage of 14%. Our partner hospital is considering adopting our approach, which would be particularly beneficial for hospitals with higher return rates and where units have a shorter remaining useful life on arrival.
- [241] arXiv:2411.14945 [pdf, html, other]
-
Title: The CTSkills App -- Measuring Problem Decomposition Skills of Students in Computational ThinkingDorit Assaf, Giorgia Adorni, Elia Lutz, Lucio Negrini, Alberto Piatti, Francesco Mondada, Francesca Mangili, Luca Maria GambardellaSubjects: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Emerging Technologies (cs.ET)
This paper addresses the incorporation of problem decomposition skills as an important component of computational thinking (CT) in K-12 computer science (CS) education. Despite the growing integration of CS in schools, there is a lack of consensus on the precise definition of CT in general and decomposition in particular. While decomposition is commonly referred to as the starting point of (computational) problem-solving, algorithmic solution formulation often receives more attention in the classroom, while decomposition remains rather unexplored. This study presents "CTSKills", a web-based skill assessment tool developed to measure students' problem decomposition skills. With the data collected from 75 students in grades 4-9, this research aims to contribute to a baseline of students' decomposition proficiency in compulsory education. Furthermore, a thorough understanding of a given problem is becoming increasingly important with the advancement of generative artificial intelligence (AI) tools that can effectively support the process of formulating algorithms. This study highlights the importance of problem decomposition as a key skill in K-12 CS education to foster more adept problem solvers.
- [242] arXiv:2411.14946 [pdf, html, other]
-
Title: Reliable Evaluation of Attribution Maps in CNNs: A Perturbation-Based ApproachSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
In this paper, we present an approach for evaluating attribution maps, which play a central role in interpreting the predictions of convolutional neural networks (CNNs). We show that the widely used insertion/deletion metrics are susceptible to distribution shifts that affect the reliability of the ranking. Our method proposes to replace pixel modifications with adversarial perturbations, which provides a more robust evaluation framework. By using smoothness and monotonicity measures, we illustrate the effectiveness of our approach in correcting distribution shifts. In addition, we conduct the most comprehensive quantitative and qualitative assessment of attribution maps to date. Introducing baseline attribution maps as sanity checks, we find that our metric is the only contender to pass all checks. Using Kendall's $\tau$ rank correlation coefficient, we show the increased consistency of our metric across 15 dataset-architecture combinations. Of the 16 attribution maps tested, our results clearly show SmoothGrad to be the best map currently available. This research makes an important contribution to the development of attribution maps by providing a reliable and consistent evaluation framework. To ensure reproducibility, we will provide the code along with our results.
- [243] arXiv:2411.14950 [pdf, html, other]
-
Title: Trajectory Planning and Control for Robotic Magnetic ManipulationComments: 8 pages, 6 figuresSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
Robotic magnetic manipulation offers a minimally invasive approach to gastrointestinal examinations through capsule endoscopy. However, controlling such systems using external permanent magnets (EPM) is challenging due to nonlinear magnetic interactions, especially when there are complex navigation requirements such as avoidance of sensitive tissues. In this work, we present a novel trajectory planning and control method incorporating dynamics and navigation requirements, using a single EPM fixed to a robotic arm to manipulate an internal permanent magnet (IPM). Our approach employs a constrained iterative linear quadratic regulator that considers the dynamics of the IPM to generate optimal trajectories for both the EPM and IPM. Extensive simulations and real-world experiments, motivated by capsule endoscopy operations, demonstrate the robustness of the method, showcasing resilience to external disturbances and precise control under varying conditions. The experimental results show that the IPM reaches the goal position with a maximum mean error of 0.18 cm and a standard deviation of 0.21 cm. This work introduces a unified framework for constrained trajectory optimization in magnetic manipulation, directly incorporating both the IPM's dynamics and the EPM's manipulability.
- [244] arXiv:2411.14951 [pdf, html, other]
-
Title: Morph: A Motion-free Physics Optimization Framework for Human Motion GenerationComments: 15 pages, 6 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
Human motion generation plays a vital role in applications such as digital humans and humanoid robot control. However, most existing approaches disregard physics constraints, leading to the frequent production of physically implausible motions with pronounced artifacts such as floating and foot sliding. In this paper, we propose \textbf{Morph}, a \textbf{Mo}tion-f\textbf{r}ee \textbf{ph}ysics optimization framework, comprising a Motion Generator and a Motion Physics Refinement module, for enhancing physical plausibility without relying on costly real-world motion data. Specifically, the Motion Generator is responsible for providing large-scale synthetic motion data, while the Motion Physics Refinement Module utilizes these synthetic data to train a motion imitator within a physics simulator, enforcing physical constraints to project the noisy motions into a physically-plausible space. These physically refined motions, in turn, are used to fine-tune the Motion Generator, further enhancing its capability. Experiments on both text-to-motion and music-to-dance generation tasks demonstrate that our framework achieves state-of-the-art motion generation quality while improving physical plausibility drastically.
- [245] arXiv:2411.14953 [pdf, html, other]
-
Title: Evaluating Vision Transformer Models for Visual Quality Control in Industrial ManufacturingJournal-ref: Machine Learning and Knowledge Discovery in Databases.Applied Data Science Track, vol 14950, Springer (2024) 116-132Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
One of the most promising use-cases for machine learning in industrial manufacturing is the early detection of defective products using a quality control system. Such a system can save costs and reduces human errors due to the monotonous nature of visual inspections. Today, a rich body of research exists which employs machine learning methods to identify rare defective products in unbalanced visual quality control datasets. These methods typically rely on two components: A visual backbone to capture the features of the input image and an anomaly detection algorithm that decides if these features are within an expected distribution. With the rise of transformer architecture as visual backbones of choice, there exists now a great variety of different combinations of these two components, ranging all along the trade-off between detection quality and inference time. Facing this variety, practitioners in the field often have to spend a considerable amount of time on researching the right combination for their use-case at hand. Our contribution is to help practitioners with this choice by reviewing and evaluating current vision transformer models together with anomaly detection methods. For this, we chose SotA models of both disciplines, combined them and evaluated them towards the goal of having small, fast and efficient anomaly detection models suitable for industrial manufacturing. We evaluated the results of our experiments on the well-known MVTecAD and BTAD datasets. Moreover, we give guidelines for choosing a suitable model architecture for a quality control system in practice, considering given use-case and hardware constraints.
- [246] arXiv:2411.14954 [pdf, other]
-
Title: Teaching Experiences using the RVfpga PackageD. Chaver, S. Harris, L. Pinuel, O. Kindgren, R. Kravitz, J. I. Gomez, F. Castro, K. Olcoz, J. Villalba, A. Grinshpun, F. Gabbay, L. Seed, R. Duarte, M. Lopez, O. Alonso, R. OwenSubjects: Hardware Architecture (cs.AR)
The RVfpga course offers a solid introduction to computer architecture using the RISC-V instruction set and FPGA technology. It focuses on providing hands-on experience with real-world RISC-V cores, the VeeR EH1 and the VeeR EL2, developed by Western Digital a few years ago and currently hosted by ChipsAlliance. This course is particularly aimed at educators and students in computer science, computer engineering, and related fields, enabling them to integrate practical RISC-V knowledge into their curricula. The course materials, which include detailed labs and setup guides, are available for free through the Imagination University Programme website. We have used RVfpga in different teaching activities and we plan to continue using it in the future. Specifically, we have used RVfpga as the main experimental platform in several bachelor/master degree courses; we have completed several final bachelor/master degree projects based on this platform; we will conduct a microcredential about processor design based on RVfpga; we have adapted RVfpga to a MOOC in the edX platform; and we have shared RVfpga worldwide through one-day hands-on workshops and tutorials. This paper begins by discussing how the RVfpga course matches the latest IEEE/ACM/AAAI computing curriculum guidelines. It then details various teaching implementations we have conducted over recent years using these materials. Finally, the paper examines other courses similar to RVfpga, comparing their strengths and weaknesses.
- [247] arXiv:2411.14957 [pdf, html, other]
-
Title: Information Extraction from Heterogenous Documents without Ground Truth Labels using Synthetic Label Generation and Knowledge DistillationComments: Accepted to WACV 2025Subjects: Computation and Language (cs.CL)
Invoices and receipts submitted by employees are visually rich documents (VRDs) with textual, visual and layout information. To protect against the risk of fraud and abuse, it is crucial for organizations to efficiently extract desired information from submitted receipts. This helps in the assessment of key factors such as appropriateness of the expense claim, adherence to spending and transaction policies, the validity of the receipt, as well as downstream anomaly detection at various levels. These documents are heterogenous, with multiple formats and languages, uploaded with different image qualities, and often do not contain ground truth labels for the efficient training of models. In this paper we propose Task Aware Instruction-based Labelling (TAIL), a method for synthetic label generation in VRD corpuses without labels, and fine-tune a multimodal Visually Rich Document Understanding Model (VRDU) on TAIL labels using response-based knowledge distillation without using the teacher model's weights or training dataset to conditionally generate annotations in the appropriate format. Using a benchmark external dataset where ground truth labels are available, we demonstrate conditions under which our approach performs at par with Claude 3 Sonnet through empirical studies. We then show that the resulting model performs at par or better on the internal expense documents of a large multinational organization than state-of-the-art LMM (large multimodal model) Claude 3 Sonnet while being 85% less costly and ~5X faster, and outperforms layout-aware baselines by more than 10% in Average Normalized Levenshtein Similarity (ANLS) scores due to its ability to reason and extract information from rare formats. Finally, we illustrate the usage of our approach in overpayment prevention.
- [248] arXiv:2411.14959 [pdf, html, other]
-
Title: Design-o-meter: Towards Evaluating and Refining Graphic DesignsSahil Goyal, Abhinav Mahajan, Swasti Mishra, Prateksha Udhayanan, Tripti Shukla, K J Joseph, Balaji Vasan SrinivasanComments: Accepted to WACV 2025. Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Graphic designs are an effective medium for visual communication. They range from greeting cards to corporate flyers and beyond. Off-late, machine learning techniques are able to generate such designs, which accelerates the rate of content production. An automated way of evaluating their quality becomes critical. Towards this end, we introduce Design-o-meter, a data-driven methodology to quantify the goodness of graphic designs. Further, our approach can suggest modifications to these designs to improve its visual appeal. To the best of our knowledge, Design-o-meter is the first approach that scores and refines designs in a unified framework despite the inherent subjectivity and ambiguity of the setting. Our exhaustive quantitative and qualitative analysis of our approach against baselines adapted for the task (including recent Multimodal LLM-based approaches) brings out the efficacy of our methodology. We hope our work will usher more interest in this important and pragmatic problem setting.
- [249] arXiv:2411.14961 [pdf, html, other]
-
Title: LoRA-FAIR: Federated LoRA Fine-Tuning with Aggregation and Initialization RefinementSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Foundation models (FMs) achieve strong performance across diverse tasks with task-specific fine-tuning, yet full parameter fine-tuning is often computationally prohibitive for large models. Parameter-efficient fine-tuning (PEFT) methods like Low-Rank Adaptation (LoRA) reduce this cost by introducing low-rank matrices for tuning fewer parameters. While LoRA allows for efficient fine-tuning, it requires significant data for adaptation, making Federated Learning (FL) an appealing solution due to its privacy-preserving collaborative framework. However, combining LoRA with FL introduces two key challenges: the \textbf{Server-Side LoRA Aggregation Bias}, where server-side averaging of LoRA matrices diverges from the ideal global update, and the \textbf{Client-Side LoRA Initialization Drift}, emphasizing the need for consistent initialization across rounds. Existing approaches address these challenges individually, limiting their effectiveness. We propose LoRA-FAIR, a novel method that tackles both issues by introducing a correction term on the server while keeping the original LoRA modules, enhancing aggregation efficiency and accuracy. LoRA-FAIR maintains computational and communication efficiency, yielding superior performance over state-of-the-art methods. Experimental results on ViT and MLP-Mixer models across large-scale datasets demonstrate that LoRA-FAIR consistently achieves performance improvements in FL settings.
- [250] arXiv:2411.14962 [pdf, html, other]
-
Title: LLM for Barcodes: Generating Diverse Synthetic Data for Identity DocumentsComments: 5 pages, 1 figuresSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Accurate barcode detection and decoding in Identity documents is crucial for applications like security, healthcare, and education, where reliable data extraction and verification are essential. However, building robust detection models is challenging due to the lack of diverse, realistic datasets an issue often tied to privacy concerns and the wide variety of document formats. Traditional tools like Faker rely on predefined templates, making them less effective for capturing the complexity of real-world identity documents. In this paper, we introduce a new approach to synthetic data generation that uses LLMs to create contextually rich and realistic data without relying on predefined field. Using the vast knowledge LLMs have about different documents and content, our method creates data that reflects the variety found in real identity documents. This data is then encoded into barcode and overlayed on templates for documents such as Driver's licenses, Insurance cards, Student IDs. Our approach simplifies the process of dataset creation, eliminating the need for extensive domain knowledge or predefined fields. Compared to traditional methods like Faker, data generated by LLM demonstrates greater diversity and contextual relevance, leading to improved performance in barcode detection models. This scalable, privacy-first solution is a big step forward in advancing machine learning for automated document processing and identity verification.
- [251] arXiv:2411.14967 [pdf, html, other]
-
Title: SwissADT: An Audio Description Translation System for Swiss LanguagesSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
Audio description (AD) is a crucial accessibility service provided to blind persons and persons with visual impairment, designed to convey visual information in acoustic form. Despite recent advancements in multilingual machine translation research, the lack of well-crafted and time-synchronized AD data impedes the development of audio description translation (ADT) systems that address the needs of multilingual countries such as Switzerland. Furthermore, since the majority of ADT systems rely solely on text, uncertainty exists as to whether incorporating visual information from the corresponding video clips can enhance the quality of ADT outputs. In this work, we present SwissADT, the first ADT system implemented for three main Swiss languages and English. By collecting well-crafted AD data augmented with video clips in German, French, Italian, and English, and leveraging the power of Large Language Models (LLMs), we aim to enhance information accessibility for diverse language populations in Switzerland by automatically translating AD scripts to the desired Swiss language. Our extensive experimental ADT results, composed of both automatic and human evaluations of ADT quality, demonstrate the promising capability of SwissADT for the ADT task. We believe that combining human expertise with the generation power of LLMs can further enhance the performance of ADT systems, ultimately benefiting a larger multilingual target population.
- [252] arXiv:2411.14968 [pdf, html, other]
-
Title: Optimization Strategies for Parallel Computation of SkylinesComments: 18 pagesSubjects: Databases (cs.DB)
Skyline queries are one of the most widely adopted tools for Multi-Criteria Analysis, with applications covering diverse domains, including, e.g., Database Systems, Data Mining, and Decision Making. Skylines indeed offer a useful overview of the most suitable alternatives in a dataset, while discarding all the options that are dominated by (i.e., worse than) others.
The intrinsically quadratic complexity associated with skyline computation has pushed researchers to identify strategies for parallelizing the task, particularly by partitioning the dataset at hand. In this paper, after reviewing the main partitioning approaches available in the relevant literature, we propose two orthogonal optimization strategies for reducing the computational overhead, and compare them experimentally in a multi-core environment equipped with PySpark. - [253] arXiv:2411.14971 [pdf, html, other]
-
Title: Leveraging LLMs for Legacy Code Modernization: Challenges and Opportunities for LLM-Generated DocumentationColin Diggs, Michael Doyle, Amit Madan, Siggy Scott, Emily Escamilla, Jacob Zimmer, Naveed Nekoo, Paul Ursino, Michael Bartholf, Zachary Robin, Anand Patel, Chris Glasz, William Macke, Paul Kirk, Jasper Phillips, Arun Sridharan, Doug Wendt, Scott Rosen, Nitin Naik, Justin F. Brunelle, Samruddhi ThakerComments: Abbreviated version submitted to LLM4Code 2025 (a workshop co-located with ICSE 2025), 13 pages, 3 figuresSubjects: Machine Learning (cs.LG); Software Engineering (cs.SE)
Legacy software systems, written in outdated languages like MUMPS and mainframe assembly, pose challenges in efficiency, maintenance, staffing, and security. While LLMs offer promise for modernizing these systems, their ability to understand legacy languages is largely unknown. This paper investigates the utilization of LLMs to generate documentation for legacy code using two datasets: an electronic health records (EHR) system in MUMPS and open-source applications in IBM mainframe Assembly Language Code (ALC). We propose a prompting strategy for generating line-wise code comments and a rubric to evaluate their completeness, readability, usefulness, and hallucination. Our study assesses the correlation between human evaluations and automated metrics, such as code complexity and reference-based metrics. We find that LLM-generated comments for MUMPS and ALC are generally hallucination-free, complete, readable, and useful compared to ground-truth comments, though ALC poses challenges. However, no automated metrics strongly correlate with comment quality to predict or measure LLM performance. Our findings highlight the limitations of current automated measures and the need for better evaluation metrics for LLM-generated documentation in legacy systems.
- [254] arXiv:2411.14974 [pdf, html, other]
-
Title: 3D Convex Splatting: Radiance Field Rendering with 3D Smooth ConvexesJan Held, Renaud Vandeghen, Abdullah Hamdi, Adrien Deliege, Anthony Cioppa, Silvio Giancola, Andrea Vedaldi, Bernard Ghanem, Marc Van DroogenbroeckComments: 13 pages, 13 figures, 10 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV)
Recent advances in radiance field reconstruction, such as 3D Gaussian Splatting (3DGS), have achieved high-quality novel view synthesis and fast rendering by representing scenes with compositions of Gaussian primitives. However, 3D Gaussians present several limitations for scene reconstruction. Accurately capturing hard edges is challenging without significantly increasing the number of Gaussians, creating a large memory footprint. Moreover, they struggle to represent flat surfaces, as they are diffused in space. Without hand-crafted regularizers, they tend to disperse irregularly around the actual surface. To circumvent these issues, we introduce a novel method, named 3D Convex Splatting (3DCS), which leverages 3D smooth convexes as primitives for modeling geometrically-meaningful radiance fields from multi-view images. Smooth convex shapes offer greater flexibility than Gaussians, allowing for a better representation of 3D scenes with hard edges and dense volumes using fewer primitives. Powered by our efficient CUDA-based rasterizer, 3DCS achieves superior performance over 3DGS on benchmarks such as Mip-NeRF360, Tanks and Temples, and Deep Blending. Specifically, our method attains an improvement of up to 0.81 in PSNR and 0.026 in LPIPS compared to 3DGS while maintaining high rendering speeds and reducing the number of required primitives. Our results highlight the potential of 3D Convex Splatting to become the new standard for high-quality scene reconstruction and novel view synthesis. Project page: this http URL.
- [255] arXiv:2411.14977 [pdf, html, other]
-
Title: A p-Multigrid Accelerated Nodal Spectral Element Method for Free-Surface Incompressible Navier-Stokes Model of Nonlinear Water WavesComments: 27 pages, 20 figuresSubjects: Numerical Analysis (math.NA); Fluid Dynamics (physics.flu-dyn)
We present a spectral element model for general-purpose simulation of non-overturning nonlinear water waves using the incompressible Navier-Stokes equations (INSE) with a free surface. The numerical implementation of the spectral element method is inspired by the related work by Engsig-Karup et al. (2016) and is based on nodal Lagrange basis functions, mass matrix-based integration and gradient recovery using global $L^2$ projections. The resulting model leverages the high-order accurate -- possibly exponential -- error convergence and has support for geometric flexibility allowing for computationally efficient simulations of nonlinear wave propagation. An explicit fourth-order accurate Runge-Kutta scheme is employed for the temporal integration, and a mixed-stage numerical discretization is the basis for a pressure-velocity coupling that makes it possible to maintain high-order accuracy in both the temporal and spatial discretizations while preserving mass conservation. Furthermore, the numerical scheme is accelerated by solving the discrete Poisson problem using an iterative solver strategy based on a geometric $p$-multigrid method. This problem constitutes the main computational bottleneck in INSE models. It is shown through numerical experiments, that the model achieves spectral convergence in the velocity fields for highly nonlinear waves, and there is excellent agreement with experimental data for the simulation of the classical benchmark of harmonic wave generation over a submerged bar. The geometric $p$-multigrid solver demonstrates $O(n)$ computational scalability simulations, making it a suitable efficient solver strategy as a candidate for extensions to more complex, real-world scenarios.
- [256] arXiv:2411.14980 [pdf, html, other]
-
Title: Generalized Multivariate Polynomial Codes for Distributed Matrix-Matrix MultiplicationComments: 5 pages, 2 figures, presented at ITW'24Subjects: Information Theory (cs.IT)
Supporting multiple partial computations efficiently at each of the workers is a keystone in distributed coded computing in order to speed up computations and to fully exploit the resources of heterogeneous workers in terms of communication, storage, or computation capabilities. Multivariate polynomial coding schemes have recently been shown to deliver faster results for distributed matrix-matrix multiplication compared to conventional univariate polynomial coding schemes by supporting multiple partial coded computations at each worker at reduced communication costs. In this work, we extend multivariate coding schemes to also support arbitrary matrix partitions. Generalized matrix partitions have been proved useful to trade-off between computation speed and communication costs in distributed (univariate) coded computing. We first formulate the computation latency-communication trade-off in terms of the computation complexity and communication overheads required by coded computing approaches as compared to a single server uncoded computing system. Then, we propose two novel multivariate coded computing schemes supporting arbitrary matrix partitions. The proposed schemes are shown to improve the studied trade-off as compared to univariate schemes.
- [257] arXiv:2411.14982 [pdf, html, other]
-
Title: Large Multi-modal Models Can Interpret Features in Large Multi-modal ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Recent advances in Large Multimodal Models (LMMs) lead to significant breakthroughs in both academia and industry. One question that arises is how we, as humans, can understand their internal neural representations. This paper takes an initial step towards addressing this question by presenting a versatile framework to identify and interpret the semantics within LMMs. Specifically, 1) we first apply a Sparse Autoencoder(SAE) to disentangle the representations into human understandable features. 2) We then present an automatic interpretation framework to interpreted the open-semantic features learned in SAE by the LMMs themselves. We employ this framework to analyze the LLaVA-NeXT-8B model using the LLaVA-OV-72B model, demonstrating that these features can effectively steer the model's behavior. Our results contribute to a deeper understanding of why LMMs excel in specific tasks, including EQ tests, and illuminate the nature of their mistakes along with potential strategies for their rectification. These findings offer new insights into the internal mechanisms of LMMs and suggest parallels with the cognitive processes of the human brain.
- [258] arXiv:2411.14984 [pdf, html, other]
-
Title: Adaptive Group Robust Ensemble Knowledge DistillationComments: Workshop Algorithmic Fairness through the Lens of Metrics and Evaluation at NeurIPS 2024Subjects: Machine Learning (cs.LG)
Neural networks can learn spurious correlations in the data, often leading to performance disparity for underrepresented subgroups. Studies have demonstrated that the disparity is amplified when knowledge is distilled from a complex teacher model to a relatively "simple" student model. Prior work has shown that ensemble deep learning methods can improve the performance of the worst-case subgroups; however, it is unclear if this advantage carries over when distilling knowledge from an ensemble of teachers, especially when the teacher models are debiased. This study demonstrates that traditional ensemble knowledge distillation can significantly drop the performance of the worst-case subgroups in the distilled student model even when the teacher models are debiased. To overcome this, we propose Adaptive Group Robust Ensemble Knowledge Distillation (AGRE-KD), a simple ensembling strategy to ensure that the student model receives knowledge beneficial for unknown underrepresented subgroups. Leveraging an additional biased model, our method selectively chooses teachers whose knowledge would better improve the worst-performing subgroups by upweighting the teachers with gradient directions deviating from the biased model. Our experiments on several datasets demonstrate the superiority of the proposed ensemble distillation technique and show that it can even outperform classic model ensembles based on majority voting.
- [259] arXiv:2411.14986 [pdf, html, other]
-
Title: Generative AI may backfire for counterspeechSubjects: Social and Information Networks (cs.SI); Computers and Society (cs.CY)
Online hate speech poses a serious threat to individual well-being and societal cohesion. A promising solution to curb online hate speech is counterspeech. Counterspeech is aimed at encouraging users to reconsider hateful posts by direct replies. However, current methods lack scalability due to the need for human intervention or fail to adapt to the specific context of the post. A potential remedy is the use of generative AI, specifically large language models (LLMs), to write tailored counterspeech messages. In this paper, we analyze whether contextualized counterspeech generated by state-of-the-art LLMs is effective in curbing online hate speech. To do so, we conducted a large-scale, pre-registered field experiment (N=2,664) on the social media platform Twitter/X. Our experiment followed a 2x2 between-subjects design and, additionally, a control condition with no counterspeech. On the one hand, users posting hateful content on Twitter/X were randomly assigned to receive either (a) contextualized counterspeech or (b) non-contextualized counterspeech. Here, the former is generated through LLMs, while the latter relies on predefined, generic messages. On the other hand, we tested two counterspeech strategies: (a) promoting empathy and (b) warning about the consequences of online misbehavior. We then measured whether users deleted their initial hateful posts and whether their behavior changed after the counterspeech intervention (e.g., whether users adopted a less toxic language). We find that non-contextualized counterspeech employing a warning-of-consequence strategy significantly reduces online hate speech. However, contextualized counterspeech generated by LLMs proves ineffective and may even backfire.
- [260] arXiv:2411.14991 [pdf, html, other]
-
Title: Free Energy Projective Simulation (FEPS): Active inference with interpretabilityComments: 26 pages (including 5 pages appendix), 6 figuresSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)
In the last decade, the free energy principle (FEP) and active inference (AIF) have achieved many successes connecting conceptual models of learning and cognition to mathematical models of perception and action. This effort is driven by a multidisciplinary interest in understanding aspects of self-organizing complex adaptive systems, including elements of agency. Various reinforcement learning (RL) models performing active inference have been proposed and trained on standard RL tasks using deep neural networks. Recent work has focused on improving such agents' performance in complex environments by incorporating the latest machine learning techniques. In this paper, we take an alternative approach. Within the constraints imposed by the FEP and AIF, we attempt to model agents in an interpretable way without deep neural networks by introducing Free Energy Projective Simulation (FEPS). Using internal rewards only, FEPS agents build a representation of their partially observable environments with which they interact. Following AIF, the policy to achieve a given task is derived from this world model by minimizing the expected free energy. Leveraging the interpretability of the model, techniques are introduced to deal with long-term goals and reduce prediction errors caused by erroneous hidden state estimation. We test the FEPS model on two RL environments inspired from behavioral biology: a timed response task and a navigation task in a partially observable grid. Our results show that FEPS agents fully resolve the ambiguity of both environments by appropriately contextualizing their observations based on prediction accuracy only. In addition, they infer optimal policies flexibly for any target observation in the environment.
- [261] arXiv:2411.14992 [pdf, html, other]
-
Title: Differentiable Biomechanics for Markerless Motion Capture in Upper Limb Stroke Rehabilitation: A Comparison with Optical Motion CaptureTim Unger, Arash Sal Moslehian, J.D. Peiffer, Johann Ullrich, Roger Gassert, Olivier Lambercy, R. James Cotton, Chris Awai EasthopeComments: 7 pages, 4 figures, 3 tables, RehabWeek 2025 ICORR, first 3 authors are shared-first and last two authors are shared lastSubjects: Computer Vision and Pattern Recognition (cs.CV)
Marker-based Optical Motion Capture (OMC) paired with biomechanical modeling is currently considered the most precise and accurate method for measuring human movement kinematics. However, combining differentiable biomechanical modeling with Markerless Motion Capture (MMC) offers a promising approach to motion capture in clinical settings, requiring only minimal equipment, such as synchronized webcams, and minimal effort for data collection. This study compares key kinematic outcomes from biomechanically modeled MMC and OMC data in 15 stroke patients performing the drinking task, a functional task recommended for assessing upper limb movement quality. We observed a high level of agreement in kinematic trajectories between MMC and OMC, as indicated by high correlations (median r above 0.95 for the majority of kinematic trajectories) and median RMSE values ranging from 2-5 degrees for joint angles, 0.04 m/s for end-effector velocity, and 6 mm for trunk displacement. Trial-to-trial biases between OMC and MMC were consistent within participant sessions, with interquartile ranges of bias around 1-3 degrees for joint angles, 0.01 m/s in end-effector velocity, and approximately 3mm for trunk displacement. Our findings indicate that our MMC for arm tracking is approaching the accuracy of marker-based methods, supporting its potential for use in clinical settings. MMC could provide valuable insights into movement rehabilitation in stroke patients, potentially enhancing the effectiveness of rehabilitation strategies.
- [262] arXiv:2411.14994 [pdf, html, other]
-
Title: Approximating Prize-Collecting Variants of TSPSubjects: Data Structures and Algorithms (cs.DS)
We present an approximation algorithm for the Prize-collecting Ordered Traveling Salesman Problem (PCOTSP), which simultaneously generalizes the Prize-collecting TSP and the Ordered TSP. The Prize-collecting TSP is well-studied and has a long history, with the current best approximation factor slightly below $1.6$, shown by Blauth, Klein and Nägele [IPCO 2024]. The best approximation ratio for Ordered TSP is $\frac{3}{2}+\frac{1}{e}$, presented by Böhm, Friggstad, Mömke, Spoerhase [SODA 2025] and Armbruster, Mnich, Nägele [Approx 2024]. The former also present a factor 2.2131 approximation algorithm for Multi-Path-TSP.
By carefully tuning the techniques of the latest results on the aforementioned problems and leveraging the unique properties of our problem, we present a 2.097-approximation algorithm for PCOTSP. A key idea in our result is to first sample a set of trees, and then probabilistically pick up some vertices, while using the pruning ideas of Blauth, Klein, Nägele [IPCO 2024] on other vertices to get cheaper parity correction; the sampling probability and the penalty paid by the LP playing a crucial part in both cases. A straightforward adaptation of the aforementioned pruning ideas would only give minuscule improvements over standard parity correction methods. Instead, we use the specific characteristics of our problem together with properties gained from running a simple combinatorial algorithm to bring the approximation factor below 2.1. Our techniques extend to Prize-collecting Multi-Path TSP, building on results from Böhm, Friggstad, Mömke, Spoerhase [SODA 2025], leading to a 2.41-approximation. - [263] arXiv:2411.14995 [pdf, html, other]
-
Title: Learning Lifted STRIPS Models from Action Traces Alone: A Simple, General, and Scalable SolutionComments: submitted to ICAPS 2025Subjects: Artificial Intelligence (cs.AI)
Learning STRIPS action models from action traces alone is a challenging problem as it involves learning the domain predicates as well. In this work, a novel approach is introduced which, like the well-known LOCM systems, is scalable, but like SAT approaches, is sound and complete. Furthermore, the approach is general and imposes no restrictions on the hidden domain or the number or arity of the predicates. The new learning method is based on an \emph{efficient, novel test} that checks whether the assumption that a predicate is affected by a set of action patterns, namely, actions with specific argument positions, is consistent with the traces. The predicates and action patterns that pass the test provide the basis for the learned domain that is then easily completed with preconditions and static predicates. The new method is studied theoretically and experimentally. For the latter, the method is evaluated on traces and graphs obtained from standard classical domains like the 8-puzzle, which involve hundreds of thousands of states and transitions. The learned representations are then verified on larger instances.
- [264] arXiv:2411.15001 [pdf, html, other]
-
Title: A positive- and bound-preserving vectorial lattice Boltzmann method in two dimensionsComments: 23 pages, 10 figuresSubjects: Numerical Analysis (math.NA)
We present a novel positive kinetic scheme built on the efficient collide-and-stream algorithm of the lattice Boltzmann method (LBM) to address hyperbolic conservation laws. We focus on the compressible Euler equations with strong discontinuities. Starting from the work of Jin and Xin [20] and then [4,8], we show how the LBM discretization procedure can yield both first- and second-order schemes, referred to as vectorial LBM. Noticing that the first-order scheme is convex preserving under a specific CFL constraint, we develop a blending strategy that preserves both the conservation and simplicity of the algorithm. This approach employs convex limiters, carefully designed to ensure either positivity (of the density and the internal energy) preservation (PP) or well-defined local maximum principles (LMP), while minimizing numerical dissipation. On challenging test cases involving strong discontinuities and near-vacuum regions, we demonstrate the scheme accuracy, robustness, and ability to capture sharp discontinuities with minimal numerical oscillations.
- [265] arXiv:2411.15003 [pdf, html, other]
-
Title: Autonomous Tail-Sitter Flights in Unknown EnvironmentsSubjects: Robotics (cs.RO)
Trajectory generation for fully autonomous flights of tail-sitter unmanned aerial vehicles (UAVs) presents substantial challenges due to their highly nonlinear aerodynamics. In this paper, we introduce, to the best of our knowledge, the world's first fully autonomous tail-sitter UAV capable of high-speed navigation in unknown, cluttered environments. The UAV autonomy is enabled by cutting-edge technologies including LiDAR-based sensing, differential-flatness-based trajectory planning and control with purely onboard computation. In particular, we propose an optimization-based tail-sitter trajectory planning framework that generates high-speed, collision-free, and dynamically-feasible trajectories. To efficiently and reliably solve this nonlinear, constrained \textcolor{black}{problem}, we develop an efficient feasibility-assured solver, EFOPT, tailored for the online planning of tail-sitter UAVs. We conduct extensive simulation studies to benchmark EFOPT's superiority in planning tasks against conventional NLP solvers. We also demonstrate exhaustive experiments of aggressive autonomous flights with speeds up to 15m/s in various real-world environments, including indoor laboratories, underground parking lots, and outdoor parks. A video demonstration is available at this https URL, and the EFOPT solver is open-sourced at this https URL.
- [266] arXiv:2411.15004 [pdf, html, other]
-
Title: ScribeAgent: Towards Specialized Web Agents Using Production-Scale Workflow DataSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large Language Model (LLM) agents are rapidly improving to handle increasingly complex web-based tasks. Most of these agents rely on general-purpose, proprietary models like GPT-4 and focus on designing better prompts to improve their planning abilities. However, general-purpose LLMs are not specifically trained to understand specialized web contexts such as HTML, and they often struggle with long-horizon planning. We explore an alternative approach that fine-tunes open-source LLMs using production-scale workflow data collected from over 250 domains corresponding to 6 billion tokens. This simple yet effective approach shows substantial gains over prompting-based agents on existing benchmarks -- ScribeAgent achieves state-of-the-art direct generation performance on Mind2Web and improves the task success rate by 14.1% over the previous best text-only web agents on WebArena. We further perform detailed ablation studies on various fine-tuning design choices and provide insights into LLM selection, training recipes, context window optimization, and effect of dataset sizes.
- [267] arXiv:2411.15005 [pdf, html, other]
-
Title: Multi-granularity Interest Retrieval and Refinement Network for Long-Term User Behavior Modeling in CTR PredictionXiang Xu, Hao Wang, Wei Guo, Luankang Zhang, Wanshan Yang, Runlong Yu, Yong Liu, Defu Lian, Enhong ChenComments: KDD2025Subjects: Information Retrieval (cs.IR)
Click-through Rate (CTR) prediction is crucial for online personalization platforms. Recent advancements have shown that modeling rich user behaviors can significantly improve the performance of CTR prediction. Current long-term user behavior modeling algorithms predominantly follow two cascading stages. The first stage retrieves subsequence related to the target item from the long-term behavior sequence, while the second stage models the relationship between the subsequence and the target item. Despite significant progress, these methods have two critical flaws. First, the retrieval query typically includes only target item information, limiting the ability to capture the user's diverse interests. Second, relational information, such as sequential and interactive information within the subsequence, is frequently overlooked. Therefore, it requires to be further mined to more accurately model user interests.
To this end, we propose Multi-granularity Interest Retrieval and Refinement Network (MIRRN). Specifically, we first construct queries based on behaviors observed at different time scales to obtain subsequences, each capturing users' interest at various granularities. We then introduce an noval multi-head Fourier transformer to efficiently learn sequential and interactive information within the subsequences, leading to more accurate modeling of user interests. Finally, we employ multi-head target attention to adaptively assess the impact of these multi-granularity interests on the target item. Extensive experiments have demonstrated that MIRRN significantly outperforms state-of-the-art baselines. Furthermore, an A/B test shows that MIRRN increases the average number of listening songs by 1.32% and the average time of listening songs by 0.55% on a popular music streaming app. The implementation code is publicly available at this https URL. - [268] arXiv:2411.15007 [pdf, html, other]
-
Title: FTA generation using GenAI with an Autonomy sensor UsecaseSubjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Functional safety forms an important aspect in the design of systems. Its emphasis on the automotive industry has evolved significantly over the years. Till date many methods have been developed to get appropriate FTA(Fault Tree analysis) for various scenarios and features pertaining to Autonomous Driving. This paper is an attempt to explore the scope of using Generative Artificial Intelligence(GenAI) in order to develop Fault Tree Analysis(FTA) with the use case of malfunction for the Lidar sensor in mind. We explore various available open source Large Language Models(LLM) models and then dive deep into one of them to study its responses and provide our analysis. This paper successfully shows the possibility to train existing Large Language models through Prompt Engineering for fault tree analysis for any Autonomy usecase aided with PlantUML tool.
- [269] arXiv:2411.15008 [pdf, html, other]
-
Title: Evolutionary Automata and Deep Evolutionary ComputationSubjects: Neural and Evolutionary Computing (cs.NE); Computation and Language (cs.CL)
Evolution by natural selection, which is one of the most compelling themes of modern science, brought forth evolutionary algorithms and evolutionary computation, applying mechanisms of evolution in nature to various problems solved by computers. In this paper we concentrate on evolutionary automata that constitute an analogous model of evolutionary computation compared to well-known evolutionary algorithms. Evolutionary automata provide a more complete dual model of evolutionary computation, similar like abstract automata (e.g., Turing machines) form a more formal and precise model compared to recursive algorithms and their subset - evolutionary algorithms. An evolutionary automaton is an automaton that evolves performing evolutionary computation perhaps using an infinite number of generations. This model allows for a direct modeling evolution of evolution, and leads to tremendous expressiveness of evolutionary automata and evolutionary computation. This also gives the hint to the power of natural evolution that is self-evolving by interactive feedback with the environment.
- [270] arXiv:2411.15014 [pdf, other]
-
Title: On the Linear Speedup of Personalized Federated Reinforcement Learning with Shared RepresentationsSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Federated reinforcement learning (FedRL) enables multiple agents to collaboratively learn a policy without sharing their local trajectories collected during agent-environment interactions. However, in practice, the environments faced by different agents are often heterogeneous, leading to poor performance by the single policy learned by existing FedRL algorithms on individual agents. In this paper, we take a further step and introduce a \emph{personalized} FedRL framework (PFedRL) by taking advantage of possibly shared common structure among agents in heterogeneous environments. Specifically, we develop a class of PFedRL algorithms named PFedRL-Rep that learns (1) a shared feature representation collaboratively among all agents, and (2) an agent-specific weight vector personalized to its local environment. We analyze the convergence of PFedTD-Rep, a particular instance of the framework with temporal difference (TD) learning and linear representations. To the best of our knowledge, we are the first to prove a linear convergence speedup with respect to the number of agents in the PFedRL setting. To achieve this, we show that PFedTD-Rep is an example of the federated two-timescale stochastic approximation with Markovian noise. Experimental results demonstrate that PFedTD-Rep, along with an extension to the control setting based on deep Q-networks (DQN), not only improve learning in heterogeneous settings, but also provide better generalization to new environments.
- [271] arXiv:2411.15016 [pdf, html, other]
-
Title: MSSF: A 4D Radar and Camera Fusion Framework With Multi-Stage Sampling for 3D Object Detection in Autonomous DrivingSubjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
As one of the automotive sensors that have emerged in recent years, 4D millimeter-wave radar has a higher resolution than conventional 3D radar and provides precise elevation measurements. But its point clouds are still sparse and noisy, making it challenging to meet the requirements of autonomous driving. Camera, as another commonly used sensor, can capture rich semantic information. As a result, the fusion of 4D radar and camera can provide an affordable and robust perception solution for autonomous driving systems. However, previous radar-camera fusion methods have not yet been thoroughly investigated, resulting in a large performance gap compared to LiDAR-based methods. Specifically, they ignore the feature-blurring problem and do not deeply interact with image semantic information. To this end, we present a simple but effective multi-stage sampling fusion (MSSF) network based on 4D radar and camera. On the one hand, we design a fusion block that can deeply interact point cloud features with image features, and can be applied to commonly used single-modal backbones in a plug-and-play manner. The fusion block encompasses two types, namely, simple feature fusion (SFF) and multiscale deformable feature fusion (MSDFF). The SFF is easy to implement, while the MSDFF has stronger fusion abilities. On the other hand, we propose a semantic-guided head to perform foreground-background segmentation on voxels with voxel feature re-weighting, further alleviating the problem of feature blurring. Extensive experiments on the View-of-Delft (VoD) and TJ4DRadset datasets demonstrate the effectiveness of our MSSF. Notably, compared to state-of-the-art methods, MSSF achieves a 7.0% and 4.0% improvement in 3D mean average precision on the VoD and TJ4DRadSet datasets, respectively. It even surpasses classical LiDAR-based methods on the VoD dataset.
- [272] arXiv:2411.15018 [pdf, html, other]
-
Title: Neural 4D Evolution under Large Topological Changes from 2D ImagesComments: 15 pages, 21 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
In the literature, it has been shown that the evolution of the known explicit 3D surface to the target one can be learned from 2D images using the instantaneous flow field, where the known and target 3D surfaces may largely differ in topology. We are interested in capturing 4D shapes whose topology changes largely over time. We encounter that the straightforward extension of the existing 3D-based method to the desired 4D case performs poorly.
In this work, we address the challenges in extending 3D neural evolution to 4D under large topological changes by proposing two novel modifications. More precisely, we introduce (i) a new architecture to discretize and encode the deformation and learn the SDF and (ii) a technique to impose the temporal consistency. (iii) Also, we propose a rendering scheme for color prediction based on Gaussian splatting. Furthermore, to facilitate learning directly from 2D images, we propose a learning framework that can disentangle the geometry and appearance from RGB images. This method of disentanglement, while also useful for the 4D evolution problem that we are concentrating on, is also novel and valid for static scenes. Our extensive experiments on various data provide awesome results and, most importantly, open a new approach toward reconstructing challenging scenes with significant topological changes and deformations. Our source code and the dataset are publicly available at this https URL. - [273] arXiv:2411.15020 [pdf, html, other]
-
Title: ZT-SDN: An ML-powered Zero-Trust Architecture for Software-Defined NetworksComments: 32 pages, 13 figures, 6 tablesSubjects: Cryptography and Security (cs.CR)
Zero Trust (ZT) is a security paradigm aiming to curtail an attacker's lateral movements within a network by implementing least-privilege and per-request access control policies. However, its widespread adoption is hindered by the difficulty of generating proper rules due to the lack of detailed knowledge of communication requirements and the characteristic behaviors of communicating entities under benign conditions. Consequently, manual rule generation becomes cumbersome and error-prone. To address these problems, we propose ZT-SDN, an automated framework for learning and enforcing network access control in Software-Defined Networks. ZT-SDN collects data from the underlying network and models the network "transactions" performed by communicating entities as graphs. The nodes represent entities, while the directed edges represent transactions identified by different protocol stacks observed. It uses novel unsupervised learning approaches to extract transaction patterns directly from the network data, such as the allowed protocol stacks and port numbers and data transmission behavior. Finally, ZT-SDN uses an innovative approach to generate correct access control rules and infer strong associations between them, allowing proactive rule deployment in forwarding devices. We show the framework's efficacy in detecting abnormal network accesses and abuses of permitted flows in changing network conditions with real network datasets. Additionally, we showcase ZT-SDN's scalability and the network's performance when applied in an SDN environment.
- [274] arXiv:2411.15024 [pdf, html, other]
-
Title: DyCoke: Dynamic Compression of Tokens for Fast Video Large Language ModelsComments: 12 pages, 6 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Video large language models (VLLMs) have significantly advanced recently in processing complex video content, yet their inference efficiency remains constrained because of the high computational cost stemming from the thousands of visual tokens generated from the video inputs. We empirically observe that, unlike single image inputs, VLLMs typically attend visual tokens from different frames at different decoding iterations, making a one-shot pruning strategy prone to removing important tokens by mistake. Motivated by this, we present DyCoke, a training-free token compression method to optimize token representation and accelerate VLLMs. DyCoke incorporates a plug-and-play temporal compression module to minimize temporal redundancy by merging redundant tokens across frames, and applies dynamic KV cache reduction to prune spatially redundant tokens selectively. It ensures high-quality inference by dynamically retaining the critical tokens at each decoding step. Extensive experimental results demonstrate that DyCoke can outperform the prior SoTA counterparts, achieving 1.5X inference speedup, 1.4X memory reduction against the baseline VLLM, while still improving the performance, with no training.
- [275] arXiv:2411.15027 [pdf, html, other]
-
Title: Time is on my sight: scene graph filtering for dynamic environment perception in an LLM-driven robotSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Robots are increasingly being used in dynamic environments like workplaces, hospitals, and homes. As a result, interactions with robots must be simple and intuitive, with robots perception adapting efficiently to human-induced changes. This paper presents a robot control architecture that addresses key challenges in human-robot interaction, with a particular focus on the dynamic creation and continuous update of the robot state representation. The architecture uses Large Language Models to integrate diverse information sources, including natural language commands, robotic skills representation, real-time dynamic semantic mapping of the perceived scene. This enables flexible and adaptive robotic behavior in complex, dynamic environments. Traditional robotic systems often rely on static, pre-programmed instructions and settings, limiting their adaptability to dynamic environments and real-time collaboration. In contrast, this architecture uses LLMs to interpret complex, high-level instructions and generate actionable plans that enhance human-robot collaboration. At its core, the system Perception Module generates and continuously updates a semantic scene graph using RGB-D sensor data, providing a detailed and structured representation of the environment. A particle filter is employed to ensure accurate object localization in dynamic, real-world settings. The Planner Module leverages this up-to-date semantic map to break down high-level tasks into sub-tasks and link them to robotic skills such as navigation, object manipulation (e.g., PICK and PLACE), and movement (e.g., GOTO). By combining real-time perception, state tracking, and LLM-driven communication and task planning, the architecture enhances adaptability, task efficiency, and human-robot collaboration in dynamic environments.
- [276] arXiv:2411.15028 [pdf, html, other]
-
Title: FloAt: Flow Warping of Self-Attention for Clothing Animation GenerationSubjects: Computer Vision and Pattern Recognition (cs.CV)
We propose a diffusion model-based approach, FloAtControlNet to generate cinemagraphs composed of animations of human clothing. We focus on human clothing like dresses, skirts and pants. The input to our model is a text prompt depicting the type of clothing and the texture of clothing like leopard, striped, or plain, and a sequence of normal maps that capture the underlying animation that we desire in the output. The backbone of our method is a normal-map conditioned ControlNet which is operated in a training-free regime. The key observation is that the underlying animation is embedded in the flow of the normal maps. We utilize the flow thus obtained to manipulate the self-attention maps of appropriate layers. Specifically, the self-attention maps of a particular layer and frame are recomputed as a linear combination of itself and the self-attention maps of the same layer and the previous frame, warped by the flow on the normal maps of the two frames. We show that manipulating the self-attention maps greatly enhances the quality of the clothing animation, making it look more natural as well as suppressing the background artifacts. Through extensive experiments, we show that the method proposed beats all baselines both qualitatively in terms of visual results and user study. Specifically, our method is able to alleviate the background flickering that exists in other diffusion model-based baselines that we consider. In addition, we show that our method beats all baselines in terms of RMSE and PSNR computed using the input normal map sequences and the normal map sequences obtained from the output RGB frames. Further, we show that well-established evaluation metrics like LPIPS, SSIM, and CLIP scores that are generally for visual quality are not necessarily suitable for capturing the subtle motions in human clothing animations.
- [277] arXiv:2411.15031 [pdf, html, other]
-
Title: PoneglyphDB: Efficient Non-interactive Zero-Knowledge Proofs for Arbitrary SQL-Query VerificationSubjects: Databases (cs.DB); Cryptography and Security (cs.CR)
In database applications involving sensitive data, the dual imperatives of data confidentiality and provable query processing are important. This paper introduces PoneglyphDB, a database system that leverages non-interactive zero-knowledge proofs (ZKP) to support both confidentiality and provability. Unlike traditional databases, PoneglyphDB enhances confidentiality by ensuring that raw data remains exclusively with the host, while also enabling verification of the correctness of query responses by providing proofs to clients. The main innovation in this paper is proposing efficient ZKP designs (called circuits) for basic operations in SQL query processing. These basic operation circuits are then combined to form ZKP circuits for larger, more complex queries. PoneglyphDB's circuits are carefully designed to be efficient by utilizing advances in cryptography such as PLONKish-based circuits, recursive proof composition techniques, and designs with low-order polynomial constraints. We demonstrate the performance of PoneglyphDB with the standard TPC-H benchmark. Our experimental results show that PoneglyphDB can efficiently achieve both confidentiality and provability, outperforming existing state-of-the-art ZKP methods.
- [278] arXiv:2411.15033 [pdf, html, other]
-
Title: One to rule them all: natural language to bind communication, perception and actionSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
In recent years, research in the area of human-robot interaction has focused on developing robots capable of understanding complex human instructions and performing tasks in dynamic and diverse environments. These systems have a wide range of applications, from personal assistance to industrial robotics, emphasizing the importance of robots interacting flexibly, naturally and safely with humans. This paper presents an advanced architecture for robotic action planning that integrates communication, perception, and planning with Large Language Models (LLMs). Our system is designed to translate commands expressed in natural language into executable robot actions, incorporating environmental information and dynamically updating plans based on real-time feedback. The Planner Module is the core of the system where LLMs embedded in a modified ReAct framework are employed to interpret and carry out user commands. By leveraging their extensive pre-trained knowledge, LLMs can effectively process user requests without the need to introduce new knowledge on the changing environment. The modified ReAct framework further enhances the execution space by providing real-time environmental perception and the outcomes of physical actions. By combining robust and dynamic semantic map representations as graphs with control components and failure explanations, this architecture enhances a robot adaptability, task execution, and seamless collaboration with human users in shared and dynamic environments. Through the integration of continuous feedback loops with the environment the system can dynamically adjusts the plan to accommodate unexpected changes, optimizing the robot ability to perform tasks. Using a dataset of previous experience is possible to provide detailed feedback about the failure. Updating the LLMs context of the next iteration with suggestion on how to overcame the issue.
- [279] arXiv:2411.15034 [pdf, html, other]
-
Title: HeadRouter: A Training-free Image Editing Framework for MM-DiTs by Adaptively Routing Attention HeadsSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Diffusion Transformers (DiTs) have exhibited robust capabilities in image generation tasks. However, accurate text-guided image editing for multimodal DiTs (MM-DiTs) still poses a significant challenge. Unlike UNet-based structures that could utilize self/cross-attention maps for semantic editing, MM-DiTs inherently lack support for explicit and consistent incorporated text guidance, resulting in semantic misalignment between the edited results and texts. In this study, we disclose the sensitivity of different attention heads to different image semantics within MM-DiTs and introduce HeadRouter, a training-free image editing framework that edits the source image by adaptively routing the text guidance to different attention heads in MM-DiTs. Furthermore, we present a dual-token refinement module to refine text/image token representations for precise semantic guidance and accurate region expression. Experimental results on multiple benchmarks demonstrate HeadRouter's performance in terms of editing fidelity and image quality.
- [280] arXiv:2411.15036 [pdf, html, other]
-
Title: Safe Multi-Agent Reinforcement Learning with Convergence to Generalized Nash EquilibriumSubjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Multi-agent reinforcement learning (MARL) has achieved notable success in cooperative tasks, demonstrating impressive performance and scalability. However, deploying MARL agents in real-world applications presents critical safety challenges. Current safe MARL algorithms are largely based on the constrained Markov decision process (CMDP) framework, which enforces constraints only on discounted cumulative costs and lacks an all-time safety assurance. Moreover, these methods often overlook the feasibility issue (the system will inevitably violate state constraints within certain regions of the constraint set), resulting in either suboptimal performance or increased constraint violations. To address these challenges, we propose a novel theoretical framework for safe MARL with $\textit{state-wise}$ constraints, where safety requirements are enforced at every state the agents visit. To resolve the feasibility issue, we leverage a control-theoretic notion of the feasible region, the controlled invariant set (CIS), characterized by the safety value function. We develop a multi-agent method for identifying CISs, ensuring convergence to a Nash equilibrium on the safety value function. By incorporating CIS identification into the learning process, we introduce a multi-agent dual policy iteration algorithm that guarantees convergence to a generalized Nash equilibrium in state-wise constrained cooperative Markov games, achieving an optimal balance between feasibility and performance. Furthermore, for practical deployment in complex high-dimensional systems, we propose $\textit{Multi-Agent Dual Actor-Critic}$ (MADAC), a safe MARL algorithm that approximates the proposed iteration scheme within the deep RL paradigm. Empirical evaluations on safe MARL benchmarks demonstrate that MADAC consistently outperforms existing methods, delivering much higher rewards while reducing constraint violations.
- [281] arXiv:2411.15041 [pdf, html, other]
-
Title: mR$^2$AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQATao Zhang, Ziqi Zhang, Zongyang Ma, Yuxin Chen, Zhongang Qi, Chunfeng Yuan, Bing Li, Junfu Pu, Yuxuan Zhao, Zehua Xie, Jin Ma, Ying Shan, Weiming HuSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Advanced Multimodal Large Language Models (MLLMs) struggle with recent Knowledge-based VQA tasks, such as INFOSEEK and Encyclopedic-VQA, due to their limited and frozen knowledge scope, often leading to ambiguous and inaccurate responses. Thus, multimodal Retrieval-Augmented Generation (mRAG) is naturally introduced to provide MLLMs with comprehensive and up-to-date knowledge, effectively expanding the knowledge scope. However, current mRAG methods have inherent drawbacks, including: 1) Performing retrieval even when external knowledge is not needed. 2) Lacking of identification of evidence that supports the query. 3) Increasing model complexity due to additional information filtering modules or rules. To address these shortcomings, we propose a novel generalized framework called \textbf{m}ultimodal \textbf{R}etrieval-\textbf{R}eflection-\textbf{A}ugmented \textbf{G}eneration (mR$^2$AG), which achieves adaptive retrieval and useful information localization to enable answers through two easy-to-implement reflection operations, preventing high model complexity. In mR$^2$AG, Retrieval-Reflection is designed to distinguish different user queries and avoids redundant retrieval calls, and Relevance-Reflection is introduced to guide the MLLM in locating beneficial evidence of the retrieved content and generating answers accordingly. In addition, mR$^2$AG can be integrated into any well-trained MLLM with efficient fine-tuning on the proposed mR$^2$AG Instruction-Tuning dataset (mR$^2$AG-IT). mR$^2$AG significantly outperforms state-of-the-art MLLMs (e.g., GPT-4v/o) and RAG-based MLLMs on INFOSEEK and Encyclopedic-VQA, while maintaining the exceptional capabilities of base MLLMs across a wide range of Visual-dependent tasks.
- [282] arXiv:2411.15042 [pdf, other]
-
Title: Enhancing Autonomous Driving Safety through World Model-Based Predictive Navigation and Adaptive Learning Algorithms for 5G Wireless ApplicationsComments: 6 pages, 5 figuresSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Addressing the challenge of ensuring safety in ever-changing and unpredictable environments, particularly in the swiftly advancing realm of autonomous driving in today's 5G wireless communication world, we present Navigation Secure (NavSecure). This vision-based navigation framework merges the strengths of world models with crucial safety-focused decision-making capabilities, enabling autonomous vehicles to navigate real-world complexities securely. Our approach anticipates potential threats and formulates safer routes by harnessing the predictive capabilities of world models, thus significantly reducing the need for extensive real-world trial-and-error learning. Additionally, our method empowers vehicles to autonomously learn and develop through continuous practice, ensuring the system evolves and adapts to new challenges. Incorporating radio frequency technology, NavSecure leverages 5G networks to enhance real-time data exchange, improving communication and responsiveness. Validated through rigorous experiments under simulation-to-real driving conditions, NavSecure has shown exceptional performance in safety-critical scenarios, such as sudden obstacle avoidance. Results indicate that NavSecure excels in key safety metrics, including collision prevention and risk reduction, surpassing other end-to-end methodologies. This framework not only advances autonomous driving safety but also demonstrates how world models can enhance decision-making in critical applications. NavSecure sets a new standard for developing more robust and trustworthy autonomous driving systems, capable of handling the inherent dynamics and uncertainties of real-world environments.
- [283] arXiv:2411.15043 [pdf, html, other]
-
Title: OVO-SLAM: Open-Vocabulary Online Simultaneous Localization and MappingSubjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
This paper presents the first Open-Vocabulary Online 3D semantic SLAM pipeline, that we denote as OVO-SLAM. Our primary contribution is in the pipeline itself, particularly in the mapping thread. Given a set of posed RGB-D frames, we detect and track 3D segments, which we describe using CLIP vectors, calculated through a novel aggregation from the viewpoints where these 3D segments are observed. Notably, our OVO-SLAM pipeline is not only faster but also achieves better segmentation metrics compared to offline approaches in the literature. Along with superior segmentation performance, we show experimental results of our contributions integrated with Gaussian-SLAM, being the first ones demonstrating end-to-end open-vocabulary online 3D reconstructions without relying on ground-truth camera poses or scene geometry.
- [284] arXiv:2411.15045 [pdf, other]
-
Title: Who is Funding Indian Research? A look at major funding sources acknowledged in Indian research papersComments: First DraftSubjects: Digital Libraries (cs.DL)
Science and scientific research activities, in addition to the involvement of the researchers, require resources like research infrastructure, materials and reagents, databases and computational tools, journal subscriptions and publication charges etc. In order to meet these requirements, researchers try to attract research funding from different funding sources, both intramural and extramural. Though some recent reports provide details of the amount of funding provided by different funding agencies in India, it is not known what quantum of research output resulted from such funding. This paper, therefore, attempts to quantify the research output produced with the funding provided by different funding agencies to Indian researchers. The major funding agencies that supported Indian research publications are identified and are further characterized in terms of being national or international, and public or private. The analytical results not only provide a quantitative estimate of funded research from India and the major funding agencies supporting the research, but also discusses the overall context of research funding in India, particularly in the context of upcoming operationalization of Anusandhan National Research Foundation (ANRF).
- [285] arXiv:2411.15046 [pdf, html, other]
-
Title: On Multi-Agent Inverse Reinforcement LearningComments: Currently under reviewSubjects: Machine Learning (cs.LG)
In multi-agent systems, the agent behavior is highly influenced by its utility function, as these utilities shape both individual goals as well as interactions with the other agents. Inverse Reinforcement Learning (IRL) is a well-established approach to inferring the utility function by observing an expert behavior within a given environment. In this paper, we extend the IRL framework to the multi-agent setting, assuming to observe agents who are following Nash Equilibrium (NE) policies. We theoretically investigate the set of utilities that explain the behavior of NE experts. Specifically, we provide an explicit characterization of the feasible reward set and analyze how errors in estimating the transition dynamics and expert behavior impact the recovered rewards. Building on these findings, we provide the first sample complexity analysis for the multi-agent IRL problem. Finally, we provide a numerical evaluation of our theoretical results.
- [286] arXiv:2411.15049 [pdf, other]
-
Title: Indo-US Research Collaboration: strengthening or declining?Comments: Pre printSubjects: Digital Libraries (cs.DL)
Despite the importance of Indo-US research collaboration, it is intriguing to note that measurement and characterization of dynamics of Indo-US research collaboration is relatively underexplored. Therefore, in this work, we investigate major patterns in Indo-US collaboration with respect to certain key aspects using suitable scientometric notions and indicators. The research publication data for the last three decades (1990-2020) is obtained from Web of Science and analysed for the purpose. Results indicate an increase in absolute number of Indo-US collaborated papers over time, with an impressive share of about 1/3rd of India's total internationally collaborated research output. However, the proportionate share of Indo-US collaborated papers in India's internationally collaborated papers has declined over the time, as Indian researchers find new collaborating partners. Nevertheless, the collaboration with US is found to be highly rewarding in terms of citations and boost measures. Important insights and recommendations that may be helpful for shaping up new perspective on Indo-US collaboration policy are presented in this work.
- [287] arXiv:2411.15051 [pdf, html, other]
-
Title: Fantastic Biases (What are They) and Where to Find ThemComments: Publication in Spanish in the Journal Bits de Ciencias: this https URLJournal-ref: Bits de Ciencias 26 (2024), 02-13Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)
Deep Learning models tend to learn correlations of patterns on huge datasets. The bigger these systems are, the more complex are the phenomena they can detect, and the more data they need for this. The use of Artificial Intelligence (AI) is becoming increasingly ubiquitous in our society, and its impact is growing everyday. The promises it holds strongly depend on their fair and universal use, such as access to information or education for all. In a world of inequalities, they can help to reach the most disadvantaged areas. However, such a universal systems must be able to represent society, without benefiting some at the expense of others. We must not reproduce the inequalities observed throughout the world, but educate these IAs to go beyond them. We have seen cases where these systems use gender, race, or even class information in ways that are not appropriate for resolving their tasks. Instead of real causal reasoning, they rely on spurious correlations, which is what we usually call a bias. In this paper, we first attempt to define what is a bias in general terms. It helps us to demystify the concept of bias, to understand why we can find them everywhere and why they are sometimes useful. Second, we focus over the notion of what is generally seen as negative bias, the one we want to avoid in machine learning, before presenting a general zoology containing the most common of these biases. We finally conclude by looking at classical methods to detect them, by means of specially crafted datasets of templates and specific algorithms, and also classical methods to mitigate them.
- [288] arXiv:2411.15056 [pdf, html, other]
-
Title: Financial Risk Assessment via Long-term Payment Behavior Sequence FoldingComments: ICDM2024 long paperSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
Online inclusive financial services encounter significant financial risks due to their expansive user base and low default costs. By real-world practice, we reveal that utilizing longer-term user payment behaviors can enhance models' ability to forecast financial risks. However, learning long behavior sequences is non-trivial for deep sequential models. Additionally, the diverse fields of payment behaviors carry rich information, requiring thorough exploitation. These factors collectively complicate the task of long-term user behavior modeling. To tackle these challenges, we propose a Long-term Payment Behavior Sequence Folding method, referred to as LBSF. In LBSF, payment behavior sequences are folded based on merchants, using the merchant field as an intrinsic grouping criterion, which enables informative parallelism without reliance on external knowledge. Meanwhile, we maximize the utility of payment details through a multi-field behavior encoding mechanism. Subsequently, behavior aggregation at the merchant level followed by relational learning across merchants facilitates comprehensive user financial representation. We evaluate LBSF on the financial risk assessment task using a large-scale real-world dataset. The results demonstrate that folding long behavior sequences based on internal behavioral cues effectively models long-term patterns and changes, thereby generating more accurate user financial profiles for practical applications.
- [289] arXiv:2411.15061 [pdf, other]
-
Title: Empowering Clients: Transformation of Design Processes Due to Generative AISubjects: Artificial Intelligence (cs.AI)
The domain of computational design, driven by advancements in Generative AI, is transforming creative fields. We explore the transformative effects of Generative AI on the architectural design process and discuss the role of the architect. The case of architecture is interesting as designing houses is complex, involving extensive customer interaction. We employ a within-subject experiment using a popular general-purpose text-to-image tool for generating designs and providing feedback on existing designs, followed by expert interviews. The study reveals that AI can disrupt the ideation phase by enabling clients to engage in the design process through rapid visualization of their own ideas. In turn, the architect's role shifts more towards assessing the feasibility of designs generated conjointly by clients and AI. Our study also shows that while AI can provide valuable feedback on designs, it might fail to generate such designs, allowing for interesting connections to foundations in computer science, i.e., NP-completeness. AI's feedback also tends to hamper creativity and innovation by suggesting altering novel, innovative approaches toward more standardized designs. Our study also reveals that there is uncertainty among architects about the interpretative sovereignty of architecture and loss of meaning and identity when AI increasingly takes over authorship in the design process.
- [290] arXiv:2411.15066 [pdf, html, other]
-
Title: SPAC-Net: Rethinking Point Cloud Completion with Structural PriorSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Point cloud completion aims to infer a complete shape from its partial observation. Many approaches utilize a pure encoderdecoder paradigm in which complete shape can be directly predicted by shape priors learned from partial scans, however, these methods suffer from the loss of details inevitably due to the feature abstraction issues. In this paper, we propose a novel framework,termed SPAC-Net, that aims to rethink the completion task under the guidance of a new structural prior, we call it interface. Specifically, our method first investigates Marginal Detector (MAD) module to localize the interface, defined as the intersection between the known observation and the missing parts. Based on the interface, our method predicts the coarse shape by learning the displacement from the points in interface move to their corresponding position in missing parts. Furthermore, we devise an additional Structure Supplement(SSP) module before the upsampling stage to enhance the structural details of the coarse shape, enabling the upsampling module to focus more on the upsampling task. Extensive experiments have been conducted on several challenging benchmarks, and the results demonstrate that our method outperforms existing state-of-the-art approaches.
- [291] arXiv:2411.15068 [pdf, html, other]
-
Title: Locating the Leading Edge of Cultural ChangeComments: Accepted CHR 2024Subjects: Computation and Language (cs.CL)
Measures of textual similarity and divergence are increasingly used to study cultural change. But which measures align, in practice, with social evidence about change? We apply three different representations of text (topic models, document embeddings, and word-level perplexity) to three different corpora (literary studies, economics, and fiction). In every case, works by highly-cited authors and younger authors are textually ahead of the curve. We don't find clear evidence that one representation of text is to be preferred over the others. But alignment with social evidence is strongest when texts are represented through the top quartile of passages, suggesting that a text's impact may depend more on its most forward-looking moments than on sustaining a high level of innovation throughout.
- [292] arXiv:2411.15074 [pdf, html, other]
-
Title: Learning to Stabilize FacesComments: Eurographics 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Nowadays, it is possible to scan faces and automatically register them with high quality. However, the resulting face meshes often need further processing: we need to stabilize them to remove unwanted head movement. Stabilization is important for tasks like game development or movie making which require facial expressions to be cleanly separated from rigid head motion. Since manual stabilization is labor-intensive, there have been attempts to automate it. However, previous methods remain impractical: they either still require some manual input, produce imprecise alignments, rely on dubious heuristics and slow optimization, or assume a temporally ordered input. Instead, we present a new learning-based approach that is simple and fully automatic. We treat stabilization as a regression problem: given two face meshes, our network directly predicts the rigid transform between them that brings their skulls into alignment. We generate synthetic training data using a 3D Morphable Model (3DMM), exploiting the fact that 3DMM parameters separate skull motion from facial skin motion. Through extensive experiments we show that our approach outperforms the state-of-the-art both quantitatively and qualitatively on the tasks of stabilizing discrete sets of facial expressions as well as dynamic facial performances. Furthermore, we provide an ablation study detailing the design choices and best practices to help others adopt our approach for their own uses. Supplementary videos can be found on the project webpage this http URL.
- [293] arXiv:2411.15082 [pdf, html, other]
-
Title: Towards Speaker Identification with Minimal Dataset and Constrained Resources using 1D-Convolution Neural NetworkSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Voice recognition and speaker identification are vital for applications in security and personal assistants. This paper presents a lightweight 1D-Convolutional Neural Network (1D-CNN) designed to perform speaker identification on minimal datasets. Our approach achieves a validation accuracy of 97.87%, leveraging data augmentation techniques to handle background noise and limited training samples. Future improvements include testing on larger datasets and integrating transfer learning methods to enhance generalizability. We provide all code, the custom dataset, and the trained models to facilitate reproducibility. These resources are available on our GitHub repository: this https URL.
- [294] arXiv:2411.15087 [pdf, html, other]
-
Title: Instance-Aware Generalized Referring Expression SegmentationComments: 12 pages, 7 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Recent works on Generalized Referring Expression Segmentation (GRES) struggle with handling complex expressions referring to multiple distinct objects. This is because these methods typically employ an end-to-end foreground-background segmentation and lack a mechanism to explicitly differentiate and associate different object instances to the text query. To this end, we propose InstAlign, a method that incorporates object-level reasoning into the segmentation process. Our model leverages both text and image inputs to extract a set of object-level tokens that capture both the semantic information in the input prompt and the objects within the image. By modeling the text-object alignment via instance-level supervision, each token uniquely represents an object segment in the image, while also aligning with relevant semantic information from the text. Extensive experiments on the gRefCOCO and Ref-ZOM benchmarks demonstrate that our method significantly advances state-of-the-art performance, setting a new standard for precise and flexible GRES.
- [295] arXiv:2411.15091 [pdf, html, other]
-
Title: Somesite I Used To Crawl: Awareness, Agency and Efficacy in Protecting Content Creators From AI CrawlersComments: Under SubmissionSubjects: Human-Computer Interaction (cs.HC)
The success of generative AI relies heavily on training on data scraped through extensive crawling of the Internet, a practice that has raised significant copyright, privacy, and ethical concerns. While few measures are designed to resist a resource-rich adversary determined to scrape a site, crawlers can be impacted by a range of existing tools such as this http URL, NoAI meta tags, and active crawler blocking by reverse proxies.
In this work, we seek to understand the ability and efficacy of today's networking tools to protect content creators against AI-related crawling. For targeted populations like human artists, do they have the technical knowledge and agency to utilize crawler-blocking tools such as this http URL, and can such tools be effective? Using large scale measurements and a targeted user study of 182 professional artists, we find strong demand for tools like this http URL, but significantly constrained by significant hurdles in technical awareness, agency in deploying them, and limited efficacy against unresponsive crawlers. We further test and evaluate network level crawler blockers by reverse-proxies, and find that despite very limited deployment today, their reliable and comprehensive blocking of AI-crawlers make them the strongest protection for artists moving forward. - [296] arXiv:2411.15096 [pdf, html, other]
-
Title: RED: Effective Trajectory Representation Learning with Comprehensive InformationComments: This paper is accepted by VLDB2025Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Trajectory representation learning (TRL) maps trajectories to vectors that can then be used for various downstream tasks, including trajectory similarity computation, trajectory classification, and travel-time estimation. However, existing TRL methods often produce vectors that, when used in downstream tasks, yield insufficiently accurate results. A key reason is that they fail to utilize the comprehensive information encompassed by trajectories. We propose a self-supervised TRL framework, called RED, which effectively exploits multiple types of trajectory information. Overall, RED adopts the Transformer as the backbone model and masks the constituting paths in trajectories to train a masked autoencoder (MAE). In particular, RED considers the moving patterns of trajectories by employing a Road-aware masking strategy} that retains key paths of trajectories during masking, thereby preserving crucial information of the trajectories. RED also adopts a spatial-temporal-user joint Embedding scheme to encode comprehensive information when preparing the trajectories as model inputs. To conduct training, RED adopts Dual-objective task learning}: the Transformer encoder predicts the next segment in a trajectory, and the Transformer decoder reconstructs the entire trajectory. RED also considers the spatial-temporal correlations of trajectories by modifying the attention mechanism of the Transformer. We compare RED with 9 state-of-the-art TRL methods for 4 downstream tasks on 3 real-world datasets, finding that RED can usually improve the accuracy of the best-performing baseline by over 5%.
- [297] arXiv:2411.15098 [pdf, html, other]
-
Title: OminiControl: Minimal and Universal Control for Diffusion TransformerSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
In this paper, we introduce OminiControl, a highly versatile and parameter-efficient framework that integrates image conditions into pre-trained Diffusion Transformer (DiT) models. At its core, OminiControl leverages a parameter reuse mechanism, enabling the DiT to encode image conditions using itself as a powerful backbone and process them with its flexible multi-modal attention processors. Unlike existing methods, which rely heavily on additional encoder modules with complex architectures, OminiControl (1) effectively and efficiently incorporates injected image conditions with only ~0.1% additional parameters, and (2) addresses a wide range of image conditioning tasks in a unified manner, including subject-driven generation and spatially-aligned conditions such as edges, depth, and more. Remarkably, these capabilities are achieved by training on images generated by the DiT itself, which is particularly beneficial for subject-driven generation. Extensive evaluations demonstrate that OminiControl outperforms existing UNet-based and DiT-adapted models in both subject-driven and spatially-aligned conditional generation. Additionally, we release our training dataset, Subjects200K, a diverse collection of over 200,000 identity-consistent images, along with an efficient data synthesis pipeline to advance research in subject-consistent generation.
- [298] arXiv:2411.15099 [pdf, html, other]
-
Title: Context-Aware Multimodal PretrainingSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Large-scale multimodal representation learning successfully optimizes for zero-shot transfer at test time. Yet the standard pretraining paradigm (contrastive learning on large amounts of image-text data) does not explicitly encourage representations to support few-shot adaptation. In this work, we propose a simple, but carefully designed extension to multimodal pretraining which enables representations to accommodate additional context. Using this objective, we show that vision-language models can be trained to exhibit significantly increased few-shot adaptation: across 21 downstream tasks, we find up to four-fold improvements in test-time sample efficiency, and average few-shot adaptation gains of over 5%, while retaining zero-shot generalization performance across model scales and training durations. In particular, equipped with simple, training-free, metric-based adaptation mechanisms, our representations easily surpass more complex and expensive optimization-based schemes, vastly simplifying generalization to new domains.
- [299] arXiv:2411.15100 [pdf, html, other]
-
Title: XGrammar: Flexible and Efficient Structured Generation Engine for Large Language ModelsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
The applications of LLM Agents are becoming increasingly complex and diverse, leading to a high demand for structured outputs that can be parsed into code, structured function calls, and embodied agent commands. These developments bring significant demands for structured generation in LLM inference. Context-free grammar is a flexible approach to enable structured generation via constrained decoding. However, executing context-free grammar requires going through several stack states over all tokens in vocabulary during runtime, bringing non-negligible overhead for structured generation. In this paper, we propose XGrammar, a flexible and efficient structure generation engine for large language models. XGrammar accelerates context-free grammar execution by dividing the vocabulary into context-independent tokens that can be prechecked and context-dependent tokens that need to be interpreted during runtime. We further build transformations to expand the grammar context and reduce the number of context-independent tokens. Additionally, we build an efficient persistent stack to accelerate the context-dependent token checks. Finally, we co-design the grammar engine with LLM inference engine to overlap grammar computation with GPU executions. Evaluation results show that XGrammar can achieve up to 100x speedup over existing solutions. Combined with an LLM inference engine, it can generate near-zero overhead structure generation in end-to-end low-LLM serving.
- [300] arXiv:2411.15101 [pdf, html, other]
-
Title: What You See is Not What You Get: Neural Partial Differential Equations and The Illusion of LearningSubjects: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
Differentiable Programming for scientific machine learning (SciML) has recently seen considerable interest and success, as it directly embeds neural networks inside PDEs, often called as NeuralPDEs, derived from first principle physics. Therefore, there is a widespread assumption in the community that NeuralPDEs are more trustworthy and generalizable than black box models. However, like any SciML model, differentiable programming relies predominantly on high-quality PDE simulations as "ground truth" for training. However, mathematics dictates that these are only discrete numerical approximations of the true physics. Therefore, we ask: Are NeuralPDEs and differentiable programming models trained on PDE simulations as physically interpretable as we think? In this work, we rigorously attempt to answer these questions, using established ideas from numerical analysis, experiments, and analysis of model Jacobians. Our study shows that NeuralPDEs learn the artifacts in the simulation training data arising from the discretized Taylor Series truncation error of the spatial derivatives. Additionally, NeuralPDE models are systematically biased, and their generalization capability is likely enabled by a fortuitous interplay of numerical dissipation and truncation error in the training dataset and NeuralPDE, which seldom happens in practical applications. This bias manifests aggressively even in relatively accessible 1-D equations, raising concerns about the veracity of differentiable programming on complex, high-dimensional, real-world PDEs, and in dataset integrity of foundation models. Further, we observe that the initial condition constrains the truncation error in initial-value problems in PDEs, thereby exerting limitations to extrapolation. Finally, we demonstrate that an eigenanalysis of model weights can indicate a priori if the model will be inaccurate for out-of-distribution testing.
- [301] arXiv:2411.15102 [pdf, html, other]
-
Title: AttriBoT: A Bag of Tricks for Efficiently Approximating Leave-One-Out Context AttributionComments: 29 pages, 11 figuresSubjects: Machine Learning (cs.LG)
The influence of contextual input on the behavior of large language models (LLMs) has prompted the development of context attribution methods that aim to quantify each context span's effect on an LLM's generations. The leave-one-out (LOO) error, which measures the change in the likelihood of the LLM's response when a given span of the context is removed, provides a principled way to perform context attribution, but can be prohibitively expensive to compute for large models. In this work, we introduce AttriBoT, a series of novel techniques for efficiently computing an approximation of the LOO error for context attribution. Specifically, AttriBoT uses cached activations to avoid redundant operations, performs hierarchical attribution to reduce computation, and emulates the behavior of large target models with smaller proxy models. Taken together, AttriBoT can provide a >300x speedup while remaining more faithful to a target model's LOO error than prior context attribution methods. This stark increase in performance makes computing context attributions for a given response 30x faster than generating the response itself, empowering real-world applications that require computing attributions at scale. We release a user-friendly and efficient implementation of AttriBoT to enable efficient LLM interpretability as well as encourage future development of efficient context attribution methods.
- [302] arXiv:2411.15103 [pdf, other]
-
Title: Coslice Colimits in Homotopy Type TheoryPerry Hart, Kuen-Bang Hou (Favonia)Subjects: Logic in Computer Science (cs.LO); Category Theory (math.CT); Logic (math.LO)
We contribute to the theory of (homotopy) colimits inside homotopy type theory. The heart of our work characterizes the connection between colimits in coslices of a universe, called coslice colimits, and colimits in the universe (i.e., ordinary colimits). To derive this characterization, we find an explicit construction of colimits in coslices that is tailored to reveal the connection. We use the construction to derive properties of colimits. Notably, we prove that the forgetful functor from a coslice creates colimits over trees. We also use the construction to examine how colimits interact with orthogonal factorization systems and with cohomology theories. As a consequence of their interaction with orthogonal factorization systems, all pointed colimits (special kinds of coslice colimits) preserve $n$-connectedness, which implies that higher groups are closed under colimits on directed graphs. We have formalized our main construction of the coslice colimit functor in Agda. The code for this paper is available at this https URL
- [303] arXiv:2411.15106 [pdf, html, other]
-
Title: About Time: Advances, Challenges, and Outlooks of Action UnderstandingSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
We have witnessed impressive advances in video action understanding. Increased dataset sizes, variability, and computation availability have enabled leaps in performance and task diversification. Current systems can provide coarse- and fine-grained descriptions of video scenes, extract segments corresponding to queries, synthesize unobserved parts of videos, and predict context. This survey comprehensively reviews advances in uni- and multi-modal action understanding across a range of tasks. We focus on prevalent challenges, overview widely adopted datasets, and survey seminal works with an emphasis on recent advances. We broadly distinguish between three temporal scopes: (1) recognition tasks of actions observed in full, (2) prediction tasks for ongoing partially observed actions, and (3) forecasting tasks for subsequent unobserved action. This division allows us to identify specific action modeling and video representation challenges. Finally, we outline future directions to address current shortcomings.
- [304] arXiv:2411.15109 [pdf, html, other]
-
Title: Effective Littlestone DimensionComments: 12 pagesSubjects: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
Delle Rose et al.~(COLT'23) introduced an effective version of the Vapnik-Chervonenkis dimension, and showed that it characterizes improper PAC learning with total computable learners. In this paper, we introduce and study a similar effectivization of the notion of Littlestone dimension. Finite effective Littlestone dimension is a necessary condition for computable online learning but is not a sufficient one -- which we already establish for classes of the effective Littlestone dimension 2. However, the effective Littlestone dimension equals the optimal mistake bound for computable learners in two special cases: a) for classes of Littlestone dimension 1 and b) when the learner receives as additional information an upper bound on the numbers to be guessed. Interestingly, finite effective Littlestone dimension also guarantees that the class consists only of computable functions.
- [305] arXiv:2411.15110 [pdf, html, other]
-
Title: A Real-Time DETR Approach to Bangladesh Road Object Detection for Autonomous VehiclesSubjects: Computer Vision and Pattern Recognition (cs.CV)
In the recent years, we have witnessed a paradigm shift in the field of Computer Vision, with the forthcoming of the transformer architecture. Detection Transformers has become a state of the art solution to object detection and is a potential candidate for Road Object Detection in Autonomous Vehicles. Despite the abundance of object detection schemes, real-time DETR models are shown to perform significantly better on inference times, with minimal loss of accuracy and performance. In our work, we used Real-Time DETR (RTDETR) object detection on the BadODD Road Object Detection dataset based in Bangladesh, and performed necessary experimentation and testing. Our results gave a mAP50 score of 0.41518 in the public 60% test set, and 0.28194 in the private 40% test set.
- [306] arXiv:2411.15111 [pdf, html, other]
-
Title: Learnable Activation Functions in Physics-Informed Neural Networks for Solving Partial Differential EquationsSubjects: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
We investigate the use of learnable activation functions in Physics-Informed Neural Networks (PINNs) for solving Partial Differential Equations (PDEs). Specifically, we compare the efficacy of traditional Multilayer Perceptrons (MLPs) with fixed and learnable activations against Kolmogorov-Arnold Networks (KANs), which employ learnable basis functions. Physics-informed neural networks (PINNs) have emerged as an effective method for directly incorporating physical laws into the learning process, offering a data-efficient solution for both the forward and inverse problems associated with PDEs. However, challenges such as effective training and spectral bias, where low-frequency components are learned more effectively, often limit their applicability to problems characterized by rapid oscillations or sharp transitions. By employing different activation or basis functions on MLP and KAN, we assess their impact on convergence behavior and spectral bias mitigation, and the accurate approximation of PDEs. The findings offer insights into the design of neural network architectures that balance training efficiency, convergence speed, and test accuracy for PDE solvers. By evaluating the influence of activation or basis function choices, this work provides guidelines for developing more robust and accurate PINN models. The source code and pre-trained models used in this study are made publicly available to facilitate reproducibility and future exploration.
- [307] arXiv:2411.15113 [pdf, html, other]
-
Title: Efficient Pruning of Text-to-Image Models: Insights from Pruning Stable DiffusionSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
As text-to-image models grow increasingly powerful and complex, their burgeoning size presents a significant obstacle to widespread adoption, especially on resource-constrained devices. This paper presents a pioneering study on post-training pruning of Stable Diffusion 2, addressing the critical need for model compression in text-to-image domain. Our study tackles the pruning techniques for the previously unexplored multi-modal generation models, and particularly examines the pruning impact on the textual component and the image generation component separately. We conduct a comprehensive comparison on pruning the model or the single component of the model in various sparsities. Our results yield previously undocumented findings. For example, contrary to established trends in language model pruning, we discover that simple magnitude pruning outperforms more advanced techniques in text-to-image context. Furthermore, our results show that Stable Diffusion 2 can be pruned to 38.5% sparsity with minimal quality loss, achieving a significant reduction in model size. We propose an optimal pruning configuration that prunes the text encoder to 47.5% and the diffusion generator to 35%. This configuration maintains image generation quality while substantially reducing computational requirements. In addition, our work uncovers intriguing questions about information encoding in text-to-image models: we observe that pruning beyond certain thresholds leads to sudden performance drops (unreadable images), suggesting that specific weights encode critical semantics information. This finding opens new avenues for future research in model compression, interoperability, and bias identification in text-to-image models. By providing crucial insights into the pruning behavior of text-to-image models, our study lays the groundwork for developing more efficient and accessible AI-driven image generation systems
- [308] arXiv:2411.15114 [pdf, html, other]
-
Title: RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human expertsHjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Josh Clymer, Jai Dhyani, Elena Ericheva, Katharyn Garcia, Brian Goodrich, Nikola Jurkovic, Megan Kinniment, Aron Lajko, Seraphina Nix, Lucas Sato, William Saunders, Maksym Taran, Ben West, Elizabeth BarnesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Frontier AI safety policies highlight automation of AI research and development (R&D) by AI agents as an important capability to anticipate. However, there exist few evaluations for AI R&D capabilities, and none that are highly realistic and have a direct comparison to human performance. We introduce RE-Bench (Research Engineering Benchmark, v1), which consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 distinct human experts. We confirm that our experts make progress in the environments given 8 hours, with 82% of expert attempts achieving a non-zero score and 24% matching or exceeding our strong reference solutions. We compare humans to several public frontier models through best-of-k with varying time budgets and agent designs, and find that the best AI agents achieve a score 4x higher than human experts when both are given a total time budget of 2 hours per environment. However, humans currently display better returns to increasing time budgets, narrowly exceeding the top AI agent scores given an 8-hour budget, and achieving 2x the score of the top AI agent when both are given 32 total hours (across different attempts). Qualitatively, we find that modern AI agents possess significant expertise in many ML topics -- e.g. an agent wrote a faster custom Triton kernel than any of our human experts' -- and can generate and test solutions over ten times faster than humans, at much lower cost. We open-source the evaluation environments, human expert data, analysis code and agent trajectories to facilitate future research.
- [309] arXiv:2411.15115 [pdf, html, other]
-
Title: VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized RefinementComments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Recent text-to-video (T2V) diffusion models have demonstrated impressive generation capabilities across various domains. However, these models often generate videos that have misalignments with text prompts, especially when the prompts describe complex scenes with multiple objects and attributes. To address this, we introduce VideoRepair, a novel model-agnostic, training-free video refinement framework that automatically identifies fine-grained text-video misalignments and generates explicit spatial and textual feedback, enabling a T2V diffusion model to perform targeted, localized refinements. VideoRepair consists of four stages: In (1) video evaluation, we detect misalignments by generating fine-grained evaluation questions and answering those questions with MLLM. In (2) refinement planning, we identify accurately generated objects and then create localized prompts to refine other areas in the video. Next, in (3) region decomposition, we segment the correctly generated area using a combined grounding module. We regenerate the video by adjusting the misaligned regions while preserving the correct regions in (4) localized refinement. On two popular video generation benchmarks (EvalCrafter and T2V-CompBench), VideoRepair substantially outperforms recent baselines across various text-video alignment metrics. We provide a comprehensive analysis of VideoRepair components and qualitative examples.
- [310] arXiv:2411.15122 [pdf, html, other]
-
Title: ReXrank: A Public Leaderboard for AI-Powered Radiology Report GenerationXiaoman Zhang, Hong-Yu Zhou, Xiaoli Yang, Oishi Banerjee, Julián N. Acosta, Josh Miller, Ouwen Huang, Pranav RajpurkarSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
AI-driven models have demonstrated significant potential in automating radiology report generation for chest X-rays. However, there is no standardized benchmark for objectively evaluating their performance. To address this, we present ReXrank, this https URL, a public leaderboard and challenge for assessing AI-powered radiology report generation. Our framework incorporates ReXGradient, the largest test dataset consisting of 10,000 studies, and three public datasets (MIMIC-CXR, IU-Xray, CheXpert Plus) for report generation assessment. ReXrank employs 8 evaluation metrics and separately assesses models capable of generating only findings sections and those providing both findings and impressions sections. By providing this standardized evaluation framework, ReXrank enables meaningful comparisons of model performance and offers crucial insights into their robustness across diverse clinical settings. Beyond its current focus on chest X-rays, ReXrank's framework sets the stage for comprehensive evaluation of automated reporting across the full spectrum of medical imaging.
- [311] arXiv:2411.15124 [pdf, html, other]
-
Title: T\"ULU 3: Pushing Frontiers in Open Language Model Post-TrainingNathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, Hannaneh HajishirziSubjects: Computation and Language (cs.CL)
Language model post-training is applied to refine behaviors and unlock new skills across a wide range of recent language models, but open recipes for applying these techniques lag behind proprietary ones. The underlying training data and recipes for post-training are simultaneously the most important pieces of the puzzle and the portion with the least transparency. To bridge this gap, we introduce TÜLU 3, a family of fully-open state-of-the-art post-trained models, alongside its data, code, and training recipes, serving as a comprehensive guide for modern post-training techniques. TÜLU 3, which builds on Llama 3.1 base models, achieves results surpassing the instruct versions of Llama 3.1, Qwen 2.5, Mistral, and even closed models such as GPT-4o-mini and Claude 3.5-Haiku. The training algorithms for our models include supervised finetuning (SFT), Direct Preference Optimization (DPO), and a novel method we call Reinforcement Learning with Verifiable Rewards (RLVR). With TÜLU 3, we introduce a multi-task evaluation scheme for post-training recipes with development and unseen evaluations, standard benchmark implementations, and substantial decontamination of existing open datasets on said benchmarks. We conclude with analysis and discussion of training methods that did not reliably improve performance.
In addition to the TÜLU 3 model weights and demo, we release the complete recipe -- including datasets for diverse core skills, a robust toolkit for data curation and evaluation, the training code and infrastructure, and, most importantly, a detailed report for reproducing and further adapting the TÜLU 3 approach to more domains. - [312] arXiv:2411.15127 [pdf, html, other]
-
Title: PRIMUS: Pretraining IMU Encoders with Multimodal Self-SupervisionComments: Also presented under the title "PRIMUS: Pretraining IMU Encoders with Multimodal and Self-Supervised Learning" at NeurIPS 2024 TSALM Workshop (Time Series in the Age of Large Models)Subjects: Machine Learning (cs.LG)
Sensing human motions through Inertial Measurement Units (IMUs) embedded in personal devices has enabled significant applications in health and wellness. While labeled IMU data is scarce, we can collect unlabeled or weakly labeled IMU data to model human motions. For video or text modalities, the "pretrain and adapt" approach utilizes large volumes of unlabeled or weakly labeled data for pretraining, building a strong feature extractor, followed by adaptation to specific tasks using limited labeled data. This approach has not been widely adopted in the IMU domain for two reasons: (1) pretraining methods are poorly understood in the context of IMU, and (2) open-source pretrained models that generalize across datasets are rarely publicly available. In this paper, we aim to address the first issue by proposing PRIMUS, a method for PRetraining IMU encoderS. We conduct a systematic and unified evaluation of various self-supervised and multimodal learning pretraining objectives. Our findings indicate that using PRIMUS, which combines self-supervision, multimodal supervision, and nearest-neighbor supervision, can significantly enhance downstream performance. With fewer than 500 labeled samples per class, PRIMUS effectively enhances downstream performance by up to 15% in held-out test data, compared to the state-of-the-art multimodal training method. To benefit the broader community, our code and pre-trained IMU encoders will be made publicly available at this http URL upon publication.
- [313] arXiv:2411.15128 [pdf, html, other]
-
Title: Health AI Developer FoundationsAtilla P. Kiraly, Sebastien Baur, Kenneth Philbrick, Fereshteh Mahvar, Liron Yatziv, Tiffany Chen, Bram Sterling, Nick George, Fayaz Jamil, Jing Tang, Kai Bailey, Faruk Ahmed, Akshay Goel, Abbi Ward, Lin Yang, Andrew Sellergren, Yossi Matias, Avinatan Hassidim, Shravya Shetty, Daniel Golden, Shekoofeh Azizi, David F. Steiner, Yun Liu, Tim Thelin, Rory Pilgrim, Can KirmizibayrakComments: 16 pages, 8 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
Robust medical Machine Learning (ML) models have the potential to revolutionize healthcare by accelerating clinical research, improving workflows and outcomes, and producing novel insights or capabilities. Developing such ML models from scratch is cost prohibitive and requires substantial compute, data, and time (e.g., expert labeling). To address these challenges, we introduce Health AI Developer Foundations (HAI-DEF), a suite of pre-trained, domain-specific foundation models, tools, and recipes to accelerate building ML for health applications. The models cover various modalities and domains, including radiology (X-rays and computed tomography), histopathology, dermatological imaging, and audio. These models provide domain specific embeddings that facilitate AI development with less labeled data, shorter training times, and reduced computational costs compared to traditional approaches. In addition, we utilize a common interface and style across these models, and prioritize usability to enable developers to integrate HAI-DEF efficiently. We present model evaluations across various tasks and conclude with a discussion of their application and evaluation, covering the importance of ensuring efficacy, fairness, and equity. Finally, while HAI-DEF and specifically the foundation models lower the barrier to entry for ML in healthcare, we emphasize the importance of validation with problem- and population-specific data for each desired usage setting. This technical report will be updated over time as more modalities and features are added.
- [314] arXiv:2411.15129 [pdf, other]
-
Title: Measuring Bullshit in the Language Games played by ChatGPTSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Generative large language models (LLMs), which create text without direct correspondence to truth value, are widely understood to resemble the uses of language described in Frankfurt's popular monograph On Bullshit. In this paper, we offer a rigorous investigation of this topic, identifying how the phenomenon has arisen, and how it might be analysed. In this paper, we elaborate on this argument to propose that LLM-based chatbots play the 'language game of bullshit'. We use statistical text analysis to investigate the features of this Wittgensteinian language game, based on a dataset constructed to contrast the language of 1,000 scientific publications with typical pseudo-scientific text generated by ChatGPT. We then explore whether the same language features can be detected in two well-known contexts of social dysfunction: George Orwell's critique of politics and language, and David Graeber's characterisation of bullshit jobs. Using simple hypothesis-testing methods, we demonstrate that a statistical model of the language of bullshit can reliably relate the Frankfurtian artificial bullshit of ChatGPT to the political and workplace functions of bullshit as observed in natural human language.
- [315] arXiv:2411.15130 [pdf, html, other]
-
Title: Learning-based Trajectory Tracking for Bird-inspired Flapping-Wing RobotsSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
Bird-sized flapping-wing robots offer significant potential for agile flight in complex environments, but achieving agile and robust trajectory tracking remains a challenge due to the complex aerodynamics and highly nonlinear dynamics inherent in flapping-wing flight. In this work, a learning-based control approach is introduced to unlock the versatility and adaptiveness of flapping-wing flight. We propose a model-free reinforcement learning (RL)-based framework for a high degree-of-freedom (DoF) bird-inspired flapping-wing robot that allows for multimodal flight and agile trajectory tracking. Stability analysis was performed on the closed-loop system comprising of the flapping-wing system and the RL policy. Additionally, simulation results demonstrate that the RL-based controller can successfully learn complex wing trajectory patterns, achieve stable flight, switch between flight modes spontaneously, and track different trajectories under various aerodynamic conditions.
- [316] arXiv:2411.15131 [pdf, html, other]
-
Title: WildLMa: Long Horizon Loco-Manipulation in the WildRi-Zhao Qiu, Yuchen Song, Xuanbin Peng, Sai Aneesh Suryadevara, Ge Yang, Minghuan Liu, Mazeyu Ji, Chengzhe Jia, Ruihan Yang, Xueyan Zou, Xiaolong WangComments: Website: this https URLSubjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
`In-the-wild' mobile manipulation aims to deploy robots in diverse real-world environments, which requires the robot to (1) have skills that generalize across object configurations; (2) be capable of long-horizon task execution in diverse environments; and (3) perform complex manipulation beyond pick-and-place. Quadruped robots with manipulators hold promise for extending the workspace and enabling robust locomotion, but existing results do not investigate such a capability. This paper proposes WildLMa with three components to address these issues: (1) adaptation of learned low-level controller for VR-enabled whole-body teleoperation and traversability; (2) WildLMa-Skill -- a library of generalizable visuomotor skills acquired via imitation learning or heuristics and (3) WildLMa-Planner -- an interface of learned skills that allow LLM planners to coordinate skills for long-horizon tasks. We demonstrate the importance of high-quality training data by achieving higher grasping success rate over existing RL baselines using only tens of demonstrations. WildLMa exploits CLIP for language-conditioned imitation learning that empirically generalizes to objects unseen in training demonstrations. Besides extensive quantitative evaluation, we qualitatively demonstrate practical robot applications, such as cleaning up trash in university hallways or outdoor terrains, operating articulated objects, and rearranging items on a bookshelf.
- [317] arXiv:2411.15133 [pdf, other]
-
Title: On Approximability of Satisfiable $k$-CSPs: VISubjects: Computational Complexity (cs.CC); Combinatorics (math.CO)
We prove local and global inverse theorems for general $3$-wise correlations over pairwise-connected distributions. Let $\mu$ be a distribution over $\Sigma \times \Gamma \times \Phi$ such that the supports of $\mu_{xy}$, $\mu_{xz}$, and $\mu_{yz}$ are all connected, and let $f: \Sigma^n \to \mathbb{C}$, $g: \Gamma^n \to \mathbb{C}$, $h: \Phi^n \to \mathbb{C}$ be $1$-bounded functions satisfying \[ \left|\mathbb{E}_{(x,y,z) \sim \mu^{\otimes n}}[f(x)g(y)h(z)]\right| \geq \varepsilon. \] In this setting, our local inverse theorem asserts that there is $\delta :=\textsf{exp}(-\varepsilon^{-O_{\mu}(1)})$ such that with probability at least $\delta$, a random restriction of $f$ down to $\delta n$ coordinates $\delta$-correlates to a product function. To get a global inverse theorem, we prove a restriction inverse theorem for general product functions, stating that if a random restriction of $f$ down to $\delta n$ coordinates is $\delta$-correlated with a product function with probability at least $\delta$, then $f$ is $2^{-\textsf{poly}(\log(1/\delta))}$-correlated with a function of the form $L\cdot P$, where $L$ is a function of degree $\textsf{poly}(1/\delta)$, $\|L\|_2\leq 1$, and $P$ is a product function.
We show applications to property testing and to additive combinatorics. In particular, we show the following result via a density increment argument. Let $\Sigma$ be a finite set and $S \subseteq \Sigma \times \Sigma \times \Sigma$ such that: (1) $(x, x, x) \in S$ for all $x \in S$, and (2) the supports of $S_{xy}$, $S_{xz}$, and $S_{yz}$ are all connected. Then, any set $A \subseteq \Sigma^n$ with $|\Sigma|^{-n}|A| \geq \Omega((\log \log \log n)^{-c})$ contains $x, y, z \in A$, not all equal, such that $(x_i,y_i,z_i) \in S$ for all $i$. This gives the first reasonable bounds for the restricted 3-AP problem over finite fields. - [318] arXiv:2411.15136 [pdf, html, other]
-
Title: On Approximability of Satisfiable $k$-CSPs: VIISubjects: Computational Complexity (cs.CC); Combinatorics (math.CO)
Let $\Sigma_1,\ldots,\Sigma_k$ be finite alphabets, and let $\mu$ be a distribution over $\Sigma_1 \times \dots \times \Sigma_k$ in which the probability of each atom is at least $\alpha$. We prove that if $\mu$ does not admit Abelian embeddings, and $f_i: \Sigma_i \to \mathbb{C}$ are $1$-bounded functions (for $i=1,\ldots,k$) such that \[ \left|\mathbb{E}_{(x_1,\dots,x_k) \sim \mu^{\otimes n}}\Big[f_1(x_1) \dots f_k(x_k)\Big]\right| \geq \varepsilon, \] then there exists $L\colon \Sigma_1^n\to\mathbb{C}$ of degree at most $d$ and $\|L\|_2\leq 1$ such that $|\langle f_1, L\rangle|\geq \delta$, where $d$ and $\delta>0$ depend only on $k, \alpha$ and $\varepsilon$. This answers the analytic question posed by Bhangale, Khot, and Minzer (STOC 2022). We also prove several extensions of this result that are useful in subsequent applications.
- [319] arXiv:2411.15138 [pdf, html, other]
-
Title: Material Anything: Generating Materials for Any 3D Object via DiffusionComments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
We present Material Anything, a fully-automated, unified diffusion framework designed to generate physically-based materials for 3D objects. Unlike existing methods that rely on complex pipelines or case-specific optimizations, Material Anything offers a robust, end-to-end solution adaptable to objects under diverse lighting conditions. Our approach leverages a pre-trained image diffusion model, enhanced with a triple-head architecture and rendering loss to improve stability and material quality. Additionally, we introduce confidence masks as a dynamic switcher within the diffusion model, enabling it to effectively handle both textured and texture-less objects across varying lighting conditions. By employing a progressive material generation strategy guided by these confidence masks, along with a UV-space material refiner, our method ensures consistent, UV-ready material outputs. Extensive experiments demonstrate our approach outperforms existing methods across a wide range of object categories and lighting conditions.
- [320] arXiv:2411.15139 [pdf, html, other]
-
Title: DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous DrivingBencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, Xinggang WangComments: Work in progress. Code & demo & model will be available at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Recently, the diffusion model has emerged as a powerful generative technique for robotic policy learning, capable of modeling multi-mode action distributions. Leveraging its capability for end-to-end autonomous driving is a promising direction. However, the numerous denoising steps in the robotic diffusion policy and the more dynamic, open-world nature of traffic scenes pose substantial challenges for generating diverse driving actions at a real-time speed. To address these challenges, we propose a novel truncated diffusion policy that incorporates prior multi-mode anchors and truncates the diffusion schedule, enabling the model to learn denoising from anchored Gaussian distribution to the multi-mode driving action distribution. Additionally, we design an efficient cascade diffusion decoder for enhanced interaction with conditional scene context. The proposed model, DiffusionDrive, demonstrates 10$\times$ reduction in denoising steps compared to vanilla diffusion policy, delivering superior diversity and quality in just 2 steps. On the planning-oriented NAVSIM dataset, with the aligned ResNet-34 backbone, DiffusionDrive achieves 88.1 PDMS without bells and whistles, setting a new record, while running at a real-time speed of 45 FPS on an NVIDIA 4090. Qualitative results on challenging scenarios further confirm that DiffusionDrive can robustly generate diverse plausible driving actions. Code and model will be available at this https URL.
New submissions (showing 320 of 320 entries)
- [321] arXiv:2411.14434 (cross-list from quant-ph) [pdf, html, other]
-
Title: Quantum CORDIC -- Arcsin on a BudgetComments: 6 pages, 3 figures, 3 algorithms, pending acceptance at peer-reviewed conferenceSubjects: Quantum Physics (quant-ph); Cryptography and Security (cs.CR)
This work introduces a quantum algorithm for computing the arcsine function to an arbitrary accuracy. We leverage a technique from embedded computing and field-programmable gate array (FPGA), called COordinate Rotation DIgital Computer (CORDIC). CORDIC is a family of iterative algorithms that, in a classical context, can approximate various trigonometric, hyperbolic, and elementary functions using only bit shifts and additions. Adapting CORDIC to the quantum context is non-trivial, as the algorithm traditionally uses several non-reversible operations. We detail a method for CORDIC which avoids such non-reversible operations. We propose multiple approaches to calculate the arcsine function reversibly with CORDIC. For n bits of precision, our method has space complexity of order n qubits, a layer count in the order of n times log n, and a CNOT count in the order of n squared. This primitive function is a required step for the Harrow-Hassidim-Lloyd (HHL) algorithm, is necessary for quantum digital-to-analog conversion, can simplify a quantum speed-up for Monte-Carlo methods, and has direct applications in the quantum estimation of Shapley values.
- [322] arXiv:2411.14443 (cross-list from eess.SP) [pdf, html, other]
-
Title: Industrial Machines Health Prognosis using a Transformer-based FrameworkComments: 10 pages, 5 figures. Accepted for presentation at the IEEE MetroAXRAINE conferenceSubjects: Signal Processing (eess.SP); Machine Learning (cs.LG)
This article introduces Transformer Quantile Regression Neural Networks (TQRNNs), a novel data-driven solution for real-time machine failure prediction in manufacturing contexts. Our objective is to develop an advanced predictive maintenance model capable of accurately identifying machine system breakdowns. To do so, TQRNNs employ a two-step approach: (i) a modified quantile regression neural network to segment anomaly outliers while maintaining low time complexity, and (ii) a concatenated transformer network aimed at facilitating accurate classification even within a large timeframe of up to one hour. We have implemented our proposed pipeline in a real-world beverage manufacturing industry setting. Our findings demonstrate the model's effectiveness, achieving an accuracy rate of 70.84% with a 1-hour lead time for predicting machine breakdowns. Additionally, our analysis shows that using TQRNNs can increase high-quality production, improving product yield from 78.38% to 89.62%. We believe that predictive maintenance assumes a pivotal role in modern manufacturing, minimizing unplanned downtime, reducing repair costs, optimizing production efficiency, and ensuring operational stability. Its potential to generate substantial cost savings while enhancing sustainability and competitiveness underscores its importance in contemporary manufacturing practices.
- [323] arXiv:2411.14446 (cross-list from stat.ML) [pdf, other]
-
Title: Rising Rested Bandits: Lower Bounds and Efficient AlgorithmsComments: 62 pages. arXiv admin note: substantial text overlap with arXiv:2212.03798Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
This paper is in the field of stochastic Multi-Armed Bandits (MABs), i.e. those sequential selection techniques able to learn online using only the feedback given by the chosen option (a.k.a. $arm$). We study a particular case of the rested bandits in which the arms' expected reward is monotonically non-decreasing and concave. We study the inherent sample complexity of the regret minimization problem by deriving suitable regret lower bounds. Then, we design an algorithm for the rested case $\textit{R-ed-UCB}$, providing a regret bound depending on the properties of the instance and, under certain circumstances, of $\widetilde{\mathcal{O}}(T^{\frac{2}{3}})$. We empirically compare our algorithms with state-of-the-art methods for non-stationary MABs over several synthetically generated tasks and an online model selection problem for a real-world dataset
- [324] arXiv:2411.14452 (cross-list from eess.SP) [pdf, other]
-
Title: Past, Present, and Future of Sensor-based Human Activity Recognition using Wearables: A Surveying Tutorial on a Still Challenging TaskSubjects: Signal Processing (eess.SP); Machine Learning (cs.LG)
In the many years since the inception of wearable sensor-based Human Activity Recognition (HAR), a wide variety of methods have been introduced and evaluated for their ability to recognize activities. Substantial gains have been made since the days of hand-crafting heuristics as features, yet, progress has seemingly stalled on many popular benchmarks, with performance falling short of what may be considered 'sufficient'-- despite the increase in computational power and scale of sensor data, as well as rising complexity in techniques being employed. The HAR community approaches a new paradigm shift, this time incorporating world knowledge from foundational models. In this paper, we take stock of sensor-based HAR -- surveying it from its beginnings to the current state of the field, and charting its future. This is accompanied by a hands-on tutorial, through which we guide practitioners in developing HAR systems for real-world application scenarios. We provide a compendium for novices and experts alike, of methods that aim at finally solving the activity recognition problem.
- [325] arXiv:2411.14464 (cross-list from q-bio.QM) [pdf, html, other]
-
Title: JESTR: Joint Embedding Space Technique for Ranking Candidate Molecules for the Annotation of Untargeted Metabolomics DataComments: 7 pages, 5 figures, 2 tablesSubjects: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
Motivation: A major challenge in metabolomics is annotation: assigning molecular structures to mass spectral fragmentation patterns. Despite recent advances in molecule-to-spectra and in spectra-to-molecular fingerprint prediction (FP), annotation rates remain low. Results: We introduce in this paper a novel paradigm (JESTR) for annotation. Unlike prior approaches that explicitly construct molecular fingerprints or spectra, JESTR leverages the insight that molecules and their corresponding spectra are views of the same data and effectively embeds their representations in a joint space. Candidate structures are ranked based on cosine similarity between the embeddings of query spectrum and each candidate. We evaluate JESTR against mol-to-spec and spec-to-FP annotation tools on three datasets. On average, for rank@[1-5], JESTR outperforms other tools by 23.6%-71.6%. We further demonstrate the strong value of regularization with candidate molecules during training, boosting rank@1 performance by 11.4% and enhancing the model's ability to discern between target and candidate molecules. Through JESTR, we offer a novel promising avenue towards accurate annotation, therefore unlocking valuable insights into the metabolome.
- [326] arXiv:2411.14467 (cross-list from q-bio.QM) [pdf, html, other]
-
Title: Towards Scalable Insect Monitoring: Ultra-Lightweight CNNs as On-Device Triggers for Insect Camera TrapsSubjects: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Camera traps, combined with AI, have emerged as a way to achieve automated, scalable biodiversity monitoring. However, the passive infrared (PIR) sensors that trigger camera traps are poorly suited for detecting small, fast-moving ectotherms such as insects. Insects comprise over half of all animal species and are key components of ecosystems and agriculture. The need for an appropriate and scalable insect camera trap is critical in the wake of concerning reports of declines in insect populations. This study proposes an alternative to the PIR trigger: ultra-lightweight convolutional neural networks running on low-powered hardware to detect insects in a continuous stream of captured images. We train a suite of models to distinguish insect images from backgrounds. Our design achieves zero latency between trigger and image capture. Our models are rigorously tested and achieve high accuracy ranging from 91.8% to 96.4% AUC on validation data and >87% AUC on data from distributions unseen during training. The high specificity of our models ensures minimal saving of false positive images, maximising deployment storage efficiency. High recall scores indicate a minimal false negative rate, maximising insect detection. Further analysis with saliency maps shows the learned representation of our models to be robust, with low reliance on spurious background features. Our system is also shown to operate deployed on off-the-shelf, low-powered microcontroller units, consuming a maximum power draw of less than 300mW. This enables longer deployment times using cheap and readily available battery components. Overall we offer a step change in the cost, efficiency and scope of insect monitoring. Solving the challenging trigger problem, we demonstrate a system which can be deployed for far longer than existing designs and budgets power and bandwidth effectively, moving towards a generic insect camera trap.
- [327] arXiv:2411.14471 (cross-list from q-bio.GN) [pdf, html, other]
-
Title: Leveraging Gene Expression Data and Explainable Machine Learning for Enhanced Early Detection of Type 2 DiabetesComments: 8 pagesSubjects: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
Diabetes, particularly Type 2 diabetes (T2D), poses a substantial global health burden, compounded by its associated complications such as cardiovascular diseases, kidney failure, and vision impairment. Early detection of T2D is critical for improving healthcare outcomes and optimizing resource allocation. In this study, we address the gap in early T2D detection by leveraging machine learning (ML) techniques on gene expression data obtained from T2D patients. Our primary objective was to enhance the accuracy of early T2D detection through advanced ML methodologies and increase the model's trustworthiness using the explainable artificial intelligence (XAI) technique. Analyzing the biological mechanisms underlying T2D through gene expression datasets represents a novel research frontier, relatively less explored in previous studies. While numerous investigations have focused on utilizing clinical and demographic data for T2D prediction, the integration of molecular insights from gene expression datasets offers a unique and promising avenue for understanding the pathophysiology of the disease. By employing six ML classifiers on data sourced from NCBI's Gene Expression Omnibus (GEO), we observed promising performance across all models. Notably, the XGBoost classifier exhibited the highest accuracy, achieving 97%. Our study addresses a notable gap in early T2D detection methodologies, emphasizing the importance of leveraging gene expression data and advanced ML techniques.
- [328] arXiv:2411.14508 (cross-list from cond-mat.mtrl-sci) [pdf, html, other]
-
Title: Multi-objective Bayesian Optimisation of Spinodoid Cellular Structures for Crush Energy AbsorptionSubjects: Materials Science (cond-mat.mtrl-sci); Computational Engineering, Finance, and Science (cs.CE)
In the pursuit of designing safer and more efficient energy-absorbing structures, engineers must tackle the challenge of improving crush performance while balancing multiple conflicting objectives, such as maximising energy absorption and minimising peak impact forces. Accurately simulating real-world conditions necessitates the use of complex material models to replicate the non-linear behaviour of materials under impact, which comes at a significant computational cost. This study addresses these challenges by introducing a multi-objective Bayesian optimisation framework specifically developed to optimise spinodoid structures for crush energy absorption. Spinodoid structures, characterised by their scalable, non-periodic topologies and efficient stress distribution, offer a promising direction for advanced structural design. However, optimising design parameters to enhance crush performance is far from straightforward, particularly under realistic conditions. Conventional optimisation methods, although effective, often require a large number of costly simulations to identify suitable solutions, making the process both time-consuming and resource intensive. In this context, multi-objective Bayesian optimisation provides a clear advantage by intelligently navigating the design space, learning from each evaluation to reduce the number of simulations required, and efficiently addressing the complexities of non-linear material behaviour. By integrating finite element analysis with Bayesian optimisation, the framework developed in this study tackles the dual challenge of improving energy absorption and reducing peak force, particularly in scenarios where plastic deformation plays a critical role. The use of scalarisation and hypervolume-based techniques enables the identification of Pareto-optimal solutions, balancing these conflicting objectives.
- [329] arXiv:2411.14525 (cross-list from eess.IV) [pdf, html, other]
-
Title: SegBook: A Simple Baseline and Cookbook for Volumetric Medical Image SegmentationJin Ye, Ying Chen, Yanjun Li, Haoyu Wang, Zhongying Deng, Ziyan Huang, Yanzhou Su, Chenglong Ma, Yuanfeng Ji, Junjun HeSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Computed Tomography (CT) is one of the most popular modalities for medical imaging. By far, CT images have contributed to the largest publicly available datasets for volumetric medical segmentation tasks, covering full-body anatomical structures. Large amounts of full-body CT images provide the opportunity to pre-train powerful models, e.g., STU-Net pre-trained in a supervised fashion, to segment numerous anatomical structures. However, it remains unclear in which conditions these pre-trained models can be transferred to various downstream medical segmentation tasks, particularly segmenting the other modalities and diverse targets. To address this problem, a large-scale benchmark for comprehensive evaluation is crucial for finding these conditions. Thus, we collected 87 public datasets varying in modality, target, and sample size to evaluate the transfer ability of full-body CT pre-trained models. We then employed a representative model, STU-Net with multiple model scales, to conduct transfer learning across modalities and targets. Our experimental results show that (1) there may be a bottleneck effect concerning the dataset size in fine-tuning, with more improvement on both small- and large-scale datasets than medium-size ones. (2) Models pre-trained on full-body CT demonstrate effective modality transfer, adapting well to other modalities such as MRI. (3) Pre-training on the full-body CT not only supports strong performance in structure detection but also shows efficacy in lesion detection, showcasing adaptability across target tasks. We hope that this large-scale open evaluation of transfer learning can direct future research in volumetric medical image segmentation.
- [330] arXiv:2411.14533 (cross-list from math.OC) [pdf, html, other]
-
Title: The connected Grundy coloring problem: Formulations and a local-search enhanced biased random-key genetic algorithmSubjects: Optimization and Control (math.OC); Discrete Mathematics (cs.DM)
Given a graph G=(V,E), a connected Grundy coloring is a proper vertex coloring that can be obtained by a first-fit heuristic on a connected vertex sequence. A first-fit coloring heuristic is one that attributes to each vertex in a sequence the lowest-index color not used for its preceding neighbors. A connected vertex sequence is one in which each element, except for the first one, is connected to at least one element preceding it. The connected Grundy coloring problem consists of obtaining a connected Grundy coloring maximizing the number of colors. In this paper, we propose two integer programming (IP) formulations and a local-search enhanced biased random-key genetic algorithm (BRKGA) for the connected Grundy coloring problem. The first formulation follows the standard way of partitioning the vertices into color classes while the second one relies on the idea of representatives in an attempt to break symmetries. The BRKGA encompasses a local search procedure using a newly proposed neighborhood. A theoretical neighborhood analysis is also presented. Extensive computational experiments indicate that the problem is computationally demanding for the proposed IP formulations. Nonetheless, the formulation by representatives outperforms the standard one for the considered benchmark instances. Additionally, our BRKGA can find high-quality solutions in low computational times for considerably large instances, showing improved performance when enhanced with local search and a reset mechanism. Moreover we show that our BRKGA can be easily extended to successfully tackle the Grundy coloring problem, i.e., the one without the connectivity requirements.
- [331] arXiv:2411.14601 (cross-list from math.OC) [pdf, other]
-
Title: On Linear Convergence in Smooth Convex-Concave Bilinearly-Coupled Saddle-Point Optimization: Lower Bounds and Optimal AlgorithmsSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG)
We revisit the smooth convex-concave bilinearly-coupled saddle-point problem of the form $\min_x\max_y f(x) + \langle y,\mathbf{B} x\rangle - g(y)$. In the highly specific case where each of the functions $f(x)$ and $g(y)$ is either affine or strongly convex, there exist lower bounds on the number of gradient evaluations and matrix-vector multiplications required to solve the problem, as well as matching optimal algorithms. A notable aspect of these algorithms is that they are able to attain linear convergence, i.e., the number of iterations required to solve the problem is proportional to $\log(1/\epsilon)$. However, the class of bilinearly-coupled saddle-point problems for which linear convergence is possible is much wider and can involve smooth non-strongly convex functions $f(x)$ and $g(y)$. Therefore, we develop the first lower complexity bounds and matching optimal linearly converging algorithms for this problem class. Our lower complexity bounds are much more general, but they cover and unify the existing results in the literature. On the other hand, our algorithm implements the separation of complexities, which, for the first time, enables the simultaneous achievement of both optimal gradient evaluation and matrix-vector multiplication complexities, resulting in the best theoretical performance to date.
- [332] arXiv:2411.14626 (cross-list from eess.IV) [pdf, html, other]
-
Title: Unveiling the Hidden: A Comprehensive Evaluation of Underwater Image Enhancement and Its Impact on Object DetectionAli Awad (1), Ashraf Saleem (1), Sidike Paheding (2), Evan Lucas (1), Serein Al-Ratrout (1), Timothy C. Havens (1) ((1) Michigan Technological University, (2) Fairfield University)Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Underwater imagery often suffers from severe degradation that results in low visual quality and object detection performance. This work aims to evaluate state-of-the-art image enhancement models, investigate their impact on underwater object detection, and explore their potential to improve detection performance. To this end, we selected representative underwater image enhancement models covering major enhancement categories and applied them separately to two recent datasets: 1) the Real-World Underwater Object Detection Dataset (RUOD), and 2) the Challenging Underwater Plant Detection Dataset (CUPDD). Following this, we conducted qualitative and quantitative analyses on the enhanced images and developed a quality index (Q-index) to compare the quality distribution of the original and enhanced images. Subsequently, we compared the performance of several YOLO-NAS detection models that are separately trained and tested on the original and enhanced image sets. Then, we performed a correlation study to examine the relationship between enhancement metrics and detection performance. We also analyzed the inference results from the trained detectors presenting cases where enhancement increased the detection performance as well as cases where enhancement revealed missed objects by human annotators. This study suggests that although enhancement generally deteriorates the detection performance, it can still be harnessed in some cases for increased detection performance and more accurate human annotation.
- [333] arXiv:2411.14630 (cross-list from physics.med-ph) [pdf, other]
-
Title: ACE-Net: AutofoCus-Enhanced Convolutional Network for Field Imperfection Estimation with application to high b-value spiral Diffusion MRIComments: 8 pages, 5 figures, submitted to International Society for Magnetic Resonance in Medicine 32th Scientific Meeting, 2025Subjects: Medical Physics (physics.med-ph); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Spatiotemporal magnetic field variations from B0-inhomogeneity and diffusion-encoding-induced eddy-currents can be detrimental to rapid image-encoding schemes such as spiral, EPI and 3D-cones, resulting in undesirable image artifacts. In this work, a data driven approach for automatic estimation of these field imperfections is developed by combining autofocus metrics with deep learning, and by leveraging a compact basis representation of the expected field imperfections. The method was applied to single-shot spiral diffusion MRI at high b-values where accurate estimation of B0 and eddy were obtained, resulting in high quality image reconstruction without need for additional external calibrations.
- [334] arXiv:2411.14633 (cross-list from q-bio.NC) [pdf, html, other]
-
Title: Evaluating Representational Similarity Measures from the Lens of Functional CorrespondenceSubjects: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Neuroscience and artificial intelligence (AI) both face the challenge of interpreting high-dimensional neural data, where the comparative analysis of such data is crucial for revealing shared mechanisms and differences between these complex systems. Despite the widespread use of representational comparisons and the abundance classes of comparison methods, a critical question remains: which metrics are most suitable for these comparisons? While some studies evaluate metrics based on their ability to differentiate models of different origins or constructions (e.g., various architectures), another approach is to assess how well they distinguish models that exhibit distinct behaviors. To investigate this, we examine the degree of alignment between various representational similarity measures and behavioral outcomes, employing group statistics and a comprehensive suite of behavioral metrics for comparison. In our evaluation of eight commonly used representational similarity metrics in the visual domain -- spanning alignment-based, Canonical Correlation Analysis (CCA)-based, inner product kernel-based, and nearest-neighbor methods -- we found that metrics like linear Centered Kernel Alignment (CKA) and Procrustes distance, which emphasize the overall geometric structure or shape of representations, excelled in differentiating trained from untrained models and aligning with behavioral measures, whereas metrics such as linear predictivity, commonly used in neuroscience, demonstrated only moderate alignment with behavior. These insights are crucial for selecting metrics that emphasize behaviorally meaningful comparisons in NeuroAI research.
- [335] arXiv:2411.14656 (cross-list from eess.SP) [pdf, html, other]
-
Title: mmWave Radar for Sit-to-Stand Analysis: A Comparative Study with Wearables and KinectShuting Hu, Peggy Ackun, Xiang Zhang, Siyang Cao, Jennifer Barton, Melvin G. Hector, Mindy J. Fain, Nima ToosizadehSubjects: Signal Processing (eess.SP); Emerging Technologies (cs.ET); Applications (stat.AP)
This study explores a novel approach for analyzing Sit-to-Stand (STS) movements using millimeter-wave (mmWave) radar technology. The goal is to develop a non-contact sensing, privacy-preserving, and all-day operational method for healthcare applications, including fall risk assessment. We used a 60GHz mmWave radar system to collect radar point cloud data, capturing STS motions from 45 participants. By employing a deep learning pose estimation model, we learned the human skeleton from Kinect built-in body tracking and applied Inverse Kinematics (IK) to calculate joint angles, segment STS motions, and extract commonly used features in fall risk assessment. Radar extracted features were then compared with those obtained from Kinect and wearable sensors. The results demonstrated the effectiveness of mmWave radar in capturing general motion patterns and large joint movements (e.g., trunk). Additionally, the study highlights the advantages and disadvantages of individual sensors and suggests the potential of integrated sensor technologies to improve the accuracy and reliability of motion analysis in clinical and biomedical research settings.
- [336] arXiv:2411.14663 (cross-list from eess.IV) [pdf, other]
-
Title: BrightVAE: Luminosity Enhancement in Underexposed Endoscopic ImagesComments: 18 pages, 6 figuresSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
The enhancement of image luminosity is especially critical in endoscopic images. Underexposed endoscopic images often suffer from reduced contrast and uneven brightness, significantly impacting diagnostic accuracy and treatment planning. Internal body imaging is challenging due to uneven lighting and shadowy regions. Enhancing such images is essential since precise image interpretation is crucial for patient outcomes. In this paper, we introduce BrightVAE, an architecture based on the hierarchical Vector Quantized Variational Autoencoder (hierarchical VQ-VAE) tailored explicitly for enhancing luminosity in low-light endoscopic images. Our architecture is meticulously designed to tackle the unique challenges inherent in endoscopic imaging, such as significant variations in illumination and obscured details due to poor lighting conditions. The proposed model emphasizes advanced feature extraction from three distinct viewpoints-incorporating various receptive fields, skip connections, and feature attentions to robustly enhance image quality and support more accurate medical diagnoses. Through rigorous experimental analysis, we demonstrate the effectiveness of these techniques in enhancing low-light endoscopic images. To evaluate the performance of our architecture, we employ three widely recognized metrics-SSIM, PSNR, and LPIPS-specifically on Endo4IE dataset, which consists of endoscopic images. We evaluated our method using the Endo4IE dataset, which consists exclusively of endoscopic images, and showed significant advancements over the state-of-the-art methods for enhancing luminosity in endoscopic imaging.
- [337] arXiv:2411.14664 (cross-list from stat.ML) [pdf, html, other]
-
Title: Sparsifying Suprema of Gaussian ProcessesComments: 30 pagesSubjects: Machine Learning (stat.ML); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Probability (math.PR)
We give a dimension-independent sparsification result for suprema of centered Gaussian processes: Let $T$ be any (possibly infinite) bounded set of vectors in $\mathbb{R}^n$, and let $\{\boldsymbol{X}_t\}_{t\in T}$ be the canonical Gaussian process on $T$. We show that there is an $O_\varepsilon(1)$-size subset $S \subseteq T$ and a set of real values $\{c_s\}_{s \in S}$ such that $\sup_{s \in S} \{\boldsymbol{X}_s + c_s\}$ is an $\varepsilon$-approximator of $\sup_{t \in T} {\boldsymbol{X}}_t$. Notably, the size of $S$ is completely independent of both the size of $T$ and of the ambient dimension $n$.
We use this to show that every norm is essentially a junta when viewed as a function over Gaussian space: Given any norm $\nu(x)$ on $\mathbb{R}^n$, there is another norm $\psi(x)$ which depends only on the projection of $x$ along $O_\varepsilon(1)$ directions, for which $\psi({\boldsymbol{g}})$ is a multiplicative $(1 \pm \varepsilon)$-approximation of $\nu({\boldsymbol{g}})$ with probability $1-\varepsilon$ for ${\boldsymbol{g}} \sim N(0,I_n)$.
We also use our sparsification result for suprema of centered Gaussian processes to give a sparsification lemma for convex sets of bounded geometric width: Any intersection of (possibly infinitely many) halfspaces in $\mathbb{R}^n$ that are at distance $O(1)$ from the origin is $\varepsilon$-close, under $N(0,I_n)$, to an intersection of only $O_\varepsilon(1)$ many halfspaces.
We describe applications to agnostic learning and tolerant property testing. - [338] arXiv:2411.14665 (cross-list from stat.ML) [pdf, other]
-
Title: Double Machine Learning for Adaptive Causal Representation in High-Dimensional DataSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO)
Adaptive causal representation learning from observational data is presented, integrated with an efficient sample splitting technique within the semiparametric estimating equation framework. The support points sample splitting (SPSS), a subsampling method based on energy distance, is employed for efficient double machine learning (DML) in causal inference. The support points are selected and split as optimal representative points of the full raw data in a random sample, in contrast to the traditional random splitting, and providing an optimal sub-representation of the underlying data generating distribution. They offer the best representation of a full big dataset, whereas the unit structural information of the underlying distribution via the traditional random data splitting is most likely not preserved. Three machine learning estimators were adopted for causal inference, support vector machine (SVM), deep learning (DL), and a hybrid super learner (SL) with deep learning (SDL), using SPSS. A comparative study is conducted between the proposed SVM, DL, and SDL representations using SPSS, and the benchmark results from Chernozhukov et al. (2018), which employed random forest, neural network, and regression trees with a random k-fold cross-fitting technique on the 401(k)-pension plan real data. The simulations show that DL with SPSS and the hybrid methods of DL and SL with SPSS outperform SVM with SPSS in terms of computational efficiency and the estimation quality, respectively.
- [339] arXiv:2411.14677 (cross-list from physics.ao-ph) [pdf, html, other]
-
Title: Exploring the Use of Machine Learning Weather Models in Data AssimilationSubjects: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
The use of machine learning (ML) models in meteorology has attracted significant attention for their potential to improve weather forecasting efficiency and accuracy. GraphCast and NeuralGCM, two promising ML-based weather models, are at the forefront of this innovation. However, their suitability for data assimilation (DA) systems, particularly for four-dimensional variational (4DVar) DA, remains under-explored. This study evaluates the tangent linear (TL) and adjoint (AD) models of both GraphCast and NeuralGCM to assess their viability for integration into a DA framework.
We compare the TL/AD results of GraphCast and NeuralGCM with those of the Model for Prediction Across Scales - Atmosphere (MPAS-A), a well-established numerical weather prediction (NWP) model. The comparison focuses on the physical consistency and reliability of TL/AD responses to perturbations. While the adjoint results of both GraphCast and NeuralGCM show some similarity to those of MPAS-A, they also exhibit unphysical noise at various vertical levels, raising concerns about their robustness for operational DA systems.
The implications of this study extend beyond 4DVar applications. Unphysical behavior and noise in ML-derived TL/AD models could lead to inaccurate error covariances and unreliable ensemble forecasts, potentially degrading the overall performance of ensemble-based DA systems, as well. Addressing these challenges is critical to ensuring that ML models, such as GraphCast and NeuralGCM, can be effectively integrated into operational DA systems, paving the way for more accurate and efficient weather predictions. - [340] arXiv:2411.14684 (cross-list from eess.IV) [pdf, html, other]
-
Title: Cross Group Attention and Group-wise Rolling for Multimodal Medical Image SynthesisTao Song, Yicheng Wu, Minhao Hu, Xiangde Luo, Linda Wei, Guotai Wang, Yi Guo, Feng Xu, Shaoting ZhangSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Multimodal MR image synthesis aims to generate missing modality image by fusing and mapping a few available MRI data. Most existing approaches typically adopt an image-to-image translation scheme. However, these methods often suffer from sub-optimal performance due to the spatial misalignment between different modalities while they are typically treated as input channels. Therefore, in this paper, we propose an Adaptive Group-wise Interaction Network (AGI-Net) that explores both inter-modality and intra-modality relationships for multimodal MR image synthesis. Specifically, groups are first pre-defined along the channel dimension and then we perform an adaptive rolling for the standard convolutional kernel to capture inter-modality spatial correspondences. At the same time, a cross-group attention module is introduced to fuse information across different channel groups, leading to better feature representation. We evaluated the effectiveness of our model on the publicly available IXI and BraTS2023 datasets, where the AGI-Net achieved state-of-the-art performance for multimodal MR image synthesis. Code will be released.
- [341] arXiv:2411.14696 (cross-list from quant-ph) [pdf, html, other]
-
Title: Quantum Hamiltonian Descent for Graph PartitionSubjects: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
We introduce Quantum Hamiltonian Descent as a novel approach to solve the graph partition problem. By reformulating graph partition as a Quadratic Unconstrained Binary Optimization (QUBO) problem, we leverage QHD's quantum-inspired dynamics to identify optimal community structures. Our method implements a multi-level refinement strategy that alternates between QUBO formulation and QHD optimization to iteratively improve partition quality. Experimental results demonstrate that our QHD-based approach achieves superior modularity scores (up to 5.49\%) improvement with reduced computational overhead compared to traditional optimization methods. This work establishes QHD as an effective quantum-inspired framework for tackling graph partition challenges in large-scale networks.
- [342] arXiv:2411.14697 (cross-list from quant-ph) [pdf, other]
-
Title: Quantum Advantage via Solving Multivariate QuadraticsSubjects: Quantum Physics (quant-ph); Cryptography and Security (cs.CR)
In this work, we propose a new way to (non-interactively, verifiably) demonstrate Quantum Advantage by solving the average-case $\mathsf{NP}$ search problem of finding a solution to a system of (underdetermined) multivariate quadratic equations over the finite field $\mathbb{F}_2$ drawn from a specified distribution. In particular, we design a distribution of degree-2 polynomials $\{p_i(x_1,\ldots,x_n)\}_{i\in [m]}$ for $m<n$ over $\mathbb{F}_2$ for which we show that there is a quantum polynomial-time algorithm that simultaneously solves $\{p_i(x_1,\ldots,x_n)=y_i\}_{i\in [m]}$ for a random vector $(y_1,\ldots,y_m)$. On the other hand, while a solution exists with high probability, we conjecture that it is classically hard to find one based on classical cryptanalysis that we provide, including a comprehensive review of all known relevant classical algorithms for solving multivariate quadratics. Our approach proceeds by examining the Yamakawa-Zhandry (FOCS 2022) quantum advantage scheme and replacing the role of the random oracle with our multivariate quadratic equations. Our work therefore gives several new perspectives:
First, our algorithm gives a counterexample to the conventional belief that generic classically hard multivariate quadratic systems are also quantumly hard.
Second, based on cryptanalytic evidence, our work gives an explicit simple replacement for the random oracle from the work of Yamakawa and Zhandry. We show how to instantiate the random oracle with families of just degree two multivariate polynomials over $\mathbb{F}_2$. - [343] arXiv:2411.14741 (cross-list from physics.optics) [pdf, html, other]
-
Title: SecONN: An Optical Neural Network Framework with Concurrent Detection of Thermal Fault Injection AttacksSubjects: Optics (physics.optics); Cryptography and Security (cs.CR)
Silicon Photonics-based AI Accelerators (SPAAs) have been considered as promising AI accelerators achieving high energy efficiency and low latency. While many researchers focus on improving SPAAs' energy efficiency and latency, their physical security has not been sufficiently studied. This paper first proposes a threat of thermal fault injection attacks on SPAAs based on Vector-Matrix Multipliers (VMMs) utilizing Mach-Zhender Interferometers. This paper then proposes SecONN, an optical neural network framework that is capable of not only inferences but also concurrent detection of the attacks. In addition, this paper introduces a concept of Wavelength Division Perturbation (WDP) where wavelength dependent VMM results are utilized to increase detection accuracy. Simulation results show that the proposed method achieves 88.7% attack-caused average misprediction recall.
- [344] arXiv:2411.14748 (cross-list from astro-ph.CO) [pdf, html, other]
-
Title: Cosmological Analysis with Calibrated Neural Quantile Estimation and Approximate SimulatorsComments: 5+4 pages, 5+3 figures, to be submitted, comments are welcomeSubjects: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
A major challenge in extracting information from current and upcoming surveys of cosmological Large-Scale Structure (LSS) is the limited availability of computationally expensive high-fidelity simulations. We introduce Neural Quantile Estimation (NQE), a new Simulation-Based Inference (SBI) method that leverages a large number of approximate simulations for training and a small number of high-fidelity simulations for calibration. This approach guarantees an unbiased posterior and achieves near-optimal constraining power when the approximate simulations are reasonably accurate. As a proof of concept, we demonstrate that cosmological parameters can be inferred at field level from projected 2-dim dark matter density maps up to $k_{\rm max}\sim1.5\,h$/Mpc at $z=0$ by training on $\sim10^4$ Particle-Mesh (PM) simulations with transfer function correction and calibrating with $\sim10^2$ Particle-Particle (PP) simulations. The calibrated posteriors closely match those obtained by directly training on $\sim10^4$ expensive PP simulations, but at a fraction of the computational cost. Our method offers a practical and scalable framework for SBI of cosmological LSS, enabling precise inference across vast volumes and down to small scales.
- [345] arXiv:2411.14752 (cross-list from eess.IV) [pdf, html, other]
-
Title: Comparative Analysis of nnUNet and MedNeXt for Head and Neck Tumor Segmentation in MRI-guided RadiotherapyNikoo Moradi, André Ferreira, Behrus Puladi, Jens Kleesiek, Emad Fatemizadeh, Gijs Luijten, Victor Alves, Jan EggerComments: 15 pages, 3 figuresSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Radiation therapy (RT) is essential in treating head and neck cancer (HNC), with magnetic resonance imaging(MRI)-guided RT offering superior soft tissue contrast and functional imaging. However, manual tumor segmentation is time-consuming and complex, and therfore remains a challenge. In this study, we present our solution as team TUMOR to the HNTS-MRG24 MICCAI Challenge which is focused on automated segmentation of primary gross tumor volumes (GTVp) and metastatic lymph node gross tumor volume (GTVn) in pre-RT and mid-RT MRI images. We utilized the HNTS-MRG2024 dataset, which consists of 150 MRI scans from patients diagnosed with HNC, including original and registered pre-RT and mid-RT T2-weighted images with corresponding segmentation masks for GTVp and GTVn. We employed two state-of-the-art models in deep learning, nnUNet and MedNeXt. For Task 1, we pretrained models on pre-RT registered and mid-RT images, followed by fine-tuning on original pre-RT images. For Task 2, we combined registered pre-RT images, registered pre-RT segmentation masks, and mid-RT data as a multi-channel input for training. Our solution for Task 1 achieved 1st place in the final test phase with an aggregated Dice Similarity Coefficient of 0.8254, and our solution for Task 2 ranked 8th with a score of 0.7005. The proposed solution is publicly available at Github Repository.
- [346] arXiv:2411.14833 (cross-list from eess.IV) [pdf, html, other]
-
Title: Cell as Point: One-Stage Framework for Efficient Cell TrackingComments: 17 pages, 8 figures, 8 tablesSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
Cellular activities are dynamic and intricate, playing a crucial role in advancing diagnostic and therapeutic techniques, yet they often require substantial resources for accurate tracking. Despite recent progress, the conventional multi-stage cell tracking approaches not only heavily rely on detection or segmentation results as a prerequisite for the tracking stage, demanding plenty of refined segmentation masks, but are also deteriorated by imbalanced and long sequence data, leading to under-learning in training and missing cells in inference procedures. To alleviate the above issues, this paper proposes the novel end-to-end CAP framework, which leverages the idea of regarding Cell as Point to achieve efficient and stable cell tracking in one stage. CAP abandons detection or segmentation stages and simplifies the process by exploiting the correlation among the trajectories of cell points to track cells jointly, thus reducing the label demand and complexity of the pipeline. With cell point trajectory and visibility to represent cell locations and lineage relationships, CAP leverages the key innovations of adaptive event-guided (AEG) sampling for addressing data imbalance in cell division events and the rolling-as-window (RAW) inference method to ensure continuous tracking of new cells in the long term. Eliminating the need for a prerequisite detection or segmentation stage, CAP demonstrates strong cell tracking performance while also being 10 to 55 times more efficient than existing methods. The code and models will be released.
- [347] arXiv:2411.14839 (cross-list from stat.AP) [pdf, html, other]
-
Title: Bayesian dynamic mode decomposition for real-time ship motion digital twinningGiorgio Palma, Andrea Serani, Kevin McTaggart, Shawn Aram, David W. Wundrow, David Drazen, Matteo DiezSubjects: Applications (stat.AP); Machine Learning (cs.LG); Dynamical Systems (math.DS)
Digital twins are widely considered enablers of groundbreaking changes in the development, operation, and maintenance of novel generations of products. They are meant to provide reliable and timely predictions to inform decisions along the entire product life cycle. One of their most interesting applications in the naval field is the digital twinning of ship performances in waves, a crucial aspect in design and operation safety. In this paper, a Bayesian extension of the Hankel dynamic mode decomposition method is proposed for ship motion's nowcasting as a prediction tool for naval digital twins. The proposed algorithm meets all the requirements for formulations devoted to digital twinning, being able to adapt the resulting models with the data incoming from the physical system, using a limited amount of data, producing real-time predictions, and estimating their reliability. Results are presented and discussed for the course-keeping of the 5415M model in beam-quartering sea state 7 irregular waves at Fr = 0.33, using data from three different CFD solvers. The results show predictions keeping good accuracy levels up to five wave encounter periods, with the Bayesian formulation improving the deterministic forecasts. In addition, a connection between the predicted uncertainty and prediction accuracy is found.
- [348] arXiv:2411.14845 (cross-list from physics.soc-ph) [pdf, html, other]
-
Title: To Be a Truster or Not to Be: Evolutionary Dynamics of a Symmetric N-Player Trust Game in Well-Mixed and Networked PopulationsComments: 21 pages, 8 figuresSubjects: Physics and Society (physics.soc-ph); Computer Science and Game Theory (cs.GT); Populations and Evolution (q-bio.PE)
Trust and reciprocation of it form the foundation of economic, social and other interactions. While the Trust Game is widely used to study these concepts for interactions between two players, often alternating different roles (i.e., investor and trustee), its extensions to multi-player scenarios have been restricted to instances where players assume only one role. We propose a symmetric N-player Trust Game, in which players alternate between two roles, and the payoff of the player is defined as the average across their two roles and drives the evolutionary game dynamics. We find that prosocial strategies are harder to evolve with the present symmetric N-player Trust Game than with the Public Goods Game, which is well studied. In particular, trust fails to evolve regardless of payoff function nonlinearity in well-mixed populations in the case of the symmetric N-player trust game. In structured populations, nonlinear payoffs can have strong impacts on the evolution of trust. The same nonlinearity can yield substantially different outcomes, depending on the nature of the underlying network. Our results highlight the importance of considering both payoff structures and network topologies in understanding the emergence and maintenance of prosocial behaviours.
- [349] arXiv:2411.14850 (cross-list from quant-ph) [pdf, html, other]
-
Title: Quantum Algorithm for the Multiple String Matching ProblemComments: the paper is accepted in SOFSEM2025 ConferenceSubjects: Quantum Physics (quant-ph); Data Structures and Algorithms (cs.DS)
Let us consider the Multiple String Matching Problem. In this problem, we consider a long string, denoted by $t$, of length $n$. This string is referred to as a text. We also consider a sequence of $m$ strings, denoted by $S$, which we refer to as a dictionary. The total length of all strings from the dictionary is represented by the variable L. The objective is to identify all instances of strings from the dictionary within the text. The standard classical solution to this problem is Aho-Corasick Algorithm that has $O(n+L)$ query and time complexity. At the same time, the classical lower bound for the problem is the same $\Omega(n+L)$. We propose a quantum algorithm with $O(n+\sqrt{mL\log n}+m\log n)$ query complexity and $O(n+\sqrt{mL\log n}\log b+m\log n)=O^*(n+\sqrt{mL})$ time complexity, where $b$ is the maximal length of strings from the dictionary. This improvement is particularly significant in the case of dictionaries comprising long words. Our algorithm's complexity is equal to the quantum lower bound $O(n + \sqrt{mL})$, up to a log factor. In some sense, our algorithm can be viewed as a quantum analogue of the Aho-Corasick algorithm.
- [350] arXiv:2411.14865 (cross-list from eess.IV) [pdf, html, other]
-
Title: Benchmarking the Robustness of Optical Flow Estimation to CorruptionsComments: The benchmarks and source code will be released at this https URLSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Optical flow estimation is extensively used in autonomous driving and video editing. While existing models demonstrate state-of-the-art performance across various benchmarks, the robustness of these methods has been infrequently investigated. Despite some research focusing on the robustness of optical flow models against adversarial attacks, there has been a lack of studies investigating their robustness to common corruptions. Taking into account the unique temporal characteristics of optical flow, we introduce 7 temporal corruptions specifically designed for benchmarking the robustness of optical flow models, in addition to 17 classical single-image corruptions, in which advanced PSF Blur simulation method is performed. Two robustness benchmarks, KITTI-FC and GoPro-FC, are subsequently established as the first corruption robustness benchmark for optical flow estimation, with Out-Of-Domain (OOD) and In-Domain (ID) settings to facilitate comprehensive studies. Robustness metrics, Corruption Robustness Error (CRE), Corruption Robustness Error ratio (CREr), and Relative Corruption Robustness Error (RCRE) are further introduced to quantify the optical flow estimation robustness. 29 model variants from 15 optical flow methods are evaluated, yielding 10 intriguing observations, such as 1) the absolute robustness of the model is heavily dependent on the estimation performance; 2) the corruptions that diminish local information are more serious than that reduce visual effects. We also give suggestions for the design and application of optical flow models. We anticipate that our benchmark will serve as a foundational resource for advancing research in robust optical flow estimation. The benchmarks and source code will be released at this https URL.
- [351] arXiv:2411.14875 (cross-list from stat.ML) [pdf, html, other]
-
Title: Iterative Reweighted Framework Based Algorithms for Sparse Linear Regression with Generalized Elastic Net PenaltySubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
The elastic net penalty is frequently employed in high-dimensional statistics for parameter regression and variable selection. It is particularly beneficial compared to lasso when the number of predictors greatly surpasses the number of observations. However, empirical evidence has shown that the $\ell_q$-norm penalty (where $0 < q < 1$) often provides better regression compared to the $\ell_1$-norm penalty, demonstrating enhanced robustness in various scenarios. In this paper, we explore a generalized elastic net model that employs a $\ell_r$-norm (where $r \geq 1$) in loss function to accommodate various types of noise, and employs a $\ell_q$-norm (where $0 < q < 1$) to replace the $\ell_1$-norm in elastic net penalty. Theoretically, we establish the computable lower bounds for the nonzero entries of the generalized first-order stationary points of the proposed generalized elastic net model. For implementation, we develop two efficient algorithms based on the locally Lipschitz continuous $\epsilon$-approximation to $\ell_q$-norm. The first algorithm employs an alternating direction method of multipliers (ADMM), while the second utilizes a proximal majorization-minimization method (PMM), where the subproblems are addressed using the semismooth Newton method (SNN). We also perform extensive numerical experiments with both simulated and real data, showing that both algorithms demonstrate superior performance. Notably, the PMM-SSN is efficient than ADMM, even though the latter provides a simpler implementation.
- [352] arXiv:2411.14886 (cross-list from eess.SP) [pdf, html, other]
-
Title: CardioLab: Laboratory Values Estimation and Monitoring from Electrocardiogram Signals -- A Multimodal Deep Learning ApproachComments: 7 pages, 1 figure, code under this https URLSubjects: Signal Processing (eess.SP); Machine Learning (cs.LG)
Background: Laboratory values are fundamental to medical diagnosis and management, but acquiring these values can be costly, invasive, and time-consuming. While electrocardiogram (ECG) patterns have been linked to certain laboratory abnormalities, the comprehensive modeling of these relationships remains underexplored.
Methods: We utilize MIMIC-IV dataset to develop multimodal deep-learning models to demonstrate the feasibility of estimating (real-time) and monitoring (predict at future intervals) laboratory value abnormalities from ECG waveforms, demographics, biometrics, and vital signs.
Results: The models exhibit a strong predictive performance with AUROC scores above 0.70 in a statistically significant manner for 23 laboratory values in the estimation setting and up to 26 values in the monitoring setting. Most notably, the accurately predictable values encompassing abnormalities across diverse physiological categories such as cardiac, renal, hematological, metabolic, immunological and coagulation. To name examples, for estimation NTproBNP (>353 pg/mL) with 0.882, whereas for monitoring at 30 minutes Urea nitrogen (<6 mg/dL) with 0.851, at 60 minutes creatinine (<0.5 mg/dL) with 0.85, and at 120 minutes hemoglobin (>17.5 g/dL) with 0.821.
Conclusions: This study provides first evidence for the feasibility of using ECG data alongside clinical routine data for the real-time estimation and monitoring of laboratory value abnormalities, which could provide a non-invasive, cost-effective supplement to traditional laboratory testing, with strong implications for enhanced patient monitoring and early intervention. Further validation could facilitate their integration into routine clinical practice. - [353] arXiv:2411.14924 (cross-list from math.CO) [pdf, html, other]
-
Title: Construction of Toroidal Polyhedra corresponding to perfect Chains of wild TetrahedraSubjects: Combinatorics (math.CO); Computational Geometry (cs.CG); Discrete Mathematics (cs.DM)
In 1957, Steinhaus proved that a chain of regular tetrahedra, meeting face-to-face and forming a closed loop does not exist. Over the years, various modifications of this statement have been considered and analysed. Weakening the statement by only requiring the tetrahedra of a chain to be wild, i.e. having all faces congruent, results in various examples of such chains. In this paper, we elaborate on the construction of these chains of wild tetrahedra. We therefore introduce the notions of chains and clusters of wild tetrahedra and relate these structures to simplicial surfaces. We establish that clusters and chains of wild tetrahedra can be described by polyhedra in Euclidean 3-space. As a result, we present methods to construct toroidal polyhedra arising from chains and provide a census of such toroidal polyhedra consisting of up to 20 wild tetrahedra. Here, we classify toroidal polyhedra with respect to self-intersections and reflection symmetries. We further prove the existence of an infinite family of toroidal polyhedra emerging from chains of wild tetrahedra and present clusters of wild tetrahedra that yield polyhedra of higher genera.
- [354] arXiv:2411.14942 (cross-list from hep-th) [pdf, html, other]
-
Title: Comparative Study of Neural Network Methods for Solving Topological SolitonsComments: 12 pages, 4 figuresSubjects: High Energy Physics - Theory (hep-th); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Topological solitons, which are stable, localized solutions of nonlinear differential equations, are crucial in various fields of physics and mathematics, including particle physics and cosmology. However, solving these solitons presents significant challenges due to the complexity of the underlying equations and the computational resources required for accurate solutions. To address this, we have developed a novel method using neural network (NN) to efficiently solve solitons. A similar NN approach is Physics-Informed Neural Networks (PINN). In a comparative analysis between our method and PINN, we find that our method achieves shorter computation times while maintaining the same level of accuracy. This advancement in computational efficiency not only overcomes current limitations but also opens new avenues for studying topological solitons and their dynamical behavior.
- [355] arXiv:2411.14972 (cross-list from eess.AS) [pdf, html, other]
-
Title: Open-Amp: Synthetic Data Framework for Audio Effect Foundation ModelsSubjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
This paper introduces Open-Amp, a synthetic data framework for generating large-scale and diverse audio effects data. Audio effects are relevant to many musical audio processing and Music Information Retrieval (MIR) tasks, such as modelling of analog audio effects, automatic mixing, tone matching and transcription. Existing audio effects datasets are limited in scope, usually including relatively few audio effects processors and a limited amount of input audio signals. Our proposed framework overcomes these issues, by crowdsourcing neural network emulations of guitar amplifiers and effects, created by users of open-source audio effects emulation software. This allows users of Open-Amp complete control over the input signals to be processed by the effects models, as well as providing high-quality emulations of hundreds of devices. Open-Amp can render audio online during training, allowing great flexibility in data augmentation. Our experiments show that using Open-Amp to train a guitar effects encoder achieves new state-of-the-art results on multiple guitar effects classification tasks. Furthermore, we train a one-to-many guitar effects model using Open-Amp, and use it to emulate unseen analog effects via manipulation of its learned latent space, indicating transferability to analog guitar effects data.
- [356] arXiv:2411.14975 (cross-list from eess.IV) [pdf, html, other]
-
Title: Exploring Foundation Models Fine-Tuning for Cytology ClassificationComments: 5 pages, 2 figuresSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
Cytology slides are essential tools in diagnosing and staging cancer, but their analysis is time-consuming and costly. Foundation models have shown great potential to assist in these tasks. In this paper, we explore how existing foundation models can be applied to cytological classification. More particularly, we focus on low-rank adaptation, a parameter-efficient fine-tuning method suited to few-shot learning. We evaluated five foundation models across four cytological classification datasets. Our results demonstrate that fine-tuning the pre-trained backbones with LoRA significantly improves model performance compared to fine-tuning only the classifier head, achieving state-of-the-art results on both simple and complex classification tasks while requiring fewer data samples.
- [357] arXiv:2411.15002 (cross-list from q-fin.ST) [pdf, html, other]
-
Title: A New Way: Kronecker-Factored Approximate Curvature Deep Hedging and its BenefitsComments: 16 pages, 5 figuresSubjects: Statistical Finance (q-fin.ST); Machine Learning (cs.LG)
This paper advances the computational efficiency of Deep Hedging frameworks through the novel integration of Kronecker-Factored Approximate Curvature (K-FAC) optimization. While recent literature has established Deep Hedging as a data-driven alternative to traditional risk management strategies, the computational burden of training neural networks with first-order methods remains a significant impediment to practical implementation. The proposed architecture couples Long Short-Term Memory (LSTM) networks with K-FAC second-order optimization, specifically addressing the challenges of sequential financial data and curvature estimation in recurrent networks. Empirical validation using simulated paths from a calibrated Heston stochastic volatility model demonstrates that the K-FAC implementation achieves marked improvements in convergence dynamics and hedging efficacy. The methodology yields a 78.3% reduction in transaction costs ($t = 56.88$, $p < 0.001$) and a 34.4% decrease in profit and loss (P&L) variance compared to Adam optimization. Moreover, the K-FAC-enhanced model exhibits superior risk-adjusted performance with a Sharpe ratio of 0.0401, contrasting with $-0.0025$ for the baseline model. These results provide compelling evidence that second-order optimization methods can materially enhance the tractability of Deep Hedging implementations. The findings contribute to the growing literature on computational methods in quantitative finance while highlighting the potential for advanced optimization techniques to bridge the gap between theoretical frameworks and practical applications in financial markets.
- [358] arXiv:2411.15060 (cross-list from eess.IV) [pdf, html, other]
-
Title: Detecting Hallucinations in Virtual Histology with Neural PrecursorsSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Significant biomedical research and clinical care rely on the histopathologic examination of tissue structure using microscopy of stained tissue. Virtual staining (VS) offers a promising alternative with the potential to reduce cost and eliminate the use of toxic reagents. However, the critical challenge of hallucinations limits confidence in its use, necessitating a VS co-pilot to detect these hallucinations. Here, we first formally establish the problem of hallucination detection in VS. Next, we introduce a scalable, post-hoc hallucination detection method that identifies a Neural Hallucination Precursor (NHP) from VS model embeddings for test-time detection. We report extensive validation across diverse and challenging VS settings to demonstrate NHP's effectiveness and robustness. Furthermore, we show that VS models with fewer hallucinations do not necessarily disclose them better, risking a false sense of security when reporting just the former metric. This highlights the need for a reassessment of current VS evaluation practices.
- [359] arXiv:2411.15067 (cross-list from math.OC) [pdf, html, other]
-
Title: Linear convergence of proximal descent schemes on the Wasserstein spaceComments: 28 pagesSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Probability (math.PR)
We investigate proximal descent methods, inspired by the minimizing movement scheme introduced by Jordan, Kinderlehrer and Otto, for optimizing entropy-regularized functionals on the Wasserstein space. We establish linear convergence under flat convexity assumptions, thereby relaxing the common reliance on geodesic convexity. Our analysis circumvents the need for discrete-time adaptations of the Evolution Variational Inequality (EVI). Instead, we leverage a uniform logarithmic Sobolev inequality (LSI) and the entropy "sandwich" lemma, extending the analysis from arXiv:2201.10469 and arXiv:2202.01009. The major challenge in the proof via LSI is to show that the relative Fisher information $I(\cdot|\pi)$ is well-defined at every step of the scheme. Since the relative entropy is not Wasserstein differentiable, we prove that along the scheme the iterates belong to a certain class of Sobolev regularity, and hence the relative entropy $\operatorname{KL}(\cdot|\pi)$ has a unique Wasserstein sub-gradient, and that the relative Fisher information is indeed finite.
- [360] arXiv:2411.15076 (cross-list from eess.IV) [pdf, html, other]
-
Title: RankByGene: Gene-Guided Histopathology Representation Learning Through Cross-Modal Ranking ConsistencyWentao Huang, Meilong Xu, Xiaoling Hu, Shahira Abousamra, Aniruddha Ganguly, Saarthak Kapse, Alisa Yurovsky, Prateek Prasanna, Tahsin Kurc, Joel Saltz, Michael L. Miller, Chao ChenComments: 17 pages, 8 figuresSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
Spatial transcriptomics (ST) provides essential spatial context by mapping gene expression within tissue, enabling detailed study of cellular heterogeneity and tissue organization. However, aligning ST data with histology images poses challenges due to inherent spatial distortions and modality-specific variations. Existing methods largely rely on direct alignment, which often fails to capture complex cross-modal relationships. To address these limitations, we propose a novel framework that aligns gene and image features using a ranking-based alignment loss, preserving relative similarity across modalities and enabling robust multi-scale alignment. To further enhance the alignment's stability, we employ self-supervised knowledge distillation with a teacher-student network architecture, effectively mitigating disruptions from high dimensionality, sparsity, and noise in gene expression data. Extensive experiments on gene expression prediction and survival analysis demonstrate our framework's effectiveness, showing improved alignment and predictive performance over existing methods and establishing a robust tool for gene-guided image representation learning in digital pathology.
- [361] arXiv:2411.15084 (cross-list from eess.IV) [pdf, html, other]
-
Title: Leapfrog Latent Consistency Model (LLCM) for Medical Images GenerationLakshmikar R. Polamreddy, Kalyan Roy, Sheng-Han Yueh, Deepshikha Mahato, Shilpa Kuppili, Jialu Li, Youshan ZhangComments: Total 16 pages including 5 figures and 36 referencesSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
The scarcity of accessible medical image data poses a significant obstacle in effectively training deep learning models for medical diagnosis, as hospitals refrain from sharing their data due to privacy concerns. In response, we gathered a diverse dataset named MedImgs, which comprises over 250,127 images spanning 61 disease types and 159 classes of both humans and animals from open-source repositories. We propose a Leapfrog Latent Consistency Model (LLCM) that is distilled from a retrained diffusion model based on the collected MedImgs dataset, which enables our model to generate real-time high-resolution images. We formulate the reverse diffusion process as a probability flow ordinary differential equation (PF-ODE) and solve it in latent space using the Leapfrog algorithm. This formulation enables rapid sampling without necessitating additional iterations. Our model demonstrates state-of-the-art performance in generating medical images. Furthermore, our model can be fine-tuned with any custom medical image datasets, facilitating the generation of a vast array of images. Our experimental results outperform those of existing models on unseen dog cardiac X-ray images. Source code is available at this https URL.
- [362] arXiv:2411.15086 (cross-list from eess.IV) [pdf, html, other]
-
Title: Quantum-enhanced unsupervised image segmentation for medical images analysisComments: 16 pages, 7 figuresSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantum Physics (quant-ph)
Breast cancer remains the leading cause of cancer-related mortality among women worldwide, necessitating the meticulous examination of mammograms by radiologists to characterize abnormal lesions. This manual process demands high accuracy and is often time-consuming, costly, and error-prone. Automated image segmentation using artificial intelligence offers a promising alternative to streamline this workflow. However, most existing methods are supervised, requiring large, expertly annotated datasets that are not always available, and they experience significant generalization issues. Thus, unsupervised learning models can be leveraged for image segmentation, but they come at a cost of reduced accuracy, or require extensive computational resourcess. In this paper, we propose the first end-to-end quantum-enhanced framework for unsupervised mammography medical images segmentation that balances between performance accuracy and computational requirements. We first introduce a quantum-inspired image representation that serves as an initial approximation of the segmentation mask. The segmentation task is then formulated as a QUBO problem, aiming to maximize the contrast between the background and the tumor region while ensuring a cohesive segmentation mask with minimal connected components. We conduct an extensive evaluation of quantum and quantum-inspired methods for image segmentation, demonstrating that quantum annealing and variational quantum circuits achieve performance comparable to classical optimization techniques. Notably, quantum annealing is shown to be an order of magnitude faster than the classical optimization method in our experiments. Our findings demonstrate that this framework achieves performance comparable to state-of-the-art supervised methods, including UNet-based architectures, offering a viable unsupervised alternative for breast cancer image segmentation.
- [363] arXiv:2411.15095 (cross-list from stat.ML) [pdf, html, other]
-
Title: Dimension-independent rates for structured neural density estimationSubjects: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Statistics Theory (math.ST)
We show that deep neural networks achieve dimension-independent rates of convergence for learning structured densities such as those arising in image, audio, video, and text applications. More precisely, we demonstrate that neural networks with a simple $L^2$-minimizing loss achieve a rate of $n^{-1/(4+r)}$ in nonparametric density estimation when the underlying density is Markov to a graph whose maximum clique size is at most $r$, and we provide evidence that in the aforementioned applications, this size is typically constant, i.e., $r=O(1)$. We then establish that the optimal rate in $L^1$ is $n^{-1/(2+r)}$ which, compared to the standard nonparametric rate of $n^{-1/(2+d)}$, reveals that the effective dimension of such problems is the size of the largest clique in the Markov random field. These rates are independent of the data's ambient dimension, making them applicable to realistic models of image, sound, video, and text data. Our results provide a novel justification for deep learning's ability to circumvent the curse of dimensionality, demonstrating dimension-independent convergence rates in these contexts.
- [364] arXiv:2411.15137 (cross-list from math.CO) [pdf, html, other]
-
Title: Reasonable Bounds for Combinatorial Lines of Length ThreeSubjects: Combinatorics (math.CO); Computational Complexity (cs.CC)
We prove that any subset $A \subseteq [3]^n$ with $3^{-n}|A| \ge (\log\log\log\log n)^{-c}$ contains a combinatorial line of length $3$, i.e., $x, y, z \in A$, not all equal, with $x_i=y_i=z_i$ or $(x_i,y_i,z_i)=(0,1,2)$ for all $i = 1, 2, \dots, n$. This improves on the previous best bound of $3^{-n}|A| \ge \Omega((\log^* n)^{-1/2})$ of [D.H.J. Polymath, Ann. of Math. 2012].
Cross submissions (showing 44 of 44 entries)
- [365] arXiv:1902.02698 (replaced) [pdf, html, other]
-
Title: Ranked Enumeration of Conjunctive Query ResultsComments: LMCS journal submissionSubjects: Databases (cs.DB)
We investigate the enumeration of top-k answers for conjunctive queries against relational databases according to a given ranking function. The task is to design data structures and algorithms that allow for efficient enumeration after a preprocessing phase. Our main contribution is a novel priority queue based algorithm with near-optimal delay and non-trivial space guarantees that are output sensitive and depend on structure of the query. In particular, we exploit certain desirable properties of ranking functions that frequently occur in practice and degree information in the database instance, allowing for efficient enumeration. We introduce the notion of {\em decomposable} and {\em compatible} ranking functions in conjunction with query decomposition, a property that allows for partial aggregation of tuple scores in order to efficiently enumerate the ranked output. We complement the algorithmic results with lower bounds justifying why certain assumptions about properties of ranking functions are necessary and discuss popular conjectures providing evidence for optimality of enumeration delay guarantees. Our results extend and improve upon a long line of work that has studied ranked enumeration from both theoretical and practical perspective.
- [366] arXiv:2007.01931 (replaced) [pdf, html, other]
-
Title: A Deep-Generative Hybrid Model to Integrate Multimodal and Dynamic Connectivity for Predicting Spectrum-Level Deficits in AutismNiharika Shimona D'Souza, Mary Beth Nebel, Deana Crocetti, Nicholas Wymbs, Joshua Robinson, Stewart Mostofsky, Archana VenkataramanSubjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
We propose an integrated deep-generative framework, that jointly models complementary information from resting-state functional MRI (rs-fMRI) connectivity and diffusion tensor imaging (DTI) tractography to extract predictive biomarkers of a disease. The generative part of our framework is a structurally-regularized Dynamic Dictionary Learning (sr-DDL) model that decomposes the dynamic rs-fMRI correlation matrices into a collection of shared basis networks and time varying patient-specific loadings. This matrix factorization is guided by the DTI tractography matrices to learn anatomically informed connectivity profiles. The deep part of our framework is an LSTM-ANN block, which models the temporal evolution of the patient sr-DDL loadings to predict multidimensional clinical severity. Our coupled optimization procedure collectively estimates the basis networks, the patient-specific dynamic loadings, and the neural network weights. We validate our framework on a multi-score prediction task in 57 patients diagnosed with Autism Spectrum Disorder (ASD). Our hybrid model outperforms state-of-the-art baselines in a five-fold cross validated setting and extracts interpretable multimodal neural signatures of brain dysfunction in ASD.
- [367] arXiv:2103.02324 (replaced) [pdf, other]
-
Title: Estimating the Expected Influence Capacities of Nodes in Complex Networks under the Susceptible-Infectious-Recovered (SIR) ModelComments: There was a minor inaccuracy in coefficient calculation for a competitor centrality measure named as Convex Combinations of Centrality Measures. So, we excluded this centrality measure. Also, there were some minor computational errors in monotonicity calculations, which we think are caused by the computational precision of the programming tools we use or the computer. We fixed itSubjects: Social and Information Networks (cs.SI)
In recent years, epidemic modeling in complex networks has found many applications, including modeling of information or gossip spread in online social networks, modeling of malware spread in communication networks, and the most recent model of the COVID-19 pandemic. If the information disseminated is accurate, for example, maximizing its distribution is desirable, whereas if it is a rumor or a virus, its spread should be minimized. In this context, it is very important to identify super-spreaders that maximize or minimize propagation. Lately, studies for detecting super-spreaders have gained momentum. Most of the studies carried out aim to distinguish the influences of nodes under a specific propagation model (such as SIR) using network centrality measures and subsequently, to rank the nodes accordingly. However, in this study, we developed an algorithm that approximates the expected influence of nodes under the popular SIR model. By considering the behavior of the SIR model and only the shortest paths between nodes, the algorithm ranks the nodes according to this approximated value. Our developed algorithm is named the Expected Value Estimation (EVE). We compared the performance of EVE, using different SIR settings on real datasets, with that of many current well-known centrality measures. The experimental studies demonstrated that the solution quality (ranking capability) of EVE is superior to that of its competitors.
- [368] arXiv:2110.09934 (replaced) [pdf, html, other]
-
Title: Elevating the future of mobility: UAV-enabled Intelligent Transportation SystemsComments: The 7th International Conference on Advanced Communication Technologies and Networking (CommNet 2024)Subjects: Networking and Internet Architecture (cs.NI)
Intelligent Transportation Systems (ITS) increasingly rely on connectivity for efficient traffic management and enhanced user experience. The existing ITS solutions operate mainly within a 2D domain, thus missing the potential benefits of aerial platforms. This paper envisions 3D ITS by integrating aerial platforms, such as Unmanned Aerial Vehicles (UAVs), to simultaneously improve network coverage and support multi-modal transportation, including Advanced Air Mobility (AAM). Using stochastic models, we investigate how UAV-based Aerial Base Stations (ABSs) can address the limitations of traditional Terrestrial Base Stations (TBSs) by offering superior coverage, particularly in urban environments. Our results demonstrate that ABSs have 106.67% more coverage area than TBS, higher Signal-to-Noise Ratio (SNR) distribution, and are suitable for high-throughput ITS applications.
- [369] arXiv:2201.10825 (replaced) [pdf, html, other]
-
Title: Different Strokes in Randomised Strategies: Revisiting Kuhn's Theorem under Finite-Memory AssumptionsComments: Extended version, preprint of Information and Computation article, 36 pagesSubjects: Computer Science and Game Theory (cs.GT); Formal Languages and Automata Theory (cs.FL); Logic in Computer Science (cs.LO)
Two-player (antagonistic) games on (possibly stochastic) graphs are a prevalent model in theoretical computer science, notably as a framework for reactive synthesis.
Optimal strategies may require randomisation when dealing with inherently probabilistic goals, balancing multiple objectives, or in contexts of partial information. There is no unique way to define randomised strategies. For instance, one can use so-called mixed strategies or behavioural ones. In the most general setting, these two classes do not share the same expressiveness. A seminal result in game theory -- Kuhn's theorem -- asserts their equivalence in games of perfect recall.
This result crucially relies on the possibility for strategies to use infinite memory, i.e., unlimited knowledge of all past observations. However, computer systems are finite in practice. Hence it is pertinent to restrict our attention to finite-memory strategies, defined as automata with outputs. Randomisation can be implemented in these in different ways: the initialisation, outputs or transitions can be randomised or deterministic respectively. Depending on which aspects are randomised, the expressiveness of the corresponding class of finite-memory strategies differs.
In this work, we study two-player concurrent stochastic games and provide a complete taxonomy of the classes of finite-memory strategies obtained by varying which of the three aforementioned components are randomised. Our taxonomy holds in games of perfect and imperfect information with perfect recall, and in games with more than two players. We also provide an adapted taxonomy for games with imperfect recall. - [370] arXiv:2211.04439 (replaced) [pdf, html, other]
-
Title: Sampling from convex sets with a cold start using multiscale decompositionsComments: Changes from v3: Added further discussion/details, and fixed some typos. This version should be close to the final versionSubjects: Data Structures and Algorithms (cs.DS); Computational Geometry (cs.CG); Probability (math.PR)
Running a random walk in a convex body $K\subseteq\mathbb{R}^n$ is a standard approach to sample approximately uniformly from the body. The requirement is that from a suitable initial distribution, the distribution of the walk comes close to the uniform distribution $\pi_K$ on $K$ after a number of steps polynomial in $n$ and the aspect ratio $R/r$ (i.e., when $rB_2 \subseteq K \subseteq RB_{2}$).
Proofs of rapid mixing of such walks often require the probability density $\eta_0$ of the initial distribution with respect to $\pi_K$ to be at most $\mathrm{poly}(n)$: this is called a "warm start". Achieving a warm start often requires non-trivial pre-processing before starting the random walk. This motivates proving rapid mixing from a "cold start", wherein $\eta_0$ can be as high as $\exp(\mathrm{poly}(n))$. Unlike warm starts, a cold start is usually trivial to achieve. However, a random walk need not mix rapidly from a cold start: an example being the well-known "ball walk". On the other hand, Lovász and Vempala proved that the "hit-and-run" random walk mixes rapidly from a cold start. For the related coordinate hit-and-run (CHR) walk, which has been found to be promising in computational experiments, rapid mixing from a warm start was proved only recently but the question of rapid mixing from a cold start remained open.
We construct a family of random walks inspired by classical decompositions of subsets of $\mathbb{R}^n$ into countably many axis-aligned dyadic cubes. We show that even with a cold start, the mixing times of these walks are bounded by a polynomial in $n$ and the aspect ratio. Our main technical ingredient is an isoperimetric inequality for $K$ for a metric that magnifies distances between points close to the boundary of $K$. As a corollary, we show that the CHR walk also mixes rapidly both from a cold start and from a point not too close to the boundary of $K$. - [371] arXiv:2301.07473 (replaced) [pdf, other]
-
Title: Discrete Latent Structure in Neural NetworksSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Many types of data from fields including natural language processing, computer vision, and bioinformatics, are well represented by discrete, compositional structures such as trees, sequences, or matchings. Latent structure models are a powerful tool for learning to extract such representations, offering a way to incorporate structural bias, discover insight about the data, and interpret decisions. However, effective training is challenging, as neural networks are typically designed for continuous computation.
This text explores three broad strategies for learning with discrete latent structure: continuous relaxation, surrogate gradients, and probabilistic estimation. Our presentation relies on consistent notations for a wide range of models. As such, we reveal many new connections between latent structure learning strategies, showing how most consist of the same small set of fundamental building blocks, but use them differently, leading to substantially different applicability and properties. - [372] arXiv:2302.13342 (replaced) [pdf, html, other]
-
Title: Envy-freeness and maximum Nash welfare for mixed divisible and indivisible goodsSubjects: Computer Science and Game Theory (cs.GT)
We study fair allocation of resources consisting of both divisible and indivisible goods to agents with additive valuations. When only divisible or indivisible goods exist, it is known that an allocation that achieves the maximum Nash welfare (MNW) satisfies the classic fairness notions based on envy. Moreover, the literature shows the structures and characterizations of MNW allocations when valuations are binary and linear (i.e., divisible goods are homogeneous). In this paper, we show that when all agents' valuations are binary linear, an MNW allocation for mixed goods satisfies the envy-freeness up to any good for mixed goods (EFXM). This notion is stronger than an existing one called envy-freeness for mixed goods (EFM), and our result generalizes the existing results for the case when only divisible or indivisible goods exist. When all agents' valuations are binary over indivisible goods and identical over divisible goods (e.g., the divisible good is money), we extend the known characterization of an MNW allocation for indivisible goods to mixed goods, and also show that an MNW allocation satisfies EFXM. For the general additive valuations, we also provide a formal proof that an MNW allocation satisfies a weaker notion than EFM.
- [373] arXiv:2303.06252 (replaced) [pdf, other]
-
Title: AI-Enhanced Intensive Care Unit: Revolutionizing Patient Care with Pervasive SensingSubhash Nerella, Ziyuan Guan, Scott Siegel, Jiaqing Zhang, Ruilin Zhu, Kia Khezeli, Azra Bihorac, Parisa RashidiSubjects: Artificial Intelligence (cs.AI)
The intensive care unit (ICU) is a specialized hospital space where critically ill patients receive intensive care and monitoring. Comprehensive monitoring is imperative in assessing patients conditions, in particular acuity, and ultimately the quality of care. However, the extent of patient monitoring in the ICU is limited due to time constraints and the workload on healthcare providers. Currently, visual assessments for acuity, including fine details such as facial expressions, posture, and mobility, are sporadically captured, or not captured at all. These manual observations are subjective to the individual, prone to documentation errors, and overburden care providers with the additional workload. Artificial Intelligence (AI) enabled systems has the potential to augment the patient visual monitoring and assessment due to their exceptional learning capabilities. Such systems require robust annotated data to train. To this end, we have developed pervasive sensing and data processing system which collects data from multiple modalities depth images, color RGB images, accelerometry, electromyography, sound pressure, and light levels in ICU for developing intelligent monitoring systems for continuous and granular acuity, delirium risk, pain, and mobility assessment. This paper presents the Intelligent Intensive Care Unit (I2CU) system architecture we developed for real-time patient monitoring and visual assessment.
- [374] arXiv:2303.15647 (replaced) [pdf, html, other]
-
Title: Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-TuningSubjects: Computation and Language (cs.CL)
This paper presents a systematic overview of parameter-efficient fine-tuning methods, covering over 50 papers published between early 2019 and mid-2024. These methods aim to address the challenges of fine-tuning large language models by training only a small subset of parameters. We provide a taxonomy that covers a broad range of methods and present a detailed method comparison with a specific focus on real-life efficiency in fine-tuning multibillion-scale language models. We also conduct an extensive head-to-head experimental comparison of 15 diverse PEFT methods, evaluating their performance and efficiency on models up to 11B parameters. Our findings reveal that methods previously shown to surpass a strong LoRA baseline face difficulties in resource-constrained settings, where hyperparameter optimization is limited and the network is fine-tuned only for a few epochs. Finally, we provide a set of practical recommendations for using PEFT methods and outline potential future research directions.
- [375] arXiv:2304.04640 (replaced) [pdf, html, other]
-
Title: NeuroBench: A Framework for Benchmarking Neuromorphic Computing Algorithms and SystemsJason Yik, Korneel Van den Berghe, Douwe den Blanken, Younes Bouhadjar, Maxime Fabre, Paul Hueber, Weijie Ke, Mina A Khoei, Denis Kleyko, Noah Pacik-Nelson, Alessandro Pierro, Philipp Stratmann, Pao-Sheng Vincent Sun, Guangzhi Tang, Shenqi Wang, Biyan Zhou, Soikat Hasan Ahmed, George Vathakkattil Joseph, Benedetto Leto, Aurora Micheli, Anurag Kumar Mishra, Gregor Lenz, Tao Sun, Zergham Ahmed, Mahmoud Akl, Brian Anderson, Andreas G. Andreou, Chiara Bartolozzi, Arindam Basu, Petrut Bogdan, Sander Bohte, Sonia Buckley, Gert Cauwenberghs, Elisabetta Chicca, Federico Corradi, Guido de Croon, Andreea Danielescu, Anurag Daram, Mike Davies, Yigit Demirag, Jason Eshraghian, Tobias Fischer, Jeremy Forest, Vittorio Fra, Steve Furber, P. Michael Furlong, William Gilpin, Aditya Gilra, Hector A. Gonzalez, Giacomo Indiveri, Siddharth Joshi, Vedant Karia, Lyes Khacef, James C. Knight, Laura Kriener, Rajkumar Kubendran, Dhireesha Kudithipudi, Yao-Hong Liu, Shih-Chii Liu, Haoyuan Ma, Rajit Manohar, Josep Maria Margarit-Taulé, Christian Mayr, Konstantinos Michmizos, Dylan Muir, Emre Neftci, Thomas Nowotny, Fabrizio Ottati, Ayca Ozcelikkale, Priyadarshini Panda, Jongkil Park, Melika Payvand, Christian Pehle, Mihai A. Petrovici, Christoph Posch, Alpha Renner, Yulia Sandamirskaya, Clemens JS Schaefer, André van Schaik, Johannes Schemmel, Samuel Schmidgall, Catherine Schuman, Jae-sun Seo, Sadique Sheik, Sumit Bam Shrestha, Manolis Sifalakis, Amos Sironi, Matthew Stewart, Kenneth Stewart, Terrence C. Stewart, Jonathan Timcheck, Nergis Tömen, Gianvito Urgese, Marian Verhelst, Craig M. Vineyard, Bernhard Vogginger, Amirreza Yousefzadeh, Fatima Tuz Zohora, Charlotte Frenkel, Vijay Janapa ReddiComments: System track baselines addedSubjects: Artificial Intelligence (cs.AI)
Neuromorphic computing shows promise for advancing computing efficiency and capabilities of AI applications using brain-inspired principles. However, the neuromorphic research field currently lacks standardized benchmarks, making it difficult to accurately measure technological advancements, compare performance with conventional methods, and identify promising future research directions. Prior neuromorphic computing benchmark efforts have not seen widespread adoption due to a lack of inclusive, actionable, and iterative benchmark design and guidelines. To address these shortcomings, we present NeuroBench: a benchmark framework for neuromorphic computing algorithms and systems. NeuroBench is a collaboratively-designed effort from an open community of researchers across industry and academia, aiming to provide a representative structure for standardizing the evaluation of neuromorphic approaches. The NeuroBench framework introduces a common set of tools and systematic methodology for inclusive benchmark measurement, delivering an objective reference framework for quantifying neuromorphic approaches in both hardware-independent (algorithm track) and hardware-dependent (system track) settings. In this article, we outline tasks and guidelines for benchmarks across multiple application domains, and present initial performance baselines across neuromorphic and conventional approaches for both benchmark tracks. NeuroBench is intended to continually expand its benchmarks and features to foster and track the progress made by the research community.
- [376] arXiv:2304.07741 (replaced) [pdf, html, other]
-
Title: Canvas: End-to-End Kernel Architecture Search in Neural NetworksSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
The demands for higher performance and accuracy in neural networks (NNs) never end. Existing tensor compilation and Neural Architecture Search (NAS) techniques orthogonally optimize the two goals but actually share many similarities in their concrete strategies. We exploit such opportunities by combining the two into one and make a case for Kernel Architecture Search (KAS). KAS reviews NAS from a system perspective and zooms into a more fine-grained level to generate neural kernels with both high performance and good accuracy. To demonstrate the potential of KAS, we build an end-to-end framework, Canvas, to find high-quality kernels as convolution replacements. Canvas samples from a rich set of fine-grained primitives to stochastically and iteratively construct new kernels and evaluate them according to user-specified constraints. Canvas supports freely adjustable tensor dimension sizes inside the kernel and uses two levels of solvers to satisfy structural legality and fully utilize model budgets. The evaluation shows that by replacing standard convolutions with generated new kernels in common NNs, Canvas achieves average 1.5x speedups compared to the previous state-of-the-art with acceptable accuracy loss and search efficiency. Canvas verifies the practicability of KAS by rediscovering many manually designed kernels in the past and producing new structures that may inspire future machine learning innovations. For source code and implementation, we open-sourced Canvas at this https URL.
- [377] arXiv:2304.10851 (replaced) [pdf, html, other]
-
Title: What Do GNNs Actually Learn? Towards Understanding their RepresentationsSubjects: Machine Learning (cs.LG)
In recent years, graph neural networks (GNNs) have achieved great success in the field of graph representation learning. Although prior work has shed light on the expressiveness of those models (\ie whether they can distinguish pairs of non-isomorphic graphs), it is still not clear what structural information is encoded into the node representations that are learned by those models. In this paper, we address this gap by studying the node representations learned by four standard GNN models. We find that some models produce identical representations for all nodes, while the representations learned by other models are linked to some notion of walks of specific length that start from the nodes. We establish Lipschitz bounds for these models with respect to the number of (normalized) walks. Additionally, we investigate the influence of node features on the learned representations. We find that if the initial representations of all nodes point in the same direction, the representations learned at the $k$-th layer of the models are also related to the initial features of nodes that can be reached in exactly $k$ steps. We also apply our findings to understand the phenomenon of oversquashing that occurs in GNNs. Our theoretical analysis is validated through experiments on synthetic and real-world datasets.
- [378] arXiv:2304.12606 (replaced) [pdf, other]
-
Title: Output Statistics of Random Binning: Tsallis Divergence and Its ApplicationsMasoud Kavian, Mohammad Mahdi Mojahedian, Mohammad Hossein Yassaee, Mahtab Mirmohseni, Mohammad Reza ArefSubjects: Information Theory (cs.IT)
Random binning is a widely used technique in information theory with diverse applications. In this paper, we focus on the output statistics of random binning (OSRB) using the Tsallis divergence $T_\alpha$. We analyze all values of $\alpha \in (0, \infty)\cup\{\infty\}$ and consider three scenarios: (i) the binned sequence is generated i.i.d., (ii) the sequence is randomly chosen from an $\epsilon$-typical set, and (iii) the sequence originates from an $\epsilon$-typical set and is passed through a non-memoryless virtual channel. Our proofs cover both achievability and converse results. To address the unbounded nature of $T_\infty$, we extend the OSRB framework using Rényi's divergence with order infinity, denoted $D_\infty$. As part of our exploration, we analyze a specific form of Rényi's conditional entropy and its properties. Additionally, we demonstrate the application of this framework in deriving achievability results for the wiretap channel, where Tsallis divergence serves as a security measure. The secure rate we obtain through the OSRB analysis matches the secure capacity for $\alpha \in (0, 2]\cup\{{\infty}\}$ and serves as a potential candidate for the secure capacity when $\alpha \in (2, \infty)$.
- [379] arXiv:2305.02922 (replaced) [pdf, html, other]
-
Title: Coloring tournaments with few colors: Algorithms and complexityComments: Journal versionSubjects: Data Structures and Algorithms (cs.DS)
A $k$-coloring of a tournament is a partition of its vertices into $k$ acyclic sets. Deciding if a tournament is 2-colorable is NP-hard. A natural problem, akin to that of coloring a 3-colorable graph with few colors, is to color a 2-colorable tournament with few colors. This problem does not seem to have been addressed before, although it is a special case of coloring a 2-colorable 3-uniform hypergraph with few colors, which is a well-studied problem with super-constant lower bounds.
We present a new efficient decomposition lemma for tournaments, which we use to design polynomial-time algorithms to color various classes of tournaments with few colors, notably, to color a 2-colorable tournament with ten colors. We also use this lemma to prove equivalence between the problems of coloring 3-colorable tournaments and coloring 3-colorable graphs with constantly many colors. For the classes of tournaments considered, we complement our upper bounds with strengthened lower bounds, painting a comprehensive picture of the algorithmic and complexity aspects of coloring tournaments. - [380] arXiv:2305.03146 (replaced) [pdf, html, other]
-
Title: Testing Convex TruncationComments: Preliminary version in SODA 2023; v3 includes a simpler and stronger lower bound than v2. 26 pagesSubjects: Data Structures and Algorithms (cs.DS); Computational Complexity (cs.CC); Probability (math.PR); Statistics Theory (math.ST)
We study the basic statistical problem of testing whether normally distributed $n$-dimensional data has been truncated, i.e. altered by only retaining points that lie in some unknown truncation set $S \subseteq \mathbb{R}^n$. As our main algorithmic results,
(1) We give a computationally efficient $O(n)$-sample algorithm that can distinguish the standard normal distribution $N(0,I_n)$ from $N(0,I_n)$ conditioned on an unknown and arbitrary convex set $S$.
(2) We give a different computationally efficient $O(n)$-sample algorithm that can distinguish $N(0,I_n)$ from $N(0,I_n)$ conditioned on an unknown and arbitrary mixture of symmetric convex sets.
These results stand in sharp contrast with known results for learning or testing convex bodies with respect to the normal distribution or learning convex-truncated normal distributions, where state-of-the-art algorithms require essentially $n^{\sqrt{n}}$ samples. An easy argument shows that no finite number of samples suffices to distinguish $N(0,I_n)$ from an unknown and arbitrary mixture of general (not necessarily symmetric) convex sets, so no common generalization of results (1) and (2) above is possible.
We also prove that any algorithm (computationally efficient or otherwise) that can distinguish $N(0,I_n)$ from $N(0,I_n)$ conditioned on an unknown symmetric convex set must use $\Omega(n)$ samples. This shows that the sample complexity of each of our algorithms is optimal up to a constant factor. - [381] arXiv:2305.03223 (replaced) [pdf, html, other]
-
Title: Structural Group Unfairness: Measurement and Mitigation by means of the Effective ResistanceComments: Accepted at International AAAI Conference on Web and Social Media (ICWSM) 2025. Please cite accordinglySubjects: Social and Information Networks (cs.SI); Machine Learning (cs.LG)
Social networks contribute to the distribution of social capital, defined as the relationships, norms of trust and reciprocity within a community or society that facilitate cooperation and collective action. Therefore, better positioned members in a social network benefit from faster access to diverse information and higher influence on information dissemination. A variety of methods have been proposed in the literature to measure social capital at an individual level. However, there is a lack of methods to quantify social capital at a group level, which is particularly important when the groups are defined on the grounds of protected attributes. To fill this gap, we propose to measure the social capital of a group of nodes by means of the effective resistance and emphasize the importance of considering the entire network topology. Grounded in spectral graph theory, we introduce three effective resistance-based measures of group social capital, namely group isolation, group diameter and group control, where the groups are defined according to the value of a protected attribute. We denote the social capital disparity among different groups in a network as structural group unfairness, and propose to mitigate it by means of a budgeted edge augmentation heuristic that systematically increases the social capital of the most disadvantaged group. In experiments on real-world networks, we uncover significant levels of structural group unfairness when using gender as the protected attribute, with females being the most disadvantaged group in comparison to males. We also illustrate how our proposed edge augmentation approach is able to not only effectively mitigate the structural group unfairness but also increase the social capital of all groups in the network.
- [382] arXiv:2305.04508 (replaced) [pdf, html, other]
-
Title: Improving Code Search with Hard Negative Sampling Based on Fine-tuningComments: Accepted by APSEC 2024Subjects: Software Engineering (cs.SE)
Pre-trained code models have emerged as the state-of-the-art paradigm for code search tasks. The paradigm involves pre-training the model on search-irrelevant tasks such as masked language modeling, followed by the fine-tuning stage, which focuses on the search-relevant task. The typical fine-tuning method is to employ a dual-encoder architecture to encode semantic embeddings of query and code separately, and then calculate their similarity based on the embeddings. However, the typical dual-encoder architecture falls short in modeling token-level interactions between query and code, which limits the capabilities of model. To address this limitation, we introduce a cross-encoder architecture for code search that jointly encodes the concatenation of query and code. We further introduce a Retriever-Ranker (RR) framework that cascades the dual-encoder and cross-encoder to promote the efficiency of evaluation and online serving. Moreover, we present a ranking-based hard negative sampling (PS) method to improve the ability of cross-encoder to distinguish hard negative codes, which further enhances the cascaded RR framework. Experiments on four datasets using three code models demonstrate the superiority of our proposed method. We have made the code available at this https URL.
- [383] arXiv:2305.05933 (replaced) [pdf, html, other]
-
Title: Spectrum Breathing: Protecting Over-the-Air Federated Learning Against InterferenceSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Information Theory (cs.IT)
Federated Learning (FL) is a widely embraced paradigm for distilling artificial intelligence from distributed mobile data. However, the deployment of FL in mobile networks can be compromised by exposure to interference from neighboring cells or jammers. Existing interference mitigation techniques require multi-cell cooperation or at least interference channel state information, which is expensive in practice. On the other hand, power control that treats interference as noise may not be effective due to limited power budgets, and also that this mechanism can trigger countermeasures by interference sources. As a practical approach for protecting FL against interference, we propose Spectrum Breathing, which cascades stochastic-gradient pruning and spread spectrum to suppress interference without bandwidth expansion. The cost is higher learning latency by exploiting the graceful degradation of learning speed due to pruning. We synchronize the two operations such that their levels are controlled by the same parameter, Breathing Depth. To optimally control the parameter, we develop a martingale-based approach to convergence analysis of Over-the-Air FL with spectrum breathing, termed AirBreathing FL. We show a performance tradeoff between gradient-pruning and interference-induced error as regulated by the breathing depth. Given receive SIR and model size, the optimization of the tradeoff yields two schemes for controlling the breathing depth that can be either fixed or adaptive to channels and the learning process. As shown by experiments, in scenarios where traditional Over-the-Air FL fails to converge in the presence of strong interference, AirBreahing FL with either fixed or adaptive breathing depth can ensure convergence where the adaptive scheme achieves close-to-ideal performance.
- [384] arXiv:2305.09145 (replaced) [pdf, html, other]
-
Title: Deep ReLU Networks Have Surprisingly Simple PolytopesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
A ReLU network is a piecewise linear function over polytopes. Figuring out the properties of such polytopes is of fundamental importance for the research and development of neural networks. So far, either theoretical or empirical studies on polytopes only stay at the level of counting their number, which is far from a complete characterization. Here, we propose to study the shapes of polytopes via the number of faces of the polytope. Then, by computing and analyzing the histogram of faces across polytopes, we find that a ReLU network has relatively simple polytopes under both initialization and gradient descent, although these polytopes can be rather diverse and complicated by a specific design. This finding can be appreciated as a kind of generalized implicit bias, subjected to the intrinsic geometric constraint in space partition of a ReLU network. Next, we perform a combinatorial analysis to explain why adding depth does not generate a more complicated polytope by bounding the average number of faces of polytopes with the dimensionality. Our results concretely reveal what kind of simple functions a network learns and what will happen when a network goes deep. Also, by characterizing the shape of polytopes, the number of faces can be a novel leverage for other problems, \textit{e.g.}, serving as a generic tool to explain the power of popular shortcut networks such as ResNet and analyzing the impact of different regularization strategies on a network's space partition.
- [385] arXiv:2306.01229 (replaced) [pdf, html, other]
-
Title: Exploring the Boundaries of Semi-Supervised Facial Expression Recognition using In-Distribution, Out-of-Distribution, and Unconstrained DataComments: Accepted in IEEE Transactions on Affective Computing (TAFFC), 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
Deep learning-based methods have been the key driving force behind much of the recent success of facial expression recognition (FER) systems. However, the need for large amounts of labelled data remains a challenge. Semi-supervised learning offers a way to overcome this limitation, allowing models to learn from a small amount of labelled data along with a large unlabelled dataset. While semi-supervised learning has shown promise in FER, most current methods from general computer vision literature have not been explored in the context of FER. In this work, we present a comprehensive study on 11 of the most recent semi-supervised methods, in the context of FER, namely Pi-model, Pseudo-label, Mean Teacher, VAT, UDA, MixMatch, ReMixMatch, FlexMatch, CoMatch, and CCSSL. Our investigation covers semi-supervised learning from in-distribution, out-of-distribution, unconstrained, and very small unlabelled data. Our evaluation includes five FER datasets plus one large face dataset for unconstrained learning. Our results demonstrate that FixMatch consistently achieves better performance on in-distribution unlabelled data, while ReMixMatch stands out among all methods for out-of-distribution, unconstrained, and scarce unlabelled data scenarios. Another significant observation is that with an equal number of labelled samples, semi-supervised learning delivers a considerable improvement over supervised learning, regardless of whether the unlabelled data is in-distribution, out-of-distribution, or unconstrained. We also conduct sensitivity analyses on critical hyper-parameters for the two best methods of each setting. To facilitate reproducibility and further development, we make our code publicly available at: this http URL.
- [386] arXiv:2306.05480 (replaced) [pdf, html, other]
-
Title: Artificial General Intelligence for Medical Imaging AnalysisXiang Li, Lin Zhao, Lu Zhang, Zihao Wu, Zhengliang Liu, Hanqi Jiang, Chao Cao, Shaochen Xu, Yiwei Li, Haixing Dai, Yixuan Yuan, Jun Liu, Gang Li, Dajiang Zhu, Pingkun Yan, Quanzheng Li, Wei Liu, Tianming Liu, Dinggang ShenSubjects: Artificial Intelligence (cs.AI)
Large-scale Artificial General Intelligence (AGI) models, including Large Language Models (LLMs) such as ChatGPT/GPT-4, have achieved unprecedented success in a variety of general domain tasks. Yet, when applied directly to specialized domains like medical imaging, which require in-depth expertise, these models face notable challenges arising from the medical field's inherent complexities and unique characteristics. In this review, we delve into the potential applications of AGI models in medical imaging and healthcare, with a primary focus on LLMs, Large Vision Models, and Large Multimodal Models. We provide a thorough overview of the key features and enabling techniques of LLMs and AGI, and further examine the roadmaps guiding the evolution and implementation of AGI models in the medical sector, summarizing their present applications, potentialities, and associated challenges. In addition, we highlight potential future research directions, offering a holistic view on upcoming ventures. This comprehensive review aims to offer insights into the future implications of AGI in medical imaging, healthcare, and beyond.
- [387] arXiv:2306.06077 (replaced) [pdf, html, other]
-
Title: Semantically-Prompted Language Models Improve Visual DescriptionsComments: Published at NAACL 2024. See this https URLJournal-ref: In Findings of the Association for Computational Linguistics: NAACL 2024, pages 4285-4302Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Language-vision models like CLIP have made significant strides in vision tasks, such as zero-shot image classification (ZSIC). However, generating specific and expressive visual descriptions remains challenging; descriptions produced by current methods are often ambiguous and lacking in granularity. To tackle these issues, we propose V-GLOSS: Visual Glosses, a novel method built upon two key ideas. The first is Semantic Prompting, which conditions a language model on structured semantic knowledge. The second is a new contrastive algorithm that elicits fine-grained distinctions between similar concepts. With both ideas, we demonstrate that V-GLOSS improves visual descriptions and achieves strong results in the zero-shot setting on general and fine-grained image-classification datasets, including ImageNet, STL-10, FGVC Aircraft, and Flowers 102. Moreover, these descriptive capabilities contribute to enhancing image-generation performance. Finally, we introduce a quality-tested silver dataset with descriptions generated with V-GLOSS for all ImageNet classes.
- [388] arXiv:2306.06202 (replaced) [pdf, html, other]
-
Title: NeuroGraph: Benchmarks for Graph Machine Learning in Brain ConnectomicsAnwar Said, Roza G. Bayrak, Tyler Derr, Mudassir Shabbir, Daniel Moyer, Catie Chang, Xenofon KoutsoukosComments: NeurIPS23Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
Machine learning provides a valuable tool for analyzing high-dimensional functional neuroimaging data, and is proving effective in predicting various neurological conditions, psychiatric disorders, and cognitive patterns. In functional magnetic resonance imaging (MRI) research, interactions between brain regions are commonly modeled using graph-based representations. The potency of graph machine learning methods has been established across myriad domains, marking a transformative step in data interpretation and predictive modeling. Yet, despite their promise, the transposition of these techniques to the neuroimaging domain has been challenging due to the expansive number of potential preprocessing pipelines and the large parameter search space for graph-based dataset construction. In this paper, we introduce NeuroGraph, a collection of graph-based neuroimaging datasets, and demonstrated its utility for predicting multiple categories of behavioral and cognitive traits. We delve deeply into the dataset generation search space by crafting 35 datasets that encompass static and dynamic brain connectivity, running in excess of 15 baseline methods for benchmarking. Additionally, we provide generic frameworks for learning on both static and dynamic graphs. Our extensive experiments lead to several key observations. Notably, using correlation vectors as node features, incorporating larger number of regions of interest, and employing sparser graphs lead to improved performance. To foster further advancements in graph-based data driven neuroimaging analysis, we offer a comprehensive open-source Python package that includes the benchmark datasets, baseline implementations, model training, and standard evaluation.
- [389] arXiv:2306.09471 (replaced) [pdf, html, other]
-
Title: Privacy Guarantees for Personal Mobility Data in Humanitarian ResponseSubjects: Cryptography and Security (cs.CR)
Personal mobility data from mobile phones and other sensors are increasingly used to inform policymaking during pandemics, natural disasters, and other humanitarian crises. However, even aggregated mobility traces can reveal private information about individual movements to potentially malicious actors. This paper develops and tests an approach for releasing private mobility data, which provides formal guarantees over the privacy of the underlying subjects. Specifically, we (1) introduce an algorithm for constructing differentially private mobility matrices, and derive privacy and accuracy bounds on this algorithm; (2) use real-world data from mobile phone operators in Afghanistan and Rwanda to show how this algorithm can enable the use of private mobility data in two high-stakes policy decisions: pandemic response and the distribution of humanitarian aid; and (3) discuss practical decisions that need to be made when implementing this approach, such as how to optimally balance privacy and accuracy. Taken together, these results can help enable the responsible use of private mobility data in humanitarian response.
- [390] arXiv:2306.14779 (replaced) [pdf, html, other]
-
Title: A Note On The Natural Range Of Unambiguous-SATComments: Replacement: Various sentences were rewritten for clarity and grammar. This did not change any of the resultsSubjects: Computational Complexity (cs.CC)
We discuss the natural range of the Unambiguous-SAT problem with respect to the number of clauses. We prove that for a given Boolean formula in precise conjunctive normal form with n variables, there exist functions f(n) and g(n) such that if the number of clauses is greater than f(n) then the formula does not have a satisfying truth assignment and if the number of clauses is greater than g(n) then the formula either has a unique satisfying truth assignment or no satisfying truth assignment. The interval between functions f(n) and g(n) is the natural range of the Unambiguous-SAT problem. We also provide several counting rules and an algorithm that determine the unsatisfiability of some formulas in polynomial time.
- [391] arXiv:2307.00822 (replaced) [pdf, other]
-
Title: Space-time finite element analysis of the advection-diffusion equation using Galerkin/least-square stabilizationSubjects: Numerical Analysis (math.NA); Analysis of PDEs (math.AP)
We present a full space-time numerical solution of the advection-diffusion equation using a continuous Galerkin finite element method on conforming meshes. The Galerkin/least-square method is employed to ensure stability of the discrete variational problem. In the full space-time formulation, time is considered another dimension, and the time derivative is interpreted as an additional advection term of the field variable. We derive a priori error estimates and illustrate spatio-temporal convergence with several numerical examples. We also derive a posteriori error estimates, which coupled with adaptive space-time mesh refinement provide efficient and accurate solutions. The accuracy of the space-time solutions is illustrated against analytical solutions as well as against numerical solutions using a conventional time-marching algorithm.
- [392] arXiv:2307.05520 (replaced) [pdf, html, other]
-
Title: How to use model architecture and training environment to estimate the energy consumption of DL trainingComments: 32 pages, 11 figures, under review in ACM Transactions on Software Engineering and Methodology (TOSEM). This work is an extension of arXiv:2307.05520v3 [cs.LG]Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Software Engineering (cs.SE)
To raise awareness of the huge impact Deep Learning (DL) has on the environment, several works have tried to estimate the energy consumption and carbon footprint of DL-based systems across their life cycle. However, the estimations for energy consumption in the training stage usually rely on assumptions that have not been thoroughly tested. This study aims to move past these assumptions by leveraging the relationship between energy consumption and two relevant design decisions in DL training; model architecture, and training environment. To investigate these relationships, we collect multiple metrics related to energy efficiency and model correctness during the models' training. Then, we outline the trade-offs between the measured energy consumption and the models' correctness regarding model architecture, and their relationship with the training environment. Finally, we study the training's power consumption behavior and propose four new energy estimation methods. Our results show that selecting the proper model architecture and training environment can reduce energy consumption dramatically (up to 80.72%) at the cost of negligible decreases in correctness. Also, we find evidence that GPUs should scale with the models' computational complexity for better energy efficiency. Furthermore, we prove that current energy estimation methods are unreliable and propose alternatives 2x more precise.
- [393] arXiv:2307.07693 (replaced) [pdf, html, other]
-
Title: Neural Deformable Models for 3D Bi-Ventricular Heart Shape Reconstruction and Modeling from 2D Sparse Cardiac Magnetic Resonance ImagingComments: Accepted by ICCV 2023Subjects: Computer Vision and Pattern Recognition (cs.CV)
We propose a novel neural deformable model (NDM) targeting at the reconstruction and modeling of 3D bi-ventricular shape of the heart from 2D sparse cardiac magnetic resonance (CMR) imaging data. We model the bi-ventricular shape using blended deformable superquadrics, which are parameterized by a set of geometric parameter functions and are capable of deforming globally and locally. While global geometric parameter functions and deformations capture gross shape features from visual data, local deformations, parameterized as neural diffeomorphic point flows, can be learned to recover the detailed heart this http URL from iterative optimization methods used in conventional deformable model formulations, NDMs can be trained to learn such geometric parameter functions, global and local deformations from a shape distribution manifold. Our NDM can learn to densify a sparse cardiac point cloud with arbitrary scales and generate high-quality triangular meshes automatically. It also enables the implicit learning of dense correspondences among different heart shape instances for accurate cardiac shape registration. Furthermore, the parameters of NDM are intuitive, and can be used by a physician without sophisticated post-processing. Experimental results on a large CMR dataset demonstrate the improved performance of NDM over conventional methods.
- [394] arXiv:2308.11786 (replaced) [pdf, other]
-
Title: How Voice and Helpfulness Shape Perceptions in Human-Agent TeamsComments: 11 pages, 6 figuresSubjects: Human-Computer Interaction (cs.HC)
Voice assistants are increasingly prevalent, from personal devices to team environments. This study explores how voice type and contribution quality influence human-agent team performance and perceptions of anthropomorphism, animacy, intelligence, and trustworthiness. By manipulating both, we reveal mechanisms of perception and clarify ambiguity in previous work. Our results show that the human resemblance of a voice assistant's voice negatively interacts with the helpfulness of an agent's contribution to flip its effect on perceived anthropomorphism and perceived animacy. This means human teammates interpret the agent's contributions differently depending on its voice. Our study found no significant effect of voice on perceived intelligence, trustworthiness, or team performance. We find differences in these measures are caused by manipulating the helpfulness of an agent. These findings suggest that function matters more than form when designing agents for high-performing human-agent teams, but controlling perceptions of anthropomorphism and animacy can be unpredictable even with high human resemblance.
- [395] arXiv:2309.05004 (replaced) [pdf, html, other]
-
Title: Reconstructing the kinetic chemotaxis kernel using macroscopic data: well-posedness and ill-posednessSubjects: Numerical Analysis (math.NA); Analysis of PDEs (math.AP); Optimization and Control (math.OC); Cell Behavior (q-bio.CB)
Bacterial motion is steered by external stimuli (chemotaxis), and the motion described on the mesoscopic scale is uniquely determined by a parameter $K$ that models velocity change response from the bacteria. This parameter is called chemotaxis kernel. In a practical setting, it is inferred by experimental data. We deploy a PDE-constrained optimization framework to perform this reconstruction using velocity-averaged, localized data taken in the interior of the domain. The problem can be well-posed or ill-posed depending on the data preparation and the experimental setup. In particular, we propose one specific design that guarantees numerical reconstructability and local convergence. This design is adapted to the discretization of $K$ in space and decouples the reconstruction of local values of $K$ into smaller cell problems, opening up parallelization opportunities. Numerical evidences support the theoretical findings.
- [396] arXiv:2309.16181 (replaced) [pdf, html, other]
-
Title: MSF-Model: Queuing-Based Analysis and Prediction of Metastable Failures in Replicated Storage SystemsComments: Published in The 43rd International Symposium on Reliable Distributed Systems (SRDS 2024)Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Databases (cs.DB)
Metastable failure is a recent abstraction of a pattern of failures that occurs frequently in real-world distributed storage systems. In this paper, we propose a formal analysis and modeling of metastable failures in replicated storage systems. We focus on a foundational problem in distributed systems -- the problem of consensus -- to have an impact on a large class of systems. Our main contribution is the development of a queuing-based analytical model, MSF-Model, that can be used to characterize and predict metastable failures. MSF-Model integrates novel modeling concepts that allow modeling metastable failures which was interactable to model prior to our work. We also perform real experiments to reproduce and validate our model. Our real experiments show that MSF-Model predicts metastable failures with high accuracy by comparing the real experiment with the predictions from the queuing-based model.
- [397] arXiv:2310.00013 (replaced) [pdf, html, other]
-
Title: Adaptive Communications in Collaborative Perception with Domain Alignment for Autonomous DrivingComments: Accepted by GLOBECOM'24Subjects: Artificial Intelligence (cs.AI)
Collaborative perception among multiple connected and autonomous vehicles can greatly enhance perceptive capabilities by allowing vehicles to exchange supplementary information via communications. Despite advances in previous approaches, challenges still remain due to channel variations and data heterogeneity among collaborative vehicles. To address these issues, we propose ACC-DA, a channel-aware collaborative perception framework to dynamically adjust the communication graph and minimize the average transmission delay while mitigating the side effects from the data heterogeneity. Our novelties lie in three aspects. We first design a transmission delay minimization method, which can construct the communication graph and minimize the transmission delay according to different channel information state. We then propose an adaptive data reconstruction mechanism, which can dynamically adjust the rate-distortion trade-off to enhance perception efficiency. Moreover, it minimizes the temporal redundancy during data transmissions. Finally, we conceive a domain alignment scheme to align the data distribution from different vehicles, which can mitigate the domain gap between different vehicles and improve the performance of the target task. Comprehensive experiments demonstrate the effectiveness of our method in comparison to the existing state-of-the-art works.
- [398] arXiv:2310.06959 (replaced) [pdf, html, other]
-
Title: Proof Repair across Quotient Type EquivalencesComments: for associated code, see this https URLSubjects: Programming Languages (cs.PL)
Proofs in proof assistants like Coq can be brittle, breaking easily in response to changes. To address this, recent work introduced an algorithm and tool in Coq to automatically repair broken proofs in response to changes that correspond to type equivalences. However, many changes remained out of the scope of this algorithm and tool -- especially changes in underlying behavior. We extend this proof repair algorithm so that it can express certain changes in behavior that were previously out of scope. We focus in particular on equivalences between quotient types -- types equipped with a relation that describes what it means for any two elements of that type to be equal. Quotient type equivalences can be used to express interesting changes in representations of mathematical structures, as well as changes in the underlying implementations of data structures.
We extend this algorithm and tool to support quotient type equivalences in Coq. Notably, since Coq lacks quotient types entirely, our extensions use Coq's setoid machinery to represent quotients externally. Specifically, (1) our extension to the algorithm supports new changes corresponding to setoids, and (2) our extension to the tool supports this new class of changes and further automates away some of the new proof obligations. We ground our setoid extensions by way of a discussion of a corresponding manual proof repair approach in Cubical Agda, which supports quotient types and allows for some internalization of the correctness criteria for proof repair. We demonstrate our extensions on proof repair case studies for previously unsupported changes. - [399] arXiv:2310.11728 (replaced) [pdf, html, other]
-
Title: EchoScan: Scanning Complex Room Geometries via Acoustic EchoesComments: 15 pages, 15 figures, 2 tablesJournal-ref: in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 4768-4782, 2024Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
Accurate estimation of indoor space geometries is vital for constructing precise digital twins, whose broad industrial applications include navigation in unfamiliar environments and efficient evacuation planning, particularly in low-light conditions. This study introduces EchoScan, a deep neural network model that utilizes acoustic echoes to perform room geometry inference. Conventional sound-based techniques rely on estimating geometry-related room parameters such as wall position and room size, thereby limiting the diversity of inferable room geometries. Contrarily, EchoScan overcomes this limitation by directly inferring room floorplan maps and height maps, thereby enabling it to handle rooms with complex shapes, including curved walls. The segmentation task for predicting floorplan and height maps enables the model to leverage both low- and high-order reflections. The use of high-order reflections further allows EchoScan to infer complex room shapes when some walls of the room are unobservable from the position of an audio device. Herein, EchoScan was trained and evaluated using RIRs synthesized from complex environments, including the Manhattan and Atlanta layouts, employing a practical audio device configuration compatible with commercial, off-the-shelf devices.
- [400] arXiv:2311.05810 (replaced) [pdf, other]
-
Title: Automated Lane Change via Adaptive Interactive MPC: Human-in-the-Loop ExperimentsSubjects: Systems and Control (eess.SY)
This article presents a new optimal control-based interactive motion planning algorithm for an autonomous vehicle interacting with a human-driven vehicle. The ego vehicle solves a joint optimization problem for its motion planning involving costs and coupled constraints of both vehicles and applies its own actions. The non-convex feasible region and lane discipline are handled by introducing integer decision variables and the resulting optimization problem is a mixed-integer quadratic program (MIQP) which is implemented via model predictive control (MPC). Furthermore, the ego vehicle imputes the cost of human-driven neighboring vehicle (NV) using an inverse optimal control method based on Karush-Kuhn-Tucker (KKT) conditions and adapts the joint optimization cost accordingly. We call the algorithm adaptive interactive mixed-integer MPC (aiMPC). Its interaction with human subjects driving the NV in a mandatory lane change scenario is tested in a developed software-and-human-in-the-loop simulator. Results show the effectiveness of the presented algorithm in terms of enhanced mobility of both the vehicles compared to baseline methods.
- [401] arXiv:2311.07783 (replaced) [pdf, html, other]
-
Title: Retrieving Top-k Hyperedge Triplets: Models and ApplicationsSubjects: Discrete Mathematics (cs.DM); Social and Information Networks (cs.SI); Data Analysis, Statistics and Probability (physics.data-an); Physics and Society (physics.soc-ph)
Complex systems frequently exhibit multi-way, rather than pairwise, interactions. These group interactions cannot be faithfully modeled as collections of pairwise interactions using graphs and instead require hypergraphs. However, methods that analyze hypergraphs directly, rather than via lossy graph reductions, remain limited. Hypergraph motifs hold promise in this regard, as motif patterns serve as building blocks for larger group interactions which are inexpressible by graphs. Recent work has focused on categorizing and counting hypergraph motifs based on the existence of nodes in hyperedge intersection regions. Here, we argue that the relative sizes of hyperedge intersections within motifs contain varied and valuable information. We propose a suite of efficient algorithms for finding top-k triplets of hyperedges based on optimizing the sizes of these intersection patterns. This formulation uncovers interesting local patterns of interaction, finding hyperedge triplets that either (1) are the least similar with each other, (2) have the highest pairwise but not groupwise correlation, or (3) are the most similar with each other. We formalize this as a combinatorial optimization problem and design efficient algorithms based on filtering hyperedges. Our comprehensive experimental evaluation shows that the resulting hyperedge triplets yield insightful information on real-world hypergraphs. Our approach is also orders of magnitude faster than a naive baseline implementation.
- [402] arXiv:2311.09016 (replaced) [pdf, html, other]
-
Title: The Chromatic Number of Kneser Hypergraphs via Consensus DivisionComments: 25 pagesSubjects: Computational Complexity (cs.CC); Discrete Mathematics (cs.DM); Algebraic Topology (math.AT); Combinatorics (math.CO)
We show that the Consensus Division theorem implies lower bounds on the chromatic number of Kneser hypergraphs, offering a novel proof for a result of Alon, Frankl, and Lovász (Trans. Amer. Math. Soc., 1986) and for its generalization by Kř\'ıž (Trans. Amer. Math. Soc., 1992). Our approach is applied to study the computational complexity of the total search problem Kneser$^p$, which given a succinct representation of a coloring of a $p$-uniform Kneser hypergraph with fewer colors than its chromatic number, asks to find a monochromatic hyperedge. We prove that for every prime $p$, the Kneser$^p$ problem with an extended access to the input coloring is efficiently reducible to a quite weak approximation of the Consensus Division problem with $p$ shares. In particular, for $p=2$, the problem is efficiently reducible to any non-trivial approximation of the Consensus Halving problem on normalized monotone functions. We further show that for every prime $p$, the Kneser$^p$ problem lies in the complexity class $\mathsf{PPA}$-$p$. As an application, we establish limitations on the complexity of the Kneser$^p$ problem, restricted to colorings with a bounded number of colors.
- [403] arXiv:2312.03720 (replaced) [pdf, other]
-
Title: Negotiating with LLMS: Prompt Hacks, Skill Gaps, and Reasoning DeficitsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large language models LLMs like ChatGPT have reached the 100 Mio user barrier in record time and might increasingly enter all areas of our life leading to a diverse set of interactions between those Artificial Intelligence models and humans. While many studies have discussed governance and regulations deductively from first-order principles, few studies provide an inductive, data-driven lens based on observing dialogues between humans and LLMs especially when it comes to non-collaborative, competitive situations that have the potential to pose a serious threat to people. In this work, we conduct a user study engaging over 40 individuals across all age groups in price negotiations with an LLM. We explore how people interact with an LLM, investigating differences in negotiation outcomes and strategies. Furthermore, we highlight shortcomings of LLMs with respect to their reasoning capabilities and, in turn, susceptiveness to prompt hacking, which intends to manipulate the LLM to make agreements that are against its instructions or beyond any rationality. We also show that the negotiated prices humans manage to achieve span a broad range, which points to a literacy gap in effectively interacting with LLMs.
- [404] arXiv:2312.04767 (replaced) [pdf, html, other]
-
Title: Finite Horizon Multi-Agent Reinforcement Learning in Solving Optimal Control of State-Dependent Switched SystemsSubjects: Systems and Control (eess.SY)
In this article, a \underline{S}tate-dependent \underline{M}ulti-\underline{A}gent \underline{D}eep \underline{D}eterministic \underline{P}olicy \underline{G}radient (\textbf{SMADDPG}) method is proposed in order to learn an optimal control policy for regionally switched systems. We observe good performance of this method and explain it in a rigorous mathematical language using some simplifying assumptions in order to motivate the ideas and to apply them to some canonical examples. Using reinforcement learning, the performance of the switched learning-based multi-agent method is compared with the vanilla DDPG in two customized demonstrative environments with one and two-dimensional state spaces.
- [405] arXiv:2312.14460 (replaced) [pdf, html, other]
-
Title: Quantum computing with error mitigation for data-driven computational homogenizationComments: 36 pages, 17 figuresSubjects: Computational Engineering, Finance, and Science (cs.CE)
As a crossover frontier of physics and mechanics, quantum computing is showing its great potential in computational mechanics. However, quantum hardware noise remains a critical barrier to achieving accurate simulation results due to the limitation of the current hardware. In this paper, we integrate error-mitigated quantum computing in data-driven computational homogenization, where the zero-noise extrapolation (ZNE) technique is employed to improve the reliability of quantum computing. Specifically, ZNE is utilized to mitigate the quantum hardware noise in two quantum algorithms for distance calculation, namely a Swap-based algorithm and an H-based algorithm, thereby improving the overall accuracy of data-driven computational homogenization. Multiscale simulations of a 2D composite L-shaped beam and a 3D composite cylindrical shell are conducted with the quantum computer simulator Qiskit, and the results validate the effectiveness of the proposed method. We believe this work presents a promising step towards using quantum computing in computational mechanics.
- [406] arXiv:2401.04487 (replaced) [pdf, other]
-
Title: Online convex optimization for robust control of constrained dynamical systemsComments: 16 pagesSubjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
This article investigates the problem of controlling linear time-invariant systems subject to time-varying and a priori unknown cost functions, state and input constraints, and exogenous disturbances. We combine the online convex optimization framework with tools from robust model predictive control to propose an algorithm that is able to guarantee robust constraint satisfaction. The performance of the closed loop emerging from application of our framework is studied in terms of its dynamic regret, which is proven to be bounded linearly by the variation of the cost functions and the magnitude of the disturbances. We corroborate our theoretical findings and illustrate implementational aspects of the proposed algorithm by a numerical case study on a tracking control problem of an autonomous vehicle.
- [407] arXiv:2401.07576 (replaced) [pdf, html, other]
-
Title: PyTester: Deep Reinforcement Learning for Text-to-Testcase GenerationComments: 17 pages, 5 figuresSubjects: Software Engineering (cs.SE)
Test-driven development (TDD) is a widely-employed software development practice that mandates writing test cases based on requirements before writing the actual code. While writing test cases is the centerpiece of TDD, it is time-consuming, expensive, and often shunned by developers. To address these issues associated with TDD, automated test case generation approaches have recently been investigated. Such approaches take source code as input, but not the requirements. Therefore, existing work does not fully support true TDD, as actual code is required to generate test cases. In addition, current deep learning-based test case generation approaches are trained with one learning objective, i.e., to generate test cases that are exactly matched with the ground-truth test cases. However, such approaches may limit the model's ability to generate different yet correct test cases. In this paper, we introduce PyTester, a Text-to-Testcase generation approach that can automatically generate syntactically correct, executable, complete, and effective test cases while being aligned with a given natural language requirement. We evaluate PyTester on the public APPS benchmark dataset, and the results show that our Deep RL approach enables PyTester, a small language model, to outperform much larger language models like GPT3.5, StarCoder, and InCoder. Our findings suggest that future research could consider improving small over large LMs for better resource efficiency by integrating the SE domain knowledge into the design of reinforcement learning architecture.
- [408] arXiv:2402.04292 (replaced) [pdf, other]
-
Title: AdaFlow: Imitation Learning with Variance-Adaptive Flow-Based PoliciesComments: NeuRIPS 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Diffusion-based imitation learning improves Behavioral Cloning (BC) on multi-modal decision-making, but comes at the cost of significantly slower inference due to the recursion in the diffusion process. It urges us to design efficient policy generators while keeping the ability to generate diverse actions. To address this challenge, we propose AdaFlow, an imitation learning framework based on flow-based generative modeling. AdaFlow represents the policy with state-conditioned ordinary differential equations (ODEs), which are known as probability flows. We reveal an intriguing connection between the conditional variance of their training loss and the discretization error of the ODEs. With this insight, we propose a variance-adaptive ODE solver that can adjust its step size in the inference stage, making AdaFlow an adaptive decision-maker, offering rapid inference without sacrificing diversity. Interestingly, it automatically reduces to a one-step generator when the action distribution is uni-modal. Our comprehensive empirical evaluation shows that AdaFlow achieves high performance with fast inference speed.
- [409] arXiv:2402.05349 (replaced) [pdf, html, other]
-
Title: Scrapping The Web For Early Wildfire Detection: A New Annotated Dataset of Images and Videos of Smoke Plumes In-the-wildComments: Preprint of ongoing workSubjects: Computer Vision and Pattern Recognition (cs.CV)
Early wildfire detection is of the utmost importance to enable rapid response efforts, and thus minimize the negative impacts of wildfire spreads. To this end, we present PyroNear-2024, a new dataset composed of both images and videos, allowing for the training and evaluation of smoke plume detection models, including sequential models. The data is sourced from: \textit{(i)} web-scraped videos of wildfires from public networks of cameras for wildfire detection in-the-wild, \text{(ii)} videos from our in-house network of cameras, and \textit{(iii)} a small portion of synthetic and real images. This dataset includes around 150,000 manual annotations on 50,000 images, covering 400 wildfires, \Pyro surpasses existing datasets in size and diversity. It includes data from France, Spain, and the United States. Finally, it is composed of both images and videos, allowing for the training and evaluation of smoke plume detection models, including sequential models. We ran cross-dataset experiments using a lightweight state-of-the-art object detection model and found out the proposed dataset is particularly challenging, with F1 score of around 60%, but more stable than existing datasets. The video part of the dataset can be used to train a lightweight sequential model, improving global recall while maintaining precision. Finally, its use in concordance with other public dataset helps to reach higher results overall. We will make both our code and data available.
- [410] arXiv:2402.08133 (replaced) [pdf, html, other]
-
Title: Detecting Low-Degree TruncationComments: 36 pages; small correction to Theorem 3Subjects: Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS)
We consider the following basic, and very broad, statistical problem: Given a known high-dimensional distribution ${\cal D}$ over $\mathbb{R}^n$ and a collection of data points in $\mathbb{R}^n$, distinguish between the two possibilities that (i) the data was drawn from ${\cal D}$, versus (ii) the data was drawn from ${\cal D}|_S$, i.e. from ${\cal D}$ subject to truncation by an unknown truncation set $S \subseteq \mathbb{R}^n$.
We study this problem in the setting where ${\cal D}$ is a high-dimensional i.i.d. product distribution and $S$ is an unknown degree-$d$ polynomial threshold function (one of the most well-studied types of Boolean-valued function over $\mathbb{R}^n$). Our main results are an efficient algorithm when ${\cal D}$ is a hypercontractive distribution, and a matching lower bound:
$\bullet$ For any constant $d$, we give a polynomial-time algorithm which successfully distinguishes ${\cal D}$ from ${\cal D}|_S$ using $O(n^{d/2})$ samples (subject to mild technical conditions on ${\cal D}$ and $S$);
$\bullet$ Even for the simplest case of ${\cal D}$ being the uniform distribution over $\{+1, -1\}^n$, we show that for any constant $d$, any distinguishing algorithm for degree-$d$ polynomial threshold functions must use $\Omega(n^{d/2})$ samples. - [411] arXiv:2402.12571 (replaced) [pdf, other]
-
Title: Solving fluid flow problems in space-time with multiscale stabilization: formulation and examplesSubjects: Numerical Analysis (math.NA); Analysis of PDEs (math.AP)
We present a space-time continuous-Galerkin finite element method for solving incompressible Navier-Stokes equations. To ensure stability of the discrete variational problem, we apply ideas from the variational multi-scale method. The finite element problem is posed on the ``full" space-time domain, considering time as another dimension. We provide a rigorous analysis of the stability and convergence of the stabilized formulation. And finally, we apply this method on two benchmark problems in computational fluid dynamics, namely, lid-driven cavity flow and flow past a circular cylinder. We validate the current method with existing results from literature and show that very large space-time blocks can be solved using our approach.
- [412] arXiv:2402.14989 (replaced) [pdf, html, other]
-
Title: Stable Neural Stochastic Differential Equations in Analyzing Irregular Time Series DataComments: Published at the Twelfth International Conference on Learning Representations (ICLR 2024), Spotlight presentation (Notable Top 5%). this https URLSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Irregular sampling intervals and missing values in real-world time series data present challenges for conventional methods that assume consistent intervals and complete data. Neural Ordinary Differential Equations (Neural ODEs) offer an alternative approach, utilizing neural networks combined with ODE solvers to learn continuous latent representations through parameterized vector fields. Neural Stochastic Differential Equations (Neural SDEs) extend Neural ODEs by incorporating a diffusion term, although this addition is not trivial, particularly when addressing irregular intervals and missing values. Consequently, careful design of drift and diffusion functions is crucial for maintaining stability and enhancing performance, while incautious choices can result in adverse properties such as the absence of strong solutions, stochastic destabilization, or unstable Euler discretizations, significantly affecting Neural SDEs' performance. In this study, we propose three stable classes of Neural SDEs: Langevin-type SDE, Linear Noise SDE, and Geometric SDE. Then, we rigorously demonstrate their robustness in maintaining excellent performance under distribution shift, while effectively preventing overfitting. To assess the effectiveness of our approach, we conduct extensive experiments on four benchmark datasets for interpolation, forecasting, and classification tasks, and analyze the robustness of our methods with 30 public datasets under different missing rates. Our results demonstrate the efficacy of the proposed method in handling real-world irregular time series data.
- [413] arXiv:2402.17298 (replaced) [pdf, html, other]
-
Title: ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual TasksYang Liu, Xiaomin Yu, Gongyu Zhang, Zhen Zhu, Christos Bergeles, Prokar Dasgupta, Alejandro Granados, Sebastien OurselinSubjects: Computer Vision and Pattern Recognition (cs.CV)
"A data scientist is tasked with developing a low-cost surgical VQA system for a 2-month workshop. Due to data sensitivity, she collects 50 hours of surgical video from a hospital, requiring two months for privacy approvals. Privacy restrictions prevent uploading data to platforms like ChatGPT, so she assembles one annotator and a medical expert to manually create QA pairs. This process takes three weeks and costs over $10,000. The trained model provides accurate responses within the limited data scope but lacks broader generalizability, completing the project in 3 months."
To simplify the challenges presented in the scenario above. In this paper, we replace the image input with text for Vision-language training. Inspired by prior noise injection methods to reduce modality gaps, we introduce Adaptive ranged cosine Similarity injected noise (ArcSin). First, we introduce an innovative adaptive noise scale that effectively generates the textual elements with more variability while preserving the original text feature's integrity. Second, a similarity pool strategy is employed, expanding the domain generalization potential by broadening the overall noise scale. This dual strategy effectively broadens the scope of the original domain while safeguarding content integrity. Our empirical results demonstrate that these models closely rival those trained on images in terms of performance. Specifically, our method exhibits substantial improvements over the previous state-of-the-art, achieving gains of 1.9 and 1.1 CIDEr points in S-Cap and M-Cap, respectively. Additionally, we observe increases of 0.5 percentage points (pp), 1.4 pp, and 1.4 pp in accuracy for VQA, VQA-E, and VE, respectively, pushing the boundaries of what is achievable within the constraints of image-trained model benchmarks. - [414] arXiv:2403.04037 (replaced) [pdf, html, other]
-
Title: OCD-FL: A Novel Communication-Efficient Peer Selection-based Decentralized Federated LearningComments: 6 pages, under review in IEEE Transactions on Vehicular Technology as a Correspondance (rev. 1)Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
The conjunction of edge intelligence and the ever-growing Internet-of-Things (IoT) network heralds a new era of collaborative machine learning, with federated learning (FL) emerging as the most prominent paradigm. With the growing interest in these learning schemes, researchers started addressing some of their most fundamental limitations. Indeed, conventional FL with a central aggregator presents a single point of failure and a network bottleneck. To bypass this issue, decentralized FL where nodes collaborate in a peer-to-peer network has been proposed. Despite the latter's efficiency, communication costs and data heterogeneity remain key challenges in decentralized FL. In this context, we propose a novel scheme, called opportunistic communication-efficient decentralized federated learning, a.k.a., OCD-FL, consisting of a systematic FL peer selection for collaboration, aiming to achieve maximum FL knowledge gain while reducing energy consumption. Experimental results demonstrate the capability of OCD-FL to achieve similar or better performances than the fully collaborative FL, while significantly reducing consumed energy by at least 30% and up to 80%.
- [415] arXiv:2403.05304 (replaced) [pdf, html, other]
-
Title: Spatiotemporal Predictive Pre-training for Robotic Motor ControlComments: 19 pages, 7 figures, 14 tablesSubjects: Robotics (cs.RO)
Robotic motor control necessitates the ability to predict the dynamics of environments and interaction objects. However, advanced self-supervised pre-trained visual representations in robotic motor control, leveraging large-scale egocentric videos, often focus solely on learning the static content features. This neglects the crucial temporal motion clues in human video, which implicitly contain key knowledge about interacting and manipulating with the environments and objects. In this paper, we present a simple yet effective robotic motor control visual pre-training framework that jointly performs spatiotemporal prediction with dual decoders, utilizing large-scale video data, termed as STP. STP adheres to two key designs in a multi-task learning manner. First, we perform spatial prediction on the masked current frame for learning content features. Second, we utilize the future frame with an extremely high masking ratio as a condition, based on the masked current frame, to conduct temporal prediction for capturing motion features. The asymmetric masking and decoupled dual decoders ensure that our image representation focusing on motion information while capturing spatial details. Extensive simulation and real-world experiments demonstrate the effectiveness and generalization abilities of STP, especially in generalizing to unseen environments with more distractors. Additionally, further post-pre-training and hybrid pre-training unleash its generality and data efficiency. Our code and weights will be released for further applications.
- [416] arXiv:2403.08936 (replaced) [pdf, html, other]
-
Title: Beyond Joint Demonstrations: Personalized Expert Guidance for Efficient Multi-Agent Reinforcement LearningPeihong Yu, Manav Mishra, Alec Koppel, Carl Busart, Priya Narayan, Dinesh Manocha, Amrit Bedi, Pratap TokekarSubjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Multi-Agent Reinforcement Learning (MARL) algorithms face the challenge of efficient exploration due to the exponential increase in the size of the joint state-action space. While demonstration-guided learning has proven beneficial in single-agent settings, its direct applicability to MARL is hindered by the practical difficulty of obtaining joint expert demonstrations. In this work, we introduce a novel concept of personalized expert demonstrations, tailored for each individual agent or, more broadly, each individual type of agent within a heterogeneous team. These demonstrations solely pertain to single-agent behaviors and how each agent can achieve personal goals without encompassing any cooperative elements, thus naively imitating them will not achieve cooperation due to potential conflicts. To this end, we propose an approach that selectively utilizes personalized expert demonstrations as guidance and allows agents to learn to cooperate, namely personalized expert-guided MARL (PegMARL). This algorithm utilizes two discriminators: the first provides incentives based on the alignment of individual agent behavior with demonstrations, and the second regulates incentives based on whether the behaviors lead to the desired outcome. We evaluate PegMARL using personalized demonstrations in both discrete and continuous environments. The results demonstrate that PegMARL learns near-optimal policies even when provided with suboptimal demonstrations and outperforms state-of-the-art MARL algorithms in solving coordinated tasks. We also showcase PegMARL's capability of leveraging joint demonstrations in the StarCraft scenario and converging effectively even with demonstrations from non-co-trained policies.
- [417] arXiv:2403.16785 (replaced) [pdf, html, other]
-
Title: Approximating maps into manifolds with lower curvature boundsSubjects: Numerical Analysis (math.NA); Differential Geometry (math.DG)
Many interesting functions arising in applications map into Riemannian manifolds. We present an algorithm, using the manifold exponential and logarithm, for approximating such functions. Our approach extends approximation techniques for functions into linear spaces in such a way that we can upper bound the forward error in terms of a lower bound on the manifold's sectional curvature. Furthermore, when the sectional curvature is nonnegative, such as for compact Lie groups, the error is guaranteed to not be worse than in the linear case.
We implement the algorithm in a Julia package this http URL and apply it to two example problems. - [418] arXiv:2404.03307 (replaced) [pdf, html, other]
-
Title: Bi-level Trajectory Optimization on Uneven Terrains with Differentiable Wheel-Terrain Interaction ModelComments: 8 pages, 7 figures, submitted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024)Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Navigation of wheeled vehicles on uneven terrain necessitates going beyond the 2D approaches for trajectory planning. Specifically, it is essential to incorporate the full 6dof variation of vehicle pose and its associated stability cost in the planning process. To this end, most recent works aim to learn a neural network model to predict the vehicle evolution. However, such approaches are data-intensive and fraught with generalization issues. In this paper, we present a purely model-based approach that just requires the digital elevation information of the terrain. Specifically, we express the wheel-terrain interaction and 6dof pose prediction as a non-linear least squares (NLS) problem. As a result, trajectory planning can be viewed as a bi-level optimization. The inner optimization layer predicts the pose on the terrain along a given trajectory, while the outer layer deforms the trajectory itself to reduce the stability and kinematic costs of the pose. We improve the state-of-the-art in the following respects. First, we show that our NLS based pose prediction closely matches the output from a high-fidelity physics engine. This result coupled with the fact that we can query gradients of the NLS solver, makes our pose predictor, a differentiable wheel-terrain interaction model. We further leverage this differentiability to efficiently solve the proposed bi-level trajectory optimization problem. Finally, we perform extensive experiments, and comparison with a baseline to showcase the effectiveness of our approach in obtaining smooth, stable trajectories.
- [419] arXiv:2404.03854 (replaced) [pdf, html, other]
-
Title: Distributionally Robust Alignment for Medical Federated Vision-Language Pre-training Under Data HeterogeneitySubjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Vision-language pre-training (VLP) has emerged as an effective scheme for multimodal representation learning, but its reliance on large-scale multimodal data poses significant challenges for medical applications. Federated learning (FL) offers a promising solution to scale up the dataset for medical VLP while preserving data privacy. However, we observe that client data heterogeneity in real-world scenarios could cause models to learn biased cross-modal alignment during local pre-training. This would limit the transferability of the federally learned representation model on downstream tasks. To address this challenge, we propose Federated Distributionally Robust Alignment (FedDRA), a framework for federated VLP that achieves robust vision-language alignment under heterogeneous conditions. Based on client datasets, we construct a distribution family that encompasses potential test-time domains, and apply a distributionally robust framework to optimize the pre-trained model's performance across this distribution space. This approach bridges the gap between pre-training samples and downstream applications. To avoid over-fitting on client-specific information, we use anchor representation from the global model to guide the local training, and adopt a two-stage approach to first tune deeper layers before updating the entire network. Extensive experiments on real-world datasets demonstrate FedDRA's effectiveness in enhancing medical federated VLP under data heterogeneity. Our method also adapts well to various medical pre-training methods.
- [420] arXiv:2404.04507 (replaced) [pdf, html, other]
-
Title: Irrational-window-filter projection method and application to quasiperiodic Schr\"odinger eigenproblemsSubjects: Numerical Analysis (math.NA)
In this paper, we propose a new algorithm, the irrational-window-filter projection method (IWFPM), for quasiperiodic systems with concentrated spectral point distribution. Based on the projection method (PM), IWFPM filters out dominant spectral points by defining an irrational window and uses a corresponding index-shift transform to make the FFT available. The error analysis on the function approximation level is also given. We apply IWFPM to 1D, 2D, and 3D quasiperiodic Schrödinger eigenproblems (QSEs) to demonstrate its accuracy and efficiency. IWFPM exhibits a significant computational advantage over PM for both extended and localized quantum states. More importantly, by using IWFPM, the existence of Anderson localization in 2D and 3D QSEs is numerically verified.
- [421] arXiv:2404.04552 (replaced) [pdf, other]
-
Title: Fast and Simple Sorting Using Partial InformationComments: To appear at SODA 2025Subjects: Data Structures and Algorithms (cs.DS)
We consider the problem of sorting $n$ items, given the outcomes of $m$ pre-existing comparisons. We present a simple and natural deterministic algorithm that runs in $O(m+\log T)$ time and does $O(\log T)$ comparisons, where $T$ is the number of total orders consistent with the pre-existing comparisons.
Our running time and comparison bounds are best possible up to constant factors, thus resolving a problem that has been studied intensely since 1976 (Fredman, Theoretical Computer Science). The best previous algorithm with a bound of $O(\lg T)$ on the number of comparisons has a time bound of $O(n^{2.5})$ and is more complicated.
Our algorithm combines three classic algorithms: topological sort, heapsort with the right kind of heap, and efficient search in a sorted list. It outputs the items in sorted order one by one. It can be modified to stop early, thereby solving the important and more general top-$k$ sorting problem: Given $k$ and the outcomes of some pre-existing comparisons, output the smallest $k$ items in sorted order. The modified algorithm solves the top-$k$ sorting problem in minimum time and comparisons, to within constant factors. - [422] arXiv:2404.09699 (replaced) [pdf, html, other]
-
Title: Generative AI for Game Theory-based Mobile NetworkingSubjects: Computer Science and Game Theory (cs.GT)
With the continuous advancement of network technology, various emerging complex networking optimization problems have created a wide range of applications utilizing game theory. However, since game theory is a mathematical framework, game theory-based solutions often rely heavily on the experience and knowledge of human experts. Recently, the remarkable advantages exhibited by generative artificial intelligence (GAI) have gained widespread attention. In this work, we propose a novel GAI-enabled game theory solution that combines the powerful reasoning and generation capabilities of GAI to the design and optimization of mobile networking. Specifically, we first outline the game theory and key technologies of GAI, and explore the advantages of combining GAI with game theory. Then, we review the contributions and limitations of existing research and demonstrate the potential application values of GAI applied to game theory in mobile networking. Subsequently, we develop a large language model (LLM)-enabled game theory framework to realize this combination, and demonstrate the effectiveness of the proposed framework through a case study in secured UAV networks. Finally, we provide several directions for future extensions.
- [423] arXiv:2404.11788 (replaced) [pdf, html, other]
-
Title: NonGEMM Bench: Understanding the Performance Horizon of the Latest ML Workloads with NonGEMM WorkloadsSubjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG); Performance (cs.PF)
Machine Learning (ML) operators are the building blocks to design ML models with various target applications. GEneral Matrix Multiplication (GEMM) operators are the backbone of ML models. They are notorious for being computationally expensive requiring billions of multiply-and-accumulate. Therefore, significant effort has been put to study and optimize the GEMM operators in order to speed up the execution of ML models. GPUs and accelerators are widely deployed to accelerate ML workloads by optimizing the execution of GEMM operators. Nonetheless, the performance of NonGEMM operators have not been studied as thoroughly as GEMMs. Therefore, this paper describes \bench, a benchmark to study NonGEMM operators. We first construct \bench using popular ML workloads from different domains, then perform case studies on various grade GPU platforms to analyze the behavior of NonGEMM operators in GPU accelerated systems. Finally, we present some key takeaways to bridge the gap between GEMM and NonGEMM operators and to offer the community with potential new optimization directions.
- [424] arXiv:2404.12389 (replaced) [pdf, html, other]
-
Title: Moving Object Segmentation: All You Need Is SAM (and Flow)Comments: Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
The objective of this paper is motion segmentation -- discovering and segmenting the moving objects in a video. This is a much studied area with numerous careful, and sometimes complex, approaches and training schemes including: self-supervised learning, learning from synthetic datasets, object-centric representations, amodal representations, and many more. Our interest in this paper is to determine if the Segment Anything model (SAM) can contribute to this task. We investigate two models for combining SAM with optical flow that harness the segmentation power of SAM with the ability of flow to discover and group moving objects. In the first model, we adapt SAM to take optical flow, rather than RGB, as an input. In the second, SAM takes RGB as an input, and flow is used as a segmentation prompt. These surprisingly simple methods, without any further modifications, outperform all previous approaches by a considerable margin in both single and multi-object benchmarks. We also extend these frame-level segmentations to sequence-level segmentations that maintain object identity. Again, this simple model achieves outstanding performance across multiple moving object segmentation benchmarks.
- [425] arXiv:2404.13736 (replaced) [pdf, html, other]
-
Title: Interval Abstractions for Robust Counterfactual ExplanationsComments: Published in Artificial Intelligence JournalSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Counterfactual Explanations (CEs) have emerged as a major paradigm in explainable AI research, providing recourse recommendations for users affected by the decisions of machine learning models. However, CEs found by existing methods often become invalid when slight changes occur in the parameters of the model they were generated for. The literature lacks a way to provide exhaustive robustness guarantees for CEs under model changes, in that existing methods to improve CEs' robustness are mostly heuristic, and the robustness performances are evaluated empirically using only a limited number of retrained models. To bridge this gap, we propose a novel interval abstraction technique for parametric machine learning models, which allows us to obtain provable robustness guarantees for CEs under a possibly infinite set of plausible model changes $\Delta$. Based on this idea, we formalise a robustness notion for CEs, which we call $\Delta$-robustness, in both binary and multi-class classification settings. We present procedures to verify $\Delta$-robustness based on Mixed Integer Linear Programming, using which we further propose algorithms to generate CEs that are $\Delta$-robust. In an extensive empirical study involving neural networks and logistic regression models, we demonstrate the practical applicability of our approach. We discuss two strategies for determining the appropriate hyperparameters in our method, and we quantitatively benchmark CEs generated by eleven methods, highlighting the effectiveness of our algorithms in finding robust CEs.
- [426] arXiv:2404.14117 (replaced) [pdf, html, other]
-
Title: Hierarchical localization with panoramic views and triplet loss functionsSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
The main objective of this paper is to tackle visual localization, which is essential for the safe navigation of mobile robots. The solution we propose employs panoramic images and triplet convolutional neural networks. We seek to exploit the properties of such architectures to address both hierarchical and global localization in indoor environments, which are prone to visual aliasing and other phenomena. Considering their importance in these architectures, a complete comparative evaluation of different triplet loss functions is performed. The experimental section proves that triplet networks can be trained with a relatively low number of images captured under a specific lighting condition and even so, the resulting networks are a robust tool to perform visual localization under dynamic conditions. Our approach has been evaluated against some of these effects, such as changes in the lighting conditions, occlusions, noise and motion blurring. Furthermore, to explore the limits of our approach, triplet networks have been tested in different indoor environments simultaneously. In all the cases, these architectures have demonstrated a great capability to generalize to diverse and challenging scenarios. The code used in the experiments is available at this https URL.
- [427] arXiv:2404.16362 (replaced) [pdf, html, other]
-
Title: Feature graph construction with static features for malware detectionSubjects: Cryptography and Security (cs.CR)
Malware can greatly compromise the integrity and trustworthiness of information and is in a constant state of evolution. Existing feature fusion-based detection methods generally overlook the correlation between features. And mere concatenation of features will reduce the model's characterization ability, lead to low detection accuracy. Moreover, these methods are susceptible to concept drift and significant degradation of the model. To address those challenges, we introduce a feature graph-based malware detection method, MFGraph, to characterize applications by learning feature-to-feature relationships to achieve improved detection accuracy while mitigating the impact of concept drift. In MFGraph, we construct a feature graph using static features extracted from binary PE files, then apply a deep graph convolutional network to learn the representation of the feature graph. Finally, we employ the representation vectors obtained from the output of a three-layer perceptron to differentiate between benign and malicious software. We evaluated our method on the EMBER dataset, and the experimental results demonstrate that it achieves an AUC score of 0.98756 on the malware detection task, outperforming other baseline models. Furthermore, the AUC score of MFGraph decreases by only 5.884% in one year, indicating that it is the least affected by concept drift.
- [428] arXiv:2404.17916 (replaced) [pdf, html, other]
-
Title: FedCRL: Personalized Federated Learning with Contrastive Shared Representations for Label Heterogeneity in Non-IID DataSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Heterogeneity resulting from label distribution skew and data scarcity can lead to inaccuracy and unfairness in intelligent communication applications that mainly rely on distributed computing. To deal with it, this paper proposes a novel personalized federated learning algorithm, named Federated Contrastive Shareable Representations (FedCoSR), to facilitate knowledge sharing among clients while maintaining data privacy. Specifically, parameters of local models' shallow layers and typical local representations are both considered shareable information for the server and aggregated globally. To address poor performance caused by label distribution skew among clients, contrastive learning is adopted between local and global representations to enrich local knowledge. Additionally, to ensure fairness for clients with scarce data, FedCoSR introduces adaptive local aggregation to coordinate the global model involvement in each client. Our simulations demonstrate FedCoSR's effectiveness in mitigating label heterogeneity by achieving accuracy and fairness improvements over existing methods on datasets with varying degrees of label heterogeneity.
- [429] arXiv:2405.01769 (replaced) [pdf, html, other]
-
Title: A Survey on Large Language Models for Critical Societal Domains: Finance, Healthcare, and LawZhiyu Zoey Chen, Jing Ma, Xinlu Zhang, Nan Hao, An Yan, Armineh Nourbakhsh, Xianjun Yang, Julian McAuley, Linda Petzold, William Yang WangComments: TMLR 2024Subjects: Computation and Language (cs.CL)
In the fast-evolving domain of artificial intelligence, large language models (LLMs) such as GPT-3 and GPT-4 are revolutionizing the landscapes of finance, healthcare, and law: domains characterized by their reliance on professional expertise, challenging data acquisition, high-stakes, and stringent regulatory compliance. This survey offers a detailed exploration of the methodologies, applications, challenges, and forward-looking opportunities of LLMs within these high-stakes sectors. We highlight the instrumental role of LLMs in enhancing diagnostic and treatment methodologies in healthcare, innovating financial analytics, and refining legal interpretation and compliance strategies. Moreover, we critically examine the ethics for LLM applications in these fields, pointing out the existing ethical concerns and the need for transparent, fair, and robust AI systems that respect regulatory norms. By presenting a thorough review of current literature and practical applications, we showcase the transformative impact of LLMs, and outline the imperative for interdisciplinary cooperation, methodological advancements, and ethical vigilance. Through this lens, we aim to spark dialogue and inspire future research dedicated to maximizing the benefits of LLMs while mitigating their risks in these precision-dependent sectors. To facilitate future research on LLMs in these critical societal domains, we also initiate a reading list that tracks the latest advancements under this topic, which will be continually updated: \url{this https URL}.
- [430] arXiv:2405.04370 (replaced) [pdf, html, other]
-
Title: Diff-IP2D: Diffusion-Based Hand-Object Interaction Prediction on Egocentric VideosSubjects: Computer Vision and Pattern Recognition (cs.CV)
Understanding how humans would behave during hand-object interaction is vital for applications in service robot manipulation and extended reality. To achieve this, some recent works have been proposed to simultaneously forecast hand trajectories and object affordances on human egocentric videos. The joint prediction serves as a comprehensive representation of future hand-object interactions in 2D space, indicating potential human motion and motivation. However, the existing approaches mostly adopt the autoregressive paradigm for unidirectional prediction, which lacks mutual constraints within the holistic future sequence, and accumulates errors along the time axis. Meanwhile, these works basically overlook the effect of camera egomotion on first-person view predictions. To address these limitations, we propose a novel diffusion-based interaction prediction method, namely Diff-IP2D, to forecast future hand trajectories and object affordances concurrently in an iterative non-autoregressive manner. We transform the sequential 2D images into latent feature space and design a denoising diffusion model to predict future latent interaction features conditioned on past ones. Motion features are further integrated into the conditional denoising process to enable Diff-IP2D aware of the camera wearer's dynamics for more accurate interaction prediction. Extensive experiments demonstrate that our method significantly outperforms the state-of-the-art baselines on both the off-the-shelf metrics and our newly proposed evaluation protocol. This highlights the efficacy of leveraging a generative paradigm for 2D hand-object interaction prediction. The code of Diff-IP2D is released as open source at this https URL.
- [431] arXiv:2405.05966 (replaced) [pdf, html, other]
-
Title: Natural Language Processing RELIES on LinguisticsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large Language Models (LLMs) have become capable of generating highly fluent text in certain languages, without modules specially designed to capture grammar or semantic coherence. What does this mean for the future of linguistic expertise in NLP? We highlight several aspects in which NLP (still) relies on linguistics, or where linguistic thinking can illuminate new directions. We argue our case around the acronym RELIES that encapsulates six major facets where linguistics contributes to NLP: Resources, Evaluation, Low-resource settings, Interpretability, Explanation, and the Study of language. This list is not exhaustive, nor is linguistics the main point of reference for every effort under these themes; but at a macro level, these facets highlight the enduring importance of studying machine systems vis-à-vis systems of human language.
- [432] arXiv:2405.06058 (replaced) [pdf, html, other]
-
Title: Large Language Models Show Human-like Social Desirability Biases in Survey ResponsesAadesh Salecha, Molly E. Ireland, Shashanka Subrahmanya, João Sedoc, Lyle H. Ungar, Johannes C. EichstaedtComments: 3 pages, 2 figures, accepted at PNAS NexusSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
As Large Language Models (LLMs) become widely used to model and simulate human behavior, understanding their biases becomes critical. We developed an experimental framework using Big Five personality surveys and uncovered a previously undetected social desirability bias in a wide range of LLMs. By systematically varying the number of questions LLMs were exposed to, we demonstrate their ability to infer when they are being evaluated. When personality evaluation is inferred, LLMs skew their scores towards the desirable ends of trait dimensions (i.e., increased extraversion, decreased neuroticism, etc). This bias exists in all tested models, including GPT-4/3.5, Claude 3, Llama 3, and PaLM-2. Bias levels appear to increase in more recent models, with GPT-4's survey responses changing by 1.20 (human) standard deviations and Llama 3's by 0.98 standard deviations-very large effects. This bias is robust to randomization of question order and paraphrasing. Reverse-coding all the questions decreases bias levels but does not eliminate them, suggesting that this effect cannot be attributed to acquiescence bias. Our findings reveal an emergent social desirability bias and suggest constraints on profiling LLMs with psychometric tests and on using LLMs as proxies for human participants.
- [433] arXiv:2405.08363 (replaced) [pdf, html, other]
-
Title: UnMarker: A Universal Attack on Defensive Image WatermarkingComments: To appear at IEEE S&P 2025Subjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Reports regarding the misuse of Generative AI (GenAI) to create deepfakes are frequent. Defensive watermarking enables GenAI providers to hide fingerprints in their images and use them later for deepfake detection. Yet, its potential has not been fully explored. We present UnMarker -- the first practical universal attack on defensive watermarking. Unlike existing attacks, UnMarker requires no detector feedback, no unrealistic knowledge of the watermarking scheme or similar models, and no advanced denoising pipelines that may not be available. Instead, being the product of an in-depth analysis of the watermarking paradigm revealing that robust schemes must construct their watermarks in the spectral amplitudes, UnMarker employs two novel adversarial optimizations to disrupt the spectra of watermarked images, erasing the watermarks. Evaluations against SOTA schemes prove UnMarker's effectiveness. It not only defeats traditional schemes while retaining superior quality compared to existing attacks but also breaks semantic watermarks that alter an image's structure, reducing the best detection rate to $43\%$ and rendering them useless. To our knowledge, UnMarker is the first practical attack on semantic watermarks, which have been deemed the future of defensive watermarking. Our findings show that defensive watermarking is not a viable defense against deepfakes, and we urge the community to explore alternatives.
- [434] arXiv:2405.13278 (replaced) [pdf, html, other]
-
Title: Single color digital H&E staining with In-and-Out NetMengkun Chen, Yen-Tung Liu, Fadeel Sher Khan, Matthew C. Fox, Jason S. Reichenberg, Fabiana C.P.S. Lopes, Katherine R. Sebastian, Mia K. Markey, James W. TunnellJournal-ref: Computerized Medical Imaging and Graphics, volume = {118}, pages = {102468}, year = {2024}, issn = {0895-6111},Subjects: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
Virtual staining streamlines traditional staining procedures by digitally generating stained images from unstained or differently stained images. While conventional staining methods involve time-consuming chemical processes, virtual staining offers an efficient and low infrastructure alternative. Leveraging microscopy-based techniques, such as confocal microscopy, researchers can expedite tissue analysis without the need for physical sectioning. However, interpreting grayscale or pseudo-color microscopic images remains a challenge for pathologists and surgeons accustomed to traditional histologically stained images. To fill this gap, various studies explore digitally simulating staining to mimic targeted histological stains. This paper introduces a novel network, In-and-Out Net, specifically designed for virtual staining tasks. Based on Generative Adversarial Networks (GAN), our model efficiently transforms Reflectance Confocal Microscopy (RCM) images into Hematoxylin and Eosin (H&E) stained images. We enhance nuclei contrast in RCM images using aluminum chloride preprocessing for skin tissues. Training the model with virtual H\&E labels featuring two fluorescence channels eliminates the need for image registration and provides pixel-level ground truth. Our contributions include proposing an optimal training strategy, conducting a comparative analysis demonstrating state-of-the-art performance, validating the model through an ablation study, and collecting perfectly matched input and ground truth images without registration. In-and-Out Net showcases promising results, offering a valuable tool for virtual staining tasks and advancing the field of histological image analysis.
- [435] arXiv:2405.15549 (replaced) [pdf, html, other]
-
Title: SEP: Self-Enhanced Prompt Tuning for Visual-Language ModelSubjects: Computer Vision and Pattern Recognition (cs.CV)
Prompt tuning based on Context Optimization (CoOp) effectively adapts visual-language models (VLMs) to downstream tasks by inferring additional learnable prompt tokens. However, these tokens are less discriminative as they are independent of the pre-trained tokens and fail to capture input-specific knowledge, such as class-aware textual or instance-aware visual knowledge. Leveraging the discriminative and generalization capabilities inherent in pre-trained tokens, we introduce a novel approach named Self-Enhanced Prompt Tuning (SEP). The core principle of SEP involves adapting the learnable prompt tokens at each encoder layer from the corresponding self-pretrained tokens, thereby explicitly incorporating discriminative prior knowledge to enhance both textual-level and visual-level embeddings. Furthermore, SEP's self-enhanced tokens not only boost discrimination but also mitigate domain shifts in unseen domains, enhancing generalization. In practice, SEP selects several representative tokens from all pre-trained tokens for each input data at every layer of the text/visual encoders. Subsequently, a Token Fusion Module (TFM) is introduced to generate a self-enhanced token by merging these representative tokens with the learnable tokens using a cross-attention mechanism. This self-enhanced token is then concatenated with all pre-trained tokens, serving as input for subsequent encoder layers to produce the relevant embeddings. Comprehensive evaluations across various benchmarks and tasks confirm SEP's efficacy in prompt tuning. Code: \href{Code}{this https URL}.
- [436] arXiv:2405.16158 (replaced) [pdf, html, other]
-
Title: Bigger, Regularized, Optimistic: scaling for compute and sample-efficient continuous controlComments: NeurIPS 2024 SpotlightSubjects: Machine Learning (cs.LG)
Sample efficiency in Reinforcement Learning (RL) has traditionally been driven by algorithmic enhancements. In this work, we demonstrate that scaling can also lead to substantial improvements. We conduct a thorough investigation into the interplay of scaling model capacity and domain-specific RL enhancements. These empirical findings inform the design choices underlying our proposed BRO (Bigger, Regularized, Optimistic) algorithm. The key innovation behind BRO is that strong regularization allows for effective scaling of the critic networks, which, paired with optimistic exploration, leads to superior performance. BRO achieves state-of-the-art results, significantly outperforming the leading model-based and model-free algorithms across 40 complex tasks from the DeepMind Control, MetaWorld, and MyoSuite benchmarks. BRO is the first model-free algorithm to achieve near-optimal policies in the notoriously challenging Dog and Humanoid tasks.
- [437] arXiv:2405.19544 (replaced) [pdf, html, other]
-
Title: One-Shot Safety Alignment for Large Language Models via Optimal DualizationComments: 32 pages, 6 figures, 8 tablesSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
The growing safety concerns surrounding large language models raise an urgent need to align them with diverse human preferences to simultaneously enhance their helpfulness and safety. A promising approach is to enforce safety constraints through Reinforcement Learning from Human Feedback (RLHF). For such constrained RLHF, typical Lagrangian-based primal-dual policy optimization methods are computationally expensive and often unstable. This paper presents a perspective of dualization that reduces constrained alignment to an equivalent unconstrained alignment problem. We do so by pre-optimizing a smooth and convex dual function that has a closed form. This shortcut eliminates the need for cumbersome primal-dual policy iterations, greatly reducing the computational burden and improving training stability. Our strategy leads to two practical algorithms in model-based and preference-based settings (MoCAN and PeCAN, respectively). A broad range of experiments demonstrate the effectiveness and merits of our algorithms.
- [438] arXiv:2406.00627 (replaced) [pdf, other]
-
Title: Prompt Framework for Role-playing: Generation and EvaluationSubjects: Computation and Language (cs.CL)
Large language models (LLMs) exhibit impressive proficiency in natural language generation, understanding user instructions, and emulating human-like language use, which has led to significant interest in their application to role-playing scenarios. However, the manual collection of role-specific script data and the evaluation of model performance are resource-intensive processes. This project introduces a prompt-based framework designed to leverage GPT's capabilities for the generation of role-playing dialogue datasets and the evaluation of role-playing performance. To validate the effectiveness of the GPT-based generation and evaluation, we further incorporate the recall-oriented Rouge-L metric, providing an additional quantitative measure of performance.
- [439] arXiv:2406.01294 (replaced) [pdf, html, other]
-
Title: CE-VAE: Capsule Enhanced Variational AutoEncoder for Underwater Image EnhancementComments: Accepted for publication at IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Unmanned underwater image analysis for marine monitoring faces two key challenges: (i) degraded image quality due to light attenuation and (ii) hardware storage constraints limiting high-resolution image collection. Existing methods primarily address image enhancement with approaches that hinge on storing the full-size input. In contrast, we introduce the Capsule Enhanced Variational AutoEncoder (CE-VAE), a novel architecture designed to efficiently compress and enhance degraded underwater images. Our attention-aware image encoder can project the input image onto a latent space representation while being able to run online on a remote device. The only information that needs to be stored on the device or sent to a beacon is a compressed representation. There is a dual-decoder module that performs offline, full-size enhanced image generation. One branch reconstructs spatial details from the compressed latent space, while the second branch utilizes a capsule-clustering layer to capture entity-level structures and complex spatial relationships. This parallel decoding strategy enables the model to balance fine-detail preservation with context-aware enhancements. CE-VAE achieves state-of-the-art performance in underwater image enhancement on six benchmark datasets, providing up to 3x higher compression efficiency than existing approaches. Code available at \url{this https URL}.
- [440] arXiv:2406.01341 (replaced) [pdf, other]
-
Title: Important node identification for complex networks based on improved Electre Multi-Attribute fusionComments: Due to changes in authorship and substantial updates to the content, this manuscript is no longer validSubjects: Social and Information Networks (cs.SI)
Influence maximization problem involves selecting a subset of seed nodes within a social network to maximize information spread under a given diffusion model, so how to identify the important nodes is the problem to be considered in this paper. Due to the great differences in the reality of the network, a class of multi-attribute decision fusion methods is often used to solve this problem. Electre is mostly used to solve the problems of investment order, benefit, and risk assessment of projects in economics, which supports the decision maker to make choices by comparing the differences between a set of alternatives. In this paper, we propose a multi-attribute decision fusion method named SK-E, which construct local and global metrics for different networks, use the improved Electre to make decision fusion between local and global metrics of nodes, to get the optimal weight between local and global metrics, and then identify the important nodes. The proposed method demonstrates superior accuracy compared to other methods, as evaluated through three experiments: the SIR epidemic model, the independent cascade model, and constraint efficiency. These experiments were conducted across six different real networks selected as the experimental dataset.
- [441] arXiv:2406.01525 (replaced) [pdf, html, other]
-
Title: Polynomial Bounds of CFLOBDDs against BDDsXusheng Zhi (University of Wisconsin-Madison and Peking University), Thomas Reps (University of Wisconsin-Madison)Subjects: Symbolic Computation (cs.SC); Discrete Mathematics (cs.DM); Data Structures and Algorithms (cs.DS); Formal Languages and Automata Theory (cs.FL)
Binary Decision Diagrams (BDDs) are widely used for the representation of Boolean functions. Context-Free-Language Ordered Decision Diagrams (CFLOBDDs) are a plug-compatible replacement for BDDs -- roughly, they are BDDs augmented with a certain form of procedure call. A natural question to ask is, ``For a given family of Boolean functions $F$, what is the relationship between the size of a BDD for $f \in F$ and the size of a CFLOBDD for $f$?'' Sistla et al. established that there are best-case families of functions, which demonstrate an inherently exponential separation between CFLOBDDs and BDDs. They showed that there are families of functions $\{ f_n \}$ for which, for all $n = 2^k$, the CFLOBDD for $f_n$ (using a particular variable order) is exponentially more succinct than any BDD for $f_n$ (i.e., using any variable order). However, they did not give a worst-case bound -- i.e., they left open the question, ``Is there a family of functions $\{ g_i \}$ for which the size of a CFLOBDD for $g_i$ must be substantially larger than a BDD for $g_i$?'' For instance, it could be that there is a family of functions for which the BDDs are exponentially more succinct than any corresponding CFLOBDDs.
This paper studies such questions, and answers the second question posed above in the negative. In particular, we show that by using the same variable ordering in the CFLOBDD that is used in the BDD, the size of a CFLOBDD for any function $h$ cannot be far worse than the size of the BDD for $h$. The bound that relates their sizes is polynomial: If BDD $B$ for function $h$ is of size $|B|$ and uses variable ordering $\textit{Ord}$, then the size of the CFLOBDD $C$ for $h$ that also uses $\textit{Ord}$ is bounded by $O(|B|^3)$.
The paper also shows that the bound is tight: there is a family of functions for which $|C|$ grows as $\Omega(|B|^3)$. - [442] arXiv:2406.01593 (replaced) [pdf, html, other]
-
Title: MaGS: Reconstructing and Simulating Dynamic 3D Objects with Mesh-adsorbed Gaussian SplattingComments: Project Page: see this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
3D reconstruction and simulation, although interrelated, have distinct objectives: reconstruction requires a flexible 3D representation that can adapt to diverse scenes, while simulation needs a structured representation to model motion principles effectively. This paper introduces the Mesh-adsorbed Gaussian Splatting (MaGS) method to address this challenge. MaGS constrains 3D Gaussians to roam near the mesh, creating a mutually adsorbed mesh-Gaussian 3D representation. Such representation harnesses both the rendering flexibility of 3D Gaussians and the structured property of meshes. To achieve this, we introduce RMD-Net, a network that learns motion priors from video data to refine mesh deformations, alongside RGD-Net, which models the relative displacement between the mesh and Gaussians to enhance rendering fidelity under mesh constraints. To generalize to novel, user-defined deformations beyond input video without reliance on temporal data, we propose MPE-Net, which leverages inherent mesh information to bootstrap RMD-Net and RGD-Net. Due to the universality of meshes, MaGS is compatible with various deformation priors such as ARAP, SMPL, and soft physics simulation. Extensive experiments on the D-NeRF, DG-Mesh, and PeopleSnapshot datasets demonstrate that MaGS achieves state-of-the-art performance in both reconstruction and simulation.
- [443] arXiv:2406.07329 (replaced) [pdf, html, other]
-
Title: Cinematic Gaussians: Real-Time HDR Radiance Fields with Depth of FieldChao Wang, Krzysztof Wolski, Bernhard Kerbl, Ana Serrano, Mojtaba Bemana, Hans-Peter Seidel, Karol Myszkowski, Thomas LeimkühlerSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Radiance field methods represent the state of the art in reconstructing complex scenes from multi-view photos. However, these reconstructions often suffer from one or both of the following limitations: First, they typically represent scenes in low dynamic range (LDR), which restricts their use to evenly lit environments and hinders immersive viewing experiences. Secondly, their reliance on a pinhole camera model, assuming all scene elements are in focus in the input images, presents practical challenges and complicates refocusing during novel-view synthesis. Addressing these limitations, we present a lightweight method based on 3D Gaussian Splatting that utilizes multi-view LDR images of a scene with varying exposure times, apertures, and focus distances as input to reconstruct a high-dynamic-range (HDR) radiance field. By incorporating analytical convolutions of Gaussians based on a thin-lens camera model as well as a tonemapping module, our reconstructions enable the rendering of HDR content with flexible refocusing capabilities. We demonstrate that our combined treatment of HDR and depth of field facilitates real-time cinematic rendering, outperforming the state of the art.
- [444] arXiv:2406.07966 (replaced) [pdf, html, other]
-
Title: Real-world Image Dehazing with Coherence-based Label Generator and Cooperative Unfolding NetworkComments: Accepted at NeurIPS 2024 as a Spotlight PaperSubjects: Computer Vision and Pattern Recognition (cs.CV)
Real-world Image Dehazing (RID) aims to alleviate haze-induced degradation in real-world settings. This task remains challenging due to the complexities in accurately modeling real haze distributions and the scarcity of paired real-world data. To address these challenges, we first introduce a cooperative unfolding network that jointly models atmospheric scattering and image scenes, effectively integrating physical knowledge into deep networks to restore haze-contaminated details. Additionally, we propose the first RID-oriented iterative mean-teacher framework, termed the Coherence-based Label Generator, to generate high-quality pseudo labels for network training. Specifically, we provide an optimal label pool to store the best pseudo-labels during network training, leveraging both global and local coherence to select high-quality candidates and assign weights to prioritize haze-free regions. We verify the effectiveness of our method, with experiments demonstrating that it achieves state-of-the-art performance on RID tasks. Code will be available at \url{this https URL}.
- [445] arXiv:2406.09413 (replaced) [pdf, html, other]
-
Title: Interpreting the Weight Space of Customized Diffusion ModelsAmil Dravid, Yossi Gandelsman, Kuan-Chieh Wang, Rameen Abdal, Gordon Wetzstein, Alexei A. Efros, Kfir AbermanComments: Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
We investigate the space of weights spanned by a large collection of customized diffusion models. We populate this space by creating a dataset of over 60,000 models, each of which is a base model fine-tuned to insert a different person's visual identity. We model the underlying manifold of these weights as a subspace, which we term weights2weights. We demonstrate three immediate applications of this space that result in new diffusion models -- sampling, editing, and inversion. First, sampling a set of weights from this space results in a new model encoding a novel identity. Next, we find linear directions in this space corresponding to semantic edits of the identity (e.g., adding a beard), resulting in a new model with the original identity edited. Finally, we show that inverting a single image into this space encodes a realistic identity into a model, even if the input image is out of distribution (e.g., a painting). We further find that these linear properties of the diffusion model weight space extend to other visual concepts. Our results indicate that the weight space of fine-tuned diffusion models can behave as an interpretable meta-latent space producing new models.
- [446] arXiv:2406.11840 (replaced) [pdf, html, other]
-
Title: LLaNA: Large Language and NeRF AssistantComments: Under review. Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Multimodal Large Language Models (MLLMs) have demonstrated an excellent understanding of images and 3D data. However, both modalities have shortcomings in holistically capturing the appearance and geometry of objects. Meanwhile, Neural Radiance Fields (NeRFs), which encode information within the weights of a simple Multi-Layer Perceptron (MLP), have emerged as an increasingly widespread modality that simultaneously encodes the geometry and photorealistic appearance of objects. This paper investigates the feasibility and effectiveness of ingesting NeRF into MLLM. We create LLaNA, the first general-purpose NeRF-language assistant capable of performing new tasks such as NeRF captioning and Q\&A. Notably, our method directly processes the weights of the NeRF's MLP to extract information about the represented objects without the need to render images or materialize 3D data structures. Moreover, we build a dataset of NeRFs with text annotations for various NeRF-language tasks with no human intervention. Based on this dataset, we develop a benchmark to evaluate the NeRF understanding capability of our method. Results show that processing NeRF weights performs favourably against extracting 2D or 3D representations from NeRFs.
- [447] arXiv:2406.13677 (replaced) [pdf, html, other]
-
Title: Leveraging Large Language Models to Measure Gender Representation Bias in Gendered Language CorporaSubjects: Computation and Language (cs.CL); Computers and Society (cs.CY)
Gender bias in text corpora that are used for a variety of natural language processing (NLP) tasks, such as for training large language models (LLMs), can lead to the perpetuation and amplification of societal inequalities. This phenomenon is particularly pronounced in gendered languages like Spanish or French, where grammatical structures inherently encode gender, making the bias analysis more challenging. A first step in quantifying gender bias in text entails computing biases in gender representation, i.e., differences in the prevalence of words referring to males vs. females. Existing methods to measure gender representation bias in text corpora have mainly been proposed for English and do not generalize to gendered languages due to the intrinsic linguistic differences between English and gendered languages. This paper introduces a novel methodology that leverages the contextual understanding capabilities of LLMs to quantitatively measure gender representation bias in Spanish corpora. By utilizing LLMs to identify and classify gendered nouns and pronouns in relation to their reference to human entities, our approach provides a robust analysis of gender representation bias in gendered languages. We empirically validate our method on four widely-used benchmark datasets, uncovering significant gender prevalence disparities with a male-to-female ratio ranging from 4:1 to 6:1. These findings demonstrate the value of our methodology for bias quantification in gendered language corpora and suggest its application in NLP, contributing to the development of more equitable language technologies.
- [448] arXiv:2406.14596 (replaced) [pdf, html, other]
-
Title: VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of ThoughtGabriel Sarch, Lawrence Jang, Michael J. Tarr, William W. Cohen, Kenneth Marino, Katerina FragkiadakiComments: Project website: this http URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Large-scale generative language and vision-language models (LLMs and VLMs) excel in few-shot in-context learning for decision making and instruction following. However, they require high-quality exemplar demonstrations in their context window. In this work, we ask: Can LLMs and VLMs generate their own examples from generic, sub-optimal demonstrations? We propose In-Context Abstraction Learning (ICAL), a method that builds a memory of multimodal experience from sub-optimal demonstrations and human feedback. Given a task demonstration that may contain inefficiencies or mistakes, a VLM abstracts the trajectory into a generalized program of thoughts by correcting inefficient actions and annotating cognitive abstractions: causal relationships, object state changes, temporal subgoals, and task-relevant visual elements. These programs of thought are iteratively improved through human feedback while the agent executes the trajectory in a similar environment. The resulting examples significantly improve decision-making in retrieval-augmented LLM and VLM agents. Moreover, as the agent's library of examples grows, it becomes more efficient, relying less on human feedback and requiring fewer environment interactions per demonstration. Our ICAL agent surpasses the SOTA in dialogue-based instruction following in TEACh, multimodal web agents in VisualWebArena, and action anticipation in Ego4D. In TEACh, we achieve a 12.6% improvement in goal-condition success. In VisualWebArena, our task success rate improves over few-shot GPT4V. In Ego4D action forecasting, we improve over few-shot GPT-4V and remain competitive with supervised models. We show finetuning our retrieval-augmented in-context agent yields additional improvements. Our approach significantly reduces reliance on manual prompt engineering and consistently outperforms in-context learning from action plans that lack such programs of thought.
- [449] arXiv:2406.14861 (replaced) [pdf, other]
-
Title: Resilience of the Electric Grid through Trustable IoT-Coordinated AssetsVineet J. Nair, Venkatesh Venkataramanan, Priyank Srivastava, Partha S. Sarker, Anurag Srivastava, Laurentiu D. Marinovici, Jun Zha, Christopher Irwin, Prateek Mittal, John Williams, Jayant Kumar, H. Vincent Poor, Anuradha M. AnnaswamyComments: Accepted to the Proceedings of the National Academy of Sciences (PNAS) 2024Subjects: Systems and Control (eess.SY); Emerging Technologies (cs.ET)
The electricity grid has evolved from a physical system to a cyber-physical system with digital devices that perform measurement, control, communication, computation, and actuation. The increased penetration of distributed energy resources (DERs) including renewable generation, flexible loads, and storage provides extraordinary opportunities for improvements in efficiency and sustainability. However, they can introduce new vulnerabilities in the form of cyberattacks, which can cause significant challenges in ensuring grid resilience. We propose a framework in this paper for achieving grid resilience through suitably coordinated assets including a network of Internet of Things (IoT) devices. A local electricity market is proposed to identify trustable assets and carry out this coordination. Situational Awareness (SA) of locally available DERs with the ability to inject power or reduce consumption is enabled by the market, together with a monitoring procedure for their trustability and commitment. With this SA, we show that a variety of cyberattacks can be mitigated using local trustable resources without stressing the bulk grid. Multiple demonstrations are carried out using a high-fidelity co-simulation platform, real-time hardware-in-the-loop validation, and a utility-friendly simulator.
- [450] arXiv:2406.18060 (replaced) [pdf, html, other]
-
Title: AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-TuningComments: Accepted for publication in EMNLP 2024Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Fine-tuning large language models (LLMs) has achieved remarkable performance across various natural language processing tasks, yet it demands more and more memory as model sizes keep growing. To address this issue, the recently proposed Memory-efficient Zeroth-order (MeZO) methods attempt to fine-tune LLMs using only forward passes, thereby avoiding the need for a backpropagation graph. However, significant performance drops and a high risk of divergence have limited their widespread adoption. In this paper, we propose the Adaptive Zeroth-order Tensor-Train Adaption (AdaZeta) framework, specifically designed to improve the performance and convergence of the ZO methods. To enhance dimension-dependent ZO estimation accuracy, we introduce a fast-forward, low-parameter tensorized adapter. To tackle the frequently observed divergence issue in large-scale ZO fine-tuning tasks, we propose an adaptive query number schedule that guarantees convergence. Detailed theoretical analysis and extensive experimental results on Roberta-Large and Llama-2-7B models substantiate the efficacy of our AdaZeta framework in terms of accuracy, memory efficiency, and convergence speed.
- [451] arXiv:2406.18279 (replaced) [pdf, html, other]
-
Title: Improving EO Foundation Models with Confidence Assessment for enhanced Semantic segmentationComments: 5 pages, 7 figures, 4 tables, AcceptedSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Confidence assessments of semantic segmentation algorithms are important. Ideally, deep learning models should have the ability to predict in advance whether their output is likely to be incorrect. Assessing the confidence levels of model predictions in Earth Observation (EO) classification is essential, as it can enhance semantic segmentation performance and help prevent further exploitation of the results in case of erroneous prediction. The model we developed, Confidence Assessment for enhanced Semantic segmentation (CAS), evaluates confidence at both the segment and pixel levels, providing both labels and confidence scores as output. Our model, CAS, identifies segments with incorrect predicted labels using the proposed combined confidence metric, refines the model, and enhances its performance. This work has significant applications, particularly in evaluating EO Foundation Models on semantic segmentation downstream tasks, such as land cover classification using Sentinel-2 satellite data. The evaluation results show that this strategy is effective and that the proposed model CAS outperforms other baseline models.
- [452] arXiv:2407.00615 (replaced) [pdf, html, other]
-
Title: GC-Bench: An Open and Unified Benchmark for Graph CondensationQingyun Sun, Ziying Chen, Beining Yang, Cheng Ji, Xingcheng Fu, Sheng Zhou, Hao Peng, Jianxin Li, Philip S. YuComments: Accepted by NeurIPS 2024Subjects: Machine Learning (cs.LG)
Graph condensation (GC) has recently garnered considerable attention due to its ability to reduce large-scale graph datasets while preserving their essential properties. The core concept of GC is to create a smaller, more manageable graph that retains the characteristics of the original graph. Despite the proliferation of graph condensation methods developed in recent years, there is no comprehensive evaluation and in-depth analysis, which creates a great obstacle to understanding the progress in this field. To fill this gap, we develop a comprehensive Graph Condensation Benchmark (GC-Bench) to analyze the performance of graph condensation in different scenarios systematically. Specifically, GC-Bench systematically investigates the characteristics of graph condensation in terms of the following dimensions: effectiveness, transferability, and complexity. We comprehensively evaluate 12 state-of-the-art graph condensation algorithms in node-level and graph-level tasks and analyze their performance in 12 diverse graph datasets. Further, we have developed an easy-to-use library for training and evaluating different GC methods to facilitate reproducible research. The GC-Bench library is available at this https URL.
- [453] arXiv:2407.01782 (replaced) [pdf, html, other]
-
Title: Addressing a fundamental limitation in deep vision models: lack of spatial attentionSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
The primary aim of this manuscript is to underscore a significant limitation in current deep learning models, particularly vision models. Unlike human vision, which efficiently selects only the essential visual areas for further processing, leading to high speed and low energy consumption, deep vision models process the entire image. In this work, we examine this issue from a broader perspective and propose two solutions that could pave the way for the next generation of more efficient vision models. In the first solution, convolution and pooling operations are selectively applied to altered regions, with a change map sent to subsequent layers. This map indicates which computations need to be repeated. In the second solution, only the modified regions are processed by a semantic segmentation model, and the resulting segments are inserted into the corresponding areas of the previous output map. The code is available at this https URL.
- [454] arXiv:2407.02079 (replaced) [pdf, html, other]
-
Title: Theseus: Exploring Efficient Wafer-Scale Chip Design for Large Language ModelsJingchen Zhu, Chenhao Xue, Yiqi Chen, Zhao Wang, Chen Zhang, Yu Shen, Yifan Chen, Zekang Cheng, Yu Jiang, Tianqi Wang, Yibo Lin, Wei Hu, Bin Cui, Runsheng Wang, Yun Liang, Guangyu SunSubjects: Hardware Architecture (cs.AR)
The emergence of the large language model~(LLM) poses an exponential growth of demand for computation throughput, memory capacity, and communication bandwidth. Such a demand growth has significantly surpassed the improvement of corresponding chip designs. With the advancement of fabrication and integration technologies, designers have been developing Wafer-Scale Chips~(WSCs) to scale up and exploit the limits of computation density, memory capacity, and communication bandwidth at the level of a single chip. Existing solutions have demonstrated the significant advantages of WSCs over traditional designs, showing potential to effectively support LLM workloads.
Despite the benefits, exploring the early-stage design space of WSCs for LLMs is a crucial yet challenging task due to the enormous and complicated design space, time-consuming evaluation methods, and inefficient exploration strategies. To address these challenges, we propose Theseus, an efficient WSC design space exploration framework for LLMs. We construct the design space of WSCs with various constraints considering the unique characteristics of WSCs. We propose efficient evaluation methodologies for large-scale NoC-based WSCs and introduce multi-fidelity Bayesian optimization to efficiently explore the design space. Evaluation results demonstrate the efficiency of Theseus that the searched Pareto optimal results outperform GPU cluster and existing WSC designs by up to 62.8\%/73.7\% in performance and 38.6\%/42.4\% in power consumption for LLM training, while improving up to 23.2$\times$ and 15.7$\times$ for the performance and power of inference tasks. Furthermore, we conduct case studies to address the design tradeoffs in WSCs and provide insights to facilitate WSC designs for LLMs. - [455] arXiv:2407.03318 (replaced) [pdf, html, other]
-
Title: Constant-Factor EFX Exists for ChoresComments: 72 pagesSubjects: Computer Science and Game Theory (cs.GT)
We study the problem of fair allocation of chores to agents with additive preferences. In the discrete setting, envy-freeness up to any chore (EFX) has emerged as a compelling fairness criterion. However, establishing its (non-)existence or achieving a meaningful approximation remains a major open question. The current best guarantee is the existence of $O(n^2)$-EFX allocations for $n$ agents, obtained through a sophisticated algorithm (Zhou and Wu, 2022). In this paper, we show the existence of $4$-EFX allocations, providing the first constant-factor approximation of EFX.
We also investigate the existence of allocations that are both fair and efficient, using Pareto optimality (PO) as our efficiency criterion. For the special case of bivalued instances, we establish the existence of allocations that are both $3$-EFX and PO, thus improving the current best factor of $O(n)$-EFX without any efficiency guarantees. For general additive instances, the existence of allocations that are $\alpha$-EF$k$ and PO has remained open for any constant values of $\alpha$ and $k$, where EF$k$ denotes envy-freeness up to $k$ chores. We provide the first positive result in this direction by showing the existence of allocations that are $2$-EF$2$ and PO.
Our results are obtained via a novel economic framework called earning restricted (ER) competitive equilibrium for fractional allocations, which limits agents' earnings from each chore. We show the existence of ER equilibria by formulating it as an linear complementarity problem (LCP) and proving that the classic complementary pivot algorithm on the LCP terminates at an ER equilibrium. We design algorithms that carefully round fractional ER equilibria, and perform bundle swaps and merges to meet the desired fairness and efficiency criteria. We expect that the concept of ER equilibrium will be useful in deriving further results on related problems. - [456] arXiv:2407.04873 (replaced) [pdf, html, other]
-
Title: Evaluating Language Models for Generating and Judging Programming FeedbackComments: 2 tables. Accepted for SIGCSE TS 2025Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
The emergence of large language models (LLMs) has transformed research and practice across a wide range of domains. Within the computing education research (CER) domain, LLMs have garnered significant attention, particularly in the context of learning programming. Much of the work on LLMs in CER, however, has focused on applying and evaluating proprietary models. In this article, we evaluate the efficiency of open-source LLMs in generating high-quality feedback for programming assignments and judging the quality of programming feedback, contrasting the results with proprietary models. Our evaluations on a dataset of students' submissions to introductory Python programming exercises suggest that state-of-the-art open-source LLMs are nearly on par with proprietary models in both generating and assessing programming feedback. Additionally, we demonstrate the efficiency of smaller LLMs in these tasks and highlight the wide range of LLMs accessible, even for free, to educators and practitioners.
- [457] arXiv:2407.06194 (replaced) [pdf, other]
-
Title: More Distinctively Black and Feminine Faces Lead to Increased Stereotyping in Vision-Language ModelsComments: This submission is being withdrawn to address concerns related to the terms of use of a database utilized in the research. We aim to ensure full compliance with all data usage agreements before proceeding with publicationSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Vision Language Models (VLMs), exemplified by GPT-4V, adeptly integrate text and vision modalities. This integration enhances Large Language Models' ability to mimic human perception, allowing them to process image inputs. Despite VLMs' advanced capabilities, however, there is a concern that VLMs inherit biases of both modalities in ways that make biases more pervasive and difficult to mitigate. Our study explores how VLMs perpetuate homogeneity bias and trait associations with regards to race and gender. When prompted to write stories based on images of human faces, GPT-4V describes subordinate racial and gender groups with greater homogeneity than dominant groups and relies on distinct, yet generally positive, stereotypes. Importantly, VLM stereotyping is driven by visual cues rather than group membership alone such that faces that are rated as more prototypically Black and feminine are subject to greater stereotyping. These findings suggest that VLMs may associate subtle visual cues related to racial and gender groups with stereotypes in ways that could be challenging to mitigate. We explore the underlying reasons behind this behavior and discuss its implications and emphasize the importance of addressing these biases as VLMs come to mirror human perception.
- [458] arXiv:2407.08713 (replaced) [pdf, html, other]
-
Title: GTA: A Benchmark for General Tool AgentsComments: Github repo: this https URLSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Significant focus has been placed on integrating large language models (LLMs) with various tools in developing general-purpose agents. This poses a challenge to LLMs' tool-use capabilities. However, there are evident gaps between existing tool-use evaluations and real-world scenarios. Current evaluations often use AI-generated queries, single-step tasks, dummy tools, and text-only interactions, failing to reveal the agents' real-world problem-solving abilities effectively. To address this, we propose GTA, a benchmark for General Tool Agents, featuring three main aspects: (i) Real user queries: human-written queries with simple real-world objectives but implicit tool-use, requiring the LLM to reason the suitable tools and plan the solution steps. (ii) Real deployed tools: an evaluation platform equipped with tools across perception, operation, logic, and creativity categories to evaluate the agents' actual task execution performance. (iii) Real multimodal inputs: authentic image files, such as spatial scenes, web page screenshots, tables, code snippets, and printed/handwritten materials, used as the query contexts to align with real-world scenarios closely. We design 229 real-world tasks and executable tool chains to evaluate mainstream LLMs. Our findings show that real-world user queries are challenging for existing LLMs, with GPT-4 completing less than 50% of the tasks and most LLMs achieving below 25%. This evaluation reveals the bottlenecks in the tool-use capabilities of current LLMs in real-world scenarios, which provides future direction for advancing general-purpose tool agents. The code and dataset are available at this https URL.
- [459] arXiv:2407.09820 (replaced) [pdf, other]
-
Title: Mining individual daily commuting patterns of dockless bike-sharing users: a two-layer framework integrating spatiotemporal flow clustering and rule-based decision treesJournal-ref: Sustainable Cities and Society 118:105985,2024Subjects: Computers and Society (cs.CY)
The rise of dockless bike-sharing systems has led to increased interest in using bike-sharing data for sustainable transportation and travel behavior research. However, these studies have rarely focused on the individual daily mobility patterns, hindering their alignment with the increasingly refined needs of active transportation planning. To bridge this gap, this paper presents a two-layer framework, integrating improved flow clustering methods and multiple rule-based decision trees, to mine individual cyclists' daily home-work commuting patterns from dockless bike-sharing trip data with user IDs. The effectiveness and applicability of the framework is demonstrated by over 200 million bike-sharing trip records in Shenzhen. Based on the mining results, we obtain two categories of bike-sharing commuters (74.38% of Only-biking commuters and 25.62% of Biking-with-transit commuters) and some interesting findings about their daily commuting patterns. For instance, lots of bike-sharing commuters live near urban villages and old communities with lower costs of living, especially in the central city. Only-biking commuters have a higher proportion of overtime than Biking-with-transit commuters, and the Longhua Industrial Park, a manufacturing-oriented area, has the longest average working hours (over 10 hours per day). Moreover, massive users utilize bike-sharing for commuting to work more frequently than for returning home, which is intricately related to the over-demand for bikes around workplaces during commuting peak. In sum, this framework offers a cost-effective way to understand the nuanced non-motorized mobility patterns and low-carbon trip chains of residents. It also offers novel insights for improving the bike-sharing services and planning of active transportation modes.
- [460] arXiv:2407.11747 (replaced) [pdf, html, other]
-
Title: PandORA: Automated Design and Comprehensive Evaluation of Deep Reinforcement Learning Agents for Open RANMaria Tsampazi, Salvatore D'Oro, Michele Polese, Leonardo Bonati, Gwenael Poitau, Michael Healy, Mohammad Alavirad, Tommaso MelodiaComments: 18 pages, 26 figures. arXiv admin note: text overlap with arXiv:2309.05621Subjects: Networking and Internet Architecture (cs.NI)
The highly heterogeneous ecosystem of NextG wireless communication systems calls for novel networking paradigms where functionalities and operations can be dynamically and optimally reconfigured in real time to adapt to changing traffic conditions and satisfy stringent and diverse QoS demands. Open RAN technologies, and specifically those being standardized by the O-RAN Alliance, make it possible to integrate network intelligence into the once monolithic RAN via intelligent applications, namely, xApps and rApps. These applications enable flexible control of the network resources and functionalities, network management, and orchestration through data-driven intelligent control loops. Recent work has showed how DRL is effective in dynamically controlling O-RAN systems. However, how to design these solutions in a way that manages heterogeneous optimization goals and prevents unfair resource allocation is still an open challenge, with the logic within DRL agents often considered as a black box. In this paper, we introduce PandORA, a framework to automatically design and train DRL agents for Open RAN applications, package them as xApps and evaluate them in the Colosseum wireless network emulator. We benchmark $23$ xApps that embed DRL agents trained using different architectures, reward design, action spaces, and decision-making timescales, and with the ability to hierarchically control different network parameters. We test these agents on the Colosseum testbed under diverse traffic and channel conditions, in static and mobile setups. Our experimental results indicate how suitable fine-tuning of the RAN control timers, as well as proper selection of reward designs and DRL architectures can boost network performance according to the network conditions and demand. Notably, finer decision-making granularities can improve mMTC's performance by ~56% and even increase eMBB Throughput by ~99%.
- [461] arXiv:2407.12043 (replaced) [pdf, html, other]
-
Title: The Art of Saying No: Contextual Noncompliance in Language ModelsFaeze Brahman, Sachin Kumar, Vidhisha Balachandran, Pradeep Dasigi, Valentina Pyatkin, Abhilasha Ravichander, Sarah Wiegreffe, Nouha Dziri, Khyathi Chandu, Jack Hessel, Yulia Tsvetkov, Noah A. Smith, Yejin Choi, Hannaneh HajishirziComments: The first two authors are co-first authors; Accepted at NeurIPS 2024 Track on Datasets and BenchmarksSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Chat-based language models are designed to be helpful, yet they should not comply with every user request. While most existing work primarily focuses on refusal of "unsafe" queries, we posit that the scope of noncompliance should be broadened. We introduce a comprehensive taxonomy of contextual noncompliance describing when and how models should not comply with user requests. Our taxonomy spans a wide range of categories including incomplete, unsupported, indeterminate, and humanizing requests (in addition to unsafe requests). To test noncompliance capabilities of language models, we use this taxonomy to develop a new evaluation suite of 1000 noncompliance prompts. We find that most existing models show significantly high compliance rates in certain previously understudied categories with models like GPT-4 incorrectly complying with as many as 30% of requests. To address these gaps, we explore different training strategies using a synthetically-generated training set of requests and expected noncompliant responses. Our experiments demonstrate that while direct finetuning of instruction-tuned models can lead to both over-refusal and a decline in general capabilities, using parameter efficient methods like low rank adapters helps to strike a good balance between appropriate noncompliance and other capabilities.
- [462] arXiv:2407.14115 (replaced) [pdf, other]
-
Title: Dual Adjunction Between $\Omega$-Automata and Wilke Algebra QuotientsSubjects: Formal Languages and Automata Theory (cs.FL)
$\Omega$-automata and Wilke algebras are formalisms for characterising $\omega$-regular languages via their ultimately periodic words. $\Omega$-automata read finite representations of ultimately periodic words, called lassos, and they are a subclass of lasso automata. We introduce lasso semigroups as a generalisation of Wilke algebras that mirrors how lasso automata generalise $\Omega$-automata, and we show that finite lasso semigroups characterise regular lasso languages. We then show a dual adjunction between lasso automata and quotients of the free lasso semigroup with a recognising set, and as our main result we show that this dual adjunction restricts to one between $\Omega$-automata and quotients of the free Wilke algebra with a recognising set.
- [463] arXiv:2407.15850 (replaced) [pdf, html, other]
-
Title: AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio DescriptionComments: Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Our objective is to generate Audio Descriptions (ADs) for both movies and TV series in a training-free manner. We use the power of off-the-shelf Visual-Language Models (VLMs) and Large Language Models (LLMs), and develop visual and text prompting strategies for this task. Our contributions are three-fold: (i) We demonstrate that a VLM can successfully name and refer to characters if directly prompted with character information through visual indications without requiring any fine-tuning; (ii) A two-stage process is developed to generate ADs, with the first stage asking the VLM to comprehensively describe the video, followed by a second stage utilising a LLM to summarise dense textual information into one succinct AD sentence; (iii) A new dataset for TV audio description is formulated. Our approach, named AutoAD-Zero, demonstrates outstanding performance (even competitive with some models fine-tuned on ground truth ADs) in AD generation for both movies and TV series, achieving state-of-the-art CRITIC scores.
- [464] arXiv:2407.16485 (replaced) [pdf, other]
-
Title: Learning General Continuous Constraint from Demonstrations via Positive-Unlabeled LearningComments: The paper is hastily uploaded. We prefer to improve it and upload it later, and possibily after it is publishedSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Planning for a wide range of real-world tasks necessitates to know and write all constraints. However, instances exist where these constraints are either unknown or challenging to specify accurately. A possible solution is to infer the unknown constraints from expert demonstration. The majority of prior works limit themselves to learning simple linear constraints, or require strong knowledge of the true constraint parameterization or environmental model. To mitigate these problems, this paper presents a positive-unlabeled (PU) learning approach to infer a continuous, arbitrary and possibly nonlinear, constraint from demonstration. From a PU learning view, We treat all data in demonstrations as positive (feasible) data, and learn a (sub)-optimal policy to generate high-reward-winning but potentially infeasible trajectories, which serve as unlabeled data containing both feasible and infeasible states. Under an assumption on data distribution, a feasible-infeasible classifier (i.e., constraint model) is learned from the two datasets through a postprocessing PU learning technique. The entire method employs an iterative framework alternating between updating the policy, which generates and selects higher-reward policies, and updating the constraint model. Additionally, a memory buffer is introduced to record and reuse samples from previous iterations to prevent forgetting. The effectiveness of the proposed method is validated in two Mujoco environments, successfully inferring continuous nonlinear constraints and outperforming a baseline method in terms of constraint accuracy and policy safety.
- [465] arXiv:2407.21735 (replaced) [pdf, html, other]
-
Title: EMatch: A Unified Framework for Event-based Optical Flow and Stereo MatchingSubjects: Computer Vision and Pattern Recognition (cs.CV)
Event cameras have shown promise in vision applications like optical flow estimation and stereo matching, with many specialized architectures leveraging the asynchronous and sparse nature of event data. However, existing works only focus event data within the confines of task-specific domains, overlooking how tasks across the temporal and spatial domains can reinforce each other. In this paper, we reformulate event-based flow estimation and stereo matching as a unified dense correspondence matching problem, enabling us to solve both tasks within a single model by directly matching features in a shared representation space. Specifically, our method utilizes a Temporal Recurrent Network to aggregate event features across temporal or spatial domains, and a Spatial Contextual Attention to enhance knowledge transfer across event flows via temporal or spatial interactions. By utilizing a shared feature similarities module that integrates knowledge from event streams via temporal or spatial interactions, our network performs optical flow estimation from temporal event segment inputs and stereo matching from spatial event segment inputs simultaneously. We demonstrate that our unified model inherently supports multi-task fusion and cross-task transfer. Without the need for retraining for specific task, our model can effectively handle both optical flow and stereo estimation, achieving state-of-the-art performance on both tasks.
- [466] arXiv:2407.21753 (replaced) [pdf, html, other]
-
Title: Characterizing User Archetypes and Discussions on Scored.coSubjects: Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
In recent years, the proliferation of social platforms has drastically transformed the way individuals interact, organize, and share information. In this scenario, we experience an unprecedented increase in the scale and complexity of interactions and, at the same time, little to no research about some fringe social platforms. In this paper, we present a multi-dimensional framework for characterizing nodes and hyperedges in social hypernetworks, with a focus on the understudied alt-right platform this http URL. Our approach integrates the possibility of studying higher-order interactions, thanks to the hypernetwork representation, and various node features such as user activity, sentiment, and toxicity, with the aim to define distinct user archetypes and understand their roles within the network. Utilizing a comprehensive dataset from this http URL, we analyze the dynamics of these archetypes over time and explore their interactions and influence within the community. The framework's versatility allows for detailed analysis of both individual user behaviors and broader social structures. Our findings highlight the importance of higher-order interactions in understanding social dynamics, offering new insights into the roles and behaviors that emerge in complex online environments.
- [467] arXiv:2408.00041 (replaced) [pdf, html, other]
-
Title: Con4m: Context-aware Consistency Learning Framework for Segmented Time Series ClassificationSubjects: Artificial Intelligence (cs.AI)
Time Series Classification (TSC) encompasses two settings: classifying entire sequences or classifying segmented subsequences. The raw time series for segmented TSC usually contain Multiple classes with Varying Duration of each class (MVD). Therefore, the characteristics of MVD pose unique challenges for segmented TSC, yet have been largely overlooked by existing works. Specifically, there exists a natural temporal dependency between consecutive instances (segments) to be classified within MVD. However, mainstream TSC models rely on the assumption of independent and identically distributed (i.i.d.), focusing on independently modeling each segment. Additionally, annotators with varying expertise may provide inconsistent boundary labels, leading to unstable performance of noise-free TSC models. To address these challenges, we first formally demonstrate that valuable contextual information enhances the discriminative power of classification instances. Leveraging the contextual priors of MVD at both the data and label levels, we propose a novel consistency learning framework Con4m, which effectively utilizes contextual information more conducive to discriminating consecutive segments in segmented TSC tasks, while harmonizing inconsistent boundary labels for training. Extensive experiments across multiple datasets validate the effectiveness of Con4m in handling segmented TSC tasks on MVD.
- [468] arXiv:2408.00392 (replaced) [pdf, other]
-
Title: Polynomial quasi-Trefftz DG for PDEs with smooth coefficients: elliptic problemsComments: 26 pages, 6 figures, 2 tables, added some remarks and one figureSubjects: Numerical Analysis (math.NA)
Trefftz schemes are high-order Galerkin methods whose discrete spaces are made of elementwise exact solutions of the underlying PDE. Trefftz basis functions can be easily computed for many PDEs that are linear, homogeneous, and have piecewise-constant coefficients. However, if the equation has variable coefficients, exact solutions are generally unavailable. Quasi-Trefftz methods overcome this limitation relying on elementwise "approximate solutions" of the PDE, in the sense of Taylor polynomials.
We define polynomial quasi-Trefftz spaces for general linear PDEs with smooth coefficients and source term, describe their approximation properties and, under a non-degeneracy condition, provide a simple algorithm to compute a basis. We then focus on a quasi-Trefftz DG method for variable-coefficient elliptic diffusion-advection-reaction problems, showing stability and high-order convergence of the scheme. The main advantage over standard DG schemes is the higher accuracy for comparable numbers of degrees of freedom. For non-homogeneous problems with piecewise-smooth source term we propose to construct a local quasi-Trefftz particular solution and then solve for the difference. Numerical experiments in 2 and 3 space dimensions show the excellent properties of the method both in diffusion-dominated and advection-dominated problems. - [469] arXiv:2408.01231 (replaced) [pdf, html, other]
-
Title: WaveMamba: Spatial-Spectral Wavelet Mamba for Hyperspectral Image ClassificationSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Hyperspectral Imaging (HSI) has proven to be a powerful tool for capturing detailed spectral and spatial information across diverse applications. Despite the advancements in Deep Learning (DL) and Transformer architectures for HSI classification, challenges such as computational efficiency and the need for extensive labeled data persist. This paper introduces WaveMamba, a novel approach that integrates wavelet transformation with the spatial-spectral Mamba architecture to enhance HSI classification. WaveMamba captures both local texture patterns and global contextual relationships in an end-to-end trainable model. The Wavelet-based enhanced features are then processed through the state-space architecture to model spatial-spectral relationships and temporal dependencies. The experimental results indicate that WaveMamba surpasses existing models, achieving an accuracy improvement of 4.5\% on the University of Houston dataset and a 2.0\% increase on the Pavia University dataset.
- [470] arXiv:2408.06047 (replaced) [pdf, html, other]
-
Title: BooW-VTON: Boosting In-the-Wild Virtual Try-On via Mask-Free Pseudo Data TrainingXuanpu Zhang, Dan Song, Pengxin Zhan, Tianyu Chang, Jianhao Zeng, Qingguo Chen, Weihua Luo, Anan LiuSubjects: Computer Vision and Pattern Recognition (cs.CV)
Image-based virtual try-on is an increasingly popular and important task to generate realistic try-on images of the specific person. Recent methods model virtual try-on as image mask-inpaint task, which requires masking the person image and results in significant loss of spatial information. Especially, for in-the-wild try-on scenarios with complex poses and occlusions, mask-based methods often introduce noticeable artifacts. Our research found that a mask-free approach can fully leverage spatial and lighting information from the original person image, enabling high-quality virtual try-on. Consequently, we propose a novel training paradigm for a mask-free try-on diffusion model. We ensure the model's mask-free try-on capability by creating high-quality pseudo-data and further enhance its handling of complex spatial information through effective in-the-wild data augmentation. Besides, a try-on localization loss is designed to concentrate on try-on area while suppressing garment features in non-try-on areas, ensuring precise rendering of garments and preservation of fore/back-ground. In the end, we introduce BooW-VTON, the mask-free virtual try-on diffusion model, which delivers SOTA try-on quality without parsing cost. Extensive qualitative and quantitative experiments have demonstrated superior performance in wild scenarios with such a low-demand input.
- [471] arXiv:2408.10338 (replaced) [pdf, html, other]
-
Title: Revisiting Tree Canonization using polynomialsComments: Added an appendix to include a simpler self-contained proof showing that arithmetic formula evaluation is in logspaceSubjects: Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS)
Graph Isomorphism (GI) is a fundamental algorithmic problem. Amongst graph classes for which the computational complexity of GI has been resolved, trees are arguably the most fundamental. Tree Isomorphism is complete for deterministic logspace, a tiny subclass of polynomial time, by Lindell's result. Over three decades ago, he devised a deterministic logspace algorithm that computes a string which is a canon for the input tree -- two trees are isomorphic precisely when their canons are identical.
Inspired by Miller-Reif's reduction of Tree Isomorphism to Polynomial Identity Testing, we present a new logspace algorithm for tree canonization fundamentally different from Lindell's algorithm. Our algorithm computes a univariate polynomial as canon for an input tree, based on the classical Eisenstein's criterion for the irreducibility of univariate polynomials. This can be implemented in logspace by invoking the well known Buss et al. algorithm for arithmetic formula evaluation. However, we have included in the appendix a simpler self-contained proof showing that arithmetic formula evaluation is in logspace.
This algorithm is conceptually very simple, avoiding the delicate case analysis and complex recursion that constitute the core of Lindell's algorithm. We illustrate the adaptability of our algorithm by extending it to a couple of other classes of graphs. - [472] arXiv:2408.10517 (replaced) [pdf, html, other]
-
Title: Integrating Multi-Modal Input Token Mixer Into Mamba-Based Decision Models: Decision MetaMambaSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Sequence modeling with State Space models (SSMs) has demonstrated performance surpassing that of Transformers in various tasks, raising expectations for their potential to outperform the Decision Transformer and its enhanced variants in offline reinforcement learning (RL). However, decision models based on Mamba, a state-of-the-art SSM, failed to achieve superior performance compared to these enhanced Decision Transformers. We hypothesize that this limitation arises from information loss during the selective scanning phase. To address this, we propose the Decision MetaMamba (DMM), which augments Mamba with a token mixer in its input layer. This mixer explicitly accounts for the multimodal nature of offline RL inputs, comprising state, action, and return-to-go. The DMM demonstrates improved performance while significantly reducing parameter count compared to prior models. Notably, similar performance gains were achieved using a simple linear token mixer, emphasizing the importance of preserving information from proximate time steps rather than the specific design of the token mixer itself. This novel modification to Mamba's input layer represents a departure from conventional timestamp-based encoding approaches used in Transformers. By enhancing performance of Mamba in offline RL, characterized by memory efficiency and fast inference, this work opens new avenues for its broader application in future RL research.
- [473] arXiv:2408.10556 (replaced) [pdf, html, other]
-
Title: Hokoff: Real Game Dataset from Honor of Kings and its Offline Reinforcement Learning BenchmarksYun Qu, Boyuan Wang, Jianzhun Shao, Yuhang Jiang, Chen Chen, Zhenbin Ye, Lin Liu, Junfeng Yang, Lin Lai, Hongyang Qin, Minwen Deng, Juchao Zhuo, Deheng Ye, Qiang Fu, Wei Yang, Guang Yang, Lanxiao Huang, Xiangyang JiSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
The advancement of Offline Reinforcement Learning (RL) and Offline Multi-Agent Reinforcement Learning (MARL) critically depends on the availability of high-quality, pre-collected offline datasets that represent real-world complexities and practical applications. However, existing datasets often fall short in their simplicity and lack of realism. To address this gap, we propose Hokoff, a comprehensive set of pre-collected datasets that covers both offline RL and offline MARL, accompanied by a robust framework, to facilitate further research. This data is derived from Honor of Kings, a recognized Multiplayer Online Battle Arena (MOBA) game known for its intricate nature, closely resembling real-life situations. Utilizing this framework, we benchmark a variety of offline RL and offline MARL algorithms. We also introduce a novel baseline algorithm tailored for the inherent hierarchical action space of the game. We reveal the incompetency of current offline RL approaches in handling task complexity, generalization and multi-task learning.
- [474] arXiv:2408.10609 (replaced) [pdf, html, other]
-
Title: PerturBench: Benchmarking Machine Learning Models for Cellular Perturbation AnalysisYan Wu, Esther Wershof, Sebastian M Schmon, Marcel Nassar, Błażej Osiński, Ridvan Eksi, Kun Zhang, Thore GraepelComments: 9 pages plus 19 pages supplementary material. Code is available at this https URLSubjects: Machine Learning (cs.LG); Genomics (q-bio.GN); Machine Learning (stat.ML)
We present a comprehensive framework for predicting the effects of perturbations in single cells, designed to standardize benchmarking in this rapidly evolving field. Our framework, PerturBench, includes a user-friendly platform, diverse datasets, metrics for fair model comparison, and detailed performance analysis. Extensive evaluations of published and baseline models reveal limitations like mode or posterior collapse, and underscore the importance of rank metrics that assess the ordering of perturbations alongside traditional measures like RMSE. Our findings show that simple models can outperform more complex approaches. This benchmarking exercise sets new standards for model evaluation, supports robust model development, and advances the potential of these models to use high-throughput and high-content genetic and chemical screens for disease target discovery.
- [475] arXiv:2408.13920 (replaced) [pdf, html, other]
-
Title: Wav2Small: Distilling Wav2Vec2 to 72K parameters for Low-Resource Speech emotion recognitionDionyssos Kounadis-Bastian, Oliver Schrüfer, Anna Derington, Hagen Wierstorf, Florian Eyben, Felix Burkhardt, Björn SchullerComments: apply reviewSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Speech Emotion Recognition (SER) needs high computational resources to overcome the challenge of substantial annotator disagreement. Today SER is shifting towards dimensional annotations of arousal, dominance, and valence (A/D/V). Universal metrics as the L2 distance prove unsuitable for evaluating A/D/V accuracy due to non converging consensus of annotator opinions. However, Concordance Correlation Coefficient (CCC) arose as an alternative metric for A/D/V where a model's output is evaluated to match a whole dataset's CCC rather than L2 distances of individual audios. Recent studies have shown that wav2vec2 / wavLM architectures outputing a float value for each A/D/V dimension achieve today's State-of-the-art (Sota) CCC on A/D/V. The Wav2Vec2.0 / WavLM family has a high computational footprint, but training small models using human annotations has been unsuccessful. In this paper we use a large Transformer Sota A/D/V model as Teacher/Annotator to train 5 student models: 4 MobileNets and our proposed Wav2Small, using only the Teacher's A/D/V outputs instead of human annotations. The Teacher model we propose also sets a new Sota on the MSP Podcast dataset of valence CCC=0.676. We choose MobileNetV4 / MobileNet-V3 as students, as MobileNet has been designed for fast execution times. We also propose Wav2Small - an architecture designed for minimal parameters and RAM consumption. Wav2Small with an .onnx (quantised) of only 120KB is a potential solution for A/D/V on hardware with low resources, having only 72K parameters vs 3.12M parameters for MobileNet-V4-Small.
- [476] arXiv:2408.15094 (replaced) [pdf, html, other]
-
Title: Constrained Diffusion Models via Dual TrainingComments: 31 pages, 4 figures, 4 tablesSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
Diffusion models have attained prominence for their ability to synthesize a probability distribution for a given dataset via a diffusion process, enabling the generation of new data points with high fidelity. However, diffusion processes are prone to generating samples that reflect biases in a training dataset. To address this issue, we develop constrained diffusion models by imposing diffusion constraints based on desired distributions that are informed by requirements. Specifically, we cast the training of diffusion models under requirements as a constrained distribution optimization problem that aims to reduce the distribution difference between original and generated data while obeying constraints on the distribution of generated data. We show that our constrained diffusion models generate new data from a mixture data distribution that achieves the optimal trade-off among objective and constraints. To train constrained diffusion models, we develop a dual training algorithm and characterize the optimality of the trained constrained diffusion model. We empirically demonstrate the effectiveness of our constrained models in two constrained generation tasks: (i) we consider a dataset with one or more underrepresented classes where we train the model with constraints to ensure fairly sampling from all classes during inference; (ii) we fine-tune a pre-trained diffusion model to sample from a new dataset while avoiding overfitting.
- [477] arXiv:2408.15205 (replaced) [pdf, html, other]
-
Title: Leveraging Hallucinations to Reduce Manual Prompt Dependency in Promptable SegmentationComments: NeurIPS 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
Promptable segmentation typically requires instance-specific manual prompts to guide the segmentation of each desired object. To minimize such a need, task-generic promptable segmentation has been introduced, which employs a single task-generic prompt to segment various images of different objects in the same task. Current methods use Multimodal Large Language Models (MLLMs) to reason detailed instance-specific prompts from a task-generic prompt for improving segmentation accuracy. The effectiveness of this segmentation heavily depends on the precision of these derived prompts. However, MLLMs often suffer hallucinations during reasoning, resulting in inaccurate prompting. While existing methods focus on eliminating hallucinations to improve a model, we argue that MLLM hallucinations can reveal valuable contextual insights when leveraged correctly, as they represent pre-trained large-scale knowledge beyond individual images. In this paper, we utilize hallucinations to mine task-related information from images and verify its accuracy for enhancing precision of the generated prompts. Specifically, we introduce an iterative Prompt-Mask Cycle generation framework (ProMaC) with a prompt generator and a mask this http URL prompt generator uses a multi-scale chain of thought prompting, initially exploring hallucinations for extracting extended contextual knowledge on a test this http URL hallucinations are then reduced to formulate precise instance-specific prompts, directing the mask generator to produce masks that are consistent with task semantics by mask semantic alignment. The generated masks iteratively induce the prompt generator to focus more on task-relevant image areas and reduce irrelevant hallucinations, resulting jointly in better prompts and masks. Experiments on 5 benchmarks demonstrate the effectiveness of ProMaC. Code given in this https URL.
- [478] arXiv:2408.15374 (replaced) [pdf, html, other]
-
Title: CycleGAN with Better CyclesComments: Technical Report 2018Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
CycleGAN provides a framework to train image-to-image translation with unpaired datasets using cycle consistency loss [4]. While results are great in many applications, the pixel level cycle consistency can potentially be problematic and causes unrealistic images in certain cases. In this project, we propose three simple modifications to cycle consistency, and show that such an approach achieves better results with fewer artifacts.
- [479] arXiv:2408.17135 (replaced) [pdf, html, other]
-
Title: TIMotion: Temporal and Interactive Framework for Efficient Human-Human Motion GenerationComments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Human-human motion generation is essential for understanding humans as social beings. Current methods fall into two main categories: single-person-based methods and separate modeling-based methods. To delve into this field, we abstract the overall generation process into a general framework MetaMotion, which consists of two phases: temporal modeling and interaction mixing. For temporal modeling, the single-person-based methods concatenate two people into a single one directly, while the separate modeling-based methods skip the modeling of interaction sequences. The inadequate modeling described above resulted in sub-optimal performance and redundant model parameters. In this paper, we introduce TIMotion (Temporal and Interactive Modeling), an efficient and effective framework for human-human motion generation. Specifically, we first propose Causal Interactive Injection to model two separate sequences as a causal sequence leveraging the temporal and causal properties. Then we present Role-Evolving Scanning to adjust to the change in the active and passive roles throughout the interaction. Finally, to generate smoother and more rational motion, we design Localized Pattern Amplification to capture short-term motion patterns. Extensive experiments on InterHuman and InterX demonstrate that our method achieves superior performance. The project code will be released upon acceptance. Project page: this https URL
- [480] arXiv:2409.00146 (replaced) [pdf, html, other]
-
Title: Prioritized Information Bottleneck Theoretic Framework with Distributed Online Learning for Edge Video AnalyticsComments: Major revision in IEEE ToN. We conduct additional real-world experiments on various hardware platforms. arXiv admin note: text overlap with arXiv:2408.17047Subjects: Networking and Internet Architecture (cs.NI)
Collaborative perception systems leverage multiple edge devices, such surveillance cameras or autonomous cars, to enhance sensing quality and eliminate blind spots. Despite their advantages, challenges such as limited channel capacity and data redundancy impede their effectiveness. To address these issues, we introduce the Prioritized Information Bottleneck (PIB) framework for edge video analytics. This framework prioritizes the shared data based on the signal-to-noise ratio (SNR) and camera coverage of the region of interest (RoI), reducing spatial-temporal data redundancy to transmit only essential information. This strategy avoids the need for video reconstruction at edge servers and maintains low latency. It leverages a deterministic information bottleneck method to extract compact, relevant features, balancing informativeness and communication costs. For high-dimensional data, we apply variational approximations for practical optimization. To reduce communication costs in fluctuating connections, we propose a gate mechanism based on distributed online learning (DOL) to filter out less informative messages and efficiently select edge servers. Moreover, we establish the asymptotic optimality of DOL by proving the sublinearity of their regrets. To validate the effectiveness of the PIB framework, we conduct real-world experiments on three types of edge devices with varied computing capabilities. Compared to five coding methods for image and video compression, PIB improves mean object detection accuracy (MODA) while reducing 17.8% and reduces communication costs by 82.65% under poor channel conditions.
- [481] arXiv:2409.06255 (replaced) [pdf, html, other]
-
Title: Market Reaction to News Flows in Supply Chain NetworksSubjects: Social and Information Networks (cs.SI); Machine Learning (cs.LG)
This study examines how positive and negative news about firms affects their stock prices and, moreover, how it affects stock prices of the firms' suppliers and clients, using a large sample of publicly listed firms around the world and another of Japanese listed firms. The level of positiveness and negativeness of each news article is determined by FinBERT, a natural language processing model fine-tuned specifically for financial information. Supply chains of firms across the world are identified mostly by financial statements, while those of Japanese firms are taken from large-scale firm-level surveys. We find that positive news increases the change rate of stock prices of firms mentioned in the news before its disclosure, most likely because of diffusion of information through private channels. Positive news also raises stock prices of the firms' suppliers and clients before its disclosure, confirming propagation of market values through supply chains. In addition, we generally find a larger post-news effect on stock prices of the mentioned firms and their suppliers and clients than the pre-news effect. The positive difference between the post- and pre-news effects can be considered as the net effect of the disclosure of positive news, controlling for information diffusion through private channels. However, the post-news effect on suppliers and clients in Japan is smaller than the pre-news effect, which is the opposite result to non-domestic firms from around the world.
- [482] arXiv:2409.06617 (replaced) [pdf, html, other]
-
Title: When to Extract ReID Features: A Selective Approach for Improved Multiple Object TrackingComments: 8 pages, 5 figures. Presents a selective approach for ReID feature extraction in Multiple Object Tracking, reducing computational overhead while maintaining accuracy. Tested on StrongSORT and Deep OC-SORT using MOT17, MOT20, and DanceTrack datasets. Code: this https URL, this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Extracting and matching Re-Identification (ReID) features is used by many state-of-the-art (SOTA) Multiple Object Tracking (MOT) methods, particularly effective against frequent and long-term occlusions. While end-to-end object detection and tracking have been the main focus of recent research, they have yet to outperform traditional methods in benchmarks like MOT17 and MOT20. Thus, from an application standpoint, methods with separate detection and embedding remain the best option for accuracy, modularity, and ease of implementation, though they are impractical for edge devices due to the overhead involved. In this paper, we investigate a selective approach to minimize the overhead of feature extraction while preserving accuracy, modularity, and ease of implementation. This approach can be integrated into various SOTA methods. We demonstrate its effectiveness by applying it to StrongSORT and Deep OC-SORT. Experiments on MOT17, MOT20, and DanceTrack datasets show that our mechanism retains the advantages of feature extraction during occlusions while significantly reducing runtime. Additionally, it improves accuracy by preventing confusion in the feature-matching stage, particularly in cases of deformation and appearance similarity, which are common in DanceTrack. this https URL, this https URL
- [483] arXiv:2409.07208 (replaced) [pdf, html, other]
-
Title: Almost-catalytic ComputationComments: 22 pages, A new lower bound on the subcube partition complexity of Hamming balls (Proposition 2.6 and Lemma 2.7), improving the bound and fixing an error in the previous versionSubjects: Computational Complexity (cs.CC)
Designing algorithms for space bounded models with restoration requirements on the space used by the algorithm is an important challenge posed about the catalytic computation model introduced by Buhrman et al. (2014). Motivated by the scenarios where we do not need to restore unless is useful, we define $ACL(A)$ to be the class of languages that can be accepted by almost-catalytic Turing machines with respect to $A$ (which we call the catalytic set), that uses at most $c\log n$ work space and $n^c$ catalytic space.
We show that if there are almost-catalytic algorithms for a problem with catalytic set as $A \subseteq \Sigma^*$ and its complement respectively, then the problem can be solved by a ZPP algorithm. Using this, we derive that to design catalytic algorithms, it suffices to design almost-catalytic algorithms where the catalytic set is the set of strings of odd weight ($PARITY$). Towards this, we consider two complexity measures of the set $A$ which are maximized for $PARITY$ - random projection complexity (${\cal R}(A)$) and the subcube partition complexity (${\cal P}(A)$).
By making use of error-correcting codes, we show that for all $k \ge 1$, there is a language $A_k \subseteq \Sigma^*$ such that $DSPACE(n^k) \subseteq ACL(A_k)$ where for every $m \ge 1$, $\mathcal{R}(A_k \cap \{0,1\}^m) \ge \frac{m}{4}$ and $\mathcal{P}(A_k \cap \{0,1\}^m)=2^{m/4}$. This contrasts the catalytic machine model where it is unclear if it can accept all languages in $DSPACE(\log^{1+\epsilon} n)$ for any $\epsilon > 0$.
Improving the partition complexity of the catalytic set $A$ further, we show that for all $k \ge 1$, there is a $A_k \subseteq \{0,1\}^*$ such that $\mathsf{DSPACE}(\log^k n) \subseteq ACL(A_k)$ where for every $m \ge 1$, $\mathcal{R}(A_k \cap \{0,1\}^m) \ge \frac{m}{4}$ and $\mathcal{P}(A_k \cap \{0,1\}^m)=2^{m/4+\Omega(\log m)}$. - [484] arXiv:2409.07779 (replaced) [pdf, html, other]
-
Title: AFFSegNet: Adaptive Feature Fusion Segmentation Network for Microtumors and Multi-Organ SegmentationFuchen Zheng, Xinyi Chen, Xuhang Chen, Haolun Li, Xiaojiao Guo, Guoheng Huang, Chi-Man Pun, Shoujun ZhouComments: 8 pages, 4 figures, 3 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Medical image segmentation, a crucial task in computer vision, facilitates the automated delineation of anatomical structures and pathologies, supporting clinicians in diagnosis, treatment planning, and disease monitoring. Notably, transformers employing shifted window-based self-attention have demonstrated exceptional performance. However, their reliance on local window attention limits the fusion of local and global contextual information, crucial for segmenting microtumors and miniature organs. To address this limitation, we propose the Adaptive Semantic Segmentation Network (ASSNet), a transformer architecture that effectively integrates local and global features for precise medical image segmentation. ASSNet comprises a transformer-based U-shaped encoder-decoder network. The encoder utilizes shifted window self-attention across five resolutions to extract multi-scale features, which are then propagated to the decoder through skip connections. We introduce an augmented multi-layer perceptron within the encoder to explicitly model long-range dependencies during feature extraction. Recognizing the constraints of conventional symmetrical encoder-decoder designs, we propose an Adaptive Feature Fusion (AFF) decoder to complement our encoder. This decoder incorporates three key components: the Long Range Dependencies (LRD) block, the Multi-Scale Feature Fusion (MFF) block, and the Adaptive Semantic Center (ASC) block. These components synergistically facilitate the effective fusion of multi-scale features extracted by the decoder while capturing long-range dependencies and refining object boundaries. Comprehensive experiments on diverse medical image segmentation tasks, including multi-organ, liver tumor, and bladder tumor segmentation, demonstrate that ASSNet achieves state-of-the-art results. Code and models are available at: \url{this https URL}.
- [485] arXiv:2409.08889 (replaced) [pdf, other]
-
Title: Extending the Benefits of Parallel Elasticity across Multiple Actuation Tasks: A Geometric and Optimization-Based ApproachSubjects: Robotics (cs.RO)
A spring in parallel with an effort source (e.g., electric motor or human muscle) can reduce its energy consumption and effort (i.e., torque or force) depending on the spring stiffness, spring preload, and actuation task. However, selecting the spring stiffness and preload that guarantees effort or energy reduction for an arbitrary set of tasks is a design challenge. This work formulates a convex optimization problem to guarantee that a parallel spring reduces the root-mean-square source effort or energy consumption for multiple tasks. Specifically, we guarantee the benefits across multiple tasks by enforcing a set of convex quadratic constraints in our optimization variables, the parallel spring stiffness and preload. These quadratic constraints are equivalent to ellipses in the stiffness and preload plane; any combination of stiffness and preload inside the ellipse represents a parallel spring that minimizes effort source or energy consumption with respect to an actuator without a spring. This geometric interpretation intuitively guides the stiffness and preload selection process. We analytically and experimentally prove the convex quadratic function of the spring stiffness and preload. As applications, we analyze the stiffness and preload selection of a parallel spring for a knee exoskeleton using human muscle as the effort source and a prosthetic ankle powered by electric motors. To promote adoption, the optimization and geometric methods are available as supplemental open-source software that can be executed in a web browser.
- [486] arXiv:2409.10399 (replaced) [pdf, html, other]
-
Title: Lattice Boltzmann framework for multiphase flows by Eulerian-Eulerian Navier-Stokes equationsComments: 42 pages, preliminary LBM framework for multiphase flows, suggested procedure in section 2 extended, numerical validation in section 3 addedSubjects: Numerical Analysis (math.NA); Fluid Dynamics (physics.flu-dyn)
Although Lattice Boltzmann Method (LBM) is relatively straightforward, it demands a well-crafted framework to handle the complex partial differential equations involved in multiphase flow simulations. For the first time to our knowledge, this work proposes a novel LBM framework for solving Eulerian-Eulerian multiphase flow equations without any finite-difference correction. The proposed methodology and all reported LBM formulas can be already applied to any dimension. This opens a promising venue for simulating multiphase flows on large High Performance Computing (HPC) facilities and on novel parallel hardware. This LBM framework consists of six coupled LBM schemes - running on the same lattice - ensuring an efficient implementation in large codes with minimum effort. The preliminary numeral results agree in an excellent way with the a reference numerical solution obtained by a traditional finite difference solver.
- [487] arXiv:2409.15371 (replaced) [pdf, html, other]
-
Title: Bone: Block-Affine Adaptation of Large Language ModelsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Low-Rank Adaptation (LoRA) has achieved remarkable training results by freezing the original weights and training only low-rank matrices, establishing itself as the predominant fine-tuning method for LLMs. In pursuit of performance closer to full-parameter training, a series of LoRA variants have emerged, such as LoRA+, PISSA, Olora, and LoRA-GA. This paper introduces a novel PEFT technique distinct from LoRA, called Block-Affine Adaptation (Bone). By dividing the original weights into multiple subspaces that share a single matrix for weight updates, Bone simplifies the process by requiring the trainable matrix to be initialized to zero, eliminating the need for complex initialization as in some LoRA variants. Compared to LoRA, Bone significantly reduces memory usage and achieves faster computation. Evaluation of both NLU and NLG tasks demonstrates that Bone substantially outperforms LoRA and its variants. Inspired by Pissa, we further proposed the ``Weight Guide'' theory to better utilize the information from the original weights. By integrating ``Weight Guide'' with Bone, we developed a new structure called Block-Affine Transformation (Bat), and ablation experiments confirmed the effectiveness of ``Weight Guide''.
- [488] arXiv:2409.15735 (replaced) [pdf, html, other]
-
Title: Boosting Cybersecurity Vulnerability Scanning based on LLM-supported Static Application Security TestingComments: Under Review of IEEE SaTML 2024Subjects: Cryptography and Security (cs.CR)
The current cybersecurity landscape is increasingly complex, with traditional Static Application Security Testing (SAST) tools struggling to capture complex and emerging vulnerabilities due to their reliance on rule-based matching. Meanwhile, Large Language Models (LLMs) have demonstrated powerful code analysis capabilities, but their static training data and privacy risks limit their effectiveness. To overcome the limitations of both approaches, we propose LSAST, a novel approach that integrates LLMs with SAST scanners to enhance vulnerability detection. LSAST leverages a locally hostable LLM, combined with a state-of-the-art knowledge retrieval system, to provide up-to-date vulnerability insights without compromising data privacy. We set a new benchmark for static vulnerability analysis, offering a robust, privacy-conscious solution that bridges the gap between traditional scanners and advanced AI-driven analysis. Our evaluation demonstrates that incorporating SAST results into LLM analysis significantly improves detection accuracy, identifying vulnerabilities missed by conventional methods.
- [489] arXiv:2409.16845 (replaced) [pdf, html, other]
-
Title: IRASNet: Improved Feature-Level Clutter Reduction for Domain Generalized SAR-ATRComments: 16 pages, 11 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
Recently, computer-aided design models and electromagnetic simulations have been used to augment synthetic aperture radar (SAR) data for deep learning. However, an automatic target recognition (ATR) model struggles with domain shift when using synthetic data because the model learns specific clutter patterns present in such data, which disturbs performance when applied to measured data with different clutter distributions. This study proposes a framework particularly designed for domain-generalized SAR-ATR called IRASNet, enabling effective feature-level clutter reduction and domain-invariant feature learning. First, we propose a clutter reduction module (CRM) that maximizes the signal-to-clutter ratio on feature maps. The module reduces the impact of clutter at the feature level while preserving target and shadow information, thereby improving ATR performance. Second, we integrate adversarial learning with CRM to extract clutter-reduced domain-invariant features. The integration bridges the gap between synthetic and measured datasets without requiring measured data during training. Third, we improve feature extraction from target and shadow regions by implementing a positional supervision task using mask ground truth encoding. The improvement enhances the ability of the model to discriminate between classes. Our proposed IRASNet presents new state-of-the-art public SAR datasets utilizing target and shadow information to achieve superior performance across various test conditions. IRASNet not only enhances generalization performance but also significantly improves feature-level clutter reduction, making it a valuable advancement in the field of radar image pattern recognition.
- [490] arXiv:2409.19345 (replaced) [pdf, other]
-
Title: Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and GeneralizationSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Transformers have demonstrated great power in the recent development of large foundational models. In particular, the Vision Transformer (ViT) has brought revolutionary changes to the field of vision, achieving significant accomplishments on the experimental side. However, their theoretical capabilities, particularly in terms of generalization when trained to overfit training data, are still not fully understood. To address this gap, this work delves deeply into the benign overfitting perspective of transformers in vision. To this end, we study the optimization of a Transformer composed of a self-attention layer with softmax followed by a fully connected layer under gradient descent on a certain data distribution model. By developing techniques that address the challenges posed by softmax and the interdependent nature of multiple weights in transformer optimization, we successfully characterized the training dynamics and achieved generalization in post-training. Our results establish a sharp condition that can distinguish between the small test error phase and the large test error regime, based on the signal-to-noise ratio in the data model. The theoretical results are further verified by experimental simulation. To the best of our knowledge, this is the first work to characterize benign overfitting for Transformers.
- [491] arXiv:2410.00434 (replaced) [pdf, html, other]
-
Title: Rapid Integration of LLMs in Healthcare Raises Ethical Concerns: An Investigation into Deceptive Patterns in Social RobotsComments: 7 pages, 1table, 1 figureSubjects: Computers and Society (cs.CY); Robotics (cs.RO)
Conversational agents are increasingly used in healthcare, and the integration of Large Language Models (LLMs) has significantly enhanced their capabilities. When integrated into social robots, LLMs offer the potential for more natural interactions. However, while LLMs promise numerous benefits, they also raise critical ethical concerns, particularly around the issue of hallucinations and deceptive patterns. In this case study, we observed a critical pattern of deceptive behavior in commercially available LLM-based care software integrated into robots. The LLM-equipped robot falsely claimed to have medication reminder functionalities. Not only did these systems assure users of their ability to manage medication schedules, but they also proactively suggested this capability, despite lacking it. This deceptive behavior poses significant risks in healthcare environments, where reliability is paramount. Our findings highlights the ethical and safety concerns surrounding the deployment of LLM-integrated robots in healthcare, emphasizing the need for oversight to prevent potentially harmful consequences for vulnerable populations.
- [492] arXiv:2410.01544 (replaced) [pdf, html, other]
-
Title: Boosting Weakly-Supervised Referring Image Segmentation via Progressive ComprehensionComments: Accepted by NeurIPS2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
This paper explores the weakly-supervised referring image segmentation (WRIS) problem, and focuses on a challenging setup where target localization is learned directly from image-text pairs. We note that the input text description typically already contains detailed information on how to localize the target object, and we also observe that humans often follow a step-by-step comprehension process (\ie, progressively utilizing target-related attributes and relations as cues) to identify the target object. Hence, we propose a novel Progressive Comprehension Network (PCNet) to leverage target-related textual cues from the input description for progressively localizing the target object. Specifically, we first use a Large Language Model (LLM) to decompose the input text description into short phrases. These short phrases are taken as target-related cues and fed into a Conditional Referring Module (CRM) in multiple stages, to allow updating the referring text embedding and enhance the response map for target localization in a multi-stage manner. Based on the CRM, we then propose a Region-aware Shrinking (RaS) loss to constrain the visual localization to be conducted progressively in a coarse-to-fine manner across different stages. Finally, we introduce an Instance-aware Disambiguation (IaD) loss to suppress instance localization ambiguity by differentiating overlapping response maps generated by different referring texts on the same image. Extensive experiments show that our method outperforms SOTA methods on three common benchmarks.
- [493] arXiv:2410.01966 (replaced) [pdf, html, other]
-
Title: Enhancing Screen Time Identification in Children with a Multi-View Vision Language Model and Screen Time TrackerXinlong Hou, Sen Shen, Xueshen Li, Xinran Gao, Ziyi Huang, Steven J. Holiday, Matthew R. Cribbet, Susan W. White, Edward Sazonov, Yu GanComments: Prepare for submissionSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Being able to accurately monitor the screen exposure of young children is important for research on phenomena linked to screen use such as childhood obesity, physical activity, and social interaction. Most existing studies rely upon self-report or manual measures from bulky wearable sensors, thus lacking efficiency and accuracy in capturing quantitative screen exposure data. In this work, we developed a novel sensor informatics framework that utilizes egocentric images from a wearable sensor, termed the screen time tracker (STT), and a vision language model (VLM). In particular, we devised a multi-view VLM that takes multiple views from egocentric image sequences and interprets screen exposure dynamically. We validated our approach by using a dataset of children's free-living activities, demonstrating significant improvement over existing methods in plain vision language models and object detection models. Results supported the promise of this monitoring approach, which could optimize behavioral research on screen exposure in children's naturalistic settings.
- [494] arXiv:2410.04683 (replaced) [pdf, other]
-
Title: Towards Measuring Goal-Directedness in AI SystemsComments: Updated acknowledgementsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Recent advances in deep learning have brought attention to the possibility of creating advanced, general AI systems that outperform humans across many tasks. However, if these systems pursue unintended goals, there could be catastrophic consequences. A key prerequisite for AI systems pursuing unintended goals is whether they will behave in a coherent and goal-directed manner in the first place, optimizing for some unknown goal; there exists significant research trying to evaluate systems for said behaviors. However, the most rigorous definitions of goal-directedness we currently have are difficult to compute in real-world settings. Drawing upon this previous literature, we explore policy goal-directedness within reinforcement learning (RL) environments. In our findings, we propose a different family of definitions of the goal-directedness of a policy that analyze whether it is well-modeled as near-optimal for many (sparse) reward functions. We operationalize this preliminary definition of goal-directedness and test it in toy Markov decision process (MDP) environments. Furthermore, we explore how goal-directedness could be measured in frontier large-language models (LLMs). Our contribution is a definition of goal-directedness that is simpler and more easily computable in order to approach the question of whether AI systems could pursue dangerous goals. We recommend further exploration of measuring coherence and goal-directedness, based on our findings.
- [495] arXiv:2410.07866 (replaced) [pdf, html, other]
-
Title: System 2 Reasoning via Generality and AdaptationComments: Accepted by NeurIPS 2024 Workshop on System 2 Reasoning at ScaleSubjects: Artificial Intelligence (cs.AI)
While significant progress has been made in task-specific applications, current models struggle with deep reasoning, generality, and adaptation -- key components of System 2 reasoning that are crucial for achieving Artificial General Intelligence (AGI). Despite the promise of approaches such as program synthesis, language models, and transformers, these methods often fail to generalize beyond their training data and to adapt to novel tasks, limiting their ability to perform human-like reasoning. This paper explores the limitations of existing approaches in achieving advanced System 2 reasoning and highlights the importance of generality and adaptation for AGI. Moreover, we propose four key research directions to address these gaps: (1) learning human intentions from action sequences, (2) combining symbolic and neural models, (3) meta-learning for unfamiliar environments, and (4) reinforcement learning to reason multi-step. Through these directions, we aim to advance the ability to generalize and adapt, bringing computational models closer to the reasoning capabilities required for AGI.
- [496] arXiv:2410.12346 (replaced) [pdf, html, other]
-
Title: Efficient Diffusion as Low Light EnhancerComments: 8 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV)
The computational burden of the iterative sampling process remains a major challenge in diffusion-based Low-Light Image Enhancement (LLIE). Current acceleration methods, whether training-based or training-free, often lead to significant performance degradation, highlighting the trade-off between performance and efficiency. In this paper, we identify two primary factors contributing to performance degradation: fitting errors and the inference gap. Our key insight is that fitting errors can be mitigated by linearly extrapolating the incorrect score functions, while the inference gap can be reduced by shifting the Gaussian flow to a reflectance-aware residual space. Based on the above insights, we design Reflectance-Aware Trajectory Refinement (RATR) module, a simple yet effective module to refine the teacher trajectory using the reflectance component of images. Following this, we introduce \textbf{Re}flectance-aware \textbf{D}iffusion with \textbf{Di}stilled \textbf{T}rajectory (\textbf{ReDDiT}), an efficient and flexible distillation framework tailored for LLIE. Our framework achieves comparable performance to previous diffusion-based methods with redundant steps in just 2 steps while establishing new state-of-the-art (SOTA) results with 8 or 4 steps. Comprehensive experimental evaluations on 10 benchmark datasets validate the effectiveness of our method, consistently outperforming existing SOTA methods.
- [497] arXiv:2410.14742 (replaced) [pdf, html, other]
-
Title: ArrivalNet: Predicting City-wide Bus/Tram Arrival Time with Two-dimensional Temporal Variation ModelingSubjects: Machine Learning (cs.LG)
Accurate arrival time prediction (ATP) of buses and trams plays a crucial role in public transport operations. Current methods focused on modeling one-dimensional temporal information but overlooked the latent periodic information within time series. Moreover, most studies developed algorithms for ATP based on a single or a few routes of public transport, which reduces the transferability of the prediction models and their applicability in public transport management systems. To this end, this paper proposes \textit{ArrivalNet}, a two-dimensional temporal variation-based multi-step ATP for buses and trams. It decomposes the one-dimensional temporal sequence into intra-periodic and inter-periodic variations, which can be recast into two-dimensional tensors (2D blocks). Each row of a tensor contains the time points within a period, and each column involves the time points at the same intra-periodic index across various periods. The transformed 2D blocks in different frequencies have an image-like feature representation that enables effective learning with computer vision backbones (e.g., convolutional neural network). Drawing on the concept of residual neural network, the 2D block module is designed as a basic module for flexible aggregation. Meanwhile, contextual factors like workdays, peak hours, and intersections, are also utilized in the augmented feature representation to improve the performance of prediction. 125 days of public transport data from Dresden were collected for model training and validation. Experimental results show that the root mean square error, mean absolute error, and mean absolute percentage error of the proposed predictor decrease by at least 6.1\%, 14.7\%, and 34.2\% compared with state-of-the-art baseline methods.
- [498] arXiv:2410.16881 (replaced) [pdf, html, other]
-
Title: Just In Time TransformersSubjects: Machine Learning (cs.LG)
Precise energy load forecasting in residential households is crucial for mitigating carbon emissions and enhancing energy efficiency; indeed, accurate forecasting enables utility companies and policymakers, who advocate sustainable energy practices, to optimize resource utilization. Moreover, smart meters provide valuable information by allowing for granular insights into consumption patterns. Building upon available smart meter data, our study aims to cluster consumers into distinct groups according to their energy usage behaviours, effectively capturing a diverse spectrum of consumption patterns. Next, we design JITtrans (Just In Time transformer), a novel transformer deep learning model that significantly improves energy consumption forecasting accuracy, with respect to traditional forecasting methods. Extensive experimental results validate our claims using proprietary smart meter data. Our findings highlight the potential of advanced predictive technologies to revolutionize energy management and advance sustainable power systems: the development of efficient and eco-friendly energy solutions critically depends on such technologies.
- [499] arXiv:2410.18707 (replaced) [pdf, html, other]
-
Title: Disjoint Projected Enumeration for SAT and SMT without Blocking ClausesComments: arXiv admin note: text overlap with arXiv:2306.00461 extended journal version of arXiv:2306.00461Subjects: Logic in Computer Science (cs.LO)
All-Solution Satisfiability (AllSAT) and its extension, All-Solution Satisfiability Modulo Theories (AllSMT), have become more relevant in recent years, mainly in formal verification and artificial intelligence applications. The goal of these problems is the enumeration of all satisfying assignments of a formula (for SAT and SMT problems, respectively), making them useful for test generation, model checking, and probabilistic inference. Nevertheless, traditional AllSAT algorithms face significant computational challenges due to the exponential growth of the search space and inefficiencies caused by blocking clauses, which cause memory blowups and degrade unit propagation performances in the long term. This paper presents two novel solvers: tabularAllSAT, a projected AllSAT solver, and tabularAllSMT, a projected AllSMT solver. Both solvers combine Conflict-Driven Clause Learning (CDCL) with chronological backtracking to improve efficiency while ensuring disjoint enumeration. To retrieve compact partial assignments we propose a novel aggressive implicant shrinking algorithm, compatible with chronological backtracking, to minimize the number of partial assignments, reducing overall search complexity. Furthermore, we extend the solver framework to handle projected enumeration and SMT formulas effectively and efficiently, adapting the baseline framework to integrate theory reasoning and the distinction between important and non-important variables. An extensive experimental evaluation demonstrates the superiority of our approach compared to state-of-the-art solvers, particularly in scenarios requiring projection and SMT-based reasoning.
- [500] arXiv:2410.18808 (replaced) [pdf, html, other]
-
Title: Delving into the Reversal Curse: How Far Can Large Language Models Generalize?Zhengkai Lin, Zhihang Fu, Kai Liu, Liang Xie, Binbin Lin, Wenxiao Wang, Deng Cai, Yue Wu, Jieping YeComments: Accepted at NeurIPS 2024. Our code and data are available at this https URLSubjects: Computation and Language (cs.CL)
While large language models (LLMs) showcase unprecedented capabilities, they also exhibit certain inherent limitations when facing seemingly trivial tasks. A prime example is the recently debated "reversal curse", which surfaces when models, having been trained on the fact "A is B", struggle to generalize this knowledge to infer that "B is A". In this paper, we examine the manifestation of the reversal curse across various tasks and delve into both the generalization abilities and the problem-solving mechanisms of LLMs. This investigation leads to a series of significant insights: (1) LLMs are able to generalize to "B is A" when both A and B are presented in the context as in the case of a multiple-choice question. (2) This generalization ability is highly correlated to the structure of the fact "A is B" in the training documents. For example, this generalization only applies to biographies structured in "[Name] is [Description]" but not to "[Description] is [Name]". (3) We propose and verify the hypothesis that LLMs possess an inherent bias in fact recalling during knowledge application, which explains and underscores the importance of the document structure to successful learning. (4) The negative impact of this bias on the downstream performance of LLMs can hardly be mitigated through training alone. These findings offer a novel perspective on interpreting LLMs' generalization through their intrinsic mechanisms and provide insights for developing more effective learning methods. Our code and data are available at this https URL.
- [501] arXiv:2410.18970 (replaced) [pdf, html, other]
-
Title: ConceptDrift: Uncovering Biases through the Lens of Foundation ModelsCristian Daniel Păduraru, Antonio Bărbălau, Radu Filipescu, Andrei Liviu Nicolicioiu, Elena BurceanuComments: 8 pages, 4 figures, 6 tables, under reviewSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
An important goal of ML research is to identify and mitigate unwanted biases intrinsic to datasets and already incorporated into pre-trained models. Previous approaches have identified biases using highly curated validation subsets, that require human knowledge to create in the first place. This limits the ability to automate the discovery of unknown biases in new datasets. We solve this by using interpretable vision-language models, combined with a filtration method using LLMs and known concept hierarchies. More exactly, for a dataset, we use pre-trained CLIP models that have an associated embedding for each class and see how it drifts through learning towards embeddings that disclose hidden biases. We call this approach ConceptDrift and show that it can be scaled to automatically identify biases in datasets like ImageNet without human prior knowledge. We propose two bias identification evaluation protocols to fill the gap in the previous work and show that our method significantly improves over SoTA methods, both using our protocol and classical evaluations. Alongside validating the identified biases, we also show that they can be leveraged to improve the performance of different methods. Our method is not bounded to a single modality, and we empirically validate it both on image (Waterbirds, CelebA, ImageNet), and text datasets (CivilComments).
- [502] arXiv:2410.20772 (replaced) [pdf, html, other]
-
Title: Introducing Spectral Attention for Long-Range Dependency in Time Series ForecastingComments: Co-first Author: Bong Gyun Kang, Dongjun Lee. NeurIPS 2024 (Conference on Neural Information Processing Systems)Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Sequence modeling faces challenges in capturing long-range dependencies across diverse tasks. Recent linear and transformer-based forecasters have shown superior performance in time series forecasting. However, they are constrained by their inherent inability to effectively address long-range dependencies in time series data, primarily due to using fixed-size inputs for prediction. Furthermore, they typically sacrifice essential temporal correlation among consecutive training samples by shuffling them into mini-batches. To overcome these limitations, we introduce a fast and effective Spectral Attention mechanism, which preserves temporal correlations among samples and facilitates the handling of long-range information while maintaining the base model structure. Spectral Attention preserves long-period trends through a low-pass filter and facilitates gradient to flow between samples. Spectral Attention can be seamlessly integrated into most sequence models, allowing models with fixed-sized look-back windows to capture long-range dependencies over thousands of steps. Through extensive experiments on 11 real-world time series datasets using 7 recent forecasting models, we consistently demonstrate the efficacy of our Spectral Attention mechanism, achieving state-of-the-art results.
- [503] arXiv:2410.22135 (replaced) [pdf, html, other]
-
Title: Lightweight Frequency Masker for Cross-Domain Few-Shot Semantic SegmentationComments: Accepted by NeurIPS 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cross-domain few-shot segmentation (CD-FSS) is proposed to first pre-train the model on a large-scale source-domain dataset, and then transfer the model to data-scarce target-domain datasets for pixel-level segmentation. The significant domain gap between the source and target datasets leads to a sharp decline in the performance of existing few-shot segmentation (FSS) methods in cross-domain scenarios. In this work, we discover an intriguing phenomenon: simply filtering different frequency components for target domains can lead to a significant performance improvement, sometimes even as high as 14% mIoU. Then, we delve into this phenomenon for an interpretation, and find such improvements stem from the reduced inter-channel correlation in feature maps, which benefits CD-FSS with enhanced robustness against domain gaps and larger activated regions for segmentation. Based on this, we propose a lightweight frequency masker, which further reduces channel correlations by an Amplitude-Phase Masker (APM) module and an Adaptive Channel Phase Attention (ACPA) module. Notably, APM introduces only 0.01% additional parameters but improves the average performance by over 10%, and ACPA imports only 2.5% parameters but further improves the performance by over 1.5%, which significantly surpasses the state-of-the-art CD-FSS methods.
- [504] arXiv:2410.23054 (replaced) [pdf, other]
-
Title: Controlling Language and Diffusion Models by Transporting ActivationsPau Rodriguez, Arno Blaas, Michal Klein, Luca Zappella, Nicholas Apostoloff, Marco Cuturi, Xavier SuauSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
The increasing capabilities of large generative models and their ever more widespread deployment have raised concerns about their reliability, safety, and potential misuse. To address these issues, recent works have proposed to control model generation by steering model activations in order to effectively induce or prevent the emergence of concepts or behaviors in the generated output. In this paper we introduce Activation Transport (AcT), a general framework to steer activations guided by optimal transport theory that generalizes many previous activation-steering works. AcT is modality-agnostic and provides fine-grained control over the model behavior with negligible computational overhead, while minimally impacting model abilities. We experimentally show the effectiveness and versatility of our approach by addressing key challenges in large language models (LLMs) and text-to-image diffusion models (T2Is). For LLMs, we show that AcT can effectively mitigate toxicity, induce arbitrary concepts, and increase their truthfulness. In T2Is, we show how AcT enables fine-grained style control and concept negation.
- [505] arXiv:2410.23148 (replaced) [pdf, other]
-
Title: HiBO: Hierarchical Bayesian Optimization via Adaptive Search Space PartitioningComments: There are some ethically sensitive words to be further modified in this paper. Hope that we can withdraw it first and re-post it back after a further investigation into the related guidelinesSubjects: Machine Learning (cs.LG)
Optimizing black-box functions in high-dimensional search spaces has been known to be challenging for traditional Bayesian Optimization (BO). In this paper, we introduce HiBO, a novel hierarchical algorithm integrating global-level search space partitioning information into the acquisition strategy of a local BO-based optimizer. HiBO employs a search-tree-based global-level navigator to adaptively split the search space into partitions with different sampling potential. The local optimizer then utilizes this global-level information to guide its acquisition strategy towards most promising regions within the search space. A comprehensive set of evaluations demonstrates that HiBO outperforms state-of-the-art methods in high-dimensional synthetic benchmarks and presents significant practical effectiveness in the real-world task of tuning configurations of database management systems (DBMSs).
- [506] arXiv:2410.23245 (replaced) [pdf, html, other]
-
Title: PointRecon: Online Point-based 3D Reconstruction via Ray-based 2D-3D MatchingSubjects: Computer Vision and Pattern Recognition (cs.CV)
We propose a novel online, point-based 3D reconstruction method from posed monocular RGB videos. Our model maintains a global point cloud representation of the scene, continuously updating the features and 3D locations of points as new images are observed. It expands the point cloud with newly detected points while carefully removing redundancies. The point cloud updates and the depth predictions for new points are achieved through a novel ray-based 2D-3D feature matching technique, which is robust against errors in previous point position predictions. In contrast to offline methods, our approach processes infinite-length sequences and provides real-time updates. Additionally, the point cloud imposes no pre-defined resolution or scene size constraints, and its unified global representation ensures view consistency across perspectives. Experiments on the ScanNet dataset show that our method achieves comparable quality among online MVS approaches. Project page: this https URL
- [507] arXiv:2410.24060 (replaced) [pdf, html, other]
-
Title: Understanding Generalizability of Diffusion Models Requires Rethinking the Hidden Gaussian StructureSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
In this work, we study the generalizability of diffusion models by looking into the hidden properties of the learned score functions, which are essentially a series of deep denoisers trained on various noise levels. We observe that as diffusion models transition from memorization to generalization, their corresponding nonlinear diffusion denoisers exhibit increasing linearity. This discovery leads us to investigate the linear counterparts of the nonlinear diffusion models, which are a series of linear models trained to match the function mappings of the nonlinear diffusion denoisers. Surprisingly, these linear denoisers are approximately the optimal denoisers for a multivariate Gaussian distribution characterized by the empirical mean and covariance of the training dataset. This finding implies that diffusion models have the inductive bias towards capturing and utilizing the Gaussian structure (covariance information) of the training dataset for data generation. We empirically demonstrate that this inductive bias is a unique property of diffusion models in the generalization regime, which becomes increasingly evident when the model's capacity is relatively small compared to the training dataset size. In the case that the model is highly overparameterized, this inductive bias emerges during the initial training phases before the model fully memorizes its training data. Our study provides crucial insights into understanding the notable strong generalization phenomenon recently observed in real-world diffusion models.
- [508] arXiv:2410.24172 (replaced) [pdf, other]
-
Title: A Multiphysics Analysis and Investigation of Soft Magnetics Effect on IPMSM: Case Study DynamometerJournal-ref: ICEMG 2023Subjects: Systems and Control (eess.SY)
Nowadays, Interior Permanent Magnet Synchronous Motors (IPMSMs) are taken into attention in the industry owing to their advantages. Moreover, in many cases, performing static tests is not enough, and investigating electric machines under dynamic conditions is necessary. Accordingly, by employing a dynamometer system, the dynamic behavior of the electric machine under test is investigated. Among the dynamometers, the best is the Alternating (AC) dynamometer because the basic dynamometers cannot take loads with high complexity. So, in the following study, two IPMSM with V-type and Delta-type rotor configurations are designed and suggested to employ in AC dynamometer. Any non-ideality in the electric machines of AC dynamometers, electrically and mechanically, causes errors in the measurement of the motor under test. Electrically and mechanically, the behavior of a system significantly depends on the used soft magnetics besides its physical and magnetic configuration. Accordingly, by performing a Multiphysics analysis and using the FEM tool to change the soft magnetics in the rotor and stator core, comparing the electric motors' behavior in the AC dynamometer is investigated under the same operating conditions electrically and mechanically. Finally, which soft magnetics is more satisfactory for the AC dynamometer can be seen.
- [509] arXiv:2411.00144 (replaced) [pdf, html, other]
-
Title: Self-Ensembling Gaussian Splatting for Few-Shot Novel View SynthesisSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
3D Gaussian Splatting (3DGS) has demonstrated remarkable effectiveness for novel view synthesis (NVS). However, the 3DGS model tends to overfit when trained with sparse posed views, limiting its generalization ability to novel views. In this paper, we alleviate the overfitting problem, presenting a Self-Ensembling Gaussian Splatting (SE-GS) approach. Our method encompasses a $\mathbf{\Sigma}$-model and a $\mathbf{\Delta}$-model. The $\mathbf{\Sigma}$-model serves as an ensemble of 3DGS models that generates novel-view images during inference. We achieve the self-ensembling by introducing an uncertainty-aware perturbation strategy at the training state. We complement the $\mathbf{\Sigma}$-model with the $\mathbf{\Delta}$-model, which is dynamically perturbed based on the uncertainties of novel-view renderings across different training steps. The perturbation yields diverse temporal samples in the Gaussian parameter space without additional training costs. The geometry of the $\mathbf{\Sigma}$-model is regularized by penalizing discrepancies between the $\mathbf{\Sigma}$-model and these temporal samples. Therefore, our SE-GS conducts an effective and efficient regularization across a large number of 3DGS models, resulting in a robust ensemble, the $\mathbf{\Sigma}$-model. Our experimental results on the LLFF, Mip-NeRF360, DTU, and MVImgNet datasets show that our approach improves NVS quality with few-shot training views, outperforming existing state-of-the-art methods. The code is released at: this https URL.
- [510] arXiv:2411.00554 (replaced) [pdf, html, other]
-
Title: Differentiable Physics-based System Identification for Robotic Manipulation of Elastoplastic MaterialsComments: Underreivew on the Internation Journal of Robotics ResearchSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
Robotic manipulation of volumetric elastoplastic deformable materials, from foods such as dough to construction materials like clay, is in its infancy, largely due to the difficulty of modelling and perception in a high-dimensional space. Simulating the dynamics of such materials is computationally expensive. It tends to suffer from inaccurately estimated physics parameters of the materials and the environment, impeding high-precision manipulation. Estimating such parameters from raw point clouds captured by optical cameras suffers further from heavy occlusions. To address this challenge, this work introduces a novel Differentiable Physics-based System Identification (DPSI) framework that enables a robot arm to infer the physics parameters of elastoplastic materials and the environment using simple manipulation motions and incomplete 3D point clouds, aligning the simulation with the real world. Extensive experiments show that with only a single real-world interaction, the estimated parameters, Young's modulus, Poisson's ratio, yield stress and friction coefficients, can accurately simulate visually and physically realistic deformation behaviours induced by unseen and long-horizon manipulation motions. Additionally, the DPSI framework inherently provides physically intuitive interpretations for the parameters in contrast to black-box approaches such as deep neural networks.
- [511] arXiv:2411.02282 (replaced) [pdf, html, other]
-
Title: A Comprehensive Simulation Framework for CXL Disaggregated MemoryYanjing Wang, Lizhou Wu, Wentao Hong, Yang Ou, Zicong Wang, Sunfeng Gao, Jie Zhang, Sheng Ma, Dezun Dong, Xingyun Qi, Mingche Lai, Nong XiaoComments: 15 pages, 19 figuresSubjects: Emerging Technologies (cs.ET); Hardware Architecture (cs.AR)
Compute eXpress Link (CXL) is a pivotal technology for memory disaggregation in future heterogeneous computing systems, enabling on-demand memory expansion and improved resource utilization. Despite its potential, CXL is in its early stages with limited market products, highlighting the need for a reliable system-level simulation tool. This paper introduces CXL-DMSim, an open-source, high-fidelity full-system simulator for CXL disaggregated memory systems, comparable in speed to gem5. CXL-DMSim includes a flexible CXL memory expander model, device driver, and support for CXLio and CXLmem protocols. It supports both app-managed and kernel-managed modes, with the latter featuring a NUMA-compatible mechanism. Rigorous verification against real hardware testbeds with FPGA-based and ASIC-based CXL memory prototypes confirms CXL-DMSim's accuracy, with an average simulation error of 4.1%. Benchmark results using LMbench and STREAM indicate that CXL-FPGA memory has approximately ~2.88x higher latency than local DDR, while CXL-ASIC latency is about ~2.18x. CXL-FPGA achieves 45-69% of local DDR's memory bandwidth, and CXL-ASIC reaches 82-83%. The performance of CXL memory is significantly more sensitive to Rd/Wr patterns than local DDR, with optimal bandwidth at a 74%:26% ratio rather than 50%:50% due to the current CXL+DDR controller design. The study also shows that CXL memory can markedly enhance the performance of memory-intensive applications, with the most improvement seen in Viper (~23x) and in bandwidth-sensitive scenarios like MERCI (16%). CXL-DMSim's observability and expandability are demonstrated through detailed case studies, showcasing its potential for research on future CXL-interconnected hybrid memory pools.
- [512] arXiv:2411.02451 (replaced) [pdf, other]
-
Title: High-performance automated abstract screening with large language model ensemblesRohan Sanghera, Arun James Thirunavukarasu, Marc El Khoury, Jessica O'Logbon, Yuqing Chen, Archie Watt, Mustafa Mahmood, Hamid Butt, George Nishimura, Andrew SoltanComments: RS and AJT are joint-first authorsSubjects: Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
Large language models (LLMs) excel in tasks requiring processing and interpretation of input text. Abstract screening is a labour-intensive component of systematic review involving repetitive application of inclusion and exclusion criteria on a large volume of studies identified by a literature search. Here, LLMs (GPT-3.5 Turbo, GPT-4 Turbo, GPT-4o, Llama 3 70B, Gemini 1.5 Pro, and Claude Sonnet 3.5) were trialled on systematic reviews in a full issue of the Cochrane Library to evaluate their accuracy in zero-shot binary classification for abstract screening. Trials over a subset of 800 records identified optimal prompting strategies and demonstrated superior performance of LLMs to human researchers in terms of sensitivity (LLM-max = 1.000, human-max = 0.775), precision (LLM-max = 0.927, human-max = 0.911), and balanced accuracy (LLM-max = 0.904, human-max = 0.865). The best performing LLM-prompt combinations were trialled across every replicated search result (n = 119,691), and exhibited consistent sensitivity (range 0.756-1.000) but diminished precision (range 0.004-0.096). 66 LLM-human and LLM-LLM ensembles exhibited perfect sensitivity with a maximal precision of 0.458, with less observed performance drop in larger trials. Significant variation in performance was observed between reviews, highlighting the importance of domain-specific validation before deployment. LLMs may reduce the human labour cost of systematic review with maintained or improved accuracy and sensitivity. Systematic review is the foundation of evidence synthesis across academic disciplines, including evidence-based medicine, and LLMs may increase the efficiency and quality of this mode of research.
- [513] arXiv:2411.02853 (replaced) [pdf, html, other]
-
Title: ADOPT: Modified Adam Can Converge with Any $\beta_2$ with the Optimal RateShohei Taniguchi, Keno Harada, Gouki Minegishi, Yuta Oshima, Seong Cheol Jeong, Go Nagahara, Tomoshi Iiyama, Masahiro Suzuki, Yusuke Iwasawa, Yutaka MatsuoComments: Accepted at Neural Information Processing Systems (NeurIPS 2024)Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Adam is one of the most popular optimization algorithms in deep learning. However, it is known that Adam does not converge in theory unless choosing a hyperparameter, i.e., $\beta_2$, in a problem-dependent manner. There have been many attempts to fix the non-convergence (e.g., AMSGrad), but they require an impractical assumption that the gradient noise is uniformly bounded. In this paper, we propose a new adaptive gradient method named ADOPT, which achieves the optimal convergence rate of $\mathcal{O} ( 1 / \sqrt{T} )$ with any choice of $\beta_2$ without depending on the bounded noise assumption. ADOPT addresses the non-convergence issue of Adam by removing the current gradient from the second moment estimate and changing the order of the momentum update and the normalization by the second moment estimate. We also conduct intensive numerical experiments, and verify that our ADOPT achieves superior results compared to Adam and its variants across a wide range of tasks, including image classification, generative modeling, natural language processing, and deep reinforcement learning. The implementation is available at this https URL.
- [514] arXiv:2411.03205 (replaced) [pdf, other]
-
Title: GIS Copilot: Towards an Autonomous GIS Agent for Spatial AnalysisSubjects: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
Recent advancements in Generative AI offer promising capabilities for spatial analysis. Despite their potential, the integration of generative AI with established GIS platforms remains underexplored. In this study, we propose a framework for integrating LLMs directly into existing GIS platforms, using QGIS as an example. Our approach leverages the reasoning and programming capabilities of LLMs to autonomously generate spatial analysis workflows and code through an informed agent that has comprehensive documentation of key GIS tools and parameters. The implementation of this framework resulted in the development of a "GIS Copilot" that allows GIS users to interact with QGIS using natural language commands for spatial analysis. The GIS Copilot was evaluated with over 100 spatial analysis tasks with three complexity levels: basic tasks that require one GIS tool and typically involve one data layer to perform simple operations; intermediate tasks involving multi-step processes with multiple tools, guided by user instructions; and advanced tasks which involve multi-step processes that require multiple tools but not guided by user instructions, necessitating the agent to independently decide on and executes the necessary steps. The evaluation reveals that the GIS Copilot demonstrates strong potential in automating foundational GIS operations, with a high success rate in tool selection and code generation for basic and intermediate tasks, while challenges remain in achieving full autonomy for more complex tasks. This study contributes to the emerging vision of Autonomous GIS, providing a pathway for non-experts to engage with geospatial analysis with minimal prior expertise. While full autonomy is yet to be achieved, the GIS Copilot demonstrates significant potential for simplifying GIS workflows and enhancing decision-making processes.
- [515] arXiv:2411.03817 (replaced) [pdf, other]
-
Title: From Novice to Expert: LLM Agent Policy Optimization via Step-wise Reinforcement LearningSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
The outstanding capabilities of large language models (LLMs) render them a crucial component in various autonomous agent systems. While traditional methods depend on the inherent knowledge of LLMs without fine-tuning, more recent approaches have shifted toward the reinforcement learning strategy to further enhance agents' ability to solve complex interactive tasks with environments and tools. However, previous approaches are constrained by the sparse reward issue, where existing datasets solely provide a final scalar reward for each multi-step reasoning chain, potentially leading to ineffectiveness and inefficiency in policy learning. In this paper, we introduce StepAgent, which utilizes step-wise reward to optimize the agent's reinforcement learning process. Inheriting the spirit of novice-to-expert theory, we first compare the actions of the expert and the agent to automatically generate intermediate rewards for fine-grained optimization. Additionally, we propose implicit-reward and inverse reinforcement learning techniques to facilitate agent reflection and policy adjustment. Further theoretical analysis demonstrates that the action distribution of the agent can converge toward the expert action distribution over multiple training cycles. Experimental results across various datasets indicate that StepAgent outperforms existing baseline methods.
- [516] arXiv:2411.04282 (replaced) [pdf, html, other]
-
Title: Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-RewardingHaolin Chen, Yihao Feng, Zuxin Liu, Weiran Yao, Akshara Prabhakar, Shelby Heinecke, Ricky Ho, Phil Mui, Silvio Savarese, Caiming Xiong, Huan WangSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps. While prompt-based methods like Chain-of-Thought (CoT) can improve LLM reasoning at inference time, optimizing reasoning capabilities during training remains challenging. We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution and optimizes it via variational approaches. LaTRO enables LLMs to concurrently improve both their reasoning process and ability to evaluate reasoning quality, without requiring external feedback or reward models. We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures. On GSM8K, LaTRO improves zero-shot accuracy by an average of 12.5% over base models and 9.6% over supervised fine-tuning across Phi-3.5-mini, Mistral-7B, and Llama-3.1-8B. Our findings suggest that pre-trained LLMs possess latent reasoning capabilities that can be unlocked and enhanced through our proposed optimization approach in a self-improvement manner. The code of LaTRO is available at \url{this https URL}.
- [517] arXiv:2411.04872 (replaced) [pdf, html, other]
-
Title: FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AIElliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, Olli Järviniemi, Matthew Barnett, Robert Sandler, Matej Vrzala, Jaime Sevilla, Qiuyu Ren, Elizabeth Pratt, Lionel Levine, Grant Barkley, Natalie Stewart, Bogdan Grechuk, Tetiana Grechuk, Shreepranav Varma Enugandla, Mark WildonSubjects: Artificial Intelligence (cs.AI)
We introduce FrontierMath, a benchmark of hundreds of original, exceptionally challenging mathematics problems crafted and vetted by expert mathematicians. The questions cover most major branches of modern mathematics -- from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory. Solving a typical problem requires multiple hours of effort from a researcher in the relevant branch of mathematics, and for the upper end questions, multiple days. FrontierMath uses new, unpublished problems and automated verification to reliably evaluate models while minimizing risk of data contamination. Current state-of-the-art AI models solve under 2% of problems, revealing a vast gap between AI capabilities and the prowess of the mathematical community. As AI systems advance toward expert-level mathematical abilities, FrontierMath offers a rigorous testbed that quantifies their progress.
- [518] arXiv:2411.05857 (replaced) [pdf, html, other]
-
Title: Financial Fraud Detection using Jump-Attentive Graph Neural NetworksComments: International Conference on Machine Learning and Applications 2024Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
As the availability of financial services online continues to grow, the incidence of fraud has surged correspondingly. Fraudsters continually seek new and innovative ways to circumvent the detection algorithms in place. Traditionally, fraud detection relied on rule-based methods, where rules were manually created based on transaction data features. However, these techniques soon became ineffective due to their reliance on manual rule creation and their inability to detect complex data patterns. Today, a significant portion of the financial services sector employs various machine learning algorithms, such as XGBoost, Random Forest, and neural networks, to model transaction data. While these techniques have proven more efficient than rule-based methods, they still fail to capture interactions between different transactions and their interrelationships. Recently, graph-based techniques have been adopted for financial fraud detection, leveraging graph topology to aggregate neighborhood information of transaction data using Graph Neural Networks (GNNs). Despite showing improvements over previous methods, these techniques still struggle to keep pace with the evolving camouflaging tactics of fraudsters and suffer from information loss due to over-smoothing. In this paper, we propose a novel algorithm that employs an efficient neighborhood sampling method, effective for camouflage detection and preserving crucial feature information from non-similar nodes. Additionally, we introduce a novel GNN architecture that utilizes attention mechanisms and preserves holistic neighborhood information to prevent information loss. We test our algorithm on financial data to show that our method outperforms other state-of-the-art graph algorithms.
- [519] arXiv:2411.06387 (replaced) [pdf, html, other]
-
Title: Self-Training Meets Consistency: Improving LLMs' Reasoning With Consistency-Driven Rationale EvaluationComments: Under reviewSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Self-training approach for large language models (LLMs) improves reasoning abilities by training the models on their self-generated rationales. Previous approaches have labeled rationales that produce correct answers for a given question as appropriate for training. However, a single measure risks misjudging rationale quality, leading the models to learn flawed reasoning patterns. To address this issue, we propose CREST (Consistency-driven Rationale Evaluation for Self-Training), a self-training framework that further evaluates each rationale through follow-up questions and leverages this evaluation to guide its training. Specifically, we introduce two methods: (1) filtering out rationales that frequently result in incorrect answers on follow-up questions and (2) preference learning based on mixed preferences from rationale evaluation results of both original and follow-up questions. Experiments on three question-answering datasets using open LLMs show that CREST not only improves the logical robustness and correctness of rationales but also improves reasoning abilities compared to previous self-training approaches.
- [520] arXiv:2411.07482 (replaced) [pdf, html, other]
-
Title: Enhancing Link Prediction with Fuzzy Graph Attention Networks and Dynamic Negative SamplingComments: 5 pagesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Link prediction is crucial for understanding complex networks but traditional Graph Neural Networks (GNNs) often rely on random negative sampling, leading to suboptimal performance. This paper introduces Fuzzy Graph Attention Networks (FGAT), a novel approach integrating fuzzy rough sets for dynamic negative sampling and enhanced node feature aggregation. Fuzzy Negative Sampling (FNS) systematically selects high-quality negative edges based on fuzzy similarities, improving training efficiency. FGAT layer incorporates fuzzy rough set principles, enabling robust and discriminative node representations. Experiments on two research collaboration networks demonstrate FGAT's superior link prediction accuracy, outperforming state-of-the-art baselines by leveraging the power of fuzzy rough sets for effective negative sampling and node feature learning.
- [521] arXiv:2411.08127 (replaced) [pdf, html, other]
-
Title: TIPO: Text to Image with Text Presampling for Prompt OptimizationComments: 26 pages, 19 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
TIPO (Text to Image with text pre-sampling for Prompt Optimization) is an innovative framework designed to enhance text-to-image (T2I) generation by language model (LM) for automatic prompt engineering. By refining and extending user-provided prompts, TIPO bridges the gap between simple inputs and the detailed prompts required for high-quality image generation. Unlike previous approaches that rely on Large Language Models (LLMs) or reinforcement learning (RL), TIPO adjusts user input prompts with the distribution of a trained prompt dataset, eliminating the need for complex runtime cost via lightweight model. This pre-sampling approach enables efficient and scalable prompt optimization, grounded in the model's training distribution. Experimental results demonstrate TIPO's effectiveness in improving aesthetic scores, reducing image corruption, and better aligning generated images with dataset distributions. These findings highlight the critical role of prompt engineering in T2I systems and open avenues for broader applications of automatic prompt refinement.
- [522] arXiv:2411.08508 (replaced) [pdf, html, other]
-
Title: BillBoard Splatting (BBSplat): Learnable Textured Primitives for Novel View SynthesisSubjects: Computer Vision and Pattern Recognition (cs.CV)
We present billboard Splatting (BBSplat) - a novel approach for 3D scene representation based on textured geometric primitives. BBSplat represents the scene as a set of optimizable textured planar primitives with learnable RGB textures and alpha-maps to control their shape. BBSplat primitives can be used in any Gaussian Splatting pipeline as drop-in replacements for Gaussians. Our method's qualitative and quantitative improvements over 3D and 2D Gaussians are most noticeable when fewer primitives are used, when BBSplat achieves over 1200 FPS. Our novel regularization term encourages textures to have a sparser structure, unlocking an efficient compression that leads to a reduction in storage space of the model. Our experiments show the efficiency of BBSplat on standard datasets of real indoor and outdoor scenes such as Tanks&Temples, DTU, and Mip-NeRF-360. We demonstrate improvements on PSNR, SSIM, and LPIPS metrics compared to the state-of-the-art, especially for the case when fewer primitives are used, which, on the other hand, leads to up to 2 times inference speed improvement for the same rendering quality.
- [523] arXiv:2411.08567 (replaced) [pdf, html, other]
-
Title: Saliency Map-based Image Retrieval using Invariant Krawtchouk MomentsSubjects: Computer Vision and Pattern Recognition (cs.CV)
With the widespread adoption of digital devices equipped with cameras and the rapid development of Internet technology, numerous content-based image retrieval systems and novel image feature extraction techniques have emerged in recent years. This paper introduces a saliency map-based image retrieval approach using invariant Krawtchouk moments (SM-IKM) to enhance retrieval speed and accuracy. The proposed method applies a global contrast-based salient region detection algorithm to create a saliency map that effectively isolates the foreground from the background. It then combines multiple orders of invariant Krawtchouk moments (IKM) with local binary patterns (LBPs) and color histograms to comprehensively represent the foreground and background. Additionally, it incorporates LBPs derived from the saliency map to improve discriminative power, facilitating more precise image differentiation. A bag-of-visual-words (BoVW) model is employed to generate a codebook for classification and discrimination. By using compact IKMs in the BoVW framework and integrating a range of region-based feature-including color histograms, LBPs, and saliency map-enhanced LBPs, our proposed SM-IKM achieves efficient and accurate image retrieval. Extensive experiments on publicly available datasets, such as Caltech 101 and Wang, demonstrate that SM-IKM outperforms recent state-of-the-art retrieval methods. The source code for SM-IKM is available at this http URL.
- [524] arXiv:2411.08977 (replaced) [pdf, html, other]
-
Title: Robustness and Confounders in the Demographic Alignment of LLMs with Human Perceptions of OffensivenessSubjects: Computers and Society (cs.CY); Computation and Language (cs.CL)
Large language models (LLMs) are known to exhibit demographic biases, yet few studies systematically evaluate these biases across multiple datasets or account for confounding factors. In this work, we examine LLM alignment with human annotations in five offensive language datasets, comprising approximately 220K annotations. Our findings reveal that while demographic traits, particularly race, influence alignment, these effects are inconsistent across datasets and often entangled with other factors. Confounders -- such as document difficulty, annotator sensitivity, and within-group agreement -- account for more variation in alignment patterns than demographic traits alone. Specifically, alignment increases with higher annotator sensitivity and group agreement, while greater document difficulty corresponds to reduced alignment. Our results underscore the importance of multi-dataset analyses and confounder-aware methodologies in developing robust measures of demographic bias in LLMs.
- [525] arXiv:2411.09813 (replaced) [pdf, html, other]
-
Title: Can Features for Phishing URL Detection Be Trusted Across Diverse Datasets? A Case Study with Explainable AIComments: 9 pages, 9 figures, 11th International Conference on Networking, Systems, and Security (NSysS 2024), 2024, Khulna, BangladeshSubjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Phishing has been a prevalent cyber threat that manipulates users into revealing sensitive private information through deceptive tactics, designed to masquerade as trustworthy entities. Over the years, proactively detection of phishing URLs (or websites) has been established as an widely-accepted defense approach. In literature, we often find supervised Machine Learning (ML) models with highly competitive performance for detecting phishing websites based on the extracted features from both phishing and benign (i.e., legitimate) websites. However, it is still unclear if these features or indicators are dependent on a particular dataset or they are generalized for overall phishing detection. In this paper, we delve deeper into this issue by analyzing two publicly available phishing URL datasets, where each dataset has its own set of unique and overlapping features related to URL string and website contents. We want to investigate if overlapping features are similar in nature across datasets and how does the model perform when trained on one dataset and tested on the other. We conduct practical experiments and leverage explainable AI (XAI) methods such as SHAP plots to provide insights into different features' contributions in case of phishing detection to answer our primary question, "Can features for phishing URL detection be trusted across diverse dataset?". Our case study experiment results show that features for phishing URL detection can often be dataset-dependent and thus may not be trusted across different datasets even though they share same set of feature behaviors.
- [526] arXiv:2411.09944 (replaced) [pdf, html, other]
-
Title: SlimLM: An Efficient Small Language Model for On-Device Document AssistanceSubjects: Computation and Language (cs.CL)
While small language models (SLMs) show promises for mobile deployment, their real-world performance and applications on smartphones remains underexplored. We present SlimLM, a series of SLMs optimized for document assistance tasks on mobile devices. Through extensive experiments on a Samsung Galaxy S24, we identify the optimal trade-offs between model size (ranging from 125M to 7B parameters), context length, and inference time for efficient on-device processing. SlimLM is pre-trained on SlimPajama-627B and fine-tuned on DocAssist, our constructed dataset for summarization, question answering and suggestion tasks. Our smallest model demonstrates efficient performance on S24, while larger variants offer enhanced capabilities within mobile constraints. We evaluate SlimLM against existing SLMs, showing comparable or superior performance and offering a benchmark for future research in on-device language models. We also provide an Android application, offering practical insights into SLM deployment. Our findings provide valuable insights and illuminate the capabilities of running advanced language models on high-end smartphones, potentially reducing server costs and enhancing privacy through on-device processing.
- [527] arXiv:2411.09946 (replaced) [pdf, other]
-
Title: Assessing Response Disparities in California Wildland-Urban-Interface (WUI) Cities Using the Compartmental ModelSubjects: Social and Information Networks (cs.SI)
The increasing frequency and severity of wildfires pose significant risks to communities, infrastructure, and the environment, especially in Wildland-Urban Interface (WUI) areas. Effective disaster management requires understanding how the public perceives and responds to wildfire threats in real-time. This study uses social media data to assess public responses and explores how these responses are linked to city-level community characteristics. Specifically, we leveraged a transformer-based topic modeling technique called BERTopic to identify wildfire response-related topics and then utilized the Susceptible-Infectious-Recovered (SIR) model to compute two key metrics associated with wildfire responses - awareness and resilience indicators. Additionally, we used GIS-based spatial analysis to map wildfire locations along with four groups of city-level factors (racial/ethnic, socioeconomic, demographic, and wildfire-specific). Our findings reveal significant geographic and socio-spatial differences in public responses. Southern California cities with larger Hispanic populations demonstrate higher wildfire awareness and resilience. In contrast, urbanized regions in Central and Northern California exhibit lower awareness levels. Furthermore, resilience is negatively correlated with unemployment rates, particularly in southern regions where higher unemployment aligns with reduced resilience. These findings highlight the need for targeted and equitable wildfire management strategies to improve the adaptive capacity of WUI communities.
- [528] arXiv:2411.10074 (replaced) [pdf, html, other]
-
Title: Improving the accuracy of automated labeling of specimen images datasets via a confidence-based processSubjects: Computer Vision and Pattern Recognition (cs.CV); Populations and Evolution (q-bio.PE)
The digitization of natural history collections over the past three decades has unlocked a treasure trove of specimen imagery and metadata. There is great interest in making this data more useful by further labeling it with additional trait data, and modern deep learning machine learning techniques utilizing convolutional neural nets (CNNs) and similar networks show particular promise to reduce the amount of required manual labeling by human experts, making the process much faster and less expensive. However, in most cases, the accuracy of these approaches is too low for reliable utilization of the automatic labeling, typically in the range of 80-85% accuracy. In this paper, we present and validate an approach that can greatly improve this accuracy, essentially by examining the confidence that the network has in the generated label as well as utilizing a user-defined threshold to reject labels that fall below a chosen level. We demonstrate that a naive model that produced 86% initial accuracy can achieve improved performance - over 95% accuracy (rejecting about 40% of the labels) or over 99% accuracy (rejecting about 65%) by selecting higher confidence thresholds. This gives flexibility to adapt existing models to the statistical requirements of various types of research and has the potential to move these automatic labeling approaches from being unusably inaccurate to being an invaluable new tool. After validating the approach in a number of ways, we annotate the reproductive state of a large dataset of over 600,000 herbarium specimens. The analysis of the results points at under-investigated correlations as well as general alignment with known trends. By sharing this new dataset alongside this work, we want to allow ecologists to gather insights for their own research questions, at their chosen point of accuracy/coverage trade-off.
- [529] arXiv:2411.10083 (replaced) [pdf, other]
-
Title: Xmodel-1.5: An 1B-scale Multilingual LLMSubjects: Computation and Language (cs.CL)
We introduce Xmodel-1.5, a 1-billion-parameter multilingual large language model pretrained on 2 trillion tokens, designed for balanced performance and scalability. Unlike most large models that use the BPE tokenizer, Xmodel-1.5 employs a custom unigram tokenizer with 65,280 tokens, optimizing both efficiency and accuracy. The model delivers competitive results across multiple languages, including Thai, Arabic, French, Chinese, and English, outperforming Alibaba's PolyLM-1.7B on respective evaluation datasets. Xmodel-1.5 excels in benchmarks like mMMLU and PIQA, and achieves state-of-the-art results in Thai. To support low-resource language research, we release Xdata_Thai, a Thai-specific evaluation dataset featuring unique linguistic challenges such as gendered particles and idioms. While the model demonstrates strong performance, there is still room for improvement in handling culturally specific nuances. We hope this work contributes to advancements in multilingual AI research. Models and code are publicly available on GitHub at this https URL
- [530] arXiv:2411.10499 (replaced) [pdf, html, other]
-
Title: FitDiT: Advancing the Authentic Garment Details for High-fidelity Virtual Try-onBoyuan Jiang, Xiaobin Hu, Donghao Luo, Qingdong He, Chengming Xu, Jinlong Peng, Jiangning Zhang, Chengjie Wang, Yunsheng Wu, Yanwei FuComments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Although image-based virtual try-on has made considerable progress, emerging approaches still encounter challenges in producing high-fidelity and robust fitting images across diverse scenarios. These methods often struggle with issues such as texture-aware maintenance and size-aware fitting, which hinder their overall effectiveness. To address these limitations, we propose a novel garment perception enhancement technique, termed FitDiT, designed for high-fidelity virtual try-on using Diffusion Transformers (DiT) allocating more parameters and attention to high-resolution features. First, to further improve texture-aware maintenance, we introduce a garment texture extractor that incorporates garment priors evolution to fine-tune garment feature, facilitating to better capture rich details such as stripes, patterns, and text. Additionally, we introduce frequency-domain learning by customizing a frequency distance loss to enhance high-frequency garment details. To tackle the size-aware fitting issue, we employ a dilated-relaxed mask strategy that adapts to the correct length of garments, preventing the generation of garments that fill the entire mask area during cross-category try-on. Equipped with the above design, FitDiT surpasses all baselines in both qualitative and quantitative evaluations. It excels in producing well-fitting garments with photorealistic and intricate details, while also achieving competitive inference times of 4.57 seconds for a single 1024x768 image after DiT structure slimming, outperforming existing methods.
- [531] arXiv:2411.10745 (replaced) [pdf, html, other]
-
Title: TDSM: Triplet Diffusion for Skeleton-Text Matching in Zero-Shot Action RecognitionComments: Please visit our project page at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
We firstly present a diffusion-based action recognition with zero-shot learning for skeleton inputs. In zero-shot skeleton-based action recognition, aligning skeleton features with the text features of action labels is essential for accurately predicting unseen actions. Previous methods focus on direct alignment between skeleton and text latent spaces, but the modality gaps between these spaces hinder robust generalization learning. Motivated from the remarkable performance of text-to-image diffusion models, we leverage their alignment capabilities between different modalities mostly by focusing on the training process during reverse diffusion rather than using their generative power. Based on this, our framework is designed as a Triplet Diffusion for Skeleton-Text Matching (TDSM) method which aligns skeleton features with text prompts through reverse diffusion, embedding the prompts into the unified skeleton-text latent space to achieve robust matching. To enhance discriminative power, we introduce a novel triplet diffusion (TD) loss that encourages our TDSM to correct skeleton-text matches while pushing apart incorrect ones. Our TDSM significantly outperforms the very recent state-of-the-art methods with large margins of 2.36%-point to 13.05%-point, demonstrating superior accuracy and scalability in zero-shot settings through effective skeleton-text matching.
- [532] arXiv:2411.12168 (replaced) [pdf, html, other]
-
Title: Sketch-guided Cage-based 3D Gaussian Splatting DeformationComments: 10 pages, 9 figures, project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
3D Gaussian Splatting (GS) is one of the most promising novel 3D representations that has received great interest in computer graphics and computer vision. While various systems have introduced editing capabilities for 3D GS, such as those guided by text prompts, fine-grained control over deformation remains an open challenge. In this work, we present a novel sketch-guided 3D GS deformation system that allows users to intuitively modify the geometry of a 3D GS model by drawing a silhouette sketch from a single viewpoint. Our approach introduces a new deformation method that combines cage-based deformations with a variant of Neural Jacobian Fields, enabling precise, fine-grained control. Additionally, it leverages large-scale 2D diffusion priors and ControlNet to ensure the generated deformations are semantically plausible. Through a series of experiments, we demonstrate the effectiveness of our method and showcase its ability to animate static 3D GS models as one of its key applications.
- [533] arXiv:2411.12448 (replaced) [pdf, html, other]
-
Title: Large Language Models for Lossless Image Compression: Next-Pixel Prediction in Language Space is All You NeedKecheng Chen, Pingping Zhang, Hui Liu, Jie Liu, Yibing Liu, Jiaxin Huang, Shiqi Wang, Hong Yan, Haoliang LiSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
We have recently witnessed that ``Intelligence" and `` Compression" are the two sides of the same coin, where the language large model (LLM) with unprecedented intelligence is a general-purpose lossless compressor for various data modalities. This attribute particularly appeals to the lossless image compression community, given the increasing need to compress high-resolution images in the current streaming media era. Consequently, a spontaneous envision emerges: Can the compression performance of the LLM elevate lossless image compression to new heights? However, our findings indicate that the naive application of LLM-based lossless image compressors suffers from a considerable performance gap compared with existing state-of-the-art (SOTA) codecs on common benchmark datasets. In light of this, we are dedicated to fulfilling the unprecedented intelligence (compression) capacity of the LLM for lossless image compression tasks, thereby bridging the gap between theoretical and practical compression performance. Specifically, we propose P$^{2}$-LLM, a next-pixel prediction-based LLM, which integrates various elaborated insights and methodologies, \textit{e.g.,} pixel-level priors, the in-context ability of LLM, and a pixel-level semantic preservation strategy, to enhance the understanding capacity of pixel sequences for better next-pixel predictions. Extensive experiments on benchmark datasets demonstrate that P$^{2}$-LLM can beat SOTA classical and learned codecs.
- [534] arXiv:2411.12603 (replaced) [pdf, html, other]
-
Title: STREAM: A Universal State-Space Model for Sparse Geometric DataMark Schöne, Yash Bhisikar, Karan Bania, Khaleelulla Khan Nazeer, Christian Mayr, Anand Subramoney, David KappelSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Handling sparse and unstructured geometric data, such as point clouds or event-based vision, is a pressing challenge in the field of machine vision. Recently, sequence models such as Transformers and state-space models entered the domain of geometric data. These methods require specialized preprocessing to create a sequential view of a set of points. Furthermore, prior works involving sequence models iterate geometric data with either uniform or learned step sizes, implicitly relying on the model to infer the underlying geometric structure. In this work, we propose to encode geometric structure explicitly into the parameterization of a state-space model. State-space models are based on linear dynamics governed by a one-dimensional variable such as time or a spatial coordinate. We exploit this dynamic variable to inject relative differences of coordinates into the step size of the state-space model. The resulting geometric operation computes interactions between all pairs of N points in O(N) steps. Our model deploys the Mamba selective state-space model with a modified CUDA kernel to efficiently map sparse geometric data to modern hardware. The resulting sequence model, which we call STREAM, achieves competitive results on a range of benchmarks from point-cloud classification to event-based vision and audio classification. STREAM demonstrates a powerful inductive bias for sparse geometric data by improving the PointMamba baseline when trained from scratch on the ModelNet40 and ScanObjectNN point cloud analysis datasets. It further achieves, for the first time, 100% test accuracy on all 11 classes of the DVS128 Gestures dataset.
- [535] arXiv:2411.12620 (replaced) [pdf, html, other]
-
Title: Maps from Motion (MfM): Generating 2D Semantic Maps from Sparse Multi-view ImagesSubjects: Computer Vision and Pattern Recognition (cs.CV)
World-wide detailed 2D maps require enormous collective efforts. OpenStreetMap is the result of 11 million registered users manually annotating the GPS location of over 1.75 billion entries, including distinctive landmarks and common urban objects. At the same time, manual annotations can include errors and are slow to update, limiting the map's accuracy. Maps from Motion (MfM) is a step forward to automatize such time-consuming map making procedure by computing 2D maps of semantic objects directly from a collection of uncalibrated multi-view images. From each image, we extract a set of object detections, and estimate their spatial arrangement in a top-down local map centered in the reference frame of the camera that captured the image. Aligning these local maps is not a trivial problem, since they provide incomplete, noisy fragments of the scene, and matching detections across them is unreliable because of the presence of repeated pattern and the limited appearance variability of urban objects. We address this with a novel graph-based framework, that encodes the spatial and semantic distribution of the objects detected in each image, and learns how to combine them to predict the objects' poses in a global reference system, while taking into account all possible detection matches and preserving the topology observed in each image. Despite the complexity of the problem, our best model achieves global 2D registration with an average accuracy within 4 meters (i.e., below GPS accuracy) even on sparse sequences with strong viewpoint change, on which COLMAP has an 80% failure rate. We provide extensive evaluation on synthetic and real-world data, showing how the method obtains a solution even in scenarios where standard optimization techniques fail.
- [536] arXiv:2411.12872 (replaced) [pdf, html, other]
-
Title: From Text to Pose to Image: Improving Diffusion Model Control and QualityComments: Published at the NeurIPS 2024 Workshop on Compositional Learning: Perspectives, Methods, and Paths ForwardSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
In the last two years, text-to-image diffusion models have become extremely popular. As their quality and usage increase, a major concern has been the need for better output control. In addition to prompt engineering, one effective method to improve the controllability of diffusion models has been to condition them on additional modalities such as image style, depth map, or keypoints. This forms the basis of ControlNets or Adapters. When attempting to apply these methods to control human poses in outputs of text-to-image diffusion models, two main challenges have arisen. The first challenge is generating poses following a wide range of semantic text descriptions, for which previous methods involved searching for a pose within a dataset of (caption, pose) pairs. The second challenge is conditioning image generation on a specified pose while keeping both high aesthetic and high pose fidelity. In this article, we fix these two main issues by introducing a text-to-pose (T2P) generative model alongside a new sampling algorithm, and a new pose adapter that incorporates more pose keypoints for higher pose fidelity. Together, these two new state-of-the-art models enable, for the first time, a generative text-to-pose-to-image framework for higher pose control in diffusion models. We release all models and the code used for the experiments at this https URL.
- [537] arXiv:2411.13010 (replaced) [pdf, html, other]
-
Title: Deriving Activation Functions via IntegrationSubjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Activation functions play a crucial role in introducing non-linearities to deep neural networks. We propose a novel approach to designing activation functions by focusing on their gradients and deriving the corresponding functions through integration. Our work introduces the Expanded Integral of the Exponential Linear Unit (xIELU), a trainable piecewise activation function derived by integrating trainable affine transformations applied on the ELU activation function. xIELU combines two key gradient properties: a trainable and linearly increasing gradient for positive inputs, similar to ReLU$^2$, and a trainable negative gradient flow for negative inputs, akin to xSiLU. Conceptually, xIELU can be viewed as extending ReLU$^2$ to effectively handle negative inputs. In experiments with 1.1B parameter Llama models trained on 126B tokens of FineWeb Edu, xIELU achieves lower perplexity compared to both ReLU$^2$ and SwiGLU when matched for the same compute cost and parameter count.
- [538] arXiv:2411.13147 (replaced) [pdf, html, other]
-
Title: GraphCL: Graph-based Clustering for Semi-Supervised Medical Image SegmentationComments: 9pageSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Semi-supervised learning (SSL) has made notable advancements in medical image segmentation (MIS), particularly in scenarios with limited labeled data and significantly enhancing data utilization efficiency. Previous methods primarily focus on complex training strategies to utilize unlabeled data but neglect the importance of graph structural information. Different from existing methods, we propose a graph-based clustering for semi-supervised medical image segmentation (GraphCL) by jointly modeling graph data structure in a unified deep model. The proposed GraphCL model enjoys several advantages. Firstly, to the best of our knowledge, this is the first work to model the data structure information for semi-supervised medical image segmentation (SSMIS). Secondly, to get the clustered features across different graphs, we integrate both pairwise affinities between local image features and raw features as inputs. Extensive experimental results on three standard benchmarks show that the proposed GraphCL algorithm outperforms state-of-the-art semi-supervised medical image segmentation methods.
- [539] arXiv:2411.13152 (replaced) [pdf, html, other]
-
Title: AGLP: A Graph Learning Perspective for Semi-supervised Domain AdaptationComments: 8pageSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
In semi-supervised domain adaptation (SSDA), the model aims to leverage partially labeled target domain data along with a large amount of labeled source domain data to enhance its generalization capability for the target domain. A key advantage of SSDA is its ability to significantly reduce reliance on labeled data, thereby lowering the costs and time associated with data preparation. Most existing SSDA methods utilize information from domain labels and class labels but overlook the structural information of the data. To address this issue, this paper proposes a graph learning perspective (AGLP) for semi-supervised domain adaptation. We apply the graph convolutional network to the instance graph which allows structural information to propagate along the weighted graph edges. The proposed AGLP model has several advantages. First, to the best of our knowledge, this is the first work to model structural information in SSDA. Second, the proposed model can effectively learn domain-invariant and semantic representations, reducing domain discrepancies in SSDA. Extensive experimental results on multiple standard benchmarks demonstrate that the proposed AGLP algorithm outperforms state-of-the-art semi-supervised domain adaptation methods.
- [540] arXiv:2411.13187 (replaced) [pdf, html, other]
-
Title: Engagement-Driven Content Generation with Large Language ModelsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Large Language Models (LLMs) exhibit significant persuasion capabilities in one-on-one interactions, but their influence within social networks remains underexplored. This study investigates the potential social impact of LLMs in these environments, where interconnected users and complex opinion dynamics pose unique challenges. In particular, we address the following research question: can LLMs learn to generate meaningful content that maximizes user engagement on social networks?
To answer this question, we define a pipeline to guide the LLM-based content generation which employs reinforcement learning with simulated feedback. In our framework, the reward is based on an engagement model borrowed from the literature on opinion dynamics and information propagation. Moreover, we force the text generated by the LLM to be aligned with a given topic and to satisfy a minimum fluency requirement.
Using our framework, we analyze the capabilities and limitations of LLMs in tackling the given task, specifically considering the relative positions of the LLM as an agent within the social network and the distribution of opinions in the network on the given topic. Our findings show the full potential of LLMs in creating social engagement. Notable properties of our approach are that the learning procedure is adaptive to the opinion distribution of the underlying network and agnostic to the specifics of the engagement model, which is embedded as a plug-and-play component. In this regard, our approach can be easily refined for more complex engagement tasks and interventions in computational social science.
The code used for the experiments is publicly available at this https URL. - [541] arXiv:2411.13485 (replaced) [pdf, html, other]
-
Title: Utilizing Large Language Models to Synthesize Product Desirability DatasetsComments: 9 pages, 2 figures, 6 tables, updated author listSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
This research explores the application of large language models (LLMs) to generate synthetic datasets for Product Desirability Toolkit (PDT) testing, a key component in evaluating user sentiment and product experience. Utilizing gpt-4o-mini, a cost-effective alternative to larger commercial LLMs, three methods, Word+Review, Review+Word, and Supply-Word, were each used to synthesize 1000 product reviews. The generated datasets were assessed for sentiment alignment, textual diversity, and data generation cost. Results demonstrated high sentiment alignment across all methods, with Pearson correlations ranging from 0.93 to 0.97. Supply-Word exhibited the highest diversity and coverage of PDT terms, although with increased generation costs. Despite minor biases toward positive sentiments, in situations with limited test data, LLM-generated synthetic data offers significant advantages, including scalability, cost savings, and flexibility in dataset production.
- [542] arXiv:2411.13528 (replaced) [pdf, html, other]
-
Title: Entropy Bootstrapping for Weakly Supervised Nuclei DetectionComments: 8 PagesSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Microscopy structure segmentation, such as detecting cells or nuclei, generally requires a human to draw a ground truth contour around each instance. Weakly supervised approaches (e.g. consisting of only single point labels) have the potential to reduce this workload significantly. Our approach uses individual point labels for an entropy estimation to approximate an underlying distribution of cell pixels. We infer full cell masks from this distribution, and use Mask-RCNN to produce an instance segmentation output. We compare this point--annotated approach with training on the full ground truth masks. We show that our method achieves a comparatively good level of performance, despite a 95% reduction in pixel labels.
- [543] arXiv:2411.13587 (replaced) [pdf, html, other]
-
Title: Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in RoboticsTaowen Wang, Dongfang Liu, James Chenhao Liang, Wenhao Yang, Qifan Wang, Cheng Han, Jiebo Luo, Ruixiang TangSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Recently in robotics, Vision-Language-Action (VLA) models have emerged as a transformative approach, enabling robots to execute complex tasks by integrating visual and linguistic inputs within an end-to-end learning framework. While VLA models offer significant capabilities, they also introduce new attack surfaces, making them vulnerable to adversarial attacks. With these vulnerabilities largely unexplored, this paper systematically quantifies the robustness of VLA-based robotic systems. Recognizing the unique demands of robotic execution, our attack objectives target the inherent spatial and functional characteristics of robotic systems. In particular, we introduce an untargeted position-aware attack objective that leverages spatial foundations to destabilize robotic actions, and a targeted attack objective that manipulates the robotic trajectory. Additionally, we design an adversarial patch generation approach that places a small, colorful patch within the camera's view, effectively executing the attack in both digital and physical environments. Our evaluation reveals a marked degradation in task success rates, with up to a 100\% reduction across a suite of simulated robotic tasks, highlighting critical security gaps in current VLA architectures. By unveiling these vulnerabilities and proposing actionable evaluation metrics, this work advances both the understanding and enhancement of safety for VLA-based robotic systems, underscoring the necessity for developing robust defense strategies prior to physical-world deployments.
- [544] arXiv:2411.13802 (replaced) [pdf, html, other]
-
Title: SemiKong: Curating, Training, and Evaluating A Semiconductor Industry-Specific Large Language ModelChristopher Nguyen, William Nguyen, Atsushi Suzuki, Daisuke Oku, Hong An Phan, Sang Dinh, Zooey Nguyen, Anh Ha, Shruti Raghavan, Huy Vo, Thang Nguyen, Lan Nguyen, Yoshikuni HirayamaComments: On-going workSubjects: Computation and Language (cs.CL)
Large Language Models (LLMs) have demonstrated the potential to address some issues within the semiconductor industry. However, they are often general-purpose models that lack the specialized knowledge needed to tackle the unique challenges of this sector, such as the intricate physics and chemistry of semiconductor devices and processes. SemiKong, the first industry-specific LLM for the semiconductor domain, provides a foundation that can be used to develop tailored proprietary models. With SemiKong 1.0, we aim to develop a foundational model capable of understanding etching problems at an expert level. Our key contributions include (a) curating a comprehensive corpus of semiconductor-related texts, (b) creating a foundational model with in-depth semiconductor knowledge, and (c) introducing a framework for integrating expert knowledge, thereby advancing the evaluation process of domain-specific AI models. Through fine-tuning a pre-trained LLM using our curated dataset, we have shown that SemiKong outperforms larger, general-purpose LLMs in various semiconductor manufacturing and design tasks. Our extensive experiments underscore the importance of developing domain-specific LLMs as a foundation for company- or tool-specific proprietary models, paving the way for further research and applications in the semiconductor domain. Code and dataset will be available at this https URL
- [545] arXiv:2411.13909 (replaced) [pdf, html, other]
-
Title: Panther: Illuminate the Sight of Multimodal LLMs with Instruction-Guided Visual PromptsSubjects: Computer Vision and Pattern Recognition (cs.CV)
Multimodal large language models (MLLMs) are closing the gap to human visual perception capability rapidly, while, still lag behind on attending to subtle images details or locating small objects precisely, etc. Common schemes to tackle these issues include deploying multiple vision encoders or operating on original high-resolution images. Few studies have concentrated on taking the textual instruction into improving visual representation, resulting in losing focus in some vision-centric tasks, a phenomenon we herein termed as Amblyopia. In this work, we introduce Panther, a MLLM that closely adheres to user instruction and locates targets of interests precisely, with the finesse of a black panther. Specifically, Panther comprises three integral components: Panther-VE, Panther-Bridge, and Panther-Decoder. Panther-VE integrates user instruction information at the early stages of the vision encoder, thereby extracting the most relevant and useful visual representations. The Panther-Bridge module, equipped with powerful filtering capabilities, significantly reduces redundant visual information, leading to a substantial savings in training costs. The Panther-Decoder is versatile and can be employed with any decoder-only architecture of LLMs without discrimination. Experimental results, particularly on vision-centric benchmarks, have demonstrated the effectiveness of Panther.
- [546] arXiv:2411.13918 (replaced) [pdf, html, other]
-
Title: Quantization without TearsSubjects: Computer Vision and Pattern Recognition (cs.CV)
Deep neural networks, while achieving remarkable success across diverse tasks, demand significant resources, including computation, GPU memory, bandwidth, storage, and energy. Network quantization, as a standard compression and acceleration technique, reduces storage costs and enables potential inference acceleration by discretizing network weights and activations into a finite set of integer values. However, current quantization methods are often complex and sensitive, requiring extensive task-specific hyperparameters, where even a single misconfiguration can impair model performance, limiting generality across different models and tasks. In this paper, we propose Quantization without Tears (QwT), a method that simultaneously achieves quantization speed, accuracy, simplicity, and generality. The key insight of QwT is to incorporate a lightweight additional structure into the quantized network to mitigate information loss during quantization. This structure consists solely of a small set of linear layers, keeping the method simple and efficient. More importantly, it provides a closed-form solution, allowing us to improve accuracy effortlessly under 2 minutes. Extensive experiments across various vision, language, and multimodal tasks demonstrate that QwT is both highly effective and versatile. In fact, our approach offers a robust solution for network quantization that combines simplicity, accuracy, and adaptability, which provides new insights for the design of novel quantization paradigms.
- [547] arXiv:2411.14349 (replaced) [pdf, html, other]
-
Title: Agnostic Learning of Arbitrary ReLU Activation under Gaussian MarginalsSubjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
We consider the problem of learning an arbitrarily-biased ReLU activation (or neuron) over Gaussian marginals with the squared loss objective. Despite the ReLU neuron being the basic building block of modern neural networks, we still do not understand the basic algorithmic question of whether one arbitrary ReLU neuron is learnable in the non-realizable setting. In particular, all existing polynomial time algorithms only provide approximation guarantees for the better-behaved unbiased setting or restricted bias setting.
Our main result is a polynomial time statistical query (SQ) algorithm that gives the first constant factor approximation for arbitrary bias. It outputs a ReLU activation that achieves a loss of $O(\mathrm{OPT}) + \varepsilon$ in time $\mathrm{poly}(d,1/\varepsilon)$, where $\mathrm{OPT}$ is the loss obtained by the optimal ReLU activation. Our algorithm presents an interesting departure from existing algorithms, which are all based on gradient descent and thus fall within the class of correlational statistical query (CSQ) algorithms. We complement our algorithmic result by showing that no polynomial time CSQ algorithm can achieve a constant factor approximation. Together, these results shed light on the intrinsic limitation of gradient descent, while identifying arguably the simplest setting (a single neuron) where there is a separation between SQ and CSQ algorithms. - [548] arXiv:1804.04780 (replaced) [pdf, html, other]
-
Title: A Grid Based Adversarial Clustering AlgorithmSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Nowadays more and more data are gathered for detecting and preventing cyber attacks. In cyber security applications, data analytics techniques have to deal with active adversaries that try to deceive the data analytics models and avoid being detected. The existence of such adversarial behavior motivates the development of robust and resilient adversarial learning techniques for various tasks. Most of the previous work focused on adversarial classification techniques, which assumed the existence of a reasonably large amount of carefully labeled data instances. However, in practice, labeling the data instances often requires costly and time-consuming human expertise and becomes a significant bottleneck. Meanwhile, a large number of unlabeled instances can also be used to understand the adversaries' behavior. To address the above mentioned challenges, in this paper, we develop a novel grid based adversarial clustering algorithm. Our adversarial clustering algorithm is able to identify the core normal regions, and to draw defensive walls around the centers of the normal objects utilizing game theoretic ideas. Our algorithm also identifies sub-clusters of attack objects, the overlapping areas within clusters, and outliers which may be potential anomalies.
- [549] arXiv:2009.03238 (replaced) [pdf, html, other]
-
Title: A Joint Network Optimization Framework to Predict Clinical Severity from Resting State Functional MRI DataNiharika Shimona D'Souza, Mary Beth Nebel, Nicholas Wymbs, Stewart H. Mostofsky, Archana VenkataramanSubjects: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
We propose a novel optimization framework to predict clinical severity from resting state fMRI (rs-fMRI) data. Our model consists of two coupled terms. The first term decomposes the correlation matrices into a sparse set of representative subnetworks that define a network manifold. These subnetworks are modeled as rank-one outer-products which correspond to the elemental patterns of co-activation across the brain; the subnetworks are combined via patient-specific non-negative coefficients. The second term is a linear regression model that uses the patient-specific coefficients to predict a measure of clinical severity. We validate our framework on two separate datasets in a ten fold cross validation setting. The first is a cohort of fifty-eight patients diagnosed with Autism Spectrum Disorder (ASD). The second dataset consists of sixty three patients from a publicly available ASD database. Our method outperforms standard semi-supervised frameworks, which employ conventional graph theoretic and statistical representation learning techniques to relate the rs-fMRI correlations to behavior. In contrast, our joint network optimization framework exploits the structure of the rs-fMRI correlation matrices to simultaneously capture group level effects and patient heterogeneity. Finally, we demonstrate that our proposed framework robustly identifies clinically relevant networks characteristic of ASD.
- [550] arXiv:2203.09677 (replaced) [pdf, html, other]
-
Title: Geodesics and dynamical information projections on the manifold of H\"older equilibrium probabilitiesComments: Keywords: Geodesics; infinite-dimensional Riemannian manifold; equilibrium probabilities; KL-divergence; information projections; Pythagorean inequalities; Fourier-like basisSubjects: Dynamical Systems (math.DS); Information Theory (cs.IT); Mathematical Physics (math-ph); Differential Geometry (math.DG); Probability (math.PR)
We consider here the discrete time dynamics described by a transformation $T:M \to M$, where $T$ is either the action of shift $T=\sigma$ on the symbolic space $M=\{1,2,...,d\}^\mathbb{N}$, or, $T$ describes the action of a $d$ to $1$ expanding transformation $T:S^1 \to S^1$ of class $C^{1+\alpha}$ (\,for example $x \to T(x) =d\, x $ (mod $1) $\,), where $M=S^1$ is the unit circle. It is known that the infinite-dimensional manifold $\mathcal{N}$ of equilibrium probabilities for Hölder potentials $A:M \to \mathbb{R}$ is an analytical manifold and carries a natural Riemannian metric associated with the asymptotic variance. We show here that under the assumption of the existence of a Fourier-like Hilbert basis for the kernel of the Ruelle operator there exists geodesics paths. When $T=\sigma$ and $M=\{0,1\}^\mathbb{N}$ such basis exists.
In a different direction, we also consider the KL-divergence $D_{KL}(\mu_1,\mu_2)$ for a pair of equilibrium probabilities. If $D_{KL}(\mu_1,\mu_2)=0$, then $\mu_1=\mu_2$. Although $D_{KL}$ is not a metric in $\mathcal{N}$, it describes the proximity between $\mu_1$ and $\mu_2$. A natural problem is: for a fixed probability $\mu_1\in \mathcal{N}$ consider the probability $\mu_2$ in a convex set of probabilities in $\mathcal{N}$ which minimizes $D_{KL}(\mu_1,\mu_2)$. This minimization problem is a dynamical version of the main issues considered in information projections. We consider this problem in $\mathcal{N}$, a case where all probabilities are dynamically invariant, getting explicit equations for the solution sought. Triangle and Pythagorean inequalities will be investigated. - [551] arXiv:2205.14627 (replaced) [pdf, html, other]
-
Title: Continuous Generative Neural Networks: A Wavelet-Based Architecture in Function SpacesComments: 40 pages, 8 figuresJournal-ref: Numerical Functional Analysis and Optimization, 2024Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
In this work, we present and study Continuous Generative Neural Networks (CGNNs), namely, generative models in the continuous setting: the output of a CGNN belongs to an infinite-dimensional function space. The architecture is inspired by DCGAN, with one fully connected layer, several convolutional layers and nonlinear activation functions. In the continuous $L^2$ setting, the dimensions of the spaces of each layer are replaced by the scales of a multiresolution analysis of a compactly supported wavelet. We present conditions on the convolutional filters and on the nonlinearity that guarantee that a CGNN is injective. This theory finds applications to inverse problems, and allows for deriving Lipschitz stability estimates for (possibly nonlinear) infinite-dimensional inverse problems with unknowns belonging to the manifold generated by a CGNN. Several numerical simulations, including signal deblurring, illustrate and validate this approach.
- [552] arXiv:2212.08162 (replaced) [pdf, html, other]
-
Title: Huber-energy measure quantizationSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR); Statistics Theory (math.ST)
We describe a measure quantization procedure i.e., an algorithm which finds the best approximation of a target probability law (and more generally signed finite variation measure) by a sum of $Q$ Dirac masses ($Q$ being the quantization parameter). The procedure is implemented by minimizing the statistical distance between the original measure and its quantized version; the distance is built from a negative definite kernel and, if necessary, can be computed on the fly and feed to a stochastic optimization algorithm (such as SGD, Adam, ...). We investigate theoretically the fundamental questions of existence of the optimal measure quantizer and identify what are the required kernel properties that guarantee suitable behavior. We propose two best linear unbiased (BLUE) estimators for the squared statistical distance and use them in an unbiased procedure, called HEMQ, to find the optimal quantization. We test HEMQ on several databases: multi-dimensional Gaussian mixtures, Wiener space cubature, Italian wine cultivars and the MNIST image database. The results indicate that the HEMQ algorithm is robust and versatile and, for the class of Huber-energy kernels, matches the expected intuitive behavior.
- [553] arXiv:2212.12725 (replaced) [pdf, html, other]
-
Title: Deep Quadratic HedgingComments: Accepted version. Final edited version available at this https URLSubjects: Computational Finance (q-fin.CP); Numerical Analysis (math.NA); Probability (math.PR); Mathematical Finance (q-fin.MF)
We propose a novel computational procedure for quadratic hedging in high-dimensional incomplete markets, covering mean-variance hedging and local risk minimization. Starting from the observation that both quadratic approaches can be treated from the point of view of backward stochastic differential equations (BSDEs), we (recursively) apply a deep learning-based BSDE solver to compute the entire optimal hedging strategies paths. This allows us to overcome the curse of dimensionality, extending the scope of applicability of quadratic hedging in high dimension. We test our approach with a classic Heston model and with a multiasset and multifactor generalization thereof, showing that this leads to high levels of accuracy.
- [554] arXiv:2306.05857 (replaced) [pdf, html, other]
-
Title: How Sparse Can We Prune A Deep Network: A Fundamental Limit ViewpointSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Network pruning is a commonly used measure to alleviate the storage and computational burden of deep neural networks. However, the fundamental limit of network pruning is still lacking. To close the gap, in this work we'll take a first-principles approach, i.e. we'll directly impose the sparsity constraint on the loss function and leverage the framework of statistical dimension in convex geometry, thus we're able to characterize the sharp phase transition point, i.e. the fundamental limit of the pruning ratio. Through this limit, we're able to identify two key factors that determine the pruning ratio limit, namely, weight magnitude and network sharpness. Generally speaking, the flatter the loss landscape or the smaller the weight magnitude, the smaller pruning ratio. Moreover, we provide efficient countermeasures to address the challenges in the computation of the pruning limit, which involves accurate spectrum estimation of a large-scale and non-positive Hessian matrix. Moreover, through the lens of the pruning ratio threshold, we can provide rigorous interpretations on several heuristics in existing pruning algorithms. Extensive experiments are performed that demonstrate that our theoretical pruning ratio threshold coincides very well with the experiments. All codes are available at: this https URL
- [555] arXiv:2309.17368 (replaced) [pdf, html, other]
-
Title: Machine Learning for Practical Quantum Error MitigationComments: 11 pages, 7 figures (main text) + 9 pages, 4 figures (supplementary information)Journal-ref: Nature Machine Intelligence (2024)Subjects: Quantum Physics (quant-ph); Machine Learning (cs.LG)
Quantum computers progress toward outperforming classical supercomputers, but quantum errors remain their primary obstacle. The key to overcoming errors on near-term devices has emerged through the field of quantum error mitigation, enabling improved accuracy at the cost of additional run time. Here, through experiments on state-of-the-art quantum computers using up to 100 qubits, we demonstrate that without sacrificing accuracy machine learning for quantum error mitigation (ML-QEM) drastically reduces the cost of mitigation. We benchmark ML-QEM using a variety of machine learning models -- linear regression, random forests, multi-layer perceptrons, and graph neural networks -- on diverse classes of quantum circuits, over increasingly complex device-noise profiles, under interpolation and extrapolation, and in both numerics and experiments. These tests employ the popular digital zero-noise extrapolation method as an added reference. Finally, we propose a path toward scalable mitigation by using ML-QEM to mimic traditional mitigation methods with superior runtime efficiency. Our results show that classical machine learning can extend the reach and practicality of quantum error mitigation by reducing its overheads and highlight its broader potential for practical quantum computations.
- [556] arXiv:2311.12214 (replaced) [pdf, other]
-
Title: Random Fourier Signature FeaturesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
Tensor algebras give rise to one of the most powerful measures of similarity for sequences of arbitrary length called the signature kernel accompanied with attractive theoretical guarantees from stochastic analysis. Previous algorithms to compute the signature kernel scale quadratically in terms of the length and the number of the sequences. To mitigate this severe computational bottleneck, we develop a random Fourier feature-based acceleration of the signature kernel acting on the inherently non-Euclidean domain of sequences. We show uniform approximation guarantees for the proposed unbiased estimator of the signature kernel, while keeping its computation linear in the sequence length and number. In addition, combined with recent advances on tensor projections, we derive two even more scalable time series features with favourable concentration properties and computational complexity both in time and memory. Our empirical results show that the reduction in computational cost comes at a negligible price in terms of accuracy on moderate-sized datasets, and it enables one to scale to large datasets up to a million time series.
- [557] arXiv:2312.09384 (replaced) [pdf, html, other]
-
Title: Modeling Epidemic Spread: A Gaussian Process Regression ApproachComments: The code for the analyses is available at this https URLSubjects: Machine Learning (stat.ML); Systems and Control (eess.SY); Physics and Society (physics.soc-ph)
Modeling epidemic spread is critical for informing policy decisions aimed at mitigation. Accordingly, in this work we present a new data-driven method based on Gaussian process regression (GPR) to model epidemic spread through the difference on the logarithmic scale of the infected cases. We bound the variance of the predictions made by GPR, which quantifies the impact of epidemic data on the proposed model. Next, we derive a high-probability error bound on the prediction error in terms of the distance between the training points and a testing point, the posterior variance, and the level of change in the spreading process, and we assess how the characteristics of the epidemic spread and infection data influence this error bound. We present examples that use GPR to model and predict epidemic spread by using real-world infection data gathered in the UK during the COVID-19 epidemic. These examples illustrate that, under typical conditions, the prediction for the next twenty days has 94.29% of the noisy data located within the 95% confidence interval, validating these predictions. We further compare the modeling and prediction results with other methods, such as polynomial regression, k-nearest neighbors (KNN) regression, and neural networks, to demonstrate the benefits of leveraging GPR in disease spread modeling.
- [558] arXiv:2402.05838 (replaced) [pdf, html, other]
-
Title: Introducing q-deformed binomial coefficients of wordsComments: 25 pages, submittedSubjects: Combinatorics (math.CO); Discrete Mathematics (cs.DM); Formal Languages and Automata Theory (cs.FL)
Gaussian binomial coefficients are q-analogues of the binomial coefficients of integers. On the other hand, binomial coefficients have been extended to finite words, i.e., elements of the finitely generated free monoids. In this paper we bring together these two notions by introducing q-analogues of binomial coefficients of words. We study their basic properties, e.g., by extending classical formulas such as the q-Vandermonde and Manvel's et al. identities to our setting. As a consequence, we get information about the structure of the considered words: these q-deformations of binomial coefficients of words contain much richer information than the original coefficients. From an algebraic perspective, we introduce a q-shuffle and a family q-infiltration products for non-commutative formal power series. Finally, we apply our results to generalize a theorem of Eilenberg characterizing so-called p-group languages. We show that a language is of this type if and only if it is a Boolean combination of specific languages defined through q-binomial coefficients seen as polynomials over $\mathbb{F}_p$.
- [559] arXiv:2403.07247 (replaced) [pdf, html, other]
-
Title: GuideGen: A Text-Guided Framework for Full-torso Anatomy and CT Volume GenerationComments: submitted to CVPR2025Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
The recently emerging conditional diffusion models seem promising for mitigating the labor and expenses in building large 3D medical imaging datasets. However, previous studies on 3D CT generation have yet to fully capitalize on semantic and textual conditions, and they have primarily focused on specific organs characterized by a local structure and fixed contrast. In this work, we present GuideGen, a controllable framework that generates anatomical masks and corresponding CT volumes for the entire torso-from chest to pelvis-based on free-form text prompts. Our approach includes three core components: a text-conditional semantic synthesizer for creating realistic full-torso anatomies; a contrast-aware autoencoder for detailed, high-fidelity feature extraction across varying contrast levels; and a latent feature generator that ensures alignment between CT images, anatomical semantics and input prompts. To train and evaluate GuideGen, we compile a multi-modality cancer imaging dataset with paired CT and clinical descriptions from 12 public TCIA datasets and one private real-world dataset. Comprehensive evaluations across generation quality, cross-modality alignment, and data usability on multi-organ and tumor segmentation tasks demonstrate GuideGen's superiority over existing CT generation methods.
- [560] arXiv:2404.06535 (replaced) [pdf, html, other]
-
Title: Learning to rank quantum circuits for hardware-optimized performance enhancementComments: 18 pages, 5 figuresSubjects: Quantum Physics (quant-ph); Machine Learning (cs.LG)
We introduce and experimentally test a machine-learning-based method for ranking logically equivalent quantum circuits based on expected performance estimates derived from a training procedure conducted on real hardware. We apply our method to the problem of layout selection, in which abstracted qubits are assigned to physical qubits on a given device. Circuit measurements performed on IBM hardware indicate that the maximum and median fidelities of logically equivalent layouts can differ by an order of magnitude. We introduce a circuit score used for ranking that is parameterized in terms of a physics-based, phenomenological error model whose parameters are fit by training a ranking-loss function over a measured dataset. The dataset consists of quantum circuits exhibiting a diversity of structures and executed on IBM hardware, allowing the model to incorporate the contextual nature of real device noise and errors without the need to perform an exponentially costly tomographic protocol. We perform model training and execution on the 16-qubit ibmq_guadalupe device and compare our method to two common approaches: random layout selection and a publicly available baseline called Mapomatic. Our model consistently outperforms both approaches, predicting layouts that exhibit lower noise and higher performance. In particular, we find that our best model leads to a $1.8\times$ reduction in selection error when compared to the baseline approach and a $3.2\times$ reduction when compared to random selection. Beyond delivering a new form of predictive quantum characterization, verification, and validation, our results reveal the specific way in which context-dependent and coherent gate errors appear to dominate the divergence from performance estimates extrapolated from simple proxy measures.
- [561] arXiv:2404.08748 (replaced) [pdf, other]
-
Title: Multi-Branch Generative Models for Multichannel Imaging with an Application to PET/CT Synergistic ReconstructionComments: 12 pages, 17 figures, 2 tables, submitted to IEEE TRPMSSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
This paper presents a novel approach for learned synergistic reconstruction of medical images using multi-branch generative models. Leveraging variational autoencoders (VAEs), our model learns from pairs of images simultaneously, enabling effective denoising and reconstruction. Synergistic image reconstruction is achieved by incorporating the trained models in a regularizer that evaluates the distance between the images and the model. We demonstrate the efficacy of our approach on both Modified National Institute of Standards and Technology (MNIST) and positron emission tomography (PET)/computed tomography (CT) datasets, showcasing improved image quality for low-dose imaging. Despite challenges such as patch decomposition and model limitations, our results underscore the potential of generative models for enhancing medical imaging reconstruction.
- [562] arXiv:2405.13063 (replaced) [pdf, html, other]
-
Title: A Foundation Model for the Earth SystemCristian Bodnar, Wessel P. Bruinsma, Ana Lucic, Megan Stanley, Anna Vaughan, Johannes Brandstetter, Patrick Garvan, Maik Riechert, Jonathan A. Weyn, Haiyu Dong, Jayesh K. Gupta, Kit Thambiratnam, Alexander T. Archibald, Chun-Chieh Wu, Elizabeth Heider, Max Welling, Richard E. Turner, Paris PerdikarisSubjects: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
Reliable forecasts of the Earth system are crucial for human progress and safety from natural disasters. Artificial intelligence offers substantial potential to improve prediction accuracy and computational efficiency in this field, however this remains underexplored in many domains. Here we introduce Aurora, a large-scale foundation model for the Earth system trained on over a million hours of diverse data. Aurora outperforms operational forecasts for air quality, ocean waves, tropical cyclone tracks, and high-resolution weather forecasting at orders of magnitude smaller computational expense than dedicated existing systems. With the ability to fine-tune Aurora to diverse application domains at only modest computational cost, Aurora represents significant progress in making actionable Earth system predictions accessible to anyone.
- [563] arXiv:2405.19552 (replaced) [pdf, html, other]
-
Title: Point process analysis of geographical diffusion of news in ArgentinaSubjects: Physics and Society (physics.soc-ph); Social and Information Networks (cs.SI)
The diffusion of information plays a crucial role in a society, affecting its economy and the well-being of the population. Characterizing the diffusion process is challenging because it is highly non-stationary and varies with the media type. To understand the spreading of newspaper news in Argentina, we collected data from more than 27000 articles published in six main provinces during four months. We classified the articles into 20 thematic axes and obtained a set of time series that capture daily newspaper attention on different topics in different provinces. To analyze the data we use a point process approach. For each topic, $n$, and for all pairs of provinces, $i$ and $j$, we use two measures to quantify the synchronicity of the events, $Q_s(i,j)$, which quantifies the number of events that occur almost simultaneously in $i$ and $j$, and $Q_a(i,j)$, which quantifies the direction of news spreading. Our analysis unveils how fast the information diffusion process is, showing pairs of provinces with very similar and almost simultaneous temporal variations of media attention. On the other hand, we also calculate other measures computed from the raw time series, such as Granger Causality and Transfer Entropy, which do not perform well in this context because they often return opposite directions of information transfer. We interpret this as due to different factors such as the characteristics of the data, which is highly non-stationary and the features of the information diffusion process, which is very fast and probably acts at a sub-resolution time scale.
- [564] arXiv:2407.16877 (replaced) [pdf, html, other]
-
Title: Neural Network-Based Bandit: A Medium Access Control for the IIoT Alarm ScenarioSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG)
Efficient Random Access (RA) is critical for enabling reliable communication in Industrial Internet of Things (IIoT) networks. Herein, we propose a deep reinforcement learning based distributed RA scheme, entitled Neural Network-Based Bandit (NNBB), for the IIoT alarm scenario. In such a scenario, the devices may detect a common critical event, and the goal is to ensure the alarm information is delivered successfully from at least one device. The proposed NNBB scheme is implemented at each device, where it trains itself online and establishes implicit inter-device coordination to achieve the common goal. Devices can transmit simultaneously on multiple orthogonal channels and each possible transmission pattern constitutes a possible action for the NNBB, which uses a deep neural network to determine the action. Our simulation results show that as the number of devices in the network increases, so does the performance gain of the NNBB compared to the Multi-Armed Bandit (MAB) RA benchmark. For instance, NNBB experiences a 7% success rate drop when there are four channels and the number of devices increases from 10 to 60, while MAB faces a 25% drop.
- [565] arXiv:2409.13548 (replaced) [pdf, html, other]
-
Title: Data Diet: Can Trimming PET/CT Datasets Enhance Lesion Segmentation?Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
In this work, we describe our approach to compete in the autoPET3 datacentric track. While conventional wisdom suggests that larger datasets lead to better model performance, recent studies indicate that excluding certain training samples can enhance model accuracy. We find that in the autoPETIII dataset, a model that is trained on the entire dataset exhibits undesirable characteristics by producing a large number of false positives particularly for PSMA-PETs. We counteract this by removing the easiest samples from the training dataset as measured by the model loss before retraining from scratch. Using the proposed approach we manage to drive down the false negative volume and improve upon the baseline model in both false negative volume and dice score on the preliminary test set. Code and pre-trained models are available at this http URL.
- [566] arXiv:2410.00903 (replaced) [pdf, html, other]
-
Title: Causal Representation Learning with Generative Artificial Intelligence: Application to Texts as TreatmentsSubjects: Applications (stat.AP); Computation and Language (cs.CL); Machine Learning (cs.LG)
In this paper, we demonstrate how to enhance the validity of causal inference with unstructured high-dimensional treatments like texts, by leveraging the power of generative Artificial Intelligence. Specifically, we propose to use a deep generative model such as large language models (LLMs) to efficiently generate treatments and use their internal representation for subsequent causal effect estimation. We show that the knowledge of this true internal representation helps disentangle the treatment features of interest, such as specific sentiments and certain topics, from other possibly unknown confounding features. Unlike the existing methods, our proposed approach eliminates the need to learn causal representation from the data and hence produces more accurate and efficient estimates. We formally establish the conditions required for the nonparametric identification of the average treatment effect, propose an estimation strategy that avoids the violation of the overlap assumption, and derive the asymptotic properties of the proposed estimator through the application of double machine learning. Finally, using an instrumental variables approach, we extend the proposed methodology to the settings, in which the treatment feature is based on human perception rather than is assumed to be fixed given the treatment object. The proposed methodology is also applicable to text reuse where an LLM is used to regenerate the existing texts. We conduct simulation and empirical studies, using the generated text data from an open-source LLM, Llama 3, to illustrate the advantages of our estimator over the state-of-the-art causal representation learning algorithms.
- [567] arXiv:2411.05771 (replaced) [pdf, html, other]
-
Title: Sketched Equivariant Imaging Regularization and Deep Internal Learning for Inverse ProblemsSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Optimization and Control (math.OC)
Equivariant Imaging (EI) regularization has become the de-facto technique for unsupervised training of deep imaging networks, without any need of ground-truth data. Observing that the EI-based unsupervised training paradigm currently has significant computational redundancy leading to inefficiency in high-dimensional applications, we propose a sketched EI regularization which leverages the randomized sketching techniques for acceleration. We then extend our sketched EI regularization to develop an accelerated deep internal learning framework -- Sketched Equivariant Deep Image Prior (Sk-EI-DIP), which can be efficiently applied for single-image and task-adapted reconstruction. Additionally, for network adaptation tasks, we propose a parameter-efficient approach for accelerating both EI-DIP and Sk-EI-DIP via optimizing only the normalization layers. Our numerical study on X-ray CT image reconstruction tasks demonstrate that our approach can achieve order-of-magnitude computational acceleration over standard EI-based counterpart in single-input setting, and network adaptation at test time.
- [568] arXiv:2411.09064 (replaced) [pdf, html, other]
-
Title: Minimax Optimal Two-Sample Testing under Local Differential PrivacyComments: 66 pages, 6 figures, 1 table; added a graphical illustration of central and local differential privacy in Section 1, referenced the Python package, fixed typos, and changed the citation styleSubjects: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
We explore the trade-off between privacy and statistical utility in private two-sample testing under local differential privacy (LDP) for both multinomial and continuous data. We begin by addressing the multinomial case, where we introduce private permutation tests using practical privacy mechanisms such as Laplace, discrete Laplace, and Google's RAPPOR. We then extend our multinomial approach to continuous data via binning and study its uniform separation rates under LDP over Hölder and Besov smoothness classes. The proposed tests for both discrete and continuous cases rigorously control the type I error for any finite sample size, strictly adhere to LDP constraints, and achieve minimax separation rates under LDP. The attained minimax rates reveal inherent privacy-utility trade-offs that are unavoidable in private testing. To address scenarios with unknown smoothness parameters in density testing, we propose an adaptive test based on a Bonferroni-type approach that ensures robust performance without prior knowledge of the smoothness parameters. We validate our theoretical findings with extensive numerical experiments and demonstrate the practical relevance and effectiveness of our proposed methods.
- [569] arXiv:2411.09075 (replaced) [pdf, other]
-
Title: Weak Poincar\'e Inequalities, Simulated Annealing, and Sampling from Spherical Spin GlassesComments: 94 pages, removed an incorrect application to the ferromagnetic Potts modelSubjects: Probability (math.PR); Disordered Systems and Neural Networks (cond-mat.dis-nn); Data Structures and Algorithms (cs.DS); Mathematical Physics (math-ph)
There has been a recent surge of powerful tools to show rapid mixing of Markov chains, via functional inequalities such as Poincaré inequalities. In many situations, Markov chains fail to mix rapidly from a worst-case initialization, yet are expected to approximately sample from a random initialization. For example, this occurs if the target distribution has metastable states, small clusters accounting for a vanishing fraction of the mass that are essentially disconnected from the bulk of the measure. Under such conditions, a Poincaré inequality cannot hold, necessitating new tools to prove sampling guarantees.
We develop a framework to analyze simulated annealing, based on establishing so-called weak Poincaré inequalities. These inequalities imply mixing from a suitably warm start, and simulated annealing provides a way to chain such warm starts together into a sampling algorithm. We further identify a local-to-global principle to prove weak Poincaré inequalities, mirroring the spectral independence and localization schemes frameworks for analyzing mixing times of Markov chains.
As our main application, we prove that simulated annealing samples from the Gibbs measure of a spherical spin glass for inverse temperatures up to a natural threshold, matching recent algorithms based on algorithmic stochastic localization. This provides the first Markov chain sampling guarantee that holds beyond the uniqueness threshold for spherical spin glasses, where mixing from a worst-case initialization is provably slow due to the presence of metastable states. As an ingredient in our proof, we prove bounds on the operator norm of the covariance matrix of spherical spin glasses in the full replica-symmetric regime.
Additionally, we resolve a question related to sampling using data-based initializations. - [570] arXiv:2411.11458 (replaced) [pdf, html, other]
-
Title: HistoEncoder: a digital pathology foundation model for prostate cancerJoona Pohjonen, Abderrahim-Oussama Batouche, Antti Rannikko, Kevin Sandeman, Andrew Erickson, Esa Pitkanen, Tuomas MirttiSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Foundation models are trained on massive amounts of data to distinguish complex patterns and can be adapted to a wide range of downstream tasks with minimal computational resources. Here, we develop a foundation model for prostate cancer digital pathology called HistoEncoder by pre-training on 48 million prostate tissue tile images. We demonstrate that HistoEncoder features extracted from tile images with similar histological patterns map closely together in the feature space. HistoEncoder outperforms models pre-trained with natural images, even without fine-tuning or with 1000 times less training data. We describe two use cases that leverage the capabilities of HistoEncoder by fine-tuning the model with a limited amount of data and computational resources. First, we show how HistoEncoder can be used to automatically annotate large-scale datasets with high accuracy. Second, we combine histomics with commonly used clinical nomograms, significantly improving prostate cancer-specific death survival models. Foundation models such as HistoEncoder can allow organizations with limited resources to build effective clinical software tools without needing extensive datasets or significant amounts of computing.
- [571] arXiv:2411.13490 (replaced) [pdf, html, other]
-
Title: Efficient Brain Imaging Analysis for Alzheimer's and Dementia Detection Using Convolution-Derivative OperationsSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Performance (cs.PF)
Alzheimer's disease (AD) is characterized by progressive neurodegeneration and results in detrimental structural changes in human brains. Detecting these changes is crucial for early diagnosis and timely intervention of disease progression. Jacobian maps, derived from spatial normalization in voxel-based morphometry (VBM), have been instrumental in interpreting volume alterations associated with AD. However, the computational cost of generating Jacobian maps limits its clinical adoption. In this study, we explore alternative methods and propose Sobel kernel angle difference (SKAD) as a computationally efficient alternative. SKAD is a derivative operation that offers an optimized approach to quantifying volumetric alterations through localized analysis of the gradients. By efficiently extracting gradient amplitude changes at critical spatial regions, this derivative operation captures regional volume variations Evaluation of SKAD over various medical datasets demonstrates that it is 6.3x faster than Jacobian maps while still maintaining comparable accuracy. This makes it an efficient and competitive approach in neuroimaging research and clinical practice.
- [572] arXiv:2411.14078 (replaced) [pdf, html, other]
-
Title: Self-supervised learning for radio-astronomy source classification: a benchmarkThomas Cecconello, Simone Riggi, Ugo Becciani, Fabio Vitello, Andrew M. Hopkins, Giuseppe Vizzari, Concetto Spampinato, Simone PalazzoSubjects: Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV)
The upcoming Square Kilometer Array (SKA) telescope marks a significant step forward in radio astronomy, presenting new opportunities and challenges for data analysis. Traditional visual models pretrained on optical photography images may not perform optimally on radio interferometry images, which have distinct visual characteristics.
Self-Supervised Learning (SSL) offers a promising approach to address this issue, leveraging the abundant unlabeled data in radio astronomy to train neural networks that learn useful representations from radio images. This study explores the application of SSL to radio astronomy, comparing the performance of SSL-trained models with that of traditional models pretrained on natural images, evaluating the importance of data curation for SSL, and assessing the potential benefits of self-supervision to different domain-specific radio astronomy datasets.
Our results indicate that, SSL-trained models achieve significant improvements over the baseline in several downstream tasks, especially in the linear evaluation setting; when the entire backbone is fine-tuned, the benefits of SSL are less evident but still outperform pretraining. These findings suggest that SSL can play a valuable role in efficiently enhancing the analysis of radio astronomical data. The trained models and code is available at: \url{this https URL} - [573] arXiv:2411.14390 (replaced) [pdf, html, other]
-
Title: Persistent Homology for Structural Characterization in Disordered SystemsComments: 19 pages, 17 figuresSubjects: Disordered Systems and Neural Networks (cond-mat.dis-nn); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Mathematical Physics (math-ph)
We propose a unified framework based on persistent homology (PH) to characterize both local and global structures in disordered systems. It can simultaneously generate local and global descriptors using the same algorithm and data structure, and has shown to be highly effective and interpretable in predicting particle rearrangements and classifying global phases. Based on this framework, we define a non-parametric metric, the Separation Index (SI), which not only outperforms traditional bond-orientational order parameters in phase classification tasks but also establishes a connection between particle environments and the global phase structure. Our methods provide an effective framework for understanding and analyzing the properties of disordered materials, with broad potential applications in materials science and even wider studies of complex systems.
- [574] arXiv:2411.14412 (replaced) [pdf, html, other]
-
Title: Adversarial Poisoning Attack on Quantum Machine Learning ModelsSubjects: Quantum Physics (quant-ph); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
With the growing interest in Quantum Machine Learning (QML) and the increasing availability of quantum computers through cloud providers, addressing the potential security risks associated with QML has become an urgent priority. One key concern in the QML domain is the threat of data poisoning attacks in the current quantum cloud setting. Adversarial access to training data could severely compromise the integrity and availability of QML models. Classical data poisoning techniques require significant knowledge and training to generate poisoned data, and lack noise resilience, making them ineffective for QML models in the Noisy Intermediate Scale Quantum (NISQ) era. In this work, we first propose a simple yet effective technique to measure intra-class encoder state similarity (ESS) by analyzing the outputs of encoding circuits. Leveraging this approach, we introduce a quantum indiscriminate data poisoning attack, QUID. Through extensive experiments conducted in both noiseless and noisy environments (e.g., IBM\_Brisbane's noise), across various architectures and datasets, QUID achieves up to $92\%$ accuracy degradation in model performance compared to baseline models and up to $75\%$ accuracy degradation compared to random label-flipping. We also tested QUID against state-of-the-art classical defenses, with accuracy degradation still exceeding $50\%$, demonstrating its effectiveness. This work represents the first attempt to reevaluate data poisoning attacks in the context of QML.