Controlling Language and Diffusion Models by Transporting Activations

Rodriguez, Pau; Blaas, Arno; Klein, Michal; Zappella, Luca; Apostoloff, Nicholas; Cuturi, Marco; Suau, Xavier

Computer Science > Machine Learning

arXiv:2410.23054 (cs)

[Submitted on 30 Oct 2024 (v1), last revised 22 Nov 2024 (this version, v2)]

Title:Controlling Language and Diffusion Models by Transporting Activations

Authors:Pau Rodriguez, Arno Blaas, Michal Klein, Luca Zappella, Nicholas Apostoloff, Marco Cuturi, Xavier Suau

View PDF

Abstract:The increasing capabilities of large generative models and their ever more widespread deployment have raised concerns about their reliability, safety, and potential misuse. To address these issues, recent works have proposed to control model generation by steering model activations in order to effectively induce or prevent the emergence of concepts or behaviors in the generated output. In this paper we introduce Activation Transport (AcT), a general framework to steer activations guided by optimal transport theory that generalizes many previous activation-steering works. AcT is modality-agnostic and provides fine-grained control over the model behavior with negligible computational overhead, while minimally impacting model abilities. We experimentally show the effectiveness and versatility of our approach by addressing key challenges in large language models (LLMs) and text-to-image diffusion models (T2Is). For LLMs, we show that AcT can effectively mitigate toxicity, induce arbitrary concepts, and increase their truthfulness. In T2Is, we show how AcT enables fine-grained style control and concept negation.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
MSC classes:	68T07, 49Q22
ACM classes:	I.2.6; I.2.7; I.4.8
Cite as:	arXiv:2410.23054 [cs.LG]
	(or arXiv:2410.23054v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2410.23054

Submission history

From: Pau Rodríguez López [view email]
[v1] Wed, 30 Oct 2024 14:21:33 UTC (23,672 KB)
[v2] Fri, 22 Nov 2024 16:04:44 UTC (24,947 KB)

Computer Science > Machine Learning

Title:Controlling Language and Diffusion Models by Transporting Activations

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Controlling Language and Diffusion Models by Transporting Activations

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators