1

AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents

AI agents aim to solve complex tasks by combining text-based reasoning with external tool calls. Unfortunately, AI agents are …

Edoardo Debenedetti, Jie Zhang, Mislav Balunović, Luca Beurer-Kellner, Marc Fischer, Florian Tramèr

Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition

Large language model systems face important security risks from maliciously crafted messages that aim to overwrite the system’s …

Edoardo Debenedetti, Javier Rando, Daniel Paleka, Silaghi Fineas Florin, Dragos Albastroiu, Niv Cohen, Yuval Lemberg, Reshmi Ghosh, Rui Wen, Ahmed Salem, Giovanni Cherubin, Santiago Zanella-Beguelin, Robin Schmid, Victor Klemm, Takahiro Miki, Chenhao Li, Stefan Kraft, Mario Fritz, Florian Tramèr, Sahar Abdelnabi, Lea Schönherr

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content. Evaluating …

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramèr, Hamed Hassani, Eric Wong

Scaling Compute Is Not All You Need for Adversarial Robustness

The last six years have witnessed significant progress in adversarially robust deep learning. As evidenced by the CIFAR-10 dataset …

Edoardo Debenedetti, Zishen Wan, Maksym Andriushchenko, Vikash Sehwag, Kshitij Bhardwaj, Bhavya Kailkhura

Evading Black-box Classifiers Without Breaking Eggs

We propose a new real-world oriented metric for black-box decision-based attacks on security-critical systems

Edoardo Debenedetti, Nicholas Carlini, Florian Tramèr

Privacy Side Channels in Machine Learning Systems

Most current approaches for protecting privacy in machine learning (ML) assume that models exist in a vacuum. Yet, in reality, these …

Edoardo Debenedetti, Giorgio Severi, Nicholas Carlini, Christopher A. Choquette-Choo, Matthew Jagielski, Milad Nasr, Eric Wallace, Florian Tramèr

A Light Recipe to Train Robust Vision Transformers

This paper shows that ViTs are highly suitable for adversarial training to achieve competitive performance and recommends that the community should avoid translating the canonical training recipes in ViTs to robust training and rethink common training choices in the context of adversarial training.

Edoardo Debenedetti, Vikash Sehwag, Prateek Mittal