NeurIPS 2025

SAEMark: Steering Personalized Multilingual LLM Watermarks with Sparse Autoencoders

Peking University
* Equal contribution Corresponding author

Abstract

Watermarking LLM-generated text is critical for content attribution and misinformation prevention, yet existing methods compromise text quality and require white-box model access with logit manipulation or training, which exclude API-based models and multilingual scenarios.

We propose SAEMark, an inference-time framework for multi-bit watermarking that embeds personalized information through feature-based rejection sampling, fundamentally different from logit-based or rewriting-based approaches: we do not modify model outputs directly and require only black-box access, while naturally supporting multi-bit message embedding and generalizing across diverse languages and domains.

We instantiate the framework using Sparse Autoencoders as deterministic feature extractors and provide theoretical worst-case analysis relating watermark accuracy to computational budget. Experiments across 4 datasets demonstrate strong watermarking performance on English, Chinese, and code while preserving text quality.

SAEMark establishes a new paradigm for scalable, quality-preserving watermarks that work seamlessly with closed-source LLMs across languages and domains.

Key Insight

Selection, Not Modification

Different LLM generations exhibit distinct patterns in their semantic features. We leverage this through selection rather than modification—picking outputs that match our watermark target, leaving text untouched.

Sparse Autoencoders

SAEs decompose LLM activations into interpretable features. Two key properties: Deterministic—identical text yields identical features; Diverse—sampled responses exhibit distinct patterns.

Black-Box Compatible

No weight changes, no logit manipulation, no token edits—works with any LLM through API calls. Naturally supports multi-bit embedding and cross-lingual generalization.

How SAEMark Works

1

Feature Extraction

Use SAEs to extract deterministic, language-agnostic semantic features from text units.

2

Compute FCS

Calculate Feature Concentration Score (FCS) measuring semantic focus of each candidate.

3

Best-of-N Selection

Generate N candidates, select the one whose FCS best matches the key-derived target.

4

Detection

Segment text, compute FCS sequence, statistically test against candidate keys.

Theoretical Guarantee

Success probability ≥ 1 − (1−pmin)N, scaling exponentially with candidate count N.

Citation

If you find SAEMark useful in your research, please consider citing our paper:

@inproceedings{yu2025saemark,
  title={SAEMark: Steering Personalized Multilingual LLM Watermarks with Sparse Autoencoders},
  author={Yu, Zhuohao and Jiang, Xingru and Gu, Weizheng and Wang, Yidong and Wen, Qingsong and Zhang, Shikun and Ye, Wei},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2025}
}