Beyond Linear Probes: Dynamic Safety Monitoring for Language Models
James Oldfield1 Philip Torr2 Ioannis Patras1 Adel Bibi2 Fazl Barez2,3,41Queen Mary University of London 2University of Oxford 3WhiteBox 4Martian
Preprint 2025
We propose Truncated Polynomial Classifiers (TPCs): a generalization of linear probes for dynamic activation monitoring of LLMs for safety. TPCs can be evaluated term-by-term, with higher orders providing stronger guardrails when needed.

TPCs have two modes of evaluation:
Safety dial 📈
TPCs scale with inference-time compute--a single polynomial can be evaluated with an increasing number of its higher-order terms to meet different safety budgets.

Adaptive defense 🛡
TPCs can alternatively be evaluated as a cascade. Higher-order terms are computed only for ambiguous inputs, reducing average monitoring costs.

Key results
-
On
Gemma-3-27b-it
, we find TPCs evaluated at a fixed-order bring up to 10% improvement in accuracy over linear probes (for classifying particular categories of harmful prompts), and up to 6% over MLP baselines. - Input-adaptive, cascaded evaluation of TPCs yields performance on par with the full polynomial--yet requiring only slightly more net parameters/compute than the linear probe.
Additional results
See the full paper for experiments with TPCs' built-in feature attribution, ablations, and many more results!
BibTeX
If you find our work useful, please consider citing:
@misc{oldfield2025tpc,
title={Beyond Linear Probes: Dynamic Safety Monitoring for Language Models},
author={James Oldfield and Philip Torr and Ioannis Patras and Adel Bibi and Fazl Barez},
year={2025},
eprint={2509.26238},
archivePrefix={arXiv},
primaryClass={cs.LG}
}