Beyond Linear Probes: Dynamic Safety
Monitoring for Language Models

James Oldfield¹ Philip Torr² Ioannis Patras¹ Adel Bibi² Fazl Barez^2,3,4
¹Queen Mary University of London ²University of Oxford ³WhiteBox ⁴Martian QMUL

Preprint 2025

Paper Code

We propose Truncated Polynomial Classifiers (TPCs): a generalization of linear probes for dynamic activation monitoring of LLMs for safety. TPCs can be evaluated term-by-term, with higher orders providing stronger guardrails when needed.

TPCs have two modes of evaluation:

Safety dial 📈

TPCs scale with inference-time compute--a single polynomial can be evaluated with an increasing number of its higher-order terms to meet different safety budgets.

Adaptive defense 🛡

TPCs can alternatively be evaluated as a cascade. Higher-order terms are computed only for ambiguous inputs, reducing average monitoring costs.

Key results

On Gemma-3-27b-it, we find TPCs evaluated at a fixed-order bring up to 10% improvement in accuracy over linear probes (for classifying particular categories of harmful prompts), and up to 6% over MLP baselines.
Input-adaptive, cascaded evaluation of TPCs yields performance on par with the full polynomial--yet requiring only slightly more net parameters/compute than the linear probe.

Additional results

See the full paper for experiments with TPCs' built-in feature attribution, ablations, and many more results!

BibTeX

If you find our work useful, please consider citing:


    @misc{oldfield2025tpc,
      title={Beyond Linear Probes: Dynamic Safety Monitoring for Language Models},
      author={James Oldfield and Philip Torr and Ioannis Patras and Adel Bibi and Fazl Barez},
      year={2025},
      eprint={2509.26238},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
    }

Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

Preprint 2025

TPCs have two modes of evaluation:

Safety dial 📈

Adaptive defense 🛡

Key results

Additional results

BibTeX

Related work

Detecting High-Stakes Interactions with Activation Probes

McKenzie et al., 2025, (NeurIPS)

Cost-Effective Constitutional Classifiers via Representation Re-use

Cunningham et al., 2025

Beyond Linear Probes: Dynamic Safety
Monitoring for Language Models