Beyond Linear Probes: Dynamic Safety
Monitoring for Language Models

James Oldfield1 Philip Torr2 Ioannis Patras1 Adel Bibi2 Fazl Barez2,3,4
1Queen Mary University of London 2University of Oxford 3WhiteBox 4Martian QMUL
Preprint 2025

Paper Code

We propose Truncated Polynomial Classifiers (TPCs): a generalization of linear probes for dynamic activation monitoring of LLMs for safety. TPCs can be evaluated term-by-term, with higher orders providing stronger guardrails when needed.

Teaser figure

TPCs have two modes of evaluation:

Safety dial 📈

TPCs scale with inference-time compute--a single polynomial can be evaluated with an increasing number of its higher-order terms to meet different safety budgets.

Results on WildGuardMix

Adaptive defense 🛡

TPCs can alternatively be evaluated as a cascade. Higher-order terms are computed only for ambiguous inputs, reducing average monitoring costs.

Adaptive cascade

Key results


  • On Gemma-3-27b-it, we find TPCs evaluated at a fixed-order bring up to 10% improvement in accuracy over linear probes (for classifying particular categories of harmful prompts), and up to 6% over MLP baselines.
  • Input-adaptive, cascaded evaluation of TPCs yields performance on par with the full polynomial--yet requiring only slightly more net parameters/compute than the linear probe.

Additional results

See the full paper for experiments with TPCs' built-in feature attribution, ablations, and many more results!

BibTeX

If you find our work useful, please consider citing:


    @misc{oldfield2025tpc,
      title={Beyond Linear Probes: Dynamic Safety Monitoring for Language Models},
      author={James Oldfield and Philip Torr and Ioannis Patras and Adel Bibi and Fazl Barez},
      year={2025},
      eprint={2509.26238},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
    }