Multilinear Mixture of Experts:
Scalable Expert Specialization through Factorization

James Oldfield1, Markos Georgopoulos, Grigorios G. Chrysos2, Christos Tzelepis3, Yannis Panagakis4,5, Mihalis A Nicolaou6, Jiankang Deng7, Ioannis Patras1
1Queen Mary University of London 2University of Wisconsin-Madison 3City University of London
4National and Kapodistrian University of Athens 5Archimedes/Athena RC 6The Cyprus Institute 7Imperial College London
NeurIPS 2024

Paper Code

Abstract

The Mixture of Experts (MoE) paradigm provides a powerful way to decompose dense layers into smaller, modular computations often more amenable to human interpretation, debugging, and editability. However, a major challenge lies in the computational cost of scaling the number of experts high enough to achieve fine-grained specialization. In this paper, we propose the Multilinear Mixture of Experts (μMoE) layer to address this, focusing on vision models. μMoE layers enable scalable expert specialization by performing an implicit computation on prohibitively large weight tensors entirely in factorized form. Consequently, μMoEs (1) avoid the restrictively high inference-time costs of dense MoEs, yet (2) do not inherit the training issues of the popular sparse MoEs' discrete (non-differentiable) expert routing. We present both qualitative and quantitative evidence that scaling μMoE layers when fine-tuning foundation models for vision tasks leads to more specialized experts at the class-level, further enabling manual bias correction in CelebA attribute classification. Finally, we show qualitative results demonstrating the expert specialism achieved when pre-training large GPT2 and MLP-Mixer models with parameter-matched μMoE blocks at every layer, maintaining comparable accuracy. Our code is available at: https://github.com/james-oldfield/muMoE.

Method Overview

μMoE layers fuse large numbers of (potentially hierarchical) experts' operations on an input vector in an efficient manner. By design, μMoE layers scale gracefully to tens of thousands of experts by performing implicit computation in factorized form, avoiding the problematic, non-differentiable 'hard' expert selection.

method

The μMoE forward pass

The forward pass of an (unfactorized) μMoE layer as a series of two tensor contractions: the experts' weight matrices are matrix-multiplied with the input vector and summed (weighted by the expert coefficients).

Our key insight is that the dense μMoE forward pass over all \(N\) experts simultaneously can be computed entirely in factorized form, needing never materialize prohibitively large weight tensors at any point during either training or inference.

We achieve this by composing the higher-order weight tensors at each μMoE layer through a combination of smaller factors, which are much more computationally efficient to store and operate on.

Results

Expert specialization

When fine-tuning foundation models (such as CLIP) for vision tasks, we find that increasing the number of experts in μMoE layers leads to more specialized experts at the class-level. We quantify this by asking a counterfactual question about each expert in turn--intervening in the model's forward pass (setting each expert's weights to 0) and recording the counterfactual change to the test set's class predictions.

polysemantic

Using the pre- and post-intervention class accuracies, we compute a measure of mean "expert polysemanticity" (i.e. the extent to which an expert's computation is responsible for the accuracy for one class and nothing more) across all experts that have any non-zero effect on class predictions.

Increasing the total number of μMoE experts leads to individual experts increasingly responsible for a single subtask: classifying all inputs of just one class.

Qualitative results

256 vs 32 total experts for a CPμMoE model. The larger the total number of experts, the more the experts appear to specialize to particular visual themes. Shown in each cell below are random images from the training set of those with corresponding expert coefficient of at least 0.5, for the first few experts numerically:

expert slices

Large scale models

We train from scatch both MLP Mixers and GPT-2 models with μMoE blocks at every layer. Not only do we get comparable accuracy to the original models, but find (qualitatively) expert specialism emerging throughout.

In other words, each layer’s µMoE block performs computations with N experts but has the same parameter counts and FLOPs as a single, dense MLP block.

Top-activating patches (top rows) and their full images (second rows) for two experts at two TRµMoE-e64 layers in µMoE MLP-mixer models–µMoE blocks exhibit coarse-grained specialism (e.g. texture) earlier and more fine-grained specialism (e.g. objects) deeper in the network.

Top-activating generated tokens for two selected experts at layer 4 for NanoGPT with CPµMoE-e256 blocks (each surrounding token is highlighted by the coefficient of the expert in question), exhibiting specialization to compound adjectives (left) and equality operators (right) respectively.

BibTeX

If you find our work useful, please consider citing our paper:

      
    @misc{oldfield2024mumoe,
      title={Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization},
      author={James Oldfield and Markos Georgopoulos and Grigorios G. Chrysos and Christos Tzelepis and Yannis Panagakis and Mihalis A. Nicolaou and Jiankang Deng and Ioannis Patras},
      year={2024},
      eprint={2402.12550},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
    }