We analyze over 1,100 deep neural networks—including 500 Mistral-7B LoRAs and 500 Vision Transformers. We provide the first large-scale empirical evidence that networks systematically converge to shared, low-dimensional spectral subspaces, regardless of initialization, task, or domain.
Models trained on disjoint data collapse into the same parametric subspace. This suggests architecture dictates geometry more than data does.
Storing only subspace coefficients enables massive compression (up to 100x). Train new tasks by optimizing lightweight coefficients instead of full weights.
Seamlessly merge models without data. Our method outperforms SOTA merging baselines (Task Arithmetic, TIES) by aligning spectral directions.
The plot illustrates the explained variance ratio of principal components across 500 Vision Transformers.
Despite random initializations and different datasets, the majority of variance is captured in the first few dimensions (the "Universal Subspace").
Drastically reduces carbon footprint for training large-scale neural models by reusing subspaces.
Explains why techniques like parameter-efficient fine-tuning succeed across architectures.
Offers new insights into the intrinsic organization of information within deep networks.
Allows under-resourced researchers to adapt SOTA models without massive compute clusters.
@misc{kaushik2025universalweightsubspacehypothesis,
title={The Universal Weight Subspace Hypothesis},
author={Prakhar Kaushik and Shravan Chaudhari and Ankit Vaidya and Rama Chellappa and Alan Yuille},
year={2025},
eprint={2512.05117},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2512.05117},
}