On the Width Scaling of Neural Optimizers: A Matrix Operator Norm Perspective

published by semadmin on Tue, 02/24/2026 - 13:50

----------------------------------------------------------------------------------------------------

Department of Systems Engineering and Engineering Management

The Chinese University of Hong Kong

----------------------------------------------------------------------------------------------------

Date: Friday, March 6, 2026, 4:30pm to 5:30pm HKT

Venue: ERB 513, The Chinese University of Hong Kong

Title: On the Width Scaling of Neural Optimizers: A Matrix Operator Norm Perspective

Speaker: Prof. Jiajin Li, University of British Columbia

Abstract:

A central question in modern deep learning and language models is how to design optimizers whose performance scales favorably with network width. We address this question by viewing neural-network optimizers such as AdamW and Muon through a unified lens, as instances of steepest descent under matrix operator norms. Within this framework, we align the optimizer geometry with the Lipschitz structure of the network's forward map, impose a requirement of layerwise composability, and show that standard p→q operator-norm steepest-descent rules generally fail to compose across layers. To overcome this limitation, we introduce a family of mean normalized norm geometries (p, mean)→(q, mean) that admit closed-form, layerwise descent directions and yield practical optimizers such as a rescaled AdamW, row normalization, and column normalization. By construction, our rescaling recovers μP-style width scaling as a special case and provides predictable learning-rate transfer across widths for a broader class of optimizers. We further prove that the induced descent directions preserve standard convergence guarantees and achieve near width-insensitive smoothness for mappings (1, mean)→(q, mean) with q ≥ 2 and (p, mean)→(infty, mean), where smoothness is measured in the corresponding matrix-norm geometry. Finally, this optimizer achieves improved width scaling compared with Muon, and that Muon in turn outperforms AdamW, suggesting a principled and practical route for mitigating dimensional dependence in large-scale optimization.

Biography:

Jiajin Li is a tenure-track Assistant Professor in the Operations & Logistics Division at the Sauder School of Business, University of British Columbia. She is also an associated faculty member of the Institute of Applied Mathematics (IAM) and the Department of Computer Science at UBC. Prior to joining UBC, she spent three years as a postdoctoral researcher in the Department of Management Science and Engineering (MS&E) at Stanford University. She received her Ph.D. in Systems Engineering and Engineering Management from the Chinese University of Hong Kong (CUHK) in 2021.

Everyone is welcome to attend the talk!

SEEM-5202 Website: http://seminar.se.cuhk.edu.hk

Date:

Friday, March 6, 2026 - 16:30

Main menu

Seminar Calendar

Upcoming Seminar

Main menu