- Seminar Calendar
- Seminar Archive
- 2025-2026 Semester 2
- 2025-2026 Semester 1
- 2024-2025 Semester 2
- 2024-2025 Semester 1
- 2023-2024 Semester 2
- 2023-2024 Semester 1
- 2022-2023 Semester 2
- 2022-2023 Semester 1
- 2021-2022 Semester 2
- 2021-2022 Semester 1
- 2020-2021 Semester 2
- 2020-2021 Semester 1
- 2019-2020 Semester 2
- 2019-2020 Semester 1
- 2018-2019 Semester 2
- 2018-2019 Semester 1
- 2017-2018 Semester 2
- 2017-2018 Semester 1
- 2016-2017 Semester 2
- 2016-2017 Semester 1
- 2015-2016 Semester 1
- 2015-2016 Semester 2
- 2014-2015 Semester 2
- 2014-2015 Semester 1
- 2013-2014 Semester 2
- 2013-2014 Semester 1
- 2012-2013 Semester 2
- 2012-2013 Semester 1
- 2011-2012 Semester 2
- 2011-2012 Semester 1
- 2010-2011 Semester 2
- 2010-2011 Semester 1
- 2009-2010 Semester 2
- 2009-2010 Semester 1
- 2008-2009 Semester 2
- 2008-2009 Semester 1
- 2007-2008 Semester 2
- 2007-2008 Semester 1
- 2006-2007 Semester 2
- 2006-2007 Semester 1
- 2005-2006 Semester 2
- 2005-2006 Semester 1
- Contact
- Site Map
On the Width Scaling of Neural Optimizers: A Matrix Operator Norm Perspective
----------------------------------------------------------------------------------------------------
Department of Systems Engineering and Engineering Management
The Chinese University of Hong Kong
----------------------------------------------------------------------------------------------------
Date: Friday, March 6, 2026, 4:30pm to 5:30pm HKT
Venue: ERB 513, The Chinese University of Hong Kong
Title: On the Width Scaling of Neural Optimizers: A Matrix Operator Norm Perspective
Speaker: Prof. Jiajin Li, University of British Columbia
Abstract:
A central question in modern deep learning and language models is how to design optimizers whose performance scales favorably with network width. We address this question by viewing neural-network optimizers such as AdamW and Muon through a unified lens, as instances of steepest descent under matrix operator norms. Within this framework, we align the optimizer geometry with the Lipschitz structure of the network's forward map, impose a requirement of layerwise composability, and show that standard p→q operator-norm steepest-descent rules generally fail to compose across layers. To overcome this limitation, we introduce a family of mean normalized norm geometries (p, mean)→(q, mean) that admit closed-form, layerwise descent directions and yield practical optimizers such as a rescaled AdamW, row normalization, and column normalization. By construction, our rescaling recovers μP-style width scaling as a special case and provides predictable learning-rate transfer across widths for a broader class of optimizers. We further prove that the induced descent directions preserve standard convergence guarantees and achieve near width-insensitive smoothness for mappings (1, mean)→(q, mean) with q ≥ 2 and (p, mean)→(infty, mean), where smoothness is measured in the corresponding matrix-norm geometry. Finally, this optimizer achieves improved width scaling compared with Muon, and that Muon in turn outperforms AdamW, suggesting a principled and practical route for mitigating dimensional dependence in large-scale optimization.
Biography:
Jiajin Li is a tenure-track Assistant Professor in the Operations & Logistics Division at the Sauder School of Business, University of British Columbia. She is also an associated faculty member of the Institute of Applied Mathematics (IAM) and the Department of Computer Science at UBC. Prior to joining UBC, she spent three years as a postdoctoral researcher in the Department of Management Science and Engineering (MS&E) at Stanford University. She received her Ph.D. in Systems Engineering and Engineering Management from the Chinese University of Hong Kong (CUHK) in 2021.
Everyone is welcome to attend the talk!
SEEM-5202 Website: http://seminar.se.cuhk.edu.hk
Date:
Friday, March 6, 2026 - 16:30


