Session: Rising Stars of Mechanical Engineering Celebration & Showcase
Paper Number: 148628
148628 - Unraveling a Central Mystery of Artificial Intelligence: Heavy-Tail Perspective
A strange and beautiful mathematical structure called “heavy tail” underlies seemingly disparate rare events such as the recent global pandemic, the 2012 blackout in India, and the 2007 financial crisis. In fact, the list of examples goes on far beyond literal catastrophes, and heavy tails are pervasive in large-scale complex systems and modern algorithms. Heavy tails provide mathematical models for extreme variability, and under the presence of heavy tails, high-impact rare events are guaranteed to happen eventually. Understanding how they will happen allows us to design resilient systems and control (or even utilize) the impact they inflict. A particularly well-known and simple manifestation of heavy tails is the “80-20 rule”—e.g., the richest 20% of the population control 80% of the wealth—whose variations are repeatedly discovered in a wide variety of application areas.
One of the most recent and surprising discovery of heavy tails emerged in deep neural networks. The unprecedented empirical success of deep neural networks in modern AI tasks is often attributed to the stochastic gradient descent (SGD) algorithm’s mysterious ability to avoid sharp local minima in the loss landscape. Recently, heavy-tailed SGD has attracted significant attention for its ability to escape sharp local minima through a single big jump, and hence, within a realistic training horizon. In practice, however, when SGD exhibits such behaviors, practitioners adopt truncated variations of SGD to temper such movements. At first glance, this truncation scheme—known as gradient clipping—appears to effectively eliminate heavy tails from the SGDs’ dynamics, obliterating the aforementioned effects. Curiously, however, such modifications lead to the opposite of what a naive intuition predicts: heavy-tailed SGDs with gradient clipping almost completely avoid sharp local minima.
To unravel this mystery and potentially further enhance SGD's ability to find flat minima, it is imperative to go beyond the traditional local convergence analysis and acquire a comprehensive understanding of SGDs’ global dynamics within complex non-convex loss landscapes. My research provides systematic tools for characterizing the global dynamics of such variations of SGDs through the lens of heavy-tailed large deviations and metastability analysis.
To be specific, we characterize the global dynamics of SGDs building on the heavy-tailed large deviations and local stability framework developed in the first part. This leads to the heavy-tailed counterparts of the classical Freidlin-Wentzell and Eyring-Kramers theories. Moreover, we reveal a fascinating phenomenon in deep learning: by injecting and then truncating heavy-tailed noises during the training phase, SGD can almost completely avoid sharp minima and hence achieve better generalization performance for the test data.
Presenting Author: Chang-Han Rhee Northwestern University
Presenting Author Biography: Chang-Han Rhee is an Assistant Professor in Industrial Engineering and Management Sciences at Northwestern University. Before joining Northwestern University, he was a postdoctoral researcher at Centrum Wiskunde & Informatica and Georgia Tech. He received his Ph.D. from Stanford University. His research interests include applied probability, stochastic simulation, experimental design, and the theoretical foundation of machine learning. His research has been recognized with the 2016 INFORMS Simulation Society Outstanding Publication Award, the 2012 Winter Simulation Conference Best Student Paper Award, the 2023 INFORMS George Nicholson Student Paper Competition (2nd place), and the 2013 INFORMS George Nicholson Student Paper Competition (finalist). Since 2022, his research has been supported by the NSF CAREER Award.
Authors:
Chang-Han Rhee Northwestern UniversityUnraveling a Central Mystery of Artificial Intelligence: Heavy-Tail Perspective
Paper Type
Poster Presentation