Training tranformer models
Introduction
Training Transformer models is computationally expensive. This document explores an adaptive learning rate strategy that adjusts parameter updates dynamically based on the local curvature of the loss landscape, rate recommedation for the adjust parameter
OBJECTS
Tư C - Nui
H Н - / Нет n
N - Nu /
Mathematic-Mnal Formulation
Consider a Transformer model parameterized by ® and an input X. producing an output representation
- Oul
• Ind
- Pre
The training objective involves minimizing a loss function
See all
((
Where:
N is the number of training samples,
C is the number of classes,
Vic is the ground-truth label for class,
P10(X1 X1:
) is the predicted probability.
liC) • Logarithmic Integral • Number Theory
#
A • Logical And • Logic
land -
Adaptive Learning Rate Adjustment
We define a dynamic learning rate Ne based on the Hessian matrix H©):
10r・
nt =
Where Nt is the initial learning rate and A is a scaling factor. This formulation ensures that step sizes shrink in regions of high curvature, preventing instability. ' 𝒩