WOLFRAM NOTEBOOK

In[]:=

CompoundExpression[

]

deploy

Thu 4 May 2023 01:40:19

Power-law decay of Velikanov, Berthier, Zeke Xie

Summary:
- Grosse observed

-1

decay

by looking at KFAC factors of Gauss-Newton matrix
- Zeke Xie observes

-1.1

decayofNTKeigenvalues.TheydiscardtopKeigenvalueswhereK=numberofclasses.

- Velikanov observed

-1.3

decay and obtains power law decay of loss by finding generating function of the loss and applying Tauberian theorem. (TODO, get explicit formula for their approach)
- Berthier gives simulation on Gaussian linear least squares with

-1.4

decay and obtains

-0.29

decay in loss* by taking dimensionality to infinity
*Technically Berthier gives decay in error given non-uniform starting distribution, but it’s equivalent to decay in loss with isotropic starting distribution

top level section: Diagonal + rank1

Grosse, “Which Algorithmic Choices Matter at Which Batch Sizes?” https://arxiv.org/abs/1907.04164
(verified on CIFAR spectrum experiments here)

Zeke Xie “On the Power-Law Hessian Spectrums in Deep Learning”
https://arxiv.org/abs/2201.13011

Velikanov “A view of mini-batch SGD via generating functions: conditions of convergence, phase transitions, benefit from negative momenta”
https://arxiv.org/abs/2206.11124

“Tight Nonparametric Convergence Rates for Stochastic Gradient Descent under the Noiseless Linear Model”
https://arxiv.org/abs/2006.08212

Other notebooks:
- https://www.wolframcloud.com/obj/yaroslavvb/nn-linear/berthier-formulas.nb

Section 3.3 of Berthier’s Tight Nonparametric Convergence Rates

https://arxiv.org/abs/2006.08212

In Gaussian setting, e(s) can be computed exactly using Eq 5 in Bordelon’s “structured features” paper as follows

Code

In[]:=

d=10000;(*dimensions*)numSteps=d;β=1.4;(*spectrumdecay*)δ=1.2;(*initialerrordecay*)h=Table[

-β

,{i,1,d}];(*Covariancespectrum*)c=Table[

-δ

,{i,1,d}];(*initialerrorvector*)c=c/Total[c];SF=StringForm;α=

Total[h]+2Max[h]

;(*stepsize,1/2ofmaximalcontractiveforerrorcovinSchatten-1norm*)(*diag+rank-1update.ComesfromEq.5inBordelon"structured features",specializedtoC=diag(c)*)step[c_]:=(

(1-αh)

)c+

hTotal[hc];cs=NestList[step,c,numSteps];observedVals=Total/@cs;stepVals=Table[s,{s,0,numSteps}];predictedVals=

-0.29

(#+1)

&/@stepVals;nudge=2.25;observed={stepVals,observedVals};predicted={stepVals,nudge*predictedVals};ListLogLogPlot[{observed,predicted},PlotRange->All,PlotLegends->{"observed",SF["``*

-0.29

",nudge]},AxesLabel->{"step","error"},PlotLabel->SF["Power-law decay with p=`` in d=``",β,d]]

Out[]=

	observed
	2.25* -0.29 s

What power-law constant to use?

- “ON THE POWER-LAW HESSIAN SPECTRUMS IN DEEP LEARNING” https://arxiv.org/pdf/2201.13011.pdf

(*MNIST*)β=1.939;p=

β-1

Out[]=

1.06496

In[]:=

(*CIFAR*)β=1.908;p=

β-1

Out[]=

1.10132

In[]:=

p=1.4d=20000;(*dimensions*)numSteps=d/2;β=p;(*spectrumdecay*)δ=0;(*initialerrordecay*)h=Table[

-β

,{i,1,d}];(*Covariancespectrum*)c=Table[

-δ

,{i,1,d}];(*initialerrorvector*)c=c/Total[c];SF=StringForm;α=

Total[h]+2Max[h]

;(*stepsize,1/2ofmaximalcontractiveforerrorcovinSchatten-1norm*)(*diag+rank-1update.ComesfromEq.5inBordelon"structured features",specializedtoC=diag(c)*)step[c_]:=(

(1-αh)

)c+

hTotal[hc];cs=NestList[step,c,numSteps];observedVals=h.#&/@cs;stepVals=Table[s,{s,0,numSteps}];(*predictedVals=

-0.29

(#+1)

&/@stepVals;*)predictedVals=0;observedData={stepVals,observedVals};ListLogLogPlot[{observedData},PlotRange->All,PlotLegends->{"observed",},AxesLabel->{"step","error"},PlotLabel->SF["Power-law decay with p=`` in d=``",p,d]]logStepVals=Rest[Log/@stepVals];logObservedVals=Rest[Log/@observedVals];data={logStepVals,logObservedVals};dataToFit=data[[Floor[Length[data]/2];;]];(*usesecondhalfforfitting*)lm=Normal@LinearModelFit[dataToFit,x,x];logPedictedPlot=Plot[lm,{x,Min[logStepVals],Max[logStepVals]},PlotLabel->"Predicted"];logObservedPlot=ListPlot[Log@Rest@observedData];(*Show[logPedictedPlot,logObservedPlot,PlotLabel->"logs"]*)linearFit=Exp[lm/.x->Log[s]];berthierZ=3103.527039817376`;berthierFit=

-0.29

/berthierZ;predictedPlot=Plot[{None,linearFit,berthierFit},{s,1,numSteps},ScalingFunctions->{"Log","Log"},PlotLegends->{"",SF["`` (empirical)",linearFit],SF["``(berthier)",berthierFit]}];observedPlot=ListPlot[Rest@observedData,ScalingFunctions->{"Log","Log"}];Show[predictedPlot,observedPlot,PlotLabel->"loss of Gaussian SGD with `` decay"]

What power law does deterministic GD follow?

Can get explicit formula by taking Laplace transform in terms of Exponential integrals

You are using a browser not supported by the Wolfram Cloud

Supported browsers include recent versions of Chrome, Edge, Firefox and Safari.

I understand and wish to continue anyway »