Wolfram Cloud Document

In[]:=

CompoundExpression[

]

deploy

Thu 6 Apr 2023 14:16:57

High level experiments relating to least squares optimization/NN’s summarized here.

More specific notebooks
gd-vs-sgd: shared code for running spectra analysis
linear-estimation-*: shared code for gradient descent/learning rate finders

nn<>least-squares-scratch: unorganized pre code

Utilities


Step length after normalization


Break - even for small batches


Get GS length dependence on batch size


Get residual dependence on batch size


Decay of GS contributions


Growth of singular values squared (linear)


Convergence of effective rank


Improvement from average Kaczmarz step


Ortho decay and effective rank?


Preconditioner improvements


Stable rank, effective dimension, required sketch size


Decay of eigenvalues vs Cholesky vs Pivoted


Does stable rank predict largest usable batch size?


What is the norm of typical batch at size=stable rank?


How angle and stable rank relate for X with 2 stacked rows


Normalization and efficiency for various distr?


Optimal vs average optimal step


Expected drop as a function of dimension


Expected drop as cross correlation


Harmonic mean of sensitive rank


Estimating Frobenius norm


Step sizes for power law decay


Use Gaussian identities to estimate step sizes


Effective ranks and norm growth


Harmonic interpolation of step sizes


Deterministic vs stochastic rates

Deterministic gradient learning rates goes up with dimension, stochastic learning rate goes down. Huge difference in learning rate in deterministic case, tiny difference for stochastic case.

In[]:=

maxBatch=50;batchSizes=First/@Partition[Range[maxBatch],Max[1,Floor[maxBatch/10]]];evals=N@Table[

-decay

,{i,1,d}];trSigma=Total[evals];trSigma2=Total[evals*evals];normSigma=Max[evals];estimateR1[X_]:=With{sigma=X.X},

Norm[X,"Frobenius"]

Norm[sigma]

;estimateR2[X_]:=With{sigma=X.X},

Norm[X,"Frobenius"]

Norm[sigma,"Frobenius"]

;getRates[evals_]:=trSigma=Total[evals];trSigma2=Total[evals*evals];normSigma=Max[evals];

2trSigma

2trSigma2+

trSigma

2normSigma+trSigma

trSigma

trSigma2

normSigma

trSigma2

normSigma

trSigma

trSigma2

;decay=1.1;vals=Transpose@Table[getRates[N@Table[

-decay

,{i,1,d}]],{d,1,1000}];ListLinePlot[vals[[;;4]],PlotLegends->{"stochastic first","stochastic anytime","determistic first","deterministic anytime"}]ListLinePlot[vals[[5;;]],PlotLegends->{"r","R"},PlotLabel->"ranks"]

Table

:Iterator {i,1,d} does not have appropriate bounds.

Table

:Iterator {i,1.,d} does not have appropriate bounds.

Out[]=

	stochastic first
	stochastic anytime
	determistic first
	deterministic anytime

Out[]=

	r
	R

Out[]=

	"stochastic first"
	"stochastic anytime"
	"determistic first"
	"deterministic anytime"

Utilities

Step length after normalization

Break - even for small batches

Get GS length dependence on batch size

Get residual dependence on batch size

Decay of GS contributions

Growth of singular values squared (linear)

Convergence of effective rank

Improvement from average Kaczmarz step

Ortho decay and effective rank?

Preconditioner improvements

Stable rank, effective dimension, required sketch size

Decay of eigenvalues vs Cholesky vs Pivoted

Does stable rank predict largest usable batch size?

What is the norm of typical batch at size=stable rank?

How angle and stable rank relate for X with 2 stacked rows

Normalization and efficiency for various distr?

Optimal vs average optimal step

Expected drop as a function of dimension

Expected drop as cross correlation

Harmonic mean of sensitive rank

Estimating Frobenius norm

Step sizes for power law decay

Use Gaussian identities to estimate step sizes

Effective ranks and norm growth

Harmonic interpolation of step sizes

Deterministic vs stochastic rates

Formula for loss decay on power law

Utilities


Step length after normalization


Break - even for small batches


Get GS length dependence on batch size


Get residual dependence on batch size


Decay of GS contributions


Growth of singular values squared (linear)


Convergence of effective rank


Improvement from average Kaczmarz step


Ortho decay and effective rank?


Preconditioner improvements


Stable rank, effective dimension, required sketch size


Decay of eigenvalues vs Cholesky vs Pivoted


Does stable rank predict largest usable batch size?


What is the norm of typical batch at size=stable rank?


How angle and stable rank relate for X with 2 stacked rows


Normalization and efficiency for various distr?


Optimal vs average optimal step


Expected drop as a function of dimension


Expected drop as cross correlation


Harmonic mean of sensitive rank


Estimating Frobenius norm


Step sizes for power law decay


Use Gaussian identities to estimate step sizes


Effective ranks and norm growth


Harmonic interpolation of step sizes
