In[]:=
(*deployswithcanonicalname*)​​
CompoundExpression[
]
​​deploy
Sat 25 Nov 2023 13:10:21

2-NN<>Least Squares

High level experiments relating to least squares optimization/NN.
Companion to “2-NN<>Least Squares” notability​
​
Previous version: nn<>least-squares.nb
Blog-trajectories doc​
​
Rough (transient) prototyping on nn-least-squares-scratch3.nb
Dropbox version history: git0/nn-linear

Alpha Capacity, Effective Rank, Power-law decay

https://www.wolframcloud.com/obj/yaroslavvb/nn-linear/alpha-capacity-effective-rank.nb

Exp approximation of power

In[]:=
eval:=(​​d=100;​​X=RandomReal[{-1,1},{d,d}];​​H=X.X;​​H=H/Norm[H];​​A=-H;​​t=10;​​ii=IdentityMatrix[d];​​traceNormalize[vec_]:=vec/Total[vec];​​u=traceNormalize@RandomReal[{0,1},{d}];​​v=traceNormalize@RandomReal[{0,1},{d}];​​Clear[f];​​u=traceNormalize@ConstantArray[1.,{d}];​​v=u;​​​​f[mat_]:=u.mat.v;​​f@MatrixExp[tA]-f@MatrixPower[ii+A,t]​​);​​ListPlot@Table[eval,{100}]
Out[]=
In[]:=
With[{t=-3},LogPlot[{Exp[tx],(1+x)^t},{x,-1,1},PlotLegends->{"exp","power"}]]
Out[]=
exp
power
In[]:=
d=3;​​X=RandomReal[{-1,1},{d,d}];​​H=X.X;​​H=H/Norm[H];​​H=DiagonalMatrix[{1,1/2,1/3}];​​A=-H;​​t=10;​​ii=IdentityMatrix[d];​​LogPlot[{Tr@MatrixExp[tA],d+tTr[A],Exp[t*Max@Eigenvalues@A]},{t,0,d*d},PlotLegends->{"true","linear","exp"}]
Out[]=
true
linear
exp
In[]:=
With[{x=.3},Plot[{Exp[-tx],(1-x)^t},{t,-1,20},PlotLegends->{"
-xt
e
","(1-x
t
)\),
"}]]
Out[]=
-xt
e
(1-x
t
)\),
In[]:=
Clear[n];​​GeneratingFunction[
n
a
,n,x]
Out[]=
1
1-ax

Incomplete Gamma result

Loss of power-law decay in the continuous/continuous limit
https://mathematica.stackexchange.com/questions/283911/asymptotic-expressions-for-incomplete-gamma

Exp polynomial Factoring


Differential equation approach


Continuous GD with p-decay


Continuous GD + SGD


Kolmogorov Concentration

​​d=1000;​​p=1.;​​h$=Table[
-p
i
,{i,1,d}];​​updateDet[e_]:=
2
(1.-h$)
*e;​​genTraj[e0_]:=NestList[updateDet,e0,10000];​​B=10;​​randomTrajectories=genTraj/@RandomVariate[NormalDistribution[],{B,d}];​​norm2[point_]:=point.point;​​plot1=ListLinePlot[Map[norm2,randomTrajectories,{2}],ScalingFunctions->"Log",PlotStyle->Directive[Gray,Opacity[.2]]];​​e0=N[
d
]~Join~ConstantArray[0,d-1];​​nonRandom=genTraj/@{RotateLeft[e0],RotateRight[e0]};​​plot2=ListLinePlot[Map[norm2,nonRandom,{2}],ScalingFunctions->"Log"];​​Show[plot1,plot2]
Free probability
Effective rank:
- Nov 24: ask stats.SE “Frobenius norm of a product of Gaussian matrices” post
​
Shared notebooks:
- purity-of-matrix-products.nb (Mathoverflow post, my own answer)
- Frobenius norm of product of random matrices
- sent to Thomas thomas-convergence-of-rank.nb​
- forum-burda-evals.nb (checking for commutativity)
- forum-product-of-matrices.nb​
​
- Formulas-contents/”dot product formulas” notability, “NN<>Least Squares:Frobenius/Trace/Rank” notability

- formula-check.nb​
- scicomp (Effective algorithm): https://scicomp.stackexchange.com/questions/43039/estimating-the-sum-of-4th-powers-of-singular-values/43044#43044
​
​
- fast effective rank https://www.wolframcloud.com/obj/yaroslavvb/nn-linear/forum-sum-of-singular-vals.nb
​
​
3.59-3.60 of https://arxiv.org/pdf/1510.06128.pdf
Also https://iopscience.iop.org/article/10.1088/1742-6596/473/1/012002/pdf
​
forum-burda-evals.nb
​
​
Construct MeijerG:
​
1/n

Moments of Fuchs-Catalan density

Gaussian matrix moments

Solve using S-transform

​
Now for squared Wishart

Simplified solving using S-transform

https://mathematica.stackexchange.com/questions/287953/obtaining-moments-from-the-inverse-of-the-moment-generating-function

Convergence of effective rank

​
​
​

Isometry/Ahle

TLDR; information is kept on isometries, but isometry + small noise will lose it

TODO: learning rates complete picture

For a given problem, we have following variables
​
Step size: 1/Tr, optimal
Batch size: b=1, b=inf
Norm for alpha=1: l1, l2, linf
Norm for alpha=opt: l1, l2, linf
​
Growth at C=I: l1, l2, linf
Growth at C=stationary: l1, l2, linf (should be same as norm for l2)
​
Refactor these by using common methods. Get graphs for Gaussian with infinite size, Gaussian with d=1000, MNIST, MNIST whitened
​

Questions

Is stationary distribution/stationary more representative for GD than for SGD?
Compare behavior at steps=d for common distributions