What generic regularities exist in trained vs. untrained networks?

What is the initialized distribution? Generally, the tails of the weight distribution get longer during training...
In[]:=
piecewise=Piecewise[{{0,#<-2},{1,#<-1},{0,#<0},{1,#<2}}]&;​​​​SeedRandom[23049829048];​​trainingx=RandomReal[{-3,3},5000];​​trainingy=piecewise/@trainingx;
In[]:=
(*===TrainingFunction===*)TrainPiecewise[net_,seed_Integer]:=Module[{res,checkpoints={},initialNet},initialNet=NetInitialize[net,RandomSeeding->seed];​​res=NetTrain[initialNet,trainingx->trainingy,All,TrainingProgressFunction->{(AppendTo[checkpoints,#Net]&),"Interval"->Quantity[10,"Rounds"]},MaxTrainingRounds->15000,BatchSize->5000,RandomSeeding->seed];​​{res,Prepend[checkpoints,initialNet]}]
In[]:=
numInstances=5;​​architecture={8,8,8};
In[]:=
RRnet[lws_List]:=NetChain[Join[Riffle[LinearLayer/@lws,Ramp],{Ramp,LinearLayer[]}],"Input"->"Real","Output"->"Real"]
In[]:=
allResults=Table[(Echo["Training instance ",i];​​TrainPiecewise[RRnet[architecture],123456*i]),{i,numInstances}];
»
1Training instance
»
2Training instance
»
3Training instance
»
4Training instance
»
5Training instance
In[]:=
allResults
Out[]=
$Aborted
In[]:=
allCheckpoints=allResults[[All,2]];​​​​FlattenWeights[net_]:=Join@@Table[With[{layer=NetExtract[net,i]},If[Head[layer]===LinearLayer,Join[Flatten[Normal[NetExtract[layer,"Weights"]]],Flatten[Normal[NetExtract[layer,"Biases"]]]],{}]],{i,Length[net]}]​​​​allWeights=Map[FlattenWeights,allCheckpoints,{2}];​​Print["Weight dimensions: ",Dimensions[allWeights]];​​​​allWeightsFlat=Flatten[allWeights,1];
Weight dimensions: {5,1501,169}
In[]:=
Histogram/@Take[allWeights[[1]],{1,-1,100}]
Out[]=

,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

In[]:=
ListDensityPlot[BinCounts[#,{-5,5,.25}]&/@allWeights[[1]]]
Out[]=
In[]:=
ListPlot3D[BinCounts[#,{-5,5,.25}]&/@allWeights[[1]]]
Out[]=
In[]:=
ListPlot3D[BinCounts[#,{-5,5,.25}]&/@allWeights[[1]],PlotRange->All,ScalingFunctions->{Identity,Identity,"Log"}]
Out[]=
In[]:=
ListLinePlot[Variance/@allWeights[[1]]]
Out[]=
200
400
600
800
1000
1200
1400
0.5
1.0
1.5
In[]:=
ListLinePlot[(Variance/@#)&/@allWeights]
Out[]=
200
400
600
800
1000
1200
1400
0.5
1.0
1.5
2.0
2.5
3.0
3.5

General claim: as the network trains, the distribution of weight values broadens

Microscopically, we could study correlations between weight values... (e.g. how correlated are the weights across the network; could be that correlations are exponentially damped.... or maybe not)
Can look at this as a function of network depth...

More ...

Investigations

Look at MLPs of different widths and depths
Look at larger single networks, vs. ensembles of networks
◼
  • Distribution of all weights as a function of training
  • ◼
  • Distribution of weights per layer as a function of training
  • We also want to look at individual weights vs. training rounds.....
    Should also make this plot normalizing by total norm of the weights....
    Take grid networks... Can one visually see training?
    ◼
  • How do these results change if one changes the target function?
  • Richard observes that adversarially changing a few weights won’t affect results, usually....

    Claim: as training progresses the rank of the weight matrices decreases (?)

    ​