Approximation Theory
Given a family of curves G, and a family of neurons F. For any curve , does there exist a neural net such that .
- Goal is to think about how deep and how wide the network should be? Why should we care about the architecture of a NN?
- One nice family: Lipschitz family of functions.
Prove a special version of universal approximation theorem for lipschitz functions and 3-layer relu networks. More details of the proof can be found here.
- To understand this, taking a special case of univariate case, and approximating the function using piecewise constant function or rectangles, and then finding the error such that can be an easier first introduction.
- Let be any L-lipschitz function. Then for any error , there exists a 3-layer relu network with total number of neurons such that .
- This can be generalised to any continuous function, i.e. when g is any continuous function. There exists a 3-layer ReLU network f with ReLU with .
- We aim to approximate the function using a partition as rectangles such that each side . Each is of the form .
- Let be the piecewise constant function. This means the function will be approximated at each partition using h when , where each .
- We can represent rectangles using a linear combination of ReLU , same for other dimensions as well. when and when , and 0 otherwise.
- This can be combined to form a hyperrectangle , (d-1) to cut the hyperrectangle at other positions. So, this is equal to 1 at only the partition . Thus, .
- and then linearly combined using a ReLU layer .
- So, we’re in a 3-way approximation g → h → f, and that’s why ,
- Note the curse of dimensionality in above proof (exponential dependence on ), and above proof fails to generalise because we’re using rectangles to approximate a function.
Universal Approximation class A class of functions is universal approximator over a compact set S if for every continuous function g and , there exists such that .
- Above proof can be succinctly done in 2 layers.
- This is generally proven using Stone-Weierstrass Theorem.
- TODO: read this.
Barron’s Theorem
Fourier representation is also a universal approximator - TODO: read this.
Depth Separation
Goal is to prove that to approximate a function that can be approximated with constant width deep networks, constant deep shallow network, require exponential many neurons. - Proven by taking a function . This function has the property that the number of kinks scale exponentially with . - In a L layer ReLU network or widths , number of affine pieces (piecewise linear regions) for layer i ⇐ . Thus, . - And total number of affine pieces = . - Any univariate function f with N piecewise linear regions or affine pieces (or kinks) can be composed together. - , . - Above result is used to prove that depth increases number of affine pieces multiplicatively, while width only scales it additively. - TODO: redo the proof properly.