Approximation Theory

Given a family of curves G, and a family of neurons F. For any curve , does there exist a neural net such that .

  • Goal is to think about how deep and how wide the network should be? Why should we care about the architecture of a NN?
  • One nice family: Lipschitz family of functions.

Prove a special version of universal approximation theorem for lipschitz functions and 3-layer relu networks. More details of the proof can be found here.

  • To understand this, taking a special case of univariate case, and approximating the function using piecewise constant function or rectangles, and then finding the error such that can be an easier first introduction.
  • Let be any L-lipschitz function. Then for any error , there exists a 3-layer relu network with total number of neurons such that .
    • This can be generalised to any continuous function, i.e. when g is any continuous function. There exists a 3-layer ReLU network f with ReLU with .
  • We aim to approximate the function using a partition as rectangles such that each side . Each is of the form .
  • Let be the piecewise constant function. This means the function will be approximated at each partition using h when , where each .
  • We can represent rectangles using a linear combination of ReLU , same for other dimensions as well. when and when , and 0 otherwise.
  • This can be combined to form a hyperrectangle , (d-1) to cut the hyperrectangle at other positions. So, this is equal to 1 at only the partition . Thus, .
  • and then linearly combined using a ReLU layer .
  • So, we’re in a 3-way approximation g h f, and that’s why ,
  • Note the curse of dimensionality in above proof (exponential dependence on ), and above proof fails to generalise because we’re using rectangles to approximate a function.

Universal Approximation class A class of functions is universal approximator over a compact set S if for every continuous function g and , there exists such that .

  • Above proof can be succinctly done in 2 layers.
  • This is generally proven using Stone-Weierstrass Theorem.
  • TODO: read this.

Barron’s Theorem

Fourier representation is also a universal approximator - TODO: read this.

Depth Separation

Goal is to prove that to approximate a function that can be approximated with constant width deep networks, constant deep shallow network, require exponential many neurons. - Proven by taking a function . This function has the property that the number of kinks scale exponentially with . - In a L layer ReLU network or widths , number of affine pieces (piecewise linear regions) for layer i . Thus, . - And total number of affine pieces = . - Any univariate function f with N piecewise linear regions or affine pieces (or kinks) can be composed together. - , . - Above result is used to prove that depth increases number of affine pieces multiplicatively, while width only scales it additively. - TODO: redo the proof properly.