Gaussian Processes

MVN

is a distribution on a set of n random variables in n-dimension, where index i determines random variable, sets the position and scale of the distribution and sets the shape of the distribution on the basis of correlation between each pair of , where is ith random variable.

Joint distribution:

Conditional distribution:

  • Non-parametric method that allows us to make predictions on our data by incorporating prior knowledge. It defines a probability distribution over possible functions.
  • Most common example about usage of Gaussian process is to fit a function to a data (regression). But you might wonder, that the possible solution to this problem is infinitely large, or in other words the function space that fits to the problem is arbitrary.
  • Gaussian process assigns a probability to each of these functions and restricts the function space. Mean of this distribution gives us the most probable characterisation of the data.
  • defines distribution over functions , where is any domain with the assumption that set of values of function of any input set is jointly Gaussian with mean and covariance .
  • For a new input (test point) , we infer from knowledge of .
  • We assume that both the training set and test set is Gaussian, and due to the property of Gaussian distribution, the joint and conditional distribution is also Gaussian.
  • The joint distribution with dimension spans the space of all possible function values for the function that we are trying to predict.
  • We use Bayesian inference to model the prior when no training data is seen, and gradually build up the posterior as we see more training data.
  • To determine the conditional probability distribution, we need to build the prior with the parameters .
    • Mean is generally assumed to be 0 in the starting to ease the complexity, and can be added later as it just shifts and scales the distribution.
    • Covariance matrix is determined using covariance function or kernel of the Gaussian process. It takes as input the pairwise points and outputs the similarity between them.
  • Before diving into kernel methods, let’s move ahead and understand what happens after we have the parameters and thus, the prior.
    • Using the covariance function, we build the covariance matrix as a Gram matrix (positive definite matrix).
    • Once we have the parameters, we can just draw samples from the distribution and values of the function at test data is just the ith point of the vector.
  • Kernel: maps input to higher-dimensional space and measures the similarity between pair of points.
    • Stationary: invariant to translation like RBF, Periodic kernels. Covariance of two points is only dependent upon their relative position.
    • Non-stationary: covariance depends on the absolute position of the points. Linear kernel.
    • We can combine kernels together to build a better prior, and even estimate the kernel hyperparameters using gradient based methods.
  • Once we start observing training data, we model the joint distribution between the test and training to compute the corresponding covariance matrix.
  • Next, we condition the gaussian on test data and compute which gives us the derived parameters . Training points constrain the set of possible function in our function space to the functions that pass through the training data. Predictive uncertainty reduces significantly near the defined points and increases as we move further away.
  • We can also model the error in training points as gaussian and add to our training points , the joint distribution then gets modified slightly:

TODO:

  • GPs for classification
  • Estimating the kernel using gradient based methods

Further reading:

  • PML book 1, Section 17.2
  • PML book 2, Chapter 18

References: