Why Scaling Works: Inductive Biases vs The Bitter Lesson | by Tarik Dzekman | Oct, 2024

0

[ad_1]

Over the past decade we’ve witnessed the power of scaling deep learning models. Larger models, trained on heaps of data, consistently outperform previous methods in language modelling, image generation, playing games, and even protein folding. To understand why scaling works, let’s look at a toy problem.

We start with a 1D manifold weaving its way through the 2D plane and forming a spiral:

A spiral path around the origin. It rotates around 4 times, turning sharply in the top right quadrant, and looping back to the origin. This produces a 1D manifold in the 2D plate.

Now we add a heatmap which represents the probability density of sampling a particular 2D point. Notably, this probability density is independent of the shape of the manifold:

A closed 1D manifold forming a spiral around the origin with a sharp tip in the top left quadrant. Around the manifold is a heatmap represented as a series of spots representing “Gaussian Mixture Density”. The spots are randomly placed and independent of the spiral.

Let’s assume that the data on either side of the manifold is always completely separable (i.e. there is no noise). Datapoints on the outside of the manifold are blue and those on the inside are orange. If we draw a sample of N=1000 points it may look like this:

A closed 1D manifold forming a spiral around the origin with a sharp tip in the top left quadrant. Random points are sampled according to a probability distribution. Points which end up inside the spiral are marked orange, points outside the spiral are marked blue.

Toy problem: How do we build a model which predicts the colour of a point based on its 2D coordinates?

In the real world we often can’t sample uniformly from all parts of the feature space. For example, in image classification it’s easy to find images of trees in general but less easy to find many examples of specific trees. As a result, it may be harder for a model to learn the difference between species there aren’t many examples of. Similarly, in our toy problem, different parts of the space will become difficult to predict simply because they are harder to sample.

First, we build a simple neural network with 3 layers, running for 1000 epochs. The neural network’s predictions are heavily influenced by the particulars of the sample. As a result, the trained model has difficulty inferring the shape of the manifold just because of sampling sparsity:

“Manifold shape inferred from samples — neural network”. It shows a sample of points which are inside vs outside a manifold and the prediction boundary of a neural network trained on classify those points as inside or outside. The prediction is of poor quality with many isolated patches and plenty of noise.

Even knowing that the points are completely separable, there are infinitely many ways to draw a boundary around the sampled points. Based on the sample data, why should any one boundary be considered superior to another?

With regularisation techniques we could encourage the model to produce a smoother boundary rather than curving tightly around predicted points. That helps to an extent but it won’t solve our problem in regions of sparsity.

Since we already know the manifold is a spiral, can we encourage the model to make spiral-like predictions?

We can add what’s called an “inductive prior”: something we put in the model architecture or the training process which contains information about the problem space. In this toy problem we can do some feature engineering and adjust the way we present inputs to the model. Instead of 2D (x, y) coordinates, we transform the input into polar coordinates (r, θ).

Now the neural network can make predictions based on the distance and angle from the origin. This biases the model towards producing decision boundaries which are more curved. Here is how the newly trained model predicts the decision boundary:

“Manifold shape inferred from samples — neural network (polar coordinates)”. It shows a sample of points which are inside vs outside a manifold and the prediction boundary of a neural network trained on classify those points as inside or outside. The neural network was trained with polar coordinates and the prediction closely follows the spiral with some noise in various places.

Notice how much better the model performs in parts of the input space where there are no samples. The feature of those missing points remain similar to features of observed points and so the model can predict an effective boundary without seeing additional data.

Obviously, inductive priors are useful.

Most architecture decisions will induce an inductive prior. Let’s try some enhancements and try to think about what kind of inductive prior they introduce:

  1. Focal Loss — increases the loss on data points the model finds hard to predict. This might improve accuracy at the cost of increasing the model complexity around those points (as we would expect from the bias-variance trade-off). To reduce the impact of increased variance we can add some regularisation.
  2. Weight Decay — L2 norm on the size of the weights prevents the model from learning features weighted too strongly to any one sample.
  3. Layer Norm — has a lot of subtle effects, one of which could be that the model focuses more on the relationships between points rather than their magnitude, which might help offset the increased variance from using Focal Loss.

After making all of these improvements, how much better does our predicted manifold look?

“Manifold shape inferred from samples — neural network (polar | FL | LN | WD)”. It shows a sample of points which are inside vs outside a manifold and the prediction boundary of a neural network trained on classify those points as inside or outside. The neural network was trained with various features and the prediction closely follows the spiral with some noise in various places. There is a space near the centre of the spiral which the network predicts as “inside” instead of outside.

Not much better at all. In fact, it’s introduced an artefact near the centre of the spiral. And it’s still failed to predict anything at the end of the spiral (in the upper-left quadrant) where there is no data. That said, it has managed to capture more of the curve near the origin which is a plus.

Now suppose that another research team has no idea that there’s a hard boundary in the shape of a single continuous spiral. For all they know there could be pockets inside pockets with fuzzy probabilistic boundaries.

However, this team is able to collect a sample of 10,000 instead of 1,000. For their model they just use a k-Nearest Neighbour (kNN) approach with k=5.

Side note: k=5 is a poor inductive prior here. For this problem k=1 is generally better. Challenge: can you figure out why? Add a comment to this article with your answer.

Now, kNN is not a particularly powerful algorithm compared to a neural network. However, even with a bad inductive prior here is how the kNN solution scales with 10x more data:

[ad_2]

Leave a Reply

Your email address will not be published. Required fields are marked *