A Patch is More than 16*16 Pixels | by Mengliu Zhao | Jun, 2024

0

[ad_1]

Pixel Transformer (PiT) — from an inductive bias perspective

Meta AI released The technical report last week on arXiv: “An image is worth more than 16*16 patches”. Instead of proposing a novel method, the technical report answered a long-lasting question: Does it make sense to use pixels instead of patches as input tokens? If so, why?

The paper took the perspective of the Inductive Bias of Locality. According to K. Murphy’s well-known machine learning book, inductive bias is the “assumptions about the nature of the data distribution.” In the early “non-deep learning” era, the inductive bias was more “feature-related,” coming from the manual feature engineered for specific tasks. This inductive bias was not a bad thing, especially for specific tasks in which very good prior knowledge from human experts is gained, making the engineered features very useful. However, from the generalization perspective, the engineered features are very hard to generalize to universal tasks, like general image classification and segmentation.

But beyond feature bias, the architecture itself contains inductive bias as well. The ViT is a great example showing less inductive bias than CNN models in terms of architecture hierarchy, propagation uniformness, representation scale, and attention locality. See my previous medium post for a detailed discussion. But still, ViT remains a special type of inductive bias — locality. When the ViT processes a sequence of patch tokens, the pixels within the same patch are naturally treated by the model differently than those from different patches. And that’s where the locality comes from.

So, is it possible to remove the inductive bias of locality further? The answer is yes. The PiT proposed using the “pixel set” as input with different position embedding (PE) strategies: sin-cos, learnt, and none. It showed superior performance over ViT on supervised, self-supervised, and generation tasks. The proposed pipeline is shown in the figure below.

Pixel Transformer pipeline. Image source: https://arxiv.org/abs/2406.09415

The idea seems simple and straightforward, and the authors claim they are “not introducing a new method” here. But still, the PiT shows great potential. On CIFAR-100 and ImageNet (reduced input size to 28*28) supervised classification tasks, the classification accuracy increased by more than 2% over ViT. See the table below.

Supervised learning classification. Image source: https://arxiv.org/pdf/2406.09415

Similar improvement was also observed in self-supervised learning tasks and image generation tasks. What’s more, the authors also showed the trend of a performance increase when reducing the patch size from 8*8 to 1*1 (single pixel) as below:

Performance increase from ViT (8*8 patch) to PiT (1*1 patch). Image from: https://arxiv.org/pdf/2406.09415

In terms of positional encoding.

As pointed out in this research paper, positional encoding is a prerequisite in transformer-based models for input token sequence ordering and improving accuracy. However, the PiT shows that even after dropping the PE, the model performance drops is minimal:

PiT performance using three different PE: 1. fixed sin-cos PE; 2. learnable PE; 3. no PE. Image source: https://arxiv.org/pdf/2406.09415

Why drop the positional encoding? It is not only because dropping the positional encoding means a good reduction of the locality bias. If we think of self-attention computation in a distributed manner, it will largely reduce the cross-device communication effort, which we’ll discuss in detail in the next section.

[ad_2]

Leave a Reply

Your email address will not be published. Required fields are marked *