1. Introduction
With the proliferation of generative modeling techniques, such as Generative Adversarial Networks (GANs) [24], accurately discerning which methods are performing better has become a critical aspect of the field. For visual data, metrics such as Inception Score (IS) [59], Kernel Inception Distance (KID) [4], and the ubiquitously-used Frechet Inception Distance (FID) [26] have become standard practice for developing and adopting models. Under the hood, these methods evaluate the discrepancy between generated and natural images, in a deep feature space, to capture relevant features of the two distributions. After all, at its core, gener-ative modeling involves learning and mimicking high-order, complex statistics of visual data.
Downsampling a circle. We resize an input image (left) by a factor of 8, using different image processing libraries. The lanczos, bicubic, and bilinear implementations by pil (top row) adjust the antialiasing filter width by the downsampling factor (marked as other implementations (including those used for pytorch-fid and tensorflow-fid) use fixed filter widths, introducing aliasing artifacts (marked as and resemble naive nearest subsampling. Aliasing artifacts induce inconsistencies in the calculation of downstream metrics such as frechet inception distance [26], kid [4], is [59], and ppl [33]. Note that antialias flag is available in tensorflow 2, but is set to false (default value) for the fid calculation.