1. Introduction
Recent RAW burst super-resolution pipelines have significantly improved the quality of modern smartphones photos [33], [8]. State-of-the-art algorithms use specialized deep learning models that learn to merge the burst frames into a single high-resolution image [3], [19], [18], [23], [24]. Training them requires paired datasets, in which each noisy burst is matched to a clean reference. Most approaches synthesize realistic bursts from the reference using carefully tuned degradation models [2], [3], [19], [24]. But because of low-level mismatches between the real and synthetically generated bursts (e.g., noise distribution, blur kernels, camera trajectories, scene motions, etc), models trained synthetically often do not generalize well to real-world inputs (Figure 1). To avoid this, other works collect weakly-paired datasets in which the reference is a high-resolution image of the same scene captured using a DSLR and a zoom lens on tripod [2], [39]. However, this capture process is tedious and time-consuming, and the resulting image pairs are often misaligned, exhibit color and detail mismatches because of the different sensors, and permit limited scene motion.