1. Introduction
Diagnosis is the central part of medicine, which heavily relies on the individual practitioner assessment strategy. Re-cent studies suggest that misdiagnosis with potential mor-tality and morbidity is widespread for even the most com-mon health conditions [32], [49]. Hence, reducing the frequency of misdiagnosis is a crucial step towards improving healthcare. Medical image segmentation, which is a cen-tral part of diagnosis, plays a crucial role in clinical out-comes. Deep learning-based networks for segmentation are now getting traction for assisting in clinical settings, how-ever, most of the leading segmentation networks in the lit-erature are deterministic [17], [23], [34], [36], [41], [42], [44], meaning they predict a single segmentation mask for each input image. Unlike natural images, ground truths are not deterministic in medical images as different diagnosticians can have different opinions on the type and extent of an anomaly [1], [1]5,[37], [39]. Due to this, the diagnosis from medical images is quite challenging and often results in a low inter-rater agreement [22], [24], [56]. Depending on only pixel-wise probabilities and ignoring co-variances between the pixels might lead to misdiagnosis. In clinical practice, aggregating interpretations of multiple experts have shown to improve diagnosis and generate fewer false negatives [57].
A) deterministic networks produce a single output for an input image. B) c-vae-based methods encode prior information about the input image in a separate network and sample latent variables from there and inject it into the deterministic segmentation network to produce stochastic segmentation masks. C) in our method the diffusion model learns the latent structure of the seg-mentation as well as the ambiguity of the dataset by modeling the way input images are diffused through the latent space. Hence our method does not need an additional prior encoder to provide latent variables for multiple plausible annotations.