1. INTRODUCTION
It exists a prominent corpus of research assuming that sizable amounts of annotated audio data are available for training end-to-end classifiers [1], [2], [3], [4]. These studies are mostly based on publicly-available datasets, where each class typically contains more than 100 audio examples [5], [6], [7], [8], [9]. Contrastingly, only few works study the problem of training neural audio classifiers with few audio examples (for instance, less than 10 per class) [10], [11], [12], [13]. In this work, we study how a number of neural network architectures perform in such situation. Two primary reasons motivate our work: (i) given that humans are able to learn novel concepts from few examples, we aim to quantify up to what extent such behavior is possible in current neural machine listening systems; and (ii) provided that data curation processes are tedious and expensive, it is unreasonable to assume that sizable amounts of annotated audio are always available for training neural network classifiers.