1. INTRODUCTION
The goal of acoustic representation learning is to transform raw or surface features into the high-level feature which are more accessible to acoustic tasks [1]. It is critical to make acoustic representations more general and robust to improve the performance of acoustic tasks. However, the labeled data size of the specific acoustic task may be limited so that the learned representations can be less robust and the performance can be vulnerable to unseen data. On the other hand, there exists varieties of acoustic tasks which range from speaker verification, speech recognition to event and scene detection. For supervised learning, the learned representation useful for one task may be less suited for another task. It is worthwhile to explore how to utilize all kinds of datasets to learn a general and robust representation for all kinds of acoustic tasks.