Enhancing Synthetic Data Generation for Class Imbalance and Privacy Preservation | IEEE Conference Publication | IEEE Xplore

Enhancing Synthetic Data Generation for Class Imbalance and Privacy Preservation


Abstract:

Synthetic data generation has emerged as a powerful solution to meet the demand for high-quality, diverse, and privacy-preserving data in many domains. Still, there is an...Show More

Abstract:

Synthetic data generation has emerged as a powerful solution to meet the demand for high-quality, diverse, and privacy-preserving data in many domains. Still, there is an open challenge when dealing with class imbalance and privacy preservation in synthetic tabular data generation. Thus, this study introduces two algorithms: balanced Tabular Generative Adversarial Network (b-TGAN) and balanced Tabular Principle Component Analysis (b-TPCA). While b-TGAN proactively tackles class imbalance by incorporating a re-balancing mechanism and leveraging an Autoencoder, b-TPCA offers a privacy-preserving solution by generating synthetic data using statistical properties. Through experiments on five datasets, this study demonstrates the effectiveness of b-TGAN in generating balanced data, particularly in improving the performance on minority classes. b-TPCA also shows promising results, achieving comparable ML utility to the baseline method while enhancing privacy preservation.
Date of Conference: 15-18 December 2024
Date Added to IEEE Xplore: 16 January 2025
ISBN Information:

ISSN Information:

Conference Location: Washington, DC, USA

Funding Agency:


I. Introduction

With the advances in artificial intelligence, the performance of machine learning (ML) models is often tied to the volume and quality of data they are trained on. However, obtaining a large amount of real-world data faces many problems. The collection process can be time-consuming and expensive. In addition, privacy concerns, particularly in sensitive domains like healthcare and finance, severely limit data accessibility. In such a context, some regulations such as the General Data Protection Regulation (GDPR) [1] restrict data acquisition and publication. However, the conflict between the necessity for data and the requirement to protect privacy has created a pressing need for innovative solutions. Besides, real-world datasets often face the problems of imbalance and data scarcity in under-representative classes.

Contact IEEE to Subscribe

References

References is not available for this document.