The Generation of Synthetic Healthcare Data Using Deep Neural Networks
Abstract
High-quality tabular data is a crucial requirement for developing data-driven applications,especially healthcare-related ones, because most of the data nowadayscollected in this context is in tabular form. However, strict data protection laws introducedin Health Insurance Portability and Accountability (HIPAA) and GeneralData Protection Regulation (GDPR) present many obstacles to accessing and doingscientific research on healthcare datasets to protect patients’ privacy and confidentiality.Thus, synthetic data has become an ideal alternative for data scientists andhealthcare professionals to circumvent such hurdles. Although many healthcare dataproviders still use the classical de-identification and anonymization techniques forgenerating synthetic data, deep learning-based generative models such as GenerativeAdversarial Networks (GANs) have shown a remarkable performance in generatingtabular datasets with complex structures. Thus, this thesis examines the GANs’potential and applicability within the healthcare industry, which often faces seriouschallenges with insufficient training data and patient records sensitivity.
We investigate several state-of-the-art GAN-based models proposed for tabular syntheticdata generation. Precisely, we assess the performance of TGAN, CTGAN,CTABGAN and WGAN-GP models on healthcare datasets with different sizes,numbers of variables, column data types, feature distributions, and inter-variablecorrelations. Moreover, a comprehensive evaluation framework is defined to evaluatethe quality of the synthetic records and the viability of each model in preservingthe patients’ privacy. After training the selected models and generatingsynthetic datasets, we evaluate the strengths and weaknesses of each model basedon the statistical similarity metrics, machine learning-based evaluation scores, anddistance-based privacy metrics.
The results indicate that the proposed models can generate datasets that maintainthe statistical characteristics, model compatibility, and privacy of the original ones.Moreover, synthetic tabular healthcare datasets can be a viable option in manydata-driven applications. However, there is still room for further improvements indesigning a perfect architecture for generating synthetic tabular data.