The Personal Data Protection Commission of Singapore has unveiled a comprehensive guide on synthetic data generation, aimed at enhancing data protection and AI development through innovative technologies.
Singapore’s PDPC Releases Guide on Synthetic Data Generation
On July 15, 2024, the Personal Data Protection Commission of Singapore (PDPC) unveiled its Proposed Guide on Synthetic Data Generation during the Personal Data Protection Week. The Guide is a significant component of the Privacy Enhancing Technology (PET) Sandbox aimed at aiding organisations in comprehending the methodologies and potential applications of Synthetic Data (SD), especially in the realm of artificial intelligence (AI). Minister for Digital Development and Information Josephine Teo underscored the importance of generating SD as an evolving PET that facilitates realistic AI model training without compromising sensitive information.
Understanding Synthetic Data and Its Benefits
Synthetic Data refers to data artificially generated through purpose-built mathematical models, including AI and machine learning (ML) algorithms, mimicking the characteristics and structure of the source data. SD offers several advantages:
-
Enhancing AI/ML Development: SD supports AI and ML growth by allowing model training without exposing actual personal data.
-
Addressing Data Challenges: It tackles dataset-related challenges in AI model training by augmenting and diversifying training datasets, ensuring sufficiency and reducing biases.
-
Facilitating Collaboration and Software Development: SD utilisation in data analytics and collaborative projects mitigates the risk of data breaches during development.
The Role of Privacy Enhancing Technologies (PETs)
PETs are defined as tools and techniques enabling the processing, analysis, and extraction of insights from data without revealing the underlying personal or commercially sensitive information. PETs fall into three main categories:
- Data Obfuscation
- Encrypted Data Processing
- Federated Analytics
SD generation is a form of data obfuscation and finds applications in privacy-preserving AI/ML, data sharing, and software testing.
Use Cases and Good Practices for Synthetic Data Generation
The Guide elaborates on various use cases for SD along with recommended best practices:
- Generating Training Data for AI/ML Models:
-
Good Practices: Implementing noise addition in appropriate scenarios and reducing the granularity of the SD points.
-
Data Analysis and Collaboration:
-
Good Practices: Removing outliers, pseudonymising, and minimising granular data during the preparation phase, adding noise before or after SD generation, and integrating technical, contractual, and governance measures to mitigate re-identification risks post-generation.
-
Software Testing:
- Good Practices: Generating SD that adheres to the semantics of source data rather than mere statistical properties.
Key Steps in Synthetic Data Generation
The Guide delineates a structured five-step approach to minimising re-identification risks associated with SD:
-
Know Your Data: Understand the purpose and use cases for SD, and the nature of the source data. Prioritise data protection over utility where necessary and establish objectives to balance risk thresholds and business requirements.
-
Prepare Your Data: Identify necessary data attributes and trends that need to be preserved. Apply data minimisation, pseudonymisation, and noise addition to relevant attributes to reduce re-identification risks.
-
Generate Synthetic Data: Choose an appropriate SD generation method based on use cases, data objectives, and types of data. Conduct checks on data integrity, fidelity, and utility post-generation.
-
Assess Re-identification Risks: Evaluate re-identification risks through an attack-based assessment to gauge if an adversary can re-identify individuals from the source data.
-
Manage Residual Risks: Identify and document residual risks, implementing appropriate mitigation controls encompassing technical, governance, and contractual measures.
Concluding Remarks
The Guide presents a comprehensive framework for leveraging Synthetic Data whilst balancing utility and protection risks, positioning it as an evolving document to stay relevant with advancing standards. Organisations are advised to remain vigilant and regularly update their practices in tandem with the latest recommendations to ensure robust data protection measures are in place.
The PDPC’s initiative in providing this Guide forms part of a broader effort to bolster data privacy, cybersecurity, and AI practices within Singapore, keeping organisations well-equipped to navigate the complexities of data protection.