Beyond Real Data: Synthetic Data through the Lens of Regularization
Summary
An Apple Machine Learning Research paper proposing a learning theory framework to quantify the optimal ratio of synthetic data to real data when real data is scarce.
Key Points
- Synthetic data can improve generalization performance when real data is sparse, but over-reliance can lead to performance degradation due to distributional mismatch.
- Derives generalization error bounds using algorithmic stability and suggests an optimal synthetic-to-real data ratio based on Wasserstein distance.
- Test error shows a U-shaped curve depending on the synthetic data ratio — a specific ratio is optimal.
- Verified theoretical predictions empirically on CIFAR-10 and clinical brain MRI datasets.
- Extendable to domain adaptation scenarios, where mixing synthetic target data with limited source data helps mitigate domain shift.
Notable Quotes & Details
Notable Data / Quotes
- Verification datasets: CIFAR-10, clinical brain MRI dataset
- Key metric: Wasserstein distance (distance between actual and synthetic distributions)
- Authors: Amitis Shidani†, Tyler Farghly†, Yang Sun‡, Habib Ganjgahi†‡, George Deligiannidis†
Intended Audience
AI/ML researchers and engineers studying data augmentation and synthetic data utilization