Generative AI has garnered significant attention for its prowess in generating text and images. However, these media represent only a fraction of the vast data landscape prevalent in our society today. Every interaction within systems, be it medical processes, flight disruptions due to storms, or user engagement with software applications, generates data.
Utilizing generative AI to produce realistic synthetic data in these contexts can offer organizations valuable insights for more effective patient treatment, flight rerouting, or software platform enhancement, particularly in scenarios where access to real-world data is limited or sensitive.
Over the past three years, DataCebo, an MIT spinout, has been pioneering the development of a generative software system called the Synthetic Data Vault (SDV). SDV aids organizations in generating synthetic data to facilitate tasks such as software application testing and machine learning model training.
Initially unveiled by Principal Research Scientist Kalyan Veeramachaneni’s group at the Data to AI Lab in 2016, SDV emerged as a suite of open-source generative AI tools aimed at creating synthetic data that mirrors the statistical properties of real-world data. This approach enables companies to utilize synthetic data in lieu of sensitive information while maintaining the statistical relationships between data points. Additionally, synthetic data allows for software testing through simulations, providing insights into performance before public release.
DataCebo’s journey began in 2020, aiming to expand SDV’s capabilities for larger organizations. Since then, SDV has found diverse applications, from aiding airlines in planning for rare weather events with a new flight simulator to predicting health outcomes for patients with cystic fibrosis through synthesized medical records.
Furthermore, SDV has been utilized in various competitions and research endeavors, such as Kaggle’s data science competition, where data scientists leveraged SDV to create synthetic data sets, thus avoiding the use of proprietary data.
Despite its versatility, DataCebo maintains a strong focus on enhancing software testing capabilities. By employing generative models created with SDV, organizations can efficiently generate synthetic data for testing diverse scenarios and edge cases, thereby accelerating the testing process.
Moreover, synthetic data offers inherent privacy advantages, especially in domains dealing with sensitive data subject to regulatory constraints. DataCebo’s efforts are geared towards advancing the field of synthetic enterprise data, particularly data generated from user behavior within large companies’ software applications.
To enhance SDV’s utility and credibility, DataCebo has introduced features such as the SDMetrics library for assessing the realism of generated data and SDGym for comparing model performances. These tools aim to instill trust in synthetic data and promote transparency in model development and testing.
As the adoption of AI and data science tools proliferates across industries, DataCebo’s innovative approach to synthetic data generation stands poised to transform enterprise operations. Veeramachaneni envisions a future where synthetic data generated from generative models becomes integral to 90% of enterprise operations, paving the way for transparent and responsible data practices across industries.