What is synthetic data?

A working definition from MIT Sloan

synthetic data (noun)

Information created by an algorithm that can be used as a stand-in for real data.

Companies committed to data-driven decisions share common concerns about privacy, data integrity, and a lack of sufficient data.

Synthetic data is one promising solution. A synthetic data set has the same mathematical properties as the real-world data it’s standing in for, but doesn’t contain any of the same information.

Synthetic data is generated by taking a relational database, creating a machine learning model for it, and generating a second set of data. It can be used to test machine learning models or build and test software applications without compromising real, personal data.

Besides protecting privacy, synthetic data can remove speed bumps and bottlenecks that slow down data work, according to Kalyan Veeramachaneni, a principal research scientist with MIT’s Schwarzman College of Computing. He and his research team developed the Synthetic Data Vault, an open-source software tool for creating and using synthetic data sets. The researchers found “no significant difference” between predictive models generated on synthetic data and the real thing.

What is synthetic data — and how can it help you competitively?

Working Definitions: Data

MIT Sloan's Working Definitions explore the words and phrases behind emerging management ideas.

robot hands touch colorful puzzle pieces

Strategy, Survival, and Success in the Age of Industrial AI

In person at MIT Sloan

Load More