Why You Need Synthetic Data

We do Data Strategy Consulting.

We've created comprehensive, cost-effective data strategies for global brands and local non-profits. We can help you to leverage the latest tools and technologies to fuel growth.

Synthetic data can solve two major problems almost everyone has: data isn't specific enough to be actionable, and data isn't recent enough to be relevant.

The Twin Problems: Granularity and Recency

Everyone wants data to make decisions. But all too often, the data you can get isn't detailed enough to be useful. We call this problem 'granularity'; data doesn't have enough of it to provide unique insights that can guide decision making.

A common solution to the granularity problem is to collect more data. Get more customer detail, conduct surveys, add additional information to user sign-up forms, etc. However, this is typically pretty expensive. In our experience, the cost of data acquisition scales linearly with the amount of data you'd like to collect.

This means that if you need a lot of data, you'll have to pay a lot of money in order to collect it. Once you reach a certain point, the costs of additional data outweigh the benefit.

Another factor is how rapidly data needs to be updated. If you have a seasonal business that sells most product in the summer, your data needs in June are going to be very different than in January. This problem, called 'recency', is a compounding factor; the more updates you need to your data, the more expensive it will be.

The recency and granularity problems together tend to create a lot of frustration for organizations invested in data-driven decision making; they can see the theoretical value, but gaps make it difficult to take action.

What is Synthetic Data?

You've probably hear a lot about 'machine learning and 'predictive modeling' - these are fancy terms for the process of generating synthetic data. Synthetic data is any data that's generated from an algorithm, or model, to estimate the product of a given set of inputs.

Put another way, synthetic data is an educated guess. If we know that customers of a certain age buy a product, we can make an inference that any new customer who matches that age has a higher probability of making a purchase. That inference is synthetic data.

Before Synthetic Data
After Synthetic Data

Though we've always been able to make educated guesses, recent advances in computational capacity and software algorithms mean that these guesses are now almost as accurate as when you collect the data directly. This has made the cost of data inversely logarithmic - the more you generate, the lower the unit cost. Because its so accurate, the cost/benefit ratio changes.

The cost of not having synthetic data is now much greater than the cost of acquiring it.

Historically, generating highly accurate synthetic data has required custom software developed by PhDs. In the last two years, the technology has improved and lowered in cost to the point that most organizations can afford to invest a modest amount in synthetic data and see an immediate return.

So what can you do with synthetic data?

The Opportunity

Synthetic data means you can finally see the detail that's been missing. By addressing the recency and granularity problems, and 'filling in the gaps', synthetic data can help you make better decisions. Here are some specific examples.

Improving access to care

One of the biggest challenges in health-care is getting people who are likely to develop diseases into preventative care programs. Available population health statistics are usually limited to county-level data, while effective outreach requires much more specific neighborhood level targeting.

Synthetic data can help health-care providers to identify 'hotspots' where population is more at risk for a particular disease, and improve messaging and outreach to the community. The 500 Cities data set from CDC is a great example of a highly successful synthetic data set - for the first time, it shows tract level detail of health behavior risk factors for the 500 most populous cities across the country. It is being used by large hospital systems and health departments to plan interventions targeting specific neighborhoods.

Increasing sales

Understanding why and when customers buy is a critical part of growing a business. Synthetic data gives eCommerce companies a lot more information to inform ad spends and direct marketing investments.

For example, an apparel business has strong Q2 sales to top quintile income earners in the Bay Area. Synthetic data can show where else in the country is likely to deliver equivalent sales in Q3. Synthetic data can also predict in real time how a weather conditions can impact sales.

Increasing voter turnout

Many, and sometimes most, registered voters don't vote. Increasingly, older voters vote at a dramatically higher percentage than younger or minority voters, giving them more 'clout' in elections.

Improving voter turnout among minority and under-represented populations requires specific targeting of neighborhoods and populations where voting is likely to be low. Synthetic data can show at a household level how likely a particular person is to vote.

Exploring the Options

We have deep experience generating data and building decision support tools that help people make informed decisions. We can help you understand the opportunities and options for synthetic data in your organization.

Send us an inquiry.