This Startup Wants to Build a “GitHub for Data”

A startup called Gretel wants to build a “GitHub for data” so developers can safely access sensitive data.

Often, developers don’t need full access to a bank of user data — they just need a portion or a sample to work with. In many cases, developers could suffice with data that looks like real user data.

This so-called “synthetic data” is essentially artificial data that looks and works just like regular sensitive user data. Gretel uses machine learning to categorize the data — like names, addresses and other customer identifiers — and classify as many labels to the data as possible. Once that data is labeled, it can be applied access policies. Then, the platform applies differential privacy — a technique used to anonymize vast amounts of data — so that it’s no longer tied to customer information.

 

More Apps Should Use Differential Privacy

News app Tonic is different than most news apps because it uses differential privacy. More apps should do the same.

Before your eyes cross, a real-life example Cyphers gave me is the census. The government has a lot of aggregate data about its citizens—and it probably wants to share demographic information from that set without revealing anything about any one particular individual. Let’s say you live in a small census block with only one or two people. It wouldn’t take a genius to figure out personal information about you, given the right parameters. Differential privacy would be a way to summarize that data without putting any one individual at risk.