The Data Authenticity Protocol

We need a data authenticity protocol, stat!

On my journey to become more fluent with statistics and data analysis, I have been reading Sir David John Spiegelhalter’s book ‘The Art of Statistics.’ One of the take-homes on reading this book has been that the source of data is sometimes the greatest challenge to prove correct or authentic in origin.

It seems crazy to me that in a tech world where we have blockchain ledgers, PKI, Kerberos, etc. — we still do not have a way of proving raw dataset authenticity.

I hope to tackle this by identifying the problem and providing a straw-man solution. This post will not lead to a fully-fledged IETF RFC (yet) but will pave the way for a more formal proposal at a later date. ETA 2021

The Problem

“When data is presented to an intended audience, there is no guarantee that the data is authentic in origin.” — Someone 2020

Why is that a big deal? Unfortunately, we live in a world of fake news and ill-informed social media hysteria. The impact of not being able to prove the correctness of a dataset is enormous. The consumers of the data might be skeptical and unsure of the origin, and this does not promote confidence. And after all, data is a commodity in the modern world — up there with oil/gas/gold.

Could you imagine spending money on an online auction for sports memorabilia and not getting a certificate of authenticity with your purchase? No, you wouldn’t feel confident about your purchase at all.

The Straw-man Solution

Enter the Data Authenticity Protocol/Standard.

Future Pre-requisites:-

  • World Data Organisation.
  • A PKI type protocol to sign and issue data authenticity certificates.

World Data Organisation- There is no such thing as the World Data Organisation as of writing this article, but there should be. We need an impartial entity to head up data issuing and distribution worldwide.

TBD data authenticity PKI — We have PKI for authenticating users and devices in the modern tech world. We would need a new protocol to maintain a chain of trust on a dataset and display this to the consumer.

strawman dap diagram

Terminology

Now for some terminology using the straw-man diagram above -

WDO (World Data Organisation) — This is a proposed future entity. A council or organization of trusted specialists and masters in the field of data. The WDO will hold ‘smart’ contracts with their elected data issuers.

Data Issuer — An elected ‘manufacturer’ of data. Think of your Google/Twitter/Facebook etc. They are also chosen intermediaries in the chain of data authenticity trust. A Data Issuer will be responsible for the delivery of data based on a quota contract, as well as providing signed certificates and datasets for end consumers.

Data Merchant — This is the seller/reseller of data. A Data Merchant enters an agreement to buy data from data issuers and sell this on to consumers. The Data Merchant also provides interfaces for the consumer to receive a dataset and a signed certificate of data authenticity.

The Data Authenticity Flow

Using the same straw-man diagram for guidance, here is the general proposed flow for data authenticity:-

  1. The WDO maintains and publishes a ‘root’ Data Authority certificate.
  2. The WDO is responsible for electing ‘trusted’ data issuers.
  3. The WDO and Data Issuer enter a dataset manufacturer tenure negotiation. To be facilitated by smart contracts that are only complete on the issuing of all required datasets by the data issuer.
  4. The data issuer is a trusted ‘intermediary’ in the Data Authenticity chain. The Data Issuer is also responsible for creating DCSR’s (Data Certificate Signing Requests) every n number of years.
  5. The Data Issuer and Data Merchants enter a dataset supply agreement. This agreement differs from the manufacturer agreement in that the supply agreement must honor full un-tampered access to the requested dataset. The data issuer must facilitate the ability to sign and distribute a certificate of data authenticity via a merchant to the end consumer.
  6. The end consumer has a raw-dataset with a checksum and a data authenticity certificate.

The Pro’s

These are the key points that this proposal helps with:-

  • The raw dataset can, with 100% certainty, traced to the data issuer. A trusted issuer of data, elected and chosen by the WDO.
  • A consumer/presenter can display with confidence that the source data is authentic, ergo providing the audience confidence.
  • The distribution of data is open and honest.

The Con’s

This early straw-man proposal has huge glaring issues of course:-

  • This proposal does not include a way to prove that during ETL or data cleansing processes that the data becomes skewed/incorrect due to bugs/errors in these processes.
  • Any visualizations or visual presentations may also have bugs/errors that incorrectly display the origin dataset.
  • The consumer can provide fake data to their audience/customer and still claim authenticity on the presented data via their acquired data authenticity certificate.

Next Steps

Defining a suitable structure for a World Data Organization is an essential next step. As well as designing a data authenticity PKI protocol to sign data authenticity certificates for consumers.

I will also be thinking about solutions for proving data authenticity all the way to the presentation.