Data is not binary

Why open data requires credibility and transparency.

Guest blogger Gavin Starks is founder and CEO of AMEE, a neutral aggregation platform designed to measure and track all the energy data in the world..

The World Bank has stated that “data in document format is effectively useless“.

However, “open data” is only the beginning of a journey. Simply applying the rules of open source as applied to software may help us take the first steps, but there are new categories of challenges to face.

Data needs to be computable (ie. acted upon in context)

“Data” is a much broader term than “code.” The term embodies a range of dimensions: there are more than just the numbers at play, especially with scientific data.

  • How was the data collected?
  • How should the data be used?
  • Are the models for processing the data valid?
  • What assumptions exist, in words and equations?
  • What is the significance of the assumptions?

In an age when peer review is an anachronism, we are searching for new solutions for “scientific content management”. When Pascal’s Wager is evoked, it is equally important to remember Godel’s incompleteness theorems (in complex enough systems, logic can be used to prove anything, including untrue statements).

Only eight percent of members of the Scientific Research Society agreed that “peer review works well as it is” (Chubin and Hackett, 1990; p.192). Peer review has also been claimed to be “a non-validated charade whose processes generate results little better than does chance.” But in the same context: “Peer review is central to the organization of modern science … why not apply scientific [and engineering] methods to the peer review process” (Horrobin, 2001)”. The absence of URLs on those two pieces of research are indicative of one of the problems we are trying to solve.

Peer review remains today in its current form because of history, but in a niche because technology has opened up usage to a mass audience.

We must build tools that enable credible engagement

To illustrate our story: we are engaged with the very pressing and complex issue of climate change. At AMEE we codify international, government, and proprietary data, models and methodologies that represent, at the most fundamental level, the algorithms that enable the energy, carbon and environmental cost of consumption and activities to be calculated. AMEE doesn’t just store and re-broadcast data, it performs the calculations based on inputs to the models.

One of our challenges is getting at the raw data in a useful, repeatable, and traceable form. As a result of this, one of the core services we offer to data and standards managers are tools that enable this.

Releasing raw data is vital. There can be no excuse not to. Releasing source code is optional. It’s truly great for open source review, but it’s also dangerous if everyone just re-runs the same code with the same baked-in implicit and explicit assumptions and errors.

This is where data and code deviate substantially. The logic cascade for the interpretation of data is not unary (there is no single interpretation), it is based on assumptions that may vary and are subject to many quantitative and qualitative inputs: the interpretation of the data is not even binary.

We believe it’s much better to publish the following five components to provide transparent and auditable disclosure:

  1. The raw data
  2. The circumstances of its collection
  3. The method and assumptions used to process the data (in words and equations)
  4. The results of the processing
  5. The known limitations on the method and significance of the assumptions

The processing code should be written from scratch as many times as possible to reduce the chance that it affected the results in any way.

Once “published,” the challenge is the how to build out a credible, and usable, set of services that encourage correct usage.

Building the solution stack

At AMEE we have developed a six-tier solution to try and address some of these issues. Specifically, we address the gap between content creators/managers (e.g. standards bodies) and content users (e.g. software apps, consultants, auditors), with a solution that is both human and machine-readable.

1. Aggregation — We aggregate the raw data, and track and log the sources. We have a standards spider that checks for changes, not unlike a search engine spider.

2. Content Enhancement — In the process of aggregation, we document the data, and embed provenance, linking back to the source. We also add authority, a measure of the reliability and credibility of the source. We’re beginning to add other taxonomies and semantic links that enable the data to be joined, and are building tools for engagement with the platform to stimulate discussion.

3. DiscoverabilityAMEE Explorer is the human-readable version of the data, and the only search engine on carbon calculation models (N.B.: we are focused on the industrial and human impacts at the moment, not modeling the climate itself).

4. Repeatable Quality — We have a quality-control process around the underlying data that is similar to a Six Sigma process. Our systems self-test the data every 30 minutes, and human checks are carried out at random intervals to ensure systemic errors have not been introduced. Our target accuracy metric is 100 percent, not five-nines.

5. Computable Engine — We believe we are taking the notion of a master database service to an entirely new level by ensuring that not only the data is robust, but AMEE performs the actual calculations. AMEE retains an audit history behind both the inputs and the calculations themselves.

6. Interoperability and auditability — The AMEE API is the machine-readable version of the data (in fact all of the content including meta data and documentation), which enables the calculations to be done. AMEE also stores the audit-history of both the inputs and the calculation mechanics. For example: PUT a (flight in an F-15 from London to New York at combat thrust), and GET the kgCO2 for that journey, or PUT (1000kWh reported by my Whirlpool fridge for this month, in Washington, using my preferred energy supplier and my solar panels) and GET the kgCO2.

Challenges

AMEE is positioned right at the junction between cloud, code, API, content, data, and the usage of the data, and as carbon becomes priced, we believe the consequences of getting it wrong are extremely high.

From an “open” standpoint, one of the big challenges we face includes defining where the boundaries of “open” lie. Our value, of course, is in the ongoing maintenance and reliability of the system, and connecting the data.

Commercially, we are treading very carefully through the platform and use-case stack (core platform, API, data, algorithms, code, structure, etc), and increasing transparency at the most relevant points for the end-user (who needs to feel confident about their own inputs and outputs). It’s a complex stack, and no open source or creative commons licenses wholly cover the kinds of issues we face.

Our field, carbon footprinting, is what we call a “non-trivial” example of where open data meets the markets: billions of dollars are flowing through or around these data on the carbon markets. For example, thousands of businesses in the UK have to start reporting their carbon footprint to the government this year, and paying for it next year. Very, very few people understand how to use this data, how it all joins together, where the trap doors are, and why it’s important to build an industry-stack to solve the problem.

If we don’t build a credible industry stack, from the ground up, the outcome could be no industry at all (or a tiny one), and that has dire consequences not only for the vendors and businesses in the space (such as SAP, SAS, CA, Microsoft, Google, and others), but also removes our ability to accelerate solving the underlying issue of carbon and climate change itself. Root cause of this credibility-gap has been lack of transparency, and no one has comprehensively joined the dots to see what is real, and what it not.

We also believe this kind of approach has huge value in many areas beyond the ones AMEE is addressing.

Open data isn’t just about re-broadcasting data, but combining it, re-using it and building upon it. It’s about creating new uses, creating new markets and building credibility into the data as it flows.

Related:

tags: , , ,
  • John

    I believe you have the sense of Godel’s theorem backward. You said that “In complex enough systems, logic can be used to prove anything, including untrue statements.” It would be more accurate to say “In complex enough systems, logic can NOT be used to prove EVERYTHING, including TRUE statements.” That is, there are true statements that cannot be proven true.

  • Nathan

    I disagree with your assertion that releasing code is optional. Being able to observe and analyze the implementation of the method of processing the data is essential to the notion of full disclosure. Even presuming multiple implementations are feasible, I am only able to black-box compare the multiple implementations against one another, which, even if they are deemed consistent, merely reduces the likelihood that the implementations are inconsistent with the method.

  • Ken Williams

    Yeah, you’ve got Gödel wrong. You can never prove untrue statements unless your axioms are inconsistent.

  • Mr. Gunn

    There’s a difference between data and knowledge. Data can certainly be binary. Most of the data in the world exists in the form of ones and zeros, and you can’t get more binary than that.

    However, context and understanding derived from data does require the things you mentioned above, and a good list it is. For what it’s worth, good biological science reports already do include this information.

  • http://www.amee.com Gavin Starks

    @Nathan – the point wasn’t to suggest that releasing code wasn’t desirable (it is highly desirable!), but rather to highlight that blind copying of analytics code could introduce systemic errors.

    @Gunn – the difference between data and knowledge gets harder to distinguish the more you think about it. We recently spoke to an academic who’s studying marine life in Antarctica in the context of climate change and he didn’t want to release the raw data because he put in a lot of work in collecting it – and didn’t want to see someone else publish his results first. He also said he thought that almost of all of colleagues in the field would feel the same way. We know that’s true in other disciplines too from our direct experience. The point is that everyone should commit to publishing the raw data in 6 or 12 months – but give them the time to publish papers on it first.

    @John, @Ken
    I’m tempted to borrow from Arthur C.Clarke here and say “any sufficiently advanced mathematics is indistinguishable from magic”. The Godel theorem is used as an analogy here and isn’t core to the points I’m making. In this article I’m talking about how human minds and communication (in the form of creating, sharing and processing datasets) can lead us to prove false statements.

    The theorem is stated and proved in formal mathematical systems (formal having a very precise meaning here). Godel’s (first) incompleteness theorem asserts that it is possible to make a true statement in a system that cannot be proved within that system (its truth can only be seen outside the system). A system that permits such true but unprovable statements is an incomplete one. In order to make the system complete you need to add axioms (essentially statements that you decide to be true in that system, not requiring proof). If you manage to add axioms so that you avoid incompleteness, then – oops! – Godel also proved that you must have an inconsistent system. That is, a system is either incomplete or it is inconsistent. An inconsistent system allows you to prove statements that are false. For example, an inconsistent system will allow you to prove a theorem that states it is a consistent system!

    @Ken – Godel said that as well as the statement about incompleteness. It’s easily the case that a) you don’t know that your axioms are inconsistent and b) you can’t tell your statement is untrue without embedding in a larger system (that will also have inconsistent axioms).