Navigating Inefficiencies in Third-Party Data Use
Incorporating third-party data into your business is always a headache. Issues of format, price, rights, format, and accessibility consistently introduce inefficiency, slow things down, and generate excess heat, so it’s not inappropriate to refer to them as frictions.
Third-party data is hugely valuable and implementation need not be onerous. Using geodata as an exemplar, I’m outlining these four frictions here so we can stare them in the face, note what needs fixing, and suggest how best to address them. The good news is that these issues are not insurmountable, and we seem to be moving in the right direction.
We are simply not good at playing with others when it comes to data
Russia’s railway gauge is different from Western Europe’s. At the border of the former Soviet states, the Russian gauge of 1.524m meets the European & American ‘Standard’ gauge of 1.435m. The reasons for this literal disconnect arise from discussions between the Tsar and his War Minister. When asked the most effective way to prevent Russia’s own rail lines being used against them in times of invasion, the Minister suggested a different gauge to prevent supply trains rolling through the border. The artifact of this decision remains visible today at all rail crossings between Poland and Belarus or Slovakia and Ukraine. The rail cars are jacked up at the border, new wheels inserted underneath, and the car lowered again. It is about a 2-4 hour time burn for each crossing.
Per head, per crossing, over 170 years, is a heck of a lot of resource wasted. But to change it would entail changing the rail stock of the entire country and realigning about 225,000 km (140,000 mi) of track.
Talk about technical debt.
Data suffers from a similar disconnect. It really wasn’t until the advent of XML 15 years ago that we had an agreed (but not entirely satisfactory) mechanism for storing arbitrary data structures outside the application layer. This is as much a commentary on our technical priorities as it is a social indictment. We are simply not good at playing with others when it comes to data.
Location coordinate data lacks important context.
Coordinate pairs are regular and orderly, but they are entirely ambiguous when used to represent more conceptual places like states, cities, stores and neighborhoods.
Big data as a discipline or a conference topic is still in its formative years.
Big data is a massive opportunity, but the language used to describe it ("goldrush," "data deluge," "firehose," etc.) reveals we're still searching for its identity.