Note: The following is an excerpt from Pete Warden’s free ebook “Where are the bodies buried on the web? Big data for journalists.”
There’s been a revolution in data over the last few years, driven by an astonishing drop in the price of gathering and analyzing massive amounts of information. It only cost me $120 to gather, analyze and visualize 220 million public Facebook profiles, and you can use 80legs to download a million web pages for just $2.20. Those are just two examples.
The technology is also getting easier to use. Companies like Extractiv and Needlebase are creating point-and-click tools for gathering data from almost any site on the web, and every other stage of the analysis process is getting radically simpler too.
What does this mean for journalists? You no longer have to be a technical specialist to find exciting, convincing and surprising data for your stories. For example, the following four services all easily reveal underlying data about web pages and domains.
Many of you will already be familiar with WHOIS, but it’s so useful for research it’s still worth pointing out. If you go to this site (or just type “whois www.example.com” in Terminal.app on a Mac) you can get the basic registration information for any website. In recent years, some owners have chosen “private” registration, which hides their details from view, but in many cases you’ll see a name, address, email and phone number for the person who registered the site.
You can also enter numerical IP addresses here and get data on the organization or individual that owns that server. This is especially handy when you’re trying to track down more information on an abusive or malicious user of a service, since most websites record an IP address for everyone who accesses them
The newest search engine in town, one of Blekko’s selling points is the richness of the data it offers. If you type in a domain name followed by /seo, you’ll receive a page of statistics on that URL
The first tab shows other sites that are linking to the current domain, in popularity order. This can be extremely useful when you’re trying to understand what coverage a site is receiving, and if you want to understand why it’s ranking highly in Google’s search results, since they’re based on those inbound links. Inclusion of this information would have been an interesting addition to the recent DecorMyEyes story, for example.
The other handy tab is “Crawl stats,” especially the “Cohosted with” section:
This tells you which other websites are running from the same machine. It’s common for scammers and spammers to astroturf their way toward legitimacy by building multiple sites that review and link to each other. They look like independent domains, and may even have different registration details, but often they’ll actually live on the same server because that’s a lot cheaper. These statistics give you an insight into the hidden business structure of shady operators.
I always turn to bit.ly when I want to know how people are sharing a particular link. To use it, enter the URL you’re interested in:
Then click on the ‘Info Page+’ link:
That takes you to the full statistics page (though you may need to choose “aggregate bit.ly link” first if you’re signed in to the service).
This will give you an idea of how popular the page is, including activity on Facebook and Twitter. Below that you’ll see public conversations about the link provided by backtype.com.
I find this combination of traffic data and conversations very helpful when I’m trying to understand why a site or page is popular, and who exactly its fans are. For example, it provided me with strong evidence that the prevailing narrative about grassroots sharing and Sarah Palin was wrong.
[Disclosure: O'Reilly AlphaTech Ventures is an investor in bit.ly.]
By surveying a cross-section of American consumers, Compete builds up detailed usage statistics for most websites, and they make some basic details freely available.
Choose the “Site Profile” tab and enter a domain:
You’ll then see a graph of the site’s traffic over the last year, together with figures for how many people visited, and how often.
Since they’re based on surveys, Compete’s numbers are only approximate. Nonetheless, I’ve found them reasonably accurate when I’ve been able to compare them against internal analytics.
Compete’s stats are a good source when comparing two sites. While the absolute numbers may be off for both sites, Compete still offers a decent representation of the sites’ relative difference in popularity.
One caveat: Compete only surveys U.S. consumers, so the data will be poor for predominantly international sites.
Additional data resources and tools are discussed in Pete’s free ebook.