Here are a few of the data stories that caught my eye this week.
Aaron Swartz and the politics of violating a database’s TOS
Aaron Swartz, best known as an early Reddit-er and the founder of the progressive non-profit Demand Progress, was charged on Tuesday of multiple felony counts for the illegal download of some 4 million academic journal articles from the MIT Library.
The indictment against Swartz (a full copy is here) details the steps he took to procure a laptop and register it on the MIT network, all in the name of securing access to JSTOR. JSTOR is an online database of academic journals, providing full text search and access to library patrons at both academic and public universities.
Swartz accessed the JSTOR database via MIT and proceeded to devise a mechanism to download a massive number of documents. It isn’t clear what his intentions were for these — Swartz has been involved previously with open data efforts. Was he planning to liberate the JSTOR database? Or, as others have suggested, was he in the middle of an academic project that required a massive dataset?
The government has made it clear this is “stealing.” JSTOR, the library, and the university are less willing to comment or condemn.
Kevin Webb asks an important question in a post reprinted by Reuters. What’s the difference between what Swartz did and what Google does?
What’s missing from the news articles about Swartz’s arrest is a realization that the methods of collection and analysis he’s used are exactly what makes companies like Google valuable to its shareholders and its users. The difference is that Google can throw the weight of its name behind its scrapers …
Although Swartz did allegedly download data from JSTOR in such quantities that it violates a Terms of Service agreement, many questions remain: Why does this constitute stealing? How much data does one need to take to be at risk of accusations of theft and fraud? For data scientists, not just for activists, these are very real questions.
Update: GigaOm’s Janko Roettgers reports that a torrent with 18,592 scientific publications — all of them apparently from JSTOR — was uploaded to The Pirate Bay.
Microsoft releases its big data toolkit for scientists
Although we’re all creating massive amounts of data, for scholars and scientists that data creation and analysis can quickly run afoul of the limitations of university computing centers. To that end, Microsoft Research this week unveiled Daytona, a tool designed to help scientists with big data computation.
Created by the eXtreme Computing Group, the tool lets scholars and scientists use Microsoft’s Azure platform to work with large datasets. According to Roger Barga, an architect in the eXtreme Computing Group:
Daytona has a very simple, easy-to-use programming interface for developers to write machine-learning and data-analytics algorithms. They don’t have to know too much about distributed computing or how they’re going to spread the computation out, and they don’t need to know the specifics of Windows Azure.
Daytona is meant to be an alternative to Hadoop or MapReduce (although it does utilize the latter), but with an emphasis on ease-of-use. Daytona comes with code samples and programming guides to get people up and running.
The eXtreme Computing Group has also built Excel Datascope, which as the name suggests is a tool that offers data analytics from Excel.
While making it easier for academics to perform big data analysis is an honorable goal, I can’t help but ask (as a recovering academic myself) when will academy realize that the skills needed to work with these datasets warrant formal attention? Scholars need to be trained to manage this information. That way, it isn’t just a matter of making it “easier,” but making these tools better.
The state of open data in Canada
Eaves examines how the Canadian government (provincial and otherwise) has made strides toward opening up data to its citizens, developers, and others. But as Eaves makes clear in his post, it isn’t as simple as just “opening” data as a gesture, but rather making sure data is readily accessible and usable.
“Licenses matter because they determine how you are able to use government data — a public asset,” he writes. “As I outlined in the three laws of open data, data is only open if it can be found, be played with and be shared.” Eaves contends that licensing is particularly important, as this can limit what sorts of restrictions are put on the sharing of data and, in turn, on the sorts of apps one can build using it.
What do we want then? Eaves lists these attributes:
- Open: there should maximum freedom for reuse
- Secure: it offers governments appropriate protections for privacy and security
- Simplicity: to keep down legal costs, and make it easier for everyone to understand
- Standardized: so my work is accessible across jurisdictions
- Stable: so I know that the government won’t change the rules on me
When it comes to the “where do we go from here” aspect, Eaves isn’t optimistic. He notes that while some municipalities may have opened their datasets, the federal government — in Canada and elsewhere — seems unprepared to fully engage with the developer and open data communities.
Got data news?
Feel free to email me.