Here are a few of the data stories that caught my attention this week.
Automated essay-scoring software scores as well as humans
Robot essay graders: They grade the same as humans. That’s the conclusion of a study conducted by University of Akron’s Dean of the College of Education Mark Shermis and Kaggle data scientist Ben Hamner. The researchers examined some 22,000 essays that were administered to junior and high school students as part of their states’ standardized testing process, comparing the grades given by human graders and those given by automated grading software. They found that “overall, automated essay scoring was capable of producing scores similar to human scores for extended-response writing items with equal performance for both source-based and traditional writing genre” (PDF of the report).
“The demonstration showed conclusively that automated essay scoring systems are fast, accurate, and cost effective,” says Tom Vander Ark, managing partner at the investment firm Learn Capital, in a press release touting the study’s results.
The study coincides with an active competition hosted on Kaggle and sponsored by the Hewlett Foundation, in which data scientists are challenged with developing the best algorithm to automatically grade student essays. “Better tests support better learning,” noted the foundation’s Education Program Director Barbara Chow in the press release. “This demonstration of rapid and accurate automated essay scoring will encourage states to include more writing in their state assessments. And, the more we can use essays to assess what students have learned, the greater the likelihood they’ll master important academic content, critical thinking, and effective communication.”
Personally, I like writing for a human audience. Bots leave really stupid blog comments — but I bet there’s an algorithm for that too.
The billion-dollar acquisition of the mobile photo-sharing app Instagram was big news last week. The news coincided with a presentation by co-founder Mike Krieger at an AirBnB Tech Talk about how the startup managed to scale to 30 million users worldwide with a small team of back-end developers (a very small team, in fact). Krieger’s presentation is interesting in its own right, of course, but news of the acquisition by Facebook certainly fueled interest — in the deal and in the tech under the Instagram hood.
Krieger’s slides can be found here. The presentation details some of the early and ongoing challenges of handling the app’s increasing number of users and their photos (including the recent roll-out of an Android app, which added another million new users in just 12 hours). Although Instagram hasn’t suffered any major outages of the likes seen by Twitter and Tumblr, Krieger does note a number of early problems, including a missing favicon.ico that was causing a lot of 404 errors in Django.
The UK’s National Audit Office has just released its look at the government’s open data efforts, reports The Guardian. Although the open data initiative gets good marks for the “tsunami of data” it’s released — 8,300 datasets — there remain questions about cost and usage.
Governmental departments estimate they spend between £53,000 and £500,000 each year on publishing the data, with the police crime maps, for example, costing £300,000 to set up and £150,000 per year to maintain. And it’s not clear that the data is in demand, according to the National Audit Office report: “None of the departments reported significant spontaneous public demand for the standard dataset releases.” This doesn’t account for the ways in which third-party vendors may be using the data, however.
Big Data Week
April 23-29 is “Big Data Week,” an event created by DataSift that will feature meetups and hackathons in several cities around the world. Big Data Week aims to bring together the “core communities” — data scientists, data technologies, data visualization, and data business. A list of events is available on the Big Data Week website.
Got data news?
Feel free to email me.