Inside core features of specialized data analysis languages.
Big data frameworks like Hadoop have received a lot of attention recently, and with good reason: when you have terabytes of data to work with — and these days, who doesn’t? — it’s amazing to have affordable, reliable and ubiquitous tools that allow you to spread a computation over tens or hundreds of CPUs on commodity hardware. The dirty truth is, though, that many analysts and scientists spend as much time or more working with mere megabytes or gigabytes of data: a small sample pulled from a larger set, or the aggregated results of a Hadoop job, or just a dataset that isn’t all that big (like, say, all of Wikipedia, which can be squeezed into a few gigs without too much trouble).
MATLAB is one of the oldest programming languages designed specifically for data analysis, and it is still extremely popular today. MATLAB was conceived in the late ’70s as a simple scripting language wrapped around the FORTRAN libraries LINPACK and EISPACK, which at the time were the best way to efficiently work with large matrices of data — as they arguably still are, through their successor LAPACK. These libraries, and thus MATLAB, were solely concerned with one data type: the matrix, a two-dimensional array of numbers.
This may seem very limiting, but in fact, a very wide range of scientific and data-analysis problems can be represented as matrix problems, and often very efficiently. Image processing, for example, is an obvious fit for the 2D data structure; less obvious, perhaps, is that a directed graph (like Twitter’s follow graph, or the graph of all links on the web) can be expressed as an adjacency matrix, and that graph algorithms like Google’s PageRank can be easily implemented as a series of additions and multiplications of these matrices. Similarly, the winning entry to the Netflix Prize recommendation challenge relied, in part, on a matrix representation of everyone’s movie ratings (you can imagine every row representing a Netflix user, every column a movie, and every entry in the matrix a rating), and in particular on an operation called Singular Value Decomposition, one of those original LINPACK matrix routines that MATLAB was designed to make easy to use.