Julia’s Role in Data Science

Myths and Realities


Since its first public release in February 2012, the Julia programming language has received a lot of hype. This has led to some confusion about the language’s current status. In this post, I’d like to make clear where Julia stands and where Julia is going, especially in regard to Julia’s role in data science, where the dominant languages are R and Python. We’re working hard to make Julia a viable alternative to those languages, but it’s important to separate out myth from reality.

Where Julia Stands

In order to the dispel some of the confusion about Julia, I want to discuss the two main types of misunderstandings that I come across:

  • Confusion 1: Julia already possesses a mature package ecosystem and can be used as a feature-complete replacement for R or Python.
  • Confusion 2: Julia’s compiler is so good that it will make any piece of code fast – even bad code.

The truth about Julia is closer to the following:

  • Reality 1: Julia has a quickly growing, but still very young, package ecosystem. If you want to be productive, Julia needs to be part of a multilanguage environment in which you use R or Python when they are more appropriate. How much of your work can be done using only Julia depends upon your specific needs. People who tend to construct novel models and fit them using optimization algorithms will find that Julia is already nearly feature-complete. People who depend upon R’s large collection of classical statistical procedures will find that Julia is still missing a lot of functionality.
  • Reality 2: Julia’s compiler can produce code nearly as efficient as similar C code if the Julia code given to the compiler is written with performance in mind. What sets apart Julia is not a sufficiently smart compiler that works around sloppy code, but rather a combination of (1) a strong type system that aligns naturally with primitive machine operations and (2) a fully automatic type inference system, which makes it possible for the compiler to do the tedious work of type declaration when the user does not want to do it for themself.

Julia’s Ecosystem is Growing, but Young

Julia was publicly released ~1.5 years ago, after 2 years of internal development. Although the Julia community has grown substantially since February 2012, the language ecosystem is still very young. Many of the popular libraries available for languages like R or Python have no parallels in Julia yet. Julia is slowly developing its own ecosystem of packages, but any practical data scientist will need to mix Julia code with R or Python when the problem at hand demands it.

That said, it’s worth noting that the Julia ecosystem already has some very impressive packages:

  • Base Julia provides much of the functionality available in NumPy.
  • Additional Julia packages are slowly filling in the functionality of SciPy, including Stats.jlDistributions.jlOptim.jl and JuMP.jl.
  • DataFrames.jl provides tools for working with tabular data that will be familiar to users of R or pandas.
  • Gadfly.jl provides a bare-bones visualization package similar in spirit to ggplot2, while PyPlot.jl provides a complete interface to matplotlib from Julia.
  • Graphs.jl provides some of the functionality from packages like igraph or NetworkX.

When these packages do not meet users’ needs, most Julia users will get their work done using one of two strategies:

  • Direct language interopPyCall.jl makes it possible to call Python code from inside of a Julia program. The Julia community is already using these interop facilities to build packages like SymPy.jl, which wraps a popular symbolic algebra system developed for Python. Similarly, Matlab.jl makes it possible to call Matlab from Julia.
  • Multistep pipelines: Many data science tasks can be divided into a pipeline of completely independent steps. Newcomers to Julia can transition a pipeline over to Julia in steps, which eases the transition. When I first started using Julia, I would frequently do data preprocessing and modeling in Julia, but all of the subsequent visualization steps in R. As Julia’s package ecosystem matures, more parts of a pipeline can be translated into Julia code.

Julia’s Compiler isn’t Magic: The Language Design Is

Julia has acquired a well-deserved reputation for speed. Microbenchmarks demonstrate that well-written Julia performs nearly as well as similar C code. But, unlike a language like Javascript, Julia achieves its high level of performance through the systematic use of machine-appropriate types and data structures, rather than through the application of a compiler with the sophistication of Javascript’s v8 engine.

To see what makes Julia’s approach special, consider a method definition in Julia like the simple line, double(x) = x + x. In Julia, this method definition actually defines a potentially infinite family of functions: one for each of the possible types of input that might be passed as an argument to the function. For example, double(2) will call a specialized function definition that uses a CPU’s native integer addition instruction, whereas double(2.0) will use the CPU’s native floating point addition instruction. Julia’s ability to generate specialized code for different input types, when coupled with the compiler’s ability to infer these types for most variables, makes it possible to write Julia code at a very abstract level while achieving the efficiency associated with low level code that would work on only a small subset of machine primitives. Julia’s ability to compile code that reads like Python into machine code that performs like C almost entirely derives from Julia’s ability to specialize function definitions in this way.

While Julia’s compiler is able to exploit type inference to generate very efficient code in many cases, it’s important to keep in mind that Julia’s compiler isn’t doing anything magical. Code that can be interpreted in terms of simple operations on basic machine types will be as fast as careful C code, but code that doesn’t let the compiler do its tricks won’t be faster than code written in languages like R or Python.

In addition to the fact that Julia’s compiler can’t make arbitrary code fast, it’s important to keep in mind that many of the built-in functions in R or Python aren’t written in those languages, but in C. Because Julia performs roughly as well as C, this means that Julia won’t do better than R or Python if most of the work you do in R or Python is calling built-in functions without performing any explicit iteration or recursion. It’s only when you start doing custom work that Julia will really shine.

In other words, Julia is the perfect language for advanced users of R or Python, who are trying to build advanced tools inside of those languages. The alternative to Julia is typically resorting to C: R offers this through Rcpp and Python offers it through Cython. The goal of Julia is to make it possible to get Cython-like performance in the exact same language as you build your prototype in.

Where Julia is Going

In the next year, we’ll be working to push Julia forward in several different directions. First and foremost, we’ll be trying to improve on Julia’s graphical toolkits, so that binary installations of Julia ship with a high quality set of graphical functions that users can use to visualize data. We expect that Julia will be able to rival the toolkits from more established languages within another year or two.

We’ll also be working to make the integration between Julia and Python much tighter, which should make it much easier for advanced Python users to implement performance bottlemarks in Julia much as they currently might use Cython.

Finally, we’ll improve the quality of our data infrastructure and modeling tools. More and more of the statistical functionality from R will be ported to Julia. At the same time, interfaces to Python libraries like scikit-learn will grow. Eventually newcomers will be able to expect that most data science tasks can be done in Julia as easily as they can now be done in Python or R.

In addition to this work, the basic Julia language will continue its gradual evolution, including the introduction of better tools for parallel processing and the development of a static compiler that will generate machine executables.

Although Julia’s core language is still evolving, it’s worth noting that the basic language design has been stable for several years now. Unlike the evolving package system, the basic Julia language is quite stable. Users who are interested in experimenting with the language should find that it is already ready to handle many standard tasks. We hope you take Julia out for a spin and find it as enjoyable to work with as we do.

Related Resources

O’Reilly Strata Conference — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.
Strata + Hadoop World: October 28-30 | New York, NY
Strata in London: November 15-17 | London, England
Strata in Santa Clara: February 11-13 | Santa Clara, CA
tags: , , ,
  • gappy3000

    It may take a couple of years to get there, but it’s very likely that Julia will become the best language for technical computing. it’s only partially surprising that Julia is going to have better interop with Python than with R. To some extent, the two languages are good complements, with Python being a great scripting language with mature support for system and web programming, and partially for numerical computing (against the grain, I don’t think that is python’s strong suit). R, on the other side, has a narrower technical/statistical user base (without being a DSL). As someone who loves gpplot2, plyr, and a dozen R statistical libraries, I feel a little sad, not least because there are some major contributors to Julia who are or were R power users. Not sure about the long-term implications in terms of user base, but I understand that resources are finite.

    • John Myles White

      At some point, Julia will probably have better interop with R. The reason it is currently lacking is that no one who is actively working on Julia has decided to invest time in it. The Python and Matlab interop tools are almost entirely the work of five people. If a few people with required skill set took up the task for R, it could be done in principle.

  • Greg Wilson

    What steps is the Julia community taking to avoid winding up with the same confusing mess of libraries that plagues so many other open languages? (More than one student in our Software Carpentry classes has found it easier to write their own routines than find something in the standard library, much less on PyPi.) And what is there in Julia to make it easier for people to package, deploy, and install code, and figure out exactly what versions of what are installed on a particular machine? Again, this is where we see scientists being most frustrated; all-in-one installers like Canopy and Anaconda are great if the thing you want is in them, but as soon as you need a third-party package, you’re back in a painful place.

    • Leah

      You can see some of the built-in functions for helping you find existing functions here: http://blog.leahhanson.us/julia-helps.html . In particular, the `help` function provides documentation at the REPL.

      Julia has a built-in package manager: http://docs.julialang.org/en/release-0.2/manual/packages/ . The function `Pkg.status()` lists all installed packages and their versions.

      • Greg Wilson

        Yes, but live help and a package listing service haven’t prevented package management in Python and other languages from becoming a tangled mess.

    • Isaiah

      Distributing pure Julia (or Python) code is straightforward most of the time; the trouble usually comes when compiled components are required. Julia has some advantages here. First, pure Julia code is usually quite fast, and additional optimizations can be performed within-language to get maximum performance. This eliminates the need to drop into a different pre-compiled language for performance. Second, external function calls are JIT’d, providing the flexibility of Python’s ctypes with no runtime overhead. Thus there is no need to distribute mutually-compatible versions of the whole stack (CPython + shim + shared library).

      Regarding ecosystem development, *having* an integrated package manager and highly-visible listing system from the outset will hopefully reduce duplication of efforts. The package manager is tightly integrated with version control and has simplified tools to facilitate package creation, submission, and pull-requests (each package is a Git repositry), hopefully blurring the line between contributor and user and resulting in more participation (Julia’s speed is also important here, because the barrier to entry for contribution is much lower than in a hybrid-language project).

      We would love to hear your thoughts on this if you ever want to swing by the mailing list.

    • haydoni

      I guess one of the great things about the Julia ecosystem is that once installed, you’ve done the hard work as most packages will just be written in Julia. Agree this is a critical thing for Julia, looking forward to a clean non-sudo using soln like Anaconda is for python.

      (Note: I’ve found provided you’ve installed pip via Anaconda then pip works great with Anaconda.)

  • radiofreerome

    I am a numerical analyst specializing in machine learning. Julia looks like a very interesting language; however, I’d be more willing to use it in commercial products if it had refactoring IDE support. The creation of an Intellij plugin for Julia would be very helpful.