ENTRIES TAGGED "Python"
It has roots in academic scientific computing, but has features that appeal to many data scientists
As I noted in a recent post on reproducing data projects, notebooks have become popular tools for maintaining, sharing, and replicating long data science workflows. Much of that is due to the popularity of IPython1. In development since 2001, IPython grew out of the scientific computing community and has slowly added features that appeal to data scientists.
Roots in academic scientific computing
As IPython creator Fernando Perez noted in his “historical retrospective”, exploratory analysis in a scientific setting requires a solid interactive environment. After years of development IPython has become a great tool for interacting with data. IPython also addresses other important pain points for scientists – reproducibility and collaboration – issues that are equally important to data scientists working in industry.
IPython is more than just Python
With an interactive widget architecture that’s 100% language-agnostic, these days IPython is used by many other programming language communities2, including Julia, Haskell, F#, Ruby, Go, and Scala. If you’re a data scientist who likes to mix-and-match languages, you can create, maintain, and share multi-language data projects in IPython:
An interview with Allen Downey, the author of Think Bayes
When Mike first discussed Allen Downey’s Think Bayes book project with me, I remember nodding a lot. As the data editor, I spend a lot of time thinking about the different people within our Strata audience and how we can provide what I refer to “bridge resources”. We need to know and understand the environments that our users are the most comfortable in and provide them with the appropriate bridges in order to learn a new technique, language, tool, or …even math. I’ve also been very clear that almost everyone will need to improve their math skills should they decide to pursue a career in data science. So when Mike mentioned that Allen’s approach was to teach math not using math…but using Python, I immediately indicated my support for the project. Once the book was written, I contacted Allen about an interview and he graciously took some time away from the start of the semester to answer a few questions about his approach, teaching, and writing.
How did the “Think” series come about? What led you to start the series?
Allen Downey: A lot of it comes from my experience teaching at Olin College. All of our students take a basic programming class in the first semester, and I discovered that I could use their programming skills as a pedagogic wedge. What I mean is if you know how to program, you can use that skill to learn everything else.
I started with Think Stats because statistics is an area that has really suffered from the mathematical approach. At a lot of colleges, students take a mathematical statistics class that really doesn’t prepare them to work with real data. By taking a computational approach I was able to explain things more clearly (at least I think so). And more importantly, the computational approach lets students dive in and work with real data right away.
At this point there are four books in the series and I’m working on the fifth. Think Python covers Python programming–it’s the prerequisite for all the other books. But once you’ve got basic Python skills, you can read the others in any order.
Tools simplify the application of advanced analytics and the interpretation of results
A new set of tools make it easier to do a variety of data analysis tasks. Some require no programming, while other tools make it easier to combine code, visuals, and text in the same workflow. They enable users who aren’t statisticians or data geeks, to do data analysis. While most of the focus is on enabling the application of analytics to data sets, some tools also help users with the often tricky task of interpreting results. In the process users are able to discern patterns and evaluate the value of data sources by themselves, and only call upon expert1 data analysts when faced with non-routine problems.
Visual Analysis and Simple Statistics
Three SaaS startups – DataHero, DataCracker, Statwing – make it easy to perform simple data wrangling, visual analysis, and statistical analysis. All three (particularly DataCracker) appeal to users who analyze consumer surveys. Statwing and DataHero simplify the creation of Pivot Tables2 and suggest3 charts that work well with your data. StatWing users are also able to execute and view the results of a few standard statistical tests in plain English (detailed statistical outputs are also available).
Statistics and Machine-learning
BigML and Datameer’s Smart Analytics are examples of recent tools that make it easy for business users to apply machine-learning algorithms to data sets (massive data sets, in the case of Datameer). It makes sense to offload routine data analysis tasks to business analysts and I expect other vendors such as Platfora and ClearStory to provide similar capabilities in the near future.
A variety of tools are making data science tasks easy to do in Python
Here are a few observations inspired by conversations I had during the just concluded PyData conference1.
The Python data community is well-organized:
Besides conferences (PyData, SciPy, EuroSciPy), there is a new non-profit (NumFOCUS) dedicated to supporting scientific computing and data analytics projects. The list of supported projects are currently Python-based, but in principle NumFOCUS is an entity that can be used to support related efforts from other communities.
It’s getting easier to use the Python data stack:
There are tools that facilitate the dissemination and sharing of code and programming environments. IPython2 notebooks allow Python code and markup in the same document. Notebooks are used to record and share complex workflows and are used heavily for (conference) tutorials. As the data stack grows, one of the major pain points is getting all the packages to work properly together (version compatibility is a common issue). In particular setting up environments were all the pieces work together can be a pain. There are now a few solutions that address this issue: Anaconda and cloud-based Wakari from Continuum Analytics, and cloud computing platform PiCloud.
There are many more visualization tools to choose from:
The 2D plotting tool matplotlib is the first tool enthusiasts turn to, but as I learned at the conference, there are a number of other options available. Continuum Analytics recently introduced companion packages Bokeh and Bokeh.js that simplify the creation of static and interactive visualizations using Python. In particular Bokeh is the equivalent of ggplot (it even has an interface that mimics ggplot). With Nodebox, programmers use Python code to create sketches and interactive visualizations that are similar to those produced by Processing. Read more…
Use Gephi and Python to find your personal communities
Using a bit of Python and the Gephi graph tool, exploring your own Twitter network is a great way to learn about analyzing networks: and the results definitely have a "wow" factor.