Software engineering practices for graduate students

Recently I was talking with an Olin student who will start graduate school in the fall, and I suggested a few things I wish I had done in grad school. And then I thought I should write them down. So here is my list of Software Engineering Practices All Graduate Students Should Adopt:

Version Control

Every keystroke you type should be under version control from the time you initiate a project until you retire it. Here are the reasons:

  1. Everything you do will be backed up. But instead of organizing your backups by date (which is what most backup systems do) they are organized by revision. So, for example, if you break something, you can roll back to an earlier working revision.
  2. When you are collaborating with other people, you can share repositories. Version control systems are well designed for managing this kind of collaboration. If you are emailing documents back and forth, you are doing it wrong.
  3. At various stages of the project, you can save a tagged copy of the repo. For example, when you submit a paper for publication, make a tagged copy. You can keep working on the trunk, and when you get reviewer comments (or a question 5 years later) you have something to refer back to.

I use Subversion (SVN) primarily, so I keep many of my projects on Google Code (if they are open source) or on my own SVN server. But these days it seems like all the cool kids are using Git and keeping their repositories on GitHub.

Either way, find a version control system you like, learn how to use it, and find someplace to host your repository.

Build Automation

This goes hand in hand with version control. If someone checks out your repository, they should be able to rebuild your project by running a single command. That means that everything someone needs to replicate your results should be in the repo, and you should have scripts that process the data, generate figures and tables, and integrate them into your papers, slides, and other documents.

One simple tool for automating the build is Make. Every directory in your project should contain a Makefile. The top-level directory should contain the Makefile that runs all the others.

If you use GUI-based tools to process data, it might not be easy to automate your build. But it will be worth it. The night before your paper is due, you will find a bug somewhere in your data flow. If you’ve done things right, you should be able to rebuild the paper with just five keystrokes (m-a-k-e, and Enter).

Also, put a README in the top-level directory that documents the directory structure and the build process. If your build depends on other software, include it in the repo if practical; otherwise provide a list of required packages.

Or, if your software environment is not easy to replicate, put your whole development environment in a virtual machine and ship the VM.

Agile Planning

For many people, the most challenging part of grad school is time management. If you are an undergraduate taking 4-5 classes, you can do deadline-driven scheduling; that is, you can work on whatever task is due next and you will probably get everything done on time.

In grad school, you have more responsibility for how you spend your time and fewer deadlines to guide you. It is easy to lose track of what you are doing, waste time doing things that are not important (see Yak Shaving), and neglect the things that move you toward the goal of graduation.

One of the purposes of agile planning tools is to help people decide what to do next. They provide several features that apply to grad school as well as software development:

  1. They encourage planners to divide large tasks into smaller tasks that have a clearly-defined end condition.
  2. They maintain a prioritized ranking of tasks so that when you complete one you can start work on the next, or one of the next few.
  3. They provide mechanisms for collaborating with a team and for getting feedback from an adviser.
  4. They involve planning on at least two time scales. On a daily basis you decide what to work on by selecting tasks from the backlog. On a weekly (or longer) basis, you create and reorder tasks, and decide which ones you should work on during the next cycle.

If you use GitHub or Google code for version control, you get an issue tracker as part of the deal. You can use issue trackers for agile planning, but there are other tools, like Pivotal Tracker, that have more of the agile methodology built in. I suggest you start with Pivotal Tracker because it has excellent documentation, but you might have to try out a few tools to find one you like.

Do these things — Version Control, Build Automation, and Agile Planning — and you will get through grad school in less than the average time, with less than the average drama.

Related Resources

Editor’s Note: Editor requested to repost this piece as these tips also apply to those seeking to pursue studies in data science or data engineering. This post originally appeared in the Probably Overthinking It blog. It has been edited.
O’Reilly Strata Conference — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.Strata + Hadoop World: October 28-30 | New York, NY
Strata in London: November 15-17 | London, England
Strata in Santa Clara: February 11-13 | Santa Clara, CA