Using and Learning Drake

Drake is a new piece of software I discovered that can be very helpful for automating data manipulation and management. The use case for Drake is simple, one of the biggest, and most time consuming, tasks in any data project is cleaning the data so it can be transformed into a format for analysis.

Data often comes in crazy, non-standard, formats which need to be regularized in order to be imported into an analysis program like R or ArcGIS. There are at least two good reasons to automate as much of this process as possible.

First, you want to create something that can be reproducible by yourself sometime in the future or by others who may want to check your work. Reproducible research is increasingly important for all disciplines. Data publishing requirements at research journals and data management plan requirements at funders like the NSF, mean that researchers should be able to publish their data as well as the software that was used to perform the analysis.

Second, automation can just make your life easier. Finding a way to repeat the same tasks again and again can be helpful whether you have a handful or hundreds of different files which need to be cleaned. Automation means you can repeat the same tasks months in the future when new data is collected, so it helps your memory. You can share the automation pipeline with others who have the same software tools available, so it helps in collaboration.

I discovered Drake through a good screencast which demonstrates the basics of how Drake works. The model is similar to Make, a program often used by software engineers to automate building software projects. The basic format of the file uses an output-input format for designating action targets. So a line with output <- input is followed by a series of commands that are used to transform the input file into the output file. The outputs can then become an input to the next set of steps. You can choose to run the entire workflow or target specific steps depending on whether you have made changes to the analysis or the data. And Drake will be intelligent enough to rebuild all of the steps that need to be redone after a change.

Overall it is a very useful program which can make you life much easier.

I posted an example of how I used it in a project to clean up some data from the 2008 election resuts. The project is part of class at Bucknell which will be studying the effects of demographic and geographic variables on election outcomes.