Beyond file soup: Data Science Reproducibility.

/blog/file-soup-reproducibility/images/featured_hu7fba8044cab7dfefb06dc991a2859f61_1812133_1595x1148_resize_q75_box.jpg

As part of our horizon scanning for potential product-market fit, one of our hot favourite areas is Data Science. We’ve got a real soft spot for science (several of us at Dotmesh have a science background) - throw in a bit of data and it’s doubly interesting!

Our 2-week research sprint into Data Science has given us a great chance to get geeky, dig deep into some desk research and have a number of really great conversations with some friendly Data Scientists.

What we’ve learned so far is that Data Scientists are a little bit different to the Developers we’re used to. They need to use all their considerable talents on the science, not the technology, so they find it less fun or interesting to get down among the weeds with their tooling. Sounds obvious, but developers are toolmakers, and if they don’t like their tools they fix them or make some more. Not so for the rest of us mere mortals!

Tooling aside for a moment, we’ve seen that Data Scientists face a unique set of challenges across the industry and in academia:

1. Reproducibility.

It’s surprisingly common for Data Scientists to struggle to reproduce results produced in previous experiments, especially those conducted by others. These may be internal to the organisation or external experiments from published research. Peter Warden recently called it a crisis in reproducibility and AirBnB has built its own tool to solve this problem.

2. File Soup.

“File Soup” is created as a byproduct of the Data Science workflow. As scientists explore and reason over their experiments they often generate a large number of files which include input datasets and output results. Keeping tabs on these is a chore and sometimes the signal:noise ratio causes frustration when looking for a particular file.

3. Clunky workflows.

In addition to these challenges, we noticed that there are other common areas of Data Science workflows that would benefit from support through better tooling. Many projects exist to solve some parts of the problem, but all of them require a bit (or a lot!) of wrangling to fit to the problem at hand. On top of this, they frequently introduce extra overhead on the user - Git anyone? Things get more clunky, not less.

Can Dotmesh help?

Ok, so I didn’t put tooling aside for long, it’s what we love to do! We don’t have the answers to these challenges - yet, but we’re paper prototyping some ideas to share with the community.

We’re envisioning a plugin to your JupyterLab, RStudio or MatLab IDE. It would automatically capture changes in your code and data in a “Version history” for both.

Automatic history of your changes. - Initial automatic commit upon project creation.

Each saved change or test run results in an autocommit to your project history.

Name any important commits. They are automatically saved back to a central repository e.g. our Dothub

For each autocommit, you could drill down to see what had changed.

The workflow could extend to publishing, branching/duplicating and running your project in a shared cloud.

At any time you could go back to a place in your project history and reinstate the code and data back to how it was then. In addition, your project could be saved centrally (e.g. in our Dothub) where your colleagues could pull down the exact version they need - less hassle with requesting and emailing spreadsheets around!

So, what do you think? Could this work for you? Does it address any needs? Is it a layer of overhead you’d rather do without? Have we misunderstood how you work? Or have we suggested the start of something that could be really helpful? How would you use it? How would you change it?

Join the discussion on Hacker News, tweet us, or join our Slack (look for the #datascience channel).