One of the main benefits of Dotscience is that it can automatically “version everything” for you. This helps to fulfil our goals of our users’ work being 100% reproducible, accountable, and auditable.
Items that can be versioned include:
- Datasets (including large, via ZFS )
- Machine learning models
- Model hyperparameters
- Model metrics
An obvious question for the new user writing their analysis is, how? How does it figure out which things in your code are datasets or models, and which things in your models are hyperparameters and metrics? The answer is that you annotate your code to tell the system what to record. At first glance, if you already have your code, this sounds onerous: no one wants to change their code just to put the same analysis into a new system. However, it is worth it in Dotscience because the generic nature of the system allows you to get the benefits of versioning while still being flexible enough to solve real business problems.
- The system works with Jupyter notebooks, via the integrated JupyterLab, or with Python .py scripts
- The user writes their own data science analysis, currently in Python
- Within these, arbitrary code can be run, and any available tools, machine learning or otherwise, can be installed
- Work is done in projects, which are tied to our versioning system which works like GitHub
While it is true that you currently get more functionality with some types of models than others - TensorFlow is the easiest to deploy - the scope of the system is to be an end-to-end platform not restricted to particular tools. When one considers that even within a single tool like TensorFlow or H2O there are dozens of models with hundreds of parameters, and these are constantly changing and updating, it makes sense to have a general system where the user annotates the code themselves to record parameters rather than trying to derive them automatically.
 And no you don’t have to have a new copy of the data every time, that’s one of the nice things about ZFS :)
How to annotate your code
The key functionality is the Dotscience Python library. This provides a set of functions under the name ds that are added to existing code.
(Note, while the emphasis here is on Python, there is nothing in the system that prevents a Dotscience library being made available in other languages. Obvious candidates include R, Julia, and Scala.)
In total there are a couple of dozen functions. If you bring in the library by doing
import dotscience as ds, then some of the main ones are:
We will show a couple of example codes below, but in general terms what each one does is as follows.
ds.interactive() in a Jupyter notebook tells the system that the code is being run in this way and not as a batch script like a .py (which would be ds.script()).
ds.start() indicates the start of a run. A run is all the actions within a dataflow between the ds.start() and the later ds.publish(), which indicates the end of a run. Just as the items within them like datasets and models are versioned, runs themselves are versioned and repeatable, recording your data flow.
ds.input() tells the system that the argument within it, a string, is the location of an input dataset. The function returns the string, so you can say things like
pd.read_csv(ds.input('~/path/to/my/file.csv')), if you happen to be using Pandas, and it will work correctly. Alternatively, you can use ds.add_input(), or ds.add_inputs(), on a separate line if you don’t want to change your read command.
ds.parameter() and ds.metric() are the information about your machine learning model. So if you have, for example, a decision tree, you might have things like
ds.parameter('learning_rate', 0.1) or
ds.metric('accuracy', results.accuracy), where the learning rate is a hyperparameter and the metric is the accuracy of the model. These are like ds.input() in the sense that you can use ds.add_parameter() and ds.add_metric() on separate lines as well, and obviously you can have many parameters for a given model.
ds.model() is used to say that this resulting object from an analysis is a model. So you might say something like, for a Keras deep learning model in TensorFlow:
ds.model(tf, 'CIFAR10', model.save('model', save_format='tf'), classes='classes.json') which makes it suitable for deployment via our usual deploy -> Docker -> Kubernetes path.
ds.publish(), as mentioned, indicates the end of a run.
In addition to the above main functions, there are a few more, including ds.run(), ds.summary(), ds.output()/ds.add_output()/ds.add_outputs, ds.connect() (on the command line interface), ds.end(), ds.error()/ds.set_error(), ds.set_description(), ds.add_parameters(), ds.add_metrics(), ds.label()/ds.add_label()/ds.add_labels(), and ds.summary().
For some more details, see the Dotscience Python library documentation .
To give you some more idea of how these ds() annotations are used in practice, here we show a couple of examples where the ds() additions have been highlighted.
In this simple data preparation example, we import a set of images of roadsigns, and sample them to make small and large training, validation and testing sets. The lines with added Dotscience Python library annotations are highlighted in gray.
Figure 1: Data preparation example code, augmented with Dotscience Python library (ds) functionality
The library is imported as ds in the usual way. We then indicate the start of the run using ds.start(). The pickle.load includes a ds.input(), and the dump includes a ds.output(), recording those as versioned datasets. Finally, we publish the run back to Dotscience using ds.publish(). There is then a second run, showing that multiple runs can be created in the same file.
In this second example, we train a convolutional neural network on the prepared sample of roadsigns images. Note that not all the lines of the analysis are shown, just those notebook cells that contain lines with ds() functions added. The highlights this time are in yellow and black.
Figure 2: Machine learning model training example code, similarly augmented
The import ds is as above. Then we follow with ds.start() and several ds.inputs(). In this case ds.parameter() has also been used to record the dataset name. We then use ds.parameter() in the compilation and training of the network, and ds.summary() is another way to report results. Finally, we do ds.output() and ds.publish(). There are also some ds.label()s. Note in this code that a ds.model() could also be added.
The result from making annotations such as the above to your code is that when it is executed using your runner (machine supplying the computation) on the system, everything that you have requested via the ds() functions gets recorded and versioned. This means runs within a project, code, datasets, models, parameters, metrics, and so on. Other features such as the data and model provenance graph then get created automatically.
While it is some amount of work to include annotation of you code using the Dotscience Python library, the resulting analyses represent code that is fully versioned, and thus truly reproducible, accountable, and auditable. It is likely no less work (and probably more) to use some other method to ensure these same ideals, which are crucial for the success of modern machine learning in the enterprise.