How Dotscience works.

Image for blogpost

Dotscience Architecture

Dotscience consists of a hub and one or more runners. The hub is a central repository for projects (containing runs), datasets (or pointers to S3 datasets) and models (in the model library). The runners are where runs of actual work (data engineering, model training etc) happen. Runs which generate labelled models automatically show up in the model library.

Runners are connected to the hub by starting a Docker container (dotscience-runner) which opens a gRPC connection to the hub and awaits instructions. The user then sends instructions to start Tasks (interactive Jupyter or ds run CLI tasks) on a runner. When the runner receives this instruction, it starts a container called the dotscience-agent which synchronizes datasets and workspaces (mounted as the home directory from the perspective of the task) onto the runner. Workspaces and datasets are ZFS filesystems managed by Dotmesh. Once the data is synchronized, the agent spawns another container with the user's specified Docker image (in the case of ds run) or the bundled jupyterlab-tensorflow container in the case of Jupyter.

Run tracking is like git for executions of code

Central to Dotscience is the notion of Runs. As a run tracker, you can think of Dotscience as a bit like "Git for executions (runs) of your code". Instead of just tracking source files on disk, Dotscience tracks the interaction between code, input data & output data/models, and the logs and metrics (such as model accuracy) that are side-effects of these runs. This is necessary for Machine Learning because it has many more moving parts than regular software engineering (data, parameters, models and metrics as well as code) and they all need to be tracked to make ML teams productive and to ensure that models have proper reproducibility and provenance. This is important so that you can track back from a version of a model to how it was trained, the data it was trained on, and, transitively, where that data came from — this is useful because it relieves the burden of manual tracking, it helps with forensic debugging of model issues and it makes it much easier to comply with regulations about tracking your work.

See also our blog post about how Run tracking liberates ML teams.

How runs are detected

A component called the committer is running within the dotscience-agent process and watches for new runs (in ds run, the run metadata is written to stdout by the workload; in Jupyter, it's written into the notebook itself and saved to disk — this acts as a trigger). When a new run occurs, the committer automatically creates a new lightweight filesystem snapshot and synchronizes it to the hub. This filesystem snapshot contains all the code, data, and metadata such as run parameters and metrics.

In the case of Jupyter tasks, the committer is continuously watching for runs by detecting changes to notebook files which contain the Dotscience metadata JSON written by ds.publish in the Dotscience Python library. In the case of ds run command tasks, the run metadata is written to the stdout of the process by the Dotscience Python library, and the run metadata is picked up at the end of the run by the dotscience-agent.

Efficient lightweight filesystem snapshots

Because Dotscience uses Dotmesh, which uses ZFS, it's able to very efficiently synchronize changes to workspaces and datasets, both of which can contain large data files, between the Hub and the Runners. Only the blocks that have changed on disk from one run to another need to be synchronized to the hub, and because ZFS knows which blocks have changed there's no need to scan or hash large files. ZFS can support multi-petabyte datasets and billions of files.

Bring your own compute

With Dotscience you can attach your own machines as runners to the platform, simply by running the Docker commands given in the runners UI or by using the ds runner subcommand in the CLI. The only requirement for a runner is Docker and an internet connection. It's not even necessary to have a public IP address — and yet you can access the Jupyter container running on a runner from anywhere, just by logging in to the Dotscience Hub. How does this work?

The trick is that we start a tunnel container on the runner, which makes an outbound connection to the Dotscience tunnel service, and securely exposes the Jupyter container as a subdomain of app.cloud.dotscience.com. When a connection is made from the user's browser to the tunnel URL, it gets proxied through the tunnel service to the connected runner and back to the Jupyter container, even if the runner itself is behind NAT or a firewall which only allows outbound connections. This gives you flexibility about being able to attach any available compute resource to the cluster, and still allow users to log in from anywhere, while managing the work in a central location (the hub).

Runners enable "hybrid cloud" architectures

Because Dotscience uses Dotmesh for the workspace and dataset filesystems (which can be mirrors of S3 buckets), and because Dotmesh uses ZFS, and because ZFS supports zfs send and zfs receive to stream snapshots between any nodes regardless of the underlying infrastructure, this makes it possible to synchronize data from any Linux machine to any other Linux machine, even if they are running in different environments or on different cloud providers.

This makes it possible to very easily construct "hybrid" architectures where, for example, the hub is running on one cloud provider and for some reason one or more of the Runners are running on a different cloud (for example, you want access to some of the data services provided by one cloud, like BigQuery on GCP from a GCP runner). Or, you can run the hub in the cloud but attach a powerful GPU machine you have in your office or data center, such as NVIDIA DGX servers, which make awesome Dotscience runners.

Integration with CI systems to trigger model training

It is possible to integrate Dotscience with your CI system so that models can be automatically trained and their metrics and provenance published to Dotscience on a push of your code to version control. Simply configure your CI job to run ds run -d --repo [email protected]:org/repo --ref $CI_COMMIT_SHA python train.py, for example. This way, the model training (which might take some time) happens asynchronously in Dotscience, freeing up your CI runners to carry on running tests. So every model training is fully tracked, and lands in the Dotscience model library, from where it can easily be deployed (via API or clicky buttons) then statistically monitored.

Read more about this use case in our ds run CI integration tutorial.

Deploying to production

Beta feature: contact us to enable on your account

Dotscience supports deploying models to production by integrating with a CI system. When deploying to production is triggered in Dotscience it triggers a CI job which pulls the model files from a Dotscience S3-compatible endpoint, builds an optimized container image for that model, and then pushes it into a Docker registry, from where a continuous delivery tool can deploy it to e.g. a Kubernetes cluster.

Monitoring in production

Beta feature: contact us to enable on your account

Dotscience enables statistical monitoring with a component called the Dotscience model proxy. This service works as an interceptor of requests/responses to and from Tensorflow Serving (or similar services). Users, using the API can set which parameters they want to capture for statistics. This integrates with Prometheus to, for example, allow you to monitor the distribution of predictions in a categorical model (one which is predicting what category of thing a certain input is, such as predicting road signs from images). You can then use Prometheus and Grafana to create dashboards of the statistics of your models in production, in addition to the usual RED metrics (request rate, errors, duration) that you would want to monitor for any microservice.

Technology choices

Dotscience is primarily written in Golang, with React used for the Javascript web frontend, and Python (of course) used for the Dotscience Python Library. The JupyterLab plugin is written in TypeScript and Python.

Integrations
Jupyter Logo Work in Jupyter within Dotscience and see your run history in our Jupyter plugin.
python Logo Instrument your data and model runs with the Dotscience Python library for full tracking.
AWS logo Launch a private Dotscience deployment with a few clicks in the AWS Marketplace.
Docker logo All work automatically containerized. Bring your own images when running scripts.
CircleCI logo Trigger runs from a CI job to track model training in CI. Works with Jenkins & other CI systems too.
Git logo Give Dotscience access to your Git & GitHub repos to automatically check out code.
TensorFlow logo Automatically monitor categorical predictions with our model proxy, works with TFX.
Prometheus logo Model proxy is integrated with Prometheus, Grafana & Alertmanager for monitoring & alerting.
S3 logo Access data in S3 from within Dotscience, with versioning & provenance integration.
MS logo Attach a Kubernetes cluster as a runner, and deploy models into Kubernetes via CI.
Any many more placeholder Integrates with any Python ML framework or library, CI system, infrastructure and deployment system…