Collaboration Tooling for End-to-End
ML Data & Model Management.

Headshot of Luca Palmieri, Machine Learning and Data Engineering at TrueLayer

“The world of ML has a lot to learn from all the best practices developed to handle the Software Engineering lifecycle in the last 10 years. Dotscience has the potential to bring some of those hard-learned lessons to the ML world without forcing data scientists and researchers to completely abandon their tools of choice, like Jupyter Notebooks. It's a bold proposition and has the potential to make a huge impact.”

Luca Palmieri, Machine Learning and Data Engineering at TrueLayer

Dotscience Solves the Biggest Pain Points Operationalizing AI

Complexity comes from too many moving parts

Building your own process & platform gets your first models into production, but that's where the problems begin.

It's easy to get mired in a mess of models, code, datasets, and metrics. Not to mention the infrastructure complexity of self-service Jupyter environments, clusters & pipelines. See the difference with a platform which manages the complexity for you.

Building on top of Docker, Git and S3 gets you part of the way to a reproducible model library. But wiring up the pieces with an efficient, intuitive user experience requires time and expertise. Use the open source tools you love with our tracked workflows so you can build better models by staying focused on the ML.

Friction comes from manual setups and poor knowledge transfer

Get your models into production faster and keep them performing reliably.

In a competitive hiring landscape, ensure optimal team productivity with a runnable ML knowledge base to eliminate silos. Remove key person risk by making it easy for anyone to pick up where another left off.

Tired of wasting time manually managing metrics, models and data? There's a better way: snapshot your complete ML environment (models, data, code, environment, metrics) and make it reproducible with no manual record-keeping. Easily see what's been tried before to avoid going down rabbit holes.

Risk comes from failing to capture key information

Guarantee compliance with current and future regulation.

If stakeholders contend decisions made by a model, forensically reproduce any issues and guarantee they are fixed. Reduce financial and reputational risks from AI.

Pinpoint exact versions of training data and where it came from. Debug model issues with confidence about data provenance and environment. Connect isolated data engineering and model training systems in a single view, a window into your data & ML pipelines together.

Open and Interoperable

Dotscience is open and interoperable. Unlike other ML platforms, the design philosophy of Dotscience is to explicitly avoid being a walled garden. Connect to external data sources, while recording provenance. Bring your own compute to our SaaS, or deploy easily in your own cloud. Connect to your own CI system, container registry, and Kubernetes clusters, or use our lightweight builtins. Simple, but flexible. And at its core is the open source Dotmesh project, so you can always get your projects, models and metadata out.

End-to-End Data Engineering & Machine Learning Features

Run Tracker

Dotscience tracks, packages and links together together every run that goes into the data engineering and model creation process. Discover previous work and see exactly how it was built by tracking every version of every element in the model development phase.

Data Versioning

As part of tracking runs, Dotscience bundles with each run a complete snapshot of the project workspace filesystem and any dependent datasets, using copy-on-write technology to ensure that no more disk space is used than absolutely required to ensure reproducibility.

Provenance Graph

Trace from a model to its training data and back from that to the raw data, so that if stakeholders contend decisions made by a model, you can forensically reproduce inferences, and if there are issues, isolate and fix them.

Metric Explorer

Dotscience gives data science and ML engineering teams the unique ability to collaboratively track, record and share run metrics. Explore historic runs, and see relationships between hyperparameters & metrics so that you can gain insights into which hyperparameters to tune next, and make better decisions about where to invest time & effort.

S3 Datasets

Attach versioned S3 buckets as Dotscience Datasets while still tracking reproducibility & provenance all the way to the source. Dotscience mirrors the S3 dataset to local storage for extreme performance and reduced latency, and keeps track of which versions are accessed during data engineering and model training for you.

Bring Your Own Compute

Attach any compute: laptop, GPU rig, enterprise data center or cloud instances. Be productive on a new runner in seconds, as Dotscience ensure an identical development environments even when you switch runner. Dotscience handles the storage and network complexity, all you need is an internet connection and Docker.

Pull Requests

Jupyter Notebooks are notoriously hard to use well with Git and GitHub. Dotscience lets you fork someone else's project, create new runs in notebooks and propose them back along with their metrics. See a full, clear full notebook diff and merge conflicting changes with ease.

Deploy to Production

Deploy your best model into production with a click or an API call. Dotscience will automatically build optimized Docker images and deploy them to your choice of cluster or any other production environment.

Beta feature, contact us to enable it on your account.

Statistical Monitoring

Statistically monitor models to get an early warning when models behave unexpectedly. Monitor model behavior on unlabelled production data by analyzing the statistical distribution of predictions.

Beta feature, contact us to enable it on your account.

Deep Dive Demo

Key Sections

Introduction

Simple Demo - getting started

Advanced Demo - full lifecycle

Conclusions

 

Motivation & Beliefs [0:07]

  • AI has the potential to make a positive impact on the world
  • But as a discipline it's immature
  • We've seen lots of problems affecting AI efforts: wasting time, inefficient collaboration, manual tracking, no reproducibility or provenance, no proper monitoring

We've been here before [0:37]

  • We've been here before: in the 90s software was siloed and slow
  • What changed? This movement called DevOps transformed the way we ship software
  • The same kind of paradigm shift is possible for AI
  • If only these following four requirements can be achieved, it's possible to achieve DevOps for ML

DevOps for ML Requirements [1:05]

  • Reproducibility: every model has to be reproducible. Someone else can come along 6 months later and re-run exactly the same training run of your model and get more-or-less the same result.
  • Accountable: every model must be accountable. That means the basis on which it made its decisions must be recorded. And that means knowing exactly what data it was trained on and how that data came to be.
  • Collaborative: The development environment for models has to be collaborative. I need to be able to pick up where you left off and try different things without treading on your toes.
  • Continuous: Proper model development requires a continuous lifecycle. You're not done when you ship, and deploying a model into production is just the start of a process of continuously monitoring it and improving it as the world changes. So models have to be retrained and statistically monitored for drift.

ML is different to software engineering [2:01]

  • Why can't we achieve these requirements using the existing tools we have for software?
  • The reason is the software lifecycle is much simpler than the model development lifecycle.
  • In software you have code which gets tested and deployed and monitored, and then you change the code and it goes round the loop.
  • Machine Learning is more complex: sure you have code, but that's just one of the inputs.
  • So what you're doing is you're building these models that automatically make predictions based on patterns they've observed in the data.
  • The way you create these models is by training them on a certain version of the code and certain parameters.
  • It's then that model artifact that's deployed into production and monitored.

Key Dotscience innovation: tracking runs [2:41]

  • So the key Dotscience innovation is that we're not just tracking versions of code.
  • We're tracking runs – these can either be data runs which happen when you're doing data engineering, or they can be model runs which happen when you're training a model.
  • In both cases, Dotscience is capturing and bundling together the complete context of everything that went into either creating an intermediate dataset or training a model
  • So the runs are all fully reproducible and you can connect data engineering to model training.
  • This means you can track back from a model running in production to exactly the context in which it was trained and recursively find out exactly what data it was trained on and where that data came from.
  • And all of this is done in an environment that's fully collaborative, so that people can learn from each other, try different things freely, and pick up where someone else left off.

Dotscience features [3:30]

  • In this demo I'm going to show you a number of features:
  • Track runs
  • Collaborate
  • Generate a provenance graph
  • Explore relationship between parameters and metrics
  • Deploy any model into production
  • Statistically monitor it once they're there
  • Flexibly attach any compute
  • Attach external datasets from S3

Machine learning model lifecycle [3:51]

  • Everything I'm going to show you is in the context of this machine learning model lifecycle
  • So we're going to start with data engineering where raw data gets processed
  • Then we're going into model development where we iteratively try a bunch of models and parameters to get the best performing model
  • As we do this we might go back into data engineering to tweak the way we're doing it
  • Then once we have a model we're happy with that looks like it's accurate we can try it out by deploying it into production
  • And then we get to see actually how well it performs in real life, and based on statistical monitoring and retraining on new datasets, we can then go back to the beginning and do more data engineering to build new models and then go round the lifecycle again.

Demo 1: Simple demo of getting started [4:41]

  • Starting with a simple example which you can try yourself for free on our website:
  • Signing up for a new account
  • Fork a sample project: makes an editable copy of the project
  • Must add a runner. Dotscience has a Hub, which is a repository of runs, data, code and models, and you attach Runners to the platform: it's on the Runners that the actual work of data engineering or machine learning model training will be executed
  • You can add your own machine -- bring your own compute, or use a Dotscience-provided runner
  • Click the button to attach a Dotscience-provided runner: will spin up a VM on Google Cloud and attach it automatically to your account so that you can play around
  • This VM will have Docker on it, and will automatically start the dotscience runner container which connects to the Hub and receives instructions
  • Runner is online and ready - first instruction we give it is to start Jupyter
  • Possible to go down a CLI route but Jupyter is easier to start with
  • You'll see some log messages as Jupyter starts up

Hello Dotscience Jupyter notebook [6:22]

  • Dotscience is a run tracker which helps with reproducibility
  • You specify the start and end of run and publish the run from your notebook
  • When you run your notebook in JupyterLab you get a new run recorded in the Dotscience tab. Run metadata is shown in the notebook cell output
  • Look at the same run in Dotscience and you will see the run metadata all the versions of the files involved and a snapshot of the notebook as you used it
  • How to capture metrics in Dotscience?
  • Specify what you want to capture via your notebook e.g. parameter value, summary statistic
  • Carry out the run, then go to Dotscience to see the run outputs on a plot with the same outputs from other runs. You can inspect these to see what inputs caused each value of the output per run

Tracking data in Dotscience [9:02]

  • Ingest data
  • Track input/output data per run
  • View a provenance graph per run
  • View the version of the data and the notebook for every run

Training an ML model in Dotscience [10:23]

  • Using: Linear regression, CSV file, outputting a model in a Pickle file
  • Track the notebook, metrics, data in each run
  • View provenance graph per run
  • Tune the model and see how the error rate changes on the Explore tab

Demo 2: More realistic example of complete data ML model lifecycle in Dotscience [12:30]

  • Using: S3, GitHub
  • We'll look at: data engineering, model development, deploy into production, monitor in prod
  • Create new project, Roadsigns
  • Use a local runner for compute. Connect runner to Github
  • Attach dataset which is in an S3 bucket
  • Add collaborators to give visibility and sharing

Data engineering in Dotscience [14:35]

  • Using: Python scripts, versioned in GitHub
  • Using: Script for ingesting data from S3
  • Split set into training and test
  • Wrap each operation in a Dotscience run
  • Use ds run on the CLI, specify Project, branch, GH repo and Docker image. Run metadata is output on the command line
  • View runs in Dotscience.
  • S3 also versioned in Dotscience

Model development in Dotscience [19:40]

  • Using: Neural net using Jupyter in model training notebook
  • Using: the data from the previous step
  • Note that the example model not very accurate - try a different subset
  • Overwrite your files in place without worrying. Dotscience captures each change as a new version
  • View updated plot in Dotscience - New dataset shows better accuracy

Deploy to production with Dotscience [29:55]

  • View model in Dotscience model library
  • Deploy with a single click into CI system
  • Note: S3 compatible API
  • Note: Deployed to AWS Kubernetes cluster

Asking for help: Collaboration with Dotscience [34:29]

  • Collaborate with Danesh to see if he can improve the model performance
  • Danesh can see all the background of what I tried
  • Danesh takes a copy and tries some new things
  • Both of us have made different changes, we can merge them back together
  • The most accurate run so far comes from using some changes from both people
  • Now Danesh can make a pull request on the Dotscience project
  • Manager can see history of progress in the accuracy plot chart

Statistical monitoring in production [44:06]

  • Go to Dotscience model library
  • Deploy to Gitlab CI
  • Note: Model less accurate on real data than expected.
  • View monitoring in Grafana and Prometheus - gives you the option to set up alerting on any unusual model behaviour

How Dotscience achieves the DevOps for ML manifesto [47:50]

Dotscience integrations [49:10]

  • Jupyter
  • Python
  • AWS
  • DOcker
  • CircleCI
  • Git
  • Tensorflow
  • Prometheus
  • S3
  • Kubernetes
  • Lots more.

Dotscience is available today [49:21]

  • Saas/Cloud service with a free account
  • AWS on your private VPC
  • On-prem
  • Or any hybrid of the above

Dotscience is highly differentiated [49:45]

  • Accelerate AI projects
  • Run anywhere
  • Model accountability
  • End-to-end AI platform
Try Dotscience
Dotscience works with every Python ML framework
TensorFlow Logo
Pytorch Logo
Scikit learn logo
MS logo
MX Logo
Keras Logo
Caffe logo
Theano logo
& all other frameworks…