Dotscience as an alternative to more well-known tools

/blog/2020-03-25-databricks/images/featured_huef6d67481784cc7157afe261e621a33d_1297863_1595x1148_resize_q75_box.jpg

Companies don’t just want to build some ML models, but to get them into production and generating value.


This leads to various common paths toward choosing which tools to use:

  • Open source
  • The tool their data scientists know and like
  • Paid products from big companies, that can provide free trials and large consulting teams to solve your problems
  • Well-known or popular names
  • Startup company that matches well what they want to do
  • They decide they can build it themselves
  • Go 100% to outsourcing, using this and consulting as needed


These options are not mutually exclusive of course, but they can come with various downsides:

  • Commercial products are too expensive
  • Reality of day-to-day use of the tools to get value from AI doesn’t match the hype
  • Startup vendor goes out of business, taking its product with it
  • Company finds after a year that they in fact can’t build it themselves
  • Tool is too large and complex to get anything to work
  • Tool is poorly documented and/or too hard to use
  • Nice product but requires vendor lock-in
  • Nice product but it’s on the cloud and our company and its data are not
  • Data scientist who likes the tool leaves, taking their knowledge with them
  • Tool can’t get approved by IT


Products and tools that fall into one of more categories above, and suffer some of the problems, include the 3 major clouds (Amazon, Google, Microsoft), open source languages and tools like Python, Jupyter, Scikit-learn, TensorFlow, well-known machine learning platforms like H2O and Databricks, and the giant new pile of tools for getting something into production - Docker, Kubernetes, and so on. It’s not an easy space in which to find a solution that actually works in production for your company.

Then there is Dotscience. While we too fall into some of the categories (e.g., random startup you’ve never heard of), if you have explored the above landscape and not found any satisfactory solution, we offer another option to consider.

Our main value propositions


Dotscience’s value can be distilled into this list:

  • Easy model deploy
  • Run on-premise, on the cloud, hybrid, or multicloud
  • Add MLOps
  • RACC: reproducibility, accountability, collaboration, continuous delivery
  • End-to-end

Easy model deploy

The biggest pain point for companies using AI right now is production: they want to get their models generating value now. Not in a few years when the AI cloud space hs matured, and not when they have managed to hire a team of experts to help them. Most companies will never hire a team of experts because there are not enough to go around.

Deploy should be easy on the cloud, right? Well, here is an example of the steps to deploy a single model, by a large and advanced AI company (Databricks) on the platform of an even larger company (Microsoft). It’s not so much that there is anything wrong here - in fact everything is well presented and documented, but you can see the level of complexity and number of steps that have to go right, and this is for a best-case probably cloud-native solution.

Figure 1

Part of the documentation for a model deployment on the cloud.

The steps are

  1. Setup Databricks cluster, with MLflow and Azure ML SDK libraries via PyPI
    Create or load an Azure ML Workspace
  2. Build an Azure Container Image for model deployment
    Use MLflow to build a Container Image for the trained model
  3. Deploy the model to “dev” using Azure Container Instances (ACI)
    Create an ACI webservice deployment using the model’s Container Image
  4. Query the deployed model in “dev”
  5. Create sample input vector
    Evaluate the sample input vector by sending an HTTP request
  6. Deploy the model to production using Azure Kubernetes Service (AKS)
    Create a new AKS cluster (or use an existing one)
  7. Deploy to the model’s image to the specified AKS cluster
  8. Query the deployed model in production
    Evaluate the sample input vector by sending an HTTP request


Compare to the same procedure on Dotscience:

  • Go to model list
  • Click deploy
  • Send the model some data

Dotscience integrates with a continuous integration (CI) system to pull the files from an endpoint (by default Amazon S3-compatible), builds the optimized Docker container image, adds the image to the container registry, and deploys the model via a continuous delivery (CD) tool into production on a suitable cluster, by default Kubernetes. None of this has to be configured by the user. You can then also click monitor and you get model monitoring too, via Prometheus and Grafana.

Obviously the cloud setup has other features that are useful, not least the resources to keep improving their tools. But if you want production deployment right now, without needing a room full of experts, we can provide it.

(One side note to this, in the interest of candor and as a data scientist not turning my blog entries too much into marketing BS: observant readers will have noticed that it says “send the model some data”, not “click to send the model some data”. Currently the user is still required to code this step: they can use the command line or construct the command from the Jupyter Python (or other) code, to send data via HTTP POST. The cloud example has functions to do the send instead of constructing commands. The main explanation for this is us still being a startup company, but we plan to make it more convenient as soon as possible. You can, however, click to monitor as mentioned when data is being sent.)

On-premise / cloud / hybrid / multicloud

The cloud represents an obvious future for computing. It makes sense to not have to manage massive compute hardware and data warehouses yourself when the cloud providers can do it for you, and do it better. Right now, however, most companies are not cloud-native. Some would like to be but are restricted by practicalities like compliance and data migration, or lack the expertise. Others have various good reasons to remain 100% on-premise.

Since it is unlikely that most of your employees can be trained to expert level on all 3 major clouds (Amazon, Google, Microsoft), or others, it may make sense to pick one and use its capabilities of bringing their tools to your premises, or doing partial migration to make a hybrid setup.

There is an alternative, however, which is to use a platform that is fundamentally hardware-agnostic. Dotscience provides this, because it operates on the premise of bring your own compute. There is a central hub, which can be on any machine, and as many other machines as needed to do the compute, the runners. “Machine”s can be physical machines, or cloud instances, or both. The result is that you can have a setup that is free to be on-premise, on-cloud, hybrid, or multicloud, and can be changed over time as needed. And the setup is not a special case of an otherwise preferred vendor space, but is the main way of using the product.

MLOps

MLOps, or DevOps for machine learning, has emerged as an in-demand subfield as companies realize the full complexity of putting AI into production.

Doing MLOps properly, however, is not the natural state of most users capable of doing data science to the level needed to create value for companies from AI: they are problem solvers, not software engineers, so documenting and versioning everything is not their usual mode of working. Even if it was, data science is more than just the application of DevOps tools to ML, because it needs to incorporate extra concepts including dataset versioning, model hyperparameters, and monitoring production performance for degaradation due to data drift. We have described this in more detail previously in this blog entry.

Therefore, a tool that automatically adds to the process the rigor that software engineers already use, while allowing the users to focus on the problem solving, greatly aids the actual adoption of MLOps, without imposing undue burdens on your users. Dotscience does this by using the Git style of working, and then automatically versioning everything: datasets, models, notebooks, and the actual code runs themselves. The run tracking is like Git for executions of code. Add the setup for easily doing a modern-style deploy with the models as containerized microservices, and you have MLOps infused throughout the data science process.

RACC: Reproducibility, Accountability, Collaboration, and Continuous Delivery

The result of adding MLOps rigor to the data science process is that a number of improvements come naturally, many of which are in fact vital to getting value from AI that really works for a business. The RACC combination forms the Dotscience manifesto.

Reproducibility: In Dotscience, everything is versioned, including all datasets, artifacts, model settings, and executions of code. Thus by definition everything done is reproducible, including dataflows with large datasets due to the underlying ZFS filesystem.

Accountability: This comes from data and models being versioned, and provenance recorded. The user can then see exactly what data, code, and settings went in to producing any given model, and thus any outputs it produces and decision made are accountable.

Collaboration: Because each user is working on their own version of the code in a project, analyses can be shared with collaborators and later compared or merged using the usual software development capabilities like differences, merges, and pull requests, extended to notebooks and analyses. In these current times when many more people are working remotely, such asynchronous collaboration is of particular importance.

Continuous Delivery: Via continuous integration and delivery (CI/CD), and mechanisms such as microservices, modern software can be delivered in minutes rather than months, and the same can be made true of ML models in production by using CI/CD. Dotscience does this by default, enabling companies to update their models as needed to keep up with changing circumstances.

End-to-end with added MLOps: More than the sum of the parts

The various functionalities described here are of course not unique to us: other tools have easy deploy, or multicloud, or accountability, or end-to-end. What makes Dotscience a proposition to consider in the MLOps and data science space is that it has all of these, without requiring expert level engineering to get it all to work.

This combination of rigorous MLOps, plus easy deploy, plus cloud/on-prem/both, and so in, combined with an ability to host an end-to-end analysis makes possible a more-than-the-sum-of-the-parts value effect for the user. An analogy is the use of commands in UNIX-like operating systems: each command does one thing and does it well, and the power comes from using them in combination.

This summing of parts will vary in each individual case, but some examples are:

Problem: My data scientist left!
Solution: Dotscience’s automatic recording and versioning of runs, datasets, models, and deployments, combined with a provenance graph of the analysis from start to finish, gives a guaranteed-correct representation of the work done by a user, or a user plus their collaborators. The collaboration and sharing feature means means that a new person can take over without having to rediscover any of the work, and the focus returns to problem solving. Having been in this position myself of needing to understand someone else’s project, I can attest to the value of a correct diagrammatic representation of it.

Problem: Why are my customer’s models failing?
Solution: Dotscience has the potential to be used as a platform by companies that in turn provide data science consulting or services to their own customers. In this scenario of customer model failure, there are 2 benefits: Firstly, Dotscience’s model monitoring and alerting means that you can see quickly that a model is failing, before the failure causes big damage by the customer taking wrong actions; and secondly, because of our auditing and provenance capabilities, you can see exactly which versions of analysis, datasets, and models are involved in the problem. Since you can import and use whatever libraries on the tool, these can include ones that output explicability outputs, like reason codes, as well.

Problem: I need to find the best model.
Solution: Mixture of experts can get better performance than an individual. While this is often true for machine learning models (combine the outputs), it can be true for teams as well, especially when team members’ expertise is complimentary, e.g., machine learning, and the industry domain. Dotscience’s collaboration capabilities let you work with others in the same way that a development team does when producing software with Git, but with the addition of sharing and versioning notebooks and proper MLOps. So the inputs of different people on an analysis can be shared and merged, the result remaining a coherent and correctly versioned project.

Problem: I found a great model but now can’t remember how I got it
Solution: This is automatically recorded and represented by the combination of end-to-end, data, run, and model versioning, and the visual representation of the provenance graph. Dotscience also includes some basic visualization capability so you can compare performance metrics if you don’t know which model was best.

Problem: I like cloud A’s storage and experiment tools, but we need cloud B’s deployment.
Solution: Since Dotscience is bring-your-own-compute, it is not a special case within the product to connect to one cloud and use its tools for storing data or running model experiments, and then connect to a second cloud to do a deployment. Our deep dive demo video shows an example of this, where the data starts in Amazon S3 storage, and the model is deployed on Google Cloud platform. Of course, any or all of the steps can be on-premise too.

Problem: I am working with large datasets but our data prep is complex. This results in many dataflow steps that change the data, leaving many versions of it, taking up a huge amount of space.
Solution: Dotscience is built on top of Dotmesh, a tool for snapshotting application states, which is in turn built on top of ZFS. ZFS is a filesystem that is aware of both the logical structure of the files, but also the physical block storage underneath. This means that it can track versions of large datasets by only needing to record the changes from one file version to the next. ZFS can support petabytes of data and billions of files, and while most machine learning projects are not this large, it means that properly versioning your data even when it is large becomes tractable.

Conclusion

We have shown how the needs of companies who want to put their AI into production and gain value from it, are not necessarily solved by existing tools, even the well-known or expensive ones. In particular, not everyone is cloud-native, has a room full of experts to get large and complex platforms or deployments to work for them, or can afford the budget to pay for a tool or the compute that they need to run machine learning at scale.

Dotscience will not be the perfect answer for everyone, but if these are questions that have not yet been satisfactorily answered for you, then its capabilities are worth a look.



Try it out!

You can try out Dotscience for free right now, or for more details about the product, head over to our product page and our documentation page .

Written by:

Dr. Nick Ball, Principal Data Scientist (Product) at Dotscience