Luke Marsden, Founder & CEO at Dotscience, and Alex Spanos, Data Scientist at TrueLayer, on stage at MCubed London describing TrueLayer's Dotscience-powered MLOps pipelines
Dotscience are proud to announce TrueLayer as a customer and present a case study of how they’ve created an MLOps pipeline with Dotscience on AWS. Alex Spanos, Data Scientist at TrueLayer, explains…
About TrueLayer - and the business problem we’re solving with MLOps
TrueLayer is one of the world’s leading providers of financial APIs. Its aim is to grow the Open Banking economy by creating a global platform for companies to develop innovative new financial services and products.
TrueLayer offers one Open Banking platform with a range of customisable APIs - including a Data API which allows companies to access the financial data of their customers securely and efficiently, and a Payments API which enables seamless, secure and real-time online payments via Payments Initiation.
TrueLayer is fully integrated with major European banks and challengers including Monzo, Starling Bank and Revolut. It works with leading businesses including Zopa, Tandem, ANNA Money, Emma, Canopy, Plum, CreditLadder and GoodLord. TrueLayer has also created a network of partnerships with major companies including Visa.
TrueLayer Data API Product where Machine Learning Use Case Exists
Our use case for Dotscience relates to the Data API product. The Data API is a unified and simple API that enables applications to securely and reliably retrieve financial information of their users, such as transaction history and balances from credit cards and accounts. With our Data API developers do not need to implement and maintain a connection to each individual provider; we handle all this complexity behind the scenes.
ML use case: purchase classification
One of the features of the Data API is classification of purchase-related transactions; raw transaction data are enriched with a category, sub-category and the name of the merchant (when relevant). Below is a demonstration of a purchase-related transaction before and after enrichment.
TrueLayer Purchase Classification Problem
The first iteration of our classification system was rules-based. It relied on a lookup between transaction description groups and merchant name, obtained through manual annotation. Whilst this system provided broad coverage and high accuracy, there were a number of inherent limitations.
Pre-ML solution: limitations of rules based system
To overcome these limitations we developed a new Machine-Learning-powered classification service.
Why Dotscience - business motivation
“One does not simply deploy ML into production”
Of course, introducing a Machine Learning workflow to an organisation has its own set of challenges. At a high level, our requirements were the following:
- Integrate the new service in our existing infrastructure and then ensuring that the DevOps and Data Science teams collaborate effectively;
- Data science experiment tracking (e.g. hyperparameter and metrics tracking). Enable transparent collaboration between scientists;
- Enable reproducibility. As we operate in the financial services industry it’s possible that one of our predictions may be challenged in the future, and if that happens we need to be able to have the provenance of the model: the exact version of the data and the exact version of the code that we used to generate that prediction — this is necessary to achieve model accountability.
Dotscience was able to help us in each of those steps.
Over to Luke to describe the architecture of the joint solution…
At Dotscience, we really enjoyed working with the team at TrueLayer to develop the following architecture. The TrueLayer data science team, DevOps team and leadership team truly understood what’s required to make a production-ready MLOps solution.
And so, this is the architecture we co-developed. It is split into two main parts, and Dotscience is used in both parts.
The first part is the prototyping pipeline, where data scientists experiment to find which type of models work best on the available data. In this mode, Jupyter is useful to rapidly visualize datasets and rapidly experiment on results, and yet it’s useful to keep track of the experiments that have been run, the dataset versions used and the resulting metrics.
The second part is the production pipeline, where data scientists have pinned down the major variables, e.g. decided what kind of model they want to build, then move into a “productionizing” mode of moving the code into a Python library (which can be shared between the pipelines) and create Python scripts to train models which should be triggered from their CI pipeline as data and code changes.
The production pipeline also deploys the models into production where they need to be monitored.
Looking at detail in each of the pipelines, and the steps highlighted by yellow arrows:
TrueLayer MLOps Prototyping and Production Pipelines
(1) Data used for model training is stored in a versioned S3 bucket, which can be connected to Dotscience as a Dataset. These datasets can be used directly in prototyping and production pipelines in Dotscience, and the provenance of resulting models can be tracked back to specific S3 object versions.
(2) Data scientists iterate on the prototype models directly in the JupyterLab environment which is available as a hosted development environment available directly from the Dotscience web interface.
(3) Each “run” (either a data engineering step or a model training step) is tracked in Dotscience, and can be shared between users. The provenance of every model is recorded (i.e. which data it was trained on, and which version of the code and hyperparameters were used). Metrics such as accuracy or f-score can be shared between data scientists. As data scientists can each have different forks of a project, they can collaborate on notebooks using notebook diffs and merges (including conflict resolution).
(4) Once a likely winning strategy is determined by prototyping, the code for training such a model given the input data is refactored by the data scientists into an internal Python library (instrumented with Dotscience commands, like
ds.metric() to keep track of hyperparameters and metrics), so that it can be shared between Jupyter and the Python scripts in the production pipeline. This Python library is version controlled in GitHub as usual (Dotscience will capture exactly which version of the library is used for runs in both the prototyping and production phases).
(5) Now that we know what kind of models we’re going to create, we’ll productionize the training scripts for them. Using the Python library created in the previous step, Python scripts for model training are created. These scripts are pushed directly to GitHub, and the code is developed using a normal “git flow” approach, having a master branch and feature branches, etc.
(6) Rather than training the models directly in CircleCI, the actual execution of the model training is executed within Dotscience by kicking off a
ds run command from inside the CircleCI pipeline. This allows the CI system to avoid being clogged up by slow model training steps (the training is asynchronous).
(7) Running the productionized model training itself inside Dotscience also means that data, provenance, metrics and versioned models from the model training steps can be captured in Dotscience. What’s more, hyperparameter tuning can be accomplished by executing jobs that will explore a set of possible hyperparameters, such as using scikit-learn’s GridSearchCV capability.
(8) From the versioned models that are saved (along with their complete provenance history) in Dotscience, Docker images to run these models in production can be created. In the TrueLayer case, they integrated with Dotscience before it had the ability to build & deploy these Docker images automatically, so they are currently using their own Docker build step in CircleCI. However, it’s now possible to automatically deploy models to Kubernetes directly from Dotscience.
(9) Once models are running in production, they need to be monitored. Dotscience has the ability to automatically instrument models that are running in production with Prometheus monitoring, however TrueLayer integrated with us before we had this ability, so they are also using their own Prometheus integration in their custom Docker images.
In the future, we hope to collaborate further with TrueLayer to bring them into our automatic statistical monitoring capabilities. We’re also excited about being able to go from a model that’s running in production, and with one click spin up a Jupyter environment which includes the exact data, environment and code that the model was trained with. We are also looking at ways to label projects so that metrics can be easily compared between the prototyping and production phases.
To finish, here’s a screenshot of the three projects: the prototyping phase, the production phase, and a third hyperparameter tuning project within Dotscience:
TrueLayer MLOps pipelines in Dotscience - prototyping, production and hyperparameter tuning
TrueLayer are MLOps pioneers and we are proud to be working with them!
The solution we built together with them solves the business requirements in Fintech: productivity and model accountability for models in both the model prototyping and production-ready model pipeline stages.
We are excited to replicate this dual-pipeline approach for other customers. If you are interested in solving these problems in your business too, try Dotscience today or contact us to get your own setup.