Collaboration between data scientists and data engineers - Part 2

/blog/2019-10-03-collaboration-part-2/images/featured_hu3d03a01dcc18bc5be0e67db3d8d209a6_3482685_1595x1148_resize_q75_box.jpg

Machine learning projects are often costly, according to the Dotscience 2019 survey, ‘The State of Development and Operations of AI Applications’, 63.2% of businesses reported they are spending between $500,000 and $10 million on their AI efforts. Yet the same survey then goes on to report that nearly a quarter of respondents (22.4%) list collaboration as a primary challenge in their AI development processes.

Check out part 1 of this 2 part series here.

Shortcomings in collaboration between and among data scientists and data engineers

Collaboration isn’t only a problem between data scientists and data engineers, it is also problematic within data science and data engineering teams themselves.

Furthermore, many organizations with big data, e.g. financial institutions, operate in a world where compliance plays an important role in their business. It is imperative for such institutions to have a clear understanding and path from business decisions back to the machine learning (ML) models and the environments that produced them.

To better understand why collaboration in data science and data engineering is still such an issue, it is important to understand that ML projects have more moving parts, which must also be reproducible if collaboration is to be efficient.

Unlike in a software engineering project where versioning the code and the environment has led to huge increases in efficiency, ML projects need to track:

  • Model training data
  • Model test data
  • Data engineering and model training code
  • Parameters and Hyper-Parameters to the models
  • The data engineering or model training environment, including not only the specific versions of libraries used but in some cases details like the specific version of the GPU the model was trained on
  • The metrics, such as the accuracy of a model, that are achieved during the training process

We’ll now discuss each of these separately.

Model training and test data

In developing and sharing data features and machine learning models, data engineers and data scientists need to ensure consistency of data both over time and across different environments.

Two data scientists working on the same machine learning problem, using identical code and environments, can obtain obviously reach different results if there are differences in their data sets. To mitigate this problem, data scientists working on the same machine learning problem should strive to:

  • use the same data (e.g. querying the same data lake or using a shared server with data)
  • use the same procedures for data pre-processing and data wrangling
  • implement data versioning if possible (or use collaborative environment like Dotscience where data versioning is already integrated)

Although collaboration inefficiencies may arise during collaboration between different data scientists or data engineers, it is often even more consequential if they occur between data scientists and data engineers. Data engineers are key not only for providing data scientists with initial data sets in exploration phase, they are especially important in later phases, in designing appropriate data infrastructure and optimizing data pipelines for data ingestion in ML solutions.

Data engineering and model training code

Machine learning development involves a lot of code at various stages of the model development lifecycle:

  • early analytics to identify business cases to be solved
  • extraction of datasets from data lakes and data warehouses
  • pre-processing of data and feature engineering
  • training of machine learning models, evaluation, parameter tuning
  • using trained models to make predictions on new data
  • continuous evaluation of ML models after introduction to production

As mentioned above, efficient code collaboration can be achieved by sharing code via a version control system, such as Git, Subversion and Mercurial. To ensure consistency, it is important that team members use the same code not only when running machine learning models but also in other phases like data pre-processing. This not only makes collaboration more efficient, but also allows others in the team or company to easily discover previous work, meaning that features and ML models can be reused across the organization.

Model Parameters and Hyper-Parameters

Although the model parameters and hyper-parameters are, strictly speaking, a part of the code, they require special consideration due to how they are used. When data scientists are collaborating on training a model, it is highly inefficient if they are forced to look through all previous versions of someone else’s code to see which parameters were used.

Instead data scientists should be able to see, at a glance, which parameters were used in previous training runs so that they can quickly understand and pick up from previous work done by their colleagues.

Environment

An important reason why data scientists encounter difficulties in reproducing the results of other team members is the differences in their computing environment. The results of machine learning models can differ due to different versions of libraries being used even when the training code and data are otherwise identical and the same can apply to different versions of the GPUs used in training.

The involvement of data scientists in multiple machine learning projects that may require different computing environments makes this problem even trickier, as to collaborate efficiently they will need to be able to quickly swap between environments to make sure they can fully reproduce the work done by others. Although there are many ways to capture the state of one’s environment common approaches include various packaging tools such as pip or virtualenv for Python. As mentioned in part 1, an increasingly popular approach is to use containers to capture the environment.

Although containers may create an adoption curve for some, tools like Dotscience hide this complexity under-the-hood which means that data engineers and data scientists can focus on the work they need to do without worrying about the underlying implementation.

Metrics

During model training it is common for data scientists to work iteratively and collaboratively. As different team members try different approaches to solving a problem their results often diverge quickly and it is important for colleagues to be able to see at-a-glance which approaches are proving most fruitful without having to re-run each others work. This is where metric tracking helps to make collaboration more effective.

Without proper tooling and processes teams often resort to recording their metrics (and other information such as which data sets were used and which hyperparameters were chosen) in a spreadsheet or other manual tracking tool. This is error prone and risky as there is often too much information to record easily so people naturally take short cuts, only recording the variables they believe to be important. Once this occurs, teams then lose the ability to fully reproduce each others’ work, making collaboration less efficient.

What’s the cost of inefficient collaboration between data scientists, data engineers and what can organizations gain by minimizing friction?

Shortcomings in collaboration can lead to a wide array of costs for companies:

  • project delays and budget overruns due to time spent searching for the causes of differences between ML model results
  • repeat work as some team members duplicate work already done by others
  • difficulties in reproducibility can lead to less trust in models and weaker propagation and reuse of ML results across organization
  • siloes can reduce the speed of introducing new team members to projects
  • lack of collaboration between data scientists and data engineers around the introduction of offline models to production environments can result in the deterioration of results of those models in production
  • diminished trust of management in data science projects

Organizations that manage to improve collaboration between members of machine learning projects will benefit from the reduction of costs, time saving, better data driven business decisions and improved trust towards machine learning and data science initiatives.

How does Dotscience help to improve collaboration between data scientists and data engineers on your machine learning projects?

The Dotscience platform is collaborative development environment at its heart and brings collaboration on ML projects to a new level by enabling team members of ML projects to:

  • use the same data and implement data versioning
  • use the same procedures for data pre-processing and data wrangling
  • share ML model information: hyperparameters, trained models weights, metrics
  • achieve full reproducibility across the ML lifecycle
  • allow ML team members to collaborate on all variables in the model development lifecycle from the same platform
  • Expose a library of ML models to promote their reuse and validation within the organization

If you’re struggling to scale your machine learning initiatives, there’s a strong chance that colaboration is a key issue.

To try Dotscience, get started with a sample project now.

Written by:

Mark Coleman