Collaboration Between Data Scientists and Data Engineers
Many business leaders are investing significant resources in machine learning (ML) solutions to help them generate insights about their businesses from vast amounts of data, and to improve their business results. Although businesses can vary significantly in the goals and needs that they set for their ML teams, we see the same challenges occurring across most companies.
One of these challenges is how to achieve effective collaboration between and among two different types of team members that are typically involved in such projects: data scientists and data engineers.
Although both data scientists and data engineers work with data there are differences between them in terms of their skills, educational background, responsibilities, tools used and goals.
Data Scientists
Data scientists work with large amounts of data (both structured and unstructured), applying their knowledge of statistics, programming and other disciplines to extract insights from data to help their companies improve their business decisions, services or products. The educational background of data scientists is most often Mathematics and Statistics, followed by Computer Science and Engineering. They usually have many responsibilities:
- collecting and processing data, and extracting trends, patterns and other valuable insights using statistical analysis and models
- working with stakeholders to identify opportunities for exploiting company data to drive better business decisions and new products
- training and deploying machine learning models into production with help from data engineers, business analysts and software developers
- identifying the relevance and accuracy of various data sets
- presenting data science results and statistical insights to key decision makers across the business
Data scientists typically use a wide range of different machine learning tools in their work, many of which can be seen in the Dotscience 2019 survey, ‘The State of Development and Operations of AI Applications’.
Data Engineers
Like data scientists, data engineers also work with large amounts of data, but their primary focus is around creating and maintaining data infrastructures for analytical purposes. Data engineers most often have Computer Science and Engineering backgrounds. They also have a wide range of responsibilities:
- building, maintaining, testing and optimizing scalable data architectures
- developing dataset processes, workflows and modelling pipelines
- Aligning data infrastructure with business demands
- discovering manual processes that can be automated
- providing support and collaborating with data scientists in the deployment of ML models
A data engineer most often works with tools like Hadoop, Spark, Scala, MongoDB, Cassandra, Kafka and Python. When data engineers work as part of machine learning teams, they are often called machine learning engineers.
Reproducibility Is the Key to Efficient Collaboration
In order for colleagues to be able to collaborate with one another efficiently they must be able to reproduce each other’s work easily. In software engineering we’ve seen how Github has enabled millions of developers from around the world to collaborate with each other. This is because they are able to see the work others have been doing, reproduce it locally, continue the work, and then contribute changes back.
For years, despite version control like Git, software engineers still struggled to collaborate effectively however, because although they could easily reproduce each others’ code, it was still difficult to reproduce each others’ environments; the libraries and other dependencies on which the code relies. This made collaboration less efficient and more error-prone, leading to the phrase “it works on my machine!”
A large part of why containerization, driven by Docker, has been so popular is because containers address exactly this problem. They allow software engineers to version control, and therefore reproduce each other’s environments, as well as the code.
Shortcomings in Collaboration Between and Among Data Scientists and Data Engineers
Collaboration isn’t only a problem between data scientists and data engineers, it is also problematic within data science and data engineering teams themselves.
Furthermore, many organizations with big data, e.g. financial institutions, operate in a world where compliance plays an important role in their business. It is imperative for such institutions to have a clear understanding and path from business decisions back to the machine learning (ML) models and the environments that produced them.
To better understand why collaboration in data science and data engineering is still such an issue, it is important to understand that ML projects have more moving parts, which must also be reproducible if collaboration is to be efficient.
Unlike in a software engineering project where versioning the code and the environment has led to huge increases in efficiency, ML projects need to track:
- Model training data
- Model test data
- Data engineering and model training code
- Parameters and Hyper-Parameters to the models
- The data engineering or model training environment, including not only the specific versions of libraries used but in some cases details like the specific version of the GPU the model was trained on
- The metrics, such as the accuracy of a model, that are achieved during the training process
We’ll now discuss each of these separately.
Model Training And Test Data
In developing and sharing data features and machine learning models, data engineers and data scientists need to ensure consistency of data both over time and across different environments.
Two data scientists working on the same machine learning problem, using identical code and environments, can obtain obviously reach different results if there are differences in their data sets. To mitigate this problem, data scientists working on the same machine learning problem should strive to:
- use the same data (e.g. querying the same data lake or using a shared server with data)
- use the same procedures for data pre-processing and data wrangling
- implement data versioning if possible (or use collaborative environment like Dotscience where data versioning is already integrated)
Although collaboration inefficiencies may arise during collaboration between different data scientists or data engineers, it is often even more consequential if they occur between data scientists and data engineers. Data engineers are key not only for providing data scientists with initial data sets in exploration phase, they are especially important in later phases, in designing appropriate data infrastructure and optimizing data pipelines for data ingestion in ML solutions.
Data Engineering and Model Training Code
Machine learning development involves a lot of code at various stages of the model development lifecycle:
- early analytics to identify business cases to be solved
- extraction of datasets from data lakes and data warehouses
- pre-processing of data and feature engineering
- training of machine learning models, evaluation, parameter tuning
- using trained models to make predictions on new data
- continuous evaluation of ML models after introduction to production
As mentioned above, efficient code collaboration can be achieved by sharing code via a version control system, such as Git, Subversion and Mercurial. To ensure consistency, it is important that team members use the same code not only when running machine learning models but also in other phases like data pre-processing. This not only makes collaboration more efficient, but also allows others in the team or company to easily discover previous work, meaning that features and ML models can be reused across the organization.
Model Parameters and Hyper-Parameters
Although the model parameters and hyper-parameters are, strictly speaking, a part of the code, they require special consideration due to how they are used. When data scientists are collaborating on training a model, it is highly inefficient if they are forced to look through all previous versions of someone else’s code to see which parameters were used.
Instead data scientists should be able to see, at a glance, which parameters were used in previous training runs so that they can quickly understand and pick up from previous work done by their colleagues.
Environment
An important reason why data scientists encounter difficulties in reproducing the results of other team members is the differences in their computing environment. The results of machine learning models can differ due to different versions of libraries being used even when the training code and data are otherwise identical and the same can apply to different versions of the GPUs used in training.
The involvement of data scientists in multiple machine learning projects that may require different computing environments makes this problem even trickier, as to collaborate efficiently they will need to be able to quickly swap between environments to make sure they can fully reproduce the work done by others. Although there are many ways to capture the state of one’s environment common approaches include various packaging tools such as pip or virtualenv for Python. An increasingly popular approach is to use containers to capture the environment.
Although containers may create an adoption curve for some, tools like Dotscience hide this complexity under-the-hood which means that data engineers and data scientists can focus on the work they need to do without worrying about the underlying implementation.
Metrics
During model training it is common for data scientists to work iteratively and collaboratively. As different team members try different approaches to solving a problem their results often diverge quickly and it is important for colleagues to be able to see at-a-glance which approaches are proving most fruitful without having to re-run each others work. This is where metric tracking helps to make collaboration more effective.
Without proper tooling and processes teams often resort to recording their metrics (and other information such as which data sets were used and which hyperparameters were chosen) in a spreadsheet or other manual tracking tool. This is error prone and risky as there is often too much information to record easily so people naturally take short cuts, only recording the variables they believe to be important. Once this occurs, teams then lose the ability to fully reproduce each others’ work, making collaboration less efficient.
What’s the Cost of Inefficient Collaboration Between Data Scientists, Data Engineers and What Can Organizations Gain by Minimizing Friction?
Shortcomings in collaboration can lead to a wide array of costs for companies:
- project delays and budget overruns due to time spent searching for the causes of differences between ML model results
- repeat work as some team members duplicate work already done by others
- difficulties in reproducibility can lead to less trust in models and weaker propagation and reuse of ML results across organization
- siloes can reduce the speed of introducing new team members to projects
- lack of collaboration between data scientists and data engineers around the introduction of offline models to production environments can result in the deterioration of results of those models in production
- diminished trust of management in data science projects
Organizations that manage to improve collaboration between members of machine learning projects will benefit from the reduction of costs, time saving, better data driven business decisions and improved trust towards machine learning and data science initiatives.
Post Comment