Collaboration between data scientists and data engineers - Part 1

/blog/2019-10-02-collaboration-part-1/images/featured_hu3d03a01dcc18bc5be0e67db3d8d209a6_2623952_1595x1148_resize_q75_box.jpg

Many business leaders are investing significant resources in machine learning (ML) solutions to help them generate insights about their businesses from vast amounts of data, and to improve their business results. Although businesses can vary significantly in the goals and needs that they set for their ML teams, we see the same challenges occurring across most companies.

One of these challenges is how to achieve effective collaboration between and among two different types of team members that are typically involved in such projects: data scientists and data engineers.

Although both data scientists and data engineers work with data there are differences between them in terms of their skills, educational background, responsibilities, tools used and goals.

Data scientists

Data scientists work with large amounts of data (both structured and unstructured), applying their knowledge of statistics, programming and other disciplines to extract insights from data to help their companies improve their business decisions, services or products. The educational background of data scientists is most often Mathematics and Statistics, followed by Computer Science and Engineering. They usually have many responsibilities:

  • collecting and processing data, and extracting trends, patterns and other valuable insights using statistical analysis and models
  • working with stakeholders to identify opportunities for exploiting company data to drive better business decisions and new products
  • training and deploying machine learning models into production with help from data engineers, business analysts and software developers
  • identifying the relevance and accuracy of various data sets
  • presenting data science results and statistical insights to key decision makers across the business

Data scientists typically use a wide range of different machine learning tools in their work, many of which can be seen in the Dotscience 2019 survey, ‘The State of Development and Operations of AI Applications’.

Data engineers

Like data scientists, data engineers also work with large amounts of data, but their primary focus is around creating and maintaining data infrastructures for analytical purposes. Data engineers most often have Computer Science and Engineering backgrounds. They also have a wide range of responsibilities:

  • building, maintaining, testing and optimizing scalable data architectures
  • developing dataset processes, workflows and modelling pipelines
  • Aligning data infrastructure with business demands
  • discovering manual processes that can be automated
  • providing support and collaborating with data scientists in the deployment of ML models

A data engineer most often works with tools like Hadoop, Spark, Scala, MongoDB, Cassandra, Kafka and Python. When data engineers work as part of machine learning teams, they are often called machine learning engineers.

Responsibilities - Data Scientist and Data Engineer

Reproducibility is the key to efficient collaboration

In order for colleagues to be able to collaborate with one another efficiently they must be able to reproduce each other’s work easily. In software engineering we’ve seen how Github has enabled millions of developers from around the world to collaborate with each other. This is because they are able to see the work others have been doing, reproduce it locally, continue the work, and then contribute changes back.

For years, despite version control like Git, software engineers still struggled to collaborate effectively however, because although they could easily reproduce each others’ code, it was still difficult to reproduce each others’ environments; the libraries and other dependencies on which the code relies. This made collaboration less efficient and more error-prone, leading to the phrase “it works on my machine!”

A large part of why containerization, driven by Docker, has been so popular is because containers address exactly this problem. They allow software engineers to version control, and therefore reproduce each other’s environments, as well as the code.

In part 2 we’ll look at why reproducibility, and therefore collaboration, is harder for data science and data engineering teams and discuss each sub-challenge separately.

If you’re struggling to scale your machine learning initiatives, there’s a strong chance that colaboration is a key issue.

To try Dotscience, get started with a sample project now.

Check out part 2 here.

Written by:

Mark Coleman