With the field of data science maturing, so are the titles of the people working in the field. For a long time, my colleagues and I mainly saw two catch-all titles being used in the various teams we joined as a consultant: Data Scientist and Data Engineer. Nowadays, we see new titles pop up that suggest more subtle definitions, like Decision Scientist, ML Scientist, and MLOps Engineer. However, so many men, so many minds: it seems like every company explains these roles in their own way.¹
I think the definitions people choose reflect their perspective on the data science workflow. In this blog, I will explain what I mean with this. Also, I will describe my own perspective on these titles in relation to my ideal workflow, based on the experiences of me and my colleagues.
The old catch-all titles suggested that a data science project should consist of two completely separate and sequential streams of work. The first workstream was that of ‘science’, where one takes a static data source to create a model that meets certain statistical requirements. The second workstream was that of ‘engineering’, where one takes the result of the ‘science’ and translates it into something that meets software requirements. So first we ‘science’, then we ‘engineer’, and then we’re done: we have a model in production!
Unfortunately, this idea of having two completely separate streams of work has left many data science projects stuck in a proof of concept phase: the science workflow has been completed, but the engineering workflow never started. And even when both workflows were executed, the end situation could be far from ideal. In most cases, a data science project is just not as simple as: ‘science’, ‘engineer’, ‘done’. We need to maintain the products that we build. Also, we want to keep improving our models by leveraging new data or modeling insights.
Popular insights from tech giants² have rendered the workflow described above as old fashioned. They view data science products as being similar to software products, thus requiring a similar development approach. As such, data science roles are evolving to fit the new aspects needed in such a workflow.
We should be cautious however, to not simply reformulate the old siloed roles into new, smaller, silo’s. Instead, we can define the roles using the concept of T-shaped individuals: people who are very skilled in one area, but have sufficient knowledge about the topics their co-workers are experts at. By using such a definition, one explicitly moves away from the idea that these different roles work in isolation from each other. Rather they work in a team, where the same task could be picked up by individuals with different primary roles.³ Instead of having many separate workflows that produce a static end product, there is a single workflow of continuous improvement.
When looking at the roles from this perspective, defining a role using a list of titles and definitions feels inadequate. Instead one can view the roles on a spectrum alongside the different aspects of a data science project (figure 1). If we then consider an aspect such as model monitoring, the person in your team with the title ML Engineer will probably bring in the deepest knowledge about the tools one can use to set this up. She will also be capable of defining sensible metrics to monitor. However, she will be less perceptive than the ML Scientist when it comes to thinking about potential biases in data and how we can finetune the metrics to the problem at hand. In this way the different roles can collaborate ánd complement each other. It is crucial to this collaboration that all the roles described in the figure are involved from the onset of the project and each of them contribute to your project repositories.
By having all roles involved from onset to end, one can move away from the idea of ‘science’, ‘engineer’, ‘done’. And no, not just by replacing ‘done’ with ‘ops’. Right from the start of the project, the team can envision the complete end product, including the requirements for implementing DevOps practices. While developing, the team can define the model requirements and design an evaluation framework, at the same time as setting up model pipelines. In this way a simplistic model will be up and running fast, allowing the team to iterate on it in a controlled way. Instead of ‘first A, then B’, the development of different components happens in sync. To really make this fly, it helps to have people of different roles scoping and refining stories together.
Reading this, you may still wonder which combination of titles you need to build a solid data science product. Unfortunately, there is no clear cut answer to that. You will need a group of people that together at least have some knowledge of each of the topics as described in figure 1. But, the topics that require deep expertise depend on the type of product you aim to build, the type of user you build it for, and the sentiment towards data science in the rest of the company. My short advice? Don’t get too hung up on names and definitions. Ensure that team members complement each other well and invest in creating team dynamics that encourage collaboration across titles.
Thanks to Sophron Vermeij, Guido Tournois, and Michel Meulpolder for reviewing this blog.