You might think that the job of a data scientist mainly consists of creating innovative solutions and writing quality code, but in my experience it is not. Perhaps even more important is how data scientists translate their work to the business. From my perspective, it is one of the most

underdeveloped skills in data science and lack of this competence often results in misunderstandings between managers and data scientists. As a consequence, a potentially valuable project might not generate business value. In this blog I will highlight three often overlooked issues by managers, and show how practical examples can help understanding the view of a data scientist. The goal of this blog post is to help bridge the gap between business and data science teams.

1. A specific example is not a reflection of the majority

Imagine a data scientist is sitting behind their computer and coding is going well today. Suddenly, the manager walks in with an unhappy face. They give the data scientist one example for which the model does not predict the expected outcome and it needs to be fixed. The data scientist looks at the example and notices that the characteristics of the example are quite extreme in comparison to the sample average. The model has not seen enough of such extreme observations, making its predictions uncertain. So the data scientist tells the manager that this specific example is not a reflection of the model’s performance. I can imagine this answer is not very satisfying for people who don’t have a background in data science.

Let me explain why this happens in the first place. I observed that people have the unrealistic willingness to predict the occurrence of unlikely events. Let us look at a case I worked on at my previous job as an example, which was about the scouting of football players in the second devision in the Netherlands. Scouts were analyzing several players and one player caught their attention by scoring 3 goals in two consecutive matches. Moreover, he turned out to be the son of a famous retired Dutch footballer, so they decided to give him a contract. He was something special, had the “football DNA” and therefore probably would make it to the Champions League in a few years. But now, let’s consider which of the following is a better bet:

The player will make it to the Champions League in a few years
The player will still play in the Dutch second division in a few years

They went for the first option, but from a statistical point of view this is a very bad guess. In the Dutch second division, most players will never play in the Champions League in their life. Besides, the information available in our example is almost irrelevant when deciding whether or not a player is going to make it. Ultimately, the scouts didn’t know anything about the player but pretended they did.

The second reason is that experts in the field (i.e. the manager in the situation outlined at the beginning of this paragraph) are usually really good at thinking outside-the-box and considering complex combinations of variables in their predictions. This complexity may work in the odd cases, but it will disturb predictions of the majority.

Applying our soccer example to daily practice, brings us to the following advice. When the first results of the model catch your eye and you find this prediction obviously wrong, try first to assess the likelihood of this example and how important is it to predict it flawlessly. Is it indeed an edge case? Then the model will notice this as well and most of them are not designed to predict these cases. And keep in mind: people are often better at predicting the extreme cases, but do much worse on the larger majority of the observations (Kahneman, D. (2011). Thinking, fast and slow).

2. The best models still have errors, however not every error is the same

As you have seen in the section above, data scientists will not develop a perfect model. This illusion is often created by the media or the ease of using high quality face recognition algorithms in apps, but even these models made by multi-million dollar companies will produce prediction errors. For example, you can argue that data can’t capture everything such as a person’s state-of-mind. Moreover, there is more randomness in the world than people think. What about luck? And timing? Due to these random factors in combination with missing data, all predictions will have an uncertainty and (small) errors. But, lumping all errors together and not worrying about them at all, would have significant negative impact on the quality of a model. So, what is a key factor that distinguishes one error from the other?

Let’s start with an example to illustrate the term bias in errors. Suppose we predict the score of two golfers on day 2 of a golf tournament. We are quite lazy and don’t know a lot about golf so we predict that their score will equal their score on day 1. Chances are that the golfer who did extremely well the first day, will probably perform worse on the second day (possibly due to being less lucky, more tired, or under more pressure). And of course, the recursive reasoning can be used for a golfer that performs poorly on the first day. Eventually, if we look at the actual results and compare them to the predictions, we find bias in the errors. We overestimated the score for the golfer with a good start and underestimated it for the other.

But how can we can rid of such bias and consequently make better predictions? We need to correct the prediction with evidence but don’t want to exaggerate it. Here is an easy and general approach of how we can predict the score on day 2 of a golfer, knowing their result on day 1:

Start with an estimate of the average score of all players on day 1. Let’s say the average score was 72.
Determine the score for the player on day 2 that matches your impression of the evidence (i.e. the score on day 1 of the golfer). For example, the score of the player on day 1 was 68, so you think he will score 69 on day 2.
Determine the correlation (i.e. coherence) between scores on
consecutive days, which is an equation that is beyond the scope of this blog. But for now we assume the correlation was 0.3.
Now, move 30% of the distance from the average to the matching score on day 2 (determined in the second step). So we will get (69–72) * 0.3 + 72 = 71.1

This way, the bias will be eliminated from the prediction, which means that the predictions (both high and low) are equally likely to overestimate and underestimate the true value. Hence, you still make errors, but (depending on the quality and quantity of the evidence) they will be smaller and will not favor high or low.

The take-away for this section is that you can’t expect your data scientist to come up with a perfect model. Instead, accept that every model comes with errors. However, it is important to ask the data scientist about the possible occurrence of bias or the quantity/quality of the data. In my experience, biased errors will cause a lot more problems along the way than unbiased errors, which are by definition not dramatic at all.

3. The returning question of sufficient data

Ever talked to a statistician or data scientist about your future plans with data? I’m pretty sure one of their first questions will have been: “Is there sufficient data available?”. I hear you think, why does this matter so much to those people? I mean some data is good right? Do your trick with the computer and we can predict what we need. But let me explain to you where this question comes from.

Ever heard of the law of small numbers? In short, this law states that extreme outcomes are more likely in small samples. For example, let’s consider the following experiment. Brian and Jenny have a bowl full of two types of candy (blue and red). Brian will always take 4 pieces of candy each trial and Jenny always takes 7. They record each time they observe a sample where all candies are the same color. If I tell you that Brian will observe more of these extreme outcomes (i.e. where all pieces of candy are the same color) than Jenny, you are probably not surprised.

So how can this law of small numbers be translated to the data scientist and the business? At one of my projects I had to predict house prices in the Netherlands. Among other things, we knew the location of properties. It is widely known that houses in Amsterdam are on average more expensive than in the rest of the Netherlands. However, the value of houses across different neighbourhoods in Amsterdam varies a lot. Whenever you only have a few samples for training a model, it might just be the case that your sample contains many houses from the deprived Amsterdam-Osdorp neighbourhood. As a consequence, you could end up with a model that predicts that all houses in Amsterdam are very inexpensive because your training sample contains a lot of extreme values (i.e. very inexpensive houses in Amsterdam-Osdorp). This result is costly because you wasted your time on a useless model. The only way to mitigate this risk is to use a sufficiently large sample to train a model.

Conclusion

After reading this blog, I hope managers have a better understanding of some struggles in the life of a data scientist. You now know:

That models are not designed for edge cases and that such observation is not a reflection of the overall performance of a model.
That every model has errors, but be aware of their bias.
That sufficient data mitigates the risk of observing a lot of extreme values and thus prevents the model in generalizing their characteristics to the population.

On the other hand, if you are a data scientist recognizing these struggles, I hope that the examples were helpful and you now have more guidance in explaining these struggles to the business. This way, we can gradually create more beautiful products and don’t unnecessarily waste great ideas because of miscommunication.