Lessons Learned 2022 H1: challenges faced by a machine learning engineer

Two quarters in one this time 😎, still plenty of lessons.

Andrei Pascanean
Jul 5, 2022 5:24:39 PM

In my last blog of this series, I looked at some lessons I learned in Q4 of 2021. This time around I’ll be doing 6 months worth of tips and tricks in one blog, mostly because I noticed that there wasn’t enough content for a write-up every quarter. Who knows, maybe the next edition will go back to the old format.

I also wanted to focus not only on ‘lessons learned’, which implies mistakes happening in the first place, but also on cool new tech and methods that I learn along the way. Not everything is learned by failing first, some things are discovered by trying.

If you’re new here, I’m a machine learning engineer with one year of experience working as a consultant at Vantage AI. Right now I’m working on an end-to-end ML project in the energy sector. Check out my previous lessons learned post here:

CTA lesson learned blog 1

I’ll keep making these blog posts regularly throughout the year, to reflect on my experience and hopefully help others learn and grow as well.

I hope these blogs can offer you tips and tricks for navigating your own career’s waters.

The Tech

#1: Use all your data!

When training a machine learning model, I usually divide my data into a training, validation and test set. This way of working has been drilled into me to avoid any form of overfitting or data leakage. You don’t want your model to perform amazingly when it runs on your half-baked Toshiba, barely scraping by on 2GB of RAM your uncle got you for your birthday, only to fail miserably in production.

Well it turns out that while this is a good rule to stick to when optimising hyperparameters, it is usually a good idea to also train a final model on all your available data. Of course at this point you can’t generate any performance metric using your data, but that’s why model monitoring is so important.

Meme model more data

Your model yearns for more data

So do your train, valid, test data split, optimise the hyperparameters and then give it a final train on all your data. We now use this method to re-train all our models automatically and then keep an eye on how they perform in production.

#2: Track experiments with MLFlow

Training machine learning models can get messy and unorganised real quick, especially if you use Jupyter Notebooks. After trying out a hypothesis and building a model that uses all the right input features, you want to record the model’s performance metrics. What was its RMSE? Which hyperparameters ended up being best?

Enter MLFlow, a useful Python library for keeping track of all the little experiments you run. MLFlow will keep a record of all your ‘runs’ and their results, as well as allowing you to store your best-performing models. It takes care of all the back-end tracking in SQL and file management for your pickle files, all you have to do is tell it to track your model training code.

We use MLFlow on a remote private server, which allows us to keep track of all the models we train. This is especially important when you need to implement an automated model retraining solution in production. Mike Kraus has a handy article where he explains exactly how to schedule automatic model retraining using MLFlow for experiment tracking.

#3: Testing, testing, 1 2 3

We all know by now how important testing is when it comes to software development. It’s all anyone talks about these days. Usually I brush off test-driven development as something that is both time-consuming and often inefficient, especially in the trial-and-error process of building ML models.

However, once your models are finalised and you finally feel ready to go to production, tests become crucial. After refactoring your code from dirty, dirty Jupyter Notebooks to noble clean code modules and functions, you should write tests as well. I discovered the value of tests when I tried to make a change to our existing code base, to add a new data preprocessing function. The tests I had written allowed me make sure I didn’t break anything. Change a column reference? Run a test. Add an extra input feature? Run a test. Instantly find out if you made a mistake and spare your blushes during a client demo.

Although Python already has a built-in library for writing tests, unittest, I found the pytest library to be much easier to use. Shout-out to Jasper Derikx for organising an in-depth 2-day training in-house at Vantage AI. Fellow USB-stick enthousiast Björn van Dijkman has a nice blog about why you should use pytest in your data science project.

The Business

#1: Start with NoML

When starting on a new machine learning project, it is tempting to immediately start with — machine learning. That may not always be the best approach. Most of us have heard about Andrew Ng’s data-centric AI, the idea that one should focus more on extracting performance from data than from models, but another first step that is overlooked is NoML.

After reading Eugene Yan’s post about NoML, I was quickly convinced of its simplicity and effectiveness. Based on the first rule in Google’s Rules of Machine Learning, the principle is that one should always start from simple heuristics. I would take this a step further and say; start from available business rules and implemented workflows. Measure the performance of what your client’s current approach is. A first goal could be to automate the existing workflow, which already would save time and money.

Rule #1: Don’t be afraid to launch a product without machine learning

Achieving a ‘quick win’ by building a baseline model can offer insight into the domain, including the problem’s scope and limitations. In our energy project, the client already had a business rules-based model, which offered useful insight into data availability, forecast timing and feature importance.

The Personal

#1: One in a row

Personal change is tough. Whether you want to read more insightful medium articles, start a new certification or just bang out more high-quality memes, changing your behaviour can seem daunting. Personally I tried lots of different hobbies and habits, trying to implement a ton of change in a short timespan, only to fail horribly. One thing that does work though, is tiny focus.

An eye-opener for me was Matthew McConaughey’s ‘oneinarow’ principle. In his book Greenlights, McConaughey talks about starting with one thing only, whether it’s a personal goal or a new behaviour, and keep practicing until you do that one thing well. It’s important to not clutter your development with too many goals in different directions or domains. Singular focus.

A second part of the solution is to make your goals tiny. Simple, small and achievable. Forget about the 5-day-a-week full body workout, start by just going to the gym once and doing whatever you want for however long you want. Setting yourself tiny goals lets you avoid setting yourself up for failure. The best part is that once you ‘crush’ these goals you often end up going a step further and doing more, and if you don’t — no biggie.

Start with a single tiny goal and crush it! Then move on to the next.

To give an example, I always wanted to constantly read as much as possible about tech, machine learning and management, and then share it with my network. Of course this lead to tons of unread tabs open in my browser, half-read ‘The Economist’ issues and ignored email articles from interesting bloggers. So I set myself the tiny goal of just ‘reading one page a day’. What page didn’t matter, could be a tab, blog or even an interesting comment. While I still have a ton of unread content open in my browser as I’m typing this, I read a lot more than I did 2 months ago. Crushing it.

#2: Journaling

Mindfulness and self-reflection is a good habit to have. I tend to beat myself up about the stuff I’m not so good at, while ignoring all the things that went well. I also don’t see myself as the type of person who journals and keeps track of their days and feelings, but I found a method that works for me. CGP Grey’s journaling method leaves space for gratefulness, self-reflection and positive reinforcement.

For a quick run-down, check out the man’s very own video where he explains how you can use any plain old notebook to track your days. You start off with listing two things that you’re grateful for each day, so life doesn’t seem so bad anymore. After a long day you can choose how much you reflect. I usually use short sentences that reflect my feelings about the day and what I’m thinking about. Lastly, you top it off with at least one thing that went well that day. Can be as simple as taking notes during a meeting or as complicated as finishing a side-project. This forces you to think of all your day’s ‘highlights’ instead of lows.

#3: Communication is 🔑

I’ve talked before about how communication is one of the most important skills you can have as an ML engineer. Especially if you’re in consulting. Communication with your client or stakeholders is important but so is communicating with your teammates. Staying aligned on the goals you’re trying to achieve, discussing issues and solving problems are all benefits of checking in regularly. Of course this is tougher to do remote. Here’s what we do in my team.

Daily check-ins. I know not everyone enjoys having a call first thing in the morning (although you can schedule it a bit later) but I found that it really helps to divide the workload clearly. Additionally, a check-out over whatever messaging platform you use (Slack) is also helpful to keep each other up-to-date on what went well and what could have gone better.

I would highly recommend a backlog to keep track of your team’s progress and the distribution of tasks. It is especially useful at avoiding confusion regarding dependencies or situations where two people write the same code — twice! Speaking of hiccups, leave space in your agenda for regular feedback sessions with your teammates. Our feedback session takes the form of a retrospective that we organise every 2 weeks, useful for finding ways to improve and lessons to blog about 😉.

In Review

An interesting first half of the year with plenty of learnings. Especially impactful was the tip to use all our data when retraining the final model for production, the client sure noticed an improvement!

Hope you enjoyed reading! Smack that follow button if you did 👨‍🌾

P.S. What did you learn about these past 6 months? Feel free to drop a comment about your tech tools, business tips or personal experiences.