You have already seen the how to prepare a dataset to train and evaluate a simple model in the notebook. There are many ways in which you can attempt to improve the model performance, some of which are described in this chapter. Specifically, the chapter will cover performance metrics, model selection, and hyperparameter tuning.
3.1 Performance metrics
It is important to make sure your performance metric directly translates to what you are trying to achieve for the use case. An overview of some of the well-known metrics is given in Performance Metrics in Machine Learning [Complete Guide]. Additionally, you can glance through the metrics listed at sklearn.metrics to see if anything better fits your needs. For example, the
matthews_corrcoef metrics can be very suitable for most classification tasks. Also, some domains have their own specific metrics, such as Computer Vision (5 Object Detection Evaluation Metrics That Data Scientists Should Know) and NLP (The Most Common Evaluation Metrics In NLP).
3.2 Model Selection
In the notebook you already saw a simple baseline model, the Random Forest Classifier, which is generally a good starting point for tabular data. Your educational background should give you enough intuition as to which model type is suitable for the task at hand. If you have no sense of what model would fit your particular classification or regression problem, you could consider using PyCaret for tabular data to automatically try out a suite of frequently used machine learning algorithms with a single function call.
3.3 Hyperparameter Tuning
Now that you settled for a model and a performance metric, the fun can begin. Manually tuning the hyperparameters of your model will only get you so far. The default way of brute forcing is by using a grid- or random search, for example using scikit-learn. However, these methods can be extremely resource-intensive. Sklearn’s built-in grid- and randomized search aren’t the most advanced methods, although they can be a good option for less computationally intensive models. For models that have longer training times, e.g. boosting trees and neural networks, more advanced optimization methods are better suited. Have a look at this blog on Bayesian optimization by Vantage alumnus Mike or this more general blog about hyperparameter tuning to get an idea of different tuning methods.
More efficient exploration of your parameter space can be achieved by utilizing some advanced hyperparameter optimization libraries for Python. One good example is Optuna, which lists some examples in their GitHub repository.
Use the techniques above to improve the performance of the simple model in the notebook.
- Choose a performance metric that you think fits this particular case best. Describe why you think it is the best fit.
- Try out at least three different models and see which model performs best.
- Choose one of more models from step 2 and try to tune its hyperparameters to improve the performance even more.