Menu

About Vantage 101

Welcome to Vantage 101, the machine learning engineering course that bridges the gap between university and the business world. This course is designed to quickly get a grasp on all best practices you wish you would have known at the beginning of your career.

GitHub-Logo Check out the GitHub repo here

Questions or feedback? Let us know!

Chapter 5: Object Oriented Programming

Now that you are all up to date on how to write clean code, and which tools can help you to enforce the right standards, it is time to delve into the realm of object oriented programming (OOP). OOP will help you to keep your code structured, improve your productivity, increase the quality of your code and lesser its maintenance cost.

5.1 What is OOP?

Object-oriented programming (OOP) is a programming paradigm that deals with objects. An object is an instance of a class. Most of the widely used programming languages, such as Python and C++, support OOP as well as other paradigms such as functional programming. In data science, you will often encounter OOP, and it is important to know your way around it, and to know when (not) to use it.

A class can be considered as a blueprint of an object, and an object is an individual instance of a class. For example, in Python you can have a class Person, where a Person can have an age and a name, and construct two Person objects:

class Person:
    def __init__(self,
                 age: int,
                 name: str):
        self.age = age
        self.name = name

    def print_name_and_age(self):
        print(f"Person {self.name} is {self.age} years old.")

anne = Person(age=41, name="Anne")
bob = Person(age=28, name="Bob")

The self keyword, as the name suggests, is used with a class to refer to the object itself. The age and name are variables that belong to this class, because they are prefixed with the self keyword – these are called attributes of the class. The print_name_and_age function also belongs to this class, because its first argument is the self keyword – this is called a class method.

Dunder methods

The __init__() method is a special type of method (called a "magic" or "dunder" method, as in "double underscore"). This method allows for setting attributes when instantiating the class. It takes the input parameters (e.g. age), which exist only in the scope of __init__(), and can use them to create attributes (e.g. self.age) which exist during the lifetime of the class object.

You can use these attributes and methods as follows:

>>> anne.print_name_and_age()
Person Anne is 41 years old.
>>> print(anne.age)
41

So instead of using self, we use the actual name of the object instance.

Some dunder methods can be useful for defining how objects interact with built-in operators and functions. For example, the __len__(self) method defines what happens when you call the built-in len() function on an object, and the __add__(self, other) method defines what happens when you use the + operator with this object. These two methods are useful if you are defining your own Dataset class, for instance.

The dataclasses library offers a @dataclass decorator that automatically adds dunder methods. For example:

from dataclasses import dataclass

@dataclass()
class InventoryItem:
    name: str
    unit_price: float
    quantity_on_hand: int = 0

    def total_cost(self) -> float:
        return self.unit_price * self.quantity_on_hand

carrot_items = InventoryItem(name="Carrot",
                             unit_price=1.23,
                             quantity_on_hand=42)
print(carrot_items.total_cost())
print(carrot_items)  # Print `__repr__` output

This yields:

51.66
InventoryItem(name='Carrot', unit_price=1.23, quantity_on_hand=42)

We can see that the __init__() and __repr__() methods were defined implicitly by the decorator.

Using objects from libraries

Libraries often contain classes to let the user instantiate objects, so they can interact with the methods and attributes. For example, the scikit-learn package provides commonly used algorithms in the form of classes. You can create a model simply by instantiating an object (e.g. using reg = LinearRegression()), use its .fit() method to train it on a dataset, and .predict() to see how it performs on new data. Note that datasets are themselves typically class objects too, for example the DataFrame class provided by the pandas library. One of the major advantages of classes is that they hide their underlying implementation – this is called encapsulation. You don't (and shouldn't) concern yourself with the underlying implementation of the object, but focus on how to use it in the logic of your own program.

5.2 We need to go deeper: advanced concepts in OOP

What if we need to make two similar classes that share a lot of their methods and attributes? According to the "Don't Repeat Yourself" principle, you shouldn't write these methods and attributes twice; this makes the codebase error-prone and difficult to maintain. Instead, we can use a cool OOP concept called inheritance. With inheritance, you can create a child class and import all methods and attributes from a parent class (or subclass and superclass, respectively). For example, let's revisit our two persons, Anne and Bob. Anne is a teacher, and teaches some courses. Bob is a student who's enrolled in a study program. We can create two child classes named Teacher and Student that inherit from parent class Person, and instantiate anne and bob accordingly:

class Teacher(Person):
    def __init__(self,
                 age: int,
                 name: str,
                 courses: set):
        super().__init__(age=age, name=name)
        self.courses = courses

class Student(Person):
    def __init__(self,
                 age: int,
                 name: str,
                 study_programme: str):
        super().__init__(age=age, name=name)
        self.study_programme = study_programme

anne = Teacher(age=41, name="Anne", courses={"Quantum Mechanics", "Advanced Algebra"})
bob = Student(age=26, name="Bob", study_programme="MSc. Theoretical Physics")

Now, the Teacher and Student class still have all the functionality of the Person class, but we didn't have to rewrite them explicitly! The super() call instantiates this functionality.

A parent class can have arbitrarily many child classes, and the inheritance chain can be arbitrarily deep. Also, a child class can inherit from multiple parent classes, a concept aptly called multiple inheritance.

Another powerful concept is polymorphism, which describes the principle that objects of different classes share the same method and attribute names. Our previous scikit-learn example is a great example of this principle: different algorithms, with different underlying implementations, all have .fit() and .predict() methods. This enables the user to do powerful things, such as looping over different objects and applying the same method calls on them. Methods and attributes can be hidden and reserved for internal use by prefixing them with two underscores.

Sometimes, you have parent classes that are only used to derive subclasses from, and you might want to preclude instantiation of the parent class. These types of classes are called abstract base classes in Python, and you can use the abc library to build them. Simply inherit from the abc.ABC class and give the parent class a generic method template:

from dataclasses import dataclass
from abc import ABC, abstractmethod

@dataclass
class Shape(ABC):
    color: str

    @property
    @abstractmethod
    def surface_area(self):
        ...

@dataclass
class Circle(Shape):
    radius: float

    @property
    def surface_area(self):
        return 2 * 3.1415 * self.radius

@dataclass
class Rectangle(Shape):
    width: float
    height: float

    @property
    def surface_area(self):
        return self.width * self.height

rect = Rectangle(width=3, height=5, color='orange')  # this is valid
print(rect.surface_area)  # outputs "15"
shape = Shape(color='blue')  # this gives a TypeError

5.3 OOP in data science and machine learning

You'll often encounter OOP in your data science projects, and there are cases when you should design your own classes.

As mentioned before, most machine learning libraries offer out-of-the-box algorithms as classes that you can use. For instance, scikit-learn offers classes of particular ML algorithms (e.g. RandomForestClassifier) but also classes that allow you to make a Pipeline object that concatenate preprocessor objects, model objects, and more. This allows you to only interface with the Pipeline object which encapsulates its components. A similar example is the Model class in keras, which consists of neural network layer objects, such as Input, Conv2D, or Dense instances. On this Model instance you can also call .fit(), just like in scikit-learn, or .summary() to print a string summary of the network. Like any other object in Python, these can be saved to a .pickle file and reloaded later in another script.

A common use-case in machine learning where you need to define your own classes is when you need to build a complex neural network that includes custom blocks of layers, or a custom data loader. In these cases you typically subclass a base model provided by another library and work from there. In other cases, you might want to encapsulate a custom ensemble or cascade of models into one easy-to-use class, so that you can still easily interact with it, coordinate multiple instances of them, and through polymorphism principles, compare them to other models.

Although the OOP paradigm is very important to organize and maintain code in large-scale projects, take care not to overuse it. When something can also be implemented as a standalone function or a simpler data structure (e.g. from the collections library) without losing code readability or maintainability, it is often better to opt for the simpler approach.