Python: Why DataClasses are Awesome

Cool way to handle Data Oriented Classes

Pravash
7 min readJan 23, 2023
Python Dataclass

In this article I will discuss about the python dataclasses, where we can use these and how are they helpful. And also will explain this topic with examples, which will help you to getting started with it and use it in day to day coding.

What is DataClasses?

Data classes are bread and butter tool for everyday programmer. It can save you literally hours every week of writing boiler plate code instead of just showing you the syntax.

Data Classes are mainly aimed at helping you write more data oriented class. They simply act as containers of data, used by other classes.

As of python 3.7, a new exciting feature was introduced, the @dataclassdecorator via Dataclasses library.

The @dataclass decorator is used to automatically generate base functionalities to classes, including __init__(), __hash__(), __repr__() and more, which helps reduce some boilerplate code.

Installing the DataClasses module:

pip install dataclasses

Syntax:
@dataclasses.dataclass(*, init=True, repr=True, eq=True, order=False, unsafe_hash=False, frozen=False)

Parameters:

  • init: If true __init__() method will be generated
  • repr: If true __repr__() method will be generated
  • eq: If true __eq__() method will be generated
  • order: If true __lt__(), __le__(), __gt__(), and __ge__() methods will be generated.
  • unsafe_hash: If False __hash__() method is generated according to how eq and frozen are set
  • frozen: If true assigning to fields will generate an exception.

Explaining with Example

Lets take below example, where you have a “Person” class and a “main” method that creates a person and prints the person.

class Person:
def __init__(self, name:str, address:str):
self.name = name
self.address = address

def main() -> None:
person = Person(name="Jack", address="New York")
print(person)

if __name__ == '__main__':
main()

So when we run above code, it gives us the below output -

<__main__.Person object at 0x10371bfa0>

Yes exactly — The memory address, Which is ideally not useful as we want the person’s name and address. So what we can do is we can use str dundar methods in Person class to see the data.

class Person:
def __init__(self, name:str, address:str):
self.name = name
self.address = address

def __str__(self):
print(f"{self.name}, {self.address}")

And now if we run this code we can see the required data. So we have a better understanding now about the data.

There’s a couple of things we might do like being able to compare a person with another person, do sorting or you might have to add some other fields to the person like city, email, contact etc. So the “Person” class can actually get more complicated. And also there is a disadvantage of doing all this work is like you have to add those fields as arguments to the initializer and to the dundar methods.

So here dataclass comes to the picture. Instead of writing off this code, what I can do is simply turn the “Person” to dataclass.

from dataclasses import dataclass

@dataclass
class Person:
name: str
address: str

def main() -> None:
person = Person(name="Jack", address="New York")
print(person)

if __name__ == '__main__':
main()

So here I can use the “dataclass” as decorator and can define my instance variables like above. The initializer will going to be generated by the dataclass decorator.

So now when we run this code it will print the required output, as the dataclass generates a repr dundar method for us.

Output:

Person(name= 'Jack', address= 'New York')

Some Use cases

  • You can also provide default values to the initializer.
@dataclass
class Person:
name: str
address: str
active: bool = False
@dataclass
class Person:
name: str
address: str
email_adress: list[str] = []

NOTE:
But some time it creates problem when you assign a default value as list to a variable. This is because Python evaluates the default values when it interprets this script. That means if you have multiple instances of the person, It will always going to be the same reference to the list.

To solve this problem dataclass provides a factory function “field” (import it from dataclasses) and inside that we can provide a “default_factory” and we can use list as function.

from dataclasses import dataclass, field

@dataclass
class Person:
name: str
address: str
email_adress: list[str] = field(default_factory=list)

So what happens is when dataclasses generates the class, it calls a function (in this example, list is provided as a function not as type).
You can also provide a different user defined function to the default_factory.

  • At the time of initializing you can overwrite the default values as well.
    You can also set whether the instance variable is part of initializer or not.
from dataclasses import dataclass, field

@dataclass
class Person:
name: str
address: str
active: bool = False
id: str = field(init=False, default_factory=list)

def main() -> None:
person = Person(name="Jack", address="New York", active=True)
print(person)

if __name__ == '__main__':
main()
  • Lets take another use case like when you want to generate a ID using other instance variables.
    So here, post init dundar methods comes to play as it knows that other instance attributes have values.
@dataclass
class Person:
name: str
address: str
active: bool = False
search_string: str = field(init=False)

def __post_init__(self):
self.search_string = f"{self.name} {self.address}"
  • You can also add “_”/”__” in front of the instance method to make protected/private, so that its not supposed to change outside of the class.
  • You can also exclude that instance when we print a person by making the repr=False.
@dataclass
class Person:
name: str
address: str
active: bool = False
_search_string: str = field(init=False, repr=False)

def __post_init__(self):
self.search_string = f"{self.name} {self.address}"
Output:

Person (name= 'Jack', address= 'New York', active=True)
  • Another use case like you can freeze the dataclasses.
    This means once you have initialized the object we can’t no longer change, It will be read only.
@dataclass(frozen=True)
class Person:
name: str
address: str
active: bool = False
_search_string: str = field(init=False, repr=False)

def __post_init__(self):
self.search_string = f"{self.name} {self.address}"

def main() -> None:
person = Person(name="Jack", address="New York", active=True)
person.name="John"
print(person)

With frozen=True, If we run this code It will throw error — FrozenInstanceError.

This is really useful as in many cases we need to make sure that our data is not mutable as it simplifies the code as it is constant.
Though you can assign new values to new “Person instance” to the “person” variable.

you can also use replace method from dataclasses, If you want to make copy of an immutable dataclass with something changed in it.

@dataclass(frozen=True)
class Person:
name: str
address: str
active: bool = False

def main() -> None:
person = Person(name="Jack", address="New York", active=True)
print(dataclass.replace(Person, name="John"))
  • You can include or exclude fields which we don’t need for comparison.
@dataclass(frozen=True)
class Person:
name: str
address: str
active: bool = field(compare=False)

inst1 = Person("Person", "some adress", True)
inst2 = Person("Person", "some adress", False)

print(inst1 == inst2) # o/p = False
  • You can also check the greater or smaller by using order=True.
    We need to set this after the initailization process that is after we have the values for rest of the defined fields.
@dataclass(order=True)
class Person:
sort_index: float = field(init=False, repr=False)
name: str
address: str
cash: float = field(repr=True)

def __post_init__(self):
self.sort_index = self.cash

inst1 = Person("Person", "some adress", 100)
inst2 = Person("Person", "some adress", 200)

print(inst1 < inst2) # o/p = True

NOTE:
You can also sort these values on the basis of cash, with default order being small to large

@dataclass(order=True)
class Person:
sort_index: float = field(init=False, repr=False)
name: str
address: str
cash: float = field(repr=True)

def __post_init__(self):
self.sort_index = self.cash

inst1 = Person("Person", "some adress", 1000)
inst2 = Person("Person2", "some adress", 100)
inst3 = Person("Perso3", "some adress", 2000)
inst4 = Person("Perso4", "some adress", 200)

lst = [inst1, inst2, inst3, inst4]
lst.sort()
print(lst)

# o/p =
[Person(name='Person2', address="some adress", cash=100),
Person(name='Person4', address="some adress", cash=200),
Person(name='Person', address="some adress", cash=1000),
Person(name='Person3', address="some adress", cash=2000)]
  • There might be the case if your class is logically immutable but can nonetheless be mutated.
    You can force dataclass() to create a __hash__ method with unsafe_hash=True.
@dataclass(unsafe_hash=True)
class Person:
name: str = field(hash=True)
address: str
cash: float

NOTE:
However the fields on which you want to do hash, if they have same values, but in the fields if their hash is False, then their hash value will be same.

@dataclass(unsafe_hash=True)
class Person:
sort_index: float = field(init=False, repr=False, hash=False)
name: str
address: str
cash: float = field(hash=False)

def __post_init__(self):
self.sort_index = self.cash

inst1 = Person("Person", "some adress", 1000)
inst2 = Person("Person", "some adress", 100)

print(hash(inst1)) # o/p: -1201635493101344805
print(hash(inst2)) # o/p: -1201635493101344805

Features available in Python 3.10 or above

There are some new features are added in dataclasses in newer version of python — 3.10 or above. I will brief about some of the important features.

  • kw_only: So what it means is you can only initialize an object of a class by supplying only key-word arguments.
@dataclass(kw_only=True)
class Person:
name: str
address: str
active: bool = False

def main() -> None:
person = Person(name="Jack", address="New York", active=True)
  • match_args: So what this does as it structural pattern matching. When you set this to True, it generates the match args dundar method, that's gonna supply the arguments which we can use in structural pattern matching.
@dataclass(match_args=True)
class Person:
name: str
address: str
active: bool = False

Structural Pattern Matching — The feature verifies if the value of an expression, called the subject, matches a given structure called the pattern.
Here you will find more details about it.

  • slots: Lets say, when you create an instance of a class, there's a dundar dict object that contains all the instance variables.
    So dataclasses generates that for you which makes it faster to access instance variables. If you are doing lot of data processing this actually makes big difference.
@dataclass(slots=True)
class Person:
name: str
address: str
active: bool = False

But there is also a disadvantage of using slots as they break when used in multiple inheritance, As the child class couldn’t understand which base class slots to refer to.

@dataclass (slots=True)
class PersonSlots:
name: str
address: str
email: str

@dataclass (slots=True)
class EmployeeSlots:
dept: str

class PersonEmployee (PersonSlots, EmployeeSlots): # code breaks here
pass

When dealing with data oriented classes, Dataclasses are a great tool.

With data classes, you do not have to write boilerplate code to get proper initialization, representation, and comparisons for your objects.

There is a lot more to discuss about Dataclasses. And in Python 3.10 or newer versions, new features are added in dataclasses.

Connect with me on LinkedIn

--

--

Pravash

I am a passionate Data Engineer and Technology Enthusiast. Here I am using this platform to share my knowledge and experience on tech stacks.