Python Collections – An Introductory Guide

Collections is a built-in python module that provides useful container datatypes. Container datatypes allow us to store and access values in a convenient way. Generally, you would have used lists, tuples, and dictionaries. But, while dealing with structured data we need smarter objects.

In this article, I will walk you through the different data structures supported by collections module, understand when to use them with examples.

Contents

  1. namedtuple
    • What is namedtuple
    • Another way of creating a namedtuple
    • Why use namedtuple over dictionary
    • Creating a namedtuple from a python Dictionary
    • How to replace a attribute in a namedtuple
  2. Counter
  3. defaultdict
  4. OrderedDict
    • What happens when you delete and re-insert keys in OrderedDict
    • Sorting with OrderedDict
  5. ChainMap
    • What happens when we have redundant keys in a ChainMap
    • How to add a new dictionary to a ChainMap
    • How to reverse the order of dictionaries in a ChainMap
  6. UserList
  7. UserString
  8. UserDict
# Import the collections module
import collections

Let us start with the namedtuple

What is namedtuple()

You can think of namedtuple in two ways:

As an enhanced version of tuple. Or as a quick way of creating a python class with certain named attributes.

A key difference between a tuple and a namedtuple is: while a tuple let’s you access data through indices, with a namedtuple you can access the elements with their names.

You can actually define what all attributes a namedtuple can hold and create multiple instances of it. Just like how you would do with classes.

So, in terms of functionality, its more similar to a class, eventhough it has tuple in its name.

Let’s create a namedtuple that represents a ‘movie’ with the attributes ‘genre’, ‘rating’ and ‘lead_actor’.

# Creating a namedtuple. 

# The field values are passed as a string seperated by ' '
from collections import namedtuple
movie = namedtuple('movie','genre rating lead_actor')

# Create instances of movie
ironman = movie(genre='action',rating=8.5,lead_actor='robert downey junior')
titanic = movie(genre='romance',rating=8,lead_actor='leonardo dicaprio')
seven   = movie(genre='crime',rating=9,lead_actor='Brad Pitt')

Now, you can access any details of a movie you want using the identifier. It’s quite convenient and user friendly.

# Access the fields
print(titanic.genre)
print(seven.lead_actor)
print(ironman.rating)

#> romance
#> Brad Pitt
#> 8.5

Another way of creating a namedtuple

Alternately, you can pass a list of field names instead of the filed names separated by a space.

Let us see an example.

# Creating namedtuple by passing fieldnames as a list of strings
book = namedtuple('book',['price','no_of_pages','author'])

harry_potter = book('500','367','JK ROWLING')
pride_and_prejudice = book('300','200','jane_austen')
tale = book('199','250','christie')

print('Price of pride and prejudice is ',pride_and_prejudice.price)
print('author of harry potter is',harry_potter.author)

#> Price of pride and prejudice is  300
#> author of harry potter is JK ROWLING

The items in a namedtuple can be accessed by both index as well as an identifier.

print(tale[1])

#> 250

Why use namedtuple over dictionary

A major advantage of namedtuple is they take up less space / memory than an equivalent dictionary.

So, in the case of large data, namedtuples are efficient.

I’ll demonstrate the same in below example.

# Create a dict and namedtuple with same data and compare the size
import random
import sys

# Create Dict
dicts = {'numbers_1': random.randint(0, 10000),'numbers_2':random.randint(5000,10000)} 
print('Size or space occupied by dictionary',sys.getsizeof(dicts))

# converting same dictionary to a namedtuple
data=namedtuple('data',['numbers_1','numbers_2'])
my_namedtuple= data(**dicts)
print('Size or space occupied by namedtuple',sys.getsizeof(my_namedtuple))

#> Size or space occupied by dictionary 240
#> Size or space occupied by namedtuple 64

Executing above code, you find that namedtuple has size ’64’, whereas a dictionary occupies much larger ‘240’ bytes. That’s nearly 4x smaller memory.

You can imagine the effect when expanded to handle a large number of such objects.

Creating a namedtuple from a python Dictionary

Did you notice how we converted a dictionary into a namedtuple using ** operator?

All you need to do is: first define the structure of the namedtuple and pass the dictionary (**dict) to that namedtuple as argument. Only requirement is, the key’s of the dict should match the field names of the namedtuple.

# Convert a dictionary into a namedtuple
dictionary=dict({'price':567,'no_of_pages':878,'author': 'cathy thomas'})

# Convert
book = namedtuple('book',['price','no_of_pages','author'])
print(book(**dictionary))

#> book(price=567, no_of_pages=878, author='cathy thomas')

How to replace a attribute in a namedtuple

What if the value of one attribute has to be changed?

You need to update it in the data. It can be done simply using ._replace() method

# update the price of the book
my_book=book('250','500','abc')
my_book._replace(price=300)

print("Book Price:", my_book.price)

#> Book Price: 250

Counter

A counter object is provided by the collections library.

You have a list of some random numbers. What if you want to know how many times each number occurs?

Counter allows you to compute the frequency easily. It works not just for numbers but for any iterable object, like strings and lists.

Counter is dict subclass, used to count hashable objects.

It returns a dictionary with the elements as keys and the count (no of times the element was present) as values .

EXAMPLES

#importing Counter from collections
from collections import Counter

numbers = [4,5,5,2,22,2,2,1,8,9,7,7]
num_counter = Counter(numbers)
print(num_counter)

#>Counter({2: 3, 5: 2, 7: 2, 4: 1, 22: 1, 1: 1, 8: 1, 9: 1})

Let’s use Counter to find the frequency of each character in a string

#counter with strings
string = 'lalalalandismagic'
string_count = Counter(string)
print(string_count)

#> Counter({'a': 5, 'l': 4, 'i': 2, 'n': 1, 'd': 1, 's': 1, 'm': 1, 'g': 1, 'c': 1})

As you saw, we can view what elements are there and their count in a list string.

In case you have a sentence and you want to view count of the words, how to do it?

Use the split() function to make a list of words in the sentence and pass it to Counter()

# Using counter on sentences
line = 'he told her that her presentation was not that good'

list_of_words = line.split() 
line_count=Counter(list_of_words)
print(line_count)

#> Counter({'her': 2, 'that': 2, 'he': 1, 'told': 1, 'presentation': 1, 'was': 1, 'not': 1, 'good': 1})

How to find most common elements using Counter

Counter is very useful in real life applications.

Especially when you need to process large data, and you want to find out the frequency of some elements. Let me show some very useful methods using Counters.

Counter().most_common([n])

This returns a list of ‘n most common elements’ along with their counts in descending order

# Passing different values of n to most_common() function
print('The 2 most common elements in `numbers` are', Counter(numbers).most_common(2))
print('The 3 most common elements in `string` are', Counter(string).most_common(3))

#> The 2 most common elements in `numbers` are [(2, 3), (5, 2)]
#> The 3 most common elements in `string` are [('a', 5), ('l', 4), ('i', 2)]

The most_common() method can be used to print the most repetitive item. It is used in frequency analysis.

Counter(list_of_words).most_common(1)

#> [('her', 2)]

We can use to the same to find the most repetitive character in a string.

Counter(string).most_common(3)

#> [('a', 5), ('l', 4), ('i', 2)]

What happens if you don’t specify ‘n’ while using most_common(n)?

All the elements are their counts will be printed in descending order of their counts.

Counter(string).most_common()

#>[('a', 5),('l', 4),('i', 2),('n', 1),('d', 1),('s', 1),('m', 1),('g', 1),('c', 1)]

Counter().elements() method returns all the elements which have count greater than 0.

print(sorted(string_count.elements()))

#> ['a', 'a', 'a', 'a', 'a', 'c', 'd', 'g', 'i', 'i', 'l', 'l', 'l', 'l', 'm', 'n', 's']

defaultdict

A dictionary is an unordered collection of keys and values.

In the key: value pairs, the key should be distinct, and it cannot be changed. That is why in a dictionary, a list cannot be a key, as it is mutable. But, a tuple can be a key.

# Dict with tuple as keys: OKAY
{('key1', 'key2'): "value"}


# Dict with list as keys: ERROR
{['key1', 'key2']: "value"}

How defaultdict is different from dict

If you try to access a key that is not present in a dictionary, it throws a KeyError. Whereas, in a defaultdict it does not give a KeyError.

It does not give a keyerror . If you access a key that is not present,the defaultdict will return a default value.

Syntax: defaultdict(default_factory)

When we access a key that is not present, default_factory function will return a default value

# Creating a defaultdict and trying to access a key that is not present.
from collections import defaultdict
def_dict = defaultdict(object)
def_dict['fruit'] = 'orange'
def_dict['drink'] = 'pepsi'
print(def_dict['chocolate'])

#> <object object at 0x7f591a2f4510>

If you excecute above command it does not give you a KeyError.
In case you want to output that the value for the requested key is not present, you can define your own function and pass it to the defaultdict.
See below example

# Passing a function to return default value
def print_default():
    return 'value absent'

def_dict=defaultdict(print_default)
print(def_dict['chocolate'])

#> value absent

In all other ways, it is the same as a normal dictionary. Same syntax commands are used for defaultdict too.

Actually, it is possible to overcome the KeyError in dictionary by using the get method.

# Make dict return a default value
mydict = {'a': 'Apple', 'b': 'Ball'}
mydict.get('c', 'NOT PRESENT')

#> 'NOT PRESENT'

OrderedDict

A dict is an UNORDERED collection of key value pairs. But, an OrderedDict maintains the ORDER in which the keys are inserted.

It is subclass of dict.

I am going to create a ordinary dict and make it OrderedDict to show you the difference

# create a dict and print items
vehicle = {'bicycle': 'hercules', 'car': 'Maruti', 'bike': ' Harley', 'scooter': 'bajaj'}

print('This is normal dict')
for key,value in vehicle.items():
    print(key,value)

print('-------------------------------')

# Create an OrderedDict and print items
from collections import OrderedDict
ordered_vehicle=OrderedDict()
ordered_vehicle['bicycle']='hercules'
ordered_vehicle['car']='Maruti'
ordered_vehicle['bike']='Harley'
print('This is an ordered dict')

for key,value in ordered_vehicle.items():
    print(key,value)

#> This is normal dict
#> bicycle hercules
#> car Maruti
#> bike  Harley
#> scooter bajaj
-------------------------------
#> This is an ordered dict
#> bicycle hercules
#> car Maruti
#> bike Harley

In an OrderedDict, even after changing the value of certain keys, the order remains same or unchanged.

# I have changed the value of car in this ordered dictionary.
ordered_vehicle['car']='BMW'# I have changed the value of car in this ordered dictionary.
for key,value in ordered_vehicle.items():
    print(key,value)

#> bicycle hercules
#> car BMW
#> bike harley davison

What happens when you delete and re-insert keys in OrderedDict

When a key is deleted, the information about its order is also deleted. When you re-insert the key, it is treated as a new entry and corresponding order information is stored.

# deleting a key from an OrderedDict
ordered_vehicle.pop('bicycle')
for key,value in ordered_vehicle.items():
    print(key,value)

#> car BMW
#> bike harley davison

On reinserting the key, it is considered as a new entry.

# Reinserting the same key and print
ordered_vehicle['bicycle']='hercules'
for key,value in ordered_vehicle.items():
    print(key,value)

#> car BMW
#> bike harley davison
#> bicycle hercules

You can see the bicycle is at the last, the order has changed when we deleted the key.

There are several useful commands that can be executed. We can perform sorting functions as per need

Sorting with OrderedDict

What if you want to sort the items in increasing order of their values? This will help you in data analysis

Sort the items by KEY (in ascending order)

# Sorting items in ascending order of their keys
drinks = {'coke':5,'apple juice':2,'pepsi':10}
OrderedDict(sorted(drinks.items(), key=lambda t: t[0]))

#> OrderedDict([('apple juice', 2), ('coke', 5), ('pepsi', 10)])

Sort the pairs by VALUE (in ascending order)


# Sorting according to values
OrderedDict(sorted(drinks.items(), key=lambda t: t[1]))

#> OrderedDict([('apple juice', 2), ('coke', 5), ('pepsi', 10)])

Sort the dictionary by length of key string (in ascending order)

# Sorting according to length of key string
OrderedDict(sorted(drinks.items(), key=lambda t: len(t[0])))

#> OrderedDict([('coke', 5), ('pepsi', 10), ('apple juice', 2)])

ChainMap

ChainMap is a container datatype which stores multiple dictionaries.
In many cases, you might have relevant or similar dictionaries, you can store them collectively in a ChainMap

You can print all the items in a ChainMap using .map operator. Below code demonstrates the same

# Creating a ChainMap from 3 dictionaries.
from collections import ChainMap
dic1={'red':5,'black':1,'white':2}
dic2={'chennai':'tamil','delhi':'hindi'}
dic3={'firstname':'bob','lastname':'mathews'}

my_chain = ChainMap(dic1,dic2,dic3)
my_chain.maps

#> [{'black': 1, 'red': 5, 'white': 2}, {'chennai': 'tamil', 'delhi': 'hindi'},{'firstname': 'bob', 'lastname': 'mathews'}]

You can print keys of all dictionaries in a chainmap using .keys() function

print(list(my_chain.keys()))

#> ['firstname', 'lastname', 'chennai', 'delhi', 'red', 'black', 'white']

You can print the values of all dictionaries in a chainmap using .values()function

print(list(my_chain.values()))

#> ['bob', 'mathews', 'tamil', 'hindi', 5, 1, 2]

What happens when we have redundant keys in a ChainMap

It is possible that 2 dictionaries might have the same key. See an example below.

# Creating a chainmap whose dictionaries do not have unique keys
dic1 = {'red':1,'white':4}
dic2 = {'red':9,'black':8}
chain = ChainMap(dic1,dic2)
print(list(chain.keys()))

#>['black', 'red', 'white']

Observe that ‘red’ is not repeated, it is printed only once

How to add a new dictionary to a ChainMap

You can add a new dictionary at the beginning of a ChainMap using .new_child() method. It is demonstrated in the below code.

# Add a new dictionary to the chainmap through .new_child()
print('original chainmap', chain)

new_dic={'blue':10,'yellow':12} 
chain=chain.new_child(new_dic)

print('chainmap after adding new dictioanry',chain)

#> original chainmap ChainMap({'red': 1, 'white': 4}, {'red': 9, 'black': 8})
#> chainmap after adding new dictioanry ChainMap({'blue': 10, 'yellow': 12}, {'red': 1, 'white': 4}, {'red': 9, 'black': 8})

How to reverse the order of dictionaries in a ChainMap

The order in which dictionaries are stored in a ChainMap can be reversed using reversed() function.

# We are reversing the order of dictionaries using reversed() function
print('orginal chainmap',  chain)

chain.maps = reversed(chain.maps)
print('reversed Chainmap', str(chain))

#> orginal chainmap ChainMap({'blue': 10, 'yellow': 12}, {'red': 1, 'white': 4}, {'red': 9, 'black': 8})
#>  reversed Chainmap ChainMap({'red': 9, 'black': 8}, {'red': 1, 'white': 4}, {'blue': 10, 'yellow': 12})

UserList

Hope you are familiar with python lists?.

A UserList is list-like container datatype, which is wrapper class for lists.

Syntax: collections.UserList([list])

You pass a normal list as an argument to userlist. This list is stored in the data attribute and can be accessed through UserList.data method.

# Creating a user list with argument my_list
from collections import UserList
my_list=[11,22,33,44]

# Accessing it through `data` attribute
user_list=UserList(my_list)
print(user_list.data)

#> [11, 22, 33, 44]

What is the use of UserLists

Suppose you want to double all the elements in some particular lists as a reward. Or maybe you want to ensure that no element can be deleted from a given list.

In such cases, we need to add a certain ‘behavior’ to our lists, which can be done using UserLists.

For example, Let me show you how UserList can be used to override the functionality of a built-in method. The below code prevents the addition of a new value (or appending) to a list.

# Creating a userlist where adding new elements is not allowed.

class user_list(UserList):
    # function to raise error while insertion
    def append(self,s=None):
        raise RuntimeError("Authority denied for new insertion")

my_list=user_list([11,22,33,44])

# trying to insert new element
my_list.append(55)

#> ---------------------------------------------------------------------------

RuntimeError                              Traceback (most recent call last)

<ipython-input-2-e8f22159f6e0> in <module>
      4 
      5 my_list=user_list([11,22,33,44])
----> 6 my_list.append(55)
      7 print(my_list)


<ipython-input-2-e8f22159f6e0> in append(self, s)
      1 class user_list(UserList):
      2     def append(self,s=None):
----> 3         raise RuntimeError("Authority denied for new insertion")
      4 
      5 my_list=user_list([11,22,33,44])


RuntimeError: Authority denied for new insertion

The above code prints RunTimeError message and does not allow appending. This can be helpful if you want to make sure nobody can insert their name after a particular deadline. So, UserList have very real time efficient.

UserString

Just like UserLists are wrapper class for lists, UserString is a wrapper class for strings.

It allows you to add certain functionality/behavior to the string. You can pass any string convertible argument to this class and can access the string using the data attribute of the class.

# import Userstring
from collections import UserString
num=765

# passing an string convertible argument to userdict
user_string = UserString(num)

# accessing the string stored 
user_string.data

#> '765'

As you can see in above example, the number 765 was converted into a string ‘765’ and can be accessed through the UserString.data method.

How and when UserString can be used

UserString can be used to modify the string, or perform certain funtions.

What if you want to remove a particular word from a text file (wherever present)?

May be, some words have misplaced and need to be removed.

Let’s see an example of how `UserString` can be used to remove certain odd words from a string

# Using UserString to remove odd words from the textfile
class user_string(UserString):

    def append(self, new):
        self.data = self.data + new

    def remove(self, s):
        self.data = self.data.replace(s, "")

text='apple orange grapes bananas pencil strawberry watermelon eraser'
fruits = user_string(text)

for word in ['pencil','eraser']:
    fruits.remove(word)

print(fruits)

#> apple orange grapes bananas  strawberry watermelon 

You can see that ‘pencil’ and ‘eraser’ were removed using the function class user_string.

Let us consider another case. What if you need to replace a word by some other word throughout the file?

Userstring makes this far easier as shown below.The below code replaces a certain word throughout the textfile using UserString

I have defined a function inside the class to replace certain word by ‘The Chairman’ throughout.

# using UserString to replace the name or a word throughout.
class user_string(UserString):

    def append(self, new):
        self.data = self.data + new

    def replace(self,replace_text):
        self.data = self.data.replace(replace_text,'The Chairman')

text = 'Rajesh concluded the meeting very late. Employees were disappointed with Rajesh'
document = user_string(text)

document.replace('Rajesh')

print(document.data)
#> The Chairman concluded the meeting very late. Employees were disappointed with The Chairman

As you can see, ‘Rajesh’ is replaced with ‘The Chairman’ everywhere. Similarly, UserStrings help you simplify all processes

UserDict

It is a wrapper class for dictionaries. The syntax, functions are similar to UserList and UserString.

syntax:collections.UserDict([data])

We pass a dictionary as the argument which is stored in the data attribute of UserDict.

# importing UserDict
from collections import UserDict 
my_dict={'red':'5','white':2,'black':1} 

# Creating an UserDict 
user_dict = UserDict(my_dict) 
print(user_dict.data) 

#> {'red': '5', 'white': 2, 'black': 1}

How UserDict can be used

UserDict allows you to create a dictionary modified to your needs. Let’s see an example of how UserDict can be used to override the functionality of a built-in method. The below code prevents a key-value pair from being dropped.

# Creating a Dictionary where deletion of an  is not allowed 
class user_dict(UserDict):       
    # Function to stop delete/pop
    def pop(self, s = None):
        raise RuntimeError("Not Authorised to delete") 

data = user_dict({'red':'5','white':2,'black':1}) 

# try to delete a item
data.pop(1)

#> ---------------------------------------------------------------------------

RuntimeError                              Traceback (most recent call last)

<ipython-input-16-2e576a68d2ad> in <module>
     12 
     13 #try to delete a item
---> 14 data.pop(1)


<ipython-input-16-2e576a68d2ad> in pop(self, s)
      5         def pop(self, s = None):
      6 
----> 7             raise RuntimeError("Not Authorised to delete")
      8 
      9 


RuntimeError: Not Authorised to delete

You will receive an RunTimeError message. This will help if you don’t want to lose data.

What if some keys have junk values and you need to replace them with nil or ‘0’? See the below examples on how to use Userdict for the same.

class user_dict(UserDict): 
        def replace(self,key):
            self[key]='0'

file= user_dict({'red':'5','white':2,'black':1,'blue':4567890}) 

# Delete 'blue' and 'yellow'
for i in ['blue','yellow']:
    file.replace(i)

print(file)
#> {'red': '5', 'white': 2, 'black': 1, 'blue': '0', 'yellow': '0'}

The field with junk values have been replaced with 0. These are just simple examples of how an UserDict allows you to create a dictionary with required functionality

These are all the container datatypes from the collections module. They increase efficiency by a great amount when used on large datasets.

Conclusion

I hope you have understood when and why to use the above container datatypes. If you have any questions, please drop it in the comments

Recommended Posts

Python JSON Guide
Python RegEx Tutorial
Python Logging Guide
Paralel Processing in Python

This article was contributed by Shrivarsheni.