Fundamentals of DataScience Week 3 - Data Objects - Sets
PadhAIdata_objects_sets

drawing

by One Fourth Labs

For more reading or to go indepth refer official reference of python language https://docs.python.org/3/reference/

Also refer to the Padhai-One Github repo for updates & other notebooks

Sets

Set data objects are mutable objects. They hold unique elements, i.e elements cannot have duplicates. Set hashes the elements and stores them, making it faster than other structures for many operations.

Sets are efficient for computating mathematical set operations such as intersection, union, difference etc.

Creating set

sets can be created in 2 ways

  1. using {} curly brackets, with elements seperated by comma.
  2. using set() function, which takes any sequence type or iterable objects (like other list, tuple) as argument and converts to set
In [1]:
# set with values
c = {1,2,3,4,5}
d = set((6,7,8,8,9,9)) # create set out of tuple (sequence type)
e = c # shallow copy from a set

# Print
print("\nc = ",c, "\nd = ",d, "\ne = ",e )
c =  {1, 2, 3, 4, 5} 
d =  {8, 9, 6, 7} 
e =  {1, 2, 3, 4, 5}

One can notice that duplicate elements from tuple got removed in set. Set can not/will not have duplicate elements.

There is catch in creating empty sets.

In [2]:
# empty set
a = set()
b = {}
c = {1,2,3,4,5}
print("Type of a:", type(a), "\nType of b:", type(b), "\nType of c:", type(c) )
Type of a: <class 'set'> 
Type of b: <class 'dict'> 
Type of c: <class 'set'>

It can be seen that the type of a is set where as b is a dictionary. when one uses {} empty it considers it as empty dictionary. When have elements seperated be , comma inside curly brakets only then it takes it as a set as seen in c

Possible way to remember this is to think set as dictionary with just keys and no values.Interpretor assumes empty {} as dictionary, and explicitly adding elements as keys with no value makes interpretor consider it as set.
From language evolution perspective, In python2 {} was reserved for dictionary and sets cannot be represented with them. One has to use set() to define a set, only in python3 this was changed for easiness.



what other objects can we convert to set?

In [3]:
a = (1,2,3,4) # tuple
b = {5,6,7,8} # set
c = {'a': 1, 'b':2, 'c':3} #dictionary
d = "truth alone triumphs"
p = set(a)
q = set(b)
r = set(c)
s = set(d)

print("p = ", p, "\nq =", q, "\nr =", r, "\ns =" ,s)
p =  {1, 2, 3, 4} 
q = {8, 5, 6, 7} 
r = {'c', 'a', 'b'} 
s = {'p', 'o', 't', 'l', 'e', 'm', 'h', 'n', 'i', 's', 'a', 'u', ' ', 'r'}

One can notice that for dictionary only the keys are taken for creating the set not the values. And for strings all the characters including 'space' is converted into elements of a string.

One can easily observe two things, firstly the ordering of the elements is not preserved, secondly all data is deduped and only unique lemets are present.

Sets have restriction on what type of objects it can have as a set. As mentioned earlier all elements of sets are hashed, which mkaes it necessary that only immutable elements can be present set.

In [4]:
a = {"apple", "orange", 'papaya', "tomato"} # strings types are immutable, hence hashable
b = {(1,2), (3,4), (5,6)} # Tuple are immutable
print("a:", a)
print("b:", b)

try:
    c = {[1,2], [4,5], [6,7]} # Lists are mutable(unhashable) so cannot be used as set element
except Exception as error:
    print("Error @@ c:", error)

try:
    d = {{1,2}, {4,5}, {6,7}} # Sets are mutable(unhashable) so cannot be used as set element
except Exception as error:
    print("Error @@ d:", error)
a: {'papaya', 'apple', 'tomato', 'orange'}
b: {(1, 2), (3, 4), (5, 6)}
Error @@ c: unhashable type: 'list'
Error @@ d: unhashable type: 'set'

Even set object cannot be nested inside a set.

Mutable objects are considered not hashable because the objects can change at any point, so hash will not be preserved. Hash values of objects in set should be persistent.




Accessing elements

Sets are not order preserving, and location of elements in memory will be based on hash values. Hence one cannot use index for accessing the set elements.

In [5]:
a = {1,2,3,4,5}
try:
    b =  a[0]
except Exception as error:
    print("Error:", error)
Error: 'set' object does not support indexing

In order to access set elements directly on has to iterate using for and in statements.

In [6]:
c = {8,2,9,4,1,6,7}

for i in c:
    print(i) # i will give all elements of c
1
2
4
6
7
8
9

Can one find if a element is present in a set?

One can do that using the in statement. Since set elements are hashed the membership testing(if elements exist or not) is much faster compared to other structures.

In [7]:
a = {"cat", "dog", "parrot"}
a_dog = "dog" in a
a_wolf = "wolf" in a

print("is dog present:", a_dog)
print("is wolf present:", a_wolf)
is dog present: True
is wolf present: False

Below is comparison between lists and sets for membership test.

Note: Access time of lists and sets are equivalent or sometimes faster for list.

In [8]:
%%timeit -n1 -r10
a = {1,2,3,4,5,6,7,8,9}
for i in range(100000):
    t = 7 in a
1 loop, best of 10: 3.9 ms per loop
In [9]:
%%timeit -n1 -r10
a = [1,2,3,4,5,6,7,8,9]
for i in range(100000):
    t = 7 in a
1 loop, best of 10: 11.4 ms per loop

In sets the check is faster since it is based on hashing, where as in list it has to check all elements sequentially.

For the nested sets, it only checks the top level objects, and will not check the objects inside.

In [10]:
b = { (1,2,3,4,5), 
      (11,12,13,14,15,16,17),
      (21,22,23,24,25,26)}
b1 = 11 in b

print("Is 11 in b:", b1)
Is 11 in b: False

As said checking b for 11 then it fails because all elements of b is a tuple and nothing matches 11.



Inserting elements

Set is mutable object, so one can add elements to existing object without creating new object.

One can use add() method to add elements to a set.
Note: Please remember that only hashable objects could be added to set

In [11]:
# Appending new element
a = {1,2,3}
a.add(55)

b = {(1,2), (5,3), (4,5)}
b.add((8,9))

# Print
print("a = ",a)
print("b = ",b)
a =  {1, 2, 3, 55}
b =  {(1, 2), (4, 5), (8, 9), (5, 3)}

What if one wants to append multiple elements to a set?

In [12]:
a = {10, 20, 30, 40}
b = {40,50,60} # Set
c = [60, 70, 70, 80] # List
a.update(b)
a.update(c)
print("a =", a)
a = {80, 50, 20, 70, 40, 10, 60, 30}

One can see that all the elements are deduped.

Both the add() and update() will function like mathematical union, owing to nature of the sets.

Note that both functions will modify the existing set object.



Deleting elements

If one needs to remove an element, one can use remove().

In [13]:
a = {11, 12, 13, 14, 15, 16, 17}
a.remove(13) # removes 13 from set
print("a =", a)
a = {11, 12, 14, 15, 16, 17}

The remove() will throw KeyError when one tries to remove an element that does not exist in the set.

In Order to avoid that one can use discard() which removes element if exists or else does nothing and will not throw error.

In [14]:
a = {11, 12, 13, 14, 15, 16, 17}
try:
    a.remove(19) # This will throw error
except KeyError as error:
    print("KeyError:", error)
KeyError: 19

But Why do we need remove in the first place we could use discard always right?

In some cases it is useful to have error. One might be trying to remove an element which is expected to be in set. If it is not there then there might be something wrong somewhere in code, and one may want to do something in this case with except block.

One can argue that we can test membership and act everytime before removing. but remove() provides simple ease of use.



One can use pop() over a set. But unlike in list where one provides index, in sets it does not take index, since indexing is not valid an meaningless in sets.

So what does it DO?
It removes element which is at random from the set.

In [15]:
a = {22,33,44,55, 66,77,88}
a.pop()

print("a =", a)
a = {66, 44, 77, 22, 55, 88}


One can use del on the object as whole.

In [16]:
c = {5,6,7,8,9}
del c # reference to object is removed

try: 
    print(c) #This will throw ERROR since c is not defined
except NameError as error:
    print("Error:", error)
Error: name 'c' is not defined

When using del on the entire object. The object remains same and exists in memory. Only the reference between variable and object is broken and variable becomes undefined.

In [17]:
d = {10, 20, 30, 40}
print("Id of `d`:", id(d))
e = d
del d # reference to object is removed

print("Id of `e`:", id(e))
print("Object in e:", e) # `e` still references the object
Id of `d`: 140416355162152
Id of `e`: 140416355162152
Object in e: {40, 10, 20, 30}


To remove all elements one can use clear() method, this will retain the old object and only remove the elements

In [18]:
a = {100, 200, 300, 400, 500, 600}
a.clear()
print("a =", a)
a = set()

At this point a question might come to mind that why should we clear elements of a set object and reuse the same object

Check the below case where we link the object in variable a to another variable b by shallow copy. And we are changing the set object in a by assigning a new set.

In [19]:
a = {11, 22, 33, 44, 55, 66, 77, 88}
b = a # a & b refers to same object
print("before modifying a: \na =", a, " id= ",id(a) )
print("b =", b, " id= ",id(b))

a = {11} # a now point to new object
print("after modifying a: \na =", " id= ",id(a) )
print("b =", " id= ",id(b))
before modifying a: 
a = {33, 66, 11, 44, 77, 22, 55, 88}  id=  140416355160808
b = {33, 66, 11, 44, 77, 22, 55, 88}  id=  140416355160808
after modifying a: 
a =  id=  140416693181576
b =  id=  140416355160808

The object initially a had, is not destroyed or modified, it still exists in memory and b still references the old object, which might not be desired in some case.

Also the link between objects a and b i.e object referenced by b is no longer associated to a.The link is broken which could be bad in some case.

In [20]:
a = {11, 22, 33, 44, 55, 66, 77, 88}
b = a # a & b refers to same object

print("Initially: \na =", a," id= ",id(a))
print(" b =", b," id= ",id(b))

a.clear()

print("clearing a: \na =", a, " id= ",id(a))
print("b =", b, " id= ",id(b))
Initially: 
a = {33, 66, 11, 44, 77, 22, 55, 88}  id=  140416355163720
 b = {33, 66, 11, 44, 77, 22, 55, 88}  id=  140416355163720
clearing a: 
a = set()  id=  140416355163720
b = set()  id=  140416355163720



Duplicating

It may not be the case always to have same object in two different variables.

Sometimes one may want to create a copy of a set and do some modification independent of other. In such cases it is better create two different objects. For which one can use copy() which creates a deep copy of the object.
Alternatively, one can use set() which is used to create new set from sequence type object.

In [21]:
a = {10, 20 , 30, 40 ,50}
b = a.copy() # deep copy
b.remove(10)

c = set(a)
c.remove(50)

print("a =", a, " id=",id(a))
print("b =", b, " id=",id(b))
print("b =", c, " id=",id(c))
a = {40, 10, 50, 20, 30}  id= 140416693181576
b = {50, 20, 40, 30}  id= 140416355163496
b = {20, 40, 10, 30}  id= 140416355163720

But one would wonder what could be the difference between copy() and set().

set() function takes any iteratable data objects, iterates through elements converts to set type object. This is same for set type objects.

whereas copy() method directly creates a copy of set object. This makes copy() method faster than set()

In [22]:
%%timeit -n1 -r10
a = {1,2,3,4,5,6,7,8,9}
for i in range(1000000):
    b = a.copy
1 loop, best of 10: 50.4 ms per loop
In [23]:
%%timeit -n1 -r10
a = {1,2,3,4,5,6,7,8,9}
for i in range(1000000):
    b = set(a)
1 loop, best of 10: 198 ms per loop



Other functionalities

One can find the total number of elements in a set using len() function, which will return the number of elements.

In [24]:
a = {1.1, 2.2, 3.3, 4.4, 5.5, 6.6}
b = {(1,2), (4,5,6)} #set with tuple objects

print("Num of elements in a:", len(a))
print("Num of elements in b:", len(b))
Num of elements in a: 6
Num of elements in b: 2

It can be seen that for b it show 2, len() considers only objects it contains, does not take into account elements present in the object. So b has two objects which is returned.



There would be cases where one will be in need to find sum of all elements in a set. For which one can use sum() function

In [25]:
a ={1.1, 2, 3.3, 4}
s = sum(a)
print("sum is ", s)
sum is  10.4

But the set has to have arithmeticaly summable object like float, int; else one will get error.

In [26]:
a = {"dog", "cat"} # string objects
try:
    sum(a) # this will throw error
except Exception as error:
    print("Error in a:", error)


b = {1,2, (1,2) }
try:
    sum(b) # this will throw error
except Exception as error:
    print("Error in b:", error)
Error in a: unsupported operand type(s) for +: 'int' and 'str'
Error in b: unsupported operand type(s) for +: 'int' and 'tuple'

In second case it has two int object and a tuple object. Though the inner tuple has int object, function doesn't not care about that. It just considers the object types in the first level. Adding up a int and tuple is not possible.



Mathematical Set Operations

Sets are datastructures that are intented to capture the Mathematical set nature. Thus python set has many methods for set operations.

All these set operations gives results as seprate objects and does not update the existing objects.

For union) of two sets A ∪ B use union()

In [27]:
a = {1,2,3,4}
b = {4,5,6,7}
# returns result as seperate object
c = a.union(b) 
d = b.union(c)

print("c=", c)
print("d=", d)
c= {1, 2, 3, 4, 5, 6, 7}
d= {1, 2, 3, 4, 5, 6, 7}

Notice that both a.union(b) and b.union(a) are same. Also mathematically set union and intersection are commutative

For intersection A ∩ B use intesection()

In [28]:
a = {1,2,3,4,5}
b = {4,5,6,7,8}
# returns result as seperate object
c = a.intersection(b)  # same as b.intersection(a)

print("c=", c)
c= {4, 5}

For checking if two sets are disjoint (A∩B=∅) or not use isdisjoint()

In [29]:
a = {1,2,3,4,5}
b = {4,5,6,7,8}
c = {10, 11, 12, 13}

## a.isdisjoint(b) same as b.isdisjoint(a)
print("A and B are Disjoint sets:", a.isdisjoint(b))
print("B and C are Disjoint sets:", b.isdisjoint(c))
A and B are Disjoint sets: False
B and C are Disjoint sets: True

For checking subsets A ⊆ B use issubset()

Note the subset is not commutative as union, intersection. So a.issubset(b) not equal to b.issubset(a)

In [30]:
a = {1,2,3,4,5}
b = {2,3}

##  a.issubset(b) not equal to b.issubset(a)
print("A is a subset of B:", a.issubset(b))
print("B is a subset of A:", b.issubset(a))
A is a subset of B: False
B is a subset of A: True

For checking superset A ⊇ B use issuperset()

Note the subset is not commutative as union, intersection. So a.issuperset(b) not equal to b.issuperset(a)

In [31]:
a = {1,2,3,4,5}
b = {2,3}

##  a.issuperset(b) not equal to b.issuperset(a)
print("A is a superset of B:", a.issuperset(b))
print("B is a superset of A:", b.issuperset(a))
A is a superset of B: True
B is a superset of A: False

For finding set difference) between A and B use difference() A\B = {x ∈ A | x ∉ B}

Set difference is not commutative.

In [32]:
a = {1,2,3,4}
b = {3,4,5,6}

##  a.difference(b) not equal to b.difference(a)
print("A difference B:", a.difference(b))
print("B difference A:", b.difference(a))
A difference B: {1, 2}
B difference A: {5, 6}

For finding symetric difference of sets use symmetric_difference()

A △ B = (A\B) ∪ (B\A)

Symmetric difference is commutative

In [33]:
a = {1,2,3,4}
b = {3,4,5,6}

##  a.symmetric_difference(b) not equal to b.symmetric_difference(a)
print("A symmetric_difference B:", a.symmetric_difference(b))
print("B symmetric_difference A:", b.symmetric_difference(a))
A symmetric_difference B: {1, 2, 5, 6}
B symmetric_difference A: {1, 2, 5, 6}


There will be requirements to update an existing set with results union, intersection, difference operation between two set.

Typical way could be seen below

In [34]:
a = {1,2,3,4,5}
b = {4,5,6,7}

temp = a.intersection(b)
a.clear() # clear elements without changing object ID

a.update(temp)
print("a=", a)
a= {4, 5}

This pattern is required in many cases, there exist a direct update methods for each methods that involves creating new set.

For union we can directly use update() seen in inserting elements, since sets are unique, elements not present inset gets added which is equivalent to union operation. For other methods there exists intersection_update(), difference_update(), Symmetric_difference_update()

In [35]:
a = {1,2,3,4,5}
b = {4,5,6,7}
a.update(b) # union and update

c = {11,12,13,14,15}
d = {14,15,16,17}
c.intersection_update(d) # intersection and update

e = {11,22,33,44,55}
f = {44,55,66,77}
e.difference_update(f) # difference and update

g = {10,20,30,40,50}
h = {40,50,60,70}
g.symmetric_difference_update(h) # symmetric difference and update

print("a =", a)
print("c =", c)
print("e =", e)
print("g =", g)
a = {1, 2, 3, 4, 5, 6, 7}
c = {14, 15}
e = {33, 11, 22}
g = {70, 10, 20, 60, 30}

Note that all updates change the object whose member method is invoked, i.e a.update(b) changes a not b.




Operators

The result will be stored in a new set object, existing objects will not get modified.

Only - operator can be used on set data types. The reslut will be equivalent to set difference. A\B = {x ∈ A | x ∉ B}

In [36]:
a = {1, 2, 3, 4}
b = {3, 4, 5, 6}
c = a - b
print("c =", c)
c = {1, 2}

One cannot use / or * or + operators on a set

In [37]:
a = {4, 5}
b = {1,2}
try:
    c = a/b 
except Exception as error:
    print("Error on /:", error)

try:
    c = a+b 
except Exception as error:
    print("Error on +:", error)

try:
    c = a*b 
except Exception as error:
    print("Error on *:", error)
Error on /: unsupported operand type(s) for /: 'set' and 'set'
Error on +: unsupported operand type(s) for +: 'set' and 'set'
Error on *: unsupported operand type(s) for *: 'set' and 'set'



Frozen Set

Frozen set is immutable set type. All other properties remain same as set except ones that rely on mutable nature of sets.

Creating set

Frozen set can be created only by

  1. using frozenset() function, which takes any sequence type or iterable objects (like other list, tuple) as argument and converts to set
In [38]:
# set with values
c = frozenset([1,2,3,4,5]) #create set out of list
d = frozenset((6,7,8,8,9,9)) # create set out of tuple (sequence type)
e = c # shallow copy from a set

# Print
print("\nc = ",c, "\nd = ",d, "\ne = ",e )
c =  frozenset({1, 2, 3, 4, 5}) 
d =  frozenset({8, 9, 6, 7}) 
e =  frozenset({1, 2, 3, 4, 5})

Frozen Sets have same restrictions as set on what type of objects it can have. Only immutable and hashable object can be elements of frozenset.

In [39]:
a = {"apple", "orange", 'papaya', "tomato"} # strings types are immutable, hence hashable
b = {(1,2), (3,4), (5,6)} # Tuple are immutable
print("a:", a)
print("b:", b)

try:
    c = {[1,2], [4,5], [6,7]} # Lists are mutable(unhashable) so cannot be used as set element
except Exception as error:
    print("Error @@ c:", error)

try:
    d = {{1,2}, {4,5}, {6,7}} # Sets are mutable(unhashable) so cannot be used as set element
except Exception as error:
    print("Error @@ d:", error)
a: {'papaya', 'apple', 'tomato', 'orange'}
b: {(1, 2), (3, 4), (5, 6)}
Error @@ c: unhashable type: 'list'
Error @@ d: unhashable type: 'set'

If one recollects set type data cannot be nested in set since sets are mutable.

Since FrozenSet are immutable they can be nested as elements of both set and frozenset.

In [40]:
a = frozenset((1,2,3))
b = frozenset((4,5,6))
c = {a , b} # Frozen set inside set
d = frozenset((a,b))

print("c =", c, "\nd =", d)
c = {frozenset({1, 2, 3}), frozenset({4, 5, 6})} 
d = frozenset({frozenset({1, 2, 3}), frozenset({4, 5, 6})})



Handling Frozensets

Accessing

Accessing the frozen set is similar to regular sets.

In [41]:
c = frozenset((8,2,9,4,1,6,))
for i in c:  # accessing elements
    print(i) # i will give all elements of c

d = 8 in c # Membership test
print("Is 8 in c: ", d)
1
2
4
6
8
9
Is 8 in c:  True

Inserting

Since Frozen sets are immutable inserting elements is not possible. One has to convert to a set, update elements and cover t back to frozensets.

In [42]:
a  = frozenset([1,2,3])
b = set(a)
b.add(4)
a = frozenset(b)
print("a =", a)
a = frozenset({1, 2, 3, 4})

Deleting

Only the entire object can be deleted. since Frozen sets are immutable.

In [43]:
c = frozenset((5,6,7,8,9))
del c # reference to object is removed

try: 
    print(c) #This will throw ERROR since c is not defined
except NameError as error:
    print("Error:", error)
Error: name 'c' is not defined

Like all other data objects del removes the reference to the object from the variable.

Duplicating

Both shallow copy and deep copy have same behaviour as of sets.

In [44]:
a = frozenset((1,2,3))
c = a.copy() # deep copy
d = a # shallow copy
del a 
print("c =", c)
print("d =", d)
c = frozenset({1, 2, 3})
d = frozenset({1, 2, 3})


Operations on Frozen sets

Set methods the available in set objects for math operations are alos available in frozenset aswell. Refer the set for usage.

Methods

The operations can also be carried out between sets and frozenset. The return type will be of type of object from which the method is invoked.

In [0]:
a = {1, 2, 3, 4}
b = frozenset({3, 4, 5, 6})

c = b.union(b) # between frozenset and frozenset
d = a.union(b) # returns set type object
e = b.union(a) # returns frozenset type object

All the update() based methods are not available since frozensets are immutable.

In [46]:
a = frozenset((1,2,3))

try:
    a.update({1})
except Exception as error:
    print("Error:", error)
Error: 'frozenset' object has no attribute 'update'

Functions

Both len() and sum() functions returns similar results and behaves same as sets.

In [47]:
 a = frozenset([1,2,3,4])
 print("Length:", len(a) )
 print("Sum:", sum(a) )
Length: 4
Sum: 10

Operators

Similar to set only the - can be used for set difference. All other operators like +, *, / are not applicable.

In [48]:
a = {1, 2, 3, 4}
b = frozenset({3, 4, 5, 6})
c = a - b
print("c =", c)
c = {1, 2}

Similar to methods the return type will be based on the first operand, (in this case type of a)




Please refer to the Padhai-One Github repo for updates & other notebooks