Skip to content

Making crawling faster

We have used peformance execution measurement as a tool to meaure list index performance and we did the same for dictionary index. This is a reliable metric as long as that all results are collected from same machine/environment and background activity is minimal. However we can't always use this or in some cases that doesn't make sense. For example we can't use timeit to directly measure crawl_web performance.

Why? Because crawl_web sends a network request to fetch some content and communication over network can take variable time. Network communication fluctuation can skew results and its not a good idea to use that directly. We will not go into how we can still use execution measurement in this case.

Set

Set is another complex datatype in Python. Sets are exactly like lists except for the fact that their elements are immutable (that means you cannot change/mutate an element of a set once declared). However, you can add/remove elements from the set.

In other words:

A set is a mutable, unordered group of elements, where the elements themselves are immutable.

The most common way of creating a set in Python is by using the built-in set() function.

cartoons = set(['Tom', 'Jerry', 'Droopy'])
print(cartoons)    # Output: {'Jerry', 'Tom', 'Droopy'}

You can see order in output is different from when we initialized the set. This is the same reason as why dictionary order is different. This is because set and dictionary internal implementation is similar.

.add(...)

Similar to list, we add a new element in set, we use add method. Unlike list, this doesn't gurrantee that added element will be appened to the end.

Let's take our above example of cartoons and add Dog to it.

cartoons = set(['Tom', 'Jerry', 'Droopy'])
print(cartoons)         # Output: {'Jerry', 'Tom', 'Droopy'}
cartoons.add('Dog')
print(cartoons)         # Output: {'Jerry', 'Tom', 'Dog', 'Droopy'}

Let's add Dog again and see what happens:

cartoons = set(['Tom', 'Jerry', 'Droopy'])
print(cartoons)         # Output: {'Jerry', 'Tom', 'Droopy'}
cartoons.add('Dog')
print(cartoons)         # Output: {'Jerry', 'Tom', 'Dog', 'Droopy'}
cartoons.add('Dog')
print(cartoons)         # Output: {'Jerry', 'Tom', 'Dog', 'Droopy'}

Although, we attempted to add Dog twice to cartoons but it appeared once in the set. This happened because, like a set in mathematics, set is a unique collection of items.

.remove(...)

To remove an element from set we can use remove method. It takes element that needs to be removed.

Back to top