Making lookup faster

Measure list index performance

Let's build a little complex index and measure execution time. But first let's build a small index:

test_index = []

add_page_to_index(test_index, 'http://eatthis.com', """
Another strategy is to ignore the fact that you are slowly killing yourself by not sleeping and
exercising enough. That frees up several hours a day. The only downside is that you get fat and die.
--- Scott Adams on Time Management
""")

add_page_to_index(test_index, 'http://libquotes.com', """
Good judgement comes from experience, experience comes from bad judgement. If things aren't going
well it probably means you are learning a lot and things will go better later.
--- Randy Pausch
""")

Now let's crunch the numbers:

import timeit

for query in ['good', 'bad', 'you']:
    print(query, timeit.timeit('lookup(test_index, "' + query + '")', globals=globals()))

Output:

good 2.540750673
bad 1.9949304300000001
you 0.5010264919999994

We are using three queries. good is not part of test_index, bad has one occurance while you has several occurances. Looking at results you can see, our lookup function takes ~2s in worst case scenerio (result might different on your machine).

Now let's add few more pages to index and run our queries again.

add_page_to_index(test_index, 'http://libquotes.com', """
My own computational world is a strange blend of Plan 9, Windows, and Inferno. I very much admire
Linux's growth and vigor. Occasionally, people ask me much the same question [about Linux], but
posed in a way that seems to expect an answer that shows jealousy or irritation about Linux vs.
Unix as delivered and branded by traditional companies. Not at all; I think of both as the
continuation of ideas that were started by Ken and me and many others, many years ago.
--- Denis Ritchie
""")

add_page_to_index(test_index, 'http://libquotes.com', """
Computer science research is different from these more traditional disciplines. Philosophically it
differs from the physical sciences because it seeks not to discover, explain, or exploit the natural
world, but instead to study the properties of machines of human creation. In this it as analogous to
mathematics, and indeed the "science" part of computer science is, for the most part mathematical in
spirit. But an inevitable aspect of computer science is the creation of computer programs: objects
that, though intangible, are subject to commercial exchange.
""")

Let's measure execution time again:

good 6.463424404
bad 1.8868498199999992
you 0.4186478089999994

This time we can see while bad and you has same position in the index, execution time are similar while in case of good (which is missing from index), execution time increased by almost 156%.

How can we make lookup faster? Why is it so slow?

It's slow because it has to go through the whole of the for loop

for entry in index:
    if ...

to find out if a keyword isn't there. This isn't how you use an index in real life. When you pick up a book, and look in the index, you don't start at A and work your way all the way through to Z to know if an entry isn't there. You go directly to where it should be. You can jump around the index because it is in sorted order. You know where the entry belongs because it is in alphabetic order. If it isn't where it belongs, it isn't in the index anywhere.

You could do something similar with the index in your code. If you had a sorted index, you could go directly to the entry you wanted. Sorting is an interesting problem, but it won't be covered in this training. Instead of having to keep the index in a specific order, you'll learn another way to speed up index lookup.

Dictionary (`dict`)

You've already seen two complex types in Python, string type and list type. The dict type is another. These three types have many things in common but other things are different. A dictionary aka dict is an unordered and mutable Python container that stores mappings of unique keys to values.

To create a string, you have a sequence of characters inside quotes ''. To create a list, you use square brackets [] and have a sequence of elements inside the square brackets, separated by commas. The elements can be of any type, unlike with a string which has to be made up of characters. The dictionary type is created using curly brackets {}, and consists of <key>:<value> pairs. The keys can be any immutable type, such as a string or a number, and the values can be of any type.

Recall a strings is immutable which means that once it is created, the characters can not be changed. A list is mutable, which means it can be changed once it's been created. A dictionary is also mutable and its key:value pairs are updated. In other words, if the key isn't in the list, it's created, and if it is, it's changed to the new value. The differences, and similarities between a string, list and dictionary are summarised below.

String	List	Dictionary
`'hello'`	`['alpha', 23]`	`{'hydrogen':1, 'helium':2}`
sequence of characters	list of elements	set of `<key:value>` pairs
immutable	mutable	mutable
`s[i]` ith character of `s`	`p[i]` ith element of `p`	`d[k]` value associated with the key `k`
Characters can't be replaced as strings are immutable	`p[i] = u` replaces the value of the ith element of `p` with `u`	`d[k] = v` updates the value associated with `k` to be `v`

Using Dictionaries

You can create a dictionary using {}. For the example below, the key:value pairs are the chemical elements along with their atomic numbers.

elements = {'hydrogen': 1, 'helium': 2, 'carbon': 6}
print(elements)  # Output (1)

Output:

{'helium': 2, 'hydrogen': 1, 'carbon': 6}

Unlike with a list, when the elements are printed out, the key:value pairs can be in a different order from the order they were entered into the dictionary. When you look up a chemical element in the dictionary, the value associated with that element is returned.

print(elements['hydrogen'])  # Output: 1
print(elements['carbon'])    # Output: 6

What do you think will happen when you try to lookup an element which is not in the list?

print(elements['lithium'])  # Output: (1)

Output:

Traceback (most recent call last):
 File "__main__", line 5, in <module>
  print(elements['lithium'])
KeyError: 'lithium'

You get a KeyError which tells you that 'lithium' is not in the dictionary. Unlike in your lookup where you defined it to return None if the key is not there, you get an error if a key is not in a dictionary and you try to do a lookup on it.

To prevent this error, you can check if a key is in the dictionary using in, just like with lists. Just like for lists, it will return True if the key in in the list and False if it is not.

print('lithium' in elements)  # Output: False

As a dictionary is mutable, it can be added to and changed. Using elements['lithium'] on the left hand side of an assignment does not cause an error even though 'lithium' is not in the dictionary. Instead it adds the key:value pair 'lithium': 3 to the dictionary.

elements['lithium'] = 3
elements['nitrogen'] = 8
print(elements)  # Output: (1)
print element['nitrogen']

Output:

{'helium': 2, 'lithium': 3, 'hydrogen': 1, 'nitrogen': 8, 'carbon': 6}

Although this gives the output expected, the atomic number of 'nitrogen' is actually 7 and not 8, so it needs to be changed. As 'nitrogen' is already in the dictionary, this time

elements['nitrogen'] = 7

doesn't create a new key:value pair, but instead it updates the value associated with 'nitrogen'. To see that, you can see print the value

print(element['nitrogen'])  # Output: 7
print(elements)             # Output: (1)

Output:

{'helium': 2, 'lithium': 3, 'hydrogen': 1, 'nitrogen': 7, 'carbon': 6}

Q5.3: Population

Define a dictionary, population, that provides information on the world's largest cities. The key is the name of a city (a string), and the associated value is its population in millions.

Shanghai 17.8
Istanbul 13.3
Karachi 13.0
Mumbai 12.5

If you define your dictionary correctly, you should be able to test it using print(population['Karachi'] for which you should get the output 13.0.

Answer

population = {}
population['Shanghai'] = 17.8 # the population of Shanghai is 17.8 million
population['Istanbul'] = 13.3
population['Karachi'] = 13.0
population['Mumbai'] = 12.5

print(population['Shanghai'])               # Output: 17.8
print(population['Karachi'])                # Output: 13.0

A Noble Gas

Now to return to the elements dictionary, but this time, to make it more interesting. The chemical element dictionary from earlier just contains elements and their atomic numbers.

elements = {'hydrogen': 1, 'helium': 2, 'carbon': 6}
elements['lithium'] = 3
elements['nitrogen'] = 8

print(elements['nitrogen'])         # Output: 8
elements['nitrogen'] = 7
print(elements['nitrogen'])         # Output: 7

Values don't have to be numbers or strings. They can be anything you want. They can even be other dictionaries. The next example will have atomic symbols as keys with associated values which are dictionaries. First, an empty dictionary is created and then hydrogen, with key 'H' and helium, with key 'He' are added to it.

elements = {}
elements['H'] = {'name': 'Hydrogen', 'number': 1, 'weight': 1.00794}
elements['He'] = {'name': 'Helium', 'number': 2, 'weight': 4.002602,
                  'noble gas': True}

The code:

elements['H'] = {'name': 'Hydrogen', 'number': 1, 'weight': 1.00794}

sets 'H' as the key with associated value the dictionary {'name': 'Hydrogen', 'number': 1,'weight':1.00794}. This dictionary has three entries with the keys name,number and weight. For helium, 'He' is the key and a dictionary containing the same keys as 'H' but with an extra entry which has key noble gas and value of True.

The dictionary of key:value pairs, atomic_symbol:{dictionary} is shown below.

{'H': {'name': 'Hydrogen', 'weight': 1.00794, 'number': 1},
'He': {'noble gas': True, 'name': 'Helium', 'weight': 4.002602, 'number': 2}}

To see the element hydrogen, look up its key which is 'H'. print(elements['H'])

{'name': 'Hydrogen', 'weight': 1.00794, 'number': 1}

Note that the elements appear in a different order from the order they were input as elements['H'] is a dictionary.

To look up the name of the element with symbol H, use

print(elements['H']['name'])    # Output: Hydrogen

where elements['H'] is itself a dictionary and name is one of its keys. You could also lookup the weights of hydrogen and of helium, or check if helium is a noble gas.

print(elements['H']['weight'])      # Output: 1.00794
print elements['He']['weight']      # Output: 4.002602
print elements['He']['noble gas']   # Output: True

What happens if you try to check if hydrogen is a noble gas?

print(elements['H']['noble gas'])  # Output: (1)

Output:

Traceback (most recent call last):
 File "__main__", line 11, in
<module>
  print(elements['H']['noble gas'])

KeyError: 'noble gas'

It's the same error as for the attempted lookup of lithium in the dictionary of elements which didn't include it. It tries to look up noble gas but it doesn't exist in the dictionary which is associated with the key H.

Modifying the Search Engine

Modifying the search engine code from the previous unit to use dictionary indexes instead of list indexes has the advantage of doing lookups in constant time rather than linear time.

Q5.4: Modifying the Search Engine

Which of the procedures in your search engine do you need to change to make use of a dictionary index instead of a list index?

A): get_all_links
B): crawl_web
C): add_page_to_index
D): add_to_index
E): lookup

Answer

The answers are B), _D) and E).

Q5.5: Changing `crawl_web`

Change the crawl_web procedure to use a dictionary rather than a list index.

Answer

def crawl_web(seed):
    tocrawl = [seed]
    crawled = []
    index = []
    while tocrawl:
        page = tocrawl.pop()
        if page not in crawled:
            content = get_page(page)
            add_page_to_index(index, page, content)
            union(tocrawl, get_all_links(content))
            crawled.append(page)

return index

In this code you initialized index with an empty list and just used it to pass into add_page_to_index. To change index to use a dictionary, just change the square brackets to be curly brackets:

index = {}

Now, instead of starting with an empty list, you are starting with an empty dictionary. This is the only change you need to make to index.

Q5.6: Changing `add_to_index`

Change the add_to_index procedure to use a dictionary rather than a list index. This does not require any loop.

Answer

Change add_to_index is a little more complicated. In add_to_index, what happens to each page is that you crawl add_to_index, passing in the index, which is now a dictionary.

def add_to_index(index, keyword, url):
    for entry in index:
        if entry[0] == keyword:
            entry[1].append(url)
            return
    # not found, add new keyword to index
    index.append([keyword, [url]])

Before, you had add_to_index, which took in an index, a keyword and a url – it will still take the same parameters. However, when index was a list, your code was written such that you went through all of the entries in the index, checked for each one to see if it matched the keyword you were looking for. If it did, then you add the url, but if you got to the end without finding it then you append a new entry, which is the keyword and the list of urls, containing just the first url.

Here is how you can change this code to work with the dictionary index. Remember that with a dictionary you do not need to loop through anything. With the dictionary, the built in in operation serves this purpose. So, instead of looping, now you can check right away if a keyword is in the index:

def add_to_index(index, keyword, url):
    if keyword in index:
        index[keyword].append(url) # this will look up in the dictionary the entry that corresponds to index, which is going to be the list of urls
    else: # not found, add new keyword to index
        index[keyword] = [url]

This code is a lot simpler and is going to run a lot faster because you don't have to loop through manually. Because of the dictionary you can right away look up whether or not the keyword is in the index. If it is, you can find out what the value is by using the dictionary lookup and then append the new url to the list of urls associated with that keyword. If it is not found, you can create a new entry, index[keyword] = [url], using the dictionary syntax that contains just that url.

Q5.7: Changing `lookup`

Change the lookup procedure to use a dictionary rather than a list index. This does not require any loop. Make sure an error is not raised if the keyword is not in the index; in that case, return None.

Answer

def lookup(index, keyword):
  if keyword in index: # instead of loop, check to see if keyword is in the index
    return index[keyword] # if it is, use dictionary lookup
  else:
    return None # if the keyword is not in the index

Measure dict index performance

Let's perform the same tests that we did earlier with list index, but this time using dictionary index.

test_index = {}

add_page_to_index(test_index, 'http://eatthis.com', """
Another strategy is to ignore the fact that you are slowly killing yourself by not sleeping and
exercising enough. That frees up several hours a day. The only downside is that you get fat and die.
--- Scott Adams on Time Management
""")

add_page_to_index(test_index, 'http://libquotes.com', """
Good judgement comes from experience, experience comes from bad judgement. If things aren't going
well it probably means you are learning a lot and things will go better later.
--- Randy Pausch
""")

Measure execution time on smaller index:

import timeit

for query in ['good', 'bad', 'you']:
    print(query, timeit.timeit('lookup(test_index, "' + query + '")', globals=globals()))

Output:

good 0.11618620399999999
bad 0.140563376
you 0.12144473600000005

You can see, lookup function is taking almost similar time for all the queries. Now let's add few more pages to index and run our queries again.

add_page_to_index(test_index, 'http://libquotes.com', """
My own computational world is a strange blend of Plan 9, Windows, and Inferno. I very much admire
Linux's growth and vigor. Occasionally, people ask me much the same question [about Linux], but
posed in a way that seems to expect an answer that shows jealousy or irritation about Linux vs.
Unix as delivered and branded by traditional companies. Not at all; I think of both as the
continuation of ideas that were started by Ken and me and many others, many years ago.
--- Denis Ritchie
""")

add_page_to_index(test_index, 'http://libquotes.com', """
Computer science research is different from these more traditional disciplines. Philosophically it
differs from the physical sciences because it seeks not to discover, explain, or exploit the natural
world, but instead to study the properties of machines of human creation. In this it as analogous to
mathematics, and indeed the "science" part of computer science is, for the most part mathematical in
spirit. But an inevitable aspect of computer science is the creation of computer programs: objects
that, though intangible, are subject to commercial exchange.
""")

Let's measure execution time again:

good 0.16273928399999998
bad 0.14643949500000003
you 0.11602338800000006

See, results are still the same even index size is bigger. Previously with list index, we saw 156% increase in execution time for worst case scenerio.

Making lookup faster

Measure list index performance

Dictionary (dict)

Using Dictionaries

Q5.3: Population

A Noble Gas

Modifying the Search Engine

Q5.4: Modifying the Search Engine

Q5.5: Changing crawl_web

Q5.6: Changing add_to_index

Q5.7: Changing lookup

Measure dict index performance

Dictionary (`dict`)

Q5.5: Changing `crawl_web`

Q5.6: Changing `add_to_index`

Q5.7: Changing `lookup`