Skip to content

Building the Web Index

To build your web index you want to find a way to separate all the words on a web page. It is possible to use the concepts you've already seen to build a procedure to do this, however, Python has a built-in operation that will make this much simpler.

.split(...)

When you invoke the split operation on a string the output is a list of the words in the string.

<string>.split()
# results in [<word>, <word>, ... ]

For example,

quote = "Programs must be written for people to read, and only incidentally for machines to execute. --- Harold Abelson"
print(quote.split())
# -> ['Programs', 'must', "be", 'written', 'for', 'people', 'to', 'read,', "and", 'only', 'incidentally', 'for',
# 'machines', 'to', 'execute.', '---', 'Harold', 'Abelson']

This operation does a pretty good job of separating out the words in the list so that they will be useful. However, in the case of 'read,' which was followed by a comma in the quote, for the keyword you would not want to include the comma. While this isn't perfect, it is going to good enough for now.

Here is another example of how split works, using triple quotes ('''). Using the triple quotes you can define one string over several lines:

quote = '''Walking on water and developing software from a specification are easy if both are frozen. 
(Edward V. Berard)'''
print(quote.split())
# -> ["Walking", 'on', 'water', 'and', 'developing', 'software', 'from', 'a', 'specification',
# 'are', 'easy', 'if', 'both', 'are', 'frozen.', '(Edward', 'V.', 'Berard)']

This still has similar problems to the first example, where the parentheses are included in the word (Edward.

Quiz 4.4

Define a procedure, add_page_to_index , that takes three inputs:

  • index
  • url (string)
  • content (string)

It should update the index to include all of the word occurrences found in the page content by adding the url to the word's associated url list. For example:

index = []
add_page_to_index(index, 'fake.test', "This is a test")
print(index)     # -> [['This', ['fake.test']], ['is', ['fake.test']], ['a', ['fake.test']], ['test', ['fake.test']]]
print(index[0])  # -> ['This', ['fake.test']]
print(index[1])  # -> ['is', ['fake.test']]
Answer

The goal is to define a procedure, add_page_to_index, which takes in three inputs, the index, the url and the content at that url. This takes two steps:

  1. split the page into its component words,
  2. then add each word along with the url to the index.

The split method divides the content into its component words, while the procedure, add_to_index adds a word and a url to the index. However, this is not sufficient to satisfy the second step. You still need to do this for each of the words found during the first step. A for loop will take care of this. Remember that the point of this procedure is to modify the index, so there will be no return results.

This is how you can put the code together using this structure. Use the split method to divide the content into its component words and store them in the words variable.

def add_page_to_index(index, url, content):
    words = content.split()

Go through each of the words in words, which we can do using a for loop, and naming the variable word:

def add_page_to_index(index, url, content):
    words=content.split()
    for word in words:
        # ...

adding each of the words to the index by calling add_to_index and passing in the index, the word and the url.

def add_page_to_index(index, url, content):
    words=content.split()
    for word in words:
        add_to_index(index, word, url)

Now, add a page called real.test, after fake.test:

index = []
add_page_to_index(index, 'real.test', "This is not a test")
print(index) # -> [['This', ['fake.test', 'real.test']], ['is', ['fake.test', 'real.test']], ['a', ['fake.test']], ['test', ['fake.test', 'real.test']], ['not', ['real.test']]]
print(index[1]) # -> ['is', ['fake.test', 'real.test']]
print(index[4]) # -> ['not', ['real.test']]

You have already defined a procedure for responding to a query, so check out if it works on this index, searching for the keyword is:

index = []
add_page_to_index(index, 'fake.test', "This is a test")
add_page_to_index(index, 'real.test', "This is not a test")
print(lookup(index, 'is')) # you should expect to see that this keyword appears on both pages ['fake.test', 'real.test']

Now try searching for a keyword, udacity, that is not in either of the urls:

print(lookup(index, 'udacity')) # you should expect to see an empty list []

Note

Note that if a word occurs more than once on the same page, we’re going to keep adding it to the index each time. This means there might be more than one occurrence of the same url in the list of urls associated with a keyword. Depending on what we want our search engine to do and how we want our it to respond to queries, this may or may not be a good thing.

Examples

Here is the same example from before:

index = []
add_page_to_index(index, 'fake.test', "Thi' is a test')
print(index) # Output: (1)

index = []
add_page_to_index(index, 'not.test', 'This is not a test')
print(index) # Output: (2)
  1. Output:

    [
        ['This', ['fake.test']],
        ['is', ['fake.test']],
        ['a', ['fake.test']],
        ['test', ['fake.test']]
    ]
    

    output is borken into multiple lines to improve visuals

  2. Output:

    [
        ['This', ['not.test']],
        ['is', ['not.test']],
        ['not', ['not.test']],
        ['a', ['not.test']],
        ['test', ['not.test']]
    ]
    

    output is borken into multiple lines to improve visuals

To convince yourself that the code is working, here’s a more complex example using a quotation from Scott Adams on dilbert.com, followed by one from Randy Pausch.

index = []
add_page_to_index(index, 'http://eatthis.com', """
Another strategy is to ignore the fact that you are slowly killing yourself by not sleeping and
exercising enough. That frees up several hours a day. The only downside is that you get fat and die.
--- Scott Adams on Time Management
""")

add_page_to_index(index, 'http://libquotes.com', """
Good judgement comes from experience, experience comes from bad judgement. If things aren't going
well it probably means you are learning a lot and things will go better later.
--- Randy Pausch
""")

print(index) # Output: (1)
  1. Output:

    [
        ['Another', ['http://eatthis.com']], ['strategy', ['http://eatthis.com']], ['is', ['http://eatthis.com', 'http://eatthis.com']],
        ['to', ['http://dilbert.com']], ['ignore', ['http://eatthis.com']], ['the', ['http://dilbert.com']],
        ['fact', ['http://eatthis.com']], ['that', ['http://dilbert.com', 'http://eatthis.com']],
        ['you', ['http://dilbert.com', 'http://eatthis.com', 'http://libquotes.com']], ['are', ['http://eatthis.com', 'http://libquotes.com']],
        ['slowly', ['http://dilbert.com']], ['killing', ['http://eatthis.com']],
        ['yourself',['http://eatthis.com']], ['by', ['http://eatthis.com']], ['not', ['http://dilbert.com']], ['sleeping', ['http://eatthis.com']],
        ['and', ['http://dilbert.com', 'http://eatthis.com', 'http://libquotes.com']], ['exercising',['http://eatthis.com']],
        ['enough.', ['http:// dilbert.com']], ['That',['http://eatthis.com']], ['frees', ['http:// dilbert.com']],
        ['up',['http://eatthis.com']], ['several', ['http:// dilbert.com']], ['hours',['http://eatthis.com']],
        ['a', ['http:// dilbert.com','http://libquotes.com']], ['day.', ['http://eatthis.com']],
        ['The',['http://eatthis.com']], ['only', ['http://eatthis.com']], ['downside',['http://eatthis.com']],
        ['get', ['http://eatthis.com']], ['fat',['http://eatthis.com']], ['die.', ['http://eatthis.com']],
        ['--- ',['http://eatthis.com', 'http://libquotes.com']], ['Scott', ['http://dilbert.com']],
        ['Adams', ['http://eatthis.com']], ['on', ['http://dilbert.com']], ['Time', ['http://eatthis.com']],
        ['Management', ['http://dilbert.com']], ['Good', ['http://libquotes.com']], ['judgement',['http://libquotes.com']],
        ['comes', ['http://libquotes.com', 'http://randy.pausch']], ['from', ['http://libquotes.com', 'http://libquotes.com']],
        ['experience,', ['http://libquotes.com']], ['experience', ['http://randy.pausch']], ['bad', ['http://libquotes.com']], ['judgement.',['http://libquotes.com']],
        ['If', ['http://libquotes.com']], ['things',['http://libquotes.com', 'http://libquotes.com']], 
        ["aren't", ['http://randy.pausch']], ['going', ['http://libquotes.com']],
        ['well', ['http://randy.pausch']], ['it', ['http://libquotes.com']], ['probably', ['http://randy.pausch']], ['means', ['http://libquotes.com']],
        ['learning',['http://libquotes.com']], ['lot', ['http://libquotes.com']], ['will',['http://libquotes.com']], ['go', ['http://libquotes.com']],
        ['better',['http://libquotes.com']], ['later.', ['http://libquotes.com']], ['Randy',['http://libquotes.com']], ['Pausch', ['http://libquotes.com']]
    ]
    

    output is borken into multiple lines to improve visuals

It’s quite big as there were a lot of words in the two quotes. Note that some words appear in both lists and some in just one. To take a closer look at what is going on, you can try the following queries (It’s probably best to remove the print(index) first so you can see the results more clearly):

print(lookup(index, 'you'))  # Output: (1)
  1. Output:

    [
        'http://eatthis.com',
        'http://eatthis.com',
        'http://libquotes.com'
    ]
    

    output is borken into multiple lines to improve visuals

In this result you see that 'you' occurred twice in the eatthis.com quote, but just once in the Randy Pausch quote.

print(lookup(index, 'good')) # []

The word 'good' does not appear in either quote.

print(lookup(index, 'bad')) # ['http://libquotes.com']

bad appears in just the Randy Pausch quote. Using this code you can look up words in your index and get the urls where they are found. You can add pages to your index and record all the words in that page in the location they occur. The one thing left to do is to connect the code here with the code for crawling the web.

Back to top