Skip to content

Finishing The Web Crawler

Returning to the code you wrote before for crawling the web, make some modifications to include the code you’ve just written. First, a quick recap on how the crawl_web code below works before incorporating the indexing:

def crawl_web(seed):
    tocrawl = [seed]
    crawled = []
    while tocrawl:
        page = tocrawl.pop()
        if page not in crawled:
            union(tocrawl, get_all_links(get_page(page)))
            crawled.append(page)
    return crawled

First, you defined two variables tocrawl and crawled. Starting with the seed page, tocrawl keeps track of the pages left to crawl, whereas crawled keeps track of the pages that have already been crawled. When there are still pages left to crawl, remove the last page from tocrawl using the pop method. If that page has not been crawled yet, get all the links from the page and add the new ones to tocrawl. Then, add the page that was just crawled to the list of crawled links. When there are no more pages to crawl, return the list of crawled pages.

Adapt the code so that you can use the information found on the pages crawled. The changed code is below. First, add the variable index , to keep track of the content on the pages along with their associated urls. Since you are really interested in the index , this is what we will return.

It is possible to return both crawled and index , but to keep it simple just return index. Next, add a variable, content to replace get_page(page). This variable will be used twice, once in the code already there and once in the code to be filled in for the quiz. The procedure get_page(page) is expensive as it requires a web call, so we don’t want to call it more often than is necessary. Using the variable content means that the call to the procedure get_page(page) only needs to be performed once and then the result is stored and can be used over and over without having to go through the expensive call again.

Quiz 4.5

Fill in the missing line using the variable content.

def crawl_web(seed):
    tocrawl = [seed]
    crawled = []
    index = []
    while tocrawl:
        page = tocrawl.pop()
        if page not in crawled:
            content = get_page(page)
            # <<< FILL IN THE MISSING LINE HERE >>>
            union(tocrawl, get_all_links(content))
            crawled.append(page)
    return index
Answer

Understandably, you need to call the procedure add_page_to_index for adding the crawled page to the search index:

add_page_to_index(index, page, content)

Startup

You now have a functioning web crawler! From a seed you can find a set of pages; for each of these pages you can add the content from that page to an index, and return that index. Additionally, since you have already written the code you can do the look-up that will return the pages for that keyword.

But you're not completely done yet. In unit 5, you will see how to make a search engine faster and in unit 6 you will learn how to find the best page for a given query rather than returning all the pages. Before then, you need to understand more about how the Internet works and what happens when you request a page on the world wide web.

The Internet

Let’s explore our get_page procedure to understand how it works:

def get_page(url):
    return str(urlopen(url).read())

It is better to do some error handling when we try to read something from the Internet. This is the python code that does that:

def get_page(url):
    try:
        import urllib
        return str(urllib.urlopen(url).read())
    except:
        return ""

The table below shows what each line of the code does.

Code Explaination
urllib.urlopen(url) This opens the web page at url
urllib.urlopen(url).read() and reads the page requested which is a string.
return urllib.urlopen(url).read() That string is then returned.
import urllib The code which contains urlopen is in the library urllib, so we have to import that
try:
except:
   return ''
This is an exception handler. It's called a try block. We try these things but they might not always work. There might be an error. The page might not be returned, or the url might be bad or the page times out. If we request a url which we can't be loaded, it jumps to the except: block and returns an empty string instead of producing an error.

Conclusion

The search engine so far works but it isn’t fast or smart. In next unit, you’ll look at how to make the search engine scale and respond to queries faster.

Back to top