Finishing Web Crawler

Now we are ready to finish our web crawler!

Let’s recap how the web crawler should work to find all the links that can be found from a seed page. We need to start by finding all the links on the seed page, but instead of just printing them like we did in Unit 2, we need to store them in a list so we can use them to keep going. We will go through all the links in that list to continue our crawl, and keep going as long as there are more pages to crawl.

The first step to define a procedure get_all_links that takes as input a string that represents the text on a web page and produces as output a list containing all the URLs that are targets of link tags on that page.

Get All Links

Let’s recap the code we had from Unit 2:

def print_all_links(page):
  while True:
    url, endpos = get_next_target(page)
    if url:
      print(url)
      page = page[endpos:]
    else:
      break

We defined a procedure, get_next_target, that would take a page, search for the first link on that page, return that as the value of url and also return the position at the end of the quote is so we know where to continue. Then, we defined the procedure, print_all_links, that keeps going as long as there are more links on the page. It will repeatedly find the next target, print it out, and advance the page past the end position.

What we want to do to change this is instead of printing out the URL each time we find one, we want to collect the URLs so we may use them to keep crawling and find new pages. To do this, we will create a list of all of the links we find. We change the print_all_links procedure into get_all_links so that we can use the output, which will be a list of links, which will correspond to the links we were originally printing out.

Links

As an example of how this should work, we use a simple webpage present here. It contains three link tags that point to three different pages including crawling, walking, and flying (you can check them out for yourself by clicking on links on the test page in your web browser). Here is how get_all_links should behave:

links = get_all_links('https://training.dsquare.io/to-crawl/basic/index.html')
print(links)

Output:

['https://training.dsquare.io/to-crawl/basic/crawling.html', 'https://training.dsquare.io/to-crawl/basic/walking.html', 'https://training.dsquare.io/to-crawl/basic/flying.html']

Because the result is a list, we can use it to continue crawling pages. Think on your own how to define get_all_links, but if you get stuck, use the following quizzes to step through the changes we need to make.

Quiz 3.19

What should the initial value of links be? Remember, our goal for get_all_links is to return a list of all the links found on a page. We will use the links variable to refer to a list that contains all the links we have found.

def get_all_links(page):
  links = __________
  while True:
    url, endpos = get_next_target(page)
    if url:
      print(url)
      page = page[endpos:]
    else:
      break

Answer

We want links to start out as an empty list:

links = []

Quiz 3.20

What do we replace print(url) with to update the value of links?

def get_all_links(page):
  links = []
  while True:
    url, endpos = get_next_target(page)
    if url:
      print(url)
      page = page[endpos:]
    else:
      break

Answer

We want to append the url to links. This will add that value to the list that links refers to, keeping track of all the URLs that we found on that page.

def get_all_links(page):
  links = []
  while True:
    url, endpos = get_next_target(page)
    if url:
      links.append(url)
      page = page[endpos:]
    else:
      break

Quiz 3.21

For this last quiz on get_all_links, let’s figure out how to get the output:

def get_all_links(page):
  links = []
  while True:
    url, endpos = get_next_target(page)
    if url:
      links.append(url)
      page = page[endpos:]
    else:
      break
  ______________________ # (fill in here)

Answer

We need a return statement:

return links

Web Crawler (Contd.)

At this point we are ready to finish the web crawler. The web crawler is meant to be able to find links on a seed page, make them into a list and then follow those links to new pages where there may be more links, which you want your web crawler to follow.

In order to do this the web crawler needs to keep track of all the pages. Use the variable tocrawl as a list of pages left to crawl. Use the variable crawled to store the list of pages crawled.

Crawling Process

Here is a description of the crawling process. We call this pseudocode since it is more precise than English and structured sort of like Python code, but is not actual Python code. As we develop more complex algorithms, it is useful to describe them in pseudocode before attempting to write the Python code to implement them. (In this case, it is also done to give you an opportunity to write the Python code yourself!)

start with tocrawl = [seed]
crawled = []
while there are more pages tocrawl:
  pick a page from tocrawl
  add that page to crawled
  add all the link targets on this page to tocrawl
return crawled

Quiz 3.22

What would happen if we follow this process on the test site, starting with the seed page (https://training.dsquare.io/to-crawl/basic/index.html)?

A): It will return a list of all the urls reachable from the seed page.
B): It will return a list of some of the urls reachable from the seed page.
C): It will never return.

Answer

The answer is C). This is because the stopping test for the while loop will keep going as long as there are pages in tocrawl. So in order to finish, you need to know that the value of tocrawl eventually becomes empty. If you look at the test site and follow the walk link, you get to a page that has a link to crawling which goes back to the index and follow the walk link again -- over and over. The crawler will never finish because it will always find a link to crawl.

To avoid this, you need to make sure you don't crawl pages that have already been crawled. You can do this by adding a test that checks to see if a certain page has already been crawled.

The next several quizzes implement our web crawling procedure, crawl_web, that takes as input a seed page url, and outputs a list of all the urls that can be reached by following links starting from the seed page.

Quiz 3.23

To start the crawl_web procedure, provide the initial values of tocrawl and crawled:

def crawl_web(seed):
  tocrawl = ______ # <- initialize this variable
  crawled = ________ # <- initialize this variable

Answer

tocrawl = [seed]
crawl = []

Quiz 3.24

The next step is to write a loop to do the crawling, where we keep going as long as there are pages to crawl. To do this, we will use a while loop, with tocrawl as our test condition. We could use len(tocraw) == 0 to test if the list is empty. There is an easier way to write this using just tocrawl. An empty list (a list with no elements) is interpreted as False, and every non-empty list is interpreted as True.

Inside the loop, we need to choose a page to crawl. For this quiz, your goal is to figure out a good way to do this. There are many ways to do this, but using things we have learned in this unit you can do it using one line of code that both initializes page to the next page we want to crawl and removes that page from the tocrawl list.

def crawl_web(seed):
  tocrawl = [seed]
  crawled = []
  while tocrawl:
    page = ____________

Answer

The best way to answer this is to use pop. pop is the only thing we’ve seen that actually removes elements from a list and also has the property of returning that element. If we use tocrawl.pop(), that will get us the last element in the tocrawl list, remove that element from the list tocrawl, and assign that to the variable page.

def crawl_web(seed):
  tocrawl = [seed]
  crawled = []
  while tocrawl:
    page = tocrawl.pop()

Because we are getting the last element first, we are implementing what is called a depth-first search. That means as we crawl web pages, we will look at first link on each page in the chain of pages until we get to the end. Only then, will we start to look at the second link on the first page and each subsequent page. If our goal was to get a good corpus of the web quickly, doing a depthfirst search is probably not the best way to do that. If we complete our search, no matter what order we follow, we’ll find the same set of pages. If we aren’t able to complete the search, and with a real web crawler there are far too many pages to wait until we crawl them all to return a result, then the order with which we view the pages matters a lot.

Quiz 3.25

The next step is to manage the problem of cycles in the links. We do not want to crawl pages that we’ve already crawled, so what we need is someway of testing whether the page was crawled. To make a decision like this, we use if. We need a test condition for if that will only do the stuff we do to crawl a page if it has not been crawled before.

def crawl_web(seed):
  tocrawl = [seed]
  crawled = []
  while tocrawl:
    page = tocrawl.pop()
    if __________________

Answer

We should only crawl the page if we have not crawled it already. The crawled variable keeps track of the pages we have crawled already. So, we want to test if the page is not in crawled. If it page is not in crawled, then we crawl. If it is, then we keep going. We are going to do nothing else in this iteration in the while loop and go on to check the next page.

if page not in crawled:
  # ...

Quiz 3.26

Now we’re ready to finish writing our crawler. Write two lines of code to update the value of tocrawl to reflect all of the new links found on page and update the value of crawled to keep track of the pages that have been crawled.

def crawl_web(seed):
  tocrawl = [seed]
  crawled = []
  while tocrawl:
    page = tocrawl.pop()
    if page not in crawled:
      ______________________
      ______________________
  return crawled

Answer

First, we need to add all the links we find on a page to page, which we can do using the union procedure from an earlier quiz this unit. Using append, we can keep track of the pages we have already crawled.

def crawl_web(seed):
  tocrawl = [seed]
  crawled = []

  while tocrawl:
    page = tocrawl.pop()
    if page not in crawled:
      union(tocrawl, get_all_links(get_page(page)))
      crawled.append(page)

  return crawled