Completing the Extraction

So far we have defined a procedure to eliminate writing tedious code:

def get_next_target(page):
  start_link = page.find('<a href=')
  start_quote = page.find('"', start_link)
  end_quote = page.find('"', start_quote + 1)
  url = page[start_quote + 1:end_quote]
  return url, end_quote

Although you have not yet used a procedure that returns two things, it is pretty simple to do. You can do this by having two values on the left side of an assignment statement. Assigning multiple values on the left side of an assignment statement is called multiple assignment. To write a multiple assignment you can put any number of names separated by commas, an equal sign and then any number of expressions separated by commas. The number of names and the number of expressions has to match so that the value of the first expression can be assigned to the first name and so on.

<name1>, <name2>,... = <expression1>, <expression2>, ...

Example: a, b = 1, 2

Therefore, in order to get the two values, (url, end_quote) to return from the procedure above, you will need to have the two variables on the left side and the procedure on the right. Here is the syntax to do that:

url, endpos = get_next_target(page)

Quiz 2.21

What does this do? s, t = t, s

A) Nothing
B) Makes s and t both refer to the original value of t
C) Swaps the values of s and t
D) Error

Answer

C) Swaps the values of s and t

In the multiple assignment expression, both of the values on the right side get evaluated before they get their assignments.

No Links

There is one concern with the get_next_target procedure that has to be fixed before tackling the problem of outputting all of the links, which is: What should happen if the input does not have another link?

Test the code in the interpreter using a test link to print get_next_target:

def get_next_target(page):
  start_link = page.find('<a href=')
  start_quote = page.find('"', start_link)
  end_quote = page.find('"', start_quote + 1)
  url = page[start_quote + 1:end_quote]
  return url, end_quote

print(get_next_target('this is a <a href="http://udacity.com">link!</a>'))
# ('http://udacity.com', 37)

When you run this code you get both outputs as a tuple, that is, the link followed by the position of the end quote. A tuple is an immutable list, meaning it cannot be modified. In the same way a list is enumerated with brackets, a tuple definition is bounded by parentheses.

Or you can write this using a double assignment to return just the url:

def get_next_target(page):
  start_link = page.find('<a href=')
  start_quote = page.find('"', start_link)
  end_quote = page.find('"', start_quote + 1)
  url = page[start_quote + 1:end_quote]
  return url, end_quote

url, endpos = get_next_target('this is a <a href="http://udacity.com">link!</a>')
print(url) # -> http://udacity.com

What happens if we pass in a page that doesn’t have a link at all?

def get_next_target(page):
  ...

url, endpos = get_next_target('good')
print(url) # -> goo

The program returns, goo because when the find operation does not find what it is looking for it returns -1. When -1 is used as an index, it eliminates the last character of the string. Compare with a string that includes double quotes:

def get_next_target(page):
  ...

url, endpos = get_next_target('Not "good" at all!')
print(url)  # -> Not

In the end, the code above is not very useful. It is going to be very hard to tell when you get to the last target because maybe Not could be a valid url, but we don’t know that.

Quiz 2.22

Think about making get_next_target more useful in the case where the input does not contain any link. This is something you can do!

def get_next_target(page):
  start_link = page.find('<a href=')  # HINT: put something into the code here
  start_quote = page.find('"', start_link)
  end_quote = page.find('"', start_quote + 1)
  url = page[start_quote + 1:end_quote]
  return url, end_quote

url, endpos = get_next_target('Not "good" at all!')
print(url)

Answer

def get_next_target(page):
  start_link = page.find('< a href=')
  if start_link == -1:
    return None, 0
  start_quote = page.find('"', start_link)
  end_quote = page.find('"', start_quote + 1)
  url = page[start_quote + 1:end_quote]
  return url, end_quote

url, endpos = get_next_target('Not "good" at all!'>link!</a>
if url:
  print("Here!")
else:
  print("Not here!")

print(url) # -> Not here!

Modify the get_next_target procedure so that if there is a link it behaves as before, but if there is no link tag in the input string, it outputs None, 0.

Print All Links

At this point you have a piece of code, get_next_target, to replace a formerly tedious program. Here is where we are so far, with a few modifications:

page = '<contents of some web page as a string>'
url, endpos = get_next_target(page)
print(url)
page = page[endpos:] # replaced end_quote with endpos
url, endpos = get_next_target(page)
# ...

This code will have to repeat and keep going until the url that’s returned is None.

So far, you have seen a way to keep going, which is a while loop. You have seen a way to test the url. Now you have everything you need to print all the links on the page!

Getting a Webpage

Instead of supplying the source of a webpage as string, we can directly get a webpage from internet. For this, we need to import a library urllib.

from urllib.request import urlopen

Having imported the urllib, we can get a webpage using its url:

page = str(urlopen(url).read())

The page is read as bytes and we need it in the form of string, therefore, we use str() for conversion. This works only when you are connected to the Internet. You may use page string otherwise.

Quiz 2.23

Write the get_page procedure that takes url as input and return the page source.

Answer

from urllib.request import urlopen
def get_page(url):
    return str(urlopen(url).read())

Quiz 2.24

In the following code, fill in the test condition for the while loop and the rest of the else statement:

def print_all_links(page):
  while ________: # what goes as the test condition for the while?
    url, endpos = get_next_target(page)
    if url: # test whether the url you got back is None
      print(url)
      page = page[endpos:] # advance page to next position
    else: # if there was no valid url, then get_next_target did not find
          a link then do something else. What do you need to do?

Answer

def print_all_links(page):
  while True:
    url, endpos = get_next_target(page)
    if url:
      print(url)
      page = page[endpos:]
    else:
      break

print_all_links('this <a href="test1">link 1</a> is <a href="test2">link2</a> a <a href="test3">link 3</a>')
# test1
# test2
# test3

Let’s go back to the xkcd web page we looked at earlier and try something a little more interesting. Go to the xkcd.com home page, click view source to find the first link on the page and notice how many links there are in the source code — quite a few. Using your print_all_links code, print get_page and try passing in the page url, http:// xkcd.com/353. Your program should return the page's source code when you run it.

def print_all_links(page):
  while True:
    url, endpos = get_next_target(page)
    if url:
      print(url)
      page = page[endpos:]
    else:
      break

print(get_page('http://xkcd.com/353'))

Since we do not need the entire source code for what we are looking for, try your print_all_links procedure to print all of the links on the xkcd.com page to print the links on the page.

def print_all_links(page):
  while True:
    url, endpos = get_next_target(page)
    if url:
      print(url)
      page = page[endpos:]
    else:
      break

print_all_links(get_page('http://xkcd.com/353'))

There are a few links that are not returned, but you will learn about those situations in the coming units.

Congratulations! You just learned how to print all the links on a web page, every possible computer program, and are in good shape to continue building your web browser! In the next unit you will learn how to collect the links from a web page and do something with them.