Extracting Links

A web page is really just a long string of characters. Your browser renders the web page in a way that looks more attractive than just the string of characters. You can view the string of characters for any web page in your browser. How to do this depends on the browser you are using. For Chrome and Firefox, right-click anywhere on the page that is not a link and select "View Page Source". For Internet Explorer, right-click and select "View Source".

Open http://blag.xkcd.com/ in Firefox, Select View Page Source from the menu. The raw string for the web page pops up in a new window: For our web crawler, the important thing is to find the links to other web pages in the page. We can find those links by looking for the anchor tags that match this structure:

<a href="<url>">

For example, here is a link to the News/Blag page:

<a href=http://blag.xkcd.com/>

To build our crawler, for each web page we want to find all the link target URLs on the page. We want to keep track of them and follow them to find more content on the web. For this unit, we will do the first step which is to extract the first target URL from the page. In Unit 2, we will see how to keep going to get all the link targets, and in Unit 3, we will see how to keep track of them to be able to crawl the target pages. For now, our goal is to take the text from a web request and find the first link target in that text. We can do this by finding the anchor tag, <a href=, and then extract from that tag the URL that is found between the double quotes. We will assume that with the page's contents in a variable, page.

Quiz 1.15

Write Python code that initializes the variable start_link to be the value of the position where the first '<a href=' occurs in the string page refers to.

Answer

page = "<contents of a web page>"
start_link = page.find('<a href=')

Quiz 1.16

Write Python code that assigns to the variable url a string that is the value of the first URL that appears in a link tag in the string page. For the example page used in the test code, your code should end up with url having the value 'http://blag.xkcd.com'.

If you have not been able to initialize the page variable, use some content from the source of xkcd.com/353 and enclose it in triple quotes (a triple quote can be placed using the q=single quote typed three times without a space). For example try this:

page = '''
<html>
<body>
  <div id="topContainer">
    <div id="topLeft">
      <ul>
        <li><a href="/archive">Archive</a></li>
        <li><a href="https://what-if.xkcd.com">What If?</a></li>
        <li><a href="https://blag.xkcd.com">Blag</a></li>
        <li><a href="/how-to/">How To</a></li>
        <li><a href="https://store.xkcd.com/">Store</a></li>
        <li><a rel="author" href="/about">About</a></li>
        <li>
          <a href="/atom.xml">Feed</a> &bull;
          <a href="/newsletter/">Email</a>
        </li>
      </ul>
    </div>
  </div>
</body>
</html>
'''

Answer

start_link = page.find('<a href=')
start_quote = page.find('"',start_link)
end_quote = page.find('"',start_quote+1)
url = page[start_quote + 1:end_quote]

Yay! You did it and are off to a great start! You've learned about programs, variables, expressions, and strings, and a well on your way to building a web crawler. Next unit, we will learn some big ideas in computer science that will make this code more useful and enable us to get all the links on the page, not just the first one.