Search Index
The main new computer science idea you will learn is how to build complex data structures. You will learn how to design a structure that you can use so that you can respond to queries without needing to rescan all the web pages every time you want to respond to a query. The structure you will build for this is called an index. The goal of an index is to map a keyword and where that keyword is found. For example, in the index of a book you can see a page number which serves as a map to where a term or concept can be found. The key ideas in index will allow us to find references to what we want. With a search engine the index gives you a way for a keyword to map to a list of web pages, which are the urls where those particular web pages appear. Once you have done the work of building an index, then the look-ups are really fast.
Quiz 4.1
Deciding on data structures is one of the most important parts of building software. As long as you pick the right data structure, the rest of the code will be a lot easier to write. Which of these data structures would be a good way to represent the index for your search engine?
- A)
[<keyword 1>, <url1>, <url2>, <keyword 2>, ...]
- B)
[[<keyword 1>, <url1>, <url2>], [<keyword 2>, <url3>, ...]
- C)
[[<url1>, [<keyword 1>, <keyword 23>, ...]], [<url2>, [<keyword 2>, ...]]
- D)
[[<keyword 1>, [ <url1>, <url2>]], [<keyword 2>, [<url3> ]], ...]
Answer
The best choice is D), and B) is a close runner up. Choices A) and C) would be really difficult. Here's why:
-
A): Hard to read because everything is all in one list. This structure will also make it hard to loop over the keywords.
-
B): This option is okay, but not as good as d. The structure of b is such that there is a list, where each element of the list is a list, and the element lists are themselves lists. The big advantage to this structure is that it is easy to tell the keywords from the urls, the keyword is always the first element of the list. It is also easier to go through the keywords because for each list you just have to look to the first element, check to see if it is the keyword you are looking for, and if not, move on to the next one. The only downfall to this structure is that the distinction between the keywords and the urls could be even more clear.
-
C): While this has more structure, but it does not make it easy to look up the pages where a keyword appears. To look for a particular keyword you would have to look in each entry, look in the second part of the entry, scan it to see if the keyword appears, if it does then you want the url in your result, and if it doesn't then that url is not in the result. This option is not the best because you would have to scan through all the pages, which would take too much time.
-
D) In this structure there are just two elements in the inner list; the keyword followed by a list of urls. This is the best option because it makes a very clear separation between the keyword and the list of urls. It also means that if you decide you want to keep track of something else, like the number of times someone searches for each keyword, you can easily do that by adding an extra element. This is something that would be more difficult to do with the structure of option B).
Quiz 4.2 (Add to Index)
Define a procedure, add_to_index
, that takes three inputs:
- an
index
[[<keyword>, [<url>, ...]], ...]
- a
keyword
string - a
url
string
If the keyword
is already in the index
, add the url
to the list of urls associated with that keyword
. If the keyword
is not in the index
, add an entry to to the index
: [keyword, [url]]
For example:
index = []
add_to_index(index, 'udacity', 'http://udacity.com')
add_to_index(index, 'computing', 'http://acm.org')
This code starts with the empty list index
. After the two lines of code the empty list will contain two lists beginning with the keywords, udacity
and computing
.
Adding another entry for udacity:
add_to_index(index, 'udacity', 'http://npr.org')
In this code, udacity is already in the index and you don't want to add a new entry to the index itself. Since udacity is already in the index, what you want to do is add the new url to the list already associated with that keyword.
Answer
So here is how it goes:
def add_to_index(index, keyword, url):
for entry in index: # loop through the elements of index, giving each one the name entry
if entry [0] == keyword: # test to see if the value at position 0 of entry identical to the keyword that's passed in
entry[1].append(url) # if it is equal you want to append the url to the list of urls associated with that entry
return # stop the loop
#not found, add new entry
If the keywords are not found, you want to add a new entry. The new entry is going to have as its value a list containing two elements, the keyword and the second element will be a list containing the urls that were found that have that keyword. So far, there is only one url that was passed in to index
. To do this, add to your code:
def add_to_index(index, keyword, url):
for entry in index:
if entry [0] == keyword:
entry[1].append(url)
return # stop the loop
index.append([keyword, [url]])
Let’s test the procedure with the procedure calls given above.
index = []
add_to_index(index, 'udacity', 'http://udacity.com')
add_to_index(index, 'computing', 'http://acm.org')
add_to_index(index, 'udacity', 'http://npr.org')
print(index)
Quiz 4.3 (Lookup Procedure)
Define a procedure, lookup, that takes two inputs:
- An
index
: A list where each element of the list is a list containing a keyword and a list as its second element. The second list element is a list of urls where that keyword appears. - The
keyword
to lookup
The output should be a list of the urls associated with the keyword. If the keyword is not in the index, the output should be an empty list.
For example:
lookup(index, 'udacity') # -> ['http://udacity.com', 'http://npr.org']
Answer
def lookup(index, keyword): # two parameters
for entry in index: # loop through the entries in the index
if entry[0] == keyword:
return entry[1] # if found then return the urls associated with that entry
return [] # if no keyword is found, then return empty list
Trying this procedure on our pre-buit index will result in:
print(lookup(index, 'udacity'))
['http://udacity.com', 'http://npr.org']