Thinking in Python - Part 1

Intended Audience

This post is intended for people new to programming in Python. It is hoped that this should serve well to at-least some of us Python programmers who have some Python programming experience but are looking at taking their Python programming to the next level.

Introduction

Over last few years Python has gained a lot of popularity mainly thanks to it's excellent libraries for machine learning, deep learning and set of matured features for web development, which makes it's appeal to a larger set of programmers who are coming from quite diverse backgrounds. While there are a number of good introductory sources on the web to get started programming in Python, a large majority of them focus on the syntax and then on the other hand, there are so many new libraries coming for specific application domain, it's considerably easier for a novice programmer to get overwhelmed and miss trees for the forest.

In this series of blog posts, I will look at some simple problems and how we can start with simple solutions to the problems and iterate and solve those problems using constructs which are more Pythonic. Note a lot of this discussion is based on looking at a code of a number of people for the very same problems in the past, so this might look like 'extremely simple' to a lot of people who have done Python programming in the past, but I have looked at so many implementations from so many people, I thought, it surely was worth the effort.

When learning a new language, usually we pay a lot of attention to the language syntax or how well a language suits particular type of programming paradigm (eg. functional / object oriented programming etc.). While this is a viable approach, one is better served by actually understanding in better details about the language standard library and language built-ins and how one can utilize those language services to solve a particular problem.

The objective of the effort is not to introduce one to new libraries but instead what can be done just using built-ins and standard library. This post was somewhat inspired by Dave Beazley's talk built-in Super Heroes. So let's get started.

Problem 1 - Parsing HTTP Log files

This problem is also representative of typical problems during data acquisition, data cleanup while working on any Data Science projects.

Let's take an example of a simple Problem, look at the log of an http server and find out meaningful information from this.

An HTTP Log file is typically generated by a Server while serving requests from the client, just looking at these logs, one can find out a lot of meaningful information, for instance, we can find out -

What are the most popular URLs (eg. Top 5 URLs)
How many Unique IPs did we get requests from?
What does the distribution of HTTP Status codes look like? (ie. How many 2XX, 5XX, 4XX error codes are there.)

A typical Log file looks like following - (These lines are taken from the log file that was used in older version of Stanford 202 course, however it seems that the internet reference is lost, so I am maintaining this file myself here.)

82.95.120.160 - - [14/May/2009:00:10:34 -0700] "GET www.twibuzz.com/ HTTP/1.1" 200 112813 "-" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_4_11; en) AppleWebKit/528.16 (KHTML, like Gecko) Version/4.0 Safari/528.16"
82.95.120.160 - - [14/May/2009:00:10:35 -0700] "GET www.twibuzz.com/style.css HTTP/1.1" 200 639 "http://www.twibuzz.com/" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_4_11; en) AppleWebKit/528.16 (KHTML, like Gecko) Version/4.0 Safari/528.16"
82.95.120.160 - - [14/May/2009:00:10:36 -0700] "GET www.twibuzz.com/logo.png HTTP/1.1" 200 5410 "http://www.twibuzz.com/" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_4_11; en) AppleWebKit/528.16 (KHTML, like Gecko) Version/4.0 Safari/528.16"
82.95.120.160 - - [14/May/2009:00:10:43 -0700] "GET www.twibuzz.com/favicon.ico HTTP/1.1" 404 - "http://www.twibuzz.com/" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_4_11; en) AppleWebKit/528.16 (KHTML, like Gecko) Version/4.0 Safari/528.16"

A very quick primer on HTTP - HTTP is a protocol that is used on the Internet for accessing pages. If you are reading this blog, likely you are using HTTP. It's a simple Request Response Protocol, where a client sends a Request for a URL and Server returns a response and logs this interaction in a log file. The lines shown above are from one such log file.

Baby Steps

One of the best part of Python is, you can fire your Python (or iPython Shell) and try out things through what is called as a REPL (Read-Eval-Print-Loop), this makes iterative development really easy. So let's do the simplest thing, let's print the contents of the file.

We are developing a program called parse_weblog.py, so initially it will look something like this

f = open("weblog.txt")

print (f)

We simply open a file and print the handle returned to us. One thing to note here is that open is a language built-in. So what does it mean, it basically means is something that is available to the Python interpreter without looking up elsewhere (ie. importing). So when you have an interpreter available, all the built-ins are available to you, so that is something 'part of the language' itself. Some of Pythons data types are built-ins eg. dict, list, set, also int, float etc. open happens to be a built-in functions that can be used for file IO. The open above takes file name as argument and gives us a file handle. Of course, simply printing a file handle isn't much fun. So we'll read the contents of the file. There are two methods available to read contents of the file. One is readline and another is for line in file. So which one should you use - typically unless you want to make some changes to the contents of the file (ie. append a line or edit a line etc. it is always a good idea to use for line in life construct. This does not read entire file into memory where as readlines reads all the contents of the file into a list (of strings) in the memory. This could be a problem for very large files, as in such cases, you probably just want to go one line at a time, do something with the line and keep the resultant data in memory rather than keeping the unprocessed line in the memory. So we update our parse_weblog.py as follows -

# file: `parse_weblog.py`
# Open the weblog file and print each line

f = open("weblog.txt")

### Wrong, avoid
lines = f.readlines()
for line in lines:
    print(line)

# Preferred: This one also means typing fewer keys.
for line in f:
    print(line)

So far so good, but this does not take care of a few things, what if the file name we provided does not exist or if it exist and we do not have permissions to read the file? It's better to handle such cases. Typical way of doing this in Python is what is called try-catch block. Let's for the time being ignore handling individual exceptions separately ie. whether the file didn't exist or we do not have permission to open the file etc and simply print an error when we cannot open a file. So our updated parse_weblog.py now becomes -

# file: parse_weblog.py

try:
    f = open("weblog.txt")

    for line in f:
        print(line)
except Exception as e:
    print ("Error opening file.")

Voila, we have implemented the cat command on Unix. This is somewhat useful, but we are far from answering the questions we described above. Let's get started -

Analysis - Step 1 - Tokenize

Let's look at an individual log line and what it depicts -

82.95.120.160 - - [14/May/2009:00:10:34 -0700] "GET www.twibuzz.com/ HTTP/1.1" 200 112813 "-" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_4_11; en) AppleWebKit/528.16 (KHTML, like Gecko) Version/4.0 Safari/528.16"

This looks like a series of tokens separated by white space. So it might be a good idea to actually tokenize each individual line and extract information that's useful for us. The str class has got a function called split, that takes as input a string and returns list of tokens. So let's split this and get individual tokens. In general for the rest of the discussion we assume that the file is indeed opened properly and we are iterating over the contents of the line so we'll not show some of the try-catch-except explicitly. So here we go -

# file: weblog_parser.py
# tokenize every line

# Note: split function below when called with no parameters, tokenizes given string using white-spaces.
# Consequtive white spaces are folded and tokens are returned.

for line in f:
    tokens = line.split()
    print (tokens)

Let's look at individual tokens and we can identify some tokens as following -

Token 1: IP Address
Token 6: HTTP Method
Token 7: URL
Token 9: HTTP Status Code
Token 10: Bytes Transferred

There are other tokens as well, but we are ignoring them at the moment. This is kind of not perfect, but good enough for us to get started and we'll make it eventually better.

# file: weblog_parser.py
# Assign values to tokens

for line in f:
    tokens = line.split()
  ip, method, url, status_code, size = tokens[0], tokens[5], tokens[6], tokens[8], tokens[9]

  print (ip, method, url, status_code, size)

Analysis Step 2 - Organize Data

Since we are able to get important data from each line of the log file, it might be a good idea now, how we think about organizing this data so that we can process it effectively. Usually the type of questions we want to answer tell us about which of the Python built-in data structures we should be using for that particular task.

For example - Let's try to answer this question - 'Which are the most popular URLs'? To answer this question, it might be a good idea if we are able to - organize the data as follows, a Given URL and count corresponding to it's occurrence in the log file. This kind of data can be organized as - 'Collection of two items where first item is a URL and next item is a count or perhaps even better if we maintain some kind of 'table' in which keys are the URLs and values are the counts. The most ideal data structure for this available in Python built-ins is dictionary or dict.

While we are at it, perhaps to answer the questions above we will need to go through the contents of the file 'every time', to answer that particular question. File I/O being an expensive operation, it might be a good idea to 'iterate' over the file once, keep the data in the memory and then use this data to answer our questions. One might then argue, 'What is wrong then with doing file.readlines, after all that was also saving the data in the memory?' To answer that, this is not keeping the file contents in memory, but a processed data in the memory. So let's do those couple of things - Factoring out getting the information we want from the file and organizing that as data structures of our choices.

def parse_log_file(f):
    """ f: A File handle. """

    data = []
    for line in f:
        tokens = line.split()

        ip, method, url, status_code, size = tokens[0], tokens[5], tokens[6], tokens[8], tokens[9]
        tup = (ip, method, url, status_code, size,)
        data.append(tup)

    return data

urls_dict = {}

for value in parse_log_file(f):

    url = data[2]

    count = urls_dict.get(url, 0)
        count += 1
    urls[url] = count

print (urls_dict)

Let's try to answer the question 'How many Unique IP addresses are there?'. What we can do is - we can maintain IP address in a list and while going through the list, if the IP address is not in the list already, we will append that to the list. While this is not exactly a wrong way to do it, in Python there's a better way to do it using the set built-in. Adding the value to the set achieves the exact same effect of what we discussed above, except it is managed by Python and a better data structure (similar to hash tables) and we can perform this operations faster. In general, whenever a language provides a functionality that we need, it's better to utilize that as that has been used in different scenarios and probably has undergone lot more testing than something we will implement by hand.

So let's implement unique IPs as a set.

unique_ips = set()
for value in parse_log_file(f):
    ip = data[0]
    unique_ips.add(ip)

print(unique_ips)

A nifty, performant implementation.

To be able to answer third question - We need a data structure, that holds HTTP Status codes and corresponding number of occurrences in the HTTP Log file.

Analysis Step 3 - Compute Results

So we have organized the data using right data structures viz. dict and set for the type of questions that we wanted to answer. It's a good idea to actually solve the problems. We'll write individual functions, just so that the code is better organized as well. Let's write those functions.

Question 1: Which URL Was accessed maximum number of time?

We have discussed bulk of the solution already, We are putting things together.

A note about couple of things here, we are using sorted built-in that returns a list which sorts items in an iterable using a key and sorting order. Python's Sorting HOWTO is a very useful resource to learn more about it. An iterable for simplicity is something over which we can run the loop for i in iterable. Typically, Python's built-in list (iterates over all elements), dict (iterates over all keys), set (iterates over all elements) and even tuple (iterates over all elements) are all examples of iterables.

Also, we are using list slices that allow us to take parts of the list. As in example below, we are taking the first 5 elements. When there are fewer than five elements, this will return as many as are available.

def get_max_occuring_ips(file_handle, count=5):
    """
    Returns a List of max_occuring_ips.
    """

    urls_dict = {}

    for tup in parse_log_file(file_handle):
        url = data[2]

        count = urls_dict.get(url, 0)
            count += 1
        urls_dict[url] = count

    # We'll use the built-in sorted
    max_occuring_ips = sorted(urls_dict, key=urls_dict.get, reverse=True)

    return max_occuring_ips[:count]

Question 2: How many unique IPs did we get requests from?

We have already looked at the solution, we'll just wrap it in a function and return a list of IP Unique IP Addresses.

def unique_visitor_ips(file_handle):

    unique_ips = set()
    for value in parse_log_file(file_handle):
        ip = data[0]
        unique_ips.add(ip)

    return list(unique_ips)

Question 3: What does the distribution of HTTP Status Codes look like?

This question is actually pretty similar to Question 1. The only difference being we need to organize data around HTTP Status codes as opposed to IP Addresses and instead of sending top N, we'll send the results for All Status codes.

def http_status_codes_with_count(file_handle):

    all_status_codes = {}
    for tup in oarse_log_file(file_handle):
        status_code = int(tup[3])
        all_status_codes[status_code] = all_status_codes.setdefault(status_code, 1) + 1

    return all_status_codes

Next Steps - Raising the Bar

So far we have looked at answering a few questions. These are not the only questions that can be answered from the log files. There can be many other questions that can be answered -

What is the total number of bytes downloaded from the server?
Can you identify the busiest hours for his server? This question is little challenging and we will need to use the datetime package from the standard library.

Hopefully this showed some ideas as to how this should be done and reader is encouraged to try answering many more such questions.

Conclusion

While, this is a good start towards solving a problem, this is by no means an attempt at introducing all language features. For instance, we have not somehow tried to fit this problem in an Object Oriented solution (while it's not impossible to do so, that effort doesn't achieve any better results than what we have presented here). The key takeaway is - to be able to start thinking in terms of Organizing data in terms of Python built-in data structures and try to develop iterative solution to problems starting with something very simple and then looking around when discovers something new about a language (eg. Python Sorting Howto described above).

Future Posts

In some of the future posts of this series, I am going to look at following problems -

Running Random experiments (Birthday Paradox)
Object Oriented Python (Ranking Poker Hands)
Python to Parse Binary Data (Which country IP address belongs to.)