innoQ

Vladimir's Tech Blog


Time sheet - parsing input with Python

January 21, 2007

We will start developing our application in python.

For the examples you will need a simple text editor like vi or notepad. Or you can even type in the code directly to your python interpreter console. My favorite plain text and code editor is scite. Mac geeks will take the textmate (their only mate ;-) ) just kidding, I'm simply jealous with my boring black PC.

First, lets design our input format. It should be text oriented and easy to type and read, with other words, go into the direction represented by wiki movement and DSLs (domain specific languages).

# project "lookup" table
# keyword "project", shortcut, customer name, address - no spaces allowed
project N BigCustomer Musterstr.99,Duesseldorf

# multiple months per file possible
month 01 2007

# lines starting with a number represent entries for single days
# fields are separated by spaces or tabs
# field sequence: day_of_month project_id from to break remaining_fields_as_dictionary
1 	N 	9:00 	17:30 	0:30 comment:this_day_was_very_exhausting taxi:34.56
2 	N 	8:30 	15:00 	1:00 
3 	N 	9:00 	17:30 	0:30
12	N	10:00	20:00 	0:30

Opening a file in python is as easy as

f = open("/path/to/sample_input.txt")

going through all the lines and stripping the whitespace at the beginning and the end of every line is as easy as

for line in f:
    print line.strip()

By the way, if your file is not on your local drive, but somewhere in internet, for example if you are checking it in with subversion or some other WebDAV based tool, you can use openurl instead of open and get a file like object so there is no need for changes to your remaining code:

f = openurl("http://mysvn.example.com/myrepository/my_time_sheet_data.txt")
for line in f:
    print line.strip()

As a next step lets split the line to tokes. Guessed how the function is called? tokens = line.split() It uses any whitespace (spaces, tabs) as delimeter and returns a list. Python list is similar to java's array or arraylist or collection, only more powerful. You can access an element of the list or a range of elements using square brackets:

print tokens[2]   # third element, lists are zero-based
print tokens[3:5] # fourth to fifth element, the right border is not included
print tokens[5:]  # remaining elements, after the fifth


A note about the brackets: python has powerful build in language concepts like tuples, lists and dictionaries. For the initializing use parenthesis, square and curly brackets respectively.

# use tuples if the number and meaning of elements are fixed
invention1 = ("web", "Tim Berners Lee", 1980)
invention2 = ("wheel", "anonymous", -3000)

# a list
my_breakfast = ["apple", "orange", "tee"]
numbers = range(10)


# dictionary, use key, colon, value for single elements
cities = {"New York":"USA", "Fuchu":"Japan", "Los Angeles":"USA"}

Lets put the details of the project to a dictionary (something like hashmap in other languages):

if tokens[0] == "project":
    projects[tokens[1]] = tokens

If we have a short name for the project we can easily access for example the address in following way:

projects[the_short_name][3]

Later we can define the container for project details as dictionary too or as a class so we can access the properties of the project more comfortably through the property names instead of numbers.

Putting it all together

projects = {}
f = open("/home/vd/work/innoq/Sandbox/vd/TimeSheet/sample_input.txt")
projects = {}
employeeName = "nobody"

for raw_line in f:
    line = raw_line.strip()
    if len(line) > 0 and line[0:1] <> "#":
        tokens = line.split()
        if tokens[0] == "employee":
            employeeName = tokens[1]
        elif tokens[0] == "project":
            projects[tokens[1]] = tokens
        elif tokens[0] == "month":
            print "month"

        elif tokens[0].isdigit():
            print tokens[2], "-", tokens[3]             
print projects

If you run the source above on our data file, you will get an output like

>python -u "TimeSheet.py"
month
9:00 - 17:30
8:30 - 15:00
9:00 - 17:30
10:00 - 20:00
{'N': ['project', 'N', 'BigCustomer', 'Musterstr.99,Duesseldorf']}
>Exit code: 0

It took me less than 5 minutes to write this first version of the parsing code. And how many lines of code and how many seconds of your life you need in C++ / Java / Assembler for the parser of your first simple domain specific language?

P.S. I used "if elif else" in my implementation. I am sure, there is a more elegant way for a dispatcher in python. We just need to find out, if there is a "switch" statement in python. ;-) Stay tuned.

Powered by Movable Type 3.31