Skip to main content

Text Processing in Python - standard reading for all Python programmers

Posted in

I came across a thread on reddit the other day where someone made a suggestion to look into "Text Processing in Python" . All I can say is wow! I ran through the first chapter and am drooling. Going to be ordering the book from Amazon for sure.

So I as I am reading through it the book made me think of a conversation that my former boss and I had once. I forget the context of the discussion however I do clearly remember it centering around VBScript. I do not care for VB at all. In fact I have only written one VBScript in my entire career and frankly that was enough to convince me that I don't need to be "ah wastein' mah time" with it. My boss on the other hand loved it and declared that it was the "only" scripting language authorized by him to be used.

In his effort to persuade me that it was a "good" language, he was going to impress me. He opens up notepad. Types in:

Msgbox "Hello World!!"

Then saves it and executes it. Viola! A Windows message box appears with "Hello World!!" in it. Then he proceeds to tell me [paraphraseing here] "Norm. VB is the only language in the world that I know of that you can do something like that so easily." I guess I should have been impressed, but I wasn't. What I was thinking at the time but knew it better to not say was "Nice, if all I need is message boxes, I'll use VB. But if I need to do something useful like..... I'll use Python." =)

So that brings me to the present. That "something useful like...." would be text processing. Lets say you have a large text file, something like Ulysses by James Joyce. And you wanted to know how many times the word "truth" occurred in the document. In VB6 that was not possible. Now in VB.Net it is, here is _all_ you have to do (code taken from this post):

Const ForReading = 1

Set oFSO = CreateObject("Scripting.FileSystemObject")

Set re = New RegExp
re.Pattern    = "truth"
re.IgnoreCase = False
re.Global     = True
re.Multiline  = True

strReport = ""
For Each strFileName in WScript.Arguments.Unnamed
  Set oFile = oFSO.OpenTextFile(strFileName, ForReading)
  strText = oFile.ReadAll
  oFile.Close

  intCount  = re.Execute(strText).Count
  strReport = strReport & "There are " & intCount &  strFileName & vbNewLine
Next

WScript.Echo strReport 

Including white space you have 21 lines of code. The answer is 29.

Now lets contrast that to Python:

f = open('c:\\ulyss.txt')
string = "truth"
tp = f.read()
f.close

print  "The word " + string + " appears: %d times."  % tp.count(string)

Including white space we have 6 lines. And the winner is........

Imagine the things Microsoft could do if they would replace VB with IronPython or IronRuby. One can imagine I suppose.