A short searching script in Python

Recently, I was working with Monte Carlo samples and I noticed a couple of oddities when plotting the mass of the visible decay products of the tau. It turns out that the script wasn’t properly removing unwanted particles (like neutrinos, which should not be visible in the actual detector). As part of the debugging process, I needed a list of particle IDs (since in Monte Carlo you know precisely what kind of particles you are generating and what their properties are) to check what was going on. The problem is that I had dozens of log files containing several thousand lines like this :

FirstEvent is 21000 and EventMax is 22000
  Number of events is 45500
ID=111
ID=-211
ID=-11
  visible true tau pt is 45531
  visible true tau m is 1353.26
ID=111
ID=-211
ID=-211
ID=13
  visible true tau pt is 30605.4
  visible true tau m is 1130.7

Although there are only a few particle IDs used in this MC sample, I wanted to be sure not to miss anything; so I wrote the following Python script :

import glob, os

os.chdir("/home/mydirectory/")

IDlist = []

for file in glob.glob("*.sh"):
    with open(file, "r") as currentfile:
    	 for line in currentfile:
	     if "ID=" in line:
	     	particleID = int(line.split("=")[1])
		if particleID not in IDlist:
		   IDlist.append(particleID)
		   
print IDlist

This is a perfect example of why I like Python so much : it’s short, it’s clean, it’s easily readable, and it’s fast.

We only need two modules : glob and os. The former allows you to work with pathnames and search for files pretty much in the same way you would in a Unix shell (with ls); the latter, to open, read and write files. Pretty standard, right? The next line is simply setting the address of the directory where my log files are; it’s actually more for convenience and readability than sheer necessity. Indeed, I could as well have given the full address of my files later on.

The purpose of this script is to return a list of particle IDs, so we define just that : an empty list, that we’ll fill by looping over the files. We get the full list of files by calling glob.glob("*.sh") - again, think ls *.sh in Unix. Then we create an iterator called file which will go through all the files we’ve just listed, and we open the current file in read-mode.

Now that we’re in a given file, we need to parse its contents and isolate the data we need - the number after “ID=”. We loop over the lines and retain only those that contain this string. line.split("=") returns two strings for us, “ID” and “<some number>”, conveniently arranged in an array. We simply have to add [1] to access the second one. Finally, we convert it from a string to an integer with int and we set particleID to take that value.

The last step is to check whether particleID is already in our list (we don’t want all the particle IDs, just all the different ones); if it’s not we add it.

[me@work]$ python myscript.py
[-211,211,111,22,-11,11,-13,13,311,-321,221,-321,130,-311,321,310,223,323]
comments powered by Disqus