Robert's Notebook

Computational Chemistry, Python, Politics, Policy

Next Steps With Swift

Let's try doing something a little more complicated with swift. Here's my new swift script. Basically, it's going to run a python script on each of a set of input files

# count.swift
type File;
type Pythonscript;

app (File o) python(Pythonscript script, File input) {
  # this script will get executed just as python process.py <input>
  python @script @input stdout=@filename(o);
}

File inputfiles[] <filesys_mapper; pattern="*.txt">;
Pythonscript pyscript <"process.py">;

foreach f in inputfiles {
  File c <regexp_mapper;  source=@f, match="(.*)txt", transform="\\1processed">;
  c = python(pyscript, f);
}

My python script counts the words in a file, and prints out the most common words to stdout.

# count.py
"""count the most common words in the file
"""
import os
import sys
import string
import pprint
from collections import Counter

counter = Counter()
exclude = set(string.punctuation)

with open(sys.argv[1]) as f:
    for line in f:
        for elem in line.split():
            word = ''.join(ch for ch in elem.lower() if ch not in exclude)
            counter[word] += 1

print counter.most_common(100)
# this will print the uname to stdout so that we can see where we executed
os.system('uname -a')

I downloaded three books from project gutenburg. Les Miserables, Pride and Prejudice, and Alice and Wonderland. They all end in the .txt extension, so they get picked up by the inputfile mapper.

After running swift count.swift, I now have three new files on my workstation

$ tail *.processed
==> alice.in.wonderland.processed <==
Linux vsp-compute-22.Stanford.EDU 2.6.18-274.el5 #1 SMP Fri Jul 22 04:43:29 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
[('the', 1804), ('and', 912), ('to', 801), ('a', 684), ('of', 625), ('it', 541), ('she', 538), ('said', 462), ('you', 429), ('in', 428), ('i', 400), ('alice', 385), ('was', 358), ('that', 291), ('as', 272), ('her', 248), ('with', 228), ('at', 224), ('on', 204), ('all', 197), ('this', 181), ('for', 179), ('had', 178), ('but', 169), ('not', 165), ('be', 165), ('or', 154), ('so', 151), ('very', 145), ('what', 137), ('they', 130), ('is', 128), ('little', 128), ('he', 122), ('its', 117), ('if', 114), ('out', 114), ('one', 102), ('about', 102), ('down', 101), ('up', 101), ('do', 98), ('no', 97), ('his', 96), ('then', 90), ('were', 87), ('know', 87), ('project', 86), ('like', 85), ('have', 85), ('them', 84), ('would', 83), ('went', 83), ('herself', 83), ('again', 82), ('when', 80), ('could', 78), ('there', 77), ('any', 76), ('by', 76), ('', 75), ('thought', 74), ('off', 73), ('are', 72), ('your', 71), ('see', 69), ('me', 68), ('how', 68), ('queen', 68), ('time', 68), ('into', 67), ('who', 64), ('did', 62), ('king', 61), ('an', 61), ('dont', 60), ('well', 60), ('my', 58), ('began', 58), ('im', 57), ('now', 57), ('turtle', 56), ('gutenbergtm', 56), ('mock', 56), ('which', 56), ('hatter', 55), ('gryphon', 55), ('quite', 55), ('must', 54), ('way', 54), ('work', 53), ('think', 53), ('other', 53), ('much', 52), ('some', 52), ('their', 52), ('just', 51), ('only', 51), ('from', 51), ('say', 50)]

==> les.miserables.processed <==
Linux vsp-compute-21.Stanford.EDU 2.6.18-274.el5 #1 SMP Fri Jul 22 04:43:29 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
[('the', 40845), ('of', 19924), ('and', 14877), ('a', 14485), ('to', 13705), ('in', 11183), ('he', 9580), ('was', 8613), ('that', 7768), ('it', 6475), ('his', 6459), ('is', 6184), ('had', 6171), ('which', 5138), ('with', 4525), ('on', 4462), ('at', 4055), ('this', 3971), ('not', 3799), ('you', 3661), ('i', 3634), ('as', 3253), ('one', 3127), ('for', 2964), ('him', 2923), ('have', 2793), ('her', 2633), ('there', 2615), ('who', 2540), ('all', 2451), ('from', 2447), ('she', 2428), ('be', 2389), ('by', 2382), ('are', 2159), ('an', 2116), ('they', 2113), ('but', 2043), ('no', 1967), ('man', 1899), ('were', 1824), ('what', 1796), ('said', 1791), ('been', 1517), ('when', 1362), ('marius', 1352), ('we', 1278), ('their', 1252), ('will', 1226), ('two', 1183), ('so', 1180), ('jean', 1176), ('my', 1166), ('me', 1150), ('more', 1128), ('himself', 1079), ('has', 1077), ('them', 1064), ('would', 1052), ('valjean', 1046), ('then', 1034), ('its', 1013), ('these', 998), ('did', 993), ('into', 992), ('out', 984), ('little', 975), ('like', 962), ('or', 954), ('do', 928), ('very', 922), ('up', 921), ('cosette', 913), ('other', 879), ('m', 878), ('old', 873), ('than', 866), ('made', 782), ('some', 781), ('only', 780), ('good', 773), ('time', 758), ('your', 757), ('those', 730), ('nothing', 729), ('if', 728), ('without', 699), ('could', 678), ('day', 673), ('rue', 664), ('about', 642), ('well', 614), ('where', 614), ('say', 598), ('men', 596), ('de', 592), ('any', 578), ('', 577), ('here', 576), ('first', 565)]

==> pride.and.prejudice.processed <==
Linux vsp-compute-20.Stanford.EDU 2.6.18-274.el5 #1 SMP Fri Jul 22 04:43:29 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
[('the', 4495), ('to', 4207), ('of', 3715), ('and', 3602), ('her', 2215), ('i', 2051), ('a', 1996), ('in', 1919), ('was', 1844), ('she', 1704), ('that', 1582), ('it', 1535), ('not', 1445), ('you', 1417), ('he', 1333), ('his', 1267), ('be', 1257), ('as', 1189), ('had', 1174), ('with', 1098), ('for', 1071), ('but', 977), ('is', 883), ('have', 846), ('at', 801), ('mr', 783), ('him', 761), ('on', 726), ('my', 717), ('by', 657), ('all', 637), ('they', 604), ('elizabeth', 594), ('so', 585), ('were', 565), ('which', 542), ('could', 525), ('been', 515), ('from', 505), ('this', 493), ('no', 493), ('very', 486), ('what', 474), ('would', 469), ('your', 465), ('their', 441), ('me', 439), ('them', 434), ('will', 418), ('said', 401), ('such', 393), ('or', 373), ('when', 372), ('darcy', 371), ('do', 364), ('if', 364), ('are', 359), ('an', 357), ('there', 347), ('mrs', 343), ('much', 328), ('more', 326), ('must', 318), ('am', 316), ('any', 306), ('bennet', 293), ('who', 286), ('than', 284), ('miss', 283), ('did', 270), ('one', 266), ('jane', 263), ('we', 260), ('bingley', 257), ('should', 250), ('know', 239), ('how', 231), ('before', 229), ('herself', 224), ('has', 223), ('other', 222), ('can', 221), ('though', 221), ('never', 220), ('only', 217), ('soon', 216), ('well', 212), ('think', 211), ('now', 209), ('some', 209), ('may', 207), ('time', 200), ('might', 200), ('after', 199), ('every', 198), ('most', 190), ('little', 189), ('lady', 183), ('own', 183), ('good', 182)]

Looks like "the" is the most common word in the english language. No surprise there. But each of these calculations ran on a different node. You can see that from the uname output at the top of each file. Since the vsp-compute nodes have 24 hyperthreaded cores and python is (usually) single-threaded, this is pretty silly. Lets see if we can do better.

In my sites.xml file, I changed the vsp-compute pool to have the follwing line

<!-- Swift should run 24 app() task at a time within each PBS job slot -->
<profile namespace="globus" key="jobsPerNode">24</profile>

And now, rerunning swift, you can see that the jobs all got pipelined to run in a single PBS slot.

$ tail *.processed
==> alice.in.wonderland.processed <==
Linux vsp-compute-22.Stanford.EDU 2.6.18-274.el5 #1 SMP Fri Jul 22 04:43:29 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
[('the', 1804), ('and', 912), ('to', 801), ('a', 684), ('of', 625), ('it', 541), ('she', 538), ('said', 462), ('you', 429), ('in', 428), ('i', 400), ('alice', 385), ('was', 358), ('that', 291), ('as', 272), ('her', 248), ('with', 228), ('at', 224), ('on', 204), ('all', 197), ('this', 181), ('for', 179), ('had', 178), ('but', 169), ('not', 165), ('be', 165), ('or', 154), ('so', 151), ('very', 145), ('what', 137), ('they', 130), ('is', 128), ('little', 128), ('he', 122), ('its', 117), ('if', 114), ('out', 114), ('one', 102), ('about', 102), ('down', 101), ('up', 101), ('do', 98), ('no', 97), ('his', 96), ('then', 90), ('were', 87), ('know', 87), ('project', 86), ('like', 85), ('have', 85), ('them', 84), ('would', 83), ('went', 83), ('herself', 83), ('again', 82), ('when', 80), ('could', 78), ('there', 77), ('any', 76), ('by', 76), ('', 75), ('thought', 74), ('off', 73), ('are', 72), ('your', 71), ('see', 69), ('me', 68), ('how', 68), ('queen', 68), ('time', 68), ('into', 67), ('who', 64), ('did', 62), ('king', 61), ('an', 61), ('dont', 60), ('well', 60), ('my', 58), ('began', 58), ('im', 57), ('now', 57), ('turtle', 56), ('gutenbergtm', 56), ('mock', 56), ('which', 56), ('hatter', 55), ('gryphon', 55), ('quite', 55), ('must', 54), ('way', 54), ('work', 53), ('think', 53), ('other', 53), ('much', 52), ('some', 52), ('their', 52), ('just', 51), ('only', 51), ('from', 51), ('say', 50)]

==> les.miserables.processed <==
Linux vsp-compute-22.Stanford.EDU 2.6.18-274.el5 #1 SMP Fri Jul 22 04:43:29 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
[('the', 40845), ('of', 19924), ('and', 14877), ('a', 14485), ('to', 13705), ('in', 11183), ('he', 9580), ('was', 8613), ('that', 7768), ('it', 6475), ('his', 6459), ('is', 6184), ('had', 6171), ('which', 5138), ('with', 4525), ('on', 4462), ('at', 4055), ('this', 3971), ('not', 3799), ('you', 3661), ('i', 3634), ('as', 3253), ('one', 3127), ('for', 2964), ('him', 2923), ('have', 2793), ('her', 2633), ('there', 2615), ('who', 2540), ('all', 2451), ('from', 2447), ('she', 2428), ('be', 2389), ('by', 2382), ('are', 2159), ('an', 2116), ('they', 2113), ('but', 2043), ('no', 1967), ('man', 1899), ('were', 1824), ('what', 1796), ('said', 1791), ('been', 1517), ('when', 1362), ('marius', 1352), ('we', 1278), ('their', 1252), ('will', 1226), ('two', 1183), ('so', 1180), ('jean', 1176), ('my', 1166), ('me', 1150), ('more', 1128), ('himself', 1079), ('has', 1077), ('them', 1064), ('would', 1052), ('valjean', 1046), ('then', 1034), ('its', 1013), ('these', 998), ('did', 993), ('into', 992), ('out', 984), ('little', 975), ('like', 962), ('or', 954), ('do', 928), ('very', 922), ('up', 921), ('cosette', 913), ('other', 879), ('m', 878), ('old', 873), ('than', 866), ('made', 782), ('some', 781), ('only', 780), ('good', 773), ('time', 758), ('your', 757), ('those', 730), ('nothing', 729), ('if', 728), ('without', 699), ('could', 678), ('day', 673), ('rue', 664), ('about', 642), ('well', 614), ('where', 614), ('say', 598), ('men', 596), ('de', 592), ('any', 578), ('', 577), ('here', 576), ('first', 565)]

==> pride.and.prejudice.processed <==
Linux vsp-compute-22.Stanford.EDU 2.6.18-274.el5 #1 SMP Fri Jul 22 04:43:29 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
[('the', 4495), ('to', 4207), ('of', 3715), ('and', 3602), ('her', 2215), ('i', 2051), ('a', 1996), ('in', 1919), ('was', 1844), ('she', 1704), ('that', 1582), ('it', 1535), ('not', 1445), ('you', 1417), ('he', 1333), ('his', 1267), ('be', 1257), ('as', 1189), ('had', 1174), ('with', 1098), ('for', 1071), ('but', 977), ('is', 883), ('have', 846), ('at', 801), ('mr', 783), ('him', 761), ('on', 726), ('my', 717), ('by', 657), ('all', 637), ('they', 604), ('elizabeth', 594), ('so', 585), ('were', 565), ('which', 542), ('could', 525), ('been', 515), ('from', 505), ('this', 493), ('no', 493), ('very', 486), ('what', 474), ('would', 469), ('your', 465), ('their', 441), ('me', 439), ('them', 434), ('will', 418), ('said', 401), ('such', 393), ('or', 373), ('when', 372), ('darcy', 371), ('do', 364), ('if', 364), ('are', 359), ('an', 357), ('there', 347), ('mrs', 343), ('much', 328), ('more', 326), ('must', 318), ('am', 316), ('any', 306), ('bennet', 293), ('who', 286), ('than', 284), ('miss', 283), ('did', 270), ('one', 266), ('jane', 263), ('we', 260), ('bingley', 257), ('should', 250), ('know', 239), ('how', 231), ('before', 229), ('herself', 224), ('has', 223), ('other', 222), ('can', 221), ('though', 221), ('never', 220), ('only', 217), ('soon', 216), ('well', 212), ('think', 211), ('now', 209), ('some', 209), ('may', 207), ('time', 200), ('might', 200), ('after', 199), ('every', 198), ('most', 190), ('little', 189), ('lady', 183), ('own', 183), ('good', 182)]

Perfect!

Comments