Robert's Notebook

Computational Chemistry, Python, Politics, Policy

Setting up Swift

Swift is a parallel scripting language developed at the U. Chicago Computation Institute. It provides a way to manage heterogeneous clusters. I have n jobs that I want to get done, and I have access to about 5 high performance computing clusters. I really dont want to handle file transfer, checking qstat to see where there's resource availability, etc. I want my jobs to run wherever they can, as fast as possible.

First, I downloaded the 0.94 release from here, I untarred it in $HOME/local/swift-0.94, and added the executables to my $PATH.

cd $HOME/local
wget http://www.ci.uchicago.edu/swift/packages/swift-0.94.tar.gz
tar -xzvf swift-0.94.tar.gz
export $PATH=$HOME/local/swift-0.94/bin:$PATH

Setting up our cluster

First, we add a new cluster to sites.xml file. This is a file that tells swift what clusters we have available to us, and what their queues look like. The file is located in $HOME/local/swift-0.94/etc/sites.xml. I added the following new pool to the file to describe our groups little analysis cluster, vsp-compute. vsp-compute is a 40 node linux cluster, with a shared filesystem between the nodes. Each node has individual (node-local) space mounted on /scratch.

<pool handle="vsp-compute">
  <!-- use the "coaster" provider, which enables Swift to ssh to another system and qsub from there -->
  <execution provider="coaster" jobmanager="ssh-cl:pbs" url="vsp-compute-01.stanford.edu"/>

  <!-- app() tasks should be limited to 5 minutes walltime -->
  <profile namespace="globus" key="maxWalltime">00:05:00</profile>

  <!-- app() tasks will be run within PBS coaster "pilot" jobs. Each PBS job should have a walltime of 1 hour -->
  <profile namespace="globus" key="lowOverAllocation">100</profile>
  <profile namespace="globus" key="highOverAllocation">100</profile>
  <profile namespace="globus" key="maxtime">3600</profile>

  <!-- Up to 5 concurrent PBS coaster jobs each asking for 1 node will be submitted to the default queue -->
  <profile namespace="globus" key="queue">default</profile>
  <profile namespace="globus" key="slots">5</profile>
  <profile namespace="globus" key="maxnodes">1</profile>
  <profile namespace="globus" key="nodeGranularity">1</profile>

  <!-- Swift should run only one app() task at a time within each PBS job slot -->
  <profile namespace="globus" key="jobsPerNode">1</profile>

  <profile namespace="karajan" key="jobThrottle">1.00</profile>
  <profile namespace="karajan" key="initialScore">10000</profile>

  <!-- the scratch filesystem is unique to each node, and not shared across the cluster -->
  <workdirectory>/scratch/{env.USER}/.swiftwork</workdirectory>
</pool>

Next, we set up the transformations catalog, tc.data. This file specifies what commands are installed on each machine. The user specific tc.data is in $HOME/local/swift-0.94/etc/tc.data. I added two lines to the bottom, to describe the software available there. The lines are

# vsp-compute
vsp-compute     uname   /bin/uname  null    null    null
vsp-compute     wc      /usr/bin/wc null    null    null

This tells the swift execution engine that the uname and wc commands are available on vsp-compute.

Setting the swift.properties file

How should we transfer the input files to the compute nodes (and bring the output files back). One option is called "coaster provider staging". To set this up, I opened up the $HOME/local/swift-0.94/etc/swift.properties file, and changed these four settings.

# this lets the provider deal with the staging of files. we want this because
# vsp-compute does not share a shared filesystem with my workstation.
use.provider.staging=true
provider.staging.pin.swiftfiles=true
status.mode=provider

# this is just for debugging
wrapperlog.always.transfer=true

Dealing with a weird ssh issue

There was an issue with my ssh keys. To save you the pain of debugging this, if you have a file on your machine at $HOME/.ssh/id_rsa.pub, but not one at $HOME/.ssh/identity.pub, make these softlinks.

ln -s ~/.ssh/id_rsa ~/.ssh/identity
ln -s ~/.ssh/id_rsa.pub ~/.ssh/identity.pub

Running a parallel script

Enough configuration! Here's the script that I want to execute. It just runs the *nix uname command. Remember, this command needs to be available in tc.data

Here's my swift script.

# uname.swift
type file;

app (file o) uname() {
  # execute the uname command, with the argument -a, sending stdout to a file
  uname "-a" stdout=@o;
}
file outfile <"uname.txt">;

outfile = uname();

To run it, I just execute the script from the command line

$ swift uname.swift

The following gets printed to my terminal

Swift 0.94 swift-r6492 cog-r3658

RunID: 20130604-1330-fpx5r78b
Progress:  time: Tue, 04 Jun 2013 13:30:06 -0700
Progress:  time: Tue, 04 Jun 2013 13:30:36 -0700  Submitting:1
Progress:  time: Tue, 04 Jun 2013 13:30:49 -0700  Submitted:1
Progress:  time: Tue, 04 Jun 2013 13:30:51 -0700  Stage in:1
Final status: Tue, 04 Jun 2013 13:30:51 -0700  Finished successfully:1

Looking in my working directory, I now have a new file called uname.txt. The file indicates that the job ran on one of the vsp-compute worker nodes. Swift transparently submitted a pbs job, and copied the results back to my workstation.

$ cat uname.txt
Linux vsp-compute-31.Stanford.EDU 2.6.18-274.el5 #1 SMP Fri Jul 22 04:43:29 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux

Comments