Swift is a parallel scripting language
developed at the U. Chicago Computation Institute. It provides a way to manage
heterogeneous clusters. I have n
jobs that I want to get done, and I have access
to about 5 high performance computing clusters. I really dont want to handle
file transfer, checking qstat
to see where there's resource availability, etc.
I want my jobs to run wherever they can, as fast as possible.
First, I downloaded the 0.94 release from here,
I untarred it in $HOME/local/swift-0.94
, and added the executables
to my $PATH
.
cd $HOME/local
wget http://www.ci.uchicago.edu/swift/packages/swift-0.94.tar.gz
tar -xzvf swift-0.94.tar.gz
export $PATH=$HOME/local/swift-0.94/bin:$PATH
Setting up our cluster
First, we add a new cluster to sites.xml
file. This is a file that tells swift
what clusters we have available to us, and what their queues look like. The file
is located in $HOME/local/swift-0.94/etc/sites.xml
. I added the following
new pool
to the file to describe our groups little analysis cluster, vsp-compute.
vsp-compute is a 40 node linux cluster, with a shared filesystem between the nodes.
Each node has individual (node-local) space mounted on /scratch.
<pool handle="vsp-compute">
<!-- use the "coaster" provider, which enables Swift to ssh to another system and qsub from there -->
<execution provider="coaster" jobmanager="ssh-cl:pbs" url="vsp-compute-01.stanford.edu"/>
<!-- app() tasks should be limited to 5 minutes walltime -->
<profile namespace="globus" key="maxWalltime">00:05:00</profile>
<!-- app() tasks will be run within PBS coaster "pilot" jobs. Each PBS job should have a walltime of 1 hour -->
<profile namespace="globus" key="lowOverAllocation">100</profile>
<profile namespace="globus" key="highOverAllocation">100</profile>
<profile namespace="globus" key="maxtime">3600</profile>
<!-- Up to 5 concurrent PBS coaster jobs each asking for 1 node will be submitted to the default queue -->
<profile namespace="globus" key="queue">default</profile>
<profile namespace="globus" key="slots">5</profile>
<profile namespace="globus" key="maxnodes">1</profile>
<profile namespace="globus" key="nodeGranularity">1</profile>
<!-- Swift should run only one app() task at a time within each PBS job slot -->
<profile namespace="globus" key="jobsPerNode">1</profile>
<profile namespace="karajan" key="jobThrottle">1.00</profile>
<profile namespace="karajan" key="initialScore">10000</profile>
<!-- the scratch filesystem is unique to each node, and not shared across the cluster -->
<workdirectory>/scratch/{env.USER}/.swiftwork</workdirectory>
</pool>
Next, we set up the transformations catalog, tc.data
. This file specifies what
commands are installed on each machine. The user specific tc.data
is in
$HOME/local/swift-0.94/etc/tc.data
. I added two lines to the bottom,
to describe the software available there. The lines are
# vsp-compute
vsp-compute uname /bin/uname null null null
vsp-compute wc /usr/bin/wc null null null
This tells the swift execution engine that the uname
and wc
commands are
available on vsp-compute.
Setting the swift.properties file
How should we transfer the input files to the compute nodes (and bring the
output files back). One option is called "coaster provider staging". To set
this up, I opened up the $HOME/local/swift-0.94/etc/swift.properties
file,
and changed these four settings.
# this lets the provider deal with the staging of files. we want this because
# vsp-compute does not share a shared filesystem with my workstation.
use.provider.staging=true
provider.staging.pin.swiftfiles=true
status.mode=provider
# this is just for debugging
wrapperlog.always.transfer=true
Dealing with a weird ssh issue
There was an issue with my ssh keys. To save you the pain of debugging this,
if you have a file on your machine at $HOME/.ssh/id_rsa.pub
, but not one at
$HOME/.ssh/identity.pub
, make these softlinks.
ln -s ~/.ssh/id_rsa ~/.ssh/identity
ln -s ~/.ssh/id_rsa.pub ~/.ssh/identity.pub
Running a parallel script
Enough configuration! Here's the script that I want to execute. It just runs
the *nix uname
command. Remember, this command needs to be available in tc.data
Here's my swift script.
# uname.swift
type file;
app (file o) uname() {
# execute the uname command, with the argument -a, sending stdout to a file
uname "-a" stdout=@o;
}
file outfile <"uname.txt">;
outfile = uname();
To run it, I just execute the script from the command line
$ swift uname.swift
The following gets printed to my terminal
Swift 0.94 swift-r6492 cog-r3658
RunID: 20130604-1330-fpx5r78b
Progress: time: Tue, 04 Jun 2013 13:30:06 -0700
Progress: time: Tue, 04 Jun 2013 13:30:36 -0700 Submitting:1
Progress: time: Tue, 04 Jun 2013 13:30:49 -0700 Submitted:1
Progress: time: Tue, 04 Jun 2013 13:30:51 -0700 Stage in:1
Final status: Tue, 04 Jun 2013 13:30:51 -0700 Finished successfully:1
Looking in my working directory, I now have a new file called uname.txt
. The
file indicates that the job ran on one of the vsp-compute worker nodes. Swift
transparently submitted a pbs job, and copied the results back to my workstation.
$ cat uname.txt
Linux vsp-compute-31.Stanford.EDU 2.6.18-274.el5 #1 SMP Fri Jul 22 04:43:29 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux