The result is a private and a public key (.pub). The private key is the key that needs to be distributed to all nodes in order to be able to logon to the server.
Thus, put the contents of ppss-private.key.pub into a file called authorized_keys and place this file into the directory .ssh in the home directory of the PPSS user on the server.
This is necessary if you want to deploy PPSS on the nodes using PPSS in an automated fashion. The alternative is to manually copy PPSS and all necessary files to each node by hand.
When a node connects to the server for the first time, SSH wil show you the fingerprint of the server and ask if it is ok to connect to this host. To prevent this question, you must perform one of these actions:
* Logon to each node manually and connect once to the server and manually accept the server signature
You may already have the server public key in the ~/.ssh/known_hosts file of a system that has been used to logon to the server. Thus use the -K option to generate your own ./known_hosts file for usage with PPSS. If a known_hosts file exists within the same directory in which PPSS resides, this file will automatically be used and deployed to nodes.
Please note that usage of SSH keys without pass phrases may pose a security threat if the machines are shared with other users. You must decide for yourself if the security risk that is associated with this setup is acceptable for your environment. For example, if a node is compromised, the attacker will have (initially unprivileged) access to the server.
This is the most important part of setting up distributed PPSS. It is exactly the same as setting up a configuration file for standalone mode, except that more options are necessary.
The best way to explain how to create a configuration file for distributed PPSS is to provide an example. In this example, a script is used to encode WAV files to MP3. This script is called 'encode.sh' and takes a filename as an argument.
The third option, -c, specifies the command to be executed. *Please take special note of the single quotes and the space behind the command.* You can read -c 'encode.sh ' also as -c 'encode.sh "$ITEM"'.
This option specifies the location on the *server* where the files reside that must be processed. These files will be transfered using SCP to the nodes for local processing.
*Server*
The -s option specifies the SSH server that acts as both fileserver and SSH server for communication between nodes. The SSH server is mainly used for file-locking: nodes know that locked files are already processed or being processed, so another unlocked file must be selected.
If the server acts both as a file server and SSH server, it is not recommended to use it also as a node, in this case for encoding. Filetransers using SSH can take quite some processing power.
This is the name of the local system user that is used by the nodes to logon to the server with SSH. For deployment, such a user must also be present on the nodes.
Scripts using SSH require an SSH key withouth a passphrase. This key must be uploaded to the nodes an the nodes must know which key to use, so it must be specified.
*Script or program that must be uploaded*
The -S option specifies the script or program that should be uploaded to the node because it must be executed by the node for distributed computing. In this case, the encode.sh script must be deployed on all nodes and thus specified.
*List of nodes*
The -n option specifies the file containing all nodes. For every node, PPSS will perform actions such as deploy, start, stop and pause.
*Transfer files to local host*
If this option is specified, the file is copied from the source directory to a local temporary working directory for local processing. This is necessary if SCP is used to access files that must be processed.
If files are distributed over NFS or SMB, the files seem to be present on the local system, because it is just a mount point and thus just a part of the local file system. In this case, the -t option can be omitted, however it it is specified, files are copied to a local directory using 'cp'.
*The output directory*
If the -t option is used, the -o option specifies the destination directory on the server. The results are uploaded to this directory. If the -t option is not specified, the command 'cp' is used to transfer files back to the specified output directory.
The OUTPUT_DIR and OUTPUT_FILE variables are special. It tells your command where to store the output. This is important if you want to transfer the results of your command back to the server.
In this example, Lame requires that the user specifies an output file. PPSS generates the name of this output file for you, based on the name of the Item. This example shows that you don't need to create your own shell scripts in order to be able to use PPSS.
* = optional. If you created a file called 'known_hosts', this file will automatically be used. Warning: if you specify a different file with the -K option, the curent known_hosts file will be replaced by this file.
PPSS transfers files to the node and uploads the output back to the server. In order to be able to upload output back to the server, PPSS must know where this output can be found.
by default output is stored in the directory specified by $PPSS_LOCAL_OUTPUT/$ITEM. Ofcource, you can hard-code the PPSS_LOCAL_OUTPUT path, however, it is much easier to just source the ppss configuration file and use the already defined variables, that are used by PPSS anyway.
An example script that uses the settings of the PPSS configuration file is shown below, that has actually been used to encode 400 GB of WAV files.
Take notice of the basename command. Items are provided with full path. Basename strips this path from the filename and uses just the filename in this script.
* As with any decent shell script, use exit codes. Exit code 0 reflects successful execution, any other value a faillure.
* Echo some information about what the script is doing. If something fails, echo what is wrong. This is caught by PPSS and logged in the logfile of the item that is processed.
Once SSH access is setup and the configuration file is generated, PPSS can be deployed to the nodes. This is very simple, as this example demonstrates:
`./ppss.sh deploy -C config.cfg
During the phase when we generated the configuration file, a nodes file was specified. Thus PPSS knows, just by reading this configuration file, which file contains a list of nodes.
{{{
bash-3.2$ ./ppss.sh deploy -C config.cfg
mrt 12 22:18:22: INFO - ---------------------------------------------------------
mrt 12 22:18:22: INFO - Distributed Parallel Processing Shell Script version 2.03
Please note that nodes will continue processing the current item they are working on, they just stop processing new items if stop or pause is selected.
An important feature of PPSS is its extensive logging. There are two types of log files.
* A single log file created by PPSS itself. This file is found on the local nodes. Using tail -f on these files, it is possible to monitor what PPSS is currently doing.
* An individual log file containing information and output of each processed item. these files are uploaded to the SSH server to the 'job_log' directory. For every item, a log file must be present.
As you can see, with a few simple grep commands, it is possible to quickly determine which items have failed to process. Also, you can see that my MacBook took 5 minutes and 23 seconds to process this WAV file.
I am convinced that PPSS is very easy to use and tailored to your needs. If you have questions and/or suggestions, don't hesitate to send an e-mail. If you find bugs, please report them using the issue tracker. Feedback is greatly appreciated.