Compare commits

...

No commits in common. "master" and "wiki" have entirely different histories.
master ... wiki

26 changed files with 1786 additions and 4740 deletions

322
Changelog.md Normal file
View File

@ -0,0 +1,322 @@
### 2.98 (not released yet, available in SVN) ###
* Decided that the locking file name is based on MD5 hash of item. The log file name is based on the item itself unless the MD5 option is specified.
* Resolved [issue 66](https://code.google.com/p/ppss/issues/detail?id=66) regarding (lack of) support of Solaris.
### 2.97 ###
* Fixed some suggested improvements of [issue 39](https://code.google.com/p/ppss/issues/detail?id=39), such as removing use of backtics. using [[ for tests.
* Fixed issue with copying files to local node when using regular 'cp'.
### 2.96 ###
* Fixed many minor issues, small tweaks
* Fixed issues with daemon mode not properly handling NFS mounted shares.
* Fixed issue when running in distributed mode
* Fixed [issue 40](https://code.google.com/p/ppss/issues/detail?id=40). FreeBSD is detected and then the bash shell is explicitly called. Is not tested, do'nt have BSD.
* Fixed [issue 41](https://code.google.com/p/ppss/issues/detail?id=41): the status file of a node is not status.txt but $hostname-status.txt. So each status file is now unique, assuming that each system has an unique hostname.
* Nodes now upload their status to the ssh server. A 'ppss status' now just polls the ssh server and does not need to contact every individual client anymore.
* Fixed [issue 54](https://code.google.com/p/ppss/issues/detail?id=54): the deprecated usage of find -d instead of -depth
* Fixed [issue 56](https://code.google.com/p/ppss/issues/detail?id=56): exit codes. Exit now returns non-0 when PPSS itself fails or it failed to process an item.
* Fixed [issue 57](https://code.google.com/p/ppss/issues/detail?id=57): PPSS now keeps track of failed items. It reports if items have failed or not. Works also in distributed mode.
* Added feature for distributed mode: every node gets the same list of items. Nodes then try to process them by 'claiming' them through locking. At a certain moment, all items will be locked and it is not necessary for nodes to mindlessly continue to try and obtain a lock on all items. PPSS will now detect this and make the node finish and quit. It does this by comparing the number of item locks on the SSH server and the total number of items to process.
* Fixed [issue 41](https://code.google.com/p/ppss/issues/detail?id=41)
* Fixed [issue 42](https://code.google.com/p/ppss/issues/detail?id=42)
* Fixed [issue 46](https://code.google.com/p/ppss/issues/detail?id=46)
* Fixed [issue 47](https://code.google.com/p/ppss/issues/detail?id=47)
* Fixed [issue 52](https://code.google.com/p/ppss/issues/detail?id=52)
* Fixed [issue 60](https://code.google.com/p/ppss/issues/detail?id=60)
* Fixed [issue 61](https://code.google.com/p/ppss/issues/detail?id=61)
### 2.85 ###
* Fixed [issue 38](https://code.google.com/p/ppss/issues/detail?id=38): daemon mode lockup when using inotify
### 2.84 ###
* Fixed [issue 35](https://code.google.com/p/ppss/issues/detail?id=35): Total processing time not shown or logged.
* PPSS now estimates when PPSS will be finished (ETA).
* removed dead function get\_status (thanks Mr. Hartman).
* Improved error handling.
### 2.83 ###
* Fixed [issue 33](https://code.google.com/p/ppss/issues/detail?id=33): Daemon mode crashes if inotify is not installed. This version is available as an attachment to this issue.
* Fixed [issue 34](https://code.google.com/p/ppss/issues/detail?id=34): Daemon mode does not process new items under certain conditions.
### 2.82 ###
* This version will support the Linux inotify system. File system events are processed in real-time, asynchronously in daemon mode. Thus, if inotify is installed on the system, PPSS will detect this and use it to watch a specified directory for file system events. This allows PPSS to respond to file system events very quickly. It also does a way with the locking mechanism that is required if inotify is not used. To use inotify, you must install it first (inotify-tools).
* Daemon mode now checks modification date of files to prevent processing of files while they are still being written to if inotify is not used or not installed.
* Some under the hood code improvements.
### 2.80 ###
> This versions consists of many under-the-hood changes, no functionality added. Changes are significant however, some code has been cleaned up and some parts are removed.
* There was a global locking mechanism that was in-place for distributing items to worker processes. The listener process now handles distribution of items to worker processes, which is a single central process. Thus, locking is no longer an issue, since it is a single process. This provides a serious performance benefit.
* Fixed [issue 32](https://code.google.com/p/ppss/issues/detail?id=32): when processing large number of items, lots of memory is consumed because all items are loaded into an array. This is no longer the case. PPSS now uses 'sed' to read a particular line from an input file containing all the items. Therefore, the memory footprint of PPSS remains small.
### 2.65 ###
* Major change: PPSS now generates unique file names for log files and item locks using MD5 hashes. Thus, job file names cannot be tracked back to items, but that should not be a problem. Just grep for SUCCESS or FAILURE to determine issues, or grep for the particular item, to find the actual file containing the output.
```
./ppss_dir/job_log/51fbc529402f569855f0ec9c5edc33d1
./ppss_dir/job_log/94af29775c416edbe6dc75c8d9ec6eb5
./ppss_dir/job_log/b112de8ed197cfc738f76332b0c1d7cc
```
* Fixed [issue 31](https://code.google.com/p/ppss/issues/detail?id=31): some strange files appeared under certain conditions.
* Some fixes regarding distributed mode.
### 2.63 (not released) ###
* PPSS can now run as a daemon, watching a file or directory with --daemon. Read the docs (TODO)
> > you must create a lock dir (default INPUT\_LOCK) within the source directory (specified with -d) to make sure that files are not written while reading them. After you finished placing items in this directory, you remove this lock directory.
* Added support for quiet operation. Only a progress indicator is displayed.
* Some minor cosmetic cleanups (display of percentage mechanism).
* Added some improvement based on comments on the code (thanks!)
### 2.62 ###
* Added support for reading from stdin with -f -, as a suggestion of walkerj /at/ walkerj.de. You can now do stuff like: cat /some/file | ppss -f - -c 'echo '
### 2.61 ###
* Fixed compatibility with Sun Solaris 10.
### 2.60 ###
* Cleaned up some code.
* Added some comments.
* Released PPSS as a .deb Debian / Ubuntu package.
### 2.60b2 (BETA) ###
* Fixed distributed computing. Many small bug fixes and changes.
* Changed some command line parameters, beware. -t is gone, --upload and --download are new.
* Cleaned up help page.
* Incorporated the help instructions on Amazon EC2 options.
### 2.57b1 (BETA, NOT RELEASED) ###
* Incorporated the patch from Sean M. Collins that integrates the use of the EC2 platform of Amazon with PPSS. Through this patch, PPSS can start EC2 instances and deploy PPSS on them.
### 2.56b4 (BETA, NOT RELEASED) ###
* Distributed processing using SSH is fixed partially, but it must be improved.
* PPSS now reports the total processing time, not only of individual items.
### 2.56b2 (BETA) ###
* Added new option '-r' that disables recursive traversal of directories.
* Fixed an error in the new recursion mechanism that prevented processing of symlinks (thanks John Lehr)
* Revamped logging there is now a better separation between messages that must be displayed, logged or only logged when debugging is enabled with 'export PPSSDEBUG=1'
* Distributed processing using SSH is BROKEN in this version.
### 2.56b (BETA) ###
* Changed license from BSD to GPL.
* Renamed ppss.sh to just 'ppss' to make it more like a regular Unix command.
* The -d (directory) option now works differently. The option operates recursively, thus also processing all files within sub directories. This is the default. Recursion will be disabled as an option (which is not present yet).
* Fixed a bug that prevented PPSS from properly handing files with special characters or paths.
* Added an example script to svn that transcodes flac to mp3 in parallel using PPSS.
### 2.50 ###
* Created a unit-test script using shunit2.
* Reworked the process management code. Management of child processes within a shell script is always a hassle. It could occur that when PPSS was interrupted and aborted with ctrl-c, some spawned processes would continue to run until finished. PPSS now identifies all processes by their ppid and pgid and kills the appropriate processes when ctrl-c is invoked.
* Reworked some file name parsing issues.
* A mistake prevented nodes from setting up a single SSH 'channel'. So every time a node wants to lock an item, an SSH connection must be made and teared down, which is too slow. This regression is now fixed.
### 2.41 ###
* Fixed some stupid bug that prevented distributed PPSS from functioning.
### 2.40 ###
* All usage screens have been overhauled to make it more readable.
* Reworked the distributed stuff.
* PPSS now keeps track of processes and kills them gracefully.
* Bugs in process handling have been removed.
### 2.34 ###
* PPSS now works on Solaris if Bash is installed.
* PPSS can now be run simultaneously within the same directory. If multiple instances are started of PPSS with the same arguments, they work together. If other arguments are used, they work separately.
* Cleaned stuff up a bit.
```
Oct 21 16:38:48: =========================================================
Oct 21 16:38:48: |P|P|S|S|
Oct 21 16:38:48: Distributed Parallel Processing Shell Script version 2.34
Oct 21 16:38:48: =========================================================
Oct 21 16:38:48: Hostname: opensolaris-vm
Oct 21 16:38:48: ---------------------------------------------------------
Oct 21 16:38:49: Processor architecture: i386 @ 3600 MHz.
Oct 21 16:38:49: Found 2 logic processors.
Oct 21 16:38:49: Starting 2 parallel workers.
Oct 21 16:38:49: ---------------------------------------------------------
```
### 2.31 ###
* The status screen in distributed mode is now working properly again.
### 2.30 ###
* PPSS now operates fully asynchronous. There are no polling mechanisms, every action is almost realtime.
### 2.21 ###
* Fixed bug in new mechanism for detecting multiple PPSS instances. (My fault).
* By default now PPSS creates a directory 'ppss' in the current working dir of PPSS.
> > By using 'export PPSS\_DIR=/some/dir' you can change the directory used.
### 2.20 ###
* PPSS did not take into account that different users can run PPSS on a single system. This has been fixed. Thanks to Cinly Ooi
* If a user is starting another instance of PPSS, the second instance will abort unless the -F option is specified.
Please note that if you run multiple instances of PPSS under the same user account, all instances will process items but fail to terminate.
* Improved error reporting.
### 2.19 ###
* Improved filename sanitization.
* Fixed typo.
### 2.18 ###
* PPSS now also record CPU model of Mac OS X devices.
```
mrt 29 23:11:56: INFO =========================================================
mrt 29 23:11:56: INFO |P|P|S|S|
mrt 29 23:11:56: INFO Distributed Parallel Processing Shell Script version 2.18
mrt 29 23:11:56: INFO =========================================================
mrt 29 23:11:56: INFO Hostname: MacBoek.local
mrt 29 23:11:56: INFO ---------------------------------------------------------
mrt 29 23:11:56: INFO Found 2 logic processors.
mrt 29 23:11:56: INFO CPU: Intel Core 2 Duo 2.16 GHz
mrt 29 23:11:56: INFO Starting 2 workers.
mrt 29 23:11:56: INFO ---------------------------------------------------------
```
```
Mar 29 23:19:12: INFO =========================================================
Mar 29 23:19:12: INFO |P|P|S|S|
Mar 29 23:19:12: INFO Distributed Parallel Processing Shell Script version 2.18
Mar 29 23:19:12: INFO =========================================================
Mar 29 23:19:12: INFO Hostname: MINI.local
Mar 29 23:19:12: INFO ---------------------------------------------------------
Mar 29 23:19:13: INFO Found 2 logic processors.
Mar 29 23:19:14: INFO CPU: Intel Core Duo 1.66 GHz
Mar 29 23:19:14: INFO Starting 2 workers.
Mar 29 23:19:14: INFO ---------------------------------------------------------
```
### 2.17 ###
* Implemented nifty status screen for distributed mode.
```
mrt 29 22:18:27: INFO =========================================================
mrt 29 22:18:27: INFO |P|P|S|S|
mrt 29 22:18:27: INFO Distributed Parallel Processing Shell Script version 2.17
mrt 29 22:18:27: INFO =========================================================
mrt 29 22:18:27: INFO Hostname: MacBoek.local
mrt 29 22:18:27: INFO ---------------------------------------------------------
mrt 29 22:18:28: INFO Status: 100 percent complete.
mrt 29 22:18:28: INFO Nodes: 7
mrt 29 22:18:28: INFO ---------------------------------------------------------
mrt 29 22:18:28: INFO IP-address Hostname Processed Status
mrt 29 22:18:28: INFO ---------------------------------------------------------
mrt 29 22:18:28: INFO 192.168.0.4 Core7i 155 FINISHED
mrt 29 22:18:29: INFO 192.168.0.2 MINI.local 34 FINISHED
mrt 29 22:18:29: INFO 192.168.0.5 server 29 FINISHED
mrt 29 22:18:30: INFO 192.168.0.63 host3 6 FINISHED
mrt 29 22:18:31: INFO 192.168.0.64 host4 6 FINISHED
mrt 29 22:18:31: INFO 192.168.0.20 imac-2.local 34 FINISHED
mrt 29 22:18:32: INFO 192.168.0.1 router 7 FINISHED
mrt 29 22:18:32: INFO ---------------------------------------------------------
mrt 29 22:18:32: INFO Total processed: 271
```
### 2.16 ###
* Cleaned up output to screen.
* Deployment of ppss to nodes uses a single SSH connection for file transfer.
* Deployment of ppss to nodes is done in parallel.
### 2.15 ###
When using PPSS in distributed mode, it is now possible to obtain the status of individual nodes.
```
bash-3.2$ ./ppss.sh status -C config.cfg
mrt 29 01:22:04: INFO - ---------------------------------------------------------
mrt 29 01:22:04: INFO - Distributed Parallel Processing Shell Script version 2.15
mrt 29 01:22:04: INFO - Hostname: MacBoek.local
mrt 29 01:22:04: INFO - 77 percent complete.
mrt 29 01:22:04: INFO - 10.0.0.14: PAUZED (Core7i)
mrt 29 01:22:04: INFO - 10.0.0.12: RUNNING (MINI.local)
mrt 29 01:22:05: INFO - 10.0.0.4: PAUZED (server)
mrt 29 01:22:05: INFO - 10.0.0.30: PAUZED (host3)
mrt 29 01:22:05: INFO - 10.0.0.31: RUNNING (host4)
mrt 29 01:22:06: INFO - 10.0.0.50: PAUZED (imac-2.local)
mrt 29 01:22:06: INFO - 10.0.0.1: PAUZED (router)
```
* Also, the ; character is now supported in filenames.
### 2.14 ###
Major rework on path and filename handling. Filenames are now properly sanitized for special characters such as ',& and ( ).
Also, PPSS recreates the directory structure of the source location of files, based on the -f option.
### 2.10 ###
Fixed important bugs when using an input file instead of an input directory.
### 2.09 ###
Fixed important bug: when deploying a key must be used to logon to the nodes using scp.
### 2.08 ###
Fixed some bugs...
### 2.07 ###
* User can now specify the known\_hosts file with option -K. The fact that a known\_hosts file must exist when distributing PPSS to nodes was not documented.
### 2.06 ###
* Forgot to update the version number within the script.
### 2.05 ###
* Deployment of PPSS to nodes is now performed in parallel, by executing the deploy function with &."
* It is now possible to specify the output directory and/or output filename within the -c option.
### 2.04 and older ###
I didn't realise that a changelog might be relevant until 2.05.

80
Design.md Normal file
View File

@ -0,0 +1,80 @@
# Introduction #
This wiki page describes how PPSS is designed, how it works and which techniques are used.
**Please note that the design has changed with version 2.80 and differs from older versions.**
# Design #
There are two main ingredients that must be supplied to PPSS
1. A list of items that must be processed:
* either a text file containing one item per line. These items can represent whatever you want;
* or a directory containing files that must be processed.
1. A command that must be executed for each item.
For every item the specified command will be executed with the item supplied as an argument.
* At any given moment there will be no more commands running in parallel other than specified by the command-line or based on the detected number of cpu cores.
* Two parallel running processes should never interfere or collide with each other by processing the same item.
* PPSS should not poll but wait for events to occur and 'do nothing' if there is nothing to do. It must be event-driven.
## Communication between parent and child processes ##
One of the main difficulties for shell scripts is interprocess communication. There is no communication mechanism for child and parent processes to communicate with each other. A solution might be the use of signals with the 'trap' command to catch events, however tests have proven that this is not reliable. The trap mechanism that bash employs is inherently unreliable (by design). During the time period the trap command is processing a trap, additional traps are ignored. Therefore, it is not possible to create a reliable mechanism using signals. There is actually a parallel processing shell script available on the web that is based on signals, and suffers exactly from this problem, which makes it unreliable.
However, repeated tests have determined that communication between processes using a FIFO named pipe is reliable and can be used for interprocess communication. PPSS uses a FIFO to allow a child process to communicate with the parent process.
Within PPSS, a child process only tells the master process one thing: 'I finished processing'. Either a new process is started processing the next item.
## Queue management ##
There is a single listener process that is just waiting for events to occur, by listening to a FIFO. The most important event is that a worker process should be started. This listener process will request a new item and will start a worker process to process this item.
Since the listener is the central process that requests items, no locking mechanism is required. Versions of PPSS before 2.80 had a cumbersome locking mechanism to prevent race conditions, however as of 2.80 this is no longer necessary.
Locking is only used to lock individual items. This allows multiple instances of PPSS to process the same local pool of items. For example, you started PPSS with two workers, but it seems that there is room for more workers. Just execute PPSS again with the same parameters and you will have two instances of PPSS processing the same bunch of items.
## Technical design ##
![http://home.quicknet.nl/mw/prive/nan1/got/ppss-schema.png](http://home.quicknet.nl/mw/prive/nan1/got/ppss-schema.png)
### Function: get\_all\_items ###
The first step of PPSS is to read all items that must be processed into a special text file. Items are read from this file using 'sed' and fed to the get\_item function.
### get\_item function ###
If called, an item will be read from the special input file and a global counter is increased, so the next time the function is executed, the next item on the list is returned. Sed is used to read a particular line number from the internal text file containing item names. The line number is based on a global counter that is increased each time an item is returned.
### Function: listen\_for\_job ###
The listen\_for\_job function is a process running in the background that listens on a FIFO special file (named pipe).
For every messages that is received, the listener will execute the 'get\_item' function to get an item. The commando function is then executed with this item as an argument. The commando function is run as a background process.
If the list of items has been processed, the get\_item function will return with a non-null return code, and the listen\_for\_job function will not start a new commando process. Thus over time, when commando jobs finish, all jobs die out. Once listen\_for\_job registers that all running jobs have died, it kills of PPSS itself.
The listen\_for\_job function keeps a counter for every worker thread that dies. Once this number hits the maximum number of parallel workers (like 4 if you have a quad-core CPU), it will terminate itself and eventually PPSS itself.
The whole listen\_for\_job function is executed as a background process. This function is the only permanent (while) loop running and is often blocked when no input is received, so it is doing nothing most of the time. This means that if PPSS has nothing to do, your system won't be wasting CPU cycles on some looping or polling.
### Function: start\_all\_workers ###
For every available cpu core, a thread will be started. If a user manually specifies a number of threads, that number will override the detected number of CPU cores.
So the start\_single\_worker function is called for each thread. This function just sends a message to the FIFO. There, it will be picked up by the listener process, which will request an item and execute the commando function to process the item.
### Command function ###
The command function performs the following tasks:
* check if a supplied item has been processed already, if so, skip it. If a job log exists, the item is skipped.
* execute the user-supplied command with the item as an argument
* execute the 'start\_single\_worker' function to start a new job for a new item.
The third option is the most relevant. After the command finishes, it calls the start\_single\_worker function. The snake biting-its-own-tail mechanism. Essentially, a running thread keeps itself running by starting a new thread after it finishes, until there are no items to process.
### start\_new\_worker function ###
The start\_new\_worker function will send a message to the fifo to inform the listener process that a commando should be executed.

99
DistributedPPSS.md Normal file
View File

@ -0,0 +1,99 @@
# Introduction #
PPSS allows you to distribute jobs to multiple hosts, thus allowing for distributed processing. So a large number of host can be used to process items, not just a single host (node). These nodes will share a single list of items that they will process in parallel.
To keep track of which items have been processed, nodes must be able to communicate with each other. Therefore, a server is necessary. The primary role of the server is just a communication channel for nodes. Nodes use the server to signal to other nodes that an item is being processed or has been processed. So nodes will never process the same file.
![http://home.quicknet.nl/mw/prive/nan1/img/ppss.png](http://home.quicknet.nl/mw/prive/nan1/img/ppss.png)
The secondary role of the server is to act as a file server. Assuming that files are processed, files stored on the PPSS server are transferred to the node, that will process a file and store the result back on the server.
PPSS is very flexible: the file server can be a different host than the PPSS server that is used for inter-node communication. Beware: currently, this is only possible based on NFS/SMB shares, not for usage of SSH/SCP.
# Design considerations #
## Node installation ##
Installing PPSS on a larger number of hosts will become an appalling boring repetitive and time consuming task if this is performed manually. Therefore, PPSS has a mode called 'deploy'. In this mode, PPSS connects to each node using SSH and deploys PPSS on this node. If you want to remove PPSS, use the mode 'erase'.
```
bash-3.2$ ./ppss.sh deploy -C testconfig.cfg
dec 17 16:36:17: =========================================================
dec 17 16:36:17: |P|P|S|S|
dec 17 16:36:17: Distributed Parallel Processing Shell Script version 2.50
dec 17 16:36:17: =========================================================
dec 17 16:36:17: Hostname: MacBoek.local
dec 17 16:36:17: ---------------------------------------------------------
dec 17 16:36:17: Deploying PPSS on nodes.
dec 17 16:36:19: PPSS installed on node 10.0.0.4.
dec 17 16:36:19: PPSS installed on node 10.0.0.14.
dec 17 16:36:19: PPSS installed on node 10.0.0.1.
```
## Node control ##
If a larger number of nodes are used, say more than five to ten, it will be a hassle to control these nodes individually by hand. The question is how to control all nodes without having to access nodes manually. Starting new jobs, pausing and stopping jobs should be controlled from a central location.
The modes 'start', 'pause', and 'stop', implement this functionality. They signal to nodes that PPSS must start, pause or stop.
## Node status ##
If a larger number of nodes are used, it would be nice if some simple overview could be generated about the current status of nodes and the overal progress of the entire process.
The current status screen polls the status of each host ( running, paused, stopped, finished ) and informs you about how many items have been processed by each host.
```
bash-3.2$ ./ppss.sh status -C testconfig.cfg
dec 17 16:39:15: =========================================================
dec 17 16:39:15: |P|P|S|S|
dec 17 16:39:15: Distributed Parallel Processing Shell Script version 2.50
dec 17 16:39:15: =========================================================
dec 17 16:39:15: Hostname: MacBoek.local
dec 17 16:39:15: ---------------------------------------------------------
dec 17 16:39:15: Status: 56 percent complete.
dec 17 16:39:15: Nodes: 3
dec 17 16:39:15: Items: 100
dec 17 16:39:15: ---------------------------------------------------------
dec 17 16:39:15: IP-address Hostname Processed Status
dec 17 16:39:15: ---------------------------------------------------------
dec 17 16:39:16: 10.0.0.4 server 8 RUNNING
dec 17 16:39:16: 10.0.0.14 Core7i 32 RUNNING
dec 17 16:39:17: 10.0.0.1 Mini 8 RUNNING
dec 17 16:39:17: ---------------------------------------------------------
dec 17 16:39:17: Total processed: 48
```
## Item (file) distribution ##
If items are files that need to be processed, they can be accessed in two ways:
* using a network file system such as NFS or SMB or other. The -d option must point to the mountpoint of this share.
* using scp within scripts to (securely) copy items (files) to the local host and copy the processed items back to the server. Please note that copying files using scp is more resource intensive (CPU) than SMB or NFS.
When using PPSS in a distributed fashion, it should be decided if files can be processed in-place on the file server through the share, or that they must be copied to the node first before being processed. The latter is the most robust solution.
# Technical background #
## Locking of items through SSH ##
According to many sources on the Internet, the only reliable solution to **atomic** locking is to use the 'mkdir' command to create a file. The fun thing is that this is also true if 'mkdir' is executed through SSH.
So a node tries to lock a file by issueing a mkdir on the server through SSH. If this mkdir fails, the directory and thus the lock already exists and the next item in the list is tried.
## Requirements ##
* A central server for inter-node communication (item locking).
* Accessible through SSH.
* A central server for file distribution (optional).
* Sufficient bandwidth (gigabit? totally depends on your needs.).
* SCP / NFS / SMB share for distributing files.
* One or more nodes.
* Accessible through SSH.
* Must support Bash shell.
Although it is not necessary to run PPSS on the master SSH server, it must be installed on the master SSH server. This is done automatically by PPSS when the deploy command is issued.

22
Example_script.md Normal file
View File

@ -0,0 +1,22 @@
In this example WAV files are converted to MP3 using Lame. The script takes two arguments that are supplied by the PPSS -c option.
PPSS is run like this:
```
ppss -d /source/directory/with/wav/files -c './wav2mp3.sh "$ITEM" "$OUTPUT_DIR"' -o /dest/dir/where/mp3/files/must/be/put
```
This is the code.
```
#!/usr/bin/env bash
SRC="$1"
DEST="$2"
BASENAME=`basename "$SRC"`
MP3FILE="`echo ${BASENAME%wav}mp3`"
lame --quiet --preset insane "$SRC" "$DEST/$MP3FILE"
exit "$?"
```

1
MOVEDTOGITHUB.md Normal file
View File

@ -0,0 +1 @@
https://github.com/louwrentius/PPSS

189
Manual.md Normal file
View File

@ -0,0 +1,189 @@
# Introduction #
This page discusses the usage of PPSS on a single host. Examples show how PPSS is used.
# Overview of modes and options #
The following output is displayed by PPSS when executed without any options:
```
'Distributed Parallel Processing Shell Script
Version: 2.0
PPSS is a Bash shell script that executes commands in parallel on a set
of items, such as files, or lines in a file.
Usage: ./ppss.sh MODE [ options ]
or
Usage: ./ppss.sh MODE -c <config file>
Modes are:
standalone For execution of PPSS on a single host.
node For execution of PPSS on a node, that is part of a 'cluster'.
config Generate a config file based on the supplied option parameters.
deploy Deploy PPSS and related files on the specified nodes.
erase Erase PPSS and related files from the specified nodes.
start Starting PPSS on nodes.
pause Pausing PPSS on all nodes.
stop Stopping PPSS on all nodes.
Options are:
--command | -c Command to execute. Syntax: '<command> ' including the single quotes.
Example: -c 'ls -alh '. It is also possible to specify where an item
must be inserted: 'cp "$ITEM" /somedir'.
--sourcedir | -d Directory that contains files that must be processed. Individual files
are fed as an argument to the command that has been specified with -c.
--sourcefile | -f Each single line of the supplied file will be fed as an item to the
command that has been specified with -c.
--config | -c If the mode is config, a config file with the specified name will be
generated based on all the options specified. In the other modes.
this option will result in PPSS reading the config file and start
processing items based on the settings of this file.
--disable-ht | -j Disable hyperthreading. Is enabled by default.
--log | -l Sets the name of the log file. The default is ppss-log.txt.
--processes | -p Start the specified number of processes. Ignore the number of available
CPUs.
The following options are used for distributed execution of PPSS.
--server | -s Specifies the SSH server that is used for communication between nodes.
Using SSH, file locks are created, informing other nodes that an item
is locked. Also, often items, such as files, reside on this host. SCP
is used to transfer files from this host to nodes for local procesing.
--node | -n File containig a list of nodes that act as PPSS clients. One IP / DNS
name per line.
--key | -k The SSH key that a node uses to connect to the server.
--user | -u The SSH user name that is used when logging in into the master SSH
server.
--script | -s Specifies the script/program that must be copied to the nodes for
execution through PPSS. Only used in the deploy mode.
This option should be specified if necessary when generating a config.
--transfer | -t This option specifies that an item will be downloaded by the node
from the server or share to the local node for processing.
--no-scp | -b Do not use scp for downloading items. Use cp instead. Assumes that a
network file system (NFS/SMB) is mounted under a local mountpoint.
--outputdir | -o Directory on server where processed files are put. If the result of
encoding a wav file is an mp3 file, the mp3 file is put in the
directory specified with this option.
Example: encoding some wav files to mp3 using lame:
./ppss.sh standalone -c 'lame ' -d /path/to/wavfiles -j
Running PPSS based on a configuration file.
./ppss.sh node -C config.cfg
Running PPSS on a client as part of a cluster.
./ppss.sh node -d /somedir -c 'cp /some/destination' -s 10.0.0.50 -u ppss -t -k ppss-key.key'
```
A detailed explanation based on examples will follow.
# How to use PPSS #
PPSS allows a user to execute commands, scripts or programs in parallel. That's it. It's sole purpose is to turn a batch job into a parallel batch job. This is relevant, since modern day processors are almost always multi-core and are designed to process jobs in parallel, so why not use it?
Items can be two things:
* files within a user-specified directory
* arbitrary lines of text within a file
When PPSS has finished, it has produced a log file of its operation. By default, this file is called ppss-log.txt.
Also, a directory is created, by default JOB\_LOG. Within this directory a logfile exists for each item that has been processed. If a log file is present for an item, and PPSS is re-run, these items will be skipped.
## Basic command line options ##
Before discussing the full list of command line options, an example will be given how to run PPSS with the least amount of options, in it's simplest form. In this example, some files are compressed with gzip.
`$ ./ppss.sh standalone -d /path/to/files -c 'gzip '`
In this example, we can distinguish a 'mode' and two options. The mode speaks for itself: PPSS is not part of a cluster, it is just running on the host.
The -d option specifies the directory where the files reside that must be processed.
The -c option specifies the command that will be executed by PPSS in parallel for each file within the directory specified by -d. In this example the command has a **trailing space**, which is necessary since the command will expand to 'gzip example.tar' when executed. If the space is omitted, an error will occur.
Sometimes, the item should not be appended to the command, but inserted somewhere in the middle. This is possible by using the placeholder "$ITEM". See the following example:
`$ ./ppss.sh standalone -d /path/to/files -c 'cp "$ITEM" /destination/dir '`
Another example is the use of an input file instead of a directory. Such a file is specified with the -f option.
`$ ./ppss.sh standalone -f list-of-urls.txt -c 'wget -q '`
In this example, a list of URLs is provided by the file list.txt. These urls are fed to wget, which will retrieve the specified URLs. The -p option specifies that 5 parallel downloads or threads should be started. Ofcourse, this command can also be written like this:
`$ ./ppss.sh standalone -f list-of-urls.txt -c 'wget -q "$ITEM"'`
## Advanced command line options ##
In this paragraph, some additional options are discussed.
**-p <configure manually number of parallel processes>**
This option allows you to specify how many parallel proceses should be started. Thus, automatic detection of CPUs and cores is overruled. This is useful, for example, when downloading a bunch of files in parallel, or other tasks that are not bound by the number of available CPUs.
**-j <disable hyper threading>**
If a CPU is found that supports hyper threading, the additional cores are used. For example, an Intel Core 7i quad-core processor supports HT, thus has effectively 8 cores. When HT is enabled, not 4 but 8 parallel jobs are started.
Please note that this mechanism depends on what /proc/cpu (linux) reports. For exampe, an old dual CPU P3 doesn't report the 'physical id' section, thus if HT is disabled (why whould you do that anyway) only one processor is used. So test this option if you need it.
**-l <PPSS log file>**
This option allows you to specify a custom name for the log file that is used by PPSS itself.
## Logging (must read) ##
There are two separate log mechanisms:
* the log file of PPSS itself
* the log file of each individual item that is processed
_PPSS log file_
The logfile of PPSS is by default ppss-log.txt. A different name can be chosen with the -l option. It contains all relevant information about what PPSS is doing.
_Item log file_
When an item is processed, any output that is generated is logged within its individual log file. This logfile resides within the directory job\_log. This directory is created from where PPSS is executed.
An example of the output of a single log file for a single item is shown below:
```
===== PPSS Item Log File =====
Host: imac-2.local
Item: PPSS_LOCAL_TMPDIR/20080602.wav
Start date: Mar 03 00:10:32
Encode of PPSS_LOCAL_TMPDIR/20080602.wav successful.
Status: Succes - item has been processed.
Elapsed time (h:m:s): 0:4:48
```
If you tailor your command the right way, or create a (small) script, it is very easy to determine which items have not been processed correctly. A simple grep on 'error' might already give a clue.
**Important:** If a log file exists for an item, within the job\_log directory, and PPSS is run again, that item will be skipped. This allows you to interrupt PPSS and continue where you left off. If you want to process all items again, just remove the job\_log directory.

317
Manual1.md Normal file
View File

@ -0,0 +1,317 @@
# Introduction #
This page discusses the usage of PPSS on a single host. Examples show how PPSS is used.
# Overview of modes and options #
The following output is displayed by PPSS when executed without any options:
```
bash-3.2$ ppss
|P|P|S|S| Distributed Parallel Processing Shell Script 2.97
PPSS is a Bash shell script that executes commands in parallel on a set
of items, such as files in a directory, or lines in a file. The purpose
of PPSS is to make it simple to benefit from multiple CPUs or CPU cores.
This short summary only discusses options for stand-alone mode. For a
full listing of all options, run PPSS with the options --help
Usage ./ppss [[ options ]]
--command | -c Command to execute. Syntax: '<command> ' including the single quotes.
Example: -c 'ls -alh '. It is also possible to specify where an item
must be inserted: 'cp "$ITEM" /somedir'.
--sourcedir | -d Directory that contains files that must be processed. Individual files
are fed as an argument to the command that has been specified with -c.
--sourcefile | -f Each single line of the supplied file will be fed as an item to the
command that has been specified with -c. Read input from stdin with
-f -
--config | -C If the mode is config, a config file with the specified name will be
generated based on all the options specified. In the other modes.
this option will result in PPSS reading the config file and start
processing items based on the settings of this file.
--disable-ht | -j Disable hyper threading. Is enabled by default.
--log | -l Sets the name of the log file. The default is ppss-log.txt.
--processes | -p Start the specified number of processes. Ignore the number of available
CPUs.
--quiet | -q Shows no output except for a progress indication using percents.
--delay | -D Adds an initial random delay to the start of all parallel jobs to spread
the load. The delay (seconds) is only used at the start of all 'threads'.
--daemon Daemon mode. Do not exit after items are professed, but keep looking
for new items and process them. Read the manual how to use this!
See --help for important additional options regarding daemon mode.
--disable-inotify Linux users can use real-time inotify filesystem events when using
daemon mode. Requires inotify-tools. Enabled by default if available.
Automatically disabled if NFS is used as the daeon source dir.
--no-traversal|-r By default, PPSS uses the regular 'find' command to list all files
within the directory specified by the -d option. If you do not wish
for PPSS to process files in sub directories, use this option.
Only files within the specified directory will be processed. Any
subdirectories will then be ignored.
--email | -e PPSS sends an e-mail if PPSS has finished. It is also used if processing
of an item has failed (configurable, see -h).
--debug Enable debugging output to the |P|P|S|S| log file.
--help Extended help, including options for distributed mode.
Example: encoding some wav files to mp3 using lame:
./ppss -d /path/to/wavfiles -c 'lame '
Extended usage: use --help
```
A detailed explanation based on examples will follow.
# How to use PPSS #
PPSS allows a user to execute commands, scripts or programs in parallel. That's it. It's sole purpose is to turn a batch job into a parallel batch job. This is relevant, since modern day processors are almost always multi-core and are designed to process jobs in parallel, so why not use it?
Items can be two things:
* files within a user-specified directory
* arbitrary lines of text within a file
When PPSS has finished, it has produced a log file of its operation. By default, this file is called ppss-log.txt.
Also, a directory is created, by default JOB\_LOG. Within this directory a logfile exists for each item that has been processed. If a log file is present for an item, and PPSS is re-run, these items will be skipped.
## Basic command line options ##
Before discussing the full list of command line options, an example will be given how to run PPSS with the least amount of options, in it's simplest form. In this example, some files are compressed with gzip.
`$ ./ppss -d /path/to/files -c 'gzip '`
In this example, we can distinguish two options.
The -d option specifies the directory where the files reside that must be processed.
The -c option specifies the command that will be executed by PPSS in parallel for each file within the directory specified by -d. In this example the command has a **trailing space**, which is necessary since the command will expand to 'gzip file01.tar' when executed. If the space is omitted, an error will occur.
Sometimes, the item should not be appended to the command, but inserted somewhere in the middle. This is possible by using the placeholder "$ITEM". See the following example:
`$ ./ppss -d /path/to/files -c 'cp "$ITEM" /destination/dir'`
Another example is the use of an input file instead of a directory. Such a file is specified with the -f option.
For this example, create a file called numbers.txt and fill it wit this:
```
1
2
3
4
5
```
Next, try this example.
`$./ppss -f numbers.txt -c 'touch '`
The result should be that five new files are 'touched' which have the name of the numbers you entered in the numbers.txt file.
This is the recommended way to use PPSS: put items in the files and specify a single command with the -c option. I often see people, as an example, fill the numbers.txt file with:
```
touch 1
touch 2
etc.
```
...and then process the items like:
`./ppss -f numbers.txt -c 'bash $ITEM'`
This is ofcourse perfectly fine, but not necessary.
`$ ./ppss -f list-of-urls.txt -c 'wget -q '`
In this example, a list of URLs is provided by the file list.txt. These urls are fed to wget, which will retrieve the specified URLs. The -p option specifies that 5 parallel downloads or threads should be started. Ofcourse, this command can also be written like this:
`$ ./ppss -f list-of-urls.txt -c 'wget -q "$ITEM"'` -p 5
**Tip**: please note that the double quotes around "$ITEM" may or may not be necessary depending on the situation. When using an input file with the -f option, they are often not necessary.
PPSS also supports input from STDIN:
`$ cat list-of-urls.txt | ppss -f - -c 'wget -q '`
**Advanced usage of the -c command option**
Some commands require that you specify an output file. An example of such a command or program is the Lame mp3 encoder. Since the output file must be unique for each item, the output file name must be based on a variable. Like this:
`-c 'lame -a "$ITEM" "/some/path/$ITEM.mp3" --preset standard --quiet'`
The filename of the item is reused to create the output file name.
## Advanced command line options ##
In this paragraph, some additional options are discussed.
**-p <configure manually number of parallel processes>**
This option allows you to specify how many parallel proceses should be started. Thus, automatic detection of CPUs and cores is overruled. This is useful, for example, when downloading a bunch of files in parallel, or other tasks that are not bound by the number of available CPUs.
**-j** (Disable hyper-threading )
If a CPU is found that supports hyper threading, the additional cores are used. For example, an Intel Core i7 quad-core processor supports HT, thus has effectively 8 cores. When HT is enabled, not 4 but 8 parallel jobs are started.
Please note that this mechanism depends on what /proc/cpu (linux) reports. For example, an old dual CPU P3 doesn't report the 'physical id' section, thus if HT is disabled (why would you do that anyway) only one processor is used. So test this option if you need it.
**-l <PPSS log file>**
This option allows you to specify a custom name for the log file that is used by PPSS itself.
**setting the working directory**
Prior to executing PPSS, set the working directory as follows:
`export PPSS_DIR=/path/to/dir`
Next, if PPSS is executed, the aforementioned directory is used to store all (temporary) files.
## Creating and using a config file ##
A config file is created when PPSS is called with the 'config' mode. In this mode, PPSS does not execute any job, instead, all command line options are used to create a config file. An example:
`./ppss create -C config.cfg -d /source/dir -c 'gzip ' -j`
This command creates a config file config.cfg that can be used in stead of re-entering the command line options like this:
`./ppss -C config.cfg`
## Advanced usage (by example) ##
**Unrar files in parallel**
Unrarring some files in parallel can be as easy as:
`./ppss -d ./dir-with-rars -c 'unrar x "$ITEM" ./output-dir'
However, this may result in the outcome that all extracted files are dumped in the directory output-dir. This may not be wat you want. If you want to extract the files contained within each RAR-file into it's own directory. We need to perform two steps:
1. Create a directory for each item in /output-dir
1. Unrar the files into the individual directories.
Step 1: making directories based on the name of the RAR file:
`/ppss -d ./dir-with-rars -c 'ITEM=`basename "$ITEM"`; mkdir ./output-dir/"$ITEM"'`
Explanation: by default, each item consists of the full or relative path to that item. An item will expand as "./dir-with-rars/filename.rar". However, the directory name must be based only on the filename. So the unix build-in 'basename' is used to extract the filename from the item and use it to create the directory name.
As you can see, it is no problem to use multiple commands within the -c option, by using ';'.
Step 2: extracting the files of each RAR file into it's own directory.
`./ppss -d ./dir-with-rars -c 'ITEM_DIR=`basename "$ITEM"`; unrar x "$ITEM" ./output-dir/"$ITEM_DIR"'`
In this example, we use the basename command again to be able to specify the output directory based on the supplied ITEM name.
Ofcourse, it is possible to put this all in one command:
`./ppssh -d ./dir-with-rars -c 'ITEM_DIR=`basename "$ITEM"`; mkdir ./output-dir/"$ITEM_DIR"; unrar x "$ITEM" ./output-dir/"$ITEM_DIR"'`
**Execute commands in a file**
Let's asume you have a file containing these lines"
```
/home/user/dosomething.sh 1
/home/user/dosomething.sh 2
/home/user/dosomething.sh 3
/home/user/dosomething.sh 4
/home/user/dosomething.sh 5
```
To execute this properly, the command as provided to the -c option is slightly altered:
`./ppss -f afile.txt -c 'bash $ITEM'`
Notice that in this case, you **must** supply the '$ITEM' variable **without** double quotes. If you omit the '$ITEM' variable or use '"$ITEM"' then the commands will fail like this:
```
===== PPSS Item Log File =====
Host: Core7i
Process: 7905
Item: /home/user/ppss/dosomething.sh 1
Start date: Dec 16 16:32:00
bash: /home/user/ppss/dosomething.sh 1: No such file or directory
Status: FAILURE
Elapsed time (h:m:s): 0:0:0
```
## Specifying a different home directory ##
By default, PPSS creates a directory in the current working directory which will contain all (temporary) files.
This directory can be changed by exporting the PPSS\_DIR variable with another directory like this:
export PPSS\_DIR=/some/other/dir
Next, just run PPSS as usual.
## Daemon mode ##
This mode is discussed at it's own manual page http://code.google.com/p/ppss/wiki/Manual3
## Logging (must read) ##
There are two separate log mechanisms:
* the log file of PPSS itself
* the log file of each individual item that is processed
_PPSS log file_
The logfile of PPSS is by default ppss-log.txt. A different name can be chosen with the -l option. It contains all relevant information about what PPSS is doing.
_Item log file_
When an item is processed, any output that is generated is logged within its individual log file. This logfile resides within the directory job\_log. This directory is created from where PPSS is executed.
An example of the output of a single log file for a single item is shown below:
```
===== PPSS Item Log File =====
Host: imac-2.local
Item: PPSS_LOCAL_TMPDIR/20080602.wav
Start date: Mar 03 00:10:32
Encode of PPSS_LOCAL_TMPDIR/20080602.wav successful.
Status: Succes - item has been processed.
Elapsed time (h:m:s): 0:4:48
```
If you tailor your command the right way, or create a (small) script, it is very easy to determine which items have not been processed correctly. A simple grep on 'error' might already give a clue.
**Important:**
1) PPSS skips items if an item log file is present in the Job\_log directory. This allows you to interrupt PPSS and continue where you left off. If you want to process all items again, just remove the job\_log directory.
2) The -M or --md5 option allows you to inform PPSS to use MD5 hashes of the items as file names to be 100% sure to avoid collisions. This may be important when processing items that are not file names.
## Other things you should be aware of ##
ppss must be run inside a file system that support file locking. The data that must be processed can be in a non-locking file system.

434
Manual2.md Normal file
View File

@ -0,0 +1,434 @@
# Design overview #
**SSH for communication**
The basis for communication between master and nodes is SSH. This requires the setup of SSH keys.
1. Nodes must be able to login onto the master server for actual distributed operation.
1. The master server must be able to login onto the nodes for deployment of PPSS and all required files.
The second option is not mandatory. Any other computer system can be used, as long as it has proper SSH key material to logon into the nodes.
# Installation steps in a nutshell #
To use PPSS in a distributed fasion, The following steps must be performed:
1. Setup SSH access on server and nodes.
1. Create a list of all nodes.
1. Create a configuration file for PPSS, that will be distributed to nodes.
1. Optional: create a custom script to be executed.
1. Deploy PPSS to the nodes.
1. Start PPSS on all nodes.
# A list of all relevant configuration options #
```
Modes are optional and mainly used for running in distributed mode. Modes are:
config Generate a config file based on the supplied option parameters.
deploy Deploy PPSS and related files on the specified nodes.
erase Erase PPSS and related files from the specified nodes.
start Starting PPSS on nodes.
pause Pausing PPSS on all nodes.
stop Stopping PPSS on all nodes.
node Running PPSS as a node, requires additional options.
Options are:
--config | -C If the mode is config, a config file with the specified name will be
generated based on all the options specified. In the other modes.
this option will result in PPSS reading the config file and start
processing items based on the settings of this file.
The following options are used for distributed execution of PPSS.
--master | -m Specifies the SSH server that is used for communication between nodes.
Using SSH, file locks are created, informing other nodes that an item
is locked. If items are files that must be processed, they must reside
on this host. SCP is used to transfer files from this host to nodes
for local procesing.
--node | -n File containig a list of nodes that act as PPSS clients. One IP / DNS
name per line.
--key | -k The SSH key that a node uses to connect to the master.
--known-hosts | -K The file that contains the server public key. Can often be found on
hosts that already once connected to the server. See the file
~/.ssh/known_hosts or else, manualy connect once and check this file.
--user | -u The SSH user name that is used by the node when logging in into the
master SSH server.
--script | -S Specifies the script/program that must be copied to the nodes for
execution through PPSS. Only used in the deploy mode.
This option should be specified if necessary when generating a config.
--download This option specifies that an item will be downloaded by the node
from the server or share to the local node for processing.
--upload This option specifies that the output file will be copied back to
the server, the --outputdir option is mandatory.
--no-scp | -b Do not use scp for downloading items. Use cp instead. Assumes that a
network file system (NFS/SMB) is mounted under a local mount point.
--outputdir | -o Directory on server where processed files are put. If the result of
encoding a wav file is an mp3 file, the mp3 file is put in the
directory specified with this option.
--homedir | -H Directory in which PPSS is installed on the node.
Default is 'ppss-home'.
--script | -S Script to run on the node. PPSS must copy this script to the node.
--randomize | -R Randomise which items to process by the client in distributed mode.
This makes sure that with many nodes, it is prevented that some
clients spend all their time trying to get a lock on an item.
```
# Preparation of server and nodes #
The following preparations must be made in order to use PPSS in a distributed fasion:
* Create an unprivileged user 'ppss' on the server.
* Create an unprivileged user 'ppss' on each node.
* Generate a SSH key without a pass phrase.
**Important**
The SSH key will be used for nodes to logon into the server AND for the server to logon into the nodes. So in this example the same key material is used both on the nodes as on the server.
Example:
`ssh-keygen -f ppss.key`
```
Generating public/private rsa key pair.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in ppss.key.
Your public key has been saved in ppss.key.pub.
The key fingerprint is:
....
bash-3.2$ ls -alh
total 16
drwxr-xr-x 4 ppss staff 136B 15 mrt 00:09 .
drwxr-xr-x+ 51 ppss staff 1,7K 14 mrt 17:45 ..
-rw------- 1 ppss staff 1,6K 15 mrt 00:09 ppss.key
-rw-r--r-- 1 ppss staff 401B 15 mrt 00:09 ppss.key.pub
```
The result is a private and a public key (.pub). The private key is the key that needs to be distributed to all nodes in order to be able to logon to the server.
* Add the _public_ SSH key to the authorized\_keys file of the 'ppss' user on the server.
Thus, put the contents of ppss.key.pub into a file called authorized\_keys and place this file into the directory .ssh in the home directory of the PPSS user on the server.
* Add the public SSH key to the authorized\_keys file of the 'ppss' user on the nodes.
This is necessary if you want to deploy PPSS on the nodes using PPSS in an automated fashion(./ppss deploy -C config.cfg). The alternative is to manually copy PPSS and all necessary files to each node by hand.
* Create a 'known\_hosts' file containing the public key of the server. **Important**
When a node connects to the server for the first time, SSH wil show you the fingerprint of the server and ask if it is ok to connect to this host. To prevent this question, you must perform one of these actions:
1. Logon to each node manually and connect once to the server and manually accept the server signature
1. Manually upload a known\_hosts file to each node and place it in the ~/.ssh directory of the ppss user.
1. Create a file called "known\_hosts" and put the server public key in this file. **Recommended**
You may already have the server public key in the ~/.ssh/known\_hosts file of a system that has been used to logon to the server. Thus use the -K option to generate your own ./known\_hosts file for usage with PPSS. If a known\_hosts file exists within the same directory in which PPSS resides, this file will automatically be used and deployed to nodes. So if you manually create a file called known\_hosts with the appropriate content, the -K option can be omitted.
* Place PPSS on the server within the PPSS home directory.
**Security**
Please note that usage of SSH keys without pass phrases may pose a security threat if the machines are shared with other users. You must decide for yourself if the security risk that is associated with this setup is acceptable for your environment. For example, if a node is compromised, the attacker will have (initially unprivileged) access to the server.
# Create a list of nodes #
A file must be created containing the hostnames (DNS) and/or IP-addresses of all nodes. The file must contain one node per line, such as:
```
192.168.0.100
192.168.0.101
host.domain.com
...
```
# Create a PPSS configuration file #
This is the most important part of setting up distributed PPSS. It is exactly the same as setting up a configuration file for standalone mode, except that more options are necessary.
The best way to explain how to create a configuration file for distributed PPSS is to provide an example. In this example, a script is used to encode WAV files to MP3. This script is called 'encode.sh' and takes a filename as an argument.
`./ppss config -C config.cfg -c 'encode.sh ' -d /source/dir -m 192.168.1.100 -u ppss -k ppss.key -S ./encode.sh -n nodes.txt -o /some/output/dir --upload --download`
It is quite a long command line, however, it is executed only once. Afther that, the config file config.cfg can be used for all further commands.
**Mode**
The first option sets the mode, in this case 'config' to generate a configuration file.
**Configuration file**
The second option, -C, specifies the name of the configuration file to be created.
**Command**
The third option, -c, specifies the command to be executed. **Please take special note of the single quotes and the space behind the command.** You can read -c 'encode.sh ' also as -c 'encode.sh "$ITEM"'.
**Source directory**
This option specifies the location on the **server** where the files reside that must be processed. These files will be transfered using SCP to the nodes for local processing.
**Server**
The -m option specifies the SSH server that acts as both fileserver and SSH server for communication between nodes. The SSH server is mainly used for file-locking: nodes know that locked files are already processed or being processed, so another unlocked file must be selected.
If the server acts both as a file server and SSH server, it is not recommended to use it also as a node, in this case for encoding. File transfers using SSH can take quite some processing power. Using different hosts as a file server (through SCP) and master is currently not possible (yet).
**User name**
This is the name of the local system user that is used by the nodes to logon to the server with SSH. For deployment, such a user must also be present on the nodes.
**SSH Key**
Scripts using SSH require an SSH key withouth a passphrase. This key must be uploaded to the nodes an the nodes must know which key to use, so it must be specified.
**Script or program that must be uploaded**
The -S option specifies the script or program that should be uploaded to the node because it must be executed by the node for distributed computing. In this case, the encode.sh script must be deployed on all nodes and thus specified.
**List of nodes**
The -n option specifies the file containing all nodes. For every node, PPSS will perform actions such as deploy, start, stop and pause.
**Transfer files to local host**
--download: If this option is specified, the file that is to be processed is copied from the source directory to a local temporary working directory for local processing. This is necessary if SCP is used to access files that must be processed.
If files are distributed over NFS or SMB, the files look like they are present on the local system, because it is just a mount point and thus just a part of the local file system. In this case, the --download option can be omitted.
**The output directory**
If the --upload option is used, the -o option specifies the destination directory on the server. The results are uploaded to this directory.
**More examples**
The following example does the exact same thing as the encode script.
`./ppss config -C config.cfg -c 'lame -a "$ITEM" "$OUTPUT_DIR/$OUTPUT_FILE.mp3" --preset standard --quiet' -d /source/dir -m 192.168.1.100 -u ppss -k ppss-key.key -K /path/to/known_hosts_file -n nodes.txt -o /some/output/dir --download --upload`
The OUTPUT\_DIR and OUTPUT\_FILE variables are special. It tells your command where to store the output. This is important if you want to transfer the results of your command back to the server.
In this example, Lame requires that the user specifies an output file. PPSS generates the name of this output file for you, based on the name of the Item. This example shows that you don't need to create your own shell scripts in order to be able to use PPSS.
-K = optional. If you created a file called 'known\_hosts', this file will automatically be used. Warning: if you specify a different file with the -K option, the curent known\_hosts file will be replaced by this file. If you manually create a file called known\_hosts with the appropriate content, the -K option can be omitted.
# Create a script #
**Entirely optional!**
This section is optional. It is possible to execute commands just by using the -c option and the appropriate variables.
PPSS transfers files to the node and uploads the output back to the server. In order to be able to upload output back to the server, PPSS must know where this output can be found.
by default output is stored in the directory specified by $PPSS\_LOCAL\_OUTPUT. Ofcource, you can hard-code the PPSS\_LOCAL\_OUTPUT path, however, it is much easier to just source the ppss configuration file and use the already defined variables, that are used by PPSS anyway.
An example script that uses the settings of the PPSS configuration file is shown below, that has actually been used to encode 400 GB of WAV files.
```
#!/bin/bash
ITEM="$1"
TMP=`basename $ITEM`
source config.cfg
lame -a "$ITEM" "$PPSS_LOCAL_OUTPUT/$TMP/$TMP.mp3" --preset standard --quiet
ERROR="$?"
if [ "$ERROR" == "0" ]
then
echo "Encode of $ITEM successful."
exit 0
else
echo "Error when encoding $ITEM."
exit 1
fi
```
Take notice of the basename command. Items are provided with full path. Basename strips this path from the filename and uses just the filename in this script.
By sourcing the config.cfg file of PPSS, this script can use the PPSS\_LOCAL\_OUTPUT variable, or any other variable contained within the configuration file.
**Rules when writing a script for usage with PPSS**
* As with any decent shell script, use exit codes. Exit code 0 reflects successful execution, any other value a failure.
* Echo some information about what the script is doing. If something fails, echo what is wrong. This is caught by PPSS and logged in the log file of the item that is processed.
For example, the above script results in this kind of output:
```
===== PPSS Item Log File =====
Host: Beest
Item: PPSS_LOCAL_TMPDIR/20060907.wav
Start date: Mar 10 23:54:04
Encode of PPSS_LOCAL_TMPDIR/20060907.wav successful.
Status: Succes - item has been processed.
Elapsed time (h:m:s): 0:1:44
```
**TIP**
All variables specified when generating a configuration script can be used within your own script when sourcing the configuration file.
## Deploy PPSS to nodes ##
Once SSH access is setup and the configuration file is generated, PPSS can be deployed to the nodes. This is very simple, as this example demonstrates:
`./ppss deploy -C config.cfg
During the phase when we generated the configuration file, a nodes file was specified. Thus PPSS knows, just by reading this configuration file, which file contains a list of nodes.
```
bash-3.2$ ./ppss.sh deploy -C config.cfg
mrt 30 23:20:00: INFO =========================================================
mrt 30 23:20:00: INFO |P|P|S|S|
mrt 30 23:20:00: INFO Distributed Parallel Processing Shell Script version 2.18
mrt 30 23:20:00: INFO =========================================================
mrt 30 23:20:00: INFO Hostname: MacBoek.local
mrt 30 23:20:00: INFO ---------------------------------------------------------
mrt 30 23:20:00: INFO Deploying PPSS on nodes.
mrt 30 23:20:01: INFO PPSS installed on node 192.168.0.18.
mrt 30 23:20:01: INFO PPSS installed on node 192.168.0.6.
mrt 30 23:20:01: INFO PPSS installed on node 192.168.0.4.
mrt 30 23:20:01: INFO PPSS installed on node 192.168.0.1.
mrt 30 23:20:01: INFO PPSS installed on node 192.168.0.15.
mrt 30 23:20:01: INFO PPSS installed on node 192.168.0.33.
mrt 30 23:20:05: INFO Cannot connect to node 192.168.0.20.
```
Deployment of PPSS is executed in parallel for each host.
## Start PPSS on nodes ##
Just as simple as deploying PPSS, PPSS is started on all nodes.
`./ppss start -C config`
```
mrt 12 22:21:17: INFO - ---------------------------------------------------------
mrt 12 22:21:17: INFO - Distributed Parallel Processing Shell Script version 2.03
mrt 12 22:21:17: INFO - Hostname: MacBoek.local
mrt 12 22:21:17: INFO - Starting PPSS on node 10.0.0.14.
mrt 12 22:21:17: INFO - Starting PPSS on node 10.0.0.12.
mrt 12 22:21:20: INFO - Starting PPSS on node 10.0.0.4.
mrt 12 22:21:20: INFO - Starting PPSS on node 10.0.0.31.
```
## Stop pause and continue PPSS on nodes ##
To stop, pause or continue processing on all nodes, use the following commands:
`./ppss stop -C config.cfg`
`./ppss pause -C config.cfg`
`./ppss continue -C config.cfg`
Please note that nodes will continue processing the current item they are working on, they just stop processing new items if stop or pause is selected.
## Show progress ##
The overall process of the 'cluster' is determined by the number of files present in the input and output directories on the server.
```
bash-3.2$ ./ppss.sh status -C config.cfg
mrt 29 22:18:27: INFO =========================================================
mrt 29 22:18:27: INFO |P|P|S|S|
mrt 29 22:18:27: INFO Distributed Parallel Processing Shell Script version 2.17
mrt 29 22:18:27: INFO =========================================================
mrt 29 22:18:27: INFO Hostname: MacBoek.local
mrt 29 22:18:27: INFO ---------------------------------------------------------
mrt 29 22:18:28: INFO Status: 100 percent complete.
mrt 29 22:18:28: INFO Nodes: 7
mrt 29 22:18:28: INFO ---------------------------------------------------------
mrt 29 22:18:28: INFO IP-address Hostname Processed Status
mrt 29 22:18:28: INFO ---------------------------------------------------------
mrt 29 22:18:28: INFO 192.168.0.4 Core7i 155 FINISHED
mrt 29 22:18:29: INFO 192.168.0.2 MINI.local 34 FINISHED
mrt 29 22:18:29: INFO 192.168.0.5 server 29 FINISHED
mrt 29 22:18:30: INFO 192.168.0.63 host3 6 FINISHED
mrt 29 22:18:31: INFO 192.168.0.64 host4 6 FINISHED
mrt 29 22:18:31: INFO 192.168.0.20 imac-2.local 34 FINISHED
mrt 29 22:18:32: INFO 192.168.0.1 router 7 FINISHED
mrt 29 22:18:32: INFO ---------------------------------------------------------
mrt 29 22:18:32: INFO Total processed: 271
```
## Logging ##
An important feature of PPSS is its extensive logging. There are two types of log files.
* A single log file created by PPSS itself. This file is found on the local nodes. Using tail -f on these files, it is possible to monitor what PPSS is currently doing.
```
Mar 12 22:57:19: INFO - ---------------------------------------------------------
Mar 12 22:57:19: INFO - Distributed Parallel Processing Shell Script version 2.03
Mar 12 22:57:19: INDO - ---------------------------------------------------------
Mar 12 22:57:19: INFO - Hostname: Beest
Mar 12 22:57:19: DEBUG - Found 8 logic processors.
Mar 12 22:57:19: INFO - CPU: Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
Mar 12 22:57:19: INFO - ---------------------------------------------------------
Mar 12 22:57:19: DEBUG - Job log directory JOB_LOG exists.
Mar 12 22:57:20: INFO - Listener started.
Mar 12 22:57:20: INFO - Starting 8 workers.
Mar 12 22:57:20: INFO - Currently 0 percent complete. Processed 0 of 625 items.
Mar 12 22:57:20: DEBUG - Trying to lock item 20060731.wav.
Mar 12 22:57:20: DEBUG - Item 20060731.wav is locked.
Mar 12 22:57:20: INFO - Currently 0 percent complete. Processed 1 of 625 items.
Mar 12 22:57:20: DEBUG - Trying to lock item 20060801.wav.
Mar 12 22:57:20: DEBUG - Item 20060801.wav is locked.
Mar 12 22:57:20: INFO - Currently 0 percent complete. Processed 2 of 625 items.
Mar 12 22:57:20: DEBUG - Trying to lock item 20060802.wav.
Mar 12 22:57:20: DEBUG - Item 20060802.wav is locked.
Mar 12 22:57:20: INFO - Currently 0 percent complete. Processed 3 of 625 items.
............
mrt 10 23:51:23: DEBUG - Item 20060830.wav is locked.
mrt 10 23:51:23: INFO - Currently 3 percent complete. Processed 23 of 625 items.
mrt 10 23:51:23: DEBUG - Trying to lock item 20060831.wav.
mrt 10 23:51:23: DEBUG - Got lock on 20060831.wav, processing.
mrt 10 23:51:23: DEBUG - Transfering item 20060831.wav to local disk.
mrt 10 23:52:18: DEBUG - Exit code of transfer is 0
mrt 10 23:52:18: DEBUG - Processing item 20060831.wav
```
* An individual log file containing information and output of each processed item. these files are uploaded to the SSH server to the 'job\_log' directory. For every item, a log file must be present.
```
===== PPSS Item Log File =====
Host: MacBoek.local
Item: PPSS_LOCAL_TMPDIR/20060831.wav
Start date: mrt 10 23:52:18
Encode of PPSS_LOCAL_TMPDIR/20060831.wav successful.
Status: Succes - item has been processed.
Elapsed time (h:m:s): 0:5:23
```
As you can see, with a few simple grep commands, it is possible to quickly determine which items have failed to process. Also, you can see that my MacBook took 5 minutes and 23 seconds to process this WAV file.
Please note that the "Encode of..." part is output of the script that is executed on an item. The other content is generated by PPSS.
# To wrap it up #
I am convinced that PPSS is very easy to use and tailored to your needs. If you have questions and/or suggestions, don't hesitate to send an e-mail. If you find bugs, please report them using the issue tracker. Feedback is greatly appreciated.

69
Manual3.md Normal file
View File

@ -0,0 +1,69 @@
# Daemon mode (2.63 and onward) #
PPSS can be run as a daemon, monitoring a file or directory for new items. If (new) input is found, it is processed. If multiple items are put into the directory at once, they are processed in parallel. In daemon mode, PPSS will show no output, but will perform some basic logging to its log file.
When running as a daemon there is a risk that as soon PPSS detects a new file, it starts processing, while the file has not been fully written to disk. To prevent this risk, there are three options for running PPSS as a daemon:
* standard (default if inotify is not available)
* with Linux inotify (default if available)
* with manual locking
## Standard daemon mode ##
In this mode PPSS uses the 'stat' command to determine the time since it was last modified. By default, a file must have an age of 4 seconds before it is processed. If you want to wait longer or a shorter time period, use the --file-age (seconds) parameter. The --polling-interval option allows you to specify how often PPSS should check for new files within the directory. The default is to check for new files every 10 seconds. An example:
> ppss -d /some/directory -c 'gzip ' --daemon --polling-interval 30 --file-age 10
Please note that checking for new files on a directory with many files will stress the CPU as PPSS must determine for each file found if it is processed or not. So it is advised to remove items from the directory once they are processed. Also, don't set the polling interval to short or the system is only busy polling and can't do any actual work. If a short polling interval is required, consider using the Linux inotify option as described below.
## Linux inotify (2.82 and onward) ##
A regular daemon just polls every x seconds for new files, but this polling is not very efficient. A robust and fast mechanism for monitoring of file system events is [inotify](http://en.wikipedia.org/wiki/Inotify). By default, the inotify program does nothing and just waits for a file system event to occur. Thus when using PPSS, PPSS will do absolutely nothing unless a file system event occurs. Only 'close' events are noticed by PPSS, making dead sure that only files are processed that have been closed and are not being operated upon.
Inotify is enabled by default if PPSS detects that inotify is installed and PPSS is run as a daemon.
To use inotify on a Linux system, you must install it first. For Debian-based operating systems, this can be done with:
> apt-get install inotify-tools
Inotify is regarded as the best option for running the daemon mode, however it requires additional software. The standard mechanism that just polls the directory at a regular interval and verifies the modification date of a file may be sufficient for many, so it is not required. The benefit of inotify is that it makes PPSS fast to respond to filesystem events. PPSS doesn't need to wait for the next polling event to pick up new items. They are processed as soon as they arrive.
Inotify can be explicitly disabled with the --disable-inotify option.
**Caveat** Inotify does not work on network file systems like NFS. Disable inotify in this case.
## Manual locking mechanism ##
If you want to be dead sure that no race condition can occur and 'inotify' cannot be used, use the additional locking mechanism that is build-in into PPSS. The --enable-input-lock option forces PPSS to claim the input directory with a lock file called INPUT\_LOCK. If this directory exists, PPSS will not process items. Once this directory is removed, PPSS will start processing. This way, you can lock the input directory in your script and make sure that all processes on files are finished before PPSS starts processing items. For this feature, your script needs some additional logic like this (almost identical code from PPSS):
```
# 1 - try to obtain lock.
while true
do
mkdir "/some/directory/INPUT_LOCK" >> /dev/null 2>&1
if [ "$?" == "0" ]
then
break
else
sleep 5
fi
done
# 2 - do something here
copy file /some/directory/
# 3 - release lock
rm -rf /some/directory/INPUT_LOCK
```
## Using a file as input ##
When using the -f option in DAEMON mode, you also need to specify a -d option, to specify a directory. This directory is used for the lock file, as described above, if locking is used.
## NFS and inotify ##
NFS does not work well with inotify. If a directory is exported through NFS and an NFS client writes to this directory through NFS, this creates thousands of CLOSE events instead of a single event.
Therefore inotify cannot be used on a directory exported through NFS. PPSS tries to detect this by parsing the output of the 'mount' command.

27
Overview.md Normal file
View File

@ -0,0 +1,27 @@
# Introduction #
Most recent computer systems feature at least two processor cores or sometimes even more. Most programs and tasks do not benefit from these extra CPU cores because software must be (re)written in such a way that it benefits from extra CPU cores. Most of the time, only one CPU or CPU core is used. This is a waste of resources.
Most users can't benefit from these extra CPU cores, because the programs they use are often not aware of the extra cpu cores. To support parallel processing, software must often be substantially be rewritten, which is often not done. So only one core can be used and the other core(s) are just idling, while if they could also be used, the job could be done in half (dual-core) or a quarter (quad-core) or even less (distributed cluster) of the time.
The solution is to just run the application multiple times in parallel. This is of cource only beneficiary if you have more than one file or item to process. And that is the principle behind PPSS.
The simple idea behind PPSS is that, you have a (large) number of items, files for example, and you want to perform some action on them. Instead of processing one item at at time, you want to process 4 items at a time, since you have a nice quad-core CPU. A program is required that starts a process for every core, and when a process finishes, starts a new one. And some logging of the result (success or failure?) would also be nice.
PPSS does this for you.
# Features #
Features of PPSS are:
* Very easy to use. You may be up and running within 5 minutes.
* Will run on any system that supports bash (although only tested on Linux and Mac OS X)
* Automatically detects the number of CPUs and CPU cores and start a worker process for each of them.
* Supports hyper-threading if available.
* All output of individual processes will be logged for your inspection (where there errors? How long did it take?).
* Actions performed by PPSS are logged to a log file for your inspection.
* Can process a text file with one item per line. Items can be what you want. URLs, files, anything. Each line is fed to the command you specify.
* Can execute any command you like. Can execute your own scripts in parallel.
* If interrupted, will by default continue where it was left, skipping processed files.
* Can be run in distributed mode as a cluster over multiple computer systems using SSH.

129
ProjectHome.md Normal file
View File

@ -0,0 +1,129 @@
## |P|P|S|S| - (Distributed) Parallel Processing Shell Script ##
---
**PROJECT HAS MOVED TO GITHUB:
https://github.com/louwrentius/PPSS**
---
PPSS is a Bash shell script that executes commands, scripts or programs in parallel. It is designed to make full use of current multi-core CPUs. It will detect the number of available CPUs and start a separate job for each CPU core. It will also use hyper threading by default.
PPSS can be run on multiple hosts, processing a single group of items, like a cluster.
PPSS provides you with examples that will make it obvious how it is used:
```
bash-3.2$ ppss
|P|P|S|S| Distributed Parallel Processing Shell Script 2.60
usage: ./ppss [ -d <sourcedir> | -f <sourcefile> ] [ -c '<command> "$ITEM"' ]
[ -C <configfile> ] [ -j ] [ -l <logfile> ] [ -p <# jobs> ]
[ -D <delay> ] [ -h ] [ --help ] [ -r ]
Examples:
./ppss -d /dir/with/some/files -c 'gzip '
./ppss -d /dir/with/some/files -c 'cp "$ITEM" /tmp' -p 2
./ppss -f <file> -c 'wget -q -P /destination/directory "$ITEM"' -p 10
```
Basically, just provide PPSS with a source of items (a directory with files, for example) and a command that must be applied to these items.
For a quick demonstration of it's standalone usage, see the video below.
<a href='http://www.youtube.com/watch?feature=player_embedded&v=32PwsARbePw' target='_blank'><img src='http://img.youtube.com/vi/32PwsARbePw/0.jpg' width='600px' height=344 /></a>
A bit more advanced (better quality):
<a href='http://www.youtube.com/watch?feature=player_embedded&v=AdwZlW1eZ6A' target='_blank'><img src='http://img.youtube.com/vi/AdwZlW1eZ6A/0.jpg' width='600px' height=344 /></a>
PPSS will take a list of items as input. Items can be files within a directory or entries in a text file. PPSS
executes a user-specified command for each item in this list. The item is supplied as an argument to this command. At any point in time, there are never more items processed in parallel as there are cores available.
An example how this script is used:
```
user@host:~/ppss$ ./ppss.sh -d /wavs -c './encode.sh '
Mar 30 23:21:10: INFO =========================================================
Mar 30 23:21:10: INFO |P|P|S|S|
Mar 30 23:21:10: INFO Distributed Parallel Processing Shell Script version 2.18
Mar 30 23:21:10: INFO =========================================================
Mar 30 23:21:10: INFO Hostname: Core i7
Mar 30 23:21:10: INFO ---------------------------------------------------------
Mar 30 23:21:10: INFO Found 8 logic processors.
Mar 30 23:21:10: INFO CPU: Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
Mar 30 23:21:10: INFO Starting 8 workers.
Mar 30 23:21:10: INFO ---------------------------------------------------------
Mar 30 23:21:17: INFO Currently 76 percent complete. Processed 172 of 226 items.
```
In this example, the script detects that four CPU-cores are available. Hyper-threading is used as the core i7 920 supports it, so 8 workers are started. Don't miss the trailing space within the command section.
**Logging**
One of the nice features of PPSS is logging. The output of every command on every item that is executed is logged into a single file. Below is an example of such a file:
```
===== PPSS Item Log File =====
Host: imac-2.local
Item: PPSS_LOCAL_TMPDIR/20080602.wav
Start date: Mar 03 00:10:32
Encode of PPSS_LOCAL_TMPDIR/20080602.wav successful.
Status: Succes - item has been processed.
Elapsed time (h:m:s): 0:4:48
```
As you can see, a lot of information is logged by PPSS about the processed item, including the time it took to process it. Of particular interest is the status line: it is based on the exit status of the executed command, so error detection is build-in.
This script is build with the goal to be very easy to use. It runs on Linux and Mac OS X. It should work on other Unix-like operating systems, such as Solaris, that support the Bash shell.
This script is (only) useful for jobs that can be easily broken down in separate tasks that can be executed in parallel. For example, encoding a bunch of wav-files to mp3-format, downloading a large number of files, resizing images, anything you can think of.
Please note that this script is _even useful on a single-core host_. Certain jobs, such as downloading files and processing these downloaded files can often be optimized by executing these processes in parallel.
**_PPSS is always a work in progress and although it seems to work for me, it might not for you for reasons I'm currently not aware of. I would very much appreciate it if you try it out and create an issue if you find a bug. Thanks!_**
## Distributed PPSS ##
From version 2.0 and onward, PPSS supports distributed computing. With this version, it is possible to run PPSS on multiple host that each process a part of the same queue of items. Nodes communicate with each other through a single SSH server.
This script has already been used to convert 400 GB of WAV files to MP3 with 4 hosts, a Core i7 running Ubuntu, two Macs based on 1.8 and 2 ghz Core Duos running Leopard, and an 2,2 Ghz AMD system running Debian.
The remarkable thing is that the Core 7i @ 3,6 Ghz processed 380 files, while the other three systems _combined_ only processed 199. Keep in mind that the Core 7i has only 4 physical cores...
![http://chart.apis.google.com/chart?cht=p3&chd=t:66,11,11,12&chs=350x150&chl=Core%20i7%20|AMD|iMac|Mac%20Mini&noncense=test.png](http://chart.apis.google.com/chart?cht=p3&chd=t:66,11,11,12&chs=350x150&chl=Core%20i7%20|AMD|iMac|Mac%20Mini&noncense=test.png)
It is difficult to give an impression how PPSS works in distributed mode, however maybe the status screen can give you an idea.
```
mrt 29 22:18:27: INFO =========================================================
mrt 29 22:18:27: INFO |P|P|S|S|
mrt 29 22:18:27: INFO Distributed Parallel Processing Shell Script version 2.17
mrt 29 22:18:27: INFO =========================================================
mrt 29 22:18:27: INFO Hostname: MacBoek.local
mrt 29 22:18:27: INFO ---------------------------------------------------------
mrt 29 22:18:28: INFO Status: 100 percent complete.
mrt 29 22:18:28: INFO Nodes: 7
mrt 29 22:18:28: INFO ---------------------------------------------------------
mrt 29 22:18:28: INFO IP-address Hostname Processed Status
mrt 29 22:18:28: INFO ---------------------------------------------------------
mrt 29 22:18:28: INFO 192.168.0.4 Corei7 155 FINISHED
mrt 29 22:18:29: INFO 192.168.0.2 MINI.local 34 FINISHED
mrt 29 22:18:29: INFO 192.168.0.5 server 29 FINISHED
mrt 29 22:18:30: INFO 192.168.0.63 host3 6 FINISHED
mrt 29 22:18:31: INFO 192.168.0.64 host4 6 FINISHED
mrt 29 22:18:31: INFO 192.168.0.20 imac-2.local 34 FINISHED
mrt 29 22:18:32: INFO 192.168.0.1 router 7 FINISHED
mrt 29 22:18:32: INFO ---------------------------------------------------------
mrt 29 22:18:32: INFO Total processed: 271
```

View File

@ -1 +0,0 @@
Automatically exported from code.google.com/p/ppss

30
Relatedprojects.md Normal file
View File

@ -0,0 +1,30 @@
### Project Middleman ###
Through Linuxtoday, I found a project called "Linux Middleman".
A quote from it's main page:
```
Parallel Shell Scripting Made Easy
The Middleman System (mdm) is a set of open-source utilities that help you parallelize your shell scripts. Its features include:
Uses dynamic mix-n-match parallelization technology
Requires minimal modification to existing scripts
Provides ncurses-based monitoring console
The Middleman System is what you need to unleash the power of your multi-processor and multi-core computers.
```
Middleman uses a different approach to accomplish the same thing as PPSS: parallel processing using shell scripts. It seems that Middleman focuses on parallelising tasks within existing shell scripts. PPSS on the other hand just executes entire scripts or commands in parallel.
Use the tool you need. So take a look at the following link to learn more about Middleman:
http://mdm.berlios.de/
Some slides about Middleman: http://mdm.berlios.de/data/csgsc-talk-slides.pdf
### "parallel" ###
Seems also very useful. Written in Perl. See:
https://savannah.nongnu.org/projects/parallel/

20
Requirements.md Normal file
View File

@ -0,0 +1,20 @@
The sole requirement for PPSS to work is the Bash shell and the availability of the mkfifo command. PPSS is written using features that are specific to the Bash shell and is thus not portable to shells like dash and ksh.
For distributed PPSS, a host with SSH access is necessary, that will act as a server. Also, clients and server must be configured in order to use SSH keys and "Screen" must be available.
## Requirements for stand-alone usage of PPSS on single host ##
* Bash
* mkfifo
* md5sum
* sed
## Requirements for distributed usage of PPSS on multiple hosts ##
This is additional to the requirements for stand-alone.
* SSH(D)
* Screen
* On both node and server, an unprivileged user is required.
* SSH key with no passphrase for node -> server connection.

26
Roadmap.md Normal file
View File

@ -0,0 +1,26 @@
# Introduction #
PPSS Roadmap.
# Details #
* The distributed mechanism basically requires some script to be written in order to be able to upload content back to the server. This shouldt not be a requirement. The same must be accomplished by just using the -c command line option. **(COMPLETED)**
* Better status monitoring of nodes. Nodes should periodically write some status information to the server, that can provide some insight in the current status of the jobs that are running. UPDATE: nodes write status information to their own disk and this is spidered by the computer running PPSS to control the nodes. **(COMPLETED)**
* I am considering to add the option to specify a hosts that acts as the SSH fileserver. The PPSS master host and fileserver must be the same host currently. This should not be necessary. **(COMPLETED)**
* Use netcat and ncat (nmap.org) for client communication. This may be easier to setup than SSH. Although if security is really an issue, SSH will still be the way to go. (This idea may not be possible.)
* Providing RPMS and DEB packages for the popular operating systems. **(COMPLETED)**
* Bootable LIVE image with pre-installed PPSS client for easy setup of a distributed computing environment.
* Run PPSS as a daemon, monitoring a directory for (new) files to process. **(COMPLETED)**
* Send email notifications when: **(Pending completion)**
1. PPSS has finished processing all items.
1. When processing an item returns an error, including the error message.
If you have any suggestions for additional features or improvements, let me know using commends or by creating an issue.

3
Tutorials.md Normal file
View File

@ -0,0 +1,3 @@
Some person (GBI) has written a HowToForge document about PPSS:
http://www.howtoforge.com/fully-utilizing-your-x-core-cpu

View File

@ -1,94 +0,0 @@
#!/usr/bin/env bash
INPUT="$1"
METATAGS="--export-tags-to="
LAMEOPTS=""
ERROR_STATUS="0"
function usage () {
echo
echo "Usage: $0 <flac file name>"
echo
exit 1
}
function error () {
ERROR="$1"
MSG="$2"
echo "Error: $MSG"
exit 1
}
if [ -z "$INPUT" ]
then
usage
fi
if [ ! -e "$INPUT" ]
then
echo "File $INPUT does not exist!"
exit 1
fi
FILETYPE="`file -b "$INPUT" | awk '{ print $1 }'`"
if [ ! "$FILETYPE" == "FLAC" ]
then
echo "File $FILE is not a flac file..."
exit 0
fi
checkvar () {
VAR="$1"
if [ -z "$VAR" ] || [ "$VAR" == "" ]
then
echo "Unknown"
else
echo "$VAR"
fi
}
METATAGS="TITLE ARTIST ALBUM GENRE COMPOSER CONDUCTOR ENSEMBLE TRACKNUMBER DATE ALBUM ARTIST DISCNUMBER DISC"
function convert () {
FILE="$1"
META="$FILE.meta"
MP3FILE="`echo ${FILE%flac}mp3`"
DIR="`dirname "$FILE"`"
metaflac --export-tags-to="$META" "$FILE"
ARTIST="`metaflac "$FILE" --show-tag=ARTIST | sed s/.*=//g`"
TITLE="`metaflac "$FILE" --show-tag=TITLE | sed s/.*=//g`"
ALBUM="`metaflac "$FILE" --show-tag=ALBUM | sed s/.*=//g`"
GENRE="`metaflac "$FILE" --show-tag=GENRE | sed s/.*=//g`"
TRACKNUMBER="`metaflac "$FILE" --show-tag=TRACKNUMBER | sed s/.*=//g`"
for x in $METATAGS
do
declare $x="`grep "$x" "$META" | cut -d "=" -f 2`"
VAR=$(eval echo " \$$x")
VAR="`checkvar $VAR`"
done
flac -s -c -d "$FILE" | lame --tt "$TITLE" --tn "$TRACKNUMBER" --tg "$GENRE" --ty "$DATE" --ta "$ARTIST" --tl "$ALBUM" --ty "$YEAR" --preset insane - "$MP3FILE"
ERROR_STATUS="$?"
if [ -e "$META" ]
then
rm "$META"
fi
}
convert "$INPUT"
exit "$ERROR_STATUS"

3055
ppss

File diff suppressed because it is too large Load Diff

View File

@ -1,283 +0,0 @@
#!/bin/bash
DEBUG="$1"
VERSION="2.98"
TMP_DIR="/tmp/ppss"
PPSS=./ppss
PPSS_DIR=ppss_dir
export PPSS_DEBUG=1
HOST_ARCH=`uname`
SPECIAL_DIR=$TMP_DIR/root/special
. "$PPSS"
cleanup () {
unset RES1
unset RES2
GLOBAL_COUNTER=1
if [ ! "$DEBUG" = "debug" ]
then
for x in $REMOVEFILES
do
if [ -e ./$x ]
then
rm -r ./$x
fi
done
fi
if [ ! -z "$TMP_DIR" ] && [ -e "$TMP_DIR" ]
then
rm -rf "$TMP_DIR"
fi
}
parseJobStatus () {
TMP_FILE="$1"
RES=`grep "Status:" "$JOBLOG/$TMP_FILE"`
STATUS=`echo "$RES" | awk '{ print $2 }'`
echo "$STATUS"
}
get_item_count_of_input_file () {
if [ -e "$PPSS_DIR/INPUT_FILE-$$" ]
then
CONTENTS_OF_INPUTFILE=`cat $PPSS_DIR/INPUT_FILE-$$ | wc -l | awk '{ print $1 }'`
echo "$CONTENTS_OF_INPUTFILE"
else
echo "Error, file $PPSS_DIR/INPUT_FILE-$$ does not exist."
fi
}
oneTimeSetUp () {
JOBLOG=./$PPSS_DIR/job_log
INPUTFILENORMAL=test-normal.input
INPUTFILESPECIAL_DIR=test-special.input
LOCALOUTPUT=ppss_dir/PPSS_LOCAL_OUTPUT
REMOVEFILES="$PPSS_DIR test-ppss-*"
if [ ! -e "$TMP_DIR" ]
then
mkdir -p "$TMP_DIR"
fi
cleanup
}
testVersion () {
assertEquals "Version mismatch!" "$VERSION" "$SCRIPT_VERSION"
}
rename-ppss-dir () {
TEST="$1"
if [ -e "$PPSS_DIR" ] && [ -d "$PPSS_DIR" ] && [ ! -z "$TEST" ]
then
mv "$PPSS_DIR" test-ppss-"$TEST"
fi
}
oneTimeTearDown () {
if [ ! "$DEBUG" == "debug" ]
then
cleanup
fi
}
createDirectoryWithSomeFiles () {
ROOT_DIR=$TMP_DIR/root
CHILD_1=$ROOT_DIR/child_1
CHILD_2=$ROOT_DIR/child_2
if [ ! -e "$ROOT_DIR" ]
then
mkdir -p "$ROOT_DIR"
fi
if [ ! -e "$CHILD_1" ]
then
mkdir -p "$CHILD_1"
fi
if [ ! -e "$CHILD_2" ]
then
mkdir -p "$CHILD_2"
fi
for x in {1..10}
do
touch "$ROOT_DIR/file-$x"
touch "$CHILD_1/file-$x"
touch "$CHILD_2/file-$x"
done
ln -s /etc/resolve.conf "$ROOT_DIR" 2> /dev/null
ln -s /etc/hosts "$ROOT_DIR" 2> /dev/null
}
createSpecialFilenames () {
ERROR=0
mkdir -p "$SPECIAL_DIR"
touch "$SPECIAL_DIR/a file with spaces"
touch "$SPECIAL_DIR/a\\'file\\'with\\'quotes"
touch "$SPECIAL_DIR/a{file}with{curly}brackets}"
touch "$SPECIAL_DIR/a(file)with(parenthesis)"
touch "$SPECIAL_DIR/a\\file\\with\\backslashes"
touch "$SPECIAL_DIR/a!file!with!exclamationmarks"
touch "$SPECIAL_DIR/a filé with special characters"
touch "$SPECIAL_DIR/a\"file\"with\"double\"quotes"
}
testMD5 () {
export USE_MD5=1
init_vars > /dev/null 2>&1
ARCH=Darwin
set_md5
assertEquals "MD5 executable not set properly - $MD5" "$MD5" "md5"
ARCH=Linux
set_md5
assertEquals "MD5 executable not set properly - $MD5" "$MD5" "md5sum"
ARCH=$HOST_ARCH
}
init_get_all_items () {
DIR="$1"
TRAVERSAL="$2"
createDirectoryWithSomeFiles
create_working_directory
export SRC_DIR=$DIR
init_vars > /dev/null 2>&1
get_all_items
}
testRecursion () {
init_get_all_items $TMP_DIR/root 1
RESULT=`get_item_count_of_input_file`
EXPECTED=32
assertEquals "Recursion not correct." "$EXPECTED" "$RESULT"
rename-ppss-dir $FUNCNAME
}
testNoRecursion () {
init_get_all_items $TMP_DIR/root 0
RESULT=`get_item_count_of_input_file`
EXPECTED=12
assertEquals "Recursion not correct." "$EXPECTED" "$RESULT"
rename-ppss-dir $FUNCNAME
}
testGetItem () {
createSpecialFilenames
init_get_all_items $TMP_DIR/root 1
get_item
if [ -z "$ITEM" ]
then
ERROR=1
else
ERROR=0
fi
EXPECTED=0
assertEquals "Get item failed." "$EXPECTED" "$ERROR"
i=1
ERROR=0
while get_item
do
((i++))
done
EXPECTED=40
assertEquals "Got wrong number of items." "$EXPECTED" "$i"
rename-ppss-dir $FUNCNAME
cleanup
}
return_all_items () {
while get_item
do
ALL_ITEMS="$ALL_ITEMS$ITEM"$'\n'
done
echo "$ALL_ITEMS"
}
testNumberOfItems () {
createSpecialFilenames
RESULT=`init_get_all_items $TMP_DIR/root 1`
RES1=`find $TMP_DIR/root/ ! -type d`
RES2=`return_all_items`
echo "$RES1" > a
echo "$RES2" > b
assertEquals "Input file and actual files not the same!" "$RES1" "$RES2"
rename-ppss-dir $FUNCNAME
}
testInvalidProcessingOfitemVariable() {
createSpecialFilenames
init_get_all_items $TMP_DIR/root 1
COMMAND='echo $ITEM'
while get_item
do
commando "$ITEM"
done
RESULT=$(grep '$ITEM' $PPSS_DIR/job_log/*)
EXPECTED=""
assertEquals "Got incorrect processing of ITEM variable." "$EXPECTED" "$RESULT"
rename-ppss-dir $FUNCNAME
}
testNumberOfLogfiles () {
createSpecialFilenames
init_get_all_items $TMP_DIR/root 1
COMMAND='echo hoi'
while get_item
do
commando "$ITEM"
done
RESULT=`ls -1 $PPSS_DIR/job_log/ | wc -l | awk '{ print $1}'`
EXPECTED=40
assertEquals "Got wrong number of log files." "$EXPECTED" "$RESULT"
rename-ppss-dir $FUNCNAME
}
testUserInputFile () {
cleanup
INPUT_FILE=test-special.input
create_working_directory
init_vars > /dev/null 2>&1
get_all_items
RESULT=`return_all_items`
ORIGINAL=`cat $INPUT_FILE`
assertEquals "User input processing not ok." "$RESULT" "$ORIGINAL"
rename-ppss-dir $FUNCNAME
}
. ./shunit2

View File

@ -1,14 +0,0 @@
REMOTE_OUTPUT_DIR=/mnt/mp3
SSH_KEY=ppss-key.dsa
SSH_KNOWN_HOSTS=known_hosts
SRC_DIR=/mnt/wav
COMMAND='./wav2mp3.sh "$ITEM" "$OUTPUT_DIR"'
NODES_FILE=nodes.txt
SSH_SERVER=10.0.1.110
USER=ppss
SCRIPT=wav2mp3.sh
RANDOMIZE=1
DOWNLOAD_TO_NODE=0
UPLOAD_TO_SERVER=0
SECURE_COPY=1
PPSS_DEBUG=1

1116
shunit2

File diff suppressed because it is too large Load Diff

View File

@ -1,26 +0,0 @@
test-a
test-b
test-c
test-d
test-e
test-f
test-g
test-h
test-i
test-j
test-k
test-l
test-m
test-n
test-o
test-p
test-q
test-r
test-s
test-t
test-u
test-v
test-w
test-x
test-y
test-z

View File

@ -1,8 +0,0 @@
\'file-!@#$%^&*()_ +=-0987654321~\'
\'file-/\<>?:;'{}[]\'
file-/\/\:\/!@#$%^&*()_+=-0987654321~
file-42>424>424<2424>424?24<24>24
file-/\<>?:;'{}[]
http://www.google.nl
ftp://storage.nl
./flac/Bééthoven Overtures CD2/01 - Beethoven, Lv - Leonore I - Op.138.flac

View File

@ -1,124 +0,0 @@
#!/bin/bash
INPUT="$1"
RESOLUTION="$3"
SEPARATE="$4"
TITLES=0
OPTS_HIGHRES="-e x264 -q 20.0 -r 29.97 --pfr -a 1 -E faac -B 160 -6 dpl2 -R Auto -D 0.0 -f mp4 -4 -X 1024 --strict-anamorphic -m"
OPTS_LOWRES="-e x264 -q 20.0 -a 1 -E faac -B 128 -6 dpl2 -R 48 -D 0.0 -f mp4 -X 480 -m -x cabac=0:ref=2:me=umh:bframes=0:subme=6:8x8dct=0:trellis=0"
OPTS_SOURCE="-e x264 -q 20.0 -a 1,1 -E faac,ac3 -l 576 -B 160,160 -6 dpl2,auto -R Auto,Auto -D 0.0,0.0 -f mp4 --detelecine --decomb --strict-anamorphic -m -x b-adapt=2:rc-lookahead=50"
MODE=""
HANDBRAKE=HandBrakeCLI
DIRNAME=`dirname "$INPUT"`
BASENAME=`basename "$INPUT"`
OPTS=""
OUTPUT_DIR="$2"
OUTPUT_FILE_NAME=""
if [ -z "$INPUT" ]
then
echo "usage $0 <input file / folder> <output folder> <highres|lowres> <separate>"
echo
echo "Input either file, VIDEO_TS directory or .ISO"
echo
echo -e "highres:\t1024 x 576"
echo -e "lowres:\t\t480 x 320"
echo -e "source:\t\tsame as source."
echo
echo -e "separate:\tseparate files for episodes of a serie."
exit 1
fi
if [ ! -z "$OUTPUT_DIR" ]
then
if [ ! -e "$OUTPUT_DIR" ] || [ ! -d "$OUTPUT_DIR" ]
then
echo "Output directory does not exist or is not a directory."
exit 1
fi
else
echo "Output to current directory."
OUTPUT_DIR="."
fi
if [ ! -e "$INPUT" ]
then
echo "$INPUT does not exist!"
exit 1
fi
if [ -d "$INPUT" ]
then
MODE=DIR
else
MODE=FILE
fi
echo "Input type is $MODE"
case "$RESOLUTION" in
highres|HIGHRES )
OPTS="$OPTS_HIGHRES" ;;
lowres|LOWRES )
OPTS="$OPTS_LOWRES" ;;
source|SOURCE )
OPTS="$OPTS_SOURCE" ;;
*)
echo "Resolution must be 'highres', 'source' or 'lowres'."
exit 1
;;
esac
function titles () {
TITLES=`./$HANDBRAKE -t 0 -i "$INPUT" 2>&1 | grep "+ title" | awk '{ print $3 }' | sed s/://g`
echo $TITLES
}
if [ "$MODE" = "FILE" ]
then
mkdir -p "$OUTPUT_DIR/$DIRNAME"
OUTPUT_FILE_NAME="$OUTPUT_DIR/$DIRNAME/${BASENAME%.*}"
elif [ "$MODE" = "DIR" ]
then
echo "$INPUT" | grep -i video_ts >> /dev/null 2>&1
if [ "$?" = "0" ]
then
INTERMEDIATE2=`basename "$DIRNAME"`
mkdir -p "$OUTPUT_DIR/$DIRNAME"
OUTPUT_FILE_NAME="$OUTPUT_DIR/$DIRNAME/$INTERMEDIATE2"
else
INTERMEDIATE2="$BASENAME"
mkdir -p "$OUTPUT_DIR/$DIRNAME/$INTERMEDIATE2"
OUTPUT_FILE_NAME="$OUTPUT_DIR/$DIRNAME/$INTERMEDIATE2/$INTERMEDIATE2"
fi
echo "INTERMEDIATE2 = $INTERMEDIATE2"
else
echo "Mode is not determined..."
exit 1
fi
if [ "$SEPARATE" = "separate" ]
then
TITLES=`titles $INPUT`
echo "TITLES = $TITLES"
ERROR=0
for x in $TITLES
do
HandBrakeCLI $OPTS -i "$INPUT" -o "$OUTPUT_FILE_NAME-$x.mp4"
if [ ! "$?" = "0" ]
then
ERROR="1"
fi
done
exit "$ERROR"
else
echo "Creating a single file."
HandBrakeCLI $OPTS -i "$INPUT" -o "$OUTPUT_FILE_NAME.mp4"
fi

View File

@ -1,19 +0,0 @@
#!/usr/bin/env bash
SRC="$1"
DEST="$2"
TYPE=`file -b "$SRC"`
RES=`echo "$TYPE" | grep "WAVE audio"`
if [ ! "$?" == "0" ]
then
echo "File $SRC is not a wav file..."
echo "Type is $TYPE"
exit 0
fi
BASENAME=`basename "$SRC"`
MP3FILE="`echo ${BASENAME%wav}mp3`"
lame --quiet --preset insane "$SRC" "$DEST/$MP3FILE"
exit "$?"

18
whynotxargs.md Normal file
View File

@ -0,0 +1,18 @@
# Introduction #
It is suggested that the command 'xargs' - that is often found on unix-like systems by default - does exactly the same as PPSS does when used with the -p option.
In it's most basic form, this is true to some extend. Xargs processes items and keeps an x number of jobs running in parallel. I think that there may be cases that xargs is sufficient for your task at hand. A simple example that demonstrates how xargs can be used:
`$ touch 10 15 20 25 30 35 40`
`$ ls -1 | xargs -n1 -P 4 sleep`
The additional value of PPSS is that it:
* provides logging (for free)
* provides a progress indicator
* is simpler to use (my own opinion)
* does not process items that already have been processed if interrupted.
However, use the tool that best fits the job at hand.