A detailed explanation based on examples will follow.
= How to use PPSS =
PPSS allows a user to execute commands, scripts or programs in parallel. That's it. It's sole purpose is to turn a batch job into a parallel batch job. This is relevant, since modern day processors are almost always multi-core and are designed to process jobs in parallel, so why not use it?
Items can be two things:
* files within a user-specified directory
* arbitrary lines of text within a file
When PPSS has finished, it has produced a log file of its operation. By default, this file is called ppss-log.txt.
Also, a directory is created, by default JOB_LOG. Within this directory a logfile exists for each item that has been processed. If a log file is present for an item, and PPSS is re-run, these items will be skipped.
== Basic command line options ==
Before discussing the full list of command line options, an example will be given how to run PPSS with the least amount of options, in it's simplest form. In this example, some files are compressed with gzip.
In this example, we can distinguish a 'mode' and two options. The mode speaks for itself: PPSS is not part of a cluster, it is just running on the host.
The -d option specifies the directory where the files reside that must be processed.
The -c option specifies the command that will be executed by PPSS in parallel for each file within the directory specified by -d. In this example the command has a *trailing space*, which is necessary since the command will expand to 'gzip example.tar' when executed. If the space is omitted, an error will occur.
Sometimes, the item should not be appended to the command, but inserted somewhere in the middle. This is possible by using the placeholder "$ITEM". See the following example:
In this example, a list of URLs is provided by the file list.txt. These urls are fed to wget, which will retrieve the specified URLs. The -p option specifies that 5 parallel downloads or threads should be started. Ofcourse, this command can also be written like this:
Some commands require that you specify an output file. An example of such a command or program is the Lame mp3 encoder. Since the output file must be unique for each item, the output file name must be based on a variable. Like this:
In this paragraph, some additional options are discussed.
* -p <configure manually number of parallel processes>*
This option allows you to specify how many parallel proceses should be started. Thus, automatic detection of CPUs and cores is overruled. This is useful, for example, when downloading a bunch of files in parallel, or other tasks that are not bound by the number of available CPUs.
If a CPU is found that supports hyper threading, the additional cores are used. For example, an Intel Core 7i quad-core processor supports HT, thus has effectively 8 cores. When HT is enabled, not 4 but 8 parallel jobs are started.
Please note that this mechanism depends on what /proc/cpu (linux) reports. For exampe, an old dual CPU P3 doesn't report the 'physical id' section, thus if HT is disabled (why would you do that anyway) only one processor is used. So test this option if you need it.
A config file is created when PPSS is called with the 'config' mode. In this mode, PPSS does not execute any job, instead, all command line options are used to create a config file. An example:
Unrarring some files in parallel can be as easy as:
`./ppss.sh standalone -d ./dir-with-rars -c 'unrar x "$ITEM" ./output-dir'
However, this may result in the outcome that all extracted files are dumped in the directory output-dir. This may not be wat you want. If you want to extract the files contained within each RAR-file into it's own directory. We need to perform two steps:
# Create a directory for each item in /output-dir
# Unrar the files into the individual directories.
Step 1: making directories based on the name of the RAR file:
Explanation: by default, each item consists of the full or relative path to that item. An item will expand as "./dir-with-rars/filename.rar". However, the directory name must be based only on the filename. So the unix build-in 'basename' is used to extract the filename from the item and use it to create the directory name.
As you can see, it is no problem to use multiple commands within the -c option, by using ';'.
Step 2: extracting the files of each RAR file into it's own directory.
* the log file of each individual item that is processed
_PPSS log file_
The logfile of PPSS is by default ppss-log.txt. A different name can be chosen with the -l option. It contains all relevant information about what PPSS is doing.
_Item log file_
When an item is processed, any output that is generated is logged within its individual log file. This logfile resides within the directory job_log. This directory is created from where PPSS is executed.
An example of the output of a single log file for a single item is shown below:
{{{
===== PPSS Item Log File =====
Host: imac-2.local
Item: PPSS_LOCAL_TMPDIR/20080602.wav
Start date: Mar 03 00:10:32
Encode of PPSS_LOCAL_TMPDIR/20080602.wav successful.
Status: Succes - item has been processed.
Elapsed time (h:m:s): 0:4:48
}}}
If you tailor your command the right way, or create a (small) script, it is very easy to determine which items have not been processed correctly. A simple grep on 'error' might already give a clue.
PPSS skips items if an item log file is present in the Job_log directory. This allows you to interrupt PPSS and continue where you left off. If you want to process all items again, just remove the job_log directory.
===== Other things you should be aware about =====
ppss.sh must be run inside a file system that support file locking. It
can, however, the data to process can be in a non-locking file system.
PPSS controller/intermediate output such as ppss.sh_is_running, JOB_LOG,
PPSS_* directories, ppss-array-pointer etc will be created inside the same
directory as ppss.sh, will be written to the current directory. This means one cannot share a copy of ppss.sh. Each ppss.sh run must be use its own copy of ppss.sh file.
Q: Is it possible to modify the program to write
all these files to a user-specified directory instead?
A: As requested, this feature will be implemented, one way or the other.