ppss/wiki/Manual.wiki

67 lines
3.4 KiB
Plaintext
Raw Normal View History

#summary PPSS Manual (Stand-alone)
#labels Phase-Deploy
= Introduction =
This page discusses the usage of PPSS on a single host. Examples show how PPSS is used.
= How to use PPSS =
PPSS allows a user to execute commands, scripts or programs in parallel. That's it. It's sole purpose is to turn a batch job into a parallel batch job. This is relevant, since modern day processors are almost always multi-core and are designed to process jobs in parallel, so why not use it?
Items can be two things:
* files within a user-specified directory
* arbitrary lines of text within a file
When PPSS has finished, it has produced a log file of its operation. By default, this file is called ppss-log.txt.
Also, a directory is created, by default JOB_LOG. Within this directory a logfile exists for each item that has been processed. If a log file is present for an item, and PPSS is re-run, these items will be skipped.
== Basic command line options ==
Before discussing the full list of command line options, an example will be given how to run PPSS with the least amount of options, in it's simplest form.
`$ ./ppss.sh standalone -d /path/to/files -c 'gzip '`
In this example, we can distinguish a 'mode' and two options. The mode speaks for itself: PPSS is not part of a cluster, it is just running on the host.
The -d option specifies the directory where the files reside that must be processed.
The -c option specifies the command that will be executed by PPSS in parallel for each file within the directory specified by -d. In this example the command has a *trailing space*, which is necessary since the command will expand to 'gzip example.tar' when executed. If the space is omitted, an error will occur.
Sometimes, the item should not be appended to the command, but inserted somewhere in the middle. This is possible by using the placeholder "$ITEM". See the following example:
`$ ./ppss.sh standalone -d /path/to/files -c 'cp "$ITEM" /destination/dir '`
Another example is the use of an input file instead of a directory. Such a file is specified with the -f option.
`$ ./ppss.sh standalone -f list-of-urls.txt -c 'wget -q '`
In this example, a list of URLs is provided by the file list.txt. These urls are fed to wget, which will retrieve the specified URLs. The -p option specifies that 5 parallel downloads or threads should be started. Ofcourse, this command can also be written like this:
`$ ./ppss.sh standalone -f list-of-urls.txt -c 'wget -q "$ITEM"'`
== logging (must read) ==
There are two separate log mechanisms:
* the log file of PPSS itself
* the log file of each individual item that is processed
_PPSS log file_
The logfile of PPSS is by default ppss-log.txt. A different name can be chosen with the -l option. It contains all relevant information about what PPSS is doing.
_Item log file_
When an item is processed, any output that is generated is logged within its individual log file. This logfile resides within the directory job_log. This directory is created from where PPSS is executed.
If you tailor your command the right way, or create a (small) script, it is very easy to determine which items have not been processed correctly. A simple grep on 'error' might already give a clue.
*Important:* If a log file exists for an item, and PPSS is run again, that item will be skipped. This allows you to interrupt PPSS and continue where you left off. If you want to process all items again, just remove the job_log directory.