#summary How distributed PPSS will look like #labels Phase-Design = Introduction = The goal is to make PPSS distributed. So a large number of host can be used to process items, not just a single host (node). These nodes will share a single list of items that they will process in parallel. To keep track of which items have been processed by a node, nodes must be able to communicate with each other. Therefore, a server is necessary. The primary role of the server is just a communication channel for nodes. Nodes use the server to signal to other nodes that an item is being processed is processed. So two nodes will never process the same file. http://home.quicknet.nl/mw/prive/nan1/img/ppss.png The secondary role of the server is to act as a file server. Assuming that files are processed, files stored on the PPSS server are transfered to the node, that will proces a file and store the result back on the server. PPSS is very flexible: the file server can be a different host than the PPSS server that is used for inter-node communication. = Design considerations = == Node installation == Installing PPSS on a larger number of hosts will become an appalling boring repetitive and time consuming task if this is performed manually. The following actions must be performed on a node: * create a ppss user or use an existing *unprivileged* system account (do not use root). * copy ppss.sh to the node. * copy privade ssh key to the node (in a secure way). * create a crontab entry to start ppss automatically on the node. I have to come up with a simple solution to automate this process. Currently, I'm considering creating a separate tool that deploys PPSS. This tool should use a list of IP addresses and/or host names as input and install PPSS on those hosts using SSH and SCP. However, this functionality could also be implemented as a function within PPSS. == Node control == If a larger number of nodes are used, say more than five to ten, it will be a hassle to control these nodes individually by hand. The question is how to control all nodes without having to access nodes manually. Starting new jobs, pausing and stopping jobs should be controlled from a central location. The easiest solution that comes to mind is to use a central configuration or instruction file on the central server. Nodes will be able to access this file, read its contents and act accordingly. == Node status == If a larger number of nodes are used, it would be nice if some simple overview could be generated about the current status of nodes and the overal progress of the entire process. == Locking of items through SSH == According to many sources on the Internet, the only reliable solution to *atomic* locking is to use the 'mkdir' command to create a file. The fun thing is that this is also true if 'mkdir' is executed through SSH. So a node tries to lock a file by issueing a mkdir on the server through SSH. If this mkdir fails, the directory and thus the lock already exists and the next item in the list is tried. == Item (file) distribution == If items are files that need to be processed, they can be accessed in two ways: * using a network file system such as NFS or SMB or other. The -d option must point to the mountpoint of this share. * using scp within scripts to (securely) copy items (files) to the local host and copy the processed items back to the server. Please note that copying files using scp is more resource intensive (CPU) than SMB or NFS. When using PPSS in a distributed fashion, it should be decided if files can be processed in-place on the file server through the share, or that they must be copied to the node first before being processed. == Requirements == * A central server for inter-node communication (item locking). * Accessible through SSH. * A central server for file distribution (optional). * Sufficient bandwidth (gigabit? totally depends on your needs.). * SCP / NFS / SMB share for distributing files. * One or more slaves. * Must support Bash shell. Please note that it is *NOT* required to run PPSS on the central Master server. Only slaves need PPSS installed. Also the central server for inter-node communication (item locking) can (and will often be) the same host as the file server.