synchronously instead of probabilistically scheduling jobs, which
means that the job load on a machine never exceeds a desired
threshold, and we can preferentially use faster machines when they are
available. This has a dramatic effect on package build throughput,
although I don't yet have precise measurements of the performance
improvements.
Specifically, the changes are:
* Introduce the new variable maxjobs in portbuild. This replaces the
build scheduling weights previously listed in the mlist file, which
now changes format to list the build machines only, ranked in order of
preference for job dispatches (i.e. faster machines first).
* The ${arch}/queue directory is used to list machines available for
jobs (file content is the number of jobs currently running on the
machine). Changes to files in this directory are serialized using
lockf on the .lock file.
* Claim a machine with the getmachine script, with the .lock held.
This picks the machine with the fewestnumber of jobs running, which is
listed highest in the mlist file in case of multiple machines with
equal load. The job counter is incremented, and the file removed if
the counter reaches ${maxjobs} for that machine. If all machines are
busy, sleep for 15 seconds and retry.
* After we have claimed a machine, we run claim-chroot on it to claim
an empty chroot, as before. If the claim fails, release the job from
the queue with the releasemachine script and retry after a 15 second
wait.
* When the build is finished, decrement the job counter with the
releasemachine script, with .lock held.
* The checkmachines script now exists only to poll the load averages
for admin convenience (every 2 minutes), and to ping for unreachable
machines. When a machine cannot be reached, remove the entry in the
queue directory to stop further job dispatches to it. This needs more
work to deal with reinitialization of machines after they become
available again.
Additional changes to this file:
* Exit if passed a null package name, to avoid badness later on
* Send a nag-mail if pkg-plist errors are detected in the build
ssh times out)
* Support new portbuild.conf settings:
client_user = user to connect to on the client (not necessarily root)
sudo_cmd = If ssh'ing to a non-root user, run this command to gain
root privs (set to empty string for client_user=root,
or sudo for !root). Cannot require interactivity, of
course.
Approved by: portmgr (self)
* Clients no longer have ssh access to the master, so we need to
push/pull everything on the client from here. This means we need to
know where the build took place so we can go in and get the files
after it finishes. Introduce the claim-chroot script which
atomically claims a free chroot directory on the host and returns
the name. This directory is later populated by the portbuild script
if it does not already contain an extracted bindist.
* Use the per-node portbuild.$(hostname) config file to decide where
in the filesystem to claim the chroot on the build host.
* If a port failed unexpectedly (i.e. is not marked BROKEN), or if
something strange happened when trying to pull in build results from
a client, then send me email (XXX should be configurable).
* Clean up after the build finishes and we have everything we need, by
dispatching the clean-chroot script on the client.
- Increase timeout to 8 hours (this needs to be made per-arch so it
doesn't overly pessimize fast client machines)
- Support building as a non-privileged user
* Shorten timeout period from 12 hours to 4 hours to avoid delaying the builds
unnecessarily.
* Reverse sense of NOPLISTCHECK -> PLISTCHECK, since it's not an option
we want enabled by default (it causes too many build failures). This
was too easy to forget when building packages 'by hand' using the parallel
makefile.
behind. Useful for debugging.
Touch package on master after copying it back. This will avoid synchronization
problems when the machines' clocks are wildly skewed.
Remove log from master when build is successful. No need to keep around
transient error logs.
Tools/portbuild for details.
Note that this is still a major work in progress. I probably forgot
something but I need to go to sleep. At least it works here (most of
the time :).