Rework get_ports.sh to make it not use a lock file

The get_ports.sh script is used for determining the range of ports a
given system test should use.  It first determines the start of the port
range to return (the base port); it can either be specified explicitly
by the caller or chosen randomly.  Subsequent ports are picked
sequentially, starting from the base port.  To ensure no single port is
used by multiple tests, a state file (get_ports.state) containing the
last assigned port is maintained by the script.  Concurrent access to
the state file is protected by a lock file (get_ports.lock); if one
instance of the script holds the lock file while another instance tries
to acquire it, the latter retries its attempt to acquire the lock file
after sleeping for 1 second; this retry process can be repeated up to 10
times before the script returns an error.

There are some problems with this approach:

  - the sleep period in case of failure to acquire the lock file is
    fixed, which leads to a "thundering herd" type of problem, where
    (depending on how processes are scheduled by the operating system)
    multiple system tests try to acquire the lock file at the same time
    and subsequently sleep for 1 second, only for the same situation to
    likely happen the next time around,

  - the lock file is being locked and then unlocked for every single
    port assignment made, not just once for the entire range of ports a
    system test should use; in other words, the lock file is currently
    locked and unlocked 13 times per system test; this increases the
    odds of the "thundering herd" problem described above preventing a
    system test from getting one or more ports assigned before the
    maximum retry count is reached (assuming multiple system tests are
    run in parallel); it also enables the range of ports used by a given
    system test to be non-sequential (which is a rather cosmetic issue,
    but one that can make log interpretation harder than necessary when
    test failures are diagnosed),

  - both issues described above cause unnecessary delays when multiple
    system tests are started in parallel (due to high lock file
    contention among the system tests being started),

  - maintaining a state file requires ensuring proper locking, which
    complicates the script's source code.

Rework the get_ports.sh script so that it assigns non-overlapping port
ranges to its callers without using a state file or a lock file:

  - add a new command line switch, "-t", which takes the name of the
    system test to assign ports for,

  - ensure every instance of get_ports.sh knows how many ports all
    system tests which form the test suite are going to need in total
    (based on the number of subdirectories found in bin/tests/system/),

  - in order to ensure all instances of get_ports.sh work on the same
    global port range (so that no port range collisions happen), a
    stable (throughout the expected run time of a single system test
    suite) base port selection method is used instead of the random one;
    specifically, the base port, unless specified explicitly using the
    "-p" command line switch, is derived from the number of hours which
    passed since the Unix Epoch time,

  - use the name of the system test to assign ports for (passed via the
    new "-t" command line switch) as a unique index into the global
    system test range, to ensure all system tests use disjoint port
    ranges.
This commit is contained in:
Michał Kępień
2021-04-08 11:12:37 +02:00
parent a2484f2673
commit 31e5ca4bd9
4 changed files with 32 additions and 74 deletions

View File

@@ -241,7 +241,4 @@ AM_LOG_FLAGS = -r
$(TESTS): run.sh
clean-local:
-rm -f get_ports.state get_ports.lock
test-local: check