Mastodawn

zirias (on snac)Dec 18, 2024

#programming question, TL;DR: How to test for an (approximately) uniform #distribution?

Today at work, I created a piece of code that should #partition a stream of data entities based on some string keys of unknown format. The only requirements were that the same key must always be assigned to the same partition and the distribution should be approximately uniform (IOW all partitions should have roughly the same size). My approach was to apply a non-cryptographic #hash function to the keys (defaulting to #xxhash3), XOR-fold the hash down to 32 bits and then take this as an unsigned integer modulo the desired number of partitions.

I normally only code my private projects (as a software architect, I rarely have the time to touch any code at work, unfortunately), and there, I'd certainly test something like this on some large samples of input data, but probably just once manually. 🙈

But for work, I felt this should be done by a #unittest. I also think at least one set of input data should be somehow "random" (while others should contain "patterns"). My issue is with unit-testing the case for "random" input. One test I wrote feeds 16k GUIDs (in string representation) to my partitioner configured for 13 partitions, and checks that the factor between the largest and smallest partitions remains < 2, so, a very relaxed check. Still doubt remains because there's no way to guarantee this test won't go "red" eventually.

I now see several possible options:

just ignore this because hell freezing is more likely than that test going red ...
don't even attempt to test the resulting distribution on "random" input
bite the bullet and write some extra code creating "random" (unicode, random length within some limits) strings from a PRNG which will produce a predictable sequence

What do you think? 🤔 The latter option kind of sounds best, but then the complexity of the test will probably exceed the complexity of the code tested. 🙈

Show thread

zirias (on snac)Dec 19, 2024

Well, I solved this issue ... basically with the last option, but "simplifying" it a bit by creating Guids randomly. This makes sense in context because that's what we mostly expect for the key. I was annoyed by Microsoft's implementation of System.Guid on the way: First, the entropy source it uses is hardcoded (and not usable for "controlled" randomness), second it uses an internal representation with mixed endianness (no idea what this would mean on big-endian platforms) ... I finally decided to generate the Guid as a string myself without using System.Guid at all. This testcase is now run multiple times, using combinations of randomly chosen seeds for the PRNG and a few different numbers of partitions.

To be sure, I additionally added a testcase using the base64 encoding of the current iteration number as the key, just to check not-so-random input doesn't spoil the resulting distribution. All in all, I spent a lot more time on writing these tests than on the piece of actual functionality 🙈 ... but now it "feels right".

TL;DR: The sane way to test a distribution of a function on "random" input data is still "controlled" randomness by using a deterministic PRNG.

Show thread

zirias (on snac)

Just in case anyone else ever needs a #pseudorandom #Guid generator for #dotnet, here's what I finally did:

private const int guidSize = 16;
private const int guidStrLen = (guidSize * 2) + 4;
private readonly StringBuilder guidBuilder = new(guidStrLen, guidStrLen);

private string CreatePseudoRandomGuidString(Random r)
{
var bytes = new byte[guidSize];
r.NextBytes(bytes);

// Generate a valid UUID-v4 as per RFC 4122, section 4.4
bytes[6] = (byte)((bytes[6] & 0xfU) | (4U << 4)); // version 4
bytes[8] = (byte)((bytes[8] & 0x3fU) | 0x80U); // variant 1

// Format as a string. We don't use the System.Guid class here
// because it uses some mixed endian internal format which might
// not be portable.
guidBuilder.Clear();
for (int i = 0; i < guidSize; ++i)
{
guidBuilder.Append(bytes[i].ToString("x2"));
if (i % 2 != 0 && i > 2 && i < 10) guidBuilder.Append('-');
}
return guidBuilder.ToString();
}
To get a reproducable sequence of pseudo-random Guids, just seed the Random instance with a fixed value before calling this the first time.