Многопоточность в современном C++: Lock-Free программирование, Memory Ordering и Atomics

Многопоточное программирование в C++ традиционно ассоциируется с мьютексами, condition variables и потенциальными проблемами вроде deadlocks и race conditions. Однако современные стандарты C++ (начиная с C++11 и далее) предоставляют инструменты для написания высокопроизводительного многопоточного кода без классических блокировок. В этой статье рассмотрим продвинутые техники: lock-free программирование, атомарные операции и различные модели упорядочивания памяти.

https://habr.com/ru/articles/963818/

#multithreading_programming #c++ #concurrency #lockfree #atomic

Многопоточность в современном C++: Lock-Free программирование, Memory Ordering и Atomics

Многопоточное программирование в C++ традиционно ассоциируется с мьютексами, condition variables и потенциальными проблемами вроде deadlocks и race conditions. Однако современные стандарты C++...

Хабр

Lock-free код и шахматы: где LLM показывают свою несостоятельность

Все мы привыкли к тому, что нейросети творят чудеса. Suno генерирует музыку неотличимую от человеческой, Flux рисует картины лучше многих художников, Claude переводит тексты так, что даже носители языка не сделают это лучше. Создается впечатление, что искусственный интеллект вот-вот заменит нас во всех сферах деятельности. Но есть одна маленькая проблема. Как только задача требует настоящего размышления, а не воспроизведения заученных паттернов, LLM начинают творить такую дичь, что становится стыдно, что знаком с ними.

https://habr.com/ru/articles/935700/

#LLM #шахматы #lockfree #lockfree_algorithms #lockfree_структуры_данных

Lock-free код и шахматы: где LLM показывают свою несостоятельность

Все мы привыкли к тому, что нейросети творят чудеса. Suno генерирует музыку неотличимую от человеческой, Flux рисует картины лучше многих художников, Claude переводит тексты так, что даже носители...

Хабр
There are about 3 or 4 ways to make #hazardpointers #waitfree instead of merely #lockfree. I had though Linux restartable sequences (RSEQ) couldn't be inlines but apparently they can. I did a quick and dirty perf rest and it looks like protect() runs 3x faster w/o that conditional branch. But this only works on Linux and you need asynchronous memory barriers.

Please help me spread the link to #swad 😎

https://github.com/Zirias/swad

I really need some users by now, for those two reasons:

* I'm at a point where I fully covered my own needs (the reasons I started coding this), and getting some users is the only way to learn about what other people might need
* The complexity "exploded" after supporting so many OS-specific APIs (like #kqueue, #epoll, #eventfd, #signalfd, #timerfd, #eventports) and several #lockfree implementations based on #atomics while still providing fallbacks for everything that *should* work on any #POSIX systems ... I'm definitely unable at this point to think of every possible edge case and test it. If there are #bugs left (which is somewhat likely), I really need people reporting these to me

Thanks! 🙃

GitHub - Zirias/swad: Simple Web Authentication Daemon

Simple Web Authentication Daemon. Contribute to Zirias/swad development by creating an account on GitHub.

GitHub

Next #swad release will still be a while. 😞

I *thought* I had the version with multiple #reactor #eventloop threads and quite some #lockfree stuff using #atomics finally crash free. I found that, while #valgrind doesn't help much, #clang's #thread #sanitizer is a very helpful debugging tool.

But I tested without #TLS (to be able to handle "massive load" which seemed necessary to trigger some of the more obscure data races). Also without the credential checkers that use child processes. Now I deployed the current state to my prod environment ... and saw a crash there (only after running a load test).

So, back to debugging. I hope the difference is not #TLS. This just doesn't work (for whatever reason) when enabling the address sanitizer, but I didn't check the thread sanitizer yet...

The #lockfree command #queue in #poser (for #swad) is finally fixed!

The original algorithm from [MS96] works fine *only* if the "free" function has some "magic" in place to defer freeing the object until no thread holds a reference any more ... and that magic is, well, left as an exercise to the reader. 🙈

Doing more research, I found a few suggestions how to do that "magic", including for example #hazardpointers ... but they're known to cause quite some runtime overhead, so not really an option. I decided to implement some "shared object manager" based on the ideas from [WICBS18], which is kind of a "manually triggered garbage collector" in the end. And hey, it works! 🥳
https://github.com/Zirias/poser/blob/master/src/lib/core/sharedobj.c

[MS96] https://dl.acm.org/doi/10.1145/248052.248106
[WICBS18] https://www.cs.rochester.edu/u/scott/papers/2018_PPoPP_IBR.pdf

#coding #c #c11 #atomics

poser/src/lib/core/sharedobj.c at master · Zirias/poser

POsix SERvices framework for C. Contribute to Zirias/poser development by creating an account on GitHub.

GitHub

This redesign of #poser (for #swad) to offer a "multi-reactor" (with multiple #threads running each their own event loop) starts to give me severe headaches.

There is *still* a very rare data #race in the #lockfree #queue. I *think* I can spot it in the pseudo code from the paper I used[1], see screenshot. Have a look at lines E7 and E8. Suppose the thread executing this is suspended after E7 for a "very long time". Now, some dequeue operation from some other thread will eventually dequeue whatever "Q->Tail" was pointing to, and then free it after consumption. Our poor thread resumes, checks the pointer already read in E6 for NULL successfully, and then tries a CAS on tail->next in E9, which is unfortunately inside an object that doesn't exist any more .... If the CAS succeeds because at this memory location happens to be "zero" bytes, we corrupted some random other object that might now reside there. 🤯

Please tell me whether I have an error in my thinking here. Can it be ....? 🤔

Meanwhile, after fixing and improving lots of things, I checked the alternative implementation using #mutexes again, and surprise: Although it's still a bit slower, the difference is now very very small. And it has the clear advantage that it never crashes. 🙈 I'm seriously considering to drop all the lock-free #atomics stuff again and just go with mutexes.

[1] https://dl.acm.org/doi/10.1145/248052.248106

I guess this funny looking graph showing response time percentiles is exactly the result of one of 8 service worker threads having a lot more to do than all others. I wonder whether this could be a behavioral artifact of the #lockfree #queue used to distribute the accepted connections. 🤔

More interesting progress trying to make #swad suitable for very busy sites!

I realized that #TLS (both with #OpenSSL and #LibreSSL) is a *major* bottleneck. With TLS enabled, I couldn't cross 3000 requests per second, with somewhat acceptable response times (most below 500ms). Disabling TLS, I could really see the impact of a #lockfree queue as opposed to one protected by a #mutex. With the mutex, up to around 8000 req/s could be reached on the same hardware. And with a lockfree design, that quickly went beyond 10k req/s, but crashed. 😆

So I read some scientific papers 🙈 ... and redesigned a lot (*). And now it finally seems to work. My latest test reached a throughput of almost 25k req/s, with response times below 10ms for most requests! I really didn't expect to see *this* happen. 🤩 Maybe it could do even more, didn't try yet.

Open issue: Can I do something about TLS? There *must* be some way to make it perform at least a *bit* better...

(*) edit: Here's the design I finally used, with a much simplified "dequeue" because the queues in question are guaranteed to have only a single consumer: https://dl.acm.org/doi/10.1145/248052.248106

Getting somewhat closer to releasing a new version of #swad. I now improved the functionality to execute something on a different worker thread: Use an in-memory queue, providing a #lockfree version. This gives me a consistent reliable throughput of 3000 requests/s (with outliers up to 4500 r/s) at an average response time of 350 - 400 ms (with TLS enabled). For waking up worker threads, I implemented different backends as well: kqueue, eventfd and event-ports, the fallback is still a self-pipe.

So, #portability here really means implement lots of different flavors of the same thing.

Looking at these startup logs, you can see that #kqueue (#FreeBSD and other BSDs) is really a "jack of all trades", being used for "everything" if available (and that's pretty awesome, it means one single #syscall per event loop iteration in the generic case). #illumos' (#Solaris) #eventports come somewhat close (but need a lot more syscalls as there's no "batch registering" and certain event types need to be re-registered every time they fired), they just can't do signals, but illumos offers Linux-compatible signalfd. Looking at #Linux, there's a "special case fd" for everything. 🙈 Plus #epoll also needs one syscall for each event to be registered. The "generic #POSIX" case without any of these interfaces is just added for completeness 😆