what are some other tutorials for implementing a toy version of a hard thing in a short amount of time? Like https://implement-dns.wizardzines.com or https://raytracing.github.io

I’m especially interested in the “in a short amount of time” aspect, I think nand2tetris is extremely cool but you definitely can’t do it in 3 days.

If possible I'd love to hear about a project you personally did and how long it took you.

Implement DNS in a weekend

@b0rk For a more personal example, with notes… I read HTTP: The Definitive Guide in a night (yes… I speed read) then the next day wrote a more functionally complete HTTP/1.1 server in pure Python than I was able to find in Python. (Mine did chunked requests and responses, not just responses.)

The protocol bits compile to 191 opcodes.
Hybridises async and threaded execution.
And survives C10K consuming ~11MB of RAM.

https://github.com/marrow-legacy/server.http/blob/develop/marrow/server/http/protocol.py

Benchmark: https://gist.github.com/amcgregor/707936 (13 years ago)

server.http/protocol.py at develop · marrow-legacy/server.http

A simple development HTTP/1.1 server for WSGI 1 and prototype WSGI 2 applications in both Python 2.x and 3.x. - server.http/protocol.py at develop · marrow-legacy/server.http

GitHub
@alice thanks, this is great -- I want to write an "implement HTTP” tutorial and I've been struggling a bit with figuring out which edge cases from the spec to cover. Definitely going to look through this.
@b0rk @alice A simple HTTP cache doesn't take long.

@b0rk You can see from mine that I’m handing off request processing (the WSGI API side of things) to a Futures thread pool. From the benchmark, you can then run multiple processes sharing the same listening socket to let the kernel itself distribute incoming connections between them. All communication happens asynchronously within the main thread of a given process.

HTTP: The Definitive Guide is an incredibly good resource. Chunked transfer, header trailers, and some TCP–specific quirks…

@michaell @b0rk I often use that, or my 1.9 million DRPC/second pure Python implementations as counter-examples to cries of “Python is slow”. I even micro-optimised these things.

Split? Partition?
Partition is WAY faster.

Uppercase or lowercase a mostly lower-case string?
Always faster the fewer characters require replacement; lowercase.

&c. https://gist.github.com/amcgregor/405354

Then there are the examples where Python is faster than C. My web server contains one. (But also malloc elimination, inlining…)

Python timeit benchmarking for realistic optimizations.

Python timeit benchmarking for realistic optimizations. - gist:405354

Gist

@alice @b0rk And then there are the new not-quite-ready-for-prime-time subinterpreters in 3.12:

“Snow's own initial experiments with subinterpreters significantly outperformed threading and multiprocessing. One example, a simple web service that performed some CPU-bound work, maxed out at 100 requests per second with threads, and 600 with multiprocessing. But with subinterpreters, it yielded 11,500 requests, and with little to no drop-off when scaled up from one client.”

@michaell

Sub-interpreters. 😐 I get 38,775 generations/second (Python 2.7) or 45,679 (Pypy) or 55,385 (CPython 3.8) without any additional magic. And yup, cinje (my template engine pseudo-DSL) is cross-compatible back to Python 2 without code modification. (Not like that matters any more, but it's demonstrative of a few points.)

That HTTP service C10K test, under Python 2.6, and with an overwhelming 10,000 concurrency, netted 6K r/sec again, like, 13 years ago. It'd be far faster today.

@b0rk

@alice @b0rk But you’re just talking about the HTTP plumbing, not the app or service that runs behind it. That’s where subinterpreters could make a big difference — especially if that app or service makes blocking calls.

@michaell

We can then agree to disagree.

I'm talking about baseline performance; it can only get slower from there. How you manage blocking activity is up to the application; I'm a fan of thread-based Futures pools for most long-duration activities; even just enqueueing e-mail deliveries. (Also used for long-duration HTTP requests for data acquisition and processing for scheduled ingest.)

Though admittedly, my dynamic scaling thread pool is FAR more efficient than the built-in.

@b0rk

@michaell

The HTTP service test is performing the full WSGI pipeline, using a thread pool within each primarily async process to execute the WSGI endpoint, if you look very closely. (And kernel-level socket sharing for pre-fork/multi-process.) Very hybrid. 😜

And the template performance is based on the "de-facto standard" "Bigtable" test, that is, rendering a 100-column, 1000-row HTML table.

Django renders 2/second.

Two.

Insert "come at me bro" Monster (muppet drummer) GIF here. 😝

@b0rk

@michaell The “template render” being an example blocking process you might perform on each of those requests. But by being faster than the HTTP processing machinery, it’s the plumbing that’s performance restrictive, not actually the blocking process.

@b0rk