Gabe Parmer

209 Followers
284 Following
298 Posts
System builder and Composite hacker. Professor at GWU.
Webpagehttps://www.seas.gwu.edu/~gparmer/
Youtubehttps://www.youtube.com/channel/UC8-zle4WDSKZx6yUUWdExvA
Githubhttps://github.com/gparmer
Twitter@__gparmer

This day is coming in hot.

Half hour delayed pictures of the sunrise over Chesapeake Bay.

I would have never argued that inlined assembly was well-designed. However, the design of the x86_64 extensions to access higher registers just baffles me.

It

1. hijacks an existing storage modifier (register) that everyone wanted to go away, to now have a completely different purpose,

2. moves register naming out of the inline block where it belongs, increasing ambiguity about the semantic of using those named variables, and

3. requires the use of out-parameters where clobber variables *should* be the answer.

Why was this the reality we ended up with?

Example: https://git.musl-libc.org/cgit/musl/tree/arch/x86_64/syscall_arch.h#n53

syscall_arch.h\x86_64\arch - musl - musl - an implementation of the standard library for Linux-based systems

Hard to read the tea leaves on this on this exactly, thus seems to strongly imply that the accelerator will restart work at a tile load boundary, and progress is tracked with `start_row`. Given this, I'm guessing that interrupt times are not negatively impacted.

A recent pub. Links below.

Challenge: The edge cloud is awesome as it promises low latency to clients. But it suffers from small "micro-datacenters".

How can we provide bounded latency, in a multi-tenant, dense edge cloud environment?

What: Edge-RT, published in RTSS, makes contributions in edge systems that want to both maintain high (line rate) throughput *and* strongly bounded, end-to-end request latency processing.

We run thousands of per-client Network Function (NF) chains serving ML inference and control tasks, and meet request deadlines much more effectively than Linux and EdgeOS (which we build on).

How #1: We use end-to-end *packet* scheduling, and NFs inherit the priority of the packets as they are processed. Thus the system keeps its eye on the goal of controlling latency for each request, despite processing across many "processes".

How #2: This only works when we control which packets (with which deadlines) are buffered for NF processing. Too much buffering: NFs execute at an inappropriately high priority. Too little: throughput tanks. We create "deadline-bounded batching"!

How #3: To control the costs of inter-core event notification (e.g. IPIs), and scheduling overheads, we use periodic event processing, and create a new "constant-time Earliest Deadline First" algorithm that is O(1).

Background: We build on EdgeOS that enables memory-dense computation with featherweight processes, and strong isolation properites. We use DPDK for networking, and build everything on our Composite micro-kernel.

Wenyuan Shao is the main researcher and is a fantastic systems hacker. He's likely graduating with his PhD and will be on the job market around Sept. We got the "best student paper" award, which is a testament to all of his hard research.

Paper: https://www2.seas.gwu.edu/~gparmer/publications/rtss22edgert.pdf
Presentation: https://www2.seas.gwu.edu/~gparmer/publications/rtss22edgert_pres.pdf
EdgeOS: https://www2.seas.gwu.edu/~gparmer/publications/atc20edgeos.pdf
Related pubs: https://www2.seas.gwu.edu/~gparmer/pubs.html

Bonus: We empirically demonstrate why eBPF kernel-bypass (DPDK), and Linux kernel deadline scheduling alone are not close to sufficient for this domain. We're proud of the background section, if you want a small intro to some of these technologies.