Sam Westrick

399 Followers
230 Following
229 Posts
assistant professor of CS at NYU Courant :: programming languages :: parallel computing :: music :: lead dev of the MaPLe compiler (https://github.com/mpllang/mpl)
Websitehttps://cs.nyu.edu/~shw8119/
Bloghttps://shwestrick.github.io/
MaPLehttps://github.com/mpllang/mpl
Twitterhttps://twitter.com/shwestrick

my student Seong-Heon talking about one of our new projects at NYU!

(We’re at NJPLS today, come say hi)

Enjoying this presentation about performance engineering in the Go GC implementation:
https://youtu.be/gPJkM95KpKo

Really fantastic talk!

GopherCon 2025 - Advancing Go Garbage Collection with Green Tea - Michael Knyszek

YouTube

happy to announce that, earlier this Fall, our QCE'25 paper "Local Optimization of Quantum Circuits" received a Best Paper award!

https://cs.nyu.edu/~shw8119/25/qce25-oac.pdf

The key challenge here is optimizing a large quantum circuit (with, e.g., hundreds of thousands of gates). Many existing tools are essentially superoptimizers, with exponential search spaces, and therefore can only handle "small" circuits in a reasonable amount of time/space.

Thinking of these tools as black box optimizers or "oracles", a natural strategy would be to chunk up the circuit into small pieces and apply the oracle to each piece. This easily guarantees that the exponential searches are bounded and do not explode (in time/space).

But this immediately raises a question. Surely, this approach must miss optimizations that are possible across the boundaries between pieces, right?

(As you might expect, the answer is yes.)

To address this, we develop a "melding" algorithm which identifies and performs additional optimizations across the boundaries. Melding can in turn expose even more opportunities for optimizations of individual chunks. So, after melding, we can continue optimizing recursively until convergence on a circuit which is (in some sense) "locally optimal" relative to the size-constrained oracle.

We call our algorithm OAC, for "optimize and compact". We prove that this algorithm only calls the oracle productively: the number of additional calls to the oracle (due to melding, etc) is bounded by the number of optimizations found.

In our experiments, we confirm that OAC performs strictly better than the naive chunked strategy, sometimes significantly so, removing thousands more gates from large circuit instances. This confirms that "melding" is important for circuit quality. Additionally, OAC outperforms existing quantum circuit optimizers, often producing a better quality final circuit in orders of magnitude less time.

Check out the paper for more details!

One day, my apartment will look like this

in TypeDis (conditionally accepted at POPL!), we develop a type system for enforcing **disentanglement** statically at compile-time. This project was led by Alexandre Moine here at NYU, in close collaboration with Stephanie Balzer at CMU.

Disentanglement is all about parallel GC. It decomposes the heap into pieces that can be traced independently, in parallel, with no synchronization.

How does disentanglement work?

The idea is (1) give every task its own local heap, and (2) enforce that the heaps of any two _concurrent_ tasks never directly point to each other, i.e., the heaps are "disentangled".

We've been exploring this idea over the past ~10 years in MaPLe (https://github.com/mpllang/mpl). Today, in MaPLe, disentanglement is checked and managed dynamically, and we've put a ton of work into ensuring that this dynamic management cost is nearly zero (see https://dl.acm.org/doi/10.1145/3591284 and https://dl.acm.org/doi/10.1145/3547646)

However, nearly-zero cost only holds if your program is disentangled; as soon as a violation is detected, the GC has to start synchronizing across concurrent tasks and we're back to the tricky world of reasoning about the synchronization overheads of GC.

So, the question is: Can we reason about disentanglement _statically_, at the source level, and enforce it automatically at compile-time? In other words: can we statically guarantee that the GC remains fully parallel, and never has to synchronize across concurrent tasks?

Yes! Enter TypeDis. Similar to region types, TypeDis associates a task identifier or "timestamp", δ, with every allocation. E.g. the type string@δ indicates that the string was allocated at time δ. Unboxed types don't need these annotations (e.g. raw integers, booleans, etc).

At compile-time, TypeDis computes a partial order over timestamps (derived from the nested fork-join structure of parallel tasks) and enforces that all pointers must flow _backwards_ in time. Thinking about a tree of tasks (where parents fork into their children), this corresponds to an **up-pointer invariant**: the data of child tasks can only contain pointers to ancestor data. In this way, you can never have pointers between concurrent siblings/cousins/etc.

One of the surprising things we discovered is that this "backwards-in-time" / "up-pointer invariant" concept can be elegantly encoded in the type system as a form of subtyping that we call subtiming. The whole system is essentially just standard unification + subtiming.

Of course, there are tons of juicy details and intricacies that cannot fit into (even this many) tweets. Take a look at the preprint to see all the nitty gritty stuff!

https://cs.nyu.edu/~am15509/publications/typedis.pdf

We got really fantastic feedback during reviews (thank you POPL reviewers!) and we're looking forward to updating this preprint. Stay tuned for the final paper.

absolutely thrilled to announce 2 papers (conditionally) accepted at POPL!

TypeDis: A Type System for Disentanglement
(Moine, Balzer, Xu, Westrick)

All for One and One for All: Program Logics for Exploiting Internal Determinism in Parallel Programs
(Moine, Westrick, Tassarotti)

Both projects led by Alexandre Moine here at NYU.

Preprints available on Alexandre's website!
https://cs.nyu.edu/~am15509/

Home - Alexandre Moine

one of the subtle and interesting bits is that the merging function isn’t associative, so the parallel algorithm is (in a weak sense) non-deterministic, depending on how the reduction is implemented.

And yet, all possible combination orders still yield a correct overall result

parallel Boyer-Moore majority selection is a nice bit of code; here it is in MaPLe.

this algorithm seems to be folklore -- the original Boyer-Moore algorithm is sequential, but I've found at least two mentions of the parallel algorithm in the wild.

Oleg Kiselyov discusses its correctness here:
https://okmij.org/ftp/Algorithms/map-monoid-reduce.html#BM

and, Adam Blank alludes to the parallel algorithm on the bottom of page 6 in this set of lecture notes:
https://courses.cs.washington.edu/courses/cse332/17wi/sections/08/section08-solutions.pdf

so this leaves me wondering -- has this parallel Boyer-Moore algorithm always been folklore? Is there a definitive reference for it?

Looking over this paper today about parallel incremental convex hull: https://www.cs.ucr.edu/~yihans/papers/2020/SPAA20/convex-hull.pdf

An interesting connection from computational geometry is that 2D Delaunay triangulations can be computed as a special case of 3D convex hulls. So, the algorithm that Blelloch/Gu/Shun/Sun describe is not only a convex hull algorithm, but also a Delaunay algorithm.

They have an example implementation here (https://github.com/cmuparlay/parlaylib/blob/master/examples/delaunay.h) which uses a global hash table to store the incremental state of the mesh.

I had some fun over the weekend porting this to MaPLe.
WIP is here: https://github.com/MPLLang/parallel-ml-bench/tree/main/mpl/bench/delaunay-top-down

Some debugging still todo, but it's a nice example!

👀 👀

Singapore here I come!

@icfp_conference @splashcon