Mark Jeffrey

29 Followers
93 Following
31 Posts
Assistant professor at the University of Toronto. PhD from MIT, MASc&BASc from UofT. Previously Meta postdoc, Google and Epson intern, AeroFS software engineer.
Pronounshe/him
AffiliationUniversity of Toronto
Webhttps://www.eecg.utoronto.ca/~mcj
@adrian your poor self control is a service to the community. This is so cool.

We have an IEEE CAL 2026 paper on a neat little area-efficient integer dot-product hardware unit.

đź“„Paper: https://www.eecg.utoronto.ca/~mcj/papers/2026.fased.cal.pdf
⚙️Code: https://github.com/mcj-group/fased-verilog

The goal is to efficiently support quantized AI models with variable bit widths across layers (e.g., LLMs with 2-, 4-, or 8-bit weights targeting edge devices or even servers). Our key insight is to optimize the dot product holistically, rather than only its subcomponents. Our proposed FASED builds on the design of a Booth multiplier and eliminates nearly half of all full-adders in the dot product unit by fusing the multiplication and reduction steps. FASED reduces area by up to 1.9x over prior variable-width integer dot product designs. This was a fun collaboration with Pavel Golikov, Karthik Ganesan, and Gennady Pekhimenko.

We have an upcoming paper at ICML 2025.

đź“„Paper: https://www.eecg.utoronto.ca/~mcj/papers/2025.alspec.icml.pdf
⚙️Code: https://github.com/mcj-group/alspec

This gist: When a large-language model (LLM) reacts to a prompt to generate text (inference), one of the most important but slowest stages of the computation is called attention. One way to speed up attention is to use more compute devices (e.g., multiple GPUs or accelerators) to work collaboratively on it (tensor parallelism). Unfortunately, when scaling to 8 or more devices, the communication between them overwhelms the benefit of their increased computational ability. Another way to speed up attention is to use an approximation of its underlying math. Prior approaches to approximate attention can work well for particular inference tasks, but in others, like solving math problems, the quality of the generated text is poor. We propose attention-level speculation, a technique that combines and enhances multi-device and approximate approaches to speed up LLM inference. Attention-level speculation sometimes uses the output of the attention approximation, but sometimes does not, verifying on the fly whether the approximation was of good quality. Using two devices, we overlap the verification of approximation quality with speculative downstream computation. Speculation succeeds for up to 90% of attention operations. Our experiments with Tenstorrent N150s suggest that using attention-level speculation in combination with tensor parallelism across 8 devices is up to 1.65x times faster than using tensor parallelism alone on 8 devices.

This project was led by Jack Cai, a recent BASc grad from University of Toronto Engineering and a former Tenstorrent intern. Jack pitched this for an undergraduate thesis project and I did not think it would work. Shame on me and amazing work and persistence by Jack. Many thanks to our co-authors Ammar Vora, Randalph Zhang, and Mark O'Connor.

slowly working on a mega terminal cheat sheet

here's a link to the draft as a PDF: https://jvns.ca/terminal-cheat-sheet-draft.pdf

We have an FPT paper being presented next week.
đź“„Paper: https://www.eecg.utoronto.ca/~mcj/papers/2024.mqrouter.fpt.pdf
⚙️Code: https://github.com/verilog-to-routing/vtr-verilog-to-routing/tree/mq-parallel-router

The gist: #FPGA routing can take many hours, one of the longest stages in FPGA #CAD flows, impeding productivity for hardware designers. Meanwhile, the parallel algorithms community has demonstrated significant algorithm scalability for path search problems like SSSP using relaxed concurrent priority schedulers. While maintaining the deterministic circuit routing demanded of FPGA CAD users, we applied techniques from the parallelism community to the A*, (heuristic) directed, and Dijkstra path searches in FPGA routing. Atop VTR8, our work gets 13x, 2x, and 18x speedup, respectively, on average.

This was a super fun collaboration with Mirjana Stojilović, Vaughn Betz and his students Alexandre Singer and Hang Yan, and MASc grad from my group, Guozheng Zhang. This paper blossomed out of a lunch conversation then course project; dream scenario for graduate course instructors.

@notypes congratulations!!
@dingus_spingus darn, missed opportunity to squeeze some citrusy insights in rust. Thanks for reading!
Proud advisor moment 2: Javad Abdi convocated in Toronto as I presented his work at #spaa2024 (his choice!) Javad's work interrogates the #Rust programming language claim of fearless concurrency. Through a case study, we find that #Rust programmers indeed need not be fearful when expressing easy parallelism, but when parallelism gets hard (e.g., irregular run-time varying data dependences) #Rust is not inherently easier to use than its predecessors like C++.
Paper: https://www.eecg.utoronto.ca/~mcj/papers/2024.rpb.spaa.pdf
Proud advisor moment: Guozheng Zhang presented his master’s work at #spaa2024. Ordered algorithms are difficult to scale on manycores, but several priority schedulers have been proposed. Guozheng introduces a taxonomy to evaluate past schedulers and explores a new design point: the Multi Bucket Queue. See more in our paper https://www.eecg.utoronto.ca/~mcj/papers/2024.mbq.spaa.pdf

In 1968, at 30 years old, Lynn Conway transitioned. In doing so she lost her wife, her children, and her job at IBM. She continued on, living authentically as her true self and continued her career as an electral engineer. In 1978 she became an associate professor at MIT and taught a course in VLSI (very large scale integration) that became the basis of the Mead-Conway VLSI Design Methodology, changing how we design integrated circuits. In 1985 she became a professor of electrical engineering at the University of Michigan, and later the associate dean of engineering.

32 years after transitioning, Conway came out publicly as a transgender woman. Since than she has been a advocate for transgender people in the tech sector. In 2020 IBM formally apologized for firing Conway for being trans, over 50 years after the fact. Just a bit of #transHistory I learned today.

#history