Mastodawn

Anemll (@anemll)

Flash-MoE의 prefill 속도를 높이기 위한 작업을 공유했다. 토큰 단위 처리 대신 배치형 linear attention과 full attention을 추가하고 커스텀 Metal 커널을 적용해, expert 없이도 prefill 속도를 6.2배 개선했다고 밝혔다.

https://x.com/anemll/status/2036550949809037394

#flashmoe #prefill #metalkernel #attention #optimization

Anemll (@anemll) on X

WIP: First attempt to speed up prefill for Flash-MoE. Original repo did token-by-token without streamed experts. Added: Batched linear attention + batched full attention (Flash Attention style) with custom Metal kernels. Without experts: 6.2x faster prefill (11 -> 68 tok/s)

X (formerly Twitter)