Avi Chawla (@_avichawla)

Anthropic, OpenAI, Gemini 등 주요 LLM의 서빙 파이프라인 전체에서 사용되는 72가지 최적화 기법을 9개 계층으로 정리한 내용입니다. INT4 양자화부터 애플리케이션 엣지의 모델 캐스케이딩까지 포함해, 프로덕션 LLM 운영에 필요한 핵심 성능 최적화 스택을 체계적으로 분석한 글입니다.

https://x.com/_avichawla/status/2045224379718791273

#llm #optimization #serving #quantization #ai

Avi Chawla (@_avichawla) on X

Anthropic. OpenAI. Gemini. Every production LLM runs on a stack of optimizations, not a single trick. I mapped out 72 of them across the full serving pipeline, grouped into 9 layers, from INT4 quantization at the weights all the way to model cascading at the application edge.

X (formerly Twitter)

bstn (@bstnxbt)

dflash-mlx v0.1.1이 공개됐다. dflash-serve가 tools, reasoning, streaming, OpenAI 호환 서빙을 지원하며 OpenCode, aider, Continue, Open WebUI와 연동된다. oMLX에서도 사용 가능하다. AI 개발용 서빙 프레임워크/도구의 기능 확장 소식이다.

https://x.com/bstnxbt/status/2044115438443893030

#ai #serving #opensource #openai #tooling

bstn 👁️ (@bstnxbt) on X

dflash-mlx v0.1.1 dflash-serve now supports tools, reasoning, streaming, and full OpenAI-compatible serving. Works with OpenCode, aider, Continue, Open WebUI. Also available via oMLX (thanks jundot). https://t.co/Co31JoPAms

X (formerly Twitter)

Base Camp Bernie (@basecampbernie)

동시성 에이전트를 높은 대역폭으로 서빙한 사례가 공유되며, 멀티 에이전트 추론/서빙 최적화가 인상적으로 작동하고 있음을 시사한다.

https://x.com/basecampbernie/status/2042661495864177074

#agents #serving #multitasking #aiinfra

Base Camp Bernie (@basecampbernie) on X

@AiXsatoshi Yes, concurrent agents served with that bandwidth. It is wonderful to see.

X (formerly Twitter)

Ivan Fioravanti ᯅ (@ivanfioravanti)

Ollama가 동시 요청에 대한 연속 배치(continuous batching)를 지원하는지 묻는 질문이다. LLM 서빙 성능과 처리량 최적화와 관련된 중요한 개발 도구 기능 문의로 볼 수 있다.

https://x.com/ivanfioravanti/status/2042622686128476553

#ollama #llm #serving #batching #inference

Ivan Fioravanti ᯅ (@ivanfioravanti) on X

Does @ollama support Continuous batching of concurrent requests? 🤔

X (formerly Twitter)

Avi Chawla (@_avichawla)

KV 캐싱을 사용할 때와 사용하지 않을 때의 LLM 추론 속도를 비교하며, KV 캐싱이 왜 성능 향상에 중요한지 설명하는 기술 공유 트윗입니다. LLM 서빙 최적화와 추론 효율 개선에 관심 있는 개발자에게 유용한 내용입니다.

https://x.com/_avichawla/status/2035084029062750714

#llm #inference #kvcaching #optimization #serving

Avi Chawla (@_avichawla) on X

LLM inference speed with vs. without KV caching: (learn how and why it works below)

X (formerly Twitter)
No, we don't need high-protein boxed mac and cheese, experts say. But people want it
Kraft Heinz has just announced it's launching a high-protein mac and cheese called PowerMac that delivers 17 grams of protein and six grams of fibre per serving. But did we ... need this?
https://www.cbc.ca/news/canada/kraft-dinner-protein-9.7136154?cmp=rss
Not even Kraft macaroni and cheese is safe from the added protein craze
Kraft Heinz has just announced it's launching a high-protein mac and cheese called PowerMac that delivers 17 grams of protein and six grams of fibre per serving. But did we ... need this?
https://www.cbc.ca/news/canada/kraft-dinner-protein-9.7136154?cmp=rss

Whitney Port Talks Serving '90s-Inspired Tennis Looks and the Beauty Essentials in Her Match-Day Bag

https://misryoum.com/us/us24/whitney-port-talks-serving-90s-inspired-tennis-looks/

Serving in more ways than one! Whitney Port is bringing back '90s -inspired tennis fashion in a big way.The Hills alum sat down with ET to dish on her sporty style and the beauty essentials she relies on for match-day...

#Whitney #Port #Talks #Serving #90sInspired #Tennis #Looks #and #the #Beauty #Essentials #Her #MatchDay #Bag #US_News_Hub #misryoum_com

Whitney Port Talks Serving '90s-Inspired Tennis Looks and the Beauty Essentials in Her Match-Day Bag

Serving in more ways than one! Whitney Port is bringing back '90s -inspired tennis fashion in a big way.The Hills alum sat down with ET to dish on her

US News Hub

Serve, Don't Self-Care - Tony Robbins on The Diary of a CEO

#serving #service

AISatoshi (@AiXsatoshi)

vLLM으로 MiniMaxAI/MiniMax-M2.5 모델을 서빙하는 예시 커맨드를 제시한 트윗입니다. tensor-parallel-size 4 설정, tool-call-parser로 minimax_m2, reasoning-parser로 minimax_m2_append_think를 사용하고 enable-auto-tool-choice를 활성화해 vLLM 기반의 도구 호출·추론 파이프라인 구성을 보여줍니다.

https://x.com/AiXsatoshi/status/2022313584160919865

#vllm #minimax #serving #tooling

AI✖️Satoshi⏩️ (@AiXsatoshi) on X

vLLM vllm serve MiniMaxAI/MiniMax-M2.5 \ --tensor-parallel-size 4 \ --tool-call-parser minimax_m2 \ --reasoning-parser minimax_m2_append_think \ --enable-auto-tool-choice https://t.co/zEIRPr6kNl

X (formerly Twitter)