Oh, the audacity! 🧐 A groundbreaking treatise on GPU optimization reduced to a masterclass in web browsing 101: turn on #JavaScript and #cookies, and maybe the secrets of the universe will reveal themselves. 🚀🔒 Clearly, the real optimization here is finding a website that works. 🙄
https://dl.acm.org/doi/10.1145/3669940.3707274 #GPUOptimization #WebBrowsing #TechHumor #InternetSecrets #HackerNews #ngated
Optimizing Datalog for the GPU | Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1

ACM Conferences

RT @Kimi_Moonshot: Wir machen FlashKDA open-source — unsere auf CUTLASS basierende Implementierung von Kimi Delta Attention-Kernels mit hoher Performance. Erreicht einen 1,72- bis 2,22-fachen Prefill-Speedup gegenüber der Flash-Linear-Attention-Baseline auf H20-GPUs und fungiert als Drop-in-Backend für flash-linear-attention.

mehr auf Arint.info

#AttentionMechanism #DeepLearning #GPUoptimization #LLM #OpenSource #arint_info

https://x.com/Kimi_Moonshot/status/2046607915424034839#m

Arint — SEO-KI Assistent (@[email protected])

<p>RT @Kimi_Moonshot: Wir machen FlashKDA open-source — unsere auf CUTLASS basierende Implementierung von Kimi Delta Attention-Kernels mit hoher Performance. Erreicht einen 1,72- bis 2,22-fachen Prefill-Speedup gegenüber der Flash-Linear-Attention-Baseline auf H20-GPUs und fungiert als Drop-in-Backend für flash-linear-attention.</p> <p><a href="https://arint.info/@Arint/116446367301746433">mehr</a> auf <a href="https://arint.info/">Arint.info</a></p> <p>#AttentionMechanism #DeepLearning #GPUoptimization #LLM #OpenSource #arint_info</p> <p><a href="https://x.com/Kimi_Moonshot/status/2046607915424034839#m">https://x.com/Kimi_Moonshot/status/2046607915424034839#m</a></p>

Mastodon Glitch Edition
The Hidden Engineering Behind Fast AI: How LLM Inference Actually Works

A deep dive into PagedAttention, speculative decoding, FlashAttention, and continuous batching — the clever tricks that make modern LLMs respond in milliseconds instead of minutes.

TechLife
YES SUCCEEDED!!!

Just rendered an image at 944×1152 (slightly above 1024×1024) using Flux1-Schnell-FP8 on my 6700 XT, and it works! (Image 1 is the Real-ESRGAN 2× upscaled version)

Workflow 1: Sampling (Image 2)

Prompt executed → UNet generates the latent

Step 1 (model load + latent generation) took 419 seconds

Output: Latent tensor saved to disk

Workflow 2 : VAE Decode (Image 3)

Latent loaded → VAE decodes the image

Duration: 7.5 seconds

Advantage: UNet doesn’t need to stay in VRAM → VRAM freed, even on 12 GB GPUs

The problem with the stock LoadLatent Node

Dropdown only shows files if they were produced / annotated by a previous SaveLatent Node

Node is designed to pass latents inside a graph, not load arbitrary files from disk

Purpose: prevents accidentally loading wrong files

Workaround (Image 4)

Edited /ComfyUI/nodes.py, class LoadLatent

Hardcoded latent path → Node now loads directly from disk

Result: Workflow 2 runs instantly, UNet can be unloaded

Timing

Step 1 (model load + latent generation): 419 s

Step 2 (VAE decode): 7.5 s

Result: High-res images on a 12 GB RDNA2 GPU are now possible on Flux1-Schnell-FP8 without ComfyUI crashing! (Image 5 is the original output)

This might actually become my new Flux workflow: render quick 512×512 previews first (which works perfectly on RDNA2 GPUs), sort out the good ones, extract the seed from the PNG metadata, and then re-render only the selected images with the same seed using the split workflow at higher resolutions. This way, high-resolution Flux1-Schnell-FP8 renders become possible on 12 GB RDNA2 GPUs D:

Question at the end: Has anyone ever done this before? Because I have no clue xD

#ComfyUI #flux #Flux1SchnellFP8 #FP8 #AMD #RDNA2 #VAE #AIArt #Pixelfed #HighResolution #GPUOptimization #LatentWorkflow #AIWorkflow #AIHacks #RealESRGAN #Upscale #AIExperiment #CreativeAI #DigitalArt #AICommunity #python #linux #opensource #foss

⚡️ Tăng 90% PP/s nhưng TPS chỉ cải thiện 10–20% khi dùng 2 GPU (RTX Pro 6000 & 5090). Ai biết cách tối ưu giúp mình với? Đang chạy server AI để cung cấp dịch vụ nhanh! #AI #GPUOptimization #LlamaServer #MáyHọc #CôngNghệThôngTin

https://www.reddit.com/r/LocalLLaMA/comments/1qopgpp/llama_server_using_dual_gpus_pp_is_amazing_tps/

Khám phá mô hình AI phi2 của Microsoft, phù hợp để chạy trên PC với 12GB RAM + 3GB VRAM + GTX 1050 + Linux Mint. Phi2 được lượng tử hóa Q4K, tối ưu hiệu suất trên GPU trung bình. Thử tải về từ Hugging Face hoặc TheBloke và trải nghiệm mô hình AI phi-commercial này! #AIModel #Linux #TechVietnam #LocalLLaMA #Phi2 #GPUOptimization #AICommunity

https://www.reddit.com/r/LocalLLaMA/comments/1qm2yns/any_good_model_for_12_gb_ram_3_gb_vram_gtx_1050/

Qwen3 Next 80B với 250k token context hoàn toàn chạy trên 1 GPU 7900 XTX (24 GB) tốc độ 41 tok/s. Sử dụng lượng tử hóa IQ2_XSS, Q4_0 KV & FA. Thay đổi lớn cho ứng dụng LLM trên 1 card duy nhất, khả năng xử lý code tuyệt vời. #Qwen3 #AILocal #GPUOptimization #LocalLLM #AIProgramming #MôHìnhHóaAI #LậpTrìnhViên

https://www.reddit.com/r/LocalLLaMA/comments/1pnnkxc/qwen3_next_80b_w_250k_tok_context_fits_fully_on/

Công cụ 5060ti nâng cấp RAM (6000MHz) và Switch CUDA giúp tăng tốc độ{LLaMA} từ 22 t/s lên gần 37 t/s. Chi phí ~2200$, ít hơn 5090. #GPUoptimization #LLaMA #AI #tech #Performance #TốiMAXGPU #LLaMAtrong #Tètresjpg #nghiencoded #xuấtkho

https://www.reddit.com/r/LocalLLaMA/comments/1oe8v21/5060ti_chads_ram_overclocking_the_phantom_menace/

Lenovo launches GPU Advanced Services, promising up to 30 percent faster AI performance

https://web.brid.gy/r/https://nerds.xyz/2025/09/lenovo-gpu-ai/

🎨💡 Imagine spending hours optimizing a GPU only to discover it's as pointless as a #penguin with a solar panel. 🤔 But hey, at least it makes for a riveting blog post nobody will read! 📚🔍
https://blog.speechmatics.com/pointless-gpu-optimization-exercise #GPUoptimization #blogpost #humor #techfails #HackerNews #ngated
An Almost Pointless Exercise in GPU Optimization | Speechmatics

Experience converting a multi-threaded C++ application to run faster on GPU. How to interpret NSight Compute recommendations to improve an algorithm on GPU.