@enigma
Nun, manch KI Arbeit ja mit #FP8 oder #FP4 ... ;)
Da passen zwei floating point werte in ein Byte .... Ob sich da was machen lässt....? ;)

cedric (@cedric_chee)

Mistral의 새로운 모델 'Mistral Small 4 119B A6B'은 Magistral의 추론 능력, Pixtral의 멀티모달 기능, Devstral의 에이전트형 코딩 성능을 하나로 통합한 다목적 모델로, 추론 강도를 조절할 수 있습니다. FP8 또는 NVFP4 형식의 가중치가 Hugging Face에서 다운로드 가능하다고 안내됩니다.

https://x.com/cedric_chee/status/2033695928167899294

#mistral #llm #multimodal #huggingface #fp8

cedric (@cedric_chee) on X

Mistral Small 4 119B A6B combines Magistral's reasoning, Pixtral's multimodal capabilities, and Devstral's agentic coding strengths into a single versatile model with configurable reasoning effort. Download FP8 or NVFP4 weights on HF.

X (formerly Twitter)

Andrej Karpathy (@karpathy)

nanochat이 단일 8x H100 노드에서 GPT-2 역량 모델을 약 2시간 만에 학습시켰다고 발표했습니다(한 달 전 약 3시간에서 단축). fp8 지원과 여러 튜닝, 그리고 데이터셋을 FineWeb-edu에서 변경한 것이 주요 개선 포인트로, 실시간 인터랙티브 학습에 한층 근접했다는 기술적 진전입니다.

https://x.com/karpathy/status/2029701092347630069

#nanochat #gpt2 #training #h100 #fp8

Andrej Karpathy (@karpathy) on X

nanochat now trains GPT-2 capability model in just 2 hours on a single 8XH100 node (down from ~3 hours 1 month ago). Getting a lot closer to ~interactive! A bunch of tuning and features (fp8) went in but the biggest difference was a switch of the dataset from FineWeb-edu to

X (formerly Twitter)

Qwen (@Alibaba_Qwen)

Qwen 3.5 Medium 모델 시리즈의 FP8 가중치가 공개되어 배포 준비 완료되었다는 공지입니다. vLLM과 SGLang에 대한 네이티브 지원이 포함되며 모델 카드에 예제 코드가 제공됩니다. FP8 정밀도로 워크플로 최적화가 가능하며 가중치는 Hugging Face에서 획득할 수 있다고 안내합니다.

https://x.com/Alibaba_Qwen/status/2026682179305275758

#qwen3.5 #fp8 #vllm #huggingface #sglang

Qwen (@Alibaba_Qwen) on X

🔥 Qwen 3.5 Medium Model Series FP8 weights are now open and ready for deployment! Native support for vLLM and SGLang. Check the model card for example code. ⚡️ Optimize your workflow with FP8 precision. 👇 Get the weights: Hugging Face:https://t.co/3MSb7miq68

X (formerly Twitter)

Qwen (@Alibaba_Qwen)

Qwen 3.5 Medium 시리즈의 FP8 가중치가 공개되어 배포 가능하다는 공지입니다. vLLM과 SGLang에 네이티브 지원을 제공하며, 모델 카드에 예제 코드가 포함되어 있습니다. FP8 정밀도로 워크플로우 최적화 가능하며 가중치는 Hugging Face에서 확인·다운로드할 수 있습니다.

https://x.com/Alibaba_Qwen/status/2026682179305275758

#qwen #fp8 #vllm #huggingface #sglang

Qwen (@Alibaba_Qwen) on X

🔥 Qwen 3.5 Medium Model Series FP8 weights are now open and ready for deployment! Native support for vLLM and SGLang. Check the model card for example code. ⚡️ Optimize your workflow with FP8 precision. 👇 Get the weights: Hugging Face:https://t.co/3MSb7miq68

X (formerly Twitter)
Diving into LTXV, my latest video diffusion experiments.

I’ve been experimenting with LTXV (ltxv-2b-0.9.8-distilled-fp8), combined with the text encoder umt5_xxl_fp8_e4m3fn_scaled.

The renderings showcase the hackercat, cherry blossoms, and a surreal city tour.

What it does:
- Generates latent video clips from text prompts
- Can produce a wide range of scenes, from surreal to photorealistic and beyond
- Perfect for short 1-2 second clips with creative prompts

Caution! 12 GB VRAM is tight:
- On my RX 6700 XT, it easily runs into OOM
- Frames, steps, and resolution need careful tuning
- FP8 helps, but some layers get upcast → memory can still fill up

Conclusion: Extremely powerful, but you need to tweak VRAM and settings to get stable results.

#AI #VideoDiffusion #LTXV #FP8 #GPU #CreativeAI #ShortVideos #Surreal #Photorealistic #StableVRAM #RX6700XT #AMD #ROCm #ComfyUI

Awni Hannun (@awnihannun)

MLXs의 CUDA 백엔드가 개선되어 시작 시간이 빠르고 전반적인 성능도 향상되었습니다. 작성자는 Qwen3 4B를 fp8로 DGX Spark에서 실행해 1만8500토큰을 4초 미만에 처리했으며, 1만8500 컨텍스트에서 초당 32.5토큰 생성 속도를 기록했다고 보고했습니다. 이는 대규모 컨텍스트에서의 실사용 성능 향상 사례입니다.

https://x.com/awnihannun/status/2020576431307452682

#mlx #cuda #qwen3 #fp8 #dgx

Awni Hannun (@awnihannun) on X

MLXs CUDA backend is getting better. It's especially nice if you appreciate fast startup times. But it's also quite fast in general. Here's Qwen3 4B in fp8 running on my DGX Spark. - Processed 18.5k tokens in < 4 seconds - Generates at 32.5 tok/sec with 18.5k context

X (formerly Twitter)

Andrej Karpathy (@karpathy)

FP8 학습을 활성화해 'time to GPT-2'가 4.3% 개선되어 2.91시간으로 단축되었고, 8×H100 스팟 인스턴스 가격을 쓰면 GPT-2 재현 비용이 약 $20 수준이라고 보고. 과거 GPT-2 공개 논란을 언급하며 현재의 경제성과 성능 향상을 강조함.

https://x.com/karpathy/status/2018804068874064198

#fp8 #training #gpt2 #h100 #optimization

Andrej Karpathy (@karpathy) on X

Enabled fp8 training for +4.3% improvement to "time to GPT-2", down to 2.91 hours now. Also worth noting that if you use 8XH100 spot instance prices, this GPT-2 repro really only costs ~$20. So this is exciting - GPT-2 (7 years ago): too dangerous to release. GPT-2 (today): new

X (formerly Twitter)
YES SUCCEEDED!!!

Just rendered an image at 944×1152 (slightly above 1024×1024) using Flux1-Schnell-FP8 on my 6700 XT, and it works! (Image 1 is the Real-ESRGAN 2× upscaled version)

Workflow 1: Sampling (Image 2)

Prompt executed → UNet generates the latent

Step 1 (model load + latent generation) took 419 seconds

Output: Latent tensor saved to disk

Workflow 2 : VAE Decode (Image 3)

Latent loaded → VAE decodes the image

Duration: 7.5 seconds

Advantage: UNet doesn’t need to stay in VRAM → VRAM freed, even on 12 GB GPUs

The problem with the stock LoadLatent Node

Dropdown only shows files if they were produced / annotated by a previous SaveLatent Node

Node is designed to pass latents inside a graph, not load arbitrary files from disk

Purpose: prevents accidentally loading wrong files

Workaround (Image 4)

Edited /ComfyUI/nodes.py, class LoadLatent

Hardcoded latent path → Node now loads directly from disk

Result: Workflow 2 runs instantly, UNet can be unloaded

Timing

Step 1 (model load + latent generation): 419 s

Step 2 (VAE decode): 7.5 s

Result: High-res images on a 12 GB RDNA2 GPU are now possible on Flux1-Schnell-FP8 without ComfyUI crashing! (Image 5 is the original output)

This might actually become my new Flux workflow: render quick 512×512 previews first (which works perfectly on RDNA2 GPUs), sort out the good ones, extract the seed from the PNG metadata, and then re-render only the selected images with the same seed using the split workflow at higher resolutions. This way, high-resolution Flux1-Schnell-FP8 renders become possible on 12 GB RDNA2 GPUs D:

Question at the end: Has anyone ever done this before? Because I have no clue xD

#ComfyUI #flux #Flux1SchnellFP8 #FP8 #AMD #RDNA2 #VAE #AIArt #Pixelfed #HighResolution #GPUOptimization #LatentWorkflow #AIWorkflow #AIHacks #RealESRGAN #Upscale #AIExperiment #CreativeAI #DigitalArt #AICommunity #python #linux #opensource #foss
Maia 200: The AI accelerator built for inference - The Official Microsoft Blog

Today, we’re proud to introduce Maia 200, a breakthrough inference accelerator engineered to dramatically improve the economics of AI token generation. Maia 200 is an AI inference powerhouse: an accelerator built on TSMC’s 3nm process with native FP8/FP4 tensor cores, a redesigned memory system with 216GB HBM3e at 7 TB/s and 272MB of on-chip SRAM, plus...

The Official Microsoft Blog