Mastodawn

sayzard Feb 12

AISatoshi (@AiXsatoshi)

INT4 양자화 버전의 모델 용량이 405GB로 보고되었습니다. 해당 트윗은 INT4 양자화 적용 모델의 저장·배포 요구량을 간단히 알리는 내용으로, 로컬 실행 환경에서의 용량 계획에 참고가 됩니다.

https://x.com/AiXsatoshi/status/2021877728904118582

#int4 #quantization #model #modelsize

AI✖️Satoshi⏩️ (@AiXsatoshi) on X

INT4 405GB https://t.co/0QA5zYYBj3

X (formerly Twitter)

sayzard Jan 29

cedric (@cedric_chee)

로컬에서 Kimi K2.5 모델의 INT4 양자화(quant)를 사용해 8대의 RTX Pro 6000 GPU(8x)로 추론을 수행한 결과를 공유한 트윗입니다. 처리량은 8–40 TPS 범위였고, 고전적 추론 문제(아버지-외과의사 수수께끼)와 단어 세기 과제를 모두 정답으로 풀었으며 각각 약 58초·55초의 사고 시간을 기록했습니다. 로컬 INT4 양자화 성능과 추론 지연/처리량 정보가 핵심입니다.

https://x.com/cedric_chee/status/2016868174004969710

#kimi #int4 #quantization #localinference #rtx6000

cedric (@cedric_chee) on X

Core reasoning tests. Local Kimi K2.5 Thinking INT4 quant running on 8x RTX Pro 6000. 8–40 TPS. 1) A classic father-surgeon riddle Got it right. Thought for 58 s. 2) Counting words What is the fourth word in your response to this message? Answer correct. Thought for 55 s.

X (formerly Twitter)

Reddit Tech VN Bot Jan 27

Mô hình Kimi K2.5 mới từ Moonshot AI gây sốc với 1 nghìn tỷ tham số, chỉ dùng 32B tham số hoạt động mỗi token. Kiến trúc MoE tiên tiến với 384 chuyên gia, chọn top-8 + 1 chuyên gia chung, hỗ trợ INT4 gốc nhờ QAT. Vượt GPT-5 trên Humanity's Last Exam (50.2% vs 41.7%) và gần bằng GPT-5 trong LiveCodeBench (83.1%). Hỗ trợ "tư duy" nội bộ như System 2. Có thể chạy trên 4x H100, mở hướng cho chạy mô hình lớn tại chỗ. #KimiK25 #AI #LLM #MoE #Int4 #Reasoning #TríTuệNhânTạo #AI ViệtNam #MôHìnhNgônNgữ #M

sayzard Jan 15

金のニワトリ (@gosrum)

GLM-Image가 4비트(4bit) 양자화 시 느려지는 문제가 있었음. 원인은 INT4로 양자화되어 있었기 때문이며, nf4로 변경하자 처리 속도가 3배 이상 빨라졌다는 실사용 성능 개선 보고.

https://x.com/gosrum/status/2011574959890710823

#glmimage #quantization #nf4 #int4 #performance

金のニワトリ (@gosrum) on X

GLM-Imageが4bit量子化時に遅い問題、INT4で量子化していたらしく、nf4にしたら3倍以上速くなった

X (formerly Twitter)

GOMOOT

Feb 12, 2025

💡 Snapdragon 6 Gen 4, il nuovo processore di fascia media di Qualcomm

https://gomoot.com/snapdragon-6-gen-4-il-nuovo-processore-di-fascia-media-di-qualcomm/

#5g #blog #bluetooth 5.4 #cpu #gpu #int4 #kryo #lossless #lpddr5 #news #npu #picks #qualcomm #snapdragon6gen4 #tech #tecnologia #wifi6e

Snapdragon 6 Gen 4, il nuovo processore di fascia media

Snapdragon 6 Gen 4, il nuovo processore di Qualcomm introduce AI on-device e migliora prestazioni, efficienza e connettività per gli smartphone di fascia media

Gomoot : tecnologia e lifestyle Scopri le ultime novità in fatto di hardware,tecnologia e altro

Victoria Stuart 🇨🇦 🏳️‍⚧️Jun 23, 2023

Training Transformers with 4-bit Integers
https://arxiv.org/abs/2306.11987

... we propose a training method for transformers with matrix multiplications implemented with the INT4 arithmetic. Training with an ultra-low INT4 precision is challenging ... we carefully analyze the specific structures of activation & gradients in transformers to propose dedicated quantizers for them. For forward propagation, we identify ...

#ML #MachineLearning #parametrization #INT4 #NeuralNetworks #transformers #matrices

Training Transformers with 4-bit Integers

Quantizing the activation, weight, and gradient to 4-bit is promising to accelerate neural network training. However, existing 4-bit training methods require custom numerical formats which are not supported by contemporary hardware. In this work, we propose a training method for transformers with all matrix multiplications implemented with the INT4 arithmetic. Training with an ultra-low INT4 precision is challenging. To achieve this, we carefully analyze the specific structures of activation and gradients in transformers to propose dedicated quantizers for them. For forward propagation, we identify the challenge of outliers and propose a Hadamard quantizer to suppress the outliers. For backpropagation, we leverage the structural sparsity of gradients by proposing bit splitting and leverage score sampling techniques to quantize gradients accurately. Our algorithm achieves competitive accuracy on a wide range of tasks including natural language understanding, machine translation, and image classification. Unlike previous 4-bit training methods, our algorithm can be implemented on the current generation of GPUs. Our prototypical linear operator implementation is up to 2.2 times faster than the FP16 counterparts and speeds up the training by up to 35.1%.

arXiv.org