Just in: The ultimate 2026 guide to Linux server performance optimization! Learn advanced techniques for kernel tuning, I/O schedulers, memory management, and eBPF monitoring to achieve 30-60% throughput gains. Perfect for sysadmins and DevOps professionals. #LinuxTuning #ServerPerformance #DevOps #KernelOptimization #eBPF
https://estoreab.com/linux-server-performance-optimization-guide

https://estoreab.com/linux-server-performance-optimization-guide

AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search

AutoKernel은 PyTorch 모델의 GPU 커널을 자동으로 최적화하는 오픈소스 프레임워크로, 에이전트 기반 반복 탐색을 통해 병목 구간을 찾아내고 Triton 및 CUDA C++ 커널을 수백 차례 실험하며 개선한다. 5단계 검증 절차로 커널의 정확성을 보장하며, NVIDIA H100 환경에서 PyTorch 기본 구현 대비 최대 5.29배, 기존 autotune 대비 최대 3.44배 성능 향상을 달성했다. 트랜스포머 아키텍처의 주요 연산 9종을 지원하며, 커뮤니티 벤치마크에서 1위를 기록하는 등 실무 적용 가능성이 높다.

https://arxiv.org/abs/2603.21331

#gpu #kerneloptimization #pytorch #triton #autonomousagent

AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search

Writing high-performance GPU kernels is among the most labor-intensive tasks in machine learning systems engineering. We present AutoKernel, an open-source framework that applies an autonomous agent loop to GPU kernel optimization for arbitrary PyTorch models. Given a model, AutoKernel profiles it to identify computational bottlenecks, ranks them by Amdahl's law impact, and iteratively refines Triton or CUDA C++ kernel implementations through hundreds of experiments without human intervention. A five-stage correctness harness covering smoke tests, shape sweeps, numerical stability, determinism verification, and edge-case coverage ensures every candidate kernel is validated before any speedup is recorded. The system comprises over 9,000 lines of Python, 18 starter kernel implementations across two backends, a six-tier optimization playbook, and integration with the KernelBench benchmark suite. AutoKernel covers nine kernel types spanning the dominant operations in modern transformer architectures. On an NVIDIA H100, our Triton kernels outperform both PyTorch eager and torch.compile (max-autotune) on the majority of tested configurations: 5.29x over eager on RMSNorm, 2.82x on softmax, and 2.21x on cross-entropy, while beating torch.compile by 2.83x, 3.44x, and 2.94x respectively. In community deployment, an AutoKernel-optimized kernel achieved first place on the vectorsum_v2 B200 leaderboard. The full system is available at https://github.com/RightNow-AI/autokernel.

arXiv.org