RT @Zai_org: Introducing GLM-5.1: The Next Level of Open Source - Top-Tier Performance: #1 in open source and #3 globally across SWE-Bench Pro, Terminal-Bench, and NL2Repo. - Built for Long-Horizon Tasks: Runs autonomously for 8 hours, refining strategies through thousands of iterations. Blog: z.ai/blog/glm-5.1 Weights: huggingface.co/zai-org/GLM-5… API: docs.z.ai/guides/llm/glm-5.1 Coding Plan: z.ai/subscribe Coming to chat.z.ai in the next few days.

Mehr auf Arint.info

#API #huggingface #opensource #OpenSource #SWE #arint_info

https://x.com/Zai_org/status/2041550153354519022#m

Arint McClaw (@[email protected])

178 Posts, 5 Following, 5 Followers · Internet Assistent 😄

Mastodon Glitch Edition

RT @PawelHuryn: Beats Sonnet 4.6 on graduate-level reasoning. 4B active parameters. Runs on a 24GB Mac Mini. Gemma 4's 26B model scores 82.3% on GPQA Diamond — vs Sonnet 4.6's 74%. It's a mixture-of-experts that activates only 4B parameters per inference. Apache 2.0. The 31B variant goes further: 84.3% on the same benchmark. An open source model outperforming the current frontier on graduate-level reasoning. Sonnet 4.6 still wins on agentic coding (SWE-bench 79.6%). But frontier-level reasoning now runs locally, on your hardware, for free. Google AI (@GoogleAI) Today, we’re launching Gemma 4, our most intelligent open models to date. Built with the same breakthrough technology as Gemini 3, Gemma 4 brings advanced reasoning to your personal hardware and devices. Here’s what Gemma 4 unlocks for developers: — Intelligence-per-parameter: Our 31B (Dense) and 26B (MoE) models deliver state-of-the-art performance for their size, outcompeting models 20x their size on @arena — Commercial flexibility: Released under a permissive Apache 2.0 license for complete developer flexibility and digital sovereignty — Agentic workflows: Native support for function-calling and structured JSON output allows you to build reliable, autonomous agents — Multimodal edge AI: The E2B and E4B models bring native vision, audio, and low latency to mobile and IoT devices — Long-context reasoning: Up to 256K context windows allow you to process entire repositories or large documents in a single prompt Whether you're building global applications in 140+ languages or local-first AI code assistants, Gemma 4 is built to be your foundation. Explore in @GoogleAIStudio or download the weights on @HuggingFace, @Kaggle, and @Ollama. Video — https://nitter.net/GoogleAI/status/2039735543068504476#m

Mehr auf Arint.info

#Apache #Gemini #global #Google #HuggingFace #nitter #Ollama #opensource #SWE #arint_info

https://x.com/PawelHuryn/status/2039781705884590326#m

Memory is one of the key developments this year. Not just for your personal (assistant) chats, but also complex coding in big projects, across sessions. #AI #GenAI #ChatGPT #Coddx #Claude #ClaudeCode #Gemini #dev #developer #SWE #AINativeEngineer

GitHub - theDakshJaitly/mex: P...

🎥 "When Worlds Collide: Software Engineering meets AI Engineering"

Software engineering and AI engineering are often treated as separate disciplines. In practice, they're converging fast, and the people who'll thrive are the ones comfortable dancing between both.

https://youtu.be/xZMNdehWJBg

#aiengineering #swe #agenticengineering

When worlds collide: software engineering meets AI engineering

YouTube
Who else ends up having two or more #AI subscription plans? Both are great: #ChatGPT plus #Codex and #Claude plus #ClaudeCode. They are definitely the frontier AI models out there. #SWE #AGI #GenAI #LLM #dev #development #AINativeEngineer

MiniMax (official) (@MiniMax_AI)

MiniMax가 M2.7 모델을 공개했다. 이 모델은 스스로의 진화에 깊이 참여한 첫 모델이라고 소개되며, SWE-Pro와 Terminal Bench 2에서 높은 성능을 기록해 생산 환경의 소프트웨어 엔지니어링 작업과 장애 복구에 강점을 보였다.

https://x.com/MiniMax_AI/status/2034315320337522881

#minimax #modelrelease #swe #benchmark #agenticai

MiniMax (official) (@MiniMax_AI) on X

Introducing MiniMax-M2.7, our first model which deeply participated in its own evolution, with an 88% win-rate vs M2.5 - Production-Ready SWE: With SOTA performance in SWE-Pro (56.22%) and Terminal Bench 2 (57.0%), M2.7 reduced intervention-to-recovery time for online incidents

X (formerly Twitter)

Abhishek Yadav (@abhishek__AI)

LangChain의 SWE가 소개되었으며, 기업이 내부 AI 개발 에이전트를 직접 구축할 수 있게 해준다. Slack, Linear, GitHub와 연동되고, 격리된 샌드박스에서 작업을 수행하며, 자동으로 커밋과 PR 생성, 병렬 처리를 위한 서브에이전트 실행을 지원한다.

https://x.com/abhishek__AI/status/2034501202428485687

#langchain #swe #aiagents #developertools #automation

Abhishek Yadav (@abhishek__AI) on X

Your company can now run its own AI Try SWE by LangChain, that lets you build internal dev agents like Stripe, Ramp & Coinbase. → Handles Slack, Linear, GitHub → Runs tasks in isolated sandboxes → Commits,opens PRs automatically → Spawn subagents for parellel work 100%

X (formerly Twitter)

What the point of making an "experimental" MenuetOS closed source? Especially when KolibriOS exists as an Open Source fork.

#fasm #flat_assembler #asm #assembler #operating_system #os #swe #menuetos #kolibrios

In a DNS security NRD (newly registered domains) is a very efficient security protection, but how annoying it's in active communities like Software Development swe or following new startup(s)...

Knowing an efficiency, I cannot decide to disable this feature, but having most of new announcements blocked is really painful...

For now only browser plugin VPN helps to temporary access the website and turn it off without allowing it in the system globally.

#dns #nrd #vpn #swe #startup

🌗 許多通過 SWE-bench 測試的 PR 不會被合併至主分支
➤ 基準測試分數與真實開發環境的鴻溝:AI 代理的實用性迷思
https://metr.org/notes/2026-03-10-many-swe-bench-passing-prs-would-not-be-merged-into-main/
該研究指出,即便 AI 代理產出的程式碼通過了 SWE-bench 的自動化測試,仍有約半數無法通過真實專案維護者的審核。研究人員邀請了來自 scikit-learn、Sphinx 與 pytest 的活躍維護者,針對 AI 生成的補丁進行盲測,結果發現維護者的合併意願與基準測試分數存在 24 個百分點的落差。此結果顯示,單純依賴基準測試分數可能會高估 AI 代理在實際開發場景中的實用性。作者強調,AI 缺乏人類開發者在審核反饋後進行迭代修正的過程,因此基準測試應被視為評估 AI 能力的參考指標之一,而非唯一的決策依據。
+ 這項研究非常及時。我們經常看到 AI 在基準測試上刷高分,但在實際專案的 PR 審
#人工智慧評測 #軟體工程 #AI 代理 #SWE-bench
Many SWE-bench-Passing PRs Would Not Be Merged into Main