AI предлагает, мержу я: почему я не даю агенту последний ход

TL;DR. Я не пытаюсь сделать кодинг-агента самостоятельным разработчиком. Я задаю для него процесс: SPEC → PLAN → TEST → CODE → REVIEW → LEARN , артефакты на каждом шаге и человеческий accept там, где начинается ответственность. Эта статья — вход в серию про map-framework : хуки, контракты, контекст, память и всё, что я довёл из научных статей до рабочего процесса.

https://habr.com/ru/articles/1050678/

#AIагенты #кодингагенты #LLM #Claude_Code #code_review #specdriven_development #автоматизация_разработки #инженерные_практики #mapframework #arxiv

AI предлагает, мержу я: почему я не даю агенту последний ход

Есть неприятная иллюзия: если модель стала сильнее, ей можно дать больше свободы. В кодинге это быстро выходит боком. Агент пишет много, уверенно, иногда даже красиво. Потом ты открываешь diff и...

Хабр

#arXiv:
"The fate of Earth during the Sun’s giant phases: New constraints from ab initio tidal modelling and AGB mass loss"
"..need for improved constraints on .. late-stages of stellar evolut. .. .. considering observat. proxies for .. Sun during .. AGB phase, it is likely .. Earth will survive .. Sun's giant phases."
https://arxiv.org/abs/2606.19575

17.6.2026

#AsymptoticGiantBranch #Earth #Erde #Gezeitendissipation #Gravitation #RoterRiese #Sonne #Star #StellarEvolution #Stern #Sternentwicklung #Sun

The fate of Earth during the Sun's giant phases: New constraints from ab initio tidal modelling and AGB mass loss

The long-term evolution of planetary systems around solar-type stars is governed by the interplay between stellar expansion, tidal interactions, and mass loss during the red giant branch (RGB) and asymptotic giant branch (AGB) phases. However, tidal dissipation efficiencies and AGB mass-loss rates both remain poorly constrained, leading to significant uncertainty in predicting the fate of planetary systems, in particular, that of the Earth orbiting the ageing Sun. We reassess the survival of the Earth and the inner Solar System planets during the entire evolution of the Sun, focusing on the impact of updated tidal dissipation prescriptions and varying AGB mass-loss rates. We modelled the orbital evolution of the Earth using stellar evolution tracks for a solar-mass star. We compared these results with outcomes obtained using previously published and commonly adopted tidal prescriptions, and we explored a range of AGB mass-loss rates. We find that the predicted fate of the Earth is highly sensitive to the tidal model and the assumed mass-loss rate. Based on updated tidal dissipation prescriptions, Earth survives the RGB and AGB phases of the Sun. In contrast, the use of earlier tidal dissipation prescriptions leads to engulfment during the AGB phase. Furthermore, low AGB mass-loss rates result in engulfment, and vice versa. Using the observed mass-loss rates of the AGB star L2 Pup as a proxy for the Sun's future AGB mass-loss rate results in the survival of the Earth during the AGB phase when combined with our tidal dissipation evaluation. Given the current observational uncertainties in AGB mass-loss rates, the ultimate fate of the Earth remains uncertain, highlighting the need for improved constraints on the late-stages of stellar evolution. However, considering observational proxies for the Sun during the AGB phase, it is likely that the Earth will survive the Sun's giant phases.

arXiv.org

Found in my Calishat Snaps: LinXiv. “Discover, manage, and visualize academic papers from arXiv — run your library on hardware you control, with a modern desktop app, optional AI, Obsidian integration, and an interactive network graph.”

https://rbfirehose.com/2026/06/21/organized-academic-papers-on-your-desktop-linxiv/
Organized Academic Papers On Your Desktop: LinXiv

Found in my Calishat Snaps: LinXiv. “Discover, manage, and visualize academic papers from arXiv — run your library on hardware you control, with a modern desktop app, optional AI, Obsidian in…

ResearchBuzz: Firehose

Scientific journals act as though they were tabloid magazines. But that is not their function. Their purpose is to evaluate novelty, assess correctness, and place their stamp on the work. Nothing more. Editorial opinions belong elsewhere. Who reads journals?!? Today, scientific publication happens on #arXiv. Discovery happens through search engines, social networks, citations, talks, and personal recommendations.

⬇️

Semiclassical Gravity Efficiently Solves NP-Complete Problems

https://arxiv.org/abs/2606.14806

#arxiv

Semiclassical Gravity Efficiently Solves $\mathsf{NP}$-Complete Problems

Assuming the gravitational field is classical and that it couples to quantum fields via the semiclassical Einstein field equations, we show that the weak-field dynamics of a massive and non-relativistic qubit can in principle be used to solve an $\mathsf{NP}$-complete problem in polynomial time. We attribute this vast computational power to the non-linear dynamics afforded by the semiclassical Einstein field equations. Consequently, the above two assumptions entail a violation of the Physical Extended Church--Turing Thesis, which we regard as evidence for the quantization of gravity.

arXiv.org

💙🖤 the vignette for our R package bpvars for forecasting with panel vector autoregressions has just landed on arxiv 🖤💙 It's a 40-page-long page-turner 😜

🌐 https://doi.org/10.48550/arXiv.2606.14143

#bpvars #bsvars #arxiv #rstats

RT @DailyDoseOfDS_: Claude Code fully dissected! Researchers from UCL reverse-engineered the leaked Claude source. What they found changes how you should think about agent design. Only 1.6% of the codebase is AI decision logic. The other 98.4% is operational infrastructure. Permission gates, tool routing, context compaction, recovery logic, session persistence. The model reasons. The harness does everything else. This is the opposite of what most agent frameworks do today. LangGraph routes model outputs through explicit state machines. Devin bolts heavy planners onto operational scaffolding. Claude Code gives the model maximum decision latitude inside a rich deterministic harness, and invests all its engineering effort in that harness. The core loop is a simple while-true. Call model, run tools, repeat. But the systems around that loop are where the real design lives: A permission system with 7 modes and an ML classifier. Users approve 93% of prompts anyway, so the architecture compensates with automated layers instead of adding more warnings. A 5-layer context compaction pipeline. Each layer runs only when cheaper ones fail. Budget reduction, snip, microcompact, context collapse, auto-compact. Four extension mechanisms ordered by context cost. Hooks (zero), skills (low), plugins (medium), MCP (high). Each answers a different integration problem. Subagents return only summary text to the parent. Their full transcripts live in sidechain files. Agent teams still cost roughly 7x the tokens of a standard session. Resume does not restore session-scoped permissions. Trust is re-established every ses…

mehr auf Arint.info

#agent #Agent #arXiv #Claude #ClaudeCode #Devin #MCP #medium #nitter #arint_info

https://x.com/DailyDoseOfDS_/status/2065728394084626773#m

Arint - SEO+KI (@[email protected])

<p>RT @DailyDoseOfDS_: Claude Code fully dissected! Researchers from UCL reverse-engineered the leaked Claude source. What they found changes how you should think about agent design. Only 1.6% of the codebase is AI decision logic. The other 98.4% is operational infrastructure. Permission gates, tool routing, context compaction, recovery logic, session persistence. The model reasons. The harness does everything else. This is the opposite of what most agent frameworks do today. LangGraph routes model outputs through explicit state machines. Devin bolts heavy planners onto operational scaffolding. Claude Code gives the model maximum decision latitude inside a rich deterministic harness, and invests all its engineering effort in that harness. The core loop is a simple while-true. Call model, run tools, repeat. But the systems around that loop are where the real design lives: A permission system with 7 modes and an ML classifier. Users approve 93% of prompts anyway, so the architecture compensates with automated layers instead of adding more warnings. A 5-layer context compaction pipeline. Each layer runs only when cheaper ones fail. Budget reduction, snip, microcompact, context collapse, auto-compact. Four extension mechanisms ordered by context cost. Hooks (zero), skills (low), plugins (medium), MCP (high). Each answers a different integration problem. Subagents return only summary text to the parent. Their full transcripts live in sidechain files. Agent teams still cost roughly 7x the tokens of a standard session. Resume does not restore session-scoped permissions. Trust is re-established every ses…</p> <p><a href="https://arint.info/@Arint/116749319539281425">mehr</a> auf <a href="https://arint.info/">Arint.info</a></p> <p>#agent #Agent #arXiv #Claude #ClaudeCode #Devin #MCP #medium #nitter #arint_info</p> <p><a href="https://x.com/DailyDoseOfDS_/status/2065728394084626773#m">https://x.com/DailyDoseOfDS_/status/2065728394084626773#m</a></p>

Mastodon Glitch Edition
Can I Buy Your KV Cache?

Right now, across the world, AI agents are repeating the same absurd act: to read one document, they each recompute it from scratch. Every agent re-runs prefill, the most compute-intensive step a large model takes, over identical text, only to rebuild a key-value (KV) cache identical to the one the agent before it just built. The same answer, computed a million times. We make a proposal that is almost offensively simple: compute it once. Let a publisher precompute a document's KV cache, and let every other agent buy the right to load it and skip prefill. It works, and it is token-exact: loading a precomputed KV and continuing matches prefilling from scratch (24/24 greedy tokens, and at the logits level), with no accuracy cost. On Qwen3-4B, reuse is 9-50x cheaper in compute than prefill, and the gap widens with length (prefill's attention scales with L^2), so a single reuse already pays it back. Then the part that matters: where the KV lives. Shipping it fails, because KV is nearly incompressible, so per-load egress costs more than the prefill it saves. Hosting it provider-side, exactly as production prompt-caching works, removes egress entirely. The size of the prize is set by our measured compute saving: serving one hot 3774-token document to 80M agents costs ~$1.5M to re-prefill but only ~$0.03M of reuse compute (49.7x less). The 0.1x cache-read tariff APIs charge passes a 10x discount to users while sitting inside this measured envelope, so the 10x is a floor that the measured ~50x compute saving clears, and the gap to the physical ~50x is provider margin: millions of dollars per popular document. We frame the resulting agent-native prefill CDN and leave lossless KV compression and a cross-party payment layer as the open problems.

arXiv.org
Can I Buy Your KV Cache?

Right now, across the world, AI agents are repeating the same absurd act: to read one document, they each recompute it from scratch. Every agent re-runs prefill, the most compute-intensive step a large model takes, over identical text, only to rebuild a key-value (KV) cache identical to the one the agent before it just built. The same answer, computed a million times. We make a proposal that is almost offensively simple: compute it once. Let a publisher precompute a document's KV cache, and let every other agent buy the right to load it and skip prefill. It works, and it is token-exact: loading a precomputed KV and continuing matches prefilling from scratch (24/24 greedy tokens, and at the logits level), with no accuracy cost. On Qwen3-4B, reuse is 9-50x cheaper in compute than prefill, and the gap widens with length (prefill's attention scales with L^2), so a single reuse already pays it back. Then the part that matters: where the KV lives. Shipping it fails, because KV is nearly incompressible, so per-load egress costs more than the prefill it saves. Hosting it provider-side, exactly as production prompt-caching works, removes egress entirely. The size of the prize is set by our measured compute saving: serving one hot 3774-token document to 80M agents costs ~$1.5M to re-prefill but only ~$0.03M of reuse compute (49.7x less). The 0.1x cache-read tariff APIs charge passes a 10x discount to users while sitting inside this measured envelope, so the 10x is a floor that the measured ~50x compute saving clears, and the gap to the physical ~50x is provider margin: millions of dollars per popular document. We frame the resulting agent-native prefill CDN and leave lossless KV compression and a cross-party payment layer as the open problems.

arXiv.org
MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

We present MaxProof, a population-level test-time scaling framework for competition-level mathematical proof in the MiniMax-M3 series. M3 first trains three proof-oriented capabilities -- proof generation, proof verification, and critique-conditioned proof repair -- using a defense-in-depth generative verifier engineered for low false-positive rate. These capabilities are merged into a single released M3 model. At test time, MaxProof treats the model as a generator, verifier, refiner, and ranker, searches over a population of candidate proofs, and returns one final proof through tournament selection. With MaxProof test-time scaling, the M3 model reaches 35/42 on IMO 2025 and 36/42 on USAMO 2026, exceeding the human gold-medal threshold on both.

arXiv.org