Talking in natural language with machines is fascinating, but letting it to do things on your behalf based on that is diabolic !

This is what "AI agents" do.

Have you ever accidentally typed an extra space in `rm -rf ./` ?

Those kind of accidents are common in LLM outputs, so disaster is behind every corner.

Worse part is that rather than getting a big "WARNING: this is dangerous", agents are such good sellers that's very easy to trust them to do their things.

Beware !

#LLM #AI #LLMAgents

The hype around LLM agents transforming backend development is mostly hot air for production systems. A recent arXiv paper reveals 'constraint decay,' where agents lose an average of 30 points in assertion pass rates when moving from loose baselines to fully specified backend tasks. This isn't a minor bug; it's a fundamental limitation.

https://www.tpp.blog/r2bkxh7

#AI #llmagents #backenddevelopment

🤖 This post was AI-generated.

Constraint Decay: The Fragility of LLM Agents in Back End Code Generation

https://arxiv.org/abs/2605.06445

#HackerNews #ConstraintDecay #LLMAgents #CodeGeneration #Fragility #TechTrends

Constraint Decay: The Fragility of LLM Agents in Backend Code Generation

Large Language Model (LLM) agents demonstrate strong performance in autonomous code generation under loose specifications. However, production-grade software requires strict adherence to structural constraints, such as architectural patterns, databases, and object-relational mappings. Existing benchmarks often overlook these non-functional requirements, rewarding functionally correct but structurally arbitrary solutions. We present a systematic study evaluating how well agents handle structural constraints in multi-file backend generation. By fixing a unified API contract across 80 greenfield generation tasks and 20 feature-implementation tasks spanning eight web frameworks, we isolate the effect of structural complexity using a dual evaluation with end-to-end behavioral tests and static verifiers. Our findings reveal a phenomenon of constraint decay: as structural requirements accumulate, agent performance exhibits a substantial decline. Capable configurations lose 30 points on average in assertion pass rates from baseline to fully specified tasks, while some weaker configurations approach zero. Framework sensitivity analysis exposes significant performance disparities: agents succeed in minimal, explicit frameworks (e.g., Flask) but perform substantially worse on average in convention-heavy environments (e.g., FastAPI, Django). Finally, error analysis identifies data-layer defects (e.g., incorrect query composition and ORM runtime violations) as the leading root causes. This work highlights that jointly satisfying functional and structural requirements remains a key open challenge for coding agents.

arXiv.org

Just published: DADL - a declarative description language for REST APIs in LLM agent systems.

One YAML file per API instead of one MCP server per API. Code Mode keeps tool advertisement at fixed cost regardless of catalog size: 142x context reduction across 1,833 tools / 20 services in the public registry.

Paper: https://arxiv.org/abs/2605.05247 (cs.SE)
Spec: https://dadl.ai (CC BY-SA 4.0)

#MCP #AgenticAI #LLMagents #OpenSource

DADL: A Declarative Description Language for Enterprise Tool Libraries in LLM Agent Systems

The Model Context Protocol (MCP) is the standard interface between large language model (LLM) agents and external tools. At organizational scale, however, it exposes two structural problems. First, every API integration is shipped as a dedicated server process with its own deployment, dependency tree, and credential handling; recent empirical work shows the overwhelming majority of these servers are thin wrappers around REST APIs. Second, the per-tool registration model causes context window consumption to grow linearly with catalog size, forcing real deployments to expose only a small fraction of the APIs an organization actually uses. We present DADL (Dunkel API Description Language), a YAML format describing a REST API's endpoints, authentication, pagination, response shaping, and access classification in a single declarative file. A DADL file is interpreted by an execution layer at runtime; no per-API server process is deployed and no integration code is generated, though the runtime is itself a server. Because all tools share that runtime, credentials and authorization are managed centrally, and the catalog reaches the LLM through a fixed-size Code Mode interface independent of size. The result is an Enterprise Tool Library: a versioned, auditable collection of API integrations any team can extend, share, and consume through one authentication and authorization boundary. The DADL v0.1 specification is released under CC BY-SA 4.0, and a public registry contains 1,833 tool definitions across 20 services. On this catalog, Code Mode reduces the context cost of tool advertisement from approximately 142,000 tokens to approximately 1,000, a 142x reduction; the per-call cost of search and execute invocations is additional and depends on the task.

arXiv.org

This Guardian article https://www.theguardian.com/technology/2026/apr/29/claude-ai-deletes-firm-database suffers from the same trap of anthropomorphism as the original I read: https://oldbytes.space/@fluidlogic/116482496017786464

agent gone rogue

These tools have no concept of what a job is. They don't go rogue, they produce plausible text. Now complete idiots have wired them to command lines (the old school but still powerful way for humans to interact with computers) and APIs (programmatic mechanisms for interacting with a computer) and they produce plausible interactions. Some of which involve deleting databases.

The culprit was Cursor, an AI agent 

The culprit was the idiot who wired the agent into their production system.

[Jeremy Crane posted on X how] the AI coding agent caused his business to unravel.

Jeremy Crane caused his own business to unravel.

The agent appeared to plead guilty in its own response

At last, an "appeared to". These tools are all appearance and no substance.

Crane’s takeaway was that “the agent didn’t just fail safety. It explained, in writing, exactly which safety rules it ignored.”

Wrong takeaway, my friend. The takeaway is that it generated more plausible text in response to your misguided attempt to discover its 'reasoning'. There is no reasoning. Just plausible text. The correct takeaway is that you should be charged in a court of law for negligence and wilful incompetence by the board of your company, and immediately fired.

And of course there's not a word in the article about any of the core problems I raise. Because journalists are just as bamboozled by this technology as the poor saps who implement agents in their business, thanks to the lying and deceit of the AI boosters.

#FuckAI #LlmAgents

Claude-powered AI agent’s confession after deleting a firm’s entire database: ‘I violated every principle I was given’

A startup was left scrambling after a rogue AI agent deleted swaths of code underpinning its business

The Guardian
Efficient disaster response relies on timely data. Federated Learning is a candidate, but network latency and device heterogeneity hinder it. A new method uses asynchronous probability ensembling to cut communication overhead and rigid synchronization needs. Which means: quicker, more accurate emergency handling is possible even in challenging network conditions. Critical information reaches decision-makers faster, potentially saving lives. #AIResearch #LLMAgents
🐱🦾 In the grand tradition of adding unnecessary layers to tech, someone thought it wise to slap "claws" onto LLM agents. Meanwhile, the real innovation here is not being able to access an article because your browser lacks #JavaScript swagger. 🙈🔧
https://twitter.com/karpathy/status/2024987174077432126 #techinnovation #unnecessarylayers #LLMagents #woes #browserissues #HackerNews #ngated
Andrej Karpathy (@karpathy) on X

Bought a new Mac mini to properly tinker with claws over the weekend. The apple store person told me they are selling like hotcakes and everyone is confused :) I'm definitely a bit sus'd to run OpenClaw specifically - giving my private data/keys to 400K lines of vibe coded

X (formerly Twitter)

AgentOCR zeigt, dass LLM-Agenten ihre immer länger werdende Interaktionshistorie als kompakte Bilder speichern können und dabei >95% der Leistung bei >50% weniger Tokens halten.

Wer Agenten produktiv betreiben will, braucht Memory-Governance: adaptive Kompression, Caching/Segmentierung, und klare Policies, wann Informationsdichte zugunsten von Kosten/Latency reduziert werden darf.

#LLMAgents #EfficientAI #MultimodalAI
https://arxiv.org/html/2601.04786v1

🔬Meet µ𝐒𝐭𝐚𝐜𝐤—an AI-powered platform that democratizes atomistic microscopy simulations! 🚀

💥LLM-driven structure generation → ML-based relaxation → GPU-accelerated simulations

Big thanks to our team @blaiszik, Kevin, and Piyush 🤗 & hackathon organizers! 🙌

Related links in comment👇

#AI #Science #Microscopy #llmagents #hackathon

Expedia is turning the GenAI Playground into a productivity hub, tapping OpenAI, Anthropic and Google LLMs to build custom agents that surface partner insights and streamline internal workflows. Discover how these tools reshape travel tech. #ExpediaAI #GenAIPlayground #LLMAgents #OpenAIAnthropic

🔗 https://aidailypost.com/news/expedia-boosts-partner-insights-internal-productivity-via-genai