OpAMP server with MCP – aka conversational Fluent Bit control

I’ve written a few times about how OpAMP (Open Agent Management Protocol) may emerge from the OpenTelemetry CNCF project, but like OTLP (OpenTelemetry Protocol), it applies to just about any observability agent, not just the OTel Collector. As a side project, giving a real-world use case work on my Python skills, as well as an excuse to work with FastMCP (and LangGraph shortly). But also to bring the evolved idea of ChatOps (see here and here).

One of the goals of ChatOps was to free us from having to actively log into specific tools to mine for information once metrics, traces, and logs reach the aggregating back ends, but being able to. If we leverage a decent LLM with Model Context Protocol tools through an app such as Claude Desktop or ChatGPT (or their mobile variants). Ideally, we have a means to free ourselves to use social collaboration tools, rather than being tied to a specific LLM toolkit.

With a UI and the ability to communicate with Fluentd and Fluent Bit without imposing changes on the agent code base (we use a supervisor model), issue commands, track what is going on, and have the option of authentication. (more improvements in this space to come).

New ChatOps – Phase 1

With the first level of the new ChatOps dynamism being through LLM desktop tooling and MCP, the following are screenshots showing how we’ve exposed part of our OpAMP server via APIs. As you can see in the screenshot within our OpAMP server, we have the concept of commands. What we have done is take some of the commands described in the OpAMP spec, call them standard commands, and then define a construct for Custom Commands (which can be dynamically added to the server and client).

The following screenshot illustrates using plain text rather than trying to come up with structured English to get the OpAMP server to shut down a Fluentd node (in this case, as we only had 1 Fluentd node, it worked out which node to stop).

Interesting considerations

What will be interesting to see is the LLM token consumption changes as the portfolio of managed agents changes, given that, to achieve the shutdown, the LLM will have had to obtain all the Fluent Bit & Fluentd instances being managed. If we provide an endpoint to find an agent instance, would the LLM reason to use that rather than trawl all the information?

Next phase

ChatGPT, Claude Desktop, and others already incorporate some level of collaboration capabilities if the users involved are on a suitable premium account (Team/Enterprise). It would be good to enable greater freedom and potentially lower costs by enabling the capability to operate through collaboration platforms such as Teams and Slack. This means the next steps need to look something along the lines of:

#AI #chatops #FluentBit #Fluentd #LangGraph #LLM #MCP

OpAMP with Fluent Bit – Observability and ChatOps

With KubeCon Europe happening this week, it felt like a good moment to break cover on this pet project.

If you are working with Fluent Bit at any scale, one question keeps coming up: how do we consistently control and observe all those edge agents, especially outside a Kubernetes-only world?

This is exactly the problem the OpAMP specification is trying to solve. At its core, OpAMP defines a standard contract between a central server and distributed agents/supervisors, so status, health, commands, and config-related interactions follow one protocol instead of ad-hoc integration per tool.

That is where this project sits. We’re implementing the OpAMP specification to support Fluent Bit (and later Fluentd).

In this implementation, we have:

  • provider (the OpAMP server), and
  • consumer acting as a supervisor to manage Fluent Bit deployments.

Right now, we are focused on Fluent Bit first. That is deliberate: it keeps scope practical while we validate the framework. The same framework is being shaped so it can evolve to support Fluentd as well.

The repository for the implementation can be found at https://github.com/mp3monster/fluent-opamp

Quick summary

The provider/server is the control plane endpoint. It tracks clients, accepts status, queues commands, and returns instructions using OpAMP payloads over HTTP or WebSocket.

The consumer/supervisor handles the local execution and reporting. It launches Fluent Bit, polls local health/status endpoints, sends heartbeat and metadata to the provider, and handles inbound commands (including custom ones). The server and supervisor can be deployed independently, which is important for real-world rollout patterns.

Because they follow the OpAMP protocol model, clients and servers can be interchanged with other OpAMP-compliant implementations (although we’ve not yet tested this aspect of the development).

Together, they give us a manageable, spec-aligned path to coordinating distributed Fluent Bit nodes without hard-coding one-off control logic into every environment.

Deployment options and scripts

There are a few practical ways to get started quickly:

  • Deploy just the server/provider using scripts/run_opamp_server.sh (or scripts/run_opamp_server.cmd on Windows).
  • Deploy just the client/supervisor using scripts/run_supervisor.sh (or scripts/run_supervisor.cmd on Windows).
  • Run both components either together in a single environment or independently across different hosts.

The scripts will set up a virtual environment and retrieve the necessary dependencies.

If you want an initial MCP client setup as part of your workflow, there are helper scripts for that too:

  • mcp/configure-codex-fastmcp.sh and mcp/configure-codex-fastmcp.ps1
  • mcp/configure-claude-desktop-fastmcp.sh and mcp/configure-claude-desktop-fastmcp.ps1

Server screenshots

Here is a first server view we can include in the post:

The Server Console with a single Agent

The UI is still evolving, but this gives a concrete picture of the provider side control plane we are discussing.

What the OpAMP server (provider) does

The provider is responsible for the shared view of fleet state and intent.

Today it provides:

  • OpAMP transport endpoints (/v1/opamp) over HTTP and WebSocket.
  • API and UI endpoints to inspect clients and queue actions.
  • In-memory command queueing per client.
  • Emission of standard command payloads (for example, restart).
  • Emission of custom message payloads for custom capabilities.
  • Discovery and publication of custom capabilities supported by the server side command framework.

Operationally, this means we can queue intent once at the server and let the next client poll/connection cycle deliver that action in protocol-native form.

What the supervisor (consumer) does for Fluent Bit

The supervisor is the practical glue between OpAMP and Fluent Bit:

  • Starts Fluent Bit as a local child process.
  • Parses Fluent Bit config details needed for status polling.
  • Polls Fluent Bit local endpoints on a heartbeat loop.
  • Builds and sends AgentToServer messages (identity, capabilities, health/status context).
  • Receives ServerToAgent responses and dispatches commands.
  • Handles custom capabilities and custom messages through a handler registry.

So for Fluent Bit specifically, the supervisor gives us a way to participate in OpAMP now, even before native in-agent OpAMP support is universal.

And to be explicit: this is the current target. Fluentd support is a planned evolution of this same model, not a separate rewrite.

Where ChatOps fits

ChatOps is where this gets interesting for day-2 operations.

In this implementation, ChatOps commands are carried as OpAMP custom messages (custom capability org.mp3monster.opamp_provider.chatopcommand). The provider queues the custom command, and the supervisor’s ChatOps handler executes it by calling a local HTTP endpoint on the configured chat_ops_port.

That gives us a cleaner control path:

  • Chat/user intent can go to the central server/API.
  • The server routes to the right node through OpAMP.
  • The supervisor performs the local action and can return failure context when local execution fails.

This is a stronger pattern than directly letting chat tooling call every node individually, and it opens the door to better auditability and policy controls around who can trigger what.

Reality check: we are still testing

This is important: we are still actively testing functionality.

Current status is intentionally mixed:

  • Core identity, sequencing, capabilities, disconnect handling, and heartbeat/status pathways are in place.
  • Some protocol fields are partial, todo, or long-term backlog.
  • Custom capabilities/message pathways are implemented as a practical extension point and are still being hardened with test coverage and real-world runs.

So treat this as a working framework with proven pieces, not a finished all-capabilities implementation.

What is coming next (based on docs/features.md)

Near-term priorities include:

  • stricter header/channel validation,
  • heartbeat validation hardening,
  • payload validation against declared capabilities,
  • server-side duplicate websocket connection control behaviour.

Broader roadmap themes include:

  • authentication/security model for APIs and UI,
  • persistence in the provider,
  • richer UI controls for node/global polling and multi-node config push,
  • certificate and signing workflows,
  • packaging improvements.

And yes, a key strategic direction is evolving the framework abstraction so it can support Fluentd in due course, not only Fluent Bit. Some feature areas (like package/status richness) make even more sense in that broader collector ecosystem.

Why this matters

OpAMP gives us a standard envelope for control-plane interactions; the server/supervisor split gives us pragmatic deployment flexibility; and ChatOps provides a human-friendly control surface.

Put together, this becomes a useful pattern for managing telemetry agents in real environments where fleets are mixed, rollout velocity matters, and “just redeploy everything” is not always an option.

If you are evaluating this right now, the right mindset is: useful today, promising for tomorrow, and still under active verification as we close feature gaps.

#AI #artificialIntelligence #Cloud #Fluentbit #Fluentd #LLM #observability #OpAMP #Technology
More Game Servers Should have Webhooks

Webhooks are useful for bridging multiple applications together, and I often wish more apps supported them. One use case that could be a fun place to support webhooks are multiplayer game servers.

PaulTraylor.net

In this episode of 🌩️ Thunder, Peter Wilcsinszky, maintainer of Logging Operator, explains how the project's Flow and Output custom resources bring Kubernetes-native control to log routing. We cover the noisy neighbor problem, what happens when a tenant's aggregator goes down, and how the new Telemetry Controller enables OpenTelemetry Collector integration.

Watch now →
https://youtu.be/NDsaw0hO0T8

#Kubernetes #LoggingOperator #Observability #FluentBit #CNCF #DevOps

Fluent Bit was designed for tiny IoT devices with barely any memory. Why did cloud native teams running massive Kubernetes clusters adopt it instead?

Eduardo Silva explains in this 🌩️ Thunder episode: https://youtu.be/t6RxvVGvM7Q

#FluentBit #IoT #Observability #CloudNative #CNCF #Kubernetes

What happens when Fluent Bit's built-in functionality isn't enough for your use case? The plugin architecture lets you extend your pipeline with custom logic — including Lua scripts — right out of the box.

Eduardo Silva explains in this 🌩️ Thunder episode: https://youtu.be/t6RxvVGvM7Q

#FluentBit #Observability #CloudNative #DataPipelines #CNCF

12 years ago, collecting logs from different sources meant dealing with incompatible formats everywhere. How did we get from that chaos to the unified pipelines we have today?

Eduardo Silva shares Fluentd's origin story in this 🌩️ Thunder episode:
https://youtu.be/t6RxvVGvM7Q

#Fluentd #FluentBit #Logging #Observability #CloudNative

If Fluent Bit and OpenTelemetry Collector both collect, process, and export telemetry data, when should you reach for one over the other?

Eduardo Silva explains in this 🌩️ Thunder episode:
https://youtu.be/t6RxvVGvM7Q

#FluentBit #OpenTelemetry #Observability #CloudNative #CNCF

New @fluentbit workshop release!!! 🚨

Our #fluentbit workshop has a new lab where you learn hands-on how to visualize Fluent Bit telemetry pipeline data in @OpenSearchProject! @chronosphereio #cloudnative #observability #o11y https://o11y-workshops.gitlab.io/workshop-fluentbit/

For now, filed a report upstream to see if I can get some hints.

https://github.com/fluent/fluent-bit/issues/11139

#fluentbit #prometheus