19 Followers
1 Following
87 Posts

How hard is it to update an Ubuntu between two LTS versions? On a laptop it's usually not that hard: it needs 1-2 hours and a couple of manual config updates but that's it. But keeping a server up-to-date is usually way harder: will my app work the same with the newer kernel/gcc/npm/100 other packages?

This is a solved problem in programming: in npm, for example, I have a package-lock.json. I bump the dependencies, run the tests, and if they pass then I commit the changes. The tests are key here: they run the new code and validate the expectations.

That's why I'm excited about NixOS: it promises to bring the same process to the system-level. Every package, every config, firewall rule, is in code, and dependencies are fixed in a lockfile. With it, I can follow the same route as with npm: update the code, check it still works, and only then deploy to the server. And it can be automated, bringing a nice property to the updates: the system auto-updates but if a change breaks the tests then it will preserve the last known good state.

This relies on tests and this is my first step with NixOS: I chose two topics that should be testable and see if I can manage to cover them. The first is the firewall configuration: how to test if the firewall is running and is filtering what it needs to filter? The other is DNS-over-HTTP (DoH) config: DNS that works by sending HTTPS requests, which I think is a nice privacy enhancement for a personal laptop.

Especially the DoH test case was tricky, but it seems like it's possible to write tests for them with NixOS.

This article describes my motivation in a bit more detail and how the test cases work. These are my first steps with this system, so I expect that I'll learn a lot more when I actually start to use it.

Read the full article on my blog: https://advancedweb.hu/nixos-first-impressions-writing-system-level-tests/

#nixos #nix

NixOS first impressions: writing system-level tests

How reproducibility opens the door for test-driven updates

When you implement authorization for an endpoint that returns a list of items, there is an optimization that simplifies the policy structure a bit: define only the permission to list but not to get items. This makes it a bit easier for a policy writer to think about permissions as there is less duplication.

For example, in a ticketing system that provides an endpoint to list tickets (/project/project1/tickets) and to get a ticket by ID (/tickets/:ticketid) both endpoints need an authorization check (can the user list tickets / get this ticket?). If both of them return the same objects (the full ticket) then only one permission is needed: either allow getting a ticket and then the list endpoint (/project/project1/tickets) only returns the allowed items, or allow listing the tickets and then the /tickets/:ticketid can authorize whether the ticket can be listed.

This results in either a getTicket or a listTickets permission but not both. This avoids problems like "a user can not list a ticket but if they know its ID then can still access it" called "insecure direct object references" which is part of #1 on the OWASP list, and also "the ticket is available via the list endpoint but denied when accessed via the direct-ID endpoint". Note that this makes policies simpler, but not the backend code.

This took me a couple of detours to notice this pattern and when it can be used. I started with a different problem: how should I define permissions to a list endpoint? I noticed that if the permission is getTicket then when combined with pagination it exposes information that it should not. Then when I moved to the direct-ID endpoint I realized that it only works when the information returned by the two endpoints are the same and this is not a generic approach.

An API that better supports authorization would return less data about each item in the list operation and require a call to the direct-ID endpoint for details. But that is not how many existing systems work and so in practice this simplification might be doable.

This article is about this learning process and how I'd design an API that fits better in an authorization system.

Read it at https://advancedweb.hu/designing-safer-listitems-and-getitem-permissions/

Designing safer listItems and getItem permissions

When getItem can be implemented based on listItems

After reading that the timeline for post-quantum cryptography is bumped closer to today I started looking into the standards and protocols that are going away and the ones that are coming. Then I started to look at the ESP32 chip: can it run post-quantum crypto?

Since there is no official PQC in the official SDK, to test that I needed to go a bit lower and find the building blocks of TLS. The conclusion: yes, it's powerful enough.

But I bumped into a separate problem: checking whether the TLS certificate is valid requires the device to know the time. How can the device know the time without TLS?

It turned out that it's a known problem: it's called the time bootstrap problem. It's about the circular requirements: secure communication needs the knowledge of time, but knowing the time needs communication when the device does not have an always-on clock.

This article is what I learned looking into the different technologies and how they shape the best practices. My conclusion is a bit anti-climatic: it's nice to use some protocols that have some security built-in, but for most cases I believe even the unencrypted, decades-old plain NTP is good enough.

Read it on my blog: https://advancedweb.hu/esp32-time-bootstrap-problem/

#iot #ntp #nts #tls

ESP32 time bootstrap problem

How to get the time after a cold start?

My first two months using AI

I started using AI more seriously in early November, so around this time marks my second month. When I talk to others, everyone's experience feels very
different. So to add one more data point, here is mine.

I resisted using AI for a long time. My reasoning was that prompting is easy to catch up with, so it does not matter if I join the crowd a year later. This
turned out to be true. Also, I had some trials with the ChatGPT free version which was a mixed bag so I wasn't that convinced it is any good.

What prompted me for a proper try was talking with people. Some of them told me that for Python scripts it is correct almost 100% of the time. Others used it
daily for programming tasks and were impressed by recent advancements. Moreover, apparently it is quite good for adapting cooking recipes for X amount of people
and providing a shopping list which was particularly useful when I found myself in a rural community. I had a change of scenery and time needed for
experimentation, I felt that this is a good time to start learning.

(I got a bit carried away and it does not fit in a post. Read the rest on my blog: https://advancedweb.hu/my-first-two-months-using-ai/)

My first two months using AI

I'm changing my mind about serverless

I keep track of an "ideal architecture", one that I would use if tasked to design a new system from scratch. For several years now this was AWS serverless. The
AWS part is personal: this is the stack I'm most familiar with. And serverless because it works the same for small and for large. It is a magical feeling to do
a `terraform apply` and see that all the different parts are coming live, ready to serve whatever load coming its way. A well-designed serverless application
combines the best of all worlds: the cost scales with traffic and there is no upper ceiling.

Now? I'm not sure anymore.

When I started as a professional developer, the general consensus was:

* Compute is expensive so we need elasticity. When the traffic is lower, we can shut down machines so we don't pay for peak capacity
* A single machine puts a ceiling on scale so we need horizontal scalability.
* Physical failures happen so we need resilient systems.

Now it seems like these are not true anymore for the vast majority. I read the [One Big Server Is Probably
Enough](https://oneuptime.com/blog/post/2025-12-12-one-big-server-is-enough/view) post and then the [The Small Data
Manifesto](https://motherduck.com/blog/small-data-manifesto/) and at some point I checked how much a dedicated server would cost. A server with 192 cores, 3.1
TB RAM, and 25 Gbps unmetered bandwidth is ~$5k. In terms of cost, that's not a significant expense for a team of developers, and in terms of capacity (to
borrow words from Gemini): this server is a tank and it will be bored. And this is the top, one that can probably handle a small country, a smaller server costs
less.

How does that change the calculations?

* Compute is cheap, it's viable to plan for 10 times the peak. There is no need for elasticity
* There is a ceiling but it's so high that unless you are Cloudflare or Google you won't hit it.
* I don't think most services need higher reliability than what a single hardware or a setup of a primary-secondary can provide

Of course, backup, disaster recovery plans, and monitoring are still needed. I'd design a system that takes periodic backups offsite and also asynchronously
replicate data as it comes. But these don't need fancy tools.

Moreover, I think the necessary reliability is way lower than most people think. "We need 5 9s!", I've heard. Last October
[AWS](https://aws.amazon.com/message/101925/) and next month [Cloudflare](https://blog.cloudflare.com/18-november-2025-outage/) were down for hours bringing down
a big part of the internet. I'm not saying downtimes are good. But setting expectations too high is unproductive. Once every 3 years the server is down and it
takes 30 minutes to fail over to the secondary? An update brings down the service for 30 seconds? I believe these are entirely acceptable values for most cases.

What are the upsides? No eventual consistency, local reproducibility, easy debugging, instant restarts. Things that silently decrease productivity. Also, a more
complex setup increases the chance that *logical* errors happen, such as the ones that caused the AWS and the Cloudflare incidents.

I'll explore the alternatives in the future. What I have in mind is a Linux box with Postgres started from a NixOS configuration file that can be rigorously
tested before deployment. There are several challenges here that need to be solved, such as tenant isolation, replication, and a lot of configuration, but I
think when set up properly this can be a base for a more developer-friendly environment.

One downside is that it needs a lot of rigour to keep the upsides and this is where I saw these setups go wrong in the past: someone SSHs into the machine,
changes some configuration or installs some packages and forgets to update the code for that. Or a process writes files to the filesystem instead of the
database and suddenly the replication does not include everything. These small things then accumulate until the point that nobody dares to touch the system
anymore. This is why I'm particularly interested in learning about NixOS: everything is code, there are no one-off changes that are quickly forgotten. It can
merge the advantages of IaC and local development.

I'd still use a serverless architecture for things that are less likely to change, things that are bounded in scope. My backup solution, for example, will stay
serverless as it benefits from the cost structure of S3 and needs only a small monitoring function. But for a product that the team is actively iterating on,
I'm exploring alternatives.

Originally published [on my blog](https://advancedweb.hu/shorts/im-changing-my-mind-about-serverless/)

How to Choose Between Alpine and Debian-Slim Base Images

An in-depth comparison of Alpine and Debian-slim Docker base images covering size, compatibility, security, and real-world trade-offs.

OneUptime | One Complete Observability platform.

Why I prefer multi-tenant systems

A multi-tenant system can be used by many customers and for each of them it looks like they are the only ones. Think about AWS, for example: the account
is isolated from all other accounts, and apart from the account ID there is no indication that anybody else is using that platform.

The obvious reason is that there is only one deployment and not one per customer. But I found that even if I needed to design a system that is only used by one
customer I would design it with support for multiple tenants.

This is a tradeoff, of course: every feature makes the system more complex and the cost of complexity tends to accumulate over time. This is especially true for
multi-tenancy as that affects almost everything.

So, why do I still prefer to add that complexity?

**Running integration tests** is easily parallelizable in a multi-tenant system: since each tenant is separated from the others, all tests can run in parallel. The
tests are simple: create a tenant, run the test, delete the tenant. The only extra thing is to clean up the junk if a test exits abruptly. I found that a naming
convention that encodes an expiration time works well: a script can periodically run and clean up the expired ones. For example, the tenants that tests create
all follow `__TEST_AUTODELETE_<expiration>__` pattern.

Also, it enables **production monitoring that mimics real user behavior**. Instead of looking at charts and trying to figure out which one is going to go up
or down during an incident, you can write a script that logs in as a tenant, runs some scenarios, and reports success/failure. All this without worrying about
breaking something for real clients.

Then it **allows sales (and developers) to use the production system**. Create a tenant for them and they are free to demo the features.

Finally, **requirements can change**. I find it likely that it is going to be needed to put more tenants into a single system as the business evolves.

Lately, I've been looking into row-level security in Postgres and I believe multi-tenancy can be set up in a non-intrusive way. I'm still formulating my
thoughts on this, but it seems that the whole complexity of adding multi-tenancy can be contained in a limited amount of code since the tenant ID can be set
in a central place and then filtering and access control are handled by the database instead of every single query. I'll explore this topic in the future, but
it seems like a good approach.

#software-engineering

Originally published [on my blog](https://advancedweb.hu/shorts/why-i-prefer-multi-tenant-systems/)

Why I prefer multi-tenant systems

AppSync subscriptions: waiting for start_ack can still result in missing events

It seems like that when AppSync returns a `start_ack` message in response to a subscription `start` it won't necessarily mean that all future events will be delivered.

Subscriptions are the mechanism to deliver real-time events from AppSync. It is based on WebSockets and its protocol is documented [here](https://docs.aws.amazon.com/appsync/latest/devguide/real-time-websocket-client.html).

In the protocol, a client needs to send a `start` message with the GraphQL query to start receiving updates. Then AppSync responds with a `start_ack` if everything is OK and then sends `data` events whenever an update happens.

Reading the documentation my impression was that `start_ack` is the moment when the subscription is live and all future events will be delivered. But what I'm seeing is that **it's not the case**. Even when the event is strictly triggered after the `start_ack` is received sometimes it is not delivered to the client.

Why is it a problem?

A common pattern for APIs with real-time updates is to subscribe to updates first then query the current state. This way there is no "temporal dead zone" when updates are lost. But that requires a definitive *point in time* when the subscription is live. Without that, it's only best-effort and messages will be lost every now and then especially in cases when the subscription is made just before the event, common in tests and some async workflows.

Real-time updates, especially in AppSync, is a complex topic and it's easy to get wrong. I've [written about it before](https://advancedweb.hu/shorts/apollos-subscribetomore-is-the-wrong-abstraction/), it has a [separate section in my book](https://www.graphql-on-aws-appsync-book.com/client-side/implementing-subscriptions/), and I even [made a library](https://github.com/sashee/appsync-subscription-observable) because I wasn't particularly happy with the AWS-provided one.

I noticed tests using subscriptions timeouting for a long time now, but i wrote it off as "something complex is happening" and added some retries to handle it. A message is published to IoT Core that triggers a Lambda, that writes to DynamoDB then it triggers the subscription. A lot can go wrong so it's realistic that the 10-ish seconds sometimes pass.

But I then started working on a simpler setup and still noticed that some events seemingly never arrive. This time I could pinpoint the issue because if a parallel subscription is opened before then the event is delivered there. So the problem must be that the subscription is not live even though AppSync says it is.

Hopefully, it will get fixed soon.

Bug report opened [in the AppSync repo](https://github.com/aws/aws-appsync-community/issues/405).

#aws #appsync

Originally published [on my blog](https://advancedweb.hu/shorts/appsync-subscriptions-waiting-for-start_ack-can-still-result-in-missing-events/)

Building a real-time WebSocket client in AWS AppSync - AWS AppSync GraphQL

AWS AppSync real-time WebSocket client setup

Many projects close issues after a triage if the feature/bug is not planned. For example, the terraform-provider-aws uses a bot that detects stale issues (for example, I'm following this one and I'm getting periodic emails about it). If nobody comments for a period of time the issue gets closed.

I get why it's good for the project's perspective: if you open the issue tracker and it's full of open issues then it's both depressing and counterproductive as it buries the important things to work on. Most of these projects with a public issue tracker are open-source ones and the maintainers are usually not paid for their work. So it's not up to me to decide how they organize their work.

But from my side this pattern is a problem.

Let's say I encounter an edge case. I report it, find some workaround, and add a comment along the lines of "workaround for X, see ticket Y". But maybe it's outside the current priorities of the project so it gets closed. I get an email saying it's closed because it's "not planned".

Fast-forward a couple of years. Maybe then the issue is in-scope now for the project and they actually implement it. How do I know? I get notifications for the original ticket which was closed after opening it. And when something is closed, nobody goes back to it to let me know that now it's done. Maybe there is another ticket opened for the same thing, but I'm not subscribed to that one. I need to actively look for the fix to find it.

The same thing happens in internal projects as well. The backlog is always too large, so there is a tendency to say: we're unlikely to work on most of these things, so let's close the majority of these tickets.

My argument against this: even though these tickets are not useful now they can be useful in the future. Maybe someone will take a look at the backlog at some point and realize that some important thing did not get implemented. Or someone bumps into an issue and realizes that it's a known bug.

In this case the backlog is a list of things we know we don't have. Adding some tags so that they can be easily hidden would be a better solution.

Published [on my blog](https://advancedweb.hu/shorts/closing-issues-because-they-are-unplanned-is-bad-ux/)

Closing issues because they are unplanned is bad UX

Hardening with Firejail, Landlock, and bubblewrap

Recently I've been looking into securing my laptop a bit. By default, every single program has access to everything: filesystem, network, other programs.

First, I started looking into Firejail. It allows specifying paths the program can access, as well as the network and other special things. It's not bad and I used it for a while.

What I don't like about Firejail is that it's setuid: it runs as root, sets up the sandbox, then starts the program that is passed as an argument. If there is a problem in Firejail then it can even extend the blast radius.

Then I learned about Landlock. It is unprivileged and also allows restricting the network. At some point I found a [CLI](https://github.com/Zouuup/landrun) that makes it easy to run. Landlock solves the privilege problem: it restricts the process without having more permissions to do so.

The problem with Landlock is its fs restrictions are a bit too coarse: if a directory is allowed then everything below it is also allowed. For example, giving read access to $HOME also gives read access to the chromium profile.

Now I'm looking into bubblewrap. It promises to combine Firejail and Landlock in the best way: unprivileged and also allows layering filesystem access.

I'm still working on moving my dotfiles to bubblewrap and it takes some mental energy to do that. But is seems like it's going to be a good next step.

#security #linux #bwrap #landlock #firejail

Originally published [on my blog](https://advancedweb.hu/shorts/hardening-with-firejail-landlock-and-bubblewrap/)

GitHub - Zouuup/landrun: Run any Linux process in a secure, unprivileged sandbox using Landlock. Think firejail, but lightweight, user-friendly, and baked into the kernel.

Run any Linux process in a secure, unprivileged sandbox using Landlock. Think firejail, but lightweight, user-friendly, and baked into the kernel. - Zouuup/landrun

GitHub

Another AWS footgun: Cognito custom attributes

You can define extra attributes for users in user pools. Maybe you want to store information that is not covered by the standard attributes, such as social profiles or preferred currency.

But there is a [catch](https://docs.aws.amazon.com/cognito/latest/developerguide/user-pool-settings-attributes.html#user-pool-settings-custom-attributes):

> You can't remove or change it after you add it to the user pool.

I had to remove all users and recreate the user pool because of this (it was a personal dev environment fortunately).

Why is it a big thing?

* There is a limit of 50 custom attributes you can add. It's a finite resource
* You use code to deploy your infrastructure? Now you can't rollback
* Or you use clickops? Watch where you click as this is a one-way road

To make things worse, it's practically impossible to replace a user pool. You don't have access to the passwords and the MFA secrets (which is a good thing) which means if you move users everything is reset for them. Including their `sub` (subject id) which might affect your databases.

I'd stay very far away from using custom attributes.

What's the better solution? You probably already has some backend with some database: use that to store any extra information about users.

#aws #cognito

Originally published [on my blog](https://advancedweb.hu/shorts/another-aws-footgun-cognito-custom-attributes/)

Working with user attributes - Amazon Cognito

With Amazon Cognito, you can associate standard and custom attributes with user accounts in your user pool. You can configure read and write permissions for these attributes at the app client level to control the information that each of your applications can access and modify.