12 Followers
1 Following
82 Posts

My first two months using AI

I started using AI more seriously in early November, so around this time marks my second month. When I talk to others, everyone's experience feels very
different. So to add one more data point, here is mine.

I resisted using AI for a long time. My reasoning was that prompting is easy to catch up with, so it does not matter if I join the crowd a year later. This
turned out to be true. Also, I had some trials with the ChatGPT free version which was a mixed bag so I wasn't that convinced it is any good.

What prompted me for a proper try was talking with people. Some of them told me that for Python scripts it is correct almost 100% of the time. Others used it
daily for programming tasks and were impressed by recent advancements. Moreover, apparently it is quite good for adapting cooking recipes for X amount of people
and providing a shopping list which was particularly useful when I found myself in a rural community. I had a change of scenery and time needed for
experimentation, I felt that this is a good time to start learning.

(I got a bit carried away and it does not fit in a post. Read the rest on my blog: https://advancedweb.hu/my-first-two-months-using-ai/)

My first two months using AI

I'm changing my mind about serverless

I keep track of an "ideal architecture", one that I would use if tasked to design a new system from scratch. For several years now this was AWS serverless. The
AWS part is personal: this is the stack I'm most familiar with. And serverless because it works the same for small and for large. It is a magical feeling to do
a `terraform apply` and see that all the different parts are coming live, ready to serve whatever load coming its way. A well-designed serverless application
combines the best of all worlds: the cost scales with traffic and there is no upper ceiling.

Now? I'm not sure anymore.

When I started as a professional developer, the general consensus was:

* Compute is expensive so we need elasticity. When the traffic is lower, we can shut down machines so we don't pay for peak capacity
* A single machine puts a ceiling on scale so we need horizontal scalability.
* Physical failures happen so we need resilient systems.

Now it seems like these are not true anymore for the vast majority. I read the [One Big Server Is Probably
Enough](https://oneuptime.com/blog/post/2025-12-12-one-big-server-is-enough/view) post and then the [The Small Data
Manifesto](https://motherduck.com/blog/small-data-manifesto/) and at some point I checked how much a dedicated server would cost. A server with 192 cores, 3.1
TB RAM, and 25 Gbps unmetered bandwidth is ~$5k. In terms of cost, that's not a significant expense for a team of developers, and in terms of capacity (to
borrow words from Gemini): this server is a tank and it will be bored. And this is the top, one that can probably handle a small country, a smaller server costs
less.

How does that change the calculations?

* Compute is cheap, it's viable to plan for 10 times the peak. There is no need for elasticity
* There is a ceiling but it's so high that unless you are Cloudflare or Google you won't hit it.
* I don't think most services need higher reliability than what a single hardware or a setup of a primary-secondary can provide

Of course, backup, disaster recovery plans, and monitoring are still needed. I'd design a system that takes periodic backups offsite and also asynchronously
replicate data as it comes. But these don't need fancy tools.

Moreover, I think the necessary reliability is way lower than most people think. "We need 5 9s!", I've heard. Last October
[AWS](https://aws.amazon.com/message/101925/) and next month [Cloudflare](https://blog.cloudflare.com/18-november-2025-outage/) were down for hours bringing down
a big part of the internet. I'm not saying downtimes are good. But setting expectations too high is unproductive. Once every 3 years the server is down and it
takes 30 minutes to fail over to the secondary? An update brings down the service for 30 seconds? I believe these are entirely acceptable values for most cases.

What are the upsides? No eventual consistency, local reproducibility, easy debugging, instant restarts. Things that silently decrease productivity. Also, a more
complex setup increases the chance that *logical* errors happen, such as the ones that caused the AWS and the Cloudflare incidents.

I'll explore the alternatives in the future. What I have in mind is a Linux box with Postgres started from a NixOS configuration file that can be rigorously
tested before deployment. There are several challenges here that need to be solved, such as tenant isolation, replication, and a lot of configuration, but I
think when set up properly this can be a base for a more developer-friendly environment.

One downside is that it needs a lot of rigour to keep the upsides and this is where I saw these setups go wrong in the past: someone SSHs into the machine,
changes some configuration or installs some packages and forgets to update the code for that. Or a process writes files to the filesystem instead of the
database and suddenly the replication does not include everything. These small things then accumulate until the point that nobody dares to touch the system
anymore. This is why I'm particularly interested in learning about NixOS: everything is code, there are no one-off changes that are quickly forgotten. It can
merge the advantages of IaC and local development.

I'd still use a serverless architecture for things that are less likely to change, things that are bounded in scope. My backup solution, for example, will stay
serverless as it benefits from the cost structure of S3 and needs only a small monitoring function. But for a product that the team is actively iterating on,
I'm exploring alternatives.

Originally published [on my blog](https://advancedweb.hu/shorts/im-changing-my-mind-about-serverless/)

How to Troubleshoot CUPS Printing Issues on Ubuntu

Diagnose and fix common CUPS printing problems on Ubuntu, covering stuck jobs, printer offline errors, permission issues, PPD problems, and network printer connectivity.

OneUptime | One Complete Observability platform.

Why I prefer multi-tenant systems

A multi-tenant system can be used by many customers and for each of them it looks like they are the only ones. Think about AWS, for example: the account
is isolated from all other accounts, and apart from the account ID there is no indication that anybody else is using that platform.

The obvious reason is that there is only one deployment and not one per customer. But I found that even if I needed to design a system that is only used by one
customer I would design it with support for multiple tenants.

This is a tradeoff, of course: every feature makes the system more complex and the cost of complexity tends to accumulate over time. This is especially true for
multi-tenancy as that affects almost everything.

So, why do I still prefer to add that complexity?

**Running integration tests** is easily parallelizable in a multi-tenant system: since each tenant is separated from the others, all tests can run in parallel. The
tests are simple: create a tenant, run the test, delete the tenant. The only extra thing is to clean up the junk if a test exits abruptly. I found that a naming
convention that encodes an expiration time works well: a script can periodically run and clean up the expired ones. For example, the tenants that tests create
all follow `__TEST_AUTODELETE_<expiration>__` pattern.

Also, it enables **production monitoring that mimics real user behavior**. Instead of looking at charts and trying to figure out which one is going to go up
or down during an incident, you can write a script that logs in as a tenant, runs some scenarios, and reports success/failure. All this without worrying about
breaking something for real clients.

Then it **allows sales (and developers) to use the production system**. Create a tenant for them and they are free to demo the features.

Finally, **requirements can change**. I find it likely that it is going to be needed to put more tenants into a single system as the business evolves.

Lately, I've been looking into row-level security in Postgres and I believe multi-tenancy can be set up in a non-intrusive way. I'm still formulating my
thoughts on this, but it seems that the whole complexity of adding multi-tenancy can be contained in a limited amount of code since the tenant ID can be set
in a central place and then filtering and access control are handled by the database instead of every single query. I'll explore this topic in the future, but
it seems like a good approach.

#software-engineering

Originally published [on my blog](https://advancedweb.hu/shorts/why-i-prefer-multi-tenant-systems/)

Why I prefer multi-tenant systems

AppSync subscriptions: waiting for start_ack can still result in missing events

It seems like that when AppSync returns a `start_ack` message in response to a subscription `start` it won't necessarily mean that all future events will be delivered.

Subscriptions are the mechanism to deliver real-time events from AppSync. It is based on WebSockets and its protocol is documented [here](https://docs.aws.amazon.com/appsync/latest/devguide/real-time-websocket-client.html).

In the protocol, a client needs to send a `start` message with the GraphQL query to start receiving updates. Then AppSync responds with a `start_ack` if everything is OK and then sends `data` events whenever an update happens.

Reading the documentation my impression was that `start_ack` is the moment when the subscription is live and all future events will be delivered. But what I'm seeing is that **it's not the case**. Even when the event is strictly triggered after the `start_ack` is received sometimes it is not delivered to the client.

Why is it a problem?

A common pattern for APIs with real-time updates is to subscribe to updates first then query the current state. This way there is no "temporal dead zone" when updates are lost. But that requires a definitive *point in time* when the subscription is live. Without that, it's only best-effort and messages will be lost every now and then especially in cases when the subscription is made just before the event, common in tests and some async workflows.

Real-time updates, especially in AppSync, is a complex topic and it's easy to get wrong. I've [written about it before](https://advancedweb.hu/shorts/apollos-subscribetomore-is-the-wrong-abstraction/), it has a [separate section in my book](https://www.graphql-on-aws-appsync-book.com/client-side/implementing-subscriptions/), and I even [made a library](https://github.com/sashee/appsync-subscription-observable) because I wasn't particularly happy with the AWS-provided one.

I noticed tests using subscriptions timeouting for a long time now, but i wrote it off as "something complex is happening" and added some retries to handle it. A message is published to IoT Core that triggers a Lambda, that writes to DynamoDB then it triggers the subscription. A lot can go wrong so it's realistic that the 10-ish seconds sometimes pass.

But I then started working on a simpler setup and still noticed that some events seemingly never arrive. This time I could pinpoint the issue because if a parallel subscription is opened before then the event is delivered there. So the problem must be that the subscription is not live even though AppSync says it is.

Hopefully, it will get fixed soon.

Bug report opened [in the AppSync repo](https://github.com/aws/aws-appsync-community/issues/405).

#aws #appsync

Originally published [on my blog](https://advancedweb.hu/shorts/appsync-subscriptions-waiting-for-start_ack-can-still-result-in-missing-events/)

Building a real-time WebSocket client in AWS AppSync - AWS AppSync GraphQL

AWS AppSync real-time WebSocket client setup

Many projects close issues after a triage if the feature/bug is not planned. For example, the terraform-provider-aws uses a bot that detects stale issues (for example, I'm following this one and I'm getting periodic emails about it). If nobody comments for a period of time the issue gets closed.

I get why it's good for the project's perspective: if you open the issue tracker and it's full of open issues then it's both depressing and counterproductive as it buries the important things to work on. Most of these projects with a public issue tracker are open-source ones and the maintainers are usually not paid for their work. So it's not up to me to decide how they organize their work.

But from my side this pattern is a problem.

Let's say I encounter an edge case. I report it, find some workaround, and add a comment along the lines of "workaround for X, see ticket Y". But maybe it's outside the current priorities of the project so it gets closed. I get an email saying it's closed because it's "not planned".

Fast-forward a couple of years. Maybe then the issue is in-scope now for the project and they actually implement it. How do I know? I get notifications for the original ticket which was closed after opening it. And when something is closed, nobody goes back to it to let me know that now it's done. Maybe there is another ticket opened for the same thing, but I'm not subscribed to that one. I need to actively look for the fix to find it.

The same thing happens in internal projects as well. The backlog is always too large, so there is a tendency to say: we're unlikely to work on most of these things, so let's close the majority of these tickets.

My argument against this: even though these tickets are not useful now they can be useful in the future. Maybe someone will take a look at the backlog at some point and realize that some important thing did not get implemented. Or someone bumps into an issue and realizes that it's a known bug.

In this case the backlog is a list of things we know we don't have. Adding some tags so that they can be easily hidden would be a better solution.

Published [on my blog](https://advancedweb.hu/shorts/closing-issues-because-they-are-unplanned-is-bad-ux/)

Closing issues because they are unplanned is bad UX

Hardening with Firejail, Landlock, and bubblewrap

Recently I've been looking into securing my laptop a bit. By default, every single program has access to everything: filesystem, network, other programs.

First, I started looking into Firejail. It allows specifying paths the program can access, as well as the network and other special things. It's not bad and I used it for a while.

What I don't like about Firejail is that it's setuid: it runs as root, sets up the sandbox, then starts the program that is passed as an argument. If there is a problem in Firejail then it can even extend the blast radius.

Then I learned about Landlock. It is unprivileged and also allows restricting the network. At some point I found a [CLI](https://github.com/Zouuup/landrun) that makes it easy to run. Landlock solves the privilege problem: it restricts the process without having more permissions to do so.

The problem with Landlock is its fs restrictions are a bit too coarse: if a directory is allowed then everything below it is also allowed. For example, giving read access to $HOME also gives read access to the chromium profile.

Now I'm looking into bubblewrap. It promises to combine Firejail and Landlock in the best way: unprivileged and also allows layering filesystem access.

I'm still working on moving my dotfiles to bubblewrap and it takes some mental energy to do that. But is seems like it's going to be a good next step.

#security #linux #bwrap #landlock #firejail

Originally published [on my blog](https://advancedweb.hu/shorts/hardening-with-firejail-landlock-and-bubblewrap/)

GitHub - Zouuup/landrun: Run any Linux process in a secure, unprivileged sandbox using Landlock. Think firejail, but lightweight, user-friendly, and baked into the kernel.

Run any Linux process in a secure, unprivileged sandbox using Landlock. Think firejail, but lightweight, user-friendly, and baked into the kernel. - Zouuup/landrun

GitHub

Another AWS footgun: Cognito custom attributes

You can define extra attributes for users in user pools. Maybe you want to store information that is not covered by the standard attributes, such as social profiles or preferred currency.

But there is a [catch](https://docs.aws.amazon.com/cognito/latest/developerguide/user-pool-settings-attributes.html#user-pool-settings-custom-attributes):

> You can't remove or change it after you add it to the user pool.

I had to remove all users and recreate the user pool because of this (it was a personal dev environment fortunately).

Why is it a big thing?

* There is a limit of 50 custom attributes you can add. It's a finite resource
* You use code to deploy your infrastructure? Now you can't rollback
* Or you use clickops? Watch where you click as this is a one-way road

To make things worse, it's practically impossible to replace a user pool. You don't have access to the passwords and the MFA secrets (which is a good thing) which means if you move users everything is reset for them. Including their `sub` (subject id) which might affect your databases.

I'd stay very far away from using custom attributes.

What's the better solution? You probably already has some backend with some database: use that to store any extra information about users.

#aws #cognito

Originally published [on my blog](https://advancedweb.hu/shorts/another-aws-footgun-cognito-custom-attributes/)

Working with user attributes - Amazon Cognito

With Amazon Cognito, you can associate standard and custom attributes with user accounts in your user pool. You can configure read and write permissions for these attributes at the app client level to control the information that each of your applications can access and modify.

CloudFormation (and by extension everything that builds on it, such as the CDK) promises a declarative deployment: you describe how your infrastructure should look like and the tool makes reality match that.

When it works well it simplifies thinking about deployments a lot: you can update, roll back, and switch to any version, just like you could with a version-controlled local program.

But the reality is that there are edge cases: some resources that don't play well with this declarative approach. What most of my time is spent when working with CloudFormation and the CDK is working around these edge cases.

I've just publish an article about a couple of examples of these problematic resources I've encountered. Starting with something simple to an S3 bucket, to IAM Roles, then my favourite, IoT DomainConfigurations.

Read it here: https://advancedweb.hu/edge-cases-in-cloudformation/

#cloudformation #aws

Edge cases in CloudFormation

Examples for when the declarative infrastructure management doesn't work that well

Retries are still a best practice for serverless architectures

I watched Marc Brooker's talk on re:Invent 2024, [Try again: The tools and techniques behind resilient systems (ARC403)](https://www.youtube.com/watch?v=rvHd4Y76-fs). In that, he talks about metastability and how retries can make a transient failure a systemic one.

His example is a spike in requests that overloads a server. Without retries, clients get errors, but after a while things go back to normal. The downside is that clients will see errors instead of just some delay.

Let's fix that by adding some retries!

Sounds reasonable, but now there is a bigger problem: the server will never recover. This is because an already overloaded server is getting even more traffic. According to Marc, this is the "effect behind some of the biggest outages of large-scale system over the history of the industry".

The rest of the talk is equally interesting (erasure coding for reducing tail latency? wow!), but I kept thinking about this.

I work primarily with serverless architectures and I think they are fundamentally different so that retries are not harmful to them as with a server-based architecture.

The advantage of a serverless architecture is that it has no practical upper limit in scalability. Maybe DynamoDB or Lambda needs some time to scale out, but eventually it will and then the increased traffic will be handled just fine. The system won't be stuck in an endless overloaded state.

There is a huge caveat here though. Lambda, S3, DynamoDB, AppSync, API Gateway, and similar services scale to infinity, but other parts might not. If the app uses any service, first-, or third-party, in that critical path that has an upper limit then suddenly all those nice scalability features are out of the window.

And the worst part? You won't even know until the crash happens.

#serverless

Originally published [on my blog](https://advancedweb.hu/shorts/retries-are-still-a-best-practice-for-serverless-architectures/)

AWS re:Invent 2024 - Try again: The tools and techniques behind resilient systems (ARC403)

YouTube

Key management and cryptography

I've read [an interesting quote](https://digitalseams.com/blog/when-encryption-works-perfectly-and-still-fails):

> Kissner’s law: Cryptography converts many problems into key management problems, and key management problems are way harder than you think.

This resonates with my experience. There is a huge difference between encrypted storage when encryption happens on the client-side and when it's managed by the server, for example.

I've written a couple of articles about encryption mostly in the context of the cloud.

In the [Seamless S3 encryption does not imply better security](https://advancedweb.hu/seamless-s3-encryption-does-not-imply-better-security/) I wrote:

> SSE-S3 helps to tick a box, but nothing else

This was at the time when by default objects were stored unencrypted and there was an option to turn on seamless encryption. Since it did not change how objects are accessed (hence the "seamless" in the name) it had no observable effect.

Then in a separate article, [Encryption in the cloud](https://advancedweb.hu/encryption-in-the-cloud/), I looked into other, non-seamless variety of encryption in AWS. My conclusion was mainly about KMS:

> encryption does nothing else but splits permissions into two required parts

This is because if an object is encrypted using a KMS key then the user needs two permissions: one is to read the object, the other is to use the key to decrypt it. But since there is no way to get back the encrypted object itself, it is only a separate inter-service access control layer.

Finally, after the Zoom fiasco when they claimed that they use end-to-end encryption but did not, I wrote an article [What is end-to-end encryption and why it's such a confusing term](https://advancedweb.hu/what-is-end-to-end-encryption-and-why-its-such-a-confusing-term/). In it, I looked how TLS termination breaks E2EE and also how key management determines who can access the data.

#encryption #security

Originally published [on my blog](https://advancedweb.hu/shorts/key-management-and-cryptography/)

When encryption works perfectly and still fails — Digital Seams

The cryptography threats people talk about are not the ones you should worry about, as Mike Waltz knows. Enter Kissner’s law: key management problems are harder than you think.

Digital Seams