Honestly, I really love patroni for PostreSQL-failover-solution, coupled with the amazing HAProxy.

patronictl -c conf.yml reinit postgres nfnode2 was all I need to do the reinitialize a too far out-of-sync cluster-node.

Normally, when the node joins the cluster it is automatically synced.

But when it was gone for too long, you can't just sync it anymore, you need to reinitialize because its history is older than the last available parent on the active node.

But that command will just reinit it and will sync from the beginning.

We'll see if it worked as expected ––– tomorrow 😅

#WorkTopics

Why you need to have a staging environment that is near-identical to your production environment: to experience what happens when the root partition is full and you (or someone else) forgot to put Couchbase on its own partition.

The result was that one of the Couchbase-servers in the cluster failed and I couldn't log in anymore.

Also: it was all visible only because I wanted to test the PostgreSQL-failover by shutting down the primary PSQL-Server and observing how HAProxy quickly switches to one of the other two.

The experiment worked, but the side-effect was the new insight that we had forgotten to put Couchbase-data on its own filesystem.

Yay!!

So, in any case, I won't be doing any work this evening, because we will revisit the future CB-servers on production and move the CB-storage to its own FS before I can add them to the existing cluster.

#WorkTopics

What a day...

I have finally designed the new architecture - but it is all diagrams.
And I have also agreed with our service provider to set up completely new VMs for this architecture and then integrate it into a running system without disruption (I don't see a problem there as the only change to the existing logic is that we are moving Couchbase servers from existing three VMs to three new VMs, which can be done on-the-fly).

Now to:

  • Implement this architecture – manually – in the dev-environment (waiting for FW-rules to be updated)
  • While doing so, document every single step in the new Operations Manual
  • Create config files for each component and store them in a config repository
  • Deploy those configs into etcd
  • Write any required script to generate config-files from etcd for those components that don't support etcd directly
  • Then, if everything works and is tested:

    • Hand-over the documenation + config files to our service provider so that they can create puppet-files to automatically create all of that for Staging.

    If that works (and I have deployed to staging):

    • Try all of this on Production

    And all of this before mid December

    I am very confident though, because we all want to get away from the existing architecture and everybody is really enthusiastic.

    Yeah

    #WorkTopics

    What one doesn't do for their team ...

    167 files changed or added.

    For 10 development-servers (and production, staging, integration, obviously).

    The majority so that my team has a comfortable and more useful dev environment (dev servers).

    I am sooooo done today.

    Background

    We have 11 development server of which ten are allocated to each developer in team. The dev-servers should, obviously, reflect as close as possible the production-environment, including all the regular timers/services, etc.

    Of course they all need separate configurations because .. this is how it is, right (yes, yes, I am working on deploying etcd and moving every config to there, but that takes some time).

    And of course they all need their own systemd-services and timers, meaning that each timer/service exists at least 11 times and since there are five services per server (search, api-server, three timer-services), that makes 55 files to be changed.

    But that's not it.

    I also have to change the scripts that these services/timers use. There are three scripts now per server, adding another 33 files.

    Yeah, and since I aggregated a lot of existing services into those shell scripts, that also required deleting a lot of obsolete scripts/files/services, so that brings it down to 167 changes (including 14*3 [ production, stage, integration] = 42 changes).

    Yep, what I don't do for my team...

    Listen: without neovim/vim, this task would've taken forever. The most used command in neovim for me is . - repeat last command. If your editor doesn't have the .-command, don't even bother talking to me...

    #WorkTopics

    I know this stuff isn't easy, and I know you are the only one in the world who has implemented it really successfully - but I think you really, really need to update your dev-documentation.

    Also, it wouldn't hurt to modernize your API - but still, I am absolutely thankful that you have done this.

    #WorkTopics

    #TailorMade

    I don't want to, but I have to write tickets for my team - splitting the the new-product development into logical tasks.

    Before you ask: this is how we work. My team doesn't like to write tickets about new products, they write bug-tickets or change-tickets, but they don't like writing enhancements-tickets or tickets for completely new product.

    Usually, we discuss how we are gonna approach the new product in a dev-team meeting and then it is up to me to break it down into logical pieces and write the tickets and order them.

    Yeah - being Head of Development comes with a huge price-tag, but I love ... my team.

    #WorkTopics

    Today, I finally got access to Veracode and AIKIDO.

    Via Veracode, I did a static analysis of our all our code. The result was 65 findings in about 280k lines of code - all of which were false positives.

    AIKIDO had about 30 findings, 1 critical, 10 high, rest medium/low. These findings were all fixable by updating the dependencies, including (unfortunately) a few overrides (I don't like overrides, but these are without any issue). We do package-dependency-update every 2-3 months, and today was just in-between those times, so, not really something to worry about.

    We now have only one(!!) "high" finding - that one will be fixed in the code - and that is really just about 15 minutes work.

    ONE(!!) finding that needs to be fixed, in 280k LoC - AWESOME!

    I must say, I am extremely proud of the work of my team: two years of development and only a single finding that needs fixing! WOW!! WOWOWOWOW!!!

    Yeah, some non-political positive news.

    I am so proud - OMG! WOW!

    #WorkTopics

    (Warning: #WorkTopics)

    What a beautiful feeling: on Tuesday, we had a really, really big launch and ... nothing happened. Nothing, really, absolutely nothing. Not even a slightest feedback from customers (yes, yes, we have about 10k unique registered users/day). We have found a few minor bugs, but they are no show-stoppers and are being worked on.

    And you what's amazing? The tickets that we still have are so few that are important (Priority: Normal/High/Critical). All the other tickets are either Priority: Low, are being worked on ("In Progress"), are waiting to be tested by the ticket-owner, or are already being tested.

    We are now preparing the next big enhancements for the product:

    • v1.1: end of November => adding new api-endpoints so that we can port our mobile app
    • v1.2: end of December => web-dav support for the new api-server
    • v1.5: end of January => "confidential”😂
    • v2: end of February => "even more confidential” 😂…

    When the team "complained”: “We don't have that many tickets left", my response was: “Ok, then, what about slacking a little for the next two months? I think you all earned that, don't you? After 20 months of writing all that stuff from scratch..."

    Yep, great stuff.

    Now, off to meet a wonderful friend.

    We have introduced an oAuth2-Server for my client.

    This server supports a variety of clients, one of which is the new web-app that we developed. We had specifically developed the auth-server for use in the new web-app (and future apps).

    But since the auth-server is really great, we decided to add support in classic web-app as well, so we did that.

    The question now is which one (the classic or the new web-app) should be the default after the login if the users just use a URL like .com/login (the existing login-URL)?

    Ok, of course, the decision is "new web-app". So far, so good.

    There is, though, a situation where, after the login, in the new web-app we realize that this use can (for various reasons) only use the classic web-app, so we then redirect the user to the classic web-app code. So far, so good as well.

    The thing is, currently, in order to detect that situation, the new web-app is loaded into the user's browser, then thrown out and the classic code is loaded, which can take up to 1-2 seconds.

    The dev's suggestion was: "Let's move some of that code from the new web-app to the oAuth2-Server and the auth-server should decide where to send the user."
    Me: "Are you f**** crazy? Move app-functionality to oAauth2-Server? How many times do I have to tell you that we ain't gonna add any app-functionality into the oAuth2-Server? It is a friggin' auth-server, its sole job is to authenticate a user, nothing more, noting less... No, we are not going to do that!"
    Them: "But how are we going to solve this problem then?"
    Me: "Umm, maybe splitting that code in the web-app into a smaller chunk, sending the user to that small chunk which just checks for that criteria and then either loads the bigger, main chunk or classic...?"
    Them: "Ooof, but then we have to restructure our code a lot."
    Me: "Well, who told you that software engineering was easy? Yes, that's exactly what we have to do and that's exactly what we will do!"
    Them: "..."
    Me: "And btw, we will do a lot more restructuring after next week's big launch, because I can't accept chunk-sizes of 5MB!"
    Them: "... ooof ..."

    (Note: I wasn't as rude as I write here, but this is what I felt and wanted to say, but I said most of it in a nicer tone)

    #WorkTopics

    Developer writes a ticket: "If the user logs out in this, they are still logged out in the other. We need a way to logout in the other when the other redirects to this, maybe with a flag that the user shouldn't be logged out while redirecting to this."

    Me: "Erm, that sounds wrong. And even if that is the case, the user can't access 'the other' anyway, so I'm deprioritizing this."

    Dev: "But, but, but ... a user staying logged in after they are logged out? THat's not important?"

    Me: "Call!! Now!!"

    Well, the result is that actually the user is never logged out in "this" as well, because "the other" is the one that handles login/logout, and only the other can do a "logout", so, the ticket is completely wrong. It should actually say "logout doesn't work correctly if the user only uses 'this' and not returns to 'the other'"

    Ticket-Prio: CRITICAL! (Also: the solution discussed is a 15-min-work, changing a few lines to redirect the user to the /logout-route on 'the other' instead of /logout on 'this'.

    #WorkTopics