Adventures in PKI: ​

Ok so here is the story so far as a recap....
* The starting point was Crowdsec. Crowdsec has three components: agents which parse logs/events, remediation engines, which act on decisions, and a local API (lapi) which the first two connect to, and tracks the decisions and pulls from public block lists
* I realized I could also get external hosts involved, and also wait Crowdsec can parse logs from an aggregator, in this case Loki
* Awesome, step one, get logs into Loki. This lead to a whole chain of events that caused me to deploy Grafana/Alloy to collect those logs
* At this point I realized that shit, the remote nodes need auth and I'd need to copy around tokens everywhere
* Right, tokens everywhere, on remote nodes, etc. but wait, both alloy and Crowdsec support mTLS, all I need is client certs

record scratch

* Right so this would be easy if it wasn't for the pesky external nodes
* This lead me to setting up smallstep's step-ca with an ACME provider
* I got rsyslog setting logs to a central log server via mTLS! Even without the rest of this the log collection is a win.
* (Aside, I also got ssh certs working)
* And I got the Traefik bouncer plus agent to lapi connections working over mTLS but there was a little bit of strangeness there
* Crowdsec's components do not understand cert lifespans,and will not reload certs if they're renewed, hilarious. Fine they get certs with a lifespan measured in "eh, I'll probably reboot a node before then"

Ok and here we are caught up with current day. The very last part is getting the various non cluster nodes connected so their ssh is covered by the block lists. I go to edit the config, and...

nothing

In the logs of the lapi there is a bad cert error. After some browsing of the issue tracker I see mention of and allowed OU setting. Huh. Yeah. The certs created by the helm chart have an OU setting.

Ok but can I ask for a specific OU via ACME?

Whelp.



@homelab
#Homelab #Suffering #PKI #Grafana #Crowdsec
"Oh I know an easy solution to this, all I need to do is setup a basic PKI infra and the problem is solved!" - a statement made by someone who is about it to find out
On the plus side I have learned sooooo much about PKI in general over the last few months ​
@rachel hello fellow "surely we can rest everything on smallstep and some Python scripts, right..? RIGHT..?" dark alchemist
@wilbr so far Crowdsec has been the biggest source of pain somehow ​​ the face I made when I realized that none of their components understand cert validity and rotation.......
@wilbr I take that back step-ca had a bizzare and painful helm chart, and it's single config file randomly expects secrets embedded as strings

@rachel I have a dirty secret

I tried step-ca, used it to generate the chain, set everything to ten years, and then used python -> openssl instead 🙃

@wilbr I have step-ca and cert-manager in-cluster and sharing an intermediate

Cert-manager for in-cluster
ACME for other non k8s hosts

I also got ssh host certs via ansible/sshpop, and user ssh certs via OIDC working
@rachel nice! Oh I remembered why I did all that. Because I wanted the medium to be MQTT, not HTTP, so I basically reimplemented AWS IoT's JIT provisioning
@wilbr aaah, yes I also replaced MQTT user/pass with TLS as part of this
@rachel "I had 3 problems I could solve with PKI infra so I deployed it. I now have 18 problems"

@rachel
I need to use my Vault-based PKI infra more. 😁

But this sounds like a nice setup? If I understand it right, the tokens you now replaced were for access to CrowdSec's lapi? And that's now (supposed to 😁 ) be replaced by mTLS certs instead?

Also very nice to see that CrowdSec can read logs directly from Loki. So it can just directly use Loki's API?

At some point I also want to set up CrowdSec. I really like the "community" idea of it.

@mmeier yup except for one final remaining part

But as far as in-cluster goes it is working well so far

Crowdsec agent could be working on each node and collecting Traefik logs directly but I didn't do that because I didn't want a second daemonset listing to logs via hostpath, when I wanted log aggregation anyway.

Logs are collected just once, then a single Crowdsec agent reads from the central point
@mmeier speaking of vault and secrets ..

Last weekend I went and cataloged every non automated secret in the cluster (so ignoring cert-manager created certs and cloudnative-pg passwords and so on)

Then I sorted them into three categories:
* External: these are things like SMTP passwords, or access tokens for cloud hosted git repos and other APIs for things like DNS record updates. These is basically nothing I can do with these but also they're tightly scooped and can easily be regenerated
* Local: pretty much the same as above, but targeting something on-prem. Very few of these left but they're things like git repo/container registry tokens for forgejo
* Self: these are used by the same namespace that they're hosted in. I want to eliminate a few of the like a handful of redis passwords, but others are data that simply needs to remain static like Django secret keys. There are a handful of keys here that are critical components that can't simply be regenerated, like password salts and matrix instance signing keys.

Any secret which contained multiple types was broken up, and then they were all renamed so that they ended in
-exteral-secret/-cluster-secret/-local-secret
@rachel Yeah, especially that last category is a bit annoying, the stuff that needs to be kept around because it isn't replaceable, but also can't be auto-generated. What are you using for these at the moment?
@mmeier any project with a secret like that, I write a note in the README detailing how it was generated, and then save it in pass along with keepass