pgrep wget
tail -f /proc/<pid>/4 #(or more)
So how the web scrape is going...
I'm estimating completion by Wednesday.
I'm 170 base sites in. Each one points to about 30 others.
This is with a depth of 1.
Every 10.0s: echo number of sites archived ; ls /var/lib/topgen/vhosts/ | wc -l ; echo; echo current; ps -ef | grep wget | grep -v grep | awk '{ print $NF }'; ps -ef | grep wget | grep -v grep | awk '{ print $NF }' >> wfile ; sort... 1ecf4ac17ca0: Mon Aug 11 00:32:47 2025
number of sites archived
4815
current
lefigaro.fr
free
403G
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
tcp ESTAB 0 0 192.18.0.254:51466 23.193.174.82:https
So, I'm currently unemployed and looking for a new job. Fully remote or in Central Finland area. I do mostly C#, but can kinda do front-end stuff too (it's not pretty though). Over 20 years of experience ranging from logistics to finance and healthcare. I'm not picky, but I don't want to touch finance, AI or crypto.
linkedin: https://fi.linkedin.com/in/tuulaturto
codeberg:
https://codeberg.org/tuturto/
boost ok
So what am I doing on my birthday weekend?
Herding a very large webscrape.
It got stuck on the live video at cnn.com.
Killing the wget broke the scraper process.
Had to hack out a bunch of the sanity checks in topgen-scrape.sh to get it running again.
I'm currently building a little cyber lab setup. For fake Internet I'm using topgen.
Doing topgen-scrape to pull down websites for the fake Internet. wget with depth 1.
It's on line 50 of 300ish.
It's pulled down 1500 sites, because each site in the top level list links to ~30 other sites.
It doesn't have to be like this.
Find someone who understands you and will put up with you.
I just reviewed the last couple of text messages from my wife.
1) headed home. [Telling me to put pants on so we can go out later]
2) write calc.exe exploit [A reminder of something I need to do next week at work]
$gods! I just want to know the characteristics of a given plant variety. After dredging past the AI plop, the advertisements consisting mainly of AI plop, etc. I finally find an article by someone who's grown the freaking plant. Except it's AI plop, with links to TikTok.
We've broken the Internet. But not in the good way.
A weird thing about #Slack on #Linux #X11: whenever it gains keyboard focus, it immediately reads the X11 clipboard, even if you didn't try to paste anything. I see no evidence that it _does_ anything with what it reads, but I know it reads it.
I found this out by accident, because I wrote a stunt X11 client which owns the clipboard for just long enough to paste _once_, and then terminates. The idea was to queue up three different pastes on the command line, and paste them in quick succession into fields of a form. Works very well, _unless_ I accidentally mouse over Slack on the way to the form I want to paste into – then my focus-follows-mouse activates it, and it consumes one of my clipboard strings!
I have no idea why, or whether it's on purpose. I'm prepared to give it the benefit of the doubt, by assuming until further evidence that it's some unforeseen emergent consequence of the huge wobbly tower of libraries and wrappers and browsers that the desktop Slack app is built on top of. But it's not great. Some password managers will put passwords in the clipboard!