Mastodawn

themachinestops Oct 27

A single point of failure triggered the Amazon outage affecting millions

https://lemmy.dbzer0.com/post/56397131

A single point of failure triggered the Amazon outage affecting millions - Divisions by zero

We need more cloud services.

Show thread

DudeImMacGyver

We need to ditch cloud entirety and go in house again.

Show thread

UnderpantsWeevil Oct 27

The inverse of the old axiom “The cloud is just someone else’s computer” is “Yes, duh, that’s how you get economies of scale”.

In-housing would mean an enormous increase in demand for physical hardware and IT technical services with a large variance in quality and accessibility. Like, it doesn’t fix the problem. It just takes one big problem and shatters it into a thousand little problems.

Show thread

dubyakay Oct 28

That’s good though. It means half the internet wouldn’t fail.

Show thread

UnderpantsWeevil Oct 28

I think some of you younger folks really don’t know what the Internet was like 20 years ago.Shit was up and down all the time.

I worked on a project back in 2008 where I had to physically haul hardware from Houston to Dallas just to keep a second rate version of a website running until we got power back at the original office. Latency at the new location was so bad that we were scrambling to reinvent the website in real time to try and improve performance. We ended up losing the client. They ended up going bankrupt. An absolute nightmare.

Getting screamed at by clients. Working 14 hour days in a cramped server room on something way outside my scope.

Would have absolutely killed for something as clean and reliable as AWS. Not like it didn’t even exist back then. But we self-hosted because it was cheaper.

Show thread

rumba Oct 27

I certainly don’t miss dealing with air conditioning, dry fire protection, and redundant internet connections.

I also don’t miss trying to deal with aging servers out and bringing new hardware in.

Show thread

dubyakay Oct 28

That work is still being done by someone in a data centre. But all these jobs went from in-house positions to the centres.

Show thread

partial_accumen Oct 28

That work is still being done by someone in a data centre. But all these jobs went from in-house positions to the centres.

The difference is scale. When in-house, the person responsible for managing the glycol loop is also responsible for the other CRACs, possibly the power rails, and likely the fire suppression. In a giant provider, each one of those is its own team with dozens or hundreds of people that specialize in only their area. They can spend 100% on their one area of responsibilty instead of having to wear multiple hats. The small the company, the more hats people have to wear, and the worse to overall result is because of being spread to thin.

Show thread

partial_accumen Oct 28

We need to ditch cloud entirety and go in house again.

For many many companies that would be returning to the bad-old-days.

I don’t miss getting an emergency page during the Thanksgiving meal because there’s excessive temperature being reported in the in-house datacenter. Going into the office and finding the CRAC failed and its now 105 degree F. And you knew the CRAC preventive maintenance was overdue and management wouldn’t approve the cost to get it serviced even though you’ve been asking for it for more than 6 months. You also know with this high temp event, you’re going to have an increased rate of hard drive failures over the next year.

No thank you.

Show thread

corsicanguppy Oct 28

There’s a huge gulf between pub clowd and shitty on-prem. My daytime contract is with an organization almost completely on-prem for privacy, although on-prem to them means priv-cloud. Space has been rented. Redundant everything piped in. Redundant everything set up. We run VMs by terraform. Wheeeeee

Point is, posing shitty on-prem as the alternative to the clowd is moving the goalposts a bit.

Show thread

partial_accumen Oct 28

There’s a huge gulf between pub clowd and shitty on-prem.

We agree on this.

Redundant everything piped in. Redundant everything set up. We run VMs by terraform. Wheeeeee

For that customer of yours, is that a single datacenter or does is represent multiple datacenters separated by a large distance across a nation, or perhaps even across national borders?

Point is, posing shitty on-prem as the alternative to the clowd is moving the goalposts a bit.

I think ignoring that shitty on-prem represented a large part of IT infrastructure prior cloud providers is ignoring a critical point. Was it possible to have well-run enterprise IT data centers prior to cloud? Sure. Was everyone doing that? Absolutely not, I’d argue the majority had at least a certain level of jank in their infra and that that floor is raised with cloud providers. Just the basic facilities is enterprise grade irrespective of the server or app config.

Show thread

nova_ad_vitum Oct 28

We don’t have to. It is entirely possible to engineer applications and services in a way that they’re not dependent on any one cloud service, while also using cloud services for IaaS. Netflix famously does this, and sure enough Netflix experience no service interruptions during this latest outage despite having a large AWS presence.

Show thread

DudeImMacGyver Oct 28

If we want a truly robust system, yeah, we kinda do. This sort of event is only one of the issues with allowing a single entity to control pretty much everything.

There are plenty of potential issues from a corrupt rogue corporation hijacking everything to attacks to internal fuck-ups like we just experienced. Sure, they can design a better cloud, but at the end of the day, it's still their cloud. The Internet needs to be less centralized, not more (and I don't just mean that purely in terms of infrastructure, though that is included of course).

Show thread

nova_ad_vitum Oct 29

If we want a truly robust system, yeah, we kinda do. This sort of event is only one of the issues with allowing a single entity to control pretty much everything.

What I’m advocating for is the opposite of “allowing one entity to control everything”.

en.wikipedia.org/wiki/Chaos_engineering#Chaos_Mon…

Read about it dude. Netflix has a large presence in all major cloud providers (and they have their own data centers), but has a service whose uptime is NOT dependent on any one of those hosting environments. The proof is the pudding - Netflix service did not go down in the recent AWS outage, nor in the last one.

Chaos engineering - Wikipedia

Show thread

DudeImMacGyver Oct 29

Yes, Netflix had their own infrastucture in addition to other multiple redundant cloud services for their CDNs: You're kind of proving my point?

Show thread

nova_ad_vitum Oct 30

You’re kind of proving (part of) my point?

How? Their reliability would exist without that. There’s nothing inherent to their own data center that makes their setup that much better. Having a distributed system across multiple cloud service providers means your actual chance of downtime (here I mean inverse of uptime) is their individual chances of uptime multiplied by each other. In other words, they all have to go down for your service to fail. The catch is you have to use only commodity IaaS and PaaS, nothing proprietary to one CSP.

For smaller companies especially, in terms of pure reliability, there’s no reason to think that they would be better at running a high availability data center than Microsoft or AWS or Google.

Parallel distributed architectures give you the advantages of using public cloud (not having to physically manage your own data center) without the disadvantages (dependence on any one cloud vendor), while also potentially increasing your reliability beyond the reliability of any one of your cloud vendors .

Show thread

DudeImMacGyver Oct 30

You really don't see the risk of having no data centers you actually control as an organization? Maybe I misunderstood your initial statement? At first it sounded like we kind of agreed with each other but didn't understand that was the case at first.

Show thread

nova_ad_vitum Oct 30

You really don’t see the risk of having no data centers you actually control as an organization?

This really depends on what you think you’re getting from having your own DC. Is it reliability? Flexibility? Control? What are you objectives?

There’s some argument to be made to have some locally hosted stuff for some flexibility and control. And in some niche cases the pricing of public offerings doesn’t make sense.

But as I said, if you’re building your own data center for increased reliability then 1) you’re necessarily assuming the premise that you’re going to be better at managing DCs than Google, Microsoft and AWS which I think in reality would be hard to prove let alone do, and 2) is hard to justify considering you can distribute workloads across multiple data centers already (as proven by the Netflix example) so that your reliability isn’t limited by any one vendor.

Show thread

blackn1ght Oct 28

Bit of an over-reaction to one incident. I’d be willing to bet the uptime, reliability and scalability of AWS is significantly better than what the vast majority of in-house solutions could do. It’s absolutely not worth going back.

Millions of customers using AWS also weren’t affected - the company I work for certainly wasn’t, although some of our tools like Jira were.

Show thread

DudeImMacGyver Oct 28

The problem is far more pervasive than any single incident, allowing a single megacorporation to control most of the Internet is a bad idea for many reasons.

Show thread

blackn1ght Oct 28

Agreed, but other cloud providers exist and it would be good if there was stronger competition in this space. But going back to self hosting is a huge step back and I think if a CTO said they were going to move from the cloud back to a self hosted solution, pretty much everyone would hate it.

Show thread

DudeImMacGyver Oct 28

There are still self hosted places today, not everything is cloud based.

Also, there isn't more competition largely because of Amazon so, while I agree with the sentiment that it could improve things, in practice it's a moot point.