Wednesday, February 28, 2024

As Twitter Goes Down Yet Again, Report Highlights How Fragile Its Infrastructure Has Become

Must read

from the tougher-than-rocket-science,-apparently dept

On Wednesday there was yet another major global outage at Twitter, something that feels like it’s becoming a recurring issue and bringing us back to the days when Twitter regularly crashed and had to put up a “Fail Whale” graphic.

In response, Twitter spent a few years hiring some fantastic engineers and building up a strong core competency in making the site have tremendous reliability, even during times of high intensity, and rapid updating. A site like Twitter is more difficult to manage than many other sites, because it’s highly custom to each and every viewer, and has a real-time aspect built into it as well. That combination is tough to do well, and Twitter built up a team of engineers who made it work.

And Elon Musk fired basically all of them.

While it’s been somewhat clear, anecdotally, that the site has really suffered quite a bit to keep running, Netblocks, as reported in the NY Times, now confirms that it’s not your imagination: Twitter is failing much more regularly:

In February alone, Twitter experienced at least four widespread outages, compared with nine in all of 2022, according to NetBlocks, an organization that tracks internet outages. That suggests the frequency of service failures is on the rise, NetBlocks said. And bugs that have made Twitter less usable — by preventing people from posting tweets, for instance — have been more noticeable, researchers and users said.

Twitter’s reliability has deteriorated as Mr. Musk has repeatedly slashed the company’s work force. After another round of layoffs on Saturday, Twitter has fewer than 2,000 employees, down from 7,500 when Mr. Musk took over in October. The latest cuts affected dozens of engineers responsible for keeping the site online, three current and former employees said.

Yeah, four in one month, when it was nine in all of last year (which included at least some from after Musk began his somewhat chaotic style of ownership of the company). And, yes, much of this is because of Musk’s decisions to get rid of basically anyone who knew anything. A former Twitter employee mentioned to me soon after Musk took over the company that, whether it was good or bad (and I believe this person was suggesting it was bad…), Twitter had a small number of “load bearing” employees. And nearly all of them, if not all of them, are gone.

Mr. Musk has ended operations at one of Twitter’s three main data centers, further slashed the teams that work on the company’s back-end technology such as servers and cloud storage, and gotten rid of leaders overseeing that area.

The moves have exacerbated fears that there are not enough people or institutional knowledge to triage Twitter’s problems, especially if the service one day encounters a problem its remaining workers do not know how to fix, two people with knowledge of the company’s internal operations said.

In the past, Twitter prevented breakages from escalating by having people around to diagnose and solve problems immediately. Now the platform is likely to be plagued by more glitches as workers take longer to pinpoint issues, the people said.

“It used to be that you’d see smaller things fail, but now Twitter is going down completely for certain regions of the world,” said Saagar Jha, a Twitter engineer who left in May. “When serious things break, the people who knew the systems aren’t there anymore.”

And even when things do go down, the lack of institutional knowledge makes it that much harder to figure out what went wrong, leading to much slower response times to fix the problems:

Employee errors led to other outages. In early February, a Twitter worker deleted data from an internal service meant to prevent spam, leading to a glitch that left many people unable to tweet or to message one another, according to three people familiar with the incident.

Twitter’s engineers took several hours to diagnose the problem and restore the data stored with a backup. In that time, users received error messages that said they could not tweet because they had already posted too much. The Platformer newsletter earlier reported the cause of the problem.

A week later, an engineer testing a change to people’s Twitter profiles on Apple mobile devices caused another temporary outage. The engineer disregarded a past practice of testing new features on small subsets of users and simply rolled out the change — a tweak for Spaces, Twitter’s live audio service — to a wide swath of users, two people familiar with the move said.

“Welp, I just accidentally took down Twitter,” Leah Culver, the engineer, later tweeted. The app eventually came back online after the change was reversed, she said. Ms. Culver did not respond to a request for comment.

While it’s not mentioned in the NY Times article, TechCrunch reported a few days ago that Leah Culver was one of those laid off over the weekend.

And, while it does appear that the last engineers standing are doing their best, it’s apparently been quite a mess internally as well:

The constant loss of workers has only added to the sense of instability, two current and former employees said. Some junior employees are overseeing products or services they had never touched before, they said, and there is no clear leadership. The company has been without a permanent head of global infrastructure since last year when Mr. Musk fired Nelson Abramson, who held that job. Mr. Musk brought on a temporary replacement, a Tesla engineer named Sheen Austin, who resigned in January.

Fixing technical challenges has also become more difficult because of changes to internal systems and communication. Last week, employees lost access to the workplace chat platform Slack, leaving them without their main mode of communicating with colleagues or the ability to see a record of how workers previously fixed problems with Twitter, three current and former employees said.

On Monday, the company brought Slack back. But it archived thousands of old Slack channels that workers had used to communicate, according to an internal email seen by The Times.

The decision to shut down Slack again seems to be an example of Musk shooting himself in the foot over his own vanity and ego. Twitter employees have long relied on Slack as a communications tool, and part of that is that it became a huge and extremely important repository of institutional knowledge — the exact kind of knowledge that would be helpful at a moment like this when many engineers have walked out the door.

While there were some rumors that Slack got shut down because Elon wouldn’t pay the bill, Platformer reported that while true (Musk isn’t paying the bill), that’s not why it got shut down. Instead, it sounds like Musk got annoyed that employees were using Slack to gripe about everything going on under his leadership. So in order to keep them quiet, he basically destroyed the last store of useful internal knowledge:

“After everyone was gone, I had no one to ask questions when stuck,” an employee who stayed on past the first round of layoffs wrote in Blind. “I used to search for the error [messages] on Slack and got help 99 percent of the time.”

Websites don’t just fall over. The early predictions some (not us!) made that Twitter would just shut down completely never made much sense. But all of the evidence suggests that things are a huge mess, and anyone relying on the website is asking for trouble.

It’s still possible that Musk and his new team can somehow turn this around and get the site working again. Musk himself keeps making pronouncements about how the site is working better than ever (which lack any evidence whatsoever). But the early returns should raise serious questions.

Filed Under: , , , ,

Companies: twitter

Latest article