McLaren Stanley, now Amazon’s Principal Engineer and a former software engineer at Uber, took it to Twitter and shared an almost detective story of rewriting the Uber app from scratch.
Enjoy the reading!
Alright folks, gather round and let me tell you the story of (almost) the biggest engineering disaster I’ve ever had the misfortune of being involved in. It’s a tale of politics, architecture, and the sunk cost fallacy [I’m drinking an Aberlour Cask Strength Single Malt Scotch].
The year was 2016. Donald Trump was not yet the President therefore the #DeleteUber movement hadn’t happened yet. Travis Kalanick was still the CEO, we were still in the hyper-growth phase of international rollout, public sentiment was overwhelmingly positive, Uber was riding high.
But hyper-growth is not without its problems, and the app itself was starting to show some cracks. The engineering organization had doubled in size almost every year prior, and when you grow that fast you end up with an incredibly wide range of skills. That paired with a hacking mentality that we called “Let builders build”, meaning that the app architecture was complicated and fragile. Uber at the time was extremely heavy on client-side logic so the app would break a lot. We were constantly doing hotfixes, burning releases, etc. The design was also scaling badly.
As a consequence of all these problems, there began to be a growing movement across all levels of the organization that was rallying around the idea of “rewriting the app from scratch” The general sentiment was that the architecture was slowing us down, starting over would be faster.
So a team was formed to build a new mobile architecture for this new app. The driving charter for the team was to build an architecture that would “sustain mobile development at Uber for the next 5 years”. We did both platforms at once. Product and Design also started over.
On the iOS side of the world, the rewrite presented an opportunity to adopt Swift (which was in version 2.x during this timespan). Uber had tried Swift before, but like many who had adopted it that early on it was extremely problematic so it had been banned prior to the rewrite.
But the general feeling of the architecture team was that most of Swift’s problems centered around the flakiness of the Objective-C interop back then so if we wrote a pure Swift app we could avoid the major issues.
There was also a push to use the same major architectural patterns on both Android and iOS. The android folks at the time were big RxJava fans, and there was an equivalent RxSwift library that took advantage of the functional programming paradigms in Swift. Seemed straightforward.
So this smaller core team of Design, Product, and Architecture went off in a room for with their new functional/reactive patterns, new language, and new app for a few months. Everything went well. The architecture relied heavily on the advanced language features of Swift.
The UI design was scalable for the growing number of products that Uber offered, the functional programing paradigm was powerful (albeit a bit of a learning curve), the architecture centered around our new realtime stream based networking protocol (that’s the part I wrote).
After a few months and a number of flashy demos later the momentum was building. The project was looking like a success. They had built amazing experiences in a short time with a small number of engineers. Most of the core product was built out. The execs were sold.
So the company-wide rollout began. Teams began shifting all their focus to bringing their features to the new app. At first the excitement of the new created a flurry of motivation and productivity. The architecture was built for feature isolation which allowed teams to move fast.
But once Swift started to scale past ten engineers the wheels started coming off. The Swift compiler is still much slower than Objective-C to then but back then it was practically unusable. Build times went though the roof. Typeahead/debugging stopped working entirely.
There’s a video somewhere in one of our talks of an Uber engineer typing a single line statement in Xcode and then waiting 45 seconds for the letter to appear in the editor slowly, one-by-one.
Then we hit a wall with the dynamic linker. At the time you could only link Swift libraries dynamically. Unfortunately the linker executed in polynomial time so Apple’s recommend maximum number of libraries in a single binary was 6. We had 92 and counting.
As a result It took 8-12 seconds after tapping the app icon before main was even called. Our shinny new app was slower than the old clunky one. Then the binary size problem hit.
when the problems started showing up in earnest, we were already way past the point of no turning back (sunk cost fallacy). At this point the whole company was pouring its energy into the new app.
Thousands of people across every discipline, millions and millions (I can’t tell you the real number but it was way more than 1) of dollars had been spent. The whole management chain was fully bought in. I had privately had the “we need to stop” conversation with my director.
He told me that if this project fails he might as well pack his bags. The same was true for his boss all the way up to the VP. There was no way out.
So we rolled up our sleeves, and put our best people on each of the problems and prioritized the launch critical issues (dynamic linking, binary size). I was assigned to both dynamic linking and binary size in that order.
We quickly discovered that putting all of our code in the main executable solved the linking problem at App start up. But as we all know, Swift conflates namespacing with frameworks; so to do so would take a huge code change involving countless namespace checks.
That’s when the brilliant Richard Howell (not sure if he’s on Twitter) discovered while reading the Xcode build output that he could take all the intermediate object files and re-link them back into the main executable with a custom script after the build was complete.
Since Swift mangles the object namespace into the symbol name itself at compile time, this meant that he could safely preserve the namespacing while doing this. This allowed us to effectively static link our libraries and cut our pre-main time from 10 to basically 0.
Next problem: App Size. At the time we were planning to include the new app in the old app bundle and slowly roll it out at runtime as a safety net. First thing we did to buy space was to just remove the old app. We called this release strategy “Yolo”. TK himself made the call.
We also replaced *all* of our Swift structs with classes. Value types in general have a ton of overhead due to object flattening and the extra machine code needed for the copy behavior and auto-initializers etc. This saved us space so we pressed on.
But as the app kept growing. Soon we hit the cellar download limit (100 mb) for our universal binaries (iOS 8 and earlier). This represented a substantial amount of lost signups (it dollars it would cost us in the order of 8 figures of people who hadn’t upgraded yet).
At this point we were weeks away from the public launch date. We had graciously received help from a certain company that I’m still under NDA with, but they couldn’t solve our problem. The only thing we could do was regenerate all the model code (25% of the total line count) back into Objective-C or drop support for iOS 8. Since iOS 9 had introduced individual architecture slicing it was affectively half the size (give or take). With only a week left we decided eat the 8 figures and drop support for iOS 8.
The general thinking was that at half the size we still had plenty of runway with the iOS 9 binary, and after the rewrite was done we could solve the problem sometime way down the road, because things would slow down a bit. We were unfortunately completely wrong about that.
After the app release, we threw a huge party. The app was well-received by the press. It was fast and snappy, with a flashy new design. A bunch of people got promoted. We all breathed a sigh of relief. The 90-hour work weeks stopped for a few weeks.
Stay tuned with Software Focus!