Stack Overflow: A Technical Deconstruction

One of the reasons I love working at Stack Overflow is we’re allowed encouraged to talk about almost anything out in the open. Except for things companies always keep private like financials and the nuclear launch codes, everything else is fair game. That’s an awesome thing that we haven’t taken advantage of on the technical side lately. I think it’s time for an experiment in extreme openness.

By sharing what we do (and I mean all of us), we better our world. Everyone that works at Stack shares at least one passion: improving life for all developers. Sharing how we do things is one of the best and biggest ways we can do that. It helps you. It helps me. It helps all of us.

When I tell you how we do <something>, a few things happen:

  • You might learn something cool you didn’t know about.
  • We might learn we’re doing it wrong.
  • We’ll both find a better way, together…and we share that too.
  • It helps eliminate the perception that “the big boys” always do it right. No, we screw up too.

There’s nothing to lose here and there’s no reason to keep things to yourself unless you’re afraid of being wrong. Good news: that’s not a problem. We get it wrong all the time anyway, so I’m not really worried about that one. Failure is always an option. The best any of us can do is live, learn, move on, and do it better next time.

Here’s where I need your help

I need you to tell me: what do you want to hear about? My intention is to get to a great many things, but it will take some time. What are people most interested in? How do I decide which topic to blog about next? The answer: I don’t know and I can’t decide. That’s where you come in. Please, tell me.

I put together this Trello board: Blog post queue for Stack Overflow topics

I’m also embedding it here for ease, hopefully this adds a lot of concreteness to the adventure:

It’s public. You can comment and vote on topics as well as suggest new topics either on the board itself or shoot me a tweet: @Nick_Craver. Please, help me out by simply voting for what you want to know so I can prioritize the queue. If you see a topic and have specific questions, please comment on the card so I make sure to answer it in the post.

The first post won’t be vote-driven. I think it has to be the architecture overview so all future references make sense. After that, I’ll go down the board and blog the highest-voted topic each time.

I’ve missed blogging due to spending my nights entirely in open source lately. I don’t believe that’s necessarily the best or only way for me to help developers. Having votes for topics gives me real motivation to dedicate the time to writing them up, pulling the stats, and making the pretty pictures. For that, I thank everyone participating.

If you’re curious about my writing style and what to expect, check out some of my previous posts:

Am I crazy? Yep, probably - that’s already a lot of topics. But I think it’ll be fun and engaging. Let’s go on this adventure together.

Why you should wait on upgrading to .Net 4.6

Update (August 11th): A patch for this bug has been released by Microsoft. Here’s their update to the advisory:

We released an updated version of RyuJIT today, which resolves this advisory. The update was released as Microsoft Security Bulletin MS15-092 and is available on Windows Update or via direct download as KB3086251. The update resolves: CoreCLR #1296, CoreCLR #1299, and VisualFSharp #536. Major thanks to the developers who reported these issues. Thanks to everyone for their patience.

Original Post

What follows is the work of several people: Marc Gravell and I have taken lead on this at Stack Overflow and we continue to coordinate with Microsoft on a resolution. They have fixed the bug internally, but not for users. Given the severity, we can’t in good conscience let such a subtle yet high-impact bug linger silently. We are not upgrading Stack Overflow to .Net 4.6, and you shouldn’t upgrade yet either. You can find the issue we opened on GitHub (for public awareness) here. A fix has been released, see Update 5 below.

Update #1 (July 27th): A pull request has been posted by Matt Michell (Microsoft).

Update #2 (July 28th): There are several smaller repros now (including a small console app). Microsoft has confirmed they are working on an expedited hotfix release but we don’t have details yet.

Update #3 (July 28th): Microsoft’s Rich Lander has posted an update: RyuJIT Bug Advisory in the .NET Framework 4.6.

Update #4 (July 29th): There’s another subtle bug found by Andrey Akinshin and the F# Engine Exception is confirmed to be a separate issue. I still recommend disabling RyuJIT in production given the increasing bug count.

Update #5 (August 11th): A patch for this bug has been released by Microsoft, see above.

This critical bug is specific to .Net 4.6 and RyuJIT (64-bit). I’ll make this big and bold so we get to the point quickly:

The methods you call can get different parameter values than you passed in.

The JIT (Just-in-Time compiler) in .Net (and many platforms) does something called Tail Call optimization. This happens to alleviate stack load on the last-called method in a chain. I won’t go into what a tail call is because there’s already an excellent write up by David Broman.

The issue here is a bug in how RyuJIT x64 implements this optimization in certain situations.

Continue reading...

Optimization Considerations: Rebuilding this site

This week I took a few days and re-built my blog. It was previously a WordPress instance on a small host with CloudFlare in front. It is now a statically generated, open source, managed via git, hosted on GitHub pages, and still proxied through CloudFlare. This post is my attempt to explain my reasoning and process in optimizing it along the way.

Why?

I couldn’t do what I wanted with WordPress, at least…not without a fight. I want to do several posts with interactive elements such as charts, maps, simulations, directly included CSV data, etc. to compare a lot of numbers I’ll be throwing out about some major infrastructure changes we’re making over at Stack Exchange. Here’s a quick example of many things I want to do in future posts. That was my motivation. When I looked into what I needed to change to support these things (besides even the basic editor fighting me along the way), I also took a long look at how the blog was performing. It was heavy…very heavy, as most WordPress installs tend to be. As a result, it was slow. Here’s what my blog looked like before the do-over:

Continue reading...

How we upgrade a live data center

I blogged about how we upgrade a live data center over on the Server Fault blog, which you may find interesting. It details how we planned and executed a large hardware refresh in the Stack Exchange New Jersey data center complete with hundreds of photos along the way. I love our hardware upgrade trips and will try my best to share future ones the same way. Expect a similar post on building the Denver data center around June this year.

If reading isn’t your thing and you want to go directly to the server porn, never fear. The album of the move (fully commented) is here: http://imgur.com/a/X1HoY

Continue reading...

What it takes to run Stack Overflow

I like to think of Stack Overflow as running with scale but not at scale. By that I meant we run very efficiently, but I still don’t think of us as “big”, not yet. Let’s throw out some numbers so you can get an idea of what scale we are at currently. Here are some quick numbers from a 24 hour window few days ago - November 12th, 2013 to be exact. These numbers are from a typical weekday and only include our active data center - what we host. Things like hits/bandwidth to our CDN are not included, they don’t hit our network.

  • 148,084,883 HTTP requests to our load balancer
  • 36,095,312 of those were page loads
  • 833,992,982,627 bytes (776 GB) of HTTP traffic sent
  • 286,574,644,032 bytes (267 GB) total received
  • 1,125,992,557,312 bytes (1,048 GB) total sent
  • 334,572,103 SQL Queries (from HTTP requests alone)
  • 412,865,051 Redis hits
  • 3,603,418 Tag Engine requests
  • 558,224,585 ms (155 hours) spent running SQL queries
  • 99,346,916 ms (27 hours) spent on redis hits
  • 132,384,059 ms (36 hours) spent on Tag Engine requests
  • 2,728,177,045 ms (757 hours) spent processing in ASP.Net
Continue reading...
View Archive (15 posts)