Tag: stack-exchange

Stack Exchange: Planning for failure, part 1

This will be the first in an series of interspersed posts about how our backup/secondary infrastructure is built and designed to work.

Stack Overflow started as the only site we had. That was over 3 years ago (August 2008) in a data center called PEAK Internet in Corvallis, Oregon.  Since them we’ve grown a lot, moved the primary network to New York, and added room to grow in the process.  A lot has changed since then in both locations, but much of the activity has stuck to the New York side of things.  The only services we’re currently running out of Oregon are chat and data explorer (so if you’re wondering why chat still runs during most outages, that’s why).

Back a few months ago we outgrew our CruiseControl.Net build system, changing over to TeamCity by JetBrains.  We did this for manageability, scalability, extensibility and because it’s just generally a better product (for our needs at least)  These build changes were pretty straightforward in the NY datacenter because we have a homogeneous web tier.  Our sysadmins insist that all the web servers be identical in configuration, and it pays off in many ways…such as when you change them all to a new build source. Once NY was converted, it was time to set our eyes on Oregon.  This was going to net us several benefits: a consistent build, a single URL (NY and OR CC.Net instances were in no way connected, the same version, etc.), and a single build system managing it all – including notifications, etc.

So what’s the problem?  Oregon is, for lack of a more precise description, a mess.  No one here forgets it’s where we started out, and the configuration there was just fine at the time, but as you grow things need to be in order.  We felt the time has come to do that organization.  Though some cleanup and naming conventions were applied when we joined OR to the new domain a few months ago, many things were all over the place.  Off the top of my head:

  • Web tier is on Windows 2k8, not 2k8 SP1
  • Web tier is not homogeneous
    • OR-WEB01 – Doesn’t exist, this became a DNS server a looooong time ago
    • OR-WEB02 – Chat, or.sstatic.net
    • OR-WEB03 – Stack Exchange Data Explorer, CC.Net primary build server
    • OR-WEB04 – or.sstatic.net
    • OR-WEB05 – or.sstatic.net, used to be a VM server (we can’t get this to uninstall, heh)
    • OR-WEB06 – Chat
  • The configuration looks nothing like NY
  • Automatic updates are a tad bit flaky
  • Missing several components compared to NY (such as physical redis, current & upcoming service boxes)

So we’re doing what any reasonable person would do.  NUKE. EVERYTHING. New hardware for the web tier and primary database server has already been ordered by our Sysadmin team (Kyle’s lead on this one) and racked by our own Geoff Dalgas.  Here’s the plan:

  • Nuke the web tier, format it all
  • Replace OR-DB01 with the new database server with plenty of space on 6x Intel 320 Series 300GB drives
  • Re-task 2 of the web tier as Linux load balancers running HAProxy (failover config)
  • Re-task the old OR-DB01 as a service box (upcoming posts on this – unused at the moment, but it has plenty of processing power and memory, so it fits)
  • Install 4 new web tier boxes as OR-WEB01 through OR-WEB04

Why all of this work just for Stack Exchange Chat & Data Explorer?  Because it’s not just for that.  Oregon is also our failover in case of catastrophic failure in New York.  We send backups of all databases there every night.  In a pinch, we want to switch DNS over and get Oregon up ASAP (probably in read-only mode though, until we’re sure NY can’t be recovered any time soon). The OR web tier will tentatively look something like this:

  • OR-WEB01 – Chat, or.sstatic.net
  • OR-WEB02 – Chat, or.sstatic.net
  • OR-WEB03 – Data Explorer
  • OR-WEB04 – Idle

Now that doesn’t look right, does it?  I said earlier that the web tier should be homogeneous and that’s true.  The above list is what’s effectively running on each server.  In reality (just like New York) they’ll have identical IIS configs, all running the same app pools.  The only difference is which ones get traffic for which sites via HAProxy.  Ones that don’t get traffic, let’s say OR-WEB04 for chat, simply won’t spin up that app pool. In addition to the above, each of the servers will be running everything else we have in New York, just not active/getting any traffic.  This includes things like every Q&A site in the network (including Stack Overflow), stackexchange.com, careers.stackoverflow.com, area51.stackexchange.com, openid.stackexchange.com, sstatic.net, etc.  All of these will be in a standby state of some sort…we’re working on the exact details.  In any case, it won’t be drastically different from the New York load balancing setup, which I’ll cover in detail in a future post.

Things will also get more interesting on the backup/restore side with the SQL 2012 move.  I’ll also do a follow-up post on our initial plans around the SQL upgrade in the coming weeks – we’re waiting on some info around new hardware in that department.

Growing pains and lessons learned

In this blog, I aim to give you some behind the scenes views of what goes on at Stack Exchange and share some lessons we learn along the way.

Life at Stack Exchange is pretty busy at the moment; we have lots of projects in the air.  In short, we’re growing, and growing fast.  What effect does this have?

While growth is awesome (it’s what almost every company wants to do), it’s not without technical challenges.  A significant portion of our time is currently devoted to fighting fires in one way or another, whether it be software issues with community scaling (like the mod flag queue) or actual technical walls (like drive space, Ethernet limits).

Off the top of my head, these are just a few items from the past few weeks:

  • We managed to completely saturate our outbound bandwidth in New York (100mbps).  When we took an outage a few days ago to bump a database server from 96GB to 144GB of RAM, we served error pages without the backing of our CDN…turns out that’s not something we’re quite capable of doing anymore.  There were added factors here, but the bottom line is we’ve grown too far to serve even static HTML and a few small images off that 100mbps pipe. We need a CDN at this point, but just to be safe we’ll be upping that connection at the datacenter as well.
  • The Stack Overflow database server is running out of space.  Those Intel X25-E SSD drives we went with have performed superbly, but a raid 10 of 6x64GB (177GB usable) only goes so far.  We’ll be bumping those drives up to 200GB Intel 710 SSDs for the next 12-18 months of growth.  Since we have to eat an outage to do the swap and memory is incredibly cheap, we’ll be bumping that database server to 288GB as well.
  • Our original infrastructure in Oregon (which now hosts Stack Exchange chat) is too old and a bit disorganized – we’re replacing it.  Oregon isn’t only a home for chat and data explorer, it’s the emergency failover if anything catastrophic were to happen in New York.  The old hardware just has no chance of standing up to the current load of our network – so we’re replacing it with shiny new goodies.
  • We’ve changed build servers – we’re building lots of projects across the company now and we need something that scales and is a bit more extensible.  We moved from CruiseControl.Net to TeamCity (still in progress, will be completed with the Oregon upgrade).
  • We’re in process of changing core architecture to continue scaling.  The tag engine that runs on each web server is doing duplicate work and running multiple times.  The search engine (built on Lucene.Net) is both running from disk (not having the entire index bank loaded into memory) and duplicating work.  Both of these are solvable problems, but they need a fundamental change.  I’ll discuss this further coming up; hopefully we’ll have some more open source goodness to share with the community as a result.
  • Version 2.0 on our API is rolling out (lots of SSL-related scaling fun around this behind the scenes).
  • A non-trivial amount of time has gone into our monitoring systems as of late.  We have a lot of servers running a lot of stuff, we need to see what’s going on.  I’ll go into more detail on this later.  Since there seems to be at least some demand for open-sourcing the dashboard we’ve built, we will as soon as time permits.

There are lots of things going on around here, as I get time I’ll try to share more detailed happenings like the above examples with you as we grow.  Not many companies grow as fast as we are with as little hardware or as much passion for performance.  I don’t believe anyone runs the architecture we do at the scale we’re running at (traffic-wise, we actually have very little hardware being utilized); we’re both passionate and insane.

We’ll go through some tough technical changes coming up, from both paying down technical debt and provisioning for the future.  I’ll try and share as much as I can of that here, for those who are merely curious what happens behind the curtain and those who are going through the same troubles we already have, maybe our experiences can help you out.

Copyright © 2014 . All rights reserved. Theme based on the Jarrah by Templates Next