For starters, let’s get a bad assumption I personally had before being hired out of the way: you don’t see most features we deploy.  In fact a small percentage of features we deploy are for regular users, or seen directly.  Aside from user-facing features, there exists a great deal of UI and infrastructure for our moderators, and even more for developers.  Besides features directly on the site, a lot more goes on behind the scenes.  Here’s a quick list of what’s underway right now:

  • Moving search off the web tier, a little over a year after putting it there.
  • Moving the tag engine off the web tier (you may have not even heard this exists).
  • Real-time updates to a number of site elements (voting, new answers, comments, your rep changes, etc.)
  • Deploying Oregon failover cluster (and other related infrastructure changes)
  • [redacted] sidebar project
  • [redacted] profile project
  • Home-grown TCP Server
  • Home-grown WebSockets server (built on the base TCP)
  • Home-grown error handler (similar to elmah but suited for our needs)
  • Monitoring dashboard improvements (on the way to open source)
  • Major network infrastructure changes (next Saturday, 4/14/2012)

That’s just the major stuff in process right now, of course there are a lot of other minor features and bug fixes being rolled out all the time.  There are some inter-connected pieces here, let’s first look at the how some of these groups of projects came to be.

When you visit a tag page, or pretty much anything to do with finding or navigating via a tag, you’re hitting our tag engine.  This little gem was cooked up by Marc Gravell and Sam Saffron to take the load off SQL Server (where it was Full Text Search-based previously) and do it much more efficiently on the web tier, inside the app domain.  This has been a tremendous performance win, but can we do it better? absolutely.  We can always do it better, the question is: at what cost?  The tag engine is now used a few times on each web server, for example inside the API as well as the main Q&A application (which handles all SE site requests, including Stack Overflow).  This is duplicate work…as is running it on all 11 web servers.  This same duplicate work wastefulness is true of our search indexing.  While it’s quite redundant, it’s too much so, and in practice doesn’t gain us anything…so what do we do?

We need to move some things into a service-oriented architecture, tag engine and search are first up.  Yes, there are other approaches, but this is how we’re choosing to do it.  So how do we go about this? We ask Marc what he can cook up, and he comes back with awesome every time.

There are a two things in play here, websockets and internal traffic…both have very different behavior and needs.  Internal API requests (tag engine, search) are a few clients (the web tier) getting a *lot* of data across those connections.  Websockets is a lot of consumers (tens of thousands or more) getting a little data.  We can optimize each case, but they are different and need independent love (possibly with separate implementations).

Enter our in-house TCP server, still a work in progress. This is what’s already powering our websockets/real-time functionality, so it’s optimized for the many clients/little data case.  If we start using this internally, it may very well be a different implementation optimized for the other end of the spectrum.  I won’t go into more detail on this because things change greatly with the next .Net release, and the TCP server can be much more efficient by being based on HttpListener (though it’s crazy efficient right now, Marc and normal humans have different definitions, so by his standards it needs improvement).  Accordingly, we’ll wait to open source this until that large refactor happens, the same goes for the websockets server impersonation built on top of it.

I’ve also spent a bit of my time lately replacing our exception handler, which we’ll be open sourcing after testing in production for a bit.  Currently it’s file system based and…well, it melts if we are in a high velocity error-throwing situation.  Instead we’ll be moving to a SQL-based error store (with JSON file and memory support for those who want it, it’s pluggable so we can add more stores later…like redis) with an in-memory backup buffer that’ll handle SQL being down as well.  This was needed because we monitor exceptions across all applcations in our monitoring dashboard…and we also want to open source that.  That means there’s a few dominos to get in place first.  For the same reasoning as projects above, the real-time websockets based real-time monitoring will have to wait for a future release (hopefully not too long)…but that’s a separate post, also coming soon.

There are some good things coming, especially in terms of open sourcing some of our goodies.  While we have some things out there already, we’ll be adding more…and a few improvements to the existing projects.  Also, we’ll be creating a central place where you find all these things.  That blog post is a good start, but we plan on giving our open source creations a permanent home so the community can find and help us improve them, so everyone can benefit.

This will be the first in an series of interspersed posts about how our backup/secondary infrastructure is built and designed to work.

Stack Overflow started as the only site we had. That was over 3 years ago (August 2008) in a data center called PEAK Internet in Corvallis, Oregon.  Since them we’ve grown a lot, moved the primary network to New York, and added room to grow in the process.  A lot has changed since then in both locations, but much of the activity has stuck to the New York side of things.  The only services we’re currently running out of Oregon are chat and data explorer (so if you’re wondering why chat still runs during most outages, that’s why).

Back a few months ago we outgrew our CruiseControl.Net build system, changing over to TeamCity by JetBrains.  We did this for manageability, scalability, extensibility and because it’s just generally a better product (for our needs at least)  These build changes were pretty straightforward in the NY datacenter because we have a homogeneous web tier.  Our sysadmins insist that all the web servers be identical in configuration, and it pays off in many ways…such as when you change them all to a new build source. Once NY was converted, it was time to set our eyes on Oregon.  This was going to net us several benefits: a consistent build, a single URL (NY and OR CC.Net instances were in no way connected, the same version, etc.), and a single build system managing it all – including notifications, etc.

So what’s the problem?  Oregon is, for lack of a more precise description, a mess.  No one here forgets it’s where we started out, and the configuration there was just fine at the time, but as you grow things need to be in order.  We felt the time has come to do that organization.  Though some cleanup and naming conventions were applied when we joined OR to the new domain a few months ago, many things were all over the place.  Off the top of my head:

  • Web tier is on Windows 2k8, not 2k8 SP1
  • Web tier is not homogeneous
    • OR-WEB01 – Doesn’t exist, this became a DNS server a looooong time ago
    • OR-WEB02 – Chat, or.sstatic.net
    • OR-WEB03 – Stack Exchange Data Explorer, CC.Net primary build server
    • OR-WEB04 – or.sstatic.net
    • OR-WEB05 – or.sstatic.net, used to be a VM server (we can’t get this to uninstall, heh)
    • OR-WEB06 – Chat
  • The configuration looks nothing like NY
  • Automatic updates are a tad bit flaky
  • Missing several components compared to NY (such as physical redis, current & upcoming service boxes)

So we’re doing what any reasonable person would do.  NUKE. EVERYTHING. New hardware for the web tier and primary database server has already been ordered by our Sysadmin team (Kyle’s lead on this one) and racked by our own Geoff Dalgas.  Here’s the plan:

  • Nuke the web tier, format it all
  • Replace OR-DB01 with the new database server with plenty of space on 6x Intel 320 Series 300GB drives
  • Re-task 2 of the web tier as Linux load balancers running HAProxy (failover config)
  • Re-task the old OR-DB01 as a service box (upcoming posts on this – unused at the moment, but it has plenty of processing power and memory, so it fits)
  • Install 4 new web tier boxes as OR-WEB01 through OR-WEB04

Why all of this work just for Stack Exchange Chat & Data Explorer?  Because it’s not just for that.  Oregon is also our failover in case of catastrophic failure in New York.  We send backups of all databases there every night.  In a pinch, we want to switch DNS over and get Oregon up ASAP (probably in read-only mode though, until we’re sure NY can’t be recovered any time soon). The OR web tier will tentatively look something like this:

  • OR-WEB01 – Chat, or.sstatic.net
  • OR-WEB02 – Chat, or.sstatic.net
  • OR-WEB03 – Data Explorer
  • OR-WEB04 – Idle

Now that doesn’t look right, does it?  I said earlier that the web tier should be homogeneous and that’s true.  The above list is what’s effectively running on each server.  In reality (just like New York) they’ll have identical IIS configs, all running the same app pools.  The only difference is which ones get traffic for which sites via HAProxy.  Ones that don’t get traffic, let’s say OR-WEB04 for chat, simply won’t spin up that app pool. In addition to the above, each of the servers will be running everything else we have in New York, just not active/getting any traffic.  This includes things like every Q&A site in the network (including Stack Overflow), stackexchange.com, careers.stackoverflow.com, area51.stackexchange.com, openid.stackexchange.com, sstatic.net, etc.  All of these will be in a standby state of some sort…we’re working on the exact details.  In any case, it won’t be drastically different from the New York load balancing setup, which I’ll cover in detail in a future post.

Things will also get more interesting on the backup/restore side with the SQL 2012 move.  I’ll also do a follow-up post on our initial plans around the SQL upgrade in the coming weeks – we’re waiting on some info around new hardware in that department.

Stack Overflow handles a lot of traffic.  Quantcast ranks us (at the time of this writing) as the 274th largest website in the US, and that’s rising.  That means everything that traffic relates to grows as well.  With growth, there are 2 areas of concern I like to split problems into: technical and non-technical.

Non-technical would be things like community management, flag queues, moderator counts, spam protection, etc.  These might (and often do) end up with technical solutions (that at least help, if not solve the problem), but I tend to think of them as “people problems.”  I won’t cover those here for the most part, unless there’s some technical solution that we can expose that may help others.

So what about the technical?  Now those are more interesting to programmers.  We have lots of things that grow along with traffic, to name a few:

  • Bandwidth (and by-proxy, CDN dependency)
  • Traffic logs (HAProxy logs)
  • CPU/Memory utilization (more processing/cache involved for more users/content)
  • Performance (inefficient things take more time/space, so we have to constantly look for wins in order to stay on the same hardware)
  • Database Size
I’ll probably get to all of the above in the coming months, but today let’s focus on what our current issue is. Stack Overflow is running out of space.  This isn’t news, it isn’t shocking; anything that grows is going to run out of room eventually.

 

This story starts a little over a year ago, right after I was hired.  One of the first things I was asked to do was growth analysis of the Stack Overflow DB. You can grab a copy of it here (excel format).  Brent Ozar, our go-to DBA expert/database magician, told us you can’t predict growth that accurately…and he was right.  There are so many unknowns: traffic from new sources, new features, changes in usage patterns (e.g. more editing caused a great deal more space usage than ever before).  I projected that right now we’d be at about 90GB in Stack Overflow data; we are at 113GB.  That’s not a lot, right?  Unfortunately, it’s all relative…and it is a lot.  Also, we currently have a transaction log of 41GB.  Yes, we could shrink it…but on the next re-org of the PK_Posts it’ll just grow to the same size (edit: thanks to Kendra Little on updating my knowledge base here, it was the cluster re-org itself eating that much transaction space, other indexes on the table need not be rebuilt in the clustered’s re-org since SQL 2000).

 

These projections were used to influence our direction on where we were going with scale, up vs. out.  We did a bit of both.  First let’s look at the original problems we faced when the entire network ran off one SQL Server box:
  • Memory usage (even at 96GB, we were over-crammed, we couldn’t fit everything in memory)
  • CPU (to be fair, this had other factors like Full Text search eating most of the CPU)
  • Disk IO (this is the big one)

What happens when you have lots of databases is all your sequential performance goes to crap because it’s not sequential anymore.  For disk hardware, we had one array for the DB data files: a RAID 10, 6 drive array of magnetic disks.  When dozens of DBs are competing in the disk queue, all performance is effectively random performance.  That means our read/write stalls were way higher than we liked.  We tuned our indexing and trimmed as much as we could (you should always do this before looking at hardware), but it wasn’t enough.  Even if it was enough there were the CPU/Memory issues of the shared box.

Ok, so we’ve outgrown a single box, now what?  We got a new one specifically for the purpose of giving Stack Overflow its own hardware.  At the time this decision was made, Stack Overflow was a few orders of magnitude larger than any other site we have.  Performance-wise, it’s still the 800 lb. gorilla.  A very tangible problem here was that Stack Overflow was so large and “hot,” it was a bully in terms of memory, forcing lesser sites out of memory and causing slow disk loads for queries after idle periods.  Seconds to load a home page? Ouch. Unacceptable.  It wasn’t just a hardware decision though, it had a psychological component.  Many people on our team just felt that Stack Overflow, being the huge central site in the network that is is, deserved its own hardware…that’s the best I can describe it.

Now, how does that new box solve our problems?  Let’s go down the list:

  • Memory (we have another 96GB of memory just for SO, and it’s not using massive amounts on the original box, win)
  • CPU (fairly straightforward: it’s now split and we have 12 new cores to share the load, win)
  • Disk IO (what’s this? SSDs have come out, game. on.)

We looked at a lot of storage options to solve that IO problem.  In the end, we went with the fastest SSDs money could buy.  The configuration on that new server is a RAID 1 for the OS (magnetic) and a RAID 10 6x Intel X-25E 64GB, giving us 177 GB of usable space.  Now let’s do the math of what’s on that new box as of today:

  • 114 GB – StackOverflow.mdf
  • 41 GB – StackOverflow.ldf

With a few other miscellaneous files on there, we’re up to 156 GB.  155/177 = 12% free space.  Time to panic? Not yet.  Time to plan? Absolutely.  So what is the plan?

We’re going to be replacing these 64GB X-25E drives with 200GB Intel 710 drives.  We’re going with the 710 series mainly for the endurance they offer.  And we’re going with 200GB and not 300GB because the price difference just isn’t worth it, not with the high likelihood of rebuilding the entire server when we move to SQL Server 2012 (and possibly into a cage at that data center).  We simply don’t think we’ll need that space before we stop using these drives 12-18 months from now.

Since we’re eating an outage to do this upgrade (unknown date, those 710 drives are on back-order at the moment) why don’t we do some other upgrades?  Memory of the large capacity DIMM variety is getting cheap, crazy cheap.  As the database grows, less and less of it fits into memory, percentage-wise.  Also, the server goes to 288GB (16GB x 18 DIMMs)…so why not?  For less than $3,000 we can take this server from 6x16GB to 18x16GB and just not worry about memory for the life of the server.  This also has the advantage of balancing all 3 memory channels on both processors, but that’s secondary.  Do we feel silly putting that much memory in a single server? Yes, we do…but it’s so cheap compared to say a single SQL Server license that it seems silly not to do it.

I’ll do a follow-up on this after the upgrade (on the Server Fault main blog, with a stub here).

In this blog, I aim to give you some behind the scenes views of what goes on at Stack Exchange and share some lessons we learn along the way.

Life at Stack Exchange is pretty busy at the moment; we have lots of projects in the air.  In short, we’re growing, and growing fast.  What effect does this have?

While growth is awesome (it’s what almost every company wants to do), it’s not without technical challenges.  A significant portion of our time is currently devoted to fighting fires in one way or another, whether it be software issues with community scaling (like the mod flag queue) or actual technical walls (like drive space, Ethernet limits).

Off the top of my head, these are just a few items from the past few weeks:

  • We managed to completely saturate our outbound bandwidth in New York (100mbps).  When we took an outage a few days ago to bump a database server from 96GB to 144GB of RAM, we served error pages without the backing of our CDN…turns out that’s not something we’re quite capable of doing anymore.  There were added factors here, but the bottom line is we’ve grown too far to serve even static HTML and a few small images off that 100mbps pipe. We need a CDN at this point, but just to be safe we’ll be upping that connection at the datacenter as well.
  • The Stack Overflow database server is running out of space.  Those Intel X25-E SSD drives we went with have performed superbly, but a raid 10 of 6x64GB (177GB usable) only goes so far.  We’ll be bumping those drives up to 200GB Intel 710 SSDs for the next 12-18 months of growth.  Since we have to eat an outage to do the swap and memory is incredibly cheap, we’ll be bumping that database server to 288GB as well.
  • Our original infrastructure in Oregon (which now hosts Stack Exchange chat) is too old and a bit disorganized – we’re replacing it.  Oregon isn’t only a home for chat and data explorer, it’s the emergency failover if anything catastrophic were to happen in New York.  The old hardware just has no chance of standing up to the current load of our network – so we’re replacing it with shiny new goodies.
  • We’ve changed build servers – we’re building lots of projects across the company now and we need something that scales and is a bit more extensible.  We moved from CruiseControl.Net to TeamCity (still in progress, will be completed with the Oregon upgrade).
  • We’re in process of changing core architecture to continue scaling.  The tag engine that runs on each web server is doing duplicate work and running multiple times.  The search engine (built on Lucene.Net) is both running from disk (not having the entire index bank loaded into memory) and duplicating work.  Both of these are solvable problems, but they need a fundamental change.  I’ll discuss this further coming up; hopefully we’ll have some more open source goodness to share with the community as a result.
  • Version 2.0 on our API is rolling out (lots of SSL-related scaling fun around this behind the scenes).
  • A non-trivial amount of time has gone into our monitoring systems as of late.  We have a lot of servers running a lot of stuff, we need to see what’s going on.  I’ll go into more detail on this later.  Since there seems to be at least some demand for open-sourcing the dashboard we’ve built, we will as soon as time permits.

There are lots of things going on around here, as I get time I’ll try to share more detailed happenings like the above examples with you as we grow.  Not many companies grow as fast as we are with as little hardware or as much passion for performance.  I don’t believe anyone runs the architecture we do at the scale we’re running at (traffic-wise, we actually have very little hardware being utilized); we’re both passionate and insane.

We’ll go through some tough technical changes coming up, from both paying down technical debt and provisioning for the future.  I’ll try and share as much as I can of that here, for those who are merely curious what happens behind the curtain and those who are going through the same troubles we already have, maybe our experiences can help you out.

Copyright © 2012 . All rights reserved. Theme based on the Jarrah by Templates Next