I like to think of Stack Overflow as running with scale but not at scale.  By that I meant we run very efficiently, but I still don’t think of us as “big”, not yet.  Let’s throw out some numbers so you can get an idea of what scale we are at currently.  Here are some quick numbers from a 24 hour window few days ago – November 12th, 2013 to be exact.  These numbers are from a typical weekday and only include our active data center – what we host.  Things like hits/bandwidth to our CDN are not included, they don’t hit our network.

  • 148,084,883 HTTP requests to our load balancer
  • 36,095,312 of those were page loads
  • 833,992,982,627 bytes (776 GB) of HTTP traffic sent
  • 286,574,644,032 bytes (267 GB) total received
  • 1,125,992,557,312 bytes (1,048 GB) total sent
  • 334,572,103 SQL Queries (from HTTP requests alone)
  • 412,865,051 Redis hits
  • 3,603,418 Tag Engine requests
  • 558,224,585 ms (155 hours) spent running SQL queries
  • 99,346,916 ms (27 hours) spent on redis hits
  • 132,384,059 ms (36 hours) spent on Tag Engine requests
  • 2,728,177,045 ms (757 hours) spent processing in ASP.Net

(I should do a post on how we get those numbers quickly, and how just having those numbers is worth the effort)

Keep in mind these are for the entire Stack Exchange network but still don’t include everything. With the exception of the 2 totals, these numbers are only from HTTP requests we log to look at performance. Also, whoa that’s a lot of hours in a day, how do you do that?  We like to call it magic, other people call it “multiple servers with multi-core processors” – but we’ll stick with magic. Here’s what runs the Stack Exchange network in that data center:

Here’s what that looks like:

DataCenter Rear

We don’t only run the sites,  The rest of those servers in the nearest rack are VMs and other infrastructure for auxiliary things not involved in serving the sites directly, like deployments, domain controllers, monitoring, ops database for sysadmin goodies, etc. Of that list above, 2 SQL servers were backups only until very recently – they are now used for read-only loads so we can keep on scaling without thinking about it for even longer (this mainly consists of the Stack Exchange API). Two of those web servers are for dev and meta, running very little traffic.

Core Hardware

When you remove redundancy here’s what Stack Exchange needs to run (while maintaining our current level of performance):

  • 2 SQL servers (SO is on one, everything else on another….they could run on a single machine still having headroom though)
  • 2 Web Servers (maybe 3, but I have faith in just 2)
  • 1 Redis Server
  • 1 Tag Engine server
  • 1 elasticsearch server
  • 1 Load balancer
  • 1 Network
  • 1 ASA
  • 1 Router

(we really should test this one day by turning off equipment and seeing what the breaking point is)

Now there are a few VMs and such in the background to take care of other jobs, domain controllers, etc., but those are extremely lightweight and we’re focusing on Stack Overflow itself and what it takes to render all the pages at full speed.  If you want a full apples to apples, throw a single VMware server in for all of those stragglers. So that’s not a large number of machines, but the specs on those machines typically aren’t available in the cloud, not at reasonable prices.  Here are some quick “scale up” server notes:

  • SQL servers have 384 GB of memory with 1.8TB of SSD storage
  • Redis servers have 96 GB of RAM
  • elastic search servers 196 GB of RAM
  • Tag engine servers have the fastest raw processors we can buy
  • Network cores have 10 Gb of bandwidth on each port
  • Web servers aren’t that special at 32 GB and 2x quad core and 300 GB of SSD storage.
  • Servers that don’t have 2x 10Gb (e.g. SQL) have 4x 1 Gb of network bandwidth

Is 20 Gb massive overkill? You bet your ass it is, the active SQL servers average around 100-200 Mb out of that 20 Gb pipe.  However, things like backups, rebuilds, etc. can completely saturate it due to how much memory and SSD storage is present, so it does serve a purpose.

Storage

We currently have about 2 TB of SQL data (1.06 TB / 1.63 TB across 18 SSDs on the first cluster, 889 GB / 1.45 TB across 4 SSDs on the second cluster), so that’s what we’d need on the cloud (hmmm, there’s that word again).  Keep in mind that’s all SSD.  The average write time on any of our databases is 0 milliseconds, it’s not even at the unit we can measure because the storage handles it that well.  With the database in memory and 2 levels of cache in front of it, Stack Overflow actually has a 40:60 read-write ratio.  Yeah, you read that right, 60% of our database disk access is writes (you should know your read/write workload too).  There’s also storage for each web server – 2x 320GB SSDs in a RAID 1.  The elastic boxes need about 300 GB a piece and do perform much better on SSDs (we write/re-index very frequently).

It’s worth noting we do have a SAN, an Equal Logic PS6110X that’s 24x900GB 10K SAS drives on a 2x 10Gb link (active/standby) to our core network.  It’s used exclusively for the VM servers as shared storage for high availability but does not really support hosting our websites.  To put it another way, if the SAN died the sites would not even notice for a while (only the VM domain controllers are a factor).

Put it all together

Now, what does all that do?  We want performance.  We need performance.  Performance is a feature, a very important feature to us.  The main page loaded on all of our sites is the question page, affectionately known as Question/Show (its route name) internally.  On November 12th, that page rendered in an average of 28 milliseconds.  While we strive to maintain 50ms, we really try and shave every possible millisecond off your pageload experience.  All of our developers are certifiably anal curmudgeons when it comes to performance, so that helps keep times low as well. Here are the other top hit pages on SO, average render time on the same 24 hour period as above:

  • Question/Show: 28 ms (29.7 million hits)
  • User Profiles: 39 ms (1.7 million hits)
  • Question List: 78 ms (1.1 million hits)
  • Home page: 65 ms (1 million hits) (that’s very slow for us – Kevin Montrose will be fixing this perf soon: here’s the main cause)

We have high visibility of what goes into our page loads by recording timings for every single request to our network.  You need some sort of metrics like this, otherwise what are you basing your decisions on?  With those metrics handy, we can make easy to access, easy to read views like this:

Route Time - QuestionsList

After that the percentage of hits drops off dramatically, but if you’re curious about a specific page I’m happy to post those numbers too.  I’m focusing on render time here because that’s how long it takes our server to produce a webpage, the speed of transmission is an entirely different (though admittedly, very related) topic I’ll cover in the future.

Room to grow

It’s definitely worth noting that these servers run at very low utilization.  Those web servers average between 5-15% CPU, 15.5 GB of RAM used and 20-40 Mb/s network traffic.  The SQL servers average around 5-10% CPU, 365 GB of RAM used, and 100-200 Mb/s of network traffic.  This affords us a few major things: general room to grow before we upgrade, headroom to stay online for when things go crazy (bad query, bad code, attacks, whatever it may be), and the ability to clock back on power if needed.  Here’s a view from Opserver of our web tier taken just now:

Opserver - Web Tier

The primary reason the utilization is so low is efficient code.  That’s not the topic of this post, but efficient code is critical to stretching your hardware further.  Anything you’re doing that doesn’t need doing costs more than not doing it, that continues to apply if it’s a subset of your code that could be more efficient.  That cost comes in the form of: power consumption, hardware cost (since you need more/bigger servers), developers understanding something more complicated (to be fair, this can go both ways, efficient isn’t necessarily simple) and likely a slower page render – meaning less users sticking around for another page load…or being less likely to come back.  The cost of inefficient code can be higher than you think.  

Now that we know how Stack Overflow performs on its current hardware, next time we can see why we don’t run in the cloud.

Ok let’s start with the obvious, why run Stack Overflow on pre-release software?  If no one tests anything, it only means one thing: more bugs at release.  We have the opportunity to test the hell out of the 2014 platform.  We can encounter, triage and in most cases, get bugs resolved before release.  What company doesn’t want testers for their software in real world environments whenever possible?  I would like to think all of them do, just not all have the resources to make it happen.

All of that adds up to it being a huge win for us to test pre-release software in many cases.  First there’s the greedy benefit: we can help ourselves to make sure our pain points with the current version are resolved in the next.  Then there’s something we love doing even more: helping others.  By eliminating bugs from the RTM release and being able to blog about upgrade experiences such as this, we hope it helps others…just as posting a question publicly does.

It’s worth noting this isn’t an isolated case.  When we make changes to the open source libraries we  release, you can bet they’ve run that code has already run under heavy stackoverflow.com load before we push a library update out – we can’t think of a better test in most cases.

I’m going to elaborate on why we upgraded.  If you’re curious about the upgrade to SQL 2014 itself definitely check out Brent Ozar’s post: Update on Stack Overflow’s Recovery Strategy with SQL Server 2014.

What’s in SQL 2014 for us?

We have one major issue with SQL 2012 that’s solved by upgrading to SQL 2014: AlwaysOn replica availability in the disconnected state is now read-only instead of offline.  Currently (in SQL 2012) when a server loses quorum in a cluster, the AlwaysOn availability groups on that server shut down, and the databases in it become unavailable.   In SQL 2014, this doesn’t happen; minority nodes go into read-only mode rather than offline.

That’s it, that’s all the reason we needed for upgrading. That’s if it works as advertised, how do we know unless we test it?  Much better to test it now while we can actually get bugs fixed before RTM.  The net result in a communication down scenario is that Stack Overflow goes into read-only mode automatically, instead of offline.  We want to be online at all times (who doesn’t?), but at worst case being read-only isn’t bad.  Being online in read-only mode means all of our users can still find existing answers and the world keeps spinning.

To be fair, we want some of the other goodies in 2014 as well.  The improved query optimizer (due to better cardinality estimations) is already showing significant savings for some of our most active queries (for example: the query that fetches comments and runs 200 times per second is 30% cheaper). We can now have 8 read-only replicas per cluster (limited to 4 in 2012).

Problems with SQL 2012 quorum behavior

Let’s start by understanding Windows Server clustering quorum behavior.  Quorum isn’t anything fancy, it’s exactly what you think.  In Windows clustering it means that a server needs to be able to see the majority of the total cluster to maintain quorum.  Without this, it assumes that the minority fraction of servers it can see (possibly just itself) are the minority affected by whatever connectivity issue exists.  The amount of servers required for quorum can change and shift as servers go up and down for reboots, etc.  This is known as dynamic quorum.  Dynamic quorum takes 90 seconds to change the calculation in most of our unplanned disconnect cases, this is important because losing several nodes quickly can result in all your databases being offline everywhere…because everyone lost quorum.

Let’s take a look specifically at Stack Overflow.  In our current setup we have 2 SQL clusters that are of the same configuration for the sake of this discussion. In each cluster, we have 2 servers in the New York data center and 1 in the Oregon data center:

SE Network SQL

Now the key here is that VPN link from one data center to the other.  When the internet blips (as it tends to do), the VPN mesh drops and the Oregon side loses quorum and (in 2012) goes offline.  What would be a read-only data center that continued to operate without a connection to New York is essentially dead in the water.  Dammit.

Well, that sucks.  What’s would be worse than losing a read-only data center?  If Oregon blips less than 90 seconds after a node in New York goes down for maintenance and dynamic quorum hasn’t adjusted yet, all nodes go down, and stackoverflow.com with them.  That’s what we call a bad day.  Yes, we can avoid this by removing Oregon’s node weight before maintenance (and we do just that), but it still leaves Oregon totally offline.

How SQL 2014 makes it all better

It’s important to understand that SQL AlwaysOn availability groups are built on Windows Server clustering.  When a cluster node loses quorum, it will send a signal to SQL letting it know.  In SQL 2012 the handle of that signal is to shut down the availability group and the databases within it become unavailable.  In SQL 2014, the availability groups handle that signal differently, by staying a read-only replica.

The whole point of not being writable when in the minority is the prevent a split brain situation where 2 (or more) separate groups both have a writing node…resulting in a forked database.  Being read-only doesn’t cause a split brain, so honestly this is the fail-back-to-read-only behavior we expected SQL 2012 to have.

This simple change lets us do more than just stay online, it opens up new possibilities without re-architecting our primary data store.  Let’s say, hypothetically, we have a few servers in Europe: a few web servers, whatever networking hardware, a few support VMs, and a SQL read-only replica.  What happens when the VPN to this hypothetical node goes down?  It’s still online just with data that’s not getting updated…that’s not a terrible alternative to being offline and traffic having to be routed away immediately.  Even if we needed to re-route traffic, we have time to make that decision of for whatever mechanism accomplishing that to kick in.

What if that scenario wasn’t so hypothetical?

A question we often get asked at Stack Exchange is why stackoverflow.com and all our other domains aren’t served over SSL.  It’s a user request we see at least a few times a month asking to ensure their security and privacy.  So why haven’t we done it yet?  I wanted to address that here, it’s not that we don’t want to do it, it’s just a lot of work and we’re getting there.

So, what’s needed to move our network to SSL? Let’s make a quick list:

  • Third party content must support SSL:
    • Ads
    • Avatars
    • Facebook
    • Google Analytics
    • Inline Images & Videos
    • MathJax
    • Quantcast
  • Our side has to support SSL:
    • Our CDN
    • Our load balancers
    • The sites themselves
    • Websockets
    • Certificates (this one gets interesting)

Ok, so that doesn’t look so hard, what’s the big deal?  Let’s look at third party content first.  Note that with all of these items, they’re totally outside our control.  All we can do is ask for them to support SSL…but luckily we work with awesome people that are they’re helping us out.

    • Ads: We’re working with Adzerk to support SSL, this had to be done on their side and it’s ready for testing now.
    • Avatars: Gravatars and Imgur can support SSL – Gravatar is ready but i.stack.imgur.com where our images are hosted is not yet (we’re working on this).
    • Facebook: done.
    • Google Analytics: done.
    • Inline Images: we can’t include insecure content on the page…so that means turning our images to SSL when i.stack.imgur.com is ready.  For other domains images are embedded from we have to turn them into links, or solve via another approach.
    • MathJax: we currently use MathJax’s CDN for that content, but they don’t currently support SSL so we may have to host this on our CDN.
    • Quantcast: done – under another domain.

Now here’s where stuff gets fun, we have to support SSL on our side.  Let’s cover the easy parts first.  Our CDN has to support SSL.  Okay, it’s not cheap but it’s a problem we can buy away.  We only use cdn.sstatic.net for production content so a cdn.sstatic.net & *.cdn.sstatic.net combo cert should cover us.  The CDN endpoint on our side (it’s a pull model) has to support SSL as well so it’s an encrypted handshake on both legs, but that’s minor traffic and easily doable.

With websockets we’ll just try and see when all dependencies are in place.  We don’t anticipate any particular problems there, and when it works it should work for more people.  Misconfigured or old proxies tend to interfere with HTTP websocket traffic they don’t understand, but those same proxies will just forward on the encrypted HTTPS traffic.

Of course, our load balancers have to support SSL as well, so let’s take a look at how that works in our infrastructure.  Whether our SSL setup is typical or not I have no idea, but here’s how we currently do SSL:

nginx network layout

HTTPS traffic goes to nginx on the load balancer machines and terminates there.  From there a plain HTTP request is made to HAProxy which delegates the request to whichever web server set it should go to.  The response then goes back along the same path.  So you have a secure connection all the way to us, but inside our internal network it’s transitioned to a regular HTTP request.

So what’s changing?  Logically we’re not changing much, we’re just switching from nginx to HAProxy for the terminator.  The request now goes all the way to HAProxy (specifically, a process only for SSL termination) then to the HTTP front-end to continue to a web server.  This is both a management-easing change (one install and config via puppet) and hope that HAProxy handles SSL more efficiently, since CPU load on the balancers is an unknown once we go full SSL.  An HAProxy instance (without crazy) only ties to a single core, but you can have many SSL processes doing termination all feeding to the single HTTP front-end process.  With this approach, the heavy load does scale out across our 12 physical core machines well…we hope.  If it doesn’t work then we need active/active load balancers, which is another project we’re working on just in case.

Now here’s the really fun part, certificates.  Let’s take a look a sample of domains we’d have to cover:

  • stackoverflow.com
  • meta.stackoverflow.com
  • stackexchange.com
  • careers.stackoverflow.com
  • gaming.stackexchange.com
  • meta.gaming.stackexchange.com
  • superuser.com
  • meta.superuser.com

Ok so the top level domains are easy, a SAN cert which allows many domains on a single cert – we can sanely combine up to 100 here.  So what about all of our *.stackexchange.com domains? A wildcart cert, excellent we’re knocking these out like crazy. What about meta.*.stackexchange.com? Damn. Can’t do that. You can’t have a wildcard of that form – at least not one supported by most major browsers, which means effectively it’s not an option.  Let’s see where these restrictions originate.

Section 3.1 of RFC 2818 is very open/ambiguous on wildcard usage, it states:

Names may contain the wildcard character * which is considered to match any single domain name component or component fragment. E.g., *.a.com matches foo.a.com but not bar.foo.a.com. f*.com matches foo.com but not bar.com.

It doesn’t really disallow meta.*.stackexchange.com or *.*.meta.stackexchange.com.  So far so good…then some jerk tried to make a certificate for *.com which obviously wasn’t good, so that was revoked and disallowed.  So what happened? Some other jerk went and tried *.*.com.  Well, that ruined it for everyone.  Thanks, jerks.

The rules were further clarified in Section 6.4.3 of RFC 6125 which says (emphasis mine):

The client SHOULD NOT attempt to match a presented identifier in which the wildcard character comprises a label other than the left-most label (e.g., do not match bar.*.example.net)

This means no *.*.stackexchange.com or meta.*.stackexchange.com.  Enough major browsers conform to this RFC that it’s a non-option.  So what do we do?  We thought of a few approaches.  We would prefer not to change domains for our content, so the first thought was setting up an automated operation to install new entries on a SAN cert for each new meta created.  As we worked though this option, we found several major problems:

  • We are limited to approximately 100 entries per SAN cert, so every new 100 metas means another IP allocation on our load balancer. (This appears to be an industry restriction due to abuse, rather than a technical limitation)
  • The IP usage issues would be multiplied as we move to active/active load balancers, draining our allocated IP ranges faster and putting an even shorter lifetime on this solution.
  • It delays site launches, due to waiting on certificates to be requested, generated, merge, received and installed.
  • Every site launch has to have a DNS entry for the meta, this exposes us to additional risk of DNS issues and complicates data center failover.
  • We have to build an entire system to support this, rotating through certificates a bit rube goldberg style, installing into HAProxy, writing to BIND, etc.

So what do we do?  We’re not 100% decided yet – we’re still researching and talking with people.  We may have to do the above.  The alternatives would be to shift child meta domains to be on the same level as *.stackexchange.com or under a common *.x.stackexchange.com for which we can get a wildcard.  If we shift domains, we have to put in redirects and update URLs in already posted content to point to the new place (to save users a redirect).  Also changing the domain to not be a child means setting a child cookie on the parent domain that is shared down is no longer an option – so the login model has to change there but still be as transparent and easy as possible.

Now let’s say we do all of that and it all works, what happens to our google rank when we start sending everyone to HTTPS, including crawlers?  We don’t know, and it’s a little scary…so we’ll test the best we can with a slow rollout.  Hopefully, it won’t matter. We have to send everything to SSL because otherwise logged-in users would get a redirect every time after clicking an http:// link from google…that needs to be an https:// from the search results.

Some of those simple items above aren’t so simple, especially changing child meta logins and building cert grabbing/installing system.  So is that why we aren’t doing it?  Of course not, we love pain – so we are actively working on SSL now that some of our third party content providers are ready.  We’re not ignoring the request for enhanced security and privacy while using our network, it’s just not as simple as many people seem to think it is at first glance – not when you’re dealing with our domain variety.  We’ll be working on it over the next 6-8 weeks.

The best part of working for Stack Exchange is that our code, network and databases are so awesome that we never throw exceptions.

Okay, okay, back to reality.  Everyone has exceptions, so how do we handle the first step of recording them?  Most .Net developers have heard of ELMAH (Error Logging Modules and Handlers), and Stack Exchange started out using a modified version of this for over a year.  The setup was simple: we were using the XML file store pointing to a share on one of the web servers (due to most of the problems being SQL or network related when shit hit the fan).

What this setup didn’t allow for was high-traffic logging, since the 200 file loop necessitated a directory file listing (not so fast in windows), reading existing files to see if it was a duplicate, then updating the duplicate with a new count if one was found.  Take into account the level of traffic Stack Overflow gets at any point in the day and all this meant was we effectively took that web server out of rotation (due to pegging it’s network throughput just logging the errors…yes, we saw the humor there).

When I finally got some time to stick the errors into SQL in a way that fit our needs, I wrote StackExchange.Exceptional (with input from all our dev teams along the way).  It certainly borrows a fundamental idea from ELMAH (multiple stores on a single error interface mainly), but after that they diverge significantly.  We needed a few things that no existing error handler put in one package:

  • High speed logging (upwards of 100,000 exceptions/minute)
  • Error roll-ups (for similar exceptions, showing a duplicate count that increases rather than logging a separate entry)
  • Handling the case where we can’t reach the central error store (e.g. the connection to SQL is interrupted)
  • Custom data for our exceptions
  • Querying relevant exception data

All but the last one went into StackExchange.Exceptional directly, the last one factors into a bigger picture coming soon.

Let me explain why we need the above.  The high volume one’s pretty easy; we get a lot of traffic on Stack Overflow alone.  When shit hits the fan, the exceptions roll in pretty fast.  For example when redis goes offline or SQL server is unreachable we’re throwing 10,000 errors in under a few seconds.  While we’re throwing lots of errors, they’re not likely to tell us much in all that repetition. 10,000 of the same exception helps us no more in debugging than that error logged once with a x10,000 beside it…so that’s what Exceptional does.  If an error has the same stack trace then we just increase a duplicate counter on the SQL row, instead of logging another row.

When we’re throwing exceptions due to a network issue, guess what trying to log over the network does.  Yeah, this one’s not hard to see coming, so what Exceptional does is keep a memory-based exception store that queues up to 1000 exceptions (duplicates roll up in that 1000) when it’s unable to write whatever remote store you’ve configured (SQL or JSON).  It will retry writing the exceptions every 2 seconds.  Once successful, things go back to normal and exceptions are logged as they come in.  If there is currently a problem connecting to the error store, the issue will be shown in the exceptions list.

The last two are probably less relevant.  Custom data just enables storing of string name/value pairs with the exception for display only or use via JavaScript includes on the exception views.  I’ll cover this in detail on the project page in a specific wiki, for now the quick bits are in the setup wiki.  Querying is not really covered in Exceptional, it just has a SQL structure that’s friendly for doing so (relevant bits of exceptions broken out into individual columns).  The actual searching is something we have, but I have to do a bit more unexpected work to get that dashboard out the door.  There are other benefits this netted us, but they’re hard to explain without a full dashboard post when that’s released…I’ll try and do so as quickly as I can figure out the open-source charting story.

I’ve added a few initial wikis to github for getting Exceptional up and running, I’ll try and get an example project up as well…in the next few days as time allows.  Here’s a quick view of the list/detail screens to get a feel:

Update: A sample project is now posted alongside Exceptional core on github.  Some additional store providers (e.g. MongoDB) have already been requested, stay tuned for those – they’ll show up in the form of other packages with a dependency on StackExchange.Exceptional, so if you don’t want that store’s driver, you won’t have to include it.

For starters, let’s get a bad assumption I personally had before being hired out of the way: you don’t see most features we deploy.  In fact a small percentage of features we deploy are for regular users, or seen directly.  Aside from user-facing features, there exists a great deal of UI and infrastructure for our moderators, and even more for developers.  Besides features directly on the site, a lot more goes on behind the scenes.  Here’s a quick list of what’s underway right now:

  • Moving search off the web tier, a little over a year after putting it there.
  • Moving the tag engine off the web tier (you may have not even heard this exists).
  • Real-time updates to a number of site elements (voting, new answers, comments, your rep changes, etc.)
  • Deploying Oregon failover cluster (and other related infrastructure changes)
  • [redacted] sidebar project
  • [redacted] profile project
  • Home-grown TCP Server
  • Home-grown WebSockets server (built on the base TCP)
  • Home-grown error handler (similar to elmah but suited for our needs)
  • Monitoring dashboard improvements (on the way to open source)
  • Major network infrastructure changes (next Saturday, 4/14/2012)

That’s just the major stuff in process right now, of course there are a lot of other minor features and bug fixes being rolled out all the time.  There are some inter-connected pieces here, let’s first look at the how some of these groups of projects came to be.

When you visit a tag page, or pretty much anything to do with finding or navigating via a tag, you’re hitting our tag engine.  This little gem was cooked up by Marc Gravell and Sam Saffron to take the load off SQL Server (where it was Full Text Search-based previously) and do it much more efficiently on the web tier, inside the app domain.  This has been a tremendous performance win, but can we do it better? absolutely.  We can always do it better, the question is: at what cost?  The tag engine is now used a few times on each web server, for example inside the API as well as the main Q&A application (which handles all SE site requests, including Stack Overflow).  This is duplicate work…as is running it on all 11 web servers.  This same duplicate work wastefulness is true of our search indexing.  While it’s quite redundant, it’s too much so, and in practice doesn’t gain us anything…so what do we do?

We need to move some things into a service-oriented architecture, tag engine and search are first up.  Yes, there are other approaches, but this is how we’re choosing to do it.  So how do we go about this? We ask Marc what he can cook up, and he comes back with awesome every time.

There are a two things in play here, websockets and internal traffic…both have very different behavior and needs.  Internal API requests (tag engine, search) are a few clients (the web tier) getting a *lot* of data across those connections.  Websockets is a lot of consumers (tens of thousands or more) getting a little data.  We can optimize each case, but they are different and need independent love (possibly with separate implementations).

Enter our in-house TCP server, still a work in progress. This is what’s already powering our websockets/real-time functionality, so it’s optimized for the many clients/little data case.  If we start using this internally, it may very well be a different implementation optimized for the other end of the spectrum.  I won’t go into more detail on this because things change greatly with the next .Net release, and the TCP server can be much more efficient by being based on HttpListener (though it’s crazy efficient right now, Marc and normal humans have different definitions, so by his standards it needs improvement).  Accordingly, we’ll wait to open source this until that large refactor happens, the same goes for the websockets server impersonation built on top of it.

I’ve also spent a bit of my time lately replacing our exception handler, which we’ll be open sourcing after testing in production for a bit.  Currently it’s file system based and…well, it melts if we are in a high velocity error-throwing situation.  Instead we’ll be moving to a SQL-based error store (with JSON file and memory support for those who want it, it’s pluggable so we can add more stores later…like redis) with an in-memory backup buffer that’ll handle SQL being down as well.  This was needed because we monitor exceptions across all applcations in our monitoring dashboard…and we also want to open source that.  That means there’s a few dominos to get in place first.  For the same reasoning as projects above, the real-time websockets based real-time monitoring will have to wait for a future release (hopefully not too long)…but that’s a separate post, also coming soon.

There are some good things coming, especially in terms of open sourcing some of our goodies.  While we have some things out there already, we’ll be adding more…and a few improvements to the existing projects.  Also, we’ll be creating a central place where you find all these things.  That blog post is a good start, but we plan on giving our open source creations a permanent home so the community can find and help us improve them, so everyone can benefit.

This will be the first in an series of interspersed posts about how our backup/secondary infrastructure is built and designed to work.

Stack Overflow started as the only site we had. That was over 3 years ago (August 2008) in a data center called PEAK Internet in Corvallis, Oregon.  Since them we’ve grown a lot, moved the primary network to New York, and added room to grow in the process.  A lot has changed since then in both locations, but much of the activity has stuck to the New York side of things.  The only services we’re currently running out of Oregon are chat and data explorer (so if you’re wondering why chat still runs during most outages, that’s why).

Back a few months ago we outgrew our CruiseControl.Net build system, changing over to TeamCity by JetBrains.  We did this for manageability, scalability, extensibility and because it’s just generally a better product (for our needs at least)  These build changes were pretty straightforward in the NY datacenter because we have a homogeneous web tier.  Our sysadmins insist that all the web servers be identical in configuration, and it pays off in many ways…such as when you change them all to a new build source. Once NY was converted, it was time to set our eyes on Oregon.  This was going to net us several benefits: a consistent build, a single URL (NY and OR CC.Net instances were in no way connected, the same version, etc.), and a single build system managing it all – including notifications, etc.

So what’s the problem?  Oregon is, for lack of a more precise description, a mess.  No one here forgets it’s where we started out, and the configuration there was just fine at the time, but as you grow things need to be in order.  We felt the time has come to do that organization.  Though some cleanup and naming conventions were applied when we joined OR to the new domain a few months ago, many things were all over the place.  Off the top of my head:

  • Web tier is on Windows 2k8, not 2k8 SP1
  • Web tier is not homogeneous
    • OR-WEB01 – Doesn’t exist, this became a DNS server a looooong time ago
    • OR-WEB02 – Chat, or.sstatic.net
    • OR-WEB03 – Stack Exchange Data Explorer, CC.Net primary build server
    • OR-WEB04 – or.sstatic.net
    • OR-WEB05 – or.sstatic.net, used to be a VM server (we can’t get this to uninstall, heh)
    • OR-WEB06 – Chat
  • The configuration looks nothing like NY
  • Automatic updates are a tad bit flaky
  • Missing several components compared to NY (such as physical redis, current & upcoming service boxes)

So we’re doing what any reasonable person would do.  NUKE. EVERYTHING. New hardware for the web tier and primary database server has already been ordered by our Sysadmin team (Kyle’s lead on this one) and racked by our own Geoff Dalgas.  Here’s the plan:

  • Nuke the web tier, format it all
  • Replace OR-DB01 with the new database server with plenty of space on 6x Intel 320 Series 300GB drives
  • Re-task 2 of the web tier as Linux load balancers running HAProxy (failover config)
  • Re-task the old OR-DB01 as a service box (upcoming posts on this – unused at the moment, but it has plenty of processing power and memory, so it fits)
  • Install 4 new web tier boxes as OR-WEB01 through OR-WEB04

Why all of this work just for Stack Exchange Chat & Data Explorer?  Because it’s not just for that.  Oregon is also our failover in case of catastrophic failure in New York.  We send backups of all databases there every night.  In a pinch, we want to switch DNS over and get Oregon up ASAP (probably in read-only mode though, until we’re sure NY can’t be recovered any time soon). The OR web tier will tentatively look something like this:

  • OR-WEB01 – Chat, or.sstatic.net
  • OR-WEB02 – Chat, or.sstatic.net
  • OR-WEB03 – Data Explorer
  • OR-WEB04 – Idle

Now that doesn’t look right, does it?  I said earlier that the web tier should be homogeneous and that’s true.  The above list is what’s effectively running on each server.  In reality (just like New York) they’ll have identical IIS configs, all running the same app pools.  The only difference is which ones get traffic for which sites via HAProxy.  Ones that don’t get traffic, let’s say OR-WEB04 for chat, simply won’t spin up that app pool. In addition to the above, each of the servers will be running everything else we have in New York, just not active/getting any traffic.  This includes things like every Q&A site in the network (including Stack Overflow), stackexchange.com, careers.stackoverflow.com, area51.stackexchange.com, openid.stackexchange.com, sstatic.net, etc.  All of these will be in a standby state of some sort…we’re working on the exact details.  In any case, it won’t be drastically different from the New York load balancing setup, which I’ll cover in detail in a future post.

Things will also get more interesting on the backup/restore side with the SQL 2012 move.  I’ll also do a follow-up post on our initial plans around the SQL upgrade in the coming weeks – we’re waiting on some info around new hardware in that department.

Stack Overflow handles a lot of traffic.  Quantcast ranks us (at the time of this writing) as the 274th largest website in the US, and that’s rising.  That means everything that traffic relates to grows as well.  With growth, there are 2 areas of concern I like to split problems into: technical and non-technical.

Non-technical would be things like community management, flag queues, moderator counts, spam protection, etc.  These might (and often do) end up with technical solutions (that at least help, if not solve the problem), but I tend to think of them as “people problems.”  I won’t cover those here for the most part, unless there’s some technical solution that we can expose that may help others.

So what about the technical?  Now those are more interesting to programmers.  We have lots of things that grow along with traffic, to name a few:

  • Bandwidth (and by-proxy, CDN dependency)
  • Traffic logs (HAProxy logs)
  • CPU/Memory utilization (more processing/cache involved for more users/content)
  • Performance (inefficient things take more time/space, so we have to constantly look for wins in order to stay on the same hardware)
  • Database Size
I’ll probably get to all of the above in the coming months, but today let’s focus on what our current issue is. Stack Overflow is running out of space.  This isn’t news, it isn’t shocking; anything that grows is going to run out of room eventually.

 

This story starts a little over a year ago, right after I was hired.  One of the first things I was asked to do was growth analysis of the Stack Overflow DB. You can grab a copy of it here (excel format).  Brent Ozar, our go-to DBA expert/database magician, told us you can’t predict growth that accurately…and he was right.  There are so many unknowns: traffic from new sources, new features, changes in usage patterns (e.g. more editing caused a great deal more space usage than ever before).  I projected that right now we’d be at about 90GB in Stack Overflow data; we are at 113GB.  That’s not a lot, right?  Unfortunately, it’s all relative…and it is a lot.  Also, we currently have a transaction log of 41GB.  Yes, we could shrink it…but on the next re-org of the PK_Posts it’ll just grow to the same size (edit: thanks to Kendra Little on updating my knowledge base here, it was the cluster re-org itself eating that much transaction space, other indexes on the table need not be rebuilt in the clustered’s re-org since SQL 2000).

 

These projections were used to influence our direction on where we were going with scale, up vs. out.  We did a bit of both.  First let’s look at the original problems we faced when the entire network ran off one SQL Server box:
  • Memory usage (even at 96GB, we were over-crammed, we couldn’t fit everything in memory)
  • CPU (to be fair, this had other factors like Full Text search eating most of the CPU)
  • Disk IO (this is the big one)

What happens when you have lots of databases is all your sequential performance goes to crap because it’s not sequential anymore.  For disk hardware, we had one array for the DB data files: a RAID 10, 6 drive array of magnetic disks.  When dozens of DBs are competing in the disk queue, all performance is effectively random performance.  That means our read/write stalls were way higher than we liked.  We tuned our indexing and trimmed as much as we could (you should always do this before looking at hardware), but it wasn’t enough.  Even if it was enough there were the CPU/Memory issues of the shared box.

Ok, so we’ve outgrown a single box, now what?  We got a new one specifically for the purpose of giving Stack Overflow its own hardware.  At the time this decision was made, Stack Overflow was a few orders of magnitude larger than any other site we have.  Performance-wise, it’s still the 800 lb. gorilla.  A very tangible problem here was that Stack Overflow was so large and “hot,” it was a bully in terms of memory, forcing lesser sites out of memory and causing slow disk loads for queries after idle periods.  Seconds to load a home page? Ouch. Unacceptable.  It wasn’t just a hardware decision though, it had a psychological component.  Many people on our team just felt that Stack Overflow, being the huge central site in the network that is is, deserved its own hardware…that’s the best I can describe it.

Now, how does that new box solve our problems?  Let’s go down the list:

  • Memory (we have another 96GB of memory just for SO, and it’s not using massive amounts on the original box, win)
  • CPU (fairly straightforward: it’s now split and we have 12 new cores to share the load, win)
  • Disk IO (what’s this? SSDs have come out, game. on.)

We looked at a lot of storage options to solve that IO problem.  In the end, we went with the fastest SSDs money could buy.  The configuration on that new server is a RAID 1 for the OS (magnetic) and a RAID 10 6x Intel X-25E 64GB, giving us 177 GB of usable space.  Now let’s do the math of what’s on that new box as of today:

  • 114 GB – StackOverflow.mdf
  • 41 GB – StackOverflow.ldf

With a few other miscellaneous files on there, we’re up to 156 GB.  155/177 = 12% free space.  Time to panic? Not yet.  Time to plan? Absolutely.  So what is the plan?

We’re going to be replacing these 64GB X-25E drives with 200GB Intel 710 drives.  We’re going with the 710 series mainly for the endurance they offer.  And we’re going with 200GB and not 300GB because the price difference just isn’t worth it, not with the high likelihood of rebuilding the entire server when we move to SQL Server 2012 (and possibly into a cage at that data center).  We simply don’t think we’ll need that space before we stop using these drives 12-18 months from now.

Since we’re eating an outage to do this upgrade (unknown date, those 710 drives are on back-order at the moment) why don’t we do some other upgrades?  Memory of the large capacity DIMM variety is getting cheap, crazy cheap.  As the database grows, less and less of it fits into memory, percentage-wise.  Also, the server goes to 288GB (16GB x 18 DIMMs)…so why not?  For less than $3,000 we can take this server from 6x16GB to 18x16GB and just not worry about memory for the life of the server.  This also has the advantage of balancing all 3 memory channels on both processors, but that’s secondary.  Do we feel silly putting that much memory in a single server? Yes, we do…but it’s so cheap compared to say a single SQL Server license that it seems silly not to do it.

I’ll do a follow-up on this after the upgrade (on the Server Fault main blog, with a stub here).

In this blog, I aim to give you some behind the scenes views of what goes on at Stack Exchange and share some lessons we learn along the way.

Life at Stack Exchange is pretty busy at the moment; we have lots of projects in the air.  In short, we’re growing, and growing fast.  What effect does this have?

While growth is awesome (it’s what almost every company wants to do), it’s not without technical challenges.  A significant portion of our time is currently devoted to fighting fires in one way or another, whether it be software issues with community scaling (like the mod flag queue) or actual technical walls (like drive space, Ethernet limits).

Off the top of my head, these are just a few items from the past few weeks:

  • We managed to completely saturate our outbound bandwidth in New York (100mbps).  When we took an outage a few days ago to bump a database server from 96GB to 144GB of RAM, we served error pages without the backing of our CDN…turns out that’s not something we’re quite capable of doing anymore.  There were added factors here, but the bottom line is we’ve grown too far to serve even static HTML and a few small images off that 100mbps pipe. We need a CDN at this point, but just to be safe we’ll be upping that connection at the datacenter as well.
  • The Stack Overflow database server is running out of space.  Those Intel X25-E SSD drives we went with have performed superbly, but a raid 10 of 6x64GB (177GB usable) only goes so far.  We’ll be bumping those drives up to 200GB Intel 710 SSDs for the next 12-18 months of growth.  Since we have to eat an outage to do the swap and memory is incredibly cheap, we’ll be bumping that database server to 288GB as well.
  • Our original infrastructure in Oregon (which now hosts Stack Exchange chat) is too old and a bit disorganized – we’re replacing it.  Oregon isn’t only a home for chat and data explorer, it’s the emergency failover if anything catastrophic were to happen in New York.  The old hardware just has no chance of standing up to the current load of our network – so we’re replacing it with shiny new goodies.
  • We’ve changed build servers – we’re building lots of projects across the company now and we need something that scales and is a bit more extensible.  We moved from CruiseControl.Net to TeamCity (still in progress, will be completed with the Oregon upgrade).
  • We’re in process of changing core architecture to continue scaling.  The tag engine that runs on each web server is doing duplicate work and running multiple times.  The search engine (built on Lucene.Net) is both running from disk (not having the entire index bank loaded into memory) and duplicating work.  Both of these are solvable problems, but they need a fundamental change.  I’ll discuss this further coming up; hopefully we’ll have some more open source goodness to share with the community as a result.
  • Version 2.0 on our API is rolling out (lots of SSL-related scaling fun around this behind the scenes).
  • A non-trivial amount of time has gone into our monitoring systems as of late.  We have a lot of servers running a lot of stuff, we need to see what’s going on.  I’ll go into more detail on this later.  Since there seems to be at least some demand for open-sourcing the dashboard we’ve built, we will as soon as time permits.

There are lots of things going on around here, as I get time I’ll try to share more detailed happenings like the above examples with you as we grow.  Not many companies grow as fast as we are with as little hardware or as much passion for performance.  I don’t believe anyone runs the architecture we do at the scale we’re running at (traffic-wise, we actually have very little hardware being utilized); we’re both passionate and insane.

We’ll go through some tough technical changes coming up, from both paying down technical debt and provisioning for the future.  I’ll try and share as much as I can of that here, for those who are merely curious what happens behind the curtain and those who are going through the same troubles we already have, maybe our experiences can help you out.

Copyright © 2014 . All rights reserved. Theme based on the Jarrah by Templates Next