A question we often get asked at Stack Exchange is why stackoverflow.com and all our other domains aren’t served over SSL.  It’s a user request we see at least a few times a month asking to ensure their security and privacy.  So why haven’t we done it yet?  I wanted to address that here, it’s not that we don’t want to do it, it’s just a lot of work and we’re getting there.

So, what’s needed to move our network to SSL? Let’s make a quick list:

  • Third party content must support SSL:
    • Ads
    • Avatars
    • Facebook
    • Google Analytics
    • Inline Images & Videos
    • MathJax
    • Quantcast
  • Our side has to support SSL:
    • Our CDN
    • Our load balancers
    • The sites themselves
    • Websockets
    • Certificates (this one gets interesting)

Ok, so that doesn’t look so hard, what’s the big deal?  Let’s look at third party content first.  Note that with all of these items, they’re totally outside our control.  All we can do is ask for them to support SSL…but luckily we work with awesome people that are they’re helping us out.

    • Ads: We’re working with Adzerk to support SSL, this had to be done on their side and it’s ready for testing now.
    • Avatars: Gravatars and Imgur can support SSL – Gravatar is ready but i.stack.imgur.com where our images are hosted is not yet (we’re working on this).
    • Facebook: done.
    • Google Analytics: done.
    • Inline Images: we can’t include insecure content on the page…so that means turning our images to SSL when i.stack.imgur.com is ready.  For other domains images are embedded from we have to turn them into links, or solve via another approach.
    • MathJax: we currently use MathJax’s CDN for that content, but they don’t currently support SSL so we may have to host this on our CDN.
    • Quantcast: done – under another domain.

Now here’s where stuff gets fun, we have to support SSL on our side.  Let’s cover the easy parts first.  Our CDN has to support SSL.  Okay, it’s not cheap but it’s a problem we can buy away.  We only use cdn.sstatic.net for production content so a cdn.sstatic.net & *.cdn.sstatic.net combo cert should cover us.  The CDN endpoint on our side (it’s a pull model) has to support SSL as well so it’s an encrypted handshake on both legs, but that’s minor traffic and easily doable.

With websockets we’ll just try and see when all dependencies are in place.  We don’t anticipate any particular problems there, and when it works it should work for more people.  Misconfigured or old proxies tend to interfere with HTTP websocket traffic they don’t understand, but those same proxies will just forward on the encrypted HTTPS traffic.

Of course, our load balancers have to support SSL as well, so let’s take a look at how that works in our infrastructure.  Whether our SSL setup is typical or not I have no idea, but here’s how we currently do SSL:

nginx network layout

HTTPS traffic goes to nginx on the load balancer machines and terminates there.  From there a plain HTTP request is made to HAProxy which delegates the request to whichever web server set it should go to.  The response then goes back along the same path.  So you have a secure connection all the way to us, but inside our internal network it’s transitioned to a regular HTTP request.

So what’s changing?  Logically we’re not changing much, we’re just switching from nginx to HAProxy for the terminator.  The request now goes all the way to HAProxy (specifically, a process only for SSL termination) then to the HTTP front-end to continue to a web server.  This is both a management-easing change (one install and config via puppet) and hope that HAProxy handles SSL more efficiently, since CPU load on the balancers is an unknown once we go full SSL.  An HAProxy instance (without crazy) only ties to a single core, but you can have many SSL processes doing termination all feeding to the single HTTP front-end process.  With this approach, the heavy load does scale out across our 12 physical core machines well…we hope.  If it doesn’t work then we need active/active load balancers, which is another project we’re working on just in case.

Now here’s the really fun part, certificates.  Let’s take a look a sample of domains we’d have to cover:

  • stackoverflow.com
  • meta.stackoverflow.com
  • stackexchange.com
  • careers.stackoverflow.com
  • gaming.stackexchange.com
  • meta.gaming.stackexchange.com
  • superuser.com
  • meta.superuser.com

Ok so the top level domains are easy, a SAN cert which allows many domains on a single cert – we can sanely combine up to 100 here.  So what about all of our *.stackexchange.com domains? A wildcart cert, excellent we’re knocking these out like crazy. What about meta.*.stackexchange.com? Damn. Can’t do that. You can’t have a wildcard of that form – at least not one supported by most major browsers, which means effectively it’s not an option.  Let’s see where these restrictions originate.

Section 3.1 of RFC 2818 is very open/ambiguous on wildcard usage, it states:

Names may contain the wildcard character * which is considered to match any single domain name component or component fragment. E.g., *.a.com matches foo.a.com but not bar.foo.a.com. f*.com matches foo.com but not bar.com.

It doesn’t really disallow meta.*.stackexchange.com or *.*.meta.stackexchange.com.  So far so good…then some jerk tried to make a certificate for *.com which obviously wasn’t good, so that was revoked and disallowed.  So what happened? Some other jerk went and tried *.*.com.  Well, that ruined it for everyone.  Thanks, jerks.

The rules were further clarified in Section 6.4.3 of RFC 6125 which says (emphasis mine):

The client SHOULD NOT attempt to match a presented identifier in which the wildcard character comprises a label other than the left-most label (e.g., do not match bar.*.example.net)

This means no *.*.stackexchange.com or meta.*.stackexchange.com.  Enough major browsers conform to this RFC that it’s a non-option.  So what do we do?  We thought of a few approaches.  We would prefer not to change domains for our content, so the first thought was setting up an automated operation to install new entries on a SAN cert for each new meta created.  As we worked though this option, we found several major problems:

  • We are limited to approximately 100 entries per SAN cert, so every new 100 metas means another IP allocation on our load balancer. (This appears to be an industry restriction due to abuse, rather than a technical limitation)
  • The IP usage issues would be multiplied as we move to active/active load balancers, draining our allocated IP ranges faster and putting an even shorter lifetime on this solution.
  • It delays site launches, due to waiting on certificates to be requested, generated, merge, received and installed.
  • Every site launch has to have a DNS entry for the meta, this exposes us to additional risk of DNS issues and complicates data center failover.
  • We have to build an entire system to support this, rotating through certificates a bit rube goldberg style, installing into HAProxy, writing to BIND, etc.

So what do we do?  We’re not 100% decided yet – we’re still researching and talking with people.  We may have to do the above.  The alternatives would be to shift child meta domains to be on the same level as *.stackexchange.com or under a common *.x.stackexchange.com for which we can get a wildcard.  If we shift domains, we have to put in redirects and update URLs in already posted content to point to the new place (to save users a redirect).  Also changing the domain to not be a child means setting a child cookie on the parent domain that is shared down is no longer an option – so the login model has to change there but still be as transparent and easy as possible.

Now let’s say we do all of that and it all works, what happens to our google rank when we start sending everyone to HTTPS, including crawlers?  We don’t know, and it’s a little scary…so we’ll test the best we can with a slow rollout.  Hopefully, it won’t matter. We have to send everything to SSL because otherwise logged-in users would get a redirect every time after clicking an http:// link from google…that needs to be an https:// from the search results.

Some of those simple items above aren’t so simple, especially changing child meta logins and building cert grabbing/installing system.  So is that why we aren’t doing it?  Of course not, we love pain – so we are actively working on SSL now that some of our third party content providers are ready.  We’re not ignoring the request for enhanced security and privacy while using our network, it’s just not as simple as many people seem to think it is at first glance – not when you’re dealing with our domain variety.  We’ll be working on it over the next 6-8 weeks.

The best part of working for Stack Exchange is that our code, network and databases are so awesome that we never throw exceptions.

Okay, okay, back to reality.  Everyone has exceptions, so how do we handle the first step of recording them?  Most .Net developers have heard of ELMAH (Error Logging Modules and Handlers), and Stack Exchange started out using a modified version of this for over a year.  The setup was simple: we were using the XML file store pointing to a share on one of the web servers (due to most of the problems being SQL or network related when shit hit the fan).

What this setup didn’t allow for was high-traffic logging, since the 200 file loop necessitated a directory file listing (not so fast in windows), reading existing files to see if it was a duplicate, then updating the duplicate with a new count if one was found.  Take into account the level of traffic Stack Overflow gets at any point in the day and all this meant was we effectively took that web server out of rotation (due to pegging it’s network throughput just logging the errors…yes, we saw the humor there).

When I finally got some time to stick the errors into SQL in a way that fit our needs, I wrote StackExchange.Exceptional (with input from all our dev teams along the way).  It certainly borrows a fundamental idea from ELMAH (multiple stores on a single error interface mainly), but after that they diverge significantly.  We needed a few things that no existing error handler put in one package:

  • High speed logging (upwards of 100,000 exceptions/minute)
  • Error roll-ups (for similar exceptions, showing a duplicate count that increases rather than logging a separate entry)
  • Handling the case where we can’t reach the central error store (e.g. the connection to SQL is interrupted)
  • Custom data for our exceptions
  • Querying relevant exception data

All but the last one went into StackExchange.Exceptional directly, the last one factors into a bigger picture coming soon.

Let me explain why we need the above.  The high volume one’s pretty easy; we get a lot of traffic on Stack Overflow alone.  When shit hits the fan, the exceptions roll in pretty fast.  For example when redis goes offline or SQL server is unreachable we’re throwing 10,000 errors in under a few seconds.  While we’re throwing lots of errors, they’re not likely to tell us much in all that repetition. 10,000 of the same exception helps us no more in debugging than that error logged once with a x10,000 beside it…so that’s what Exceptional does.  If an error has the same stack trace then we just increase a duplicate counter on the SQL row, instead of logging another row.

When we’re throwing exceptions due to a network issue, guess what trying to log over the network does.  Yeah, this one’s not hard to see coming, so what Exceptional does is keep a memory-based exception store that queues up to 1000 exceptions (duplicates roll up in that 1000) when it’s unable to write whatever remote store you’ve configured (SQL or JSON).  It will retry writing the exceptions every 2 seconds.  Once successful, things go back to normal and exceptions are logged as they come in.  If there is currently a problem connecting to the error store, the issue will be shown in the exceptions list.

The last two are probably less relevant.  Custom data just enables storing of string name/value pairs with the exception for display only or use via JavaScript includes on the exception views.  I’ll cover this in detail on the project page in a specific wiki, for now the quick bits are in the setup wiki.  Querying is not really covered in Exceptional, it just has a SQL structure that’s friendly for doing so (relevant bits of exceptions broken out into individual columns).  The actual searching is something we have, but I have to do a bit more unexpected work to get that dashboard out the door.  There are other benefits this netted us, but they’re hard to explain without a full dashboard post when that’s released…I’ll try and do so as quickly as I can figure out the open-source charting story.

I’ve added a few initial wikis to github for getting Exceptional up and running, I’ll try and get an example project up as well…in the next few days as time allows.  Here’s a quick view of the list/detail screens to get a feel:

Update: A sample project is now posted alongside Exceptional core on github.  Some additional store providers (e.g. MongoDB) have already been requested, stay tuned for those – they’ll show up in the form of other packages with a dependency on StackExchange.Exceptional, so if you don’t want that store’s driver, you won’t have to include it.

For starters, let’s get a bad assumption I personally had before being hired out of the way: you don’t see most features we deploy.  In fact a small percentage of features we deploy are for regular users, or seen directly.  Aside from user-facing features, there exists a great deal of UI and infrastructure for our moderators, and even more for developers.  Besides features directly on the site, a lot more goes on behind the scenes.  Here’s a quick list of what’s underway right now:

  • Moving search off the web tier, a little over a year after putting it there.
  • Moving the tag engine off the web tier (you may have not even heard this exists).
  • Real-time updates to a number of site elements (voting, new answers, comments, your rep changes, etc.)
  • Deploying Oregon failover cluster (and other related infrastructure changes)
  • [redacted] sidebar project
  • [redacted] profile project
  • Home-grown TCP Server
  • Home-grown WebSockets server (built on the base TCP)
  • Home-grown error handler (similar to elmah but suited for our needs)
  • Monitoring dashboard improvements (on the way to open source)
  • Major network infrastructure changes (next Saturday, 4/14/2012)

That’s just the major stuff in process right now, of course there are a lot of other minor features and bug fixes being rolled out all the time.  There are some inter-connected pieces here, let’s first look at the how some of these groups of projects came to be.

When you visit a tag page, or pretty much anything to do with finding or navigating via a tag, you’re hitting our tag engine.  This little gem was cooked up by Marc Gravell and Sam Saffron to take the load off SQL Server (where it was Full Text Search-based previously) and do it much more efficiently on the web tier, inside the app domain.  This has been a tremendous performance win, but can we do it better? absolutely.  We can always do it better, the question is: at what cost?  The tag engine is now used a few times on each web server, for example inside the API as well as the main Q&A application (which handles all SE site requests, including Stack Overflow).  This is duplicate work…as is running it on all 11 web servers.  This same duplicate work wastefulness is true of our search indexing.  While it’s quite redundant, it’s too much so, and in practice doesn’t gain us anything…so what do we do?

We need to move some things into a service-oriented architecture, tag engine and search are first up.  Yes, there are other approaches, but this is how we’re choosing to do it.  So how do we go about this? We ask Marc what he can cook up, and he comes back with awesome every time.

There are a two things in play here, websockets and internal traffic…both have very different behavior and needs.  Internal API requests (tag engine, search) are a few clients (the web tier) getting a *lot* of data across those connections.  Websockets is a lot of consumers (tens of thousands or more) getting a little data.  We can optimize each case, but they are different and need independent love (possibly with separate implementations).

Enter our in-house TCP server, still a work in progress. This is what’s already powering our websockets/real-time functionality, so it’s optimized for the many clients/little data case.  If we start using this internally, it may very well be a different implementation optimized for the other end of the spectrum.  I won’t go into more detail on this because things change greatly with the next .Net release, and the TCP server can be much more efficient by being based on HttpListener (though it’s crazy efficient right now, Marc and normal humans have different definitions, so by his standards it needs improvement).  Accordingly, we’ll wait to open source this until that large refactor happens, the same goes for the websockets server impersonation built on top of it.

I’ve also spent a bit of my time lately replacing our exception handler, which we’ll be open sourcing after testing in production for a bit.  Currently it’s file system based and…well, it melts if we are in a high velocity error-throwing situation.  Instead we’ll be moving to a SQL-based error store (with JSON file and memory support for those who want it, it’s pluggable so we can add more stores later…like redis) with an in-memory backup buffer that’ll handle SQL being down as well.  This was needed because we monitor exceptions across all applcations in our monitoring dashboard…and we also want to open source that.  That means there’s a few dominos to get in place first.  For the same reasoning as projects above, the real-time websockets based real-time monitoring will have to wait for a future release (hopefully not too long)…but that’s a separate post, also coming soon.

There are some good things coming, especially in terms of open sourcing some of our goodies.  While we have some things out there already, we’ll be adding more…and a few improvements to the existing projects.  Also, we’ll be creating a central place where you find all these things.  That blog post is a good start, but we plan on giving our open source creations a permanent home so the community can find and help us improve them, so everyone can benefit.

This will be the first in an series of interspersed posts about how our backup/secondary infrastructure is built and designed to work.

Stack Overflow started as the only site we had. That was over 3 years ago (August 2008) in a data center called PEAK Internet in Corvallis, Oregon.  Since them we’ve grown a lot, moved the primary network to New York, and added room to grow in the process.  A lot has changed since then in both locations, but much of the activity has stuck to the New York side of things.  The only services we’re currently running out of Oregon are chat and data explorer (so if you’re wondering why chat still runs during most outages, that’s why).

Back a few months ago we outgrew our CruiseControl.Net build system, changing over to TeamCity by JetBrains.  We did this for manageability, scalability, extensibility and because it’s just generally a better product (for our needs at least)  These build changes were pretty straightforward in the NY datacenter because we have a homogeneous web tier.  Our sysadmins insist that all the web servers be identical in configuration, and it pays off in many ways…such as when you change them all to a new build source. Once NY was converted, it was time to set our eyes on Oregon.  This was going to net us several benefits: a consistent build, a single URL (NY and OR CC.Net instances were in no way connected, the same version, etc.), and a single build system managing it all – including notifications, etc.

So what’s the problem?  Oregon is, for lack of a more precise description, a mess.  No one here forgets it’s where we started out, and the configuration there was just fine at the time, but as you grow things need to be in order.  We felt the time has come to do that organization.  Though some cleanup and naming conventions were applied when we joined OR to the new domain a few months ago, many things were all over the place.  Off the top of my head:

  • Web tier is on Windows 2k8, not 2k8 SP1
  • Web tier is not homogeneous
    • OR-WEB01 – Doesn’t exist, this became a DNS server a looooong time ago
    • OR-WEB02 – Chat, or.sstatic.net
    • OR-WEB03 – Stack Exchange Data Explorer, CC.Net primary build server
    • OR-WEB04 – or.sstatic.net
    • OR-WEB05 – or.sstatic.net, used to be a VM server (we can’t get this to uninstall, heh)
    • OR-WEB06 – Chat
  • The configuration looks nothing like NY
  • Automatic updates are a tad bit flaky
  • Missing several components compared to NY (such as physical redis, current & upcoming service boxes)

So we’re doing what any reasonable person would do.  NUKE. EVERYTHING. New hardware for the web tier and primary database server has already been ordered by our Sysadmin team (Kyle’s lead on this one) and racked by our own Geoff Dalgas.  Here’s the plan:

  • Nuke the web tier, format it all
  • Replace OR-DB01 with the new database server with plenty of space on 6x Intel 320 Series 300GB drives
  • Re-task 2 of the web tier as Linux load balancers running HAProxy (failover config)
  • Re-task the old OR-DB01 as a service box (upcoming posts on this – unused at the moment, but it has plenty of processing power and memory, so it fits)
  • Install 4 new web tier boxes as OR-WEB01 through OR-WEB04

Why all of this work just for Stack Exchange Chat & Data Explorer?  Because it’s not just for that.  Oregon is also our failover in case of catastrophic failure in New York.  We send backups of all databases there every night.  In a pinch, we want to switch DNS over and get Oregon up ASAP (probably in read-only mode though, until we’re sure NY can’t be recovered any time soon). The OR web tier will tentatively look something like this:

  • OR-WEB01 – Chat, or.sstatic.net
  • OR-WEB02 – Chat, or.sstatic.net
  • OR-WEB03 – Data Explorer
  • OR-WEB04 – Idle

Now that doesn’t look right, does it?  I said earlier that the web tier should be homogeneous and that’s true.  The above list is what’s effectively running on each server.  In reality (just like New York) they’ll have identical IIS configs, all running the same app pools.  The only difference is which ones get traffic for which sites via HAProxy.  Ones that don’t get traffic, let’s say OR-WEB04 for chat, simply won’t spin up that app pool. In addition to the above, each of the servers will be running everything else we have in New York, just not active/getting any traffic.  This includes things like every Q&A site in the network (including Stack Overflow), stackexchange.com, careers.stackoverflow.com, area51.stackexchange.com, openid.stackexchange.com, sstatic.net, etc.  All of these will be in a standby state of some sort…we’re working on the exact details.  In any case, it won’t be drastically different from the New York load balancing setup, which I’ll cover in detail in a future post.

Things will also get more interesting on the backup/restore side with the SQL 2012 move.  I’ll also do a follow-up post on our initial plans around the SQL upgrade in the coming weeks – we’re waiting on some info around new hardware in that department.

Stack Overflow handles a lot of traffic.  Quantcast ranks us (at the time of this writing) as the 274th largest website in the US, and that’s rising.  That means everything that traffic relates to grows as well.  With growth, there are 2 areas of concern I like to split problems into: technical and non-technical.

Non-technical would be things like community management, flag queues, moderator counts, spam protection, etc.  These might (and often do) end up with technical solutions (that at least help, if not solve the problem), but I tend to think of them as “people problems.”  I won’t cover those here for the most part, unless there’s some technical solution that we can expose that may help others.

So what about the technical?  Now those are more interesting to programmers.  We have lots of things that grow along with traffic, to name a few:

  • Bandwidth (and by-proxy, CDN dependency)
  • Traffic logs (HAProxy logs)
  • CPU/Memory utilization (more processing/cache involved for more users/content)
  • Performance (inefficient things take more time/space, so we have to constantly look for wins in order to stay on the same hardware)
  • Database Size
I’ll probably get to all of the above in the coming months, but today let’s focus on what our current issue is. Stack Overflow is running out of space.  This isn’t news, it isn’t shocking; anything that grows is going to run out of room eventually.

 

This story starts a little over a year ago, right after I was hired.  One of the first things I was asked to do was growth analysis of the Stack Overflow DB. You can grab a copy of it here (excel format).  Brent Ozar, our go-to DBA expert/database magician, told us you can’t predict growth that accurately…and he was right.  There are so many unknowns: traffic from new sources, new features, changes in usage patterns (e.g. more editing caused a great deal more space usage than ever before).  I projected that right now we’d be at about 90GB in Stack Overflow data; we are at 113GB.  That’s not a lot, right?  Unfortunately, it’s all relative…and it is a lot.  Also, we currently have a transaction log of 41GB.  Yes, we could shrink it…but on the next re-org of the PK_Posts it’ll just grow to the same size (edit: thanks to Kendra Little on updating my knowledge base here, it was the cluster re-org itself eating that much transaction space, other indexes on the table need not be rebuilt in the clustered’s re-org since SQL 2000).

 

These projections were used to influence our direction on where we were going with scale, up vs. out.  We did a bit of both.  First let’s look at the original problems we faced when the entire network ran off one SQL Server box:
  • Memory usage (even at 96GB, we were over-crammed, we couldn’t fit everything in memory)
  • CPU (to be fair, this had other factors like Full Text search eating most of the CPU)
  • Disk IO (this is the big one)

What happens when you have lots of databases is all your sequential performance goes to crap because it’s not sequential anymore.  For disk hardware, we had one array for the DB data files: a RAID 10, 6 drive array of magnetic disks.  When dozens of DBs are competing in the disk queue, all performance is effectively random performance.  That means our read/write stalls were way higher than we liked.  We tuned our indexing and trimmed as much as we could (you should always do this before looking at hardware), but it wasn’t enough.  Even if it was enough there were the CPU/Memory issues of the shared box.

Ok, so we’ve outgrown a single box, now what?  We got a new one specifically for the purpose of giving Stack Overflow its own hardware.  At the time this decision was made, Stack Overflow was a few orders of magnitude larger than any other site we have.  Performance-wise, it’s still the 800 lb. gorilla.  A very tangible problem here was that Stack Overflow was so large and “hot,” it was a bully in terms of memory, forcing lesser sites out of memory and causing slow disk loads for queries after idle periods.  Seconds to load a home page? Ouch. Unacceptable.  It wasn’t just a hardware decision though, it had a psychological component.  Many people on our team just felt that Stack Overflow, being the huge central site in the network that is is, deserved its own hardware…that’s the best I can describe it.

Now, how does that new box solve our problems?  Let’s go down the list:

  • Memory (we have another 96GB of memory just for SO, and it’s not using massive amounts on the original box, win)
  • CPU (fairly straightforward: it’s now split and we have 12 new cores to share the load, win)
  • Disk IO (what’s this? SSDs have come out, game. on.)

We looked at a lot of storage options to solve that IO problem.  In the end, we went with the fastest SSDs money could buy.  The configuration on that new server is a RAID 1 for the OS (magnetic) and a RAID 10 6x Intel X-25E 64GB, giving us 177 GB of usable space.  Now let’s do the math of what’s on that new box as of today:

  • 114 GB – StackOverflow.mdf
  • 41 GB – StackOverflow.ldf

With a few other miscellaneous files on there, we’re up to 156 GB.  155/177 = 12% free space.  Time to panic? Not yet.  Time to plan? Absolutely.  So what is the plan?

We’re going to be replacing these 64GB X-25E drives with 200GB Intel 710 drives.  We’re going with the 710 series mainly for the endurance they offer.  And we’re going with 200GB and not 300GB because the price difference just isn’t worth it, not with the high likelihood of rebuilding the entire server when we move to SQL Server 2012 (and possibly into a cage at that data center).  We simply don’t think we’ll need that space before we stop using these drives 12-18 months from now.

Since we’re eating an outage to do this upgrade (unknown date, those 710 drives are on back-order at the moment) why don’t we do some other upgrades?  Memory of the large capacity DIMM variety is getting cheap, crazy cheap.  As the database grows, less and less of it fits into memory, percentage-wise.  Also, the server goes to 288GB (16GB x 18 DIMMs)…so why not?  For less than $3,000 we can take this server from 6x16GB to 18x16GB and just not worry about memory for the life of the server.  This also has the advantage of balancing all 3 memory channels on both processors, but that’s secondary.  Do we feel silly putting that much memory in a single server? Yes, we do…but it’s so cheap compared to say a single SQL Server license that it seems silly not to do it.

I’ll do a follow-up on this after the upgrade (on the Server Fault main blog, with a stub here).

In this blog, I aim to give you some behind the scenes views of what goes on at Stack Exchange and share some lessons we learn along the way.

Life at Stack Exchange is pretty busy at the moment; we have lots of projects in the air.  In short, we’re growing, and growing fast.  What effect does this have?

While growth is awesome (it’s what almost every company wants to do), it’s not without technical challenges.  A significant portion of our time is currently devoted to fighting fires in one way or another, whether it be software issues with community scaling (like the mod flag queue) or actual technical walls (like drive space, Ethernet limits).

Off the top of my head, these are just a few items from the past few weeks:

  • We managed to completely saturate our outbound bandwidth in New York (100mbps).  When we took an outage a few days ago to bump a database server from 96GB to 144GB of RAM, we served error pages without the backing of our CDN…turns out that’s not something we’re quite capable of doing anymore.  There were added factors here, but the bottom line is we’ve grown too far to serve even static HTML and a few small images off that 100mbps pipe. We need a CDN at this point, but just to be safe we’ll be upping that connection at the datacenter as well.
  • The Stack Overflow database server is running out of space.  Those Intel X25-E SSD drives we went with have performed superbly, but a raid 10 of 6x64GB (177GB usable) only goes so far.  We’ll be bumping those drives up to 200GB Intel 710 SSDs for the next 12-18 months of growth.  Since we have to eat an outage to do the swap and memory is incredibly cheap, we’ll be bumping that database server to 288GB as well.
  • Our original infrastructure in Oregon (which now hosts Stack Exchange chat) is too old and a bit disorganized – we’re replacing it.  Oregon isn’t only a home for chat and data explorer, it’s the emergency failover if anything catastrophic were to happen in New York.  The old hardware just has no chance of standing up to the current load of our network – so we’re replacing it with shiny new goodies.
  • We’ve changed build servers – we’re building lots of projects across the company now and we need something that scales and is a bit more extensible.  We moved from CruiseControl.Net to TeamCity (still in progress, will be completed with the Oregon upgrade).
  • We’re in process of changing core architecture to continue scaling.  The tag engine that runs on each web server is doing duplicate work and running multiple times.  The search engine (built on Lucene.Net) is both running from disk (not having the entire index bank loaded into memory) and duplicating work.  Both of these are solvable problems, but they need a fundamental change.  I’ll discuss this further coming up; hopefully we’ll have some more open source goodness to share with the community as a result.
  • Version 2.0 on our API is rolling out (lots of SSL-related scaling fun around this behind the scenes).
  • A non-trivial amount of time has gone into our monitoring systems as of late.  We have a lot of servers running a lot of stuff, we need to see what’s going on.  I’ll go into more detail on this later.  Since there seems to be at least some demand for open-sourcing the dashboard we’ve built, we will as soon as time permits.

There are lots of things going on around here, as I get time I’ll try to share more detailed happenings like the above examples with you as we grow.  Not many companies grow as fast as we are with as little hardware or as much passion for performance.  I don’t believe anyone runs the architecture we do at the scale we’re running at (traffic-wise, we actually have very little hardware being utilized); we’re both passionate and insane.

We’ll go through some tough technical changes coming up, from both paying down technical debt and provisioning for the future.  I’ll try and share as much as I can of that here, for those who are merely curious what happens behind the curtain and those who are going through the same troubles we already have, maybe our experiences can help you out.

Copyright © 2013 . All rights reserved. Theme based on the Jarrah by Templates Next