Why you should wait on upgrading to .Net 4.6

Update (August 11th): A patch for this bug has been released by Microsoft. Here’s their update to the advisory:

We released an updated version of RyuJIT today, which resolves this advisory. The update was released as Microsoft Security Bulletin MS15-092 and is available on Windows Update or via direct download as KB3086251. The update resolves: CoreCLR #1296, CoreCLR #1299, and VisualFSharp #536. Major thanks to the developers who reported these issues. Thanks to everyone for their patience.

Original Post

What follows is the work of several people: Marc Gravell and I have taken lead on this at Stack Overflow and we continue to coordinate with Microsoft on a resolution. They have fixed the bug internally, but not for users. Given the severity, we can’t in good conscience let such a subtle yet high-impact bug linger silently. We are not upgrading Stack Overflow to .Net 4.6, and you shouldn’t upgrade yet either. You can find the issue we opened on GitHub (for public awareness) here. A fix has been released, see Update 5 below.

Update #1 (July 27th): A pull request has been posted by Matt Michell (Microsoft).

Update #2 (July 28th): There are several smaller repros now (including a small console app). Microsoft has confirmed they are working on an expedited hotfix release but we don’t have details yet.

Update #3 (July 28th): Microsoft’s Rich Lander has posted an update: RyuJIT Bug Advisory in the .NET Framework 4.6.

Update #4 (July 29th): There’s another subtle bug found by Andrey Akinshin and the F# Engine Exception is confirmed to be a separate issue. I still recommend disabling RyuJIT in production given the increasing bug count.

Update #5 (August 11th): A patch for this bug has been released by Microsoft, see above.

This critical bug is specific to .Net 4.6 and RyuJIT (64-bit). I’ll make this big and bold so we get to the point quickly:

The methods you call can get different parameter values than you passed in.

The JIT (Just-in-Time compiler) in .Net (and many platforms) does something called Tail Call optimization. This happens to alleviate stack load on the last-called method in a chain. I won’t go into what a tail call is because there’s already an excellent write up by David Broman.

The issue here is a bug in how RyuJIT x64 implements this optimization in certain situations. Let’s look at the specific example we hit at Stack Overflow (we have uploaded a minimal version of this reproduction to GitHub).

We noticed that MiniProfiler (which we use to track performance) was showing only on the first page load. The profiler then failed to show again until an application recycle. This turned out to be a caching bug based on the HTTP Cache usage locally. HTTP Cache is our “L1” cache at Stack Overflow; redis is typically the “L2.” After over a day of debugging (and sanity checking), we tracked the crazy to here:

void Set<T>(string key, T val, int? durationSecs, bool sliding, bool broadcastRemoveFromCache = false)
{
    SetWithPriority<T>(key, val, durationSecs, sliding, CacheItemPriority.Default);
}

void SetWithPriority<T>(string key, T val, int? durationSecs, bool isSliding, CacheItemPriority priority)
{
    key = KeyInContext(key);

    RawSet(key, val, durationSecs, isSliding, priority);
}

void RawSet(string cacheKey, object val, int? durationSecs, bool isSliding, CacheItemPriority priority)
{
    var absolute = !isSliding && durationSecs.HasValue 
                   ? DateTime.UtcNow.AddSeconds(durationSecs.Value) 
                   : Cache.NoAbsoluteExpiration;
    var sliding = isSliding && durationSecs.HasValue 
                  ? TimeSpan.FromSeconds(durationSecs.Value) 
                  : Cache.NoSlidingExpiration;

    HttpRuntime.Cache.Insert(cacheKey, val, null, absolute, sliding, priority, null);
}

What was happening? We were setting the MiniProfiler cache duration (passed to Set<T>) as 3600 seconds. But often (~98% of the time), we were seeing it immediately expire from HTTP cache. Next we narrowed this down to being a bug only when optimizations are enabled (the “Optimize Code” checkbox on your project’s build properties). At this point sanity is out the window and you debug everything.

Here’s what that code looks like now. Note: I have slightly shortened it to fit this page. The unaltered code is on GitHub here.

void Set<T>(string key, T val, int? durationSecs, bool sliding, bool broadcastRemoveFromCache = false)
{
    LocalCache.OnLogDuration(key, durationSecs, "LocalCache.Set");
    SetWithPriority<T>(key, val, durationSecs, sliding, CacheItemPriority.Default);
}

void SetWithPriority<T>(string key, T val, int? durationSecs, bool isSliding, CacheItemPriority priority)
{
    LocalCache.OnLogDuration(key, durationSecs, "LocalCache.SetWithPriority");
    key = KeyInContext(key);

    RawSet(key, val, durationSecs, isSliding, priority);
}

void RawSet(string cacheKey, object value, int? durationSecs, bool isSliding, CacheItemPriority priority)
{
    LocalCache.OnLogDuration(cacheKey, durationSecs, "RawSet");
    var absolute = !isSliding && durationSecs.HasValue 
                   ? DateTime.UtcNow.AddSeconds(durationSecs.Value) 
                   : Cache.NoAbsoluteExpiration;
    var sliding = isSliding && durationSecs.HasValue 
                  ? TimeSpan.FromSeconds(durationSecs.Value) 
                  : Cache.NoSlidingExpiration;

    HttpRuntime.Cache.Insert(cacheKey, value, null, absolute, sliding, priority, Removed);
    var evt = Added;
    if(evt != null) evt(cacheKey, value, absolute, sliding, priority, durationSecs, isSliding);
}

This is nothing fancy, all we have is some methods calling each other. Here’s the scary result of those LocalCache.OnLogDuration calls:

  • LocalCache.Set: 3600
  • LocalCache.SetWithPriority: 3600
  • RawSet: null, or 114, or 97, or some other seemingly random value

Here’s an example test run from the GitHub repo:
RyuJIT Tail Tests

The method we called did not get the parameters we passed. That’s it. The net result of this is that local cache (which we use very heavily) is either unreliable or non-existent. This would add a tremendous amount of load to our entire infrastructure, making Stack Overflow much slower and likely leading to a full outage.

That’s not why we’re telling you about this though. Let’s step back and look at the big picture. What are some other variable names we could use?

  • amountToWithdraw
  • qtyOfStockToBuy
  • carbonCopyToAccountId
  • voltage
  • targetAltitude
  • rotorVelocity
  • oxygenPressure
  • dosageMilliliters

Does that help put things in perspective?

This bug is not obvious for several reasons:

  • It only happens with optimizations enabled. For most developers and projects, that’s not in DEBUG and won’t show locally.
    • That means you’ll only see this in RELEASE, which for most people is only production.
  • Attaching a debugger alters the behavior. This almost always hides the issue.
  • Adding a Debug.WriteLine() will often fix the issue because of the tail change.
  • It won’t reproduce in certain scenarios (e.g. we can’t repro this in a console application or VS hosting, only IIS).
  • Given the nature of the bug, as far as we can tell, it can equally affect any framework library as well.
  • It can happen in a NuGet library (most of which are RELEASE); the issue may not be in your code at all.

To address an obvious question: is this a security issue? Answer: it can be. It’s not something you could actively exploit in almost all cases, since stack variables are the things being swapped around here. However, it can be an issue indirectly. For example, if your code makes an assumption with a null param like if (user == null) { user = SystemUser; }, then a null passed in will certainly be a problem, giving other users that access sporadically. A more common example of this would be a value for a Role enum being passed incorrectly.

Recommendations

  1. Do not install .Net 4.6 in production.
  2. If you have installed .Net 4.6, disable RyuJIT immediately (this is a temporary fix and should be removed when an update is released). You can disable RyuJIT via a registry setting (note: this requires an application pool recycle to take effect).
    • Under HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\.NETFramework add a useLegacyJit DWORD with a value of 1
    • Or via PowerShell:
Set-ItemProperty -Path HKLM:\Software\Microsoft\.NETFramework -Name useLegacyJit -Type DWord -Value 1

Be aware, the web.config method (#3) of disabling RyuJIT does not work. Outside of IIS hosting, applying this fix via app.config does work.

We are talking with and pushing Microsoft to get a fix for this shipped ASAP. We recognize releasing a fix for .Net on the Microsoft side isn’t a small deal. Our disclosure is prompted by the reality that a fix cannot be released, distributed, and applied immediately.

Optimization Considerations: Rebuilding this site

This week I took a few days and re-built my blog. It was previously a WordPress instance on a small host with CloudFlare in front. It is now a statically generated, open source, managed via git, hosted on GitHub pages, and still proxied through CloudFlare. This post is my attempt to explain my reasoning and process in optimizing it along the way.

Why?

I couldn’t do what I wanted with WordPress, at least…not without a fight. I want to do several posts with interactive elements such as charts, maps, simulations, directly included CSV data, etc. to compare a lot of numbers I’ll be throwing out about some major infrastructure changes we’re making over at Stack Exchange. Here’s a quick example of many things I want to do in future posts. That was my motivation. When I looked into what I needed to change to support these things (besides even the basic editor fighting me along the way), I also took a long look at how the blog was performing. It was heavy…very heavy, as most WordPress installs tend to be. As a result, it was slow. Here’s what my blog looked like before the do-over:

Continue reading...

How we upgrade a live data center

I blogged about how we upgrade a live data center over on the Server Fault blog, which you may find interesting. It details how we planned and executed a large hardware refresh in the Stack Exchange New Jersey data center complete with hundreds of photos along the way. I love our hardware upgrade trips and will try my best to share future ones the same way. Expect a similar post on building the Denver data center around June this year.

If reading isn’t your thing and you want to go directly to the server porn, never fear. The album of the move (fully commented) is here: http://imgur.com/a/X1HoY

Continue reading...

What it takes to run Stack Overflow

I like to think of Stack Overflow as running with scale but not at scale. By that I meant we run very efficiently, but I still don’t think of us as “big”, not yet. Let’s throw out some numbers so you can get an idea of what scale we are at currently. Here are some quick numbers from a 24 hour window few days ago - November 12th, 2013 to be exact. These numbers are from a typical weekday and only include our active data center - what we host. Things like hits/bandwidth to our CDN are not included, they don’t hit our network.

  • 148,084,883 HTTP requests to our load balancer
  • 36,095,312 of those were page loads
  • 833,992,982,627 bytes (776 GB) of HTTP traffic sent
  • 286,574,644,032 bytes (267 GB) total received
  • 1,125,992,557,312 bytes (1,048 GB) total sent
  • 334,572,103 SQL Queries (from HTTP requests alone)
  • 412,865,051 Redis hits
  • 3,603,418 Tag Engine requests
  • 558,224,585 ms (155 hours) spent running SQL queries
  • 99,346,916 ms (27 hours) spent on redis hits
  • 132,384,059 ms (36 hours) spent on Tag Engine requests
  • 2,728,177,045 ms (757 hours) spent processing in ASP.Net
Continue reading...

Running Stack Overflow on SQL 2014 CTP 2

Ok let’s start with the obvious, why run Stack Overflow on pre-release software? If no one tests anything, it only means one thing: more bugs at release. We have the opportunity to test the hell out of the 2014 platform. We can encounter, triage and in most cases, get bugs resolved before release. What company doesn’t want testers for their software in real world environments whenever possible? I would like to think all of them do, just not all have the resources to make it happen.

All of that adds up to it being a huge win for us to test pre-release software in many cases. First there’s the greedy benefit: we can help ourselves to make sure our pain points with the current version are resolved in the next. Then there’s something we love doing even more: helping others. By eliminating bugs from the RTM release and being able to blog about upgrade experiences such as this, we hope it helps others…just as posting a question publicly does.

Continue reading...
View Archive (14 posts)