Sorry I didn't post here earlier, but I was heavily focused on figuring this thing out.
Let me give you all a brief update.
As you all know, the price movement shined a very bright spotlight on the Ripple network. People came around to look, kick the tires, take the network for a spin. All that is wonderful - and expected.
As a result of increased interest a number of people spun up fresh servers, all of which were attempting to retrieve a lot of data from the existing, established servers. While the servers know how to protect themselves from overextending (they do resource/load monitoring and adapt), some of this logic actually worked against us. Additionally, the increased interest resulted in a significantly larger amount of client connections (from wallets, ripple-lib instances, etc) and a corresponding amount of additional work (imagine a few thousand new websocket connections, all of which are asking for every update on the same four of five order books, or other expensive operations).
Although the servers weren't anywhere near their max, it seemed reasonable to spin up some additional capacity - which we did, within hours. Once that was done, we dug deeper to understand what was happening.
Our investigation revealed that some of the default values we were using to detect connection quality were a bit too aggressive and could result in some connection instability. Servers would incorrectly detect they had poor connectivity, drop their "poor" links and try to find better ones, causing connection churt and connection flapping. This would happen sporadically, resulting in brief periods of unstable connectivity before things would settle down again and everything would recovere.
The good news is that not only have we identified the issue, but we have a fix: PR 2111 which will go out tomorrow will be rippled 0.60.3. We are in the process of updating some of our servers with this build. I believe that this will help settle the network down and improve things across the board.
The additional capacity we have spun up will likely remain active (or at least on hot-standby) although I believe that once 0.60.3 rolls out across the network, it will no longer be necessary.
With all that said, let me shift gears: I appreciate all the commentary here. I know that the intermittent connectivity issues and the appearance of instability that resulted from that were disappointing to the community. Believe me, we felt disappointed too. We are committed to learning from this and improving our processes to make sure this doesn't happen again.
But look at it this way too: despite the connection instability, the RCL kept closing ledgers like clockwork every 3.5 seconds in the face of significantly increased volume - the most important code held up brilliantly and didn't even bat an eyelash; it was rock solid. The automated fee escalation engine responded well to the ebb and flow of volume, adjusting fees to match supply and demand. Our internal, automated monitoring systems functioned well and offered us good visiblity. And, most importantly, our people worked well together. Kudos to all my coworkers for stepping up.
Again, thank you to everyone here for the comments and the suggestions. I haven't yet gone over every post, but I will do so tonight.