In June 2014, Microsoft suffered a several hour outage of Office 365. For my company, the main issue was around mail delivery, but other companies seemed to have email access issues as well. It became apparent to us that Office 365 was having an issue because both FOPE (long story) and our on-premise mail system couldn’t send into EOP. We discovered this problem through monitoring and then making some assumptions – FOPE is set up to email us queue alerts and we also have internal Exchange and Unix mail host monitoring to do the same. All 3 systems couldn’t send to EOP, thus the issue was most likely EOP. So, what does Microsoft do to monitor Office 365?
Well, Microsoft tells us what they do in a succinct way. Of course, I don’t want the details because I’m sure there are pages of ITIL based documentation and processes around their monitoring. What they do tell us can be found in summary in an Office blog entry, Cloud services you can trust: Office 365 availability:
- Our internal monitoring systems continuously monitor the service for any failure and are built to drive automated recovery of the service.
Matt’s interpretation: infrastructure and services are monitored with automated failover capabilities
- Our systems analyze any deviations in service behavior to alert on-call engineers to take proactive measures.
Matt interpretation: internal trending analysis with intelligence
- We also have Outside-In monitoring constantly executing from multiple locations around the world both from trusted third party services (for independent SLA verification) and our own worldwide datacenters to raise alerts.
Matt’s interpretation: external monitoring (and hopefully trending analysis). To me, this is the most useful because it is a real view.
- For diagnostics, we have extensive logging, auditing, and tracing. Granular tracing and monitoring helps us isolate issues to root cause.
Matt’s intepretation: logging that, while useful, may be too cumbersome to analyze real-time and is therefore used for root cause after the fact.
Microsoft pretty much monitors the same way I would in my own environment, as long as my as interpretations are correct. There are probably other monitors and redundancies built in that I can’t even fathom, but this got me thinking though of what else happens at my company when there is a wide-spread issue. Beyond the monitoring, there are 3 things that happen when an issue starts:
- Social Interaction – People report the issue to the Help Desk or complain to each other.
- Self Help Searching – Hit up Bing or Google and start investigating the error.
- Power Users – Bypass the Help Desk and call directly to someone in IT (not recommended, but it gets the job done).
This gave me some ideas on additional ways Microsoft, or really any IT group, could quickly identify issues before a large amount of customers call in. I believe much of this could even be achieved at a low cost using existing Microsoft technologies. Better still, this framework represents leveraging both technology and people to create a social monitoring to see what is really happening out there.
I think that when Office 365 is down, the second thing an IT person would do (the initial triage being first) is to check the dashboard and, if nothing is there, call the issue into Office 365 (or Premier).
Suggestion 1: Perform real-time trending analysis on Office 365 tickets to look for patterns beyond of complete outages.
Not everyone will call in immediately, though. Some folks will linger around troubleshooting the issue, but others will vent out their frustration on social media networks.
Suggestion 2: Monitor social media and forums for mentions of an Office 365 outage. Microsoft has enabled Twitter hashtag searching in Bing. Leverage this partnership to perform trend analysis in real-time. Also, use the data from the Office 365 forums to determine if there is a thread (or threads) about an outage.
Self Help Searching
IT folks will hit up their favorite search engine when a problem arises. Not only can you search for specific error codes, but with the News compiled, you can even find articles that are written reporting a service outage.
Suggestion 2.5: Use Bing’s own search APIs along with Bing’s Prediction capabilities to better determine when there is an issue affecting many customers. If “Office 365 outage” is suddenly trending, there might be a problem! This suggestion is 2.5 because it augments Suggestion 2.
For Office 365, the power users are Microsoft’s own vast partner network. As a large enterprise, you can’t really get into Office 365 without the help from a good partner. These are the folks in the trenches, doing migrations, writing code, and slamming Office 365 all the time.
Suggestion 3: Create a prioritized alert system for partners to report issues. This could be a special code when they call into Office 365 (or Premier) support that alerts Microsoft they are seeing something abnormal. To save time, provide a framework for these partners to provide specific information that can help in triage.
Those are just 3 simple suggestions that take what is typically a marketing driven social media search to monitor interest in a product and turn it into something that can be used to make an impact with existing customers and potentially improve service uptime.
This past outage though was plagued by one big problem. The Office 365 dashboard wasn’t updating for some customers with the outage. When the dashboard doesn’t update, a customer is left completely in the dark until they call in. I saw the ‘all clear’ on the dashboard, but our internal monitoring was alerting, so I called in and got a voice message that Exchange Online was experiencing an issue. This was very helpful since the dashboard was down. But, there is still room for improvement here.
A More Proactive Dashboard
This one could fall under “swallow your pride.” It follows the premise that people appreciate honesty. If there’s an issue, tell me and tell me as much as you can so that at least I can explain it to my customers.
Suggestion 4: Redesign the Office 365 dashboard with some intelligence. What is out there now is a good start, but create a table for potential issues that are under investigation with a brief description. Use data organization on your alerts and tickets to proactively report on these potential issues. Build workflows like ‘if this alert comes in, update the dashboard because this might be broken.’ Put a disclaimer there that says, “Microsoft is currently investigating a potential issue. If you believe you are experiencing a related issue, please call into your support and reference this incident number.” If I’m not having an issue, I’m most likely not checking the dashboard and won’t bother to call in. If I am having an issue and I can see that a particular network provider might be down and I have that network provider, maybe I’ll call them instead of Microsoft.
I understand some of these suggestions are more reactive than proactive. Certainly, I wouldn’t go on Twitter if I thought my Office 365 tenant might go down. These suggestions though may help support understand the scope of the issue quickly. This in turn starts an information flow that could lead to quicker resolution times and better communication to the customers. Many of these suggestions can be used by your internal IT for any service or application as well!