Matthew Sekol

"The basic tool for the manipulation of reality is the manipulation of words."

Month: July 2014

Roast Beef and IT

I recently came across a neglected IT system. The software itself was up to date and patched, but the processes and management of this system had fallen away. The explanation I received was a story that I hadn’t heard before, but one that is used frequently to describe such a phenomenon in a business. The story goes like this…

A newly married couple was preparing a roast beef dinner. The wife cut the ends off of the roast and placed it in the pan. The husband asked her why she was doing that. She explained that was the way her mother always did it. The next day, she rang up her mother who explained that the grandmother had done that when she was a little girl. When they asked the grandmother about it, she explained that the roasting pan she used to use was too small to fit a roast in, so she cut the ends off.

Clearly, the point of the story is to show how outdated processes can be a hindrance, or at least wasteful if not checked. When it comes to IT processes though, what can the harm be? Well, it turns out the harm can be pretty high.

The rest of this is a little technical, so hang on!

Active Directory Replication
The particular system that brought this story up was Active Directory. After working with a company, I heard a complaint that when computers would lose their domain membership, Site Support couldn’t delete the computer object from the domain and add it back with the same name. There would always be a conflict that would prevent it despite deleting the computer object from AD. As a result, their domain was riddled with computer01, computer01a, computer01b, etc.

After going through several Microsoft Active Directory Healthchecks over the years, this sounded to me like the convergence time was too high across the domain. My assumption was that the computer object deletion wouldn’t replicate fast enough before a new computer was joined to the domain. I found an excellent PowerShell script to test out my theory.

After running the script, I noticed that it was taking 15 minutes to replicate an object across the domain. In my own observation with manual object creation, this appeared to be upwards of 45 minutes! Furthermore, it was taking most of the time just to reach one particular site. I started my investigation.

I found that Site Links were neglected and in their place, manual connections were created between the domain controllers. So, I dug deeper and saw that IP subnets were incorrectly configured and domain controllers in 3 different physical locations were in 1 particular site. When I asked about this, it seemed this had been part of some legacy process.

The first thing I did was fix the Site Links to make sure where I wanted things to replicate is where they were going. I also enabled change notification on all the Site Links. If your network can handle this, I highly recommend it.

After that, I split out 3 different site domain controllers into 3 separate sites, added in the IP subnets accordingly and configured the new Site Links between them.

Now that things were looking better, I waited 15 minutes for everything to replicate around. I then logged into each domain controller and deleted the manual links that were created. I tried to do this (as much as possible) in pairs so that when I kicked off the KCC, it would find the new pair and create it, which it did.

I gave the domain around an hour to flesh out the new connections. One sidenote – there were 2 legacy 2003 domain controllers. I noticed the 2 Sites with those were having problems automatically creating all the Site Links. These were set to be retired, so I isolated them using their own IPs as the subnet and then the Site Links were able to be created properly.

I let everything settle down for around an hour and then ran the convergence test again. It was down to 43 seconds! No more sequential computer names need to be created!

Group Policy
Now that replication was fixed, it was time to check out Group Policies. What I found there was mind boggling. There were at least 8 DNS based group policies on site based computer OUs that did the exact same thing. Several of these had legacy VBS scripts that no longer existed on the NETLOGON share.

There were also other Group Policies with other VBS scripts that no longer existed and policies set with no settings. I found computer policies with the User Configuration enabled, but no user settings. There were also Domain Level policies with security settings configured to apply to only 1 user or a few computers (like Domain Controllers).

Where to start? Well, I consolidated the DNS policies down into one and removed the offending VBS script. I also combined several other Computer based (not site based) GPOs into one and disabled the User Configuration setting from processing. This would speed up the GPO processing in general.

I also combined the top level domain policy that was only applying to some Domain Controllers (because new ones hadn’t been added), into the Default Domain Controllers Policy. I also added in another policy that was on the Default Domain Controllers OU into the same policy.

I moved down top level policies that only applied to one user or computer (for testing) to avoid someone accidentally turning it on for everyone. This should also speed things up because users and computers would no longer even see this policy.

Needless to say, this whole exercise took about 4 hours, but the benefits were massive. Local errors were reduced, login times were decreased and simplicity was restored! The changes were communicated out to support teams so that these legacy processes were removed or updated.

No more Roast Beef!
Active Directory is a great example of something that can be so easy to manage, it falls to the way side for support. It can be passed around to those that just know enough and perpetuate legacy issues over and over. This served as a good lesson that can be applied to any IT systems. It is easy to assume legacy processes are still relevant enough to support the environment, but what does it take to give the users a good experience? In this case, it was buckling down for a day and sorting things out. That wasn’t so bad!

What Weird Al can teach us
about Corporate Communication

No, I’m not talking about “Word Crimes.” If you are that far gone, please close your LinkedIn account to avoid embarrassment and check with your nearest 7 year old for assistance. I’m talking about Weird Al’s all too familiar generic corporate speak found in “Mission Statement.” The song is a parody of Crosby, Stills & Nash’s “Suite: Judy Blue Eyes.” Go listen to both, I’ll wait…

I think the “Mission Statement” is great for two reasons:
1. The melody is technically proficient and catchy!
2. Anyone working at a modern business has been exposed to a mission statement similar to Weird Al’s interpretation and its ensuing spawn of generic corporate speak. Typically, when people hear these nebulous phrases, there is an uptick in self-inflicted head injuries.

As much as I’d like to discuss the finer points of Weird Al’s musical ability, that’s what Facebook is for. Let’s focus on number 2 and start with the mission statement itself.

In a few scant sentences, a business is expected to write a framework for its purpose. Think of it as an elevator pitch that justifies the company’s existence and its place in the universe. Furthermore, it relinquishes any responsibility to answer questions about your business’s true nature. Just refer to the mission statement!

The word ‘mission’ is jam packed with action! Didn’t you realize that your employees jump over hurdles and crawl under barbed wire? It’s the ‘statement’ that gives us trouble. An example of a statement is, “I am confused by your mission statement.” That’s it, one sentence. A quote from Disney’s “Aladdin” comes to mind – “Phenomenal cosmic power! Itty bitty living space.” This thing is doomed from the start.

But, we’re stuck with it. Someone long ago etched in stone that all companies must have a tweet-length summary to define itself – so say we all! The result is nothing less than a complete deconstruction of the English language into broad terms that could fit any company as Weird Al portrays. Quartz went as far to call out companies that use phrases from “Mission Statement.” See how many you can interchange! It’s fun for all ages!

How does a company move on from this blatant attempt to dumb down the true essence of the business?

1. Your employees, consultants, and fortune tellers need to know what exactly your company does and what your expectations are. Let’s help them get there. Look past the bland words that you’ve associated with your awesome business and define a clear strategy for the company with specifics. Something like, “we strive to be the preferred provider for 70% or higher of the widget resellers.” After all, you want to do well, right? Take those ideas (more than one paragraph is encouraged here) and communicate your expectations to your employees.

Adjust these paragraphs and ultimately your mission statement as your company evolves.

2. When selling, communicating, or interacting in any way to the outside world, use those same internal paragraphs to craft a concrete summary specific for the situation at hand. For example, “we provide our inventory to 70% of the widget resellers because ours are more yellow than our competitors.”

3. If at all possible, don’t refer to or publish your mission statement on the internet unless you can supplement it with honesty and specifics. Bury it deep down in your soul and cover it with junk food. Weird Al isn’t the only one judging you.

Well, we’ve got our mission statement and we’ve isolated it fairly well from doing any harm elsewhere by expanding it with specifics, or at least we though so! Out of left field, where all problems originate, we find a new communication issue that permeates a lot of organizations, stemming from the mediocrity of the mission statement – the dreaded convoluted management email.

This email is from anyone in charge to any group that is affected by some action. For example, it may be a CxO level email to a group that is being downsized or a director level email announcing a new goal for a group of teams. Typically, HR has been mistakenly encouraged to perform in an editorial capacity to soften the communication.

Beware! Danger! Watch Out! These might be typical signs I would flash at you before you click Send. While you might think you have written a carefully thought out communication, you need to reconsider if you’ve been infected by the mission statement virus. Here are some tips.

For bad news, don’t beat around the bush or pontificate, just get to the point! Folks have most likely heard rumors or at least understand what’s what and how your business works. They have either read the internet articles and analyst reports about your company or Gary in Accounting has told them the full deal (after all, Gary is the one who told the analysts). Give them some credit and be straight with them. Yes, it will suck, but they will at least appreciate your honesty, even if they hate you forever for whatever it is you’re doing to them.

If you are a manager or higher, you might find yourself having your hands tied as to what you can say and when. Assess whether it is the proper time to communicate the message. It might be better to wait until the complete picture has formed. If, on the other hand, there’s top secret machinations at work, but some communication still needs to get out (because information has a mind of its own, especially when HR is involved), you should be succinct.

For other news, set clear goals and timelines or at least communicate downstream and make sure the real message gets to the folks doing the work. Measureable results allow you to, get this, measure the results you get and hold people accountable! This isn’t rocket science, people (unless your company is building rockets, then I apologize).

The generic mission statement has infected our corporate culture, causing strife and despair across many companies. There are all sorts of reasons this happens. Sometimes people think they know what people want to hear and try to coddle them, but when a convoluted and generic ‘feel-good’ statement escapes, it can be a poison. Poor communication loaded with useless information only insults your audience and leads to mental uprisings complete with pretend pitchforks! Instead of being burned as a virtual effigy, modify your communication methods and be honest, specific and succinct.

Don’t be the target of internet ridicule for your mission statement! That’s what the comment section below is for!

Social Monitoring and Office 365

In June 2014, Microsoft suffered a several hour outage of Office 365. For my company, the main issue was around mail delivery, but other companies seemed to have email access issues as well. It became apparent to us that Office 365 was having an issue because both FOPE (long story) and our on-premise mail system couldn’t send into EOP. We discovered this problem through monitoring and then making some assumptions – FOPE is set up to email us queue alerts and we also have internal Exchange and Unix mail host monitoring to do the same. All 3 systems couldn’t send to EOP, thus the issue was most likely EOP. So, what does Microsoft do to monitor Office 365?

Well, Microsoft tells us what they do in a succinct way. Of course, I don’t want the details because I’m sure there are pages of ITIL based documentation and processes around their monitoring. What they do tell us can be found in summary in an Office blog entry, Cloud services you can trust: Office 365 availability:

  • Our internal monitoring systems continuously monitor the service for any failure and are built to drive automated recovery of the service.
    Matt’s interpretation: infrastructure and services are monitored with automated failover capabilities
  • Our systems analyze any deviations in service behavior to alert on-call engineers to take proactive measures.
    Matt interpretation: internal trending analysis with intelligence
  • We also have Outside-In monitoring constantly executing from multiple locations around the world both from trusted third party services (for independent SLA verification) and our own worldwide datacenters to raise alerts.
    Matt’s interpretation: external monitoring (and hopefully trending analysis). To me, this is the most useful because it is a real view.
  • For diagnostics, we have extensive logging, auditing, and tracing. Granular tracing and monitoring helps us isolate issues to root cause.
    Matt’s intepretation: logging that, while useful, may be too cumbersome to analyze real-time and is therefore used for root cause after the fact.

Microsoft pretty much monitors the same way I would in my own environment, as long as my as interpretations are correct. There are probably other monitors and redundancies built in that I can’t even fathom, but this got me thinking though of what else happens at my company when there is a wide-spread issue. Beyond the monitoring, there are 3 things that happen when an issue starts:

  • Social Interaction – People report the issue to the Help Desk or complain to each other.
  • Self Help Searching – Hit up Bing or Google and start investigating the error.
  • Power Users – Bypass the Help Desk and call directly to someone in IT (not recommended, but it gets the job done).

This gave me some ideas on additional ways Microsoft, or really any IT group, could quickly identify issues before a large amount of customers call in. I believe much of this could even be achieved at a low cost using existing Microsoft technologies. Better still, this framework represents leveraging both technology and people to create a social monitoring to see what is really happening out there.

Social Interaction
I think that when Office 365 is down, the second thing an IT person would do (the initial triage being first) is to check the dashboard and, if nothing is there, call the issue into Office 365 (or Premier).
Suggestion 1: Perform real-time trending analysis on Office 365 tickets to look for patterns beyond of complete outages.
Not everyone will call in immediately, though. Some folks will linger around troubleshooting the issue, but others will vent out their frustration on social media networks.
Suggestion 2: Monitor social media and forums for mentions of an Office 365 outage. Microsoft has enabled Twitter hashtag searching in Bing. Leverage this partnership to perform trend analysis in real-time. Also, use the data from the Office 365 forums to determine if there is a thread (or threads) about an outage.

Self Help Searching
IT folks will hit up their favorite search engine when a problem arises. Not only can you search for specific error codes, but with the News compiled, you can even find articles that are written reporting a service outage.
Suggestion 2.5: Use Bing’s own search APIs along with Bing’s Prediction capabilities to better determine when there is an issue affecting many customers. If “Office 365 outage” is suddenly trending, there might be a problem! This suggestion is 2.5 because it augments Suggestion 2.

Power Users
For Office 365, the power users are Microsoft’s own vast partner network. As a large enterprise, you can’t really get into Office 365 without the help from a good partner. These are the folks in the trenches, doing migrations, writing code, and slamming Office 365 all the time.
Suggestion 3: Create a prioritized alert system for partners to report issues. This could be a special code when they call into Office 365 (or Premier) support that alerts Microsoft they are seeing something abnormal. To save time, provide a framework for these partners to provide specific information that can help in triage.

Those are just 3 simple suggestions that take what is typically a marketing driven social media search to monitor interest in a product and turn it into something that can be used to make an impact with existing customers and potentially improve service uptime.

This past outage though was plagued by one big problem. The Office 365 dashboard wasn’t updating for some customers with the outage. When the dashboard doesn’t update, a customer is left completely in the dark until they call in. I saw the ‘all clear’ on the dashboard, but our internal monitoring was alerting, so I called in and got a voice message that Exchange Online was experiencing an issue. This was very helpful since the dashboard was down. But, there is still room for improvement here.

A More Proactive Dashboard
This one could fall under “swallow your pride.” It follows the premise that people appreciate honesty. If there’s an issue, tell me and tell me as much as you can so that at least I can explain it to my customers.
Suggestion 4:  Redesign the Office 365 dashboard with some intelligence. What is out there now is a good start, but create a table for potential issues that are under investigation with a brief description. Use data organization on your alerts and tickets to proactively report on these potential issues. Build workflows like ‘if this alert comes in, update the dashboard because this might be broken.’ Put a disclaimer there that says, “Microsoft is currently investigating a potential issue. If you believe you are experiencing a related issue, please call into your support and reference this incident number.” If I’m not having an issue, I’m most likely not checking the dashboard and won’t bother to call in. If I am having an issue and I can see that a particular network provider might be down and I have that network provider, maybe I’ll call them instead of Microsoft.

I understand some of these suggestions are more reactive than proactive. Certainly, I wouldn’t go on Twitter if I thought my Office 365 tenant might go down. These suggestions though may help support understand the scope of the issue quickly. This in turn starts an information flow that could lead to quicker resolution times and better communication to the customers. Many of these suggestions can be used by your internal IT for any service or application as well!

© 2017 Matthew Sekol

Theme by Anders NorenUp ↑