Friday, 16 September 2011

Windows Servers and Clock Drift…Why you may or may not need a flux capacitor (or a really big hammer)

Jeremy Williams, Sr. Director, Modern Workplace

Take a look down at your system clock, he’s hiding down there in the right-hand corner of your screen (most likely at least, unless you moved your taskbar), and he does a seemingly thankless job.  Day-in and day-out, your time is reliably kept, and everything runs smoothly.  By default (non-domain joined systems), the system clock is maintained by your BIOS, and is accurate enough for most people…Sure, it may need to be fixed every year or so, but it’s nothing serious.  However, once a machine is domain-joined, it’s clock will synchronize with a domain controller.  This is to ensure a cohesive time throughout an enterprise, and (should time not be synchronized), there could be other issues that crop up…Like Kerberos problems, caching issues, load balancer issues, etc. 

Problem 1: Users are reporting that their system time is off (as compared to their cell phones) by 3-7 minutes (of which most are synced with the cell phone companies, not a bullet-proof defense, but an arguable one for sure)

Solution 1: Hop on a Domain Controller and set it up to use an NTP server of some variety (purchase an appliance you can access, or risk the internet).  How to:

Problem 2: Solution 1 worked for a few days, then all of the sudden users were reporting that problem #1 was occurring again… This seemed odd, so I checked out the domain controller, and everything was correct there (as far as the time was concerned.  So I punched in ‘w32tm /monitor’ at an administrator command prompt, and I was informed that 1 of the two domain controllers was in sync, and the other had a drift of about 400+ seconds.  That didn’t make any sense, so I started digging… [Note: If you’re not into the adventure, just skip to ‘Solution 2’ below.]

Heuristics: Instead of digging too terribly hard, I figured that I would simply perform the same configurations that I did on DC1 to DC2, that way both DCs would be pulling the same time and would be pretty close to each other in terms of clock-drift.  After performing all of the steps though, the clock still had a drift of over 5 minutes, which made absolutely no sense.  To get to the root of it all, I ran ‘w32tm /query /status’ to get a look at what was going on under the covers of Windows Time Service.  While in there, I noticed an odd IP address, one that I had certainly never configured it to query.  After a quick search, I discovered that w32tm was actually synchronizing with it’s virtual host (since DC2 is a virtual box)….  Before I go much further, here’s a diagram of what’s going on…


As you can see, the issue was that an infinite loop (gasp!) of sorts was occurring.  Here’s how it went:

  • HyperV maintains system time on all of it’s guest OS [hence that goofy IP I saw earlier]
  • On of the HyperV guests is a domain controller, of which the HyperV host is a member
  • HyperV host updates it’s time from a domain controller (which could be DC2)
  • HyperV guest updates it’s time from the HyperV host

Okay, so it’s not a pure-evil infinite loop, but it certainly is a runaway time issue…Whatever the time offset was between HyperV host and HyperV Guest, it would be incremented on each successive time sync (depending on if the host tried to sync off the guest’s synced time which was originally the hosts'…follow?)

Solution 2: I’m fairly certain that Microsoft doesn’t recommend that you run a domain controller on a member server HyperV host, but that issue aside; w32tm has a flag in it that should resolve the issue.  On the virtualized server, you’ll want to run:

w32tm /config /reliable:NO

w32tm /config /update

net stop w32time

net start w32time

After you’ve run all that, you should manually set the clock on your HyperV host (or configure it to use the same NTP configuration at your primary DC for good measure). 

Oh, I almost forgot…You’d need the flux capacitor to go back in time and either have HyperV not sync time all ninja-like OR you could use the big hammer to simply take out your HyperV host…thereby directing all of your users back to DC01 where the time is accurate.