I recently had a frustrating day of Active Directory troubleshooting which led to a bunch of dead ends and eventually a solution. I thought I would document the resolution in case I ran into the same problem again, or if someone else encountered a similar issue.
Original Issue: We moved a domain to a new hyper-V environment which consisted of two domain controllers (for the sake of example will call them DC1 and DC2). Everything was working flawlessly with these servers in their original environment. The problems originally started when adding a batch of around 500 users using ADIMPORT. The first 250 users imported without any problem and then the server started giving an error of “Directory Service has exhausted pool of relative identifiers”. Further investigation in the Directory Services event log and running DCDIAG showed an error of “A Global Catalog Server could not be located – All GC’s are down” and “Event ID 16645 — RID Pool Request”. These indicated that directory services could not communicate with the Global Catalog (GC) server, which was very odd giving that DC1 was the GC server. The other server DC2 was also a GC server, but running DCDAIG on DC2 gave errors of: The processing of the group policy failed windows could not obtain the name of a domain controller KRB_AP_ERR_MODIFIED error and running repadmin /showrepl gave errors of: The source destination server is currently rejecting replication requests “The RPC Server is unavailable”
SOLUTION: The eventual solution for this was to address several problems. The first was to remove any shadow network adapters, the current network adapter, and then setup a new network adapter. The steps to do this:
- Open a command prompt as administrator
- Run “set devmgr_show_nonpresent_devices=1”
- “Start devmgmt.msc”
- Under View in the device manager select “show hidden devices”
- Under network adapters remove any hidden NICs
- Finally remove your active NIC
- Select “Scan for new hardware”, which will create a new NIC.
- Close device manager.
- Note that the “set devmgr_show_nonpresent_devices=1” command is only for the current session and once you close the command prompt the device manager will not show the shadow hidden NICs even if you select “show hidden devices”. This is misleading, since you think you don’t have any shadow adapters, but you actually do.
- Go into your network adapter properties and setup your static IP
- Its important that you set the DNS to your main AD DNS server (in this case the internal IP of DC1)
- After doing all of this I was experiencing network connectivity issues. When running IPCONFIG /ALL instead of only showing the proper gateway it was also showing a gateway of 0.0.0.0. Going to the command prompt and running “Route Delete 0.0.0.0” forced the removal of the phantom 0.0.0.0 gateway. Note that you have to setup your proper gateway in network properties after running the route delete.
The second issue is that the DNS server was not responding to external queries. I tested this by running NSLOOKUP on DC2 and querying the DC1 DNS server, which always gave a not responding. Doing into the DNS manager and checking the main DNS properties showed that the DNS server was set to listen on “All IPS”. Oddly enough it was not. The solution to this was to select the internal IP specifically, which fixed the DNS issue. I believe this was caused by the listener IP registry entry being lost due to the network card switch from the VM move. HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\DNS\parameters after this change had the correct IP keys. After the changes above rebooting DC1 resulted in a working AD that could correctly query the Global Catalog and get unique IDs to create new users.
The next issue was that DC2 still was giving all of the replication and RPC errors. Originally I thought this was a DNS issue since DCDIAG was giving errors of “Replication error 8524 The DSA operation is unable to proceed because of a DNS lookup failure” .I spent considerable time debugging the AD DNS entries, but this turned out to be a dead end. The DNS error was actually being caused by DC2s inability to connect to DNS via RPC.
This appears to have been caused by the Kerebos IDs on DC2 not being recognized by DC1. I tried a number of methods to query these with KLIST and reset SIDs, but none of these worked. Eventually the solution was to demote DC2 and then re-add it as a DC into the domain, which fixed all the replication, RPC, and other issues. The steps to do this:
- run “DCPROMO /FORCEREMOVAL” to force removal of directory services from DC2. Doing this without the /ForceRemoval was always unsuccessful since DC2 could not communicate with DC1 due to the kerebos problems.
- After performing the forced removal you have to clean up the meta data for DC2 on DC1, since DC1 does not know its been removed. There are several command line ways on the Internet to do this using ntdsutil, but the easiest way I found was to simply go into Active Directory Sites and Services, drill down to your site and then servers, right click on the server (DC2 in this case) and pick delete.
- The delete of the server above removed most, but not all DNS entries, go into DNS and check the domain for any references to the removed server and also go into the _msdcs.<my domain> zone and look for any entries and remove them.
- Once this is completed run DCPROMO on the previously removed server to add it back as a domain controller, which will allow it to rejoin the domain and also recreates any required DNS entries.