SCOM Wintel Gray Agents Health State and Cache Flush – Part II Automation

In the previous post, we learned we can clear the agents cache, recycle the health service, and this will (hopefully) resolve our gray agent issue. But, what happens when we have to do this for hundreds of agents? One word, PowerShell. PowerShell allows us to automate this task over hundreds of servers to make this very tedious task, actually very quick!Here is the code I use.

Just make sure all of your servers are within the list you are providing, and of course the account you are running as has Local Administrative rights on each server.

$list = gc “.\list.txt”
foreach ($server in $list)
{
       Write-Host $server Check Service: “ -NonewLine
if ((gwmi win32_service -computer $server -filter “name=’HealthService'” | %{$_.State}) -eq “Running”)
{
       gwmi win32_service -computer $server -filter “name=’HealthService'” | %{$_.StopService()}  | findstr ReturnValue | Tee-Object -var service | out-null
       $serviceResult = $service.split(“:”)
       if ($serviceResult[1] -eq ” 0″)
              { Write-Host “Successful” -f Green}
       else
{Write-Host “Failed” -f Red}
start-sleep 5
$a = gwmi win32_Directory -computer $server -filter “Name=’C:\\PROGRA~1\\SYSTEM~1\\Agent\\HEALTH~1\\HEALTH~1′”
                           $a.DeleteEx() | findstr ReturnValue Tee-Object -var status out-null
                           $statusresult $status.split(“:”)
                                  if ($statusResult[1] -eq ” 0″)
                                         {Write-Host “Successful”}
                                  else {Write-Host “Failed”}
       gwmi win32_service -computer $server -filter “name=’HealthService'” | %{$_.StartService()}  | findstr ReturnValue | Tee-Object -var service | out-null
       $serviceResult = $service.split(“:”)
       if ($serviceResult[1] -eq ” 0″)
              { Write-Host “Successful” -f Green}
       else
{Write-Host “Failed” -f Red}
       }
else
{Write-Host Stopped}
}

SCOM Wintel Gray Agents Health State and Cache Flush

Problem, you launch your SCOM console, and your server is in the following state. You browse the server, and check out the health service, and it is clearly running… So now what?

1 (1)

A Windows based machine appears in a “Not monitored” state. While SCOM thinks the machine is un-responsive, we can confirm this is not the case, as we can ping to the machine; in addition we are able to login to the machine.

2 (1)

This is a result of the SCOM health service needing its cache to be cleared.

SCOM has a built in task to do exactly what we want; however, since SCOM believes the machine is in an offline state, it will not be able to trigger the task to the “Not monitored” machine.

3 (1)

SOLUTION – MANUAL PROCESS

  1. Remote into the machine, and launch the services (services.msc). Locate the Microsoft Monitoring Agent service, and stop the service.

4 (1)

  1. Once the service has stopped, browse the following folder, “C:\Program Files\Microsoft Monitoring Agent\Agent\

5 (1)

Delete the entire (Health Service State) folder.

  1. Go back to the Windows Services (services.msc) and start the Microsoft Monitoring Agent service. This will rebuild the folder we just deleted.

Give SCOM a few seconds, maybe a few minutes, and the Health State of our machine will turn back to healthy!