There was an issue with the test Siteminder environment since the day I came back (am I cursed??). The problem happens intermediately. Siteminder Policy Server has the following recorded:
[3804/644][Tue Mar 01 2011 17:45:44][CServer.cpp:1395][ERROR] Bad security handshake attempt. Handshake error: 3152
[3804/644][Tue Mar 01 2011 17:45:44][CServer.cpp:1402][ERROR] Handshake error: Failed to receive client hello. Socket error 0
CA knowledge base suggests it may be a shared secret rollover out of sync and the solution was to reregister the Sitemnder IIS Web Agent trusted host object. Simple enough, done that, but that did not solve the issue. The next step of troubleshooting is to find out if there is anything changed on the server level before the Monday. e.g. OS patching, IIS config change, performance bottleneck or disk space issue. I have checked the obvious but problem persist.
The test Siteminder environment is used by the testing of the Identity Manager, and a few multi millions project applications. The project managers are panic, the solution architects are worried, and I am stressed. Since then I have rebuild the IIS server (without the last 6 months security patches and hotfixes) and relocate only the major applications to the newly rebuild server. SM Policy Server ad We Agent installed locally. No luck. I then installed a copy of the eDirectory and use it as the Siteminder Policy Store locally. At this stage, all of the Siteminder components are isolated and reside on the new server. Still… no luck. I are running out of ideas. The application developers then started to hassle me for the obvious things that I have checked in day one taking over the problem, which is quite annoying. The last resort is to enable the Siteminder traces and lodge a support call to CA for assistance.
It took CA 3 days to get back to us requesting for more information. We updated the trace log format and resubmitted the logs. They got back to us today. The trace indicate that the policy server TCP/IP sockets are used up due to numerous connections to one of the user directories. They were eventually timeout but being held up for no reason. CA support suggest us to look at the performance of the user directories.
The user directories are NLB enabled with Cisco ACE module frontend. We still can’t determine the cause of the socket error. The performance of the user directories seems ok. What the heck, it’s a test environment, so I decided to arrange a restart after hours. It is now restarted and the above errors have not reoccurred since then. I will see what the developers say tomorrow.