13th September 2023
One of our clients carried out Windows patching across a number of their servers, including one hosting a SQL Server 2014 Standard Edition instance that we support. Patching on the SQL Server’s host server took the form of a large number (about a dozen) of Security Updates. After a reboot it became almost immediately apparent that there was a problem, with users and applications unable to connect to databases and SQL Agent jobs not progressing. Even local SSMS connections were failing with timout errors.
After we were asked to help with the problem it was possible to connect with a Dedicated Admin Connection and all databases seemed to be online and healthy, and an examination of the current log showed no obvious problems. Restarting the SQL Server service allowed users and applications to connect temporarily, however those connections started to time out again fairly quickly. After having looked at the ‘usual suspects’ associated with timeouts – e.g. network latency exceeding the timeout value, a network mis-configuration, incorrect server name, DNS/IP address issues, etc – the recommendation was to roll back the patching as that seemed to be the obvious reason for the problem with no other information.
Unfortunately, after the patching was rolled back, the problem persisted so it was time to look elsewhere. It was then that we also became aware of an Oracle upgrade the previous week which resulted in Oracle 19c being installed in tandem with the pre-existing version 12. That rang a bell as another client had suffered a similar problem about a year earlier. Again this was after Windows patching and an upgrade from Oracle version 11 to 19c, whereupon they started having serious issues with a SQL Agent job that was connecting to a remote Oracle data source via a Linked server, causing the Database Engine service to hang and generating a stack dump. That client had approached Oracle support at that point, the problem was traced to a driver mis-match and the recommendation was to uninstall version 11 and re-create all Oracle Linked Servers. This solution then resolved the problem and this blog post summaries the case.
This closely mirrored what was happening with the current client as they too had SQL Agent jobs connecting to Oracle data sources via Linked Servers. Shutting down all services except the SQL Server service subsequently allowed user, application and SSMS connections to stay up, although eventually these started to time out again, albeit after a far longer period. In retrospect shutting down all other services would have been a useful first step before rolling back the patching (almost certainly unnecessarily as it transpired) and the blog post referred to above does recommend that.
The same recommendation then was made to the current client. Despite the fact that, in this case there was no sign of any stack dumps in the log, it was felt that the similarities in the behaviour warranted such a recommendation and, although it transpired that the client had opted to uninstall version 19c rather than the older version, this still proved to be a successful fix.
In summary it’s possible that, having upgraded Oracle a week earlier, there had been no reboot of the server and that that only occurred after the patching, so the patching may well have been a red herring. And although a snapshot backup had been taken prior to the patching, reverting to a snapshot prior to the Oracle upgrade would also have meant backing up and restoring the databases as part of the recovery in order to minimise data loss. Finally the client was asked to consider some form of High Availability or Disaster Recovery technology since an upgrade of Oracle and patching on an HA or DR secondary could have been delayed until after an evaluation of any incipient problems on a patched/upgraded primary.
It’s possible that the reason the problem eventually recurred even after after shutting down the SQL Agent service was that an application was querying one or more Linked Servers directly after a period of time rather than a scheduled SQL Agent job that was running almost immediately after the service had started, although the client was unable to confirm that.
Takeaways are that the Windows server should perhaps always be rebooted after any upgrade or patching. And that clients don’t always inform you about every change – only the most recent one.
Gordon Feeney, SQL Server Database Consultant.
Type above, then press return to search