Catalyst Switch Memory Leak Bug
On the call, you looped me in on the issue. In particular, you had three switches that were alerting due to high response times. When you accessed the switches, they were extremely slow/unresponsive on the CLI. Two of the switches crashed and you intentionally reloaded the third to recover services.
Thank you for calling out the “%PLATFORM_INFRA-5-IOS_INTR_OVER_LIMIT_HIGH_STIME”. This is a really good data point as these logs typically indicate memory exhaustion at the kernel level. While we were copying over the system-reports (one from Mar-18 and one from Jan-20), we looked through the crashinfo file on the switch (which is plaintext). In addition to a number of “CRL_FETCH_FAIL” syslogs indicating issues with the CRL fetch for the DNAC-CA trustpoint failing, we saw several indicators of high memory utilization prior to each crash:
1023858: Jan 19 23:51:04.356: %PLATFORM-4-ELEMENT_WARNING: Switch 1 R0/0: smand: 1/RP/0: Used Memory value 92% exceeds warning level 90%. Top memory allocators are:...
1023973: Jan 20 00:01:14.327: %PLATFORM-4-ELEMENT_WARNING: Switch 1 R0/0: smand: 1/RP/0: Used Memory value 92% exceeds warning level 90%. Top memory allocators are:...
1117330: Mar 18 20:11:01.998: %PLATFORM-4-ELEMENT_WARNING: Switch 1 R0/0: smand: 1/RP/0: Used Memory value 92% exceeds warning level 90%. Top memory allocators are:...
The backtrace of functions for both crashes align with defect CSCwk62333:
CSCwk62333 - Cat9200L Unexpected reload crimson db write lock held for too long
However, the “Conditions” section of the release notes for CSCwk62333 mention that this crash is a consequence of events that generate internal tracebacks: one of which is high memory utilization. So our focus is the cause of the high memory utilization on this switch. There are a few well-known pubd memory leaks that affect this platform and software version, and I noticed that you have NETCONF, DNAC, and telemetry configured. We checked the telemetry connection, and we confirmed that it’s in a “Connecting” state (which is the trigger for the pubd memory leaks):
NSHC_ITAS_P5_CAEDGE01#sh telemetry connec all
Telemetry connections
Index Peer Address Port VRF Source Address State State Description
----- -------------------------- ----- --- -------------------------- ---------- --------------------
11801 10.40.1.183 25103 0 10.40.114.141 Connecting Connection request made to transport handler
I was also able to decode the tracelogs for the Mar-18 event, and we see a flood of “Closing TLS connection returned error” logs, which is also indicative of this pubd memory leak:
cats-eng-lnx1:1036> pwd; grep -ari "Closing TLS connection returned error" | tail
/nobackup/dmonazah/698847335/sd64756-698847335.NSHC_ITAS_P5_CAEDGE01_1_RP_0-system-report_1_20250318-202050-UTC.tar.gz_decoded.log/crashinfo/tracelogs
utf_R0-0.7906_45543.20250318184316.bin.gz_decoded.log:2025/03/18 18:44:13.261659713 {pubd_R0-0}{1}: [pubd] [23971]: UUID: 0, ra: 0 (ERR): CNDP_MGR:conn_id[]Closing TLS connection returned error, rc -6992
utf_R0-0.7906_45543.20250318184316.bin.gz_decoded.log:2025/03/18 18:44:18.473718549 {pubd_R0-0}{1}: [pubd] [23971]: UUID: 0, ra: 0 (ERR): CNDP_MGR:conn_id[]Closing TLS connection returned error, rc -6992
utf_R0-0.7906_45543.20250318184316.bin.gz_decoded.log:2025/03/18 18:44:23.687913423 {pubd_R0-0}{1}: [pubd] [23971]: UUID: 0, ra: 0 (ERR): CNDP_MGR:conn_id[]Closing TLS connection returned error, rc -6992
utf_R0-0.7906_45543.20250318184316.bin.gz_decoded.log:2025/03/18 18:44:28.900262658 {pubd_R0-0}{1}: [pubd] [23971]: UUID: 0, ra: 0 (ERR): CNDP_MGR:conn_id[]Closing TLS connection returned error, rc -6992
utf_R0-0.7906_45543.20250318184316.bin.gz_decoded.log:2025/03/18 18:44:34.111763218 {pubd_R0-0}{1}: [pubd] [23971]: UUID: 0, ra: 0 (ERR): CNDP_MGR:conn_id[]Closing TLS connection returned error, rc -6992
Moving forward, we’ll want to monitor memory utilization closely and proactively free up the leaked memory to avoid impact during production hours. The “show platform software status control-processor brief” command gives a quick, high-level overview of current kernel memory utilization:
------------------ show platform software status control-processor brief ------------------
<...>
Memory (kB)
Slot Status Total Used (Pct) Free (Pct) Committed (Pct)
1-RP0 Healthy 1973316 1084220 (55%) 889096 (45%) 1800824 (91%)
For a more detailed breakdown of the top processes utilizing memory within the kernel, we can use the following command:
------------------ show processes memory platform sorted ------------------
System memory: 1973316K total, 1096340K used, 876976K free,
Lowest: 601812K
Pid Text Data Stack Dynamic RSS Name
----------------------------------------------------------------------
4276 178431 214300 136 128 214300 linux_iosd-imag
26472 2855 165148 136 1096 165148 confd.smp
5487 176 96640 136 112 96640 fed main event
24098 68 31540 136 160 31540 pubd
If we see pubd bubbling to the top, that would be characteristic of this memory leak as well.
Once memory is leaked, we have two main options for freeing it back up:
- Reload the switch
- Remove and reconfigure netconf to restart the pubd process:
configure terminal
no netconf-yang
netconf-yang
end
In addition, we’ll want to address the “Connecting” state of the DNAC telemetry connection. The workaround for defect CSCwe09745 mentions doing a “Force Configuration Push” from DNAC’s side, and defect CSCwk90747 calls out that this is usually the result of a certificate that has expired. I have looped in my colleague Joshua from the DNAC team to assist from that standpoint.
In terms of software patches, we’d really want to pick up the new API for OpenSSL (internal defect ID CSCwk07994), which is integrated starting in 17.12.5, 17.15.1, and later releases. 17.12.5 and 17.15.2 also have the software patches for defect CSCwm80596. You may consider moving to one of these releases, but please be sure to review the release notes for any version you plan to upgrade to so that you are aware of compatible features and open caveats present on that release.