Catalyst Switch Memory Leak Bug

On the call, you looped me in on the issue. In particular, you had three switches that were alerting due to high response times. When you accessed the switches, they were extremely slow/unresponsive on the CLI. Two of the switches crashed and you intentionally reloaded the third to recover services.

Thank you for calling out the “%PLATFORM_INFRA-5-IOS_INTR_OVER_LIMIT_HIGH_STIME”. This is a really good data point as these logs typically indicate memory exhaustion at the kernel level. While we were copying over the system-reports (one from Mar-18 and one from Jan-20), we looked through the crashinfo file on the switch (which is plaintext). In addition to a number of “CRL_FETCH_FAIL” syslogs indicating issues with the CRL fetch for the DNAC-CA trustpoint failing, we saw several indicators of high memory utilization prior to each crash:

1023858: Jan 19 23:51:04.356: %PLATFORM-4-ELEMENT_WARNING: Switch 1 R0/0: smand: 1/RP/0: Used Memory value 92% exceeds warning level 90%. Top memory allocators are:...

1023973: Jan 20 00:01:14.327: %PLATFORM-4-ELEMENT_WARNING: Switch 1 R0/0: smand: 1/RP/0: Used Memory value 92% exceeds warning level 90%. Top memory allocators are:...

1117330: Mar 18 20:11:01.998: %PLATFORM-4-ELEMENT_WARNING: Switch 1 R0/0: smand: 1/RP/0: Used Memory value 92% exceeds warning level 90%. Top memory allocators are:...

The backtrace of functions for both crashes align with defect CSCwk62333:

CSCwk62333 - Cat9200L Unexpected reload crimson db write lock held for too long

However, the “Conditions” section of the release notes for CSCwk62333 mention that this crash is a consequence of events that generate internal tracebacks: one of which is high memory utilization. So our focus is the cause of the high memory utilization on this switch. There are a few well-known pubd memory leaks that affect this platform and software version, and I noticed that you have NETCONF, DNAC, and telemetry configured. We checked the telemetry connection, and we confirmed that it’s in a “Connecting” state (which is the trigger for the pubd memory leaks):

NSHC_ITAS_P5_CAEDGE01#sh telemetry connec all

Telemetry connections

Index Peer Address Port VRF Source Address State State Description

----- -------------------------- ----- --- -------------------------- ---------- --------------------

11801 10.40.1.183 25103 0 10.40.114.141 Connecting Connection request made to transport handler

I was also able to decode the tracelogs for the Mar-18 event, and we see a flood of “Closing TLS connection returned error” logs, which is also indicative of this pubd memory leak:

cats-eng-lnx1:1036> pwd; grep -ari "Closing TLS connection returned error" | tail

/nobackup/dmonazah/698847335/sd64756-698847335.NSHC_ITAS_P5_CAEDGE01_1_RP_0-system-report_1_20250318-202050-UTC.tar.gz_decoded.log/crashinfo/tracelogs

utf_R0-0.7906_45543.20250318184316.bin.gz_decoded.log:2025/03/18 18:44:13.261659713 {pubd_R0-0}{1}: [pubd] [23971]: UUID: 0, ra: 0 (ERR): CNDP_MGR:conn_id[]Closing TLS connection returned error, rc -6992

utf_R0-0.7906_45543.20250318184316.bin.gz_decoded.log:2025/03/18 18:44:18.473718549 {pubd_R0-0}{1}: [pubd] [23971]: UUID: 0, ra: 0 (ERR): CNDP_MGR:conn_id[]Closing TLS connection returned error, rc -6992

utf_R0-0.7906_45543.20250318184316.bin.gz_decoded.log:2025/03/18 18:44:23.687913423 {pubd_R0-0}{1}: [pubd] [23971]: UUID: 0, ra: 0 (ERR): CNDP_MGR:conn_id[]Closing TLS connection returned error, rc -6992

utf_R0-0.7906_45543.20250318184316.bin.gz_decoded.log:2025/03/18 18:44:28.900262658 {pubd_R0-0}{1}: [pubd] [23971]: UUID: 0, ra: 0 (ERR): CNDP_MGR:conn_id[]Closing TLS connection returned error, rc -6992

utf_R0-0.7906_45543.20250318184316.bin.gz_decoded.log:2025/03/18 18:44:34.111763218 {pubd_R0-0}{1}: [pubd] [23971]: UUID: 0, ra: 0 (ERR): CNDP_MGR:conn_id[]Closing TLS connection returned error, rc -6992

Moving forward, we’ll want to monitor memory utilization closely and proactively free up the leaked memory to avoid impact during production hours. The “show platform software status control-processor brief” command gives a quick, high-level overview of current kernel memory utilization:

------------------ show platform software status control-processor brief ------------------

<...>

Memory (kB)

Slot Status Total Used (Pct) Free (Pct) Committed (Pct)

1-RP0 Healthy 1973316 1084220 (55%) 889096 (45%) 1800824 (91%)

For a more detailed breakdown of the top processes utilizing memory within the kernel, we can use the following command:

------------------ show processes memory platform sorted ------------------

System memory: 1973316K total, 1096340K used, 876976K free,

Lowest: 601812K

Pid Text Data Stack Dynamic RSS Name

----------------------------------------------------------------------

4276 178431 214300 136 128 214300 linux_iosd-imag

26472 2855 165148 136 1096 165148 confd.smp

5487 176 96640 136 112 96640 fed main event

24098 68 31540 136 160 31540 pubd

If we see pubd bubbling to the top, that would be characteristic of this memory leak as well.

Once memory is leaked, we have two main options for freeing it back up:

Reload the switch
Remove and reconfigure netconf to restart the pubd process:

configure terminal

no netconf-yang

netconf-yang

end

In addition, we’ll want to address the “Connecting” state of the DNAC telemetry connection. The workaround for defect CSCwe09745 mentions doing a “Force Configuration Push” from DNAC’s side, and defect CSCwk90747 calls out that this is usually the result of a certificate that has expired. I have looped in my colleague Joshua from the DNAC team to assist from that standpoint.

In terms of software patches, we’d really want to pick up the new API for OpenSSL (internal defect ID CSCwk07994), which is integrated starting in 17.12.5, 17.15.1, and later releases. 17.12.5 and 17.15.2 also have the software patches for defect CSCwm80596. You may consider moving to one of these releases, but please be sure to review the release notes for any version you plan to upgrade to so that you are aware of compatible features and open caveats present on that release.