Mind Dump, Tech And Life Blog
written by Ivan Alenko
published under license CC4-BY
posted in category Systems Software / Desktop
posted at 03. Jun '24

Exiting GPU process because some drivers can’t recover from errors

Recently I updated nVidia linux drivers to:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4080        On  |   00000000:09:00.0  On |                  N/A |
|  0%   41C    P8              7W /  320W |    1550MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

and I got these errors all the time when running Chromium:

máj 08 03:02:12 rapthalia kwin_x11[5007]: qt.qpa.xcb: QXcbConnection: XCB error: 3 (BadWindow), sequence: 62375, resource id: 79691784, major code: 18 (ChangeProperty), minor code: 0
máj 08 03:04:38 rapthalia krunner[9371]: [9371:9371:0508/030438.745549:ERROR:shared_context_state.cc(1079)] SharedContextState context lost via ARB/EXT_robustness. Reset status = GL_GUILTY_CONTEXT_RESET_KHR
máj 08 03:04:38 rapthalia krunner[9371]: [9371:9371:0508/030438.745705:ERROR:gpu_service_impl.cc(1124)] Exiting GPU process because some drivers can't recover from errors. GPU process will restart shortly.
máj 08 03:04:38 rapthalia krunner[6850]: [6850:6850:0508/030438.765977:ERROR:command_buffer_proxy_impl.cc(323)] GPU state invalid after WaitForGetOffsetInRange.
máj 08 03:04:38 rapthalia krunner[6850]: [6850:6850:0508/030438.800885:ERROR:gpu_process_host.cc(997)] GPU process exited unexpectedly: exit_code=8704

DMESG - NVRM: krcWatchdogCallbackVblankRecovery_IMPL Error

The output from dmesg:

[13517.427715] NVRM: GPU at PCI:0000:09:00: GPU-39d6bee3-b86c-946b-b921-9d8ca886556b
[13517.427724] NVRM: Xid (PCI:0000:09:00): 16, pid='<unknown>', name=<unknown>, Head 00000003 Count 000bd943
[13517.427731] NVRM: krcWatchdogCallbackVblankRecovery_IMPL: NVRM-RC: RM has detected that 7 Seconds without a Vblank Counter Update on head:D0
[13525.619416] NVRM: Xid (PCI:0000:09:00): 16, pid='<unknown>', name=<unknown>, Head 00000003 Count 000bd944
[13525.619432] NVRM: krcWatchdogCallbackVblankRecovery_IMPL: NVRM-RC: RM has detected that 7 Seconds without a Vblank Counter Update on head:D0
[13533.811249] NVRM: Xid (PCI:0000:09:00): 16, pid='<unknown>', name=<unknown>, Head 00000003 Count 000bd945
[13533.811268] NVRM: krcWatchdogCallbackVblankRecovery_IMPL: NVRM-RC: RM has detected that 7 Seconds without a Vblank Counter Update on head:D0
[13542.002862] NVRM: Xid (PCI:0000:09:00): 16, pid='<unknown>', name=<unknown>, Head 00000003 Count 000bd946
[13542.002879] NVRM: krcWatchdogCallbackVblankRecovery_IMPL: NVRM-RC: RM has detected that 7 Seconds without a Vblank Counter Update on head:D0
[13550.194621] NVRM: Xid (PCI:0000:09:00): 16, pid='<unknown>', name=<unknown>, Head 00000003 Count 000bd947

It is supposed to be fixed in 555.42.02 - https://github.com/NVIDIA/open-gpu-kernel-modules/issues/632, yet I still see it on newer 550.54.15. But the issue is still open, so there still might be something going on.

Why now? How? Why? I don’t know. Yet it freezes Chromium window or a whole desktop for like 30 seconds to 1 minute. I thought nVidia drivers are, while very proprietary, rock solid, but it seems I’m fucked. Others wrote that this error is not caused by Chromium itself, but it is a side effect to a condition with drivers. Why can’t we just have stable graphics drivers when I want to use CUDA or OpenCL? Man, it was a nightmare in 2005, in 2010 it got much better, but still, here we are, 15 years later. If it was open source other could at least take a look and fix stuff.

Best bet is to reinstall the operating system.

Add Comment