Problem Statement
Telegram desktop client crashes with a SIGABRT whenever it tries to render a webpage with QT webview. This seems to be a problem unique to Wayland-based system running on proprietary NVIDIA userspace GL/EGL drivers. Telegram upstream is aware of the issue but isn’t going to find the root cause with their time.
At time of writing this is reproducible with the following versions:
- Telegram desktop 6.8.2
- NVIDIA
egl-wayland1.1.21 - NVIDIA drivers 610.43.02
- QT6 6.11.1
OriginCode has already opened a ticket with egl-wayland over a month ago, but so far the ticket doesn’t seem to have received any replies. No choice but to take a stab myself.
Debug working notes
First look at a previous core dump
Systemd core dumper managed to capture telegram’s dead memory space. Loading the core in GDB revealed a assertion error inside wlEglAcquireDisplay.
... [telegram aborting itself after an assertion failure]
#5 0x00007ffff066e4e5 in __assert_fail (assertion=<optimized out>, file=<optimized out>, line=<optimized out>, function=0x55555d94a1a0 "`\263w]UU") at assert.c:127
#6 0x00007fffcd421e1d in wlExternalApiLock () at ../src/wayland-thread.c:87
#7 0x00007fffcd4226c1 in wlEglAcquireDisplay (dpy=dpy@entry=0x55555d94a1a0) at ../src/wayland-egldisplay.c:1469
#8 0x00007fffcd4234dc in wlEglGetInternalHandleExport (dpy=<optimized out>, type=<optimized out>, handle=0x55555d94a1a0) at ../src/wayland-eglhandle.c:186
#9 0x00007fffc8331081 in ??? () at /usr/lib/libEGL_nvidia.so.0
#10 0x00007fffc82d62ce in ??? () at /usr/lib/libEGL_nvidia.so.0
#11 0x00007fffcd4295e3 in wlEglCreateStreamAttribHook (dpy=0x55555d94a1a0, attribs=0x7fffffffd150) at ../src/wayland-eglstream.c:200
#12 0x00007fffc83359e3 in ??? () at /usr/lib/libEGL_nvidia.so.0
#13 0x00007fffc82d6321 in ??? () at /usr/lib/libEGL_nvidia.so.0
#14 0x00007fffcd45d1ab in WaylandEglClientBuffer::setCommitted(QRegion&) () at /usr/lib/libQt6WaylandEglCompositorHwIntegration.so.6
#15 0x00007ffff30dc178 in QWaylandSurfacePrivate::surface_commit(QtWaylandServer::wl_surface::Resource*) () at /usr/lib/libQt6WaylandCompositor.so.6
... [A whole bunch of normal looking QT stuff. Presumed unrelated.]
Looking at the offending code for wlEglAcquireDisplay:
int wlExternalApiLock(void)
{
if (pthread_once(&wlMutexOnceControl, wlExternalApiInitializeLock)) {
assert(!"pthread once failed");
return -1;
}
if (!wlMutexInitialized || pthread_mutex_lock(&wlMutex)) {
assert(!"failed to lock pthread mutex");
return -1;
}
return 0;
}
The whether-mutex-is-initialized flag shows no weird signs, but the very mutex it’s trying to acquire is already locked.
(gdb) print wlMutex
$1 = {__data = {__lock = 1, __count = 0, __owner = 530834, __nusers = 1, __kind = 2, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = "\001\000\000\000\000\000\000\000\222\031\b\000\001\000\000\000\002", '\000' <repeats 22 times>, __align = 1}
(gdb) pipe info threads | grep 530834
* 1 Thread 0x7fffe86d3ac0 (LWP 530834) __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
The current thread already locked this mutex, but is somehow attempting to lock it again. NVIDIA egl-wayland initializes its mutex with PTHREAD_MUTEX_ERRORCHECK, and a deadlock situation thus caused locking to fail with EDEADLK, triggering the assert().
Now the remaining question becomes where and why the fuck egl-wayland attempts to lock the same lock twice.
Tracking lock usage in GDB with overengineered python
In a hindsight it should be rather obvious that the earlier locking must have happened somewhere in the upper stack frames, but my monkey brain is too comfortable writing python plugins for GDB both in my day-time job and when I struggle with Hollow Knight.
Loaded the thing into GDB, initialize the breakpoints tracking where locks are taken and released, and launch a new Telegram process.
INFO:__main__:Mutex 00007fffcd432380 unlocked by thread 530834
INFO:__main__:Mutex 00007fffcd432380 locked by thread 530834
INFO:__main__:Mutex 00007fffcd432380 unlocked by thread 530834
INFO:__main__:Mutex 00007fffcd432380 locked by thread 530834
INFO:__main__:Mutex 00007fffcd432380 locked by thread 530834
Telegram: ../src/wayland-thread.c:87: wlExternalApiLock: Assertion !"failed to lock pthread mutex" failed.
Thread 1 "Telegram" received signal SIGABRT, Aborted.
Two successive lock-acquire on the same thread. My script has saved backtraces each of these lock-acquire. Should reveal more details.
[BEGIN lock attempt -2]
#0 wlExternalApiLock () at ../src/wayland-thread.c:79
#1 0x00007fffcd4292b9 in wlEglCreateStreamAttribHook (dpy=0x55555d94a1a0, attribs=0x7fffffffd140) at ../src/wayland-eglstream.c:82
#2 0x00007fffc83359e3 in ??? () at /usr/lib/libEGL_nvidia.so.0
#3 0x00007fffc82d6321 in ??? () at /usr/lib/libEGL_nvidia.so.0
#4 0x00007fffcd45d1ab in WaylandEglClientBuffer::setCommitted(QRegion&) () at /usr/lib/libQt6WaylandEglCompositorHwIntegration.so.
...
[BEGIN lock attempt -1]
#0 wlExternalApiLock () at ../src/wayland-thread.c:79
#1 0x00007fffcd4226c1 in wlEglAcquireDisplay (dpy=dpy@entry=0x55555d94a1a0) at ../src/wayland-egldisplay.c:1469
#2 0x00007fffcd4234dc in wlEglGetInternalHandleExport (dpy=<optimized out>, type=<optimized out>, handle=0x55555d94a1a0) at ../src/wayland-eglhandle.c:186
#3 0x00007fffc8331081 in ??? () at /usr/lib/libEGL_nvidia.so.0
#4 0x00007fffc82d62ce in ??? () at /usr/lib/libEGL_nvidia.so.0
#5 0x00007fffcd4295e3 in wlEglCreateStreamAttribHook (dpy=0x55555d94a1a0, attribs=0x7fffffffd150) at ../src/wayland-eglstream.c:200
#6 0x00007fffc83359e3 in ??? () at /usr/lib/libEGL_nvidia.so.0
#7 0x00007fffc82d6321 in ??? () at /usr/lib/libEGL_nvidia.so.0
#8 0x00007fffcd45d1ab in WaylandEglClientBuffer::setCommitted(QRegion&) () at /usr/lib/libQt6WaylandEglCompositorHwIntegration.so.6
...
[Lock-acquire attempt fails here and causes SIGABRT]
An unknown piece of code from proprietary NVIDIA libEGL_nvidia.so has issued a call to wlEglCreateStreamAttribHook and it somehow caused itself a deadlock. I guess its time to read a bit more about this egl-wayland and its purpose. From a random search on Google, I landed on an NVIDIA presentation about libEGL on XDC2016. Looks like we have some wild pointer chasing and circular calling ahead of us.
egl-wayland API trampoline madness
The commentary inside the code base isn’t very enlightening, but together with those slides from NVIDIA I can kinda make an educated guess on the execution flow (after installing QT6 symbols).
[Qt6] WaylandEglClientBufferIntegrationPrivate::initEglStream
|
| 0. QT makes EGL create_stream_attrib_nv() call with an "external"
| display.
v
[libEGL_nvidia.so]
|
| 1. Device EGL calls "external platform API".
v
[libnvidia-egl-wayland.so.1.1.21]
wlEglCreateStreamAttribHook(display, attribute)
|
======== | ===== wlExternalApiLock() held.
H |
H | 2. With the external API lock held, it fetches some display
H | metadata through the following API.
H | - wl_eglstream_display_get()
H | - wl_eglstream_display_get_stream()
H | With this information, it invokes the device EGL again
H | to actually create the stream.
H v
H data->egl.createStreamAttrib(display, modifed_attrib)
H [libEGL_nvidia.so] Function pointer set during driver init.
H |
H | 3. Calls wlEglGetInternalHandleExport through external API.
H v
H [libnvidia-egl-wayland.so.1.1.21]
H wlEglGetInternalHandleExport(display, EGL_OBJECT_DISPLAY_KHR, display)
H |
H v
H [libnvidia-egl-wayland.so.1.1.21]
H wlEglAcquireDisplay(display)
H |
H | 4. Attempts to lock the external API lock again with
H | wlExternalApiLock().
H |
H <========== PTHREAD_MUTEX_ERRORCHECK deadlock assertion tripped.
The wlExternalApiLock() logically protects the global linked list of displays against data race so that a display or an associated stream cannot just disappear. Functions that validate whether streams/displays are valid or change the list internally take this lock. But does it make sense to keep holding it across the “device-platform-device-platform” trampoline in step 4?
The Fix
Whatever downstream “platform” external EGL API it calls seem to do a fairly good job to make sure incoming pointers remain valid regardless of who (from application or from platform side) calls these APIs. Looks like the way out is to just immediately release the lock once step 2 in the chart is done, so that the remaining device EGL calls can take their own lock as needed.
diff --git a/src/wayland-eglstream.c b/src/wayland-eglstream.c
index 3c40a0d..611e773 100644
--- a/src/wayland-eglstream.c
+++ b/src/wayland-eglstream.c
@@ -89,15 +89,17 @@ EGLStreamKHR wlEglCreateStreamAttribHook(EGLDisplay dpy,
}
if (err != EGL_SUCCESS) {
- goto fail;
+ goto fail_unlock;
}
wlStream = wl_eglstream_display_get_stream(wlStreamDpy, resource);
if (wlStream == NULL) {
err = EGL_BAD_ACCESS;
- goto fail;
+ goto fail_unlock;
}
+ wlExternalApiUnlock();
+
if (wlStream->eglStream != EGL_NO_STREAM_KHR ||
wlStream->handle == -1) {
err = EGL_BAD_STREAM_KHR;
@@ -237,12 +239,11 @@ EGLStreamKHR wlEglCreateStreamAttribHook(EGLDisplay dpy,
wlStream->eglStream = stream;
wlStream->handle = -1;
- wlExternalApiUnlock();
-
return stream;
-fail:
+fail_unlock:
wlExternalApiUnlock();
+fail:
wlEglSetError(data, err);
return EGL_NO_STREAM_KHR;
}
Submitted upstream as https://github.com/NVIDIA/egl-wayland/pull/194
Appendix
Crude python script used to make GDB dump where and who took and released the mutex.
Mashed up GDB python plugin
#!/usr/bin/env python3
import json
import logging
from collections import defaultdict
from dataclasses import dataclass
import gdb
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@dataclass
class Trace:
kind: Literal["lock", "unlock"]
thread_id: int
stacktrace: List[str]
# Mutex address -> trace info
_trace_db: Dict[int, List[str]] = defaultdict(list)
_breakpoints: List[gdb.Breakpoint] = []
class MutexLockWatchpoint(gdb.Breakpoint):
def __init__(self, trace_db: Dict[int, List[str]], track_lock: bool):
if track_lock:
super().__init__("wlExternalApiLock", gdb.BP_BREAKPOINT)
else:
super().__init__("wlExternalApiUnlock", gdb.BP_BREAKPOINT)
self.trace_db = trace_db
self.track_lock = track_lock
def stop(self) -> bool:
stacktrace = gdb.execute("bt", to_string=True)
mutex_address = int(gdb.parse_and_eval("&wlMutex"))
self.trace_db[mutex_address].append(
Trace(kind="lock" if self.track_lock else "unlock", thread_id=gdb.selected_thread().ptid[0],
stacktrace=stacktrace.splitlines()))
logger.info(
f"Mutex {mutex_address:016x} {'locked' if self.track_lock else 'unlocked'} by thread {gdb.selected_thread().ptid[0]}")
def start_watch():
global _breakpoints
global _trace_db
if len(_breakpoints) == 0:
_breakpoints.append(MutexLockWatchpoint(_trace_db, True))
_breakpoints.append(MutexLockWatchpoint(_trace_db, False))
else:
logger.warning("Watchpoints already set up. Skipping.")
def end_watch():
global _breakpoints
for bp in _breakpoints:
bp.delete()
_breakpoints = []
def dump_traces(out_file: str = None):
global _trace_db
mapped = {k: [{"kind": t.kind, "thread_id": t.thread_id, "stacktrace": t.stacktrace} for t in v] for k, v in _trace_db.items()}
with open(out_file, "w") as f:
json.dump(_trace_db, f, indent=2)
logger.info(f"Traces dumped to {out_file}")
class MutexWatchCmd(gdb.Command):
def __init__(self):
super().__init__("mutex_watch", gdb.COMMAND_DATA)
def invoke(self, arg, from_tty):
argv = gdb.string_to_argv(arg)
if len(argv) == 0:
logger.info("Usage: mutex_watch ")
return
if argv[0] == "start":
start_watch()
return
if argv[0] == "end":
end_watch()
return
if argv[0] == "dump":
if len(argv) < 2:
logger.info("Where to dump???")
return
dump_traces(argv[1])
return
if __name__ == "__main__":
MutexWatchCmd()





