Zanny's Realm of Rants: August 2013

I've had a barrel of fun (sarcasm) with X recently, involving multi-seat, multi-head, multi-gpu - just in general, multiples of things you can have multiples of, but most of the time don't, so the implementations of such things in X are lacking at best, utterly broken at worst.

I also am becoming quite frustrated with opensuse, trying to fix varous graphical and interface glitches to get a working mutlihead system for my grandparents.

But I look towards Wayland, and while I appreciate the slimming, I have to be worried when thing like minimizing, input, and network transport are hacks on top of a 1.0 core that already happened. It reeks of a repeat of the X behavior that lead to the mess we have now.

So I want to talk about how I, a random user with no experience in writing a display protocol or server, would implement a more modern incarnation.

First, is to identify the parts of the system involved. This might have been a shortcoming in Wayland - the necessary parts were not carved out in advance, so they needed to be tacked on after the fact. You can represent this theoretical system as an hourglass, with two trees on either side of a central management framework. In Linux terms, this would be through DRI and mode setting, but the principle is that you must communicate virtual concepts like desktops and windows (and more) into physical devices, and do so in a fluid, organic, interoperable, hotpluggable fashion. This might be one of Waylands greatest weaknesses, in how its construction doesn't lend itself to using an arbitrary protocol as a display pipe.

You would have a collection of display sinks - physical screens, first and foremost, but also projectors, recorders, a remote display server, cameras to record from, etc. They are all presented as screens - you can read a screen with the necessary permissions (through the display server). To write a screen, you must also use the display server. You can orient these screens in a miriad of ways - disjoint desktops, running in seperate sessions - you might have disparate servers, with each one managing seperate displays, and inter-server display connectivity is achieved through either the general wide-band network transport (rdp, udp, etc) or over a lower latency / overhead local interconnect (dbus). Servers claim ownership of the displays they manage, and are thus a lower level implementation of this technology than a userspace server like X or even the partial kernel implemented wayland - it supplants the need for redunant display stacks, where right now virtual terminals are not managed by a display server, but by the kernel itself, in this implementation virtual terminals would just be another possible desktop to provide by the display server.

Obviously, this server needs EGL and hardware acceleration where possible, or use llvmpipe. The system needs to target maximal acceleration when available, account for disparate compute resources, and not assume the state of its executation environment at all - this means you could have variable numbers of processors, with heterogeneous compute performance, bandwidth, latency, an arbitrary number of (hot pluggable) acceleration devices (accessable through DRI or gl) that may, or may not, be capable of symmetric bulk processing of workloads. Multiple devices can't assumed to be clones, and while you should correlate and optimize for displays made available through certain acceleration devices (think pci gpus with display outs vs the onboard outs vs usb converters vs a laptop where the outs are bridged vs a server where the outs are on another host) you need to be open to accelerate in one place and output in another, as long as it is the most optimal utiliziation of resources.

So this isn't actually a display server at all, it is the abolition of age old assumptions about the state of an operating computer system that prevent the advancement of the overall state of kernel managed graphics. Tangentially, this relates to my Altimit conceptualizations - in my idealized merged firmware / os model where drivers are shared and not needlessly replicated between firmware and payload, the firmware would initialize the core components of this display server model and use a standardized minimum set of accelerated apis to present the booting environment on all available output displays (you wouldn't see network ones until you could initialize that whole stack, for example). Once the payload is done, the os can reallocate displays according to saved preferences. But the same server would be running all the way through - using the same acceleration drivers, same output protocols, same memory mapping or the port sinks.

Sadly, we aren't there yet, so we can't get that kind of unified video (or unified everything as the general case). Instead, we look towards what we can do with what we have now - we can accept that the firmware will use its own world of display management and refactor our world (the kernel and beyond world) to use one single stack for everything.

So once you have this server running, you need to correlate virtual desktops and terminals to displays. The traditional TTY model is a good analogy here - when the server starts (as a part of the kernel) it would intiialize a configured number of VTs allocated to displays in a configured way (ie, tty1 -> screen0, tty2 -> screen1 which clones to remote screen rscreen3, tty4 -> recorder0 which captures the output, tty5 -> screen2 which clones to recorder1, etc). tty6 could be unassociated, and on screen0, with its associated input device keyboard, you could switch terminals like you do now. You could have the same terminal opened on multiple displays, where instead of having a display side clone, you have a window side clone (ie, not all outputs to, say, screen0 and 1 are cloned, but tty15 outputs to both of them as simultaneous display sinking).

A window manager would start and register itself with the server (probably over dbus) as a full screen application and request either the default screen or a specific screen - recalling that screens need not be physical, but could also be virtual, as is the case in network transport or multidisplay desktops. It is provided information about the screen it is rendering to, such as the size, the DPI, brightness, contrast, refresh, etc - with some of these optionally configurable over the protocol. This window manager may also request the ability to see or write to other screens beyond its immediate parent, and the server can manage access permissions accordingly per application.

On that desktop (which to the server is a window, occupying a screen as a "full screen controlling application" akin to most implementations of a full screen application, and whenever it spawns new windowed processes, it allocates them as its own child windows. You get a tree of windows, starting with a root full screen application, which is bound to displays (singular or plural) to render to. It could also be bound to a null display, or no display at all - in the former, you render to nothing, in the latter, you enter a freeze state where you suspend the entire window tree under the assumption you will rebind that application later.

In this sense, a program running in full screen on a display, and a desktop window manager, are acting the same way - they are spawned, communicate with the central server as being a full screen application running on some screen, and assume control. If you run a full screen application from a desktop environment, it might halt the desktop itself, or more likely it moves it to the null screen where it can recognize internally it isn't rendering and thus stop.

I think it would actually require some deeper analysis if you even want an application to be able to unbind from displays at all - additionally, you often have many applications in a minimized state, but you want to give a full screen application ownership of the display it runs on (or do you?) so you would need to create virtual null screens dynamically for any application entering a hidden state.

Permissions, though, are important. You can introduce good security into this model - peers can't view one another, but parents can control their children. You can request information about your parent (be it the server itself, or a window manager) and get information about the screen you are runnning on only with the necessary permissions to do so. Your average application should just care about the virtual window it is given, and support notification of when its window changes (is hidden, resized, closed, or maybe even if it enters a maximized state, or is obscured but still presented). Any window can spawn its own children windows, to a depth and density limit (to prevent a windowed application from assulting the system with forking) set by the parent, which is set by its parent, so on and so forth, to a display manager limit on how much depth and bredth of windows any full screen application may take.

The full screen application paradigm supports traditional application switching in resource constrained environments - when you take a screen from some other full screen application, the display server will usually place it is a suspended state until you finish / close, or a fixed timer limit on your ownership expires (a lot like preemptive multiprocessing with screens) and control is returned. Permissions are server side, and cascade through children, and while they can be diminished, they require first cllass windows with privildge assention to raise them.

You can also bind any other output device to an applications context. If you only want sound playing out of one desktop environment, you can control hardware allocation server side accordingly. Same with inputs - keyboards, mice, touchscreens, motion tracking, etc - can all be treated as input, be it digital keycoding, vector motion, or stream based (like a webcam or udp input), and assigned to whatever window you want, from the display server itself (which delegates all events to all fullscreens). Or you can bind them to a focus, which in the same way you have default screens, you can have a default window according to focus, and delegate events into it (from the application level, it would be managed by the window manager).

You could also present a lot of this functionality through the filesystem - say, sys/wm, where screens/ can correspond to the physical and virtual screens in use (similar to hard drives, or network transports, or audio sinks), and sys/wm/displays where the fullscreen parents reside, such as displays/kwin, or displays/doomsday, or displays/openbox. These are simultaneously writable as parents and browasable as directories of their own children windows, assuming adequete permissions in the browser. You could write to another window to communicate with it over the protocol, or you could write to your own window as if writing to your framebuffer object. Since the protocols initial state is always to commune with ones direct parent, you can request permissions to view, read, or write your peers, corresponding to knowing their existence, viewing their state, and communicating over the protocol to them. As a solution to the subwindow tearing problem, the server understands movements in parent are recursive to children, such that moving, say, one window 5px means a displacement in all children by 5px, and a corresponding notification to that windowit has moved, and that it has moved due to a parents movement.

The means why which you write your framebuffer are not the display servers problem - you could use egl, or just write the pixels (in a format established at window creation) between the server and the process. Access to acceleration hardware, while visible by the display server, is a seperate permissions model, probably based off the user permissions of the executing process rather than a seperate permissions heirarchy per binary.

In practice, the workflow would be as follows: the system would boot, and udev would resolve all display devices and establish them in /sys/wm/screens. This would include udev network transport displays, usb adapted displays, virtual displays (cloned windows, virtual screens combined from multiple other screens in an established orientation, duplicate outputs to one physical screen as overlays) and devices that are related to screens and visual devices like cameras, or in the future holographics, or even something far out like an imaging protocol to transmit scenes to the brain.

Because output screens are abstracted, the display manager starts after udev's initial resolution pass and uses configuration files to create the defult virtual screens. It doesn't spawn any windowing applications, though.

After this step, virtual terminals can be spawned and assigned screens (usually the default screen, which in the absence of other configuration is just a virtual screen spanning all physical screens that has the various physical properties of the lowest common denominator of feature support amongst displays). In modern analogy, you would probably spawn VTs with tty1 running on vscreen0, and 2-XX in suspend states, ready to do a full screen switch when the display server intercepts certain keycodes from any input device.

Then you could spawn a window manager, like kwin, which would communicate with this display server and do a full screen swap with its own display configuration - by default, it would also claim vscreen0, and swap out tty1 to suspend. It would request all input delegation available, and run its own windows - probably kdm as a window encompassing its entire display, while it internally manages stopping its render loops. It would spawn windows like plasma-desktop, which occupy space on its window frame that it assigns over this standard protocol. plasma-desktop can have elevated permissions to view its peer windows (like kdm) and has lax spawn permissions (so you can create thousands of windows on a desktop with thousands of their own children without hitting any limits). If you run a full screen application from plasma-active, it can request a switch with the display server on the default screen, its current screen, or find out what screens are available (within its per-app permissions) to claim. If it claims kwins screen, kwin would be swapped into a suspend state, which cascades to all its children windows. Maybe it also had permissions and spawned an overlay screen on vscreen0, and forked a seperate full screen application of the form knotify or some such, which would continue running after vscreen0 was taken by another full screen application, and since it has overlay precedence (set in configuration) notifications could pop up on vscreen0 without vblank flipping or tearing server-side.

Wayland is great, but I feel it might not be handling the genericism a next generation display server needs well enough. Its input handling is "what we need" not "what might be necessary" which might prompt its obsolescence one day. I'd rather sacrifice some performance now to cover all our future bases and make a beautiful sensible solution than to optimize for the now case and forsake the future.

Also, I'd like the niche use cases (mirroring of a virtual 1920x1080 60hz display onto a hundred screens as a window, or capturing a webcam as a screen to send over telepathy, or having 3 DEs running on 3 screens with 3 independent focuses and input device associations between touchscsreens, gamepads, keyboards, mice, trackpads, etc) to work magically. To fit in as easily as the standard use case (one display, that doesn't hotplug, one focus on one window manager with one keyboard and one tracking device).

Zanny's Realm of Rants

2013/08/18

Software Rants 15: The Window Tree

2013/08/01

Reddit Rants 2: So I wrote a book for a reddit comment

About Me