2012/11/14

Software Rants 5: Architecting a 21st Century OS

In multiple posts on this pile of cruft and silliness I've spoken about how I don't like the design of most modern operating systems.  I've even written some poorly construed ideas on the subject.  So how would I go about building something that, comparatively, has taken billions of man hours to create in a couple cases? (although subjects like Haiku make me feel a bit more optimistic). 

So I said microkernel.  This theoretical system has one goal being the minimization of kernel-space code - and anything platform specific that doesn't directly tie into the ability to execute software should definitely be userspace.  Linux has the philosophy that one source archive needs to compile with a configuration to work on any system ever, from the lowly phone to a massive supercomputer cluster of thousands of nodes.  You will get tremendously different binaries and KO sets depending on your configurations.  The difference between a PPC64 supercomputer all necessary modules compiled in kernel and a udev based dynamic module loading ARM kernel for phones makes them barely resemble each other - besides some basic tenants, the same command line arguments, the idea of loading an init payload, initializing devices, and using the same (similar?) memory management and scheduling algorithms (there are some builds of the kernel that use different ones...), the resulting system will be using entirely different tracks the source tree.

I disagree with that, in that those aren't the same software projects anymore, and keeping them in the same source tree is the exact opposite of the do one thing right, well, and concisely UNIX ideology.  And that is fine, because it means the Kernel is ultraportable, you can get an upstream codebase, build it, and run it on practically anything, and with a source clone you can customize the configuration to suit almost any use case. 

But there is also a reason Linux takes multiple levels of management and a ton of organization behind it - they have everything from device drivers to hardware translation layers to language documentation to binary blobs being merged into one program and it becomes impossible for any one person to understand or conceptualize such divergent tech.  I don't call myself smart, but I do think smart people are averse to complexity where possible, and this is definitely one of the places I see ample unneeded complexity.

So here is my proposition : a microkernel core, designed in and written in L, that only handles virtual memory management, preempted process scheduling ("virtual execution"), a virtual filesystem abstraction layer (so that aspects of this kernel can be mapped into the final root filesystem without awkward dependencies on user space filesystem managers), a virtual socket layer (for communication purposes, the sockets themselves would be managed in userspace, but the kernel would initalize this system so that some user space network daemon can manage socket communication - but the kernel itself will be using sockets as well to communicate hardware state and with the daemons it connects to!), and a hardware abstraction layer that allows user-space privileged daemons to take control of hardware components (disks, video devices, busses and usb hubs, etc, basically everything divergent from the RAM/CPU mix*).

*: I would be interested in exploring if there would be a way to have the memory controller in user space.  In that the kernel could start and initialize itself only in processor space, but it seems to be incessantly complex... you can't even establish any other virtual systems without system memory to use.  It would have the same issue a filesystem host would have, that the kernel would need to, in advance, start up this daemon, and then fall back to its own boot procedures.

The traditional boot philosophy is on -> firmware is payloaded and initialized by hardware -> firmware does primitive device detection and scans persistent storage buses for something with a recognizable partition table and payloading a bootloader from such -> giving it a "direct media interface table" even though that is Intel jumbo, with some hardware memory mapping used to provide devices.

UEFI is like this, with more complex device loading (including a 3d driver of some sort with the graphical setups) and the ability to scan not just partition tables but fat-32 filesystems for bootable executables.  It is pretty much an OS in and of itself considering how it behaves.

In my grand delusions, we could scratch the unnecessary parts - the important aspects of a boot cycle are device initialization, error detection, and searching for a payload.  Device initialization is already pretty well "passed on" to the resulting OS.  Error reporting is more complicated, because you are dealing in a world where the most you have access to may be some primitive bus signals to indicate problems, such as beep codes, error panels on the board, or keyboard signals to blink the ps2 port.  BIOSes and EFI boot procedure are obscured by platforms - each new chipset does things differently, merging more parts onto the cpu, or handling device channels differently.  In terms of payload searching, EFI actually does a really good job - given a boot device, and a system partition on it, load a binary.  No need for traditional bootloaders (which is nice).

On -> check for cpu / memory errors, on error try to signal all devices with some special hardware failure signal and a payload error descriptor.  Devices would need to be designed to handle this signal if they have some way to display errors (vga, keyboards, sound systems) or drop it.  The expectation is that "on" means devices power on, and the independent units like network controllers  come online at the same time and await the cpu to initialize them.

Check errors -> initialize devices.  Given no catastrophic errors, check if any other device has errors, and if not build a table of devices to provide the payload device to boot.  If a device has an error, broadcast another error payload signal to let anything capable alert the "outside" of the problem.  But don't stop booting if the error is recoverable, just indicate a failed device in the provided "table".

Initialize devices -> payload.  You have a table of devices, and the firmware needs to be aware of where we can find something to payload.  In terms binary byte ordering, that is an open question - we would prefer big endian for readability if it doesn't incur a hardware complexity cost.  Nobody should almost ever be working with binary data representation at this scale anyway, but if we can do Big End without circuitry overhead, we should, otherwise, keep it simple stupid and use Little E.

Since we have effectively an integrated bootloader, we need some very simple file system to just store binaries.  Now here a point of contention - have complex file system analysis machines that can read our "stock" FS type (which would absolutely be in the btrfs vein of COW auto-compressing snapshotted filesystems) or have a discrete filesystem just for the bootloader to read off.   We need to think of MBRs here - and logical and physical sector sizes.

I want to propose variable sized sectors, the same way there are viable sized memory pages - 4K memory and 4K disk are great defaults, but you can always use larger contiguous blocks of.. both.  In both cases, page size and sector size implicate overhead on the managers of these spaces while having larger sizes implicates overhead of its own when boundaries are not nicely met.

For one, traditional storage media just can't have variable sized physical sectors.  Having different logical sectors seems silly because it is a great simplication of work to have 4k sector sizes in both cases.  That will continue to work well for some time, and a sufficiently smart operating system can optimize sector utilization to minimize wasted space.  That is a device driver problem, though.

In terms of hardware memory pages, I still think swapping has value even if traditional desktops don't need it anymore - too many problems just require tremendous amounts of memory to work with beyond the bounds of traditional computing concepts, and we like embedded systems with low memory.  Even if you could theoretically implement something akin to paging with file system abstractions (writing to and from disk once you approach the physical memory limit) having the option there has proven to be worth it.

So page sizes - we don't want too many sizes, and we want them to scale nicely if possible.  This would require research and insight I don't possess, but you definitely want to support variable sized pages.

So we assume disks have 4K sectors, pages are at least 4K, and we may or may not have a dedicated partition standard for bootable binaries with some specialized file system.  We will need a disk partition table, and if were mandating 4k sectors, we have 4k bits to store info on the device.  I like how GPT has a backup copy of itself, so we want one of those, so its really 8k bits total, the first and last sectors.  In terms of sector addressing, I'm starting to think about 48 bit as an option - the overhead of 64 bit just for 64 zetabytes seems unnecessary when 48 bits gives an exabyte of storage.  Currently, the largest storage device is approximately 4tb, up from last years 3tb, the year before that at 2.5 tb, the year before that at 2, the year before that at 1.5, etc.  So if we go with a terabyte a year (which is about right for the last 5 years) we have a few decades before this becomes an issue, and we can just add in a 64 bit addressable 2.0 version anyway since we want userspace drivers.

Similarly, I'm not entirely sold on pure 64 bit cpus either.  The real big argument for large memory cpu sets is that server farms with shared memory need to address all that space, but you still get 281 petabytes on 48 bits.  I'd probably make the 1.0 of this system 48 bit, and make sure the language and instruction set inherently are valid converting to 64 and maybe even 80 / 96 bit.  This is actually really easy in the conceptualization of L, because you would have pointers as objects, and their size would be compile specific to the platform pointer size.  Integers are decoupled from words, so you can create ints of all sizes from 1 to ~16 bytes (I don't see why you wouldn't implement 128 bit integers on a modern architecture).  This also brings up the potential to just do the 64 bit virtual 48 bit physical word sizes Intel uses, but there is a translation in there that adds unnecessary complexity in my opinion.

So 48 bit CPU, 48 bit sector addresses, 4k standard pages and sectors.  Big endian unless there is a complexity or performance hit, in which case we just use little.  We want point to point memory rather than dual channel, and I already talked about CPU architecture earlier - having a heterogeneous collection of registers, alus, fpus, simd cores, etc to run dedicated parallel instructions.  This way the traditionally discrete gpu and cpu cores (even on a shared die) can be more tightly integrated.  You could also use one single ram, and not have to reserve it for the gpu cores.  As long as the instruction set is designed around supporting SIMD instructions for the parallel processing cores, we should be sound.

So we have our payload, we run it, and we set up a process scheduler that I will look into more in the future (really, this is the kind of decision that takes a ton of reading up on, to figure out the best preemptive scheduler for purpose, but CFS I guess is the industry standard).  We have the virtual memory, and we need some way for the kernel to initialize physical hardware virtual memory mappings.

So we don't want the kernel explicitly dealing with device management, but it needs to initialize device memory, so we can just add that into the device table the firmware provides - one of the hardware signals can be for the memory map size.  The kernel can then map memory addresses at this time to the given tables memory requirements for applications, and when an application access device hooks, it also gets control of the virtual page table referring to that device.

So scheduler and memory is up.  We want a virtual file system now - no hardware folders are even initialized yet, but we can provide a virtual hardware file system for device access.  This would be an elevated privilege folder - as an abstraction, you can have device control servers open devices for writing to take exclusive control over them, and the writable file is their memory map.  I proposed a file system layout earlier, so here we are talking about something akin to /System/Hardware/*.  The VFS would propagate a node per hardware device provided, the folder would require level 1 permissions to access, and once a device server has control of the virtual memory map with the device opened for writing, the only thing other servers can do is read it.

So this virtual file server needs the concept of permissions at the kernel level - we want executables running to have a level of privilege, beyond the bounds of kernel vs user mode processor execution state - this is a software privilege, where level 0 is the kernel, level 1 is init and systemwide access services and daemons, and has a "pure" view of the file system.  Level 2 would be the usual run mode of programs - restricted access to system essential files, restricted views of the file system, and each application would have device privileges specific to it given to it by a level 1 service.

Some examples - the vfs, socket layer, memory mapping, and scheduler are operating in level 0 kernel mode.  A gpu driver, device controller, usb host, a file system daemon, or a dhcp host, smb server, virtual machine service.  You want two levels of indirection here in most cases - a sound server to manage audio device servers running, a display server to manage display adapters running, a network server to manage networking devices, etc.  Access to these servers is restricted in the vfs and vss through executable privileges, probably by a level 1 executor server that wraps kernel level execution behavior.  Basically, kernel sets up permissions that the executor server manages, since level 2 permissions can't access level 0 directly.  Level 3 is the "pure sandbox" - it would be started by level 2 programs (like a VM) and has no device access directly, and only has its own restricted view of the vfs and by default has no write permissions outside its execution context.  You could thus host users from level 2 session managers (maybe run by an administrator) and they would be unable to manipulate the outer system by design.

So we have 4 permissions levels right now, and you could theoretically add more and more to add more levels of virtualization.  A virtual machine could thus just be a level 2 program that pipes commands from level 4 devices through the virtual memory map of a level 3 kernel into the level 1 devices above it.  Very slick I think.

The other major revelation is the idea of display management.  In the absence of dedicated video hardware to control by a level 1 daemon, the video daemon could itself emulate a video server.   Or you could set up the userspace where Kernel is level 0, device controllers are level 1, device managers are level 2, and user applications are level 3, so that user applications never interface with devices directly but only through abstraction layers.  I actually like that somewhat more than the other model in some use cases.

And then of course the traditional server model can just run everything at level 1.  It isn't kernel mode, but it has device mapper access so you can set up the traditional ultra-fast interconnects.  This alleviates the problem of FUSE and its ilk in Linux because you don't have to emulate the device controller without any hooks and pipe them, you inherit devices into user space.

So I'll talk about the device / manager services more next post.

No comments:

Post a Comment