Operating Systems Design Principles

This page is about high-level OS concepts and desirable transcendental properties of OSes, not implementation details.

Issues such as global versus per-process address spaces have no place here, except as a derivation. Same thing with synchronous versus asynchronous communication.

Performance has no place here, even as a derivation. Every OS can be made to perform well by using advanced compiler techniques, and if that fails then systematically breaking down component boundaries. Make it work, make it right, make it fast. Fast is last, always.

Performance is strictly a matter of engineering and so simply isn't a valid concern of OS design. The same thing with scalability; running on different kinds and classes of hardware with adequate performance on each. See Design Vs Engineering.

This turns out not to be the case. For example, Ell Four has been shown to be inherently faster than Mach Microkernel, because it is a tighter design. (Ell Four's message passing semantics are simpler, so message passing is about an order of magnitude faster; and the Micro Kernel is only about 12K, which means much less cache thrash.) Tests have shown that the performance cost of hosting Linux on Mach Microkernel is about 10 times as high as the cost of hosting it on Ell Four. This should not surprise anyone who has solved difficult performance problems; any substantial program which has not been designed for performance often has semantics which simply cannot be implemented efficiently. Consider the difference between HTTP/1.0 and HTTP/1.1; if browsers are HTTP/1.1-compliant, then an HTTP/1.0 server will never perform as well as an HTTP/1.1 server, no matter how much engineering has been put into it. -- John Stracke

It has to be understood that OS design isn't taught in "OS design" classes and books like "OS concepts" are a twisted joke. Teaching low-level implementation details of Unix in OS Design classes is the quintessence of a Cargo Cult. The material on this page comes as close to an introduction to the subject of OS design as exists either online or off. Without further ado,

The big list of fundamental OS principles,

Elegance

Simplicity

Integrity = Elegance + Uniformity

Resilience

Robustness

Modularity

Composability

Name Space principles (eg. Connectedness)

Uniformity, or Consistency

Extensibility

Distribution

Reflection (includes Monitors)

Reliability

Self-documentation

Predictability / Understandability

Reversibility

Liveness

Directness

Correct(ness-proved)

Issues requiring special attention:

Political Economy (Resource Management / Resource Allocation / Real Time being features)

Relative Importance of Principles

Note that these are ALL reconcilable with each other, both in logic and in practice. IOW, it is possible to create an OS that satisfies all of them.

Of course, it is wise to spend one's time satisfying the most important ones and leave out those, like correctness proofs, which provide little benefit at enormous cost. The question is, which principles are essential to include in any new OS?

The answer is that anything which every other OS project has accomplished or seeks to accomplish ... is completely non-essential. There are dozens of OS projects that sought and have achieved highly-performing, reliable, distributed, secure operating systems. If a new OS is to stand out from the crowd, to challenge the established antediluvian systems, it must do something different. In fact, it must do something which it is impossible for the established OSes to do. Performance, reliability, distribution and even security to some degree, are simply not it!

If you're an OS designer and someone tells you "your OS must have so and so because every other OS already has it", you can choose to comply with the narrow-minded idiot and be a complete loser. (What else to call someone intent on reinventing the wheel?) Or you can choose to do something difficult, risky and new.


Security

Security has two halves: preventing users from accessing objects they have no right to access and allowing users to access objects they have every right to access. Sometimes it seems like OSes are written by Fascists; eg, it is impossible to share any object in Unix in any meaningful way. Other times, it seems like they're written by people who don't have the faintest notion of what privacy means. Genuine security has a lot of implications. See www.caplet.com , Capability Computing, Eros Os.

Note that dealing with Denial Of Service, deadlocks and the like has nothing to do with security. They are economic (Resource Allocation) problems; a user should not have so much power or money that they can secure a resource forever. And if they do, then DoS / deadlock is perfectly legitimate behaviour.

I'm reading up on no-wait based concurrency, it strikes me that the 'no-indeterminate wait/finite number of steps to finish any operation' aspect would be a solution to the DoS problem; the worst-case is that an operation takes longer to complete... no DoS attack could hold off any arbitrary user task indefinitely. It follows that any operating system/language must provide an atomic primitive operation with a consensus number of infinity it order to provide lock-free synchronization. I need to reread the papers, but it struck me that message passing is not sufficient to do this without in-order delivery of broadcast messages. Other primitives which are sufficient include atomic test-with-conditional-swap (although this might be limited to consensus=[capacity of operand]), and atomic memory move (I'm not completely clear on this one). -- William Underwood

I seriously doubt that DoS is a security problem in the first place. It seems to me to be an economic problem at its core. People give away an unlimited amount of money to whoever wants it and then they whine when they buy out your finitely priced CPU / network capacity. -- RK

DoS is in no way abated with lock-free concurrency. DoS occurs precisely because things are lock-free. To wit, the Internet. One of the best ways to attack a website is to have millions of (scripted) clients conducting a perfectly legal and fully asynchronous operation like obtaining a website. The resources consumed are completely transient -- nobody has exclusive rights to the server's network connection for longer than a single HTTP request. The volume of the requests, however, are what denies service to other, legitimate users. Had a legitimate user had the ability to exclusively lock a chunk of bandwidth for his/her purposes, then this becomes a non-issue.

However, the opposite situation happens. If you allow exclusive assignment of resources, now you run into deadlock situations. It's a no-win battle. You either deal with DoS, or you deal with deadlock. --Samuel Falvo

I disagree. DoS would be much easier if things were lockable. One legitimate user could far more easily deny service to other legitimate users.

Elegance

Sometimes called Quality Withouta Name or Right Thing though it is neither. Elegance is achieved when every property of the system is derived from every other property such that the number of arbitrary choices is zero. String theory is elegant, the Standard Model is not.

Simplicity

Simplicity doesn't mean that the OS is weak or that it provides few abstractions. What it means is that each abstraction is simple and components of the OS have simple sets of abstractions. In particular, it means that multiplexing is separated from abstraction. A layer either multiplexes or it abstracts, but never both. The Exo Kernel fad was an attempt to provide this simplicity in one tiny part of the OS. Providing a multiplexing layer beneath every abstraction layer is an excellent way to make an OS extensible.

I detect more of RK's favored Blue Abyss-colored glasses behind the above statements. Simplicity does NOT mean having some layers multiplex while others abstract. However, that would be one instance of greater simplicity (as compared to service-layers that each do two orthogonal jobs). In a broader view, simplicity in OS design needs apply to both the component-abstractions (i.e. what is a process, what is a 'file', what is a 'service', etc.) AND to the individual services. Simple services are characterized by strong cohesion and minimal unnecessary coupling, much like any other simple software component. Since coupling of multiplexing and abstraction is very rarely 'necessary' within a service, there should very rarely be any service-layers that do both.

Resilience

The design is resilient if the radical addition of a feature in the fundamental design causes it to be simpler, solves unexpected problems, or increases its elegance by causing a previously made choice to no longer be arbitrary.

This happened many, many times in the design of Blue Abyss. The latest instance was considering extending Number with classes Positive Infinity and Negative Infinity for the sake of infinite version numbers. Then actually doing it when it turns out that infinite quotas are critical to flexible resource allocation. Then finding out that having done it, you've got indeterminate-size nodes, ones whose start or end location is unknowable even by the node, for free without writing a single line of code. Which is quite a bit better than having to write message-eating Delayed Exceptions.

In opposition, the design is brittle when the addition of a feature causes a security hole or other undesirable interaction to exist.

The above description of resilience is difficult to associate with how the word is used in systems design. A system (network or service, for example) is called resilient if it can bounce back from partial or complete failures, attacks, and sabotage. It is closely associated with graceful degredation and a necessary component of survivability, though resilience doesn't imply continuous service through the duration of the attack. (A service that fails gracefully and is resilient will fail just a little then bounce back into original condition when the cause of the malady is resolved). A design, in this line of reasoning, might be called 'resilient' if changes to it (e.g. the addition or removal of a feature) automatically shape the rest of the design, such that nothing else needs to be redesigned just to fit it in. The above statements are symptoms of this, though resilience itself doesn't imply the overall system becomes simpler or at all more elegant - it can often go the other way, where a small addition invalidates assumptions and makes previous optimizations impossible in all other parts of the design. Still, a resilient design would somehow 'repair' itself to working condition, so the other features live on, even if they run slower.

I face some confusion here, too. Is 'Resilience' supposed to be a property of the operating system's design, or is it to be a property of the Operating System? RK spoke of the wonderful 'resilience' his design possessed, but I can't help but think he grossly misinterpreted the intent in placing 'resilience' among the top design principles. What would 'Liveness' mean when discussing 'Live' designs?

Resilience of the Operating System, as opposed to its design, would be its ability to restore service (or be restored to service) if broken or attacked. Resilience of a Security System would be the ability to kick infiltrators out and rebuild walls that are penetrated or infiltrated (compare to notion of nearly unbreakable wall that, once penetrated, never self-repairs and is difficult to fix by hand). Resilience is a very good property for an operating system and its orthogonal security system to possess. Resilience doesn't prevent things from breaking (that'd be 'Robustness'), but Resilience does imply that broken things get fixed.

At the level of individual process-objects, 'resilience' would be the ability to restore to working condition even if interrupted, moved to a different environment, loss of communications, etc. In modern operating systems, most process-objects are incredibly fragile... to the point that the normal mechanism of providing resilience is to destroy the process-object entirely and create a new one every time the operating system is shut down, and getting them back into working condition after moving them to another computer or after power cycle is wishful thinking or even impossible. One might say that a process is 'fragile' insofar as it requires information or other processes that are no longer available in order to restore it to working condition. Thus, resilience can be obtained by maintaining this extra information.

I submit a hypothesis: resilient process-objects will need to provide a systematic mechanism for communicating to the hosting environment what services it need make available, where this communication is always initiated by the hosting environment (e.g. after first receiving the process-object, possibly after a restart of some sort). This could be a script or list of sorts published with an expected name and type ('export environ-requirements = "Give me a timer-signal on port T once every 15 minutes. Oh, and please call my exported procedure named 'On Destruct' before destroying me."'). More easily, it could be a simple message on a known port that tells the process-object whom its host is (for callbacks) and tells it to regenerate itself. These aren't exclusive... the former 'script' could easily be used to simply tell the environment to send the process a special message telling it to self-regenerate. The mechanism is, perhaps, relevant to other design principles, but what's important to resilience is that the host environment be capable of putting Humpty Dumpty back together again after dropping him, whether this be relying upon Humpty Dumpty to tell it how, or Humpty Dumpty coming with instructions.

I favor use of a script, myself, because it allows most processes to operate under the illusion that they never broke (even if they require a timer signal every 15 minutes). The script makes 'Transparent Persistence' of process-objects considerably more transparent in the great many cases where explicit regeneration isn't required, and it also allows for more transparent migration of processes.

Resilient process-objects are possibly a necessary component of any Operating System when uniting 'Liveness' with 'Persistence' (esp. Transparent Persistence).

Robustness

The system is robust if the addition of a feature doesn't cause the breakdown of a previous feature.

(But, isn't this, by definition, the same as Resilience? I think a better explanation for robustness is the exact same as that used by every other vendor -- it's resistance to failure from outside stimuli. In other words, no amount of user-generated or environmentally-generated input can cause the system to fail. Since there are always cosmic rays to deal with, a system can never be 100% robust. But we can approach it asymptotically. --Samuel Falvo)

Robust means it's hard to break. Resilient means that it bounces back quickly when it breaks (either it self-repairs or is simply easy to fix). They aren't the same, though the differences don't seem to be well explained on this page. It's important to combine the two: a system can't be 100% robust unless it has no features (nothing to break), so some resilience is always necessary. However, resilience is often far more difficult to achieve efficiently than robustness. In the case of a robust design, there is no implication that other parts of the design (or features) would be shaped by the addition of a new one... to the contrary, robustness implies a certain rigidity, such that the original features will be there no matter what you change in other areas. In a good, robust OS, there's no way to crash or damage one part of the OS by crashing or adding insane features to another part.

It's worth noting that a robust design might simply block the addition of a feature, and still be called robust. It's robust if you can't break it, and all other concerns are secondary.

Composability

Components should be composable with each other at the user's whim.

I'd note that this one makes Robustness and Correctness-Proved considerably more difficult. An OS that provides these other features will only allow for composition within constrained limits, not at the user's whim. Essentially, if you're going to provide composability and these other design principles, you'll need a strongly typed (for robust) and statically typed (for correctness-proved) operating system.

Name Space principles

'''Global Consistency'''

'''Connectedness'''

Components' accessibility from each other should be symmetric and transitive. This implies that all mounts and hardlinks (in Unix jargon) be bidirectional.

Connectedness allows "the system [to] reveals itself through Progressive Disclosure" (from Great Design).

Uniformity

Components should obey the same syntax, have the same meta-interface. This implies near complete network transparency. That is, scheduling your program on a remote CPU should be no different from scheduling it on a particular one of the local CPUs (which feature must be available to comply with the Exo Kernel principle). It is not the business of the OS designer to care how the user structures their processes.

(Note that caching can be made available to make remote process execution tolerable. But that's a general solution to remote data transfer.)

One should be able to put a monitor on any object and on any operation. A monitor is just a message sent to an object to Watch for the passage of another message. Monitors are a reflection not of the OS, but of the primitive objects of which the OS is composed.

Extensibility

New components can be added to the system at any time and (because of uniformity and composability) used by anyone to whom they're available from that very moment. Preferably, individual components should be extensible at run-time.

Since Oo Fits Our Mental Abilities, this means user-oriented versus developer-oriented.

Political Economy

(This is probably a feature any way you slice it but the issue is so often misunderstood that it's given special treatment here.)

By this we mean an expressive, powerful yet simple political economy. Whenever you multiplex a resource, you run into economic issues. These fundamental economic issues are usually left unresolved or only halfway resolved using some vague appeal to fairness. But fairness simply shouldn't be a designer's concern. It's a question of policy and building a policy (such as "fairness") into a basic mechanism of the OS is a grievous mistake.

What the OS designer must be concerned with is the creation of a general-purpose mechanism to allocate resources. And then to make sure that this mechanism is used to allocate resources in the OS. Obviously, this mechanism has to be secure, simple, powerful and so on.

Predictability

The design must be well-defined and predictable. Not merely deterministic, nor even tractable, but tractable by a normal human being. There is a large, perhaps even total, overlap between predictability, uniformity and elegance, but saying that one set of principles is more fundamental than another is mistaken.

Liveness

An object is an object is an object, not a copy or a representation. For example, Plan Nine's proc filesystem is not live.

Reversibility

The ability to undo and redo any operation. This requires Transparent Persistence. It also requires the ability to compose primitive operations into transactions and transactions into changesets and changesets into still higher level changesets. This is because actual human users need to understand what operations to undo together to achieve what they want.

Common implementations of reversible systems where you get a single undo stack and no ability to see that stack are simply broken.

Referential transparency

It means that (cap read: 1 to: 20 version: 2) will always return the same value, and that (cap write: 'abcd' at: 20 version: 2) will either always proceed or never proceed. What it does NOT mean is that (cap read: 1 to: 20 version: #latest) will always have the same value ....


How do different OS designs compare?

Plan Nine has modularity, composability, distribution and half of uniformity. You can also include reliability.

Unix has only modularity and many people seem to act like it's the "best OS of all time."

Eros Os satisfies Reliability, Transparent Persistence and most of security. It doesn't seem to give very practical security.

Blue Abyss aims to satisfy all of the principles to the utmost, except for Correctness Proofs, which I see as a costly fad that adds little value.


Discussion

Ignoring such things as "runs application X", as that's external to the intrinsic nature of an operating system, what about: [all refactored up]

Perhaps you've lumped all of these in with "reliability"; it is useful to break them out.

Robustness. Does the thing crash, or come down to its knees, especially under load?

Availability. (Partially covered in Extensibility). Do you have to reboot it whenever you add a driver? Or new HW for that matter? You should never have to reboot it. Covered under extensibility and exokernel.

Fault Tolerance. If a disk crashes, does the OS keep on chugging. (This is more of a Whole System property - the most fault-tolerant OS in the world won't be fault-tolerant if loaded on a garden-variety PC.)

Data Integrity. Do crashes, unplanned/uncontrolled reboots, and other such nastiness result in data corruption? This is a security problem.

So basically reliability means robustness.

In addition, a few other comments:

Real Time: Many applications/environments don't need this trait. Many applications don't need graphics. An OS / library / language that doesn't provide suitable abstractions in graphics simply isn't doing its job.

Proof Of Correctness: Likely possible for some parts of the system; but the General Halting Problem is likely to rear its ugly head if you try to prove an entire general-purpose OS correct. Assuming you have a spec with sufficient coverage and detail to prove out.

Also, Exo Kernel seems to be a particular architecture for OS design, not a "design principle". Exo Kernels may be a good way to write an OS; however they aren't a fundamental requirement or figure of merit.

Transparent Persistence and Object Orientation are some other items on the list that some people think would be nice, but which are not fundamental principles or general requirements.

I'd like you to define a "fundamental" principle of OS design as opposed to a "non-fundamental" one. The distinction is simply not useful, if it can even be defined.

OO crosscuts all system components and that makes it a general requirement. Transparent Persistence is a feature since it doesn't crosscut across all system components. However, it's a requirement that every decent OS is obligated to meet.


One of the design principles behind the Incompatible Time Sharing System is that system calls be atomic. This means that they cannot appear to be interrupted. If they are interrupted, one of two things happens:

the system call backs out and the program counter is reset to the start of the system call, saving whatever state has been changed, so that when the process resumes, it can pick up where it left off

the system call completes if this is possible after a few more instructions.

This is called PCLSRing (or PC-lusering), and was also used by WAITS

(West Coast Alternative to ITS, a variant of TOPS-10).

The approach taken by most modern operating systems is to exit the system call and signal failure. EROS (IIUC) uses a variant of PCLSRing, but with the difference that system calls are constrained to take a limited time to run, and the Fluke kernel, which has multiple entry points to system calls so that they can be resumed. -- Donald Fisk

Nowadays, these would no longer be called OS design principles. They would be consequences of resource allocation, security, elegance, and other such principles. System call internals are simply too low level.

''Maybe some consideration should be given to system call internals before worrying about higher level issues. Avoiding this means responsibility is being passed from the operating system designer to other programmers who use the system calls.''

System calls (if such a thing even exists) must use PCLSRing. Elegance requires this. But worrying about system calls first is stupid. There are just so many much more important things.


Documentation

Why has nobody even *mentioned* the documentation? Because it's not a principle of OS design?

Yes, that's pretty much it. The design principles as they stand will already work towards many of the tasks for which documentation is used for in systems. The tasks which remain, although important, are not operating system design issues. Windows would be windows, with or without the manual. Its design contains many faults, and none of them could be fixed with a manual. And a major error in the manual, perhaps misrepresenting a security aspect so as to describe a system with a certain hole, doesn't by force of existence bring that hole into being. -- cwillu

I'm starting to think that self-documentation is an OS principle at least as much as other derivative things like OO or Exo Kernel, and more so than proper resource allocation. Documentation is not an OS principle, but self-documentation? Consider the difference between a Smalltalk system with sources and one without (as happens in Visual Works when you don't set $Home). Now imagine the difference if a component of Blue Abyss were written in Smalltalk and if it were written in assembly. I'm starting to wonder what kind of support I'll need for self-documentation in Blue Abyss ... -- RK

Hmm... for some reason I was under the impression that the old list already included something along the lines of progressive disclosure, which combined with uniformity would result in a... I don't know. I want to say intuitive, but I don't believe that exists per say. Perhaps a system which would be understandable with a minimum of hand-holding?

If the principles don't already include such that the resulting os will be self-documenting for the most part, then self-documentation belongs to os design principles as much as self-documenting code belongs to application design principles.

So while Documention still doesn't have a place, but that the system should be self-documenting certainly does.

-- cwillu

You got it right the first time. Uniformity and progressive disclosure should result in a largely understandable system. But even Smalltalk has specialized mechanisms for documenting code (eg. class comment method). -- rk



See original on c2.com