Dependable C, is an attempt to document a subset of C for developers who want to write Dependable C.
C is the most portable and widely implemented language available. C has been called the Lingua-franca of computing. A problem solved in C will remain solved for the foreseeable future. Changes in operating systems, computing environments, or hardware are unlikely to render a well written C implementation obsolete. A library written in C will be able to be used from almost any language. While many programmers don't use, many can read and understand C. This mean that code written in C can be modified by a larger pool of programmers.
If quality is the measure of longevity, C is a prime candidate for writing high quality code.
Not all C code is portable, or will compile the same in all compilers, or can even be understood by most C programmers. C has a long history of quirks, and corner cases that can be hard to navigate. Writing non-portable code that is only intended to run on one platform and be built with a particular tool chain is perfectly legitimate, but if you want to write code that is portable, and remains usable for decades this guide if for you. Who values writing code that is guaranteed to compile and work correctly, over having the latest language features.
Dependable C is the opposite of a dialect. It is a C that is trying to be as middle of the road as possible in order to be understood and implemented as widely as possible. Think of it as Newscaster C, a neutral, universally understood, language.
Dependable C is not a style guide, it does not prescribe formatting, indentation and style. It simply tries to document what C functionality can be depended on and how. It is perfectly valid to use Dependable C as a guide for what functionality to use, while at the same time to adhere to a style guide like Misra. The Misra standard prioritizes safety, where as Dependable C prioritizes Compatibility. It is entirely possible to adhere both at the same time.
In some cases features that have been introduced in later versions are needed, and in these cases we will try to document how to access these features in a dependable way.
Many languages have derived their syntax from C. C++, Java, C#, D, Javascript, Objective-C to name a few. Almost all of these languages are based on C89, and have not incorporated C99 or later features.
The purpose of this project is to document the small subset of C that is dependable, it therefor high discourages writing standard compliant code without any UB code. However in some very rare occasions, this guide will highlight where writing code that is technically UB is permitted, because in practice it is dependable. Likewise there are many, many ways to write technically standard compliant code, that will be far from dependable, and in some cases no implementations do not even exist (See annex K). The goal is to give guidance as how to write code that works in the real world, on real implementations, not just paper products produced by a standard body. Having written that, most implementations study the standard carefully and do their best to follow it, and the standard body goes to considerable length to try to make the standard as complete and clear as possible.
This page is maintained by Eskil Steenberg Hald. I am a long time C developer, and represent Sweden in the C standard board. This page is maintained to chronicle my own understanding of the language, and as a guide for my employees and anyone who wants to write dependable C. I consider myself as an expert in writing software in C, Undefined behaviour and I'm proficient in the memory model and concurrency model (I would probably rank as one of the worlds experts in these two areas, but I still do not want to claim to understand them fully...). I would consider myself less experienced in "Modern" versions of the language. Id appreciate any corrections to this document, or proposed additions it would be much appreciated. I am especially interested in hearing about C features that people have found to be unreliable in any implementation. You can Email me eskil at dependablec dot com.
My participation in the wg14 C standard board is for my own education and participation in the Memory model and Undefined behaviour study groups. Because I will never use any of the newer versions of the language, and do not recommend their use, I abstain from voting in the languages development.
Most other languages only have one or very few implementations. This means you can rely on the implementations behaviour to not vary between platforms. C has numerus implementations and with a very wide range of complexity and feature support.
Many C implementations have bugs, and they mostly manifest when you stretch the language to its limits. All basic functionality can be relied on because the most idiomatic code is also the most tested code. Compiler developers use publicly available code to test their implementations, and therefore a more common construct is much more likely to have been rigorously tested than an esoteric corner case. By writing code in a syntax that you can be sure all compilers have encountered in the past, you minimize the chance that you will trigger a bug.
Code should try to avoid relying on the user having the latest version of a compiler. Some platforms may have had their support deprecated by major compilers, or may only be supported by a specific compiler. Projects that incorporate
Dependable C advocates for using a sub set of all versions of C.
Given that C89 is the smallest of the C standards, in practice this means a subset of C89. Simply using the C89 standard is not enough to fully understand C. Many of the changes that have been made to the C standard text in the years since it was published, address ambiguities and issues with previous versions. If something is unclear in one standard but has been clarified in later standards, users tend to get the clarified behaviour even when they set their compiler to follow the earlier standard. Given that C89/ANSI C was the first versions of the language, it is the version of the standard written with the least implementation experience, and therefore have lots of issues.
While a C89 subset is recommended, the point of writing Dependable C is to be universally accepted, and that includes being accepted by compilers set to any version of C. You may choose to set your compiler to adhere to a strict C89 subset in order to verify that your code is not using any newer functionality, but it should run just as well using any other version of C. Your code should not require a compiler that has a C89 mode, it should be universal. This is why Dependable C discourages the use of any deprecated functionality or any functionality that clashes with new C features (see "auto").
The vast majority of features added to the C standard since C89 add new ways of doing thing that are already possible to do in C89 if you know how. Its our intention to try to document as much as possible of this over time. In some cases features that have been introduced in later versions are needed, and in these cases we will try to document how to access these features in the most dependable way possible.
The C programming language is, unfortunately unfixable. Fortunately C is good enough not to need to be fixed.
One of the greatest strengths of C is its compatibility. C have more implementations than any other programming language, more existing code, more documentation and more experienced programmers than any other language. The cost of breaking all of this compatibility, is simply higher than the value brought by any improvements in the language.
There are a wide range of C dialects and proposed replacement, that all try to fix precited deficiencies of C. However, almost none of them have had any success.
The ISO C standard has until C23 been taking backwards compatibility seriously. This means that they have been unable to remove functionality, only add new functionality. On some rare occasions, features have been marked as deprecated, but in practice, it has not been possible to remove these features from implementations, because users simply need these features to compile existing code.
A situation where features can only be added, never removed, serves a language like C poorly, since one of its core values is its simplicity, compactness and easy of implementation. Stability is also poorly maintained by a group of language designers, who not surprisingly, want to design language features. People do not join standard organizations in order to not develop the standard. (In general, my personal experience is that the members of the ISO C wg14 standard body, are competent, hardworking and very knowledgeable, and have the best of intentions. However, when enough people want to change just one thing, the result is not a clean design)
While the wg14 have historically, worked hard to maintain backwards compatibility, they have ignored compatibility in the opposite direction. Writing code in newer versions of C simply makes it incompatible with many platforms and implementations. Often code written in newer versions of the language will not compile in older implementations, but on occasions the meaning of the code would simply change. This is obviously very dangerous.
An example of this hazard is the removal of some UB. On first glance, it seems like a clear improvement to define behaviours that have in the past needlessly been undefined. But it is problematic, if programmers reads a later standard that makes a guarantee, that isn't guaranteed in most implementations that where written before the behaviour was defined. The behaviour may be technically defined, but in practice it is still not dependable since it has a history of being undefined, and unlike in the past, this hazard is no longer clearly spelled out in the standard. The good intended effort to remove an issue, instead creates an issue. This is one of the issues that compelled the Dependable C effort.
A time traveller going back to 1972, could address many issues in C, but today the situation is much more complicated. Luckily, the small subset defined here, is more than capable of doing everything that needs to be done. In the grand scheme of thing, the sacrifices are minor. Most of the issues of C, for a developer can simply be addressed by "just don't do that then", implementations don't have that luxury since they need to compile existing code that isn't always as well written.
Any software project has requirements, just because your code adheres to a language standard, does not mean that it is meaningful to run on all platforms that do support the language. C is a language that can be implemented on very exotic platforms, where for instance bytes aren't 8 bit or that have very limited memory. It is perfectly reasonable to write software that follows the Dependable C guidelines, but that isn't portable to hardware having smaller pointers than 64 bits.
- Bytes are 8 bits.
- Types are aligned are no more than their size.
- Function pointer have the same size as data pointers.
None of these are guaranteed by the standard.
I would caution against assuming pointers are and will always be 64 bits on modern platforms as there are new platforms that enable 128 bit pointers. While this is unlikely to come in to mainstream use any time soon, it is likely to be a growing niche. (128 bit pointers allow mapping very large NUMA architectures. for instance it is possible to encode a network address (MAC or IP) and memory address in to one pointer)
Dependable C, encourages C++ compatibility in all interfaces, but does not guarantee code to be compiled correctly in a C++ compiler. C++ is not a subset of C, and the differences between the two languages are subtitle and often unintended. Being able to write Code that is guaranteed to produce the same results in both C and C++ requires deep knowledge of both languages and is not something we recommend. We strongly encourage header files to be C++ compatible and not contain any functions. We also discourage any use of C++ keywords.
There are a many misconceptions about C, and features that have been added to C that are invalidated by As-if concept.
[[_TOC_]]
n3308
This document is an educational document that tries to explain the concept of "Undefined behavior" in the C programming language. It is the combined efforts of the ISO WG14's Undefined Behavior Study group, to clarify the term, and its implications.
ISO C defines undefined behavior (UB) in Section 3.4.3 as:
behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this document imposes no requirements
Note 1 to entry: Possible undefined behavior ranges from ignoring the situation completely with unpredictable
results, to behaving during translation or program execution in a documented manner characteristic of the
environment (with or without the issuance of a diagnostic message), to terminating a translation or execution
(with the issuance of a diagnostic message).
Note 2 to entry: J.2 gives an overview over properties of C programs that lead to undefined behavior.
Note 3 to entry: Any other behavior during execution of a program is only affected as a direct consequence of
the concrete behavior that occurs when encountering the erroneous or non-portable program construct or data.
In particular, all observable behavior (5.1.2.4) appears as specified in this document when it happens before an
operation with undefined behavior in the execution of the program.
Inherent to the ISO specification of the C programming language is the concept that a set of behaviors are undefined. From this the specification derives several strengths as well as several weaknesses. UB allows a platform to either define platform-specific behaviors or ignore the possibility of an erroneous state. The language does not require a platform to detect these errors.
Undefined behavior is used in many places in the C standard and for several reasons such as:
Undefined behavior can either be explicitly specified in the standard or remain implicit if the standard does not define a behavior. The C standards body has a goal to document all UB in the C standard, but identifying all UB is a difficult and laborious task. The standard states that the rules for undefined behavior extend to behavior that is not specified by the standard.
Additionally, there are paragraphs in the standard where it is unclear whether a behavior is defined or not. This can mean that some platforms treat a behavior as defined while others treat it as undefined. For example, the standard states that the first member of a struct has a zero offset from the struct itself. Some argue that this means that the first member of the struct therefore must have the same pointer address as the struct while others argue that it is undefined if the struct has the same address as its first member, as the standard does not explicitly resolve this ambiguity.
Beyond undefined behavior, the C standard defines a range of terms for behaviors. Unlike undefined behavior, each of these terms do define a constrained behavior where the implementation has some form of responsibility to uphold, even if it may differ between implementations.
All of these are different from Undefined Behavior in that, while they may produce different behaviors on different implementations, they do represent behaviors that a user can depend on, in a ISO compliant C implementation.
The C standard states that any platform is free to detect UB and to provide platform-specific behavior and document this behavior if it wishes. In this sense, what is in strict ISO C terms "UB" may be well-defined behavior on a particular implementation.
This can be very useful, because it enables implementers to extend C's capabilities, and thereby grants users access to platform-specific features. While the C language is designed to enable cross-platform development, developers are free to only support a limited set of platforms. For example, there are implementations of C that do define the behavior of out-of-bounds array writes, signed integer overflow, and dereferencing null pointers.
For brevity, unless otherwise noted, this document will consider UB only in cases where the implementation has not defined a platform-specific behavior or implementation-specific behavior.
Consider the following code:
int a [5 ];
a [x ] = 0 ;
What should happen if x is 42? A language design could issue an error, exit the program, or resize the array, among other choices. However, any of these choices would require the implementation of the language to perform a test to see if the value is within the valid range of the array.
int a [5 ];
if (x < 0 || x >= 5 ) {
/* Handle out-of-bounds write */
} else {
a [x ] = 0 ;
}
This range check would add work for the compiler and execution environment. Adding any requirement to detect if the assignment is out of bounds would come at a cost in run time performance, and complexity. Not only would the implementation have to check each access to the array, but it would also have to keep track of valid array ranges.
C is designed to be fast, simple, and easily implementable; this is why C does not require any detection of out-of-bounds states. Consequentially C cannot define a behavior for a state that isn't detected. The behavior must be undefined.
It is a common misconception that all undefined behavior in the standard stems from oversights, or to the standard body's failure to agree on an appropriate behavior. The above example clearly shows that it is not practical to define any consistent behavior for out-of-bounds array access without imposing considerable burden on the implementation to detect the state. The cost of detecting an erroneous state prevents the language from defining any behavior should it occur.
Furthermore, if the standard were to require a program to exit on an out-of-bounds write, then the following piece of code would become a valid way to exit a program:
int a [5 ];
a [24 ] = 0 ;
This is not a good way to deliberately exit a program. It is preferred that a program exit in a manner that the standard explicitly documents as exiting, such as by calling a function named `exit`.
Reconsider this code:
int a [5 ];
a [x ] = 0 ;
Another interpretation of the above code is that if there are no requirements for an implementation to handle an out-of-bounds access, then the code contains an implicit contract that `x` can only be between 0 and 4. The implementation can then assume that the user is aware of the contract and consents to it, even if the implementation cannot by itself determine that the contract is valid by analysis of the possible values `x` may hold. The implementation therefore need not check the value of `x`.
If the user cannot guarantee that `x` is within range, they can rewrite the code:
int a [5 ];
if (x >= 0 && x < 5 )
a [x ] = 0 ;
One big reason that many behaviors are undefined is that detecting these undefined behaviors may be difficult to do at compile time, or it may impose too much of a performance penalty at run time.
The existence of undefined behavior implies conversely that when a program has no undefined behavior, its behavior is well-specified by the ISO C standard and the platform on which it runs. This is a promise or contract between the ISO C standard, the platform, and the developer. If the program violates this promise, the result can be anything, and is likely to violate the user's intentions, and will not be portable. We will call this promise the "Assumed Absence of UB".
A C program that enters a state of UB can be considered to contain an error that the platform is under no obligation to catch or report and the result could be anything.
Consider this code:
x = (x * 4 ) / 4 ;
From a mathematical perspective, this operation should not change the value of x. The multiplication and the division should cancel each other out. However, when calculated in a computer, x * 4 may result in a value that may not be expressed using the type of x. If x is an unsigned 32-bit integer with the value 2,000,000,000 and it is multiplied by 4, the operation could wrap on a 32-bit platform and produce 3,705,032,704. The subsequent division by 4 will then produce 926,258,176. Since the standard declares that operations on unsigned integers have defined wrapping behavior, the two operations do not cancel each other out.
If we instead perform the same operation using signed integer types, things might change because signed integer overflow is UB. By using a signed integer, the programmer has agreed to the contract that no operations using the type will ever produce overflow. Therefore, the optimizer is free to ignore any potential overflow, and can assume that the two operations cancel each other out. This mean that there is a significant optimization advantage in declaring that signed integer overflow is UB.
The assumption that the program contains no UB is a powerful tool that compilers can employ to analyze code to find optimizations. If we assume that a program contains no UB, we can use this information to learn about the expected state of the execution. Consider:
int a [5 ];
a [x ] = 0 ;
If x is any value below 0 or above 4, the code contains UB. On many platforms, `a[-1]` and `a[5]` would be assigned to addresses outside the bounds of `a`. Without requiring implementations to explicitly add bounds checks, it becomes impossible to predict the side effects of an out-of-bounds write. The implementation is therefore allowed to assume that UB will not happen. This phenomenon is known as "Assumed Absence of UB", and it lets compilers make further deductions. By writing the above code, the programmer respects a contract with the compiler that `x` will never exceed the bounds of the array.
If we consider:
int a [5 ];
a [x ] = 0 ;
if (x > 5 ) {
// ...
}
In this case, since the compiler assumes `x` must be between 0 and 4, the if statement cannot possibly be true. This allows the compiler to optimize away the if statement entirely. This completely conforms to the standard, but it removes some predictability of UB, and can make programs with UB much harder to debug. The out-of-bounds write no longer causes a predictable wild write and it also causes an `if` statement to be removed.
A common bug is to try to detect and avoid signed integer overflow with code like this:
if (x + 1 > x ) {
x ++;
}
If we assume that UB cannot happen, then we must assume the `if` condition must always be true. Consequently, many compilers will optimize away the `if` statement entirely.
The confluence of UB and more aggressive but standards-compliant compiler optimizations exposes latent bugs that may otherwise behave according to user intentions. These bugs are characterized as hard to find and diagnose. These bugs often do not appear at lower optimization levels. This means that such bugs do not appear in executables that developers produce during development. Consequently, these bugs can bypass many tests. Debuggers tend to operate on on executables compiled with lower optimization settings, where many of these issues do not show up. This makes it harder to find and fix these bugs.
An early example of a vulnerability arising from such aggressive optimization is [CERT vulnerability 162289](https://www.kb.cert.org/vuls/id/162289).
A common consideration when discussing UB is the question of when UB is invoked. While some have argued that programs that are able to procure UB have no requirements whatsoever, it is the position of the WG14 UB Study Group that a program must first reach a state of UB before the requirements of the language standard are suspended. This view is shared by implementers, who have had a history of classifying instances where this isn't true as compiler bugs.
Consider the following:
int a [5 ], x ;
scanf ("%i" , &x );
a [x ] = 0 ;
In this example, a user-provided index is used to access an array of five elements. While this program may be bad form, it is well-defined until and unless `scanf` sets `x` to outside the range of the array. The developer has (implicitly) guaranteed that the index used to access the array will stay within the bounds of the array, but this guarantee is maintained outside of the program. Many programs depend on input strictly conforming to a set of requirements to operate correctly. While this may present safety and security issues, the developers must weigh those considerations against other factors, such as performance. Even a strictly-conforming program could enter a state of UB under some environmental circumstances. A program is only erroneous when it reaches UB. An implementation is not released from complying with the ISO C standard because UB is possible when executing that program; the implementation is released only once the program has entered a state of UB.
A core tenet of the C standard is the "as-if" rule. This rule states that an implementation is not required to operate in the way the program is strictly written, so long as the implementation's observable behavior (defined in C23, s5.1.2.3p6) is identical to the program. The program must behave, but not operate, as if the written program was executed.
This means that the actual program behavior can vary radically depending on how an implementation is able to transform the program, as long as its observable behavior remains constant. For example, two non-observable operations can be reordered. Consider:
int a , b ;
a = 0 ;
b = 1 ;
These are two non-observable assignments (because neither a nor b is `volatile`). As two independent operations they are not required to be executed in any particular order. They may in fact be executed concurrently. If we then consider:
*p = 0 ;
x = 42 / y ;
These two operations are also non-observable operations, however both operations can produce UB (either by `p` pointing to a invalid address, or `y` producing a divide by zero). Because the operations are non-observable, they may be re-ordered. If `y` is zero, there is no guarantee that `*p` is written before the program enters a state of UB.
Because any non-observable operation can be reordered and transformed, a program might reach a state of UB in an ordering not explicitly expressed in the source code. Due to the assumed absence of UB, and the "as-if" rule, a program can show symptoms of UB before any actual UB is encountered during program execution. Consider:
int a [5 ];
if (x < 0 || x >= 5 )
y = 0 ;
a [x ] = 0 ;
Using assumed absence of UB, the implementation can determine that `x` must be a value between 0 and 4, and therefore the `if` statement can be removed. This cases an out-of-order behavior known as "time traveling UB", where a program bug causes unintended consequences before the UB is encountered during program execution. It is as if the UB traveled backwards in time from the array access to the if statement.
Time traveling UB is permitted if it does not interfere with observable behavior that occurs before entering a state of UB. Consider:
int a [5 ];
if (x < 0 )
y = 0 ;
if (x >= 5 )
printf ("Error!\n" );
a [x ] = 0 ;
In this case, the call to `printf` is an observable event, and any re-ordering requires it to execute correctly unless it is preceded by a state of UB. The compiler is not permitted to optimize away the second if statement. The first if statement however has no impact on the observable behavior and can therefore be removed.
Note: Historically, there have been cases where time travel has impacted observable state. Implementers have generally considered these to be implementation bugs. To clarify that they indeed are bugs, the document [N3128 Uecker] was proposed and accepted for c23. It adds the non-normative 3rd Note that clarifies the issue in the standard.
Consider this code:
int a [5 ];
a [42 ] = 0 ;
Every time this code runs, it will produce UB. The state of UB does not depend on any dynamic or external factors other than the code being executed. We choose to define this type of UB as "static UB", because it only depends on variables that are known at compile time. The term "static UB" is somewhat complicated because different implementations have differing abilities to detect UB at compile time. Consider:
int a [5 ];
if (x > 0 ) {
y = 42 ;
} else {
y = MAX_INT ;
}
a [y ] = 0 ;
This code also contains static UB but requires a more complex analysis to reach that conclusion. The term "static UB" denotes any UB that is not dependent on runtime state. An implementation is under no obligation to detect static UB, but if an implementation does detect static UB we have recommendations for how to proceed. Static UB denotes expressions that always produce UB even if it's not proven that the expression will ever be evaluated.
Any statement that produces a state of UB (with the exception of the `unreachable()` macro) is erroneous, unless an implementation has defined its own behavior for that statement. An implementation is under no obligation to detect any UB. If, however, the implementation doesn't detect static UB, it is free to assume the statement will not produce UB. Therefore any static UB (again, excepting `unreachable()`) should be considered a developer error and not an intended use of the language. In these cases, an implementation should issue an error with an appropriate diagnostic when it detects UB.
An implementation can assume that a program will not enter a state of UB, but no implementation should assume that a program that reaches a state of UB is intentional.
Consider again:
int a [5 ];
a [x ] = 0 ;
The assignment may or may not produce UB. In this case if we follow the rule "assumed absence of UB", we can assume that `x` must be between 0 and 4. The assignment is an assignment, but it also provides a hint to the compiler as to what `x` may be. If we then add:
int a [5 ];
a [x ] = 0 ;
if (x > 4 )
...
The if statement here can be considered dead code and optimized away. The if statement doesn't produce UB, it just cannot happen without UB. If we instead consider:
int a [5 ];
if (x > 4 ) {
a [x ] = 0 ;
}
Again, this code may or may not trigger UB, but if the assignment is ever executed it is guaranteed to trigger UB. (Note that an implementation is not required to detect the UB). In other words, the UB is static, but only if the assignment is executed.
The correct interpretation of the detected static UB is that the code is erroneous. It is incorrect to interpret the above code as a valid way for the user to express that `x` is 4 or less. The "assumed absence of UB" rule only applies to the way a construct can be assumed to be executed, not that a construct that always produces UB will never be executed. 0ne divided by X, lets the compiler assume X is not zero, and X divided by zero should cause the compiler to assume unintended user error.
The one exception to this is the `unreachable()` macro. The `unreachable()` macro is the only way for a user to express that a statement can be assumed to never be executed. Incidentally, executing `unreachable()` is UB, but it should not be regarded as equivalent to other UB in this regard.
For example:
if (x > 4 )
unreachable ();
This is a correct way to express that a compiler can assume that `x` is smaller or equal to 4. Despite `unreachable()` being UB, it is not equivalent to:
if (x > 4 )
x /= 0 ;
Division by zero is UB, but unlike `unreachable()`, it is assumed to be a user error. The `unreachable()` macro can therefore not be implemented by the user by producing UB in some way other than the `unreachable()` macro. UB is also erroneous even when it can be determined never to be executed. The following can be detected as erroneous:
if (0 )
x /= 0 ;
C is designed to make naive, as well as highly optimizing implementations possible. The C standard therefore places no requirements or limits on the efforts an implementation takes to analyze the code. Whichever erroneous UB may be detected will therefore vary between implementations.
Operating systems and even hardware have been designed to mitigate the side effects of unintentional UB, or deliberate sabotage using UB, with features such as protection of the memory containing the executable or execution stack. Due to some of these protections, some UB is predictably caught at run time. This mitigates the unpredictable nature of UB and improves the stability and security of the system. However, this can also give the false impression that some UB has predictable side effects. While dereferencing null pointers is technically UB, doing so has a very predictable outcome (a trap) on many platforms. Even if the behavior of dereferencing null is reliable on a platform, the compilers' assumption that the code will not dereference null will make it unreliable.
Some UB was initially included in the C standard because the standard wanted to allow for different platform designs. Over the years, some designs have grown so dominant that few developers will ever encounter a platform that does not conform to these dominant designs. One example of this is two's-complement arithmetic, which causes signed integer overflow to wrap.
This means that many UBs have predictable behavior on most platforms:
| UB | Convention |
|----|------------|
| Dereferencing null pointer | Traps |
| Signed integer overflow | Wraps |
| Using the offset between 2 allocations | Treats pointers as integer addresses |
| Comparing the pointer to freed memory with a newly allocated pointer | Treats pointers as integer addresses |
| Reading uninitialized memory | You get whatever is there |
Such behavior is not defined by the C standard but can seem to be predictable. Predictability is of great value to most developers. The knowledge of how the underlying platform operates lets the developer predict and diagnose bugs. A trapped null pointer dereference is easy to find in a debugger. In fact, a programmer may deliberately add a null pointer dereference to a program to invoke a core dump. In MSVC uninitialized memory is initialized to 0xCDCDCDCD, a pattern that is instantly recognizable for any experienced Windows programmer. [https://en.wikipedia.org/wiki/Magic_number_(programming)] If the sum of two large positive signed integers results in a negative value, a wise programmer will suspect signed integer overflow which happened to wrap.
This apparent predictability of many types of UB hides the fact that UB is not predictable. This causes many programmers to either not realize that some of these behaviors are undefined or confuse UB with implementation-defined behavior. They may believe that UB is defined in the C standard and UBs may be non-portable, but they may assume that the behavior of their platform applies to all platforms, or other hosts of their machine's platform. This faulty assumption creates a variety of hard-to-diagnose issues that we will explore further.
An out-of-bounds write may have a wide range of consequences as it can disturb many kinds of state. However, most developers would assume that an out-of-bounds write is executed as a write operation, which is not true in general. If we consider another UB such as signed integer overflow, it is even less predictable that a simple arithmetic operation can have a wide range of unpredictable outcomes.
Undefined behavior in C gives an implementation wide latitude to optimize the code. This freedom has enabled implementers to successively generate faster and faster machine code, which enables significant reduction in computing time and energy consumption for a wide range of workloads. C is the de facto benchmark for efficiency that other languages are compared against and strive to match.
Significant portions of UB, such as Aliasing, Provenance and Overflow are specifically designed to enable implementations to make optimizations. Violating these categories of UB is likely to cause unpredictable behavior only when an implementation engages with these opportunities to optimize code.
As many implementations support varying levels of optimizations, a perception has formed in parts of the C community that compilers, at higher levels of optimizations, ignore the C standard and "break" code. This is a misconception. Most C implementations are consistent with the C standard even at the highest levels of optimization settings. Optimizations reveal existing bugs in the source code much more often than they reveal bugs in the compiler. These bugs are usually in violation of the C standard even when the program operates consistently with the developers' expectations.
The higher a level of optimization is employed, the more bugs are exposed, but as the code is further transformed, it also becomes harder to debug. Many tools like debuggers depend on low levels of optimizations to be able to correctly associate the binary's execution to the source code. This compounds the difficulty of diagnosing UB bugs.
Given the misconception that optimizations break code, rather than reveal latent bugs, implementers often unfairly get blamed for issues arising from UB. This has made many compilers avoid making certain optimizations, even when supported by the specification, if they anticipate a user backlash. This creates a gray area, where unsound code that contains UB may have an undocumented reliable or semi-reliable behavior. This gray area comes at the cost of denying performance afforded by the standard to compliant code.
C is regarded as an "unsafe language". This is, in the strictest sense, not true. The C standard does not require an implementation to check for several errors, but it also does not prevent an implementation from doing so. Hence, each implementation may choose the level of safety guaranteed.
In practice, C is an unsafe language because the most popular implementations of C choose not to make many additional guarantees, but instead choose to prioritize performance and power efficiency. As such, C is perceived as a de facto unsafe language because that is how most users have chosen to use it.
There are safer implementations, but these are predominantly used to detect issues during development rather than to add additional protections to deployment. One such implementation is [Valgrind](https://valgrind.org/), whose default tool "memcheck" detects out-of-bounds reads and writes to memory on the heap, as well as uninitialized reads, use-after-free errors, and memory leaks. Valgrind achieves these safety constraints at a significant performance cost. Many different implementations such as GCC, LLVM and MSVS offer various tools for detecting and diagnosing UB. Several static analyzers also exist to alleviate this problem.
Users can also write their own memory tracking shims to detect small out-of-bounds writes, double frees, memory consumption and memory leaks, using macros:
#define malloc (n ) my_debug_mem_malloc (n , __FILE__ , __LINE__ ) /* Replaces malloc. */
#define free (n ) my_debug_mem_free (n , __FILE__ , __LINE__ ) /* Replaces free. */
While not in any way mandated by the C specification, the prevailing modus operandi of C users consists of using safety-related tools to detect issues during development, rather than as backstops during deployment. A major drawback of this approach is that since UB is a state that often cannot be definitively detected until it occurs at run time, there is no easy way to definitively guarantee that a program will not enter a state of UB.
Despite this, it is worth noting that some of the most trusted software in the world, like the Linux kernel, Apache, MySQL, Curl, OpenSSL and Git are written in C. The simplicity of C makes it significantly easier to read and detect issues.
C does suffer when the standard is unclear, particularly in areas of the memory model and concurrent execution. Rules about aliasing, active type, thread safety, and volatile leaves a lot open to interpretation as to what is UB, and what is not. On many of these issues there is a lack of consensus within WG14. Most implementations do support behaviors that in the strictest reading of the standard would be considered UB simply because of user expectation, and to be able to compile important existing software. In this sense most implementation deviate from the standard, but how and how much they deviate varies. Some projects like the Linux Kernel, has explicitly opted out of these ambiguities and defined their own requirements.
As this document has hopefully illustrated, Undefined Behavior in the context of C is complex. To simply say that its behavior has been omitted from the standard does not convey this complexity.
C is designed to be a language that trusts the developer. In the case of UB, developers should interpret this to mean "Trust the developer not to initiate UB", rather than "The developer can trust UB if they know the underlying implementation and platform". The Undefined Behavior Study Group therefore strongly advises developers to avoid any UB, unless a platform has explicitly defined that behavior. Testing to determine what observable effect use of a nonportable or erroneous program construct has on your platform is insufficient cause for assuming the UB will consistently have the same behavior on all platforms, including the next one that your code will run on. Only trust an implementation's explicit documentation of a language extension that defines a behavior. We advise that implementations clearly document any language extensions that replace undefined behavior so that users can differentiate between such extensions and seemingly predictable but still unintended behavior.
A computer language is a tool for humans to communicate with computers, but it is also a tool for computers to communicate with humans. Humans spend more time reading the code they write and trying to figure out why its behavior does not match their expectations, than computers do. Traditionally implementations have been black boxes that users must rely on, without understanding how they operate. UB shows that this approach causes issues, because modern compilers do not operate like many users expect them to. We would therefore recommend that implementations try to find ways to be more transparent with their transformations. The ability for users to inspect code that has been transformed could reveal out-of-order issues, code removal, load/store omissions and other non-obvious transformations. We recognize that this involves significant user interface and architectural challenges.
This Document was written by Eskil Steenberg Hald. This document is the result of many invaluable discussions in the Undefined Behavior Study Group and ISO WG14, so many of its members deserves credit for its creation. Specifically the author wants to thank David Svoboda, Chris Bazley, and Martin Uecker for providing feedback, editing, and suggesting improvements.
if, for, while, do, goto, break, continue and return, are all dependable.
However, there are limits to what you can do inside a if, for or while statement. C99 allows for the declaring of variables in the first statement:
for (int i = 0 ; i < 10 : i ++)
This is not legal in C89 and it there for not always dependable. The ability to declare variables in C89 does to extend to other flow control. All of these are illegal:
for (i = 0 ; int x = i < 10 : i ++)
if (int x = i < 10 )
while (int x = i < 10 )
switch (int x = i < 10 )
for loops are often explained as equivalent to while loops in like this:
for (<statement0 >; <statement1 >; <statement2 >)
{
...
}
Is equivalent to :
<statement0 >
while (<statement1 >)
{
...
<statement2 >
}
This is true in C++ but not C, because statement0 can not define a type inside a for loop in C, but can be defined before a loop.
auto is an obscure keyword that indicates that a variable is of "automatic storage duration". That is the default qualifier for variables inside functions scopes, and auto can not be used on variables outside functions scopes (although some compilers allow for it.)
Unfortunately auto has gone from pointless to dangerous in C23. In C23 auto was given a new meaning as a means to automatically assign the type of a variable using assignment:
auto x = 0.0 ;
In C23 the above code will make x a variable of type double since 0.0 is a double. This feature is not dependable. In fact if you write the above in older versions of C, you don not need to specify a type at all, and will then be given a variable of type int per default. The above in C89 would make x an int. The default type was deprecated with C99, but almost all compilers support it with a warning. This means that very recent compilers will compile this and an int and even more recent compilers (supporting C23) will no longer warn against this.
Therefore any use of the keyword auto should be considered not dependable.
Both float and double can be considered dependable and are well supported in most C implementations. There are however a few things to consider.
Some small embedded platforms do not have FPUs and may choose to software emulate or not support floating point operations at all, or may only support 32 floats. So in some cases it can be useful to avoid needlessly using floating point types. If you for instance implement a small library to implement a file format, network protocol or compression, you can avoid floating point instrumentation for benchmarking, if the library is otherwise free from floating point operations, to make the library more portable.
While pretty much all floating point implementations use IEEE 754 standard representation of floating point numbers, its worth pointing out that different hardware implementations (sometimes form the same vendor) will yield defend results due to how the various implementations handle rounding. This means that executing the exact same instructions, with the exact same input data can yield different results on two different machines, running the same executable. This means that floats are not reliable for lockstep synchronizations.
Floating point arithmetic should never be relied upon to get accurate result that can be compared with other values. Here are some examples where x and y may not be equal:
x = (y * 2.0 ) / 2.0 ;
x = a / 2.0 ;
y = a / 2.0 ;
The compiler may fold some operations at compile time using one implementation, while other operations are not folded and gets computed at execution time using a different implementation.
As a general rule, it is only safe == compare floating points values that are assigned, not values that have been computed, or to compare floating point values to themselves in order to detect a NaN state.
x = 6.0 ;
...
if (x == 6.0 ) /* safe to test */
x = 6.0 / 2 ;
...
if (x == 3.0 ) /* not safe to test */
x = 1.0 / y ;
if (x == x ) /* a safe way to test if x is a NaN */