Why a Dependable C?

Dependable C, is an attempt to document a subset of C for developers who want to write Dependable C.

universal storage of computing instructions.

C is the most portable and widely implemented language available. C has been called the Lingua-franca of computing. A problem solved in C will remain solved for the foreseeable future. Changes in operating systems, computing environments, or hardware are unlikely to render a well written C implementation obsolete. A library written in C will be able to be used from almost any language. While many programmers don't use, many can read and understand C. This mean that code written in C can be modified by a larger pool of programmers.

If quality is the measure of longevity, C is a prime candidate for writing high quality code.

Not all C code is portable, or will compile the same in all compilers, or can even be understood by most C programmers. C has a long history of quirks, and corner cases that can be hard to navigate. Writing non-portable code that is only intended to run on one platform and be built with a particular tool chain is perfectly legitimate, but if you want to write code that is portable, and remains usable for decades this guide if for you. Who values writing code that is guaranteed to compile and work correctly, over having the latest language features.

Dependable C is the opposite of a dialect. It is a C that is trying to be as middle of the road as possible in order to be understood and implemented as widely as possible. Think of it as Newscaster C, a neutral, universally understood, language.

Dependable C is not a style guide, it does not prescribe formatting, indentation and style. It simply tries to document what C functionality can be depended on and how. It is perfectly valid to use Dependable C as a guide for what functionality to use, while at the same time to adhere to a style guide like Misra. The Misra standard prioritizes safety, where as Dependable C prioritizes Compatibility. It is entirely possible to adhere both at the same time.

In some cases features that have been introduced in later versions are needed, and in these cases we will try to document how to access these features in a dependable way.

Many languages have derived their syntax from C. C++, Java, C#, D, Javascript, Objective-C to name a few. Almost all of these languages are based on C89, and have not incorporated C99 or later features.

The purpose of this project is to document the small subset of C that is dependable, it therefor high discourages writing standard compliant code without any UB code. However in some very rare occasions, this guide will highlight where writing code that is technically UB is permitted, because in practice it is dependable. Likewise there are many, many ways to write technically standard compliant code, that will be far from dependable, and in some cases no implementations do not even exist (See annex K). The goal is to give guidance as how to write code that works in the real world, on real implementations, not just paper products produced by a standard body. Having written that, most implementations study the standard carefully and do their best to follow it, and the standard body goes to considerable length to try to make the standard as complete and clear as possible.

About this page

This page is maintained by Eskil Steenberg Hald. I am a long time C developer, and represent Sweden in the C standard board. This page is maintained to chronicle my own understanding of the language, and as a guide for my employees and anyone who wants to write dependable C. I consider myself as an expert in writing software in C, Undefined behaviour and I'm proficient in the memory model and concurrency model (I would probably rank as one of the worlds experts in these two areas, but I still do not want to claim to understand them fully...). I would consider myself less experienced in "Modern" versions of the language. Id appreciate any corrections to this document, or proposed additions it would be much appreciated. I am especially interested in hearing about C features that people have found to be unreliable in any implementation. You can Email me eskil at dependablec dot com.

My participation in the wg14 C standard board is for my own education and participation in the Memory model and Undefined behaviour study groups. Because I will never use any of the newer versions of the language, and do not recommend their use, I abstain from voting in the languages development.

implementations

Most other languages only have one or very few implementations. This means you can rely on the implementations behaviour to not vary between platforms. C has numerus implementations and with a very wide range of complexity and feature support.

Many C implementations have bugs, and they mostly manifest when you stretch the language to its limits. All basic functionality can be relied on because the most idiomatic code is also the most tested code. Compiler developers use publicly available code to test their implementations, and therefore a more common construct is much more likely to have been rigorously tested than an esoteric corner case. By writing code in a syntax that you can be sure all compilers have encountered in the past, you minimize the chance that you will trigger a bug.

Code should try to avoid relying on the user having the latest version of a compiler. Some platforms may have had their support deprecated by major compilers, or may only be supported by a specific compiler. Projects that incorporate

C standard version

Dependable C advocates for using a sub set of all versions of C.

Given that C89 is the smallest of the C standards, in practice this means a subset of C89. Simply using the C89 standard is not enough to fully understand C. Many of the changes that have been made to the C standard text in the years since it was published, address ambiguities and issues with previous versions. If something is unclear in one standard but has been clarified in later standards, users tend to get the clarified behaviour even when they set their compiler to follow the earlier standard. Given that C89/ANSI C was the first versions of the language, it is the version of the standard written with the least implementation experience, and therefore have lots of issues.

While a C89 subset is recommended, the point of writing Dependable C is to be universally accepted, and that includes being accepted by compilers set to any version of C. You may choose to set your compiler to adhere to a strict C89 subset in order to verify that your code is not using any newer functionality, but it should run just as well using any other version of C. Your code should not require a compiler that has a C89 mode, it should be universal. This is why Dependable C discourages the use of any deprecated functionality or any functionality that clashes with new C features (see "auto").

The vast majority of features added to the C standard since C89 add new ways of doing thing that are already possible to do in C89 if you know how. Its our intention to try to document as much as possible of this over time. In some cases features that have been introduced in later versions are needed, and in these cases we will try to document how to access these features in the most dependable way possible.

Why not fix C?

The C programming language is, unfortunately unfixable. Fortunately C is good enough not to need to be fixed.

One of the greatest strengths of C is its compatibility. C have more implementations than any other programming language, more existing code, more documentation and more experienced programmers than any other language. The cost of breaking all of this compatibility, is simply higher than the value brought by any improvements in the language.

There are a wide range of C dialects and proposed replacement, that all try to fix precited deficiencies of C. However, almost none of them have had any success.

The ISO C standard has until C23 been taking backwards compatibility seriously. This means that they have been unable to remove functionality, only add new functionality. On some rare occasions, features have been marked as deprecated, but in practice, it has not been possible to remove these features from implementations, because users simply need these features to compile existing code.

A situation where features can only be added, never removed, serves a language like C poorly, since one of its core values is its simplicity, compactness and easy of implementation. Stability is also poorly maintained by a group of language designers, who not surprisingly, want to design language features. People do not join standard organizations in order to not develop the standard. (In general, my personal experience is that the members of the ISO C wg14 standard body, are competent, hardworking and very knowledgeable, and have the best of intentions. However, when enough people want to change just one thing, the result is not a clean design)

While the wg14 have historically, worked hard to maintain backwards compatibility, they have ignored compatibility in the opposite direction. Writing code in newer versions of C simply makes it incompatible with many platforms and implementations. Often code written in newer versions of the language will not compile in older implementations, but on occasions the meaning of the code would simply change. This is obviously very dangerous.

An example of this hazard is the removal of some UB. On first glance, it seems like a clear improvement to define behaviours that have in the past needlessly been undefined. But it is problematic, if programmers reads a later standard that makes a guarantee, that isn't guaranteed in most implementations that where written before the behaviour was defined. The behaviour may be technically defined, but in practice it is still not dependable since it has a history of being undefined, and unlike in the past, this hazard is no longer clearly spelled out in the standard. The good intended effort to remove an issue, instead creates an issue. This is one of the issues that compelled the Dependable C effort.

A time traveller going back to 1972, could address many issues in C, but today the situation is much more complicated. Luckily, the small subset defined here, is more than capable of doing everything that needs to be done. In the grand scheme of thing, the sacrifices are minor. Most of the issues of C, for a developer can simply be addressed by "just don't do that then", implementations don't have that luxury since they need to compile existing code that isn't always as well written.

requirements

Any software project has requirements, just because your code adheres to a language standard, does not mean that it is meaningful to run on all platforms that do support the language. C is a language that can be implemented on very exotic platforms, where for instance bytes aren't 8 bit or that have very limited memory. It is perfectly reasonable to write software that follows the Dependable C guidelines, but that isn't portable to hardware having smaller pointers than 64 bits.

- Bytes are 8 bits.

- Types are aligned are no more than their size.

- Function pointer have the same size as data pointers.

None of these are guaranteed by the standard.

I would caution against assuming pointers are and will always be 64 bits on modern platforms as there are new platforms that enable 128 bit pointers. While this is unlikely to come in to mainstream use any time soon, it is likely to be a growing niche. (128 bit pointers allow mapping very large NUMA architectures. for instance it is possible to encode a network address (MAC or IP) and memory address in to one pointer)

C++ Compatibility.

Dependable C, encourages C++ compatibility in all interfaces, but does not guarantee code to be compiled correctly in a C++ compiler. C++ is not a subset of C, and the differences between the two languages are subtitle and often unintended. Being able to write Code that is guaranteed to produce the same results in both C and C++ requires deep knowledge of both languages and is not something we recommend. We strongly encourage header files to be C++ compatible and not contain any functions. We also discourage any use of C++ keywords.

AS-IF

There are a many misconceptions about C, and features that have been added to C that are invalidated by As-if concept.

Educational Undefined Behavior Technical Report

[[_TOC_]]

n3308

Undefined Behavior

This document is an educational document that tries to explain the concept of "Undefined behavior" in the C programming language. It is the combined efforts of the ISO WG14's Undefined Behavior Study group, to clarify the term, and its implications.

ISO C defines undefined behavior (UB) in Section 3.4.3 as:

behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this document imposes no requirements

Note 1 to entry: Possible undefined behavior ranges from ignoring the situation completely with unpredictable

results, to behaving during translation or program execution in a documented manner characteristic of the

environment (with or without the issuance of a diagnostic message), to terminating a translation or execution

(with the issuance of a diagnostic message).

Note 2 to entry: J.2 gives an overview over properties of C programs that lead to undefined behavior.

Note 3 to entry: Any other behavior during execution of a program is only affected as a direct consequence of

the concrete behavior that occurs when encountering the erroneous or non-portable program construct or data.

In particular, all observable behavior (5.1.2.4) appears as specified in this document when it happens before an

operation with undefined behavior in the execution of the program.

Inherent to the ISO specification of the C programming language is the concept that a set of behaviors are undefined. From this the specification derives several strengths as well as several weaknesses. UB allows a platform to either define platform-specific behaviors or ignore the possibility of an erroneous state. The language does not require a platform to detect these errors.

Undefined behavior is used in many places in the C standard and for several reasons such as:

**Simplicity:** C is a relatively small and widely implemented language and it remains so partially because implementations are not required to define behaviors for many things.
**Performance:** Declaring some behaviors to be undefined allows for better performance. (This will be discussed in more detail later in this paper.) Some behaviors differ between hardware platforms so greatly that defining and requiring one specific behavior would penalize hardware architectures that implement an alternative behavior.
**Detectability:** Many problematic states that a program can enter are hard or expensive for the implementation to detect. UB therefore shifts the burden of detection onto the programmer to guarantee their program will not enter such a state.
**Extensibility:** Where a behavior is not defined, a platform can choose to implement a specific extension of the language that adds additional features. Many compilers add extensions in one form or another.
**Historical reasons:** Some behaviors cannot be defined today because platforms have implemented divergent choices that existing software relies on. Therefore, defining these UBs would break some existing platforms and the programs that rely on them.

Undefined behavior can either be explicitly specified in the standard or remain implicit if the standard does not define a behavior. The C standards body has a goal to document all UB in the C standard, but identifying all UB is a difficult and laborious task. The standard states that the rules for undefined behavior extend to behavior that is not specified by the standard.

Additionally, there are paragraphs in the standard where it is unclear whether a behavior is defined or not. This can mean that some platforms treat a behavior as defined while others treat it as undefined. For example, the standard states that the first member of a struct has a zero offset from the struct itself. Some argue that this means that the first member of the struct therefore must have the same pointer address as the struct while others argue that it is undefined if the struct has the same address as its first member, as the standard does not explicitly resolve this ambiguity.

Other types of behavior in the C standard.

Beyond undefined behavior, the C standard defines a range of terms for behaviors. Unlike undefined behavior, each of these terms do define a constrained behavior where the implementation has some form of responsibility to uphold, even if it may differ between implementations.

**Implementation-defined behavior:** The behavior is chosen by the implementation. Unlike Undefined behavior, a conforming implementation must choose a dependable behavior and document this behavior, so that it can be relied on by users.
**Unspecified behavior:** Behavior where the standard offers 2 or more options for implementations to choose from.
**Locale-specific behavior:** A behavior that depend on local conventions of nationality, culture, and language that each implementation documents.
**Constraint:** This is a restriction of the language, either syntactic or semantic. It usually results in a compilation error.
**Runtime-constraints:** A constraint that defines limitations of the C standard library. These are limitations that can apply to how the program uses the standard library and are therefore usually encountered at run time.
**Diagnostic message:** Is when the standard mandates that a conforming implementation issue a message to the user. Usually this happens in the form of a warning or error.

All of these are different from Undefined Behavior in that, while they may produce different behaviors on different implementations, they do represent behaviors that a user can depend on, in a ISO compliant C implementation.

UB that is Assigned a Platform-Defined Behavior

The C standard states that any platform is free to detect UB and to provide platform-specific behavior and document this behavior if it wishes. In this sense, what is in strict ISO C terms "UB" may be well-defined behavior on a particular implementation.

This can be very useful, because it enables implementers to extend C's capabilities, and thereby grants users access to platform-specific features. While the C language is designed to enable cross-platform development, developers are free to only support a limited set of platforms. For example, there are implementations of C that do define the behavior of out-of-bounds array writes, signed integer overflow, and dereferencing null pointers.

For brevity, unless otherwise noted, this document will consider UB only in cases where the implementation has not defined a platform-specific behavior or implementation-specific behavior.

Detecting UB

Consider the following code:

int a[5];


a[x] = 0;

What should happen if x is 42? A language design could issue an error, exit the program, or resize the array, among other choices. However, any of these choices would require the implementation of the language to perform a test to see if the value is within the valid range of the array.

int a[5];


if (x < 0 || x >= 5) {


  /* Handle out-of-bounds write */


} else {


  a[x] = 0;


}

This range check would add work for the compiler and execution environment. Adding any requirement to detect if the assignment is out of bounds would come at a cost in run time performance, and complexity. Not only would the implementation have to check each access to the array, but it would also have to keep track of valid array ranges.

C is designed to be fast, simple, and easily implementable; this is why C does not require any detection of out-of-bounds states. Consequentially C cannot define a behavior for a state that isn't detected. The behavior must be undefined.

It is a common misconception that all undefined behavior in the standard stems from oversights, or to the standard body's failure to agree on an appropriate behavior. The above example clearly shows that it is not practical to define any consistent behavior for out-of-bounds array access without imposing considerable burden on the implementation to detect the state. The cost of detecting an erroneous state prevents the language from defining any behavior should it occur.

Furthermore, if the standard were to require a program to exit on an out-of-bounds write, then the following piece of code would become a valid way to exit a program:

int a[5];


a[24] = 0;

This is not a good way to deliberately exit a program. It is preferred that a program exit in a manner that the standard explicitly documents as exiting, such as by calling a function named `exit`.

Reconsider this code:

int a[5];


a[x] = 0;

Another interpretation of the above code is that if there are no requirements for an implementation to handle an out-of-bounds access, then the code contains an implicit contract that `x` can only be between 0 and 4. The implementation can then assume that the user is aware of the contract and consents to it, even if the implementation cannot by itself determine that the contract is valid by analysis of the possible values `x` may hold. The implementation therefore need not check the value of `x`.

If the user cannot guarantee that `x` is within range, they can rewrite the code:

int a[5];


if (x >= 0 && x < 5)


  a[x] = 0;

One big reason that many behaviors are undefined is that detecting these undefined behaviors may be difficult to do at compile time, or it may impose too much of a performance penalty at run time.

The existence of undefined behavior implies conversely that when a program has no undefined behavior, its behavior is well-specified by the ISO C standard and the platform on which it runs. This is a promise or contract between the ISO C standard, the platform, and the developer. If the program violates this promise, the result can be anything, and is likely to violate the user's intentions, and will not be portable. We will call this promise the "Assumed Absence of UB".

A C program that enters a state of UB can be considered to contain an error that the platform is under no obligation to catch or report and the result could be anything.

Assumed Absence of UB: the Contract between the Standard, Developer, and Implementation

Consider this code:

x = (x * 4) / 4;

From a mathematical perspective, this operation should not change the value of x. The multiplication and the division should cancel each other out. However, when calculated in a computer, x * 4 may result in a value that may not be expressed using the type of x. If x is an unsigned 32-bit integer with the value 2,000,000,000 and it is multiplied by 4, the operation could wrap on a 32-bit platform and produce 3,705,032,704. The subsequent division by 4 will then produce 926,258,176. Since the standard declares that operations on unsigned integers have defined wrapping behavior, the two operations do not cancel each other out.

If we instead perform the same operation using signed integer types, things might change because signed integer overflow is UB. By using a signed integer, the programmer has agreed to the contract that no operations using the type will ever produce overflow. Therefore, the optimizer is free to ignore any potential overflow, and can assume that the two operations cancel each other out. This mean that there is a significant optimization advantage in declaring that signed integer overflow is UB.

The assumption that the program contains no UB is a powerful tool that compilers can employ to analyze code to find optimizations. If we assume that a program contains no UB, we can use this information to learn about the expected state of the execution. Consider:

int a[5];


a[x] = 0;

If x is any value below 0 or above 4, the code contains UB. On many platforms, `a[-1]` and `a[5]` would be assigned to addresses outside the bounds of `a`. Without requiring implementations to explicitly add bounds checks, it becomes impossible to predict the side effects of an out-of-bounds write. The implementation is therefore allowed to assume that UB will not happen. This phenomenon is known as "Assumed Absence of UB", and it lets compilers make further deductions. By writing the above code, the programmer respects a contract with the compiler that `x` will never exceed the bounds of the array.

If we consider:

int a[5];


a[x] = 0;


if (x > 5) {


  // ...


}

In this case, since the compiler assumes `x` must be between 0 and 4, the if statement cannot possibly be true. This allows the compiler to optimize away the if statement entirely. This completely conforms to the standard, but it removes some predictability of UB, and can make programs with UB much harder to debug. The out-of-bounds write no longer causes a predictable wild write and it also causes an `if` statement to be removed.

A common bug is to try to detect and avoid signed integer overflow with code like this:

if (x + 1 > x) {


  x++;


}

If we assume that UB cannot happen, then we must assume the `if` condition must always be true. Consequently, many compilers will optimize away the `if` statement entirely.

The confluence of UB and more aggressive but standards-compliant compiler optimizations exposes latent bugs that may otherwise behave according to user intentions. These bugs are characterized as hard to find and diagnose. These bugs often do not appear at lower optimization levels. This means that such bugs do not appear in executables that developers produce during development. Consequently, these bugs can bypass many tests. Debuggers tend to operate on on executables compiled with lower optimization settings, where many of these issues do not show up. This makes it harder to find and fix these bugs.

An early example of a vulnerability arising from such aggressive optimization is [CERT vulnerability 162289](https://www.kb.cert.org/vuls/id/162289).

State of UB

A common consideration when discussing UB is the question of when UB is invoked. While some have argued that programs that are able to procure UB have no requirements whatsoever, it is the position of the WG14 UB Study Group that a program must first reach a state of UB before the requirements of the language standard are suspended. This view is shared by implementers, who have had a history of classifying instances where this isn't true as compiler bugs.

Consider the following:

int a[5], x;


scanf("%i", &x);


a[x] = 0;

In this example, a user-provided index is used to access an array of five elements. While this program may be bad form, it is well-defined until and unless `scanf` sets `x` to outside the range of the array. The developer has (implicitly) guaranteed that the index used to access the array will stay within the bounds of the array, but this guarantee is maintained outside of the program. Many programs depend on input strictly conforming to a set of requirements to operate correctly. While this may present safety and security issues, the developers must weigh those considerations against other factors, such as performance. Even a strictly-conforming program could enter a state of UB under some environmental circumstances. A program is only erroneous when it reaches UB. An implementation is not released from complying with the ISO C standard because UB is possible when executing that program; the implementation is released only once the program has entered a state of UB.

Observability

A core tenet of the C standard is the "as-if" rule. This rule states that an implementation is not required to operate in the way the program is strictly written, so long as the implementation's observable behavior (defined in C23, s5.1.2.3p6) is identical to the program. The program must behave, but not operate, as if the written program was executed.

This means that the actual program behavior can vary radically depending on how an implementation is able to transform the program, as long as its observable behavior remains constant. For example, two non-observable operations can be reordered. Consider:

int a, b;


a = 0;


b = 1;

These are two non-observable assignments (because neither a nor b is `volatile`). As two independent operations they are not required to be executed in any particular order. They may in fact be executed concurrently. If we then consider:

*p = 0;


x = 42 / y;

These two operations are also non-observable operations, however both operations can produce UB (either by `p` pointing to a invalid address, or `y` producing a divide by zero). Because the operations are non-observable, they may be re-ordered. If `y` is zero, there is no guarantee that `*p` is written before the program enters a state of UB.

Time Travel

Because any non-observable operation can be reordered and transformed, a program might reach a state of UB in an ordering not explicitly expressed in the source code. Due to the assumed absence of UB, and the "as-if" rule, a program can show symptoms of UB before any actual UB is encountered during program execution. Consider:

int a[5];


if (x < 0 || x >= 5)


   y = 0;


a[x] = 0;

Using assumed absence of UB, the implementation can determine that `x` must be a value between 0 and 4, and therefore the `if` statement can be removed. This cases an out-of-order behavior known as "time traveling UB", where a program bug causes unintended consequences before the UB is encountered during program execution. It is as if the UB traveled backwards in time from the array access to the if statement.

Time traveling UB is permitted if it does not interfere with observable behavior that occurs before entering a state of UB. Consider:

int a[5];


if (x < 0)


   y = 0;


if (x >= 5)


   printf("Error!\n");


a[x] = 0;

In this case, the call to `printf` is an observable event, and any re-ordering requires it to execute correctly unless it is preceded by a state of UB. The compiler is not permitted to optimize away the second if statement. The first if statement however has no impact on the observable behavior and can therefore be removed.

Note: Historically, there have been cases where time travel has impacted observable state. Implementers have generally considered these to be implementation bugs. To clarify that they indeed are bugs, the document [N3128 Uecker] was proposed and accepted for c23. It adds the non-normative 3rd Note that clarifies the issue in the standard.

Static UB

Consider this code:

int a[5];


a[42] = 0;

Every time this code runs, it will produce UB. The state of UB does not depend on any dynamic or external factors other than the code being executed. We choose to define this type of UB as "static UB", because it only depends on variables that are known at compile time. The term "static UB" is somewhat complicated because different implementations have differing abilities to detect UB at compile time. Consider:

int a[5];


if (x > 0) {


  y = 42;


} else {


  y = MAX_INT;


}


a[y] = 0;

This code also contains static UB but requires a more complex analysis to reach that conclusion. The term "static UB" denotes any UB that is not dependent on runtime state. An implementation is under no obligation to detect static UB, but if an implementation does detect static UB we have recommendations for how to proceed. Static UB denotes expressions that always produce UB even if it's not proven that the expression will ever be evaluated.

UB is Erroneous

Any statement that produces a state of UB (with the exception of the `unreachable()` macro) is erroneous, unless an implementation has defined its own behavior for that statement. An implementation is under no obligation to detect any UB. If, however, the implementation doesn't detect static UB, it is free to assume the statement will not produce UB. Therefore any static UB (again, excepting `unreachable()`) should be considered a developer error and not an intended use of the language. In these cases, an implementation should issue an error with an appropriate diagnostic when it detects UB.

An implementation can assume that a program will not enter a state of UB, but no implementation should assume that a program that reaches a state of UB is intentional.

Consider again:

int a[5];


a[x] = 0;

The assignment may or may not produce UB. In this case if we follow the rule "assumed absence of UB", we can assume that `x` must be between 0 and 4. The assignment is an assignment, but it also provides a hint to the compiler as to what `x` may be. If we then add:

int a[5];


a[x] = 0;


if (x > 4)


  ...

The if statement here can be considered dead code and optimized away. The if statement doesn't produce UB, it just cannot happen without UB. If we instead consider:

int a[5];


if (x > 4) {


  a[x] = 0;


}

Again, this code may or may not trigger UB, but if the assignment is ever executed it is guaranteed to trigger UB. (Note that an implementation is not required to detect the UB). In other words, the UB is static, but only if the assignment is executed.

The correct interpretation of the detected static UB is that the code is erroneous. It is incorrect to interpret the above code as a valid way for the user to express that `x` is 4 or less. The "assumed absence of UB" rule only applies to the way a construct can be assumed to be executed, not that a construct that always produces UB will never be executed. 0ne divided by X, lets the compiler assume X is not zero, and X divided by zero should cause the compiler to assume unintended user error.

The one exception to this is the `unreachable()` macro. The `unreachable()` macro is the only way for a user to express that a statement can be assumed to never be executed. Incidentally, executing `unreachable()` is UB, but it should not be regarded as equivalent to other UB in this regard.

For example:

if (x > 4)


  unreachable();

This is a correct way to express that a compiler can assume that `x` is smaller or equal to 4. Despite `unreachable()` being UB, it is not equivalent to:

if (x > 4)


  x /= 0;

Division by zero is UB, but unlike `unreachable()`, it is assumed to be a user error. The `unreachable()` macro can therefore not be implemented by the user by producing UB in some way other than the `unreachable()` macro. UB is also erroneous even when it can be determined never to be executed. The following can be detected as erroneous:

if (0)


  x /= 0;

C is designed to make naive, as well as highly optimizing implementations possible. The C standard therefore places no requirements or limits on the efforts an implementation takes to analyze the code. Whichever erroneous UB may be detected will therefore vary between implementations.

Apparent predictable behavior of UB

Operating systems and even hardware have been designed to mitigate the side effects of unintentional UB, or deliberate sabotage using UB, with features such as protection of the memory containing the executable or execution stack. Due to some of these protections, some UB is predictably caught at run time. This mitigates the unpredictable nature of UB and improves the stability and security of the system. However, this can also give the false impression that some UB has predictable side effects. While dereferencing null pointers is technically UB, doing so has a very predictable outcome (a trap) on many platforms. Even if the behavior of dereferencing null is reliable on a platform, the compilers' assumption that the code will not dereference null will make it unreliable.

Some UB was initially included in the C standard because the standard wanted to allow for different platform designs. Over the years, some designs have grown so dominant that few developers will ever encounter a platform that does not conform to these dominant designs. One example of this is two's-complement arithmetic, which causes signed integer overflow to wrap.

This means that many UBs have predictable behavior on most platforms:

| UB | Convention |

|----|------------|

| Dereferencing null pointer | Traps |

| Signed integer overflow | Wraps |

| Using the offset between 2 allocations | Treats pointers as integer addresses |

| Comparing the pointer to freed memory with a newly allocated pointer | Treats pointers as integer addresses |

| Reading uninitialized memory | You get whatever is there |

Such behavior is not defined by the C standard but can seem to be predictable. Predictability is of great value to most developers. The knowledge of how the underlying platform operates lets the developer predict and diagnose bugs. A trapped null pointer dereference is easy to find in a debugger. In fact, a programmer may deliberately add a null pointer dereference to a program to invoke a core dump. In MSVC uninitialized memory is initialized to 0xCDCDCDCD, a pattern that is instantly recognizable for any experienced Windows programmer. [https://en.wikipedia.org/wiki/Magic_number_(programming)] If the sum of two large positive signed integers results in a negative value, a wise programmer will suspect signed integer overflow which happened to wrap.

This apparent predictability of many types of UB hides the fact that UB is not predictable. This causes many programmers to either not realize that some of these behaviors are undefined or confuse UB with implementation-defined behavior. They may believe that UB is defined in the C standard and UBs may be non-portable, but they may assume that the behavior of their platform applies to all platforms, or other hosts of their machine's platform. This faulty assumption creates a variety of hard-to-diagnose issues that we will explore further.

An out-of-bounds write may have a wide range of consequences as it can disturb many kinds of state. However, most developers would assume that an out-of-bounds write is executed as a write operation, which is not true in general. If we consider another UB such as signed integer overflow, it is even less predictable that a simple arithmetic operation can have a wide range of unpredictable outcomes.

Compiler restraint

Undefined behavior in C gives an implementation wide latitude to optimize the code. This freedom has enabled implementers to successively generate faster and faster machine code, which enables significant reduction in computing time and energy consumption for a wide range of workloads. C is the de facto benchmark for efficiency that other languages are compared against and strive to match.

Significant portions of UB, such as Aliasing, Provenance and Overflow are specifically designed to enable implementations to make optimizations. Violating these categories of UB is likely to cause unpredictable behavior only when an implementation engages with these opportunities to optimize code.

As many implementations support varying levels of optimizations, a perception has formed in parts of the C community that compilers, at higher levels of optimizations, ignore the C standard and "break" code. This is a misconception. Most C implementations are consistent with the C standard even at the highest levels of optimization settings. Optimizations reveal existing bugs in the source code much more often than they reveal bugs in the compiler. These bugs are usually in violation of the C standard even when the program operates consistently with the developers' expectations.

The higher a level of optimization is employed, the more bugs are exposed, but as the code is further transformed, it also becomes harder to debug. Many tools like debuggers depend on low levels of optimizations to be able to correctly associate the binary's execution to the source code. This compounds the difficulty of diagnosing UB bugs.

Given the misconception that optimizations break code, rather than reveal latent bugs, implementers often unfairly get blamed for issues arising from UB. This has made many compilers avoid making certain optimizations, even when supported by the specification, if they anticipate a user backlash. This creates a gray area, where unsound code that contains UB may have an undocumented reliable or semi-reliable behavior. This gray area comes at the cost of denying performance afforded by the standard to compliant code.

Safety and security

C is regarded as an "unsafe language". This is, in the strictest sense, not true. The C standard does not require an implementation to check for several errors, but it also does not prevent an implementation from doing so. Hence, each implementation may choose the level of safety guaranteed.

In practice, C is an unsafe language because the most popular implementations of C choose not to make many additional guarantees, but instead choose to prioritize performance and power efficiency. As such, C is perceived as a de facto unsafe language because that is how most users have chosen to use it.

There are safer implementations, but these are predominantly used to detect issues during development rather than to add additional protections to deployment. One such implementation is [Valgrind](https://valgrind.org/), whose default tool "memcheck" detects out-of-bounds reads and writes to memory on the heap, as well as uninitialized reads, use-after-free errors, and memory leaks. Valgrind achieves these safety constraints at a significant performance cost. Many different implementations such as GCC, LLVM and MSVS offer various tools for detecting and diagnosing UB. Several static analyzers also exist to alleviate this problem.

Users can also write their own memory tracking shims to detect small out-of-bounds writes, double frees, memory consumption and memory leaks, using macros:

#define malloc(n) my_debug_mem_malloc(n, __FILE__, __LINE__) /* Replaces malloc. */


#define free(n) my_debug_mem_free(n, __FILE__, __LINE__) /* Replaces free. */

While not in any way mandated by the C specification, the prevailing modus operandi of C users consists of using safety-related tools to detect issues during development, rather than as backstops during deployment. A major drawback of this approach is that since UB is a state that often cannot be definitively detected until it occurs at run time, there is no easy way to definitively guarantee that a program will not enter a state of UB.

Despite this, it is worth noting that some of the most trusted software in the world, like the Linux kernel, Apache, MySQL, Curl, OpenSSL and Git are written in C. The simplicity of C makes it significantly easier to read and detect issues.

C does suffer when the standard is unclear, particularly in areas of the memory model and concurrent execution. Rules about aliasing, active type, thread safety, and volatile leaves a lot open to interpretation as to what is UB, and what is not. On many of these issues there is a lack of consensus within WG14. Most implementations do support behaviors that in the strictest reading of the standard would be considered UB simply because of user expectation, and to be able to compile important existing software. In this sense most implementation deviate from the standard, but how and how much they deviate varies. Some projects like the Linux Kernel, has explicitly opted out of these ambiguities and defined their own requirements.

Advice for Developers and Implementers

As this document has hopefully illustrated, Undefined Behavior in the context of C is complex. To simply say that its behavior has been omitted from the standard does not convey this complexity.

C is designed to be a language that trusts the developer. In the case of UB, developers should interpret this to mean "Trust the developer not to initiate UB", rather than "The developer can trust UB if they know the underlying implementation and platform". The Undefined Behavior Study Group therefore strongly advises developers to avoid any UB, unless a platform has explicitly defined that behavior. Testing to determine what observable effect use of a nonportable or erroneous program construct has on your platform is insufficient cause for assuming the UB will consistently have the same behavior on all platforms, including the next one that your code will run on. Only trust an implementation's explicit documentation of a language extension that defines a behavior. We advise that implementations clearly document any language extensions that replace undefined behavior so that users can differentiate between such extensions and seemingly predictable but still unintended behavior.

A computer language is a tool for humans to communicate with computers, but it is also a tool for computers to communicate with humans. Humans spend more time reading the code they write and trying to figure out why its behavior does not match their expectations, than computers do. Traditionally implementations have been black boxes that users must rely on, without understanding how they operate. UB shows that this approach causes issues, because modern compilers do not operate like many users expect them to. We would therefore recommend that implementations try to find ways to be more transparent with their transformations. The ability for users to inspect code that has been transformed could reveal out-of-order issues, code removal, load/store omissions and other non-obvious transformations. We recognize that this involves significant user interface and architectural challenges.

Acknowledgements

This Document was written by Eskil Steenberg Hald. This document is the result of many invaluable discussions in the Undefined Behavior Study Group and ISO WG14, so many of its members deserves credit for its creation. Specifically the author wants to thank David Svoboda, Chris Bazley, and Martin Uecker for providing feedback, editing, and suggesting improvements.

Flow control

if, for, while, do, goto, break, continue and return, are all dependable.

However, there are limits to what you can do inside a if, for or while statement. C99 allows for the declaring of variables in the first statement:

for(int i = 0; i < 10: i++)

This is not legal in C89 and it there for not always dependable. The ability to declare variables in C89 does to extend to other flow control. All of these are illegal:

for(i = 0; int x = i < 10: i++)





if(int x = i < 10)





while(int x = i < 10)





switch(int x = i < 10)

for loops are often explained as equivalent to while loops in like this:

for(<statement0>; <statement1>; <statement2>)


{


    ...


}





Is equivalent to:





<statement0>


while(<statement1>)


{


    ...


    <statement2>


}

This is true in C++ but not C, because statement0 can not define a type inside a for loop in C, but can be defined before a loop.

auto

auto is an obscure keyword that indicates that a variable is of "automatic storage duration". That is the default qualifier for variables inside functions scopes, and auto can not be used on variables outside functions scopes (although some compilers allow for it.)

Unfortunately auto has gone from pointless to dangerous in C23. In C23 auto was given a new meaning as a means to automatically assign the type of a variable using assignment:

auto x = 0.0;

In C23 the above code will make x a variable of type double since 0.0 is a double. This feature is not dependable. In fact if you write the above in older versions of C, you don not need to specify a type at all, and will then be given a variable of type int per default. The above in C89 would make x an int. The default type was deprecated with C99, but almost all compilers support it with a warning. This means that very recent compilers will compile this and an int and even more recent compilers (supporting C23) will no longer warn against this.

Therefore any use of the keyword auto should be considered not dependable.

Floating point types.

Both float and double can be considered dependable and are well supported in most C implementations. There are however a few things to consider.

Some small embedded platforms do not have FPUs and may choose to software emulate or not support floating point operations at all, or may only support 32 floats. So in some cases it can be useful to avoid needlessly using floating point types. If you for instance implement a small library to implement a file format, network protocol or compression, you can avoid floating point instrumentation for benchmarking, if the library is otherwise free from floating point operations, to make the library more portable.

While pretty much all floating point implementations use IEEE 754 standard representation of floating point numbers, its worth pointing out that different hardware implementations (sometimes form the same vendor) will yield defend results due to how the various implementations handle rounding. This means that executing the exact same instructions, with the exact same input data can yield different results on two different machines, running the same executable. This means that floats are not reliable for lockstep synchronizations.

Floating point arithmetic should never be relied upon to get accurate result that can be compared with other values. Here are some examples where x and y may not be equal:

x = (y * 2.0) / 2.0;





x = a / 2.0;


y = a / 2.0;

The compiler may fold some operations at compile time using one implementation, while other operations are not folded and gets computed at execution time using a different implementation.

As a general rule, it is only safe == compare floating points values that are assigned, not values that have been computed, or to compare floating point values to themselves in order to detect a NaN state.

x = 6.0;


...


if(x == 6.0) /* safe to test */











x = 6.0 / 2;


...


if(x == 3.0) /* not safe to test */








x = 1.0 / y;


if(x == x) /* a safe way to test if x is a NaN */

Initialization in C.

Initialization in C, has a number of pitfalls and they usually stem from programmers (or language designers) trying to be clever. The general advice is to use "=" to assign values directly.

Variables.

Lets start with a simple example:

int a = 0;

Is a initialised in this code? No, not necessarily. "a" will be declared in the scope after the appearance of the statement, but for the value to be initialized it actually has to be executed, and there are ways to get around that, using goto or a switch:

goto lable;


{


    int a = 0;


    lable :


    printf("%i", a); /* a is declared but not initialized */


}








switch(1)


{


    case 0 :


     {


        int a = 0;


        case 1 :


        printf("%i", a); /* a is declared but not initialized */


    }


}

Being able to declare anywhere as you can in C99 makes the problem a lot worse. The simple solution to this is to always only declare variables at the beginning of a scope (C89 style), and only in the function scope. By not ever declaring variables in other scopes you also avoid a range of other bugs where variables in different scopes having the same name.

Arrays:

Arrays can be initialized with braces, this is usually good, but there are some traps. One such trap is the definition of the array length, consider:

int a[] = {1, 2, 3};


int b[3] = {1, 2, 3};

The size of 'a' is implicit from the initialization, there is no clear definition of the size of the array that can be referred to. You can use sizeof, to extract the size, but you then need to divide it by the size of int. In general I advice against using sizeof on any array, because they can decay to pointers, they can easily be replaced by pointers during refactoring, and if the type of the array changes, and the divider doesn't change that's another cause for a bug. The size of 'b' is explicit, and much cleared, if it is declared with a define then the define can be reused. It still presents the issue that there may be fewer initializations than there are members of the array. The safest way to initialize an array is therefor without braces if possible.

#define LENGTH_OF_ARRAY 3


int c[LENGTH_OF_ARRAY];


for(i = 0; i < LENGTH_OF_ARRAY; i++)


    c[i] = i + 1;

Yes its more verbose but it is fail safe.

Another thing to be aware of with array initialization is that there is a special feature if you only have one initializer: it initializes all values. Consider:

char string[1024] = {'\0'};

This code does not write a single byte to null terminate a string, it fills the entire string with null termination characters making it considerably slower than:

char string[1024];


string[0] = '\0';

As a general rule, I would try to avoid having arrays with complex operations of the stack. Stack overrun bugs are far more dangerous, than heap overruns, and are harder to debug.

NULL is not Zero

NULL is a reserved address that may or may not reside on address 0x0. It is platform defined where NULL resides. NULL has two definitions in C (three in c23) and they are: (void *)0, and 0. When you type:

int *p = (void *)0;

You are not setting zero to p, you are setting NULL, that means that the compiler has to be able to recognize that you are assigning a pointer, one of the two definitions of NULL, and then set the pointer to what ever the definition of NULL is on the platform. Therefore this is not portable C code:

int *p;


memset(&p, 0, sizeof p);

You should not initialize NULL pointers with memset!

Because the compiler has to be able to differentiate between NULL and other values, using 0 as NULL is generally considered bad. Therefore you should always define NULL as (void *)0. This is why you should also never use NULL as a null-terminator for strings. Null terminators are in fact not NULL, they are a reserved character (that has to have all its bits set to 0). The null terminator you should use is '\0'. Therefore this is not advisable:

char string[] = {'H', 'i', '!', NULL};


printf("%s", string);

If NULL is defined as (void *)0, it may be translated to a reserved address that is not zero, and then translated back to a char integer, that is not the same as '\0'.

C23 adds nullptr, a third way to define NULL, confusing the situation further. Don't use it, only use (void*)0.

If we are going to be super pedantic (lets!), the representations of types in C are mostly platform defined. This means that an implementation can represent numbers any way it wants. A platform may decide that zero represented in an int, is all bits set to zeros except the sign bit set to 1. This is perhaps just trivia, but it shows that any time you assume the bit representation, you are strictly not writing portable code. An example where this does come in to play is for IEEE 754 floating point values where if you set only the sign bit, you get -0. -0 equals 0, but does not have the same bit representation. (Comparing two floats is not the same as comparing the bit representations of two floats, -0 and 0 are equal, but two identical NaNs are unequal)

Initializing struct

Because zero initialization isn't NULL, it causes issues, when you initialize the memory of a struct with zeros. This:

struct my_struct s = {0};

Sets the memory to zero, not to NULL! In C23 there was an attempt to address this by adding the feature of NULL initialization:

struct my_struct s = {};

In C23 this initializes all members to zero, and all pointers to NULL. This is a death-trap. If anyone compiles this with a compiler that does not support this feature you will get entirely uninitialized memory, without any warnings or error! Stay away from this feature, and if possible add tooling to detect accidental use of this.

In general I think that using braces to initialize structs are a bad practice, and should never be done. Consider:

struct my_struct s = {1, 5, NULL};

This is incredibly unclear code. 3 parameters are being initialized but what do they do? What if someone adds a new parameter, or changes their order? This is incredibly fragile code that depends on the programmer always being tight about the content of a struct. C99 ads the ability to designate members:

struct my_struct s = {.member = 1, .other = 5, .pointer = NULL};

This is much better, but raises the language requirement and results in long lines of code. A much simpler way to initialize them is to explicitly set the values:

struct my_struct s;


s.member = 1;


s.other = 5;


s.pointer = NULL;

Yes this is much more verbose, but it has the advantage of being clear. The greatest advantage to this way comes with the process for writing it: Simply take the struct definition, paste it in to the code where you want to initialize the struct and then edit it to be an initialization. That way you can be sure you get all members, that you spell them correctly (this may be more of an advantage for some than others...), and you have the type right there so that you can make sure it gets initialized with the right type.

To reiterate:

struct my_struct s;


memset(&s, 0, sizeof s);

Is not a portable way to set pointer members of the struct to NULL!

(Foot note: I always typedef in the keyword struct, so you wont find it in any variable definitions in my code, I only used it here to clarify that we are initializing structs)

Zero initialization is harmful:

So initializing zeros to a pointer may not on some platforms be to set them to NULL, but these platforms are rare. Many programs would argue "It works on all the machines I care about", and while that in general is a very precarious way to program C that will get you in to all kinds of trouble not covered in this article, it is not the main reason not to initialize memory with zeros.

When you make a mistake you want that mistake to be as obvious as possible, and you want it to stick out like sore thumb. 0x0 is very common value both for pointers and other variables, and therefor they do not clearly stand out as uninitialized values. You want something that is as recognizable as possible. Many compilers (in debug mode) therefor initialize memory with a sentinel magic number. VisualStudio uses 0xCD or 0xCC, and other platforms use hex speed like 0xDEADBEEF. If your platform doesn't have this you can easily implement it yourself:

#ifdef DEBUG_MODE


void *debug_malloc(size_t size)


{


    void *p;


    p = malloc(size);


    memset(p, 0xCD, size);


    return p;


}


#define malloc(x) debug_malloc(x)


#endif

If you read or write to a pointer set to 0xCDCDCDCDCDCDCDCD, you program will crash and it will be very obvious that the problem is an initialized value. If a pointe is initialized to 0x00000000 it is much less likely to crash, because its likely that some form of null check will stop if from crashing. Isnt that a good thing? No! You want your code to work because its correct, not because it accidentally worked! Because the error doesn't fail right away, that doesn't mean that there isn't an issue, it just means the issue is harder to find! Lets imagine we want to write a link list and we want to allocate a pool of links, that we allocate up front and then an API for retrieving links, and returning them. Then, we are going to add a bug to this code, and discuss how zero initialization makes this bug far harder to find.

typedef struct{


    void *data;


    void *next;


}Link;





void free_link(Link *l)


{


    l->data = NULL;


}





Link *alloc_link(Link *link_array, uint link_array_length, void *data)


{


    unsigned int i;


    for(i = 0; i < link_array_length; i++)


    {


        if(link_array[i].data == NULL)


        {


            link_array[i].data = data;


            /*  link_array[i].next = NULL;  OPS  this line was accidentaly lost!*/


            return &link_array[i];


        }


    }


    return NULL;


}





void do_something()


{


    Link *link_array;


    link_array = calloc(1024, sizeof *link_array);


    for(i = 0; i < link_array_length; i++)


        link_array[i].data = NULL;


    .... use alloc_link and free_link to do stuff with data ...





}

If link_array is initialized to garbage, the first use of a next pointers will crash. However if they are initialized to NULL, they will be caught by null checks. Only when alloc_link is called and returns a previously used link will it return an non-null next member. That non-null next member will point to a valid link in the linked list! This makes this bug incredibly hard to find, because the user is going to assume that either it works or that the bug is in the code using the code above not in the code itself. Since the initialization only happens at the first use, only on second use may the bug show up. Its very easy to write test code for something like this where the code passes, because the test never uses the memory enough to reuse links. This is a good example how by mitigating a simple bugs, you make more complex bugs significantly harder to find.

Essentially I strongly discourage the use of calloc or memseting memory to zero at allocation for this reason.

calloc has one advantage over malloc and that is that it is able to detect overflows when you multiply the type with number of elements you want to allocate. But this is a very rare issue in comparasion, if this is an issue for you, write a wrapper around calloc that uses a calloc to allocate, and then uses memset to initialize the memort to something other than zero;

There are in rare cases, performance gains to be had by "pre priming" a struct using memset. Consider:

typedef struct{


    int a;


    short b;


    char c; 


}MyStruct;





void my_function(MyStruct *s)


{


    memset(s, 0xCD, sizeof *s);


    s->a = 1;


    s->b = 2;


    s->c = 3;


}

Here, (on most common platforms) the structs members are 4, 2, and 1 one bytes respectively. The implementation will add a eight byte of padding. without the mem set, many compliers will generate three instructions to initialize the struct, writing each of the members individually. The compiler is careful not to change the content of the padding. With the added memset, the compiler now knows the content of the entire struct even the padding, and can therefore replace the 3 instructions with a single 64 bit write instruction that is much faster.

What if I do read uninitialized values?

Its bad, worse then you think. There are some arguments about what you can do with un initialized values/memory according to the standard. Its not clear that every use of uninitialized values is UB, but I would also argue there are no clear uses for uninitialized values that are clearly not UB.

The standard uses the term "indeterminate state" to describe the value of uninitialized values. "indeterminate state" does not just mean the value can be any combination of zeros and ones, it can also be values that are not expressible using zeros and ones! It can be "trap representations", values that essentially when used causes to execution to trap. They can also be values that do things that normal values cant do, essentially UB.

One such real world instance is "wobbly values". Consider this code:

int *p, a, b;


p = malloc(sizeof *p);


a = *p;


b = *p; 


if(a != b)


    printf("WTF!");

There are platforms (in common use) where "WTF!" could be printed. Lets dig in to why. When you call malloc on a modern computer, the OS has to do two things, it has to allocate an address range for the allocation, and it has to allocate enough memory pages needed to store the memory. Allocating all the address pages may be a slow process and the program may not immediately need all the memory. So modern operating systems may only allocate the address range, and only allocate some or none of the pages needed, and will then allocate more pages as the application starts using the memory. When the program executes the line "a = *p;" the OS needs to provide a memory page to read from. However, it doesn't have to assign this memory page to the program, because the application has not yet stored anything in the page. If we then assume that the OS task switches between the assignment of "a" and "b", another program may need a memory page. Since the program has yet to be assigned the memory page, the OS is free to give the page to the other application. Once the task switches back to the original program and executes the assignment of b, it again has to find an available page to memory read from. The program is then given an other uninitialized memory page and "b" therefore can be assigned a different value than "a".

A more common issue with reading un initialized values is that since they are UB the compiler may assume they will never happen. Consider:

int x;


if(a)


    x = 0;


if(b)


    func(x);

Here a compiler may reason that if "b" is true, "a" must also be true, otherwise "x" would be uninitialized and the use of "x" would be UB. It may therefor transform the code in to:

int x;


if(a)


{


     x = 0;


    if(b)


          func(x);


}

This kind of UB has caused problems when some programmes have tried reading uninitialized memory in order to collect entropy to generate good secure random numbers. Compilers have then optimized away the reading of the un initialized values, with catastrophic consequences.

Setting values doesn't necessarily set values.

Finally, lets remember that just because you say you want to set a value doesn't mean the compiler has to. As we saw in the example where a memset followed by 3 struct member assignments was turned in to a single 64 bit instruction, compilers are able to optimize away things. The main issue with this is if you want to erase a value from memory. Consider this:

{


    char password[64];


     ... /* store a password in the array, and use it for secret stuff */


    memset(password, 0, sizeof pasword); /* erase password form memory */


}

Here the compiler can remove the memset, becaus it can conclude that there is no point in setting a variable that isnt read after it is set. To get arround this issue there are various versions of memset suported by compilers, and C23 adds "memset_explicit". but you can also implement your own secure memset by using volatile.

By Eskil Steenberg Hald

eskil@quelsolaar.com

_Bool / bool

"_Bool" was a type that was introduced in C99, and then later deprecated and replaced in C23 with "bool" defined in stdbool.h. The problem that _Bool tries to solve is that in C, 0 is considered false, while all other numbers are considered true. Theis means that 2 values can be true, while not being equal. Consider:

#define TRUE 1


if(x == TRUE)

x may be true, while not being equal to one. _Bool can only have two values 0 and 1 and therefore solves this problem. However without _Bool there are several other simple solutions to this. The first one is to not compare x to anything:

if(x)

Or to compare it to FALSE.

#define FALSE 0


if(x != FALSE)

Or use the ! operator twice:

if(!!x == TRUE)

The ! operator turns any true value in to a 0, and any 0 in to a 1.

As long as you are aware of the pitfalls of comparing to TRUE, you can live without _Bool

In fact there are good reasons not to use _Bool. First of all it is deprecated and replaced by "bool", and then the keyword "bool" is likly to clash with other definitions of bool. A header file that defines a function that returns bool, requires stdbool.h to be included and that means that any code that defines their own bool, will have trouble accessing the function. We strongly recommend considering the word "bool" to be reserved as well as "true" and "false".

What _Bool does not do is act as a good storage container for Boolean values. There are in my opinion 3 viable ways to store a Boolean:

- As a single bit in larger integer type.

- As a byte

- As an int

This list goes from the most compact to least compact, but also goes form the one most difficult to access to the fastest way to access.

A general guideline would be to use int, for return values, or variables, bytes in structs that needs fast access or where padding will make bit packing irrelevant, and use individual bits in larger types for, disk and network packed data.

Comparison operators

- == Equal to

- != Not equal

- > Greater than

- < Less than

- >= Greater than or equal to

- <= Less than or equal to

All comparison operators are dependable. There are how ever some caveats:

Floating point numbers

All floating point Not-A-Numbers are unequal to themselves. You can therefore check if a floating point value is a NaN by comparing it to itself:

if(f != f)


    printf("f is a NaN!\n");

C11 adds the function "isnan" to math.h. This is not dependable, but can easily be implemented as functions:

int dependablec_isnanf(float x)


{


    return (x != x);


}





int dependablec_isnand(double x)


{


    return (x != x);


}

It can also be implemented as a type agnostic macro:

#define dependablec_isnan(x) ((x) != (x))

This has the disadvantage that anyt macro that uses the same parameter twice runs the risk of unintentionally casuing bugs were a parameter is evaluated or executed twize:

x = dependablec_isnan(my_function());





/* expands to */





x = ((my_function()) != (my_function()));

Pointers

All pointers can be compared equal to each other. However it is undefined to compare less than or greater than comparasions of the two pointers do not have the soma provenance. In other words they point to different objects.

Example:

int a[2] b, *pa0, *pa1, *pb, x;





pa0 = &a[0];


pa1 = &a[1];


pb = &b;





x = pa0 < pa1; /* defined and dependable */





x = pa0 < pb; /* undefined defined and not dependable */

Null

Some platforms have NULL that resides on a different address than 0x0, and some platforms even have multiple NULL. All NuLL values are required to evalueate as equal, but their bit representation may not be equal.

C99

C99 is the first major revising to standard C, and beyond C99 is the most supported. Despite being a quarter of a century, C99 remains not fully supported. Even major implementations have

Inline functions

C99 that adds the word _inline. This keyword has no meaning due to "As-If". If something is inline or not is implementation defined. _inline was also deprecated in c23.

Declare variables anywhere.

This feature does not add any new capabilities to the language, it simply shifts where things are declared. In our opinion it is clearer to declare all variables up front. As a generally rule it is safer to only declare variables upfront and in the function scope to avoid accidental variable overloading, and issues with initialization (see initialization)

Variable length arrays (VLAs)

Variable arrays are very broken and has contributed to the slow adoption of C99. For a longer discussion of VLAs see the separate article under memory model. (VLAs where made optional in C11 and the again made somewhat but not entirely mandatory in C23)

Designated initializers.

Designated initializers are only syntactic sugar that do not enable any new functionality. It lets users do:

struct struct s = {.member = 42};

It is the same as writing:

struct my_struct s;


my_struct.member = 42;

Compound literals.

Compounds literals enable the creation of literals of structs and unions.

functon((struct my_struct)(.member = 42, .other_member = 1138));

It can be done in dependable C by just creating a variable:

struct struct s;


s.member = 42;


s.other_member = 1138;





functon((struct my_struct)(.member = 42, .other_member = 1138));

restrict

Restrict is a very useful concept for conveying aliasing. It lets you declare that a variable of parameter will not alias with other pointer. The precise rules for this is unclear and there have been attempts to clarify the meaning of restrict in the past. In the future we plan to try to understand and document restrict in detail.

If you need to use restrict, you can do so using a macro, since removing the keyword wont alter the meaning of the code. (Restrict is a key word that tells the implementations that you promise NOT to do something, and an implementation can therefore ignore this keyword) we recommend using the following code:

#if defined(__GNUC__) && ((__GNUC__ > 3) || (__GNUC__ == 3 && __GNUC_MINOR__ >= 1))


#define DC_RESTRICT __restrict


#elif defined(_MSC_VER) && _MSC_VER >= 1400


#define DC_RESTRICT __restrict


#elif defined(__STDC_VERSION__) && __STDC_VERSION__ >= 199901L


#define DC_RESTRICT restrict


#else


#define DC_RESTRICT


#endif

Flexible array members

C99 support flexible array members. This lets the last member of a struct be an array that can be extended by simply allocating enough memory to store more members. A flexible array member is designated using a pair of empty brackedts []. Example:

typedef struct{


    size_t length;


    char text_buffer[];


}TextBuffer;





#define TEXT "Hello World"


TextBuffer *t;


t = malloc(sizeof(TextBuffer) + sizeof TEXT);


t.length = sizeof TEXT;


memcpy(t.text_buffer, TEXT, sizeof TEXT);

In C89 it was common to do this, by simply declaring an array with the length of one as the last member the struct. This is technicaly UB. Because it is common in older code bases major compilers still support it. It will however fail on many static analysers. Before C99 was standardized many compiler supported the extension to have zero length arrays as their last members "[0]" for this very use case. Most compilers also support this extension.

If you need flexible array members, the correct way that is legal in all standard versions is to simply place the flexible array after the struct allocation:

typedef struct{


    size_t length;


}TextBuffer;





#define TEXT "Hello World"


TextBuffer *t;


char *buffer;


t = malloc(sizeof(TextBuffer) + sizeof TEXT);


t.length = sizeof TEXT;


buffer = (char *)&t[1]; /* get the address after the TextBuffer struct */


memcpy(buffer, TEXT, sizeof TEXT);

However, this requires you to be careful with padding, to make sure that the tail type has an alignments smaller or equal to that of the header structure. In many cases like the above, that can be guaranteed by the standards type sizing requirements, but sometimes it is implementation defined.

C11

C17

C23

Variable Length Arrays

VLAs where introduced in C99, them made optional in C11 and then partially made mandatory again in C23. They are NOT dependable.

C is a language defined by the fact that memory management is done manually. You are in charge of making sure you have enough of it and that you don't use more than you have. This level of control means that you can control when allocations happen.

VLAs completely breaks this paradigm. It claims you can just randomly ask for memory and that will always be available and always be fast to allocate. Thats a pipe dream and that is fundamentally broken. If you want this kind of memory management(minus the speed), there are plenty of other languages to choose from. it should never have been added to C.

Apart from the impossibility of making VLAs dependably fast, here is the real kicker: How much memory can you access? Undefined. What happens when you run out memory? Undefined. If you cant grantee what int a[n]; does, then it is effectively UB. (VLAs sould beaded to to Annex J)

People complain about missed NULL checks in C, when its entirely possible to check if a pointer is NULL in C, but then defend VLAs, when there is literally nothing a programmer can do to write safe portable VLAs. There are hand wavy answers like "It works on my computer" or "computers have a lot of memory now a days", but No, C runs on many tiny platforms and then again No, Linux as a default stack side of 8Megs, and Windows has 1Meg. That runs out fast. VLAs are not "trust the compiler", or "trust the programmer", its trust no one.

Alternative:

If you need dunamicly allocated memory there anr many ways to do this in C using malloc, calloc and realloc.

If you need scratch memory there are a few different strategies that I would suggest. The first one is to create your own pre-allocated stack:

array = &my_pre_allocated_stack[my_usage_of_said_stack];


my_usage_of_said_stack += n * sizeof(*array);





....





my_usage_of_said_stack -= n * sizeof(*array);

You can pre-allocate as much memory as you need and you can add what ever checks you need to not exceed it. Guaranteed to be fast. Another way to do this is to not incrementally give the buffer back, but to reset the stack usage to zero at a specific points in the code. This has some debugging advantages because you can backtrack a lot of state. There are many variations, but the point is: you are in control. (In the language Jai this a language feature for some reason).

Another approach that I sometimes use is:

int *array, buffer[1024];


if(n > 1024)


{


    array = malloc((sizeof *array) * n);


    if(array == NULL)


        my_abort();


}else


    array = buffer;





....





if(array != buffer)


    free(array):

This needs less infrastructure, but is dependable when you know n is very likely to be small but may in some rare cases be larger. Incidentally this is what some compilers do to implement VLAs, minus the null check.

Effective type aliasing in C.

By Eskil Steenberg Hald (eskil at quelsolaar dot com)

The C programming language has since C99 a concept known as Effective type. The Effective type rules are designed to improve a compilers ability to do aliasing analysis. These rules and their implications are understood by almost no one. By almost no one i mean probably less than ten people in the world. I have programmed C for 25 years, I have never even heard about it until recently. I represent Sweden in the ISO wg14 standard group, I'm in the C memory model study group, and only very recently did i grasp the concept. Several people in the wg14 has asked for a document to try to help them understand the concept of Effective type so i will now do my best to try to explain, how i think it works.

In this document I will try to explain what the standard says, not what implementations actually do. As far as i know, no compilers will break these rules, so if you write within these rules you should be fine (baring compiler bugs and there are a fair bit of them in this area given how few people understand this).

This is my best interpretation of the C standard. I have spent considerable time and effort in collaboration to try to understand these rules and their implications. I owe huge gratitude for the time and efforts of my peers helping me decipher this, especially Jens Gustedt and Martin Uecker. Still, this is still only my interpretation, and it is not a document officially endorsed by the wg14 or the memory model study group.

Just one more note before we get started. This is a document trying to explain how the effective type system in C works, it is not an endorsement of its design.

A Primer on Aliasing.

Before we talk specifically about the Effective type system we need to talk about what problem it is meant to solve.

Consider the following code:

typedef struct{


    unsigned int length;


    float *array;


}MyStruct;





void function(MyStruct *s)


{


    unsigned int i;


    for(i = 0; i < s->length; i++)


    s->array[i] = 0.0f;


}

How may a compiler go about optimizing this code? There are many different opportunities here, it could compute an end pointer and step forward towards it, it may make a call to an intrinsic memset, use vector instruction and all kinds of hardware specific tricks, and simply putting the length value in a register. All these approaches depends one one thing: that writing to array does not overwrite the length member in struct. The array pointer points to the length member, then this code means something very different from if it does not. This is known as aliasing. When something is being accessed by a pointer, and the same time being accessed by directly or by another pointer, then the two alias. Consider this very simple example:

void function(float *a, int *b)


{


    *a = 2.0f;


    *b = 3;


}

If a and b are guaranteed to not alias, then the order of the two operations do not matter and they can even happen in parallel, If they do alias, the order of operation must be guaranteed.

Determining what aliases is therefore of great importance. Around the creation of C99 this had become an apparent problem, and one of the solutions was to add a set of rules, where the compiler was allowed to assume things about how memory was used, and any use of memory that breaks these rules us undefined behaviour. Enter the "effective type" rules.

One way of determining if two values alias is to look at their types. This is know as type based aliasing. C employs type aliasing.

If we again consider the first example, one might think that since the "length" variable is of type unsigned int, and the "array" pointer points to a value of type float, it can be assumed that "array" doesn't point to length. This assumption is wrong. C uses a system that depends on neither on the type of the pointer or the type of the value its points to, its depends on something entirely different: the Effective type.

Most of the times, object are only accessed using one type, and in these cases everything works as expected, but as soon as want to access memory using more than one type, these rules kick in. We may for instance allocate memory that we want to re-use, we may cast pointers, we may want to do type pruning, we may want to run encryption operation on arbitrary memory, or we may want to move memory using larger types, than individual bytes.

The Rules

Lets start by looking at the rules themselves from the C standard:

The effective type of an object for an access to its stored value is the declared type of the object, if any.(Allocated objects have no declared type) If a value is stored into an object having no declared type through an lvalue having a type that is not a character type, then the type of the lvalue becomes the effective type of the object for that access and for subsequent accesses that do not modify the stored value. If a value is copied into an object having no declared type using memcpy or memmove, or is copied as an array of character type, then the effective type of the modified object for that access and for subsequent accesses that do not modify the value is the effective type of the object from which the value is copied, if it has one. For all other accesses to an object having no declared type, the effective type of the object is simply the type of the lvalue used for the access.

7 An object shall have its stored value accessed only by an lvalue expression that has one of the following types:(The intent of this list is to specify those circumstances in which an object may or may not be aliased.)

- a type compatible with the effective type of the object,

- a qualified version of a type compatible with the effective type of the object,

- a type that is the signed or unsigned type corresponding to the effective type of the object,

- a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,

- an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or

- a character type.

(Note: the term "Access" is defined as reading or modifying a value according to The C standard. This is a bit confusing since in one part of this text it says that you can sometimes change the effective type by over writhing a differed type, and later is says you can only "access" (switching to) using a compatible type.)

What this boils down to is that if something has an effective type, you can only access it using a type compatible with that type. In most cases the effective type of an object is the same as the type. However sometimes this is not true, and we may want to access memory using many different types.

An abstract Effective type based architecture.

One way to think of it, is to imagine a hardware architecture, where next to every byte of memory there is a storage facility that stores the data type stored in that byte. This facility, is entirely separate from what type a pointer to the memory may have. Every time you write to a byte, the type storage associated with that byte is updated with the type you wrote with (with some exceptions), Every time you read from memory, you have to read using the same type as the type stored for that memory, if you don't this imaginary architecture would fail. memcpy, memmove and byte copy, are implemented as intrinsic and copy not just the memory, but also the type associated with the memory. The stack automatically initialized types associated with all variables and the type data is write protected.

Examples:

(Note: For brevity this paper assumes that the types float and int have the same size, and that malloc/calloc doesn't return NULL.)

If you declare an int, it has the effective type of int and you have to use it as an int. So the following is undefined behaviour:

int i = 0;


float *fp;


fp = &i;


x = *fp;

i has an effective type of "int" because it was declared as an int and therefor it is UB to access it as a by dereferencing as a float pointer. So far things are pretty straight forward. But what about allocated memory? Allocated memory is not declared, its returned by functions like malloc, calloc and realloc. Consider this:

int *ip;


float *fp;


ip = calloc(sizeof *fp);


fp = ip;


x = *fp;

In this case the memory returned from calloc doesn't have an effective type, even if its stored in to a pointer of type int. When memory doesn't have an effective type, reading using a type, gives the memory that effective type, so its entirely valid to also read it as a float. In fact we can read it as an integer too:

int *ip;


float *fp;


ip = calloc(sizeof *fp);


fp = ip;


y = *ip;

What we cant do is read it as both:

int *ip;


float *fp;


ip = calloc(sizeof *fp);


fp = ip;


y = *ip;


x = *fp;

Here by first dereferencing it, as a int, we assign to the memory the effective type of int, and therefore dereferencing it as an float afterwards become UB.

Every time you write to an object with non declared type, the effective type changes to the type you use to write to it. Consider:

int *ip;


float *fp;


ip = malloc(sizeof *fp);


*ip = 0;


fp = ip;


x = *fp;

This code is UB. When malloc allocates the memmory it has no effective type, but as soon as we write to it using an int pointer, it gets assigned the effective type of int, and then any access to that memory has to be done using a type compatible with int. Given that float is not compatible with int, the dereference of fp, becomes UB. Any write to allocated memory will automatically change the effective type, to the type being written (with an exception we will discuss later). So the following is not UB:

int *ip;


float *fp;


ip = malloc(sizeof *fp);


fp = ip;


*ip = 0;


*fp = 0.0f;


x = *fp;

In this case the memory returned by malloc, has no effective type, but is then given the effective type of int as we write the integer 0 to it, then its assigned the effective type float as we write the number 0.0 to the same memory, and then finally when we dereference a float pointer to the memroy, that is legal since the memory has its effective type to float.

This tells us that allocated memory works fundamentally different from declared memory:

- Declared memory has the same effective type as its declared type and can not change.

- Allocated memory can be given an effective type by writing to it or reading from it before its given an effective type, and the effective type can change if one writes to it again using a different type.

The standard clearly says that the effective type of allocated memory changes when is written to. For declared memory it doesn't, and it has to be a compatible type. Consider:

float f;


int *ip;


ip = &f;


*ip = 42; // UB because we are accessing an incompatible type.

The invisibility of Effective type.

A curious side effect of the effective type rules are that bugs that break the rules are often entirely invisible in the code where the bug happens. Consider the following:

void this_function()


{


    int *ip;


    ip = malloc(sizeof *ip);


    other_function(ip);


    printf("%i", *ip);


    free(ip);


}

This code can violate the effective type rules, and be UB depending on what other_function does with the allocated memory, even if the other_function is entirely free from UB. If the other_function writes to the memory using a type that in incompatible with int (something it is free to do) like this:

void other_function(void *p)


{


    *(float *)p = 3.14f;


}

...then the dereference of ip, is UB. In isolation, neither this_function, or other_function contains enough information to make the code UB. This problem does not only exist for the user, it also exists for the compiler. Consider this:

void function(int *ip, float *fp)


{


    ip[0] = 0;


    fp[0] = 1.0f;


    ip[1] = 2;


    fp[1] = 3.0f;


}

This function has two parameters, one float pointer and one int pointer. The types of the pointer says nothing about the effective type of the memory they point to. The effective type system allows for these to alias. Since the function never de references either of the pointers, the compiler can not use the effective type system to discern if the pointer aliases or not. It is legal to over write the same memory with different types as long as you read using the a type compatible with your last write. This mean that a compiler can not use the Effective type rules to say, optimize these four 32bit writes in to two 64store instructions.

A curous observation is that sometimes a compiler could assume that a particular memory must be allocated just by how it is accessed. Consider:

int function(float *fp)


{


    int *ip;


    float x;


    x = *fp;


    ip = fp;


    *ip = 42;


    return *ip;


}

Here, because the function uses the memory pointed to by fp to store and access two incompatible types, the memory must be allocated. otherwise this code is UB.

This means that technically, a function that takes a pointer to memory as one of its parameters may need to document in its usage requirements that the memory be allocated and not declared.

Lets return to the original example:

typedef struct{


    unsigned int length;


    float *array;


}MyStruct;





void function(MyStruct *s)


{


    unsigned int i;


    for(i = 0; i < s->length; i++)


    s->array[i] = 0.0f;


}

The naïve view is that a compiler can assume that "array" cant point to "length" because they are of different type, but this as we have learned not true. In this case the compiler can assume that array doesn't point to length, because doing so would cause the memory be assigned the effective type of float using the member "array", and then afterwards be read as an unsigned int by the "length" member, and this would be UB. The difference is subtile, but in this case the outcome is the same.

Exceptions.

If we assume that all reads can only be done with the type something was written, then we quickly run in to a number of issues. Often in C you want to access and move memory in the form of arrays of bytes. The basic effective type rules would make this impossible. One such example is that it becomes impossible to implement or use memcpy. Therefor a number of exceptions have been carved out.

sign and qualifiers

The simple ones are integer sign and and qualifiers. sign and qualifiers are simply ignored when deciding if two types are compatible. An Atomic, signed int is compatible with volatile, unsigned int.

memcpy and memmove

memcpy and memmove have special properties, that lets them write to both declared and allocated memory. However when they do write to allocated memory, they also copy the effective type from the source, to the destination.

Consider the following:

int i = 42;


int *ip;


float *fp;


ip = fp = malloc(sizeof *ip);


*fp = 3.14;


memcpy(ip, &i, sizeof(int));


x = *ip;

Here the allocated memory first is given the effective type of float, by assigning a float value to it. Then we over write the memory again using a memcpy. Finally we dereference the memory as an integer. For this to be legal the memory needs to have the effective type of int. Therefore memcpy(and memmove) does not just copy the memory, they also copy the effective type.

Now Consider this:

int i = 42;


float f;


memcpy(&f, &i, sizeof(int));


x = f;

In this example f has a declared type, and therefore its effective type is always float. Over writing it with memcpy, is legal, but does not change its effective type and therefore reading f as a float is legal. Now consider this this:

int i = 42;


float *fp;


fp = mallof(sizeof(float));


memcpy(fp, &i, sizeof(int));


x = *fp;


free(fp);

This superficially looks like the same, as the previous example, only we are now using allocated memory. But this code is UB, because the memory allocated will be assigned the effective type of int by memcopy and de-referencing it as a float is then UB.

character types

Any object disregarding of its effective type can be read and written as a character type. This means that this is legal:

int i = 42;


char *p;


p = (char *)&i;


x = *p;

In this case "i" has effective type int, and accessing it as a character type (char) is legal.

The special case for memcpy and memove also applies to copying characters. consider:

for(i = 0; i < length; i++)


    ((char *)p)[i] = ((char *)x)[i];

If we assume that p is pointing to allocated memory, this operation will assign the values and effective type of the memory pointed to by x.

Now consider:

for(i = 0; i < length; i++)


    ((char *)p)[i] = ((char *)x)[i] + 0;

Now we are no longer just copying the bytes. This means that the special case no longer applies and the memory pointed to by p will have the effective type of char, not the effective type of what ever x points to.

One curiosity about the exception of character types, is that it is for character types, not for bytes. This means that's a uint8_t is not covered, but wchar_t is.

Unions

Any access of memory using a union, where the union includes the effective type of the memory is legal. Consider:

union{


    int i;


    float f; 


}*u;


float f = 3.14;


u = &f;


x = u->i;

In this case the memory pointed to by "u" has the declared effective type of int, and given that "u" is a union that contains int, the access using the "i" member is legal. Its noteworthy in this that the "f" member of the union is never used, but only there to satisfy the requirement of having a member with a type compatible with the effective type.

This means that when accessing a memory with unknown effective type this can always be done with a "union condom" that contains all types:

int function(void *p)


{


    union{ // C98 types only


        char c;


        short s;


        int i;


        long l;


        long long ll;


        float f;


        double d;


  }*uber_union;


  u = p;


  return uber_union->i;


}

Non-exceptions.

While memcpy, memove and character types have special properties to accommodate the effective type system, many other functions do not.

The range of functions that would have issues in the standard library includes functions like, calloc, memset, and fwrite. Consider:

float f = 3.14;


fwrite(&f, sizeof f, 1, stream );

This should be legal as long as fwrite accesses f using a character type. In the same way:

float f;


fread(&f, sizeof f, 1, stream );


x = f;

is legal. writing to a declared type, does not change the effective type, so what type is used by fread to write to f does not matter. On the other hand consider:

float *fp;


fp = malloc(sizeof *fp);


fread(fp, sizeof *fp, 1, stream);


x = *fp;

This is almost certainly UB. It is unknown what effective type fread assigns the memory, but since the standard says it writes a number of characters, one may assume its of type char. fread has no way of knowing the type of the pointer it is writing to, so there is no way for it to know it in this case needs to give the memory the effective type of float.

Given that the effective type of freads output is unknown the only safe way to read it is using a character type or an uber union. The same would be true of memset.

Effective types of Structs and unions.

One question that has remained unanswered is how effective type deals with structs and unions. We have discussed that unions can be used to access memory with other types than its effective types, but when we use a union or struct to write to a non declared memory, what effective type does the memory get? Does the memory get the Effective type of the member or the struct/union?

The standard says nothing about this, but it obviously pretty important to know. There simply isn't an answer to this you can find in the spec. But we can make some educated inferences and come to some logical conclusions. Again, I want to make it clear that this is my reading of the standard, and other people may have different readings.

Lets start with struct members:

A member of a structure, could conceivably, have the effective type of the member, the struct or both at the same time.

consider:

struct{


int i;


}s;








s.i = 5;


ip = &s.i;


x = *ip;

I do not think it is the intention that this should be UB. You can access members of a struct using pointers. If this wasn't possible a vast majority of C code would not function. We can clearly infer that it is the intention that this should be possible in C. That means that it has to either have the effective type of int or, int and the struct. Given that the spec never says that memory can have more then one effective type, makes me think just int is more likely. So the answer should be: Structs have the effective type of its members.

So one would assume unions are the same? Lets have a look. consider:

union{


int i;


float f;


}*u;








u->i = 5;


x = u->f;

This should also be legal. Given that a value can be accessed using a union that "contains" the effective type, this would be legal if the active type was int. It would however also work if the effective type was union, because we are accessing it using the same type. If we instead consider:

union{


int i;


float f;


}*u;








u->i = 5;


fp = &u->f;


x = *fp;

I think there is some consensus that this is UB, and I'm pretty sure I have seen examples of code like this breaking, in real world compilers. *fp has to access a compatible type, and neither int or the union is a compatible type. If we then consider:

union{


int i;


float f;


}*u;








u->i = 5;


ip = &u->i;


x = *ip;

This I think should work (but i would highly discourage anyone depending on). This would indicate that the effective type is in fact, int.

So my conclusion is: struct and union members have the effective type of the member.

Final thoughts, and recommendations.

If you read this document it you may find to your horror an awful lot of C code, probably code you have written, is UB and therefore broken. However just because something is technically UB doesn't mean compilers will take advantage of that and try to to break your code. Most compilers want to compile your code and not try to break it. Given that almost no one understands these rulles, compilers give programmers a lot of leeway. A lot of code that technically breaks these rules will in reality never cause a problem, because any compiler crazy enough to assume all code is always in 100% compliance whit these rules would essentially be deemed broken by its users. If you are using fwrite to fill out a structure, you are just fine. No reasonable compiler would ever break that code.

The issue is not that implementations don't give users leeway, the issue is that its unclear how much leeway is given.

Many compilers (Like visual studion) just flat out ignore these rulles and will let you break them any way you want. Other compilers like gcc and llvm offer options like no-strict-aliasing that turn off any optimizations relying on these rules. Many projects (Most notably the Linux kernel), and security guidelines mandate no-strict-aliasing, in order to get around the issue entirely. If you are unsure, or want to be safe, I recommend using these options, especially if you are working on larger team with diverse levels of experience.

If you are unsure, or want to be safe, I recommend:

- Don't ever use variables as "memory buffers" that can be written to, outside of memcpy and memmove. Don't ever access a declared variable with an other type then its declared type.

- Don't ever access uninitialized memory for any reason.

(reading uninitialized memory is UB, and can have very surprising results. For instance reading the same uninitialized values twice may yield different results, or branches of your code that read un initialized memory may be deleted entirely. Uninitialized memory is not a source of entropy)

- Don't use calloc or memset for struct or pointer initialization.

(This is probably the most controversial advice on this list. It creates hard to find second use bugs and the bit representation is not portable. (memset(&p 0x0, sizeof(void *)) is not guaranteed to be set to NULL for example), It can at times be advantageous to "pre-prime" memory using memset, and then initialize the memory for performance)

- Don't ever take a pointer off a member of a union.

(implementations have been known to break code doing this, either because of compiler bugs or being UB)

- If you take a pointer to a member of a struct, never use that pointer to access anything outside of that member (like other members of the struct)

- Any time you need to type prune, use a union.

(Do not access the same memory using a cast pointer. If you have a float and want to access it as a float, make a union, copy the integer to the union and then read the float from the union. Any reasonable compiler will optimize away the compiler)

- Don't ever convert an integer in to a pointer.

(This is at best non-portable and often run afoul of provenance rules or compiler interpretations of the same)

- Don't ever read memory with a type that is different then it was written.

- If allocated memory is re-used, make sure it is re initialized using the new types it will be used, before reading from it.

- Assume standard lib functions that write memory does so with the type you will use to read that memory.

(May in theory be UB, but guaranteed to work on all platforms)

- When in doubt use a union condom.

- Never create a pointer to an area outside of the size plus one byte of the object. Any pointer outside of this may have wrapped or been clamped. Don't assume you know what your machine does in this case.)

- Never use a pointer to a freed object for ANYTHING, including comparing it to other pointers. (If you want to test for ABA bugs, convert the pointer to an integer before freeing it and then convert the new pointer to an other integer, and then compare the two integers.)

-Never convert a pointer to integer for any other reason than the above, and debugging/visualization.

-Pointers to different objects do not relate to each other, Never test where two pointers to different object are in relation to one an other, never compute the offset between two objects and never try to access one object using a pointer originating form another.

-Never use Variable length arrays. They have no way of reporting out-of-memory, and the stack is small. They are inherently untrustworthy. they are effectively UB. (so is recursion, unless you have set a hard limit on the number of recursions that have been thoroughly tested on the target platform, then they become platform dependent)

-Do not EVER think you know when its ok to break the rules, because you know how your compiler/platform works, if you do, the compiler lie and wait until it finds a way to break your code in the most nefarious way when you least expect it.

If you are unsure, or want to be safe, I recommend:

- Don't ever use variables as "memory buffers" that can be written to, outside of memcpy and memmove. Don't ever access a declared variable with an other type then its declared type.

- Don't ever access uninitialized memory for any reason.

(reading uninitialized memory is UB, and can have very surprising results. For instance reading the same uninitialized values twice may yield different results, or branches of your code that read un initialized memory may be deleted entirely. Uninitialized memory is not a source of entropy)

- Don't use calloc or memset for struct or pointer initialization.

(This is probably the most controversial advice on this list. It creates hard to find second use bugs and the bit representation is not portable. (memset(&p 0x0, sizeof(void *)) is not guaranteed to be set to NULL for example), It can at times be advantageous to "pre-prime" memory using memset, and then initialize the memory for performance)

- Don't ever take a pointer off a member of a union.

(implementations have been known to break code doing this, either because of compiler bugs or being UB)

- If you take a pointer to a member of a struct, never use that pointer to access anything outside of that member (like other members of the struct)

- Any time you need to type prune, use a union.

(Do not access the same memory using a cast pointer. If you have a float and want to access it as a float, make a union, copy the integer to the union and then read the float from the union. Any reasonable compiler will optimize away the compiler)

- Don't ever convert an integer in to a pointer.

(This is at best non-portable and often run afoul of provenance rules or compiler interpretations of the same)

- Don't ever read memory with a type that is different then it was written.

- If allocated memory is re-used, make sure it is re initialized using the new types it will be used, before reading from it.

- Assume standard lib functions that write memory does so with the type you will use to read that memory.

(May in theory be UB, but guaranteed to work on all platforms)

- When in doubt use a union condom.

- Never create a pointer to an area outside of the size plus one byte of the object. Any pointer outside of this may have wrapped or been clamped. Don't assume you know what your machine does in this case.)

- Never use a pointer to a freed object for ANYTHING, including comparing it to other pointers. (If you want to test for ABA bugs, convert the pointer to an integer before freeing it and then convert the new pointer to an other integer, and then compare the two integers.)

-Never convert a pointer to integer for any other reason than the above, and debugging/visualization.

-Pointers to different objects do not relate to each other, Never test where two pointers to different object are in relation to one an other, never compute the offset between two objects and never try to access one object using a pointer originating form another.

-Never use Variable length arrays. They have no way of reporting out-of-memory, and the stack is small. They are inherently untrustworthy. they are effectively UB. (so is recursion, unless you have set a hard limit on the number of recursions that have been thoroughly tested on the target platform, then they become platform dependent)

-Do not EVER think you know when its ok to break the rules, because you know how your compiler/platform works, if you do, the compiler lie and wait until it finds a way to break your code in the most nefarious way when you least expect it.

The Forge memory debugging pattern.

By Eskil Steenberg Hald eskil@quelsolaar.com

Many people struggle with memory corruption in C, I rarely do, and one of the mains reasons is a pattern of memory debuggers I have been employing for many years. This memory debugger comes in the form of an include and a c file that can be added to any C project. It is entirely platform independentcc89, and does not require any modification of the code being debugged outside of being included.

The concept is very simple: use a macros to capture all allocations and deallocations, so that you can track our memory. In this document I will describe my implementation "Forge" of this technique. Its freely available, but you can also easily implement your own flavour of this functionality in a couple of hours.

Macros

The macros we define are:

#define malloc(n) f_debug_mem_malloc(n, __FILE__, __LINE__) /* Replaces malloc. */


#define calloc(n, m) f_debug_mem_calloc(n, m, __FILE__, __LINE__) /* Replaces calloc. */


#define realloc(n, m) f_debug_mem_realloc(n, m, __FILE__, __LINE__) /* Replaces realloc. */


#define free(n) f_debug_mem_free(n, __FILE__, __LINE__) /* Replaces free. */

We add these macros in to a header file that we include in to all c files of our project. We also put ifdefs, around these macros so that we can turn the macros on and off. The system is meant for debugging only and so when you are not debugging, it should be turned off. The advantage of this approach is that you don't need to write your code with the debugging system in mind, or use a special memory allocator, and its therefore easy to apply to existing projects, or remove from projects, once the memory management have been tested.

These macros reroutes all calls to standard allocation functions to our debug code. By also adding on the file and line where the calls happen we can catalogue where all allocations and frees occurs. This enables some pretty handy tools to be built.

Uninitialized memory.

In order to make uninitialized memory easier to detect, the system always initializes all memory allocated with malloc and realloc to 0xCD. Some development environments do this in debug mode too, but this fives you that feature on all platforms, and makes it possible to turn on even in release mode, if you encounter issues that only surface in release mode.

Tracking memory leaks and consumption

To track the applications full memory consumption, you can call:

size_t f_debug_mem_consumption(void);

It will return the total sum of allocated memory (not including memory used internally by forge, the amount of memory consumed by forge internally can be significant).

Because the system tracks all allocations we can print out a summary list of all allocations:

void f_debug_mem_print(unsigned int min_allocs);

Each printout will come with amount of memory consumed, the number of allocations that have bene made and how many of those allocations have subsequently been freed. This lets you very easily identify where memory is allocated, what parts of your code consumes the most memory and memory leaks. My implementation has a minimum allocation parameter, that will make it possible to limit the print outs, to allocations that have been called a minimum number of times. This is useful because its a simple way to ignore allocations that only happen very few times, like during start-up. You can also call:

void f_debug_mem_log(void *file_pointer);

with a valid FILE pointer (or NULL to disable), to print out al log of all allocations and deallocations. Another way to find memory leaks is to call:

void f_debug_mem_check_heap_reference(unsigned int minimum_allocations);

It will look for all allocated pointers, in heap memory, and report any allocated memory it can not find a pointer to. This will yield false positives and false negatives. A memory storing other data can happen to have the same bit combination as a allocated pointer, and allocated pointers not found in heap memory may still be held in stack memory. It is still a useful tool to triage memory leaks. The minimum_allocations parameter, lets you limit the system to only look for leaks in systems that allocate more times then the specified value.

Memory corruption.

Since the system is keeping track of all memory, it is able to detect a number of different memory corruptions. to do this the system needs to traverse the catalogue of memory and check for errors. This is done by calling:

boolean f_debug_mem_check_bounds();

You can place any number of calls to this function to narrow down when a problem occurs. You can also enable the define "FORGE_MEMORY_CHECK_ALWAYS", to run f_debug_mem_check_bounds any time malloc, calloc, free, or realloc is called. If you have a lot of allocations this will decrease performance significantly, but its a very fast way to quickly narrow down an issue.

Buffer over-runs.

In order to catch buffer overruns the system will always over allocate, every allocation and fill the surrounding memory with a magic number (0xCF). Every time f_debug_mem_check_bounds is called the surrounding memory of every allocation will be checked, to make sure none of this memory has been written to. This catches almost all heap buffer overrun bugs. You can set the amount of allocated memory being allocated by changing the define FORGE_MEMORY_OVER_ALLOC. The larger it is to more things it will catch, but the slower and more wasteful it will be. You can set the define FORGE_MEMORY_PRE_PADDIG as to how much of this memory should go before the allocation, and the rest will go after it. (In general is far more likely to have a buffer over-run then a buffer under-run, so having more padding after the allocation is advisable) Because this is such a powerful tool to catch buffer overruns in the heap, I have started considering using heap memory to be less error prone to stack memory (From a security standpoint stack memory is more venerable to attacks then heap memory since the stack memory layout is mostly known).

Double free

If the define "FORGE_DOUBLE_FREE_CHECK" is enabled, the system will keep a list of all freed pointers, and check any pointer being freed against this list. This effectively detects any double free.

Use after free.

Any time memory is being freed, it will always be cleared with a magic number (0XCD), so that if it is accidentally read again, it will not give yield usable data. However, bugs where a pointer points to freed memory, that is then again allocated, can be hard to find, since the pointer is again pointing to valid memory. To find these bugs you can enable the define "FORGE_USE_AFTER_FREE_CHECK" any memory passed to free will, not be freed, but will instead be cleared with a magic number (0XCD). Any call to f_debug_mem_check_bounds will then check all freed memory in case any of it has been over written. Obviously this consumes a lot of memory.

Malloc / Realloc failiure.

Modern computers with virtualized memory very rarely fails at allocating memory. Most of the times when allocation fails its because the user accidentally, tried to allocated an unreasonable amount of memory as a result of an underflow or other mistake. By enabling FORGE_MEMORY_NULL_ALLOCATION_ERROR, the system will flag any time a allocation fails and print an error to standard out.

Using realloced Realloc pointers.

To catch code that erroneously uses pointer that have been reallocated, every realloc will be turned in to a malloc, memcpy, free where the freed memory is given the same, treatment as other deallocations to protect against use-after-free and double frees. The system also checks that realocations are executed on valid allocations, and will isse errors if the pointers are not the base pointer of the allocation.

Stack

The system is able to guestimate if a pointer points to the stack by taking the pointer of a variable on the stack and then measure if a pointer is close to this pointer in address space. You can ask the system to look for suspected pointers to the stack in heap allocated memory, using:

f_debug_mem_check_stack_reference();

This can be very useful, to track down bugs where allocated memory references stack pointers, that may have expired. This test, will yield false positives, either by having heap and stack being close in address space, especially on 32 bit machines, or by data simply having the same bit combinations as pointers near the stack. It is still very useful when tracking down a class of difficult bugs.

If you are on a platform ether the size and location of the stack in known, you can call:

void f_debug_mem_stack_pointer_set(void *lowest_stack_pinter, size_t stack_size_in_bytes);

With the stack defined all functionality that uses the stack locating becomes more accurate, and f_debug_mem_check_heap_reference is able to search the stack for references to allocations.

Querying pointers

The debugger can also be used to query memory. By calling:

void *f_debug_mem_query_origin(void *pointer, unsigned int *line, char **file, size_t *size);

Users can query where a pointer was allocated, and how much was allocated. It also returns the base pointer of the allocation. if the pointer is not found in the forge index, the functions returns NULL. Thiscan be very usefull to track the origin of an allocation. You can also call:

boolean    f_debug_mem_query_is_allocated(void *pointer, size_t size, boolean ignore_not_found);

To query if a memory range is in a valid allocation. The function will return TRUE if it is valid, FALSE if it is out or range on and exisiting allocation, or par of a freed allocation. If the pointer is not found, it will return the parameter ignore_not_found;

interaction with outside memory.

A pointer that have been allocated with the forge debugger macros enabled, can only be freed with a free or realloc, wrapped by forge debugger macros, (un less FORGE_MEMORY_PRE_PADDIG is set to zero), and will most likely lead to crashes on memory safe systems. however pointers that have been allocated, outside of forge debugger macros, can be freed or realoced by forge macros, with a warning. This is to make forge usable in situations where is is dificult, to include the forge headers in all code. Any time a pointer is given to forge that it does not recognize, it will estimate if this could be a stack pointer.

thread safety

This library is written in pure portable "dependable" C89, without any dependencies or platform specific code. As such it is not thread safe out-of-the-box. To make it thread safe, you need to call:

void f_debug_mem_thread_safe_init(int (*lock)(void *mutex), int (*unlock)(void *mutex), void *mutex);

This lets you give the debugger function pointers to a platform specific lock, and unlock function. An example of how this could be done using POSIX threads is:

pthread_mutex_t *mutex;


pthread_mutex_init(mutex, NULL);


f_debug_mem_thread_safe_init(pthread_mutex_lock, pthread_mutex_unlock, mutex);

Note on implementation UB.

There are technically three known instances of undefined behaviour in this implementation. This should in the context of this tool be benign. It is however not recommended using forge in release builds, only for debugging. (If someone can find a platform/implementation where either of these issues do cause a problem please let me know, id be very interested to learn more). Using a macro to replace a built in or standard lib function is technically UB in the C standard. I have never encounter an implementation where this is a problem. All known implementations do what you expect them to do in this instance. The second UB is that the implementation uses < and > when comparing pointers originating from different objects. This could be addressed, by simply comparing against the address of every allocated byte, but this would be incredibly slow. The third is that we do construct pointers outside of their allocations when estimating if something is in the stack. In theory this could result in a wrapped pointer, that could make outputs unreliable. If you are on a platform with 32 or less bit pointer, this may be an issue and we recommend trying to acquire the stack pointer. Technically, the code would still be UB, but would have much lower chance of causing issues.

The instances of UB have been marked in the source code.

Why is C the safest language?

By Eskil Steenberg Hald 2025

Representative of Sweden in iso Wg14

eskil at quelsolaar dot com @eskilsteenberg

The most trusted software in the world like OpenSSL, Apache, SQLite, Curl, CPython, FFMpeg, PHP, the GNU free software collection, most OS kernels, and most filesystems are all written in C. No other language has managed to produce anywhere near the same amount of safety and security critical software deployed the world over as C. In evolutionary terms, it is clear that security critical C projects have a much higher survival rate than security critical projects written in any other language.

Despite the overwhelming success of C as a language for developing security critical software, many security researchers claim that C is an unsafe language that should be avoided for security critical software. They take it as a given that C is unsafe, because of some of C features, most notably the lack of required bounds checking. It is a very unscientific to assume this feature would outweigh possible other benefits of using C, when clearly this feature hasn't stopped C from being the most successful language in security. When assumption says one thing but real world experience says something completely different, its time to re-evaluate ones assumptions.

If security researchers are interested in research that is based on the scientific method, they should clearly want to investigate why C has produced so many successful security critical projects, instead of dismissing C as an insecure language, against clear evidence. Security researchers should investigate various explanations for why C has been so successful, despite the reasons for why someone may think it shouldn't be. To simply dismiss all the people who have chosen C as their language, and then have gone on to produce some of the most trusted software on the planet (and other planets), as simply not knowing what they are doing, is both disrespectful and ignorant, and should not be part of scientific discord with the ultimate goal of finding the truth and making progress.

Actual research in to why C has been so successful for security critical software could have great impact on future security policy, possible language design, how we see Cs future, and generally how we think about software development. Simply ignoring C developers success, or worse trying to gaslight the world into thinking C developer have not had great contribution to the security of software, robs the world of valuable insights in to how successful software is developed.

As an experienced C developer, I have a number of hypotheses as to why C has been so successful, that warrants further research. I want to be clear, I am not a researcher, nor am I claiming to present quantitative evidence for these theories. I do however think they are among theories that should be thoroughly investigated. As part of this list of theories, I also want to point out some possible ramifications for how we should view software development if they are shown to pan out.

Theory 1: Its about readability.

C is language with very few abstractions. This makes it very easy to read and reason about the code. The difficulty in writing code is almost always to close the gap between, what the code does, and what the code is meant to do. Therefor being able to follow every step of execution, and have each step be explicit aids greatly. C code becomes safe, because it is easy to reason about, and easy to audit. Most things are written in place, and therefor require less knowledge of a larger system in order to be understood. This lack of interconnectedness that abstractions create also makes it harder for changes in one part of the code to break other parts of the code. C is a verbose, explicit language and while this creates a less convenient programming experience, it creates trustworthy software. An interesting observations is that other "unsafe" C derived languages like C++ and Objective-C, have produced has less trusted safety critical software. This indicates that the simplicity and lack of features of C are contributing factors. Another observation is that the successful software projects that do use C, tend to use older versions of C and restrict the use of features. The recent success of Python and Lua over more complex languages like Perl also indicates that simple readable languages with fewer features are in fact more reliable.

Potential lesson:

A general focus on clarity over expressiveness and cleverness in language design could lead to safer and simpler code. We should start looking at abstractions with a more critical eye in computer science. For C developers, this means avoiding complex macros, and using long and expressive naming. There are also many opportunities here to improve tooling. Compilers reason about code in order to optimize it, but rarely do they show the programmer its reasoning. It would be very valuable if code paths that are optimized away, assumptions made about values possible ranges, or memory model assumptions where presented to the user.

Theory 2: C is fun.

Many C programmers claim that C is a programming language they enjoy using. The fun factor is an underestimated factor in programming. An enjoyable programming experience leads to more engagement and long term maintenance of software project. A weak software project that is consistently updated and improved over a long period of time will eventually surpass a good software project that no one cares to improve. Something that speaks for this theory is that so many of the mentioned C projects are open-source project mainly maintained by volunteers. For a volunteer project to succeed motivation is key.

Potential lesson:

From a security perspective, we need to find the right balance between safety procedures and the joy and agility of programming. If every change results in onerous re-certification processes, and pointless warnings and procedures that need to be addressed, software will not be maintained properly. We need to find ways to better triage where the risks are and spend our time accordingly and avoid making necessary changes hard to make.

For other languages, we may also consider where the fun lies. C++ is a programming language that many people greatly enjoy architecting code in. It has a rich set of features to choose from, to build intricate structures. These structures, while fun to invent, are less fun to maintain, hence most C++ programmers have never seen any C++ code they did not want to rearchitect. It is an open question how to design languages that remains fun to program in as projects grow, but it should be explored seriously.

Theory 3: Control matters

C gives the programmer a very high degree of control. As I am fond of saying: "In the beginning all you want is results, in the end all you want is control". Because C offers little in terms of pre-made facilities and a minimal standard library, you have to do most things yourself. This means that if you complete a project in C, you have a much more complete understanding of your software, than if it were written in a higher level language. This deep understanding is critical for identifying potential security vulnerabilities. Linus Torvalds once observed that when he writes C, he can in his mind see the assembly instructions being executed.

Theory 4: C is somewhat uniform

Compared to a language such as C++ where many different styles and paradigms exist, C is relatively simple. Because so many other languages like C++, Java and others borrow syntax from C, many non-C programmers are able to read and understand C to some degree. This lingua franca of computing aids in getting many more eyes on critical code and therefore makes C code more secure. Unlike languages like C++ there is less divergence in the options of how the language should be programmed, simply because there are fewer ways to do things. This also makes it easier for open source projects to attract more developer who can contribute.

Potential lesson:

We should put higher weight on writing plain code, and avoid clever tricks or new language features. We can create new guidelines and tools that evaluate code according to these guidelines. MISRA, and CERT are such guidelines focusing on safety and security, but additional guidelines that focus on simplicity, readability, and portability could be created.

Theory 5: Security doesn't matter as much you think

It is possible that C is successful despite its apparent security shortcomings, simply because users have different priorities. The main priority that could override security is Cs performance. Performance isn't just about how long something takes to execute it also translates in to battery life, power consumption, hardware scale, cooling costs, CO2 emissions, and directly to cost of operation. For a large scale cloud operations a few percentage points of performance degradation, translates in to hundreds of millions in added capex for hardware, power and cooling. Data centres alone are estimated to use 3-4% of all generated power by the end of the decade, and added to this is all the power consumed by other devices running C based software. Even a small degradation of efficiency at this scale results in large costs and CO2 emissions.

Switching from trusted legacy C software, to new comparably untested software written in an other language, with substantially higher running cost, with the promise of better future security is a substantial leap of faith.

Further evidence for this theory is that, C does not have to be memory unsafe. The C standard clearly states that any implementation is free to define a safe and documented behaviour for anything left undefined by the standard. Such implementations exist such as UBSan and Valgrind. It is entirely possible for anyone who prioritizes security to run any C software in these implementations. Yet, few people do (outside of debugging), this indicates that the majority of users have other priorities.

If things like performance and memory usage are highly valued properties of C software, perhaps this popularity affords C projects more time to mature and therefor address other issues like security issues.

Potential lesson:

Security professionals need to learn to accept that they live in a world where other considerations often take precedence over security. An example of this is various speculative execution mitigations that have been needlessly forced on most users. These mitigations have probably cost untold billions. These costs are rarely taken in to consideration when security is being discussed, and while many security changes have negligible impact on a specific system, in the aggregate, and when staked on to of layers of security protections, on a global scale have very large impacts.

Theory 6: C Is old

C enjoys some advantages of being old, in that many projects have had time to mature and people have had time build up skills and knowledge about the language. Similarly C has wide range of tools and implementations that are mature and offers a lot of advantages. Most safety critical software have had years to mature and gain trust. While this factor can explain some of Cs success, many languages like C++ or Java are now old enough that if they would have offered significant advantage over C, they should have supplanted C by now, this clearly hasn't happened.

Potential lesson:

Computer science needs to stop equating new with better. You can not prove a negative, and therefor there is no way to prove that software is free from flaws. In fact, the best way we have of evaluating if something is good, is its longevity. This goes for languages, and code bases a like.

Theory 7: C programmers are different.

It is possible that C programmers are so used to dealing with things like memory management that, it becomes a natural part of how they think, and therefore presents far less far fewer issues than an outsider may suspect.

For a pedestrian who is not used to cars, living in a city where 2 tone metal boxes race down the streets at 50kph may seem like an extremely dangerous environment where inhabitants are contently under immense stress to avoid being hit by cars. For anyone who have coexisted with cars in a city for a long time, its a known risk, but one that one that represent an insignificant portion of lives worries. To developers coming from languages where memory is handled for you, the task can feel daunting, but to long time C developers it is a natural part of development that take up very little brainpower to maintain. On the scale of things C experienced programmers worry about memory management ranks low as there are other much more challenging tasks that needs to be completed.

Partly this is a result of experience, but its also that, people who seek out C, and chose to program in C are wired in a way that fits well with Cs design. C programmers do tend to be a different bread of programmers, who want to get close to the metal, and value control over convenience.

Potential lesson:

Perhaps we should consider C programmers separate, and recognize that C is a language that requires a different mindset to master than another languages. While many programmers tend to be able to switch languages often, maybe we should be more careful about who is assigned to program C. Instead of teaching C as yet another language, perhaps it should be treated more as a specialty. Advertising for C/C++ programmers isn't very helpful when the two languages require very different skills and mindsets.

Theory 8: C developers are self selecting good programmers.

Perhaps the language C has the advantage that it simply attracts good programmers. It is even conceivable that the perceived challenge of writing C code contributes to its popularity among seasoned programmers. This would mean that C projects are in part successful, because C programmers tend to be good programmers, and that these projects would see similar success with the same people using a different language. Linus torvalds famously pronounced that he wont let C++ in to the Linux Kernel, for the simple reason that he doesn't want to work with people who like C++.

Potential lesson:

If many of the best programmers choose C, then maybe C isn't that bad.

Theory 9: Its about tooling

Most languages only have one or very few implementations, while C enjoys a wide range of implementations with very different aims. C also has many different debuggers, linters, fuzzes and static analysis tools that aid in debugging. The range, quality and maturity of C tools greatly aids in development or robust software.

Potential lesson:

Many other languages would do very well to focus more on tooling then syntax. IMO having a good debugger is by far the most important tool for a programmer, yet many programmers don't use debuggers, and many languages have very few debugging tools. While C has a lot of tools, there are still great improvements that can be made.

Theory 10: C fails fast and hard.

C is famous for its unforgiving nature. Because so many C bugs cause segfaults or other showstopping issues, much fewer bugs survive debugging. Modern C tools often make C even more unforgiving by detecting things like reads of uninitialized variables. This is a much stricter requirement than a language that may automatically initialize all values, and this forces developers to explicitly state their intention, as compilers do not assume that the user wants a default initialization. Many languages in the name of convenience, lets the programmer get away with things that do create hard to find bugs. For example java script lets you access object members without first declaring them. You can easily write object->member = 42; and then by mistake try to access the value x = object->Member; Java script will then assign x a default value, instead of issuing an error. Yes its convenient to not have to declare members in advance, but saving a few seconds of typing and then loosing hours to debugging or worse shipping broken software is much worse.

Potential lesson:

Policies that force programmers to engage with issues and ambiguities, instead of issuing warnings, or worse make assumptions about the users intentions are important to improve security. Most C compilers can be configured to fail compilation when it encounters selected warnings. This could be made much stricter, where more code is rejected, even though it technically is standard compliant. Implicit type conversions is one such feature that should not have been made part of C, but can easily be detected and remedied with tools.

Theory 11: The issues are known.

As an old and well known language the issues in C are mostly well known. The fact that the language is smaller than many other languages, means that compilers and other infrastructure is less likely to encounter code that trigger an unexplored corner case. Most of the security issues are well known and easy to audit for. Because C is so wide spread there is also a large amount of people who are able to review and read C code.

Theory 12: Survivorship bias

Given that C code has been so successful, any issue in this widely deployed C code gets and outsized impact. A security vulnerability in OpenSSL has far greater ramifications, than another SSL implementation that don't have nearly as many users, no matter what language it was implemented in. Other languages like javascript, and SQL code, that have well known exploits have a much more diffuse attack surface. There are numerous websites that have a wide range of exploits, and may have lots of vulnerabilities, yet the finding of an exploit would not raise the headlines that an exploit in a widely used C system like the Linux kernel. Given the size and scope of projects like the Linux kernel and the number of people who have eyes of the project, its surprising how seldom serious exploits are found.

Potential lesson:

Its worth appreciating how seldom major security issues appear in major security critical software's written in C.

Theory 13: The roads not taken.

Any person who finds bugs in someone else's code, will have a bias against the design decisions that made the bugs possible or likely. This is another form of survivorship bias. However what they do not see are the issues that the design prevented. All engineering is inherently about trade-offs. Any decision will make some issues more likely and some other issues less likely. It is possible that while the design trade-offs in C creates some classes of bugs that reoccur, it on balance prevents much more issues, then it creates.

A good engineer, in any field, knows that she has to weigh the potential risks and benefits of any decision. Trying to cover for all risks, no matter how miniscule is to not properly allocate time, effort and resources on the problems that are likely to cause problems. Security researches have along track record of raising security issues that have a extremely low probability of being exploited.

Introduction

Concepts of C

Keywords

Types

Operators

C Versions

Memory Model

Extra

Why a Dependable C?

universal storage of computing instructions.

About this page

implementations

C standard version

Why not fix C?

requirements

C++ Compatibility.

AS-IF

Educational Undefined Behavior Technical Report

Undefined Behavior

Other types of behavior in the C standard.

UB that is Assigned a Platform-Defined Behavior

Detecting UB

Assumed Absence of UB: the Contract between the Standard, Developer, and Implementation

State of UB

Observability

Time Travel

Static UB

UB is Erroneous

Apparent predictable behavior of UB

Compiler restraint

Safety and security

Advice for Developers and Implementers

Acknowledgements

Flow control

auto

Floating point types.

Initialization in C.

Variables.

Arrays:

NULL is not Zero

Initializing struct

Zero initialization is harmful:

What if I do read uninitialized values?

Setting values doesn't necessarily set values.

_Bool / bool

Comparison operators

Floating point numbers

Pointers

Example:

Null

C99

Inline functions

Declare variables anywhere.

Variable length arrays (VLAs)

Designated initializers.

Compound literals.

restrict

Flexible array members

C11

C17

C23

Variable Length Arrays

Alternative:

Effective type aliasing in C.

A Primer on Aliasing.

The Rules

An abstract Effective type based architecture.

Examples:

The invisibility of Effective type.

Exceptions.

sign and qualifiers

memcpy and memmove

character types

Unions

Non-exceptions.

Effective types of Structs and unions.

Final thoughts, and recommendations.

The Forge memory debugging pattern.

Macros

Uninitialized memory.