By Eskil Steenberg Hald eskil@quelsolaar.com
Initialization in C, has a number of pitfalls and they usually stem from programmers (or language designers) trying to be clever. The general advice is to use "=" to assign values directly.
Lets start with a simple example:
int a = 0;
Is a initialised in this code? No, not necessarily. "a" will be declared in the scope after the appearance of the statement, but for the value to be initialized it actually has to be executed, and there are ways to get around that, using goto or a switch:
goto lable;
{
int a = 0;
lable :
printf("%i", a); /* a is declared but not initialized */
}
switch(1)
{
case 0 :
{
int a = 0;
case 1 :
printf("%i", a); /* a is declared but not initialized */
}
}
Being able to declare anywhere as you can in C99 makes the problem a lot worse. The simple solution to this is to always only declare variables at the beginning of a scope (C89 style), and only in the function scope. By not ever declaring variables in other scopes you also avoid a range of other bugs where variables in different scopes having the same name.
Arrays can be initialized with braces, this is usually good, but there are some traps. One such trap is the definition of the array length, consider:
int a[] = {1, 2, 3};
int b[3] = {1, 2, 3};
The size of 'a' is implicit from the initialization, there is no clear definition of the size of the array that can be referred to. You can use sizeof, to extract the size, but you then need to divide it by the size of int. In general I advice against using sizeof on any array, because they can decay to pointers, they can easily be replaced by pointers during refactoring, and if the type of the array changes, and the divider doesn't change that's another cause for a bug. The size of 'b' is explicit, and much cleared, if it is declared with a define then the define can be reused. It still presents the issue that there may be fewer initializations than there are members of the array. The safest way to initialize an array is therefor without braces if possible.
#define LENGTH_OF_ARRAY 3
int c[LENGTH_OF_ARRAY];
for(i = 0; i < LENGTH_OF_ARRAY; i++)
c[i] = i + 1;
Yes its more verbose but it is fail safe.
Another thing to be aware of with array initialization is that there is a special feature if you only have one initializer: it initializes all values. Consider:
char string[1024] = {'\0'};
This code does not write a single byte to null terminate a string, it fills the entire string with null termination characters making it considerably slower than:
char string[1024];
string[0] = '\0';
As a general rule, I would try to avoid having arrays with complex operations of the stack. Stack overrun bugs are far more dangerous, than heap overruns, and are harder to debug.
NULL is a reserved address that may or may not reside on address 0x0. It is platform defined where NULL resides. NULL has two definitions in C (three in c23) and they are: (void *)0, and 0. When you type:
int *p = (void *)0;
You are not setting zero to p, you are setting NULL, that means that the compiler has to be able to recognize that you are assigning a pointer, one of the two definitions of NULL, and then set the pointer to what ever the definition of NULL is on the platform. Therefore this is not portable C code:
int *p;
memset(&p, 0, sizeof p);
You can not initialize NULL pointers with memset!
Because the compiler has to be able to differentiate between NULL and other values, using 0 as NULL is generally considered bad. Therefore you should always define NULL as (void *)0. This is why you should also never use NULL as a null-terminator for strings. Null terminators are in fact not NULL, they are a reserved character (that has to have all its bits set to 0). The null terminator you should use is '\0'. Therefore this is not advisable:
char string[] = {'H', 'i', '!', NULL};
printf("%s", string);
If NULL is defined as (void *)0, it may be translated to a reserved address that is not zero, and then translated back to a char integer, that is not the same as '\0'.
C23 adds nullptr, a third way to define NULL, confusing the situation further. Don't use it, only use (void*)0.
If we are going to be super pedantic (lets!), the representations of types in C are mostly platform defined. This means that an implementation can represent numbers any way it wants. A platform may decide that zero represented in an int, is all bits set to zeros except the sign bit set to 1. This is perhaps just trivia, but it shows that any time you assume the bit representation, you are strictly not writing portable code. An example where this does come in to play is for IEEE 754 floating point values where if you set only the sign bit, you get -0. -0 equals 0, but does not have the same bit representation. (Comparing two floats is not the same as comparing the bit representations of two floats, -0 and 0 are equale, but two identical NaNs are unequal)
Because zero initialization isn't NULL, it causes issues, when you initialize the memory of a struct with zeros. This:
struct my_struct s = {0};
Sets the memory to zero, not to NULL! In C23 there was an attempt to address this by adding the feature of NULL initialization:
struct my_struct s = {};
In C23 this initializes all members to zero, and all pointers to NULL. This is a death-trap. If anyone compiles this with a compiler that does not support this feature you will get entirely uninitialized memory, without any warnings or error! Stay away from this feature, and if possible add tooling to detect accidental use of this.
In general I thing that using braces to initialize structs are a bad practice, and should never be done. Consider:
struct my_struct s = {1, 5, NULL};
This is incredibly unclear code. 3 parameters are being initialized but what do they do? What if someone adds a new parameter, or changes their order? This is incredibly fragile code that depends on the programmer always being tight about the content of a struct. C99 ads the ability to designate members:
struct my_struct s = {.member = 1, .other = 5, .pointer = NULL};
This is much better, but raises the language requirement and results in long lines of code. A much simpler way to initialize them is to explicitly set the values:
struct my_struct s;
s.member = 1;
s.other = 5;
s.pointer = NULL;
Yes this is much more verbose, but it has the advantage of being clear. The greatest advantage to this way comes with the process for writing it: Simply take the struct definition, paste it in to the code where you want to initialize the struct and then edit it to be an initialization. That way you can be sure you get all members, that you spell them correctly (this may be more of an advantage for some than others...), and you have the type right there so that you can make sure it gets initialized with the right type.
To reiterate:
struct my_struct s;
memset(&s, 0, sizeof s);
Is not a portable way to set pointer members of the struct to NULL!
(Foot note: I always typedef in the keyword struct, so you wont find it in any variable definitions in my code, I only used it here to clarify that we are initializing structs)
So initializing zeros to a pointer may not on some platforms be to set them to NULL, but these platforms are rare. Many programs would argue "It works on all the machines I care about", and while that in general is a very precarious way to program C that will get you in to all kinds of trouble not covered in this article, it is not the main reason not to initialize memory with zeros.
When you make a mistake you want that mistake to be as obvious as possible, and you want it to stick out like sore thumb. 0x0 is very common value both for pointers and other variables, and therefor they do not clearly stand out as uninitialized values. You want something that is as recognizable as possible. Many compilers (in debug mode) therefor initialize memory with a sentinel magic number. VisualStudio uses 0xCD or 0xCC, and other platforms use hex speed like 0xDEADBEEF. If your platform doesn't have this you can easily implement it yourself:
#ifdef DEBUG_MODE
void *debug_malloc(size_t size)
{
void *p;
p = malloc(size);
memset(p, 0xCD, size);
return p;
}
#define malloc(x) debug_malloc(x)
#endif
If you read or write to a pointer set to 0xCDCDCDCDCDCDCDCD, you program will crash and it will be very obvious that the problem is an initialized value. Because you can wrap malloc and give it a recognizable value, it is much safer to use then using memory on the stack, where you cant automatically initialize it. This is another reason why stack memory is more error prone than allocated memory.f a pointe is initialized to 0x00000000 it is much less likely to crash, because its likely that some form of null check will stop if from crashing. Isnt that a good thing? No! You want your code to work because its correct, not because it accidentally worked! Because the error doesn't fail right away, that doesn't mean that there isn't an issue, it just means the issue is harder to find! Lets imagine we want to write a link list and we want to allocate a pool of links, that we allocate up front and then an API for retrieving links, and returning them. Then, we are going to add a bug to this code, and discuss how zero initialization makes this bug far harder to find.
typedef struct{
void *data;
void *next;
}Link;
void free_link(Link *l)
{
l->data = NULL;
}
Link *alloc_link(Link *link_array, uint link_array_length, void *data)
{
unsigned int i;
for(i = 0; i < link_array_length; i++)
{
if(link_array[i].data == NULL)
{
link_array[i].data = data;
/* link_array[i].next = NULL; OPS this line was accidentaly lost!*/
return &link_array[i];
}
}
return NULL;
}
void do_something()
{
Link *link_array;
link_array = calloc(1024, sizeof *link_array);
for(i = 0; i < link_array_length; i++)
link_array[i].data = NULL;
.... use alloc_link and free_link to do stuff with data ...
}
If link_array is initialized to garbage, the first use of a next pointers will crash. However if they are initialized to NULL, they will be caught by null checks. Only when alloc_link is called and returns a previously used link will it return an non-null next member. That non-null next member will point to a valid link in the linked list! This makes this bug incredibly hard to find, because the user is going to assume that either it works or that the bug is in the code using the code above not in the code itself. Since the initialization only happens at the first use, only on second use may the bug show up. Its very easy to write test code for something like this where the code passes, because the test never uses the memory enough to reuse links. This is a good example how by mitigating a simple bugs, you make more complex bugs significantly harder to find.
Essentially I strongly discourage the use of calloc or memseting memory to zero at allocation for this reason.
calloc has one advantage over malloc and that is that it is able to detect overflows when you multiply the type with number of elements you want to allocate. But this is a very rare issue in comparasion, if this is an issue for you, write a wrapper around calloc that uses a calloc to allocate, and then uses memset to initialize the memort to something other than zero;
There are in rare cases, performance gains to be had by "pre priming" a struct using memset. Consider:
typedef struct{
int a;
short b;
char c;
}MyStruct;
void my_function(MyStruct *s)
{
memset(s, 0xCD, sizeof *s);
s->a = 1;
s->b = 2;
s->c = 3;
}
Here, (on most common platforms) the structs members are 4, 2, and 1 one bytes respectively. The implementation will add a eight byte of padding. without the mem set, many compliers will generate three instructions to initialize the struct, writing each of the members individually. The compiler is careful not to change the content of the padding. With the added memset, the compiler now knows the content of the entire struct even the padding, and can therefore replace the 3 instructions with a single 64 bit write instruction that is much faster.
Its bad, worse then you think. There are some arguments about what you can do with un initialized values/memory according to the standard. Its not clear that every use of uninitialized values is UB, but I would also argue there are no clear uses for uninitialized values that are clearly not UB.
The standard uses the term "indeterminate state" to describe the value of uninitialized values. "indeterminate state" does not just mean the value can be any combination of zeros and ones, it can also be values that are not expressible using zeros and ones! It can be "trap representations", values that essentially when used causes to execution to trap. They can also be values that do things that normal values cant do, essentially UB.
One such real world instance is "wobbly values". Consider this code:
int *p, a, b;
p = malloc(sizeof *p);
a = *p;
b = *p;
if(a != b)
printf("WTF!");
There are platforms (in common use) where "WTF!" could be printed. Lets dig in to why. When you call malloc on a modern computer, the OS has to do two things, it has to allocate an address range for the allocation, and it has to allocate enough memory pages needed to store the memory. Allocating all the address pages may be a slow process and the program may not immediately need all the memory. So modern operating systems may only allocate the address range, and only allocate some or none of the pages needed, and will then allocate more pages as the application starts using the memory. When the program executes the line "a = *p;" the OS needs to provide a memory page to read from. However, it doesn't have to assign this memory page to the program, because the application has not yet stored anything in the page. If we then assume that the OS task switches between the assignment of "a" and "b", another program may need a memory page. Since the program has yet to be assigned the memory page, the OS is free to give the page to the other application. Once the task switches back to the original program and executes the assignment of b, it again has to find an available page to memory read from. The program is then given an other uninitialized memory page and "b" therefore can be assigned a different value than "a".
A more common issue with reading un initialized values is that since they are UB the compiler may assume they will never happen. Consider:
int x;
if(a)
x = 0;
if(b)
func(x);
Here a compiler may reason that if "b" is true, "a" must also be true, otherwise "x" would be uninitialized and the use of "x" would be UB. It may therefor transform the code in to:
int x;
if(a)
{
x = 0;
if(b)
func(x);
}
This kind of UB has caused problems when some programmes have tried reading uninitialized memory in order to collect entropy to generate good secure random numbers. Compilers have then optimized away the reading of the un initialized values, with catastrophic consequences.
Finally, lets remember that just because you say you want to set a value doesn't mean the compiler has to. As we saw in the example where a memset followed by 3 struct member assignments was turned in to a single 64 bit instruction, compilers are able to optimize away things. The main issue with this is if you want to erase a value from memory. Consider this:
{
char password[64];
... /* store a password in the array, and use it for secret stuff */
memset(password, 0, sizeof pasword); /* erase password form memory */
}
Here the compiler can remove the memset, becaus it can conclude that there is no point in setting a variable that isnt read after it is set. To get arround this issue there are various versions of memset suported by compilers, and C23 adds "memset_explicit". but you can also implement your own secure memset by using volatile.
By Eskil Steenberg Hald eskil@quelsolaar.com