The C way to handle text

C/C++ text handling

The C and C++ languages use two different approaches to handle text. We will start by learning the C way to do it.

Text and numbers

As you know, computers can only store and manipulate numbers. However, this is up to us to decide what these numbers represent. In order to represent text, we assign letters (and characters in general) with a number. If we decided that 1 meant a, 2 meant b and so on, then the set of numbers '1 2 3 4 5 6' could be translated as "abcdef". The C and C++ standard libraries use a table called ASCII to assign characters with numerical values. According to this table, the number 65 represents the character 'A' and 97 means 'a'. The ASCII table only define the values for 127 characters. In order to handle non-english characters, we have to use other approaches (We will see that later).

The variable type used to represent characters is char. Why? Well, it can represent the numbers in the range [-128, 127] and the number of characters defined in the ASCII table is 127. By the way, char stands for character.

Single character

To retrieve the ASCII code (value) of a character, we must put it between apostrophes. That means that 'B' is equivalent to the number 66 (Which is the ASCII code for 'B'). Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#include <iostream> int main() { char letter = 'h'; // The variable letter is set to 104. std::cout << letter << std::endl; // The letter 'h' is printed in the console. int asciiCode = letter; std::cout << asciiCode << std::endl; // The number 104 is printed in the console. return 0; }

The object std::cout interprets the values inside variables of type char as ASCII codes representing characters instead of displaying their numerical values.

We now know how to handle single characters. We are almost there.

String of character

In C, texts are stored in arrays of type char that ends with the ASCII character number 0 (The character that has the numerical value 0). The character number 0 means null and it marks the end of a string of characters (It is often called the null character). An array of characters is called a string and an array of characters that ends with the null character (The character number 0) is called a null-terminated string. The C standard library use null-terminated strings to represent text, which is why they are sometime called C style strings. That means that the following text: "Good morning", which is made of 12 characters (the space is considered a character), would require an array (of char) of size 13 (12 characters + the null character) to be stored.

Defining a null-terminated string

To define a string, we must put the text, to create the string from, inside quotes. Doing so creates a constant array of char that is filled with the values of the null-terminated string representing the text. Example:

1
2
3
4
5
6
7
8
9
10
11
12
#include <iostream> int main() { char str[] = "Hello"; // The array str is filled with the ASCII values of the characters "Hello" // plus the null character. Its size is 6. std::cout << str << std::endl; // "Hello" is written in the console. return 0; }

It is possible to do the same with a pointer. In this case, the null-terminated string will be created somewhere in the memory and the pointer will be set to point to it. Example:

1
2
3
4
5
6
7
8
9
10
11
#include <iostream> int main() { char * str = "Hello"; // str points to the null-terminated string "Hello" stored somewhere in the memory. std::cout << str << std::endl; // "Hello" is written in the console. return 0; }

In C, the example above is fine, but in C++, it gives a warning. The reason is that the quotes create a constant null-terminated string. Therefore, the pointer should be constant to be able to point to it. However, in this precise case, the compiler let it go and usually only warns you. We could solve that by making the pointer constant:

const char * str = "Hello";

Comparing two strings

We can not, using char arrays, directly compare two strings using the == operator.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#include <iostream> int main() { char word[] = "sky"; char word2[] = "sky"; if(word == word2) { std::cout << "The two strings are equal." << std::endl; // This line is not printed. } return 0; }

In the example above, the expression word == word2 returns 0 (false). The reason is that static arrays are locations. So the expression word == word2 could be translated as (Address in memory of word) == (Address in memory of word2). Since they are two different arrays, their addresses are different so the result is false.

The reason giving a char array directly to std::cout works is that when the object std::cout receives a pointer (an address) of type char, it iterates through the pointed characters and prints them until it finds the character number 0 (the null character).

The way to test if two strings are equals, in C, is two compare each character of each array one by one or by using a function that do it.

Here is an example of a function that compares two null-terminated strings:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#include <iostream> bool areEqual(const char *str, const char *str2) { unsigned c =0; while(str[c] != 0 && str2[c] != 0) // As long as the end of none of the strings is reached. { if(str[c] != str2[c]) // If a character is different, both strings are different. return false; c++; } return str[c] == str2[c]; // If both characters at index c equals 0, both strings are equal. } int main() { char word[] = "sky"; char word2[] = "sky"; if(areEqual(word, word2)) { std::cout << "The two strings are equal." << std::endl; // This line is printed. } return 0; }