Sunday, December 10, 2017

First Code

(Go back to Learn To Code Introduction)

Let's jump into some code. I'll be spewing terms right and left here. I'll deal with some quickly, just enough detail to get by, but defer others for later discussion.

I'm also going to be a bit wordy, saying the same things in different ways so that you pick up the terminology and the different ways people express these things. Forgive me if I beat a concept to death. As you'll see, many terms also get reused for different things in a mix of formal and informal usage. That's why terminology can be so confusing.

And for everything I say, there are exceptions, arguments, counter-arguments, and many more details. I'll address some of those in later posts. For now just bear with me so we don't get too far off into the weeds.

For every characteristic of a given language, there are those who think it's great, and those who think it's terrible. Good or bad, some things are certainly a source of confusion. I take a neutral approach. A language is a tool, it is what it is. Understand the pros and cons and bear them in mind when using it.

An important point to remember is that there are always multiple ways to do something. From the formatting of source code to data structures, design, and organization of a program, you have virtually infinite choices. These rapidly escalate into religious wars. Some choices are worth arguing over. Some aren't.

The C Language

C is a high-level language (as opposed to a low-level language), which means it is a human-readable text language where source code specifies the instructions for how a program runs. However, computers don't execute high-level code, they execute binary machine instructions, also known as machine code. At some point, high-level source code needs to be translated to machine code so that it can be executed.

This brings up the critically important concept of abstraction, which will show up in many ways in this series. It takes multiple machine instructions to carry out each source instruction.

High-level languages elevate your thinking to a higher level of abstraction, allowing you to abstract the low-level details. This is a huge benefit, because instead of having to think about all the tiny details of machine instructions, you can focus on the higher-level concepts of your program logic.

Contrast this to assembly language, which is a human-readable low-level language. Assembly language statements translate one for one to machine code, so you have to deal with all those low-level details.

C is a compiled language (as opposed to an interpreted language), which means a software tool called a compiler translates source code into machine code, also called object code. The compiler compiles source code instructions and generates object code corresponding to their logic.

Each source file produces an object file. These files are often referred to simply as sources and objects (but don't confuse this use of the term object with its use in object-oriented programming (OOP)).

A tool called a linker then links your objects with objects from pre-built runtime libraries to produce the final complete program. The result is an executable, also known simply as a binary.

This set of tools and libraries comprise the toolchain, and are specific to the type of system where you'll be running the binary. That's why there are separate versions of programs for Mac and for Windows.

Once you've built the binary with this build process, you can run it any number of times without having to run it through the toolchain again. You only have to rebuild if you make a change to the source.

You can distribute the binary to other people who have the same type of system without having to give them your source code. They don't need to have the toolchain to be able to use the binary.

C is a statically-typed language (as opposed to a dynamically-typed language, static meaning constant, fixed, unchanging, and dynamic meaning varying, changing), which means you have to declare an item of data in a type declaration to tell the compiler what type it is (integer numeric, floating point numeric, character string, etc.) before you can use it, and you can only store that type of data in it (that's the static part, the fact that the type is fixed ahead of time).

C has rules about what words are reserved as part of the language, known as key words, how you can name your own things as user-defined names, punctuation, and how to form statements. This is the syntax of the language, just as grammar and spelling rules are the syntax of spoken languages.

Those statements have some particular meaning and cause the program to behave in a particular way. This is the semantics of the language, just as the meaning and implications of sentences are the semantics of spoken languages.

And like spoken languages, you can construct statements that are syntactically correct, but semantically incorrect, such as saying, "The sky is fast." That's a perfectly legal sentence, but it doesn't make any sense. Similarly, you can write code that is legal, but doesn't do what you want.

To make a program that does what you want, you have to write code that is both syntactically correct, and semantically correct. The compiler will tell you if you make syntax errors, so you correct the code and try again, but once you have correct syntax, it can't tell you anything about the semantics. You have to run the program and test it to tell if you got the semantics right.

That's the real challenge of software development. The compiler will quickly help you find and correct syntax errors. But a complex piece of software can have many behaviors, and testing and verifying them can be as much work as writing it in the first place.

Further complicating things, while there's only one set of correct behaviors that you want it to to, there's an infinite variety of random incorrect things it can do if the semantics are wrong. When it does strange and unexpected things, you have figure it out so you can correct the semantics. This can be time-consuming and frustrating. "Well, I know what I wanted it do, but what did it actually do? And why?"

The classic book The C Programming Language by Kernighan and Ritchie (known as K&R) established the tradition of the "hello, world" program before getting into slightly more complex examples. I'll buck that tradition by skipping right to the latter. Like their examples, this provides the framework to start presenting lots of details.

K&R described the initial version of the C language, defining the original syntax and semantics. Over the years, the language has been changed to improve and standardize it. The current version is known as C11, for the 2011 standard. The previous version was C99.

You need to know which standard your compiler supports so that you write the code it will understand. Some language version differences are minor, making the syntax a little more convenient, and some are major, changing the way you do things. I'll use C11 here, but I have a tendency to use older style when I don't think about it, and you'll run into code written that way out in the real world.

Source Control

You can find all the source code for this series in a public GitHub repository at https://github.com/sdbranam/learntocode. GitHub is an online version control system (VCS), a place that stores source code, also known as a source control system. It can store multiple versions of the code. There are other VCS's besides GitHub, which is actually an online version of git.

A VCS has two main purposes: protecting code against loss, and sharing code among multiple developers. Source code can be lost two ways: by deleting a file (losing the entire file), or by changing its contents (losing that particular version of the code).

By storing multiple versions, a VCS allows any version to be recovered. It also allows tracing specific changes to specific developers, so you can tell who did what to the code, known as the annotate or blame function.

Sharing allows multiple people to work on code, or distribution of code from the authors to others, as I'm doing here. Settings on the repository, known as a repo, control who is allowed to see its contents (you might want it to be private within your company), and who is allowed to make changes to it (you might only want authorized developers to be able to change it).

This is my repo, that I've made publicly accessible. Anyone can create their own public repo for free, and I'll show you how to do that later, since you'll want to have one for working on this series. A public repo is also a good way to showcase your work to others.

The Program

This is file printargs.c , containing the entire program:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
/* 
 * Printargs prints the list of arguments from the command line.
 * It returns EXIT_SUCCESS if at least one argument was specified,
 * or EXIT_FAILURE if no arguments were specified.
 *
 * 2017 Steve Branam <sdbranam@gmail.com> learntocode
 */

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
    int i;

    if (argc < 2) {
        printf("Usage: %s <arguments>\n"
               "Prints command line arguments.\n",
               argv[0]);
        return EXIT_FAILURE;
    }
    else {
        for (i = 0; i < argc; ++i) {
            printf("%d: %s\n", i, argv[i]);
        }
    }
    return EXIT_SUCCESS;
}

Note the different colors. This is called syntax highlighting, which helps in visually navigating through code. Many source code editors do this, along with other convenience features that help speed up working on code.

I used the online source code formatter http://hilite.me/ to produce this listing, with CSS: "border:solid gray;border-width:1px 1px 1px 1px;padding:.1em .2em;" and Style: "emacs" (although it appears the blog settings override the 1px border). I ran the output of that through a crude Python script to extract specific lines below.

The way to read code is to scan visually looking for the blocks, to get a feel for the overall shape. And when I say shape, I mean that literally. The way it's spaced out and indented is a visual guide to its logic.

Spacing and indentation help you follow that structure. Code without spacing or indentation is very hard to follow. It's like trying to read a book where all the text is jammed together. Paragraphs help break up the page. Spacing and indentation help break up the source code.

The C compiler ignores spacing and indentation. It simply reads through the file, skipping over them as it parses the statements. With a couple of exceptions that I'll cover below, line boundaries are irrelevant, so you can arrange the code any way you want.

Identify the boundaries between blocks, then pick the ones to dig into further. That's more of that abstraction, allowing you to focus on the bigger things before you get into the finer details, like seeing the forest before you see the trees.

The first blocks to look for are the functions. These are the modular building blocks of code. They provide the overall separation of logic into manageable chunks.

Deciding how to divide things up into those chunks is a major part of the art of coding. It's one of those things that can be difficult to describe, but you know it when you see it. When you do see it, look for more by that person. Good code, like good art or good music, is something to be appreciated and emulated.

There are many guidelines for what constitutes good structuring of functions. For now, think in terms of division of labor. Rather than one big chunk that does it all from start to finish, divide up the work, like delegating a big job to a team of workers, each with their own responsibility, with some of them providing helper services that the others can use.

And just as it's a bad idea to overload an individual worker with too much, it's a bad idea to make a function too long. Break up the work into smaller functions that the larger function can call.

Just like a large team of workers that needs to be organized into a hierarchy of different levels that depend on each other to carry out their responsibilities, organize functions into a hierarchy where they depend on each other to get their work done.

A top level function depends on the next level of functions for the overall program, and those functions depend on a third level of functions for their responsibility, and so on, as deeply nested as necessary. That's how you manage complexity in real software.

Now I'll break down the different sections in the file, known as snippets. This is how I'll go through code throughout this series. I'll go into excrutiating gory detail on this one because there are so many concepts to introduce.

 1
 2
 3
 4
 5
 6
 7
/* 
 * Printargs prints the list of arguments from the command line.
 * It returns EXIT_SUCCESS if at least one argument was specified,
 * or EXIT_FAILURE if no arguments were specified.
 *
 * 2017 Steve Branam <sdbranam@gmail.com> learntocode
 */

Lines 1-7: a comment, free-form text that helps the reader understand the code. The comment is delimited by the /* and */ markers. Everything between theses comment delimiters is ignored by the compiler.

Another style of comment delimiter is the double slash //. Everything from the double slash to the end of the line is a comment. This is one place where line boundaries mean something to the compiler.

This particular comment describes the program. It also lists author information. A brief header comment like this at the top of a file is a big help to readers sifting through files.

 9
10
#include <stdio.h>
#include <stdlib.h>

Lines 9-10: preprocessor directives that direct the preprocessor to include two system header files at this point in the file. These are also known as system headers, header files, or simply headers. File inclusion is a way to pull other code into the file, breaking things up into modular parts.

These headers are from the standard library, which contains predefined code required to make your source a complete program. The headers themselves contain various declarations, including forward declarations that tell the compiler about the functions in the library.

File stdio.h contains declarations for the standard input/ouput (I/O) functions. File stdlib.h contains declarations for various constants, fixed data values that don't change as the program runs (i.e. they remain constant).

Preprocessor directives are the other place where line boundaries are significant. The preprocessor is actually an initial stage of the compiler that processes the source code text before compiling, executing directives as it finds them.

Each preprocessor directive takes one line, although that can be extended by putting a backslash \ at the end of the line to form a multi-line directive.

12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
int main(int argc, char *argv[])
{
    int i;

    if (argc < 2) {
        printf("Usage: %s <arguments>\n"
               "Prints command line arguments.\n",
               argv[0]);
        return EXIT_FAILURE;
    }
    else {
        for (i = 0; i < argc; ++i) {
            printf("%d: %s\n", i, argv[i]);
        }
    }
    return EXIT_SUCCESS;
}

Lines 12-28: the main function, the top level function in the function call hierarchy. C requires one function in the program to be named main(), which defines the program entry point. This is where the program starts when you run it. Note that I use the function name with empty parenthesis when I refer to it informally.

You can define other functions with any name you want as long as they conform to the C naming syntax.

C encloses things in braces {}. They delimit the function body itself, and blocks within the function. Any number of lines may appear in a block.

There are places where the braces aren't required, when a block contains only one line, but I put them in anyway because a common source of bugs is expanding a one-line block into a multi-line block and forgetting to add the braces.

Look at the shape of the function. You can see it's outline and the shape of the blocks inside, hinting at its logical flow. After identifying the blocks, look at the details inside them. This on is pretty simple, but others can get more complex, with blocks nested within blocks.

Just as the functions work in layers, the code inside them does. As with the hierarchical layers of functions, these are layers of logic, like peeling back the layers of an onion.

But unlike layers of functions that can nest to any depth, you don't want to have too many layers within a single function, or it can be difficult to follow. Code that's difficult to follow has a higher likelihood of bugs (or you end up adding bugs when you try to change it).

Now lets dig down a level into the function.

12
int main(int argc, char *argv[])

Line 12: the function declaration, telling the compiler what the function's call interface is. This is how other code must call the function to invoke it.

The function main() has an interface that's predefined by C, but other functions allow you to specify the interface yourself.

The items separated by commas in the parenthesis () are the function parameters. These are a type of variable that holds the arguments passed into the function when it is called.

Each parameter is identified by its data type, shown in bold green, and its name, shown in black. The asterisk * and square brackets [] indicate characteristics of the arguments. The asterisk means pointer. The brackets mean array. I'll cover pointers later; they're often a source of confusion, but once you understand how to visualize them, they're easy.

A data type specifies what kind of values are used. The int type means integer numbers, such as 0, 1, and -2. The char type means a character, as in a letter, digit, or punctuation mark.

An array is a contiguous block of elements of the same data type. An array of characters forms a character string, or simply string.

Because these types are predefined by the language, they are known as primitive types. You can use these as building blocks to define your own user-defined types.

In the case of main(), the parameters are the command line arguments that are passed to the program itself when you run it. This is one way to get information from the outside world into the program.

The argc parameter is the argument count, and argv is the argument vector that contains the list of all the arguments, including the program name. Vector is another term for an array. Thus argv is an array of pointers to characters.

Because of the way strings work in C, a pointer to a character is often interpreted as a pointer to a whole string of characters, not just a single character. I'll cover that more when I talk about pointers. But that means argv is an array of pointers to strings.

Once you've built the program, typing "printargs hello world" on the command line results in running the program and calling main() with argc containing the value 3, and argv containing the strings "printargs", "hello", and "world". Notice that the first string is the name of the program as it appeared on the command line.

What about that very first int on line 12? That's the function return type, indicating the type of value the function returns to whatever called it. So just as the parameters had a data type and name, the function has a data type and name.

Since main() is called by the operating system (OS), through some extra layers that we won't worry about now, the int return type means that main() returns an integer value to the OS. Since this is the value the program returns when it exits, this is called the exit code.

The term caller refers to whatever code is calling the function, and the term callee refers to the function being called. The caller calls the function with specific arguments. The function, as callee, receives those specific argument values in its parameters, and returns a value to the caller. The caller may use the return value in some way.

Some functions exist purely to produce a value to return to the caller, such as square(x), which computes the mathematical square of x and returns it. The value produced is the primary purpose of the function, and any work performed producing it is a consequence of achieving that goal.

Other functions are intended to do some sort of work, and then return a result indicating the status of the work. The work performed is the primary purpose of the function, and the value returned from it is merely a report of what happened.

The first type of function is closer to the mathematical concept of functions. In its purest form, such a function has no side effects, meaning it does not affect anything else. The only output of the function  is the return value, which is based solely on its input parameters. There is an entire field of functional programming based on this.

The second type of function is intended to produce side effects. Such a function is intended to affect other things in the program or the outside world. In this case, main() has the side effect of printing something out. The value returned by main() indicates the success or failure of the program.

A return value that is intended as a status indicator is often referred to as a status code. Some status codes simply indicate a binary status, "true" or "false", "yes" or "no", "success" or "failure". Other status codes may convey more detailed information, often used to discriminate various errors, such as "success", "failure, bad filename", "failure, full disk", or "failure, not authorized".

It's also possible to have a function that doesn't return a value. The side effects of running the function are the only thing that happens. In this case, the return type is declared as void, so this is known as a void function.

What line 12 means is "Main is a function that accepts an integer argument count and an array of character pointers to argument strings, and returns an integer." Everything in line 12, except for the parameter names, forms the function signature. That is, the signature consists of the function name, its parameter types, and its return type.

Saying a signature out loud is a mouthful, so in informal usage you just use the name. But when you write code that calls the function, you have to know the precise signature so that you call the function the right way.

14
15
16
17
18
19
20
21
22
23
24
25
26
    int i;

    if (argc < 2) {
        printf("Usage: %s <arguments>\n"
               "Prints command line arguments.\n",
               argv[0]);
        return EXIT_FAILURE;
    }
    else {
        for (i = 0; i < argc; ++i) {
            printf("%d: %s\n", i, argv[i]);
        }
    }

Lines 14-26: the function body, everything enclosed in the braces that follow the signature. This defines the function. It's where the work of the function gets done.

This function uses two control structures, an if-else decision in lines 16-26, and a for-loop in lines 23-25, nested in the else block of the decision. Control structures direct the flow of control of execution.

An if-else decision checks some condition, in this case whether the argument count is less than 2, and does something based on the result. It goes one way if the condition is true, and the other way if the condition is false.

If there's nothing to do when the condition is false, you can omit the else portion. You use a simple if decision, that only doing something if the condition is true.

A for-loop repeats the block it contains for some number of times. Each repetition cycle is known as an iteration. It is therefore often used for iterating through something, cycling through it. Iterating through the elements of an array is a common use of for-loops.

Line 14 is another variable of type int, named simply i. This one is a local variable, a variable that is local to the function; it exists only within the scope of the function. After the function returns, it no longer exists and no longer has a value.

This line is both a variable definition and a variable declaration. It declares the type and name of the variable, and defines the memory for it.

The function parameters are also local variables, the difference being that their values are set by the arguments that are passed in.

What exactly is a variable? It's a small portion of memory that contains a value that can change over time based on what the program does. The fact that it can change is what makes it a variable.

C uses call by value, meaning that it passes the values of things into function parameters. But pointers provide a way to call by reference.

The name "i" is very simple and doesn't convey much meaning. However, it's common to use single-letter names for local variables used as simple for-loop controls. For other variables, used in more complex ways, it's better to use more descriptive names.

Line 16 tests the value of argc to see if it's less than 2. If so, it calls library function printf() to print a formatted message. This function was forward-declared in stdio.h, so the compiler knows its signature and can check that I used it correctly (syntactically, not necessarily semantically).

The arguments to printf() are a string that describes the format of the message, and the data values to be formatted.

In this case, the string is a hard-coded constant, meaning the actual string is coded right there where it's used. It's delimited by double-quotes. The \n at the end of each line is an escape sequence that contains a control character called newline. Newline causes a new line to be started in the program output. The %s is a conversion specification that shows where the value of another string should be substituted into the output; the process of substituting values for markers in a string is called string substitution.

There's another subtle thing going on with this function call. Notice that the arguments in a function call are separated by commas. But the comma is missing after the first string in line 17. This is a syntactic convenience called string concatenation, where the compiler joins together all the strings in the source that aren't separated by commas or semi-colons into a single string. This allows you to break up long strings in the source code for readability. So lines 17 and 18 only contain a single string, the first argument to printf().

The second argument, argv[0], is the first element (i.e. the first entry) of the of the argv array. The square brackets [] contain the index of the element, which is 0. You might think that the first one would be 1, but C uses 0-based indexing. It's like the years in a century; the first year of a century is the 0 year, such as 1900 or 2000.

Recall that argv was declared to be an array of pointers to strings. The first element is a therefore a single pointer to a string, so it matches up with the %s conversion specification.

If you run the program with just the name on the command line, no arguments, argc will be 1. When line 16 checks that argc is less than 2, the condition will be true, and the function will execute the block in lines 17-21. The printf() will print out:
Usage: printargs <arguments>
Prints command line arguments.
A usage message like this is a common way to inform the user that they didn't supply all the command line arguments expected, or that the arguments were in some way unacceptable.

After printing that message, line 20 will return from the function, with the value EXIT_FAILURE. This is a symbolic constant that was defined in stdlib.h. It's symbolic because we don't know its actual value here, all we know is a symbolic name that's been given to it. This indicates the program completed with some kind of error.

If you run the program with additional arguments on the command line, the condition in line 16 will be false, and the function will execute the else block in lines 22-26. This consists of the for-loop in lines 23-25.

The for-loop iterates through the items in array argv, printing each one with printf().

The for-loop uses i as the control variable, which it also uses as the index into the array. The for statement has three control expressions in the parenthesis, separated by semicolons, that control how it runs:
  • The loop initialization, executed once before starting the loop, here initializing i to 0.
  • The loop condition, executed at the beginning of each cycle, here checking that i is less than argc.
  • The loop update, executed at the end of each cycle, here pre-incrementing i by 1.
As long as the condition is true, the loop keeps executing. Here, with i starting at 0 and incrementing on each iteration, it will execute until i reaches whatever count is in argc.

You can have an empty initialization expression, if the condition that is being checked is already initialized before the for-loop. You can have an empty update expression, if the condition that is being checked is updated within the loop.

The format string for the printf() in line 24 has a %d conversion specification, which means to substitute a decimal integer value, and a %s for a string. The remaining printf() arguments are the array index, and the array value at that index. So the printf() prints out a number and a string.

It's important to be aware of how the 0-based indexing relates to the specific check in the for-loop condition. Otherwise the loop may not execute enough times, or may execute one time too many. This is a common source of off-by-one bugs.

Incorrect control expressions can also cause dead loops, that never run through any iterations, or infinite loops, that never end.

The easiest way to figure this out is to step through the iterations yourself, remembering that this update expression increments i after every cycle. If the command line is "printargs hello world", argc will be 3. Therefore:
  • On the first iteration, i will be 0, so the condition is true, and it will print "0: printargs".
  • On the second iteration, i will be 1, so the condition is true, and it will print "1: hello".
  • On the third iteration, i will be 2, so the condition is true, and it will print "2: world".
  • On the fourth iteration, i will be 3, so the condition is false, and the loop terminates.
A simple way to model this on paper is with a table that steps through the index values I, the actual values used in the condition C, the condition result R (t for true, f for false), and the resulting value V represented by that iteration:
I C R V
0 0 < 3 t printargs
1 1 < 3 t hello
2 2 < 3 t world
3 3 < 3 f
Drawing things out like this and stepping through the code yourself is a great way to work out the details, even on a simple example, so that you get the initial conditions and termination conditions right. It's even more helpful when the initialization is something other than 0, or the condition or resulting value is more complex.

Notice also that i isn't used anywhere except in the for-loop, yet I declared it at the top of the function, where any parts of the function could access it (maybe when they shouldn't). A reader might reasonably wonder why I did it that way.

This is one of those cases where my old-version C habits take over when I'm not thinking. That was a requirement of old C. New hotness allows i to be declared where used. So I could have put it right in the for statement:

for (int i = 0; i < argc; ++i) {
That limits the scope of code that can access it, and also makes it clear that this simply-named variable is just the loop control, not used for anything else. That's just one of the subtleties of coding to minimize the potential for errors and maximize understanding, especially as a function gets complex.

The moral here is not only to keep up to date on language versions, but to remember to take advantage of them!

27
    return EXIT_SUCCESS;

Line 27: if the function reaches this point, it returns EXIT_SUCCESS, indicating it successfully printed out the arguments.

It's important to remember that since you declared the function as returning a value, you must make sure that every possible return from the function actually does return a value. It's possible for the function to "run off the end" and return without explicitly returning a value. The result is that the caller will get back some random value.

This can be a nasty type of bug, because sometimes that random value might be acceptable to the caller as a valid return value, even though it has no relation to what the function actually did. This can cause very mystifying behavior.

For a function that returns a value, always put an explicit return statement at the end of the function. For a void function, which doesn't have a return value, you can simply let the function run off the end and return implicitly; you can also use a return statement with no value at the end of the function, but that's considered redundant and unnecessary.

C also allows you to have different return points in a function (for a void function, these would be return statements without values). Some people like to code that way, as I did here, with two return statements. Others prefer to have only one return statement at the end of a function, using a local variable to keep track of the value to return.

That's an awful lot to talk about a mere 28 lines of code. But now you're armed with a lot of terminology that will make getting through subsequent code faster. Some of the concepts may be a little shaky, but they'll firm up as we proceed.

Building And Running The Program

The toolchain I'm using is GCC, the GNU C Compiler. It both compiles and links the program. It comes with Linux and Mac OS systems.

There are also free online tools that allow you to build and run C code (and other languages) on a server. These are sandboxed environments that allow you to play around with code without any risk of affecting anyone else. These are useful if you're using a Chromebook or OS that's not setup for software development.

One example of such an online IDE (Integrated Development Environment) is CodingGround. Some include complete online courses, and some include integration with GitHub. My experience with them shows that they're a great way to practice coding, but some can be a bit buggy, resulting in lost coding sessions or other problems. So don't rely on them to save your code.

That's another good reason to create an account on GitHub, which is free to use for public and open source projects, and create your own public repo like my learntocode repo. You can upload and download your files, or even edit them directly on GitHub. If you use an online tool that doesn't have GitHub integration, you can copy-and-paste from a GitHub window into the tool window.

1
2
3
4
5
6
7
8
$ gcc -v
Configured with: --prefix=/Library/Developer/CommandLineTools/usr 
 --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.3.0
Thread model: posix

$ gcc printargs.c -o printargs 

Lines 1-6: show the version of gcc. I'm building and running on a Mac, so the compiler and standard library themselves are built to run under Mac OS X on an x86 processor, generating code that will run under Mac OS X on an x86 processor.

Line 8: the build command. This is about the simplest possible build command, directing gcc to compile source file printargs.c and output an executable in file printargs. Builds can get quite complex, allowing you to construct software from a number of parts.

If gcc finds a syntax error, it prints it out along with the line number, and won't produce an executable. It may also print warnings, which indicates things that are at risk of being an error. If there are warning but no errors, it will go ahead and produce an executable.

Depending on the severity of an error, the compilation may end prematurely. But the compiler will try to get as far as it can, reporting as many errors as it can.

That's both good and bad. It's nice to get all the errors at once so you can fix them all. But some errors can have a cascading effect that causes the compiler to report many other things as errors because incorrect syntax has thrown it off. With a little experience, you'll learn to pick out the real errors quickly.

Sometimes error messages are obscure. The compiler may be trying to report a very technical issue with your code, but it's not clear what the error means, so you don't know how to fix it. Googling error messages is a useful way to get help interpreting them. You're probably not the first person who had that problem.

Once you have a successful build, indicated by the absence of error messages, you can run it. Here's where it's useful to start thinking about test cases. Because you want to make sure your code works, right? You'll feel stupid if you let someone else run it and it doesn't work properly.

In fact, you should think about test cases before you even write the code, so that you write the code in a way that makes it easy to test. Testability affects how you design the code.

How many possible paths are there through the code? Ideally, you should run the program in a way that exercises each one to test it. That's easy for a simple program like this.

For more complex software, however, that becomes a big job, and it can be difficult to achieve that ideal. Just identifying all the paths can be tricky. Then coming up with inputs that guarantee you cover all those paths is even trickier.

It's further complicated by the fact that some paths are meant to handle error conditions that are hard to produce. And what if the code has poor testability? These issues get off into the whole art of testing.

For this program there are two test cases, going through the two possible paths of the if-else decision:
  1. You run the program with no additional arguments.
  2. You run the program with additional arguments.
There's no need to worry about differentiating between 1 additional argument, 2 additional arguments, 3, etc. They all generalize to the single case "with additional arguments". That simplifies testing so you don't have to keep going with 98 additional arguments, 99 additional arguments, 100...

You can identify test cases by the different control structures in the code, the various decisions and loops that it contains. These create a combinatorial set of possible execution paths. That set gets large quickly, known as combinatorial explosion, because real code has lots of control structures to deal with the various combinations of inputs the many functions may handle.

What about the for-loop in this program? Well, we can see logically that the loop gets executed only if there are additional arguments. Testing loops often includes test cases where the loop has nothing to do. In this program, we can see that such a scenario is impossible. So all the possible loop test cases are already covered by the if-else test cases.

Coming up with a suitable set of test cases is very much an art form. An incomplete set of cases risks missing some bug, that then shows up when someone else uses the code. But excess test cases that are redundant just waste time without telling you any further useful information.

I'll cover more about testing later, because it's an important part of being an RPSD. If you do a poor job of testing, it can have consequences for your career. It can also have consequences for the people who depend on your code. It becomes a matter of being ethical and responsible.

If you think that's overblown, think about the problems caused by software failures you've experienced or heard about in the news. The security breaches, the software crashes, the system failures, and the inconvenience, aggravation, frustration, and misery heaped on people's lives as a result.

I'll be covering more about testing in later posts, but for a separate presentation on it, see Testing Is How You Avoid Looking Stupid.

I can exercise the two test cases for this program by running it with and without additional arguments:

1
2
3
4
5
6
7
8
9
$ printargs
Usage: printargs <arguments>
Prints command line arguments.

$ printargs hello world everywhere
0: printargs
1: hello
2: world
3: everywhere

Lines 1-3: test case 1. As expected, the program prints the usage message, referring to the program name correctly.

Lines 5-9: test case 2. As expected, the program prints each of the three additional arguments that were on the command line, with the correct 0-based indices. It would have been sufficient to use just one extra argument.

The process of manually running each test case like this and examining their results is called manual testing. Manual testing a simple program is pretty easy, but even that can get awfully tedious if something goes wrong and you keep having to repeat the tests as you chase down and correct the bugs. That's especially true if you have to carefully scrutinize every line being printed out to check for an unexpected result.

An RPSD quickly moves from manual testing to automated testing, using some form of test automation. That creates an efficient workflow and gives you a way to repeat the testing later without having to remember all the cases. But that's a topic for another post.

For now, I've proven to myself that the code works as expected.

No comments:

Post a Comment