Tokenize A String In C With ‘strtok’

tokenize string

C is not known for its ability to manipulate strings in any easy manner. Other languages such as Perl or PHP are better designed to handle string functions because of a lack of types. However, there are a few functions that do make string manipulation in C a lot easier; one of those being the strtok function. strtok is used to tokenize or split a string on a user specified delimiter(s). This can be handy for parsing comma separated value files (csv), or any type of commonly formatted data strings or files.

The strtok function is unlike other C functions in that its initial call takes the string as an argument, but the subsequent calls to strtok take NULL as the string. This is because strtok has a handler behind the scenes that does all the dirty work of tokenizing the string. How is strtok defined?

C ‘strtok’ Definition

strtok is defined as:

// Header file to include
#include <string.h>
 
// function definition
char *strtok(char *str, const char *delim);

As shown above, strtok takes two arguments. The first argument “str” is the string we would like to tokenize or break into pieces specified by our delimiter. The second argument, “delim” is the string or single value we would like strtok to use to split on. This may sound somewhat confusing, and thus I have an example:

Example Of Using ‘strtok’ On A C String

#include <stdio.h>
#include <string.h>
 
// Delimiter values
#define DELIM " ,.-+"
 
int main (int argc, char **argv)
{
  char str_to_tokenize[] = "- Strtok is meant for - breaking up, strings with funny values. + 5";
  char *str_ptr;
 
  fprintf(stdout, "Split \"%s\" into tokens:\n", str_to_tokenize);
 
  str_ptr = strtok(str_to_tokenize, DELIM);
  for(; str_ptr != NULL ;){
    fprintf(stdout, "%s\n", str_ptr);
    str_ptr = strtok(NULL, DELIM);
  }
 
  return 0;
}

Output from running the above example program:

$ ./token 
Split "- Strtok is meant for - breaking up, strings with funny values. + 5" into tokens:
Strtok
is
meant
for
breaking
up
strings
with
funny
values
5

A few things to note are, the DELIM #define I created is actually a string of basic values I inform strtok to split on. The string I used to tokenize contained spaces, dashes, commas, a period and an addition sign. strtok used all of those values to break up my string as shown by the output above. In a comma separated case (csv) your delimiter would most likely only be a comma. Another piece to note is within the for loop at line 17 in the example code, I pass strtok(NULL, DELIM). The NULL is very important, as behind the scenes strtok has a pointer to the remaining string and knows to continue working on it through each iteration. Once strtok has completed its work, it returns NULL, which is what I used as my conditional in the for loop.

Rather than simply printing out each tokenized piece you could process and use it, or store it in array, whatever needs to be done with each data piece. Now you don’t have to be afraid of string parsing in C!

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *