4cpp Lexing Library

§1 Introduction
§2 Lexer Library

§1 Introduction

This is the documentation for the 4cpp lexer version 1.1. The documentation is the newest piece of this lexer project so it may still have problems. What is here should be correct and mostly complete.

If you have questions or discover errors please contact editor@4coder.net or to get help from community members you can post on the 4coder forums hosted on handmade.network at 4coder.handmade.network

§2 Lexer Library

§2.1 Lexer Intro

The 4cpp lexer system provides a polished, fast, flexible system that takes in C/C++ and outputs a tokenization of the text data. There are two API levels. One level is setup to let you easily get a tokenization of the file. This level manages memory for you with malloc to make it as fast as possible to start getting your tokens. The second level enables deep integration by allowing control over allocation, data chunking, and output rate control.

To use the quick setup API you simply include 4cpp_lexer.h and read the documentation at cpp_lex_file.

To use the the fancier API include 4cpp_lexer.h and read the documentation at cpp_lex_step. If you want to be absolutely sure you are not including malloc into your program you can define FCPP_FORBID_MALLOC before the include and the "step" API will continue to work.

There are a few more features in 4cpp that are not documented yet. You are free to try to use these, but I am not totally sure they are ready yet, and when they are they will be documented.

§2.2 Lexer Function List

cpp_get_token
cpp_lex_step
cpp_lex_data_init
cpp_lex_data_temp_size
cpp_lex_data_temp_read
cpp_lex_data_new_temp_DEP
cpp_get_relex_range
cpp_relex_init
cpp_relex_start_position
cpp_relex_declare_first_chunk_position
cpp_relex_is_start_chunk
cpp_relex_step
cpp_relex_get_new_count
cpp_relex_complete
cpp_relex_abort
cpp_make_token_array
cpp_free_token_array
cpp_resize_token_array
cpp_lex_file

§2.4 Lexer Function Descriptions

§2.4.1: cpp_get_token

Cpp_Get_Token_Result cpp_get_token(Cpp_Token_Array *token_array_in,
int32_t pos
)

Parameters

token_array

The array of tokens from which to get a token.

pos

The position, measured in bytes, to get the token for.

Return

A Cpp_Get_Token_Result struct is returned containing the index of a token and a flag indicating whether the pos is contained in the token or in whitespace after the token.

Description

This call performs a binary search over all of the tokens looking for the token that contains the specified position. If the position is in whitespace between the tokens, the returned token index is the index of the token immediately before the provided position. The returned index can be -1 if the position is before the first token.

See Also

Cpp_Get_Token_Result

§2.4.2: cpp_lex_step

Cpp_Lex_Result cpp_lex_step(Cpp_Lex_Data *S_ptr,
char *chunk,
int32_t size,
int32_t full_size,
Cpp_Token_Array *token_array_out,
int32_t max_tokens_out
)

Parameters

S_ptr

The lexer state. Go to the Cpp_Lex_Data section to see how to initialize the state.

chunk

The first or next chunk of the file being lexed.

size

The number of bytes in the chunk including the null terminator if the chunk ends in a null terminator. If the chunk ends in a null terminator the system will interpret it as the end of the file.

full_size

If the final chunk is not null terminated this parameter should specify the length of the file in bytes. To rely on an eventual null terminator use HAS_NULL_TERM for this parameter.

token_array_out

The token array structure that will receive the tokens output by the lexer.

max_tokens_out

The maximum number of tokens to be output to the token array. To rely on the max built into the token array pass NO_OUT_LIMIT here.

Description

This call is the primary interface of the lexing system. It is quite general so it can be used in a lot of different ways. I will explain the general rules first, and then give some examples of common ways it might be used.

First a lexing state, Cpp_Lex_Data, must be initialized. The file to lex must be read into N contiguous chunks of memory. An output Cpp_Token_Array must be allocated and initialized with the appropriate count and max_count values. Then each chunk of the file must be passed to cpp_lex_step in order using the same lexing state for each call. Every time a call to cpp_lex_step returns LexResult_NeedChunk, the next call to cpp_lex_step should use the next chunk. If the return is some other value, the lexer hasn't finished with the current chunk and it sopped for some other reason, so the same chunk should be used again in the next call.

If the file chunks contain a null terminator the lexer will return LexResult_Finished when it finds this character. At this point calling the lexer again with the same state will result in an error. If you do not have a null terminated chunk to end the file, you may instead pass the exact size in bytes of the entire file to the full_size parameter and it will automatically handle the termination of the lexing state when it has read that many bytes. If a full_size is specified and the system terminates for having seen that many bytes, it will return LexResult_Finished. If a full_size is specified and a null character is read before the total number of bytes have been read the system will still terminate as usual and return LexResult_Finished.

If the system has filled the entire output array it will return LexResult_NeedTokenMemory. When this happens if you want to continue lexing the file you can grow the token array, or switch to a new output array and then call cpp_lex_step again with the chunk that was being lexed and the new output. You can also specify a max_tokens_out which is limits how many new tokens will be added to the token array. Even if token_array_out still had more space to hold tokens, if the max_tokens_out limit is hit, the lexer will stop and return LexResult_HitTokenLimit. If this happens there is still space left in the token array, so you can resume simply by calling cpp_lex_step again with the same chunk and the same output array. Also note that, unlike the chunks which must only be replaced when the system says it needs a chunk. You may switch to or modify the output array in between calls as much as you like.

The most basic use of this system is to get it all done in one big chunk and try to allocate a nearly "infinite" output array so that it will not run out of memory. This way you can get the entire job done in one call and then just assert to make sure it returns LexResult_Finished to you:

Cpp_Token_Array lex_file(char *file_name){
    File_Data file = read_whole_file(file_name);
    
    char *temp = (char*)malloc(4096); // hopefully big enough
    Cpp_Lex_Data lex_state = cpp_lex_data_init(temp); 
    
    Cpp_Token_Array array = {0};
    array.tokens = (Cpp_Token*)malloc(1 << 20); // hopefully big enough
    array.max_count = (1 << 20)/sizeof(Cpp_Token);
    
    Cpp_Lex_Result result = 
        cpp_lex_step(&lex_state, file.data, file.size, file.size,
                     &array, NO_OUT_LIMIT);
    Assert(result == LexResult_Finished);
    
    free(temp);
    
    return(array);
}

See Also

Cpp_Lex_Data

Cpp_Lex_Result

§2.4.3: cpp_lex_data_init

Cpp_Lex_Data cpp_lex_data_init(
)

Return

A brand new lex state ready to begin lexing a file from the beginning.

Description

Creates a new lex state in the form of a Cpp_Lex_Data struct and returns the struct. The system needs a temporary buffer that is as long as the longest token. 4096 is usually enough but the buffer is not checked, so to be 100% bullet proof it has to be the same length as the file being lexed.

§2.4.4: cpp_lex_data_temp_size

int32_t cpp_lex_data_temp_size(Cpp_Lex_Data *lex_data
)

Parameters

lex_data

The lex state from which to get the temporary buffer size.

Description

This call gets the current size of the temporary buffer in the lexer state so that you can move to a new temporary buffer by copying the data over.

See Also

cpp_lex_data_temp_read

cpp_lex_data_new_temp

§2.4.5: cpp_lex_data_temp_read

void cpp_lex_data_temp_read(Cpp_Lex_Data *lex_data,
char *out_buffer
)

Parameters

lex_data

The lex state from which to read the temporary buffer.

out_buffer

The buffer into which the contents of the temporary buffer will be written. The size of the buffer must be at least the size as returned by cpp_lex_data_temp_size.

Description

This call reads the current contents of the temporary buffer.

See Also

cpp_lex_data_temp_size

cpp_lex_data_new_temp

§2.4.6: cpp_lex_data_new_temp_DEP

void cpp_lex_data_new_temp_DEP(Cpp_Lex_Data *lex_data,
char *new_buffer
)

§2.4.7: cpp_get_relex_range

Cpp_Relex_Range cpp_get_relex_range(Cpp_Token_Array *array,
int32_t start_pos,
int32_t end_pos
)

Parameters

array

A pointer to the token array that will be modified by the relex, this array should already contain the tokens for the previous state of the file.

start_pos

The start position of the edited region of the file. The start and end points are based on the edited region of the file before the edit.

end_pos

The end position of the edited region of the file. In particular, end_pos is the first character after the edited region not effected by the edit. Thus if the edited region contained one character end_pos - start_pos should equal 1. The start and end points are based on the edited region of the file before the edit.

§2.4.8: cpp_relex_init

Cpp_Relex_Data cpp_relex_init(Cpp_Token_Array *array,
int32_t start_pos,
int32_t end_pos,
int32_t character_shift_amount
)

Parameters

array

A pointer to the token array that will be modified by the relex, this array should already contain the tokens for the previous state of the file.

start_pos

The start position of the edited region of the file. The start and end points are based on the edited region of the file before the edit.

end_pos

character_shift_amount

The shift in the characters after the edited region.

Return

Returns a partially initialized relex state.

Description

This call does the first setup step of initializing a relex state. To finish initializing the relex state you must tell the state about the positioning of the first chunk it will be fed. There are two methods of doing this, the direct method is with cpp_relex_declare_first_chunk_position, the method that is often more convenient is with cpp_relex_is_start_chunk. If the file is not chunked the second step of initialization can be skipped.

See Also

cpp_relex_declare_first_chunk_position

cpp_relex_is_start_chunk

§2.4.9: cpp_relex_start_position

int32_t cpp_relex_start_position(Cpp_Relex_Data *S_ptr
)

Parameters

S_ptr

Return

Returns the first position in the file the relexer wants to read. This is usually a position slightly earlier than the start_pos provided as the edit range.

Description

After doing the first stage of initialization this call is useful for figuring out what chunk of the file to feed to the lexer first. It should be a chunk that contains the position returned by this call.

See Also

cpp_relex_init

cpp_relex_declare_first_chunk_position

§2.4.10: cpp_relex_declare_first_chunk_position

void cpp_relex_declare_first_chunk_position(Cpp_Relex_Data *S_ptr,
int32_t position
)

Parameters

S_ptr

position

The start position of the first chunk that will be fed to the relex process.

Description

To initialize the relex system completely, the system needs to know how the characters in the first file line up with the file's absolute layout. This call declares where the first chunk's start position is in the absolute file layout, and the system infers the alignment from that. For this method to work the starting position of the relexing needs to be inside the first chunk. To get the relexers starting position call cpp_relex_start_position.

See Also

cpp_relex_init

cpp_relex_start_position

§2.4.11: cpp_relex_is_start_chunk

int32_t cpp_relex_is_start_chunk(Cpp_Relex_Data *S_ptr,
char *chunk,
int32_t chunk_size
)

Parameters

S_ptr

chunk

The chunk to check.

chunk_size

The size of the chunk to check.

Return

Returns non-zero if the passed in chunk should be used as the first chunk for lexing.

Description

With this method, once a state is initialized, each chunk can be fed in one after the other in the order they appear in the absolute file layout. When this call returns non-zero it means that the chunk that was passed in on that call should be used in the first call to cpp_relex_step. If, after trying all of the chunks, they all return zero, pass in NULL for chunk and 0 for chunk_size to tell the system that all possible chunks have already been tried, and then use those values again in the one and only call to cpp_relex_step.

See Also

cpp_relex_init

§2.4.12: cpp_relex_step

Cpp_Lex_Result cpp_relex_step(Cpp_Relex_Data *S_ptr,
char *chunk,
int32_t chunk_size,
int32_t full_size,
Cpp_Token_Array *array,
Cpp_Token_Array *relex_array
)

Parameters

S_ptr

A pointer to a fully initiazed relex state.

chunk

A chunk of the edited file being relexed.

chunk_size

The size of the current chunk.

full_size

The full size of the edited file.

array

A pointer to a token array that contained the original tokens before the edit.

relex_array

A pointer to a token array for spare space. The capacity of the relex_array determines how far the relex process can go. If it runs out, the process can be continued if the same relex_array is extended without losing the tokens it contains. To get an appropriate capacity for relex_array, you can get the range of tokens that the relex operation is likely to traverse by looking at the result from cpp_get_relex_range.

Description

When a file has already been lexed, and then it is edited in a small local way, rather than lexing the new file all over again, cpp_relex_step can try to find just the range of tokens that need to be updated and fix them in.

First the lex state must be initialized (cpp_relex_init). Then one or more calls to cpp_relex_step will start editing the array and filling out the relex_array. The return value of cpp_relex_step indicates whether the relex was successful or was interrupted and if it was interrupted, what the system needs to resume.

LexResult_Finished indicates that the relex engine finished successfully.

LexResult_NeedChunk indicates that the system needs the next chunk of the file.

LexResult_NeedTokenMemory indicates that the relex_array has reached capacity, and that it needs to be extended if it is going to continue. Sometimes in this case it is better to stop and just lex the entire file normally, because there are a few cases where a small local change effects a long range of the lexers output.

The relex operation can be closed in one of two ways. If the LexResult_Finished value has been returned by this call, then to complete the edits to the array make sure the original array has enough capacity to store the final result by calling cpp_relex_get_new_count. Then the operation can be finished successfully by calling cpp_relex_complete.

Whether or not the relex process finished with LexResult_Finished the process can be finished by calling cpp_relex_abort, which puts the array back into it's original state. No close is necessary if getting the original array state back is not necessary.

See Also

cpp_relex_init

cpp_get_relex_range

Cpp_Lex_Result

cpp_relex_get_new_count

cpp_relex_complete

cpp_relex_abort

§2.4.13: cpp_relex_get_new_count

int32_t cpp_relex_get_new_count(Cpp_Relex_Data *S_ptr,
int32_t current_count,
Cpp_Token_Array *relex_array
)

Parameters

S_ptr

A pointer to a state that has gone through cpp_relex_step with a LexResult_Finished return.

current_count

The count of tokens in the original array before the edit.

relex_array

The relex_array that was used in the cpp_relex_step call/calls.

Description

After getting a LexResult_Finished from cpp_relex_step, this call can be used to get the size the new array will have. If the original array doesn't have enough capacity to store the new array, it's capacity should be increased before passing to cpp_relex_complete.

§2.4.14: cpp_relex_complete

void cpp_relex_complete(Cpp_Relex_Data *S_ptr,
Cpp_Token_Array *array,
Cpp_Token_Array *relex_array
)

Parameters

S_ptr

A pointer to a state that has gone through cpp_relex_step with a LexResult_Finished return.

array

The original array being edited by cpp_relex_step calls.

relex_array

The relex_array that was filled by cpp_relex_step.

Description

After getting a LexResult_Finished from cpp_relex_step, and ensuring that array has a large enough capacity by calling cpp_relex_get_new_count, this call does the necessary replacement of tokens in the array to make it match the new file.

§2.4.15: cpp_relex_abort

void cpp_relex_abort(Cpp_Relex_Data *S_ptr,
Cpp_Token_Array *array
)

Parameters

S_ptr

A pointer to a state that has gone through at least one cpp_relex_step.

array

The original array that went through cpp_relex_step to be edited.

Description

After the first call to cpp_relex_step, the array's contents may have been changed, this call assures the array is in it's original state. After this call the relex state is dead.

§2.4.16: cpp_make_token_array

Cpp_Token_Array cpp_make_token_array(int32_t starting_max
)

Parameters

starting_max

The number of tokens to initialize the array with.

Return

An empty Cpp_Token_Array with memory malloc'd for storing tokens.

Description

This call allocates a Cpp_Token_Array with malloc for use in other convenience functions. Stacks that are not allocated this way should not be used in the convenience functions.

§2.4.17: cpp_free_token_array

void cpp_free_token_array(Cpp_Token_Array token_array
)

Parameters

token_array

An array previously allocated by cpp_make_token_array

Description

This call frees a Cpp_Token_Array.

See Also

cpp_make_token_array

§2.4.18: cpp_resize_token_array

void cpp_resize_token_array(Cpp_Token_Array *token_array,
int32_t new_max
)

Parameters

token_array

An array previously allocated by cpp_make_token_array.

new_max

The new maximum size the array should support. If this is not greater than the current size of the array the operation is ignored.

Description

This call allocates a new memory chunk and moves the existing tokens in the array over to the new chunk.

See Also

cpp_make_token_array

§2.4.19: cpp_lex_file

void cpp_lex_file(char *data,
int32_t size,
Cpp_Token_Array *token_array_out
)

Parameters

data

The file data to be lexed in a single contiguous block.

size

The number of bytes in data.

token_array_out

The token array where the output tokens will be pushed. This token array must be previously allocated with cpp_make_token_array

Description

Lexes an entire file and manages the interaction with the lexer system so that it is quick and convenient to lex files.

Cpp_Token_Array lex_file(char *file_name){
    File_Data file = read_whole_file(file_name);
    
    // This array will be automatically grown if it runs
    // out of memory.
    Cpp_Token_Array array = cpp_make_token_array(100);
    
    cpp_lex_file(file.data, file.size, &array);
    
    return(array);
}

See Also

cpp_make_token_array

§2.5 Lexer Type Descriptions

§2.5.1: Cpp_Token_Type

enum Cpp_Token_Type;

Description

A Cpp_Token_Type classifies a token to make parsing easier. Some types are not actually output by the lexer, but exist because parsers will also make use of token types in their own output.

Values

CPP_TOKEN_JUNK = 0

CPP_TOKEN_COMMENT = 1

CPP_PP_INCLUDE = 2

CPP_PP_DEFINE = 3

CPP_PP_UNDEF = 4

CPP_PP_IF = 5

CPP_PP_IFDEF = 6

CPP_PP_IFNDEF = 7

CPP_PP_ELSE = 8

CPP_PP_ELIF = 9

CPP_PP_ENDIF = 10

CPP_PP_ERROR = 11

CPP_PP_IMPORT = 12

CPP_PP_USING = 13

CPP_PP_LINE = 14

CPP_PP_PRAGMA = 15

CPP_PP_STRINGIFY = 16

CPP_PP_CONCAT = 17

CPP_PP_UNKNOWN = 18

CPP_PP_DEFINED = 19

CPP_PP_INCLUDE_FILE = 20

CPP_PP_ERROR_MESSAGE = 21

CPP_TOKEN_KEY_TYPE = 22

CPP_TOKEN_KEY_MODIFIER = 23

CPP_TOKEN_KEY_QUALIFIER = 24

CPP_TOKEN_KEY_OPERATOR = 25

This type is not stored in token output from the lexer.

CPP_TOKEN_KEY_CONTROL_FLOW = 26

CPP_TOKEN_KEY_CAST = 27

CPP_TOKEN_KEY_TYPE_DECLARATION = 28

CPP_TOKEN_KEY_ACCESS = 29

CPP_TOKEN_KEY_LINKAGE = 30

CPP_TOKEN_KEY_OTHER = 31

CPP_TOKEN_IDENTIFIER = 32

CPP_TOKEN_INTEGER_CONSTANT = 33

CPP_TOKEN_CHARACTER_CONSTANT = 34

CPP_TOKEN_FLOATING_CONSTANT = 35

CPP_TOKEN_STRING_CONSTANT = 36

CPP_TOKEN_BOOLEAN_CONSTANT = 37

CPP_TOKEN_STATIC_ASSERT = 38

CPP_TOKEN_BRACKET_OPEN = 39

CPP_TOKEN_BRACKET_CLOSE = 40

CPP_TOKEN_PARENTHESE_OPEN = 41

CPP_TOKEN_PARENTHESE_CLOSE = 42

CPP_TOKEN_BRACE_OPEN = 43

CPP_TOKEN_BRACE_CLOSE = 44

CPP_TOKEN_SEMICOLON = 45

CPP_TOKEN_ELLIPSIS = 46

CPP_TOKEN_STAR = 47

This is an 'ambiguous' token type because it requires parsing to determine the full nature of the token.

CPP_TOKEN_AMPERSAND = 48

This is an 'ambiguous' token type because it requires parsing to determine the full nature of the token.

CPP_TOKEN_TILDE = 49

This is an 'ambiguous' token type because it requires parsing to determine the full nature of the token.

CPP_TOKEN_PLUS = 50

This is an 'ambiguous' token type because it requires parsing to determine the full nature of the token.

CPP_TOKEN_MINUS = 51

This is an 'ambiguous' token type because it requires parsing to determine the full nature of the token.

CPP_TOKEN_INCREMENT = 52

This is an 'ambiguous' token type because it requires parsing to determine the full nature of the token.

CPP_TOKEN_DECREMENT = 53

This is an 'ambiguous' token type because it requires parsing to determine the full nature of the token.

CPP_TOKEN_SCOPE = 54

CPP_TOKEN_POSTINC = 55

This type is for parser use, it is not output by the lexer.

CPP_TOKEN_POSTDEC = 56

This type is for parser use, it is not output by the lexer.

CPP_TOKEN_FUNC_STYLE_CAST = 57

This type is for parser use, it is not output by the lexer.

CPP_TOKEN_CPP_STYLE_CAST = 58

CPP_TOKEN_CALL = 59

This type is for parser use, it is not output by the lexer.

CPP_TOKEN_INDEX = 60

This type is for parser use, it is not output by the lexer.

CPP_TOKEN_DOT = 61

CPP_TOKEN_ARROW = 62

CPP_TOKEN_PREINC = 63

This token is for parser use, it is not output by the lexer.

CPP_TOKEN_PREDEC = 64

This token is for parser use, it is not output by the lexer.

CPP_TOKEN_POSITIVE = 65

This token is for parser use, it is not output by the lexer.

CPP_TOKEN_NEGAITVE = 66

This token is for parser use, it is not output by the lexer.

CPP_TOKEN_NOT = 67

CPP_TOKEN_BIT_NOT = 68

This type is for parser use, it is not output by the lexer.

CPP_TOKEN_CAST = 69

This type is for parser use, it is not output by the lexer.

CPP_TOKEN_DEREF = 70

This type is for parser use, it is not output by the lexer.

CPP_TOKEN_TYPE_PTR = 71

This type is for parser use, it is not output by the lexer.

CPP_TOKEN_ADDRESS = 72

This type is for parser use, it is not output by the lexer.

CPP_TOKEN_TYPE_REF = 73

This type is for parser use, it is not output by the lexer.

CPP_TOKEN_SIZEOF = 74

CPP_TOKEN_ALIGNOF = 75

CPP_TOKEN_DECLTYPE = 76

CPP_TOKEN_TYPEID = 77

CPP_TOKEN_NEW = 78

CPP_TOKEN_DELETE = 79

CPP_TOKEN_NEW_ARRAY = 80

This type is for parser use, it is not output by the lexer.

CPP_TOKEN_DELETE_ARRAY = 81

This type is for parser use, it is not output by the lexer.

CPP_TOKEN_PTRDOT = 82

CPP_TOKEN_PTRARROW = 83

CPP_TOKEN_MUL = 84

This type is for parser use, it is not output by the lexer.

CPP_TOKEN_DIV = 85

CPP_TOKEN_MOD = 86

CPP_TOKEN_ADD = 87

This type is for parser use, it is not output by the lexer.

CPP_TOKEN_SUB = 88

This type is for parser use, it is not output by the lexer.

CPP_TOKEN_LSHIFT = 89

CPP_TOKEN_RSHIFT = 90

CPP_TOKEN_LESS = 91

CPP_TOKEN_GRTR = 92

CPP_TOKEN_GRTREQ = 93

CPP_TOKEN_LESSEQ = 94

CPP_TOKEN_EQEQ = 95

CPP_TOKEN_NOTEQ = 96

CPP_TOKEN_BIT_AND = 97

This type is for parser use, it is not output by the lexer.

CPP_TOKEN_BIT_XOR = 98

CPP_TOKEN_BIT_OR = 99

CPP_TOKEN_AND = 100

CPP_TOKEN_OR = 101

CPP_TOKEN_TERNARY_QMARK = 102

CPP_TOKEN_COLON = 103

CPP_TOKEN_THROW = 104

CPP_TOKEN_EQ = 105

CPP_TOKEN_ADDEQ = 106

CPP_TOKEN_SUBEQ = 107

CPP_TOKEN_MULEQ = 108

CPP_TOKEN_DIVEQ = 109

CPP_TOKEN_MODEQ = 110

CPP_TOKEN_LSHIFTEQ = 111

CPP_TOKEN_RSHIFTEQ = 112

CPP_TOKEN_ANDEQ = 113

CPP_TOKEN_OREQ = 114

CPP_TOKEN_XOREQ = 115

CPP_TOKEN_COMMA = 116

CPP_TOKEN_EOF = 117

This type is for parser use, it is not output by the lexer.

CPP_TOKEN_TYPE_COUNT = 118

§2.5.2: Cpp_Token

struct Cpp_Token {
Cpp_Token_Type type;
int32_t start;
int32_t size;
uint16_t state_flags;
uint16_t flags;
};

Description

Cpp_Token represents a single lexed token. It is the primary output of the lexing system.

Fields

type

The type field indicates the type of the token. All tokens have a type no matter the circumstances.

start

The start field indicates the index of the first character of this token's lexeme.

size

The size field indicates the number of bytes in this token's lexeme.

state_flags

The state_flags should not be used outside of the lexer's implementation.

flags

The flags field contains extra useful information about the token.

See Also

Cpp_Token_Flag

§2.5.3: Cpp_Token_Flag

enum Cpp_Token_Flag;

Description

The Cpp_Token_Flags are used to mark up tokens with additional information.

Values

CPP_TFLAG_PP_DIRECTIVE = 0x1

Indicates that the token is a preprocessor directive.

CPP_TFLAG_PP_BODY = 0x2

Indicates that the token is on the line of a preprocessor directive.

CPP_TFLAG_MULTILINE = 0x4

Indicates that the token spans across multiple lines. This can show up on line comments and string literals with back slash line continuation.

CPP_TFLAG_IS_OPERATOR = 0x8

Indicates that the token is some kind of operator or punctuation like braces.

CPP_TFLAG_IS_KEYWORD = 0x10

Indicates that the token is a keyword.

§2.5.4: Cpp_Token_Array

struct Cpp_Token_Array {
Cpp_Token * tokens;
int32_t count;
int32_t max_count;
};

Description

Cpp_Token_Array is used to bundle together the common elements of a growing array of Cpp_Tokens. To initialize it the tokens field should point to a block of memory with a size equal to max_count*sizeof(Cpp_Token) and the count should be initialized to zero.

Fields

tokens

The tokens field points to the memory used to store the array of tokens.

count

The count field counts how many tokens in the array are currently used.

max_count

The max_count field specifies the maximum size the count field may grow to before the tokens array is out of space.

§2.5.5: Cpp_Get_Token_Result

struct Cpp_Get_Token_Result {
int32_t token_index;
int32_t in_whitespace;
};

Description

Cpp_Get_Token_Result is the return result of the cpp_get_token call.

Fields

token_index

The token_index field indicates which token answers the query. To get the token from the source array

array.tokens[result.token_index]

in_whitespace

The in_whitespace field is true when the query position was actually in whitespace after the result token.

See Also

cpp_get_token

§2.5.6: Cpp_Relex_Range

struct Cpp_Relex_Range {
int32_t start_token_index;
int32_t end_token_index;
};

Description

Cpp_Relex_Range is the return result of the cpp_get_relex_range call.

Fields

start_token_index

The index of the first token in the unedited array that needs to be relexed.

end_token_index

The index of the first token in the unedited array after the edited range that may not need to be relexed. Sometimes a relex operation has to lex past this position to find a token that is not effected by the edit.

See Also

cpp_get_relex_range

§2.5.7: Cpp_Lex_Data

struct Cpp_Lex_Data { /* non-public internals */ } ;

Description

Cpp_Lex_Data represents the state of the lexer so that the system may be resumable and the user can manage the lexer state and decide when to resume lexing with it. To create a new lexer state call cpp_lex_data_init.

The internals of the lex state should not be treated as a part of the public API.

See Also

cpp_lex_data_init

§2.5.8: Cpp_Lex_Result

enum Cpp_Lex_Result;

Description

Cpp_Lex_Result is returned from the lexing engine to indicate why it stopped lexing.

Values

LexResult_Finished = 0

This indicates that the system got to the end of the file and will not accept more input.

LexResult_NeedChunk = 1

This indicates that the system got to the end of an input chunk and is ready to receive the next input chunk.

LexResult_NeedTokenMemory = 2

This indicates that the output array ran out of space to store tokens and needs to be replaced or expanded before continuing.

LexResult_HitTokenLimit = 3

This indicates that the maximum number of output tokens as specified by the user was hit.

§2.5.9: Cpp_Relex_Data

struct Cpp_Relex_Data { /* non-public internals */ } ;

Description

Cpp_Relex_Data represents the state of the relexer so that the system may be resumable. To create a new relex state call cpp_relex_init.

See Also

cpp_relex_init

4cpp Lexing Library

Table of Contents

§1 Introduction

§2 Lexer Library

§2.1 Lexer Intro

§2.2 Lexer Function List

§2.3 Lexer Types List

§2.4 Lexer Function Descriptions

§2.4.1: cpp_get_token

§2.4.2: cpp_lex_step

§2.4.3: cpp_lex_data_init

§2.4.4: cpp_lex_data_temp_size

§2.4.5: cpp_lex_data_temp_read

§2.4.6: cpp_lex_data_new_temp_DEP

§2.4.7: cpp_get_relex_range

§2.4.8: cpp_relex_init

§2.4.9: cpp_relex_start_position

§2.4.10: cpp_relex_declare_first_chunk_position

§2.4.11: cpp_relex_is_start_chunk

§2.4.12: cpp_relex_step

§2.4.13: cpp_relex_get_new_count

§2.4.14: cpp_relex_complete

§2.4.15: cpp_relex_abort

§2.4.16: cpp_make_token_array

§2.4.17: cpp_free_token_array

§2.4.18: cpp_resize_token_array

§2.4.19: cpp_lex_file

§2.5 Lexer Type Descriptions

§2.5.1: Cpp_Token_Type

§2.5.2: Cpp_Token

§2.5.3: Cpp_Token_Flag

§2.5.4: Cpp_Token_Array

§2.5.5: Cpp_Get_Token_Result

§2.5.6: Cpp_Relex_Range

§2.5.7: Cpp_Lex_Data

§2.5.8: Cpp_Lex_Result

§2.5.9: Cpp_Relex_Data