= Compiler construction Lexer emits Tokens. Tokens have a type, a content, and identify a range of byte offsets where this content/slice occurs. Parser reads Tokens and creates a structured document as Abstract Syntax Tree. == Lexing state diagram notes • defining character classes (set of acceptable characters for identifiers/… or parts of it) can be helpful to maintain an overview over admissible characters • when reading a string, the edge leading to a node reads the first character and a self-loop on the node reads further characters • for escaping, the best approach is to assign special semantics to one escape character. This escaping character allows to write the literal escape character as well. • you will need at least 2 iterations until you write the final lexing state diagram • for each node of the diagram think through whether you have all possible inputs covered properly to detect grammar problems early Some of these properties work very well for syntaxes which only have single-letter operators, but not long keywords to test with. Hence little lookahead is required. == Lexer/Parser rules 1. The set of Tokens must be non-overlapping and cover the entire text document. 3. A error in the Lexer or Parser must always be generated such that the problematic Token is attached to the error to generate a visual error message with {call stack with names of all source files, source code with 1 line before and 1 line after, line number and column, visual guide to token in source line} 4. The Parser must implement an interface to answer the question "after reading these tokens, can this be the end of the document?" 5. Decide for your implementation whether Lexer/Parser have access to the entire source code in memory. If so, references could lead to the token content/slice inside the source code. Or copy all values to get a more pure/independent structure. 6. The AST can be serialized into a normalized representation 7. Depending on memory management requirements: Try to design the parser such that access to the source code is not required after parsing has finished. == Desirable features • Tokens should have an "original content" representation, but also a "debug" representation (with readable byte offsets) • AST node should have an "structural" representation (type of a node in the tree), "debug" representation (yielding all attributes of a node, non-recursively) and a "complete" representation (yielding all attributes of a node, recursively) • The set of tokens should be XML-serializable (here start is inclusive, end is exclusive): abc • Create XML serialization of AST where additional fields declare (e.g.) whether this element defines the contained identifier • Provide tools which – given an AST – can return documentation/testcase/definition-line/… for a given identifier == How can user requirements be implemented then? REPL capabilities → lexer can tell whether the end is reached, an additional interface could allow to determine which characters might follow next syntax highlighting → XML serializability of Token sequence auto-formatting → normalized representation of AST == interfaces and objects Token: type: enum-value content: bytes start: u32 end: u32 original_representation() → bytes debug_representation() → string TreeNode: name: string attributes: Map[string, any] children: Maybe<[TreeNode]> start: u32 end: u32 structural_representation() → string debug_representation() → string complete_representation() → string Lexer: lexing_state: enum-value cache: bytes consume(bytes) -> Maybe<[Token]> end() -> Maybe<[Token]> Parser: root: TreeNode stack: [&TreeNode] consume(Token) has_finished() → bool end() → TreeNode TokensSerialized: root: XMLDocument attach(Token) end() → XMLDocument function serialize_tree(tree: TreeNode) → XMLDocument function identify(tree: TreeNode, query: string) → TreeNode