= Compiler construction

Lexer emits Tokens. Tokens have a type, a content, and identify a range of byte offsets where this content/slice occurs.
Parser reads Tokens and creates a structured document as Abstract Syntax Tree.

== Lexing state diagram notes

• defining character classes (set of acceptable characters for identifiers/… or parts of it) can be helpful to maintain an overview over admissible characters
• when reading a string, the edge leading to a node reads the first character and a self-loop on the node reads further characters
• for escaping, the best approach is to assign special semantics to one escape character. This escaping character allows to write the literal escape character as well.
• you will need at least 2 iterations until you write the final lexing state diagram
• for each node of the diagram think through whether you have all possible inputs covered properly to detect grammar problems early

Some of these properties work very well for syntaxes which only have single-letter operators, but not long keywords to test with. Hence little lookahead is required.

== Lexer/Parser rules

1. The set of Tokens must be non-overlapping and cover the entire text document.
3. A error in the Lexer or Parser must always be generated such that the problematic Token is attached to the error to generate a visual error message with {call stack with names of all source files, source code with 1 line before and 1 line after, line number and column, visual guide to token in source line}
4. The Parser must implement an interface to answer the question "after reading these tokens, can this be the end of the document?"
5. Decide for your implementation whether Lexer/Parser have access to the entire source code in memory. If so, references could lead to the token content/slice inside the source code. Or copy all values to get a more pure/independent structure.
6. The AST can be serialized into a normalized representation
7. Depending on memory management requirements: Try to design the parser such that access to the source code is not required after parsing has finished.

== Desirable features

• Tokens should have an "original content" representation, but also a "debug" representation (with readable byte offsets)
• AST node should have an "structural" representation (type of a node in the tree), "debug" representation (yielding all attributes of a node, non-recursively) and a "complete" representation (yielding all attributes of a node, recursively)
• The set of tokens should be XML-serializable (here start is inclusive, end is exclusive):
  <tokens><token type="identifier" start="0" end="3">abc</token>…</tokens>
• Create XML serialization of AST where additional fields declare (e.g.) whether this element defines the contained identifier
• Provide tools which – given an AST – can return documentation/testcase/definition-line/… for a given identifier

== How can user requirements be implemented then?

REPL capabilities → lexer can tell whether the end is reached, an additional interface could allow to determine which characters might follow next
syntax highlighting → XML serializability of Token sequence
auto-formatting → normalized representation of AST

== interfaces and objects

Token:
	type: enum-value
	content: bytes
	start: u32
	end: u32

	original_representation() → bytes
	debug_representation() → string

TreeNode:
	name: string
	attributes: Map[string, any]
	children: Maybe<[TreeNode]>
	start: u32
	end: u32

	structural_representation() → string
	debug_representation() → string
	complete_representation() → string

Lexer:
	lexing_state: enum-value
	cache: bytes

	consume(bytes) -> Maybe<[Token]>
	end() -> Maybe<[Token]>

Parser:
	root: TreeNode
	stack: [&TreeNode]

	consume(Token)
	has_finished() → bool
	end() → TreeNode

TokensSerialized:
	root: XMLDocument

	attach(Token)
	end() → XMLDocument

function serialize_tree(tree: TreeNode) → XMLDocument
function identify(tree: TreeNode, query: string) → TreeNode