Compiler Internals
A comprehensive technical deep-dive into the MathViz compiler architecture, implementation details, and optimization strategies. This document is intended for contributors, language implementers, and anyone interested in compiler design.
Overview
The MathViz compiler is a multi-stage source-to-source transpiler that transforms
.mviz source files into optimized Python code. It follows a traditional
compiler pipeline architecture with additional analysis passes for performance optimization.
The compiler prioritizes correctness, readable output, and seamless Python interoperability. Generated code is clean, idiomatic Python that can be inspected and debugged easily.
Key Design Decisions
- Immutable AST: All AST nodes are frozen dataclasses for thread safety and predictability
- Visitor Pattern: Tree traversal via the visitor pattern enables clean separation of concerns
- Multi-pass Architecture: Separate analysis passes for type checking, purity, complexity, and parallelization
- Optional Strictness: Type errors can be warnings or fatal errors depending on configuration
- Zero Runtime Overhead: No MathViz runtime library required; output is standalone Python
Compilation Pipeline
The compiler executes the following stages in order. Each stage is independent and
can be enabled/disabled via the CompilationPipeline configuration.
┌─────────────────────────────────────────────────────────────────────────────┐
│ MathViz Compilation Pipeline │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Source Code (.mviz) │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ LEXER │ Tokenization: Unicode math symbols, f-strings, keywords │
│ │ lexer.py │ Output: Token stream │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ PARSER │ Recursive descent parser with operator precedence │
│ │ parser.py │ Output: Abstract Syntax Tree (AST) │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ MODULE │ Resolves `use` statements, loads dependencies │
│ │ LOADER │ Detects circular dependencies │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ TYPE │ Hindley-Milner style inference with bidirectional hints │
│ │ CHECKER │ Output: Type errors, function signatures │
│ └──────┬──────┘ │
│ │ │
│ ├────────────────────┬────────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ PURITY │ │ COMPLEXITY │ │ PARALLEL │ │
│ │ ANALYZER │ │ ANALYZER │ │ ANALYZER │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ └────────────────────┼────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ CODE │ Emits Python with Numba decorators │
│ │ GENERATOR │ Output: Executable Python code │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ Generated Python (.py) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
File Structure
- src/mathviz/compiler/
-
- __init__.py (31 KB) - Pipeline orchestration
- tokens.py (14 KB) - Token definitions
- lexer.py (28 KB) - Lexical analysis
- parser.py (115 KB) - Syntax parser
- ast_nodes.py (69 KB) - AST node definitions
- type_checker.py (143 KB) - Type inference
- codegen.py (122 KB) - Python code generation
- module_loader.py (22 KB) - Module resolution
- purity_analyzer.py (28 KB) - Side effect analysis
- complexity_analyzer.py (23 KB) - Big-O estimation
- parallel_analyzer.py (34 KB) - Loop parallelization
- jit_optimizer.py (63 KB) - Numba optimization
- vectorizer.py (63 KB) - SIMD vectorization
- memory_optimizer.py (61 KB) - Memory optimization
- const_fold.py (99 KB) - Constant folding
- linter.py (56 KB) - Static analysis
- diagnostics.py (25 KB) - Error formatting
- call_graph.py (24 KB) - Call graph analysis
Stage 1: Lexical Analysis
The lexer (lexer.py) transforms source code into a stream of tokens.
It supports both ASCII and Unicode operators for mathematical notation.
Lexer Implementation
Token Categories
| Category | Examples | Token Types |
|---|---|---|
| Keywords | fn, let, struct, scene |
FN, LET, STRUCT, SCENE |
| Literals | 42, 3.14, "hello", true |
INTEGER, FLOAT, STRING, BOOLEAN |
| Math Unicode | ∈, ∪, ∩, ⊆, π |
ELEMENT_OF, UNION, INTERSECTION, SUBSET, PI |
| Operators | +, -, *, ^, ->, :: |
PLUS, MINUS, STAR, CARET, THIN_ARROW, DOUBLE_COLON |
| F-strings | f"x = {x}" |
FSTRING_START, FSTRING_PART, FSTRING_EXPR_* |
Special Features
- F-string Tokenization: Handles nested expressions in
f"..."strings with proper brace matching - Range Syntax: Recognizes
..and..=for exclusive/inclusive ranges - Doc Comments: Parses
///and/** */documentation comments - Unicode Support: First-class support for mathematical symbols (∈, ∪, ∩, ⊆, ⊇, π, ∞)
# Input: let x = {1, 2} ∪ {3, 4}
# Output tokens:
Token(LET, "let")
Token(IDENTIFIER, "x")
Token(ASSIGN, "=")
Token(LBRACE, "{")
Token(INTEGER, 1)
Token(COMMA, ",")
Token(INTEGER, 2)
Token(RBRACE, "}")
Token(UNION, "∪") # Unicode operator!
Token(LBRACE, "{")
Token(INTEGER, 3)
Token(COMMA, ",")
Token(INTEGER, 4)
Token(RBRACE, "}")
Token(EOF, None)
Stage 2: Parsing
The parser (parser.py) is a hand-written recursive descent parser with
Pratt parsing for expressions. It produces an immutable Abstract Syntax Tree.
Parser Implementation
Parsing Strategy
- Recursive Descent: Top-down parsing for statements and declarations
- Pratt Parsing: Operator precedence parsing for expressions
- Error Recovery: Continues parsing after errors to report multiple issues
- Location Tracking: Every AST node includes source location for error messages
Operator Precedence (Low to High)
| Level | Operators | Associativity |
|---|---|---|
| 1 | or | Left |
| 2 | and | Left |
| 3 | not | Prefix |
| 4 | ==, !=, <, >, <=, >=, ∈, ⊆ | Left |
| 5 | ∪, ∩, ∖ | Left |
| 6 | +, - | Left |
| 7 | *, /, % | Left |
| 8 | ^, ** | Right |
| 9 | - (unary), not | Prefix |
| 10 | ., ::, (), [] | Left |
Abstract Syntax Tree
The AST (ast_nodes.py) consists of over 100 node types organized into
hierarchies for expressions, statements, patterns, and types.
AST Design
Node Hierarchy
ASTNode (abstract base)
├── Expression
│ ├── Literal (Integer, Float, String, Boolean, None)
│ ├── Identifier
│ ├── BinaryExpression
│ ├── UnaryExpression
│ ├── CallExpression
│ ├── MemberAccess
│ ├── IndexExpression
│ ├── LambdaExpression (PipeLambda)
│ ├── MatchExpression
│ ├── ConditionalExpression (ternary)
│ ├── RangeExpression
│ ├── CollectionLiteral (List, Set, Dict, Tuple)
│ └── FString
├── Statement
│ ├── LetStatement
│ ├── AssignmentStatement
│ ├── FunctionDef
│ ├── ClassDef / SceneDef
│ ├── StructDef / EnumDef / ImplBlock
│ ├── IfStatement / IfLetStatement
│ ├── ForStatement / WhileStatement / LoopStatement
│ ├── MatchStatement
│ ├── ReturnStatement / BreakStatement / ContinueStatement
│ └── UseStatement / ModuleDecl
├── Pattern (for pattern matching)
│ ├── LiteralPattern
│ ├── IdentifierPattern
│ ├── TuplePattern / ListPattern
│ ├── ConstructorPattern
│ ├── OrPattern / BindingPattern
│ └── WildcardPattern / RestPattern
└── TypeAnnotation
├── SimpleType
├── GenericType
└── FunctionType
Immutability
All AST nodes are defined as frozen dataclasses with slots:
@dataclass(frozen=True, slots=True)
class BinaryExpression(Expression):
left: Expression
operator: BinaryOperator
right: Expression
location: Optional[SourceLocation] = None
def accept(self, visitor: ASTVisitor) -> Any:
return visitor.visit_binary_expression(self)
Stage 3: Type Checking
The type checker (type_checker.py) implements bidirectional type inference
with support for generics, structs, enums, and algebraic data types.
Type System
Type Representation
| Type Class | Description | Examples |
|---|---|---|
PrimitiveType |
Built-in scalar types | Int, Float, Bool, String |
GenericTypeInstance |
Parameterized types | List[Int], Dict[String, Float], Optional[T] |
FunctionType |
Function signatures | (Int, Int) -> Int |
ClassType |
User-defined structs/classes | Point, Vector3D |
ArrayType |
NumPy-style arrays | Array[Float, 2] |
UnknownType |
Unresolved type variable | Used during inference |
Inference Algorithm
- Bidirectional: Combines synthesis (bottom-up) and checking (top-down)
- Local Inference: Types propagate within functions without global constraint solving
- Explicit Annotations: Type annotations serve as ground truth
- Coercion: Implicit Int → Float promotion where safe
Built-in Types
# Primitive types
INT_TYPE = PrimitiveType("Int")
FLOAT_TYPE = PrimitiveType("Float")
BOOL_TYPE = PrimitiveType("Bool")
STRING_TYPE = PrimitiveType("String")
NONE_TYPE = PrimitiveType("None")
# Math types
VEC_TYPE = ArrayType(FLOAT_TYPE, 1) # Vector
MAT_TYPE = ArrayType(FLOAT_TYPE, 2) # Matrix
# Special types
ANY_TYPE = AnyType() # Accepts anything
Analysis Passes
Multiple analysis passes run in parallel to gather information for optimization and code generation.
Purity Analyzer
Determines function purity by tracking side effects. Pure functions are candidates for memoization and JIT compilation.
Side Effect Categories
- IO: print, file operations
- Mutation: Modifying non-local state
- Manim: Animation calls (play, wait)
- External: Calling impure functions
class Purity(Enum):
PURE = auto() # No side effects
READ_ONLY = auto() # Reads external state
IMPURE = auto() # Has side effects
UNKNOWN = auto() # Cannot determine
Complexity Analyzer
Estimates Big-O algorithmic complexity by analyzing loop nesting and recursion patterns.
class Complexity(Enum):
O_1 = "O(1)" # Constant
O_LOG_N = "O(log n)" # Logarithmic
O_N = "O(n)" # Linear
O_N_LOG_N = "O(n log n)"
O_N_SQUARED = "O(n²)" # Quadratic
O_N_CUBED = "O(n³)" # Cubic
O_2_N = "O(2ⁿ)" # Exponential
Parallel Analyzer
Identifies loops that can be parallelized with Numba's prange. Analyzes
data dependencies, reduction patterns, and loop-carried dependencies.
Parallelization Criteria
- No loop-carried dependencies (each iteration independent)
- Reduction variables detected (sum, product, min, max)
- No I/O or mutation of shared state
- Iteration count known or estimable
Stage 4: Code Generation
The code generator (codegen.py) transforms the AST into executable Python
code, handling operator translation, Manim integration, and Numba decorators.
Code Generator
Operator Translation
| MathViz | Python | Description |
|---|---|---|
x ^ y | x ** y | Exponentiation |
A ∪ B | A | B or A.union(B) | Set union |
A ∩ B | A & B or A.intersection(B) | Set intersection |
x ∈ S | x in S | Set membership |
A ⊆ B | A.issubset(B) | Subset |
sqrt(x) | np.sqrt(x) | Square root |
sin(x) | np.sin(x) | Trigonometric |
PI | np.pi | Pi constant |
Struct to Dataclass
// MathViz
struct Point {
x: Float
y: Float
}
impl Point {
fn new(x: Float, y: Float) -> Point {
return Point { x: x, y: y }
}
}
# Generated Python
@dataclass
class Point:
x: float
y: float
def Point_new(x: float, y: float) -> Point:
return Point(x=x, y=y)
Scene to Manim Class
// MathViz
scene HelloWorld {
fn construct(self) {
let text = Text("Hello")
play(Write(text))
}
}
# Generated Python
from manim import *
class HelloWorld(Scene):
def construct(self):
text = Text('Hello')
self.play(Write(text))
Optimization Passes
The compiler includes multiple optimization passes that can be enabled for high-performance numerical code.
Numba JIT Integration
Functions that pass purity analysis are automatically decorated with @njit
for Just-In-Time compilation to native code.
# Automatic JIT decoration
@njit(cache=True)
def compute_sum(n: int) -> int:
total = 0
for i in range(0, n):
total = (total + i)
return total
Constant Folding
The const_fold.py module (99KB) implements multiple optimization passes:
- Constant Folding: Evaluate constant expressions at compile time
- Constant Propagation: Replace variables with known constant values
- Dead Code Elimination: Remove unreachable code
- Algebraic Simplification: Simplify expressions (x * 1 → x, x + 0 → x)
- Strength Reduction: Replace expensive ops (x * 2 → x << 1)
- Common Subexpression Elimination: Reuse repeated computations
Vectorization
The vectorizer.py module detects patterns that can be vectorized to
NumPy operations for SIMD acceleration:
- Element-wise array operations
- Reduction operations (sum, product, min, max)
- Stencil computations
- Broadcasting patterns
Memory Optimization
The memory_optimizer.py module optimizes memory access patterns:
- Buffer reuse analysis
- In-place operation detection
- Cache-friendly loop ordering
- Temporary variable elimination
- Memory pool generation
Summary
The MathViz compiler is a production-grade source-to-source transpiler with:
- 31,000+ lines of carefully crafted Python code
- 22 modules covering all aspects of compilation
- 8-stage pipeline with optional analysis passes
- Full type inference with generics and algebraic data types
- Automatic optimization via Numba JIT, vectorization, and constant folding
- Rich diagnostics with source locations and suggestions
The compiler is open source. Contributions are welcome! See the GitHub repository for contribution guidelines.