LeoGreenAI • Configurability

Leo Execution Core

This column is the RTL / silicon catalog we configure for each FPGA or ASIC target—fixed at generation time. It spans datapath, PE/MAC grades (performance / power / area), array shape, on-chip storage, activations, LLM-oriented units (layer norm, LUT SoftMax, transpose paths, and related parameters), instruction decoder (fetch / decode / issue), external DRAM/HBM, and PCIe host attach. Host-visible CSM gives observability and control (see note below the list). For transformers, those LLM blocks supply the primitives the Leo Compiler maps onto; the same core stays useful when the workload is not attention-heavy. Bullets are representative—the real space is larger and not limited to what is shown.

Processing core logic — BUS

DATA width — 4, 8, 16, …; readily extended to custom widths.
Accumulation width — arbitrary size (typical default 32).

Processing many-core

Core dimension — 2×2, 4×4, 8×8, 16×16, 32×32, 64×64, 128×128, 256×256, …

PE MAC type & grades

PE / MAC implementation grades—catalog choices that trade performance, power, and area (e.g. higher-throughput vs denser low-leakage variants) selected at RTL generation time.
Leo specialized ultra-efficient core—optional MAC/PE style optimized for energy and area when the program targets maximum efficiency; remains on the same ISA family so the Leo Compiler flow stays consistent across grades.

Internal storage

Global buffer depth / capacity parameters.
Partial result buffer depth / capacity parameters.

Vectorized ALU

Compile-time selection of which functions the ALU exposes across a menu of supported operations—chosen when the core variant is generated.

Activation

ReLU, quantization-only paths, and extendable activation families.
Activation + residual add fusion—fused regions where dependencies and the datapath allow.
LUT-based custom activations—table geometry and interpolation matched to compiler and accuracy targets.
Rounding mode as a configurable choice per activation path (e.g. toward zero, nearest, stochastic) where the RTL variant supports it.
Further activation modes as the catalog grows.

LLM-related hardware parameters

Layer normalization — LUT size and width; internal BUS (distinct from DATA and partial-sum path); error recovery via Newton–Raphson; enable, iteration count; I/O matched to DATA and Result BUS.
SoftMax (LUT-based) — quantization settings; LUT size and width; internal BUS size; mean / variance compute sizing; inputs and outputs matched to DATA and Result BUS.
Transpose — automatically configured with processing core and BUS; further options as the catalog grows.

Instruction decoder (fetch / decode / issue)

Input instruction buffer size.
Instruction window size—how many instructions are visible for parallelism exploration and dependency analysis in the issue logic.
Instruction packet size—alignment with host link framing (e.g. PCIe).
Instruction compression and packing—optional encodings for denser streams and better PCIe utilization.

External DRAM / HBM memory interface

Unified vs separate channels—traffic classes, dedicated paths for partial sums vs global buffer vs host-visible regions as configured.
Number of DRAM interfaces and bandwidth per channel (width, rate, burst behavior).
Support for heterogeneous DRAM and parallel write patterns where the memory subsystem allows.
DRAM addressing length—variable support for how many address bits (and which map semantics) the interface exposes to software.
Controller buffer sizing; read/write burst sizes; memory behavior control.
LUT-adjacent interface hooks where tied to SoftMax, LayerNorm, and similar units.

PCIe host attach

Number of PCIe channels / port instances (as integrated in the macro).
Width and sizing per channel—lanes per link, effective data width, buffer depth, and packet framing.
Lane count, bits per lane, and PHY/link parameters matched to board and SoC assumptions.
Instruction packet mapping from PCIe framing to the instruction ingress path (see instruction decoder above).

The CSM (configuration and status module) is the host-facing path for visibility into core operation—counters, status, and configuration—together with interrupt-based event handling and shared resource management (for example coordination when DRAM, ingress, and on-chip buffers are visible to both host and core). It is not enumerated line-by-line here; partners tune CSR maps, mutex policies, and counter banks per integration.

Leo Compiler

Fixed at compile time for a given build: lowering, optimization, and mapping choices aligned with the configured ISA—including fused vs separated operations (e.g. FMA vs distinct multiply/add). It chooses passes, scheduling, tiling, prefetch and DDR spill policy, and buffer partitioning so graphs match the Leo Execution Core you generated. On LLM-style graphs, the mapper tiles attention-heavy regions, pipelines and interleaves stages, and can use macro-style issue for long softmax flows or reshape+matmul; QAT and IR shaping track the precision and rounding options you set in hardware. Bullets are representative—options extend beyond the list.

IR, lowering & visualization

ONNX-oriented (and related) lowering; graph visibility for mapping and verification.
Quantization-aware training and IR-reduction options where enabled in the flow.
Options for how schedules and tilings are generated and inspected in tooling.
Knobs stay aligned with RTL parameters so what you compile matches what you configured in hardware.

Optimization passes

Added passes: loop unrolling, tiling, pruning, reduction, and similar transforms.
Custom passes targeting specific hardware features or ISA extensions.

Parallelism & dependency chaining

Restrict or allow wide parallelism to match the configured core and hazards.
Restrict or relax dependency chaining (deep vs shallow dependence across ops).

Scheduling & tiling policy

Multi-threaded vs single-threaded scheduling.
Tiling policies—static vs dynamic splits; how large ops are broken for the ISA.
Choice among data prefetching policies—how far ahead tensors and tiles are staged vs compute.
Control of spillout to DDR—when partial results or activations are allowed to leave on-chip storage for external memory.

Unified buffer model

Division of the unified global buffer between weights and activations in the mapped program.

Fusion vs separation

Fused vs separated operations (e.g. fused multiply–accumulate vs separate multiply and add)—a compiler decision matched to the RTL datapath.

Configurability end to end

Training, Leo Compiler, host & Leo Execution Core

Design-time knobs: Leo Execution Core & Leo Compiler