LeoGreenAI
Configurable AI Hardware + Full Stack HW/SW Visibility R&D partnership • FPGA inference enablement • RTL IP Licensing

Configurability end to end

LeoGreenAI is built around a simple idea: most of the machine should be a compile-time decision, not a fixed black box. The LEO execution core, its memory system, host attach, and the compiler that targets it are designed together so teams can sweep shapes, precisions, and topologies while keeping a single coherent ISA and tool flow.

We position the stack as among the most configurable inference-oriented engines available—not a single silicon point design, but a family you can specialize per program, board, or research question.

Training, Leo Compiler, host & Leo Execution Core

End-to-end view: QAT and the Leo Compiler produce weights, biases, and instructions; the host path prepares activations and streams everything over PCIe into the Leo Execution Core with DDR, MMU, and CSM. The host/driver path sits between the two columns detailed below.

Architecture diagram: Fully configurable LEO execution core with DDR and PCIe (MMU, CSM); host path from ML input through Leo Transform and activation to configurable driver, linked over PCIe; Leo compiler side with training DB, ML model, configurable QAT, Q model, weights, bias, instructions, and configurable Leo compiler.

Design-time knobs: Leo Execution Core & Leo Compiler

Leo Execution Core

This column is the RTL / silicon catalog we configure for each FPGA or ASIC target—fixed at generation time. It spans datapath, PE/MAC grades (performance / power / area), array shape, on-chip storage, activations, LLM-oriented units (layer norm, LUT SoftMax, transpose paths, and related parameters), instruction decoder (fetch / decode / issue), external DRAM/HBM, and PCIe host attach. Host-visible CSM gives observability and control (see note below the list). For transformers, those LLM blocks supply the primitives the Leo Compiler maps onto; the same core stays useful when the workload is not attention-heavy. Bullets are representative—the real space is larger and not limited to what is shown.

Processing core logic — BUS

  • DATA width — 4, 8, 16, …; readily extended to custom widths.
  • Accumulation width — arbitrary size (typical default 32).

Processing many-core

  • Core dimension — 2×2, 4×4, 8×8, 16×16, 32×32, 64×64, 128×128, 256×256, …

PE MAC type & grades

  • PE / MAC implementation grades—catalog choices that trade performance, power, and area (e.g. higher-throughput vs denser low-leakage variants) selected at RTL generation time.
  • Leo specialized ultra-efficient core—optional MAC/PE style optimized for energy and area when the program targets maximum efficiency; remains on the same ISA family so the Leo Compiler flow stays consistent across grades.

Internal storage

  • Global buffer depth / capacity parameters.
  • Partial result buffer depth / capacity parameters.

Vectorized ALU

  • Compile-time selection of which functions the ALU exposes across a menu of supported operations—chosen when the core variant is generated.

Activation

  • ReLU, quantization-only paths, and extendable activation families.
  • Activation + residual add fusion—fused regions where dependencies and the datapath allow.
  • LUT-based custom activations—table geometry and interpolation matched to compiler and accuracy targets.
  • Rounding mode as a configurable choice per activation path (e.g. toward zero, nearest, stochastic) where the RTL variant supports it.
  • Further activation modes as the catalog grows.

LLM-related hardware parameters

  • Layer normalization — LUT size and width; internal BUS (distinct from DATA and partial-sum path); error recovery via Newton–Raphson; enable, iteration count; I/O matched to DATA and Result BUS.
  • SoftMax (LUT-based) — quantization settings; LUT size and width; internal BUS size; mean / variance compute sizing; inputs and outputs matched to DATA and Result BUS.
  • Transpose — automatically configured with processing core and BUS; further options as the catalog grows.

Instruction decoder (fetch / decode / issue)

  • Input instruction buffer size.
  • Instruction window size—how many instructions are visible for parallelism exploration and dependency analysis in the issue logic.
  • Instruction packet size—alignment with host link framing (e.g. PCIe).
  • Instruction compression and packing—optional encodings for denser streams and better PCIe utilization.

External DRAM / HBM memory interface

  • Unified vs separate channels—traffic classes, dedicated paths for partial sums vs global buffer vs host-visible regions as configured.
  • Number of DRAM interfaces and bandwidth per channel (width, rate, burst behavior).
  • Support for heterogeneous DRAM and parallel write patterns where the memory subsystem allows.
  • DRAM addressing lengthvariable support for how many address bits (and which map semantics) the interface exposes to software.
  • Controller buffer sizing; read/write burst sizes; memory behavior control.
  • LUT-adjacent interface hooks where tied to SoftMax, LayerNorm, and similar units.

PCIe host attach

  • Number of PCIe channels / port instances (as integrated in the macro).
  • Width and sizing per channel—lanes per link, effective data width, buffer depth, and packet framing.
  • Lane count, bits per lane, and PHY/link parameters matched to board and SoC assumptions.
  • Instruction packet mapping from PCIe framing to the instruction ingress path (see instruction decoder above).

The CSM (configuration and status module) is the host-facing path for visibility into core operation—counters, status, and configuration—together with interrupt-based event handling and shared resource management (for example coordination when DRAM, ingress, and on-chip buffers are visible to both host and core). It is not enumerated line-by-line here; partners tune CSR maps, mutex policies, and counter banks per integration.

Leo Compiler

Fixed at compile time for a given build: lowering, optimization, and mapping choices aligned with the configured ISA—including fused vs separated operations (e.g. FMA vs distinct multiply/add). It chooses passes, scheduling, tiling, prefetch and DDR spill policy, and buffer partitioning so graphs match the Leo Execution Core you generated. On LLM-style graphs, the mapper tiles attention-heavy regions, pipelines and interleaves stages, and can use macro-style issue for long softmax flows or reshape+matmul; QAT and IR shaping track the precision and rounding options you set in hardware. Bullets are representative—options extend beyond the list.

IR, lowering & visualization

  • ONNX-oriented (and related) lowering; graph visibility for mapping and verification.
  • Quantization-aware training and IR-reduction options where enabled in the flow.
  • Options for how schedules and tilings are generated and inspected in tooling.
  • Knobs stay aligned with RTL parameters so what you compile matches what you configured in hardware.

Optimization passes

  • Added passes: loop unrolling, tiling, pruning, reduction, and similar transforms.
  • Custom passes targeting specific hardware features or ISA extensions.

Parallelism & dependency chaining

  • Restrict or allow wide parallelism to match the configured core and hazards.
  • Restrict or relax dependency chaining (deep vs shallow dependence across ops).

Scheduling & tiling policy

  • Multi-threaded vs single-threaded scheduling.
  • Tiling policies—static vs dynamic splits; how large ops are broken for the ISA.
  • Choice among data prefetching policies—how far ahead tensors and tiles are staged vs compute.
  • Control of spillout to DDR—when partial results or activations are allowed to leave on-chip storage for external memory.

Unified buffer model

  • Division of the unified global buffer between weights and activations in the mapped program.

Fusion vs separation

  • Fused vs separated operations (e.g. fused multiply–accumulate vs separate multiply and add)—a compiler decision matched to the RTL datapath.

LLM-oriented support

Attention, self-attention, and multi-head attention are not one monolithic “block” in the catalog. They are compositions of work the stack already exposes: GEMM-class layers, softmax, layer normalization, transposes, elementwise paths, and fused regions chosen in the Leo Compiler (see Fusion vs separation in the Leo Compiler column). The goal is efficient transformer inference while keeping arrays, memory paths, and specialized units busy—without siloed attention-only silicon that idles on the rest of the graph.

Together, Leo Execution Core and Leo Compiler configurability are why partners use LeoGreenAI as a research vehicle: you can change the machine in meaningful ways without abandoning the toolchain. For collaboration ideas, see Research partnerships.

Discuss a configuration Flow & observability