LeoGreenAI
Configurable AI Hardware + Full Stack HW/SW Visibility R&D partnership • FPGA inference enablement • RTL IP Licensing

End-to-end flow & observability

Diagram: model path and input path converge into host staging with DRAM; LEO execution core exchanges data with DRAM while consuming instructions; CSM spans configuration, locks, interrupts, and hardware performance counters.

In a typical inference run, work branches before it merges on the device. The diagram in the next section shows one physical attachment pattern; the steps below spell out responsibilities in order.

  1. Model path — A model is selected (often from the model zoo), then the LeoGreenAI compiler lowers it and emits an instruction stream plus the model data (weights, tables, and other persistent tensors) the program needs.
  2. Input path — Separately, an input source (sensor, file, network, database, synthetic generator, etc.) produces the input stream (tokens, frames, features) for the run or batch.
  3. Host staging — The host (or runtime) places the instruction stream on the path into the LEO execution core (e.g. via PCIe and an instruction ingress path) and stages model data and input stream into DRAM (and related system memory) as the mapping requires.
  4. Execution — During the run, data moves between DRAM and the core (and back) for activations, partial results, and memory-bound phases—while the core consumes instructions. Hardware-side tiling, skipping, and local reordering can change effective work from the compiler’s static plan.
  5. CSM everywhereCSM spans configuration, locks, interrupts, and hardware performance counters so you can govern the engine and correlate compiler intent with measured behavior.

Architecture diagram — physical data-path (example)

The figure illustrates one configuration among many. Here the host attaches over PCIe; two memory channels each include an MMU between the link and memory—one serving four DDR devices, another HBM and paths into the core. Instructions follow PCIe into instruction input registers inside the LEO processing core. CSM (CSRs, counters, optional interrupts) is on the same host attachment for control and measurement alongside the data path.

Channel count, DDR organization, capacities, widths, and topology are configuration decisions, not fixed by this drawing. Tuning the memory interface remains a central lever for model–hardware co-design; see Configurability for the wider option space.

LeoGreenAI hardware–software stack: example physical data path with dual-channel memory (four DDR and HBM), PCIe to host, MMUs, and LEO core; one configuration among many.

Compiler: the plan

The compiler’s tiling and schedule are not the entire tiling story. The LEO execution core also applies its own hardware-side tiling and sub-blocking as it maps logical tiles to arrays, buffers, and memory bursts. The full picture is a cooperative split: software proposes shape and ordering; hardware refines what actually runs on the floorplan. Interpreting performance therefore benefits from both compiler-side plans and on-chip behavior.

Hardware execution: beyond the compiled schedule

The core is not a passive interpreter of a fixed micro-schedule. It can exploit structure in the data and in the instruction window to go faster than a literal “programmed” step count would suggest.

Because of these effects, end-to-end analysis needs both views: compiler-side tiling, stream layout, and constraints, plus CSM-backed counters and traces (stalls, skips, reorder events, utilization).

CSM: hardware statistics & control

The CSM (control and statistics module) is the read/write path between the host and the accelerator: configuration status, shared-resource locks, interrupts, debug handshakes, and hardware performance counters. Because the LEO execution core architecture is designed for extension, additional on-chip signals can be brought out as custom counter banks when a program needs deeper visibility.

Why both layers matter

When comparing architectures or compiler variants under accuracy constraints, outcomes depend on how much parallelism actually materializes and how evenly resources are used. Software chooses graph-level tiling and stream layout; the core may refine tiles, skip work, or reorder locally. CSM together with compiler-side reports reconciles those two stories—linking planned structure to stalls, utilization, skip activity, and reorder slack in ways neither view alone can approximate reliably.

Partnership models Contact