This post tries to introduce the basic vhdl syntax by means of examples. Written about a year ago, its original purpose was to serve as a guide for students that started developing hardware for the CompSOC platform, but I think this info is generally useful for the world. I left most of the CompSOC specific bits in the tutorial for the sake of completion (the occasional CompSOC: label marks them). There is also a section on DTL, which may be interesting for academic readers who wonder how the sentence `DTL, which is similar to AXI‘, which I wrote in my thesis and some papers, actually holds up.

I wrote the guide with Xilinx tools in mind (version 14.7). It gradually introduces new language constructs. Although far from complete, you should be able to find most frequently used snippets in here.

Table of Contents

  1. Main tutorial
    1. Unclocked logic
    2. Clocked / Sequential logic
    3. FSM (Finite State Machines)
    4. Loops
    5. More Constants
    6. Conditional instantiation
    7. Arrays
    8. Records (like structs for C++)
    9. Functions
    10. Naming Conventions and Code Style
    11. Do’s and Don’ts (mostly don’ts)
    12. XPS quick re-synthesis
  2. DTL
    1. Signals
    2. Protocol definition
    3. Default accept implementation
    4. CompSOC: DTL proxy blocks
    5. Proxy added latency
    6. Best DTL practices

Main tutorial

Unclocked logic

Here’s a simple example of unclocked /combinatorial logic. Operations are performed directly on the signals.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- Lines starting with -- are comments
-- Data types and function are imported from libraries
library ieee;
-- Basic data types (std_logic, std_logic_vector)
use ieee.std_logic_1164.all;
-- Type conversion function (to_integer, unsigned)
-- and operators (+, -) for unsigned numbers.
use ieee.numeric_std.all;
-- Comparison operators for std_logic_vector
use ieee.std_logic_unsigned.all;

-- Port level description of the hardware. Sets input and output names.
entity example is
port(
a : in std_logic; -- The basic data type. Use it.
b : in std_logic;
c : out std_logic
);
end example;

-- Internals of this hardware block
architecture rtl of example is
-- Before the 'begin' keyword, extra signals can be defined
signal d : std_logic;
begin
c <= not d;
d <= a and b;
end rtl;

Clocked / Sequential logic

The following example shows a hardware design with a clock and a reset.

The example starts with a header-like structure as shown before:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- Clocked process example.
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
use ieee.std_logic_unsigned.all;

entity example2 is
port(
clk : in std_logic;
rst_n : in std_logic;
operand0 : in std_logic;
operand1 : in std_logic;
action : in std_logic;
result : out std_logic -- Note: no semicolon on last line.
);
end example2;

architecture rtl of example2 is
-- The _r signal represents the register.
-- The _nxt signal will be the value which we
-- write into the register on clock edges.
signal result_nxt, result_r : std_logic;

Constants can be used to give a nice name to values with a specific meaning:

1
2
3
4
5
    -- You may define constants in the following manner:
constant OP_XOR : std_logic := '0';
constant OP_MAGIC : std_logic := '1';
-- Note the single quotes around the 1-bit std_logic value. No brackets.
begin

This hardware block is split into two processes. In general, you can have as many processes as you like within your design. In this case the synchronous, clocked bit of the hardware is separated from the combinatorial part. The signals between the brackets of the process are the wires/signal/registers to which the process is sensitive.

Separation of sequential and combinatorial logic

In general, the sequential process is:

  • Sensitive to rising edge of clock
  • Stores values to registers
1
2
3
4
5
6
7
8
9
10
11
12
13
-- Synchronous process, with synchronous reset
process(clk) -- Note: only sensitive to clk.
begin
if rising_edge(clk) then
if rst_n = '0' then
-- On reset, value set to 0
result_r <= '0';
else
-- Load new value in register
result_r <= result_nxt;
end if;
end if;
end process;

The reset is synchronous, i.e. the block only responds to the reset signal on (rising) clock edges. It is active low, i.e. the block gets reset when the reset wire contains the value 0.

The second process is purely combinatorial. It generally:

  1. Creates variables for the results it will generate.
  2. Always assign a default value to the variable at the start of the process to prevent latches.
  3. You can use don’t cares - if you don’t want to assign a specific value.
  4. Performs operations on variables.
  5. Writes output of variables to data input of registers, or to output ports of the block.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
combinatial_proc: process(operand0, operand1, action, result_r)
-- (Note: the name (combinatial_proc) is optional.)
-- The process is sensitive to all signals it will use.
-- Not adding all used signals will make simulation results
-- different from synthesis results.
-- 1) Create a variable for all the outputs of this process
variable var_result_nxt : std_logic;
begin
-- 2) Assign defaults to the variables.
var_result_nxt := result_r; -- Note: := assigns to variables
-- 3) Do fancy option
-- No brackets around the argument of the if-statement.
if action = OP_XOR then
var_result_nxt := operand0 xor operand1;
else
var_result_nxt := not (operand0 or result_r) and operand1;
end if;
-- 4) Assign variable to input of register
result_nxt <= var_result_nxt; -- Note: <= assigns to wires.
end process combinatial_proc;
1
2
3
4
5
6
    -- Connect the register output to an output port.
-- This is fully combinatorial, so we are allowed to do this from
-- outside the process.
-- (But it is also allowed to do it from within one).
result <= result_r;
end rtl;

Most hardware can be built according to this 2-process structure.

FSM (Finite State Machines)

Quite often the blocks we use resemble a finite state machine. Certain vhdl constructs are helpful when describing them. This example introduces generics, std_logic_vectors, subtypes and FSMs / case-statements.

We start with a new header. Note the new section in the entity description called generic. It contains a list of variables which we later use to size the width of the input/output ports. They add a great deal of re-usability to your design, so it is highly recommended to use them.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- State machine example.
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
use ieee.std_logic_unsigned.all;

entity example3 is
generic(
-- Use generics for configurable parameters
DATA_WIDTH : natural := 32; -- A 'natural' is a simple number.
NUM_OPERATION_BITS : natural := 2
);
port(
clk : in std_logic;
rst_n : in std_logic;
-- Always use the "size - 1 downto 0" pattern.
operand0 : in std_logic_vector(DATA_WIDTH - 1 downto 0);
action : in std_logic_vector(NUM_OPERATION_BITS - 1 downto 0);
result : out std_logic_vector(DATA_WIDTH - 1 downto 0)
);
end example3;

Another construct which helps maintainability and readability is a subtype. It allows you to specify custom types based on existing types:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
architecture rtl of example3 is
-- Define subtypes for commonly used vectors
subtype DATA_T is std_logic_vector(DATA_WIDTH - 1 downto 0);
subtype ACTION_T is std_logic_vector(NUM_OPERATION_BITS - 1 downto 0);

-- Use constants for internal parameters
constant OPER_LOAD : std_logic := '1';
constant OPER_OTHER : ACTION_T := "01";
-- This creates an std_logic_vector, with the value 1,
-- build from DATA_WIDTH bits.
constant DATA_ONE : DATA_T :=
std_logic_vector(to_unsigned(1, DATA_WIDTH));

signal result_nxt, result_r : DATA_T;
signal operand_nxt, operand_r : DATA_T;

When building an FSM you probably want to define states, using the type keyword. This is quite similar to the enum from C++:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
    --FSM states
type STATE_T is (LOAD_R0_S, LOAD_R1_S, BUSY_S);
signal state_r, state_nxt : STATE_T;
begin
-- Synchronous process, with synchronous reset
process(clk) -- Note: only sensitive to clk.
begin
if rising_edge(clk) then
-- No reset required for this register
operand_r <= operand_nxt;
if rst_n = '0' then
-- Setting entire vector can be done with (others => )
result_r <= (others => '0');
state_r <= LOAD_R0_S;
else
-- Load new value in register
result_r <= result_nxt;
state_r <= state_nxt;
end if;
end if;
end process;

-- Register to output port
result <= result_r;

Using the pre-defined constants / types and subtypes makes it very easy to quickly see the purpose of variables, and changing widths / types can be done in a central place, instead of all over the code.

The next snippet also shows the case statement, which describes what the FSM does. It is good to add the when others => null; line, since modelsim complaints if you don’t (Xilinx does not care.)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
combinatial_proc: process(operand0, action, result_r, operand_r, state_r)
-- 1) Create a variable for all the outputs of this process
variable var_result_nxt : DATA_T;
variable var_operand_nxt : DATA_T;
variable var_state_nxt : STATE_T;
begin
-- 2) Assign defaults to the variables.
var_result_nxt := result_r;
var_operand_nxt := operand_r;
var_state_nxt := state_r;

state_case: case state_r is
when LOAD_R0_S =>
if action(0) = OPER_LOAD then
var_operand_nxt := operand0;
var_state_nxt := LOAD_R1_S;
end if;

when LOAD_R1_S =>
if action(0) = OPER_LOAD then
var_result_nxt := operand0;
var_state_nxt := BUSY_S;
end if;

when BUSY_S =>
-- The `+`-operator is not defined on std_logic_vectors.
-- Therefore, we cast to 'unsigned', do the addition, and
-- then transform back to an std_logic_vector.
var_result_nxt := std_logic_vector(unsigned(result_r) +
unsigned(result_r));
-- A numerical constant 1:
var_operand_nxt := std_logic_vector(unsigned(operand_r)-1);
if operand_r = DATA_ONE then
var_state_nxt := LOAD_R0_S;
-- Round brackets on an std_logic_vector index a
-- specific bit.
-- (in this case the most significant one).
var_result_nxt(DATA_WIDTH-1) := '1';
end if;
when others => null;
end case;
-- 4) Assign variable to input of register
result_nxt <= var_result_nxt;
operand_nxt <= var_operand_nxt;
state_nxt <= var_state_nxt;
end process combinatial_proc;
end rtl;

Loops

An example with a for loop within a process. It also contains a wrapping counter:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
-- For loop example
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity for_loop is
port(
clk : in std_logic;
rst_n : in std_logic;
sel : in std_logic_vector(2 downto 0);
outb : out std_logic
);
end for_loop;

architecture rtl of for_loop is
constant DATA_WIDTH : natural := 8;
subtype VEC_T is std_logic_vector(DATA_WIDTH - 1 downto 0);
signal vec_nxt, vec_r : VEC_T;
signal out_nxt, out_r : std_logic;

begin
process(clk)
begin
if rising_edge(clk) then
if rst_n = '0' then
vec_r <= (others => '0');
out_r <= '0';
else
vec_r <= vec_nxt;
out_r <= out_nxt;
end if;
end if;
end process;

comb_proc: process(vec_r, out_r, sel)
variable var_vec_nxt : VEC_T;
variable var_out_nxt : std_logic;
begin
-- A wrapping counter:
var_vec_nxt := std_logic_vector(unsigned(vec_r) + 1);
var_out_nxt := out_r;
-- Note: I am not sure what this for loop synthesizes into.
-- I suspect it turns into a large series back-to-back connected
-- muxes. Writing something like:
-- var_out_nxt := vec_r(to_integer(unsigned(sel)));
-- is probably better, since it leaves the
-- mux-inference to the synthesis tools.
-- (You would still have to assert that 'sel'
-- is bounded between 0 and DATA_WIDTH).
for i in 0 to (DATA_WIDTH - 1) loop
if i = unsigned(sel) then
var_out_nxt := vec_r(i);
end if;
end loop;
vec_nxt <= var_vec_nxt;
end process comb_proc;
end rtl;

More Constants

Hexadecimal constants may come in handy:

1
constant HEX_CONSTANT : std_logic_vector(WIDTH -1 downto 0) := X"BADC0FFE";

Conditional instantiation

Conditional instantiation of hardware can be used to enable or disable parts of the design based on a generic, like the size of a bus for example. In this example we enable block of debugging hardware based on the value of the ENABLE_MEMCTRL_DEBUG flag:

1
2
3
4
5
6
7
8
9
gen_controller_debug_off : if (ENABLE_MEMCTRL_DEBUG = 0) generate
dbg_wr_fifo_rd_valid <= '0';
dbg_wr_fifo_rd <= (others => '0');
end generate;

gen_controller_debug_on : if (ENABLE_MEMCTRL_DEBUG = 1) generate
-- Instantiate complex and large debug structure here
-- (...)
end generate;

Arrays

Arrays are built by creating a custom type.

1
2
3
4
5
6
constant NSIGNALS_PER_CMD : integer := 6;
subtype CMD_T is std_logic_vector(NSIGNALS_PER_CMD - 1 downto 0);

constant NCOMMANDS : integer := 9;
type CMD_LOOKUP_T is array(0 to NCOMMANDS -1) of CMD_T;
signal cmd_lookup : CMD_LOOKUP_T;

Note that this example actually creates a 2D array, since CMD_T is already an array of std_logic signals. Initialization of a 2D array to zero is done like this:

1
cmd_lookup  <= (others => (others => '0'));

You can also postpone sizing the array until you use the type:

1
2
3
4
5
6
subtype BANKCMD_VEC_T is std_logic_vector(BANKCMD_WIDTH -1  downto 0);
type PATTERN_MEM_T is array (integer range <>) of BANKCMD_VEC_T;
-- Note" '**' means: to the power of
signal pattern_mem1 : PATTERN_MEM_T (0 to 2**PATTERN_MEM_ADDR_WIDTH - 1);
-- Another instance, with a different size, using the same array template:
signal pattern_mem2 : PATTERN_MEM_T (0 to 66);

Records (like structs for C++)

Records can be used to group signals.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
constant NSIGNALS_PER_CMD : integer := 6;
subtype CMD_T is std_logic_vector(NSIGNALS_PER_CMD-1 downto 0);

constant BANK_ADDR_WIDTH : integer := 4;
-- Note: a range can be a subtype of you like!
subtype BANK_ADDR_RNG is natural range BANK_ADDR_WIDTH - 1 downto 0;
subtype BANK_ADDR_T is std_logic_vector(BANK_ADDR_RNG);

type BANKCMD_T is record
last : std_logic;
cmd : CMD_T;
bank : BANK_ADDR_T;
end record;

signal cmd_n0 : BANKCMD_T;
signal cmd_n1 : BANKCMD_T;
(...)
cmd_n0.last <= '0';
cmd_n0.bank <= "1010";

Functions

Another possible way to reuse code is to use functions. The following example combines functions and records:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
-- Address decoder functions:
function cs_shift_vec(vec : std_logic_vector ;
shmt : std_logic_vector)
return std_logic_vector is
begin
return std_logic_vector(shift_right(unsigned(vec),
to_integer(unsigned(shmt))));
end function cs_shift_vec;

type BRC_T is record
bank : BANK_ADDR_T;
row : ROW_ADDR_T;
col : COL_ADDR_T;
end record;

constant COL_ADDR_WIDTH : integer := 10;
subtype COL_ADDR_RNG is natural range COL_ADDR_WIDTH - 1 downto 0;
subtype COL_ADDR_T is std_logic_vector(COL_ADDR_RNG);
(...)

function addr_decode( addr : ADDR_T;
bank_shift : AD_SHIFT_T;
bank_mask : AD_MASK_T;
col_BCBL_shift : AD_SHIFT_T;
col_rest_shift : AD_SHIFT_T;
col_BCBL_mask : AD_MASK_T;
col_rest_mask : AD_MASK_T;
row_shift : AD_SHIFT_T;
row_mask : AD_MASK_T ) return BRC_T is
variable res : BRC_T;
variable var_bank_tmp : ADDR_T;
variable var_ca_0_tmp, var_ca_1_tmp : ADDR_T;
variable var_ra_tmp : ADDR_T;
begin
-- Banks:
var_bank_tmp := cs_shift_vec(addr, bank_shift);
res.bank := var_bank_tmp(BANK_ADDR_RNG) and
bank_mask(BANK_ADDR_RNG);
-- Columns: Take the 2 parts of the column address,
-- bitwise-or them into one resulting vector
var_ca_0_tmp := cs_shift_vec(addr, col_BCBL_shift);
var_ca_1_tmp := cs_shift_vec(addr, col_rest_shift);

res.col := (var_ca_0_tmp(COL_ADDR_RNG) and
col_BCBL_mask(COL_ADDR_RNG)) or
(var_ca_1_tmp(COL_ADDR_RNG) and
col_rest_mask(COL_ADDR_RNG));

var_ra_tmp := cs_shift_vec(addr, row_shift);
res.row := var_ra_tmp(ROW_ADDR_RNG) and row_mask(ROW_ADDR_RNG);
return res;
end function addr_decode;

Sharing functions across libraries is possible but tricky (you need to place the shared functionality in a separate shared library). CompSOC: rdt_dtl_proxy_shared shows an example of how this can be done.

Naming Conventions and Code Style

VHDL is case insensitive. We (CompSOC) use all lowercase, except for

  1. constants and generics.
  2. Type names. They usually end in _T, like ADDR_T for example.
  3. State names. They usually end in _S, like START_S for example.

Furthermore:

  1. Variable names start with var_.
  2. Register names end in _r, register input end in _nxt.
  3. Please indent using 4 spaces (no tabs).
  4. We use active-low reset, i.e. ‘0’ means: reset the block.
  5. Don’t drown yourself in hierarchy: if you are not re-using a block, then there is no point in turning it into a separate entity. Multiple processes in the same file are often a good alternative to separate functionally disjoint blocks without textual overhead.

CompSOC: We have standard implementations for:

  1. FIFOs
  2. One-hot decoding
  3. Various functions (required_bits(x), log2(x), etc.)

Most of them sit in the rdt_util_misclib. Don’t reinvent the wheel: instead please ask / search for previous implementations of common tasks.

Do’s and Don’ts (mostly don’ts)

  1. Don’t copy-paste a line more than 3 or 4 times. Use for-loops, vectors or hierarchy to duplicate functionality instead.
  2. Don’t copy VHDL files to your own folder to use as library. Instead, import the library. You can refer to a different PCore folder as if it were a library (assuming it has a package definition).
  3. CompSOC: Do use the DTL proxies if you are not crazy latency critical.
  4. Give all registers a value on reset (initialize with a don’t care if you don’t care about the value).
  5. Make sure your blocks don’t try to communicate while they are in reset (i.e. keep all cmd_valid and cmd_accept signals low).

CompSOC: Please discuss with the rest of the CompSOC group / supervisor before you:

  1. Create your own clock domains. Clock domains are a system-level design feature, which potentially cause of a lot of headaches if not done right.
  2. Use Xilinx primitives. They reduces portability to different FPGA devices. You usually don’t need them, so it may be an indication something else is wrong.
  3. Specifically BRAM blocks are rare, and we need them as local memory for the processor tiles. You can use distributed LUTRAM in most cases.

XPS quick re-synthesis

If the check_pcore script fails or behaves differently from xps, or if you simply want to make a quick change to a vhdl file to check the result on the maximum synthesis frequency, you can invoke xst (the Xilinx synthesizer) directly).

Source the xilinx settings file:

1
source $XILINX_SETTINGS_FILE

Navigate to the synthesis folder of the pcore. For example:

1
cd synthesis/parallel_run/<name_of_your_pcore_instance>_wrapper

Trigger synthesis by passing the scr file to xst:

1
xst -intstyle xflow -ifn *.scr

DTL

This page contains a small description of the subset of the DTL (Device Transaction Layer) protocol used in CompSOC. A DTL link consists of a DTL-initiator and a DTL-target port. The initiator port initiates transactions, in other words, it owns the cmd_valid signal. Initiator ports have _i attached to their o/i signals, target ports use the _t postfix. All hardware in CompSOC follows the convention.

Signals

The protocol defines 3 groups of signals that are individually (but not independently) handshaked. The driver of the signal is denoted by an i for the initiator-port or a t for the target-port.

  1. Command-group:

    1. cmd_valid (i): valid flag [1 bit].
    2. cmd_accept (t): accept flag [1 bit].
    3. cmd_address (i): the address of the transaction. [Variable width, usually 32 bits]
    4. cmd_read (i): 1 if the transaction is a read, 0 if it is a write. [1 bit]
    5. cmd_block_size (i): the size of the transaction in dtl-words, minus 1. If, for example, the used data-busses are 4 bytes wide, and the transaction is 12 bytes, then cmd_block_size is 2.
  2. Write-group:

    1. wr_valid (i): valid flag [1 bit].
    2. wr_accept (t): accept flag [1 bit].
    3. wr_last (i): 1 if the current wr-valid is the last one in this transaction. [1 bit]
    4. wr_data (i): the data that is transmitted. [Variable width, usually 32 bits]
    5. wr_mask (i): byte-masks. [wr_data width/8 bits]
  3. Read-group:

    1. rd_valid (t): valid flag [1 bit].
    2. rd_accept (i): accept flag [1 bit].
    3. rd_last (t): 1 if the current rd-valid is the last one in this transaction. [1 bit]
    4. rd_data (t): the data that is transmitted. [Variable width, usually 32 bits]

Furthermore, there is a dtl_clk signal and a dtl_rst_n signal. The convention is to use an active-low reset, which means the hardware is reset when dtl_rst_n is 0. Another signal, dtl_flush, is attached to virtually every dtl-port in the system, but never actually used. It may be removed in the future.

Protocol definition

The protocol rules within one group are quite simple:

  • When cmd_valid is 1, cmd_addr, cmd_read, and cmd_block_size contain valid values.
  • When wr_valid is 1, wr_last, wr_data, and wr_mask contain valid values.
  • When rd_valid is 1, rd_last and rd_data contain valid data.
  • If valid and accept are 1 in the same clock cycle, a handshake is completed.
  • cmd_valid, once raised, may not be lowered until the cmd-handshake completes.
  • cmd_accept, once raised, may not be lowered until the cmd-handshake completes.

The rules describing the interaction between different groups is more complicated. The port responsible for implementing each rule is denoted with (i) or (t).

  1. (i) wr_valid may coincide but not precede its associated cmd_valid signal.
  2. (i) wr_valid may not be dependent on its associated cmd_accept signal.
  3. (t) wr_accept may coincide but not precede its associated cmd_accept.

The third rule has some peculiar implications on when a wr_accept is allowed. It can be given if:

  • cmd_accept is high, or
  • a cmd handshake for a write transaction which has not finished yet happened in the past

There are few rules on the required implementation, although one very important one is given in the spec on page 16. This rule is more often broken than followed, leading to timing problems during synthesis:

  • (i, t) Combinatorial paths from DTL inputs to DTL outputs are not allowed

This rule impacts how the default accept signal may be implemented.

The following things are perhaps not intuitive, but still allowed:

  • rd_accept can be 1, even if the associated command has not been handshaked.
  • A target may raise both its cmd and write-accept flag at the same time. If a command-handshake takes place, and it turns out to be a read command, then the write-accept signal has to be lowered in the next cycle (Unless the target is immediately ready to accept a new command, and keeps its cmd-accept flag also raised).

Default accept implementation

To support fully pipelined operation (100% throughput), a DTL port has to implement default-accept. Without it, it would take 2 cycles to transfer 1 word of data:

clock cycle 0 1 2 3
cmd_valid 1 1 1 1
cmd_accept 0 1 0 1
wr_valid 1 1 1 1
wr_accept 0 1 0 1

No combinatorial path between cmd_valid_t and cmd_accept_t is allowed, and a DTL port is not allowed to lower a command accept once its raised, until a handshake happens. This means that if the processing of data by a target takes a variable number of cycles, and if default accept is used, then its cmd and wr-group signals need to be buffered in the target in case blocking occurs. Any default-accept DTL block connected to a resource that may block fits this description (like the NoC or a finite buffer, practically any block in the platform). Assume for example a fictional piece of hardware, which has a DTL initiator and target. It takes data from the target port, processes it, and then sends it out of the initiator port. For example:

clock cycle 0 1 2 3
cmd_valid_i 1 1 1 1
cmd_accept_t 1 1 0 1
wr_valid_i 1 1 1 1
wr_accept_t 1 1 0 1
wr_data_i b c d d
processing data a b b c
processing done 1 0 1 1

Here, data-element c has to be buffered in cycle 2 by the target block, otherwise it would be lost. The alternative, connecting the processing_done signal to cmd_accept is a very bad idea. Chaining of accept/valid signal across different blocks, leads to long paths spanning multiple hardware modules, especially if mealy-machines are used.

Accept-valid-chain

CompSOC: DTL proxy blocks

Properly implementing default-accept in DTL without chaining accept-valid signals is tricky. Luckily, it has been done already, in
src/vhdl/data/rdt_dtl_lib/rdt_dtl_proxy_(init|target). These blocks force output-buffering of flow-control and data signals and implement default accept , basically acting as a protocol-safe buffer. The non-dtl side of the proxy has no constraints that span across the different signal groups. Data is consumed in-order from the write and read inputs, and forwarded when the DTL protocol allows it.

DTL Proxy

The basic idea is to instantiate the init and/or target proxy in your VHDL module and connect it to the DTL interface of your module. You are then allowed to toggle the internal handshake signals at will, with no effect on the external interface of your block until at least the next clock cycle. Three usage example are:

  1. the tft controller: (pcores/dtl_tft_v0_00_a/hdl/vhdl/dtl_tft.vhd)
  2. the atomizer: (src/vhdl/data/rdt_compose_lib/rdt_dtl_atomizer)
  3. the initiator bus: (src/vhdl/data/rdt_bus_lib/rdt_bus_dtl_init_gen)

Proxy added latency

If two proxy blocks are chained (initiator to target), the minimal latency to get data from one block into the other is two clock cycles. This is illustrated in this image, where left of the dotted line is part of the initiator proxy, and on the right there’s a target proxy:

Chained-proxies, internals

There is a very small pipeline stage at every interface (between colored pairs in the chained-proxies image), which could help the Place and Route (PAR) stage when generating a bit-file. The extra pipeline stage could optionally be omitted, but that requires changes in the cmd_push/pull logic as well. In the future, generation of _last signals could also be automated (based on the _block_size signal).

Best DTL practices

  • Use the dtl_proxy blocks, they make your life easier.
  • If you choose not to use them, then you need to properly implement the protocol, AND think about the hardware it synthesizes to.
  • Most blocks in CompSOC serialize command and write-handshakes, such that there is only one uncompleted write-transaction at a time. For most resources this has no implications on throughput (the DDR controller is an exception to this rule) and it can simplify the implementation of the protocol.
  • Never hard code the widths of the variable length signals, but use generics. Using the default naming convention is also highly recommended.