Skip to content

Latest commit

 

History

History
114 lines (74 loc) · 9.03 KB

rv32zild_proposal.adoc

File metadata and controls

114 lines (74 loc) · 9.03 KB

Zilsp (Integer Load/Store Pairs) v0.12

Note
This document is in the Discussion Document state. Assume everything can change. This document is not complete yet and was created only for the purpose of conversation outside of the document. For more information see: https://riscv.org/spec-state

Zilsp (working title) is an RV32-only extension, with the sole purpose of enabling 64-bit load/store instructions from the RV64 base ISA into RV32.

In the future, a similar extension could be added for RV64, with the purpose of enabling 128-bit load/store instructions from the RV128 base ISA into RV64 once that has been ratified. See Future RV64 instructions and future compressed RV64 instructions

Changes

Changes since v0.11

  • Removed RV64 instructions from the initial spec and instead kept them as future improvements since RV128 isn’t yet ratified (see #6)

  • Renamed the extension from Zild to Zilsp (thanks to Christian Herber for the suggestion)

  • Addressed how misaligned accesses should be treated

Changes since v0.10 (initial version)

  • Added similar RV64 instructions for reusing the RV128 load/store instructions for loading/storing pairs of 64-bit registers.

  • Added note to clarify that encodings with odd register encodings are reserved and will remain so until a potential future extension.

  • Minor typo fixes and formatting

Rationale

The motivation behind this proposal is a combination of multiple use-cases:

  • Performance: in applications with very tight inner-loops, the ability to issue multiple register loads with a single instruction can be beneficial in order to meet tight real-time requirements where every CPU cycle counts. If the CPU bus interface is wider than the native data width, this will also allow for better utilization and double the throughput compared to the regular load instructions.[1]

  • Atomicity: in some cases, the ability to read or write two registers in an atomic fashion can be beneficial

  • Zdinx: with the Zdinx extension, the LD instruction can be used to load a double-precision floating-point value into two adjacent integer registers for use by the FP ops; directly replacing the FLD instruction.

  • Code size (in case the compressed encodings are included): being able to encode two adjacent loads/stores with a single compressed instruction could have the potential to save quite a bit of code size (measurements/bencmarking would be needed to determine how much).

Instructions

This proposed extension doesn’t specify any new instructions per se, it only adds support for instructions which already exist in the 64-bit base ISA to RV32.

The instructions will follow the same restrictions in terms of register specification as the Zdinx extension:

  • Registers are allocated in pairs with r containing the lower half of the value, and (r+1) containing the upper half.

    • For RV32 lower half means bits (31:0) of the 64-bit value, and the upper half means the high-order bits (63:32).

    • Similarly for a future RV64 extension the lower half would refer to bits (63:0) of the 128-bit value, and the upper half would refer to bits (127:64).

    • This alignment applies regardless of machine endianness.

  • Only even-numbered registers may be used - odd-numbered register encodings are reserved.[2]

    • If x0/zero is specified as the rs2 source operand of a store operation, zero should be written for both words - regardless of the contents in x1.

    • Conversely, if x0/zero is specified as the destination, x1 should not be updated with the upper half of the result.

Regarding accesses that are not naturally 2*XLEN-aligned, this extension will follow the example given by the base ISA; misaligned accesses are not required to raise an address-misaligned exception, but raising such an exception in this case is perfectly valid behavior, and atomicity is not mandated for such misaligned accesses.
For best performance across different implementations, software should make sure that load/store-pair instructions are only generated with 2*XLEN-aligned addresses.

If a misaligned access crosses a page boundary it should follow the same behavior as for other accesses that are not naturally aligned to the data width. In case of an access fault trap in the second half (when the implementation supports accesses that are not naturally aligned), the behavior will be like defined in the Privileged Spec; the xtval CSR (if implemented) will contain the virtual address of the portion of the access that caused the fault.

Currently proposed instructions

The following instructions would be added (enabled) by this extension for RV32:

  • LD: 64-bit data load into {rd+1, rd} from memory address stored in rs1 + immediate

  • SD: 64-bit data store from {rs2+1, rs2} to memory address stored in rs1 + immediate

Future RV64 instructions

Note
This section is just an idea for a future where the RV128 is defined, and will not be part of the current extension proposal.

The following instructions could be added (enabled) by a future extension for RV64 (once RV128 is defined):

  • LQ: 128-bit data load into {rd+1, rd} from memory address stored in rs1 + immediate

  • SQ: 128-bit data store from {rs2+1, rs2} to memory address stored in rs1 + immediate

(Optional) compressed encodings

Note
If this cannot be inferred directly from the presence of Zilsp and Zca (but not Zcf), then we could define a separate extension Zclsp (Compressed Load/Store Pairs) containing the following compressed instructions

RV32

If the compressed extension Zca is enabled, then we can add the RV64C encodings for 64-bit load/stores to RV32 as well:

  • C.LD: 64-bit data load

  • C.SD: 64-bit data store

  • C.LDSP: 64-bit data load (stack-pointer relative)

  • C.SDSP: 64-bit data store (stack-pointer relative)

This is of course incompatible with the (RV32) F extension, as those opcodes are then occupied by the single-precision load/store ops (C.FLW, C.FSW, C.FLWSP and C.FSWSP respectively).

Because of the [restrictions] imposed on register selection, the least significant bit of the register selections for all the compressed encodings (the ones for rs1[0], rs2[0] and rd[0]) can be reserved and thus we free up half the code points (and twice of that for C.LD/C.SD) for future use.

RV64

Note
Although the RV128 base ISA is not yet defined, the compressed encodings are included in the Zca extension. Therefore we could push this as a compressed-only RV64 extension, although if it turns out to be too controversial we can hold off until the non-compressed counterparts (LQ/SQ) are defined

If the compressed extension Zca is enabled, then we can add the RV128C encodings for 128-bit load/stores to RV64 as well:

  • C.LQ: 128-bit data load

  • C.SQ: 128-bit data store

  • C.LQSP: 128-bit data load (stack-pointer relative)

  • C.SQSP: 128-bit data store (stack-pointer relative)

This is of course incompatible with the (RV64) D extension, as those opcodes are then occupied by the double-precision load/store ops (C.FLD, C.FSD, C.FLDSP and C.FSDSP respectively).

Because of the [restrictions] imposed on register selection, the least significant bit of the register selections for all the compressed encodings (the ones for rs1[0], rs2[0] and rd[0]) can be reserved and thus we free up half the code points (and twice of that for C.LQ/C.SQ) for future use.

Extra encoding bit (future discussion)

Note
This section is just brainstorming and will not be part of the current extension proposal.

The encoding bit saved by the pairwise register assignment could be repurposed to give more utility by extending the addressable register or immediate range - for example like this:

  • For LD and SD: introduce new register encoding rs/rd'' - [2:1|4], making it possible to address registers 8,10,..,30 (bit 3 still implicitly wired to 1 for more similarity with existing encoding).

  • For LDSP and SDSP: add another high-order bit to the immediate, doubling the addressable range.


1. While the same could be achieved using macro-op fusion, having a dedicated instruction for this makes it easier for smaller implementations with simple decoders and short pipelines to utilize this. The most realistic alternative for most smaller implementations would be to fuse two adjacent C.LW/C.LD instructions, as anything else would require a decoder wider than 32 bits and/or multiple decode stages - all adding complexity and (power/area) cost.
2. Since this restriction is already in place for the Zdinx extension it makes sense to keep it here as well