This blog post details how the QEMU TCG engine manages guest memory accesses.
As as JIT-compiler enabling guest code to be translated into host code, a lot of questions arise:
- How does a virtual PowerPC userland address can be translated and accessed into the host ?
- How the host cpu instructions are restricted to only access the available QEMU virtual machine memory ?
- Is there an address translation cache ?
- What about the performances ?
Don't hesitate to refresh your QEMU memory model knowledge by reading this blog post.
We assume the reader to have previous knowledge of memory management in modern architectures and operating systems. A lot of ressources are publicly available:
As usual in the blog series, we will assume the guest being a PowerPC 32 bits machine and the host an Intel x86_64 one.
We won't detail anything related to address translation in the host, because QEMU is a simple user process of your hosting operating system. The guest physical memory (RAM) is just a buffer allocated into that QEMU process. The QEMU TCG job is thus to redirect any guest memory access to address X (whether virtual or physical) to that memory buffer.
Obviously, the guest vCPU type and running mode will have an impact on how addresses are translated. This is where the so called QEMU-softmmu enters the game. In system-mode emulation, as opposed to user-mode, and for some architectures the QEMU engine supports a Software Memory Managment Unit (soft-MMU). This component is able to translate guest virtual addresses into guest physical ones. QEMU also supports Virtual Translation Lookaside Buffers (vTLBs) to speed-up further accesses to previously translated addresses.
The guest physical address is always translated into an offset to the
QEMU maintained representation of the guest RAM, which is usually a
memory mapped area in the host (a buffer obtained via malloc()
or
mmap()
).
What happens in QEMU when it has to translate the following PowerPC instruction:
0xfff0017c: 90010004 stw r0, 4(r1)
Like any other instructions, the
opcodes
table has a specific entry for stw
:
static opcode_t opcodes[] = {
...
GEN_STS(stw, st32, 0x04, PPC_INTEGER)
...
};
#define GEN_ST(name, stop, opc, type) \
GEN_HANDLER(name, opc, 0xFF, 0xFF, 0x00000000, type),
#define GEN_HANDLER(name, opc1, opc2, opc3, inval, type) \
GEN_OPCODE(name, opc1, opc2, opc3, inval, type, PPC_NONE)
We purposely omitted the expansion of GEN_STS
to GEN_HANDLER
because it breaks down into all signed, unsigned and extended variant
of the store operation which is of no interest to us.
The GEN_OPCODE
macro will expand to the declaration, if you remember
our TCG article, of the
gen_stw
handler. We can't find its direct definition in the QEMU source code,
as for TCG helpers, it is partly generated at compilation time:
#define GEN_ST(name, stop, opc, type) \
static void glue(gen_, name)(DisasContext *ctx) \
{ \
TCGv EA; \
gen_set_access_type(ctx, ACCESS_INT); \
EA = tcg_temp_new(); \
gen_addr_imm_index(ctx, EA, 0); \
gen_qemu_##stop(ctx, cpu_gpr[rS(ctx->opcode)], EA); \
tcg_temp_free(EA); \
}
Beside some effective address computation, the interesting line of
code is the expansion of gen_qemu_##stop
into
gen_qemu_st32
which again has no direct definition but is generated thanks to
several macro expansions:
#define GEN_QEMU_STORE_TL(stop, op) \
static void glue(gen_qemu_, stop)(DisasContext *ctx, \
TCGv val, \
TCGv addr) \
{ \
tcg_gen_qemu_st_tl(val, addr, ctx->mem_idx, op); \
}
GEN_QEMU_STORE_TL(st32, DEF_MEMOP(MO_UL))
/* from tcg/tcg-op.h */
#if TARGET_LONG_BITS == 32
...
#define tcg_gen_qemu_st_tl tcg_gen_qemu_st_i32
...
For a 32 bits PowerPC guest, the initial stw
guest instruction gets
translated into QEMU TCG frontend-op
tcg_gen_qemu_st_i32
:
void tcg_gen_qemu_st_i32(TCGv_i32 val, TCGv addr, TCGArg idx, TCGMemOp memop)
{
...
gen_ldst_i32(INDEX_op_qemu_st_i32, val, addr, memop, idx);
...
}
From that point, we have an IR qemu_st_i32 opcode which is emitted.
We won't explain host code generation again, read the dedicated blog
post. Once the execution loop reaches
tcg_gen_code
and more specifically
tcg_reg_alloc_op
,
QEMU generates the TCG backend-op for qemu_st_i32
.
static void tcg_reg_alloc_op(TCGContext *s, const TCGOp *op)
{
...
tcg_out_op(s, op->opc, new_args, const_args);
...
}
In our situation, the tcg-target is an Intel x86 machine, so we will
find suitable tcg_out_op
definition in
tcg/i386/tcg-target.inc.c
static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
const TCGArg *args, const int *const_args)
{
...
case INDEX_op_qemu_st_i32:
tcg_out_qemu_st(s, args, 0);
...
}
Here we are. The
tcg_out_qemu_st
is a very interesting function to study. It holds the internals of
QEMU guest memory addressing.
The blog series does not intend to be a complete guide to the TCG
internals. Keep in mind, that at this level, functions are developped
with some conventions related to TCG arguments (ie. TCGArg *args
).
static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64)
{
TCGReg datalo, datahi, addrlo;
TCGReg addrhi __attribute__((unused));
TCGMemOpIdx oi;
MemOp opc;
#if defined(CONFIG_SOFTMMU)
int mem_index;
tcg_insn_unit *label_ptr[2];
#endif
datalo = *args++;
datahi = (TCG_TARGET_REG_BITS == 32 && is64 ? *args++ : 0);
addrlo = *args++;
addrhi = (TARGET_LONG_BITS > TCG_TARGET_REG_BITS ? *args++ : 0);
oi = *args++;
opc = get_memop(oi);
#if defined(CONFIG_SOFTMMU)
mem_index = get_mmuidx(oi);
tcg_out_tlb_load(s, addrlo, addrhi, mem_index, opc,
label_ptr, offsetof(CPUTLBEntry, addr_write));
/* TLB Hit. */
tcg_out_qemu_st_direct(s, datalo, datahi, TCG_REG_L1, -1, 0, 0, opc);
/* Record the current context of a store into ldst label */
add_qemu_ldst_label(s, false, is64, oi, datalo, datahi, addrlo, addrhi,
s->code_ptr, label_ptr);
#else
tcg_out_qemu_st_direct(s, datalo, datahi, addrlo, x86_guest_base_index,
x86_guest_base_offset, x86_guest_base_seg, opc);
#endif
}
Thanks to the soft-mmu and support of virtual TLBs, QEMU offers a slow path and a fast path when accessing guest memory. The slow path can be seen as a TLB-miss and implies a subsequent call to the PowerPC software MMU implemented inside QEMU to translate a guest virtual address into a guest physical address.
If there is a TLB-hit, QEMU already holds the guest physical address
in its vCPU maintained TLBs and is able to directly generate the final
memory access into guest RAM with an Intel x86 instruction. Have a
look at
tcg_out_qemu_st_direct
.
The mechanic behind that is tied to the following 3 lines:
/* try to find a filled TLB entry */
tcg_out_tlb_load(s, addrlo, addrhi, mem_index, opc,
label_ptr, offsetof(CPUTLBEntry, addr_write));
/* TLB Hit. So generate a physical guest memory access */
tcg_out_qemu_st_direct(s, datalo, datahi, TCG_REG_L1, -1, 0, 0, opc);
/* TLB Miss. Filled during tlb_load and redirect to soft-MMU */
add_qemu_ldst_label(s, false, is64, oi, datalo, datahi, addrlo, addrhi,
s->code_ptr, label_ptr);
First,
tcg_out_tlb_load
will generate host instructions to check for a TLB entry. The QEMU
TLBs are generic to the architecture and defined at
cpu-defs.h:
typedef struct CPUTLBEntry {
/* bit TARGET_LONG_BITS to TARGET_PAGE_BITS : virtual address
bit TARGET_PAGE_BITS-1..4 : Nonzero for accesses that should not
go directly to ram.
bit 3 : indicates that the entry is invalid
bit 2..0 : zero
*/
union {
struct {
target_ulong addr_read;
target_ulong addr_write;
target_ulong addr_code;
/* Addend to virtual address to get host address. IO accesses
use the corresponding iotlb value. */
uintptr_t addend;
};
/* padding to get a power of two size */
uint8_t dummy[1 << CPU_TLB_ENTRY_BITS];
};
} CPUTLBEntry;
As translated blocks are generated once and executed several times (potentially), it is convenient to generate host code that will dynamically check for QEMU maintained vCPU TLBs. During the life of a translated block, a given TLB entry might be invalidated then filled again.
The
tcg_out_tlb_load
code is quite annoying to read, full of tcg-target opcodes
generators. A resulting Intel x86_64 translated block containing the
tlb_load
algorithm looks like the following:
tcg_out_tlb_load:
0x7ffff41888e9 <code_gen_buffer+22716>: mov %esp,%edi
0x7ffff41888eb <code_gen_buffer+22718>: shr $0x7,%edi
0x7ffff41888ee <code_gen_buffer+22721>: and 0x338(%rbp),%edi
0x7ffff41888f4 <code_gen_buffer+22727>: add 0x388(%rbp),%rdi
0x7ffff41888fb <code_gen_buffer+22734>: lea 0x3(%r12),%esi
0x7ffff4188900 <code_gen_buffer+22739>: and $0xfffff000,%esi
0x7ffff4188906 <code_gen_buffer+22745>: cmp 0x4(%rdi),%esi
0x7ffff4188909 <code_gen_buffer+22748>: mov %r12d,%esi
0x7ffff418890c <code_gen_buffer+22751>: jne 0x7ffff418897f ---> back to LDST labels
0x7ffff4188912 <code_gen_buffer+22757>: add 0x10(%rdi),%rsi
QEMU tries to read its CPUTLBEntries
for the given guest virtual
address. For a store operation, the addr_write
field is used for
comparaison at 0x7ffff4188906
in the extract. The RDI
register
points to the CPUTLBEntry
and ESI
holds the guest address.
If the comparaison fails, it's a TLB-miss and we jump to a LDST
label that we will explain later. Else, the RSI
register is
adjusted to the final host address thanks to the CPUTLBEntry.addend
and the memory access can be done:
tcg_out_qemu_st_direct:
0x7ffff4188925 <code_gen_buffer+22776>: movbe %ebx,(%rsi)
The TLB verification implementation can also be found in the QEMU
cpu_ld/st_xxx
API functions. As defined in the
documentation,
they operate on guest virtual addresses and may cause guest CPU
exception. Thus they do check TLBs and might redirect to the
software MMU. They are implemented through macros in
cpu_ldst_template.h:
/* generic store macro */
static inline void
glue(glue(glue(cpu_st, SUFFIX), MEMSUFFIX), _ra)(CPUArchState *env,
target_ulong ptr,
RES_TYPE v, uintptr_t retaddr)
{
...
addr = ptr;
mmu_idx = CPU_MMU_INDEX;
entry = tlb_entry(env, mmu_idx, addr);
if (unlikely(tlb_addr_write(entry) !=
(addr & (TARGET_PAGE_MASK | (DATA_SIZE - 1))))) {
oi = make_memop_idx(SHIFT, mmu_idx);
glue(glue(helper_ret_st, SUFFIX), MMUSUFFIX)(env, addr, v, oi,
retaddr);
} else {
uintptr_t hostaddr = addr + entry->addend;
glue(glue(st, SUFFIX), _p)((uint8_t *)hostaddr, v);
}
...
}
Looks familiar to you, no :) ?
The LDST labels
stand for load/store labels. It is the mechanism used by QEMU to
redirect a TLB-miss to a call to the software MMU via a TCG helper.
The
TCGContext
object holds the output buffer that receive generated host assembly
opcodes. Any time an instruction is added, the code_ptr
pointer to
that buffer is obviously incremented.
During
tcg_out_tlb_load
,
when QEMU generates the comparaison and jump instructions for a
TLB-miss, it also records the location into the output buffer that
will hold the JNE
offset to redirect execution to.
static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
int mem_index, MemOp opc,
tcg_insn_unit **label_ptr, int which)
{
...
/* jne slow_path */
tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0);
label_ptr[0] = s->code_ptr;
s->code_ptr += 4;
...
}
Additionally, when we get back to
tcg_out_qemu_st
,
the call to
add_qemu_ldst_label
creates a new LDST label and records context details to later prepare
a call to the softmmu slow path TCG helper:
static void add_qemu_ldst_label(TCGContext *s, bool is_ld, bool is_64,
TCGMemOpIdx oi,
TCGReg datalo, TCGReg datahi,
TCGReg addrlo, TCGReg addrhi,
tcg_insn_unit *raddr,
tcg_insn_unit **label_ptr)
{
TCGLabelQemuLdst *label = new_ldst_label(s);
label->is_ld = is_ld;
label->oi = oi;
label->type = is_64 ? TCG_TYPE_I64 : TCG_TYPE_I32;
label->datalo_reg = datalo;
label->datahi_reg = datahi;
label->addrlo_reg = addrlo;
label->addrhi_reg = addrhi;
label->raddr = raddr;
label->label_ptr[0] = label_ptr[0];
if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
label->label_ptr[1] = label_ptr[1];
}
}
The call to the slow path helper is inserted by
tcg_gen_code
during translated block epilogue generation with a call to
tcg_out_ldst_finalize
. QEMU
checks if there exists LDST labels and generates the corresponding
TCG helper calls. For our store operation,
tcg_out_ldst_finalize
will emit a
tcg_out_qemu_st_slow_path
:
int tcg_gen_code(TCGContext *s, TranslationBlock *tb)
{
...
#ifdef TCG_TARGET_NEED_LDST_LABELS
i = tcg_out_ldst_finalize(s);
if (i < 0) {
return i;
}
#endif
...
}
static int tcg_out_ldst_finalize(TCGContext *s)
{
...
/* qemu_ld/st slow paths */
QSIMPLEQ_FOREACH(lb, &s->ldst_labels, next) {
if (lb->is_ld
? !tcg_out_qemu_ld_slow_path(s, lb)
: !tcg_out_qemu_st_slow_path(s, lb)) {
return -2;
}
...
}
static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
{
...
/* "Tail call" to the helper, with the return address back inline. */
tcg_out_push(s, retaddr);
tcg_out_jmp(s, qemu_st_helpers[opc & (MO_BSWAP | MO_SIZE)]);
return true;
}
The preparation of the slow path helper call implies:
- resolving the label offset, previously recorded during
JNE
generation - arguments/registers setup
- call to the appropriate
qemu_st_helper
Like others memory access functions, there exists a lot of variants
for store and load operations: signed, unsigned, byte, word, long,
big or little endian. Each one has a specific TCG helper which is a
wrapper to
store_helper
:
/* helper signature: helper_ret_st_mmu(CPUState *env, target_ulong addr,
* uintxx_t val, int mmu_idx, uintptr_t ra)
*/
static void * const qemu_st_helpers[16] = {
[MO_UB] = helper_ret_stb_mmu,
[MO_LEUW] = helper_le_stw_mmu,
[MO_LEUL] = helper_le_stl_mmu,
[MO_LEQ] = helper_le_stq_mmu,
[MO_BEUW] = helper_be_stw_mmu,
[MO_BEUL] = helper_be_stl_mmu,
[MO_BEQ] = helper_be_stq_mmu,
};
void helper_be_stl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
TCGMemOpIdx oi, uintptr_t retaddr)
{
store_helper(env, addr, val, oi, retaddr, MO_BEUL);
}
static inline void QEMU_ALWAYS_INLINE
store_helper(CPUArchState *env, target_ulong addr, uint64_t val,
TCGMemOpIdx oi, uintptr_t retaddr, MemOp op)
{
...
if (!tlb_hit_page(tlb_addr2, page2)) {
if (!victim_tlb_hit(env, mmu_idx, index2, tlb_off, page2)) {
tlb_fill(env_cpu(env), page2, size2, MMU_DATA_STORE,
mmu_idx, retaddr);
index2 = tlb_index(env, mmu_idx, page2);
entry2 = tlb_entry(env, mmu_idx, page2);
}
tlb_addr2 = tlb_addr_write(entry2);
}
...
haddr = (void *)((uintptr_t)addr + entry->addend);
store_memop(haddr, val, op);
}
The helper checks TLBs, if there is a miss it calls
tlb_fill
. In
any case, once the address is resolved it does the final memory access
in the host memory with
store_memop
.
static void tlb_fill(CPUState *cpu, target_ulong addr, int size,
MMUAccessType access_type, int mmu_idx, uintptr_t retaddr)
{
CPUClass *cc = CPU_GET_CLASS(cpu);
bool ok;
ok = cc->tlb_fill(cpu, addr, size, access_type, mmu_idx, false, retaddr);
assert(ok);
}
As you may guess, the process of filling a TLB is done by the
software MMU whose implementation depends on the emulated
architecture. During the PowerPC CPU
initialisation
,
cc->tlb_fill
is set to
ppc_cpu_tlb_fill
.
bool ppc_cpu_tlb_fill(CPUState *cs, vaddr addr, int size,
MMUAccessType access_type, int mmu_idx,
bool probe, uintptr_t retaddr)
{
PowerPCCPU *cpu = POWERPC_CPU(cs);
PowerPCCPUClass *pcc = POWERPC_CPU_GET_CLASS(cs);
CPUPPCState *env = &cpu->env;
int ret;
if (pcc->handle_mmu_fault) {
ret = pcc->handle_mmu_fault(cpu, addr, access_type, mmu_idx);
} else {
ret = cpu_ppc_handle_mmu_fault(env, addr, access_type, mmu_idx);
}
if (unlikely(ret != 0)) {
if (probe) {
return false;
}
raise_exception_err_ra(env, cs->exception_index, env->error_code,
retaddr);
}
return true;
}
And
ppc->handle_mmu_fault
might eventually be set to
ppc_hash32_handle_mmu_fault
,
depending on your PowerPC CPU family. This is where the software MMU
implementation lies for the PowerPC.