Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

piolib: data transfers slower than expected #116

Open
jepler opened this issue Jan 17, 2025 · 4 comments
Open

piolib: data transfers slower than expected #116

jepler opened this issue Jan 17, 2025 · 4 comments

Comments

@jepler
Copy link

jepler commented Jan 17, 2025

In my work for Adafruit, I've implemented a PIO prgram for driving HUB75 style displays.

The data transfer to the PIO peripheral is slower than anticipated, topping out at about 10MB/s. That's too bad, as we ideally would like to run at several times that speed. (naively, I'd expected that like on rp2, we could keep it fed with data every cycle even at the highest PIO frequencies)

Here is a simple reproducer that requires no hardware -- it just does an out x, 32 every cycle, consuming a FIFO entry each time. So e.g., at 1MHz it should consume 4MB/s, at 5MHz it should consume 20MB/s, etc. However, the transfer speed tops out at around 10MB/s:

$ for frequency in 1e6 2e6 5e6 10e6 20e6 200e6; do for xfersize in 65532; do ./build/examples/bench $frequency $xfersize ; done; done 2>/dev/null
{"frequency": 1e+06, "xfer_size": 65532, "rate": 3.99201e+06}
{"frequency": 2e+06, "xfer_size": 65532, "rate": 7.96719e+06}
{"frequency": 5e+06, "xfer_size": 65532, "rate": 1.07482e+07}
{"frequency": 1e+07, "xfer_size": 65532, "rate": 1.07461e+07}
{"frequency": 2e+07, "xfer_size": 65532, "rate": 1.07484e+07}
{"frequency": 2e+08, "xfer_size": 65532, "rate": 1.0746e+07}

Notice how the top rate is about 1e7 (i.e., 10MB/s), and does not continue increasing as the clock rate increases.

Firmware & kernel:

$ uname -a
Linux m5 6.6.70-v8+ #1 SMP PREEMPT Fri Jan 10 13:53:47 UTC 2025 aarch64 GNU/Linux
$ vcgencmd version
2025/01/08 17:52:48 
Copyright (c) 2012 Broadcom
version 97facbf4 (release) (embedded)
My test program
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#include "piolib.h"
#include "ws2812.pio.h"

#define bench_wrap_target 0
#define bench_wrap 0

static const uint16_t bench_program_instructions[] = {
            // .wrap_target
    0x6020, // out x, 32
            // .wrap
};

static const struct pio_program bench_program = {
    .instructions = bench_program_instructions,
    .length = 1,
    .origin = -1,
};

static inline pio_sm_config bench_program_get_default_config(uint offset) {
    pio_sm_config c = pio_get_default_sm_config();
    sm_config_set_wrap(&c, offset + bench_wrap_target, offset + bench_wrap);
    sm_config_set_sideset(&c, 1, false, false);
    return c;
}

static inline float bench_program_init(PIO pio, int sm, int offset, float freq) {
    pio_sm_config c = bench_program_get_default_config(offset);
    sm_config_set_out_shift(&c, false, true, 32);
    sm_config_set_fifo_join(&c, PIO_FIFO_JOIN_TX);
    float div = clock_get_hz(clk_sys) / freq;
    if(div < 1) div = 1;
    if(div > 65535) div = 65535;
    int div_int = (int)div;
    int div_frac = (int)((div - div_int) * 256);
    sm_config_set_clkdiv_int_frac(&c, div_int, div_frac);
    pio_sm_init(pio, sm, offset, &c);
    pio_sm_set_enabled(pio, sm, true);
    return clock_get_hz(clk_sys) / (div_int + div_frac / 256.);
}


double monotonic() {
    struct timespec tv;
    clock_gettime(CLOCK_MONOTONIC, &tv);
    return tv.tv_sec + tv.tv_nsec * 1e-9;
}

long databuf[1048576];

int main(int argc, const char **argv)
{
    float frequency = argc > 1 ? atof(argv[1]) : 10e6;
    size_t xfer_size = argc > 2 ? atoi(argv[2]) : 256;
    PIO pio;
    int sm;
    uint offset;

    pio = pio0;
    sm = pio_claim_unused_sm(pio, true);
    pio_sm_config_xfer(pio, sm, PIO_DIR_TO_SM, xfer_size, 1);

    offset = pio_add_program(pio, &bench_program);
    fprintf(stderr, "Loaded program at %d, using sm %d\n", offset, sm);

    float actual_frequency = bench_program_init(pio, sm, offset, frequency);
    fprintf(stderr, "Actual frequency %fMHz\n", actual_frequency/1e6);
    pio_sm_clear_fifos(pio, sm);

    double t0 = monotonic();
    size_t xfer = 0;
    do {
        pio_sm_xfer_data(pio, sm, PIO_DIR_TO_SM, sizeof(databuf), databuf);
        xfer += sizeof(databuf);
    } while(monotonic() - t0 < 1);
    double t1 = monotonic();
    double dt = t1 - t0;
    double rate = xfer / dt; // bytes per second
    fprintf(stderr, "%zu bytes in %.1fms (%.1fMiB/s)\n",
        xfer, dt*1e3, rate / 1048576);
    printf("{\"frequency\": %g, \"xfer_size\": %zd, \"rate\": %g}\n",
        actual_frequency, xfer_size, rate);
    return 0;
}

PS is there a more appropriate repo to report this issue in?

@jepler jepler changed the title pio data transfers slower than expected piolib: data transfers slower than expected Jan 17, 2025
@pelwell
Copy link
Collaborator

pelwell commented Jan 17, 2025

PS is there a more appropriate repo to report this issue in?

No, this is fine.

@pelwell
Copy link
Collaborator

pelwell commented Jan 17, 2025

I modified the state machine to run OUT PINS, 32 having configured a pair of GPIOs for PIO, and filled the buffer with alternating words of 0x55555555 and 0xaaaaaaaa. Putting a logical analyser on the pins then shows the ticks of the SM.

The result is not at all what I expected - the steady transmission of a buffer full of data, with 40us gaps in between the buffers. The limiting factor seems to be the clock speed. which has been capped at ~2.6MHz - the exact value varies with the exact clock frequency, the clock limping slightly due to a non-integer clock divisor.

I'm encouraged by this - fixing some clock weirdness feels more tractable than making something quicker. More next week.

@jepler
Copy link
Author

jepler commented Jan 17, 2025

I'm not confident it's a clocking problem.

I can think of two ways to test it. One would involve a program that turns off auto pull and uses pull noblock. something like

	set x, 0
.wrap_target
	pull noblock
	out pins, 32
.wrap

In this case, you'll see 0x55.. or 0xaa.. in the case where data was available, and then 0 whenever data was not available, because (if I understand the docs right) x is copied back to osr in this case.

The other is what I quickly implemented: still using out x, ## but varying the count from 1 to 32. I ran this test at the max 200MHz clock but saw a data rate of around 10MB/s regardless of the PIO clock. If it was actually the PIO clock that was capped at 2.5MHz, the smaller out numbers would have lower throughput.

my new test program#include <time.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include "piolib.h" #include "ws2812.pio.h" #define bench_wrap_target 0 #define bench_wrap 0 static uint16_t bench_program_instructions[] = { // .wrap_target 0x6020, // out x, 32 // .wrap }; static const struct pio_program bench_program = { .instructions = bench_program_instructions, .length = 1, .origin = -1, }; static inline pio_sm_config bench_program_get_default_config(uint offset) { pio_sm_config c = pio_get_default_sm_config(); sm_config_set_wrap(&c, offset + bench_wrap_target, offset + bench_wrap); sm_config_set_sideset(&c, 1, false, false); return c; } static inline float bench_program_init(PIO pio, int sm, int offset, float freq) { pio_sm_config c = bench_program_get_default_config(offset); sm_config_set_out_shift(&c, false, true, 32); sm_config_set_fifo_join(&c, PIO_FIFO_JOIN_TX); float div = clock_get_hz(clk_sys) / freq; if(div < 1) div = 1; if(div > 65535) div = 65535; int div_int = (int)div; int div_frac = (int)((div - div_int) * 256); sm_config_set_clkdiv_int_frac(&c, div_int, div_frac); pio_sm_init(pio, sm, offset, &c); pio_sm_set_enabled(pio, sm, true); return clock_get_hz(clk_sys) / (div_int + div_frac / 256.); } double monotonic() { struct timespec tv; clock_gettime(CLOCK_MONOTONIC, &tv); return tv.tv_sec + tv.tv_nsec * 1e-9; } long databuf[1048576]; int main(int argc, const char **argv) { float frequency = argc > 1 ? atof(argv[1]) : 10e6; int out_count = argc > 2 ? atoi(argv[2]) : 32; size_t xfer_size = 65532; PIO pio; int sm; uint offset; pio = pio0; sm = pio_claim_unused_sm(pio, true); pio_sm_config_xfer(pio, sm, PIO_DIR_TO_SM, xfer_size, 1); bench_program_instructions[0] = pio_encode_out(pio_x, out_count); offset = pio_add_program(pio, &bench_program); fprintf(stderr, "Loaded program at %d, using sm %d\n", offset, sm); float actual_frequency = bench_program_init(pio, sm, offset, frequency); fprintf(stderr, "Actual frequency %fMHz\n", actual_frequency/1e6); pio_sm_clear_fifos(pio, sm); double t0 = monotonic(); size_t xfer = 0; do { pio_sm_xfer_data(pio, sm, PIO_DIR_TO_SM, sizeof(databuf), databuf); xfer += sizeof(databuf); } while(monotonic() - t0 < 3); double t1 = monotonic(); double dt = t1 - t0; double rate = xfer / dt; // bytes per second fprintf(stderr, "%zu bytes in %.1fms (%.1fMiB/s)\n", xfer, dt*1e3, rate / 1048576); printf("{\"frequency\": %g, \"out_count\": %d, \"rate\": %g}\n", actual_frequency, out_count, rate); return 0; }
$ for b in 1 2 4 8 16 32; do ./build/examples/bench  200000000 ${b}; done 2>/dev/null
{"frequency": 2e+08, "out_count": 1, "rate": 1.07461e+07}
{"frequency": 2e+08, "out_count": 2, "rate": 1.0746e+07}
{"frequency": 2e+08, "out_count": 4, "rate": 1.0746e+07}
{"frequency": 2e+08, "out_count": 8, "rate": 1.07461e+07}
{"frequency": 2e+08, "out_count": 16, "rate": 1.07461e+07}
{"frequency": 2e+08, "out_count": 32, "rate": 1.07461e+07}

@jepler
Copy link
Author

jepler commented Jan 17, 2025

The "pull noblock" test, scoped. Because of my existing probe setup, I have pi5 pins 5 & 6 on scope channels 4 & 6.

PIO clock configured to 10MHz:

Image

PIO clock configured to 100MHz:

Image

note how the pulses get shorter (pio is clocking at expected rate) but stay the same distance apart (data is arriving in the PIO FIFO slower than expected)

The pull noblock test
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#include "piolib.h"
#include "ws2812.pio.h"

#define bench_wrap_target 1
#define bench_wrap 2

static const uint16_t bench_program_instructions[] = {
    0xe020, // set x,0
            // .wrap_target
    0x8080, // pull noblock
    0x6000, // out pins, 32
            // .wrap
};

static const struct pio_program bench_program = {
    .instructions = bench_program_instructions,
    .length = 3,
    .origin = -1,
};

static inline pio_sm_config bench_program_get_default_config(uint offset) {
    pio_sm_config c = pio_get_default_sm_config();
    sm_config_set_wrap(&c, offset + bench_wrap_target, offset + bench_wrap);
    sm_config_set_sideset(&c, 1, false, false);
    return c;
}

static inline float bench_program_init(PIO pio, int sm, int offset, float freq, int gpio_base) {
    pio_sm_config c = bench_program_get_default_config(offset);
    sm_config_set_out_shift(&c, false, false /* auto pull */, 32);
    sm_config_set_out_pins(&c, 0, 32);
    sm_config_set_fifo_join(&c, PIO_FIFO_JOIN_TX);
    float div = clock_get_hz(clk_sys) / freq;
    if(div < 1) div = 1;
    if(div > 65535) div = 65535;
    int div_int = (int)div;
    int div_frac = (int)((div - div_int) * 256);
    sm_config_set_clkdiv_int_frac(&c, div_int, div_frac);
    pio_sm_init(pio, sm, offset, &c);
    pio_sm_set_enabled(pio, sm, true);
    pio_gpio_init(pio, gpio_base);
    pio_gpio_init(pio, gpio_base+1);
    pio_sm_set_consecutive_pindirs(pio, sm, gpio_base, 2, true);
    return clock_get_hz(clk_sys) / (div_int + div_frac / 256.);
}


double monotonic() {
    struct timespec tv;
    clock_gettime(CLOCK_MONOTONIC, &tv);
    return tv.tv_sec + tv.tv_nsec * 1e-9;
}

long databuf[1048576];

int main(int argc, const char **argv)
{
    float frequency = argc > 1 ? atof(argv[1]) : 10e6;
    size_t xfer_size = 65532;
    PIO pio;
    int sm;
    uint offset;

    pio = pio0;
    sm = pio_claim_unused_sm(pio, true);
    pio_sm_config_xfer(pio, sm, PIO_DIR_TO_SM, xfer_size, 1);

    offset = pio_add_program(pio, &bench_program);
    fprintf(stderr, "Loaded program at %d, using sm %d\n", offset, sm);

    float actual_frequency = bench_program_init(pio, sm, offset, frequency, /* base pin */ 5);
    fprintf(stderr, "Actual frequency %fMHz\n", actual_frequency/1e6);
    pio_sm_clear_fifos(pio, sm);

    for(size_t i=0; i<sizeof(databuf)/sizeof(databuf[0]); i++ )
        databuf[i] = i % 2 ? 0x55555555 : 0xaaaaaaaa;

    double t0 = monotonic();
    size_t xfer = 0;
    do {
        pio_sm_xfer_data(pio, sm, PIO_DIR_TO_SM, sizeof(databuf), databuf);
        xfer += sizeof(databuf);
    } while(monotonic() - t0 < 3);
    double t1 = monotonic();
    double dt = t1 - t0;
    double rate = xfer / dt; // bytes per second
    fprintf(stderr, "%zu bytes in %.1fms (%.1fMiB/s)\n",
        xfer, dt*1e3, rate / 1048576);
    printf("{\"frequency\": %g, \"rate\": %g}\n",
        actual_frequency, rate);
    return 0;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants