Fix iperf no connection and PerfResult warmup timestamp calculation #382

olichtne · 2024-10-03T11:18:49Z

Description

This should fix the issues we regularly see with lnst crashing with exception when iperf failed to establish a connection. The source of the issue is that the created perf result data structure was slightly different.

While testing this I also found issues with the warmup/warmdown timestamp calculation because the zero created zero iperf measurements don't have a duration long enough to include a warmup/warmdown... this however shouldn't crash the trimming and should instead return another zero measurement.

Tests

(Please provide a list of tests that prove that the pull
request doesn't break the stable state of the master branch. This should
include test runs with valid results for all of critical workflows.)

Reviews

@jtluka @enhaut @Axonis

Closes: #

When the iperf job fails we created a slightly different PerfResult data structure which can create problems in other parts of the code which assume a stable structure. This fixes that by adding the one additional level of a SequentialPerfResult. Signed-off-by: Ondrej Lichtner <[email protected]>

When creating these PerfIntervals representing zero flows we need to set the duration to non zero to avoid zero division issues. Signed-off-by: Ondrej Lichtner <[email protected]>

olichtne · 2024-10-03T11:50:34Z

this will need some testing on real data to see if the warmup/warmdown trimming isn't causing issues...

I tested this locally by printing the difference between the old calculation and the new one and the difference was minimal - 0.002s so IMO... this shouldn't be a problem, but we want to be sure...

the original implementation comes from #248 so that will be relevant for testing this

jtluka

Besides my concern about the warmup/warmdown timestamps, this looks ok.

lnst/RecipeCommon/Perf/Measurements/Results/FlowMeasurementResults.py

jtluka · 2024-10-08T06:36:26Z

this will need some testing on real data to see if the warmup/warmdown trimming isn't causing issues...

I tested this locally by printing the difference between the old calculation and the new one and the difference was minimal - 0.002s so IMO... this shouldn't be a problem, but we want to be sure...

the original implementation comes from #248 so that will be relevant for testing this

Did you test this on any parallel iperf test we have internally? Eg. 35469f23-2f89-4b8a-a40d-d29395f71fdf (96 parallel flows)

olichtne · 2024-10-08T07:04:31Z

this will need some testing on real data to see if the warmup/warmdown trimming isn't causing issues...
I tested this locally by printing the difference between the old calculation and the new one and the difference was minimal - 0.002s so IMO... this shouldn't be a problem, but we want to be sure...
the original implementation comes from #248 so that will be relevant for testing this

Did you test this on any parallel iperf test we have internally? Eg. 35469f23-2f89-4b8a-a40d-d29395f71fdf (96 parallel flows)

this is an arm test which i can't run at the moment...

instead i'll schedule a job with all tests that have perf_parallel_processes > 1 on RHEL9 that are in kernel-full group and see what happens...

olichtne · 2024-10-08T07:11:43Z

test job scheduled: Submitted: ['J:9976969']

Edit: made a mistake in scheduling... new job: Submitted: ['J:9977354']
Submitted: ['J:9977534']

lnst/RecipeCommon/Perf/Measurements/Results/FlowMeasurementResults.py

lnst/RecipeCommon/Perf/Measurements/NeperFlowMeasurement.py

olichtne · 2024-10-17T13:41:01Z

scheduled a test for the other solution to check the difference J:10037824

olichtne · 2024-10-18T11:51:56Z

Compared the "Option A" test run vs "Option B" test run here:
a_vs_b.txt

There's a couple of tests where there are different results seen however IMO these are mostly due to the instable nature of multistream performance tests that we usually see anyway.

Both test runs evaluates just OK when compared to historical baselines.

So IMO both implementations are basically equivalent.

As one more step, I'll run a multistream test with 64 threads locally and print out the calculation of the warmup_end calculation to compare the option a, option b and the old original implementation to show here

olichtne · 2024-10-18T12:19:28Z

So local comparison with running a test in containers is here.

The warmup_end code i used:

    @property
    def warmup_end(self):
        old = max(
            [
                parallel[self.warmup_duration - 1].end_timestamp
                for parallel in (
                    *self.generator_results,
                    *self.receiver_results,
                )
            ])
        optionA = self.start_timestamp + self.warmup_duration
        optionB = max(
                [
                    self.generator_results.start_timestamp,
                    self.generator_cpu_stats.start_timestamp,
                    self.receiver_results.start_timestamp,
                    self.receiver_cpu_stats.start_timestamp,
                ]
            ) + self.warmup_duration

        import logging
        logging.error(f"AAAAAAAAAAA old: {old}")
        logging.error(f"AAAAAAAAAAA A: {optionA}")
        logging.error(f"AAAAAAAAAAA B: {optionB}")
        return (
            max(
                [
                    self.generator_results.start_timestamp,
                    self.generator_cpu_stats.start_timestamp,
                    self.receiver_results.start_timestamp,
                    self.receiver_cpu_stats.start_timestamp,
                ]
            )
            + self.warmup_duration
        )

the output i got:

-> % rg 'AAAAAAAAAAA old' -A2 Logs
Logs/2024-10-18_14:00:06/SimpleNetworkRecipe_match_0/info
1465:2024-10-18 14:08:48       (localhost)        -   ERROR: AAAAAAAAAAA old: 1729252874.167374
1466-2024-10-18 14:08:48       (localhost)        -   ERROR: AAAAAAAAAAA A: 1729252874
1467-2024-10-18 14:08:48       (localhost)        -   ERROR: AAAAAAAAAAA B: 1729252874
1468:2024-10-18 14:08:48       (localhost)        -   ERROR: AAAAAAAAAAA old: 1729252971.003915
1469-2024-10-18 14:08:48       (localhost)        -   ERROR: AAAAAAAAAAA A: 1729252971
1470-2024-10-18 14:08:48       (localhost)        -   ERROR: AAAAAAAAAAA B: 1729252971
1471:2024-10-18 14:08:48       (localhost)        -   ERROR: AAAAAAAAAAA old: 1729253068.265373
1472-2024-10-18 14:08:48       (localhost)        -   ERROR: AAAAAAAAAAA A: 1729253068
1473-2024-10-18 14:08:48       (localhost)        -   ERROR: AAAAAAAAAAA B: 1729253068
1474:2024-10-18 14:08:48       (localhost)        -   ERROR: AAAAAAAAAAA old: 1729253163.374155
1475-2024-10-18 14:08:48       (localhost)        -   ERROR: AAAAAAAAAAA A: 1729253163
1476-2024-10-18 14:08:48       (localhost)        -   ERROR: AAAAAAAAAAA B: 1729253163
1477:2024-10-18 14:08:48       (localhost)        -   ERROR: AAAAAAAAAAA old: 1729253262.3002691
1478-2024-10-18 14:08:48       (localhost)        -   ERROR: AAAAAAAAAAA A: 1729253262
1479-2024-10-18 14:08:48       (localhost)        -   ERROR: AAAAAAAAAAA B: 1729253262

We can see that the optionA and optionB are identical and they're always basically at most 0.3s off of the old method.

The recipe i used for this locally was with this configuration:

recipe = SimpleNetworkRecipe(
    perf_parallel_processes=64,
    offload_combinations=[{'gro': 'on', 'gso': 'on', 'tso': 'on', 'tx': 'on', 'rx': 'on'}],
    perf_duration=60,
    ip_versions=['ipv4'],
    perf_tests=['tcp_stream'],
    perf_msg_sizes=[16384],
    minimal_idlestates_latency=0,
    drop_caches=True,
    perf_warmup_duration=3,
    disable_turboboost=True,
)

And it is worth noting that the startup of the iperf clients took a while as the difference between the first and last client is sizable:

2024-10-18 14:07:09       (localhost)        -   DEBUG: Host host2 executing job 271: type(module), host(host2), netns(None), IperfServer(bind=Ip4Address(192.168.101.2/24), oneoff=True, port=12000)
2024-10-18 14:07:09       (localhost)        -   DEBUG: Job.what = IperfServer(bind=Ip4Address(192.168.101.2/24), oneoff=True, port=12000)
2024-10-18 14:07:09       (localhost)        -    INFO: Result: PASS, What: Job started: type(module), host(host2), netns(None), IperfServer(bind=Ip4Address(192.168.101.2/24), oneoff=True, port=12000)

...

2024-10-18 14:07:14       (localhost)        -   DEBUG: Host host2 executing job 334: type(module), host(host2), netns(None), IperfServer(bind=Ip4Address(192.168.101.2/24), oneoff=True, port=12063)
2024-10-18 14:07:14       (localhost)        -   DEBUG: Job.what = IperfServer(bind=Ip4Address(192.168.101.2/24), oneoff=True, port=12063)
2024-10-18 14:07:14       (localhost)        -    INFO: Result: PASS, What: Job started: type(module), host(host2), netns(None), IperfServer(bind=Ip4Address(192.168.101.2/24), oneoff=True, port=12063)
2024-10-18 14:07:14           (host2)        -   DEBUG: Running job 334 with pid "1010"
2024-10-18 14:07:14           (host2)        -   DEBUG: compiled command:  iperf3 -s -B 192.168.101.2 -J -p 12063 -1
2024-10-18 14:07:14           (host2)        -   DEBUG: running as server ...
2024-10-18 14:07:16       (localhost)        -   DEBUG: Host host1 executing job 272: type(module), host(host1), netns(None), IperfClient(server=Ip4Address(192.168.101.2/24), duration=60, warmup_duration=3, blksize=16384, port=12000, clien
t_port=12000)
2024-10-18 14:07:16       (localhost)        -   DEBUG: Job.what = IperfClient(server=Ip4Address(192.168.101.2/24), duration=60, warmup_duration=3, blksize=16384, port=12000, client_port=12000)
2024-10-18 14:07:16       (localhost)        -    INFO: Result: PASS, What: Job started: type(module), host(host1), netns(None), IperfClient(server=Ip4Address(192.168.101.2/24), duration=60, warmup_duration=3, blksize=16384, port=12000, cl

...

2024-10-18 14:07:39       (localhost)        -    INFO: Result: PASS, What: Job started: type(module), host(host1), netns(None), IperfClient(server=Ip4Address(192.168.101.2/24), duration=60, warmup_duration=3, blksize=16384, port=12063, cl
ient_port=12063)
2024-10-18 14:07:39           (host1)        -   DEBUG: Running job 335 with pid "1013"
2024-10-18 14:07:39       (localhost)        -   DEBUG: Waiting for Job 272 on Host host1 for 71 seconds.
2024-10-18 14:07:39           (host1)        -   DEBUG: Measuring for 66 seconds (perf_duration + perf_warmup_duration * 2).
2024-10-18 14:07:39           (host1)        -   DEBUG: compiled command:  iperf3 -c 192.168.101.2 -b 0/1000 -J -t 66   -l 16384  -p 12063 --cport 12063
2024-10-18 14:07:39           (host1)        -   DEBUG: running as client ...

olichtne · 2024-10-18T12:41:27Z

Did the same with the warmdown_start for completion:

    @property
    def warmdown_start(self):
        old = min(
            [
                parallel[-self.warmup_duration].start_timestamp
                for parallel in (
                    *self.generator_results,
                    *self.receiver_results,
                )
            ])
        optionA = self.end_timestamp - self.warmup_duration
        optionB = min(
                [
                    self.generator_results.end_timestamp,
                    self.generator_cpu_stats.end_timestamp,
                    self.receiver_results.end_timestamp,
                    self.receiver_cpu_stats.end_timestamp,
               max(  ]
            ) - self.warmup_duration

        import logging
        logging.error(f"BBBBBBBBBBB old: {old}")
        logging.error(f"BBBBBBBBBBB A: {optionA}")
        logging.error(f"BBBBBBBBBBB B: {optionB}")
        return (
            min(
                [
                    self.generator_results.end_timestamp,
                    self.generator_cpu_stats.end_timestamp,
                    self.receiver_results.end_timestamp,
                    self.receiver_cpu_stats.end_timestamp,
                ]
            )
            - self.warmup_duration
        )

logs:

Logs/2024-10-18_14:30:00/SimpleNetworkRecipe_match_0/debug
6902:2024-10-18 14:38:45       (localhost)        -   ERROR: BBBBBBBBBBB old: 1729254727.004715
6903-2024-10-18 14:38:45       (localhost)        -   ERROR: BBBBBBBBBBB A: 1729254727.135558
6904-2024-10-18 14:38:45       (localhost)        -   ERROR: BBBBBBBBBBB B: 1729254727
--
6909:2024-10-18 14:38:45       (localhost)        -   ERROR: BBBBBBBBBBB old: 1729254829.009228
6910-2024-10-18 14:38:45       (localhost)        -   ERROR: BBBBBBBBBBB A: 1729254829.005448
6911-2024-10-18 14:38:45       (localhost)        -   ERROR: BBBBBBBBBBB B: 1729254829
--
6916:2024-10-18 14:38:45       (localhost)        -   ERROR: BBBBBBBBBBB old: 1729254930.013833
6917-2024-10-18 14:38:45       (localhost)        -   ERROR: BBBBBBBBBBB A: 1729254930.123065
6918-2024-10-18 14:38:45       (localhost)        -   ERROR: BBBBBBBBBBB B: 1729254930
--
6923:2024-10-18 14:38:45       (localhost)        -   ERROR: BBBBBBBBBBB old: 1729255027.000226
6924-2024-10-18 14:38:45       (localhost)        -   ERROR: BBBBBBBBBBB A: 1729255027.104802
6925-2024-10-18 14:38:45       (localhost)        -   ERROR: BBBBBBBBBBB B: 1729255027
--
6930:2024-10-18 14:38:45       (localhost)        -   ERROR: BBBBBBBBBBB old: 1729255120.022853
6931-2024-10-18 14:38:45       (localhost)        -   ERROR: BBBBBBBBBBB A: 1729255120.036433
6932-2024-10-18 14:38:45       (localhost)        -   ERROR: BBBBBBBBBBB B: 1729255120

here we see some small difference between A and B, but when comparing A to old or B to old imo the difference is irrelevant.

olichtne · 2024-10-18T14:26:05Z

with the new data and testing i believe the simpler option A implementation is better and there should be no real impact on any of our testing.

jtluka

Besides my only comment, this looks ok.

lnst/RecipeCommon/Perf/Measurements/Results/NeperFlowMeasurementResults.py

This index based calculation is inaccurate if the data structure doesn't reflect a single very specific usecase... instead we can simply calculate the timestamps by addition... Signed-off-by: Ondrej Lichtner <[email protected]>

`XDPBenchMeasurements` results used to use `FlowMeasurementResults` as a base class overriding some of its methods because xdp-bench tool doesn't measure CPU usage and so `FlowMeasurementResults` CPU metrics were set to `None`. Recent changes from [0] now expects `FlowMeasurementResults` to use both perf and CPU results. Especially `.{start,end}_timestamp` properties expect CPU metrics to be set and if not, it crashes since not using CPU metrics is "masked" by setting `None` instead of regular result containers. [0] LNST-project#382

`XDPBenchMeasurements` results used to use `FlowMeasurementResults` as a base class overriding some of its methods because xdp-bench tool doesn't measure CPU usage and so `FlowMeasurementResults` CPU metrics were set to `None`. Recent changes from [0] now expects `FlowMeasurementResults` to use both perf and CPU results. Especially `.{start,end}_timestamp` properties expect CPU metrics to be set and if not, it crashes since not using CPU metrics is "masked" by setting `None` instead of regular result containers. [0] #382

`XDPBenchMeasurements` results used to use `FlowMeasurementResults` as a base class overriding some of its methods because xdp-bench tool doesn't measure CPU usage and so `FlowMeasurementResults` CPU metrics were set to `None`. Recent changes from [0] now expects `FlowMeasurementResults` to use both perf and CPU results. Especially `.{start,end}_timestamp` properties expect CPU metrics to be set and if not, it crashes since not using CPU metrics is "masked" by setting `None` instead of regular result containers. [0] LNST-project#382

olichtne added 2 commits October 3, 2024 12:59

Perf.Measurements: set minimal PerfInterval duration to 1

ef13708

When creating these PerfIntervals representing zero flows we need to set the duration to non zero to avoid zero division issues. Signed-off-by: Ondrej Lichtner <[email protected]>

jtluka requested changes Oct 8, 2024

View reviewed changes

lnst/RecipeCommon/Perf/Measurements/Results/FlowMeasurementResults.py Show resolved Hide resolved

enhaut reviewed Oct 8, 2024

View reviewed changes

lnst/RecipeCommon/Perf/Measurements/Results/FlowMeasurementResults.py Show resolved Hide resolved

lnst/RecipeCommon/Perf/Measurements/NeperFlowMeasurement.py Show resolved Hide resolved

jtluka previously approved these changes Oct 8, 2024

View reviewed changes

enhaut approved these changes Oct 8, 2024

View reviewed changes

olichtne dismissed jtluka’s stale review via 83c9e63 October 17, 2024 13:38

olichtne force-pushed the fix-iperf-no-connection branch from 83c9e63 to fef6817 Compare October 18, 2024 08:05

olichtne force-pushed the fix-iperf-no-connection branch from fef6817 to 8e924a8 Compare October 18, 2024 14:25

olichtne requested review from enhaut and jtluka October 18, 2024 14:28

jtluka requested changes Oct 21, 2024

View reviewed changes

lnst/RecipeCommon/Perf/Measurements/Results/NeperFlowMeasurementResults.py Outdated Show resolved Hide resolved

Perf.Measurements.Results: fix warmup/warmdown calculation

3cf4333

This index based calculation is inaccurate if the data structure doesn't reflect a single very specific usecase... instead we can simply calculate the timestamps by addition... Signed-off-by: Ondrej Lichtner <[email protected]>

olichtne force-pushed the fix-iperf-no-connection branch from 8e924a8 to 3cf4333 Compare October 21, 2024 11:52

jtluka approved these changes Oct 21, 2024

View reviewed changes

olichtne merged commit a42be24 into LNST-project:master Oct 21, 2024
3 checks passed

enhaut mentioned this pull request Oct 24, 2024

HOTFIX: XDPBenchResults: split inheritance tree #385

Merged

olichtne deleted the fix-iperf-no-connection branch November 1, 2024 13:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix iperf no connection and PerfResult warmup timestamp calculation #382

Fix iperf no connection and PerfResult warmup timestamp calculation #382

olichtne commented Oct 3, 2024

olichtne commented Oct 3, 2024 •

edited

Loading

jtluka left a comment

jtluka commented Oct 8, 2024

olichtne commented Oct 8, 2024

olichtne commented Oct 8, 2024 •

edited

Loading

olichtne commented Oct 17, 2024 •

edited

Loading

olichtne commented Oct 18, 2024

olichtne commented Oct 18, 2024

olichtne commented Oct 18, 2024

olichtne commented Oct 18, 2024

jtluka left a comment

Fix iperf no connection and PerfResult warmup timestamp calculation #382

Fix iperf no connection and PerfResult warmup timestamp calculation #382

Conversation

olichtne commented Oct 3, 2024

Description

Tests

Reviews

olichtne commented Oct 3, 2024 • edited Loading

jtluka left a comment

Choose a reason for hiding this comment

jtluka commented Oct 8, 2024

olichtne commented Oct 8, 2024

olichtne commented Oct 8, 2024 • edited Loading

olichtne commented Oct 17, 2024 • edited Loading

olichtne commented Oct 18, 2024

olichtne commented Oct 18, 2024

olichtne commented Oct 18, 2024

olichtne commented Oct 18, 2024

jtluka left a comment

Choose a reason for hiding this comment

olichtne commented Oct 3, 2024 •

edited

Loading

olichtne commented Oct 8, 2024 •

edited

Loading

olichtne commented Oct 17, 2024 •

edited

Loading