Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set epoch_seed_change attribute on SimulationDataset #840

Merged

Conversation

srstevenson
Copy link
Contributor

@srstevenson srstevenson commented Dec 3, 2024

Set epoch_seed_change attribute on SimulationDataset

This was added to the StreamingDataset which the SimulationDataset inherits, so also needed to be added here. Without this, the code attempts to access the missing attribute when running a simulation:

AttributeError: 'SimulationDataset' object has no attribute 'epoch_seed_change'
Traceback:
File "/home/scott/projects/streaming/.venv/lib64/python3.12/site-packages/streamlit/runtime/scriptrunner/exec_code.py", line 88, in exec_func_with_error_handling
    result = func()
             ^^^^^^
File "/home/scott/projects/streaming/.venv/lib64/python3.12/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 579, in code_to_exec
    exec(code, module.__dict__)
File "/home/scott/projects/streaming/simulation/interfaces/sim_ui.py", line 409, in <module>
    submit_jobs(shuffle_quality, dataset, time_per_sample, node_internet_bandwidth,
File "/home/scott/projects/streaming/simulation/interfaces/sim_ui.py", line 110, in submit_jobs
    for output in gen_sim:
                  ^^^^^^^
File "/home/scott/projects/streaming/simulation/core/main.py", line 110, in simulate
    samples_per_node = dataset.get_samples_per_node(epoch, 0)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/scott/projects/streaming/simulation/core/sim_dataset.py", line 367, in get_samples_per_node
    partition = generate_work(self.batching_method, self, self.world, epoch, sample_in_epoch)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/scott/projects/streaming/streaming/base/batching/__init__.py", line 45, in generate_work
    return get(dataset, world, epoch, sample_in_epoch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/scott/projects/streaming/streaming/base/batching/random.py", line 49, in generate_work_random_batching
    shuffle_units, small_per_big = dataset.resample_streams(epoch)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/scott/projects/streaming/streaming/base/dataset.py", line 878, in resample_streams
    epoch, self.epoch_seed_change)
           ^^^^^^^^^^^^^^^^^^^^^^

Closes #831

@srstevenson srstevenson force-pushed the srs/add-missing-call-to-parent branch from c78a1c7 to 64cd184 Compare December 9, 2024 22:07
@snarayan21
Copy link
Collaborator

Hey @srstevenson, we actually don't want to call StreamingDataset's constructor here -- SimulationDataset is meant to run on single process unlike StreamingDataset so there's some logic that's different between the two init methods. The epoch_seed_change attribute not being present is because when that attribute was added to StreamingDataset it was not also added to SimulationDataset. Admittedly, this isn't the best design pattern. But yeah the immediate solution would be to copy over that logic from StreamingDataset to SimulationDataset.

@srstevenson srstevenson marked this pull request as draft December 10, 2024 21:59
@srstevenson
Copy link
Contributor Author

Thanks, @snarayan21, that makes sense. I've marked this PR as draft for now, and will update it to set epoch_seed_change without calling the parent constructor.

@srstevenson srstevenson force-pushed the srs/add-missing-call-to-parent branch from 64cd184 to 148157b Compare December 16, 2024 18:38
@srstevenson srstevenson changed the title Add missing call to parent constructor in SimulationDataset Set epoch_seed_change attribute on SimulationDataset Dec 16, 2024
@srstevenson srstevenson marked this pull request as ready for review December 16, 2024 18:39
@srstevenson
Copy link
Contributor Author

@snarayan21 I've updated this to just set epoch_seed_change with the same logic as in the parent class, PTAL.

@snarayan21
Copy link
Collaborator

Hey @srstevenson, just getting back to this PR now. Before I approve, can you ensure that you can correctly run the simulator with this change now? Just to make sure that we can resolve issue #831

This was added to the `StreamingDataset` which the `SimulationDataset`
inherits, so also needed to be added here. Without this, the code
attempts to access the missing attribute when running a simulation:

```
AttributeError: 'SimulationDataset' object has no attribute 'epoch_seed_change'
Traceback:
File "/home/scott/projects/streaming/.venv/lib64/python3.12/site-packages/streamlit/runtime/scriptrunner/exec_code.py", line 88, in exec_func_with_error_handling
    result = func()
             ^^^^^^
File "/home/scott/projects/streaming/.venv/lib64/python3.12/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 579, in code_to_exec
    exec(code, module.__dict__)
File "/home/scott/projects/streaming/simulation/interfaces/sim_ui.py", line 409, in <module>
    submit_jobs(shuffle_quality, dataset, time_per_sample, node_internet_bandwidth,
File "/home/scott/projects/streaming/simulation/interfaces/sim_ui.py", line 110, in submit_jobs
    for output in gen_sim:
                  ^^^^^^^
File "/home/scott/projects/streaming/simulation/core/main.py", line 110, in simulate
    samples_per_node = dataset.get_samples_per_node(epoch, 0)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/scott/projects/streaming/simulation/core/sim_dataset.py", line 367, in get_samples_per_node
    partition = generate_work(self.batching_method, self, self.world, epoch, sample_in_epoch)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/scott/projects/streaming/streaming/base/batching/__init__.py", line 45, in generate_work
    return get(dataset, world, epoch, sample_in_epoch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/scott/projects/streaming/streaming/base/batching/random.py", line 49, in generate_work_random_batching
    shuffle_units, small_per_big = dataset.resample_streams(epoch)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/scott/projects/streaming/streaming/base/dataset.py", line 878, in resample_streams
    epoch, self.epoch_seed_change)
           ^^^^^^^^^^^^^^^^^^^^^^
```

Closes #831
@srstevenson srstevenson force-pushed the srs/add-missing-call-to-parent branch from 148157b to ec9c4c4 Compare January 6, 2025 13:08
@srstevenson
Copy link
Contributor Author

Yes, the simulator is running without errors from this branch (rebased on main as of 63e8907):
Screenshot From 2025-01-06 13-07-22

@snarayan21
Copy link
Collaborator

@srstevenson perfect thanks! lgtm

Copy link
Collaborator

@snarayan21 snarayan21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@snarayan21 snarayan21 merged commit 9165c9e into mosaicml:main Jan 6, 2025
7 checks passed
@srstevenson srstevenson deleted the srs/add-missing-call-to-parent branch January 6, 2025 20:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Streaming Simulator errors with default parameters
2 participants