Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error injection and fixes to resilient OpenMP execution spaces #71

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

ElisabethGiem
Copy link
Contributor

@ElisabethGiem ElisabethGiem commented Sep 18, 2024

An omnibus PR to bring Kokkos Resilience up to date with execution space work.

  • Adds error injection for one and two-dimensional ResOMP space views
  • Changes behavior of triple modular redundancy and dual modular redundancy cmake options from previous
  • Changes kernel fusion cmake option to accept WorkTag if present
  • Various error and bugfixes
  • Removes (some) unnecessary comments

Apologies for the mega PR, smaller PRs in the future!

@nmm0 nmm0 changed the title Elisabethgiem/resilient dmr Error injection and fixes to resilient OpenMP execution spaces Sep 19, 2024
@ElisabethGiem ElisabethGiem self-assigned this Oct 16, 2024
@nmm0 nmm0 self-requested a review October 17, 2024 20:11
@nmm0 nmm0 force-pushed the elisabethgiem/resilient-dmr branch from 1914fa8 to 0174855 Compare October 17, 2024 20:29
@nmm0 nmm0 marked this pull request as ready for review October 17, 2024 20:51
@nmm0
Copy link
Contributor

nmm0 commented Oct 22, 2024

Can you rebase this on main to get the CI fix?

@ElisabethGiem ElisabethGiem force-pushed the elisabethgiem/resilient-dmr branch 2 times, most recently from 2579e4a to 47cff8d Compare October 24, 2024 23:26
@ElisabethGiem ElisabethGiem force-pushed the elisabethgiem/resilient-dmr branch from 47cff8d to 75992b4 Compare November 13, 2024 19:59
Copy link
Contributor

@nmm0 nmm0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial part of the review, will continue with more

@@ -1,6 +1,9 @@
cmake_minimum_required(VERSION 3.17)
project(kokkos-resilience VERSION 0.1.0)

#OLD bheavior is deprecated by definition and may be removed in future
cmake_policy(SET CMP0144 OLD)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't add this

@@ -10,7 +13,7 @@ add_library(Kokkos::resilience ALIAS resilience)


option(KR_ALL_WARNINGS "Enable all warnings" ON)
option(KR_WARNINGS_AS_ERRORS "Enable warnings as errors" ON)
option(KR_WARNINGS_AS_ERRORS "Enable warnings as errors" OFF)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep this as on -- if you need to set this to off do so in your configure

@@ -138,6 +142,22 @@ export(TARGETS resilience
FILE resilienceTargets.cmake
)

if (Kokkos_ENABLE_Cuda)
target_compile_definitions(resilience PUBLIC KR_ENABLE_CUDA)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How necessary is this?

@@ -48,6 +48,8 @@
#include <iomanip>
#include <iostream>

#include <iostream>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be removed

}
}
else{
std::cout << "Error finding error_rate. Were global error settings enabled?\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need this output

KokkosResilience::clear_duplicates_map();
#endif
repeats--;

}// while (!success & repeats left)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should remove dead code here


repeats--;

}// while (!success & repeats left)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove dead code here


// Range policy implementation
template <class CombinedFunctorReducerType, class... Traits>
class ParallelReduce< CombinedFunctorReducerType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this guy working properly? If not lets omit from the PR

Comment on lines +103 to +104
inline static std::chrono::duration<long int, std::ratio<1, 1000000000>> elapsed_seconds{};
inline static std::chrono::duration<long int, std::ratio<1, 1000000000>> total_error_time{};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
inline static std::chrono::duration<long int, std::ratio<1, 1000000000>> elapsed_seconds{};
inline static std::chrono::duration<long int, std::ratio<1, 1000000000>> total_error_time{};
inline static std::chrono::duration<long int, std::nano> elapsed_seconds{};
inline static std::chrono::duration<long int, std::nano> total_error_time{};

inline static size_t global_next_inject = 0;
};

struct ETimer{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
struct ETimer{
struct ErrorTimerSettings{

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants