Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for mixed precision builtins #1786

Open
thomasmelvin opened this issue Jul 5, 2022 · 15 comments
Open

Support for mixed precision builtins #1786

thomasmelvin opened this issue Jul 5, 2022 · 15 comments
Assignees
Labels
LFRic Issue relates to the LFRic domain Release Planning and creating PSyclone releases

Comments

@thomasmelvin
Copy link
Collaborator

LFRic ticket #3055 introduces for new builtins to support mixed precision work. These will need implementing in psyclone. Briefly, they are:

  1. invoke_rdouble_X_innerproduct_X: Compute x.x where the individual sums are promoted to double precision and the output to r_def precision
  2. invoke_rdouble_X_innerproduct_Y Compute x.y where the individual sums are promoted to double precision and the output to r_def precision
  3. invoke_copy_to_rsolver Effectively setval_X(x, y) where x is r_solver and y is r_def
  4. invoke_copy_to_rdef Effectively setval_X(x, y) where x is r_def and y is r_solver

The final two could be achieved by a generalization of the setval_X builtin using generic interfaces (as used elsewhere in the lfric code)

@thomasmelvin
Copy link
Collaborator Author

In more detail the current (psykal_lite) implementation of these is

     ! Perform innerproduct of a r_solver precision field in r_double precision
    subroutine invoke_rdouble_X_innerproduct_X(field_norm, field)

      use scalar_mod,         only: scalar_type
      use omp_lib,            only: omp_get_thread_num
      use omp_lib,            only: omp_get_max_threads
      use mesh_mod,           only: mesh_type
      use r_solver_field_mod, only: r_solver_field_type, r_solver_field_proxy_type

      implicit none

      real(kind=r_def), intent(out) :: field_norm
      type(r_solver_field_type), intent(in) :: field

      type(scalar_type)                           :: global_sum
      integer(kind=i_def)                         :: df
      real(kind=r_double), allocatable, dimension(:) :: l_field_norm
      integer(kind=i_def)                         :: th_idx
      integer(kind=i_def)                         :: loop0_start, loop0_stop
      integer(kind=i_def)                         :: nthreads
      type(r_solver_field_proxy_type)             :: field_proxy
      integer(kind=i_def)                         :: max_halo_depth_mesh
      type(mesh_type), pointer                    :: mesh => null()
      !
      ! Determine the number of OpenMP threads
      !
      nthreads = omp_get_max_threads()
      !
      ! Initialise field and/or operator proxies
      !
      field_proxy = field%get_proxy()
      !
      ! Create a mesh object
      !
      mesh => field_proxy%vspace%get_mesh()
      max_halo_depth_mesh = mesh%get_halo_depth()
      !
      ! Set-up all of the loop bounds
      !
      loop0_start = 1
      loop0_stop = field_proxy%vspace%get_last_dof_owned()
      !
      ! Call kernels and communication routines
      !
      !
      ! Zero summation variables
      !
      field_norm = 0.0_r_def
      ALLOCATE (l_field_norm(nthreads))
      l_field_norm = 0.0_r_double
      !
      !$omp parallel default(shared), private(df,th_idx)
      th_idx = omp_get_thread_num()+1
      !$omp do schedule(static)
      DO df=loop0_start,loop0_stop
        l_field_norm(th_idx) = l_field_norm(th_idx) + real(field_proxy%data(df),r_double)**2
      END DO
      !$omp end do
      !$omp end parallel
      !
      ! sum the partial results sequentially
      !
      DO th_idx=1,nthreads
        field_norm = field_norm+real(l_field_norm(th_idx),r_def)
      END DO
      DEALLOCATE (l_field_norm)
      global_sum%value = field_norm
      field_norm = global_sum%get_sum()
      !
    end subroutine invoke_rdouble_X_innerproduct_X

    ! Perform innerproduct of a r_solver precision field in r_def precision
    subroutine invoke_rdouble_X_innerproduct_Y(field_norm, field1, field2)

      use scalar_mod,         only: scalar_type
      use omp_lib,            only: omp_get_thread_num
      use omp_lib,            only: omp_get_max_threads
      use mesh_mod,           only: mesh_type
      use r_solver_field_mod, only: r_solver_field_type, r_solver_field_proxy_type

      implicit none

      real(kind=r_def), intent(out) :: field_norm
      type(r_solver_field_type), intent(in) :: field1, field2

      type(scalar_type)                           :: global_sum
      integer(kind=i_def)                         :: df
      real(kind=r_double), allocatable, dimension(:) :: l_field_norm
      integer(kind=i_def)                         :: th_idx
      integer(kind=i_def)                         :: loop0_start, loop0_stop
      integer(kind=i_def)                         :: nthreads
      type(r_solver_field_proxy_type)             :: field1_proxy, field2_proxy
      integer(kind=i_def)                         :: max_halo_depth_mesh
      type(mesh_type), pointer                    :: mesh => null()
      !
      ! Determine the number of OpenMP threads
      !
      nthreads = omp_get_max_threads()
      !
      ! Initialise field and/or operator proxies
      !
      field1_proxy = field1%get_proxy()
      field2_proxy = field2%get_proxy()
      !
      ! Create a mesh object
      !
      mesh => field1_proxy%vspace%get_mesh()
      max_halo_depth_mesh = mesh%get_halo_depth()
      !
      ! Set-up all of the loop bounds
      !
      loop0_start = 1
      loop0_stop = field1_proxy%vspace%get_last_dof_owned()
      !
      ! Call kernels and communication routines
      !
      !
      ! Zero summation variables
      !
      field_norm = 0.0_r_def
      ALLOCATE (l_field_norm(nthreads))
      l_field_norm = 0.0_r_double
      !
      !$omp parallel default(shared), private(df,th_idx)
      th_idx = omp_get_thread_num()+1
      !$omp do schedule(static)
      DO df=loop0_start,loop0_stop
        l_field_norm(th_idx) = l_field_norm(th_idx) + real(field1_proxy%data(df),r_double)*real(field2_proxy%data(df),r_double)
      END DO
      !$omp end do
      !$omp end parallel
      !
      ! sum the partial results sequentially
      !
      DO th_idx=1,nthreads
        field_norm = field_norm+real(l_field_norm(th_idx),r_def)
      END DO
      DEALLOCATE (l_field_norm)
      global_sum%value = field_norm
      field_norm = global_sum%get_sum()
      !
    end subroutine invoke_rdouble_X_innerproduct_Y

    ! Copy a field_type to a r_solver_field_type
    subroutine invoke_copy_to_rsolver(rsolver_field, field)

      use omp_lib,            only: omp_get_thread_num
      use omp_lib,            only: omp_get_max_threads
      use mesh_mod,           only: mesh_type
      use r_solver_field_mod, only: r_solver_field_type, r_solver_field_proxy_type
      use field_mod,          only: field_type, field_proxy_type

      implicit none

      type(r_solver_field_type), intent(inout) :: rsolver_field
      type(field_type),          intent(in)    :: field

      integer(kind=i_def)             :: df
      integer(kind=i_def)             :: loop0_start, loop0_stop
      type(r_solver_field_proxy_type) :: rsolver_field_proxy
      type(field_proxy_type)          :: field_proxy
      integer(kind=i_def)             :: max_halo_depth_mesh
      type(mesh_type), pointer        :: mesh => null()
      !
      ! Initialise field and/or operator proxies
      !
      rsolver_field_proxy = rsolver_field%get_proxy()
      field_proxy = field%get_proxy()
      !
      ! Create a mesh object
      !
      mesh => rsolver_field_proxy%vspace%get_mesh()
      max_halo_depth_mesh = mesh%get_halo_depth()
      !
      ! Set-up all of the loop bounds
      !
      loop0_start = 1
      loop0_stop = rsolver_field_proxy%vspace%get_last_dof_halo(1)
      !
      ! Call kernels and communication routines
      !
      IF (field_proxy%is_dirty(depth=1)) THEN
        CALL field_proxy%halo_exchange(depth=1)
      END IF
      !
      !$omp parallel default(shared), private(df)
      !$omp do schedule(static)
      DO df=loop0_start,loop0_stop
        rsolver_field_proxy%data(df) = real(field_proxy%data(df), r_solver)
      END DO
      !$omp end do
      !
      ! Set halos dirty/clean for fields modified in the above loop
      !
      !$omp master
      CALL rsolver_field_proxy%set_dirty()
      CALL rsolver_field_proxy%set_clean(1)
      !$omp end master
      !
      !$omp end parallel
      !
    end subroutine invoke_copy_to_rsolver

    ! Copy a r_solver_field_type to a field_type
    subroutine invoke_copy_to_rdef(rdef_field, field)

      use omp_lib,            only: omp_get_thread_num
      use omp_lib,            only: omp_get_max_threads
      use mesh_mod,           only: mesh_type
      use r_solver_field_mod, only: r_solver_field_type, r_solver_field_proxy_type
      use field_mod,          only: field_type, field_proxy_type

      implicit none

      type(field_type),          intent(inout) :: rdef_field
      type(r_solver_field_type), intent(in)    :: field

      integer(kind=i_def)             :: df
      integer(kind=i_def)             :: loop0_start, loop0_stop
      type(r_solver_field_proxy_type) :: field_proxy
      type(field_proxy_type)          :: rdef_field_proxy
      integer(kind=i_def)             :: max_halo_depth_mesh
      type(mesh_type), pointer        :: mesh => null()
      !
      ! Initialise field and/or operator proxies
      !
      rdef_field_proxy = rdef_field%get_proxy()
      field_proxy = field%get_proxy()
      !
      ! Create a mesh object
      !
      mesh => rdef_field_proxy%vspace%get_mesh()
      max_halo_depth_mesh = mesh%get_halo_depth()
      !
      ! Set-up all of the loop bounds
      !
      loop0_start = 1
      loop0_stop = rdef_field_proxy%vspace%get_last_dof_halo(1)
      !
      ! Call kernels and communication routines
      !
      IF (field_proxy%is_dirty(depth=1)) THEN
        CALL field_proxy%halo_exchange(depth=1)
      END IF
      !
      !$omp parallel default(shared), private(df)
      !$omp do schedule(static)
      DO df=loop0_start,loop0_stop
        rdef_field_proxy%data(df) = real(field_proxy%data(df), r_def)
      END DO
      !$omp end do
      !
      ! Set halos dirty/clean for fields modified in the above loop
      !
      !$omp master
      CALL rdef_field_proxy%set_dirty()
      CALL rdef_field_proxy%set_clean(1)
      !$omp end master
      !
      !$omp end parallel
      !
    end subroutine invoke_copy_to_rdef

@TeranIvy TeranIvy self-assigned this Jul 5, 2022
@arporter arporter added the LFRic Issue relates to the LFRic domain label Sep 27, 2022
@TeranIvy TeranIvy added the Release Planning and creating PSyclone releases label Sep 29, 2023
@TeranIvy
Copy link
Collaborator

I was just notified by LFRicians that this it would be good to have it in a release (early next year) so that we can remove some LFRic infrastructure code the relies on PSyKAl-lite.

@stevemullerworth
Copy link
Collaborator

I was just notified by LFRicians that this it would be good to have it in a release (early next year) so that we can remove some LFRic infrastructure code the relies on PSyKAl-lite.

Yes. Specifically, r_solver_field_vector_mod in the infrastructure uses a PSyKAl-lite builtin from GungHo, and early next year we are planning to split the repository to move apps such as GungHo to a different repository.

@oakleybrunt
Copy link
Collaborator

Regarding invoke_rdouble_x_innerproduct_x, the current implementation in the PSy-KAlite code (given in this issue) calls two loops to achieve the aim.

      th_idx = omp_get_thread_num()+1
      ! calculate inner product of each element after converting to double
      DO df=loop0_start,loop0_stop
        l_field_norm(th_idx) = l_field_norm(th_idx) + real(field_proxy%data(df),r_double)**2
      END DO
      th_idx = omp_get_thread_num()+1
      ! sum the partial results sequentially and convert output to real at each step
      DO th_idx=1,nthreads
        field_norm = field_norm+real(l_field_norm(th_idx),r_def)
      END DO

When creating a builtin for this, is it better to combine these into one line rather than initiating two loops over the array?

for example:

      th_idx = omp_get_thread_num()+1
      DO df=loop0_start,loop0_stop
        field_norm = field_norm + real(real(field_proxy%data(df),r_double)**2, r_def)
      END DO

Does nesting like this even work?

@arporter
Copy link
Member

arporter commented Nov 29, 2023

Hi @oakleybrunt, the rule for proper Kernels (as opposed to their PSyKAlite implementation) is that they must make no reference to parallelism (which, in this case, means no references to thread index). In this case, it will be PSyclone's job to do the reduction. (PSyclone has support for generating the type of code that is used here which guarantees matching run-for-run results on the same number of OMP threads.) I think all you need to do is copy the implementation of the other innerpoduct builtin in PSyclone but include the casting.

@oakleybrunt
Copy link
Collaborator

Hey @arporter, I included those lines for context, really. Though I guess it isn't hard for anyone to work out what th_idx is.

Okay, so will I need to cast twice like in the example I gave?

@arporter
Copy link
Member

arporter commented Nov 29, 2023

In the Builtin itself you will need to ensure that the LHS of the Assignment has the correct precision and that the RHS has a cast. The code you want to look at is:

# Get indexed references for the field (proxy) argument.
arg_refs = self.get_indexed_field_argument_references()
# Get a reference for the kernel scalar reduction argument.
lhs = self._reduction_reference()
# Create the PSyIR for the kernel:
# asum = asum + proxy0%data(df) * proxy0%data(df)
mult_op = BinaryOperation.create(BinaryOperation.Operator.MUL,
arg_refs[0].copy(), arg_refs[0])
rhs = BinaryOperation.create(BinaryOperation.Operator.ADD,
lhs.copy(), mult_op)
assign = Assignment.create(lhs, rhs)
# Finally, replace this kernel node with the Assignment
self.replace_with(assign)
return assign

However, you are also going to have to update the bit of PSyclone that generates the reduction result and that's not going to be so easy because it doesn't yet use the PSyIR properly:
def local_reduction_name(self):
'''
:returns: a local reduction variable name that is unique for the
current reduction argument name. This is used for
thread-local reductions with reproducible reductions.
:rtype: str
'''
# TODO #2381: Revisit symbol creation, now moved to the
# Kern._reduction_reference() method, and try to associate it
# with the PSy-layer generation or relevant transformation.
return "l_" + self.reduction_arg.name
def zero_reduction_variable(self, parent, position=None):
'''
Generate code to zero the reduction variable and to zero the local
reduction variable if one exists. The latter is used for reproducible
reductions, if specified.
:param parent: the Node in the AST to which to add new code.
:type parent: :py:class:`psyclone.psyir.nodes.Node`
:param str position: where to position the new code in the AST.
:raises GenerationError: if the variable to zero is not a scalar.
:raises GenerationError: if the reprod_pad_size (read from the \
configuration file) is less than 1.
:raises GenerationError: for a reduction into a scalar that is \
neither 'real' nor 'integer'.
'''
if not position:
position = ["auto"]
var_name = self._reduction_arg.name
local_var_name = self.local_reduction_name
var_arg = self._reduction_arg
# Check for a non-scalar argument
if not var_arg.is_scalar:
raise GenerationError(
f"Kern.zero_reduction_variable() should be a scalar but "
f"found '{var_arg.argument_type}'.")
# Generate the reduction variable
var_data_type = var_arg.intrinsic_type
if var_data_type == "real":
data_value = "0.0"
elif var_data_type == "integer":
data_value = "0"
else:
raise GenerationError(
f"Kern.zero_reduction_variable() should be either a 'real' or "
f"an 'integer' scalar but found scalar of type "
f"'{var_arg.intrinsic_type}'.")
# Retrieve the precision information (if set) and append it
# to the initial reduction value
if var_arg.precision:
kind_type = var_arg.precision
zero_sum_variable = "_".join([data_value, kind_type])
else:
kind_type = ""
zero_sum_variable = data_value
parent.add(AssignGen(parent, lhs=var_name, rhs=zero_sum_variable),
position=position)
if self.reprod_reduction:
parent.add(DeclGen(parent, datatype=var_data_type,
entity_decls=[local_var_name],
allocatable=True, kind=kind_type,
dimension=":,:"))
nthreads = \
self.scope.symbol_table.lookup_with_tag("omp_num_threads").name
if Config.get().reprod_pad_size < 1:
raise GenerationError(
f"REPROD_PAD_SIZE in {Config.get().filename} should be a "
f"positive integer, but it is set to "
f"'{Config.get().reprod_pad_size}'.")
pad_size = str(Config.get().reprod_pad_size)
parent.add(AllocateGen(parent, local_var_name + "(" + pad_size +
"," + nthreads + ")"), position=position)
parent.add(AssignGen(parent, lhs=local_var_name,
rhs=zero_sum_variable), position=position)
def reduction_sum_loop(self, parent):
'''
Generate the appropriate code to place after the end parallel
region.
:param parent: the Node in the f2pygen AST to which to add new code.
:type parent: :py:class:`psyclone.f2pygen.SubroutineGen`
:raises GenerationError: for an unsupported reduction access in \
LFRicBuiltIn.
'''
var_name = self._reduction_arg.name
local_var_name = self.local_reduction_name
# A non-reproducible reduction requires a single-valued argument
local_var_ref = self._reduction_reference().name
# A reproducible reduction requires multi-valued argument stored
# as a padded array separately for each thread
if self.reprod_reduction:
local_var_ref = FortranWriter().arrayreference_node(
self._reduction_reference())
reduction_access = self._reduction_arg.access
try:
reduction_operator = REDUCTION_OPERATOR_MAPPING[reduction_access]
except KeyError as err:
api_strings = [access.api_specific_name()
for access in REDUCTION_OPERATOR_MAPPING]
raise GenerationError(
f"Unsupported reduction access "
f"'{reduction_access.api_specific_name()}' found in "
f"LFRicBuiltIn:reduction_sum_loop(). Expected one of "
f"{api_strings}.") from err
symtab = self.scope.symbol_table
thread_idx = symtab.lookup_with_tag("omp_thread_index").name
nthreads = symtab.lookup_with_tag("omp_num_threads").name
do_loop = DoGen(parent, thread_idx, "1", nthreads)
do_loop.add(AssignGen(do_loop, lhs=var_name, rhs=var_name +
reduction_operator + local_var_ref))
parent.add(do_loop)
parent.add(DeallocateGen(parent, local_var_name))
def _reduction_reference(self):
'''
Return the reference to the reduction variable if OpenMP is set to
be unreproducible, as we will be using the OpenMP reduction clause.
Otherwise we will be computing the reduction ourselves and therefore
need to store values into a (padded) array separately for each
thread.
:returns: reference to the variable to be reduced.
:rtype: :py:class:`psyclone.psyir.nodes.Reference` or
:py:class:`psyclone.psyir.nodes.ArrayReference`
'''
# TODO #2381: Revisit symbol creation, moved from the
# Kern.local_reduction_name property, and try to associate it
# with the PSy-layer generation or relevant transformation.
symtab = self.scope.symbol_table
reduction_name = self.reduction_arg.name
# Return a multi-valued ArrayReference for a reproducible reduction
if self.reprod_reduction:
array_dim = [
Literal("1", INTEGER_TYPE),
Reference(symtab.lookup_with_tag("omp_thread_index"))]
reduction_array = ArrayType(
symtab.lookup(reduction_name).datatype, array_dim)
local_reduction = DataSymbol(
self.local_reduction_name, datatype=reduction_array)
symtab.find_or_create_tag(
tag=self.local_reduction_name,
symbol_type=DataSymbol, datatype=reduction_array)
return ArrayReference.create(
local_reduction, array_dim)
# Return a single-valued Reference for a non-reproducible reduction
return Reference(symtab.lookup(reduction_name))

In fact, it may be that someone at STFC (myself, @sergisiso or @rupertford) needs to do that bit. Take a look and see what you think.

@arporter
Copy link
Member

The code generated by PSyclone (and that which is present in the psykalite you are looking at) is specifically designed to ensure that if you run the same code on the same number of OpenMP threads then you get the same answer. See https://psyclone.readthedocs.io/en/stable/transformations.html#openmp-reductions for a bit more info. This is optional functionality - by default PSyclone will just use OpenMP's default support for reductions but that doesn't give reproducible results.

@arporter
Copy link
Member

@oakleybrunt
Copy link
Collaborator

Okay, I understand more... but also have more questions!

Questions:
What do you mean when you say the reduction reference "doesn't yet use the PSyIR properly"?
Why is REPRODUCIBLE_REDUCTIONS set to false? Would anything break if this was set to true?

Why can't the rdouble_x_innerproduct_x work right now? Do you want to enforce reproducibility on this builtin?

@arporter
Copy link
Member

Questions: What do you mean when you say the reduction reference "doesn't yet use the PSyIR properly"? Why is REPRODUCIBLE_REDUCTIONS set to false? Would anything break if this was set to true?

So, in PSyclone there's an old way of doing the code generation and a new way. We are gradually migrating to the new way and both the GOcean and NEMO APIs use it. LFRic does not yet use it. The new way is to construct everything in terms of Nodes in a PSyIR tree and then give that tree to a 'backend' that generates the (Fortran) code. The old way uses functionality from the f2pygen module and builds up an entirely separate (Fortran specific) tree. Code is then generated by doing str(root_node) on that.

I can't quite remember why REPRODUCIBLE_REDUCTIONS is set to False. Probably it harms performance and also it's good to be able to keep things as simple as possible sometimes.

Why can't the rdouble_x_innerproduct_x work right now? Do you want to enforce reproducibility on this builtin?

You're going to have to help me here - is that the one you're working on? If you're creating a new builtin then it needs to support the option for reproducible reductions - we can't have exceptions. I'm guessing that the answer to your question is: "because the code that generates the reduction (using f2pygen) doesn't know about precision".

@oakleybrunt
Copy link
Collaborator

the code that generates the reduction (using f2pygen) doesn't know about precision

Ahhh this is what I was missing, thank you!!

The old way uses functionality from the f2pygen module and builds up an entirely separate (Fortran specific) tree

Okay, I see. It's all linking together now. I had read this in the User or Developer guide today but didn't realise that the old way does not recognize precisions.

Does the new way know about precision?

@arporter
Copy link
Member

Well, it's not so much that it doesn't know about precision (I think it can be supplied), it's more that the code that currently uses it will be assuming/hardwired to use r_def and i_def I think. The PSyIR does have a full 'type system' that understands precision: https://psyclone-dev.readthedocs.io/en/latest/psyir_symbols.html

@rupertford
Copy link
Collaborator

Where possible, the current version of PSyclone extracts the precision of variables from the algorithm layer in LFRic. This information is required for declaring variables appropriately in the PSy-layer. Whether that information ends up in PSyIR or still uses f2pygen I don't know.

@TeranIvy
Copy link
Collaborator

TeranIvy commented Dec 19, 2023

Note, after a discussion with Chris we have decided to revisit approach to global reductions in #1570 as there is ongoing work in using 32- and 64-bit definitions of global sums in LFRic instead of r_def and other science-based precisions.

@oakleybrunt added support for real-to-real conversion built-ins (PR #2420) and that will make it into the upcoming PSyclone release 2.5.0 (issue #2447).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
LFRic Issue relates to the LFRic domain Release Planning and creating PSyclone releases
Projects
None yet
Development

No branches or pull requests

6 participants