Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add health check package and update SQS receive func #178

Merged
merged 4 commits into from
Nov 13, 2024

Conversation

bpeng
Copy link
Contributor

@bpeng bpeng commented Nov 4, 2024

Proposed Changes

Ticket https://github.com/GeoNet/tickets/issues/13878

Changes proposed in this pull request:

  • package to check health
  • update sqs receiveMesage to pass in a callback func to do something when the receive continues waiting on empty queue

Production Changes

The following production changes are required to deploy these changes:

  • None

Review

Check the box that applies to this code review. If necessary please seek help with adding a checklist guide for the reviewer.
When assigning the code review please consider the expertise needed to review the changes.

  • This is a content (documentation, web page etc) only change.
  • This is a minor change (meta data, bug fix, improve test coverage etc).
  • This is a larger change (new feature, significant refactoring etc). Please use the code review guidelines to add a checklist below to guide the code reviewer.

Code Review Guide

Insert check list here if needed.

@bpeng bpeng changed the title add health check package feat: add health check package Nov 4, 2024
@bpeng bpeng requested a review from junghao November 4, 2024 03:15
@bpeng bpeng force-pushed the health-check branch 4 times, most recently from 8767a00 to 56ad5fc Compare November 5, 2024 21:46
@bpeng bpeng changed the title feat: add health check package feat: add health check package and update SQS receive func Nov 5, 2024
@CallumNZ
Copy link
Contributor

CallumNZ commented Nov 5, 2024

So from what I can tell, the health package is to be used by services that don't have an http server for their primary function, and simply adds one with an /soh endpoint - is that right?

How does the ECS health check work - does ECS call the endpoint at regular intervals?
How would normal usage of this package look like?

@bpeng
Copy link
Contributor Author

bpeng commented Nov 5, 2024

The check is scheduled on AWS via terraform , the health status can be viewed on dev

@bpeng bpeng requested a review from CallumNZ November 6, 2024 00:15
aws/sqs/sqs.go Outdated
Comment on lines 30 to 33
// define a func to callback when no message received from the sqs and the process continues
// to receive for a long time, for the calling func to do sth in the middle,
// e.g. send a heartbeat
type CallBackOperation func()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems wrong to couple SQS to the health check.
Can the caller be responsible for running the health check? Like how Mark does it? https://github.com/GeoNet/tickets/issues/13878#issuecomment-1667189581

Copy link
Contributor Author

@bpeng bpeng Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the health check is run by the caller, this is just sending a heartbeat in between the SQS receives, otherwise it will wait on the empty queue for ages (like the tilde-ingest-backfill) and the health check will fail, unless we don't do age check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not trust if the queue is being listened to, all is well?

What is the ECS health check trying to achieve? Is getting a successful response from the health http server not enough? What failure states are possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is another option to do, but allow to send a heartbeat in between is also reasonable

health/service.go Show resolved Hide resolved
}

func (s *Service) handler(w http.ResponseWriter, r *http.Request) {
switch ok := s.state(); {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for scoping ok like this? Strange to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mark's code, technically correct but not easy to read

Comment on lines +103 to +159
// Alive allows an application to perform a complex task while still sending hearbeats.
func (s *Service) Alive(ctx context.Context, heartbeat time.Duration) context.CancelFunc {
ctx, cancel := context.WithCancel(ctx)

go func() {
defer cancel()

ticker := time.NewTicker(heartbeat)
defer ticker.Stop()

s.Ok()

for {
select {
case <-ticker.C:
s.Ok()
case <-ctx.Done():
return
}
}
}()

return cancel
}

// Pause allows an application to stall for a set period of time while still sending hearbeats.
func (s *Service) Pause(ctx context.Context, deadline, heartbeat time.Duration) context.CancelFunc {
ctx, cancel := context.WithTimeout(ctx, deadline)

go func() {
defer cancel()

ticker := time.NewTicker(heartbeat)
defer ticker.Stop()

s.Ok()

for {
select {
case <-ticker.C:
s.Ok()
case <-ctx.Done():
return
}
}
}()

return cancel
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use case for these? Might be unneeded complexity.
If the application has a long running process, wouldn't you just adjust aged ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently not used, don't know if there is any use case by Mark, i would keep it in case we need it later

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could Pause() while you wait to receive a message, for example

Comment on lines 25 to 27
// aged is the time if no updates have happened indicates the service is no longer running.
// set to 0 if no age check needed
aged time.Duration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value will need to be coordinated with the ECS health check frequency in terraform, is that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

health/service.go Outdated Show resolved Hide resolved
health/service.go Show resolved Hide resolved
health/service.go Outdated Show resolved Hide resolved
health/service.go Show resolved Hide resolved
health/check.go Show resolved Hide resolved
@junghao
Copy link
Contributor

junghao commented Nov 8, 2024

My suggestion to this:

In sqs.go:

type SQS struct {
	client   *sqs.Client
	Callback *func()
}

Then in (s *SQS) receiveMessage():

for {
		if s.Callback != nil {
			(*s.Callback)()
		}
                .....

So from the service, we do

	sqsClient, err = sqs.New()
	if err != nil {
		log.Fatal(err)
	}

	callback := func() {
		log.Println("callback")
	}

	sqsClient.Callback = &callback

In this way, it becomes optional and not change the signature of the main function in SQS, nor it creates a new function to hook the callback.

aws/sqs/sqs.go Outdated
@@ -134,7 +137,12 @@ func (s *SQS) receiveMessage(ctx context.Context, input *sqs.ReceiveMessageInput

switch {
case r == nil || len(r.Messages) == 0:
// no message received
// No message received
retryCount++
Copy link
Contributor

@junghao junghao Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retrying this doesn't seem to help.

No matter it's retried or not, the caller has to dealt with IsNoMessageError, and continue the loop with the same sqsClient.receiveMewssage() call.

eg.

m, err := sqsClient.receiveMessage()
if err, ok := err.(* IsNoMessageError); ok {
  continue
}

No matter we have retry inside the wrapper or not, the caller's code doesn't change. And retrying inside the wrapper does the same number of API requests as looping from the caller.

So I suggest that we remove this retry, and let the caller decide whether it wants to count the number of retries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retrying this doesn't seem to help.

Retrying just lets it to wait longer for messages to come, and with less interaction

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say, the message arrives after 99 aws.receiveMessage.

  • The existing implementation takes one call to kit.receive, and no need to deal with empty retires.
    I would say this is the reason we want a retry inside kit.
  • Removing the retry completely takes 99 calls to kit.receive, and in main we need to deal with empty retries.
  • Adding 3 retries takes 33 calls to kit.receive, and in main we need to deal with empty retries.

Reducing the number of call to kit.receive doesn't save anything, but adding the complexity of retrying inside kit.

@bpeng
Copy link
Contributor Author

bpeng commented Nov 8, 2024

Callback
In sqs.go:

type SQS struct {
client *sqs.Client
Callback *func()
}

Yes, this is a good option to solve the problem

@bpeng bpeng force-pushed the health-check branch 6 times, most recently from d19647c to 3263b4f Compare November 12, 2024 20:40
Copy link
Contributor

@junghao junghao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use this as a start.

Copy link
Contributor

@CallumNZ CallumNZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍

@sue-h-gns sue-h-gns merged commit 9ad1489 into main Nov 13, 2024
7 checks passed
@sue-h-gns sue-h-gns deleted the health-check branch November 13, 2024 02:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants