-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add health check package and update SQS receive func #178
Conversation
8767a00
to
56ad5fc
Compare
So from what I can tell, the How does the ECS health check work - does ECS call the endpoint at regular intervals? |
The check is scheduled on AWS via terraform , the health status can be viewed on dev |
aws/sqs/sqs.go
Outdated
// define a func to callback when no message received from the sqs and the process continues | ||
// to receive for a long time, for the calling func to do sth in the middle, | ||
// e.g. send a heartbeat | ||
type CallBackOperation func() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems wrong to couple SQS to the health check.
Can the caller be responsible for running the health check? Like how Mark does it? https://github.com/GeoNet/tickets/issues/13878#issuecomment-1667189581
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the health check is run by the caller, this is just sending a heartbeat in between the SQS receives, otherwise it will wait on the empty queue for ages (like the tilde-ingest-backfill
) and the health check will fail, unless we don't do age check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not trust if the queue is being listened to, all is well?
What is the ECS health check trying to achieve? Is getting a successful response from the health http server not enough? What failure states are possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is another option to do, but allow to send a heartbeat in between is also reasonable
health/service.go
Outdated
} | ||
|
||
func (s *Service) handler(w http.ResponseWriter, r *http.Request) { | ||
switch ok := s.state(); { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for scoping ok
like this? Strange to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mark's code, technically correct but not easy to read
// Alive allows an application to perform a complex task while still sending hearbeats. | ||
func (s *Service) Alive(ctx context.Context, heartbeat time.Duration) context.CancelFunc { | ||
ctx, cancel := context.WithCancel(ctx) | ||
|
||
go func() { | ||
defer cancel() | ||
|
||
ticker := time.NewTicker(heartbeat) | ||
defer ticker.Stop() | ||
|
||
s.Ok() | ||
|
||
for { | ||
select { | ||
case <-ticker.C: | ||
s.Ok() | ||
case <-ctx.Done(): | ||
return | ||
} | ||
} | ||
}() | ||
|
||
return cancel | ||
} | ||
|
||
// Pause allows an application to stall for a set period of time while still sending hearbeats. | ||
func (s *Service) Pause(ctx context.Context, deadline, heartbeat time.Duration) context.CancelFunc { | ||
ctx, cancel := context.WithTimeout(ctx, deadline) | ||
|
||
go func() { | ||
defer cancel() | ||
|
||
ticker := time.NewTicker(heartbeat) | ||
defer ticker.Stop() | ||
|
||
s.Ok() | ||
|
||
for { | ||
select { | ||
case <-ticker.C: | ||
s.Ok() | ||
case <-ctx.Done(): | ||
return | ||
} | ||
} | ||
}() | ||
|
||
return cancel | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the use case for these? Might be unneeded complexity.
If the application has a long running process, wouldn't you just adjust aged
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently not used, don't know if there is any use case by Mark, i would keep it in case we need it later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could Pause()
while you wait to receive a message, for example
health/service.go
Outdated
// aged is the time if no updates have happened indicates the service is no longer running. | ||
// set to 0 if no age check needed | ||
aged time.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value will need to be coordinated with the ECS health check frequency in terraform, is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
My suggestion to this: In sqs.go:
Then in
So from the service, we do
In this way, it becomes optional and not change the signature of the main function in SQS, nor it creates a new function to hook the callback. |
aws/sqs/sqs.go
Outdated
@@ -134,7 +137,12 @@ func (s *SQS) receiveMessage(ctx context.Context, input *sqs.ReceiveMessageInput | |||
|
|||
switch { | |||
case r == nil || len(r.Messages) == 0: | |||
// no message received | |||
// No message received | |||
retryCount++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retrying this doesn't seem to help.
No matter it's retried or not, the caller has to dealt with IsNoMessageError
, and continue the loop with the same sqsClient.receiveMewssage()
call.
eg.
m, err := sqsClient.receiveMessage()
if err, ok := err.(* IsNoMessageError); ok {
continue
}
No matter we have retry inside the wrapper or not, the caller's code doesn't change. And retrying inside the wrapper does the same number of API requests as looping from the caller.
So I suggest that we remove this retry, and let the caller decide whether it wants to count the number of retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retrying this doesn't seem to help.
Retrying just lets it to wait longer for messages to come, and with less interaction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Say, the message arrives after 99 aws.receiveMessage.
- The existing implementation takes one call to kit.receive, and no need to deal with empty retires.
I would say this is the reason we want a retry inside kit. - Removing the retry completely takes 99 calls to kit.receive, and in main we need to deal with empty retries.
- Adding 3 retries takes 33 calls to kit.receive, and in main we need to deal with empty retries.
Reducing the number of call to kit.receive doesn't save anything, but adding the complexity of retrying inside kit.
Yes, this is a good option to solve the problem |
d19647c
to
3263b4f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use this as a start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good 👍
Proposed Changes
Ticket https://github.com/GeoNet/tickets/issues/13878
Changes proposed in this pull request:
Production Changes
The following production changes are required to deploy these changes:
Review
Check the box that applies to this code review. If necessary please seek help with adding a checklist guide for the reviewer.
When assigning the code review please consider the expertise needed to review the changes.
Code Review Guide
Insert check list here if needed.