Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add incomplete events #872

Merged
merged 1 commit into from
Jun 13, 2024
Merged

Add incomplete events #872

merged 1 commit into from
Jun 13, 2024

Conversation

benjben
Copy link
Contributor

@benjben benjben commented Feb 10, 2024

This PR adds incomplete events to Enrich flow.

Before, enriching an event was resulting in either a BadRow or an EnrichedEvent. What's new is that now a BadRow can have an EnrichedEvent associated with it, which is the incomplete event.

All the changes are related to the fact that now we don't short circuit the processing as soon as there is an error but we keep going with what is valid.

As a result we're not using ValidatedNel and EitherT any more.

To make things easier to follow, I reorganized the flow in 3 steps:

@benjben benjben changed the title Add incomplete events (close #) Add incomplete events Feb 12, 2024
Copy link
Contributor

@istreeter istreeter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No real strong objections from me on this. I do think it has made the code harder to read compared with how it was before, but maybe that's inevitable when adding a complex feature.

The point about Ior is that it has a Monad on the right hand side. So it might allow arranging the code like this:

for {
  a <- doSomethingThatMightYieldBadRows(collectorPayload)
  b <- doSomethingElseThatMightYieldBadRows(a)
  event <- andSomethingElse(b)
} yield event

...so you accumulate failures on the left hand side, while continuing to process the event on the right hand side.

There's even a IorT which is similar to an EitherT.

I understand this work was done to unblock the next phases of this project, so I don't think we should stick around for too long searching for perfection.

@benjben
Copy link
Contributor Author

benjben commented Feb 12, 2024

Thanks for the review @istreeter ! I will address your feedback.

I do think it has made the code harder to read compared with how it was before

Yeah I feel the same 🥲 But the logic has become more complicated, so I'm afraid it's an inevitable outcome. Now after each step we need to check if there are bad rows and the value of emitIncomplete.

The point about Ior is that it has a Monad on the right hand side

I'm not sure that we will benefit from it, as we need to check for bad rows after each step, and not only at the end. But I'll try Ior to see if it can simplify the workflow.

@benjben
Copy link
Contributor Author

benjben commented Feb 13, 2024

Hi @istreeter , I addressed your feedback, please have a look. I agree that having Ior[BadRow, EnrichedEvent] as the return type of enrichEvent looks nicer. Then I tried to use Ior everywhere in the flow but I didn't like it. One problem is that to get the errors you need to pattern match and to have case Ior.Left whereas we know already that we will never be in this case.

for {
a <- doSomethingThatMightYieldBadRows(collectorPayload)
b <- doSomethingElseThatMightYieldBadRows(a)
event <- andSomethingElse(b)
} yield event
...so you accumulate failures on the left hand side, while continuing to process the event on the right hand side.

I might be missing something because I don't see how this can work. Before executing doSomethingElseThatMightYieldBadRows we need to check the errors that are in a (not everyone will have incomplete events activated), how would that work? I want to be able to accumulate the errors and to continue the processing (so no short circuiting with IorT).

An alternative that I envisaged was to use IorT and to have a Ref[F, List[BadRow]] . We would pass it to each function, along with emitIncomplete. Then inside each function, when there is an error, if emitIncomplete is set to false, we return a Left which short circuis the rest of the processing. If set to true, we add the bad row to the Ref and we return Right, which will continue the processing. And at the end we look at what is inside the Ref to decide what to return. Do you think that it would be nicer? We can jump on a call if it's unclear.

@istreeter
Copy link
Contributor

Hey @benjben

Before executing doSomethingElseThatMightYieldBadRows we need to check the errors that are in a (not everyone will have incomplete events activated), how would that work?

Imagine for a second that everyone has incomplete events activated. Would that remove the need to check errors after each step, and therefore remove the part that makes it messy?

Under what circumstances would someone have this feature disabled? And in those cases, could we continue to process the event, but simply not emit it in the final step?

(Not saying this is the right solution, but it's worth exploring while we're trying to find options available to us).

@benjben
Copy link
Contributor Author

benjben commented Feb 13, 2024

Hey @istreeter

Imagine for a second that everyone has incomplete events activated. Would that remove the need to check errors after each step, and therefore remove the part that makes it messy?

Indeed that would remove the need for that and would make the flow look nicer.

Under what circumstances would someone have this feature disabled?

This feature will cost money as it will require an additional stream and additional apps to run, so I can imagine customers not wanting it.

@colmsnowplow should we consider that once this feature exists we will activate it for all our customers ?

And in those cases, could we continue to process the event, but simply not emit it in the final step?

For sure we can but that would waste resources. That's not very satisfying to force features only to make the code look nicer.

@colmsnowplow
Copy link
Contributor

I think I can offer a point of view here.

First just to double check I'm reading it right:

Before executing doSomethingElseThatMightYieldBadRows we need to check the errors that are in a (not everyone will have incomplete events activated), how would that work?

I'm not 100% clear why we need to check the errors, but it sounds like it's so that we can decide not to continue processing the event if we hit an error. Just stating this in case it's not correct. :)

Under what circumstances would someone have this feature disabled?

This could happen! An example - one of the customers in the research uses the JS enrichment for custom bot filtering. They specifically don't want that data to go to the db. First version of this they will most likely want to just disable it. Some customers might simply not want the additional cost of an extra stream and an extra loader.

And in those cases, could we continue to process the event, but simply not emit it in the final step?

This would be acceptable. But perhaps we can find something cleaner. (Not sure as haven't yet carved out the time to read this PR properly)

@colmsnowplow
Copy link
Contributor

@colmsnowplow should we consider that once this feature exists we will activate it for all our customers ?

I think we were editing at the same time, think I answered that at least at first I don't think we can. @stanch, FYI, think you'll agree. :)

@colmsnowplow
Copy link
Contributor

colmsnowplow commented Feb 13, 2024

I'm not 100% sure I understand the code enough to weigh in heavily, but I think at this point it is worth considering the other option - the goal is to choose the right design, so if there's a trade-off like this I think we should consider it.

On this point:

For sure we can but that would waste resources. That's not very satisfying to force features only to make the code look nicer.

I wouldn't say it's just about making the code look nice. It's about making the code easier to understand and easier to work with. Which in turn reduces risk of bugs.

I don't know which option is better but I do think that simpler code has tangible value, and it's worth consideration. If one option is margnially less efficient but significantly less complex, IMO it is often the better choice. (But I don't know enough to say if that applies here)

@benjben benjben force-pushed the incomplete_events branch 3 times, most recently from ecef087 to b404c0a Compare February 16, 2024 09:59
@benjben benjben requested a review from istreeter February 16, 2024 09:59
Copy link
Contributor

@istreeter istreeter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are lots of places where you've changed to emitting a SchemaViolation instead of an EnrichmentFailure. This definitely needs checking with others!

It is a design decision that affects how users interact with the bad data. It deserves a discussion beyond just me and you in a code review.

@benjben benjben requested a review from istreeter February 19, 2024 14:35
@benjben benjben force-pushed the incomplete_events branch 2 times, most recently from 463b54a to 140ff91 Compare February 20, 2024 20:14
@benjben benjben force-pushed the incomplete_events branch from fc4b389 to b2f8067 Compare March 6, 2024 18:42
@benjben benjben requested a review from istreeter March 6, 2024 18:43
@benjben benjben changed the base branch from develop to switch_to_schema_violations March 6, 2024 18:45
@benjben benjben force-pushed the switch_to_schema_violations branch from 378fe6c to 9846a2f Compare March 7, 2024 16:20
Base automatically changed from switch_to_schema_violations to develop March 8, 2024 08:51
@benjben benjben force-pushed the incomplete_events branch from b2f8067 to e6df527 Compare March 8, 2024 09:26
@benjben benjben force-pushed the incomplete_events branch from 0e92213 to ebe793f Compare April 9, 2024 08:52
@benjben benjben force-pushed the incomplete_events branch from f5d611a to 838d0bc Compare May 17, 2024 09:16
benjben added a commit that referenced this pull request Jun 11, 2024
#872)

Before this change, any error in the enriching workflow would short circuit
and a failed event would get emitted as JSON. After this change, if incomplete events
are enabled, the enriching goes to the end with what is possible,
accumulating errors as it goes. Errors get attached in derived_contexts
and a failed event gets emitted to a third stream with TSV format
(same as enriched event).
@benjben benjben force-pushed the incomplete_events branch from 838d0bc to d12c41a Compare June 11, 2024 15:24
@benjben benjben changed the base branch from develop to master June 11, 2024 16:29
@benjben benjben changed the base branch from master to develop June 11, 2024 16:29
@benjben benjben force-pushed the incomplete_events branch from 0ecd9f8 to 317b485 Compare June 12, 2024 13:42
@istreeter
Copy link
Contributor

I had already approved it, but it looks even better now ✅

#872)

Before this change, any error in the enriching workflow would short circuit
and a failed event would get emitted as JSON. After this change, if incomplete events
are enabled, the enriching goes to the end with what is possible,
accumulating errors as it goes. Errors get attached in derived_contexts
and a failed event gets emitted to a third stream with TSV format
(same as enriched event).
@benjben benjben force-pushed the incomplete_events branch from 317b485 to e11145d Compare June 13, 2024 15:48
@benjben benjben merged commit a599505 into develop Jun 13, 2024
1 check failed
@benjben benjben deleted the incomplete_events branch June 13, 2024 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants