Add incomplete events #872

benjben · 2024-02-10T13:51:20Z

This PR adds incomplete events to Enrich flow.

Before, enriching an event was resulting in either a BadRow or an EnrichedEvent. What's new is that now a BadRow can have an EnrichedEvent associated with it, which is the incomplete event.

All the changes are related to the fact that now we don't short circuit the processing as soon as there is an error but we keep going with what is valid.

As a result we're not using ValidatedNel and EitherT any more.

To make things easier to follow, I reorganized the flow in 3 steps:

Mapping and validating the input.
Running the enrichments. This step now takes care of validating the enrichments contexts.
Validating the atomic fields lengths.

...les/common-fs2/src/main/scala/com/snowplowanalytics/snowplow/enrich/common/fs2/package.scala

istreeter

No real strong objections from me on this. I do think it has made the code harder to read compared with how it was before, but maybe that's inevitable when adding a complex feature.

The point about Ior is that it has a Monad on the right hand side. So it might allow arranging the code like this:

for {
  a <- doSomethingThatMightYieldBadRows(collectorPayload)
  b <- doSomethingElseThatMightYieldBadRows(a)
  event <- andSomethingElse(b)
} yield event

...so you accumulate failures on the left hand side, while continuing to process the event on the right hand side.

There's even a IorT which is similar to an EitherT.

I understand this work was done to unblock the next phases of this project, so I don't think we should stick around for too long searching for perfection.

.../main/scala/com.snowplowanalytics.snowplow.enrich/common/enrichments/EnrichmentManager.scala

...les/common/src/main/scala/com.snowplowanalytics.snowplow.enrich/common/utils/IgluUtils.scala

benjben · 2024-02-12T13:22:20Z

Thanks for the review @istreeter ! I will address your feedback.

I do think it has made the code harder to read compared with how it was before

Yeah I feel the same 🥲 But the logic has become more complicated, so I'm afraid it's an inevitable outcome. Now after each step we need to check if there are bad rows and the value of emitIncomplete.

The point about Ior is that it has a Monad on the right hand side

I'm not sure that we will benefit from it, as we need to check for bad rows after each step, and not only at the end. But I'll try Ior to see if it can simplify the workflow.

benjben · 2024-02-13T08:17:04Z

Hi @istreeter , I addressed your feedback, please have a look. I agree that having Ior[BadRow, EnrichedEvent] as the return type of enrichEvent looks nicer. Then I tried to use Ior everywhere in the flow but I didn't like it. One problem is that to get the errors you need to pattern match and to have case Ior.Left whereas we know already that we will never be in this case.

for {
a <- doSomethingThatMightYieldBadRows(collectorPayload)
b <- doSomethingElseThatMightYieldBadRows(a)
event <- andSomethingElse(b)
} yield event
...so you accumulate failures on the left hand side, while continuing to process the event on the right hand side.

I might be missing something because I don't see how this can work. Before executing doSomethingElseThatMightYieldBadRows we need to check the errors that are in a (not everyone will have incomplete events activated), how would that work? I want to be able to accumulate the errors and to continue the processing (so no short circuiting with IorT).

An alternative that I envisaged was to use IorT and to have a Ref[F, List[BadRow]] . We would pass it to each function, along with emitIncomplete. Then inside each function, when there is an error, if emitIncomplete is set to false, we return a Left which short circuis the rest of the processing. If set to true, we add the bad row to the Ref and we return Right, which will continue the processing. And at the end we look at what is inside the Ref to decide what to return. Do you think that it would be nicer? We can jump on a call if it's unclear.

istreeter · 2024-02-13T10:13:48Z

Hey @benjben

Before executing doSomethingElseThatMightYieldBadRows we need to check the errors that are in a (not everyone will have incomplete events activated), how would that work?

Imagine for a second that everyone has incomplete events activated. Would that remove the need to check errors after each step, and therefore remove the part that makes it messy?

Under what circumstances would someone have this feature disabled? And in those cases, could we continue to process the event, but simply not emit it in the final step?

(Not saying this is the right solution, but it's worth exploring while we're trying to find options available to us).

benjben · 2024-02-13T10:37:48Z

Hey @istreeter

Imagine for a second that everyone has incomplete events activated. Would that remove the need to check errors after each step, and therefore remove the part that makes it messy?

Indeed that would remove the need for that and would make the flow look nicer.

Under what circumstances would someone have this feature disabled?

This feature will cost money as it will require an additional stream and additional apps to run, so I can imagine customers not wanting it.

@colmsnowplow should we consider that once this feature exists we will activate it for all our customers ?

And in those cases, could we continue to process the event, but simply not emit it in the final step?

For sure we can but that would waste resources. That's not very satisfying to force features only to make the code look nicer.

colmsnowplow · 2024-02-13T10:43:10Z

I think I can offer a point of view here.

First just to double check I'm reading it right:

Before executing doSomethingElseThatMightYieldBadRows we need to check the errors that are in a (not everyone will have incomplete events activated), how would that work?

I'm not 100% clear why we need to check the errors, but it sounds like it's so that we can decide not to continue processing the event if we hit an error. Just stating this in case it's not correct. :)

Under what circumstances would someone have this feature disabled?

This could happen! An example - one of the customers in the research uses the JS enrichment for custom bot filtering. They specifically don't want that data to go to the db. First version of this they will most likely want to just disable it. Some customers might simply not want the additional cost of an extra stream and an extra loader.

And in those cases, could we continue to process the event, but simply not emit it in the final step?

This would be acceptable. But perhaps we can find something cleaner. (Not sure as haven't yet carved out the time to read this PR properly)

colmsnowplow · 2024-02-13T11:06:45Z

@colmsnowplow should we consider that once this feature exists we will activate it for all our customers ?

I think we were editing at the same time, think I answered that at least at first I don't think we can. @stanch, FYI, think you'll agree. :)

colmsnowplow · 2024-02-13T18:00:57Z

I'm not 100% sure I understand the code enough to weigh in heavily, but I think at this point it is worth considering the other option - the goal is to choose the right design, so if there's a trade-off like this I think we should consider it.

On this point:

For sure we can but that would waste resources. That's not very satisfying to force features only to make the code look nicer.

I wouldn't say it's just about making the code look nice. It's about making the code easier to understand and easier to work with. Which in turn reduces risk of bugs.

I don't know which option is better but I do think that simpler code has tangible value, and it's worth consideration. If one option is margnially less efficient but significantly less complex, IMO it is often the better choice. (But I don't know enough to say if that applies here)

...les/common/src/main/scala/com.snowplowanalytics.snowplow.enrich/common/utils/IgluUtils.scala

istreeter

There are lots of places where you've changed to emitting a SchemaViolation instead of an EnrichmentFailure. This definitely needs checking with others!

It is a design decision that affects how users interact with the bad data. It deserves a discussion beyond just me and you in a code review.

.../main/scala/com.snowplowanalytics.snowplow.enrich/common/enrichments/EnrichmentManager.scala

...les/common/src/main/scala/com.snowplowanalytics.snowplow.enrich/common/utils/IgluUtils.scala

modules/common-fs2/src/main/scala/com/snowplowanalytics/snowplow/enrich/common/fs2/Enrich.scala

...t/scala/com.snowplowanalytics.snowplow.enrich.common/enrichments/EnrichmentManagerSpec.scala

.../main/scala/com.snowplowanalytics.snowplow.enrich/common/enrichments/EnrichmentManager.scala

#872) Before this change, any error in the enriching workflow would short circuit and a failed event would get emitted as JSON. After this change, if incomplete events are enabled, the enriching goes to the end with what is possible, accumulating errors as it goes. Errors get attached in derived_contexts and a failed event gets emitted to a third stream with TSV format (same as enriched event).

...n/src/main/scala/com.snowplowanalytics.snowplow.enrich/common/enrichments/AtomicFields.scala

.../main/scala/com.snowplowanalytics.snowplow.enrich/common/enrichments/EnrichmentManager.scala

...common/src/main/scala/com.snowplowanalytics.snowplow.enrich/common/enrichments/Failure.scala

istreeter · 2024-06-13T15:45:43Z

I had already approved it, but it looks even better now ✅

#872) Before this change, any error in the enriching workflow would short circuit and a failed event would get emitted as JSON. After this change, if incomplete events are enabled, the enriching goes to the end with what is possible, accumulating errors as it goes. Errors get attached in derived_contexts and a failed event gets emitted to a third stream with TSV format (same as enriched event).

benjben changed the title ~~Add incomplete events (close #)~~ Add incomplete events Feb 12, 2024

istreeter reviewed Feb 12, 2024

View reviewed changes

...les/common-fs2/src/main/scala/com/snowplowanalytics/snowplow/enrich/common/fs2/package.scala Outdated Show resolved Hide resolved

istreeter reviewed Feb 12, 2024

View reviewed changes

spenes force-pushed the develop branch from 57cb773 to fad74df Compare February 12, 2024 12:37

benjben force-pushed the incomplete_events branch from eb84c16 to 5b40ead Compare February 13, 2024 07:59

benjben force-pushed the incomplete_events branch 3 times, most recently from ecef087 to b404c0a Compare February 16, 2024 09:59

benjben requested a review from istreeter February 16, 2024 09:59

benjben commented Feb 16, 2024

View reviewed changes

...les/common/src/main/scala/com.snowplowanalytics.snowplow.enrich/common/utils/IgluUtils.scala Show resolved Hide resolved

benjben mentioned this pull request Feb 16, 2024

Add incomplete events -- alternative #876

Closed

istreeter reviewed Feb 16, 2024

View reviewed changes

benjben requested a review from istreeter February 19, 2024 14:35

istreeter approved these changes Feb 19, 2024

View reviewed changes

istreeter reviewed Feb 19, 2024

View reviewed changes

...les/common/src/main/scala/com.snowplowanalytics.snowplow.enrich/common/utils/IgluUtils.scala Outdated Show resolved Hide resolved

colmsnowplow reviewed Feb 19, 2024

View reviewed changes

modules/common-fs2/src/main/scala/com/snowplowanalytics/snowplow/enrich/common/fs2/Enrich.scala Outdated Show resolved Hide resolved

benjben force-pushed the incomplete_events branch 2 times, most recently from 463b54a to 140ff91 Compare February 20, 2024 20:14

benjben force-pushed the incomplete_events branch from fc4b389 to b2f8067 Compare March 6, 2024 18:42

benjben requested a review from istreeter March 6, 2024 18:43

benjben changed the base branch from develop to switch_to_schema_violations March 6, 2024 18:45

benjben force-pushed the switch_to_schema_violations branch from 378fe6c to 9846a2f Compare March 7, 2024 16:20

Base automatically changed from switch_to_schema_violations to develop March 8, 2024 08:51

benjben force-pushed the incomplete_events branch from b2f8067 to e6df527 Compare March 8, 2024 09:26

benjben commented Mar 8, 2024

View reviewed changes

...t/scala/com.snowplowanalytics.snowplow.enrich.common/enrichments/EnrichmentManagerSpec.scala Show resolved Hide resolved

benjben force-pushed the incomplete_events branch from 0e92213 to ebe793f Compare April 9, 2024 08:52

benjben force-pushed the incomplete_events branch from f5d611a to 838d0bc Compare May 17, 2024 09:16

spenes reviewed May 31, 2024

View reviewed changes

.../main/scala/com.snowplowanalytics.snowplow.enrich/common/enrichments/EnrichmentManager.scala Outdated Show resolved Hide resolved

benjben mentioned this pull request Jun 11, 2024

Add com.snowplowanalytics.snowplow/failure/jsonschema/1-0-0 snowplow/iglu-central#1402

Closed

benjben force-pushed the incomplete_events branch from 838d0bc to d12c41a Compare June 11, 2024 15:24

benjben changed the base branch from develop to master June 11, 2024 16:29

benjben changed the base branch from master to develop June 11, 2024 16:29

istreeter reviewed Jun 11, 2024

View reviewed changes

benjben force-pushed the incomplete_events branch from 0ecd9f8 to 317b485 Compare June 12, 2024 13:42

benjben force-pushed the incomplete_events branch from 317b485 to e11145d Compare June 13, 2024 15:48

benjben merged commit a599505 into develop Jun 13, 2024
1 check failed

benjben deleted the incomplete_events branch June 13, 2024 15:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add incomplete events #872

Add incomplete events #872

benjben commented Feb 10, 2024 •

edited

Loading

istreeter left a comment

benjben commented Feb 12, 2024

benjben commented Feb 13, 2024

istreeter commented Feb 13, 2024

benjben commented Feb 13, 2024

colmsnowplow commented Feb 13, 2024

colmsnowplow commented Feb 13, 2024

colmsnowplow commented Feb 13, 2024 •

edited

Loading

istreeter left a comment

istreeter commented Jun 13, 2024

Add incomplete events #872

Add incomplete events #872

Conversation

benjben commented Feb 10, 2024 • edited Loading

istreeter left a comment

Choose a reason for hiding this comment

benjben commented Feb 12, 2024

benjben commented Feb 13, 2024

istreeter commented Feb 13, 2024

benjben commented Feb 13, 2024

colmsnowplow commented Feb 13, 2024

colmsnowplow commented Feb 13, 2024

colmsnowplow commented Feb 13, 2024 • edited Loading

istreeter left a comment

Choose a reason for hiding this comment

istreeter commented Jun 13, 2024

benjben commented Feb 10, 2024 •

edited

Loading

colmsnowplow commented Feb 13, 2024 •

edited

Loading