-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add incomplete events #872
Conversation
...les/common-fs2/src/main/scala/com/snowplowanalytics/snowplow/enrich/common/fs2/package.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No real strong objections from me on this. I do think it has made the code harder to read compared with how it was before, but maybe that's inevitable when adding a complex feature.
The point about Ior
is that it has a Monad on the right hand side. So it might allow arranging the code like this:
for {
a <- doSomethingThatMightYieldBadRows(collectorPayload)
b <- doSomethingElseThatMightYieldBadRows(a)
event <- andSomethingElse(b)
} yield event
...so you accumulate failures on the left hand side, while continuing to process the event on the right hand side.
There's even a IorT
which is similar to an EitherT
.
I understand this work was done to unblock the next phases of this project, so I don't think we should stick around for too long searching for perfection.
.../main/scala/com.snowplowanalytics.snowplow.enrich/common/enrichments/EnrichmentManager.scala
Outdated
Show resolved
Hide resolved
.../main/scala/com.snowplowanalytics.snowplow.enrich/common/enrichments/EnrichmentManager.scala
Outdated
Show resolved
Hide resolved
.../main/scala/com.snowplowanalytics.snowplow.enrich/common/enrichments/EnrichmentManager.scala
Outdated
Show resolved
Hide resolved
.../main/scala/com.snowplowanalytics.snowplow.enrich/common/enrichments/EnrichmentManager.scala
Outdated
Show resolved
Hide resolved
...les/common/src/main/scala/com.snowplowanalytics.snowplow.enrich/common/utils/IgluUtils.scala
Outdated
Show resolved
Hide resolved
...les/common/src/main/scala/com.snowplowanalytics.snowplow.enrich/common/utils/IgluUtils.scala
Outdated
Show resolved
Hide resolved
...les/common/src/main/scala/com.snowplowanalytics.snowplow.enrich/common/utils/IgluUtils.scala
Outdated
Show resolved
Hide resolved
...les/common/src/main/scala/com.snowplowanalytics.snowplow.enrich/common/utils/IgluUtils.scala
Outdated
Show resolved
Hide resolved
Thanks for the review @istreeter ! I will address your feedback.
Yeah I feel the same 🥲 But the logic has become more complicated, so I'm afraid it's an inevitable outcome. Now after each step we need to check if there are bad rows and the value of
I'm not sure that we will benefit from it, as we need to check for bad rows after each step, and not only at the end. But I'll try |
eb84c16
to
5b40ead
Compare
Hi @istreeter , I addressed your feedback, please have a look. I agree that having
I might be missing something because I don't see how this can work. Before executing An alternative that I envisaged was to use |
Hey @benjben
Imagine for a second that everyone has incomplete events activated. Would that remove the need to check errors after each step, and therefore remove the part that makes it messy? Under what circumstances would someone have this feature disabled? And in those cases, could we continue to process the event, but simply not emit it in the final step? (Not saying this is the right solution, but it's worth exploring while we're trying to find options available to us). |
Hey @istreeter
Indeed that would remove the need for that and would make the flow look nicer.
This feature will cost money as it will require an additional stream and additional apps to run, so I can imagine customers not wanting it. @colmsnowplow should we consider that once this feature exists we will activate it for all our customers ?
For sure we can but that would waste resources. That's not very satisfying to force features only to make the code look nicer. |
I think I can offer a point of view here. First just to double check I'm reading it right:
I'm not 100% clear why we need to check the errors, but it sounds like it's so that we can decide not to continue processing the event if we hit an error. Just stating this in case it's not correct. :)
This could happen! An example - one of the customers in the research uses the JS enrichment for custom bot filtering. They specifically don't want that data to go to the db. First version of this they will most likely want to just disable it. Some customers might simply not want the additional cost of an extra stream and an extra loader.
This would be acceptable. But perhaps we can find something cleaner. (Not sure as haven't yet carved out the time to read this PR properly) |
I think we were editing at the same time, think I answered that at least at first I don't think we can. @stanch, FYI, think you'll agree. :) |
I'm not 100% sure I understand the code enough to weigh in heavily, but I think at this point it is worth considering the other option - the goal is to choose the right design, so if there's a trade-off like this I think we should consider it. On this point:
I wouldn't say it's just about making the code look nice. It's about making the code easier to understand and easier to work with. Which in turn reduces risk of bugs. I don't know which option is better but I do think that simpler code has tangible value, and it's worth consideration. If one option is margnially less efficient but significantly less complex, IMO it is often the better choice. (But I don't know enough to say if that applies here) |
ecef087
to
b404c0a
Compare
...les/common/src/main/scala/com.snowplowanalytics.snowplow.enrich/common/utils/IgluUtils.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are lots of places where you've changed to emitting a SchemaViolation instead of an EnrichmentFailure. This definitely needs checking with others!
It is a design decision that affects how users interact with the bad data. It deserves a discussion beyond just me and you in a code review.
.../main/scala/com.snowplowanalytics.snowplow.enrich/common/enrichments/EnrichmentManager.scala
Outdated
Show resolved
Hide resolved
...les/common/src/main/scala/com.snowplowanalytics.snowplow.enrich/common/utils/IgluUtils.scala
Outdated
Show resolved
Hide resolved
...les/common/src/main/scala/com.snowplowanalytics.snowplow.enrich/common/utils/IgluUtils.scala
Outdated
Show resolved
Hide resolved
...les/common/src/main/scala/com.snowplowanalytics.snowplow.enrich/common/utils/IgluUtils.scala
Outdated
Show resolved
Hide resolved
modules/common-fs2/src/main/scala/com/snowplowanalytics/snowplow/enrich/common/fs2/Enrich.scala
Outdated
Show resolved
Hide resolved
463b54a
to
140ff91
Compare
fc4b389
to
b2f8067
Compare
378fe6c
to
9846a2f
Compare
b2f8067
to
e6df527
Compare
...t/scala/com.snowplowanalytics.snowplow.enrich.common/enrichments/EnrichmentManagerSpec.scala
Show resolved
Hide resolved
0e92213
to
ebe793f
Compare
f5d611a
to
838d0bc
Compare
.../main/scala/com.snowplowanalytics.snowplow.enrich/common/enrichments/EnrichmentManager.scala
Outdated
Show resolved
Hide resolved
#872) Before this change, any error in the enriching workflow would short circuit and a failed event would get emitted as JSON. After this change, if incomplete events are enabled, the enriching goes to the end with what is possible, accumulating errors as it goes. Errors get attached in derived_contexts and a failed event gets emitted to a third stream with TSV format (same as enriched event).
838d0bc
to
d12c41a
Compare
...n/src/main/scala/com.snowplowanalytics.snowplow.enrich/common/enrichments/AtomicFields.scala
Outdated
Show resolved
Hide resolved
.../main/scala/com.snowplowanalytics.snowplow.enrich/common/enrichments/EnrichmentManager.scala
Outdated
Show resolved
Hide resolved
...common/src/main/scala/com.snowplowanalytics.snowplow.enrich/common/enrichments/Failure.scala
Show resolved
Hide resolved
...common/src/main/scala/com.snowplowanalytics.snowplow.enrich/common/enrichments/Failure.scala
Outdated
Show resolved
Hide resolved
0ecd9f8
to
317b485
Compare
I had already approved it, but it looks even better now ✅ |
#872) Before this change, any error in the enriching workflow would short circuit and a failed event would get emitted as JSON. After this change, if incomplete events are enabled, the enriching goes to the end with what is possible, accumulating errors as it goes. Errors get attached in derived_contexts and a failed event gets emitted to a third stream with TSV format (same as enriched event).
317b485
to
e11145d
Compare
This PR adds incomplete events to Enrich flow.
Before, enriching an event was resulting in either a
BadRow
or anEnrichedEvent
. What's new is that now aBadRow
can have anEnrichedEvent
associated with it, which is the incomplete event.All the changes are related to the fact that now we don't short circuit the processing as soon as there is an error but we keep going with what is valid.
As a result we're not using
ValidatedNel
andEitherT
any more.To make things easier to follow, I reorganized the flow in 3 steps: