-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drop the Set-Cookie header #132
Merged
Merged
Changes from 5 commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
736080f
Drop the Set-Cookie header
Gallaecio 5d25576
Fix a typo, reword a bit
Gallaecio 867cf77
Do not drop Set-Cookie for HTTP responses without experimenta.respons…
Gallaecio 97acb86
Comment improvements
Gallaecio ce2da62
Silence mypy issues in new tests
Gallaecio ad1444a
Always drop Set-Cookie if responseCookies are received
Gallaecio cce2001
Merge remote-tracking branch 'scrapy-plugins/main' into drop-set-cookie
Gallaecio f9fb461
Fix CI issues
Gallaecio File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it handle an edge case where only httpResponseHeaders are requested, without httpResponseBody?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it does not, did not even cross my mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm… This is more involved than I initially thought.
Since
httpResponseHeaders
can be combined with any other output, including extraction (current and future extraction keys), filtering only by “httpResponseHeaders
and no other output” is not feasible.So I think we need to choose:
Set-Cookie
even ifhttpResponseHeaders
is the only output, as long as response cookies are requested as well.Set-Cookie
header wheneverhttpResponseHeaders
is present, even ifbrowserHtml
or other outputs are also requested.Set-Cookie
header wheneverhttpResponseHeaders
is present as long as no other known output is requested, and accept that there will be times between Zyte API adding support for a new extraction type and us supporting it here whereSet-Cookie
would be deleted for those new scenarios. We could expose the list of extractions as a setting as a workaround, if we want to go that far.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, so, we need to remove the header if we know that the browser rendering was not used internally. And it's not straightforward to know if it was used or not: e.g. it could be the case httpResponseBody + browserHtml could be supported at the same time. Extraction may also be picking browser vs no browser transparently in the future.
Is it the case that ideally, we'd like to keep the Set-Cookie header always (for informational purposes), but ensure it's not processed by the cookie middleware when responseCookies are present? Or are there any issues with that? Can we address this directly somehow, instead of removing Set-Cookie header conditionally?
Also, what's the actual use case of keeping Set-Cookie header when responseCookies are received? If there is no use case, and addressing the point above is complicated, then can we drop Set-Cookie unconditionally when responseCookies are present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@proway2 Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
httpResponseBody
andbrowserHtml
are mutually exclusive, so there's no way to use these two together at the moment. The problem is that whenhttpResponseHeaders
is used together withbrowserHtml
it's obviously a mistake. There might be the very same cookie in two different places and with different values, therefore the same cookie is going to be sent twice which may lead to errors that are really hard to find and debug. We've seen this with Lowes, whenzipcode
is automatically populated based on IP and after that a proper location is set and another cookie with completely differentzipcode
is populated as well and both end up in Scrapy. I'd say that there's no reason to usehttpResponseHeaders
withbrowserHtml
.When
httpResonseHeaders
are used together withhttpResponseBody
then it's a different story, in this case headers contain valid data as no rendering is used and no other headers are created by the engine.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That still does not explain why, when using
httpResponseBody
, you would read the cookies fromSet-Cookie
instead ofresponseCookies
. It’s accommodating that scenario, i.e. not droppingSet-Cookie
even thoughresponseCookies
have been received, what makes the implementation tricky.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what stated on our website for
httpResponseHeaders
:I'm not sure what
Usually
exactly means here, I'd read cookies fromresponseCookies
. But I guessresponseCookies
may be set toFalse
buthttpResponseHeaders
may be set toTrue
, it may be the case when a user needs something from headers (e.g. some key) but doesn't actually need a cookie.I'd say that if
browserHtml
is used, removehttpResponseHeaders
at all. IfhttpResponseBody
is used and cookies are enabled andresponseCookies
isTrue
- read cookies fromresponseCookies
, but ifresponseCookies
isFalse
- read cookies fromhttpResponseHeaders
. In any case any other header fromhttpResponseHeaders
must remain untouched.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this; if it's supported by the API, and requested explicitly by the user, then it might not be a mistake.
Overall, it seems we're all on the same page - we should prioritize responseCookies, and there is no issue with removing Set-Cookie from httpResponseHeaders if responseCookies are present, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the discussion in Slack though it's allowed to use httpResponseHeaders+browserHtml it's strongly discouraged. I'm not sure if there's a use case for this.
Yes