-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drop the Set-Cookie header #132
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #132 +/- ##
=======================================
Coverage 98.83% 98.84%
=======================================
Files 10 10
Lines 687 691 +4
=======================================
+ Hits 679 683 +4
Misses 8 8
|
scrapy_zyte_api/responses.py
Outdated
# Cookies should be fetched from experimental.responseCookies only, the | ||
# Set-Cookie header may define cookies from the main HTTP response that | ||
# were overriden later on, e.g. during browser rendering. | ||
"set-cookie", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should remove headers when httpResponseBody = True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
scrapy_zyte_api/responses.py
Outdated
if input_headers: | ||
headers_to_remove = copy(cls.REMOVE_HEADERS) | ||
if response_cookies or "httpResponseBody" not in api_response: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it handle an edge case where only httpResponseHeaders are requested, without httpResponseBody?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it does not, did not even cross my mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm… This is more involved than I initially thought.
Since httpResponseHeaders
can be combined with any other output, including extraction (current and future extraction keys), filtering only by “httpResponseHeaders
and no other output” is not feasible.
So I think we need to choose:
- Keep the logic as is, remove
Set-Cookie
even ifhttpResponseHeaders
is the only output, as long as response cookies are requested as well. - Keep the
Set-Cookie
header wheneverhttpResponseHeaders
is present, even ifbrowserHtml
or other outputs are also requested. - Keep the
Set-Cookie
header wheneverhttpResponseHeaders
is present as long as no other known output is requested, and accept that there will be times between Zyte API adding support for a new extraction type and us supporting it here whereSet-Cookie
would be deleted for those new scenarios. We could expose the list of extractions as a setting as a workaround, if we want to go that far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, so, we need to remove the header if we know that the browser rendering was not used internally. And it's not straightforward to know if it was used or not: e.g. it could be the case httpResponseBody + browserHtml could be supported at the same time. Extraction may also be picking browser vs no browser transparently in the future.
Is it the case that ideally, we'd like to keep the Set-Cookie header always (for informational purposes), but ensure it's not processed by the cookie middleware when responseCookies are present? Or are there any issues with that? Can we address this directly somehow, instead of removing Set-Cookie header conditionally?
Also, what's the actual use case of keeping Set-Cookie header when responseCookies are received? If there is no use case, and addressing the point above is complicated, then can we drop Set-Cookie unconditionally when responseCookies are present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, what's the actual use case of keeping Set-Cookie header when responseCookies are received?
@proway2 Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
httpResponseBody
and browserHtml
are mutually exclusive, so there's no way to use these two together at the moment. The problem is that when httpResponseHeaders
is used together with browserHtml
it's obviously a mistake. There might be the very same cookie in two different places and with different values, therefore the same cookie is going to be sent twice which may lead to errors that are really hard to find and debug. We've seen this with Lowes, when zipcode
is automatically populated based on IP and after that a proper location is set and another cookie with completely different zipcode
is populated as well and both end up in Scrapy. I'd say that there's no reason to use httpResponseHeaders
with browserHtml
.
When httpResonseHeaders
are used together with httpResponseBody
then it's a different story, in this case headers contain valid data as no rendering is used and no other headers are created by the engine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That still does not explain why, when using httpResponseBody
, you would read the cookies from Set-Cookie
instead of responseCookies
. It’s accommodating that scenario, i.e. not dropping Set-Cookie
even though responseCookies
have been received, what makes the implementation tricky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what stated on our website for httpResponseHeaders
:
The Set-Cookie header value, when present, contains the header value received from the main HTTP response. These cookies could have changed later on, e.g. during browser rendering. Usually you will want to ignore this header in favor of responseCookies, which provides the final cookies.
I'm not sure what Usually
exactly means here, I'd read cookies from responseCookies
. But I guess responseCookies
may be set to False
but httpResponseHeaders
may be set to True
, it may be the case when a user needs something from headers (e.g. some key) but doesn't actually need a cookie.
I'd say that if browserHtml
is used, remove httpResponseHeaders
at all. If httpResponseBody
is used and cookies are enabled and responseCookies
is True
- read cookies from responseCookies
, but if responseCookies
is False
- read cookies from httpResponseHeaders
. In any case any other header from httpResponseHeaders
must remain untouched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that when httpResponseHeaders is used together with browserHtml it's obviously a mistake.
I'd say that if browserHtml is used, remove httpResponseHeaders at all.
I'm not sure about this; if it's supported by the API, and requested explicitly by the user, then it might not be a mistake.
Overall, it seems we're all on the same page - we should prioritize responseCookies, and there is no issue with removing Set-Cookie from httpResponseHeaders if responseCookies are present, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this; if it's supported by the API, and requested explicitly by the user, then it might not be a mistake.
Based on the discussion in Slack though it's allowed to use httpResponseHeaders+browserHtml it's strongly discouraged. I'm not sure if there's a use case for this.
Overall, it seems we're all on the same page - we should prioritize responseCookies, and there is no issue with removing Set-Cookie from httpResponseHeaders if responseCookies are present, right?
Yes
No description provided.