-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
csv-parse stream will process the whole buffer even with back pressure. #408
Comments
I messed around with this for a while trying to figure out a solution, but this seems to be a case that just isn't handled by the Anyway, my workaround for now is to split up the incoming buffers so that they are always small, that seems to work OK. But with this issue lingering I suppose there will be others in the future who are bitten by the same issue. |
Is it not the responsibility of the Stream Reader ( |
I don't think it's unreasonable to expect that even if you pass the entire large dataset as a string or Buffer that it would still process it in chunks to reduce memory usage. A 300MB Buffer of CSV data could easily use several GB of RAM if you parse all the rows in advance. Putting 300MB in RAM might not be a big deal, but expecting the rows to parse in small batches so the additional memory usage is minimal. At least, this issue pretty much took me 6 hours to figure out and find a workaround yesterday, so If filed the issue here as I could imagine others being unpleasantly surprised by it. The workaround isn't super hard to implement once you know it's needed, but it's not very obvious that it should be. |
With a bit more research I think maybe to resolve this would require implementing Duplex directly instead of using Transform. Kind of a pain, but the nodejs Transform API doesn't have any built-in concept of one-sided backpressure, it basically assumes the input and output are of a similar size and passes through all backpressure upstream. |
In case you're still looking for a solution here, I wrote my own back pressure aware transform stream. You can find it here: And here's an example of it being used: I'm thinking about making this a separate NPM package. |
Ok, here it is: https://www.npmjs.com/package/@sciactive/back-pressure-transform Try it out and see if it works for you. |
Describe the bug
Using [email protected], I have found that if I provide a buffer input or stream to the parser, it will always send every row in a buffer it gets even if there is back-pressure. It will only apply back-pressure between chunks that it receives from its input.
To Reproduce
If you run the above script you will see that
(parser as any)._readableState.length
increments to include all rows immediately - all the rows are buffered into the Readable half of the parser.In some cases a user of the library may want to pass in a buffer of many MB thinking that it will be processed in small batches (say, using the stream water mark). However, with this library all the rows will be processed immediately, using a lot more memory than necessary.
In order to fix this, the library should check the return value of
push
, and if it isfalse
it should pause parsing even if it has enough data buffered from the input to read another record. I'm not actually sure currently how to know when it OK to callpush
again, though. The documentation isn't clear on this point.See also
push()
in the_transform()
implementation of a transform stream nodejs/help#1791The text was updated successfully, but these errors were encountered: