csv-parse stream will process the whole buffer even with back pressure. #408

dobesv · 2023-12-06T01:13:06Z

Describe the bug

Using [email protected], I have found that if I provide a buffer input or stream to the parser, it will always send every row in a buffer it gets even if there is back-pressure. It will only apply back-pressure between chunks that it receives from its input.

To Reproduce

import assert from 'assert'
import { parse, Parser } from 'csv-parse'
import { pipeline, Readable, Writable, WritableOptions } from 'stream';


// Create the parser
const parser: Parser = parse();
parser.on('data', (row) => {
    console.log('parser row', row, (parser as any)._readableState.length);
})

const bufs = [];
for(let i=0; i < 100000; i++) {
    bufs.push(Buffer.from(`a${i}, b${i}, ${i}\r\n`));
}
const inputBuffer = Buffer.concat(bufs);
const input = Readable.from([inputBuffer]);
input.on('data', (chunk) => {
    console.log('input chunk', chunk.length);
})

class BackpressureWritable extends Writable {
    count: number;
    threshold: number;

    constructor(options: WritableOptions) {
        super(options);
        this.count = 0;
        this.threshold = 10;
    }

    // @ts-ignore
    write(chunk, encoding, callback) {
        const result = super.write(chunk, encoding, callback);
        console.log(`write(${chunk.toString()}) => ${result}`);
        return result;
    }

    _write(chunk: any, encoding: string, callback: any) {
        this.count++;
        console.log(`_write(${chunk.toString()})`);

        setTimeout(callback, this.count); // Simulating delay to handle backpressure
    }
}


const output = new BackpressureWritable({objectMode: true, highWaterMark: 1});
pipeline(input, parser, output, () => {
    console.log('pipeline output');
});

output.on('end', () => {
    console.log('end');
});

If you run the above script you will see that (parser as any)._readableState.length increments to include all rows immediately - all the rows are buffered into the Readable half of the parser.

In some cases a user of the library may want to pass in a buffer of many MB thinking that it will be processed in small batches (say, using the stream water mark). However, with this library all the rows will be processed immediately, using a lot more memory than necessary.

In order to fix this, the library should check the return value of push, and if it is false it should pause parsing even if it has enough data buffered from the input to read another record. I'm not actually sure currently how to know when it OK to call push again, though. The documentation isn't clear on this point.

See also

The text was updated successfully, but these errors were encountered:

dobesv · 2023-12-06T02:06:51Z

I messed around with this for a while trying to figure out a solution, but this seems to be a case that just isn't handled by the Transform API. After the last buffer is provided to the transform, it will call _flush and close the stream and you don't get more chances to try to push the remaining data even if there was back pressure. Kind of an odd gap in the API.

Anyway, my workaround for now is to split up the incoming buffers so that they are always small, that seems to work OK. But with this issue lingering I suppose there will be others in the future who are bitten by the same issue.

wdavidw · 2023-12-06T08:16:23Z

Is it not the responsibility of the Stream Reader (input in your case) to provide smaller chunks ? If someone doesn't control the chunk size of its input, maybe he could insert a custom transformer between the input and the parser. From the parser's standpoint, processing all the data it receives seems fair.

dobesv · 2023-12-06T19:14:53Z

I don't think it's unreasonable to expect that even if you pass the entire large dataset as a string or Buffer that it would still process it in chunks to reduce memory usage. A 300MB Buffer of CSV data could easily use several GB of RAM if you parse all the rows in advance. Putting 300MB in RAM might not be a big deal, but expecting the rows to parse in small batches so the additional memory usage is minimal.

At least, this issue pretty much took me 6 hours to figure out and find a workaround yesterday, so If filed the issue here as I could imagine others being unpleasantly surprised by it.

The workaround isn't super hard to implement once you know it's needed, but it's not very obvious that it should be.

dobesv · 2023-12-06T19:17:01Z

With a bit more research I think maybe to resolve this would require implementing Duplex directly instead of using Transform. Kind of a pain, but the nodejs Transform API doesn't have any built-in concept of one-sided backpressure, it basically assumes the input and output are of a similar size and passes through all backpressure upstream.

hperrin · 2024-02-19T06:48:09Z

In case you're still looking for a solution here, I wrote my own back pressure aware transform stream. You can find it here:

https://github.com/sciactive/nephele/blob/37bac308b75fd660ebbbe0d93ed86b504237d20c/packages/plugin-encryption/src/BackPressureTransform.ts

And here's an example of it being used:

https://github.com/sciactive/nephele/blob/37bac308b75fd660ebbbe0d93ed86b504237d20c/packages/plugin-encryption/src/EncryptionProxyResource.ts#L115

I'm thinking about making this a separate NPM package.

hperrin · 2024-02-19T09:12:42Z

Ok, here it is:

https://www.npmjs.com/package/@sciactive/back-pressure-transform

Try it out and see if it works for you.

dobesv mentioned this issue Dec 6, 2023

Proper way to handle the return value of push() in the _transform() implementation of a transform stream nodejs/help#1791

Closed

mmarkell mentioned this issue Nov 14, 2024

CSV parsing fails on large files #442

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

csv-parse stream will process the whole buffer even with back pressure. #408

csv-parse stream will process the whole buffer even with back pressure. #408

dobesv commented Dec 6, 2023 •

edited

Loading

dobesv commented Dec 6, 2023

wdavidw commented Dec 6, 2023

dobesv commented Dec 6, 2023

dobesv commented Dec 6, 2023 •

edited

Loading

hperrin commented Feb 19, 2024 •

edited

Loading

hperrin commented Feb 19, 2024

csv-parse stream will process the whole buffer even with back pressure. #408

csv-parse stream will process the whole buffer even with back pressure. #408

Comments

dobesv commented Dec 6, 2023 • edited Loading

dobesv commented Dec 6, 2023

wdavidw commented Dec 6, 2023

dobesv commented Dec 6, 2023

dobesv commented Dec 6, 2023 • edited Loading

hperrin commented Feb 19, 2024 • edited Loading

hperrin commented Feb 19, 2024

dobesv commented Dec 6, 2023 •

edited

Loading

dobesv commented Dec 6, 2023 •

edited

Loading

hperrin commented Feb 19, 2024 •

edited

Loading