Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV parsing fails on large files #442

Closed
mmarkell opened this issue Nov 14, 2024 · 9 comments
Closed

CSV parsing fails on large files #442

mmarkell opened this issue Nov 14, 2024 · 9 comments

Comments

@mmarkell
Copy link

mmarkell commented Nov 14, 2024

When parsing a large csv file from a Readable stream like this:

const parsedIterable = readable.pipe(parse({ ... }));
for await (const row of parsedIterable) {
  // do stuff
}

I get this error:

Error: Cannot create a string longer than 0x1fffffe8 characters
        at Object.slice (node:buffer:638:37)
        at Buffer.toString (node:buffer:857:14)
        at ResizeableBuffer.toString (/Users/alias/code/monorepo/node_modules/csv-parse/dist/cjs/index.cjs:103:45)
        at Object.__onField (/Users/alias/code/monorepo/node_modules/csv-parse/dist/cjs/index.cjs:1100:36)
        at Object.parse (/Users/alias/code/monorepo/node_modules/csv-parse/dist/cjs/index.cjs:920:35)
        at Parser._flush (/Users/alias/code/monorepo/node_modules/csv-parse/dist/cjs/index.cjs:1365:26)
        at Parser.final [as _final] (node:internal/streams/transform:132:10)
        at callFinal (node:internal/streams/writable:698:12)
        at prefinish (node:internal/streams/writable:710:7)
        at finishMaybe (node:internal/streams/writable:720:5)
        at afterWrite (node:internal/streams/writable:507:3)
        at onwrite (node:internal/streams/writable:480:7)
        at Parser.Transform._read (node:internal/streams/transform:201:5)
        at Parser.Readable.read (node:internal/streams/readable:539:12)
        at createAsyncIterator (node:internal/streams/readable:1163:54)
        at createAsyncIterator.next (<anonymous>)

I manually checked the csv and note that there are no individual cell values greater than 0x1fffffe8 length. Also, the entire CSV is 448,665 rows, which I can confirm via wc -l.

Maybe this is related to #408?

@wdavidw
Copy link
Member

wdavidw commented Nov 14, 2024

A sample reproducing the error will certainly help.

@mmarkell
Copy link
Author

mmarkell commented Nov 14, 2024

import { parse } from "csv-parse";
import fs from "fs";

async function readFile(file: string) {
    const path = `./data/${file}`;
    const readable = fs.createReadStream(path);
    const parser = parse({ delimiter: "," });
    const parsedIterable = readable.pipe(parser);
    let i = 0;
    await new Promise<void>((resolve, reject) => {
        parsedIterable.on("readable", function () {
            let record;
            while ((record = parser.read()) !== null) {
                i++;
            }
        });

        parsedIterable.on("end", () => {
            resolve();
        });
    });

    console.log(`Parsed ${i} records from ${file}`);
}
readFile("bigfile.csv");

This code should reproduce the issue. I'm using csv-parse version 5.5.6.

How do you want me to share an example file with you that triggers the issue? Does google drive work?

@wdavidw
Copy link
Member

wdavidw commented Nov 14, 2024

Yes, google drive will do.

@mmarkell
Copy link
Author

Ok, thank you. I'll anonymize the data and get back to you soon.

@mmarkell
Copy link
Author

mmarkell commented Nov 14, 2024

Ok, this is very weird. We tried scrambling the data (same number of bytes) and it's parsing just fine. Then out of curiosity I tried writing the exact same contents 1 to 1 into a new file:

import pandas as pd
df = pd.read_csv('bigboy.csv')

df.to_csv('original_file.csv', index=False)

Running that through works too. And the resulting file is 1/9 the size (957mb to 146mb.)

Now I'm realizing this probably has nothing to do with csv-parse, but actually with a corrupted file. I identified the line that it's breaking on, but there's nothing clearly off about it.

Do you have any recommendations on how I can look into this further? Unfortunately the file is confidential financial data so I cannot share it as is.

I appreciate any advice.

@wdavidw
Copy link
Member

wdavidw commented Nov 15, 2024

Internally, the parser works with bytes and not strings. String conversion happens once the parsing is done and the field extracted. This is where you error is located in csv-parse/dist/cjs/index.cjs:1100.

image

I am not familiar with the "Cannot create a string longer than 0x1fffffe8 characters" error but it feels like if the buffer is too large to be converted to a string. Like if no CSV parsing has occurred, thus the field being huge and filled with the all dataset. Setting an incorrect delimiter and record_delimiter would lead to that situation.

@mmarkell
Copy link
Author

If it were a delimiter issue, then I shouldn't be seeing intermediate rows, right? When I iterate through the rows, I can still see the parsed values just fine.

async function readFile(file: string) {
    const path = `./data/${file}`;
    const readable = fs.createReadStream(path);
    const parser = parse();
    const parsedIterable = readable.pipe(parser);
    let i = 0;
    for await (const record of parsedIterable) {
        if (i % 100000 === 0) {
            console.log(record);
        }
        i++;
    }
    console.log(`Parsed ${i} records from ${file}`);
}

This works just fine until the corrupt line. Could there be some invisible bytestring? I'm not sure how to diagnose these sorts of things

@wdavidw
Copy link
Member

wdavidw commented Nov 15, 2024

I would say that the parser is agnostic to any byte which is not an option like delimiter, unless proven wrong of course.

@mmarkell
Copy link
Author

Ok I'll close for now, but will let you know if I can diagnose a specific bad byte here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants