-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSV parsing fails on large files #442
Comments
A sample reproducing the error will certainly help. |
This code should reproduce the issue. I'm using csv-parse version 5.5.6. How do you want me to share an example file with you that triggers the issue? Does google drive work? |
Yes, google drive will do. |
Ok, thank you. I'll anonymize the data and get back to you soon. |
Ok, this is very weird. We tried scrambling the data (same number of bytes) and it's parsing just fine. Then out of curiosity I tried writing the exact same contents 1 to 1 into a new file:
Running that through works too. And the resulting file is 1/9 the size (957mb to 146mb.) Now I'm realizing this probably has nothing to do with csv-parse, but actually with a corrupted file. I identified the line that it's breaking on, but there's nothing clearly off about it. Do you have any recommendations on how I can look into this further? Unfortunately the file is confidential financial data so I cannot share it as is. I appreciate any advice. |
Internally, the parser works with bytes and not strings. String conversion happens once the parsing is done and the field extracted. This is where you error is located in csv-parse/dist/cjs/index.cjs:1100. I am not familiar with the "Cannot create a string longer than 0x1fffffe8 characters" error but it feels like if the buffer is too large to be converted to a string. Like if no CSV parsing has occurred, thus the field being huge and filled with the all dataset. Setting an incorrect |
If it were a delimiter issue, then I shouldn't be seeing intermediate rows, right? When I iterate through the rows, I can still see the parsed values just fine.
This works just fine until the corrupt line. Could there be some invisible bytestring? I'm not sure how to diagnose these sorts of things |
I would say that the parser is agnostic to any byte which is not an option like delimiter, unless proven wrong of course. |
Ok I'll close for now, but will let you know if I can diagnose a specific bad byte here |
When parsing a large csv file from a Readable stream like this:
I get this error:
I manually checked the csv and note that there are no individual cell values greater than 0x1fffffe8 length. Also, the entire CSV is 448,665 rows, which I can confirm via
wc -l
.Maybe this is related to #408?
The text was updated successfully, but these errors were encountered: