You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to import a csv files (over 850 rows with 51 columns) that is located in an AWS S3 folder. At the moment I am processing one row at a time.
I see the column count mismatch error happening on different rows. The erroring row keeps moving around. Sometimes the 11th row will fail. Sometime 439th row will fail. Whenever the row fails I can see the delimiter '|@|' has been taken in as part of the field. For example, a failing row comes in as '11|@|543E9108-177F-4402-B423-AF9A006A6698', '9', '00000000014621DD', ... You see the delimiter was not processed correctly and the first two fields got fused together, and as a result the column count does not match and thus the error. Also, the key observation is that the location of this error moves around.
Sometimes the delimiter all by itself becomes a field, like in this example of row data parsed, '2023-11-29 11:08:25.2150201 -05:00', '|@|', 'Boot', '', 'Testing, ...
I have not looked at the code but based on my observation I can guess that the delimiter (and terminator) detection misses the case when the delimiter (or terminator) character sequence spans across the current buffer boundary. The partial delimiter sequence at the end of the buffer or at the beginning of the buffer gets handled like a regular field characters and gets fused with a field.
Single character delimiters and terminators on the other hand are easier to process. Multi-character delimiter and row terminator will need extra care. I feel that care is missing.
To Reproduce
const parser = parse({
delimiter: '|@|',
record_delimiter: '|$|',
columns: true,
from: 1,
relax_quotes: false,
encoding: "utf8",
quote: '"',
ignore_last_delimiters: true,
trim: true,
bom: true,
} as Options);
const s3Stream = Readable.from(response.Body as AsyncIterable<Uint8Array>);
const csvStream = s3Stream.pipe(parser);
for await (const row of csvStream) {
// process the row
}
The text was updated successfully, but these errors were encountered:
metarama
changed the title
cvs-parse fails on multi-character column delimiter and row terminator
cvs-parse fails 'randomly' on multi-character column delimiter and row terminator
Jun 8, 2024
Below is the __isDelimiter function defined at line 596 of node-csv/packages/csv-parse/lib/api/index.js.
I bolded the line 'if(del[j] !== buf[pos+j]) continue loop1;'. What if the pos+j exceeds the buffer length? Should this line be enhanced to handle spilling over the buffer length?
Describe the bug
Thank you for the great work on csv-parse.
I am trying to import a csv files (over 850 rows with 51 columns) that is located in an AWS S3 folder. At the moment I am processing one row at a time.
I see the column count mismatch error happening on different rows. The erroring row keeps moving around. Sometimes the 11th row will fail. Sometime 439th row will fail. Whenever the row fails I can see the delimiter '|@|' has been taken in as part of the field. For example, a failing row comes in as '11|@|543E9108-177F-4402-B423-AF9A006A6698', '9', '00000000014621DD', ... You see the delimiter was not processed correctly and the first two fields got fused together, and as a result the column count does not match and thus the error. Also, the key observation is that the location of this error moves around.
Sometimes the delimiter all by itself becomes a field, like in this example of row data parsed, '2023-11-29 11:08:25.2150201 -05:00', '|@|', 'Boot', '', 'Testing, ...
I have not looked at the code but based on my observation I can guess that the delimiter (and terminator) detection misses the case when the delimiter (or terminator) character sequence spans across the current buffer boundary. The partial delimiter sequence at the end of the buffer or at the beginning of the buffer gets handled like a regular field characters and gets fused with a field.
Single character delimiters and terminators on the other hand are easier to process. Multi-character delimiter and row terminator will need extra care. I feel that care is missing.
To Reproduce
The text was updated successfully, but these errors were encountered: