Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cvs-parse fails 'randomly' on multi-character column delimiter and row terminator #433

Open
metarama opened this issue Jun 8, 2024 · 3 comments

Comments

@metarama
Copy link

metarama commented Jun 8, 2024

Describe the bug

Thank you for the great work on csv-parse.

I am trying to import a csv files (over 850 rows with 51 columns) that is located in an AWS S3 folder. At the moment I am processing one row at a time.

I see the column count mismatch error happening on different rows. The erroring row keeps moving around. Sometimes the 11th row will fail. Sometime 439th row will fail. Whenever the row fails I can see the delimiter '|@|' has been taken in as part of the field. For example, a failing row comes in as '11|@|543E9108-177F-4402-B423-AF9A006A6698', '9', '00000000014621DD', ... You see the delimiter was not processed correctly and the first two fields got fused together, and as a result the column count does not match and thus the error. Also, the key observation is that the location of this error moves around.

Sometimes the delimiter all by itself becomes a field, like in this example of row data parsed, '2023-11-29 11:08:25.2150201 -05:00', '|@|', 'Boot', '', 'Testing, ...

I have not looked at the code but based on my observation I can guess that the delimiter (and terminator) detection misses the case when the delimiter (or terminator) character sequence spans across the current buffer boundary. The partial delimiter sequence at the end of the buffer or at the beginning of the buffer gets handled like a regular field characters and gets fused with a field.

Single character delimiters and terminators on the other hand are easier to process. Multi-character delimiter and row terminator will need extra care. I feel that care is missing.

To Reproduce

const parser = parse({ 
      delimiter: '|@|', 
      record_delimiter: '|$|', 
      columns: true, 
      from: 1, 
      relax_quotes: false,
      encoding: "utf8", 
      quote: '"',
      ignore_last_delimiters: true,
      trim: true, 
      bom: true, 
    } as Options);

const s3Stream = Readable.from(response.Body as AsyncIterable<Uint8Array>);
const csvStream = s3Stream.pipe(parser);

for await (const row of csvStream) {
  // process the row
}


@metarama metarama changed the title cvs-parse fails on multi-character column delimiter and row terminator cvs-parse fails 'randomly' on multi-character column delimiter and row terminator Jun 8, 2024
@metarama
Copy link
Author

metarama commented Jun 8, 2024

Below is the __isDelimiter function defined at line 596 of node-csv/packages/csv-parse/lib/api/index.js.

I bolded the line 'if(del[j] !== buf[pos+j]) continue loop1;'. What if the pos+j exceeds the buffer length? Should this line be enhanced to handle spilling over the buffer length?

    __isDelimiter: function(buf, pos, chr){
      const {delimiter, ignore_last_delimiters} = this.options;
      if(ignore_last_delimiters === true && this.state.record.length === this.options.columns.length - 1){
        return 0;
      }else if(ignore_last_delimiters !== false && typeof ignore_last_delimiters === 'number' && this.state.record.length === ignore_last_delimiters - 1){
        return 0;
      }
      loop1: for(let i = 0; i < delimiter.length; i++){
        const del = delimiter[i];
        if(del[0] === chr){
          for(let j = 1; j < del.length; j++){
            **if(del[j] !== buf[pos+j]) continue loop1;**
          }
          return del.length;
        }
      }
      return 0;
    },

@metarama
Copy link
Author

metarama commented Jun 8, 2024

I see the identical problem in the __isRecordDelimiter function defined at line 614 of the same file mentioned above.

@wdavidw
Copy link
Member

wdavidw commented Jun 8, 2024

What if the pos+j exceeds the buffer length

There is a function check just for this. It is called __needMoreData. Look inside if you think that could be the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants