[BUG] last Columns with first line value empty not being read from .xlsx #785

mand35 · 2023-09-15T08:53:11Z

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

Reading an .xlsx file with empty columns in the first read line (or first considered, if first line is completely empty) does cut of the remaining columns at the end, after last value in first line, even when specifying a specific range and net reading headers.

+------+------+------+------+------+
|_c0 |_c1 |_c2 |_c3 |_c4 |
+------+------+------+------+------+
|null |null |cat_a |null |cat_b |
|name_a|name_b|name_c|name_d|name_e|
|1_1 |2_1 |3_1 |4_1 |5_1 |
|1_2 |2_2 |3_2 |4_2 |5_2 |
|1_3 |2_3 |3_3 |4_3 |5_3 |
|1_4 |2_4 |3_4 |4_4 |5_4 |
|1_5 |2_5 |3_5 |4_5 |5_5 |
|1_6 |2_6 |3_6 |4_6 |5_6 |
|1_7 |2_7 |3_7 |4_7 |5_7 |
+------+------+------+------+------+

Background: I want to process the first to lines to get a consolidated header. Reading only the data block starting from line 4 works as expected.
This is similar to #631 and #366

Expected Behavior

Reading the full specify range of data.

Steps To Reproduce

    val df = spark.read
      .format("excel")
      .option("dataAddress", "'Sheet1'!A2:G10")
      .option("header", "false") // Required
      .load("test.xlsx")
    df.show(false)

test.xlsx

Environment

- Spark version: 3.3.2
- Spark-Excel version: 0.18.6-beta1
- OS: Windows 10E with IntelliJ
- Cluster environment

Anything else?

No response

The text was updated successfully, but these errors were encountered:

cegaspar · 2024-03-13T18:06:02Z

Same issue here.

The problem seems to be caused by the filters at https://github.com/crealytics/spark-excel/blob/f74995b346344adc9154b8a2caf24328ccae338f/src/main/scala/com/crealytics/spark/excel/v2/DataLocator.scala#L57 and https://github.com/crealytics/spark-excel/blob/f74995b346344adc9154b8a2caf24328ccae338f/src/main/scala/com/crealytics/spark/excel/v2/DataLocator.scala#L46

The implementation of getLastCellNum() from Apache POI ignores the last cells of the row if they are empty, and that is the reason why those columns are removed from the row iterator even if dataAddress explicitly includes them. One possible fix here is not to filter empty cells if they are specified in dataAddress. Other option is to consider the number of columns of the schema.

nightscape · 2024-03-15T00:30:45Z

Hey @cegaspar, thanks for the analysis!
Would you mind giving a PR a try?

cegaspar · 2024-03-15T14:20:48Z

Hey @nightscape.
Sure, no problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] last Columns with first line value empty not being read from .xlsx #785

[BUG] last Columns with first line value empty not being read from .xlsx #785

mand35 commented Sep 15, 2023 •

edited

Loading

cegaspar commented Mar 13, 2024

nightscape commented Mar 15, 2024

cegaspar commented Mar 15, 2024

[BUG] last Columns with first line value empty not being read from .xlsx #785

[BUG] last Columns with first line value empty not being read from .xlsx #785

Comments

mand35 commented Sep 15, 2023 • edited Loading

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

cegaspar commented Mar 13, 2024

nightscape commented Mar 15, 2024

cegaspar commented Mar 15, 2024

mand35 commented Sep 15, 2023 •

edited

Loading