Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] last Columns with first line value empty not being read from .xlsx #785

Open
1 task done
mand35 opened this issue Sep 15, 2023 · 3 comments
Open
1 task done

Comments

@mand35
Copy link

mand35 commented Sep 15, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

Reading an .xlsx file with empty columns in the first read line (or first considered, if first line is completely empty) does cut of the remaining columns at the end, after last value in first line, even when specifying a specific range and net reading headers.

image

+------+------+------+------+------+
|_c0 |_c1 |_c2 |_c3 |_c4 |
+------+------+------+------+------+
|null |null |cat_a |null |cat_b |
|name_a|name_b|name_c|name_d|name_e|
|1_1 |2_1 |3_1 |4_1 |5_1 |
|1_2 |2_2 |3_2 |4_2 |5_2 |
|1_3 |2_3 |3_3 |4_3 |5_3 |
|1_4 |2_4 |3_4 |4_4 |5_4 |
|1_5 |2_5 |3_5 |4_5 |5_5 |
|1_6 |2_6 |3_6 |4_6 |5_6 |
|1_7 |2_7 |3_7 |4_7 |5_7 |
+------+------+------+------+------+

Background: I want to process the first to lines to get a consolidated header. Reading only the data block starting from line 4 works as expected.
This is similar to #631 and #366

Expected Behavior

Reading the full specify range of data.

Steps To Reproduce

    val df = spark.read
      .format("excel")
      .option("dataAddress", "'Sheet1'!A2:G10")
      .option("header", "false") // Required
      .load("test.xlsx")
    df.show(false)

test.xlsx

Environment

- Spark version: 3.3.2
- Spark-Excel version: 0.18.6-beta1
- OS: Windows 10E with IntelliJ
- Cluster environment

Anything else?

No response

@cegaspar
Copy link

Same issue here.

The problem seems to be caused by the filters at https://github.com/crealytics/spark-excel/blob/f74995b346344adc9154b8a2caf24328ccae338f/src/main/scala/com/crealytics/spark/excel/v2/DataLocator.scala#L57 and https://github.com/crealytics/spark-excel/blob/f74995b346344adc9154b8a2caf24328ccae338f/src/main/scala/com/crealytics/spark/excel/v2/DataLocator.scala#L46

The implementation of getLastCellNum() from Apache POI ignores the last cells of the row if they are empty, and that is the reason why those columns are removed from the row iterator even if dataAddress explicitly includes them. One possible fix here is not to filter empty cells if they are specified in dataAddress. Other option is to consider the number of columns of the schema.

@nightscape
Copy link
Owner

Hey @cegaspar, thanks for the analysis!
Would you mind giving a PR a try?

@cegaspar
Copy link

Hey @nightscape.
Sure, no problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants