Adult dataset incorrect handling of formatting error in adult datafile #7

stighellemans · 2024-02-01T22:12:52Z

Describe the bug
The rows of the Adult dataset end with a dot: '...,<=50K.'
When handled naively , it's included in the label column, which is undesirable. Therefore, the following code is implemented in the preprocessing function of the Adult class (Adult.py):

df["Target"] = df["Target"].str.replace(r".", "", regex=True)

But, in my case, this doesn't correct the issue.

To Reproduce
adult = Adult(test_path="local_path_to_test_adult.csv",
train_path="local_path_to_train_adult.csv",
preprocess=True,
)

adult_test_data = adult.inverse_preprocess(adult.test_data)

adult_data = pd.concat([adult_test_data,
adult.test_labels[">50K"].to_frame(name="labels").astype("float32")
], axis=1)

--> This will give the error

Expected behavior
I would expect that the code would erase the dot.

Environment

Python version: 3.8
OS: MacOS M1 chip

stighellemans mentioned this issue Feb 1, 2024

Fix bugs #10

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adult dataset incorrect handling of formatting error in adult datafile #7

Adult dataset incorrect handling of formatting error in adult datafile #7

stighellemans commented Feb 1, 2024

Adult dataset incorrect handling of formatting error in adult datafile #7

Adult dataset incorrect handling of formatting error in adult datafile #7

Comments

stighellemans commented Feb 1, 2024