Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adult dataset incorrect handling of formatting error in adult datafile #7

Open
stighellemans opened this issue Feb 1, 2024 · 0 comments

Comments

@stighellemans
Copy link

Describe the bug
The rows of the Adult dataset end with a dot: '...,<=50K.'
When handled naively , it's included in the label column, which is undesirable. Therefore, the following code is implemented in the preprocessing function of the Adult class (Adult.py):

df["Target"] = df["Target"].str.replace(r".", "", regex=True)

But, in my case, this doesn't correct the issue.

To Reproduce
adult = Adult(test_path="local_path_to_test_adult.csv",
train_path="local_path_to_train_adult.csv",
preprocess=True,
)

adult_test_data = adult.inverse_preprocess(adult.test_data)

adult_data = pd.concat([adult_test_data,
adult.test_labels[">50K"].to_frame(name="labels").astype("float32")
], axis=1)

--> This will give the error

Expected behavior
I would expect that the code would erase the dot.

Environment

  • Python version: 3.8
  • OS: MacOS M1 chip
@stighellemans stighellemans mentioned this issue Feb 1, 2024
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant