Skip to content

Latest commit

 

History

History
185 lines (163 loc) · 4.86 KB

README.md

File metadata and controls

185 lines (163 loc) · 4.86 KB

Anonymize df: a convenient way to anonymize your data for analytics

PyPI PyPI - Status PyPI - License Code style: black

What is it?

Anonymize df is a package that helps you quickly and easily generate realistic fake data from a Pandas DataFrame.

What are the expected use cases / why was this made?

  • You're hiring consultants to work on your data but need to anonymize it first
  • You're a consultant and created something great that you want to make into a template

Installation

You can install anonymizedf using pip:

pip install anonymizedf

This will also try downloading the tableau hyper api and pandas packages if you don't have them already.

If you don't want to use pip you can also download this repository and execute:

python setup.py install

Example usage

import pandas as pd
from anonymizedf.anonymizedf import anonymize

# Import the data
df = pd.read_csv("https://query.data.world/s/shcktxndtu3ojonm46tb5udlz7sp3e")

# Prepare the data to be anonymized
an = anonymize(df)

# Select what data you want to anonymize and your preferred style

# Example 1 - just updates df
an.fake_names("Customer Name")
an.fake_ids("Customer ID")
an.fake_whole_numbers("Loyalty Reward Points")
an.fake_categories("Segment")
an.fake_dates("Date")
an.fake_decimal_numbers("Fraction")

# Example 2 - method chaining
fake_df = (
    an
    .fake_names("Customer Name", chaining=True)
    .fake_ids("Customer ID", chaining=True)
    .fake_whole_numbers("Loyalty Reward Points", chaining=True)
    .fake_categories("Segment", chaining=True)
    .fake_dates("Date", chaining=True)
    .fake_decimal_numbers("Fraction", chaining=True)
    .show_data_frame()
)

# Example 3 - multiple assignments
fake_df = an.fake_names("Customer Name")
fake_df = an.fake_ids("Customer ID")
fake_df = an.fake_whole_numbers("Loyalty Reward Points")
fake_df = an.fake_categories("Segment")
fake_df = an.fake_dates("Date")
fake_df = an.fake_decimal_numbers("Fraction")

fake_df.to_csv("fake_customers.csv", index=False)

# One thing to note is that you can't directly pass in a list of columns.
# If you want to apply the same function to multiple columns there are many ways to do that.

# Example 4 - for multiple columns

for column in column_list:
    an.fake_categories(column)

Example output

Customer ID Customer Name Loyalty Reward Points Segment Date Fraction Fake_Customer Name Fake_Customer ID Fake_Loyalty Reward Points Fake_Segment Fake_Date Fake_Fraction
0 AA-10315 Alex Avila 76 Consumer 01/01/2000 7.6 Christian Metcalfe-Reid YEJP71011502726136 558 Segment 1 1978-11-09 29.96
1 AA-10375 Allen Armold 369 Consumer 02/01/2000 36.9 Helen Taylor XWOB83170110594048 286 Segment 1 1989-12-29 72.50
2 AA-10480 Andrew Allen 162 Consumer 03/01/2000 16.2 Joanne Price VVCJ28547588747677 742 Segment 1 1982-09-23 79.77
3 AA-10645 Anna Andreadi 803 Consumer 04/01/2000 80.3 Rhys Jones OXCI12190813836802 206 Segment 1 2000-10-14 7.15
4 AB-10015 Aaron Bergman 935 Consumer 05/01/2000 93.5 Nigel Baldwin-Cook JOXS05799252235987 914 Segment 1 2018-01-30 40.66

Dependencies