Fed Statistics: Adding Percentiles support (NVIDIA#3124)

* 1. Add percentile support using t-digest 2. Add examples for df_stats 3. refactoring the some of the codebase 4. missing work 1. add DP noise 2. make writing filer easier for end-user 3. add job API for the stats. Job 4. make it even easier to work on stats. 5. unit tests * 1. Add percentile support using t-digest 2. Add examples for df_stats 3. refactoring the some of the codebase 4. missing work 1. add DP noise 2. make writing filer easier for end-user 3. add job API for the stats. Job 4. make it even easier to work on stats. 5. unit tests * add unit tests add job api in example format style * add tdigest license file * remove debugging print * fix test * format style changes * format style changes
SYangster · Jan 3, 2025 · 62e2441 · 62e2441
1 parent 6d9d775
commit 62e2441
Show file tree

Hide file tree

Showing 29 changed files with 819 additions and 143 deletions.
diff --git a/3rdParty/tdigest.LICENSE.txt b/3rdParty/tdigest.LICENSE.txt
@@ -0,0 +1,23 @@
+https://github.com/CamDavidsonPilon/tdigest/blob/master/LICENSE.txt
+
+The MIT License (MIT)
+
+Copyright (c) 2015 Cameron Davidson-Pilon
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -44,6 +44,10 @@ To collaborate efficiently, please read through this section and follow them.
 * [Building documentation](#building-the-documentation)
 * [Signing your work](#signing-your-work)
 
+> Note: 
+  > some package dependencies requires python<version>-dev in local development such as 
+  > python3.12-dev. 
+
 #### Checking the coding style
 We check code style using flake8 and isort.
 A bash script (`runtest.sh`) is provided to run all tests locally.

diff --git a/examples/advanced/federated-statistics/df_stats.ipynb b/examples/advanced/federated-statistics/df_stats.ipynb
@@ -144,15 +144,27 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "0d5041aa-c2e0-4af6-a2c8-bae76e4512d0",
+   "id": "6361a85e-4187-433c-976c-0dc4021908ac",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! nvflare simulator df_stats/jobs/df_stats -w /tmp/nvflare/df/workdir -n 2 -t 2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4fdbfb95-90c9-4d45-b727-dab6f5a8bc41",
    "metadata": {
     "tags": []
    },
-   "outputs": [],
    "source": [
+    "Or python code\n",
+    "```\n",
     "from nvflare.private.fed.app.simulator.simulator_runner import SimulatorRunner\n",
-    "runner = SimulatorRunner(job_folder=\"df_stats/jobs/df_stats\", workspace=\"/tmp/nvflare/df_stats/workdir\", n_clients = 2, threads=2)\n",
-    "runner.run()"
+    "runner = SimulatorRunner(job_folder=\"df_stats/jobs/df_stats\", workspace=\"/tmp/nvflare/df/workdir\", n_clients = 2, threads=2)\n",
+    "runner.run()\n",
+    "\n",
+    "```"
    ]
   },
   {
@@ -167,7 +179,7 @@
     "From a **terminal** one can also the following equivalent CLI\n",
     "\n",
     "```\n",
-    "nvflare simulator df_stats/jobs/df_stats -w /tmp/nvflare/df_stats -n 2 -t 2\n",
+    "nvflare simulator df_stats/jobs/df_stats -w /tmp/nvflare/df/workdir -n 2 -t 2\n",
     "\n",
     "```\n",
     "\n",
@@ -184,9 +196,9 @@
    "metadata": {},
    "source": [
     "\n",
-    "The results are stored in workspace \"/tmp/nvflare/df_stats/workdir/\"\n",
+    "The results are stored in workspace \"/tmp/nvflare/df/workdir/\"\n",
     "```\n",
-    "/tmp/nvflare/df_stats/workdir/server/simulate_job/statistics/adults_stats.json\n",
+    "/tmp/nvflare/df/workdir/server/simulate_job/statistics/adults_stats.json\n",
     "```"
    ]
   },
@@ -199,7 +211,7 @@
    },
    "outputs": [],
    "source": [
-    "cat /tmp/nvflare/df_stats/workdir/server/simulate_job/statistics/adults_stats.json"
+    "cat /tmp/nvflare/df/workdir/server/simulate_job/statistics/adults_stats.json"
    ]
   },
   {
@@ -222,7 +234,7 @@
    },
    "outputs": [],
    "source": [
-    "! cp /tmp/nvflare/df_stats/workdir/server/simulate_job/statistics/adults_stats.json df_stats/demo/."
+    "! cp /tmp/nvflare/df/workdir/server/simulate_job/statistics/adults_stats.json df_stats/demo/."
    ]
   },
   {
@@ -271,7 +283,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.19"
+   "version": "3.10.2"
   }
  },
  "nbformat": 4,

diff --git a/examples/advanced/federated-statistics/df_stats/demo/visualization.ipynb b/examples/advanced/federated-statistics/df_stats/demo/visualization.ipynb
@@ -285,7 +285,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.17"
+   "version": "3.10.2"
   }
  },
  "nbformat": 4,

diff --git a/examples/advanced/federated-statistics/df_stats/job_api/df_statistics.py b/examples/advanced/federated-statistics/df_stats/job_api/df_statistics.py
@@ -0,0 +1,75 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Dict, Optional
+
+import pandas as pd
+
+from nvflare.apis.fl_context import FLContext
+from nvflare.app_opt.statistics.df.df_core_statistics import DFStatisticsCore
+
+
+class DFStatistics(DFStatisticsCore):
+    def __init__(self, data_path):
+        super().__init__()
+        self.data_root_dir = "/tmp/nvflare/df_stats/data"
+        self.data_path = data_path
+        self.data: Optional[Dict[str, pd.DataFrame]] = None
+        self.data_features = [
+            "Age",
+            "Workclass",
+            "fnlwgt",
+            "Education",
+            "Education-Num",
+            "Marital Status",
+            "Occupation",
+            "Relationship",
+            "Race",
+            "Sex",
+            "Capital Gain",
+            "Capital Loss",
+            "Hours per week",
+            "Country",
+            "Target",
+        ]
+
+        # the original dataset has no header,
+        # we will use the adult.train dataset for site-1, the adult.test dataset for site-2
+        # the adult.test dataset has incorrect formatted row at 1st line, we will skip it.
+        self.skip_rows = {
+            "site-1": [],
+            "site-2": [0],
+        }
+
+    def load_data(self, fl_ctx: FLContext) -> Dict[str, pd.DataFrame]:
+        client_name = fl_ctx.get_identity_name()
+        self.log_info(fl_ctx, f"load data for client {client_name}")
+        try:
+            skip_rows = self.skip_rows[client_name]
+            data_path = f"{self.data_root_dir}/{fl_ctx.get_identity_name()}/{self.data_path}"
+            # example of load data from CSV
+            df: pd.DataFrame = pd.read_csv(
+                data_path, names=self.data_features, sep=r"\s*,\s*", skiprows=skip_rows, engine="python", na_values="?"
+            )
+            train = df.sample(frac=0.8, random_state=200)  # random state is a seed value
+            test = df.drop(train.index).sample(frac=1.0)
+
+            self.log_info(fl_ctx, f"load data done for client {client_name}")
+            return {"train": train, "test": test}
+
+        except Exception as e:
+            raise Exception(f"Load data for client {client_name} failed! {e}")
+
+    def initialize(self, fl_ctx: FLContext):
+        self.data = self.load_data(fl_ctx)
diff --git a/examples/advanced/federated-statistics/df_stats/job_api/df_stats_job.py b/examples/advanced/federated-statistics/df_stats/job_api/df_stats_job.py
@@ -0,0 +1,72 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+
+from df_statistics import DFStatistics
+
+from nvflare.job_config.stats_job import StatsJob
+
+
+def define_parser():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("-n", "--n_clients", type=int, default=3)
+    parser.add_argument("-d", "--data_root_dir", type=str, nargs="?", default="/tmp/nvflare/dataset/output")
+    parser.add_argument("-o", "--stats_output_path", type=str, nargs="?", default="statistics/stats.json")
+    parser.add_argument("-j", "--job_dir", type=str, nargs="?", default="/tmp/nvflare/jobs/stats_df")
+    parser.add_argument("-w", "--work_dir", type=str, nargs="?", default="/tmp/nvflare/jobs/stats_df/work_dir")
+    parser.add_argument("-co", "--export_config", action="store_true", help="config only mode, export config")
+
+    return parser.parse_args()
+
+
+def main():
+    args = define_parser()
+
+    n_clients = args.n_clients
+    data_root_dir = args.data_root_dir
+    output_path = args.stats_output_path
+    job_dir = args.job_dir
+    work_dir = args.work_dir
+    export_config = args.export_config
+
+    statistic_configs = {
+        "count": {},
+        "mean": {},
+        "sum": {},
+        "stddev": {},
+        "histogram": {"*": {"bins": 20}},
+        "Age": {"bins": 20, "range": [0, 10]},
+        "percentile": {"*": [25, 50, 75], "Age": [50, 95]},
+    }
+    # define local stats generator
+    df_stats_generator = DFStatistics(data_root_dir=data_root_dir)
+
+    job = StatsJob(
+        job_name="stats_df",
+        statistic_configs=statistic_configs,
+        stats_generator=df_stats_generator,
+        output_path=output_path,
+    )
+
+    sites = [f"site-{i + 1}" for i in range(n_clients)]
+    job.setup_clients(sites)
+
+    if export_config:
+        job.export_job(job_dir)
+    else:
+        job.simulator_run(work_dir)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/...es/advanced/federated-statistics/df_stats/jobs/df_stats/app/config/config_fed_client.json b/...es/advanced/federated-statistics/df_stats/jobs/df_stats/app/config/config_fed_client.json
@@ -14,23 +14,7 @@
       }
     }
   ],
-  "task_result_filters": [
-    {
-      "tasks": ["fed_stats"],
-      "filters":[
-        {
-          "path": "nvflare.app_common.filters.statistics_privacy_filter.StatisticsPrivacyFilter",
-          "args": {
-            "result_cleanser_ids": [
-              "min_count_cleanser",
-              "min_max_noise_cleanser",
-              "hist_bins_cleanser"
-            ]
-          }
-        }
-      ]
-    }
-  ],
+
   "task_data_filters": [],
   "components": [
     {

diff --git a/...es/advanced/federated-statistics/df_stats/jobs/df_stats/app/config/config_fed_server.json b/...es/advanced/federated-statistics/df_stats/jobs/df_stats/app/config/config_fed_server.json
@@ -18,10 +18,14 @@
               "bins": 10,
               "range": [0,120]
             }
+          },
+          "percentile": {
+            "*": [25, 50, 75]
           }
         },
         "writer_id": "stats_writer",
-        "enable_pre_run_task": false
+        "enable_pre_run_task": false,
+        "precision" : 2
       }
     }
   ],