Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel initialization for k-means #1754

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Hakdag97
Copy link
Collaborator

@Hakdag97 Hakdag97 commented Dec 17, 2024

Description

The bottleneck of k-means clustering (concerning runtime) is the initialization of centroids, which was previously built on a cost intensive serial algorithm. The aim of this pull request is to replace this algorithm by the more sophisticated k-means || initialization of centroids.

Issue/s resolved:

Changes proposed:

  • Complete new implementation of the initialization of centroids used for k-means, k-medians, and k-medoids
  • Adjustment of classes (like KMeans) to match with the new implementation

Type of change

  • Bug fix
  • New feature

Performance

  • Reducing the runtime of initialization of clustering algorithm in distributed and non-distributed mode with split=None and split not None by (at least) an order of magnitude (depending on the setting concerning, e.g., size of data and chosen parameters)

Does this change modify the behaviour of other functions? If so, which?

  • yes: the classes KMeans, KMedoids, KMedians and the function where are affected

Copy link
Contributor

Thank you for the PR!

@Hakdag97 Hakdag97 force-pushed the features/1674-Optimization_of_k-means_initialization branch from 0889330 to 7f860c4 Compare January 6, 2025 10:38
@github-actions github-actions bot added cluster core features testing Implementation of tests, or test-related issues labels Jan 6, 2025
Copy link
Contributor

github-actions bot commented Jan 6, 2025

Thank you for the PR!

Copy link

codecov bot commented Jan 6, 2025

Codecov Report

Attention: Patch coverage is 97.82609% with 1 line in your changes missing coverage. Please review.

Project coverage is 92.45%. Comparing base (87f2812) to head (7f860c4).

Files with missing lines Patch % Lines
heat/cluster/_kcluster.py 97.05% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1754      +/-   ##
==========================================
+ Coverage   92.26%   92.45%   +0.18%     
==========================================
  Files          84       84              
  Lines       12445    12438       -7     
==========================================
+ Hits        11482    11499      +17     
+ Misses        963      939      -24     
Flag Coverage Δ
unit 92.45% <97.82%> (+0.18%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Hakdag97 Hakdag97 requested a review from mrfh92 January 6, 2025 11:59
@JuanPedroGHM JuanPedroGHM changed the title Features/1674 optimization of k means initialization Optimization of k means initialization Jan 13, 2025
@JuanPedroGHM JuanPedroGHM self-requested a review January 13, 2025 09:20
@ClaudiaComito ClaudiaComito added this to the 1.6 milestone Jan 13, 2025
@ClaudiaComito ClaudiaComito changed the title Optimization of k means initialization Parallel initialisation for k-means Jan 13, 2025
@Hakdag97 Hakdag97 changed the title Parallel initialisation for k-means Parallel initialization for k-means Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmark PR cluster core features PR talk testing Implementation of tests, or test-related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants