K-means clustering

pajamaclaws21 · February 6, 2025, 8:03pm

Sharing a simple implementation of a k-means clustering algorithm in Snap!, based on this cool article written by chris piech and also a very cool ibm article.

k-means clustering is an algorithm that allows you to group points into k clusters. It's cool because it tells you which points are similar (something something voronoi cell something) and it is just a very fun algorithm to play around with.

This is more of a proof of concept than anything, just a fun idea I had a few days ago and wanted to try and write. According to project notes:

turn on turbo mode to make the project run faster. generally it runs 2x faster in turbo mode. turbo mode eliminates drawing the results of each iteration to allow for faster speeds.

Please run this project in the editor and not in presentation mode or on the community site!! + also if you are interested in seeing how fast the project runs, there is a testing block in the project that allows you to run the algorithm lots of times and see how fast it goes (plus see the average time, the geometric mean time, and also how many iterations per second per point you're getting).

if you increase the points, you will eventually get a "too many points" error, which occurs when a coordinate of the centroid goes to infinity. this is a result of the geometric mean block multiplying large amounts of values, which when they get too big... just become infinity (according to snap/js).

This is a problem I had! If the product of a list can be defined as Product←{×⌿⍵}, the geometric mean can be defined as GeoMean←{(Product ⍵)*(÷≢⍵)}. Because to calculate the geometric mean you are multiplying so many points together, this can quickly exceed the large number limit of JavaScript (and therefore Snap!). BIGNUMS doesn't fix this, for me, so my current fix is to just divide all of the coordinates of the points by 10 before geo-mean-ing them, and then afterwards multiplying them by 10.

Tangent on APL & geometric roots

Yes, I did learn a bit of APL just to model this. TryAPL.com is a lifesaver... it's something like that, I don't quite remember the link. I originally was struggling because my geometric wouldn't work for my randomized dataset! But then I found out that a decimal root of a negative number is in fact a complex number, and so I just had to constrain my datapoints to have only positive coordinates. That's the original reason why only a quarter of the stage is used, but I just ended up thinking it looks cool and not bothering to change it.

in theory this project can be extended to [group] n-dimensional datapoints [into k groups] by changing both the method by which data is displayed and by changing the dimensions of the random point generator.

This is true!! If I extend the (n random points) block to return [x, y, z], then the algorithm will still work (I think? I haven't tested it), and do grouping in 3 dimensions. I can't easily display that on the Snap! stage, though, so I didn't.

So that's what I've been working on recently. I have had lots of fun!

bh · February 6, 2025, 10:12pm

Can't you instead compute Product (⍵*(÷≢⍵))? (I know, the parentheses I added aren't necessary, but I added them to make it more obvious what I changed.) Then the numbers you're multiplying are very small, so it's okay that there are a lot of them. Or else this:

pajamaclaws21 · February 6, 2025, 10:43pm

Oh, I hadn't even thought of that. That works much much better-- 15,000 points in 74 seconds isn't too bad!

mr_owlssssnap2 · February 7, 2025, 1:34am

woah, haven't seen you in a while

coder_07 · February 7, 2025, 3:57pm

I hate this font it makes ≢ look like ≡/

mr_owlssssnap2 · February 7, 2025, 4:00pm

?

coder_07 · February 7, 2025, 4:01pm

my computer's font

mr_owlssssnap2 · February 7, 2025, 4:02pm

thats just how the symbol looks

coder_07 · February 7, 2025, 4:04pm

no, it's supposed to look like this \not\equiv

mr_owlssssnap2 · February 7, 2025, 4:04pm

i think they're seperate symbols

coder_07 · February 7, 2025, 4:04pm

huh?

tethrarxitet · February 7, 2025, 4:23pm

I don't think they're supposed to be separate. Here's a comparison (with Arimo and the weird Roboto that ChromeOS uses):

Screenshot 2025-02-07 10.22.55

And the Roboto that the actual text system uses:

I think it might be the "combining" character having a problem.

joecooldoo · February 11, 2025, 2:07pm

shows for me correctly...

kanpai_856 · February 11, 2025, 3:03pm

Mine doesn't

mark4sisb · February 11, 2025, 5:54pm

(from the original post, using the preformatted text font)

(coder_07, with the standard font)

(Wikipedia on the tribar symbol)

Anyways, let's get back on track about the actual project.

Hmm... It took only 49.7 seconds for me to do 15,000 points with 3 centroids. With turbo mode, it only took 34.1 seconds!

I don't know what is going on with the speed here. With 1500 points, turbo mode increased speed by 20%. With ten times as many points (15000), turbo mode increased speed by just under 50%. With 20000 points, turbo mode only increased speed by 15%. And with 50000 points, it took 187 seconds in turbo mode, and I was too scared to try without turbo mode.

mr_owlssssnap2 · February 11, 2025, 8:01pm

I don't really know what's supposed to be happening

pajamaclaws21 · February 12, 2025, 4:54pm

From Chris Piech's article:

Say you are given a data set where each observed example has a set of features, but has no labels. Labels are an essential ingredient to [other data analysis algorithms]. What can we do?

One of the most straightforward tasks we can perform on a data set without labels is to find groups of data in our dataset which are similar to one another -- what we call clusters. K-Means is one of the most popular "clustering" algorithms. K-means stores k centroids that it uses to define clusters. A point is considered to be in a particular cluster if it is closer to that cluster's centroid than any other centroid.

K-Means finds the best centroids by alternating between (1) assigning data points to clusters based on the current centroids (2) chosing centroids (points which are the center of a cluster) based on the current assignment of data points to clusters.