Sharing a simple implementation of a k-means clustering algorithm in Snap!, based on this cool article written by chris piech and also a very cool ibm article.
k-means clustering is an algorithm that allows you to group points into k clusters. It's cool because it tells you which points are similar (something something voronoi cell something) and it is just a very fun algorithm to play around with.
This is more of a proof of concept than anything, just a fun idea I had a few days ago and wanted to try and write. According to project notes:
turn on turbo mode to make the project run faster. generally it runs 2x faster in turbo mode. turbo mode eliminates drawing the results of each iteration to allow for faster speeds.
Please run this project in the editor and not in presentation mode or on the community site!! + also if you are interested in seeing how fast the project runs, there is a testing block in the project that allows you to run the algorithm lots of times and see how fast it goes (plus see the average time, the geometric mean time, and also how many iterations per second per point you're getting).
if you increase the points, you will eventually get a "too many points" error, which occurs when a coordinate of the centroid goes to infinity. this is a result of the geometric mean block multiplying large amounts of values, which when they get too big... just become infinity (according to snap/js).
This is a problem I had! If the product of a list can be defined as Product←{×⌿⍵}
, the geometric mean can be defined as GeoMean←{(Product ⍵)*(÷≢⍵)}
. Because to calculate the geometric mean you are multiplying so many points together, this can quickly exceed the large number limit of JavaScript (and therefore Snap!). BIGNUMS doesn't fix this, for me, so my current fix is to just divide all of the coordinates of the points by 10 before geo-mean-ing them, and then afterwards multiplying them by 10.
Tangent on APL & geometric roots
Yes, I did learn a bit of APL just to model this. TryAPL.com is a lifesaver... it's something like that, I don't quite remember the link. I originally was struggling because my geometric wouldn't work for my randomized dataset! But then I found out that a decimal root of a negative number is in fact a complex number, and so I just had to constrain my datapoints to have only positive coordinates. That's the original reason why only a quarter of the stage is used, but I just ended up thinking it looks cool and not bothering to change it.
in theory this project can be extended to [group] n-dimensional datapoints [into k groups] by changing both the method by which data is displayed and by changing the dimensions of the random point generator.
This is true!! If I extend the (n random points) block to return [x, y, z], then the algorithm will still work (I think? I haven't tested it), and do grouping in 3 dimensions. I can't easily display that on the Snap! stage, though, so I didn't.
So that's what I've been working on recently. I have had lots of fun!