List library function remove duplicates from list(possible bug?)

sm2art · September 19, 2022, 10:20am

It seems to me that this particular remove duplicates from function only works when it's a flat list, if the list consists of a list of lists, it won't work.

Let's see an example:

Under the hood, it's calling dta_analyze, which would count the occurrence of each element of the list, in this case, it has two identical element, both are list[1], but they are not being correctly identified as equal.
remove duplicates from bug under the hood
I was expecting it only has one row and the second column should be 2 as follows:
list[1] 2
Instead of
list[1] 1
list[1] 1

A flat list poses no problem:
remove duplicates from bug under the hood 2

I am wondering what's the best way of extending this original functionality to list of lists, I am assuming a predicate has been fed into dta_analyze to determine whether two elements are the same or not, therefore to count the occurrences.

Thank you for your help in advance.

sarpnt · September 19, 2022, 12:16pm

this happens because lists in snap are by reference.

lists aren't compared by having the same contents, they're compared by having the data stored in the same place. in the cases where data would be stored in the same place, this means that changing values of one of the lists would change the other since they're really just the same list.

remove duplicates

< is identical to ?> does the reference comparison, but every other non-library block:

< = >
<[list] contains ?>
(index of in [list])
compares by value, so i'm not sure whether the behavior is intentional or not.

anyways, the first solution i thought of was just a block that replaces lists with identical ones

sm2art · September 19, 2022, 12:55pm

Thank you @sarpnt for your clear illustration and your code to deal with the issue.

Obviously as you have pointed out, there are some discrepancies between various library blocks when dealing with the "=", for example, keep is also doing as I had expected it to do:

For my purpose, I would use this to identify duplications, rather than relying on remove duplicates function for now, maybe we could have a switch somewhere in the library function to deal with the desired behavior...

sm2art · September 19, 2022, 2:05pm

I created my own remove duplicates (which uses my own my_list_analyzer :-), which uses my own split function. )

I believe all the helper function(s) should be included in the script pic, am I right?

Thanks.

sarpnt · September 19, 2022, 2:13pm

that isn't to do with the keep block, it's the < = > block you put in it. if you were using < is identical to ?> it would return an empty list.

sm2art · September 19, 2022, 2:15pm

Got it, it has to do with the predicate we feed into, thanks @sarpnt !

sm2art · September 19, 2022, 2:32pm

Just re-read your code @sarpnt, your solution is great, in that we could still use the original function, the process is to recreate a copy of the original list, but replaced it all by the first instance of the list it has found, as you pointed out that "index things of" are also using comparison by value rather than references...

bh · September 20, 2022, 1:36am

Hmm. The original implementation of REMOVE DUPLICATES would have done what you expected. When Jens invented dta_analyze, for the purpose of making histograms from spreadsheets, he realized that it would also speed up looking for duplicates and I'm not sure if changing the behavior for lists of lists was intentional. One more thing to discuss for the next release...

sm2art · September 20, 2022, 9:47am

Thanks @bh for sharing a little background here. I think it would be nice to have the consistent behavior.

List[1] by itself without assigning to a variable seems to be considered as immutable by snap. For example, if you try to do:

Nothing would happen. However, you can do this:
remove duplicates from bug under the hood 6
a would be showing as list[1, 3] correctly, somehow assigning list[1] to a variable made it mutable.

Therefore, comparing these "immutable" (or constant) lists would be a sound candidate for comparison by value.

I am quite sure Jens had reasons to, but I would be interested in knowing why Jens made a decision to switch the behavior. I was reading an article on Javascript map function could have undesired side effects if the callback function is changing the original list, could this be the reason why Jens is using the is_identical function? As we are working more on the functional programming aspect, I think it would be nice to make sure that we write our callback function not to have those side effects instead...

bh · September 20, 2022, 10:10am

I'm not sure what you expect your example (add 2 to (list 1)) to do. It's not that the result of (list 1) is immutable; it's that you don't have a pointer to it, so it's gone, whether or not you ADD something to it. I hope you don't want it to be the case that from that point on in your program, whenever you do (list 1) you actually get the list {1 2}!

There's nothing special about putting it in a variable. You could put it in a list, and then say (add 2 to (item 1 of ...)). But somehow or other you need a pointer to it.

But I don't think all this has anything to do with the question of = comparison vs. is identical to comparison. Those depend only on the values, not on what points to them.

sm2art · September 20, 2022, 11:21am

Got it, thanks @bh for your clarification, that makes sense. Since I did not save the pointer, so after adding, I cannot retrieve, it's lost in the "snap" sea...

18001767679 · October 4, 2022, 1:14am

untitled script pic (16)