How to create a new data type: set?

Why are you trying to implement sets, instead of just implementing the features to make lists able to function like sets.
You could have some blocks like these mocked up for it:

(And yes, I wasn't bothered with 4 script pics)

If I you explain what you want from sets that this can't achieve, then maybe I could help you implement sets. I think if you implement the above blocks and maybe some more, then the distinction between sets and lists wouldn't really matter.

One of the standard answers to this sort of question is to prepend type tags. Conceptually you'd say untitled script pic (5) but that leaves open the possibility of a non-set list just happening to start with *set*, so instead the usual thing is to make a list
untitled script pic (6)
and then use
untitled script pic (7)
to test for a set, because no other list created in the program can be EQ? to this one. (This is one of the very non-Lispy properties of Snap!; in Lisp (including Scheme) there is only one empty list, and if you try to make another one in your program you just get a pointer to that one. In Lisp you'd make a nonempty list as the type tag:
untitled script pic (8)

Here is my implementation.

Time to debug

I think it's just the PROPER ones that are problematic.

I fixed them.


Thanks for the hint! I'm sure it will work (even though there is probably a way around this too - but that would be on purpose; so let's not worry about that).

Here's what I made of it - even this basic code to properly create a set is complicated and probably needs to be optimized, but it's a start.

Creating a set tag:
untitled script pic (64)

Testing for set-hood:
untitled script pic (69)

Creating a new set:
untitled script pic (65)

A "set" variety of CONTAINS:
set basics script pic (1)

A "set" variety of "=":
set basics script pic

The final two blocks are co-recursive by the way.

Proof of the pudding:
set basics script pic (3)
,,, reports TRUE (after 10 ms processing time, which IMO is a lot).

I have understood that for (r)b-trees to work data must be sorted. Since a set member could be any data type, and a set could even contain mutiple data types, I have been looking for a universal sorting operator. I haven't found it yet within Snap! E.g. set basics script pic (4) appears to work for most datatypes, but not for lists (unless deep-sorted), sounds (I think), and only partially for commands, reporters and predicates. If you know how to sort all or any of these, I'd be glad to know.

  • E.g. set basics script pic (6) reports TRUE, whereas < reports FALSE. So far so good.

  • But! e.g. both
    set basics script pic (8), set basics script pic (9) and set basics script pic (10)
    report FALSE, just like IDENTICAL TO, whereas ≠, ≤ and ≥ report TRUE (which tells me that ≠ must have been coded as NOT (=), ≤ as NOT (>), ...) :slight_smile:
    Sorting will therefore produce inconsistent results.

BTW, if you (or anyone else) happen to have well-built Snap! blocks for (r)b-trees, I'd be interested in that, too.

This probably means that the code is converting back and forth between linked list form and dynamic array form. Sigh.

Possibly, yes. Like I wrote, it needs some optimizing. For now though I’m happy it works at all.

The thing is, according to my theory about lists, we should never convert back and forth. If you use mutation commands, we should (invisibly) convert to array once if necessary; if you use IN FRONT OF and ALL BUT FIRST OF, we should (also invisibly) convert to linked list once if necessary. We did it this way because I figured users would be consistent in how they use any particular list, even if they don't have a consistent style overall. I mean, that should just work out automatically/naturally, not because you make a special effort to be consistent. So if that isn't happening for you, it's evidence against my theory, not a complaint about your code.

Perhaps in a future release list blocks could be labeled with small and not-too-conspicuous markers (like the flash-symbol for speed-optimized blocks) signifying whether they go well with linked lists, or arrays, or both.

Hmm. The whole reason for this strange hybrid data structure is so that users shouldn't know anything about it. And I think that's actually the case for most of our users; as Jens says, the forum gang are totally atypical in their need to look under the hood of everything. If we're going to expose the hybridness to users, we might as well have two separate data types, arrays and pairs. Maybe we should do that.

The reason we didn't in the first place is that there are certain common algorithms that are O(n²) time for arrays but O(n) time for lists, and what's worse is that they tend to be the functional ones, so users would be pushed into doing things imperatively, just what we don't want them to learn.

Of course, for small n, the aymptotic behavior is hidden by constant factors, such as the fact that arrays are provided and optimized by Javascript, while linked lists aren't. My original guess was that the cutover point would be somewhere around n=100, and for such small n it doesn't matter how slow your algorithm is. It turns out that my guess wasn't even close, and for n=10,000 most things are still faster with JS arrays. So, first, I should get off my butt and do serious timing tests, and second, based on the results we should rethink list storage. Maybe the right thing is to flush linked lists altogether, and let data structures students build them in Snap!, out of arrays of length 2, the way everyone else does. Or maybe the right thing would be to look at the list size and keep everything below n items, whatever n turns out to be, as arrays. But if n=10⁶ then the cutover point would be where things stop working anyway because the lists don't fit in browser memory (especially because linked lists take more space than arrays).

Or maybe instead of linked lists we should provide B-trees or something, including the ability to moved non-recently-used nodes to temporary storage on disk. But that would require the user to provide an ordering function for every large list, not just when explicitly sorting them. More generally, we could build several underlying storage disciplines, depending on things such as how densely the list is populated.

(And yes, I did notice "small and not-too-conspicuous," but it seems to me that we either do or don't tell users what they mean; there's no real middle ground.)

There are? Which ones, for example?

O, but there is. An example: you maintain a reference manual; it’s actually read by merely a fraction of Snap! users, so most of what you write there will not reach, or bother, the average user. The “forum gang”) discuss all sorts of under-the-hood topics - consider us beta testers - and though anyone could read all about it most users choose not to. Among communication professionals a well-known phrase is: “If they haven’t heard it, you haven’t said it.” … so, plenty of middle ground, IMO.

Is there a way to know the present underlying representation of a list (linked list / array)? Or to force it, with 100% certainty, to be one or the other? I’d be happy to run some user tests.

Revisiting an earlier post:

I found this to work prefectly well if “set-tag” is initialized before any set is created. Still, it requires a conscious act (setting the set-tag variable to be list ()), and if by any chance one initializes “set-tag” again and subsequently uses a set that was created earlier, the latter is not recognized as a set. So I wonder if it wouldn’t be better to choose a very specific fixed value, like: Programming tools SANDBOX script pic 53.

Anything that traverses the list using ALL BUT FIRST OF in a recursive call, or constructs a new list using IN FRONT OF, typically in a recursive procedure. For lists, those primitives are constant time, so their use for each item of the list is linear time. But for arrays, those primitives require copying the entire list so far, which is linear time, so the entire traversal is quadratic time.

Mostly you could replace such recursive procedures with iterative ones, except for ones that do branched recursion (several recursive calls at each level, not just one), but not if your goal in the last 25% of your course is to teach recursion in depth.

I suppose. But you're talking about a change to Snap! itself; if that isn't documented, every student will ask about it. (Half of them, the ones who don't read the manual, will ask even if you do.) What's more, in order to know which version to use, they have to have their noses rubbed in the difference; you can't make that difference hard to see.

If the list is created by IN FRONT OF, it's linked. If the list is created by ADD ot INSERT commands, it's an array. (The LIST constructor with more than zero inputs creates an array. (The empty array IS IDENTICAL TO the empty list.) If you're consistent about it, you know which it is.

To make things more interesting, if you create an array and then extend it with IN FRONT OF, you end up with a hybrid list, with the new items linked but the original ones still an array. This is historically called a cdr-coded list. So don't do that. (Exception: an empty list is the same as an empty array, so if you call (LIST) with no inputs and then extend it with IN FRONT OF, what you have is a pure list.

(A property of Snap! that's incorrect when we're viewed as a Lisp is that in Lisp there's only one empty list; if you call (LIST) twice the two results are EQ (what we call IDENTICAL). I don't think this matters, really, because in Snap! people hardly ever use IDENTICAL on lists, and the people who do know how to deal with it. But this is why you'll sometimes hear me say "the empty list" instead of "an empty list.")

NUMBERS FROM creates an array. If you want a linked list of numbers, you can either avoid using NUMBERS FROM or listify the result with

I don't understand what you mean here. If you save that procedure in a variable to use each time, it has the same vulnerability to redefinition as SET-TAG does. If you reconstruct it each time, I don't think the resulting procedures are EQ.

By contrast, the EQ-ness of lists is well-understood; every call to IN FRONT OF generates a guaranteed unique value.

But whatever floats your boat.

You are right, I had made an assumption without thorough testing.

Now about the processing times of arrays vs. linked lists: I ran some tests on my system for lists of 10,000 items; it turns out that for many operations arrays are about the same speed as linked lists (e.g. KEEP, MAP, SORT), whereas they are considerably faster for e.g. creating a list item by item (iterative ADD vs. recursive IN FRONT OF), LENGTH, and REVERSE. With COMBINE the difference was almost unbelievable: arrays are over 100 times faster than linked lists, I can’t think of a good explanation.


  • The differences in processing times for COMBINE and LENGTH get even worse for lists of 100,000 items. And I haven’t detected any difference to the advantage of linked lists yet.
  • APPEND is also very much slower for linked lists. (Btw. I had to take APPEND out of the test set eventually because of a stack size issue)
  • I wasn’t able to measure accurate processing times for ADD, INSERT etc., because I think after each single operation the list is apparently converted to an array anyway.
  • I also tried a cdr-coded list (5,000 recursive IN FRONT OF’s an array [5,000]). Processing times were roughly the same as those for a linked list, so I don’t think a warning would be necessary.

Thanks for the test, which I'll look at after the meeting I'm about to have!