How to create a new data type: set?

I’m attempting to create a “new” (i.e. not yet implemented in Snap!) data type: set. In many respects a set is like a list, so I started out implementing it as a list, with type-specific functions. Soon I ran unto trouble though; let me explain.

Unlike lists, sets are identical if their members are the same, irrespective of their order within the set. E.g. {1, 2} = {2, 1} - if I tell my set comparator that both inputs are sets, this can easily be dealt with. But what if the sets have subsets? E.g. {1, {1, 2}} = {1, {2, 1}} … but if sets are implemented as lists, there is no way for the code to discern {1, {1, 2}} from {1, (1, 2)} - the second structure being a set containing a list; and since order does matter where lists are concerned, (1, 2) ≠ (2, 1); consequently {1, (1, 2)} ≠ {1, (2, 1)).

One way to deal with this is writing a “deep” set comparator, that will assume all lists contained by sets are themselves sets. This excludes hybrid sets, containing both subsets and lists. Even if in most practical cases such hybrids will be exceptional, it doesn’t feel right to knowingly build an inconsistency into the code from the outset. So I’m looking for a different solution.

A second way to deal with it is along the “OOP with procedures” route (Ref.Manual, ch. viii). A variable referencing a set will thus be typed as a “command” by the Programming tools SANDBOX script pic 52 predicate, so it’s easily discernible from a list. But how can it be discerned from other “commands” that are members of a set, or any of its subsets?

Or perhaps someone has a better suggestion for creating a new data type?

do it the c++ way,implement rb trees

Why are you trying to implement sets, instead of just implementing the features to make lists able to function like sets.
You could have some blocks like these mocked up for it:


(And yes, I wasn't bothered with 4 script pics)

If I you explain what you want from sets that this can't achieve, then maybe I could help you implement sets. I think if you implement the above blocks and maybe some more, then the distinction between sets and lists wouldn't really matter.

One of the standard answers to this sort of question is to prepend type tags. Conceptually you'd say untitled script pic (5) but that leaves open the possibility of a non-set list just happening to start with *set*, so instead the usual thing is to make a list
untitled script pic (6)
and then use
untitled script pic (7)
to test for a set, because no other list created in the program can be EQ? to this one. (This is one of the very non-Lispy properties of Snap!; in Lisp (including Scheme) there is only one empty list, and if you try to make another one in your program you just get a pointer to that one. In Lisp you'd make a nonempty list as the type tag:
untitled script pic (8)
)

Here is my implementation.

Time to debug

I think it's just the PROPER ones that are problematic.

I fixed them.

Cool!

Thanks for the hint! I'm sure it will work (even though there is probably a way around this too - but that would be on purpose; so let's not worry about that).

Here's what I made of it - even this basic code to properly create a set is complicated and probably needs to be optimized, but it's a start.

Creating a set tag:
untitled script pic (64)

Testing for set-hood:
untitled script pic (69)

Creating a new set:
untitled script pic (65)

A "set" variety of CONTAINS:
set basics script pic (1)

A "set" variety of "=":
set basics script pic

The final two blocks are co-recursive by the way.

Proof of the pudding:
set basics script pic (3)
,,, reports TRUE (after 10 ms processing time, which IMO is a lot).

I have understood that for (r)b-trees to work data must be sorted. Since a set member could be any data type, and a set could even contain mutiple data types, I have been looking for a universal sorting operator. I haven't found it yet within Snap! E.g. set basics script pic (4) appears to work for most datatypes, but not for lists (unless deep-sorted), sounds (I think), and only partially for commands, reporters and predicates. If you know how to sort all or any of these, I'd be glad to know.

  • E.g. set basics script pic (6) reports TRUE, whereas < reports FALSE. So far so good.

  • But! e.g. both
    set basics script pic (8), set basics script pic (9) and set basics script pic (10)
    report FALSE, just like IDENTICAL TO, whereas ≠, ≤ and ≥ report TRUE (which tells me that ≠ must have been coded as NOT (=), ≤ as NOT (>), ...) :slight_smile:
    Sorting will therefore produce inconsistent results.

BTW, if you (or anyone else) happen to have well-built Snap! blocks for (r)b-trees, I'd be interested in that, too.

This probably means that the code is converting back and forth between linked list form and dynamic array form. Sigh.

Possibly, yes. Like I wrote, it needs some optimizing. For now though I’m happy it works at all.

The thing is, according to my theory about lists, we should never convert back and forth. If you use mutation commands, we should (invisibly) convert to array once if necessary; if you use IN FRONT OF and ALL BUT FIRST OF, we should (also invisibly) convert to linked list once if necessary. We did it this way because I figured users would be consistent in how they use any particular list, even if they don't have a consistent style overall. I mean, that should just work out automatically/naturally, not because you make a special effort to be consistent. So if that isn't happening for you, it's evidence against my theory, not a complaint about your code.

Perhaps in a future release list blocks could be labeled with small and not-too-conspicuous markers (like the flash-symbol for speed-optimized blocks) signifying whether they go well with linked lists, or arrays, or both.

Hmm. The whole reason for this strange hybrid data structure is so that users shouldn't know anything about it. And I think that's actually the case for most of our users; as Jens says, the forum gang are totally atypical in their need to look under the hood of everything. If we're going to expose the hybridness to users, we might as well have two separate data types, arrays and pairs. Maybe we should do that.

The reason we didn't in the first place is that there are certain common algorithms that are O(n²) time for arrays but O(n) time for lists, and what's worse is that they tend to be the functional ones, so users would be pushed into doing things imperatively, just what we don't want them to learn.

Of course, for small n, the aymptotic behavior is hidden by constant factors, such as the fact that arrays are provided and optimized by Javascript, while linked lists aren't. My original guess was that the cutover point would be somewhere around n=100, and for such small n it doesn't matter how slow your algorithm is. It turns out that my guess wasn't even close, and for n=10,000 most things are still faster with JS arrays. So, first, I should get off my butt and do serious timing tests, and second, based on the results we should rethink list storage. Maybe the right thing is to flush linked lists altogether, and let data structures students build them in Snap!, out of arrays of length 2, the way everyone else does. Or maybe the right thing would be to look at the list size and keep everything below n items, whatever n turns out to be, as arrays. But if n=10⁶ then the cutover point would be where things stop working anyway because the lists don't fit in browser memory (especially because linked lists take more space than arrays).

Or maybe instead of linked lists we should provide B-trees or something, including the ability to moved non-recently-used nodes to temporary storage on disk. But that would require the user to provide an ordering function for every large list, not just when explicitly sorting them. More generally, we could build several underlying storage disciplines, depending on things such as how densely the list is populated.

(And yes, I did notice "small and not-too-conspicuous," but it seems to me that we either do or don't tell users what they mean; there's no real middle ground.)

There are? Which ones, for example?

O, but there is. An example: you maintain a reference manual; it’s actually read by merely a fraction of Snap! users, so most of what you write there will not reach, or bother, the average user. The “forum gang”) discuss all sorts of under-the-hood topics - consider us beta testers - and though anyone could read all about it most users choose not to. Among communication professionals a well-known phrase is: “If they haven’t heard it, you haven’t said it.” … so, plenty of middle ground, IMO.

Is there a way to know the present underlying representation of a list (linked list / array)? Or to force it, with 100% certainty, to be one or the other? I’d be happy to run some user tests.

Revisiting an earlier post:

I found this to work prefectly well if “set-tag” is initialized before any set is created. Still, it requires a conscious act (setting the set-tag variable to be list ()), and if by any chance one initializes “set-tag” again and subsequently uses a set that was created earlier, the latter is not recognized as a set. So I wonder if it wouldn’t be better to choose a very specific fixed value, like: Programming tools SANDBOX script pic 53.

Anything that traverses the list using ALL BUT FIRST OF in a recursive call, or constructs a new list using IN FRONT OF, typically in a recursive procedure. For lists, those primitives are constant time, so their use for each item of the list is linear time. But for arrays, those primitives require copying the entire list so far, which is linear time, so the entire traversal is quadratic time.

Mostly you could replace such recursive procedures with iterative ones, except for ones that do branched recursion (several recursive calls at each level, not just one), but not if your goal in the last 25% of your course is to teach recursion in depth.

I suppose. But you're talking about a change to Snap! itself; if that isn't documented, every student will ask about it. (Half of them, the ones who don't read the manual, will ask even if you do.) What's more, in order to know which version to use, they have to have their noses rubbed in the difference; you can't make that difference hard to see.

If the list is created by IN FRONT OF, it's linked. If the list is created by ADD ot INSERT commands, it's an array. (The LIST constructor with more than zero inputs creates an array. (The empty array IS IDENTICAL TO the empty list.) If you're consistent about it, you know which it is.

To make things more interesting, if you create an array and then extend it with IN FRONT OF, you end up with a hybrid list, with the new items linked but the original ones still an array. This is historically called a cdr-coded list. So don't do that. (Exception: an empty list is the same as an empty array, so if you call (LIST) with no inputs and then extend it with IN FRONT OF, what you have is a pure list.

(A property of Snap! that's incorrect when we're viewed as a Lisp is that in Lisp there's only one empty list; if you call (LIST) twice the two results are EQ (what we call IDENTICAL). I don't think this matters, really, because in Snap! people hardly ever use IDENTICAL on lists, and the people who do know how to deal with it. But this is why you'll sometimes hear me say "the empty list" instead of "an empty list.")

NUMBERS FROM creates an array. If you want a linked list of numbers, you can either avoid using NUMBERS FROM or listify the result with

I don't understand what you mean here. If you save that procedure in a variable to use each time, it has the same vulnerability to redefinition as SET-TAG does. If you reconstruct it each time, I don't think the resulting procedures are EQ.

By contrast, the EQ-ness of lists is well-understood; every call to IN FRONT OF generates a guaranteed unique value.

But whatever floats your boat.