How lists work

moving stuff here...

lists are weird in snap and they have weird rules and i still dont really get all of the rules and why they exist lol

Do you mean about linked vs array?

The Rules, simple version:

For any particular list, use the commands ADD/INSERT/REPLACE/DELETE or the reporters ALL BUT FIRST OF/IN FRONT OF, but not both.

"Why?" simple answer:

Because that way we can store each list internally in the way most efficient for its style (command-based or reporter-based). And also, if you follow that rule you don't have to worry about lists sharing items.


Okay, I have to eat dinner, I'll say more later...

Lesson 2, memory organization.

The memory we're talking about is (fingers crossed behind back) the stuff they mean when they sell you a computer with so many Gb of RAM (random access memory). "Random access" means that the time required to find a value in memory is constant for all locations in the memory.

What's a "location"? That's also slightly complicated. The smallest amount of memory you can read or write at once is a byte, eight bits. So each byte has its own memory address, from 0 up to however much memory you have. However, more commonly, you read or write a word, which, at this time (2020) is either 32 or 64 bits -- four or eight bytes. Words are used to hold numbers, more or less (we are not going into hardware representation of numbers today), and also to hold pointers, which are memory addresses. Since memory addresses are nonnegative integers -- unsigned integers as computer people say -- the most memory you can have on your computer is around 4Gb in a 32-bit computer, or, umm, 16 Pb, petabytes, do I have that right?, in a 64-bit computer. The main reason for moving from 32- to 64-bit words isn't that the number of integers in the world has increased, but rather that these days there are programs and/or data sets bigger than 4Gb.

Addressing of bytes is straightforward; any unsigned integer can be the address of a byte. Addressing of words is slightly more complicated: In order to make the interface hardware between the processor and the memory simpler (and therefore faster), word addresses are required to be multiples of four or eight, depending on the width (32-bit or 64-bit) of the computer.

Uncrossing my fingers slightly: Some of the possible addresses don't address RAM. They address various special things, such as the bootstrap ROM that holds the code that runs when you turn on or reset the computer, and input/output device registers.

Uncrossing most of the way: Once upon a time, your program asked for the byte or word at a certain address, and the hardware read the byte or word of RAM at that address. That's no longer true. The reason is that bigger memories are inherently slower than smaller memories. That's because the addressing circuitry in the memory interface has to examine log₂ N bits, where N is the number of bytes in the memory. So it's desirable to arrange things so that the memory actually addressed in hardware is smaller than what your program thinks it is. There are (a simplification but not a lie) three levels of memory. From small to large:

  1. Cache memory. May be inside the processor chip. These days, less than a Gb, maybe a lot less. There are probably three or more levels of cache; the smallest and fastest may be less than 1Mb.
  2. Physical RAM. Several Gb.
  3. Virtual memory. This is what your program sees, all 4Gb to 16 Pb of it. At any moment, some parts of it will be in RAM; other parts may be on disk.

Having a small memory pretend to be a large memory is a good trick. What makes it possible is locality of reference: the fact that the next memory address your program will ask for is almost certainly pretty near, maybe very near, the last address it asked for. So each level of the memory hierarchy contains the memory locations in the vicinity of the ones that have been used recently, and almost all the time the next address you want will already be in faster memory. When it's not, the next memory access is much slower than usual.

Okay, that's almost the entire truth about memory, more than we need for our purposes. Hereafter, we pretend that a software address just points to a specific chunk of RAM, while silently giving thanks to locality of reference for making our computers run fast.

Lesson 3. Storage of data aggregates.

An aggregate is a bunch of data under one umbrella. :sn: lists are aggregates.

There are certain things you want to be able to do with an aggregate: Find an item, add an item, change an item, delete an item. Depending on your program's algorithm, you may want to find items in order, front to back; or you may want to jump around in the items.

So, in the beginning there was the array. An array is a contiguous chunk of memory. So if you have an array of four-byte-word numbers starting at address 4000, the address of item N is 4000+(4*N). The multiplication by 4 is there because words take four bytes, so the second item is at 4004, not 4001.

But, oops, according to the formula, item 2 is at 4000+(4*2) = 4008, not 4004. And this is why programming languages that are all about what's where in memory index their arrays starting from zero, so their compilers don't have to use 3996+(4*N) as the formula. And that's why "item 2" is the third item in those languages. In languages that are more interested in making sense than in what's where in memory, of course, the first item of an array is called item 1.

So, how do arrays work out as far as speed is concerned? Finding any item is really fast, because you don't have to search for it; you just add (four times) the item number to the starting address of the array. Replacing an item with a new value is similarly fast. But adding and deleting items is slow. Deleting at the end is just deciding not to use the last item any more. But since the items have to be contiguous in memory, deleting anywhere else requires copying the elements past the deletee one word lower in memory. This takes an amount of time proportional to the number of items in the array. Adding an item is even worse, because only a fixed number of words were allocated for this array, and if you add more than that many items, you have to allocate a bigger block of memory, copy the entire array over there, insert the new item (having left an open slot for it when doing the copy), and make anything that points to the array point to the new one instead.

By the way, in order to make this just-do-arithmetic approach work, to find any item in constant time, the items all have to be the same size. So in array-centric languages like C, when you allocate an array you have to say not only how many items it'll hold, but also what data type they all are. You can have an array of integers, or you can have an array of characters (bytes), but you can't have an array with some of each. If you want that, you have to make it an array of pointers, all of which are the same size, and have some way to know whether a particular item is a pointer to a byte or a pointer to a word. From this unfortunate restriction of arrays has grown a whole mythology about why it's bad to have a data aggregate with mixed item types.

So, along came Lisp, with applications in artificial intelligence in which it's rare to ask for item number 437 of an aggregate. Instead, you're much more likely to traverse the aggregate: Look for the first item, then look for the second, then the third, and so on. But you'd like your aggregates to be able to get bigger and bigger without any recopying. So you invent the linked list, which is made of pairs. A pair is an array of length two. Since it's a fixed size, there's no question of inserting or deleting items of the pair. Since all pairs are the same size, memory allocation is particularly simple. Instead of calling the two items "item 1" and "item 2," you call them "the car" and "the cdr" of the pair. (Look it up if you're wondering why.) The way we use pairs to implement lists is that we allocate a pair for each list item. The car of the pair points so (i.e., contains the address of) the list item. The cdr of the pair points to the next pair, unless this is the last pair, in which case it points to a special thing called "the empty list." In early Lisp implementations it was common to use a value of 0 instead of a pointer to indicate the empty list. More recently it's become more common to see an actual location in memory counted as the empty list, so the last pair of a list will have an actual pointer to that location in its cdr. (Never mind why.) In :sn: there can be more than one empty list; a new one is made every time you ask for a list with no items. This is sort of wrong, but harmless, because we don't mutate lists that much.

So, how's the timing of linked list operations? Linked lists aren't random access; looking for item number N requires chasing the cdr of the cdr of the cdr... N-1 times. But traversing a list is fast; if you're looking at the part of the list starting with item N, a single cdr operation will get you to item N+1, regardless of how big N is. And it takes only constant time (fingers crossed) to add an item, if you add it at the front of the list; you just allocate one new pair, whose car is the new item and whose cdr is the previous head of the list. In :sn: the next-item operation is ALL BUT FIRST OF, and the add-an-item operation is IN FRONT OF. For linked lists, these take constant time.

Lesson 4. Aggregates in :sn:.

Some algorithms lend themselves to arrays, because the aggregate size doesn't change, and you have to jump around in it. The canonical example is to shuffle an array (e.g., a deck of cards) randomly. Here's how:


Notice that SHUFFLE is a command, and it uses the command REPLACE ITEM to manipulate the data. Arrays are best here because, e.g., a deck is 52 cards both before and after shuffling, and the REPLACE ITEM command takes constant time for an array.

Other algorithms lend themselves to linked lists, because you want to grow the size of the aggregate, but you don't have to jump around in it, just traverse it from front to back. One standard example is to merge two already-sorted lists:


Here, the operations used are IN FRONT OF and ALL BUT FIRST OF. (A subtle point: ITEM is also used, but only in the form ITEM 1 OF, which is constant time for both arrays and linked lists.) The MERGE block is a reporter, not a command, because it makes a new list, using IN FRONT OF. And it's recursive; it calls itself instead of looping.

(Note for experts: The recursive call is not a tail call. If you don't know what that means, don't worry about it; it's not important to our story.)

This procedure is much faster for linked lists than for arrays, because IN FRONT OF is making the aggregate result bigger, which for arrays would involve copying. (Yeah, there's a different algorithm you can use for arrays, but it makes mergesort, where this MERGE block would be used, really complicated instead of really simple.)

So, the best thing really would be to provide users with both arrays and linked lists, and let them use whichever better fits their algorithm. But this requires users to know a lot. (We are on lesson 4 and not done yet.) Since we want :sn: to be usable by children as well as by teenage hackers, we try to give users the benefit of the best structure for their program, automatically.

To do that, we allow for conversion between the two structures automatically. If you use IN FRONT OF or ALL BUT FIRST OF, we convert your list to linked form, if it wasn't in that form already. If you use ADD or INSERT or REPLACE ITEM or DELETE ITEM, we convert your list to array form, if it wasn't in that form already. That means that for any one list, it's best if you use only the Scratch-style commands, or use only the Lisp-style reporters. If you switch back and forth, your program will run much more slowly than necessary because of repeated conversions.

I like to use the example of a card game. The first thing you want to do is shuffle, for which you want the deck in array form. But after that, you want to deal from the deck, using ITEM 1 OF for the dealt card and ALL BUT FIRST OF for the remainder of the deck. So once the game is in play, you want linked list form. It's not bad to have one conversion in between shuffling and dealing, but you don't want lots of conversions.

You only have to be consistent in how you use each specific list. You can have some arrays and some linked lists, and that works fine.

Lesson 5. Garbage collection.

So, I lied when I said that for linked lists IN FRONT OF takes constant time.

That's true until you use up all the memory in the computer. That actually has two different meanings. (1) You use up all the physical RAM that the operating system has allocated for you. Then what happens is that the system, on your behalf, writes out to disk some chunks of memory that you haven't used recently, allowing other chunks to be read in, or newly allocated. This takes roughly forever. (2) You use up all the virtual address space you're allowed to use (by the OS, or by the Lisp interpreter, or something). Then your interpreter has to look at everything in memory, to see if something else still points to it. If nothing points to a particular pair, then it is "garbage," and can be recycled (yeah, yeah, garbage is different from recycling, it's a metaphor). This takes time roughly proportional to the size of your address space, all the memory your program thinks it has.

So you're happily bouncing along with your nice fast program, all constant time operations, and all of a sudden everything grinds to a halt for, like, two minutes. This may not matter if you're doing a long computation and it doesn't happen too often, but if you're trying to do a smooth animation of a moving sprite, it's a disaster, and if you're the program controlling the brakes in a self-driving car, it's a real disaster. It was this behavior, with unpredictable sudden freezes, that mainly led to Lisp's early reputation as impractical.

That was 65 years ago. As you might expect, garbage collection technology has gotten better, in several different directions. In generational garbage collection, you take advantage of the fact that if a pair doesn't become garbage in the few minutes after it's first allocated, it's probably part of a permanent data structure. So when it's time to do a garbage collection, you look only at the pairs that were allocated recently. The details are complicated, but that's the basic idea. In real-time garbage collection, the program doesn't grind to a halt; it can keep working in parallel with the GC. There are lots of other refinements. So now, pretty much every language that isn't C has a garbage collector.

Lesson 6. Modern aggregate methods.

Arrays and linked lists both have weaknesses. But computer science marches on, and today there are a lot of data structures with better time behavior than old-fashioned arrays and linked lists.

First comes the dynamic array. Here we're solving the problem that when you want to add an item to an array you have to allocate a bigger space, which is slow. The basic idea of the solution is easy: When you have to allocate more space, allocate a lot more space, so the next bunch of add-an-item operations will be fast. The particular detail that makes this a dynamic array is that whenever you have to grow an array, you double its size. This seems horribly inefficient in the use of memory, because on average, 1/4 of the space in the array is unused all the time. And so when memory was really expensive (2¢/bit over many years of my childhood, even as memory density improved), dynamic arrays weren't widely used. But now memory is essentially free, and because of caching (see lesson 2) even large memories can be fast. And the payoff is that when you double the size of the array on each copying, the average time for an array access (made up of many almost-zero times plus an occasional huge time) is bounded by log₂ N, where N is the number of items in the array. Scratch lists are really dynamic arrays, and so are the array form of :sn: lists.

We can do better than that. The champion of fast data structures is the hash table, in which insertions and lookups are fast, constant time operations. (There is a bunch of fine print. You have to allocate quite a lot more memory than you expect to need. You need to choose a good hash function (yeah, I know, you don't know what that is yet) for your specific data types. You have to remember to make the size of the table a prime number. If you want to store N distinct character strings, then at least one of them has length at least log N (why? exercise for the reader) and so the hash function is likely to require log N time. And you have to have read the hash table chapter of your data structures textbook really carefully so you don't fall into one of the black holes littered around the terrain.) The trick is that you don't control where in the hash table your new entry will end up. Instead, you've provided a hash function whose domain is data of whatever type you're putting in the table, and whose range is nonnegative integers less than the size of the hash table. In other words, it turns values into indices into the hash table's array. What makes a good hash function is that similar values don't turn into nearby indices. So the words FOO, FOO1, FOO2, FOOD, and FOOL should all end up in very different places. If your hash function can't achieve that, you'll end up having several values that you actually use end up at the same index, and then you don't get constant time behavior.

But there's a big problem with hash tables: They're not ordered. That is, since things go into quasi-random slots in the table, you can't ask "What's item number 427?" or even "What's item number 1?"

There are two solutions to this problem. First, you can include the item number in the table. To do that, you allocate a pair, you put the item number (the key) in the car and the value in the cdr (thus making a key-value pair, an idea that comes up repeatedly in different data structures), and you hash only on the car (the item number). Then, to find item number 427, you call the hash function with 427 as input, and the result is the place to look in the hash table. This structure (a hash table of key-value pairs) is called a dictionary. If you restrict yourself to nonnegative integers as keys, it's equivalent to an array. But you can also use anything else as a key. If you use words as the keys and their definitions as the values, you can see where the name "dictionary" comes from.

The other all-purpose modern data structure is the search tree. In particular, you can put key-value pairs into a self-balancing binary search tree to get a data structure in which you can ask for the first item, and can insert or delete items, even mid-list, all in log N time operations. That's not as good as constant time, but in return you get well-behaved ordering. Want to know more? Ask your friend Wikipedia. :~P

Why don't we use these more complicated structures in :sn:? Because they're way more complicated and our users don't typically have big enough data sets to make it pay.

That's the end of the list mini-course. Feel free to post specific questions. If you just say "you lost me" there's not much I can do about it.

Oops I forgot something.

Lesson 7: Shared memory.

So, in a linked list, the cdr of any particular pair can only have one value in it (like any word of memory), so if you want a list (A B C D) and another list (A B C E) they have to be two entirely separate structures, with no common pairs.

abcde

But you can have as many pointers as you want to the same pair. So a list (A B C D) and another list (X B C D) could share the pairs for the items (B C D). So those two lists could require only five pairs altogether.

axbcd

So, if you say

abcd

and then you say

xbcd

does :sn: just notice that those lists could share memory and set it up that way? No, it's not that smart. But if you say

the ALL BUT FIRST OF reports a pointer to the second pair of the (A B C D) list, and then the IN FRONT OF makes a pair whose car is X and whose cdr is that (B C D) pair. :sn: has to do it that way, or else ALL BUT FIRST OF and IN FRONT OF wouldn't be constant-time operations.

So, isn't this business of shared memory all scary and stuff? Because modifying one list magically modifies the other one also? Well, those of you who remember back to lesson 4 aren't scared, because you remember that we use linked lists only when you use the reporters IN FRONT OF and ALL BUT FIRST OF to work with a particular list, and in particular that means you aren't using REPLACE ITEM, ADD, DELETE, or INSERT. In short, you're not modifying that list. So it doesn't matter that it's sharing memory with another list! Funny how elegantly that worked out.

What about in the Jens' Bigger Data library? Is the search tree data structure useful there?

Oh, thanks for the lists lessons; there was a lot I was not aware of.

If what you're doing is a traversal of the data (which is the case for all the higher order functions) plain old arrays and plain old linked lists are both perfectly fine. It's really if you want to jump around, or insert or delete items, that you have to worry about complicated data structures.

I am sorry. I must have forgotten HOFs do only traversals and not really remove/add/insert or replace any items.

EDIT
No, actually they can replace items, can't they?

@bh Do you think that it's a good idea to explain the details of RAM in order to explain the performance benefits of linked lists versus arrays? On one hand, I believe in teaching the high-level abstract idea instead of the machine-level details (and I was under the impression that this was your philosophy of teaching). On the other hand, since this is about performance, not just the mathematical meaning of programs, understanding how the computer operates is important. But, getting into things like the cache and data locality may be distracting. My concern is that getting lost in these details may prevent people from learning the concept of arrays. Since Snap! is implemented in JavaScript, do you know if Snap! arrays actually benefit from data locality?

In PL theory, when discussing operational semantics (giving programs meaning by describing how they run), people use an abstract machine, which is a simplified idealization of the computer. Should abstract machines be used in pedagogy in cases like this? Or, perhaps it's a bad idea to use abstract machines to explain concepts if people don't know how they might translate to actual computers.

No, they make new lists, leaving the input list alone. I mean, yeah, you could, I guess, put command blocks inside a script inside the function you're calling, but that would really be weird. FOR EACH ITEM could, I guess.

All this was in response to @miniepicness talking about the "weird rules" of :sn: lists. So in order to make the rules feel sensible, I thought I had to explain what's really happening. Yes, the part about caching and VM could be left out, but I thought it was worthwhile to make the point that what really happens is complicated. The programmer's-eye view of a program having a huge, fast, contiguous memory from location 0 to location infinity is a beautiful abstraction. What I should have said and didn't (sigh, I just pounded that out at the keyboard without planning) is that arrays have better locality of reference than linked lists do. An array is by definition all contiguous in memory. A linked list, especially one created after a few garbage collections, can be spread out all over the place. And because of caching, that matters.

So, you're right, if I were teaching "the concept of arrays" I wouldn't get into all that. But the audience for all this is people who do know what an array is, but don't know why :sn: uses two different representations for lists.

Yeah, JS arrays are real contiguous arrays, aren't they?

Wow, @bh, that looks like it took a lot of work. This is another reason why I like SNAP! over Scratch. Someone asks a question, and the developers take the time to write out pages-long answers that really explain why things work the way they do. This actually really helped explain why lists get linked, not just that they do, and why programs that use both the HOF reporters and the command blocks run slow.


P.S. How do you get your SNAP! picture so small? When I try it it looks like
this:

colon s n colon

Type :sn: to make :sn:
and type <ooo>:</ooo>sn: to make :sn:

No offence, but could you remove me from the Kazoo kid collection? Thanks!