How lists work

bh · February 10, 2020, 2:29am

Lesson 3. Storage of data aggregates.

An aggregate is a bunch of data under one umbrella. lists are aggregates.

There are certain things you want to be able to do with an aggregate: Find an item, add an item, change an item, delete an item. Depending on your program's algorithm, you may want to find items in order, front to back; or you may want to jump around in the items.

So, in the beginning there was the array. An array is a contiguous chunk of memory. So if you have an array of four-byte-word numbers starting at address 4000, the address of item N is 4000+(4*N). The multiplication by 4 is there because words take four bytes, so the second item is at 4004, not 4001.

But, oops, according to the formula, item 2 is at 4000+(4*2) = 4008, not 4004. And this is why programming languages that are all about what's where in memory index their arrays starting from zero, so their compilers don't have to use 3996+(4*N) as the formula. And that's why "item 2" is the third item in those languages. In languages that are more interested in making sense than in what's where in memory, of course, the first item of an array is called item 1.

So, how do arrays work out as far as speed is concerned? Finding any item is really fast, because you don't have to search for it; you just add (four times) the item number to the starting address of the array. Replacing an item with a new value is similarly fast. But adding and deleting items is slow. Deleting at the end is just deciding not to use the last item any more. But since the items have to be contiguous in memory, deleting anywhere else requires copying the elements past the deletee one word lower in memory. This takes an amount of time proportional to the number of items in the array. Adding an item is even worse, because only a fixed number of words were allocated for this array, and if you add more than that many items, you have to allocate a bigger block of memory, copy the entire array over there, insert the new item (having left an open slot for it when doing the copy), and make anything that points to the array point to the new one instead.

By the way, in order to make this just-do-arithmetic approach work, to find any item in constant time, the items all have to be the same size. So in array-centric languages like C, when you allocate an array you have to say not only how many items it'll hold, but also what data type they all are. You can have an array of integers, or you can have an array of characters (bytes), but you can't have an array with some of each. If you want that, you have to make it an array of pointers, all of which are the same size, and have some way to know whether a particular item is a pointer to a byte or a pointer to a word. From this unfortunate restriction of arrays has grown a whole mythology about why it's bad to have a data aggregate with mixed item types.

So, along came Lisp, with applications in artificial intelligence in which it's rare to ask for item number 437 of an aggregate. Instead, you're much more likely to traverse the aggregate: Look for the first item, then look for the second, then the third, and so on. But you'd like your aggregates to be able to get bigger and bigger without any recopying. So you invent the linked list, which is made of pairs. A pair is an array of length two. Since it's a fixed size, there's no question of inserting or deleting items of the pair. Since all pairs are the same size, memory allocation is particularly simple. Instead of calling the two items "item 1" and "item 2," you call them "the car" and "the cdr" of the pair. (Look it up if you're wondering why.) The way we use pairs to implement lists is that we allocate a pair for each list item. The car of the pair points so (i.e., contains the address of) the list item. The cdr of the pair points to the next pair, unless this is the last pair, in which case it points to a special thing called "the empty list." In early Lisp implementations it was common to use a value of 0 instead of a pointer to indicate the empty list. More recently it's become more common to see an actual location in memory counted as the empty list, so the last pair of a list will have an actual pointer to that location in its cdr. (Never mind why.) In there can be more than one empty list; a new one is made every time you ask for a list with no items. This is sort of wrong, but harmless, because we don't mutate lists that much.

So, how's the timing of linked list operations? Linked lists aren't random access; looking for item number N requires chasing the cdr of the cdr of the cdr... N-1 times. But traversing a list is fast; if you're looking at the part of the list starting with item N, a single cdr operation will get you to item N+1, regardless of how big N is. And it takes only constant time (fingers crossed) to add an item, if you add it at the front of the list; you just allocate one new pair, whose car is the new item and whose cdr is the previous head of the list. In the next-item operation is ALL BUT FIRST OF, and the add-an-item operation is IN FRONT OF. For linked lists, these take constant time.