Built-in hashtables

mineoskar999 · August 28, 2019, 3:25pm

Hashtables will store items with name instead of position.

Blocks used:
(get [] of [])
set [] of [] to []
delete [] from []

bh · August 28, 2019, 5:20pm

Hi! Welcome to the forum.

The thing is, our policy has been to give users only the structures they need, and let them create more complex data structures themselves. One of the uses of Snap! is teaching data structures courses! Hash tables are a perfect example.

spaceelephant · April 2, 2020, 10:45pm

I can't even figure out how to make hashtables is JS. (js, however, has native hashtables. If you don't like the keys then you can make your own hash function) Can you make a variable store a js object?

ego-lay_atman-bay · April 3, 2020, 1:59am

what are hashtables?

bh · April 3, 2020, 6:49am

I'll get to "what are hashtables?" in a moment but first comes the discussion about implementing them.

All a hashtable is, is a list of lists. Each sublist is a bucket for key-value pairs. What's special about the hashtable is that in addition to the actual table it has a hash function that translates a key into a hash index, i.e., a positive integer less than or equal to the length of the table. (In a language with zero-origin lists, it'd be a nonnegative integer strictly less than the length of the table.)

So it really doesn't matter whether you use Snap! or Javascript to implement the table. The entire insertion algorithm is

hashtable[hashfunction(new_key)] ← new_key_value_pair IN FRONT OF hashtable[hashfunction(new_key)]

To find something in the table, you get hashtable[hashfunction(key)] and look inside it for the key you want. (Our list library has a function assoc that does this searching.)

The tricky parts are (1) pick a prime number as the size of the table, to help avoid collisions; and (2) pick a hash function that's smart about spreading out similar keys. So, for example, if the key is a text string, you could just add up the bytes of the string modulo the length of the table, but that's not a good hash function, because most of those bytes will probably be alphabetic, which means they're all within the same 26 consecutive Unicode codes. So you left-shift the value already computed by five bits before adding the new byte. That means that the low-order five bits of the byte contribute more strongly to the key than the high-order three bits, the ones that are mostly the same for all (alphabetic) characters. (You don't just turn off those top three bits because they do distinguish letters from digits and punctuation characters.) The reason you go to all this trouble is that if your user has variables named FOO1, FOO2, FOO3, and so on, you want to be sure they don't all hash to the same bucket.

Okay, time for "what are hashtables." The problem we're trying to solve is to remember a bunch of key-value pairs (e.g., variable name and variable value) and we want to be able to add a new one and to look up one as quickly as possible. (As always in thinking about algorithms, if you only have ten things to look up, you might as well use a simple data structure. You have to imagine 1000 or more key-value pairs.) If you use a plain linear list to hold these key-value pairs, it takes linear time (i.e., time proportional to the number of pairs you have) to look one up; on average the one you want will be halfway down the table, but if the key you want isn't in the table, you have to get all the way to the end before you find out.

A better approach would be something like a binary search tree, in which each node is a key-value pair with links to a left child and to a right child, arranged so that every key that sorts before the one at that node is in the left child (also a binary search tree) and every key that sorts after the one at that node is in the right child. Ideally, it takes only log₂ n, where n is the number of key-value pairs in the table, to find the one you want or to find out that it's not in the table. But that's true only if the two children of each node are more or less balanced. If you just leave that to chance, and you imagine inserting a bunch of key-value pairs that are in sorted order, every new pair will end up in the right child of the one just before it, so you just get a linear list. There are techniques for making balanced trees that guarantee log n time, but I'm not going into that right now, because we can do better than trees anyway.

The best approach is to use a hashtable, which looks up a key in constant time. This sounds like magic, and it is kind of magical. The magic is in the hash function. If you pick a bad function, there will be lots of collisions: two different keys that hash into the same bucket. Oh, besides having a good hash function, you have to allocate a list whose length is a prime number close to twice the number of pairs you expect to insert. (If the table isn't big enough, you're guaranteed to have collisions!)

There's also a very subtle way in which we're cheating: The hash function for text strings deals with each byte separately, adding them up modulo the size of the table or something like that. Now, if we have 1000 unique keys to add to the table, then the average length of a key has to be at least log₂₆1000 bytes, supposing keys contain only letters. So the hash function by itself takes time proportional to the log of the number of entries, similar to the time for a balanced binary search tree. If you really want constant time lookup, you need a constant time hash function that still spreads the keys out among the buckets. Perhaps you'd use a byte-based function but look only at the first five and last five bytes of each key. That would be constant time and maybe almost as good. (I'm not a mathematician.)

snapenilk · April 3, 2020, 12:57pm

Yes

spaceelephant · April 3, 2020, 6:32pm

Yes, but there is no way to disable transient. I know what a hashtable is. I have written one. In C where there is a 64 bit integer type. I could probably make one in python, with an infinite precision int, but not without an int at all.
I have found a way, and the only remaining challenge is to write foreach.

spaceelephant · April 3, 2020, 7:36pm

Fixed. https://snap.berkeley.edu/snap/snap.html#present:Username=spaceelephant&ProjectName=hashtable

Some example modifications:

Set: All values are false (or 0 or '') Use the contains block to check if the set contains an element.
Multiset: Values are numbers specifying the count of a key.
Dictionary: A hashtable and a list of keys. Keys are ordered by the list. Slows down removal because of removing from list. Or you could always keep a key, and only remove the value (set to false/0/'')
Ordered Set: A list and a set.

bh · April 3, 2020, 9:49pm

I don't understand this. Why do you need the list of keys? You can just look up a word in the hash table.

spaceelephant · April 3, 2020, 10:20pm

So that there is an order. It must be a twice linked list (each item has a link to previous and next)

bh · April 3, 2020, 11:17pm

This kind of defeats the purpose of a hash table, I think. Wouldn't you be better off using an order-preserving structure such as a balanced binary search tree instead of two parallel structures?

spaceelephant · April 3, 2020, 11:21pm

Possibly. I haven't personally used the ordering, but I still thought it would be a good idea to describe the implementation. I only knew about the possibility from python.

wunder_wulfe · April 5, 2020, 5:58am

You could always offer a simple variant/version of hashtables by using the builtin "dictionary" or objects in JavaScript. But aside from that, more complex classes like maps (unordered and ordered) and sets (same) may be a bit more difficult to implement. You can try looking at Lua source code (in C) to see how they make their dictionaries (they use tombstones). Java makes theirs by using a fixed length array of linked lists and then eventually converts it when there are too many links. As for how C++ classes are set up, I don't really know.

bh · April 5, 2020, 8:03am

There's a deep question hiding behind these detail questions: To fulfil our role as a teaching language, how do we decide when to offer a feature as primitive, versus leaving it as something a student (including self-directed learners) can/should implement?

In the beginning, our policy was never to include as a primitive something that could be written in Snap! itself. There's something sublime about writing

and seeing how much power RUN gives you to create your own language.

The trouble is that, wearing my curriculum writer hat, I want students to be able to use FOR before they write it. Thus was born the Tools library, in which we write all those blocks that can be written in Snap!. This is arguably the pessimal solution, in that we've taken away the joy of doing it yourself, but the blocks are still slow because they're written in Snap! instead of in JS.

And so now we have FOR and the HOFs as primitives. This is a big win for people who want to use Snap! for real work, and also for the "big data" section of the curriculum. But there's a real cost; rewriting something you've been using all along as a primitive doesn't give the same power rush as writing it yourself in the first place.

(Footnote: In BJC we are constrained by having to cram in all the stuff that the College Board wants us to teach before May. If we weren't an AP CS Principles course, we could try out things like teaching RUN and CALL early.)

So, what about hashtables, or the other 20 gazillion things JS offers either as primitive or in widely used libraries? For JS programmers, the natural thing is to use Snap! essentially as a fancy front end to JS. But that's not what we set out to build, and we're still resisting. I can imagine that if BJC were built around dictionaries, we might have added hashtable primitives at the same time we added primitive HOFs.

There's no right and wrong answer about what to include. Snap! has the flavor it does partly because I grew up in Logo and later in Scheme (real Scheme, R5RS or earlier, before the hostile takeover by the forces of Common Lisp). Jens, on the other hand, grew up in Smalltalk. This is why we argue about things like whether to include the complete Scheme numeric tower in Snap!.

It may be that we're going to keep being beaten back on this, and ten years from now Snap! will have primitive dictionaries, primitive regular expression searching, and so on. But right now we're still fighting.

Meanwhile, feel free to write a hashtable-based dictionary library!

jens · April 5, 2020, 9:35am

Thank you, Brian, for this wonderful write-up of our design dilemma and its conflicts. I keep showing how to make map and for and in doing so I'm annoyed by the apparent pointlessness - because these blocks are already there -, but I equally love introducing novices to the joys of using map for ... almost anything!

wunder_wulfe · April 5, 2020, 4:04pm

I think one of the main concerns with trying to create some sort of hashmap structure in Snap! or JS is the limitations that this provides. As JS and Snap! are unable to offer something such as some sort of memory address lookup or specific ID / UUID for all objects and data, things like strings and numbers will be easily hashable but things like ringed blocks and custom objects and other datatypes (lists, functions..) are really hard to make hash functions for since most applications will just hash the memory address of the object's pointer (as this provides a solution for all objects too, not just specific ones). Sure, we could have string/numerical key dictionaries but that misses out on some of the most important uses of dictionaries: making lookup tables and key-pair iteration.

bh · April 5, 2020, 5:39pm

Okay, that's a really good point. Providing an address_of primitive would definitely fit our policy of having a minimal set of primitives that enable writing anything in Snap! itself. But I worry that it might not work once a project is saved and reloaded. I would hate to have user-visible handles (pointers to pointers) or similar kludges. Jens? Could we license-plate every object, or would that slow us down too much?

wunder_wulfe · April 5, 2020, 5:53pm

You could try labelling them with a timestamp for when they are created along with some additional information like the type of block or data into some sort of hash or base64 string (like Discord does with Snowflakes). That should stay the same throughout, regardless of when something is reloaded and be fairly quick to compute. While you could justify also that (2) and (2) are the same thing, they are two different ring objects and thus should be treated as two, so using a hash map to check if a object exists by using another object that is similar should not be possible. Instead, they have to use the same object or use iteration. This should work for most Snap! objects but for JS native datatypes, I am unsure.