Streams library 2.0 development, part 1

qw23 · March 3, 2024, 10:01am

I’m still digesting your post (#87) - thx again for all of the explanation! I may react on other parts of the content later - however, there is something that is not sitting well with me and I would like to discuss that now:.
[quote="bh, post:87, topic:13828"]
the-empty-stream can actually be any value that STREAM-NULL? can find unambiguously. We use the empty list, but it could be (LIST [THE EMPTY LIST]) too.
[/quote]
Isn’t it fundamental to define the “base case” of streams?
I actually found the “could be” implementation in the map … block of the Streams library:

I suggest canonical versions of the empty stream and is stream () empty? will be added to the Streams library, and these will be used by all other blocks, where applicable.

Something different
Could / would you explain this to me:

with:

What surprises me is that while the third “unique” item is based on the number 9 (floor of (sqrt (9)) = 3), the stream of positive integers is actually scanned up to and including the number 16:

BTW two more suggestions for inclusion into a next version of the Streams library: uniques of stream (), and log () stream ().

bh · March 3, 2024, 7:56pm

Yes. As we're having this conversation I'm working on bringing the library more in agreement with what's in SICP. I'll publish an update soon. When I wrote the library, I just did the minimum work to get to showing off the prime stream, which I think is one of the wonders of the modern world.

Yeah, the forcing of a stream is one ahead of what's shown. Consider that when you make the stream of integers, its internal representation is (1 . <promise>) before you show any of it! That 1 is an actual number from the beginning, because CONS-STREAM delays only the tail of the stream, not the head.

You could have a version of CONS-STREAM that delays both of its inputs. I think other authors actually do it that way. But this way is, iirc, a little easier to work with. I'm not sure; I just did what SICP does.

UNIQUES maybe, although it's more valuable as a programming exercise than as a tool for practical stream development. I think not LOG, because it's non-functional, and because even partially turning a stream into a list makes it take up memory. Serious stream programming requires not keeping a pointer to the head of a stream as you cash in the promise for the tail, so that even in an infinite amount of time you're using a bounded, finite amount of memory.

qw23 · March 3, 2024, 9:43pm

I now expect to be going to use it for the logic programming interpreter, streams version. Besides I based my version of uniques on sieve, it’s structure is really very similar.

I feel Snap! could use some more debugging tools, and logging intermediate results is one of them.

Some more candidates (I think) for a Streams library update - both inspired by SICP - may be:

toontalk · March 4, 2024, 6:08am

Not if the producers are producing faster than the consumers.

qw23 · March 4, 2024, 9:59am

Wasn't the Big Idea behind streams: that data production is on-demand (as-needed)?

toontalk · March 4, 2024, 1:40pm

Whoops. You are right unless the streams are how concurrent processes communicate

qw23 · March 4, 2024, 4:46pm

Like I wrote in post #90, I expect to use uniques of stream () (and interleave) within the logic programming interpreter. I rewrote it to be more effcient than the one I proposed earlier:

It now has an extra argument: history, which is a list of previously found results (at the first call this should be the empty list, of course). I'm sure the same principle will make sieve more efficient, too - it's just not as elegant, by far, though, as the current Streams library version.

bh · March 4, 2024, 9:00pm

I don't understand the SHOW STREAM ... -1 in your implementation of FOR EACH. That will try to turn the infinite stream reported by MAP into an infinite list, so it will never return.

Any time you're generating a lot of data and then IGNOREing it, that's a red flag. It'd be one thing if you were just doing IGNORE MAP ..., which would generate a promise that could never be cashed in, but SHOW STREAM makes a plain old huge list, even if finite. I would just implement FOR EACH recursively:

Note that the input INDEX is variadic 0/0/1, i.e., semi-invisible to the user, meant only for the recursive call. If I were less lazy I'd write a FOR EACH HELPER that had INDEX as a plain number slot rather than variadic. If this ends up in the library I'll do it right.

I'm not convinced this is more efficient. CONTAINS is linear time, not constant. You could still be right but I'd have to think about it more carefully.

But, interesting that, like me in FOR EACH, you are surfacing to the user an input that's meant to be used only in the recursive call. We need to invent a way to do this invisibly.

qw23 · March 5, 2024, 10:51am

Only so in case of an infinite stream, which is fitting. With any finite stream as input, show stream () (-1) will definitely halt.

How about this implementation? (I added a stream filter)

The other mechanism (sieve style) compares each new number with all of the earlier generated stream output, too - so it isn’t in constant time either, just much slower, as demonstrated by the following example:

qw23 · March 5, 2024, 2:06pm

Here's another suggestion for an updated Streams library (perhaps @bh would create a separate topic for this?). Stream operations take considerable processing time (and memory? not quite sure how much actually). A quick win would be to combine operations into a single block. As an example - granted, probably the most obvious case - I made a block combining keep and map (much like Python's list comprehension). It's almost twice as fast as keep -> map:

And this is the code:

It's for processing a single stream only, whereas map () over stream[s] () is a multi-map. I suppose it could be extended to process multiple streams too, though I'm not sure whether such extension would have any reasonable applications.

bh · March 5, 2024, 7:50pm

I guess. I was imagining that FOR EACH would process only the head of the stream until somehow the tail was forced, sort of like TRY-AGAIN in the AMB evaluator. But I'm not sure how that'd actually happen. I suppose imperative programming such as FOR EACH doesn't really fit well with the idea of streams, which beg for functional programming. For that reason I'm tempted to say we shouldn't have FOR EACH in a stream library. But I'm open to discussion...

Yeah, that's better.

Okay. I wasn't thinking the other way was constant time (per item); I was thinking that you thought your way was constant time.

The trouble is that there are too many imaginable ways to combine HOFs. For example, map->keep instead of keep->map. Keep->map->keep. Map->keep->map. Pretty sure I've used all those patterns. Want to list the ones you'd include?

qw23 · March 5, 2024, 10:51pm

According to SICP, stream-for-each is “useful for viewing streams”. Even so, most of what it can do may also be implemented using stream-map. The best argument to include it into the Streams library may well be the very fact that it’s “different”.

That would have required magical thinking on my part.

How reminiscent of cadadr and all that … keep -> map (~ list comprehension) is almost certainly useful - 10 million Python developers can’t be all wrong. (and many other languages)
I happened to employ the reverse order (map -> keep) as part of my improved for each. You mentioned having used several different patterns composed of 2-3 primitives … would it make sense to try and build a block that will execute any pattern of keep and map operations? or even any sequence of operations at all?

bh · March 5, 2024, 11:32pm

We have that; it's PIPE. But you're trying get an efficiency improvement by not calling the individual HOFs. I suppose we could do that by compiling the nested HOFs into one recursive procedure, using metaprogramming. Not sure I want to take that on right now, though!

Yeah, I guess.

I mean, constant time per item. Not noticing the CONTAINSes. Not magical thinking, just overlooking something. But you didn't do that either. :~)

Sure they can: They're all programming in Python!

qw23 · March 6, 2024, 10:27am

It's actually not that hard - I went about building a prototype anyway. Doing so I encountered an issue with keep vs. stream-keep. Would you (@bh) agree that items should be handled in the same way by keep and stream-keep?

They do however handle items-that-are-a-list differently:
Sorting streams script pic
Sorting streams script pic (1)

BTW command-if is more or less consistent with stream-keep:

... whereas reporter-if takes yet another approach:
Sorting streams script pic (2)

I fully understand the underlying dilemma, and am aware that how reporter-if should handle items-that-are-lists has been discussed extensively between you and jens (and others). However I didn't find anything about keep and stream-keep in this context. Aligning keep and stream-keep shouldn't be too difficult (though especially changing the regular keep may affect some existing user code).

qw23 · March 6, 2024, 1:31pm

Here’s my proposal for an almost universal keep / map combiner on streams. No metaprogramming! The list of operations is executed from right to left, or from the bottom up:

This is the code:

with:

Is it useful? hmm ... not sure - but it was fun to create! I used a kind of "may be" mechanism for integration of boolean operations (~ keep on a 1-item scale):

the item to be operated on is formatted as a list: (value);
if the operation is a reporter, the result of the operation is also formatted as a list: (intermediate);
if the operation is a predicate, the result is a Boolean: true = keep; false = discard.

With a predicate as operation, if the item (or intermediate result) to be operated on is a list, I have (for now) followed the convention of Snap! command-if and stream-keep.

Runtime performance

Less ambitious: map -> keep

~~Something else~~

I wonder if the run time performance (and memory use?) of streams could be improved by using a different data structure. During my research of partial sorting (in 2022) I developed a "Mickey Mouse" special purpose version of the streams mechanism, which proved much faster. By now my understanding of streams has much improved, shall I (at least briefly) look into generalizing this approach? As far as I can see there's only a few blocks that would have to be rewritten, notably: head of stream (), tail of stream (), () in front of stream (), and optionally: is stream () empty, and the empty stream. Plus a conversion block from "old" to "new" perhaps.

bh · March 7, 2024, 6:23am

For starters:

That's the value of the CALL, no matter who's calling CALL. When the function being called is >, I'd argue that that's what the user wants, since lists aren't ordered, as numbers are, and so there's no right answer to "is this list greater than 1?" (See that unsightly space between the 1 and the question mark? That's because the font is properly making digits monospaced, but it's a sans serif font, so there's no way for it to make the 1 actually take up the same space as a 5.)

If the predicate were = instead of >, there wouldn't be such an obvious right answer, because lists are in its natural domain. We are in fact conservative about this, and don't hyperize =. We do hyperize arithmetic operators and Boolean operations. We don't hyperize custom blocks, although the code in the block can check for a list as input. (Maybe we should have a HYPERIZED primitive that takes a function as input and reports a hyperized version, to make that easier. There's one in the APL library.) Then there's JOIN, which bizarrely does consider lists to be in its natural domain. Especially now that we have a FLATTEN primitive, maybe we should reconsider that, although changing it at this point would be a flag day event.

Since KEEP FROM STREAM is built with CALL, it's not surprising that it has the same behavior. That doesn't necessarily make it right, of course. I think it is right, though, since, again, lists aren't in the natural domain of >. On the other hand, your example comes out less pleasantly if, as one might expect, the data structure were streams all the way down:

So if we're going to hyperize KEEP FROM STREAM, we have to protect streams from being treated as lists. :~(

As for the primitive KEEP, it doesn't have to worry about protecting streams, so we can do whatever we want, except that since it's primitive that means doing whatever Jens wants, although as you say, changing its behavior would be a flag day, and Jens hates those. (Although we've done them in the past, notably when we changed the behavior of ITEM with a list as the index input.) There's an especially good case for hyperizing it if the list input is a uniform array:

I'll take all this up with the team.

Command IF, not hyperized, takes the view that anything that isn't False is true (as in Scheme), except that, regrettably, what it really does is say that anything that isn't Javascript-falsy is true. So in particular the list (True False) is true. One could argue that we should throw an error instead, but we're unlikely to change that at this late date.

Reporter IF, however, is hyperized, so given the list (True False) it turns True into True and False into False.

The difference between the two IFs is consistent with our general philosophy that only reporters are hyperized, although we recently introduced an exception, iirc, and I'd remember which command it is if I were more awake. Also, how would we hyperize command IF? Start a thread for each item? Presumably the predicate would report different values for different items, as in your example.

I'm aware that one of my flaws is overengineering things, but still, I'd be happier if users could answer some of these questions for themselves. One way to do that would be to have an ATOMIC LIST primitive, just like LIST except that it flags the list as exempt from being descended into by hyperization. IN FRONT OF and ALL BUT FIRST OF would preserve the flag. So would KEEP, and probably so would MAP, although I'd want to think more about that when awake. But I'm betting Jens will say that that would slow things down too much.

I'll look at your other message tomorrow.

qw23 · March 7, 2024, 11:16am

I don't agree: stream-keep tests for head of stream only, so it's just the structure of the head that matters - and that isn't any different from item 1 of any regular list.

Optimizing streams
I tried my hand at optimizing the streams mechanism. I had hoped to be able to speed things up a bit, but haven't figured out how (as yet).

The main drivers of the mechanism's comparative slowness are, I think:

Production: The to-and-fro within the series of calling and producing functions, for each individual data item.
Retrieval: the explicit (= user code level) "linked list" data structure

Combining functions (such as keep and map) may speed up production, at least somewhat.
Ideally (from an execution speed pov), instead of cascading stream functions (stream-in, stream-out), any production chain would contain at most a single stream – i.e. all data processing is done EITHER in front of OR behind the stream, by the likes of:

(stream-in only);
and (stream-out only).

Perhaps some more (example) blocks could be developed that are either stream-in or stream-out only?

As for changing the data structure: from a retrieval point of view a simple list of all stream items that have already been evaluated would be ideal. I'm not sure if this could fully replace the existing user code level linked list - more precisely put: the part thereof concerning already evaluated items ... perhaps both structures would have to be complementary.

Better even: if streams and their essential operations (head of, tail of, in front of, empty stream, empty?) would be implemented as Snap! primitives. But that's a matter of (future) priorities, I suppose.

Meanwhile I did some more experiments, and I give up: streams are inherently slow, I guess; they may provide external benefits, but the internal workings are not going to be much more efficient whatever one tries.

Think big, start small
For now I've found the tinest of changes to the existing data structure, that may (only so slightly) reduce (?) the data usage. The current data structure:

(1) head, evaluated (2) tail, unevaluated (3) was tail evaluated yet? (4) link to tail, after evaluation

could (IMO) safely be replaced by:

(1) head (2) tail, unevaluated OR link to tail, after evaluation.

This is the relevant (changed) code:

Sorting streams script pic (5)

Demo:

bh · March 7, 2024, 10:35pm

This is very clever. Personally, I'd write it to call the functions from left to right, the way PIPE does. Normal people not accustomed to functional programming think of it as a quasi-imperative script. The only thing that bothers me about your approach is that it won't map a predicate over a stream so as to make a stream of Booleans. I don't often want to do that, but students do it all the time. (They should use KEEP instead, usually, but they don't.)

~~It's in MAP, not in KEEP, that we have to distinguish streams from regular lists. (Read my message again. ;~) )~~

What do you think about my example?

It should find the numbers inside the inner stream, rather than KEEPing the stream itself, which is what it does.

I'm not sure what you mean by "in front of" or "behind." So I don't really understand this whole section of your message.

This works unless the elements of the stream are scripts. I grant that that'd be unusual, but I wanted to make streams that could handle anything. So perhaps we could compromise by reusing item 2 for the evaluated tail, but keep the flag as item 3. But honestly I'd rather optimize time than space.

qw23 · March 8, 2024, 1:06am

I feel my implementation is more “natural”; I see your point, though.

I had not considered that yet. I may want to implement a type selector block so as to deal with this issue. BTW I can see at least two more types of operations: combine (over the stream from item 1 to present, or any moving window - if it can be done at all), and spawn (create multiple output items from a single input item), which could be integrated into such solution.

Suppose one requires a list of even numbers, generated on-demand. Three different ways to achieve this, all of them based on:

The first solution multiplies all items (integers) by 2 “in front of” the stream, the third way does it “behind” the stream, whereas the second solution uses a function in the middle (stream-in, stream-out). The latter is clearly slower executed than the other two:

Same.

bh · March 8, 2024, 1:59am

We can do that with just MAP. For example, start with a constant stream:

Now watch what the following MAP does:

But this:

generates all 10,000 numbers right now, because it makes a plain list, not a stream!