Streams library 2.0 development

Here's another suggestion for an updated Streams library (perhaps @bh would create a separate topic for this?). Stream operations take considerable processing time (and memory? not quite sure how much actually). A quick win would be to combine operations into a single block. As an example - granted, probably the most obvious case - I made a block combining keep and map (much like Python's list comprehension). It's almost twice as fast as keep -> map:

Logic programming DEVLAB script pic (11)

And this is the code:

Logic programming DEVLAB script pic (12)

It's for processing a single stream only, whereas map () over stream[s] () is a multi-map. I suppose it could be extended to process multiple streams too, though I'm not sure whether such extension would have any reasonable applications.

I guess. I was imagining that FOR EACH would process only the head of the stream until somehow the tail was forced, sort of like TRY-AGAIN in the AMB evaluator. But I'm not sure how that'd actually happen. I suppose imperative programming such as FOR EACH doesn't really fit well with the idea of streams, which beg for functional programming. For that reason I'm tempted to say we shouldn't have FOR EACH in a stream library. But I'm open to discussion...

Yeah, that's better.

Okay. I wasn't thinking the other way was constant time (per item); I was thinking that you thought your way was constant time.

The trouble is that there are too many imaginable ways to combine HOFs. For example, map->keep instead of keep->map. Keep->map->keep. Map->keep->map. Pretty sure I've used all those patterns. Want to list the ones you'd include?

According to SICP, stream-for-each is “useful for viewing streams”. Even so, most of what it can do may also be implemented using stream-map. The best argument to include it into the Streams library may well be the very fact that it’s “different”.

That would have required magical thinking on my part.

How reminiscent of cadadr and all that … keep -> map (~ list comprehension) is almost certainly useful - 10 million Python developers can’t be all wrong. :smirk: (and many other languages)
I happened to employ the reverse order (map -> keep) as part of my improved for each. You mentioned having used several different patterns composed of 2-3 primitives … would it make sense to try and build a block that will execute any pattern of keep and map operations? or even any sequence of operations at all?

We have that; it's PIPE. But you're trying get an efficiency improvement by not calling the individual HOFs. I suppose we could do that by compiling the nested HOFs into one recursive procedure, using metaprogramming. Not sure I want to take that on right now, though!

Yeah, I guess.

I mean, constant time per item. Not noticing the CONTAINSes. Not magical thinking, just overlooking something. But you didn't do that either. :~)

Sure they can: They're all programming in Python!

It's actually not that hard - I went about building a prototype anyway. Doing so I encountered an issue with keep vs. stream-keep. Would you (@bh) agree that items should be handled in the same way by keep and stream-keep?

They do however handle items-that-are-a-list differently:
Sorting streams script pic
Sorting streams script pic (1)

BTW command-if is more or less consistent with stream-keep:

Sorting streams script pic (3)

... whereas reporter-if takes yet another approach:
Sorting streams script pic (2)

I fully understand the underlying dilemma, and am aware that how reporter-if should handle items-that-are-lists has been discussed extensively between you and jens (and others). However I didn't find anything about keep and stream-keep in this context. Aligning keep and stream-keep shouldn't be too difficult (though especially changing the regular keep may affect some existing user code).

Here’s my proposal for an almost universal keep / map combiner on streams. No metaprogramming! The list of operations is executed from right to left, or from the bottom up:

This is the code:

with:

Is it useful? hmm ... not sure - but it was fun to create! I used a kind of "may be" mechanism for integration of boolean operations (~ keep on a 1-item scale):

  • the item to be operated on is formatted as a list: (value);
  • if the operation is a reporter, the result of the operation is also formatted as a list: (intermediate);
  • if the operation is a predicate, the result is a Boolean: true = keep; false = discard.

With a predicate as operation, if the item (or intermediate result) to be operated on is a list, I have (for now) followed the convention of Snap! command-if and stream-keep.

Runtime performance

Logic programming DEVLAB script pic (13)

Less ambitious: map -> keep

Something else

I wonder if the run time performance (and memory use?) of streams could be improved by using a different data structure. During my research of partial sorting (in 2022) I developed a "Mickey Mouse" special purpose version of the streams mechanism, which proved much faster. By now my understanding of streams has much improved, shall I (at least briefly) look into generalizing this approach? As far as I can see there's only a few blocks that would have to be rewritten, notably: head of stream (), tail of stream (), () in front of stream (), and optionally: is stream () empty, and the empty stream. Plus a conversion block from "old" to "new" perhaps.

For starters:


That's the value of the CALL, no matter who's calling CALL. When the function being called is >, I'd argue that that's what the user wants, since lists aren't ordered, as numbers are, and so there's no right answer to "is this list greater than 1?" (See that unsightly space between the 1 and the question mark? That's because the font is properly making digits monospaced, but it's a sans serif font, so there's no way for it to make the 1 actually take up the same space as a 5.)

If the predicate were = instead of >, there wouldn't be such an obvious right answer, because lists are in its natural domain. We are in fact conservative about this, and don't hyperize =. We do hyperize arithmetic operators and Boolean operations. We don't hyperize custom blocks, although the code in the block can check for a list as input. (Maybe we should have a HYPERIZED primitive that takes a function as input and reports a hyperized version, to make that easier. There's one in the APL library.) Then there's JOIN, which bizarrely does consider lists to be in its natural domain. Especially now that we have a FLATTEN primitive, maybe we should reconsider that, although changing it at this point would be a flag day event.

Since KEEP FROM STREAM is built with CALL, it's not surprising that it has the same behavior. That doesn't necessarily make it right, of course. I think it is right, though, since, again, lists aren't in the natural domain of >. On the other hand, your example comes out less pleasantly if, as one might expect, the data structure were streams all the way down:


So if we're going to hyperize KEEP FROM STREAM, we have to protect streams from being treated as lists. :~(

As for the primitive KEEP, it doesn't have to worry about protecting streams, so we can do whatever we want, except that since it's primitive that means doing whatever Jens wants, although as you say, changing its behavior would be a flag day, and Jens hates those. (Although we've done them in the past, notably when we changed the behavior of ITEM with a list as the index input.) There's an especially good case for hyperizing it if the list input is a uniform array:

I'll take all this up with the team.


Command IF, not hyperized, takes the view that anything that isn't False is true (as in Scheme), except that, regrettably, what it really does is say that anything that isn't Javascript-falsy is true. So in particular the list (True False) is true. One could argue that we should throw an error instead, but we're unlikely to change that at this late date.

Reporter IF, however, is hyperized, so given the list (True False) it turns True into True and False into False.

The difference between the two IFs is consistent with our general philosophy that only reporters are hyperized, although we recently introduced an exception, iirc, and I'd remember which command it is if I were more awake. Also, how would we hyperize command IF? Start a thread for each item? Presumably the predicate would report different values for different items, as in your example.


I'm aware that one of my flaws is overengineering things, but still, I'd be happier if users could answer some of these questions for themselves. One way to do that would be to have an ATOMIC LIST primitive, just like LIST except that it flags the list as exempt from being descended into by hyperization. IN FRONT OF and ALL BUT FIRST OF would preserve the flag. So would KEEP, and probably so would MAP, although I'd want to think more about that when awake. But I'm betting Jens will say that that would slow things down too much.


I'll look at your other message tomorrow.

I don't agree: stream-keep tests for head of stream only, so it's just the structure of the head that matters - and that isn't any different from item 1 of any regular list.

:+1:

Optimizing streams
I tried my hand at optimizing the streams mechanism. I had hoped to be able to speed things up a bit, but haven't figured out how (as yet).

The main drivers of the mechanism's comparative slowness are, I think:

  1. Production: The to-and-fro within the series of calling and producing functions, for each individual data item.

  2. Retrieval: the explicit (= user code level) "linked list" data structure

Combining functions (such as keep and map) may speed up production, at least somewhat.
Ideally (from an execution speed pov), instead of cascading stream functions (stream-in, stream-out), any production chain would contain at most a single stream – i.e. all data processing is done EITHER in front of OR behind the stream, by the likes of:

  • untitled script pic (54) (stream-in only);

  • untitled script pic (56) and untitled script pic (55) (stream-out only).

Perhaps some more (example) blocks could be developed that are either stream-in or stream-out only?

As for changing the data structure: from a retrieval point of view a simple list of all stream items that have already been evaluated would be ideal. I'm not sure if this could fully replace the existing user code level linked list - more precisely put: the part thereof concerning already evaluated items ... perhaps both structures would have to be complementary.

Better even: if streams and their essential operations (head of, tail of, in front of, empty stream, empty?) would be implemented as Snap! primitives. But that's a matter of (future) priorities, I suppose.

Meanwhile I did some more experiments, and I give up: streams are inherently slow, I guess; they may provide external benefits, but the internal workings are not going to be much more efficient whatever one tries.

Think big, start small :smirk:
For now I've found the tinest of changes to the existing data structure, that may (only so slightly) reduce (?) the data usage. The current data structure:

(1) head, evaluated (2) tail, unevaluated (3) was tail evaluated yet? (4) link to tail, after evaluation

could (IMO) safely be replaced by:

(1) head (2) tail, unevaluated OR link to tail, after evaluation.

This is the relevant (changed) code:

Sorting streams script pic (5)

Sorting streams script pic (4)

Demo:
Sorting streams script pic (6)

This is very clever. Personally, I'd write it to call the functions from left to right, the way PIPE does. Normal people not accustomed to functional programming think of it as a quasi-imperative script. The only thing that bothers me about your approach is that it won't map a predicate over a stream so as to make a stream of Booleans. I don't often want to do that, but students do it all the time. (They should use KEEP instead, usually, but they don't.)

It's in MAP, not in KEEP, that we have to distinguish streams from regular lists. (Read my message again. ;~) )

What do you think about my example?

It should find the numbers inside the inner stream, rather than KEEPing the stream itself, which is what it does.


I'm not sure what you mean by "in front of" or "behind." So I don't really understand this whole section of your message.

This works unless the elements of the stream are scripts. I grant that that'd be unusual, but I wanted to make streams that could handle anything. So perhaps we could compromise by reusing item 2 for the evaluated tail, but keep the flag as item 3. But honestly I'd rather optimize time than space.

I feel my implementation is more “natural”; I see your point, though.

I had not considered that yet. I may want to implement a type selector block so as to deal with this issue. BTW I can see at least two more types of operations: combine (over the stream from item 1 to present, or any moving window - if it can be done at all), and spawn (create multiple output items from a single input item), which could be integrated into such solution.

Suppose one requires a list of even numbers, generated on-demand. Three different ways to achieve this, all of them based on:

The first solution multiplies all items (integers) by 2 “in front of” the stream, the third way does it “behind” the stream, whereas the second solution uses a function in the middle (stream-in, stream-out). The latter is clearly slower executed than the other two:

Same.

We can do that with just MAP. For example, start with a constant stream:

Now watch what the following MAP does:


But this:

generates all 10,000 numbers right now, because it makes a plain list, not a stream!

I’ve been aware of that. My demonstration was meant to show that a chain with two streams and a stream-to-stream (map) operation in between takes much more processing time than a chain with one stream and the mapping in front of or behind the stream. I had 10k items generated all at once in order to create some mass and make the runtime difference visible (1k would have been fine, too; but with just a single item the result would have been unreliable due to random influences).

Even so I made a mistake:

is not realistic, it should have been like:

Multiplicate stream items
I mentioned I was thinking about a spawn operation. The idea is a single input stream element may generate multiple output stream elements, individually mapped. So e.g. (. x .) -> (. x2 (print x) .). Each of the outputs would also be conditional - it’s a kind of antagonist of keep. Such operation could facilitate splitting a stream: items may be marked such that consumers will recognize items that may be of relevance for them.

I dipped a toe in the water by constructing a simple item duplicator:

In the end stream-spawn should do something like the following, but with streams.

Actually, spawn can do anything list comprehension can do, and more.
Perhaps spawn is not the best of names, as it usually refers to new processes, not data items; multiply may be better, it’s just that it could be mixed up with multiplication as an arithmetic operator.

OK, I see. I still would rather have a HOF compiler than a user-visible complicated toolkit. Nesting HOFs is such a great abstraction!

Yeah, not MULTIPLY. :~) Although it's ugly, I think the most expressive name would be ONE-TO-MANY. (Yeah, "many" could be zero. Still.) Or maybe CALL MULTIPLE?

That orange block needs work, too. I don't like the "else false" because (I'm making this complaint a lot lately...) it rules out a list of Booleans. I'd be inclined to have it not report anything, and have SPAWN (whatever it ends up being called) catch the resulting error.

And that always-empty input slot at the end just screams out for dynamic scope! Or maybe metaprogramming to add a not-user-visible input, or something. So just IF _ REPORT _. (No parentheses needed, especially since the ring groups whatever's inside it.)

“Else false” is not applicable any more now, and no potential results are ruled out. The output is always a list, carrying 0, 1 or more items. These items might be Booleans, or lists of Booleans, or whatever.

I don’t like the always-empty input slot either, but will also keep away from metaprogramming if possible - do you have an example of what is meant with dynamic scoping?
The parentheses were to signify that the result is always a list; they don’t do anything of course; but perhaps they only confuse.

Oh that's much nicer!

Two name things: 1-2-MANY-PROJECTOR is misleading because, importantly, it might report an empty list. CALL MULTIPLE? Or of course it could be 0-1-2-MANY-PROJECTOR. :~)

And IF _ MAP _ is misleading because MAP implies calling a function for each item of a list, whereas this just calls the function once. (I read the code, I know that technically it doesn't call anything at all, but we're not telling the users that...) I'd make it IF _ CALL _ or even IF _ INCLUDE _ especially if you changed that second input to Any(Unevaluated).


Dynamic scope: In your previous version, the reason for the empty input slot at the end of ...W/INPUT () is that that orange block, in an inner ring, needs access to the input to the outer ring. The way we use empty slots as implicit placeholders for inputs makes this hard to talk about, so imagine your example looked like


So the inner procedure (the one that calls IF_REPORT) needs access to the variable VALUE from the outer procedure (the one that calls SPAWN). But if you just drag an orange VALUE oval into the definition of IF_REPORT, you'll get a no-such-variable error. That's the problem you're solving by putting VALUE into an input slot of IF_REPORT, so it can be used by the procedures it calls.

But if Snap! were dynamically scoped (instead of being lexically scoped), then IF_REPORT would automatically have access to the variables of the procedure that calls it, in this case SPAWN.

By the way, you don't have to go whole hog metaprogramming to get your caller's inputs. You can just say
untitled script pic (7)
Neat, huh? :~) This is a sort of backdoor dynamic scope. You can also ask your caller for the value of a specific variable with a ringed orange oval.


One thing I don't like about 1-2-MANY-PROJECTOR is that its input expressions have to be IF_REPORT_ calls. You can imagine wanting to include an expression that's not conditional on anything. I know, you can just put True in the first slot, or you can write another orange block with only one input that supplies the True itself. But as a naive user, I want to just drag in an expression without an orange block.

But maybe I'm just quibbling.

The block’s name is actually a pun on 1 (original value) to (= 2) many (image elements). And “many” in this context may be any non-negative integer, including 0. But if even you don’t understand it, the pun is probably to far-fetched.

I did something like that.

Thx4 the explanation of dynamic scoping. I may not going to be using it here, but probably it will come in handy at some other occasion.

I see what you mean. The issue here is that if a user doesn’t apply an dedicated block, they must (in the current version) insert an expression producing a list of two lambda’s, which is even more complicated.
Alternatively (like in the previous version) the inserted expression would have to produce a list of 1 or 0 items, which isn’t obvious either.
Ideally the user would be enabled to insert any expression, such as multiple items in front of stream issue script pic 2 or multiple items in front of stream issue script pic 3. The latter, however, always reports something (i.e. the value of the input). I tried a special version if (condition) then (expression), not reporting anything if the condition fails. However that causes an error (“hmm ... reporter didn’t report”) that can’t be caught by safely try … any suggestions?

Anomaly with Any (unevaluated)
For now I’m using an auxiliary block if (condition) apply (expression) (input types: Boolean (uneval.), and Any (uneval.), respectively). For the first input, I specified true as default; for the second I would like to have id of as default, but haven’t found out how yet - if no input is specified, the corresponding output will be: multiple items in front of stream issue script pic 5: when called, it produces NIL, and I don’t know how I can test for it in advance (I even tried comparing it with the output of an ad-hoc test reporter called Any (uneval.), but it was different (≠) ! the same holds for literals (constants), whereas comparing functions is OK. Try it here. My speculation: with Any (uneval.), if the input value is not a reporter, it is replaced internally by a unique ad-hoc reporter that will reproduce the value. But knowing so (if it is true) doesn’t solve the issue …

Why doesn’t this work? Solved! See post #118
I made a block to map the 1-2-many-projector (now renamed (item) -> 0 .. many images, using (rules...) over a stream. Though it’s very little code: it doesn’t work, even after hours of testing. More precisely: it gets stuck in an infinite loop. I am using a block heads (items) in front of stream (stream), which appears to work well (but I wonder if it actually does). Perhaps you (@bh), or anyone else reading this, could have a look - it’s probably something trivial I have consistently been missing. Same link as above. Thx in advance!

Long ago I read about some alleged primitive tribe somewhere whose language has only the number names "one," "two," and "many," so that's what I thought the name meant, and yeah, the two/to pun went over my head. (I think I later learned that the story about that tribe wasn't true, but was a misunderstanding by some anthropologist.)

What? Really? I'll file a bug report.

All the rest of your post, which requires that I read your code, will have to wait until I'm actually awake. (This is a long process!)

That was also part of the pun, of course. Plus “one too many”, and “want to many” (or much).

The secret is going to bed on time (or so they’ve told me :wink: )

Yeah, I tried to do that last night, but ended up having a fight with my CPAP machine, which thinks there's a leak in my mask. That's another thing for me to deal with when I'm awake.

But we old people sleep more than you whippersnappers.

And toki pona has wan, tu, and mute as their only numbers.