Partial sorting

I would like to pick the first (n) sorted items from a large unsorted list (length: L), in a runtime-efficient manner.

What I found to work, but not very efficiently, is a combination of two blocks:

  • from the List utilities library: sort (data) ordering with (function);
  • from the APL library: take (howmany) from (array).
    When combined, I call this algorithm A:
  • take (n) from [sort (data) ordering with (function)].
    Thus the list is fully sorted first, and subsequently the first (n) elements are selected from the result of the first operation. Even though both functions were implemented quite efficiently, a full sort of a large list takes unnecessary runtime.

One alternative approach I found to work (algorithm B) is to using “combine” to select the key of the first item, subsequently picking and deleting this item from the list, and recursively picking (n-1) items from the remainder of the list. For very small (n) this turns out more efficiently than a full sort etc., especially if (function) evaluation is trivial (alternatively, (function) would be evaluated once for each element of the list, and retained for multiple usage).
However for somewhat larger (n), e.g. when selecting the first 100 items from a list of 1000, the inherent inefficiencies of this approach tend to dominate, making it even less efficient than a full sort (such as employed in algorithm A).

Can anyone suggest a potentially faster third algorithm for (n < L) ∧ ¬ (n << L) ?

No need for that block, just do
untitled script pic

This returns a list of items, item numbers from the numbers from block, of the list. It makes a lot more sense if you do it.

I think there's an official best algorithm, which you can find on the web. But it's been a long time since I taught the algorithms course.

The first experiment I'd try would be to do a mergesort of the data, but using streams (lazy lists) and stopping when you get N numbers out of the merge. But it's possible that the overhead of setting up the streams will in practice outweigh the savings from stopping early.

"n is less than L and not n is bitwise left shift L"?
you should not use bitwises as predicates

there is something called arena sort

Out in the real world, "<<" means "much less than."

I couldn't find it in a web search. Can you provide a reference?

擂台 sort
You get a list of numbers.They fight on an arena.The smaller one gets killed.

Writing "arena" in Chineese does not help :wink:
Do you think of Tournament sort ?

Input list contains real or whole numbers?

I think this is an "efficient" approach as it only transverses the large list once

nope

bucket sort?

thats arena sort

@d4s_over_dt4, dardoro & cymplecy: thank you for pointing me to Tournament (or Arena) sort, being a one-pass approach. Not sure how much of the efficiency is lost however in sorting the intermediate results after every replacement, but I'll look into it.

By the way, where can I find how to paste Snap! blocks into a forum post? (like some of you did).

@ego-lay_atman-bay: thanks for the tip. "Item (numbers from (1) to (n)) of (list)" works well, and it's native Snap! (notwithstanding the APL-library being generally very useful when working with large sets)

@bh: I'm eager to try your approach. However I can't think of how to implement it, not being proficient at streams. Would you write a few lines of code demonstrating the essence ? Thanks beforehand!

You right-click on a script and select script pic.

Once it's downloaded - you can drag and drop it into a forum post

In here:
https://snap.berkeley.edu/snap/snap.html#present:Username=bh&ProjectName=sorting
is a collection of sorts, including mergesort, so start there.

Then you load the stream library, and wherever you see X replace it with Y:

item 1 of / head of stream
all but first of / tail of stream
in front of / in front of stream

And then you write FIRST (n) ITEMS OF STREAM (stream).
But as I said, I have no idea whether it'll be fast or slow.

P.S. To generate the data for sorting, you can

Cymplecy’s example is quite simple to implement - execution is not as fast as I was hoping for though. Contributing factors are: (1) the one-by-one comparison instruction, and (2) the frequent sorting of intermediate results.

Building on the same tournament sort principle I constructed the following block. It employs Sorting script pic 3 to speed up the comparison process, and intermediate results are sorted much less frequently (e.g. with n = 5: after 10, 20, 40, 80 … comparisons). This works especially well if the data are randomly distributed over the list.

It can be sped up somewhat more by leaving out the predicate input variable, and sticking with
Sorting script pic 4 .

Wait are you tring to let us help you on the day 1 problem?

I don't think they are, since they made the post before the puzzle went public.

Following your hints I tried to implement Mergesort for streams:.

Sorting streams script pic (6)

I couldn’t get it to work though.

Besides it seems hardly a suitable algorithm for postponing evaluation, since the latter would only make sense during the final merge operation, when most of the work has already been done. Selection sort, though very inefficient for full sorting, is perhaps best suited for deferred partial (or incremental) sorting; that is when the exact number of elements required to be extracted is unknown beforehand.

So I tried to construct a Selection sort algorithm using Snap! streams.

I guess I would need to turn the list of unsorted elements into a stream, using the
Sorting streams script pic 15 block, thus creating a finite stream …

(As an aside, I’m puzzled by the SNAP! implementation of infinite vs. finite streams. I would have expected both of the following blocks to yield the same value, vid. “1” (one):
Sorting streams script pic
Sorting streams script pic 2
In reality only the first block yields “1”, the second yields a list with elements “1” and “2”. Is that even how it’s supposed to work?)

… and then I would need to write an algorithm extracting the smallest element from a finite set of numbers that is part of the stream, and modifying the stream to keep the other numbers for possible next increments. I haven’t figured out that algorithm yet.

So I developed my own (alas completely problem-specific) version of a lazy list of items to be sorted one at a time. Perhaps you (or some other reader) would be willing to show me how to turn this idea into a Snap!-style stream a/o algorithm? Thanks again!

Preparation: I devised my own Mickey Mouse version of a lazy list: item 1 contains evaluated (= sorted) values, item 2 holds unsorted values; the algorithm is not a part of the lazy list. Initially all values are in item 2:

Incremental sorting: the smallest value of the unsorted bunch is reported, and transfered to to the sorted queue:

Random access: to any item of the (virtual) sorted queue. If necessary, it will be calculated (as will all values before it within the sorted queue):

This algorithm, or alternatively the (non-incremental) Tournament sort I developed from other forum contributors’ hints, works well enough for my current application, but I’m truly interested to understand how streams work and how they can be applied.

What is Sorting streams script pic 16 ?
It’s not in the Streams library. Would


be equivalent to
Sorting streams script pic 18 ?

And a more general question: how can I generate a useable (finite) stream based on an ordinary list of values? E.g.
Sorting streams script pic 19 appears not to be valid as
Sorting streams script pic 20
doesn’t yield (the same result as)
Sorting streams script pic 21 as, in my opinion, it should.

Sorry to keep bugging you - it’s just that I would like to understand (and master using) the stream concept, and its Snap! Implementation.

The items of a stream don't have to be numbers; they can be words or lists or streams etc. That's why using NUMBERS FROM as an input to STREAM just gives you one item, the list of numbers. And SHOW STREAM turns a stream into a list, so your one-element stream whose one element is the list (4 7 1) becomes a one-element list whose one element is the list (4 7 1).

The STREAM block's input is variadic, so you can say
untitled script pic (6)
to get a finite stream. If you want to use an expression instead of typing all the numbers in separately, drag the expression over the arrowheads in the STREAM block, so you get a red halo instead of a white halo, then let go:


and you get

which does what you want, as if you'd clicked the right arrowhead 999 times and filled in all 1000 input slots with the items of the list reported by NUMBERS FROM.

I'll think about the sorting issue later...