String encoding problem

d4s_over_dt4 · March 16, 2023, 11:42am

I have a string that contains four possible characters:0,1,_,).
The frequency of 0 and 1 are same.
) is less frequent than _
_ would not neighbor ) or _ ("))"is okay)
Encode it into a short as possible string.
There can be at most 20 different characters.
How can I encode it?(No using huffman trees,i have tried it and it makes an unbalance on 0 and 1 since they are not the least frequent)

d4s_over_dt4 · March 16, 2023, 11:44am

@here
joecooldoo can you help?

PS:Your algorithm should be better than "0=0,1=10,_=110,)=111" and "0=00,1=01,_=10,)=11"

bh · March 16, 2023, 3:27pm

Can you use two bits per character, so as to stuff four of them into a byte?

d4s_over_dt4 · March 16, 2023, 10:20pm

Yes,This is my original plan,but I found out that this wastes almost half of the bits
also i have to stuff them into a string instead of a bytearray

qw23 · March 17, 2023, 8:48am

Some questions to help the rest of us understand the problem:

What are the relative frequencies of each character? (approx.)
Part of the optimisation must be from knowledge of the frequency of series of characters. E.g. how often will “0“ be followed by “1”? (If the answer deviates from the relative frequencies of the characters)
What does this mean: “There can be at most 20 different characters”? I thought there were only 4. Do you mean that the length of the string to be coded is at most 20 characters? Or something else?

d4s_over_dt4 · March 17, 2023, 9:02am

0->0.47 1->0.47 _->0.04 )->0.02

it is impossible to have _),__,and )_,and the rest of the combinations have no special problems unlisted on the char frequency

After encoding.

d4s_over_dt4 · March 17, 2023, 9:06am

Full problem:Encode an Anyadic syntax tree with binary node values into a string with <=20 types of characters.
Anyadic means that the same symbol has different meanings when used as niladic,monadic,etc(i.e no information of arity)

qw23 · March 17, 2023, 9:38am

Thanks for you clarification. So the frequencies of “(“ and [underscore] are much lower than those of “0” and “1”. I choose to ignore the parts about anyadic trees and 20 different characters for now (IOW leaving it to you), and present a suggestion about compressing the information from the source (the object with “0”, “1”, “(“, and [underscore]). I don’t know if it will help you, but who knows:

Code “(“ and [underscore] as a sparse list; followed by a series of bits representing the remaining characters.

d4s_over_dt4 · March 17, 2023, 9:47am

like 00101_010)10100)) becomes 0001 00000101 0011 00000111 00001100 00001101 0010101010100 (then use hexadecimal)?
(spaces aren't really allowed,so length data is bound to 4 bits and indices 8,an index means to insert the stuff into that place)
Hmm maybe use a bitflag instead of two seperate counts?or use the 17th character for delimiter?
Maybe allow both schemes,since in some cases the ratio is closer to 1/4 1/4 1/4 1/4 due to short names and then seperate them by a bitflag?
I'll try to program this

d4s_over_dt4 · March 17, 2023, 12:08pm

How about this?https://replit.com/@imcute-aaaa/UngrammarlyTranslate-qw23#qw23encode.py(it was for a kinda language)

qw23 · March 17, 2023, 3:09pm

Sorry, I don't use Python (having nothing against it - just deeply appreciating Snap!'s user friendliness).
Thanks for the compliment of naming your program after my Snap id
Did it work out well?

d4s_over_dt4 · March 17, 2023, 11:45pm

Yeah,i see,someone who doesn't know how to swim doesn't hate swimmers or water either

np
I'm kinda making a fork

quite well

qw23 · March 18, 2023, 6:29am

Happy to have helped you