String encoding problem

I have a string that contains four possible characters:0,1,_,).
The frequency of 0 and 1 are same.
) is less frequent than _
_ would not neighbor ) or _ ("))"is okay)
Encode it into a short as possible string.
There can be at most 20 different characters.
How can I encode it?(No using huffman trees,i have tried it and it makes an unbalance on 0 and 1 since they are not the least frequent)

@here
joecooldoo can you help?

PS:Your algorithm should be better than "0=0,1=10,_=110,)=111" and "0=00,1=01,_=10,)=11"

Can you use two bits per character, so as to stuff four of them into a byte?

Yes,This is my original plan,but I found out that this wastes almost half of the bits
also i have to stuff them into a string instead of a bytearray

Some questions to help the rest of us understand the problem:

  1. What are the relative frequencies of each character? (approx.)
  2. Part of the optimisation must be from knowledge of the frequency of series of characters. E.g. how often will “0“ be followed by “1”? (If the answer deviates from the relative frequencies of the characters)
  3. What does this mean: “There can be at most 20 different characters”? I thought there were only 4. Do you mean that the length of the string to be coded is at most 20 characters? Or something else?

0->0.47 1->0.47 _->0.04 )->0.02

it is impossible to have _),__,and )_,and the rest of the combinations have no special problems unlisted on the char frequency

After encoding.

Full problem:Encode an Anyadic syntax tree with binary node values into a string with <=20 types of characters.
Anyadic means that the same symbol has different meanings when used as niladic,monadic,etc(i.e no information of arity)

Thanks for you clarification. So the frequencies of “(“ and [underscore] are much lower than those of “0” and “1”. I choose to ignore the parts about anyadic trees and 20 different characters for now (IOW leaving it to you), and present a suggestion about compressing the information from the source (the object with “0”, “1”, “(“, and [underscore]). I don’t know if it will help you, but who knows:

Code “(“ and [underscore] as a sparse list; followed by a series of bits representing the remaining characters.

like 00101_010)10100)) becomes 0001 00000101 0011 00000111 00001100 00001101 0010101010100 (then use hexadecimal)?
(spaces aren't really allowed,so length data is bound to 4 bits and indices 8,an index means to insert the stuff into that place)
Hmm maybe use a bitflag instead of two seperate counts?or use the 17th character for delimiter?
Maybe allow both schemes,since in some cases the ratio is closer to 1/4 1/4 1/4 1/4 due to short names and then seperate them by a bitflag?
I'll try to program this

How about this?https://replit.com/@imcute-aaaa/UngrammarlyTranslate-qw23#qw23encode.py(it was for a kinda language)

Sorry, I don't use Python (having nothing against it - just deeply appreciating Snap!'s user friendliness).
Thanks for the compliment of naming your program after my Snap id :wink:
Did it work out well?

Yeah,i see,someone who doesn't know how to swim doesn't hate swimmers or water either

np
I'm kinda making a fork

quite well

Happy to have helped you :slight_smile: