Split by XML

warped_wart_wars · August 26, 2021, 1:05am

For example, say I get the XML of my project Extended For with:

as a test of How to get project xml with the url block?. How would I turn that into a list?

bh · August 26, 2021, 1:31am

Oh, that sounds like an interesting and useful project!

We don't have anything built in to do that. The first step, I guess, is to decide what you want the list to look like. I'm envisioning that
<foo>x y z</foo>
turns into
(list <foo> [listified x] [listified y] [listified z])

My first step would be to get each <...> in its own item, like this:

(The code could be a little shorter, but I think it's easier to read this way.)

Then comes the fun part: You go through the list, and every time you see a <foo> you recursively parse the list up to the matching </foo>. (I wouldn't bother keeping that </foo> in the result list, because it's implied by the end of the sublist that starts with the matching <foo>.)

warped_wart_wars · August 26, 2021, 1:44am

I'm imagining something like this:

And what if I have something like this, with a <foo> inside another <foo>?

<foo>
    x
    <foo>
        y
    </foo>
    z
</foo>

bh · August 26, 2021, 2:00am

Oh, right, I forgot about the stuff inside a tag.

As for a foo inside a foo, that's the joy of recursion: When you see the second foo, you make a recursive call to the same split procedure, and when it sees the matching (inner) /foo it returns the list it created, which will be an item in the outer foo's list.

What makes a recursive solution a little tricky is that the outer call needs to know how far the inner call got in the input list. So you attach an item number to the input list, and every time you read an item from the list you increment that item number. (Or I suppose you could just SET (input) TO (ALL BUT FIRST OF (input)).)

sir_kitten2 · August 26, 2021, 2:26am

You already have a file for reading xml in your source code. I'm sure it wouldn't take that much time.

warped_wart_wars · August 26, 2021, 2:41am

Yes, it's here. But I have no idea how to use that for a (split by xml).

dardoro · August 26, 2021, 2:53am

It's a naive XML2XPath parser.

The first column has tag+attributes/endtag, the second values.
You can further create the path by concatenating tags or parsing attr.
"catalog/book/author", "Gam..."
"catalog/book/title", "XML Dev..."
"catalog/book/genre", "Computer"

XML from Sample XML File (books.xml) | Microsoft Learn

warped_wart_wars · August 26, 2021, 3:03am

How would I do that?

bh · August 26, 2021, 3:09am

I'm sitting on my hands to keep myself from writing it for you...

For starters, why don't you leave XML aside temporarily and write a program to parse
((A B) C (D E)) into

Don't use SPLIT; just a FOR loop that looks at each character and does something special if it's ( or ).

But make sure you call the same procedure to parse the (A B) etc.

warped_wart_wars · August 26, 2021, 3:29am

Ok, and should I give you the program when I'm done?

Edit 2: Working on it now.
Edit 3: I'm getting stumped ~~by the outer parentheses~~.

This is my code so far:

And it doesn't work at all:

bh · August 26, 2021, 8:08pm

Could you share the project? It's not obvious to me just from looking at it why it's doing that. Thanks.

warped_wart_wars · August 26, 2021, 9:04pm

Here.

I haven't completed it, but it seems like it shouldn't be this complicated.

bh · August 26, 2021, 9:55pm

Ah. So I instrumented your code with a at the beginning, and what I found was that in the recursive calls, the input was ( the first time and (( the second time. So that's why you're getting those weird lists of parentheses.

If you just work with the text string, when you return from a recursive call, the outer call won't know how much of the string the inner call parsed. That's why I suggested a data structure that combines the string with a pointer to where you're up to in it:

The "2" in listify skips over the initial open paren. If the text doesn't start with an open paren, it doesn't represent a list, so I just report the same text.

This can, of course, be done purely functionally, but parsing is one of the things that turn out way easier non-functionally, because of this business of reading each character only once. You can think of text with pointer as a sort of poor-man's object, with next character as its only method.

I'm refraining from showing listify helper so as not to spoil it if you want to write it yourself, but if you'd rather just read mine it's here: https://snap.berkeley.edu/snap/snap.html#present:Username=bh&ProjectName=listify

Once you've absorbed this technique, you can go back to parsing XML. :~)

P.S. Why not just

? Because TEXT is just a local variable in this block, and changing its value won't be visible to its caller. By contrast, REPLACE in a list is visible to the caller, because the local variable and the caller's variable point to the identical list. (But you could just put the text in a one-item list, and next character could replace the one item with its all but first letter.)

P.P.S. I have a slight bug in LISTIFY, which should skip over leading spaces before doing anything else. But you get the idea.

warped_wart_wars · August 26, 2021, 10:21pm

I was still stumped, so I took a look at your code, and I still can't figure out how it parses it. For example, in what case is "item" an empty string?

bh · August 26, 2021, 10:22pm

If the input text has two spaces in a row.

warped_wart_wars · August 26, 2021, 10:24pm

I probably need something interactive to help me understand it. (Same with matrix multiplication, but that's for a different topic.)

bh · August 26, 2021, 10:28pm

Oh, try showing the local variables and visual-stepping it.

warped_wart_wars · August 26, 2021, 11:05pm

I kind of get it now.

bh · August 26, 2021, 11:13pm

Hmm, not sure what to make of "kind of." Want to ask a question? Want to drop the topic? Want to talk about XML?

warped_wart_wars · August 27, 2021, 1:12am

I guess I might know enough to talk about XML now.