Re: Bypassing XML inconsistencies with REXML::StreamListener
nutsmuggler wrote:
> Hello folks.
> I am trying to build a simple XML parser to extract data from IBM
> translation manager memories. Here is a sample os such memory files:
>
> <NTMMemoryDb>
> <Description>
>
> </Description>
> <Segment>0000000001
> <Control>
> 00012200000001178876638English(U.S.)ITALIANI BMIDDOCBB1CTMST.
> 000BB1CTmst.idd
> </Control>
> <Source><ibmiddoc company="ibm" docstyle="ibmxagd" ibmcopyr="2007"
> noteindent="no-noteindent"
> brand="default-brand"></Source>
> <Target><ibmiddoc company="ibm" docstyle="ibmxagd" ibmcopyr="2007"
> noteindent="no-noteindent"
> brand="default-brand"></Target>
> </Segment>
> <Segment>0000000002
> <Control>
> 00000300000001178876638English(U.S.)ITALIANI BMIDDOCCONFIGUR.
> 000Configuration_PDSG.IDE
> </Control>
> <Source><titleblk>
> <title>Configuration information and guidelines</title>
> </titleblk></Source>
> <Target><titleblk>
> <title>Informazioni e istruzioni per la configurazione</title>
> </titleblk></Target>
> etc...
>
>
> These memory files are quite similar to XML files, but I suspect they
> actually conform to another standard. In fact, they often include
> "opened" tags; these because they store segments of translation; thus,
> when the translation is referred to a website or a SGML document, the
> original HTML or SGML might be split in two or more parts. So I often
> encounter faulty segments; open tags generate a REXML fault.
<snip>
It might be worth trying HTML Tidy in XML mode. I can't remember off
the top of my head how it'll react to missing close tags, but it's worth
a shot...
--
Alex
|