Re: Bypassing XML inconsistencies with REXML::StreamListener
nutsmuggler wrote:
> Hello folks.
> I am trying to build a simple XML parser to extract data from IBM
> translation manager memories. Here is a sample os such memory files:
...
>
> I need to bypass mistakes, and tell StreamListener: "when you
> encounter a faulty segment, don't bother!"
> How do I achieve this?
Don't use an XML parser to handle non-XML?
Alternatively, have you tried the REXML pull parser? A bit more work in
that you have to explicitly pop items off the tag stack, but it may have
better options for recovering from bad markup.
However, the underlying parser may still barf in trying to segment the
source into tags and such.
Also, I don't know if Hpricot is happy with non-HTML, but it's worth a
shot to see if it can read and "fix" the source before you pass it to
another parser. You'll want to check that any modification made to the
input do not change the essential semantics.
(Or perhaps you could just use Hpricot and extract data with XPath)
--
James Britt
"In Ruby, no one cares who your parents were, all they care
about is if you know what you are talking about."
- Logan Capaldo
|