|
|
|
|
||||||
![]() |
|
|
LinkBack | Outils de la discussion |
|
|
#1 |
|
Messages: n/a
Hébergeur: |
Hello folks.
I am trying to build a simple XML parser to extract data from IBM translation manager memories. Here is a sample os such memory files: <NTMMemoryDb> <Description> </Description> <Segment>0000000001 <Control> 00012200000001178876638English(U.S.)ITALIANI BMIDDOCBB1CTMST. 000BB1CTmst.idd </Control> <Source><ibmiddoc company="ibm" docstyle="ibmxagd" ibmcopyr="2007" noteindent="no-noteindent" brand="default-brand"></Source> <Target><ibmiddoc company="ibm" docstyle="ibmxagd" ibmcopyr="2007" noteindent="no-noteindent" brand="default-brand"></Target> </Segment> <Segment>0000000002 <Control> 00000300000001178876638English(U.S.)ITALIANI BMIDDOCCONFIGUR. 000Configuration_PDSG.IDE </Control> <Source><titleblk> <title>Configuration information and guidelines</title> </titleblk></Source> <Target><titleblk> <title>Informazioni e istruzioni per la configurazione</title> </titleblk></Target> etc... These memory files are quite similar to XML files, but I suspect they actually conform to another standard. In fact, they often include "opened" tags; these because they store segments of translation; thus, when the translation is referred to a website or a SGML document, the original HTML or SGML might be split in two or more parts. So I often encounter faulty segments; open tags generate a REXML fault. My code is quite simple : require 'rexml/document' require 'rexml/streamlistener' include REXML class Listener include StreamListener $segment = "" $result = "" $is_there = false def tag_start(name, attributes) if name == "Source" $segment << "EN:" end if name == "Target" $segment << "IT:" end end def tag_end(name) if name == "Target" if $is_there $result << $segment end $segment = "" $is_there = false end if name == "NTMMemoryDb" puts $result end end def text(text) $segment << text if text =~ /blade/ $is_there = true end end end listener = Listener.new parser = Parsers::StreamParser.new(File.new("bch01aad006_ME MORIA.EXP"), listener) parser.parse I need to bypass mistakes, and tell StreamListener: "when you encounter a faulty segment, don't bother!" How do I achieve this? Thanks in advance, Davide |
|
|
|
#2 |
|
Messages: n/a
Hébergeur: |
nutsmuggler wrote:
> Hello folks. > I am trying to build a simple XML parser to extract data from IBM > translation manager memories. Here is a sample os such memory files: ... > > I need to bypass mistakes, and tell StreamListener: "when you > encounter a faulty segment, don't bother!" > How do I achieve this? Don't use an XML parser to handle non-XML? Alternatively, have you tried the REXML pull parser? A bit more work in that you have to explicitly pop items off the tag stack, but it may have better options for recovering from bad markup. However, the underlying parser may still barf in trying to segment the source into tags and such. Also, I don't know if Hpricot is happy with non-HTML, but it's worth a shot to see if it can read and "fix" the source before you pass it to another parser. You'll want to check that any modification made to the input do not change the essential semantics. (Or perhaps you could just use Hpricot and extract data with XPath) -- James Britt "In Ruby, no one cares who your parents were, all they care about is if you know what you are talking about." - Logan Capaldo |
|
|
|
#3 |
|
Messages: n/a
Hébergeur: |
nutsmuggler wrote:
> Hello folks. > I am trying to build a simple XML parser to extract data from IBM > translation manager memories. Here is a sample os such memory files: > > <NTMMemoryDb> > <Description> > > </Description> > <Segment>0000000001 > <Control> > 00012200000001178876638English(U.S.)ITALIANI BMIDDOCBB1CTMST. > 000BB1CTmst.idd > </Control> > <Source><ibmiddoc company="ibm" docstyle="ibmxagd" ibmcopyr="2007" > noteindent="no-noteindent" > brand="default-brand"></Source> > <Target><ibmiddoc company="ibm" docstyle="ibmxagd" ibmcopyr="2007" > noteindent="no-noteindent" > brand="default-brand"></Target> > </Segment> > <Segment>0000000002 > <Control> > 00000300000001178876638English(U.S.)ITALIANI BMIDDOCCONFIGUR. > 000Configuration_PDSG.IDE > </Control> > <Source><titleblk> > <title>Configuration information and guidelines</title> > </titleblk></Source> > <Target><titleblk> > <title>Informazioni e istruzioni per la configurazione</title> > </titleblk></Target> > etc... > > > These memory files are quite similar to XML files, but I suspect they > actually conform to another standard. In fact, they often include > "opened" tags; these because they store segments of translation; thus, > when the translation is referred to a website or a SGML document, the > original HTML or SGML might be split in two or more parts. So I often > encounter faulty segments; open tags generate a REXML fault. <snip> It might be worth trying HTML Tidy in XML mode. I can't remember off the top of my head how it'll react to missing close tags, but it's worth a shot... -- Alex |
|
|
|
#4 |
|
Messages: n/a
Hébergeur: |
hpricot is my man :-)
Being an HTML parser, it's much less hard to please. Here is the basic code I am using: require 'rubygems' require 'hpricot' doc = Hpricot.XML(open("bch01aad006_MEMORIA.EXP")) doc.search("Source").each do |item| if item.innerHTML =~ /firmware/ puts "EN: #{item}" puts "IT: #{item.next_sibling}" end end The principle it's quite easy, and much more coincise than the rexml solution. Thanks a million for the tip. Davide |
|
![]() |
| Outils de la discussion | |
|
|