|
|
|
|
||||||
![]() |
|
|
LinkBack | Outils de la discussion |
|
|
#1 |
|
Messages: n/a
Hébergeur: |
Hi all.
I'm pleased to announce 0.0.1 (aka "early adopters only" release) of my Uniforma library. It's here: http://rubyforge.org/projects/uniforma/ == What is it? Library for parsing "simple text" formats (RD, Textile, Markdown, etc.) and generating output in various formats (including simple text, html/xml and more complex ones). The heart of the library is two DSLs - for defining parsers and generators. == Why? 1. Preparing "one more serious library"'s documentation, I've found a dillema: write it in RD? (for auto-generate all with RDoc) or Trac's wiki format? (for uploading to Trac site) or Textile? (for once uploading to stand-alone site) So I've decided to do conversion library/tool. 2. I'm using RedCloth (Textile) for all my works, and trying to patch it for my needs, I've found it's a mess. I just need to have separate clear description of "how is it parsed" and "how is it generated" aspects. 3. For my journalism, I need MS Word output (I have no fun to do text editing in MS Word, but ability to generate it is a must). Now I use "Textile=>(RedCloth)=>HTML=>`winword mytext.html`" scheme, which have several flaws. I want be able to easy define MS Word generator (using win32ole, of course, no hand-made heroism). == Show. Me. The. Code. Usage: puts Uniforma::textile('*some text* "with links":http://google.com.').to_html_string output: <html><body> <p><b>some text</b> <a href='http://google.com'>with links</a>.</p> </body></html> Defining parsers: --- module Uniforma::Parsers class Textile < LineParser definition do .... #how to parse some line .... line /^h(\d+)\.\s+/ do para(:heading, :level => @_1.to_i) end .... #how to parse inline formatting: inline /__(.+?)__/, :italic end end end --- Defining generators --- module Uniforma::Generators class HtmlString < TextGenerator definition do ... #what to place around some "paragraph type" around(:heading) {|p| i = p.level; ["<h#{i}>", "</h#{i}>\n"]} ... #what to place around some "inline markup type" around(:italic) {["<i>", "</i>"]} end end end --- Uniforma is smart enough to allow: * non-line based formats parsers (in fact, it also has one "toy" parser for HTML, which even works! on not-very-complex HTML documents) * non-text format generators (I'm working on PDF and MSWord generators. It's not very hard to define with Uniforma) == Important notes about current release * This release shamelessly includes htmlentities library by Paul Battley[1], without even notice it in license files. It is subject to change ASAP. * It's really "early adopters" release. Almost no docs, and very, very poor tests. But it shows an idea and is a base for further work. * This release include parsers for: Textile, RD, HTML and generators for: BBcode, RD, HTML. All of them are incomplete but tend to work.* I'd want to hear opinions about whether DSLs for parser/generator looks "right" from point-of-view of a) native English speakers and b) real Ruby ninja. You can examine my parsers in lib/uniforma/parsers/ and generators in lib/uniforma/generators/ Again, the library is here: http://rubyforge.org/projects/uniforma/ Thanx. Zverok. 1:http://rubyforge.org/projects/htmlentities/ |
|
|
|
#2 |
|
Messages: n/a
Hébergeur: |
Your work is interesting and sure looks like good ruby to me.
I had to write a similar parser for Zena (to parse textile additions and the zafu templates) and I thought that doing so many regex evaluations on the full text (can be long) was too slow (please tell if this is wrong). I thus chose to use regex anchored left /\A.../ and eat through the text only once. This has the other advantage that you enter different modes (tag parameters, comments, raw data, etc) along the way. It makes it very easy to parse sub languages from within these modes. The parser is a two step operation: 1. parse, 2. render. This might be overkill for the kind of transformations you need but it is very interesting because the parsed elements can use some knowledge from the context when they are rendered. I intend to do a textile -> Latex transformation so the users can write zafu templates to generate PDF. The current implementation of the parser is not as clean as yours but works. You can have a look at the parser here : http://dev.zenadmin.org/browser/trunk/lib/parser Let me know what you think. Gaspard |
|
|
|
#3 |
|
Messages: n/a
Hébergeur: |
On 9/14/07, Jeff Barczewski <jeff.barczewski@gmail.com> wrote:
> > > To generate msword docs - it might be easier (and more portable) to simply > write out the new xml form of word rather than using win32ole. I believe it > would still have all the same capabilities but just represented in xml > format. > > http://en.wikipedia.org/wiki/Office_Open_XML OOXML's specification is over 6.000 pages long, and full of idiosyncrasies - I don't know how much of it he'd need for his documents, but using OLE and Word is probably easier than trying to build a OOXML-compliant document generator from scratch. Which I guess is exactly why Microsoft made the spec that long in the first place, but I digress... .-- Bira http://compexplicita.blogspot.com http://sinfoniaferida.blogspot.com |
|
|
|
#4 |
|
Messages: n/a
Hébergeur: |
> Library for parsing "simple text" formats (RD, Textile, Markdown, etc.) and
> generating output in various formats (including simple text, html/xml and > more complex ones). I wrote deplate[1], which has similar goals (well, with the exception of source quality maybe ;-). The point here is of course that simple formats are easy to parse, so the question is how simple do you mean with "simple". > I want be able to easy define MS Word generator (using > win32ole, of course, no hand-made heroism). If simple is really simple like rdoc-simple, why not simply import HTML? Although I like slightly more the way how OpenOffice uses HTML files. If "simple" includes cross references, footnotes, endnotes, headers, footers, table of contents/tables/figures etc., I think you'll probably need: - a general way to define counters and lists - some notion of metadata (like index, footnotes, labels, section names etc.) - make it possible to locate text at some random position in the output document (eg for headers & footers), e.g. move text to the top of the document, after packages are required but before the start of the body etc. deplate defines "slots" for this which allows users to place the element at any position they want. - on the long (or intermediate-distance) run, you might also think of some plugin-mechanism (e.g. e-mail obfuscation that may be loaded when converting the document without being hard-coded, although this could also be done by post-processing the output). > * non-line based formats parsers (in fact, it also has one "toy" parser for > HTML, which even works! on not-very-complex HTML documents) >From a pragmatic point of view, using hpricot and writing and map classes on its output could be the better strategy. Anyway, I'm eager to see how this develops. Cheers, Thomas. [1] http://deplate.sf.net |
|
|
|
#5 |
|
Messages: n/a
Hébergeur: |
> From a pragmatic point of view, using hpricot and writing and map
> classes on its output could be the better strategy. BTW which makes me wonder, if you could your library into a parser and a formatter. A parser for wiki-like formats that generates something like hpricot does, would be something I would very much like to see. Regards, Thomas. |
|
![]() |
| Outils de la discussion | |
|
|