Tags
This short series of blog posts attempts to show how to use librevenge to facilitate writing of import filters for office document formats. It is focused on writing libraries, as I would like to encourage sharing of import filters between Open Source projects.
There has been no release of librevenge yet, but I do not expect any significant changes at this stage, so what I say here should be true later too. Despite of that, librevenge is already used by 10 libraries and there are more that are work-in-progress._
What is librevenge?
librevenge is a library that simplifies writing of import filters by providing interfaces for typical office file formats: text documents, presentations, vector drawings and spreadsheets. Using it frees import library developers from the necessity to invent their own interface and allows them to focus on writing the actual import code. It also simplifies integration of new import libraries to applications: to integrate a new library of a kind the application already handles, one only needs to write some boilerplate code that registers the importer into the application’s filter framework (and that code can typically be copied from a previous occurence with just a slight modification).
The library replaces existing import interfaces from libwpd (text documents), libwpg (vector drawings) and libetonyek (presentations). Support for spreadsheets is new–we had not had that in any library previously. It is split into three parts:
- librevenge, containing the import interfaces and types used there (and for historical reasons also an SVG generator for vector drawing interface);
- librevenge-stream, containing stream interface that is used for input data;
- librevenge-generators, containing implementations of the import interfaces that produce some useful output document formats (more about that later).
The code can be found here.
How does it work?
The import interfaces are event-based. In other words, they define callbacks that should be called at various stages of import of the source document, like startDocument()
, closeParagraph()
or insertText()
. The caller must provide an implementation of the chosen interface that actually does something meaningful, e.g., builds an internal document model. Alternatively, the caller can use one of the prepared implementations–we call them generators, that are available in librevenge-generators library. These are plain text and HTML for text documents; SVG for presentations and drawings; and CSV for spreadsheets. There are also ODF generators for all document kinds in libodfgen. Note that once one has implemented a generator, it can be used with all import libraries for the same kind of document.
The callback functions have (we hope) self-explanatory names. Most of them are paired; in that case the pairs are named either start
/endFoo()
or open
/closeFoo()
. Standalone callbacks are named insertFoo()
or defineFoo()
. All opening callbacks and most of the standalone callbacks take a single argument which is a property list. The closing callbacks never take an argument.
The following sections contain more details about the helper types. All are defined in namespace librevenge
, but that namespace is omitted for the sake of brevity.
Strings
RVNGString
is used for passing strings through the interfaces. It always uses UTF-8 encoding.
Properties
RVNGProperty
is an interface for specific properties handled by RVNGPropertyList
. It contains convenience functions to extract objects of types supported by librevenge (except RVNGPropertyList
itself).
Property lists
RVNGPropertyList
maps string keys to RVNGProperty
objects. It has factory functions for constructing and inserting properties for various data types: int
, double
, RVNGString
, RVNGBinaryData
,… Note that while the insertion functions for numbers allow to specify a unit, it is not advised to use anything but RVNG_UNIT_INCH
(which is the default), as legacy generators do not really handle units, so the results might be incorrect.
Property list vectors
RVNGPropertyListVector
is a sequence of RVNGPropertyList
objects. This type is also used to implement nesting of property lists: RVNGPropertyList
cannot contain another RVNGPropertyList
directly, but it can contain a RVNGPropertyListVector
.
Binary data
RVNGBinaryData
serves to store, well, arbitrary binary data 🙂 As it is quite common operation, it is possible to convert to/from base64.
String lists
RVNGStringList
is just that: a list of RVNGString
objects. The predefined drawing and presentation generators use an RVNGStringList
for output, inserting each page/slide separately.
Streams
RVNGInputStream
interface, defined in librevenge-stream library, serves to pass around the input data. It has the usual functions for read-only stream: readNBytes()
, seek()
, tell()
, isEnd()
etc. It also has functions for handling internal structure, as that is useful for many formats (which use Zip, OLE2, etc. internally).
There are two implementations of RVNGInputStream
available in librevenge-stream: RVNGFileStream
, which also transparently handles Zip and OLE2, and RVNGDirectoryStream
, which is useful for handling directory-based document formats. An application typically needs to implement its own stream type that wraps whatever internal stream type it uses.
Creating an import library
So you decided that you like librevenge and that the library for parsing format XYZ you are contemplating to write will use it. Because want to make it as easy as possible to start, we have written a tool that creates a skeleton of a new project. It is called project-generator and it can be found in this repository.
It has several options, but only few of them are really needed:
-p
sets the project name;-d
sets one-sentence description (used in .pc file and inREADME
);-a
sets the main author and -e his e-mail; both are used in .rc files and inCREDITS
;-D
,-P
,-S
,-T
select the kind of document the library handles.-D
is for vector drawings,-P
for presentations,-S
for spreadsheets and-T
for text documents (this is the default).
Anatomy of a project
So we have created a new project named, let’s say, libfoo. Let’s take a peek at what is inside…
Build system
The project uses autoconf and automake for build, as we believe that it is the least bad from all the existing build systems. In addition to that, there are project files for several versions of Microsoft Visual Studio in build/win32
.
Headers
The public headers are in inc. The public interface is really minimal: a class (by default named FooDocument) that has two static functions:
isSupported()
takes anRVNGInputStream
and tests whether the input has the right format;parse()
takes anRVNGInputStream
and anRVNGXYZInterface
(which one depends on the document kind the library imports), reads the input and produces the document by callingRVNGXYZInterface
‘s callbacks.
That means that it is the caller’s responsibility to supply the input stream and the generator–by providing suitable implementations for them or using one of the existing ones from librevenge-generators and librevenge-stream. The library only uses the two interfaces.
Library
Since the library only works with streams, it has no notion of the path to the source document (as it cannot even know if the input stream is based on an actual file or just a memory buffer). Therefore, if a library needs to read other files (especially with paths relative to the input), it has to delegate that task to the caller in some way.
The code of the library is in src/lib
. The project-generator produces FooDocument.cpp
containing empty implementation of the two public functions and libfoo_utils.h
and libfoo_utils.cpp
with some functions and types that we generally find helpful, but which we do not want to put into librevenge for various reasons. Many of the functions handle reading numbers and strings from RVNGInputStream
(e.g., readU32()
or readCString()
), but there are other things too. (I am being deliberately vague here, as more functions can be added–or existing ones removed–in future versions of project-generator.)
Unit tests
Unit tests–if any–should go into src/test
and use CppUnit. When implementing a new test class, do not forget to use CPPUNIT_TEST_SUITE_REGISTRATION(TestClassName)
at the end (in namespace scope). That macro registers the class in the default test suite at the test manager, so the tests from this class will actually be run. (This can be done manually, of course, but why bother?)
Command-line tools
Last but not least, project-generator creates several command-line conversion tools into formats suitable for the document kind: HTML and plain text for text documents; SVG for vector drawings; SVG and plain text for presentations; and CSV for spreadsheets. The sources for these tools are in subdirectories of src/conv
, named by the output type.
There is also another tool, converting to so-called “raw” format. The raw generator prints all callbacks and their arguments. It also allows to check proper nesting of paired callbacks. This is particularly useful during development, but we use the output for regression tests as well.
All these converters use the generators provided by librevenge-generators.
Stay tuned!
In the next part I will present a complete parser for an invented text document format.
Pingback: Document Liberation Project’s framework is available to be used
1. As I understand, code goes like “libFilterName(e.g. libqxp, libwpd, etc…) ↔ librevenge ↔ application (e.g. LibreOffice, Calligra, etc.)”. But I still don’t understand how to the three communicate with each other.
E.g. suppose I implemented a filter. I imagine, it could work like this: an application asks librevenge a list of doDetectFormat() functions, and applies them until one matches. librevenge, in the first place, gets the list by exploring certain directories in search of filter libs. Is it correct (well, I’m pretty sure it’s not)?
2. “generator”, if I correctly understand, is a fancy name for examples of interfaces, so somebody writing a filter could just implement some. But, what’s really confusing: none of them in `inc/librevenge-generators` does have “extern ” keyword, more over, they’re members of a class. The problem, they’re C++, so they gotta get mangled upon getting into a lib. This is just not gonna work, I’m not sure even if GCC mangles names the same way through compiler versions, not to mention clang and co.
3. Am I correct that the public interface consisting of `isSupported()` and `parse()` is the interface between librevenge and an application?
It seems to me that you think the import libraries are automatically-loaded “plugins” and your questions stem from that confusion. It’s not so. If a project wants to use an import library, it has to integrate it into its import mechanism. And there is no interface/protocol/whatever that import libs would be obliged to follow in its public API; if the APIs of the existing libs are similar, it’s because we feel no need to do it in another way…
A full explanation of how it all fits together would be too lengthy, so I’ll do it as another blog post (hopefully during the next few days). But if you’re too impatient, you can try to google ‘Document Liberation Project’ talk given at LGM 2014; IIRC Fridrich talked about this in the 3rd part of the talk (he can explain things better than I can, in any case).
Thank you, I’ve looked this video https://www.youtube.com/watch?v=cApNAvouYdY I didn’t find the slides though, and my verbal english perception isn’t perfect, but I think understood everything.
Unfortunately the talk didn’t make things much clearer. From the video and your comment I’d guess librevenge was supposed to be a library that only defines a few input/output types, like RVNGInputSteam. However in this case it’s enough for it to be header only — but I do see in the list of files of librevenge package actual binary libraries.
Overall for now, I’ve got a feeling that there are import filter libraries, like libpwd, libqxp, which have to be integrated into an application, and they have nothing in common. And then there is “librevenge” which is there… I don’t know, just for lols :Ь
I haven’t finished the promised blog post yet, but, seeing you’re impatient, i’ll try to give a brief summary here: librevenge provides 4 document interfaces, some types that are used by them (that do have some impl., of course -> a library) and an input stream interface. An import library’s parsing function takes–as a minimum–an input stream and a “document” (an instance of an impl. of one of the import intefaces). It reads from the input and calls appropriate functions on the “document” object.
So, there are 3 steps in integration of a DLP library into an application: 1/ Provide an impl. of RVNGInputStream. 2/ Provide an impl. of RVNG*Interface. 3/ Write the necessary boilerplate to register the import filter, call it’s detection and parsing functions, etc., as needed by the application. The first 2 steps are one-time and can even be avoided by using pre-canned impls (e.g., LibreOffice or Calligra use libodfgen as an impl. of the document interfaces).
Okay, so I’ve explored an AbiWordImportFilter as an example, and I think I get how does it work.
First argument to parse() is `librevenge::RVNGInputStream` is an input data. Supports trivial operations, like `read()`, `seek()`.
The second one is RVNGTextInterface. As far as I understand, an import filter should parse the RVNGInputStream, and call according functions on RVNGTextInterface to shape a representation of the document inside LibreOffice.
I have a question now. I looked through the interface functions (FTR: inc/librevenge/RVNGTextInterface.h), can you give some hints:
1. How to write a text above or below another (e.g. super- and sub- script, and ideally with controlled height).
2. How to write one text over another? E.g. writing a letter “-” over “Q” should result in “Q̶” (without unicode though).
1. Pass style:text-position property to openSpan. This property is copied from ODF (most of the usable properties are), so the value is the same: “sub” | “super” | % offset from baseline.
2. There’s no way to do that. Is there actually any word processor that supports that?
> 2. There’s no way to do that. Is there actually any word processor that supports that?
There is; more over, you might be surprised, but this is a core feature of one of the most popular text processors in the world. I was aiming to make an import filter for LaTeX — making “office documents” from LaTeX is a recurring problem, this hurt me badly for being a student, there’s a big demand, and I thought it could be a nice kickstarter project. Unfortunately my preparations for making at least a basic working filter took too long, so I had to move on.
FYI, LaTeX is widely used not only in scientific publications. E.g. I was a simple student, and my institute demanded papers in “office format”. Unfortunately(?) I am a Vim-mode user (technically I’m using Emacs with Evil-mode, but that’s not the point), and writing lots of text without possibility to move around and edit stuff quickly is a real torture. Lots of stuff happened back then, but sticking to the topic — LaTeX import support in LibreOffice would’ve made my life so easier…