This short series of blog posts attempts to show how to use librevenge to facilitate writing of import filters for office document formats. It is focused on writing libraries, as I would like to encourage sharing of import filters between Open Source projects.
There has been no release of librevenge yet, but I do not expect any significant changes at this stage, so what I say here should be true later too. Despite of that, librevenge is already used by 10 libraries and there are more that are work-in-progress._
What is librevenge?
librevenge is a library that simplifies writing of import filters by providing interfaces for typical office file formats: text documents, presentations, vector drawings and spreadsheets. Using it frees import library developers from the necessity to invent their own interface and allows them to focus on writing the actual import code. It also simplifies integration of new import libraries to applications: to integrate a new library of a kind the application already handles, one only needs to write some boilerplate code that registers the importer into the application’s filter framework (and that code can typically be copied from a previous occurence with just a slight modification).
The library replaces existing import interfaces from libwpd (text documents), libwpg (vector drawings) and libetonyek (presentations). Support for spreadsheets is new–we had not had that in any library previously. It is split into three parts:
- librevenge, containing the import interfaces and types used there (and for historical reasons also an SVG generator for vector drawing interface);
- librevenge-stream, containing stream interface that is used for input data;
- librevenge-generators, containing implementations of the import interfaces that produce some useful output document formats (more about that later).
The code can be found here.
How does it work?
The import interfaces are event-based. In other words, they define callbacks that should be called at various stages of import of the source document, like
insertText(). The caller must provide an implementation of the chosen interface that actually does something meaningful, e.g., builds an internal document model. Alternatively, the caller can use one of the prepared implementations–we call them generators, that are available in librevenge-generators library. These are plain text and HTML for text documents; SVG for presentations and drawings; and CSV for spreadsheets. There are also ODF generators for all document kinds in libodfgen. Note that once one has implemented a generator, it can be used with all import libraries for the same kind of document.
The callback functions have (we hope) self-explanatory names. Most of them are paired; in that case the pairs are named either
closeFoo(). Standalone callbacks are named
defineFoo(). All opening callbacks and most of the standalone callbacks take a single argument which is a property list. The closing callbacks never take an argument.
The following sections contain more details about the helper types. All are defined in namespace
librevenge, but that namespace is omitted for the sake of brevity.
RVNGString is used for passing strings through the interfaces. It always uses UTF-8 encoding.
RVNGProperty is an interface for specific properties handled by
RVNGPropertyList. It contains convenience functions to extract objects of types supported by librevenge (except
RVNGPropertyList maps string keys to
RVNGProperty objects. It has factory functions for constructing and inserting properties for various data types:
RVNGBinaryData,… Note that while the insertion functions for numbers allow to specify a unit, it is not advised to use anything but
RVNG_UNIT_INCH (which is the default), as legacy generators do not really handle units, so the results might be incorrect.
Property list vectors
RVNGPropertyListVector is a sequence of
RVNGPropertyList objects. This type is also used to implement nesting of property lists:
RVNGPropertyList cannot contain another
RVNGPropertyList directly, but it can contain a
RVNGBinaryData serves to store, well, arbitrary binary data 🙂 As it is quite common operation, it is possible to convert to/from base64.
RVNGStringList is just that: a list of
RVNGString objects. The predefined drawing and presentation generators use an
RVNGStringList for output, inserting each page/slide separately.
RVNGInputStream interface, defined in librevenge-stream library, serves to pass around the input data. It has the usual functions for read-only stream:
isEnd() etc. It also has functions for handling internal structure, as that is useful for many formats (which use Zip, OLE2, etc. internally).
There are two implementations of
RVNGInputStream available in librevenge-stream:
RVNGFileStream, which also transparently handles Zip and OLE2, and
RVNGDirectoryStream, which is useful for handling directory-based document formats. An application typically needs to implement its own stream type that wraps whatever internal stream type it uses.
Creating an import library
So you decided that you like librevenge and that the library for parsing format XYZ you are contemplating to write will use it. Because want to make it as easy as possible to start, we have written a tool that creates a skeleton of a new project. It is called project-generator and it can be found in this repository.
It has several options, but only few of them are really needed:
-psets the project name;
-dsets one-sentence description (used in .pc file and in
-asets the main author and -e his e-mail; both are used in .rc files and in
-Tselect the kind of document the library handles.
-Dis for vector drawings,
-Sfor spreadsheets and
-Tfor text documents (this is the default).
Anatomy of a project
So we have created a new project named, let’s say, libfoo. Let’s take a peek at what is inside…
The project uses autoconf and automake for build, as we believe that it is the least bad from all the existing build systems. In addition to that, there are project files for several versions of Microsoft Visual Studio in
The public headers are in inc. The public interface is really minimal: a class (by default named FooDocument) that has two static functions:
RVNGInputStreamand tests whether the input has the right format;
RVNGXYZInterface(which one depends on the document kind the library imports), reads the input and produces the document by calling
That means that it is the caller’s responsibility to supply the input stream and the generator–by providing suitable implementations for them or using one of the existing ones from librevenge-generators and librevenge-stream. The library only uses the two interfaces.
Since the library only works with streams, it has no notion of the path to the source document (as it cannot even know if the input stream is based on an actual file or just a memory buffer). Therefore, if a library needs to read other files (especially with paths relative to the input), it has to delegate that task to the caller in some way.
The code of the library is in
src/lib. The project-generator produces
FooDocument.cpp containing empty implementation of the two public functions and
libfoo_utils.cpp with some functions and types that we generally find helpful, but which we do not want to put into librevenge for various reasons. Many of the functions handle reading numbers and strings from
readCString()), but there are other things too. (I am being deliberately vague here, as more functions can be added–or existing ones removed–in future versions of project-generator.)
Unit tests–if any–should go into
src/test and use CppUnit. When implementing a new test class, do not forget to use
CPPUNIT_TEST_SUITE_REGISTRATION(TestClassName) at the end (in namespace scope). That macro registers the class in the default test suite at the test manager, so the tests from this class will actually be run. (This can be done manually, of course, but why bother?)
Last but not least, project-generator creates several command-line conversion tools into formats suitable for the document kind: HTML and plain text for text documents; SVG for vector drawings; SVG and plain text for presentations; and CSV for spreadsheets. The sources for these tools are in subdirectories of
src/conv, named by the output type.
There is also another tool, converting to so-called “raw” format. The raw generator prints all callbacks and their arguments. It also allows to check proper nesting of paired callbacks. This is particularly useful during development, but we use the output for regression tests as well.
All these converters use the generators provided by librevenge-generators.
In the next part I will present a complete parser for an invented text document format.