Tags

This short series of blog posts attempts to show how to use librevenge to facilitate writing of import filters for office document formats. It is focused on writing libraries, as I would like to encourage sharing of import filters between Open Source projects.
There has been no release of librevenge yet, but I do not expect any significant changes at this stage, so what I say here should be true later too. Despite of that, librevenge is already used by 10 libraries and there are more that are work-in-progress._

What is librevenge?

librevenge is a library that simplifies writing of import filters by providing interfaces for typical office file formats: text documents, presentations, vector drawings and spreadsheets. Using it frees import library developers from the necessity to invent their own interface and allows them to focus on writing the actual import code. It also simplifies integration of new import libraries to applications: to integrate a new library of a kind the application already handles, one only needs to write some boilerplate code that registers the importer into the application’s filter framework (and that code can typically be copied from a previous occurence with just a slight modification).
The library replaces existing import interfaces from libwpd (text documents), libwpg (vector drawings) and libetonyek (presentations). Support for spreadsheets is new–we had not had that in any library previously. It is split into three parts:

  • librevenge, containing the import interfaces and types used there (and for historical reasons also an SVG generator for vector drawing interface);
  • librevenge-stream, containing stream interface that is used for input data;
  • librevenge-generators, containing implementations of the import interfaces that produce some useful output document formats (more about that later).

The code can be found here.

How does it work?

The import interfaces are event-based. In other words, they define callbacks that should be called at various stages of import of the source document, like startDocument(), closeParagraph() or insertText(). The caller must provide an implementation of the chosen interface that actually does something meaningful, e.g., builds an internal document model. Alternatively, the caller can use one of the prepared implementations–we call them generators, that are available in librevenge-generators library. These are plain text and HTML for text documents; SVG for presentations and drawings; and CSV for spreadsheets. There are also ODF generators for all document kinds in libodfgen. Note that once one has implemented a generator, it can be used with all import libraries for the same kind of document.
The callback functions have (we hope) self-explanatory names. Most of them are paired; in that case the pairs are named either start/endFoo() or open/closeFoo(). Standalone callbacks are named insertFoo() or defineFoo(). All opening callbacks and most of the standalone callbacks take a single argument which is a property list. The closing callbacks never take an argument.
The following sections contain more details about the helper types. All are defined in namespace librevenge, but that namespace is omitted for the sake of brevity.

Strings

RVNGString is used for passing strings through the interfaces. It always uses UTF-8 encoding.

Properties

RVNGProperty is an interface for specific properties handled by RVNGPropertyList. It contains convenience functions to extract objects of types supported by librevenge (except RVNGPropertyList itself).

Property lists

RVNGPropertyList maps string keys to RVNGProperty objects. It has factory functions for constructing and inserting properties for various data types: int, double, RVNGString, RVNGBinaryData,… Note that while the insertion functions for numbers allow to specify a unit, it is not advised to use anything but RVNG_UNIT_INCH (which is the default), as legacy generators do not really handle units, so the results might be incorrect.

Property list vectors

RVNGPropertyListVector is a sequence of RVNGPropertyList objects. This type is also used to implement nesting of property lists: RVNGPropertyList cannot contain another RVNGPropertyList directly, but it can contain a RVNGPropertyListVector.

Binary data

RVNGBinaryData serves to store, well, arbitrary binary data 🙂 As it is quite common operation, it is possible to convert to/from base64.

String lists

RVNGStringList is just that: a list of RVNGString objects. The predefined drawing and presentation generators use an RVNGStringList for output, inserting each page/slide separately.

Streams

RVNGInputStream interface, defined in librevenge-stream library, serves to pass around the input data. It has the usual functions for read-only stream: readNBytes(), seek(), tell(), isEnd() etc. It also has functions for handling internal structure, as that is useful for many formats (which use Zip, OLE2, etc. internally).
There are two implementations of RVNGInputStream available in librevenge-stream: RVNGFileStream, which also transparently handles Zip and OLE2, and RVNGDirectoryStream, which is useful for handling directory-based document formats. An application typically needs to implement its own stream type that wraps whatever internal stream type it uses.

Creating an import library

So you decided that you like librevenge and that the library for parsing format XYZ you are contemplating to write will use it. Because want to make it as easy as possible to start, we have written a tool that creates a skeleton of a new project. It is called project-generator and it can be found in this repository.
It has several options, but only few of them are really needed:

  • -p sets the project name;
  • -d sets one-sentence description (used in .pc file and in README);
  • -a sets the main author and -e his e-mail; both are used in .rc files and in CREDITS;
  • -D, -P, -S, -T select the kind of document the library handles. -D is for vector drawings, -P for presentations, -S for spreadsheets and -T for text documents (this is the default).

Anatomy of a project

So we have created a new project named, let’s say, libfoo. Let’s take a peek at what is inside…

Build system

The project uses autoconf and automake for build, as we believe that it is the least bad from all the existing build systems. In addition to that, there are project files for several versions of Microsoft Visual Studio in build/win32.

Headers

The public headers are in inc. The public interface is really minimal: a class (by default named FooDocument) that has two static functions:

  • isSupported() takes an RVNGInputStream and tests whether the input has the right format;
  • parse() takes an RVNGInputStream and an RVNGXYZInterface (which one depends on the document kind the library imports), reads the input and produces the document by calling RVNGXYZInterface‘s callbacks.

That means that it is the caller’s responsibility to supply the input stream and the generator–by providing suitable implementations for them or using one of the existing ones from librevenge-generators and librevenge-stream. The library only uses the two interfaces.

Library

Since the library only works with streams, it has no notion of the path to the source document (as it cannot even know if the input stream is based on an actual file or just a memory buffer). Therefore, if a library needs to read other files (especially with paths relative to the input), it has to delegate that task to the caller in some way.
The code of the library is in src/lib. The project-generator produces FooDocument.cpp containing empty implementation of the two public functions and libfoo_utils.h and libfoo_utils.cpp with some functions and types that we generally find helpful, but which we do not want to put into librevenge for various reasons. Many of the functions handle reading numbers and strings from RVNGInputStream (e.g., readU32() or readCString()), but there are other things too. (I am being deliberately vague here, as more functions can be added–or existing ones removed–in future versions of project-generator.)

Unit tests

Unit tests–if any–should go into src/test and use CppUnit. When implementing a new test class, do not forget to use CPPUNIT_TEST_SUITE_REGISTRATION(TestClassName) at the end (in namespace scope). That macro registers the class in the default test suite at the test manager, so the tests from this class will actually be run. (This can be done manually, of course, but why bother?)

Command-line tools

Last but not least, project-generator creates several command-line conversion tools into formats suitable for the document kind: HTML and plain text for text documents; SVG for vector drawings; SVG and plain text for presentations; and CSV for spreadsheets. The sources for these tools are in subdirectories of src/conv, named by the output type.
There is also another tool, converting to so-called “raw” format. The raw generator prints all callbacks and their arguments. It also allows to check proper nesting of paired callbacks. This is particularly useful during development, but we use the output for regression tests as well.
All these converters use the generators provided by librevenge-generators.

Stay tuned!

In the next part I will present a complete parser for an invented text document format.