RDF text compression experiment

Large RDF-based datasets such as the National Library of Medicine’s Medical Subject Headings (MeSH) controlled vocabulary thesaurus are often encoded in machine-friendly formats such as N-Triples and compressed with standard methods such as gzip. As a regular producer and consumer of RDF datasets, I’m interested in how effective different off-the-shelf compression methods are when applied to text-based RDF encodings such as N-Triples. N-Triples is the preferred encoding for programmatic use of RDF triples, since it is line-based and can be streamed easily.

I developed a small experiment to evaluate several common compression methods (Brotli, bzip2, and gzip) on representative N-Triples-encoded datasets:

I measured the size of the uncompressed N-Triples file, the size of the N-Triples converted to Turtle with common namespace prefixes, and the size of both encodings compressed using the above-mentioned methods. I tracked the time each method took to run on my laptop and calculated compression ratios and space savings. In order to make the comparison fair, I used stock, single-process command line implementations of the compression methods rather than parallel implementations such as pbzip2. There is no parallel command-line implementation of Brotli at the time of this writing.

The code for the experiment and the raw results are available in this GitHub repository.

My key findings were:

  • Brotli compression of N-Triples has a slightly higher compression ratio than bzip2, but takes 10-25 times as long to work on large datasets. Brotli compressed the 938.4 MiB AGROVOC dataset in N-Triples to 38.6 MiB and the 1.9GiB MeSH dataset in N-Triples to 65.1 MiB.
  • bzip2 compression of N-Triples has a significantly higher compression ratio than gzip, and takes 5-10 times as long to work on large datasets.
  • Turtle encodings of the datasets were approximately half the size of the N-Triples encodings. Brotli compression of the Turtle encodings reduced AGROVOC to 23.4 MiB and MeSH to 42.0 MiB.