Efficient String Compression for Modern Database Systems

crazygringo 2 hours ago | parent | next [-]

I'm genuinely surprised that there isn't column-level shared-dictionary string compression built into SQLite, MySQL/MariaDB or Postgres, like this post is describing.

SQLite has no compression support, MySQL/MariaDB have page-level compression which doesn't work great and I've never seen anyone enable in production, and Postgres has per-value compression which is good for extremely long strings, but useless for short ones.

There are just so many string columns where values and substrings get repeated so much, whether you're storing names, URL's, or just regular text. And I have databases I know would be reduced in size by at least half.

Is it just really really hard to maintain a shared dictionary when constantly adding and deleting values? Is there just no established reference algorithm for it?

It still seems like it would be worth it even if it were something you had to manually set. E.g. wait until your table has 100,000 values, build a dictionary from those, and the dictionary is set in stone and used for the next 10,000,000 rows too unless you rebuild it in the future (which would be an expensive operation).

	▲	hinkley an hour ago \| parent \| next [-]
		There are some databases that can move an entire column into the index. But that's mostly going to work for schemas where the number of distinct values is <<< rowcount, so that you're effectively interning the rows.
	▲	analyst74 an hour ago \| parent \| prev [-]
		compression is not free, dictionary compression: 1, complicates and slows down update, which is typically more important in OLTP than OLAP 2, is generally bad for high cardinality columns, which requires tracking cardinality to make decisions, which further complicates things. lastly, additional operational complexity (like the table maintenance system you described in last paragraph) could reduce system reliability, and they might decide it's not worth the price or against their philosophy.

▲

ayuhito an hour ago | parent | prev | next [-]

DuckDB has one of my favourite articles on this topic if you want something a little more high level: https://duckdb.org/2022/10/28/lightweight-compression

▲

mbfg 5 hours ago | parent | prev | next [-]

I wonder how one does like queries.

	▲	hcs 3 hours ago \| parent \| next [-]
		After decompression, with the performance characteristics you'd expect. If it has to come off disk it's still a win or at least usually breaks even in their measurements. https://cedardb.com/blog/string_compression/#query-runtime The paper suggests that you could rework string matching to work on the compressed data but they haven't done it.
	▲	speed_spread 4 hours ago \| parent \| prev [-]
		s, jst cmprss ll qrs b rmvng vyls!

▲

ForHackernews 4 hours ago | parent | prev [-]

Never heard of CedarDB.

Seems to be another commercial cloud-hosted thing offering a Postgres API? https://dbdb.io/db/cedardb

https://cedardb.com/blog/ode_to_postgres/

▲

switz 4 hours ago | parent | next [-]

I was evaluating it recently but it's not FOSS, so buyer beware. I'm totally fine with commercialization, but I hesitate to build on top of data stores with no escape hatches or maintenance plans–especially when they're venture backed. It is self-hostable, but not OSS.

▲

cmrdporcupine 3 hours ago | parent | prev [-]

It's a startup founded by -- and built with tech coming out of research by -- some well known people in the DB research community.

Successor to Umbra, I believe.

I know somebody (quite talented) working there. It's likely to kick ass in terms of performance.

But it's hard to get people to pay for a DB these days.

▲

atombender an hour ago | parent [-]

It's probably going to be acquired. The last effort to commercialize the TUM (Technical University of Munich) database group's work was acquired by Snowflake and disappeared into that stack.

CedarDB is the commercialization of Umbra, the TUM group's in-memory database lead by professor Thomas Neumann. Umbra is a successor to HyPer, so this is the third generation of the system Neumann came up with.

Umbra/CedarDB isn't a completely new way of doing database stuff, but basically a combination of several things that rearchitect the query engine from the ground up for modern systems: A query compiler that generates native code, a buffer pool manager optimized for multi core, push-based DAG execution that divides work into batches ("morsels"), and in-memory Adaptive Radix Tries (never used in a database before, I think).

It also has an advanced query planner that embraces the latest theoretical advances in query optimization, especially some techniques to unnest complex multi-join query plans, especially with queries that have a ton of joins. The TUM group has published some great papers on this.

▲

senderista 12 minutes ago | parent | next [-]

Umbra is not an in-memory database (Hyper was). TUM gave up on the feasibility of in-memory databases several years ago (when the price of RAM relative to storage stopped falling).

	▲	cmrdporcupine 2 minutes ago \| parent [-]
		Yeah I think the way Umbra was pitched when I watched the talks and read the paper was as more as "hybrid" in the sense that it aimed for something close to in-memory performance while optimizing the page-in/page-out performance profile. The part of Umbra I found interesting was the buffer pool, so that's where focused most of my attention when reading though.

▲

senderista 15 minutes ago | parent | prev [-]

Are you thinking of Hyper being acquired by Tableau?