From Star Schema to Complete Denormalization

Bennie Schut —  March 28, 2013 — 1 Comment


Back in 2008 when we took our first baby steps into the Hadoop world we started with some custom MapReduce jobs, later we used Pig and finally went completely to using Hive. Since our legacy RDBMS were nicely designed star schemas and they served us so well we have so far always stuck with this setup.


But as the dimensions grew over time it became more and more obvious the star schema did not perfectly match the realities of big data. Although you can easily do massive joins they end up needing lots of resources. Sometimes there are skews in the data which can get you into skewed join trouble. This left us with a question, what would happen if you would remove all of the joins and went to a completely flat design, or at least as much as possible. Our assumption was it would explode the fact table in size but we wanted to test how much this would be and if it would, in the end, be less I/O than doing the joins.

It’s probably good to know we use Snappy on all our tables to reduce I/O cost.

Let’s try it…

So we flattened our table, used the Avro serializer for our schema and compressed it lightly with snappy and looked at the results on disk. To be fair, both original design and the denormalized design will get fresh tables since sorting can have an affect on compression and we wouldn’t want the new design to have an unfair advantage.

hadoop fs -dus /user/hive/warehouse/old

648090512 Bytes so 618 MiB

hadoop fs -dus /user/hive/warehouse/new

1014676023 Bytes so 968 MiB

So we see a 36% difference in size. The difference might initially look disappointing, but only if you miss the point of the exercise. Remember we don’t have to do any joins anymore. So lets use the result.

On a busy cluster: 485 sec, so: 8 min 5 sec
On a quite cluster: 161 sec, so: 2 min 41 sec

On a busy cluster: 198 sec so: 3 min 18 sec
On a quite cluster: 98 sec so: 1 min 38 sec

Apart from the performance difference it’s also clear to see how the new schema will be easier to use for entry level employees, especially when the queries get more complex. This will likely result in less mistakes when using this data. As a bonus you can also stop doing a lot of ETL work involving filling the dimensions since you don’t have them anymore.

So why did this work?

In our case over the years our useragent dimension has grown to about 1 GiB so while you initially assume you hit about 618 MB you will realistically hit 1.6 GiB.

Our schema contains 4 large dimensions. Sometimes we join multiple of these where the 1.6 GiB will be even larger while the new schema will stay at 968 MiB.

It’s likely without compression the numbers look very differently. Star schemas inherently store data in a more compressed format without using any compression (mostly numbers instead of full strings). But this advantage is mostly lost when using lightweight compression like Snappy.

One response to From Star Schema to Complete Denormalization


    Nice work! Very interesting point as Hive is sold as a DWH tool and star schema is at the heart of DWH modeling. We are tempted to port the star schema directly over Hive.

    Would you mind describing your newtable? Did you use any collection/struct type for that or is it a simple table with all columns from all dimensions plus session table columns?

    Almost one year after your post, Hive evolved quite a bit with the stinger initiative, we had also other initiatives like Impala. Which Hive version/Hadoop distribution are you using? Did you try some improvements with any new feature like ORCFile?



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s