Back in 2008 when we took our first baby steps into the Hadoop world we started with some custom MapReduce jobs, later we used Pig and finally went completely to using Hive. Since our legacy RDBMS were nicely designed star schemas and they served us so well we have so far always stuck with this setup.
But as the dimensions grew over time it became more and more obvious the star schema did not perfectly match the realities of big data. Although you can easily do massive joins they end up needing lots of resources. Sometimes there are skews in the data which can get you into skewed join trouble. This left us with a question, what would happen if you would remove all of the joins and went to a completely flat design, or at least as much as possible. Our assumption was it would explode the fact table in size but we wanted to test how much this would be and if it would, in the end, be less I/O than doing the joins.
It’s probably good to know we use Snappy on all our tables to reduce I/O cost.
Let’s try it…
So we flattened our table, used the Avro serializer for our schema and compressed it lightly with snappy and looked at the results on disk. To be fair, both original design and the denormalized design will get fresh tables since sorting can have an affect on compression and we wouldn’t want the new design to have an unfair advantage.
hadoop fs -dus /user/hive/warehouse/old
648090512 Bytes so 618 MiB
hadoop fs -dus /user/hive/warehouse/new
1014676023 Bytes so 968 MiB
So we see a 36% difference in size. The difference might initially look disappointing, but only if you miss the point of the exercise. Remember we don’t have to do any joins anymore. So lets use the result.
On a busy cluster: 485 sec, so: 8 min 5 sec
On a quite cluster: 161 sec, so: 2 min 41 sec
On a busy cluster: 198 sec so: 3 min 18 sec
On a quite cluster: 98 sec so: 1 min 38 sec
Apart from the performance difference it’s also clear to see how the new schema will be easier to use for entry level employees, especially when the queries get more complex. This will likely result in less mistakes when using this data. As a bonus you can also stop doing a lot of ETL work involving filling the dimensions since you don’t have them anymore.
So why did this work?
In our case over the years our useragent dimension has grown to about 1 GiB so while you initially assume you hit about 618 MB you will realistically hit 1.6 GiB.
Our schema contains 4 large dimensions. Sometimes we join multiple of these where the 1.6 GiB will be even larger while the new schema will stay at 968 MiB.
It’s likely without compression the numbers look very differently. Star schemas inherently store data in a more compressed format without using any compression (mostly numbers instead of full strings). But this advantage is mostly lost when using lightweight compression like Snappy.