Archives For Hadoop

Back in 2008 when we took our first baby steps into the Hadoop world we started with some custom MapReduce jobs, later we used Pig and finally went completely to using Hive. Since our legacy RDBMS were nicely designed star schemas and they served us so well we have so far always stuck with this setup.


But as the dimensions grew over time it became more and more obvious the star schema did not perfectly match the realities of big data. Although you can easily do massive joins they end up needing lots of resources. Sometimes there are skews in the data which can get you into skewed join trouble. This left us with a question, what would happen if you would remove all of the joins and went to a completely flat design, or at least as much as possible. Our assumption was it would explode the fact table in size but we wanted to test how much this would be and if it would, in the end, be less I/O than doing the joins.

Continue Reading...