Steps To Big Data: Hello, Pail

Just before I had my first anniversary holiday since I had joined communityengine. I started my new role in the recommendation team. The team is relatively new, so we are open to use virtually any technologies.

In the book Big Data (MEAP), Nathan Marz describes a framework calls Pail (dfs-datastores) which is a data storage solution on top of hadoop. It supports schema, merging small files into a large chunk for better hdfs performance etc.

We started to use Pail for our data collection. So what have we done?

  • a data schema defined in protobuf
  • an implementation of PailStructure
  • a job to run the data collection

Ready to dig deeper? Let’s get it started.

Continue reading