Plot and visualization of Hadoop large dataset with Python Datashader

When dealing with a lot of data, it's not easy to visualize them on a usual plot. For instance, if you want to plot coordinates data (like the NYC taxi dataset), the picture will be rapidly overwhelmed by the points (see below).

The purpose of this post is to show a scalable way to visualize and plot extremely large dataset using a great Python library called Datashader (from ...


Handle 200 GB of data with AWS EC2 Hadoop cluster

As explained in the previous post, the Hadoop ecosystem can be installed manually (often called vanilla Hadoop) in order to see how things are tied together and how MapReduce's parallelism works. But when it comes to deal with a large amount of data, in order to avoid the installation and configuration overhead, you would probably better use one of the Hadoop popular distribution such as Cloudera, Hortonworks ...


Set up Hadoop multi-nodes cluster on AWS EC2: a working example using Python with Hadoop Streaming

Many Hadoop distribution vendors like Cloudera or Hortonworks comes with a package including all the Hadoop related projects such as Hive, Spark etc... But for a learning purpose, it might be useful to know how to install a multi nodes Hadoop cluster manually, to see how component like HDFS or MapReduce works. This post will walk you step by step to set up a multi nodes Hadoop cluster ...


Multi-Objectives Shortest Paths Algorithms for the Multi-Transfer flight routes

In this final post of the three-part serie, I will describe different algorithms that exist for the Multi-Objectives Shortest Path problem and how it applies to the multi-transfers flight routes explained in the previous post. I also came up with a simple but understandable algorithm to this problem. Before describing the algorithms, let's have an overview of what are done in this field by looking at some of the ...