Getting Started with Apache Zeppelin and Airbnb Visuals

I’ve been playing around with Apache Zeppelin for a few months now (not so much playing as just frustration initially to get it working). After consistently using it a bit, I find it incredibly useful for data visualization and business intelligence purposes.

Apache Zeppelin is self described as “a web-based notebook that enables interactive data analytics”. Imagine it as an IPython notebook for interactive visualizations but supporting more languages than just Python to munge your data for visualization. Ultimately after getting Pyspark working on it, I find it incredibly useful for displaying business data and analytics. Right now it only has a couple graph options which include bar graphs, line graphs, pie charts, and scatter plots. Currently it’s also in incubation mode at Apache and open-sourced!

 

Airbnb Neighbourhoods Grouped by Average Price
SF Neighborhoods Grouped by Average Price Per Number of Beds on Airbnb

On a business and company level, I’ve found that it is probably the best way to introduce a new visualization tool when the interpreter language can be written in Spark SQL. SQL queries come naturally to all analysts and most product managers, so this can potentially introduce everyone to creating their own data visualizations if the data is loaded and formatted accordingly. Therefore anyone who knows SQL now can play around with visualizations with a lot more ease.

Ultimately it looks and functions a bit like Tableau minus the cost of thousands of dollars for a Tableau license. Yet for Zeppelin, as the data scales, hopefully the speed and functionality of Zeppelin scale linearly when running it on a cluster with Spark. I believe the end goal is to run huge amounts of data through it and potentially visualize billions of data-points with Spark doing most of the heavy lifting.

Well how does it work?! I’ll show a quick demo of the install and then some initial code. Here’s the link to the current Zeppelin github repo.

Installing Zeppelin and getting it set up and working is kind of a headache but let’s try it.

Installing Spark and Pyspark

If you care about getting Pyspark working on Zeppelin you’ll have to download and install pyspark manually. My friend Alex created a pretty good tutorial on how to install Spark here. I’ll go over the initial part quickly.

First we have to download the latest version of Spark and we’ll install it into our directory in /usr/local/bin.

Then we’ll change directories into where we installed Spark and decompress the file

cd /usr/local/bin
sudo tar -zxf spark-1.5.2-bin-hadoop2.6.tgz -C /usr/local/bin

Awesome, now that Spark is installed and downloaded we can also set and export the SPARK_HOME and PYTHONPATH variables in bash.

export SPARK_HOME=/usr/local/bin/spark-1.5.2-bin-hadoop2.6
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH

These variables need to be exported and set on each new session that you run. It’s caused me a bit of a headache the past few months when I realized that the only reason things weren’t working were because of the PYTHONPATH not being set correctly.

Installing Zeppelin

Before installing Zeppelin make sure you have the requirements necessary to install it.

sudo apt-get update
sudo apt-get install git
sudo apt-get install openjdk-7-jdk
sudo apt-get install npm
sudo apt-get install libfontconfig

# install maven
wget http://www.eu.apache.org/dist/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.tar.gz
sudo tar -zxf apache-maven-3.3.3-bin.tar.gz -C /usr/local/
sudo ln -s /usr/local/apache-maven-3.3.3/bin/mvn /usr/local/bin/mvn

pip install py4j #Necessary for pyspark

Navigate to the directory you want and clone the repository. After that you want to cd into the folder and build using maven. There’s a couple of options to build it with but I just used the Pyspark 1.5 for local use.

git clone https://github.com/apache/incubator-zeppelin/
cd incubator-zeppelin/
mvn clean package -Pspark-1.5 -DskipTests
./bin/zeppelin-daemon.sh start

Awesome, now go to http://localhost:8080/ and you should see  the zeppelin homepage.

Zeppelin Homepage
Zeppelin Homepage

If you happen to get an error when installing Zeppelin, you can email me or tweet at me @racketracer. But I also likely might not have the answer to the question so checking the user community will help as well. Just the other day I tried setting it up on another server and it didn’t work. So now I almost think doing this is a crapshoot. Also this stackoverflow question helped me a lot.

Getting Started with Apache Zeppelin Notebook

With everything set up correctly we can open up a new notebook and start writing some code. I grabbed the Airbnb dataset from this website Inside Airbnb: Adding Data to the Debate. It has a plethora of information on listings on Airbnb from cities all across the world. We’re going to explore the San Francisco one to see if we can visualize any trends or interesting factoids from all of the listings and prices. Each row in the dataset is a Airbnb listing in San Francisco with numerous features like neighbourhood, price, review score, etc….Get the data here.

In the first cell, write %pyspark to tell the interpreter that the code will be written in PySpark. Scala, Java, and more languages also work to ingest the data.

%pyspark
from pyspark.sql.types import *
from pyspark.sql import functions 
import pandas as pd

path = "/path_to_dataset/sf_listings_zeppelin.csv"

data = pd.read_csv(path)
df = sqlContext.createDataFrame(data)
df.registerTempTable("airbnb")

I’m reading the data with pandas for ease and then creating a spark sql data frame and registering it as a temporary table in memory. Note that Spark is now specifically used for datasets that are too large to fit in Pandas in-memory RAM on your laptop so while this isn’t the best practice, it works for our situation since the dataset is around 7000 rows. Also while in Spark you would have to initially declare sqlContext and encapsulate the spark context, in Zeppelin this isn’t necessary.

airbnb = sqlContext.sql("""
SELECT 
    CAST(id as INT) AS                          id, 
    CAST(host_id as INT) AS                     host_id, 
    CAST(host_response_rate AS INT) AS          host_response_rate, 
    CAST(host_acceptance_rate AS INT) AS        host_acceptance_rate, 
    CAST(host_listings_count AS INT) AS         host_listings_count,
    CAST(host_total_listings_count AS INT) AS   host_total_listings_count, 
    CAST(latitude AS FLOAT) AS                  latitude, 
    CAST(longitude AS FLOAT) AS                 longitude,
    CAST(accommodates AS INT) AS                accommodates,
    CAST(bathrooms AS FLOAT) AS                 bathrooms,
    CAST(beds AS INT) AS                        beds,
    CAST(minimum_nights AS INT) AS              minimum_nights,
    CAST(maximum_nights AS INT) AS              maximum_nights,
    CAST(availability_365 AS INT) AS            availability_365,
    CAST(number_of_reviews AS INT) AS           number_of_reviews,
    CAST(first_review AS DATE) AS               first_review,
    CAST(last_review AS DATE) AS                last_review,
    CAST(review_scores_rating AS INT) AS        review_scores_rating,
    CAST(review_scores_accuracy AS INT) AS      review_scores_accuracy,
    CAST(review_scores_cleanliness AS INT) AS   review_scores_cleanliness,
    CAST(review_scores_checkin AS INT) AS       review_scores_checkin,
    CAST(review_scores_communication AS INT) AS review_scores_communication,
    CAST(review_scores_location	AS INT) AS      review_scores_location,
    CAST(review_scores_value AS INT) AS         review_scores_value,
    CAST(reviews_per_month AS FLOAT) AS         reviews_per_month,
    CAST(price AS INT) AS                       price,
    CAST(weekly_price AS INT) AS                weekly_price,
    CAST(monthly_price AS INT) AS               monthly_price,
    CAST(security_deposit AS INT) AS            security_deposit,
    CAST(cleaning_fee AS INT) AS                cleaning_fee,
    CAST(guests_included AS INT) AS             guests_included,
    CAST(extra_people AS INT) AS                extra_people,
    host_response_time, host_is_superhost, host_neighbourhood, host_identity_verified, neighbourhood, property_type, room_type, bed_type, instant_bookable,
    cancellation_policy, require_guest_profile_picture, require_guest_phone_verification
FROM airbnb
""")

airbnb.registerTempTable("sf")
airbnb.show(1)

Here I’m creating a new SQL query and casting all the numeric types into integers and floats just to be sure as well as grabbing all the relevant columns out of  the dataset. The sqlContext.sql allows us to write SQL and modify the table and assign it to a specific variable, in this case called “airbnb”. After that, I’m registering the table as a temporary sql table again so that we can query out of it in the next cells.

A screenshot of what each cell should look like
A screenshot of what each cell should look like

Visualizations with SQL

Now it’s time to just write sql to get some interesting visualizations out of the dataset. First we have to create a new cell and annotate the interpreter with a %sql at the top like we did with PySpark in the first cell. This tells Zeppelin that the interpreter for the current cell is Spark SQL. The SQL syntax follows the same conventions as HIVE does.

Let’s try a query to get the average acceptance rates of Airbnb listings by the respective SF neighbourhoods.

%sql
SELECT AVG(host_acceptance_rate) avg_acceptance, neighbourhood
FROM sf
GROUP BY neighbourhood
ORDER BY 1 DESC

Press SHIFT+ENTER and the cell will run and also save the notebook at the same time. You’ll see that the navigation bar will pop up with graphs and options. If you mess around with each graph, you can generally picture what you’ll want to use with what data that you have. Since we’re looking at representing neighborhoods and their numeric average acceptance rates, I tried using the stacked line graphs. You can also drag and drop the fields that are part of the SELECT query into three different tabs.

Screenshot of the SQL query graphed by neighbourhood acceptance rate
Screenshot of the SQL query graphed by neighbourhood acceptance rate. Very colorful.

Essentially the “Keys” tab sets the x-axis of the graph so I dragged and dropped the neighbourhood  field into the “Keys” tab. “Groups” essentially separates out the different values in that field into different colors and groups. “Values” displays numeric values on the Y-axis. Because the aggregation has already happened with the SQL query, it doesn’t really matter whether the avg_acceptance value is SUM, COUNT, etc…; as long as we are using every single value in the graph, there is no aggregation that is necessary. We could also re-do this exact same visualization by just selecting the rows in SQL without doing an AVG(host_acceptance_rate) and then selecting AVG on the light blue button.  However, since Zeppelin limits the dataset to 1000 rows on each query, we wouldn’t get all of the data.

Zeppelin Input Boxes

Another example to try out is to integrate input values through text boxes through the SQL code. For example, note the query below and the highlighted brackets.

%sql
-- THIS QUERY GRABS Average prices grouped by the amount of people they can accommodate

WITH subquery AS(
    SELECT AVG(price) PRICE, neighbourhood, beds, COUNT(1) AS cnt
    FROM sf
    WHERE beds <= ${max_accommodations=4}
     AND price < 1000
    GROUP BY neighbourhood, beds
)
SELECT * 
FROM subquery q2
INNER JOIN (
    SELECT DISTINCT(neighbourhood)
    FROM subquery q1
    WHERE q1.cnt > ${min_count_of_listings=50}
) AS distinct_neigh
ON distinct_neigh.neighbourhood = q2.neighbourhood
WHERE cnt > 15

This query is essentially grabbing the average price per number of bedrooms in each neighborhood in SF. However, because I don’t want to skew results for listings that don’t have enough data, I set the “minimum count of listings” defaulted to 50 with the format of the input box in the second subquery.

By using the ${variable_name=50}, this allows me to set an input box to change the default value dynamically and set the name of the input box as “min_count_of_listings”.

Notice the input boxes
Notice the input boxes of max_accommodations and min_count_of_listings

The last thing I’ll try is a scatter plot. Zeppelin also limits the amount of data that can be visualized to 1000 data-points as of now so scatter plots aren’t as useful as they could be.

%sql
SELECT review_scores_rating, price
FROM sf
WHERE review_scores_rating > "${min_review_score=0}"
    AND price < "${price_max=1000}"
Scatter plot
Scatter plot displaying Airbnb review score vs price of listings

We can see a general trend of better reviews costing more money as common sense states. But graphs like these really help with defining features in data that we can always potentially use when we run machine learning models on possible predictors of costs of each individual airbnb listing.

Tips and Tricks

There’s a couple bullet pointed tips I’ll add.

  • Optimize queries for visualizations: Think about what you want to visualize before you write your query. For example, if you want to get an average price for number of bathrooms and different neighbourhoods, it’s smarter to group by bedroom sizes and neighbourhoods in your query than to use the group function in the UI. Essentially you can imagine creating the graphs from the data in your table such as if you were creating graphs in excel, therefore all of the SQL data results has to be well segmented and selected and avoid doing a SELECT *
  • Keep all data wrangling in one cell: Doing all of the data cleaning and formatting before writing SQL in other cells helps with overall organization of the notebook.
  • Sort your data so that the first row isn’t null: I’ve noticed a bug in Zeppelin that currently nullifies your entire column if the first row in that column is also null. Try to sort your column so that each column has values in the first row to get around this bug.

Conclusion and Business Case

By segmenting data out like this on a notebook style visualization layer, Zeppelin is reaching into creating an awesome user experience for any business need. If you throw up a notebook on a server now, multiple people can now also edit it and append their own visualizations based on existing dataframes. Currently in a BI sense, I’ve found Zeppelin useful to visualize time series data as it can be continuously ported through cron jobs and sql pulls.

Try Zeppelin out today and tell me what you think.

As always, connect with me on LinkedIn, Twitter, Email or leave a comment below!

3 comments

  1. Great writeup!! One addition I might suggest is to use, SparkSession.read() with inferSchema = true option, to avoid the boilerplate casting code!

  2. Hi mate,
    I am trying to use the inbuilt visualisation of zeppelin when retrieving data from Cassandra keyspace using spark-cassandra connector.

    import org.apache.spark.sql.cassandra.CassandraSQLContext
    import org.apache.spark.sql.cassandra._
    import org.apache.spark.sql
    val csc=new CassandraSQLContext(sc)
    import sqlContext.implicits._
    val rdd1=csc.sql(“SELECT * from q_data.qr_ep_data limit 100”)
    ///rdd1.registerTempTable(“epdata”)

    rdd1.registerTempTable(“epdata”)

    %sql
    SELECT * from q_data.qr_ep_data limit 10
    Error::171: error: not found: value %
    %sql

    I am not able to use the %sql which throws me AN ERROR. Any help is much appreciated.

Leave a Reply

Your email address will not be published. Required fields are marked *