I was planning my trip to Amsterdam in January and was looking through hostels in Hostel World filtering for different features and amenities. One amenity that I thought I would definitely need was free wifi if I wanted to do some programming from the hostel and also just because life demands it in general. While there’s a ton of hostels that offer free wifi, I’ve definitely been at the end of the stick where the quality of wifi has been unmentionably bad. This probably goes for hotels as well as hostels, but generally hostels are cheaper and offer less in the way of complementary services.

That got me thinking about creating an interesting application that could judge the quality of wifi in reviews. Randomly I decided to spin up a new idea for a scraping/api for Hostel World where I could actually find the reviews that mention wifi and other amenities that would be useful. Instead of meticulously scanning through hundreds of reviews, I could just scrape the reviews, parse out keywords, and assign sentiment scores to each review.

Eventually I made it into a Twitter Bot:

Heh, I am replying to myself. Try it out yourselves! Mention @HostelReviewBot and link a hostel from Hostelworld and include the word wifi, breakfast, noise, bathroom, or shower.

The positive and negative refers to the number of positive  and negative sentiment reviews respectively. The quote is picked from being an overall average of common words mentioned when scraping the site. Overall there’s too much information that can’t really get stuffed into 140 characters which is quite a sham. I should learn how to create a quick flask api. Maybe that’s for later.

Let’s go through a quick tutorial of some python stuff.

amenities = {
'wifi':      ['wifi','internet','wi-fi', 'wi fi', 'wireless'],
'breakfast': ['breakfast', 'breakfest', 'break fast', 'brunch'],
'bathroom':  ['bathroom', 'bath room', 'bath', 'restroom', 'toilet', 
              'urinal', 'lavatory', 'washroom'],
'shower':    ['shower', 'bathe', 'showers'],
'noise':     ['noise', 'noisy', 'quiet', 'loud', 'silent']
}

The idea is to first create a list of amenities that we would like to track from each hostel. I can think of five pretty important things that a hostel or a hotel should have that aren’t rated on Hostel World or another review site like Tripadvisor. Amenities is a dictionary of key value stores where the values are chained synonyms that could be used in text to describe our amenities. That way if someone mentions or misspells “wifi” with “wi fi” or “wi-fi” or just another definition like “internet”, we can track their opinion. I welcome any more ideas that could be things to track that right now can be ambiguous or require reading-reviews-effort.

Scraping

import requests
from lxml import html

url =  "http://www.hostelworld.com/hosteldetails.php/Black-Swan/Barcelona/66913/reviews/"
amenity =   'wifi'
first_url = url + "1?period=all"
xml =       request_xml(first_url)
pages =     find_end(xml)

def request_xml(url):
    """ Passes in a url and returns the xml of the page """
    response = requests.get(url)
    xml = html.fromstring(response.text)
    return xml

def find_end(self, xml):
    """ Pass in xml and returns the number of pages to scrape.
        Grab first 25 pages which is 500 reviews for relevancy.
    """
    num_reviews = int(xml
        .xpath("//div/div[@class='results']/text()")[0]
        .split(' ')[0].split('(')[1]) #extract total number of reviews
    if num_reviews / 20 > 25: 
        pages = 26
    else:
        pages = num_reviews / 20
    return pages

I’m using the requests library to grab the html from Hostel World. Here I’ve picked the Black Swan hostel in Barcelona as an example and taken the url from their reviews site.

lxml is a useful package for taking an html page and transforming it into an xml tree to then use xpath to select elements. By grabbing the total number of reviews from the original base url, we can find out how many pages we need to scrape if the review count is less than 500. Else we can just go from page 1 to 25 because Hostel World lists 20 reviews per page.

import pandas as pd

def scrape_to_df(base_url, pages):
    """
    Takes base url and number of pages and returns a dataframe
    with each row representing a review. Columns are: 
    ratings, review, and page number
    """
    df = []
    for i in xrange(1, pages):
        url = base_url + str(i) + "?period=all"
        xml = request_xml(url)
        reviews = xml.xpath('//div[@class="microreviews rounded"]')
        for review in reviews:
            df.append({
             'rating': int(review
                 .xpath('.//div/text()')[1].replace('%', '')),
             'review': ''.join(review
                 .xpath('.//div/p/text()')).strip(),
             'page': i
            })
    return pd.DataFrame(df) 

Each review comes with a rating from 1-100 that we want to grab so that we can get a rating for each review. We can also use this later to subset the hotel reviews by our specific keyword (in this case “wifi”) and then see what the average rating of those reviews are. Chances are that if there’s a general average of lower reviews when the keyword “wifi” is in the review, the hostel probably doesn’t do well in that amenity. I used pandas to store the data for ease of use.

This should be an example of what the dataframe would look like: (Actually I already subsetted out the non-related reviews)

Excel Screenshot of Dataframe Head. I really need to integrate IPython notebooks

Excel Screenshot of the head of the dataframe. I really need to integrate IPython notebooks.

Natural Language Processing

There’s a couple different interesting ways we can approach text analysis when looking at reviews. I haven’t read the entire NLTK tutorial guide yet so the only ways I can think of analyzing these reviews would be to use a couple of methods that I might do if I were thinking about doing the whole situation manually.

  • Averaging the subsetted reviews: As mentioned before, we can subset the reviews by the reviews that mentioned the key word and take their average rating and compare it against the overall average.
  • N-grams: Another possibility to find common two word or three word phrases. The problem for the reviews is that there may not be enough reviews to extract these phrases. Also we chained the keyword “wifi” so theoretically we could replace each keyword we find with “wifi” in it to see if there are repeated n-grams, but with maybe 10-20 reviews actually describing the wifi, it might not be enough to be anything significant.
  • Common words: An alternative to getting n-grams is to try finding common words  in the reviews. Essentially these would be uni-grams so technically n-grams but we can also filter out the non-important words that don’t really contribute to the conversation when selecting these uni-grams.
  • Sentiment analysis: Sentiment analysis is always a bit tricky when done programmatically because it means we then need to curate our tokenization and dictionary with specific keywords for our purpose. If we were to individually read each review that contained “wifi” in it, we could probably get a sense of how each reviewer felt about the wifi. We would then tally up the number of good reviews and bad reviews in our head and get a sense of how we feel about it. But programmatically it’s always harder to just assign sentiment without going into huge amounts of customization for the product at hand. For example, “too gimmicky for them to need a “like” on Facebook to use their wifi”. We understand that as being a small negative, though overall that really doesn’t even describe the wifi as being good quality or not. A machine could interpret that as being a great review though if it sees the keyword “like” and “wifi” in the same sentence and doesn’t understand the process of Facebook “likes” as gimmicked advertisements. In conclusion, I decided using a package called TextBlob, which is a wrapper around NLTK and effectively gives general sentiment scores on sentences and phrases. It’s a pretty good solution for general purpose projects and you don’t have to go into too much manual positive and negative sentiment dictionary searching.

Okay, let’s just try extracting the key phrases first. In each review, we have to first find the sentence or phrase where our keyword of “wifi” was mentioned. To do this, we’ll first need to split up the review into chunks of phrases with specific delimiters as commas, periods, etc.. After that, we’ll check if “wifi” exists in in the list of phrases and then if it does we’ll parse it out into it’s own column called “wifi” in the dataframe. I used the apply method here and passed in the function get_key_sentence  which checks if the key exists and if it doesn’t just returns null.

import re

def count_amenities(hostel, key):
    """ If key/amenity found in review, apply phrase in key column """
    hostel[key] = hostel.apply(lambda x: 
        get_key_sentence(x['review'], amenities[key]), axis=1)
    return hostel
    
def get_key_sentence(x, key_list):
    """Passes in a review and a bag of words associated with the key
       Returns a sentence in the review containing one or more of the bag of words
    """
    delimiters = ',', '.', ';', '!', '?'
    sentences = split(delimiters, x, maxsplit=0)
    for sent in sentences: 
        for word in sent.split(): #loop through words in phrase
            if word.lower() in key_list:
                return sent.lower().strip() #return phrase

def split(delimiters, string, maxsplit=0):
    """ Takes in comma separated delimiters and splits paragraph string
        into a list of phrases """
    regexPattern = '|'.join(map(re.escape, delimiters))
    return re.split(regexPattern, string, maxsplit)

Awesome so now we got a dataframe that we can subset to get just the reviews that contain “wifi” in them and isolate the individual sentences and phrases with wifi in it. Let’s apply TextBlob to each one of the phrases to get numerical sentiments.

import numpy as np  
from textblob import TextBlob
import nltk
from nltk.util import ngrams
from nltk.collocations import *
from collections import Counter

subset = hostel.dropna() #drop reviews not mentioning key
subset.reset_index(inplace=True) 

#Apply sentiment values to each phrase 
subset['sentiment'] = subset[key].apply(lambda x: 
    TextBlob(x).sentiment.polarity)

Cool so we got sentiment analysis pretty easily. Note that this is probably the hackiest and easiest way to do it when a library wrapper is already packaged around NLTK. It really works for all of us when we are okay with  80% of it being correct and can deal with a possible completely off 20%. The next step is grabbing common uni-gram words.

word_freq = parse_reviews(subset, key)
d = Counter(word_freq)
phrase_words = [x[0] for x in d.most_common(3)] #find top 3 keywords describing each review

def count_words(word_freq, sent, stopwords, list_key):
    """Takes in a dictionary, sentence or phrase, stopwords, and bag of words
       and appends counts for word frequencies not in stopwords to find
       the most common words in the reviews
    """
    for word in sent.split():
        if word not in stopwords and word not in list_key:
            if word not in word_freq:
                word_freq[word] = 1
            else:
                word_freq[word] += 1
    
def parse_reviews(subset, key):
    """ Takes in a dataframe and key
        Returns a dictionary with the highest frequency words and their counts
        where the key was found in the reviews
    """
    word_freq = {}
    stopwords = nltk.corpus.stopwords.words('english')
    for i in xrange(0, len(subset)): #loop through each review
        count_words(word_freq, subset[key][i], stopwords, self.amenities[key])    
    return word_freq

Here nltk is grabbing stopwords  which are essentially words that should be parsed out that are words like “is, the, i, etc…”. We want to count occurrences for all words besides stopwords and we do that by passing in a dictionary and tallying counts (Note NLTK has it’s own library for this but I didn’t figure it out in time). Since dictionaries are pass by reference in Python, we can pass in the value of word_freq  and it’ll update the dictionary without having to return the actual value. Then we’ll implement the collections library which gives us an easy way to grab the top three common words.

summary = {
   'phrase_words': d.most_common(3), #dictionary of top 3 common keywords and their counts
   'hotel_avg': np.mean(hostel['rating']), #average rating of the hostel
   'key_avg':   np.mean(subset['rating']), #average rating of reviews specific to key
   'num':       len(hostel), #number of reviews at hostel
   'mean':      np.mean(subset['sentiment']), #average sentiment of review related to key
   'positive':  len(subset[subset['sentiment'] > 0]), #number of positive reviews
   'negative':  len(subset[subset['sentiment'] < 0]), #number of negative reviews
   'zero':      len(subset[subset['sentiment'] == 0]), #number of zero sentiment reviews
   'max_val':   {
       'num': subset.loc[subset['sentiment'].idxmax()]['sentiment'], #sentiment rating for best review
       'phrase': subset.loc[subset['sentiment'].idxmax()][key] #text for best review
    },
   'min_val':   {
       'num': subset.loc[subset['sentiment'].idxmin()]['sentiment'], #sentiment rating for worst reviews
       'phrase': subset.loc[subset['sentiment'].idxmin()][key] #text for worst review
    },
   'common_phrase': {
       'phrase': '', 
       'num': -1
    }
}
# Find the review with the most number of common words aggregated from all key reviews
for phrase in subset[key]:
    num_words = len([x for word in phrase.split() for x in phrase_words if x in word])
    if  num_words > summary['common_phrase']['num']:
        summary['common_phrase'] = {
            'phrase': phrase, 
            'num': num_words
        }

This humongous dictionary called summary  is an object of analysis that spans from the most positive phrase, most negative phrase, positive and negative sentiments, and rating averages. Ultimately it’s hard to store all of the information in a concise way to return it to the twitter bot. Ultimately it should be pretty clear what each averaged metric is doing, but in the end I only used a couple of the metrics as Twitter really only has 140 characters maxed out. For common_phrase  I looked took the top three common words from word_freq  and looked for the reviews that had the most out of the three.

{
'common_phrase': {
    'num':    2,
    'phrase': 'free breakfast and sometimes free dinner'},
 'hotel_avg': 94.708,
 'key_avg':   92.59375,
 'max_val': {
     'num':    0.80000000000000004, 
     'phrase': 'breakfast was great'},
 'mean':       0.24566998106060609,
 'min_val': {  
     'num': -0.27083333333333331,
     'phrase': "breakfast was weak and it's a bit expensive for being not too close to major sights of barce"},
 'negative': 3,
 'num':      500,
 'phrase_words': [('free', 17), ('cereal', 3), ('nice', 3)],
 'positive': 22,
 'zero': 7
}

The rest of the code is up on github. I’ll be sure to upload my Twitter API code up as well when it’s totally bug free. There’s a lot of things that can be improved as I hope that I can add in concurrent requests (though not to re-invent scrapy) and add in caching of previous hotels. There’s also a lot to be improved in natural language processing as I’ve just touched the surface with some trivial things. As the knowledge of NLTK grows I am going to want to implement more customized features as there’s a ton of hidden gems in a huge wealth of information like hostel and hotel reviews. The only real problem is laziness.

Please add comments and feedback.

Connect with me on LinkedIn, Twitter, Email