Atliq Hotel Data Analysis In Pandas

Atliq Grands, a leading player in India's hospitality industry, is losing market share and revenue due to competitor moves and management inefficiencies. To counter this, the managing director plans to adopt Business and Data Intelligence through a third-party provider. Our goals include using Pandas in Jupyter Notebook for data exploration, cleaning, transformation, and deriving insights from historical data to guide strategic decisions.

Atliq Grands owns multiple five-star hotels across India. They have been in the hospitality industry for the past 20 years. Due to strategic moves from other competitors and ineffective decision-making in management, Atliq Grands are losing its market share and revenue in the luxury/business hotels category. As a strategic move, the managing director of Atliq Grands wanted to incorporate “Business and Data Intelligence” in order to regain their market share and revenue. However, they do not have an in-house data analytics team to provide them with these insights. Their revenue management team had decided to hire a 3rd party service provider to provide them insights from their historical data. Chat, please reword this to sound more professional and less repetitive and then add that our Goals which are: Upload the files into Jupiter Notebook and do the following in Pandas: data exploration, Data cleaning, data transformation, and deriving insights from the data.

The Image above shows the entire process of Atliq Grands Data collection and storage process. The process we focus on in the project happens once the data is duplicated in order for us to perform analysis and transformation.

Data Exploration:
- Objective: Understand dataset structure and characteristics.
- Process: Load data, use Pandas for initial insights, and visualize key features.
Data Cleaning:
- Objective: Ensure data quality by addressing missing values and inconsistencies.
- Process: Handle missing values, standardize formats, remove duplicates, and address outliers.
Data Transformation:
- Objective: Prepare data for analysis by transforming variables and encoding categorical data.
- Process: Engineer features, encode categorical variables, normalize numerical data, and handle temporal data.
Deriving Insights from Data:
- Objective: Extract actionable insights to guide strategic decisions.
- Process: Conduct statistical analysis, visualize trends, apply machine learning, and interpret findings.

DATA CLEANING:

Problem: In The data set I noticed two significant issues: Outliers in the reveneue generated column and where revenue was either realized but incorrect, not transferred in the right currency, or Na. Secondly, I noticed there were rooms being booked with a negative number of guests so we needed to remove those along with the outliers before we performed any type of analysis or statistics.

Input:

df_bookings[df_bookings.no_guests<=0]
# As you can see below number of guests having less than zero value represents data error. We can ignore these records.

Output:

Solution:

df_bookings = df_bookings[df_bookings.no_guests>0]

# so We used this function to filter out any data where the number of guests was negative.
# So anything above zero we take as a # legitimate value in our dataset.

df_bookings.revenue_generated.min(), df_bookings.revenue_generated.max()
# no hotel room cost 28 million lol. this is an outlier an needs fixed
avg, std = df_bookings.revenue_generated.mean(), df_bookings.revenue_generated.std()
avg, std
higher_limit = avg + 3*std
higher_limit
lower_limit = avg - 3*std
lower_limit
df_bookings[df_bookings.revenue_generated<=0]
# no revenue below 0
df_bookings[df_bookings.revenue_generated>higher_limit]
# values higher than the higher limit. "Outliers"
# need to get rid of these values
higher_limit = df_bookings.revenue_realized.mean() + 3*df_bookings.revenue_realized.std()
higher_limit

df_bookings[df_bookings.revenue_realized>higher_limit]

# there is 1299 rows that are greater than the higher limit so we have to take this
# information since it is signiftcant and possibly change our std from a multiple of
# 3. if we use our common sense we realize RT4 which is the presidential suite is
# going to cost the most. so its reasonable to assume that we need to consider that
# when trying to find outliers in our data

df_bookings[df_bookings.room_category=="RT4"].revenue_realized.describe()

# we are looking at just RT4

# mean + 3*standard deviation
23439+3*9048

# So because the higher limit came out to be 50583, if we go back and see our data
# the maximum generated revenue from a suite is 45220 which is less than our higher limit
# of suites for just RT4. So therefore we have no outliers associated with the
# revenue_realized column

Here higher limit comes to be 50583 and in our dataframe above we can see that max value for revenue realized is 45220. Hence we can conclude that there is no outlier and we don't need to do any data cleaning on this particular column

Input:

df_bookings.isnull().sum()

# Isnull function gives you true or false on whethere the column
# is Na

# okay to have Na values in ratings given

In aggregate bookings we are finding the columns that have null values. And are filling these null values with the median capacity of each room

df_agg_bookings[df_agg_bookings.capacity.isna()]
df_agg_bookings.capacity.fillna(df_agg_bookings.capacity.median(), inplace=True)
df_agg_bookings.loc[[8,14]]

In aggregate bookings we find out records that have successful_bookings value greater than capacity ane Filter those records

df_agg_bookings[df_agg_bookings['successful_bookings'] > df_agg_bookings['capacity']]

# This shows the where successful bookings value is greater than the capacity

df_agg_bookings = df_agg_bookings[df_agg_bookings.successful_bookings<=df_agg_bookings.capacity]
df_agg_bookings.shape

# We reverse engineer the problem by filtering out the successful books greater than
# capacity and only shows the bookings that are less than or equal to capactiy

Data Transformation:

Various types of Data Transformation Tactics we will be using for this dataset

Creating new columns
Normalization
Merging data
Aggregation

Creating occupancy percentage column

df_agg_bookings['occ_pct'] = df_agg_bookings.apply(lambda row: row['successful_bookings']/row['capacity'], axis=1)

new_col = df_agg_bookings.apply(lambda row: row['successful_bookings']/row['capacity'], axis=1)
df_agg_bookings = df_agg_bookings.assign(occ_pct=new_col.values)
df_agg_bookings.head(3)

This code also adds a new column named 'occ_pct' to the df_agg_bookings DataFrame. However, it uses an alternative method using the assign method. It first calculates the occupancy percentage for each row similarly to the previous method. Then, it assigns this calculated column to the DataFrame using .assign().

Converting it to % value

df_agg_bookings['occ_pct'] = df_agg_bookings['occ_pct'].apply(lambda x: round(x*100, 2))
df_agg_bookings.head(3)

INSIGHTS GENERATION:

1. What is an average occupancy rate in each of the room categories?

Input:

df_agg_bookings.groupby("room_category")["occ_pct"].mean()

Matching RT1, RT2 etc. with room categories such as Standard, Premium, Elite etc along with average occupancy percentage

df = pd.merge(df_agg_bookings, df_rooms, left_on="room_category", right_on="room_id")
df.head(4)

df.drop("room_id",axis=1, inplace=True)
df.head(4)

# Inplace means to modify the Data Frame, very important
# it will not updat the same dataframe if you do not

df.groupby("room_class")["occ_pct"].mean()

df[df.room_class=="Standard"].occ_pct.mean()

Output:

57.88964285714285

2. In the month of June, This is the occupancy for different cities

df_june_22 = df[df["month_year"]=="Jun 22"]
df_june_22.head(4)

# Filtering our data frame specifically for June
# for the occupancy rate

df_june_22.groupby('city')['occ_pct'].mean().round(2).sort_values(ascending=False)

df_june_22.groupby('city')['occ_pct'].mean().round(2).sort_values(ascending=False).plot(kind="bar")

Output:

3. Print month by month revenue

df_date["mmm yy"].unique()

df_bookings_all.info()

# Here we notice that the data type for "Date" Column of the bookings is an object and has to be converted to Datetime

def parse_date(date_str):
    try:
        return pd.to_datetime(date_str, format="%m/%d/%Y")
    except ValueError:
        return pd.to_datetime(date_str, format="%d-%m-%y")

df_bookings_all["check_in_date"] = df_bookings_all["check_in_date"].apply(parse_date)
df_bookings_all.head(4)

For this particular problem I kept recieving an Error Message when trying to switch the data type and had to find the solution using some creativity.

Summary of the Solution:

We encountered an issue where the date formats in the "check_in_date" column were inconsistent, leading to parsing errors. To address this, we defined a custom parsing function (parse_date) that attempts to parse the date string using two different formats: "%m/%d/%Y" and "%d-%m-%y". By applying this custom function to each date string in the DataFrame, we successfully parsed the dates, handling both formats correctly.

Prevention:

To prevent similar issues in the future, it's essential to ensure consistency in date formats across the dataset. Standardizing the date format during data collection or preprocessing can help avoid parsing errors. Additionally, specifying a consistent date format when reading data into pandas DataFrame can be helpful. If multiple date formats are expected, implementing robust error handling, as demonstrated in the provided solution, can handle parsing errors gracefully.

df_bookings_all = pd.merge(df_bookings_all, df_date, left_on="check_in_date", right_on="date")
df_bookings_all.head(3)

df_bookings_all.groupby("mmm yy")["revenue_realized"].sum()

Output:

Jul 22    389940912
Jun 22    377191229
May 22    408375641

Showing a pie chart of revenue realized per booking platform

Input:

df_bookings_all = df_bookings_all.groupby("booking_platform")["revenue_realized"].sum().sort_values(ascending=True).plot(kind="pie")

Output:

Atliq Hotel Data Analysis In Pandas

DATA CLEANING:

Data Transformation:

INSIGHTS GENERATION:

Summary of the Solution:

Prevention:

Showing a pie chart of revenue realized per booking platform

PROJECT GALLERY