Predict your customers’ next purchase (Part 1)

3 min readFeb 21, 2022


Our day to day purchases and their details are saved to databases of our favorite restaurant, mall, or even coffee shop!

Since the rising interest in data science and machine learning, people have been adamant on finding the best way to notice purchase patterns and defining personal taste. Using these patterns we can find out things like (when can we expect the next purchase date to be, how does number of sales change with regard to discounts and offers, etc..)

In this article, we’ll figure out how to analyze customers purchases, finding insights, and building a fitting model to figure out what their next product of interest might be?

We’ll use the IBM Cognos Analytics dataset for a coffee shop, which will only have entries from April, 2019.
You can also download the dataset from kaggle.

The steps to our desired results are:
1- Exploratory analysis
2- Formulate our hypothesis
3- Feature engineering
4- Model selection
5- Results interpretation

Exploratory analysis:
We can use tableau to do some exploratory analysis and create visualizations to help formulate our hypothesis.
You can find mine here

In this process, my questions were:
1- What were the rush hours of different product categories?
2- What is the gender distribution of our customers?
3- Which products performed better than others? and does its price affect this?

4- Who is our most loyal customers? (longest subscription period and most purchases)

Product categories rush hours and gender distribution

These different graphs show mutual info and key differences between three of the most in-demand categories, seeing this pattern can explain for example (how coffee is ordered more than tea around noon, even though they have similar numbers for the rest of the day)

Best performing products overall

As for this graph, it shows count of certain product being ordered, and seeing how our best 10 or more products are on the cheap side of the menu can change our perspective on what is selling more, and how to measure that performance.

Now for the last bit of info we can extract from our data I used python and seaborn to make a heatmap to highlight the number of orders for each day on every hour.

sales['timestamp']= pd.to_datetime(sales['transaction_date'] + sales['transaction_time'], format='%Y-%m-%d%H:%M:%S')sales['Day'] = sales['timestamp'].dt.daysales['Hour'] = sales['timestamp'].dt.hourmap = sales.groupby([‘Day’,’Hour’]).order.sum().unstack().fillna(0)sns.set(rc={"figure.figsize":(16, 8)})sns.heatmap(map)
Result heatmap of orders distribution

Some other interesting things that I found using python were:
different generations and their order amount
generation — orders
Baby Boomers — 5876
Gen X — 5559
Older Millennials — 5345
Gen Z — 4184
Younger Millennials — 3301

Generations and their preferred serving size:
Baby Boomers — 16 oz. — 1778
Gen X — 16 oz. — 1706
Older Millennials — 16 oz. — 1665
Baby Boomers — 24 oz. — 1451
Gen Z — 16 oz. — 1280

Formulate our hypothesis:

Since our data has no details if there’s any discounts/new products/events, we can safely make assumptions about the data we explored earlier.
My basic idea was you can predict any customer’s next purchase depending on three or four conditions:
1- Serving size depending on their age
2- Product category depending on time of the day
3- Product depending on most frequently bought items by the same customer
4- Amount of spending depending if there’s discounts or it’s their birthday
(not enough data to verify this last condition since our dataset is only collected during one month)

Part 2
Where we’ll choose how represent our hypothesis and feature engineer some parameters than can be represent our customer’s personal taste better!