A Guide to RFM Analysis using Python

A common mistake many businesses make is treating all customers the same. They run generic promotions, send the same emails, and wonder why their revenue isn’t growing. The truth is, not all customers are created equal. But how do you tell the difference? That’s where RFM Analysis comes in. It’s a powerful, straightforward technique that helps you identify your most valuable customers and understand their behaviour. So, if you want to learn RFM Analysis, this article is for you. In this article, I’ll take you through a practical guide to RFM Analysis using Python.

What is RFM Analysis?

RFM stands for Recency, Frequency, and Monetary. It’s a marketing analysis tool used to rank quantitatively and segment customers based on their purchasing habits. Here’s how it helps:

Recency (R): How recently did a customer make a purchase? The more recent their purchase, the more likely they are to respond to promotions.
Frequency (F): How often do they buy? Customers who buy more frequently are generally more engaged and loyal.
Monetary (M): How much do they spend? Customers who spend more money are often your most profitable.

By combining these three metrics, you can group customers into different segments and develop targeted strategies for each group.

RFM Analysis using Python: Getting Started

Please download the dataset from here to get started with RFM Analysis.

Let’s import the dataset:

import pandas as pd
df = pd.read_csv('rfm_data.csv')
print("First 5 rows of the dataset:")
print(df.head().to_markdown(index=False, numalign="left", stralign="left"))

Before you can calculate anything, you need to get your data in the correct format. We typically start with a raw transaction dataset. In our example, the data includes:

CustomerID, PurchaseDate, TransactionAmount, ProductInformation, OrderID, and Location

The first crucial step is to ensure your PurchaseDate column is in the correct datetime format. This is essential for calculating recency. Here’s how:

df['PurchaseDate'] = pd.to_datetime(df['PurchaseDate'])

Calculate R, F, and M Values

This is the core of the analysis. We’ll calculate the Recency, Frequency, and Monetary values for each customer.

To calculate recency, we first need a snapshot date, which is the latest date in the dataset plus one day. We then subtract the customer’s most recent purchase date from this snapshot date. This gives us the number of days since their last purchase. For Frequency, we count the number of orders (OrderID) for each unique customer. And, for Monetary, we sum up the Transaction Amount for each customer.

Here’s how to code these calculations:

# get the latest date in the dataset and add one day to serve as the snapshot date
snapshot_date = df['PurchaseDate'].max() + pd.Timedelta(days=1)

# calculate RFM values for each customer
rfm_df = df.groupby('CustomerID').agg(
    Recency=('PurchaseDate', lambda date: (snapshot_date - date.max()).days),
    Frequency=('OrderID', 'count'),
    Monetary=('TransactionAmount', 'sum')
).reset_index()

print("Calculated RFM values:")
print(rfm_df.head().to_markdown(index=False, numalign="left", stralign="left"))

Calculated RFM values:
| CustomerID   | Recency   | Frequency   | Monetary   |
|:-------------|:----------|:------------|:-----------|
| 1011         | 34        | 2           | 1129.02    |
| 1025         | 22        | 1           | 359.29     |
| 1029         | 1         | 1           | 704.99     |
| 1046         | 44        | 1           | 859.82     |
| 1049         | 14        | 1           | 225.72     |

Now, let’s have a look at how recently customers have made transactions:

import plotly.graph_objects as go
import plotly.express as px

fig = go.Figure()

fig.add_trace(go.Histogram(
    x=rfm_df['Recency'],
    nbinsx=20,
    name="Customer Count",
    marker_color='skyblue',
    opacity=0.7
))

fig.add_trace(go.Histogram(
    x=rfm_df['Recency'],
    nbinsx=100,
    histnorm='probability density',
    marker_color='orange',
    opacity=0.3,
    name="Recency Density"
))

fig.update_layout(
    title=dict(
        text="How Recently Have Customers Purchased?",
        x=0.5,
        xanchor='center',
        font=dict(size=20)
    ),
    xaxis_title="Recency (Days since last purchase)",
    yaxis_title="Number of Customers",
    bargap=0.1,
    template="plotly_white",
    legend=dict(title="Legend", orientation="h", x=0.3, y=1.1),
    annotations=[
        dict(
            x=rfm_df['Recency'].median(),
            y=0,
            xref="x",
            yref="y",
            text="Median Recency",
            showarrow=True,
            arrowhead=2,
            ax=-40,
            ay=-40
        )
    ]
)

fig.show()

The distribution appears spread out, with a slight peak around the 35 to 45-day mark. The median recency is highlighted at approximately 30 days, indicating that half of all customers have purchased within the last month. The shape of the histogram suggests that customer purchases are spread across a range of recency values, rather than being heavily concentrated on very recent purchases.

Now, let’s have a look at how often customers buy:

import numpy as np

freq = rfm_df['Frequency']
q80, med = np.percentile(freq, 80), np.median(freq)

fig = px.histogram(
    rfm_df, x='Frequency', nbins=30, marginal='box',
    title='How Often Do Customers Buy? (Spot the Loyalists)',
    labels={'Frequency': 'Purchases per Customer'}, template='plotly_white', opacity=0.9
)
fig.update_traces(hovertemplate='Purchases: %{x}<br>Customers: %{y}')

fig.add_vline(x=med, line_dash='dash', line_width=2,
              annotation_text=f"Median = {med:.0f}", annotation_position='top left')
fig.add_vrect(x0=q80, x1=freq.max(), opacity=0.1, line_width=0,
              annotation_text='Top 20% loyalists', annotation_position='top right')

fig.add_annotation(x=freq.min(), y=0, yshift=40, showarrow=False,
                   text='Left tail: one-time/rare buyers → retention opportunities')

fig.update_layout(yaxis_title='Number of Customers')
fig.show()

RFM Analysis: How Often Do Customers Buy?

It reveals that the vast majority of customers are one-time or rare buyers, as indicated by the tall bar at “1 purchase per customer” and the median Frequency of 1. There is a small group of customers who have made two or more purchases. A distinct “Top 20% loyalists” band highlights the customers with the highest purchase frequency, representing a small but valuable group. The long tail on the right side of the distribution, while containing few customers, means those with a higher number of purchases, highlighting a key opportunity for retention and loyalty programs.

Segment Customers with RFM Scores

Now, we’ll transform the raw RFM values into scores (1-4) to create customer segments. The scoring is based on quartiles:

Recency: A lower Recency value (meaning a more recent purchase) gets a higher score. So, customers in the first quartile (least recent) get a score of 1, while those in the last quartile (most recent) get a score of 4.
Frequency and Monetary: The logic is reversed. A higher Frequency or Monetary value gets a higher score. Customers in the first quartile get a score of 1, while those in the last quartile get a score of 4.

Here’s how we can segment our customers:

# define scoring functions
def r_score(value, r_quartiles):
    if value <= r_quartiles[0.25]:
        return 4
    elif value <= r_quartiles[0.5]:
        return 3
    elif value <= r_quartiles[0.75]:
        return 2
    else:
        return 1

def fm_score(value, fm_quartiles):
    if value <= fm_quartiles[0.25]:
        return 1
    elif value <= fm_quartiles[0.5]:
        return 2
    elif value <= fm_quartiles[0.75]:
        return 3
    else:
        return 4
      
# get quartiles
r_quartiles = rfm_df['Recency'].quantile([0.25, 0.5, 0.75])
f_quartiles = rfm_df['Frequency'].quantile([0.25, 0.5, 0.75])
m_quartiles = rfm_df['Monetary'].quantile([0.25, 0.5, 0.75])

# apply scoring functions to create RFM scores
rfm_df['R_Score'] = rfm_df['Recency'].apply(lambda x: r_score(x, r_quartiles))
rfm_df['F_Score'] = rfm_df['Frequency'].apply(lambda x: fm_score(x, f_quartiles))
rfm_df['M_Score'] = rfm_df['Monetary'].apply(lambda x: fm_score(x, m_quartiles)) 

rfm_df['RFM_Score'] = rfm_df['R_Score'].astype(str) + rfm_df['F_Score'].astype(str) + rfm_df['M_Score'].astype(str)

After scoring, we combine them into a single RFM_Score string, like ‘444’ or ‘111’:

# define a function to assign segments based on the RFM scores
def rfm_segment(score):
    if score in ['444', '443', '434', '344', '433', '343', '334']:
        return 'Champions'
    elif score in ['442', '424', '244', '333', '324', '342', '432']:
        return 'Loyal Customers'
    elif score in ['411', '412', '421', '422', '311', '312', '321', '322', '211', '212', '221', '222']:
        return 'New Customers'
    elif score in ['144', '143', '134', '243', '234', '133', '124', '123']:
        return 'At Risk'
    elif score in ['111', '112', '121', '211']:
        return 'Lost Customers'
    else:
        return 'Other'
      
rfm_df['Segment'] = rfm_df['RFM_Score'].apply(rfm_segment)

Now, we assign each customer to a segment based on their combined score using a defined function. Here’s a breakdown of some key segments and their typical scores:

Champions: Your best customers. They often buy and spend the most.
Loyal Customers: High Frequency and monetary value, but their recency might be slightly lower than that of champions.
New Customers: Bought recently but haven’t purchased frequently or spent much.
Lost Customers: Haven’t bought in a long time, don’t buy often, and don’t spend much.

After applying this segmentation logic, we can visualize the distribution of our customer base across these segments:

seg = (rfm_df['Segment'].value_counts()
       .rename_axis('Segment').reset_index(name='Customers'))
seg['Share'] = (seg['Customers']/seg['Customers'].sum()*100).round(1)
top = seg.iloc[0]['Segment']
seg['Highlight'] = np.where(seg['Segment'].eq(top), 'Top segment', 'Others')

fig = px.bar(
    seg, x='Segment', y='Customers', color='Highlight',
    text=seg.apply(lambda r: f"{int(r.Customers)} ({r.Share}%)", axis=1),
    title='Customer Segments: Where Is The Value?',
    labels={'Customers':'Number of Customers'}, template='plotly_white', opacity=0.95
)

fig.update_traces(textposition='outside', hovertemplate='<b>%{x}</b><br>Customers: %{y}<br>Share: %{text}')
fig.update_layout(
    xaxis_title='RFM Segment',
    xaxis={'categoryorder':'array','categoryarray':seg['Segment']},
    showlegend=False,
    margin=dict(t=70, b=20)
)

# Story cue: call out the biggest slice
top_y = int(seg.loc[seg['Segment'].eq(top), 'Customers'].iloc[0])
fig.add_annotation(x=top, y=top_y, yshift=35, showarrow=True,
                   text='Largest segment → prioritize retention/upsell')

fig.show()

The “Other” segment is the largest, with 435 customers, making up 46.0% of the total. This is followed by “New Customers” at 344, or 36.4%, and “Lost Customers” with 121, or 12.8%. The most valuable segments, “Champions” and “Loyal Customers,” are the smallest, with 29 and 15 customers, representing 3.1% and 1.6% of the customer base, respectively. Finally, the “At Risk” segment is the smallest, with only two customers, or 0.2%.

Our analysis showed that the “Champions” segment has the highest average spend, but a large portion of customers fall into the “Other” or “New Customers” segments. This highlights a massive opportunity for upsell and activation campaigns aimed at moving these customers up the value chain.

Final Words

So, RFM Analysis is a simple yet potent tool that every data analyst and business owner should have in their arsenal. It moves you from generic, one-size-fits-all marketing to innovative, data-driven strategies that genuinely connect with your customers. I hope you liked this article on a practical guide to RFM Analysis using Python. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.