RFM analysis is a relatively common analysis method in user refined operations."

Today, I will share with you a more commonly used analysis framework in data analysis: RFM analysis. The model is used a lot, which shows that the model has its own advantages; but at the same time, there are also many problems. Discuss with you today.

1. What is RFM Analysis

RFM analysis is actually a method of stratifying users and then performing refined operations for different user groups.

The three letters of RFM represent a dimension:

R (Recency): The last consumption time. It reflects the popularity of users' recent consumption to measure whether users are lost. Theoretically, the longer the last consumption time, the higher the churn probability.

F (Frequency): The consumption frequency of the user. It reflects the loyalty of users to products and brands. In theory, the higher the purchase frequency within a certain period of time, the higher the user loyalty

M (Monetary): Consumption amount. reflect the purchasing power of users.

Generally speaking, a threshold is set for each dimension, and the user group is divided into two (above the threshold, below the threshold), and the three dimensions are equal, then the user as a whole can be divided into 2^3=8 users subgroups. As shown below:

With user segmentation, refined marketing can be performed for segmented users. For example, for [important value customers], we should do a good job in maintaining the rights and interests of users, and for [important retention customers], we should do a good job in recovering the loss of customers.

Regarding the general meaning and application value of the model, I will briefly describe one or two, and please continue in detail.

2. How to conduct RFM modeling

The establishment of the RFM model can generally be divided into the following steps.

1. About raw data

From the definition, we can see that R, F, and M are actually related to consumption. Therefore, regarding the construction of the RFM model, the raw data used is very clear: the order table transaction table.

Moreover, the dimensions used do not need to be very complicated, as long as the following dimensions are sufficient:

That is, as long as we have the details of the user's unique ID, consumption time, and consumption amount, we can build an RFM analysis model.

Of course, there is some data cleaning work for raw data, so I won't go into details here. For example, the selected order is the completed order, not the order that has not been paid; for example, the order that excludes large institutions is selected... and so on.

2. Processing calculation in three dimensions

Based on the original order data above, the next step is to process the three dimensions of RFM. There are many details here.

First, the calculation of the last consumption time. The definition of this indicator is relatively clear, just take the difference between the last consumption time and the current time.

Regarding the calculation of consumption frequency, there must be a time range setting. Is it to set the consumption frequency of the last year (that is, how many times have you purchased it), or the consumption frequency of the last month? This makes a big difference. Generally speaking, the setting of this range has a lot to do with the industry in which the user is analyzed. For example, for fast-moving consumer goods, it is enough to count users for a few months, but for durable consumer goods, it is obviously not. After one year of statistics, users may not repurchase.

About the consumption amount. Here, like the consumption frequency, it is also necessary to set the time range, and the reason is the same. After determining the time range, just do the sum directly, without too many doubts.

Therefore, there is no fixed standard for the setting of parameters, and it is necessary to combine the rules of the industry where you are. The __country email list__ finished table looks like this:

3. Threshold division

After processing the basic three-dimensional statistical indicators, the next step is to determine the division threshold. That is, it is determined based on how large the value is, and the users of each dimension are divided into segments.

Generally speaking, only one threshold needs to be determined for each dimension, so that the overall user can be divided into 8 segments. But there is still a routine in which each dimension is divided into 5 segments, and the whole is divided into 5^3 for a total of 125 layers, euphemistically called [subdivision]. But I personally don't agree. I think the importance of RFM analysis is the interpretability and practicability of user segmentation, which is divided into 125 user groups. How do you refine your operations? In the end, it's going to be merged.

OK, we still follow the normal 8 layers. We have seen the statistical aggregation table above, and the distribution is often as follows (take R as an example):

How to divide into two groups of users? There are many different methods at this point.

The first method is to take the mean value method. I personally don't recommend using the mean as the threshold. Because the real situation often has some outliers, it will affect the calculation of the mean. When data cleaning, it is difficult to eliminate exceptions.