How do you perform market basket analysis using R?

1
How do you perform market basket analysis using R?

AUTHOR: Sanjana Mahabale

Association Rules Mining or Market Basket Analysis is used in the retail industry to assess which items are likely to be bought together in a retail store. In this guide we explore how SafeCity uses the concept of Market Basket Analysis in order to analyze data on sexual harassment to assess what categories of sexual harassment are frequently perpetrated in combination.

Here is an example of Market Basket Analysis when it comes to the retail industry:

When it was found that diapers and beer are frequently bought together, diapers were moved closer to beer crates in stores to increase sales.

An association rule looks like this: A -> B

This implies that when A happens, B happens as well. It is important to find, however, how interesting this rule is. This is aided by the following measurements:

• Support: Ratio of the number of records in which A and B occur together to the total number of records

• Confidence: Ratio of number of records in which A and B occur together to the number of records in which only A occurs. It can be read as the probability of B occurring given that A is present in a record.

• Lift: A lift value greater than 1 indicates that A and B appear more often together than expected; this means that the occurrence of A has a positive effect on the occurrence of B or that A is positively correlated with B.

The following file containing “Full Data” from the Safecity Dashboard data was loaded in the R workspace.

This data needs to be converted into “transactions”. A transaction is, in terms of retail problems, a particular consumer’s basket containing different items bought from the store. With respect to Safecity data, a transaction is a particular harassment victim’s experience with different categories of sexual harassment. If the victim was catcalled and groped, their “transaction” contains ‘1’ or ‘Yes’ for the columns Catcalls/Whistles and Touching /Groping, and ‘0’ or ‘No’ for all other categories of harassment.

The final transaction data was called ‘TransData.csv’

The apriori algorithm from arules library of R mines association rules according to the threshold support and confidence. Threshold support is the minimum number of transactions in which A and B (from the A -> B rule) occur together. Threshold confidence is the minimum number of transactions of A, in which B also occurs.

Since the total number of transactions is 9719, we would want any two types of harassment occurring together at least in 10% of the total transactions (i.e. 972 times, which is a sizeable number of records). Hence, the threshold support is 0.1. The minimum confidence is kept at 0.5; at least half the number of records of crime A should also have an occurrence of crime B to make the rule interesting.

Prior to confirming the threshold confidence at 0.5, a sensitivity analysis was applied to determine the most appropriate minimum confidence value. The results were the same as those with confidence at 0.5 save 1 rule when the confidence was notched up to 0.6. Additionally, there was no difference observed in the rules when the confidence was reduced to 0.4, and many interesting rules were being lost when the confidence was increased to 0.8. Hence, in the case of new data being added to the current dataset, the confidence threshold was kept at a safe estimate of 0.5.

In the rules mined with apriori algorithm, A and B (from A -> B) can be sets of more than one items. minlen decides the minimum number of items and maxlen decides the maximum number of items allowed in either the LHS (A) or RHS (B) of a rule. Another sensitivity was applied in the algorithm where maxlen was set to 3. This means the maximum number of items could be 2 in either the LHS or RHS of the rule. Following is a sample rule mined using maxlen = 3.

{Ogling/Facial Expressions/Staring=1, Touching/Groping=0} => {North East India Report=0}

The insights gathered from the sensitivity were the same as those with maxlen =2. Hence, for the sake of brevity and ease of parsing through the rules, maxlen was kept at 2.

This is the final set of mined rules  is called ‘Rules24.csv’

The following insights were gathered from the mined rules –

· Commenting follows Ogling/Facial Expressions/Staring about 55% of times

· Commenting follows Catcalls/Whistling about 62% of times

· Of all the cases in which Ogling/Facial Expressions/Staring was reported, Rape/Sexual Assault did not occur about 98% of times

· Of all cases in which Catcalls/Whistling was reported, Stalking did not occur 93% of times

· Of all cases in which Catcalls/Whistling was reported, Rape/Sexual Assault did not occur 98% of times

· Of all cases in which Touching/Groping was reported, Rape/Sexual Assault did not occur 96% of times

· Of all cases in which Commenting was reported, Rape/Sexual Assault did not occur 99% of times

· No positive correlation was found between Rape/Sexual Assault and other forms of sexual harassment, even when the threshold confidence was dropped to 0.1 and no filter was applied on the lift of the rules (i.e. lift is not necessarily >=1)

 

  1. I would like to show my passion for your kindness in support of people who must have help on this important area. Your real dedication to passing the solution throughout appears to be rather productive and have frequently allowed many people like me to attain their dreams. Your entire informative facts indicates so much a person like me and somewhat more to my office workers. Many thanks; from each one of us.

LEAVE A REPLY