Research

Abstract

Some datasets are sufficiently large to inhibit traditional algorithms for data analysis. Data mining is locating important data within these extremely large datasets. In this research, an algorithm is developed to study data mining techniques, and investigates the purpose of different parts of the process. It shows the technique for finding product associations, and the application of this in real world scenarios even though the written algorithm was not applied to any actual datasets. It also shows through timing analysis that change of the parameters of the algorithm can change the time of computation from linear to exponential. When analyzing very large datasets, even linear growth can be too great for the algorithm to be efficient. This is the reason that the algorithm developed in research was not applied to very large datasets. However it did still show how linear and exponential growth cannot be present in effective data mining algorithms.

Introduction

Imagine a vast dataset of billions of sales transactions. Even to find something as simple as the largest and smallest of the transactions would take a drastically long time. Some datasets are so immense that they can be regarded as nearly infinite, but results of some kind still must be generated. When analyzing these vastly large datasets, programmers must take a different perspective towards solving the problem. Most algorithm failures come from an error such as a bug in the program, syntax, or user error. However, algorithms can also fail because of their efficiency. It takes a certain amount of time for algorithms to run, and when they must run billions of loops they might take years to finish. Thus, successful algorithms are efficient, and can be executed in a reasonable amount of time.
This is absolutely essential in what is called data mining. Data mining is the concept analyzing of these very large data sets. This is done by a computer, which uses a specific data mining algorithm to find important data and then analyze it. Data mining algorithms are aimed specifically at datasets that are sufficiently large to cause traditional algorithms to fail. Brute force is the most basic and inefficient type of algorithm, and is simply finding solutions by looking at every single combination of every single point of data. There are other traditional algorithms that are more efficient than brute force, but still fail with a sufficiently large dataset.
There are many different data mining algorithms that can be used, all of which are built to be streamlined and efficient when operating on massive datasets. One such algorithm is Apriori. Proposed in 1994 by two computer scientists, Rakesh Agrawal and Ramakrishnan Srikant (p. 487-499), Apriori finds association rules by finding frequent items. Frequent items are items that occur frequently within a dataset, often a transaction database. Association rules state whether two or more items often occur in combination within the dataset, implying that they are associated, or purchased together frequently. This process can turn up some interesting and helpful results. The most useful of these results are the ones that exist, but are not intuitive. For example, someone who buys tortilla chips is also likely to buy salsa, but that is not a result that is particularly helpful to marketers. According to Nutan Dhange and Prof. Sheetal Dhande (2010), the “Apriori algorithm is a classical algorithm of association rule mining and [is] widely used for mining association [rules which use] frequent item,” (p. 341). This is called product analysis when used in transactions. Table 1 below provides a very basic transaction database to exemplify this, showing 6 transactions of up to 3 items each.

The algorithm’s first step in finding associations is counting how many times each individual item was purchased. In a dataset with billions of points of data, Apriori cannot count every single one, so it instead takes a small sample that has only millions of points instead of billions. It can then count all of the individual points and combinations in a more reasonable amount of time, and use this as a basis to apply to the entire dataset. Table 2 below shows the first step by displaying the frequency that each item was purchased in the example from Table 1, which would be like a sample from a larger dataset.

Table 2 is the first step of the analysis that Apriori does. It counts how many times each item was purchased in the sample and uses that information to decide what to do next with the whole dataset. In this simple example, lightbulbs were purchased infrequently; only once out of the 13 total items purchased. Therefore, Apriori assumes that there also are very few combinations that include lightbulbs. Thus, instead of finding combinations with lightbulbs, it just ignores all combinations with lightbulbs. In this example, only one item is removed, but in a dataset with billions of transactions and possibly millions of items, it is likely that most of the items will be removed because they are not likely to have useful combinations. That only leaves the items that are likely to have useful combinations. This significantly reduces the number of different combinations, and speeds up the algorithm, both in the example and in real life applications. After going through this process in the example, the algorithm returns Table 3 on the next page.

The next step is to figure out how many of the items were frequently bought in combination. To do this, the result combines purchased items as both the horizontal and vertical labels, with the data representing how frequently the two items were bought in combination. In a real world application, this step is done with the whole dataset, with all of the items that the algorithm has determined are likely to have frequent combinations. For this example, there are the same number of transactions for simplicity of understanding. As such, the algorithm returns Table 4 on the next page from the example dataset.

Table 4 above has the bottom section shaded with no values because buying beer and a basketball is the same as buying a basketball and beer. This means that only one side of the table has results to avoid analyzing purchases twice. The diagonal section is also shaded because the amount of a product purchased is irrelevant. For example, if two basketballs were bought in the same transaction, the algorithm wouldn’t associate basketballs with basketballs.
Table 4 is the useful result, because it is the table that shows data associations. For a dataset as small as the example, showing pairs is the largest practical level of correlation. However, in a much larger dataset, finding groups of three, four or even five associated items may be useful for marketing purposes. One of the obvious correlations that appeared in the example data was that between diapers and beer. This was an actual correlation found by a data analysis team for Walmart, as described by the Forbes article, “Diaper-Beer Syndrome”. The team ran their algorithm, and “Out popped a most unexpected correlation: sales of diapers and beer. Evidently, young fathers would make a late-night run to the store to pick up Pampers and get some Bud Light while they were there.” This is just one example of the usefulness of the data that the results discovered by Apriori provides.
To recap, according to Yan Guo, Minxi Wang, and Xin Li, Apriori’s “...first step is to find all the frequent itemsets from the database or data warehouse; the second step is to generate association rules from the frequent item sets,”(p. 289). In the example, the frequent items are lettuce, beer, basketballs, diapers and tires. The association rule found was the beer and diapers were often bought in combination. This is just one real world example of the things that Apriori discovers and returns.
The project goal is to write an algorithm that effectively and efficiently draws connections in the datasets it analyzes based on an Apriori technique. In other words, the project is to streamline the algorithm so that it takes a reasonable amount of time to mine through a large amount of data. This could then be applied to real life datasets, and would find correlations within those datasets.

Methods and Materials

To further understand the need for effecient data mining algorithms, a traditional algorithm was written to solve the same problem. This allowed for a comparison to data mining algorithms in terms of efficiency. The written algorithm followed traditional, brute force methods, but returned the same results that the apriori algorithm would.
The first step was to write a function that would count how many of a certain item occurred within the transaction data set. It does this by running a for-loop, which runs the code inside of it for as many times as specified. That number should be as many items as there are in the set in this case. This means that the code checks every number to see if it is the target number. This is a brute force method, and starts the process of analysis on the data. Code Example 1 below displays this code.
Code Example 1
This function searches a dataset for a certain value, target.

The next step was to create a function that used this code and ran it for each possible number in the dataset, again using a for-loop. This showed how many of each item was in the dataset. This is like the first step that Apriori does in analyzing the sample of the dataset. It counts how many times each of the items appears in the dataset, and in the nexts step would decide whether or not to analyze its combinations. A function that did this step was not included in the written algorithm, but would be developed in future work. The next functions to be defined in the written algorithm found the high and low values of the datasets. Code Example 2 below shows these functions.
Code Example 2
These two functions find the lowest and highest values in the input dataset.

   From this point, the next step was to make a function that finds associations within individual transactions. This required data of sets within sets. Instead of datasets that were just lists of numbers, such as [1,4,2,3,5,3,2], the function needed lists of lists, such as [[1,3,2], [4,1,5], [2,4,1]]. Each list within is a list of items within each individual transaction, while the number of sets is the number of transactions. Any analysis will need to be written to account for a list of lists. This generally requires using a previously written function within the for-loop. The code for this appears below in Code Example 3.
   Code Example 3
This function lists all of the possible correlations within each transaction in the dataset.

Code Example 3 on the page above is used to make every possible combination within a set of transactions. Another function that keeps individual counters in a matrix for how many times each combination is then called. This is displayed in Code Example 4.
Code Example 4
This function creates a matrix and adds to it every value of the combinations in the dataset

   This function takes all of the outputs of factorialList and plots a matrix or chart of how many of each combination occured. The combination of all of these functions allowed the analysis of datasets, and produced the results.
   In the code, the Apriori algorithm is used as a method of comparison to a traditional algorithm to find frequent itemsets in a dataset of of transactions. Python served as the programming language to implement a traditional algorithm version of the Apriori algorithm. This algorithm returned the same results as an Apriori algorithm would, but because it was brute force, it was much less efficient. It did not use data pruning as a data mining Apriori algorithm would, so it was not practical to apply the algorithm to real life datasets. This means that the results are not correlations between data in real datasets, but instead is the time it took to calculate different inputs to the algorithm. Testing included varying two parameters; the number of transactions, and how many items are in each transactions.
   This research implemented a random transaction generator, so that large random datasets could be generated fly. This allowed testing of the processing speed of the algorithm, which is a good gauge of its efficiency. All code was written from scratch, without the use of external libraries.

Results

Discussion

In the timing of the algorithm written in research, it is a given that there was some level of error. The timing is dependent on the processing speed of the computer, which can be affected by the temperature, battery level, and background activity of the computer. To minimize the effect that this had on the timing, all of the tests were performed one after another in the same sitting. Although this made the effect minimal, it may have had an effect on the results, which needs to be acknowledged. However, for the purposes of this research, approximate numbers are sufficient.
In these results, it is not the exact numbers that matter, but instead how the numbers scale up, or the ratio between all of the numbers. This indicates whether the increase is exponential or linear, which can point to how efficient the algorithm is. In this case, a linear increase is preferred. When we get to very large datasets, the time will be more reasonable if the increase has been linear rather than exponential. However, for very large datasets, even linear growth is too steep.
In the future, this research will add data pruning to the algorithm, which will greatly increase its efficiency. To implement pruning, a new function would need to be added that filters the data after each item in the sample has been counted once. It then removes items that were not counted enough times to be viable as combinations, and applies this smaller search to the whole dataset. This would take more time than remains in the research window, but would yield very interesting results.
This research will also then be applied to much larger datasets, and datasets that come from real transactions. This will yield results that are applicable to the real world, and marketing in general. To do this, a new function would need to be added that converts data from something like a spreadsheet into something that the algorithm can use, and get results from.