Tuesday, February 9, 2010

Association Rule Mining with Extended Vertical Format Data Mining

I and my final year project team members at Department of Computer Science and Engineering, University of Moratuwa conducted a research on better alternative to the Apriori algorithm, and proving the efficiency enhancement by using a dataset. Under the supervision of Dr. Shehan Perera, we analyzed the Apriori algorithm, and came up with a better implementation which is supposed to be more efficient than its predecessors.

Current databases are very large sizes, reaching Tera-bytes and Peta-bytes, and the trend towards further increase. With this explosion of growth of databases of particular importance is the question of scalability of data mining techniques. Therefore, to find association rules require efficient scalable algorithms that allow solving the problem with in a reasonable time.
Large companies for decades accumulated data on their customers, suppliers, products and services. Due to high rate of development of e-commerce working in Web start-ups can turn into a huge enterprise in a matter of months, rather than something those years. And, as a consequence, will grow rapidly and their databases. Data mining, which is also called knowledge discovery in databases provides organizations with the tools developed to analyze the large collection of information to find trends, patterns and relationships that can help in making strategic decisions.

Traditionally, that the algorithms of data analysis assumed that the input database containing a relatively small number of records. However, the size of modern databases is too large, which is why they can not be fully deployed in the main memory. Extracting data fro m your hard drive is considerably slower access to data located in memory. Therefore, to methods of data mining used to work with very large databases, to become effective, they must possess a significant level of scalability. The algorithm is called scalable, if sustained capacity of main memory with an increase in th e number of records in the input database, its execution time increases linearly.

Recently, researchers have focused their efforts on the study of scalable algorithms for data mining in very large data sets. Here it's described an efficient and scalable frequent item-set mining method with Apriori algorithm.

Apriori algorithm is proposed for mining frequent item sets for Boolean association rules. It operates on databases having transactions to learn the association rules. Apriori algorithm is a base algorithm proposed by R. Agrawal and R. Srikant in 1994, on which many researches are done, and improvements are suggested for the general case as well as a specific subset of the applicable data. Due to the huge amount of data that is mined in the present applications, even a small performance gain on the algorithm will result in a considerable amount of throughput gain. Some enhancements to Apriori algorithm sacrifice the accuracy for a better response time. Sampling is a simplest example, where accuracy is lost in favor of performance gain.

Hash-based technique, Transaction reduction, Partitioning, Dynamic itemset counting, and multilevel and multidimensional association rules are some of the other common enhancements proposed to improve the efficiency of Apriori algorithm.

Apriori algorithm generates candidate sets and tests them to find the frequent itemsets, significantly reducing the size of candidate sets. Algorithms such as Frequent-pattern growth (FP-Growth) mine frequent itemsets without candidate generation. Both the algorithm sets have their own advantages and disadvantages. Many hybrid algorithms have been proposed and still researched to suit the general case, or mostly a particular case specialized for a given dataset.


Related articles - Project proposal

No comments: