Data Mining - College of Computing & Information Technology

Meer Hamza , Khaled Mahar and Sherif Hegazy : “ Incremental Mining of Association Rules Using Data Cubes “ Proceedings of IASTED , Malaga , Spain , September 2002.

Abstract: An incremental updating algorithm for the maintenance of previously discovered association rules is applied on data cubes. Previous research work concentrated on the development of incremental algorithms working on flat data i.e. the database files. Due to the huge amounts of data usually in process, using the data cubes accelerates the job and avoids scanning the whole database after every update on the data. The approach also suggests a way to practically perform the incremental mining process without affecting the original (functioning) database. Previous methods required tagging new records to count them in the incremental algorithm, here we overcome this point.

Khaled Mahar, Meer Hamza and Sherif Hegazy “Adaptive Web Sites based on the ACM Incremental Mining” Proceedings of 13th international conference on computer theory and applications, ICCTA’2003, Arab Academy for sciences and technology, Alexandria, Egypt, Aug. 2003.

Abstract: Due to the increasing numbers of users and websites on the Internet as well as the increasing demands of web users and the size of competition it has become inevitable for companies and organizations to move towards adaptive websites. Adaptive websites are websites that automatically update themselves according to the user access patterns and behaviors. Several approaches have been introduced in this matter. In this paper, we present an approach for developing adaptive websites based on the ACM incremental mining.

Meer Hamza, Khaled Mahar, and Mohamed Younis “NewSpirit: Sequential pattern mining with regular expression constraints,” ACIT 2003, Alexandria, Egypt, Dec. 2003.

Abstract: The significant growth of sequence database sizes in recent years increase the importance of developing new techniques for data organization and query processing. Discovering sequential patterns is an important problem in data mining with a host of application domains including medicine, telecommunications, and the World Wide Web. Conventional mining systems provide users with only a very restricted mechanism (based on minimum support) for specifying patterns of interest. For effectiveness and efficiency consideration, constraints are essential for many sequential applications. In this paper, we give a brief review of different sequential pattern mining algorithms, and then introduce a new algorithm (termed NewSPIRIT) for mining frequent sequential patterns that satisfy user specified regular expression constraints. The general idea of our algorithm is to use finite state automata to represent the regular expression constraints and build a tree that represents all sequences of data which satisfy these constraints by scanning the database of sequences only once.

Meer Hamza, Khaled Mahar, and Mohamed Younis “Mining sequential Pattern with regualar expresson constraints using sequential pattern tree,” ICEIS 2004, Porto, Portugal, April 2004.

G. Drahem, M. Abougabal, H. Sueyllam and K. Mahar,” BSPC: A novel bitmap-based scalable parallel classifier,” Alexandria Engineering Journal, Vol. 44. No. 4, pp.585:595, July 2005.

Abstract: Decision trees have been found very effective for classification especially in data mining. Although classification is a well studied problem, most of the current classification algorithms need an in-memory data structure to achieve efficiency. This limits their suitability for mining over large databases. In this paper, a novel Bitmap-based Scalable Parallel Classifier (BSPC) is presented. It removes all the memory requirements needed by existing algorithms. Also, since scalability is a key requirement for any data mining algorithm, it is considered and achieved in the design of BSPC. Additionally, the suggested algorithm has been designed to be easily parallelized, allowing many processors to work together to build a single consistent model. Performance analysis demonstrates that the BSPC outperforms other state of the art algorithms. The superiority of the novel algorithm is demonstrated through the classification of Wisconsin breast cancer dataset.

Khaled Mahar and Mohamed Younis,” Predicting Number of New Students in a Course Enrollment: A Data Mining Approach, ”Proceedings of 4th international conference of informatics and systems, INFOS 2006, Cario Univeristy, March 2006.

Abstract: Finding patterns in sequences is a challenging problem with a great importance. In many domains such as medicine, finance, and marketing, data are represented as sequences that can be used to predict certain behavior or events. This can also be extended to the education domain for a number of potentially useful applications. This paper presents a predictive model with a high level of accuracy to predict the number of students who will register in a certain course in a next term. The proposed model views the student registration process as sequential patterns where the college offers courses and the students can register some of these courses at each term. By using the stored data of past semesters along with the course-plan (course pre-requisites), regular expressions can be constructed to constraint the extracted sequential patterns of previously registered courses that will be used to predict the number of new students who will enroll in a course. To demonstrate the working of the model, a brief overview of a sequential pattern mining algorithm called TSPIRIT (Tree approach for Sequential Pattern mining with Regular expressIon consTraints) is first presented. Then, a study of the complexity of TSPIRIT is given to prove the efficiency of using it as a predictive model. Finally, TSPIRIT is applied to the student database of the Arab academy for Science and Technology (AAST) to extract the sequential patterns. Experimental results show the effectiveness of our approach as a predictive model.

Shaaban Mahran, Khaled Mahar,"Using Grid for Accelerating Density-Based Clustering," 8th IEEE International Conference on Computer and Information Technology, (CIT 2008), Sydney, Australia, pp. 35-40, July 2008.

Abstract: Clustering analysis is a primary method for data mining. The ever increasing volumes of data in different applications forces clustering algorithms to cope with it. DBSCAN is a well-known algorithm for density-based clustering. It is both effective so it can detect arbitrary shaped clusters of dense regions and efficient especially in existence of spatial indexes to perform the neighborhood queries efficiently. In this paper we introduce a new algorithm GriDBSCAN to enhance the performance of DBSCAN using grid partitioning and merging, yielding a high performance with the advantage of high degree of parallelism. We verified the correctness of the algorithm theoretically and experimentally, studied the performance theoretically and using experiments on both real and synthetic data. It proved to run much faster than original DBSCAN. We compared the algorithm with a similar algorithm, EnhancedDBSCAN, which is also an enhancement to DBSCAN using partitioning. Experiments showed the new algorithm''s superiority in performance and degree of parallelism.

Ashraf Shawky, Essam Kosba and Khaled Mahar, " Using Association Rule Mining to Detect Adverse Drug Events," Proceedings of 20th international conference on computer theory and applications, ICCTA’2010, Arab Academy for sciences and technology, Alexandria, Egypt, Oct. 2010.

Abstract: Improving safety for patients is a top priority in health care. However, adverse drug events (ADEs) are estimated to account up to 5% annually of hospitalized patients causing morbidity and mortality. Due to the availability of large amount of medical data, discovering ADE patterns using data mining (DM) techniques becomes a challenge. This paper proposes a framework to mine a large database containing data about previously prescribed drugs and their adverse outcomes. This mining process yields association rules necessary to detect ADEs in future prescriptions. In addition, an Adverse Drug Events Detection Tool (ADEDT) is built to help physicians detect ADE during drug prescription phase.

Yasser El-Sonbaty and Rasha Kashef, New Fast Algorithm for Incremental Mining of Association Rules, ICEIS 2004 - 6th International Conference on Enterprise Information Systems, Porto, Portugal, PP. 275-281, 2004

Abstract: Mining association rules is a well-studied problem, and several algorithms were presented for finding large itemsets. In this paper we present a new algorithm for incremental discovery of large itemsets in an increasing set of transactions. The proposed algorithm is based on partitioning the database and keeping a summary of local large itemsets for each partition based on the concept of negative border technique. A global summary for the whole database is also created to facilitate the fast updating of overall large itemsets. When adding a new set of transactions to the database, the algorithm uses these summaries instead of scanning the whole database, thus reducing the number of database scans. The results of applying the new algorithm showed that the new technique is quite efficient, and in many respects superior to other incremental algorithms like Fast Algorithm (FUP) and Large Itemsets (ULI).

Yasser El-Sonbaty, M. A. Ismail, Mohamed Farouk, An Efficient Density Based Clustering Algorithm for Large Databases, The 16th IEEE International Conference on Tools with Artificial Intelligence, Boca Raton, Florida, USA, PP. 673-679, November 15-17, 2004

Abstract: Clustering in data mining is used for identifying useful patterns and interesting distributions in the underlying data. Several algorithms for clustering large data sets have been proposed in the literature using different techniques. Density-based method is one of these methodologies which can detect arbitrary shaped clusters where clusters are defined as dense regions separated by low density regions. In this paper, we present a new clustering algorithm to enhance the density-based algorithm DBSCAN. Synthetic datasets are used for experimental evaluation which shows that the new clustering algorithm is faster and more scalable than the original DBSCAN.

Yasser El-Sonbaty, Amgad Neematallah, Multiway Decision Tree Induction using Projection and Merging (MPEG), The 17th IEEE International Conference on Tools with Artificial Intelligence, Hong Kong, China November 14-16, 2005.

Abstract: Decision trees are one of the most popular and commonly used classification models. Many algorithms have been designed to build decision trees in the last twenty years. These algorithms are categorized into two groups according to the type of the trees they build, binary and multiway trees. In this paper, a new algorithm, MPEG, is designed to build multiway trees. MPEG uses DBMS indices and optimized query to access the dataset it works on, hence it has few memory requirements and no restrictions on sizes of datasets. Projection of examples over attribute values, merging of generated partitions using class values, applying GINI index to among different attributes and finally post pruning using EBP method, are the basic steps of MPEG. The trees built by MPEG have the advantages of binary trees as being accurate, small in size and the advantages of multiway trees as being compact and easy to be comprehended by humans.

Sherine Nagy, Yasser El-Sonbaty, A feature ion algorithm with redundancy reduction for text classification, 22nd International Symposium on Computer and Information Sciences, Turkey, 7-9 November 2007.

Abstract: Document classification involves the act of classifying documents according to their content to predefined categories. One of the main problems of document classification is the large dimensionality of the data. To overcome this problem, feature ion is required which reduces the number of ed features and thus improves the classification accuracy. In this paper, a new algorithm for multi-label document classification is presented. This algorithm focuses on the reduction of redundant features using the concept of minimal redundancy maximal relevance which is based on the mutual information measure. The features ed by the proposed algorithm are then input to one of two classifiers, the multinomial naive Bayes classifier and the linear kernel support vector machines. The experimental results on the Reuters dataset show that the proposed algorithm is superior to some recent algorithms presented in the literature in many respects like the F1-measure and the break-even point.

Yasser El-Sonbaty and Hany Said, Enhanced Density Based Algorithm for Clustering Large Datasets, Advances in Soft Computing: Computer Recognition Systems by Springer Berlin / Heidelberg, Vol. 57, pp. 195 – 203, ISBN 978-3-540-93904-7, 2009.

Abstract: Clustering is one of the data mining techniques that extracts knowledge from spatial datasets. DBSCAN algorithm was considered as well-founded algorithm as it discovers clusters in different shapes and handles noise effectively. There are several algorithms that improve DBSCAN as fast hybrid density algorithm (L-DBSCAN) and fast density-based clustering algorithm. In this paper, an enhanced algorithm is proposed that improves fast density-based clustering algorithm in the ability to discover clusters with different densities and clustering large datasets.