Abstract: An incremental updating algorithm for the maintenance of previously discovered association rules is applied on data cubes. Previous research work concentrated on the development of incremental algorithms working on flat data i.e. the database files. Due to the huge amounts of data usually in process, using the data cubes accelerates the job and avoids scanning the whole database after every update on the data. The approach also suggests a way to practically perform the incremental mining process without affecting the original (functioning) database. Previous methods required tagging new records to count them in the incremental algorithm, here we overcome this point.
Abstract: Due to the increasing numbers of users and websites on the Internet as well as the increasing demands of web users and the size of competition it has become inevitable for companies and organizations to move towards adaptive websites. Adaptive websites are websites that automatically update themselves according to the user access patterns and behaviors. Several approaches have been introduced in this matter. In this paper, we present an approach for developing adaptive websites based on the ACM incremental mining.
Abstract: The significant growth of sequence database sizes in recent years increase the importance of developing new techniques for data organization and query processing. Discovering sequential patterns is an important problem in data mining with a host of application domains including medicine, telecommunications, and the World Wide Web. Conventional mining systems provide users with only a very restricted mechanism (based on minimum support) for specifying patterns of interest. For effectiveness and efficiency consideration, constraints are essential for many sequential applications. In this paper, we give a brief review of different sequential pattern mining algorithms, and then introduce a new algorithm (termed NewSPIRIT) for mining frequent sequential patterns that satisfy user specified regular expression constraints. The general idea of our algorithm is to use finite state automata to represent the regular expression constraints and build a tree that represents all sequences of data which satisfy these constraints by scanning the database of sequences only once.
Abstract: Decision trees have been found very effective for classification especially in data mining. Although classification is a well studied problem, most of the current classification algorithms need an in-memory data structure to achieve efficiency. This limits their suitability for mining over large databases. In this paper, a novel Bitmap-based Scalable Parallel Classifier (BSPC) is presented. It removes all the memory requirements needed by existing algorithms. Also, since scalability is a key requirement for any data mining algorithm, it is considered and achieved in the design of BSPC. Additionally, the suggested algorithm has been designed to be easily parallelized, allowing many processors to work together to build a single consistent model. Performance analysis demonstrates that the BSPC outperforms other state of the art algorithms. The superiority of the novel algorithm is demonstrated through the classification of Wisconsin breast cancer dataset.
Abstract: Finding patterns in sequences is a challenging problem with a great importance. In many domains such as medicine, finance, and marketing, data are represented as sequences that can be used to predict certain behavior or events. This can also be extended to the education domain for a number of potentially useful applications. This paper presents a predictive model with a high level of accuracy to predict the number of students who will register in a certain course in a next term. The proposed model views the student registration process as sequential patterns where the college offers courses and the students can register some of these courses at each term. By using the stored data of past semesters along with the course-plan (course pre-requisites), regular expressions can be constructed to constraint the extracted sequential patterns of previously registered courses that will be used to predict the number of new students who will enroll in a course. To demonstrate the working of the model, a brief overview of a sequential pattern mining algorithm called TSPIRIT (Tree approach for Sequential Pattern mining with Regular expressIon consTraints) is first presented. Then, a study of the complexity of TSPIRIT is given to prove the efficiency of using it as a predictive model. Finally, TSPIRIT is applied to the student database of the Arab academy for Science and Technology (AAST) to extract the sequential patterns. Experimental results show the effectiveness of our approach as a predictive model.
Abstract:  Clustering analysis is a primary method for data mining. The ever increasing volumes of data in different applications forces clustering algorithms to cope with it. DBSCAN is a well-known algorithm for density-based clustering. It is both effective so it can detect arbitrary shaped clusters of dense regions and efficient especially in existence of spatial indexes to perform the neighborhood queries efficiently. In this paper we introduce a new algorithm GriDBSCAN to enhance the performance of DBSCAN using grid partitioning and merging, yielding a high performance with the advantage of high degree of parallelism. We verified the correctness of the algorithm theoretically and experimentally, studied the performance theoretically and using experiments on both real and synthetic data. It proved to run much faster than original DBSCAN. We compared the algorithm with a similar algorithm, EnhancedDBSCAN, which is also an enhancement to DBSCAN using partitioning. Experiments showed the new algorithm''s superiority in performance and degree of parallelism.
Abstract:  Improving safety for patients is a top priority in health care. However, adverse drug events (ADEs) are estimated to account up to 5% annually of hospitalized patients causing morbidity and mortality. Due to the availability of large amount of medical data, discovering ADE patterns using data mining (DM) techniques becomes a challenge. This paper proposes a framework to mine a large database containing data about previously prescribed drugs and their adverse outcomes. This mining process yields association rules necessary to detect ADEs in future prescriptions. In addition, an Adverse Drug Events Detection Tool (ADEDT) is built to help physicians detect ADE during drug prescription phase.
Abstract: Mining association rules is a well-studied problem, and several algorithms were presented for finding large itemsets. In this paper we present a new algorithm for incremental discovery of large itemsets in an increasing set of transactions. The proposed algorithm is based on partitioning the database and keeping a summary of local large itemsets for each partition based on the concept of negative border technique. A global summary for the whole database is also created to facilitate the fast updating of overall large itemsets. When adding a new set of transactions to the database, the algorithm uses these summaries instead of scanning the whole database, thus reducing the number of database scans. The results of applying the new algorithm showed that the new technique is quite efficient, and in many respects superior to other incremental algorithms like Fast Algorithm (FUP) and Large Itemsets (ULI).
Abstract: Clustering in data mining is used for identifying useful patterns and interesting distributions in the underlying data. Several algorithms for clustering large data sets have been proposed in the literature using different techniques. Density-based method is one of these methodologies which can detect arbitrary shaped clusters where clusters are defined as dense regions separated by low density regions. In this paper, we present a new clustering algorithm to enhance the density-based algorithm DBSCAN. Synthetic datasets are used for experimental evaluation which shows that the new clustering algorithm is faster and more scalable than the original DBSCAN.
Abstract: Decision trees are one of the most popular and commonly used classification models. Many algorithms have been designed to build decision trees in the last twenty years. These algorithms are categorized into two groups according to the type of the trees they build, binary and multiway trees. In this paper, a new algorithm, MPEG, is designed to build multiway trees. MPEG uses DBMS indices and optimized query to access the dataset it works on, hence it has few memory requirements and no restrictions on sizes of datasets. Projection of examples over attribute values, merging of generated partitions using class values, applying GINI index to among different attributes and finally post pruning using EBP method, are the basic steps of MPEG. The trees built by MPEG have the advantages of binary trees as being accurate, small in size and the advantages of multiway trees as being compact and easy to be comprehended by humans.
Abstract: Document classification involves the act of classifying documents according to their content to predefined categories. One of the main problems of document classification is the large dimensionality of the data. To overcome this problem, feature ion is required which reduces the number of ed features and thus improves the classification accuracy. In this paper, a new algorithm for multi-label document classification is presented. This algorithm focuses on the reduction of redundant features using the concept of minimal redundancy maximal relevance which is based on the mutual information measure. The features ed by the proposed algorithm are then input to one of two classifiers, the multinomial naive Bayes classifier and the linear kernel support vector machines. The experimental results on the Reuters dataset show that the proposed algorithm is superior to some recent algorithms presented in the literature in many respects like the F1-measure and the break-even point.
Abstract: Clustering is one of the data mining techniques that extracts knowledge from spatial datasets. DBSCAN algorithm was considered as well-founded algorithm as it discovers clusters in different shapes and handles noise effectively. There are several algorithms that improve DBSCAN as fast hybrid density algorithm (L-DBSCAN) and fast density-based clustering algorithm. In this paper, an enhanced algorithm is proposed that improves fast density-based clustering algorithm in the ability to discover clusters with different densities and clustering large datasets.