Intelligent Science Homepage   As My HomePage  |   As My Favorite |   Chinese Version  
 

Intelligent  Info Processing

Intelligence Science Lab
- Zhongzhi Shi
IntSci Research
Intelligent Systems
Intelligent Applications
Search IntSci.ac.cn

IntSci.ac.cn
 
General Data Mining Platform – MSMiner
 

1. Introduction

2. Architecture

3. MetaData

4. ETL

5. Data Warehouse and OLAP

6. Data Mining

7. Application Example

 

1. Introduction

MSMiner, developed by Key Lab of Intelligent Information Processing, is a multi-strategy data mining system, which uses object-oriented knowledge representation and processing technology, integrates lots of data mining methods, and combines data warehouse technology. This system supports On-Line Analytical Processing (OLAP) for multi-dimension data and decision-making for high-level users. It possesses the data mining functions, such as feature extraction, classification, clustering, prediction, discovery for association rules, and statistical analysis by providing many algorithms. It also affords data mining and decision-making service for different users.

MSMiner consists of four parts: ETL (data extraction, data transformation, data loading) subsystem, metadata management subsystem, data warehouse manager subsystem and data mining subsystem. Cooperating with ETL subsystem, data warehouse manager subsystem creates a data warehouse from relational data source, which is managed and maintained by metadata management subsystem. Based on this data warehouse, MSMiner supports OLAP and data mining tasks. Data mining subsystem integrates lots of algorithms, provides task manager and task processing engine for data mining. It expresses and processes the work for data mining and decision-making in the form of object-oriented task model.

Top of the page

2. Architecture

 

Top of the page

3. MetaData

       Metadata is data about data which describes the content, quality, condition, and other characteristics of data.. It plays an important role not only in the design, implementation and maintenance of the data warehouse, but also in data organizing, information querying and result understanding [3]. It usually records the location and description of warehouse system components. Here, we expand the scope of the metadata, use it to describe and manage the data and environment of the whole system that includes not only the data in data warehouse platform but also the task model and algorithms (or functions) in ETL and data mining. Metadata is in a core position of the whole system since it integrates ETL, data warehouse, and data mining tools. It controls the whole flow from ETL, data warehouse to data mining, so we can define and execute ETL and data mining tasks more conveniently and effectively.

    In MSMiner, the contents of the metadata are as follows:

    (1) Description of the external data source. The external data source can be relational database or other kind of data, such as Excel data, plain text, XML text, etc. In metadata, it contains allocated position and environment information of the external data source, data structure and description of the contents.

    (2) Descriptions of the subject, including the name and remark of the subject, when the subject is created and updated etc.

    (3) Description of databases under a subject, including the name, type and remark of database, the login information and other information.

    (4) Description of tables in a database, including fact tables, dimensional tables and temporary tables. It contains tables’ information and fields’ information.

    (5) Description of the ETL task, containing organization and steps of the task, data source, selection of the transformation functions, assumption of the parameters, creation and execution history of the task, and so on.

    (6) Description of the data mining task, containing organization and steps of the task, data source, selection of the mining algorithms, assumption of the parameters, evaluation and output of the results, creation and execution history of the task, and so on.

    (7) Description of the data cube, containing dimension and measure of extracted information, building information of the star-structure, and so on.

    (8) Management of the algorithm base for data mining, containing the registration and management of the mining algorithms,

    (9) Management of the functions for ETL, containing the registration and management of the functions.

   (10) User's information, containing user's basic information, authority, operational history, and so on.

       We build the correspondent metadata classes with object-oriented method. We take the three-tier architecture as the system architecture and put the metadata management subsystem at the middle tier position. It can be regarded as a metadata management server. The upper tier accesses and manages metadata by the middle tier.

       Metadata is automatically generated while every component of the system is created. Metadata will be changed during the daily maintenance of the system. MSMiner provides special metadata manager subsystem that can maintain the metadata directly and the whole system is managed validly.

Top of the page

4. ETL

       ETL subsystem is an important subsystem of MSMiner. The main motivation of ETL function module is to transform the operational date from source database to analytical data in data warehouse. As we all know, the data in data warehouse is integrated and extracted from disperse database (for example Oracle, SQL Server, Access, Foxpro, Excel, DB2 etc), and there are many differences between the operational data in source database and the analytical data in data warehouse, so it isn’t a good way to load the data from various data sources into data warehouse directly. Namely, to get the clean data for data warehouse, the data from previous database must be cleaned, collected and transformed before being integrated into data warehouse. It is a key and complex step during building data warehouse. Generally speaking, ETL subsystem needs to finish the following works:

    (1) Because of data repetition and conflict in the source data from disperse database, the subsystem should unify the conflict data.

    (2) To get the comprehensive data in data warehouse, the subsystem should transform the original data structure from application-oriented one to subject-oriented one and do some generating and computing.

 

   

     The basic architecture of ETL subsystem is shown in Figure 2. From this figure, it is clear that there are 4 modules in ETL subsystem:

    (1) The Friendly user interface

      The users can do any ETL operations expediently by this interface, such as designing the ETL tasks, registering new ETL DLL functions, scheduling executing ETL tasks and visiting the result of ETL tasks.

    (2) The integrated ETL Function management and ETL task Management.

    This module including registering new ETL DLL functions, building new ETL tasks, scheduling and processing of ETL tasks etc.

    (3) The uniform metadata management

    The whole subsystem is developed in metadata-oriented way. Namely all information of this subsystem, including data source, algorithm and result, are managed by metadata.

    (4) The database server

    ETL subsystem supports disperse and various database (for example Oracle, SQL Server, Access, Foxpro, Excel, DB2 etc).

    The subsystem supports the expandable ETL function base. The main algorithms for ETL function are realized in the form of dynamic link lib (DLL) with uniform interfaces. Users can design the ETL task according to their need by choosing the relevant ETL DLLs. At present the subsystem provides about 30 kinds of ETL DLLs. In addition, users can develop some new ETL DLLs in accordance with uniform interfaces, and add them into ETL function base. In order to improve the efficiency, the ETL tasks can be scheduled at designated time and processed concurrently.

 

Top of the page

5. Data Warehouse and OLAP

       Data warehouse is “a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of management decisions” . The function of the data warehouse is to provide a general data warehouse environment, by which users can create and maintain their data warehouse in accordance with different needs to finish data analysis and processing and provide preparation for data mining task.

        Data warehouse in MSMiner consists of lots of subjects. When data warehouse is created, users establish several subject fields according to the application needs, the system help users extract the data for each subject and model them by star-schema. Based on the above operation, data warehouse realizes the multi-dimension data cube and OLAP, provides validate data source for data mining and decision-making. The final results may be shown by visualization tools.

        Data warehouse in MSMiner is modeled by star-schema. The system extracts the data from source tables or views and builds multiple fact tables through the data extraction, transformation and loading by the subject's request. A star-model 's structure is made of one fact table and several dimension tables related to the fact table, where the fact table includes multiple dimensions and measures. The dimension stands for the special visual angle for viewing data, such as time dimension, distribution dimension, product dimension and so on. The measure is data’s real meaning and describes what is the data. Each dimension table describes a certain dimension and its values, and each dimension consists of several levels. For example, a time dimension may be divided into three levels: year, season, and month, as each describes different query layer. One or several star-schema structures form a subject, which is the basic unit of data warehouse.

        The OLAP is realized by two ways: creating special multi-dimension database system (MOLAP) and simulating the multi-dimension data by using the relational database (ROLAP). MSMiner supports ROLAP, which is based on the star schema. The star structure related with multiple dimension tables simulates the multi-dimension data cube, where the dimensions and measures in the data cube come from dimensions and measures in the star schema. When OLAP operations are executed in the data cube, multi-dimension analysis translates the request into SQL statements, queries in fact tables, then shows the results in the form of multi-dimension.

        At present the system supports the standard OLAP operations, such as slice, dice, roll up, drill down and pivot. And the results may be displayed in many forms such as cross-tabulation tables, bar charts, pie charts or other forms of graphical output.

        The results of OLAP operations and data in fact tables may be the data source for data mining subsystem. They may be helpful to some preparation work for data mining.

Top of the page

6. Data Mining

       Data mining subsystem organizes and executes the data mining task in the object-oriented form and its data sources are obtained from data warehouse. It integrates all kinds of data mining algorithms and possesses flexible expandability.

Data mining subsystem mainly includes algorithm manager, task manager and task processing engine for data mining. Algorithm manager is in charge of registering new or unregistering existed data mining algorithms, those of which can be called by data mining tasks. Task manager helps users select data source and mining algorithm, build correspondent task model by providing task wizard. Task processing engine’s function is to schedule and execute tasks. It achieves high efficiency by using multi thread technology. After the results are explained and estimated, it is stored in data warehouse and can be visualized or exported to files.

        The basic architecture of data mining subsystem is shown in Figure 3.

 

6.1 Expandable Algorithm Base

         In this subsystem, the core algorithms for data mining are realized in the form of dynamic link lib (DLL). We define a set of standard interfaces for the DLLs that embedding data mining algorithm. Any algorithm, if encapsulated according to those interfaces, can be integrated into this system conveniently. As thus, users can develop their own data mining algorithm modules in this system easily. In addition, there are some add-in algorithms in this subsystem. Users can use them directly to accomplish some data mining tasks.

        At present the system provides lots of algorithms, such as decision tree, back-propagation, SVM, fuzzy clustering, SOM, multiple regressive analysis, CBR, association rules discovery and so on. They are applied to classification, prediction, clustering, data reduction etc. Users may develop new algorithms and add them into algorithm base, then they can be called by data mining tasks flexibly.


6.2 Data Mining Task Wizard

       A data mining task consists of several mining steps, each step corresponding to a data mining algorithm module (a DLL) and requiring some parameters. In order to build mining task more conveniently, the system provides mining task wizard. Through the wizard users may select the mining algorithms by steps, set parameters of the algorithms and select the data sources. Therefore, Adaptive task models will be constructed for all data mining algorithms provided by the subsystem.

6.3 Flexible Task Scheduling

       We provide a flexible task schedule module in this subsystem. Users can schedule a task conveniently using this function. Users can force a task to execute at designated time through configuring several parameters, and they can even let a task execute by a specific period.

       For exampleThere is a data mining task to predict the turnover of the company. As the new data will be added into system every month, the data source of this task will be changed frequently. In order to get the newest result anytime, users can set the schedule parameters of the task (for instance, execute at the last day of the month, and repeat per month), and the task process engine will do it automatically.

6.4 Efficient Task Processing

There are many tasks in a mining system, and a mining task may include many steps, the tasks or steps can execute concurrently. We use multithread technology to achieve this. The task processing engine will check all tasks in the system, if one task is to be executed, system will create a new thread to process the task. The steps in one task are executed concurrently also. The order of steps in one task maybe have some restriction (commonly is a DAG).

Top of the page

7. Application Example

       We have applied MSMiner to many application areas such as tax deviation, analysis of fishery information, analysis of VIP (very important person) for telecom corporation, and so on. For example, the fishing ground prediction system is a good application example of CBR. This system has been applied to the East China Sea fishing center prediction. In 2002, it is awarded the second grade of National Science and Technology Progress Award.

Top of the page

 

About the Site | Webmaster
Copyright © 2002-2003 Intelligent Science Research Group, at Key Lab of IIP, ICT, CAS, China.