|
1.
Introduction
MSMiner,
developed by Key Lab of Intelligent Information Processing, is a
multi-strategy data mining system, which uses object-oriented
knowledge representation and processing technology, integrates lots
of data mining methods, and combines data warehouse technology. This
system supports On-Line Analytical Processing (OLAP) for
multi-dimension data and decision-making for high-level users. It
possesses the data mining functions, such as feature extraction,
classification, clustering, prediction, discovery for association
rules, and statistical analysis by providing many algorithms. It
also affords data mining and decision-making service for different
users.
MSMiner
consists of four parts: ETL (data extraction, data transformation,
data loading) subsystem, metadata management subsystem, data warehouse
manager subsystem and data mining subsystem. Cooperating with ETL
subsystem, data warehouse manager subsystem creates a data warehouse
from relational data source, which is managed and maintained by
metadata management subsystem. Based on this data warehouse, MSMiner
supports OLAP and data mining tasks. Data mining subsystem integrates
lots of algorithms, provides task manager and task processing engine
for data mining. It expresses and processes the work for data mining
and decision-making in the form of object-oriented task model.
Top of the page
2. Architecture

Top of the page
3. MetaData
Metadata is data about data
which describes the content, quality, condition, and other
characteristics of data.. It plays an important role not only in the
design, implementation and maintenance of the data warehouse, but also
in data organizing, information querying and result understanding [3].
It usually records the location and description of warehouse system
components. Here, we expand the scope of the metadata, use it to
describe and manage the data and environment of the whole system that
includes not only the data in data warehouse platform but also the
task model and algorithms (or functions) in ETL and data mining.
Metadata is in a core position of the whole system since it integrates
ETL, data warehouse, and data mining tools. It controls the whole flow
from ETL, data warehouse to data mining, so we can define and execute
ETL and data mining tasks more conveniently and effectively.
In MSMiner, the contents of the metadata are as follows:
(1) Description of the external data source. The external data source
can be relational database or other kind of data, such as Excel data,
plain text, XML text, etc. In metadata, it contains allocated position
and environment information of the external data source, data
structure and description of the contents.
(2) Descriptions of the subject, including the name and remark of the
subject, when the subject is created and updated etc.
(3) Description of databases under a subject, including the name, type
and remark of database, the login information and other information.
(4) Description of tables in a database, including fact tables,
dimensional tables and temporary tables. It contains tables’
information and fields’ information.
(5) Description of the ETL task, containing organization and steps of
the task, data source, selection of the transformation functions,
assumption of the parameters, creation and execution history of the
task, and so on.
(6) Description of the data mining task, containing organization and
steps of the task, data source, selection of the mining algorithms,
assumption of the parameters, evaluation and output of the results,
creation and execution history of the task, and so on.
(7) Description of the data cube, containing dimension and measure of
extracted information, building information of the star-structure, and
so on.
(8) Management of the algorithm base for data mining, containing the
registration and management of the mining algorithms,
(9) Management of the functions for ETL, containing the registration
and management of the functions.
(10) User's information, containing user's basic information,
authority, operational history, and so on.
We build the correspondent metadata classes with object-oriented
method. We take the three-tier architecture as the system architecture
and put the metadata management subsystem at the middle tier position.
It can be regarded as a metadata management server. The upper tier
accesses and manages metadata by the middle tier.
Metadata is automatically generated while every component of the
system is created. Metadata will be changed during the daily
maintenance of the system. MSMiner provides special metadata manager
subsystem that can maintain the metadata directly and the whole system
is managed validly.
Top of the page
4. ETL
ETL
subsystem is an important subsystem of MSMiner. The main motivation of
ETL function module is to transform the operational date from source
database to analytical data in data warehouse. As we all know, the
data in data warehouse is integrated and extracted from disperse
database (for example Oracle, SQL Server, Access, Foxpro, Excel, DB2
etc), and there are many differences between the operational data in
source database and the analytical data in data warehouse, so it
isn’t a good way to load the data from various data sources into
data warehouse directly. Namely, to get the clean data for data
warehouse, the data from previous database must be cleaned, collected
and transformed before being integrated into data warehouse. It is a
key and complex step during building data warehouse. Generally
speaking, ETL subsystem needs to finish the following works:
(1) Because of data repetition and conflict in the source data from
disperse database, the subsystem should unify the conflict data.
(2) To get the comprehensive data in data warehouse, the subsystem should transform the original data structure from application-oriented one to subject-oriented one and do some generating and computing.

The basic architecture of ETL subsystem is shown in Figure 2. From
this figure, it is clear that there are 4 modules in ETL subsystem:
(1) The Friendly user interface
The users can do any ETL operations expediently by this interface,
such as designing the ETL tasks, registering new ETL DLL functions,
scheduling executing ETL tasks and visiting the result of ETL tasks.
(2) The integrated ETL Function management and ETL task Management.
This module including registering new ETL DLL functions, building new
ETL tasks, scheduling and processing of ETL tasks etc.
(3) The uniform metadata management
The whole subsystem is developed in metadata-oriented way. Namely all
information of this subsystem, including data source, algorithm and
result, are managed by metadata.
(4) The database server
ETL subsystem supports disperse and various database (for example
Oracle, SQL Server, Access, Foxpro, Excel, DB2 etc).
The subsystem supports the expandable ETL function base. The main
algorithms for ETL function are realized in the form of dynamic link
lib (DLL) with uniform interfaces. Users can design the ETL task
according to their need by choosing the relevant ETL DLLs. At present
the subsystem provides about 30 kinds of ETL DLLs. In addition, users
can develop some new ETL DLLs in accordance with uniform interfaces,
and add them into ETL function base. In order to improve the
efficiency, the ETL tasks can be scheduled at designated time and
processed concurrently.
Top of the page
5. Data Warehouse and OLAP
Data
warehouse is “a subject-oriented, integrated, time-variant,
nonvolatile collection of data in support of management decisions” .
The function of the data warehouse is to provide a general data
warehouse environment, by which users can create and maintain their
data warehouse in accordance with different needs to finish data
analysis and processing and provide preparation for data mining task.
Data warehouse in MSMiner consists of lots of subjects. When data
warehouse is created, users establish several subject fields according
to the application needs, the system help users extract the data for
each subject and model them by star-schema. Based on the above
operation, data warehouse realizes the multi-dimension data cube and
OLAP, provides validate data source for data mining and
decision-making. The final results may be shown by visualization
tools.
Data warehouse in MSMiner is modeled by star-schema. The system
extracts the data from source tables or views and builds multiple fact
tables through the data extraction, transformation and loading by the
subject's request. A star-model 's structure is made of one fact table
and several dimension tables related to the fact table, where the fact
table includes multiple dimensions and measures. The dimension stands
for the special visual angle for viewing data, such as time dimension,
distribution dimension, product dimension and so on. The measure is
data’s real meaning and describes what is the data. Each dimension
table describes a certain dimension and its values, and each dimension
consists of several levels. For example, a time dimension may be
divided into three levels: year, season, and month, as each describes
different query layer. One or several star-schema structures form a
subject, which is the basic unit of data warehouse.
The OLAP is realized by two ways: creating special multi-dimension
database system (MOLAP) and simulating the multi-dimension data by
using the relational database (ROLAP). MSMiner supports ROLAP, which
is based on the star schema. The star structure related with multiple
dimension tables simulates the multi-dimension data cube, where the
dimensions and measures in the data cube come from dimensions and
measures in the star schema. When OLAP operations are executed in the
data cube, multi-dimension analysis translates the request into SQL
statements, queries in fact tables, then shows the results in the form
of multi-dimension.
At present the system supports the standard OLAP operations, such as
slice, dice, roll up, drill down and pivot. And the results may be
displayed in many forms such as cross-tabulation tables, bar charts,
pie charts or other forms of graphical output.
The results of OLAP operations and data in fact tables may be the data
source for data mining subsystem. They may be helpful to some
preparation work for data mining.
Top of the page
6. Data Mining
Data mining subsystem organizes and
executes the data mining task in the object-oriented form and its data
sources are obtained from data warehouse. It integrates all kinds of
data mining algorithms and possesses flexible expandability.
Data
mining subsystem mainly includes algorithm manager, task manager and
task processing engine for data mining. Algorithm manager is in charge
of registering new or unregistering existed data mining algorithms,
those of which can be called by data mining tasks. Task manager helps
users select data source and mining algorithm, build correspondent
task model by providing task wizard. Task processing engine’s
function is to schedule and execute tasks. It achieves high efficiency
by using multi thread technology. After the results are explained and
estimated, it is stored in data warehouse and can be visualized or
exported to files.
The basic architecture of data mining subsystem is shown in Figure 3.

6.1 Expandable Algorithm Base
In this subsystem, the core algorithms for data mining are
realized in the form of dynamic link lib (DLL). We define a set of
standard interfaces for the DLLs that embedding data mining algorithm.
Any algorithm, if encapsulated according to those interfaces, can be
integrated into this system conveniently. As thus, users can develop
their own data mining algorithm modules in this system easily. In
addition, there are some add-in algorithms in this subsystem. Users
can use them directly to accomplish some data mining tasks.
At present the system provides lots of algorithms, such as decision
tree, back-propagation, SVM, fuzzy clustering, SOM, multiple
regressive analysis, CBR, association rules discovery and so on. They
are applied to classification, prediction, clustering, data reduction
etc. Users may develop new algorithms and add them into algorithm
base, then they can be called by data mining tasks flexibly.
6.2 Data Mining Task Wizard
A
data mining task consists of several mining steps, each step
corresponding to a data mining algorithm module (a DLL) and requiring
some parameters. In order to build mining task more conveniently, the
system provides mining task wizard. Through the wizard users may
select the mining algorithms by steps, set parameters of the
algorithms and select the data sources. Therefore, Adaptive task
models will be constructed for all data mining algorithms provided by
the subsystem.
6.3 Flexible Task Scheduling
We provide a flexible task schedule module in this subsystem.
Users can schedule a task conveniently using this function. Users can
force a task to execute at designated time through configuring several
parameters, and they can even let a task execute by a specific period.
For example,There
is a data mining task to predict the turnover of the company. As the
new data will be added into system every month, the data source of
this task will be changed frequently. In order to get the newest
result anytime, users can set the schedule parameters of the task (for
instance, execute at the last day of the month, and repeat per month),
and the task process engine will do it automatically.
6.4 Efficient Task Processing
There
are many tasks in a mining system, and a mining task may include many
steps, the tasks or steps can execute concurrently. We use multithread
technology to achieve this. The task processing engine will check all
tasks in the system, if one task is to be executed, system will create
a new thread to process the task. The
steps in one task are executed concurrently also. The order of steps
in one task maybe have some restriction (commonly is a DAG).
Top of the page
7. Application
Example
We
have applied MSMiner to many application areas such as tax deviation,
analysis of fishery information, analysis of VIP (very important
person) for telecom corporation, and so on. For example, the fishing
ground prediction system is a good application example of CBR. This
system has been applied to the East China Sea fishing center
prediction. In 2002, it is awarded the second
grade of National Science and Technology Progress Award.
Top of the page
|