1. 程式人生 > >課程記錄——Data Mining

課程記錄——Data Mining

一、Introduction

……

1、Major Issues in Data Mining

User Interaction

Presentation and visualization of data mining results : Efficiency and Scalability

Diversity of data types: complex types of data; Mining dynamic, networked, and global data repositories 

Data mining and society: Privacy-preserving; Social impacts of data mining; Invisible data mining

二、Getting to Know Your Data

1、Type of Data Sets

Record:Relational records; Data matrix; Text documents; Transaction data

2、 Important Characteristics of Structured Data

Dimensionality: Curse of dimensionality;

Sparsity: Only presnce counts;

Resolution: Patterns depend on the scale;

Distribution: Centrality and dispersion 

3、Attribute (dimensions features varibles)

types: Nominal; Ordinal; Binary: Symmetric, Asymmetric; Quantity: Interval, Ratio

Discrete Attribute

Continuous Attribute

4、Basic Statistical Descriptions of Data

Data dispersion characterstics: median, max, min, quantiles, outliers, variance

mean:Weighted arithmetic mean; Trimmed mean

5、Measuring the Dispersion of Data

Quartiles:Q1(25th percentile)、Q3(75th percentile)

Inter-quartile range(IQR):最當中的50%

Five number summary :min、Q1,median、Q3、max

6、Graphic Displays of Basic Statistcal Description 

7、五種資料分析圖

boxplot analysis:

Histogram Analysis

Quantile Plot

Quantile-Quantile Plot(Q-Q Plot)

Scatter Plot

8、 Categorization of visualization methods

Pixel-orirnted: 

① The m dimension values of a record are mapped to m pixels at the corresponding positions in the windows

② The color of pixel reflect corresponding values

③ For  a dataset of m dimensions, create m windows on the screen, one for each dimension

Parallel Coordinates:用於畫k維屬性的圖。

Geometric projection

Icon-based

Chenoff Faces:

 Stick Figures:A 5-piece stick figure

Hierarchical:

Dimensional Stacking

Worlds-within-Worlds

Tree-Map

Infocube

8、Similarity and  Dissimilarity

① Data matrix

② Dissimilarity matrix

Proximity Measure of Nominal Attributes

a. Simple matching

b. Use a large number of binary attributes: create a new binary attribute for each  

Standardizing Numeric Data: z-score