課程記錄——Data Mining
一、Introduction
……
1、Major Issues in Data Mining
User Interaction
Presentation and visualization of data mining results : Efficiency and Scalability
Diversity of data types: complex types of data; Mining dynamic, networked, and global data repositories
Data mining and society: Privacy-preserving; Social impacts of data mining; Invisible data mining
二、Getting to Know Your Data
1、Type of Data Sets
Record:Relational records; Data matrix; Text documents; Transaction data
2、 Important Characteristics of Structured Data
Dimensionality: Curse of dimensionality;
Sparsity: Only presnce counts;
Resolution: Patterns depend on the scale;
Distribution: Centrality and dispersion
3、Attribute (dimensions features varibles)
types: Nominal; Ordinal; Binary: Symmetric, Asymmetric; Quantity: Interval, Ratio
Discrete Attribute
Continuous Attribute
4、Basic Statistical Descriptions of Data
Data dispersion characterstics: median, max, min, quantiles, outliers, variance
mean:Weighted arithmetic mean; Trimmed mean
5、Measuring the Dispersion of Data
Quartiles:Q1(25th percentile)、Q3(75th percentile)
Inter-quartile range(IQR):最當中的50%
Five number summary :min、Q1,median、Q3、max
6、Graphic Displays of Basic Statistcal Description
7、五種資料分析圖
boxplot analysis:
Histogram Analysis
Quantile Plot
Quantile-Quantile Plot(Q-Q Plot)
Scatter Plot
8、 Categorization of visualization methods
Pixel-orirnted:
① The m dimension values of a record are mapped to m pixels at the corresponding positions in the windows
② The color of pixel reflect corresponding values
③ For a dataset of m dimensions, create m windows on the screen, one for each dimension
Parallel Coordinates:用於畫k維屬性的圖。
Geometric projection
Icon-based
Chenoff Faces:
Stick Figures:A 5-piece stick figure
Hierarchical:
Dimensional Stacking
Worlds-within-Worlds
Tree-Map
Infocube
8、Similarity and Dissimilarity
① Data matrix
② Dissimilarity matrix
Proximity Measure of Nominal Attributes
a. Simple matching
b. Use a large number of binary attributes: create a new binary attribute for each
Standardizing Numeric Data: z-score