Dimension Reduction and techniques

Avoiding dimension reduction means preserving the original dimensionality of the data in a machine learning or statistical analysis. Dimension reduction is a technique used to reduce the number of variables or features in a dataset, while retaining the maximum amount of information. It is used to simplify data visualization, reduce computation time, and to prevent over-fitting in machine learning models.

Ø There are two main types of dimension reduction techniques:

a. Feature Selection - this involves selecting a subset of the original features from the dataset based on their importance or relevance to the problem at hand.

b. Feature Extraction - this involves transforming the original features into a smaller set of new features that capture the most important information in the original features.

Dimension reduction is a technique used to reduce the number of features or variables in a dataset while retaining the most important information. The main reason for using dimension reduction is to reduce the computational complexity of models, improve model performance, and to make data visualization easier.

Ø There are several common dimension reduction techniques:

A. Principal Component Analysis (PCA): PCA is a statistical technique that reduces the dimensionality of a dataset by identifying the most important features in the data. PCA transforms the original features into a new set of uncorrelated features called principal components.

Example: Consider a dataset with features such as age, income, education level, and occupation. PCA can be used to identify the most important features that contribute the most to the overall variance of the data.

Code : # Generate sample data

set.seed(123)

x1 <- rnorm(100)

x2 <- rnorm(100)

x3 <- 2*x1 + 3*x2 + rnorm(100)

dat <- data.frame(x1, x2, x3)

# Perform PCA

pca <- prcomp(dat, scale = TRUE)

summary(pca)

Output :

Importance of components:

PC1 PC2 PC3

Standard deviation 1.3860 1.0207 0.19316

Proportion of Variance 0.6403 0.3473 0.01244

Cumulative Proportion 0.6403 0.9876 1.00000

B. Singular Value Decomposition (SVD): SVD is a matrix factorization technique used to reduce the dimensionality of a dataset by identifying the most important features in the data. SVD decomposes a matrix into three components: a left singular matrix, a diagonal matrix, and a right singular matrix.

Example: SVD can be used to identify the most important features in an image dataset. The diagonal matrix represents the most important features that contribute the most to the overall variance of the data.

Code : # Generate sample data

set.seed(123)

x1 <- rnorm(100)

x2 <- rnorm(100)

x3 <- 2*x1 + 3*x2 + rnorm(100)

dat <- data.frame(x1, x2, x3)

# Perform SVD

svd_dat <- svd(scale(dat))

u <- svd_dat$u

d <- svd_dat$d

v <- svd_dat$v

C. t-distributed Stochastic Neighbor Embedding (t-SNE): t-SNE is a technique used for visualization of high-dimensional data. t-SNE reduces the dimensionality of the dataset while preserving the relationships between data points.

Example:

Consider a dataset with high-dimensional features such as gene expression levels. t-SNE can be used to visualize the relationships between different genes and identify clusters of genes that are related to a particular biological process.

Code : # Load sample data

library(Rtsne)

data(mnist)

# Subset data to 1000 observations

set.seed(123)

idx <- sample(nrow(mnist$y), 1000)

dat <- mnist$x[idx,]

# Perform t-SNE

tsne <- Rtsne(dat, dims = 2, perplexity = 30, verbose = TRUE)

plot(tsne$Y)

D. Non-negative Matrix Factorization (NMF): NMF is a matrix factorization technique used to identify the most important features in a dataset. NMF decomposes a matrix into two non-negative matrices, where the columns of the first matrix represent the most important features in the data.

Example: NMF can be used to identify the most important features in a dataset of text documents. The columns of the first matrix represent the most important topics in the documents.

Code : # Generate sample data

set.seed(123)

dat <- matrix(abs(rnorm(100)), nrow = 10, ncol = 10)

# Perform NMF

nmf_res <- nmf(dat, 5)

W <- nmf_res$W

H <- nmf_res$H

E. Independent Component Analysis (ICA): ICA is a statistical technique used to identify independent components in a dataset. ICA separates a dataset into independent sources by assuming that the sources are non-Gaussian and statistically independent.

Example: ICA can be used to identify independent components in a dataset of EEG signals. ICA can be used to separate the independent sources of brain activity from the signals recorded

by the electrodes.

Code : library(ica)

ica_cereals_num <- icafast(cereals_num[, 1:12], 2,

center = TRUE, maxit = 100,

tol = 1e-6

)

ica_cereals_num <- data.frame(

ICA1 = ica_cereals_num$Y[, 1],

ICA2 = ica_cereals_num$Y[, 2],

label = cereals_num$label,

classification = cereals_num$classification

)

ggplot(ica_cereals_num, aes(

x = ICA1, y = ICA2,

label = label, col = classification

)) +

geom_point() +

ggrepel::geom_text_repel(cex = 2.5)

Output :

R Programming Language notes

Dimension Reduction and techniques

Dimension Reduction and techniques

No comments:

Post a Comment

R Programming Language

Data Analytics and the key concepts and techniques of R language