Avoiding dimension reduction means preserving the original dimensionality of the data in a machine learning or statistical analysis. Dimension reduction is a technique used to reduce the number of variables or features in a dataset, while retaining the maximum amount of information. It is used to simplify data visualization, reduce computation time, and to prevent over-fitting in machine learning models.
Ø
There are two main types of
dimension reduction techniques:
a.
Feature Selection -
this involves selecting a subset of the original features from the dataset
based on their importance or relevance to the problem at hand.
b.
Feature Extraction
- this involves transforming the original features into a smaller set of new
features that capture the most important information in the original features.
Dimension
reduction is a technique used to reduce the number of features or variables in
a dataset while retaining the most important information. The main reason for
using dimension reduction is to reduce the computational complexity of models,
improve model performance, and to make data visualization easier.
Ø There
are several common dimension reduction techniques:
A. Principal
Component Analysis (PCA): PCA is a statistical technique
that reduces the dimensionality of a dataset by identifying the most important
features in the data. PCA transforms the original features into a new set of
uncorrelated features called principal components.
Example:
Consider a dataset with features such as age, income, education level, and
occupation. PCA can be used to identify the most important features that
contribute the most to the overall variance of the data.
Code : #
Generate sample data
set.seed(123)
x1
<- rnorm(100)
x2
<- rnorm(100)
x3
<- 2*x1 + 3*x2 + rnorm(100)
dat
<- data.frame(x1, x2, x3)
#
Perform PCA
pca
<- prcomp(dat, scale = TRUE)
summary(pca)
Output :
Importance of components:
PC1 PC2
PC3
Standard deviation 1.3860 1.0207 0.19316
Proportion of Variance 0.6403 0.3473
0.01244
Cumulative Proportion 0.6403 0.9876 1.00000
B. Singular
Value Decomposition (SVD): SVD is a matrix factorization
technique used to reduce the dimensionality of a dataset by identifying the most
important features in the data. SVD decomposes a matrix into three components:
a left singular matrix, a diagonal matrix, and a right singular matrix.
Example:
SVD
can be used to identify the most important features in an image dataset. The
diagonal matrix represents the most important features that contribute the most
to the overall variance of the data.
Code : #
Generate sample data
set.seed(123)
x1
<- rnorm(100)
x2
<- rnorm(100)
x3
<- 2*x1 + 3*x2 + rnorm(100)
dat
<- data.frame(x1, x2, x3)
#
Perform SVD
svd_dat
<- svd(scale(dat))
u
<- svd_dat$u
d
<- svd_dat$d
v
<- svd_dat$v
C. t-distributed
Stochastic Neighbor Embedding (t-SNE): t-SNE is a technique
used for visualization of high-dimensional data. t-SNE reduces the
dimensionality of the dataset while preserving the relationships between data
points.
Example:
Consider a
dataset with high-dimensional features such as gene expression levels. t-SNE
can be used to visualize the relationships between different genes and identify
clusters of genes that are related to a particular biological process.
Code : #
Load sample data
library(Rtsne)
data(mnist)
#
Subset data to 1000 observations
set.seed(123)
idx
<- sample(nrow(mnist$y), 1000)
dat
<- mnist$x[idx,]
#
Perform t-SNE
tsne
<- Rtsne(dat, dims = 2, perplexity = 30, verbose = TRUE)
plot(tsne$Y)
D. Non-negative
Matrix Factorization (NMF): NMF is a matrix factorization
technique used to identify the most important features in a dataset. NMF
decomposes a matrix into two non-negative matrices, where the columns of the
first matrix represent the most important features in the data.
Example:
NMF
can be used to identify the most important features in a dataset of text
documents. The columns of the first matrix represent the most important topics
in the documents.
Code : #
Generate sample data
set.seed(123)
dat
<- matrix(abs(rnorm(100)), nrow = 10, ncol = 10)
#
Perform NMF
nmf_res
<- nmf(dat, 5)
W
<- nmf_res$W
H
<- nmf_res$H
E. Independent
Component Analysis (ICA): ICA is a statistical technique
used to identify independent components in a dataset. ICA separates a dataset
into independent sources by assuming that the sources are non-Gaussian and
statistically independent.
Example:
ICA can be used to identify independent components in a dataset of EEG signals.
ICA can be used to separate the independent sources of brain activity from the
signals recorded
by
the electrodes.
Code : library(ica)
ica_cereals_num
<- icafast(cereals_num[, 1:12], 2,
center
= TRUE, maxit = 100,
tol
= 1e-6
)
ica_cereals_num
<- data.frame(
ICA1
= ica_cereals_num$Y[, 1],
ICA2
= ica_cereals_num$Y[, 2],
label
= cereals_num$label,
classification
= cereals_num$classification
)
ggplot(ica_cereals_num,
aes(
x
= ICA1, y = ICA2,
label
= label, col = classification
))
+
geom_point()
+
ggrepel::geom_text_repel(cex
= 2.5)
Output :
No comments:
Post a Comment