Introduction
In many high-dimensional bioinformatics settings
(\(\mathbf{p \gg n}\)), penalized
regression methods (Ridge, Lasso, Elastic Net, SCAD, MCP) help control
overfitting and allow variable selection. This R Markdown: 1.
Demonstrates a simulation model with correlated
predictors. 2. Defines five penalized methods with
cross-validation to select \(\lambda\).
3. Compares performance via TPR, FDR,
MSE of coefficients, and predictive error.
Please ensure you have these packages installed:
# install.packages("glmnet")
# install.packages("ncvreg")
# install.packages("MASS")
1. Simulation Model
We generate a design matrix \(\mathbf{X}\in
\mathbb{R}^{n\times p}\) with AR(1) correlation
\(\rho\). We construct a sparse \(\beta_{\text{true}}\) (first
n_inform
non-zero), then compute \(y = X \, \beta_{\text{true}} +
\epsilon\).
##########################################
# 1) Load required packages
##########################################
library(glmnet)
library(ncvreg)
library(MASS)
##########################################
# 2) Simulation function
##########################################
simulate_data <- function(n = 100, p = 200, n_inform = 10, rho = 0.3, sigma = 1, seed = NULL) {
# Generates:
# X: n-by-p design matrix with correlation rho
# y: response vector
# beta_true: sparse true coefficient vector
if (!is.null(seed)) set.seed(seed)
# 2.1) AR(1) covariance matrix
Sigma <- outer(1:p, 1:p, function(i, j) rho^abs(i-j))
# 2.2) Generate X ~ N(0, Sigma)
X <- MASS::mvrnorm(n = n, mu = rep(0, p), Sigma = Sigma)
# 2.3) Construct sparse beta_true
beta_true <- numeric(p)
beta_true[1:n_inform] <- runif(n_inform, min = 1, max = 2) * sample(c(1, -1), n_inform, replace = TRUE)
# 2.4) Generate y = X * beta_true + noise
noise <- rnorm(n, mean = 0, sd = sigma)
y <- drop(X %*% beta_true + noise)
list(X = X, y = y, beta_true = beta_true)
}
3. Fitting Methods
We fit Ridge, Lasso, Elastic Net via
glmnet
, and SCAD, MCP via
ncvreg
. All use CV to find the best \(\lambda\).
##########################################
# 4) Fitting each method
##########################################
fit_all_methods <- function(X, y, alpha_EN = 0.5, folds = 5) {
# Ridge (alpha=0)
cv_ridge <- cv.glmnet(X, y, alpha = 0, nfolds = folds)
ridge_coef <- as.numeric(coef(cv_ridge, s = "lambda.min"))[-1]
# Lasso (alpha=1)
cv_lasso <- cv.glmnet(X, y, alpha = 1, nfolds = folds)
lasso_coef <- as.numeric(coef(cv_lasso, s = "lambda.min"))[-1]
# Elastic Net (alpha=alpha_EN)
cv_en <- cv.glmnet(X, y, alpha = alpha_EN, nfolds = folds)
en_coef <- as.numeric(coef(cv_en, s = "lambda.min"))[-1]
# SCAD
cv_scad <- cv.ncvreg(X, y, family = "gaussian", penalty = "SCAD", nfolds = folds)
scad_coef <- coef(cv_scad, lambda = cv_scad$lambda.min)[-1]
# MCP
cv_mcp <- cv.ncvreg(X, y, family = "gaussian", penalty = "MCP", nfolds = folds)
mcp_coef <- coef(cv_mcp, lambda = cv_mcp$lambda.min)[-1]
list(
ridge = ridge_coef,
lasso = lasso_coef,
en = en_coef,
scad = scad_coef,
mcp = mcp_coef
)
}
4. Main Simulation Loop
We repeatedly:
- Generate data.
- Fit each method.
- Evaluate performance.
- Store results in
all_results
, then average in
summary_list
.
##########################################
# 5) Main simulation loop
##########################################
simulate_and_compare <- function(n_sim = 50,
n = 100, p = 200, n_inform = 10,
rho = 0.3, sigma = 1, alpha_EN = 0.5,
folds = 5, seed_start = 1234) {
method_list <- c("ridge", "lasso", "en", "scad", "mcp")
all_results <- list()
for (m in method_list) {
all_results[[m]] <- data.frame(mse_beta = numeric(n_sim),
tpr = numeric(n_sim),
fdr = numeric(n_sim),
mse_pred = numeric(n_sim))
}
# Loop over n_sim replicates
for (i in seq_len(n_sim)) {
sim_data <- simulate_data(n = n, p = p, n_inform = n_inform,
rho = rho, sigma = sigma, seed = seed_start + i)
X <- sim_data$X
y <- sim_data$y
beta_true <- sim_data$beta_true
coefs <- fit_all_methods(X, y, alpha_EN = alpha_EN, folds = folds)
# Evaluate each method
for (m in method_list) {
perf <- evaluate_performance(coefs[[m]], beta_true, X, y)
all_results[[m]]$mse_beta[i] <- perf$mse_beta
all_results[[m]]$tpr[i] <- perf$tpr
all_results[[m]]$fdr[i] <- perf$fdr
all_results[[m]]$mse_pred[i] <- perf$mse_pred
}
}
# Summarize
summary_list <- list()
for (m in method_list) {
summary_list[[m]] <- apply(all_results[[m]], 2, mean)
}
list(raw = all_results, summary = summary_list)
}
5. Run the Simulation
Here we run 30 simulations with \((n=100, p=200)\), correlation \(\rho=0.3\), n_inform=10
truly
non-zero, and default \(\alpha_{\text{EN}}=0.5\). We then print
average results.
##########################################
# 6) Run the simulation and compare
##########################################
res <- simulate_and_compare(
n_sim = 30,
n = 100,
p = 200,
n_inform = 10,
rho = 0.3,
sigma = 1,
alpha_EN = 0.5,
folds = 5,
seed_start= 1000
)
##########################################
# 7) Inspect and print final results
##########################################
cat("Average performance metrics across simulations:\n")
## Average performance metrics across simulations:
for (m in names(res$summary)) {
cat("\nMethod:", m, "\n")
print(res$summary[[m]])
}
##
## Method: ridge
## mse_beta tpr fdr mse_pred
## 0.09539526 1.00000000 0.95000000 11.15929727
##
## Method: lasso
## mse_beta tpr fdr mse_pred
## 0.005301336 1.000000000 0.753635840 0.592791896
##
## Method: en
## mse_beta tpr fdr mse_pred
## 0.007489721 1.000000000 0.804833607 0.494146244
##
## Method: scad
## mse_beta tpr fdr mse_pred
## 0.000703822 1.000000000 0.181614682 0.857284133
##
## Method: mcp
## mse_beta tpr fdr mse_pred
## 0.0006805814 1.0000000000 0.0355218855 0.8881671510
Interpretation:
- mse_beta
: how close \(\hat{\beta}\) is to \(\beta_{\text{true}}\).
- tpr
: fraction of truly non-zero coefficients
discovered.
- fdr
: fraction of selected coefficients that are false
positives.
- mse_pred
: training set MSE on \(y\).
Experiment with larger n_sim
, or different
p
, rho
, or alpha_EN
to see how
correlation and penalty structure affect each method’s performance.
Summary
This R Markdown file:
- Generates correlated high-dimensional data with a
sparse truth.
- Implements Ridge, Lasso, Elastic Net, SCAD, MCP
using CV for \(\lambda\).
- Evaluates each method’s performance (MSE of \(\beta\), TPR, FDR, predictive MSE) over
multiple simulations.
By adjusting parameters (e.g., n
, p
,
rho
, alpha_EN
), you can visualize how each
penalty handles feature selection,
bias, and predictive accuracy in
different settings.