Introduction

In many high-dimensional bioinformatics settings (\(\mathbf{p \gg n}\)), penalized regression methods (Ridge, Lasso, Elastic Net, SCAD, MCP) help control overfitting and allow variable selection. This R Markdown: 1. Demonstrates a simulation model with correlated predictors. 2. Defines five penalized methods with cross-validation to select \(\lambda\). 3. Compares performance via TPR, FDR, MSE of coefficients, and predictive error.

Please ensure you have these packages installed:

# install.packages("glmnet")
# install.packages("ncvreg")
# install.packages("MASS")

1. Simulation Model

We generate a design matrix \(\mathbf{X}\in \mathbb{R}^{n\times p}\) with AR(1) correlation \(\rho\). We construct a sparse \(\beta_{\text{true}}\) (first n_inform non-zero), then compute \(y = X \, \beta_{\text{true}} + \epsilon\).

##########################################
# 1) Load required packages
##########################################
library(glmnet)
library(ncvreg)
library(MASS)

##########################################
# 2) Simulation function
##########################################
simulate_data <- function(n = 100, p = 200, n_inform = 10, rho = 0.3, sigma = 1, seed = NULL) {
  # Generates:
  #   X: n-by-p design matrix with correlation rho
  #   y: response vector
  #   beta_true: sparse true coefficient vector
  
  if (!is.null(seed)) set.seed(seed)
  
  # 2.1) AR(1) covariance matrix
  Sigma <- outer(1:p, 1:p, function(i, j) rho^abs(i-j))
  
  # 2.2) Generate X ~ N(0, Sigma)
  X <- MASS::mvrnorm(n = n, mu = rep(0, p), Sigma = Sigma)
  
  # 2.3) Construct sparse beta_true
  beta_true <- numeric(p)
  beta_true[1:n_inform] <- runif(n_inform, min = 1, max = 2) * sample(c(1, -1), n_inform, replace = TRUE)
  
  # 2.4) Generate y = X * beta_true + noise
  noise <- rnorm(n, mean = 0, sd = sigma)
  y <- drop(X %*% beta_true + noise)
  
  list(X = X, y = y, beta_true = beta_true)
}

2. Performance Metrics

We evaluate four metrics: - Coefficient MSE: \(\|\hat{\beta} - \beta_{\text{true}}\|^2 / p\) (though we compute the mean directly).

##########################################
# 3) Metrics for performance
##########################################
evaluate_performance <- function(beta_est, beta_true, X, y) {
  # 3.1) MSE on coefficient estimation
  mse_beta <- mean((beta_est - beta_true)^2)
  
  # 3.2) TPR and FDR for variable selection
  selected <- which(abs(beta_est) > 1e-6)
  true_nonzero <- which(abs(beta_true) > 1e-9)
  
  tp <- length(intersect(selected, true_nonzero))
  fp <- length(setdiff(selected, true_nonzero))
  fn <- length(setdiff(true_nonzero, selected))
  
  tpr <- if ((tp + fn) == 0) 0 else tp / (tp + fn)
  fdr <- if ((tp + fp) == 0) 0 else fp / (tp + fp)
  
  # 3.3) Prediction error
  yhat <- X %*% beta_est
  mse_pred <- mean((y - yhat)^2)
  
  list(mse_beta = mse_beta, tpr = tpr, fdr = fdr, mse_pred = mse_pred)
}

3. Fitting Methods

We fit Ridge, Lasso, Elastic Net via glmnet, and SCAD, MCP via ncvreg. All use CV to find the best \(\lambda\).

##########################################
# 4) Fitting each method
##########################################
fit_all_methods <- function(X, y, alpha_EN = 0.5, folds = 5) {
  
  # Ridge (alpha=0)
  cv_ridge <- cv.glmnet(X, y, alpha = 0, nfolds = folds)
  ridge_coef <- as.numeric(coef(cv_ridge, s = "lambda.min"))[-1]
  
  # Lasso (alpha=1)
  cv_lasso <- cv.glmnet(X, y, alpha = 1, nfolds = folds)
  lasso_coef <- as.numeric(coef(cv_lasso, s = "lambda.min"))[-1]
  
  # Elastic Net (alpha=alpha_EN)
  cv_en <- cv.glmnet(X, y, alpha = alpha_EN, nfolds = folds)
  en_coef <- as.numeric(coef(cv_en, s = "lambda.min"))[-1]
  
  # SCAD
  cv_scad <- cv.ncvreg(X, y, family = "gaussian", penalty = "SCAD", nfolds = folds)
  scad_coef <- coef(cv_scad, lambda = cv_scad$lambda.min)[-1]
  
  # MCP
  cv_mcp <- cv.ncvreg(X, y, family = "gaussian", penalty = "MCP", nfolds = folds)
  mcp_coef <- coef(cv_mcp, lambda = cv_mcp$lambda.min)[-1]
  
  list(
    ridge = ridge_coef,
    lasso = lasso_coef,
    en    = en_coef,
    scad  = scad_coef,
    mcp   = mcp_coef
  )
}

4. Main Simulation Loop

We repeatedly:

  1. Generate data.
  2. Fit each method.
  3. Evaluate performance.
  4. Store results in all_results, then average in summary_list.
##########################################
# 5) Main simulation loop
##########################################
simulate_and_compare <- function(n_sim = 50,
                                 n = 100, p = 200, n_inform = 10,
                                 rho = 0.3, sigma = 1, alpha_EN = 0.5,
                                 folds = 5, seed_start = 1234) {
  
  method_list <- c("ridge", "lasso", "en", "scad", "mcp")
  all_results <- list()
  for (m in method_list) {
    all_results[[m]] <- data.frame(mse_beta = numeric(n_sim),
                                   tpr = numeric(n_sim),
                                   fdr = numeric(n_sim),
                                   mse_pred = numeric(n_sim))
  }
  
  # Loop over n_sim replicates
  for (i in seq_len(n_sim)) {
    sim_data <- simulate_data(n = n, p = p, n_inform = n_inform,
                              rho = rho, sigma = sigma, seed = seed_start + i)
    X <- sim_data$X
    y <- sim_data$y
    beta_true <- sim_data$beta_true
    
    coefs <- fit_all_methods(X, y, alpha_EN = alpha_EN, folds = folds)
    
    # Evaluate each method
    for (m in method_list) {
      perf <- evaluate_performance(coefs[[m]], beta_true, X, y)
      all_results[[m]]$mse_beta[i] <- perf$mse_beta
      all_results[[m]]$tpr[i]      <- perf$tpr
      all_results[[m]]$fdr[i]      <- perf$fdr
      all_results[[m]]$mse_pred[i] <- perf$mse_pred
    }
  }
  
  # Summarize
  summary_list <- list()
  for (m in method_list) {
    summary_list[[m]] <- apply(all_results[[m]], 2, mean)
  }
  
  list(raw = all_results, summary = summary_list)
}

5. Run the Simulation

Here we run 30 simulations with \((n=100, p=200)\), correlation \(\rho=0.3\), n_inform=10 truly non-zero, and default \(\alpha_{\text{EN}}=0.5\). We then print average results.

##########################################
# 6) Run the simulation and compare
##########################################
res <- simulate_and_compare(
  n_sim     = 30,
  n         = 100, 
  p         = 200,
  n_inform  = 10,
  rho       = 0.3,
  sigma     = 1,
  alpha_EN  = 0.5,
  folds     = 5,
  seed_start= 1000
)

##########################################
# 7) Inspect and print final results
##########################################
cat("Average performance metrics across simulations:\n")
## Average performance metrics across simulations:
for (m in names(res$summary)) {
  cat("\nMethod:", m, "\n")
  print(res$summary[[m]])
}
## 
## Method: ridge 
##    mse_beta         tpr         fdr    mse_pred 
##  0.09539526  1.00000000  0.95000000 11.15929727 
## 
## Method: lasso 
##    mse_beta         tpr         fdr    mse_pred 
## 0.005301336 1.000000000 0.753635840 0.592791896 
## 
## Method: en 
##    mse_beta         tpr         fdr    mse_pred 
## 0.007489721 1.000000000 0.804833607 0.494146244 
## 
## Method: scad 
##    mse_beta         tpr         fdr    mse_pred 
## 0.000703822 1.000000000 0.181614682 0.857284133 
## 
## Method: mcp 
##     mse_beta          tpr          fdr     mse_pred 
## 0.0006805814 1.0000000000 0.0355218855 0.8881671510

Interpretation:
- mse_beta: how close \(\hat{\beta}\) is to \(\beta_{\text{true}}\).
- tpr: fraction of truly non-zero coefficients discovered.
- fdr: fraction of selected coefficients that are false positives.
- mse_pred: training set MSE on \(y\).

Experiment with larger n_sim, or different p, rho, or alpha_EN to see how correlation and penalty structure affect each method’s performance.


Summary

This R Markdown file:

  1. Generates correlated high-dimensional data with a sparse truth.
  2. Implements Ridge, Lasso, Elastic Net, SCAD, MCP using CV for \(\lambda\).
  3. Evaluates each method’s performance (MSE of \(\beta\), TPR, FDR, predictive MSE) over multiple simulations.

By adjusting parameters (e.g., n, p, rho, alpha_EN), you can visualize how each penalty handles feature selection, bias, and predictive accuracy in different settings.