--- title: "Overview of the ksm package" author: "Leo Belzile" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{Overview of the ksmpackage} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 5, fig.height = 3, fig.align='center', global.par = TRUE ) ``` The `ksm` package provides functionalities for the kernel density estimation for random symmetric positive definite matrices. It provides C++ wrappers for estimating the density, with additional functionalities for estimating the optimal bandwidth according to the least square cross validation (LSCV) and the leave-one-out cross (LCV) validation criteria. Three kernels are implemented: the Gaussian (`smnorm`), log-Gaussian (`smlnorm`) and Wishart (`Wishart`). ## Evaluation of the kernel and bandwidth optimization ```{r} set.seed(2025) library(ksm) d <- 4L # Equicorrelation matrix S <- diag(rep(0.5, d)) + matrix(0.5, d, d) # Generate simulated data from an inverse Wishart (d by d by n array). samp <- rinvWishart(n = 100, df = 10, S) dim(samp) # Optimize the bandwidth for kernel using leave-one-out cross validation b <- bandwidth_optim( x = samp, kernel = "smlnorm", criterion = "lcv") # Evaluate kernel with a new psd matrix data point newsamp <- rinvWishart(n = 2, df = 10, S = S) # Kernel density (default to log scale) kdens_symmat( x = newsamp, xs = samp, kernel = "smlnorm", b = b) ``` We can also consider stationary data and exclude observations at lag $h$ to account for serial dependence. This is illustrated below with a time series of realized variance covariance matrix between two assets over 250 days. ```{r} data(realvar, package = "ksm") dim(realvar) # Select suitable lag for least square cross validation h <- ceiling(dim(realvar)[3]^0.25) # Calculate optimal bandwidth bopt <- bandwidth_optim( x = realvar, kernel = "Wishart", criterion = "lscv", h = h) # Evaluate at multiple values of bandwidth bseq <- seq(0.5 * bopt, 2 * bopt, length.out = 101) lscv <- ksm::lscv_kdens_symmat( x = realvar, b = bseq, h = h, kernel = "Wishart") with(lscv, plot(x = b, y = lscv, type = "l", panel.first = { abline(v = bopt, lty = 2, col = "grey") }, xlab = "bandwidth", ylab = "least square cross validation" ) ) ``` ## Integrating over positive definite matrices We can also make integration in dimensions $d \in \{2, 3\}$ to evaluate numerically the performance of kernels on simulated data. The function `integrate_spd` performs a transformation of inputs for numerical integration with routines from the `cubature` package. This is useful for checking correct implementation: below, we check that the density of a Wishart integrates to unity over the space of positive definite matrices. Different methods for integration are provided: we recommend in particular `cuhre`, `suave` or `hcubature`. The `lb` argument controls the smaller value for the eigenvalues, which sometimes returns matrices that are close to being non-positive definite. If such an error mistake arises, consider setting `lb` to a small value. ```{r} # Check that Wishart density integrates to unity (int <- integrate_spd( dim = 2L, neval = 1e5L, f = function(x, S){ dWishart(x, df = 10, S = S, log = FALSE)}, S = diag(2),lb = 0, method = "hcubature")) isTRUE(all.equal(int$integral, 1, tol = 2*int$error)) ``` The main use for the `integrate_spd` function is to calculate the integrated squared error for data simulated from some density. We do this here for different models that generate independent and identically distributed matrices from a mixture of Wishart densities. The function here relies on predefined models from functions `simu_rdens` and `simu_fdens` used in the simulation studies of the paper for the independent and identically distributed scenario. ```{r} # Integrated squared error ISE <- function( S, x, bandwidth, model = 1:6, kernel = c("Wishart", "smlnorm", "smnorm"), ... ) { model <- match.arg(as.character(model), choices = 1:6) kernel <- match.arg(kernel) dim <- ncol(x) # Compute the squared difference between the kernel and the target density ( ksm::kdens_symmat( x = S, xs = x, b = bandwidth, kernel = kernel, log = FALSE ) - ksm::simu_fdens(S, model = model, d = dim) )^2 } # Calculate numerical integral d <- 2L xs <- simu_rdens(n = 100, model = 2, d = d) b <- bandwidth_optim( x = xs, criterion = "lcv", kernel = "smnorm") # Computing ISE ISE( S = xs[,,1, drop = FALSE], x = xs, bandwidth = b, model = 2, kernel = "smnorm") # Next integrate over the space of psd matrices ISE_int <- try( ksm::integrate_spd( f = function(S) { ISE( S, x = xs, kernel = "smnorm", # (only for lcv) bandwidth = b, model = 2 ) }, dim = d, method = "hcubature", lb = 0, ub = Inf, neval = 1e6L ), silent = TRUE ) ISE_int ```