Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cross-validation with folds of size < 5 #10

Closed
rdiaz02 opened this issue Feb 13, 2024 · 5 comments
Closed

cross-validation with folds of size < 5 #10

rdiaz02 opened this issue Feb 13, 2024 · 5 comments

Comments

@rdiaz02
Copy link

rdiaz02 commented Feb 13, 2024

This is not a bug per se, but I think an unnecessary limitation.

If we try to run KernelKnnCV with a data set and folds such that folds have less than 5 observations, we get an error. This is in current line 64 of kernelknnCV.R:

if (!all(unlist(lapply(n_folds, length)) > 5)) stop('Each fold has less than 5 observations. Consider decreasing the number of folds or increasing the size of the data.')

What is the rationale? This precludes leave-one-out crossvalidation, but also any cross-validation with, well, folds of size less than 5.

@mlampros
Copy link
Owner

There is a while since I implemented the 'KernelKnnCV' R function and normally I add exceptions to the code whenever I receive errors when testing the functions.
This function uses two custom cross validation functions internally i.e. regr_folds and class_folds

I'm willing to make potential changes but for which kind of dataset would you use fewer than 5 observations per fold with these custom internal functions?

If you intend to use 'leave-one-out crossvalidation' my suggestion would be to use a resampling R package and write a few lines of code and include the 'KernelKnn' function. That way I think you will be in place to debug easier potential errors.

For new functionality, a pull request is welcome.

@rdiaz02
Copy link
Author

rdiaz02 commented Feb 14, 2024

Thanks for your reply. I am using a bunch of different data sets. I am not specifically interested in leave-one-out per se (I avoid it if I can) but with large m (using m instead of k to avoid ambiguities 😅), m-fold cross-validation becomes leave-one-out when m = sample size.

My suggestion would be to only trigger an error if any of the folds is empty. In fact, the easiest would be to allow any value of folds if it is larger or equal to nrow of the data (i.e., only trigger an error if folds < nrow(X)). I am using your code (regr_folds now) with folds equal to sample size, and it works (see below). I can make a PR with this minimal check if you want.

As for your suggestion to use a resampling package, I am not sure I understand. I am not specifically interested in leave-one-out per se, but I want to be able to use large m-fold, and default to leave-one-out when m = nrow(X). It is actually farily simple to write a few lines of code to do this to, for example, use a grid search over values of k.


This is what I am actually doing:
a) Set folds <- min(folds, nrow(data)) right before calling KernelKnnCV. In other words, use my given folds unless larger than nrow(data), in which case it becomes leave-one-out;
b) modify KernelKnnCV to comment out the line with the error for folds with fewer than 5 observations.

@mlampros
Copy link
Owner

I can make a PR with this minimal check if you want
b) modify KernelKnnCV to comment out the line with the error for folds with fewer than 5 observations

In case of a PR, it would be nice to add a test case with one of the existing datasets of the KernelKnn R package to test the modified code and also that can be used as a reference for future users of the package

@rdiaz02
Copy link
Author

rdiaz02 commented Feb 14, 2024

Perfect. I'll try to do it and add a test. I would leave a stop condition if folds < nrow(X). (Not immediately, though, since I am currently swamped with ... the analysis of some data 😅 .

@mlampros
Copy link
Owner

I'll close the issue for now, feel free to open the pull request to adjust the code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants