question about the biggest N #37

stevenlis · 2020-01-20T23:11:20Z

Thanks for the package! I'm trying to use it in my study with a sample size around 26,000 (10 covariates). However, in the following paper:

Messy Data, Robust Inference? Navigating Obstacles to Inference with bigKRLS:

bigKRLS can handle datasets up to approximately N = 14,000 on a personal machine before
reaching the 8 GB cutoff

Thus, I'm concerned about whether I should continue. Does this mean the program will stop running if I fit a dataset with N > 14,000? I have a laptop with 16 GB RAM. Will it be OK?

rdrr1990 · 2020-01-21T04:12:12Z

Thanks for your interest! bigKRLS should, when it runs out of RAM, switch the computation to disk (using swap, I.e., putting some on ROM). However those calculations are considerably slower and it is unlikely you would consider the speed trade off tolerable on 16gb RAM. There aren’t too many hyperparameters but they matter some; I recommend fitting at N=3000 to benchmark your machine and then increasing keeping the quadratic memory footprint in mind.

On Mon, Jan 20, 2020 at 3:11 PM StevenLi-DS ***@***.***> wrote: Thanks for the package! I'm trying to use it in my study with a sample size around 26,000 (10 covariates). However, in the following paper: Messy Data, Robust Inference? Navigating Obstacles to Inference with bigKRLS: bigKRLS can handle datasets up to approximately N = 14,000 on a personal machine before reaching the 8 GB cutoff Thus, I'm concerned about whether I should continue. Does this mean the program will stop running if I fit a dataset with N > 14,000? I have a laptop with 16 GB RAM. Will it be OK? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#37?email_source=notifications&email_token=AES2KXFDISUYYT3YU4XPG6TQ6YVRRA5CNFSM4KJLA3T2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IHPF4BA>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AES2KXAXCO5HXVIMIWFC65TQ6YVRRANCNFSM4KJLA3TQ> .

-- Pete Mohanty, PhD Data Scientist at Google

stevenlis · 2020-01-21T04:36:18Z

I tried with N = 5,000, and save.bigKRLS() generated a folder of files that takes 1.5 GB.
Assume my total N = 25,000, then I would need an 1.5*(25000/5000)^2 = 37.5 GB RAM to run the model right?

rdrr1990 · 2020-01-21T05:40:45Z

Close but not quite. Saving the full model output involves several NxN matrices (such as variance-covariance). I am away from my laptop but anticipate the output is \propto 5 N^2.

On Mon, Jan 20, 2020 at 8:36 PM StevenLi-DS ***@***.***> wrote: I tried with N = 5,000, and save.bigKRLS() generated a folder of files that takes 1.5 GB. Assume my total N = 25,000, then I would need an 1.5*(25000/5000)^2 = 37.5 GB RAM to run the model right? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#37?email_source=notifications&email_token=AES2KXDPUKWGL4SQEDHZXA3Q6Z3UHA5CNFSM4KJLA3T2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJOOR7I#issuecomment-576514301>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AES2KXDQUX6LPQEQWGSJGBDQ6Z3UHANCNFSM4KJLA3TQ> .

-- Pete Mohanty, PhD Data Scientist at Google

stevenlis · 2020-01-21T15:45:29Z

When I run the whole model with N around 26000. I got the following error. Am I missing something? (sorry for the low-fi image)

stevenlis · 2020-01-23T02:05:06Z

@rdrr1990 any hit?

rbshaffer · 2020-01-26T04:03:33Z

Hi Stephen, I'm not quite certain about the source of the error, but it doesn't look like it's an issue with the bigKRLS package. I spent a little while digging through LAPACK's documentation, and it looks like error is something raised by LAPACK's eigenvalue solver.<http://www.netlib.org/lapack/explore-html/d2/d8a/group__double_s_yeigen_gaeed8a131adf56eaa2a9e5b1e0cce5718.html> The error code specifically seems to be from this line: IF( n.GT.0 .AND. vu.LE.vl ) info = -8 The conditions mean that the order of the matrix is greater than zero and the upper bound on the eigenvalue solver is less than the lower bound. Those bounds aren't something that we set or even have the option to set (via Armadillo or Rcpp), so this is either a bug with Armadillo or a numerical/memory problem caused by an oversized input matrix. My guess is the latter, but it's difficult to be sure without a reproducible example. As an experiment, you might try running the bigKRLS estimation routine on a random subset of your data at a size that will clearly run (say, n=10,000 or so). If the estimation routine runs, then the issue is likely related to the size of the input matrix. If you see an error, though, then there's probably some other issue going on. - Robert On Wed, Jan 22, 2020 at 9:05 PM StevenLi-DS <[email protected]<mailto:[email protected]>> wrote: @rdrr1990<https://github.com/rdrr1990> any hit? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#37?email_source=notifications&email_token=ABNR7ZCEX52WN5EJLWKJZN3Q7D3NHA5CNFSM4KJLA3T2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJVYJ4Y#issuecomment-577471731>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABNR7ZG65GPUWJAPVVLQERDQ7D3NHANCNFSM4KJLA3TQ>.

…

-- Postdoctoral Fellow, Perry World House, University of Pennsylvania Website<https://rbshaffer.github.io/>

stevenlis · 2020-01-26T13:18:05Z

Hi @rbshaffer. Thanks for the reply. I thought it was due to my sample size.

I've tried with a sample of my dataset with more than N = 13,000, which had no issue at all. I will try it again and see if there is anyway I can share the dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question about the biggest N #37

question about the biggest N #37

stevenlis commented Jan 20, 2020

rdrr1990 commented Jan 21, 2020 via email

stevenlis commented Jan 21, 2020

rdrr1990 commented Jan 21, 2020 via email

stevenlis commented Jan 21, 2020

stevenlis commented Jan 23, 2020

rbshaffer commented Jan 26, 2020 via email

stevenlis commented Jan 26, 2020

question about the biggest N #37

question about the biggest N #37

Comments

stevenlis commented Jan 20, 2020

rdrr1990 commented Jan 21, 2020 via email

stevenlis commented Jan 21, 2020

rdrr1990 commented Jan 21, 2020 via email

stevenlis commented Jan 21, 2020

stevenlis commented Jan 23, 2020

rbshaffer commented Jan 26, 2020 via email

stevenlis commented Jan 26, 2020