Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to correctly load model matrices into a learner (e.g., a Random Forest)? #125

Open
DanielTakeshi opened this issue Aug 27, 2016 · 2 comments

Comments

@DanielTakeshi
Copy link
Contributor

BIDMach version: 1037ae7 (August 12)
BIDMat version: 1383cb4ccf3933a8175073b8eab9819be7e252bf (August 12)
OS: Linux (it's on "stout")

Here's the problem setup. I have run Random Forests on some training data. At the end of my script, I call the model's save method to save the model matrices:

// 'a' and 'c' are the training data and labels, respectively
val (nn, opts) = RandomForest.learner(a, c);

// Set a bunch of options, such as opts.seed = 0 to set a random seed
opts.batchSize = 1000
opts.depth = 20
opts.gain = 0.001f
opts.ntrees = 10
opts.nsamps = 24
opts.nnodes = 2500000
opts.nbits = 16
opts.ncats = 100;
opts.regression = true;
opts.seed = 0
opts.useGPU = true;

nn.train
nn.model.save("seed_0/")

This creates four files since RFs have four model matrices (ctrees.fmat.lz4, ftrees.imat.lz4, itrees.imat.lz4, vtrees.imat.lz4). Now, in a separate script, I want to create a new Random Forests that will load in these model matrices so that it doesn't have to train. Here's what an example script might look like, with file names removed for privacy:

// Test data
val ta = loadFMat("...")
val tc = loadFMat("...")

// Train data
val a = loadFMat("...")
val c = loadFMat("...")

// Establish the Random Forest with same parameters as before.
val (nn, opts) = RandomForest.learner(a, c); 
opts.batchSize = 1000
opts.depth = 20
opts.gain = 0.001f
opts.ntrees = 10
opts.nsamps = 24
opts.nnodes = 2500000
opts.nbits = 16
opts.ncats = 100;
opts.regression = true;
opts.useGPU = true;

// IMPORTANT, I am trying to load the model matrices from the correct directory.
// My main concern, is this command located in the correct spot? Am I missing
// anything I need to call in addition to this?
nn.model.load("seed_0/")

// This prediction does not work.
val model = nn.model.asInstanceOf[RandomForest]
val (mm, mopts) = RandomForest.predictor(model, ta);
mopts.batchSize = 1000
mm.predict

I have the training data there even though I don't think it's needed. I have it there because I am trying to keep everything consistent with the original script that ran training. I'm assuming that if the Random Forest got trained with tree depth 20, then here, we should also have a tree depth of 20 if we're going to be loading the model matrices, and soon.

Unfortunately, running the above (with the appropriate data, but I think any data will do) I get:

model: BIDMach.models.RandomForest = BIDMach.models.RandomForest@234d5408
mm: BIDMach.Learner = Learner(BIDMach.datasources.MatSource@1be2bc0,BIDMach.models.RandomForest@234d5408,null,null,BIDMach.datasinks.MatSink@6c2a4b24,BIDMach.models.RandomForest$PredOpts@4cab5ff6)
mopts: BIDMach.models.RandomForest.PredOpts = BIDMach.models.RandomForest$PredOpts@4cab5ff6
mopts.batchSize: Int = 1000
java.lang.NullPointerException
  at BIDMach.models.RandomForest.init(RandomForest.scala:247)
  at BIDMach.Learner.predict(Learner.scala:199)
  ... 54 elided

This error happens in the init method, implying that the Random Forests have to be initialized somehow. This happens automatically when you call the train method, but I don't know how to get it initialized without calling train. Are there some examples of scripts that do that here? I couldn't find any by searching. The Random Forest's 'load' method looks like it "returns" a Random Forest model, but I cannot simply do:

val model = nn.model.load("seed_0/")

Do you have some advice? I'm currently working through this issue so hopefully I can find how to do it, but even if I do, it would be great to have confirmation that I'm doing the steps the way it's supposed to work.

Final (somewhat unrelated) comment: the Random Forest model matrices all have dimension (opts.nnodes, opts.ntrees). Therefore, if we want to combine multiple Random Forest trained trees together for a testing set, we have to horizontally concatenate the matrices (not vertically) to make more columns. The Random Forest code doesn't seem to have a method for that but I can do that offline myself.

-Daniel

@DanielTakeshi
Copy link
Contributor Author

I think I have a workaround!

First, as background, the original training script that doesn't have to do any loading of model matrices, it will achieve the following test set performance:

R.F. AUC = 0.7105, MAE = 0.2538

That script also saves the model matrices using what I outlined earlier. Therefore, it logically follows that for the loading process to work, I should be able to set up a new script, create the same Random Forest (w/same parameters), load those model matrices, and do the prediction, and get the same ROC AUC and the same MAE, WITHOUT having to do any training.

My insight is to do the following: I will have the same new script as outlined in my previous post. But after setting the 'opts', I will set opts.depth = 1, so that I can train with minimal lag. So unfortunately I need to train, but the training is only done to force initialization. Then I will manually load the matrices in, change the depth to be larger, and then predict, and I get the same AUC and MAE. To be explicit, here's what the new script looks like:

val (nn, opts) = RandomForest.learner(a, c); 
// Set up a bunch of opts, to follow the same parameters as we did in the original training script
// NEW! Let's set opts.depth=1 and train, so we force initialization.
opts.depth = 1
nn.train

// Next, let's explicitly load.
nn.modelmats(0) = loadIMat("seed_0/itrees.imat.lz4");
nn.modelmats(1) = loadIMat("seed_0/ftrees.imat.lz4");
nn.modelmats(2) = loadIMat("seed_0/vtrees.imat.lz4");
nn.modelmats(3) = loadFMat("seed_0/ctrees.fmat.lz4");

// I have to use the above four lines because the following does NOT work:
// nn.model.load("seed_0/")

// Reset opts.depth to be the value we had in training (here it was 20).
opts.depth = 20

// Now do prediction as normal and output statistics:
val model = nn.model.asInstanceOf[RandomForest]
val (mm, mopts) = RandomForest.predictor(model, ta);
mopts.batchSize = 1000
mm.predict
val pc = FMat(mm.preds(0)) / 100.0
val rc = roc2(pc, tc, 1-tc, 1000)
println("R.F. AUC = %5.4f, MAE = %5.4f" format (mean(rc).dv, mean(abs(pc - tc)).dv));

The reason why the nn.model.load() method doesn't work is because it doesn't assign to the model matrices, but instead to local variables:

override def load(fname:String) = {
    itrees = loadIMat(fname+"itrees.imat.lz4");
    ftrees = loadIMat(fname+"ftrees.imat.lz4");
    vtrees = loadIMat(fname+"vtrees.imat.lz4");
    ctrees = loadFMat(fname+"ctrees.fmat.lz4");
}

In line 225 of the RandomForest.scala code (as of today), it performs the correct assignment:

setmodelmats(Array(itrees, ftrees, vtrees, ctrees));

So I need to do that myself.

@DanielTakeshi
Copy link
Contributor Author

DanielTakeshi commented Aug 28, 2016

Correction to the above: the better way is to do the following:

nn.model.setmodelmats(Array(matrix0, matrix1, matrix2, matrix3))

This way you can avoid calling the train method entirely.

Edit: where matrix0, etc., are the loaded files that store the trained model matrices from previous iterations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant