From 7f906035f2af0fa7a1e32809ef7e7ef9a9eaac83 Mon Sep 17 00:00:00 2001
From: Harsha Vardhan Simhadri <harshasi@microsoft.com>
Date: Sat, 19 Oct 2019 21:13:03 -0700
Subject: [PATCH] updated README.md

---
 README.md                                   | 27 ++++---
 examples/pytorch/Bonsai/README.md           |  3 +-
 examples/pytorch/FastCells/README.md        | 86 +++++++++++----------
 pytorch/README.md                           | 47 +++++++----
 pytorch/edgeml_pytorch/trainer/fastmodel.py | 66 +++++++++++++---
 tf/README.md                                | 10 +--
 6 files changed, 152 insertions(+), 87 deletions(-)

diff --git a/README.md b/README.md
index 504fcc54a..07f420e77 100644
--- a/README.md
+++ b/README.md
@@ -26,7 +26,7 @@ offline.
 A tool that adapts models trained by above algorithms to be inferred by fixed point arithmetic.
  - **SeeDot**: Floating-point to fixed-point quantization tool.
 
-Applications demonstrating usecases of these algorithms.
+Applications demonstrating usecases of these algorithms, such as [GesturePod](/docs/publications).
 
 ### Organization
  - The `tf` directory contains the `edgeml_tf` package which specifies these architectures in TensorFlow,
@@ -41,16 +41,18 @@ Please see install/run instructions in the README pages within these directories
 
 ### Details and project pages
 For details, please see our
- [project page](https://microsoft.github.io/EdgeML/) and
- [Microsoft Research page](https://www.microsoft.com/en-us/research/project/resource-efficient-ml-for-the-edge-and-endpoint-iot-devices/).
-our ICML'17 publications on [Bonsai](docs/publications/Bonsai.pdf) and
-[ProtoNN](docs/publications/ProtoNN.pdf) algorithms, 
-NeurIPS'18 publications on [EMI-RNN](docs/publications/emi-rnn-nips18.pdf) and
-[FastGRNN](docs/publications/FastGRNN.pdf),
-and PLDI'19 publication on [SeeDot](docs/publications/SeeDot.pdf).
-
-
-Checkout the [ELL](https://github.com/Microsoft/ELL) project which can
+ [project page](https://microsoft.github.io/EdgeML/), 
+ [Microsoft Research page](https://www.microsoft.com/en-us/research/project/resource-efficient-ml-for-the-edge-and-endpoint-iot-devices/),
+the ICML'17 publications on [Bonsai](/docs/publications/Bonsai.pdf) and
+[ProtoNN](/docs/publications/ProtoNN.pdf) algorithms, 
+the NeurIPS'18 publications on [EMI-RNN](/docs/publications/emi-rnn-nips18.pdf) and
+[FastGRNN](/docs/publications/FastGRNN.pdf),
+the PLDI'19 publication on [SeeDot compiler](/docs/publications/SeeDot.pdf),
+the UIST'19 publication on [Gesturepod](/docs/publications/ICane-UIST19.pdf),
+and the NeurIPS'19 publication on [S-RNN](/docs/publications/SRNN.pdf).
+
+
+Also checkout the [ELL](https://github.com/Microsoft/ELL) project which can
 provide optimized binaries for some of the ONNX models trained by this library.
 
 ### Contributors:
@@ -75,7 +77,8 @@ If you use software from this library in your work, please use the BibTex entry
 ```
 @software{edgeml01,
    author = {{Dennis, Don Kurian and Gaurkar, Yash and Gopinath, Sridhar and Gupta, Chirag and
-      Kumar, Ashish and Kusupati, Aditya and Lovett, Chris and Patil, Shishir G and Simhadri, Harsha Vardhan}},
+              Jain, Moksh and Kumar, Ashish and Kusupati, Aditya and Lovett, Chris 
+              and Patil, Shishir G and Simhadri, Harsha Vardhan}},
    title = {{EdgeML: Machine Learning for resource-constrained edge devices}},
    url = {https://github.com/Microsoft/EdgeML},
    version = {0.2},
diff --git a/examples/pytorch/Bonsai/README.md b/examples/pytorch/Bonsai/README.md
index 5a80a88bf..60b9c312a 100644
--- a/examples/pytorch/Bonsai/README.md
+++ b/examples/pytorch/Bonsai/README.md
@@ -7,7 +7,8 @@ use-case on the USPS10 public dataset.
 `edgeml_pytorch.graph.bonsai` implements the Bonsai prediction graph in pytorch.
 The three-phase training routine for Bonsai is decoupled from the forward graph
 to facilitate a plug and play behaviour wherein Bonsai can be combined with or
-used as a final layer classifier for other architectures (RNNs, CNNs).
+used as a final layer classifier for other architectures (RNNs, CNNs). 
+See `edgeml_pytorch.trainer.bonsaiTrainer` for 3-phase training.
 
 Note that `bonsai_example.py` assumes that data is in a specific format.  It is
 assumed that train and test data is contained in two files, `train.npy` and
diff --git a/examples/pytorch/FastCells/README.md b/examples/pytorch/FastCells/README.md
index f3d8c3474..abdfb20e2 100644
--- a/examples/pytorch/FastCells/README.md
+++ b/examples/pytorch/FastCells/README.md
@@ -1,34 +1,36 @@
 # EdgeML FastCells on a sample public dataset
 
-This directory includes example notebook and general execution script of
-FastCells (FastRNN & FastGRNN) developed as part of EdgeML along with modified
+This directory includes example notebooks and scripts of
+FastCells (FastRNN & FastGRNN) along with modified
 UGRNN, GRU and LSTM to support the LSQ training routine. 
-Also, we include a sample cleanup and use-case on the USPS10 public dataset.
-
-`edgeml_pytorch.graph.rnn` implements the custom RNN cells of **FastRNN** ([`FastRNNCell`](../../pytorch_edgeml/graph/rnn.py#L226)) and **FastGRNN** ([`FastGRNNCell`](../../pytorch_edgeml/graph/rnn.py#L80)) with
-multiple additional features like Low-Rank parameterisation, custom
-non-linearities etc., Similar to Bonsai and ProtoNN, the three-phase training
-routine for FastRNN and FastGRNN is decoupled from the custom cells to
-facilitate a plug and play behaviour of the custom RNN cells in other
-architectures (NMT, Encoder-Decoder etc.,) in place of the inbuilt `RNNCell`, `GRUCell`, `BasicLSTMCell` etc., 
-`edgeml_pytorch.graph.rnn` also contains modified RNN cells of **UGRNN** ([`UGRNNLRCell`](../../pytorch_edgeml/graph/rnn.py#L742)), 
-**GRU** ([`GRULRCell`](../../edgeml/graph/rnn.py#L565)) and **LSTM** ([`LSTMLRCell`](../../pytorch_edgeml/graph/rnn.py#L369)). These cells also can be substituted for FastCells where ever feasible. 
-
-`edgeml_pytorch.graph.rnn` also contains fully wrapped RNNs which are equivalent to `nn.LSTM` and `nn.GRU`. Implemented cells:
-**FastRNN** ([`FastRNN`](../../pytorch_edgeml/graph/rnn.py#L968)), **FastGRNN** ([`FastGRNN`](../../pytorch_edgeml/graph/rnn.py#L993)), **UGRNN** ([`UGRNN`](../../edgeml_pytorch/graph/rnn.py#L945)), **GRU** ([`GRU`](../../edgeml/graph/rnn.py#L922)) and **LSTM** ([`LSTM`](../../pytorch_edgeml/graph/rnn.py#L899)).
-
-Note that all the cells and wrappers (when used independently from `fastcell_example.py` or `edgeml_pytorch.trainer.fastTrainer`) take in data in a batch first format ie., [batchSize, timeSteps, inputDims] by default but it can also support [timeSteps, batchSize, inputDims] format by setting `batch_first` argument to False when used. `fast_example.py` automatically takes care it while assuming the standard format between tf, c++ and pytorch.
+There is also a sample cleanup and train/test script for the USPS10 public dataset.
+
+[`edgeml_pytorch.graph.rnn`](../../../pytorch/pytorch_edgeml/graph/rnn.py) 
+provides two RNN cells **FastRNNCell**  and **FastGRNNCell** with additional
+features like low-rank parameterisation and custom non-linearities. Akin to
+Bonsai and ProtoNN, the three-phase training routine for FastRNN and FastGRNN
+is decoupled from the custom cells to facilitate a plug and play behaviour of 
+the custom RNN cells in other architectures (NMT, Encoder-Decoder etc.).
+Additionally, numerically  equivalent CUDA-based implementations FastRNNCuda
+and FastGRNNCuda are provided for faster training. 
+`edgeml_pytorch.graph.rnn` also contains modified RNN cells of **UGRNNCell**, 
+**GRUCell**, and **LSTMCell**, which can be substituted for Fast(G)RNN,
+as well as untrolled RNNs which are equivalent to `nn.LSTM` and `nn.GRU`. 
+
+Note that all the cells and wrappers, when used independently from `fastcell_example.py`
+or `edgeml_pytorch.trainer.fastTrainer`, take in data in a batch first format, i.e.,
+[batchSize, timeSteps, inputDims] by default, but can also support [timeSteps,
+batchSize, inputDims] format if  `batch_first` argument is set to False.
+`fast_example.py` automatically adjusts to the correct format across tf, c++ and pytorch.
 
 For training FastCells, `edgeml_pytorch.trainer.fastTrainer` implements the three-phase
-FastCell training routine in PyTorch. A simple example,
-`examples/fastcell_example.py` is provided to illustrate its usage.
-
-Note that `fastcell_example.py` assumes that data is in a specific format.  It
-is assumed that train and test data is contained in two files, `train.npy` and
-`test.npy`. Each containing a 2D numpy array of dimension `[numberOfExamples,
+FastCell training routine in PyTorch. A simple example `fastcell_example.py` is provided
+to illustrate its usage. Note that `fastcell_example.py` assumes that data is in a specific format.
+It is assumed that train and test data is contained in two files, `train.npy` and
+`test.npy`, each containing a 2D numpy array of dimension `[numberOfExamples,
 numberOfFeatures]`. numberOfFeatures is `timesteps x inputDims`, flattened
-across timestep dimension. So the input of 1st timestep followed by second and
-so on.  For an N-Class problem, we assume the labels are integers from 0
+across timestep dimension with the input of the first time step followed by the second
+and so on.  For an N-Class problem, we assume the labels are integers from 0
 through N-1. Lastly, the training data, `train.npy`, is assumed to well shuffled 
 as the training routine doesn't shuffle internally.
 
@@ -36,9 +38,8 @@ as the training routine doesn't shuffle internally.
 
 ## Download and clean up sample dataset
 
-We will be testing out the validation of the code by using the USPS dataset.
-The download and cleanup of the dataset to match the above-mentioned format is
-done by the script [fetch_usps.py](fetch_usps.py) and
+To validate the code with USPS dataset, first download and format the dataset to match
+the required format using the script [fetch_usps.py](fetch_usps.py) and
 [process_usps.py](process_usps.py)
 
 ```
@@ -46,17 +47,17 @@ python fetch_usps.py
 python process_usps.py
 ```
 
+Note: Even though usps10 is not a time-series dataset, it can be regarding as a time-series
+dataset where time step sees a new row. So the number of timesteps = 16 and inputDims = 16.
 
 ## Sample command for FastCells on USPS10
-The following sample run on usps10 should validate your library:
-
-Note: Even though usps10 is not a time-series dataset, it can be assumed as, a time-series where each row is coming in at one single time.
-So the number of timesteps = 16 and inputDims = 16
+The following is a sample run on usps10 :
 
 ```bash
 python fastcell_example.py -dir usps10/ -id 16 -hd 32
 ```
-This command should give you a final output screen which reads roughly similar to (might not be exact numbers due to various version mismatches):
+This command should give you a final output that reads roughly similar to
+(might not be exact numbers due to various version mismatches):
 
 ```
 Maximum Test accuracy at compressed model size(including early stopping): 0.9407075 at Epoch: 262
@@ -64,23 +65,26 @@ Final Test Accuracy: 0.93721974
 
 Non-Zeros: 1932 Model Size: 7.546875 KB hasSparse: False
 ```
-`usps10/` directory will now have a consolidated results file called `FastRNNResults.txt` or `FastGRNNResults.txt` depending on the choice of the RNN cell.
-A directory `FastRNNResults` or `FastGRNNResults` with the corresponding models with each run of the code on the `usps10` dataset.
+`usps10/` directory will now have a consolidated results file called `FastRNNResults.txt` or 
+`FastGRNNResults.txt` depending on the choice of the RNN cell. A directory `FastRNNResults` or
+`FastGRNNResults` with the corresponding models with each run of the code on the `usps10` dataset.
 
-Note that the scalars like `alpha`, `beta`, `zeta` and `nu` are all before the application of the sigmoid function over them.
+Note that the scalars like `alpha`, `beta`, `zeta` and `nu` correspond to the values before
+the application of the sigmoid function.
 
 ## Byte Quantization(Q) for model compression
-If you wish to quantize the generated model to use byte quantized integers use `quantizeFastModels.py`. Usage Instructions:
+If you wish to quantize the generated model, use `quantizeFastModels.py`. Usage Instructions:
 
 ```
 python quantizeFastModels.py -h
 ```
 
-This will generate quantized models with a suffix of `q` before every param stored in a new directory `QuantizedFastModel` inside the model directory.
-One can use this model further on edge devices. 
+This will generate quantized models with a suffix of `q` before every param stored in a
+new directory `QuantizedFastModel` inside the model directory.
 
-Note that the scalars like `qalpha`, `qbeta`, `qzeta` and `qnu` are all after the application of the sigmoid function over them and quantization, they can be directly plugged into the inference pipleines.
+Note that the scalars like `qalpha`, `qbeta`, `qzeta` and `qnu` correspond to values 
+after the application of the sigmoid function over them post quantization;
+they can be directly plugged into the inference pipleines.
 
 Copyright (c) Microsoft Corporation. All rights reserved. 
-
 Licensed under the MIT license.
diff --git a/pytorch/README.md b/pytorch/README.md
index 13f253f69..3cfac80b1 100644
--- a/pytorch/README.md
+++ b/pytorch/README.md
@@ -1,24 +1,39 @@
 ## Edge Machine Learning: Pytorch Library 
 
-This directory includes PyTorch implementations of various techniques and
-algorithms developed as part of EdgeML. Currently, the following algorithms are
-available in Tensorflow:
-
-1. [Bonsai](/docs/publications/Bonsai.pdf)
-2. S-RNN
-3. [FastRNN & FastGRNN](/docs/publications/FastGRNN.pdf)
-4. [ProtoNN](/docs/publications/ProtoNN.pdf)
-
-The PyTorch graphs for these algoriths are packaged as `edgeml_pytorch.graph`.
-Trainers for these algorithms are in `edgeml_pytorch.trainer`. 
-Usage directions and examples for these algorithms are provided in 
-`$EDGEML_ROOT/examples/pytorch` directory. To get started with any 
-of the provided algorithms, please follow the notebooks in the the 
-`examples/pytorch` directory.
+This package includes PyTorch implementations of following algorithms and training
+techniques developed as part of EdgeML. The PyTorch graphs for the forward/backward
+pass of these algorithms are packaged as `edgeml_pytorch.graph` and the trainers
+for these algorithms are in `edgeml_pytorch.trainer`. 
 
-## Installation
+1. [Bonsai](/docs/publications/Bonsai.pdf): `edgeml_pytorch.graph.bonsai` implements
+   the Bonsai prediction graph. The three-phase training routine for Bonsai is decoupled
+   from the forward graph to facilitate a plug and play behaviour wherein Bonsai can be
+   combined with or used as a final layer classifier for other architectures (RNNs, CNNs).
+   See `edgeml_pytorch.trainer.bonsaiTrainer` for 3-phase training.  
+2. [ProtoNN](/docs/publications/ProtoNN.pdf): `edgeml_pytorch.graph.protoNN` implements the
+   ProtoNN prediction functions. The training routine for ProtoNN is decoupled from the forward
+   graph to facilitate a plug and play behaviour wherein ProtoNN can be combined with or used
+   as a final layer classifier for other architectures (RNNs, CNNs). The training routine is
+   implemented in `edgeml_pytorch.trainer.protoNNTrainer`.
+3. [FastRNN & FastGRNN](/docs/publications/FastGRNN.pdf): `edgeml_pytorch.graph.rnn` provides
+    various RNN cells --- including new cells `FastRNNCell` and `FastGRNNCell` as well as 
+    `UGRNNCell`, `GRUCell`, and `LSTMCell` --- with features like low-rank parameterisation
+    of weight matrices and custom non-linearities. Akin to Bonsai and ProtoNN, the three-phase
+    training routine for FastRNN and FastGRNN is decoupled from the custom cells to enable plug and
+    play behaviour of the custom RNN cells in other architectures (NMT, Encoder-Decoder etc.).
+    Additionally, numerically equivalent CUDA-based implementations `FastRNNCUDACell` and 
+    `FastGRNNCUDACell` are provided for faster training. `edgeml_pytorch.graph.rnn`.
+    `edgeml_pytorch.graph.rnn.Fast(G)RNN(CUDA)` provides unrolled RNNs equivalent to `nn.LSTM` and `nn.GRU`.
+    `edgeml_pytorch.trainer.fastmodel` presents a sample multi-layer RNN + multi-class classifier model.
+4. [S-RNN](/docs/publications/SRNN.pdf): `edgeml_pytorch.graph.rnn.SRNN2` implements a 
+    2 layer SRNN network which can be instantied with a choice of RNN cell. The training
+    routine for SRNN is in `edgeml_pytorch.trainer.srnnTrainer`.
+
+Usage directions and examples notebooks for this package are provided [here](/examples/pytorch).
 
 
+## Installation
+
 It is highly recommended that EdgeML be installed in a virtual environment. 
 Please create a new virtual environment using your environment manager
  ([virtualenv](https://virtualenv.pypa.io/en/stable/userguide/#usage) or
diff --git a/pytorch/edgeml_pytorch/trainer/fastmodel.py b/pytorch/edgeml_pytorch/trainer/fastmodel.py
index a55c0bd75..f8baa52af 100644
--- a/pytorch/edgeml_pytorch/trainer/fastmodel.py
+++ b/pytorch/edgeml_pytorch/trainer/fastmodel.py
@@ -38,6 +38,7 @@ def __init__(self, rnn_name, input_dim, num_layers, hidden_units_list,
             self.linear = linear
             self.batch_first = batch_first
             self.apply_softmax = apply_softmax
+            self.rnn_name = rnn_name
 
             if self.linear:
                 if not self.num_classes:
@@ -57,6 +58,18 @@ def __init__(self, rnn_name, input_dim, num_layers, hidden_units_list,
                             batch_first = self.batch_first)
                 for l in range(self.num_layers)])
 
+            if rnn_name == "FastGRNNCUDA":
+                RNN_ = getattr(getattr(getattr(__import__('edgeml_pytorch'), 'graph'), 'rnn'), 'FastGRNN')
+                self.rnn_list_ = nn.ModuleList([
+                    RNN_(self.input_dim if l==0 else self.hidden_units_list[l-1],
+                        self.hidden_units_list[l],
+                        gate_nonlinearity=self.gate_nonlinearity,
+                        update_nonlinearity=self.update_nonlinearity,
+                        wRank=self.wRank_list[l], uRank=self.uRank_list[l],
+                        wSparsity=self.wSparsity_list[l],
+                        uSparsity=self.uSparsity_list[l],
+                        batch_first = self.batch_first)
+                    for l in range(self.num_layers)])
             # The linear layer is a fully connected layer that maps from hidden state space
             # to number of expected keywords
             if self.linear:
@@ -66,16 +79,30 @@ def __init__(self, rnn_name, input_dim, num_layers, hidden_units_list,
 
         def sparsify(self):
             for rnn in self.rnn_list:
-                rnn.cell.sparsify()
+                if self.rnn_name is "FastGRNNCUDA":
+                    rnn.to(torch.device("cpu"))
+                    rnn.sparsify()
+                    rnn.to(torch.device("cuda"))
+                else:
+                    rnn.cell.sparsify()
 
         def sparsifyWithSupport(self):
             for rnn in self.rnn_list:
-                rnn.cell.sparsifyWithSupport()
+                if self.rnn_name is "FastGRNNCUDA":
+                    rnn.to(torch.device("cpu"))
+                    rnn.sparsifyWithSupport()
+                    rnn.to(torch.device("cuda"))
+                else:
+                    rnn.cell.sparsifyWithSupport()
 
         def get_model_size(self):
             total_size = 4 * self.hidden_units_list[self.num_layers-1] * self.num_classes
+            print(self.rnn_name)
             for rnn in self.rnn_list:
-                total_size += rnn.cell.get_model_size()
+                if self.rnn_name == "FastGRNNCUDA":
+                    total_size += rnn.get_model_size()
+                else:
+                    total_size += rnn.cell.get_model_size()
             return total_size
 
         def normalize(self, mean, std):
@@ -130,15 +157,32 @@ def forward(self, input):
                 input = (input - self.mean) / self.std
 
             rnn_in = input
-            for l in range(self.num_layers):
-                rnn = self.rnn_list[l]
-                model_output = rnn(rnn_in, hiddenState=self.hidden_states[l])
-                self.hidden_states[l] = model_output.detach()[-1, :, :]
+            if self.rnn_name == "FastGRNNCUDA":
                 if self.tracking:
-                    weights = rnn.getVars()
-                    model_output = onnx_exportable_rnn(rnn_in, weights,
-                                                       rnn.cell, output=model_output)
-                rnn_in = model_output
+                    for l in range(self.num_layers):
+                        print("Layer: ", l)
+                        rnn_ = self.rnn_list_[l]
+                        model_output = rnn_(rnn_in, hiddenState=self.hidden_states[l])
+                        self.hidden_states[l] = model_output.detach()[-1, :, :]
+                        weights = self.rnn_list[l].getVars()
+                        weights = [weight.clone() for weight in weights]
+                        model_output = onnx_exportable_rnn(rnn_in, weights, rnn_.cell, output=model_output)
+                        rnn_in = model_output
+                else:
+                    for l in range(self.num_layers):
+                        rnn = self.rnn_list[l]
+                        model_output = rnn(rnn_in, hiddenState=self.hidden_states[l])
+                        self.hidden_states[l] = model_output.detach()[-1, :, :]
+                        rnn_in = model_output
+            else:
+                for l in range(self.num_layers):
+                    rnn = self.rnn_list[l]
+                    model_output = rnn(rnn_in, hiddenState=self.hidden_states[l])
+                    self.hidden_states[l] = model_output.detach()[-1, :, :]
+                    if self.tracking:
+                        weights = rnn.getVars()
+                        model_output = onnx_exportable_rnn(rnn_in, weights, rnn.cell, output=model_output)
+                    rnn_in = model_output
 
             if self.linear:
                 model_output = self.hidden2keyword(model_output[-1, :, :])
diff --git a/tf/README.md b/tf/README.md
index 83494456f..68f8a39b3 100644
--- a/tf/README.md
+++ b/tf/README.md
@@ -9,12 +9,10 @@ available in Tensorflow:
 3. [FastRNN & FastGRNN](/docs/publications/FastGRNN.pdf)
 4. [ProtoNN](/docs/publications/ProtoNN.pdf)
 
-The TensorFlow compute graphs for these algoriths are packaged as
-`edgeml_tf.graph`. Trainers for these algorithms are in `edgeml_tf.trainer`.
-Usage directions and examples for these algorithms are provided in
- `$EDGEML_ROOT/examples/tf` directory. 
-To get started with any of the provided algorithms, please follow
-the notebooks in the `examples/tf` directory.
+The TensorFlow compute graphs for these algoriths are packaged as `edgeml_tf.graph`
+and trainers are in `edgeml_tf.trainer`. Usage directions and example notebook for
+these algorithms are provided in the [examples/tf directory](/examples/tf). 
+
 
 ## Installation