bnicenboim · bnicenboim · Oct 26, 2023 · Jun 6, 2023 · Jun 6, 2023 · Jun 6, 2023
diff --git a/.Rbuildignore b/.Rbuildignore
@@ -16,3 +16,4 @@ _test
 ^inst/.auctex-auto$
 README_cache
 ^.covrignore$
+REVIEW.md
diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md
@@ -1,8 +1,7 @@
 # Contributing to pangoling
 
-This outlines how to propose a change to pangoling. 
-For more detailed info about contributing to this, and other tidyverse packages, please see the
-[**development contributing guide**](https://rstd.io/tidy-contrib). 
+This outlines how to propose a change to pangoling.
+For a detailed discussion on contributing to this and other tidyverse packages, please see the [development contributing guide](https://rstd.io/tidy-contrib) and our [code review principles](https://code-review.tidyverse.org/).
 
 ## Fixing typos
 
@@ -15,6 +14,7 @@ You can find the `.R` file that generates the `.Rd` by reading the comment in th
 If you want to make a bigger change, it's a good idea to first file an issue and make sure someone from the team agrees that it’s needed. 
 If you’ve found a bug, please file an issue that illustrates the bug with a minimal 
 [reprex](https://www.tidyverse.org/help/#reprex) (this will also help you write a unit test, if needed).
+See our guide on [how to create a great issue](https://code-review.tidyverse.org/issues/) for more advice.
 
 ### Pull request process
 
@@ -40,8 +40,9 @@ If you’ve found a bug, please file an issue that illustrates the bug with a mi
 *  We use [testthat](https://cran.r-project.org/package=testthat) for unit tests. 
    Contributions with test cases included are easier to accept.  
 
-## Code of Conduct
+## Code of conduct
+
+Please note that this package is released with a [Contributor
+Code of Conduct](https://ropensci.org/code-of-conduct/). 
+By contributing to this project, you agree to abide by its terms.
 
-Please note that the pangoling project is released with a
-[Contributor Code of Conduct](CODE_OF_CONDUCT.md). By contributing to this
-project you agree to abide by its terms.
diff --git a/.github/ISSUE_TEMPLATE/issue_template.md b/.github/ISSUE_TEMPLATE/issue_template.md
@@ -0,0 +1,16 @@
+---
+name: Bug report or feature request
+about: Describe a bug you've seen or make a case for a new feature
+---
+
+Please briefly describe your problem and what output you expect. If you have a question, please don't use this form. Instead, ask on <https://github.com/bnicenboim/pangoling/discussions>.
+
+Please include a minimal reproducible example (AKA a reprex). If you've never heard of a [reprex](http://reprex.tidyverse.org/) before, start by reading <https://www.tidyverse.org/help/#reprex>.
+
+For more advice on how to write a great issue, see <https://code-review.tidyverse.org/issues/>.
+
+Brief description of the problem
+
+```r
+# insert reprex here
+```
diff --git a/.gitignore b/.gitignore
@@ -11,3 +11,4 @@ README_cache
 /inst/REFERENCES.bib.sav
 /pangoling.Rproj
 /vignettes/articles/intro-gpt2_cache/
+REVIEW.md
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,14 +1,17 @@
 Package: pangoling
 Type: Package
 Title: Access to Large Language Model Predictions
-Version: 0.0.0.9006
+Version: 0.0.0.9007
 Authors@R: c(
     person("Bruno", "Nicenboim",
     email = "[email protected]",
     role = c( "aut","cre"),
     comment = c(ORCID = "0000-0002-5176-3943")),
     person("Chris", "Emmerly", role = "ctb"),
-    person("Giovanni", "Cassani", role = "ctb"))
+    person("Giovanni", "Cassani", role = "ctb"),
+    person("Lisa", "Levinson", role = "rev"),
+    person("Utku", "Turk", role = "rev")
+    )
 Description: Access to word predictability using large language (transformer) models.
 URL: https://bruno.nicenboim.me/pangoling, https://github.com/bnicenboim/pangoling
 BugReports: https://github.com/bnicenboim/pangoling/issues

diff --git a/NAMESPACE b/NAMESPACE
@@ -14,6 +14,7 @@ export(masked_preload)
 export(masked_tokens_tbl)
 export(ntokens)
 export(perplexity_calc)
+export(set_cache_folder)
 export(tokenize_lst)
 export(transformer_vocab)
 importFrom(memoise,memoise)
diff --git a/NEWS.md b/NEWS.md
@@ -20,6 +20,12 @@
 * Requires correct version of R. 
 
 # pangoling 0.0.0.9006
-* `causal_lp` get a `l_contexts` argument
-* checkpoints work for causal models (not yet for masked models)
+* `causal_lp` get a `l_contexts` argument.
+* Checkpoints work for causal models (not yet for masked models).
 * Ropensci badge added.
+
+
+# pangoling 0.0.0.9007
+* `set_cache_folder()` function added.
+* Message when the package loads.
+* New troubleshooting vignette.
diff --git a/R/tr_utils.R b/R/tr_utils.R
@@ -365,3 +365,43 @@ num_to_token <- function(x, tkzr) {
     tkzr$convert_ids_to_tokens(x)
   })
 }
+
+#' Set Cache Folder for HuggingFace Transformers
+#'
+#' This function sets the cache directory for HuggingFace transformers. If a path is given, the function checks if the directory exists and then sets the `TRANSFORMERS_CACHE` environment variable to this path.
+#' If no path is provided, the function checks for the existing cache directory in a number of environment variables.
+#' If none of these environment variables are set, it provides the user with information on the default cache directory.
+#'
+#' @param path Character string, the path to set as the cache directory. If NULL, the function will look for the cache directory in a number of environment variables. Default is NULL.
+#'
+#' @return Nothing is returned, this function is called for its side effect of setting the `TRANSFORMERS_CACHE` environment variable, or providing information to the user.
+#' @export
+#'
+#' @examples
+#' \dontrun{
+#' set_cache_folder("~/new_cache_dir")
+#' }
+#' @seealso \url{https://huggingface.co/docs/transformers/installation?highlight=transformers_cache#cache-setup}
+#' @references HuggingFace Transformers: \url{https://huggingface.co/transformers/index.html}
+#' @family general functions
+set_cache_folder <- function(path = NULL){
+  if(!is.null(path)){
+    if(!dir.exists(path)) stop2("Folder '", path, "' doesn't exist.")
+    reticulate::py_run_string(paste0("import os\nos.environ['TRANSFORMERS_CACHE']='",path,"'"))
+    reticulate::py_run_string(paste0("import os\nos.environ['HF_HOME']='",path,"'"))
+}
+  path <- c(Sys.getenv("TRANSFORMERS_CACHE"),
+  Sys.getenv("HUGGINGFACE_HUB_CACHE"),
+  Sys.getenv("HF_HOME"),
+  Sys.getenv("XDG_CACHE_HOME"))
+
+  path <- paste0(path[path!=""],"")[1]
+  if(path != ""){
+    message_verbose("Pretrained models and tokenizers are downloaded and locally cached at '", path,"'.")
+    } else {
+      message_verbose("By default pretrained models are downloaded and locally cached at: ~/.cache/huggingface/hub. This is the default directory given by the shell environment variable TRANSFORMERS_CACHE. On Windows, the default directory is given by C:\\Users\\username\\.cache\\huggingface\\hub.
+
+For changing the shell environment variables that affect the cache folder see https://huggingface.co/docs/transformers/installation?highlight=transformers_cache#cache-setup")
+    }
+
+}
diff --git a/R/zzz.R b/R/zzz.R
@@ -6,6 +6,7 @@ torch <- NULL
 
 #' @noRd
 .onLoad <- function(libname, pkgname) {
+
   # This will instruct reticulate to immediately try to configure the
   # active Python environment, installing any required Python packages
   # as necessary.
@@ -38,3 +39,10 @@ torch <- NULL
 
   invisible()
 }
+
+.onAttach <- function(libname, pkgname) {
+    packageStartupMessage(pkgname, " version ", packageVersion(pkgname),"\nAn introduction to the package can be found in <https://bruno.nicenboim.me/pangoling/articles/>\n Notice that pretrained models and tokenizers are downloaded from https://huggingface.co/ the first time they are used. For changing the cache folder use:\n
+set_cache_folder(my_new_path)")
+}
+
+
diff --git a/README.Rmd b/README.Rmd
@@ -76,6 +76,11 @@ df_sent
 >  DOI: [10.5281/zenodo.7637526](https://zenodo.org/badge/latestdoi/497831295),
 >  <https://github.com/bnicenboim/pangoling>.
 
+## How to contribute
+
+See the [Contributing guidelines](.github/CONTRIBUTING.md). 
+
+
 ## Code of conduct
 
 Please note that this package is released with a [Contributor

diff --git a/README.md b/README.md
@@ -16,7 +16,7 @@ public.](https://www.repostatus.org/badges/latest/wip.svg)](https://www.repostat
 Review](https://badges.ropensci.org/575_status.svg)](https://github.com/ropensci/software-review/issues/575)
 <!-- badges: end -->
 
-`pangoling`[^1] is an R package for estimating the log-probabilities of
+`pangoling`\[1\] is an R package for estimating the log-probabilities of
 words in a given context using transformer models. The package provides
 an interface for utilizing pre-trained transformer models (such as GPT-2
 or BERT) to obtain word probabilities. These log-probabilities are often
@@ -28,7 +28,7 @@ The package is mostly a wrapper of the python package
 [`transformers`](https://pypi.org/project/transformers/) to process data
 in a convenient format.
 
-## Important! Limitations and bias
+## Important\! Limitations and bias
 
 The training data of the most popular models (such as GPT-2) haven’t
 been released, so one cannot inspect it. It’s clear that the data
@@ -94,11 +94,11 @@ as follows:
 ``` r
 df_sent <- df_sent |>
   mutate(lp = causal_lp(word, .by = sent_n))
-#> Processing using causal model 'gpt2'...
-#> Processing 1 batch(es) of 10 tokens.
+#> Processing using causal model ''...
+#> Processing a batch of size 1 with 10 tokens.
+#> Processing a batch of size 1 with 9 tokens.
 #> Text id: 1
 #> `The apple doesn't fall far from the tree.`
-#> Processing 1 batch(es) of 9 tokens.
 #> Text id: 2
 #> `Don't judge a book by its cover.`
 df_sent
@@ -125,10 +125,14 @@ df_sent
 ## How to cite
 
 > Nicenboim B (2023). *pangoling: Access to language model predictions
-> in R*. R package version 0.0.0.9005, DOI:
+> in R*. R package version 0.0.0.9007, DOI:
 > [10.5281/zenodo.7637526](https://zenodo.org/badge/latestdoi/497831295),
 > <https://github.com/bnicenboim/pangoling>.
 
+## How to contribute
+
+See the [Contributing guidelines](.github/CONTRIBUTING.md).
+
 ## Code of conduct
 
 Please note that this package is released with a [Contributor Code of
@@ -142,7 +146,7 @@ Another R package that act as a wrapper for
 [`text`](https://r-text.org//) However, `text` is more general, and its
 focus is on Natural Language Processing and Machine Learning.
 
-[^1]: The logo of the package was created with [stable
+1.  The logo of the package was created with [stable
     diffusion](https://huggingface.co/spaces/stabilityai/stable-diffusion)
     and the R package
     [hexSticker](https://github.com/GuangchuangYu/hexSticker).
diff --git a/_pkgdown.yml b/_pkgdown.yml
@@ -26,6 +26,7 @@ articles:
   contents:
   - '`articles/intro-gpt2`'
   - '`articles/intro-bert`'
+  - '`articles/troubleshooting`'
 
 reference:
 - title: About

diff --git a/man/causal_config.Rd b/man/causal_config.Rd
diff --git a/man/causal_lp.Rd b/man/causal_lp.Rd
diff --git a/man/causal_lp_mats.Rd b/man/causal_lp_mats.Rd
diff --git a/man/causal_next_tokens_tbl.Rd b/man/causal_next_tokens_tbl.Rd
diff --git a/man/causal_preload.Rd b/man/causal_preload.Rd
diff --git a/man/causal_tokens_lp_tbl.Rd b/man/causal_tokens_lp_tbl.Rd
diff --git a/man/pangoling-package.Rd b/man/pangoling-package.Rd
diff --git a/man/perplexity_calc.Rd b/man/perplexity_calc.Rd
diff --git a/man/set_cache_folder.Rd b/man/set_cache_folder.Rd
diff --git a/tests/testthat/test-tr_causal.R b/tests/testthat/test-tr_causal.R
@@ -32,18 +32,20 @@ test_that("empty or small strings", {
   expect_equal(lp_small, c(NA_real_, "It" = NA_real_))
 })
 
-
+if(0){
+  #long inputs require too much memory
 test_that("long input work", {
   skip_if_no_python_stuff()
   long0 <- paste(rep("x", 1022), collapse = " ")
   long <- paste(rep("x", 1024), collapse = " ")
   longer <- paste(rep("x", 1025), collapse = " ")
-  lp_long0 <- causal_tokens_lp_tbl(c(long0, long, longer), add_special_tokens = TRUE, batch_size = 3, model = "sshleifer/tiny-gpt2")
+  lp_long0 <- causal_tokens_lp_tbl(texts = c(long0, long, longer), add_special_tokens = TRUE, batch_size = 3, model = "sshleifer/tiny-gpt2")
   skip_on_os("windows") #the following just doesn't work on windows,
   # but it's not that important
     lp_long1 <- causal_tokens_lp_tbl(c(long0, long, longer), add_special_tokens = TRUE, batch_size = 1, model = "sshleifer/tiny-gpt2")
   expect_equal(lp_long0, lp_long1)
 })
+}
 
 test_that("errors work", {
   skip_if_no_python_stuff()

diff --git a/tests/testthat/test-tr_utils.R b/tests/testthat/test-tr_utils.R
@@ -19,3 +19,5 @@ test_that("messages work", {
   options(pangoling.verbose = FALSE)
   expect_no_message(causal_preload())
 })
+
+message("TEST cache")
diff --git a/vignettes/articles/conda.png b/vignettes/articles/conda.png
diff --git a/vignettes/articles/intro-bert.Rmd b/vignettes/articles/intro-bert.Rmd
@@ -30,6 +30,8 @@ Notice the following potential pitfall. This would be a **bad** approach for mak
 ```{r}
 masked_tokens_tbl("The apple doesn't fall far from the [MASK]")
 ```
+(The pretrained models and tokenizers will be downloaded from https://huggingface.co/ the first time they are used.)
+
 
 The most common predictions are punctuation marks, because BERT uses the left *and* right context. In this case, the right context indicates that the mask is the final *token* of the sentence.
 More expected results are obtained in the following way:

diff --git a/vignettes/articles/intro-gpt2.Rmd b/vignettes/articles/intro-gpt2.Rmd
@@ -38,6 +38,8 @@ tic()
 toc()
 ```
 
+(The pretrained models and tokenizers will be downloaded from https://huggingface.co/ the first time they are used.)
+
 The most likely continuation is "tree", which makes sense.
 The first time a model is run, it will download some files that will be available for subsequent runs. However, every time we start a new R session and we run a model, it will take some time to store it in memory. Next runs in the same session are much faster. We can also preload a model with `causal_preload()`.
 

diff --git a/vignettes/articles/python.png b/vignettes/articles/python.png
-Original file line number
+Diff line change
@@ Expand Up / @@ -38,6 +38,8 @@ tic() @@
     toc()
     ```
+    (The pretrained models and tokenizers will be downloaded from https://huggingface.co/ the first time they are used.)
     The most likely continuation is "tree", which makes sense.
     The first time a model is run, it will download some files that will be available for subsequent runs. However, every time we start a new R session and we run a model, it will take some time to store it in memory. Next runs in the same session are much faster. We can also preload a model with `causal_preload()`.
@@ Expand Down @@