diff --git a/.Rbuildignore b/.Rbuildignore index 234d50d..25f8038 100644 --- a/.Rbuildignore +++ b/.Rbuildignore @@ -16,3 +16,4 @@ _test ^inst/.auctex-auto$ README_cache ^.covrignore$ +REVIEW.md diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md index 084cd18..16f342b 100644 --- a/.github/CONTRIBUTING.md +++ b/.github/CONTRIBUTING.md @@ -1,8 +1,7 @@ # Contributing to pangoling -This outlines how to propose a change to pangoling. -For more detailed info about contributing to this, and other tidyverse packages, please see the -[**development contributing guide**](https://rstd.io/tidy-contrib). +This outlines how to propose a change to pangoling. +For a detailed discussion on contributing to this and other tidyverse packages, please see the [development contributing guide](https://rstd.io/tidy-contrib) and our [code review principles](https://code-review.tidyverse.org/). ## Fixing typos @@ -15,6 +14,7 @@ You can find the `.R` file that generates the `.Rd` by reading the comment in th If you want to make a bigger change, it's a good idea to first file an issue and make sure someone from the team agrees that it’s needed. If you’ve found a bug, please file an issue that illustrates the bug with a minimal [reprex](https://www.tidyverse.org/help/#reprex) (this will also help you write a unit test, if needed). +See our guide on [how to create a great issue](https://code-review.tidyverse.org/issues/) for more advice. ### Pull request process @@ -40,8 +40,9 @@ If you’ve found a bug, please file an issue that illustrates the bug with a mi * We use [testthat](https://cran.r-project.org/package=testthat) for unit tests. Contributions with test cases included are easier to accept. -## Code of Conduct +## Code of conduct + +Please note that this package is released with a [Contributor +Code of Conduct](https://ropensci.org/code-of-conduct/). +By contributing to this project, you agree to abide by its terms. -Please note that the pangoling project is released with a -[Contributor Code of Conduct](CODE_OF_CONDUCT.md). By contributing to this -project you agree to abide by its terms. diff --git a/.github/ISSUE_TEMPLATE/issue_template.md b/.github/ISSUE_TEMPLATE/issue_template.md new file mode 100644 index 0000000..02e6b96 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/issue_template.md @@ -0,0 +1,16 @@ +--- +name: Bug report or feature request +about: Describe a bug you've seen or make a case for a new feature +--- + +Please briefly describe your problem and what output you expect. If you have a question, please don't use this form. Instead, ask on . + +Please include a minimal reproducible example (AKA a reprex). If you've never heard of a [reprex](http://reprex.tidyverse.org/) before, start by reading . + +For more advice on how to write a great issue, see . + +Brief description of the problem + +```r +# insert reprex here +``` diff --git a/.gitignore b/.gitignore index b2eabcb..cef324d 100644 --- a/.gitignore +++ b/.gitignore @@ -11,3 +11,4 @@ README_cache /inst/REFERENCES.bib.sav /pangoling.Rproj /vignettes/articles/intro-gpt2_cache/ +REVIEW.md diff --git a/DESCRIPTION b/DESCRIPTION index 0e988a1..1665749 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,14 +1,17 @@ Package: pangoling Type: Package Title: Access to Large Language Model Predictions -Version: 0.0.0.9006 +Version: 0.0.0.9007 Authors@R: c( person("Bruno", "Nicenboim", email = "bruno.nicenboim@gmail.com", role = c( "aut","cre"), comment = c(ORCID = "0000-0002-5176-3943")), person("Chris", "Emmerly", role = "ctb"), - person("Giovanni", "Cassani", role = "ctb")) + person("Giovanni", "Cassani", role = "ctb"), + person("Lisa", "Levinson", role = "rev"), + person("Utku", "Turk", role = "rev") + ) Description: Access to word predictability using large language (transformer) models. URL: https://bruno.nicenboim.me/pangoling, https://github.com/bnicenboim/pangoling BugReports: https://github.com/bnicenboim/pangoling/issues diff --git a/NAMESPACE b/NAMESPACE index 880ae8f..5754acf 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -14,6 +14,7 @@ export(masked_preload) export(masked_tokens_tbl) export(ntokens) export(perplexity_calc) +export(set_cache_folder) export(tokenize_lst) export(transformer_vocab) importFrom(memoise,memoise) diff --git a/NEWS.md b/NEWS.md index 09a4e17..26d69f5 100644 --- a/NEWS.md +++ b/NEWS.md @@ -20,6 +20,12 @@ * Requires correct version of R. # pangoling 0.0.0.9006 -* `causal_lp` get a `l_contexts` argument -* checkpoints work for causal models (not yet for masked models) +* `causal_lp` get a `l_contexts` argument. +* Checkpoints work for causal models (not yet for masked models). * Ropensci badge added. + + +# pangoling 0.0.0.9007 +* `set_cache_folder()` function added. +* Message when the package loads. +* New troubleshooting vignette. diff --git a/R/tr_utils.R b/R/tr_utils.R index 6f15455..25e6b2b 100644 --- a/R/tr_utils.R +++ b/R/tr_utils.R @@ -365,3 +365,43 @@ num_to_token <- function(x, tkzr) { tkzr$convert_ids_to_tokens(x) }) } + +#' Set Cache Folder for HuggingFace Transformers +#' +#' This function sets the cache directory for HuggingFace transformers. If a path is given, the function checks if the directory exists and then sets the `TRANSFORMERS_CACHE` environment variable to this path. +#' If no path is provided, the function checks for the existing cache directory in a number of environment variables. +#' If none of these environment variables are set, it provides the user with information on the default cache directory. +#' +#' @param path Character string, the path to set as the cache directory. If NULL, the function will look for the cache directory in a number of environment variables. Default is NULL. +#' +#' @return Nothing is returned, this function is called for its side effect of setting the `TRANSFORMERS_CACHE` environment variable, or providing information to the user. +#' @export +#' +#' @examples +#' \dontrun{ +#' set_cache_folder("~/new_cache_dir") +#' } +#' @seealso \url{https://huggingface.co/docs/transformers/installation?highlight=transformers_cache#cache-setup} +#' @references HuggingFace Transformers: \url{https://huggingface.co/transformers/index.html} +#' @family general functions +set_cache_folder <- function(path = NULL){ + if(!is.null(path)){ + if(!dir.exists(path)) stop2("Folder '", path, "' doesn't exist.") + reticulate::py_run_string(paste0("import os\nos.environ['TRANSFORMERS_CACHE']='",path,"'")) + reticulate::py_run_string(paste0("import os\nos.environ['HF_HOME']='",path,"'")) +} + path <- c(Sys.getenv("TRANSFORMERS_CACHE"), + Sys.getenv("HUGGINGFACE_HUB_CACHE"), + Sys.getenv("HF_HOME"), + Sys.getenv("XDG_CACHE_HOME")) + + path <- paste0(path[path!=""],"")[1] + if(path != ""){ + message_verbose("Pretrained models and tokenizers are downloaded and locally cached at '", path,"'.") + } else { + message_verbose("By default pretrained models are downloaded and locally cached at: ~/.cache/huggingface/hub. This is the default directory given by the shell environment variable TRANSFORMERS_CACHE. On Windows, the default directory is given by C:\\Users\\username\\.cache\\huggingface\\hub. + +For changing the shell environment variables that affect the cache folder see https://huggingface.co/docs/transformers/installation?highlight=transformers_cache#cache-setup") + } + +} diff --git a/R/zzz.R b/R/zzz.R index 851ef7c..d44e208 100644 --- a/R/zzz.R +++ b/R/zzz.R @@ -6,6 +6,7 @@ torch <- NULL #' @noRd .onLoad <- function(libname, pkgname) { + # This will instruct reticulate to immediately try to configure the # active Python environment, installing any required Python packages # as necessary. @@ -38,3 +39,10 @@ torch <- NULL invisible() } + +.onAttach <- function(libname, pkgname) { + packageStartupMessage(pkgname, " version ", packageVersion(pkgname),"\nAn introduction to the package can be found in \n Notice that pretrained models and tokenizers are downloaded from https://huggingface.co/ the first time they are used. For changing the cache folder use:\n +set_cache_folder(my_new_path)") +} + + diff --git a/README.Rmd b/README.Rmd index b43e4f1..394016a 100644 --- a/README.Rmd +++ b/README.Rmd @@ -76,6 +76,11 @@ df_sent > DOI: [10.5281/zenodo.7637526](https://zenodo.org/badge/latestdoi/497831295), > . +## How to contribute + +See the [Contributing guidelines](.github/CONTRIBUTING.md). + + ## Code of conduct Please note that this package is released with a [Contributor diff --git a/README.md b/README.md index fe7a396..f7c80a4 100644 --- a/README.md +++ b/README.md @@ -16,7 +16,7 @@ public.](https://www.repostatus.org/badges/latest/wip.svg)](https://www.repostat Review](https://badges.ropensci.org/575_status.svg)](https://github.com/ropensci/software-review/issues/575) -`pangoling`[^1] is an R package for estimating the log-probabilities of +`pangoling`\[1\] is an R package for estimating the log-probabilities of words in a given context using transformer models. The package provides an interface for utilizing pre-trained transformer models (such as GPT-2 or BERT) to obtain word probabilities. These log-probabilities are often @@ -28,7 +28,7 @@ The package is mostly a wrapper of the python package [`transformers`](https://pypi.org/project/transformers/) to process data in a convenient format. -## Important! Limitations and bias +## Important\! Limitations and bias The training data of the most popular models (such as GPT-2) haven’t been released, so one cannot inspect it. It’s clear that the data @@ -94,11 +94,11 @@ as follows: ``` r df_sent <- df_sent |> mutate(lp = causal_lp(word, .by = sent_n)) -#> Processing using causal model 'gpt2'... -#> Processing 1 batch(es) of 10 tokens. +#> Processing using causal model ''... +#> Processing a batch of size 1 with 10 tokens. +#> Processing a batch of size 1 with 9 tokens. #> Text id: 1 #> `The apple doesn't fall far from the tree.` -#> Processing 1 batch(es) of 9 tokens. #> Text id: 2 #> `Don't judge a book by its cover.` df_sent @@ -125,10 +125,14 @@ df_sent ## How to cite > Nicenboim B (2023). *pangoling: Access to language model predictions -> in R*. R package version 0.0.0.9005, DOI: +> in R*. R package version 0.0.0.9007, DOI: > [10.5281/zenodo.7637526](https://zenodo.org/badge/latestdoi/497831295), > . +## How to contribute + +See the [Contributing guidelines](.github/CONTRIBUTING.md). + ## Code of conduct Please note that this package is released with a [Contributor Code of @@ -142,7 +146,7 @@ Another R package that act as a wrapper for [`text`](https://r-text.org//) However, `text` is more general, and its focus is on Natural Language Processing and Machine Learning. -[^1]: The logo of the package was created with [stable +1. The logo of the package was created with [stable diffusion](https://huggingface.co/spaces/stabilityai/stable-diffusion) and the R package [hexSticker](https://github.com/GuangchuangYu/hexSticker). diff --git a/_pkgdown.yml b/_pkgdown.yml index 0ca6a41..f2a5c98 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -26,6 +26,7 @@ articles: contents: - '`articles/intro-gpt2`' - '`articles/intro-bert`' + - '`articles/troubleshooting`' reference: - title: About diff --git a/man/causal_config.Rd b/man/causal_config.Rd index dedef83..b651c38 100644 --- a/man/causal_config.Rd +++ b/man/causal_config.Rd @@ -13,7 +13,7 @@ causal_config( \arguments{ \item{model}{Name of a pre-trained model or folder.} -\item{checkpoint}{folder of a checkpoint.} +\item{checkpoint}{Folder of a checkpoint.} \item{config_model}{List with other arguments that control how the model from Hugging Face is accessed.} diff --git a/man/causal_lp.Rd b/man/causal_lp.Rd index 28200e6..d69eaa1 100644 --- a/man/causal_lp.Rd +++ b/man/causal_lp.Rd @@ -31,7 +31,7 @@ all punctuation that stands alone in a token.} \item{model}{Name of a pre-trained model or folder.} -\item{checkpoint}{folder of a checkpoint.} +\item{checkpoint}{Folder of a checkpoint.} \item{add_special_tokens}{Whether to include special tokens. It has the same default as the diff --git a/man/causal_lp_mats.Rd b/man/causal_lp_mats.Rd index 31edf82..3b5cb8b 100644 --- a/man/causal_lp_mats.Rd +++ b/man/causal_lp_mats.Rd @@ -22,7 +22,7 @@ causal_lp_mats( \item{model}{Name of a pre-trained model or folder.} -\item{checkpoint}{folder of a checkpoint.} +\item{checkpoint}{Folder of a checkpoint.} \item{add_special_tokens}{Whether to include special tokens. It has the same default as the diff --git a/man/causal_next_tokens_tbl.Rd b/man/causal_next_tokens_tbl.Rd index 783277f..386e682 100644 --- a/man/causal_next_tokens_tbl.Rd +++ b/man/causal_next_tokens_tbl.Rd @@ -18,7 +18,7 @@ causal_next_tokens_tbl( \item{model}{Name of a pre-trained model or folder.} -\item{checkpoint}{folder of a checkpoint.} +\item{checkpoint}{Folder of a checkpoint.} \item{add_special_tokens}{Whether to include special tokens. It has the same default as the diff --git a/man/causal_preload.Rd b/man/causal_preload.Rd index 6e7a9fe..0477d1d 100644 --- a/man/causal_preload.Rd +++ b/man/causal_preload.Rd @@ -15,7 +15,7 @@ causal_preload( \arguments{ \item{model}{Name of a pre-trained model or folder.} -\item{checkpoint}{folder of a checkpoint.} +\item{checkpoint}{Folder of a checkpoint.} \item{add_special_tokens}{Whether to include special tokens. It has the same default as the diff --git a/man/causal_tokens_lp_tbl.Rd b/man/causal_tokens_lp_tbl.Rd index 87929ca..53da4f6 100644 --- a/man/causal_tokens_lp_tbl.Rd +++ b/man/causal_tokens_lp_tbl.Rd @@ -20,7 +20,7 @@ causal_tokens_lp_tbl( \item{model}{Name of a pre-trained model or folder.} -\item{checkpoint}{folder of a checkpoint.} +\item{checkpoint}{Folder of a checkpoint.} \item{add_special_tokens}{Whether to include special tokens. It has the same default as the diff --git a/man/pangoling-package.Rd b/man/pangoling-package.Rd index e5b5eb6..5986772 100644 --- a/man/pangoling-package.Rd +++ b/man/pangoling-package.Rd @@ -26,6 +26,8 @@ Other contributors: \itemize{ \item Chris Emmerly [contributor] \item Giovanni Cassani [contributor] + \item Lisa Levinson [reviewer] + \item Utku Turk [reviewer] } } diff --git a/man/perplexity_calc.Rd b/man/perplexity_calc.Rd index 3d5df0c..b965fe9 100644 --- a/man/perplexity_calc.Rd +++ b/man/perplexity_calc.Rd @@ -34,4 +34,8 @@ perplexity_calc(probs, log.p = FALSE) lprobs <- log(probs) perplexity_calc(lprobs, log.p = TRUE) } +\seealso{ +Other general functions: +\code{\link{set_cache_folder}()} +} \concept{general functions} diff --git a/man/set_cache_folder.Rd b/man/set_cache_folder.Rd new file mode 100644 index 0000000..62754bf --- /dev/null +++ b/man/set_cache_folder.Rd @@ -0,0 +1,34 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/tr_utils.R +\name{set_cache_folder} +\alias{set_cache_folder} +\title{Set Cache Folder for HuggingFace Transformers} +\usage{ +set_cache_folder(path = NULL) +} +\arguments{ +\item{path}{Character string, the path to set as the cache directory. If NULL, the function will look for the cache directory in a number of environment variables. Default is NULL.} +} +\value{ +Nothing is returned, this function is called for its side effect of setting the \code{TRANSFORMERS_CACHE} environment variable, or providing information to the user. +} +\description{ +This function sets the cache directory for HuggingFace transformers. If a path is given, the function checks if the directory exists and then sets the \code{TRANSFORMERS_CACHE} environment variable to this path. +If no path is provided, the function checks for the existing cache directory in a number of environment variables. +If none of these environment variables are set, it provides the user with information on the default cache directory. +} +\examples{ +\dontrun{ +set_cache_folder("~/new_cache_dir") +} +} +\references{ +HuggingFace Transformers: \url{https://huggingface.co/transformers/index.html} +} +\seealso{ +\url{https://huggingface.co/docs/transformers/installation?highlight=transformers_cache#cache-setup} + +Other general functions: +\code{\link{perplexity_calc}()} +} +\concept{general functions} diff --git a/tests/testthat/test-tr_causal.R b/tests/testthat/test-tr_causal.R index 976b58c..094842f 100644 --- a/tests/testthat/test-tr_causal.R +++ b/tests/testthat/test-tr_causal.R @@ -32,18 +32,20 @@ test_that("empty or small strings", { expect_equal(lp_small, c(NA_real_, "It" = NA_real_)) }) - +if(0){ + #long inputs require too much memory test_that("long input work", { skip_if_no_python_stuff() long0 <- paste(rep("x", 1022), collapse = " ") long <- paste(rep("x", 1024), collapse = " ") longer <- paste(rep("x", 1025), collapse = " ") - lp_long0 <- causal_tokens_lp_tbl(c(long0, long, longer), add_special_tokens = TRUE, batch_size = 3, model = "sshleifer/tiny-gpt2") + lp_long0 <- causal_tokens_lp_tbl(texts = c(long0, long, longer), add_special_tokens = TRUE, batch_size = 3, model = "sshleifer/tiny-gpt2") skip_on_os("windows") #the following just doesn't work on windows, # but it's not that important lp_long1 <- causal_tokens_lp_tbl(c(long0, long, longer), add_special_tokens = TRUE, batch_size = 1, model = "sshleifer/tiny-gpt2") expect_equal(lp_long0, lp_long1) }) +} test_that("errors work", { skip_if_no_python_stuff() diff --git a/tests/testthat/test-tr_utils.R b/tests/testthat/test-tr_utils.R index abef2d8..58bb2c9 100644 --- a/tests/testthat/test-tr_utils.R +++ b/tests/testthat/test-tr_utils.R @@ -19,3 +19,5 @@ test_that("messages work", { options(pangoling.verbose = FALSE) expect_no_message(causal_preload()) }) + +message("TEST cache") diff --git a/vignettes/articles/conda.png b/vignettes/articles/conda.png new file mode 100644 index 0000000..c886cab Binary files /dev/null and b/vignettes/articles/conda.png differ diff --git a/vignettes/articles/intro-bert.Rmd b/vignettes/articles/intro-bert.Rmd index d628711..8a0b40e 100644 --- a/vignettes/articles/intro-bert.Rmd +++ b/vignettes/articles/intro-bert.Rmd @@ -30,6 +30,8 @@ Notice the following potential pitfall. This would be a **bad** approach for mak ```{r} masked_tokens_tbl("The apple doesn't fall far from the [MASK]") ``` +(The pretrained models and tokenizers will be downloaded from https://huggingface.co/ the first time they are used.) + The most common predictions are punctuation marks, because BERT uses the left *and* right context. In this case, the right context indicates that the mask is the final *token* of the sentence. More expected results are obtained in the following way: diff --git a/vignettes/articles/intro-gpt2.Rmd b/vignettes/articles/intro-gpt2.Rmd index 4fc8488..f76d1af 100644 --- a/vignettes/articles/intro-gpt2.Rmd +++ b/vignettes/articles/intro-gpt2.Rmd @@ -38,6 +38,8 @@ tic() toc() ``` +(The pretrained models and tokenizers will be downloaded from https://huggingface.co/ the first time they are used.) + The most likely continuation is "tree", which makes sense. The first time a model is run, it will download some files that will be available for subsequent runs. However, every time we start a new R session and we run a model, it will take some time to store it in memory. Next runs in the same session are much faster. We can also preload a model with `causal_preload()`. diff --git a/vignettes/articles/python.png b/vignettes/articles/python.png new file mode 100644 index 0000000..c9621d4 Binary files /dev/null and b/vignettes/articles/python.png differ diff --git a/vignettes/articles/troubleshooting.Rmd b/vignettes/articles/troubleshooting.Rmd new file mode 100644 index 0000000..864914c --- /dev/null +++ b/vignettes/articles/troubleshooting.Rmd @@ -0,0 +1,87 @@ +--- +title: "Troubleshooting the use of Python in R" +bibliography: '`r system.file("REFERENCES.bib", package="pangoling")`' +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +``` + +This vignette provides guidance on troubleshooting issues related to Python module dependencies when using R. + +## Module not found error in Rstudio + +You might encounter an error message similar to this: + +``` +Error in py_run_string_impl(code, local, convert) : + ModuleNotFoundError: No module named 'torch' +Run `reticulate::py_last_error()` for details. +``` + +`pangoling` (as many other R packages) relies on `reticulate` for Python functionality. Even if the package installation seems successful, the issue may be that `reticulate` is not correctly loading the correct Python environment. By default, R should use the a *conda* environment named `r-reticulate` when managing these configurations automatically. + +One can verify this by using `py_config`, this is the output in a linux computer where the conda environment wasn't loaded: + +```{r} +library(reticulate) +``` +```{r, eval= FALSE} +py_config() +``` +```{r} +#> python: /usr/local/bin/python +#> libpython: /usr/lib/python3.10/config-3.10-x86_64-linux-gnu/libpython3.10.so +#> pythonhome: //usr://usr +#> version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] +#> numpy: [NOT FOUND] +#> +#> NOTE: Python version was forced by RETICULATE_PYTHON_FALLBACK +``` + + +One can anyways configure RStudio to load the correct conda environment (`r-reticulate`) by default following these steps: + +1. In RStudio, navigate to the "Tools" menu. + +2. Click on "Global Options" in the "Tools" menu. + +3. Click on Python. + +```{r, echo=FALSE} +knitr::include_graphics("python.png") +``` + +4. Click on Select... + +5. Click on the Conda Enviornment tab. + +6. Click on the r-reticulate path. + +```{r, echo=FALSE} +knitr::include_graphics("conda.png") +``` + +7. Click on the Select button on the bottom. + +The path that was selected should appear now when using `py_config()`: + +```{r} +py_config() +``` + + +## HTTPSConnectionPool error + +A `causal_` or `masked_` commands throws an error that starts as follows: + +``` +Error in py_run_string_impl(code, local, convert) : +requests.exceptions.SSLError: HTTPSConnectionPool(host='huggingface.co', port=443): +``` + +The first time a model is run, it will download some files that will be available for subsequent runs. So if there is no internet connection (or the huggingface website is down) during the first run, one will experience this problem. Afterwards, it is possible to use `pangoling` without an internet connection. +