Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev #17

Merged
merged 12 commits into from
Oct 26, 2023
Merged

Dev #17

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,4 @@ _test
^inst/.auctex-auto$
README_cache
^.covrignore$
REVIEW.md
15 changes: 8 additions & 7 deletions .github/CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
# Contributing to pangoling

This outlines how to propose a change to pangoling.
For more detailed info about contributing to this, and other tidyverse packages, please see the
[**development contributing guide**](https://rstd.io/tidy-contrib).
This outlines how to propose a change to pangoling.
For a detailed discussion on contributing to this and other tidyverse packages, please see the [development contributing guide](https://rstd.io/tidy-contrib) and our [code review principles](https://code-review.tidyverse.org/).

## Fixing typos

Expand All @@ -15,6 +14,7 @@ You can find the `.R` file that generates the `.Rd` by reading the comment in th
If you want to make a bigger change, it's a good idea to first file an issue and make sure someone from the team agrees that it’s needed.
If you’ve found a bug, please file an issue that illustrates the bug with a minimal
[reprex](https://www.tidyverse.org/help/#reprex) (this will also help you write a unit test, if needed).
See our guide on [how to create a great issue](https://code-review.tidyverse.org/issues/) for more advice.

### Pull request process

Expand All @@ -40,8 +40,9 @@ If you’ve found a bug, please file an issue that illustrates the bug with a mi
* We use [testthat](https://cran.r-project.org/package=testthat) for unit tests.
Contributions with test cases included are easier to accept.

## Code of Conduct
## Code of conduct

Please note that this package is released with a [Contributor
Code of Conduct](https://ropensci.org/code-of-conduct/).
By contributing to this project, you agree to abide by its terms.

Please note that the pangoling project is released with a
[Contributor Code of Conduct](CODE_OF_CONDUCT.md). By contributing to this
project you agree to abide by its terms.
16 changes: 16 additions & 0 deletions .github/ISSUE_TEMPLATE/issue_template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
---
name: Bug report or feature request
about: Describe a bug you've seen or make a case for a new feature
---

Please briefly describe your problem and what output you expect. If you have a question, please don't use this form. Instead, ask on <https://github.com/bnicenboim/pangoling/discussions>.

Please include a minimal reproducible example (AKA a reprex). If you've never heard of a [reprex](http://reprex.tidyverse.org/) before, start by reading <https://www.tidyverse.org/help/#reprex>.

For more advice on how to write a great issue, see <https://code-review.tidyverse.org/issues/>.

Brief description of the problem

```r
# insert reprex here
```
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,4 @@ README_cache
/inst/REFERENCES.bib.sav
/pangoling.Rproj
/vignettes/articles/intro-gpt2_cache/
REVIEW.md
7 changes: 5 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,14 +1,17 @@
Package: pangoling
Type: Package
Title: Access to Large Language Model Predictions
Version: 0.0.0.9006
Version: 0.0.0.9007
Authors@R: c(
person("Bruno", "Nicenboim",
email = "[email protected]",
role = c( "aut","cre"),
comment = c(ORCID = "0000-0002-5176-3943")),
person("Chris", "Emmerly", role = "ctb"),
person("Giovanni", "Cassani", role = "ctb"))
person("Giovanni", "Cassani", role = "ctb"),
person("Lisa", "Levinson", role = "rev"),
person("Utku", "Turk", role = "rev")
)
Description: Access to word predictability using large language (transformer) models.
URL: https://bruno.nicenboim.me/pangoling, https://github.com/bnicenboim/pangoling
BugReports: https://github.com/bnicenboim/pangoling/issues
Expand Down
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ export(masked_preload)
export(masked_tokens_tbl)
export(ntokens)
export(perplexity_calc)
export(set_cache_folder)
export(tokenize_lst)
export(transformer_vocab)
importFrom(memoise,memoise)
10 changes: 8 additions & 2 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,12 @@
* Requires correct version of R.

# pangoling 0.0.0.9006
* `causal_lp` get a `l_contexts` argument
* checkpoints work for causal models (not yet for masked models)
* `causal_lp` get a `l_contexts` argument.
* Checkpoints work for causal models (not yet for masked models).
* Ropensci badge added.


# pangoling 0.0.0.9007
* `set_cache_folder()` function added.
* Message when the package loads.
* New troubleshooting vignette.
40 changes: 40 additions & 0 deletions R/tr_utils.R
Original file line number Diff line number Diff line change
Expand Up @@ -365,3 +365,43 @@ num_to_token <- function(x, tkzr) {
tkzr$convert_ids_to_tokens(x)
})
}

#' Set Cache Folder for HuggingFace Transformers
#'
#' This function sets the cache directory for HuggingFace transformers. If a path is given, the function checks if the directory exists and then sets the `TRANSFORMERS_CACHE` environment variable to this path.
#' If no path is provided, the function checks for the existing cache directory in a number of environment variables.
#' If none of these environment variables are set, it provides the user with information on the default cache directory.
#'
#' @param path Character string, the path to set as the cache directory. If NULL, the function will look for the cache directory in a number of environment variables. Default is NULL.
#'
#' @return Nothing is returned, this function is called for its side effect of setting the `TRANSFORMERS_CACHE` environment variable, or providing information to the user.
#' @export
#'
#' @examples
#' \dontrun{
#' set_cache_folder("~/new_cache_dir")
#' }
#' @seealso \url{https://huggingface.co/docs/transformers/installation?highlight=transformers_cache#cache-setup}
#' @references HuggingFace Transformers: \url{https://huggingface.co/transformers/index.html}
#' @family general functions
set_cache_folder <- function(path = NULL){
if(!is.null(path)){
if(!dir.exists(path)) stop2("Folder '", path, "' doesn't exist.")
reticulate::py_run_string(paste0("import os\nos.environ['TRANSFORMERS_CACHE']='",path,"'"))
reticulate::py_run_string(paste0("import os\nos.environ['HF_HOME']='",path,"'"))
}
path <- c(Sys.getenv("TRANSFORMERS_CACHE"),
Sys.getenv("HUGGINGFACE_HUB_CACHE"),
Sys.getenv("HF_HOME"),
Sys.getenv("XDG_CACHE_HOME"))

path <- paste0(path[path!=""],"")[1]
if(path != ""){
message_verbose("Pretrained models and tokenizers are downloaded and locally cached at '", path,"'.")
} else {
message_verbose("By default pretrained models are downloaded and locally cached at: ~/.cache/huggingface/hub. This is the default directory given by the shell environment variable TRANSFORMERS_CACHE. On Windows, the default directory is given by C:\\Users\\username\\.cache\\huggingface\\hub.

For changing the shell environment variables that affect the cache folder see https://huggingface.co/docs/transformers/installation?highlight=transformers_cache#cache-setup")
}

}
8 changes: 8 additions & 0 deletions R/zzz.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ torch <- NULL

#' @noRd
.onLoad <- function(libname, pkgname) {

# This will instruct reticulate to immediately try to configure the
# active Python environment, installing any required Python packages
# as necessary.
Expand Down Expand Up @@ -38,3 +39,10 @@ torch <- NULL

invisible()
}

.onAttach <- function(libname, pkgname) {
packageStartupMessage(pkgname, " version ", packageVersion(pkgname),"\nAn introduction to the package can be found in <https://bruno.nicenboim.me/pangoling/articles/>\n Notice that pretrained models and tokenizers are downloaded from https://huggingface.co/ the first time they are used. For changing the cache folder use:\n
set_cache_folder(my_new_path)")
}


5 changes: 5 additions & 0 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,11 @@ df_sent
> DOI: [10.5281/zenodo.7637526](https://zenodo.org/badge/latestdoi/497831295),
> <https://github.com/bnicenboim/pangoling>.

## How to contribute

See the [Contributing guidelines](.github/CONTRIBUTING.md).


## Code of conduct

Please note that this package is released with a [Contributor
Expand Down
18 changes: 11 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ public.](https://www.repostatus.org/badges/latest/wip.svg)](https://www.repostat
Review](https://badges.ropensci.org/575_status.svg)](https://github.com/ropensci/software-review/issues/575)
<!-- badges: end -->

`pangoling`[^1] is an R package for estimating the log-probabilities of
`pangoling`\[1\] is an R package for estimating the log-probabilities of
words in a given context using transformer models. The package provides
an interface for utilizing pre-trained transformer models (such as GPT-2
or BERT) to obtain word probabilities. These log-probabilities are often
Expand All @@ -28,7 +28,7 @@ The package is mostly a wrapper of the python package
[`transformers`](https://pypi.org/project/transformers/) to process data
in a convenient format.

## Important! Limitations and bias
## Important\! Limitations and bias

The training data of the most popular models (such as GPT-2) haven’t
been released, so one cannot inspect it. It’s clear that the data
Expand Down Expand Up @@ -94,11 +94,11 @@ as follows:
``` r
df_sent <- df_sent |>
mutate(lp = causal_lp(word, .by = sent_n))
#> Processing using causal model 'gpt2'...
#> Processing 1 batch(es) of 10 tokens.
#> Processing using causal model ''...
#> Processing a batch of size 1 with 10 tokens.
#> Processing a batch of size 1 with 9 tokens.
#> Text id: 1
#> `The apple doesn't fall far from the tree.`
#> Processing 1 batch(es) of 9 tokens.
#> Text id: 2
#> `Don't judge a book by its cover.`
df_sent
Expand All @@ -125,10 +125,14 @@ df_sent
## How to cite

> Nicenboim B (2023). *pangoling: Access to language model predictions
> in R*. R package version 0.0.0.9005, DOI:
> in R*. R package version 0.0.0.9007, DOI:
> [10.5281/zenodo.7637526](https://zenodo.org/badge/latestdoi/497831295),
> <https://github.com/bnicenboim/pangoling>.

## How to contribute

See the [Contributing guidelines](.github/CONTRIBUTING.md).

## Code of conduct

Please note that this package is released with a [Contributor Code of
Expand All @@ -142,7 +146,7 @@ Another R package that act as a wrapper for
[`text`](https://r-text.org//) However, `text` is more general, and its
focus is on Natural Language Processing and Machine Learning.

[^1]: The logo of the package was created with [stable
1. The logo of the package was created with [stable
diffusion](https://huggingface.co/spaces/stabilityai/stable-diffusion)
and the R package
[hexSticker](https://github.com/GuangchuangYu/hexSticker).
1 change: 1 addition & 0 deletions _pkgdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ articles:
contents:
- '`articles/intro-gpt2`'
- '`articles/intro-bert`'
- '`articles/troubleshooting`'

reference:
- title: About
Expand Down
2 changes: 1 addition & 1 deletion man/causal_config.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/causal_lp.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/causal_lp_mats.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/causal_next_tokens_tbl.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/causal_preload.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/causal_tokens_lp_tbl.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions man/pangoling-package.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 4 additions & 0 deletions man/perplexity_calc.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

34 changes: 34 additions & 0 deletions man/set_cache_folder.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 4 additions & 2 deletions tests/testthat/test-tr_causal.R
Original file line number Diff line number Diff line change
Expand Up @@ -32,18 +32,20 @@ test_that("empty or small strings", {
expect_equal(lp_small, c(NA_real_, "It" = NA_real_))
})


if(0){
#long inputs require too much memory
test_that("long input work", {
skip_if_no_python_stuff()
long0 <- paste(rep("x", 1022), collapse = " ")
long <- paste(rep("x", 1024), collapse = " ")
longer <- paste(rep("x", 1025), collapse = " ")
lp_long0 <- causal_tokens_lp_tbl(c(long0, long, longer), add_special_tokens = TRUE, batch_size = 3, model = "sshleifer/tiny-gpt2")
lp_long0 <- causal_tokens_lp_tbl(texts = c(long0, long, longer), add_special_tokens = TRUE, batch_size = 3, model = "sshleifer/tiny-gpt2")
skip_on_os("windows") #the following just doesn't work on windows,
# but it's not that important
lp_long1 <- causal_tokens_lp_tbl(c(long0, long, longer), add_special_tokens = TRUE, batch_size = 1, model = "sshleifer/tiny-gpt2")
expect_equal(lp_long0, lp_long1)
})
}

test_that("errors work", {
skip_if_no_python_stuff()
Expand Down
2 changes: 2 additions & 0 deletions tests/testthat/test-tr_utils.R
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,5 @@ test_that("messages work", {
options(pangoling.verbose = FALSE)
expect_no_message(causal_preload())
})

message("TEST cache")
Binary file added vignettes/articles/conda.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions vignettes/articles/intro-bert.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ Notice the following potential pitfall. This would be a **bad** approach for mak
```{r}
masked_tokens_tbl("The apple doesn't fall far from the [MASK]")
```
(The pretrained models and tokenizers will be downloaded from https://huggingface.co/ the first time they are used.)


The most common predictions are punctuation marks, because BERT uses the left *and* right context. In this case, the right context indicates that the mask is the final *token* of the sentence.
More expected results are obtained in the following way:
Expand Down
2 changes: 2 additions & 0 deletions vignettes/articles/intro-gpt2.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@ tic()
toc()
```

(The pretrained models and tokenizers will be downloaded from https://huggingface.co/ the first time they are used.)

The most likely continuation is "tree", which makes sense.
The first time a model is run, it will download some files that will be available for subsequent runs. However, every time we start a new R session and we run a model, it will take some time to store it in memory. Next runs in the same session are much faster. We can also preload a model with `causal_preload()`.

Expand Down
Binary file added vignettes/articles/python.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading