diff --git a/.all-contributorsrc b/.all-contributorsrc deleted file mode 100644 index bc6a9103..00000000 --- a/.all-contributorsrc +++ /dev/null @@ -1,45 +0,0 @@ -{ - "files": [ - "README.md" - ], - "imageSize": 100, - "commit": false, - "contributorsPerLine": 7, - "projectName": "al-folio", - "projectOwner": "alshedivat", - "repoType": "github", - "repoHost": "https://github.com", - "badgeTemplate": "[core_contributors]: https://img.shields.io/badge/core_contributors-<%= contributors.length %>-orange.svg 'Number of core contributors'", - "contributorTemplate": "\">\" width=\"<%= options.imageSize %>px;\" alt=\"\"/>
<%= contributor.name %>
", - "skipCi": true, - "contributors": [ - { - "login": "alshedivat", - "name": "Maruan", - "avatar_url": "https://avatars.githubusercontent.com/u/2126561?v=4", - "profile": "http://maruan.alshedivat.com", - "contributions": [ - "design", - "code" - ] - }, - { - "login": "rohandebsarkar", - "name": "Rohan Deb Sarkar", - "avatar_url": "https://avatars.githubusercontent.com/u/50144004?v=4", - "profile": "http://rohandebsarkar.github.io", - "contributions": [ - "code" - ] - }, - { - "login": "pourmand1376", - "name": "Amir Pourmand", - "avatar_url": "https://avatars.githubusercontent.com/u/32064808?v=4", - "profile": "https://amirpourmand.ir", - "contributions": [ - "code" - ] - } - ] -} diff --git a/.gitattributes b/.gitattributes deleted file mode 100644 index 24244739..00000000 --- a/.gitattributes +++ /dev/null @@ -1 +0,0 @@ -_config.yml merge=ours diff --git a/.github/FUNDING.yml b/.github/FUNDING.yml deleted file mode 100644 index c78502f4..00000000 --- a/.github/FUNDING.yml +++ /dev/null @@ -1,12 +0,0 @@ -# These are supported funding model platforms - -github: # Replace with up to 4 GitHub Sponsors-enabled usernames e.g., [user1, user2] -patreon: # Replace with a single Patreon username -open_collective: # Replace with a single Open Collective username -ko_fi: alshedivat -tidelift: # Replace with a single Tidelift platform-name/package-name e.g., npm/babel -community_bridge: # Replace with a single Community Bridge project-name e.g., cloud-foundry -liberapay: # Replace with a single Liberapay username -issuehunt: # Replace with a single IssueHunt username -otechie: # Replace with a single Otechie username -custom: # ['https://www.buymeacoffee.com/TkFxuKo'] diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md deleted file mode 100644 index 511f5851..00000000 --- a/.github/ISSUE_TEMPLATE/bug_report.md +++ /dev/null @@ -1,38 +0,0 @@ ---- -name: Bug report -about: Create a report to help us improve -title: '' -labels: bug -assignees: '' - ---- - -**Acknowledge the following** -- [ ] I carefully read and followed the [Getting Started](https://github.com/alshedivat/al-folio#getting-started) guide. -- [ ] I read through [FAQ](https://github.com/alshedivat/al-folio#faq) and searched through the [past issues](https://github.com/alshedivat/al-folio/issues), none of which addressed my issue. -- [ ] The issue I am raising is a potential bug in al-folio and not just a usage question.
[For usage questions, please post in the [Discussions](https://github.com/alshedivat/al-folio/discussions) instead of raising an issue.] - -**Describe the bug** -A clear and concise description of what the bug is. - -**To Reproduce** -Steps to reproduce the behavior: -1. Go to '...' -2. Click on '....' -3. Scroll down to '....' -4. See error - -**Expected behavior** -A clear and concise description of what you expected to happen. - -**Screenshots** -If applicable, add screenshots to help explain your problem. - -**System (please complete the following information):** - - OS: [e.g. iOS] - - Browser (and its version) [e.g. chrome, safari] - - Jekyll version [e.g. 3.8.7] -- Ruby version [e.g. 2.6.5] - -**Additional context** -Add any other context about the problem here. diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md deleted file mode 100644 index 11fc491e..00000000 --- a/.github/ISSUE_TEMPLATE/feature_request.md +++ /dev/null @@ -1,20 +0,0 @@ ---- -name: Feature request -about: Suggest an idea for this project -title: '' -labels: enhancement -assignees: '' - ---- - -**Is your feature request related to a problem? Please describe.** -A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] - -**Describe the solution you'd like** -A clear and concise description of what you want to happen. - -**Describe alternatives you've considered** -A clear and concise description of any alternative solutions or features you've considered. - -**Additional context** -Add any other context or screenshots about the feature request here. diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md deleted file mode 100644 index 82c43a79..00000000 --- a/.github/pull_request_template.md +++ /dev/null @@ -1,26 +0,0 @@ - - - -## OpenReview Submission Thread - - - -## Checklist before opening a PR - -- [ ] I am opening a pull request against the `main` branch of the `2024` repo. -- [ ] The title of my PR is exactly the name of my markdown file - - i.e. `_posts/2024-05-07-[SUBMISSION NAME].md` would require a PR name `2024-05-07-[SUBMISSION NAME]` -- [ ] I have **anonymized** my post: my author's list is `Anonymous`, and there is no potential - content which can reveal my/my collaborators identities. -- [ ] My post matches the formatting requirements, including (but not limited to): - - [ ] I have **ONLY MODIFIED** files in the following locations (failure to do so will result in - your PR automatically being closed!): - - a Markdown (or HTML) file in `_posts/` with the format `_posts/2024-05-07-[SUBMISSION NAME].md` (or `.html`) - - static image assets added to `assets/img/2024-05-07-[SUBMISSION NAME]/` - - interactive HTML figures added to `assets/html/2024-05-07-[SUBMISSION NAME]/` - - citations in a bibtex file in `assets/bibliography/2024-05-07-[SUBMISSION NAME].bib` - - [ ] I have a short 2-3 sentence abstract in the `description` field of my front-matter ([example](https://github.com/iclr-blogposts/2024/blob/295ab5b4c31f2c7d421a4caf41e5481cbb4ad42c/_posts/2024-05-07-distill-example.md?plain=1#L4-L6)) - - [ ] I have a table of contents, formatted using the `toc` field of my front-matter ([example](https://github.com/iclr-blogposts/2024/blob/295ab5b4c31f2c7d421a4caf41e5481cbb4ad42c/_posts/2024-05-07-distill-example.md?plain=1#L36-L47)) - - [ ] My bibliography is correctly formatted, using a `.bibtex` file as per the sample post - -## Any other comments diff --git a/.github/stale.yml b/.github/stale.yml deleted file mode 100644 index 8ec2004d..00000000 --- a/.github/stale.yml +++ /dev/null @@ -1,18 +0,0 @@ -# Number of days of inactivity before an issue becomes stale -daysUntilStale: 60 -# Number of days of inactivity before a stale issue is closed -daysUntilClose: 7 -# Issues with these labels will never be considered stale -exemptLabels: - - pinned - - security - - enhancement -# Label to use when marking an issue as stale -staleLabel: wontfix -# Comment to post when marking an issue as stale. Set to `false` to disable -markComment: > - This issue has been automatically marked as stale because it has not had - recent activity. It will be closed if no further activity occurs. Thank you - for your contributions. -# Comment to post when closing a stale issue. Set to `false` to disable -closeComment: false diff --git a/.github/workflows/comment-on-error.yaml b/.github/workflows/comment-on-error.yaml deleted file mode 100644 index 35099659..00000000 --- a/.github/workflows/comment-on-error.yaml +++ /dev/null @@ -1,36 +0,0 @@ -name: Comment on error - -on: - workflow_run: - workflows: ["filter-files"] - types: - - completed - -jobs: - upload: - runs-on: ubuntu-latest - if: > - github.event.workflow_run.event == 'pull_request' && - github.event.workflow_run.conclusion == 'failure' - steps: - - name: Download build artifact from triggered workflow - uses: dawidd6/action-download-artifact@v2 - with: - run_id: ${{ github.event.workflow_run.id }} - # name: website_out - # path: site_out - search_artifacts: true - - name: Get ISSUE_NUMBER - run: echo "ISSUE_NUMBER=$(cat website_out/pr_number.txt)" >> $GITHUB_ENV - - name: Get filterout - run: echo "MSG=$(cat website_out/filterout.txt)" >> $GITHUB_ENV - - uses: actions/github-script@v6 - with: - github-token: ${{ secrets.GITHUB_TOKEN }} - script: | - github.rest.issues.createComment({ - issue_number: ${{ env.ISSUE_NUMBER }}, - owner: context.repo.owner, - repo: context.repo.repo, - body: "⚠️ **We have detected a problem with your submission!** ⚠️\n\n${{ env.MSG }}\n\nPlease make the aforementioned changes and re-submit :)" - }) diff --git a/.github/workflows/deploy-docker-tag.yml b/.github/workflows/deploy-docker-tag.yml deleted file mode 100644 index 3e6b6a3a..00000000 --- a/.github/workflows/deploy-docker-tag.yml +++ /dev/null @@ -1,40 +0,0 @@ -name: Docker Image CI (Upload Tag) - -on: - push: - tags: - - 'v*' - -jobs: - - build: - - runs-on: ubuntu-latest - - steps: - - name: Checkout - uses: actions/checkout@v2 - - name: Buildx - uses: docker/setup-buildx-action@v1 - - - - name: Docker meta - id: meta - uses: docker/metadata-action@v4 - with: - images: amirpourmand/al-folio - - - name: Login - uses: docker/login-action@v1 - with: - username: ${{ secrets.DOCKER_USERNAME }} - password: ${{ secrets.DOCKER_PASSWORD }} - - - name: Build and push - uses: docker/build-push-action@v3 - with: - context: . - push: ${{ github.event_name != 'pull_request' }} - tags: ${{ steps.meta.outputs.tags }} - labels: ${{ steps.meta.outputs.labels }} - diff --git a/.github/workflows/deploy-for-review.yml b/.github/workflows/deploy-for-review.yml deleted file mode 100644 index f76729d1..00000000 --- a/.github/workflows/deploy-for-review.yml +++ /dev/null @@ -1,54 +0,0 @@ -name: Deploy post for review - -on: - workflow_run: - workflows: ["filter-files"] - types: - - completed - -jobs: - upload: - runs-on: ubuntu-latest - if: > - github.event.workflow_run.event == 'pull_request' && - github.event.workflow_run.conclusion == 'success' - steps: - - name: Download build artifact from triggered workflow - uses: dawidd6/action-download-artifact@v2 - with: - run_id: ${{ github.event.workflow_run.id }} - # name: website_out - # path: site_out - search_artifacts: true - - run: unzip website_out/site.zip - # set the SLUG environment variable to the contests of website_out/slug.txt - - name: Get SLUG - run: echo "SLUG=$(cat website_out/slug.txt)" >> $GITHUB_ENV - - name: Print SLUG - run: echo ${{env.SLUG}} - # the post name is the slug minus the first 11 characters - - name: Get post name - run: echo "POST_NAME=${SLUG:11}" >> $GITHUB_ENV - - name: Print POST_NAME - run: echo ${{env.POST_NAME}} - - name: Get ISSUE_NUMBER - run: echo "ISSUE_NUMBER=$(cat website_out/pr_number.txt)" >> $GITHUB_ENV - - name: Print ISSUE_NUMBER - run: echo ${{env.ISSUE_NUMBER}} - - name: Setup AWS CLI - uses: aws-actions/configure-aws-credentials@v4 - with: - aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }} - aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }} - aws-region: eu-west-1 - - run: aws s3 sync --region eu-west-1 --acl public-read _site s3://iclr-blogposts-2024/${{env.SLUG}}-${{env.ISSUE_NUMBER}} - - uses: actions/github-script@v6 - with: - github-token: ${{ secrets.GITHUB_TOKEN }} - script: | - github.rest.issues.createComment({ - issue_number: ${{ env.ISSUE_NUMBER }}, - owner: context.repo.owner, - repo: context.repo.repo, - body: '👋 Thanks for your submission! We have successfully built your website and we will push it shortly to the URL https://d2jud02ci9yv69.cloudfront.net/${{env.SLUG}}-${{env.ISSUE_NUMBER}}/blog/${{env.POST_NAME}}/ !' - }) diff --git a/.github/workflows/deploy-image.yml b/.github/workflows/deploy-image.yml deleted file mode 100644 index b747dfc1..00000000 --- a/.github/workflows/deploy-image.yml +++ /dev/null @@ -1,31 +0,0 @@ -name: Docker Image CI - -on: - push: - branches: [ master ] - -jobs: - - build: - - runs-on: ubuntu-latest - if: github.repository_owner == 'alshedivat' - - steps: - - name: Checkout - uses: actions/checkout@v2 - - name: Buildx - uses: docker/setup-buildx-action@v1 - - - name: Login - uses: docker/login-action@v1 - with: - username: ${{ secrets.DOCKER_USERNAME }} - password: ${{ secrets.DOCKER_PASSWORD }} - - - name: Build and push - uses: docker/build-push-action@v2 - with: - context: . - push: true - tags: amirpourmand/al-folio diff --git a/.github/workflows/deploy.yml b/.github/workflows/deploy.yml deleted file mode 100644 index cbfb6996..00000000 --- a/.github/workflows/deploy.yml +++ /dev/null @@ -1,39 +0,0 @@ -name: deploy - -on: - push: - branches: - - master - - main - workflow_dispatch: {} - -jobs: - deploy: - runs-on: ubuntu-latest - steps: - - name: Checkout code - uses: actions/checkout@v3 - - name: Setup Ruby - uses: ruby/setup-ruby@v1 - with: - ruby-version: '3.0.2' - bundler-cache: true - - name: Install deps - run: | - npm install -g mermaid.cli - - name: Setup deploy options - id: setup - run: | - git config --global user.name "GitHub Action" - git config --global user.email "41898282+github-actions[bot]@users.noreply.github.com" - if [[ ${GITHUB_REF} = refs/pull/*/merge ]]; then # pull request - echo "SRC_BRANCH=${GITHUB_HEAD_REF}" >> $GITHUB_OUTPUT - echo "NO_PUSH=--no-push" >> $GITHUB_OUTPUT - elif [[ ${GITHUB_REF} = refs/heads/* ]]; then # branch, e.g. master, source etc - echo "SRC_BRANCH=${GITHUB_REF#refs/heads/}" >> $GITHUB_OUTPUT - fi - echo "DEPLOY_BRANCH=gh-pages" >> $GITHUB_OUTPUT - - name: Deploy website - run: yes | bash bin/deploy --verbose ${{ steps.setup.outputs.NO_PUSH }} - --src ${{ steps.setup.outputs.SRC_BRANCH }} - --deploy ${{ steps.setup.outputs.DEPLOY_BRANCH }} diff --git a/.github/workflows/filter-files.yml b/.github/workflows/filter-files.yml deleted file mode 100644 index cfbc3c0b..00000000 --- a/.github/workflows/filter-files.yml +++ /dev/null @@ -1,113 +0,0 @@ -name: filter-files - -on: - pull_request: - branches: - - main - -# hack for https://github.com/actions/cache/issues/810#issuecomment-1222550359 -#env: -# SEGMENT_DOWNLOAD_TIMEOUT_MIN: 3 - -jobs: - files-changed: - name: Detect what files changed - # if: contains(github.event.pull_request.labels.*.name, 'submission') - # if: ${{ github.event.label.name == 'submission' }} - runs-on: ubuntu-20.04 - timeout-minutes: 3 - outputs: - offendingfiles: ${{ steps.pythonfilter.outputs.offendingfiles }} - - steps: - - name: Checkout code - uses: actions/checkout@v3 - - uses: actions/setup-python@v4 - with: - python-version: '3.10' - - run: pip install python-slugify pyyaml - - uses: dorny/paths-filter@v2 - id: filter - with: - # Enable listing of files matching each filter. - # Paths to files will be available in `${FILTER_NAME}_files` output variable. - # Paths will be escaped and space-delimited. - # Output is usable as command-line argument list in Linux shell - list-files: shell - - # In this example changed files will be checked by linter. - # It doesn't make sense to lint deleted files. - # Therefore we specify we are only interested in added or modified files. - filters: | - changed: - - '**' - - name: Check label - run: echo ${{ github.event.label.name }} - - name: Save title slug - run: echo "SLUG=`slugify ${{ github.event.pull_request.title }}`" >> $GITHUB_ENV - - name: Print slug - run: echo ${{env.SLUG}} - - name: Check if changed files fit our filters - id: pythonfilter - if: ${{ steps.filter.outputs.changed == 'true' }} - # todo read from step below - run: | - FILTEROUT=$(python3 bin/filterpaths.py $SLUG ${{ steps.filter.outputs.changed_files }} | tail -1) - echo "offendingfiles=$FILTEROUT" >> $GITHUB_OUTPUT - mkdir site_out - python3 bin/filterpaths.py $SLUG ${{ steps.filter.outputs.changed_files }} - #- uses: actions/github-script@v6 - # if: always() && steps.pythonfilter.outcome == 'failure' - # with: - # script: | - # github.rest.issues.createComment({ - # issue_number: context.issue.number, - # owner: context.repo.owner, - # repo: context.repo.repo, - # body: "⚠️ **We have detected a problem with your submission!** ⚠️\n\n${{ steps.pythonfilter.outputs.offendingfiles }}\n\nPlease make the aforementioned changes and re-submit :)" - # }) - - name: Setup Ruby - if: always() && steps.pythonfilter.outcome == 'success' - uses: ruby/setup-ruby@v1 - with: - ruby-version: '3.0.2' - bundler-cache: true - - name: Install deps - if: always() && steps.pythonfilter.outcome == 'success' - run: | - npm install -g mermaid.cli - - name: Setup deploy options - if: always() && steps.pythonfilter.outcome == 'success' - id: setup - run: | - git config --global user.name "GitHub Action" - git config --global user.email "41898282+github-actions[bot]@users.noreply.github.com" - if [[ ${GITHUB_REF} = refs/pull/*/merge ]]; then # pull request - echo "SRC_BRANCH=${GITHUB_HEAD_REF}" >> $GITHUB_OUTPUT - echo "NO_PUSH=--no-push" >> $GITHUB_OUTPUT - elif [[ ${GITHUB_REF} = refs/heads/* ]]; then # branch, e.g. master, source etc - echo "SRC_BRANCH=${GITHUB_REF#refs/heads/}" >> $GITHUB_OUTPUT - fi - echo "DEPLOY_BRANCH=gh-pages" >> $GITHUB_OUTPUT - - name: Build website - if: always() && steps.pythonfilter.outcome == 'success' - run: yes | bash bin/build --verbose ${{ steps.setup.outputs.NO_PUSH }} - --src ${{ steps.setup.outputs.SRC_BRANCH }} - --deploy ${{ steps.setup.outputs.DEPLOY_BRANCH }} - --slug ${{env.SLUG}}-${{ github.event.number }} - - name: Save slug - if: always() - run: echo ${{env.SLUG}} > site_out/slug.txt - - name: Save PR number - if: always() - env: - PR_NUMBER: ${{ github.event.number }} - run: echo $PR_NUMBER > site_out/pr_number.txt - - name: Save filterout - if: always() - run: echo "${{ steps.pythonfilter.outputs.offendingfiles }}" > site_out/filterout.txt - - uses: actions/upload-artifact@v2 - if: always() - with: - name: website_out - path: site_out diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 00000000..e69de29b diff --git a/404.html b/404.html index 0da4ee0b..2659a807 100644 --- a/404.html +++ b/404.html @@ -1,9 +1 @@ ---- -layout: page -permalink: /404.html -title: "Page not found" -description: "Looks like there has been a mistake. Nothing exists here." -redirect: true ---- - -

You will be redirected to the main page within 3 seconds. If not redirected, please click here.

+ Page not found | ICLR Blogposts 2024

You will be redirected to the main page within 3 seconds. If not redirected, please click here.

\ No newline at end of file diff --git a/Gemfile b/Gemfile deleted file mode 100644 index 98de3166..00000000 --- a/Gemfile +++ /dev/null @@ -1,24 +0,0 @@ -source 'https://rubygems.org' -group :jekyll_plugins do - gem 'jekyll' - gem 'jekyll-archives' - gem 'jekyll-diagrams' - gem 'jekyll-email-protect' - gem 'jekyll-feed' - gem 'jekyll-imagemagick' - gem 'jekyll-minifier' - gem 'jekyll-paginate-v2' - gem 'jekyll-scholar' - gem 'jekyll-sitemap' - gem 'jekyll-target-blank' - gem 'jekyll-twitter-plugin' - gem 'jekyll-redirect-from' - # gem 'jemoji' - gem 'mini_racer' - gem 'unicode_utils' - gem 'webrick' -end -group :other_plugins do - gem 'httparty' - gem 'feedjira' -end diff --git a/_config.yml b/_config.yml deleted file mode 100644 index 982045e3..00000000 --- a/_config.yml +++ /dev/null @@ -1,342 +0,0 @@ -# ----------------------------------------------------------------------------- -# Site settings -# ----------------------------------------------------------------------------- - -title: ICLR Blogposts 2024 # the website title (if blank, full name will be used instead) -first_name: ICLR -middle_name: -last_name: Blog -email: -description: > # the ">" symbol means to ignore newlines until "footer_text:" - Home to the 2024 ICLR Blogposts track -footer_text: > - Powered by Jekyll with al-folio theme. - Hosted by GitHub Pages. - Photos from Unsplash. -keywords: machine-learning, ml, deep-learning, reinforcement-learning, iclr # add your own keywords or leave empty - -lang: en # the language of your site (for example: en, fr, cn, ru, etc.) -icon: iclr_favicon.ico # the emoji used as the favicon (alternatively, provide image name in /assets/img/) - -url: https://iclr-blogposts.github.io # the base hostname & protocol for your site -baseurl: /2024 # the subpath of your site, e.g. /blog/ -last_updated: false # set to true if you want to display last updated in the footer -impressum_path: # set to path to include impressum link in the footer, use the same path as permalink in a page, helps to conform with EU GDPR - -timezone: Europe/Vienna - -# ----------------------------------------------------------------------------- -# Theme -# ----------------------------------------------------------------------------- - -# code highlighter theme -highlight_theme_light: github # https://github.com/jwarby/jekyll-pygments-themes -highlight_theme_dark: native # https://github.com/jwarby/jekyll-pygments-themes - -# repo color theme -repo_theme_light: default # https://github.com/anuraghazra/github-readme-stats/blob/master/themes/README.md -repo_theme_dark: dark # https://github.com/anuraghazra/github-readme-stats/blob/master/themes/README.md - -# ----------------------------------------------------------------------------- -# RSS Feed -# ----------------------------------------------------------------------------- -# will use title and url fields -# Take a look to https://github.com/jekyll/jekyll-feed for more customization - -rss_icon: true - -# ----------------------------------------------------------------------------- -# Layout -# ----------------------------------------------------------------------------- - -navbar_fixed: true -footer_fixed: true - -# Dimensions -max_width: 1000px - -# TODO: add layout settings (single page vs. multi-page) - -# ----------------------------------------------------------------------------- -# Open Graph & Schema.org -# ----------------------------------------------------------------------------- -# Display links to the page with a preview object on social media. -serve_og_meta: false # Include Open Graph meta tags in the HTML head -serve_schema_org: false # Include Schema.org in the HTML head -og_image: # The site-wide (default for all links) Open Graph preview image - -# ----------------------------------------------------------------------------- -# Social integration -# ----------------------------------------------------------------------------- - -github_username: # your GitHub user name -gitlab_username: # your GitLab user name -twitter_username: # your Twitter handle -linkedin_username: # your LinkedIn user name -scholar_userid: # your Google Scholar ID -semanticscholar_id: # your Semantic Scholar ID -orcid_id: # your ORCID ID -medium_username: # your Medium username -quora_username: # your Quora username -publons_id: # your ID on Publons -research_gate_profile: # your profile on ResearchGate -blogger_url: # your blogger URL -work_url: # work page URL -keybase_username: # your keybase user name -wikidata_id: # your wikidata id -dblp_url: # your DBLP profile url -stackoverflow_id: # your stackoverflow id -kaggle_id: # your kaggle id -lastfm_id: # your lastfm id -spotify_id: # your spotify id -pinterest_id: # your pinterest id -unsplash_id: # your unsplash id -instagram_id: # your instagram id -facebook_id: # your facebook id -discord_id: # your discord id (18-digit unique numerical identifier) - -contact_note: - -# ----------------------------------------------------------------------------- -# Analytics and search engine verification -# ----------------------------------------------------------------------------- - -google_analytics: # your Goole Analytics measurement ID (format: G-XXXXXXXXXX) -panelbear_analytics: # panelbear analytics site ID (format: XXXXXXXXX) - -google_site_verification: # your google-site-verification ID (Google Search Console) -bing_site_verification: # out your bing-site-verification ID (Bing Webmaster) - -# ----------------------------------------------------------------------------- -# Blog -# ----------------------------------------------------------------------------- - -blog_name: blogposts # blog_name will be displayed in your blog page -blog_nav_title: blog # your blog must have a title for it to be displayed in the nav bar -blog_description: Blog Posts -permalink: /blog/:title/ - -# Pagination -pagination: - enabled: true - -# Comments -disqus_shortname: # put your disqus shortname -# https://help.disqus.com/en/articles/1717111-what-s-a-shortname - -# External sources. -# If you have blog posts published on medium.com or other exteranl sources, -# you can display them in your blog by adding a link to the RSS feed. -external_sources: - -# ----------------------------------------------------------------------------- -# Collections -# ----------------------------------------------------------------------------- - -collections: - news: - defaults: - layout: post - output: true - permalink: /news/:path/ - projects: - output: false - permalink: /projects/:path/ - -news_scrollable: true # adds a vertical scroll bar if there are more than 3 news items -news_limit: 5 # leave blank to include all the news in the `_news` folder - -# ----------------------------------------------------------------------------- -# Jekyll settings -# ----------------------------------------------------------------------------- - -# Markdown and syntax highlight -markdown: kramdown -highlighter: rouge -kramdown: - input: GFM - syntax_highlighter_opts: - css_class: 'highlight' - span: - line_numbers: false - block: - line_numbers: false - start_line: 1 - -# Includes & excludes -include: ['_pages'] -exclude: - - bin - - Gemfile - - Gemfile.lock - - vendor -keep_files: - - CNAME - - .nojekyll - - .git - -# Plug-ins -plugins: - - jekyll-archives - - jekyll-diagrams - - jekyll-email-protect - - jekyll-feed - - jekyll-imagemagick - - jekyll-minifier - - jekyll-paginate-v2 - - jekyll/scholar - - jekyll-sitemap - - jekyll-target-blank - - jekyll-twitter-plugin - # - jemoji - -# Sitemap settings -defaults: - - scope: - path: "assets/**/*.*" - values: - sitemap: false - -# ----------------------------------------------------------------------------- -# Jekyll Minifier -# ----------------------------------------------------------------------------- - -jekyll-minifier: - exclude: ['robots.txt'] - uglifier_args: - harmony: true - -# ----------------------------------------------------------------------------- -# Jekyll Archives -# ----------------------------------------------------------------------------- - -jekyll-archives: - enabled: [year, tags, categories] # enables year, tag and category archives (remove if you need to disable one of them). - layouts: - year: archive-year - tag: archive-tag - category: archive-category - permalinks: - year: '/blog/:year/' - tag: '/blog/tag/:name/' - category: '/blog/category/:name/' - -# display_tags: ['formatting', 'images', 'links', 'math', 'code'] # this tags will be dispalyed on the front page of your blog - -# ----------------------------------------------------------------------------- -# Jekyll Scholar -# ----------------------------------------------------------------------------- - -scholar: - - last_name: - first_name: - - style: apa - locale: en - - source: /_bibliography/ - bibliography: papers.bib - bibliography_template: bib - # Note: if you have latex math in your bibtex, the latex filter - # preprocessing may conflict with MathJAX if the latter is enabled. - # See https://github.com/alshedivat/al-folio/issues/357. - bibtex_filters: [latex, smallcaps, superscript] - - replace_strings: true - join_strings: true - - details_dir: bibliography - details_layout: bibtex.html - details_link: Details - - query: "@*" - -# Filter out certain bibtex entry keywords used internally from the bib output -filtered_bibtex_keywords: [abbr, abstract, arxiv, bibtex_show, html, pdf, selected, supp, blog, code, poster, slides, website, preview] - -# Maximum number of authors to be shown for each publication (more authors are visible on click) -max_author_limit: 3 # leave blank to always show all authors -more_authors_animation_delay: 10 # more authors are revealed on click using animation; smaller delay means faster animation - - -# ----------------------------------------------------------------------------- -# Responsive WebP Images -# ----------------------------------------------------------------------------- - -imagemagick: - enabled: true # enables responsive images for your site (recomended, see https://github.com/alshedivat/al-folio/issues/537) - widths: - - 480 - - 800 - - 1400 - input_directories: - - assets/img/ - input_formats: - - ".jpg" - - ".jpeg" - - ".png" - - ".tiff" - output_formats: - webp: "-resize 800x" - -# ----------------------------------------------------------------------------- -# Jekyll Diagrams -# ----------------------------------------------------------------------------- - -jekyll-diagrams: - # configuration, see https://github.com/zhustec/jekyll-diagrams. - # feel free to comment out this section if not using jekyll diagrams. - - -# ----------------------------------------------------------------------------- -# Optional Features -# ----------------------------------------------------------------------------- - -enable_google_analytics: false # enables google analytics -enable_panelbear_analytics: false # enables panelbear analytics -enable_google_verification: false # enables google site verification -enable_bing_verification: false # enables bing site verification -enable_masonry: true # enables automatic project cards arangement -enable_math: true # enables math typesetting (uses MathJax) -enable_tooltips: false # enables automatic tooltip links generated - # for each section titles on pages and posts -enable_darkmode: true # enables switching between light/dark modes -enable_navbar_social: false # enables displaying social links in the - # navbar on the about page -enable_project_categories: true # enables categorization of projects into - # multiple categories -enable_medium_zoom: true # enables image zoom feature (as on medium.com) - - -# ----------------------------------------------------------------------------- -# Library versions -# ----------------------------------------------------------------------------- - -academicons: - version: "1.9.1" - integrity: "sha256-i1+4qU2G2860dGGIOJscdC30s9beBXjFfzjWLjBRsBg=" -bootstrap: - version: "4.6.1" - integrity: - css: "sha256-DF7Zhf293AJxJNTmh5zhoYYIMs2oXitRfBjY+9L//AY=" - js: "sha256-fgLAgv7fyCGopR/gBNq2iW3ZKIdqIcyshnUULC4vex8=" -fontawesome: - version: "5.15.4" - integrity: "sha256-mUZM63G8m73Mcidfrv5E+Y61y7a12O5mW4ezU3bxqW4=" -jquery: - version: "3.6.0" - integrity: "sha256-/xUj+3OJU5yExlq6GSYGSHk7tPXikynS7ogEvDej/m4=" -mathjax: - version: "3.2.0" -masonry: - version: "4.2.2" - integrity: "sha256-Nn1q/fx0H7SNLZMQ5Hw5JLaTRZp0yILA/FRexe19VdI=" -mdb: - version: "4.20.0" - integrity: - css: "sha256-jpjYvU3G3N6nrrBwXJoVEYI/0zw8htfFnhT9ljN3JJw=" - js: "sha256-NdbiivsvWt7VYCt6hYNT3h/th9vSTL4EDWeGs5SN3DA=" -medium_zoom: - version: "1.0.6" - integrity: "sha256-EdPgYcPk/IIrw7FYeuJQexva49pVRZNmt3LculEr7zM=" diff --git a/_data/coauthors.yml b/_data/coauthors.yml deleted file mode 100644 index 8ed52124..00000000 --- a/_data/coauthors.yml +++ /dev/null @@ -1,34 +0,0 @@ -"Adams": - - firstname: ["Edwin", "E.", "E. P.", "Edwin Plimpton"] - url: https://en.wikipedia.org/wiki/Edwin_Plimpton_Adams - -"Podolsky": - - firstname: ["Boris", "B.", "B. Y.", "Boris Yakovlevich"] - url: https://en.wikipedia.org/wiki/Boris_Podolsky - -"Rosen": - - firstname: ["Nathan", "N."] - url: https://en.wikipedia.org/wiki/Nathan_Rosen - -"Bach": - - firstname: ["Johann Sebastian", "J. S."] - url: https://en.wikipedia.org/wiki/Johann_Sebastian_Bach - - - firstname: ["Carl Philipp Emanuel", "C. P. E."] - url: https://en.wikipedia.org/wiki/Carl_Philipp_Emanuel_Bach - -"Przibram": - - firstname: ["Karl"] - url: https://link.springer.com/article/10.1007/s00016-019-00242-z - -"Schrödinger": - - firstname: ["Erwin"] - url: https://en.wikipedia.org/wiki/Erwin_Schr%C3%B6dinger - -"Lorentz": - - firstname: ["Hendrik Antoon"] - url: https://en.wikipedia.org/wiki/Hendrik_Lorentz - -"Planck": - - firstname: ["Max"] - url: https://en.wikipedia.org/wiki/Max_Planck diff --git a/_data/cv.yml b/_data/cv.yml deleted file mode 100644 index 5b115724..00000000 --- a/_data/cv.yml +++ /dev/null @@ -1,97 +0,0 @@ -- title: General Information - type: map - contents: - - name: Full Name - value: Albert Einstein - - name: Date of Birth - value: 14th March 1879 - - name: Languages - value: English, German - -- title: Education - type: time_table - contents: - - title: PhD - institution: University of Zurich, Zurich, Switzerland - year: 1905 - description: - - Description 1. - - Description 2. - - title: Description 3. - contents: - - Sub-description 1. - - Sub-description 2. - - title: Federal teaching diploma - institution: Eidgenössische Technische Hochschule, Zurich, Switzerland - year: 1900 - description: - - Description 1. - - Description 2. - -- title: Experience - type: time_table - contents: - - title: Professor of Theoretical Physics - institution: Institute for Advanced Study, Princeton University - year: 1933 - 1955 - description: - - Description 1. - - Description 2. - - title: Description 3. - contents: - - Sub-description 1. - - Sub-description 2. - - title: Visiting Professor - institution: California Institute of Technology, Pasadena, California, US - year: 1933 - description: - - Description 1. - - Description 2. - - - title: Director - institution: Kaiser Wilhelm Institute for Physics, Berlin, Germany. - year: 1917-1933 - - - title: Professor of Theoretical Physics - institution: Karl-Ferdinand University, Prague, Czechoslovakia - year: 1911 - 1917 - description: - - - title: Associate Professor of Theoretical Physics - institution: University of Zurich, Zurich, Switzerland - year: 1909 - 1911 - -- title: Open Source Projects - type: time_table - contents: - - title: al-folio - year: 2015-now - description: A beautiful, simple, clean, and responsive Jekyll theme for academics. - -- title: Honors and Awards - type: time_table - contents: - - year: 1921 - items: - - Nobel Prize in Physics - - Matteucci Medal - - year: 2029 - items: - - Max Planck Medal - -- title: Academic Interests - type: nested_list - contents: - - title: Topic 1. - items: - - Description 1. - - Description 2. - - title: Topic 2. - items: - - Description 1. - - Description 2. - -- title: Other Interests - type: list - contents: - - Hobbies: Hobby 1, Hobby 2, etc. diff --git a/_data/repositories.yml b/_data/repositories.yml deleted file mode 100644 index 5205c9f6..00000000 --- a/_data/repositories.yml +++ /dev/null @@ -1,12 +0,0 @@ -github_users: - - torvalds - - alshedivat - -github_repos: - - alshedivat/al-folio - - twbs/bootstrap - - jekyll/jekyll - - jquery/jquery - - FortAwesome/Font-Awesome - - jpswalsh/academicons - - mathjax/MathJax diff --git a/_data/venues.yml b/_data/venues.yml deleted file mode 100644 index 6c16ad5d..00000000 --- a/_data/venues.yml +++ /dev/null @@ -1,6 +0,0 @@ -"AJP": - url: https://aapt.scitation.org/journal/ajp - color: "#00369f" - -"PhysRev": - url: https://journals.aps.org/ diff --git a/_includes/cv/list.html b/_includes/cv/list.html deleted file mode 100644 index 75625859..00000000 --- a/_includes/cv/list.html +++ /dev/null @@ -1,5 +0,0 @@ - \ No newline at end of file diff --git a/_includes/cv/map.html b/_includes/cv/map.html deleted file mode 100644 index e0d1983e..00000000 --- a/_includes/cv/map.html +++ /dev/null @@ -1,8 +0,0 @@ - - {% for content in entry.contents %} - - - - - {% endfor %} -
{{ content.name }}{{ content.value }}
\ No newline at end of file diff --git a/_includes/cv/nested_list.html b/_includes/cv/nested_list.html deleted file mode 100644 index 4778aca0..00000000 --- a/_includes/cv/nested_list.html +++ /dev/null @@ -1,14 +0,0 @@ - \ No newline at end of file diff --git a/_includes/cv/time_table.html b/_includes/cv/time_table.html deleted file mode 100644 index 123b9d09..00000000 --- a/_includes/cv/time_table.html +++ /dev/null @@ -1,59 +0,0 @@ - \ No newline at end of file diff --git a/_includes/figure.html b/_includes/figure.html deleted file mode 100644 index e67e8043..00000000 --- a/_includes/figure.html +++ /dev/null @@ -1,36 +0,0 @@ -{%- assign img_path = include.path | remove: ".jpg" | remove: ".jpeg" | remove: ".png" | remove: ".tiff" -%} - -
- - - {% if site.imagemagick.enabled %} - {% for i in site.imagemagick.widths -%} - - {% endfor -%} - {% endif %} - - - - - - - {%- if include.caption -%}
{{ include.caption }}
{%- endif %} - -
diff --git a/_includes/footer.html b/_includes/footer.html deleted file mode 100644 index acc4688f..00000000 --- a/_includes/footer.html +++ /dev/null @@ -1,25 +0,0 @@ - {% if site.footer_fixed %} - - {%- else -%} - - {%- endif %} \ No newline at end of file diff --git a/_includes/head.html b/_includes/head.html deleted file mode 100644 index 3796eb38..00000000 --- a/_includes/head.html +++ /dev/null @@ -1,31 +0,0 @@ - - {% include metadata.html %} - - - - - - - - - - - - - - - {% if site.icon.size < 3 %} - - {% elsif site.icon != blank %} - - {% endif %} - - - - - {% if site.enable_darkmode %} - - - - - {% endif %} diff --git a/_includes/header.html b/_includes/header.html deleted file mode 100644 index f72668e5..00000000 --- a/_includes/header.html +++ /dev/null @@ -1,137 +0,0 @@ - -
- - - -
\ No newline at end of file diff --git a/_includes/metadata.html b/_includes/metadata.html deleted file mode 100644 index af3813a8..00000000 --- a/_includes/metadata.html +++ /dev/null @@ -1,196 +0,0 @@ -{% if site.enable_google_verification or site.enable_bing_verification %} - - {% if site.enable_google_verification -%} - - {%- endif -%} - {% if site.enable_bing_verification -%} - - {%- endif -%} -{%- endif %} - - - - - - - {%- if site.title == "blank" -%} - {%- capture title -%}{{ site.first_name }} {{ site.middle_name }} {{ site.last_name }}{%- endcapture -%} - {%- else -%} - {%- capture title -%}{{ site.title }}{%- endcapture -%} - {%- endif -%} - {% if page.url == '/blog/index.html' %} - {{ site.blog_nav_title }} | {{ title }} - {%- elsif page.title != "blank" and page.url != "/" -%} - {%- if page.title == nil or page.title == "" -%} - {{ page.date | date: "%Y" }} | {{ title }} - {%- else -%} - {{ page.title }} | {{ title }} - {%- endif -%} - {%- else -%} - {{ title }} - {%- endif -%} - - - -{%- if page.keywords or site.keywords %} - -{%- endif %} - -{%- if site.serve_og_meta %} - - - - - - - - {% if page.og_image or site.og_image -%} - - {%- endif %} - - - - - - - {% if page.og_image or site.og_image -%} - - {%- endif %} - {% if site.twitter_username -%} - - - {%- endif %} -{%- endif %} - -{%- if site.serve_schema_org %} - - - {%- comment -%} Social links generator for "sameAs schema" {%- endcomment %} - {% assign sameaslinks = "" | split: "," %} - {%- if site.orcid_id -%} - {%- capture link -%}https://orcid.org/{{ site.orcid_id }}{%- endcapture -%} - {%- assign sameaslinks = sameaslinks | push: link -%} - {%- endif -%} - {%- if site.scholar_userid -%} - {%- capture link -%}https://scholar.google.com/citations?user={{ site.scholar_userid }}{%- endcapture -%} - {%- assign sameaslinks = sameaslinks | push: link -%} - {%- endif -%} - {%- if site.semanticscholar_id -%} - {%- capture link -%}https://www.semanticscholar.org/author/{{ site.semanticscholar_id }}{%- endcapture -%} - {%- assign sameaslinks = sameaslinks | push: link -%} - {%- endif -%} - {%- if site.publons_id -%} - {%- capture link -%}https://publons.com/a/{{ site.publons_id }}/{%- endcapture -%} - {%- assign sameaslinks = sameaslinks | push: link -%} - {%- endif -%} - {%- if site.research_gate_profile -%} - {%- capture link -%}https://www.researchgate.net/profile/{{site.research_gate_profile}}{%- endcapture -%} - {%- assign sameaslinks = sameaslinks | push: link -%} - {%- endif -%} - {%- if site.github_username -%} - {%- capture link -%}https://github.com/{{ site.github_username }}{%- endcapture -%} - {%- assign sameaslinks = sameaslinks | push: link -%} - {%- endif -%} - {%- if site.linkedin_username -%} - {%- capture link -%}https://www.linkedin.com/in/{{ site.linkedin_username }}{%- endcapture -%} - {%- assign sameaslinks = sameaslinks | push: link -%} - {%- endif -%} - {%- if site.twitter_username -%} - {%- capture link -%}https://twitter.com/{{ site.twitter_username }}{%- endcapture -%} - {%- assign sameaslinks = sameaslinks | push: link -%} - {%- endif -%} - {%- if site.medium_username -%} - {%- capture link -%}https://medium.com/@{{ site.medium_username }}{%- endcapture -%} - {%- assign sameaslinks = sameaslinks | push: link -%} - {%- endif -%} - {%- if site.quora_username -%} - {%- capture link -%}https://www.quora.com/profile/{{ site.quora_username }}{%- endcapture -%} - {%- assign sameaslinks = sameaslinks | push: link -%} - {%- endif -%} - {%- if site.blogger_url -%} - {%- capture link -%}{{ site.blogger_url }}{%- endcapture -%} - {%- assign sameaslinks = sameaslinks | push: link -%} - {%- endif -%} - {%- if site.work_url -%} - {%- capture link -%}{{ site.work_url }}{%- endcapture -%} - {%- assign sameaslinks = sameaslinks | push: link -%} - {%- endif -%} - {%- if site.wikidata_id -%} - {%- capture link -%}https://www.wikidata.org/wiki/{{ site.wikidata_id }}{%- endcapture -%} - {%- assign sameaslinks = sameaslinks | push: link -%} - {%- endif -%} - {%- if site.strava_userid -%} - {%- capture link -%}https://www.strava.com/athletes/{{ site.strava_userid }}{%- endcapture -%} - {%- assign sameaslinks = sameaslinks | push: link -%} - {%- endif -%} - {%- if site.keybase_username -%} - {%- capture link -%}https://keybase.io/{{ site.keybase_username }}{%- endcapture -%} - {%- assign sameaslinks = sameaslinks | push: link -%} - {%- endif -%} - {%- if site.gitlab_username -%} - {%- capture link -%}https://gitlab.com/{{ site.gitlab_username }}{%- endcapture -%} - {%- assign sameaslinks = sameaslinks | push: link -%} - {%- endif -%} - {%- if site.dblp_url -%} - {%- capture link -%}{{ site.dblp_url }}{%- endcapture -%} - {%- assign sameaslinks = sameaslinks | push: link -%} - {%- endif -%} - {%- if site.stackoverflow_id -%} - {%- capture link -%}https://stackoverflow.com/users/{{ site.stackoverflow_id }}{%- endcapture -%} - {%- assign sameaslinks = sameaslinks | push: link -%} - {%- endif -%} - {%- if site.kaggle_id -%} - {%- capture link -%}https://www.kaggle.com/{{ site.kaggle_id }}{%- endcapture -%} - {%- assign sameaslinks = sameaslinks | push: link -%} - {%- endif -%} - {%- if site.lastfm_id -%} - {%- capture link -%}https://www.last.fm/user/{{ site.lastfm_id }}{%- endcapture -%} - {%- assign sameaslinks = sameaslinks | push: link -%} - {%- endif -%} - {%- if site.spotify_id -%} - {%- capture link -%}https://open.spotify.com/user/{{ site.spotify_id }}{%- endcapture -%} - {%- assign sameaslinks = sameaslinks | push: link -%} - {%- endif -%} - {%- if site.pinterest_id -%} - {%- capture link -%}https://www.pinterest.com/{{ site.pinterest_id }}{%- endcapture -%} - {%- assign sameaslinks = sameaslinks | push: link -%} - {%- endif -%} - {%- if site.unsplash_id -%} - {%- capture link -%}https://unsplash.com/@{{ site.unsplash_id }}{%- endcapture -%} - {%- assign sameaslinks = sameaslinks | push: link -%} - {%- endif -%} - {%- if site.instagram_id -%} - {%- capture link -%}https://instagram.com/{{ site.instagram_id }}{%- endcapture -%} - {%- assign sameaslinks = sameaslinks | push: link -%} - {%- endif -%} - {%- if site.facebook_id -%} - {%- capture link -%}https://facebook.com/{{ site.facebook_id }}{%- endcapture -%} - {%- assign sameaslinks = sameaslinks | push: link -%} - {%- endif -%} - {%- if site.discord_id -%} - {%- capture link -%}https://discord.com/users/{{ site.discord_id }}{%- endcapture -%} - {%- assign sameaslinks = sameaslinks | push: link -%} - {%- endif -%} - {%- if sameaslinks != blank -%} - {%- assign sameaslinks = sameaslinks | split: "" -%} - {%- endif -%} - - -{%- endif %} diff --git a/_includes/news.html b/_includes/news.html deleted file mode 100644 index 307e532d..00000000 --- a/_includes/news.html +++ /dev/null @@ -1,31 +0,0 @@ - -
-

news

- {% if site.news != blank -%} - {%- assign news_size = site.news | size -%} -
3 %}style="max-height: 10vw"{% endif %}> - - {%- assign news = site.news | reverse -%} - {% if site.news_limit %} - {% assign news_limit = site.news_limit %} - {% else %} - {% assign news_limit = news_size %} - {% endif %} - {% for item in news limit: news_limit %} - - - - - {%- endfor %} -
{{ item.date | date: "%b %-d, %Y" }} - {% if item.inline -%} - {{ item.content | remove: '

' | remove: '

' | emojify }} - {%- else -%} - {{ item.title }} - {%- endif %} -
-
- {%- else -%} -

No news so far...

- {%- endif %} -
diff --git a/_includes/pagination.html b/_includes/pagination.html deleted file mode 100644 index 4b8d27e3..00000000 --- a/_includes/pagination.html +++ /dev/null @@ -1,17 +0,0 @@ -{%- if paginator.total_pages > 1 -%} - -{%- endif -%} diff --git a/_includes/people.html b/_includes/people.html deleted file mode 100644 index b5a79f1f..00000000 --- a/_includes/people.html +++ /dev/null @@ -1,16 +0,0 @@ - -
-
- -
- {%- include figure.html - path=include.img - alt=include.name - -%} -
-
{{- include.name -}}
-

{{- include.affiliation -}}

-
-
-
-
diff --git a/_includes/people_horizontal.html b/_includes/people_horizontal.html deleted file mode 100644 index 957bc768..00000000 --- a/_includes/people_horizontal.html +++ /dev/null @@ -1,17 +0,0 @@ -
- -
-
-
- {% include figure.html path=include.img alt=include.name %} -
-
-
-
{{ include.name }}
-

{{ include.affiliation }}

-
-
-
-
-
-
diff --git a/_includes/projects.html b/_includes/projects.html deleted file mode 100644 index 503146e2..00000000 --- a/_includes/projects.html +++ /dev/null @@ -1,36 +0,0 @@ - -
-
- {% if project.redirect -%} - - {%- else -%} - - {%- endif %} -
- {%- if project.img %} - {%- include figure.html - path=project.img - alt="project thumbnail" -%} - {%- endif %} -
-

{{ project.title }}

-

{{ project.description }}

-
- {%- if project.github -%} -
-
- -
- {%- if project.github_stars -%} - - - - - {%- endif %} -
- {%- endif %} -
-
-
- -
\ No newline at end of file diff --git a/_includes/projects_horizontal.html b/_includes/projects_horizontal.html deleted file mode 100644 index ddf74058..00000000 --- a/_includes/projects_horizontal.html +++ /dev/null @@ -1,40 +0,0 @@ -
- {%- if project.redirect -%} - - {%- else -%} - - {%- endif -%} -
- - -
diff --git a/_includes/repository/repo.html b/_includes/repository/repo.html deleted file mode 100644 index 6344b860..00000000 --- a/_includes/repository/repo.html +++ /dev/null @@ -1,14 +0,0 @@ -{% assign repo_url = include.repository | split: '/' %} - -{% if site.data.repositories.github_users contains repo_url.first %} - {% assign show_owner = false %} -{% else %} - {% assign show_owner = true %} -{% endif %} - -
- - {{ include.repository }} - {{ include.repository }} - -
diff --git a/_includes/repository/repo_user.html b/_includes/repository/repo_user.html deleted file mode 100644 index ae06a058..00000000 --- a/_includes/repository/repo_user.html +++ /dev/null @@ -1,6 +0,0 @@ -
- - {{ include.username }} - {{ include.username }} - -
diff --git a/_includes/scripts/analytics.html b/_includes/scripts/analytics.html deleted file mode 100644 index db2aeef9..00000000 --- a/_includes/scripts/analytics.html +++ /dev/null @@ -1,18 +0,0 @@ -{%- if site.enable_google_analytics -%} - - - -{%- endif -%} -{%- if site.enable_panelbear_analytics -%} - - - -{%- endif -%} diff --git a/_includes/scripts/bootstrap.html b/_includes/scripts/bootstrap.html deleted file mode 100644 index 1c213650..00000000 --- a/_includes/scripts/bootstrap.html +++ /dev/null @@ -1,3 +0,0 @@ - - - diff --git a/_includes/scripts/jquery.html b/_includes/scripts/jquery.html deleted file mode 100644 index f84a2f22..00000000 --- a/_includes/scripts/jquery.html +++ /dev/null @@ -1,2 +0,0 @@ - - diff --git a/_includes/scripts/masonry.html b/_includes/scripts/masonry.html deleted file mode 100644 index 804389d3..00000000 --- a/_includes/scripts/masonry.html +++ /dev/null @@ -1,6 +0,0 @@ - {%- if site.enable_masonry -%} - - - - - {%- endif -%} diff --git a/_includes/scripts/mathjax.html b/_includes/scripts/mathjax.html deleted file mode 100644 index c55ec056..00000000 --- a/_includes/scripts/mathjax.html +++ /dev/null @@ -1,12 +0,0 @@ - {%- if site.enable_math -%} - - - - - {%- endif %} diff --git a/_includes/scripts/misc.html b/_includes/scripts/misc.html deleted file mode 100644 index 08ba49f0..00000000 --- a/_includes/scripts/misc.html +++ /dev/null @@ -1,14 +0,0 @@ -{% if site.enable_tooltips %} - - -{%- endif %} -{%- if site.enable_medium_zoom %} - - - -{%- endif -%} - - - diff --git a/_includes/selected_papers.html b/_includes/selected_papers.html deleted file mode 100644 index 61457dbc..00000000 --- a/_includes/selected_papers.html +++ /dev/null @@ -1,5 +0,0 @@ - -
-

selected publications

- {% bibliography -f papers -q @*[selected=true]* %} -
diff --git a/_includes/social.html b/_includes/social.html deleted file mode 100644 index 8c7a079c..00000000 --- a/_includes/social.html +++ /dev/null @@ -1,84 +0,0 @@ - {%- if site.email -%} - - {% endif %} - {%- if site.orcid_id -%} - - {% endif %} - {%- if site.scholar_userid -%} - - {% endif %} - {%- if site.semanticscholar_id -%} - - {% endif %} - {%- if site.publons_id -%} - - {% endif %} - {%- if site.research_gate_profile -%} - - {% endif %} - {%- if site.github_username -%} - - {% endif %} - {%- if site.linkedin_username -%} - - {% endif %} - {%- if site.twitter_username -%} - - {% endif %} - {%- if site.medium_username -%} - - {% endif %} - {%- if site.quora_username -%} - - {% endif %} - {%- if site.blogger_url -%} - - {% endif %} - {%- if site.work_url -%} - - {% endif %} - {%- if site.wikidata_id -%} - - {% endif %} - {%- if site.strava_userid -%} - - {% endif %} - {%- if site.keybase_username -%} - - {% endif %} - {%- if site.gitlab_username -%} - - {% endif %} - {%- if site.dblp_url -%} - - {% endif %} - {%- if site.stackoverflow_id -%} - - {% endif %} - {%- if site.kaggle_id -%} - - {% endif %} - {%- if site.lastfm_id -%} - - {% endif %} - {%- if site.spotify_id -%} - - {% endif %} - {%- if site.pinterest_id -%} - - {% endif %} - {%- if site.unsplash_id -%} - - {% endif %} - {%- if site.instagram_id -%} - - {% endif %} - {%- if site.facebook_id -%} - - {% endif %} - {%- if site.discord_id -%} - - {% endif %} - {%- if site.rss_icon -%} - - {% endif %} diff --git a/_layouts/about.html b/_layouts/about.html deleted file mode 100644 index d3628377..00000000 --- a/_layouts/about.html +++ /dev/null @@ -1,66 +0,0 @@ ---- -layout: default ---- - - -
-
- -

{{ page.subtitle }}

-
- -
- {% if page.profile -%} -
- {%- if page.profile.image %} - {%- assign profile_image_path = page.profile.image | prepend: 'assets/img/' -%} - - {% if page.profile.image_circular %} - {%- assign profile_image_class = "img-fluid z-depth-1 rounded-circle" -%} - {% else %} - {%- assign profile_image_class = "img-fluid z-depth-1 rounded" -%} - {% endif %} - - {% include figure.html - path=profile_image_path - class=profile_image_class - alt=page.profile.image -%} - {% endif -%} - {%- if page.profile.address %} -
- {{ page.profile.address }} -
- {%- endif %} -
- {%- endif %} - -
- {{ content }} -
- - {% if page.news -%} - - {%- include news.html %} - {%- endif %} - {% if page.selected_papers -%} - - {%- include selected_papers.html %} - {%- endif %} - {%- if page.social %} - - - {%- endif %} -
- -
diff --git a/_layouts/archive-category.html b/_layouts/archive-category.html deleted file mode 100644 index 79aad74f..00000000 --- a/_layouts/archive-category.html +++ /dev/null @@ -1,27 +0,0 @@ ---- -layout: default ---- - -
- -
-

{{ page.title }}

-

an archive of posts in this category

-
- -
-
- - {% for post in page.posts %} - - - - - {% endfor %} -
{{ post.date | date: "%b %-d, %Y" }} - {{ post.title }} -
-
-
- -
diff --git a/_layouts/archive-tag.html b/_layouts/archive-tag.html deleted file mode 100644 index 66abaebb..00000000 --- a/_layouts/archive-tag.html +++ /dev/null @@ -1,27 +0,0 @@ ---- -layout: default ---- - -
- -
-

{{ page.title }}

-

an archive of posts with this tag

-
- -
-
- - {% for post in page.posts %} - - - - - {% endfor %} -
{{ post.date | date: "%b %-d, %Y" }} - {{ post.title }} -
-
-
- -
diff --git a/_layouts/archive-year.html b/_layouts/archive-year.html deleted file mode 100644 index 8af1d29b..00000000 --- a/_layouts/archive-year.html +++ /dev/null @@ -1,27 +0,0 @@ ---- -layout: default ---- - -
- -
-

{{ page.date | date: "%Y" }}

-

an archive of posts from this year

-
- -
-
- - {% for post in page.posts %} - - - - - {% endfor %} -
{{ post.date | date: "%b %-d, %Y" }} - {{ post.title }} -
-
-
- -
diff --git a/_layouts/bib.html b/_layouts/bib.html deleted file mode 100644 index eb6520a2..00000000 --- a/_layouts/bib.html +++ /dev/null @@ -1,196 +0,0 @@ ---- ---- - -
-
- {%- if entry.preview -%} - {% if entry.preview contains '://' -%} - - {%- else -%} - - {%- endif -%} - {%- elsif entry.abbr -%} - {%- if site.data.venues[entry.abbr] -%} - {%- assign venue_style = nil -%} - {%- if site.data.venues[entry.abbr].color != blank -%} - {%- assign venue_style = site.data.venues[entry.abbr].color | prepend: 'style="background-color:' | append: '"' -%} - {%- endif -%} - {{entry.abbr}} - {%- else -%} - {{entry.abbr}} - {%- endif -%} - {%- endif -%} -
- - -
- {% if entry.type == "thesis" -%} - {{reference}} - {%- else %} - -
{{entry.title}}
- -
- {% assign author_array_size = entry.author_array | size %} - - {% assign author_array_limit = author_array_size %} - {%- if site.max_author_limit and author_array_size > site.max_author_limit %} - {% assign author_array_limit = site.max_author_limit %} - {% endif %} - - {%- for author in entry.author_array limit: author_array_limit -%} - {%- assign author_is_self = false -%} - {%- assign author_last_name = author.last | remove: "¶" | remove: "&" | remove: "*" | remove: "†" | remove: "^" -%} - {%- if site.scholar.last_name contains author_last_name -%} - {%- if site.scholar.first_name contains author.first -%} - {%- assign author_is_self = true -%} - {%- endif -%} - {%- endif -%} - {%- assign coauthor_url = nil -%} - {%- if site.data.coauthors[author_last_name] -%} - {%- for coauthor in site.data.coauthors[author_last_name] -%} - {%- if coauthor.firstname contains author.first -%} - {%- assign coauthor_url = coauthor.url -%} - {%- break -%} - {%- endif -%} - {%- endfor -%} - {%- endif -%} - - {%- if forloop.length > 1 -%} - {%- if forloop.first == false -%}, {%- endif -%} - {%- if forloop.last and author_array_limit == author_array_size -%}and {%- endif -%} - {%- endif -%} - {%- if author_is_self -%} - {{author.first}} {{author.last}} - {%- else -%} - {%- if coauthor_url -%} - {{author.first}} {{author.last}} - {%- else -%} - {{author.first}} {{author.last}} - {%- endif -%} - {%- endif -%} - {%- endfor -%} - {%- assign more_authors = author_array_size | minus: author_array_limit -%} - - {%- assign more_authors_hide = more_authors | append: " more author" -%} - {%- if more_authors > 0 -%} - {%- if more_authors > 1 -%} - {%- assign more_authors_hide = more_authors_hide | append: "s" -%} - {%- endif -%} - {%- assign more_authors_show = '' -%} - {%- for author in entry.author_array offset: author_array_limit -%} - {%- assign more_authors_show = more_authors_show | append: author.first | append: " " | append: author.last -%} - {%- unless forloop.last -%} - {%- assign more_authors_show = more_authors_show | append: ", " -%} - {%- endunless -%} - {%- endfor -%} - , and - {{more_authors_hide}} - {%- endif -%} - -
- - - {% assign proceedings = "inproceedings, incollection" | split: ','%} - {% if entry.type == "article" -%} - {%- capture entrytype -%}{{entry.journal}}{%- endcapture -%} - {%- elsif proceedings contains entry.type -%} - {%- capture entrytype -%}In {{entry.booktitle}} {%- endcapture -%} - {%- else -%} - {%- capture entrytype -%}{%- endcapture -%} - {%- endif -%} - {%- if entry.month -%} - {%- capture entrymonth -%}{{ " " }}{{ entry.month | capitalize }}{%- endcapture -%} - {%- endif -%} - {%- if entry.year -%} - {%- capture entryyear -%}{{ " " }}{{entry.year}}{%- endcapture -%} - {%- endif -%} - {%- capture periodical -%}{{ entrytype }}{{ entrymonth }}{{ entryyear }}{%- endcapture -%} -
- {{ periodical | strip }} -
- {%- endif %} - - - - - {% if entry.abstract -%} - - - {%- endif -%} - - {% if entry.bibtex_show -%} - - - {%- endif %} -
-
diff --git a/_layouts/cv.html b/_layouts/cv.html deleted file mode 100644 index bb3d85af..00000000 --- a/_layouts/cv.html +++ /dev/null @@ -1,35 +0,0 @@ ---- -layout: default ---- - -
- -
-

{{ page.title }} {% if page.cv_pdf %}{% endif %}

-

{{ page.description }}

-
- -
-
- {% for entry in site.data.cv %} -
-

{{ entry.title }}

-
- {% if entry.type == "list" %} - {% include cv/list.html %} - {% elsif entry.type == "map" %} - {% include cv/map.html %} - {% elsif entry.type == "nested_list" %} - {% include cv/nested_list.html %} - {% elsif entry.type == "time_table" %} - {% include cv/time_table.html %} - {% else %} - {{ entry.contents }} - {% endif %} -
-
- {% endfor %} -
-
- -
diff --git a/_layouts/default.html b/_layouts/default.html deleted file mode 100644 index 1001a5b5..00000000 --- a/_layouts/default.html +++ /dev/null @@ -1,36 +0,0 @@ - - - - - - {%- if page.redirect -%} - - {%- endif -%} - {% include head.html %} - - - - - - - {%- include header.html %} - -
- - -
- {{ content }} -
- - - - - - {% include scripts/jquery.html %} - {% include scripts/bootstrap.html %} - {% include scripts/masonry.html %} - {% include scripts/misc.html %} - {% include scripts/mathjax.html %} - {% include scripts/analytics.html %} - - diff --git a/_layouts/distill.html b/_layouts/distill.html deleted file mode 100644 index f70ee34b..00000000 --- a/_layouts/distill.html +++ /dev/null @@ -1,197 +0,0 @@ - - - - - - - - {%- include head.html %} - - {% include scripts/jquery.html %} - {% include scripts/mathjax.html %} - - - - - {% if page._styles %} - - - {%- endif %} - - - - - - - - - - {%- include header.html %} - - -
- - -

{{ page.title }}

-

{{ page.description }}

-
- - - - - {% if page.toc -%} - - - - {%- endif %} - - {{ content }} - - - - - - - -
- - - - - For attribution in academic contexts, please cite this work as -
-        PLACEHOLDER FOR ACADEMIC ATTRIBUTION
-  
- - BibTeX citation -
-        PLACEHOLDER FOR BIBTEX
-  
-
- - - - - {% include scripts/bootstrap.html %} - {% include scripts/analytics.html %} - - - - diff --git a/_layouts/none.html b/_layouts/none.html deleted file mode 100644 index b92f6522..00000000 --- a/_layouts/none.html +++ /dev/null @@ -1 +0,0 @@ -{{content}} diff --git a/_layouts/page.html b/_layouts/page.html deleted file mode 100644 index 1452f7e0..00000000 --- a/_layouts/page.html +++ /dev/null @@ -1,28 +0,0 @@ ---- -layout: default ---- - -
- - - - - {{ content }} - - -
diff --git a/_layouts/post.html b/_layouts/post.html deleted file mode 100644 index bbe2477f..00000000 --- a/_layouts/post.html +++ /dev/null @@ -1,85 +0,0 @@ ---- -layout: default ---- - -{%- assign year = page.date | date: "%Y" -%} -{%- assign tags = page.tags | join: "" -%} -{%- assign categories = page.categories | join: "" -%} - -{% if page._styles %} - - -{% endif %} - -
- -
-

{{ page.title }}

- - -
- -
- {{ content }} -
- - - - - - {%- if site.disqus_shortname and page.comments -%} -
- - - {%- endif %} - - - -
diff --git a/_news/announcement_1.md b/_news/announcement_1.md deleted file mode 100644 index 98e5af5c..00000000 --- a/_news/announcement_1.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -layout: post -date: 2015-10-22 15:59:00-0400 -inline: true ---- - -A simple inline announcement. diff --git a/_news/announcement_2.md b/_news/announcement_2.md deleted file mode 100644 index dbd4b4d4..00000000 --- a/_news/announcement_2.md +++ /dev/null @@ -1,31 +0,0 @@ ---- -layout: post -title: A long announcement with details -date: 2015-11-07 16:11:00-0400 -inline: false ---- - -Announcements and news can be much longer than just quick inline posts. In fact, they can have all the features available for the standard blog posts. See below. - -*** - -Jean shorts raw denim Vice normcore, art party High Life PBR skateboard stumptown vinyl kitsch. Four loko meh 8-bit, tousled banh mi tilde forage Schlitz dreamcatcher twee 3 wolf moon. Chambray asymmetrical paleo salvia, sartorial umami four loko master cleanse drinking vinegar brunch. Pinterest DIY authentic Schlitz, hoodie Intelligentsia butcher trust fund brunch shabby chic Kickstarter forage flexitarian. Direct trade cold-pressed meggings stumptown plaid, pop-up taxidermy. Hoodie XOXO fingerstache scenester Echo Park. Plaid ugh Wes Anderson, freegan pug selvage fanny pack leggings pickled food truck DIY irony Banksy. - -#### Hipster list - - -Hoodie Thundercats retro, tote bag 8-bit Godard craft beer gastropub. Truffaut Tumblr taxidermy, raw denim Kickstarter sartorial dreamcatcher. Quinoa chambray slow-carb salvia readymade, bicycle rights 90's yr typewriter selfies letterpress cardigan vegan. - -*** - -Pug heirloom High Life vinyl swag, single-origin coffee four dollar toast taxidermy reprehenderit fap distillery master cleanse locavore. Est anim sapiente leggings Brooklyn ea. Thundercats locavore excepteur veniam eiusmod. Raw denim Truffaut Schlitz, migas sapiente Portland VHS twee Bushwick Marfa typewriter retro id keytar. - -> We do not grow absolutely, chronologically. We grow sometimes in one dimension, and not in another, unevenly. We grow partially. We are relative. We are mature in one realm, childish in another. -> —Anais Nin - -Fap aliqua qui, scenester pug Echo Park polaroid irony shabby chic ex cardigan church-key Odd Future accusamus. Blog stumptown sartorial squid, gastropub duis aesthetic Truffaut vero. Pinterest tilde twee, odio mumblecore jean shorts lumbersexual. diff --git a/_news/announcement_3.md b/_news/announcement_3.md deleted file mode 100644 index d9072191..00000000 --- a/_news/announcement_3.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -layout: post -date: 2016-01-15 07:59:00-0400 -inline: true ---- - -A simple inline announcement with Markdown emoji! :sparkles: :smile: diff --git a/_pages/about.md b/_pages/about.md deleted file mode 100644 index 5d7240e4..00000000 --- a/_pages/about.md +++ /dev/null @@ -1,196 +0,0 @@ ---- -layout: about -title: about -permalink: /about/ -nav: true -nav_order: 1 -subtitle: - -# profile: -# align: right -# image: -# image_circular: false # crops the image to make it circular -# address: - -# news: false # includes a list of news items -# selected_papers: false # includes a list of papers marked as "selected={true}" -# social: false # includes social icons at the bottom of the page ---- - -**Announcements**: -- The 22 blog posts 2024 are now published! Check our [press release](https://blog.iclr.cc/2024/04/02/blogposts-track-iclr-2023-announcing-accepted-blogposts/) for an overview or dive directly into it on the [Blog page](https://iclr-blogposts.github.io/2024/blog/index.html) -- More information regarding the poster session will be available soon. - -## Contents - -- [ICLR 2024 Blogposts Track](#iclr-2024-blogposts-track) -- [Spotlight Posts](#spotlight) -- [Accepted Posts](#accepted-posts) -- [Key Dates](#key-dates) -- [Submissions](#submissions) -- [Organizers](#organizers) - -# ICLR 2024 Blogposts Track - -The Machine Learning community is currently experiencing a [reproducibility crisis](https://neuripsconf.medium.com/designing-the-reproducibility-program-for-neurips-2020-7fcccaa5c6ad) and a reviewing crisis [[Littman, 2021]](#Litt). Because of the highly competitive and noisy reviewing process of ML conferences [[Tran et al., 2020]](#Tran), researchers have an incentive to oversell their results, slowing down the progress and diminishing the integrity of the scientific community. Moreover with the growing number of papers published and submitted at the main ML conferences [[Lin et al., 2020]](#Lin), it has become more challenging to keep track of the latest advances in the field. - -Blog posts are becoming an increasingly popular and useful way to talk about science [[Brown and Woolston, 2018]](#Brow). They offer substantial value to the scientific community by providing a flexible platform to foster open, human, and transparent discussions about new insights or limitations of a scientific publication. However, because they are not as recognized as standard scientific publications, only a minority of researchers manage to maintain an active blog and get visibility for their efforts. Many are well-established researchers ([Francis Bach](https://francisbach.com/), [Ben Recht](https://www.argmin.net/), [Ferenc Huszár](https://www.inference.vc/), [Lilian Weng](https://lilianweng.github.io/lil-log/)) or big corporations that leverage entire teams of graphic designers designer and writers to polish their blogs ([Facebook AI](https://ai.facebook.com/blog/?page=1), [Google AI](https://ai.googleblog.com/), [DeepMind](https://deepmind.com/blog), [OpenAI](https://openai.com/blog/)). As a result, the incentives for writing scientific blog posts are largely personal; it is unreasonable to expect a significant portion of the machine learning community to contribute to such an initiative when everyone is trying to establish themselves through publications. - -**Submit** your blogpost on [Openreview](https://openreview.net/group?id=ICLR.cc/2024/BlogPosts&referrer=%5BHomepage%5D(%2F)) - -## A Blog Post Conference Track - -Last year, we ran the **second** iteration of the [Blogpost track](https://iclr-blogposts.github.io/2023/about) at ICLR 2023! - -It was very successful, with accepted posts presented in person at the main conference. - -Our goal is to create a formal call for blog posts at ICLR to incentivize and reward researchers to review past work and summarize the outcomes, develop new intuitions, or highlight some shortcomings. A very influential initiative of this kind happened after the Second World War in France. Because of the lack of up-to-date textbooks, a collective of mathematicians under the pseudonym Nicolas Bourbaki [[Halmos 1957]](#Halm), decided to start a series of textbooks about the foundations of mathematics [[Bourbaki, 1939]](#Bour). In the same vein, we aim to provide a new way to summarize scientific knowledge in the ML community. - -Due to the large diversity of topics that can be discussed in a blog post, we decided to restrict the range of topics for this call for blog posts. We identified that the blog posts that would bring to most value to the community and the conference would be posts that distill and discuss *previously published papers*. - -## Spotlight - -**[The N Implementation Details of RLHF with PPO]({% post_url 2024-05-07-the-n-implementation-details-of-rlhf-with-ppo %})** -:      _Shengyi Costa Huang, Tianlin Liu, Leandro von Werra_ - -**[How to compute Hessian-vector products?]({% post_url 2024-05-07-bench-hvp %})** -:      _Mathieu Dagréou, Pierre Ablin, Samuel Vaiter, Thomas Moreau_ - -**[Bridging the Data Processing Inequality and Function-Space Variational Inference]({% post_url 2024-05-07-dpi-fsvi %})** -:      _Andreas Kirsch_ - -## Accepted Posts - -**[Understanding in-context learning in transformers]({% post_url 2024-05-07-understanding-icl %})** -:      _Simone Rossi, Rui Yuan, Thomas Hannagan_ - -**[Behavioral Differences in Mode-Switching Exploration for Reinforcement Learning]({% post_url 2024-05-07-mode-switching %})** -:      _Loren J Anderson_ - -**[Fairness in AI: two philosophies or just one?]({% post_url 2024-05-07-fairness-ai-two-phil-or-just-one %})** -:      _MaryBeth Defrance_ - -**[Towards Robust Foundation Models: Adversarial Contrastive Learning]({% post_url 2024-05-07-robust-foundation-model %})** -:      _Jingfeng Zhang, Xilie Xu_ - -**[A New Alchemy: Language Model Development as a Subfield?]({% post_url 2024-05-07-language-model-development-as-a-new-subfield %})** -:      _Colin Raffel_ - -**[Understanding gradient inversion attacks from the prior knowledge perspective]({% post_url 2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective %})** -:      _Yanbo Wang, Jian Liang, Ran He_ - -**[Building Diffusion Model's theory from ground up]({% post_url 2024-05-07-diffusion-theory-from-scratch %})** -:      _Ayan Das_ - -**[Masked Language Model with ALiBi and CLAP head]({% post_url 2024-05-07-alibi-mlm %})** -:      _Jason Chuan-Chih Chou_ - -**[What exactly has TabPFN learned to do?]({% post_url 2024-05-07-what-exactly-has-tabpfn-learned-to-do %})** -:      _Calvin McCarter_ - -**[Elaborating on the Value of Flow Matching for Density Estimation]({% post_url 2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation %})** -:      _Maternus Herold, Faried Abu Zaid_ - -**[The Hidden Convex Optimization Landscape of Two-Layer ReLU Networks]({% post_url 2024-05-07-hidden-convex-relu %})** -:      _Victor Mercklé, Franck Iutzeler, Ievgen Redko_ - -**[Deep Equilibrium Models For Algorithmic Reasoning]({% post_url 2024-05-07-deqalg-reasoning %})** -:      _Sophie Xhonneux, Yu He, Andreea Deac, Jian Tang, Gauthier Gidel_ - -**[Fair Model-Based Reinforcement Learning Comparisons with Explicit and Consistent Update Frequency]({% post_url 2024-05-07-update-frequency-in-mbrl %})** -:      _Albert Thomas, Abdelhakim Benechehab, Giuseppe Paolo, Balázs Kégl_ - -**[Exploring Meta-learned Curiosity Algorithms]({% post_url 2024-05-07-exploring-meta-learned-curiosity-algorithms %})** -:      _Batsirayi Mupamhi Ziki_ - -**[Unraveling The Impact of Training Samples]({% post_url 2024-05-07-unraveling-the-impact-of-training-samples %})** -:      _Daiwei Chen, Jane Zhang, Ramya Korlakai Vinayak_ - -**[RLHF without RL - Direct Preference Optimization]({% post_url 2024-05-07-rlhf-without-rl %})** -:      _Michael Panchenko_ - -**[It's Time to Move On: Primacy Bias and Why It Helps to Forget]({% post_url 2024-05-07-primacy-bias-and-why-it-helps-to-forget %})** -:      _Matthew Kielo, Vladimir Lukin_ - -**[Double Descent Demystified]({% post_url 2024-05-07-double-descent-demystified %})** -:      _Rylan Schaeffer, Zachary Robertson, Akhilan Boopathy, Mikail Khona, Kateryna Pistunova, Jason W. Rocks, Ila R. Fiete, Andrey Gromov, Sanmi Koyejo_ - -**[On Bayesian Model Selection: The Marginal Likelihood, Cross-Validation, and Conditional Log Marginal Likelihood]({% post_url 2024-05-07-clml %})** -:      _Andreas Kirsch_ - - -## Key Dates - -**Abstract deadline**: December 11th 00:00GMT, 2023 (submit to OpenReview - to be announced soon). - -**Submission deadline**: December 17th 00:00GMT, 2023 (any modifications to your blog post, via a pull request on GitHub). - -**Decision Notification**: ~~January 30th, 2024~~ UPDATED: February 15th, 2024 - -**Camera-ready merge**: March 15th, 2024 - -## A call for blog posts discussing work previously published at ICLR - -#### Content - -Write a post on a subject that has been published at a top-tier venue (ICLR, ICML, NeurIPS, AAAI, UAI, CVPR, SIGGRAPH, ECCV, ICCV, etc.) relatively recently. - -#### Conflict of interest - -The authors of the blog posts will have to declare their conflicts of interest (positive or negative) with the paper (and the paper's authors) they write about. Conflicts of interest include: -- Recent collaborators (less than 3 years) -- Current institution ​ Reviewers will be asked to judge if the submission is sufficiently critical and objective of the papers addressed in the blog post. -- **Blog Posts must not be used to highlight or advertise past publications of the **authors or their lab****. - -We will only ask the authors to report if they have a conflict of interest. If so, reviewers will be asked to judge if the submission is sufficiently critical and objective of the papers addressed in the blog post. - - -## Publication - -#### Blog post - -The posts will be created and published under a unified template; see [the submission instructions]({{ '/submitting' | relative_url }}) and the [sample post]({% post_url 2024-05-07-distill-example %}) hosted on the blog of this website. - -#### Poster -Additionally, accepted posts will have the option to present their work as a poster during the main poster session. For more information about the main poster session (time, poster format, etc.) please refer to the ICLR homepage. - -## Submissions - -Our goal is to avoid heavily engineered, professionally-made blog posts ---Such as the “100+ hours” mentioned as a standard by the [Distill guidelines](https://distill.pub/journal/)---to entice ideas and clear writing rather than dynamic visualizations or embedded javascript engines. -Please check our [submission instructions]({{ '/submitting' | relative_url }}) for more details. -We accept submissions in both Markdown and HTML. We believe this is a good trade-off between complexity and flexibility. - -**Submit** your blogpost on [Openreview](https://openreview.net/group?id=ICLR.cc/2024/BlogPosts&referrer=%5BHomepage%5D(%2F)) - -## Contact - -For any technical issues with the blog post repository (for example, blog posts not displaying correctly or issues while following the [submission instructions](https://iclr-blogposts.github.io/2024/submitting/#creating-a-blog-post)), please open an [issue in our github repository](https://github.com/iclr-blogposts/2024/issues). - -For other inquiries, reach us via email at: [blog.track.chairs@gmail.com](mailto:blog.track.chairs@gmail.com) - -## Organizers - -
- {% include people_horizontal.html name="Gauthier Gidel" affiliation="Mila, Université de Montréal" url="https://gauthiergidel.github.io/" img="assets/img/organizers/gg.jpg" %} - {% include people_horizontal.html name="Charlie Gauthier" affiliation="Mila, Université de Montréal" url="https://velythyl.github.io/" img="assets/img/organizers/cg.jpg" %} - {% include people_horizontal.html name="David Dobre" affiliation="Mila, Université de Montréal" url="" img="assets/img/organizers/dd.jpg" %} - {% include people_horizontal.html name="Claire Vernade" affiliation="University of Tuebingen" url="https://www.cvernade.com/" img="assets/img/organizers/cv.jpg" %} - {% include people_horizontal.html name="Fabian Pedregosa" affiliation="Google DeepMind" url="https://fa.bianp.net/pages/about.html" img="assets/img/organizers/fp.jpg" %} - {% include people_horizontal.html name="Leo Schwinn" affiliation="Technical University of Munich" url="https://schwinnl.github.io//" img="assets/img/organizers/ls.jpg" %} -
- ---- - -## References - -Michael L Littman. Collusion rings threaten the integrity of computer science research. Communications of the ACM, 2021. - -David Tran, Alex Valtchanov, Keshav Ganapathy, Raymond Feng, Eric Slud, Micah Goldblum, and Tom Goldstein. An open review of OpenReview: A critical analysis of the machine learning conference review process. arXiv, 2020. - -Hsuan-Tien Lin, Maria-Florina Balcan, Raia Hadsell, and Marc’Aurelio Ranzato. What we learned from NeurIPS 2020 reviewing process. Medium https://medium.com/@NeurIPSConf/what-we-learned-from-neurips-2020-reviewing-process-e24549eea38f, 2020. - -Eryn Brown and Chris Woolston. Why science blogging still matters. Nature, 2018. - -Paul R Halmos. Nicolas Bourbaki. Scientific American, 1957. - -Nicolas Bourbaki. Elements of mathematics. Éditions Hermann, 1939. diff --git a/_pages/call.md b/_pages/call.md deleted file mode 100644 index c7ad4dbd..00000000 --- a/_pages/call.md +++ /dev/null @@ -1,69 +0,0 @@ ---- -layout: page -title: call for blogposts -permalink: /call/ -description: -nav: true -nav_order: 2 ---- - -**Submit** your blogpost on [Openreview](https://openreview.net/group?id=ICLR.cc/2024/BlogPosts&referrer=%5BHomepage%5D(%2F)) - -# Call for blog posts -​ -We invite all researchers and practitioners to submit a blog post discussing work previously published at a top-tier venue to the ICLR 2024 blog post track. -The format and process for this blog post track are described below. -​ - -### Content -​ -Write a post on a subject that has been published at a top-tier venue (ICLR, ICML, NeurIPS, AAAI, UAI, CVPR, SIGGRAPH, ECCV, ICCV, etc.) relatively recently. -Past blog posts can be accessed [here](https://iclr-blogposts.github.io/2023/about). -​ - -### Conflict of interest -​ -The authors of the blog posts will have to declare their conflicts of interest (positive or negative) with the paper (and their authors) they write about. -Conflicts of interest include: - - - Recent collaborators (less than 3 years) - - Current institution -​ -Reviewers will be asked to judge if the submission is sufficiently critical and objective of the papers addressed in the blog post. -**Blog Posts must not be used to highlight or advertise past publications of the authors or of their lab**. - - -### Publication - -##### Blog post -​ -The posts will be created and published under a unified template; see [the submission instructions]({{ '/submitting' | relative_url }}) and the [sample post]({{ '/blog/2024/distill-example' | relative_url }}) hosted on the blog of this website. - -##### Poster -Additionally, accepted posts will have the option to present their work as a poster during the main poster session. For more information about the main poster session (time, poster format, etc.) please refer to the ICLR homepage. - -### Review - -Blogs will be peer-reviewed (double-blind) for quality and novelty of the content: clarity and pedagogy of the exposition, new theoretical or practical insights, reproduction/extension of experiments, etc. -The review is dual-anonymous assuming good faith from both submitters and reviewers (see [the submission instructions]({{ '/submitting' | relative_url }}) for more details). -​ - -## Key Dates -- **Abstract deadline**: December 11th 00:00GMT, 2023 ([submit to OpenReview](https://openreview.net/group?id=ICLR.cc/2024/BlogPosts&referrer=%5BHomepage%5D(%2F))). -  - -- **Submission deadline**: December 17th 00:00GMT, 2023 (any modifications to your blog post, via a pull request on github). -  - -- **Notification of acceptance**: ~~January 30th, 2024~~ UPDATED: February 15th, 2024 -  - -- **Camera-ready merge**: March 15th, 2024 - - -### Contact - -For answers to many common questions please refer to the ICLR [FAQ](https://iclr.cc/FAQ) - -Should you have other inquiries, please don't hesitate to reach out via email at: [blog.track.chairs@gmail.com](mailto:blog.track.chairs@gmail.com) - diff --git a/_pages/dropdown.md b/_pages/dropdown.md deleted file mode 100644 index 0eb85be8..00000000 --- a/_pages/dropdown.md +++ /dev/null @@ -1,13 +0,0 @@ ---- -layout: page -title: past iterations -nav: true -nav_order: 99 -dropdown: true -children: - - title: 2023 - permalink: https://iclr-blogposts.github.io/2023/about - - title: divider - - title: 2022 - permalink: https://iclr-blog-track.github.io/home/ ---- \ No newline at end of file diff --git a/_pages/dropdown/index.html b/_pages/dropdown/index.html new file mode 100644 index 00000000..44ce61cb --- /dev/null +++ b/_pages/dropdown/index.html @@ -0,0 +1 @@ + past iterations | ICLR Blogposts 2024
\ No newline at end of file diff --git a/_pages/reviewer_guidelines.md b/_pages/reviewer_guidelines.md deleted file mode 100644 index 0958cd1c..00000000 --- a/_pages/reviewer_guidelines.md +++ /dev/null @@ -1,25 +0,0 @@ ---- -layout: page -title: reviewing -permalink: /reviewing/ -description: -nav: true -nav_order: 4 ---- - -### Reviewing Process - -Reviewers will be required to only view the live content of the blog. -We ask that they act in good faith, and refrain from digging into the repository's logs and closed Pull Requests to find any identifying information on the authors. - -Reviewers should motivate their final decision based on the following points: - -- Is there a significant added value in comparison to the cited papers? -- Is this added value supported by accurate, convincing, and clear arguments? -- If the blogpost does not directly relate to a paper, does it address a relevant research topic from a novel perspective? -- In case the field *Conflict Of Interest* is marked as *YES* the reviewers are asked to pay specific attention to how the related work mentioned in the field *ICLR Papers*: is the blogpost *too positive* (self advertisement) or *too negative* (unfair assessment of this related work)? - -In order to access them please follow the following steps: - -1. Go to the OpenReview submission page. -2. To see the blogpost submission, go to the blogpost url specified in the field 'Blogpost Url'. \ No newline at end of file diff --git a/_pages/submitting.md b/_pages/submitting.md deleted file mode 100644 index 5e24980c..00000000 --- a/_pages/submitting.md +++ /dev/null @@ -1,360 +0,0 @@ ---- -layout: page -title: submitting -permalink: /submitting/ -description: -nav: true -nav_order: 3 ---- - -### A more open process - -As with the previous edition of the Blog Post track, we forgo the requirement for total anonymity. -The blog posts **must be anonymized for the review process**, but users will submit their anonymized blog posts via a pull request to the blog track's repository (in addition to a submission on OpenReview). -The pull request will trigger an automated pipeline that will build and deploy your post onto a website dedicated to the reviewing process. - -Reviewers will be able to access the posts directly through a public URL (generated by the Github action), and will submit their reviews on OpenReview. -Reviewers should refrain from looking at the git history for the post, which may reveal information about the authors. - -This still largely follows the Double-Blind reviewing principle; it is no less double-blind than when reviewers are asked to score papers that have previously been released to [arXiv](https://arxiv.org/), an overwhelmingly common practice in the ML community. -This approach was chosen to lower the burden on both the organizers and the authors; in 2022, many submissions had to be reworked once deployed due to a variety of reasons. -By allowing the authors to render their websites to Github Pages prior to the review process, we hope to avoid this issue entirely. - - -However, we understand the desire for total anonymity. -Authors that wish to have a fully double-blind process might consider creating new GitHub accounts without identifying information which they will only be use for this track. -For an example of a submission in the past which used an anonymous account in this manner, you can check out the [World Models blog post (Ha and Schmidhuber, 2018)](https://worldmodels.github.io/) and the [accompanying repository](https://github.com/worldmodels/worldmodels.github.io). - -### Template - -The workflow you will use to participate in this track should be relatively familiar to you if have used [Github Pages](https://pages.github.com/). Specifically, our website uses the [Al-Folio](https://github.com/alshedivat/al-folio) template. -This template uses Github Pages as part of its process, but it also utilizes a separate build step using [Github Actions](https://github.com/features/actions) and intermediary [Docker Images](https://www.docker.com/). - -**We recommend paying close attention to the steps presented in this guide. -Small mistakes here can have very hard-to-debug consequences.** - -### Contents - -- [Quickstart](#quickstart) -- [Download the Blog Repository](#download-the-blog-repository) -- [Creating a Blog Post](#creating-a-blog-post) -- [Local Serving](#local-serving) - - [Method 1: Using Docker](#method-1-using-docker) - - [Method 2: Using Jekyll Manually](#method-2-using-jekyll-manually) - - [Installation](#installation) - - [Manual Serving](#manual-serving) -- [Submitting Your Blog Post](#submitting-your-blog-post) -- [Reviewing Process](#reviewing-process) -- [Camera Ready (TBD)](#camera-ready) - - -### Quickstart - -This section provides a summary of the workflow for creating and submitting a blog post. -For more details about any of these steps, please refer to the appropriate section. - - -1. Fork or download our [repository](https://github.com/iclr-blogposts/2024). - -2. Create your blog post content as detailed in the [Creating a Blog Post](#creating-a-blog-post) section. - In summary, to create your post, you will: - - Create a Markdown or HTML file in the `_posts/` directory with the format `_posts/2024-05-07-[SUBMISSION NAME].md`. If you choose to write the post in HTML, then the extension of this last file should be .html instead of .md. NOTE: HTML posts are not officially supported, use at your own risk! - - Add any static image to `assets/img/2024-05-07-[SUBMISSION NAME]/`. - - Add any interactive HTML figures to `assets/html/2024-05-07-[SUBMISSION NAME]/`. - - Put your citations into a bibtex file in `assets/bibliography/2024-05-07-[SUBMISSION NAME].bib`. - - **DO NOT** touch anything else in the repository. - We will utilize an automated deployment action which will filter out all submissions that modifiy more than the list of files that we just described above. - Read the [relevant section](#creating-a-blog-post) for more details. - **Make sure to omit any identifying information for the review process.** - -3. To render your website locally, you can build a docker container via `$ ./bin/docker_run.sh` to serve your website locally. - Alternatively, you can setup your local environment to render the website via conventional `$ bundle exec jekyll serve --future` commands. - More information for both of these configuratoins can be found in the [Local Serving](#local-serving) section. - -4. To submit your website, create a pull request to the main repository. Make sure that this PR's title is `_posts/2024-05-07-[SUBMISSION NAME]`. This will trigger a GitHub Action that will build your blogpost and write the host's URL in a comment to your PR. - -5. If accepted, we will merge the accepted posts to our main repository. See the [camera ready](#camera-ready) section for more details on merging in an accepted blog post. - -**Should you edit ANY files other your new post inside the `_posts` directory, and your new folder inside the `assets` directory, your pull requests will automatically be rejected.** - -You can view an example of a successful PR [here](https://github.com/iclr-blogposts/2024/pull/48). You can view an example of a PR with erroneous files [here](https://github.com/iclr-blogposts/2024/pull/51). - -### Download the Blog Repository - -Download or fork our [repository](https://github.com/iclr-blogposts/2024). -You will be submitting a pull request this repository. - -### Creating a Blog Post - -To create a blog post in Markdown format, you can modify the [example]({% post_url 2024-05-07-distill-example %}) Markdown post `_posts/2024-05-07-distill-example.md` and rename it to `_posts/2024-05-07-[SUBMISSION NAME].md`, where `[SUBMISSION NAME]` is the name of your submission. You can see the result of the sample post . - -While most users will want to create a post in the Markdown format, it is also possible to create a post in HTML format. For this, modify instead the example `_posts/2024-05-08-distill-example2.html` and rename it to `_posts/2024-05-07-[SUBMISSION NAME].html`. (NOTE: HTML is not officially supported, use at your own risk). - - -You must modify the file's header (or 'front-matter') as needed. - - - - ```markdown - --- -layout: distill -title: [Your Blog Title] -description: [Your blog post's abstract - no math/latex or hyperlinks!] -date: 2024-05-07 -future: true -htmlwidgets: true - -# anonymize when submitting -authors: - - name: Anonymous - -# do not fill this in until your post is accepted and you're publishing your camera-ready post! -# authors: -# - name: Albert Einstein -# url: "https://en.wikipedia.org/wiki/Albert_Einstein" -# affiliations: -# name: IAS, Princeton -# - name: Boris Podolsky -# url: "https://en.wikipedia.org/wiki/Boris_Podolsky" -# affiliations: -# name: IAS, Princeton -# - name: Nathan Rosen -# url: "https://en.wikipedia.org/wiki/Nathan_Rosen" -# affiliations: -# name: IAS, Princeton - -# must be the exact same name as your blogpost -bibliography: 2024-05-07-distill-example.bib - -# Add a table of contents to your post. -# - make sure that TOC names match the actual section names -# for hyperlinks within the post to work correctly. -toc: - - name: [Section 1] - - name: [Section 2] - # you can additionally add subentries like so - subsections: - - name: [Subsection 2.1] - - name: [Section 3] ---- - -# ... your blog post's content ... -``` - -You must change the `title`, `discription`, `toc`, and eventually the `authors` fields (**ensure that the -submission is anonymous for the review process**). - - -Read our [sample blog post]({% post_url 2024-05-07-distill-example %}) carefully to see how you can add image assets, and how to write using $$\LaTeX$$! -Read about rendering your post locally [below](#serving). - -**Important: make sure your post is completely anonymized before you export and submit it!** - -Before going any further, it will be useful to highlight exactly what folders and files you are going to add or modify. -Even if you use one of our simpler quickstart methods, this will always be what's happening -behind the scenes. - -If you clone our repo or download a release, you will find a directory structure that looks like -the following (excluding all files and directories that are not relevant to your submission): - -```bash -your_blogpost_repo/ -│ -├── _posts -│   ├── 2024-05-07-[YOUR SUBMISSION].md # <--- Create this markdown file; this is your blogpost -│   └── ... -├── assets -│   ├── bibliography -│   │   ├── 2024-05-07-[YOUR SUBMISSION].bib # <--- Create this bibtex file -│   │   └── ... -│   ├── html -│   │   ├── 2024-05-07-[YOUR SUBMISSION] # <--- Create this directory and add interactive html figures -│   │   │ └──[YOUR HTML FIGURES].html -│   │   └── ... -│   ├── img -│   │   ├── 2024-05-07-[YOUR SUBMISSION] # <--- Create this directory and add static images here -│   │   │ └──[YOUR IMAGES].png -│   │   └── ... -│   └── ... -└── ... -``` - -In summary, to create your post, you will: - -- Create a Markdown (or HTML) file in the `_posts/` directory with the format `_posts/2024-05-07-[SUBMISSION NAME].md` (`_posts/2024-05-07-[SUBMISSION NAME].html` in the case of an HTML file). -- Add any static image assets will be added to `assets/img/2024-05-07-[SUBMISSION NAME]/`. -- Add any interactive HTML figures will be added to `assets/html/2024-05-07-[SUBMISSION NAME]/`. -- Put your citations into a bibtex file in `assets/bibliography/2024-05-07-[SUBMISSION NAME].bib`. - -**DO NOT** touch anything else in the blog post! -If you do, our automated pipeline will reject your PR and you will have to undo those changes in order for it to be accepted! - -Note that `2024-05-07-[YOUR SUBMISSION]` serves as a tag to your submission, so it should be the -same for all three items. -For example, if you're writing a blog post called "Deep Learning", you'd likely want to make your -tag `2024-05-07-deep-learning`, and the directory structure would look like this: - -```bash -your_blogpost_repo/ -│ -├── _posts -│   ├── 2024-05-07-deep-learning.md # <--- Create this markdown file; this is your blogpost -│   └── ... -├── assets -│   ├── bibliography -│   │   ├── 2024-05-07-deep-learning.bib # <--- Create this bibtex file -│   │   └── ... -│   ├── html -│   │   ├── 2024-05-07-deep-learning # <--- Create this directory and add interactive html figures -│   │   │ └──[YOUR HTML FIGURES].html -│   │   └── ... -│   ├── img -│   │   ├── 2024-05-07-deep-learning # <--- Create this directory and add static images here -│   │   │ └──[YOUR IMAGES].png -│   │   └── ... -│   └── ... -└── ... -``` - -### Local serving - -So far we've talked about how to get the relevant repository and create a blog post conforming to our requirements. -Everything you have done so far has been in Markdown, but this is not the same format as web content (typically HTML, etc.). -You'll now need to build your static web site (which is done using Jekyll), and then *serve* it on some local webserver in order to view it properly. -We will now discuss how you can *serve* your blog site locally, so you can visualize your work before you open a pull request on the staging website so you can submit it to the ICLR venue. - -#### Method 1: Using Docker - -To render your website locally, we follow the instructions for [Local setup using Docker (Recommended on Windows)](https://github.com/iclr-blogposts/iclr-blogposts.github.io/blob/master/README.md#local-setup-using-docker-recommended-on-windows), but specifically you will need to create your own docker container rather than pull it from Dockerhub (because we modified the Gemfile). - -Create and run the Docker image: - -```bash -./bin/docker_run.sh -``` - -Remove the `Gemfile.lock` file if prompted. -This will create a docker image labeled as `al-folio:latest`. -Don't use `dockerhub_run.sh`; this may result in issues with missing jekyll dependencies. - - -#### Method 2: Using Jekyll Manually - -For users wishing to not use a Docker container, you can install Jekyll directly to your computer and build the site using Jekyll directly. -This is done at your own risk, as there are many potential points of error! -Follow the instructions for rendering the website via the conventional method of `$ bundle exec jekyll serve --future` - -##### Installation - -You will need to manually install Jekyll which will vary based on your operating system. -The instructions here are only for convenience - you are responsible for making sure it works on your system and we are not liable for potential issues that occur when adding your submissions to our repo! - -**Ubuntu/Debian** - -1. Install Ruby - - ```bash - sudo apt install ruby-full - ``` - -2. Once installed, add the following to your `.bashrc` or whatever terminal startup script you may use (this is important because otherwise gem may complain about needing sudo permission to install packages): - - ```bash - export GEM_HOME="$HOME/.gem" - export PATH="$HOME/.gem/bin:$PATH" - ``` - -3. Install Jekyll and Bundler: - - ```bash - gem install jekyll bundler - ``` - -**MacOS and Windows** - -Mac and Windows users can find relevant guides for installing Jekyll here: - -- [Windows guide](https://jekyllrb.com/docs/installation/windows/) -- [MacOS guide](https://jekyllrb.com/docs/installation/macos/) - -##### Manual Serving - -Once you've installed jekyll and all of the dependencies, you can now serve the webpage on your local machine for development purposes using the `bundle exec jekyll serve` command. - -You may first need to install any project dependencies. In your terminal, from the directory containing the Jekyll project run: - -```bash -bundle install -``` - -This will install any plugins required by the project. -To serve the webpage locally, from your terminal, in the directory containing the Jekyll project run: - -```bash -bundle exec jekyll serve --future --port=8080 --host=0.0.0.0 -``` - -You should see something along the lines of: - -``` -> bundle exec jekyll serve -Configuration file: /home/$USER/blog_post_repo/_config.yml - Source: /home/$USER/blog_post_repo - Destination: /home/$USER/blog_post_repo/_site - Incremental build: disabled. Enable with --incremental - Generating... - Jekyll Feed: Generating feed for posts - - ... you may see a lot of stuff in here related to images ... - - done in 0.426 seconds. - Auto-regeneration: enabled for '/home/$USER/blog_post_repo' - Server address: http://0.0.0.0:8080/2024/ - Server running... press ctrl-c to stop. -``` - -If you see this, you've successfully served your web page locally! -You can access it at server address specified, in this case `http://0.0.0.0:8080/2024/` (and the blog posts should once again be viewable at the `blog/` endpoint). - - -### Submitting your Blog Post - -To submit your blog post: - -1. **Anonymize your blog post.** Strip all identifying information from your post, including the - author's list (replace with `Anonymous`). -2. Double check that your post matches the formatting requirements, including (but not limited to): - - **Only modify** files in the following locations (failure to do so will result in your PR - automatically being closed!): - - a Markdown (or HTML) file in `_posts/` with the format `_posts/2024-05-07-[SUBMISSION NAME].md` - (or `.html`) - - static image assets added to `assets/img/2024-05-07-[SUBMISSION NAME]/` - - interactive HTML figures added to `assets/html/2024-05-07-[SUBMISSION NAME]/` - - citations in a bibtex file in `assets/bibliography/2024-05-07-[SUBMISSION NAME].bib` - - Have a short 2-3 sentence abstract in the `description` field of your front-matter ([example](https://github.com/iclr-blogposts/2024/blob/295ab5b4c31f2c7d421a4caf41e5481cbb4ad42c/_posts/2024-05-07-distill-example.md?plain=1#L4-L6)) - - Have a table of contents, formatted using the `toc` field of your front-matter ([example](https://github.com/iclr-blogposts/2024/blob/295ab5b4c31f2c7d421a4caf41e5481cbb4ad42c/_posts/2024-05-07-distill-example.md?plain=1#L36-L47)) - - Your bibliography uses a `.bibtex` file as per the sample post -3. Open a pull request against the `main` branch of the [2024 repo](https://github.com/iclr-blogposts/2024). - Fill in the checklist provided in the PR template. The title of your pull request should be - exactly the name of your markdown/html file. - - i.e. `_posts/2024-05-07-[SUBMISSION NAME].md` would require a PR name `2024-05-07-[SUBMISSION NAME]` -4. (TBD) Your post will automatically run two pipelines: one to verify that you have not modified any other - file in the repo, and another that will create a unique URL for your contributed blog post. - - Verify that everything looks correct in the given URL. - - If the pipelines failed, check if it was because of improper formatting (i.e. you modified - restricted files). If this is the case, fix the issues. If the issue persist, please ping one of the repo admins. - -5. Submit the name of your blog post and its URL to our OpenReview through [this link](https://openreview.net/group?id=ICLR.cc/2024/BlogPosts&referrer=%5BHomepage%5D(%2F)). - -> **Note:** If you wish to make updates to your submission, you should update the content in the -> PR that you already opened. - -### Reviewing Process - -Reviewers will be required to only view the live content of the reviewing website - the website to which the Pull Requests push to. -We ask that they act in good faith, and refrain from digging into the repository's logs and closed Pull Requests to find any identifying information on the authors. - -### Camera-ready - -**TBD** - instructions will be provided closer to the submission deadline. diff --git a/_plugins/external-posts.rb b/_plugins/external-posts.rb deleted file mode 100644 index e4fd5eb6..00000000 --- a/_plugins/external-posts.rb +++ /dev/null @@ -1,36 +0,0 @@ -require 'feedjira' -require 'httparty' -require 'jekyll' - -module ExternalPosts - class ExternalPostsGenerator < Jekyll::Generator - safe true - priority :high - - def generate(site) - if site.config['external_sources'] != nil - site.config['external_sources'].each do |src| - p "Fetching external posts from #{src['name']}:" - xml = HTTParty.get(src['rss_url']).body - feed = Feedjira.parse(xml) - feed.entries.each do |e| - p "...fetching #{e.url}" - slug = e.title.downcase.strip.gsub(' ', '-').gsub(/[^\w-]/, '') - path = site.in_source_dir("_posts/#{slug}.md") - doc = Jekyll::Document.new( - path, { :site => site, :collection => site.collections['posts'] } - ) - doc.data['external_source'] = src['name']; - doc.data['feed_content'] = e.content; - doc.data['title'] = "#{e.title}"; - doc.data['description'] = e.summary; - doc.data['date'] = e.published; - doc.data['redirect'] = e.url; - site.collections['posts'].docs << doc - end - end - end - end - end - -end diff --git a/_plugins/hideCustomBibtex.rb b/_plugins/hideCustomBibtex.rb deleted file mode 100644 index 4a852fde..00000000 --- a/_plugins/hideCustomBibtex.rb +++ /dev/null @@ -1,15 +0,0 @@ - module Jekyll - module HideCustomBibtex - def hideCustomBibtex(input) - keywords = @context.registers[:site].config['filtered_bibtex_keywords'] - - keywords.each do |keyword| - input = input.gsub(/^.*#{keyword}.*$\n/, '') - end - - return input - end - end -end - -Liquid::Template.register_filter(Jekyll::HideCustomBibtex) diff --git a/_posts/2024-05-07-alibi-mlm.md b/_posts/2024-05-07-alibi-mlm.md deleted file mode 100644 index 5f6cdbaf..00000000 --- a/_posts/2024-05-07-alibi-mlm.md +++ /dev/null @@ -1,239 +0,0 @@ ---- -layout: distill -title: Masked Language Model with ALiBi and CLAP head -description: As a new approach to positional encoding, Attention with Linear Biases (ALiBi) uses linear biases of the attention weights to encode positional information, with capability of context length extrapolation. In their paper however, Press et al. focus on the perplexity of autoregressive decoder-only language models, leaving the question of downstream tasks and its applicability to encoder-attention open. In this blogpost, we attempt to bridge the gap by testing masked language models (MLMs) with encoder-attention ALiBi and prediction head similar to the counterparts of the original ALiBi models. We find that while simplified prediction head may be beneficial, performance of MLMs with encoder-attention ALiBi starts to deteriorate with 2048 sequence length at larger scales. We put our results in the context of related recent experiments and tentatively identify the circumstances more challenging to positional encoding designs. Finally, we open-source our MLMs, with BERT-level performance and 2048 context length. -date: 2024-05-07 -future: true -htmlwidgets: true - -authors: - - name: Jason Chuan-Chih Chou - url: https://scholar.google.com/citations?user=V7BXGawAAAAJ - affiliations: - name: Cohere For AI Community - -# must be the exact same name as your blogpost -bibliography: 2024-05-07-alibi-mlm.bib - -# Add a table of contents to your post. -# - make sure that TOC names match the actual section names -# for hyperlinks within the post to work correctly. -# - please use this format rather than manually creating a markdown table of contents. -toc: - - name: Attention with Linear Biases (ALiBi) - - name: Contrastive Language Pretraining (CLAP) Head - - name: Experiments - subsections: - - name: WikiText-103 - - name: The Pile - - name: Conclusions - - name: Model Checkpoints ---- - -*Adapted and expanded from [EIFY/fairseq](https://github.com/EIFY/fairseq).* - -Unmodified and unmasked, attention mechanism is permutation-invariant and positional encoding is therefore employed by transformer-based language models to break the symmetry and enable sequence modeling. In their ICLR 2022 paper, Press et al. introduced Attention with Linear Biases (ALiBi) as a new approach to positional encoding, where the positional info of the tokens are encoded by applying an attention weight bias proportional to the distance between tokens: - -{% include figure.html path="assets/img/2024-05-07-alibi-mlm/ALiBi.jpeg" class="img-fluid" %} - -where $$m$$ is a head-specific slope chosen to follow geometric sequence $$\frac{1}{2^{0.5}}, \frac{1}{2^1}, \frac{1}{2^{1.5}}, \dots, \frac{1}{2^\frac{n}{2}}$$ for a model with $$n$$ attention heads. This approach is shown to enable input length extrapolation in the sense that perplexity of the model remains stable as the inference context length exceeds training context length. The paper, however, focuses on autoregressive decoder-only models and relies on model perplexity as the metric, therefore leaves the question open whether ALiBi is applicable to MLMs like BERT and RoBERTa . To help answer this question, we tested the two following changes to the RoBERTa baseline models, based on the first-party Fairseq toolkit : - - -## Attention with Linear Biases (ALiBi) - -Since MLMs are based on encoders that attend to tokens both before and after the given position, considerations must be made regarding how to distinguish them. Press himself [suggested the 3 following options for encoder-attention ALiBi](https://github.com/ofirpress/attention_with_linear_biases/issues/5): - -1. Symmetric: Keep attention weight bias proportional to the distance between tokens and rely on the context to distinguish between tokens at +N and -N position. -2. Nonsymmetric, one-sided: Make half of the heads only attend to the tokens before and half of the heads only attend to the tokens after. Weight bias is still proportional to the distance. -3. Nonsymmetric with different slopes: Make the slopes $$m$$ different forward and backward, with either learned or fixed values. - -With the observation that option 2 spends about half of the attention compute on no-op and option 3 can still result in bias value collision (e.g. $$m_{bwd} = 2 m_{fwd}$$ and -1 vs. +2 positions), we implemented both option 1 and what we call "nonsymmetric with offset": [Shift the linear biases ahead by `0.5 * slope`](https://github.com/ofirpress/attention_with_linear_biases/issues/5#issuecomment-1213410982), i.e. the constant bias (right matrix of the figure above) becomes - -``` - 0 -.5 -1.5 -2.5 -3.5 --1 0 -.5 -1.5 -2.5 --2 -1 0 -.5 -1.5 --3 -2 -1 0 -.5 --4 -3 -2 -1 0 -``` - -Unless otherwise noted, ALiBi for the following experiments means this nonsymmetric-with-offset encoder-attention ALiBi. - -## Contrastive Language Pretraining (CLAP) Head -The prediction head is one part of the LMs that has received less attention that happens to differ between the ALiBi autoregressive decoder-only models and RoBERTa. Based on the configs and [training logs](https://github.com/ofirpress/attention_with_linear_biases#saved-checkpoints), the ALiBi models use the adaptive word embedding and softmax of Baevski & Auli with weight tying , whereas the RoBERTa prediction head has an additional fully-connected layer and nonlinearity on top of weight-tying. Inspired by CLIP , we decided to test what we called Contrastive Language Pretraining (CLAP) head below, as the [simplest possible prediction head with weight tying](https://github.com/EIFY/fairseq/blob/8143446dfa88d9f8e246b366bd335f6c9b018db0/fairseq/models/roberta/model.py#L527-L543) for the masked tokens plus the thermodynamic beta (inverse temperature): - -{% highlight python %} -class ClapHead(nn.Module): - """Head for masked language modeling.""" - - def __init__(self, initial_beta, weight): - super().__init__() - self.beta = nn.Parameter(torch.tensor(initial_beta)) - self.weight = weight - - def forward(self, features, masked_tokens=None, normalize=True): - # Only project the masked tokens while training, - # saves both memory and computation - if masked_tokens is not None: - features = features[masked_tokens, :] - w = self.weight - if normalize: - w = F.normalize(w, dim=-1) - return self.beta * F.linear(features, w) -{% endhighlight %} - -Compared to the [baseline RoBERTa prediction head](https://github.com/facebookresearch/fairseq/blob/da8fb630880d529ab47e53381c30ddc8ad235216/fairseq/models/roberta/model.py#L470-L495) - -{% highlight python %} -class RobertaLMHead(nn.Module): - """Head for masked language modeling.""" - - def __init__(self, embed_dim, output_dim, activation_fn, weight=None): - super().__init__() - self.dense = nn.Linear(embed_dim, embed_dim) - self.activation_fn = utils.get_activation_fn(activation_fn) - self.layer_norm = LayerNorm(embed_dim) - - if weight is None: - weight = nn.Linear(embed_dim, output_dim, bias=False).weight - self.weight = weight - self.bias = nn.Parameter(torch.zeros(output_dim)) - - def forward(self, features, masked_tokens=None, **kwargs): - # Only project the masked tokens while training, - # saves both memory and computation - if masked_tokens is not None: - features = features[masked_tokens, :] - - x = self.dense(features) - x = self.activation_fn(x) - x = self.layer_norm(x) - # project back to size of vocabulary with bias - x = F.linear(x, self.weight) + self.bias - return x -{% endhighlight %} - -We removed the `embed_dim x embed_dim` fully-connected layer, activation function (GELU), layer norm, and the `output_dim` trainable bias. Just like CLIP, we added the trainable thermodynamic beta and L2-normalize the token embeddings before feeding them to the transformer and computing the inner products between them and the transformer output as the softmax logits, scaled by beta. - -## Experiments - -### WikiText-103 -At first we tested the changes with the [WikiText-103 dataset](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/) with a GeForce RTX 3080 16 GB Laptop GPU, using the validation set MLM perplexity as the metric. We tested the baseline (learned positional encoding + RoBERTa prediction head), learned-clap (learned positional encoding + CLAP head), ALiBi (ALiBi + RoBERTa prediction head), and zero-clap (ALiBi + CLAP head), in addition to baseline but with sinusoidal positional encoding instead of learned positional encoding: - -{% include figure.html path="assets/img/2024-05-07-alibi-mlm/valid_ppl_cleaned.png" class="img-fluid" %} - -where solid lines are what's considered "canonical" setup and dotted lines are experiments with the following variations in setup. These variations turned out to be irrelevant: - -1. Whether we use attention dropout or not -2. Whether we use [symmetric ALiBi (option 1)](https://github.com/ofirpress/attention_with_linear_biases/issues/5) or nonsymmetric-with-offset ALiBi above -3. ~~Whether we use zero vector or a separate learnable embedding for the mask embedding~~The intention was to test using zero vector instead of a separate learnable embedding for the mask embedding, which in combination with ALiBi results in no non-semantic information in the input embeddings. However, a bug prevented this variation from working correctly and the end effect was merely deleting the last two words (madeupword0001 and madeupword0002) from the dictionary instead, which we don't expect to be consequential. -4. Whether we L2-normalize the embeddings for the CLAP head or not -5. Whether we scale the L2-normalized embeddings by `sqrt(embed_dim)` (`no_scale_embedding=False`) or not - -As we can see, the dotted lines are almost on top of the solid lines. Notably, sinusoidal positional encoding underperforms significantly compared to learned positional encoding. - -### The Pile -As the next step, we scaled our experiments to train on the Pile for one epoch. About half of the examples in the Pile has sequence length > 1024, so we set sequence length to 2048. Even so, ~1/7 of the examples have sequence length > 2048 and had to be discarded. In the end, one epoch consists of 133082 updates and [we employ cosine learning rate schedule while "overestimating" the number of training steps by 10%](https://github.com/EIFY/fairseq/blob/33fb2c306851f104cc567b7fe865b1e3fd1e6fe7/examples/roberta/config/pretraining/baseline_pile.yaml#L31-L36), as inspired by the Chinchilla paper . In addition to the validation MLM perplexity, we also fine-tuned the models on the [GLUE](https://gluebenchmark.com/) benchmark . As in the original RoBERTa paper, we tested both the `roberta.base` with 125M parameters and `roberta.large` with 355M parameters. These experiments were performed on 8 x A100 40GB SXM4 GPUs, where the `roberta.base` experiments took ~3 days and `roberta.large` experiments took ~9 days. In the table below, `PPL` is the final validation MLM perplexity, `STS-B` is the best validation loss, and all the others are the best validation accuracies over 10 epochs of finetuning. - -#### `roberta.base` -``` - PPL↓ CoLA MNLI MRPC QNLI QQP RTE SST-2 STS-B↓ -baseline 2.94 83.6 84.2 90 91.6 91.3 73.6 92.1 0.028 -learned-clap 2.86 81.7 84.4 86.3 90.9 91.2 72.6 92.5 0.027 -alibi 2.93 69.2 85.1 80.9 92 91.5 63.9 93.1 0.033 -zero-clap 2.83 70.5 84.9 75.5 90.6 91.1 54.9 89.7 0.041 -``` -\**Baseline but with sinusoidal positional encoding instead of learned positional encoding failed to converge.* - -#### `roberta.large` -``` - PPL↓ CoLA MNLI MRPC QNLI QQP RTE SST-2 STS-B↓ -baseline* 2.55 83.7 86.8 84.3 92.5 91.8 79.8 93.3 0.027 -learned-clap 2.5 84.1 86.3 89.7 92.8 91.7 79.8 93.7 0.023 -alibi 2.65 69.1 86.5 68.4 92.4 91.7 52.7 93.6 0.123 -zero-clap 2.54 69.1 86.7 81.9 92.2 91.6 52.7 93.1 0.031 -``` -\**Loss spiked somewhere between 24000-24500 updates and the model failed to recover. Loosely following the practice of `5.1 Training Instability` in the PaLM paper , we solved the issue by restarting the training from the 20000 updates checkpoint with the PyTorch random seed changed from `1` to `2`.* - -We found that ALiBi no longer helps lowering the validation MLM perplexity. Furthermore, ALiBi turned out to be harmful for several specific GLUE tasks (`CoLA`, `MRPC`, and `RTE`). CLAP head on its own, however, seems to be competitive and in fact outperforms the baseline with `roberta.large`. - -## Conclusions -This seems to be another case where models with lower perplexity do not necessarily yield higher accuracies for downstream tasks and architectural changes beneficial for models at smaller scales do not imply the same for models at larger scales . CLAP head, however, is simpler than the standard prediction head for MLMs, requires minimal changes, and may be worth trying especially at larger scales. - -In the broader context, MosaicBERT and LittleBird are most similar to our experiments. In the MosaicBERT paper, Portes et al. also evaluate BERT-style MLMs with symmetric (option 1) encoder-attention ALiBi on the GLUE benchmark and find performance exceeding the BERT baseline within limited training budget. However, these MosaicBERT models were trained with much shorter (128) sequence length and so may have avoided the sequence length regime in which perplexity and performance of certain downstream tasks start to deteriorate The same can be said about , which also reports in Table 4 the MLM perplexity of RoBERTa large models trained on an excerpt of the Pile with various positional encodings including symmetric (option 1) encoder-attention ALiBi with 128 sequence length.. The LittleBird architecture is designed for question answering and built with BiALiBi (Bidirectional ALiBi), a variation of option 3 (nonsymmetric with different slopes) where the model not only learned the forward and backward slopes $$m_{fwd}$$ and $$m_{bwd}$$, but also a special bias value for the attention weight of the global `[CLS]` token. Lee et al. evaluate LittleBird models on a collection of QA Benchmarks for both English and Korean and report favorable performance, but leave the question open whether they work well for other NLP tasks. Notably, we also found our ALiBi models capable of matching the baseline performance of the question answering task `QNLI`, so the reported performance is compatible with our experiments even without attributing to the other differences in architecture or pretraining task. - -Finally, what can we say about the original decoder-attention ALiBi and positional encodings in general? The original decoder-attention ALiBi has been shown to help not only perplexity, but also performance on evaluation suites consist of a diverse set of tasks like the EleutherAI Language Model Evaluation Harness . This discrepancy may be explained by the causal mask, which has been proven to be sufficient for encoding positional information in theory One caveat is that Proof C.1 of for absolute positional encoding depends on distinguishing values of unit fractions 1/t, which eventually fails due to precision limit. For example, 1/1464 can't be distinguished from 1/1465 in float16, well within the context length of interest., if not quite matching the performance of models with additional positional encodings in practice . Perhaps we can conclude that - -1. Decoder-attention positional encodings really should be considered causal mask + additional encodings and how they complement each other should be taken into account. -2. Longer context length and certain downstream tasks are more challenging for positional encodings. One worthwhile direction may be to rank their difficulties systematically and iterate on the more challenging circumstances first for future positional encoding designs. - -## Model checkpoints -Final checkpoints for models trained on the Pile: - -### `roberta.base` - -[baseline](https://drive.google.com/file/d/1r9VwJCU3AeuivNULRuY3Taq_3AEBg-v5/view?usp=share_link) -[learned-clap](https://drive.google.com/file/d/1KmO3FEaawz0tHW-s581NmrkL-OZklLYk/view?usp=share_link) -[alibi](https://drive.google.com/file/d/1s4Tcjnbawq1W6LBcknysj6NdpMfJdek6/view?usp=share_link) -[zero-clap](https://drive.google.com/file/d/1PwE_MASg4FinuKq6DX29A8c2lPP2B6nb/view?usp=share_link) - -### `roberta.large` - -[baseline](https://drive.google.com/file/d/1XSStju8S9y1BCHpXqZ_fZcueH3A0yW2c/view?usp=share_link) -[learned-clap](https://drive.google.com/file/d/1UyFxC3XoQ5eAhhXaAUQznLbBLa0J_45U/view?usp=share_link) -[alibi](https://drive.google.com/file/d/1D22xJxJTI4gPAD4gHfKaN1ytjQTy2u_y/view?usp=share_link) -[zero-clap](https://drive.google.com/file/d/1ktiRIVqz46DbV261_WxA9RELR971_2iu/view?usp=share_link) - -To load them, install [EIFY/fairseq](https://github.com/EIFY/fairseq) following [the original instructions](https://github.com/facebookresearch/fairseq/blob/b8ac3fa6cc95f9dc97085232d4faf125e5bcd2e7/README.md#requirements-and-installation) and download the GPT-2 fairseq dictionary: -``` -wget -O gpt2_bpe/dict.txt https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt -``` -Then all of the checkpoints above except the `zero-clap` ones can load as follows: -``` -$ python -Python 3.8.10 (default, Jun 22 2022, 20:18:18) -[GCC 9.4.0] on linux -Type "help", "copyright", "credits" or "license" for more information. ->>> from fairseq.models.roberta import RobertaModel ->>> roberta = RobertaModel.from_pretrained('/checkpoint-dir', 'learned-clap-large.pt', '/dict-dir') -(...) ->>> roberta.fill_mask('The capital of China is .', topk=3) -[('The capital of China is Beijing.', 0.7009016871452332, ' Beijing'), ('The capital of China is Shanghai.', 0.23566904664039612, ' Shanghai'), ('The capital of China is Moscow.', 0.010170688852667809, ' Moscow')] ->>> -``` -The `zero-clap` ones were trained without the last two `madeupword`'sThis is due to the same bug that affected the WikiText-103 variation above and its only visible effect., so you need to delete them from `dict.txt` before loading, i.e.: - -
-(...)
-50009 0
-50256 0
-madeupword0000 0
-madeupword0001 0
-madeupword0002 0
-
- -``` -$ python -Python 3.8.10 (default, Jun 22 2022, 20:18:18) -[GCC 9.4.0] on linux -Type "help", "copyright", "credits" or "license" for more information. ->>> from fairseq.models.roberta import RobertaModel ->>> roberta = RobertaModel.from_pretrained('/checkpoint-dir', 'zero-clap-large.pt', '/dict-dir') -(...) ->>> roberta.fill_mask('The capital of China is .', topk=3) -[('The capital of China is Beijing.', 0.7051425576210022, ' Beijing'), ('The capital of China is Shanghai.', 0.21408841013908386, ' Shanghai'), ('The capital of China is Taiwan.', 0.007823833264410496, ' Taiwan')] ->>> -``` - -The rest of the original [example usage](https://github.com/facebookresearch/fairseq/blob/b8ac3fa6cc95f9dc97085232d4faf125e5bcd2e7/examples/roberta/README.md#example-usage) should also just work. While these checkpoints have only been tested with this fork, the `baseline` ones should also work with the [original fairseq repo](https://github.com/facebookresearch/fairseq) with minimum changes to the state dict: - -``` ->>> path = '/checkpoint-dir/baseline-large.pt' ->>> with open(path, 'rb') as f: -... state = torch.load(f, map_location=torch.device("cpu")) -... ->>> ->>> del state['cfg']['task']['omit_mask'] -(...) ->>> torch.save(state, '/checkpoint-dir/compatible.pt') -``` diff --git a/_posts/2024-05-07-bench-hvp.md b/_posts/2024-05-07-bench-hvp.md deleted file mode 100644 index 14175807..00000000 --- a/_posts/2024-05-07-bench-hvp.md +++ /dev/null @@ -1,534 +0,0 @@ ---- -layout: distill -title: How to compute Hessian-vector products? -description: The product between the Hessian of a function and a vector, the Hessian-vector product (HVP), is a fundamental quantity to study the variation of a function. It is ubiquitous in traditional optimization and machine learning. However, the computation of HVPs is often considered prohibitive in the context of deep learning, driving practitioners to use proxy quantities to evaluate the loss geometry. Standard automatic differentiation theory predicts that the computational complexity of an HVP is of the same order of magnitude as the complexity of computing a gradient. The goal of this blog post is to provide a practical counterpart to this theoretical result, showing that modern automatic differentiation frameworks, JAX and PyTorch, allow for efficient computation of these HVPs in standard deep learning cost functions. -date: 2024-05-07 -future: true -htmlwidgets: true - -# Anonymize when submitting -authors: - - name: Mathieu Dagréou - url: https://matdag.github.io - affiliations: - name: Inria - - name: Pierre Ablin - url: https://pierreablin.com/ - affiliations: - name: Apple - - name: Samuel Vaiter - url: https://samuelvaiter.com/ - affiliations: - name: CNRS - - name: Thomas Moreau - url: https://tommoral.github.io/ - affiliations: - name: Inria -# must be the exact same name as your blogpost -bibliography: 2024-05-07-bench-hvp.bib - -# Add a table of contents to your post. -# - make sure that TOC names match the actual section names -# for hyperlinks within the post to work correctly. -# - please use this format rather than manually creating a markdown table of contents. -toc: - - name: What are HVPs and where are they useful? - - subsections: - - name: Inverse Hessian-vector products (iHVPs) in optimization - - name: HVPs for the study of the loss landscape - - name: A quick detour by automatic differentiation - - subsections: - - name: Computational graph - - name: Forward mode - - name: Reverse mode - - name: Naive computation of HVPs - - name: HVPs with automatic differentiation - subsections: - - name: Forward-over-reverse - - name: Reverse-over-reverse - - name: Reverse-over-forward - - name: Benchmark with deep learning architectures - subsections: - - name: Time complexity - - name: Memory complexity - - name: Conclusion - -# Below is an example of injecting additional post-specific styles. -# This is used in the 'Layouts' section of this post. -# If you use this post as a template, delete this _styles block. -_styles: > - .framed { - border: 1px var(--global-text-color) dashed !important; - padding: 20px; - } - .marge { - margin-left: 20px; - } ---- - -Hessian-vector products (HVPs) play a central role in the study and the use of the geometric property of the loss function of deep neural networks, as well as in many recent bilevel optimizers. -However, computing such quantity is often considered prohibitive by practitioners, discouraging them from using algorithms that rely on HVPs. - -With this blog post, we aim to convince the practitioners that with modern automatic differentiation (AD) frameworks such as `JAX` or `PyTorch`, HVPs can be efficiently evaluated. Indeed, standard AD theory predicts that the computational cost of an HVP is of the same order as the cost of computing a gradient. After a brief introduction on why HVPs are useful for optimization and ML applications and on the basis of AD, we explain in detail the AD-based methods to compute an HVP and the reason for their efficiency. In particular, we show that one can compute HVPs without explicit Hessian computation. We then compare the different methods to compute HVPs for several deep neural network architectures in terms of time and memory for both `JAX` and `PyTorch`. Our results illustrate the complexity predicted by the theory, showing that computing an HVP is not much more expensive than computing a gradient. This opens an avenue to develop efficient second-order informed methods for neural networks. - -## What are HVPs and where are they useful? - -Let us first introduce the notion of Hessian and HVP. We will consider in this post a twice differentiable function $$f:\mathbb{R}^d\to\mathbb{R}$$ that goes from a vector $$x$$ in space $$\mathbb{R}^d$$ to a real number in $$\mathbb{R}$$. This typically corresponds to a function that maps the value of the parameters $$\theta$$ of a neural network to the loss $$f(\theta)$$. -For such a function, standard AD can be used to efficiently compute the gradient of the loss $$\nabla f(\theta) = \left[ \frac{\partial f}{\partial \theta_i}(\theta)\right]_{1\le i \le d} \in \mathbb{R}^d$$, using the backpropagation. -The Hessian matrix of $$f$$ at $$\theta$$ is the matrix of its second-order partial derivatives - -$$ - \nabla^2 f(\theta) = \left[\frac{\partial^2f}{\partial \theta_i\partial \theta_j}(\theta)\right]_{1\leq i,j\leq d}\in\mathbb{R}^{d\times d}\enspace. -$$ - -This matrix corresponds to the derivative of the gradient and captures how the gradient will change when moving $$x$$. To evaluate the variation of the gradient when moving $$\theta$$ in the direction $$v\in\mathbb{R}^d$$, one can compute the quantity $$\nabla^2 f(\theta) v\in\mathbb{R}^d$$. This is the Hessian-vector product (HVP). - -Let us review some use cases of HVPs in optimization and machine learning. - -### Inverse Hessian-vector products (iHVPs) in optimization -When trying to find the minimum of the function $$f$$, methods that account for the second-order information often rely on the product between the inverse Hessian and a vector to find a good update direction. -For instance, Newton's method relies on update rules of the form - -$$ - \theta_{k+1} = \theta_k - \eta_k[\nabla^2f(\theta_k)]^{-1}\nabla f(\theta_k) -$$ - -for some step-size $$\eta_k>0$$. - -When evaluating the term $$[\nabla^2f(\theta_k)]^{-1}\nabla f(\theta_k)$$, it would be very inefficient to first compute the full Hessian matrix $$\nabla^2f(\theta_k)$$, then invert it and finally multiply this with the gradient $$\nabla f(\theta_k)$$. -Instead, one computes the inverse Hessian-Vector Product (iHPV) by solving the following linear system - -\begin{equation}\label{eq:linear_system} - \nabla^2f(\theta)v = b\enspace. -\end{equation} - -with $$b = \nabla f(\theta_k)$$. -This approach is much more efficient as it avoids computing and storing the full Hessian matrix, and only computes the inverse of the matrix in the direction $$v$$. - -A second use case for the iHVP in optimization is with bilevel optimization. In bilevel optimization, one wants to solve the following problem - -\begin{equation}\label{eq:bilevel_pb} - \min_{x\in\mathbb{R}^d} h(x) = F(x, y^* (x))\quad\text{with}\quad y^*(x) = \arg\min_{y\in\mathbb{R}^p} G(x, y)\enspace. -\end{equation} - -The gradient of the function $$h$$ can be computed using the implicit function theorem, giving the following expression - -$$ - \nabla h(x) = \nabla_x F(x, y^* (x)) - \nabla_{xy}G(x, y^*(x))[\nabla_{yy}G(x, y^*(x))]^{-1}\nabla_y G(x, y^*(x))\enspace. -$$ - -Here, the term $$\nabla^2_{yy} G(x, y)$$ is the Hessian of the function $$G$$ relatively to $$y$$. Thus, this quantity also requires computing an iHVP. - -To compute the iHVP, there are many methods in the literature to solve \eqref{eq:linear_system}, like Neumann iterates, the Conjugate Gradient method or gradient descent steps in the quadratic form $$v\mapsto \frac12\langle\nabla^2f(\theta)v, v\rangle - \langle b, v\rangle$$. These methods rely on HVPs, as illustrated by the highlighted terms in the Conjugate Gradient method. Thus, an efficient implementation of HVPs is crucial for the overall algorithm performance. - -

- Conjugate gradient to solve \eqref{eq:linear_system}
- Input Initialization \(v_0\)
- - Initialization - $$ - r_0 = \textcolor{orange}{\nabla^2f(\theta) v_0} - b,\quad p_0 = -r_0,\quad t = 0 - $$ - - While \(r_t \neq 0\) - \begin{align*} - \alpha_t &=\frac{r_t^\top r_t}{p_t^\top \textcolor{orange}{\nabla^2f(\theta) p_t}} \\ - v_{t+1} &=v_t + \alpha_t p_t \\ - r_{t+1} &=r_t + \alpha_t\textcolor{orange}{\nabla^2f(\theta) p_t} \\ - \beta_{t+1} &=\frac{r_{t+1}^\top r_{t+1}}{r_t^\top r_t} \\ - p_{t+1} &=-r_{t+1} + \beta_{t+1} p_t\\ - t &=t + 1 - \end{align*} -

- -### HVPs for the study of the loss landscape - -The study of the geometry of neural networks is an active field that aims at understanding the links between training dynamics, local geometry of the training loss and generalization. One way to study the local geometry of a neural network is to find the distribution of the eigenvalues of its Hessian matrix. Indeed, depending on the sign of the eigenvalues of the Hessian, one can for instance distinguish local minima, local maxima and saddle points. As an illustration, the following figure shows how the sign of the eigenvalues of the Hessian matrix of a function affects the shape of the function's landscape around a stationary point. - -{% include figure.html path="assets/img/2024-05-07-bench-hvp/hess_eig.png" class="img-fluid" %} - - -In several papers, an approximation of the Hessian spectrum is computed thanks to the Lanczos algorithm. This algorithm is a modification of the power method where each new iterate is taken in the orthogonal complement of the previous iterates. It outputs a factorization of the Hessian of the form $\nabla^2 f(\theta) = VTV^\top$ where $$V=(v_0,...,v_{k-1})$$ is orthogonal and - -$$ -T = \begin{pmatrix} - \alpha_0& \beta_1 & 0 & \cdots & 0\\ - \beta_1 & \alpha_1 & \beta_2 & \ddots & \vdots\\ - 0 & \beta_2 & \alpha_2 & \ddots & 0\\ - \vdots & \ddots & \ddots & \ddots & \beta_{k-1}\\ - 0 & \cdots & 0 & \beta_{k-1} & \alpha_{k-1} -\end{pmatrix}\enspace. -$$ - - -

- Lanczos' algorithm
- - Input Initial vector \(v_0\).
- Initialization - $$ - w'_0 = \textcolor{orange}{\nabla^2f(\theta)v_0},\quad \alpha_0 = w_0'^\top v_0,\quad w_0 = w_0' - \alpha_0 v_0 - $$ - - For \(i = 1,\dots, k-1\):
- - \begin{align*} - \beta_i &= \|w_{i-1}\|\\ - v_{i} &= \frac{w_{i-1}}{\beta_{i}}\\ - w_i' &= \textcolor{orange}{\nabla^2f(\theta)v_i}\\ - \alpha_i &= w_i'^\top v_i\\ - w_i &= w_i' - \alpha_i v_i - \beta_iv_{i-1} - \end{align*} -

- -We observe once again that the Hessian information is accessed through HVPs rather than the full Hessian matrix itself. - - -## A quick detour by automatic differentiation - -Automatic differentiation (AD) is an important tool to compute exactly the derivatives of differentiable functions obtained as the composition of simple operations. -There are two modes in AD; the forward mode that computes Jacobian-vector products (JVPs) and the reverse mode that computes vector-Jacobian products (VJPs). -Since the gradient of a scalar function is a special case of the VJP, the reverse mode is the most frequently used in machine learning. -It is typically used to compute the gradients of deep learning cost functions, where it is called *backpropagation*. - -In what follows, we briefly present the notion of computational graph and the two AD modes. For a more detailed explanation, we refer the reader to the excellent survey by Baydin et al.. - -### Computational graph - -A key ingredient of AD is a computational graph associated with the code that evaluates a function. -It is a directed acyclic graph that represents the succession of elementary operations required the evaluate a function. -Simple computational graph of a function $$f:\mathbb{R}^d\to\mathbb{R}^p$$ are typically - -{% include figure.html path="assets/img/2024-05-07-bench-hvp/direct_graph.png" class="img-fluid"%} - -In this graph, the vertices $$z_i\in\mathbb{R}^{m_i}$$ represent the intermediate states of the evaluation of $$f$$. -To get the vertex $$z_i$$, we use the values of its parents in the graph $$z_{i-1}$$, with simple transfer functions $$z_i(z_{i-1})$$. -The computational complexity of the function evaluation depends on the complexity of the considered graph, as one node might have more than one parent. -The memory footprint of the evaluation of the function is also linked to the maximum number of parents that can have a vertex in the computational graph, as their value needs to be stored until all children nodes have been computed. - -Let us take an example with a multilayer linear perceptron (MLP) with 2 layers. -The function $$f_x:\mathbb{R}^h\times \mathbb{R}^{h\times p}\to \mathbb{R}$$ is defined for an input $$x\in\mathbb{R}^p$$ by - -\begin{equation}\label{eq:mlp} - f_x(U, W) = \frac12(UWx)^2\enspace. -\end{equation} - -Here, the input $$\theta$$ corresponds to the parameters of the network $$(U, V)$$ and the intermediate steps are $$z_1 = Wx$$, $$z_2 = Uz_1$$ and $$z_3 = \frac12 z_2^2$$. -A possible computational graph to get $$f_x(U, W)$$ is the following - -{% include figure.html path="assets/img/2024-05-07-bench-hvp/computational_graph.png" class="img-fluid"%} - -and the associated Python code to compute $$f_x$$ is -```python -def f(U, W): - z1 = W @ x - z2 = U @ z1 - z3 = 0.5 * z2**2 - return z3 -``` - -Here, the feed-forward structure of the function makes the computational graph very simple, as each node has a single intermediate result parent. - -AD uses this computational graph to compute the function's derivatives. -Using the chain rule, the Jacobian $$\frac{\partial f}{\partial \theta}(\theta)$$ of $$f$$ is obtained as a product of the Jacobian of the intermediate states $$z_1, \dots, z_n$$. -\begin{equation}\label{eq:chain_rule} - \underbrace{\frac{\partial f}{\partial \theta}(\theta)}\_{p\times d} = \frac{\partial z_n}{\partial \theta} - =\frac{\partial z_n}{\partial z_1}\frac{\partial z_1}{\partial \theta}=\cdots = \underbrace{\frac{\partial z_n}{\partial z_{n-1}}}\_{p\times m_{n-1}}\underbrace{\frac{\partial z_{n-1}}{\partial z_{n-2}}}\_{m_{n-1}\times m_{n-2}}\cdots\underbrace{\frac{\partial z_1}{\partial \theta}}\_{m_1\times d}\enspace. -\end{equation} -Depending on the order of the multiplication, one can compute the derivative of $$f$$ with respect to $$\theta$$ in two ways: the forward mode and the reverse mode. - -### Forward mode - -For a vector $v\in\mathbb{R}^d$, the Jacobian-vector product (JVP) corresponds to the directional derative of $f$ in the direction $v$. It can be computed by the forward mode AD - -\begin{equation}\label{eq:chain_rule_jvp} - \frac{\partial f}{\partial \theta}(\theta)\times v = \frac{\partial z_n}{\partial z_{n-1}}\frac{\partial z_{n-1}}{\partial z_{n-2}}\cdots\frac{\partial z_1}{\partial \theta}v\enspace. -\end{equation} - -It consists in doing the multiplications in \eqref{eq:chain_rule_jvp} from the right to the left. It is a forward pass in the computational graph where we propagate at the same time the states $$z_i$$ and the partial derivatives $$\frac{\partial z_{i+1}}{\partial z_i}$$. If $$f$$ is real-valued, the $$i$$th coordinate of its gradient is exactly given by product of the Jacobian of $$f$$ and the $$i$$th canonical basis vector $$e_i$$ since -\begin{equation} -\frac{\partial f}{\partial \theta_i}(\theta) = \lim_{t\to 0}\frac{f(\theta+te_i)-f(\theta)}{t}\enspace. -\end{equation} - Thus, we can get its gradient by computing each of the $$d$$ JVPs $$\left(\frac{\partial f}{\partial \theta_i}(\theta)\times e_i\right)_{1\leq i \leq d}$$ with forward AD. - -To understand properly what is happening when using forward differentiation, let us go back to the linear MLP defined in \eqref{eq:mlp}. -If we implement ourselves the forward differentiation to get the JVP, we obtain the following code - -``` python -def jvp(U, W, v_u, v_w): - # Forward diff of f - z1 = W @ x - v_z1 = v_w @ x # Directional derivative of W -> W @ x in the direction v_w - - z2 = U @ z1 - v_z2 = U @ v_z1 + v_u @ z1 # Directional derivative of (U, z_1) -> z2 in the direction (v_u, v_z1) - - v_z3 = v_z2 @ z2 # Directional derivative of z2 -> .5*z2**2 in the direction v_z2 - return v_z3 -``` - -In comparison with the code of the evaluation of $$f_x$$, there are two more operations corresponding to the computation of the dual variables `v_z1` and `v_z2`. In terms of memory, if we consider the computation of the JVP as coded in the previous snippet, the maximum number of parents of a vertex is four. This maximum is achieved by the vertex `v_z2` which has the vertices `U`, `v_z1`, `v_u` and `z1` as parents. - -In `JAX`, we get the JVP of a function $$f$$ in the direction $$v$$ with `jax.jvp(f, (params, ), (v, ))[1]`. - -### Reverse mode -The reverse mode is also known as backpropagation in the context of deep learing. For $u\in\mathbb{R}^p$, it aims at computing VJPs - -\begin{equation}\label{eq:chain_rule_vjp} - u^\top\frac{\partial f}{\partial \theta}(\theta) = u^\top\frac{\partial z_n}{\partial z_{n-1}}\frac{\partial z_{n-1}}{\partial z_{n-2}}\cdots\frac{\partial z_1}{\partial \theta}\enspace. -\end{equation} - -In the reverse AD, the multiplications of \eqref{eq:chain_rule_jvp} are done from the left to the right. It requires doing one forward pass in the computational graph to compute the intermediate states $$z_i$$ and then a backward pass to propagate the successive partial derivatives from the left to the right. Contrary to the forward mode, it has a more important memory footprint. Indeed, it requires storing the values of all the states. For instance, to compute the last term $$\frac{\partial z_3}{\partial z_2}$$, one needs the value of $$z_2$$ which was the first computed during the forward pass. If $$f$$ is real-valued, $$u$$ is a scalar and the VJP is the multiplication of the gradient of $$f$$ by $$u$$. Thus, one can get the gradient on $$f$$ by using $$u=1$$ and performing only one reverse differentiation. This makes this mode more efficient in computing gradients. - -Let us observe what happens if we code manually the backpropagation to get the gradient of the previous function $$f_x$$ defined by $$f_x(U, W) = \frac12(UW x)^2$$. - -``` python -def gradient(U, W): - # Forward pass - z1 = W @ x - z2 = U @ z1 - z3 = 0.5 * z2**2 - - # Reverse pass - ## Transfer function: z3 = 0.5 * z2**2 - dz2 = z2 # derivative of z3 wrt z2 - - ## Transfer function: z2 = U @ z1 - dU = jnp.outer(dz2, z1) # derivative of z3 wrt U - dz1 = U.T @ dz2 # derivative of z3 wrt z1 - - ## Transfer function: z1 = W @ x - dW = jnp.outer(dz1, x) # derivative of z3 wrt W - - return dU, dW -``` - -This function returns the gradient of $$f_x$$. At reading this code, we understand one needs to store all the intermediate values of the forward pass in the graph. Indeed, if we look at the case of `z1` which is the first node computed, it is used four steps later for the computation of `dU`. - -To get the gradient in JAX, one can use `jax.grad(f)(params)`. - - -## Naive computation of HVPs -Since we are interested in computing $$\nabla^2 f(\theta)v$$, the simplest way to do it is to compute the Hessian matrix and then multiply it by the vector $$v$$. This can be achieved in `JAX` by calling `jax.hessian(f)(params) @ v`. - -This method is quite cumbersome making it impossible to use for deep neural networks. Indeed, the storage of the full Hessian matrix has $$\mathcal{O}(d^2)$$ complexity where $$d$$ is the dimension of the model's parameters set. - -The good news is that we can compute HVP without computing the Hessian thanks to clever use of AD. - - -## HVPs without explicit Hessian computation -In 1994, Pearlmutter proposed to leverage the following observation to compute HVP efficiently: the HVP is also the directional derivative of the gradient in the direction $$v$$: - -$$ -\nabla^2f(\theta) v = \lim_{\epsilon\to 0} \frac1\epsilon[\nabla f(\theta+\epsilon v)-\nabla f(\theta)] = \nabla [\langle \nabla f(.), v\rangle](\theta)\enspace. -$$ - -Based on this identity, AD enables to compute HVPs in three ways, as described in the [JAX documentation](https://jax.readthedocs.io/en/latest/notebooks/autodiff_cookbook.html). - - -### Forward-over-reverse -The forward-over-reverse mode consists in doing forward differentiation in a computational graph of the gradient of $$f$$. - -Its implementation in `JAX` is only two lines of code. - -```python -def hvp_forward_over_reverse(f, params, v): - return jax.jvp(jax.grad(f), (params, ), (v, ))[1] -``` -In this case, `jax.grad(f)(params)` is computed by backward AD, whose complexity is two times the complexity of evaluating $$f$$. -Thus, the temporal complexity of `hvp_forward_over_reverse` is roughly four times the complexity of the evaluation of $$f$$. - -To better see what happens, let us consider again our function $$f_x$$ defined by \eqref{eq:mlp}. The Python code of the `forward-over-reverse` HVP is the following. - -```python -def forward_over_reverse(U, W, v_U, v_W): - # Forward through the forward pass through f - z1 = W @ x - v_z1 = v_W @ x - - z2 = U @ z1 - v_z2 = U @ v_z1 + v_U @ z1 - - # z3 = 0.5 * z2**2 - # Forward through the backward pass through f - z4 = z2 # dz2 - v_z4 = v_z2 # v_dz2 - - z5 = jnp.outer(z4, z1) # dU - v_z5 = jnp.outer(v_z4, z1) + jnp.outer(z4, v_z1) # v_dU - - z6 = U.T @ z4 # dz1 - v_z6 = U.T @ v_z4 + v_U.T @ z4 # v_dz1 - - z7 = jnp.outer(z6, x) # dW - v_z7 = jnp.outer(v_z6, x) # v_dW - - return v_z5, v_z7 # v_dU, v_dW - ``` - -The take-home message of this part is that, after computing the gradient of $$f_x$$, one can consider a computational graph of this gradient and perform forward differentiation through this new computational graph. -Here, the variables `z1`,..., `z7` are the vertices of a computational graph of the gradient of $$f_x$$. -The nice thing is that this mode enables getting at the same time the gradient and the HVP. -Indeed, in the previous snippet, `z5` and `z7` are the components of the gradient of $$f_x$$ which could be also returned if needed. -This feature can be useful in bilevel optimization for instance. - -### Reverse-over-reverse -Instead of doing forward differentiation of the gradient, one can multiply the gradient by $$v$$ and thus get a scalar. We can then backpropagate into this scalar product. This is the reverse-over-reverse mode. - -It can be implemented by these lines of code. -```python -def hvp_reverse_over_reverse(f, params, v): - return jax.grad(lambda y: jnp.vdot(jax.grad(f)(y), v))(params) -``` -Since the gradients are computed by backpropagation, the complexity of `hvp_reverse_over_reverse` is twice the complexity of `jax.grad(f)`, which is roughly four times the complexity of the evaluation of $$f$$. - -Writting down the code of the reverse-over-reverse HVP for our function $$f_x$$ defined by \eqref{eq:mlp} makes us understand the differences between this mode and the `forward-over-reverse` mode. Particularly, one can notice that there are more elementary operations in the `reverse-over-reverse` mode than in the `forward-over-reverse` mode. Moreover, in terms of memory footprint, the `reverse-over-reverse` requires storing the values of the vertices of the computational graph of the gradient of $$f_x$$, while the `forward-over-reverse` only needs to store the values of the vertices of the computational graph of $$f_x$$. Thus, the former is less efficient than the latter. - -```python -def reverse_over_reverse(U, W, v_u, v_w): - # Forward through - ## Forward through f - z1 = W @ x - z2 = U @ z1 - z3 = 0.5 * jnp.linalg.norm(z2)**2 - - ## Reverse through f - z4 = z2 # dz2 - z4 = jnp.outer(z3, z1) # dU - z5 = U.T @ z3 # dz1 - z6 = jnp.outer(z5, x) # dW - - # Output: dot product - z7 = jnp.sum(z4 * v_u) + jnp.sum(z6 * v_w) - - # Backward through z7 = - ## z7 = jnp.sum(z4 * v_u) + jnp.sum(z6 * v_w) - dz6 = v_w - dz4 = v_u - - ## z6 = jnp.outer(z5, x) - dz5 = dz6 @ x - - ## z5 = U.T @ z3 - dz3 = U @ dz5 - ddU = jnp.outer(z3, dz5) # Derivative of z7 wrt U - - ## z4 = jnp.outer(z3, z1) - dz3 += dz4 @ z1 - dz1 = dz4.T @ z3 - - ## z3 = z2 - dz2 = dz3 - - ## z2 = U @ z1 - dz1 += dz2 * U - # As U appears multiple times in the graph, we sum its contributions - ddU += jnp.outer(dz2, z1) - - ## z1 = W @ x - ddW = jnp.outer(dz1, x) # Derivative of z7 wrt W - - return ddU, ddW - ``` - -### Reverse-over-forward -What about doing forward differentiation of $$f$$ rather than reverse propagation? This is what is done in the reverse-over-forward mode. It consists in backpropagating in the computational graph of the JVP of $$f$$ and $$v$$. - -```python -def hvp_reverse_over_forward(f, params, v): - jvp_fun = lambda params: jax.jvp(f, (params, ), (v, ))[1] - return jax.grad(jvp_fun)(params) -``` - -This method is more efficient than the previous one. Indeed, since we backpropagate only once, the memory burden is lower than for the `reverse_over_reverse` fashion. In comparison with `forward-over-reverse`, the complexity is the same. However, one can notice that the `forward-over-reverse` enables computing at the same time the gradient of $$f$$ and the HVP, which is not the case for the `reverse-over-forward` mode. - -The code of the `reverse-over-forward` HVP for the MLP $$f_x$$ defined by \eqref{eq:mlp} is the following. - -```python -def reverse_over_forward(U, W, v_U, v_W): - # Forward diff of f to - z1 = W @ x - z6 = v_W @ x # v_z1 - - z2 = U @ z1 - z5 = U @ z6 + v_U @ z1 # v_z2 - - # output - z4 = z5 @ z2 # v_z3 - - # Backward pass through - ## z4 = z5 @ z2 - dz2 = z5 - dz5 = z2 # dv_z2 - - ## z5 = U @ z6 + v_U @ z1 - dz1 = v_U.T @ dz5 - dz6 = U.T @ dz5 # dv_z1 - ddU = jnp.outer(dz5, z6) # derivative of z4 wrt U - - ## z2 = U @ z1 - # As U and dz1 appear multiple times, we sum their contributions - dz1 += U.T @ dz2 - ddU += jnp.outer(dz2, z1) - - ## z1 = W @ x - ddW = jnp.outer(dz1, x) - return ddU, ddW -``` - -## Benchmark with deep learning architectures - -While these three methods compute the same outputs, the different ways of traversing the computational graph change their overall time and memory complexities. We now compare the computation of HVPs with these three methods for various deep-learning architectures. To cover a broad range of use cases, we consider a residual network ([ResNet34](https://huggingface.co/docs/transformers/model_doc/resnet)) and a transformer-based architecture ([ViT-base](https://huggingface.co/docs/transformers/model_doc/vit)) for image classification as well as a transformer for natural language processing ([Bert-base](https://huggingface.co/docs/transformers/model_doc/bert#transformers.FlaxBertForTokenClassification).). -We use the `Flax` and `PyTorch` implementations of these architectures available in the [transformers package](https://huggingface.co/docs/transformers/) provided by [Hugging Face 🤗](https://huggingface.co). - -All computations were run on an Nvidia A100 GPU with 40 GB of memory. We used the version 0.4.21. of `Jax` and the version 2.1.1. of `torch`. - -The code of the benchmark is available on [this repo](https://github.com/MatDag/bench_hvp/). - -### Time complexity - -The first comparison we make is a comparison in terms of wall-clock time between the different ways to compute HVPs and also the computation of a gradient by backpropagation. For each architecture, we compute the gradient of the model with respect to the parameters by backpropagation. We also compute the HVPs in `forward-over-reverse`, `reverse-over-forward` and `reverse-over-reverse` modes. For each computation, we measure the time taken. Specifically for the HVPs, we subtract the time taken by a gradient computation, to get only the time of the overhead required by the HVP computation. -The inputs for each architecture are generated randomly. For the ResNet34 architecture, we generated a batch of images of size 224x224x3. To limit out-of-memory issues in the experiments, we generated for the ViT architecture images of size 96x96x3. For the BERT architecture, we generated a batch of sequences of length 32. - -We first use `JAX` with just-in-time compilation. Each computation is run 90 times. We plot on the left of the figure, the median computation time and also the 20% and 80% percentile in black. The computations are done with a batch size of 128. We observe that, in practice, the overhead over the gradient computation for the HVP computation is between one and twice the time of a gradient computation for the three architectures. Consequently, a whole HVP computation takes between twice and three times the time of a gradient calculation. This is consistent with the theory. One can notice that the `reverse-over-reverse` is slightly slower than the others in all the cases. The `forward-over-reverse` and `reverse-over-forward` are, as for them, very close in terms of time. - -We also report on the right figure the computational time of each method with respect to the batch size for the ResNet34 architecture. We observe, as expected, that the computational time scales linearly with the batch size. - -{% include figure.html path="assets/img/2024-05-07-bench-hvp/bench_hvp_time_jax.png" class="img-fluid" %} - -We run a similar experiment with the functional API available in `PyTorch` [`torch.func`](https://pytorch.org/docs/stable/func.html) similar to the one `JAX` has. The results we get are more contrasted. - -In the case of ResNet34, the scaling between the different methods is similar to the one we get with `JAX`. Also, during our experiments, we figured out that batch normalization made the forward computation slow and induced out-of-memory issues. Thus, we removed the batch normalization layers from the ResNet34 architecture. - -For ViT and BERT, the `forward-over-reverse` is surprisingly longer than the `reverse-over-reverse` method. Moreover, the scaling between the gradient and HVP computational time differs from the one we get with `JAX`. Indeed, for these architectures, the HVP computations take between four and five more time than the gradient computations. This is a discrepancy with what we would expect in theory. This might be because, at the time we are writing this blog post, the functional API of `PyTorch` is still in its early stages. Particularly, we could not use the compilation with `torch.compile` because it does not work with some operators of `torch.func` such as `torch.func.jvp`. - -{% include figure.html path="assets/img/2024-05-07-bench-hvp/bench_hvp_time_torch.png" class="img-fluid" %} - -### Memory complexity - -We also compare the memory footprint of each approach. The following figure provides the results we get with jax jitted code. On the left, we represent the result for each method and model with a batch size of 64. On the right, we show the evolution of the memory footprint of each method for the ResNet34 with the batch size. Surprisingly, we could observe that the memory footprint of the different methods to compute HVPs does not vary for a given model. This is counterintuitive since we expect that the `reverse-over-reverse` method have a larger memory footprint due to the double backpropagation. - -{% include figure.html path="assets/img/2024-05-07-bench-hvp/bench_hvp_memory_jax.png" class="img-fluid" %} - -However, we do the same experiment by *disabling the JIT compilation*. The result we get corroborates the theory. Indeed, one can observe in the following figure that the memory footprint of the `reverse-over-reverse` method is larger than the one of the `forward-over-reverse` and `reverse-over-forward` methods. This is because the `reverse-over-reverse` involves two successive backward differentiations while the other two involve only one reverse differentiation. Moreover, it scales linearly with the batch size, which was not the case in the previous figure in the small batch size regime. - -In light of these two results, the clever memory allocation performed during just-in-time compilation reduces significantly the memory footprint of the HVP computations. - -{% include figure.html path="assets/img/2024-05-07-bench-hvp/bench_hvp_memory_jax_without_jit.png" class="img-fluid" %} - -In the following figure, we plot the results we get with the `PyTorch` implementation. One can observe that in all the cases the `forward-over-reverse` consumes more memory in comparison with the `reverse-over-forward` mode. It is almost at the same level as `reverse-over-reverse` mode, which is quite unexpected. - -The right plot of the evolution of the memory footprint with the batch size for the ResNet34 architecture evolves linearly as expected. - -{% include figure.html path="assets/img/2024-05-07-bench-hvp/bench_hvp_memory_torch.png" class="img-fluid" %} - -## Conclusion - -In this blog post, we have explored the different ways to compute HVP from theoretical and practical perspectives. The three take-home messages to keep in mind are the following: - -* We can compute HVPs without computing Hessian matrices. - -* In practice, computing an HVP takes between twice and four times the time taken by a gradient computation and requires two to three times more memory than computing a gradient. - -* The AD framework and the use or not of the just-in-time compilation affects the practical performances of HVPs computations in time and memory. - diff --git a/_posts/2024-05-07-clml.md b/_posts/2024-05-07-clml.md deleted file mode 100644 index 97729d1c..00000000 --- a/_posts/2024-05-07-clml.md +++ /dev/null @@ -1,1278 +0,0 @@ ---- -layout: distill -title: >- - On Bayesian Model Selection: The Marginal Likelihood, Cross-Validation, and Conditional Log Marginal Likelihood -description: >- - Bayesian model selection has long relied on the marginal likelihood and related quantities, often motivated by the principle of Occam's razor. Following the paper 'Bayesian Model Selection, the Marginal Likelihood, and Generalization' by Lotfi et al. (2022/2023), this blog post critically examines the conventional focus on the marginal likelihood and related quantities for Bayesian model selection as a direct consequence of Occam's razor. We find that the suitability of these criteria depends on the specific context and goals of the modeling task. - We revisit the concepts of log marginal likelihood (LML), cross-validation, and the recently introduced conditional log marginal likelihood (CLML), highlighting their connections and differences through an information-theoretic lens. - Through thought experiments and empirical observations, we explore the behavior of these model selection criteria in different data regimes under model misspecification and prior-data conflict, finding that the conditional marginal cross-entropy, closely related to cross-validation, is often more reliable for optimizing generalization performance. We review relevant literature, compare the CLML and validation loss for deep neural networks, and using a toy Bayesian linear regression, we demonstrate that all the discussed quantities can fail to reliably predict generalization. - Our takeaways are that: there is no one-size-fits-all solution; the choice of model selection quantity depends on the specific context and goals; and in the future, we should take into account model complexity as well and not assume a uniform model prior. - While the post is limited by the need for more rigorous theoretical justification, a broader range of models and datasets (and deeper engagement with philosophical implications), it rightly questions the primacy of the (conditional) log marginal likelihood and encourages critical thinking about its foundations, aiming for a more nuanced understanding of Bayesian model selection. -date: 2024-05-07 -future: true -htmlwidgets: true -tags: -- Bayesian Neural Network -- Generalization -- Log Marginal Likelihood -- Conditional Log Marginal Likelihood -- Information Theory -- Model Evaluation -- Model Selection - -authors: - - name: Andreas Kirsch - url: "https://www.blackhc.net" - affiliations: - name: University of Oxford–2023 - -# must be the exact same name as your blogpost -bibliography: 2024-05-07-clml.bib - -# Add a table of contents to your post. -# - make sure that TOC names match the actual section names -# for hyperlinks within the post to work correctly. -# - please use this format rather than manually creating a markdown table of contents. -toc: - - name: "Introduction" - subsections: - - name: "(Bayesian) Model Selection" - - name: "Background: Information-Theoretic Notation and Concepts" - subsections: - - name: "Expressing Occam's Razor in Information-Theoretic Terms" - - name: "Hyperparameter Learning and Model Selection" - subsections: - - name: "Model Parameters" - - name: "Bayesian Model Averaging" - - name: "Marginal Likelihood and Estimation" - - name: "Datasets instead of Individual Data Points" - subsections: - - name: "Joint Marginal Information and Cross-Entropy" - - name: "Marginal Information and Cross-Entropy" - - name: "Marginal Cross-Entropy vs Joint Cross-Entropy" - - name: "Intermediate Comparison" - - name: "Different Data Regimes" - subsections: - - name: "Model Misspecification" - - name: "Infinite Data Limit" - - name: "Prior-Data Conflict" - - name: "Anti-Correlated Model Misspecification and Prior-Data Conflict" - - name: "Approximating the (Cross-)Validation Loss" - - name: "The Big Comparison" - - name: "Literature Review" - subsections: - - name: "Fong and Holmes (2020): \"On the marginal likelihood and cross-validation\"" - - name: "Lyle et al. (2020) and Ru et al. (2021): Training speed and model selection" - - name: "Lotfi et al. (2022/2023): \"Bayesian Model Selection, the Marginal Likelihood, and Generalization\"" - - name: "A Simple Toy Experiment" - subsections: - - name: "Experimental Setup" - - name: "Results" - - name: "A Narrow but Deep Dive into \"Bayesian Model Selection, the Marginal Likelihood, and Generalization\"" - subsections: - - name: "Use Cases and Pitfalls of the LML" - - name: "The \"Conditional Marginal Likelihood\" in Lotfi et al. (2022/2023)" - - name: "Estimating the CLML and LML via the Laplace Approximation" - - name: "DNN Experiments: Validation Loss vs. CLML" - - name: "Conclusion" - - name: "Appendix" - subsections: - - name: "Detailed Code Review of the DNN Experiments in Lotfi et al. (2022/2023)" - - name: "Author Response from 2022" - - name: "Ablation: CLML vs. BMA Validation Loss vs. (non-BMA) Validation Loss" - - name: "Ablation: LA Sample Size" -# Below is an example of injecting additional post-specific styles. -# This is used in the 'Layouts' section of this post. -# If you use this post as a template, delete this _styles block. -_styles: > - .fake-img { - background: #bbb; - border: 1px solid rgba(0, 0, 0, 0.1); - box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1); - margin-bottom: 12px; - } - .fake-img p { - font-family: monospace; - color: white; - text-align: left; - margin: 12px 0; - text-align: center; - font-size: 16px; - } - .box-note, .box-warning, .box-error, .box-important { - padding: 15px 15px 15px 10px; - margin: 20px 20px 20px 5px; - border: 1px solid #eee; - border-left-width: 5px; - border-radius: 5px 3px 3px 5px; - } - d-article .box-note { - background-color: #eee; - border-left-color: #2980b9; - } - d-article .box-warning { - background-color: #fdf5d4; - border-left-color: #f1c40f; - } - d-article .box-error { - background-color: #f4dddb; - border-left-color: #c0392b; - } - d-article .box-important { - background-color: #d4f4dd; - border-left-color: #2bc039; - } - html[data-theme='dark'] d-article .box-note { - background-color: #333333; - border-left-color: #2980b9; - } - html[data-theme='dark'] d-article .box-warning { - background-color: #3f3f00; - border-left-color: #f1c40f; - } - html[data-theme='dark'] d-article .box-error { - background-color: #300000; - border-left-color: #c0392b; - } - html[data-theme='dark'] d-article .box-important { - background-color: #003300; - border-left-color: #2bc039; - } - html[data-theme='dark'] d-article aside { - color: var(--global-text-color) !important; - } - html[data-theme='dark'] d-article blockquote { - color: var(--global-text-color) !important; - } - html[data-theme='dark'] d-article summary { - color: var(--global-text-color) !important; - } - d-article aside * { - color: var(--global-text-color) !important; - } - d-article p { - text-align: justify; - text-justify: inter-word; - -ms-hyphens: auto; - -moz-hyphens: auto; - -webkit-hyphens: auto; - hyphens: auto; - } - d-article aside { - border: 1px solid #aaa; - border-radius: 4px; - padding: .5em .5em 0; - font-size: 90%; - } - d-article aside p:first-child { - margin-top: 0; - } - d-article details { - border: 1px solid #aaa; - border-radius: 4px; - padding: .5em .5em 0; - } - d-article summary { - font-weight: bold; - margin: -.5em -.5em 0; - padding: .5em; - display: list-item; - } - d-article details[open] { - padding: .5em; - } - d-article figure { - padding: 1em 1em 0; - } - d-article details[open] summary { - border-bottom: 1px solid #aaa; - margin-bottom: .5em; - } - html[data-theme='dark'] d-article blockquote { - border-left-color: #f1c40f; - } - @media (min-width: 768px) { - .l-gutter:has(> aside) { - height: 0px; - } - } - @media (min-width: 1025px) { - d-article d-contents { - height: 0px; - } - } ---- - - -{% raw %} -
-$$\require{mathtools} -\DeclareMathOperator{\opExpectation}{\mathbb{E}} -\newcommand{\E}[2]{\opExpectation_{#1} \left [ #2 \right ]} -\newcommand{\simpleE}[1]{\opExpectation_{#1}} -\newcommand{\MidSymbol}[1][]{\:#1\:} -\newcommand{\given}{\MidSymbol[\vert]} -\DeclareMathOperator{\opmus}{\mu^*} -\newcommand{\IMof}[1]{\opmus[#1]} -\DeclareMathOperator{\opInformationContent}{H} -\newcommand{\ICof}[1]{\opInformationContent[#1]} -\newcommand{\xICof}[1]{\opInformationContent(#1)} -\DeclareMathOperator{\opEntropy}{H} -\newcommand{\Hof}[1]{\opEntropy[#1]} -\newcommand{\xHof}[1]{\opEntropy(#1)} -\DeclareMathOperator{\opMI}{I} -\newcommand{\MIof}[1]{\opMI[#1]} -\DeclareMathOperator{\opTC}{TC} -\newcommand{\TCof}[1]{\opTC[#1]} -\newcommand{\CrossEntropy}[2]{\opEntropy(#1 \MidSymbol[\Vert] #2)} -\newcommand{\iCrossEntropy}[3]{\opEntropy_{#1 \Vert #2}[#3]} -\DeclareMathOperator{\opKale}{D_\mathrm{KL}} -\newcommand{\Kale}[2]{\opKale(#1 \MidSymbol[\Vert] #2)} -\newcommand{\iKale}[3]{\opKale_{,\, #1 \Vert #2}[#3]} -\DeclareMathOperator{\opJSD}{D_\mathrm{JSD}} -\newcommand{\JSD}[2]{\opJSD(#1 \MidSymbol[\Vert] #2)} -\DeclareMathOperator{\opp}{p} -\newcommand{\pof}[1]{\opp(#1)} -\newcommand{\hpof}[1]{\hat{\opp}(#1)} -\newcommand{\pcof}[2]{\opp_{#1}(#2)} -\newcommand{\hpcof}[2]{\hat\opp_{#1}(#2)} -\DeclareMathOperator{\opq}{q} -\newcommand{\qof}[1]{\opq(#1)} -\newcommand{\hqof}[1]{\hat{\opq}(#1)} -\newcommand{\qcof}[2]{\opq_{#1}(#2)} -\newcommand{\varHof}[2]{\opEntropy_{#1}[#2]} -\newcommand{\xvarHof}[2]{\opEntropy_{#1}(#2)} -\newcommand{\varMIof}[2]{\opMI_{#1}[#2]} -\newcommand{\w}{\boldsymbol{\theta}} -\newcommand{\W}{\boldsymbol{\Theta}} -\newcommand{\h}{\boldsymbol{\phi}} -\newcommand{\hopt}{\boldsymbol{\h^\star}} -\newcommand{\H}{\boldsymbol{\Phi}} -\DeclareMathOperator{\opf}{f} -\newcommand{\fof}[1]{\opf(#1)} -\newcommand{\xset}[3]{(\x_n^{#1})_{n=#2}^{#3}} -\newcommand{\xNset}{(\x_n)_{n=1}^N} -\newcommand{\XNtuple}{(\X_n)_{n=1}^N} -\newcommand{\xNtuple}{(\x_n)_{n=1}^N} -\newcommand{\XNset}{\{\X_n\}_{n=1}^N} -\newcommand{\xNset}{\{\x_n\}_{n=1}^N} -\newcommand{\XNsetk}{\{\X_n\}_{n=N-k+1}^N} -\newcommand{\xNsetk}{\{\x_n\}_{n=N-k+1}^N} -\newcommand{\XNkset}{\{\X_n\}_{n=1}^{N-k}} -\newcommand{\xNkset}{\{\x_n\}_{n=1}^{N-k}} -\newcommand{\XNoset}{\{\X_n\}_{n=1}^{N-1}} -\newcommand{\y}{y} -\newcommand{\Y}{Y} -\newcommand{\L}{\boldsymbol{L}} -\newcommand{\x}{\boldsymbol{x}} -\newcommand{\X}{\boldsymbol{X}} -\newcommand{\oppdata}{\hat{\opp}_{\text{data}}} -\newcommand{\pdata}[1]{\hpcof{\text{data}}{#1}} -\newcommand{\normaldist}[1]{\mathcal{N}(#1)} -\newcommand{\wstddev}{\sigma_\w} -\newcommand{\noisestddev}{\sigma_\text{noise}} -\newcommand{\Dataset}{\mathcal{D}} -\newcommand{\Dtrain}{\Dataset_{\text{train}}} -\newcommand{\Dval}{\Dataset_{\text{val}}} -$$ -
-{% endraw %} - -## Introduction - - - -Model selection is a crucial aspect of machine learning, as it allows us to choose the most appropriate model for a given task. In the Bayesian setting, the marginal likelihood has been a popular tool for model selection and hyperparameter learning, often motivated by the principle of Occam's razor. However, the suitability of the marginal likelihood depends on the specific context and goals of the modeling task. - -Recently, the paper "Bayesian Model Selection, the Marginal Likelihood, and Generalization" by Lotfi et al. (2022/2023), which was accepted as Outstanding Paper and Long Oral at ICML 2022, examined the importance and challenges of model selection in machine learning, focusing on the log marginal likelihood (LML) and proposing a variant, the conditional log marginal likelihood (CLML). The authors argue that while LML is a useful tool for hypothesis testing, it may not be the best metric for model selection and for predicting the generalization performance of trained models or learning hyperparameters. They introduce the CLML as a potential improvement and demonstrate its effectiveness across various settings, including density models, Fourier features, Gaussian Processes, and deep neural networks. - -In this blog post, inspired by the above paper, we (re-)derive insights that challenge the conventional focus on the marginal likelihood and related quantities for Bayesian model selection. We argue that the quantities we examine are all consequences of Occam's razor, and thus no single quantity should be considered universally superior. Instead, the choice of model selection criterion should be guided by the context and the desired outcomes. We highlight that many recently proposed metrics for model selection, including CLML, are closely related to cross-validation and have failure cases that can be explained by considering model misspecification and prior-data conflicts. Overall, the choice between these metrics should be based on the specific requirements of the task at hand. - -We begin by discussing the foundations of model selection, including the role of Occam's razor and its relationship to maximum likelihood estimation (MLE) and maximum a posteriori (MAP) estimation. We then introduce the concepts of log marginal likelihood (LML), cross-validation, and conditional log marginal likelihood (CLML), highlighting their connections and differences. -Through a series of thought experiments and empirical observations, we explore the behavior of these model selection criteria in various scenarios, such as under model misspecification, prior-data conflict, and in different data regimes. We find that the conditional marginal cross-entropy, which is closely related to cross-validation, is often a more reliable choice when the primary objective is to select for generalization performance. On the other hand, the conditional joint marginal cross-entropy (permutation-invariant negative CLML) may be preferable when the focus is on sequential prediction and online learning. At the same time, the joint marginal information (negative LML) is rarely the right choice for model selection. -We review relevant literature, including the work of Fong and Holmes (2020) on the connection between the LML and cross-validation, the training speed estimators by Lyle et al. (2020) and Ru et al. (2021), and the experiments of Lotfi et al. (2022/2023) , comparing the CLML and validation loss for deep neural networks (DNNs). These studies provide valuable insights into the strengths and limitations of different model selection criteria. - -Throughout the post, we emphasize the importance of considering the context, available data, and desired outcomes when selecting the most appropriate metric for model selection and hyperparameter tuning. By questioning the primacy of the (conditional) joint marginal likelihood and encouraging critical thinking about the foundations of these quantities, we hope to foster a more nuanced understanding of Bayesian model selection. - -## (Bayesian) Model Selection - -In our daily lives, we're often faced with choices that require us to sift through competing explanations or decisions. Imagine you hear your doorbell ring. You might think it's the delivery you've been waiting for, a neighbor dropping by, or perhaps you didn't hear anything at all, and it was just your imagination. In deciding between these options, you're likely to lean towards the simplest explanation that aligns with your expectations—say, the long-awaited delivery. This inclination towards simplicity has a formal counterpart in scientific discovery and machine learning, known as [Occam’s razor](https://en.wikipedia.org/wiki/Occam%27s_razor): - - - -This concept is further illustrated using [an example from chapter 28](https://www.inference.org.uk/itprnn/book.pdf#page=355) of David MacKay’s seminal book, ["Information Theory, Inference, and Learning Algorithms”](http://www.inference.org.uk/mackay/itila/book.html), where the essence of selecting between models based on their evidence is laid out succinctly. - -{% capture caption %} -Excerpt from page 343 in David MacKay’s "Information Theory, Inference, and Learning Algorithms.” -{% endcapture %} -{% include figure.html path="assets/img/2024-05-07-clml/mackay_343.png" zoomable=True class="img-fluid" caption=caption alt="Occam’s razor --- How many boxes are in the picture (figure 28.1)? In particular, how many boxes are in the vicinity of the tree? If we looked with x-ray spectacles, would we see one or two boxes behind the trunk (figure 28.2)? (Or even more?) Occam’s razor is the principle that states a preference for simple theories. ‘Accept the simplest explanation that fits the data’. Thus according to Occam’s razor, we should deduce that there is only one box behind the tree. Is this an ad hoc rule of thumb? Or is there a convincing reason for believing there is most likely one box? Perhaps your intuition likes the argument ‘well, it would be a remarkable coincidence for the two boxes to be just the same height and colour as each other’. If we wish to make artificial intelligences that interpret data correctly, we must translate this intuitive feeling into a concrete theory." - -title="Occam’s razor --- How many boxes are in the picture (figure 28.1)? In particular, how many boxes are in the vicinity of the tree? If we looked with x-ray spectacles, would we see one or two boxes behind the trunk (figure 28.2)? (Or even more?) Occam’s razor is the principle that states a preference for simple theories. ‘Accept the simplest explanation that fits the data’. Thus according to Occam’s razor, we should deduce that there is only one box behind the tree. Is this an ad hoc rule of thumb? Or is there a convincing reason for believing there is most likely one box? Perhaps your intuition likes the argument ‘well, it would be a remarkable coincidence for the two boxes to be just the same height and colour as each other’. If we wish to make artificial intelligences that interpret data correctly, we must translate this intuitive feeling into a concrete theory."%} - -But how can we express this formally using mathematics? - -In the next section, we will use information-theoretic concepts to formalize Occam's razor and connect it to the maximum likelihood estimation (MLE) and maximum-a-posteriori (MAP) estimation approaches. This formalization highlights that Occam's razor, as a general principle favoring simplicity, can motivate various techniques, not just Bayesian ones. Therefore, using Occam's razor as the sole justification for Bayesian model selection may not be as compelling as it initially appears. - -However, one could argue that when Occam's razor is properly applied within a Bayesian framework, it captures a more nuanced notion of complexity. From this perspective, the Bayesian formulation of Occam's razor favors models that strike a balance between goodness-of-fit and model complexity, where complexity is measured by the model's ability to compress the data. This view is consistent with the [minimum description length (MDL)](https://www.wikiwand.com/en/Minimum_description_length) principle, which posits that the best model is the one that minimizes the total description length of both the model and the data given the model. - -**From Philosophical Principle to Mathematical Statement** - -Let's first connect Occam's razor to **Maximum Likelihood Estimation (MLE)** before diving deeper into the background and (Bayesian) model selection. - -In information theory, the information content of an event $$x$$ is defined as $$-\log_2 \pof{x}$$, where $$\pof{x}$$ is the probability of that event occurring according to a given model. This is also called *Shannon's information content*. We use the base $$2$$ for logarithms and measure information in *bits (binary digits)*, and for the rest of the post, we will drop the base of the logarithm. -The information content measures the optimal encoding length in bits for the event $$x$$ under the model specified by its probability distribution $$\pof{\cdot}$$. In the context of probabilistic modeling, variables that cannot be directly observed are called *latent variables*. Occam's razor suggests that we should prefer simpler explanations for latent variables, given the *observed data*. - -Consider a model with a latent variable $$z$$ and observed data $$x$$. The model specifies a probability distribution $$\pof{z \given x}$$. According to Occam's razor, we prefer simpler explanations, which correspond to smaller values of $$-\log \pof{z \given x}$$. Using Bayes' theorem, we can rewrite this as: - -$$\text{minimize } z \text{ in } -\log \pof{z \given x} = -\log \pof{x \given z} - \log \pof{z} + \log \pof{x}.$$ - -Given that $$\pof{x}$$ is independent of $$z$$, we can omit it from our objective. Additionally, if we posit a uniform (or non-informative prior) for $$z$$, implying that all potential values of $$z$$ are equally probable before observing $$x$$, then $$\pof{z}$$ becomes constant and can also be dropped from our objective. This simplifies our preference to: - -$$\text{minimize } z \text{ in } -\log \pof{x \given z}.$$ - -Equivalently, we can maximize $$\pof{x \given z}$$, which is the *likelihood* of the observed data $$x$$ given the latent variable $$z$$. When making a decision and selecting a single value for $$z$$, this leads to the maximum likelihood estimation (MLE) approach. - -
- -
- -In summary, the connection between Occam's razor and MLE relies on the following assumptions: - -1. Shannon's information content is how we measure complexity. -2. The prior distribution for the latent variables is uniform (or uninformative). -3. Simpler explanations, as measured by the information content, are preferred (Occam's razor). - -Under these assumptions, the preference for simpler explanations leads to the MLE approach, where more likely values of the latent variable given the observed data are preferred. - -Optimizing the MLE is common in machine learning because we can directly optimize the likelihood function. Still, this is not easy for deep learning models because they have a large number of parameters and the loss function is non-convex. - -### Maximum-a-Posteriori Estimation - -However, the assumption of a uniform or non-informative prior for the latent variables is not always valid or desirable. In many cases, we have prior knowledge about the latent variables that can be incorporated into the model. This leads to the **Maximum-A-Posteriori (MAP) Estimation** as an alternative to MLE. - -In MAP estimation, $$\pof{z}$$ is not constant, so we cannot drop it---we can still drop $$\pof{x}$$, however---and maximize the joint distribution $$\pof{z, x}$$, or equivalently: - -$$\text{minimize } z \text{ in } -\log \pof{x, z}=-\log \pof{x \given z} - \log \pof{z}.$$ - -Before we go further, we need to introduce notation for information-theoretic quantities and concepts that we will use throughout the postThis next section is mostly shared with the sister post. - -## Background: Information-Theoretic Notation and Concepts - -Information theory deals with the communication and quantification of informationSee the excellent "Visual Information Theory" by Chris Olah for a visual introduction to information theory.. In this post, we use a unified information-theoretic notation to express various quantities related to probability distributions and their relationshipsIt largely follows "A Practical & Unified Notation for Information-Theoretic Quantities in ML".. Here are some key concepts we will use: - -The **information content** of an event $$x$$ is denoted as $$\Hof{x}$$ and is defined as $$-\log_2 \pof{x}$$, where $$\pof{x}$$ is the probability of event $$x$$ occurring. It represents the minimum amount of information needed to describe the occurrence of $$x$$ given an underlying probability distribution. $$\Hof{x \given y}$$ and $$\Hof{x, y}$$ are analogously defined and denote the conditional and joint information content of random variables $$X$$ and $$Y$$, respectively. -In machine learning, the information content is often used as a minimization objective, represented as the negative log-likelihood or cross-entropy when averaged over a dataset (see below). - -The **entropy** $$\Hof{X}$$ of a random variable $$X$$ is the expectation of its information content: - -$$ -\Hof{X} \triangleq \E{\pof{x}}{\Hof{x}} = \E{\pof{x}}{-\log \pof{x}}. -$$ - -The entropy measures the average amount of information needed to describe the random variable $$X$$. It provides a measure of uncertainty or randomness associated with $$X$$. We can similarly define the entropy of a conditional distribution $$\Hof{X \given Y}$$ and the joint entropy $$\Hof{X, Y}$$. - -We will also use the **Kullback-Leibler divergence** $$\Kale{\pof{X}}{\qof{X}}$$ and the **cross-entropy** $$\CrossEntropy{\pof{X}}{\qof{X}}$$: - -$$ -\begin{aligned} -\CrossEntropy{\pof{X}}{\qof{X}} & = \E{\pof{x}}{-\log \qof{x}}\\ -\Kale{\pof{X}}{\qof{X}} & = \CrossEntropy{\pof{X}}{\qof{X}} - \Hof{X} -\end{aligned} -$$ - -The cross-entropy quantifies the average number of bits needed to encode samples drawn from the true distribution $$\pof{X}$$ using a different distribution $$\qof{X}$$. The Kullback-Leibler divergence measures the difference between two probability distributions and captures the additional bits needed to encode samples from $$\pof{X}$$ compared to encoding them using the true distribution $$\qof{X}$$. - - - -### Expressing Occam's Razor in Information-Theoretic Terms - -Taking this notation into account, we can express Occam's razor as: - -$$\text{prefer small } z \text{ for } \Hof{z \given x},$$ - -where $$Z$$ is the latent variable and $$X$$ is the observed data. Note that $$x$$ and $$z$$ are individual realizations of the random variables $$X$$ and $$Z$$, respectively. - -The MLE and MAP objectives are accordingly: - -$$\text{minimize } z \text{ in } \Hof{x \given z} \text{ for MLE and } \Hof{x, z} \text{ for MAP.}$$ - -This measures the number of bits we need to encode the observed data given the latent variable for MLE and the number of bits to encode both the observed data and the latent variable for MAP. This relates Occam's razor to the minimum description length principleSee the Wikipedia article on Minimum Description Length for more details.. - -## Hyperparameter Learning and Model Selection - -In many machine learning tasks, we need to determine the best hyperparameters for a model or select the most suitable model architecture from several discrete options. The primary goal is to find the hyperparameters or model that generalizes best to new, unseen data. - -Both cases can be viewed as inferring a random variable $$\H$$, which represents either the model choice as a categorical distribution or the hyperparameters as a continuous distribution. In this sense, $$\H$$ can be considered as another latent variable in the model. - -For consistency, we will continue using $$\x$$ to denote data points throughout this post. Although it is common to use $$\y$$ for predictions and $$\x$$ for side channel information, we will not require this distinction here and will stick to $$\x$$ for simplicity. - -The same arguments discussed previously also apply in this context, and we can express the objective as: - -$$\text{minimize } \h \text{ in } \Hof{\x \given \h}.$$ - -### Model Parameters - -In addition to the hyperparameters $$\H$$, we usually have model parameters $$\W$$ for a given $$\h$$ with a parameter distribution $$\pof{\w \given \h}$$ that we need to infer based on observed data. These parameters are the learnable components of the model, such as the weights and biases in a neural network. For given $$\w$$ and $$\h$$, we can easily compute the likelihood $$\pof{\x \given \w, \h}$$, which represents the probability of observing the data $$\x$$ given the specific values of the parameters and hyperparameters. However, to make predictions or compute the marginal likelihood, we will need to consider the uncertainty in the parameter values by integrating over all possible $$\w$$. - -### Bayesian Model Averaging - -Bayesian Model Averaging (BMA) is a technique that integrates, or marginalizes, over the model parameters $$\W$$ when making predictions. This accounts for the uncertainty in the model parameters, which is particularly useful when dealing with complex models, high-dimensional parameter spaces, and limited data. In contrast to the MLE or MAP estimate, which use a single parameter value $$\w$$ for predictions, BMA provides a more robust and comprehensive approach. The probability of a new data point $$\x'$$ under BMA is given by: - -$$\pof{\x' \given \x, \h} = \int \pof{\x' \given \x, \w, \h} \pof{\w \given \x, \h} \, \mathrm{d}\w,$$ - -where $$\pof{\w \given \x, \h}$$ is the posterior distribution of the parameters given the data, and $$\pof{\x' \given \x, \w, \h}$$ is the likelihood of the new data point given the parameters, hyperparameters, and training data. - - - -While BMA offers benefits, it is computationally challenging, particularly when dealing with high-dimensional parameter spaces commonly encountered in deep learning models. To make BMA tractable, various approximation methods, such as Markov Chain Monte Carlo (MCMC) and Variational Inference, have been proposed. - -### Marginal Likelihood and Estimation - -Let's now discuss the marginal likelihood and its relation to BMA. The marginal likelihood, denoted as $$\pof{\x \given \h}$$, is the likelihood of the observed data given the hyperparameters, marginalized over all possible parameter values $$\W$$. It is also known as the **model evidence**. To compute the marginal likelihood, we integrate over all possible $$\w$$: - -$$\pof{\x \given \h} = \int \pof{\x \given \w, \h} \pof{\w \given \h} \, d\w,$$ - -where $$\pof{\x \given \w, \h}$$ is the likelihood of the data given the parameters and hyperparameters, and $$\pof{\w \given \h}$$ is the prior distribution of the parameters given the hyperparameters. - -Comparing BMA to the marginal likelihood, we see that they match for individual data points. However, for multiple data points (i.e., conditioning on datasets), the marginal likelihood is more complex. "BMA" typically refers to making predictions for a single new data point, while the marginal likelihood can be considered for many points simultaneously. Apart from this difference, the two are equivalent. Let's discuss the case of multiple data points in more detail to understand why computing the marginal likelihood on datasets is even more challenging. - - - -## Datasets instead of Individual Data Points - -So far, we have described everything as if we only had a single data point $$x$$. However, in practice, we often have a dataset $$\xNtuple = (\x_1, \x_2, \ldots, \x_N)$$. - -### Joint Marginal Information and Cross-Entropy - -The easiest way to extend the previous definitions is to simply substitute $$\xNset$$ for $$\x$$ and assume we can compute a likelihood for the entire dataset using its joint predictive distribution: - -$$\pof{\xNtuple \given \h} = \int \pof{\x_1, \x_2, \ldots, \x_N \given \w, \h} \, \pof{\w \given \h} \, d\w.$$ - -We can then maximize this likelihood or equivalently minimize the joint marginal information $$\Hof{\xNtuple \given \h}.$$ - - - -If our model is exchangeable, meaning the order of the $$\x_n$$ does not matter, we can equivalently take an expectation over all permutations of the data to obtain the **joint marginal cross-entropy**: - -$$ -\CrossEntropy{\pdata{\X_1, ...,\X_n}}{\pof{\X_1, ... \X_n \given \h}}, -$$ - -where $$\pdata{\cdot}$$ is an empirical data distribution that allows us to draw samples *without replacement*. In this case, the joint marginal information and cross-entropy are equivalent. - -With exchangeability, we can simply write $$\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNset}$$ instead of using the tuple notation $$\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNtuple}$$ as the order of the data points does not matter. - -Conversely, if a model is not exchangeable, we can induce exchangeability by averaging over all permutations of the data points via ensembling. For example, deep learning models trained with stochastic gradient descent are generally *not* exchangeable, as the order and composition of the batches can impact the results. However, we can make them effectively exchangeable by training multiple models and averaging their predictions. In the limit of infinite models, the resulting ensemble will be exchangeableThe ensemble might not necessarily perform better though, as papers on training curricula have shown that batch order can be important.. - -The joint marginal cross-entropy turns a potentially non-exchangeable joint information into an exchangeable one by taking an expectation. - -### Marginal Information and Cross-Entropy - -Before we try to understand these joint expressions, we should consider alternative ways to extend the previous definitions. - -For instance, we could take the average of the likelihoods for individual data points: - -$$ \frac{1}{N} \sum_{n=1}^N \pof{\x_n \given \h}. $$ - -Assuming an underlying data distribution $$\pdata{x}$$, we can also express this as an attempt to estimate: - -$$ \E{\pdata{\x}}{\pof{\x \given \h}} = \int \pof{\x \given \h} \, \pdata{\x} \, d\x. $$ - -This provides an average score for the data likelihood. - -However, from the perspective of Occam's razor, simply taking the average likelihood is not the most principled approach. Instead, we can leverage information theory, which has been our tool of choice thus far. Recall that we prefer small values of the **marginal information** $$\Hof{\x \given \h}$$. By taking the expectation over the data distribution, we obtain the *individual* marginal cross-entropy: - -$$\CrossEntropy{\pdata{\X}}{\pof{\X \given \h}} = \E{\pdata{\x}}{-\log \pof{\x \given \h}}.$$ - -This cross-entropy measures the average number of bits needed to encode the data using the model's probability distribution. As it does not involve a joint distribution, we refer to it simply as the **marginal cross-entropy**. - -It is evident that the marginal cross-entropy and the average likelihood are not equivalent. Using the convexity of the negative logarithm and Jensen's inequality, we see that the marginal cross-entropy is always larger than the negative logarithm of the average likelihood: - -$$ -\begin{aligned} -\CrossEntropy{\pdata{\X}}{\pof{\X \given \h}} &= \E{\pdata{\x}}{-\log \pof{\x \given \h}} \\ -&\geq -\log \E{\pdata{\x}}{\pof{\x \given \h}} \\ -&\approx -\log \frac{1}{N} \sum_{n=1}^N \pof{\x_n \given \h}. -\end{aligned} -$$ - - - -The NLL is frequently used to evaluate a model's performance *after* training, typically on a held-out *validation set*. This is equivalent to computing the cross-entropy between the empirical distribution of the validation set and the model's predictive distribution, conditioned on the parameters learned from the training data: - -$$\CrossEntropy{\hpcof{\text{val}}{\X'}}{\pof{\X' \given \xNtuple, \h}}$$ - -It is essential to distinguish this from the cross-entropy computed on the prior distribution of the model parameters before seeing any data, which is less useful for evaluating a trained model's performance: - -$$\CrossEntropy{\hpcof{\text{val}}{\X'}}{\pof{\X' \given \h}}$$ - -Only the NLL on a validation set *conditioned on the training data* provides an estimate of the model's generalization ability after training. The same holds for the quantities marginalized over the model parameters. - -### Marginal Cross-Entropy vs Joint Cross-Entropy - -Occam's razor does not clearly specify which aggregate metric on $$\Hof{\x \given \h}$$ we should prefer. Instead of the mean, we could use the median or a different quantile of the information content as a summary statistic to assess the model's performance on the dataset. This might be more robust, as it is less sensitive to outliers. - -Crucially, the marginal cross-entropy and related summary statistics measure the model's performance using the "prior" parameter distribution, not the posterior conditioned on data. However, the joint distribution captures something else, which can be seen more clearly using the chain rule: - -$$\Hof{\xNset \given \h} = \sum_{k=1}^N \Hof{\x_n \given \x_1, \ldots, \x_{k-1}, \h}$$ - -Each term is a **conditional marginal information** on the previous data points. Similarly, when we take an expectation over the data distribution, we obtain a chain of **conditional marginal cross-entropies**: - -$$ -\begin{aligned} -& \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNtuple} = \\ -&\quad = \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X_1} + \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X_2 \given \X_1} \\ -&\quad \quad + \ldots + \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{X_N \given \X_1, \X_2, \ldots, \X_{N-1}} \\ -&\quad = \sum_{n=1}^N \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X_n \given \X_{n-1}, \ldots, \X_1}. -\end{aligned} -$$ - -Each term in the sum is a conditional marginal cross-entropy conditioned on the previous data points, which differs from the marginal cross-entropy (recognized in the first term). - -The following visualization summarizes the relationship between the conditional and joint marginal cross-entropies and information. The chain rule tells us that the area under the curve of the conditional quantities equals the joint quantity. - -
- - - - - - -
- *The relationship between conditional and joint marginal cross-entropies and information.* - **Left**: Conditional marginal cross-entropy (blue) for a multi-class classification problem. The area under the curve (orange) represents the joint marginal cross-entropy. As the dataset size increases, the conditional marginal cross-entropy decreases and converges to the best achievable loss for the given model hypothesis $$\h$$. - **Right**: Conditional marginal information (green). The area under the curve (red) represents the joint marginal information. The conditional marginal information is a noisy estimate of the conditional marginal cross-entropy, as it is computed on individual data points. -
-
- - - -In summary, the marginal and joint cross-entropies offer different perspectives on a model's performanceRecent works by Ian Osband et al., starting with The Neural Testbed: Evaluating Joint Predictions can help build intuitions for joint predictions. -Similarly, a gentler introduction, comparing marginal and joint predictions, can also be found in the arXiv note Marginal and Joint Cross-Entropies & Predictives for Online Bayesian Inference, Active Learning, and Active Sampling.: - -- The marginal cross-entropy and related summary statistics assess the model's performance using the prior parameter distribution, without considering the effect of the data on the model. -- The joint marginal cross-entropy, expressed as a sum of conditional marginal cross-entropies, captures the model's online learning performance as it processes the data sequentially. - -While both metrics are useful for evaluating models, the joint marginal cross-entropy provides insight into how well the model learns from the data during training. The conditional marginal cross-entropy, on the other hand, is more suitable for assessing the model's generalization ability at a given point in time, without the influence of parameter updates. - -## Intermediate Comparison - -This brings us back to the earlier question of what metric we should prefer and use for model selection. Let's consider: - -1. The marginal cross-entropy, as in the first term, is likely not useful for model selection with deep learning models, as it is not conditioned on any data and thus cannot correlate well with the model's performance after training. - -2. If we care about the model's "generalization" performance *after training on $$N-1$$ data points* without further adaptation, the marginal cross-entropy on the last data point is the more relevant quantity: - - $$\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X_N \given \X_{N-1}, \ldots, \X_1}$$ - - It measures the model's performance on the last data point after having seen all previous data points, similar to a "leave-one-out" metric. Indeed, it is equivalent to [leave-one-out cross-validation](https://www.wikiwand.com/en/Cross-validation_(statistics)#Leave-one-out_cross-validation) when we have an empirical data distribution consisting of $$N$$ data points and sample without replacement. - -3. More generally, it is *equivalent* to cross-validation when we hold out more than one data point for evaluation from the empirical data distribution: - - $$\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X' \given \X_{N-k}, ..., \X_{1}}.$$ - - This is the same expression as in **(2.)** but we assume there are more samples to draw from in the empirical data distribution $$\pdata{\x'}$$. We call this term the conditional marginal cross-entropy and keep in mind its connection to cross-validation. - -4. On the other hand, if we care about the model's performance as an online learner, or in the case of LLMs, as an in-context learner, the joint marginal cross-entropy becomes a more relevant metric. It measures the model's ability to adapt and make accurate predictions as it sequentially processes new data points, conditioned on the information it has seen so far. - - In the context of online learning, the model receives data points one at a time and updates its predictions based on the cumulative knowledge gained from previous data points. The joint marginal cross-entropy captures how well the model incorporates this sequential information to make accurate predictions for future data points. - - Similarly, for in-context learning of LLMs, the model is provided with a prompt or context consisting of a sequence of data points, and it is expected to generate accurate completions or predictions based on this context. The joint marginal cross-entropy measures the model's ability to effectively utilize the provided context to make accurate predictions for the next data point in the sequence. - -5. However, we would not want to use the unconditional joint marginal cross-entropy, but rather condition on some initial data to be closer to the actual use case of the model, which will have been (pre-)trained already. As such, we are interested in estimating a **conditional joint marginal cross-entropy**: - - $$\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNsetk \given \XNkset}. $$ - - By conditioning on the previously seen data points, this metric assesses the model's capacity to learn and adapt its predictions based on the evolving context. It provides a more fine-grained evaluation of the model's sequential prediction performance, taking into account the specific order and dependencies within the data. - - Moreover, the conditional joint marginal cross-entropy can be used to compare different models or hyperparameter settings in terms of their online learning or in-context learning capabilities. By evaluating this metric on held-out data sequences, we can determine which model or setting is better suited for tasks that require sequential adaptation and context-dependent predictions. - -6. If we have a preferred order of the data points (or a split in the case of exchangeability), we can also consider the **conditional joint marginal information**: - - $$\Hof{\xNsetk \given \xNkset, \h}.$$ - - It is also known as the **conditional joint marginal log likelihood**. - -7. All these quantities are equally valid from the perspective of Occam's razor. - -8. We have not yet discussed how to efficiently estimate these quantities, especially for deep learning models. More importantly, we have already considered that the joint marginal information (marginal likelihood), BMA, and the joint marginal cross-entropy (as an expectation over the marginal likelihood) are not easy to estimate. - -This brings us to one of the main points: - - - -This is a crucial point that has not been sufficiently considered in the literature on model selection and hyperparameter learning previously, where the model evidence and marginal likelihood have been presented as the ultimate criteria. In practice, we rarely update a model on additional data during inference—this is changing with the advent of LLMs and strong in-context learners, but it is still not the norm. - - - -But why has the marginal likelihood been the preferred choice for model selection so far then? - -## Different Data Regimes - -To explore when the conditional marginal cross-entropy and joint marginal cross-entropy lead to different outcomes for model selection and hypothesis testing, let's consider a few key scenarios. - -For the discrete case, we can reduce the question to one about ranking: if we have two possible hyperparameter choices $$\h_1$$ and $$\h_2$$, when do we get the same ranking $$\h_1 \succ \h_2$$ for both metrics? - -### Model Misspecification - -First, let's examine the case when we have a large amount of data available. Here, model misspecification, a common concern, plays a crucial role. - -As renowned statistician George Box famously stated: - -
- All models are wrong, but some are useful. -
- --- George Box, Science and Statistics (1976) -
-
- -When working with real-world data, we must always assume that our models are misspecified to some degree. Models simplify complex systems and cannot capture every nuance of the data-generating process. Consequently, the goal of model selection is not to find the "true" model but rather to identify the most useful model that balances simplicity, interpretability, and predictive performance. - - - -Without model misspecification, we would always converge to the maximum likelihood estimate (MLE) that matches the data-generating model in the infinite data limit as the [Bernstein-von Mises' theorem](https://www.wikiwand.com/en/Bernstein%E2%80%93von_Mises_theorem) tells us that posteriors converge to the MLE in the limit. However, in practice, we are always dealing with misspecified models, and the MLE will not converge to the true data-generating model. - - - -### Infinite Data Limit - -Let's return to our question of when the different quantities lead to similar rankings. - -While a conditional joint marginal cross-entropy, as a sum of conditional marginal cross-entropies, is obviously larger than each individual term, if we divide the joint marginal cross-entropy by the number of samples in the conditional joint distribution, we obtain the **rate**In this context, "rate" refers to the average amount of cross-entropy or information per (training) sample, drawing parallels to the concept of entropy rate in Shannon's information theory. This usage is distinct from other common uses of "rate" in machine learning, such as learning rate or convergence rate. of the conditional joint marginal cross-entropies as its per-sample average, which can be more easily related: - -$$ -\begin{aligned} -& \frac{1}{N-k} \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNsetk \given \XNkset} \\ -&\quad = \sum_{n=N-k+1}^N \frac{1}{N-k} \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X_n \given \X_{n-1}, ..., \X_1}. -\end{aligned} -$$ - -Bernstein-von Mises' theorem tells us that the posterior distribution of the model parameters converges to a normal distribution around the MLE as the number of data points goes to infinityThere are likely fewer caveats to this statement than the naive interpretation of the theorem implies because we are usually not interested in converging towards some unique and identifiable parameters but rather in the predictions matching the data-generating process.. This means that the later terms in the chain rule decomposition of the joint cross-entropy will converge to the same value in the infinite sample limit as the data we condition on becomes infinite. If we take the limit, we can ignore the first terms in the chain rule decomposition of the joint cross-entropy, and we will get the same average value for the terms of the joint cross-entropy (one per sample in the joint) and the conditional cross-entropy. This matches a similar result on entropy rates in "Elements of Information Theory" by Cover & Thomas. - -Overall, we have (without formal proof): - -$$ -\begin{aligned} -&\lim_{N \to \infty} \frac{1}{N} \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNset} = \\ -&\quad = \lim_{N \to \infty} \frac{1}{N} \sum_{n=1}^N \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X_n \given \X_{n-1}, ..., \X_1} \\ -&\quad = \lim_{N \to \infty} \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X' \given \XNset}. -\end{aligned} -$$ - -Given sufficient data (in the infinite sample limit), we see that either of these quantities will lead to the same ranking of different hyperparameters/model hypotheses. Conversely, we can expect to see meaningful differences only in low-data regimes, where the model is not yet fully adapted to the data. - -Finally, in the infinite data limit, for the conditional marginal cross-entropy, we don't need to take an expectation over the data we condition on (as the model parameters will still have converged): - -$$ -\begin{aligned} -&\lim_{N \to \infty} \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNsetk \given \XNkset} \\ -&\quad = \lim_{N \to \infty} \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNsetk \given \xNset}, -\end{aligned} -$$ - -forany $$\xNset \sim \pdata{\xNset}$$ as $$n \to \infty$$. More importantly, this also holds for the joint marginal information, whose rate in the limit is the same as the rate of the joint marginal cross-entropy above (and thus also joint cross-entropy): - -$$ -\begin{aligned} -&\lim_{N \to \infty} \frac{1}{N} \Hof{\xNset \given \h} = \\ -&\quad = \lim_{N \to \infty} \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X' \given \XNset}. -\end{aligned} -$$ - -We have previously mentioned the connection between cross-validation, leave-one-out validation, and the conditional marginal cross-entropy. This result also connects the marginal likelihood in the limit to these quantities. - -Thus: - - - -The catch is that "sufficient data" might be a very large amount of data, especially for highly expressive models like neural networks. - -Hence, we only expect these quantities to be meaningfully different in the low-data regime. So let's focus on the low-data regime now. - -### Prior-Data Conflict - -Even if different hyperparameter choices lead to the same generalization loss in the infinite data limit, they can induce different priors that affect the convergence speed and model performance in the low-data regime. - - - -In the low-data regime, assuming all models converge to the same validation loss given infinite data, we prefer the model that converges the fastest, i.e., with the least amount of training data. A model with a prior well-aligned with the data distribution learns efficiently and generalizes better with limited data. - -
-
-*Conditional marginal cross-entropy vs. dataset size under different modeling scenarios.* -**Left: Model misspecification** - Three model hypotheses ($$\h_1$$, $$\h_2$$, $$\h_3$$) converge to different losses due to the model class not containing the true data-generating process. The minimum achievable loss represents the misspecification error. -**Right: Prior-data conflict** - Three model priors ($$\h_1$$, $$\h_2$$, $$\h_3$$) converge to the same loss but at different speeds due to varying alignment with the data distribution. Priors with more mass near the MLE converge faster. -*Real-world models often face both prior-data conflict and model misspecification.* -
-
- -In this scenario, the area under the conditional marginal cross-entropy or information curve (equivalent to the joint marginal cross-entropy, or joint marginal information) indicates the preferred model. The model with the lowest joint marginal information (highest log marginal likelihood) fits the available data best while having a prior enabling efficient learning and generalization. - -### Anti-Correlated Model Misspecification and Prior-Data Conflict - -Finally, what happens when there are both model misspecification and a prior-data conflict in the low-data regime? If both are correlated, the ranking will be preserved, but if they are anti-correlated, the ranking might change. - -Let's visualize this: the curves will intersect at some point, and the model with the best achievable loss in the infinite data limit might not be the best choice in the low-data regime, depending on how much data we can train on. The optimal model choice may also change based on the amount of available data. - -
-
-*The conditional marginal cross-entropy is plotted for three different model hypotheses ($$\h_0$$, $$\h_1$$, $$\h_2$$) as a function of dataset size. The models exhibit both prior-data conflict and model misspecification.* -In the small data regime, $$\h_2$$ has the lowest loss due to its prior aligning well with the data distribution, allowing for faster initial learning. However, as more data becomes available, the models' asymptotic performance quickly plateaus. First, $$\h_1$$ takes over, and then finally $$\h_0$$, which converges to the lowest achievable loss in the infinite data limit, indicating it suffers the least from model misspecification. In contrast, $$\h_1$$ and $$\h_2$$ converge to higher loss values due to greater misspecification. -Notably, the models' performance ranking changes multiple times as the dataset grows, with $$\h_2$$ being initially favored but ultimately having the worst infinite-data loss. Each model ranks best for the conditional joint marginal cross-entropy for some chosen range. -*This illustrates how the interplay between prior-data conflict and model misspecification can lead to different model selection decisions depending on the amount of available data and the metric used to measure performance.* -
-
- -Here, the joint marginal cross-entropy and the joint marginal information (log marginal likelihood) might not lead to the same decision because the area under the curve at the start might be larger than what the best model can save later. This could change the ranking of the models compared to the conditional marginal cross-entropy (leave-one-out cross-validation) at the end of training, which serves as a proxy for the model's generalization performance. - -Instead, the conditional joint marginal cross-entropy and information can shine here by conditioning "away" the beginning of the curve, thus giving us a better estimate of the conditional marginal cross-entropy (or expected information) at the point of interest. - -To formalize this, we can use the chain rule to split the joint marginal cross-entropy into two terms: - -$$ -\begin{aligned} -&\underbrace{\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNset}}_{\text{Joint Marginal Cross-Entropy}} = \\ -&\quad = \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNsetk} \\ -&\quad \quad + \underbrace{\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNset \given \XNsetk}}_{\text{Conditional Joint Marginal Cross-Entropy}}, -\end{aligned} -$$ - -Note that the per-sample averages of both terms converge to the same value in the infinite data limit—the conditional marginal cross-entropy (cross-validation loss), as discussed previously. However, the second term will converge faster because it does not include the constant $$\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNsetk}$$. - -We can also see both terms as approximating the conditional marginal cross-entropy (cross-validation loss) for a fixed $$N$$ in the low-data regime. The per-sample average of the second term will provide a better approximation. - -In summary, the consistency of the ranking will depend on the size of $$\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNsetk}$$ for different $$\h$$ and how it compares to the conditional joint marginal cross-entropy $$\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNset \given \XNsetk}$$. - -This analysis highlights the importance of considering both prior-data conflict and model misspecification when selecting models in the low-data regime. The choice of performance metric and the amount of available data can significantly impact the ranking of models. The conditional joint marginal cross-entropy provides a more accurate estimate of the model's generalization performance by conditioning away the initial part of the learning curve, which may be heavily influenced by prior-data conflict. - -## Approximating the Validation Loss - -You may be wondering: why bother with the marginal likelihood or conditional joint marginal cross-entropy at all? Why not just always use leave-one-out cross-validation (i.e., the conditional marginal cross-entropy) or a simple validation loss? - -While that is a valid approach, the key question is: can we approximate the validation loss earlier in training, without fully training the model? Or can we do this more efficiently than performing inference on each element of a validation set? - -One option is to extrapolate the training loss to predict the validation loss. While potentially underexplored in this context, scaling laws have been found effective for predicting model performance. - -Alternatively, when training a model on a dataset for a single epoch—which is still surprisingly common for large language models, especially without active data sampling—the average training loss per batch provides a good approximation of the validation loss. With a cross-entropy loss, this is equivalent to estimating the conditional marginal cross-entropy. - -However, the batch size may not be large enough for a precise estimate. Averaging over the last few batches or using an exponential moving average can help, as the training losses on earlier batches were computed with older model parameters. Compared to using only the last batch's loss, this smooths the estimate and reduces sensitivity to outliers. - -In the multi-epoch setting, revisiting data points multiple times prevents using the training loss as a validation loss estimate. Here, cross-validation offers a solution: train on the held-out data in the last epoch, compute the validation loss via the training losses, and obtain an ensemble of fully trained models without wasting data. - -In summary, while the validation loss is the gold standard, approximations based on the training loss or cross-validation can provide efficient estimates, especially in the early stages of training or with limited data. - -## The Big Comparison - -In this post, we have explored various metrics for model selection and hyperparameter learning in the Bayesian context, focusing on the marginal likelihood, joint marginal cross-entropy, and conditional marginal cross-entropy. Our discussion has led to several key insights: - -1. **Infinite Data Limit**: As the dataset size approaches infinity, the rate of the log marginal likelihood (or equivalently, the joint marginal information), the joint marginal cross-entropy, and the conditional marginal cross-entropy converge to the same value when averaged over the data distribution. Given sufficient data, all these metrics will produce the same ranking of different model hypotheses or hyperparameter choices. - -2. **Connection to Cross-Validation**: The conditional marginal cross-entropy is equivalent to the expected cross-validation loss. Cross-validation is the gold standard for model selection in machine learning practice, where a model's generalization performance is estimated by evaluating it on held-out validation data after training on the remaining data. - -3. **Sufficient Data Requirement**: The amount of data needed for the convergence of these metrics in the infinite data limit may be impractically large, especially for highly expressive models like deep neural networks. Therefore, the convergence property may not be directly relevant in many real-world scenarios. - -4. **Low-Data Regimes**: When data is limited, the metrics can differ significantly. The conditional marginal cross-entropy (or cross-validation loss) is often the more reliable choice for model selection targeting generalization performance, as it directly measures the model's ability to predict unseen data after being trained on the available data. - -5. **Sequential Prediction and Compression**: The joint marginal cross-entropy, which corresponds to the negative log marginal likelihood, may be preferable if the focus is on a model's overall sequential prediction performance or compression ability on the training data itself. It measures how well the model fits the entire training dataset jointly, without splitting into train and validation sets. - - Moreover, the conditional joint marginal information and cross-entropy are particularly relevant for measuring the performance of online learners and the in-context learning abilities of large language models (LLMs). These metrics capture the model's ability to adapt and make accurate predictions based on the sequential information and evolving context after training on available data. - -6. **Model Misspecification and Prior-Data Conflict**: In practice, models often face a combination of model misspecification (where the true data-generating process is not contained within the model class) and prior-data conflict (where the prior distribution does not align well with the data distribution). The interplay between these factors can lead to different rankings of models depending on the amount of available data and the specific metric used for evaluation. - -While the marginal likelihood has been a popular tool for model selection and hyperparameter learning in the Bayesian community, its suitability depends on the specific context and goals. The conditional marginal cross-entropy, closely related to cross-validation, is often a more reliable choice when the primary objective is to optimize generalization performance. However, the conditional joint marginal cross-entropy (or conditional log marginal likelihood) may be preferable when the focus is on sequential prediction after training or measuring in-context learning abilities. - -Now, after having thought about all this in detail and mostly from first principles, let's discuss the literature and how it supports or augments these considerations. - -## Literature Review - -Having discussed the key concepts, we will now look at several influential papers that have shaped the previous discussion on model selection and hyperparameter tuning in the Bayesian context or have provided valuable insights into the marginal likelihood and its connections to other metrics. - -### Fong and Holmes (2020): "On the marginal likelihood and cross-validation" - -Fong and Holmes (2020) explore the connection between the log marginal likelihood (joint marginal information) and cumulative leave-p-out cross-validation. Under exchangeability, they show that the joint marginal information can be rewritten as a cumulative sum of leave-p-out cross-validation terms. - -The authors define the *leave-p-out cross-validation score* as: - -$$S_{CV}(\xNset;p) = \frac{1}{\binom{N}{p}} \sum_{V \in \binom{[N]}{p}} \frac{1}{p} \sum_{i=1}^p \Hof{\x^{V}_i \given \{\x^{\bar{V}_k}\}_{k=1}^{N-p}}$$ - -where $$\binom{[N]}{p}$$ denotes the set of all $$p$$-length subsets of $$\{1,...,N\}$$---the indices of the validation set---$$\x^V_i$$ is the $$i$$-th validation data point, and $$\x^{\bar{V}}_k$$ is the $$k$$-th training data point. This score measures the model's performance using $$p$$ validation points given the remaining data for training, equivalent to the respective conditional marginal cross-entropy. - -The *cumulative leave-P-out cross-validation score* is defined as: - -$$S_{CCV}(\xNset; P) = \sum_{p=1}^P S_{CV}(\xNset; p)$$ - -This score focuses on the last $$P$$ stages of the learning curve equally and is the same as the conditional joint marginal cross-entropy. For $$P=N$$, the cumulative leave-N-out cross-validation score equals the joint marginal information: - -$$S_{CCV}(\xNset; N) = \Hof{\xNset}$$ - -Comparing $$P establish a connection between training speed and the marginal likelihood in linear models. They propose using the sum of mini-batch training losses as a proxy for the log marginal likelihood to predict the generalization behavior of deep neural networks. This sum, referred to in later works as the *training speed estimator* (TSE), corresponds to the area under the learning curve. For 1-sample batches, the TSE is defined as: - -$$\text{TSE}(\xNset) = \sum_{n=1}^N \Hof{\x_n \given \w_n},$$ - -where $$\Hof{\x_n \given \w_n}$$ is the cross-entropy loss at training step $$n$$ with model parameters $$\w_n$$. Thus, an MLE estimate is used instead of conditioning on the data points $$\x_{, Ru et al. (2021) focus on using TSE for model selection in neural architecture search in "Speedy Performance Estimation for Neural Architecture Search". They propose two variants of TSE: *TSE-E*, which focuses on the last few epochs, and *TSE-EMA*, which uses an exponential moving average to assign higher weights to later epochs: - -$$ -\begin{aligned} -\text{TSE-E}(\xNset) &= \sum_{n=N-E+1}^N \Hof{\x_n \given \w_n}, \\ -\text{TSE-EMA}(\xNset) &= \sum_{n=1}^N \alpha^{N-n} \Hof{\x_n \given \w_n}, -\end{aligned} -$$ - -where $$\alpha \in (0, 1)$$ is a hyperparameter controlling the decay rate. - -The authors hypothesize that assigning higher weights to later epochs may lead to better correlation with the true generalization performance of the final trained network, as the early epochs may be unstable and less informative. - -They demonstrate empirically that TSE-E and TSE-EMA can reliably estimate the generalization performance of neural architectures with a small training budget and remain effective for a large range of training epochs. TSE outperforms other efficient estimators, such as early stopping and learning curve extrapolation, in terms of rank correlation with the true test performance. - -The TSE estimators proposed by Ru et al. (2021) align closely with the ideas discussed in this blog post, as they prioritize the model's performance in the later stages of learning. The empirical results presented by Ru et al. (2021) and Lyle et al. (2020) provide supporting evidence for the importance of going beyond the marginal likelihood. - -### Lotfi et al. (2022/2023): "Bayesian Model Selection, the Marginal Likelihood, and Generalization" - -Lotfi et al. (2022/2023) provide a comprehensive re-evaluation of the marginal likelihood as a metric for predicting the generalization performance of trained models and learning hyperparameters. They argue that while the marginal likelihood is well-suited for prior hypothesis testing, it is only peripherally related to generalization after training. The authors identify several practical and philosophical issues in using the marginal likelihood for selecting between trained models, such as its sensitivity to the choice of prior, potential to lead to both underfitting and overfitting, and negative correlation with generalization performance in some cases. - -To address these limitations, Lotfi et al. propose the conditional marginal likelihood (CLML) as a partial remedy. The CLML is computed by conditioning on a subset of the training data, which helps to mitigate the influence of the prior and focus on the model's performance under this posterior. It is also less sensitive to the number of parameters in the model. -The authors demonstrate that the CLML is better correlated with generalization than the marginal likelihood and provides promising performance for deep kernel hyperparameter learning and neural architecture search. - -The CLML shares significant similarities with the cumulative leave-p-out cross-validation score proposed by Fong and Holmes (2020). Both approaches essentially propose the same metric, which focuses on the model's performance in the later stages of learning and provides a more reliable indication of generalization compared to the full marginal likelihood. -Lotfi et al. also critically compare their work to that of Lyle et al. (2020), but do not discuss the work of Ru et al. (2021). - -Lotfi et al. conduct an extensive empirical evaluation of the CLML across various settings, comparing it to the marginal likelihood and other baselines under different conditions, such as varying dataset sizes, model complexities, and hyperparameter settings. They demonstrate that the CLML consistently outperforms the marginal likelihood in terms of selecting the hyperparameters that lead to better generalization performance. The authors also acknowledge some limitations of their work, such as the need for further theoretical analysis of the CLML's properties and the potential challenges in estimating the CLML for more complex models. - -The key novelty of Lotfi et al.'s work lies in their comprehensive analysis of the limitations of the marginal likelihood for model selection and hyperparameter learning, as well as their proposal of the CLML as a practical alternative that addresses these limitations. - -## A Simple Toy Experiment - -To illustrate the concepts discussed in this post, we conduct a simple toy experiment using a Bayesian linear regression model. The goal is to demonstrate how the various information metrics behave under different prior settings and dataset sizes, and to show that none of the metrics are universally reliable. In particular, the joint marginal information may not be the best choice when the primary concern is static performance after training on data. - -### Experimental Setup - -We generate a synthetic dataset with 64 features and 500 training and validation samples each. The true coefficients are drawn from a normal distribution with a mean of 2, and the target is the dot product between the features and the true coefficients. - -For the model, we use a Bayesian linear regression with an isotropic Gaussian prior on the weights (hyperparameter $$\wstddev$$) and independent Gaussian noise (hyperparameter $$\noisestddev$$). The model is misspecified when $$\noisestddev > 0$$. We consider three different prior settings: - -- Model 1 ($$\h_1$$): $$\wstddev=0.1$$, $$\noisestddev=0.8$$ -- Model 2 ($$\h_2$$): $$\wstddev=100$$, $$\noisestddev=1.0$$ -- Model 3 ($$\h_3$$): $$\wstddev=1$$, $$\noisestddev=1.2$$ - -Thus, all three models are misspecified to varying degrees and exhibit different levels of prior-data conflict. - -We train the model on subsets of the training data of varying sizes, ranging from 1 to the full training set size, performing 5 trials with different splits. For each subset size, we compute the following metrics: - -- Joint Marginal Information (JMI) -- Conditional Joint Marginal Information (CJMI) with half the data used for conditioning -- Marginal Cross-Entropy (MCE) on the training set -- Marginal Cross-Entropy (MCE) on the validation set -- Training Speed (Approximate) -- Joint Marginal Information Rate (JMI Rate) - -The JMI is equivalent to the negative log marginal likelihood, the CJMI to the negative conditional log likelihood, and the MCE corresponds to the cross-entropy loss. The Training Speed approximates an iterative algorithm by following the full data gradient. The JMI Rate is the JMI divided by the dataset size, which converges to the MCE in the infinite data limit. - -### Results - -The results of the experiment are summarized in the following plots: - -{% comment %} -include figure.html path="assets/img/2024-05-07-clml/binary_regression_information_metrics.png" -class="l-screen-inset img-fluid rounded z-depth-1" -{% endcomment %} - -
-
-*Information metrics for the three Bayesian linear regression models as a function of dataset size.* The joint marginal information does not indicate the best performing model. The conditional joint marginal information (conditioned on half the dataset size, predicting on the other half) only finds the best model after 4/5 of the data are observed. *Metrics are reported in bits (log base 2), five trials each.* -
-
- -The plots show the behavior of the information metrics as the dataset size increases for the three different prior settings. Some key observations: - -- The marginal cross-entropy (MCE) metrics decrease as the dataset size increases, indicating improved model performance. -- The joint marginal information (JMI) increases with more data, as it is equivalent to the area under the curve of the MCE on the training set. (As we take the average over multiple trials, its mean is actually an estimate of the joint marginal cross-entropy.) -- The JMI rate, which is the JMI divided by the dataset size, decreases very slowly towards the same value as the MCE. This agrees with the previous discussion on the infinite data limit. -- The training losses also decrease, while their sum, equal to the training speed estimator (TSE), increases with the dataset size. -- The conditional joint marginal information (CJMI) with half the data used for conditioning shows a similar trend to the JMI but with lower values, as it focuses on the model's performance on the held-back data. As we take an average over multiple trials, it is actually an estimate of the conditional joint marginal cross-entropy. - -To further analyze the model selection behavior, we computed the CJMI for different conditioning set sizes and selected the model with the lowest CJMI for each combination of dataset size and conditioning set size. The results are visualized in the following plot: - -{% comment %} -include figure.html path="assets/img/2024-05-07-clml/binary_regression_conditional_joint_marginal_information_decision_boundary.png" -class="l-screen-inset img-fluid rounded z-depth-1" -{% endcomment %} - -
-
-*Decision boundary for the best model amongst three ($$\phi_1$$, $$\phi_2$$, $$\phi_3$$) with the lowest conditional joint marginal cross-entropy/information, as a function of dataset size and held-back size.* The three models $$\phi_1$$, $$\phi_2$$, and $$\phi_3$$ correspond to different prior variances and noise levels. The white diagonal line shows where the conditional joint marginal information is computed using half the dataset size. In the region below this line, $$\phi_1$$ (blue) has the lowest conditional joint marginal information, while $$\phi_2$$ (orange) and $$\phi_3$$ (green) are preferred for different dataset and held-back sizes. -
-
- -The plot shows which model is selected based on the lowest CJMI for different dataset sizes (x-axis) and conditioning set sizes (y-axis). The white line represents the case where half the data is used for conditioning (CJMI half in the previous plot). We observe that the model selection decision changes depending on the amount of available data and the size of the conditioning set/held-back data. - -## A Narrow but Deep Dive into "Bayesian Model Selection, the Marginal Likelihood, and Generalization" - -Now that we have introduced the necessary concepts and discussed the literature, let's take a closer look at the paper by Lotfi et al. (2022/2023). - -### Use Cases and Pitfalls of the LML - -Lotfi et al. (2022/2023) present both the case for the log marginal likelihood (LML) as well as potential pitfalls when using it. They highlight the following use cases for the LML---*quoted and paraphrased from the paper*: - -1. **Hypothesis testing:** The LML provides an elegant mechanism to select between fixed prior hypotheses, even if each hypothesis is entirely consistent with observations. It automatically favors the most constrained hypothesis that fits the data, encoding a notion of Occam's razor. The paper gives the example of the LML favoring general relativity over alternative explanations for Mercury's orbit. - -2. **Hyperparameter learning:** The LML is often successfully used in practice to learn hyperparameters of the prior, finding the hyperparameters $$\h$$ that maximize $$\pof{\mathcal{D} \given \h}$$, where $$\mathcal{D}$$ is a dataset. The paper highlights Gaussian processes as a compelling example, where the LML chooses kernel hyperparameters that make the distribution over functions likely to generate the training data, rather than simply maximizing data fit. The LML can learn many kernel parameters and be used where cross-validation would be intractable. - -3. **Constraint learning:** Unlike typical learning objectives like maximum likelihood, the LML is incentivized to select for constraints. It provides a consistent estimator for constraints, automatically selecting the most constrained solution that fits the data and collapsing to the true constraint value as the number of observations grows. Examples include the LML consistently estimating the true dimensionality in Bayesian PCA and automatically learning symmetries like rotation invariance. - -However, the paper argues that the LML has several pitfalls for model selection and generalization: - -{:start="4"} -4. **Not aligned with generalization:** The LML answers "what is the probability a prior model generated the training data?" rather than "how likely is the posterior to have generated withheld points?". A prior that initially explains the data well can still lead to a posterior that generalizes poorly. - -5. **Misaligned in model selection:** The LML evaluates priors, while model selection should evaluate posteriors. Maximizing LML is not equivalent to selecting the best generalizing posterior. - -6. **Can overfit:** The LML can favor "simple" priors concentrated around overfit maximum likelihood solutions that generalize poorly. - -7. **Underfitting bias in hyperparameter selection:** The LML may not favor hyperparameters that make good parameters likely if they also make many poor parameters likely. - -Relating these points to the previous discussions: - -For hypothesis testing and hyperparameter learning (**1.** & **2.**), the LML favors the simpler hypothesis that converges faster, implying a smaller area under the learning curve. This aligns with the discussion on prior-data conflict for similarly misspecified models. - -At the same time, the paper also states about the case of Mercury's orbit that: - -> We emphasize here we are comparing fixed *prior* hypotheses. We are not interested in how parameters of general relativity update based on orbital data, and then deciding whether the updated general relativity is the correct description of orbital trajectories. - -This could be misconstrued at computing the marginal cross-entropy for the data under the prior, which is not what the LML is doing: it computes a joint marginal cross-entropy after all. -The two questions in (**4.**) point to the joint and conditional marginal cross-entropies---the areas under the full and partial learning curves, respectively. - -However, neither LML nor CLML align with *static* evaluation, but rather with continued learning (**5.**). - -Points (**6.**) and (**7.**) relate to prior-data conflict and model misspecification when they are anti-correlated. - -Overall, all quantities can fail in the low-data regime. In the infinite data limit, model (mis-)specification dominates other factors, making the quantities less interesting. - -### The "Conditional Marginal Likelihood" in Lotfi et al. (2022/2023) - -The paper introduces the conditional marginal likelihood (CLML) as a remedy for the pitfalls of the LML, matching the earlier definition of conditional joint marginal information: - -$$ -\Hof{\xset{}{N-P+1}{N} \given \xset{}{1}{N-P}, \h}. -$$ - -Unlike the LML which is invariant to data order, the CLML depends on how the data is split into a conditioning set and validation set. To make the CLML permutation-invariant, the paper proposes averaging over different permutations, equivalent to the joint marginal cross-entropy. However, this becomes computationally expensive, so the paper uses a single permutation with $$P=20\% \, N$$ to ensure the posterior has sufficiently converged. - - - -### Estimating the CLML and LML via Laplace Approximation - -Computing the LML via sampling is intractable for deep neural networks. Estimating it from an uninformative prior leads to high-variance estimates, as most $$\w$$ sampled from the prior will perform poorly on the data. While Monte Carlo sampling works well in high dimensions, it fails here because randomly sampling a good $$\w$$ from the prior is incredibly unlikely, as illustrated in these tweets: - -{% twitter https://twitter.com/stanislavfort/status/1529865444701577216 %} -{% twitter https://twitter.com/RobertRosenba14/status/1517465854157500419 %} - -While sampling from the prior to estimate the LML is intractable, we can fare better when sampling from a posterior for computing a CLML, which is the approach taken by the paper for the CLML. The posterior is more concentrated around "good" $$\w$$, and the paper uses a Laplace approximation to approximate it: - - - -However, the LA only captures uncertainty around a single mode, underestimating the uncertainty before the model converges, as beautifully illustrated in the paper: - -{% capture max-width %} -" style="max-width: 35em; -{% endcapture %} -{% include figure.html path="assets/img/2024-05-07-clml/bmsmlg_fig3.png" max-width=max-width %} - -This is especially relevant for overparameterized DNNs which have multiple diverse modes ([Wilson, Izmailov, 2020](https://arxiv.org/abs/2002.08791); [2021, blog](https://cims.nyu.edu/~andrewgw/deepensembles/)). - -Furthermore, when computing the CLML, the LA may similarly struggle to find meaningful $$\w$$ that perform well on the held-out data when that data would meaningfully change the model, as the CLML decomposes into conditional marginal information terms that condition on these additional data sequentially. - -### DNN Experiments: Validation Loss vs. CLML - -The DNN experiments in Lotfi et al. (2022/2023) compare the CLML to the validation loss for DNNs on CIFAR-10 and CIFAR-100 datasets. The results provide empirical evidence for the challenges of computing the CLML and beg the question whether these approximations are meaningfully different from a validation loss. - -The paper shows that while the CLML is better correlated with the generalization performance of the model than the LML, the validation loss is still better correlated with the generalization performance than the CLML. Interestingly, the initially published DNN experiments in the first arXiv version of the paper did not actually compute the CLML but instead computed the validation loss. This was fixed in the second arXiv revision.This bug was found by yours truly, see the appendix of this post. - -However, given the previous discussions on the similarities between the CLML and cross-validation and difficulty of approximating the CLML meaningfully, this bug was not a major issue for the paper's conclusions. - -Importantly, as we examine in the appendix of this post, when comparing the CLML using Monte Carlo sampling with the validation loss computed using Monte Carlo sampling for the Bayesian Model Average (BMA), the validation loss is still better correlated with the generalization performance than the CLML. - -## Conclusion - -In conclusion, this blog post has challenged the conventional focus on the marginal likelihood and related quantities for Bayesian model selection as a direct consequence of Occam's razor. It highlights the importance of considering context and goals when choosing a model selection criterion. By motivating MLE and MAP using Occam's razor and questioning the uniqueness of the (conditional) joint marginal likelihood, we hope to encourage critical thinking about the foundations of these quantities. - -However, it is important to acknowledge the limitations of our arguments and experiments. A more rigorous theoretical justification, a broader range of models and datasets, and a deeper engagement with philosophical implications are needed to strengthen the insights. As most of the presented methods ignore model complexity and assume a uniform model prior $$\pof{\h}$$, we have not discussed it in the detail necessary, even though from the perspective of model description lengths (MDL), it would be crucial to take into account. - -Despite these limitations, our exploration of the connections between information-theoretic concepts and their behavior in different data regimes, along the lines of model misspecification and prior-data conflict, provides a necessary starting point for understanding recently proposed metrics. - -The toy experiment demonstrates that all discussed quantities can fail to reliably predict generalization under model misspecification and prior-data conflict, even for a basic setting using Bayesian linear regression. This emphasizes the need for caution when making claims about the superiority of any particular metric. - -Ultimately, the key takeaway is that there is no one-size-fits-all solution, and the choice of model selection criterion should be guided by a careful consideration of the specific context and goals at hand. - ---- - -**Acknowledgements:** We would like to thank the authors of the examined papers for their valuable contributions to the field and for inspiring this blog post. Claude-3 and GPT-4 were used to edit and improve this blog post (via cursor.sh). - -**Reproducibility:** The figures were created using matplotlib and seaborn in Python. The Bayesian linear regression model was implemented using numpy. The code for the toy experiment is available in this [Google colab](https://colab.research.google.com/drive/1rUnOvkFIxVrIJACxyjcQiGHo3nA77T4T?usp=sharing), and the code for the visualizations is available in this [Google colab](https://colab.research.google.com/drive/1q0esvQGSqd7d6zJfjbFcz-DGSKYi_WpC?usp=sharing). - ---- - -## Appendix - -### Detailed Code Review of the DNN Experiments in Lotfi et al. (2022/2023) -The [`logcml_` files in the repository](https://github.com/Sanaelotfi/Bayesian_model_comparison/tree/main/Laplace_experiments/cifar) contain the code to compute the CLML for partially trained models. However, instead of computing - -$$ -\begin{aligned} -\log p(\mathcal D_{\ge m} \mid \mathcal D_{< m}, \mathcal{M} ) \approx \log \sum_{k=1}^K \frac{1}{K}\, p(\mathcal{D}_{\ge m} \mid w_k, \mathcal M ) \\ -= \log \sum_{k=1}^K \frac{1}{K}\, \prod_{j=m}^n p(y_j \mid x_j, w_k, \mathcal M ), -\end{aligned} -$$ - -the code computes: - -$$ -\begin{aligned} -&\frac{1}{|\mathcal{D}_{\ge m}|}\,\sum_{j=m}^n \log p(\mathcal D_{j} \mid \mathcal D_{< m}, \mathcal{M} ) \approx \\ -&\quad =\frac{1}{|\mathcal{D}_{\ge m}|}\,\sum_{j=m}^n \log \sum_{k=1}^K \frac{1}{K}\, p(y_j \mid x_j, w_k, \mathcal M ), -\end{aligned} -$$ - -which is the validation cross-entropy loss of the BMA (of the model trained with 80% of the training data). - -The high-level [code](https://github.com/Sanaelotfi/Bayesian_model_comparison/tree/c6f0da1d49374c0dda6ee743e5b02bcf3e158e96/Laplace_experiments/cifar/logcml_cifar10_resnets.py#L295) that computes the CLML is: - -{% highlight python linenos %} -bma_accuracy, bma_probs, all_ys = get_bma_acc( - net, la, trainloader_test, bma_nsamples, - hessian_structure, temp=best_temp -) -cmll = get_cmll(bma_probs, all_ys, eps=1e-4) -{% endhighlight %} - -[`get_bma_acc`](https://github.com/Sanaelotfi/Bayesian_model_comparison/tree/c6f0da1d49374c0dda6ee743e5b02bcf3e158e96/Laplace_experiments/cifar/logcml_cifar10_resnets.py#L149) marginalizes over the LA samples before returning `bma_probs`: - -{% highlight python linenos %} -[...] -for sample_params in params: - sample_probs = [] - all_ys = [] - with torch.no_grad(): - vector_to_parameters(sample_params, net.parameters()) - net.eval() - for x, y in loader: - logits = net(x.cuda()).detach().cpu() - probs = torch.nn.functional.softmax(logits, dim=-1) - sample_probs.append(probs.detach().cpu().numpy()) - all_ys.append(y.detach().cpu().numpy()) - sample_probs = np.concatenate(sample_probs, axis=0) - all_ys = np.concatenate(all_ys, axis=0) - all_probs.append(sample_probs) - -all_probs = np.stack(all_probs) -bma_probs = np.mean(all_probs, 0) -bma_accuracy = (np.argmax(bma_probs, axis=-1) == all_ys).mean() * 100 - -return bma_accuracy, bma_probs, all_ys -{% endhighlight %} - -The important line is #18: `bma_probs = np.mean(all_probs, 0)` which marginalizes over the predictions and returns the BMA prediction for each sample. - -Finally, [`get_cmll`](https://github.com/Sanaelotfi/Bayesian_model_comparison/tree/c6f0da1d49374c0dda6ee743e5b02bcf3e158e96/Laplace_experiments/cifar/logcml_cifar10_resnets.py#L170) computes the validation loss for each sample independently (after applying a bit of label smoothing): -{% highlight python linenos %} -def get_cmll(bma_probs, all_ys, eps=1e-4): - log_lik = 0 - eps = 1e-4 - for i, label in enumerate(all_ys): - probs_i = bma_probs[i] - probs_i += eps - probs_i[np.argmax(probs_i)] -= eps * len(probs_i) - log_lik += np.log(probs_i[label]).item() - cmll = log_lik/len(all_ys) - - return cmll -{% endhighlight %} - -The DNN experiments in Section 5 and Section 6 of the first arXiv revision of the paper (*v1*) thus did not estimate the CLML per-se but computed the BMA validation loss of a partially trained model (80%) and find that this correlates positively with the test accuracy and test log-likelihood of the fully trained model (at 100%). This is not surprising because it is well-known that the validation loss of a model trained 80% of the data correlates positively with the test accuracy (and generalization loss). - -### Author Response from 2022 - -The following response sadly seems to target the first draft mainly. However, it is also helpful for the final blog post and provides additional context. - -
-Thanks for your interest in our paper and your comments. Here are our comments about the blog as it is currently framed: - -(1) Thank you for pointing out a bug in the CLML computation for Figure 5b. We note that this bug is only relevant to a single panel of a single figure in the main text. We have re-run this experiment with the right CLML, and the results, attached here, are qualitatively the same. In summary, it was a very minor part of the paper, and even for that part it did not affect the take-away. We also attach the results of the correlation between the BMA test accuracy and the negative validation loss. You suggest in your post that the validation loss might correlate better with the BMA test accuracy than the CLML given that we use 20 samples for NAS. Our empirical results show the opposite conclusion. Additionally, we are not suggesting the CLML as a replacement to cross-validation but rather as a minor way to modify the LML for improvements in predicting generalization. Finally, we attach results for different sample sizes (20 samples vs. 100 samples) to address your comments on the sample size used to estimate the CLML. As we can see in the figure, the Spearman correlation factor is quite similar. 20 samples appears to provide a reasonable estimate of the CLML for these purposes, and is different from validation loss. - -{% capture max-width %} -" style="max-width: 20em; -{% endcapture %} -{% include figure.html path="assets/img/2024-05-07-clml/rebuttal_1.png" max-width=max-width %} -{% include figure.html path="assets/img/2024-05-07-clml/rebuttal_2.png" max-width=max-width %} -{% include figure.html path="assets/img/2024-05-07-clml/rebuttal_3.png" max-width=max-width %} - -(2) Your post currently opens by suggesting that there is something wrong with our experiments, likely either an LML approximation or a CLML issue, because we note that the LML correlates more poorly with generalization for larger datasets (where “large” is relative in the context of a specific experiment). A few points here: (i) this result is actually completely expected. The LML is in fact non-monotonic in how well it predicts generalization. For small datasets, the prior should be reasonably predictive of generalization. For intermediate datasets, the first terms in the LML decomposition have a negative effect on the correlation with generalization. For asymptotically large datasets, the first terms have a diminishing effect, and we get a consistent estimator; (ii) almost all of our experiments are exact, and we see this behaviour in the exact experiments for the Fourier model. For example, for the Fourier feature experiment in Fig 4(d), LML picks the better generalizing model for n < 50 and n > 296. For n in [50, 296] it picks the wrong model. For large neural network models, it is reasonable that the exact LML could pick the wrong model for CIFAR-sized datasets. (iii) any potential issues with the CLML are not relevant to these considerations, which are about the behaviour of the LML. - -(3) Your post currently suggests that issues with approximate inference could be responsible for our take-aways, rather than issues with the LML in general. But as we note in (2), almost all of our experiments use the exact LML and CLML: the density model, Fourier features, Gaussian processes, and deep learning exps on DKL, and there was never any bug associated with CLML computation in these experiments. The takeaways for the Laplace experiments are consistent with the exact experiments, and also expected, as above. While it’s true that the CLML can be estimated more effectively than the LML for the Laplace experiments, this is actually an advantage of the CLML that we note in the paper. The LML results also stand on their own, as we discuss above. - -(4) Your post places a lot of importance on Figure 5, as if it is the main result of the paper and our main “DNN” experiments. We stand by the results of Figure 5, but it is a relatively minor component of the paper. As we’ve mentioned most of our results are exact, including our DKL experiments, which are certainly the most substantial DNN experiments, with practically exciting results for transfer and few-shot learning. The DKL experiments are actually where we expect the CLML to be practically useful, and currently they seem to be overlooked in the post. - -(5) The blog seems to question the learning curve experiments, but these experiments in Figure 4 are exact, with no Laplace approximation, and relatively straightforward. - -(6) Your post seems to be negative about the CLML, presenting its similarity with cross-validation as a potential drawback, and implying the skepticism about the CLML should affect the interpretation of our take-aways. Two points here: (i) as above, the CLML is independent of most of our take-aways, which are about the properties of the LML; (ii) our goal with the CLML was not to introduce something starkly different from cross-validation, but to show how a very minor modification to the LML could improve alignment with generalization. Moreover, the DKL CLML results are quite promising as an efficient way to do gradient based estimation of a large number of hyperparameters. - -(7) The blog opens as if it is leading up to some fatal flaw. But as above, (i) the LML considerations are independent of the CLML, (ii) most of the experiments are exact, (iii) the trends for the exact and approximate inference procedures are the same and are naturally understandable and explainable, such as the non-monotonic trend in how well the LML correlates with generalization, and (iv) the CLML bug only affected Figure 5, panel b, and when it’s corrected the qualitative take-away is the same as before. - -We appreciate your interest and effort in reading the paper, and we think your questions will improve the clarity of the paper, which we have updated with an acknowledgement to you. Given the above considerations, we do think there would need to be substantial revisions to the blog post to accurately and fairly reflect the paper. We would appreciate being able to see the revisions before it’s posted. - -Best wishes,\\ -Sanae, Pavel, Greg, Micah, Andrew -
- -### Ablation: CLML vs. BMA Validation Loss vs. (non-BMA) Validation Loss - -Let us examine the new results: - -In the three panels below, two panels show test accuracy vs. validation loss; one shows test accuracy vs. CLML. The left-most panel is the BMA test accuracy vs. (negative) BMA validation loss, the middle panel is vs. the CLML, and the right-most panel is vs. the (negative) non-BMA validation loss. - -Note that the left-most panel is from *v1*, which was accidentally computing the BMA validation loss, and whose axis label is adapted here from *v1* for clarity. The two other plots are from *v2* after fixing the bug. See commits [here](https://github.com/Sanaelotfi/Bayesian_model_comparison/commit/a579aa292723dc20a6105ec8f4fff1045dd9a9fd) for fixing the CLML estimation and [here](https://github.com/Sanaelotfi/Bayesian_model_comparison/commit/3fa8ca2ecb314ee881f6c95a602ef58b9ccd3620) for computing the non-BMA validation loss. - -{% capture width %} -" style="width: 20em; -{% endcapture %} -
-
- {% include figure.html path="assets/img/2024-05-07-clml/bmsmlg_bma_validation_loss.svg" class="img-fluid" caption="BMA Neg Validation Loss" width=width %} -
-
- {% include figure.html path="assets/img/2024-05-07-clml/bmsmlg_clml.svg" class="img-fluid" caption="CLML" width=width %} -
-
-
-
- {% include figure.html path="assets/img/2024-05-07-clml/bmsmlg_validation_loss.svg" class="img-fluid" caption="Validation Loss" width=width %} -
-{% capture width %} -" style="width: 5em; -{% endcapture %} -
- {% include figure.html path="assets/img/2024-05-07-clml/bmsmlg_plot_legend.svg" class="img-fluid" caption="Leg" width=width %} -
-
- -At first glance, there might be an observer effect in the experiments for the validation loss. The BMA validation loss in *v1* performs better than the CLML in *v2*, while the non-BMA validation loss in *v2* underperforms the CLML in *v2*. -When asked about it, the authors pushed the respective code (see link above) and explained that the updated, right-most panel computes the **non-BMA** validation loss, i.e., without LA samples. -It seems surprising that there is such a difference between the non-BMA validation loss and BMA validation loss: *the non-BMA validation loss is more than one nat worse on average than the BMA validation loss, based on visual inspection*. Note that the plots here and in the paper compute the average CLML and average validation loss and are thus directly comparable. - -The authors said in their response that: - -> You suggest in your post that the validation loss might correlate better with the BMA test accuracy than the CLML given that we use 20 samples for NAS. Our empirical results show the opposite conclusion. - -This is only partially true. -The BMA validation loss (which was accidentally computed in *v1* instead of the CLML) correlates very well with the BMA test accuracy. -This is not surprising given that this is the frequentist purpose of using validation sets. If validation sets were not correlating well with the test accuracy, we would not be using them in practice. 🤗 As such, this raises the question why the non-BMA validation loss correlates negatively with the BMA test accuracy for ResNets and overall in the *v2* results. -Thus, only the non-BMA validation loss supports the now opposite conclusion in *v2* of the paper and in the authors' response. - -Yet what is also surprising is how well the BMA validation loss does vs. the CLML: - - - -### Ablation: LA Sample Size - -Secondly, when we compare the reported values between BMA validation loss and CLML, we notice that the CLML is lower than the BMA validation loss by half a nat for $$\lambda=10^2$$ and generally for CNNs. - - - -However, it seems, even though the new experiments in *v2* are supposed to reproduce the ones from *v1*, and we can assume that the same model checkpoints were used for re-evaluation (as retraining is not necessary), both CLML and non-BMA validation loss are off by about half a nat for the CNNs. As such, the above consideration might hold but might not provide the answer here. - -Instead, we overlay the non-BMA validation loss and the CLML plots, both from *v2*, with a "difference blend": it shows the absolute difference between the colors for overlapping data points (the circles 🔴 and triangles 🔺), leading to black where there is a match, negative (green-ish) color for CLML, and positive (sepia) color for validation losses. The background grids were used to match the plots, but we hid the ones from CLML afterward---as such, the strong overlay is because the values are so close. - -{% capture width %} -" style="width: 25em; -{% endcapture %} -{% include figure.html path="assets/img/2024-05-07-clml/bmsmlg_difference_overlay_plot.svg" class="img-fluid" width=width %} - - -Surprisingly---or rather as predicted when the LA does not really do much---it turns out that the validation loss for the CNNs (🔴) mostly fully matches the estimated CLML with 20 LA samples following a visual inspection. To be more precise, either the models have already sufficiently converged, or the CLML estimate is not actually capturing the correlations between points and thus ends up being very similar to the validation loss. - - - -{% capture width %} -" style="width: 25em; -{% endcapture %} -{% include figure.html path="assets/img/2024-05-07-clml/rebuttal_3.png" class="img-fluid" width=width %} - -This changes the interpretation of the sample ablation in the author's response. The ablation shows no difference between 20 and 100 LA samples, with 100 LA even samples having a slightly lower rank correlation. So it seems 5 times more LA samples are not sufficient to make a difference, or the Laplace posterior cannot capture the posterior as well as hoped. It would be interesting to examine this further. Kirsch et al (2022) reported running toy experiments on MNIST with 10,000 MC Dropout samples without achieving good adaptation. Laplace approximation is not MC Dropout, and this is speculation, but it seems in agreement. Notwithstanding the compute cost and feasibility, could posterior samples using HMC or similar more principled methods provide better estimates? - -All in all, given the above, it is fair to say that the estimate of the CLML is probably not as good as hoped, and further experiments might be needed to tease out when the CLML provides more value than the (BMA) validation loss. Note, however, that this question has not been explicitly examined in the paper. Instead, for DNNs, the paper only compares LML and CLML with distinct estimation methods. \ No newline at end of file diff --git a/_posts/2024-05-07-deqalg-reasoning.md b/_posts/2024-05-07-deqalg-reasoning.md deleted file mode 100644 index 86ab8285..00000000 --- a/_posts/2024-05-07-deqalg-reasoning.md +++ /dev/null @@ -1,206 +0,0 @@ ---- -layout: distill -title: Deep Equilibrium Models For Algorithmic Reasoning -description: - In this blogpost we discuss the idea of teaching neural networks to reach fixed points when reasoning. Specifically, on the algorithmic reasoning benchmark CLRS the current neural networks are told the number of reasoning steps they need, which they shouldn't be given. While a quick fix is to add a termination network that predicts when to stop, a much more salient inductive bias is that the neural network shouldn't change its answer any further once the answer is correct, i.e. it should reach a fixed point. This is supported by denotational semantics, which tells us that while loops that terminate are the minimum fixed points of a function. We implement this idea with the help of deep equilibrium models and discuss several hurdles one encounters along the way. We show on several algorithms from the CLRS benchmark the partial success of this approach and the difficulty in making it work robustly across all algorithms. -date: 2024-05-07 -future: true -htmlwidgets: true - -# Anonymize when submitting - -authors: - - name: Sophie Xhonneux - url: "https://scholar.google.com/citations?user=9TQM9k4AAAAJ&hl=en" - affiliations: - name: Mila, Université de Montréal - - name: Yu He - url: "https://dransyhe.github.io/" - affiliations: - name: University of Cambridge - - name: Andreea Deac - url: "https://andreeadeac22.github.io/" - affiliations: - name: Mila, Université de Montréal - - name: Jian Tang - url: "https://jian-tang.com/" - affiliations: - name: Mila, HEC Montréal - - name: Gauthier Gidel - url: "https://gauthiergidel.github.io/" - affiliations: - name: Mila, Université de Montréal - -# must be the exact same name as your blogpost -bibliography: 2024-05-07-deqalg-reasoning.bib - -# Add a table of contents to your post. -# - make sure that TOC names match the actual section names -# for hyperlinks within the post to work correctly. -# - please use this format rather than manually creating a markdown table of contents. -toc: - - name: What is Algorithmic Reasoning? - - name: Why care about fixed points? - - name: How can we do fixed points with DNNs? - - name: How well does it work? - - name: What's the problem? - - name: What do we take away? - - name: References - -# Below is an example of injecting additional post-specific styles. -# This is used in the 'Layouts' section of this post. -# If you use this post as a template, delete this _styles block. -_styles: > - .fake-img { - background: #bbb; - border: 1px solid rgba(0, 0, 0, 0.1); - box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1); - margin-bottom: 12px; - } - .fake-img p { - font-family: monospace; - color: white; - text-align: left; - margin: 12px 0; - text-align: center; - font-size: 16px; - } ---- - - - -## What is Algorithmic Reasoning? - -Broadly, algorthmic reasoning studies how well neural networks can learn to execute classical computer science algorithms. In particular to measure how well an algorithm has been learned we look at size-generalisation, i.e. if we train on inputs of size $$N$$ and check how well the Neural Network perform on inputs of size $$2N$$ or $$10N$$. The idea is that neural networks often learn shortcuts that work well in-distribution, but fail out-of-distribution, whereas classical computer science algorithms work no matter the input size. The purpose of this exercise is to study the generalisation of reasoning tasks, especially what tricks help to improve robustness and get the network closer to deducing logically rather than relying on statistical short cuts. - -## Why care about fixed-points? - -First, let's remember that for $$x_0$$ to be a fixed-point of a function $$f$$ it must satisfy $$f(x_0) = x_0$$. Secondly, we can observe that many algorithms consist of an update rule that you apply until there is no more change. The final output can easily be seen to be a fixed-point! In a classical computer science algorithm some smart person will have sat down and shown that under some conditions on the input this convergence will happen and the final answer is correct. - -An example algorithm would be the Bellman-Ford algorithm to compute the shortest-distance to a given node in a graph. Here the update rule looks like $$x_i^{(t+1)} =\min(x_i^{(t)}, \min \{x_j^{(t)} + e_{ij}\}_{j\in N(i)})$$, where $$x_i^{(t)}$$ is the shortest distance estimate to the source node at time $$t$$, $$e_{ij}$$ is the distance between nodes $$i$$ and $$j$$, and $$\{j\}_{j\in N(i)}$$ are the neighbours of node $$i$$. The algorithm says to apply this rule until there is no more change---a fixed point. - -Interestingly, denotational semantics---a theoretical field of computer science---has shown you can represent Turing complete programming languages as mathematical functions. This is mostly quite trivial with the exception of the while loop (which is also the key ingredient to make it Turing complete). Here the trick is a special mathematical operator that returns the minimum fixed point of a function! (If there is no fixed point to a function then the corresponding while loop doesn't terminate.) And thus we can see that fixed-points are reached by all programs that terminate, and yet they aren't used in neural networks that try to learn how to do reasoning. A missed inductive bias perhaps? - -## The details -### Task specification - -The CLRS paper provides us with a benchmark dataset for algorithmic reasoning. The general structure of the data is a sequence in time of intermediate states of a given algorithm. In other words, at timestep $$t$$ we have a state $$x_t$$ that describes various variables that the algorithm stores, e.g. in BellmanFord $$x_t$$ will contain the current estimate of the shortest path in each node of the graph. At each timestep $$t$$ we then try to predict the next time step, we do this by outputting some $$y_t$$ from which we can extract $$x_{t+1}$$. Note that $$y_t$$ may be slightly different from $$x_{t+1}$$, for instance because it has some state may never change by definition, e.g. the graph in BellmanFord, hence we don't predict it again. This is all illustrated in the next figure, where we split the state into a state at each node $$x$$ and at each edge $$e$$ for a given graph $$G$$ as an example. - -
-
- {% include figure.html path="assets/img/2024-05-07-deqalg-reasoning/alg-reasoning-task.png" class="img-fluid rounded z-depth-1" %} -
-
-
- Algorithmic Reasoning Task, diagram recreated from -
- -### The architecture - -The high-level architecture is that of an encoder-processor-decoder. The motivation is that neural networks perform well in high-dimensional spaces but that classical algorithms tend to operate on very low-dimensional variables, e.g. in BellmanFord the shortest distance would be a single scalar. Thus the encoder projects the state into a high-dimensional space $$z_t$$ where the main computation is then done by the processor network---typically a Graph Neural Network. The output of the processor $$z_{t+1}$$ is then decoded back into the low-dimensional space by the decoder. The encoder and decoders mostly consist of linear layers with the occasional exception, e.g. a softmax for categorical variables. The processor will be a graph neural network, for which several different architectures have been explored, for example in. We either use the TripletMPNN from which adds edge message passing or a simple MPNN with a linear message layer. - -
-
- {% include figure.html path="assets/img/2024-05-07-deqalg-reasoning/architecture.png" class="img-fluid rounded z-depth-1" %} -
-
-
- High-level architecture employed -
- -The processor is supposed to do the main computation of the network, in particular, the hope is that one iteration of the processor is equal to one iteration of the algorithm. In our example of BellmanFord, it would be one iteration of the update rule $$x_i^{(t+1)} =\min(x_i^{(t)}, \min \{x_j^{(t)} + e_{ij}\}_{j\in n(i)})$$ (see also the Figure below). Thus, the processor should indicate termination by no longer changing it's output $$z$$. - -### Training - -Traditionally the training approach has been teacher-forcing. In teacher forcing we train each step of the algorithm independently by feeding the network the ground-truth $$x_t$$ and computing the loss against $$y_t$$ at all $$t$$ simultaneously. This requires us to know the exact number of steps in the algorithm a priori. In other words, training with just teacher forcing will require us to tell the network the number of iterations it should run for at test time (which will vary depending on the input state). This is unrealistic in practice, where we would simply give our neural network the input state and ask it to run the algorithm on its own, which includes knowing when to stop the computation. While a termination network is suggested in , the issue is ignored in later papers such as . - -Remember that neural networks are really good at learning in-distribution shortcuts. To more rigorously test whether the neural network has learned the underlying logical algorithm we introduce a shift between the training and test distribution. If the network has learned the classical algorithm, it should be able to overcome this shift. Throughout the CLRS algorithmic reasoning benchmark size generalisation is used, i.e. we train on examples of size 16 (i.e. the graph has 16 nodes) and at test time we will use an input size of 64. - -
-
- {% include figure.html path="assets/img/2024-05-07-deqalg-reasoning/BFexplained.png" class="img-fluid rounded z-depth-1" %} -
-
-
- An example algorithm: Bellman-Ford -
- -## How can we do fixed-points in DNNs? -One approach to training neural networks that run until they reach a fixed point is deep equilibrium models (DEQ). We give a brief introduction to this approach next based on the blogpost . - -Given our input $$x$$, our hidden state $$z$$, and our processor $$f$$, the goal is to optimise the fixed point $$z^*=f(z^*,x)$$ we reach. The question how can we backprop through $$z^* = f(z^*,x)$$. - -In backprop, we ultimately want to compute - -$$ \left(\frac{\partial z^*(.)}{\partial(.)}\right)^{\top} g$$ - -for some incoming gradient $$g$$ from the layers after (in our case from the decoder) and $$(.)$$ being anything we want, but usually the weights of the network. We can show by implicit differentation of $$z^* = f(z^*,x)$$ that - -$$ \left(\frac{\partial z^*(.)}{\partial(.)}\right)^{\top} g = \left(\frac{\partial f(z^*, x)}{\partial (.)}\right)^{\top}\left(I-\frac{\partial f(z^*, x)}{\partial z^*}\right)^{-\top}g$$ - -The difficult to term to solve in the above equation is $$\left(I-\frac{\partial f(z^*, x)}{\partial z^*}\right)^{-\top}g$$, which is the solution of a linear system, namely: - -$$\left(I-\frac{\partial f(z^*, x)}{\partial z^*}\right)^{\top}h = g$$ - -In general, we can try to solve it in two ways, use a linear system solver, like can be found torch.linalg, or by computing a fixed point to - -$$h = \left(\frac{\partial f(z^*, x)}{\partial z^*}\right)^{-\top}h +g$$ - -In the DEQ blogpost they suggest solving the above fixed point. The reason to use implicit differentiation is that backpropagating through time may easily run into exploding or vanishing gradients or error accumulation due to the number of steps needed to reach a fixed point. - -We tried both: solving the linear system with torch.linalg.solve and finding the above fixed point. But we converged to computing the fixed point of the equation above as suggested by the deep equilibrium blogpost as it is computationally faster, while the added accuracy of linear system solvers wasn't beneficial. Note this trade-off is heavily informed by what is readily implemented in PyTorch to run on GPU, hence the balance may shift in the future. - -### Tricks we employ - -To encourage convergence we change the update function in the MPNN to be a minimum update, i.e. $$z^{(t+1)} = \min(z^{(t)}, z^{'(t+1)})$$. This update rule is motivated by the problem of getting neural networks to converge to a fixed point. We discuss the effect of this in more detail after the experimental section. - -Currently, gradient flows through the implicit differentiation explained above as well as back in time through standard backprop via $$z_t$$. To enable more ways for the gradient to inform early steps in the algorithm, we propagate the gradient through $$y_t$$ as well. For discrete $$y_t$$, in other words, for categorical variables in the state $$x_t$$ we employ the Rao-Blackwell straight-through gumbel softmax estimator to allow gradients to flow. - -Finally, we also try adding a loss for the number of steps by adding the penalty $$\sum_{t=0}^{T} \|z_{t+1} - z_{t}\|^2$$. The penalty will be larger as we take more steps and stay away from the fixed point, thus hopefully encouraging convergence to a fixed point more quickly. - - -## How well does it work? - -In the table below we show the accuracyWhat exactly is measured for the accuracy depends on each algorithm, but usually is a pointer, e.g. in the Bellman-Ford algorithm it is a pointer to the previous node along the shortest path. For more details see the CLRS Benchmark paper. of the algorithms when tested on graphs of size 64. - -DEQ is our approach of reaching a fixed point together with the implicit differentiation explained above. Hint propagation is simply reaching a fixed point and back propagating through time with no implicit differentiation. Teacher forcing is used for the baselines, where the first number is the simple MPNN architecture and the second number is the more complex TripletMPNN (these numbers are taken from the paper ). For BellmanFord and BFS we use the simple MPNN and for all others we use the TripletMPNN. - -| Tables | DEQ | Hint propagation | Teacher forcing | -| ------------- |:-------------:|:----------------:|:---------------:| -| BellmanFord* | 96.4% | 96.7% | 92%/97% | -| Dijkstra | 78.8% | 84.4% | 92%/96% | -| BFS* | 53.8% | 57.1% | 100%/100% | -| DFS | 5.0% | 4.7% | 7%/48% | -| MST-Kruskal | 82.3% | 82.3% | 71%/90% | -| MST-Prim | 75.2% | 50.4% | 71%/90% | - - -As we can see in the table above the approach works very well for simpler algorithms such as BellmanFord, where with simple MPNN we manage to achieve equal or better accuracy than the simple MPNN and match the TripletMPNN. Interestingly, this is a parallel algorithm, i.e. all node representations run the same code, in constrast sequential algorithms which go through the graph node by node. We did try gating to enable the GNN to better mimic a sequential algorithm, but this didn't help. - -On the other algorithms while we are able to learn we cannot match the performance of teacher forcing where we assume to know the number of timesteps to run the neural network. This additional help makes the comparison slightly unfair, however, it shows how learning a fixed point is difficult for the network as we are not able to match the performance. We hypothesise about the reasons behind this in the next section. - -## What's the problem? - -There are a few major issues that we notice during training. The first is that the network is prone to underfitting, while we only show the test accuracy in the table above the training error doesn't actually reach 0. It is unclear what causes this, however, trying to solve some issues with the DEQ may solve this. So let's delve into them. - -### Convergence is a key issue - -Firstly, the network will often take a large number of steps to reach a fixed point. We can see on easier algorithms like the BellmanFord algorithm that the number of forward steps during training often reaches our set upper limit of 64 forwards steps (the actual algorithm would take on average 4-5, max 10 for this graph size). This is why we implement our architecture trick, where we update the next hidden representation only if it is smaller than the current one, i.e. $$z^{(t+1)} = \min(z^{(t)}, z^{'(t+1)})$$ where $$z^{'(t+1)}$$ is the output of our min aggregator in the message passing step (alternatives such as gating and an exponential moving average update function were also tried). This helps with convergence, which enables finding a fixed point in simple cases, but fails to work reliably for more complex architectures and problems, while also introducing a different issue. - -### The problem with hard constraints to achieve convergence - -Remember that during the implicit differentiation we are trying to solve - -$$h = \left(I-\frac{\partial f(z^*, x)}{\partial z^*}\right)^{-\top}g$$ - -i.e. in the linear system $$y = Ax$$ our matrix $$A$$ is equal to $$I-J$$ where $$J$$ is the Jacobian in the above equation. If the Jacobian is equal to the identity then our matrix $A=0$ and our system has no solution. In practice, $$z^{(t+1)} = \min(z^{(t)}, z^{'(t+1)})$$ will reduce to $$f(z) = z$$ in many dimensions of $$z$$. This leads to many rows of the Jacobian being the identity due to the function effectively becoming $$f(x)=x$$ in many dimensions. Thus leading to rows that are entirely zero in $$A$$, which is ill-defined and has no solution causing the optimisation to break. - -One solution is to try a soft-min, i.e. $$softmin_{\tau}(a,b) = \frac{ae^{-a/\tau}+be^{-b/\tau}}{e^{-a/\tau}+e^{-b/\tau}}$$. Here we get the ability to trade off between convergence and the Jacobian being interesting. For $$\tau<<1$$ we basically recover the min operation and for $$\tau>>1$$ we simply get an average, i.e. an exponential moving average. In practice, there was not a trade-off for which we consistently have an interesting Jacobian, while also converging sufficiently fast. - -## What do we take away? - -1. Training to reach a fixed point can work as way to determine when to stop reasoning. But it gets increasingly more difficult as the underlying problem gets harder. -2. It's unclear what inductive bias to choose in order to ensure fast enough convergence to a fixed point. There are downsides such as uninformative gradients at the fixed point. -3. Optimisation is tricky and stands in the way. In particular, with implicit differentiation through the fixed point. - - diff --git a/_posts/2024-05-07-diffusion-theory-from-scratch.md b/_posts/2024-05-07-diffusion-theory-from-scratch.md deleted file mode 100644 index e7c71591..00000000 --- a/_posts/2024-05-07-diffusion-theory-from-scratch.md +++ /dev/null @@ -1,457 +0,0 @@ ---- -layout: distill -title: "Building Diffusion Model's theory from ground up" -description: "Diffusion Models, a new generative model family, have taken the world by storm after the seminal paper by Ho et al. [2020]. While diffusion models are often described as a probabilistic Markov Chains, their underlying principle is based on the decade-old theory of Stochastic Differential Equations (SDE), as found out later by Song et al. [2021]. In this article, we will go back and revisit the 'fundamental ingredients' behind the SDE formulation and show how the idea can be 'shaped' to get to the modern form of Score-based Diffusion Models. We'll start from the very definition of the 'score', how it was used in the context of generative modeling, how we achieve the necessary theoretical guarantees and how the critical design choices were made to finally arrive at the more 'principled' framework of Score-based Diffusion. Throughout this article, we provide several intuitive illustrations for ease of understanding." -date: 2024-05-07 -htmlwidgets: true - -authors: - - name: Ayan Das - url: "https://ayandas.me/" - affiliations: - name: "University of Surrey UK, MediaTek Research UK" - -# must be the exact same name as your blogpost -bibliography: 2024-05-07-diffusion-theory-from-scratch.bib - -# Add a table of contents to your post. -# - make sure that TOC names match the actual section names -# for hyperlinks within the post to work correctly. -toc: - - name: Introduction - subsections: - - name: Motivation - - name: Generative Modeling - - name: Existing Frameworks - - name: Diffusion is no different - - name: "The 'Score'" - - name: Generative Modeling with Scores - subsections: - - name: Langevin Equation and Brownian Motion - - name: Fokker-Planck Equation - - name: A probability path - - name: Estimating the "score" is hard - - name: The "forward process" - - name: Finite time & the "schedule" - - name: Estimating the Score - subsections: - - name: Implicit Score Matching - - name: Denoising Score Matching - - name: Probing the learning objective - - name: Denoising as inverse problem - - name: Last few bits - -# Below is an example of injecting additional post-specific styles. -# This is used in the 'Layouts' section of this post. -# If you use this post as a template, delete this _styles block. -_styles: > - .fake-img { - background: #bbb; - border: 1px solid rgba(0, 0, 0, 0.1); - box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1); - margin-bottom: 12px; - } - .fake-img p { - font-family: monospace; - color: white; - text-align: left; - margin: 12px 0; - text-align: center; - font-size: 16px; - } ---- - -## Introduction - -### Motivation - -Not only generative modeling has been around for decades, few promising model families emerged and dominated the field for several years in the recent past. VAEs dominated the generative modelling landscape from 2014 onwards, until GANs took off in 2015-16; Normalizing Flows (NF) never really made it to the mainstream generative modeling due to its restrictive architectural requirement. However, it is quite clear at this point that the magnitude of impact they made is relatively less than barely 2-3 years of Diffusion Models. It is mostly attributed to one of the seminal papers (by Jonathan Ho et al.), now popularly referred to as "Denoising Diffusion Probabilistic Models" or DDPM. With the exponential explosion of works following DDPM, it is very hard, or rather unnecessary to look beyond this pivotal point. - -In this article, we look back into the conceptual and theoretical ideas that were in development for a long time, even outside the field of core machine learning. We will show in a later sections that, some of the theoretical 'pillars' holding Diffusion Models, have their roots deep into statistical physics and other fields. A significant part of this theory was presented afresh in the ICLR paper (won best paper award). Lastly, even though the ideas presented in this article are quite theoretical, we made our best attempt to convey them with intuitive explanations, diagrams and figures, thereby expanding its potential audience. To encourage further exploration, we provide all codes used in producing the figures (and experiments) of this article in [this repository](https://github.com/dasayan05/iclr24_blog_code). - -This article notes that, historically, there were two distinct roads of development that merged in order for modern diffusion models to emerge -- "scalable estimation of score" and "using the score for generative modelling". The former is relatively short, while the latter traces its origin back to ~1900, if not earlier. This article explores these two paths independently -- the latter one first while assuming the knowledge of the former. Rest of this introductory section is spent on defining the general modelling problem and the very notion of 'score' -- the primary quantity of interest. The next section deals with how we can use score in generative modelling, assuming access to an oracle for the true score. The last section dives solely into the problem of estimating the score in a scalable manner. It is worth mentioning that, in this article, we explain only the "sufficient and necessary" concepts needed to build the diffusion model framework and hence may not directly resemble the typical formalism seen in most papers. - - -### Generative Modeling - -The problem of generative modeling, in most cases, is posed as *parametric density estimation* using a finite set of samples $$\{ x^{(n)} \}_{n=1}^N$$ from a "true but unknown" data distribution $$q_{data}(x)$$. With a suitable model family chosen as $$p_{\theta}(x)$$, with unknown parameters $$\theta$$, the problem boils down to maximizing the average (log-)likelihood (w.r.t $$\theta$$) of all the samples under the model - -$$ -\theta^* = arg\max_{\theta} \mathbb{E}_{x \sim q_{data}(x)} \left[ \log p_{\theta}(x) \right] \approx arg\max_{\theta} \frac{1}{N} \sum_{n=1}^N \log p_{\theta}(x^{(n)}) -$$ - -It turned out however, that defining an arbitrary parametric density $$p_{\theta}(x)$$ is not as easy as it looks. There was one aspect of $$p_{\theta}$$ that is widely considered to be the evil behind this difficulty -- _the normalizing constant_ that stems from the axiom of probability - -$$ -p_{\theta}(x) = \frac{\tilde{p}_{\theta}(x)}{\color{purple} \int_x \tilde{p}_{\theta}(x)} -$$ - -### Existing Frameworks - -It was understood quite early on that any promising generative model family must have one property -- _ease of sampling_, i.e. generating new data samples. Sampling was so essential to generative modeling, that the model families that followed were all geared towards effective sampling, even if it was at the expense of other not-so-important properties. It was also well understood that there was one common underlying principle most effective for crafting "sampling-centric" generative models -- _transforming simple probability densities_. This formed the backbone of every single generative model family so far; be it VAEs, GANs or NFs, their generative process is a density transformation of this form - -$$ -x = f_{\theta}(z),\text{ where } z \sim \mathcal{N}(0, I) -$$ - -that suggests to start with a simple density (often just standard normal) followed by a functional transformation $$f_{\theta}$$, typically a neural network with parameters $$\theta$$. For VAEs, the function $$f_{\theta}$$ is the decoder; for GANs, it's the generator network and for NFs, it's the entire flow model. It is to be noted however, that the way they differ is mostly _how they are trained_, which may involve more parametric functions (e.g. VAE's encoder or GAN's discriminator) and additional machinery. This way of building generative models turned out to be an effective way of sidestepping the notorious normalizing constant. - -### Diffusion is no different - -Diffusion Models, at its core, follow the exact same principle, but with a slightly clever design. For diffusion models, the transformation $$f_{\theta}$$ is rather complicated. It is a sequence of invocations of a neural function (denoted as $$s_{\theta}$$) along with some additional computation (denoted as $$g(\cdot)$$) - -\begin{equation} \label{eq:diffusion_general_parametric_structure} -x = g_1(g_2(g_3(\cdots z \cdots, s_{\theta}), s_{\theta}), s_{\theta}), \text{ where } z \sim \mathcal{N}(0, I) -\end{equation} - -This is a big difference between Diffusion Models and other generative model families. Prior generative families tried to learn the exact transformation directly via one parametric neural function $$f_{\theta}$$. Diffusion Models on the other hand, try to learn $$s_{\theta}$$, a quantity very _fundamental and intrinsic_ to any true data distribution $$q_{data}(x)$$. The quantity in question has historically been called the "_Score_". - -### The 'Score' - -The term 'Score' is simply defined as the _gradient of the log-density of a distribution_, i.e. $$\nabla \log p(\cdot)$$. In statistics, it is also known (but not very popular) as the 'Informant'. One might argue that 'Score' is rather a strange name for such a quantity. It so happened that the origin of this term can be tracedThanks to this StackOverflow answer by @ben to a 1935 paper by Ronald Fisher, where he used the term in a very generic sense in order to "rank" some quantities. In the context of diffusion models however, we stick to the modern definition of score. The _true score_ of our data distribution is therefore defined as the gradient of the log of _true density_ of data, w.r.t the data variable - -\begin{equation} \label{eq:data_score_defn} -\nabla_x \log q_{data}(x) \triangleq s(x) -\end{equation} - -The quantity in Eq.\eqref{eq:data_score_defn} is unknown, just like the true data density $$q_{data}(x)$$. It does have a meaning though: the "_true score_" refers to the _direction of steepest increase_ in log-likelihood at any given point in the data space. See the gray arrows in the figure below. - -
-{% include figure.html path="assets/img/2024-05-07-diffusion-theory-from-scratch/score_def.png" class="col-8" %} -
- -Simply, at a point $$x$$, it tell us the best direction to step into (with little step-size $$\delta$$) if we would like to see a point $$x'$$ with slightly higher likelihood - -\begin{equation} \label{eq:naive_score_steps} -x' = x + \delta \cdot \left. \nabla_x \log q_{data}(x) \right|_{x = x} -\end{equation} - -Please note that this stems just from the definition of the gradient operator $$\nabla$$ in score. If you are familiar with gradient descent, you may find conceptual resemblance. - -Now, there are two burning questions here: - -1. Considering we have access to the true score, is Eq.\eqref{eq:naive_score_steps} enough to define a generative process with appropriate convergence guarantee ? -2. How do we actually get the true score ? - -The following two sections answer these questions respectively. Luckily, as we now understand that these two questions are somewhat decoupled, that they can be studied independently. The first section analyzes the first question, _assuming_ we have access to the true score $$\nabla_x \log q_{data}(x)$$. The second section explores how to get the true score, or rather, an approximation of it. - - -## Generative Modeling with Scores - -As explained before, we would like to sample from the true data distribution $$q_{data}(x)$$ but all we have access to (we assume) is its score $$s(x)$$ as defined in Eq.\eqref{eq:data_score_defn}. One may define a naive generative process as the iterative application of Eq.\eqref{eq:naive_score_steps}. Intuitively, it is very similar to gradient descent, where we greedily climb the log-density surface to attain a local maxima. If so, we can already see a possible instance of the general structure of Diffusion's generative process as hinted in Eq.\eqref{eq:diffusion_general_parametric_structure}, with $$g(\cdot)$$ being - -$$ -g(z, s(\cdot)) = z + \delta \cdot s(z) = z + \delta \cdot \nabla_x \log q_{data}(x) -$$ - -With a little reshuffling of Eq.\eqref{eq:naive_score_steps} and considering $$\delta \rightarrow 0$$, one can immediately reveal the underlying ODEOrdinary Differential Equations, or ODEs describe how a process evolves over time by its infinitesimal change. that describes the infinitesimal change - -\begin{equation} \label{eq:ode_with_score} -dx = \nabla_x \log q_{data}(x) dt -\end{equation} - -BUT, please note that this is only an intuitive attempt and is entirely based on the definition of score. It possesses **absolutely no guarantee** that this process can converge to samples from the true data distribution. In fact, this process is **greedy**, i.e. it only seeks to go uphill, converging exactly at the _modes_Local maxima of probability density. You can see the below figure that shows the samples $$x$$ subjected to the process in Eq.\eqref{eq:ode_with_score} and its density $$p_t(x)$$ evolving over time. The density in red is the target density whose score (we assume we know it) is being used. - -
-{% include figure.html path="assets/img/2024-05-07-diffusion-theory-from-scratch/greedy_wo_noise.gif" class="img-fluid" %} -
- -In this case, at $$t=\infty$$, all samples will converge to the state with _the highest_ likelihood (i.e. exactly a the center). This isn't really desirable as it doesn't "explore" at all. Just like any other sampling algorithm, we need noise injection ! - -### Langevin Equation and Brownian Motion - -Turned out that this problem was explored long ago in molecular dynamics by french physicist Paul Langevin in the context of analyzing movements of particles suspended in a fluid. He described the overall dynamics of particles, i.e how the position of the particle changes over time $t$ when in a _potential energy_ field $$U(x)$$ - -\begin{equation} \label{eq:original_langevin_dyn} -dx = - \nabla_x U(x) dt + \sqrt{2} dB_t -\end{equation} - -The term $$dB_t$$ is called "Brownian Motion" and is effectively the source of noise -- we will talk about this later in this subsection. Energy is considered "bad", i.e. particles do not want to stay in a state with high energy. So they try to go downhill and settle in low-energy states using the gradient of the energy surface. The langevin equation (i.e. Eq.\eqref{eq:original_langevin_dyn}) happened to provide sufficient "exploration" abilities so that the particles visit states with probability $$\propto e^{-U(x)}$$. This suggests that we can treat "negative energy" as log-likelihood - -$$ -q_{data}(x) \propto e^{-U(x)} \implies \log q_{data}(x) = -U(x) + C \implies \nabla_x \log q_{data}(x) = - \nabla_x U(x) -$$ - -By using the above substitution into the langevin equation, we can move out of physics and continue with out ML perspective - -\begin{equation} \label{eq:langevin_dyn} -dx = \nabla_x \log q_{data}(x) dt + \sqrt{2} dB_t -\end{equation} - -Note that this isn't very different from our "intuitive" and greedy process in Eq.\eqref{eq:ode_with_score}, except for the noise term $$dB_t$$ and a strange $$\sqrt{2}$$. But this makes a difference! The brownian motion is an old construct from particle physics to describe random motion of particles in fluid/gas. It is simply a gaussian noise with infinitesimally small varianceIn practice, the smaller step you take, the small noise you get. - -$$ -dB_t = \mathcal{N}(0, dt) \implies dB_t = \sqrt{dt} \cdot z,\text{ where } z \sim \mathcal{N}(0, I) -$$ - -With that, we can simulate our new langevin equation _with noise_ (i.e. Eq.\eqref{eq:langevin_dyn}) just like the noiseless case. You can see now that the noise is keeping the process from entirely converging into the mode. If you notice carefully, we have added a little "tail" to each point to help visualize their movement. - -{% include figure.html path="assets/img/2024-05-07-diffusion-theory-from-scratch/langevin_dyn_basic.gif" class="img-fluid" %} - -### Fokker-Planck Equation - -The simulation is convincing; but it'd be even better if we can _theoretically verify_ that the process in Eq.\eqref{eq:langevin_dyn} indeed converges to $$q_{data}(x)$$. The key to this proof is figuring out $$p_t(x)$$ and making sure that it stabilizes as $$t\rightarrow \infty$$, i.e. $$p_{\infty}(x) = q_{data}(x)$$. It turned out that a stochastic process of the form $$dx = \mu_t(x) dt + \sigma_t(x) dB_t$$, acting on a random variable $$x$$, induces a time-varying distribution that can be described by this ODE - -\begin{equation} -\frac{\partial}{\partial t}p_t(x) = -\frac{\partial}{\partial x} \Big[ p_t(x)\mu_t(x) \Big] + \frac{1}{2} \frac{\partial^2}{\partial x^2} \Big[ p_t(x) \sigma^2_t(x) \Big] -\end{equation} - -This is a well celebrated result know as the "Fokker-Planck equation" that even predates the Langevin Equation. So, the solution of this ODE is exactly what we are seeing in the above figure (middle). One can easily verify the convergence of Eq.\eqref{eq:langevin_dyn} by first observing $$\mu_t(x) = \nabla_x \log q_{data}(x), \sigma_t(x) = \sqrt{2}$$ and then using $$\frac{\partial}{\partial t} p_{\infty}(x) = \frac{\partial}{\partial t} q_{data}(x) = 0$$. - -$$\begin{eqnarray*} -\frac{\partial}{\partial t}p_{\infty}(x) &=& -\frac{\partial}{\partial x} \Big[ p_{\infty}(x) \nabla_x \log q_{data}(x) \Big] + \frac{(\sqrt{2})^2}{2} \frac{\partial^2}{\partial x^2} \Big[ p_{\infty}(x) \Big] \\ -\frac{\partial}{\partial t} q_{data}(x) &=& -\frac{\partial}{\partial x} \Big[ q_{data}(x) \nabla_x \log q_{data}(x) \Big] + \frac{(\sqrt{2})^2}{2} \frac{\partial^2}{\partial x^2} \Big[ q_{data}(x) \Big] \\ -0 \text{ (LHS)} &=& -\frac{\partial}{\partial x} \Big[ \nabla_x q_{data}(x) \Big] + \frac{\partial}{\partial x} \Big[ \nabla_x q_{data}(x) \Big] = 0\text{ (RHS)} -\end{eqnarray*}$$ - - -The LHS holds due to the fact that after a long time (i.e. $$t = \infty$$) the distribution stabilizesIt's called a "stationary or equilibrium distribution". Please also note that the proof above is for the 1 dimensional case and included for illustrative purpose only -- the general case is slightly more complicated. - -So, we're all good. Eq.\eqref{eq:langevin_dyn} is a provable way of sampling given we have access to the true score. In fact, the very work (by Song et al.) that immediately precedes DDPM, used exactly Eq.\eqref{eq:langevin_dyn} in its discrete form - -\begin{equation} -x_{t+\delta} = x_t + \delta \cdot \nabla_x \log q_{data}(x) + \sqrt{2\delta} \cdot z -\end{equation} - -where $$\delta$$ (a small constant) is used as a practical proxy for the theoretical $$dt$$. - -If you are already familiar with Diffusion Models, specifically their reverse process, you might be scratching your head. That is because, the generative process in Eq.\eqref{eq:langevin_dyn} isn't quite same as what modern diffusion models do. We need to cross a few more hurdles before we get there. - -### A probability path - -More than just a proof, the Fokker-Planck ODE provides us with a key insight -- i.e. gradually transforming one distribution into another is equivalent to traveling (over time) on a "path" in the _space of probability distributions_. Imagine a space of all possible probability distributions $$p$$While each distribution vary in space (i.e. $x$) too, let's hide it for now and imagine them to be just a vectors.. The Fokker-Planck ODE for Eq.\eqref{eq:langevin_dyn}, therefore, represents a specific dynamics on this probability space whose solution trajectory $$p_t$$ ends at $$q_{data}$$ at $$t = \infty$$. - -Speaking of ODEs, there is something we haven't talked about yet -- the initial distribution at $$t=0$$, i.e. $$p_0$$. In the simulation above, I quietly used a standard normal $$\mathcal{N}(0, I)$$ as starting distributionYou can notice this if you carefully see the first few frames of the animation. without ever discussing it. Turns out that our Fokker-Planck ODE does not have any specific requirement for $$p_0$$, i.e. it always converges to $$p_{\infty} = q_{data}$$ no matter where you start. Here's an illustration that shows two different starting distributions $$p_0$$ and both of their "paths" over time, i.e. $$p_t$$ in probability space ultimately converges to $$q_{data}$$. - -{% include figure.html path="assets/img/2024-05-07-diffusion-theory-from-scratch/fokker-plank-multiple.gif" class="img-fluid" %} - -So theoretically, given the score function $$\nabla_x \log q_{data}(x)$$ of a target distribution $$q_{data}(x)$$, one can "travel to" it from _any_ distribution. However, keeping in mind our need for _sampling_, it's best to choose an initial distribution that is sampling-friendly. Strictly speaking, there are couple of reasonable choices, but the diffusion model community ended up with the _Isotropic Gaussian_ (i.e. $$\mathcal{N}(0, I)$$). This is not only due to its goodwill across machine learning and statistics, but also the fact that in the context of SDEs with Brownian motionsRemember, they are infinitesimal gaussian noises., Gaussians arise quite naturally. - -### Estimating the "score" is hard - -So far what we've talked about, is just the _generative process_ or as diffusion model literature calls it, the "reverse process". But we haven't really talked about the "forward process" yet, in case you are familiar with it. The forward process, in simple terms, is an _ahead-of-time description_ of the "probability path" that reverse process intends to take. But the question is, why do we need to know the path ahead of time -- the reverse process seems quite spontaneousIn the sense that, given a score function, it just travels to the correct target distribution on its own., no ? Sadly, it can't be answered with theory alone. - -The problem lies in Eq.\eqref{eq:langevin_dyn} -- let's write it again with a little more verbosity - -\begin{equation} -dx_t = \nabla_x \left. \log q_{data}(x) \right|_{x = x_t}\ dt + \sqrt{2} dB_t -\end{equation} - -Even though we wished to estimate $$\nabla_x \log q_{data}(x)\vert_{x = x_t}$$ with neural network $$s_{\theta}(x = x_t)$$, this turned out to be **extremely hard** in practice. It was understood that one neural network is not enough to capture the richness of the score function at all values of $$x$$. There were two options before the us -- one, make the neural network expressive enough, or second, learn the network **only where it's needed**. The community settled on the second one because it was easier to solve. - -So, what some of the pioneering works did, is first fixing a pathOn probability space, like we showed above and then learning the score only _on that path_. It's all about specializing the neural network $$s_{\theta}(x_t, t)$$ over $$t \in [0, \infty]$$. The neural score estimator is capable of producing the right score if we provide the time $$t$$, which we can of course. We will see in [the next section](#estimating-the-score) that, to learn a score of any distribution, we need samples from it. This begs the question: how do we get samples $$x_t$$ (for all $$t$$) for training purpose ? It certainly can't be with Eq.\eqref{eq:langevin_dyn} since it requires the score. The answer is, we need to run this process in the other way -- this is what Diffusion Models call the "Forward Process". - -### The "forward process" - -Going _the other way_ requires us to run a simulation to go from $$q_{data}(x)$$ at $$t=0$$ to $$t=\infty$$, just the opposite of the animation above. Recall that we already saw how to do this. To go to any distribution at $$t=\infty$$, all you need is its score and the langevin equation. So how about we start from $$q_0 = q_{data}(x)$$ this timeDo you remember that starting point doesn't matter ! and run the langevin simulation again with a _known_ end target $$q_{\infty} = \mathcal{N}(0, I)$$ ? - -$$\begin{eqnarray} -dx &=& \nabla_x \log \mathcal{N}(0, I) dt + \sqrt{2} dB_t \\ -\label{eq:forward_sde} -&=& -x dt + \sqrt{2 dt} z -\end{eqnarray}$$ - -It is interesting to note that due to the target distribution being known in its closed form, we do not see any awkward scores dangling around. The score of $$\mathcal{N}(0, I)$$ is simply $$-x$$We encourage the reader to verify this on their own as an exercise.. The discretized version of Eq.\eqref{eq:forward_sde}, i.e. - -$$\begin{eqnarray*} -x_{t+dt} &=& x_t - x_t \cdot dt + \sqrt{2 dt}\ z \\ -&=& (1 - dt) x_t + \sqrt{2 dt}\ z -\end{eqnarray*}$$ - -.. may resemble DDPM's forward processHint: compare $dt$ with DDPM's $\beta_t$.. - -> NOTE: A little subtlety here that we only fixed the _end point_ of the forward process, but not the _exact path_. It seems that running the langevin equation in the forward direction chose one path on its own. Turns out that this is the "isotropic path" where all dimensions of the variable $$x$$ evolves in time the exact same way. Some works recently uncovered _non-isotropic_ diffusion, where it is indeed possible to travel on other paths. But this is outside the scope of this article. - -We can simulate the above equation just like we did in the reverse process, in order to get samples $$x_t \sim q_t$$. Below we show simulation of the forward process - -
-{% include figure.html path="assets/img/2024-05-07-diffusion-theory-from-scratch/forward_process_2.gif" class="col-10" %} -
- -While it is true that the reverse process in inherently sequential due to the arbitrary nature of the score, the forward process (in Eq.\eqref{eq:forward_sde}) is entirely known and hence can be exploited for easing the sequentiality. We can see a way out if we try to simplifyWe use the standard assumption of $dt^2 = 0$. the expression for $$x_{t+2dt}$$ using $$x_{t+dt}$$ - -$$\begin{eqnarray*} -x_{t+2dt} &=& (1 - dt) {\color{blue} x_{t+dt}} + \sqrt{2dt}\ z_2 \\ -&=& (1 - dt) {\color{blue} \left[(1 - dt) x_t + \sqrt{2 dt}\ z_1\right]} + \sqrt{2dt}\ z_2 \\ -&=& (1 - 2dt) x_t + \sqrt{2dt(1-dt)^2 + 2dt}\ z_{12} \\ -&=& (1 - 2 \cdot dt) x_t + \sqrt{2 \cdot 2dt}\ z_{12} \\ -\implies x_{t+2dt} &\sim& \mathcal{N}((1 - 2 \cdot dt) x_t, 2 \cdot 2dt I) -\end{eqnarray*}$$ - -The above simplification suggests that we can jump to any time $$t$$, without going through the entire sequence, in order to sample $$x_t \sim q_t$$. In fact, $$q_t(x_t\vert x_0)$$ is gaussian ! This result opens up an interesting interpretation -- generating $$x_0 \sim q(x_0 \vert x_t)$$ can be interpreted as solving a "gaussian inverse problems", which we explore [in a later section](#denoising-as-inverse-problem). - -All good for now, but there is one more thing we need to deal with. - -### Finite time & the "schedule" - -What we discussed so far, i.e. the forward and reverse process, require infinite time to reach its end state. This is a direct consequence of using the langevin equation. That, of course, is unacceptable in practice. But it so happened that there exists quite an elegant fix, which is well known to mathematics -- we simply _re-define what time means_. We may choose a re-parameterization of time as, for example, $$t' = \mathcal{T}(t) = 1 - e^{-t} \in [0, 1]$$You can see $t = 0 \implies t' = 0$ and $t = \infty \implies t' = 1$. Hence we converted the range $[0, \infty]$ to $[0, 1]$.. Plugging $$dt = \mathcal{T}'(t)^{-1} dt' = e^t dt'$$One can easily see that $t' = 1 - e^{-t} \implies dt' = e^{-t} dt \implies dt = e^t dt'$. into the forward equation brings us even closer to DDPM's forward process - -$$ -x_{t' + dt'} = (1 - {\color{blue}e^t dt'}) x_t + \sqrt{2 {\color{blue}e^t dt'}}\ z -$$ - -This suggests that in the world where time runs from $$t' = 0 \rightarrow 1$$, we need to _escalate_ the forward process by replacing $$dt$$ with $$e^t dt'$$. The quantity $$\mathcal{T}'(t)^{-1} dt' = e^t dt'$$ is analogous to what diffusion models call a "schedule". Recall that DDPM uses a small but increasing$e^t dt'$ is small because of $dt'$, while increasing because of $e^t$. "schedule" $$\beta_t$$. - -
-{% include figure.html path="assets/img/2024-05-07-diffusion-theory-from-scratch/ddpm_forward_kernel.png" class="col-6 z-depth-1"%} -
- -Of course, our choice of the exact value of end time (i.e. $$t' = 1$$) and the re-parameterization $$\mathcal{T}$$ are somewhat arbitrary. Different choices of $$\mathcal{T}$$, and consequently $$\mathcal{T}'(t)^{-1} dt'$$ lead to different schedules (e.g. linear, cosine etc.). - -> NOTE: Choosing a different schedule does not mean the process takes a different path on the probability space, it simply changes its _speed_ of movement over time towards the end state. - -#### Summary - -To summarize, in this section, we started with the definition of 'score' and arrived at a stochastic process (thanks to an old result by Langevin) that, at infinite time, converges to the density associated with the score. We saw that this process is provably correct and can be interpreted as a "path" on the probability space. We argued that due to the difficulty of score estimation everywhere along the path, we need samples at the intermediate time $$t$$ in order to specialize the score estimates. To do that, we had to travel backwards on the path, which can be done in closed form. We also saw how this process, even though theoretically takes infinite time, can be shrunk down to a finite interval, opening up a design choice known as "schedules". - -## Estimating the Score - -The last chapter, while explaining the "sampling" part of score-based diffusion models, assumed that we have access to the true score $$\nabla_x \log q_{data}(x)$$ via some oracle. That is, of course, untrue in practice. In fact, accessing the true score for any arbitrary distribution is just not possibleWe can only have access to the true score for distributions with closed-form, e.g. Gaussian.. So the way forward, as mentioned before, is to estimate/learn it with a parametric neural network $$s_{\theta}(x)$$. Recall however, that all we have access to is samples from $$q_{data}(x)$$. - -If curious enough, one may question how realistic it is to estimate the score $$\nabla_x \log q_{data}(x)$$, while we can NOT usually estimate the density $$q_{data}(x)$$ itself ? After all, it is a quantity derived from the density ! The answer becomes clear once you make the _normalization constant_ explicit - -$$\begin{eqnarray*} -\nabla_x \log q_{data}(x) &=& \nabla_x \log \frac{\tilde{q}_{data}(x)}{\int_{x} \tilde{q}_{data}(x) dx} \\ -&=& \nabla_x \log \tilde{q}_{data}(x) - {\color{red}\nabla_x \log \int_{x} \tilde{q}_{data}(x) dx} \\ -&=& \nabla_x \log \tilde{q}_{data}(x) -\end{eqnarray*}$$ - -The part in red is zero due to not having dependence on $$x$$. So, the score, very cleverly **sidesteps the normalization constant**. This is the reason score estimation gained momentum in the research community. - -### Implicit Score Matching - -The first notable attempt of this problem was by Aapo Hyvärinen back in 2005. His idea was simply to start from a loss function that, when minimized, leads to an estimator of the true score - -\begin{equation} -J(\theta) = \frac{1}{2} \mathbb{E}_{x\sim q\_{data}(x)}\Big[ \vert\vert s\_{\theta}(x) - \nabla_x \log q\_{data}(x) \vert\vert^2 \Big] -\end{equation} - -It is simply an $$L_2$$ loss between a parametric model and the true score, weighted by the probability of individual states (hence the expectation). But of course, it is not computable in this form as it contains the true score. Hyvärinen's contribution was to simply show that, theoretically, the minimization problem is equivalent when the loss function is - -\begin{equation} \label{eq:impl_score_match} -J_{\mathrm{I}}(\theta) = \mathbb{E}_{x\sim q\_{data}(x)}\Big[ \mathrm{Tr}(\nabla\_x s\_{\theta}(x)) + \frac{1}{2} \vert\vert s\_{\theta}(x) \vert\vert^2 \Big] -\end{equation} - -In the literature, this is known as the "_Implicit Score Matching_". The derivation is relatively simple and only involves algebraic manipulations -- please see Appendix A of . The remarkable nature of this result stems from the fact that $$J_{\mathrm{I}}$$ no longer contains the true score. The only dependency on $$q_{data}$$ is via the expectation, which can be approximated by sample average over our dataset. - -But the key challenge with Implicit Score Matching was the $$\mathrm{Tr}(\nabla_x s_{\theta}(x))$$ term, i.e. the trace of the hessian of the neural score model, which is costly to compute. This prompted several follow-up works for the race towards scalable score matching, one of which (namely De-noising score matching) is used in Diffusion Models till this day. - -For the sake of completeness, I would like to mention the work of Yang Song et al. around 2019, that proposed an engineering trick to alleviate the hessian computation. They simply used the "Hutchinson Trace estimator"A stochastic way of computing trace: $\mathrm{Tr}(M) = \mathbb{E}_{v\sim p_v} \Big[ v^T M v \Big]$, where $p_v$ can be a lot of distributions, most notably $\mathcal{N}(0, I)$. to replace the $$\mathrm{Tr}(\cdot)$$ in Eq.\eqref{eq:impl_score_match}, which eased the computation a bit. This approach however, did not end up being used in practice. - -### Denoising Score Matching - -The most valuable contribution came from Vincent Pascal in 2011, when he showed that the score matching problem has yet another equivalent objective, which was called "Denoising" score matching - -\begin{equation} \label{eq:deno_score_match} -J_{\mathrm{D}}(\theta) = \mathbb{E}_{x\sim q\_{data}(x), \epsilon\sim\mathcal{N}(0, I)}\left[ \frac{1}{2} \left|\left| s\_{\theta}(\ \underbrace{x + \sigma\epsilon}\_{\tilde{x}}\ ) - (- \frac{\epsilon}{\sigma}) \right|\right|^2 \right] -\end{equation} - -We deliberately wrote it in a way that exposes its widely accepted interpretation. Denoising score matching simply adds some _known_ noise $$\sigma\epsilon$$ to the datapoints $$x$$ and learns (in mean squeared sense), from the "noisy" point $$\tilde{x}$$, the direction of comeback, i.e. $$(-\epsilon)$$, scaled by $$\frac{1}{\sigma}$$. In a way, it acts like a "de-noiser", hence the name. It is theoretically guaranteed that $$J_{\mathrm{D}}$$ leads to an unbiased estimate of the true score. Below we show a visualization of the score estimate as it learns from data. - -
-{% include figure.html path="assets/img/2024-05-07-diffusion-theory-from-scratch/deno_score_learning.gif" class="col-10" %} -
- -A little algebraic manipulation of Eq.\eqref{eq:deno_score_match}, demonstrated by Ho et al. , leads to an equivalent form which turned out to be training friendly. - -$$\begin{eqnarray} -J_{\mathrm{D}}(\theta) &=& \mathbb{E}_{x\sim q_{data}(x), \epsilon\sim\mathcal{N}(0, I)}\left[ \frac{1}{2\sigma^2} \left|\left| {\color{blue} - \sigma s_{\theta}}(\tilde{x}) - \epsilon \right|\right|^2 \right] \\ -&=& \mathbb{E}_{x\sim q_{data}(x), \epsilon\sim\mathcal{N}(0, I)}\left[ \frac{1}{2\sigma^2} \left|\left| {\color{blue} \epsilon}_{\theta}(\tilde{x}) - \epsilon \right|\right|^2 \right]\label{eq:deno_eps_match} -\end{eqnarray}$$ - -We simply change the _interpretation_ of what the network learns. In this form, the "noise estimator" network learns _just_ the original pure gaussian noise vector $$\epsilon$$ that was added while crafting the noisy sample. So, from a noisy sample, the network $$\epsilon_{\theta}$$ learns roughly an unit variance direction that points towards the clean sample. - -There is yet another re-interpretation of Eq.\eqref{eq:deno_score_match} that leads to a slightly different perspective - -$$\begin{eqnarray} -J_{\mathrm{D}}(\theta) &=& \mathbb{E}_{x\sim q_{data}(x), \epsilon\sim\mathcal{N}(0, I)}\left[ \frac{1}{2\sigma^4} \left|\left| {\color{blue}\tilde{x} + \sigma^2 s_{\theta}}(\tilde{x}) - (\underbrace{\tilde{x} - \sigma\epsilon}_{x}) \right|\right|^2 \right] \\ -&=& \mathbb{E}_{x\sim q_{data}(x), \epsilon\sim\mathcal{N}(0, I)}\left[ \frac{1}{2\sigma^4} \left|\left| {\color{blue} x_{\theta}}(\tilde{x}) - x \right|\right|^2 \right]\label{eq:deno_endpoint_match} -\end{eqnarray}$$ - -Eq.\eqref{eq:deno_endpoint_match} shows, that instead of the noise direction towards clean sample, we can also have the clean sample directly as a learning target. This is like doing "denoising" in its true sense. We will get back to this in [the next subsection](#probing-the-learning-objective). - -### Probing the learning objective - -If you are still puzzled about how Eq.\eqref{eq:deno_eps_match} is related to learning the score, there is a way to probe exactly what the network is learning at an arbitrary input point $$\tilde{x}$$. We note that the clean sample $$x$$ and the noisy sample $$\tilde{x}$$ come from a joint distribution that factorizes - -$$ -q(x, \tilde{x}) = q(\tilde{x} \vert x) q_{data}(x) = \mathcal{N}(\tilde{x}; x, \sigma I) q_{data}(x). -$$ - -We then factorize this joint in a slightly different way, i.e. - -$$ -q(x, \tilde{x}) = q(x \vert \tilde{x}) q(\tilde{x}) -$$ - -where $$q(x \vert \tilde{x})$$ can be thought of as a distribution of all clean samples which could've led to the given $$\tilde{x}$$. Eq.\eqref{eq:deno_eps_match} can therefore be written as - -$$\begin{eqnarray*} -J_{\mathrm{D}}(\theta) &=& \mathbb{E}_{(x, \tilde{x}) \sim q(x,\tilde{x})}\left[ \frac{1}{2\sigma^2} \left|\left| \epsilon_{\theta}(\tilde{x}) - \epsilon \right|\right|^2 \right] \\ -&=& \mathbb{E}_{\tilde{x} \sim q(\tilde{x}), x \sim q(x\vert \tilde{x})}\left[ \frac{1}{2\sigma^2} \left|\left| \epsilon_{\theta}(\tilde{x}) - \frac{\tilde{x} - x}{\sigma} \right|\right|^2 \right] \\ -&=& \mathbb{E}_{\tilde{x} \sim q(\tilde{x})}\left[ \frac{1}{2\sigma^2} \left|\left| \epsilon_{\theta}(\tilde{x}) - \frac{\tilde{x} - \mathbb{E}_{x \sim q(x\vert \tilde{x})}[x]}{\sigma} \right|\right|^2 \right] \\ -\end{eqnarray*}$$ - -In the last step, the expectation $$\mathbb{E}_{q(x\vert\tilde{x})}\left[ \cdot \right]$$ was pushed inside, up until the only quantity that involves $$x$$. Looking at it, you may realize that the network $$\epsilon_{\theta}$$, given an input $$\tilde{x}$$, learns the _average noise direction_ that leads to the given input point $$\tilde{x}$$. It also exposes the quantity $$\mathbb{E}_{x \sim q(x\vert \tilde{x})}[x]$$, which is the _average clean sample_ that led to the given $$\tilde{x}$$. - -Below we visualize this process with a toy example, followed by a short explanation. - -
-{% include figure.html path="assets/img/2024-05-07-diffusion-theory-from-scratch/probing_deno_estimation.gif" class="col-10" %} -
- -Explanation: We have 10 data points $$x\sim q_{data}(x)$$ in two clusters (big red dots) and we run the learning process by generating noisy samples $$\tilde{x}\sim q(\tilde{x})$$ (small red dots). Instead of learning a neural mapping over the entire space, we learn a tabular map with only three chosen input points $$\tilde{x}_1, \tilde{x}_2, \tilde{x}_3$$ (blue, magenta and green cross). Every time we sample one of thosePractically it's impossible to randomly sample a specific point. So we assume a little ball around each point. three chosen input points, we note which input data point it came from (shown by connecting a dotted line of same color) and maintain a running average (bold cross of same color) of them, i.e. which is nothing but $$\mathbb{E}_{x \sim q(x\vert \tilde{x})}[x]$$. We also show the average noise direction at each $$\tilde{x}$$, i.e. $$\frac{\tilde{x} - \mathbb{E}_{x \sim q(x\vert \tilde{x})}[x]}{\sigma}$$, with gray arrows. The gray arrows, as the training progresses, start to resemble the score estimate of the data. - -### Denoising as inverse problem - -A similar treatment, when applied on Eq.\eqref{eq:deno_endpoint_match}, yields the following - -$$\begin{eqnarray*} -J_{\mathrm{D}}(\theta) &=& \mathbb{E}_{(x, \tilde{x}) \sim q(x,\tilde{x})}\left[ \frac{1}{2\sigma^4} \left|\left| {\color{blue}x_{\theta}}(\tilde{x}) - x \right|\right|^2 \right] \\ -&=& \mathbb{E}_{\tilde{x} \sim q(\tilde{x})}\left[ \frac{1}{2\sigma^4} \left|\left| {\color{blue}\tilde{x} + \sigma^2 s_{\theta}}(\tilde{x}) - \mathbb{E}_{x \sim q(x\vert \tilde{x})}[x] \right|\right|^2 \right] \\ -\end{eqnarray*}$$ - -Notice that I brought back the original form of $$x_{\theta}(\cdot)$$ that involves the score. If we had the true score instead of an learned estimate, we would have - -$$ -\mathbb{E}_{x \sim q(x\vert \tilde{x})}[x] = \tilde{x} + \sigma^2 \nabla_{\tilde{x}} \log p(\tilde{x}) -$$ - -In "Inverse problem" and Bayesian literature, this is a very well celebrated result named "_Tweedie's Formula_", first published by Robbins but credited to statistician Maurice Tweedie. This theorem is applied in the context of bayesian posterior estimation of a "true" quantity $$x$$ which we only observe through a (gaussian) noisy measurement $$\tilde{x}$$. Tweedie's formula tells us that the _posterior mean_ of the inverse problem $$q(x\vert \tilde{x})$$ can be computed without ever knowing the actually density, as long as we have access to the score at the noisy measurement. - -#### Summary - -In this section, we explored the problem of scalable score matching. We looked at the notable attempts in the literature and learned that score can be estimated from samples only. We also looked at several interpretations of the learning objective and the connections they expose. - -## Last few bits - -#### Incorporating time - -In the last section, we expressed and explained everything in terms of one known noise level $$\sigma$$ and the noisy sample $$\tilde{x}$$. We did so to avoid cluttering of multiple concepts that aren't necessary to explain each other. In [a previous section](#estimating-the-score-is-hard) however, we learned that the score must be estimated along every timestep of the forward process. By simply augmenting Eq.\eqref{eq:deno_score_match} with an additional time variable $$t \in \mathcal{U}[0, 1]$$ is sufficient to induce the time dependency in the score matching problem - -\begin{equation} \label{eq:deno_score_match_with_time} -J_{\mathrm{D}}(\theta) = \mathbb{E}_{x_0, \epsilon, t \sim \mathcal{U}[0, 1], x_t\sim q_t(x_t\vert x_0) }\left[ \frac{1}{2} \left|\left| s\_{\theta}(x_t, t) - (- \frac{\epsilon}{\sigma_t}) \right|\right|^2 \right] -\end{equation} - -.. where $$q_t(x_t \vert x_0)$$ is defined in a [previous section](#the-forward-process) and $$\sigma_t$$ is the standard deviation of it. - - -#### We took an different approach - -We would like to highlight that, in this article, we first explored the reverse process and then showed why the forward process emerges out of necessity. Typical diffusion models papers start from a forward process specification of the form - -$$ -dx_t = f(t)x_t dt + g(t) {dB}_t -$$ - -.. and then use Anderson's SDE reversal to explain the reverse process, which also involves the score - -$$ -dx_t = \left[ f(t) x_t - g(t)^2 \underbrace{\nabla_{x_t} \log q_t(x_t)}_{s_{\theta}(x_t, t)} \right] dt + g(t) dB_t -$$ - -We argue that our approach is more "organic" in the sense that it builds up the theory _chronologically_, exploring the exact path the community went through over time. - -#### Conclusion - -In this article, we dived deep into the theoretical fundamentals of Diffusion Models, which are often ignored by practitioners. We started from the 'heart' of diffusion models, i.e. scores, and built the concepts up almost chronologically. We hope this article will serve as a conceptual guide toward understanding diffusion models from the score SDE perspective. We intentionally avoid the 'probabilistic markov model' view of diffusion since more and more works have been seen to embrace the SDE formalism. \ No newline at end of file diff --git a/_posts/2024-05-07-distill-example.md b/_posts/2024-05-07-distill-example.md deleted file mode 100644 index 3b5162c4..00000000 --- a/_posts/2024-05-07-distill-example.md +++ /dev/null @@ -1,453 +0,0 @@ ---- -layout: distill -title: Sample Blog Post -description: Your blog post's abstract. - Please add your abstract or summary here and not in the main body of your text. - Do not include math/latex or hyperlinks. -date: 2024-05-07 -future: true -htmlwidgets: true -hidden: true - -# Anonymize when submitting -# authors: -# - name: Anonymous - -authors: - - name: Albert Einstein - url: "https://en.wikipedia.org/wiki/Albert_Einstein" - affiliations: - name: IAS, Princeton - - name: Boris Podolsky - url: "https://en.wikipedia.org/wiki/Boris_Podolsky" - affiliations: - name: IAS, Princeton - - name: Nathan Rosen - url: "https://en.wikipedia.org/wiki/Nathan_Rosen" - affiliations: - name: IAS, Princeton - -# must be the exact same name as your blogpost -bibliography: 2024-05-07-distill-example.bib - -# Add a table of contents to your post. -# - make sure that TOC names match the actual section names -# for hyperlinks within the post to work correctly. -# - please use this format rather than manually creating a markdown table of contents. -toc: - - name: Equations - - name: Images and Figures - subsections: - - name: Interactive Figures - - name: Citations - - name: Footnotes - - name: Code Blocks - - name: Diagrams - - name: Tweets - - name: Layouts - - name: Other Typography? - -# Below is an example of injecting additional post-specific styles. -# This is used in the 'Layouts' section of this post. -# If you use this post as a template, delete this _styles block. -_styles: > - .fake-img { - background: #bbb; - border: 1px solid rgba(0, 0, 0, 0.1); - box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1); - margin-bottom: 12px; - } - .fake-img p { - font-family: monospace; - color: white; - text-align: left; - margin: 12px 0; - text-align: center; - font-size: 16px; - } ---- - -Note: please use the table of contents as defined in the front matter rather than the traditional markdown styling. - -## Equations - -This theme supports rendering beautiful math in inline and display modes using [MathJax 3](https://www.mathjax.org/) engine. -You just need to surround your math expression with `$$`, like `$$ E = mc^2 $$`. -If you leave it inside a paragraph, it will produce an inline expression, just like $$ E = mc^2 $$. - -To use display mode, again surround your expression with `$$` and place it as a separate paragraph. -Here is an example: - -$$ -\left( \sum_{k=1}^n a_k b_k \right)^2 \leq \left( \sum_{k=1}^n a_k^2 \right) \left( \sum_{k=1}^n b_k^2 \right) -$$ - -Note that MathJax 3 is [a major re-write of MathJax](https://docs.mathjax.org/en/latest/upgrading/whats-new-3.0.html) -that brought a significant improvement to the loading and rendering speed, which is now -[on par with KaTeX](http://www.intmath.com/cg5/katex-mathjax-comparison.php). - - -## Images and Figures - -Its generally a better idea to avoid linking to images hosted elsewhere - links can break and you -might face losing important information in your blog post. -To include images in your submission in this way, you must do something like the following: - -```markdown -{% raw %}{% include figure.html path="assets/img/2024-05-07-distill-example/iclr.png" class="img-fluid" %}{% endraw %} -``` - -which results in the following image: - -{% include figure.html path="assets/img/2024-05-07-distill-example/iclr.png" class="img-fluid" %} - -To ensure that there are no namespace conflicts, you must save your asset to your unique directory -`/assets/img/2024-05-07-[SUBMISSION NAME]` within your submission. - -Please avoid using the direct markdown method of embedding images; they may not be properly resized. -Some more complex ways to load images (note the different styles of the shapes/shadows): - -
-
- {% include figure.html path="assets/img/2024-05-07-distill-example/9.jpg" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-distill-example/7.jpg" class="img-fluid rounded z-depth-1" %} -
-
-
- A simple, elegant caption looks good between image rows, after each row, or doesn't have to be there at all. -
- -
-
- {% include figure.html path="assets/img/2024-05-07-distill-example/8.jpg" class="img-fluid z-depth-2" %} -
-
- {% include figure.html path="assets/img/2024-05-07-distill-example/10.jpg" class="img-fluid z-depth-2" %} -
-
- -
-
- {% include figure.html path="assets/img/2024-05-07-distill-example/11.jpg" class="img-fluid" %} -
-
- {% include figure.html path="assets/img/2024-05-07-distill-example/12.jpg" class="img-fluid" %} -
-
- {% include figure.html path="assets/img/2024-05-07-distill-example/7.jpg" class="img-fluid" %} -
-
- -### Interactive Figures - -Here's how you could embed interactive figures that have been exported as HTML files. -Note that we will be using plotly for this demo, but anything built off of HTML should work -(**no extra javascript is allowed!**). -All that's required is for you to export your figure into HTML format, and make sure that the file -exists in the `assets/html/[SUBMISSION NAME]/` directory in this repository's root directory. -To embed it into any page, simply insert the following code anywhere into your page. - -```markdown -{% raw %}{% include [FIGURE_NAME].html %}{% endraw %} -``` - -For example, the following code can be used to generate the figure underneath it. - -```python -import pandas as pd -import plotly.express as px - -df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/earthquakes-23k.csv') - -fig = px.density_mapbox( - df, lat='Latitude', lon='Longitude', z='Magnitude', radius=10, - center=dict(lat=0, lon=180), zoom=0, mapbox_style="stamen-terrain") -fig.show() - -fig.write_html('./assets/html/2024-05-07-distill-example/plotly_demo_1.html') -``` - -And then include it with the following: - -```html -{% raw %}
- -
{% endraw %} -``` - -Voila! - -
- -
- -## Citations - -Citations are then used in the article body with the `` tag. -The key attribute is a reference to the id provided in the bibliography. -The key attribute can take multiple ids, separated by commas. - -The citation is presented inline like this: (a number that displays more information on hover). -If you have an appendix, a bibliography is automatically created and populated in it. - -Distill chose a numerical inline citation style to improve readability of citation dense articles and because many of the benefits of longer citations are obviated by displaying more information on hover. -However, we consider it good style to mention author last names if you discuss something at length and it fits into the flow well — the authors are human and it’s nice for them to have the community associate them with their work. - -*** - -## Footnotes - -Just wrap the text you would like to show up in a footnote in a `` tag. -The number of the footnote will be automatically generated.This will become a hoverable footnote. - -*** - -## Code Blocks - -This theme implements a built-in Jekyll feature, the use of Rouge, for syntax highlighting. -It supports more than 100 languages. -This example is in C++. -All you have to do is wrap your code in a liquid tag: - -{% raw %} -{% highlight c++ linenos %}
code code code
{% endhighlight %} -{% endraw %} - -The keyword `linenos` triggers display of line numbers. You can try toggling it on or off yourself below: - -{% highlight c++ %} - -int main(int argc, char const \*argv[]) -{ -string myString; - - cout << "input a string: "; - getline(cin, myString); - int length = myString.length(); - - char charArray = new char * [length]; - - charArray = myString; - for(int i = 0; i < length; ++i){ - cout << charArray[i] << " "; - } - - return 0; -} - -{% endhighlight %} - -*** - -## Diagrams - -This theme supports generating various diagrams from a text description using [jekyll-diagrams](https://github.com/zhustec/jekyll-diagrams){:target="\_blank"} plugin. -Below, we generate a few examples of such diagrams using languages such as [mermaid](https://mermaid-js.github.io/mermaid/){:target="\_blank"}, [plantuml](https://plantuml.com/){:target="\_blank"}, [vega-lite](https://vega.github.io/vega-lite/){:target="\_blank"}, etc. - -**Note:** different diagram-generation packages require external dependencies to be installed on your machine. -Also, be mindful of that because of diagram generation the first time you build your Jekyll website after adding new diagrams will be SLOW. -For any other details, please refer to [jekyll-diagrams](https://github.com/zhustec/jekyll-diagrams){:target="\_blank"} README. - -**Note:** This is not supported for local rendering! - -The diagram below was generated by the following code: - -{% raw %} -``` -{% mermaid %} -sequenceDiagram - participant John - participant Alice - Alice->>John: Hello John, how are you? - John-->>Alice: Great! -{% endmermaid %} -``` -{% endraw %} - -{% mermaid %} -sequenceDiagram -participant John -participant Alice -Alice->>John: Hello John, how are you? -John-->>Alice: Great! -{% endmermaid %} - -*** - -## Tweets - -An example of displaying a tweet: -{% twitter https://twitter.com/rubygems/status/518821243320287232 %} - -An example of pulling from a timeline: -{% twitter https://twitter.com/jekyllrb maxwidth=500 limit=3 %} - -For more details on using the plugin visit: [jekyll-twitter-plugin](https://github.com/rob-murray/jekyll-twitter-plugin) - -*** - -## Blockquotes - -
- We do not grow absolutely, chronologically. We grow sometimes in one dimension, and not in another, unevenly. We grow partially. We are relative. We are mature in one realm, childish in another. - —Anais Nin -
- -*** - - -## Layouts - -The main text column is referred to as the body. -It is the assumed layout of any direct descendants of the `d-article` element. - -
-

.l-body

-
- -For images you want to display a little larger, try `.l-page`: - -
-

.l-page

-
- -All of these have an outset variant if you want to poke out from the body text a little bit. -For instance: - -
-

.l-body-outset

-
- -
-

.l-page-outset

-
- -Occasionally you’ll want to use the full browser width. -For this, use `.l-screen`. -You can also inset the element a little from the edge of the browser by using the inset variant. - -
-

.l-screen

-
-
-

.l-screen-inset

-
- -The final layout is for marginalia, asides, and footnotes. -It does not interrupt the normal flow of `.l-body`-sized text except on mobile screen sizes. - -
-

.l-gutter

-
- -*** - -## Other Typography? - -Emphasis, aka italics, with *asterisks* (`*asterisks*`) or _underscores_ (`_underscores_`). - -Strong emphasis, aka bold, with **asterisks** or __underscores__. - -Combined emphasis with **asterisks and _underscores_**. - -Strikethrough uses two tildes. ~~Scratch this.~~ - -1. First ordered list item -2. Another item -⋅⋅* Unordered sub-list. -1. Actual numbers don't matter, just that it's a number -⋅⋅1. Ordered sub-list -4. And another item. - -⋅⋅⋅You can have properly indented paragraphs within list items. Notice the blank line above, and the leading spaces (at least one, but we'll use three here to also align the raw Markdown). - -⋅⋅⋅To have a line break without a paragraph, you will need to use two trailing spaces.⋅⋅ -⋅⋅⋅Note that this line is separate, but within the same paragraph.⋅⋅ -⋅⋅⋅(This is contrary to the typical GFM line break behavior, where trailing spaces are not required.) - -* Unordered lists can use asterisks -- Or minuses -+ Or pluses - -[I'm an inline-style link](https://www.google.com) - -[I'm an inline-style link with title](https://www.google.com "Google's Homepage") - -[I'm a reference-style link][Arbitrary case-insensitive reference text] - -[I'm a relative reference to a repository file](../blob/master/LICENSE) - -[You can use numbers for reference-style link definitions][1] - -Or leave it empty and use the [link text itself]. - -URLs and URLs in angle brackets will automatically get turned into links. -http://www.example.com or and sometimes -example.com (but not on Github, for example). - -Some text to show that the reference links can follow later. - -[arbitrary case-insensitive reference text]: https://www.mozilla.org -[1]: http://slashdot.org -[link text itself]: http://www.reddit.com - -Here's our logo (hover to see the title text): - -Inline-style: -![alt text](https://github.com/adam-p/markdown-here/raw/master/src/common/images/icon48.png "Logo Title Text 1") - -Reference-style: -![alt text][logo] - -[logo]: https://github.com/adam-p/markdown-here/raw/master/src/common/images/icon48.png "Logo Title Text 2" - -Inline `code` has `back-ticks around` it. - -```javascript -var s = "JavaScript syntax highlighting"; -alert(s); -``` - -```python -s = "Python syntax highlighting" -print(s) -``` - -``` -No language indicated, so no syntax highlighting. -But let's throw in a tag. -``` - -Colons can be used to align columns. - -| Tables | Are | Cool | -| ------------- |:-------------:| -----:| -| col 3 is | right-aligned | $1600 | -| col 2 is | centered | $12 | -| zebra stripes | are neat | $1 | - -There must be at least 3 dashes separating each header cell. -The outer pipes (|) are optional, and you don't need to make the -raw Markdown line up prettily. You can also use inline Markdown. - -Markdown | Less | Pretty ---- | --- | --- -*Still* | `renders` | **nicely** -1 | 2 | 3 - -> Blockquotes are very handy in email to emulate reply text. -> This line is part of the same quote. - -Quote break. - -> This is a very long line that will still be quoted properly when it wraps. Oh boy let's keep writing to make sure this is long enough to actually wrap for everyone. Oh, you can *put* **Markdown** into a blockquote. - - -Here's a line for us to start with. - -This line is separated from the one above by two newlines, so it will be a *separate paragraph*. - -This line is also a separate paragraph, but... -This line is only separated by a single newline, so it's a separate line in the *same paragraph*. diff --git a/_posts/2024-05-07-distill-example2.html b/_posts/2024-05-07-distill-example2.html deleted file mode 100644 index 5d2908ae..00000000 --- a/_posts/2024-05-07-distill-example2.html +++ /dev/null @@ -1,443 +0,0 @@ ---- -layout: distill -title: Sample Blog Post (HTML version) -description: Your blog post's abstract. - Please add your abstract or summary here and not in the main body of your text. - Do not include math/latex or hyperlinks. -date: 2024-05-07 -future: true -htmlwidgets: true -hidden: true - -# Anonymize when submitting -# authors: -# - name: Anonymous - -authors: - - name: Albert Einstein - url: "https://en.wikipedia.org/wiki/Albert_Einstein" - affiliations: - name: IAS, Princeton - - name: Boris Podolsky - url: "https://en.wikipedia.org/wiki/Boris_Podolsky" - affiliations: - name: IAS, Princeton - - name: Nathan Rosen - url: "https://en.wikipedia.org/wiki/Nathan_Rosen" - affiliations: - name: IAS, Princeton - -# must be the exact same name as your blogpost -bibliography: 2024-05-07-distill-example.bib - -# Add a table of contents to your post. -# - make sure that TOC names match the actual section names -# for hyperlinks within the post to work correctly. -# - please use this format rather than manually creating a markdown table of contents. -toc: - - name: Equations - - name: Images and Figures - subsections: - - name: Interactive Figures - - name: Citations - - name: Footnotes - - name: Code Blocks - - name: Diagrams - - name: Tweets - - name: Layouts - - name: Other Typography? - -# Below is an example of injecting additional post-specific styles. -# This is used in the 'Layouts' section of this post. -# If you use this post as a template, delete this _styles block. -_styles: > - .fake-img { - background: #bbb; - border: 1px solid rgba(0, 0, 0, 0.1); - box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1); - margin-bottom: 12px; - } - .fake-img p { - font-family: monospace; - color: white; - text-align: left; - margin: 12px 0; - text-align: center; - font-size: 16px;\ - } ---- - -

- This is a sample blog post written in HTML (while the other sample post is written in Markdown). Authors have the choice to write in HTML or Markdown. While Markdown is easier to write, HTML gives you more control over the layout of your post. Furthermore, Markdown often interacts in unexpected ways with MathJax and other HTML widgets. If you are having trouble with Markdown, try writing in HTML instead. -

- -

- Note: please use the table of contents as defined in the front matter rather than the traditional markdown styling. -

- -

Equations

- -

This theme supports rendering beautiful math in inline and display modes using MathJax 3 engine. -You just need to surround your math expression with $$, like $$ E = mc^2 $$. -If you leave it inside a paragraph, it will produce an inline expression, just like \(E = mc^2\).

- -

To use display mode, again surround your expression with $$ and place it as a separate paragraph. -Here is an example: -$$ -\left( \sum_{k=1}^n a_k b_k \right)^2 \leq \left( \sum_{k=1}^n a_k^2 \right) \left( \sum_{k=1}^n b_k^2 \right) -$$ -

- -

Note that MathJax 3 is a major re-write of MathJax -that brought a significant improvement to the loading and rendering speed, which is now -on par with KaTeX.

- -

Images and Figures

- -

Its generally a better idea to avoid linking to images hosted elsewhere - links can break and you -might face losing important information in your blog post. -You can display images from this repository using the following code:

- -
{% raw %}{% include figure.html path="assets/img/2024-05-07-distill-example/iclr.png" class="img-fluid" %}{% endraw %}
- -

which results in the following image:

- -{% include figure.html path="assets/img/2024-05-07-distill-example/iclr.png" class="img-fluid" %} - - -

- To ensure that there are no namespace conflicts, you must save your asset to your unique directory - `/assets/img/2024-05-07-[SUBMISSION NAME]` within your submission. -

- -

- Please avoid using the direct HTML method of embedding images; they may not be properly resized. - Some below complex ways to load images (note the different styles of the shapes/shadows): -

- -
-
- {% include figure.html path="assets/img/2024-05-07-distill-example/9.jpg" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-distill-example/7.jpg" class="img-fluid rounded z-depth-1" %} -
-
-
- A simple, elegant caption looks good between image rows, after each row, or doesn't have to be there at all. -
- -
-
- {% include figure.html path="assets/img/2024-05-07-distill-example/8.jpg" class="img-fluid z-depth-2" %} -
-
- {% include figure.html path="assets/img/2024-05-07-distill-example/10.jpg" class="img-fluid z-depth-2" %} -
-
- -
-
- {% include figure.html path="assets/img/2024-05-07-distill-example/11.jpg" class="img-fluid" %} -
-
- {% include figure.html path="assets/img/2024-05-07-distill-example/12.jpg" class="img-fluid" %} -
-
- {% include figure.html path="assets/img/2024-05-07-distill-example/7.jpg" class="img-fluid" %} -
-
- -

Interactive Figures

- -

- Here's how you could embed interactive figures that have been exported as HTML files. - Note that we will be using plotly for this demo, but anything built off of HTML should work. - All that's required is for you to export your figure into HTML format, and make sure that the file - exists in the `assets/html/[SUBMISSION NAME]/` directory in this repository's root directory. - To embed it into any page, simply insert the following code anywhere into your page. -

- -
{% raw %}{% include [FIGURE_NAME].html %}{% endraw %}
- -

-For example, the following code can be used to generate the figure underneath it. -

- -
import pandas as pd
-import plotly.express as px
-
-df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/earthquakes-23k.csv')
-
-fig = px.density_mapbox(
-    df, lat='Latitude', lon='Longitude', z='Magnitude', radius=10,
-    center=dict(lat=0, lon=180), zoom=0, mapbox_style="stamen-terrain")
-fig.show()
-
-fig.write_html('./assets/html/2024-05-07-distill-example/plotly_demo_1.html')
-
- -And then include it with the following: - -
{% raw %}<div class="l-page">
-  <iframe src="{{ 'assets/html/2024-05-07-distill-example/plotly_demo_1.html' | relative_url }}" frameborder='0' scrolling='no' height="600px" width="100%"></iframe>
-</div>{% endraw %}
-
- -Voila! - -
- -
- - -

Citations

- - -

- Citations are then used in the article body with the <d-cite> tag. - The key attribute is a reference to the id provided in the bibliography. - The key attribute can take multiple ids, separated by commas. -

- -

- The citation is presented inline like this: (a number that displays more information on hover). - If you have an appendix, a bibliography is automatically created and populated in it. -

- -

- Distill chose a numerical inline citation style to improve readability of citation dense articles and because many of the benefits of longer citations are obviated by displaying more information on hover. - However, we consider it good style to mention author last names if you discuss something at length and it fits into the flow well - the authors are human and it's nice for them to have the community associate them with their work. -

- - -

Footnotes

- -

- Just wrap the text you would like to show up in a footnote in a <d-footnote> tag. - The number of the footnote will be automatically generated.This will become a hoverable footnote. -

- - -

Code Blocks

- -

- This theme implements a built-in Jekyll feature, the use of Rouge, for syntax highlighting. - It supports more than 100 languages. - This example is in C++. - All you have to do is wrap your code in a liquid tag as follows: -

- -
{% raw  %}
-{% highlight c++ linenos %}  
code code code
{% endhighlight %} -{% endraw %} -
- -The keyword `linenos` triggers display of line numbers. You can try toggling it on or off yourself below: - -{% highlight c++ %} - -int main(int argc, char const *argv[]) -{ -string myString; - - cout << "input a string: "; - getline(cin, myString); - int length = myString.length(); - - char charArray = new char * [length]; - - charArray = myString; - for(int i = 0; i < length; ++i){ - cout << charArray[i] << " "; - } - - return 0; -} - -{% endhighlight %} - - - -

Diagrams

- -

- This theme supports generating various diagrams from a text description using jekyll-diagrams plugin. - Below, we generate a few examples of such diagrams using languages such as mermaid, plantuml, vega-lite, etc. -

- -

- Notedifferent diagram-generation packages require external dependencies to be installed on your machine. - Also, be mindful of that because of diagram generation the first time you build your Jekyll website after adding new diagrams will be SLOW. - For any other details, please refer to the jekyll-diagrams README. -

- -

- Note: This is not supported for local rendering! -

- -

- The diagram below was generated by the following code: -

- -
{% raw %}{% mermaid %}
-sequenceDiagram
-    participant John
-    participant Alice
-    Alice->>John: Hello John, how are you?
-    John-->>Alice: Great!
-{% endmermaid %}
-{% endraw %}
-
- -{% mermaid %} -sequenceDiagram -participant John -participant Alice -Alice->>John: Hello John, how are you? -John-->>Alice: Great! -{% endmermaid %} - - -

Tweets

- -

- An example of displaying a tweet: - {% twitter https://twitter.com/rubygems/status/518821243320287232 %} -

- -

- An example of pulling from a timeline: - {% twitter https://twitter.com/jekyllrb maxwidth=500 limit=3 %} -

- -

- For more details on using the plugin visit: jekyll-twitter-plugin -

- - -

Blockquotes

- -
- We do not grow absolutely, chronologically. We grow sometimes in one dimension, and not in another, unevenly. We grow partially. We are relative. We are mature in one realm, childish in another. - —Anais Nin -
- - -

Layouts

- -The main text column is referred to as the body. -It's the assumed layout of any direct descendants of the `d-article` element. - -
-

.l-body

-
- -For images you want to display a little larger, try `.l-page`: - -
-

.l-page

-
- -All of these have an outset variant if you want to poke out from the body text a little bit. -For instance: - -
-

.l-body-outset

-
- -
-

.l-page-outset

-
- -Occasionally you'll want to use the full browser width. -For this, use `.l-screen`. -You can also inset the element a little from the edge of the browser by using the inset variant. - -
-

.l-screen

-
-
-

.l-screen-inset

-
- -The final layout is for marginalia, asides, and footnotes. -It does not interrupt the normal flow of `.l-body`-sized text except on mobile screen sizes. - -
-

.l-gutter

-
- - -

Other Typography?

- -

- Emphasis, aka italics, with the <i></i> tag emphasis. -

- -

- Strong emphasis, aka bold, with <b></b> tag bold. -

- -

- Strikethrough ca be accomplished with the <s></s> tag. Scratch this. -

- -
    -
  • First ordered list item
  • -
  • Another item
  • -
      -
    1. Unordered sub-list.
    2. -
    -
  • And another item.
  • -
- - - -

- For code, the language can be specified in the class. For example, use language-javascript for Javascript and language-python for Python code. -

- -
var s = "JavaScript syntax highlighting";
-  alert(s);
- -
s = "Python syntax highlighting"
-  print(s)
- -
No language indicated, so no syntax highlighting.
- -

- A table can be created with the <table> element. Below is an example -

- - - - - - - - - - - - - - - - - - - - - - - - - - -
TablesAreCool
col 3 isright-aligned$1600
col 2 iscentered$12
zebra stripesare neat$1
- - -

-

Blockquotes can be defined with the >blockquote< tag.
-

diff --git a/_posts/2024-05-07-double-descent-demystified.md b/_posts/2024-05-07-double-descent-demystified.md deleted file mode 100644 index d78ba984..00000000 --- a/_posts/2024-05-07-double-descent-demystified.md +++ /dev/null @@ -1,736 +0,0 @@ ---- -layout: distill -title: Double Descent Demystified -description: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle -date: 2024-05-07 -future: true -htmlwidgets: true - -authors: - - name: Rylan Schaeffer - url: "https://scholar.google.com/citations?user=6tMEGz8AAAAJ&hl=en" - affiliations: - name: Stanford University - - name: Zachary Robertson - url: "https://scholar.google.com/citations?user=769PIisAAAAJ&hl=en&oi=ao" - affiliations: - name: Stanford University - - name: Akhilan Boopathy - url: "https://scholar.google.com/citations?user=21alU7EAAAAJ&hl=en" - affiliations: - name: MIT - - name: Mikail Khona - url: "https://scholar.google.com/citations?user=K5f0SYQAAAAJ&hl=en&oi=ao" - affiliations: - name: MIT - - name: Kateryna Pistunova - url: "https://scholar.google.com/citations?user=V7QY5j0AAAAJ&hl=en" - affiliations: - name: Stanford University - - name: Jason W. Rocks - url: "https://scholar.google.com/citations?user=rFHAzMUAAAAJ" - affiliations: - name: Boston University - - name: Ila R. Fiete - url: "https://scholar.google.com/citations?user=uE-CihIAAAAJ&hl=en&oi=ao" - affiliations: - name: MIT - - name: Andrey Gromov - url: "https://scholar.google.com/citations?user=D056qfMAAAAJ&hl=en&oi=ao" - affiliations: - name: UMD & Meta AI FAIR - - name: Sanmi Koyejo - url: "https://scholar.google.com/citations?user=EaaOeJwAAAAJ&hl=en&oi=ao" - affiliations: - name: Stanford University - -# must be the exact same name as your blogpost -bibliography: 2024-05-07-double-descent-demystified.bib - -# Add a table of contents to your post. -# - make sure that TOC names match the actual section names -# for hyperlinks within the post to work correctly. -# - please use this format rather than manually creating a markdown table of contents. -toc: - - name: Introduction - - name: Double Descent in Ordinary Linear Regression - subsections: - - name: Empirical Evidence - - name: Notation and Terminology - - name: Mathematical Analysis - - name: Factor 1 - Low Variance in Training Features - - name: Factor 2 - Test Features in Training Feature Subspace - - name: Factor 3 - Errors from Best Possible Model - - name: Divergence at the Interpolation Threshold - - name: Generalization in Overparameterized Linear Regression - - name: Adversarial Data - subsections: - - name: Adversarial Test Examples - - name: Adversarial Training Data - - name: Intuition for Nonlinear Models - -# Below is an example of injecting additional post-specific styles. -# This is used in the 'Layouts' section of this post. -# If you use this post as a template, delete this _styles block. -_styles: > - .fake-img { - background: #bbb; - border: 1px solid rgba(0, 0, 0, 0.1); - box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1); - margin-bottom: 12px; - } - .fake-img p { - font-family: monospace; - color: white; - text-align: left; - margin: 12px 0; - text-align: center; - font-size: 16px; - } ---- - -## Introduction - -Machine learning models, while incredibly powerful, can sometimes act unpredictably. One of the most intriguing -behaviors is when the test loss suddenly diverges at the interpolation threshold, a phenomenon -distinctly observed in **double descent** . - - -
-
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/unablated.png" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/unablated.png" class="img-fluid rounded z-depth-1" %} -
-
-
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/unablated.png" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/unablated.png" class="img-fluid rounded z-depth-1" %} -
-
-
- Figure 1. Double descent in ordinary linear regression. - Three real datasets (California Housing, Diabetes, and WHO Life Expectancy) and one synthetic dataset (Student-Teacher) all exhibit double descent, - with test loss spiking at the interpolation threshold. - Blue is training error. Orange is test error. -
-
- -While significant theoretical work has been done to comprehend why double descent occurs, it can be difficult -for a newcomer to gain a general understanding of why the test loss behaves in this manner, and under what conditions -one should expect similar misbehavior. In this blog post, when we say double descent, we mean the divergence at the interpolation -threshold, and not whether overparameterized models generalize (or fail to generalize). - -In this work, we intuitively and quantitatively explain why the test loss diverges at the interpolation threshold, -with as much generality as possible and with as simple of mathematical machinery as possible, but also without sacrificing rigor. -To accomplish this, we focus on the simplest supervised model - ordinary linear regression - using the most -basic linear algebra primitive: the singular value decomposition. We identify three distinct interpretable -factors which, when collectively present, trigger the divergence. -Through practical experiments on real data sets, we confirm that both model's test losses diverge at the -interpolation threshold, and this divergence vanishes when even one of the three factors is removed. -We complement our understanding by offering a geometric picture that reveals linear models perform -representation learning when overparameterized, and conclude by shedding light on recent results in nonlinear -models concerning superposition. - - -## Double Descent in Ordinary Linear Regression - -### Empirical Evidence of Double Descent in Ordinary Linear Regression - - - -Before studying ordinary linear regression mathematically, does our claim that it exhibits double descent -hold empirically? We show that it indeed does, using one synthetic and three real datasets: -World Health Organization Life Expectancy , California Housing , Diabetes ; -these three real datasets were selected on the basis of being easily accessible through sklearn or Kaggle. -As shown in [Fig 1](#fig_unablated_all), all display a spike in test mean squared error at the interpolation threshold. Our simple Python code is [publicly available](). - - - -### Notation and Terminology - -Consider a regression dataset of $N$ training data with features $\vec{x}_n \in \mathbb{R}^D$ and targets $y_n \in \mathbb{R}$. -We sometimes use matrix-vector notation to refer to the training data: - -$$X \in \mathbb{R}^{N \times D} \quad , \quad Y \in \mathbb{R}^{N \times 1}.$$ - -In ordinary linear regression, we want to learn parameters $\hat{\vec{\beta}} \in \mathbb{R}^{D}$ such that: - -$$\vec{x}_n \cdot \hat{\vec{\beta}} \approx y_n.$$ - -We will study three key parameters: -1. The number of model parameters $P$ -2. The number of training data $N$ -3. The dimensionality of the data $D$ - -We say that a model is _overparameterized_ if $N < P$ and _underparameterized_ if $N > P$. -The _interpolation threshold_ refers to $N=P$, because when $N\leq P$, the model can perfectly interpolate the training points. -Recall that in ordinary linear regression, the number of parameters $P$ equals the dimension $D$ of the covariates. -Consequently, rather than thinking about changing the number of parameters $P$, we'll instead think about changing -the number of data points $N$. - - -### Mathematical Analysis of Ordinary Linear Regression - -To understand under what conditions and why double descent occurs at the interpolation threshold in linear regression, -we'll study the two parameterization regimes. -If the regression is _underparameterized_, we estimate the linear relationship between covariates $\vec{x}_n$ -and target $y_n$ by solving the least-squares minimization problem: - - -$$ -\begin{align*} -\hat{\vec{\beta}}_{under} \, &:= \, \arg \min_{\vec{\beta}} \frac{1}{N} \sum_n ||\vec{x}_n \cdot \vec{\beta} - y_n||_2^2\\ -\, &:= \, \arg \min_{\vec{\beta}} ||X \vec{\beta} - Y ||_2^2. -\end{align*} -$$ - -The solution is the ordinary least squares estimator based on the second moment matrix $X^T X$: - -$$\hat{\vec{\beta}}_{under} = (X^T X)^{-1} X^T Y.$$ - -If the model is overparameterized, the optimization problem is ill-posed since we have fewer constraints than parameters. -Consequently, we choose a different (constrained) optimization problem that asks for the minimum norm parameters that -still perfectly interpolate the training data: - - -$$ -\begin{align*} -\hat{\vec{\beta}}_{over} \, &:= \, \arg \min_{\vec{\beta}} ||\vec{\beta}||_2^2\\ -\text{s.t.} \quad \quad \forall \, n \in &\{1, ..., N\}, \quad \vec{x}_n \cdot \vec{\beta} = y_n. -\end{align*} -$$ - -We choose this optimization problem because it is the one gradient descent implicitly minimizes. -The solution to this optimization problem uses the Gram matrix $X X^T \in \mathbb{R}^{N \times N}$: - -$$\hat{\vec{\beta}}_{over} = X^T (X X^T)^{-1} Y.$$ - -One way to see why the Gram matrix appears is via constrained optimization: define the Lagrangian -$\mathcal{L}(\vec{\beta}, \vec{\lambda}) \, := \, \frac{1}{2}||\vec{\beta}||_2^2 + \vec{\lambda}^T (Y - X \vec{\beta})$ -with Lagrange multipliers $\vec{\lambda} \in \mathbb{R}^N$, then differentiate with respect to the parameters -and Lagrange multipliers to obtain the overparameterized solution. - -After being fit, for test point $\vec{x}_{test}$, the model will make the following predictions: - -$$\hat{y}_{test, under} = \vec{x}_{test} \cdot \hat{\vec{\beta}}_{under} = \vec{x}_{test} \cdot (X^T X)^{-1} X^T Y$$ - - -$$\hat{y}_{test, over} = \vec{x}_{test} \cdot \hat{\vec{\beta}}_{over} = \vec{x}_{test} \cdot X^T (X X^T)^{-1} Y.$$ - - - -Hidden in the above equations is an interaction between three quantities that can, when all grow extreme, create a -divergence in the test loss! - -To reveal the three quantities, we'll rewrite the regression targets by introducing -a slightly more detailed notation. Unknown to us, there are some ideal linear parameters -$\vec{\beta}^* \in \mathbb{R}^P = \mathbb{R}^D$ that truly minimize the test mean squared error. -We can write any regression target as the inner product of the data $\vec{x}_n$ and the ideal parameters $\vec{\beta}^*$, -plus an additional error term $e_n$ that is an -"uncapturable" residual from the "viewpoint" of the model class - -$$y_n = \vec{x}_n \cdot \vec{\beta}^* + e_n.$$ - -In matrix-vector form, we will equivalently write: - -$$Y = X \vec{\beta}^* + E,$$ - -with $E \in \mathbb{R}^{N \times 1}$. -To be clear, we are _not_ imposing assumptions. Rather, we are introducing notation to express that -there are (unknown) ideal linear parameters, and possibly non-zero errors $E$ that even the ideal model might -be unable to capture; these errors $E$ could be random noise or could be fully deterministic patterns that this -particular model class cannot capture. Using this new notation, we rewrite the model's predictions to show how -the test datum's features $\vec{x}_{test}$, -training data's features $X$ and training data's regression targets $Y$ interact. - -Let $y_{test}^* := \vec{x}_{test} \cdot \vec{\beta}^*$. In the underparameterized regime: - -$$ -\begin{align*} -\hat{y}_{test,under} &= \vec{x}_{test} \cdot \hat{\vec{\beta}}_{under}\\ -&=\vec{x}_{test} \cdot (X^T X)^{-1} X^T Y\\ -&=\vec{x}_{test} \cdot (X^T X)^{-1} X^T (X \vec{\beta}^* + E)\\ -&=\vec{x}_{test} \cdot \vec{\beta}^* + \, \vec{x}_{test} \cdot (X^T X)^{-1} X^T E\\ -\hat{y}_{test,under} - y_{test}^* &= \vec{x}_{test} \cdot (X^T X)^{-1} X^T E. -\end{align*} -$$ - -This equation is important, but opaque. To extract the intuition, -replace $X$ with its singular value decomposition $X = U S V^T$. -Let $R \, := \, \text{rank}(X)$ and let $\sigma_1 > \sigma_2 > ... > \sigma_R > 0$ be -$X$'s (non-zero) singular values. Let $S^+$ denote the [Moore-Penrose inverse](https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_inverse); -in this context, this means that if a singular value $\sigma_r$ is non-zero, then in $S^+$, it becomes its reciprocal -$1/\sigma_r$, but if the singular value is zero, then in $S^+$, it remains $0$. -We can decompose the underparameterized prediction error -along the orthogonal singular modes: - -$$ -\begin{align*} -\hat{y}_{test, under} - y_{test}^* &= \vec{x}_{test} \cdot V S^{+} U^T E\\ -&= \sum_{r=1}^R \frac{1}{\sigma_r} (\vec{x}_{test} \cdot \vec{v}_r) (\vec{u}_r \cdot E). -\end{align*} -$$ - -This equation will be critical! The same term will appear in the overparameterized regime (plus one additional term): - -$$ -\begin{align*} -\hat{y}_{test,over} &= \vec{x}_{test} \cdot \hat{\vec{\beta}}_{over}\\ -&= \vec{x}_{test} \cdot X^T (X X^T)^{-1} Y\\ -&= \vec{x}_{test} \cdot X^T (X X^T)^{-1} (X \beta^* + E)\\ -\hat{y}_{test,over} - y_{test}^* &= \vec{x}_{test} \cdot (X^T (X X^T)^{-1} X - I_D) \beta^* \\ -&\quad\quad + \quad \vec{x}_{test} \cdot X^T (X X^T)^{-1} E\\ - &= \vec{x}_{test} \cdot (X^T (X X^T)^{-1} X - I_D) \beta^* \\ -&\quad\quad + \quad \sum_{r=1}^R \frac{1}{\sigma_r} (\vec{x}_{test} \cdot \vec{v}_r) (\vec{u}_r \cdot E), -\end{align*} -$$ - -where the last step again replaced $X$ with its SVD $X = U S V^T$. Thus, the prediction errors -in the overparameterized and underparameterized regimes will be: - -$$ -\begin{align*} -\hat{y}_{test,over} - y_{test}^* &= \sum_{r=1}^R \frac{1}{\sigma_r} (\vec{x}_{test} \cdot \vec{v}_r) (\vec{u}_r \cdot E)\\ -&\quad \quad + \quad \vec{x}_{test} \cdot (X^T (X X^T)^{-1} X - I_D) \beta^*\\ -\hat{y}_{test,under} - y_{test}^* &= \sum_{r=1}^R \frac{1}{\sigma_r} (\vec{x}_{test} \cdot \vec{v}_r) (\vec{u}_r \cdot E). -\end{align*} -$$ - -The shared term in the two prediction errors causes the divergence: - -$$ -\begin{equation} -\sum_{r=1}^R \frac{1}{\sigma_r} (\vec{x}_{test} \cdot \vec{v}_r) (\vec{u}_r \cdot E). -\label{eq:variance} -\end{equation} -$$ - -Eqn. \ref{eq:variance} is critical. It reveals that our test prediction error (and thus, our -test squared error!) will depend on an interaction between 3 quantities: - -1. How much the training features vary in each direction. -More formally, the inverse (non-zero) singular values of the _training features_ $X$: - - $$\frac{1}{\sigma_r}$$ - -2. How much, and in which directions, the test features vary relative to the training features. -More formally: how $\vec{x}_{test}$ projects onto $X$'s right singular vectors $V$: - - $$\vec{x}_{test} \cdot \vec{v}_r$$ - -3. How well the best possible model in the model class can correlate the variance in the training features with the training regression targets. -More formally: how the residuals $E$ of the best possible model in the model class (i.e. insurmountable "errors" from the "perspective" of the model class) project onto $X$'s left singular vectors $U$: - - $$\vec{u}_r \cdot E$$ - -We use the term "vary" when discussing $\vec{v}_r$ because $V$ can be related to the empirical (or sample) covariance -matrix oftentimes studied in Principal Component Analysis. That is, if the SVD of $X$ is $U S V^T$, then -$\frac{1}{N} X^T X = \frac{1}{N} V S^2 V^T$. If the training data are centered -(a common preprocessing step), then this is the empirical covariance -matrix and its eigenvectors $\vec{v}_1, ..., \vec{v}_R$ identify the orthogonal directions of variance. We'll return -to this in [Fig 6](#fig_geometric_smallest_nonzero_singular_value). - -**Why does the test error diverge?** When (1) and (3) are both present in the learning problem, the model's -parameters along this singular mode are likely incorrect. -When (2) is added to the mix by a test datum $\vec{x}_{test}$ with a large projection along this mode, -the model is forced to extrapolate significantly beyond what it saw in the training data, in a direction where -the training data had an error-prone relationship between its predictions and the training targets, using -parameters that are likely wrong. As a consequence, the test squared error explodes! - -### Factor 1 - Low Variance in Training Features - - -
-
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/no_small_singular_values.png" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/no_small_singular_values.png" class="img-fluid rounded z-depth-1" %} -
-
-
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/no_small_singular_values.png" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/no_small_singular_values.png" class="img-fluid rounded z-depth-1" %} -
-
-
- Figure 2. Required Factor #1: How much training features vary in each direction. - The test loss diverges at the interpolation threshold only if training features $X$ contain small (non-zero) - singular values. Ablation: By removing all singular values below a cutoff, the divergence at the interpolation threshold is diminished or disappears entirely. - Blue is training error. Orange is test error. -
-
- -The test loss will not diverge if any of the three required factors are absent. What could cause that? -One way is if small-but-nonzero singular values do not appear in the training data features. One way to -accomplish this is by setting all singular values below a selected threshold to exactly 0. To test our understanding, -we independently ablate all small singular values in the training features. Specifically, as we run the -ordinary linear regression fitting process, and as we sweep the number of training data, we also sweep different -singular value cutoffs and remove all singular values of the training features $X$ below the cutoff ([Fig 2](#fig_factor_1_small_singular_values)). - -### Factor 2 - Test Features in Training Feature Subspace - - -
-
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/test_feat_in_train_feat_subspace.png" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/test_feat_in_train_feat_subspace.png" class="img-fluid rounded z-depth-1" %} -
-
-
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/test_feat_in_train_feat_subspace.png" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/test_feat_in_train_feat_subspace.png" class="img-fluid rounded z-depth-1" %} -
-
-
- Figure 3. Required Factor #2: How much, and in which directions, test features vary relative to training features. - The test loss diverges only if the test features $\vec{x}_{test}$ have a large projection onto the training - features $X$'s right singular vectors $V$. Ablation: By projecting the test features into the subspace of the - leading singular modes, the divergence at the interpolation threshold is diminished or disappears entirely. - Blue is training error. Orange is test error. -
-
- -Double descent should not occur if the test datum does not vary in different directions than the training features. -Specifically, if the test datum lies entirely in the subspace of just a few of the leading singular directions, then the divergence is unlikely to occur. -To test our understanding, we force the test data features to lie in the training features subspace: as we run the -ordinary linear regression fitting process, and as we sweep the number of training data, we project the test features -$\vec{x}_{test}$ onto the subspace spanned by the training features $X$ singular modes ([Fig 3](#fig_test_feat_in_train_feat_subspace)). - - -### Factor 3 - Errors from Best Possible Model - - -
-
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/no_residuals_in_ideal.png" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/no_residuals_in_ideal.png" class="img-fluid rounded z-depth-1" %} -
-
-
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/no_residuals_in_ideal.png" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/no_residuals_in_ideal.png" class="img-fluid rounded z-depth-1" %} -
-
-
- Figure 4. Required Factor #3: How well the best possible model in the model class can correlate variance in training - features with training targets. The test loss diverges only if the residuals $E$ from the best possible model - in the model class on the training data have a large projection onto the training features $X$'s left singular - vectors $U$. Ablation: By ensuring the true relationship between features and targets is within the model class - i.e. linear, the divergence at the interpolation threshold disappears. - Blue is training error. Orange is test error. -
-
- -Double descent should not occur if the best possible model in the model class makes no errors on the training data. -For example, if we use a linear model class on data where the true relationship is a noiseless linear relationship, -then at the interpolation threshold, we will have $D=P$ data, $P=D$ parameters, our line of best fit will exactly match -the true relationship, and no divergence will occur. To test our understanding, we ensure no residual errors exist in -the best possible model: we first use the entire dataset to fit a linear model, then replace all target values -with the predictions made by the ideal linear model. We then rerun our typical fitting process using these -new labels, sweeping the number of training data ([Fig 4](#fig_no_residuals_in_ideal)). - -As a short aside, what could cause residual errors in the best possible model in the model class? - -1. __Noise__: If the data is noisy, then the best possible model in the model class will have residual errors. -2. __Model Misspecification__: If the data is generated by a nonlinear model, but we use a linear model class (or vice versa), then the best possible model in the model class will have residual errors. -3. __Missing Features__: Even if the data is noiseless and our model belongs to the correct model class, but we are missing covariates, then the best possible model in the model class will still have residual errors. - -### Divergence at the Interpolation Threshold - -
-
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/least_informative_singular_value.png" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/least_informative_singular_value.png" class="img-fluid rounded z-depth-1" %} -
-
-
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/least_informative_singular_value.png" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/least_informative_singular_value.png" class="img-fluid rounded z-depth-1" %} -
-
-
- Figure 5. The training features are most likely to obtain their smallest non-zero singular value when approaching the interpolation threshold. -
-
- -Why does this divergence happen near the interpolation threshold? The answer is that the first factor -(small non-zero singular values in the training features $X$) is likely to occur at the interpolation -threshold ([Fig 5](#fig_least_informative_singular_value)), but why? - -Suppose we're given a single -training datum $$\vec{x}_1$$. So long as this datum isn't exactly zero, that datum varies in a single -direction, meaning we gain information about the variance in that direction, but the variance in all -orthogonal directions is exactly 0. With the second training datum $$\vec{x}_2$$, so long as this datum -isn't exactly zero, that datum varies, but now, some fraction of $$\vec{x}_2$$ might have a positive -projection along $$\vec{x}_1$$; if this happens (and it likely will, since the two vectors are unlikely -to be exactly orthogonal), the shared direction gives us _more_ information about the variance -in this shared direction, but _less_ information about the second orthogonal direction of variation. -Ergo, the training data's smallest non-zero singular value after 2 samples is probabilistically smaller than -after 1 sample. As we approach the interpolation threshold, the probability that each additional datum -has large variance in a new direction orthogonal to all previous directions grows unlikely -([Fig 5](#fig_geometric_smallest_nonzero_singular_value)), but as we move beyond the interpolation threshold, the variance -in each covariate dimension becomes increasingly clear. - -
-
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution.png" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=1.png" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=2.png" class="img-fluid rounded z-depth-1" %} -
-
-
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=3.png" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=8.png" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=100.png" class="img-fluid rounded z-depth-1" %} -
-
-
- Figure 6. Geometric intuition for why the smallest non-zero singular value reaches its lowest value near the interpolation threshold. - If $1$ datum is observed, variance exists in only 1 direction. If $2$ data are observed, a second axis of - variation appears, but because the two data are likely to share some component, the second axis is likely to have - less variance than the first. At the interpolation threshold (here, $D=P=N=3$), because the three data are - likely to share components along the first two axes, the third axis is likely to have even less variance. - Beyond the interpolation threshold, additional data contribute additional variance to these three axes. -
-
- - -### Generalization in Overparameterized Linear Regression - -You might be wondering why three of the datasets have low test squared error in the overparameterized regime (California -Housing, Diabetes, Student-Teacher) but one (WHO Life Expectancy) does not. Recall that the overparameterized regime's prediction -error has another term $$\hat{y}_{test,over} - y_{test}^*$$ not present in the underparameterized regime: - -$$ -\begin{equation} -\vec{x}_{test} \cdot (X^T (X X^T)^{-1} X - I_D) \beta^*. -\label{eq:bias} -\end{equation} -$$ - -To understand why this bias exists, recall that our goal is to correlate fluctuations in the covariates -$\vec{x}$ with fluctuations in the targets $y$. In the overparameterized regime, there are more parameters -than data; consequently, for $N$ data points in $D=P$ dimensions, the model can "see" fluctuations in at -most $N$ dimensions, but has no ``visibility" into the remaining $P-N$ dimensions. This causes information -about the optimal linear relationship $\vec{\beta}^*$ to be lost, thereby increasing the overparameterized -prediction error. - -
-
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/overparameterized_generalization.jpg" class="img-fluid rounded z-depth-1"%} -
-
-
- Figure 7. Geometry of Generalization in Overparameterized Ordinary Linear Regression. - The rowspace of the training features $X$ forms a subspace (here, $\mathbb{R}^1$) of the ambient space - (here, $\mathbb{R}^2$). For test datum $\vec{x}_{test}$, the linear model forms an internal representation - of the test datum $\hat{\vec{x}}_{test}$ by orthogonally projecting the test datum onto the rowspace via - projection matrix $X^T (X X^T)^{-1} X$. The generalization error will then increase commensurate with the - inner product between $\hat{\vec{x}}_{test} - \vec{x}_{test}$ and the best possible parameters for the - function class $\vec{\beta}^*$. Three different possible $\vec{\beta}^*$ are shown with - low (blue), medium (green) - and high (red) generalization errors. -
-
- -We previously saw that away from the interpolation threshold, the variance is unlikely to affect the -discrepancy between the overparameterized model's predictions and the ideal model's predictions, -meaning most of the discrepancy must therefore emerge from the bias (Eqn. \ref{eq:bias}). -This bias term yields an intuitive geometric picture ([Fig 7](#fig_overparameterized_generalization)) that -also reveals a surprising fact: _overparameterized linear regression does representation learning!_ -Specifically, for test datum $$\vec{x}_{test}$$, a linear model creates a representation of the test datum -$$\hat{\vec{x}}_{test}$$ by orthogonally projecting the test datum onto the row space of the training -covariates $$X$$ via the projection matrix $$X^T (X X^T)^{-1} X$$: - -$$ -\begin{equation*} -\hat{\vec{x}}_{test} := X^T (X X^T)^{-1} X \; \vec{x}_{test}. -\end{equation*} -$$ - -Seen this way, the bias can be rewritten as the inner product between (1) the difference between its representation of the test datum and the test datum and (2) the ideal linear model's fit parameters: - -$$ -\begin{equation}\label{eq:overparam_gen_bias} -(\hat{\vec{x}}_{test} - \vec{x}_{test}) \cdot \vec{\beta}^*. -\end{equation} -$$ - -
-
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/test_bias_squared.png" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/test_bias_squared.png" class="img-fluid rounded z-depth-1" %} -
-
-
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/test_bias_squared.png" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/test_bias_squared.png" class="img-fluid rounded z-depth-1" %} -
-
-
- Figure 8. Test Error of Overparameterized Models. Large inner product between the ideal model's parameters and - the difference between the fit model's internal representations of the test data and the test data creates - large test squared error for overparameterized models. -
-
- - -Intuitively, an overparameterized model will generalize well if the model's representations capture the essential -information necessary for the best model in the model class to perform well ([Fig. 8](#fig_test_bias_squared)). - -## Adversarial Test Data and Adversarial Training Data - -Our key equation (Eqn. \ref{eq:variance}) also reveals _why_ adversarial test data and adversarial training data exist -(at least in linear regression) and _how_ mechanistically they function. For convenience, we repeat the equation: - -$$ -\begin{equation*} -\sum_{r=1}^R \frac{1}{\sigma_r} (\vec{x}_{test} \cdot \vec{v}_r) (\vec{u}_r \cdot E). -\end{equation*} -$$ - -Adversarial test examples are a well-known phenomenon in machine learning that we can see in this equation. -The adversarial test features correspond to $$\vec{x}_{test} \cdot \vec{v}_r$$ being large, where one can drastically increase -the test squared error by moving the test example in the direction of the right singular vector(s) with the smallest non-zero -singular values ([Fig 9](#fig_adversarial_train_data)). - -
-
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/california_housing/adversarial_test_datum.png" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/diabetes/adversarial_test_datum.png" class="img-fluid rounded z-depth-1" %} -
-
-
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/student_teacher/adversarial_test_datum.png" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/who_life_expectancy/adversarial_test_datum.png" class="img-fluid rounded z-depth-1" %} -
-
-
- Figure 9. Adversarial Test Examples in Linear Regression. Adversarial examples arise by pushing - $\vec{x}_{test}$ far along the trailing singular modes in the training features $X$. - Blue is training error. Orange is test error. -
-
- - -Less well-known are adversarial training data, akin to dataset poisoning -or backdoor attacks . -Adversarial training examples correspond to $$\vec{u}_r \cdot E$$ being large, where one can drastically -increase the test squared error by moving the training errors $E$ in the direction of the left singular vector(s) with the smallest -non-zero singular value. This gives a practical way to construct _adversarial training data_: training features and targets -whose training loss is unchanged from unaltered training data, but causes the test loss to be 1-3 orders of magnitude -larger ([Fig 10](#fig_adversarial_train_data)). - -
-
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/california_housing/adversarial_train_data.png" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/diabetes/adversarial_train_data.png" class="img-fluid rounded z-depth-1" %} -
-
-
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/student_teacher/adversarial_train_data.png" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/who_life_expectancy/adversarial_train_data.png" class="img-fluid rounded z-depth-1" %} -
-
-
- Figure 10. Adversarial Training Dataset in Linear Regression. By manipulating the residual errors $E$ - that the best possible model in the model class achieves on the training data, we construct training datasets - that increase the test error of the learned model by 1-3 orders of magnitude without affecting its training - error. Blue is training error. Orange is test error. -
-
- -## Intuition for Nonlinear Models - - -Although we mathematically studied ordinary linear regression, the intuition for why the test loss diverges extends -to nonlinear models, such as polynomial regression and including certain classes of deep neural networks . -For a concrete example about how our intuition can shed -light on the behavior of nonlinear models, Henighan et al. 2023 -recently discovered interesting properties of shallow nonlinear autoencoders: depending on the number of training data, -(1) autoencoders either store data points or features, and (2) the test loss increases sharply between these two -regimes ([Fig. 11](#fig_henighan)). - -
-
-
- {% include figure.html path="assets/img/2024-05-07-double-descent-demystified/henighan2023superposition.png" class="img-fluid rounded z-depth-1"%} -
-
-
- Figure 11. Superposition, Memorization and Double Descent in Nonlinear Shallow Autoencoders. - Figure from Henighan et al. 2023 . -
-
- -Our work sheds light on the results in two ways: - - -1. Henighan et al. 2023 write, "It’s interesting to note that we’re observing double descent in the absence of label noise." Our work clarifies that noise, in the sense of a random quantity, is _not_ necessary to produce double descent. Rather, what is necessary is _residual errors from the perspective of the model class_ ($E$, in our notation). Those errors could be entirely deterministic, such as a nonlinear model attempting to fit a noiseless linear relationship, or other model misspecifications. - -2. Henighan et al. 2023 write, "[Our work] suggests a naive mechanistic theory of overfitting and memorization: memorization and overfitting occur when models operate on 'data point features' instead of 'generalizing features'." Our work hopefully clarifies that this dichotomy is incorrect: when overparameterized, data point features are akin to the Gram matrix $X X^T$ and when underparameterized, generalizing features are akin to the second moment matrix $X^T X$. Our work hopefully clarifies that data point features can and very often do generalize, and that there is a deep connection between the two, i.e., their shared spectra. - - -## Conclusion - -In this work, we intuitively and quantitatively explained why the test loss misbehaves based on three interpretable -factors, tested our understanding via ablations, connected our understanding to adversarial test examples and -adversarial training datasets, and added conceptual clarity of recent discoveries in nonlinear models. \ No newline at end of file diff --git a/_posts/2024-05-07-dpi-fsvi.md b/_posts/2024-05-07-dpi-fsvi.md deleted file mode 100644 index a8d084ac..00000000 --- a/_posts/2024-05-07-dpi-fsvi.md +++ /dev/null @@ -1,1267 +0,0 @@ ---- -layout: distill -title: "Bridging the Data Processing Inequality and Function-Space Variational Inference" -description: >- - This blog post explores the interplay between the Data Processing Inequality (DPI), a cornerstone concept in information theory, and Function-Space Variational Inference (FSVI) within the context of Bayesian deep learning. The DPI governs the transformation and flow of information through stochastic processes, and its unique connection to FSVI is employed to highlight FSVI's focus on Bayesian predictive posteriors over parameter space. Throughout the post, theoretical concepts are intertwined with intuitive explanations and mathematical rigor, offering a comprehensive understanding of these complex topics. The post concludes by bringing together various ideas to explain why the choice of predictive priors (initial probability distributions assumed for model predictions before training) is important for training machine learning models and preventing overfitting. It also discusses the practical implications of these concepts in areas such as continual learning and knowledge distillation. By examining these concepts in depth, the post provides valuable insights for both theory and practice in machine learning, making it an informative resource for researchers and practitioners. -date: 2024-05-07 -future: true -htmlwidgets: true - -authors: - - name: Andreas Kirsch - url: "https://www.blackhc.net" - affiliations: - name: University of Oxford (Former Affiliation) - -# authors: -# - name: Albert Einstein -# url: "https://en.wikipedia.org/wiki/Albert_Einstein" -# affiliations: -# name: IAS, Princeton -# - name: Boris Podolsky -# url: "https://en.wikipedia.org/wiki/Boris_Podolsky" -# affiliations: -# name: IAS, Princeton -# - name: Nathan Rosen -# url: "https://en.wikipedia.org/wiki/Nathan_Rosen" -# affiliations: -# name: IAS, Princeton - -# must be the exact same name as your blogpost -bibliography: 2024-05-07-dpi-fsvi.bib - -# Add a table of contents to your post. -# - make sure that TOC names match the actual section names -# for hyperlinks within the post to work correctly. -# - please use this format rather than manually creating a markdown table of contents. -toc: - - name: Introduction - - name: "Background: Information-Theoretic Notation" - - name: "Data Processing Inequality" - subsections: - - name: "Example: Image Processing Pipeline" - - name: "Example: Supervised Learning" - - name: "Example: Autoencoders" - - name: "Proof of the DPI" - - name: "🥬 Data Processing Inequality" - subsections: - - name: "Example: Comparing Image Distributions" - - name: "Counter-Example: Bayesian Inference" - - name: "Proofs of the 🥬 DPI" - - name: Overall Statement - - name: "Other Data Processing Inequalities" - subsections: - - name: "Jensen-Shannon Divergence" - - name: "JSD-DPI" - - name: "Mutual Information" - - name: "Function-Space Variational Inference" - subsections: - - name: "Problem Setting & Notation" - - name: "Chain Rule of the 🥬 Divergence & DPI" - - name: "Deriving the Functional ELBO" - - name: "Choosing the \"Coreset\"" - - name: "Application to Continual Learning" - - name: Comparison to FSVI in the Literature - - name: The Equality Case and Equivalence Classes - subsections: - - name: "Equivalence Classes" - - name: "Consistency" - - name: "Equality & Symmetries" - - name: "Predictive Prior" - - name: "Parameter Priors vs. Predictive Priors" - subsections: - - name: "Label Entropy Regularization" - - name: "Knowledge Distillation" - - name: Conclusion - -# Below is an example of injecting additional post-specific styles. -# This is used in the 'Layouts' section of this post. -# If you use this post as a template, delete this _styles block. -_styles: > - .fake-img { - background: #bbb; - border: 1px solid rgba(0, 0, 0, 0.1); - box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1); - margin-bottom: 12px; - } - .fake-img p { - font-family: monospace; - color: white; - text-align: left; - margin: 12px 0; - text-align: center; - font-size: 16px; - } - .box-note, .box-warning, .box-error, .box-important { - padding: 15px 15px 15px 10px; - margin: 20px 20px 20px 5px; - border: 1px solid #eee; - border-left-width: 5px; - border-radius: 5px 3px 3px 5px; - } - d-article .box-note { - background-color: #eee; - border-left-color: #2980b9; - } - d-article .box-warning { - background-color: #fdf5d4; - border-left-color: #f1c40f; - } - d-article .box-error { - background-color: #f4dddb; - border-left-color: #c0392b; - } - d-article .box-important { - background-color: #d4f4dd; - border-left-color: #2bc039; - } - html[data-theme='dark'] d-article .box-note { - background-color: #333333; - border-left-color: #2980b9; - } - html[data-theme='dark'] d-article .box-warning { - background-color: #3f3f00; - border-left-color: #f1c40f; - } - html[data-theme='dark'] d-article .box-error { - background-color: #300000; - border-left-color: #c0392b; - } - html[data-theme='dark'] d-article .box-important { - background-color: #003300; - border-left-color: #2bc039; - } - html[data-theme='dark'] d-article blockquote { - color: var(--global-text-color) !important; - } - html[data-theme='dark'] d-article summary { - color: var(--global-text-color) !important; - } - d-article aside * { - color: var(--global-text-color) !important; - } - d-article p { - text-align: justify; - text-justify: inter-word; - -ms-hyphens: auto; - -moz-hyphens: auto; - -webkit-hyphens: auto; - hyphens: auto; - } - d-article aside { - border: 1px solid #aaa; - border-radius: 4px; - padding: .5em .5em 0; - font-size: 90%; - } - d-article aside p:first-child { - margin-top: 0; - } - d-article details { - border: 1px solid #aaa; - border-radius: 4px; - padding: .5em .5em 0; - } - d-article summary { - font-weight: bold; - margin: -.5em -.5em 0; - padding: .5em; - display: list-item; - } - d-article details[open] { - padding: .5em; - } - d-article details[open] summary { - border-bottom: 1px solid #aaa; - margin-bottom: .5em; - } -categories: -- Data Processing Inequality -- Information Theory -- Data Processing Inequality -- Information Theory -- Function-Space Variational Inference -- Parameter Equivalence Classes -- Entropy Regularization -- Label Entropy Regularization ---- - -{% raw %} -
-$$\require{mathtools} -\DeclareMathOperator{\opExpectation}{\mathbb{E}} -\newcommand{\E}[2]{\opExpectation_{#1} \left [ #2 \right ]} -\newcommand{\simpleE}[1]{\opExpectation_{#1}} -\newcommand{\MidSymbol}[1][]{\:#1\:} -\newcommand{\given}{\MidSymbol[\vert]} -\DeclareMathOperator{\opmus}{\mu^*} -\newcommand{\IMof}[1]{\opmus[#1]} -\DeclareMathOperator{\opInformationContent}{H} -\newcommand{\ICof}[1]{\opInformationContent[#1]} -\newcommand{\xICof}[1]{\opInformationContent(#1)} -\DeclareMathOperator{\opEntropy}{H} -\newcommand{\Hof}[1]{\opEntropy[#1]} -\newcommand{\xHof}[1]{\opEntropy(#1)} -\DeclareMathOperator{\opMI}{I} -\newcommand{\MIof}[1]{\opMI[#1]} -\DeclareMathOperator{\opTC}{TC} -\newcommand{\TCof}[1]{\opTC[#1]} -\newcommand{\CrossEntropy}[2]{\opEntropy(#1 \MidSymbol[\Vert] #2)} -\DeclareMathOperator{\opKale}{D_\mathrm{KL}} -\newcommand{\Kale}[2]{\opKale(#1 \MidSymbol[\Vert] #2)} -\DeclareMathOperator{\opJSD}{D_\mathrm{JSD}} -\newcommand{\JSD}[2]{\opJSD(#1 \MidSymbol[\Vert] #2)} -\DeclareMathOperator{\opp}{p} -\newcommand{\pof}[1]{\opp(#1)} -\newcommand{\hpof}[1]{\hat{\opp}(#1)} -\newcommand{\pcof}[2]{\opp_{#1}(#2)} -\newcommand{\hpcof}[2]{\hat\opp_{#1}(#2)} -\DeclareMathOperator{\opq}{q} -\newcommand{\qof}[1]{\opq(#1)} -\newcommand{\hqof}[1]{\hat{\opq}(#1)} -\newcommand{\qcof}[2]{\opq_{#1}(#2)} -\newcommand{\varHof}[2]{\opEntropy_{#1}[#2]} -\newcommand{\xvarHof}[2]{\opEntropy_{#1}(#2)} -\newcommand{\varMIof}[2]{\opMI_{#1}[#2]} -\newcommand{\w}{\boldsymbol{\theta}} -\newcommand{\W}{\boldsymbol{\Theta}} -\DeclareMathOperator{\opf}{f} -\newcommand{\fof}[1]{\opf(#1)} -\newcommand{\Dany}{\mathcal{D}} -\newcommand{\y}{y} -\newcommand{\Y}{Y} -\newcommand{\L}{\boldsymbol{L}} -\newcommand{\x}{\boldsymbol{x}} -\newcommand{\X}{\boldsymbol{X}} -\newcommand{\pdata}[1]{\hpcof{\text{data}}{#1}} -\newcommand{\normaldist}[1]{\mathcal{N}(#1)} -$$ -
-{% endraw %} - -## Introduction - -In information theory, the **data processing inequality (DPI)** expresses a fundamental idea: processing data (stochastically) cannot increase information. The DPI provides us with a powerful intuition about what information processing systems can do and what the limitations of data processing are. - -In this blog post, we first study the DPI, developing intuition through vivid examples and detailed proofs---especially the equality case, which is arguably the best way to understand inequalities. We will consider classic forms of the DPI as well as DPIs relating probability distributions more broadly. -Then, we explore the intriguing connection between DPI and **function-space variational inference (FSVI)**, a modern Bayesian deep learning technique that focuses on the Bayesian predictive posterior rather than the parameter space. Exploring this connection is important because it can provide new insights into FSVI on a fundamental level. We apply the DPI to recover several interesting results from the literature in a simple form and build intuitions for the relationship between parameter and functional priors. - -Most importantly, we consider how FSVI can measure a *predictive* divergence between the approximate and true posterior which is independent of parameter symmetries. (With parameter symmetries, I refer to different parameters that yield the same predictions, which is very common in over-parameterized neural networks: think of parameter symmetries like different paths leading to the same destination; they might look different but end up at the same predictionsThanks to ChatGPT for this analogy! 🤗.) Explaining this connection is one of the main goals of this article and will help you understand the relationships between DPI, FSVI, and other deep learning methods. -As a concrete example and application, we relate FSVI to training with knowledge distillation and label entropy regularization: potentially more meaningful priors than the ones usually used in Bayesian neural networksIn many papers, an isotropic Gaussian is used because of its simplicity. Indeed, there are better alternatives, see Fortuin et al (2022) and Fortuin (2022).. This connection highlights the practical relevance of the theoretical concepts discussed in this post and will hopefully inspire the reader to view Bayesian deep learning from a new point of view. - -### TL;DR - -The following sections summarize the key takeaways of this blog post. If they don't make sense, don't worry: they will after reading this post. - -#### Data Processing Inequality - -The data processing inequality examines how information cannot increase due to processing. In information theory, it is usually stated based on a Markov chain of random variables $$X \rightarrow Y \rightarrow Z$$ and their mutual information. We will look at different data processing inequalities that relate different distributions instead of different random variables. However, the blog posts in particular looks at the DPI when formulated using Kullback-Leibler (KL) divergences between distributions. I will use "🥬 divergence" in headings to add a bit of color. 😊 - -Concretely, this KL DPI states that processing data stochastically can only reduce information. More formally: - - - -That is, the KL divergence between $$\qof{Y}$$ and $$\pof{Y}$$ cannot be larger than the one between the original $$\qof{\W}$$ and $$\pof{\W}$$. Intuitively, the stochastic mapping $$\opf$$ induces a bottleneck that reduces how well we can distinguish between $$\opp$$ and $$\opq$$. Finally we have equality when $$\Kale{\qof{\W \given Y}}{\pof{\W \given Y}} = 0$$. - -The paper "*Understanding Variational Inference in Function-Space*" by Burt et al. (2021) succinctly summarizes the DPI as follows: - -
-The data processing inequality states that if two random variables are transformed in this way, they cannot become easier to tell apart. -
- -#### Function-Space Variational Inference - -Generally, *variational inference* is a powerful technique for approximating complex Bayesian posteriors with simpler distributions. In its usual form, it optimizes an approximate, *variational* distribution to match the *Bayesian **parameter** posterior* as closely as possible. This way, it transforms the problem of Bayesian inference into an optimization problem. - -However, especially for deep neural networks, obtaining a good approximation of the parameter space can be difficult. One reason is the sheer size of the parameter space. Additionally, the parameterization of a neural network often contains many symmetries---different parameter configurations can lead to the same predictions of the model---that are not taken into account either. - -Here, **Function-space variational inference (FSVI)** side-steps some of these restrictions by only requiring that the variational distribution matches the *Bayesian **predictive** posterior*: -Whereas regular variational inference regularizes towards a parameter prior, FSVI regularizes towards a data prior. This is especially useful when the parameter prior is not very meaningful, e.g. an isotropic Gaussian prior, which is often used in Bayesian neural networks. - - - -## Background: Information-Theoretic Notation - -Information theory deals with the communication of informationSee the excellent "Visual Information Theory" by Chris Olah for a visual introduction to information theory.. In this blog post, we use a unified information-theoretic notation to express various quantities related to probability distributions and their relationshipsIt largely follows "A Practical & Unified Notation for Information-Theoretic Quantities in ML".. Here are some key concepts we will use: - -The **information content** of an event $$x$$ is denoted as $$\Hof{x}$$ and is defined as $$-\log \pof{x}$$. It represents the minimum amount of information needed to describe the occurrence of $$x$$ given an underlying probability distribution. -In machine learning, this information content is often used as a minimization objective, represented as the negative log-likelihood or cross-entropy when averaged over a dataset. - -The **entropy** $$\Hof{X}$$ of a random variable $$X$$ is the expectation of its information content: - -$$ -\Hof{X} \triangleq \E{\pof{x}}{\Hof{x}} = \E{\pof{x}}{-\log \pof{x}}. -$$ - -The entropy measures the average amount of information needed to describe the random variable $$X$$. It provides a measure of uncertainty or randomness associated with $$X$$. We can similarly define the entropy of a conditional distribution $$\Hof{X \given Y}$$ and the joint entropy $$\Hof{X, Y}$$. - -The **mutual information** $$\MIof{X;Y}$$ between two random variables $$X$$ and $$Y$$ is a measure of the amount of information that one random variable contains about the other. It is defined as: - -$$ -\begin{aligned} -\MIof{X;Y} & \triangleq \Hof{X} - \Hof{X \given Y} \\ -&= \Hof{Y} - \Hof{Y \given X} \\ -&= \Hof{X} + \Hof{Y} - \Hof{X, Y}. -\end{aligned} -$$ - -We will also use the **Kullback-Leibler divergence** $$\Kale{\pof{X}}{\qof{X}}$$ and the **cross-entropy** $$\CrossEntropy{\pof{X}}{\qof{X}}$$: - -$$ -\begin{aligned} -\CrossEntropy{\pof{X}}{\qof{X}} & = \E{\pof{x}}{-\log \qof{x}}\\ -\Kale{\pof{X}}{\qof{X}} & = \CrossEntropy{\pof{X}}{\qof{X}} - \Hof{X} -\end{aligned} -$$ - -The cross-entropy quantifies the average number of bits needed to encode samples drawn from the true distribution $$\pof{X}$$ using a different distribution $$\qof{X}$$. The Kullback-Leibler divergence is a measure of the difference between two probability distributions and captures the additional bits needed to encode samples from $$\pof{X}$$ compared to encoding them using the true distribution $$\qof{X}$$. - -Now that we have covered the notation, let's delve into the data processing inequality. - -## Data Processing Inequality - -The **data processing inequality (DPI)** is a fundamental inequality in information theory that states the mutual information between two random variables cannot increase through processing. The original DPI is typically stated for a Markov chain of random variables $$X \rightarrow Y \rightarrow Z$$ and relates the mutual information terms as follows: - -$$ -\MIof{X;Y} \ge \MIof{X;Z}. -$$ - -We can view $$\rightarrow$$ as a processing or transition step that maps $$X$$ to $$Y$$ and $$Y$$ to $$Z$$, whereas the mapping can be deterministic or stochastic. -The inequality tells us that processing the random variable $$X$$ to obtain $$Y$$ and further processing $$Y$$ to obtain $$Z$$ cannot increase the mutual information between $$X$$ and $$Z$$ compared to the mutual information between $$X$$ and $$Y$$. - -The following three scenarios illustrate the data processing inequality using different mappings: - -### Example: Image Processing Pipeline - -Consider an image processing pipeline with the following steps. Let: - -* $$X$$ be the original image data; -* $$Y$$ be a compressed version of the image; and -* $$Z$$ be $$Y$$ after adding blur and pixelation. - -In this case, $$X$$ has more mutual information with $$Y$$ than with $$Z$$. The compression reduces information, but the image is still recognizable. However, after the additional processing of blurring and pixelating, the mutual information between $$X$$ and $$Z$$ is further reduced. This gives an intuitive example of how additional processing on data reduces the mutual information with the original data. Each processing step results in some loss of information. - -### Example: Supervised Learning -Consider a supervised learning pipeline with the following steps. Let - -* $$X$$ be the input features; -* $$Y$$ be the intermediate representations learned by the model; and -* $$Z$$ be the model predictions. - -Here, $$X \rightarrow Y \rightarrow Z$$ forms a Markov chain. The data processing inequality tells us that the mutual information between the inputs $$X$$ and predictions $$Z$$ cannot exceed the mutual information between the inputs $$X$$ and intermediate representations $$Y$$: - -$$\MIof{X; Y} \geq \MIof{X; Z}.$$ - -This makes intuitive sense---the intermediate representations $$Y$$ are obtained by processing the raw inputs $$X$$, so they cannot contain more information about $$X$$ than $$X$$ itself. The predictions $$Z$$ are obtained by further processing $$Y$$, so additional information may be lost, reducing the mutual information with the original inputs $$X$$. - -As a more concrete example, consider an image classification model. Let: - -* $$X$$ be the input images; -* $$Y$$ be the activations of the convolutional layers; and -* $$Z$$ be predicted image labels. - -The convolutional layers will extract features from the input images, but cannot extract more information than present in the original images. The predicted labels are obtained by further processing these convolutional features, so may lose some fine-grained information about the original inputs. - -### Example: Autoencoders -An autoencoder compresses the input $$X$$ into a latent code $$Y$$ and then tries to reconstruct the original input from the code, producing $$\hat{X}$$. Let: - -* $$X$$ be the input; -* $$Y$$ be the latent code; and -* $$\hat{X}$$ be the reconstruction; - -The data processing inequality tells us again: - -$$\MIof{X; Y} \geq \MIof{X; \hat{X}}.$$ - -The latent code $$Y$$ is obtained by compressing $$X$$, so cannot contain more information. The reconstruction $$\hat{X}$$ tries to recover $$X$$ from $$Y$$, but some information may be lost, reducing the mutual information with $$X$$. - -Intuitively, autoencoders try to preserve as much mutual information between inputs $$X$$ and reconstructions $$\hat{X}$$ as possible by learning latent representations $$Y$$ that compress inputs without losing too much information. The data processing inequality quantifies this information bottleneck. - -### Proof of the DPI - -The proof is simple and connects the DPI to another important inequality. - -First we note that the Markov Chain implies the following factorization of the joint distribution: - -$$ -\pof{x, y, z} = \pof{x} \pof{y \given x} \pof{z \given y}. -$$ - -Using this factorization, we can express the mutual information terms: - -$$ -\begin{aligned} -\MIof{X;Y} &= \Hof{X} - \Hof{X \given Y} \\ -&\ge \Hof{X} - \Hof{X \given Z} \\ -&= \MIof{X;Z}. -\end{aligned} -$$ - -This relies on $$\Hof{X \given Y} \le \Hof{X \given Z}$$. Why is this true? - -We have the following chain of inequalities: - -$$ -\Hof{X \given Y} = \underbrace{\MIof{X ; Z \given Y}}_{\overset{(1)}{=}0} + \Hof{X \given Y, Z} \overset{(2)}{\le} \Hof{X \given Z}. -$$ - -**(1)** follows from the Markov chain property: when $$X \rightarrow Y \rightarrow Z$$, $$X$$ does not depend on $$Z$$ at all when conditioned on $$Y$$; and **(2)** follows from the fact that conditioning reduces entropy, i.e. $$\Hof{A \given B} \le \Hof{A}.$$ - -The equality gap $$\Hof{X \given Y, Z} - \Hof{X \given Z}$$ corresponds to the mutual information $$\MIof{X ; Y \given Z}$$. This mutual information measures the extra information about $$X$$ contained in $$Y$$ that is not already conveyed by $$Z$$. It is zero if and only if $$X \rightarrow Z \rightarrow Y$$ forms a Markov chain, indicating that $$Z$$ is a sufficient statistic for $$X$$. - -
-Proof of (2) "Conditioning Reduces Entropy": -We can easily show that conditioning reduces entropy by using the non-negative property of the mutual information: - -$$ -\begin{aligned} -0 &\le \Kale{\pof{X,Y}}{\pof{X}\pof{Y}} \\ -&= \MIof{X;Y} \\ -&= \Hof{X} - \Hof{X \given Y} \\ -\implies \Hof{X \given Y} &\le \Hof{X}. -\end{aligned} -$$ -
- -The fact that conditioning reduces entropy, $$\Hof{X} \ge \Hof{X \given Y}$$, is an important property by itself and is reminiscent of the data processing inequality. -The conditional entropy $$\Hof{X \given Y}$$ quantifies the remaining uncertainty about $$X$$ after observing $$Y$$. If $$X$$ and $$Y$$ are independent, then $$\Hof{X} = \Hof{X \given Y}$$, as knowing $$Y$$ does not provide any information about $$X$$. On the other hand, if $$Y$$ completely determines $$X$$, then $$\Hof{X \given Y} = 0$$, as there is no remaining uncertainty about $$X$$ once $$Y$$ is known. In general, conditioning can only reduce the uncertainty about $$X$$, but it does not necessarily reduce it to zero. - -Let's move on and consider the KL data processing inequality. - -## 🥬 Data Processing Inequality - -A similar DPI can be expressed for different distributions $$\pof{x}$$ and $$\qof{x}$$ of the same random variable and the KL divergence between them. -This DPI states that if we evolve two distributions using the same *transition function*, they cannot become less similar. The KL divergence is sometimes also referred to as "relative entropy", so we could also call this the "*relative data processing inequality*". - -This can be formalized for distributions $$\pof{x}$$ and $$\qof{x}$$ and a stochastic transition function $$X \overset{\fof{y \given x}}{\longrightarrow} Y$$. Here, we use that such a stochastic mapping $$Y = \fof{X}$$ is equivalent to having a probability (density) $$\fof{y \given x}$$: - -$$ -\Kale{\pof{X}}{\qof{X}} \ge \Kale{\pof{Y}}{\qof{Y}}, -$$ - -where $$\pof{y \given x} = \fof{y \given x} = \qof{y \given x}$$. The marginals after the transition are $$\pof{y} = \E{\pof{x}}{\fof{y \given x}}$$ and $$\qof{y} = \E{\qof{x}}{\fof{y \given x}}$$, so more explicitly: - -$$ -\Kale{\pof{X}}{\qof{X}} \ge \Kale{\E{\pof{x}}{\fof{Y \given x}}}{\E{\qof{x}}{\fof{Y \given x}}}. -$$ - -In their book [Elements of Information Theory](https://www.wiley.com/en-us/Elements+of+Information+Theory%2C+2nd+Edition-p-9780471241959), Thomas and Cover describe this as "relative entropy never increases" and relate it to the second law of thermodynamics. - -### Example: Comparing Image Distributions - -As an example, let: - -* $$\pof{x}$$ be the true distribution of images in a dataset; -* $$\qof{x}$$ be a generative model that tries to mimic $$\pof{x}$$; and -* $$\fof{y \given x}$$ be a function that thresholds images $$x$$ into bilevel black and white images $$y$$. - -Then $$\pof{y}$$ and $$\qof{y}$$ will be more difficult to distinguish after the thresholding operation than $$\pof{x}$$ and $$\qof{x}$$. Converting to black and white images has lost information that could help distinguish the real and generated distributions. - -This provides some intuition for why the KL divergence between distributions decreases under a shared stochastic mapping, as formalized by the KL data processing inequality. Processing through $$\fof{y \given x}$$ makes the distributions harder to tell apart. - -### Counter-Example: Bayesian Inference - -It might be inviting to think that this data processing inequality also applies to Bayesian inference, that is updating the model parameters based on new evidence. Then, we could argue that if two agents start with different prior beliefs but update based on the same evidence, their posterior beliefs will become more similar. However, this intuition is flawed: the data processing inequality does not apply to Bayesian inference. - -Let's walk through why. Consider: - -* $$\pof{\w}$$ be an agent's prior belief; -* $$\qof{\w}$$ be another agent's different prior; -* $$\pof{\w\given x}$$ is the posterior after observing data $$x$$; and -* $$\qof{\w\given x}$$ is the other agent's posterior. - -The priors $$\pof{\w}$$ and $$\qof{\w}$$ may have large divergence, representing very different initial beliefs. However, when conditioning on the same data $$x$$, the KL divergence between $$\pof{\w \given x}$$ and $$\qof{\w \given x}$$ could increase or decrease---the data processing inequality does not give us any guarantee. - -This is because $$\pof{\w}$$ and $$\qof{\w}$$ are not evolving under the same stochastic mapping. Rather, each prior is mapped to its respective posterior via Bayes' rule, which operates differently on $$\opp$$ and $$\opq$$: - -$$ -\begin{aligned} -\pof{\w \given x} &= \frac{\pof{x \given \w}}{\pof{x}} \, \pof{\w}\\ -\qof{\w \given x} &= \frac{\qof{x \given \w}}{\qof{x}} \, \qof{\w}. -\end{aligned} -$$ - -Even assuming that both agents have the same internal model, that is they use the same likelihood $$\pof{x \given \w} = \qof{x \given \w}$$, the priors $$\pof{\w}$$ and $$\qof{\w}$$ will still influence the posterior distributions differently because they lead to different evidence terms $$\pof{x}$$ and $$\qof{x}$$: - -$$ -\begin{aligned} -\pof{x} &= \E{\pof{\w}}{\pof{x \given \w}}\\ -\qof{x} &= \E{\qof{\w}}{\qof{x \given \w}}. -\end{aligned} -$$ - -Thus, the correct intuition is that observing the same data $$x$$ does not necessarily bring the posterior beliefs closer together---they depend on the interplay between their specific priors and likelihoods. The data processing inequality does not directly apply to this Bayesian updating scenario: - -$$ -\Kale{\qof{\W}}{\pof{\W}} {\color{red}{\not\ge}} \Kale{\qof{\W \given \mathcal{D}}}{\pof{\W \given \mathcal{D}}}, -$$ - -This counterexample highlights the importance of precisely understanding the assumptions underlying conceptual principles like the DPI. While the DPI provides insight about information dynamics in many cases, it does not universally apply, as exemplified here by Bayesian updating under different priors. As always, bear in mind that: - - - -As we currently also seem to experience a world of increasing polarization, this counterexample might also serve as a reminder that different priors can lead to different beliefs, even when observing the same evidence. This is a fundamental aspect of Bayesian inference and the scientific method. - -### Proofs of the 🥬 DPI - -We will prove this inequality in two different ways. First, we will develop a "brute-force" proof, and then we will look at a more elegant proof that follows Thomas and Cover. Importantly, we will also consider the equality case in detail. - -#### Brute-force Proof - -If $$\opp$$ does not have support in $$\opq$$, the inequality is trivially true because then $$\Kale{\pof{Y}}{\qof{Y}}=\infty$$. - -Thus, let's now assume that $$\opp$$ has support in $$\opq$$. Then, we can brute-force using the definitions, starting from the cross-entropy: - -$$ -\begin{aligned} -\CrossEntropy{\pof{Y}}{\qof{Y}}&=\CrossEntropy{\pof{Y}}{\E{\qof{x}}{\pof{Y \given x}}}\\ -&=\CrossEntropy{\pof{Y}}{\E{\qof{x}}{\frac{\pof{x \given Y}\pof{Y}}{\pof{x}}}}\\ -&=\CrossEntropy{\pof{Y}}{\E{\pof{x \given Y}}{\frac{\qof{x}}{\pof{x}}}}+\CrossEntropy{\pof{Y}}{\pof{Y}}\\ -&\overset{(1)}{=}\CrossEntropy{\pof{Y}}{\E{\pof{x \given Y}}{\frac{\qof{x}}{\pof{x}}}}+\xHof{\pof{Y}}\\ -&\overset{(2)}{\le}\CrossEntropy{\pof{X, Y}}{\frac{\qof{X}}{\pof{X}}}+\xHof{\pof{Y}}\\ -&\overset{(3)}{=}\CrossEntropy{\pof{X}}{\frac{\qof{X}}{\pof{X}}}+\xHof{\pof{Y}}\\ -&\overset{(4)}{=}\Kale{\pof{X}}{\qof{X}}+\xHof{\pof{Y}}\\ -\iff \Kale{\pof{Y}}{\qof{Y}}&\le\Kale{\pof{X}}{\qof{X}}, -\end{aligned} -$$ - -where we have used **(1)** that the cross-entropy of a distribution with itself is just the entropy, **(2)** that the cross-entropy is convex and we can apply Jensen's inequality, **(3)** that the RHS side of the cross-entropy does not depend on $$Y$$ and we can trivially marginalize it out, and **(4)** that the definition of the Kullback-Leibler divergence is equivalent an (unnormalized) cross-entropy over a fraction. - -This makes it difficult to extract the case for equality, however. - -#### Equality Case - -We have only one inequality in above proof, and it stems from applying Jensen's inequality. Remembering the equality case for Jensen's inequality, we recall: - - - -For **(2)**, this is sadly slightly more complex than it might seem on first glance. -Let's unwrap the term: - -$$ -\CrossEntropy{\pof{Y}}{\E{\pof{x \given Y}}{\frac{\qof{x}}{\pof{x}}}} = \E{\pof{y}}{-\log \E{\pof{x \given y}}{\frac{\qof{x}}{\pof{x}}}}. -$$ - -We take an expectation over $$\pof{y}$$, so we need to look at almost all $$\pof{x \given y} \not= 0$$ for (almost all) $$\pof{y} \not= 0$$ separately to consider equality. $$-\log x$$ is strictly convex---and thus not linear---so we need $$f(x) = \frac{\qof{X}}{\pof{X}}$$ to be constant for any fixed $$y$$ with $$\pof{y} \not= 0$$---only then have we equality in Jensen's inequality. - -In the following, I will limit myself to the discrete case to avoid having to deal with measure theoryI currently don't have a good 'toolbox' to express simple ideas cleanly in measure theory. I'm working on it.. -To obtain equality, for all $$y$$ with $$\pof{y} \not= 0$$ (i.e. we have support) and for all $$x_1, x_2$$ with $$\pof{x_1 \given y}, \pof{x_2 \given y} \not= 0$$, we need $$\frac{\qof{x_1}}{\pof{x_1}} = \frac{\qof{x_2}}{\pof{x_2}}$$. -Equivalently (for the reader, why is then $$\pof{x_1} \not= 0?$$): - -$$ -\begin{aligned} -\frac{\qof{x_1}}{\pof{x_1}} &= \frac{\qof{x_2}}{\pof{x_2}} \\ -\iff \qof{x_1} &= \frac{\qof{x_2}}{\pof{x_2}} \, \pof{x_1} \\ -\end{aligned} -$$ - -This means that $$\qof{x} = C_y \pof{x}$$ piecewise for all $$x$$ for which $$\pof{x \given y} \not= 0$$ for some fixed $$y$$ with $$\pof{y} \not= 0$$. That is if we keep $$y$$ fixed, all the $$x$$ for which $$\pof{x \given y} \not= 0$$ have the same constant factor $$C_y$$. Then for all $$y$$ with $$\pof{y} \not= 0$$, we have equality and overall equality in **(2)**. - -If for any $$x$$ there are multiple $$y$$, e.g. $$y_1, y_2$$ for which $$\pof{x \given y} \not= 0$$, then we have $$C_{y_1} = C_{y_2}$$. - -As an example, at the simplest, if this is the case for all $$y$$, then $$C_y = 1$$ constant. - -As a side-note, this is a great reason why we often require full support for distributions as we then can avoid these piecewise constant factors (and the headaches they might cause). - -#### Simpler Elegant Proof - -Thomas and Cover provide a beautifully simple proof: - - - -What does this mean? Whereas $$\fof{y \given x}$$ is the 'forward' transition function, $$\pof{x \given y}$$ and $$\qof{x \given y}$$ are the 'backward' transition functions. We only have equality when the backward transition functions are equal (almost everywhere). - -The statement on equality is not very informative yet though, so we have to put in a bit more work. Again, this is written for the discrete case. - -This time we explicitly use Bayes' rule to connect the forward and backward transition functions. -First, we have to fix $$y$$ such that $$\pof{y} \not= 0$$ (i.e. $$y$$ is in the support of $$\pof{y}$$) and then $$\qof{y} \not=0$$. -We have: - -$$ -\begin{aligned} -\pof{x \given y} &= \qof{x \given y} \\ -\overset{\text{ass. }\pof{y} \not= 0}{\iff} \frac{\fof{y \given x}\pof{x}}{\pof{y}} &= \frac{\fof{y \given x}\qof{x}}{\qof{y}} \\ -\overset{\text{ass. }\fof{y \given x}\not= 0}{\iff} \frac{\pof{x}}{\pof{y}} &= \frac{\qof{x}}{\qof{y}} \\ -\iff \pof{x} &= \frac{\pof{y}}{\qof{y}} \, \qof{x}. -\end{aligned} -$$ - -For a given $$y$$ with $$\pof{y} \not=0$$, for the equality case, we see that for all $$x$$ with $$\fof{y \given x} \not= 0$$, $$\pof{x}$$ and $$\qof{x}$$ have to be coupled via piecewise constant factors. - -As another example, if $$\fof{y \given x} \not=0$$ (has full support) for all possible $$x$$, for the equality case we have $$\pof{x} = \qof{x}$$. - -Compared to the previous equality case, we went a bit deeper and rewrote the conditions to consider the ratios between $$x$$ and $$y$$. Note we could have shown the same thing in the "brute-force" proof, too. - -Altogether, we have see that both $$x$$ and $$y$$ are modulated by the same constant factor between $$\pof{\cdot}$$ and $$\qof{\cdot}$$. Essentially, this tells us that we could split our support into unconnected sub-domains and examine each individually for the equality case. - - - -### Overall Statement -We have the following overall statement: - - -($$\pof{x} \ll \qof{x}$$ means that $$\qof{x} > 0$$ implies $$\pof{x} > 0$$, so the KL divergence is not $$\infty$$.) But more precisely, for $$\pof{x} \ll \qof{x}$$, we have equality when: - -$$ -\forall y, \pof{y} \not= 0 \exists C_y \in \mathbb{R}_{> 0} \forall x, \fof{y \given x}\not=0\colon \pof{x} = C_y \, \qof{x}. -$$ - -## Other Data Processing Inequalities - -Now, we can use these ideas to derive a few additional results and even close the circle to the original data processing inequality. - -### Jensen-Shannon Divergence - -The KL divergence is not a metric: the triangle inequality does not hold, and it is not symmetric. - -However, we can symmetrize it to obtain the [Jensen-Shannon divergence (JSD)](https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence). The JSD is defined as the mean of the two KL divergences of the two distributions from their average. In essence, it makes the KL divergence symmetric: - -$$ -\begin{aligned} -\fof{x} &= \frac{\pof{x} + \qof{x}}{2}\\ -\JSD{\pof{x}}{\qof{x}} &= \frac{1}{2} \Kale{\pof{x}}{\fof{x}} + \frac{1}{2} \Kale{\qof{x}}{\fof{x}}. -\end{aligned} -$$ - -Similar approaches can be used to "symmetrize" other concepts; for example matrices: $$\frac{1}{2} A + \frac{1}{2} A^T$$ is also symmetric by construction for any matrix $$A$$. - -The JSD is still not a metric, but the square root of the Jensen-Shannon divergence is symmetric and satisfies the triangle inequality and gives us the *Jensen-Shannon distance*, a metric. - -### JSD-DPI - -We can also obtain a data processing inequality for the Jensen-Shannon divergence and the Jensen-Shannon distance: - - - -The proof uses the KL data processing inequality: - -$$ -\begin{aligned} -\JSD{\pof{X}}{\qof{X}} &= \frac{1}{2} \Kale{\pof{X}}{\fof{X}} + \frac{1}{2} \Kale{\qof{X}}{\fof{X}}\\ -&\ge \frac{1}{2} \Kale{\pof{Y}}{\fof{Y}} + \frac{1}{2} \Kale{\qof{Y}}{\fof{Y}}\\ -&= \JSD{\pof{Y}}{\qof{Y}}. -\end{aligned} -$$ - -We verify $$\fof{y} = \frac{\pof{y} + \qof{y}}{2}$$ is the average of $$\pof{y}$$ and $$\qof{y}$$: - -$$ -\begin{aligned} -\fof{y} &= \E{\fof{x}}{\fof{y \given x}}\\ -&= \E{\frac{\pof{x}+\qof{x}}{2}}{\fof{y \given x}}\\ -&= \frac{1}{2} \E{\pof{x}}{\fof{y \given x}} + \frac{1}{2} \E{\qof{x}}{\fof{y \given x}}\\ -&= \frac{1}{2} \pof{y} + \frac{1}{2} \qof{y}. -\end{aligned} -$$ - -Finally, $$\pof{x}, \qof{x} \ll \fof{x}$$, and the equality condition of the KL data processing inequality gives us: - -$$ -\begin{aligned} -&\Kale{\pof{X \given Y}}{\fof{X \given Y}} = 0 &\\ -\land \quad &\Kale{\qof{X \given Y}}{\fof{X \given Y}} = 0 &\\ -\iff &\pof{x \given y} = \fof{x \given y} \land \qof{x \given y} = \fof{x \given y}& \forall x,y \\ -\iff &\pof{x \given y} = \qof{x \given y}& \forall x,y. -\end{aligned} -$$ - -### Mutual Information - -The JSD can also be expressed as a mutual information. For -$$ -\begin{aligned} -Z &\sim \mathrm{Bernoulli}(\frac{1}{2}) = \fof{Z} \\ -X \given Z = 0 &\sim \pof{x}\\ -X \given Z = 1 &\sim \qof{x}, -\end{aligned} -$$ - -we have: - -$$ -\JSD{\pof{X}}{\qof{X}} = \MIof{X;Z}. -$$ - -This follows from rewriting the mutual information as a KL divergence: - -$$ -\begin{aligned} -\MIof{X;Z} &= \Kale{\fof{X \given Z}}{\fof{X}}\\ -&= \E{\fof{z}} {\Kale{\fof{X \given Z = z}}{\fof{X}}}\\ -&= \frac{1}{2} \Kale{\pof{x}}{\fof{x}} + \frac{1}{2} \Kale{\qof{x}}{\fof{x}}\\ -&= \JSD{\pof{X}}{\qof{X}}. -\end{aligned} -$$ - -We can generalize this to the Markov chain $$Z \rightarrow X \rightarrow Y$$ with $$\fof{z, x, y} = \fof{z} \fof{x \given z} \fof{y \given x}$$ for any distribution $$\fof{z}$$: - -$$ -\begin{aligned} -\MIof{X;Z} &= \Kale{\fof{X \given Z}}{\fof{X}}\\ -&= \E{\fof{z}} {\Kale{\fof{X \given z}}{\fof{X}}}\\ -&\overset{(1)}{\ge} \E{\fof{z}} {\Kale{\fof{Y \given z}}{\fof{Y}}}\\ -&= \Kale{\fof{Y \given Z}}{\fof{Y}}\\ -&= \MIof{Y;Z}, -\end{aligned} -$$ - -where $$(1)$$ follows from the KL data processing inequality. - -This is just the data processing inequality we presented initially. We have gone full circle! - -The equality gap (*Jensen gap*) is $$\Kale{\fof{X \given Y, Z}}{\fof{X \given Y}}$$, and we have equality when: - -$$ -\begin{aligned} -\Kale{\fof{X \given Y, Z}}{\fof{X \given Y}} &= 0\\ -\iff \MIof{X;Z \given Y} &= 0. -\end{aligned} -$$ - -This is exactly when $$X$$ is independent of $$Z$$ given $$Y$$. ($$Y$$ is a sufficient statistic in that case.) - -## Function-Space Variational Inference - -So far we've explored the foundational aspects of the data processing inequality (DPI) and its extended forms, in particular the KL data processing inequality. Through detailed derivations and intuitive examples, we've demonstrated how these inequalities can be applied, emphasizing their significance and limitations. Specifically, we've shown how the KL data processing inequality relates to the reduction in information as data is processed. The examples and counterexample have hopefully demonstrated the nuances of applying these inequalities in different contexts. - -This exploration sets the stage for diving into function-space variational inference and building up a robust understanding of it, leveraging the insights gained about the DPI and its implications in Bayesian deep learning. - -### Problem Setting & Notation - -In the following, we will consider a classification task with cross-entropy loss, and we will use the following the random variables and distributions: - -- $$\y$$ is the label, -- $$\x$$ is the input, -- $$\qof{\y \given \x}$$ is the predictive distribution we want to learn, -- $$\pdata{\y \given \x}$$ is the data distribution, -- $$\Dany$$ is the (training) dataset, and -- $$C$$ is the number of classes. - -The probabilistic model is: - -$$\pof{\y, \w \given \x} = \pof{\y \given \x, \w} \, \pof{\w}.$$ - -As before, I use upper-case letters for random variables, which we take an expectation over, e.g. in the KL divergence, and lower-case letters when I'm referring to specific observations or values that could be substituted (with the exception of $$\Dany$$). - - -### Chain Rule of the 🥬 Divergence & DPI - -An important property of the KL divergence is the chain rule: - -$$ -\begin{aligned} -&\Kale{\qof{\Y_n,...,\Y_1}}{\pof{\Y_n,...,\Y_1}} \\ -&\quad = \sum_{i=1}^n \Kale{\qof{\Y_i \given -\Y_{i-1}, ..., \Y_1}}{\pof{\Y_i \given \Y_{i-1}, ..., \Y_1}}. -\end{aligned} -$$ - -The chain rule yields a *chain inequality* for the DPI as well: - -$$ -\begin{aligned} -\Kale{\qof{\W}}{\pof{\W}} &\ge \Kale{\qof{\Y_n,...,\Y_1}}{\pof{\Y_n,...,\Y_1}}\\ -&\ge \Kale{\qof{\Y_{n-1},...,\Y_1}}{\pof{\Y_{n-1},...,\Y_1}}\\ -&\ge \Kale{\qof{\Y_1}}{\pof{\Y_1}}, -\end{aligned} -$$ - -where we start from the KL DPI and then apply the chain rule. - -### Deriving the Functional ELBO - -The DPI has an intriguing connection to FSVI. Let's say we want to approximate a Bayesian posterior $$\pof{\w \given \Dany}$$ with a variational distribution $$\qof{\w}$$. In standard VI, we would minimize $$\Kale{\qof{\W}}{\pof{\W \given \Dany}}$$ to match the variational distribution to the Bayesian posterior. Specifically: - -$$ -\begin{aligned} -&\Kale{\qof{\W}}{\pof{\W \given \Dany}} =\\ -&\quad = \underbrace{\E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\W}}{\pof{\W}}}_{\text{Evidence}\ \text{Bound}} + \log \pof{\Dany} \ge 0 \\ -&\iff \underbrace{-\log \pof{\Dany}}_{=\xHof{\pof{\Dany}}} \le \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\W}}{\pof{\W}}. -\end{aligned} -$$ - -This is an information-theoretic evidence (upper) bound on the information content $$-\log \pof{\Dany}$$ of the data $$\Dany$$ under the variational distribution $$\qof{\w}$$, which we can minimize as an objective to approximiate $$\pof{\w \given \Dany}$$ via $$\qof{\w}$$. - -In more probability-theory inspired literature, the negative of this bound is called the *evidence lower bound (ELBO)* and is maximized. - -Both the ELBO and the information-theoretic evidence upper-bound are equivalent, and we can use either objective, but the information-theoretic perspective is obviously superior 🙃 I'll refer to this as evidence bound from now on. - -In FSVI (with a caveat I detail below), we apply the DPI to the prior KL divergence term and obtain a "functional" version of the evidence bound: - -$$ -\begin{aligned} -\Kale{\qof{\W}}{\pof{\W}} \ge \Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}}, -\end{aligned} -$$ - -where $$\Y... \given \x...$$ are (finite or infinite) sets of samples. That is, we do not only optimize marginal distributions but also joint distributions. - - - - -The resulting objective: - -$$ -\begin{aligned} -\E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}} -\end{aligned} -$$ - -is equal to the (negative) *functional ELBO (fELBO)* in "*Functional variational Bayesian neural networks*" by Sun et al. (2019)---with caveats that we discuss below. - -### Choosing the "Coreset" $$\x...$$ - -One important detail is the question of how to choose the $$\x...$$: - -Ideally, we want to choose them such that the DPI inequality is as tight as possible. - -Given the chain inequality, it is obvious that the larger the set $$\x...$$, the tighter the inequality will be. -Hence, if we could choose an infinite set of points well, we might be able to get the tightest possible inequality. -However, this might not be tractable, and in practice, it is often not. - -Some works take a supremum over finite subsets of a certain size, essentially building a core-set as an approximation (Rudner et al., 2022a/b); -others take an expectation over finite sets of input samples (Sun et al., 2019), which is not necessarily yielding the tightest inequality but provides an unbiased estimate; while again other works focus on finite datasets for which the all points can be taken into account (Klarner et al., 2023). - -We will discuss the tightness of the inequality and the implications in the data limit below. - -Focusing on the most important aspect of FSVI, we observe: - - - -### Application to Continual Learning - -When we directly optimize the KL divergence on a finite input dataset, for example, we align $$\opq$$ with the prior of $$\opp$$ where it matters most: on the predictions of the observed data. - -This is of particular interest in continual learning, where the prior for the next task is chosen to be the posterior from the previous task. In this case, the functional ELBO can be used to approximate the posterior of the previous model while incorporating new data. - -For two great papers that are very readable and provide further insights, see "*Continual learning via sequential function-space variational inference*" and "*Tractable function-space variational inference in Bayesian neural networks*", both by Rudner et al. (2022). - -## Comparison to FSVI in the Literature - - - -In practice, both works by Rudner et al. (2022), linearize the logitsThe logits are the final activations of the neural network before applying the softmax function (in multi-class classification). They are not to be confused with the pre-logits, e.g. embeddings before the final linear layer. (similar to a Laplace approximation) and use the DPI to show (in their notation): - -$$ -\mathbb{D}_{\mathrm{KL}}\left(q_{f(\cdot ; \boldsymbol{\Theta})} \| p_{f(\cdot ; \boldsymbol{\Theta})}\right) \leq \mathbb{D}_{\mathrm{KL}}\left(q_{\Theta} \| p_{\Theta}\right) -$$ - -which in my notation is equivalent to the first application of the DPI above: - -$$ -\Kale{\qof{\L...\given \x...}}{\pof{\L...\given \x...}} \le \Kale{\qof{\W}}{\pof{\W}}. -$$ - -They maximize the fELBO objective: - -$$ -\begin{aligned} -\mathcal{F}\left(q_{\boldsymbol{\Theta}}\right) &=\mathbb{E}_{q_{f\left(\mathbf{x}_{\mathcal{D}} ; \boldsymbol{\Theta}\right)}}\left[\log p_{\mathbf{y} \mid f(\mathbf{X} ; \boldsymbol{\Theta})}\left(\mathbf{y}_{\mathcal{D}} \mid f\left(\mathbf{X}_{\mathcal{D}} ; \boldsymbol{\theta}\right)\right)\right]\\ -&\quad -\sup _{\mathbf{X} \in \mathcal{X}_{\mathbb{N}}} \mathbb{D}_{\mathrm{KL}}\left(q_{f(\mathbf{X} ; \boldsymbol{\Theta})} \| p_{f(\mathbf{X} ; \boldsymbol{\Theta})}\right), -\end{aligned} -$$ - -which is equivalent to minimizing the information-theoretic objective: - -$$ -\E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\L... \given \x...}}{\pof{\L... \given \x...}}, -$$ - -if we choose the $$\x...$$ to tighten the DPI inequality as much as possible (i.e. by "finding" the supremum). - -Using the inequality chain from above, we can sandwich their objective between a regular (negative) ELBO and the (negative) functional ELBO, we have derived above: - -$$ -\begin{aligned} -&\E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\W}}{\pof{\W}} \\ -&\quad \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\L... \given \x...}}{\pof{\L... \given \x...}} \\ -&\quad \ge \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}}. -\end{aligned} -$$ - -**Why are they using logits instead of probabilities?** In practice, using the probabilities instead of logits when performing linearization is often cumbersome due to the non-linearity of the softmax functions, which requires Monte-Carlo sampling of the logits to obtain an approximation of the final probabilities. Furthermore, I speculate that sampling the logits can be more benign given that we often use ReLUs in the underlying neural networks. (Don't quote me too strongly on this, though.) - -Conceptually, this explains the derivation of their ELBO objective and also relates them to the 'purer' and simpler functional evidence bound derived above, but this raises the question of how these inequalities are different and what the gap between them tells us. Let's address this question next. - -## The Equality Case and Equivalence Classes - -When do we have equality? That is, when do we have: - -$$\Kale{\qof{\W}}{\pof{\W}} = \Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}}?$$ - -And what does it tell us? - -As we have seen in the first part of this post, we have equality in the DPI if and only: - -$$\Kale{\qof{\W \given \Y..., \x...}}{\pof{\W \given \Y..., \x...}}=0$$. - -Given that we are trying to approximate the Bayesian posterior $$\pof{\w \given \Y..., \x...}$$ using $$\qof{\w}$$, this equality condition tells us that we would have to find the exact posterior for equality. -Hence, it is unlikely that we will have equality in practice. From this, the next question immediately follows: what does this predictive prior term - -$$\Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}}$$ - -provides us with? - -Another way to think about the gap between the two KL divergences is that one is parameter-based and the other one is not. This points to a deeper truth about overparameterized models used in deep learning: - - - -The functional KL divergences won't be affected by this as they are parameter-free and do not take into account the parameters of the model but only the predictions. -The regular parameter-based KL divergence, however, would be affected by this---depending on the prior $$\pof{\w}$$, they might express differences between the parameter distributions that have no effect on the outputs. - -In other words, if the prior assigns different probability to otherwise equivalent parameters, this obviously changes the parameter posterior, while the outputs are invariant to these changes if the overall assigned probability to a given output remains the same. - - - -For example, the paper "Deep Ensembles: A Loss Landscape Perspective" by Fort et al. (2020) examines the similarity of the predictions of models trained from different initializations and shows that the prediction space has a multi-modal loss landspace. In the language of FSVI, this is similar to analyzing the function-space distances between different models. - -### Equivalence Classes - -Unless there are other considerations, it makes sense to use priors that assign the same density to parameters that are equivalent. -Hence, for a given function $$\fof{\x ; \w}$$, which determines the likelihood $$\pof{\y \given \x, \w} \triangleq \pof{y \given \fof{\x ; \w}}$$, we can define an equivalence relation such that $$\w \sim \w'$$ if and only if $$\fof{\x; \w} = \fof{\x; \w'}$$ *for all* $$\x$$. -This equivalence relation partitions the parameter space into equivalence classes: - -$$[\w] \triangleq \{\w' : \fof{x ; \w} = \fof{x ; \w} \quad \forall x \}.$$ - -A prior $$\pof{\w}$$ induces a prior $$\hpof{[\w]}$$ over the equivalence classes: - -$$\hpof{[\w]} \triangleq \sum_{\w' \in [\w]} \pof{\w'}.$$ - ----or $$\int_{[\w]} \pof{\w'} \, d \w'$$ for continuous $$\w$$---with the corresponding model: - -$$ -\begin{aligned} -\hpof{\y, [\w] \given \x} &\triangleq \hpof{\y \given \x, [\w]} \, \hpof{[\w]} \\ -&= \pof{\y \given \x, \w} \, \hpof{[\w]}. -\end{aligned} -$$ - - - - -### Consistency - -Importantly, the definition of the equivalence classes above is consistent with Bayesian inference: - - - -This is easy to show with using Bayes' rule: - -$$ -\begin{aligned} -\hpof{[\w] \given \Dany} &= \hpof{\Dany \given [\w]} \, \hpof{[\w]} / \hpof{\Dany} \\ -&= \pof{\Dany \given \w} \sum_{\w' \in [\w]} \pof{\w'} / \hpof{\Dany} \\ -&= \sum_{\w' \in [\w]} \pof{\Dany \given \w'} \, \pof{\w'} / \hpof{\Dany} \\ -&= \sum_{\w' \in [\w]} \pof{\w' \given \Dany} \, \pof{\Dany} / \hpof{\Dany} \\ -&= \sum_{\w' \in [\w]} \pof{\w' \given \Dany}. -\end{aligned} -$$ - -The last step follows from $$\hpof{\Dany}=\pof{\Dany}$$: - -$$ -\begin{aligned} -\hpof{\Dany} &= \sum_{[\w]} \hpof{\Dany, [\w]} \\ -&= \sum_{[\w]} \sum_{\w' \in [\w]} \pof{\Dany, \w'} \\ -&= \sum_{\w'} \pof{\Dany, \w} \\ -&= \pof{\Dany}. -\end{aligned} -$$ - -This also tells us that, for any $$\x$$ and $$\y$$: - -$$\pof{\y... \given \x...} = \hpof{\y... \given \x...}$$. - -Given this consistency, we don't have to differentiate between $$\hat\opp$$ and $$\opp$$ and can use $$\opp$$ interchangeably. -The same holds for $$\opq$$. - - - -### Equality & Symmetries - -We can view $$[\w]$$ as a projection from $$\w$$ to its equivalence class $$[\w]$$. The DPI then gives us: - -$$ -\Kale{\qof{\W}}{\pof{\W}} \ge \Kale{\qof{[\W]}}{\pof{[\W]}}. -$$ - -And again: what does the gap between the two terms tell us? - - - -Let's look at a few examples to get a better understanding of this. - -#### 1. Trivial Constant Case - - Let $$\fof{\x ; \w} = 0$$ independent of any $$f$$. Then $$[\w] = [\w']$$ for any $$\w$$, $$\w'$$. - - For any approximate distribution $$\qof{\w}$$, the induced $$\Kale{\qof{[\W]}}{\pof{[\W]}}=0$$, while $$\Kale{\qof{\W}}{\pof{\W}}$$ also includes superfluous divergence. - -#### 2. Unused Parameter - - Let $$\y \given (\w_1, \w_2) = \w_1$$ deterministic but independent of $$\w_2$$. Then $$[(\w_1, \w_2)] = [(\w_1, {\w'}_2)]$$ for any $${\w'}_2$$ and $$[(\w_1,*)]\not=[({\w'}_1, *)]$$ for any $$\w_1 \not= \w'_1$$. - - $$\Kale{\qof{[\W]}}{\pof{[\W]}}=\Kale{\qof{\W_1}}{\pof{\W_1}}$$ captures the meaningful divergence between approximate and true distribution, while $$\Kale{\qof{\W}}{\pof{\W}}$$ also includes any divergence across $$\w_2$$ that has no effect on the predictions. - -#### 3. Periodic Parameter Space - - Finally, let's assume that the predictions are periodic in some way. That is, for example $$\y = \sin \w$$. We then have $$[\w] = [\w + 2\pi]$$. - - Further, let $$\pof{\w} = \operatorname{U}(\w; [0,2\pi \, N))$$ for some $$N$$ that determines the number of periods. Then, if we introduce another random variable $$K$$, that captures which period we are in, we can (again) use the chain rule to write: - - $$ - \begin{aligned} - \Kale{\qof{\W}}{\pof{\W}} &= \Kale{\qof{\W \given \W \in [K\,2\pi, (K+1)\,2\pi]}}{\pof{\W \given \W \in [K\,2\pi, (K+1)\,2\pi]}} \\ - &\quad + \Kale{\qof{\W \in [K\,2\pi, (K+1)\,2\pi]}}{\pof{\W \in [K\,2\pi, (K+1)\,2\pi]}} \\ - &= \Kale{\qof{[\W]}}{\pof{[\W]}} \\ - &\quad + \Kale{\qof{\W \in [K\,2\pi, (K+1)\,2\pi]}}{\pof{\W \in [K\,2\pi, (K+1)\,2\pi]}}. - \end{aligned} - $$ - - This follows from the setup of this specific example. Finally, we have: - - $$\Kale{\qof{\W \in [K\,2\pi, (K+1)\,2\pi]}}{\pof{\W \in [K\,2\pi, (K+1)\,2\pi]}} \le \log N.$$ - - So, if $$\opq$$ only had support in a single period for example, the difference between $$\Kale{\qof{\W}}{\pof{\W}}$$ and $$\Kale{\qof{[\W]}}{\pof{[\W]}}$$ would be $$\log N$$: the redundancy. - -### Predictive Prior - -How does the predictive prior term fit into this? The DPI again yields the answer: - - - -This tells us that the predictive prior term can at best measure the KL divergence between the equivalence classes of the parameters---and not between the parameters itself---but luckily, this is the more meaningful divergence anyway! - -For the equality cases, we observe that: - -1. we need a 1:1 mapping between parameters and equivalence classes for the first bound to be tight, and -2. we need $$\Kale{\qof{[\W] \given \Y_n,\x_n,...,\Y_1,\x_1}}{\pof{[\W] \given \Y_n,\x_n,...,\Y_1,\x_1}} \to 0$$ for $$n \to \infty$$ for the second bound to be tight. - -For **2.**: as we know from the chain rule that - -$$\Kale{\qof{\Y_n,...\Y_1\given\x_n,...,\x_1}}{\pof{\Y_n,...\Y_1\given\x_n,...,\x_1}}$$ - -is monotonically increasing in $$n$$, and it is bounded by $$\Kale{\qof{[\W]}}{\pof{[\W]}}$$ from above, it *must* convergeIt is a bounded monotonically increasing sequence.. So, when does it close the gap? - -To give intuition that it might do that, and without attempting to prove this formally, we can appeal to [*Bernstein von Mises* theorem](https://en.wikipedia.org/wiki/Bernstein%E2%80%93von_Mises_theorem), which states that the posterior distribution of the parameters converges to a Gaussian distribution with mean and variance given by the maximum likelihood estimate (MLE) as the number of data points tends to infinity *as long as the model parameters are identifiable, that is the true parameters we want to learn are unique, and that they have support*. - -For the evidence bound to be meaningful, we already know that we need support of the approximate distribution $$\opq$$ in the prior $$\opp$$---otherwise, the LHS is $$\infty$$. Moreover, realizing that we take an expectation over $$\qof{\Y_n ,..., \Y_1 \given \x_n ,..., \x_1}$$, we can decompose the KL term for the gap as: - -$$ -\begin{aligned} -&\Kale{\qof{[\W] \given \Y_n,\x_n,...,\Y_1,\x_1}}{\pof{[\W] \given \Y_n,\x_n,...,\Y_1,\x_1}} \\ -&\quad = \E{\qof{\y_n,...,\y_1\given\x_n,...,\x_1}}{\Kale{\qof{[\W]\given \y_n, \x_n, ..., \y_1, \x_1}}{\pof{[\W]\given \y_n, \x_n, ..., \y_1, \x_1}}} \\ -&\quad = \simpleE{\qof{[\w']}}{\E{\qof{\y_n,..,.\y_1\given\x_n,...,\x_1, [\w']}}{\Kale{\qof{[\W]\given \y_n, \x_n, ..., \y_1, \x_1}}{\pof{[\W]\given \y_n, \x_n, ..., \y_1, \x_1}}}}. -\end{aligned} -$$ - -That is, we sample a $$[\w'] \sim \qof{[\w']}$$ and then sample $$\y_n,...\y_1\given\x_n,...,\x_1$$ from the corresponding $$\qof{\y_n,...\y_1\given\x_n,...,\x_1, [\w']}$$ and marginalize over these. Crucially, $$[\w']$$ are the true parameters of the data-generating process for the inner KL divergence term. We thus take an expectation over KL terms fulfilling the conditions of the Bernstein von Mises theorem: - -$$ -\begin{aligned} -\Kale{\qof{[\W] \given \y_n,\x_1...\y_1, \x_1}}{\pof{[\W] \given \y_n,\x_1...\y_1, \x_1}} \to 0. -\end{aligned} -$$ - -In other words, for a given $$[w']$$, in the space of equivalence classes as defined previously, the equivalence class of all MLE solutions in the data limit, $$[MLE]$$, will be unique by definition---the model is identifiable---and match $$[\w']$$This follows from the consistency of MLE estimators but also from Berstein von Mises with a flat/uninformative prior.. As the MLE is prior-independent once there is support for it, both $$\opq$$ and $$\opp$$ will converge to the MLE $$[\w']$$ with sufficient data. Taking the expectation, this yields $$\Kale{\qof{[\W]\given \Y,..., \x...}}{\pof{[\W] \given \Y,..., \x...}} \to 0$$ for $$n \to \infty$$, and thus, we have: - -$$ -\begin{aligned} -& \Kale{\qof{[\W]}}{\pof{[\W]}} = \\ -&\quad = \sup_{n\in \mathbb{N}} \Kale{\qof{\Y_n,...,\Y_1\given\x_n,...,\x_1}}{\pof{\Y_n,...,\Y_1\given\x_n,...,\x_1}}. -\end{aligned} -$$ - -(Again, this is not a formal proof but an intuition for why the gap might close in the data limit.) - -In my opinion, this is a great result. We have shown both that the predictive prior term converges given our assumptions and that it converges to the symmetry-free parameter-based divergence in the data limit. This is a strong argument for the predictive prior term being meaningful and not just a technical trick. - -Let's appreciate one more thing: the predictive prior can consist of infinitely many data points and still converge to a finite value. - -## Parameter Priors vs. Predictive Priors - -What is the advantage of this all? - -In Bayesian deep learning, we often use parameter priors that are not meaningful and which also do not take parameter symmetries into account. For example, a unit Gaussian prior over the parameters of a neural network does not induce different predictions for different parameters necessarily. While this prior can be sensible from a parameter compression perspective (e.g. see Hinton and van Camp (1993)), this does not have to be the only consideration guiding us. - -With function priors and predictive priors, we can specify more meaningful priors because we can focus on the predictions and ignore the parameters. More importantly, this connects Bayesian approaches to data augmentation and other regularization techniques as we will see next. - -Given that priors over equivalence classes are difficult to express explicitly though, using the DPI to obtain a functional ELBO can be an easier way to express and approximate them. - -### Label Entropy Regularization - -All this also helps us gain a new perspective on label entropy regularization. The functional evidence bound can be lower-bounded using the chain rule by: - -$$ -\begin{aligned} -\E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}} \\ -\ge \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \E{\pdata{\x}}{\Kale{\qof{\Y \given \x}}{\pof{\Y \given \x}}}, -\end{aligned} -$$ - -where we can expand the term under the second expectation to: - -$$ -\Kale{\qof{\Y \given \x}}{\pof{\Y \given \x}}=\CrossEntropy{\qof{\Y \given \x}}{\pof{\Y \given \x}} - \xHof{\qof{\Y \given \x}}. -$$ - -*Assuming that our prior yields a uniform distribution over the labels*, we can drop the cross entropy term because it is constant and obtain: - -$$ -\E{\qof{\w}}{-\log \pof{\Dany \given \w}} - \E{\pdata{\x}}{\xHof{\qof{\Y \given \x}}}. -$$ - -This is the same as an MLE minimization objective with an additional entropy regularization term $$-\xHof{\qof{\Y \given \x}}$$ for different $$\x$$ that prevents the model from overfitting to the labels and collapsing to the one-hot encoding of the labels. - -Thus, in the simplest approximation, the DPI and functional variational inference give us a new perspective on label entropy regularization. - -### Knowledge Distillation - -Obviously, assuming non-uniform prior predictions, $$\E{\pdata{\x}}{\Kale{\qof{\Y \given \x}}{\pof{\Y \given \x}}}$$ can be related to knowledge distillation in deep neural networks as introduced by Hinton et al. (2015). - -The main technical difference is that knowledge distillation is using the reverse KL divergence instead of the forward KL divergence, while the conceptual difference is that we are not distilling the knowledge from a teacher model but from the prior that we downweigh while also training our model on the data itself. However, the connection between knowledge distillation and continual learning using informative priors is manifest. - -## Conclusion - -In this blog post, we took a deep dive into the data processing inequality (DPI) and its surprisingly far-reaching implications for modern Bayesian deep learning. By carefully examining the assumptions, equality conditions, and chain rule of the DPI, we arrived at an intuitive understanding of why function-space variational inference (FSVI) can be such a powerful tool. The DPI perspective illuminates how FSVI side-steps issues with high-dimensional parameter spaces by focusing on matching Bayesian predictive posteriors. - -Reasoning about parameter equivalence classes under the lens of the DPI, we saw how predictive KL divergences can capture meaningful differences between models while ignoring superficial discrepancies due to symmetries. This provides a fresh perspective on the advantages of predictive priors over standard parameter priors commonly used in Bayesian neural networks. - -While our treatment only scratched the surface of the full mathematical story, the intuitions we developed allowed us to re-derive key results from the literature and uncover deep connections between seemingly disparate methods like entropy regularization, continual learning, and knowledge distillation. The examples and proofs peppered throughout solidified the core concepts. - -More than a bag of technical tricks, the DPI reveals itself to be a powerful conceptual tool for reasoning about models, objectives, and algorithms. I hope this post inspires the reader to seek the fundamental principles underpinning machine learning innovations and to use those principles as a guide for future research. With a solid grasp of foundational tools like the DPI, we can all contribute to demystifying and unifying the rapidly evolving field of Bayesian deep learning. - ---- - -**Acknowledgements.** Many thanks to [Freddie Bickford Smith](https://fbickfordsmith.com/) for very helpful comments and feedback on this post and to [Tim Rudner](https://timrudner.com/) for additional pointers to relevant literature and feedback on the FSVI section in particular 🤗 - diff --git a/_posts/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation.md b/_posts/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation.md deleted file mode 100644 index 06144f3f..00000000 --- a/_posts/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation.md +++ /dev/null @@ -1,589 +0,0 @@ ---- -layout: distill -title: Elaborating on the Value of Flow Matching for Density Estimation -description: The transfer of matching-based training from Diffusion Models to Normalizing - Flows allows to fit expressive continuous normalizing flows - efficiently and therefore enables their usage for different kinds - of density estimation tasks. One particularly interesting task is - Simulation-Based Inference, where Flow Matching enabled several - improvements. The post shall focus on the discussion of Flow - Matching for Continuous Normalizing Flows. To highlight the - relevance and the practicality of the method, their use and - advantages for Simulation-Based Inference is elaborated. -date: 2024-05-07 -future: true -htmlwidgets: true - -# Anonymize when submitting -authors: - - name: Maternus Herold - affiliations: - name: BMW Group, appliedAI Institute for Europe gGmbH & University of the Bundeswehr Munich - - name: Faried Abu Zaid - affiliations: - name: appliedAI Institute for Europe gGmbH - -# must be the exact same name as your blogpost -bibliography: 2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation.bib - -# Add a table of contents to your post. -# - make sure that TOC names match the actual section names -# for hyperlinks within the post to work correctly. -# - please use this format rather than manually creating a markdown table of contents. -toc: - - name: Motivation - - name: Continuous Normalizing Flows - - name: Flow Matching - subsections: - - name: Gaussian conditional probability paths - # - name: Generalized Flow-Based Models - - name: Empirical Results - - name: Application of Flow Matching in Simulation-Based Inference - subsections: - - name: Primer on Simulation-Based Inference - - name: Flow Matching for Simulation-Based Inference - - name: A Personal Note - -# Below is an example of injecting additional post-specific styles. -# This is used in the 'Layouts' section of this post. -# If you use this post as a template, delete this _styles block. -_styles: > - .fake-img { - background: #bbb; - border: 1px solid rgba(0, 0, 0, 0.1); - box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1); - margin-bottom: 12px; - } - .fake-img p { - font-family: monospace; - color: white; - text-align: left; - margin: 12px 0; - text-align: center; - font-size: 16px; - } ---- - -# Motivation - -Normalizing Flows (NF) enable the construction of complex probability -distributions by transforming a simple, known distribution into a more complex -one. They do so by leveraging the change of variables formula, defining a -bijection from the simple distribution to the complex one. - -For most of the time, flows were based on chaining several differentiable and -invertible transformations. However, these diffeomorphic transformations limit -the flows in their complexity as such have to be simple. Furthermore, this leads -to trade-off sampling speed and evaluation performance . Their continuous counterpart, -Continuous Normalizing Flows (CNFs) have been held back by limitations in their -Simulation-Based maximum likelihood training . By utilizing Flow Matching, this limitation -has been overcome and CNFs have been shown to be a powerful tool for density -estimation. - -In the following sections, CNFs and Flow Matching are explained. Following the -explanation, the empirical results of Flow Matching are presented. Finally, the -application of Flow Matching in Simulation-Based Inference is discussed, which -shall highlight their wide applicability and consistent improvement. - -# Continuous Normalizing Flows - -Continuous normalizing flows are among the first applications of neural -ordinary differential equations (ODEs) . -Instead of the traditional layers of neural networks, the flow is defined by a -vector field that is integrated over time. - -$$ - \frac{d}{dt} x(t) = f_{\theta}(x(t), t) -$$ - -The vector field is typically parameterized by a neural network. While -traditional layer based flow architectures need to impose special architectural -restrictions to ensure invertibility, CNFs are invertible as long as the -uniqueness of the solution of the ODE is guaranteed. This is for instance the -case if the vector field is Lipschitz continuous in $$x$$ and continuous in -$$t$$. Many common neural network architectures satisfy these conditions. Hence, -the above equation defines a diffeomorphism $$\phi_t(x_0) = x_0 + \int_0^t -f_{\theta}(x(t), t)$$ under the discussed assumption. The change of variables -formula can be applied to compute the density of a distribution that is -transformed by $$\phi_t$$. - -As usual, a CNF is trained to transform a simple base distribution $$p_B$$, -usually a standard normal distribution, into a complex data distribution -$$p_D$$. For each point in time $$t\in[0,1]$$ the time-dependent vector field -defines a distribution $$p_t$$ (probability path) and the goal is to find a -vector field $$f_\theta$$ such that $$p_1=p_D$$. This is usually achieved by -maximum likelihood training, i.e. by minimizing the negative log-likelihood of -the data under the flow. - -While CNFs are very flexible, they are also computationally expensive to train -naively with maximum likelihood since the flow has to be integrated over time -for each sample. This is especially problematic for large datasets which are -needed for the precise estimation of complex high-dimensional distributions. - -# Flow Matching - -The authors of propose a new method for -training CNFs, which avoids the need for simulation. The key idea is to regress -the vector field directly from an implicit definition of a target vector field -that defines a probability path $$p_t(x)$$ with $$p_0=p_{B}$$ and $$p_1=p_{D}$$. -Moreover, the authors propose a loss function that directly regresses the time -dependent vector field against the conditional vector fields with respect to -single samples. - - -
-
- {% include figure.html path="assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/imagenet.png" class="img-fluid rounded z-depth-1" %} -
-
-
- Unconditional ImageNet-128 samples of a CNF trained using Flow Matching - with Optimal Transport probability paths. Figure obtained from - . -
- - - - -Assuming that the target vector field is known, the authors propose a -loss function that directly regresses the time dependent vector field: - -$$ - L_{\textrm{FM}}(\omega) = \mathbb{E}_{t, p_t(x)}(|f_{\omega}(x, t) - u_t(x)|^2), -$$ - -where $$u_t$$ is a vector field that generates $$p_t$$ and the expectation with -respect to $$t$$ is over a uniform distribution. Unfortunately, the loss -function is not directly applicable because we do not know how to define the -target vector field. However, it turns out that one can define appropriate -conditional target vector fields when conditioning on the outcome $$x_1$$: - -$$ - p_t(x) = \int p_t(x|x_1)p_{D}(x_1)d x_1. -$$ - - -Using this fact, the conditional flow matching loss can be defined, obtaining -equivalent gradients as the flow matching loss. - -$$ - L_{\textrm{CFM}}(\omega) = \mathbb{E}_{t, p_t(x|x_1), - p_D(x_1)}(|f_{\omega}(x, t) - u_t(x|x_1)|^2). -$$ - -Finally, one can easily obtain an unbiased estimate for this loss if samples -from $$p_D$$ are available, $$p_t(x|x_1)$$ can be efficiently sampled, and -$$u_t(x|x_1)$$ can be computed efficiently. We discuss these points in the -following. - -## Gaussian Conditional Probability Paths - -The vector field that defines a probability path is usually not unique. This is -often due to invariance properties of the distribution, e.g. rotational -invariance. The authors focus on the simplest possible vector fields to avoid -unnecessary computations. They choose to define conditional probability paths -that maintain the shape of a Gaussian throughout the entire process. Hence, the -conditional probability paths can be described by a variable transformation -$$\phi_t(x \mid x_1) = \sigma_t(x_1)x + \mu_t(x_1)$$. The time-dependent functions -$$\sigma_t$$ and $$\mu_t$$ are chosen such that $$\sigma_0(x_1) = 1$$ and $$\sigma_1 = -\sigma_\text{min}$$ (chosen sufficiently small), as well as $$\mu_0(x_1) = 0$$ -and $$\mu_1(x_1)=x_1$$. The corresponding probability path can be written as - -$$ -p_t(x|x_1) = \mathcal{N}(x; \mu_t(x_1), \sigma_t(x_1)^2 I). -$$ - -In order to train a CNF, it is necessary to derive the corresponding conditional -vector field. An important contribution of the authors is therefore the -derivation of a general formula for the conditional vector field $$u_t(x|x_1)$$ -for a given conditional probability path $$p_t(x|x_1)$$ in terms of $$\sigma_t$$ and -$$\mu_t$$: - -$$ - u_t(x\mid x_1) = \frac{\sigma_t'(x_1)}{\sigma_t(x_1)}(x-\mu_t(x_1)) - \mu_t'(x_1), -$$ - -where $$\psi_t'$$ denotes the derivative with respect to time $$t$$. - - -
-
- {% include figure.html path="assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/vectorfields.svg" class="img-fluid rounded z-depth-1" %} -
-
-
- Compared to the diffusion path’s conditional score function, the OT path’s - conditional vector field has constant direction in time and is arguably - simpler to fit with a parametric model. Note the blue color denotes larger - magnitude while red color denotes smaller magnitude. Figure obtained from - . -
- - - - -They show that it is possible to recover certain diffusion training objectives -with this choice of conditional probability paths, e.g. the variance preserving -diffusion path with noise scaling function $$\beta$$ is given by: - -$$ -\begin{align*} - \phi_t(x \mid x_1) &= (1-\alpha_{1-t}^2)x + \alpha_{1-t}x_1 \\\ - \alpha_{t} &= \exp\left(-\frac{1}{2}\int_0^t \beta(s) ds\right) -\end{align*} -$$ - -Additionally, they propose a novel conditional probability path based on optimal -transport, which linearly interpolates between the base and the -conditional target distribution. - -$$ - \phi_t(x \mid x_1) = (1-(1-\sigma_{\text{min}})t)x + tx_1 -$$ - -The authors argue that this choice leads to more natural vector fields, faster -convergence and better results. - - - - -# Empirical Results - -The authors investigate the utility of Flow Matching in the context of image -datasets, employing CIFAR-10 and ImageNet at different resolutions. Ablation -studies are conducted to evaluate the impact of choosing between standard -variance-preserving diffusion paths and optimal transport (OT) paths in Flow -Matching. The authors explore how directly parameterizing the generating vector -field and incorporating the Flow Matching objective enhances sample generation. - - -
-
- {% include figure.html path="assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/imagegen.svg" class="img-fluid rounded z-depth-1" %} -
-
-
- Likelihood (BPD), quality of generated samples (FID), and evaluation time - (NFE) for the same model trained with different methods. Figure from - . -
- - - - -The findings are presented through a comprehensive evaluation using various -metrics such as negative log-likelihood (NLL), Frechet Inception Distance -(FID), and the number of function evaluations (NFE). Flow Matching with OT -paths consistently outperforms other methods across different resolutions. - - -
-
- {% include figure.html path="assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/sampling.svg" class="img-fluid rounded z-depth-1" %} -
-
-
- Flow Matching, especially when using OT paths, allows us to use fewer - evaluations for sampling while retaining similar numerical error (left) and - sample quality (right). Results are shown for models trained on ImageNet - 32×32, and numerical errors are for the midpoint scheme. Figure from - . -
- - - - -The study also delves into the efficiency aspects of Flow Matching, showcasing -faster convergence during training and improved sampling efficiency, -particularly with OT paths. - - -
-
- {% include figure.html path="assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/sample_path.png" class="img-fluid rounded z-depth-1" %} -
-
-
- Sample paths from the same initial noise with models trained on ImageNet - 64×64. The OT path reduces noise roughly linearly, while diffusion paths - visibly remove noise only towards the end of the path. Note also the - differences between the generated images. Figure from - . -
- - - - - -
-
- {% include figure.html path="assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/superres.svg" class="img-fluid rounded z-depth-1" %} -
-
-
- Image super-resolution on the ImageNet validation set. Figure from - . -
- - - - -Additionally, conditional image generation and super-resolution experiments -demonstrate the versatility of Flow Matching, achieving competitive performance -in comparison to state-of-the-art models. The results suggest that Flow -Matching presents a promising approach for generative modeling with notable -advantages in terms of model efficiency and sample quality. - -# Application of Flow Matching in Simulation-Based Inference - -A very specifically interesting application of density estimation, i.e. -Normalizing Flows, is in Simulation-Based Inference (SBI). In SBI, Normalizing -Flows are used to estimate the posterior distribution of model parameters given -some observations. An important factor here are the sample efficiency, -scalability, and expressivity of the density model. Especially for the later -two, Flow Matching has shown to the yield an improvement. This is due to the -efficient transport between source and target density and the flexibility due -the more complex transformations allowed by continuous normalizing flows. To -start out, a brief introduction to SBI shall be given as not many might be -familiar with this topic. - -## Primer on Simulation-Based Inference - -In many practical scenarios, the likelihood function of a model is intractable -and cannot be described analytically. This might be the case for where the -forward model is a complex or proprietary simulation, or if it is a physical -experiment . In order to -still be able to perform Bayesian inference, one can resort to a class of -methods called Likelihood-free Inference. One possible but popular method in -this class is SBI. The core idea is to use a prior in combination with the -simulator to obtain samples from the joint distribution of the parameters and -the data. Based on these samples, the posterior can either be learned directly -or the likelihood can be approximated . Depending on the exact method chosen, the -approximated posterior is either amortized, i.e. does not require refitting when -conditioned on different data, or non-amortized. - - -
-
- {% include figure.html path="assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/kinds_of_sbi.jpg" class="img-fluid rounded z-depth-1" %} -
-
-
- The figure depicts the schematic flow of information for different kinds of - Likelihood-free methods. Modern methods in SBI are depicted in the bottom - row where the likelihood is approximated in subfigure E, the posterior is - approximated in subfigure F, and the likelihood-ratio in subfigure G. Figure from . -
- - -In order to formalize the method, let $$\theta \sim \pi(\theta)$$ denote the -parameters to a system and its respective prior distribution. The system under -evaluation and the respective observations obtained are denoted by $$x -= \mathcal{M}(\theta)$$. To sample from the joint distribution $$p(\theta, -x)$$, the dedicated parameter $$\theta_i$$ is sampled from the prior -and the observation is obtained by evaluating the forward model on that -parameter $$x_i = \mathcal{M}(\theta_i)$$. According to this approach, a dataset -of samples from the joint distribution can be generated $$\mathcal{X} = \{ -(\theta, \mathbf{x})_i \}^N_{i=1}$$. A density estimator is then fitted on the -provided dataset in order to estimate the desired distribution, e.g. directly -the posterior $$q_{\omega}(\theta \mid x) \approx p(\theta \mid x)$$. - -The interested reader shall be directed to and especially for a more rigorous introduction -to SBI. In order to compare the performances of the different approaches to -SBI and their performance with respect to certain tasks, an excellent overview -is provided in . For the sake -of this post, a more abstract understanding is enough. - -## Flow Matching for Simulation-Based Inference - -The approach using the Flow Matching formulation to fit the density network is -presented by Dax et al. . In the setting -described by the authors and the before mentioned SBI context, the goal is to -approximate a posterior distribution of over model parameters given observations -$$p(\theta \vert x)$$. To learn the posterior, the Flow Matching loss -is adapted to the following: - -$$ -\mathcal{L}_{FMPE} = \mathbb{E}_{t \sim p(t),\theta_1 \sim p(\theta), x \sim p(x -\vert \theta_1),\theta_t \sim p_t(\theta_t \mid \theta_1)} \Vert -f_{\omega,x}(\theta_t, t) - u_t(\theta_t \mid \theta_1) -\Vert^2 -$$ - -The important details to note here are the adaptations to minimize the loss -w.r.t. samples drawn from the joint distribution, as it is described in the -general section to SBI. To do so, the expectation is adapted to be w.r.t. -$$\theta_1 \sim p(\theta), x \sim p(x \vert \theta_1)$$, which yield the desired -samples. - -Another adaption by the authors is to exchange the uniform distribution over the -time with a general distribution $$t \sim p(t)$$. The effects of this -substitution won't be focus deeper. However, adapting the distribution makes -intuitive sense as the training gets harder close to the target distribution. -Therefore, focussing on time steps $$t$$ closer to one is beneficial, as the -authors have also found in their empirical studies. - -In order to provide a general comparison of the Flow Matching-based SBI -approach, the CFM model is tested on the SBI benchmarking tasks . The results show either equal or -better performance, underscoring the approaches ability and applicability to -SBI. - - -
-
- {% include figure.html path="assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/fmpe_sbi_benchmark.png" class="img-fluid rounded z-depth-1" %} -
-
-
- The figure depicts the results of the CFM model on the SBI benchmarking tasks, as carried out by the authors of . Comparing the results to such obtained by neural posterior estimation with a normalizing flow shows comparable performance on most tasks while outperforming on some. -
- - -Besides the general benchmarks, the authors use their proposed technique to -estimate the posterior distribution of gravitational wave parameters $$p(\theta -\mid x)$$ where $$\theta \in \mathbb{R}^{15}, x \in \mathbb{R}^{15744}$$. In -order to reduce the problem's dimensionality and increase the information -density, the observations are compressed to $$128$$ dimensions using an -embedding network. - -Following the preprocessing of the data, three density estimators are fitted and -compared to each other. The first method uses a neural spline flow, which has -proven itself on these kinds of problems. It is compared to a neural posterior -estimation using the Flow Matching approach described here. Finally, a neural -posterior estimator leveraging physical symmetries is used to estimate the -targeted posterior. All were trained on a simulation budget of $$5 \cdot 10^6$$ -samples for a total of 400 epochs. - -In order to evaluate the models' performances, the obtained posteriors were -compared w.r.t. their 50% credible regions as well as Jensen-Shannon divergence -between the inferred posterior and reference results. The results shown below -support the advantages found in the benchmarking tasks. The Flow Matching-based -shows a good performance for all shown parameters and has a clear advantage over -the classical NPE approach. - - -
-
- {% include figure.html path="assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/fmpe_results_gw.png" class="img-fluid rounded z-depth-1" %} -
-
-
- The figure shows the single performances of a classic NPE approach using - neural spline flows, the proposed Flow Matching approach, and a - physics-focussed NPE approach. The results are shown for the 50% credible - regions on the left, as well as the Jensen-Shannon divergence between the - inferred posterior and reference results on the right. The Flow - Matching-based approach shows a good performance for all investigated - parameters and has a clear advantage over the classical NPE approach. In - the pair plot on the left, the choice was made to only show the four - parameters for which the classical NPE method performs the worst. While the - Flow Matching approach could perform worse on other dimensions, this is - not the case as shown on the right. Figure from - . -
- -Whilst the examples are interesting themselves, their evaluation has shown the -applicability, scalability, and flexibility of Flow Matching for density -estimation. These performance improvements in different areas have motivated the -discussion of Flow Matching in the first place and hopefully become clear now. - -# A Personal Note - -Whilst this is a blog post, we'd like to use this last part to express our -personal thoughts on this topic. SBI is a powerful method, enabling Bayesian -Inference where it would not be possibleIt might be more fitting to -say that Bayesian Inference is not practically feasible in many scenarios as, in -theory, it might still be possible by sampling. However, this is essentially not -possible where single evaluations of the forward model are expensive or further -evaluations are simply not available, as shown in the example. -otherwise. Due to the natural problem setting of SBI, where problems are -high-dimensional, observations are scarce, and distribution complex, density -estimators capable to counter these are required. In the past, Normalizing Flows -have proven themselves to meet these challenges, whilst not resolving them -completely. CNFs, due to their higher flexibility, have been a desired method to -put to test whether they could even improve on these but were limited in the -inability to train the efficiently. - -Formulating the Flow Matching variant of CNFs has allowed their application to -complex density estimation tasks, as for example in SBI, and they've shown to -yield the expected improvements -- on standard SBI benchmarking tasks as well a -very high dimensional task from the field of astrophysics. Furthermore, the -generalization of CFM even broadens their applicability. It will be very -interesting to see what possibilities are opened by this exact formulation and, -in addition, what further improvements can be obtained by transferring -techniques from the Diffusion Models to Normalizing Flows. diff --git a/_posts/2024-05-07-exploring-meta-learned-curiosity-algorithms.md b/_posts/2024-05-07-exploring-meta-learned-curiosity-algorithms.md deleted file mode 100644 index 4d3bcab5..00000000 --- a/_posts/2024-05-07-exploring-meta-learned-curiosity-algorithms.md +++ /dev/null @@ -1,484 +0,0 @@ ---- -layout: distill -title: Exploring Meta-learned Curiosity Algorithms -description: This blog post delves into Alet et al.'s ICLR 2020 paper, Meta-learning curiosity algorithms, which introduces a unique approach to meta-learning curiosity algorithms. Instead of meta-learning neural network weights, the focus is on meta-learning pieces of code, allowing it to be interpretable by humans. The post explores the two meta-learned algorithms, namely Fast Action Space Transition (FAST) and Cycle-Consistency Intrinsic Motivation (CCIM). -date: 2024-05-07 -future: true -htmlwidgets: true - -authors: - - name: Batsirayi Mupamhi Ziki - affiliations: - name: University of Cape Town - -# must be the exact same name as your blogpost -bibliography: 2024-05-07-exploring-meta-learned-curiosity-algorithms.bib - -# Add a table of contents to your post. -# - make sure that TOC names match the actual section names -# for hyperlinks within the post to work correctly. -# - please use this format rather than manually creating a markdown table of contents. -toc: - - name: Introduction - - name: Background - subsections: - - name: Reinforcement Learning - - name: Meta-learning and Meta-RL - - name: Random Network Distillation - - name: BYOL-Explore - - name: Meta-learning curiosity algorithms - subsections: - - name: Meta-Learned Components and their DAGs - - name: Method - - name: FAST - - name: CCIM - - name: Experiments - subsections: - - name: Emperical Design - - name: Empty grid-world - - name: Deep sea - - name: Results - - name: Discussion - - name: Conclusion ---- - -## Introduction - -Dealing with environments with sparse rewards, i.e., feedback comes at a low frequency, in reinforcement learning (RL) requires meaningful exploration. -One way to encourage the RL agent to perform meaningful exploration is by instilling intrinsic motivation into the agents. This intrinsic motivation usually comes in the form of curiosity. As Schmidhuber highlighted : One becomes curious as soon as one believes there's something about the world that one does not know. It is because of this that curiosity or intrinsic rewards are usually predictive errors. For instance, an RL agent equipped with a world model is given the current state of the environment, $$s_t$$, and attempts to predict the next state, $$s_{t+1}$$. The error in this prediction is the intrinsic reward. As the world model improves one should expect the intrinsic rewards to decrease as the agent's knowledge about environment increases. This is known as curiosity-driven exploration. - -Now there has been success with curious agents solving environments with sparse rewards . Curiosity algorithms such as Random Network Distillation (RND) and BYOL-Explore are hand-designed and are able to perform well across different environments. -However, in the 2020 paper , Meta-learning curiosity algorithms, Alet et al. took a unique approach to discovering new curisoity algorithms. They did this by meta-learning pieces of code. -Similar to the code segments used by researchers when crafting curiosity algorithms such as neural networks with gradient descent mechanisms, trained objective functions, ensembles, buffers, and various regression models. -Two new interpretable algorithms were learned by meta-learning these pieces of code: Fast Action Space Transition (FAST) and Cycle-Consistency Intrinsic Motivation (CCIM). -It is these two algorithms that we will explore and compare their behaviour to our baselines: RND and BYOL-Explore. - -The roadmap for exploring FAST and CCIM is organised as follows. We begin with a brief introduction to RL, meta-learning, and meta-reinforcement learning (meta-RL). Next, we provide concise explanations of how curiosity-driven exploration baselines, RND and BYOL-Explore, operate. Subsequently, we delve into the discovery process of FAST and CCIM. Following that, we explore the intricacies of FAST and CCIM, evaluating their performance and studying their behaviour in both the empty grid-world environment and the `bsuite` deep sea environment. We then compare them to curiosity-driven baselines and a non-curious agent. Finally, we conclude our journey. - -## Background - -### Reinforcement Learning - -RL is inspired by how biological systems learn as animals are to able learn through trial-and-error. In RL we have an agent that tries to maximise the sum of rewards it recieves by learning from its interactions with the environment. This agent-environment interaction is usually modelled as a Markov decision process (MDP). Figure 1 below illstrustates this agent-environment interaction. - -{% include figure.html path="assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/MDP.png" class="img-fluid" width="100px" %} -
- Figure 1. The agent-environment interaction as a MDP. Taken from . -
- -From the figure we can see that the agent observes a state and then takes action. The agent can then decide on its next action based on the next state it observes and the rewards it receives from the critic in the environment. The critic decides on what reward the agent receives at every time-step by evaluating its behaviour. - -As Sutton et al. highlighted in Figure 1 can be misleading though. It implies that the agent-environment boundary is similar to the physical boundary between an organism's entire body and the outside world. In RL we consider anything that the agent cannot change through its actions as the environment. For example, if a human was an RL agent their skeletal structure or their muscles could be considered part of the environment. So we can then see that when it comes to RL we have two types of environments: The internal environment, such as sensory organs of an animal, and the external environment. Also, the reward the agent receives is not always from the external environment. The rewards can be seen as reward signals like a human's brain releasing dopamine when one achieves an objective. -Thus, the critic can also be in inside the RL agent. -The figure below shows an extended view of the agent-environment interactions. - -{% include figure.html path="assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/extended_mdp.png" class="img-fluid" width="100px" %} -
- Figure 2. The extended agent-environment interaction. Taken from . -
- -Singh et al. highlighted in that Figure 2 shows that an RL agent has a motivational system since the critic can be within the internal environment of the agent. And this motivational system should ideally remain consistent across a wide range of diverse environments. Since we can view the critic as being inside the agent we can instil intrinsic motivation into the agent. This means that the agent can receive two types rewards, namely extrinsic rewards from the external environments and intrinsic rewards from the internal environment. -Singh et al. () highlighted the advantages of endowing an agent with intrinsic motivation. They pointed out that an agent equipped with a collection of skills learned through intrinsic reward can more easily adapt to and learn a wide variety of extrinsically rewarded tasks compared to an agent lacking these skills. - -### Meta-RL and Meta-learning - -The next stop on our journey takes us to meta-learning. Meta-learning is about learning how to to learn. The goal is for meta-learning agents to enhance their learning abilities over time, enabling them to generalise to new, unseen tasks. Meta-learning involves two essential loops: the inner loop and the outer loop. In the inner loop, our learning algorithm adapts to a new task using experiences obtained from solving other tasks in the outer loop, which is referred to as meta-training . - -The inner loop addresses a single task, while the outer loop deals with the distribution of tasks. Figure 3 illustrates this concept of meta-learning. - -{% include figure.html path="assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/meta-learning.png" class="img-fluid" %} - -
- Figure 3. An illustration of meta-learning. Taken from . -
-Moving into the intersection of meta-learning and reinforcement learning (RL) is meta-RL, where the agent learns how to reinforcement learn . In meta-RL, the agent aims to maximise the sum of rewards from a distribution of MDPs. - -In basic RL, we have an algorithm $$f$$ that outputs a policy, mapping states to actions. However, in meta-RL, our algorithm has meta-parameters $$\theta$$ that outputs $$f$$, and $$f$$ then produces a policy when faced with a new MDP. -Figure 4 illustrates that the meta-RL process. Note that in the outer loop the meta-parameters $$\theta$$ are updated. - -{% include figure.html path="assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/meta-rl.png" class="img-fluid" %} - -
- Figure 4. An illustration of meta-RL. Taken from . -
- -### Random Network Distillation - -We now move onto our curiosity-driven exploration baselines. The first baseline that we will briefly discuss is RND . RND works by having two neural networks. One is the predictor network and the other is the target network. The target network is randomly initialised and its parameters stay fixed during training. Given a state, $$s_t$$, it then outputs the feature representation of that state $$f_t$$. The predictor network then tries to predict to $$f_t$$ given $$s_t$$ as well. The error in this prediction is then the intrinsic reward, $$r_i$$, given to the agent and it is given by the following formula, - -$$ -r_i=\|\hat{f}_t - f_t\|_2^2, -$$ - -where $$ \hat{f}_t$$ is the output of the predictor network. The formula above also serves as the loss function of the predictor network. -We normalise $$r_i$$ by dividing it by the running estimate of the standard deviations of -the intrinsic returns. We do this because the intrinsic rewards can be very different in various environments. Normalising the intrinsic rewards make it easier to pick hyperparameters that work across a wide range of environments. As the agent explores more the predictor network will get better and the intrinsic rewards will decrease. The key idea in RND is that the predictor network is trying to predict the output of a network that is deterministic, the target network. -The figure below illustrates the process of RND. - -{% include figure.html path="assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/RND.png" class="img-fluid" %} -
- Figure 5. The process of RND. Taken from . -
- -### BYOL-Explore - -BYOL-Explore builds upon Bootstrap Your Own Latent (BYOL) , a self-supervised learning algorithm used in computer vision and representation learning. BYOL-Explore is similar to RND in that there's a network that tries to predict the output of a target network. In BYOL-Explore we have an online network that consists of an encoder, a close-loop recurrent neural network (RNN) cell, an open-loop RNN cell and a predictor. While the target network just consists of an encoder. The key difference is that the target's network parameters do not stay fixed like in RND. We update the target network's parameters using the exponential moving average (EMA) of the online network's predictor parameters. The update is performed using the formula below: - -$$ -\phi \leftarrow \alpha\phi + (1-\alpha)\theta. -$$ - -In the above equation, $$\phi$$, is the target network's parameters, $$\theta$$ is the online network's predictor parameters and $$\alpha$$ is the EMA smoothing factor. In our implementation of BYOL-Explore we do not make use of the RNN cells as we are dealing with simple environments, we call our implementation BYOL-Explore Lite. -In our implementation the online network is composed of a multilayer perceptron (MLP) encoder and a predictor. The target network, $$h$$, is just composed of an MLP encoder. In the BYOL-Explore Lite process the current state of the environment, $$s_t$$, is inputted into the encoder $$f$$, which outputs a feature representation of the state, $$f(s_t)$$. This feature representation is then passed to both the RL agent and the predictor $$g$$. The RL agent uses $$f(s_t)$$ to decide on its next action and determine the value of that state. The predictor uses $$f(s_t)$$ to predict $$h(s_{t+1})$$, i.e., the predictor is attempting to predict the target network's output for the next state. There are two losses namely the encoder loss and the predictor loss. The predictor loss is given by, - -$$ -\mathcal{L}_p=\left\|\frac{g(f(s_{t}))}{\|g(f(s_{t}))\|_2}-\frac{h(s_{t+1})}{\|h(s_{t+1})\|_2}\right\|_2^2. -$$ - -Since the RL agent and the predictor both make use of the online network's encoder its loss is given by the sum of the RL loss and the predictor loss. Importantly, the loss $$\mathcal{L}_p$$ serves as the intrinsic reward that the RL agent receives at each step. We normalise the intrinsic rewards by dividing it by the EMA estimate of their standard deviation. - -BYOL-Explore Lite also makes use of something known as reward prioritisation. Reward prioritisation involves focusing on parts of the environment where the agent receives high intrinsic rewards while disregarding those with low intrinsic rewards. This enables the agent to concentrate on areas it understands the least. Over time the previously ignored areas with low intrinsic rewards become the priority for the agent. To do this we take the EMA mean relative to the successive batch of normalised intrinsic rewards, $\mu$. Note that $\mu$ is used as a threshold -to separate the high intrinsic rewards and the low intrinsic rewards. Therefore, the intrinsic rewards that agent obtains after reward prioritisation is, - -$$ -i_t=\max(ri_t-\mu,\,0), -$$ - -where $ri_t$ is the normalised intrinsic reward. - -## Meta-learning curiosity algorithms - -Alet et al. view curiosity as a mechanism that is found through natural selection. As a result they turn to meta-learning to discover new curiosity algorithms. -In this case the outer loop searches over the curiosity algorithm space while the inner loop performs the standard RL procedure. - -{% include figure.html path="assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/mlc.png" class="img-fluid" %} -
- Figure 6. The process of how the meta-learned curiosity algorithm should work. Taken from . -
- -In the above figure we can see that the curiosity algorithm, $$\mathcal{C}$$, takes in the state and reward from the environment and then feeds proxy reward $$\hat{r}$$ to the RL agent. The RL algorithm used is a fully-specified algorithm, i.e., all its hyperparameters are specified. There were two stages in the authors search because the module $$\mathcal{C}$$ is made of two components. -The first component, $$\mathcal{I}$$, calculates the intrinsic reward given the current state, next state and the action taken. The second component, $$\chi$$, then takes the extrinsic reward, the intrinsic reward and the current normalised time step to combine them and output $$\hat{r}$$. - -### Meta-Learned Components and their DAGs - -As mention earlier Alet et al. focused on meta-learning pieces of code or rather meta-learning in a space of programs or operations. The programs and operations are represented in a domain-specific language (DSL). The DSL used to find component $$\chi$$ consisted of operations such as arithmetic, Min, Max and more. -While the DSL used to find component $$\mathcal{I}$$ consisted of programs such as neural networks complete with gradient-descent mechanisms, L2 distance calculation, and ensembles of neural networks and more. Component $$\mathcal{I}$$'s DSL can describe many other hand-designed curiosity algorithms in literature, such as RND. - -The components $$\mathcal{I}$$ and $$\chi$$ are represented as Directed Acyclic Graphs (DAGs). The DAGs consist of the following types of modules: -- Input modules: These are the inputs we put in each component of module $$\mathcal{C}$$. -- Parameter and Buffer modules: This module either consists of the weights of a neural network which can be updated via back-propagation or First In, First Out queues that output a finite list of the most recent $$k$$ inputs. -- Functional modules: This type of module calculates the output given some input. -- Update modules: These modules can add real-valued outputs to the loss function of the neural network or add variables to buffers. - -The DAGs also have an output node which is a single node and the output of this node is the output of the entire program. To make these ideas more concrete, let us look the DAG that describes RND. - -{% include figure.html path="assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/RND_DAG.png" class="img-fluid" %} -
- Figure 7. The DAG of RND. Taken from . -
- -The blue rectangles represent the input modules, and we can see from the figure that the inputs are states from the environment. -The parameter modules are the gray rectangles and these are the parameters of the target network and the predictor network. -Note that the target network's parameters are given by $$\theta$${1} and the predictor network's parameter's are given by $$\theta$${2}. -The functional modules are the white rectangles and these are the neural networks. The update module is the pink rectangle which is the loss function. - -The output node is the green rectangle and is the L2 distance between the output of predictor network and the target network. This is the loss function described in the RND section. Note that the $$\theta$${2} rectangle has a pink border and a pink arrow, this indicates that it can be updated via back-propagation. While the $$\theta$${1} rectangle has black border and a black arrow indicating the parameters are not updated via back-propagation. Also note that the functional module that makes use of those parameters has the word "Detach" indicating the gradient information is not flowing back. Recall that $$\theta$${1} represents the parameters of the target network, which remain fixed, and $$\theta$${2} represents the parameters of the predictor network, which are updated during training. - -Now a very important idea is that the DAGs used in the paper have polymorphic types for the inputs and outputs. There are four types: -- $$\mathbb{R}$$, the real numbers. -- $$\mathbb{S}$$, the state space of the environment. -- $$\mathbb{A}$$, the action space of the environment. -- $$\mathbb{F}$$, the feature space. - -The instantiation of some types depends on the environment. For example in Figure 7, if $$\mathbb{S}$$ is an image then both the target network and the predictor network are instantiated as a convolutional neural network. -If $$\mathbb{S}$$ is just an array of numbers then target network and the predictor network are fully connected neural networks. We now look at the method used to find the components $$\mathcal{I}$$ and $$\chi$$. - -### Method - -We now turn our attention to how component $$\mathcal{I}$$ was searched for. Alet et al. decided to focus on environment that has sparse rewards. They chose an image-based grid-world. In this environment the agent is tasked with finding the goal position and only obtains a reward if it finds the goal position. This environment has sparse rewards as the agent only receives feedback once it finds the goal position. They limited the number of operations that component $$\mathcal{I}$$ could perform to 7 so that the search space remains manageable, and we can still interpret the algorithm. They focused on finding a component $$\mathcal{I}$$ that optimises the number of distinct cells visited. From the search 13 of the top 16 components found where variants of FAST and 3 of them were variants of CCIM. We will cover FAST and CCIM in the upcoming sections. - -For the component $$\chi$$ they focused on the Lunar Lander environment as it has a strong external reward signal. The algorithm used to output the intrinsic reward was a variant of RND. The main difference was that instead of single neural network for the predicator network an ensemble is used. This algorithm came from a preliminary set of algorithms that all resemble RND. The best reward combiner found was, - -$$ -\hat{r}_t = \frac{(1+ri_t-t/T)\cdot ri_t+ r_t\cdot t/T}{1+ri_t}. -$$ - -Here $$r_t$$ is the external reward, $$t$$ is the current time-step, $$T$$ is the maximum steps possible in the episode, and $$ri_t$$ is the intrinsic reward. -However, in this blog post we decided not to focus on the reward combiner $$\chi$$ but instead focus on FAST and CCIM.This decision arises because we felt our exploration of the reward combiner was not exhaustive enough.. - - - -### FAST - -FAST is very simple algorithm in that it only contains one neural network. Below is the DAG of FAST. - -{% include figure.html path="assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/FAST_diagram.png" class="img-fluid" %} -
- Figure 8. The DAG of FAST. Taken from . -
- -This single neural network in FAST is a policy-mimicking network, $$\hat{\pi}$$. The network $$\hat{\pi}$$ tries to predict what action the agent took given a state of the environmentWe assume the environment has a discrete action space but this not be the case.. Then the loss of the policy-mimicking network will be the negative log likelihood (NLL) loss. Note that by looking at the DAG the output of FAST is not the same as loss function of the policy-mimicking network. The output is given by, - -$$ -ri_t=\|\hat{\pi}(s_{t+1})-\hat{\pi}(s_{t})\|_2. -$$ - -This is different from RND and BYOL-Explore Lite. The intrinsic reward is not given by a predictive error or the loss function of one of the networks in the program. -We understood the above formula as the L2 difference between the logits of the current state and the next state. -The agent is then rewarded if the next state's logits is different from the current state. -Importantly, the agent isn't rewarded for taking a different action in the next state. Alet et al. pointed out that if the policy-mimicking network has a uniform distribution over the action space in all states, the agent will receive an intrinsic reward of zero. Therefore, in environments where the action probability distributions outputted by the policy-mimicking network vary across states, we expect this algorithm to generate intrinsic rewards. -We hypothesize that this algorithm may not perform well in environments where the optimal policy requires the agent to visit states with very similar action probability distributions. -While the agent explores by going to different states, ideally, we wish for the intrinsic rewards to decrease as the agent explores. Looking at the output of FAST it is not clear to use how the intrinsic reward decreases, and we expect that this could cause issues. - - - -### CCIM - -CCIM took us quite a while to understand and process. Let us first go through its DAG below. -{% include figure.html path="assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/CCIM_diagram.png" class="img-fluid" %} -
- Figure 9. The DAG of CCIM. Taken from . -
- -We can see that there are 3 neural networks: a random network, a random and forward network, and a backward network. The parameters $$\theta$${1} are the parameters of the random network, $$\theta$${2} are the parameters of the backward network, and $$\theta$${3} are the parameters of the random and forward network. Looking at the black border of $$\theta$${1}'s rectangle we can see that the random network's parameters stay fixed during training like in RND. Let us denote the random network as -$$ r_{\theta_1}$$, the backward network as $$b_{\theta_2}$$, and the random and forward network as $$ fr_{\theta_3}$$. -Let us look at the loss function of the $$b_{\theta_2}$$ and $$ fr_{\theta_3}$$. The loss function of $$b_{\theta_2}$$ is given by, - -$$ -\mathcal{L}_b=\|b_{\theta_2}(fr_{\theta_3}(s_t))-r_{\theta_1}\|_2+\|b_{\theta_2}(fr_{\theta_3}(s_{t+1}))-fr_{\theta_3}(s_t)\|_2, -$$ - -and the loss function for $$fr_{\theta_3}$$ is - -$$ -\mathcal{L}_f=\|b_{\theta_2}(fr_{\theta_3}(s_t))-r_{\theta_1}\|_2. -$$ - -Note the first term in $$\mathcal{L}_b$$ is the same as $$\mathcal{L}_f$$. The intrinsic reward, i.e., the output of this program is given by, - -$$ -ri_t=\|b_{\theta_2}(fr_{\theta_3}(s_{t+1}))-b_{\theta_2}(fr_{\theta_3}(s_t))\|_2. -$$ - -Looking at the equations, we can see that CCIM borrows ideas from the cycle-consistency seen in the Image-to-Image Translation literature. The cycle-consistency ensures that if you translate from space $$A$$ to space $$B$$, then given space $$B$$, you should be able to translate back to space $$A$$. To see how CCIM applies this, let us turn our attention to $$\mathcal{L}_f$$'s equation. The $$fr_{\theta_3}$$ network applies a random embedding to state $$s_t$$. It then forwards this random embedding to the "next state". The $$b_{\theta_2}$$ network then takes this forwarded random embedding of state $$s_t$$ and undoes the forward transformation so that we end up again with just the random embedding of state $$s_t$$. Now, the random embedding that $$fr_{\theta_3}$$ applied should match the random embedding that $$r_{\theta_1}$$ applied to the state $$s_t$$ for the loss to be minimised. -In other words, once we apply a forward transformation to the random embedding of the state, we should be able to undo that transformation and end up where we started. - -Let us look at the second term in $$\mathcal{L}_b$$ given by $$\|b_{\theta_2}(fr_{\theta_3}(s_{t+1}))-fr_{\theta_3}(s_t)\|_2$$. We apply a forward and then a backward transformation to the random embedding of state $$s_{t+1}$$, so we should end up with just the random embedding of state $$s_{t+1}$$. We then apply $$fr_{\theta_3}$$ to state $$s_t$$ and end up with the forwarded random embedding of state $$s_t$$, which should equal the random embedding of $$s_{t+1}$$. - -The intrinsic reward confuses us. Looking at the DAG of CCIM, we see that the output is given by the L2 distance between $$\mathcal{L}_f$$ and $$\mathcal{L}_b$$; hence, we initially thought the intrinsic reward was given by $$ \|b_{\theta_2}(fr_{\theta_3}(s_{t+1}))-fr_{\theta_3}(s_t)\|$$. The difference between this equation and the original intrinsic reward equation is that the backward model, $$b_{\theta_2}$$, is not applied to the $$fr_{\theta_3}(s_t)$$ term. Looking at the original formula of the intrinsic reward, we can see that it is just the difference between the random embedding of -the current state and the next stateIf we assume that the backward network can undo the forward transformation., so it is not clear to us as to how the intrinsic reward -will decrease as the agent explores. -Not only that, but we also noticed unexpected behaviour in the loss function of the $$fr_{\theta_3}$$ network in our experiments. We then watched Alet et al.'s presentation of their paper to see where we went wrong, and we noticed in the presentation they swapped the labels for $$fr_{\theta_3}$$ and $$b_{\theta_2}$$ networks. -After reaching out to them about this discrepancy, they did confirm that the equations in the paper are correct, and the labels in the talk are wrong. So for our implementation, we used the equations as found in the paper. - -#### CCIM-slimmed - -Through our communication with them, Alet et al. recommended we try ablations of CCIM and they suggested the following slimmed-down version of CCIM: -- Network $$r_{\theta_1}$$ remains unchanged and its parameters stay fixed. -- Network $$fr_{\theta_3}$$ changes to just being a forward network, $$f_{\theta_3}$$. -- The loss function of the $$f_{\theta_3}$$ is now $$\mathcal{L}_f=\|f_{\theta_3}(r_{\theta_1}(s_t))-r_{\theta_1}(s_{t+1})\|_2^2$$. -- Network $$b_{\theta_2}$$'s loss function, $$\mathcal{L}_b$$, also changes. $$\mathcal{L}_b=\|b_{\theta_2}(r_{\theta_1}(s_{t+1}))-r_{\theta_1}(s_{t})\|_2^2$$. -- The intrinsic reward is now $$\mathcal{L}_f+\mathcal{L}_b$$. - -This slimmed down version of CCIM was much easier to implement. Since the sum of the loss functions also act as the intrinsic reward it is clearer to us as to how the intrinsic rewards will decrease as the agent explores. As agent explores both the forward and backward networks become better at predicting what the random embedding of the next state and previous state will be, respectively. - -## Experiments - -### Emperical Design - - -In devising the methodology for our experiments, we sought guidance from the principles outlined in Patterson et al.'s cookbook, "Empirical Design in Reinforcement Learning" . Our codebase is derived from PureJaxRL and can be found [here](https://github.com/Ziksby/MetaLearnCuriosity). -Specifically, we leverage PureJaxRL's Proximal Policy Optimization (PPO) implementation as our chosen reinforcement learning (RL) algorithm. -We compare each meta-learned curiosity algorithm to a non-curious agent (normal PPO) and our baselines. -The foundation of our experiments is laid upon a JAX implementation of Minigrid's grid-world environment , which uses gymnax's API . Additionally, we make use of gymnax's deep sea environment implementation as well. - -Each RL agent undergoes training for 500,000 time steps across four vectorized environments, employing 30 seeds for each RL algorithm. -To assess performances on the environments, we calculate the average episode return across seeds at the end of training with a 95% confidence interval determined through the percentile bootstrapped method. -We are not just interested in how well these curiosity algorithms perform but also in understanding the behaviour of these algorithms. -We therefore also visualise the sample standard deviation during training to see the performance variations. This assists us in seeing how consistent the behaviour is for each curiosity algorithm and the normal PPO algorithm. - -Now since we are not testing the reward combiner found, it is not clear how we should combine the external reward and the intrinsic reward. However, we treat both the external reward and the intrinsic reward as episodic and therefore we use the following formula, $$ \hat{r} = r_t + \lambda ri_t $$, where $$\lambda$$ is some weight factor. -These are the optimal values we found for $$\lambda$$ for each curiosity algorithm: - -- FAST: $$\lambda = 0.003$$. -- CCIM-slimmed: $$\lambda = 0.17$$. -- CCIM: $$\lambda = 0.003$$. -- BYOL-Explore Lite: $$\lambda = 0.006$$ -- RND: $$\lambda = 0.2$$. - -For FAST, CCIM, and CCIM-slimmed we normalise the intrinsic reward using the same method as RND. Next we describe the environments we use in more detail. - -### Empty grid-world - -The empty grid-world is a very simple environment. As mentioned earlier the agent's task is to reach the goal position. The size is $$16\times 16$$ and the maximum number of steps is 1024. -In our implementation the agent starts at the bottom left corner and has to reach the top right corner. The reward that agent recieves if it finds the goal is `1 - 0.9 * (step_count / max_steps)`. The gif shows a RL agent exploring the environment to reach the goal. - -{% include figure.html path="assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/anim_BYOL_0.gif" class="img-fluid" %} -
-The empty grid-world environment. -
- -### Deep sea - -The deep sea environment is one of the `bsuite` environments developed by Google -Deepmind . -This is a $$ N \times N$$ grid environment that focuses on testing the exploration capabilities of an RL algorithm. The figure below shows the environment. -{% include figure.html path="assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/deepsea.png" class="img-fluid" %} -
- Figure 10. The Deep sea environment. Taken from . -
-The agent starts at the top left corner and its goal is to reach the bottom right corner. -At each time step the agent descends one row. The agent can either go left or right. There's a small penalty of going right which is $$ −0.01/N $$ while going left just gives a reward of zero. The agent receives a reward of 1 if it finds the treasure at the bottom right corner. -The max number of steps in the environment is $$N$$. Therefore, the optimal policy is to go right at every time step ignoring the greedy action. In our experiments we set $$N=10$$. - -### Results - -#### CCIM - -We start with the deep sea environment. The left of Figure 11 shows the sample standard deviation during training. We only show it for the first 10,000 steps because after that we notice the graphs plateau. We see that RND and BYOL-Explore Lite produce the most consistent agents in the deep sea environment. And CCIM-slimmed produces more consistent agents than CCIM and PPO. Looking at the right of Figure 11 we can see the mean episode return across the 30 seeds with the 95% confidence intervals. RND, BYOL-Explore, and CCIM-slimmed all perform better than PPO. However, CCIM does performs roughly the same as PPO at the end of training. From our experiments we also noticed that intrinsic rewards produced by CCIM increase and then plateau. The CCIM random and forward network's loss continued to increase during training as well. - -
-
- {% include figure.html path="assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_CCIM_mean_seeds_std.png" class="img-fluid" %} -
-
- {% include figure.html path="assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_ccim_mean_seeds_CI.png" class="img-fluid" %} -
-
- -
- Figure 11. The sample standard deviation during training (left) and the average episode return (right) in deep sea environment. -
- -Next we move onto the empty grid-world. Looking at the left of Figure 12 we can see that all curiosity algorithms produce more consistent agents than PPO due to their sample -standard deviations being lower. CCIM and CCIM-slimmed both actually produce more consistent agents than RND and PPO in this environment. The right of Figure 12 also indicate that CCIM performed much better in the empty grid-world and was closer to the baselines. However in this environment we did once again notice the raw intrinsic reward -increased then plateaued and the loss of random forward network increased during training. It should also be noted the confidence intervals of all the RL algorithms overlap in the empty grid-world environment. - -
-
- {% include figure.html path="assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_CCIM_mean_seeds_std.png" class="img-fluid" %} -
-
- {% include figure.html path="assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_CCIM_mean_seeds_CI.png" class="img-fluid" %} -
-
- -
- Figure 12. The sample standard deviation during training (left) and the average episode return (right) in empty grid-world environment. -
- -Next we decided to plot the RND, BYOL-Explore Lite, normal PPO, CCIM and CCIM-slimmed heatmaps in Figure 13 and 14. To make the heatmaps we looked at the best 15 seeds for -each algorithm and kept track of the paths each seed took. Looking at Figure 13 and Figure 14, we can see that the CCIM and CCIM-slimmed covered more of the map than RND and BYOL-Explore Lite. However, they only covered slightly more of the map than PPO. - - -
-
- {% include figure.html path="assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_rnd_30.png" class="img-fluid" %} -
-
- {% include figure.html path="assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_byol_lite_30.png" class="img-fluid" %} -
-
-
- Figure 13. Heatmaps of the RND agent (left) and the BYOL-Explore Lite agent (right) in empty grid-world. -
-
-
- {% include figure.html path="assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_ccim_30.png" class="img-fluid" %} -
-
- {% include figure.html path="assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_ccim_slimmed_30.png" class="img-fluid" %} -
-
- {% include figure.html path="assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_dis_ppo_30.png" class="img-fluid" %} -
-
- - -
- Figure 14. Heatmaps of the CCIM agent (left), CCIm-slimmed agent (middle), and the normal PPO agent (right) in empty grid-world. -
- - - -#### FAST - -Let us now turn our attention to how FAST performed. We began with the deep sea environment. In Figure 15 we plot the sample deviation for the first 10,000 steps, as we observe no significant difference beyond this point. -The left side of Figure 15 indicates that PPO and our curiosity-driven baselines produces more consistent agents than FAST as they exhibit a lower sample standard deviation. - -On the right side of Figure 15, we see that FAST, similar to CCIM, performs poorly on this environment compared to our baselines. Notably, during training we noticed the intrinsic reward of the FAST agents also increased. - -
-
- {% include figure.html path="assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_FAST_mean_seeds_std.png" class="img-fluid" %} -
-
- {% include figure.html path="assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_FAST_mean_seeds_CI.png" class="img-fluid" %} -
-
- -
- Figure 15. The sample standard deviation during training (left) and the average episode return (right) in deep sea environment. -
- -The right side of Figure 16 shows FAST's performance in the empty grid-world is better than its performance in the deep sea environment; it is now comparable to our baselines despite its intrinsic rewards also increasing over time. Once again, similar to CCIM's results, we observe overlapping confidence intervals in the empty grid-world. Figure 16 shows that not only has its performance improved in the empty grid-world but it now produces more consistent agents than RND and PPO as its sample standard deviation is lower. -
-
- {% include figure.html path="assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_FAST_mean_seeds_std.png" class="img-fluid" %} -
-
- {% include figure.html path="assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_FAST_mean_seeds_CI.png" class="img-fluid" %} -
-
-
- Figure 16. The sample standard deviation during training (left) and the average episode return (right) in empty grid-world environment. -
- -We once again plot the heatmap of FAST and compare it to PPO's heatmap using the best 15 seeds. When comparing Figure 17 (left) with both Figure 17 (right) and Figure 13, we observe that FAST covered more of the grid-world than PPO, BYOL-Explore Lite, and RND. - -
-
- {% include figure.html path="assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_fast_30.png" class="img-fluid" %} -
-
- {% include figure.html path="assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_dis_ppo_30.png" class="img-fluid" %} -
-
-
- Figure 17. Heatmaps of the FAST agent (left) and the normal PPO (right) in empty grid-world. -
- - -## Discussion - -Alet et al. provided a unique approach to meta-learning. The performance of CCIM and FAST in the empty grid-world then did not surprise us as that was the environment used to search for the algorithms. Note in Figure 17 that the 15 best seeds of FAST covered more of the map, i.e., most of the seeds took different parts to the goal compared to PPO. -However for the CCIM and CCIM-slimmed heatmaps we notice that these algorithms only slightly covered more of the map then PPO. It should be noted that by looking at the heat maps that -CCIM-slimmed, CCIM, and FAST both covered more of the map than our baselines which makes sense given Alet et al. looked for curiosity that optimise the number of distinct cells visited when searching for the curiosity algorithms. - -From the sample deviation plots, we can see that FAST and CCIM do not produce consistent agents than PPO and the curiosity-driven baselines in the deep sea environment. While CCIM-slimmed produced more consistent agents than PPO but not the baselines. However, in the empty grid-world environment FAST, CCIM, and CCIM-slimmed is able to produce more consistent agents than PPO and RND. -In the mean episode return plots, CCIM, CCIM-slimmed, and FAST perform better than PPO and RND in the empty grid-world environment which makes sense as the empty grid-world environment was used to find these curiosity algorithms. However, in the deep sea environment we see that the meta-learned curiosity algorithms perform worse than our curiosity-driven baselines. - -From the mean episode return plots we can see that BYOL-Explore Lite is the best performing algorithm. Even in the empty grid-world environment it performs better than the meta-learned curiosity algorithms. -We believe this is because of the reward prioritisation implemented in BYOL-Explore. This could explain its performance is better than the meta-learned curiosity algorithms and why it produces the most consistent agents. - -One major concern we still have is how the intrinsic rewards for FAST and CCIM didn't decrease during training for both environments used in our experiments. However, we noted that the -intrinsic rewards for CCIM-slimmed decreased during training. We believe the decrease in intrinsic rewards as training progresses is one of the main reasons why BYOL-Explore and RND are -effective and why we see the improved performance of the CCIM-slimmed algorithm. Even with the reward combiner, we still believe that the intrinsic rewards not decreasing could potentially cause an issue, as it did with the deep-sea environment.Recall that the reward combiner has the following formula, - -$$ -\hat{r}_t = \frac{(1+ri_t-t/T)\cdot ri_t+ r_t\cdot t/T}{1+ri_t}. -$$ - -Now if $$t=T$$ then the $$\hat{r}_t \approx r_t $$ if $$ 0 \leq ri_t \ll 1$$. However for us the intrinsic rewards were not much less than zero during training. We believe that it is important for curiosity algorithms that the intrinsic reward decreases as the agent becomes more familiar with its environment. We believe that this is why CCIM-slimmed performed better than CCIM and FAST in the deep sea environment. Another concern we have is how the CCIM random and forward network's loss increased during training. It is possible that there's a bug somewhere in our code which we have not found yet. - -In the future we think it will be interesting to repeat this experiment using the deep sea environment to find the curiosity algorithms that output the intrinsic reward. -Additionally, exploring the use of a variant of FAST or CCIM to find a reward combiner is also of interest to us. We wonder why a variant of FAST or CCIM wasn't employed for this purpose, as a variant of RND was used to find the reward combiner. As stated earlier, FAST, CCIM and CCIM-slimmed do not make use reward prioritisation like BYOL-Explore Lite does. Therefore, repeating the experiments with the meta-learned curiosity algorithms where some form of reward prioritisation is implemented is another interesting path we hope to explore. We would also like to increase the number of seeds used to reduce the confidence intervals. Since we are training end-to-end in JAX in simple environments, increasing the number of seeds should not be much of an issue. - -## Conclusion - -In this blog post, we studied two meta-learned curiosity algorithms, namely FAST and CCIM. We compared them to a non-curious agent and our baselines for the curiosity algorithms: RND and BYOL-Explore. Our experiments were conducted using both the empty grid-world environment and the deep-sea environment. - -FAST and CCIM both performed well in the empty grid-world, covering more of the map than the baselines when examining their heatmaps. This aligns with our expectations since this was the environment used to search for the curiosity algorithms. However, in the deep-sea environment, both algorithms did not perform well compared to the baselines. Conversely, CCIM-slimmed, a slimmed down version of CCIM, showed performance comparable to the baselines. -We suspect that this is because the intrinsic reward decreased as the agent explored more. This behaviour was not observed in FAST and CCIM, which we believe is not ideal and consider it the main flaw of these algorithms. - -This approach of meta-learning curiosity algorithms is novel, and we believe there's interesting work that can be done following the same approach as Alet et al., trying it with different environments to search for curiosity algorithms, such as the deep-sea environment. Moreover, BYOL-Explore makes use of reward prioritisation. Therefore, in the future, we hope to include reward prioritisation in our FAST, CCIM, and CCIM-slimmed implementations to see if it improves performance. Another avenue is using the meta-learned curiosity algorithms to search for the reward combiner. diff --git a/_posts/2024-05-07-fairness-ai-two-phil-or-just-one.md b/_posts/2024-05-07-fairness-ai-two-phil-or-just-one.md deleted file mode 100644 index 3dd47b8b..00000000 --- a/_posts/2024-05-07-fairness-ai-two-phil-or-just-one.md +++ /dev/null @@ -1,217 +0,0 @@ ---- -layout: distill -title: "Fairness in AI: two philosophies or just one?" -description: The topic of fairness in AI has garnered more attention over the last year, recently with the arrival of the EU's AI Act. This goal of achieving fairness in AI is often done in one of two ways, namely through counterfactual fairness or through group fairness. These research strands originate from two vastly differing ideologies. However, with the use of causal graphs, it is possible to show that they are related and even that satisfying a fairness group measure means satisfying counterfactual fairness. -date: 2024-05-07 -future: true -htmlwidgets: true - -# Anonymize when submitting -# authors: -# - name: Anonymous - -authors: - - name: MaryBeth Defrance - url: https://orcid.org/my-orcid?orcid=0000-0002-6570-8857 - affiliations: - name: University of Ghent - -# must be the exact same name as your blogpost -bibliography: 2024-05-07-fairness-ai-two-phil-or-just-one.bib - -# Add a table of contents to your post. -# - make sure that TOC names match the actual section names -# for hyperlinks within the post to work correctly. -# - please use this format rather than manually creating a markdown table of contents. -toc: - - name: Why fairness? - - name: What is fairness? - subsections: - - name: Explainable AI - - name: Group fairness - - name: Unifying these philosophies - subsections: - - name: Measurement error - Demographic parity - - name: Selection on label - Equalized odds - - name: Selection on predictor - conditional use accuracy equality - - name: Confirmation with experiments - - name: What can we take away? - -# Below is an example of injecting additional post-specific styles. -# This is used in the 'Layouts' section of this post. -# If you use this post as a template, delete this _styles block. -_styles: > - .fake-img { - background: #bbb; - border: 1px solid rgba(0, 0, 0, 0.1); - box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1); - margin-bottom: 12px; - } - .fake-img p { - font-family: monospace; - color: white; - text-align: left; - margin: 12px 0; - text-align: center; - font-size: 16px; - } ---- - -This blog post is based on the paper of Anthis and Veitch. The original paper is enriched with a wide overview of fairness concepts used in research and visuals aiding the readers in gaining a deeper understanding. The blog post aims to raise questions about the dichotomy between procedural and outcome fairness, that they perhaps should not be treated as separate research fields as is currently often the case. - -## Why fairness? -The spread of AI exposed some of the dark patterns that are present in society. Some well known examples are the COMPAS case which showed discrimination against black defendants and the Amazon hiring tool which showed a preference towards men compared to women. However, these AI system were most likely not the source of this disparate treatment. This behavior stems from the data that was used to train the system, thus this behavior comes from people who were behind the creation of that data. - -Fairness in AI is a research strain which aims to remove the biases in the AI models that result in that disparate treatment. The goal of these models is that people are treated more fairly, perhaps even more than a human decision. - -## What is fairness? -The question of what is fair does not have a single answer. Even when stepping away from the computer science context, a universal definition, that can be used to determine if something is fair or not, cannot be found. The concept of fair is heavily influenced by a person, but also society's biases. The fluidity of the notion therefore gives rise to multiple philosophies in what a fair AI system would be. - -
-{% include figure.html path="assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Two_categories.png" class="img-fluid" %} -
-
- Figure 1: Some examples of the concepts used in the respective philosophies. -
- -Two main philosophies can be found in research. The first one, often called explainable AI, aims to either create explainable models or to create explanations for the results obtained from a model. This can also be described as aiming for procedural fairness. The second philosophy is called group fairness. Group fairness focusses on outcome fairness. This means that the predictions from the AI system should have similar properties across groups that only differ in a certain personal attribute. - -### Explainable AI -The most famous example of explainable AI is __fairness through unawareness__. Fairness through unawareness means that no personal attributes are passed into the system, unless these are relevant for the prediction. The system does therefore not have access to the personal attributes, which means it cannot directly discriminate. Fairness through unawareness is often used as the basic model for fairness. However, the systems from both the COMPAS and Amazon example used fairness through unawareness and they still exhibited disparate treatment. The personal attributes that were removed from the data still had an influence on the dataset itself. For instance, a ZIP code can function as a proxy for race or someone's gender influenced their writing style. - -
-{% include figure.html path="assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Feature_selection.png" class="img-fluid" %} -
-
- Figure 2: Examples of Fairness Through Unawareness (FTU) and fair feature selection on the Adult dataset. -
- -Related to fairness through unawareness is __fair feature selection__ . Instead of removing the personal attributes, only features that are deemed appropriate remain in the dataset. It needs to be noted that one universal agreement for what are fair features to use is unlike due to the aforementioned biases of people and cultures. Oftentimes, there exists an overlap between the features removed in fairness through unawareness and fair feature selection as is evident in Figure 2. - -__Counterfactual fairness__ is a currently popular type of explainable AI. Counterfactual fairness stems from systems that check for direct discrimination, meaning that simply changing a personal attribute would change a person's prediction. An example of direct discrimination can be found in Figure 3, where changing the sex would result into a different prediction. From a legal standpoint it is clear that if a model would exhibit this behavior, it can be deemed unfair. - -
-{% include figure.html path="assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Direct_discrimination.png" class="img-fluid" %} -
-
- Figure 3: Example of direct discrimination where changing the personal attribute of sex changes the prediction a person would receive. -
- -Models for counterfactual fairness change both the personal attributes of a person and other features are also adjusted according to a causal model related to the personal attributes. For example changing someone's race might also require to change someone's ZIP code or high school they went to. Figure 4 contains an example of creating counterfactuals. That system is unfair as some of the counterfactuals have a different prediction from the original. Satisfying counterfactual fairness can also be achieved through requiring independence between the personal attributes and the prediction itself. A more stringent constraint is to require that the prediction is independent on all proxy features in the dataset. - -
-{% include figure.html path="assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Counterfactual_fairness.png" class="img-fluid" %} -
-
- Figure 4: Imaginary examples of a system that would not satisfy counterfactual fairness. Changing features in accordance with the personal attributes and data distribution results in a different prediction. -
- -### Group Fairness -Group fairness is a different philosophy regarding fairness of an AI system. Instead of requiring the process of the system is fair, it requires the outcome of the model to be fair. This verdict of fairness is based on the equality of a chosen statistical measure between groups. People are divided into these groups based on their personal attributes. Three definitions are most commonly used for group fairness namely, demographic parity, equalized odds and conditional use accuracy equality. - -__Demographic parity__ requires that the selection rate is equal across groups. This means that an equal percentage of people from both groups receives a positive prediction. This definition is independent of the ground truth, which means that for example a perfect predictor could never satisfy demographic parity if the base rates differ between groups. Therefore, from the observation of the dataset it must seem that the prediction is independent of the personal attributes. - -
-{% include figure.html path="assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Demographic_Parity.png" class="img-fluid" %} -
-
- Figure 5: A representation of demographic parity. Two groups are distinguished one male, one female. The circled individuals are the ones to receive a positive prediction. -
- -A second fairness measure used in group fairness in __equalized odds__. This fairness measure requires that both the true positive and true negative rates are equal across groups. This means that given the ground truth, there is an equal chance of giving a positive prediction irrespective of a person's group. In other words equalized odds requires the prediction is independent of the personal attribute given the ground truth. Unlike demographic parity, equalized odds is dependent on the ground truth. - -
-{% include figure.html path="assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Equalized_odds.png" class="img-fluid" %} -
-
- Figure 6: A representation of predictions which satisfy equalized odds. Two groups are distinguished one male, one female. The circled individuals are the ones to receive a positive prediction. The colors of the individuals indicates the ground truth of the samples. The male groups has a base rate of 0.8 and the female group a base rate of 0.6. -
- -The final common fairness measure in group fairness is __conditional use accuracy equality__. In order to satisfy conditional use accuracy equality, the precision and false omission rate must be equal between groups. Similar to equalized odds, conditional use accuracy equality requires two statistical properties to be equal between groups, namely precision and false omission rate. Put differently, this requires that given the prediction there is an equal chance that this prediction is correct regardless of the group a person belongs to. Conditional use accuracy equality is therefore defined similarly to equalized odds; the roles of the prediction and ground truth are simply reversed. This equality also holds for the independent condition, conditional use accuracy equality requires that the ground truth is independent of the personal attribute if the prediction is known. - -
-{% include figure.html path="assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Conditional_use_accuracy_equality.png" class="img-fluid" %} -
-
- Figure 7: A representation of predictions which satisfy conditional use accuracy equality. Two groups are distinguished one male, one female. The circled individuals are the ones to receive a positive prediction. The colors of the individuals indicates the ground truth of the samples. The male groups has a base rate of 0.8 and the female group a base rate of 0.6. -
- -## Unifying these philosophies -The previous two sections discussed the different concepts used for explainable AI and group fairness. It is clear that they employ a different basis for their philosophy of fairness. However, when looking at these definitions, the concept of independence returns in both counterfactual fairness and the fairness measures used for group fairness. This property of requiring independence allows to unify these notions that they accomplish the same result. Table 1 provides an overview of the fairness measures and the respective independence they require. - -In the following section $$ Y $$ symbolises the perceived label, $$ D $$ the prediction, $$ A $$ the personal attributes, $$ S $$ the selection of a sample in the dataset, $$ X^{\bot}_A $$ the data independent of the personal attributes, $$ X^{\bot}_Y $$ the data independent of the prediction and $$ \tilde{Y} $$ the real label. - -
- Table 1: A summary of the independence requirement of the fairness notions discussed. -
- - -| Name | Probability definition | Independence | -| ------------- |:-------------:| -----:| -| Demographic parity | $$ P(D=1\vert A=1) = P(D=1\vert A=0) $$ | $$ D \bot A $$ | -| Equalized odds | $$P(D=1 \vert A=1, Y=y) = P(D=1 \vert A=0, Y=y) $$ | $$ D \bot A \vert Y $$ | -| Conditional use accuracy equality | $$ P(Y=1\vert A=1, d=y) = P(D=1 \vert A=0, D=y) $$ | $$ Y \bot A \vert D $$ | - - -### Measurement error - Demographic parity - -
-{% include figure.html path="assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Measurement_error.png" class="img-fluid" %} -
-
- Figure 8: A directed acyclic graph showing the relation between the prediction and the data, in the situation of measurement error. -
- -Measurement error is a first type of dependence that can be resolved in order to be counterfactually fair. Measurement errors means that there is some bias on the perceived ground truth in the dataset. For example in system that determines whether pulling a car over is justified or not (whether a crime was committed or not). More crimes can be uncovered if a full car search happens, however a car search is not always undertaken resulting in a bias of more positive samples for a population where a car search is more likely to happen. In this situation the label is whether or not a crime was detected, not wether a crime was committed. The imbalance car searches for a group with a certain personal attribute will then have an effect on the label. This influence of the personal attributes on the label, but not the ground truth is shown in Figure 6. - -A second example of measurement error can be found in healthcare prediction. Predicting someone's health is abstract as this is not quantifiable. A proxy for health is the costs related to the healthcare an individual receives. However, costs are not universal for each group in society. Certain groups can thus have lower costs while managing more health problem due to the care that they receive or perhaps not receive. This faulty proxy is another example of measurement errors. - -This system is thus made counterfactually fair if the dependence between the personal attribute and the label is removed. The same independence that is requires to satisfy demographic parity. - -### Selection on label - Equalized odds - -
-{% include figure.html path="assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Selection_on_label.png" class="img-fluid" %} -
-
- Figure 9: A directed acyclic graph showing the relation between the prediction and the data, in the situation of selection on label. -
- -Selection on label is a type of bias that arises by that not only someone's label affects their adoption in the dataset but also their personal attribute. A subtype of this type of bias is self-selection bias. This means that certain groups of the population are more represented in certain dataset due to that certain groups are more likely to interact with the data collection system. An example of this is in voluntary studies where certain groups are more likely to participate than others leading to a skewed dataset in favor of the participating group. A study around self-selection bias in nutrition trials also found that a person's ground truth influences their participation in the trial (healthy eaters were more likely to apply for this trial). - -The directed acyclic graph in Figure 7 shows how to decouple the label itself with the personal attribute by introducing the variable of the selection bias in S, which is an observed variable. $$ A $$ and $$ X^{\bot}_A $$ are only connected through a path that includes $$ Y $$ which means that given $$ Y $$, $$ A $$ and $$ X^{\bot}_A $$ are independent, which is the condition of equalized odds. - -### Selection on predictor - conditional use accuracy equality - -
-{% include figure.html path="assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Selection_on_predictor.png" class="img-fluid" %} -
-
- Figure 10: A directed acyclic graph showing the relation between the prediction and the data, in the situation of selection on predictor. -
- -Selection on predictor is similar to selection on label, but instead of the label influencing the prediction is it the features themselves that influence the prediction together with the personal attributes. An example of this can be seen in the student population of engineering degrees. A relevant feature such as what a person studied in high school influence their choice to do engineering. However, there is a large discrepancy in the number of male versus female student who pursue engineering even though that difference does not exist in that degree when graduating high school. This shows that both relevant features, but also personal attributes influence their presence in a dataset about engineering students. - -The acyclic graph in Figure 8 for selection on predictor is similar to that for selection on label. The features and label are simply reversed in this situation. This is also in accordance with the similarity seen between equalized odds and conditional use accuracy equality. Through $$ X^{\bot}_A $$, are $$ A $$ and $$ Y $$ connected, which means that if the prediction is known, which is captured in $$ X^{\bot}_A $$, then $$ A $$ and $$ Y $$ are independent, which is necessary to satisfy conditional use accuracy. - -### Confirmation with experiments -This relation between counterfactual fairness and group fairness is supported by experiments. These experiments were done on a synthetic version of the Adult dataset. A simulated protected class A was added where the incidence is balanced (50/50 odds of belonging to the protected class or not). If someone belonged to the protected class, then there is a causal effect of A on X: $$P(race=other) = 0.8 $$. This thus means that A will loosely relate to someone's race being noted as "other". This dataset serves as the target distribution for the biased datasets. - -A counterfactually fair model is achieved by by taking the average prediction of an instance if it were part of the protected class and if it was not. Three biased datasets are created based on the directed acyclic graphs in Figures 8, 9, and 10. Table 2 shows that satisfying counterfactual fairness for a certain type of dataset will satisfy a corresponding fairness measure, confirming the theoretical results above. - -
- Table 2: The results of applying counterfactual fairness to a model with its performance on different fairness measures. -
- - -| | Demographic parity difference | Equalized odds difference | Conditional use accuracy equality | -| ------------- | ------------- | ------------- | ------------- | -| Measurement Error | __-0.0005__ | 0.0906 | -0.8158 | -| Selection on Label | 0.1321 | __-0.0021__ | 0.2225 | -| Selection on Predictors | 0.1428 | 0.0789 | __0.0040__ | - -## What can we take away? - -Procedural and outcome fairness have tended to coexist in research. They are each their own field with their philosophy with the common goal of creating fairer AI systems. The strengths of techniques like counterfactual fairness lie in their explainability and thus allow for an easier determination of whether they are fair or not. The group fairness techniques know many implementations and have been proven to be powerful. However, they are not very interpretable. In order to determine what is fair a first abstraction must be made into converting the meaning of fairness into a mathematical fairness measure. The determination of whether the system is fair is thus dependent on the interpretation of the fairness measure and the quality of the dataset. If the dataset is not representative then there is no guarantee that the system will have a fair outcome. - -This relation between the procedural fairness and outcome fairness opens certain research possibilities, perhaps allowing for the strength of the outcome fairness techniques to be combined with the interpretability of the procedural fairness concepts. A future research possibility is to investigate if the techniques to satisfy fairness measure also satisfy some explainability notions or what adjustments would be needed. \ No newline at end of file diff --git a/_posts/2024-05-07-hidden-convex-relu.md b/_posts/2024-05-07-hidden-convex-relu.md deleted file mode 100644 index 6b44cba5..00000000 --- a/_posts/2024-05-07-hidden-convex-relu.md +++ /dev/null @@ -1,659 +0,0 @@ ---- -layout: distill -title: The Hidden Convex Optimization Landscape of Two-Layer ReLU Networks -description: In this article, we delve into the research paper titled 'The Hidden Convex Optimization Landscape of Regularized Two-Layer ReLU Networks'. We put our focus on the significance of this study and evaluate its relevance in the current landscape of the theory of machine learning. This paper describes how solving a convex problem can directly give the solution to the highly non-convex problem that is optimizing a two-layer ReLU Network. After giving some intuition on the proof through a few examples, we will observe the limits of this model as we might not yet be able to throw away the non-convex problem. -date: 2024-05-07 -future: true -htmlwidgets: true - -# Anonymize when submitting -authors: - - name: Victor Mercklé - url: "https://victormerckle.fr/" - affiliations: - name: LabHC, LJK - France - - name: Franck Iutzeler - url: "https://iutzeler.org/" - affiliations: - name: Institut de Mathématiques de Toulouse, Université de Toulouse, CNRS - - name: Ievgen Redko - url: "https://ievred.github.io/" - affiliations: - name: Paris Noah's Ark lab - -#authors: -# - name: Albert Einstein -# url: "https://en.wikipedia.org/wiki/Albert_Einstein" -# affiliations: -# name: IAS, Princeton - -# must be the exact same name as your blogpost -bibliography: 2024-05-07-hidden-convex-relu.bib - -#TODO make sure that TOC names match the actual section names - they do -toc: - - name: I. Overview and Motivation - subsections: - - name: Problem and notation - - name: Research context - - name: II. Convex Reformulation - subsections: - - name: Small example walkthrough - - name: Specifics about equivalence - - name: Activation patterns - - name: Extensions of the convex reformulation to other settings - - name: III. Can we Forget the Non-Convex Problem? - subsections: - - name: Solving the convex problem efficiently is hard - - name: Activation patterns are not a constant in the non-convex problem - - name: On large initialization scale - - name: On very small initialization - - name: Conclusion - -_styles: > - - .remark { - display: block; - margin: 12px 0; - font-style: italic; - } - .remark:before { - content: "Remark."; - font-weight: bold; - font-style: normal; - } - .remark[text]:before { - content: "Remark (" attr(text) ") "; - } - - .center { - display: block; - margin-left: auto; - margin-right: auto; - } - - .legend { - display: block; - margin-left: 50px; - margin-right: 50px; - } - - .framed { - border: 1px var(--global-text-color) dashed !important; - padding: 20px; - } - - d-article { - overflow-x: visible; - } - - .underline { - text-decoration: underline; - } - ---- - - -
-$$ -\def\RR{ \mathbb{R} } -\newcommand{\dd}{\mathrm{d}} -\newcommand{\step}{\gamma} -\newcommand{\reg}{\beta} -\newcommand{\paramS}{\Theta} -\newcommand{\param}{\theta} -\newcommand{\dirac}{\delta} - -\definecolor{cvred}{RGB}{230, 29, 0} -\definecolor{cred}{RGB}{230, 159, 0} -\definecolor{cblue}{RGB}{51, 102, 253} -\definecolor{cgreen}{RGB}{0, 158, 115} -\def\czero{ {\color{cred}{0}} } -\definecolor{cvblue}{RGB}{86, 180, 233} -\def\cone{ {\color{cvblue}{1}} } - -\def\max{\mathop{\mathrm{max}}} -\def\sgn{\mathop{\mathrm{sgn}}} - - -$$ -
- -{% include figure.html path="assets/img/2024-05-07-hidden-convex-relu/teaser_movie.gif" class="img-fluid" %} - -

There exists an equivalent convex formulation to the classical non-convex ReLU two-layer network training. That sounds like great news but is it the case in practice? Let's find out together.

- -The code for _this plot_ is available and reproducible on this __[Jupyter Notebook]({{'assets/html/2024-05-07-hidden-convex-relu/hidden-convex-relu.ipynb' | relative_url}})__ (or in __[HTML]({{'assets/html/2024-05-07-hidden-convex-relu/hidden-convex-relu.html' | relative_url}})__). - -## I. Overview and Motivation - -50 years ago, two-layer networks with non-linear activations were known to be universal approximators, however, they did not catch on as they were hard to train. The recent years have been marked by deeper networks running on dedicated hardware with very large datasets. Those networks have since been at the top of the benchmark in many applications including self-driving and text generation. The pragmatic method to train such models is to run stochastic gradient descent on the non-convex optimization problem, which is concretely tuning the weights (and bias) until the model is accurate enough. The best models usually require billions of parameters and very large datasets. The training, in turn, requires millions of dollars of hardware and electricity to run gradient descent and train a single model. - -Deep learning is not without faults. Even though the test performance can surpass those of many machine learning models, it is very hard to know what the network has learned because of its black-box nature. Interpretability in neural networks is crucial for creating trustworthy AI systems, one of the biggest obstacle to AI adoption. It may also lead us to simpler models that are cheaper to run, are more robust, generalize better, and are easier to adapt to specific tasks. - -To figure out what a neural network learns, we will focus in this post on the training of a shallow ReLU network by vanilla gradient descent, using the full batch of data at each step, in a regression setting. More precisely, we will investigate how the construction of a convex equivalent to the non-convex training problem can enlighten us on how neurons evolve during the training phase, with a specific focus on the activation of the ReLU functions and their consequences. - -### Problem and notation - -Our problem of interest will be the training of a simple two-layer neural network with ReLU activation. We focus on a classical regression problem with a mean squared error loss and we add a weight decay term (whose importance will be underlined later). This leads to the following full-batch gradient method (note that we make a slight abuse of notation by denoting by $\nabla$ the output of the derivative of the parameters, obtained, for instance, by backpropagation). - -Because there are only two layers, we will integrate the biases of the neurons directly into the data by adding a dimension filled with ones. - -

- Two-Layer ReLU Network Training
- Data points: $n$ inputs \(\pmb{x}_j \in \RR^d\) and labels \(y_j \in \RR\), $j=1,..,n$
- Model: $m$ neurons: First layer \(\pmb{w}_i \in \RR^d\), second layer \(\alpha_i \in \RR\), $i=1,..,m$
- Hyper-parameters: step-size \(\step > 0\), regularization \(\lambda\geq 0\)
- Loss to be minimized: - \begin{equation}\label{eq:theloss} - \mathcal{L}(\pmb{W}, \pmb{\alpha}) = \sum_{j=1}^n \bigg( \underbrace{\sum_{i=1}^m \max(0, \pmb{w}_i^\top \pmb{x}_j) \alpha_i}_{\text{Network's Output}} - y_j \bigg)^2 + \underbrace{\lambda \sum_{i=1}^m \| \pmb{w}_i \|^2_2 + \alpha_i^2}_{\text{Weight Decay}} - \end{equation} - (Full-batch) Gradient Descent: - \begin{equation*} - (\pmb{W}, \pmb{\alpha})_{t+1} = (\pmb{W}, \pmb{\alpha})_t - \step \nabla \mathcal{L}((\pmb{W}, \pmb{\alpha})_t) - \end{equation*} -

- -Even the simplest ReLU models have non-trivial non-convexity as depicted in the figure below. We plot the loss function $$\mathcal{L}$$ of a network with two neurons on one-dimensional data. We only optimize the first layer here so we have a total of two parameters to optimize. Despite the simple setup, a gradient descent starting from a random initialization can converge to three different values, two of them being bigger than zero. However, there always exists a path of non-increasing loss from initialization to the global minimum (as predicted by a ). - -{% include figure.html path="assets/img/2024-05-07-hidden-convex-relu/threed.png" class="img-fluid" %} - -

Loss landscape of a network with two parameters, one for each ReLU neuron, and two data points: $(x_1, y_1) = (-1, 1)$ and $(x_2, y_2) = (1, 2)$ are fixed. Since all labels are positive, we fix the second layer $\alpha_1, \alpha_2$ to 1 to plot the loss in 2D without a loss of generality. The black lines represent the loss for only one neuron (since the other is equal to 0). The red lines(critical points) are paths of parameters for which the loss is constant and the gradient is zero. They represent the parameters for which the neuron fits exactly one data point and is deactivated for the other and thus suffers a loss of $(y_1)^2$ for the red line on the left and $(y_2)^2$ for the other. The exact formula to compute each point of the loss landscape is: - -\begin{equation*} -\begin{split} -\mathcal{L}(w_1, w_2) =&\ \left(\max(0, x_1 w_1) + \max(0, x_1 w_2) - y_1\right)^2 \\ -+&\ \left(\max(0, x_2 w_1) + \max(0, x_2 w_2) - y_2\right)^2 -\end{split} -\end{equation*} -

- -To avoid the local minima, one idea is to add constraints to the parameters. The constrained problem where $w_1$ has to be positive and $w_2$ has to be negative, _is_ convex, and a simple gradient descent will find the global minima of the original unconstrained problem. In , they find a more general way to build an equivalent convex problem to our ReLU shallow network training problem. - -In this blog post, we will first work out the intuition needed to understand why an equivalent, finite convex problem even exists. Then we will study the exact links between the problem in practice and the convex problem, and go over the limits of such an approach both in theory and in practice. - -### Research context - -The question of how neural networks learn is a very active domain of research with many different paths of investigation. Its main goal is to lay a mathematical foundation for deep learning and for that goal, shallow neural networks act as a stepping stone for understanding deeper and more complex networks. - -For networks with a hidden layer of infinite width, it is proven that gradient descent converges to one of the global minima under the _NTK regime_, or by considering them as Wasserstein gradient flows. Studying the NTK amounts to analyzing the first-order Taylor expansion of the network, treating the network as a linear regression over a feature map. This approximation is accurate if the neurons are initialized with a large scale(far from zero), large enough that neurons do not move far from their initialization. This is also called the _lazy regime_ , in contrast with the _feature learning regime_ where neurons align themselves to a finite amount of directions. While it is noticeable, we are also interested here in a feature-learning regime with small initialization where we can observe actual non-convex behavior such as neuron alignment, incremental learning and saddle to saddle dynamic. - -Examining the loss landscape reveals that shallow networks with more neurons than data points always have a non-increasing path to a global minimum. This is a favorable property for (stochastic) gradient convergence. In '_The Hidden Convex Optimization Landscape of Regularized Two-Layer ReLU Networks_', the authors extend those results by adding the weight decay regularization. - -Regularization plays a pivotal role as it let us influence which local minimum we will reach with gradient descent, usually to favor a simpler solution. Even if no explicit regularization is used, it is known that there is an implicit bias of gradient descent for linear activations, and more recently for ReLU networks using the convex reformulation. - -Other convex approaches are limited to an infinite amount of neurons, or to optimization in neuron-by-neuron fashion which requires solving many non-convex problems. The setting studied here allows for any number of neurons. - -To sum up, the convex reformulation approach described in this post contrasts with what precedes by presenting results for a shallow network with __finite width layers__, in a __regression__ setting with __ReLU__ activation and __weight decay__ regularization. - -## II. Convex reformulation - -### Small example walkthrough - -First, let's get familiar with and understand the inherent convexity caused by ReLU and the second layer. To do so, we will take simple yet non-convex examples and find their global minima using a convex problem. - -#### One ReLU, no second layer, no regularization - -Below is the loss of a single ReLU neuron ($$w_1 \in \RR$$) trained on two data points: $$(x_1, y_1)=(-1, 1)$$ and $$(x_2, y_2) = (1, 0.5)$$ - -

-\begin{equation}\label{eq:one_neuron_loss} -{\color{cvred}{\mathcal{L}}}(w_1) = \big(\max(0, x_1 ~ w_1) - y_1\big)^2+\big(\max(0, x_2 ~ w_1) - y_2\big)^2 -\end{equation} -

- -Because our only trainable parameter is one-dimensional, we can directly plot the entire loss landscape. - -{% include figure.html path="assets/img/2024-05-07-hidden-convex-relu/redloss.png" class="img-fluid" %} - -

\(\color{cvred}{\mathcal{L}}\) is non-convex in a strong sense: two local minima exist and have distinct values (\((y_1)^2\) and \((y_2)^2\)). In practice, a gradient descent will never be able to switch from fitting one data point to the other (switching from positive to a negative weight $w_1$ can only be done by increasing the loss).

- -We say that the ReLU neuron can _activate_ one or more data points if the output of its ReLU is non-zero when evaluated on said data. The output of a one-neuron ReLU network is $$\color{cvblue}{\max(0, x ~ w_1)}$$, we can plot both the output and the two data points on the same graph. - -{% include figure.html path="assets/img/2024-05-07-hidden-convex-relu/blueoutput.png" class="img-fluid" %} - -

Plot of the output of a one-neuron ReLU network with a positive weight $w_1$. The ReLU only activates the second data point (as $x_2>0$ and $w_1 > 0$) so the network can fit the second data point. However, doing so means it cannot activate $x_1$ and will incur a constant loss $(y_1)^2$. Overall, depending on the sign of $w_1$, we will have a loss consisting of a constant term for not activating one example and a quadratic term for matching the label of the activated data point. -

- -Before moving on, the important fact here is that we have a true non-convexity of the loss(the difference between two local minima $\vert (y_1)^2 - (y_2)^2 \vert$ can be made arbitrarily large), even without a single layer or regularization. Now we will explore the corresponding convex problems. - -#### Activation - -We want to find the global minima of the one-neuron ReLU network loss function\eqref{eq:one_neuron_loss}. Recall that the loss has two local minima: $(y_2)^2$ for $w_1=y_1/x_1$ and $(y_1)^2$ for $w_1=y_2/x_2$. - -Which data points are activated plays a crucial role in the loss. In the specific example above, $x_2>0$ is activated and $x_1<0$ is not. If we fix the ReLU's activation to this pattern and __replace the max operators__ with $$\czero$$ or $$\cone$$: - -

-\begin{equation}\label{eq:firsttry} -\min_{u_1 \in \RR} (\czero \times x_1 u_1 - y_1)^2+ (\cone \times x_2 u_1 - y_2)^2 -\end{equation} -

- -This problem is convex. A gradient descent from any initialization will converge to the optimal loss $(y_1)^2$ with the parameter $u_1 =y_2/x_2$. This parameter directly corresponds to one of the two local minima of the non-convex loss\eqref{eq:one_neuron_loss} by taking $w_1 = u_1$. - -

-\begin{equation*} -\min_{u_2 \in \RR} (\cone \times x_1 u_2 - y_1)^2+ (\czero \times x_2 u_2 - y_2)^2 -\end{equation*} -

- -Similarly, this convex problem's optimal solution directly corresponds to the second local minima: $(y_2)^2$ for $u_2 =-y_1/x_1$. - -All seems good. But keep in mind that we want to build an equivalent problem. If $u_2$ is positive, taking $w_1 = u_2$ does not lead to the same loss value in the original problem because a positive parameter will never activate the first data point. - -To make the issue obvious, consider this convex problem obtained by replacing the two $\max$ operators by $$\cone$$: - -

-\begin{equation*} -\min_{u_3 \in \RR} (\cone \times x_1 u_3 - y_1)^2+ (\cone \times x_2 u_3 - y_2)^2 -\end{equation*} -

- -While it is convex, there is no link between the ReLU parameter $w_1$, and this new problem's parameter $u_3$: it is not possible to activate both data points. This issue comes from the fact that replacing a $\max$ by $$\cone$$ only makes sense if what is inside the $\max$ is indeed positive. In other words, as long as $$x_1 ~ w_1$$ is positive we have that $$max(x_1 ~ w_1, 0) = \cone x_1 ~ w_1$$. - -

-\begin{equation*} -\min_{\substack{x_1 ~ u_3 \geq 0\\x_2 ~ u_3 \geq 0}} (\cone \times x_1 u_3 - y_1)^2+ (\cone \times x_2 u_3 - y_2)^2 -\end{equation*} -

- -We added the constraints corresponding to the activation, and it adequately restricts $u_3$ to be in $\{0\}$. - -As a simple reformulation of \eqref{eq:firsttry}, we vectorize (in the number of data points) the convex loss and we add the constraints: - -

-\begin{equation*} -\min_{\substack{\begin{bmatrix}-1 & 0 \\ 0 & 1\end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} u_1 \geq 0}} \ \ -\bigg\| \underbrace{\begin{bmatrix} \czero & 0 \\ 0 & \cone \end{bmatrix}}_{\text{diagonal activation matrix}} -\begin{bmatrix} x_1 \\ x_2 \end{bmatrix} u_1 - \begin{bmatrix} y_1 \\ y_2 \end{bmatrix} \bigg\|_2^2 -\end{equation*} -

- -The diagonal activation matrix (named $$D_i \in \{0, 1\}^{n \times n}$$) summarize the on/off behavior of _one_ ReLU for _all_ data points. The constraints on $u_1$ are directly given by this activation matrix: - -$$\begin{bmatrix} -1 & 0 \\ 0 & 1 \end{bmatrix} = 2 \begin{bmatrix} \czero & 0 \\ 0 & \cone \end{bmatrix}- I_2 \qquad \text{$I_2$ the identity matrix of $\RR^2$}$$ - -The other way around, we can define the activation pattern vector for a specific parameter $$u$$: $$(\mathbb{1}_{u ~ x_j \geq 0})_{j=1\dots n} \in \{0,1\}^n$$ with $n$ the number of data points. The activation matrix of $$u$$ is simply the matrix that has this vector for its diagonal. - -So we have exactly four possible activation matrices. $$D_1 = (\begin{smallmatrix} \czero & 0 \\ 0 & \czero \end{smallmatrix})$$ and $$D_2 = (\begin{smallmatrix} \cone & 0 \\ 0 & \cone \end{smallmatrix})$$ will have constraints that reduce to $w_1 = 0$, making them not interesting. The other two lead to convex problems with convex constraints. Solving them will give the parameters that correspond to the two local minima of the loss of ReLU neural network with only a single neuron\eqref{eq:one_neuron_loss}. - -

For any number $n$ of 1-D data points, there are $2^n$ distinct activation matrices but only two of them will be interesting: activating all positive data points, or only activating negative data points. Only some $D_i$ are interesting in higher dimensions, but finding all of them is not obvious.

- -Replacing everything with the usual matrices ($$X=(\begin{smallmatrix}x_1 \\x_2\end{smallmatrix})$$, $$Y=(\begin{smallmatrix}y_1 \\y_2\end{smallmatrix})$$) will get us the equivalent convex problem to a one-neuron ReLU network, whose activation pattern is $D_i$: - -

-\begin{equation*} -\min_{\substack{u_1 \in \RR\\ (2 D_i - I_2) X u_1 \geq 0}} \ \ -\big\| D_i X u_1 - Y \big\|_2^2 -\end{equation*} -

- - -Later sections will investigate what we can say about a ReLU network with more than one neuron. - -#### Multiplicative non-convexity from the second layer - - -{% include figure.html path="assets/img/2024-05-07-hidden-convex-relu/vraitroisd.png" class="img-fluid" %} - -

-\begin{equation}\label{eq:ncvxlin} -\min_{(x, y) \in \RR^2} (x ~ y - 1)^2 -\end{equation} -

- -\eqref{eq:ncvxlin} is not convex, it has two local minima. However, they are symmetric. Simply replace the term $x ~ y$ by a new variable $z$, and use a simple mapping such as $z \rightarrow (1, z)$ to get the solution of \eqref{eq:ncvxlin} from the solution of the convex problem: $$\min_{z \in \RR} (z-1)^2$$. - -The initial problem\eqref{eq:ncvxlin} with L2 regularization is non-convex as well: - -

-\begin{equation*} -\min_{(x, y) \in \RR^2} (x ~ y - 1)^2 + \frac{\lambda}{2} ( \vert x \vert^2 + \vert y \vert^2) -\end{equation*} -

- -The convex reformulation with one variable is: -

-\begin{equation*} -\min_{z \in \RR} (z - 1)^2 + \lambda \vert z \vert -\end{equation*} -

- -We have to use a different mapping $$z \rightarrow (\sgn(z) \sqrt(\vert z \vert), \sqrt(\vert z \vert))$$. One can verify that plugging this mapping into the non-convex problem will give the same value. Therefore, you can solve the convex problem in lieu of the non-convex one. - -Back to non-linear activations, consider the non-convex problem of training a single ReLU neuron with a second layer($$\alpha_1$$) and a L2 regularization: - -

-\begin{equation*} -\min_{(w_1, \alpha_1) \in \RR^2} \big(\max(0, x_1 w_1) \alpha_1 - y_1\big)^2 + \frac{\lambda}{2} \left(\vert w_1 \vert^2 + \vert \alpha_1 \vert^2\right) -\end{equation*} -

- -We fix the activation to only activate $x_1$(as could be done for any activation pattern) and add the corresponding constraint as done in the previous section: - -

-\begin{equation}\label{eq:ncvx1} -\min_{\substack{(u_1, \alpha_1) \in \RR^2\\ -x_1 ~ u_1 \geq 0}} -\left( \cone ~ x_1 ~ u_1 ~ \alpha_1 - y_1 \right)^2 -+ \frac{\lambda}{2} (\vert u_1 \vert^2 + \vert \alpha_1 \vert^2) -\end{equation} -

- -\eqref{eq:ncvx1} is a non-convex problem because we are multiplying $w_1$ and $\alpha_1$ together (and some constant). However, this non-convexity can be ignored by considering an equivalent convex function in a very similar way to the $(x ~ y - 1)^2$ problem. - -

-\begin{equation}\label{eq:cvx1} -\min_{x_1 ~ z_1 \geq 0} -\left( \cone ~ x_1 ~ z_1 - y_1 \right)^2 -+ \lambda \vert z_1 \vert -\end{equation} -

- -$z_1$ takes the role of the product $w_1 ~ \alpha_1$. We can solve \eqref{eq:cvx1} to get an optimal $z_1$ and then use a mapping $$(w_1, \alpha_1) = (\sgn(z_1) ~ \sqrt{\vert z_1 \vert}, \sqrt{\vert z_1\vert})$$. However, the two problems do not have the same expressivity: $$ \max(0, x_1 ~ z_1) \alpha_1 $$ can be negative but not $$\cone ~ x_1 ~ z_1$$ because of the constraint. Let's add a second variable with the same constraint as $z_1$ that will take the role of a negative $\alpha_1$. - -

-\begin{equation}\label{eq:cvx2} -\min_{\substack{x_1 ~ z_1 \geq 0\\x_1 ~ v_1 \geq 0}} -\big( \cone ~ x_1 ~ (z_1 - v_1) - y_1 \big)^2 -+ \lambda (\vert z_1 \vert + \vert v_1 \vert) - -\end{equation} -

- -The variable $$z_1$$ represents a neuron with a positive second layer and $$v_1$$ a neuron with the same activation pattern but with a negative second layer. This is a convex problem(adding a convex regularization preserves the convexity) with convex constraints. At the optimum, only one of the two variables will be non-zero. We consider this mapping: - -

-\begin{align*} -(w_1, \alpha_1) &= (\sgn(z_1) ~ \sqrt{\vert z_1 \vert}, \sqrt{\vert z_1 \vert}) & \text{ if $z_1$ is non-zero}\\ -(w_1, \alpha_1) &= (\sgn(v_1) ~ \sqrt{\vert v_1 \vert}, - \sqrt{\vert v_1 \vert}) & \text{ if $v_1$ is non-zero} -\end{align*} -

- -One can verify that this mapping does give the same value when plugged into \eqref{eq:ncvx1}. The two problems share the same global minima as we can easily map back and forth without altering the loss. The global minima of the two problems have the same value as they have the same expressivity, we can say the two problems are equivalent in the sense that we can solve one to get the solution of the other by a simple mapping. - -To summarize, here's the equivalent (with the above mapping) convex problem for a one-neuron ReLU Network with regularization and a second layer, whose activation pattern is $D_i$: - -

-\begin{equation*} -\min_{\substack{(2 D_i - I_2) X u_1 \geq 0\\ -(2 D_i - I_2) X v_1 \geq 0}} \ \ -\big\| D_i ~ X (u_1 - v_1) - Y \big\|_2^2 -\end{equation*} -

- -#### Equivalent Convex problem with two neurons - -Before moving on to the general results, we want to fit two data points, *i.e.* having both data points activated. To do so, we need at least two neurons. The usual non-convex problem is as follows (with $$X=(\begin{smallmatrix}x_1 \\x_2\end{smallmatrix})$$, $$Y=(\begin{smallmatrix}y_1 \\y_2\end{smallmatrix})$$ and $m=2$): - -

-\begin{equation*} - \min_{w_i, \alpha_i \in \RR, i=1 \dots m} \bigg\| \sum_{i=1}^m \max(0, X w_i) \alpha_i - y \bigg\|^2_2 + \lambda \sum_{i=1}^m w_i ^2 + \alpha_i^2. -\end{equation*} -

- -This loss is plotted (with $\lambda = 0$ and fixed second layer) in the introduction section. The convex reformulation is very similar. - -

-\begin{equation*} -\min_{\substack{(2 D_i - I_2) X u_i \geq 0\\ -(2 D_i - I_2) X v_i \geq 0}, i=1 \dots m} \ \ -\bigg\| \sum_{i=1}^m D_i ~ X (u_i - v_i) - Y \bigg\|_2^2 + \lambda \sum_{i=1}^m \vert u_i \vert +\vert v_i \vert -\end{equation*} -

- -The best choice(only obvious in this 1-D data case) of activation matrices would be $$D_1 = (\begin{smallmatrix} \czero & 0 \\ 0 & \cone \end{smallmatrix})$$ and $$D_2 = (\begin{smallmatrix} \cone & 0 \\ 0 & \czero \end{smallmatrix})$$. - -Solving and mapping the solutions would give the optimal *global* solution to the problem of fitting two data points with a ReLU network with two neurons. More insights about why this is true are given after the general case section, and the complete proof can be found in the paper. - -#### General Case - -Let us consider a general two-layer ReLU network with an input of dimension $d$, an output of dimension 1 (vector output requires a similar but parallel construction) and a hidden layer of size $m$. With $n$ data points, the full regularized loss is -

-\begin{equation*} - \mathcal{L}(\pmb{W}, \pmb{\alpha}) = \bigg\| \sum_{i=1}^m \max(0, \pmb{X} \pmb{w}_i) \alpha_i - \pmb{y} \bigg\|^2_2 + \lambda \sum_{i=1}^m \| \pmb{w}_i \|^2_2 + \alpha_i^2 -\end{equation*} -

- -This is the same loss as presented at the beginning of the article\eqref{eq:theloss} but with matrix and vectors. $$\pmb{X} \in \RR^{n \times d}$$ is the data matrix and $$\pmb{y} \in \RR^n$$ are the labels. Each neuron has its first layer parameter $$\pmb{w}_i \in \RR^d$$ and second layer $$\alpha_i \in \RR$$. - -By analogy with what we saw earlier, an equivalent convex problem can be found. Multiplications are replaced by scalar products in the definition of activation matrices and thus most insights about activation hold. - -

-\begin{equation}\label{eq:thecvx} - \min_{\pmb{U}, \pmb{V} \in \mathcal{K}} \bigg\| \sum_{i=1}^m \pmb{D}_i \pmb{X} (\pmb{u}_i - \pmb{v}_i) - \pmb{y} \bigg\|^2_2 + \lambda \sum_{i=1}^m \| \pmb{u}_i \|_2 + \| \pmb{v}_i \|_2 -\end{equation} -

- -$$\pmb{D}_i$$ are the activation matrix. The set of the constraints $$\mathcal{K}$$ is the concatenation of the constraints of all neurons. Each constraint can be written succintely: $$(2 \pmb{D}_i - \pmb{I}_n) X \pmb{u}_i \geq 0$$. If $$u_i$$ respects the constraint, its activation pattern is exactly $$D_i$$ and this is crucial to retrieve the optimal solution of the non-convex loss\eqref{eq:theloss} from the solution of the convex reformulation\eqref{eq:thecvx}. - -A conceptually easy way to have the two problems have the same global loss, is to consider a ReLU network with $$2^n$$ neurons, and to formulate the convex problem using all $$2^n$$ distinct activation matrices $$D_i$$. In that case, it is easy to see that they both have the same expressivity. In the paper, it is proved that in theory only $$n$$ neurons and activation patterns are required (using carathéodory's theorem), but the patterns are not given explicitly. The next section will give more insights on when the two problems are equivalent. - -From a solution of the convex problem\eqref{eq:thecvx}, the *convex neurons* $$u_i$$ can be mapped to the *non-convex neurons* $$(w_i, \alpha_i)$$ using this mapping: - -

-\begin{align*} -(w_i, \alpha_i) &= (\frac{u_i}{\sqrt{\| u_i \|_2}}, \sqrt{\| u_i \|_2}) & \text{ if $u_i$ is non-zero}\\ -(w_i, \alpha_i) &= (\frac{v_i}{\sqrt{\| v_i \|_2}}, -\sqrt{\| v_i \|_2}) & \text{ if $v_i$ is non-zero} -\end{align*} -

- -We use the same mapping as in the 1D case except the direction of the neuron ($$u_i$$) is now a vector in $$\RR^d$$ - -

This is a very simple mapping from convex solution to non-convex neurons. We will call convex neurons the set of parameters that correspond to a neuron in the original, non-convex problem. One can expect similar trajectories between the non-convex and convex neurons during gradient descent. -

- -Here, we fixed the number of neurons and the corresponding activations. A few questions are left unanswered: how many different activation patterns need to be considered, and how many neurons should we consider for both convex and non-convex problems? - -### Specifics about equivalence - -Two problems are considered equivalent when their global optima can be seamlessly mapped back and forth. - -As seen before, there are only two *interesting* possible activation patterns in the one-dimensional case (a single neuron can either activate all the positive data points and none of the negative, or the opposite), but there are close to $$2^n$$ _interesting_ patterns when the data dimension is higher. An activation pattern is interesting if there exists a non-zero vector that can respect the constraints and in fine, the activation pattern. - -The (unique) optimal loss of the convex problem \eqref{eq:thecvx} with all possible activation patterns(for fixed data) $$D_i$$ is the best loss any non-convex network can reach. The following sections are dedicated to understanding why adding more neurons than there are activation patterns will not improve the loss. - -However, if we only consider a subset of all patterns, the convex problem will in general correspond to a local optimum of the non-convex network. Indeed, it is not as expressive as before. This would either correspond to a non-convex network with not enough neurons, or with too many neurons concentrated in the same regions. - -To explore this idea, we go back to one-dimensional data. - -#### 1-D EXAMPLE, ONE NEURON - -In the non-convex problem with only one neuron, there are exactly two local minima. - -{% include figure.html path="assets/img/2024-05-07-hidden-convex-relu/oneneuron.png" class="img-fluid" %} - -

Plot of the output of a ReLU Network with one neuron, one for each of the parameter's local minima. The parameter on the left can be formulated as a solution of a convex problem with one convex neuron using the activation matrix \((\begin{smallmatrix} \czero & 0 \\ 0 & \cone\end{smallmatrix})\), and \((\begin{smallmatrix} \cone & 0 \\ 0 & \czero \end{smallmatrix})\) for the right output.

- -As seen in the previous section, each local minimum can be found exactly by solving the convex problem with a subset of all possible activations, that is on the left and on the right. Here we cannot say that the convex problem (that considers only one pattern) is equivalent to the non-convex one because the global minimum of the non-convex cannot be achieved in the convex problem. However, once we reach a local minimum in the non-convex gradient descent, then it can be described as a convex problem, by considering one pattern or the other. - -#### 1-D EXAMPLE, TWO NEURONS - -{% include figure.html path="assets/img/2024-05-07-hidden-convex-relu/twoneuron.png" class="img-fluid" %} - -

The non-convex problem initialized with two random neurons and optimized with gradient descent will have three possible local minima (if there is some regularization, otherwise there's an infinite number of them). Either we initialize a neuron for each activation and it will reach the global optima (left), or two of them will end up in the same pattern (right), activating the same data point.

- -In the case of two neurons, the convex equivalent problem is as follows: - -

-\begin{equation*} -\mathcal{L}(u_1, u_2)= -\bigg\| \begin{bmatrix} \czero & 0 \\ 0 & \cone \end{bmatrix} -\begin{bmatrix} x_1 \\ x_2 \end{bmatrix} u_1 + -\begin{bmatrix} \cone & 0 \\ 0 & \czero \end{bmatrix} -\begin{bmatrix} x_1 \\ x_2 \end{bmatrix} u_2 - \begin{bmatrix} y_1 \\ y_2 \end{bmatrix} \bigg\|_2^2 + \lambda (| u_1 | + | u_2 |) -\end{equation*} -

- -is equivalent to the non-convex problem i.e. solving it will give the global optimum of the non-convex objective. (the negative $v_i$ are zero at the optimal and are removed here only to be clear.) - -#### 1-D EXAMPLE, MANY NEURONS - -{% include figure.html path="assets/img/2024-05-07-hidden-convex-relu/manyneurons.png" class="img-fluid" %} - -

Plotting the positive part of many ReLU neurons. Summed up, they form a network output that perfectly fits the data.

- -We draw one example of a usual local minimum for gradient descent in the specific case of having more neurons than existing patterns. In practice (with more data in higher dimensions) there are much fewer neurons than possible activations. However, there are many situations in which neurons will lead to the same activation patterns, and in the experiment section we will see how to force such dynamics. - -Note that we can merge neurons that are in the same activation pattern by summing them up (even in higher dimensions), creating a new neuron, and keeping both the output and the loss unchanged (although regularization might decrease). The fact that having more than one neuron in one pattern does not decrease the loss is at the core of the proof. - -### Activation patterns - -The equivalence proof is heavily based on ReLU, specifically that a ReLU unit divides the input space into two regions: one where it will output zero, and the other where it is the identity. If you consider a finite set of samples and a single ReLU, it will activate and deactivate some samples: this is called an activation pattern. A diagonal matrix $$\pmb{D}_i \in \{0,1\}^{n \times n}$$ describes one activation pattern, but not all are possible for a given dataset. There is a finite amount of such possible patterns, exponential in the dimension of the data. - -This section is important to understand the final animations in the experimental section and helps understand how active activation patterns evolve in the non-convex problem. - -#### Two-Dimensional Data - -In the previous part, we considered data to be one-dimensional which resulted in only two possible activation patterns. Let us consider two-dimensional data. To do so in the simplest way possible, we will consider regular one-dimensional data and a dimension filled with $$1$$s. This will effectively give the neural network a _bias_ to use without modifying the formulas. - -We consider two data points: $$\color{cvred}{\pmb{x}_1} = (-0.2, 1)$$ and $$\color{cvred}{\pmb{x}_2} = (1, 1)$$, each associated with their label $$y_1 = 0.5$$ and $$y_2 = 1$$. We plot the output of one ReLU unit initialized at $$\pmb{w}_1 = (0.3, 0.15)$$, $$\alpha_1 = 1$$. Therefore we have - -

-\begin{align*} -\max(0, \pmb{w}_1^\top \pmb{x}_1) &= 0 \\ -\max(0, \pmb{w}_1^\top \pmb{x}_2) &= \pmb{w}_1^\top \pmb{x}_2 -\end{align*} -

- -The activation pattern of $$\pmb{w}_1$$ is $$\pmb{D}_1=\left(\begin{smallmatrix} \czero & 0 \\ 0 & \cone \end{smallmatrix}\right)$$. There are only three other possible activation patterns, activating both data points: $$\pmb{D}_2=\left(\begin{smallmatrix} 1 & 0 \\ 0 & 1 \end{smallmatrix}\right)$$, activating only the first one with $$\pmb{D}_3=\left(\begin{smallmatrix} 1 & 0 \\ 0 & 0 \end{smallmatrix}\right)$$ and activating no data point with a zero matrix. - -One point of interest is the data for which the ReLU will be 0. This is where the output changes its slope: $$a_1 = -w_1^2/w_1^1$$ where $$w_1^i$$ is the i-th coordinate of $$\pmb{w}_i$$. Here, $$a_1 = 0.5$$. We call this the _activation point_ of the neuron $$\pmb{w}_1$$. - -We plot the output, $$\color{cvblue}{\max(0, (x, 1) ~ \pmb{w}_1^\top)}$$, of the network as a function of the first dimension of the data $$x^1$$ (here simply written $$x$$): - -{% include figure.html path="assets/img/2024-05-07-hidden-convex-relu/twodim.png" class="img-fluid" %} - -

A neuron initialized so that it activates only one data point i.e. its activation point is between the two samples, and its slope tells us if it activates on the left or on the right like in this case.

- -__Illustration__. - -In the animation below, we train this network using vanilla gradient descent on the two data points $$\color{cvred}{\pmb{x}_1}$$ and $$\color{cvred}{\pmb{x}_2}$$, represented by the red crosses. We plot its $$\color{cblue}{\text{output}}$$ in blue for every possible data point (omitting the second dimension as it is always 1 in this example, playing the role of the bias), and we plot in red the label associated with the two data points. Each frame corresponds to one step of full-batch gradient descent with a small learning rate. We mark the $$\color{cgreen}{\text{activation point}}$$ of the neuron with a green triangle, pointing toward the side the neuron activates. The green triangle's height is the slope of the ReLU's output, equal to $$u_1^1 = w_1^1 \alpha_1$$, allowing us to visualize how important one neuron is for the output of the network. - -{% include figure.html path="assets/img/2024-05-07-hidden-convex-relu/firstgif_movie.gif" class="img-fluid" %} - -

Training a single neuron network with gradient descent until it exactly fits two data points. It starts by fitting the only point it activates, \(\color{cvred}{\pmb{x}_2}\). As training progresses, the activation point represented by a green triangle shifts position. As soon as the activation point reaches \(\color{cvred}{\pmb{x}_1}\), it activates it and starts fitting both points at the same time. Its activation pattern shifts from \(\left(\begin{smallmatrix} \czero & 0 \\ 0 & \cone \end{smallmatrix}\right)\) to \(\left(\begin{smallmatrix} \cone & 0 \\ 0 & \cone \end{smallmatrix}\right)\) and stays the same until convergence.

- -Adding more neurons will not create additional activation patterns, only adding more data points will. With only two data points $$\pmb{x}_1$$ and $$\pmb{x}_2$$, we only had 4 possible patterns, with four data points we have 10 possible patterns. - -{% include figure.html path="assets/img/2024-05-07-hidden-convex-relu/annoying.png" class="img-fluid" %} - -

We plot the individual output and activation points of each of the ReLU neurons associated with the ten _interesting_ activation patterns in blue. Those are the 10 (20 with negative ones) neurons that need to be considered to get the global optima using the convex equivalent. When moving the activation point \(a_i\) of a neuron between two data points, its activation pattern does not change.

- -

Notice that it is not possible to only activate the data points in the middle. However, if we increase the data's dimension, this becomes possible. This is also possible with a second layer of ReLU. In higher dimensions, we cannot visualize the activation patterns as easily, but we can understand that as dimensionality increases, more patterns are possible as it is easier to separate different data points.

- -### Extensions of the convex reformulation to other settings - -Batch Normalization (BN) is a key process that adjusts a batch of data to have a mean of zero and a standard deviation of one, using two trainable parameters. In the convex equivalent, we replace $$\pmb{D}_i \pmb{X}$$ with $$\pmb{U}_i$$. This $$\pmb{U}_i$$ is the first matrix in the Singular Value Decomposition (SVD) of $$\pmb{D}_i \pmb{X} = \pmb{U}_i \pmb{\Sigma}_i \pmb{V}_i$$ . If the output is a vector, rather than a scalar, the regularization changes to require a nuclear norm in the convex equivalent . Three-layer networks also have a convex equivalent using all possible combinations of two activation matrices. Moreover, parallel networks are also linked to a convex problem . Lastly, in Wasserstein Generative Adversarial Network (WGAN) problems, the adversarial games played by two-layer discriminators are identified as instances of convex-concave games . - -## III. Can We Forget the Non-Convex Problem? - -### Solving the convex problem efficiently is hard - -In the last ten years, deep neural networks have been trained using (stochastic) gradient descent on the non-convex problem. The algorithm, the implementation, and even the hardware running the training have been heavily optimized, supported, and pushed by industrial and scientific applications. Such networks were practically abandoned for years after being discovered because there did not exist an efficient way to train them. Nowadays, it takes a few lines to train a network on dedicated hardware and this might make us forget how much engineering has made this possible. This should be kept in mind when comparing a new approach to the problem. - -Training a network with the non-convex problem can be time consuming as it requires tuning hyperparameters and rollbacks(retrieving a previous state) to get out of a bad minimum. In that case, the convex approach deals with much fewer parameters and has only one global minimum. - -In complexity terms, the convex reformulation with all possible activation patterns $D_i$ gives an algorithm in polynomial time for all parameters except for the rank of the data matrix. In practice and with usual datasets, the rank is high and there will be too many patterns to consider them all. - -There has been some work focused on solving the convex problem quickly. The first idea is to take a random subset of activation patterns and use standard convex solvers. Current convex solvers (ECOS, ...) are not tailored to problems with many constraints. There is some hope in considering the unconstrained version of the problem to build an approximation. In most deep learning scenarios, it is hard to be faster, or even start to compete against a simple gradient descent running on GPUs. - -| Dataset | Convex | Adam | SGD | Adagrad | -|----------|--------|------|------|---------| -| MNIST | 97.6 | 98.0 | 97.2 | 97.5 | -| CIFAR-10 | 56.4 | 50.1 | 54.3 | 54.2 | - -

Test accuracy on popular datasets for a single layer network with 5000 neurons.

- - -{% include figure.html path="assets/img/2024-05-07-hidden-convex-relu/quantgraph.png" class="img-fluid" %} - -

Time to solve problems from the UCI datasets with Adam on the non-convex problem and a custom solver(using the augmented Lagrangian method). The code for the paper's experiments is available on github, as well as the convex problem toolkit.

- -For relatively small datasets and networks, convex solvers are fast and do not require any tuning to get convergence. Adjusting the regularization will directly reduce the amount of neurons needed. - -

-A convex equivalent of deeper networks exists but exacerbates existing problems. The only way to make it possible is to optimize layer by layer. This is still a work in progress and needs further improvements to be competitive.

- -### Activation patterns are not a constant in the non-convex problem - -Let's set aside the performance concerns and use the reformulation as a new point of view for observation. Our non-convex problem is equivalent to a convex and well-specified optimization problem with constraints. The global optima might be the same, but training the network with gradient descent almost always leads to a local minimum. Because there are too many activations to consider them all, the convex problem only find a local minimum. However, it is not clear if they find the same kind of local minimum. - -Activation patterns can and will change during gradient descent in the non-convex problem. In some cases, this pattern shifting is useful because the new activation patterns may lead to a better minimizer. To verify this, we monitor the number of unique activation patterns used by the network at each step of a gradient descent. If two neurons have the same activation pattern (_i.e._ they activate and deactivate the same data points), we would count them as one. - -{% include figure.html path="assets/img/2024-05-07-hidden-convex-relu/nbactiv.png" class="img-fluid" %} - -

Training a network with 100 random data points in 10 dimensions. The network only has 20 randomly initialized neurons and the data is linearly dependent on the input. Each neuron has a unique activation pattern as can be seen on the graph. It is expected in this setting because there are so many possible activation patterns (close to $10^{25}$The number of activation patterns is the same as the number of regions in a partition by hyperplanes perpendicular to rows of $X$ and passing through the origin. This number of region is bounded by \(2 r \left(\frac{e ~ (n-1)}{r}\right)^r\) with $r$ the rank of $X$). However, as training progresses, neurons align themselves to the same pattern. After 300 steps, the 20 neurons only share 5 unique activation patterns.

- -However, we can show an aspect that sets both formulations apart. The convex problem has fixed activation patterns. If the activations are missing important data, the convex solution will not be optimal. Meanwhile, in the non-convex problem, the gradient descent keeps shifting from pattern to pattern until it converges. - -__Illustration.__ - -We will further study this setting with 100 data points and 20 neurons in high dimensions. To compare how the two methods deal with activation patterns, we will use the activation pattern of the neurons of the non-convex problem to construct a convex problem and solve it. To be more explicit, for each non-convex neuron $$\pmb{w}_i$$, we find its activation pattern and add a $$\pmb{u}_i$$ constrained to this pattern to the convex problem. In the end, we have a convex problem with 20 neurons that will activate the same data points as the non-convex neurons. - -We train the non-convex network using gradient descent, and at each step, we construct a convex problem, solve it, and compare its global minimum to our current non-convex loss. This convex problem fully describes the local minimum we would find if the non-convex problem was constrained to never change its activation patterns. - -{% include figure.html path="assets/img/2024-05-07-hidden-convex-relu/cvx_vs.png" class="img-fluid" %} - -

- -Training a 20-neuron network with gradient descent and using the same activation patterns to solve the convex equivalent. We plot for each step, the current loss of the non-convex network and the optimal loss of the convex problem. At initialization (first point on the graph), the non-convex loss is 1. We take the current activation pattern and build a convex problem and solve it, we find an optimal loss of $0.1$. In the next step, the non-convex loss decreases and the activation pattern has changed, thus we find a different optimal loss for the convex problem. The initial optimal loss of the convex is quickly beaten by gradient descent (at around step 175), this means that the activation patterns at step 0 were far from optimal. The convex loss at the start is quickly beaten by gradient descent, this means our initial choice of activation pattern was bad, and gradient descent continually improves them. We use cvxpy to define the problem and solve it using ECOS. -

- -In general, we cannot predict which patterns will be used by the neurons found by GD, or which patterns are the best. Thus we cannot hope that the convex problem will give us an insight as it requires us to know the activation patterns. We can however predict what (some of) the optimal solution will look like a spline interpolation on each training sample. - -In the next section, we focus on cases where the non-convex minima can be accurately described by convex problems. - -### On large initialization scale - -The initialization scale of the network is the absolute size of the neurons' parameters. To get a change in the scale, we can simply multiply every parameter by a scalar. The initial value of the neuron is a large topic in machine learning as it has a large influence on the quality of the local minimum. By default in popular libraries, _He initialization_ is used, it draws neurons from a normal distribution centered on 0 and with a variance in $$1/m$$ with $$m$$ the number of neurons. However, in the literature, there is a large choice to pick from. - -We say we are on a large scale when neurons do not move far from their initial value during descent. This typically happens when using large initial values for the parameters of each neuron. - -The theory states that you can push the scale used high enough so that neurons will not change their activation patterns at all. If this is verified, the convex reformulation will describe exactly the minima that gradient descent will reach. However, it is not possible to observe this in practice as the loss becomes very small and the training process is too slow to carry on to the end. The NTK briefly mentioned in the introduction operates in this setting, using the fact that the network is very close to its linear approximation. On a similar note, reducing the step size for the first layer guarantee convergence. - -__Illustration.__ - -Using an animation, we plot every step of a gradient descent in the non-convex problem until the loss is small enough. As mentioned before, the training is too slow to continue until we reach a real local minimum described by the convex problem here. We plot the output of the network, which is the sum of all the neurons. We want to focus on the activation point of each neuron. - -{% include figure.html path="assets/img/2024-05-07-hidden-convex-relu/bigscale_movie.gif" class="img-fluid" %} - -

-Training a network with 1000 neurons with big initial values using gradient descent. The output of the network is in blue, and the four data points (red crosses) represent linear data. Each green triangle represents one neuron with its activation point horizontally, and its norm vertically. The orientation of the triangle reveals which side the neuron will activate the data. At initialization, the repartition of the activation point is uniform. The movement of the activation point is minimal, only a few neurons will change their patterns, among the thousands. -

- -Here, computing the convex optimal gives us a single neuron to fit the linear data. While the non-convex problem has converged to very low loss, their outputs are completely different. - -

A side effect of the large initialization is catastrophic overfitting i.e. there are very large variations between data points which will negatively impact test loss. -

- -### On very small initialization - -At the other extreme, the small-scale setting effectively lets neurons align themselves before ever decreasing the loss. In theory, if you push the scale down enough, neurons will converge to a finite set of directions before trying to fit the objective. - -{% include figure.html path="assets/img/2024-05-07-hidden-convex-relu/smallscale_movie.gif" class="img-fluid" %} - -

-Training a network with 1000 neurons with very small initial values using gradient descent. The output of the network is in blue, the four data points (red crosses) represent linear data. Each green triangle represents one neuron with its activation point horizontally, and its norm vertically. The orientation of the triangle reveals which side the neuron will activate the data. At initialization, the repartition of the activation point is uniform. However, as training progresses most neurons that activate toward the right converge to $-1.3$. Once the norm of the neuron at activating at $-1.3$ is large enough, the loss decreases and we quickly reach convergence. -

- -Taking a look at the loss on the same problem, we can identify the two distinct regimes: alignment and fitting (then convergence). - -{% include figure.html path="assets/img/2024-05-07-hidden-convex-relu/lastgif_plot.png" class="img-fluid" %} -

Plot of the loss during gradient descent in the same setting as the animation above. In the first half only the directions of the neurons are changing (i.e. their activation patterns), and start fitting the four data points once their parameters are large enough.

- -If you take orthogonal data and a small scale, the behavior is very predictable even in a regression setting. - -

Unless mentioned otherwise, all experiments were run using full batch vanilla gradient descent. In experiments, it is clear that adding momentum or using the Adam optimizer is much easier to use on top of being faster to converge. However, the behavior is much less predictable.

- -## Conclusion - -The main takeaway is that the best network for a given dataset can be found exactly by solving a convex problem. Additionally, the convex problem can describe every local minimum found by gradient descent in the non-convex setting. However, finding the global optima is impossible in practice, and approximations are still costly in precision. While there is no evident link between feature learning in the non-convex and the convex reformulation, many settings allow for a direct equivalence and the whole convex toolkit for proofs. - -The performance side of the convex reformulation will benefit from dedicated software as has been the case for gradient descent in deep networks. Only then will it offer a no-tuning alternative to costly stochastic gradient descent. In smaller settings, it already allows us to quickly find all the possible local minima that are so important in machine learning. - -Despite advancements in understanding the optimization landscape of neural networks, a significant gap persists in reconciling theory with practical challenges, notably because of early stopping. In real-world scenarios, networks often cease learning before reaching a local minimum and this has a direct impact (in large-scale initialization) but there are limited results. - -## Acknowledgements - -This work is partly funded by the ANR JCJC project ANR-21-CE23-0022-01. diff --git a/_posts/2024-05-07-language-model-development-as-a-new-subfield.md b/_posts/2024-05-07-language-model-development-as-a-new-subfield.md deleted file mode 100644 index bf4eb8ee..00000000 --- a/_posts/2024-05-07-language-model-development-as-a-new-subfield.md +++ /dev/null @@ -1,132 +0,0 @@ ---- -layout: distill -title: A New Alchemy: Language Model Development as a Subfield? -description: This blog post makes the case that the body of research on language models become sufficiently large and mature that we can start thinking about “language model development” as a new subfield. - To support this claim, we sketch out the focuses and methodologies of this new subfield. - In addition, we provide some personal reflections on what to do when your field of study gives birth to a new one. -date: 2024-05-07 -future: true -htmlwidgets: true - -authors: - - name: Colin Raffel - url: "https://colinraffel.com/" - affiliations: - name: University of Toronto, Vector Institute - -# must be the exact same name as your blogpost -# bibliography: 2024-05-07-distill-example.bib - -# Add a table of contents to your post. -# - make sure that TOC names match the actual section names -# for hyperlinks within the post to work correctly. -toc: - - name: Some history - - name: Language model development - - name: A New Alchemy ---- - -Historically, language models have served as an important component of many learning systems -- for example, to improve the transcriptions generated by a speech recognition system. -However, the impact and usage of language models has grown dramatically over the past few years. -Arguably, this growth is simply thanks to the fact that language models have gotten *better*, i.e. more accurate at predicting some text based on some context. -Since most text-based tasks can be cast as predicting a response to a request (e.g. "summarize the following article", "write me a Python function that queries Wikipedia", etc.), recent large language models (LLMs) have proven somewhat effective at performing an incredibly wide range of tasks. -Improvements in the language understanding and generation capabilities of LLMs have also led to their adoption in many larger systems (e.g. robots, image processing/generation, etc.), where they increasingly enable natural language to be used as an interface. -These advances have led to a huge amount of research into building and using language models. -I think this body of research has become sufficiently large and mature that we can start thinking about "language model development" as a new subfield. -The goal of this blog post is to sketch out the focuses and methodologies of the subfield of language model development as well as to provide some personal reflections on what to do when your field of study gives birth to a new one. - - -## Some history - -As a subfield, language modeling has many sibling and parent fields, including information theory, artificial intelligence, natural language processing, and machine learning. -In my biased opinion, many recent advances in language modeling have stemmed from advances in deep learning. -When thinking about fields like deep learning, I think it can be valuable to define what the assumptions and major problems of the field are. -For deep learning, I would roughly say that the assumptions are: - -1. We should end-to-end optimize everything. -1. Training a bigger model on a bigger dataset should yield improved performance, but we should also strive to develop efficient and performant model architectures. -1. If we can bake structure into our model (e.g. convolutions for images), things work better... -1. but what we really want is a system that can learn everything from data and relies on as few hard-coded assumptions as possible. -1. We care less about theoretical guarantees and more about how well something works in practice. - -Notably, the assumptions of a field are not necessarily scientifically or philosophically motivated - they can be cultural or arise from extraneous factors (e.g. the availability of GPUs). -The major problems of the field of deep learning might be: - -1. How can we design neural network architectures that work well for a given problem, or better yet, across a wide variety of problems? -1. Similarly, what objective works best? -1. How should we optimize that objective? -1. How can we ensure all of the above can be scaled up effectively? - -Arguably, one of the biggest successes of recent deep learning research is a powerful recipe for training effective models on a wide variety of problems, namely, the Transformer trained with some variant of Adam. -While the objective used can vary across problem settings, in text-based problems a simple language modeling objective works well (and, as discussed above, encapsulates pretty much any text-based task). -An important aspect of this Transformer recipe is its scalability, i.e. the ability to attain predictable gains from scaling up training compute and/or dataset size. - -## Language model development - -I think the scalability of the Transformer has ushered in a new era of research that is distinct from deep learning research. -For the first time, we can (to a significant degree) stop worrying about what model architecture to use, how to train the model, what objective to use, whether we'll continue to get returns from scaling, etc. -Instead, this new line of research primarily aims to study the development of language models in order to expand and understand their capabilities. -In addition, the fact that recent LLMs are reasonably competent at a huge range of tasks has led to major differences in terms of how we use LLMs (when compared to e.g. how we built and used neural networks in the context of deep learning) -For lack of a better term, I'll refer to this new (sub)field as "language model development", which might have the following assumptions: - -1. We can assume that the model architecture, optimizer, and objective are basically fixed. -1. We hope or expect that a given LLM can be induced to perform basically any task out-of-the-box without performing any additional training (i.e. updating its parameters), and in general we should avoid updating parameters to specialize a model to a given task (i.e. task-specific fine-tuning). -1. The computational cost of getting a model to perform a task is mostly irrelevant, or at least, these costs will be resolved by something else (e.g. better/more hardware). -1. If we invest more compute in training an LLM, it will [produce better results](https://arxiv.org/abs/2001.08361). - -Arguably, some of these assumptions could be considered consequences of the fact that many state-of-the-art language models are only available through black-box APIs. -The major problems of language model development are something like: - -1. How can we get the model to do what we want (i.e. "prompt engineering")? -1. How can we make the model run as efficiently as possible? -1. To the extent that we are going to update a model, how can we update it so that it is better at following instructions and less likely to generate harmful content (i.e. alignment)? -1. More broadly, if we are really hoping the model can do *anything*, how do we prevent it from doing things we don't want it to? -1. How can we integrate language models into other systems (i.e. tool use, multimodality, etc.)? - -Let me give a few additional examples of papers and techniques that I think aim to attack these problems under the aforementioned assumptions. - -- An early technique for "getting an LLM to do what we want" (goal #1) is [few-shot in-context learning (ICL)](https://arxiv.org/abs/2005.14165), where a few examples of the desired input/output behavior are provided in the model's input before the model is asked to process an unseen example. - Few-shot ICL avoids updating the model's parameters (assumption #1) and mostly ignores the fact that it significantly increases computational costs (assumption #3). - A related and more recent variant of ICL is ["chain-of-thought prompting"](https://arxiv.org/abs/2201.11903), which adds reasoning steps to the in-context examples in hopes of improving performance by inducing the model to generate similar reasoning steps before generating its prediction. - The fact that including reasoning steps further increases computational costs is, again, mostly ignored (assumption #3). -- Techniques like [FlashAttention](https://arxiv.org/abs/2205.14135) and [Speculative Decoding](https://arxiv.org/abs/2211.17192) aim to make the model run more efficiently (goal #2) without changing the model or its outputs whatsoever (assumption #1). - More broadly, techniques like the [Heavy-Hitter Oracle](https://arxiv.org/abs/2306.14048) or [quantization](https://arxiv.org/abs/2208.07339) aim to reduce memory or computational costs with minimal performance degradation. - The pursuit of these techniques, along with orthogonal hardware advances like NVIDIA's Transformer Engine, arguably supports the apparent disregard for increases in computational cost that arise from using a larger model (assumption #3). -- While there certainly has been some effort to improve over the Transformer architecture or the optimizer used to train LLMs (in violation of assumption #1), the vast majority of these improvements have not been widely adopted, either due to inertia (i.e., enforcement of assumption #1) or the apparent fact that [they do not always transfer across applications](https://arxiv.org/abs/2102.11972). - -Separately, a sign of the maturity of a new subfield is the development of teaching materials. -I think my friend Sasha Rush is leading the charge here, with e.g. [GPTWorld for learning prompting](https://github.com/srush/GPTWorld), [LLM training puzzles for learning about distributed training](https://github.com/srush/LLM-Training-Puzzles), and [Transformer puzzles for understanding how Transformers might work](https://github.com/srush/Transformer-Puzzles). -Another sign is the establishment of a conference on the subject, and we [have one of those now too](https://colmweb.org/). - -## A New Alchemy - -LLMs have ushered in a paradigm shift in the path toward imbuing computers with human-like capabilities. -This paradigm shift is being felt in various fields, including deep learning (where the work of designing new architectures or optimizers is increasingly less relevant), natural language processing (where we now have a recipe that works reasonably well across subproblems that previously demanded custom methodologies), and beyond. - -I started my PhD in 2012 during a similar paradigm shift from what I'd call "statistical machine learning" to deep learning. -Unlike deep learning, statistical ML prioritized theoretical guarantees (e.g. convexity of the objective function and/or convergence under certain conditions). -These guarantees arguably limited model expressivity, which arguably necessitated things like feature engineering that deep learning strove to avoid. -While deep learning by no means "solved" the problems of statistical ML (just as language model development does not "solve" deep learning), it nevertheless presented a paradigm that made dramatic progress on the target problems of statistical ML and unlocked new applications. -Such empirical successes of deep learning -- which almost entirely eschewed theoretical guarantees -- led to a great deal of hand-wringing on the part of the statistical ML crowd. - -As my research increasingly made use of deep learning, I started to find myself at the receiving end of this hand-wringing. -For example, during my first-ever oral presentation at a conference, I was presenting work that made use of convolutional neural networks. -During questions, an audience member expressed distaste at my use of "*convoluted*" neural networks and suggested that something simpler would have worked better (of course I had tried simpler models and they worked significantly worse, but let's put that aside for the moment). -This kind of despair was common at the time - people were applying deep neural networks in settings where they may or may not have been overkill, simply because it was the zeitgeist. -At another conference I attended during my PhD, I happened to share a hostel room with a computer vision researcher who went on a long rant about the atrocity of deep learning (sometimes I wonder what this researcher is working on now). -I think this sentiment is most elegantly laid out in [Ali Rahimi's NeurIPS 2017 test-of-time award acceptance speech](https://www.youtube.com/watch?v=x7psGHgatGM), where he argues that deep learning is like alchemy - trial-and-error that yields some effective techniques but lacks rigor. -Ali's speech had a big impact on me and others but arguably didn't really stop people from continuing to develop and apply deep learning without worrying about rigor and in settings where simpler methods would have sufficed (simply because using a big fancy neural network was sexier). - -These experiences led me to promise myself that when my field of study gave birth to another, I wouldn't dig my feet in and resist, I'd follow the tide of progress. -Now that this is (arguably) happening I'm finding it more difficult than I had anticipated. -As much as I wish it wasn't true, I cringe a little whenever I see a new LLM technique that ignores a dramatic increase in computational cost and bends over backwards to avoid updating the model's parameters, or an application of an LLM where something dramatically cheaper would suffice, or a paper studying the behaviors of an LLM as if it's a black box (or studying an LLM API, in which case it actually *is* somewhat of a black box), and on and on. -And try as I might, I can't resist trying to stem the tide -- for example, the [T-Few paper](https://arxiv.org/abs/2205.05638) aimed to convince everyone that few-shot ICL was absurdly computationally inefficient and that fine-tuning specialized models is cheaper and better. -Of course, people are still using few-shot ICL and are still avoiding task-specific fine-tuning at all costs, because that's the zeitgeist -- and I think this isn't totally wrong, because in tandem there's a huge amount of synergistic work on making LLMs more efficient and effective. -But, to be honest, it still *feels* a little wrong, and I'm not sure if I'll be able to shake that feeling. - -So, what's the best course of action [when you used to be with it, but then they changed what "it" was](https://www.youtube.com/watch?v=LV0wTtiJygY)? -I think there were many ML researchers who successfully rode the tide from statistical ML to deep learning -- they willingly embraced the new field while bringing their knowledge and sense of rigor to their deep learning research. -In other words, they used their past knowledge to provide a broader and deeper perspective that newcomers may have lacked. -An especially prominent product of this kind of research is arguably the [Variational Autoencoder (VAE)](https://arxiv.org/abs/1312.6114), which connected ideas from variational inference to the autoencoder neural network architecture. -VAEs are still an important component of state-of-the-art diffusion-based generative models. -Hopefully, those of us who were working on deep learning and NLP before the LLM era can bring a similar perspective (and avoid digging our feet in too much). diff --git a/_posts/2024-05-07-mode-switching.md b/_posts/2024-05-07-mode-switching.md deleted file mode 100644 index 7e1c5e08..00000000 --- a/_posts/2024-05-07-mode-switching.md +++ /dev/null @@ -1,615 +0,0 @@ ---- -layout: distill -title: Behavioral Differences in Mode-Switching Exploration for - Reinforcement Learning -description: In 2022, researchers from Google DeepMind presented an initial - study on mode-switching exploration, by which an agent separates its - exploitation and exploration actions more coarsely throughout an episode - by intermittently and significantly changing its behavior policy. We - supplement their work in this blog post by showcasing some observed - behavioral differences between mode-switching and monolithic exploration - on the Atari suite and presenting illustrative examples of its benefits. - This work aids practitioners and researchers by providing practical - guidance and eliciting future research directions in mode-switching - exploration. -date: 2024-05-07 -future: true -htmlwidgets: true - -# Anonymize when submitting -# authors: -# - name: Anonymous - -authors: - - name: Loren J Anderson - url: - affiliations: - name: USA Space Force - -# must be the exact same name as your blogpost -bibliography: 2024-05-07-mode-switching.bib - -# Add a table of contents to your post. -# - make sure that TOC names match the actual section names -# for hyperlinks within the post to work correctly. -# - please use this format rather than manually creating a markdown table of contents. -toc: - - name: 1. Introduction - subsections: - - name: Mode-Switching Distinctions - - name: Mode-Switching Basics - - name: Blog Post Motivation - - name: 2. Experiments - subsections: - - name: Concentrated Terminal States - - name: Early Exploration - - name: Concentrated Return - - name: Post-Exploration Entropy - - name: Top Exploitation Proportions - - name: 3. Conclusion - subsections: - - name: Acknowledgements - -# Below is an example of injecting additional post-specific styles. -# This is used in the 'Layouts' section of this post. -# If you use this post as a template, delete this _styles block. - ---- - -## 1. Introduction - -Imagine learning to ride a bicycle for the first time. This task -requires the investigation of numerous actions such as steering the -handlebars to change direction, shifting weight to maintain balance, and -applying pedaling power to move forward. To achieve any satisfaction, a -complex sequence of these actions must be taken for a substantial amount of -time. However, a dilemma emerges: many other tasks such as eating, sleeping, and working may result in more immediate satisfaction (e.g. lowered hunger, better rest, bigger paycheck), which may tempt the learner to favor other tasks. Furthermore, if enough satisfaction is not quickly achieved, the learner may even abandon the task of learning to ride a bicycle altogether. - -One frivolous strategy (Figure 1, Option 1) to overcome this dilemma is to -interleave a few random actions on the bicycle throughout the remaining -tasks of the day. This strategy neglects the sequential nature of bicycle -riding and will achieve satisfaction very slowly, if at all. Furthermore, -this strategy may interrupt and reduce the satisfaction of the other daily -tasks. The more intuitive strategy (Figure 1, Option 2) is to dedicate -significant portions of the day to explore the possible actions of bicycle -riding. The benefits of this approach include testing the sequential -relationships between actions, isolating different facets of the -task for quick mastery, and providing an explicit cutoff point to shift -focus and accomplish other daily tasks. Also -- let's face it -- who wants to wake up in the middle of the night to turn the bicycle handlebar twice -before going back to bed? - -{% include figure.html path="assets/img/2024-05-07-mode-switching/bike.png" class="img-fluid" %} -
- Figure 1: Illustrative difference between monolithic and mode-switching -behavior policies . -
- -The above example elicits the main ideas of the paper *When Should Agents -Explore?* , published by -researchers from Google DeepMind at ICLR 2022, which is the central piece -of literature discussed throughout this blog post. The first strategy -presented in the preceding paragraph is known as a **monolithic** behavior -policy that interleaves exploration actions (e.g. learning to ride a -bicycle) among the more frequent exploitation actions (e.g. work, sleep) in -a reinforcement learning (RL) environment. In contrast, the second strategy -presented above is a **mode-switching** behavior policy, as it more -coarsely separates exploration and exploitation actions by switching -between disparate behavior modes throughout an episode. Mode-switching -policies subsume monolithic policies at the cost of increased complexity -through introducing a new question: *when to switch*. Similar aspects of -mode-switching for diverse exploration have been observed in the -exploratory behavior of humans and animals , which served as a notable motivation for the initial mode-switching study . - -This introduction section continues with a brief discussion of topics -related to mode-switching behavior policies, ranging from different temporal -granularities to algorithms in the literature that exhibit mode-switching -behavior. We emphasize practical understanding rather than attempting to present an -exhaustive classification or survey of the subject. Afterwards, we discuss -our motivation and rationale for this blog post: the authors of the initial -mode-switching study showed that training with mode-switching -behavior policies surpassed the performance of training with monolithic -behavior policies on hard-exploration Atari games; we augment their work by -presenting observed differences between mode-switching and monolithic -behavior policies through supplementary experiments on the Atari benchmark -and other illustrative environments. Possible avenues for applications and -future investigations are emphasized throughout the discussion of each experiment. It is assumed that the interested reader has basic knowledge in RL techniques and challenges before proceeding to the rest of this blog post. - -### Mode-Switching Distinctions - -Mode-switching behavior policies (which we will sometimes shorten to -*switching -policies*, and likewise to *monolithic policies*) were explicitly -introduced in the initial mode-switching study, -and we will now focus on briefly contrasting switching policies against -monolithic policies and the previous exploration literature. Figure 2 -illustrates the high-level, pivotal difference between switching and -monolithic policies: at the beginning of each time step, the agent may use -all of its available information to determine its behavior mode -for the current time step and then output a corresponding behavior policy to -determine -the action. A key distinction is that switching policies can drastically -change between time steps since the modes can be tailored to a variety of -different purposes (e.g. exploration, exploitation, mastery, novelty). As -the graphic illustrates, switching is such a general addition to an -algorithm that it was not exhaustively characterized in the initial study. - -{% include figure.html path="assets/img/2024-05-07-mode-switching/box.png" class="img-fluid" %} -
- Figure 2: Introduction of mode-switching behavior to standard -agent-environment RL interaction. -
- -A **mode period** is defined as a sequence of time steps in a single mode. -At the finest granularity, *step-level* periods only last one step in -length; the primary example is $\epsilon$-greedy exploration because its -behavior policy switches between explore and exploit mode at the level of -one time step . At the other extreme, -*experiment-level* periods encompass the entire training duration, possibly -to be used in offline RL (ORL) algorithms . A finer granularity is *episode-level*, in which a single behavior policy is chosen for one entire episode at a time, such as when diversifying the stochasticity of a policy throughout training . The switching policies analyzed in this blog post produce *intra-episodic* periods at a granularity between step-level periods and episode-level periods. Intra-episodic periods generally occur at least a few times during an episode and last for more than a few time steps. The practice and study of interpolating between extremes has occurred in areas such as $n$-step returns and colored noise with notable success, making the study of intra-episodic mode periods even more enticing. - -The question investigated by the initial mode-switching study is *when to -switch*. This blog post and the initial study only perform experiments -with two possible modes, exploration and exploitation, so the question of -*when to switch* reduces to the question of *when to explore*. Other -questions regarding exploration include *how much to explore* that analyzes -the proportion of exploration actions taken over the entire course of -training. This problem encompasses the annealing of exploration -hyperparameters including $\epsilon$ from $\epsilon$-greedy policies and the entropy bonus $\beta$ from softmax -policies . Another related -question is *how to explore* that includes strategies such as randomly , optimistically , and intrinsically . These two questions are separate from the question of *when* to explore, as they usually consider a smooth change in the behavior policy after each time step; switching policies incorporate a much more rigid change in the behavior policy, meriting a separate analysis. - -### Mode-Switching Basics - -The preceding subsection narrowed our focus to determining *when to explore* -using *intra-episodic* mode periods. At the time of publication of the -initial mode-switching study, the previous literature contained a -few works that had incorporated basic aspects of intra-episodic -mode-switching exploration. For example, Go-Explore is a resetting algorithm that explores randomly after resetting to previously-encountered -promising states at the beginning of an episode. However, this algorithm -implements only one switch from resetting to exploration over the course of -an episode. Temporally-extended $\epsilon$-greedy exploration generalizes $\epsilon$-greedy -exploration by sampling from a distribution the number of time steps that an -exploration action should repeat. This method of switching is -intra-episodic, but it only allows repetition of an action during explore -mode. The initial mode-switching study extends the above and other work in -many dimensions and may soon be viewed as the seminal work on -mode-switching behavior policies; we discuss the most fundamental facets of -mode-switching architectures below. - -The **starting mode** is the mode of the algorithm on the first time step, -usually exploit mode. The **set of behavior modes** (e.g. explore and -exploit) must contain at least two modes, and the set of behaviors induced -by all modes should be fairly diverse. The switching **trigger** is the -mechanism that prompts the agent to switch modes and is perhaps the most -interesting consideration of switching policies. An *informed* trigger -incorporates aspects of the state, action, and reward signals; it is actuated after crossing a prespecified threshold such as the -difference between the expected and realized reward. A *blind* trigger acts -independently of these signals; for example, it can be actuated after a -certain number of time steps has elapsed or actuated randomly at each time -step with a prespecified probability. A **bandit meta-controller** may be employed to choose the switching -hyperparameters (e.g. termination probability, mode length, informed threshold) at the beginning of each episode to maximize episodic return and prevent additional hyperparameter tuning. Finally, **homeostasis** can be added when using trigger thresholds (e.g. for informed triggers), which adapts the switching threshold to a target rate across the course of training, again for ease of hyperparameter tuning. Note that these dimensions are so richly diverse that we end the associated discussion to maintain any notion of brevity, and we summarize these facets of mode-switching in Table 1. - -| ------------- |-------------| -| Mode-Switching Facet | Description | -| ------------- |-------------| -| Starting Mode | Mode during first time step at episode start | -| Behavior Mode Set | Set of modes with diverse set of associated behavior policies | -| Trigger | Informs agent when to switch modes | -| Bandit Meta-Controller | Adapts switching hyperparameters to maximize episodic return | -| Homeostasis | Adapts switching threshold to achieve a target rate | -| ------------- |-------------| - - -
- Table 1: Various facets of mode-switching policies . -
- -### Blog Post Motivation - -The initial mode-switching study performed experiments solely on 7 -hard-exploration Atari games. The focus of the study was to show the -increase in score on these games when using switching -policies versus monolithic policies. One area of future work pointed out by -the reviewers is to increase the understanding of these less-studied -policies. For example, the [meta review](https://openreview.net/forum? -id=dEwfxt14bca¬eId=C0cPgElgV7P) of the paper stated that an illustrative -task may help provide intuition of the method. The [first reviewer](https://openreview.net/forum?id=dEwfxt14bca¬eId=Fjc2fBjmhwZ) noted how -the paper could be greatly improved through demonstrating specific benefits -of the method on certain tasks. The [second reviewer](https://openreview.net/forum?id=dEwfxt14bca¬eId=e3xcQZnyuyt) stated how discussing observed differences on the different domains may be useful. The [third reviewer](https://openreview.net/forum?id=dEwfxt14bca¬eId=Qcv_GiwGPhr) mentioned how the paper could be strengthened by developing guidelines for practical use. The [last reviewer](https://openreview.net/forum?id=dEwfxt14bca¬eId=W6v6g6zFQHi) stated that it would be helpful to more thoroughly compare switching policies to monolithic policies for the sake of highlighting their superiority. - -We extend the initial mode-switching study and progress towards -further understanding of these methods in this blog post through additional -experiments. The following experiments each discuss an observed behavioral -difference in switching policies versus monolithic policies. We focus -on behavioral differences in this work, as they are observable in the -environment and are not unique to the architecture of certain agents . Our experiments are performed -on 10 commonly-used Atari games , and we also provide another -illustrative task or chart for each experiment to further enhance -understanding. One highlight of this work is showcasing how switching -policies not only influence exploration but also significantly influence -exploitation. Our work serves as a first step in empirically delineating -the differences between switching policies and monolithic policies for the use of practitioners and researchers alike. - -## 2. Experiments - -This section begins with a discussion on the experimental setup before -delving into five experiments that highlight observational differences in -switching and monolithic behavior policies. The complete details of the -agent and environments can be found in the accompanying [GitHub repository](https://github.com/LorenJAnderson/when-to-explore). -- The experimental testbed is comprised of 10 commonly-used Atari games: Asterix, Breakout, - Space Invaders, Seaquest, Q*Bert, Beam Rider, Enduro, MsPacman, Bowling, - and River Raid. Environments follow the standard Atari protocols of incorporating sticky actions and only providing a terminal signal when all lives are lost. -- A Stable-Baselines3 DQN policy - is trained on each game for 25 epochs of 100K time steps each, totaling 2.5M time steps or 10M frames due to frame skipping. The DQN policy - takes an exploration action on 10% of time steps after being linearly - annealed from 100% across the first 250K time steps. -- A switching policy and monolithic policy were evaluated on the testbed - using the greedy actions of the trained DQN policy when taking - exploitation actions. Evaluations were made for 100 episodes for each - game and epoch. The monolithic policy was $\epsilon$-greedy with a 10% - exploration rate. The switching policy we chose to examine - incorporates blind switching; we leave an analogous investigation of - informed switching policies to future work (see initial study for - background and experiments using informed switching policies). - The policy begins in - exploit mode and randomly switches to uniform random explore mode 0.7% of - the time. It randomly chooses an explore mode length from the set $\\{5, - 10, 15, 20, 25\\}$ with probabilities $\\{0.05, 0.20, 0.50, 0.20, 0.05\\} - $. During experimentation, we determined that this switching - policy took exploration actions at an almost identical rate as the - monolithic policy (10%). - -We briefly cite difficulties and possible confounding factors in our -experimental design to aid other researchers during future studies on this -topic. -- The DQN policy was trained using a monolithic policy, and unsurprisingly, -monolithic policies had slightly higher evaluation scores. Additional - studies may use exploitation actions from a policy trained with switching - behavior for comparison. -- Many of our experiments aim to evaluate the effect of exploration - or exploitation actions on some aspect of agent behavior. Due to delayed - gratification in RL, the credit assignment problem persists and confounds the - association of actions to behaviors. To attempt to mitigate some - confounding factors of this problem, we weight the behavior score of the - agent at an arbitrary time step by the proportion of exploration or - exploitation actions in a small window of past time steps; for example, - in the first experiment, we weight the effect of taking exploration - actions on yielding terminal states by calculating the proportion of exploration - actions within 10 time steps of reaching the terminal state. Then, we - average the proportions across 100 evaluation episodes to compute a final score for a single epoch for a single game. -- Lastly, we only claim to have made observations about the behavioral differences, and we do not claim to have produced statistically significant results; we leave this analysis to future work. - -### Concentrated Terminal States - -Exploration actions are generally considered to be suboptimal and are -incorporated to learn about the state space rather than accrue the most -return. Many environments contain regions of the state space that simply do -not need more exploration, such as critical states that require directed behavior for -meaningful progress. For instance, a self-driving car needing to merge onto -a highway is in a critical state, as it has few behaviors that will keep it -driving correctly. In these critical states, suboptimal action choices may -cause the agent to reach a terminal state more quickly than desired. We -investigate if terminal states are more concentrated after an exploration -period of a switching policy due to the many exploration actions taken in -succession. - -Our first experiment attempts to analyze the relationship between taking -many exploration actions in succession and reaching a terminal state. Each -terminal state is given a score equal -to the proportion of exploration actions during the past 10 time steps (see -second paragraph of Experiments section for rationale). Final scores for -each behavior policy and epoch are computed by averaging the scores of each terminal state across all 100 evaluation episodes and each game. The results are shown in -Figure 3. Switching policies produced terminal states that more closely -followed exploration actions. Furthermore, the effect was more pronounced -as the policies improved, most likely due to the increased disparity of -optimality between exploitation and exploration actions that seems more -detrimental to switching policies which explore multiple times in -succession. Note how the scores for monolithic policies are near 0.10 on -average, which is the expected proportion of exploration actions per -episode and therefore suggests that exploration actions had little effect. -These results demonstrate that switching policies may be able to -concentrate terminal states to specific areas of an agent's trajectory. - -
-
- {% include figure.html path="assets/img/2024-05-07-mode-switching/exp_1_1.png" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-mode-switching/exp_1_2.png" class="img-fluid rounded z-depth-1" %} -
-
-
- Figure 3 (Left): Terminal states are more concentrated after switching -exploration periods. Figure 4 (Right): Switching policies perform better on -cliffwalk environments. -
- -We showcase a quick illustrative example of the ability of switching -policies to concentrate terminal states more uniformly in a cliffwalk -environment (Figure 4). The agent starts at the black circle in the middle -column and top row of a 101$\times$11 grid and attempts to reach the white -'x' at the bottom. All states aside from those in the middle column are -terminal, and the heatmaps show the visitation frequency per episode of all -non-terminal states across 10K episodes. When the exploitation policy is to -move only downward -and the behavior policies are the usual policies in these experiments, the -agent incorporating a switching policy more heavily -concentrates the terminal states in exploration mode and visits states -further down the cliffwalk environment at a higher rate per episode. - - -Environments that incorporate checkpoint states that agents must traverse -to make substantial progress may benefit from switching policies that -concentrate exploration periods away from the checkpoints. For example, the -game of Montezuma's revenge sometimes -requires that the agent retrieves a key before advancing through a door, -and the agent may achieve faster learning by concentrating exploration -actions away from states near the key after that action is learned. One -notable and emerging area of RL research that may benefit from -concentrating terminal states is safe RL . In safe RL, certain safety constraints are -required during the learning and deployment process. In some situations, -the safety constraints are closely aligned with terminal states (e.g. aerospace ), and concentrating exploration actions away from terminal states may aid in achieving those safety constraints. - -### Early Exploration - -Monolithic policies uniformly take exploration actions throughout an episode, -and as a result, the exploration steps are less concentrated than those of -switching policies. While the expected number of exploration steps may be -the same per episode in monolithic policies, certain situations may require -more concentrated exploration during the beginning of episodes. For example, -the build orders in StarCraft II significantly influence the possible -future strategies, making exploration crucial throughout the beginning time -steps. Early suboptimal actions have also been manually implemented to -achieve certain effects: passive actions are taken in Atari games to -prevent memorization of trajectories , and 30 random actions were taken at -the beginning of Go games when training the AlphaGo engine to force agents -to encounter more diverse data . We investigate the flexibility of switching policies to concentrate exploration actions in the beginning of episodes. - -We perform an experiment to determine how quickly a policy takes a -prespecified number of exploration actions. Specifically, we compute the -average number of time steps it takes for a policy to take at least $x$ -total exploration actions across its top 10 of 100 fastest episodes, and we repeat this process for $x \in \\{1, 2, 3, \ldots, -20\\}$. We compare the top 10 fastest episodes because we are only -interested in gauging the flexibility of switching behavior of being able -to achieve this specific facet of exploration (beginning exploration) -during a small percentage of episodes and not for each episode. Note that -this experiment did not need to utilize the Atari signals, so we only used -data from the last epoch. Results were again averaged over each game and -shown in Figure 5. It is clear that some episodes contain many more -exploration actions concentrated -in the beginning few time steps with switching policies. This makes sense -intuitively, as only one switch needs to occur early in an episode with a -switching policy for many exploration actions to be taken immediately -afterwards. The difference increases roughly linearly for greater number of necessary exploration actions and shows that switching natively produces more episodes with exploration concentrated in the beginning. - -
-
- {% include figure.html path="assets/img/2024-05-07-mode-switching/exp_2_1.png" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-mode-switching/exp_2_2.png" class="img-fluid rounded z-depth-1" %} -
-
-
- Figure 5 (Left): Switching policies can explore more frequently earlier -during the episode. Figure 6 (Right): Switching policies have better -exploration near the start state on downwalk environments. -
- -We illustrate beginning exploration with a downwalk environment in which an -agent attempts to first move to the middle column and then down the middle -column to the white 'x' (Figure 6). The agent starts in the -second row in the middle column at the white circle, and visitation -frequencies across 1K episodes are shown for all states aside from those -between the white circle and the white 'x', inclusive. We chose to -analyze this environment because it is a crude approximation of the trajectory of agents that have learned a single policy and immediately move away from the initial start state at the beginning of an episode. The switching and monolithic policies are the same as before, and switching produces much higher visitation counts at states further from the obvious exploitation trajectory. - -Environments that may benefit from flexible early exploration are sparse -reward environments that provide a single nonzero reward at the terminal -state. Many game environments fall into this category, since a terminal -reward of 1 can be provided for a win, -1 for a loss, and 0 for a draw. In -such environments, agents usually need to learn at states near the sparse -reward region before learning at states further away, also known as -cascading . After learning near -the sparse reward region, the agent may need to reconsider earlier actions, -and switching policies natively allow for this type of exploration. Future -work may consider the extent to which switching aids in improving policies -near the start state in sparse reward environments. - -### Concentrated Return - -In contrast to the investigation in the first experiment, exploitation -actions of a trained agent are presumed to be better than all other -alternatives. Since agents aim to maximize the expected return in an -environment, exploitation actions often accrue relatively large amounts of -expected return. For example, the initial experiments of DQN and double DQN (DDQN) decreased the exploration constant (thereby -increasing exploitation) during testing runs to achieve higher scores and -ultimately demonstrate superhuman performance on Atari. In this subsection, we investigate the effect of the concentrated exploitation actions of switching policies on expected return. - -We perform an experiment to determine the proportion of return that is -concentrated during exploitation periods. Each reward during an episode is -weighted by the proportion of exploitation actions during the past 10 time -steps. The score for each episode is the sum of weighted rewards divided by -the total rewards. Scores for each behavior policy and epoch are computed -by averaging scores across all games. The results are shown in Figure 7. -Quite quickly, exploitation steps of switching policies contain a greater -percentage of the return than those of monolithic policies. This trend seems -fairly constant after roughly 2M frames, with switching policies having -roughly 95% of the return in exploitation steps and monolithic policies -having roughly 90% of the return; from another point of view, exploration -steps yield 5% of the return for switching policies and 10% of the return for -monolithic policies. These results agree with Experiment 1, as switching -policies will generally reach terminal states more frequently in explore -mode and will not receive more rewards. Since most of the rewards in our -selected Atari games are positive, switching policies should accrue -lower return while in explore mode. - -
-
- {% include figure.html path="assets/img/2024-05-07-mode-switching/exp_3_1.png" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-mode-switching/exp_3_2.png" class="img-fluid rounded z-depth-1" %} -
-
-
- Figure 7 (Left): Switching policies concentrate return in exploitation -mode. Figure 8 (Right): Switching policies concentrate return in the -beginning of episodes. -
- -One notable case in which exploitation steps are concentrated together is in -resetting methods such as Go-Explore -that reset to promising states at the beginning of the episode and explore -from there. Promising states are usually defined as states that are -frequently traversed in trajectories that accrue high return. More -generally, resetting methods aim to prevent *derailment*, whereby an agent -is unable to return or is *derailed* from returning to promising states -through its exploratory mechanisms. Since our switching agent begins in -exploit mode which aims to accrue the most return, we investigate to see if -switching policies possess characteristics that are inherent to resetting -methods. - -In Figure 8, we plot the proportion of episode return over the past 5% of -the episode versus the current proportion of episode that is complete. Data -is taken from the last training epoch. The results show that switching -policies concentrate return more towards the beginning of each episode, -most likely because its first exploit mode of switching policies is -relatively long. Future work involves determining the extent to which the -beginning exploitation mode of switching policies serves as a flexible -alternative to resetting, which would have applications in situations -that do not allow for manual resets such as model-free RL. - -### Post-Exploration Entropy - -Monolithic policies such as $\epsilon$-greedy are nearly on-policy when any -exploration constants have been annealed. In contrast, the exploration -periods of switching policies are meant to free the agent from its current -exploitation policy and allow the agent to experience significantly -different trajectories than usual. Due to the lack of meaningful learning at -states that are further from usual on-policy trajectories, the exploitation actions at those states are more likely to have greater diversity. In this experiment, we investigate the diversity of the action distribution after exploration periods. - -We quantify the diversity of the realized action distribution in the time -step immediately after each exploration period. The diversity is quantified -by entropy that has higher values for more random data and vice versa. An -action distribution is constructed for each game and epoch, and -the entropies across games are averaged. The results are shown in Figure 9. -The entropy of the action distribution for switching policies is distinctly -greater than that of monolithic policies. Like most of the previous results, this quantity only plateaus until roughly 2M frames have elapsed. - -
-
- {% include figure.html path="assets/img/2024-05-07-mode-switching/exp_4_1.png" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-mode-switching/exp_4_2.png" class="img-fluid rounded z-depth-1" %} -
-
-
- Figure 9 (Left): Switching policies produce action distributions -with higher entropy after exploration periods. Figure 10 (Right): Agent has -random exploitation actions in states that are visited less frequently. -
- -To illustrate this idea, we create a gridworld environment that provides -the agent a reward of -1 for each time step that the agent is still on the -grid; the agent's goal is to leave the grid as quickly as possible. The -agent begins in the center of the grid and learns through discrete -Q-learning. Distinct actions have separate colors in Figure 10, with arrows -showing the exploit action. The agent learns that it is fastest to exit the -grid by going left or right. Notably, the actions near the top and bottom -of the grid are seemingly random, as the agent has not seen and learned from those states as frequently as the others. Switching -policies are more likely to reach the top and bottom areas of the gridworld -state space and consequently would be more likely to have a higher entropy -of the action distribution after exploration. - -The difference in the entropy of the action distributions suggests that -more diverse areas of the state space may be encountered after exploration -modes with switching policies. This phenomenon is closely tied to the -notion of *detachment* , whereby -agents forget how to return or are *detached* from areas of high reward, -perhaps by focusing too unimodally on one region of the state space. The concentrated behavior of switching policies may provide enough consecutive exploration actions to explore a more diverse set of trajectories. Future work could investigate the ability of switching policies to curb detachment on environments with multiple regions of the state space with high reward. - -### Top Exploitation Proportions - -Our final investigation involves the change in exploitation proportion -under switching policies. Since the probability of switching to explore -mode is very low, there may be some episodes where the switch seldom happens -if at all. This creates a distribution of exploitation action proportions -per episode that is more extreme than that of monolithic policies, yet it -is still not as extreme as using a single mode throughout the entire -episode. Investigations of methods having similar interpolative -characteristics have been conducted recently; for example, an action noise -called pink noise was recently -introduced that achieved better performance than white and red noise. Pink -noise is more temporally-correlated than white noise but not as much as red noise. Here, we investigate the return of the most extreme episodes in exploitation proportion. - -We perform an experiment to compare the return of the episodes with -highest exploitation proportions between switching and monolithic policies. -The returns of the top 10 of 100 episodes ranked by exploitation proportion -of each epoch and game were averaged. Then, a ratio between the averages of -switching and monolithic policies was computed and averaged across games. The -results are plotted in Figure 11. -There does not appear to be a clear trend aside from the ratio hovering mostly above 1.00, indicating that the top exploitation episodes of switching policies accrue more return than those of monolithic policies. - -
-
- {% include figure.html path="assets/img/2024-05-07-mode-switching/exp_5_1.png" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-mode-switching/exp_5_2.png" class="img-fluid rounded z-depth-1" %} -
-
-
- Figure 11 (Left): Switching policies have higher return for episodes -with largest exploit proportion. Figure 12 (Right): Switching policies have -more extreme exploration and exploitation proportions per episode. -
- -The results are best illustrated through plotting the switching and -monolithic exploitation proportions for 1K episodes (10 games of the last -epoch) as shown in Figure 12. The top 100 episodes with highest -exploitation proportion take more exploitation actions than any monolithic -episode. Therefore, the corresponding distribution is indeed more -extreme. - -While the previous discussion has illustrated that some switching episodes -exploit more and generate more return, they don't specifically explain why -training with mode-switching is superior; in particular, the slightly -greater return is not necessary for learning an optimal policy as long as a -similar state distribution is reached during training. One -possibility is the fact that mode-switching policies train on a more -diverse set of behavior and must generalize to that diversity. -Reinforcement learning algorithms are notorious at overfitting , and future work -may investigate the extent to which generalization is improved upon using switching policies. - - -## 3. Conclusion - -This blog post highlighted five observational differences between -mode-switching and monolithic behavior policies on Atari and other -illustrative tasks. The analysis showcased the flexibility of mode-switching policies, such as the ability to explore earlier in episodes and exploit at a notably higher rate. As the original study of mode-switching behavior by DeepMind was primarily concerned with performance, the experiments in this blog post supplement the study by providing a better understanding of the strengths and weaknesses of mode-switching exploration. Due to the vast challenges in RL, we envision that mode-switching policies will need to be tailored to specific environments to achieve the greatest performance gains over monolithic policies. Pending a wealth of future studies, we believe that mode-switching has the potential to become the default behavioral policy to be used by researchers and practitioners alike. - -### Acknowledgements - -We thank Nathan Bittner for a few helpful discussions on the topic of -mode-switching exploration. We also thank Theresa Schlangen (Theresa -Anderson at the time of publication) for helping polish some of the -figures. diff --git a/_posts/2024-05-07-primacy-bias-and-why-it-helps-to-forget.md b/_posts/2024-05-07-primacy-bias-and-why-it-helps-to-forget.md deleted file mode 100644 index 250251f8..00000000 --- a/_posts/2024-05-07-primacy-bias-and-why-it-helps-to-forget.md +++ /dev/null @@ -1,425 +0,0 @@ ---- -layout: distill -title: "It's Time to Move On: Primacy Bias and Why It Helps to Forget" -description: "'The Primacy Bias in Deep Reinforcement Learning' demonstrates how the first experiences of a deep learning model can cause catastrophic memorization and how this can be prevented. In this post we describe primacy bias, summarize the authors' key findings, and present a simple environment to experiment with primacy bias." -date: 2024-05-07 -future: true -htmlwidgets: true - -# Anonymize when submitting -# authors: -# - name: Anonymous - -authors: - - name: Matthew Kielo - url: https://mkiel.org/ - affiliations: - name: Georgia Institute of Technology - - name: Vladimir Lukin - url: https://github.com/divannyteoretik - affiliations: - name: Georgia Institute of Technology - -# must be the exact same name as your blogpost -bibliography: 2024-05-07-primacy-bias-and-why-it-helps-to-forget.bib - -# Add a table of contents to your post. -# - make sure that TOC names match the actual section names -# for hyperlinks within the post to work correctly. -# - please use this format rather than manually creating a markdown table of contents. -toc: - - name: Introduction to Primacy Bias - - name: Off Policy Deep Reinforcement Learning - subsections: - - name: Are we Overcomplicating? - - name: Selecting a Replay Ratio - subsections: - - name: Heavy Priming - - name: Weight Resets - subsections: - - name: Do Resets Work? - - name: "What’s The Catch?" - - name: Implementing Primacy Bias - subsections: - - name: 2x2 Switching Frozen Lake - - name: Results - - name: Conclusions - -# Below is an example of injecting additional post-specific styles. -# This is used in the 'Layouts' section of this post. -# If you use this post as a template, delete this _styles block. -# This is a test?? - - -_styles: > - .fake-img { - background: #bbb; - border: 1px solid rgba(0, 0, 0, 0.1); - box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1); - margin-bottom: 12px; - } - .fake-img p { - font-family: monospace; - color: white; - text-align: left; - margin: 12px 0; - text-align: center; - font-size: 16px; - } ---- - -## Introduction to Primacy Bias - -Primacy bias occurs when a model's training is damaged by overfitting to its first experiences. This can be caused by poor hyperparameter selection, the underlying dynamics of the system being studied, or simply bad luck. - -In this post we explore the paper “Primacy Bias in Deep Reinforcement Learning” by Nikishin et al. and presented at ICML 2022 . We will present primacy bias and how it applies to deep reinforcement learning, discuss how the authors prevent primacy bias, and finish by experimenting with our own toy example of primacy bias. - -Like many deep learning concepts, primacy bias takes inspiration from psychology . For example, you might have a friend who “doesn’t like math” because they had a bad experience in primary school. Now, they avoid the subject despite having an aptitude for it. It turns out that for humans and machines, first impressions matter more than they should. This is primacy bias. - -## Off Policy Deep Reinforcement Learning - -Nikishin et al. discuss a specific type of model that is particularly sensitive to primacy bias: *off-policy deep reinforcement learning*. Here, the goal is to learn a (*policy*) that makes good decisions in an interactive environment. Off-policy algorithms achieve this by separating decision-making from learning. Deep Q-Learning (DQN) was one of the first popular off-policy algorithms, which separates the learning process into two steps: - -1. Data Collection: use the current policy to interact with the environment and save memories to a dataset called the *replay buffer*. -2. Learning: sample from the replay buffer to perform gradient updates on the policy. - -### Are we Overcomplicating? -For those without a reinforcement learning background, this might seem needlessly complicated. Why can’t we simply explore with a random policy and then fit a model all at once? - -Although this is sometimes done , the quality of the memories in the replay buffer is proportionate to the quality of the policy that gathered the experience. Consider an agent learning to play chess. A random policy might have enough data to learn how to play the start of the game effectively, but it will never learn how to chase an opponent’s king around an empty board. If a policy isn’t smart enough to get the agent out of the ‘early' game, it will never collect experiences to learn the ‘mid’ or ‘late' games. - - -## Selecting a Replay Ratio - -The *replay ratio* is the total number of gradient updates per environment interaction. If the number of experiences is fixed, then modifying the replay ratio is equivalent to changing the number of training epochs in a typical deep learning problem. - -Most researchers know the importance of training for a sufficient number of epochs. Training for more epochs is preferred and methods such as early stopping, weight regularization, and dropout layers can mitigate the risk of overfitting. At worst, if you end up with an overfit model then you can retrain it from scratch. - -In deep reinforcement learning, the replay ratio is typically set to one. Unfortunately, finding the correct replay ratio is difficult. We want the agent to learn as much as possible but there is a path-dependency that is hard to ignore. If the policy becomes overfit early it will have less meaningful interactions with the environment, creating negative feedback. If you don’t catch overfitting in your Poker Bot until it loses a couple tournaments, then you might have spent a lot of money for a dataset on how to lose poker hands. - -### Heavy Priming - -To quantify this, Nikishin et al. perform an experiment with heavy priming. The goal is to train an agent on the "quadruped-run" environment, where an agent learns to manipulate joint movement to travel forward. - -First, a baseline is trained with default parameters. Next, to create heavy priming, the agent collects 100 interactions and then trains for 100K steps. The model with heavy priming fails to ever recover in an example of catastrophic memorization. - -
-
- {% include figure.html path="assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/heavy-priming.jpeg" class="img-fluid rounded z-depth-1" %} -
-
-
- Example of Heavy Priming by Nikishi et al. -
- - -## Weight Resets - -To avoid primacy bias, Nikishi et al. propose the following solution: freely increase the replay ratio, but periodically perform a *weight reset* to reinitialize all of the agent’s weights while preserving the replay buffer. This destroys any learned information in the network's weights. At worst, if there is no primacy bias, the replay buffer will contain enough information to retrain to the previous weights. At best, primacy bias is eliminated, and the model finds a new optima. - -To think about this concretely, consider a 100 step training loop. At each step we: - -1. Gather 1 observation. -2. Add it to the replay buffer. -3. Select a random sample from the replay buffer. -4. Perform a gradient update to the model with the sample. - -After 100 steps, the first observation will have been sampled on average 5.19 times. The 50th observation will have been sampled 0.71 times, and the 99th observation will have been sampled on average 0.01 times. This can be summarized in a plot. - -
-
- {% include figure.html path="assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/samples11.jpeg" class="img-fluid rounded z-depth-1" %} -
-
- How often an example is sampled on average in a 100 step training loop. -
-
- - -Some solutions to mitigate this include recency weighting or using prioritized experience replay , however, weight resets offer a theoretically parameter free way to fix this. If weights are trained from scratch at every step then all prior observations will have equal influence. - -In practice, weight resets are a bit more complicated. Ideally, we retrain the model from scratch after each observation. Unfortunately this isn’t realistic (on my computer). This leaves us with two decisions: - -1. Select a reset frequency. -2. Decide what to reset. - -Resetting often will prevent primacy bias but this requires a high replay ratio. This trade-off is discussed in detail in the follow up work "Sample-Efficient Reinforcement Learning by Breaking the Replay Ratio Barrier" published at ICLR in 2023. In particular, a heatmap is shared showing the trade-off between data and computation budget on a dynamic motion control problem: - -
-
- {% include figure.html path="assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/compute-data-tradeoff.jpeg" class="img-fluid rounded z-depth-1" %} -
-
-
- "Performance of SR-SAC in DMC15 as a function of the number of interactions and of the number of agent updates, determined by the replay ratio." -
- - - -### Do Resets Work? - -Nitkshi et al. show that on average resets work well. - -1. Immediately after a reset there is a sudden drop in performance that quickly recovers. -2. Resets never irreparably harm a model. At worse, the model returns to the pre-reset level (ex: cheetah-run), but sometimes it can perform substantially better (humanoid-run). - -These results are consistent across multiple algorithms and environments, including the continuous control Deep Mind Control Suite and the discrete Atari 100k benchmarks. - -
-Episode return overtime on a subset of DeepMind Control, with and without resets, using SAC algorithm. Averaged over 10 random seeds. -
-
- {% include figure.html path="assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/mujuco-resets-sample.jpeg" class="img-fluid rounded z-depth-1" %} -
-
-
- Figure 4, -
-
- -
-Episode return overtime in DeepMind Control, with and without resets, using the DRQ algorithm. Averaged over 20 random seeds. -
-
- {% include figure.html path="assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/mujuco-resets-full.jpeg" class="img-fluid rounded z-depth-1" %} -
-
-
- Figure 18, from Appendix C) -
-
- - -
-Per-game scores in Atari, with and without reset, using the SPR algorithm. Averaged over 20-100 random seeds. -
-
- {% include figure.html path="assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/atari.jpeg" class="img-fluid rounded z-depth-1" %} -
-
-
- Table 7, from Appendix C) -
-
- - -After seeing the success of resets, it is reasonable to wonder how weight resets compare to other regularization tools. The authors test this as well and show that resets improve outcomes in their experiments on average more than either dropout or L2 regularization (which actually perform worse than the baseline). - -
-
- {% include figure.html path="assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dropoutsetc.jpeg" class="img-fluid rounded z-depth-1" %} -
-
-
- Comparison of Base Algorithm, Resets (+ resets), Dropout (+ dropout), and L2 (+ L2). Averaged over 10 runs. -
- - - -### What's The Catch? - -While these results are impressive, they come at a cost. At minimum, increasing the replay ratio increases the compute time linearly. D'Oro et al 2023 note that running the full dynamic control benchmark with a replay ratio of 32 takes 4 GPU days with a NVIDIA V100. Using a replay ratio of 16 on Atari 100K requires 5 GPU hours per run. - -Additionally, implementing weight resets requires a sneaky number of design decisions. The results from the paper show reset rules specifically chosen for each environment and algorithm. - -Some of these considerations include: - -1. How often should you reset? Every step is ‘ideal’ but it is also ideal to get results this year. -2. What is the optimal replay ratio to maximally learn per sample and sustain the reset frequency? -3. What exactly should I reset? Full model? Last layer? - -These are open questions. For weight resets to become widely used new heuristics and best practices will need to develop. The answers may depend on both the network architecture and the underlying system dynamics. Trying to imagine the precise behaviours induced by primacy bias on Atari and Deep Mind Control can be difficult. - - - -## Implementing Primacy Bias - -The best way to learn something is through practice. In this section we will present a minimum example of primacy bias. The associated code is [released as a notebook](https://github.com/mkielo3/iclr-blog2024-primacy-bias) along with additional experiments. - -The biggest obstacle to studying primacy bias is the compute required. Training time scales linearly with replay ratio, and a high replay ratio is necessary to extract maximal information per sample and to recover after each reset. To work around this, we present an MVP: Minimum Viable Primacy (bias). - -We use a modified version of the Frozen Lake environment provided by Farama Gymnasium with a DQN model (one of first models to popularize a replay buffer) based on the CleanRL implementation . - - -### 2x2 Switching Frozen Lake - -Frozen Lake is a simple pathfinding problem. The model receives a reward if it successfully traverses a grid to reach a goal. The model can fail in two ways: 1) it falls in a hole or 2) it takes too long to reach the goal. The model observes its location on the grid and each action is a move one tile up, down, left, or right. - -To simplify the problem, we restrict the map size to 2x2 and keep the environment deterministic. The agent always starts in the top left corner and is rewarded if it reaches the bottom right corner. A hole is placed in one of the two remaining spaces. The agent fails if it takes more than 2 steps or falls in a hole. Each map has exactly one solution. - -
-
- {% include figure.html path="assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/fl.jpeg" class="img-fluid rounded z-depth-1" %} -
-
- MVP: Switching 2x2 Frozen Lake Environment, with solution in red. -
-
- - -The agent attempts to cross the lake 1,000 times. To force primacy bias, we show the agent Map 1 for the first 200 crossings, and Map 2 for the last 800. The maps are deliberately chosen to have opposite solutions. After 400 crossings the agent will have experienced each map equally and afterwards the agent should begin to prefer Map 2 with increasing confidence. Our agent is maximally exploitative and will always take the action it thinks is best. - -Each trial is considered expensive (our agent doesn't want to freeze). A good algorithm will maximize the number of successful crossings in the 1,000 attempts. Each attempt is saved to the replay buffer and any reset will fully reinitialize all network weights. - -The advantage of this environment is that it is very fast. A trial of 1,000 crossings with a replay ratio of 1 completes in less than 5 seconds on a CPU. The disadvantage of this environment is that it's incredibly simple, and findings might not generalize to more complex problems. - -### Results - -The first thing we do is inspect how our model scores its first action with and without resets for each cross. - -
-
- {% include figure.html path="assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/q_vals/01.svg" class="img-fluid rounded z-depth-1" %} -
-
-
- Model scores for first action overtime (after softmax), with and without resets. The correct first action is down for the first 200 episodes and right afterwards. Replay ratio of 16 with results averaged over 25 seeds. -
- -
-Additional action values overtime for various learning rates. -
- {% include figure.html path="assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/q_vals/001.svg" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/q_vals/0001.svg" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/q_vals/00005.svg" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/q_vals/00001.svg" class="img-fluid rounded z-depth-1" %} -
- -
-
-
- -
- -Both models quickly determine that moving down is correct. The resetting model will periodically score actions equally before quickly recovering. Without resets, the map switch is only recognized after the 800th crossing. With resets, this switch happens around crossing 500. We also see that after the map switch the model without resets tries to adjust by increasing the scores for the incorrect left and up actions (which led to failure in two steps instead of one). - -We can also plot the reward per crossing, averaged over 25 seeds. Similar to the first result, the model with resets periodically fails, but also adapts to the map switch faster. - -
-
- {% include figure.html path="assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/reward/01.svg" class="img-fluid rounded z-depth-1" %} -
-
-
- Model score overtime, with and without resets. Replay ratio of 16. Average of 25 seeds. -
- - -
-Additional scores overtime for various learning rates. -
- {% include figure.html path="assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/reward/001.svg" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/reward/0001.svg" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/reward/00005.svg" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/reward/00001.svg" class="img-fluid rounded z-depth-1" %} -
- -
-
-
- -
- - -Next, we conduct a hyperparameter sweep with replay ratios 1, 4, 16 and reset frequencies 0, 50, 100, 500. We then compare the average number of successful crossings. A random policy will earn the reward 1/16 of the time. - -
-
- {% include figure.html path="assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/grid/01.svg" class="img-fluid rounded z-depth-1" %} -
-
-
- Full period average score, averaged across all crossings. Average of 25 seeds. -
- -
-Additional averages scores for various learning rates. -
- {% include figure.html path="assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/grid/001.svg" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/grid/0001.svg" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/grid/00005.svg" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/grid/00001.svg" class="img-fluid rounded z-depth-1" %} -
- -
-
-
- -
- -In general, the results match our expectations. With a learning rate of 0.01 a higher replay ratio improves results and having resets is always helpful. A high replay ratio with resets is necessary to achieve a score over 0.6 for all learning rates. Reset frequency and replay ratio must be adjusted alongside learning rate which scales how quickly the network can adapt in a non-stationary environment. - -As a final experiment, we vary model size. We compare a much smaller two layer DQN architecture to the larger three layer model used in prior experiments. Interestingly, this produces the highest score yet with a reset frequency of 10 steps although the result quickly disappears with a lower learning rate. - - -
-
- {% include figure.html path="assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/little/01-2.svg" class="img-fluid rounded z-depth-1" %} -
-
-
- Full period average score. Average of 25 seeds. Split by Network Size with Replay Ratio of 16. -
- -
-Additional averages scores for various learning rates by network size. -
- {% include figure.html path="assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/little/001.svg" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/little/0001.svg" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/little/00005.svg" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/little/00001.svg" class="img-fluid rounded z-depth-1" %} -
- -
-
-
- -
-
- {% include figure.html path="assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/little/misc.svg" class="img-fluid rounded z-depth-1" %} -
-
-
- Comparison of 3 layer and 2 layer networks. Reset every 10 steps with a replay ratio of 16. Average of 25 seeds. -
- -
- -## Conclusions - -In this blogpost, we discuss primacy bias and its application to off-policy deep reinforcement learning. We highlight a subset of results and apply weight resets to a new problem. - -We hope that more examples of primacy bias continue to be discovered and studied. Eventually, we would like to identify specific behaviors that are catastrophically memorized and create guiding principles to identify environments that are most at risk of primacy bias. Overtime we hope this might unlock new applications of deep reinforcement learning. - -Even as the theory continues to develop, there is little harm in attempting periodic weight resets with a high replay ratio to train off-policy reinforcement learning agents. - -Finally, primacy bias might not always be a bad thing. If you decide to take a new shortcut to work by walking down an alley and the first thing you notice is how dark and unsafe it seems then maybe it’s a good idea to turn back. As always, it is an important decision for the modeller to decide if primacy bias should be treated in their problem. - -## Acknowledgements - -This blogpost is derived from our work that began in Dr. Zsolt Kira's excellent Deep Learning course at Georgia Tech. - diff --git a/_posts/2024-05-07-rlhf-without-rl.md b/_posts/2024-05-07-rlhf-without-rl.md deleted file mode 100644 index 7f0a9e06..00000000 --- a/_posts/2024-05-07-rlhf-without-rl.md +++ /dev/null @@ -1,328 +0,0 @@ ---- -layout: distill -title: RLHF without RL - Direct Preference Optimization -description: We discuss the RL part of RLHF and its recent displacement by direct preference optimization (DPO). - With DPO, a language model can be aligned with - human preferences without sampling from an LM, thereby significantly - simplifying the training process. By now, DPO has been implemented in many projects and seems to be here to stay. -date: 2024-05-07 -future: true -htmlwidgets: true - -authors: - - name: Michael Panchenko - url: "https://transferlab.ai/authors/michael-panchenko" - affiliations: - name: appliedAI initiative GmbH - -bibliography: 2024-05-07-rlhf-without-rl.bib - -toc: - - name: Background - id: background - - name: Is RLHF Reinforcement Learning? - id: is-rlhf-reinforcement-learning - - name: Direct Preference Optimization - id: direct-preference-optimization - - name: DPO in the Wild - Experiments, LLMs and Software - id: dpo-in-the-wild-experiments-llms-and-software - - name: Closing Remarks - id: closing-remarks - -_styles: > - .fake-img { - background: #bbb; - border: 1px solid rgba(0, 0, 0, 0.1); - box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1); - margin-bottom: 12px; - } - .fake-img p { - font-family: monospace; - color: white; - text-align: left; - margin: 12px 0; - text-align: center; - font-size: 16px; - } ---- - -## Background - -Reinforcement learning from human feedback (RLHF) is an important technique for -aligning (large) language models (LM) -with human preferences. It was introduced by Christiano et al. and then first -applied to language models in the work by Ziegler et al.. -Since then, RLHF has become a central building block of many LLM-based applications, -including the first versions of ChatGPT. - -RLHF for language models works roughly as follows: - -1. Collect a dataset of prompts $\mathcal{D}$ for the LM, typically containing - instructions or questions. -2. For each prompt $x\in \mathcal{D}$, collect a set of completions $y_1, ..., y_N$ from the - LM. One can increase the temperature of the language model for this step to get a - sufficient variability in them. -3. Ask human annotators to rate the completions, thereby obtaining a dataset of preferences - $x, y_{rank_1}, ..., y_{rank_N}$. -4. Train a parameterized reward function $r_\phi$ (mapping pairs $(x,y)$ to scalars) on the collected preferences by minimizing the loss - - $$ - \mathcal{L}(r) = \mathbb{E}_{(x, y_{rank_i})} \left[ \log \frac{e^{r(x, y_{rank_i})}}{\sum_{j=1}^N e^{r(x, y_{rank_j})}} \right]. - $$ - - This loss is inspired by the Bradley-Terry model for pairwise comparisons and by - maximum-entropy inverse RL. - Intuitively, it encourages the reward function to assign higher rewards to completions that are preferred by humans. - Usually, the reward function is parameterized by the LM itself with an additional linear layer. Thus, the mapping from $(x, y)$ to $r(x, y)$ is given by - simply concatenating the sequences $x$ and $y$ and passing the embedding of the last (or a differently selected) token through a linear layer. -5. Fine-tune the LM by viewing it as a policy $\pi_\theta$ and using RL with the learned reward function $r_\phi$ as the - reward. For this step, a separate dataset of prompts $\mathcal{D}\_{\text{RL}}$ is used to query the LM and collect completions. - Since the reward is learned on a very limited subset of possible completions, and is therefore unreliable in - off-distribution data, it would be unwise to aim at optimizing it without any regularization. - - The typical choice of regularization is the KL-divergence between the policy (i.e. the aligned/fine-tuned LM) and a reference - policy $\pi_{\text{ref}}$ (usually the pretrained LM before fine-tuning). The RLHF objective then becomes - - $$ - \tag{1} - \label{eq:rlhf} - J(\pi) = \mathbb{E}_{x \sim \mathcal{D}_\text{RL}, y\sim \pi_\theta(y \mid x)} \left[ - r_\phi(x, y)- \beta D_{\text{KL}} \left( \pi(y, s) || \pi_\text{ref}(y, s) \right) - \right], - $$ - - which is then used to find the optimal policy $\pi_\theta$ by some optimization algorithm, typically a variant - of proximal policy optimization (PPO). Here $D_{\text{KL}}$ denotes the - KL-divergence between two distributions, and the temperature $\beta$ is a hyperparameter -that controls the strength of the regularization. - -The resulting LLMs are very powerful and so widely used that we don't need to further elaborate on their performance here. -Note, however, that the RLHF scheme has quite some complexity when it comes to actually making it work in practice. - -## Is RLHF Reinforcement Learning? - -From the beginning, RLHF has sparked some controversy. Some regarded it as one of the prime applications of reinforcement learning -(which may currently be perceived as "less hot" than LLMs, wherefore applying RL in LLMs is in the former's favor). -At the same time, others were skeptical about whether RLHF is reinforcement learning at all. - -Indeed, some crucial components of RL are missing in RLHF. First, the current forms of RLHF do not involve sequential decision-making -(although there is some work on that, e.g., the ILQL algorithm). -While the rollout of a completion can formally be viewed as a sequence of actions, the reward is not given after the completion -has ended. Moreover, for the purpose of RLHF the LM itself can be regarded as a direct mapping from inputs to distributions over completions, -rather than a sequential decision-making agent in the space of tokens. Thus, at best, RLHF is a form of single-step, -immediate-reward RL - in other words, a *contextual bandit*. - -Even more troubling than the non-sequential nature of RLHF may be its information flow. While the policy optimization of RLHF is framed as an online RL algorithm, -*the environment consists of the policy itself*. Usually, in online RL an agent is able to extract new information from the environment. -In RLHF, however, the information is not "new" in the sense that it is not extracted from something external to the agent itself. -The only information not originally contained in the LM is in the preferences data (notably, not even in the completions themselves, -but only in their rankings), and it is only used to fit a reward function. Thus, RLHF is more reminiscent of offline RL or supervised learning -than of online RL. - -Because of this 1-step nature of RLHF and due to the (unusual for RL) application of training enormous models, -the majority of RLHF software is not set up to be compatible with gym(nasium) or other environment interfaces. Take, -for example, the well known [trl](https://github.com/huggingface/trl) and [trlx](https://github.com/CarperAI/trlx) libraries, -which barely mention environments at all. A notable exception is the [RL4LMs project](https://github.com/allenai/RL4LMs) by AllenAI, -which unfortunately seems to be abandoned, and is based on the deprecated gym instead of -[gymnasium](https://gymnasium.farama.org/). For practical RLHF, training in parallel on massive datasets -is a necessary requirement, which somewhat complicates the use of standard environment and training interfaces. - -The view that RLHF is not "really" RL, or at least does not have to be, -has become even more popular after the publication of the DPO algorithm, -which we will discuss in the next section. - -## Direct Preference Optimization - -The direct preference optimization (DPO) algorithm for aligning language models (LM) by Rafailov et al. -is a method for aligning LMs to human preferences without having to sample from the LM and without using RL explicitly. -Interestingly, DPO still optimizes the same objective as RLHF, but does so purely by supervised learning. -This results in a much simpler training procedure and -reportedly better performance in a number of experiments. - -The mathematical derivation of DPO is short and insightful. It is based on the following observations: - -### 1. Reward as a Function of the Policy - -The RLHF objective (\ref{eq:rlhf}) has an exact (non-parametric) solution for the optimal policy $\pi_r$: - -$$ -\pi_r(y \mid x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y \mid x) \exp - \left( \frac{1}{\beta} r(x, y) \right). -$$ - -This expression is well known in the RL literature and is sometimes referred to as *Boltzmann policy* -(note that in the 1-step RL setting, the Q-function is given by the reward itself). - -Similar results were proved in the REPS algorithm and follow-up work (a more recent paper in that -direction is ). While this solution for $\pi_r$ in -itself is intractable (because of the partition function $Z(x)$), it can be used -to express the reward as a function of the optimal policy: - -$$ - \tag{2} - \label{eq:reward-as-function-of-policy} - r(x, y) = \beta \log \left( \frac{\pi_r(y \mid x)}{\pi_{\text{ref}}(y \mid x)} \right) + \log Z(x). -$$ - -### 2. Only Differences of Rewards Are Needed - -For simplicity, let us consider that only two completions are collected per -input, which are then ranked as $y_w$ and $y_l$ (for winning and losing). -DPO can be easily extended to the case of more completions per input, but the -notation becomes more cumbersome. - -The reward $r_\phi$ is then learned by minimizing the loss: - -$$ - \mathcal{L}_\phi = \mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\left[ - \log \frac{ e ^ {r_\phi(x, y_w)}}{ e^{r_\phi(x, y_w)} + e^{r_\phi(x, y_l)}} - \right] -$$ - -which is equivalent to - -$$ - \tag{3} - \label{eq:reward-loss-binary} - \mathcal{L}_\phi = - \mathbb{E}_{(x, y_w, y_l)\sim \mathcal{D}} \left[ - \log \sigma \left( r_\phi(x, y_w) - r_\phi(x, y_l) \right) - \right], -$$ - -where $\sigma$ is the sigmoid function. Note that only _differences of rewards_ -enter (\ref{eq:reward-loss-binary}). - -### 3. DPO Objective - -After plugging the expression for the policy \ref{eq:reward-as-function-of-policy} -into the loss \ref{eq:reward-loss-binary}, -the partition function $Z(x)$ cancels out. Replacing the -optimal $\pi_r$ with the parameterized $\pi_\theta$, the DPO objective is obtained as - -$$ - \mathcal{L}_{\text{DPO}}(\pi_\theta ; \pi_{\text{ref}}) := - - \mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ - \log \sigma \left( \beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)} \right) - \right]. -$$ - -Thus, instead of first learning a reward and then finding the optimizing policy, -one directly finds the optimal policy such that its reward as obtained from -(\ref{eq:reward-as-function-of-policy}) -corresponds to collected human preferences (i.e., a reward that -optimizes (\ref{eq:reward-loss-binary})). Note that while the induced reward function -itself is intractable, the differences of rewards remain tractable and can be -computed using the learned policy. This should be sufficient for practical -purposes, where rewards are mostly used to rank completions and, e.g., perform -rejection sampling. - -The paper includes some more details and a discussion of the interpretation of -the DPO update, and a detailed comparison to standard RLHF, -but the essence of the method is captured by the above derivation. DPO can be -easily extended to the case of more completions per input. - -## DPO in the Wild - Experiments, LLMs and Software - -The original experiments in the paper were conducted on small-scale models -and datasets, and as such were not very convincing. We partially include them here for -completeness: - - -
-
- {% include figure.html path="assets/img/2024-05-07-rlhf-without-rl/original-evaluation.svg" class="img-fluid" %} -
-
-
- Original evaluation of DPO on small-scale models and datasets. - Left: TL;DR summarization win rates vs. - human-written summaries, using GPT-4 as evaluator. DPO exceeds PPO’s best-case - performance on summarization, while being more robust to changes in the sampling - temperature. - Right: The frontier of expected reward vs KL to the reference - policy. DPO provides the highest expected reward for all KL values, - demonstrating the quality of the optimization. -
- -Fortunately, DPO's simplicity has made it attractive to many researchers and engineers. -By now, only a few months after the publication of the paper, it is -already included in [trl](https://huggingface.co/docs/trl/dpo_trainer) as well as -the ray-based library [OpenRLHF](https://github.com/OpenLLMAI/OpenRLHF) (which is -notably not using rllib, but that's a story for another day). Moreover, several large models have been trained with DPO, -including [Zephyr 7B](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha) and the 70B -parameters [TÜLU 2](https://github.com/allenai/open-instruct). Here is what the -authors of the latter had to say about DPO: - -
- DPO training significantly improves AlpacaEval and MT-Bench performance. At all sizes, - DPO training provides significant improvements in AlpacaEval, with our largest DPO-trained model - significantly outperforming GPT-3.5-turbo-0314 (89.4 vs. 95.1) and is competitive with GPT-4 ... - We also observe that DPO training provides a large boost in MT-Bench - performance for the 13B and 70B size models, with TÜLU 2+DPO 70B being the best-performing - open model compared to all other models on the MT-Bench leaderboard. -
- -
- DPO training is stable at large scales. We find that DPO training scales without issues with 70Bsize models, - with DPO training still providing large benefits for open-ended generation (AlpacaEval) - even at the 70B size. This suggests DPO is a promising path for training large models on human - feedback without the engineering complexity required by PPO. To our knowledge, TÜLU 2+DPO - 70B is the largest publicly-released DPO-trained model. -
- -
- DPO does not dramatically harm most other metrics. We find that DPO training does not - significantly change performance in most other metrics we measure, such as factual reasoning - (MMLU) or reasoning (BBH, GSM8k), with the exception of multilinguality (which we discuss - below). This suggests that DPO training does not significantly change model capabilities. - DPO training significantly drops multilingual capabilities. We find that DPO training significantly drops performance in TydiQA, which tests the multilingual capabilities of our model. However, - we note that both our supervised finetuning and DPO data mixes do not explicitly contain multilingual - data, and are majority English-language. As such, DPO training is likely to make multilingual outputs - further out-of-distribution, and mixing in multilingual data at instruction tuning and DPO training - stages may significantly improve these results. -
- -
- DPO training increases model verbosity. As seen in Table 4, TÜLU 2+DPO models generally - output answers of longer length than those trained without DPO. This is in line with prior work - showing a bias toward verbosity from RLHF training. However, we note that our DPO-trained models appear dramatically less verbose than other openweight models, which future work will investigate. -
- -## Closing Remarks - -One may find it surprising that supervised learning is able to replace RL -on a formal level. For RLHF, _new_ data is sampled from the language model, and for DPO -this is not the case. - -However, after paying closer attention to the information flow -of RLHF as described above, it may not be too surprising after all. The sampled -data is not really new - it is created using the very same model that one is trying -to optimize. The rewards for these samples are also not new, they are obtained -by fitting a reward function to the preferences, and no new human preferences are -retrieved during optimization. So from the information-flow perspective, -supervised learning and RL are indeed equivalent in this particular case. Maybe -Francois Chollet was not too extreme for suggesting to _get rid of deep RL -altogether_ in his tweet (note that it predates DPO. Personally, I don't believe in a complete futility of deep RL, but for RLHF he was on point): -{% twitter https://twitter.com/fchollet/status/1630241783111364608?s=20 %} -. - -Another surprising aspect of DPO is the question: *Why has nobody done this before?* -Hopefully after reading this blog post, you will agree that the derivation of DPO is -not particularly complicated, so why did it take almost 4 years after the introduction of RLHF? -Especially considering how tricky RLHF can be to implement. -I don't have an answer, though my intuition is that sometimes as a community we put too much -effort into following a working solution, instead of taking a step back -and searching for a simpler path. We might have witnessed a large scale instance of the -[Region-beta paradox](https://en.wikipedia.org/wiki/Region-beta_paradox). - -As a final note on community dynamics: supervised and self-supervised learning are now making more headlines -compared to reinforcement learning, and DPO might have the effect of slowing down -the complicated (but, as I believe, necessary) marriage of RL and LLMs. -I do think that planning and search should play some part of LLM training in the future, -although only for settings in which there is an actual environment from which new information -can be extracted (like tool-use or robotics). For now, however, taking the RL out of RLHF -seems like a good step forward. If DPO can be made beneficial for most LLM trainings, I believe -that one can firmly answer the opening question of this blog as: - -*Is RLHF really (online) RL? No, it is not.* diff --git a/_posts/2024-05-07-robust-foundation-model.md b/_posts/2024-05-07-robust-foundation-model.md deleted file mode 100644 index 6bf87f8e..00000000 --- a/_posts/2024-05-07-robust-foundation-model.md +++ /dev/null @@ -1,863 +0,0 @@ ---- -layout: distill -title: 'Towards Robust Foundation Models: Adversarial Contrastive Learning' -description: Foundation models pre-trained on large-scale unlabelled datasets using self-supervision can be generalizable to a wide range of downstream tasks. Existing work has shown that adversarial attacks can effectively fool any downstream models fine-tuned from a pre-trained foundation model. The existence of such adversarial attacks necessitates the development of robust foundation models which can yield both standard generalization and adversarial robustness to safety-critical downstream tasks. Currently, adversarial contrastive learning (ACL) is one of the most effective methods for outputting a robust foundation model. ACL incorporates contrastive learning with adversarial data to effectively output a robust representation without requiring costly annotations. In this blog, we introduced two NeurIPS 2023 publications that can enhance ACL's efficacy and efficiency, respectively. (1) This blog introduces Adversarial Invariant Regularization (AIR) which is a state-of-the-art ACL algorithm. A causal theoretical framework is built to interpret ACL, and then the AIR algorithm is derived from the causal framework to regulate and improve the ACL. (2) This blog also introduces a Robustness-aware Coreset Selection (RCS) method to speed up ACL. RCS does not require label information and searches for an informative training subset that can maintain the adversarial robustness. For the first time, RCS enables the application of ACL on the large-scale ImageNet-1K dataset. -# Your blog post's abstract. - # Please add your abstract or summary here and not in the main body of your text. - # Do not include math/latex or hyperlinks. -date: 2024-05-07 -future: true -htmlwidgets: true - -# Anonymize when submitting -# authors: -# - name: Anonymous - -authors: - - name: Jingfeng Zhang - url: https://zjfheart.github.io/ - affiliations: - name: The University of Auckland & RIKEN Center for Advanced Intelligence Project - - name: Xilie Xu - url: https://godxuxilie.github.io/ - affiliations: - name: National University of Singapore - -# must be the exact same name as your blogpost -bibliography: 2024-05-07-robust-foundation-model.bib - -# Add a table of contents to your post. -# - make sure that TOC names match the actual section names -# for hyperlinks within the post to work correctly. -# - please use this format rather than manually creating a markdown table of contents. -toc: - - name: Foundation Models - subsections: - - name: Contrastive Learning (CL) - - name: Robust Foundation Models - subsections: - - name: Adversarial Contrastive Learning (ACL) - # subsections: - # - name: Interactive Figures - - name: Enhancing ACL via Adversarial Invariant Regularization (AIR) - subsections: - - name: Causal View of ACL - - name: the Methodology of AIR - - name: Empirical Results - - name: Robust Self-Supervised Learning (RobustSSL) Benchmark - - name: Efficient ACL via Robustness-Aware Coreset Selection (RCS) - subsections: - - name: Motivation---ACL is Inefficient - - name: the Methodology of RCS - - name: Experimental Results - - -# Below is an example of injecting additional post-specific styles. -# This is used in the 'Layouts' section of this post. -# If you use this post as a template, delete this _styles block. -_styles: > - .fake-img { - background: #bbb; - border: 1px solid rgba(0, 0, 0, 0.1); - box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1); - margin-bottom: 12px; - } - .fake-img p { - font-family: monospace; - color: white; - text-align: left; - margin: 12px 0; - text-align: center; - font-size: 16px; - } ---- - - - -## Foundation Models - - -Foundation models are pre-trained on large-scale unlabelled datasets using self-supervised learning methods, which is generalizable to a wide range of downstream tasks via fine-tuning. For example, GPT-3 has been successfully commercialized as a powerful text generation application. Vision transformer has been widely used in computer vision tasks such as object detection and medical analysis . BLIP is a vision-language pre-trained model that can perform many vision-language tasks such as the visual question answering task . CLAP is a language-audio pre-trained model that can be used for understanding the pair of texts and audio. - -
-
- {% include figure.html path="assets/img/2024-05-07-robust-foundation-model/foundation_models.png" class="img-fluid" %} -
-
- - - - -### Contrastive Learning (CL) - -To build foundation models, contrastive learning (CL) is one of the popular self-supervised learning methods. CL aims to maximize the agreement between different natural views of the original data. - -Let $$f_\theta: \mathcal{X} \rightarrow \mathcal{Z}$$ be a feature extractor parameterized by $$\theta$$, $$g:\mathcal{Z} \rightarrow \mathcal{V}$$ be a projection head that maps representations to the space where the contrastive loss is applied, and $$\tau_i, \tau_j: \mathcal{X} \rightarrow \mathcal{X}$$ be two transformation operations randomly sampled from a pre-defined transformation set $$\mathcal{T}$$. Given a mini-batch $$B \sim \mathcal{X}^\beta$$ consisting of $$\beta$$ samples, we denote the augmented minibatch $$B^\prime = \{ \tau_i(x_k), \tau_j(x_k) \mid \forall x_k \in B \}$$ consisting of $$2\beta$$ samples. We take $$h_\theta(\cdot) = g \circ f_\theta(\cdot)$$ and $$x_k^u = \tau_u(x_k)$$ for any $$x_k \sim \mathcal{X}$$ and $$u \in \{i,j\}$$. The contrastive loss between different natural views (i.e., $$x_k^i$$ and $$x_k^j$$) is formulated as follows: - -$$ \ell_\mathrm{CL}(x_k^i,x_k^j; \theta)\!=\!-\! \sum\limits_{u \in \{i,j\}} \! \log \frac{e^{\mathrm{sim} \left(h_\theta(x_k^i), h_\theta(x_k^j) \right)/t}}{\sum\limits_{x \in B^\prime \setminus \{x_k^u\}} e^{\mathrm{sim} \left( h_\theta(x_k^u), h_\theta(x) \right)/t}}, $$ - -where $$\mathrm{sim}(\cdot,\cdot)$$ is the cosine similarity function. - -
-
- {% include figure.html path="assets/img/2024-05-07-robust-foundation-model/SCL.png" class="img-fluid" %} -
-
-
- Intuitively, CL aims to maximize the agreement between different natural views (the dash blue lines). -
- -**How to implement CL at the pre-training stage in practice?** - -
Click here to see the Pytorch code for calculating contrastive loss. You can copy-paste it to calculate the contrastive loss in convenience. -The code is copied from https://github.com/GodXuxilie/Enhancing_ACL_via_AIR. -{% highlight python %} -import torch -import torch.nn as nn -import torch.nn.functional as F - -class CL(nn.Module): - - def __init__(self, normalize=True, temperature=0.5): - super(CL, self).__init__() - self.normalize = normalize - self.temperature = temperature - - def forward(self, zi, zj): - # zi: the representation of natural view x^i. - # zj: the representation of natural view x^j. - - bs = zi.shape[0] - labels = torch.zeros((2*bs,)).long().to(zi.device) - mask = torch.ones((bs, bs), dtype=bool).fill_diagonal_(0) - - zi_norm = F.normalize(zi, p=2, dim=-1) if self.normalize else zi - zj_norm = F.normalize(zj, p=2, dim=-1) if self.normalize else zj - - ### Contrastive Loss ### - logits_ii = torch.mm(zi_norm, zi_norm.t()) / self.temperature - logits_ij = torch.mm(zi_norm, zj_norm.t()) / self.temperature - logits_ji = torch.mm(zj_norm, zi_norm.t()) / self.temperature - logits_jj = torch.mm(zj_norm, zj_norm.t()) / self.temperature - - logits_ij_pos = logits_ij[torch.logical_not(mask)] - logits_ji_pos = logits_ji[torch.logical_not(mask)] - logits_ii_neg = logits_ii[mask].reshape(bs, -1) - logits_ij_neg = logits_ij[mask].reshape(bs, -1) - logits_ji_neg = logits_ji[mask].reshape(bs, -1) - logits_jj_neg = logits_jj[mask].reshape(bs, -1) - - pos = torch.cat((logits_ij_pos, logits_ji_pos), dim=0).unsqueeze(1) - neg_i = torch.cat((logits_ii_neg, logits_ij_neg), dim=1) - neg_j = torch.cat((logits_ji_neg, logits_jj_neg), dim=1) - neg = torch.cat((neg_i, neg_j), dim=0) - - logits = torch.cat((pos, neg), dim=1) - nat_contrastive_loss = F.cross_entropy(logits, labels) - return nat_contrastive_loss -{% endhighlight %} -
- -Besides, you can use the following script to conduct self-supervised pre-training via CL using ResNet-18 on CIFAR-10: -{% highlight bash %} -# Pre-training stage via CL -git clone https://github.com/GodXuxilie/Enhancing_ACL_via_AIR.git -cd Enhancing_ACL_via_AIR -PRE_TRAIN_DIR=CL_ResNet18_cifar10 -python pretraining.py $PRE_TRAIN_DIR --dataset cifar10 \ - --model r18 \ - --pgd_iter 0 --lambda1 0 --lambda2 0 -{% endhighlight %} - - -## Robust Foundation Models -Existing work has shown that there exist adversarial attacks that can fool the foundation representations to output incorrect predictions by adding imperceptible adversarial perturbations to the original inputs in downstream tasks. -The existence of adversarial attacks necessitates the development of robust foundation models in safety-critical downstream tasks. - -
-
- {% include figure.html path="assets/img/2024-05-07-robust-foundation-model/adv_attack.png" class="img-fluid" %} -
-
-
-The foundation representation is vulnerable to adversarial attacks, which wrongly predicts a car as 'NOT a car'. -
- -Robust foundation models are pre-trained on large-scale datasets via robust self-supervised learning methods. Robust foundation models have the following two critical properties: -- Robust foundation representations is generalizable to downstream tasks; -- Fine-tuned robust foundation representations is adversarially robust against adversarial attacks in downstream tasks. - -
-
- {% include figure.html path="assets/img/2024-05-07-robust-foundation-model/robust_foundation_models.png" class="img-fluid" %} -
-
- -### Adversarial Contrastive Learning (ACL) - -To learn robust foundation representations, adversarial contrastive learning (ACL) is one of the most popular and effective robust self-supervised learning methods. ACL incorporates CL with adversarial data to build a robust foundation model without requiring costly annotations. ACL aims to maximize the agreement between different natural views as well as the agreement between different adversarial views. The adversarial contrastive loss given a data point $$x_k \in \mathcal{X}$$ is formulated as follows: - -$$ \ell_\mathrm{ACL}(x_k;\theta) = (1 + \omega) \cdot \ell_\mathrm{CL}(\tilde{x}_{k}^i, \tilde{x}_{k}^j; \theta) + (1 - \omega) \cdot \ell_\mathrm{CL}(x_k^i, x_k^j; \theta), $$ - -where adversarial views are formulated as follows: - -$$ \tilde{x}_{k}^i, \tilde{x}_{k}^j = \mathop{\arg\max}_{ - {\Large \tilde{x}_{k}^i \in \mathcal{B}_\epsilon[x_k^i]} - \atop - {\Large \tilde{x}_{k}^j \in \mathcal{B}_\epsilon[x_k^j]} - } \ell_\mathrm{CL}(\tilde{x}_{k}^i, \tilde{x}_{k}^j; \theta). $$ - -Note that $$\omega \in [0,1]$$ is a scalar and $$\mathcal{B}_\epsilon[x]$$ is a constraint that ensures the adversarial data $$\tilde{x}$$ is in the $$\epsilon$$-ball around data $$x$$. - -
-
- {% include figure.html path="assets/img/2024-05-07-robust-foundation-model/ACL.png" class="img-fluid" %} -
-
-
- Intuitively, ACL aims to maximize the agreement between different natural view (the dash blue lines) and the agreement between different adversarial views (the dash red lines). -
- -Here is the generation procedure of adversarial data via Projected Gradient Descent (PGD) . Given an initial positive pair $$(x_k^{i,(0)}, x_k^{j,(0)})$$, PGD step $$T \in \mathbb{N}$$, step size $$\rho > 0$$, and adversarial budget $$\epsilon \geq 0$$, PGD iteratively updates the pair of data from $$t=0$$ to $$T-1$$ as follows: - -$$ x_k^{i,(t+1)} \! = \! \Pi_{\mathcal{B}_\epsilon[x_k^{i,(0)}]} \big( x_k^{i,(t)} +\rho \cdot \mathrm{sign} (\nabla_{x_k^{i,(t)}} \ell_\mathrm{CL}(x_k^{i,(t)}, x_k^{j,(t)}) \big ), $$ - -$$ x_k^{j,(t+1)} \! = \! \Pi_{\mathcal{B}_\epsilon[x_k^{j,(0)}]} \big( x_k^{j,(t)} +\rho \cdot \mathrm{sign} (\nabla_{x_k^{j,(t)}} \ell_\mathrm{CL}(x_k^{i,(t)}, x_k^{j,(t)}) \big ),$$ - -where $$\Pi_{\mathcal{B}_\epsilon[x]}$$ projects the data into the $$\epsilon$$-ball around the initial point $$x$$. Generating adversarial data requires $$T$$ iterations of forwarding and back-propagations, which makes the training procedure extremely slow. - -
-
- {% include figure.html path="assets/img/2024-05-07-robust-foundation-model/pgd_step.gif" class="img-fluid" %} -
-
-
- The generation procedure of adversarial data in ACL. The adversarial data $\tilde{x}_k^i$ and $\tilde{x}_k^j$ are updated from the low-loss region to the high-loss region step by step according to the loss gradient. -
- -At each epoch, ACL conducts steps (1) and (2) alternatively: - -- Step (1): generating adversarial data (i.e., $$\tilde{x}_k^i$$ and $$\tilde{x}_k^j$$) via PGD; - -- Step (2): updating model parameters via minimizing adversarial contrastive loss to maximize agreements on the adversarial data and natural data. - - -**How to implement ACL at the pre-training stage in practice?** - -
Click here to see the Pytorch code for calculating adversarial contrastive loss. You can copy-paste it to calculate the adversarial contrastive loss in convenience. The code is copied from https://github.com/GodXuxilie/Enhancing_ACL_via_AIR. -{% highlight python %} -import torch -import torch.nn as nn -import torch.nn.functional as F - -class ACL(nn.Module): - - def __init__(self, normalize=True, temperature=0.5): - super(ACL, self).__init__() - self.normalize = normalize - self.temperature = temperature - - def forward(self, zi, zj, zi_adv, zj_adv, weight=0.5): - # zi: the representation of natural view x^i. - # zj: the representation of natural view x^j. - # zi_adv: the representation of adversarial view \tilde{x}^i. - # zj_adv: the representation of adversarial view \tilde{x}^j. - - bs = zi.shape[0] - labels = torch.zeros((2*bs,)).long().to(zi.device) - mask = torch.ones((bs, bs), dtype=bool).fill_diagonal_(0) - - zi_norm = F.normalize(zi, p=2, dim=-1) if self.normalize else zi - zj_norm = F.normalize(zj, p=2, dim=-1) if self.normalize else zj - zi_adv_norm = F.normalize(zi_adv, p=2, dim=-1) if self.normalize else zi_adv - zj_adv_norm = F.normalize(zj_adv, p=2, dim=-1) i if self.normalize else zj_adv - - ### Adversarial Contrastive Loss ### - - logits_ii = torch.mm(zi_norm, zi_norm.t()) / self.temperature - logits_ij = torch.mm(zi_norm, zj_norm.t()) / self.temperature - logits_ji = torch.mm(zj_norm, zi_norm.t()) / self.temperature - logits_jj = torch.mm(zj_norm, zj_norm.t()) / self.temperature - - logits_ij_pos = logits_ij[torch.logical_not(mask)] - logits_ji_pos = logits_ji[torch.logical_not(mask)] - logits_ii_neg = logits_ii[mask].reshape(bs, -1) - logits_ij_neg = logits_ij[mask].reshape(bs, -1) - logits_ji_neg = logits_ji[mask].reshape(bs, -1) - logits_jj_neg = logits_jj[mask].reshape(bs, -1) - - pos = torch.cat((logits_ij_pos, logits_ji_pos), dim=0).unsqueeze(1) - neg_i = torch.cat((logits_ii_neg, logits_ij_neg), dim=1) - neg_j = torch.cat((logits_ji_neg, logits_jj_neg), dim=1) - neg = torch.cat((neg_i, neg_j), dim=0) - - logits = torch.cat((pos, neg), dim=1) - nat_contrastive_loss = F.cross_entropy(logits, labels) - - logits_ii_adv = torch.mm(zi_adv_norm, zi_adv_norm.t()) / self.temperature - logits_ij_adv = torch.mm(zi_adv_norm, zj_adv_norm.t()) / self.temperature - logits_ji_adv = torch.mm(zj_adv_norm, zi_adv_norm.t()) / self.temperature - logits_jj_adv = torch.mm(zj_adv_norm, zj_adv_norm.t()) / self.temperature - - logits_ij_pos_adv = logits_ij_adv[torch.logical_not(mask)] - logits_ji_pos_adv = logits_ji_adv[torch.logical_not(mask)] - logits_ii_neg_adv = logits_ii_adv[mask].reshape(bs, -1) - logits_ij_neg_adv = logits_ij_adv[mask].reshape(bs, -1) - logits_ji_neg_adv = logits_ji_adv[mask].reshape(bs, -1) - logits_jj_neg_adv = logits_jj_adv[mask].reshape(bs, -1) - - pos_adv = torch.cat((logits_ij_pos_adv, logits_ji_pos_adv), dim=0).unsqueeze(1) - neg_i_adv = torch.cat((logits_ii_neg_adv, logits_ij_neg_adv), dim=1) - neg_j_adv = torch.cat((logits_ji_neg_adv, logits_jj_neg_adv), dim=1) - neg_adv = torch.cat((neg_i_adv, neg_j_adv), dim=0) - - logits_adv = torch.cat((pos_adv, neg_adv), dim=1) - adv_contrastive_loss = F.cross_entropy(logits_adv, labels) - - return (1 - weight) * nat_contrastive_loss + (1 + weight) * adv_contrastive_loss -{% endhighlight %} -
- -Besides, you can use the following script to conduct robust self-supervised pre-training via ACL using ResNet-18 on CIFAR-10: -{% highlight bash %} -# Pre-training stage via ACL -git clone https://github.com/GodXuxilie/Enhancing_ACL_via_AIR.git -cd Enhancing_ACL_via_AIR -PRE_TRAIN_DIR=ACL_ResNet18_cifar10 -python pretraining.py $PRE_TRAIN_DIR --dataset cifar10 \ - --model r18 \ - --DynAug --lambda1 0 --lambda2 0 -{% endhighlight %} - -**How to utilize robust foundation representations via fine-tuning in downstream tasks?** - -At the fine-tuning stage, a classifier is randomly initialized and appended to the pre-trained feature extractor for solving the classification tasks. -There are three types of fine-tuning modes: -1. Standard linear fine-tuning (SLF): only standardly fine-tuning the classifier while freezing the feature extractor. -2. Adversarial linear fine-tuning (ALF): only adversarially fine-tuning the classifier while freezing the feature extractor. -3. Adversarial full fine-tuning (AFF): adversarially fine-tuning both the feature extractor and the classifier. - -You can use the following script to transfer an adversarially pre-trained ResNet-18 on CIFAR-10 to a downstream task CIFAR-100 via fine-tuning: -{% highlight bash %} -# Fine-tuning stage -cd Enhancing_ACL_via_AIR -PRE_TRAIN_DIR=ACL_ResNet18_cifar10 -FINETUNE_DIR=ACL_ResNet18_cifar10_cifar100 -MODE=SLF/ALF/AFF/ALL -python finetuning.py --mode $MODE \ - --experiment $FINETUNE_DIR \ - --checkpoint ./checkpoints/$PRE_TRAIN_DIR/model.pt \ - --dataset cifar100 \ - --model r18 \ - --eval-AA --eval-OOD --pretraining DynACL -{% endhighlight %} -Note that `MODE=ALL` refers to that the `finetuning.py` sequentially conducts fine-tuning of all three modes (i.e., SLF, ALF, and AFF) and outputs the result via each fine-tuning mode in the log file `$FINETUNE_DIR/results/log.txt`. - -## Enhancing ACL via Adversarial Invariant Regularization (AIR) - -Here, we introduce the NeurIPS 2023 paper which proposes Adversarial Invariant Regularization (AIR) that regulates both standard and robust representations to be style-independent based on a causal theoretical framework. Empirically, AIR yields state-of-the-art performance in terms of robustness against adversarial attacks and common corruption as well as the standard generalization in downstream tasks. - -### Causal View of ACL - -AIR first introduces the causal graph of the ACL as shown in the following figure. -
-
- {% include figure.html path="assets/img/2024-05-07-robust-foundation-model/causal_graph.png" class="img-fluid" %} -
-
-
- The causal graph of the ACL. -
-During **the data generation procedure**: - -- $$c$$ is the content variable, which can be regarded as the original data in the datasets. -- $$s$$ is the style factor, which can regarded as the data transformation functions that can modify the content while maintaining the semantic meaning of the content. Note that factors $$c$$ and $$s$$ are independent. -- $$x$$ is the natural data, which is decided by the content factor $$c$$ and the style factor $$s$$. -- $$y_t \in \{ y_i \}_{i=1}^{T}$$ is the label from an unknown downstream task. Note that $$y_t$$ is only decided by the content factor $$c$$. -- $$y^R$$ is the proxy label, which is a refinement of $y_t$. $$y^R$$ is used for self-supervised learning without labels. As illustrated in the following figure, the label `dog` is refined into proxy labels `golden Retriever with yellow hair` and `labrador retriever with black hair`. Therefore, when there is no target label, we can train models by differentiating these two different pictures using the contrastive loss. - -
-
- {% include figure.html path="assets/img/2024-05-07-robust-foundation-model/proxy_label.png" class="img-fluid" %} -
-
-
- The illustration of the proxy label $y^R$ which is a refinement of the label $y_t$. -
- -- $$\tilde{x}$$ is the adversarial data of $x$. Since the generation procedure of $$\tilde{x}$$ in ACL does not use the labels, the adversarial data $$\tilde{x}$$ is decided by the natural data $$x$$ and the model parameter $$\theta$$. - -During **the learning procedure**, ACL optimizes the parameters $$\theta$$ by maximizing the conditional probabilities both $$p(y^R \mid x)$$ and $$p(y^R \mid \tilde{x})$$. - -### the Methodology of AIR - -**Style-invariant criterion.** - -From the causal view of ACL, the learning procedure should satisfy the style-independent criterion. That is to say, the intervention on the style factor should not affect the conditional probability, i.e., $$p^{do(\tau_i)}(y^R \mid x) = p^{do(\tau_j)}(y^R \mid x)$$ where $$do(\tau)$$ is the intervention approximated by the data augmentation function $\tau \in \mathcal{T}$. - -
-
- {% include figure.html path="assets/img/2024-05-07-robust-foundation-model/AIR_invariant.png" class="img-fluid" %} -
-
-
- According to causal reasoning, the style factor $s$ should not affect $p(y^R \mid x)$. -
- -Assuming that the path $$x \rightarrow \tilde{x} \rightarrow y^R$$ in the causal graph satisfies the Markov condition, we can obtain that - -$$p(y^R \mid x) = p(y^R \mid \tilde{x})p(\tilde{x} \mid x).$$ - -Therefore, ACL should follow the style-independent criterion as follows: - -$$ -p^{do(\tau_i)}(y^R \mid \tilde{x}) p^{do(\tau_i)}(\tilde{x} \mid x) = p^{do(\tau_j)}(y^R \mid \tilde{x}) p^{do(\tau_j)}(\tilde{x} \mid x) \quad \forall \tau_i, \tau_j \in \mathcal{T} -.$$ - -The conditional probability $$p^{do(\tau_u)}(y^R \mid \tilde{x})$$ for $$u \in \{i,j\}$$ is calculated as the cosine similarity between the original data $$x$$ and the adversarial data $$\tilde{x}^u$$ normalized by the softmax function: - -$$ -p^{do(\tau_u)}(y^R \mid \tilde{x}) = \frac{e^{\mathrm{sim} \left(f_\theta(x), f_\theta(\tilde{x}^u) \right)/t}} -{\sum\limits_{x_k \in B} e^{\mathrm{sim} \left( f_\theta(x_k), f_\theta(\tilde{x}_k^u) \right)/t}}. -$$ - -Note that $$y^R$$ is only decided by the content factor $$c$$. Empirically, the content factor $$c$$ can be approximated by the original data $$x$$ from the datasets. - -The conditional probability $$p^{do(\tau_u)}(\tilde{x} \mid x)$$ for $$u \in \{i,j\}$$ is calculated as the cosine similarity between the natural data $$x^u$$ and the adversarial data $$\tilde{x}^u$$ normalized by the softmax function: - -$$ -p^{do(\tau_u)}(\tilde{x} | x) = \frac{e^{\mathrm{sim} \left(f_\theta(\tilde{x}^u), f_\theta(x^u) \right)/t}} -{\sum\limits_{x_k \in B} e^{\mathrm{sim} \left( f_\theta(\tilde{x}_k^u), f_\theta(x_k^u) \right)/t}}. -$$ - - - -**The loss function of AIR.** - -To achieve the style-invariant criterion, AIR is proposed to regulate the representations to be style-independent as follows: - -$$ -\mathcal{L}_\mathrm{AIR}(B;\theta, \epsilon) = \mathrm{KL}\left(p^{do(\tau_i)}(y^R \mid \tilde{x}) p^{do(\tau_i)}(\tilde{x} \mid x) - \| p^{do(\tau_j)}(y^R \mid \tilde{x}) p^{do(\tau_j)}(\tilde{x} \mid x) ; B \right), -$$ - -in which $$\epsilon \geq 0$$ is the adversarial budget, $$B$$ is a mini-batch, and -$$\mathrm{KL}(p(x) \| q(x); B) = \sum_{x \in B} p(x) \log \frac{p(x)}{q(x)}$$ denotes the Kullback–Leibler (KL) divergence. - -We provide an illustration of AIR for ACL. The AIR aims to maximize the agreements between the original data and the adversarial view (the dash yellow lines) and the agreements between the natural view and the adversarial view (the dash pink lines). - -
-
- {% include figure.html path="assets/img/2024-05-07-robust-foundation-model/AIR_understand.png" class="img-fluid" %} -
-
-
- Intuitively, AIR aims to maximize the agreements among different natural views, different adversarial views, and original data. -
- -**Learning objective of AIR enhanced ACL.** - -The learning objective of AIR is formulated as follows: - -$$ -\mathop{\arg\min}_{\theta} \sum_{x \in U} \ell_\mathrm{ACL}(x; \theta) + \lambda_1 \cdot \mathcal{L}_\mathrm{AIR}(U;\theta,0) + \lambda_2 \cdot \mathcal{L}_\mathrm{AIR}(U;\theta,\epsilon), -$$ - -where $$\lambda_1 \geq 0$$ and $$\lambda_2 \geq 0$$ are two hyper-parameters. - -The official code of AIR is available at [https://github.com/GodXuxilie/Enhancing_ACL_via_AIR](https://github.com/GodXuxilie/Enhancing_ACL_via_AIR). -
Click here to see the Pytorch code for calculating AIR loss. You can copy-paste it to calculate the AIR loss in convenience. -{% highlight python %} -import torch -import torch.nn as nn -import torch.nn.functional as F - -class AIR(nn.Module): - - def __init__(self, normalize=True, temperature=0.5): - super(AIR, self).__init__() - self.normalize = normalize - self.temperature = temperature - - def forward(self, zi, zj, zi_adv, zj_adv, z_orig, weight=0.5, lambda1=0.5, lambda2=0.5): - # zi: the representation of natural data x^i. - # zj: the representation of natural data x^j. - # zi_adv: the representation of adversarial data \tilde{x}^i. - # zj_adv: the representation of adversarial data \tilde{x}^j. - # z_orig: the representation of original data x. - - bs = zi.shape[0] - labels = torch.zeros((2*bs,)).long().to(zi.device) - mask = torch.ones((bs, bs), dtype=bool).fill_diagonal_(0) - - zi_norm = F.normalize(zi, p=2, dim=-1) if self.normalize else zi - zj_norm = F.normalize(zj, p=2, dim=-1) if self.normalize else zj - zi_adv_norm = F.normalize(zi_adv, p=2, dim=-1) if self.normalize else zi_adv - zj_adv_norm = F.normalize(zj_adv, p=2, dim=-1) if self.normalize else zj_adv - zo_norm = F.normalize(z_orig, p=2, dim=-1) if self.normalize else z_orig - - ### Adversarial Contrastive Loss ### - logits_ii = torch.mm(zi_norm, zi_norm.t()) / self.temperature - logits_ij = torch.mm(zi_norm, zj_norm.t()) / self.temperature - logits_ji = torch.mm(zj_norm, zi_norm.t()) / self.temperature - logits_jj = torch.mm(zj_norm, zj_norm.t()) / self.temperature - - logits_ij_pos = logits_ij[torch.logical_not(mask)] - logits_ji_pos = logits_ji[torch.logical_not(mask)] - logits_ii_neg = logits_ii[mask].reshape(bs, -1) - logits_ij_neg = logits_ij[mask].reshape(bs, -1) - logits_ji_neg = logits_ji[mask].reshape(bs, -1) - logits_jj_neg = logits_jj[mask].reshape(bs, -1) - - pos = torch.cat((logits_ij_pos, logits_ji_pos), dim=0).unsqueeze(1) - neg_i = torch.cat((logits_ii_neg, logits_ij_neg), dim=1) - neg_j = torch.cat((logits_ji_neg, logits_jj_neg), dim=1) - neg = torch.cat((neg_i, neg_j), dim=0) - - logits = torch.cat((pos, neg), dim=1) - nat_contrastive_loss = F.cross_entropy(logits, labels) - - logits_ii_adv = torch.mm(zi_adv_norm, zi_adv_norm.t()) / self.temperature - logits_ij_adv = torch.mm(zi_adv_norm, zj_adv_norm.t()) / self.temperature - logits_ji_adv = torch.mm(zj_adv_norm, zi_adv_norm.t()) / self.temperature - logits_jj_adv = torch.mm(zj_adv_norm, zj_adv_norm.t()) / self.temperature - - logits_ij_pos_adv = logits_ij_adv[torch.logical_not(mask)] - logits_ji_pos_adv = logits_ji_adv[torch.logical_not(mask)] - logits_ii_neg_adv = logits_ii_adv[mask].reshape(bs, -1) - logits_ij_neg_adv = logits_ij_adv[mask].reshape(bs, -1) - logits_ji_neg_adv = logits_ji_adv[mask].reshape(bs, -1) - logits_jj_neg_adv = logits_jj_adv[mask].reshape(bs, -1) - - pos_adv = torch.cat((logits_ij_pos_adv, logits_ji_pos_adv), dim=0).unsqueeze(1) - neg_i_adv = torch.cat((logits_ii_neg_adv, logits_ij_neg_adv), dim=1) - neg_j_adv = torch.cat((logits_ji_neg_adv, logits_jj_neg_adv), dim=1) - neg_adv = torch.cat((neg_i_adv, neg_j_adv), dim=0) - - logits_adv = torch.cat((pos_adv, neg_adv), dim=1) - adv_contrastive_loss = F.cross_entropy(logits_adv, labels) - - ### Adversarial Invariant Regularization ### - logits_io = torch.mm(zi_norm, zo_norm.t()) / self.temperature - logits_jo = torch.mm(zj_norm, zo_norm.t()) / self.temperature - probs_io_zi = F.softmax(logits_io[torch.logical_not(mask)], -1) - probs_jo_zj = F.log_softmax(logits_jo[torch.logical_not(mask)], -1) - AIR_standard = F.kl_div(probs_io_zi, probs_jo_zj, log_target=True, reduction="sum") - - logits_io = torch.mm(zi_adv_norm, zi_norm.t()) / self.temperature - logits_jo = torch.mm(zj_adv_norm, zj_norm.t()) / self.temperature - probs_io_zi_adv_consis = F.softmax(logits_io[torch.logical_not(mask)], -1) - probs_jo_zj_adv_consis = F.softmax(logits_jo[torch.logical_not(mask)], -1) - - logits_io = torch.mm(zi_adv_norm, zo_norm.t()) / self.temperature - logits_jo = torch.mm(zj_adv_norm, zo_norm.t()) / self.temperature - probs_io_zi_adv = F.softmax(logits_io[torch.logical_not(mask)], -1) - probs_jo_zj_adv = F.softmax(logits_jo[torch.logical_not(mask)], -1) - - probs_io_zi_adv = torch.mul(probs_io_zi_adv, probs_io_zi_adv_consis) - probs_jo_zj_adv = torch.mul(probs_jo_zj_adv, probs_jo_zj_adv_consis) - AIR_robust = F.kl_div(probs_io_zi_adv, torch.log(probs_jo_zj_adv), log_target=True, reduction="sum") - - return (1 - weight) * nat_contrastive_loss + (1 + weight) * adv_contrastive_loss + lambda1 * AIR_standard + lambda2 * AIR_robust -{% endhighlight %} -
- -Besides, you can use the following script to conduct robust self-supervised pre-training via AIR using ResNet-18 on CIFAR-10: -{% highlight bash %} -# Pre-training stage via AIR -git clone https://github.com/GodXuxilie/Enhancing_ACL_via_AIR.git -cd Enhancing_ACL_via_AIR -PRE_TRAIN_DIR=AIR_ResNet18_cifar10 -python pretraining.py $PRE_TRAIN_DIR --dataset cifar10 --model r18 --DynAug -{% endhighlight %} - - -### Empirical Results - -**AIR yields state-of-the-art cross-task robustness transferability against adversarial attacks.** - - $$\mathcal{D}_1 \rightarrow \mathcal{D}_2$$ refers to that the model is pre-trained on dataset $$\mathcal{D}_1$$ and fine-tuned on downstream dataset $$\mathcal{D}_2$$. - - `SA` refers the standard accuracy calculated as the average accuracy on the natural test data in the downstream dataset $$\mathcal{D}_2$$. - - `AA` refers to the robust accuracy calculated as the average accuracy on the adversarial test data generated via [adversarial attacks](https://github.com/fra31/auto-attack) in the downstream dataset $$\mathcal{D}_2$$. - -
-
- {% include figure.html path="assets/img/2024-05-07-robust-foundation-model/AIR_cross_attack.png" class="img-fluid" %} -
-
- -**AIR yields state-of-the-art cross-task robustness transferability against common corruptions.** - -`CS-#` refers to the the average accuracy evaluated on the test data under common corruptions with corruption severity (CS) of `#` $$ \in $$ \{1,3,5\} in the downstream dataset $$\mathcal{D}_2$$. -
-
- {% include figure.html path="assets/img/2024-05-07-robust-foundation-model/AIR_cross_corrup.png" class="img-fluid" %} -
-
- -To reproduce the above results of the transferability from CIFAR-10 to CIFAR-100, you can use the following scripts. - -- At the pre-training stage, you can conduct AIR using ResNet-18 on CIFAR-10. -{% highlight bash %} -# Pre-training stage using AIR -git clone https://github.com/GodXuxilie/Enhancing_ACL_via_AIR.git -cd Enhancing_ACL_via_AIR -PRE_TRAIN_DIR=AIR_ResNet18_cifar10 -python pretraining.py $PRETRAIN_DIR --dataset cifar10 --model r18 --DynAug -{% endhighlight %} - -- At the fine-tuning stage, you can fine-tune the pre-trained ResNet-18 to downstream task CIFAR-100. During the fine-tuning stage, the following script will automatically conduct all three fine-tuning modes (i.e., SLF, ALF, and AFF). After the fine-tuning stage, you can check the standard accuracy, the robust accuracy under adversarial attacks and common cottuptions under each fine-tuning method from a log file at `$FINETUNE_DIR/results/log.txt`. - -{% highlight bash %} -# Fine-tuning stage -cd Enhancing_ACL_via_AIR -PRE_TRAIN_DIR=AIR_ResNet18_cifar10 -FINETUNE_DIR=AIR_ResNet18_cifar10_cifar100 -python finetuning.py --experiment $EXP_DIR \ - --checkpoint ./checkpoints/$PRE_TRAIN_DIR/model.pt \ - --dataset cifar100 \ - --model r18 \ - --mode ALL \ - --eval-AA --eval-OOD --pretraining DynACL_AIR -{% endhighlight %} - - -### Robust Self-Supervised Learning (RobustSSL) Benchmark The website of RobustSSL Benchmark is at https://robustssl.github.io/. - -**AIR ranks FIRST in [RobustSSL Benchmark](https://robustssl.github.io/)!** For more information regarding the leaderboards, please check the website of [RobustSSL Benchmark](https://robustssl.github.io/). - -
-
- {% include figure.html path="assets/img/2024-05-07-robust-foundation-model/leaderboard.png" class="img-fluid" %} -
-
-
- A screenshot of the leaderboard shown in RobustSSL Benchmark. -
- - -## Efficient ACL via Robustness-Aware Coreset Selection (RCS) - -Here, we introduce the NeurIPS 2023 spotlight paper which proposes Robustness-Aware Coreset Selection (RCS) that selects an informative coreset without label annotations to speed up ACL. Theoretically, Xu et al. (2023) show that a greedy search algorithm can efficiently find the coreset. Empirically, RCS can speed up both ACL and supervised robust pre-training by a large margin on CIFAR and ImageNet-1K datasets without significantly hurting the robustness transferability. This paper for the first time proves the concept of the possibility of applying ACL on large-scale datasets. - -### Motivation---ACL is Inefficient - -ACL is computationally prohibitive on large-scale datasets since generating adversarial data requires expensive computational overheads. - -Empirically, ACL on the entire ImageNet-1K dataset (1,281,167 training data points) requires about **650 hours** evaluated on RTX A5000 GPUs. -Due to the inefficiency of ACL, ACL has not yet been applied to ImageNet-1K datasets without RCS. - -
-
- {% include figure.html path="assets/img/2024-05-07-robust-foundation-model/PGD.png" class="img-fluid" width="100" height="100" %} -
-
-
- ACL is inefficient because $T$ PGD steps require expensive computational overheads. -
- -### the Methodology of RCS - -**Intuition of RCS.** - -To speed up ACL, RCS takes an intuitive idea which is to find an informative training subset (called "coreset"). The coreset can directly decrease the number of training samples, thus significantly accelerating ACL. Besides, since the coreset is informative, which is beneficial in improving $$f$$'s adversarial robustness, it should guarantee the ACL to output an effective robust foundation model. - -
-
- {% include figure.html path="assets/img/2024-05-07-robust-foundation-model/intuition.png" class="img-fluid" %} -
-
-
- RCS generates an informative coreset to make ACL efficiently obtain an effective robust foundation model.Image from https://medium.com/analytics-vidhya/sampling-statistical-approach-in-machine-learning-4903c40ebf86. -
- -**Representational Distance (RD) as a measurement of $$f$$'s adversarial robustness without labels.** - -RD of a data point $$\ell_\mathrm{RD}(x;\theta)$$ is quantified by the representational distance between the natural data and its adversarial counterpart, i.e., - -$$\ell_{\mathrm{RD}}(x; \theta) = d(g \circ f_\theta(\tilde{x}), g \circ f_\theta(x)) \quad \mathrm{s.t.} \quad \tilde{x} = \mathop{\arg\max}_{x^{\prime} \in \mathcal{B}_\epsilon[x]} \quad d(g \circ f_\theta(x^{\prime}), g \circ f_\theta(x)),$$ - -in which the PGD method is used to generate adversarial data $$\tilde{x}$$ within the $$\epsilon$$-ball centered at $$x$$ and -$$d(\cdot, \cdot): \mathcal{V} \times \mathcal{V} \rightarrow \mathbb{R}$$ is a distance function, such as the KL divergence. -The smaller the RD is, the representations are of less sensitivity to adversarial perturbations, thus being more adversarially robust. - -**Objective function of RCS.** - -To realize the intuitive idea, RCS is formulated as follows: - -$$ S^* = \mathop{\arg\min}_{S \subseteq X, |S|/|X| = k} \mathcal{L}_{\mathrm{RD}}(U; \theta(S)),$$ - -$$\theta(S) = \mathop{\arg\min}_{\theta} \mathcal{L}_\mathrm{ACL}(S; \theta),$$ - -in which $$S^*$$ is the coreset, $$U$$ is an unlabled validation set, $$k \in (0,1]$$ is subset fraction that controls the size of coreset, and $$ \mathcal{L}_{\mathrm{RD}}(U; \theta(S)) = \sum_{x \in U} \ell_\mathrm{RD}(x; \theta(S)) $$, and $$ \mathcal{L}_\mathrm{ACL}(S; \theta) = \sum_{x \in S} \ell_\mathrm{ACL}(x; \theta) $$. - -Intuitively, given a coreset $$S^*$$, after the model parameters are updated to $$ \theta(S^{*}) $$ via minimizing the ACL loss on the coreset $$\mathcal{L}_\mathrm{ACL}(S^*; \theta)$$, the model will achieve the minimizied RD loss on the validation dataset $$\mathcal{L}_{\mathrm{RD}}(U; \theta(S^*))$$, thus being adversarially robust. - -Then, RCS can be converted into a problem of maximizing a set function subject to a cardinality constraint as follows: - -$$S^* = \mathop{\arg\max}_{S \subseteq X, |S|/|X| = k} G_\theta(S),$$ - -$$G_\theta(S \subseteq X) \triangleq - \mathcal{L}_\mathrm{RD}(U; \theta(S)) = - \mathcal{L}_\mathrm{RD}(U; \theta - \eta \nabla_\theta \mathcal{L}_\mathrm{ACL}(S; \theta)),$$ - -where $$G:2^\mathcal{X} \rightarrow \mathbb{R}$$ is a set function, $$\theta(S)$$ is estimated using the one-step approximation and $$\eta \in \mathbb{R}^+$$ is the learning rate. - -**RCS via Greedy Search.** - -The vanilla solution of traversing all subsets and selecting the subset that has the largest $$G_\theta(S)$$ is intractable. -Xu et al. (2023) show that the set function $$G_\theta(S)$$ satisfies the following two critical properties, which motivates a greedy search to efficiently search for the coreset. - -The set function $$G_\theta(S)$$ is proved as submodularIn reality, the authors of RCS rigorously proved a proxy set function as weakly submodular. Further, the authors of RCS proved that the greedy search algorithm provides a guaranteed lower bound for the proposed set function maximization problem based on a weakly submodular proxy set function. For more details, please refer to the paper of RCS. which satisfies the following two properties: - -- Monotonicity: As more data is added to the set, the representation becomes better.
$$G(x\mid X)=G(S \cup \{x\}) - G(S) \geq 0$$ for any $$ S \subseteq X$$ and $$x \in X \setminus S$$. -- Diminishing returns: As the set has more data, the marginal gain of extra data for learning representations gradually diminishes.
$$\mathop{\forall}\limits_{A,B \mid A \subseteq B} G_\theta(x \mid A) \geq G_\theta(x \mid B)$$. - -Therefore, RCS greedily searches for the data $$x$$ that has the largest marginal gain and then adds them into the coreset. - - -**Pseudo-code of efficient ACL via RCS.** - -- Step 1 (Warm-up): Warm up training on the entire training set to find a better starting point $$f_\theta$$. -- **Step 2.1 (RCS)**: $$S \gets\emptyset$$. $$\theta' \gets \theta$$. Compute gradients $$ Q \gets \{ q_k = \nabla_\theta \mathcal{L}_\mathrm{ACL}(x_k; \theta) \mid \forall x_k \in X \}$$ on unlabeled training dataset $$X$$. -- **Step 2.2 (RCS)**: Compute gradients $$q_U \gets \nabla_\theta \mathcal{L}_\mathrm{RD}(U; \theta')$$ on unlabeled validation dataset $$U$$. -- **Step 2.3 (RCS)**: Select a data $$x_k$$, whose gradient $$q_k$$ matches best with $$q_U$$, i.e., $$\mathop{\arg\max}_k \{q_k^\top q_U \}$$. -- **Step 2.4 (RCS)**: $$S \gets S \cup \{x_k\}$$, $$X \gets X \setminus \{ x_k \}$$, $$\theta' \gets \theta' - \eta' q_k$$. -- **Step 2.5 (RCS)**: Repeat Steps 2.2-2.4 until $$\mid S\mid/\mid X\mid = k$$. -- Step 3 (ACL training): Update parameters $$\theta \gets \theta - \eta \nabla_\theta \mathcal{L}_\mathrm{ACL}(S; \theta)$$. -- Step 4: Every a few epochs, go to Step 2.1 to generate a new coreset; otherwise go to Step 3 to update model parameters. The algorithm stops when reaching the final training epoch. - - -
-
- {% include figure.html path="assets/img/2024-05-07-robust-foundation-model/RCS_algo.png" class="img-fluid" %} -
-
-
- A pipeline of efficient ACL via RCS. After the warm-up periods, the model is trained on the coreset. Thus, RCS makes the training procedure much more efficient by decreasing the number of training data. -
- -Intuitively, RCS greedily selects and adds the data $$x$$ whose training loss gradient (i.e., $$\nabla_\theta\mathcal{L}_\mathrm{ACL}(\{x\}, \theta)$$) and validation loss gradient (i.e, $$\nabla_\theta\mathcal{L}_\mathcal{RD}(U; \theta(S))$$) have the most similarity into the coreset. In this way, training on the data selected by RCS is most beneficial in optimizing the RD loss, which is thus most helpful to improve $$f$$'s adversarial robustness. - -The official code of RCS is available at [https://github.com/GodXuxilie/Efficient_ACL_via_RCS](https://github.com/GodXuxilie/Efficient_ACL_via_RCS). - -### Experimental Results - - -**RCS significantly speeds up ACL on CIFAR-10.** -- The term `speed-up ratio` refers to the ratio of the time consumption of pre-training on the training set to the the time consumption of pre-training on the training subset. Thus, the larger the speed-up ratio is, the more efficient the pre-training procedure is. -- The terms `standard test accuracy` and `robust test accuracy` refer to the average accuracy evaluated on natural test data and adversarial test data, respectively. Thus, the higher the line is, the more effective the pre-training method is. - -The results obtained by RCS located in the upper-right corner is more efficient and more effective. - - -
-
- {% include figure.html path="assets/img/2024-05-07-robust-foundation-model/RCS_exp1.png" class="img-fluid" %} -
-
- -To reproduce the above results of the robustness transferability from CIFAR-10 to CIFAR-100, you can use the following scripts. - -- At the pre-training stage, you can conduct ACL via RCS using ResNet-18 on CIFAR-10. - -{% highlight bash %} -# Pre-training stage using RCS -git clone https://github.com/GodXuxilie/Efficient_ACL_via_RCS.git -cd Efficient_ACL_via_RCS/ACL_RCS/small_scale_datasets -PRE_TRAIN_DIR=ACL_RCS_ResNet18_cifar10 -python DynACL_RCS.py $PRE_TRAIN_DIR --ACL_DS --dataset cifar10 --fraction 0.2 -{% endhighlight %} - -- At the fine-tuning stage, you can fine-tune the pre-trained ResNet-18 on CIFAR-100. The test accuracy are saved in `$FINETUNE_DIR/results/log.txt`. -{% highlight bash %} -# Fine-tuning stage (SLF, ALF, AFF) -cd Efficient_ACL_via_RCS/ACL_RCS/small_scale_datasets -PRE_TRAIN_DIR=ACL_RCS_ResNet18_cifar10 -FINETUNE_DIR=ACL_RCS_ResNet18_cifar10_cifar100 -python finetuning.py --experiment $FINETUNE_DIR \ - --checkpoint ./checkpoints/$PRE_TRAIN_DIR/model.pt \ - --dataset cifar100 \ - --model r18 \ - --mode ALL --eval-AA --eval-OOD --pretraining DynACL_RCS -{% endhighlight %} - - -**For the first time, ACL was conducted efficiently on ImageNet-1K via RCS.** -The results prove the possibility of applying ACL on large-scale datasets. Here, `SA` refers to standard test accuracy and `RA` refers to the robust test accuracy. - -
-
- {% include figure.html path="assets/img/2024-05-07-robust-foundation-model/RCS_exp2.png" class="img-fluid" %} -
-
- -To reproduce the above results of the robustness transferability from ImageNet-1K to CIFAR-10, you can use the following scripts. -- At the pre-training stage, you can ACL via RCS using Wide ResNet with width 10 and depth 28 (WRN-28-10) on ImageNet-1K of $$32 \times 32$$ resolution. - -{% highlight bash %} -# Pre-training stage using RCS -git clone https://github.com/GodXuxilie/Efficient_ACL_via_RCS.git -cd Efficient_ACL_via_RCS/ACL_RCS/ImageNet_32 -PRE_TRAIN_DIR=ACL_RCS_WRN_ImageNet -python ACL_RCS.py $PRE_TRAIN_DIR --gpu 0,1,2,3 --ACL_DS --fraction 0.05 -{% endhighlight %} - -- At the fine-tuning stage, you can fine-tune the ImageNet-1K pre-trained models on CIFAR-10. -{% highlight bash %} -cd Efficient_ACL_via_RCS/ACL_RCS/ImageNet_32 -PRE_TRAIN_DIR=ACL_RCS_WRN_ImageNet -FINETUNE_DIR=ACL_RCS_WRN_ImageNet_cifar10 -# Fine-tuning stage (SLF) -python transfer.py --out_dir $FINETUNE_DIR/SLF \ - --resume $PRE_TRAIN_DIR/model.pt - --dataset cifar10 \ - --lr 0.01 --linear -# Fine-tuning stage (ALF) -python adv_tune.py --out_dir $FINETUNE_DIR/ALF \ - --resume $PRE_TRAIN_DIR/model.pt \ - --dataset cifar10 \ - --lr 0.1 --linear -# Fine-tuning stage (AFF) -python adv_tune.py --out_dir $FINETUNE_DIR/AFF \ - --resume $PRE_TRAIN_DIR/model.pt \ - --dataset cifar10 \ - --lr 0.1 -{% endhighlight %} - -**RCS can speed up Standard Adversarial Training (SAT) on ImageNet-1K.** The results show that RCS is applicable to robust pre-training in the supervised setting. - -
-
- {% include figure.html path="assets/img/2024-05-07-robust-foundation-model/RCS_exp3.png" class="img-fluid" %} -
-
- -To reproduce the above results of the robustness transferability from ImageNet-1K to CIFAR-10, you can use the following scripts. - -- At the pre-training stage, you can conduct SAT using WRN-28-10 on ImageNet-1K of $$32 \times 32$$ resolution. -{% highlight bash %} -git clone https://github.com/GodXuxilie/Efficient_ACL_via_RCS.git -cd Efficient_ACL_via_RCS/SAT_RCS/ImageNet_32 -# Pre-training stage using RCS -PRE_TRAIN_DIR=SAT_RCS_WRN_ImageNet -nohup python SAT_RCS.py --gpu 0,1,2,3 --out_dir $PRE_TRAIN_DIR --fraction 0.2 -{% endhighlight %} - -- At the fine-tuning stage, you can fine-tune ImageNet-1K pre-trained WRN-28-10 on CIFAR-10. -{% highlight bash %} -cd Efficient_ACL_via_RCS/SAT_RCS/ImageNet_32 -PRE_TRAIN_DIR=SAT_RCS_WRN_ImageNet -FINETUNE_DIR=SAT_RCS_WRN_ImageNet_cifar10 -# Fine-tuning stage (ALF) -python adv_tune.py --out_dir $FINETUNE_DIR/ALF \ - --resume $PRE_TRAIN_DIR/checkpoint.pth.tar \ - --dataset cifar10 \ - --lr 0.1 \ - --linear -# Fine-tuning stage (AFF) -python adv_tune.py --out_dir $FINETUNE_DIR/AFF \ - --resume $PRE_TRAIN_DIR/checkpoint.pth.tar - --dataset cifar10 \ - --lr 0.1 -{% endhighlight %} - - - diff --git a/_posts/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo.md b/_posts/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo.md deleted file mode 100644 index b3a6e865..00000000 --- a/_posts/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo.md +++ /dev/null @@ -1,680 +0,0 @@ ---- -layout: distill -title: The N Implementation Details of RLHF with PPO -description: Reinforcement Learning from Human Feedback (RLHF) is pivotal in the modern application of language modeling, as exemplified by ChatGPT. This blog post delves into an in-depth exploration of RLHF, attempting to reproduce the results from OpenAI's inaugural RLHF paper, published in 2019. Our detailed examination provides valuable insights into the implementation details of RLHF, which often go unnoticed. - -date: 2024-05-07 -future: true -htmlwidgets: true - -# Anonymize when submitting -# authors: -# - name: Anonymous - - -authors: - - name: Shengyi Costa Huang - affiliations: - name: Hugging Face - - name: Tianlin Liu - affiliations: - name: University of Basel - - name: Leandro von Werra - affiliations: - name: Hugging Face - - -# authors: -# - name: Albert Einstein -# url: "https://en.wikipedia.org/wiki/Albert_Einstein" -# affiliations: -# name: IAS, Princeton -# - name: Boris Podolsky -# url: "https://en.wikipedia.org/wiki/Boris_Podolsky" -# affiliations: -# name: IAS, Princeton -# - name: Nathan Rosen -# url: "https://en.wikipedia.org/wiki/Nathan_Rosen" -# affiliations: -# name: IAS, Princeton - -# must be the exact same name as your blogpost -bibliography: 2024-05-07-the-n-implementation-details-of-rlhf-with-ppo.bib - -# Add a table of contents to your post. -# - make sure that TOC names match the actual section names -# for hyperlinks within the post to work correctly. -# - please use this format rather than manually creating a markdown table of contents. -toc: - - name: Matching Learning Curves - - name: General Implementation Details - - name: Reward Model Implementation Details - - name: Policy Training Implementation Details - - name: PyTorch Adam optimizer numerical issues w.r.t RLHF - - name: Limitations - - name: Conclusion - -# Below is an example of injecting additional post-specific styles. -# This is used in the 'Layouts' section of this post. -# If you use this post as a template, delete this _styles block. -_styles: > - .fake-img { - background: #bbb; - border: 1px solid rgba(0, 0, 0, 0.1); - box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1); - margin-bottom: 12px; - } - .fake-img p { - font-family: monospace; - color: white; - text-align: left; - margin: 12px 0; - text-align: center; - font-size: 16px; - } ---- - - -**Reinforcement Learning from Human Feedback** (RLHF) has been an impactful technique for training modern language models such as ChatGPT. In our quest to research more on RLHF, this blog post closely examines OpenAI’s inaugural RLHF paper published in 2019 together with its open-source codebase at available at [*openai/lm-human-preferences*](https://github.com/openai/lm-human-preferences). Despite being based on TensorFlow-1, the code base released by OpenAI is very well-evaluated and benchmarked, making it a good place to study RLHF implementation engineering details. - -We aim to: - -1. reproduce OpenAI’s results in stylistic tasks and match the learning curves of [*openai/lm-human-preferences*](https://github.com/openai/lm-human-preferences), using the modern PyTorch and JAX frameworks in conjunction with HuggingFace Transformers that are predominantly used by the open-source community nowadays; -2. present a checklist of implementation details, similar to the spirit of [*The 37 Implementation Details of Proximal Policy Optimization*](https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/) and [*Debugging RL, Without the Agonizing Pain*](https://andyljones.com/posts/rl-debugging.html); -3. provide a simple-to-read and minimal reference implementation of RLHF; - -This work is just for educational / learning purposes. For advanced users requiring more features, such as running larger models with parameter-efficient fine-tuning, [*huggingface/trl*](https://github.com/huggingface/trl) would be a great choice. - -- In [Matching Learning Curves](#matching-learning-curves), we show our main contribution: creating a codebase that can reproduce OpenAI’s results in the stylistic tasks and matching learning curves very closely with [*openai/lm-human-preferences*](https://github.com/openai/lm-human-preferences). -- We then take a technical deep dive into the implementation details that are relevant to reproducing OpenAI’s work. In [General Implementation Details](#general-implementation-details), we talk about basic details, such as how rewards/values are generated and how responses are generated. In [Reward Model Implementation Details](#reward-model-implementation-details), we talk about details such as reward normalization. In [Policy Training Implementation Details](#policy-training-implementation-details), we discuss details such as rejection sampling and reward “whitening”. - - In [**PyTorch Adam optimizer numerical issues w.r.t RLHF**](#pytorch-adam-optimizer-numerical-issues-wrt-rlhf), we highlight a very interesting implementation difference in Adam between TensorFlow and PyTorch, which causes an aggressive update in the model training. -- Next, we examine the effect of training different base models (e.g., gpt2-xl, falcon-1b,) given that the reward labels are produced with `gpt2-large`. -- Finally, we conclude our work with limitations and discussions. - - - - -Here are the important links: - -- 💾 [Our reproduction codebase](https://github.com/vwxyzjn/lm-human-preference-details) -- 🤗 [Demo of RLHF model comparison](https://huggingface.co/spaces/lm-human-preference-details/rlhf-demo) -- 🐝 [All w&b training logs](https://wandb.ai/openrlbenchmark/lm_human_preference_details) - -# Matching Learning Curves - -Our main contribution is to reproduce OpenAI’s results in stylistic tasks, such as sentiment and descriptiveness. As shown in the figure below, our codebase (orange curves) can produce nearly identical learning curves as OpenAI’s codebase (blue curves). - - -
-{% include figure.html path="assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/curve-matching.png" class="img-fluid" %} -
- - -## A note on running openai/lm-human-preferences - -To make a direct comparison, we ran the original RLHF code at [*openai/lm-human-preferences*](https://github.com/openai/lm-human-preferences), which will offer valuable metrics to help validate and diagnose our reproduction. We were able to set the original TensorFlow 1.x code up, but it requires a hyper-specific setup: - -- OpenAI’s dataset was partially corrupted/lost (so we replaced them with similar HF datasets, which may or may not cause a performance difference) - - Specifically, its book dataset was lost during OpenAI’s GCP - Azure migration ([https://github.com/openai/lm-human-preferences/issues/17#issuecomment-1044051496](https://github.com/openai/lm-human-preferences/issues/17#issuecomment-1044051496)). We replaced the book dataset with Hugging Face’s `bookcorpus` dataset, which is, in principle, what OpenAI used. -- It can’t run on 1 V100 because it doesn’t implement gradient accumulation. Instead, it uses a large batch size and splits the batch across 8 GPUs, and will OOM on just 1 GPU. -- It can’t run on 8x A100 because it uses TensorFlow 1.x, which is incompatible with Cuda 8+ -- It can’t run on 8x V100 (16GB) because it will OOM -- It can only run on 8x V100 (32GB), which is only offered by AWS as the `p3dn.24xlarge` instance. - -# General Implementation Details - -We now take a technical deep dive into the implementation details that are relevant to reproducing OpenAI’s work. In this section, we talk about basic details, such as how rewards/values are generated and how responses are generated. Here are these details in no particular order: - -1. **The reward model and policy’s value head take input as the concatenation of `query` and `response`** - 1. The reward model and policy’s value head do *not* only look at the response. Instead, it concatenates the `query` and `response` together as `query_response` ([lm_human_preferences/rewards.py#L105-L107](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/rewards.py#L105-L107)). - 2. So, for example, if `query = "he was quiet for a minute, his eyes unreadable"`., and the `response = "He looked at his left hand, which held the arm that held his arm out in front of him."`, then the reward model and policy’s value do a forward pass on `query_response = "he was quiet for a minute, his eyes unreadable. He looked at his left hand, which held the arm that held his arm out in front of him."` and produced rewards and values of shape `(B, T, 1)`, where `B` is the batch size, `T` is the sequence length, and `1` is the reward head dimension of 1 ([lm_human_preferences/rewards.py#L105-L107](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/rewards.py#L105-L107), [lm_human_preferences/policy.py#L111](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/policy.py#L111)). - 3. The `T` means that each token has a reward associated with it and its previous context. For example, the `eyes` token would have a reward corresponding to `he was quiet for a minute, his eyes`. -2. **Pad with a special padding token and truncate inputs.** - 1. OpenAI sets a fixed input length for query `query_length`; it **pads** sequences that are too short with `pad_token` ([lm_human_preferences/language/datasets.py#L66-L67](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/language/datasets.py#L66-L67)) and **truncates** sequences that are too long ([lm_human_preferences/language/datasets.py#L57](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/language/datasets.py#L57)). See [here](https://huggingface.co/docs/transformers/pad_truncation) for a general introduction to the concept). When padding the inputs, OpenAI uses a token beyond the vocabulary ([lm_human_preferences/language/encodings.py#L56](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/language/encodings.py#L56)). - 1. **Note on HF’s transformers — padding token.** According to ([transformers#2630#issuecomment-578159876](https://github.com/huggingface/transformers/issues/2630#issuecomment-578159876)), padding tokens were not used during the pre-training of GPT and GPT-2; therefore transformer’s gpt2 models have no official padding token associated with its tokenizer. A common practice is to set `tokenizer.pad_token = tokenizer.eos_token`, but in this work, we shall distinguish these two special tokens to match OpenAI’s original setting, so we will use `tokenizer.add_special_tokens({"pad_token": "[PAD]"})`. - - Note that having no padding token is a default setting for decoder models, since they train with “packing” during pretraining, which means that many sequences are concatenated and separated by the EOS token and chunks of this sequence that always have the max length are fed to the model during pretraining. - 2. When putting everything together, here is an example - - ```python - import transformers - tokenizer = transformers.AutoTokenizer.from_pretrained("gpt2", padding_side="right") - tokenizer.add_special_tokens({"pad_token": "[PAD]"}) - query_length = 5 - texts = [ - "usually, he would", - "she thought about it", - ] - tokens = [] - for text in texts: - tokens.append(tokenizer.encode(text)[:query_length]) - - print("tokens", tokens) - inputs = tokenizer.pad( - {"input_ids": tokens}, - padding="max_length", - max_length=query_length, - return_tensors="pt", - return_attention_mask=True, - ) - print("inputs", inputs) - - """prints are - tokens [[23073, 11, 339, 561], [7091, 1807, 546, 340]] - inputs {'input_ids': tensor([[23073, 11, 339, 561, 50257], - [ 7091, 1807, 546, 340, 50257]]), 'attention_mask': tensor([[1, 1, 1, 1, 0], - [1, 1, 1, 1, 0]])} - """ - ``` - -3. **Adjust position indices correspondingly for padding tokens** - 1. When calculating the logits, OpenAI’s code works by masking out padding tokens properly. This is achieved by finding out the token indices corresponding to the padding tokens ([lm_human_preferences/language/model.py#L296-L297](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/language/model.py#L296-L297)), followed by adjusting their position indices correspondingly ([lm_human_preferences/language/model.py#L320](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/language/model.py#L320)). - 2. For example, if the `query=[23073, 50259, 50259]` and `response=[11, 339, 561]`, where (`50259` is OpenAI’s padding token), it then creates position indices as `[[0 1 1 1 2 3]]` and logits as follows. Note how the logits corresponding to the padding tokens remain the same as before! This is the effect we should be aiming for in our reproduction. - - ```python - all_logits [[[ -35.28693 -34.2875 -38.16074 ... -41.595802 -41.082108 - -35.36577 ] - [ -35.28693 -34.2875 -38.16074 ... -41.595802 -41.082108 - -35.36577 ] - [ -35.28693 -34.2875 -38.16074 ... -41.595802 -41.082108 - -35.36577 ] - [-111.303955 -110.94471 -112.90624 ... -113.13064 -113.7788 - -109.17345 ] - [-111.51512 -109.61077 -114.90231 ... -118.43514 -111.56671 - -112.12478 ] - [-122.69775 -121.84468 -128.27417 ... -132.28055 -130.39604 - -125.707756]]] (1, 6, 50257) - ``` - - 3. **Note on HF’s transformers — `position_ids` and `padding_side`.** We can replicate the exact logits using Hugging Face’s transformer with 1) left padding and 2) pass in the appropriate `position_ids`: - - ```python - import torch - import transformers - tokenizer = transformers.AutoTokenizer.from_pretrained("gpt2", padding_side="right") - tokenizer.add_special_tokens({"pad_token": "[PAD]"}) - pad_id = tokenizer.pad_token_id - query = torch.tensor([ - [pad_id, pad_id, 23073], - ]) - response = torch.tensor([ - [11, 339, 561], - ]) - temperature = 1.0 - - query = torch.tensor(query) - response = torch.tensor(response).long() - context_length = query.shape[1] - query_response = torch.cat((query, response), 1) - pretrained_model = transformers.AutoModelForCausalLM.from_pretrained("gpt2") - def forward(policy, query_responses, tokenizer): - attention_mask = query_responses != tokenizer.pad_token_id - position_ids = attention_mask.cumsum(1) - attention_mask.long() # exclusive cumsum - input_ids = query_responses.clone() - input_ids[~attention_mask] = 0 - return policy( - input_ids=input_ids, - attention_mask=attention_mask, - position_ids=position_ids, - return_dict=True, - output_hidden_states=True, - ) - output = forward(pretrained_model, query_response, tokenizer) - logits = output.logits - logits /= temperature - print(logits) - - """ - tensor([[[ -26.9395, -26.4709, -30.0456, ..., -33.2208, -33.2884, - -27.4360], - [ -27.1677, -26.7330, -30.2386, ..., -33.6813, -33.6931, - -27.5928], - [ -35.2869, -34.2875, -38.1608, ..., -41.5958, -41.0821, - -35.3658], - [-111.3040, -110.9447, -112.9062, ..., -113.1306, -113.7788, - -109.1734], - [-111.5152, -109.6108, -114.9024, ..., -118.4352, -111.5668, - -112.1248], - [-122.6978, -121.8447, -128.2742, ..., -132.2805, -130.3961, - -125.7078]]], grad_fn=) - """ - ``` - - 4. **Note on HF’s transformers — `position_ids` during `generate`:** during generate we should not pass in `position_ids` because the `position_ids` are already adjusted in `transformers` (see [huggingface/transformers#/7552](https://github.com/huggingface/transformers/pull/7552)). - - Usually, we almost never pass `position_ids` in transformers. All the masking and shifting logic are already implemented e.g. in the `generate` function (need permanent code link). -4. **Response generation samples a fixed-length response without padding.** - 1. During response generation, OpenAI uses `top_k=0, top_p=1.0` and just do categorical samples across the vocabulary ([lm_human_preferences/language/sample.py#L43](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/language/sample.py#L43)) and the code would keep sampling until a fixed-length response is generated ([lm_human_preferences/policy.py#L103](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/policy.py#L103)). Notably, even if it encounters EOS (end-of-sequence) tokens, it will keep sampling. - 2. **Note on HF’s transformers — sampling could stop at `eos_token`:** in `transformers`, the generation could stop at `eos_token` ([src/transformers/generation/utils.py#L2248-L2256](https://github.com/huggingface/transformers/blob/67b85f24def79962ce075353c2627f78e0e53e9f/src/transformers/generation/utils.py#L2248-L2256)), which is not the same as OpenAI’s setting. To align the setting, we need to do set `pretrained_model.generation_config.eos_token_id = None, pretrained_model.generation_config.pad_token_id = None`. Note that `transformers.GenerationConfig(eos_token_id=None, pad_token_id=None, ...)` does not work because `pretrained_model.generation_config` would override and set a `eos_token`. - - ```python - import torch - import transformers - tokenizer = transformers.AutoTokenizer.from_pretrained("gpt2", padding_side="right") - tokenizer.add_special_tokens({"pad_token": "[PAD]"}) - pad_id = tokenizer.pad_token_id - query = torch.tensor([ - [pad_id, pad_id, 23073], - ]) - response = torch.tensor([ - [11, 339, 561], - ]) - response_length = 4 - temperature = 0.7 - pretrained_model = transformers.AutoModelForCausalLM.from_pretrained("gpt2") - pretrained_model.generation_config.eos_token_id = None # disable `pad_token_id` and `eos_token_id` because we just want to - pretrained_model.generation_config.pad_token_id = None # generate tokens without truncation / padding - generation_config = transformers.GenerationConfig( - max_new_tokens=response_length, - min_new_tokens=response_length, - temperature=temperature, - top_k=0.0, - top_p=1.0, - do_sample=True, - ) - context_length = query.shape[1] - attention_mask = query != tokenizer.pad_token_id - input_ids = query.clone() - input_ids[~attention_mask] = 0 # set padding tokens to 0 - output = pretrained_model.generate( - input_ids=input_ids, - attention_mask=attention_mask, - # position_ids=attention_mask.cumsum(1) - attention_mask.long(), # generation collapsed if this was turned on. - generation_config=generation_config, - return_dict_in_generate=True, - ) - print(output.sequences) - - """ - tensor([[ 0, 0, 23073, 16851, 11, 475, 991]]) - """ - ``` - - 3. Note that in a more recent codebase https://github.com/openai/summarize-from-feedback, OpenAI does stop sampling when encountering EOS token ([summarize_from_feedback/utils/experiment_helpers.py#L19](https://github.com/openai/summarize-from-feedback/blob/8af822a428c93432aa80ffbe5b065a8f93895669/summarize_from_feedback/utils/experiment_helpers.py#L19)). However in this work we aim to do a 1:1 replication, so we align the setting that could keep sampling even eos_token is encountered -5. **Learning rate annealing for reward model and policy training.** - 1. As Ziegler et al. (2019) suggested, the reward model is trained for a single epoch to avoid overfitting the limited amount of human annotation data (e.g., the `descriptiveness` task only had about 5000 labels). During this single epoch, the learning rate is annealed to zero ([lm_human_preferences/train_reward.py#L249](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/train_reward.py#L249)). - 2. Similar to reward model training, the policy's learning rate is annealed to zero ([lm_human_preferences/train_policy.py#L172-L173](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/train_policy.py#L172-L173)). -6. **Use different seeds for different processes** - 1. When spawning 8 GPU processes to do data parallelism, OpenAI sets a different random seed per process ([lm_human_preferences/utils/core.py#L108-L111](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/utils/core.py#L108-L111)). Implementation-wise, this is done via `local_seed = args.seed + process_rank * 100003`. The seed is going to make the model produce different responses and get different scores, for example. - 1. Note: We believe the dataset shuffling has a bug — the dataset is shuffled using the same seed for some reason ([lm_human_preferences/lm_tasks.py#L94-L97](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/lm_tasks.py#L94-L97)). - -# Reward Model Implementation Details - -In this section, we discuss reward-model-specific implementation details. We talk about details such as reward normalization and layer initialization. Here are these details in no particular order: - -1. **The reward model only outputs the value at the last token.** - 1. Notice that the rewards obtained after the forward pass on the concatenation of `query` and `response` will have the shape `(B, T, 1)`, where `B` is the batch size, `T` is the sequence length (which is always the same; it is `query_length + response_length = 64 + 24 = 88` in OpenAI’s setting for stylistic tasks, see [launch.py#L9-L11](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/launch.py#L9-L11)), and `1` is the reward head dimension of 1. For RLHF purposes, the original codebase extracts the reward of the last token ([lm_human_preferences/rewards.py#L132](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/rewards.py#L132)), so that the rewards will only have shape `(B, 1)`. - 2. Note that in a more recent codebase [*openai/summarize-from-feedback*](https://github.com/openai/summarize-from-feedback), OpenAI stops sampling when encountering EOS token ([summarize_from_feedback/utils/experiment_helpers.py#L19](https://github.com/openai/summarize-from-feedback/blob/8af822a428c93432aa80ffbe5b065a8f93895669/summarize_from_feedback/utils/experiment_helpers.py#L19)). When extracting rewards, it is going to identify the `last_response_index`, the index before the EOS token ([#L11-L13](https://github.com/openai/summarize-from-feedback/blob/8af822a428c93432aa80ffbe5b065a8f93895669/summarize_from_feedback/reward_model.py#L11-L13)), and extract the reward at that index ([summarize_from_feedback/reward_model.py#L59](https://github.com/openai/summarize-from-feedback/blob/8af822a428c93432aa80ffbe5b065a8f93895669/summarize_from_feedback/reward_model.py#L59)). However in this work we just stick with the original setting. -2. **Reward head layer initialization** - 1. The weight of the reward head is initialized according to \\( \mathcal{N}\left(0,1 /\left(\sqrt{d_{\text {model }}+1}\right)\right) \\) ([lm_human_preferences/language/model.py#L368,](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/language/model.py#L368) [lm_human_preferences/language/model.py#L251-L252](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/language/model.py#L251-L252)). This aligns with the settings in Stiennon et al., 2020 ([summarize_from_feedback/query_response_model.py#L106-L107](https://github.com/openai/summarize-from-feedback/blob/8af822a428c93432aa80ffbe5b065a8f93895669/summarize_from_feedback/query_response_model.py#L106-L107)) (P.S., Stiennon et al., 2020 had a typo on page 17 saying the distribution is \\( \mathcal{N}\left(0,1 /\left(d_{\text {model }}+1\right)\right) \\) without the square root) - 2. The bias of the reward head is set to 0 ([lm_human_preferences/language/model.py#L254](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/language/model.py#L254)). -3. **Reward model normalization before and after** - 1. In the paper, Ziegler el al. (2019) mentioned that "to keep the scale of the reward model consistent across training, we normalize it so that it has mean 0 and variance 1 for - \\( x \sim \mathcal{D}, y \sim \rho(·|x) \\)." To perform the normalization process, the code first creates a `reward_gain` and `reward_bias`, such that the reward can be calculated by `reward = reward * reward_gain + reward_bias` ([lm_human_preferences/rewards.py#L50-L51](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/rewards.py#L50-L51)). - 2. When performing the normalization process, the code first sets `reward_gain=1, reward_bias=0` ([lm_human_preferences/train_reward.py#L211](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/train_reward.py#L211)), followed by collecting sampled queries from the target dataset (e.g., `bookcorpus, tldr, cnndm`), completed responses, and evaluated rewards. It then gets the **empirical mean and std** of the evaluated reward ([lm_human_preferences/train_reward.py#L162-L167](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/train_reward.py#L162-L167)) and tries to compute what the `reward_gain` and `reward_bias` should be. - 3. Let us use \\( \mu_{\mathcal{D}} \\) to denote the empirical mean, \\( \sigma_{\mathcal{D}} \\) the empirical std, \\(g\\) the `reward_gain`, \\(b\\) `reward_bias`, \\( \mu_{\mathcal{T}} = 0\\) **target mean** and \\( \sigma_{\mathcal{T}}=1\\) **target std**. Then we have the following formula. - - $$ - \begin{aligned}g*\mathcal{N}(\mu_{\mathcal{D}}, \sigma_{\mathcal{D}}) + b &= \mathcal{N}(g*\mu_{\mathcal{D}}, g*\sigma_{\mathcal{D}}) + b\\&= \mathcal{N}(g*\mu_{\mathcal{D}} + b, g*\sigma_{\mathcal{D}}) \\&= \mathcal{N}(\mu_{\mathcal{T}}, \sigma_{\mathcal{T}}) \\g &= \frac{\sigma_{\mathcal{T}}}{\sigma_{\mathcal{D}}} \\b &= \mu_{\mathcal{T}} - g*\mu_{\mathcal{D}}\end{aligned} - $$ - - 4. The normalization process is then applied **before** and **after** reward model training ([lm_human_preferences/train_reward.py#L232-L234](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/train_reward.py#L232-L234), [lm_human_preferences/train_reward.py#L252-L254](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/train_reward.py#L252-L254)). - - - 5. Note that responses \\( y \sim \rho(·|x) \\) we generated for the normalization purpose are from the pre-trained language model \\(\rho \\). The model - \\(\rho \\) is fixed as a reference and is not updated in reward learning ([lm_human_preferences/train_reward.py#L286C1-L286C31](https://github.com/openai/lm-human-preferences/blob/master/lm_human_preferences/train_reward.py#L286C1-L286C31)). - -# Policy Training Implementation Details - -In this section, we will delve into details, such as layer initialization, data post-processing, and dropout settings. We will also explore techniques, such as of rejection sampling and reward "whitening", and adaptive KL. Here are these details in no particular order: - -1. **Scale the logits by sampling temperature.** - 1. When calculating the log probability of responses, the model first outputs the logits of the tokens in the responses, followed by dividing the logits with the sampling temperature ([lm_human_preferences/policy.py#L121](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/policy.py#L121)). I.e., `logits /= self.temperature` - 2. In an informal test, we found that without this scaling, the KL would rise faster than expected, and performance would deteriorate. -2. **Value head layer initialization** - 1. The weight of the value head is initialized according to \\(\mathcal{N}\left(0,0\right)\\) ([lm_human_preferences/language/model.py#L368,](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/language/model.py#L368) [lm_human_preferences/language/model.py#L251-L252](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/language/model.py#L251-L252)). This is - 2. The bias of the reward head is set to 0 ([lm_human_preferences/language/model.py#L254](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/language/model.py#L254)). -3. **Select query texts that start and end with a period** - 1. This is done as part of the data preprocessing; - 1. Tries to select text only after `start_text="."` ([lm_human_preferences/language/datasets.py#L51](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/language/datasets.py#L51)) - 2. Tries select text just before `end_text="."` ([lm_human_preferences/language/datasets.py#L61](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/language/datasets.py#L61)) - 3. Then pad the text ([lm_human_preferences/language/datasets.py#L66-L67](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/language/datasets.py#L66-L67)) - 2. When running `openai/lm-human-preferences`, OpenAI’s datasets were partially corrupted/lost ([openai/lm-human-preferences/issues/17#issuecomment-104405149](https://github.com/openai/lm-human-preferences/issues/17#issuecomment-1044051496)), so we had to replace them with similar HF datasets, which may or may not cause a performance difference) - 3. For the book dataset, we used [https://huggingface.co/datasets/bookcorpus](https://huggingface.co/datasets/bookcorpus), which we find not necessary to extract sentences that start and end with periods because the dataset ) is already pre-processed this way (e.g., `"usually , he would be tearing around the living room , playing with his toys ."`) To this end, we set `start_text=None, end_text=None` for the `sentiment` and `descriptiveness` tasks. -4. **Disable dropout** - 1. Ziegler et al. (2019) suggested, “We do not use dropout for policy training.” This is also done in the code ([lm_human_preferences/policy.py#L48](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/policy.py#L48)). -5. **Rejection sampling** - 1. Ziegler et al. (2019) suggested, “We use rejection sampling to ensure there is a period between tokens 16 and 24 and then truncate at that period (This is a crude approximation for ‘end of sentence.’ We chose it because it is easy to integrate into the RL loop, and even a crude approximation is sufficient for the intended purpose of making the human evaluation task somewhat easier). During the RL finetuning, we penalize continuations that don’t have such a period by giving them a fixed reward of −1.” - 2. Specifically, this is achieved with the following steps: - 1. **Token truncation**: We want to truncate at the first occurrence of `truncate_token` that appears at or after position `truncate_after` in the responses ([lm_human_preferences/train_policy.py#L378](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/train_policy.py#L378)) - 2. **Run reward model on truncated response:** After the response has been truncated by the token truncation process, the code then runs the reward model on the **truncated response**. - 3. **Rejection sampling**: if there is not a period between tokens 16 and 24, then replace the score of the response with a fixed low value (such as -1)([lm_human_preferences/train_policy.py#L384](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/train_policy.py#L384), [lm_human_preferences/train_policy.py#L384-L402](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/train_policy.py#L384-L402)) - 4. To give some examples in `descriptiveness`: - - {% include figure.html path="assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/descriptiveness-samples.png" class="img-fluid" %} - -6. **Discount factor = 1** - 1. The discount parameter \\(\gamma\\) is set to 1 ([lm_human_preferences/train_policy.py#L56](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/train_policy.py#L56)), which means that future rewards are given the same weight as immediate rewards. -7. **Terminology of the training loop: batches and minibatches in PPO** - 1. OpenAI uses the following training loop ([lm_human_preferences/train_policy.py#L184-L192](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/train_policy.py#L184-L192)). Note: we additionally added the `micro_batch_size` to help deal with the case in gradient accumulation. At each epoch, it shuffles the batch indices. - - ```python - - import numpy as np - batch_size = 8 - nminibatches = 2 - gradient_accumulation_steps = 2 - mini_batch_size = batch_size // nminibatches - micro_batch_size = mini_batch_size // gradient_accumulation_steps - data = np.arange(batch_size).astype(np.float32) - print("data:", data) - print("batch_size:", batch_size) - print("mini_batch_size:", mini_batch_size) - print("micro_batch_size:", micro_batch_size) - for epoch in range(4): - batch_inds = np.random.permutation(batch_size) - print("epoch:", epoch, "batch_inds:", batch_inds) - for mini_batch_start in range(0, batch_size, mini_batch_size): - mini_batch_end = mini_batch_start + mini_batch_size - mini_batch_inds = batch_inds[mini_batch_start:mini_batch_end] - - # `optimizer.zero_grad()` set optimizer to zero for gradient accumulation - for micro_batch_start in range(0, mini_batch_size, micro_batch_size): - micro_batch_end = micro_batch_start + micro_batch_size - micro_batch_inds = mini_batch_inds[micro_batch_start:micro_batch_end] - print("____⏩ a forward pass on", data[micro_batch_inds]) - # `optimizer.step()` - print("⏪ a backward pass on", data[mini_batch_inds]) - - # data: [0. 1. 2. 3. 4. 5. 6. 7.] - # batch_size: 8 - # mini_batch_size: 4 - # micro_batch_size: 2 - # epoch: 0 batch_inds: [6 4 0 7 3 5 1 2] - # ____⏩ a forward pass on [6. 4.] - # ____⏩ a forward pass on [0. 7.] - # ⏪ a backward pass on [6. 4. 0. 7.] - # ____⏩ a forward pass on [3. 5.] - # ____⏩ a forward pass on [1. 2.] - # ⏪ a backward pass on [3. 5. 1. 2.] - # epoch: 1 batch_inds: [6 7 3 2 0 4 5 1] - # ____⏩ a forward pass on [6. 7.] - # ____⏩ a forward pass on [3. 2.] - # ⏪ a backward pass on [6. 7. 3. 2.] - # ____⏩ a forward pass on [0. 4.] - # ____⏩ a forward pass on [5. 1.] - # ⏪ a backward pass on [0. 4. 5. 1.] - # epoch: 2 batch_inds: [1 4 5 6 0 7 3 2] - # ____⏩ a forward pass on [1. 4.] - # ____⏩ a forward pass on [5. 6.] - # ⏪ a backward pass on [1. 4. 5. 6.] - # ____⏩ a forward pass on [0. 7.] - # ____⏩ a forward pass on [3. 2.] - # ⏪ a backward pass on [0. 7. 3. 2.] - # epoch: 3 batch_inds: [7 2 4 1 3 0 6 5] - # ____⏩ a forward pass on [7. 2.] - # ____⏩ a forward pass on [4. 1.] - # ⏪ a backward pass on [7. 2. 4. 1.] - # ____⏩ a forward pass on [3. 0.] - # ____⏩ a forward pass on [6. 5.] - # ⏪ a backward pass on [3. 0. 6. 5.] - ``` - -8. **Per-token KL penalty** - - The code adds a per-token KL penalty ([lm_human_preferences/train_policy.py#L150-L153](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/train_policy.py#L150-L153)) to the rewards, in order to discourage the policy to be very different from the original policy. - - Using the `"usually, he would"` as an example, it gets tokenized to `[23073, 11, 339, 561]`. Say we use `[23073]` as the query and `[11, 339, 561]` as the response. Then under the default `gpt2` parameters, the response tokens will have log probabilities of the reference policy `logprobs=[-3.3213, -4.9980, -3.8690]` . - - During the first PPO update epoch and minibatch update, so the active policy will have the same log probabilities `new_logprobs=[-3.3213, -4.9980, -3.8690]`. , so the per-token KL penalty would be `kl = new_logprobs - logprobs = [0., 0., 0.,]` - - However, after the first gradient backward pass, we could have `new_logprob=[3.3213, -4.9980, -3.8690]` , so the per-token KL penalty becomes `kl = new_logprobs - logprobs = [-0.3315, -0.0426, 0.6351]` - - Then the `non_score_reward = beta * kl` , where `beta` is the KL penalty coefficient \\(\beta\\), and it’s added to the `score` obtained from the reward model to create the `rewards` used for training. The `score` is only given at the end of episode; it could look like `[0.4,]` , and we have `rewards = [beta * -0.3315, beta * -0.0426, beta * 0.6351 + 0.4]`. -9. **Per-minibatch reward and advantage whitening, with optional mean shifting** - 1. OpenAI implements a `whiten` function that looks like below, basically normalizing the `values` by subtracting its mean followed by dividing by its standard deviation. Optionally, `whiten` can shift back the mean of the whitened `values` with `shift_mean=True`. - - ```python - def whiten(values, shift_mean=True): - mean, var = torch.mean(values), torch.var(values, unbiased=False) - whitened = (values - mean) * torch.rsqrt(var + 1e-8) - if not shift_mean: - whitened += mean - return whitened - ``` - - 1. In each minibatch, OpenAI then whitens the reward `whiten(rewards, shift_mean=False)` without shifting the mean ([lm_human_preferences/train_policy.py#L325](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/train_policy.py#L325)) and whitens the advantages `whiten(advantages)` with the shifted mean ([lm_human_preferences/train_policy.py#L338](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/train_policy.py#L338)). - 2. **Optimization note:** if the number of minibatches is one (which is the case in this reproduction) we only need to whiten rewards, calculate and whiten advantages once since their values won’t change. - 3. **TensorFlow vs PyTorch note:** Different behavior of `tf.moments` vs `torch.var`: The behavior of whitening is different in torch vs tf because the variance calculation is different: - - ```jsx - import numpy as np - import tensorflow as tf - import torch - - def whiten_tf(values, shift_mean=True): - mean, var = tf.nn.moments(values, axes=list(range(values.shape.rank))) - mean = tf.Print(mean, [mean], 'mean', summarize=100) - var = tf.Print(var, [var], 'var', summarize=100) - whitened = (values - mean) * tf.rsqrt(var + 1e-8) - if not shift_mean: - whitened += mean - return whitened - - def whiten_pt(values, shift_mean=True, unbiased=True): - mean, var = torch.mean(values), torch.var(values, unbiased=unbiased) - print("mean", mean) - print("var", var) - whitened = (values - mean) * torch.rsqrt(var + 1e-8) - if not shift_mean: - whitened += mean - return whitened - - rewards = np.array([ - [1.2, 1.3, 1.4], - [1.5, 1.6, 1.7], - [1.8, 1.9, 2.0], - ]) - - with tf.Session() as sess: - print(sess.run(whiten_tf(tf.constant(rewards, dtype=tf.float32), shift_mean=False))) - print(whiten_pt(torch.tensor(rewards), shift_mean=False, unbiased=True)) - print(whiten_pt(torch.tensor(rewards), shift_mean=False, unbiased=False)) - ``` - - ```jsx - mean[1.5999999] - var[0.0666666627] - [[0.05080712 0.4381051 0.8254035 ] - [1.2127019 1.6000004 1.9872988 ] - [2.3745968 2.7618952 3.1491938 ]] - mean tensor(1.6000, dtype=torch.float64) - var tensor(0.0750, dtype=torch.float64) - tensor([[0.1394, 0.5046, 0.8697], - [1.2349, 1.6000, 1.9651], - [2.3303, 2.6954, 3.0606]], dtype=torch.float64) - mean tensor(1.6000, dtype=torch.float64) - var tensor(0.0667, dtype=torch.float64) - tensor([[0.0508, 0.4381, 0.8254], - [1.2127, 1.6000, 1.9873], - [2.3746, 2.7619, 3.1492]], dtype=torch.float64) - - ``` - -10. **Clipped value function** - 1. As done in the original PPO ([baselines/ppo2/model.py#L68-L75](https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/model.py#L68-L75)), the value function is clipped ([lm_human_preferences/train_policy.py#L343-L348](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/train_policy.py#L343-L348)) in a similar fashion as the policy objective. -11. **Adaptive KL** - - The KL divergence penalty coefficient \\(\beta\\) is modified adaptively based on the KL divergence between the current policy and the previous policy. If the KL divergence is outside a predefined target range, the penalty coefficient is adjusted to bring it closer to the target range ([lm_human_preferences/train_policy.py#L115-L124](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/train_policy.py#L115-L124)). It’s implemented as follows: - - ```python - class AdaptiveKLController: - def __init__(self, init_kl_coef, hparams): - self.value = init_kl_coef - self.hparams = hparams - - def update(self, current, n_steps): - target = self.hparams.target - proportional_error = np.clip(current / target - 1, -0.2, 0.2) - mult = 1 + proportional_error * n_steps / self.hparams.horizon - self.value *= mult - ``` - - - For the `sentiment` and `descriptiveness` tasks examined in this work, we have `init_kl_coef=0.15, hparams.target=6, hparams.horizon=10000`. - -## **PyTorch Adam optimizer numerical issues w.r.t RLHF** - -- This implementation detail is so interesting that it deserves a full section. -- PyTorch Adam optimizer ([torch.optim.Adam.html](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html)) has a different implementation compared to TensorFlow’s Adam optimizer (TF1 Adam at [tensorflow/v1.15.2/adam.py](https://github.com/tensorflow/tensorflow/blob/v1.15.2/tensorflow/python/training/adam.py), TF2 Adam at [keras/adam.py#L26-L220](https://github.com/keras-team/keras/blob/v2.13.1/keras/optimizers/adam.py#L26-L220)). In particular, **PyTorch follows Algorithm 1** of the Kingma and Ba’s Adam , but **TensorFlow uses the formulation just before Section 2.1** of the paper and its `epsilon` referred to here is `epsilon hat` in the paper. In a pseudocode comparison, we have the following - -```python -### pytorch adam implementation: -bias_correction1 = 1 - beta1 ** step -bias_correction2 = 1 - beta2 ** step -step_size = lr / bias_correction1 -bias_correction2_sqrt = _dispatch_sqrt(bias_correction2) -denom = (exp_avg_sq.sqrt() / bias_correction2_sqrt).add_(eps) -param.addcdiv_(exp_avg, denom, value=-step_size) - -### tensorflow adam implementation: -lr_t = lr * _dispatch_sqrt((1 - beta2 ** step)) / (1 - beta1 ** step) -denom = exp_avg_sq.sqrt().add_(eps) -param.addcdiv_(exp_avg, denom, value=-lr_t) -``` - -- Let’s compare the update equations of pytorch-style and tensorflow-style adam. Following the notation of the adam paper [(Kingma and Ba, 2014)](https://arxiv.org/abs/1412.6980), we have the gradient update rules for pytorch adam (Algorithm 1 of Kingma and Ba’s paper) and tensorflow-style adam (the formulation just before Section 2.1 of Kingma and Ba’s paper) as below: - -$$\begin{aligned}\text{pytorch adam :}\quad \theta_t & =\theta_{t-1}-\alpha \cdot \hat{m}_t /\left(\sqrt{\hat{v}_t}+\varepsilon\right) \\& =\theta_{t-1}- \alpha \underbrace{\left[m_t /\left(1-\beta_1^t\right)\right]}_{=\hat{m}_t} /\left[\sqrt{\underbrace{v_t /\left(1-\beta_2^t\right)}_{=\hat{v}_t} }+\varepsilon\right]\\& =\theta_{t-1}- \alpha\left[m_t /\left(1-\beta_1^t\right)\right]\frac{\sqrt{1-\beta_2^t}}{\sqrt{v_t}+\color{green}{\varepsilon \sqrt{1-\beta_2^t}}}\end{aligned}$$ - -$$\begin{aligned}\text{tensorflow adam:}\quad \theta_t & =\theta_{t-1}-\alpha_t m_t /\left(\sqrt{v_t}+\hat{\varepsilon}\right) \\& =\theta_{t-1}-\underbrace{\left[\alpha \sqrt{1-\beta_2^t} /\left(1-\beta_1^t\right)\right]}_{=\alpha_t} m_t /\left(\sqrt{v_t}+\hat{\varepsilon}\right) \\& =\theta_{t-1}- \alpha\left[m_t /\left(1-\beta_1^t\right)\right] \frac{\sqrt{1-\beta_2^t}}{\sqrt{v_t}+\color{green}{\hat{\varepsilon}}} \end{aligned}$$ - - -- The equations above highlight that the distinction between pytorch and tensorflow implementation is their **normalization terms**, \\(\color{green}{\varepsilon \sqrt{1-\beta_2^t}}\\) and \\(\color{green}{\hat{\varepsilon}}\\). The two versions are equivalent if we set \\(\hat{\varepsilon} =\varepsilon \sqrt{1-\beta_2^t}\\) . However, in the pytorch and tensorflow APIs, we can only set \\(\varepsilon\\) (pytorch) and \\(\hat{\varepsilon}\\) (tensorflow) via the `eps` argument, causing differences in their update equations. What if we set \\(\varepsilon\\) and \\(\hat{\varepsilon}\\) to the same value, say, 1e-5? Then for tensorflow adam, the normalization term \\(\hat{\varepsilon} = \text{1e-5}\\) is just a constant. But for pytorch adam, the normalization term \\({\varepsilon \sqrt{1-\beta_2^t}}\\) changes over time. Importantly, initially much smaller than 1e-5 when the timestep \\(t\\) is small, the term \\({\varepsilon \sqrt{1-\beta_2^t}}\\) gradually approaches to 1e-5 as timesteps increase. The plot below compares these two normalization terms over timesteps: - -{% include figure.html path="assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/norma_const_comparison.png" class="img-fluid" %} - -- The above figure shows that, if we set the same `eps` in pytorch adam and tensorflow adam, then pytorch-adam uses a much smaller normalization term than tensorflow-adam in the early phase of training. In other words, pytorch adam goes for **more aggressive gradient updates early in the training**. Our experiments support this finding, as we will demonstrate below. -- How does this impact reproducibility and performance? To align settings, we record the original query, response, and rewards from [https://github.com/openai/lm-human-preferences](https://github.com/openai/lm-human-preferences) and save them. We also record the metrics of the first two epochs of training with TF1’s `AdamOptimizer` optimizer as the ground truth. Below are some key metrics: - - - | | OpenAI’s TF1 Adam | PyTorch’s Adam | Our custom Tensorflow-style Adam | - | --- | --- | --- | --- | - | policy/approxkl | 0.00037167023 | 0.0023672834504395723 | 0.000374998344341293 | - | policy/clipfrac | 0.0045572915 | 0.02018229104578495 | 0.0052083334885537624 | - | ratio_mean | 1.0051285 | 1.0105520486831665 | 1.0044583082199097 | - | ratio_var | 0.0007716546 | 0.005374275613576174 | 0.0007942612282931805 | - | ratio_max | 1.227216 | 1.8121057748794556 | 1.250215768814087 | - | ratio_min | 0.7400441 | 0.4011387825012207 | 0.7299948930740356 | - | logprob_diff_mean | 0.0047487603 | 0.008101251907646656 | 0.004073789343237877 | - | logprob_diff_var | 0.0007207897 | 0.004668936599045992 | 0.0007334011606872082 | - | logprob_diff_max | 0.20474821 | 0.594489574432373 | 0.22331619262695312 | - | logprob_diff_min | -0.30104542 | -0.9134478569030762 | -0.31471776962280273 | -- **PyTorch’s `Adam` produces a more aggressive update** for some reason. Here are some evidence: - - **PyTorch’s `Adam`'s `logprob_diff_var`** **is 6x higher**. Here `logprobs_diff = new_logprobs - logprobs` is the difference between the log probability of tokens between the initial and current policy after two epochs of training. Having a larger `logprob_diff_var` means the scale of the log probability changes is larger than that in OpenAI’s TF1 Adam. - - **PyTorch’s `Adam` presents a more extreme ratio max and min.** Here `ratio = torch.exp(logprobs_diff)`. Having a `ratio_max=1.8121057748794556` means that for some token, the probability of sampling that token is 1.8x more likely under the current policy, as opposed to only 1.2x with OpenAI’s TF1 Adam. - - **Larger `policy/approxkl` `policy/clipfrac`.** Because of the aggressive update, the ratio gets clipped **4.4x more often, and the approximate KL divergence is 6x larger.** - - The aggressive update is likely gonna cause further issues. E.g., `logprob_diff_mean` is 1.7x larger in PyTorch’s `Adam`, which would correspond to 1.7x larger KL penalty in the next reward calculation; this could get compounded. In fact, this might be related to the famous KL divergence issue — KL penalty is much larger than it should be and the model could pay more attention and optimizes for it more instead, therefore causing negative KL divergence. -- **Larger models get affected more.** We conducted experiments comparing PyTorch’s `Adam` (codename `pt_adam`) and our custom TensorFlow-style (codename `tf_adam`) with `gpt2` and `gpt2-xl`. We found that the performance are roughly similar under `gpt2`; however with `gpt2-xl`, we observed a more aggressive updates, meaning that larger models get affected by this issue more. - - When the initial policy updates are more aggressive in `gpt2-xl`, the training dynamics get affected. For example, we see a much larger `objective/kl` and `objective/scores` spikes with `pt_adam`, especially with `sentiment` — *the biggest KL was as large as 17.5* in one of the random seeds, suggesting an undesirable over-optimization. - - Furthermore, because of the larger KL, many other training metrics are affected as well. For example, we see a much larger `clipfrac` (the fraction of time the `ratio` gets clipped by PPO’s objective clip coefficient 0.2) and `approxkl`. - - -
-{% include figure.html path="assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/adam_gpt2.png" class="img-fluid" %} - - -{% include figure.html path="assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/adam_gpt2_xl.png" class="img-fluid" %} -
- - -# Limitations - -Noticed this work does not try to reproduce the summarization work in CNN DM or TL;DR. This was because we found the training to be time-consuming and brittle. - -The particular training run we had showed poor GPU utilization (around 30%), so it takes almost 4 days to perform a training run, which is highly expensive (only AWS sells p3dn.24xlarge, and it costs $31.212 per hour) - -Additionally, training was brittle. While the reward goes up, we find it difficult to reproduce the “smart copier” behavior reported by Ziegler et al. (2019) . Below are some sample outputs — clearly, the agent overfits somehow. - -``` -QUERY: The modern take on Death & Taxes still plays it. I know as a Tron player I find Mindcensor -pretty frustrating. ⏎ However, the answer to your question is probably that the decks that -are currently taking up the lion's share of the metagame don't really care about its effect. -It has a chance to act as an instant speed Stone Rain + Gut Shot in response to a fetch -crack, but after that it's a 2/1 flyer that dies to all relevant removal. ⏎ It's especially -powerful against Tron since so much of the deck's consistency is in its tutor effects -- -Sylvan Scrying, Expedition Map, and Eye of Ugin. This combined with Pyroclasm and Oblivion -Stone being the only creature removal the deck has available to it before Karn make -Mindcensor fairly potent. ⏎ I'm having trouble thinking of any other popular decks (ignoring -the interaction with fetches) that Mindcensor really sticks it to, however. - -PPO GENERATED RESPONSE: Mindcensor is a very good card, but it's not really worth playing.<|endoftext|>Image -caption The family was forced from their home after they were not allowed to leave the -house⏎⏎A family of four has been jailed for five years after admitting breaching home rule -by refusing to leave their home.⏎ - -reward: -0.48 -kl: +11.37 -total reward: -0.82 - -QUERY: To make a long story short, my SO (27/m) and I live in my hometown. We've been together for -4 years. His family lives across the country. A lot of the time we spend with my parents but -besides that, we aren't very social. His family not being here doesn't make me feel like I -can't be with mine. I just don't like to. ⏎ I find my extended family (so not my parents) -extremely irritating. Especially the ones around my age. They're dramatic and rude. Every -family event is some Jerry Springer episode I don't want to participate in. I don't have any -issues with anyone in particular. I just find them annoying. I'm never uncivil or openly -telling them how I feel. If there's get togethers, I tend to avoid them. I don't ask anyone -to hang out or come over. ⏎ My SO and I have experienced multiple miscarriages due to -unknown reasons. This has made me very sensitive and a bit of a loaner when it comes to the -girls in my family my age. I'm not jealous that my family has children. I just don't seem to -be able to relate to them. When I explained our issues, they reacted very unsympathetic and -just weren't very supportive. I guess that was the time that I started to pull away and not -try to keep up any relationships. ⏎ My SO and I are pregnant, half way! This is something -we're so joyed about. However, when we told everyone, they just didn't really give a -reaction. No congrats or that they were happy for us. Since then, I completely pulled away. -I talked to my parents about this and I just don't feel the need to contact anyone. The -reason I'm posting is because I discussed deleting some relatives off social media with my -mom and she was telling me how it's really rude. I don't think it is. We don't talk. There's -no relationship. Honestly, I think they're just friends with me for gossip purposes. A few -times I tried reaching out to some of my cousins with no reply. ⏎ I guess what I'm asking is -if I should be trying harder to establish something with my family or if it's perfectly OK -just to delete them and not talk to them again? Our primary contact is Facebook even though -we live in the same town. - -PPO GENERATED RESPONSE: We're dating now. We've met, we're dating. We're dating now. We're dating now. We're dating -now. We're dating now. We're dating now.⏎⏎We've met⏎⏎We're dating now.⏎⏎We're dating now.⏎ - -reward: +2.09 -kl: +19.45 -total reward: +1.50 - -QUERY: This tower totally decided to screw me over on the final stage. So a bit of background I -don't have the strongest squad but pretty much I have: ⏎ Guardian Loch (Sacred Jewel) ⏎ -Anima Tiara (Flesh Armor) ⏎ Lord Cayena (sacred Jewel) ⏎ Oracle Twins (sacred Jewel) ⏎ -Guardian Logan (Burny soul stone) ⏎ and was using a friends maxed michelle. ⏎ So throughout -the earlier stages I've seen this guy about 5 times but have never caught him. So I had a -little trouble with stage 9 but felt like with the additional 50% atk boost and bb from -michelle I could do it. Also on stage 9 the reason I died was because I didn't take any -stimulants and this guy managed to paralyze 3 members of my squad. So I replaced mega cures -for stimulants. ⏎ Round 3 I meet him. He gets a lucky crit on my caynea and takes her down. -I have to use my one revive, take him down and capture him. I'm pretty excited. ⏎ Round 7 -I see the warning danger sign. I face him again and manage to take him down without any of -my units dying no capture. At this point I'm starting to get low on healing items. Also I've -only heard of meeting him twice on a run. But then I checked this thread and saw that he -always appears on the last stage. I thought that maybe it glitched and he wouldn't appear on -the final round. ⏎ Nope. ⏎ This guy shows up on the final round showing up 3 times in one -run. Worst part was I manged to get him down to a sliver of health so that as long as one -unit hit him I would've defeated him with the next turn. This was also the turn he decided -to use soul desecration twice. I definitely felt like my soul got desecrated watching all my -units die. So at this point it was give up the captured unit and try all over again (knowing -my luck never capture him again) or gem for the last hit. I was really looking forward to -getting the gem for completion (f2p gotta scrap up everything right?) but I let it go. - -PPO GENERATED RESPONSE: ⏎⏎When you capture a unit try to kill it.⏎⏎Kill them.⏎⏎Kill them.⏎⏎Kill them.⏎⏎Kill -them.⏎⏎Kill them.⏎⏎Kill them.⏎⏎Kill them.⏎⏎Kill them.⏎⏎Kill them.⏎ - -reward: +0.79 -kl: +10.82 -total reward: +0.46 -``` - -# Conclusion - -In this work, we took a deep dive into OpenAI’s original RLHF codebase and compiled a list of its implementation details. We also created a minimal base which reproduces the same learning curves as OpenAI’s original RLHF codebase, when the dataset and hyperparameters are controlled. Furthermore, we identify surprising implementation details such as the adam optimizer’s setting which causes aggressive updates in early RLHF training. - - diff --git a/_posts/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective.md b/_posts/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective.md deleted file mode 100644 index c62550c3..00000000 --- a/_posts/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective.md +++ /dev/null @@ -1,197 +0,0 @@ ---- -layout: distill -title: Understanding gradient inversion attacks from the prior knowledge perspective -description: In this blogpost, we mention multiple works in gradient inversion attacks, point out the chanllenges we need to solve in GIAs, and provide a perspective from the prior knowledge to understand the logic behind recent papers. -date: 2024-05-07 -future: true -htmlwidgets: true - -#Anonymize when submitting -authors: - - name: Yanbo Wang - affiliations: - name: School of AI, UCAS $\n$ CRIPAC & MAIS, CASIA - - name: Jian Liang - affiliations: - name: School of AI, UCAS $\n$ CRIPAC & MAIS, CASIA - - name: Ran He - affiliations: - name: School of AI, UCAS $\n$ CRIPAC & MAIS, CASIA - - - -# must be the exact same name as your blogpost -bibliography: 2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective.bib - -# Add a table of contents to your post. -# - make sure that TOC names match the actual section names -# for hyperlinks within the post to work correctly. -# - please use this format rather than manually creating a markdown table of contents. -toc: - - name: Fundamental pipeline of GIAs - - name: The tough challenge in GIAs - - subsections: - - name: A simple example of information discards - - name: Understanding GIAs from the prior knowledge perspective - - subsections: - - name: Unparameterized regularization terms - - name: Generative models - - name: End-to-end networks - - name: Limitation and future directions - - name: Conclusions - - -# Below is an example of injecting additional post-specific styles. -# This is used in the 'Layouts' section of this post. -# If you use this post as a template, delete this _styles block. -_styles: > - .fake-img { - background: #bbb; - border: 1px solid rgba(0, 0, 0, 0.1); - box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1); - margin-bottom: 12px; - } - .fake-img p { - font-family: monospace; - color: white; - text-align: left; - margin: 12px 0; - text-align: center; - font-size: 16px; - } ---- -Federated learning, as a way to collaboratively train a deep model, was originally developed to enhance training efficiency and protect data privacy. In a federated learning paradigm, no matter whether it is horizontal or vertical, data could be processed locally, and the central server could only get access to the processed information, such as trained model weights or intermediate gradients. Avoiding direct access to private local data, federated learning is believed to successfully protect clients' data privacy, for the central server could only make use of uploaded information to train a global model but it does not know exactly what the training dataset really contains. However, in horizontal federated learning, researchers found that with training gradients, the central server could still recover input data, which may be a threat to training data privacy. Such privacy attack is then named gradient inversion attack (or gradient leakage attack). - -## Fundamental pipeline of Gradient inversion attacks (GIAs) -Gradient inversion attacks (GIAs) aim at reconstructing clients' private input data from the gradients in deep neural network training phases. It is a threat to federated learning framework, especially the horizontal one where a curious-but-honest central server collects gradients from multiple clients, analyzes the optimal parameter updating direction, and sends back the updated model in one step. Getting rid of complicated mathematical formulas, GIA is actually a matching process: the attacker (which is the central server in the most common settings) expects that the data it randomly initialized could finally generate the identical gradients as the ground truth, therefore it measures the difference (or distance) to optimize input data pixel-wisely. The smaller the distance between gradients, the better the private data are reconstructed. -{% include figure.html path="assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/Picture1.jpg" class="img-fluid" %} -This is a **white-box** attack, for its requirement for full model parameters to conduct backpropagation. In such a process, with fixed model parameters, the distance between gradients is highly dependent on the attacker's dummy data. GIA's target is to optimize the distance below, where $x^\ast$ and $y^\ast$ represent the dummy data-label tuple, $\mathcal{D}$ represents the distance function, $\theta$ represents the model weights, and $\mathcal{L}$ represents the CE loss. - -$$\arg\min \limits_{(x^*,y^*)} {\mathcal{D}}\left(\nabla_\theta\mathcal{L}_\theta\left( x,y\right),\nabla_\theta\mathcal{L}_\theta\left( x^*,y^*\right)\right)$$ - -After raising this problem, there are a few research topics in this field. iDLG provides a way to recover the input label analytically. Following this, a series of works is proposed to recover labels from batches, and it is generally believed that compared with optimizing image-label tuples simultaneously, simply optimizing input images with ground-truth labels could achieve better performance. Except for recovering labels, attack evaluations and defense methods also attract much attention. However, recovering high-quality images is still the key focus. -## The tough challenge in GIAs -In GIA, the tough challenge, which has not been solved yet, is the reconstruction of batched input data, where **multiple samples share the same labels**. Previous works headed towards such a goal by a few steps: they first recovered single input data, then extended them to batches with known labels, and added a new algorithm to recover batched one-hot labels before recovering input images. However, to the best of my knowledge, it is still limited to the situation where **for every class there could be at most one sample in a batch**. Batched data recovery with repeated labels is still a failure for all current algorithms. The key reason for this failure lies in the information discard of averaged gradients. -### A simple example of information discards -Let's first take a look at a simple neural network: MLP. In a specific layer, it takes in intermediate features $\mathbf{x}$ and outputs a result of matrix multiplication $\mathbf{z}=\mathbf{Wx}+\mathbf{b}$. To recover the input from gradients, we could simply use the bias attack: - -$$\frac{\partial \mathcal{L}}{\partial {\mathbf{W}}}=\frac{\partial \mathcal{L}}{\partial \mathbf{z}} \times \frac{\partial \mathbf{z}}{\partial {\mathbf{W}}}=\frac{\partial \mathcal{L}}{\partial {b}}\mathbf{x}^\mathrm{T}$$ - -In the above equation, it is clear that for a single input, with full access to model weights and gradients, the gradients of the MLP contain full information to execute single-image recovery. - -Here, we conduct a simple experiment to illustrate the existence of information discard. Firstly We pick a 4-layer MLP as the target neural network and randomly select a few images from the Flowers-17 dataset as the private input data for recovery. We take $l_2$ loss as the gradient matching function without any prior knowledge (regularization terms). Firstly, we provide an example of input image recovery when **`batchsize=1` with known labels**. - -
-
- {% include figure.html path="assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs1_l2_fc.gif" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs1_l2_1.gif" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs1_l2_2.gif" class="img-fluid rounded z-depth-1" %} -
-
-
- Image reconstruction with $l_2$ loss on MLP. no regularization terms are adopted. -
- -It is not surprising that $l_2$ gradient matching functions could recover the input data well. Such a good performance is mainly because MLP's gradients contain enough information of intermediate features for single inputs. With proper labels, we could conclude that GIA works well on MLP when `batchsize=1`. - -However, when it comes to CNNs, such inversion gets harder. For convolution layers, the gradients of convolution kernels are aggregated through the whole feature map, therefore even if we set batchsize=1, gradients may still experience information discards, affecting the attack performance. This problem is also mentioned in R-GAP, which executes the GIA from an equation-solving perspective. If equations are "rank-deficient", then we cannot get a unique solution, indicating obvious information discards. Here, for better illustration, we first show CIFAR-10 image reconstructions on LeNet with `batchsize=1`. Ground-truth one-hot labels are provided. -
-
- {% include figure.html path="assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs1_l2_f.gif" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs1_cos_gt.jpg" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs1_cos_f.gif" class="img-fluid rounded z-depth-1" %} -
-
- Image reconstruction on LeNet with CIFAR-10 dataset when batchsize=1. we show the ground-truth image in the middle and attach the reconstruction process on two sides ($l_2$ loss on the left and cosine similarity loss on the right). -
-
- -It is clear that even though both functions could recover the image, there are some pixels not perfectly optimized, indicating the existence of information discards. If we change the batchsize, even if we only slightly enlarge it as `batchsize=2`, such reconstruction ends up with a failure. - -
-
- {% include figure.html path="assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs2_cos.gif" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs1_cos_gt.jpg" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs1_cos_gt_2.jpg" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs2_cos1.gif" class="img-fluid rounded z-depth-1" %} -
-
-
- Image reconstruction with cosine similarity loss on LeNet and no regularization terms are adopted. In the middle, we show ground-truth images in the batch. -
- -For a given network, the size of gradients is fixed. Therefore, with the increase in batchsize, GIA will experience more obvious information discards. This is easy to understand, and researchers designed a few ways to complement this loss. -## Understanding GIAs from the prior knowledge perspective -Realizing the information discards, reviewing the recent paper through the prior knowledge perspective may help understand the logic better. To achieve better image reconstruction quality, it is natural to consider the prior knowledge of images as the complement. Here, the prior knowledge could be explained in three aspects. - -### Unparameterized regularization terms -In IG, they utilize the total variance as a regularization because they believe a real image taken from nature should have a small total variance. That is the first prior knowledge term utilized in the gradient matching function, and it turns out to function well. After that, in GradInversion this regularization term is extended to include batch normalization supervision, $$l_2$$ norms and group consistency. This is a stronger prior knowledge implying that a real input image, or batched real images, except for total variance, should also possess lower $$l_2$$ norms, proper intermediate mean and the variance for batch normalization layers. Apart from that, all reconstructions from different random initializations ought to reach a group consistency. These terms are unparameterized, and it is clearly demonstrated in their ablation experiments that these terms matter significantly in reconstructing high-quality images. - -To further illustrate the benefits such regulariztaion terms have on the data reconstruction processes, here is an example of adding total variance for `batchsize=2` image reconstruction. The scale of total variance ranges from $$10^{-4}$$ to $$10^{-1}$$. - -
-
- {% include figure.html path="assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs2_cos_tv0.0001.gif" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs2_cos_tv0.001.gif" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs2_cos_tv0.01.gif" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs2_cos_tv0.1.gif" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs2_cos1_tv0.0001.gif" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs2_cos1_tv0.001.gif" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs2_cos1_tv0.01.gif" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs2_cos1_tv0.1.gif" class="img-fluid rounded z-depth-1" %} -
-
-
- Image reconstruction with cosine similarity loss and total variance on LeNet. The scale of the total variance starts from $10^{-4}$ for the very left column to $10^{-1}$ with 10 times as the interval. -
- -With identical learning rate, images with higher total variance are reconstructed faster. Because the total variance penalizes obvious distinctions for adjacent pixels, images with higher total variance are also more blurred. On the other side, reconstructions with insufficient total variance fail to generate recognizable images. -### Generative models -Keep following the logic that recent works require some other conditions as prior knowledge to reinforce the information discards from gradients, generative models, especially GANs, could serve as a strong tool to encode what "real images" should be. The way to add GAN's generator in gradient matching processes is simple: instead of optimizing direct image pixels, with the generator we could keep the backpropagation way back to the latent space, then alter the latent code as well as the parameters of the generator to produce recovered images. Pre-trained generators naturally encode a likely distribution of the input data, which is a stronger prior knowledge compared with previous unparameterized regularization terms. -{% include figure.html path="assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/Picture2.jpg" class="img-fluid" %} - -Recent work GIFD extends this method by optimizing GAN network layer-wisely. Instead of directly optimizing GAN weights and the latent vector in one step, GIFD optimizes the intermediate layers iteratively, making such a process more stable. In summary, gradients here serve more as an indicator for attackers to select the best image from distributions modeled by pre-trained GANs. - -### End-to-end networks -Actually, the most intuitive way to conduct a GIA is to design a function that takes gradients as input and then outputs recovered images. For a target network, image-gradient tuples are easy to collect, therefore the prior knowledge could be encoded in such an end-to-end neural network through model training. -{% include figure.html path="assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/Picture3.jpg" class="img-fluid" %} - -Here, the neural network resembles a GAN generator which takes in representation vectors and outputs a synthesized image. However, instead of abstract latent codes, such a network receives gradient vectors to generate images. In implementations, Wu et.al utilizes *feature hashing* to reduce the dimension of gradient vectors. For network picking, they use a simple 3-layer MLP to generate flattened images, which is different from widely-used GAN structures. -However, such a method faces multiple difficulties, such as large input sizes and limited structural flexibility. Even for one specific model, once the model weights are changed, such end-to-end network requires retraining to construct a new mapping from gradients to images. Besides, there is still space for network design. Will the network structure influence image reconstruction performance under identical datasets? How to construct a mapping function from gradients to images with varying batchsize? Could the network find an optimal batchsize after analyzing the gradients? These questions are all worth further exploration. - -## Limitation and future directions -For GIAs that require pre-trained models, the key limitation is the auxiliary dataset. It is kind of unrealistic to claim that the dataset used for pretraining generative models (or end-to-end models) shares the same distribution with the unknown private input data, and possibly, with distinct dataset distribution, the generative performance may experience a drop. Both GIAS and GIFD use GAN with in-distribution auxiliary data to compare with previous state-of-the-art works, and GIFD paper only shows the reconstruction result of distinct distribution data when `batchsize=1` with the same label space. For the most general situation where the attacker has limited knowledge of the potential distribution of the private data, it may be still hard to recover high-quality batched data with generative networks. -Considering these limitations, it is of great value to explore algorithms to learn some general prior knowledge, especially those robust among different data distributions. - -## Conclusions -1. The existence of information discards in gradient aggregation is the tough challenge of GIAs. -2. From the prior knowledge perspective, previous GIA works provide three ways to complement information discards. -3. It may still be hard to recover batched data from gradients with limited knowledge of private data distribution. \ No newline at end of file diff --git a/_posts/2024-05-07-understanding-icl.md b/_posts/2024-05-07-understanding-icl.md deleted file mode 100644 index 8950ecd9..00000000 --- a/_posts/2024-05-07-understanding-icl.md +++ /dev/null @@ -1,1231 +0,0 @@ ---- -layout: distill -title: Understanding in-context learning in transformers -description: We propose a technical exploration of In-Context Learning (ICL) for linear regression tasks in transformer architectures. Focusing on the article Transformers Learn In-Context by Gradient Descent by J. von Oswald et al., published in ICML 2023 last year, we provide detailed explanations and illustrations of the mechanisms involved. We also contribute novel analyses on ICL, discuss recent developments and we point to open questions in this area of research. -date: 2024-05-07 -future: true -htmlwidgets: true - -# Anonymize when submitting -# authors: -# - name: Anonymous -# affiliations: -# name: Anonymous - -authors: - - name: Simone Rossi - url: "https://scholar.google.com/citations?user=lTt86awAAAAJ&hl=en" - affiliations: - name: Stellantis, France - - name: Rui Yuan - url: "https://scholar.google.com/citations?hl=en&user=4QZgrj0AAAAJ" - affiliations: - name: Stellantis, France - - name: Thomas Hannagan - url: "https://scholar.google.com/citations?hl=en&user=u6OFo3YAAAAJ" - affiliations: - name: Stellantis, France - -# must be the exact same name as your blogpost -bibliography: 2024-05-07-understanding-icl.bib - -# Add a table of contents to your post. -# - make sure that TOC names match the actual section names -# for hyperlinks within the post to work correctly. -# - please use this format rather than manually creating a markdown table of contents. -toc: - - name: What is in-context learning? - subsections: - - name: From large language models to regression tasks - - name: Objective of this blog post - - name: Preliminaries and notations - subsections: - - name: Dataset construction and tokenization - - name: A quick review of self-attention - - name: Training details - - name: Transformers can learn any linear function in-context - subsections: - - name: Linear self-attention is sufficient - - name: What is special about linear self-attention? - subsections: - - name: Establishing a connection between gradient descent and data manipulation - - name: Building a linear transformer that implements a gradient descent step - - name: Experiments and analysis of the linear transformer - subsections: - - name: During training a linear transformer implements a gradient descent step - - name: The effect of the GD learning rate - - name: Analytical derivation of the best GD learning rate - - name: If one layer is a GD step, what about multiple layers? - - name: Is this just for transformers? What about LSTMs? - - name: Concluding remarks - subsections: - - name: What now? - -# Below is an example of injecting additional post-specific styles. -# This is used in the 'Layouts' section of this post. -# If you use this post as a template, delete this _styles block. -_styles: > - - .center { - display: block; - margin-left: auto; - margin-right: auto; - } - - .framed { - border: 1px var(--global-text-color) dashed !important; - padding: 20px; - } - - d-article { - overflow-x: visible; - } - - .underline { - text-decoration: underline; - } - - .todo{ - display: block; - margin: 12px 0; - font-style: italic; - color: red; - } - .todo:before { - content: "TODO: "; - font-weight: bold; - font-style: normal; - } - summary { - color: steelblue; - font-weight: bold; - } - - summary-math { - text-align:center; - color: black - } - - [data-theme="dark"] summary-math { - text-align:center; - color: white - } - - details[open] { - --bg: #e2edfc; - color: black; - border-radius: 15px; - padding-left: 8px; - background: var(--bg); - outline: 0.5rem solid var(--bg); - margin: 0 0 2rem 0; - font-size: 80%; - line-height: 1.4; - } - - [data-theme="dark"] details[open] { - --bg: #112f4a; - color: white; - border-radius: 15px; - padding-left: 8px; - background: var(--bg); - outline: 0.5rem solid var(--bg); - margin: 0 0 2rem 0; - font-size: 80%; - } - .box-note, .box-warning, .box-error, .box-important { - padding: 15px 15px 15px 10px; - margin: 20px 20px 20px 5px; - border: 1px solid #eee; - border-left-width: 5px; - border-radius: 5px 3px 3px 5px; - } - d-article .box-note { - background-color: #eee; - border-left-color: #2980b9; - } - d-article .box-warning { - background-color: #fdf5d4; - border-left-color: #f1c40f; - } - d-article .box-error { - background-color: #f4dddb; - border-left-color: #c0392b; - } - d-article .box-important { - background-color: #d4f4dd; - border-left-color: #2bc039; - } - html[data-theme='dark'] d-article .box-note { - background-color: #555555; - border-left-color: #2980b9; - } - html[data-theme='dark'] d-article .box-warning { - background-color: #7f7f00; - border-left-color: #f1c40f; - } - html[data-theme='dark'] d-article .box-error { - background-color: #800000; - border-left-color: #c0392b; - } - html[data-theme='dark'] d-article .box-important { - background-color: #006600; - border-left-color: #2bc039; - } - d-article aside { - border: 1px solid #aaa; - border-radius: 4px; - padding: .5em .5em 0; - font-size: 90%; - } - .caption { - font-size: 80%; - line-height: 1.2; - text-align: left; - } ---- - -
-$$ -\definecolor{input}{rgb}{0.42, 0.55, 0.74} -\definecolor{params}{rgb}{0.51,0.70,0.40} -\definecolor{output}{rgb}{0.843, 0.608, 0} -\def\mba{\boldsymbol a} -\def\mbb{\boldsymbol b} -\def\mbc{\boldsymbol c} -\def\mbd{\boldsymbol d} -\def\mbe{\boldsymbol e} -\def\mbf{\boldsymbol f} -\def\mbg{\boldsymbol g} -\def\mbh{\boldsymbol h} -\def\mbi{\boldsymbol i} -\def\mbj{\boldsymbol j} -\def\mbk{\boldsymbol k} -\def\mbl{\boldsymbol l} -\def\mbm{\boldsymbol m} -\def\mbn{\boldsymbol n} -\def\mbo{\boldsymbol o} -\def\mbp{\boldsymbol p} -\def\mbq{\boldsymbol q} -\def\mbr{\boldsymbol r} -\def\mbs{\boldsymbol s} -\def\mbt{\boldsymbol t} -\def\mbu{\boldsymbol u} -\def\mbv{\boldsymbol v} -\def\mbw{\textcolor{params}{\boldsymbol w}} -\def\mbx{\textcolor{input}{\boldsymbol x}} -\def\mby{\boldsymbol y} -\def\mbz{\boldsymbol z} -\def\mbA{\boldsymbol A} -\def\mbB{\boldsymbol B} -\def\mbE{\boldsymbol E} -\def\mbH{\boldsymbol{H}} -\def\mbK{\boldsymbol{K}} -\def\mbP{\boldsymbol{P}} -\def\mbR{\boldsymbol{R}} -\def\mbW{\textcolor{params}{\boldsymbol W}} -\def\mbQ{\boldsymbol{Q}} -\def\mbV{\boldsymbol{V}} -\def\mbtheta{\textcolor{params}{\boldsymbol \theta}} -\def\mbzero{\boldsymbol 0} -\def\mbI{\boldsymbol I} -\def\cF{\mathcal F} -\def\cH{\mathcal H} -\def\cL{\mathcal L} -\def\cM{\mathcal M} -\def\cN{\mathcal N} -\def\cX{\mathcal X} -\def\cY{\mathcal Y} -\def\cU{\mathcal U} -\def\bbR{\mathbb R} -\def\y{\textcolor{output}{y}} -$$ -
- - -## What is in-context learning? - - -In-Context Learning (ICL) is the behavior first observed in Large Language Models (LLMs), whereby learning occurs from prompted data without modification of the weights of the model . It is a simple technique used daily and throughout the world by AI practitioners of all backgrounds, to improve generation quality and alignment of LLMs . -ICL is important because it addresses full-on the once widespread criticism that for all their impressive performance, modern deep learning models are rigid systems that lack the ability to adapt quickly to novel tasks in dynamic settings - a hallmark of biological intelligence. -By this new form of "learning during inference", Large Language Models have shown that they can be, in some specific sense (once pretrained), surprisingly versatile and few-shot learners. - - -transformer -
-**Figure 1**: Example of a simple in-context prompt for ChatGPT. -
- -Interestingly, it was around the release of [GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) and [GPT-3](https://arxiv.org/abs/2005.14165) that researchers observed that an auto-regressive language model pre-trained on enough data with enough parameters was capable of performing arbitrary tasks without fine-tuning, by simply prompting the model with the task with few examples and letting it generate the output. -In recent months, the research community has started to investigate the phenomenon of ICL in more details, and several papers have been published on the topic. - - -
-**Figure 2**: The number of papers published on the topic of ICL (and transformers) in the last years. Data extracted from [arxiv.org](https://arxiv.org/) on November 16th, 2023. In the last year alone, the number of papers on the topic has increased by more than 200%. -
- -
- -Specifically, since learning processes in biology and machine are often, if not always, understood in terms of iterative optimization, it is natural to ask what kind of iterative optimization is being realized during ICL, and how. - -### From large language models to regression tasks - -Though ICL is generally regarded as a phenomenon exhibited by LLMs, we now hasten to study it in a non-language, small-scale model that enables more control and where ICL can still be shown to emerge. -This simpler situation is that of a transformer model trained to regress a set of numerical data points presented in the prompt, with data points generated from a distinct function for each prompt, but where all prompts sample a function from the same general class (i.e. linear) at train and at test time. We will see that to some extent, this simplification allows for a mathematical treatment of ICL. - -The following figure gives a visual representation of the ICL setup we will consider in this blog post. -The model is a generic transformer pre-trained to solve generic linear regression tasks. At inference time, we can give the model a prompt with a new linear regression task, and it is able to solve it with surprisingly good performance. - - - -
- **Figure 3**: The model is pre-trained to regress linear functions, and frozen during inference. With different context (input points), the model can still recover the exact underlying function. Use the slider to change the linear function to regress. -
- - - - - - -### Objective of this blog post - -The objective of this blog post is to understand how ICL is possible, and to present in an interactive way what is known of its underlying mechanism. -Specifically, we will analyze the results reported in the paper *Transformers Learn In-Context by Gradient Descent* by J. von Oswald et al. recently published in ICML 2023 , which first showed that a simplified transformer model learns in-context by gradient descent. We will replicate the authors' findings and then we will complement the discussion with a number of additional insights, before pointing to open questions. We hope the reader comes out of this post with a better vision of what *fundamentally* ICL is and the open challenges that remain. - - - - -## Preliminaries and notations - -First of all we need to agree on a mathematical formalization of in-context learning. - -Before we start, let's introduce some notation and color convention that will be used throughout the rest of the blog post. -We will use the following colors to denote different quantities: - -- blue: inputs -- green: model parameters -- yellow: output - -Vectors will be denoted with bold letters, e.g. $$\mba$$, and matrices with bold capital letters, e.g. $$\mbA$$. -Additional notation will be introduced in-line when needed. - -Formally, let's define $$p(\mbx)$$ as a probability distribution over inputs $$\mbx\in\cX$$ and $$\cH$$ a class of functions $$h: \cX \rightarrow \cY$$. -You can think of $$\cH$$ as a set of functions that share some common properties, for example, the set of all linear functions, or the set of all functions that can be represented by a neural network with a given architecture. -Also, let's define $$p(h)$$ as a probability measure over $$\cH$$. - - - -
- **Figure 4**: Visual representation of various parametric function classes (linear, sinusoidal, shallow neural network). Use the dropdown menu to select the function class. -
- -
- -Following the terminology of the LLM community, let's define a *prompt* $$P$$ of length $$C$$ as a *sequence* of $$2C+1$$ points $$(\mbx_0, h(\mbx_0), \ldots, \mbx_{C-1}, h(\mbx_{C-1}), \mbx_{\text{query}})$$ where inputs ($$\mbx_i$$ and $$\mbx_{\text{query}}$$) are independently and identically drawn from $$p(\mbx)$$, and $$h$$ is drawn from $$\cH$$. In short we will also write $$P_C = \left[\{\mbx_i, h(\mbx_i)\}_{i=0}^{C-1}, \mbx_\text{query}\right]$$. - - -
- **Note**: The expectation in Equation \eqref{eq:in-context-error} is taken over the randomness of the input and the function. This means that we are considering the average performance of the model over all possible inputs and functions in $$\cH$$. -
- - -
- -
- Additional details on the ICL formalism - -We can also define the ICL problem through the lens of statistical learning theory. -Suppose $$\ell$$ the same per-task loss function as described above. -Let's define the following loss $$\cL:\cF\rightarrow\bbR$$: - -$$ -\begin{equation} - \cL_C(f) = \mathbb{E}\left[\ell\left(f(P_C), h\left(\mbx_{\text{query}}\right)\right) \right] -\end{equation} -$$ - -Let's define $$f_C$$ as the model that minimizes the loss with $$C$$ in-context examples: - -$$ -\begin{equation} -f_C = \arg\min_{f\in\cF} \cL_C(f) -\end{equation} -$$ - -and $$f_\infty$$ as the model that minimizes the loss with an infinite number of in-context examples: - -$$ -\begin{equation} - f_\infty = \arg\min_{f\in\cF} \cL_\infty(f) -\end{equation} -$$ - -We say that a class of transformer models $$\cF$$ learns in-context for a function class $$\cH$$ if, for any $$\epsilon > 0$$, there exists a model $$f\in\cF$$ such that the following inequality holds: - -$$ -\begin{equation} -\mathbb{P} \left[ \cL( f_C) - \cL( f_\infty) \leq \epsilon \right] \geq 1 - \delta -\end{equation} -$$ - -In other words, the last equation says that a class of transformer models $$\cF$$ learns in-context for a function class $$\cH$$ if, for any $$\epsilon > 0$$, there exists a model $$f\in\cF$$ such that the difference between the loss of the model trained with $$C$$ in-context examples and the loss of the model trained with an infinite number of in-context examples is smaller than $$\epsilon$$ with probability at least $$1-\delta$$. - -Additionally, we can look at the consistency property, defined as: - -$$ -\begin{equation} - \lim_{C\rightarrow\infty} \mathbb{P} \left[ \cL( f_C) - \cL( f_\infty) \geq \epsilon \right] = 0 -\end{equation} -$$ - -This equation signifies that the difference between the loss of the model trained with $$C$$ in-context examples and the loss of the model trained with an infinite number of in-context examples converges to zero as $$C$$ goes to infinity. - -
-
- - -### Dataset construction and tokenization - -For our setup, we will consider a linear regression problem, where the goal is to learn a linear function $$h_{\mbw}(\mbx) = \mbw^\top\mbx$$, with $$\mbw\in\bbR^D$$, from a set of in-context examples $$\{\mbx_i, \y_i\}_{i=0}^{C-1}$$, where $$\mbx_i\in\bbR^D$$ and $$\y_i\in\bbR$$. -So $$h_{\mbw} \in \cH$$. - -In order to better understand how the prompt is constructed starting from a regression task, let's consider the following visual example: - - - -
- **Figure 5**: Visualization of the data construction process, from the regression dataset, to the input prompt and the tokenization. -
- -
- -The figure shows a visual representation of the construction of a single input prompt. -In particular, we first sample a weight $$\mbw$$ from the distribution $$p(\mbw)$$, and then we sample $$C$$ inputs $$\mbx_i$$ from $$p(\mbx)$$, where $$C$$ is the fixed context size. -Finally, we compute the corresponding outputs $$\y_i = \mbw^\top\mbx_i$$. -We consider $$p(\mbx) = \cU(-1, 1)$$, where $$\cU$$ is the uniform distribution, and $$p(\mbw) = \cN(\mbzero, \alpha^2\mbI)$$, where $$\cN$$ is a multivariate Gaussian distribution of dimension $$D$$, with $$0$$ mean and $$\alpha$$ standard deviation. - - -Defining $$c=C+1$$ and $$d=D+1$$, where $$C$$ is the context size and $$D$$ is the input dimension, we can represent the input as a matrix $$\mbE\in\bbR^{d\times c}$$ (also referred to as *token embeddings* or, simply, *embeddings*), where the first $$C$$ columns represent the context inputs $$\mbx_i$$ and output $$\y$$ and the last column represents the query input $$\mbx_{\text{query}}$$ with $$0$$ padding. - - -To construct a batch of regression problems, we just repeat the above procedure $$N$$ times with the fixed context size $$C$$, where $$N$$ is the size of the batch. - - - -### A quick review of self-attention - -In this section we will briefly review the self-attention mechanism, which is the core component of the transformer architecture . - -Let $$\mbW^K, \mbW^Q \in \bbR^{d_k\times d}$$, $$\mbW^V \in \bbR^{d_v\times d}$$ and $$\mbW^P \in \bbR^{d \times d_v}$$ the key, query, value and projection weight matrices respectively. -Given an embedding $$\mbE\in\bbR^{d\times c}$$, the softmax self-attention layer implements the following operation, - -$$ -\begin{equation} -\label{eq:softmax-self-attention} - f_\text{attn} (\mbtheta_\text{attn}, \mbE) = \mbE + \mbW^P \mbW^V \mbE \sigma\left(\frac{(\mbW^K \mbE)^\top \mbW^Q \mbE}{\sqrt{d}}\right), -\end{equation} -$$ - -with $$\mbtheta_\text{attn}=\{\mbW^K, \mbW^Q, \mbW^V, \mbW^P\}$$, where for simplicity we will consider $$d_k=d_v=d$$, and $$\sigma(\cdot)$$ is the softmax function applied column-wise. -It's simple to verify that the output dimension of $$f_\text{attn}$$ is the same as the input dimension. -To simplify further, we can also define the value, key and query matrices as $$\mbV = \mbW^V\mbE$$, $$\mbK = \mbW^K\mbE$$, $$\mbQ = \mbW^Q\mbE$$, respectively. - - - - - -### Training details - - - -
- Figure 6: Visualization of the pre-training process. The model is trained to minimize the loss function defined in Equation \eqref{eq:pre-train-loss-expectation}. -
- -
- -Once the dataset is created, we can train the model using the following objective: - -$$ -\begin{equation} -\label{eq:pre-train-loss-expectation} -\cL(\mbtheta) = \mathbb{E}\left\|f\left(\mbtheta, \left[\{\mbx_i, \y_i\}_{i=0}^{C-1}, \mbx_\text{query}\right]\right) - \y_{\text{query}}\right\|^2, -\end{equation} -$$ - -where the expectation is taken over $$p(\mbx)$$ and $$p(\mbw)$$, with $$h_{\mbw}(\mbx) = \mbw^\top\mbx$$. -Note that the output of the model is a sequence of $$C+1$$ values, i.e. same as the input prompt, and the loss is computed only on the last value of the sequence, which corresponds to the predicted query output $$\widehat\y_{\text{query}}$$. -Specifically, for reading out just the prediction for $$\mbx_{\text{query}}$$, we multiply again by $$-1$$ this last value. -Note that this choice is completely transparent during model training, as it is equivalent to simply changing the sign of a few elements in the projection weight matrix $$\mbW^P$$. -The reason for this will be clear in the following sections. -At each training iteration, we replace the expectation with an empirical average over a batch of $$N$$ regression tasks, each made of a different set of context points $$\{\mbx_i^{(n)}, \y_i^{(n)}\}_{i=0}^{C-1}$$, and a query input/target pain, $$\mbx^{(n)}_\text{query}$$ and $$\y^{(n)}_{\text{query}}$$, respectively. -Note that because of the on-line creation of the dataset, during training the model will never see the same regression task twice. - - - -
- Code for the transformer loss - This is the code for the loss computation, including the reading out of the query output. - - -
- - - - - - - -## Transformers can learn any linear function in-context - -
-With all the preliminaries and notations in place, we can now start to analyze some results regarding the ability of transformers to learn linear functions in-context. -One of the first papers that studied the ability of transformers to learn linear functions in-context is *What Can Transformers Learn In-Context? A Case Study of Simple Function Classes* by S. Garg et al . -We will first replicate their results using a simpler configuration: using only up to 5 layers, single head attention, with 64 embedding units for a total number of parameters of 17K, 34K, 50K, 67K, 84K respectively. - - - - - -In the figure below, we report the in-context test loss (as defined in Equation \eqref{eq:in-context-test-loss}) for each model configuration, for various context sizes $$C$$, from 2 to 100. -
- - - -
- Figure 7: Transformers can learn linear functions in-context, reasonably well. The test loss decreases as the context size increases, and as the number of layers increases. -
- -
- -The experiment above shows that the test loss diminishes for larger context sizes, and also as the number of layers increases. These two main effects are clearly expected, as consequences of more data points and more compute, respectively, and they replicate the findings of Garg et al . - -### Linear self-attention is sufficient - -From this point, we will depart from the classic softmax self-attention layer, and restrict our study to a linear self-attention layer, which is the setting considered in the paper of J. von Oswald et al . -Recently, a number of papers have drawn connections between linear transformers and *Fast Weight Programmers* and have -shown that linearized self-attention layers can be used to replace the softmax self-attention layer in transformers, with the advantage of reducing the computational complexity of the attention operation . - -A **linear self-attention** updates embeddings $$\mbE$$ as follows: - -$$ -\begin{equation} - f_\text{linattn} (\mbtheta_\text{linattn}, \mbE) = \mbE + \frac{\mbW^P \mbV\left(\mbK^\top \mbQ \right)}{\sqrt{d}}, -\end{equation} -$$ - -with $$\mbV, \mbK, \mbQ$$ being the value, key and query defined right after Equation \eqref{eq:softmax-self-attention}. - -Now, to analyze if a linear self-attention layer is sufficient to learn linear functions in-context, we can use the same experimental setup as before, but replacing the softmax self-attention layer with a linear self-attention layer. - -Additionally, we also strip down the transformer to its bare minimum, i.e. we remove the normalization, the embedding layer, the feed-forward layer, and only use a single head. The only remaining component is the linear self-attention layer. -Therefore, in the following we use the term "linear transformer" to refer to this simplified model. - -
- Code for the linear transformer - This is the code for the linear transformer, without any normalization, embedding, etc with a single head - - -
- -We test the linear transformer on the same dataset setup as before, and we will use the same number of layers as before, i.e. 1, 2, 3, 4, 5. - - - -
- **Figure 8**: Linear transformers can also learn linear functions in-context, reasonably well. The test loss decreases as the context size increases, and as the number of layers increases. -
- -
- - - - - - - -## What is special about linear self-attention? - -From the previous section we have seen that a linear self-attention layer is sufficient to learn linear functions in-context. -In this section we will try to understand why this is the case, starting from a review of least-squares regression and gradient descent. - -### Establishing a connection between gradient descent and data manipulation - -In this section, we establish an important connection that will be fundamental to understand the mechanism behind ICL with linear self-attention. To do so we need to start from a simple linear regression problem, and we will show that we can achieve the same loss after *one* gradient step by changing the inputs and the targets, and keeping the weights fixed. - - - -The loss for a linear regression problem is defined as: -$$ -\begin{equation} -\label{eq:linear-regression-loss} -\cL_{\text{lin}}\left(\mbw, \{\mbx_i, {\y}_i\}_{i=0}^{C-1}\right) = \frac 1 {2C} \sum_{i=0}^{C-1} (\mbw^\top\mbx_i - \y_i)^2 -\end{equation} -$$ - -where $$\mbw\in\bbR^D$$, $$\mbx_i\in\bbR^D$$ and $$\y_i\in\bbR$$. With a given learning rate $$\eta$$, the gradient descent update is $$\mbw \leftarrow \mbw - \Delta \mbw$$, where -$$ -\begin{equation} -\label{eq:linear-regression-gd-gradient} -\Delta \mbw = \eta \nabla_{\mbw} \cL_{\text{lin}}\left(\mbw, \{\mbx_i, {\y}_i\}_{i=0}^{C-1}\right) = \frac{\eta}{C} \sum_{i=0}^{C-1} \left(\mbw^\top\mbx_i - \y_i\right)\mbx_i -\end{equation} -$$ -The corresponding loss (after the update) is: -$$ -\begin{equation} -\label{eq:linear-regression-loss-after-gd} -\cL_{\text{lin}}\left(\mbw - \Delta \mbw, \{\mbx_i, {\y}_i\}_{i=0}^{C-1}\right) = \frac 1 {2C} \sum_{i=0}^{C-1} \left(\mbw^\top\mbx_i - \y_i - \Delta \mbw^\top\mbx_i\right)^2 -\end{equation} -$$ - -It is trivial to see that if we now define $$\widehat{\mbx}_i = \mbx_i$$ and $$\widehat{\y}_i = \y_i + \Delta \mbw^\top\mbx_i$$, we can compute Equation \eqref{eq:linear-regression-loss} with the new inputs and targets, i.e. $$\cL_{\text{lin}}(\mbw, \{\widehat{\mbx}_i, \widehat{\y}_i\}_{i=0}^{C-1})$$, which is the same as the loss after the gradient descent update (Equation \eqref{eq:linear-regression-loss-after-gd}). - - - - -### Building a linear transformer that implements a gradient descent step - -As we just saw, the starting intuition is that we can build a gradient step on the linear regression loss by manipulating the inputs and the targets. -This is the *key insight* of Oswald et al. that allows us to draw a connection between the gradient descent dynamics and the linear transformer. - -Before stating the main result, recall the definitions of value, key and query as $$\mbV = \mbW^V\mbE$$, $$\mbK = \mbW^K\mbE$$, and $$\mbq_j = \mbW^Q\mbe_j$$. - -
- -**Main result**: -Given a 1-head linear attention layer and the tokens $$\mbe_j = (\mbx_j, \y_j)$$, for $$j=0,\ldots,C-1$$, we can construct key, query and value matrices $$\mbW^K, \mbW^Q, \mbW^V$$ as well as the projection matrix $$\mbW^P$$ such that a transformer step on every token $$\mbe_j \leftarrow (\mbx_i, \y_{i}) + \mbW^{P} \mbV \mbK^{T}\mbq_{j}$$ is identical to the gradient-induced dynamics $$\mbe_j \leftarrow (\mbx_j, \y_j) + (0, -\Delta \mbW \mbx_j)$$. For the query data $$(\mbx_{\text{query}}, \y_{\text{query}})$$, the dynamics are identical. -
- - - -For notation, we will identify with $$\mbtheta_\text{GD}$$ the set of parameters of the linear transformer that implements a gradient descent step. - - - -Nonetheless, we can construct a linear self-attention layer that implements a gradient descent step and a possible construction is in block form, as follows. - -$$ -\begin{align} -\mbW^K = \mbW^Q = \left(\begin{array}{@{}c c@{}} - \mbI_D & 0 \\ - 0 & 0 -\end{array}\right) -\end{align} -$$ - -with $$\mbI_D$$ the identity matrix of size $$D$$, and - -$$ -\begin{align} -\mbW^V = \left(\begin{array}{@{}c c@{}} - 0 - & 0 \\ - \mbw_0^\top & - -1 -\end{array} - \right) -\end{align} -$$ - -with $$\mbw_0 \in \bbR^{D}$$ the weight vector of the linear model and $$\mbW^P = \frac{\eta}{C}\mbI_{d}$$ with identity matrix of size $$d$$. - - - -If you are interested in the proof of construction for the GD-equivalent transformer, you can find it in the following collapsible section. - - -
- Proof of construction for the GD-equivalent transformer - -To verify this, first remember that if $$\mbA$$ is a matrix of size $$N\times M$$ and $$\mbB$$ is a matrix of size $$M\times P$$, - -$$ -\begin{align} -\mbA\mbB = \sum_{i=1}^M \mba_i\otimes\mbb_{,i} -\end{align} -$$ - -where $$\mba_i \in \bbR^{N}$$ is the $$i$$-th column of $$\mbA$$, $$\mbb_{,i} \in \bbR^{P}$$ is the $$i$$-th row of $$\mbB$$, and $$\otimes$$ is the outer product between two vectors. - -It is easy to verify that with this construction we obtain the following dynamics - -$$ -\begin{align} -\left(\begin{array}{@{}c@{}} -\mbx_j\\ -\y_j -\end{array}\right) -\leftarrow & -\left(\begin{array}{@{}c@{}} -\mbx_j\\ -\y_j -\end{array}\right) + \mbW^{P} \mbV \mbK^{T}\mbq_{j} = \mbe_j + \frac{\eta}{C} \sum_{i={0}}^{C-1} \left(\begin{array}{@{}c c@{}} -0 -& 0 \\ -\mbw_0 & --1 -\end{array} -\right) -\left(\begin{array}{@{}c@{}} -\mbx_i\\ -\y_i -\end{array}\right) -\otimes -\left( -\left(\begin{array}{@{}c c@{}} -\mbI_D & 0 \\ -0 & 0 -\end{array}\right) -\left(\begin{array}{@{}c@{}} -\mbx_i\\ -\y_i -\end{array}\right) -\right) -\left(\begin{array}{@{}c c@{}} -\mbI_D & 0 \\ -0 & 0 -\end{array}\right) -\left(\begin{array}{@{}c@{}} -\mbx_j\\ -\y_j -\end{array}\right)\\ -&= \left(\begin{array}{@{}c@{}} -\mbx_j\\ -\y_j -\end{array}\right) + \frac{\eta}{C} \sum_{i={0}}^{C-1} \left(\begin{array}{@{}c@{}} -0\\ -\mbw_0^\top \mbx_i - \y_i -\end{array}\right) -\otimes -\left(\begin{array}{@{}c@{}} -\mbx_i\\ -0 -\end{array}\right) -\left(\begin{array}{@{}c@{}} -\mbx_j\\ -0 -\end{array}\right) = -\left(\begin{array}{@{}c@{}} -\mbx_j\\ -\y_j -\end{array}\right) + \left(\begin{array}{@{}c@{}} -0\\ - -- \frac{\eta}{C}\sum_{i=0}^{C-1} \left( \left(\mbw_0^\top\mbx_i - \y_i\right)\mbx_i\right)^\top \mbx_j - \end{array}\right). - \end{align} -$$ - -Note that the update for the query token $$(\mbx_{\text{query}}, \textcolor{output}{0})$$ is identical to the update for the context tokens $$(\mbx_j, \y_j)$$ for $$j=0,\ldots,C-1$$. - -
- - - - -## Experiments and analysis of the linear transformer - -Now let's do some experiments to verify the theoretical results. -We will work within the same experimental setup as before with the same dataset construction, training procedure and testing procedure. -In this first section, we consider a linear transformer with a single layer, and the transformer built as described in the previous section (the GD-equivalent transformer), i.e. with a linear self-attention layer that implements a gradient descent step. - -### During training, a linear transformer learns to implement a gradient descent step - -We now study the evolution of the test loss of a linear transformer during training $$\cL(\mbtheta)$$, and compare it to the loss of a transformer implementing a gradient descent step $$\cL(\mbtheta_\text{GD})$$. - - -
- **Figure 9**: The loss of a trained linear transformer converges to the loss of a transformer implementing a gradient descent step on the least-squares regression loss with the same dataset. Use the slider to change the context size. -
- - - -
- - - - - - -Although an empirical proof of such a functional equivalence would require to check the outputs for all possible test samples, we can try to gather more evidence by considering more closely the computations that unfold in the linear transformer during one pass. - -To better understand the dynamics of the linear transformer, we now study the evolution of a few metrics during training (the *L2 error for predictions*, the *L2 error for gradients* and the *cosine similarity* between models). - -
-Metrics details - -The metrics introduced above are defined as follows: - -- **L2 error (predictions)** measures the difference between the predictions of the linear transformer and the predictions of the transformer implementing a gradient descent step and it is defined as $$\left\|f\left(\mbtheta, \left[\{\mbx_i, \y_i\}_{i=0}^{C-1}, \mbx_\text{query}\right]\right) - f\left(\mbtheta_\text{GD}, \left[\{\mbx_i, \y_i\}_{i=0}^{C-1}, \mbx_\text{query}\right]\right) \right\|^2$$; - -- **L2 error (gradients w.r.t. inputs)** measures the difference between the gradients of the linear transformer and the gradients of the transformer implementing a gradient descent step and it is defined as $$\left\|\nabla_{\mbx_\text{query}} f\left(\mbtheta, \left[\{\mbx_i, \y_i\}_{i=0}^{C-1}, \mbx_\text{query}\right]\right) - \nabla_{\mbx_\text{query}} f\left(\mbtheta_\text{GD}, \left[\{\mbx_i, \y_i\}_{i=0}^{C-1}, \mbx_\text{query}\right]\right) \right\|^2$$; - -- **Model cosine similarity (gradients w.r.t. inputs)** measures the cosine similarity between the gradients of the linear transformer and the gradients of the transformer implementing a gradient descent step and it is defined as $$\cos\left(\nabla_{\mbx_\text{query}} f\left(\mbtheta, \left[\{\mbx_i, \y_i\}_{i=0}^{C-1}, \mbx_\text{query}\right]\right), \nabla_{\mbx_\text{query}} f\left(\mbtheta_\text{GD}, \left[\{\mbx_i, \y_i\}_{i=0}^{C-1}, \mbx_\text{query}\right]\right)\right)$$. - -
- -
- - -
- **Figure 10**: Comparison between the linear transformer and the GD-transformer during training. The predictions of the linear transformer converge to the predictions of the GD-transformer and the gradients of the linear transformer converge to the gradients of the GD-transformer. Use the slider to change the context size. -
- -
- - -From this figure, we see that the predictions of the linear transformer converge to the predictions of the GD-transformer, and the gradients of the linear transformer converge to the gradients of the GD-transformer. -Notably, this is true for all context sizes, though the convergence is faster for larger $$C$$. - -As a final visualization, we can also look at the evolution of the gradients of the linear transformer during training, as shown in the figure below. In this animation, we take six different regression tasks and we plot the gradients of the linear transformer during training and the exact gradients of the least-squares regression loss. - - -transformer -
- Figure 11: Animation of the gradients of the linear transformer during training. The loss landscape visualized is the least-squares regression loss (each task has its own loss). The gradients of the linear transformer are shown in red, while the gradients of the least-squares regression loss are shown in orange. -
- - - -To reiterate, the loss landscape visualized is the least-squares regression loss and each task is a different linear regression problem with a different loss landscape. -Once more, this is a visualization that the linear transformer is not learning a single regression model, but it is learning to solve a linear regression problem. - -### The effect of the GD learning rate - -Next, we study the effect of the GD learning rate on the test loss of the GD-equivalent transformer. -We believe this is an important point of discussion which was covered only briefly in the paper. - - - - - - - -Indeed, this is the same procedure we have used to find the optimal GD learning rate for our previous experiments. -We now show what happens if we use a different GD learning rate than the one found with line search. -In the following experiment, we visualize this behavior, by plotting the metrics described above for different values of the GD learning rate. - - - -
- Figure 12: Effect of the GD learning rate on the alignment between the linear transformer and the GD-transformer. The agreement between the two is maximized for a specific GD learning rate, which must be found by line search. Use the slider to manually change the GD learning rate. -
- -
- - - - -### Analytical derivation of the best GD learning rate - -It turns out that having a line search to find the best GD learning rate is not necessary. - - - -The analytical solution is provided below with its derivation reported in the collapsible section immediately following. - -
-Analytical derivation of the best GD learning rate - -We are interested in finding the optimal learning rate for the GD-transformer, which by construction (see main Proposition), is equivalent to finding the optimal GD learning rate for the least-squares regression problem. Consequently, the analysis can be constructed from the least-squares regression problem \eqref{eq:linear-regression-loss}. - -Recall the GD update of the least-squares regression in \eqref{eq:linear-regression-gd-gradient} without taking into account of the learning rate. That is, - -$$ -\begin{equation} -\label{eq:linear-regression-gd-gradient-no-lr} -\Delta \mbw = \nabla_{\mbw} -\cL_{\text{lin}}\left(\mbw, \{\mbx_i, \y_i\}_{i=0}^{C-1}\right) = -\frac{1}{C} \sum_{i=0}^{C-1} \left(\mbw^\top\mbx_i - \y_i\right)\mbx_i. -\end{equation} -$$ - -Now we consider the test loss of the least-squares regression defined as - -$$ -\begin{equation} -\cL_\mathrm{lin, te}(\{\mbw^{(n)}\}_{n=0}^{N-1}) = \frac{1}{N} \sum_{n=0}^{N-1} ((\mbx^{(n)}_\text{query})^\top \mbw^{(n)} - \y^{(n)}_\text{query})^2, -\end{equation} -$$ - -where $$N$$ is the number of the queries, which is the same number of the regression tasks of the in-context test loss dataset. -Similar to \eqref{eq:linear-regression-loss-after-gd}, after one step of the GD update \eqref{eq:linear-regression-gd-gradient-no-lr}, the corresponding test loss becomes - -$$ -\begin{align} -&\quad \ \ \cL_\mathrm{lin, te}(\{\mbw^{(n)} - \eta \Delta \mbw^{(n)}\}_{n=0}^{N-1}) \nonumber \\ -&= \frac{1}{N} \sum_{n=0}^{N-1} \left((\mbx^{(n)}_\text{query})^\top (\mbw^{(n)} - \eta \Delta \mbw^{(n)}) - \y^{(n)}_\text{query}\right)^2 \nonumber \\ -&= \frac{1}{N} \sum_{n=0}^{N-1} \left((\mbx^{(n)}_\text{query})^\top \mbw^{(n)} - \y^{(n)}_\text{query} - \eta (\mbx^{(n)}_\text{query})^\top \Delta \mbw^{(n)} \right)^2 \nonumber \\ -&= \frac{\eta^2}{N} \sum_{n=0}^{N-1} ((\mbx^{(n)}_\text{query})^\top \Delta \mbw^{(n)})^2 -+ \cL_\mathrm{lin, te}(\{\mbw^{(n)}\}_{n=0}^{N-1}) \nonumber \\ -&\quad \ - \frac{2\eta}{N} \sum_{n=0}^{N-1} ((\mbx^{(n)}_\text{query})^\top \mbw^{(n)} - \y^{(n)}_\text{query})(\mbx^{(n)}_\text{query})^\top \Delta \mbw^{(n)}. \label{eq:loss_query_W1} -\end{align} -$$ - -One can choose the optimum learning rate $$\eta^*$$ such that $$\cL_\mathrm{lin, te}(\{\mbw^{(n)} - \eta \Delta \mbw^{(n)}\}_{n=0}^{N-1})$$ achieves its minimum with respect to the learning rate $$\eta$$. That is, - -$$ -\begin{align} -\eta^* \in \arg\min_{\eta > 0} \cL_\mathrm{lin, te}(\{\mbw^{(n)} - \eta \Delta \mbw^{(n)}\}_{n=0}^{N-1}). -\end{align} -$$ - -To obtain $$\eta^*$$, it suffices to solve - -$$ -\begin{align} -\nabla_\eta \cL_\mathrm{lin, te}(\{\mbw^{(n)} - \eta \Delta \mbw^{(n)}\}_{n=0}^{N-1}) = 0. -\end{align} -$$ -From \eqref{eq:loss_query_W1} and plugging $$\Delta w^{(n)}$$ in \eqref{eq:linear-regression-gd-gradient-no-lr}, we obtain -$$ -\begin{align} -\eta^* &= \frac{\sum_{n=0}^{N-1} ((\mbx^{(n)}_\text{query})^\top \mbw^{(n)} - \y^{(n)}_\text{query})(\mbx^{(n)}_\text{query})^\top \Delta \mbw^{(n)} } -{\sum_{n=0}^{N-1} ((\mbx^{(n)}_\text{query})^\top \Delta \mbw^{(n)})^2} \nonumber \\ -&= C \frac{\sum_{n=0}^{N-1} ((\mbx^{(n)}_\text{query})^\top \mbw^{(n)} - \y^{(n)}_\text{query}) \sum_{i=0}^{C-1} ((\mbw^{(n)})^\top \mbx_i^{(n)} - \y_i^{(n)})(\mbx_i^{(n)})^\top \mbx^{(n)}_\text{query}} -{\sum_{n=0}^{N-1} \left( \sum_{i=0}^{C-1} ((\mbw^{(n)})^\top \mbx_i^{(n)} - \y_i^{(n)})(\mbx_i^{(n)})^\top \mbx^{(n)}_\text{query} \right)^2}. -\end{align} -$$ -Finally, for the initialization $$\mbw^{(n)} = 0$$ for $$n = 0, \ldots, N-1$$, the optimal learning rate can be simplified to be -$$ -\begin{align} -\eta^* = C \frac{\sum_{n=1}^{N-1} \y^{(n)}_\text{query} \left(\sum_{i=0}^{C-1}\left( \y^{(n)}_i{\left(\mbx^{(n)}_i\right)}^\top \mbx_\text{query}^{(n)}\right)\right) -}{\sum_{n=1}^{N-1} \left(\sum_{i=0}^{C-1}\left(\y^{(n)}_i {\left(\mbx^{(n)}_i\right)}^\top \mbx_\text{query}^{(n)}\right)\right)^2}. -\end{align} -$$ -
- -
- -#### Some comments on the analytical solution - -This derivation of the optimal GD learning rate $$\eta^*$$ agrees well with the line search procedure (up to the numerical precision of the line search procedure itself). -While this is expected, let's take a moment to understand why this is the case. - -1. The analytical solution is obtained starting from the linear regression loss, while the line search procedure using the loss $$\cL(\mbtheta_\text{GD})$$ defined in Equation \eqref{eq:pre-train-loss-expectation}. -However, the two losses are equivalent by construction, hence the two procedures are equivalent. - -1. Because the construction of the GD transformer is not unique, it's not easy to see the effect of the GD learning rate once we compare it with the trained linear transformer. -Recall that due to its parametrization, the linear transformer does not have an explicit $$\eta$$ parameter, which it can be absorbed in any of the weight matrices in the linear self-attention layer. -Yet, the linear transformer converges to the exact same loss of the GD-transformer for the optimal GD learning rate $$\eta^*$$. -This is expected because fundamentally the loss function used for the line search and the one used for the analytical solution is equivalent to the loss in Equation \eqref{eq:pre-train-loss-expectation} used during the transformer training. - - - -Said differently, what we did in two steps for the GD-transformer (first build the $$\mbW^K, \mbW^Q, \mbW^V$$ matrices, then find the optimal GD learning rate) is done implicitly during the training of the linear transformer. - -The following table summarizes the three different procedures we have discussed so far. - -| | Loss function | GD learning rate | -| ------------------------ | ------------------------------------ | -------------------------------------------- | -| Least-squares regression | $$\cL_\text{lin}(\mbw-\Delta \mbw)$$ | Explicit $$\eta^*$$ by analytical solution | -| GD-transformer | $$\cL(\mbtheta_\text{GD})$$ | Explicit $$\eta^*$$ by line search | -| Linear transformer | $$\cL(\mbtheta)$$ | Implicit $$\eta^*$$ by training $$\mbtheta$$ | - - -Finally, one comment on the computational complexity of the two procedures. -It doesn't come as a surprise that the analytical solution is faster to compute than the line search: the line search requires on average 10 seconds to find the optimal GD learning rate, while the analytical solution requires only 10 milliseconds (both with JAX's JIT compilation turned on, run on the same GPU). - - - - - -### If one layer is a GD step, what about multiple layers? - -It is only natural to ask if the same behavior is observed for a linear transformer with multiple layers. -In particular, if we take a trained linear transformer with a single layer (which we now know it implements a gradient descent step) and we repeat the same layer update multiple times recursively, will we observe the same behavior? - -As we now show in the following experiment, the answer is no. -In fact, the test loss for both the linear transformer and the transformer implementing a gradient descent step diverges as we increase the number of layers. - -To stabilize this behavior, we use a dampening factor $$\lambda$$, which is a scalar in $$[0, 1]$$, and we update the linear transformer as follows: - -$$ -\begin{equation} -\label{eq:linear-transformer-update} -\mbE^{(l+1)} = \mbE^{(l)} + \lambda \mbW^P \mbV\left(\mbK^\top \mbQ \right), -\end{equation} -$$ - -where $$\mbE^{(l)}$$ is the embedding matrix at layer $$l$$, and $$\mbW^P, \mbV, \mbK, \mbQ$$ are the projection, value, key and query matrices as defined before. -Effectively, this is equivalent to applying a gradient descent step with scaled learning rate. - -
- Code for the recurrent transformer - This is the code for the recurrent transformer, with a dampening factor \(\lambda\). Note that the attention layer is the same as before, but we now apply it multiple times. - - -
- -
- - -
- Figure 13: A pre-trained transformer with a single layer can be used recursively to implement multiple gradient descent steps, after applying a dampening factor \(\lambda\) to the self-attention layer. Use the slider to change the value of \(\lambda\). -
- -
- - -Note that in the original paper, the authors suggest that a dampening factor of $$\lambda=0.75$$ is generally sufficient to obtain the same behavior as a single layer linear transformer. As we can see from the figure above, in our investigations we do not find this to be the case. -In our experiments, we see that we need at least $$\lambda=0.70$$ to obtain the same behavior as a single layer linear transformer, which suggests that the effect of the dampening factor can vary. - - - - -## Is this just for transformers? What about LSTMs? - -Transformers are not the only architecture that can sequence-to-sequence models . -Notably, *recurrent neural networks* (RNNs) have been used for a long time to implement sequence-to-sequence models, and in particular *long short-term memory* (LSTM) networks have been shown to be very effective in many tasks . - -Indeed, from a modeling perspective, nothing prevents us from using a LSTM to implement in-context learning for regression tasks. -In fact, we can use the same experimental setup as before, but replacing the transformer with a LSTM. -The main architectural difference between a LSTM and a transformer is that LSTM layers are by-design causal, i.e. they can only attend to previous tokens in the sequence, while transformers can attend to any token in the sequence. -While for some tasks where order matters, like language modeling, this is a desirable property, for the regression task we are considering this is not the case, since the input sequence is not ordered (i.e. shuffling the input sequence does not change the output of the linear regression model). -For this reason, together with the classic uni-directional LSTM, we will also consider a bi-directional LSTM, which can attend to both previous and future tokens in the sequence. -This provides a fair comparison between the LSTMs and the transformers. - -In this first experiment, we analyze the performance of the uni-directional and the bi-directional LSTM to learn linear functions in-context. -Note that because of the intrinsic non-linear nature of the LSTM layers, we cannot manually construct a LSTM that implements a gradient descent step, as we did for the transformer. -Nonetheless, we can still compare the LSTMs with the GD-equivalent transformer (which we now know it implements a gradient descent step on the least-squares regression loss). - - -
- Figure 14: LSTMs cannot learn linear functions in-context as effectively as transformers and bi-directional LSTMs can learn linear functions in-context better than uni-directional LSTMs. Use the slider to change the number of layers. -
- - - -
- -In this figure we can see that a single layer LSTM is not sufficient to learn linear functions in-context. For the uni-directional LSTM, we see that the test loss is always higher than the test loss of the transformer implementing a gradient descent step, even if we increase the number of layers. -On the contrary, for the bi-directional LSTM, we see that the test loss approaches that of the GD-equivalent transformer as we increase the number of layers. - -The poor performance of the uni-directional LSTM is not surprising. Additional evidence is provided in the figure below, where, as we did for the transformer, we plot the L2 error (predictions), the L2 error (gradients w.r.t. inputs) and the model cosine similarity (gradients w.r.t. inputs) comparing the LSTM with the GD-equivalent transformer. - -
- - -
- Figure 15: Uni-directional LSTMs cannot learn linear functions in-context as effectively as transformers. Use the slider to change the number of layers. -
- -
- -Regardless of the number of layers, we see that the uni-directional LSTM is not implementing a gradient descent step, as the L2 error (predictions) and the L2 error (gradients w.r.t. inputs) do not converge to 0, and the model cosine similarity (gradients w.r.t. inputs) remains well below 1. -The picture changes for the bi-directional LSTM, as we can see in the figure below. - -
- - - -
- Figure 16: Bi-directional LSTMs align better with the GD-equivalent transformer as we increase the number of layers. Use the slider to change the number of layers. -
- - -
- -While for a single layer, we can comfortably say that also the bi-directional LSTM is not equivalent to a GD step, for **2 or more layers** we cannot reject the hypothesis that the bi-directional LSTM is equivalent to a GD step (use the slider to change the number of layers in Figure 14-16). -Note that if we compare this result with **Figure 10**, while we don't see exactly the same behavior (e.g. cosine similarity a bit lower than 1), it is still remarkably similar. -This is not a conclusive result but it is interesting to see that the bi-directional LSTM can learn linear functions in-context *similarly* to a transformer implementing a gradient descent step. - - - - -## Concluding remarks - -In this blog post, we have presented a series of experiments to understand the mechanistic behavior of transformers and self-attention layers through the lens of optimization theory. -In particular, we analyze the results of the paper *Transformers Learn In-Context by Gradient Descent*, replicating some of the experiments and providing additional insights. -In particular, we also derive an analytical solution for the best GD learning rate, which is faster to compute than the line search procedure used in the original paper. -Finally, we also empirically show that LSTMs behave differently than transformers, and that single layer LSTMs do not in fact implement a gradient descent step. -The results on deep LSTMs are less conclusive, showing behavior similar to the GD-equivalent transformer, but not exactly the same. - - - -### What now? - -The results presented in this blog post, while confirming the main findings of the original paper, also raise a number of questions and suggest possible future research directions. - -1. To reiterate, what we have done so far is to try to understand the behavior of transformers and self-attention layers through the lens of optimization theory. -This is the common approach in the literature, including very recent additions , and it is the approach we have followed in this blog post. -However, this can pose significant limitations regarding the generalization of the results and the applicability of the findings to other architectures (notably, causal self-attention layers). -Phenomena like the emergent abilities or the memorization of large language models may indicate that fundamentally different mechanisms are at play in these models, and that the optimization perspective might not be sufficient to understand them. - -1. On the other hand, nothing prevents us from working in the opposite direction, i.e. to start from specific learning algorithms and try to design neural networks that implement them. -From an alignment perspective, for example, this is desirable because it allows us to start by designing objective functions and learning algorithms that are more interpretable and more aligned with our objectives, rather than starting from a black-box neural network and trying to understand its behavior. -In this quest, the developing theory of mesa-optimization can represent a useful framework to understand these large models . - -1. Finally, we want to highlight that the main results shown in this blog post are consequences of the simplified hypothesis and the experimental setup we have considered (linear functions, least-squares regression loss, linear self-attention layers). -In an equally recent paper , for example, the authors take a completely different route: by representing transformers as interacting particle systems, they were able to show that tokens tend to cluster to limiting objects, which are dependent on the input context. -This suggests that other interpretations of the behavior of transformers are not only possible, but also possibly necessary to understand how these models learn in context. - - - - -
- -## Appendix - - -### Connection with meta-learning - -From a learning point-of-view, ICL seems closely related to the definition of *meta-learning*, where the goal is to learn a model that can quickly adapt to new tasks . -If we consider the function class $$\cH$$ as an uncountable set of tasks, then the model is learning *how* to adapt to new function by observing a few examples of that function. -The main difference between the classic formulation of meta-learning and the formulation of in-context learning is that in the latter case the model is not allowed to change its weights, but it can only change its internal state (e.g., the hidden activations of the transformer). -Indeed, meta-learning relies on the assumption that the model can quickly adapt to new tasks by changing its weights (i.e. by taking one or more gradient steps). - -#### Connection with MAML (Model-Agnostic Meta-Learning) - -In the meta-learning setup, we need to define a generic base-model $$m:\cX\rightarrow\cY$$ parameterized with $$\mbw$$ that works at sample-level. -Let's now relax the assumption of $$\cF$$ as a class of transformer models and let's build $$f$$ as follows: - -$$ -\begin{equation} -\label{eq:meta-learning-model} -f(\mbw, P_C) = m\left(\mbw - \eta \nabla_{\mbw} \sum_{i=0}^{C-1}\ell\left(m(\mbw,\mbx_i), \y_i\right),\mbx_\text{query}\right) -\end{equation} -$$ - -where $$\eta$$ is the learning rate of the meta-learning algorithm. -Equation \eqref{eq:meta-learning-model} represents the inner optimization loop in a simplified version of the MAML algorithm , where the model is updated with a single gradient step. - -Putting all together, we can define the meta-learning loss as: - -$$ -\begin{equation} -\label{eq:meta-learning-loss} -\cL_{\text{MAML}}(\mbw) = \mathbb{E}\left[\ell\left(f(\mbw, P_C), h\left(\mbx_{\text{query}}\right)\right) \right] -\end{equation} -$$ - -which now is optimized w.r.t. the base-model's parameters $$\mbw$$. - -The resemblance between Equation \eqref{eq:in-context-error} and Equation \eqref{eq:meta-learning-loss} is now clear and it justifies the interpretation of in-context learning as a form of meta-learning. - -In particular, it is interesting to study under which conditions the model $$f$$ defined in Equation \eqref{eq:meta-learning-model} is equivalent to a transformer model. - - - - -### Testing details - -In order to test whether a model learns in-context for a given function class, we need to define a dataset of in-context examples. -In this case we will only consider in-distribution test examples, i.e. examples that are drawn from the same distribution as the training examples. -Specifically, we will use the same distribution for the test inputs $$p(\mbx)$$ and the same distribution for the test weights $$p(\mbw)$$ as those used during training. -Various papers have also considered the case where the inputs are drawn from a different distribution than the training examples (also known as out-of-distribution, or OOD), but to keep the discussion relevant we will only consider the in-distribution case. - -We define the in-context test loss as: - -$$ -\begin{equation} -\label{eq:in-context-test-loss} -\cL_\text{te}(\mbtheta) = \frac 1 N \sum_{n=0}^{N-1} \left\|f\left(\mbtheta, \left[\{\mbx_i^{(n)}, \y_i^{(n)}\}_{i=0}^{C-1}, \mbx^{(n)}_\text{query}\right]\right) - \y^{(n)}_{\text{query}}\right\|^2. -\end{equation} -$$ - -Specifically, we will consider a fixed dataset of $$N=10000$$ regression tasks, where each task is defined by a set of in-context examples $$\{\mbx_i^{(n)}, \y_i^{(n)}\}_{i=0}^{C-1}$$ and a query pair $$\mbx^{(n)}_{\text{query}}$$ and $$\y^{(n)}_{\text{query}}$$. - - -
\ No newline at end of file diff --git a/_posts/2024-05-07-unraveling-the-impact-of-training-samples.md b/_posts/2024-05-07-unraveling-the-impact-of-training-samples.md deleted file mode 100644 index a9e352c3..00000000 --- a/_posts/2024-05-07-unraveling-the-impact-of-training-samples.md +++ /dev/null @@ -1,335 +0,0 @@ ---- -layout: distill -title: Unraveling The Impact of Training Samples -description: How do we quantify the influence of datasets? Recent works on Data Attribution Methods shed light on this problem. In this blog post, we introduce Data Attribution Methods which leverage robust statistics and surrogate functions, and present their applications like distinguishing the feature selection difference of learning algorithms, detecting data leakage, and assessing model robustness. -date: 2024-05-07 -future: true -htmlwidgets: true - -# Anonymize when submitting -authors: - - name: Daiwei Chen - url: 'https://chendaiwei-99.github.io' - affiliations: - name: UW-Madison - - - name: Jane Zhang - url: 'https://openreview.net/profile?id=~Jane_Zhang2' - affiliations: - name: UW-Madison - - - name: Ramya Korlakai Vinayak - url: 'https://ramyakv.github.io' - affiliations: - name: UW-Madison - -# must be the exact same name as your blogpost -bibliography: 2024-05-07-unraveling-the-impact-of-training-samples.bib - -# Add a table of contents to your post. -# - make sure that TOC names match the actual section names -# for hyperlinks within the post to work correctly. -# - please use this format rather than manually creating a markdown table of contents. -toc: - - name: Data Attribution Methods - subsections: - - name: Influence Functions - - name: Data Models - - name: TRAK - - name: How do we use it? - subsections: - - name: Learning Algorithm Comparison - - name: Data Leakage Detection - - name: Prediction Brittleness Examination - - name: Conclusion - -# Below is an example of injecting additional post-specific styles. -# This is used in the 'Layouts' section of this post. -# If you use this post as a template, delete this _styles block. -_styles: > - .fake-img { - background: #bbb; - border: 1px solid rgba(0, 0, 0, 0.1); - box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1); - margin-bottom: 12px; - } - .fake-img p { - font-family: monospace; - color: white; - text-align: left; - margin: 12px 0; - text-align: center; - font-size: 16px; - } - .details{ - border: 1px solid #000; /* Border style */ - padding: 10px; /* Optional padding for content inside the frame */ - background-color: #808080; /* light gray - background color */ - color: #FFFFFF; /* Text color, change to contrast with the background */ - } - .details summary{ - color: white; - font-style: italic; - font-weight: bold; - } - .details p{ - color:white; - } ---- - - - -How do we quantify the true influence of datasets? What role does the influence score play in refining datasets and unraveling the intricacies of learning algorithms? Recent works on **Data Attribution Methods** give us an interesting answer to these problems. - -This blog post revisits several proposed **Data Attribution Methods** which aim to quantitatively measure the importance of each training sample with respect to the model's output. The blog post also demonstrates the utility of the data attribution methods by providing some usage examples, e.g. [understanding the difference of learning algorithms](#learning-algorithm-comparison), checking [data leakage](#data-leakage-detection), and analyzing the [model robustness ](#prediction-brittleness-examination). - -{% include figure.html path="assets/img/2024-05-07-unraveling-the-impact-of-training-samples/animation.gif" class="img-fluid" %} - -_Motivation of data attribution. For a given target, we want to quantify the influence of each of the training samples. Therefore, it's more interpretable for us to understand model decisions and bias._ - -## Data Attribution Methods - -Exploring various milestone frameworks offers valuable insight into understanding the impact of training samples. Let's delve into some established methods used for data attribution. - -### Influence Functions - -In the paper **_Understanding Black-box Predictions via Influence Functions_** , the authors scaled up influence functions (a classic technique from robust statistics ) to Modern Deep Learning settings. Under the twice-differentiable and strictly convex assumption of empirical risk function and the assumption of the algorithm attaining the optimal point, we can estimate the influence of training samples by only calculating the gradients and Hessian-vector products of the model. - -The intuition behind the influence function is by looking at the difference of test loss after one training sample removal or perturbation. The calculation is given as follows: - -$$\mathcal{I}_{\text{removal,loss}}(z,z_{\text{test}}):=\frac{dL(z_\text{test},\hat\theta_{\epsilon,z})}{d\epsilon}\Bigg|_{\epsilon=0}\approx-\nabla_\theta L(z_{\text{test}},\hat\theta)^\top H_{\hat\theta}^{-1}\nabla_\theta L(z,\hat\theta)$$ - -
-Show algorithm step by step -

-Given the assumption we made, our algorithm can find the optimal $\hat\theta$ which minimizes the empirical risk and also guarantees the existence of the positive definite Hessian matrix: - -$$R(\theta):=\frac{1}{n}\sum L(z_i,\theta), \ \ \hat\theta=\arg\min_\theta R(\theta)$$ - -$$H_{\hat\theta}:=\frac{1}{n}\sum \nabla _\theta^2 L(z_i,\hat\theta).$$ - -Given the intuition written above, we look at the parameter difference $\Delta_\epsilon=\hat\theta_{\epsilon, z}-\hat\theta$ by perturbing one training sample: - -$$\hat\theta_{\epsilon, z}=\arg\min_{\theta}\{R(\theta)+\epsilon L(z,\theta)\}$$ - -Recall our goal is to estimate how does the algorithm changes with sample perturbation, we can express our goal as $\frac{d \hat\theta_{\epsilon, z}}{d \epsilon}$. Since $\hat\theta_{\epsilon, z}$ is a minimizer of the pertured loss. We can write its first order optimality condition: - -$$0=\nabla R(\hat\theta_{\epsilon, z})+\epsilon \nabla L(z,\hat\theta_{\epsilon, z}).$$ - -By performing a taylor expansion on $\hat\theta_{\epsilon, z}$, we can estimate - -$$0\approx \left[ \nabla R(\hat\theta)+\epsilon \nabla L(z,\hat\theta)\right] + \left[ \nabla^2 R(\hat\theta)+\epsilon \nabla^2 L(z,\hat\theta)\right]\Delta_\epsilon.$$ - -Since $\hat\theta$ minimizes $R$ and $o(\epsilon)$ term can be omitted, we can solve for $\Delta_\epsilon$ as follows: - -$$\Delta_\epsilon\approx -\nabla^2 R(\hat\theta)^{-1} \nabla L(z,\hat\theta)\epsilon \Rightarrow \frac{d \Delta_\epsilon}{d \epsilon}\Bigg|_{\epsilon=0}=\frac{d \hat\theta_{\epsilon,z}}{d\epsilon}\Bigg|_{\epsilon=0}=-H_{\hat\theta}^{-1}\nabla_\theta L(z,\hat\theta) $$ - -
-Therefore, $\mathcal{I}_{\text{removal,loss}}(z,z_{\text{test}}):=\frac{dL(z_\text{test},\hat\theta_{\epsilon,z})}{d\epsilon}\Bigg|_{\epsilon=0} -=\frac{dL(z_\text{test},\hat\theta_{\epsilon,z})}{d\hat\theta_{\epsilon,z}}\frac{d \hat\theta_{\epsilon,z}}{d\epsilon}\Bigg|_{\epsilon=0}\approx-\nabla_\theta L(z_{\text{test}},\hat\theta)^\top H_{\hat\theta}^{-1}\nabla_\theta L(z,\hat\theta)$ -

-
- -
-
- -Since one training sample removal can be understood as setting $\epsilon=-\frac{1}{n}$, we can predict the corresponding test loss difference by $-\frac{1}{n}\mathcal{I_{\; remove, loss}} \;(\mathcal{z}, \mathcal{z}_{\text{test}})$. By comparing the predicted test loss difference and the actual test loss difference by leave-one-out retraining, we can verify the accuracy of the proposed influence scores, as shown in the figure below. - -{% include figure.html path="assets/img/2024-05-07-unraveling-the-impact-of-training-samples/1.png" class="img-fluid" %} - -Based on their experiments, we can empirically say that the proposed influence function performs well on the tasks which satisfy their underlying assumptions (the twice-differentiable and strictly convex assumption): In Fig(a) & Fig(b), under convex and convergent situations (Logistic Regression model & L-BGFS algorithm), the predicted loss difference and actual loss difference align well with each other. However, in Fig\(c\), under non-convex and non-convergent-guarantee situations(CNN model & SGD algorithm), the influence function could not make satisfying approximation. - -Although the Influence Functions seem provide a good estimation of the importance of each training sample, the **expensive computational cost on estimating Hessian matrix and the unstablility under non-convex and non-convergent-guarantee situations are big issues for this data attribution method**. - -### Data Models - -Another branch of methods for data attribution are sampling-based methods, such as the Datamodels work of Ilyas et al . Given a learning algorithm $\mathcal{A}$, a fixed training dataset $S$ of $m$ data points, and a model function trained on $S$ with $\mathcal{A}$, is a function that maps an input data $z$ to $f_{\mathcal{A}}(z; S)$. This function $f$ can be complex in practice and hence, it's hard to learn a model to understand how the training examples in $S$ contributes to the prediction of a specific target point. Therefore, the authors use a linear function $g_{w}$ as a simple surrogate model to learn the contribution of each training examples to a target example. - -How do we train such a linear surrogate function? Consider a fixed training dataset $S$, a learning algorithm $\mathcal{A}$, and a target example $z$, and a distribution $D_{S}$ over subsets of $S$. Use $D_S$ to repeatedly sample a number of $S_{i}$, train $f_{\mathcal{A}}(z; S_{i})$ using $\mathcal{A}$, and evaluating on $z$ to get pairs: - -$$\{\Bigl(S_{1}, f_{\mathcal{A}}(z; S_{1})\Bigr),\cdot \cdot \cdot,\Bigl(S_{m}, f_{\mathcal{A}} (z; S_{m})\Bigr)\}$$ - -A datamodel for a target example $z$ is a parametric function $g_w$ optimized to predict $f_{\mathcal{A}}(z; S_{i})$ from training subsets $S_{i}$, where $S_{i} \sim D_{S}$. The training objective is formulated as: - -$$g_{w}: \{0, 1\}^{|S|} \mapsto \mathbb{R}, \text{ where }\; w = \underset{\beta}{argmin} \;\frac{1}{m}\sum_{i = 1}^{m}\mathcal{L}\Bigl(g_{\beta}(S_{i}),\; f_{\mathcal{A}}(z; S_{i})\Bigr) + \lambda||\beta||_{1}$$ - -> $$g_{w}(S_{i}) = $$; -> $$\mathcal{L}\bigl(g_{w}(S_{i}),\; f_{\mathcal{A}}(z; S_{i})\bigr) = \bigl(\;g_{w}(S_{i}) - f_{\mathcal{A}}(z; S_{i})\;\bigr)^2$$; -> $$f_{\mathcal{A}}(z; S_{i}):= (\text{logit for correct class}) - (\text{highest incorrect logit})$$ - -One Datamodel is specifically optimized to learn the data attribution of a fixed training dataset to a fixed but arbitrary example $z$. For a fixed sample of interest, we use $g_{w}$ to assign a learnable weight to each example in $S$. The sum of weights of all training example that's included in $S_{i}$ is trained to predict the model outputs on $z$. This is formulated as the dot product between a weight vector $w$ and an indicator vector where entry $k$ indicates the existence of the $k^{th}$ training datapoint in $S$. Therefore, for a set of target examples, we can train a datamodel for each of them and construct a collection of datamodels. - -{% include figure.html path="assets/img/2024-05-07-unraveling-the-impact-of-training-samples/2.png" class="img-fluid" %} -_Caption: Linear datamodels accurately predict true margins averaged across 100 models. -Source: Fig 5 in the paper "Datamodels: Predicting Predictions from Training Data" _ - -In their experiments using CIFAR-10, the authors reserved a specific subset of output pairs for evaluation. Here, $\alpha$ represents the subsampling fraction in relation to the training set size. For instance, in a training dataset with $|S| = 100$ data points, setting $\alpha = 0.2$ means each subset, $S_i$, comprises a fixed size of $|S_i| = 20$. They demonstrated that Datamodels effectively predict outcomes for unseen in-distribution test subsets. -In the above plots, the bottom-right panel illustrates data for three color-coded random target examples, showing a strong Spearman correlation ($r > 0.99$) between predicted and actual outputs. - -It's crucial to note that the displayed margins represent averages across 100 models trained on $S_i$. This underscores a limitation of linear datamodeling: - -**achieving stability demands training a sufficient number of models for each subset. The figures from the original paper involves averaging over 100 models. When the true model output aren't averaged across a significant number of models, it becomes apparent that the linearity is affected (see the figure below).** - -{% include figure.html path="assets/img/2024-05-07-unraveling-the-impact-of-training-samples/datamodel_our_exp.png" class="img-fluid" %} - -Despite the simplicity and accuracy of datamodels in predictions, training them for specific examples in large-scale scenarios poses challenges. Imagine training datamodels for ImageNet's set of target examples, requiring training numerous models from scratch using ImageNet's 1000-class training dataset. **Ensuring stable prediction performance requires extensive computational resources, which is prohibitively expensive for modern foundation models**. - -### TRAK - -Inspired by Datamodeling framework and motivated to circumvent its expensive training cost, in **_TRAK:Attributing Model Behavior at Scale_**, Ilyas et al. propose a new data attribution framework, _Tracing with the Randomly-Projected After Kernel_ (TRAK). - -First, in this paper the authors further denote $\tau(z, S_i)$ as a data attribution method that assigns a real-valued score to each training input in $S_i$, indicating its importance to the model output $f_{\mathcal{A}}(z;S_i)$. The key concept of TRAK is to use first order Taylor expansion to approximate the trained model $\theta^{\*}(S)$, of an algorithm for a given training dataset, and then use random projections to reduce the dimensionality of the gradient. Each time, we sample a training subset $S_i$ of size $\alpha \times \|S\|$ from $S$, and train a model $\theta^{\*}(S_i)$, and then use random projection to project the high-dimensional gradient matrix at $\theta^{\*}$ from $p$ to $k$ dimension where $k \ll p$. Ilyas et al. denote the projected gradients to be $\phi_t$ and conclude that using a training subset $S_i$, The TRAK attribution scores for an example of interest $z$ is: - -$$\tau(z, S_i) := \phi_{i}(z)^{T}(\Phi_{i}^{T}\Phi_{i})^{-1}\Phi_{i}^{T}\mathbf{Q_{i}}$$ - -> $i$: the index of a training subset; -> $\mathbf{Q}_{i}:=diag(1 - p_t^\*)$ = $diag(\{(1 + exp(y_t \cdot f(z;\theta^{\*})))^{-1}\})$ where $p_t^\*$ is the predicted correct-class probability at $\theta^{\*}$;
> $t$: the index of a training sample in $S$; -> $\mathbf{P}$: Random projection matrix that each entry is sample from a standard Gaussian distribution: $\mathbf{P}\sim \mathcal{N} (0, 1)^{p \times k}$ for $k \ll p$; - -> $\phi_{i}(z) = \mathbf{P}^T \nabla_{\theta} f(z;\theta^{\*})$ a projected gradients from model $\theta^{*}(S_i)$ for target sample $z$; -> $\Phi_{i} = [\phi_1 \cdot\cdot\cdot \phi_{m}]$ stacked projected gradients for all training data $\{z_1,...z_m\}$; - -Further, TRAK samples $N$ training subsets of fixed size factor $\alpha$, trains each of them independently, and ensembles over these $N$ models: -$$\tau_{TRAK}(z, S) := \mathfrak{S}((\frac{1}{N} \sum_{i=1}^{N} \mathbf{Q}_{i}) \cdot (\frac{1}{N} \sum_{i=1}^{N} \phi_{i}(z)^{T}(\Phi_{i}^{T}\Phi_{i})^{-1}\Phi_{i}^{T}), \hat{\lambda})$$ - -> $\mathfrak{S}(\cdot; \lambda)$ is the soft thresholding operator; -> $N$: total number of training subsets; -> $m$: total number of training samples in $S$; -> $\hat{\lambda}$ is the soft thresholding parameter, and it's selected via cross-validation - -
-Show algorithm step by step -

-Before introducing the implementation steps, Ilyas et al. first use binary logistic regression as a case study to to illustrate the benefits of computing data attribution scores in cases where a classification learning algorithm can be framed as straightforward logistic regression. We consider a training set of $n$ samples: -$$S = \{z_1,\cdot\cdot\cdot,z_n: z_t = (x_t \in \mathbb{R}^d, b_t \in \mathbb{R}, y_t \in \{-1, 1\}) \}$$ -where
-        $x_t$ is an input in $\mathbb{R}^d$;
-        $y_t$ is the binary label;
-        $b_t$ the bias term
- -Then the authors further parametrize the learning algorithm with $\theta$ as the model parameters: -$$\theta^{*}(S) := arg\; \underset{\theta}{min} \sum_{(x_t, y_t)\in S} log[1 + exp(-y_t \cdot (\theta^{T}x_t + b_t))]$$ - -Data attribution in binary logistic regression setting can be learned by using the _one-step Newton approximation_ . Ilyas et al. present it as follow: -$$\tau_{NS}(z, z_t) := \frac{x^{T}(X^{T}RX)^{-1}x_t}{1- x_{i}^{T}(X^{T}RX)^{-1}x_t \cdot p_{t}^{*}(1-p_{t}^{*})} \approx f(z;\theta^{*}(S)) - f(z;\theta^{*}(S \setminus z_t))$$ -where
-        $z$: target sample;
-        $f(z;\theta) :=\theta^{T}x+b$;
-        $z_t$: the $t^{th}$ training example, $z_t = (x_t, b_t, y_t)$;
-        $X \in \mathbb{R}^{n \times d}$ stacking all input in one matrix $X$;
-        $p_{t}^{*}:= (1 + exp(-y_t \cdot f(z_t; \theta^*)))^{-1}$
-        $p_{t}^{*}$ is the predicted correct-class probability at $\theta^{*}$;
-        $R$ is a diagonal $n \times n$ matrix with $R_{tt} = p_{t}\times (1-p_{t}^{*})$
- -Now that the Ilyas et al. have introduced this method to calcuate data attribution in the binary logistic regression setting, how can we leverage it effectively? The key insight is that, in a binary non-convex or multi-class classification setting, we can linearize the model function with its Taylor expansion centered around the final model parameters $\theta^*$. By selecting the output function as the raw logit of the classifier, this linear approximation allows us to approach the problem as a binary logistic regression, utilizing gradients as inputs, thereby leading to the development of the TRAK algorithm.
- -In this paper, the algorithm of TRAK is consist of five steps:
- -1. Linearizing the model output function via Taylor approximation, which reduces the model of interest to a linear funtion in parameter space. - Consider $f(z;\theta)$ as a non-convex function, then we can approximate it with its Taylor expansion centered around $\theta^{\*}$:
- $$\hat{f}(z;\theta):= f(z;\theta^{*}) + \nabla_{\theta} \; f(z;\theta^{*})^{T}(\theta - \theta^{*})$$ - $$\theta^{*}(S) \approx arg\; \underset{\theta}{min} \sum_{z_t \in S} log[1 + exp(-y_t \cdot ( \underbrace{\nabla_{\theta} \; f(z;\theta^{*})^{T}}_{inputs}\;\theta + b_t))]$$ - where
-         $f(z;\theta):=log(\frac{p(z;\theta)}{1 - p(z; \theta)})$
-         $b_t = f(z;\theta^{\*}) - \nabla_{\theta} \; f(z;\theta^{\*})^{T} \theta^{\*}$ - -
-2. Reducing the dimensionality of the linearized model using random projections. To preserve the model-relevent information, Ilyas et al use the Johnson-Lindenstrauss lemma . We need to compute gradient for each $z_i$ at $\theta^{*}$ and then project to $k$ dimensions -$$\phi(z) = \mathbf{P}^{T} \nabla_{\theta}f(z;\theta^{*})$$ -where
-         $\mathbf{P}\sim \mathcal{N} (0, 1)^{p \times k}$ for $k \ll p$ - -
-3. Estimating influences by adapting the one-step newton approximation.
-$$\tau(z, S) := \phi(z)^{T}(\Phi^{T}\Phi)^{-1}\Phi^{T}\mathbf{Q}$$ -where
-        $\mathbf{Q}:= diag(1 - p_{t}^*) = diag(\{(1 + exp(y_t \cdot f(z;\theta^{*})))^{-1}\})$;
-        $\mathbf{Q} \in \mathbb{R}^{n \times n}$ where each diagonal is a one minus correct-class probability term. -
-4. Ensembling over $N$ independently trained models. Each model is trained on a subset of the training set, $S_i \subset S$.
-$$\tau_{N}(z, S) := (\frac{1}{N} \sum_{i=1}^{N} \mathbf{Q}_{i}) \cdot (\frac{1}{N} \sum_{i=1}^{N} \phi_{i}(z)^{T}(\Phi_{i}^{T}\Phi_{i})^{-1}\Phi_{i}^{T})$$ -
-5. Inducing sparsity via soft-thresholding. -$$\tau_{TRAK}(z, S) := \mathfrak{S}((\frac{1}{N} \sum_{i=1}^{N} \mathbf{Q}_{i}) \cdot (\frac{1}{N} \sum_{i=1}^{N} \phi_{i}(z)^{T}(\Phi_{i}^{T}\Phi_{i})^{-1}\Phi_{i}^{T}), \hat{\lambda})$$ -where
-        $\mathfrak{S}(\cdot; \lambda)$ is the soft thresholding operator;
-        $\hat{\lambda}$ is the soft thresholding parameter, and it's selected via cross-validation -

-
-
-
- -{% include figure.html path="assets/img/2024-05-07-unraveling-the-impact-of-training-samples/trak_exp_fig.png" class="img-fluid" %} -_Caption: We trained 90 RestNet9 models independently on 90 randomly selected subsets of size factor 0.5 from $S$. Then we used TRAK to calculate influence score for the test dataset of CIFAR-10. These are two random samples that show the efficacy of TRAK. For the training images that have high TRAK scores, they are of the same category. While those of low TRAK scores are of different categories of the target image._ -
- -{% include figure.html path="assets/img/2024-05-07-unraveling-the-impact-of-training-samples/trak_scatter_plot.png" class="img-fluid" %} - -Ilyas et al. conducted a study utilizing TRAK to attribute various classifiers on datasets such as CIFAR-2, CIFAR-10, QNLI, and ImageNet. Their findings demonstrated that TRAK achieves superior accuracy while utilizing significantly fewer models. - -In replicating the experiments detailed in Ilyas et al. , we encountered a notable drawback in the TRAK algorithm, we found that the TRAK algorithm is **memory-expensive**. It requires recording numerous model gradients for each test sample across models trained on different subsets, which is intractable for Modern Foundation Models. Furthermore, our investigation unveiled a **limited linear correlation** between TRAK scores and true model margins. This observation suggests that the predicted margins derived from TRAK do not serve as robust estimates of the model output and its ability of predicting model outputs is not on par with Datamodels. - -While TRAK offers an interpretable and computationally efficient way to analyze training data impact, its limitations cannot be overlooked. Further research is needed to propose better data attribution methods. - - - -## How do we use it? - -### Learning Algorithm Comparison - -Data attribution methods estimate the importance of each training sample with respect to the model's output. An natural idea comes up: can we leverage the data attribution methods to understand the learning algorithms' difference based on how they weight the training data? - -The paper **_ModelDiff: A Framework for Comparing Learning Algorithms_** develops this idea: use data attribution method to figure out the "feature selection" difference of two learning algorithms. Specifically, the authors use data attribution methods to quantify the impact of each training sample to each test sample. - -Therefore, we could get the importance matrix $\Theta^{\|train \| \times \|test\|}$ -for each learning algorithm applied on a specific task. We apply matrix projection and PCA techniques on the importance matrix $\Theta$ to explore the distinguishing difference between how two algorithms use training samples. The detailed pipeline of comparing learning algorithm is depicted in the following figure. - -{% include figure.html path="assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_1.png" class="img-fluid" %} -_Source: Figure 2 in the paper "MODELDIFF: A Framework for Comparing Learning Algorithms" _ -
-In the figure above, the authors PCA on the residual importance matrix (after projection, we remove the common importance allocation). The training samples corresponding to the TOP-K principal components (these principal component directions explain a significant amount of variance in one importance matrix but not the other) reflect the distinguishing subpopulations that one learning algorithm prefers, but another learning algorithm pays little attention to. - -By visually checking these distinguishing subpolutations, we could speculate **the semantic feature selection difference of two algorithms** and then confirm it by applying the semantic feature transformations on test data and checking the model output difference. - -{% include figure.html path="assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_2.png" class="img-fluid" %} -_Source: Figure 3 in the paper "MODELDIFF: A Framework for Comparing Learning Algorithms"_ -
-For example, in the figure above, they compared two models trained on LIVING17 dataset. The only difference between these two models is whether they are trained with or without standard data augmentations. By exploring the training sample importance matrix using the method mentioned above, they speculated that the model trained with data augmentation prefers using "web" to predict the class "spider" and using "yellow polka dots" to predict the class "salamander". Therefore, they added "web" or "yellow polka dots" texture to test samples and found out that only the prediction of the model with data augmentation changes a lot. This experiment verified the previous work that the data augmentation will enhance the texture bias. - -The ModelDiff shows that the data attribution methods can be key tools for understanding model behaviors and distinguishing the subtle differences of algorithms. - -### Data Leakage Detection - -Except for comparing learning algorithms, we can also leverage the importance score to find training samples which are most relevant to the model prediction. By empirically observing the training samples with different importance magnitude, Harshay et al. find that the training samples with large importance magnitude consistently look similar to the test sample which also follows the intuition: _training samples most similar to the test sample are most relevant to the prediction_ (see the first line of the figure). - -{% include figure.html path="assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_3.png" class="img-fluid" %} -\_Source: Figure 3 in the paper "MODELDIFF: A Framework for Comparing Learning Algorithms" -
- -{% include figure.html path="assets/img/2024-05-07-unraveling-the-impact-of-training-samples/cat_data_leakage.png" class="img-fluid" %} -_Source: From the randomly selected validation points provided by Ilyas et al. , we found this data leakage example_ -
- -We can leverage such phenomenon to **identify train-test leakage in different benchmark datasets**. For example, in the second line of the figure, Harshay et al. identified significant data leakage on CIFAR10 dataset. Extending this data leakage detection technique to different datasets holds the potential to assist the ML community in curating datasets, thereby enhancing overall data quality. - -### Prediction Brittleness Examination - -We can also use the data attribution methods to identify brittle predictions (i.e. the model outputs which are brittle to a few training samples removal) and estimate data counterfactual (i.e. the casual effect of removing a set of training samples on model outputs). - -Specifically, we could leverage the sample importance scores to find the smallest training subset (defined as support set) such that removing them could flip the model prediction. By calculating the support set size for each test sample, we could know the brittleness of the model output with respect to the input. - -{% include figure.html path="assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_4.png" class="img-fluid" %} -Source: Fig 8 in the paper "Datamodels: Predicting Predictions from Training Data" \* - -Another application involves data counterfactual estimation. As illustrated in the figure above, after the training subset removal, the observed changes in actual model logits closely align with the predicted model logits changes estimated through data attribution methods. - -These experiments demonstrate that the data attribution methods could serve as efficient and convincing tools to **investigate the sensitivity and robustness of the learning algorithms**. - -## Conclusion - -The data attribution methods give us an interesting answer to a natural question arising from the deep learning field: how does each training sample help with the model's prediction? These methods can quantitatively measure the importance of each training sample with respect to the model's output. The versatility of these methods extends across diverse applications, such as understanding learning algorithm behaviors, checking the data quality and analyzing the robustness of models. - -Future works can focus on leveraging the data attribution methods to do dataset curation and model refinement. Also, investigating the scalability of the data attribution methods to larger datasets and different tasks remains a promising direction for enhancing their practical utility. - - diff --git a/_posts/2024-05-07-update-frequency-in-mbrl.md b/_posts/2024-05-07-update-frequency-in-mbrl.md deleted file mode 100644 index 613ce4bb..00000000 --- a/_posts/2024-05-07-update-frequency-in-mbrl.md +++ /dev/null @@ -1,409 +0,0 @@ ---- -layout: distill -title: Fair Model-Based Reinforcement Learning Comparisons with Explicit and Consistent Update Frequency -# description: Model-based reinforcement learning has emerged as a promising approach to achieve both state-of-the-art performance and sample-efficiency.However, ensuring fair benchmark comparisons can be challenging due to the implicit design choices made by the different algorithms. This article focuses on one such choice, the update frequency of the model and the agent. While the update frequency can sometimes be optimized to improve performance, real-world applications often impose constraints, allowing updates only between deployments on the actual system. We emphasize the need for more evaluations using consistent update frequencies across different algorithms. This will provide researchers and practitioners with clearer comparisons under realistic constraints. -description: Implicit update frequencies can introduce ambiguity in the interpretation of model-based reinforcement learning benchmarks, obscuring the real objective of the evaluation. While the update frequency can sometimes be optimized to improve performance, real-world applications often impose constraints, allowing updates only between deployments on the actual system. This blog post emphasizes the need for evaluations using consistent update frequencies across different algorithms to provide researchers and practitioners with clearer comparisons under realistic constraints. -date: 2024-05-07 -future: true -htmlwidgets: true - -authors: - - name: Albert Thomas - url: https://albertcthomas.github.io/ - affiliations: - name: Huawei Noah's Ark Lab - - name: Abdelhakim Benechehab - url: https://scholar.google.com/citations?user=JxgqOKwAAAAJ - affiliations: - name: Huawei Noah's Ark Lab - Department of Data Science, EURECOM, France - - name: Giuseppe Paolo - url: https://www.giupaolo.com - affiliations: - name: Huawei Noah's Ark Lab - - name: Balázs Kégl - url: https://twitter.com/balazskegl - affiliations: - name: Huawei Noah's Ark Lab - -# must be the exact same name as your blogpost -bibliography: 2024-05-07-update-frequency-in-mbrl.bib - -# Add a table of contents to your post. -# - make sure that TOC names match the actual section names -# for hyperlinks within the post to work correctly. -# - please use this format rather than manually creating a markdown table of contents. -toc: - - name: Introduction - - name: Three popular model-based reinforcement learning algorithms - subsections: - - name: MBPO - - name: PETS - - name: BREMEN - - name: Making the update frequency more accessible - - name: Comparisons with fixed update frequency - - name: Ablation studies - subsections: - - name: Varying the update frequency in MBPO - - name: Conclusion - - name: Appendix - -# Below is an example of injecting additional post-specific styles. -# This is used in the 'Layouts' section of this post. -# If you use this post as a template, delete this _styles block. -_styles: > - .fake-img { - background: #bbb; - border: 1px solid rgba(0, 0, 0, 0.1); - box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1); - margin-bottom: 12px; - } - .fake-img p { - font-family: monospace; - color: white; - text-align: left; - margin: 12px 0; - text-align: center; - font-size: 16px; - } ---- - -## Introduction - -In reinforcement learning , an agent learns to make decisions by interacting with an environment, receiving a feedback, or reward, following each action it takes to move from a state of the environment to another. The objective is to learn a policy, a mapping from states to action, that maximizes the expected cumulative reward over successive interactions. - -There are two main approaches when designing a reinforcement learning algorithm: model-based or model-free. Model-based reinforcement learning (MBRL) algorithms first learn a model of the environment dynamics which, given a state of the environment and an action, predicts the next state of the environment. This model can then be used in place of the real environment to learn or decide how to act. Model-free algorithms avoid this step and directly try to learn a policy. As MBRL algorithms can rely on the learned dynamics model instead of the real environment, they are known to be more sample efficient than model-free algorithms (see for instance or ). MBRL is thus a good choice when interactions with the environment are limited, which is often the case for real applications such as controlling engineering systems. - -We discuss here about one of the design choices of MBRL algorithms: the *update frequency* of the agent. As shown in the figure below This figure is inspired by Figure 1 in ., the frequency at which algorithms update their agent varies widely: some algorithms update their agent after each step on the real system while others update after thousands of steps . At the end of the spectrum, the pure offline setting considers only a single training of the agent from an initial dataset We observe that similar differences in update frequency exist in the model-free literature but we decide to focus only on model-based algorithms.. - -{% include figure.html path="assets/img/2024-05-07-update-frequency-in-mbrl/bremen.png" class="img-fluid" %} - -The update frequency is often viewed as yet another hyperparameter of the complex MBRL pipeline. However, in practice the update frequency may be imposed by real-life deployment constraints, motivating the discussions of this blog post. It is often the case that for safety reasons, system engineers agree to run a new agent on their system for a given period of time but prefer the agent to be fixed during this deployment, as studies. System engineers are then able to investigate the fixed solution before deciding to deploy it, knowing that it will not change during the deployment. It also happens that the system on which the agent is deployed does not have the required computational resources to support agent updates. Such real-life constraints could thus discard state-of-the-art MBRL algorithms that require updating their agent too frequently to perform well. - -Given the importance of the update frequency in real-life applications, this blog post advocates for: -- explicitly specifying the update frequency employed by each algorithm in a benchmark, as this remains implicit and hard to find in many existing benchmarks, -- conducting additional experiments that compare algorithms under a given update frequency, mirroring the constraints often encountered in real-life applications, and -- performing more ablation studies on update frequency, evaluating its impact on algorithm performance. - -For the rest of this blog post, we define a *deployment* as a data collection campaign realized with a fixed agent. The agents are thus updated between two consecutive deployments but not within one deployment. The *update frequency* is the number of steps realized at each deployment (that we assume fixed for all deployments). We use the term *agent* to refer to all the components of the model-based algorithm that are used to act on the system. For instance, in a Dyna-style algorithm , where a model-free algorithm is applied on the model instead of the real system, *agent* would thus refer to both the dynamics model and the policy learned with a model-free algorithm. - -We begin by introducing three popular MBRL algorithms (MBPO, PETS and BREMEN) as we will often refer to them to illustrate our arguments. - -## Three popular MBRL algorithms - -The following table gives an overview of the update frequency of the three algorithms we discussed below and few others. This table is not meant to provide an exhaustive list of all the MBRL algorithms but rather to give an idea of the different training schedules that are used in the literature. - -| Algorithm | Agent update frequency | Policy update frequency | Model update frequency | -|-----------|----------------------|---------------------------|------------------------| -| MBPO | 1 step | 1 step | 250 steps | -| PETS | Task Horizon | No policy | Task Horizon | -| PILCO | Task Horizon | Task Horizon | Task Horizon | -| BREMEN | 100k or 200k steps | 100k or 200k steps | 100k or 200k steps | -| ME-TRPO | 3k or 6k steps | 3k or 6k steps | 3k or 6k steps | - - -### MBPO -Model-based Policy Optimization (MBPO) Original code available at https://github.com/jannerm/mbpo is one of the most well-known model-based algorithms. The algorithm trains an ensemble of probabilistic neural networks for the dynamics model and trains a model-free agent, Soft Actor Critic (SAC) , using short rollouts on the model to avoid error accumulation. The agent is updated at each step: the model is updated each 250 steps but the SAC policy is updated at each step. This highly frequent update schedule discards MBPO even for small deployments on real systems. - -### PETS -Probabilistic Ensemble and Trajectory Sampling (PETS) Original code available at https://github.com/kchua/handful-of-trials is another popular model-based algorithm known for its use of an ensemble of probabilistic neural networks for the dynamics model (MBPO uses the dynamics model introduced by PETS). PETS relies on the learned model and the Cross-Entropy Method to search for the best action sequence at decision time. Therefore, it does not have to learn (nor update) a policy, as MBPO does with SAC. The only component that needs learning is the dynamics model. Compared to MBPO, the dynamics model is updated at the end of each episode (usually 1000 steps). - - -### BREMEN -Behavior-Regularized Model-ENsemble (BREMEN) Original code available at https://github.com/matsuolab/BREMEN considers the setting where only a few deployments (between 5 to 10) are possible on the real system. However large datasets can be collected at each deployment (they assume 100 000 or 200 000 transitions for each deployment, far more than just one episode which is usually of the order of 1000 transitions). The algorithm relies on an ensemble of deterministic dynamics models and a policy learned on the model, à la Dyna-Style. It only updates the policy and the model between two consecutive deployments. The update frequency is here very clear as it is motivated by real-life applications where deployments are limited. Therefore in this algorithm this is not an hyperparameter that can be tuned for better performance but rather a parameter imposed by the application. One of the goals of the blog post is to emphasize and to develop the idea of a constrained update frequency. - -We now detail the main arguments of our blog post: making the update frequency more accessible, designing benchmarks with fixed update frequencies and running ablation studies on the update frequency. - -## Making the update frequency more accessible - -Experiments done in popular papers do not always explicit the update frequencies they use for each of the algorithms they run. When nothing is said, it is very likely that most of the times the benchmarks are using the original implementation of the algorithms, shared by the authors of the algorithms in the best case. For instance the MBPO paper does not mention the update frequencies that the authors used in their experiments. The update frequency of MBPO can be found in the code shared by the authors. However it is harder to find the update frequency that the authors used for PETS. We thus assume that they use the original PETS update frequency, which updates the agent at the end of each episode. We also looked at one of the most exhaustive benchmark of MBRL algorithms . Nothing is said in the paper about the update frequency and a careful investigation of the code provided by the authors is required (more on this later). - -The difficulty in knowing the update frequencies used in benchmarks makes it harder for the researchers and practitioners to take this parameter into account to assess the performance of the algorithms and whether they would be good candidates for their real-life applications. It also demands much more investigation from the reader to know what the authors used. - -MBRL algorithms have an order of magnitude more meaningful hyperparameters than supervised models, and managing and reporting on them usually falls out of the scope of research papers. The practice of sharing the code alleviates this issue somewhat, and should be saluted, since we can always dig up in the code what the parameters were. However, ideally, choices that drastically change the performance of the algorithms, should be made explicit as much as possible in the research papers and the ablation studies. - -## Comparisons with fixed update frequency - -We want to make the community aware of the importance of the update frequency when comparing algorithms and when designing benchmarks. Running benchmarks without any constraints allows using different update frequencies for each algorithm. We believe that such benchmarks are valuable for the community. However it would also be very informative for the community to have benchmarks with comparable update frequencies between the algorithms. This would for instance help to find the potentially best algorithms for real applications with constraints on the update frequency. - -Coming back to the experiments run in MBPO's paper, as the default MBPO implementation updates the model each 250 steps, it might also make sense to allow PETS to be updated each 250 steps as well to have comparable results. We also note that the MBRL-Lib paper compares the MBRL-Lib implementations of PETS and MBPO with their respective original update frequency. We do not think that this would have a big impact for these two algorithms but it would be fairer to use the same update frequency. Finally, looking at the code of the MBRL benchmark done by , it is not clear whether the same update frequency is used for all the algorithms of the benchmark For instance it seems the update frequency on Acrobot is 3000 for RS (time_step_per_batch in https://github.com/WilsonWangTHU/mbbl/blob/master/scripts/exp_1_performance_curve/rs.sh) but 5000 for ME-TRPO (num_path_onpol $\times$ env_horizon in https://github.com/WilsonWangTHU/mbbl-metrpo/blob/master/configs/params_acrobot.json).. - -The BREMEN paper has a benchmark comparing different algorithms under fixed update frequencies. This gives valuable insights on the performance of the existing algorithms under these deployment constraints. The next step would be to evaluate the performance with a different number of deployments and a different number of steps per deployment, which we now argue for in the next section. - -## Ablation studies - -Comparisons of different update frequencies are very rare in existing benchmarks and existing papers. Even without real-life constraints it would be valuable to know how sensitive the performance of a given algorithm is with respect to the update frequency. The issue for the authors is that this could be asked for many other hyperparameters and represent additional computational budget and time. However we often find ablations on the number of models (if the model is an ensemble), the rollout length, the number of gradient updates for the model-free policy, but very rarely on the update frequency. It is very likely that the agents that are good for small deployments would be bad for large deployments, a setting that would tend to be closer to the pure offline setting (for the same total budget of real system interactions). We perform such an ablation study using MBPO in the next section, showing that MBPO's performance is degrading with larger update frequencies. - - -### Varying the update frequency in MBPO - -Using the MBPO implementation and the examples provided by MBRL-Lib we ran MBPO on Gym-Halfcheetah-v4, Gym-Hopper-v4 and Gym-Walker2d-v4 with different update frequencies: updating the agent at each step (default implementation described above), each 1000 steps, each 5000 steps and each 10 000 steps. Each curve shows the mean episode return obtained with at least 10 seeds. We did not run Hopper and Walker with an update frequency of 10 000 steps as the performance obtained with 5000 was already poor. The lightly shaded areas indicate the 95% bootstrap confidence interval. - -{% include figure.html path="assets/img/2024-05-07-update-frequency-in-mbrl/update_frequency_cheetah.png" class="img-fluid" %} - -{% include figure.html path="assets/img/2024-05-07-update-frequency-in-mbrl/update_frequency_hopper.png" class="img-fluid" %} - -{% include figure.html path="assets/img/2024-05-07-update-frequency-in-mbrl/update_frequency_walker.png" class="img-fluid" %} - -Except for the update frequency of 1000 steps on Halfcheetah and Walker which achieves similar performance than the default configuration updating the agent at each step, the results indicate a decline in asymptotic performance with larger update frequencies. Although MBPO exhibits good performance over different environments for the default update frequency, this is not the case for the other update frequencies that we consider here. We note here that 1000 steps is the usual maximum episode length and therefore a reasonable value to try for the update frequency. One insight from this experiment is that even though MBPO is one of the state-of-the-art MBRL algorithms, practical constraints like the update frequency can potentially alleviate its performance in real-world applications. - -When trying these values of updates frequencies we adjusted the number of gradient steps to maintain a constant ratio of gradient steps per step on the real system. For the maximum buffer size of SAC we used the rule provided in MBPO's code. The table below shows the values obtained for the maximum buffer size. As shown in the figure below, using a smaller buffer size negatively impacts the performance for the update frequency of 1000 steps and 10 000 steps. While there is a possibility that better values for the hyperparameters (other than the update frequency) could be found, we did what appeared to be the natural way to adapt the other hyperparameters when increasing the update frequency. See the Appendix for the complete description of the hyperparameters used in these experiments. - -| Agent update frequency | Model update frequency | Policy update frequency | Max SAC buffer size | -|------------------|--------------------------|-----------------------------------|-------------| -|default (1 step) | 250 | 1 | 400 000 | -| 1 000 steps | 1000 | 1000 | 400 000 | -| 5 000 steps | 5000 | 5000 | 2 million | -|10 000 steps | 10 000 | 10 000 | 4 million | - -{% include figure.html path="assets/img/2024-05-07-update-frequency-in-mbrl/buffer_size.png" class="img-fluid" %} - - -## Conclusion - -The goal of this blog post is to shed light on a frequently overlooked hyperparameter in MBRL: the update frequency. Despite its importance for real-life applications, this parameter is rarely discussed or analyzed. We emphasize the importance of running more evaluations using consistent update frequencies across different algorithms and more ablation studies. We for instance show how the update frequency impacts the performance of MBPO. Similar to the update frequency, we can identify several other hyperparameters that deserve more attention when benchmarking different MBRL algorithms. A typical example is the continual training (of the model and/or policy) versus retraining from scratch (referred to as the primacy bias in some previous work ). We believe this blog post offers valuable insights to researchers, providing directions that would be worth investigating to explain the differences between MBRL algorithms and whether these differences really impact the existing comparisons. - - -## Appendix - -We provide here the configuration files we used to run the different experiments. -#### Halfcheetah -* Update frequency of 1000 steps - -```yaml -# @package _group_ -env: "gym___HalfCheetah-v4" -term_fn: "no_termination" - -num_steps: 400000 -epoch_length: 1000 -num_elites: 5 -patience: 5 -model_lr: 0.001 -model_wd: 0.00001 -model_batch_size: 256 -validation_ratio: 0.2 -freq_train_model: 1000 -effective_model_rollouts_per_step: 400 -rollout_schedule: [20, 150, 1, 1] -num_sac_updates_per_step: 10000 -sac_updates_every_steps: 1000 -num_epochs_to_retain_sac_buffer: 1 - -sac_gamma: 0.99 -sac_tau: 0.005 -sac_alpha: 0.2 -sac_policy: "Gaussian" -sac_target_update_interval: 1 -sac_automatic_entropy_tuning: true -sac_target_entropy: -1 -sac_hidden_size: 512 -sac_lr: 0.0003 -sac_batch_size: 256 -``` - -* Update frequency of 5000 steps - -```yaml -# @package _group_ -env: "gym___HalfCheetah-v4" -term_fn: "no_termination" - -num_steps: 400000 -epoch_length: 5000 -num_elites: 5 -patience: 5 -model_lr: 0.001 -model_wd: 0.00001 -model_batch_size: 256 -validation_ratio: 0.2 -freq_train_model: 5000 -effective_model_rollouts_per_step: 400 -rollout_schedule: [20, 150, 1, 1] -num_sac_updates_per_step: 50000 -sac_updates_every_steps: 5000 -num_epochs_to_retain_sac_buffer: 1 - -sac_gamma: 0.99 -sac_tau: 0.005 -sac_alpha: 0.2 -sac_policy: "Gaussian" -sac_target_update_interval: 1 -sac_automatic_entropy_tuning: true -sac_target_entropy: -1 -sac_hidden_size: 512 -sac_lr: 0.0003 -sac_batch_size: 256 -``` - -* Update frequency of 10000 steps - -```yaml -# @package _group_ -env: "gym___HalfCheetah-v4" -term_fn: "no_termination" - -num_steps: 400000 -epoch_length: 10000 -num_elites: 5 -patience: 5 -model_lr: 0.001 -model_wd: 0.00001 -model_batch_size: 256 -validation_ratio: 0.2 -freq_train_model: 10000 -effective_model_rollouts_per_step: 400 -rollout_schedule: [20, 150, 1, 1] -num_sac_updates_per_step: 100000 -sac_updates_every_steps: 10000 -num_epochs_to_retain_sac_buffer: 1 - -sac_gamma: 0.99 -sac_tau: 0.005 -sac_alpha: 0.2 -sac_policy: "Gaussian" -sac_target_update_interval: 1 -sac_automatic_entropy_tuning: true -sac_target_entropy: -1 -sac_hidden_size: 512 -sac_lr: 0.0003 -sac_batch_size: 256 -``` - -#### Hopper -* Update frequency of 1000 steps - -```yaml -# @package _group_ -env: "gym___Hopper-v4" -term_fn: "hopper" - -num_steps: 125000 -epoch_length: 1000 -num_elites: 5 -patience: 5 -model_lr: 0.001 -model_wd: 0.00001 -model_batch_size: 256 -validation_ratio: 0.2 -freq_train_model: 1000 -effective_model_rollouts_per_step: 400 -rollout_schedule: [20, 150, 1, 15] -num_sac_updates_per_step: 40_000 -sac_updates_every_steps: 1000 -num_epochs_to_retain_sac_buffer: 1 - -sac_gamma: 0.99 -sac_tau: 0.005 -sac_alpha: 0.2 -sac_policy: "Gaussian" -sac_target_update_interval: 4 -sac_automatic_entropy_tuning: false -sac_target_entropy: 1 # ignored, since entropy tuning is false -sac_hidden_size: 512 -sac_lr: 0.0003 -sac_batch_size: 256 -``` - -* Update frequency of 5000 steps - -```yaml -# @package _group_ -env: "gym___Hopper-v4" -term_fn: "hopper" - -num_steps: 125000 -epoch_length: 1000 -num_elites: 5 -patience: 5 -model_lr: 0.001 -model_wd: 0.00001 -model_batch_size: 256 -validation_ratio: 0.2 -freq_train_model: 5000 -effective_model_rollouts_per_step: 400 -rollout_schedule: [20, 150, 1, 15] -num_sac_updates_per_step: 200000 -sac_updates_every_steps: 5000 -num_epochs_to_retain_sac_buffer: 1 - -sac_gamma: 0.99 -sac_tau: 0.005 -sac_alpha: 0.2 -sac_policy: "Gaussian" -sac_target_update_interval: 4 -sac_automatic_entropy_tuning: false -sac_target_entropy: 1 # ignored, since entropy tuning is false -sac_hidden_size: 512 -sac_lr: 0.0003 -sac_batch_size: 256 -``` - -#### Walker -* Update frequency of 1000 steps - -```yaml -# @package _group_ -env: "gym___Walker2d-v4" -term_fn: "walker2d" - -num_steps: 300000 -epoch_length: 1000 -num_elites: 5 -patience: 10 -model_lr: 0.001 -model_wd: 0.00001 -model_batch_size: 256 -validation_ratio: 0.2 -freq_train_model: 1000 -effective_model_rollouts_per_step: 400 -rollout_schedule: [20, 150, 1, 1] -num_sac_updates_per_step: 20000 -sac_updates_every_steps: 1000 -num_epochs_to_retain_sac_buffer: 1 - -sac_gamma: 0.99 -sac_tau: 0.005 -sac_alpha: 0.2 -sac_policy: "Gaussian" -sac_target_update_interval: 4 -sac_automatic_entropy_tuning: false -sac_target_entropy: -1 # ignored, since entropy tuning is false -sac_hidden_size: 1024 -sac_lr: 0.0001 -sac_batch_size: 256 -``` - -* Update frequency of 5000 steps -We only used a maximum buffer size of 1 million to limit the memory usage of this experiment. - -```yaml -# @package _group_ -env: "gym___Walker2d-v4" -term_fn: "walker2d" - -num_steps: 300000 -epoch_length: 1000 -num_elites: 5 -patience: 10 -model_lr: 0.001 -model_wd: 0.00001 -model_batch_size: 256 -validation_ratio: 0.2 -freq_train_model: 5000 -effective_model_rollouts_per_step: 200 -rollout_schedule: [20, 150, 1, 1] -num_sac_updates_per_step: 100000 -sac_updates_every_steps: 5000 -num_epochs_to_retain_sac_buffer: 1 - -sac_gamma: 0.99 -sac_tau: 0.005 -sac_alpha: 0.2 -sac_policy: "Gaussian" -sac_target_update_interval: 4 -sac_automatic_entropy_tuning: false -sac_target_entropy: -1 # ignored, since entropy tuning is false -sac_hidden_size: 1024 -sac_lr: 0.0001 -sac_batch_size: 256 -``` diff --git a/_posts/2024-05-07-what-exactly-has-tabpfn-learned-to-do.md b/_posts/2024-05-07-what-exactly-has-tabpfn-learned-to-do.md deleted file mode 100644 index 06ece9e7..00000000 --- a/_posts/2024-05-07-what-exactly-has-tabpfn-learned-to-do.md +++ /dev/null @@ -1,218 +0,0 @@ ---- -layout: distill -title: What exactly has TabPFN learned to do? -description: TabPFN [Hollmann et al., 2023], a Transformer model pretrained to perform in-context learning on fresh tabular classification problems, was presented at the last ICLR conference. To better understand its behavior, we treat it as a black-box function approximator generator and observe its generated function approximations on a varied selection of training datasets. Exploring its learned inductive biases in this manner, we observe behavior that is at turns either brilliant or baffling. We conclude this post with thoughts on how these results might inform the development, evaluation, and application of prior-data fitted networks (PFNs) in the future. -date: 2024-05-07 -future: true -htmlwidgets: true - -authors: - - name: Calvin McCarter - url: "https://calvinmccarter.com/" - affiliations: - name: BigHat Biosciences - -# must be the exact same name as your blogpost -bibliography: 2024-05-07-what-exactly-has-tabpfn-learned-to-do.bib - -# Add a table of contents to your post. -# - make sure that TOC names match the actual section names -# for hyperlinks within the post to work correctly. -# - please use this format rather than manually creating a markdown table of contents. -toc: - - name: Introduction - - name: 1d binary classification - - name: 2d multiclass classification - - name: Cancer status classification from high-dimensional gene expressions - - name: Computer vision as a tabular classification problem - - name: Closing thoughts - -# Below is an example of injecting additional post-specific styles. -# This is used in the 'Layouts' section of this post. -# If you use this post as a template, delete this _styles block. -_styles: > - .fake-img { - background: #bbb; - border: 1px solid rgba(0, 0, 0, 0.1); - box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1); - margin-bottom: 12px; - } - .fake-img p { - font-family: monospace; - color: white; - text-align: left; - margin: 12px 0; - text-align: center; - font-size: 16px; - } ---- - - -## Introduction - -TabPFN is a deep learning model pretrained to perform in-context learning for tabular classification. -Since then, it has attracted attention both for its high predictive performance on small dataset benchmarks and for its unique meta-learning approach. -This meta-learning approach, which builds upon earlier work on prior-data fitted networks (PFN) , requires only synthetically-generating data: structural causal models (SCMs) are randomly generated, then training datasets are sampled from each SCM. -On fresh classification tasks, no training (i.e. weight updating) is needed; instead, training data is given as context to TabPFN, a Transformer model with self-attention among training samples and cross-attention from test samples to training samples. -TabPFN can be optionally used with ensembling, wherein the forward pass is repeated with random permutations of features and class labels, and with power transformation applied to random subsets of features. -Subsequent works have reproduced its classification performance on other tabular benchmarks , and analyzed its theoretical foundations . - -At the same time, TabPFN has received criticism from within the applied ML community, around concerns that its "one large neural network is all you need" approach is fundamentally flawed and that its performance on public benchmarks may be due to overfitting. - -{% twitter https://twitter.com/tunguz/status/1583417038965334017 %} - -{% twitter https://twitter.com/predict_addict/status/1726286748173385732 %} - - -In this article, we will attempt to demystify TabPFN's behavior in order to move towards a resolution to these questions. -With this goal, we will take a different tack to analyzing TabPFN than previous works: -we will neither theoretically analyze its meta-learning pre-training approach, nor run it on yet another dataset-of-datasets, nor even mechanistically interpret the meaning of specific model weights or subnetworks. - -Instead, we will first explore its holistic behavior on two simple settings, in order to develop an intuition about TabPFN as a function approximation generator. -This is motivated by the observation that TabPFN once fitted on fresh training data (even though "fitting" is merely storing the training data), is not mathematically different from any other fitted model: it is simply a function $$f_{\mathcal{D}, \theta}: x \rightarrow y$$ from test input $$x$$ to prediction $$y$$, -where $$\mathcal{D} = (X_{\textrm{train}}, y_{\textrm{train}})$$ is the training data and $$\theta$$ are the TabPFN model weights. -By plotting $$f$$ for various case studies of $$(X_{\textrm{train}}, y_{\textrm{train}})$$, we aim to better understand what statistical knowledge has been represented in model parameters $$\theta$$. - -Next, we will evaluate TabPFN on two non-standard tabular ML classification tasks, comparing its performance with other methods. -These atypical tasks can be thought of as out-of-distribution relative to the synthetic pretraining datasets upon which TabPFN was pretrained. -This analysis will help indicate whether TabPFN was overfit to the statistical peculiarities of publicly-available small tabular datasets, or whether it has learned generalizable principles that lead to sensible behavior even in out-of-domain settings. - -## 1d binary classification - -We begin by examining the case of binary classification with 1d inputs. To better illustrate the inductive biases of the base TabPFN model, we do not use ensembling in this section unless otherwise indicated. - -Below, we show the predictions for two training samples located at +1 and -1, labeled green and red, respectively. We see that the probabilities are non-monotonic, as one would obtain from a sigmoid function; not only do we see that the model has higher uncertainty on the far sides of the training points, we see that between them there is a small wiggle. We also see that the decision boundary biased below 0.5; likely this is because TabPFN has learned that features are have right-skewed distributions. - -{% include figure.html path="assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-nonmonotone.png" class="img-fluid" %} - -These wiggles and asymmetry more-or-less disappear once we incorporate ensembling, shown below. -However, the general shape of the predicted probability function is similar regardless of the number of ensembles. - -
- {% include figure.html path="assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-ensembles2.png" class="img-fluid" %} -
-
- TabPFN predicted probabilities for test data, in red and green, for varying number of ensembles. Also shown are the predicted probabilities from using inverse-square-root of Euclidean distance within softmax, in orange and lime-green. -
- -The above results raise the question of what parametric attention function might have been learned by TabPFN. -No simple dot-product-based or Euclidean distance-based function (used within the softmax operation) exactly recapitulated the observed predicted probabilities. -However, the general shape of inverse-square-root of Euclidean distance matched reasonably well, particularly between the two training points. -Still, it appears that TabPFN has meta-learned an attention function that outperforms previously-known attention functions on small datasets. - -Next, we look at the effect of duplicating features. We tried repeating the +1 and -1 inputs for a total of 1, 4, 16, and 64 copies, as shown below. The effect is to push the predicted probabilities away from 0.5, although we observe diminishing marginal effects as the number of repeats increases. - -{% include figure.html path="assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-repeats.png" class="img-fluid" %} - -Meanwhile, there is no discernible effect from replicating samples, when both red and green samples are replicated. Below we show the predicted probabilities, when both red and green samples are each copied for a total of 1, 4, 16, and 64 times. - -{% include figure.html path="assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-repeatboth.png" class="img-fluid" %} - -In contrast, there is an impact to repeating only the red sample. -Below is shown the effect of repeating only the red sample. -While this unsurprisingly increases the probability of red for $$X < 0$$, it bizarrely increases the probability of green for $$X > 0$$. -This is especially strange because repeating green samples in the previous setting did not have the same effect. -This behavior of TabPFN seems suboptimal; it remains to be seen whether this behavior was optimal for its pretraining data, or whether this is some kind of artifact of TabPFN's architecture or training. - -{% include figure.html path="assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-repeatred.png" class="img-fluid" %} - -Finally, we were unable to find evidence that TabPFN is able to detect periodic patterns in the training data, as exemplified for three different training patterns shown below. -This behavior of TabPFN suggests that it does not support either periodic interpolation or extrapolation. -Furthermore, we observe that as the number of observed cycles in the data increases, the predicted probabilities trend toward 0.5, which also seems suboptimal. -We also notice that there is marked left-right asymmetry in these settings. - -{% include figure.html path="assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-periodic-2pair.png" class="img-fluid" %} -{% include figure.html path="assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-periodic-3pair.png" class="img-fluid" %} -{% include figure.html path="assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-periodic-2skippair.png" class="img-fluid" %} - - -## 2d multiclass classification - -Here, we examine the behavior of TabPFN on 2d input data, on problems with as many samples as classes. -Below we show results for both randomly-spaced and grid-spaced inputs, and for both ensembling and no-ensembling settings of TabPFN. -In each plot, we show the training data, their corresponding Voronoi diagrams, and finally the model predictions for the test inputs. -We see that, without ensembling, TabPFN performs quite poorly, partitioning the input space in a non-sensical manner. -The results markedly improve when we use 32 ensembles. -Particularly for the randomly-spaced training points, the model predictions clearly resemble the Voronoi diagram, suggesting that (ensembled) TabPFN has meta-learned to perform 1-nearest-neighbor classification in the setting where each class has a single training sample. - -On the other hand, that this behavior relies upon ensembling suggests that the base TabPFN model could be further improved. -In the original paper, Hollmann et al. express the hope that a future better version of TabPFN would not need to rely upon ensembling for permutation invariance, by having internalized that behavior through better architecture and training. -The aforementioned observed behavior suggests that ensembling improves performance not only by (approximately) enforcing permutation invariance, but also by producing lower variance estimators; if so, the base model could also be trained to do the latter directly. - -
-{% include figure.html path="assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/voronois.png" class="img-fluid" %} -
-
- TabPFN predictions on randomly-spaced points (left) and grid-spaced points (right). The training points are depicted as $\times$s. The yellow lines depict the Voronoi diagram of the training points. The test points are colored by TabPFN's predictions, using the same color scheme as the training points. We see that, without ensembling, TabPFN's predicted classes do not form contiguous regions over the input space. -
- - -## Cancer status classification from high-dimensional gene expressions - -We now turn towards a comparison of TabPFN with logistic regression (LR), support vector classification (SVC), and XGBoost on the BladderBatch cancer status classification task. -The bladderbatch dataset consists of 57 samples, 22,283 gene expression features, and 3 classes ("normal" vs "biopsy" vs "cancer"). -This is an extremely high-dimensional problem compared to TabPFN's intended use for $$d \le 100$$; also, linear models tend to be sufficient for predicting cancer status given gene expressions. -Thus, this setting is far outside the domain on which we would expect TabPFN to perform well, particularly if it had been overfit to small tabular datasets. -Furthermore, the 57 samples come from 5 different batches of gene microarray measurements. -This adds additional difficulty to the task, because there is confounded shift between the technical batch effect and the unequal proportions of cancer status in the different batches. - -For all methods, we do not perform hyperparameter search, in order to simulate the scenario where there are too few samples to perform cross-validation without the risk of overfitting. -We use the scikit-learn implementations of LR and SVC with their default hyperparameters. -For TabPFN, we use the default hyperparameter of 32 ensembles; we also enable feature subsampling as is required for $$d > 100$$ problems. - -Results are shown below, aggregated over 10 random 75-25 train-test splits, and evaluated via both accuracy and macro-averaged F1-score. -TabPFN has a surprisingly strong showing, handily beating SVC and XGBoost, while almost matching logistic regression. -This pattern holds both when we use all features and also when we use only the first 1k features. - -{% include figure.html path="assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/bladderbatch-comparison.png" class="img-fluid" %} - -We also evaluate the different methods on a more realistic setting, where we train on 4 out of 5 batches of data and evaluate on all samples from the remaining unseen batch. -Results are shown below, with scatterplot labels used to indicate the identity of the test batch. -While all methods perform worse in this setting, TabPFN still almost matches LR while beating the other baselines. - -{% include figure.html path="assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/bladderbatch-batch-scatterplot.png" class="img-fluid" %} - -We also verify that TabPFN is not simply memorizing the class imbalance in favor of cancer. -We compute confusion matrices, shown below, for each train-test split. -Even though cancer is the most common class in every training split, there does not appear to be any systematic bias across the splits in favor of predicting cancer. - -{% include figure.html path="assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/bladderbatch-batch-confusion.png" class="img-fluid" %} - - -## Computer vision as a tabular classification problem - -Finally, we compare TabPFN with other methods on two computer vision (CV) tasks. -As in the previous section, we use the default hyperparameter settings for all methods. -We treat MNIST and CIFAR-10 as tabular ML problems with $$28*28^2$$ and $$3*32^2$$ features, respectively. -We aggregate over 10 train-test splits, where the test set is the full MNIST / CIFAR-10 test set, and the training set is a random subsample of size 30, 100, 300, and 1000. -In this experiment, TabPFN was competitive for smaller training set sizes, but lagged as we trained on more samples. -Interestingly, while for cancer classification SVC performed poorly, it performed well for large sample sizes on the CV tasks. -Meanwhile, while logistic regression (LR) performed well on cancer classification, it struggled in the current setting. -It remains to be seen whether the shared behavioral characteristics of TabPFN and LR in these tasks hold more generally. -If so, this could motivate future work on meta-learning TabPFN to perform robust classification with a hinge-type loss. - -
-
- {% include figure.html path="assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/mnist-vs-samples.png" class="img-fluid" %} -
-
- {% include figure.html path="assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/cifar10-vs-samples.png" class="img-fluid" %} -
-
-
- Test accuracy on MNIST (left) and CIFAR-10 (right). -
- -## Closing thoughts - -Taken together, our preliminary results are suggestive of future developments in tabular PFNs. Currently, an applied ML practitioner will likely choose between training a model on their own small dataset and using the TabPFN "one size fits all" model. Our results suggest that TabPFN model will likely perform quite well, even outside its intended domain. However, it still came second-place to logistic regression on our cancer classification task and last or second-last on the CV classification problems. This suggests that the future will not look like a binary choice between training a non-PFN and selecting a single state-of-the-art tabular PFN. Rather, we suspect that there will exist PFNs for specific modalities of data (e.g. gene expression), or for specific settings (e.g. robust classification) that bridge the gap between the two extremes. - -In such a future, we believe our approach to evaluating TabPFN will become increasingly essential. In the burgeoning field of large language models (LLMs), evaluation on various public benchmarks is widely considered necessary but insufficient. LLM researchers and users will also evaluate a newly-announced model by trying their favorite personal examples on the new LLM. When the LLM fails on a prompt, one modifies the prompt slightly to see whether the LLM simply expected a different prompting style. When the LLM succeeds, one tries variants to see whether its satisfactory response was in fact brittle to the prompt. By interacting with an LLM, one gets a sense for its expected prompting style and the type of outputs it generates. In particular, providing out-of-distribution (adversarial) inputs (e.g. "poem poem poem") to an LLM tells us something useful about how it will operate on future unanticipated out-of-distribution inputs. - -By analogy, we argue that, while open tabular benchmarks are valuable resources, these should not be fully determinative for researchers and users of tabular ML methods. Benchmarks do allow us to quickly discover which methods are Pareto-dominated and can therefore be safely ignored. However, as we move into a world with multiple available PFN options, with different sorts of inductive priors, it will become increasingly useful to interact with them on simple problems to gain an intuition for whether their priors match one's own use-case. For our analysis on 1d inputs, it is important to notice that there is not necessarily one "right answer". Thus, evaluations of tabular ML approaches will need to be more granular than to describe TabPFN as state of the art for all of tabular ML. Instead, evaluations should aim at identifying specific tabular PFN checkpoints, based on different inductive priors and synthetic datasets, as being best suited for specific classes of problem settings. - -Furthermore, our results illuminate a key practical difference between TabPFN, which relies on in-context learning, and other neural network models for tabular ML. Skepticism around neural networks for tabular ML has been justified by problems stemming from the non-convexity of neural network training. Note that the problem (in the small dataset context) with neural net training non-convexity is not fundamentally about the fact that one may have missed a global optimum with better performance. Rather, deep learning requires babysitting during training runs and optimization of training hyperparameters which are unrelated to one's beliefs about the nature of one's specific problem. Thus, a modified architecture, preprocessing method, or data selection approach might be better matched for a particular dataset, but in the end perform worse due to problematic training dynamics -- which one might be unable to fix without risk of overfitting. In the small dataset regime, the maximum performance (over all training hyperparameter settings) matters less than the performance on the default hyperparameter settings. - -Because the overall approach of TabPFN obviates this problem with pure in-context learning, the fundamental weaknesses of other neural network approaches do not apply. For example, our 1d experiments would not have been straightforwardly possible if we had retrained a neural network on each reconfiguration of the training data. If we had done so while keeping the training hyperparameters fixed, it would not represent how people actually use such a neural network. On the other hand, if we had plotted results for carefully optimized hyperparameters, it is not clear whether the results would be illustrative of the general inductive biases of the neural network architecture, or merely of the behavior of an optimally-trained neural network. However, the flip side of this advantage of TabPFN is that our analysis applies not so much to TabPFN-the-method, as it does to [prior_diff_real_checkpoint_n_0_epoch_42.cpkt](https://github.com/automl/TabPFN/blob/d76f4ac7/tabpfn/models_diff/prior_diff_real_checkpoint_n_0_epoch_42.cpkt)-the-checkpoint. - -Finally, we believe our evaluation helps address some of the popular skepticism around TabPFN. While our results indicate that there remains substantial room for improvement, we found no evidence that would suggest that TabPFN's results were solely the result of overfitting a large neural network to public benchmarks. Rather, our results suggest that TabPFN learns a simple "world model" of small-n statistical learning for tabular classification. This, in itself, makes TabPFN worthy of further careful empirical study. diff --git a/_projects/1_project.md b/_projects/1_project.md deleted file mode 100644 index 3f7cf783..00000000 --- a/_projects/1_project.md +++ /dev/null @@ -1,80 +0,0 @@ ---- -layout: page -title: project 1 -description: a project with a background image -img: assets/img/12.jpg -importance: 1 -category: work ---- - -Every project has a beautiful feature showcase page. -It's easy to include images in a flexible 3-column grid format. -Make your photos 1/3, 2/3, or full width. - -To give your project a background in the portfolio page, just add the img tag to the front matter like so: - - --- - layout: page - title: project - description: a project with a background image - img: /assets/img/12.jpg - --- - -
-
- {% include figure.html path="assets/img/1.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/3.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/5.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
-
- Caption photos easily. On the left, a road goes through a tunnel. Middle, leaves artistically fall in a hipster photoshoot. Right, in another hipster photoshoot, a lumberjack grasps a handful of pine needles. -
-
-
- {% include figure.html path="assets/img/5.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
-
- This image can also have a caption. It's like magic. -
- -You can also put regular text between your rows of images. -Say you wanted to write a little bit about your project before you posted the rest of the images. -You describe how you toiled, sweated, *bled* for your project, and then... you reveal its glory in the next row of images. - - -
-
- {% include figure.html path="assets/img/6.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/11.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
-
- You can also have artistically styled 2/3 + 1/3 images, like these. -
- - -The code is simple. -Just wrap your images with `
` and place them inside `
` (read more about the Bootstrap Grid system). -To make images responsive, add `img-fluid` class to each; for rounded corners and shadows use `rounded` and `z-depth-1` classes. -Here's the code for the last row of images above: - -{% raw %} -```html -
-
- {% include figure.html path="assets/img/6.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/11.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
-``` -{% endraw %} diff --git a/_projects/2_project.md b/_projects/2_project.md deleted file mode 100644 index bebf7961..00000000 --- a/_projects/2_project.md +++ /dev/null @@ -1,80 +0,0 @@ ---- -layout: page -title: project 2 -description: a project with a background image -img: assets/img/3.jpg -importance: 2 -category: work ---- - -Every project has a beautiful feature showcase page. -It's easy to include images in a flexible 3-column grid format. -Make your photos 1/3, 2/3, or full width. - -To give your project a background in the portfolio page, just add the img tag to the front matter like so: - - --- - layout: page - title: project - description: a project with a background image - img: /assets/img/12.jpg - --- - -
-
- {% include figure.html path="assets/img/1.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/3.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/5.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
-
- Caption photos easily. On the left, a road goes through a tunnel. Middle, leaves artistically fall in a hipster photoshoot. Right, in another hipster photoshoot, a lumberjack grasps a handful of pine needles. -
-
-
- {% include figure.html path="assets/img/5.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
-
- This image can also have a caption. It's like magic. -
- -You can also put regular text between your rows of images. -Say you wanted to write a little bit about your project before you posted the rest of the images. -You describe how you toiled, sweated, *bled* for your project, and then... you reveal its glory in the next row of images. - - -
-
- {% include figure.html path="assets/img/6.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/11.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
-
- You can also have artistically styled 2/3 + 1/3 images, like these. -
- - -The code is simple. -Just wrap your images with `
` and place them inside `
` (read more about the Bootstrap Grid system). -To make images responsive, add `img-fluid` class to each; for rounded corners and shadows use `rounded` and `z-depth-1` classes. -Here's the code for the last row of images above: - -{% raw %} -```html -
-
- {% include figure.html path="assets/img/6.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/11.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
-``` -{% endraw %} diff --git a/_projects/3_project.md b/_projects/3_project.md deleted file mode 100644 index 3f3cbf70..00000000 --- a/_projects/3_project.md +++ /dev/null @@ -1,81 +0,0 @@ ---- -layout: page -title: project 3 -description: a project that redirects to another website -img: assets/img/7.jpg -redirect: https://unsplash.com -importance: 3 -category: work ---- - -Every project has a beautiful feature showcase page. -It's easy to include images in a flexible 3-column grid format. -Make your photos 1/3, 2/3, or full width. - -To give your project a background in the portfolio page, just add the img tag to the front matter like so: - - --- - layout: page - title: project - description: a project with a background image - img: /assets/img/12.jpg - --- - -
-
- {% include figure.html path="assets/img/1.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/3.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/5.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
-
- Caption photos easily. On the left, a road goes through a tunnel. Middle, leaves artistically fall in a hipster photoshoot. Right, in another hipster photoshoot, a lumberjack grasps a handful of pine needles. -
-
-
- {% include figure.html path="assets/img/5.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
-
- This image can also have a caption. It's like magic. -
- -You can also put regular text between your rows of images. -Say you wanted to write a little bit about your project before you posted the rest of the images. -You describe how you toiled, sweated, *bled* for your project, and then... you reveal its glory in the next row of images. - - -
-
- {% include figure.html path="assets/img/6.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/11.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
-
- You can also have artistically styled 2/3 + 1/3 images, like these. -
- - -The code is simple. -Just wrap your images with `
` and place them inside `
` (read more about the Bootstrap Grid system). -To make images responsive, add `img-fluid` class to each; for rounded corners and shadows use `rounded` and `z-depth-1` classes. -Here's the code for the last row of images above: - -{% raw %} -```html -
-
- {% include figure.html path="assets/img/6.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/11.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
-``` -{% endraw %} diff --git a/_projects/4_project.md b/_projects/4_project.md deleted file mode 100644 index edb5dd25..00000000 --- a/_projects/4_project.md +++ /dev/null @@ -1,80 +0,0 @@ ---- -layout: page -title: project 4 -description: another without an image -img: -importance: 3 -category: fun ---- - -Every project has a beautiful feature showcase page. -It's easy to include images in a flexible 3-column grid format. -Make your photos 1/3, 2/3, or full width. - -To give your project a background in the portfolio page, just add the img tag to the front matter like so: - - --- - layout: page - title: project - description: a project with a background image - img: /assets/img/12.jpg - --- - -
-
- {% include figure.html path="assets/img/1.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/3.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/5.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
-
- Caption photos easily. On the left, a road goes through a tunnel. Middle, leaves artistically fall in a hipster photoshoot. Right, in another hipster photoshoot, a lumberjack grasps a handful of pine needles. -
-
-
- {% include figure.html path="assets/img/5.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
-
- This image can also have a caption. It's like magic. -
- -You can also put regular text between your rows of images. -Say you wanted to write a little bit about your project before you posted the rest of the images. -You describe how you toiled, sweated, *bled* for your project, and then... you reveal its glory in the next row of images. - - -
-
- {% include figure.html path="assets/img/6.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/11.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
-
- You can also have artistically styled 2/3 + 1/3 images, like these. -
- - -The code is simple. -Just wrap your images with `
` and place them inside `
` (read more about the Bootstrap Grid system). -To make images responsive, add `img-fluid` class to each; for rounded corners and shadows use `rounded` and `z-depth-1` classes. -Here's the code for the last row of images above: - -{% raw %} -```html -
-
- {% include figure.html path="assets/img/6.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/11.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
-``` -{% endraw %} diff --git a/_projects/5_project.md b/_projects/5_project.md deleted file mode 100644 index efd9b6cf..00000000 --- a/_projects/5_project.md +++ /dev/null @@ -1,80 +0,0 @@ ---- -layout: page -title: project 5 -description: a project with a background image -img: assets/img/1.jpg -importance: 3 -category: fun ---- - -Every project has a beautiful feature showcase page. -It's easy to include images in a flexible 3-column grid format. -Make your photos 1/3, 2/3, or full width. - -To give your project a background in the portfolio page, just add the img tag to the front matter like so: - - --- - layout: page - title: project - description: a project with a background image - img: /assets/img/12.jpg - --- - -
-
- {% include figure.html path="assets/img/1.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/3.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/5.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
-
- Caption photos easily. On the left, a road goes through a tunnel. Middle, leaves artistically fall in a hipster photoshoot. Right, in another hipster photoshoot, a lumberjack grasps a handful of pine needles. -
-
-
- {% include figure.html path="assets/img/5.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
-
- This image can also have a caption. It's like magic. -
- -You can also put regular text between your rows of images. -Say you wanted to write a little bit about your project before you posted the rest of the images. -You describe how you toiled, sweated, *bled* for your project, and then... you reveal its glory in the next row of images. - - -
-
- {% include figure.html path="assets/img/6.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/11.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
-
- You can also have artistically styled 2/3 + 1/3 images, like these. -
- - -The code is simple. -Just wrap your images with `
` and place them inside `
` (read more about the Bootstrap Grid system). -To make images responsive, add `img-fluid` class to each; for rounded corners and shadows use `rounded` and `z-depth-1` classes. -Here's the code for the last row of images above: - -{% raw %} -```html -
-
- {% include figure.html path="assets/img/6.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/11.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
-``` -{% endraw %} diff --git a/_projects/6_project.md b/_projects/6_project.md deleted file mode 100644 index 9a95d6e8..00000000 --- a/_projects/6_project.md +++ /dev/null @@ -1,80 +0,0 @@ ---- -layout: page -title: project 6 -description: a project with no image -img: -importance: 4 -category: fun ---- - -Every project has a beautiful feature showcase page. -It's easy to include images in a flexible 3-column grid format. -Make your photos 1/3, 2/3, or full width. - -To give your project a background in the portfolio page, just add the img tag to the front matter like so: - - --- - layout: page - title: project - description: a project with a background image - img: /assets/img/12.jpg - --- - -
-
- {% include figure.html path="assets/img/1.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/3.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/5.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
-
- Caption photos easily. On the left, a road goes through a tunnel. Middle, leaves artistically fall in a hipster photoshoot. Right, in another hipster photoshoot, a lumberjack grasps a handful of pine needles. -
-
-
- {% include figure.html path="assets/img/5.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
-
- This image can also have a caption. It's like magic. -
- -You can also put regular text between your rows of images. -Say you wanted to write a little bit about your project before you posted the rest of the images. -You describe how you toiled, sweated, *bled* for your project, and then... you reveal its glory in the next row of images. - - -
-
- {% include figure.html path="assets/img/6.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/11.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
-
- You can also have artistically styled 2/3 + 1/3 images, like these. -
- - -The code is simple. -Just wrap your images with `
` and place them inside `
` (read more about the Bootstrap Grid system). -To make images responsive, add `img-fluid` class to each; for rounded corners and shadows use `rounded` and `z-depth-1` classes. -Here's the code for the last row of images above: - -{% raw %} -```html -
-
- {% include figure.html path="assets/img/6.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
- {% include figure.html path="assets/img/11.jpg" title="example image" class="img-fluid rounded z-depth-1" %} -
-
-``` -{% endraw %} diff --git a/_sass/_base.scss b/_sass/_base.scss deleted file mode 100644 index 7b826527..00000000 --- a/_sass/_base.scss +++ /dev/null @@ -1,658 +0,0 @@ -/******************************************************************************* - * Styles for the base elements of the theme. - ******************************************************************************/ - -// Typography - -p, h1, h2, h3, h4, h5, h6, em, div, li, span, strong { - color: var(--global-text-color); -} - -hr { - border-top: 1px solid var(--global-divider-color); -} - -table { - td, th { - color: var(--global-text-color); - } - td { - font-size: 1rem; - } -} - -a, table.table a { - color: var(--global-theme-color); - &:hover { - color: var(--global-theme-color); - text-decoration: underline; - } - &:hover:after :not(.nav-item.dropdown) { - width: 100%; - } -} - -figure, img { - max-width: 90vw; -} - -blockquote { - background: var(--global-bg-color); - border-left: 2px solid var(--global-theme-color); - margin: 1.5em 10px; - padding: 0.5em 10px; - font-size: 1.1rem; -} - -// Math - -.equation { - margin-bottom: 1rem; - text-align: center; -} - -// Caption - -.caption { - font-size: 0.875rem; - margin-top: 0.75rem; - margin-bottom: 1.5rem; - text-align: center; -} - -// Card - -.card { - background-color: var(--global-card-bg-color); - - img { - width: 100%; - } - - .card-title { - color: var(--global-text-color); - } - - .card-item { - width: auto; - margin-bottom: 10px; - - .row { - display: flex; - align-items: center; - } - } -} - -// Citation - -.citation, .citation-number { - color: var(--global-theme-color); -} - -// Profile - -.profile { - width: 100%; - - .address { - margin-bottom: 5px; - margin-top: 5px; - font-family: monospace; - p { - display: inline-block; - margin: 0; - } - } -} -.profile.float-right{ - margin-left: 1rem; -} -.profile.float-left{ - margin-right: 1rem; -} - -@media (min-width: 576px) { - .profile { - width: 30%; - .address { - p { display: block; } - } - } -} - -.post-description { - margin-bottom: 2rem; - font-size: 0.875rem; - a { - color: inherit; - &:hover { - color: var(--global-theme-color); - text-decoration: none; - } - } -} - - -// Navbar customization - -.navbar { - box-shadow: none; - border-bottom: 1px solid var(--global-divider-color); - background-color: var(--global-bg-color); - opacity: 0.95; -} -.navbar .dropdown-menu { - background-color: var(--global-bg-color); - border: 1px solid var(--global-divider-color); - a:not(.active) { - color: var(--global-text-color); - } - a:hover { - color: var(--global-hover-color); - } - .dropdown-divider { - border-top: 1px solid var(--global-divider-color) !important; - } -} -.dropdown-item { - color: var(--global-text-color); - &:hover { - color: var(--global-hover-color); - background-color: var(--global-bg-color); - } -} -.navbar.navbar-light { - a { - &:hover { - text-decoration: none; - } - } - .navbar-brand { - color: var(--global-text-color); - } - .navbar-nav .nav-item .nav-link { - color: var(--global-text-color); - &:hover { - color: var(--global-hover-color); - } - } - .navbar-nav .nav-item.active>.nav-link { - background-color: inherit; - font-weight: bolder; - color: var(--global-theme-color); - &:hover { - color: var(--global-hover-color); - } - } - .navbar-brand.social { - padding-bottom: 0; - padding-top: 0; - font-size: 1.7rem; - a { - i::before { - color: var(--global-text-color); - transition-property: all 0.2s ease-in-out; - } - &:hover { - i::before { - color: var(--global-theme-color); - } - } - } - } -} - -.navbar-toggler { - .icon-bar { - display: block; - width: 22px; - height: 2px; - background-color: var(--global-text-color); - border-radius: 1px; - margin-bottom: 4px; - transition: all 0.2s; - } - .top-bar { - transform: rotate(45deg); - transform-origin: 10% 10%; - } - .middle-bar { - opacity: 0; - } - .bottom-bar { - transform: rotate(-45deg); - transform-origin: 10% 90%; - } -} - -.navbar-toggler.collapsed { - .top-bar { - transform: rotate(0); - } - .middle-bar { - opacity: 1; - } - .bottom-bar { - transform: rotate(0); - } -} - -#light-toggle { - padding: 0; - border: 0; - background-color: inherit; - color: var(--global-text-color); - &:hover { - color: var(--global-hover-color); - } -} - -// Social (bottom) - -.social { - text-align: center; - .contact-icons { - font-size: 4rem; - a { - i::before { - color: var(--global-text-color); - transition-property: all 0.2s ease-in-out; - } - &:hover { - i::before { - color: var(--global-theme-color); - } - } - } - } - .contact-note { - font-size: 0.8rem; - } -} - - -// Footer -footer.fixed-bottom { - background-color: var(--global-footer-bg-color); - font-size: 0.75rem; - .container { - color: var(--global-footer-text-color); - padding-top: 9px; - padding-bottom: 8px; - } - a { - color: var(--global-footer-link-color); - &:hover { - color: var(--global-theme-color); - text-decoration: none; - } - } -} - -footer.sticky-bottom { - border-top: 1px solid var(--global-divider-color); - padding-top: 40px; - padding-bottom: 40px; - font-size: 0.9rem; -} - -// CV - -.cv { - margin-bottom: 40px; - - .card { - background-color: var(--global-card-bg-color); - border: 1px solid var(--global-divider-color); - - .list-group-item { - background-color: inherit; - - .badge { - color: var(--global-card-bg-color) !important; - background-color: var(--global-theme-color) !important; - } - } - } -} - -// Repositories - -@media (min-width: 768px) { - .repo { - max-width: 50%; - } -} - -// Blog - -.header-bar { - border-bottom: 1px solid var(--global-divider-color); - text-align: center; - padding-top: 2rem; - padding-bottom: 3rem; - h1 { - color: var(--global-theme-color); - font-size: 5rem; - } -} - -.tag-list { - border-bottom: 1px solid var(--global-divider-color); - text-align: center; - padding-top: 1rem; - - ul { - justify-content: center; - display: flow-root; - - p, li { - list-style: none; - display: inline-block; - padding: 1rem 0.5rem; - color: var(--global-text-color-light); - } - } -} - -.post-list { - margin: 0; - margin-bottom: 40px; - padding: 0; - li { - border-bottom: 1px solid var(--global-divider-color); - list-style: none; - padding-top: 2rem; - padding-bottom: 2rem; - .post-meta { - color: var(--global-text-color-light); - font-size: 0.875rem; - margin-bottom: 0; - } - .post-tags { - color: var(--global-text-color-light); - font-size: 0.875rem; - padding-top: 0.25rem; - padding-bottom: 0; - } - a { - color: var(--global-text-color); - text-decoration: none; - &:hover { - color: var(--global-theme-color); - } - } - } -} - -.pagination { - .page-item { - .page-link { - color: var(--global-text-color); - &:hover { - color: $black-color; - } - } - &.active .page-link { - color: $white-color; - background-color: var(--global-theme-color); - &:hover { - background-color: var(--global-theme-color); - } - } - } -} - - -// Distill - -.distill { - a:hover { - border-bottom-color: var(--global-theme-color); - text-decoration: none; - } -} - - -// Projects - -.projects { - a { - text-decoration: none; - - &:hover { - .card-title { - color: var(--global-theme-color); - } - } - } - - .card { - img { - width: 100%; - } - } - - .card-item { - width: auto; - margin-bottom: 10px; - - .row { - display: flex; - align-items: center; - } - } - - .grid-sizer, .grid-item { - width: 250px; - margin-bottom: 10px; - } - - h2.category { - color: var(--global-divider-color); - border-bottom: 1px solid var(--global-divider-color); - padding-top: 0.5rem; - margin-top: 2rem; - margin-bottom: 1rem; - text-align: right; - } -} - - -// Publications - -.publications { - margin-top: 2rem; - h1 { - color: var(--global-theme-color); - font-size: 2rem; - text-align: center; - margin-top: 1em; - margin-bottom: 1em; - } - h2 { - margin-bottom: 1rem; - span { - font-size: 1.5rem; - } - } - h2.year { - color: var(--global-divider-color); - border-top: 1px solid var(--global-divider-color); - padding-top: 1rem; - margin-top: 2rem; - margin-bottom: -2rem; - text-align: right; - } - ol.bibliography { - list-style: none; - padding: 0; - margin-top: 0; - - li { - margin-bottom: 1rem; - .preview { - width: 100%; - min-width: 80px; - max-width: 200px; - } - .abbr { - height: 2rem; - margin-bottom: 0.5rem; - abbr { - display: inline-block; - background-color: var(--global-theme-color); - padding-left: 1rem; - padding-right: 1rem; - a { - color: white; - &:hover { - text-decoration: none; - } - } - } - .award { - color: var(--global-theme-color) !important; - border: 1px solid var(--global-theme-color); - } - } - .title { - font-weight: bolder; - } - .author { - a { - border-bottom: 1px dashed var(--global-theme-color); - &:hover { - border-bottom-style: solid; - text-decoration: none; - } - } - > em { - border-bottom: 1px solid; - font-style: normal; - } - > span.more-authors { - color: var(--global-text-color-light); - border-bottom: 1px dashed var(--global-text-color-light); - cursor: pointer; - &:hover { - color: var(--global-text-color); - border-bottom: 1px dashed var(--global-text-color); - } - } - } - .links { - a.btn { - color: var(--global-text-color); - border: 1px solid var(--global-text-color); - padding-left: 1rem; - padding-right: 1rem; - padding-top: 0.25rem; - padding-bottom: 0.25rem; - &:hover { - color: var(--global-theme-color); - border-color: var(--global-theme-color); - } - } - } - .hidden { - font-size: 0.875rem; - max-height: 0px; - overflow: hidden; - text-align: justify; - transition-property: 0.15s ease; - -moz-transition: 0.15s ease; - -ms-transition: 0.15s ease; - -o-transition: 0.15s ease; - transition: all 0.15s ease; - - p { - line-height: 1.4em; - margin: 10px; - } - pre { - font-size: 1em; - line-height: 1.4em; - padding: 10px; - } - } - .hidden.open { - max-height: 100em; - transition-property: 0.15s ease; - -moz-transition: 0.15s ease; - -ms-transition: 0.15s ease; - -o-transition: 0.15s ease; - transition: all 0.15s ease; - } - div.abstract.hidden { - border: dashed 1px var(--global-bg-color); - } - div.abstract.hidden.open { - border-color: var(--global-text-color); - } - } - } -} - -// Rouge Color Customization -figure.highlight { - margin: 0 0 1rem; -} - -pre { - color: var(--global-theme-color); - background-color: var(--global-code-bg-color); - border-radius: 6px; - padding: 6px 12px; - pre, code { - background-color: transparent; - border-radius: 0; - padding: 0; - } -} - -code { - color: var(--global-theme-color); - background-color: var(--global-code-bg-color); - border-radius: 3px; - padding: 3px 3px; -} - - -// Transitioning Themes -html.transition, -html.transition *, -html.transition *:before, -html.transition *:after { - transition: all 750ms !important; - transition-delay: 0 !important; -} - -// Extra Markdown style (post Customization) -.post{ - .post-meta{ - color: var(--global-text-color-light); - font-size: 0.875rem; - margin-bottom: 0; - } - .post-tags{ - color: var(--global-text-color-light); - font-size: 0.875rem; - padding-top: 0.25rem; - padding-bottom: 1rem; - a { - color: var(--global-text-color-light); - text-decoration: none; - &:hover { - color: var(--global-theme-color); - } - } - } - .post-content{ - blockquote { - border-left: 5px solid var(--global-theme-color); - padding: 8px; - } - } -} diff --git a/_sass/_distill.scss b/_sass/_distill.scss deleted file mode 100644 index d83fafd4..00000000 --- a/_sass/_distill.scss +++ /dev/null @@ -1,126 +0,0 @@ -/******************************************************************************* - * Style overrides for distill blog posts. - ******************************************************************************/ - -d-byline { - border-top-color: var(--global-divider-color) !important; -} - -d-byline h3 { - color: var(--global-text-color) !important; -} - -d-byline a, d-article d-byline a { - color: var(--global-text-color) !important; - &:hover { - color: var(--global-hover-color) !important; - } -} - -d-article { - border-top-color: var(--global-divider-color) !important; - a, p, h1, h2, h3, h4, h5, h6, li, table { - color: var(--global-text-color) !important; - } - a, h1, h2, hr, table, table th, table td { - border-bottom-color: var(--global-divider-color) !important; - } - a:hover { - border-bottom-color: var(--global-hover-color) !important; - } - b i { - display: inline; - } - - d-contents { - align-self: start; - grid-column: 1 / 4; - grid-row: auto / span 4; - justify-self: end; - margin-top: 0em; - padding-left: 2em; - padding-right: 3em; - border-right: 1px solid var(--global-divider-color); - width: calc(max(70%, 300px)); - margin-right: 0px; - margin-top: 0em; - display: grid; - grid-template-columns: - minmax(8px, 1fr) [toc] auto - minmax(8px, 1fr) [toc-line] 1px - minmax(32px, 2fr); - - nav { - grid-column: toc; - a { - border-bottom: none !important; - &:hover { - border-bottom: 1px solid var(--global-text-color) !important; - } - } - h3 { - margin-top: 0; - margin-bottom: 1em; - } - div { - display: block; - outline: none; - margin-bottom: 0.8em; - color: rgba(0, 0, 0, 0.8); - font-weight: bold; - } - ul { - padding-left: 1em; - margin-top: 0; - margin-bottom: 6px; - list-style-type: none; - li { - margin-bottom: 0.25em; - } - } - } - .figcaption { - line-height: 1.4em; - } - toc-line { - border-right: 1px solid var(--global-divider-color); - grid-column: toc-line; - } - } - - d-footnote { - scroll-margin-top: 66px; - } -} - -d-appendix { - border-top-color: var(--global-divider-color) !important; - color: var(--global-distill-app-color) !important; - h3, li, span { - color: var(--global-distill-app-color) !important; - } - a, a.footnote-backlink { - color: var(--global-distill-app-color) !important; - &:hover { - color: var(--global-hover-color) !important; - } - } -} - -@media (max-width: 1024px) { - d-article { - d-contents { - display: block; - grid-column-start: 2; - grid-column-end: -2; - padding-bottom: 0.5em; - margin-bottom: 1em; - padding-top: 0.5em; - width: 100%; - border: 1px solid var(--global-divider-color); - nav { - grid-column: none; - } - } - } -} diff --git a/_sass/_layout.scss b/_sass/_layout.scss deleted file mode 100644 index 9c10cac7..00000000 --- a/_sass/_layout.scss +++ /dev/null @@ -1,50 +0,0 @@ -/****************************************************************************** - * Content - ******************************************************************************/ - -body { - padding-bottom: 70px; - color: var(--global-text-color); - background-color: var(--global-bg-color); - - h1, h2, h3, h4, h5, h6 { - scroll-margin-top: 66px; - } -} - -body.fixed-top-nav { - // Add some padding for the nav-bar. - padding-top: 56px; -} - -body.sticky-bottom-footer { - // Remove padding below footer. - padding-bottom: 0; -} - -.container { - max-width: $max-content-width; -} - -// Profile -.profile { - img { - width: 100%; - } -} - -// TODO: redefine content layout. - - -/****************************************************************************** - * Publications - ******************************************************************************/ - -// TODO: redefine publications layout. - - -/***************************************************************************** -* Projects -*****************************************************************************/ - -// TODO: redefine projects layout. diff --git a/_sass/_themes.scss b/_sass/_themes.scss deleted file mode 100644 index d3322c99..00000000 --- a/_sass/_themes.scss +++ /dev/null @@ -1,100 +0,0 @@ -/******************************************************************************* - * Themes - ******************************************************************************/ - -:root { - --global-bg-color: #{$white-color}; - --global-code-bg-color: #{$code-bg-color-light}; - --global-text-color: #{$black-color}; - --global-text-color-light: #{$grey-color}; - --global-theme-color: #{$cyan-color}; - --global-hover-color: #{$cyan-color}; - --global-footer-bg-color: #{$grey-color-dark}; - --global-footer-text-color: #{$grey-color-light}; - --global-footer-link-color: #{$white-color}; - --global-distill-app-color: #{$grey-color}; - --global-divider-color: rgba(0,0,0,.1); - --global-card-bg-color: #{$white-color}; - - .fa-sun { - display : none; - } - .fa-moon { - padding-left: 10px; - padding-top: 12px; - display : block; - } - - .repo-img-light { - display: block; - } - .repo-img-dark { - display: none; - } -} - -.header-background .img { - background-image: url("../img/ICLR-logo.png"); - background-repeat: no-repeat; - background-size: 400px; - background-position: center bottom; - height: 12em; - margin-bottom: 0em; - margin-top: -2.7em; -} - -html[data-theme='dark'] { - --global-bg-color: #{$grey-color-dark}; - --global-code-bg-color: #{$code-bg-color-dark}; - --global-text-color: #{$grey-color-light}; - --global-text-color-light: #{$grey-color-light}; - --global-theme-color: #{$cyan-color}; - --global-hover-color: #{$cyan-color}; - --global-footer-bg-color: #{$grey-color-light}; - --global-footer-text-color: #{$grey-color-dark}; - --global-footer-link-color: #{$black-color}; - --global-distill-app-color: #{$grey-color-light}; - --global-divider-color: #424246; - --global-card-bg-color: #{$grey-900}; - - .fa-sun { - padding-left: 10px; - padding-top: 12px; - display : block; - } - .fa-moon { - display : none; - } - - .repo-img-light { - display: none; - } - .repo-img-dark { - display: block; - } - -.header-background .img { - background-image: url("../img/ICLR-logo-dark.png"); - background-repeat: no-repeat; - background-size: 400px; - background-position: center bottom; - height: 12em; - margin-bottom: 0em; - margin-top: -2.7em; - // filter: invert(89%); -} - - - - - // .header-background .img { - // background-image: url("../img/score_contour.jpg"); - // background-repeat: no-repeat; - // background-size: cover; - // background-position: center bottom; - // height: 15em; - // margin-bottom: 2em; - // margin-top: -2.7em; - // filter: invert(89%); - // } -} diff --git a/_sass/_variables.scss b/_sass/_variables.scss deleted file mode 100644 index b050aa6e..00000000 --- a/_sass/_variables.scss +++ /dev/null @@ -1,38 +0,0 @@ -/******************************************************************************* - * Variables used throughout the theme. - * To adjust anything, simply edit the variables below and rebuild the theme. - ******************************************************************************/ - - -// Colors -$red-color: #FF3636 !default; -$red-color-dark: #B71C1C !default; -$orange-color: #F29105 !default; -$blue-color: #0076df !default; -$blue-color-dark: #00369f !default; -$cyan-color: #2698BA !default; -$light-cyan-color: lighten($cyan-color, 25%); -$green-color: #00ab37 !default; -$green-color-lime: #B7D12A !default; -$green-color-dark: #009f06 !default; -$green-color-light: #ddffdd !default; -$green-color-bright: #11D68B !default; -$purple-color: #B509AC !default; -$light-purple-color: lighten($purple-color, 25%); -$pink-color: #f92080 !default; -$pink-color-light: #ffdddd !default; -$yellow-color: #efcc00 !default; - -$grey-color: #828282 !default; -$grey-color-light: lighten($grey-color, 40%); -$grey-color-dark: #1C1C1D; -$grey-900: #212529; - -$white-color: #ffffff !default; -$black-color: #000000 !default; - - -// Theme colors - -$code-bg-color-light: rgba($purple-color, 0.05); -$code-bg-color-dark: #2c3237 !default; diff --git a/about/index.html b/about/index.html new file mode 100644 index 00000000..89cd0f4b --- /dev/null +++ b/about/index.html @@ -0,0 +1 @@ + about | ICLR Blogposts 2024

Announcements:

  • The 22 blog posts 2024 are now published! Check our press release for an overview or dive directly into it on the Blog page
  • More information regarding the poster session will be available soon.

Contents

ICLR 2024 Blogposts Track

The Machine Learning community is currently experiencing a reproducibility crisis and a reviewing crisis [Littman, 2021]. Because of the highly competitive and noisy reviewing process of ML conferences [Tran et al., 2020], researchers have an incentive to oversell their results, slowing down the progress and diminishing the integrity of the scientific community. Moreover with the growing number of papers published and submitted at the main ML conferences [Lin et al., 2020], it has become more challenging to keep track of the latest advances in the field.

Blog posts are becoming an increasingly popular and useful way to talk about science [Brown and Woolston, 2018]. They offer substantial value to the scientific community by providing a flexible platform to foster open, human, and transparent discussions about new insights or limitations of a scientific publication. However, because they are not as recognized as standard scientific publications, only a minority of researchers manage to maintain an active blog and get visibility for their efforts. Many are well-established researchers (Francis Bach, Ben Recht, Ferenc Huszár, Lilian Weng) or big corporations that leverage entire teams of graphic designers designer and writers to polish their blogs (Facebook AI, Google AI, DeepMind, OpenAI). As a result, the incentives for writing scientific blog posts are largely personal; it is unreasonable to expect a significant portion of the machine learning community to contribute to such an initiative when everyone is trying to establish themselves through publications.

Submit your blogpost on Openreview

A Blog Post Conference Track

Last year, we ran the second iteration of the Blogpost track at ICLR 2023!

It was very successful, with accepted posts presented in person at the main conference.

Our goal is to create a formal call for blog posts at ICLR to incentivize and reward researchers to review past work and summarize the outcomes, develop new intuitions, or highlight some shortcomings. A very influential initiative of this kind happened after the Second World War in France. Because of the lack of up-to-date textbooks, a collective of mathematicians under the pseudonym Nicolas Bourbaki [Halmos 1957], decided to start a series of textbooks about the foundations of mathematics [Bourbaki, 1939]. In the same vein, we aim to provide a new way to summarize scientific knowledge in the ML community.

Due to the large diversity of topics that can be discussed in a blog post, we decided to restrict the range of topics for this call for blog posts. We identified that the blog posts that would bring to most value to the community and the conference would be posts that distill and discuss previously published papers.

Spotlight

The N Implementation Details of RLHF with PPO
     Shengyi Costa Huang, Tianlin Liu, Leandro von Werra
How to compute Hessian-vector products?
     Mathieu Dagréou, Pierre Ablin, Samuel Vaiter, Thomas Moreau
Bridging the Data Processing Inequality and Function-Space Variational Inference
     Andreas Kirsch

Accepted Posts

Understanding in-context learning in transformers
     Simone Rossi, Rui Yuan, Thomas Hannagan
Behavioral Differences in Mode-Switching Exploration for Reinforcement Learning
     Loren J Anderson
Fairness in AI: two philosophies or just one?
     MaryBeth Defrance
Towards Robust Foundation Models: Adversarial Contrastive Learning
     Jingfeng Zhang, Xilie Xu
A New Alchemy: Language Model Development as a Subfield?
     Colin Raffel
Understanding gradient inversion attacks from the prior knowledge perspective
     Yanbo Wang, Jian Liang, Ran He
Building Diffusion Model’s theory from ground up
     Ayan Das
Masked Language Model with ALiBi and CLAP head
     Jason Chuan-Chih Chou
What exactly has TabPFN learned to do?
     Calvin McCarter
Elaborating on the Value of Flow Matching for Density Estimation
     Maternus Herold, Faried Abu Zaid
The Hidden Convex Optimization Landscape of Two-Layer ReLU Networks
     Victor Mercklé, Franck Iutzeler, Ievgen Redko
Deep Equilibrium Models For Algorithmic Reasoning
     Sophie Xhonneux, Yu He, Andreea Deac, Jian Tang, Gauthier Gidel
Fair Model-Based Reinforcement Learning Comparisons with Explicit and Consistent Update Frequency
     Albert Thomas, Abdelhakim Benechehab, Giuseppe Paolo, Balázs Kégl
Exploring Meta-learned Curiosity Algorithms
     Batsirayi Mupamhi Ziki
Unraveling The Impact of Training Samples
     Daiwei Chen, Jane Zhang, Ramya Korlakai Vinayak
RLHF without RL - Direct Preference Optimization
     Michael Panchenko
It’s Time to Move On: Primacy Bias and Why It Helps to Forget
     Matthew Kielo, Vladimir Lukin
Double Descent Demystified
     Rylan Schaeffer, Zachary Robertson, Akhilan Boopathy, Mikail Khona, Kateryna Pistunova, Jason W. Rocks, Ila R. Fiete, Andrey Gromov, Sanmi Koyejo
On Bayesian Model Selection: The Marginal Likelihood, Cross-Validation, and Conditional Log Marginal Likelihood
     Andreas Kirsch

Key Dates

Abstract deadline: December 11th 00:00GMT, 2023 (submit to OpenReview - to be announced soon).

Submission deadline: December 17th 00:00GMT, 2023 (any modifications to your blog post, via a pull request on GitHub).

Decision Notification: January 30th, 2024 UPDATED: February 15th, 2024

Camera-ready merge: March 15th, 2024

A call for blog posts discussing work previously published at ICLR

Content

Write a post on a subject that has been published at a top-tier venue (ICLR, ICML, NeurIPS, AAAI, UAI, CVPR, SIGGRAPH, ECCV, ICCV, etc.) relatively recently.

Conflict of interest

The authors of the blog posts will have to declare their conflicts of interest (positive or negative) with the paper (and the paper’s authors) they write about. Conflicts of interest include:

  • Recent collaborators (less than 3 years)
  • Current institution ​ Reviewers will be asked to judge if the submission is sufficiently critical and objective of the papers addressed in the blog post.
  • Blog Posts must not be used to highlight or advertise past publications of the **authors or their lab**.

We will only ask the authors to report if they have a conflict of interest. If so, reviewers will be asked to judge if the submission is sufficiently critical and objective of the papers addressed in the blog post.

Publication

Blog post

The posts will be created and published under a unified template; see the submission instructions and the sample post hosted on the blog of this website.

Poster

Additionally, accepted posts will have the option to present their work as a poster during the main poster session. For more information about the main poster session (time, poster format, etc.) please refer to the ICLR homepage.

Submissions

Our goal is to avoid heavily engineered, professionally-made blog posts —Such as the “100+ hours” mentioned as a standard by the Distill guidelines—to entice ideas and clear writing rather than dynamic visualizations or embedded javascript engines. Please check our submission instructions for more details. We accept submissions in both Markdown and HTML. We believe this is a good trade-off between complexity and flexibility.

Submit your blogpost on Openreview

Contact

For any technical issues with the blog post repository (for example, blog posts not displaying correctly or issues while following the submission instructions), please open an issue in our github repository.

For other inquiries, reach us via email at: blog.track.chairs@gmail.com

Organizers


References

Michael L Littman. Collusion rings threaten the integrity of computer science research. Communications of the ACM, 2021.

David Tran, Alex Valtchanov, Keshav Ganapathy, Raymond Feng, Eric Slud, Micah Goldblum, and Tom Goldstein. An open review of OpenReview: A critical analysis of the machine learning conference review process. arXiv, 2020.

Hsuan-Tien Lin, Maria-Florina Balcan, Raia Hadsell, and Marc’Aurelio Ranzato. What we learned from NeurIPS 2020 reviewing process. Medium https://medium.com/@NeurIPSConf/what-we-learned-from-neurips-2020-reviewing-process-e24549eea38f, 2020.

Eryn Brown and Chris Woolston. Why science blogging still matters. Nature, 2018.

Paul R Halmos. Nicolas Bourbaki. Scientific American, 1957.

Nicolas Bourbaki. Elements of mathematics. Éditions Hermann, 1939.

\ No newline at end of file diff --git a/assets/css/main.css b/assets/css/main.css new file mode 100644 index 00000000..b28183e0 --- /dev/null +++ b/assets/css/main.css @@ -0,0 +1,3 @@ +:root{--global-bg-color:#fff;--global-code-bg-color:rgba(181,9,172,0.05);--global-text-color:#000;--global-text-color-light:#828282;--global-theme-color:#2698ba;--global-hover-color:#2698ba;--global-footer-bg-color:#1c1c1d;--global-footer-text-color:#e8e8e8;--global-footer-link-color:#fff;--global-distill-app-color:#828282;--global-divider-color:rgba(0,0,0,.1);--global-card-bg-color:#fff}:root .fa-sun{display:none}:root .fa-moon{padding-left:10px;padding-top:12px;display:block}:root .repo-img-light{display:block}:root .repo-img-dark{display:none}.header-background .img{background-image:url("../img/ICLR-logo.png");background-repeat:no-repeat;background-size:400px;background-position:center bottom;height:12em;margin-bottom:0;margin-top:-2.7em}html[data-theme=dark]{--global-bg-color:#1c1c1d;--global-code-bg-color:#2c3237;--global-text-color:#e8e8e8;--global-text-color-light:#e8e8e8;--global-theme-color:#2698ba;--global-hover-color:#2698ba;--global-footer-bg-color:#e8e8e8;--global-footer-text-color:#1c1c1d;--global-footer-link-color:#000;--global-distill-app-color:#e8e8e8;--global-divider-color:#424246;--global-card-bg-color:#212529}html[data-theme=dark] .fa-sun{padding-left:10px;padding-top:12px;display:block}html[data-theme=dark] .fa-moon{display:none}html[data-theme=dark] .repo-img-light{display:none}html[data-theme=dark] .repo-img-dark{display:block}html[data-theme=dark] .header-background .img{background-image:url("../img/ICLR-logo-dark.png");background-repeat:no-repeat;background-size:400px;background-position:center bottom;height:12em;margin-bottom:0;margin-top:-2.7em}body{padding-bottom:70px;color:var(--global-text-color);background-color:var(--global-bg-color)}body h1,body h2,body h3,body h4,body h5,body h6{scroll-margin-top:66px}body.fixed-top-nav{padding-top:56px}body.sticky-bottom-footer{padding-bottom:0}.container{max-width:1000px}.profile img{width:100%}p,h1,h2,h3,h4,h5,h6,em,div,li,span,strong{color:var(--global-text-color)}hr{border-top:1px solid var(--global-divider-color)}table td,table th{color:var(--global-text-color)}table td{font-size:1rem}a,table.table a{color:var(--global-theme-color)}a:hover,table.table a:hover{color:var(--global-theme-color);text-decoration:underline}a:hover:after :not(.nav-item.dropdown),table.table a:hover:after :not(.nav-item.dropdown){width:100%}figure,img{max-width:90vw}blockquote{background:var(--global-bg-color);border-left:2px solid var(--global-theme-color);margin:1.5em 10px;padding:.5em 10px;font-size:1.1rem}.equation{margin-bottom:1rem;text-align:center}.caption{font-size:.875rem;margin-top:.75rem;margin-bottom:1.5rem;text-align:center}.card{background-color:var(--global-card-bg-color)}.card img{width:100%}.card .card-title{color:var(--global-text-color)}.card .card-item{width:auto;margin-bottom:10px}.card .card-item .row{display:flex;align-items:center}.citation,.citation-number{color:var(--global-theme-color)}.profile{width:100%}.profile .address{margin-bottom:5px;margin-top:5px;font-family:monospace}.profile .address p{display:inline-block;margin:0}.profile.float-right{margin-left:1rem}.profile.float-left{margin-right:1rem}@media(min-width:576px){.profile{width:30%}.profile .address p{display:block}}.post-description{margin-bottom:2rem;font-size:.875rem}.post-description a{color:inherit}.post-description a:hover{color:var(--global-theme-color);text-decoration:none}.navbar{box-shadow:none;border-bottom:1px solid var(--global-divider-color);background-color:var(--global-bg-color);opacity:.95}.navbar .dropdown-menu{background-color:var(--global-bg-color);border:1px solid var(--global-divider-color)}.navbar .dropdown-menu a:not(.active){color:var(--global-text-color)}.navbar .dropdown-menu a:hover{color:var(--global-hover-color)}.navbar .dropdown-menu .dropdown-divider{border-top:1px solid var(--global-divider-color)!important}.dropdown-item{color:var(--global-text-color)}.dropdown-item:hover{color:var(--global-hover-color);background-color:var(--global-bg-color)}.navbar.navbar-light a:hover{text-decoration:none}.navbar.navbar-light .navbar-brand{color:var(--global-text-color)}.navbar.navbar-light .navbar-nav .nav-item .nav-link{color:var(--global-text-color)}.navbar.navbar-light .navbar-nav .nav-item .nav-link:hover{color:var(--global-hover-color)}.navbar.navbar-light .navbar-nav .nav-item.active>.nav-link{background-color:inherit;font-weight:bolder;color:var(--global-theme-color)}.navbar.navbar-light .navbar-nav .nav-item.active>.nav-link:hover{color:var(--global-hover-color)}.navbar.navbar-light .navbar-brand.social{padding-bottom:0;padding-top:0;font-size:1.7rem}.navbar.navbar-light .navbar-brand.social a i::before{color:var(--global-text-color);transition-property:all .2s ease-in-out}.navbar.navbar-light .navbar-brand.social a:hover i::before{color:var(--global-theme-color)}.navbar-toggler .icon-bar{display:block;width:22px;height:2px;background-color:var(--global-text-color);border-radius:1px;margin-bottom:4px;transition:all .2s}.navbar-toggler .top-bar{transform:rotate(45deg);transform-origin:10% 10%} +.navbar-toggler .middle-bar{opacity:0}.navbar-toggler .bottom-bar{transform:rotate(-45deg);transform-origin:10% 90%}.navbar-toggler.collapsed .top-bar{transform:rotate(0)}.navbar-toggler.collapsed .middle-bar{opacity:1}.navbar-toggler.collapsed .bottom-bar{transform:rotate(0)}#light-toggle{padding:0;border:0;background-color:inherit;color:var(--global-text-color)}#light-toggle:hover{color:var(--global-hover-color)}.social{text-align:center}.social .contact-icons{font-size:4rem}.social .contact-icons a i::before{color:var(--global-text-color);transition-property:all .2s ease-in-out}.social .contact-icons a:hover i::before{color:var(--global-theme-color)}.social .contact-note{font-size:.8rem}footer.fixed-bottom{background-color:var(--global-footer-bg-color);font-size:.75rem}footer.fixed-bottom .container{color:var(--global-footer-text-color);padding-top:9px;padding-bottom:8px}footer.fixed-bottom a{color:var(--global-footer-link-color)}footer.fixed-bottom a:hover{color:var(--global-theme-color);text-decoration:none}footer.sticky-bottom{border-top:1px solid var(--global-divider-color);padding-top:40px;padding-bottom:40px;font-size:.9rem}.cv{margin-bottom:40px}.cv .card{background-color:var(--global-card-bg-color);border:1px solid var(--global-divider-color)}.cv .card .list-group-item{background-color:inherit}.cv .card .list-group-item .badge{color:var(--global-card-bg-color)!important;background-color:var(--global-theme-color)!important}@media(min-width:768px){.repo{max-width:50%}}.header-bar{border-bottom:1px solid var(--global-divider-color);text-align:center;padding-top:2rem;padding-bottom:3rem}.header-bar h1{color:var(--global-theme-color);font-size:5rem}.tag-list{border-bottom:1px solid var(--global-divider-color);text-align:center;padding-top:1rem}.tag-list ul{justify-content:center;display:flow-root}.tag-list ul p,.tag-list ul li{list-style:none;display:inline-block;padding:1rem .5rem;color:var(--global-text-color-light)}.post-list{margin:0;margin-bottom:40px;padding:0}.post-list li{border-bottom:1px solid var(--global-divider-color);list-style:none;padding-top:2rem;padding-bottom:2rem}.post-list li .post-meta{color:var(--global-text-color-light);font-size:.875rem;margin-bottom:0}.post-list li .post-tags{color:var(--global-text-color-light);font-size:.875rem;padding-top:.25rem;padding-bottom:0}.post-list li a{color:var(--global-text-color);text-decoration:none}.post-list li a:hover{color:var(--global-theme-color)}.pagination .page-item .page-link{color:var(--global-text-color)}.pagination .page-item .page-link:hover{color:#000}.pagination .page-item.active .page-link{color:#fff;background-color:var(--global-theme-color)}.pagination .page-item.active .page-link:hover{background-color:var(--global-theme-color)}.distill a:hover{border-bottom-color:var(--global-theme-color);text-decoration:none}.projects a{text-decoration:none}.projects a:hover .card-title{color:var(--global-theme-color)}.projects .card img{width:100%}.projects .card-item{width:auto;margin-bottom:10px}.projects .card-item .row{display:flex;align-items:center}.projects .grid-sizer,.projects .grid-item{width:250px;margin-bottom:10px}.projects h2.category{color:var(--global-divider-color);border-bottom:1px solid var(--global-divider-color);padding-top:.5rem;margin-top:2rem;margin-bottom:1rem;text-align:right}.publications{margin-top:2rem}.publications h1{color:var(--global-theme-color);font-size:2rem;text-align:center;margin-top:1em;margin-bottom:1em}.publications h2{margin-bottom:1rem}.publications h2 span{font-size:1.5rem}.publications h2.year{color:var(--global-divider-color);border-top:1px solid var(--global-divider-color);padding-top:1rem;margin-top:2rem;margin-bottom:-2rem;text-align:right}.publications ol.bibliography{list-style:none;padding:0;margin-top:0}.publications ol.bibliography li{margin-bottom:1rem}.publications ol.bibliography li .preview{width:100%;min-width:80px;max-width:200px}.publications ol.bibliography li .abbr{height:2rem;margin-bottom:.5rem}.publications ol.bibliography li .abbr abbr{display:inline-block;background-color:var(--global-theme-color);padding-left:1rem;padding-right:1rem}.publications ol.bibliography li .abbr abbr a{color:white}.publications ol.bibliography li .abbr abbr a:hover{text-decoration:none}.publications ol.bibliography li .abbr .award{color:var(--global-theme-color)!important;border:1px solid var(--global-theme-color)}.publications ol.bibliography li .title{font-weight:bolder}.publications ol.bibliography li .author a{border-bottom:1px dashed var(--global-theme-color)}.publications ol.bibliography li .author a:hover{border-bottom-style:solid;text-decoration:none}.publications ol.bibliography li .author>em{border-bottom:1px solid;font-style:normal}.publications ol.bibliography li .author>span.more-authors{color:var(--global-text-color-light);border-bottom:1px dashed var(--global-text-color-light);cursor:pointer}.publications ol.bibliography li .author>span.more-authors:hover{color:var(--global-text-color);border-bottom:1px dashed var(--global-text-color)} +.publications ol.bibliography li .links a.btn{color:var(--global-text-color);border:1px solid var(--global-text-color);padding-left:1rem;padding-right:1rem;padding-top:.25rem;padding-bottom:.25rem}.publications ol.bibliography li .links a.btn:hover{color:var(--global-theme-color);border-color:var(--global-theme-color)}.publications ol.bibliography li .hidden{font-size:.875rem;max-height:0;overflow:hidden;text-align:justify;transition-property:.15s ease;-moz-transition:.15s ease;-ms-transition:.15s ease;-o-transition:.15s ease;transition:all .15s ease}.publications ol.bibliography li .hidden p{line-height:1.4em;margin:10px}.publications ol.bibliography li .hidden pre{font-size:1em;line-height:1.4em;padding:10px}.publications ol.bibliography li .hidden.open{max-height:100em;transition-property:.15s ease;-moz-transition:.15s ease;-ms-transition:.15s ease;-o-transition:.15s ease;transition:all .15s ease}.publications ol.bibliography li div.abstract.hidden{border:dashed 1px var(--global-bg-color)}.publications ol.bibliography li div.abstract.hidden.open{border-color:var(--global-text-color)}figure.highlight{margin:0 0 1rem}pre{color:var(--global-theme-color);background-color:var(--global-code-bg-color);border-radius:6px;padding:6px 12px}pre pre,pre code{background-color:transparent;border-radius:0;padding:0}code{color:var(--global-theme-color);background-color:var(--global-code-bg-color);border-radius:3px;padding:3px 3px}html.transition,html.transition *,html.transition *:before,html.transition *:after{transition:all 750ms!important;transition-delay:0!important}.post .post-meta{color:var(--global-text-color-light);font-size:.875rem;margin-bottom:0}.post .post-tags{color:var(--global-text-color-light);font-size:.875rem;padding-top:.25rem;padding-bottom:1rem}.post .post-tags a{color:var(--global-text-color-light);text-decoration:none}.post .post-tags a:hover{color:var(--global-theme-color)}.post .post-content blockquote{border-left:5px solid var(--global-theme-color);padding:8px}d-byline{border-top-color:var(--global-divider-color)!important}d-byline h3{color:var(--global-text-color)!important}d-byline a,d-article d-byline a{color:var(--global-text-color)!important}d-byline a:hover,d-article d-byline a:hover{color:var(--global-hover-color)!important}d-article{border-top-color:var(--global-divider-color)!important}d-article a,d-article p,d-article h1,d-article h2,d-article h3,d-article h4,d-article h5,d-article h6,d-article li,d-article table{color:var(--global-text-color)!important}d-article a,d-article h1,d-article h2,d-article hr,d-article table,d-article table th,d-article table td{border-bottom-color:var(--global-divider-color)!important}d-article a:hover{border-bottom-color:var(--global-hover-color)!important}d-article b i{display:inline}d-article d-contents{align-self:start;grid-column:1/4;grid-row:auto/span 4;justify-self:end;margin-top:0;padding-left:2em;padding-right:3em;border-right:1px solid var(--global-divider-color);width:max(70%,300px);margin-right:0;margin-top:0;display:grid;grid-template-columns:minmax(8px,1fr) [toc] auto minmax(8px,1fr) [toc-line] 1px minmax(32px,2fr)}d-article d-contents nav{grid-column:toc}d-article d-contents nav a{border-bottom:none!important}d-article d-contents nav a:hover{border-bottom:1px solid var(--global-text-color)!important}d-article d-contents nav h3{margin-top:0;margin-bottom:1em}d-article d-contents nav div{display:block;outline:0;margin-bottom:.8em;color:rgba(0,0,0,0.8);font-weight:bold}d-article d-contents nav ul{padding-left:1em;margin-top:0;margin-bottom:6px;list-style-type:none}d-article d-contents nav ul li{margin-bottom:.25em}d-article d-contents .figcaption{line-height:1.4em}d-article d-contents toc-line{border-right:1px solid var(--global-divider-color);grid-column:toc-line}d-article d-footnote{scroll-margin-top:66px}d-appendix{border-top-color:var(--global-divider-color)!important;color:var(--global-distill-app-color)!important}d-appendix h3,d-appendix li,d-appendix span{color:var(--global-distill-app-color)!important}d-appendix a,d-appendix a.footnote-backlink{color:var(--global-distill-app-color)!important}d-appendix a:hover,d-appendix a.footnote-backlink:hover{color:var(--global-hover-color)!important}@media(max-width:1024px){d-article d-contents{display:block;grid-column-start:2;grid-column-end:-2;padding-bottom:.5em;margin-bottom:1em;padding-top:.5em;width:100%;border:1px solid var(--global-divider-color)}d-article d-contents nav{grid-column:none}} \ No newline at end of file diff --git a/assets/css/main.css.map b/assets/css/main.css.map new file mode 100644 index 00000000..db608df8 --- /dev/null +++ b/assets/css/main.css.map @@ -0,0 +1 @@ +{"version":3,"sourceRoot":"","sources":["../../_sass/_variables.scss","../../_sass/_themes.scss","../../_sass/_layout.scss","main.scss","../../_sass/_base.scss","../../_sass/_distill.scss"],"names":[],"mappings":"AAAA;AAAA;AAAA;AAAA;ACAA;AAAA;AAAA;AAIA;EACE;EACA;EACA;EACA;EACA;EACA;EACA;EACA;EACA;EACA;EACA;EACA;;AAEA;EACE;;AAEF;EACE;EACA;EACA;;AAGF;EACE;;AAEF;EACE;;;AAIJ;EACE;EACA;EACA;EACA;EACA;EACA;EACA;;;AAGF;EACE;EACA;EACA;EACA;EACA;EACA;EACA;EACA;EACA;EACA;EACA;EACA;;AAEA;EACE;EACA;EACA;;AAEF;EACE;;AAGF;EACE;;AAEF;EACE;;AAGJ;EACE;EACA;EACA;EACA;EACA;EACA;EACA;;;AClFF;AAAA;AAAA;AAIA;EACE;EACA;EACA;;AAEA;EACE;;;AAIJ;EAEE;;;AAGF;EAEE;;;AAGF;EACE,WCtBkB;;;AD2BlB;EACE;;;AAOJ;AAAA;AAAA;AAOA;AAAA;AAAA;AE7CA;AAAA;AAAA;AAMA;EACE;;;AAGF;EACE;;;AAIA;EACE;;AAEF;EACE;;;AAIJ;EACE;;AACA;EACE;EACA;;AAEF;EACE;;;AAIJ;EACE;;;AAGF;EACE;EACA;EACA;EACA;EACA;;;AAKF;EACE;EACA;;;AAKF;EACE;EACA;EACA;EACA;;;AAKF;EACE;;AAEA;EACE;;AAGF;EACE;;AAGF;EACE;EACA;;AAEA;EACE;EACA;;;AAON;EACE;;;AAKF;EACE;;AAEA;EACE;EACA;EACA;;AACA;EACE;EACA;;;AAIN;EACE;;;AAEF;EACE;;;AAGF;EACE;IACE;;EAEE;IAAI;;;AAKV;EACE;EACA;;AACA;EACE;;AACA;EACE;EACA;;;AAQN;EACE;EACA;EACA;EACA;;;AAEF;EACE;EACA;;AACA;EACE;;AAEF;EACE;;AAEF;EACE;;;AAGJ;EACE;;AACE;EACE;EACA;;;AAKF;EACE;;AAGJ;EACE;;AAEF;EACE;;AACA;EACE;;AAGJ;EACI;EACA;EACA;;AACA;EACE;;AAGN;EACE;EACA;EACA;;AAEE;EACE;EACA;;AAGA;EACE;;;AAQR;EACE;EACA;EACA;EACA;EACA;EACA;EACA;;AAEF;EACE;EACA;;AAEF;EACE;;AAEF;EACE;EACA;;;AAKF;EACE;;AAEF;EACE;;AAEF;EACE;;;AAIJ;EACE;EACA;EACA;EACA;;AACA;EACE;;;AAMJ;EACE;;AACA;EACE;;AAEE;EACE;EACA;;AAGA;EACE;;AAKR;EACE;;;AAMJ;EACE;EACA;;AACA;EACE;EACA;EACA;;AAEF;EACE;;AACA;EACE;EACA;;;AAKN;EACE;EACA;EACA;EACA;;;AAKF;EACE;;AAEA;EACE;EACA;;AAEA;EACE;;AAEA;EACE;EACA;;;AAQR;EACE;IACE;;;AAMJ;EACE;EACA;EACA;EACA;;AACA;EACE;EACA;;;AAIJ;EACE;EACA;EACA;;AAEA;EACE;EACA;;AAEA;EACE;EACA;EACA;EACA;;;AAKN;EACE;EACA;EACA;;AACA;EACE;EACA;EACA;EACA;;AACA;EACE;EACA;EACA;;AAEF;EACE;EACA;EACA;EACA;;AAEF;EACE;EACA;;AACA;EACE;;;AAQJ;EACE;;AACA;EACE,OJ1WM;;AI6WV;EACE,OJ/WQ;EIgXR;;AACA;EACE;;;AAUN;EACE;EACA;;;AAQF;EACE;;AAGE;EACE;;AAMJ;EACE;;AAIJ;EACE;EACA;;AAEA;EACE;EACA;;AAIJ;EACE;EACA;;AAGF;EACE;EACA;EACA;EACA;EACA;EACA;;;AAOJ;EACE;;AACA;EACE;EACA;EACA;EACA;EACA;;AAEF;EACE;;AACA;EACE;;AAGJ;EACE;EACA;EACA;EACA;EACA;EACA;;AAEF;EACE;EACA;EACA;;AAEA;EACE;;AACA;EACE;EACA;EACA;;AAEF;EACE;EACA;;AACA;EACE;EACA;EACA;EACA;;AACA;EACE;;AACA;EACE;;AAIN;EACE;EACA;;AAGJ;EACE;;AAGA;EACE;;AACA;EACI;EACA;;AAGN;EACE;EACA;;AAEF;EACE;EACA;EACA;;AACA;EACI;EACA;;AAKN;EACE;EACA;EACA;EACA;EACA;EACA;;AACA;EACE;EACA;;AAIN;EACE;EACA;EACA;EACA;EACA;EACA;EACA;EACA;EACA;;AAEA;EACE;EACA;;AAEF;EACE;EACA;EACA;;AAGJ;EACE;EACA;EACA;EACA;EACA;EACA;;AAEF;EACE;;AAEF;EACE;;;AAOR;EACE;;;AAGF;EACE;EACA;EACA;EACA;;AACA;EACE;EACA;EACA;;;AAIJ;EACE;EACA;EACA;EACA;;;AAKF;AAAA;AAAA;AAAA;EAIE;EACA;;;AAKA;EACE;EACA;EACA;;AAEF;EACE;EACA;EACA;EACA;;AACA;EACE;EACA;;AACA;EACE;;AAKJ;EACE;EACA;;;AC9oBN;AAAA;AAAA;AAIA;EACE;;;AAGF;EACE;;;AAGF;EACE;;AACA;EACE;;;AAIJ;EACE;;AACA;EACE;;AAEF;EACE;;AAEF;EACE;;AAEF;EACE;;AAGF;EACE;EACA;EACA;EACA;EACA;EACA;EACA;EACA;EACA;EACA;EACA;EACA;EACA,uBACE;;AAIF;EACE;;AACA;EACE;;AACA;EACE;;AAGJ;EACE;EACA;;AAEF;EACE;EACA;EACA;EACA;EACA;;AAEF;EACE;EACA;EACA;EACA;;AACA;EACE;;AAIN;EACE;;AAEF;EACE;EACA;;AAIJ;EACE;;;AAIJ;EACE;EACA;;AACA;EACE;;AAEF;EACE;;AACA;EACE;;;AAKN;EAEI;IACE;IACA;IACA;IACA;IACA;IACA;IACA;IACA;;EACA;IACE","sourcesContent":["/*******************************************************************************\n * Variables used throughout the theme.\n * To adjust anything, simply edit the variables below and rebuild the theme.\n ******************************************************************************/\n\n\n// Colors\n$red-color: #FF3636 !default;\n$red-color-dark: #B71C1C !default;\n$orange-color: #F29105 !default;\n$blue-color: #0076df !default;\n$blue-color-dark: #00369f !default;\n$cyan-color: #2698BA !default;\n$light-cyan-color: lighten($cyan-color, 25%);\n$green-color: #00ab37 !default;\n$green-color-lime: #B7D12A !default;\n$green-color-dark: #009f06 !default;\n$green-color-light: #ddffdd !default;\n$green-color-bright: #11D68B !default;\n$purple-color: #B509AC !default;\n$light-purple-color: lighten($purple-color, 25%);\n$pink-color: #f92080 !default;\n$pink-color-light: #ffdddd !default;\n$yellow-color: #efcc00 !default;\n\n$grey-color: #828282 !default;\n$grey-color-light: lighten($grey-color, 40%);\n$grey-color-dark: #1C1C1D;\n$grey-900: #212529;\n\n$white-color: #ffffff !default;\n$black-color: #000000 !default;\n\n\n// Theme colors\n\n$code-bg-color-light: rgba($purple-color, 0.05);\n$code-bg-color-dark: #2c3237 !default;\n","/*******************************************************************************\r\n * Themes\r\n ******************************************************************************/\r\n \r\n:root {\r\n --global-bg-color: #{$white-color};\r\n --global-code-bg-color: #{$code-bg-color-light};\r\n --global-text-color: #{$black-color};\r\n --global-text-color-light: #{$grey-color};\r\n --global-theme-color: #{$cyan-color};\r\n --global-hover-color: #{$cyan-color};\r\n --global-footer-bg-color: #{$grey-color-dark};\r\n --global-footer-text-color: #{$grey-color-light};\r\n --global-footer-link-color: #{$white-color};\r\n --global-distill-app-color: #{$grey-color};\r\n --global-divider-color: rgba(0,0,0,.1);\r\n --global-card-bg-color: #{$white-color};\r\n\r\n .fa-sun {\r\n display : none;\r\n }\r\n .fa-moon {\r\n padding-left: 10px;\r\n padding-top: 12px;\r\n display : block;\r\n }\r\n\r\n .repo-img-light {\r\n display: block;\r\n }\r\n .repo-img-dark {\r\n display: none;\r\n }\r\n}\r\n\r\n.header-background .img {\r\n background-image: url(\"../img/ICLR-logo.png\");\r\n background-repeat: no-repeat;\r\n background-size: 400px;\r\n background-position: center bottom;\r\n height: 12em;\r\n margin-bottom: 0em;\r\n margin-top: -2.7em; \r\n}\r\n\r\nhtml[data-theme='dark'] {\r\n --global-bg-color: #{$grey-color-dark};\r\n --global-code-bg-color: #{$code-bg-color-dark};\r\n --global-text-color: #{$grey-color-light};\r\n --global-text-color-light: #{$grey-color-light};\r\n --global-theme-color: #{$cyan-color};\r\n --global-hover-color: #{$cyan-color};\r\n --global-footer-bg-color: #{$grey-color-light};\r\n --global-footer-text-color: #{$grey-color-dark};\r\n --global-footer-link-color: #{$black-color};\r\n --global-distill-app-color: #{$grey-color-light};\r\n --global-divider-color: #424246;\r\n --global-card-bg-color: #{$grey-900};\r\n\r\n .fa-sun {\r\n padding-left: 10px;\r\n padding-top: 12px;\r\n display : block;\r\n }\r\n .fa-moon {\r\n display : none;\r\n }\r\n\r\n .repo-img-light {\r\n display: none;\r\n }\r\n .repo-img-dark {\r\n display: block;\r\n }\r\n\r\n.header-background .img {\r\n background-image: url(\"../img/ICLR-logo-dark.png\");\r\n background-repeat: no-repeat;\r\n background-size: 400px;\r\n background-position: center bottom;\r\n height: 12em;\r\n margin-bottom: 0em;\r\n margin-top: -2.7em; \r\n // filter: invert(89%);\r\n}\r\n\r\n\r\n\r\n\r\n // .header-background .img {\r\n // background-image: url(\"../img/score_contour.jpg\");\r\n // background-repeat: no-repeat;\r\n // background-size: cover;\r\n // background-position: center bottom;\r\n // height: 15em;\r\n // margin-bottom: 2em;\r\n // margin-top: -2.7em;\r\n // filter: invert(89%);\r\n // }\r\n}\r\n","/******************************************************************************\n * Content\n ******************************************************************************/\n\nbody {\n padding-bottom: 70px;\n color: var(--global-text-color);\n background-color: var(--global-bg-color);\n\n h1, h2, h3, h4, h5, h6 {\n scroll-margin-top: 66px;\n }\n}\n\nbody.fixed-top-nav {\n // Add some padding for the nav-bar.\n padding-top: 56px;\n}\n\nbody.sticky-bottom-footer {\n // Remove padding below footer.\n padding-bottom: 0;\n}\n\n.container {\n max-width: $max-content-width;\n}\n\n// Profile\n.profile {\n img {\n width: 100%;\n }\n}\n\n// TODO: redefine content layout.\n\n\n/******************************************************************************\n * Publications\n ******************************************************************************/\n\n// TODO: redefine publications layout.\n\n\n/*****************************************************************************\n* Projects\n*****************************************************************************/\n\n// TODO: redefine projects layout.\n","@charset \"utf-8\";\n\n// Dimensions\n$max-content-width: 1000px;\n\n@import\n \"variables\",\n \"themes\",\n \"layout\",\n \"base\",\n \"distill\"\n;\n","/*******************************************************************************\n * Styles for the base elements of the theme.\n ******************************************************************************/\n\n// Typography\n\np, h1, h2, h3, h4, h5, h6, em, div, li, span, strong {\n color: var(--global-text-color);\n}\n\nhr {\n border-top: 1px solid var(--global-divider-color);\n}\n\ntable {\n td, th {\n color: var(--global-text-color);\n }\n td {\n font-size: 1rem;\n }\n}\n\na, table.table a {\n color: var(--global-theme-color);\n &:hover {\n color: var(--global-theme-color);\n text-decoration: underline;\n }\n &:hover:after :not(.nav-item.dropdown) {\n width: 100%;\n }\n}\n\nfigure, img {\n max-width: 90vw;\n}\n\nblockquote {\n background: var(--global-bg-color);\n border-left: 2px solid var(--global-theme-color);\n margin: 1.5em 10px;\n padding: 0.5em 10px;\n font-size: 1.1rem;\n}\n\n// Math\n\n.equation {\n margin-bottom: 1rem;\n text-align: center;\n}\n\n// Caption\n\n.caption {\n font-size: 0.875rem;\n margin-top: 0.75rem;\n margin-bottom: 1.5rem;\n text-align: center;\n}\n\n// Card\n\n.card {\n background-color: var(--global-card-bg-color);\n\n img {\n width: 100%;\n }\n\n .card-title {\n color: var(--global-text-color);\n }\n\n .card-item {\n width: auto;\n margin-bottom: 10px;\n\n .row {\n display: flex;\n align-items: center;\n }\n }\n}\n\n// Citation\n\n.citation, .citation-number {\n color: var(--global-theme-color);\n}\n\n// Profile\n\n.profile {\n width: 100%;\n\n .address {\n margin-bottom: 5px;\n margin-top: 5px;\n font-family: monospace;\n p {\n display: inline-block;\n margin: 0;\n }\n }\n}\n.profile.float-right{\n margin-left: 1rem;\n}\n.profile.float-left{\n margin-right: 1rem;\n}\n\n@media (min-width: 576px) {\n .profile {\n width: 30%;\n .address {\n p { display: block; }\n }\n }\n}\n\n.post-description {\n margin-bottom: 2rem;\n font-size: 0.875rem;\n a {\n color: inherit;\n &:hover {\n color: var(--global-theme-color);\n text-decoration: none;\n }\n }\n}\n\n\n// Navbar customization\n\n.navbar {\n box-shadow: none;\n border-bottom: 1px solid var(--global-divider-color);\n background-color: var(--global-bg-color);\n opacity: 0.95;\n}\n.navbar .dropdown-menu {\n background-color: var(--global-bg-color);\n border: 1px solid var(--global-divider-color);\n a:not(.active) {\n color: var(--global-text-color);\n }\n a:hover {\n color: var(--global-hover-color);\n }\n .dropdown-divider {\n border-top: 1px solid var(--global-divider-color) !important;\n }\n}\n.dropdown-item {\n color: var(--global-text-color);\n &:hover {\n color: var(--global-hover-color);\n background-color: var(--global-bg-color);\n }\n}\n.navbar.navbar-light {\n a {\n &:hover {\n text-decoration: none;\n }\n }\n .navbar-brand {\n color: var(--global-text-color);\n }\n .navbar-nav .nav-item .nav-link {\n color: var(--global-text-color);\n &:hover {\n color: var(--global-hover-color);\n }\n }\n .navbar-nav .nav-item.active>.nav-link {\n background-color: inherit;\n font-weight: bolder;\n color: var(--global-theme-color);\n &:hover {\n color: var(--global-hover-color);\n }\n }\n .navbar-brand.social {\n padding-bottom: 0;\n padding-top: 0;\n font-size: 1.7rem;\n a {\n i::before {\n color: var(--global-text-color);\n transition-property: all 0.2s ease-in-out;\n }\n &:hover {\n i::before {\n color: var(--global-theme-color);\n }\n }\n }\n }\n}\n\n.navbar-toggler {\n .icon-bar {\n display: block;\n width: 22px;\n height: 2px;\n background-color: var(--global-text-color);\n border-radius: 1px;\n margin-bottom: 4px;\n transition: all 0.2s;\n }\n .top-bar {\n transform: rotate(45deg);\n transform-origin: 10% 10%;\n }\n .middle-bar {\n opacity: 0;\n }\n .bottom-bar {\n transform: rotate(-45deg);\n transform-origin: 10% 90%;\n }\n}\n\n.navbar-toggler.collapsed {\n .top-bar {\n transform: rotate(0);\n }\n .middle-bar {\n opacity: 1;\n }\n .bottom-bar {\n transform: rotate(0);\n }\n}\n\n#light-toggle {\n padding: 0;\n border: 0;\n background-color: inherit;\n color: var(--global-text-color);\n &:hover {\n color: var(--global-hover-color);\n }\n}\n\n// Social (bottom)\n\n.social {\n text-align: center;\n .contact-icons {\n font-size: 4rem;\n a {\n i::before {\n color: var(--global-text-color);\n transition-property: all 0.2s ease-in-out;\n }\n &:hover {\n i::before {\n color: var(--global-theme-color);\n }\n }\n }\n }\n .contact-note {\n font-size: 0.8rem;\n }\n}\n\n\n// Footer\nfooter.fixed-bottom {\n background-color: var(--global-footer-bg-color);\n font-size: 0.75rem;\n .container {\n color: var(--global-footer-text-color);\n padding-top: 9px;\n padding-bottom: 8px;\n }\n a {\n color: var(--global-footer-link-color);\n &:hover {\n color: var(--global-theme-color);\n text-decoration: none;\n }\n }\n}\n\nfooter.sticky-bottom {\n border-top: 1px solid var(--global-divider-color);\n padding-top: 40px;\n padding-bottom: 40px;\n font-size: 0.9rem;\n}\n\n// CV\n\n.cv {\n margin-bottom: 40px;\n \n .card {\n background-color: var(--global-card-bg-color);\n border: 1px solid var(--global-divider-color);\n \n .list-group-item {\n background-color: inherit;\n\n .badge {\n color: var(--global-card-bg-color) !important;\n background-color: var(--global-theme-color) !important;\n }\n }\n }\n}\n\n// Repositories\n\n@media (min-width: 768px) {\n .repo {\n max-width: 50%;\n }\n}\n\n// Blog\n\n.header-bar {\n border-bottom: 1px solid var(--global-divider-color);\n text-align: center;\n padding-top: 2rem;\n padding-bottom: 3rem;\n h1 {\n color: var(--global-theme-color);\n font-size: 5rem;\n }\n}\n\n.tag-list {\n border-bottom: 1px solid var(--global-divider-color);\n text-align: center;\n padding-top: 1rem;\n\n ul {\n justify-content: center;\n display: flow-root;\n\n p, li {\n list-style: none;\n display: inline-block;\n padding: 1rem 0.5rem;\n color: var(--global-text-color-light);\n }\n }\n}\n\n.post-list {\n margin: 0;\n margin-bottom: 40px;\n padding: 0;\n li {\n border-bottom: 1px solid var(--global-divider-color);\n list-style: none;\n padding-top: 2rem;\n padding-bottom: 2rem;\n .post-meta {\n color: var(--global-text-color-light);\n font-size: 0.875rem;\n margin-bottom: 0;\n }\n .post-tags {\n color: var(--global-text-color-light);\n font-size: 0.875rem;\n padding-top: 0.25rem;\n padding-bottom: 0;\n }\n a {\n color: var(--global-text-color);\n text-decoration: none;\n &:hover {\n color: var(--global-theme-color);\n }\n }\n }\n}\n\n.pagination {\n .page-item {\n .page-link {\n color: var(--global-text-color);\n &:hover {\n color: $black-color;\n }\n }\n &.active .page-link {\n color: $white-color;\n background-color: var(--global-theme-color);\n &:hover {\n background-color: var(--global-theme-color);\n }\n }\n }\n}\n\n\n// Distill\n\n.distill {\n a:hover {\n border-bottom-color: var(--global-theme-color);\n text-decoration: none;\n }\n}\n\n\n// Projects\n\n.projects {\n a {\n text-decoration: none;\n\n &:hover {\n .card-title {\n color: var(--global-theme-color);\n }\n }\n }\n\n .card {\n img {\n width: 100%;\n }\n }\n\n .card-item {\n width: auto;\n margin-bottom: 10px;\n\n .row {\n display: flex;\n align-items: center;\n }\n }\n\n .grid-sizer, .grid-item {\n width: 250px;\n margin-bottom: 10px;\n }\n\n h2.category {\n color: var(--global-divider-color);\n border-bottom: 1px solid var(--global-divider-color);\n padding-top: 0.5rem;\n margin-top: 2rem;\n margin-bottom: 1rem;\n text-align: right;\n }\n}\n\n\n// Publications\n\n.publications {\n margin-top: 2rem;\n h1 {\n color: var(--global-theme-color);\n font-size: 2rem;\n text-align: center;\n margin-top: 1em;\n margin-bottom: 1em;\n }\n h2 {\n margin-bottom: 1rem;\n span {\n font-size: 1.5rem;\n }\n }\n h2.year {\n color: var(--global-divider-color);\n border-top: 1px solid var(--global-divider-color);\n padding-top: 1rem;\n margin-top: 2rem;\n margin-bottom: -2rem;\n text-align: right;\n }\n ol.bibliography {\n list-style: none;\n padding: 0;\n margin-top: 0;\n\n li {\n margin-bottom: 1rem;\n .preview {\n width: 100%;\n min-width: 80px;\n max-width: 200px;\n }\n .abbr {\n height: 2rem;\n margin-bottom: 0.5rem;\n abbr {\n display: inline-block;\n background-color: var(--global-theme-color);\n padding-left: 1rem;\n padding-right: 1rem;\n a {\n color: white;\n &:hover {\n text-decoration: none;\n }\n }\n }\n .award {\n color: var(--global-theme-color) !important;\n border: 1px solid var(--global-theme-color);\n }\n }\n .title {\n font-weight: bolder;\n }\n .author {\n a {\n border-bottom: 1px dashed var(--global-theme-color);\n &:hover {\n border-bottom-style: solid;\n text-decoration: none;\n }\n }\n > em {\n border-bottom: 1px solid;\n font-style: normal;\n }\n > span.more-authors {\n color: var(--global-text-color-light);\n border-bottom: 1px dashed var(--global-text-color-light);\n cursor: pointer;\n &:hover {\n color: var(--global-text-color);\n border-bottom: 1px dashed var(--global-text-color);\n }\n }\n }\n .links {\n a.btn {\n color: var(--global-text-color);\n border: 1px solid var(--global-text-color);\n padding-left: 1rem;\n padding-right: 1rem;\n padding-top: 0.25rem;\n padding-bottom: 0.25rem;\n &:hover {\n color: var(--global-theme-color);\n border-color: var(--global-theme-color);\n }\n }\n }\n .hidden {\n font-size: 0.875rem;\n max-height: 0px;\n overflow: hidden;\n text-align: justify;\n transition-property: 0.15s ease;\n -moz-transition: 0.15s ease;\n -ms-transition: 0.15s ease;\n -o-transition: 0.15s ease;\n transition: all 0.15s ease;\n\n p {\n line-height: 1.4em;\n margin: 10px;\n }\n pre {\n font-size: 1em;\n line-height: 1.4em;\n padding: 10px;\n }\n }\n .hidden.open {\n max-height: 100em;\n transition-property: 0.15s ease;\n -moz-transition: 0.15s ease;\n -ms-transition: 0.15s ease;\n -o-transition: 0.15s ease;\n transition: all 0.15s ease;\n }\n div.abstract.hidden {\n border: dashed 1px var(--global-bg-color);\n }\n div.abstract.hidden.open {\n border-color: var(--global-text-color);\n }\n }\n }\n}\n\n// Rouge Color Customization\nfigure.highlight {\n margin: 0 0 1rem;\n}\n\npre {\n color: var(--global-theme-color);\n background-color: var(--global-code-bg-color);\n border-radius: 6px;\n padding: 6px 12px;\n pre, code {\n background-color: transparent;\n border-radius: 0;\n padding: 0;\n }\n}\n\ncode {\n color: var(--global-theme-color);\n background-color: var(--global-code-bg-color);\n border-radius: 3px;\n padding: 3px 3px;\n}\n\n\n// Transitioning Themes\nhtml.transition,\nhtml.transition *,\nhtml.transition *:before,\nhtml.transition *:after {\n transition: all 750ms !important;\n transition-delay: 0 !important;\n}\n\n// Extra Markdown style (post Customization)\n.post{\n .post-meta{\n color: var(--global-text-color-light);\n font-size: 0.875rem;\n margin-bottom: 0;\n }\n .post-tags{\n color: var(--global-text-color-light);\n font-size: 0.875rem;\n padding-top: 0.25rem;\n padding-bottom: 1rem;\n a {\n color: var(--global-text-color-light);\n text-decoration: none;\n &:hover {\n color: var(--global-theme-color);\n }\n }\n }\n .post-content{\n blockquote {\n border-left: 5px solid var(--global-theme-color);\n padding: 8px;\n }\n }\n}\n","/*******************************************************************************\n * Style overrides for distill blog posts.\n ******************************************************************************/\n\nd-byline {\n border-top-color: var(--global-divider-color) !important;\n}\n\nd-byline h3 {\n color: var(--global-text-color) !important;\n}\n\nd-byline a, d-article d-byline a {\n color: var(--global-text-color) !important;\n &:hover {\n color: var(--global-hover-color) !important;\n }\n}\n\nd-article {\n border-top-color: var(--global-divider-color) !important;\n a, p, h1, h2, h3, h4, h5, h6, li, table {\n color: var(--global-text-color) !important;\n }\n a, h1, h2, hr, table, table th, table td {\n border-bottom-color: var(--global-divider-color) !important;\n }\n a:hover {\n border-bottom-color: var(--global-hover-color) !important;\n }\n b i {\n display: inline;\n }\n\n d-contents {\n align-self: start;\n grid-column: 1 / 4;\n grid-row: auto / span 4;\n justify-self: end;\n margin-top: 0em;\n padding-left: 2em;\n padding-right: 3em;\n border-right: 1px solid var(--global-divider-color);\n width: calc(max(70%, 300px));\n margin-right: 0px;\n margin-top: 0em;\n display: grid;\n grid-template-columns:\n minmax(8px, 1fr) [toc] auto\n minmax(8px, 1fr) [toc-line] 1px\n minmax(32px, 2fr);\n\n nav {\n grid-column: toc;\n a {\n border-bottom: none !important;\n &:hover {\n border-bottom: 1px solid var(--global-text-color) !important;\n }\n }\n h3 {\n margin-top: 0;\n margin-bottom: 1em;\n }\n div {\n display: block;\n outline: none;\n margin-bottom: 0.8em;\n color: rgba(0, 0, 0, 0.8);\n font-weight: bold;\n }\n ul {\n padding-left: 1em;\n margin-top: 0;\n margin-bottom: 6px;\n list-style-type: none;\n li {\n margin-bottom: 0.25em;\n }\n }\n }\n .figcaption {\n line-height: 1.4em;\n }\n toc-line {\n border-right: 1px solid var(--global-divider-color);\n grid-column: toc-line;\n }\n }\n\n d-footnote {\n scroll-margin-top: 66px;\n }\n}\n\nd-appendix {\n border-top-color: var(--global-divider-color) !important;\n color: var(--global-distill-app-color) !important;\n h3, li, span {\n color: var(--global-distill-app-color) !important;\n }\n a, a.footnote-backlink {\n color: var(--global-distill-app-color) !important;\n &:hover {\n color: var(--global-hover-color) !important;\n }\n }\n}\n\n@media (max-width: 1024px) {\n d-article {\n d-contents {\n display: block;\n grid-column-start: 2;\n grid-column-end: -2;\n padding-bottom: 0.5em;\n margin-bottom: 1em;\n padding-top: 0.5em;\n width: 100%;\n border: 1px solid var(--global-divider-color);\n nav {\n grid-column: none;\n }\n }\n }\n}\n"],"file":"main.css"} \ No newline at end of file diff --git a/assets/css/main.scss b/assets/css/main.scss deleted file mode 100644 index fd8c311c..00000000 --- a/assets/css/main.scss +++ /dev/null @@ -1,15 +0,0 @@ ---- -# Only the main Sass file needs front matter (the dashes are enough) ---- -@charset "utf-8"; - -// Dimensions -$max-content-width: {{ site.max_width }}; - -@import - "variables", - "themes", - "layout", - "base", - "distill" -; diff --git a/assets/img/2024-05-07-alibi-mlm/ALiBi-1400.webp b/assets/img/2024-05-07-alibi-mlm/ALiBi-1400.webp new file mode 100644 index 00000000..28ff9253 Binary files /dev/null and b/assets/img/2024-05-07-alibi-mlm/ALiBi-1400.webp differ diff --git a/assets/img/2024-05-07-alibi-mlm/ALiBi-480.webp b/assets/img/2024-05-07-alibi-mlm/ALiBi-480.webp new file mode 100644 index 00000000..755ddf96 Binary files /dev/null and b/assets/img/2024-05-07-alibi-mlm/ALiBi-480.webp differ diff --git a/assets/img/2024-05-07-alibi-mlm/ALiBi-800.webp b/assets/img/2024-05-07-alibi-mlm/ALiBi-800.webp new file mode 100644 index 00000000..28ff9253 Binary files /dev/null and b/assets/img/2024-05-07-alibi-mlm/ALiBi-800.webp differ diff --git a/assets/img/2024-05-07-alibi-mlm/valid_ppl_cleaned-1400.webp b/assets/img/2024-05-07-alibi-mlm/valid_ppl_cleaned-1400.webp new file mode 100644 index 00000000..58c39059 Binary files /dev/null and b/assets/img/2024-05-07-alibi-mlm/valid_ppl_cleaned-1400.webp differ diff --git a/assets/img/2024-05-07-alibi-mlm/valid_ppl_cleaned-480.webp b/assets/img/2024-05-07-alibi-mlm/valid_ppl_cleaned-480.webp new file mode 100644 index 00000000..c1814ad9 Binary files /dev/null and b/assets/img/2024-05-07-alibi-mlm/valid_ppl_cleaned-480.webp differ diff --git a/assets/img/2024-05-07-alibi-mlm/valid_ppl_cleaned-800.webp b/assets/img/2024-05-07-alibi-mlm/valid_ppl_cleaned-800.webp new file mode 100644 index 00000000..58c39059 Binary files /dev/null and b/assets/img/2024-05-07-alibi-mlm/valid_ppl_cleaned-800.webp differ diff --git a/assets/img/2024-05-07-bench-hvp/bench_hvp_memory_jax-1400.webp b/assets/img/2024-05-07-bench-hvp/bench_hvp_memory_jax-1400.webp new file mode 100644 index 00000000..eb92ca87 Binary files /dev/null and b/assets/img/2024-05-07-bench-hvp/bench_hvp_memory_jax-1400.webp differ diff --git a/assets/img/2024-05-07-bench-hvp/bench_hvp_memory_jax-480.webp b/assets/img/2024-05-07-bench-hvp/bench_hvp_memory_jax-480.webp new file mode 100644 index 00000000..b718f069 Binary files /dev/null and b/assets/img/2024-05-07-bench-hvp/bench_hvp_memory_jax-480.webp differ diff --git a/assets/img/2024-05-07-bench-hvp/bench_hvp_memory_jax-800.webp b/assets/img/2024-05-07-bench-hvp/bench_hvp_memory_jax-800.webp new file mode 100644 index 00000000..eb92ca87 Binary files /dev/null and b/assets/img/2024-05-07-bench-hvp/bench_hvp_memory_jax-800.webp differ diff --git a/assets/img/2024-05-07-bench-hvp/bench_hvp_memory_jax_without_jit-1400.webp b/assets/img/2024-05-07-bench-hvp/bench_hvp_memory_jax_without_jit-1400.webp new file mode 100644 index 00000000..95799062 Binary files /dev/null and b/assets/img/2024-05-07-bench-hvp/bench_hvp_memory_jax_without_jit-1400.webp differ diff --git a/assets/img/2024-05-07-bench-hvp/bench_hvp_memory_jax_without_jit-480.webp b/assets/img/2024-05-07-bench-hvp/bench_hvp_memory_jax_without_jit-480.webp new file mode 100644 index 00000000..9907dbca Binary files /dev/null and b/assets/img/2024-05-07-bench-hvp/bench_hvp_memory_jax_without_jit-480.webp differ diff --git a/assets/img/2024-05-07-bench-hvp/bench_hvp_memory_jax_without_jit-800.webp b/assets/img/2024-05-07-bench-hvp/bench_hvp_memory_jax_without_jit-800.webp new file mode 100644 index 00000000..95799062 Binary files /dev/null and b/assets/img/2024-05-07-bench-hvp/bench_hvp_memory_jax_without_jit-800.webp differ diff --git a/assets/img/2024-05-07-bench-hvp/bench_hvp_memory_torch-1400.webp b/assets/img/2024-05-07-bench-hvp/bench_hvp_memory_torch-1400.webp new file mode 100644 index 00000000..e6482641 Binary files /dev/null and b/assets/img/2024-05-07-bench-hvp/bench_hvp_memory_torch-1400.webp differ diff --git a/assets/img/2024-05-07-bench-hvp/bench_hvp_memory_torch-480.webp b/assets/img/2024-05-07-bench-hvp/bench_hvp_memory_torch-480.webp new file mode 100644 index 00000000..5d62be3a Binary files /dev/null and b/assets/img/2024-05-07-bench-hvp/bench_hvp_memory_torch-480.webp differ diff --git a/assets/img/2024-05-07-bench-hvp/bench_hvp_memory_torch-800.webp b/assets/img/2024-05-07-bench-hvp/bench_hvp_memory_torch-800.webp new file mode 100644 index 00000000..e6482641 Binary files /dev/null and b/assets/img/2024-05-07-bench-hvp/bench_hvp_memory_torch-800.webp differ diff --git a/assets/img/2024-05-07-bench-hvp/bench_hvp_time_jax-1400.webp b/assets/img/2024-05-07-bench-hvp/bench_hvp_time_jax-1400.webp new file mode 100644 index 00000000..9c1bc7f1 Binary files /dev/null and b/assets/img/2024-05-07-bench-hvp/bench_hvp_time_jax-1400.webp differ diff --git a/assets/img/2024-05-07-bench-hvp/bench_hvp_time_jax-480.webp b/assets/img/2024-05-07-bench-hvp/bench_hvp_time_jax-480.webp new file mode 100644 index 00000000..2c3d190f Binary files /dev/null and b/assets/img/2024-05-07-bench-hvp/bench_hvp_time_jax-480.webp differ diff --git a/assets/img/2024-05-07-bench-hvp/bench_hvp_time_jax-800.webp b/assets/img/2024-05-07-bench-hvp/bench_hvp_time_jax-800.webp new file mode 100644 index 00000000..9c1bc7f1 Binary files /dev/null and b/assets/img/2024-05-07-bench-hvp/bench_hvp_time_jax-800.webp differ diff --git a/assets/img/2024-05-07-bench-hvp/bench_hvp_time_torch-1400.webp b/assets/img/2024-05-07-bench-hvp/bench_hvp_time_torch-1400.webp new file mode 100644 index 00000000..3df28aaf Binary files /dev/null and b/assets/img/2024-05-07-bench-hvp/bench_hvp_time_torch-1400.webp differ diff --git a/assets/img/2024-05-07-bench-hvp/bench_hvp_time_torch-480.webp b/assets/img/2024-05-07-bench-hvp/bench_hvp_time_torch-480.webp new file mode 100644 index 00000000..465e69b0 Binary files /dev/null and b/assets/img/2024-05-07-bench-hvp/bench_hvp_time_torch-480.webp differ diff --git a/assets/img/2024-05-07-bench-hvp/bench_hvp_time_torch-800.webp b/assets/img/2024-05-07-bench-hvp/bench_hvp_time_torch-800.webp new file mode 100644 index 00000000..3df28aaf Binary files /dev/null and b/assets/img/2024-05-07-bench-hvp/bench_hvp_time_torch-800.webp differ diff --git a/assets/img/2024-05-07-bench-hvp/computational_graph-1400.webp b/assets/img/2024-05-07-bench-hvp/computational_graph-1400.webp new file mode 100644 index 00000000..aacbfbbc Binary files /dev/null and b/assets/img/2024-05-07-bench-hvp/computational_graph-1400.webp differ diff --git a/assets/img/2024-05-07-bench-hvp/computational_graph-480.webp b/assets/img/2024-05-07-bench-hvp/computational_graph-480.webp new file mode 100644 index 00000000..992abd86 Binary files /dev/null and b/assets/img/2024-05-07-bench-hvp/computational_graph-480.webp differ diff --git a/assets/img/2024-05-07-bench-hvp/computational_graph-800.webp b/assets/img/2024-05-07-bench-hvp/computational_graph-800.webp new file mode 100644 index 00000000..aacbfbbc Binary files /dev/null and b/assets/img/2024-05-07-bench-hvp/computational_graph-800.webp differ diff --git a/assets/img/2024-05-07-bench-hvp/direct_graph-1400.webp b/assets/img/2024-05-07-bench-hvp/direct_graph-1400.webp new file mode 100644 index 00000000..bb300cdf Binary files /dev/null and b/assets/img/2024-05-07-bench-hvp/direct_graph-1400.webp differ diff --git a/assets/img/2024-05-07-bench-hvp/direct_graph-480.webp b/assets/img/2024-05-07-bench-hvp/direct_graph-480.webp new file mode 100644 index 00000000..a313bd2f Binary files /dev/null and b/assets/img/2024-05-07-bench-hvp/direct_graph-480.webp differ diff --git a/assets/img/2024-05-07-bench-hvp/direct_graph-800.webp b/assets/img/2024-05-07-bench-hvp/direct_graph-800.webp new file mode 100644 index 00000000..bb300cdf Binary files /dev/null and b/assets/img/2024-05-07-bench-hvp/direct_graph-800.webp differ diff --git a/assets/img/2024-05-07-bench-hvp/hess_eig-1400.webp b/assets/img/2024-05-07-bench-hvp/hess_eig-1400.webp new file mode 100644 index 00000000..98654356 Binary files /dev/null and b/assets/img/2024-05-07-bench-hvp/hess_eig-1400.webp differ diff --git a/assets/img/2024-05-07-bench-hvp/hess_eig-480.webp b/assets/img/2024-05-07-bench-hvp/hess_eig-480.webp new file mode 100644 index 00000000..c66fce53 Binary files /dev/null and b/assets/img/2024-05-07-bench-hvp/hess_eig-480.webp differ diff --git a/assets/img/2024-05-07-bench-hvp/hess_eig-800.webp b/assets/img/2024-05-07-bench-hvp/hess_eig-800.webp new file mode 100644 index 00000000..98654356 Binary files /dev/null and b/assets/img/2024-05-07-bench-hvp/hess_eig-800.webp differ diff --git a/assets/img/2024-05-07-clml/anticorrelated_prior_conflict_and_model_misspecification_1.30-1400.webp b/assets/img/2024-05-07-clml/anticorrelated_prior_conflict_and_model_misspecification_1.30-1400.webp new file mode 100644 index 00000000..b3a4739c Binary files /dev/null and b/assets/img/2024-05-07-clml/anticorrelated_prior_conflict_and_model_misspecification_1.30-1400.webp differ diff --git a/assets/img/2024-05-07-clml/anticorrelated_prior_conflict_and_model_misspecification_1.30-480.webp b/assets/img/2024-05-07-clml/anticorrelated_prior_conflict_and_model_misspecification_1.30-480.webp new file mode 100644 index 00000000..d9e7a0ad Binary files /dev/null and b/assets/img/2024-05-07-clml/anticorrelated_prior_conflict_and_model_misspecification_1.30-480.webp differ diff --git a/assets/img/2024-05-07-clml/anticorrelated_prior_conflict_and_model_misspecification_1.30-800.webp b/assets/img/2024-05-07-clml/anticorrelated_prior_conflict_and_model_misspecification_1.30-800.webp new file mode 100644 index 00000000..b3a4739c Binary files /dev/null and b/assets/img/2024-05-07-clml/anticorrelated_prior_conflict_and_model_misspecification_1.30-800.webp differ diff --git a/assets/img/2024-05-07-clml/area_under_curve_1.00-1400.webp b/assets/img/2024-05-07-clml/area_under_curve_1.00-1400.webp new file mode 100644 index 00000000..a88e1c17 Binary files /dev/null and b/assets/img/2024-05-07-clml/area_under_curve_1.00-1400.webp differ diff --git a/assets/img/2024-05-07-clml/area_under_curve_1.00-480.webp b/assets/img/2024-05-07-clml/area_under_curve_1.00-480.webp new file mode 100644 index 00000000..b47e644e Binary files /dev/null and b/assets/img/2024-05-07-clml/area_under_curve_1.00-480.webp differ diff --git a/assets/img/2024-05-07-clml/area_under_curve_1.00-800.webp b/assets/img/2024-05-07-clml/area_under_curve_1.00-800.webp new file mode 100644 index 00000000..a88e1c17 Binary files /dev/null and b/assets/img/2024-05-07-clml/area_under_curve_1.00-800.webp differ diff --git a/assets/img/2024-05-07-clml/binary_regression_conditional_joint_marginal_information_decision_boundary-1400.webp b/assets/img/2024-05-07-clml/binary_regression_conditional_joint_marginal_information_decision_boundary-1400.webp new file mode 100644 index 00000000..98719215 Binary files /dev/null and b/assets/img/2024-05-07-clml/binary_regression_conditional_joint_marginal_information_decision_boundary-1400.webp differ diff --git a/assets/img/2024-05-07-clml/binary_regression_conditional_joint_marginal_information_decision_boundary-480.webp b/assets/img/2024-05-07-clml/binary_regression_conditional_joint_marginal_information_decision_boundary-480.webp new file mode 100644 index 00000000..5d650678 Binary files /dev/null and b/assets/img/2024-05-07-clml/binary_regression_conditional_joint_marginal_information_decision_boundary-480.webp differ diff --git a/assets/img/2024-05-07-clml/binary_regression_conditional_joint_marginal_information_decision_boundary-800.webp b/assets/img/2024-05-07-clml/binary_regression_conditional_joint_marginal_information_decision_boundary-800.webp new file mode 100644 index 00000000..98719215 Binary files /dev/null and b/assets/img/2024-05-07-clml/binary_regression_conditional_joint_marginal_information_decision_boundary-800.webp differ diff --git a/assets/img/2024-05-07-clml/binary_regression_information_metrics-1400.webp b/assets/img/2024-05-07-clml/binary_regression_information_metrics-1400.webp new file mode 100644 index 00000000..0d5daa59 Binary files /dev/null and b/assets/img/2024-05-07-clml/binary_regression_information_metrics-1400.webp differ diff --git a/assets/img/2024-05-07-clml/binary_regression_information_metrics-480.webp b/assets/img/2024-05-07-clml/binary_regression_information_metrics-480.webp new file mode 100644 index 00000000..047133e7 Binary files /dev/null and b/assets/img/2024-05-07-clml/binary_regression_information_metrics-480.webp differ diff --git a/assets/img/2024-05-07-clml/binary_regression_information_metrics-800.webp b/assets/img/2024-05-07-clml/binary_regression_information_metrics-800.webp new file mode 100644 index 00000000..0d5daa59 Binary files /dev/null and b/assets/img/2024-05-07-clml/binary_regression_information_metrics-800.webp differ diff --git a/assets/img/2024-05-07-clml/bmsmlg_comparison_clml_validation_loss-1400.webp b/assets/img/2024-05-07-clml/bmsmlg_comparison_clml_validation_loss-1400.webp new file mode 100644 index 00000000..78d37053 Binary files /dev/null and b/assets/img/2024-05-07-clml/bmsmlg_comparison_clml_validation_loss-1400.webp differ diff --git a/assets/img/2024-05-07-clml/bmsmlg_comparison_clml_validation_loss-480.webp b/assets/img/2024-05-07-clml/bmsmlg_comparison_clml_validation_loss-480.webp new file mode 100644 index 00000000..ab1d67d8 Binary files /dev/null and b/assets/img/2024-05-07-clml/bmsmlg_comparison_clml_validation_loss-480.webp differ diff --git a/assets/img/2024-05-07-clml/bmsmlg_comparison_clml_validation_loss-800.webp b/assets/img/2024-05-07-clml/bmsmlg_comparison_clml_validation_loss-800.webp new file mode 100644 index 00000000..78d37053 Binary files /dev/null and b/assets/img/2024-05-07-clml/bmsmlg_comparison_clml_validation_loss-800.webp differ diff --git a/assets/img/2024-05-07-clml/bmsmlg_fig1-1400.webp b/assets/img/2024-05-07-clml/bmsmlg_fig1-1400.webp new file mode 100644 index 00000000..bb4861d8 Binary files /dev/null and b/assets/img/2024-05-07-clml/bmsmlg_fig1-1400.webp differ diff --git a/assets/img/2024-05-07-clml/bmsmlg_fig1-480.webp b/assets/img/2024-05-07-clml/bmsmlg_fig1-480.webp new file mode 100644 index 00000000..8624b6c5 Binary files /dev/null and b/assets/img/2024-05-07-clml/bmsmlg_fig1-480.webp differ diff --git a/assets/img/2024-05-07-clml/bmsmlg_fig1-800.webp b/assets/img/2024-05-07-clml/bmsmlg_fig1-800.webp new file mode 100644 index 00000000..bb4861d8 Binary files /dev/null and b/assets/img/2024-05-07-clml/bmsmlg_fig1-800.webp differ diff --git a/assets/img/2024-05-07-clml/bmsmlg_fig3-1400.webp b/assets/img/2024-05-07-clml/bmsmlg_fig3-1400.webp new file mode 100644 index 00000000..53e9126c Binary files /dev/null and b/assets/img/2024-05-07-clml/bmsmlg_fig3-1400.webp differ diff --git a/assets/img/2024-05-07-clml/bmsmlg_fig3-480.webp b/assets/img/2024-05-07-clml/bmsmlg_fig3-480.webp new file mode 100644 index 00000000..b2f2f3f9 Binary files /dev/null and b/assets/img/2024-05-07-clml/bmsmlg_fig3-480.webp differ diff --git a/assets/img/2024-05-07-clml/bmsmlg_fig3-800.webp b/assets/img/2024-05-07-clml/bmsmlg_fig3-800.webp new file mode 100644 index 00000000..53e9126c Binary files /dev/null and b/assets/img/2024-05-07-clml/bmsmlg_fig3-800.webp differ diff --git a/assets/img/2024-05-07-clml/bmsmlg_fig4-1400.webp b/assets/img/2024-05-07-clml/bmsmlg_fig4-1400.webp new file mode 100644 index 00000000..442e2b9a Binary files /dev/null and b/assets/img/2024-05-07-clml/bmsmlg_fig4-1400.webp differ diff --git a/assets/img/2024-05-07-clml/bmsmlg_fig4-480.webp b/assets/img/2024-05-07-clml/bmsmlg_fig4-480.webp new file mode 100644 index 00000000..2a1ac858 Binary files /dev/null and b/assets/img/2024-05-07-clml/bmsmlg_fig4-480.webp differ diff --git a/assets/img/2024-05-07-clml/bmsmlg_fig4-800.webp b/assets/img/2024-05-07-clml/bmsmlg_fig4-800.webp new file mode 100644 index 00000000..442e2b9a Binary files /dev/null and b/assets/img/2024-05-07-clml/bmsmlg_fig4-800.webp differ diff --git a/assets/img/2024-05-07-clml/bmsmlg_fig5-1400.webp b/assets/img/2024-05-07-clml/bmsmlg_fig5-1400.webp new file mode 100644 index 00000000..50fbdddd Binary files /dev/null and b/assets/img/2024-05-07-clml/bmsmlg_fig5-1400.webp differ diff --git a/assets/img/2024-05-07-clml/bmsmlg_fig5-480.webp b/assets/img/2024-05-07-clml/bmsmlg_fig5-480.webp new file mode 100644 index 00000000..79072b9f Binary files /dev/null and b/assets/img/2024-05-07-clml/bmsmlg_fig5-480.webp differ diff --git a/assets/img/2024-05-07-clml/bmsmlg_fig5-800.webp b/assets/img/2024-05-07-clml/bmsmlg_fig5-800.webp new file mode 100644 index 00000000..50fbdddd Binary files /dev/null and b/assets/img/2024-05-07-clml/bmsmlg_fig5-800.webp differ diff --git a/assets/img/2024-05-07-clml/bmsmlg_sec4.1-1400.webp b/assets/img/2024-05-07-clml/bmsmlg_sec4.1-1400.webp new file mode 100644 index 00000000..44e51d33 Binary files /dev/null and b/assets/img/2024-05-07-clml/bmsmlg_sec4.1-1400.webp differ diff --git a/assets/img/2024-05-07-clml/bmsmlg_sec4.1-480.webp b/assets/img/2024-05-07-clml/bmsmlg_sec4.1-480.webp new file mode 100644 index 00000000..65320aa9 Binary files /dev/null and b/assets/img/2024-05-07-clml/bmsmlg_sec4.1-480.webp differ diff --git a/assets/img/2024-05-07-clml/bmsmlg_sec4.1-800.webp b/assets/img/2024-05-07-clml/bmsmlg_sec4.1-800.webp new file mode 100644 index 00000000..44e51d33 Binary files /dev/null and b/assets/img/2024-05-07-clml/bmsmlg_sec4.1-800.webp differ diff --git a/assets/img/2024-05-07-clml/bmsmlg_sec4.1_model_selection-1400.webp b/assets/img/2024-05-07-clml/bmsmlg_sec4.1_model_selection-1400.webp new file mode 100644 index 00000000..5fe58dfd Binary files /dev/null and b/assets/img/2024-05-07-clml/bmsmlg_sec4.1_model_selection-1400.webp differ diff --git a/assets/img/2024-05-07-clml/bmsmlg_sec4.1_model_selection-480.webp b/assets/img/2024-05-07-clml/bmsmlg_sec4.1_model_selection-480.webp new file mode 100644 index 00000000..5e1813d4 Binary files /dev/null and b/assets/img/2024-05-07-clml/bmsmlg_sec4.1_model_selection-480.webp differ diff --git a/assets/img/2024-05-07-clml/bmsmlg_sec4.1_model_selection-800.webp b/assets/img/2024-05-07-clml/bmsmlg_sec4.1_model_selection-800.webp new file mode 100644 index 00000000..5fe58dfd Binary files /dev/null and b/assets/img/2024-05-07-clml/bmsmlg_sec4.1_model_selection-800.webp differ diff --git a/assets/img/2024-05-07-clml/bmsmlg_sec4.4-1400.webp b/assets/img/2024-05-07-clml/bmsmlg_sec4.4-1400.webp new file mode 100644 index 00000000..3c3773b7 Binary files /dev/null and b/assets/img/2024-05-07-clml/bmsmlg_sec4.4-1400.webp differ diff --git a/assets/img/2024-05-07-clml/bmsmlg_sec4.4-480.webp b/assets/img/2024-05-07-clml/bmsmlg_sec4.4-480.webp new file mode 100644 index 00000000..d77f4000 Binary files /dev/null and b/assets/img/2024-05-07-clml/bmsmlg_sec4.4-480.webp differ diff --git a/assets/img/2024-05-07-clml/bmsmlg_sec4.4-800.webp b/assets/img/2024-05-07-clml/bmsmlg_sec4.4-800.webp new file mode 100644 index 00000000..3c3773b7 Binary files /dev/null and b/assets/img/2024-05-07-clml/bmsmlg_sec4.4-800.webp differ diff --git a/assets/img/2024-05-07-clml/mackay_343-1400.webp b/assets/img/2024-05-07-clml/mackay_343-1400.webp new file mode 100644 index 00000000..867f87dd Binary files /dev/null and b/assets/img/2024-05-07-clml/mackay_343-1400.webp differ diff --git a/assets/img/2024-05-07-clml/mackay_343-480.webp b/assets/img/2024-05-07-clml/mackay_343-480.webp new file mode 100644 index 00000000..36473882 Binary files /dev/null and b/assets/img/2024-05-07-clml/mackay_343-480.webp differ diff --git a/assets/img/2024-05-07-clml/mackay_343-800.webp b/assets/img/2024-05-07-clml/mackay_343-800.webp new file mode 100644 index 00000000..867f87dd Binary files /dev/null and b/assets/img/2024-05-07-clml/mackay_343-800.webp differ diff --git a/assets/img/2024-05-07-clml/otmlacv_ccv-1400.webp b/assets/img/2024-05-07-clml/otmlacv_ccv-1400.webp new file mode 100644 index 00000000..f23d42ed Binary files /dev/null and b/assets/img/2024-05-07-clml/otmlacv_ccv-1400.webp differ diff --git a/assets/img/2024-05-07-clml/otmlacv_ccv-480.webp b/assets/img/2024-05-07-clml/otmlacv_ccv-480.webp new file mode 100644 index 00000000..58a35844 Binary files /dev/null and b/assets/img/2024-05-07-clml/otmlacv_ccv-480.webp differ diff --git a/assets/img/2024-05-07-clml/otmlacv_ccv-800.webp b/assets/img/2024-05-07-clml/otmlacv_ccv-800.webp new file mode 100644 index 00000000..f23d42ed Binary files /dev/null and b/assets/img/2024-05-07-clml/otmlacv_ccv-800.webp differ diff --git a/assets/img/2024-05-07-clml/otmlacv_prop2-1400.webp b/assets/img/2024-05-07-clml/otmlacv_prop2-1400.webp new file mode 100644 index 00000000..c16e5aa3 Binary files /dev/null and b/assets/img/2024-05-07-clml/otmlacv_prop2-1400.webp differ diff --git a/assets/img/2024-05-07-clml/otmlacv_prop2-480.webp b/assets/img/2024-05-07-clml/otmlacv_prop2-480.webp new file mode 100644 index 00000000..264fe6b0 Binary files /dev/null and b/assets/img/2024-05-07-clml/otmlacv_prop2-480.webp differ diff --git a/assets/img/2024-05-07-clml/otmlacv_prop2-800.webp b/assets/img/2024-05-07-clml/otmlacv_prop2-800.webp new file mode 100644 index 00000000..c16e5aa3 Binary files /dev/null and b/assets/img/2024-05-07-clml/otmlacv_prop2-800.webp differ diff --git a/assets/img/2024-05-07-clml/otmlacv_sec3.1-1400.webp b/assets/img/2024-05-07-clml/otmlacv_sec3.1-1400.webp new file mode 100644 index 00000000..6aa0d67e Binary files /dev/null and b/assets/img/2024-05-07-clml/otmlacv_sec3.1-1400.webp differ diff --git a/assets/img/2024-05-07-clml/otmlacv_sec3.1-480.webp b/assets/img/2024-05-07-clml/otmlacv_sec3.1-480.webp new file mode 100644 index 00000000..9a3ab49a Binary files /dev/null and b/assets/img/2024-05-07-clml/otmlacv_sec3.1-480.webp differ diff --git a/assets/img/2024-05-07-clml/otmlacv_sec3.1-800.webp b/assets/img/2024-05-07-clml/otmlacv_sec3.1-800.webp new file mode 100644 index 00000000..6aa0d67e Binary files /dev/null and b/assets/img/2024-05-07-clml/otmlacv_sec3.1-800.webp differ diff --git a/assets/img/2024-05-07-clml/prior_conflict_and_model_misspecification_0.67-1400.webp b/assets/img/2024-05-07-clml/prior_conflict_and_model_misspecification_0.67-1400.webp new file mode 100644 index 00000000..5feaebf8 Binary files /dev/null and b/assets/img/2024-05-07-clml/prior_conflict_and_model_misspecification_0.67-1400.webp differ diff --git a/assets/img/2024-05-07-clml/prior_conflict_and_model_misspecification_0.67-480.webp b/assets/img/2024-05-07-clml/prior_conflict_and_model_misspecification_0.67-480.webp new file mode 100644 index 00000000..af45d932 Binary files /dev/null and b/assets/img/2024-05-07-clml/prior_conflict_and_model_misspecification_0.67-480.webp differ diff --git a/assets/img/2024-05-07-clml/prior_conflict_and_model_misspecification_0.67-800.webp b/assets/img/2024-05-07-clml/prior_conflict_and_model_misspecification_0.67-800.webp new file mode 100644 index 00000000..5feaebf8 Binary files /dev/null and b/assets/img/2024-05-07-clml/prior_conflict_and_model_misspecification_0.67-800.webp differ diff --git a/assets/img/2024-05-07-clml/rebuttal_1-1400.webp b/assets/img/2024-05-07-clml/rebuttal_1-1400.webp new file mode 100644 index 00000000..25c08a88 Binary files /dev/null and b/assets/img/2024-05-07-clml/rebuttal_1-1400.webp differ diff --git a/assets/img/2024-05-07-clml/rebuttal_1-480.webp b/assets/img/2024-05-07-clml/rebuttal_1-480.webp new file mode 100644 index 00000000..313dc7eb Binary files /dev/null and b/assets/img/2024-05-07-clml/rebuttal_1-480.webp differ diff --git a/assets/img/2024-05-07-clml/rebuttal_1-800.webp b/assets/img/2024-05-07-clml/rebuttal_1-800.webp new file mode 100644 index 00000000..25c08a88 Binary files /dev/null and b/assets/img/2024-05-07-clml/rebuttal_1-800.webp differ diff --git a/assets/img/2024-05-07-clml/rebuttal_2-1400.webp b/assets/img/2024-05-07-clml/rebuttal_2-1400.webp new file mode 100644 index 00000000..81858365 Binary files /dev/null and b/assets/img/2024-05-07-clml/rebuttal_2-1400.webp differ diff --git a/assets/img/2024-05-07-clml/rebuttal_2-480.webp b/assets/img/2024-05-07-clml/rebuttal_2-480.webp new file mode 100644 index 00000000..afff6706 Binary files /dev/null and b/assets/img/2024-05-07-clml/rebuttal_2-480.webp differ diff --git a/assets/img/2024-05-07-clml/rebuttal_2-800.webp b/assets/img/2024-05-07-clml/rebuttal_2-800.webp new file mode 100644 index 00000000..81858365 Binary files /dev/null and b/assets/img/2024-05-07-clml/rebuttal_2-800.webp differ diff --git a/assets/img/2024-05-07-clml/rebuttal_3-1400.webp b/assets/img/2024-05-07-clml/rebuttal_3-1400.webp new file mode 100644 index 00000000..5fcb3957 Binary files /dev/null and b/assets/img/2024-05-07-clml/rebuttal_3-1400.webp differ diff --git a/assets/img/2024-05-07-clml/rebuttal_3-480.webp b/assets/img/2024-05-07-clml/rebuttal_3-480.webp new file mode 100644 index 00000000..14083a07 Binary files /dev/null and b/assets/img/2024-05-07-clml/rebuttal_3-480.webp differ diff --git a/assets/img/2024-05-07-clml/rebuttal_3-800.webp b/assets/img/2024-05-07-clml/rebuttal_3-800.webp new file mode 100644 index 00000000..5fcb3957 Binary files /dev/null and b/assets/img/2024-05-07-clml/rebuttal_3-800.webp differ diff --git a/assets/img/2024-05-07-deqalg-reasoning/BFexplained-1400.webp b/assets/img/2024-05-07-deqalg-reasoning/BFexplained-1400.webp new file mode 100644 index 00000000..286cb5f9 Binary files /dev/null and b/assets/img/2024-05-07-deqalg-reasoning/BFexplained-1400.webp differ diff --git a/assets/img/2024-05-07-deqalg-reasoning/BFexplained-480.webp b/assets/img/2024-05-07-deqalg-reasoning/BFexplained-480.webp new file mode 100644 index 00000000..ff9d8161 Binary files /dev/null and b/assets/img/2024-05-07-deqalg-reasoning/BFexplained-480.webp differ diff --git a/assets/img/2024-05-07-deqalg-reasoning/BFexplained-800.webp b/assets/img/2024-05-07-deqalg-reasoning/BFexplained-800.webp new file mode 100644 index 00000000..286cb5f9 Binary files /dev/null and b/assets/img/2024-05-07-deqalg-reasoning/BFexplained-800.webp differ diff --git a/assets/img/2024-05-07-deqalg-reasoning/alg-reasoning-task-1400.webp b/assets/img/2024-05-07-deqalg-reasoning/alg-reasoning-task-1400.webp new file mode 100644 index 00000000..9306b5f1 Binary files /dev/null and b/assets/img/2024-05-07-deqalg-reasoning/alg-reasoning-task-1400.webp differ diff --git a/assets/img/2024-05-07-deqalg-reasoning/alg-reasoning-task-480.webp b/assets/img/2024-05-07-deqalg-reasoning/alg-reasoning-task-480.webp new file mode 100644 index 00000000..81c6a76c Binary files /dev/null and b/assets/img/2024-05-07-deqalg-reasoning/alg-reasoning-task-480.webp differ diff --git a/assets/img/2024-05-07-deqalg-reasoning/alg-reasoning-task-800.webp b/assets/img/2024-05-07-deqalg-reasoning/alg-reasoning-task-800.webp new file mode 100644 index 00000000..9306b5f1 Binary files /dev/null and b/assets/img/2024-05-07-deqalg-reasoning/alg-reasoning-task-800.webp differ diff --git a/assets/img/2024-05-07-deqalg-reasoning/architecture-1400.webp b/assets/img/2024-05-07-deqalg-reasoning/architecture-1400.webp new file mode 100644 index 00000000..9e173958 Binary files /dev/null and b/assets/img/2024-05-07-deqalg-reasoning/architecture-1400.webp differ diff --git a/assets/img/2024-05-07-deqalg-reasoning/architecture-480.webp b/assets/img/2024-05-07-deqalg-reasoning/architecture-480.webp new file mode 100644 index 00000000..109dac41 Binary files /dev/null and b/assets/img/2024-05-07-deqalg-reasoning/architecture-480.webp differ diff --git a/assets/img/2024-05-07-deqalg-reasoning/architecture-800.webp b/assets/img/2024-05-07-deqalg-reasoning/architecture-800.webp new file mode 100644 index 00000000..9e173958 Binary files /dev/null and b/assets/img/2024-05-07-deqalg-reasoning/architecture-800.webp differ diff --git a/assets/img/2024-05-07-diffusion-theory-from-scratch/ddpm_forward_kernel-1400.webp b/assets/img/2024-05-07-diffusion-theory-from-scratch/ddpm_forward_kernel-1400.webp new file mode 100644 index 00000000..8e729ece Binary files /dev/null and b/assets/img/2024-05-07-diffusion-theory-from-scratch/ddpm_forward_kernel-1400.webp differ diff --git a/assets/img/2024-05-07-diffusion-theory-from-scratch/ddpm_forward_kernel-480.webp b/assets/img/2024-05-07-diffusion-theory-from-scratch/ddpm_forward_kernel-480.webp new file mode 100644 index 00000000..1eac6228 Binary files /dev/null and b/assets/img/2024-05-07-diffusion-theory-from-scratch/ddpm_forward_kernel-480.webp differ diff --git a/assets/img/2024-05-07-diffusion-theory-from-scratch/ddpm_forward_kernel-800.webp b/assets/img/2024-05-07-diffusion-theory-from-scratch/ddpm_forward_kernel-800.webp new file mode 100644 index 00000000..8e729ece Binary files /dev/null and b/assets/img/2024-05-07-diffusion-theory-from-scratch/ddpm_forward_kernel-800.webp differ diff --git a/assets/img/2024-05-07-diffusion-theory-from-scratch/score_def-1400.webp b/assets/img/2024-05-07-diffusion-theory-from-scratch/score_def-1400.webp new file mode 100644 index 00000000..7eb09c2e Binary files /dev/null and b/assets/img/2024-05-07-diffusion-theory-from-scratch/score_def-1400.webp differ diff --git a/assets/img/2024-05-07-diffusion-theory-from-scratch/score_def-480.webp b/assets/img/2024-05-07-diffusion-theory-from-scratch/score_def-480.webp new file mode 100644 index 00000000..2d46acee Binary files /dev/null and b/assets/img/2024-05-07-diffusion-theory-from-scratch/score_def-480.webp differ diff --git a/assets/img/2024-05-07-diffusion-theory-from-scratch/score_def-800.webp b/assets/img/2024-05-07-diffusion-theory-from-scratch/score_def-800.webp new file mode 100644 index 00000000..7eb09c2e Binary files /dev/null and b/assets/img/2024-05-07-diffusion-theory-from-scratch/score_def-800.webp differ diff --git a/assets/img/2024-05-07-distill-example/10-1400.webp b/assets/img/2024-05-07-distill-example/10-1400.webp new file mode 100644 index 00000000..ce8225b5 Binary files /dev/null and b/assets/img/2024-05-07-distill-example/10-1400.webp differ diff --git a/assets/img/2024-05-07-distill-example/10-480.webp b/assets/img/2024-05-07-distill-example/10-480.webp new file mode 100644 index 00000000..e890a183 Binary files /dev/null and b/assets/img/2024-05-07-distill-example/10-480.webp differ diff --git a/assets/img/2024-05-07-distill-example/10-800.webp b/assets/img/2024-05-07-distill-example/10-800.webp new file mode 100644 index 00000000..ce8225b5 Binary files /dev/null and b/assets/img/2024-05-07-distill-example/10-800.webp differ diff --git a/assets/img/2024-05-07-distill-example/11-1400.webp b/assets/img/2024-05-07-distill-example/11-1400.webp new file mode 100644 index 00000000..b9410833 Binary files /dev/null and b/assets/img/2024-05-07-distill-example/11-1400.webp differ diff --git a/assets/img/2024-05-07-distill-example/11-480.webp b/assets/img/2024-05-07-distill-example/11-480.webp new file mode 100644 index 00000000..2a916f52 Binary files /dev/null and b/assets/img/2024-05-07-distill-example/11-480.webp differ diff --git a/assets/img/2024-05-07-distill-example/11-800.webp b/assets/img/2024-05-07-distill-example/11-800.webp new file mode 100644 index 00000000..b9410833 Binary files /dev/null and b/assets/img/2024-05-07-distill-example/11-800.webp differ diff --git a/assets/img/2024-05-07-distill-example/12-1400.webp b/assets/img/2024-05-07-distill-example/12-1400.webp new file mode 100644 index 00000000..06b75e0f Binary files /dev/null and b/assets/img/2024-05-07-distill-example/12-1400.webp differ diff --git a/assets/img/2024-05-07-distill-example/12-480.webp b/assets/img/2024-05-07-distill-example/12-480.webp new file mode 100644 index 00000000..4fb64669 Binary files /dev/null and b/assets/img/2024-05-07-distill-example/12-480.webp differ diff --git a/assets/img/2024-05-07-distill-example/12-800.webp b/assets/img/2024-05-07-distill-example/12-800.webp new file mode 100644 index 00000000..06b75e0f Binary files /dev/null and b/assets/img/2024-05-07-distill-example/12-800.webp differ diff --git a/assets/img/2024-05-07-distill-example/7-1400.webp b/assets/img/2024-05-07-distill-example/7-1400.webp new file mode 100644 index 00000000..37aa7e8d Binary files /dev/null and b/assets/img/2024-05-07-distill-example/7-1400.webp differ diff --git a/assets/img/2024-05-07-distill-example/7-480.webp b/assets/img/2024-05-07-distill-example/7-480.webp new file mode 100644 index 00000000..77fdb68d Binary files /dev/null and b/assets/img/2024-05-07-distill-example/7-480.webp differ diff --git a/assets/img/2024-05-07-distill-example/7-800.webp b/assets/img/2024-05-07-distill-example/7-800.webp new file mode 100644 index 00000000..37aa7e8d Binary files /dev/null and b/assets/img/2024-05-07-distill-example/7-800.webp differ diff --git a/assets/img/2024-05-07-distill-example/8-1400.webp b/assets/img/2024-05-07-distill-example/8-1400.webp new file mode 100644 index 00000000..a2b1e89e Binary files /dev/null and b/assets/img/2024-05-07-distill-example/8-1400.webp differ diff --git a/assets/img/2024-05-07-distill-example/8-480.webp b/assets/img/2024-05-07-distill-example/8-480.webp new file mode 100644 index 00000000..c09934e6 Binary files /dev/null and b/assets/img/2024-05-07-distill-example/8-480.webp differ diff --git a/assets/img/2024-05-07-distill-example/8-800.webp b/assets/img/2024-05-07-distill-example/8-800.webp new file mode 100644 index 00000000..a2b1e89e Binary files /dev/null and b/assets/img/2024-05-07-distill-example/8-800.webp differ diff --git a/assets/img/2024-05-07-distill-example/9-1400.webp b/assets/img/2024-05-07-distill-example/9-1400.webp new file mode 100644 index 00000000..dfac01c4 Binary files /dev/null and b/assets/img/2024-05-07-distill-example/9-1400.webp differ diff --git a/assets/img/2024-05-07-distill-example/9-480.webp b/assets/img/2024-05-07-distill-example/9-480.webp new file mode 100644 index 00000000..c4f72887 Binary files /dev/null and b/assets/img/2024-05-07-distill-example/9-480.webp differ diff --git a/assets/img/2024-05-07-distill-example/9-800.webp b/assets/img/2024-05-07-distill-example/9-800.webp new file mode 100644 index 00000000..dfac01c4 Binary files /dev/null and b/assets/img/2024-05-07-distill-example/9-800.webp differ diff --git a/assets/img/2024-05-07-distill-example/iclr-1400.webp b/assets/img/2024-05-07-distill-example/iclr-1400.webp new file mode 100644 index 00000000..d56968ba Binary files /dev/null and b/assets/img/2024-05-07-distill-example/iclr-1400.webp differ diff --git a/assets/img/2024-05-07-distill-example/iclr-480.webp b/assets/img/2024-05-07-distill-example/iclr-480.webp new file mode 100644 index 00000000..c9d42d7e Binary files /dev/null and b/assets/img/2024-05-07-distill-example/iclr-480.webp differ diff --git a/assets/img/2024-05-07-distill-example/iclr-800.webp b/assets/img/2024-05-07-distill-example/iclr-800.webp new file mode 100644 index 00000000..d56968ba Binary files /dev/null and b/assets/img/2024-05-07-distill-example/iclr-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/henighan2023superposition-1400.webp b/assets/img/2024-05-07-double-descent-demystified/henighan2023superposition-1400.webp new file mode 100644 index 00000000..85a8a68c Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/henighan2023superposition-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/henighan2023superposition-480.webp b/assets/img/2024-05-07-double-descent-demystified/henighan2023superposition-480.webp new file mode 100644 index 00000000..0b4e8b14 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/henighan2023superposition-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/henighan2023superposition-800.webp b/assets/img/2024-05-07-double-descent-demystified/henighan2023superposition-800.webp new file mode 100644 index 00000000..85a8a68c Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/henighan2023superposition-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/overparameterized_generalization-1400.webp b/assets/img/2024-05-07-double-descent-demystified/overparameterized_generalization-1400.webp new file mode 100644 index 00000000..b69a8cc0 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/overparameterized_generalization-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/overparameterized_generalization-480.webp b/assets/img/2024-05-07-double-descent-demystified/overparameterized_generalization-480.webp new file mode 100644 index 00000000..ca48173c Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/overparameterized_generalization-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/overparameterized_generalization-800.webp b/assets/img/2024-05-07-double-descent-demystified/overparameterized_generalization-800.webp new file mode 100644 index 00000000..b69a8cc0 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/overparameterized_generalization-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/least_informative_singular_value-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/least_informative_singular_value-1400.webp new file mode 100644 index 00000000..d29ec492 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/least_informative_singular_value-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/least_informative_singular_value-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/least_informative_singular_value-480.webp new file mode 100644 index 00000000..35f88846 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/least_informative_singular_value-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/least_informative_singular_value-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/least_informative_singular_value-800.webp new file mode 100644 index 00000000..d29ec492 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/least_informative_singular_value-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/no_residuals_in_ideal-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/no_residuals_in_ideal-1400.webp new file mode 100644 index 00000000..c74f6345 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/no_residuals_in_ideal-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/no_residuals_in_ideal-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/no_residuals_in_ideal-480.webp new file mode 100644 index 00000000..65671e66 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/no_residuals_in_ideal-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/no_residuals_in_ideal-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/no_residuals_in_ideal-800.webp new file mode 100644 index 00000000..c74f6345 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/no_residuals_in_ideal-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/no_small_singular_values-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/no_small_singular_values-1400.webp new file mode 100644 index 00000000..77a122d5 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/no_small_singular_values-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/no_small_singular_values-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/no_small_singular_values-480.webp new file mode 100644 index 00000000..b8b96578 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/no_small_singular_values-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/no_small_singular_values-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/no_small_singular_values-800.webp new file mode 100644 index 00000000..77a122d5 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/no_small_singular_values-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/test_bias_squared-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/test_bias_squared-1400.webp new file mode 100644 index 00000000..be5dba14 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/test_bias_squared-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/test_bias_squared-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/test_bias_squared-480.webp new file mode 100644 index 00000000..696b2559 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/test_bias_squared-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/test_bias_squared-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/test_bias_squared-800.webp new file mode 100644 index 00000000..be5dba14 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/test_bias_squared-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/test_feat_in_train_feat_subspace-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/test_feat_in_train_feat_subspace-1400.webp new file mode 100644 index 00000000..44ca9d18 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/test_feat_in_train_feat_subspace-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/test_feat_in_train_feat_subspace-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/test_feat_in_train_feat_subspace-480.webp new file mode 100644 index 00000000..e5e340fb Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/test_feat_in_train_feat_subspace-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/test_feat_in_train_feat_subspace-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/test_feat_in_train_feat_subspace-800.webp new file mode 100644 index 00000000..44ca9d18 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/test_feat_in_train_feat_subspace-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/unablated-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/unablated-1400.webp new file mode 100644 index 00000000..519b46e9 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/unablated-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/unablated-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/unablated-480.webp new file mode 100644 index 00000000..b3cde315 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/unablated-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/unablated-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/unablated-800.webp new file mode 100644 index 00000000..519b46e9 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/california_housing/unablated-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/least_informative_singular_value-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/least_informative_singular_value-1400.webp new file mode 100644 index 00000000..4b20d5e9 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/least_informative_singular_value-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/least_informative_singular_value-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/least_informative_singular_value-480.webp new file mode 100644 index 00000000..71934304 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/least_informative_singular_value-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/least_informative_singular_value-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/least_informative_singular_value-800.webp new file mode 100644 index 00000000..4b20d5e9 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/least_informative_singular_value-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/no_residuals_in_ideal-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/no_residuals_in_ideal-1400.webp new file mode 100644 index 00000000..9a369098 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/no_residuals_in_ideal-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/no_residuals_in_ideal-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/no_residuals_in_ideal-480.webp new file mode 100644 index 00000000..e07be172 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/no_residuals_in_ideal-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/no_residuals_in_ideal-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/no_residuals_in_ideal-800.webp new file mode 100644 index 00000000..9a369098 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/no_residuals_in_ideal-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/no_small_singular_values-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/no_small_singular_values-1400.webp new file mode 100644 index 00000000..c8043914 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/no_small_singular_values-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/no_small_singular_values-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/no_small_singular_values-480.webp new file mode 100644 index 00000000..e78ed483 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/no_small_singular_values-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/no_small_singular_values-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/no_small_singular_values-800.webp new file mode 100644 index 00000000..c8043914 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/no_small_singular_values-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/test_bias_squared-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/test_bias_squared-1400.webp new file mode 100644 index 00000000..b9cbad4a Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/test_bias_squared-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/test_bias_squared-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/test_bias_squared-480.webp new file mode 100644 index 00000000..d97f864b Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/test_bias_squared-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/test_bias_squared-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/test_bias_squared-800.webp new file mode 100644 index 00000000..b9cbad4a Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/test_bias_squared-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/test_feat_in_train_feat_subspace-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/test_feat_in_train_feat_subspace-1400.webp new file mode 100644 index 00000000..df8390ab Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/test_feat_in_train_feat_subspace-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/test_feat_in_train_feat_subspace-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/test_feat_in_train_feat_subspace-480.webp new file mode 100644 index 00000000..d37a3b76 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/test_feat_in_train_feat_subspace-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/test_feat_in_train_feat_subspace-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/test_feat_in_train_feat_subspace-800.webp new file mode 100644 index 00000000..df8390ab Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/test_feat_in_train_feat_subspace-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/unablated-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/unablated-1400.webp new file mode 100644 index 00000000..9748e8b0 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/unablated-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/unablated-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/unablated-480.webp new file mode 100644 index 00000000..7ad43278 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/unablated-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/unablated-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/unablated-800.webp new file mode 100644 index 00000000..9748e8b0 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/diabetes/unablated-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/least_informative_singular_value-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/least_informative_singular_value-1400.webp new file mode 100644 index 00000000..d8654b90 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/least_informative_singular_value-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/least_informative_singular_value-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/least_informative_singular_value-480.webp new file mode 100644 index 00000000..8aff1f87 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/least_informative_singular_value-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/least_informative_singular_value-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/least_informative_singular_value-800.webp new file mode 100644 index 00000000..d8654b90 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/least_informative_singular_value-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/no_residuals_in_ideal-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/no_residuals_in_ideal-1400.webp new file mode 100644 index 00000000..d98c6aaa Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/no_residuals_in_ideal-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/no_residuals_in_ideal-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/no_residuals_in_ideal-480.webp new file mode 100644 index 00000000..f924a7b5 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/no_residuals_in_ideal-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/no_residuals_in_ideal-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/no_residuals_in_ideal-800.webp new file mode 100644 index 00000000..d98c6aaa Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/no_residuals_in_ideal-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/no_small_singular_values-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/no_small_singular_values-1400.webp new file mode 100644 index 00000000..7f9a9e51 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/no_small_singular_values-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/no_small_singular_values-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/no_small_singular_values-480.webp new file mode 100644 index 00000000..98da5f5f Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/no_small_singular_values-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/no_small_singular_values-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/no_small_singular_values-800.webp new file mode 100644 index 00000000..7f9a9e51 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/no_small_singular_values-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/test_bias_squared-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/test_bias_squared-1400.webp new file mode 100644 index 00000000..b9e89520 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/test_bias_squared-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/test_bias_squared-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/test_bias_squared-480.webp new file mode 100644 index 00000000..f1b2691c Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/test_bias_squared-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/test_bias_squared-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/test_bias_squared-800.webp new file mode 100644 index 00000000..b9e89520 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/test_bias_squared-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/test_feat_in_train_feat_subspace-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/test_feat_in_train_feat_subspace-1400.webp new file mode 100644 index 00000000..ed368a7d Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/test_feat_in_train_feat_subspace-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/test_feat_in_train_feat_subspace-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/test_feat_in_train_feat_subspace-480.webp new file mode 100644 index 00000000..d36a405f Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/test_feat_in_train_feat_subspace-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/test_feat_in_train_feat_subspace-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/test_feat_in_train_feat_subspace-800.webp new file mode 100644 index 00000000..ed368a7d Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/test_feat_in_train_feat_subspace-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/unablated-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/unablated-1400.webp new file mode 100644 index 00000000..8adfa4a0 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/unablated-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/unablated-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/unablated-480.webp new file mode 100644 index 00000000..37b911e8 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/unablated-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/unablated-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/unablated-800.webp new file mode 100644 index 00000000..8adfa4a0 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/student_teacher/unablated-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/least_informative_singular_value-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/least_informative_singular_value-1400.webp new file mode 100644 index 00000000..cdba4f2a Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/least_informative_singular_value-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/least_informative_singular_value-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/least_informative_singular_value-480.webp new file mode 100644 index 00000000..f786eb95 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/least_informative_singular_value-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/least_informative_singular_value-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/least_informative_singular_value-800.webp new file mode 100644 index 00000000..cdba4f2a Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/least_informative_singular_value-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/no_residuals_in_ideal-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/no_residuals_in_ideal-1400.webp new file mode 100644 index 00000000..d3c369c9 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/no_residuals_in_ideal-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/no_residuals_in_ideal-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/no_residuals_in_ideal-480.webp new file mode 100644 index 00000000..3b47c010 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/no_residuals_in_ideal-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/no_residuals_in_ideal-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/no_residuals_in_ideal-800.webp new file mode 100644 index 00000000..d3c369c9 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/no_residuals_in_ideal-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/no_small_singular_values-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/no_small_singular_values-1400.webp new file mode 100644 index 00000000..54de0473 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/no_small_singular_values-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/no_small_singular_values-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/no_small_singular_values-480.webp new file mode 100644 index 00000000..a1fd9a12 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/no_small_singular_values-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/no_small_singular_values-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/no_small_singular_values-800.webp new file mode 100644 index 00000000..54de0473 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/no_small_singular_values-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/test_bias_squared-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/test_bias_squared-1400.webp new file mode 100644 index 00000000..39258bc6 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/test_bias_squared-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/test_bias_squared-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/test_bias_squared-480.webp new file mode 100644 index 00000000..861a6160 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/test_bias_squared-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/test_bias_squared-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/test_bias_squared-800.webp new file mode 100644 index 00000000..39258bc6 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/test_bias_squared-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/test_feat_in_train_feat_subspace-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/test_feat_in_train_feat_subspace-1400.webp new file mode 100644 index 00000000..bd3d4c79 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/test_feat_in_train_feat_subspace-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/test_feat_in_train_feat_subspace-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/test_feat_in_train_feat_subspace-480.webp new file mode 100644 index 00000000..c6ff6ff7 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/test_feat_in_train_feat_subspace-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/test_feat_in_train_feat_subspace-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/test_feat_in_train_feat_subspace-800.webp new file mode 100644 index 00000000..bd3d4c79 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/test_feat_in_train_feat_subspace-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/unablated-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/unablated-1400.webp new file mode 100644 index 00000000..6f654462 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/unablated-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/unablated-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/unablated-480.webp new file mode 100644 index 00000000..6bc5948b Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/unablated-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/unablated-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/unablated-800.webp new file mode 100644 index 00000000..6f654462 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_ablations/who_life_expectancy/unablated-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/california_housing/adversarial_test_datum-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/california_housing/adversarial_test_datum-1400.webp new file mode 100644 index 00000000..9cd1d6a8 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/california_housing/adversarial_test_datum-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/california_housing/adversarial_test_datum-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/california_housing/adversarial_test_datum-480.webp new file mode 100644 index 00000000..eb4b909b Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/california_housing/adversarial_test_datum-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/california_housing/adversarial_test_datum-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/california_housing/adversarial_test_datum-800.webp new file mode 100644 index 00000000..9cd1d6a8 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/california_housing/adversarial_test_datum-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/california_housing/adversarial_train_data-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/california_housing/adversarial_train_data-1400.webp new file mode 100644 index 00000000..8e1f0e38 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/california_housing/adversarial_train_data-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/california_housing/adversarial_train_data-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/california_housing/adversarial_train_data-480.webp new file mode 100644 index 00000000..50a84654 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/california_housing/adversarial_train_data-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/california_housing/adversarial_train_data-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/california_housing/adversarial_train_data-800.webp new file mode 100644 index 00000000..8e1f0e38 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/california_housing/adversarial_train_data-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/california_housing/unablated-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/california_housing/unablated-1400.webp new file mode 100644 index 00000000..81c832ae Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/california_housing/unablated-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/california_housing/unablated-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/california_housing/unablated-480.webp new file mode 100644 index 00000000..98187f67 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/california_housing/unablated-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/california_housing/unablated-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/california_housing/unablated-800.webp new file mode 100644 index 00000000..81c832ae Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/california_housing/unablated-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/diabetes/adversarial_test_datum-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/diabetes/adversarial_test_datum-1400.webp new file mode 100644 index 00000000..01222212 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/diabetes/adversarial_test_datum-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/diabetes/adversarial_test_datum-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/diabetes/adversarial_test_datum-480.webp new file mode 100644 index 00000000..d0d8a61e Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/diabetes/adversarial_test_datum-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/diabetes/adversarial_test_datum-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/diabetes/adversarial_test_datum-800.webp new file mode 100644 index 00000000..01222212 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/diabetes/adversarial_test_datum-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/diabetes/adversarial_train_data-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/diabetes/adversarial_train_data-1400.webp new file mode 100644 index 00000000..a2f476f8 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/diabetes/adversarial_train_data-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/diabetes/adversarial_train_data-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/diabetes/adversarial_train_data-480.webp new file mode 100644 index 00000000..1b2a403d Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/diabetes/adversarial_train_data-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/diabetes/adversarial_train_data-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/diabetes/adversarial_train_data-800.webp new file mode 100644 index 00000000..a2f476f8 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/diabetes/adversarial_train_data-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/diabetes/unablated-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/diabetes/unablated-1400.webp new file mode 100644 index 00000000..3dd6bfa8 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/diabetes/unablated-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/diabetes/unablated-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/diabetes/unablated-480.webp new file mode 100644 index 00000000..457fc253 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/diabetes/unablated-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/diabetes/unablated-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/diabetes/unablated-800.webp new file mode 100644 index 00000000..3dd6bfa8 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/diabetes/unablated-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/student_teacher/adversarial_test_datum-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/student_teacher/adversarial_test_datum-1400.webp new file mode 100644 index 00000000..4dd7ceee Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/student_teacher/adversarial_test_datum-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/student_teacher/adversarial_test_datum-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/student_teacher/adversarial_test_datum-480.webp new file mode 100644 index 00000000..28e0f283 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/student_teacher/adversarial_test_datum-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/student_teacher/adversarial_test_datum-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/student_teacher/adversarial_test_datum-800.webp new file mode 100644 index 00000000..4dd7ceee Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/student_teacher/adversarial_test_datum-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/student_teacher/adversarial_train_data-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/student_teacher/adversarial_train_data-1400.webp new file mode 100644 index 00000000..3067798d Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/student_teacher/adversarial_train_data-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/student_teacher/adversarial_train_data-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/student_teacher/adversarial_train_data-480.webp new file mode 100644 index 00000000..687bc0f1 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/student_teacher/adversarial_train_data-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/student_teacher/adversarial_train_data-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/student_teacher/adversarial_train_data-800.webp new file mode 100644 index 00000000..3067798d Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/student_teacher/adversarial_train_data-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/student_teacher/unablated-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/student_teacher/unablated-1400.webp new file mode 100644 index 00000000..5dc22362 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/student_teacher/unablated-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/student_teacher/unablated-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/student_teacher/unablated-480.webp new file mode 100644 index 00000000..a10ce6db Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/student_teacher/unablated-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/student_teacher/unablated-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/student_teacher/unablated-800.webp new file mode 100644 index 00000000..5dc22362 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/student_teacher/unablated-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/who_life_expectancy/adversarial_test_datum-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/who_life_expectancy/adversarial_test_datum-1400.webp new file mode 100644 index 00000000..d72e3d0a Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/who_life_expectancy/adversarial_test_datum-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/who_life_expectancy/adversarial_test_datum-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/who_life_expectancy/adversarial_test_datum-480.webp new file mode 100644 index 00000000..2063eebe Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/who_life_expectancy/adversarial_test_datum-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/who_life_expectancy/adversarial_test_datum-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/who_life_expectancy/adversarial_test_datum-800.webp new file mode 100644 index 00000000..d72e3d0a Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/who_life_expectancy/adversarial_test_datum-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/who_life_expectancy/adversarial_train_data-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/who_life_expectancy/adversarial_train_data-1400.webp new file mode 100644 index 00000000..08434459 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/who_life_expectancy/adversarial_train_data-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/who_life_expectancy/adversarial_train_data-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/who_life_expectancy/adversarial_train_data-480.webp new file mode 100644 index 00000000..748d0297 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/who_life_expectancy/adversarial_train_data-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/who_life_expectancy/adversarial_train_data-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/who_life_expectancy/adversarial_train_data-800.webp new file mode 100644 index 00000000..08434459 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/who_life_expectancy/adversarial_train_data-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/who_life_expectancy/unablated-1400.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/who_life_expectancy/unablated-1400.webp new file mode 100644 index 00000000..e48a9dca Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/who_life_expectancy/unablated-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/who_life_expectancy/unablated-480.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/who_life_expectancy/unablated-480.webp new file mode 100644 index 00000000..23e69ca6 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/who_life_expectancy/unablated-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/who_life_expectancy/unablated-800.webp b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/who_life_expectancy/unablated-800.webp new file mode 100644 index 00000000..e48a9dca Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/real_data_adversarial/who_life_expectancy/unablated-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution-1400.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution-1400.webp new file mode 100644 index 00000000..75840f2f Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution-480.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution-480.webp new file mode 100644 index 00000000..2a92c8ad Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution-800.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution-800.webp new file mode 100644 index 00000000..75840f2f Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=1-1400.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=1-1400.webp new file mode 100644 index 00000000..66f4dea2 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=1-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=1-480.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=1-480.webp new file mode 100644 index 00000000..12cc61b4 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=1-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=1-800.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=1-800.webp new file mode 100644 index 00000000..66f4dea2 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=1-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=10-1400.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=10-1400.webp new file mode 100644 index 00000000..1c6c561a Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=10-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=10-480.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=10-480.webp new file mode 100644 index 00000000..77958bf1 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=10-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=10-800.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=10-800.webp new file mode 100644 index 00000000..1c6c561a Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=10-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=100-1400.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=100-1400.webp new file mode 100644 index 00000000..07dbeec9 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=100-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=100-480.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=100-480.webp new file mode 100644 index 00000000..97abb759 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=100-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=100-800.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=100-800.webp new file mode 100644 index 00000000..07dbeec9 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=100-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=2-1400.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=2-1400.webp new file mode 100644 index 00000000..46af9c3d Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=2-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=2-480.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=2-480.webp new file mode 100644 index 00000000..71937eb8 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=2-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=2-800.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=2-800.webp new file mode 100644 index 00000000..46af9c3d Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=2-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=25-1400.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=25-1400.webp new file mode 100644 index 00000000..5ff69bbf Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=25-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=25-480.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=25-480.webp new file mode 100644 index 00000000..44c7ab48 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=25-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=25-800.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=25-800.webp new file mode 100644 index 00000000..5ff69bbf Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=25-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=3-1400.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=3-1400.webp new file mode 100644 index 00000000..39c35bb5 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=3-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=3-480.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=3-480.webp new file mode 100644 index 00000000..4c04e41b Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=3-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=3-800.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=3-800.webp new file mode 100644 index 00000000..39c35bb5 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=3-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=4-1400.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=4-1400.webp new file mode 100644 index 00000000..00b5e3c7 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=4-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=4-480.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=4-480.webp new file mode 100644 index 00000000..72004dd9 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=4-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=4-800.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=4-800.webp new file mode 100644 index 00000000..00b5e3c7 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=4-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=5-1400.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=5-1400.webp new file mode 100644 index 00000000..46f5642e Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=5-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=5-480.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=5-480.webp new file mode 100644 index 00000000..d4d7a73a Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=5-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=5-800.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=5-800.webp new file mode 100644 index 00000000..46f5642e Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=5-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=6-1400.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=6-1400.webp new file mode 100644 index 00000000..71b010ed Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=6-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=6-480.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=6-480.webp new file mode 100644 index 00000000..bd04c97e Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=6-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=6-800.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=6-800.webp new file mode 100644 index 00000000..71b010ed Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=6-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=7-1400.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=7-1400.webp new file mode 100644 index 00000000..e650af24 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=7-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=7-480.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=7-480.webp new file mode 100644 index 00000000..dcc51a0b Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=7-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=7-800.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=7-800.webp new file mode 100644 index 00000000..e650af24 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=7-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=8-1400.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=8-1400.webp new file mode 100644 index 00000000..cca21f51 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=8-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=8-480.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=8-480.webp new file mode 100644 index 00000000..18b15486 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=8-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=8-800.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=8-800.webp new file mode 100644 index 00000000..cca21f51 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=8-800.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=9-1400.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=9-1400.webp new file mode 100644 index 00000000..f1e57ae7 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=9-1400.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=9-480.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=9-480.webp new file mode 100644 index 00000000..59465eff Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=9-480.webp differ diff --git a/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=9-800.webp b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=9-800.webp new file mode 100644 index 00000000..f1e57ae7 Binary files /dev/null and b/assets/img/2024-05-07-double-descent-demystified/smallest_nonzero_singular_value/data_distribution_num_data=9-800.webp differ diff --git a/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/fmpe_results_gw-1400.webp b/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/fmpe_results_gw-1400.webp new file mode 100644 index 00000000..b925921a Binary files /dev/null and b/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/fmpe_results_gw-1400.webp differ diff --git a/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/fmpe_results_gw-480.webp b/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/fmpe_results_gw-480.webp new file mode 100644 index 00000000..c3496ff8 Binary files /dev/null and b/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/fmpe_results_gw-480.webp differ diff --git a/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/fmpe_results_gw-800.webp b/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/fmpe_results_gw-800.webp new file mode 100644 index 00000000..b925921a Binary files /dev/null and b/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/fmpe_results_gw-800.webp differ diff --git a/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/fmpe_sbi_benchmark-1400.webp b/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/fmpe_sbi_benchmark-1400.webp new file mode 100644 index 00000000..c78d0a53 Binary files /dev/null and b/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/fmpe_sbi_benchmark-1400.webp differ diff --git a/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/fmpe_sbi_benchmark-480.webp b/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/fmpe_sbi_benchmark-480.webp new file mode 100644 index 00000000..c77dad68 Binary files /dev/null and b/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/fmpe_sbi_benchmark-480.webp differ diff --git a/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/fmpe_sbi_benchmark-800.webp b/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/fmpe_sbi_benchmark-800.webp new file mode 100644 index 00000000..c78d0a53 Binary files /dev/null and b/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/fmpe_sbi_benchmark-800.webp differ diff --git a/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/imagenet-1400.webp b/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/imagenet-1400.webp new file mode 100644 index 00000000..41943a11 Binary files /dev/null and b/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/imagenet-1400.webp differ diff --git a/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/imagenet-480.webp b/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/imagenet-480.webp new file mode 100644 index 00000000..c3f55025 Binary files /dev/null and b/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/imagenet-480.webp differ diff --git a/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/imagenet-800.webp b/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/imagenet-800.webp new file mode 100644 index 00000000..41943a11 Binary files /dev/null and b/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/imagenet-800.webp differ diff --git a/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/kinds_of_sbi-1400.webp b/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/kinds_of_sbi-1400.webp new file mode 100644 index 00000000..70983fd0 Binary files /dev/null and b/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/kinds_of_sbi-1400.webp differ diff --git a/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/kinds_of_sbi-480.webp b/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/kinds_of_sbi-480.webp new file mode 100644 index 00000000..e64a8212 Binary files /dev/null and b/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/kinds_of_sbi-480.webp differ diff --git a/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/kinds_of_sbi-800.webp b/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/kinds_of_sbi-800.webp new file mode 100644 index 00000000..70983fd0 Binary files /dev/null and b/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/kinds_of_sbi-800.webp differ diff --git a/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/sample_path-1400.webp b/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/sample_path-1400.webp new file mode 100644 index 00000000..8fd3fc45 Binary files /dev/null and b/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/sample_path-1400.webp differ diff --git a/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/sample_path-480.webp b/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/sample_path-480.webp new file mode 100644 index 00000000..9edb87d7 Binary files /dev/null and b/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/sample_path-480.webp differ diff --git a/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/sample_path-800.webp b/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/sample_path-800.webp new file mode 100644 index 00000000..8fd3fc45 Binary files /dev/null and b/assets/img/2024-05-07-elaborating-on-the-value-of-flow-matching-for-density-estimation/sample_path-800.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/CCIM_diagram-1400.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/CCIM_diagram-1400.webp new file mode 100644 index 00000000..759516a5 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/CCIM_diagram-1400.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/CCIM_diagram-480.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/CCIM_diagram-480.webp new file mode 100644 index 00000000..bae1ac33 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/CCIM_diagram-480.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/CCIM_diagram-800.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/CCIM_diagram-800.webp new file mode 100644 index 00000000..759516a5 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/CCIM_diagram-800.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_CCIM_mean_seeds_std-1400.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_CCIM_mean_seeds_std-1400.webp new file mode 100644 index 00000000..c1172721 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_CCIM_mean_seeds_std-1400.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_CCIM_mean_seeds_std-480.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_CCIM_mean_seeds_std-480.webp new file mode 100644 index 00000000..b1dd69c1 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_CCIM_mean_seeds_std-480.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_CCIM_mean_seeds_std-800.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_CCIM_mean_seeds_std-800.webp new file mode 100644 index 00000000..c1172721 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_CCIM_mean_seeds_std-800.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_FAST_mean_seeds_CI-1400.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_FAST_mean_seeds_CI-1400.webp new file mode 100644 index 00000000..07ba273a Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_FAST_mean_seeds_CI-1400.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_FAST_mean_seeds_CI-480.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_FAST_mean_seeds_CI-480.webp new file mode 100644 index 00000000..839fe238 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_FAST_mean_seeds_CI-480.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_FAST_mean_seeds_CI-800.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_FAST_mean_seeds_CI-800.webp new file mode 100644 index 00000000..07ba273a Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_FAST_mean_seeds_CI-800.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_FAST_mean_seeds_std-1400.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_FAST_mean_seeds_std-1400.webp new file mode 100644 index 00000000..d184d1fd Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_FAST_mean_seeds_std-1400.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_FAST_mean_seeds_std-480.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_FAST_mean_seeds_std-480.webp new file mode 100644 index 00000000..4489909a Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_FAST_mean_seeds_std-480.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_FAST_mean_seeds_std-800.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_FAST_mean_seeds_std-800.webp new file mode 100644 index 00000000..d184d1fd Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_FAST_mean_seeds_std-800.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_ccim_mean_seeds_CI-1400.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_ccim_mean_seeds_CI-1400.webp new file mode 100644 index 00000000..2b24d0b4 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_ccim_mean_seeds_CI-1400.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_ccim_mean_seeds_CI-480.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_ccim_mean_seeds_CI-480.webp new file mode 100644 index 00000000..4ac279cb Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_ccim_mean_seeds_CI-480.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_ccim_mean_seeds_CI-800.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_ccim_mean_seeds_CI-800.webp new file mode 100644 index 00000000..2b24d0b4 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/DeepSea-bsuite_ccim_mean_seeds_CI-800.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_CCIM_mean_seeds_CI-1400.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_CCIM_mean_seeds_CI-1400.webp new file mode 100644 index 00000000..3b954512 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_CCIM_mean_seeds_CI-1400.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_CCIM_mean_seeds_CI-480.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_CCIM_mean_seeds_CI-480.webp new file mode 100644 index 00000000..fe2b190e Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_CCIM_mean_seeds_CI-480.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_CCIM_mean_seeds_CI-800.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_CCIM_mean_seeds_CI-800.webp new file mode 100644 index 00000000..3b954512 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_CCIM_mean_seeds_CI-800.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_CCIM_mean_seeds_std-1400.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_CCIM_mean_seeds_std-1400.webp new file mode 100644 index 00000000..a9e73e03 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_CCIM_mean_seeds_std-1400.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_CCIM_mean_seeds_std-480.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_CCIM_mean_seeds_std-480.webp new file mode 100644 index 00000000..3b2906dd Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_CCIM_mean_seeds_std-480.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_CCIM_mean_seeds_std-800.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_CCIM_mean_seeds_std-800.webp new file mode 100644 index 00000000..a9e73e03 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_CCIM_mean_seeds_std-800.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_FAST_mean_seeds_CI-1400.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_FAST_mean_seeds_CI-1400.webp new file mode 100644 index 00000000..31ce33e8 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_FAST_mean_seeds_CI-1400.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_FAST_mean_seeds_CI-480.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_FAST_mean_seeds_CI-480.webp new file mode 100644 index 00000000..b10a0e69 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_FAST_mean_seeds_CI-480.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_FAST_mean_seeds_CI-800.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_FAST_mean_seeds_CI-800.webp new file mode 100644 index 00000000..31ce33e8 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_FAST_mean_seeds_CI-800.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_FAST_mean_seeds_std-1400.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_FAST_mean_seeds_std-1400.webp new file mode 100644 index 00000000..4b9d19ce Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_FAST_mean_seeds_std-1400.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_FAST_mean_seeds_std-480.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_FAST_mean_seeds_std-480.webp new file mode 100644 index 00000000..75fda68b Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_FAST_mean_seeds_std-480.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_FAST_mean_seeds_std-800.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_FAST_mean_seeds_std-800.webp new file mode 100644 index 00000000..4b9d19ce Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/Empty-misc_FAST_mean_seeds_std-800.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/FAST_diagram-1400.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/FAST_diagram-1400.webp new file mode 100644 index 00000000..16ec35a4 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/FAST_diagram-1400.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/FAST_diagram-480.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/FAST_diagram-480.webp new file mode 100644 index 00000000..f607b15c Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/FAST_diagram-480.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/FAST_diagram-800.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/FAST_diagram-800.webp new file mode 100644 index 00000000..16ec35a4 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/FAST_diagram-800.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/MDP-1400.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/MDP-1400.webp new file mode 100644 index 00000000..2a10c449 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/MDP-1400.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/MDP-480.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/MDP-480.webp new file mode 100644 index 00000000..cfb7fac0 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/MDP-480.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/MDP-800.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/MDP-800.webp new file mode 100644 index 00000000..2a10c449 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/MDP-800.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/RND-1400.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/RND-1400.webp new file mode 100644 index 00000000..74a2ec1f Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/RND-1400.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/RND-480.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/RND-480.webp new file mode 100644 index 00000000..55e53d4d Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/RND-480.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/RND-800.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/RND-800.webp new file mode 100644 index 00000000..74a2ec1f Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/RND-800.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/RND_DAG-1400.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/RND_DAG-1400.webp new file mode 100644 index 00000000..68165fb3 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/RND_DAG-1400.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/RND_DAG-480.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/RND_DAG-480.webp new file mode 100644 index 00000000..1bab580b Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/RND_DAG-480.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/RND_DAG-800.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/RND_DAG-800.webp new file mode 100644 index 00000000..68165fb3 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/RND_DAG-800.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/byol_arch-1400.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/byol_arch-1400.webp new file mode 100644 index 00000000..ba91a378 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/byol_arch-1400.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/byol_arch-480.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/byol_arch-480.webp new file mode 100644 index 00000000..e45daf26 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/byol_arch-480.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/byol_arch-800.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/byol_arch-800.webp new file mode 100644 index 00000000..ba91a378 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/byol_arch-800.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/deepsea-1400.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/deepsea-1400.webp new file mode 100644 index 00000000..46319591 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/deepsea-1400.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/deepsea-480.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/deepsea-480.webp new file mode 100644 index 00000000..b58e082a Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/deepsea-480.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/deepsea-800.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/deepsea-800.webp new file mode 100644 index 00000000..46319591 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/deepsea-800.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/extended_mdp-1400.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/extended_mdp-1400.webp new file mode 100644 index 00000000..b400e01d Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/extended_mdp-1400.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/extended_mdp-480.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/extended_mdp-480.webp new file mode 100644 index 00000000..984bc223 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/extended_mdp-480.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/extended_mdp-800.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/extended_mdp-800.webp new file mode 100644 index 00000000..b400e01d Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/extended_mdp-800.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_byol_lite_30-1400.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_byol_lite_30-1400.webp new file mode 100644 index 00000000..e770ad31 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_byol_lite_30-1400.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_byol_lite_30-480.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_byol_lite_30-480.webp new file mode 100644 index 00000000..df546931 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_byol_lite_30-480.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_byol_lite_30-800.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_byol_lite_30-800.webp new file mode 100644 index 00000000..e770ad31 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_byol_lite_30-800.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_ccim_30-1400.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_ccim_30-1400.webp new file mode 100644 index 00000000..8e012ca3 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_ccim_30-1400.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_ccim_30-480.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_ccim_30-480.webp new file mode 100644 index 00000000..20daaa4b Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_ccim_30-480.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_ccim_30-800.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_ccim_30-800.webp new file mode 100644 index 00000000..8e012ca3 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_ccim_30-800.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_ccim_slimmed_30-1400.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_ccim_slimmed_30-1400.webp new file mode 100644 index 00000000..c3ae7480 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_ccim_slimmed_30-1400.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_ccim_slimmed_30-480.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_ccim_slimmed_30-480.webp new file mode 100644 index 00000000..00eb2996 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_ccim_slimmed_30-480.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_ccim_slimmed_30-800.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_ccim_slimmed_30-800.webp new file mode 100644 index 00000000..c3ae7480 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_ccim_slimmed_30-800.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_dis_ppo_30-1400.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_dis_ppo_30-1400.webp new file mode 100644 index 00000000..65397320 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_dis_ppo_30-1400.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_dis_ppo_30-480.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_dis_ppo_30-480.webp new file mode 100644 index 00000000..19564e5d Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_dis_ppo_30-480.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_dis_ppo_30-800.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_dis_ppo_30-800.webp new file mode 100644 index 00000000..65397320 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_dis_ppo_30-800.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_fast_30-1400.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_fast_30-1400.webp new file mode 100644 index 00000000..957b3cef Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_fast_30-1400.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_fast_30-480.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_fast_30-480.webp new file mode 100644 index 00000000..de68f473 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_fast_30-480.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_fast_30-800.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_fast_30-800.webp new file mode 100644 index 00000000..957b3cef Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_fast_30-800.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_rnd_30-1400.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_rnd_30-1400.webp new file mode 100644 index 00000000..19d38505 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_rnd_30-1400.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_rnd_30-480.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_rnd_30-480.webp new file mode 100644 index 00000000..cbc0287a Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_rnd_30-480.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_rnd_30-800.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_rnd_30-800.webp new file mode 100644 index 00000000..19d38505 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/heatmap_rnd_30-800.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/meta-learning-1400.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/meta-learning-1400.webp new file mode 100644 index 00000000..6a7b8bc9 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/meta-learning-1400.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/meta-learning-480.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/meta-learning-480.webp new file mode 100644 index 00000000..5ad03986 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/meta-learning-480.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/meta-learning-800.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/meta-learning-800.webp new file mode 100644 index 00000000..6a7b8bc9 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/meta-learning-800.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/meta-rl-1400.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/meta-rl-1400.webp new file mode 100644 index 00000000..128acf97 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/meta-rl-1400.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/meta-rl-480.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/meta-rl-480.webp new file mode 100644 index 00000000..ea114f27 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/meta-rl-480.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/meta-rl-800.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/meta-rl-800.webp new file mode 100644 index 00000000..128acf97 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/meta-rl-800.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/mlc-1400.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/mlc-1400.webp new file mode 100644 index 00000000..174747e0 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/mlc-1400.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/mlc-480.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/mlc-480.webp new file mode 100644 index 00000000..08fc3cab Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/mlc-480.webp differ diff --git a/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/mlc-800.webp b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/mlc-800.webp new file mode 100644 index 00000000..174747e0 Binary files /dev/null and b/assets/img/2024-05-07-exploring-meta-learned-curiosity-algorithms/mlc-800.webp differ diff --git a/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Conditional_use_accuracy_equality-1400.webp b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Conditional_use_accuracy_equality-1400.webp new file mode 100644 index 00000000..ecd7cc52 Binary files /dev/null and b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Conditional_use_accuracy_equality-1400.webp differ diff --git a/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Conditional_use_accuracy_equality-480.webp b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Conditional_use_accuracy_equality-480.webp new file mode 100644 index 00000000..67afe6b6 Binary files /dev/null and b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Conditional_use_accuracy_equality-480.webp differ diff --git a/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Conditional_use_accuracy_equality-800.webp b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Conditional_use_accuracy_equality-800.webp new file mode 100644 index 00000000..ecd7cc52 Binary files /dev/null and b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Conditional_use_accuracy_equality-800.webp differ diff --git a/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Counterfactual_fairness-1400.webp b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Counterfactual_fairness-1400.webp new file mode 100644 index 00000000..5d864185 Binary files /dev/null and b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Counterfactual_fairness-1400.webp differ diff --git a/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Counterfactual_fairness-480.webp b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Counterfactual_fairness-480.webp new file mode 100644 index 00000000..b631e83e Binary files /dev/null and b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Counterfactual_fairness-480.webp differ diff --git a/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Counterfactual_fairness-800.webp b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Counterfactual_fairness-800.webp new file mode 100644 index 00000000..5d864185 Binary files /dev/null and b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Counterfactual_fairness-800.webp differ diff --git a/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Demographic_Parity-1400.webp b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Demographic_Parity-1400.webp new file mode 100644 index 00000000..4e77e990 Binary files /dev/null and b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Demographic_Parity-1400.webp differ diff --git a/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Demographic_Parity-480.webp b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Demographic_Parity-480.webp new file mode 100644 index 00000000..ea4da39a Binary files /dev/null and b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Demographic_Parity-480.webp differ diff --git a/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Demographic_Parity-800.webp b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Demographic_Parity-800.webp new file mode 100644 index 00000000..4e77e990 Binary files /dev/null and b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Demographic_Parity-800.webp differ diff --git a/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Direct_discrimination-1400.webp b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Direct_discrimination-1400.webp new file mode 100644 index 00000000..5c3fa34f Binary files /dev/null and b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Direct_discrimination-1400.webp differ diff --git a/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Direct_discrimination-480.webp b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Direct_discrimination-480.webp new file mode 100644 index 00000000..50b80d4c Binary files /dev/null and b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Direct_discrimination-480.webp differ diff --git a/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Direct_discrimination-800.webp b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Direct_discrimination-800.webp new file mode 100644 index 00000000..5c3fa34f Binary files /dev/null and b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Direct_discrimination-800.webp differ diff --git a/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Equalized_odds-1400.webp b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Equalized_odds-1400.webp new file mode 100644 index 00000000..37a1835b Binary files /dev/null and b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Equalized_odds-1400.webp differ diff --git a/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Equalized_odds-480.webp b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Equalized_odds-480.webp new file mode 100644 index 00000000..edecc951 Binary files /dev/null and b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Equalized_odds-480.webp differ diff --git a/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Equalized_odds-800.webp b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Equalized_odds-800.webp new file mode 100644 index 00000000..37a1835b Binary files /dev/null and b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Equalized_odds-800.webp differ diff --git a/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Feature_selection-1400.webp b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Feature_selection-1400.webp new file mode 100644 index 00000000..20c0fd9c Binary files /dev/null and b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Feature_selection-1400.webp differ diff --git a/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Feature_selection-480.webp b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Feature_selection-480.webp new file mode 100644 index 00000000..9385c51a Binary files /dev/null and b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Feature_selection-480.webp differ diff --git a/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Feature_selection-800.webp b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Feature_selection-800.webp new file mode 100644 index 00000000..20c0fd9c Binary files /dev/null and b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Feature_selection-800.webp differ diff --git a/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Measurement_error-1400.webp b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Measurement_error-1400.webp new file mode 100644 index 00000000..4a17d7cb Binary files /dev/null and b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Measurement_error-1400.webp differ diff --git a/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Measurement_error-480.webp b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Measurement_error-480.webp new file mode 100644 index 00000000..8df95895 Binary files /dev/null and b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Measurement_error-480.webp differ diff --git a/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Measurement_error-800.webp b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Measurement_error-800.webp new file mode 100644 index 00000000..4a17d7cb Binary files /dev/null and b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Measurement_error-800.webp differ diff --git a/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Selection_on_label-1400.webp b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Selection_on_label-1400.webp new file mode 100644 index 00000000..82a15be4 Binary files /dev/null and b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Selection_on_label-1400.webp differ diff --git a/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Selection_on_label-480.webp b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Selection_on_label-480.webp new file mode 100644 index 00000000..78976a86 Binary files /dev/null and b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Selection_on_label-480.webp differ diff --git a/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Selection_on_label-800.webp b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Selection_on_label-800.webp new file mode 100644 index 00000000..82a15be4 Binary files /dev/null and b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Selection_on_label-800.webp differ diff --git a/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Selection_on_predictor-1400.webp b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Selection_on_predictor-1400.webp new file mode 100644 index 00000000..0c18fa32 Binary files /dev/null and b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Selection_on_predictor-1400.webp differ diff --git a/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Selection_on_predictor-480.webp b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Selection_on_predictor-480.webp new file mode 100644 index 00000000..b73ad30a Binary files /dev/null and b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Selection_on_predictor-480.webp differ diff --git a/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Selection_on_predictor-800.webp b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Selection_on_predictor-800.webp new file mode 100644 index 00000000..0c18fa32 Binary files /dev/null and b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Selection_on_predictor-800.webp differ diff --git a/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Two_categories-1400.webp b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Two_categories-1400.webp new file mode 100644 index 00000000..cef2869c Binary files /dev/null and b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Two_categories-1400.webp differ diff --git a/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Two_categories-480.webp b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Two_categories-480.webp new file mode 100644 index 00000000..2cc5d6b8 Binary files /dev/null and b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Two_categories-480.webp differ diff --git a/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Two_categories-800.webp b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Two_categories-800.webp new file mode 100644 index 00000000..cef2869c Binary files /dev/null and b/assets/img/2024-05-07-fairness-ai-two-phil-or-just-one/Two_categories-800.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/annoying-1400.webp b/assets/img/2024-05-07-hidden-convex-relu/annoying-1400.webp new file mode 100644 index 00000000..19b82557 Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/annoying-1400.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/annoying-480.webp b/assets/img/2024-05-07-hidden-convex-relu/annoying-480.webp new file mode 100644 index 00000000..29404fa5 Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/annoying-480.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/annoying-800.webp b/assets/img/2024-05-07-hidden-convex-relu/annoying-800.webp new file mode 100644 index 00000000..19b82557 Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/annoying-800.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/annoyingtroisd-1400.webp b/assets/img/2024-05-07-hidden-convex-relu/annoyingtroisd-1400.webp new file mode 100644 index 00000000..c2e533af Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/annoyingtroisd-1400.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/annoyingtroisd-480.webp b/assets/img/2024-05-07-hidden-convex-relu/annoyingtroisd-480.webp new file mode 100644 index 00000000..4f26dc44 Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/annoyingtroisd-480.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/annoyingtroisd-800.webp b/assets/img/2024-05-07-hidden-convex-relu/annoyingtroisd-800.webp new file mode 100644 index 00000000..c2e533af Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/annoyingtroisd-800.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/blueoutput-1400.webp b/assets/img/2024-05-07-hidden-convex-relu/blueoutput-1400.webp new file mode 100644 index 00000000..f410fefb Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/blueoutput-1400.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/blueoutput-480.webp b/assets/img/2024-05-07-hidden-convex-relu/blueoutput-480.webp new file mode 100644 index 00000000..2ee6d71b Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/blueoutput-480.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/blueoutput-800.webp b/assets/img/2024-05-07-hidden-convex-relu/blueoutput-800.webp new file mode 100644 index 00000000..f410fefb Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/blueoutput-800.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/cvx_vs-1400.webp b/assets/img/2024-05-07-hidden-convex-relu/cvx_vs-1400.webp new file mode 100644 index 00000000..ffc6fddf Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/cvx_vs-1400.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/cvx_vs-480.webp b/assets/img/2024-05-07-hidden-convex-relu/cvx_vs-480.webp new file mode 100644 index 00000000..e4c1de51 Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/cvx_vs-480.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/cvx_vs-800.webp b/assets/img/2024-05-07-hidden-convex-relu/cvx_vs-800.webp new file mode 100644 index 00000000..ffc6fddf Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/cvx_vs-800.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/lastgif_plot-1400.webp b/assets/img/2024-05-07-hidden-convex-relu/lastgif_plot-1400.webp new file mode 100644 index 00000000..f3a172e3 Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/lastgif_plot-1400.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/lastgif_plot-480.webp b/assets/img/2024-05-07-hidden-convex-relu/lastgif_plot-480.webp new file mode 100644 index 00000000..ce2f6c4d Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/lastgif_plot-480.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/lastgif_plot-800.webp b/assets/img/2024-05-07-hidden-convex-relu/lastgif_plot-800.webp new file mode 100644 index 00000000..f3a172e3 Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/lastgif_plot-800.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/manyneurons-1400.webp b/assets/img/2024-05-07-hidden-convex-relu/manyneurons-1400.webp new file mode 100644 index 00000000..3c6c109d Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/manyneurons-1400.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/manyneurons-480.webp b/assets/img/2024-05-07-hidden-convex-relu/manyneurons-480.webp new file mode 100644 index 00000000..898bab8d Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/manyneurons-480.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/manyneurons-800.webp b/assets/img/2024-05-07-hidden-convex-relu/manyneurons-800.webp new file mode 100644 index 00000000..3c6c109d Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/manyneurons-800.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/nbactiv-1400.webp b/assets/img/2024-05-07-hidden-convex-relu/nbactiv-1400.webp new file mode 100644 index 00000000..6881f3aa Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/nbactiv-1400.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/nbactiv-480.webp b/assets/img/2024-05-07-hidden-convex-relu/nbactiv-480.webp new file mode 100644 index 00000000..3b401979 Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/nbactiv-480.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/nbactiv-800.webp b/assets/img/2024-05-07-hidden-convex-relu/nbactiv-800.webp new file mode 100644 index 00000000..6881f3aa Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/nbactiv-800.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/nonconvex-1400.webp b/assets/img/2024-05-07-hidden-convex-relu/nonconvex-1400.webp new file mode 100644 index 00000000..6597a267 Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/nonconvex-1400.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/nonconvex-480.webp b/assets/img/2024-05-07-hidden-convex-relu/nonconvex-480.webp new file mode 100644 index 00000000..3030c8ce Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/nonconvex-480.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/nonconvex-800.webp b/assets/img/2024-05-07-hidden-convex-relu/nonconvex-800.webp new file mode 100644 index 00000000..6597a267 Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/nonconvex-800.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/oneneuron-1400.webp b/assets/img/2024-05-07-hidden-convex-relu/oneneuron-1400.webp new file mode 100644 index 00000000..9e70cc0a Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/oneneuron-1400.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/oneneuron-480.webp b/assets/img/2024-05-07-hidden-convex-relu/oneneuron-480.webp new file mode 100644 index 00000000..12c88d83 Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/oneneuron-480.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/oneneuron-800.webp b/assets/img/2024-05-07-hidden-convex-relu/oneneuron-800.webp new file mode 100644 index 00000000..9e70cc0a Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/oneneuron-800.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/palette-1400.webp b/assets/img/2024-05-07-hidden-convex-relu/palette-1400.webp new file mode 100644 index 00000000..ac5dd8aa Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/palette-1400.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/palette-480.webp b/assets/img/2024-05-07-hidden-convex-relu/palette-480.webp new file mode 100644 index 00000000..463a82ae Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/palette-480.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/palette-800.webp b/assets/img/2024-05-07-hidden-convex-relu/palette-800.webp new file mode 100644 index 00000000..ac5dd8aa Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/palette-800.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/quantgraph-1400.webp b/assets/img/2024-05-07-hidden-convex-relu/quantgraph-1400.webp new file mode 100644 index 00000000..24a5227b Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/quantgraph-1400.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/quantgraph-480.webp b/assets/img/2024-05-07-hidden-convex-relu/quantgraph-480.webp new file mode 100644 index 00000000..2fde34ec Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/quantgraph-480.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/quantgraph-800.webp b/assets/img/2024-05-07-hidden-convex-relu/quantgraph-800.webp new file mode 100644 index 00000000..24a5227b Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/quantgraph-800.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/redloss-1400.webp b/assets/img/2024-05-07-hidden-convex-relu/redloss-1400.webp new file mode 100644 index 00000000..339f83ae Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/redloss-1400.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/redloss-480.webp b/assets/img/2024-05-07-hidden-convex-relu/redloss-480.webp new file mode 100644 index 00000000..99422416 Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/redloss-480.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/redloss-800.webp b/assets/img/2024-05-07-hidden-convex-relu/redloss-800.webp new file mode 100644 index 00000000..339f83ae Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/redloss-800.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/test-1400.webp b/assets/img/2024-05-07-hidden-convex-relu/test-1400.webp new file mode 100644 index 00000000..2c2015a1 Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/test-1400.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/test-480.webp b/assets/img/2024-05-07-hidden-convex-relu/test-480.webp new file mode 100644 index 00000000..d188b677 Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/test-480.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/test-800.webp b/assets/img/2024-05-07-hidden-convex-relu/test-800.webp new file mode 100644 index 00000000..2c2015a1 Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/test-800.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/threed-1400.webp b/assets/img/2024-05-07-hidden-convex-relu/threed-1400.webp new file mode 100644 index 00000000..f48d627a Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/threed-1400.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/threed-480.webp b/assets/img/2024-05-07-hidden-convex-relu/threed-480.webp new file mode 100644 index 00000000..48e8323c Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/threed-480.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/threed-800.webp b/assets/img/2024-05-07-hidden-convex-relu/threed-800.webp new file mode 100644 index 00000000..f48d627a Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/threed-800.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/twodim-1400.webp b/assets/img/2024-05-07-hidden-convex-relu/twodim-1400.webp new file mode 100644 index 00000000..56dcad2b Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/twodim-1400.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/twodim-480.webp b/assets/img/2024-05-07-hidden-convex-relu/twodim-480.webp new file mode 100644 index 00000000..9d51761a Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/twodim-480.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/twodim-800.webp b/assets/img/2024-05-07-hidden-convex-relu/twodim-800.webp new file mode 100644 index 00000000..56dcad2b Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/twodim-800.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/twoneuron-1400.webp b/assets/img/2024-05-07-hidden-convex-relu/twoneuron-1400.webp new file mode 100644 index 00000000..4b83d93a Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/twoneuron-1400.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/twoneuron-480.webp b/assets/img/2024-05-07-hidden-convex-relu/twoneuron-480.webp new file mode 100644 index 00000000..1c1fb31b Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/twoneuron-480.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/twoneuron-800.webp b/assets/img/2024-05-07-hidden-convex-relu/twoneuron-800.webp new file mode 100644 index 00000000..4b83d93a Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/twoneuron-800.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/vraitroisd-1400.webp b/assets/img/2024-05-07-hidden-convex-relu/vraitroisd-1400.webp new file mode 100644 index 00000000..e2fd97d1 Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/vraitroisd-1400.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/vraitroisd-480.webp b/assets/img/2024-05-07-hidden-convex-relu/vraitroisd-480.webp new file mode 100644 index 00000000..acff11b0 Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/vraitroisd-480.webp differ diff --git a/assets/img/2024-05-07-hidden-convex-relu/vraitroisd-800.webp b/assets/img/2024-05-07-hidden-convex-relu/vraitroisd-800.webp new file mode 100644 index 00000000..e2fd97d1 Binary files /dev/null and b/assets/img/2024-05-07-hidden-convex-relu/vraitroisd-800.webp differ diff --git a/assets/img/2024-05-07-mode-switching/bike-1400.webp b/assets/img/2024-05-07-mode-switching/bike-1400.webp new file mode 100644 index 00000000..073bb7e2 Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/bike-1400.webp differ diff --git a/assets/img/2024-05-07-mode-switching/bike-480.webp b/assets/img/2024-05-07-mode-switching/bike-480.webp new file mode 100644 index 00000000..22936c5e Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/bike-480.webp differ diff --git a/assets/img/2024-05-07-mode-switching/bike-800.webp b/assets/img/2024-05-07-mode-switching/bike-800.webp new file mode 100644 index 00000000..073bb7e2 Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/bike-800.webp differ diff --git a/assets/img/2024-05-07-mode-switching/box-1400.webp b/assets/img/2024-05-07-mode-switching/box-1400.webp new file mode 100644 index 00000000..8a7a02dd Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/box-1400.webp differ diff --git a/assets/img/2024-05-07-mode-switching/box-480.webp b/assets/img/2024-05-07-mode-switching/box-480.webp new file mode 100644 index 00000000..e37a4e35 Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/box-480.webp differ diff --git a/assets/img/2024-05-07-mode-switching/box-800.webp b/assets/img/2024-05-07-mode-switching/box-800.webp new file mode 100644 index 00000000..8a7a02dd Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/box-800.webp differ diff --git a/assets/img/2024-05-07-mode-switching/exp_1_1-1400.webp b/assets/img/2024-05-07-mode-switching/exp_1_1-1400.webp new file mode 100644 index 00000000..7582ff28 Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/exp_1_1-1400.webp differ diff --git a/assets/img/2024-05-07-mode-switching/exp_1_1-480.webp b/assets/img/2024-05-07-mode-switching/exp_1_1-480.webp new file mode 100644 index 00000000..4d17223b Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/exp_1_1-480.webp differ diff --git a/assets/img/2024-05-07-mode-switching/exp_1_1-800.webp b/assets/img/2024-05-07-mode-switching/exp_1_1-800.webp new file mode 100644 index 00000000..7582ff28 Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/exp_1_1-800.webp differ diff --git a/assets/img/2024-05-07-mode-switching/exp_1_2-1400.webp b/assets/img/2024-05-07-mode-switching/exp_1_2-1400.webp new file mode 100644 index 00000000..b03ba3be Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/exp_1_2-1400.webp differ diff --git a/assets/img/2024-05-07-mode-switching/exp_1_2-480.webp b/assets/img/2024-05-07-mode-switching/exp_1_2-480.webp new file mode 100644 index 00000000..99d7ae23 Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/exp_1_2-480.webp differ diff --git a/assets/img/2024-05-07-mode-switching/exp_1_2-800.webp b/assets/img/2024-05-07-mode-switching/exp_1_2-800.webp new file mode 100644 index 00000000..b03ba3be Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/exp_1_2-800.webp differ diff --git a/assets/img/2024-05-07-mode-switching/exp_2_1-1400.webp b/assets/img/2024-05-07-mode-switching/exp_2_1-1400.webp new file mode 100644 index 00000000..cf955883 Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/exp_2_1-1400.webp differ diff --git a/assets/img/2024-05-07-mode-switching/exp_2_1-480.webp b/assets/img/2024-05-07-mode-switching/exp_2_1-480.webp new file mode 100644 index 00000000..cb06215e Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/exp_2_1-480.webp differ diff --git a/assets/img/2024-05-07-mode-switching/exp_2_1-800.webp b/assets/img/2024-05-07-mode-switching/exp_2_1-800.webp new file mode 100644 index 00000000..cf955883 Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/exp_2_1-800.webp differ diff --git a/assets/img/2024-05-07-mode-switching/exp_2_2-1400.webp b/assets/img/2024-05-07-mode-switching/exp_2_2-1400.webp new file mode 100644 index 00000000..e4757168 Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/exp_2_2-1400.webp differ diff --git a/assets/img/2024-05-07-mode-switching/exp_2_2-480.webp b/assets/img/2024-05-07-mode-switching/exp_2_2-480.webp new file mode 100644 index 00000000..615f1e68 Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/exp_2_2-480.webp differ diff --git a/assets/img/2024-05-07-mode-switching/exp_2_2-800.webp b/assets/img/2024-05-07-mode-switching/exp_2_2-800.webp new file mode 100644 index 00000000..e4757168 Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/exp_2_2-800.webp differ diff --git a/assets/img/2024-05-07-mode-switching/exp_3_1-1400.webp b/assets/img/2024-05-07-mode-switching/exp_3_1-1400.webp new file mode 100644 index 00000000..cf67c843 Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/exp_3_1-1400.webp differ diff --git a/assets/img/2024-05-07-mode-switching/exp_3_1-480.webp b/assets/img/2024-05-07-mode-switching/exp_3_1-480.webp new file mode 100644 index 00000000..64f60024 Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/exp_3_1-480.webp differ diff --git a/assets/img/2024-05-07-mode-switching/exp_3_1-800.webp b/assets/img/2024-05-07-mode-switching/exp_3_1-800.webp new file mode 100644 index 00000000..cf67c843 Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/exp_3_1-800.webp differ diff --git a/assets/img/2024-05-07-mode-switching/exp_3_2-1400.webp b/assets/img/2024-05-07-mode-switching/exp_3_2-1400.webp new file mode 100644 index 00000000..d3cd0ac8 Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/exp_3_2-1400.webp differ diff --git a/assets/img/2024-05-07-mode-switching/exp_3_2-480.webp b/assets/img/2024-05-07-mode-switching/exp_3_2-480.webp new file mode 100644 index 00000000..39ad5419 Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/exp_3_2-480.webp differ diff --git a/assets/img/2024-05-07-mode-switching/exp_3_2-800.webp b/assets/img/2024-05-07-mode-switching/exp_3_2-800.webp new file mode 100644 index 00000000..d3cd0ac8 Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/exp_3_2-800.webp differ diff --git a/assets/img/2024-05-07-mode-switching/exp_4_1-1400.webp b/assets/img/2024-05-07-mode-switching/exp_4_1-1400.webp new file mode 100644 index 00000000..74d425ce Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/exp_4_1-1400.webp differ diff --git a/assets/img/2024-05-07-mode-switching/exp_4_1-480.webp b/assets/img/2024-05-07-mode-switching/exp_4_1-480.webp new file mode 100644 index 00000000..6aaa558d Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/exp_4_1-480.webp differ diff --git a/assets/img/2024-05-07-mode-switching/exp_4_1-800.webp b/assets/img/2024-05-07-mode-switching/exp_4_1-800.webp new file mode 100644 index 00000000..74d425ce Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/exp_4_1-800.webp differ diff --git a/assets/img/2024-05-07-mode-switching/exp_4_2-1400.webp b/assets/img/2024-05-07-mode-switching/exp_4_2-1400.webp new file mode 100644 index 00000000..0e1de403 Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/exp_4_2-1400.webp differ diff --git a/assets/img/2024-05-07-mode-switching/exp_4_2-480.webp b/assets/img/2024-05-07-mode-switching/exp_4_2-480.webp new file mode 100644 index 00000000..ce482549 Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/exp_4_2-480.webp differ diff --git a/assets/img/2024-05-07-mode-switching/exp_4_2-800.webp b/assets/img/2024-05-07-mode-switching/exp_4_2-800.webp new file mode 100644 index 00000000..0e1de403 Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/exp_4_2-800.webp differ diff --git a/assets/img/2024-05-07-mode-switching/exp_5_1-1400.webp b/assets/img/2024-05-07-mode-switching/exp_5_1-1400.webp new file mode 100644 index 00000000..0081c897 Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/exp_5_1-1400.webp differ diff --git a/assets/img/2024-05-07-mode-switching/exp_5_1-480.webp b/assets/img/2024-05-07-mode-switching/exp_5_1-480.webp new file mode 100644 index 00000000..e5fb3948 Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/exp_5_1-480.webp differ diff --git a/assets/img/2024-05-07-mode-switching/exp_5_1-800.webp b/assets/img/2024-05-07-mode-switching/exp_5_1-800.webp new file mode 100644 index 00000000..0081c897 Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/exp_5_1-800.webp differ diff --git a/assets/img/2024-05-07-mode-switching/exp_5_2-1400.webp b/assets/img/2024-05-07-mode-switching/exp_5_2-1400.webp new file mode 100644 index 00000000..8e1d2af0 Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/exp_5_2-1400.webp differ diff --git a/assets/img/2024-05-07-mode-switching/exp_5_2-480.webp b/assets/img/2024-05-07-mode-switching/exp_5_2-480.webp new file mode 100644 index 00000000..82b47689 Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/exp_5_2-480.webp differ diff --git a/assets/img/2024-05-07-mode-switching/exp_5_2-800.webp b/assets/img/2024-05-07-mode-switching/exp_5_2-800.webp new file mode 100644 index 00000000..8e1d2af0 Binary files /dev/null and b/assets/img/2024-05-07-mode-switching/exp_5_2-800.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/10-1400.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/10-1400.webp new file mode 100644 index 00000000..ce8225b5 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/10-1400.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/10-480.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/10-480.webp new file mode 100644 index 00000000..e890a183 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/10-480.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/10-800.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/10-800.webp new file mode 100644 index 00000000..ce8225b5 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/10-800.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/11-1400.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/11-1400.webp new file mode 100644 index 00000000..b9410833 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/11-1400.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/11-480.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/11-480.webp new file mode 100644 index 00000000..2a916f52 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/11-480.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/11-800.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/11-800.webp new file mode 100644 index 00000000..b9410833 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/11-800.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/12-1400.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/12-1400.webp new file mode 100644 index 00000000..06b75e0f Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/12-1400.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/12-480.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/12-480.webp new file mode 100644 index 00000000..4fb64669 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/12-480.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/12-800.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/12-800.webp new file mode 100644 index 00000000..06b75e0f Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/12-800.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/7-1400.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/7-1400.webp new file mode 100644 index 00000000..37aa7e8d Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/7-1400.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/7-480.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/7-480.webp new file mode 100644 index 00000000..77fdb68d Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/7-480.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/7-800.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/7-800.webp new file mode 100644 index 00000000..37aa7e8d Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/7-800.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/8-1400.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/8-1400.webp new file mode 100644 index 00000000..a2b1e89e Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/8-1400.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/8-480.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/8-480.webp new file mode 100644 index 00000000..c09934e6 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/8-480.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/8-800.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/8-800.webp new file mode 100644 index 00000000..a2b1e89e Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/8-800.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/9-1400.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/9-1400.webp new file mode 100644 index 00000000..dfac01c4 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/9-1400.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/9-480.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/9-480.webp new file mode 100644 index 00000000..c4f72887 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/9-480.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/9-800.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/9-800.webp new file mode 100644 index 00000000..dfac01c4 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/9-800.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/atari-1400.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/atari-1400.webp new file mode 100644 index 00000000..c1a42a36 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/atari-1400.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/atari-480.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/atari-480.webp new file mode 100644 index 00000000..d07b8cd2 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/atari-480.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/atari-800.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/atari-800.webp new file mode 100644 index 00000000..c1a42a36 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/atari-800.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/compute-data-tradeoff-1400.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/compute-data-tradeoff-1400.webp new file mode 100644 index 00000000..9aa280a7 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/compute-data-tradeoff-1400.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/compute-data-tradeoff-480.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/compute-data-tradeoff-480.webp new file mode 100644 index 00000000..f80fa936 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/compute-data-tradeoff-480.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/compute-data-tradeoff-800.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/compute-data-tradeoff-800.webp new file mode 100644 index 00000000..9aa280a7 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/compute-data-tradeoff-800.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn-actionsovertime-1400.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn-actionsovertime-1400.webp new file mode 100644 index 00000000..ef55d868 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn-actionsovertime-1400.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn-actionsovertime-480.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn-actionsovertime-480.webp new file mode 100644 index 00000000..931b2709 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn-actionsovertime-480.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn-actionsovertime-800.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn-actionsovertime-800.webp new file mode 100644 index 00000000..ef55d868 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn-actionsovertime-800.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn_by_rr-1400.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn_by_rr-1400.webp new file mode 100644 index 00000000..58c3303b Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn_by_rr-1400.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn_by_rr-480.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn_by_rr-480.webp new file mode 100644 index 00000000..fc8f6347 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn_by_rr-480.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn_by_rr-800.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn_by_rr-800.webp new file mode 100644 index 00000000..58c3303b Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn_by_rr-800.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn_by_size-1400.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn_by_size-1400.webp new file mode 100644 index 00000000..d7247753 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn_by_size-1400.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn_by_size-480.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn_by_size-480.webp new file mode 100644 index 00000000..cbe6a818 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn_by_size-480.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn_by_size-800.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn_by_size-800.webp new file mode 100644 index 00000000..d7247753 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn_by_size-800.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn_overtime-1400.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn_overtime-1400.webp new file mode 100644 index 00000000..a23be734 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn_overtime-1400.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn_overtime-480.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn_overtime-480.webp new file mode 100644 index 00000000..5a493a56 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn_overtime-480.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn_overtime-800.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn_overtime-800.webp new file mode 100644 index 00000000..a23be734 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dqn_overtime-800.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dropoutsetc-1400.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dropoutsetc-1400.webp new file mode 100644 index 00000000..19912c4d Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dropoutsetc-1400.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dropoutsetc-480.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dropoutsetc-480.webp new file mode 100644 index 00000000..613b5d88 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dropoutsetc-480.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dropoutsetc-800.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dropoutsetc-800.webp new file mode 100644 index 00000000..19912c4d Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/dropoutsetc-800.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/fl-1400.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/fl-1400.webp new file mode 100644 index 00000000..e4ce9591 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/fl-1400.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/fl-480.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/fl-480.webp new file mode 100644 index 00000000..cf0f1153 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/fl-480.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/fl-800.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/fl-800.webp new file mode 100644 index 00000000..e4ce9591 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/fl-800.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/heavy-priming-1400.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/heavy-priming-1400.webp new file mode 100644 index 00000000..d6930b3f Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/heavy-priming-1400.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/heavy-priming-480.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/heavy-priming-480.webp new file mode 100644 index 00000000..9540aa7b Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/heavy-priming-480.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/heavy-priming-800.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/heavy-priming-800.webp new file mode 100644 index 00000000..d6930b3f Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/heavy-priming-800.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/iclr-1400.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/iclr-1400.webp new file mode 100644 index 00000000..d56968ba Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/iclr-1400.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/iclr-480.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/iclr-480.webp new file mode 100644 index 00000000..c9d42d7e Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/iclr-480.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/iclr-800.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/iclr-800.webp new file mode 100644 index 00000000..d56968ba Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/iclr-800.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/mujuco-resets-full-1400.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/mujuco-resets-full-1400.webp new file mode 100644 index 00000000..1a735a4c Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/mujuco-resets-full-1400.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/mujuco-resets-full-480.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/mujuco-resets-full-480.webp new file mode 100644 index 00000000..af0a110b Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/mujuco-resets-full-480.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/mujuco-resets-full-800.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/mujuco-resets-full-800.webp new file mode 100644 index 00000000..1a735a4c Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/mujuco-resets-full-800.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/mujuco-resets-sample-1400.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/mujuco-resets-sample-1400.webp new file mode 100644 index 00000000..4d1b8902 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/mujuco-resets-sample-1400.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/mujuco-resets-sample-480.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/mujuco-resets-sample-480.webp new file mode 100644 index 00000000..c77b98a5 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/mujuco-resets-sample-480.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/mujuco-resets-sample-800.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/mujuco-resets-sample-800.webp new file mode 100644 index 00000000..4d1b8902 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/mujuco-resets-sample-800.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/rr-sweep-1400.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/rr-sweep-1400.webp new file mode 100644 index 00000000..7a7d32e0 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/rr-sweep-1400.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/rr-sweep-480.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/rr-sweep-480.webp new file mode 100644 index 00000000..e9538e00 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/rr-sweep-480.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/rr-sweep-800.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/rr-sweep-800.webp new file mode 100644 index 00000000..7a7d32e0 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/rr-sweep-800.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/samples11-1400.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/samples11-1400.webp new file mode 100644 index 00000000..0a4a7bd3 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/samples11-1400.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/samples11-480.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/samples11-480.webp new file mode 100644 index 00000000..ac6a3249 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/samples11-480.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/samples11-800.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/samples11-800.webp new file mode 100644 index 00000000..0a4a7bd3 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/samples11-800.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/sampling-1400.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/sampling-1400.webp new file mode 100644 index 00000000..a15d6507 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/sampling-1400.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/sampling-480.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/sampling-480.webp new file mode 100644 index 00000000..9bfecb57 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/sampling-480.webp differ diff --git a/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/sampling-800.webp b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/sampling-800.webp new file mode 100644 index 00000000..a15d6507 Binary files /dev/null and b/assets/img/2024-05-07-primacy-bias-and-why-it-helps-to-forget/sampling-800.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/ACL-1400.webp b/assets/img/2024-05-07-robust-foundation-model/ACL-1400.webp new file mode 100644 index 00000000..49dca71d Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/ACL-1400.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/ACL-480.webp b/assets/img/2024-05-07-robust-foundation-model/ACL-480.webp new file mode 100644 index 00000000..4d71622b Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/ACL-480.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/ACL-800.webp b/assets/img/2024-05-07-robust-foundation-model/ACL-800.webp new file mode 100644 index 00000000..49dca71d Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/ACL-800.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/AIR_cross_attack-1400.webp b/assets/img/2024-05-07-robust-foundation-model/AIR_cross_attack-1400.webp new file mode 100644 index 00000000..7e120cec Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/AIR_cross_attack-1400.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/AIR_cross_attack-480.webp b/assets/img/2024-05-07-robust-foundation-model/AIR_cross_attack-480.webp new file mode 100644 index 00000000..65bda116 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/AIR_cross_attack-480.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/AIR_cross_attack-800.webp b/assets/img/2024-05-07-robust-foundation-model/AIR_cross_attack-800.webp new file mode 100644 index 00000000..7e120cec Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/AIR_cross_attack-800.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/AIR_cross_corrup-1400.webp b/assets/img/2024-05-07-robust-foundation-model/AIR_cross_corrup-1400.webp new file mode 100644 index 00000000..d1cba9f0 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/AIR_cross_corrup-1400.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/AIR_cross_corrup-480.webp b/assets/img/2024-05-07-robust-foundation-model/AIR_cross_corrup-480.webp new file mode 100644 index 00000000..eb0eeee4 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/AIR_cross_corrup-480.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/AIR_cross_corrup-800.webp b/assets/img/2024-05-07-robust-foundation-model/AIR_cross_corrup-800.webp new file mode 100644 index 00000000..d1cba9f0 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/AIR_cross_corrup-800.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/AIR_invariant-1400.webp b/assets/img/2024-05-07-robust-foundation-model/AIR_invariant-1400.webp new file mode 100644 index 00000000..5dffd116 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/AIR_invariant-1400.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/AIR_invariant-480.webp b/assets/img/2024-05-07-robust-foundation-model/AIR_invariant-480.webp new file mode 100644 index 00000000..9e5c9063 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/AIR_invariant-480.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/AIR_invariant-800.webp b/assets/img/2024-05-07-robust-foundation-model/AIR_invariant-800.webp new file mode 100644 index 00000000..5dffd116 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/AIR_invariant-800.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/AIR_understand-1400.webp b/assets/img/2024-05-07-robust-foundation-model/AIR_understand-1400.webp new file mode 100644 index 00000000..33909d76 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/AIR_understand-1400.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/AIR_understand-480.webp b/assets/img/2024-05-07-robust-foundation-model/AIR_understand-480.webp new file mode 100644 index 00000000..0fe55166 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/AIR_understand-480.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/AIR_understand-800.webp b/assets/img/2024-05-07-robust-foundation-model/AIR_understand-800.webp new file mode 100644 index 00000000..33909d76 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/AIR_understand-800.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/CL-1400.webp b/assets/img/2024-05-07-robust-foundation-model/CL-1400.webp new file mode 100644 index 00000000..a315fb2a Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/CL-1400.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/CL-480.webp b/assets/img/2024-05-07-robust-foundation-model/CL-480.webp new file mode 100644 index 00000000..2c2215a9 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/CL-480.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/CL-800.webp b/assets/img/2024-05-07-robust-foundation-model/CL-800.webp new file mode 100644 index 00000000..a315fb2a Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/CL-800.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/PGD-1400.webp b/assets/img/2024-05-07-robust-foundation-model/PGD-1400.webp new file mode 100644 index 00000000..80df015e Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/PGD-1400.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/PGD-480.webp b/assets/img/2024-05-07-robust-foundation-model/PGD-480.webp new file mode 100644 index 00000000..fef208c7 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/PGD-480.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/PGD-800.webp b/assets/img/2024-05-07-robust-foundation-model/PGD-800.webp new file mode 100644 index 00000000..80df015e Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/PGD-800.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/RCS_algo-1400.webp b/assets/img/2024-05-07-robust-foundation-model/RCS_algo-1400.webp new file mode 100644 index 00000000..7b456a28 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/RCS_algo-1400.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/RCS_algo-480.webp b/assets/img/2024-05-07-robust-foundation-model/RCS_algo-480.webp new file mode 100644 index 00000000..2a215035 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/RCS_algo-480.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/RCS_algo-800.webp b/assets/img/2024-05-07-robust-foundation-model/RCS_algo-800.webp new file mode 100644 index 00000000..7b456a28 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/RCS_algo-800.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/RCS_exp1-1400.webp b/assets/img/2024-05-07-robust-foundation-model/RCS_exp1-1400.webp new file mode 100644 index 00000000..7b285f59 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/RCS_exp1-1400.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/RCS_exp1-480.webp b/assets/img/2024-05-07-robust-foundation-model/RCS_exp1-480.webp new file mode 100644 index 00000000..5bff134f Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/RCS_exp1-480.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/RCS_exp1-800.webp b/assets/img/2024-05-07-robust-foundation-model/RCS_exp1-800.webp new file mode 100644 index 00000000..7b285f59 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/RCS_exp1-800.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/RCS_exp2-1400.webp b/assets/img/2024-05-07-robust-foundation-model/RCS_exp2-1400.webp new file mode 100644 index 00000000..064edf98 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/RCS_exp2-1400.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/RCS_exp2-480.webp b/assets/img/2024-05-07-robust-foundation-model/RCS_exp2-480.webp new file mode 100644 index 00000000..26cca795 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/RCS_exp2-480.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/RCS_exp2-800.webp b/assets/img/2024-05-07-robust-foundation-model/RCS_exp2-800.webp new file mode 100644 index 00000000..064edf98 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/RCS_exp2-800.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/RCS_exp3-1400.webp b/assets/img/2024-05-07-robust-foundation-model/RCS_exp3-1400.webp new file mode 100644 index 00000000..457c0b5c Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/RCS_exp3-1400.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/RCS_exp3-480.webp b/assets/img/2024-05-07-robust-foundation-model/RCS_exp3-480.webp new file mode 100644 index 00000000..b4c88692 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/RCS_exp3-480.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/RCS_exp3-800.webp b/assets/img/2024-05-07-robust-foundation-model/RCS_exp3-800.webp new file mode 100644 index 00000000..457c0b5c Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/RCS_exp3-800.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/SCL-1400.webp b/assets/img/2024-05-07-robust-foundation-model/SCL-1400.webp new file mode 100644 index 00000000..a067dc6c Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/SCL-1400.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/SCL-480.webp b/assets/img/2024-05-07-robust-foundation-model/SCL-480.webp new file mode 100644 index 00000000..e33511d2 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/SCL-480.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/SCL-800.webp b/assets/img/2024-05-07-robust-foundation-model/SCL-800.webp new file mode 100644 index 00000000..a067dc6c Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/SCL-800.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/adv_attack-1400.webp b/assets/img/2024-05-07-robust-foundation-model/adv_attack-1400.webp new file mode 100644 index 00000000..782249c8 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/adv_attack-1400.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/adv_attack-480.webp b/assets/img/2024-05-07-robust-foundation-model/adv_attack-480.webp new file mode 100644 index 00000000..89f6e8eb Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/adv_attack-480.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/adv_attack-800.webp b/assets/img/2024-05-07-robust-foundation-model/adv_attack-800.webp new file mode 100644 index 00000000..782249c8 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/adv_attack-800.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/causal_graph-1400.webp b/assets/img/2024-05-07-robust-foundation-model/causal_graph-1400.webp new file mode 100644 index 00000000..36c99fd9 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/causal_graph-1400.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/causal_graph-480.webp b/assets/img/2024-05-07-robust-foundation-model/causal_graph-480.webp new file mode 100644 index 00000000..05b24b37 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/causal_graph-480.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/causal_graph-800.webp b/assets/img/2024-05-07-robust-foundation-model/causal_graph-800.webp new file mode 100644 index 00000000..36c99fd9 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/causal_graph-800.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/foundation_models-1400.webp b/assets/img/2024-05-07-robust-foundation-model/foundation_models-1400.webp new file mode 100644 index 00000000..28d1231e Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/foundation_models-1400.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/foundation_models-480.webp b/assets/img/2024-05-07-robust-foundation-model/foundation_models-480.webp new file mode 100644 index 00000000..31656e7f Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/foundation_models-480.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/foundation_models-800.webp b/assets/img/2024-05-07-robust-foundation-model/foundation_models-800.webp new file mode 100644 index 00000000..28d1231e Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/foundation_models-800.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/intuition-1400.webp b/assets/img/2024-05-07-robust-foundation-model/intuition-1400.webp new file mode 100644 index 00000000..4269f49f Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/intuition-1400.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/intuition-480.webp b/assets/img/2024-05-07-robust-foundation-model/intuition-480.webp new file mode 100644 index 00000000..67efed2f Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/intuition-480.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/intuition-800.webp b/assets/img/2024-05-07-robust-foundation-model/intuition-800.webp new file mode 100644 index 00000000..4269f49f Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/intuition-800.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/leaderboard-1400.webp b/assets/img/2024-05-07-robust-foundation-model/leaderboard-1400.webp new file mode 100644 index 00000000..994015da Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/leaderboard-1400.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/leaderboard-480.webp b/assets/img/2024-05-07-robust-foundation-model/leaderboard-480.webp new file mode 100644 index 00000000..d95bfd50 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/leaderboard-480.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/leaderboard-800.webp b/assets/img/2024-05-07-robust-foundation-model/leaderboard-800.webp new file mode 100644 index 00000000..994015da Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/leaderboard-800.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/proxt_label-1400.webp b/assets/img/2024-05-07-robust-foundation-model/proxt_label-1400.webp new file mode 100644 index 00000000..7b2f7973 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/proxt_label-1400.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/proxt_label-480.webp b/assets/img/2024-05-07-robust-foundation-model/proxt_label-480.webp new file mode 100644 index 00000000..2f9ce7c3 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/proxt_label-480.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/proxt_label-800.webp b/assets/img/2024-05-07-robust-foundation-model/proxt_label-800.webp new file mode 100644 index 00000000..7b2f7973 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/proxt_label-800.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/proxy_label-1400.webp b/assets/img/2024-05-07-robust-foundation-model/proxy_label-1400.webp new file mode 100644 index 00000000..5ee161c7 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/proxy_label-1400.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/proxy_label-480.webp b/assets/img/2024-05-07-robust-foundation-model/proxy_label-480.webp new file mode 100644 index 00000000..20729f03 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/proxy_label-480.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/proxy_label-800.webp b/assets/img/2024-05-07-robust-foundation-model/proxy_label-800.webp new file mode 100644 index 00000000..5ee161c7 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/proxy_label-800.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/robust_foundation_models-1400.webp b/assets/img/2024-05-07-robust-foundation-model/robust_foundation_models-1400.webp new file mode 100644 index 00000000..1137ae35 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/robust_foundation_models-1400.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/robust_foundation_models-480.webp b/assets/img/2024-05-07-robust-foundation-model/robust_foundation_models-480.webp new file mode 100644 index 00000000..06893277 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/robust_foundation_models-480.webp differ diff --git a/assets/img/2024-05-07-robust-foundation-model/robust_foundation_models-800.webp b/assets/img/2024-05-07-robust-foundation-model/robust_foundation_models-800.webp new file mode 100644 index 00000000..1137ae35 Binary files /dev/null and b/assets/img/2024-05-07-robust-foundation-model/robust_foundation_models-800.webp differ diff --git a/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/adam_gpt2-1400.webp b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/adam_gpt2-1400.webp new file mode 100644 index 00000000..75482f2c Binary files /dev/null and b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/adam_gpt2-1400.webp differ diff --git a/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/adam_gpt2-480.webp b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/adam_gpt2-480.webp new file mode 100644 index 00000000..e56e8cdd Binary files /dev/null and b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/adam_gpt2-480.webp differ diff --git a/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/adam_gpt2-800.webp b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/adam_gpt2-800.webp new file mode 100644 index 00000000..75482f2c Binary files /dev/null and b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/adam_gpt2-800.webp differ diff --git a/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/adam_gpt2_xl-1400.webp b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/adam_gpt2_xl-1400.webp new file mode 100644 index 00000000..ad7ca078 Binary files /dev/null and b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/adam_gpt2_xl-1400.webp differ diff --git a/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/adam_gpt2_xl-480.webp b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/adam_gpt2_xl-480.webp new file mode 100644 index 00000000..3745b4a7 Binary files /dev/null and b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/adam_gpt2_xl-480.webp differ diff --git a/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/adam_gpt2_xl-800.webp b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/adam_gpt2_xl-800.webp new file mode 100644 index 00000000..ad7ca078 Binary files /dev/null and b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/adam_gpt2_xl-800.webp differ diff --git a/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/curve-matching-1400.webp b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/curve-matching-1400.webp new file mode 100644 index 00000000..9242bfcc Binary files /dev/null and b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/curve-matching-1400.webp differ diff --git a/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/curve-matching-480.webp b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/curve-matching-480.webp new file mode 100644 index 00000000..f02fb15c Binary files /dev/null and b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/curve-matching-480.webp differ diff --git a/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/curve-matching-800.webp b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/curve-matching-800.webp new file mode 100644 index 00000000..9242bfcc Binary files /dev/null and b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/curve-matching-800.webp differ diff --git a/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/curve-matching_all-1400.webp b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/curve-matching_all-1400.webp new file mode 100644 index 00000000..eaa91633 Binary files /dev/null and b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/curve-matching_all-1400.webp differ diff --git a/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/curve-matching_all-480.webp b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/curve-matching_all-480.webp new file mode 100644 index 00000000..434fad24 Binary files /dev/null and b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/curve-matching_all-480.webp differ diff --git a/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/curve-matching_all-800.webp b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/curve-matching_all-800.webp new file mode 100644 index 00000000..eaa91633 Binary files /dev/null and b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/curve-matching_all-800.webp differ diff --git a/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/descriptiveness-samples-1400.webp b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/descriptiveness-samples-1400.webp new file mode 100644 index 00000000..80cde890 Binary files /dev/null and b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/descriptiveness-samples-1400.webp differ diff --git a/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/descriptiveness-samples-480.webp b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/descriptiveness-samples-480.webp new file mode 100644 index 00000000..9d9e9685 Binary files /dev/null and b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/descriptiveness-samples-480.webp differ diff --git a/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/descriptiveness-samples-800.webp b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/descriptiveness-samples-800.webp new file mode 100644 index 00000000..80cde890 Binary files /dev/null and b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/descriptiveness-samples-800.webp differ diff --git a/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/norma_const_comparison-1400.webp b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/norma_const_comparison-1400.webp new file mode 100644 index 00000000..e42444bc Binary files /dev/null and b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/norma_const_comparison-1400.webp differ diff --git a/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/norma_const_comparison-480.webp b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/norma_const_comparison-480.webp new file mode 100644 index 00000000..85a45449 Binary files /dev/null and b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/norma_const_comparison-480.webp differ diff --git a/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/norma_const_comparison-800.webp b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/norma_const_comparison-800.webp new file mode 100644 index 00000000..e42444bc Binary files /dev/null and b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/norma_const_comparison-800.webp differ diff --git a/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/tldr1-1400.webp b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/tldr1-1400.webp new file mode 100644 index 00000000..afe123a0 Binary files /dev/null and b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/tldr1-1400.webp differ diff --git a/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/tldr1-480.webp b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/tldr1-480.webp new file mode 100644 index 00000000..db96ab21 Binary files /dev/null and b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/tldr1-480.webp differ diff --git a/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/tldr1-800.webp b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/tldr1-800.webp new file mode 100644 index 00000000..afe123a0 Binary files /dev/null and b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/tldr1-800.webp differ diff --git a/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/tldr2-1400.webp b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/tldr2-1400.webp new file mode 100644 index 00000000..3c1dfded Binary files /dev/null and b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/tldr2-1400.webp differ diff --git a/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/tldr2-480.webp b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/tldr2-480.webp new file mode 100644 index 00000000..74207667 Binary files /dev/null and b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/tldr2-480.webp differ diff --git a/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/tldr2-800.webp b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/tldr2-800.webp new file mode 100644 index 00000000..3c1dfded Binary files /dev/null and b/assets/img/2024-05-07-the-n-implementation-details-of-rlhf-with-ppo/tldr2-800.webp differ diff --git a/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/Picture1-1400.webp b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/Picture1-1400.webp new file mode 100644 index 00000000..5506d881 Binary files /dev/null and b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/Picture1-1400.webp differ diff --git a/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/Picture1-480.webp b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/Picture1-480.webp new file mode 100644 index 00000000..fd2379bd Binary files /dev/null and b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/Picture1-480.webp differ diff --git a/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/Picture1-800.webp b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/Picture1-800.webp new file mode 100644 index 00000000..5506d881 Binary files /dev/null and b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/Picture1-800.webp differ diff --git a/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/Picture2-1400.webp b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/Picture2-1400.webp new file mode 100644 index 00000000..110d4966 Binary files /dev/null and b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/Picture2-1400.webp differ diff --git a/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/Picture2-480.webp b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/Picture2-480.webp new file mode 100644 index 00000000..6cd1486b Binary files /dev/null and b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/Picture2-480.webp differ diff --git a/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/Picture2-800.webp b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/Picture2-800.webp new file mode 100644 index 00000000..110d4966 Binary files /dev/null and b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/Picture2-800.webp differ diff --git a/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/Picture3-1400.webp b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/Picture3-1400.webp new file mode 100644 index 00000000..c535f2aa Binary files /dev/null and b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/Picture3-1400.webp differ diff --git a/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/Picture3-480.webp b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/Picture3-480.webp new file mode 100644 index 00000000..b62cd2cc Binary files /dev/null and b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/Picture3-480.webp differ diff --git a/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/Picture3-800.webp b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/Picture3-800.webp new file mode 100644 index 00000000..c535f2aa Binary files /dev/null and b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/Picture3-800.webp differ diff --git a/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs1_cos_gt-1400.webp b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs1_cos_gt-1400.webp new file mode 100644 index 00000000..0b58c2bd Binary files /dev/null and b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs1_cos_gt-1400.webp differ diff --git a/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs1_cos_gt-480.webp b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs1_cos_gt-480.webp new file mode 100644 index 00000000..5b209933 Binary files /dev/null and b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs1_cos_gt-480.webp differ diff --git a/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs1_cos_gt-800.webp b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs1_cos_gt-800.webp new file mode 100644 index 00000000..0b58c2bd Binary files /dev/null and b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs1_cos_gt-800.webp differ diff --git a/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs1_cos_gt_2-1400.webp b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs1_cos_gt_2-1400.webp new file mode 100644 index 00000000..4832531b Binary files /dev/null and b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs1_cos_gt_2-1400.webp differ diff --git a/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs1_cos_gt_2-480.webp b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs1_cos_gt_2-480.webp new file mode 100644 index 00000000..ec6dba18 Binary files /dev/null and b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs1_cos_gt_2-480.webp differ diff --git a/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs1_cos_gt_2-800.webp b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs1_cos_gt_2-800.webp new file mode 100644 index 00000000..4832531b Binary files /dev/null and b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/bs1_cos_gt_2-800.webp differ diff --git a/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/gt0-1400.webp b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/gt0-1400.webp new file mode 100644 index 00000000..841ed01f Binary files /dev/null and b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/gt0-1400.webp differ diff --git a/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/gt0-480.webp b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/gt0-480.webp new file mode 100644 index 00000000..99ec35aa Binary files /dev/null and b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/gt0-480.webp differ diff --git a/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/gt0-800.webp b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/gt0-800.webp new file mode 100644 index 00000000..841ed01f Binary files /dev/null and b/assets/img/2024-05-07-understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/gt0-800.webp differ diff --git a/assets/img/2024-05-07-understanding-icl/in-context-chatgpt-1400.webp b/assets/img/2024-05-07-understanding-icl/in-context-chatgpt-1400.webp new file mode 100644 index 00000000..f034e6ac Binary files /dev/null and b/assets/img/2024-05-07-understanding-icl/in-context-chatgpt-1400.webp differ diff --git a/assets/img/2024-05-07-understanding-icl/in-context-chatgpt-480.webp b/assets/img/2024-05-07-understanding-icl/in-context-chatgpt-480.webp new file mode 100644 index 00000000..56ad8bbe Binary files /dev/null and b/assets/img/2024-05-07-understanding-icl/in-context-chatgpt-480.webp differ diff --git a/assets/img/2024-05-07-understanding-icl/in-context-chatgpt-800.webp b/assets/img/2024-05-07-understanding-icl/in-context-chatgpt-800.webp new file mode 100644 index 00000000..f034e6ac Binary files /dev/null and b/assets/img/2024-05-07-understanding-icl/in-context-chatgpt-800.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/1-1400.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/1-1400.webp new file mode 100644 index 00000000..0a796009 Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/1-1400.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/1-480.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/1-480.webp new file mode 100644 index 00000000..c02dc60a Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/1-480.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/1-800.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/1-800.webp new file mode 100644 index 00000000..0a796009 Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/1-800.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/2-1400.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/2-1400.webp new file mode 100644 index 00000000..037aebf6 Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/2-1400.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/2-480.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/2-480.webp new file mode 100644 index 00000000..bdfee3e5 Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/2-480.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/2-800.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/2-800.webp new file mode 100644 index 00000000..037aebf6 Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/2-800.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/3-1400.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/3-1400.webp new file mode 100644 index 00000000..207b610d Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/3-1400.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/3-480.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/3-480.webp new file mode 100644 index 00000000..a87311c5 Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/3-480.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/3-800.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/3-800.webp new file mode 100644 index 00000000..207b610d Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/3-800.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/4-1400.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/4-1400.webp new file mode 100644 index 00000000..dced6d39 Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/4-1400.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/4-480.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/4-480.webp new file mode 100644 index 00000000..aa85946c Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/4-480.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/4-800.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/4-800.webp new file mode 100644 index 00000000..dced6d39 Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/4-800.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/5-1400.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/5-1400.webp new file mode 100644 index 00000000..3a525dd3 Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/5-1400.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/5-480.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/5-480.webp new file mode 100644 index 00000000..30cc9c00 Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/5-480.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/5-800.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/5-800.webp new file mode 100644 index 00000000..3a525dd3 Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/5-800.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/6-1400.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/6-1400.webp new file mode 100644 index 00000000..e7e41b02 Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/6-1400.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/6-480.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/6-480.webp new file mode 100644 index 00000000..d83c7136 Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/6-480.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/6-800.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/6-800.webp new file mode 100644 index 00000000..e7e41b02 Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/6-800.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/7-1400.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/7-1400.webp new file mode 100644 index 00000000..f52b05bd Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/7-1400.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/7-480.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/7-480.webp new file mode 100644 index 00000000..575f8104 Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/7-480.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/7-800.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/7-800.webp new file mode 100644 index 00000000..f52b05bd Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/7-800.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/cat_data_leakage-1400.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/cat_data_leakage-1400.webp new file mode 100644 index 00000000..39dcf4d6 Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/cat_data_leakage-1400.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/cat_data_leakage-480.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/cat_data_leakage-480.webp new file mode 100644 index 00000000..8442d31e Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/cat_data_leakage-480.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/cat_data_leakage-800.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/cat_data_leakage-800.webp new file mode 100644 index 00000000..39dcf4d6 Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/cat_data_leakage-800.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/datamodel_our_exp-1400.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/datamodel_our_exp-1400.webp new file mode 100644 index 00000000..300f771e Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/datamodel_our_exp-1400.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/datamodel_our_exp-480.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/datamodel_our_exp-480.webp new file mode 100644 index 00000000..846ec34e Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/datamodel_our_exp-480.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/datamodel_our_exp-800.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/datamodel_our_exp-800.webp new file mode 100644 index 00000000..300f771e Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/datamodel_our_exp-800.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_1-1400.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_1-1400.webp new file mode 100644 index 00000000..dced6d39 Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_1-1400.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_1-480.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_1-480.webp new file mode 100644 index 00000000..aa85946c Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_1-480.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_1-800.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_1-800.webp new file mode 100644 index 00000000..dced6d39 Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_1-800.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_2-1400.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_2-1400.webp new file mode 100644 index 00000000..3a525dd3 Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_2-1400.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_2-480.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_2-480.webp new file mode 100644 index 00000000..30cc9c00 Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_2-480.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_2-800.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_2-800.webp new file mode 100644 index 00000000..3a525dd3 Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_2-800.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_3-1400.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_3-1400.webp new file mode 100644 index 00000000..e7e41b02 Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_3-1400.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_3-480.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_3-480.webp new file mode 100644 index 00000000..d83c7136 Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_3-480.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_3-800.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_3-800.webp new file mode 100644 index 00000000..e7e41b02 Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_3-800.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_4-1400.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_4-1400.webp new file mode 100644 index 00000000..f52b05bd Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_4-1400.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_4-480.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_4-480.webp new file mode 100644 index 00000000..575f8104 Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_4-480.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_4-800.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_4-800.webp new file mode 100644 index 00000000..f52b05bd Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/model_diff_4-800.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/trak_exp_fig-1400.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/trak_exp_fig-1400.webp new file mode 100644 index 00000000..570cfbad Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/trak_exp_fig-1400.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/trak_exp_fig-480.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/trak_exp_fig-480.webp new file mode 100644 index 00000000..42120ec2 Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/trak_exp_fig-480.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/trak_exp_fig-800.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/trak_exp_fig-800.webp new file mode 100644 index 00000000..570cfbad Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/trak_exp_fig-800.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/trak_scatter_plot-1400.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/trak_scatter_plot-1400.webp new file mode 100644 index 00000000..b2637898 Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/trak_scatter_plot-1400.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/trak_scatter_plot-480.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/trak_scatter_plot-480.webp new file mode 100644 index 00000000..5ec777b1 Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/trak_scatter_plot-480.webp differ diff --git a/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/trak_scatter_plot-800.webp b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/trak_scatter_plot-800.webp new file mode 100644 index 00000000..b2637898 Binary files /dev/null and b/assets/img/2024-05-07-unraveling-the-impact-of-training-samples/trak_scatter_plot-800.webp differ diff --git a/assets/img/2024-05-07-update-frequency-in-mbrl/bremen-1400.webp b/assets/img/2024-05-07-update-frequency-in-mbrl/bremen-1400.webp new file mode 100644 index 00000000..3cf1d0b3 Binary files /dev/null and b/assets/img/2024-05-07-update-frequency-in-mbrl/bremen-1400.webp differ diff --git a/assets/img/2024-05-07-update-frequency-in-mbrl/bremen-480.webp b/assets/img/2024-05-07-update-frequency-in-mbrl/bremen-480.webp new file mode 100644 index 00000000..f7e1bb82 Binary files /dev/null and b/assets/img/2024-05-07-update-frequency-in-mbrl/bremen-480.webp differ diff --git a/assets/img/2024-05-07-update-frequency-in-mbrl/bremen-800.webp b/assets/img/2024-05-07-update-frequency-in-mbrl/bremen-800.webp new file mode 100644 index 00000000..3cf1d0b3 Binary files /dev/null and b/assets/img/2024-05-07-update-frequency-in-mbrl/bremen-800.webp differ diff --git a/assets/img/2024-05-07-update-frequency-in-mbrl/buffer_size-1400.webp b/assets/img/2024-05-07-update-frequency-in-mbrl/buffer_size-1400.webp new file mode 100644 index 00000000..41304f2f Binary files /dev/null and b/assets/img/2024-05-07-update-frequency-in-mbrl/buffer_size-1400.webp differ diff --git a/assets/img/2024-05-07-update-frequency-in-mbrl/buffer_size-480.webp b/assets/img/2024-05-07-update-frequency-in-mbrl/buffer_size-480.webp new file mode 100644 index 00000000..de7155b6 Binary files /dev/null and b/assets/img/2024-05-07-update-frequency-in-mbrl/buffer_size-480.webp differ diff --git a/assets/img/2024-05-07-update-frequency-in-mbrl/buffer_size-800.webp b/assets/img/2024-05-07-update-frequency-in-mbrl/buffer_size-800.webp new file mode 100644 index 00000000..41304f2f Binary files /dev/null and b/assets/img/2024-05-07-update-frequency-in-mbrl/buffer_size-800.webp differ diff --git a/assets/img/2024-05-07-update-frequency-in-mbrl/update_frequency_cheetah-1400.webp b/assets/img/2024-05-07-update-frequency-in-mbrl/update_frequency_cheetah-1400.webp new file mode 100644 index 00000000..8d7c89db Binary files /dev/null and b/assets/img/2024-05-07-update-frequency-in-mbrl/update_frequency_cheetah-1400.webp differ diff --git a/assets/img/2024-05-07-update-frequency-in-mbrl/update_frequency_cheetah-480.webp b/assets/img/2024-05-07-update-frequency-in-mbrl/update_frequency_cheetah-480.webp new file mode 100644 index 00000000..1a9b8039 Binary files /dev/null and b/assets/img/2024-05-07-update-frequency-in-mbrl/update_frequency_cheetah-480.webp differ diff --git a/assets/img/2024-05-07-update-frequency-in-mbrl/update_frequency_cheetah-800.webp b/assets/img/2024-05-07-update-frequency-in-mbrl/update_frequency_cheetah-800.webp new file mode 100644 index 00000000..8d7c89db Binary files /dev/null and b/assets/img/2024-05-07-update-frequency-in-mbrl/update_frequency_cheetah-800.webp differ diff --git a/assets/img/2024-05-07-update-frequency-in-mbrl/update_frequency_hopper-1400.webp b/assets/img/2024-05-07-update-frequency-in-mbrl/update_frequency_hopper-1400.webp new file mode 100644 index 00000000..36213ecb Binary files /dev/null and b/assets/img/2024-05-07-update-frequency-in-mbrl/update_frequency_hopper-1400.webp differ diff --git a/assets/img/2024-05-07-update-frequency-in-mbrl/update_frequency_hopper-480.webp b/assets/img/2024-05-07-update-frequency-in-mbrl/update_frequency_hopper-480.webp new file mode 100644 index 00000000..88171d5e Binary files /dev/null and b/assets/img/2024-05-07-update-frequency-in-mbrl/update_frequency_hopper-480.webp differ diff --git a/assets/img/2024-05-07-update-frequency-in-mbrl/update_frequency_hopper-800.webp b/assets/img/2024-05-07-update-frequency-in-mbrl/update_frequency_hopper-800.webp new file mode 100644 index 00000000..36213ecb Binary files /dev/null and b/assets/img/2024-05-07-update-frequency-in-mbrl/update_frequency_hopper-800.webp differ diff --git a/assets/img/2024-05-07-update-frequency-in-mbrl/update_frequency_walker-1400.webp b/assets/img/2024-05-07-update-frequency-in-mbrl/update_frequency_walker-1400.webp new file mode 100644 index 00000000..db3d50f4 Binary files /dev/null and b/assets/img/2024-05-07-update-frequency-in-mbrl/update_frequency_walker-1400.webp differ diff --git a/assets/img/2024-05-07-update-frequency-in-mbrl/update_frequency_walker-480.webp b/assets/img/2024-05-07-update-frequency-in-mbrl/update_frequency_walker-480.webp new file mode 100644 index 00000000..84d597b4 Binary files /dev/null and b/assets/img/2024-05-07-update-frequency-in-mbrl/update_frequency_walker-480.webp differ diff --git a/assets/img/2024-05-07-update-frequency-in-mbrl/update_frequency_walker-800.webp b/assets/img/2024-05-07-update-frequency-in-mbrl/update_frequency_walker-800.webp new file mode 100644 index 00000000..db3d50f4 Binary files /dev/null and b/assets/img/2024-05-07-update-frequency-in-mbrl/update_frequency_walker-800.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/bladderbatch-batch-confusion-1400.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/bladderbatch-batch-confusion-1400.webp new file mode 100644 index 00000000..42de7f9c Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/bladderbatch-batch-confusion-1400.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/bladderbatch-batch-confusion-480.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/bladderbatch-batch-confusion-480.webp new file mode 100644 index 00000000..05f5fbed Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/bladderbatch-batch-confusion-480.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/bladderbatch-batch-confusion-800.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/bladderbatch-batch-confusion-800.webp new file mode 100644 index 00000000..42de7f9c Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/bladderbatch-batch-confusion-800.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/bladderbatch-batch-scatterplot-1400.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/bladderbatch-batch-scatterplot-1400.webp new file mode 100644 index 00000000..676e2bed Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/bladderbatch-batch-scatterplot-1400.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/bladderbatch-batch-scatterplot-480.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/bladderbatch-batch-scatterplot-480.webp new file mode 100644 index 00000000..e548c388 Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/bladderbatch-batch-scatterplot-480.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/bladderbatch-batch-scatterplot-800.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/bladderbatch-batch-scatterplot-800.webp new file mode 100644 index 00000000..676e2bed Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/bladderbatch-batch-scatterplot-800.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/bladderbatch-comparison-1400.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/bladderbatch-comparison-1400.webp new file mode 100644 index 00000000..224cc3e3 Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/bladderbatch-comparison-1400.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/bladderbatch-comparison-480.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/bladderbatch-comparison-480.webp new file mode 100644 index 00000000..93cc0d1b Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/bladderbatch-comparison-480.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/bladderbatch-comparison-800.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/bladderbatch-comparison-800.webp new file mode 100644 index 00000000..224cc3e3 Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/bladderbatch-comparison-800.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/cifar10-vs-samples-1400.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/cifar10-vs-samples-1400.webp new file mode 100644 index 00000000..a9ba73e3 Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/cifar10-vs-samples-1400.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/cifar10-vs-samples-480.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/cifar10-vs-samples-480.webp new file mode 100644 index 00000000..997cb682 Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/cifar10-vs-samples-480.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/cifar10-vs-samples-800.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/cifar10-vs-samples-800.webp new file mode 100644 index 00000000..a9ba73e3 Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/cifar10-vs-samples-800.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/mnist-vs-samples-1400.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/mnist-vs-samples-1400.webp new file mode 100644 index 00000000..98a61e0d Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/mnist-vs-samples-1400.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/mnist-vs-samples-480.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/mnist-vs-samples-480.webp new file mode 100644 index 00000000..4a355710 Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/mnist-vs-samples-480.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/mnist-vs-samples-800.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/mnist-vs-samples-800.webp new file mode 100644 index 00000000..98a61e0d Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/mnist-vs-samples-800.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-ensembles2-1400.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-ensembles2-1400.webp new file mode 100644 index 00000000..f9ee3a43 Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-ensembles2-1400.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-ensembles2-480.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-ensembles2-480.webp new file mode 100644 index 00000000..bb737bff Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-ensembles2-480.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-ensembles2-800.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-ensembles2-800.webp new file mode 100644 index 00000000..f9ee3a43 Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-ensembles2-800.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-nonmonotone-1400.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-nonmonotone-1400.webp new file mode 100644 index 00000000..bd938567 Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-nonmonotone-1400.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-nonmonotone-480.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-nonmonotone-480.webp new file mode 100644 index 00000000..3c6ab0e9 Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-nonmonotone-480.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-nonmonotone-800.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-nonmonotone-800.webp new file mode 100644 index 00000000..bd938567 Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-nonmonotone-800.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-periodic-2pair-1400.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-periodic-2pair-1400.webp new file mode 100644 index 00000000..1650952b Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-periodic-2pair-1400.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-periodic-2pair-480.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-periodic-2pair-480.webp new file mode 100644 index 00000000..a5ac2653 Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-periodic-2pair-480.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-periodic-2pair-800.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-periodic-2pair-800.webp new file mode 100644 index 00000000..1650952b Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-periodic-2pair-800.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-periodic-2skippair-1400.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-periodic-2skippair-1400.webp new file mode 100644 index 00000000..94607a08 Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-periodic-2skippair-1400.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-periodic-2skippair-480.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-periodic-2skippair-480.webp new file mode 100644 index 00000000..c8cf24bc Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-periodic-2skippair-480.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-periodic-2skippair-800.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-periodic-2skippair-800.webp new file mode 100644 index 00000000..94607a08 Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-periodic-2skippair-800.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-periodic-3pair-1400.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-periodic-3pair-1400.webp new file mode 100644 index 00000000..273c4d16 Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-periodic-3pair-1400.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-periodic-3pair-480.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-periodic-3pair-480.webp new file mode 100644 index 00000000..bd621969 Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-periodic-3pair-480.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-periodic-3pair-800.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-periodic-3pair-800.webp new file mode 100644 index 00000000..273c4d16 Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-periodic-3pair-800.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-repeatboth-1400.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-repeatboth-1400.webp new file mode 100644 index 00000000..774d39cb Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-repeatboth-1400.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-repeatboth-480.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-repeatboth-480.webp new file mode 100644 index 00000000..add7f1ef Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-repeatboth-480.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-repeatboth-800.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-repeatboth-800.webp new file mode 100644 index 00000000..774d39cb Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-repeatboth-800.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-repeatred-1400.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-repeatred-1400.webp new file mode 100644 index 00000000..ee46a20c Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-repeatred-1400.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-repeatred-480.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-repeatred-480.webp new file mode 100644 index 00000000..32f24e86 Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-repeatred-480.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-repeatred-800.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-repeatred-800.webp new file mode 100644 index 00000000..ee46a20c Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-repeatred-800.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-repeats-1400.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-repeats-1400.webp new file mode 100644 index 00000000..62a73e77 Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-repeats-1400.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-repeats-480.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-repeats-480.webp new file mode 100644 index 00000000..8f232123 Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-repeats-480.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-repeats-800.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-repeats-800.webp new file mode 100644 index 00000000..62a73e77 Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/plusminus1-repeats-800.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/voronois-1400.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/voronois-1400.webp new file mode 100644 index 00000000..910323e0 Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/voronois-1400.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/voronois-480.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/voronois-480.webp new file mode 100644 index 00000000..b4b24766 Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/voronois-480.webp differ diff --git a/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/voronois-800.webp b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/voronois-800.webp new file mode 100644 index 00000000..910323e0 Binary files /dev/null and b/assets/img/2024-05-07-what-exactly-has-tabpfn-learned-to-do/voronois-800.webp differ diff --git a/assets/img/ICLR-logo-1400.webp b/assets/img/ICLR-logo-1400.webp new file mode 100644 index 00000000..d56968ba Binary files /dev/null and b/assets/img/ICLR-logo-1400.webp differ diff --git a/assets/img/ICLR-logo-480.webp b/assets/img/ICLR-logo-480.webp new file mode 100644 index 00000000..c9d42d7e Binary files /dev/null and b/assets/img/ICLR-logo-480.webp differ diff --git a/assets/img/ICLR-logo-800.webp b/assets/img/ICLR-logo-800.webp new file mode 100644 index 00000000..d56968ba Binary files /dev/null and b/assets/img/ICLR-logo-800.webp differ diff --git a/assets/img/ICLR-logo-dark-1400.webp b/assets/img/ICLR-logo-dark-1400.webp new file mode 100644 index 00000000..5ed49089 Binary files /dev/null and b/assets/img/ICLR-logo-dark-1400.webp differ diff --git a/assets/img/ICLR-logo-dark-480.webp b/assets/img/ICLR-logo-dark-480.webp new file mode 100644 index 00000000..7f0830c1 Binary files /dev/null and b/assets/img/ICLR-logo-dark-480.webp differ diff --git a/assets/img/ICLR-logo-dark-800.webp b/assets/img/ICLR-logo-dark-800.webp new file mode 100644 index 00000000..5ed49089 Binary files /dev/null and b/assets/img/ICLR-logo-dark-800.webp differ diff --git a/assets/img/organizers/cg-1400.webp b/assets/img/organizers/cg-1400.webp new file mode 100644 index 00000000..7d4f4383 Binary files /dev/null and b/assets/img/organizers/cg-1400.webp differ diff --git a/assets/img/organizers/cg-480.webp b/assets/img/organizers/cg-480.webp new file mode 100644 index 00000000..c4497e86 Binary files /dev/null and b/assets/img/organizers/cg-480.webp differ diff --git a/assets/img/organizers/cg-800.webp b/assets/img/organizers/cg-800.webp new file mode 100644 index 00000000..7d4f4383 Binary files /dev/null and b/assets/img/organizers/cg-800.webp differ diff --git a/assets/img/organizers/cv-1400.webp b/assets/img/organizers/cv-1400.webp new file mode 100644 index 00000000..3967f400 Binary files /dev/null and b/assets/img/organizers/cv-1400.webp differ diff --git a/assets/img/organizers/cv-480.webp b/assets/img/organizers/cv-480.webp new file mode 100644 index 00000000..8e5721ba Binary files /dev/null and b/assets/img/organizers/cv-480.webp differ diff --git a/assets/img/organizers/cv-800.webp b/assets/img/organizers/cv-800.webp new file mode 100644 index 00000000..3967f400 Binary files /dev/null and b/assets/img/organizers/cv-800.webp differ diff --git a/assets/img/organizers/dd-1400.webp b/assets/img/organizers/dd-1400.webp new file mode 100644 index 00000000..b63f6c49 Binary files /dev/null and b/assets/img/organizers/dd-1400.webp differ diff --git a/assets/img/organizers/dd-480.webp b/assets/img/organizers/dd-480.webp new file mode 100644 index 00000000..57f80658 Binary files /dev/null and b/assets/img/organizers/dd-480.webp differ diff --git a/assets/img/organizers/dd-800.webp b/assets/img/organizers/dd-800.webp new file mode 100644 index 00000000..b63f6c49 Binary files /dev/null and b/assets/img/organizers/dd-800.webp differ diff --git a/assets/img/organizers/fp-1400.webp b/assets/img/organizers/fp-1400.webp new file mode 100644 index 00000000..4a1670f3 Binary files /dev/null and b/assets/img/organizers/fp-1400.webp differ diff --git a/assets/img/organizers/fp-480.webp b/assets/img/organizers/fp-480.webp new file mode 100644 index 00000000..07249083 Binary files /dev/null and b/assets/img/organizers/fp-480.webp differ diff --git a/assets/img/organizers/fp-800.webp b/assets/img/organizers/fp-800.webp new file mode 100644 index 00000000..4a1670f3 Binary files /dev/null and b/assets/img/organizers/fp-800.webp differ diff --git a/assets/img/organizers/gg-1400.webp b/assets/img/organizers/gg-1400.webp new file mode 100644 index 00000000..4a8c5bd4 Binary files /dev/null and b/assets/img/organizers/gg-1400.webp differ diff --git a/assets/img/organizers/gg-480.webp b/assets/img/organizers/gg-480.webp new file mode 100644 index 00000000..ca12493c Binary files /dev/null and b/assets/img/organizers/gg-480.webp differ diff --git a/assets/img/organizers/gg-800.webp b/assets/img/organizers/gg-800.webp new file mode 100644 index 00000000..4a8c5bd4 Binary files /dev/null and b/assets/img/organizers/gg-800.webp differ diff --git a/assets/img/organizers/ls-1400.webp b/assets/img/organizers/ls-1400.webp new file mode 100644 index 00000000..d0d38f6c Binary files /dev/null and b/assets/img/organizers/ls-1400.webp differ diff --git a/assets/img/organizers/ls-480.webp b/assets/img/organizers/ls-480.webp new file mode 100644 index 00000000..5a9334a5 Binary files /dev/null and b/assets/img/organizers/ls-480.webp differ diff --git a/assets/img/organizers/ls-800.webp b/assets/img/organizers/ls-800.webp new file mode 100644 index 00000000..d0d38f6c Binary files /dev/null and b/assets/img/organizers/ls-800.webp differ diff --git a/assets/js/common.js b/assets/js/common.js index f7c41c20..521235d2 100644 --- a/assets/js/common.js +++ b/assets/js/common.js @@ -1,9 +1 @@ -$(document).ready(function() { - $('a.abstract').click(function() { - $(this).parent().parent().find(".abstract.hidden").toggleClass('open'); - }); - $('a.bibtex').click(function() { - $(this).parent().parent().find(".bibtex.hidden").toggleClass('open'); - }); - $('a').removeClass('waves-effect waves-light'); -}); +$(document).ready(function(){$("a.abstract").click(function(){$(this).parent().parent().find(".abstract.hidden").toggleClass("open")}),$("a.bibtex").click(function(){$(this).parent().parent().find(".bibtex.hidden").toggleClass("open")}),$("a").removeClass("waves-effect waves-light")}); \ No newline at end of file diff --git a/assets/js/dark_mode.js b/assets/js/dark_mode.js index 863b273f..26312e44 100644 --- a/assets/js/dark_mode.js +++ b/assets/js/dark_mode.js @@ -1,8 +1 @@ -document.addEventListener('DOMContentLoaded', function() { - const mode_toggle = document.getElementById("light-toggle"); - - mode_toggle.addEventListener("click", function() { - toggleTheme(localStorage.getItem("theme")); - }); -}); - +document.addEventListener("DOMContentLoaded",function(){document.getElementById("light-toggle").addEventListener("click",function(){toggleTheme(localStorage.getItem("theme"))})}); \ No newline at end of file diff --git a/assets/js/distillpub/overrides.js b/assets/js/distillpub/overrides.js index 2d839626..066b8efa 100644 --- a/assets/js/distillpub/overrides.js +++ b/assets/js/distillpub/overrides.js @@ -1,24 +1 @@ -$(document).ready(function() { - // Override styles of the footnotes. - document.querySelectorAll("d-footnote").forEach(function(footnote) { - footnote.shadowRoot.querySelector("sup > span") - .setAttribute("style", "color: var(--global-theme-color);"); - footnote.shadowRoot.querySelector("d-hover-box").shadowRoot.querySelector("style").sheet - .insertRule(".panel {background-color: var(--global-bg-color) !important;}"); - footnote.shadowRoot.querySelector("d-hover-box").shadowRoot.querySelector("style").sheet - .insertRule(".panel {border-color: var(--global-divider-color) !important;}"); - }); - // Override styles of the citations. - document.querySelectorAll("d-cite").forEach(function(cite) { - cite.shadowRoot.querySelector("div > span") - .setAttribute("style", "color: var(--global-theme-color);"); - cite.shadowRoot.querySelector("style").sheet - .insertRule("ul li a {color: var(--global-text-color) !important; text-decoration: none;}"); - cite.shadowRoot.querySelector("style").sheet - .insertRule("ul li a:hover {color: var(--global-theme-color) !important;}"); - cite.shadowRoot.querySelector("d-hover-box").shadowRoot.querySelector("style").sheet - .insertRule(".panel {background-color: var(--global-bg-color) !important;}"); - cite.shadowRoot.querySelector("d-hover-box").shadowRoot.querySelector("style").sheet - .insertRule(".panel {border-color: var(--global-divider-color) !important;}"); - }); -}) \ No newline at end of file +$(document).ready(function(){document.querySelectorAll("d-footnote").forEach(function(o){o.shadowRoot.querySelector("sup > span").setAttribute("style","color: var(--global-theme-color);"),o.shadowRoot.querySelector("d-hover-box").shadowRoot.querySelector("style").sheet.insertRule(".panel {background-color: var(--global-bg-color) !important;}"),o.shadowRoot.querySelector("d-hover-box").shadowRoot.querySelector("style").sheet.insertRule(".panel {border-color: var(--global-divider-color) !important;}")}),document.querySelectorAll("d-cite").forEach(function(o){o.shadowRoot.querySelector("div > span").setAttribute("style","color: var(--global-theme-color);"),o.shadowRoot.querySelector("style").sheet.insertRule("ul li a {color: var(--global-text-color) !important; text-decoration: none;}"),o.shadowRoot.querySelector("style").sheet.insertRule("ul li a:hover {color: var(--global-theme-color) !important;}"),o.shadowRoot.querySelector("d-hover-box").shadowRoot.querySelector("style").sheet.insertRule(".panel {background-color: var(--global-bg-color) !important;}"),o.shadowRoot.querySelector("d-hover-box").shadowRoot.querySelector("style").sheet.insertRule(".panel {border-color: var(--global-divider-color) !important;}")})}); \ No newline at end of file diff --git a/assets/js/distillpub/template.v2.js b/assets/js/distillpub/template.v2.js index 0a362784..e93d0d1a 100644 --- a/assets/js/distillpub/template.v2.js +++ b/assets/js/distillpub/template.v2.js @@ -1,9249 +1,67 @@ -(function (factory) { - typeof define === 'function' && define.amd ? define(factory) : - factory(); -}((function () { 'use strict'; - - // Copyright 2018 The Distill Template Authors - // - // Licensed under the Apache License, Version 2.0 (the "License"); - // you may not use this file except in compliance with the License. - // You may obtain a copy of the License at - // - // http://www.apache.org/licenses/LICENSE-2.0 - // - // Unless required by applicable law or agreed to in writing, software - // distributed under the License is distributed on an "AS IS" BASIS, - // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - // See the License for the specific language governing permissions and - // limitations under the License. - - const days = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']; - const months = ['Jan.', 'Feb.', 'March', 'April', 'May', 'June', 'July', 'Aug.', 'Sept.', 'Oct.', 'Nov.', 'Dec.']; - const zeroPad = n => n < 10 ? '0' + n : n; - - const RFC = function(date) { - const day = days[date.getDay()].substring(0, 3); - const paddedDate = zeroPad(date.getDate()); - const month = months[date.getMonth()].substring(0,3); - const year = date.getFullYear().toString(); - const hours = date.getUTCHours().toString(); - const minutes = date.getUTCMinutes().toString(); - const seconds = date.getUTCSeconds().toString(); - return `${day}, ${paddedDate} ${month} ${year} ${hours}:${minutes}:${seconds} Z`; - }; - - const objectFromMap = function(map) { - const object = Array.from(map).reduce((object, [key, value]) => ( - Object.assign(object, { [key]: value }) // Be careful! Maps can have non-String keys; object literals can't. - ), {}); - return object; - }; - - const mapFromObject = function(object) { - const map = new Map(); - for (var property in object) { - if (object.hasOwnProperty(property)) { - map.set(property, object[property]); - } - } - return map; - }; - - class Author { - - // constructor(name='', personalURL='', affiliation='', affiliationURL='') { - // this.name = name; // 'Chris Olah' - // this.personalURL = personalURL; // 'https://colah.github.io' - // this.affiliation = affiliation; // 'Google Brain' - // this.affiliationURL = affiliationURL; // 'https://g.co/brain' - // } - - constructor(object) { - this.name = object.author; // 'Chris Olah' - this.personalURL = object.authorURL; // 'https://colah.github.io' - this.affiliation = object.affiliation; // 'Google Brain' - this.affiliationURL = object.affiliationURL; // 'https://g.co/brain' - this.affiliations = object.affiliations || []; // new-style affiliations - } - - // 'Chris' - get firstName() { - const names = this.name.split(' '); - return names.slice(0, names.length - 1).join(' '); - } - - // 'Olah' - get lastName() { - const names = this.name.split(' '); - return names[names.length -1]; - } - } - - function mergeFromYMLFrontmatter(target, source) { - target.title = source.title; - if (source.published) { - if (source.published instanceof Date) { - target.publishedDate = source.published; - } else if (source.published.constructor === String) { - target.publishedDate = new Date(source.published); - } - } - if (source.publishedDate) { - if (source.publishedDate instanceof Date) { - target.publishedDate = source.publishedDate; - } else if (source.publishedDate.constructor === String) { - target.publishedDate = new Date(source.publishedDate); - } else { - console.error('Don\'t know what to do with published date: ' + source.publishedDate); - } - } - target.description = source.description; - target.authors = source.authors.map( (authorObject) => new Author(authorObject)); - target.katex = source.katex; - target.password = source.password; - if (source.doi) { - target.doi = source.doi; - } - } - - class FrontMatter { - constructor() { - this.title = 'unnamed article'; // 'Attention and Augmented Recurrent Neural Networks' - this.description = ''; // 'A visual overview of neural attention...' - this.authors = []; // Array of Author(s) - - this.bibliography = new Map(); - this.bibliographyParsed = false; - // { - // 'gregor2015draw': { - // 'title': 'DRAW: A recurrent neural network for image generation', - // 'author': 'Gregor, Karol and Danihelka, Ivo and Graves, Alex and Rezende, Danilo Jimenez and Wierstra, Daan', - // 'journal': 'arXiv preprint arXiv:1502.04623', - // 'year': '2015', - // 'url': 'https://arxiv.org/pdf/1502.04623.pdf', - // 'type': 'article' - // }, - // } - - // Citation keys should be listed in the order that they are appear in the document. - // Each key refers to a key in the bibliography dictionary. - this.citations = []; // [ 'gregor2015draw', 'mercier2011humans' ] - this.citationsCollected = false; - - // - // Assigned from posts.csv - // - - // publishedDate: 2016-09-08T07:00:00.000Z, - // tags: [ 'rnn' ], - // distillPath: '2016/augmented-rnns', - // githubPath: 'distillpub/post--augmented-rnns', - // doiSuffix: 1, - - // - // Assigned from journal - // - this.journal = {}; - // journal: { - // 'title': 'Distill', - // 'full_title': 'Distill', - // 'abbrev_title': 'Distill', - // 'url': 'http://distill.pub', - // 'doi': '10.23915/distill', - // 'publisherName': 'Distill Working Group', - // 'publisherEmail': 'admin@distill.pub', - // 'issn': '2476-0757', - // 'editors': [...], - // 'committee': [...] - // } - // volume: 1, - // issue: 9, - - this.katex = {}; - - // - // Assigned from publishing process - // - - // githubCompareUpdatesUrl: 'https://github.com/distillpub/post--augmented-rnns/compare/1596e094d8943d2dc0ea445d92071129c6419c59...3bd9209e0c24d020f87cf6152dcecc6017cbc193', - // updatedDate: 2017-03-21T07:13:16.000Z, - // doi: '10.23915/distill.00001', - this.doi = undefined; - this.publishedDate = undefined; - } - - // Example: - // title: Demo Title Attention and Augmented Recurrent Neural Networks - // published: Jan 10, 2017 - // authors: - // - Chris Olah: - // - Shan Carter: http://shancarter.com - // affiliations: - // - Google Brain: - // - Google Brain: http://g.co/brain - - // - // Computed Properties - // - - // 'http://distill.pub/2016/augmented-rnns', - set url(value) { - this._url = value; - } - get url() { - if (this._url) { - return this._url; - } else if (this.distillPath && this.journal.url) { - return this.journal.url + '/' + this.distillPath; - } else if (this.journal.url) { - return this.journal.url; - } - } - - // 'https://github.com/distillpub/post--augmented-rnns', - get githubUrl() { - if (this.githubPath) { - return 'https://github.com/' + this.githubPath; - } else { - return undefined; - } - } - - // TODO resolve differences in naming of URL/Url/url. - // 'http://distill.pub/2016/augmented-rnns/thumbnail.jpg', - set previewURL(value) { - this._previewURL = value; - } - get previewURL() { - return this._previewURL ? this._previewURL : this.url + '/thumbnail.jpg'; - } - - // 'Thu, 08 Sep 2016 00:00:00 -0700', - get publishedDateRFC() { - return RFC(this.publishedDate); - } - - // 'Thu, 08 Sep 2016 00:00:00 -0700', - get updatedDateRFC() { - return RFC(this.updatedDate); - } - - // 2016, - get publishedYear() { - return this.publishedDate.getFullYear(); - } - - // 'Sept', - get publishedMonth() { - return months[this.publishedDate.getMonth()]; - } - - // 8, - get publishedDay() { - return this.publishedDate.getDate(); - } - - // '09', - get publishedMonthPadded() { - return zeroPad(this.publishedDate.getMonth() + 1); - } - - // '08', - get publishedDayPadded() { - return zeroPad(this.publishedDate.getDate()); - } - - get publishedISODateOnly() { - return this.publishedDate.toISOString().split('T')[0]; - } - - get volume() { - const volume = this.publishedYear - 2015; - if (volume < 1) { - throw new Error('Invalid publish date detected during computing volume'); - } - return volume; - } - - get issue() { - return this.publishedDate.getMonth() + 1; - } - - // 'Olah & Carter', - get concatenatedAuthors() { - if (this.authors.length > 2) { - return this.authors[0].lastName + ', et al.'; - } else if (this.authors.length === 2) { - return this.authors[0].lastName + ' & ' + this.authors[1].lastName; - } else if (this.authors.length === 1) { - return this.authors[0].lastName; - } - } - - // 'Olah, Chris and Carter, Shan', - get bibtexAuthors() { - return this.authors.map(author => { - return author.lastName + ', ' + author.firstName; - }).join(' and '); - } - - // 'olah2016attention' - get slug() { - let slug = ''; - if (this.authors.length) { - slug += this.authors[0].lastName.toLowerCase(); - slug += this.publishedYear; - slug += this.title.split(' ')[0].toLowerCase(); - } - return slug || 'Untitled'; - } - - get bibliographyEntries() { - return new Map(this.citations.map( citationKey => { - const entry = this.bibliography.get(citationKey); - return [citationKey, entry]; - })); - } - - set bibliography(bibliography) { - if (bibliography instanceof Map) { - this._bibliography = bibliography; - } else if (typeof bibliography === 'object') { - this._bibliography = mapFromObject(bibliography); - } - } - - get bibliography() { - return this._bibliography; - } - - static fromObject(source) { - const frontMatter = new FrontMatter(); - Object.assign(frontMatter, source); - return frontMatter; - } - - assignToObject(target) { - Object.assign(target, this); - target.bibliography = objectFromMap(this.bibliographyEntries); - target.url = this.url; - target.doi = this.doi; - target.githubUrl = this.githubUrl; - target.previewURL = this.previewURL; - if (this.publishedDate) { - target.volume = this.volume; - target.issue = this.issue; - target.publishedDateRFC = this.publishedDateRFC; - target.publishedYear = this.publishedYear; - target.publishedMonth = this.publishedMonth; - target.publishedDay = this.publishedDay; - target.publishedMonthPadded = this.publishedMonthPadded; - target.publishedDayPadded = this.publishedDayPadded; - } - if (this.updatedDate) { - target.updatedDateRFC = this.updatedDateRFC; - } - target.concatenatedAuthors = this.concatenatedAuthors; - target.bibtexAuthors = this.bibtexAuthors; - target.slug = this.slug; - } - - } - - // Copyright 2018 The Distill Template Authors - // - // Licensed under the Apache License, Version 2.0 (the "License"); - // you may not use this file except in compliance with the License. - // You may obtain a copy of the License at - // - // http://www.apache.org/licenses/LICENSE-2.0 - // - // Unless required by applicable law or agreed to in writing, software - // distributed under the License is distributed on an "AS IS" BASIS, - // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - // See the License for the specific language governing permissions and - // limitations under the License. - - const Mutating = (superclass) => { - return class extends superclass { - - constructor() { - super(); - - // set up mutation observer - const options = {childList: true, characterData: true, subtree: true}; - const observer = new MutationObserver( () => { - observer.disconnect(); - this.renderIfPossible(); - observer.observe(this, options); - }); - - // ...and listen for changes - observer.observe(this, options); - } - - connectedCallback() { - super.connectedCallback(); - - this.renderIfPossible(); - } - - // potential TODO: check if this is enough for all our usecases - // maybe provide a custom function to tell if we have enough information to render - renderIfPossible() { - if (this.textContent && this.root) { - this.renderContent(); - } - } - - renderContent() { - console.error(`Your class ${this.constructor.name} must provide a custom renderContent() method!` ); - } - - }; // end class - }; // end mixin function - - // Copyright 2018 The Distill Template Authors - // - // Licensed under the Apache License, Version 2.0 (the "License"); - // you may not use this file except in compliance with the License. - // You may obtain a copy of the License at - // - // http://www.apache.org/licenses/LICENSE-2.0 - // - // Unless required by applicable law or agreed to in writing, software - // distributed under the License is distributed on an "AS IS" BASIS, - // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - // See the License for the specific language governing permissions and - // limitations under the License. - - /*global ShadyCSS*/ - - const Template = (name, templateString, useShadow = true) => { - - return (superclass) => { - - const template = document.createElement('template'); - template.innerHTML = templateString; - - if (useShadow && 'ShadyCSS' in window) { - ShadyCSS.prepareTemplate(template, name); - } - - return class extends superclass { - - static get is() { return name; } - - constructor() { - super(); - - this.clone = document.importNode(template.content, true); - if (useShadow) { - this.attachShadow({mode: 'open'}); - this.shadowRoot.appendChild(this.clone); - } - } - - connectedCallback() { - if (this.hasAttribute('distill-prerendered')) { - return; - } - if (useShadow) { - if ('ShadyCSS' in window) { - ShadyCSS.styleElement(this); - } - } else { - this.insertBefore(this.clone, this.firstChild); - } - } - - get root() { - if (useShadow) { - return this.shadowRoot; - } else { - return this; - } - } - - /* TODO: Are we using these? Should we even? */ - $(query) { - return this.root.querySelector(query); - } - - $$(query) { - return this.root.querySelectorAll(query); - } - }; - }; - }; - - var math = "/*\n * Copyright 2018 The Distill Template Authors\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n * http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */"; - - // Copyright 2018 The Distill Template Authors - // - // Licensed under the Apache License, Version 2.0 (the "License"); - // you may not use this file except in compliance with the License. - // You may obtain a copy of the License at - // - // http://www.apache.org/licenses/LICENSE-2.0 - // - // Unless required by applicable law or agreed to in writing, software - // distributed under the License is distributed on an "AS IS" BASIS, - // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - // See the License for the specific language governing permissions and - // limitations under the License. - - // This is a straight concatenation of code from KaTeX's contrib folder, - // but we aren't using some of their helpers that don't work well outside a browser environment. - - /*global katex */ - - const findEndOfMath = function(delimiter, text, startIndex) { - // Adapted from - // https://github.com/Khan/perseus/blob/master/src/perseus-markdown.jsx - let index = startIndex; - let braceLevel = 0; - - const delimLength = delimiter.length; - - while (index < text.length) { - const character = text[index]; - - if ( - braceLevel <= 0 && - text.slice(index, index + delimLength) === delimiter - ) { - return index; - } else if (character === "\\") { - index++; - } else if (character === "{") { - braceLevel++; - } else if (character === "}") { - braceLevel--; - } - - index++; - } - - return -1; - }; - - const splitAtDelimiters = function(startData, leftDelim, rightDelim, display) { - const finalData = []; - - for (let i = 0; i < startData.length; i++) { - if (startData[i].type === "text") { - const text = startData[i].data; - - let lookingForLeft = true; - let currIndex = 0; - let nextIndex; - - nextIndex = text.indexOf(leftDelim); - if (nextIndex !== -1) { - currIndex = nextIndex; - finalData.push({ - type: "text", - data: text.slice(0, currIndex) - }); - lookingForLeft = false; - } - - while (true) { - // eslint-disable-line no-constant-condition - if (lookingForLeft) { - nextIndex = text.indexOf(leftDelim, currIndex); - if (nextIndex === -1) { - break; - } - - finalData.push({ - type: "text", - data: text.slice(currIndex, nextIndex) - }); - - currIndex = nextIndex; - } else { - nextIndex = findEndOfMath( - rightDelim, - text, - currIndex + leftDelim.length - ); - if (nextIndex === -1) { - break; - } - - finalData.push({ - type: "math", - data: text.slice(currIndex + leftDelim.length, nextIndex), - rawData: text.slice(currIndex, nextIndex + rightDelim.length), - display: display - }); - - currIndex = nextIndex + rightDelim.length; - } - - lookingForLeft = !lookingForLeft; - } - - finalData.push({ - type: "text", - data: text.slice(currIndex) - }); - } else { - finalData.push(startData[i]); - } - } - - return finalData; - }; - - const splitWithDelimiters = function(text, delimiters) { - let data = [{ type: "text", data: text }]; - for (let i = 0; i < delimiters.length; i++) { - const delimiter = delimiters[i]; - data = splitAtDelimiters( - data, - delimiter.left, - delimiter.right, - delimiter.display || false - ); - } - return data; - }; - - /* Note: optionsCopy is mutated by this method. If it is ever exposed in the - * API, we should copy it before mutating. - */ - const renderMathInText = function(text, optionsCopy) { - const data = splitWithDelimiters(text, optionsCopy.delimiters); - const fragment = document.createDocumentFragment(); - - for (let i = 0; i < data.length; i++) { - if (data[i].type === "text") { - fragment.appendChild(document.createTextNode(data[i].data)); - } else { - const tag = document.createElement("d-math"); - const math = data[i].data; - // Override any display mode defined in the settings with that - // defined by the text itself - optionsCopy.displayMode = data[i].display; - try { - tag.textContent = math; - if (optionsCopy.displayMode) { - tag.setAttribute("block", ""); - } - } catch (e) { - if (!(e instanceof katex.ParseError)) { - throw e; - } - optionsCopy.errorCallback( - "KaTeX auto-render: Failed to parse `" + data[i].data + "` with ", - e - ); - fragment.appendChild(document.createTextNode(data[i].rawData)); - continue; - } - fragment.appendChild(tag); - } - } - - return fragment; - }; - - const renderElem = function(elem, optionsCopy) { - for (let i = 0; i < elem.childNodes.length; i++) { - const childNode = elem.childNodes[i]; - if (childNode.nodeType === 3) { - // Text node - const text = childNode.textContent; - if (optionsCopy.mightHaveMath(text)) { - const frag = renderMathInText(text, optionsCopy); - i += frag.childNodes.length - 1; - elem.replaceChild(frag, childNode); - } - } else if (childNode.nodeType === 1) { - // Element node - const shouldRender = - optionsCopy.ignoredTags.indexOf(childNode.nodeName.toLowerCase()) === - -1; - - if (shouldRender) { - renderElem(childNode, optionsCopy); - } - } - // Otherwise, it's something else, and ignore it. - } - }; - - const defaultAutoRenderOptions = { - delimiters: [ - { left: "$$", right: "$$", display: true }, - { left: "\\[", right: "\\]", display: true }, - { left: "\\(", right: "\\)", display: false } - // LaTeX uses this, but it ruins the display of normal `$` in text: - // {left: '$', right: '$', display: false}, - ], - - ignoredTags: [ - "script", - "noscript", - "style", - "textarea", - "pre", - "code", - "svg" - ], - - errorCallback: function(msg, err) { - console.error(msg, err); - } - }; - - const renderMathInElement = function(elem, options) { - if (!elem) { - throw new Error("No element provided to render"); - } - - const optionsCopy = Object.assign({}, defaultAutoRenderOptions, options); - const delimiterStrings = optionsCopy.delimiters.flatMap(d => [ - d.left, - d.right - ]); - const mightHaveMath = text => - delimiterStrings.some(d => text.indexOf(d) !== -1); - optionsCopy.mightHaveMath = mightHaveMath; - renderElem(elem, optionsCopy); - }; - - // Copyright 2018 The Distill Template Authors - - const katexJSURL = 'https://distill.pub/third-party/katex/katex.min.js'; - const katexCSSTag = ''; - - const T = Template('d-math', ` -${katexCSSTag} - - -`); - - // DMath, not Math, because that would conflict with the JS built-in - class DMath extends Mutating(T(HTMLElement)) { - - static set katexOptions(options) { - DMath._katexOptions = options; - if (DMath.katexOptions.delimiters) { - if (!DMath.katexAdded) { - DMath.addKatex(); - } else { - DMath.katexLoadedCallback(); - } - } - } - - static get katexOptions() { - if (!DMath._katexOptions) { - DMath._katexOptions = { - delimiters: [ { 'left':'$$', 'right':'$$', 'display': false } ] - }; - } - return DMath._katexOptions; - } - - static katexLoadedCallback() { - // render all d-math tags - const mathTags = document.querySelectorAll('d-math'); - for (const mathTag of mathTags) { - mathTag.renderContent(); - } - // transform inline delimited math to d-math tags - if (DMath.katexOptions.delimiters) { - renderMathInElement(document.body, DMath.katexOptions); - } - } - - static addKatex() { - // css tag can use this convenience function - document.head.insertAdjacentHTML('beforeend', katexCSSTag); - // script tag has to be created to work properly - const scriptTag = document.createElement('script'); - scriptTag.src = katexJSURL; - scriptTag.async = true; - scriptTag.onload = DMath.katexLoadedCallback; - scriptTag.crossorigin = 'anonymous'; - document.head.appendChild(scriptTag); - - DMath.katexAdded = true; - } - - get options() { - const localOptions = { displayMode: this.hasAttribute('block') }; - return Object.assign(localOptions, DMath.katexOptions); - } - - connectedCallback() { - super.connectedCallback(); - if (!DMath.katexAdded) { - DMath.addKatex(); - } - } - - renderContent() { - if (typeof katex !== 'undefined') { - const container = this.root.querySelector('#katex-container'); - katex.render(this.textContent, container, this.options); - } - } - - } - - DMath.katexAdded = false; - DMath.inlineMathRendered = false; - window.DMath = DMath; // TODO: check if this can be removed, or if we should expose a distill global - - // Copyright 2018 The Distill Template Authors - // - // Licensed under the Apache License, Version 2.0 (the "License"); - // you may not use this file except in compliance with the License. - // You may obtain a copy of the License at - // - // http://www.apache.org/licenses/LICENSE-2.0 - // - // Unless required by applicable law or agreed to in writing, software - // distributed under the License is distributed on an "AS IS" BASIS, - // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - // See the License for the specific language governing permissions and - // limitations under the License. - - function collect_citations(dom = document) { - const citations = new Set(); - const citeTags = dom.querySelectorAll("d-cite"); - for (const tag of citeTags) { - const keyString = tag.getAttribute("key") || tag.getAttribute("bibtex-key"); - const keys = keyString.split(",").map(k => k.trim()); - for (const key of keys) { - citations.add(key); - } - } - return [...citations]; - } - - function author_string(ent, template, sep, finalSep) { - if (ent.author == null) { - return ""; - } - var names = ent.author.split(" and "); - let name_strings = names.map(name => { - name = name.trim(); - if (name.indexOf(",") != -1) { - var last = name.split(",")[0].trim(); - var firsts = name.split(",")[1]; - } else if (name.indexOf(" ") != -1) { - var last = name - .split(" ") - .slice(-1)[0] - .trim(); - var firsts = name - .split(" ") - .slice(0, -1) - .join(" "); - } else { - var last = name.trim(); - } - var initials = ""; - if (firsts != undefined) { - initials = firsts - .trim() - .split(" ") - .map(s => s.trim()[0]); - initials = initials.join(".") + "."; - } - return template - .replace("${F}", firsts) - .replace("${L}", last) - .replace("${I}", initials) - .trim(); // in case one of first or last was empty - }); - if (names.length > 1) { - var str = name_strings.slice(0, names.length - 1).join(sep); - str += (finalSep || sep) + name_strings[names.length - 1]; - return str; - } else { - return name_strings[0]; - } - } - - function venue_string(ent) { - var cite = ent.journal || ent.booktitle || ""; - if ("volume" in ent) { - var issue = ent.issue || ent.number; - issue = issue != undefined ? "(" + issue + ")" : ""; - cite += ", Vol " + ent.volume + issue; - } - if ("pages" in ent) { - cite += ", pp. " + ent.pages; - } - if (cite != "") cite += ". "; - if ("publisher" in ent) { - cite += ent.publisher; - if (cite[cite.length - 1] != ".") cite += "."; - } - return cite; - } - - function link_string(ent) { - if ("url" in ent) { - var url = ent.url; - var arxiv_match = /arxiv\.org\/abs\/([0-9\.]*)/.exec(url); - if (arxiv_match != null) { - url = `http://arxiv.org/pdf/${arxiv_match[1]}.pdf`; - } - - if (url.slice(-4) == ".pdf") { - var label = "PDF"; - } else if (url.slice(-5) == ".html") { - var label = "HTML"; - } - return `  [${label || "link"}]`; - } /* else if ("doi" in ent){ - return `  [DOI]`; - }*/ else { - return ""; - } - } - function doi_string(ent, new_line) { - if ("doi" in ent) { - return `${new_line ? "
" : ""} DOI: ${ent.doi}`; - } else { - return ""; - } - } - - function title_string(ent) { - return '' + ent.title + " "; - } - - function bibliography_cite(ent, fancy) { - if (ent) { - var cite = title_string(ent); - cite += link_string(ent) + "
"; - if (ent.author) { - cite += author_string(ent, "${L}, ${I}", ", ", " and "); - if (ent.year || ent.date) { - cite += ", "; - } - } - if (ent.year || ent.date) { - cite += (ent.year || ent.date) + ". "; - } else { - cite += ". "; - } - cite += venue_string(ent); - cite += doi_string(ent); - return cite; - /*var cite = author_string(ent, "${L}, ${I}", ", ", " and "); - if (ent.year || ent.date){ - cite += ", " + (ent.year || ent.date) + ". " - } else { - cite += ". " - } - cite += "" + ent.title + ". "; - cite += venue_string(ent); - cite += doi_string(ent); - cite += link_string(ent); - return cite*/ - } else { - return "?"; - } - } - - function hover_cite(ent) { - if (ent) { - var cite = ""; - cite += "" + ent.title + ""; - cite += link_string(ent); - cite += "
"; - - var a_str = author_string(ent, "${I} ${L}", ", ") + "."; - var v_str = - venue_string(ent).trim() + " " + ent.year + ". " + doi_string(ent, true); - - if ((a_str + v_str).length < Math.min(40, ent.title.length)) { - cite += a_str + " " + v_str; - } else { - cite += a_str + "
" + v_str; - } - return cite; - } else { - return "?"; - } - } - - function domContentLoaded() { - return ['interactive', 'complete'].indexOf(document.readyState) !== -1; - } - - // Copyright 2018 The Distill Template Authors - // - // Licensed under the Apache License, Version 2.0 (the "License"); - // you may not use this file except in compliance with the License. - // You may obtain a copy of the License at - // - // http://www.apache.org/licenses/LICENSE-2.0 - // - // Unless required by applicable law or agreed to in writing, software - // distributed under the License is distributed on an "AS IS" BASIS, - // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - // See the License for the specific language governing permissions and - // limitations under the License. - - function _moveLegacyAffiliationFormatIntoArray(frontMatter) { - // authors used to have propoerties "affiliation" and "affiliationURL". - // We now encourage using an array for affiliations containing objects with - // properties "name" and "url". - for (let author of frontMatter.authors) { - const hasOldStyle = Boolean(author.affiliation); - const hasNewStyle = Boolean(author.affiliations); - if (!hasOldStyle) continue; - if (hasNewStyle) { - console.warn(`Author ${author.author} has both old-style ("affiliation" & "affiliationURL") and new style ("affiliations") affiliation information!`); - } else { - let newAffiliation = { - "name": author.affiliation - }; - if (author.affiliationURL) newAffiliation.url = author.affiliationURL; - author.affiliations = [newAffiliation]; - } - } - return frontMatter - } - - function parseFrontmatter(element) { - const scriptTag = element.firstElementChild; - if (scriptTag) { - const type = scriptTag.getAttribute('type'); - if (type.split('/')[1] == 'json') { - const content = scriptTag.textContent; - const parsed = JSON.parse(content); - return _moveLegacyAffiliationFormatIntoArray(parsed); - } else { - console.error('Distill only supports JSON frontmatter tags anymore; no more YAML.'); - } - } else { - console.error('You added a frontmatter tag but did not provide a script tag with front matter data in it. Please take a look at our templates.'); - } - return {}; - } - - class FrontMatter$1 extends HTMLElement { - - static get is() { return 'd-front-matter'; } - - constructor() { - super(); - - const options = {childList: true, characterData: true, subtree: true}; - const observer = new MutationObserver( (entries) => { - for (const entry of entries) { - if (entry.target.nodeName === 'SCRIPT' || entry.type === 'characterData') { - const data = parseFrontmatter(this); - this.notify(data); - } - } - }); - observer.observe(this, options); - } - - notify(data) { - const options = { detail: data, bubbles: true }; - const event = new CustomEvent('onFrontMatterChanged', options); - document.dispatchEvent(event); - } - - } - - // Copyright 2018 The Distill Template Authors - // - // Licensed under the Apache License, Version 2.0 (the "License"); - // you may not use this file except in compliance with the License. - // You may obtain a copy of the License at - // - // http://www.apache.org/licenses/LICENSE-2.0 - // - // Unless required by applicable law or agreed to in writing, software - // distributed under the License is distributed on an "AS IS" BASIS, - // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - // See the License for the specific language governing permissions and - // limitations under the License. - - // no appendix -> add appendix - // title in front, no h1 -> add it - // no title in front, h1 -> read and put into frontMatter - // footnote -> footnote list - // break up bib - // if citation, no bib-list -> add citation-list - - // if authors, no byline -> add byline - - function optionalComponents(dom, data) { - const body = dom.body; - const article = body.querySelector('d-article'); - - // If we don't have an article tag, something weird is going on—giving up. - if (!article) { - console.warn('No d-article tag found; skipping adding optional components!'); - return; - } - - let byline = dom.querySelector('d-byline'); - if (!byline) { - if (data.authors) { - byline = dom.createElement('d-byline'); - body.insertBefore(byline, article); - } else { - console.warn('No authors found in front matter; please add them before submission!'); - } - } - - let title = dom.querySelector('d-title'); - if (!title) { - title = dom.createElement('d-title'); - body.insertBefore(title, byline); - } - - let h1 = title.querySelector('h1'); - if (!h1) { - h1 = dom.createElement('h1'); - h1.textContent = data.title; - title.insertBefore(h1, title.firstChild); - } - - const hasPassword = typeof data.password !== 'undefined'; - let interstitial = body.querySelector('d-interstitial'); - if (hasPassword && !interstitial) { - const inBrowser = typeof window !== 'undefined'; - const onLocalhost = inBrowser && window.location.hostname.includes('localhost'); - if (!inBrowser || !onLocalhost) { - interstitial = dom.createElement('d-interstitial'); - interstitial.password = data.password; - body.insertBefore(interstitial, body.firstChild); - } - } else if (!hasPassword && interstitial) { - interstitial.parentElement.removeChild(this); - } - - let appendix = dom.querySelector('d-appendix'); - if (!appendix) { - appendix = dom.createElement('d-appendix'); - dom.body.appendChild(appendix); - } - - let footnoteList = dom.querySelector('d-footnote-list'); - if (!footnoteList) { - footnoteList = dom.createElement('d-footnote-list'); - appendix.appendChild(footnoteList); - } - - let citationList = dom.querySelector('d-citation-list'); - if (!citationList) { - citationList = dom.createElement('d-citation-list'); - appendix.appendChild(citationList); - } - - } - - // Copyright 2018 The Distill Template Authors - - const frontMatter = new FrontMatter(); - - const Controller = { - frontMatter: frontMatter, - waitingOn: { - bibliography: [], - citations: [] - }, - listeners: { - onCiteKeyCreated(event) { - const [citeTag, keys] = event.detail; - - // ensure we have citations - if (!frontMatter.citationsCollected) { - // console.debug('onCiteKeyCreated, but unresolved dependency ("citations"). Enqueing.'); - Controller.waitingOn.citations.push(() => - Controller.listeners.onCiteKeyCreated(event) - ); - return; - } - - // ensure we have a loaded bibliography - if (!frontMatter.bibliographyParsed) { - // console.debug('onCiteKeyCreated, but unresolved dependency ("bibliography"). Enqueing.'); - Controller.waitingOn.bibliography.push(() => - Controller.listeners.onCiteKeyCreated(event) - ); - return; - } - - const numbers = keys.map(key => frontMatter.citations.indexOf(key)); - citeTag.numbers = numbers; - const entries = keys.map(key => frontMatter.bibliography.get(key)); - citeTag.entries = entries; - }, - - onCiteKeyChanged() { - // const [citeTag, keys] = event.detail; - - // update citations - frontMatter.citations = collect_citations(); - frontMatter.citationsCollected = true; - for (const waitingCallback of Controller.waitingOn.citations.slice()) { - waitingCallback(); - } - - // update bibliography - const citationListTag = document.querySelector("d-citation-list"); - const bibliographyEntries = new Map( - frontMatter.citations.map(citationKey => { - return [citationKey, frontMatter.bibliography.get(citationKey)]; - }) - ); - citationListTag.citations = bibliographyEntries; - - const citeTags = document.querySelectorAll("d-cite"); - for (const citeTag of citeTags) { - console.log(citeTag); - const keys = citeTag.keys; - const numbers = keys.map(key => frontMatter.citations.indexOf(key)); - citeTag.numbers = numbers; - const entries = keys.map(key => frontMatter.bibliography.get(key)); - citeTag.entries = entries; - } - }, - - onCiteKeyRemoved(event) { - Controller.listeners.onCiteKeyChanged(event); - }, - - onBibliographyChanged(event) { - const citationListTag = document.querySelector("d-citation-list"); - - const bibliography = event.detail; - - frontMatter.bibliography = bibliography; - frontMatter.bibliographyParsed = true; - for (const waitingCallback of Controller.waitingOn.bibliography.slice()) { - waitingCallback(); - } - - // ensure we have citations - if (!frontMatter.citationsCollected) { - Controller.waitingOn.citations.push(function() { - Controller.listeners.onBibliographyChanged({ - target: event.target, - detail: event.detail - }); - }); - return; - } - - if (citationListTag.hasAttribute("distill-prerendered")) { - console.debug("Citation list was prerendered; not updating it."); - } else { - const entries = new Map( - frontMatter.citations.map(citationKey => { - return [citationKey, frontMatter.bibliography.get(citationKey)]; - }) - ); - citationListTag.citations = entries; - } - }, - - onFootnoteChanged() { - // const footnote = event.detail; - //TODO: optimize to only update current footnote - const footnotesList = document.querySelector("d-footnote-list"); - if (footnotesList) { - const footnotes = document.querySelectorAll("d-footnote"); - footnotesList.footnotes = footnotes; - } - }, - - onFrontMatterChanged(event) { - const data = event.detail; - mergeFromYMLFrontmatter(frontMatter, data); - - const interstitial = document.querySelector("d-interstitial"); - if (interstitial) { - if (typeof frontMatter.password !== "undefined") { - interstitial.password = frontMatter.password; - } else { - interstitial.parentElement.removeChild(interstitial); - } - } - - const prerendered = document.body.hasAttribute("distill-prerendered"); - if (!prerendered && domContentLoaded()) { - optionalComponents(document, frontMatter); - - const appendix = document.querySelector("distill-appendix"); - if (appendix) { - appendix.frontMatter = frontMatter; - } - - const byline = document.querySelector("d-byline"); - if (byline) { - byline.frontMatter = frontMatter; - } - - if (data.katex) { - DMath.katexOptions = data.katex; - } - } - }, - - DOMContentLoaded() { - if (Controller.loaded) { - console.warn( - "Controller received DOMContentLoaded but was already loaded!" - ); - return; - } else if (!domContentLoaded()) { - console.warn( - "Controller received DOMContentLoaded at document.readyState: " + - document.readyState + - "!" - ); - return; - } else { - Controller.loaded = true; - console.debug("Runlevel 4: Controller running DOMContentLoaded"); - } - - const frontMatterTag = document.querySelector("d-front-matter"); - if (frontMatterTag) { - const data = parseFrontmatter(frontMatterTag); - Controller.listeners.onFrontMatterChanged({ detail: data }); - } - - // Resolving "citations" dependency due to initial DOM load - frontMatter.citations = collect_citations(); - frontMatter.citationsCollected = true; - for (const waitingCallback of Controller.waitingOn.citations.slice()) { - waitingCallback(); - } - - if (frontMatter.bibliographyParsed) { - for (const waitingCallback of Controller.waitingOn.bibliography.slice()) { - waitingCallback(); - } - } - - const footnotesList = document.querySelector("d-footnote-list"); - if (footnotesList) { - const footnotes = document.querySelectorAll("d-footnote"); - footnotesList.footnotes = footnotes; - } - } - } // listeners - }; // Controller - - var base = "/*\n * Copyright 2018 The Distill Template Authors\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n * http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\nhtml {\n font-size: 14px;\n\tline-height: 1.6em;\n /* font-family: \"Libre Franklin\", \"Helvetica Neue\", sans-serif; */\n font-family: -apple-system, BlinkMacSystemFont, \"Segoe UI\", Roboto, Oxygen, Ubuntu, Cantarell, \"Fira Sans\", \"Droid Sans\", \"Helvetica Neue\", Arial, sans-serif;\n /*, \"Apple Color Emoji\", \"Segoe UI Emoji\", \"Segoe UI Symbol\";*/\n text-size-adjust: 100%;\n -ms-text-size-adjust: 100%;\n -webkit-text-size-adjust: 100%;\n}\n\n@media(min-width: 768px) {\n html {\n font-size: 16px;\n }\n}\n\nbody {\n margin: 0;\n}\n\na {\n color: #004276;\n}\n\nfigure {\n margin: 0;\n}\n\ntable {\n\tborder-collapse: collapse;\n\tborder-spacing: 0;\n}\n\ntable th {\n\ttext-align: left;\n}\n\ntable thead {\n border-bottom: 1px solid rgba(0, 0, 0, 0.05);\n}\n\ntable thead th {\n padding-bottom: 0.5em;\n}\n\ntable tbody :first-child td {\n padding-top: 0.5em;\n}\n\npre {\n overflow: auto;\n max-width: 100%;\n}\n\np {\n margin-top: 0;\n margin-bottom: 1em;\n}\n\nsup, sub {\n vertical-align: baseline;\n position: relative;\n top: -0.4em;\n line-height: 1em;\n}\n\nsub {\n top: 0.4em;\n}\n\n.kicker,\n.marker {\n font-size: 15px;\n font-weight: 600;\n color: rgba(0, 0, 0, 0.5);\n}\n\n\n/* Headline */\n\n@media(min-width: 1024px) {\n d-title h1 span {\n display: block;\n }\n}\n\n/* Figure */\n\nfigure {\n position: relative;\n margin-bottom: 2.5em;\n margin-top: 1.5em;\n}\n\nfigcaption+figure {\n\n}\n\nfigure img {\n width: 100%;\n}\n\nfigure svg text,\nfigure svg tspan {\n}\n\nfigcaption,\n.figcaption {\n color: rgba(0, 0, 0, 0.6);\n font-size: 12px;\n line-height: 1.5em;\n}\n\n@media(min-width: 1024px) {\nfigcaption,\n.figcaption {\n font-size: 13px;\n }\n}\n\nfigure.external img {\n background: white;\n border: 1px solid rgba(0, 0, 0, 0.1);\n box-shadow: 0 1px 8px rgba(0, 0, 0, 0.1);\n padding: 18px;\n box-sizing: border-box;\n}\n\nfigcaption a {\n color: rgba(0, 0, 0, 0.6);\n}\n\nfigcaption b,\nfigcaption strong, {\n font-weight: 600;\n color: rgba(0, 0, 0, 1.0);\n}\n"; - - var layout = "/*\n * Copyright 2018 The Distill Template Authors\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n * http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n@supports not (display: grid) {\n .base-grid,\n distill-header,\n d-title,\n d-abstract,\n d-article,\n d-appendix,\n distill-appendix,\n d-byline,\n d-footnote-list,\n d-citation-list,\n distill-footer {\n display: block;\n padding: 8px;\n }\n}\n\n.base-grid,\ndistill-header,\nd-title,\nd-abstract,\nd-article,\nd-appendix,\ndistill-appendix,\nd-byline,\nd-footnote-list,\nd-citation-list,\ndistill-footer {\n display: grid;\n justify-items: stretch;\n grid-template-columns: [screen-start] 8px [page-start kicker-start text-start gutter-start middle-start] 1fr 1fr 1fr 1fr 1fr 1fr 1fr 1fr [text-end page-end gutter-end kicker-end middle-end] 8px [screen-end];\n grid-column-gap: 8px;\n}\n\n.grid {\n display: grid;\n grid-column-gap: 8px;\n}\n\n@media(min-width: 768px) {\n .base-grid,\n distill-header,\n d-title,\n d-abstract,\n d-article,\n d-appendix,\n distill-appendix,\n d-byline,\n d-footnote-list,\n d-citation-list,\n distill-footer {\n grid-template-columns: [screen-start] 1fr [page-start kicker-start middle-start text-start] 45px 45px 45px 45px 45px 45px 45px 45px [ kicker-end text-end gutter-start] 45px [middle-end] 45px [page-end gutter-end] 1fr [screen-end];\n grid-column-gap: 16px;\n }\n\n .grid {\n grid-column-gap: 16px;\n }\n}\n\n@media(min-width: 1000px) {\n .base-grid,\n distill-header,\n d-title,\n d-abstract,\n d-article,\n d-appendix,\n distill-appendix,\n d-byline,\n d-footnote-list,\n d-citation-list,\n distill-footer {\n grid-template-columns: [screen-start] 1fr [page-start kicker-start] 50px [middle-start] 50px [text-start kicker-end] 50px 50px 50px 50px 50px 50px 50px 50px [text-end gutter-start] 50px [middle-end] 50px [page-end gutter-end] 1fr [screen-end];\n grid-column-gap: 16px;\n }\n\n .grid {\n grid-column-gap: 16px;\n }\n}\n\n@media(min-width: 1180px) {\n .base-grid,\n distill-header,\n d-title,\n d-abstract,\n d-article,\n d-appendix,\n distill-appendix,\n d-byline,\n d-footnote-list,\n d-citation-list,\n distill-footer {\n grid-template-columns: [screen-start] 1fr [page-start kicker-start] 60px [middle-start] 60px [text-start kicker-end] 60px 60px 60px 60px 60px 60px 60px 60px [text-end gutter-start] 60px [middle-end] 60px [page-end gutter-end] 1fr [screen-end];\n grid-column-gap: 32px;\n }\n\n .grid {\n grid-column-gap: 32px;\n }\n}\n\n\n\n\n.base-grid {\n grid-column: screen;\n}\n\n/* .l-body,\nd-article > * {\n grid-column: text;\n}\n\n.l-page,\nd-title > *,\nd-figure {\n grid-column: page;\n} */\n\n.l-gutter {\n grid-column: gutter;\n}\n\n.l-text,\n.l-body {\n grid-column: text;\n}\n\n.l-page {\n grid-column: page;\n}\n\n.l-body-outset {\n grid-column: middle;\n}\n\n.l-page-outset {\n grid-column: page;\n}\n\n.l-screen {\n grid-column: screen;\n}\n\n.l-screen-inset {\n grid-column: screen;\n padding-left: 16px;\n padding-left: 16px;\n}\n\n\n/* Aside */\n\nd-article aside {\n grid-column: gutter;\n font-size: 12px;\n line-height: 1.6em;\n color: rgba(0, 0, 0, 0.6)\n}\n\n@media(min-width: 768px) {\n aside {\n grid-column: gutter;\n }\n\n .side {\n grid-column: gutter;\n }\n}\n"; - - var print = "/*\n * Copyright 2018 The Distill Template Authors\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n * http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n@media print {\n\n @page {\n size: 8in 11in;\n @bottom-right {\n content: counter(page) \" of \" counter(pages);\n }\n }\n\n html {\n /* no general margins -- CSS Grid takes care of those */\n }\n\n p, code {\n page-break-inside: avoid;\n }\n\n h2, h3 {\n page-break-after: avoid;\n }\n\n d-header {\n visibility: hidden;\n }\n\n d-footer {\n display: none!important;\n }\n\n}\n"; - - var byline = "/*\n * Copyright 2018 The Distill Template Authors\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n * http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\nd-byline {\n contain: style;\n overflow: hidden;\n border-top: 1px solid rgba(0, 0, 0, 0.1);\n font-size: 0.8rem;\n line-height: 1.8em;\n padding: 1.5rem 0;\n min-height: 1.8em;\n}\n\n\nd-byline .byline {\n grid-template-columns: 1fr 1fr;\n grid-column: text;\n}\n\n@media(min-width: 768px) {\n d-byline .byline {\n grid-template-columns: 1fr 1fr 1fr 1fr;\n }\n}\n\nd-byline .authors-affiliations {\n grid-column-end: span 2;\n grid-template-columns: 1fr 1fr;\n margin-bottom: 1em;\n}\n\n@media(min-width: 768px) {\n d-byline .authors-affiliations {\n margin-bottom: 0;\n }\n}\n\nd-byline h3 {\n font-size: 0.6rem;\n font-weight: 400;\n color: rgba(0, 0, 0, 0.5);\n margin: 0;\n text-transform: uppercase;\n}\n\nd-byline p {\n margin: 0;\n}\n\nd-byline a,\nd-article d-byline a {\n color: rgba(0, 0, 0, 0.8);\n text-decoration: none;\n border-bottom: none;\n}\n\nd-article d-byline a:hover {\n text-decoration: underline;\n border-bottom: none;\n}\n\nd-byline p.author {\n font-weight: 500;\n}\n\nd-byline .affiliations {\n\n}\n"; - - var article = "/*\n * Copyright 2018 The Distill Template Authors\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n * http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\nd-article {\n contain: layout style;\n overflow-x: hidden;\n border-top: 1px solid rgba(0, 0, 0, 0.1);\n padding-top: 2rem;\n color: rgba(0, 0, 0, 0.8);\n}\n\nd-article > * {\n grid-column: text;\n}\n\n@media(min-width: 768px) {\n d-article {\n font-size: 16px;\n }\n}\n\n@media(min-width: 1024px) {\n d-article {\n font-size: 1.06rem;\n line-height: 1.7em;\n }\n}\n\n\n/* H2 */\n\n\nd-article .marker {\n text-decoration: none;\n border: none;\n counter-reset: section;\n grid-column: kicker;\n line-height: 1.7em;\n}\n\nd-article .marker:hover {\n border: none;\n}\n\nd-article .marker span {\n padding: 0 3px 4px;\n border-bottom: 1px solid rgba(0, 0, 0, 0.2);\n position: relative;\n top: 4px;\n}\n\nd-article .marker:hover span {\n color: rgba(0, 0, 0, 0.7);\n border-bottom: 1px solid rgba(0, 0, 0, 0.7);\n}\n\nd-article h2 {\n font-weight: 600;\n font-size: 24px;\n line-height: 1.25em;\n margin: 2rem 0 1.5rem 0;\n border-bottom: 1px solid rgba(0, 0, 0, 0.1);\n padding-bottom: 1rem;\n}\n\n@media(min-width: 1024px) {\n d-article h2 {\n font-size: 36px;\n }\n}\n\n/* H3 */\n\nd-article h3 {\n font-weight: 700;\n font-size: 18px;\n line-height: 1.4em;\n margin-bottom: 1em;\n margin-top: 2em;\n}\n\n@media(min-width: 1024px) {\n d-article h3 {\n font-size: 20px;\n }\n}\n\n/* H4 */\n\nd-article h4 {\n font-weight: 600;\n text-transform: uppercase;\n font-size: 14px;\n line-height: 1.4em;\n}\n\nd-article a {\n color: inherit;\n}\n\nd-article p,\nd-article ul,\nd-article ol,\nd-article blockquote {\n margin-top: 0;\n margin-bottom: 1em;\n margin-left: 0;\n margin-right: 0;\n}\n\nd-article blockquote {\n border-left: 2px solid rgba(0, 0, 0, 0.2);\n padding-left: 2em;\n font-style: italic;\n color: rgba(0, 0, 0, 0.6);\n}\n\nd-article a {\n border-bottom: 1px solid rgba(0, 0, 0, 0.4);\n text-decoration: none;\n}\n\nd-article a:hover {\n border-bottom: 1px solid rgba(0, 0, 0, 0.8);\n}\n\nd-article .link {\n text-decoration: underline;\n cursor: pointer;\n}\n\nd-article ul,\nd-article ol {\n padding-left: 24px;\n}\n\nd-article li {\n margin-bottom: 1em;\n margin-left: 0;\n padding-left: 0;\n}\n\nd-article li:last-child {\n margin-bottom: 0;\n}\n\nd-article pre {\n font-size: 14px;\n margin-bottom: 20px;\n}\n\nd-article hr {\n grid-column: screen;\n width: 100%;\n border: none;\n border-bottom: 1px solid rgba(0, 0, 0, 0.1);\n margin-top: 60px;\n margin-bottom: 60px;\n}\n\nd-article section {\n margin-top: 60px;\n margin-bottom: 60px;\n}\n\nd-article span.equation-mimic {\n font-family: georgia;\n font-size: 115%;\n font-style: italic;\n}\n\nd-article > d-code,\nd-article section > d-code {\n display: block;\n}\n\nd-article > d-math[block],\nd-article section > d-math[block] {\n display: block;\n}\n\n@media (max-width: 768px) {\n d-article > d-code,\n d-article section > d-code,\n d-article > d-math[block],\n d-article section > d-math[block] {\n overflow-x: scroll;\n -ms-overflow-style: none; // IE 10+\n overflow: -moz-scrollbars-none; // Firefox\n }\n\n d-article > d-code::-webkit-scrollbar,\n d-article section > d-code::-webkit-scrollbar,\n d-article > d-math[block]::-webkit-scrollbar,\n d-article section > d-math[block]::-webkit-scrollbar {\n display: none; // Safari and Chrome\n }\n}\n\nd-article .citation {\n color: #668;\n cursor: pointer;\n}\n\nd-include {\n width: auto;\n display: block;\n}\n\nd-figure {\n contain: layout style;\n}\n\n/* KaTeX */\n\n.katex, .katex-prerendered {\n contain: style;\n display: inline-block;\n}\n\n/* Tables */\n\nd-article table {\n border-collapse: collapse;\n margin-bottom: 1.5rem;\n border-bottom: 1px solid rgba(0, 0, 0, 0.2);\n}\n\nd-article table th {\n border-bottom: 1px solid rgba(0, 0, 0, 0.2);\n}\n\nd-article table td {\n border-bottom: 1px solid rgba(0, 0, 0, 0.05);\n}\n\nd-article table tr:last-of-type td {\n border-bottom: none;\n}\n\nd-article table th,\nd-article table td {\n font-size: 15px;\n padding: 2px 8px;\n}\n\nd-article table tbody :first-child td {\n padding-top: 2px;\n}\n"; - - var title = "/*\n * Copyright 2018 The Distill Template Authors\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n * http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\nd-title {\n padding: 2rem 0 1.5rem;\n contain: layout style;\n overflow-x: hidden;\n}\n\n@media(min-width: 768px) {\n d-title {\n padding: 4rem 0 1.5rem;\n }\n}\n\nd-title h1 {\n grid-column: text;\n font-size: 40px;\n font-weight: 700;\n line-height: 1.1em;\n margin: 0 0 0.5rem;\n}\n\n@media(min-width: 768px) {\n d-title h1 {\n font-size: 50px;\n }\n}\n\nd-title p {\n font-weight: 300;\n font-size: 1.2rem;\n line-height: 1.55em;\n grid-column: text;\n}\n\nd-title .status {\n margin-top: 0px;\n font-size: 12px;\n color: #009688;\n opacity: 0.8;\n grid-column: kicker;\n}\n\nd-title .status span {\n line-height: 1;\n display: inline-block;\n padding: 6px 0;\n border-bottom: 1px solid #80cbc4;\n font-size: 11px;\n text-transform: uppercase;\n}\n"; - - // Copyright 2018 The Distill Template Authors - - const styles = base + layout + title + byline + article + math + print; - - function makeStyleTag(dom) { - - const styleTagId = 'distill-prerendered-styles'; - const prerenderedTag = dom.getElementById(styleTagId); - if (!prerenderedTag) { - const styleTag = dom.createElement('style'); - styleTag.id = styleTagId; - styleTag.type = 'text/css'; - const cssTextTag = dom.createTextNode(styles); - styleTag.appendChild(cssTextTag); - const firstScriptTag = dom.head.querySelector('script'); - dom.head.insertBefore(styleTag, firstScriptTag); - } - - } - - // Copyright 2018 The Distill Template Authors - // - // Licensed under the Apache License, Version 2.0 (the "License"); - // you may not use this file except in compliance with the License. - // You may obtain a copy of the License at - // - // http://www.apache.org/licenses/LICENSE-2.0 - // - // Unless required by applicable law or agreed to in writing, software - // distributed under the License is distributed on an "AS IS" BASIS, - // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - // See the License for the specific language governing permissions and - // limitations under the License. - - function addPolyfill(polyfill, polyfillLoadedCallback) { - console.debug('Runlevel 0: Polyfill required: ' + polyfill.name); - const script = document.createElement('script'); - script.src = polyfill.url; - script.async = false; - if (polyfillLoadedCallback) { - script.onload = function() { polyfillLoadedCallback(polyfill); }; - } - script.onerror = function() { - new Error('Runlevel 0: Polyfills failed to load script ' + polyfill.name); - }; - document.head.appendChild(script); - } - - const polyfills = [ - { - name: 'WebComponents', - support: function() { - return 'customElements' in window && - 'attachShadow' in Element.prototype && - 'getRootNode' in Element.prototype && - 'content' in document.createElement('template') && - 'Promise' in window && - 'from' in Array; - }, - url: 'https://distill.pub/third-party/polyfills/webcomponents-lite.js' - }, { - name: 'IntersectionObserver', - support: function() { - return 'IntersectionObserver' in window && - 'IntersectionObserverEntry' in window; - }, - url: 'https://distill.pub/third-party/polyfills/intersection-observer.js' - }, - ]; - - class Polyfills { - - static browserSupportsAllFeatures() { - return polyfills.every((poly) => poly.support()); - } - - static load(callback) { - // Define an intermediate callback that checks if all is loaded. - const polyfillLoaded = function(polyfill) { - polyfill.loaded = true; - console.debug('Runlevel 0: Polyfill has finished loading: ' + polyfill.name); - // console.debug(window[polyfill.name]); - if (Polyfills.neededPolyfills.every((poly) => poly.loaded)) { - console.debug('Runlevel 0: All required polyfills have finished loading.'); - console.debug('Runlevel 0->1.'); - window.distillRunlevel = 1; - callback(); - } - }; - // Add polyfill script tags - for (const polyfill of Polyfills.neededPolyfills) { - addPolyfill(polyfill, polyfillLoaded); - } - } - - static get neededPolyfills() { - if (!Polyfills._neededPolyfills) { - Polyfills._neededPolyfills = polyfills.filter((poly) => !poly.support()); - } - return Polyfills._neededPolyfills; - } - } - - // Copyright 2018 The Distill Template Authors - // - // Licensed under the Apache License, Version 2.0 (the "License"); - // you may not use this file except in compliance with the License. - // You may obtain a copy of the License at - // - // http://www.apache.org/licenses/LICENSE-2.0 - // - // Unless required by applicable law or agreed to in writing, software - // distributed under the License is distributed on an "AS IS" BASIS, - // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - // See the License for the specific language governing permissions and - // limitations under the License. - - // const marginSmall = 16; - // const marginLarge = 3 * marginSmall; - // const margin = marginSmall + marginLarge; - // const gutter = marginSmall; - // const outsetAmount = margin / 2; - // const numCols = 4; - // const numGutters = numCols - 1; - // const columnWidth = (768 - 2 * marginLarge - numGutters * gutter) / numCols; - // - // const screenwidth = 768; - // const pageWidth = screenwidth - 2 * marginLarge; - // const bodyWidth = pageWidth - columnWidth - gutter; - - function body(selector) { - return `${selector} { - grid-column: left / text; - } - `; - } - - // Copyright 2018 The Distill Template Authors - - const T$1 = Template('d-abstract', ` - - - -`); - - class Abstract extends T$1(HTMLElement) { - - } - - // Copyright 2018 The Distill Template Authors - - const T$2 = Template('d-appendix', ` - - -`, false); - - class Appendix extends T$2(HTMLElement) { - - } - - // Copyright 2018 The Distill Template Authors - // - // Licensed under the Apache License, Version 2.0 (the "License"); - // you may not use this file except in compliance with the License. - // You may obtain a copy of the License at - // - // http://www.apache.org/licenses/LICENSE-2.0 - // - // Unless required by applicable law or agreed to in writing, software - // distributed under the License is distributed on an "AS IS" BASIS, - // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - // See the License for the specific language governing permissions and - // limitations under the License. - - // import { Template } from '../mixins/template'; - // import { Controller } from '../controller'; - - const isOnlyWhitespace = /^\s*$/; - - class Article extends HTMLElement { - static get is() { return 'd-article'; } - - constructor() { - super(); - - new MutationObserver( (mutations) => { - for (const mutation of mutations) { - for (const addedNode of mutation.addedNodes) { - switch (addedNode.nodeName) { - case '#text': { // usually text nodes are only linebreaks. - const text = addedNode.nodeValue; - if (!isOnlyWhitespace.test(text)) { - console.warn('Use of unwrapped text in distill articles is discouraged as it breaks layout! Please wrap any text in a or

tag. We found the following text: ' + text); - const wrapper = document.createElement('span'); - wrapper.innerHTML = addedNode.nodeValue; - addedNode.parentNode.insertBefore(wrapper, addedNode); - addedNode.parentNode.removeChild(addedNode); - } - } break; - } - } - } - }).observe(this, {childList: true}); - } - - } - - var commonjsGlobal = typeof globalThis !== 'undefined' ? globalThis : typeof window !== 'undefined' ? window : typeof global !== 'undefined' ? global : typeof self !== 'undefined' ? self : {}; - - function createCommonjsModule(fn, module) { - return module = { exports: {} }, fn(module, module.exports), module.exports; - } - - var bibtexParse = createCommonjsModule(function (module, exports) { - /* start bibtexParse 0.0.22 */ - - //Original work by Henrik Muehe (c) 2010 - // - //CommonJS port by Mikola Lysenko 2013 - // - //Port to Browser lib by ORCID / RCPETERS - // - //Issues: - //no comment handling within strings - //no string concatenation - //no variable values yet - //Grammar implemented here: - //bibtex -> (string | preamble | comment | entry)*; - //string -> '@STRING' '{' key_equals_value '}'; - //preamble -> '@PREAMBLE' '{' value '}'; - //comment -> '@COMMENT' '{' value '}'; - //entry -> '@' key '{' key ',' key_value_list '}'; - //key_value_list -> key_equals_value (',' key_equals_value)*; - //key_equals_value -> key '=' value; - //value -> value_quotes | value_braces | key; - //value_quotes -> '"' .*? '"'; // not quite - //value_braces -> '{' .*? '"'; // not quite - (function(exports) { - - function BibtexParser() { - - this.months = ["jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec"]; - this.notKey = [',','{','}',' ','=']; - this.pos = 0; - this.input = ""; - this.entries = new Array(); - - this.currentEntry = ""; - - this.setInput = function(t) { - this.input = t; - }; - - this.getEntries = function() { - return this.entries; - }; - - this.isWhitespace = function(s) { - return (s == ' ' || s == '\r' || s == '\t' || s == '\n'); - }; - - this.match = function(s, canCommentOut) { - if (canCommentOut == undefined || canCommentOut == null) - canCommentOut = true; - this.skipWhitespace(canCommentOut); - if (this.input.substring(this.pos, this.pos + s.length) == s) { - this.pos += s.length; - } else { - throw "Token mismatch, expected " + s + ", found " - + this.input.substring(this.pos); - } this.skipWhitespace(canCommentOut); - }; - - this.tryMatch = function(s, canCommentOut) { - if (canCommentOut == undefined || canCommentOut == null) - canCommentOut = true; - this.skipWhitespace(canCommentOut); - if (this.input.substring(this.pos, this.pos + s.length) == s) { - return true; - } else { - return false; - } }; - - /* when search for a match all text can be ignored, not just white space */ - this.matchAt = function() { - while (this.input.length > this.pos && this.input[this.pos] != '@') { - this.pos++; - } - if (this.input[this.pos] == '@') { - return true; - } return false; - }; - - this.skipWhitespace = function(canCommentOut) { - while (this.isWhitespace(this.input[this.pos])) { - this.pos++; - } if (this.input[this.pos] == "%" && canCommentOut == true) { - while (this.input[this.pos] != "\n") { - this.pos++; - } this.skipWhitespace(canCommentOut); - } }; - - this.value_braces = function() { - var bracecount = 0; - this.match("{", false); - var start = this.pos; - var escaped = false; - while (true) { - if (!escaped) { - if (this.input[this.pos] == '}') { - if (bracecount > 0) { - bracecount--; - } else { - var end = this.pos; - this.match("}", false); - return this.input.substring(start, end); - } } else if (this.input[this.pos] == '{') { - bracecount++; - } else if (this.pos >= this.input.length - 1) { - throw "Unterminated value"; - } } if (this.input[this.pos] == '\\' && escaped == false) - escaped = true; - else - escaped = false; - this.pos++; - } }; - - this.value_comment = function() { - var str = ''; - var brcktCnt = 0; - while (!(this.tryMatch("}", false) && brcktCnt == 0)) { - str = str + this.input[this.pos]; - if (this.input[this.pos] == '{') - brcktCnt++; - if (this.input[this.pos] == '}') - brcktCnt--; - if (this.pos >= this.input.length - 1) { - throw "Unterminated value:" + this.input.substring(start); - } this.pos++; - } return str; - }; - - this.value_quotes = function() { - this.match('"', false); - var start = this.pos; - var escaped = false; - while (true) { - if (!escaped) { - if (this.input[this.pos] == '"') { - var end = this.pos; - this.match('"', false); - return this.input.substring(start, end); - } else if (this.pos >= this.input.length - 1) { - throw "Unterminated value:" + this.input.substring(start); - } } - if (this.input[this.pos] == '\\' && escaped == false) - escaped = true; - else - escaped = false; - this.pos++; - } }; - - this.single_value = function() { - var start = this.pos; - if (this.tryMatch("{")) { - return this.value_braces(); - } else if (this.tryMatch('"')) { - return this.value_quotes(); - } else { - var k = this.key(); - if (k.match("^[0-9]+$")) - return k; - else if (this.months.indexOf(k.toLowerCase()) >= 0) - return k.toLowerCase(); - else - throw "Value expected:" + this.input.substring(start) + ' for key: ' + k; - - } }; - - this.value = function() { - var values = []; - values.push(this.single_value()); - while (this.tryMatch("#")) { - this.match("#"); - values.push(this.single_value()); - } return values.join(""); - }; - - this.key = function() { - var start = this.pos; - while (true) { - if (this.pos >= this.input.length) { - throw "Runaway key"; - } // а-яА-Я is Cyrillic - //console.log(this.input[this.pos]); - if (this.notKey.indexOf(this.input[this.pos]) >= 0) { - return this.input.substring(start, this.pos); - } else { - this.pos++; - - } } }; - - this.key_equals_value = function() { - var key = this.key(); - if (this.tryMatch("=")) { - this.match("="); - var val = this.value(); - return [ key, val ]; - } else { - throw "... = value expected, equals sign missing:" - + this.input.substring(this.pos); - } }; - - this.key_value_list = function() { - var kv = this.key_equals_value(); - this.currentEntry['entryTags'] = {}; - this.currentEntry['entryTags'][kv[0]] = kv[1]; - while (this.tryMatch(",")) { - this.match(","); - // fixes problems with commas at the end of a list - if (this.tryMatch("}")) { - break; - } - kv = this.key_equals_value(); - this.currentEntry['entryTags'][kv[0]] = kv[1]; - } }; - - this.entry_body = function(d) { - this.currentEntry = {}; - this.currentEntry['citationKey'] = this.key(); - this.currentEntry['entryType'] = d.substring(1); - this.match(","); - this.key_value_list(); - this.entries.push(this.currentEntry); - }; - - this.directive = function() { - this.match("@"); - return "@" + this.key(); - }; - - this.preamble = function() { - this.currentEntry = {}; - this.currentEntry['entryType'] = 'PREAMBLE'; - this.currentEntry['entry'] = this.value_comment(); - this.entries.push(this.currentEntry); - }; - - this.comment = function() { - this.currentEntry = {}; - this.currentEntry['entryType'] = 'COMMENT'; - this.currentEntry['entry'] = this.value_comment(); - this.entries.push(this.currentEntry); - }; - - this.entry = function(d) { - this.entry_body(d); - }; - - this.bibtex = function() { - while (this.matchAt()) { - var d = this.directive(); - this.match("{"); - if (d == "@STRING") { - this.string(); - } else if (d == "@PREAMBLE") { - this.preamble(); - } else if (d == "@COMMENT") { - this.comment(); - } else { - this.entry(d); - } - this.match("}"); - } }; - } - exports.toJSON = function(bibtex) { - var b = new BibtexParser(); - b.setInput(bibtex); - b.bibtex(); - return b.entries; - }; - - /* added during hackathon don't hate on me */ - exports.toBibtex = function(json) { - var out = ''; - for ( var i in json) { - out += "@" + json[i].entryType; - out += '{'; - if (json[i].citationKey) - out += json[i].citationKey + ', '; - if (json[i].entry) - out += json[i].entry ; - if (json[i].entryTags) { - var tags = ''; - for (var jdx in json[i].entryTags) { - if (tags.length != 0) - tags += ', '; - tags += jdx + '= {' + json[i].entryTags[jdx] + '}'; - } - out += tags; - } - out += '}\n\n'; - } - return out; - - }; - - })( exports); - - /* end bibtexParse */ - }); - - // Copyright 2018 The Distill Template Authors - - function normalizeTag(string) { - return string - .replace(/[\t\n ]+/g, ' ') - .replace(/{\\["^`.'acu~Hvs]( )?([a-zA-Z])}/g, (full, x, char) => char) - .replace(/{\\([a-zA-Z])}/g, (full, char) => char); - } - - function parseBibtex(bibtex) { - const bibliography = new Map(); - const parsedEntries = bibtexParse.toJSON(bibtex); - for (const entry of parsedEntries) { - // normalize tags; note entryTags is an object, not Map - for (const [key, value] of Object.entries(entry.entryTags)) { - entry.entryTags[key.toLowerCase()] = normalizeTag(value); - } - entry.entryTags.type = entry.entryType; - // add to bibliography - bibliography.set(entry.citationKey, entry.entryTags); - } - return bibliography; - } - - function serializeFrontmatterToBibtex(frontMatter) { - return `@article{${frontMatter.slug}, - author = {${frontMatter.bibtexAuthors}}, - title = {${frontMatter.title}}, - journal = {${frontMatter.journal.title}}, - year = {${frontMatter.publishedYear}}, - note = {${frontMatter.url}}, - doi = {${frontMatter.doi}} -}`; - } - - // Copyright 2018 The Distill Template Authors - - class Bibliography extends HTMLElement { - - static get is() { return 'd-bibliography'; } - - constructor() { - super(); - - // set up mutation observer - const options = {childList: true, characterData: true, subtree: true}; - const observer = new MutationObserver( (entries) => { - for (const entry of entries) { - if (entry.target.nodeName === 'SCRIPT' || entry.type === 'characterData') { - this.parseIfPossible(); - } - } - }); - observer.observe(this, options); - } - - connectedCallback() { - requestAnimationFrame(() => { - this.parseIfPossible(); - }); - } - - parseIfPossible() { - const scriptTag = this.querySelector('script'); - if (!scriptTag) return; - if (scriptTag.type == 'text/bibtex') { - const newBibtex = scriptTag.textContent; - if (this.bibtex !== newBibtex) { - this.bibtex = newBibtex; - const bibliography = parseBibtex(this.bibtex); - this.notify(bibliography); - } - } else if (scriptTag.type == 'text/json') { - const bibliography = new Map(JSON.parse(scriptTag.textContent)); - this.notify(bibliography); - } else { - console.warn('Unsupported bibliography script tag type: ' + scriptTag.type); - } - } - - notify(bibliography) { - const options = { detail: bibliography, bubbles: true }; - const event = new CustomEvent('onBibliographyChanged', options); - this.dispatchEvent(event); - } - - /* observe 'src' attribute */ - - static get observedAttributes() { - return ['src']; - } - - receivedBibtex(event) { - const bibliography = parseBibtex(event.target.response); - this.notify(bibliography); - } - - attributeChangedCallback(name, oldValue, newValue) { - var oReq = new XMLHttpRequest(); - oReq.onload = (e) => this.receivedBibtex(e); - oReq.onerror = () => console.warn(`Could not load Bibtex! (tried ${newValue})`); - oReq.responseType = 'text'; - oReq.open('GET', newValue, true); - oReq.send(); - } - - - } - - // Copyright 2018 The Distill Template Authors - // - // Licensed under the Apache License, Version 2.0 (the "License"); - // you may not use this file except in compliance with the License. - // You may obtain a copy of the License at - // - // http://www.apache.org/licenses/LICENSE-2.0 - // - // Unless required by applicable law or agreed to in writing, software - // distributed under the License is distributed on an "AS IS" BASIS, - // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - // See the License for the specific language governing permissions and - // limitations under the License. - - // import style from '../styles/d-byline.css'; - - function bylineTemplate(frontMatter) { - return ` -

-`; - } - - class Byline extends HTMLElement { - - static get is() { return 'd-byline'; } - - set frontMatter(frontMatter) { - this.innerHTML = bylineTemplate(frontMatter); - } - - } - - // Copyright 2018 The Distill Template Authors - - const T$3 = Template( - "d-cite", - ` - - - - -
- -
-` - ); - - class Cite extends T$3(HTMLElement) { - /* Lifecycle */ - constructor() { - super(); - this._numbers = []; - this._entries = []; - } - - connectedCallback() { - this.outerSpan = this.root.querySelector("#citation-"); - this.innerSpan = this.root.querySelector(".citation-number"); - this.hoverBox = this.root.querySelector("d-hover-box"); - window.customElements.whenDefined("d-hover-box").then(() => { - this.hoverBox.listen(this); - }); - // in case this component got connected after values were set - if (this.numbers) { - this.displayNumbers(this.numbers); - } - if (this.entries) { - this.displayEntries(this.entries); - } - } - - //TODO This causes an infinite loop on firefox with polyfills. - // This is only needed for interactive editing so no priority. - // disconnectedCallback() { - // const options = { detail: [this, this.keys], bubbles: true }; - // const event = new CustomEvent('onCiteKeyRemoved', options); - // document.dispatchEvent(event); - // } - - /* observe 'key' attribute */ - - static get observedAttributes() { - return ["key", "bibtex-key"]; - } - - attributeChangedCallback(name, oldValue, newValue) { - const eventName = oldValue ? "onCiteKeyChanged" : "onCiteKeyCreated"; - const keys = newValue.split(",").map(k => k.trim()); - const options = { detail: [this, keys], bubbles: true }; - const event = new CustomEvent(eventName, options); - document.dispatchEvent(event); - } - - set key(value) { - this.setAttribute("key", value); - } - - get key() { - return this.getAttribute("key") || this.getAttribute("bibtex-key"); - } - - get keys() { - const result = this.key.split(","); - console.log(result); - return result; - } - - /* Setters & Rendering */ - - set numbers(numbers) { - this._numbers = numbers; - this.displayNumbers(numbers); - } - - get numbers() { - return this._numbers; - } - - displayNumbers(numbers) { - if (!this.innerSpan) return; - const numberStrings = numbers.map(index => { - return index == -1 ? "?" : index + 1 + ""; - }); - const textContent = "[" + numberStrings.join(", ") + "]"; - this.innerSpan.textContent = textContent; - } - - set entries(entries) { - this._entries = entries; - this.displayEntries(entries); - } - - get entries() { - return this._entries; - } - - displayEntries(entries) { - if (!this.hoverBox) return; - this.hoverBox.innerHTML = `
    - ${entries - .map(hover_cite) - .map(html => `
  • ${html}
  • `) - .join("\n")} -
`; - } - } - - // Copyright 2018 The Distill Template Authors - - const styles$1 = ` -d-citation-list { - contain: style; -} - -d-citation-list .references { - grid-column: text; -} - -d-citation-list .references .title { - font-weight: 500; -} -`; - - function renderCitationList(element, entries, dom=document) { - if (entries.size > 0) { - element.style.display = ''; - let list = element.querySelector('.references'); - if (list) { - list.innerHTML = ''; - } else { - const stylesTag = dom.createElement('style'); - stylesTag.innerHTML = styles$1; - element.appendChild(stylesTag); - - const heading = dom.createElement('h3'); - heading.id = 'references'; - heading.textContent = 'References'; - element.appendChild(heading); - - list = dom.createElement('ol'); - list.id = 'references-list'; - list.className = 'references'; - element.appendChild(list); - } - - for (const [key, entry] of entries) { - const listItem = dom.createElement('li'); - listItem.id = key; - listItem.innerHTML = bibliography_cite(entry); - list.appendChild(listItem); - } - } else { - element.style.display = 'none'; - } - } - - class CitationList extends HTMLElement { - - static get is() { return 'd-citation-list'; } - - connectedCallback() { - if (!this.hasAttribute('distill-prerendered')) { - this.style.display = 'none'; - } - } - - set citations(citations) { - renderCitationList(this, citations); - } - - } - - var prism = createCommonjsModule(function (module) { - /* ********************************************** - Begin prism-core.js - ********************************************** */ - - var _self = (typeof window !== 'undefined') - ? window // if in browser - : ( - (typeof WorkerGlobalScope !== 'undefined' && self instanceof WorkerGlobalScope) - ? self // if in worker - : {} // if in node js - ); - - /** - * Prism: Lightweight, robust, elegant syntax highlighting - * MIT license http://www.opensource.org/licenses/mit-license.php/ - * @author Lea Verou http://lea.verou.me - */ - - var Prism = (function (_self){ - - // Private helper vars - var lang = /\blang(?:uage)?-([\w-]+)\b/i; - var uniqueId = 0; - - - var _ = { - manual: _self.Prism && _self.Prism.manual, - disableWorkerMessageHandler: _self.Prism && _self.Prism.disableWorkerMessageHandler, - util: { - encode: function encode(tokens) { - if (tokens instanceof Token) { - return new Token(tokens.type, encode(tokens.content), tokens.alias); - } else if (Array.isArray(tokens)) { - return tokens.map(encode); - } else { - return tokens.replace(/&/g, '&').replace(/' + env.content + ''; - }; - - /** - * @param {string} text - * @param {LinkedList} tokenList - * @param {any} grammar - * @param {LinkedListNode} startNode - * @param {number} startPos - * @param {boolean} [oneshot=false] - * @param {string} [target] - */ - function matchGrammar(text, tokenList, grammar, startNode, startPos, oneshot, target) { - for (var token in grammar) { - if (!grammar.hasOwnProperty(token) || !grammar[token]) { - continue; - } - - var patterns = grammar[token]; - patterns = Array.isArray(patterns) ? patterns : [patterns]; - - for (var j = 0; j < patterns.length; ++j) { - if (target && target == token + ',' + j) { - return; - } - - var pattern = patterns[j], - inside = pattern.inside, - lookbehind = !!pattern.lookbehind, - greedy = !!pattern.greedy, - lookbehindLength = 0, - alias = pattern.alias; - - if (greedy && !pattern.pattern.global) { - // Without the global flag, lastIndex won't work - var flags = pattern.pattern.toString().match(/[imsuy]*$/)[0]; - pattern.pattern = RegExp(pattern.pattern.source, flags + 'g'); - } - - pattern = pattern.pattern || pattern; - - for ( // iterate the token list and keep track of the current token/string position - var currentNode = startNode.next, pos = startPos; - currentNode !== tokenList.tail; - pos += currentNode.value.length, currentNode = currentNode.next - ) { - - var str = currentNode.value; - - if (tokenList.length > text.length) { - // Something went terribly wrong, ABORT, ABORT! - return; - } - - if (str instanceof Token) { - continue; - } - - var removeCount = 1; // this is the to parameter of removeBetween - - if (greedy && currentNode != tokenList.tail.prev) { - pattern.lastIndex = pos; - var match = pattern.exec(text); - if (!match) { - break; - } - - var from = match.index + (lookbehind && match[1] ? match[1].length : 0); - var to = match.index + match[0].length; - var p = pos; - - // find the node that contains the match - p += currentNode.value.length; - while (from >= p) { - currentNode = currentNode.next; - p += currentNode.value.length; - } - // adjust pos (and p) - p -= currentNode.value.length; - pos = p; - - // the current node is a Token, then the match starts inside another Token, which is invalid - if (currentNode.value instanceof Token) { - continue; - } - - // find the last node which is affected by this match - for ( - var k = currentNode; - k !== tokenList.tail && (p < to || (typeof k.value === 'string' && !k.prev.value.greedy)); - k = k.next - ) { - removeCount++; - p += k.value.length; - } - removeCount--; - - // replace with the new match - str = text.slice(pos, p); - match.index -= pos; - } else { - pattern.lastIndex = 0; - - var match = pattern.exec(str); - } - - if (!match) { - if (oneshot) { - break; - } - - continue; - } - - if (lookbehind) { - lookbehindLength = match[1] ? match[1].length : 0; - } - - var from = match.index + lookbehindLength, - match = match[0].slice(lookbehindLength), - to = from + match.length, - before = str.slice(0, from), - after = str.slice(to); - - var removeFrom = currentNode.prev; - - if (before) { - removeFrom = addAfter(tokenList, removeFrom, before); - pos += before.length; - } - - removeRange(tokenList, removeFrom, removeCount); - - var wrapped = new Token(token, inside ? _.tokenize(match, inside) : match, alias, match, greedy); - currentNode = addAfter(tokenList, removeFrom, wrapped); - - if (after) { - addAfter(tokenList, currentNode, after); - } - - - if (removeCount > 1) - matchGrammar(text, tokenList, grammar, currentNode.prev, pos, true, token + ',' + j); - - if (oneshot) - break; - } - } - } - } - - /** - * @typedef LinkedListNode - * @property {T} value - * @property {LinkedListNode | null} prev The previous node. - * @property {LinkedListNode | null} next The next node. - * @template T - */ - - /** - * @template T - */ - function LinkedList() { - /** @type {LinkedListNode} */ - var head = { value: null, prev: null, next: null }; - /** @type {LinkedListNode} */ - var tail = { value: null, prev: head, next: null }; - head.next = tail; - - /** @type {LinkedListNode} */ - this.head = head; - /** @type {LinkedListNode} */ - this.tail = tail; - this.length = 0; - } - - /** - * Adds a new node with the given value to the list. - * @param {LinkedList} list - * @param {LinkedListNode} node - * @param {T} value - * @returns {LinkedListNode} The added node. - * @template T - */ - function addAfter(list, node, value) { - // assumes that node != list.tail && values.length >= 0 - var next = node.next; - - var newNode = { value: value, prev: node, next: next }; - node.next = newNode; - next.prev = newNode; - list.length++; - - return newNode; - } - /** - * Removes `count` nodes after the given node. The given node will not be removed. - * @param {LinkedList} list - * @param {LinkedListNode} node - * @param {number} count - * @template T - */ - function removeRange(list, node, count) { - var next = node.next; - for (var i = 0; i < count && next !== list.tail; i++) { - next = next.next; - } - node.next = next; - next.prev = node; - list.length -= i; - } - /** - * @param {LinkedList} list - * @returns {T[]} - * @template T - */ - function toArray(list) { - var array = []; - var node = list.head.next; - while (node !== list.tail) { - array.push(node.value); - node = node.next; - } - return array; - } - - - if (!_self.document) { - if (!_self.addEventListener) { - // in Node.js - return _; - } - - if (!_.disableWorkerMessageHandler) { - // In worker - _self.addEventListener('message', function (evt) { - var message = JSON.parse(evt.data), - lang = message.language, - code = message.code, - immediateClose = message.immediateClose; - - _self.postMessage(_.highlight(code, _.languages[lang], lang)); - if (immediateClose) { - _self.close(); - } - }, false); - } - - return _; - } - - //Get current script and highlight - var script = _.util.currentScript(); - - if (script) { - _.filename = script.src; - - if (script.hasAttribute('data-manual')) { - _.manual = true; - } - } - - function highlightAutomaticallyCallback() { - if (!_.manual) { - _.highlightAll(); - } - } - - if (!_.manual) { - // If the document state is "loading", then we'll use DOMContentLoaded. - // If the document state is "interactive" and the prism.js script is deferred, then we'll also use the - // DOMContentLoaded event because there might be some plugins or languages which have also been deferred and they - // might take longer one animation frame to execute which can create a race condition where only some plugins have - // been loaded when Prism.highlightAll() is executed, depending on how fast resources are loaded. - // See https://github.com/PrismJS/prism/issues/2102 - var readyState = document.readyState; - if (readyState === 'loading' || readyState === 'interactive' && script && script.defer) { - document.addEventListener('DOMContentLoaded', highlightAutomaticallyCallback); - } else { - if (window.requestAnimationFrame) { - window.requestAnimationFrame(highlightAutomaticallyCallback); - } else { - window.setTimeout(highlightAutomaticallyCallback, 16); - } - } - } - - return _; - - })(_self); - - if ( module.exports) { - module.exports = Prism; - } - - // hack for components to work correctly in node.js - if (typeof commonjsGlobal !== 'undefined') { - commonjsGlobal.Prism = Prism; - } - - - /* ********************************************** - Begin prism-markup.js - ********************************************** */ - - Prism.languages.markup = { - 'comment': //, - 'prolog': /<\?[\s\S]+?\?>/, - 'doctype': { - pattern: /"'[\]]|"[^"]*"|'[^']*')+(?:\[(?:(?!)*\]\s*)?>/i, - greedy: true - }, - 'cdata': //i, - 'tag': { - pattern: /<\/?(?!\d)[^\s>\/=$<%]+(?:\s(?:\s*[^\s>\/=]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^\s'">=]+(?=[\s>]))|(?=[\s/>])))+)?\s*\/?>/i, - greedy: true, - inside: { - 'tag': { - pattern: /^<\/?[^\s>\/]+/i, - inside: { - 'punctuation': /^<\/?/, - 'namespace': /^[^\s>\/:]+:/ - } - }, - 'attr-value': { - pattern: /=\s*(?:"[^"]*"|'[^']*'|[^\s'">=]+)/i, - inside: { - 'punctuation': [ - /^=/, - { - pattern: /^(\s*)["']|["']$/, - lookbehind: true - } - ] - } - }, - 'punctuation': /\/?>/, - 'attr-name': { - pattern: /[^\s>\/]+/, - inside: { - 'namespace': /^[^\s>\/:]+:/ - } - } - - } - }, - 'entity': /&#?[\da-z]{1,8};/i - }; - - Prism.languages.markup['tag'].inside['attr-value'].inside['entity'] = - Prism.languages.markup['entity']; - - // Plugin to make entity title show the real entity, idea by Roman Komarov - Prism.hooks.add('wrap', function(env) { - - if (env.type === 'entity') { - env.attributes['title'] = env.content.replace(/&/, '&'); - } - }); - - Object.defineProperty(Prism.languages.markup.tag, 'addInlined', { - /** - * Adds an inlined language to markup. - * - * An example of an inlined language is CSS with ` - - - -`); - - class Code extends Mutating(T$4(HTMLElement)) { - - renderContent() { - - // check if language can be highlighted - this.languageName = this.getAttribute('language'); - if (!this.languageName) { - console.warn('You need to provide a language attribute to your block to let us know how to highlight your code; e.g.:\n zeros = np.zeros(shape).'); - return; - } - const language = prism.languages[this.languageName]; - if (language == undefined) { - console.warn(`Distill does not yet support highlighting your code block in "${this.languageName}'.`); - return; - } - - let content = this.textContent; - const codeTag = this.shadowRoot.querySelector('#code-container'); - - if (this.hasAttribute('block')) { - // normalize the tab indents - content = content.replace(/\n/, ''); - const tabs = content.match(/\s*/); - content = content.replace(new RegExp('\n' + tabs, 'g'), '\n'); - content = content.trim(); - // wrap code block in pre tag if needed - if (codeTag.parentNode instanceof ShadowRoot) { - const preTag = document.createElement('pre'); - this.shadowRoot.removeChild(codeTag); - preTag.appendChild(codeTag); - this.shadowRoot.appendChild(preTag); - } - - } - - codeTag.className = `language-${this.languageName}`; - codeTag.innerHTML = prism.highlight(content, language); - } - - } - - // Copyright 2018 The Distill Template Authors - - const T$5 = Template('d-footnote', ` - - - -
- -
-
- - - - - -`); - - class Footnote extends T$5(HTMLElement) { - - constructor() { - super(); - - const options = {childList: true, characterData: true, subtree: true}; - const observer = new MutationObserver(this.notify); - observer.observe(this, options); - } - - notify() { - const options = { detail: this, bubbles: true }; - const event = new CustomEvent('onFootnoteChanged', options); - document.dispatchEvent(event); - } - - connectedCallback() { - // listen and notify about changes to slotted content - // const slot = this.shadowRoot.querySelector('#slot'); - // console.warn(slot.textContent); - // slot.addEventListener('slotchange', this.notify); - this.hoverBox = this.root.querySelector('d-hover-box'); - window.customElements.whenDefined('d-hover-box').then(() => { - this.hoverBox.listen(this); - }); - // create numeric ID - Footnote.currentFootnoteId += 1; - const IdString = Footnote.currentFootnoteId.toString(); - this.root.host.id = 'd-footnote-' + IdString; - - // set up hidden hover box - const id = 'dt-fn-hover-box-' + IdString; - this.hoverBox.id = id; - - // set up visible footnote marker - const span = this.root.querySelector('#fn-'); - span.setAttribute('id', 'fn-' + IdString); - span.setAttribute('data-hover-ref', id); - span.textContent = IdString; - } - - } - - Footnote.currentFootnoteId = 0; - - // Copyright 2018 The Distill Template Authors - - const T$6 = Template('d-footnote-list', ` - - -

Footnotes

-
    -`, false); - - class FootnoteList extends T$6(HTMLElement) { - - connectedCallback() { - super.connectedCallback(); - - this.list = this.root.querySelector('ol'); - // footnotes list is initially hidden - this.root.style.display = 'none'; - // look through document and register existing footnotes - // Store.subscribeTo('footnotes', (footnote) => { - // this.renderFootnote(footnote); - // }); - } - - // TODO: could optimize this to accept individual footnotes? - set footnotes(footnotes) { - this.list.innerHTML = ''; - if (footnotes.length) { - // ensure footnote list is visible - this.root.style.display = ''; - - for (const footnote of footnotes) { - // construct and append list item to show footnote - const listItem = document.createElement('li'); - listItem.id = footnote.id + '-listing'; - listItem.innerHTML = footnote.innerHTML; - - const backlink = document.createElement('a'); - backlink.setAttribute('class', 'footnote-backlink'); - backlink.textContent = '[↩]'; - backlink.href = '#' + footnote.id; - - listItem.appendChild(backlink); - this.list.appendChild(listItem); - } - } else { - // ensure footnote list is invisible - this.root.style.display = 'none'; - } - } - - } - - // Copyright 2018 The Distill Template Authors - - const T$7 = Template('d-hover-box', ` - - -
    -
    - -
    -
    -`); - - class HoverBox extends T$7(HTMLElement) { - - constructor() { - super(); - } - - connectedCallback() { - - } - - listen(element) { - // console.log(element) - this.bindDivEvents(this); - this.bindTriggerEvents(element); - // this.style.display = "block"; - } - - bindDivEvents(element) { - // For mice, same behavior as hovering on links - element.addEventListener('mouseover', () => { - if (!this.visible) this.showAtNode(element); - this.stopTimeout(); - }); - element.addEventListener('mouseout', () => { - this.extendTimeout(500); - }); - // Don't trigger body touchstart event when touching within box - element.addEventListener('touchstart', (event) => { - event.stopPropagation(); - }, {passive: true}); - // Close box when touching outside box - document.body.addEventListener('touchstart', () => { - this.hide(); - }, {passive: true}); - } - - bindTriggerEvents(node) { - node.addEventListener('mouseover', () => { - if (!this.visible) { - this.showAtNode(node); - } - this.stopTimeout(); - }); - - node.addEventListener('mouseout', () => { - this.extendTimeout(300); - }); - - node.addEventListener('touchstart', (event) => { - if (this.visible) { - this.hide(); - } else { - this.showAtNode(node); - } - // Don't trigger body touchstart event when touching link - event.stopPropagation(); - }, {passive: true}); - } - - show(position) { - this.visible = true; - this.style.display = 'block'; - // 10px extra offset from element - this.style.top = Math.round(position[1] + 10) + 'px'; - } - - showAtNode(node) { - // https://developer.mozilla.org/en-US/docs/Web/API/HTMLElement/offsetTop - const bbox = node.getBoundingClientRect(); - this.show([node.offsetLeft + bbox.width, node.offsetTop + bbox.height]); - } - - hide() { - this.visible = false; - this.style.display = 'none'; - this.stopTimeout(); - } - - stopTimeout() { - if (this.timeout) { - clearTimeout(this.timeout); - } - } - - extendTimeout(time) { - this.stopTimeout(); - this.timeout = setTimeout(() => { - this.hide(); - }, time); - } - - } - - // Copyright 2018 The Distill Template Authors - // - // Licensed under the Apache License, Version 2.0 (the "License"); - // you may not use this file except in compliance with the License. - // You may obtain a copy of the License at - // - // http://www.apache.org/licenses/LICENSE-2.0 - // - // Unless required by applicable law or agreed to in writing, software - // distributed under the License is distributed on an "AS IS" BASIS, - // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - // See the License for the specific language governing permissions and - // limitations under the License. - - class Title extends HTMLElement { - static get is() { return 'd-title'; } - } - - // Copyright 2018 The Distill Template Authors - - const T$8 = Template('d-references', ` - -`, false); - - class References extends T$8(HTMLElement) { - - } - - // Copyright 2018 The Distill Template Authors - // - // Licensed under the Apache License, Version 2.0 (the "License"); - // you may not use this file except in compliance with the License. - // You may obtain a copy of the License at - // - // http://www.apache.org/licenses/LICENSE-2.0 - // - // Unless required by applicable law or agreed to in writing, software - // distributed under the License is distributed on an "AS IS" BASIS, - // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - // See the License for the specific language governing permissions and - // limitations under the License. - - class TOC extends HTMLElement { - - static get is() { return 'd-toc'; } - - connectedCallback() { - if (!this.getAttribute('prerendered')) { - window.onload = () => { - const article = document.querySelector('d-article'); - const headings = article.querySelectorAll('h2, h3'); - renderTOC(this, headings); - }; - } - } - - } - - function renderTOC(element, headings) { - - let ToC =` - - -

    Table of contents

    -
      `; - - for (const el of headings) { - // should element be included in TOC? - const isInTitle = el.parentElement.tagName == 'D-TITLE'; - const isException = el.getAttribute('no-toc'); - if (isInTitle || isException) continue; - // create TOC entry - const title = el.textContent; - const link = '#' + el.getAttribute('id'); - - let newLine = '
    • ' + '' + title + '' + '
    • '; - if (el.tagName == 'H3') { - newLine = '
        ' + newLine + '
      '; - } else { - newLine += '
      '; - } - ToC += newLine; - - } - - ToC += '
    '; - element.innerHTML = ToC; - } - - // Copyright 2018 The Distill Template Authors - // - // Licensed under the Apache License, Version 2.0 (the "License"); - // you may not use this file except in compliance with the License. - // You may obtain a copy of the License at - // - // http://www.apache.org/licenses/LICENSE-2.0 - // - // Unless required by applicable law or agreed to in writing, software - // distributed under the License is distributed on an "AS IS" BASIS, - // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - // See the License for the specific language governing permissions and - // limitations under the License. - - // Figure - // - // d-figure provides a state-machine of visibility events: - // - // scroll out of view - // +----------------+ - // *do work here* | | - // +----------------+ +-+---------+ +-v---------+ - // | ready +----> onscreen | | offscreen | - // +----------------+ +---------^-+ +---------+-+ - // | | - // +----------------+ - // scroll into view - // - - class Figure extends HTMLElement { - - static get is() { return 'd-figure'; } - - static get readyQueue() { - if (!Figure._readyQueue) { - Figure._readyQueue = []; - } - return Figure._readyQueue; - } - - static addToReadyQueue(figure) { - if (Figure.readyQueue.indexOf(figure) === -1) { - Figure.readyQueue.push(figure); - Figure.runReadyQueue(); - } - } - - static runReadyQueue() { - // console.log("Checking to run readyQueue, length: " + Figure.readyQueue.length + ", scrolling: " + Figure.isScrolling); - // if (Figure.isScrolling) return; - // console.log("Running ready Queue"); - const figure = Figure.readyQueue - .sort((a,b) => a._seenOnScreen - b._seenOnScreen ) - .filter((figure) => !figure._ready) - .pop(); - if (figure) { - figure.ready(); - requestAnimationFrame(Figure.runReadyQueue); - } - - } - - constructor() { - super(); - // debugger - this._ready = false; - this._onscreen = false; - this._offscreen = true; - } - - connectedCallback() { - this.loadsWhileScrolling = this.hasAttribute('loadsWhileScrolling'); - Figure.marginObserver.observe(this); - Figure.directObserver.observe(this); - } - - disconnectedCallback() { - Figure.marginObserver.unobserve(this); - Figure.directObserver.unobserve(this); - } - - // We use two separate observers: - // One with an extra 1000px margin to warn if the viewpoint gets close, - // And one for the actual on/off screen events - - static get marginObserver() { - if (!Figure._marginObserver) { - // if (!('IntersectionObserver' in window)) { - // throw new Error('no interscetionobbserver!'); - // } - const viewportHeight = window.innerHeight; - const margin = Math.floor(2 * viewportHeight); - const options = {rootMargin: margin + 'px 0px ' + margin + 'px 0px', threshold: 0.01}; - const callback = Figure.didObserveMarginIntersection; - const observer = new IntersectionObserver(callback, options); - Figure._marginObserver = observer; - } - return Figure._marginObserver; - } - - static didObserveMarginIntersection(entries) { - for (const entry of entries) { - const figure = entry.target; - if (entry.isIntersecting && !figure._ready) { - Figure.addToReadyQueue(figure); - } - } - } - - static get directObserver() { - if (!Figure._directObserver) { - Figure._directObserver = new IntersectionObserver( - Figure.didObserveDirectIntersection, { - rootMargin: '0px', threshold: [0, 1.0], - } - ); - } - return Figure._directObserver; - } - - static didObserveDirectIntersection(entries) { - for (const entry of entries) { - const figure = entry.target; - if (entry.isIntersecting) { - figure._seenOnScreen = new Date(); - // if (!figure._ready) { figure.ready(); } - if (figure._offscreen) { figure.onscreen(); } - } else { - if (figure._onscreen) { figure.offscreen(); } - } - } - } - - // Notify listeners that registered late, too: - - addEventListener(eventName, callback) { - super.addEventListener(eventName, callback); - // if we had already dispatched something while presumingly no one was listening, we do so again - // debugger - if (eventName === 'ready') { - if (Figure.readyQueue.indexOf(this) !== -1) { - this._ready = false; - Figure.runReadyQueue(); - } - } - if (eventName === 'onscreen') { - this.onscreen(); - } - } - - // Custom Events - - ready() { - // debugger - this._ready = true; - Figure.marginObserver.unobserve(this); - const event = new CustomEvent('ready'); - this.dispatchEvent(event); - } - - onscreen() { - this._onscreen = true; - this._offscreen = false; - const event = new CustomEvent('onscreen'); - this.dispatchEvent(event); - } - - offscreen() { - this._onscreen = false; - this._offscreen = true; - const event = new CustomEvent('offscreen'); - this.dispatchEvent(event); - } - - } - - if (typeof window !== 'undefined') { - - Figure.isScrolling = false; - let timeout; - const resetTimer = () => { - Figure.isScrolling = true; - clearTimeout(timeout); - timeout = setTimeout(() => { - Figure.isScrolling = false; - Figure.runReadyQueue(); - }, 500); - }; - window.addEventListener('scroll', resetTimer, true); - - } - - // Copyright 2018 The Distill Template Authors - - // This overlay is not secure. - // It is only meant as a social deterrent. - - const productionHostname = 'distill.pub'; - const T$9 = Template('d-interstitial', ` - - -
    -
    -

    This article is in review.

    -

    Do not share this URL or the contents of this article. Thank you!

    - -

    Enter the password we shared with you as part of the review process to view the article.

    -
    -
    -`); - - class Interstitial extends T$9(HTMLElement) { - - connectedCallback() { - if (this.shouldRemoveSelf()) { - this.parentElement.removeChild(this); - } else { - const passwordInput = this.root.querySelector('#interstitial-password-input'); - passwordInput.oninput = (event) => this.passwordChanged(event); - } - } - - passwordChanged(event) { - const entered = event.target.value; - if (entered === this.password) { - console.log('Correct password entered.'); - this.parentElement.removeChild(this); - if (typeof(Storage) !== 'undefined') { - console.log('Saved that correct password was entered.'); - localStorage.setItem(this.localStorageIdentifier(), 'true'); - } - } - } - - shouldRemoveSelf() { - // should never be visible in production - if (window && window.location.hostname === productionHostname) { - console.warn('Interstitial found on production, hiding it.'); - return true - } - // should only have to enter password once - if (typeof(Storage) !== 'undefined') { - if (localStorage.getItem(this.localStorageIdentifier()) === 'true') { - console.log('Loaded that correct password was entered before; skipping interstitial.'); - return true; - } - } - // otherwise, leave visible - return false; - } - - localStorageIdentifier() { - const prefix = 'distill-drafts'; - const suffix = 'interstitial-password-correct'; - return prefix + (window ? window.location.pathname : '-') + suffix - } - - } - - function ascending(a, b) { - return a < b ? -1 : a > b ? 1 : a >= b ? 0 : NaN; - } - - function bisector(compare) { - if (compare.length === 1) compare = ascendingComparator(compare); - return { - left: function(a, x, lo, hi) { - if (lo == null) lo = 0; - if (hi == null) hi = a.length; - while (lo < hi) { - var mid = lo + hi >>> 1; - if (compare(a[mid], x) < 0) lo = mid + 1; - else hi = mid; - } - return lo; - }, - right: function(a, x, lo, hi) { - if (lo == null) lo = 0; - if (hi == null) hi = a.length; - while (lo < hi) { - var mid = lo + hi >>> 1; - if (compare(a[mid], x) > 0) hi = mid; - else lo = mid + 1; - } - return lo; - } - }; - } - - function ascendingComparator(f) { - return function(d, x) { - return ascending(f(d), x); - }; - } - - var ascendingBisect = bisector(ascending); - var bisectRight = ascendingBisect.right; - - function range(start, stop, step) { - start = +start, stop = +stop, step = (n = arguments.length) < 2 ? (stop = start, start = 0, 1) : n < 3 ? 1 : +step; - - var i = -1, - n = Math.max(0, Math.ceil((stop - start) / step)) | 0, - range = new Array(n); - - while (++i < n) { - range[i] = start + i * step; - } - - return range; - } - - var e10 = Math.sqrt(50), - e5 = Math.sqrt(10), - e2 = Math.sqrt(2); - - function ticks(start, stop, count) { - var reverse, - i = -1, - n, - ticks, - step; - - stop = +stop, start = +start, count = +count; - if (start === stop && count > 0) return [start]; - if (reverse = stop < start) n = start, start = stop, stop = n; - if ((step = tickIncrement(start, stop, count)) === 0 || !isFinite(step)) return []; - - if (step > 0) { - start = Math.ceil(start / step); - stop = Math.floor(stop / step); - ticks = new Array(n = Math.ceil(stop - start + 1)); - while (++i < n) ticks[i] = (start + i) * step; - } else { - start = Math.floor(start * step); - stop = Math.ceil(stop * step); - ticks = new Array(n = Math.ceil(start - stop + 1)); - while (++i < n) ticks[i] = (start - i) / step; - } - - if (reverse) ticks.reverse(); - - return ticks; - } - - function tickIncrement(start, stop, count) { - var step = (stop - start) / Math.max(0, count), - power = Math.floor(Math.log(step) / Math.LN10), - error = step / Math.pow(10, power); - return power >= 0 - ? (error >= e10 ? 10 : error >= e5 ? 5 : error >= e2 ? 2 : 1) * Math.pow(10, power) - : -Math.pow(10, -power) / (error >= e10 ? 10 : error >= e5 ? 5 : error >= e2 ? 2 : 1); - } - - function tickStep(start, stop, count) { - var step0 = Math.abs(stop - start) / Math.max(0, count), - step1 = Math.pow(10, Math.floor(Math.log(step0) / Math.LN10)), - error = step0 / step1; - if (error >= e10) step1 *= 10; - else if (error >= e5) step1 *= 5; - else if (error >= e2) step1 *= 2; - return stop < start ? -step1 : step1; - } - - function initRange(domain, range) { - switch (arguments.length) { - case 0: break; - case 1: this.range(domain); break; - default: this.range(range).domain(domain); break; - } - return this; - } - - function define(constructor, factory, prototype) { - constructor.prototype = factory.prototype = prototype; - prototype.constructor = constructor; - } - - function extend(parent, definition) { - var prototype = Object.create(parent.prototype); - for (var key in definition) prototype[key] = definition[key]; - return prototype; - } - - function Color() {} - - var darker = 0.7; - var brighter = 1 / darker; - - var reI = "\\s*([+-]?\\d+)\\s*", - reN = "\\s*([+-]?\\d*\\.?\\d+(?:[eE][+-]?\\d+)?)\\s*", - reP = "\\s*([+-]?\\d*\\.?\\d+(?:[eE][+-]?\\d+)?)%\\s*", - reHex = /^#([0-9a-f]{3,8})$/, - reRgbInteger = new RegExp("^rgb\\(" + [reI, reI, reI] + "\\)$"), - reRgbPercent = new RegExp("^rgb\\(" + [reP, reP, reP] + "\\)$"), - reRgbaInteger = new RegExp("^rgba\\(" + [reI, reI, reI, reN] + "\\)$"), - reRgbaPercent = new RegExp("^rgba\\(" + [reP, reP, reP, reN] + "\\)$"), - reHslPercent = new RegExp("^hsl\\(" + [reN, reP, reP] + "\\)$"), - reHslaPercent = new RegExp("^hsla\\(" + [reN, reP, reP, reN] + "\\)$"); - - var named = { - aliceblue: 0xf0f8ff, - antiquewhite: 0xfaebd7, - aqua: 0x00ffff, - aquamarine: 0x7fffd4, - azure: 0xf0ffff, - beige: 0xf5f5dc, - bisque: 0xffe4c4, - black: 0x000000, - blanchedalmond: 0xffebcd, - blue: 0x0000ff, - blueviolet: 0x8a2be2, - brown: 0xa52a2a, - burlywood: 0xdeb887, - cadetblue: 0x5f9ea0, - chartreuse: 0x7fff00, - chocolate: 0xd2691e, - coral: 0xff7f50, - cornflowerblue: 0x6495ed, - cornsilk: 0xfff8dc, - crimson: 0xdc143c, - cyan: 0x00ffff, - darkblue: 0x00008b, - darkcyan: 0x008b8b, - darkgoldenrod: 0xb8860b, - darkgray: 0xa9a9a9, - darkgreen: 0x006400, - darkgrey: 0xa9a9a9, - darkkhaki: 0xbdb76b, - darkmagenta: 0x8b008b, - darkolivegreen: 0x556b2f, - darkorange: 0xff8c00, - darkorchid: 0x9932cc, - darkred: 0x8b0000, - darksalmon: 0xe9967a, - darkseagreen: 0x8fbc8f, - darkslateblue: 0x483d8b, - darkslategray: 0x2f4f4f, - darkslategrey: 0x2f4f4f, - darkturquoise: 0x00ced1, - darkviolet: 0x9400d3, - deeppink: 0xff1493, - deepskyblue: 0x00bfff, - dimgray: 0x696969, - dimgrey: 0x696969, - dodgerblue: 0x1e90ff, - firebrick: 0xb22222, - floralwhite: 0xfffaf0, - forestgreen: 0x228b22, - fuchsia: 0xff00ff, - gainsboro: 0xdcdcdc, - ghostwhite: 0xf8f8ff, - gold: 0xffd700, - goldenrod: 0xdaa520, - gray: 0x808080, - green: 0x008000, - greenyellow: 0xadff2f, - grey: 0x808080, - honeydew: 0xf0fff0, - hotpink: 0xff69b4, - indianred: 0xcd5c5c, - indigo: 0x4b0082, - ivory: 0xfffff0, - khaki: 0xf0e68c, - lavender: 0xe6e6fa, - lavenderblush: 0xfff0f5, - lawngreen: 0x7cfc00, - lemonchiffon: 0xfffacd, - lightblue: 0xadd8e6, - lightcoral: 0xf08080, - lightcyan: 0xe0ffff, - lightgoldenrodyellow: 0xfafad2, - lightgray: 0xd3d3d3, - lightgreen: 0x90ee90, - lightgrey: 0xd3d3d3, - lightpink: 0xffb6c1, - lightsalmon: 0xffa07a, - lightseagreen: 0x20b2aa, - lightskyblue: 0x87cefa, - lightslategray: 0x778899, - lightslategrey: 0x778899, - lightsteelblue: 0xb0c4de, - lightyellow: 0xffffe0, - lime: 0x00ff00, - limegreen: 0x32cd32, - linen: 0xfaf0e6, - magenta: 0xff00ff, - maroon: 0x800000, - mediumaquamarine: 0x66cdaa, - mediumblue: 0x0000cd, - mediumorchid: 0xba55d3, - mediumpurple: 0x9370db, - mediumseagreen: 0x3cb371, - mediumslateblue: 0x7b68ee, - mediumspringgreen: 0x00fa9a, - mediumturquoise: 0x48d1cc, - mediumvioletred: 0xc71585, - midnightblue: 0x191970, - mintcream: 0xf5fffa, - mistyrose: 0xffe4e1, - moccasin: 0xffe4b5, - navajowhite: 0xffdead, - navy: 0x000080, - oldlace: 0xfdf5e6, - olive: 0x808000, - olivedrab: 0x6b8e23, - orange: 0xffa500, - orangered: 0xff4500, - orchid: 0xda70d6, - palegoldenrod: 0xeee8aa, - palegreen: 0x98fb98, - paleturquoise: 0xafeeee, - palevioletred: 0xdb7093, - papayawhip: 0xffefd5, - peachpuff: 0xffdab9, - peru: 0xcd853f, - pink: 0xffc0cb, - plum: 0xdda0dd, - powderblue: 0xb0e0e6, - purple: 0x800080, - rebeccapurple: 0x663399, - red: 0xff0000, - rosybrown: 0xbc8f8f, - royalblue: 0x4169e1, - saddlebrown: 0x8b4513, - salmon: 0xfa8072, - sandybrown: 0xf4a460, - seagreen: 0x2e8b57, - seashell: 0xfff5ee, - sienna: 0xa0522d, - silver: 0xc0c0c0, - skyblue: 0x87ceeb, - slateblue: 0x6a5acd, - slategray: 0x708090, - slategrey: 0x708090, - snow: 0xfffafa, - springgreen: 0x00ff7f, - steelblue: 0x4682b4, - tan: 0xd2b48c, - teal: 0x008080, - thistle: 0xd8bfd8, - tomato: 0xff6347, - turquoise: 0x40e0d0, - violet: 0xee82ee, - wheat: 0xf5deb3, - white: 0xffffff, - whitesmoke: 0xf5f5f5, - yellow: 0xffff00, - yellowgreen: 0x9acd32 - }; - - define(Color, color, { - copy: function(channels) { - return Object.assign(new this.constructor, this, channels); - }, - displayable: function() { - return this.rgb().displayable(); - }, - hex: color_formatHex, // Deprecated! Use color.formatHex. - formatHex: color_formatHex, - formatHsl: color_formatHsl, - formatRgb: color_formatRgb, - toString: color_formatRgb - }); - - function color_formatHex() { - return this.rgb().formatHex(); - } - - function color_formatHsl() { - return hslConvert(this).formatHsl(); - } - - function color_formatRgb() { - return this.rgb().formatRgb(); - } - - function color(format) { - var m, l; - format = (format + "").trim().toLowerCase(); - return (m = reHex.exec(format)) ? (l = m[1].length, m = parseInt(m[1], 16), l === 6 ? rgbn(m) // #ff0000 - : l === 3 ? new Rgb((m >> 8 & 0xf) | (m >> 4 & 0xf0), (m >> 4 & 0xf) | (m & 0xf0), ((m & 0xf) << 4) | (m & 0xf), 1) // #f00 - : l === 8 ? rgba(m >> 24 & 0xff, m >> 16 & 0xff, m >> 8 & 0xff, (m & 0xff) / 0xff) // #ff000000 - : l === 4 ? rgba((m >> 12 & 0xf) | (m >> 8 & 0xf0), (m >> 8 & 0xf) | (m >> 4 & 0xf0), (m >> 4 & 0xf) | (m & 0xf0), (((m & 0xf) << 4) | (m & 0xf)) / 0xff) // #f000 - : null) // invalid hex - : (m = reRgbInteger.exec(format)) ? new Rgb(m[1], m[2], m[3], 1) // rgb(255, 0, 0) - : (m = reRgbPercent.exec(format)) ? new Rgb(m[1] * 255 / 100, m[2] * 255 / 100, m[3] * 255 / 100, 1) // rgb(100%, 0%, 0%) - : (m = reRgbaInteger.exec(format)) ? rgba(m[1], m[2], m[3], m[4]) // rgba(255, 0, 0, 1) - : (m = reRgbaPercent.exec(format)) ? rgba(m[1] * 255 / 100, m[2] * 255 / 100, m[3] * 255 / 100, m[4]) // rgb(100%, 0%, 0%, 1) - : (m = reHslPercent.exec(format)) ? hsla(m[1], m[2] / 100, m[3] / 100, 1) // hsl(120, 50%, 50%) - : (m = reHslaPercent.exec(format)) ? hsla(m[1], m[2] / 100, m[3] / 100, m[4]) // hsla(120, 50%, 50%, 1) - : named.hasOwnProperty(format) ? rgbn(named[format]) // eslint-disable-line no-prototype-builtins - : format === "transparent" ? new Rgb(NaN, NaN, NaN, 0) - : null; - } - - function rgbn(n) { - return new Rgb(n >> 16 & 0xff, n >> 8 & 0xff, n & 0xff, 1); - } - - function rgba(r, g, b, a) { - if (a <= 0) r = g = b = NaN; - return new Rgb(r, g, b, a); - } - - function rgbConvert(o) { - if (!(o instanceof Color)) o = color(o); - if (!o) return new Rgb; - o = o.rgb(); - return new Rgb(o.r, o.g, o.b, o.opacity); - } - - function rgb(r, g, b, opacity) { - return arguments.length === 1 ? rgbConvert(r) : new Rgb(r, g, b, opacity == null ? 1 : opacity); - } - - function Rgb(r, g, b, opacity) { - this.r = +r; - this.g = +g; - this.b = +b; - this.opacity = +opacity; - } - - define(Rgb, rgb, extend(Color, { - brighter: function(k) { - k = k == null ? brighter : Math.pow(brighter, k); - return new Rgb(this.r * k, this.g * k, this.b * k, this.opacity); - }, - darker: function(k) { - k = k == null ? darker : Math.pow(darker, k); - return new Rgb(this.r * k, this.g * k, this.b * k, this.opacity); - }, - rgb: function() { - return this; - }, - displayable: function() { - return (-0.5 <= this.r && this.r < 255.5) - && (-0.5 <= this.g && this.g < 255.5) - && (-0.5 <= this.b && this.b < 255.5) - && (0 <= this.opacity && this.opacity <= 1); - }, - hex: rgb_formatHex, // Deprecated! Use color.formatHex. - formatHex: rgb_formatHex, - formatRgb: rgb_formatRgb, - toString: rgb_formatRgb - })); - - function rgb_formatHex() { - return "#" + hex(this.r) + hex(this.g) + hex(this.b); - } - - function rgb_formatRgb() { - var a = this.opacity; a = isNaN(a) ? 1 : Math.max(0, Math.min(1, a)); - return (a === 1 ? "rgb(" : "rgba(") - + Math.max(0, Math.min(255, Math.round(this.r) || 0)) + ", " - + Math.max(0, Math.min(255, Math.round(this.g) || 0)) + ", " - + Math.max(0, Math.min(255, Math.round(this.b) || 0)) - + (a === 1 ? ")" : ", " + a + ")"); - } - - function hex(value) { - value = Math.max(0, Math.min(255, Math.round(value) || 0)); - return (value < 16 ? "0" : "") + value.toString(16); - } - - function hsla(h, s, l, a) { - if (a <= 0) h = s = l = NaN; - else if (l <= 0 || l >= 1) h = s = NaN; - else if (s <= 0) h = NaN; - return new Hsl(h, s, l, a); - } - - function hslConvert(o) { - if (o instanceof Hsl) return new Hsl(o.h, o.s, o.l, o.opacity); - if (!(o instanceof Color)) o = color(o); - if (!o) return new Hsl; - if (o instanceof Hsl) return o; - o = o.rgb(); - var r = o.r / 255, - g = o.g / 255, - b = o.b / 255, - min = Math.min(r, g, b), - max = Math.max(r, g, b), - h = NaN, - s = max - min, - l = (max + min) / 2; - if (s) { - if (r === max) h = (g - b) / s + (g < b) * 6; - else if (g === max) h = (b - r) / s + 2; - else h = (r - g) / s + 4; - s /= l < 0.5 ? max + min : 2 - max - min; - h *= 60; - } else { - s = l > 0 && l < 1 ? 0 : h; - } - return new Hsl(h, s, l, o.opacity); - } - - function hsl(h, s, l, opacity) { - return arguments.length === 1 ? hslConvert(h) : new Hsl(h, s, l, opacity == null ? 1 : opacity); - } - - function Hsl(h, s, l, opacity) { - this.h = +h; - this.s = +s; - this.l = +l; - this.opacity = +opacity; - } - - define(Hsl, hsl, extend(Color, { - brighter: function(k) { - k = k == null ? brighter : Math.pow(brighter, k); - return new Hsl(this.h, this.s, this.l * k, this.opacity); - }, - darker: function(k) { - k = k == null ? darker : Math.pow(darker, k); - return new Hsl(this.h, this.s, this.l * k, this.opacity); - }, - rgb: function() { - var h = this.h % 360 + (this.h < 0) * 360, - s = isNaN(h) || isNaN(this.s) ? 0 : this.s, - l = this.l, - m2 = l + (l < 0.5 ? l : 1 - l) * s, - m1 = 2 * l - m2; - return new Rgb( - hsl2rgb(h >= 240 ? h - 240 : h + 120, m1, m2), - hsl2rgb(h, m1, m2), - hsl2rgb(h < 120 ? h + 240 : h - 120, m1, m2), - this.opacity - ); - }, - displayable: function() { - return (0 <= this.s && this.s <= 1 || isNaN(this.s)) - && (0 <= this.l && this.l <= 1) - && (0 <= this.opacity && this.opacity <= 1); - }, - formatHsl: function() { - var a = this.opacity; a = isNaN(a) ? 1 : Math.max(0, Math.min(1, a)); - return (a === 1 ? "hsl(" : "hsla(") - + (this.h || 0) + ", " - + (this.s || 0) * 100 + "%, " - + (this.l || 0) * 100 + "%" - + (a === 1 ? ")" : ", " + a + ")"); - } - })); - - /* From FvD 13.37, CSS Color Module Level 3 */ - function hsl2rgb(h, m1, m2) { - return (h < 60 ? m1 + (m2 - m1) * h / 60 - : h < 180 ? m2 - : h < 240 ? m1 + (m2 - m1) * (240 - h) / 60 - : m1) * 255; - } - - var deg2rad = Math.PI / 180; - var rad2deg = 180 / Math.PI; - - // https://observablehq.com/@mbostock/lab-and-rgb - var K = 18, - Xn = 0.96422, - Yn = 1, - Zn = 0.82521, - t0 = 4 / 29, - t1 = 6 / 29, - t2 = 3 * t1 * t1, - t3 = t1 * t1 * t1; - - function labConvert(o) { - if (o instanceof Lab) return new Lab(o.l, o.a, o.b, o.opacity); - if (o instanceof Hcl) return hcl2lab(o); - if (!(o instanceof Rgb)) o = rgbConvert(o); - var r = rgb2lrgb(o.r), - g = rgb2lrgb(o.g), - b = rgb2lrgb(o.b), - y = xyz2lab((0.2225045 * r + 0.7168786 * g + 0.0606169 * b) / Yn), x, z; - if (r === g && g === b) x = z = y; else { - x = xyz2lab((0.4360747 * r + 0.3850649 * g + 0.1430804 * b) / Xn); - z = xyz2lab((0.0139322 * r + 0.0971045 * g + 0.7141733 * b) / Zn); - } - return new Lab(116 * y - 16, 500 * (x - y), 200 * (y - z), o.opacity); - } - - function lab(l, a, b, opacity) { - return arguments.length === 1 ? labConvert(l) : new Lab(l, a, b, opacity == null ? 1 : opacity); - } - - function Lab(l, a, b, opacity) { - this.l = +l; - this.a = +a; - this.b = +b; - this.opacity = +opacity; - } - - define(Lab, lab, extend(Color, { - brighter: function(k) { - return new Lab(this.l + K * (k == null ? 1 : k), this.a, this.b, this.opacity); - }, - darker: function(k) { - return new Lab(this.l - K * (k == null ? 1 : k), this.a, this.b, this.opacity); - }, - rgb: function() { - var y = (this.l + 16) / 116, - x = isNaN(this.a) ? y : y + this.a / 500, - z = isNaN(this.b) ? y : y - this.b / 200; - x = Xn * lab2xyz(x); - y = Yn * lab2xyz(y); - z = Zn * lab2xyz(z); - return new Rgb( - lrgb2rgb( 3.1338561 * x - 1.6168667 * y - 0.4906146 * z), - lrgb2rgb(-0.9787684 * x + 1.9161415 * y + 0.0334540 * z), - lrgb2rgb( 0.0719453 * x - 0.2289914 * y + 1.4052427 * z), - this.opacity - ); - } - })); - - function xyz2lab(t) { - return t > t3 ? Math.pow(t, 1 / 3) : t / t2 + t0; - } - - function lab2xyz(t) { - return t > t1 ? t * t * t : t2 * (t - t0); - } - - function lrgb2rgb(x) { - return 255 * (x <= 0.0031308 ? 12.92 * x : 1.055 * Math.pow(x, 1 / 2.4) - 0.055); - } - - function rgb2lrgb(x) { - return (x /= 255) <= 0.04045 ? x / 12.92 : Math.pow((x + 0.055) / 1.055, 2.4); - } - - function hclConvert(o) { - if (o instanceof Hcl) return new Hcl(o.h, o.c, o.l, o.opacity); - if (!(o instanceof Lab)) o = labConvert(o); - if (o.a === 0 && o.b === 0) return new Hcl(NaN, 0 < o.l && o.l < 100 ? 0 : NaN, o.l, o.opacity); - var h = Math.atan2(o.b, o.a) * rad2deg; - return new Hcl(h < 0 ? h + 360 : h, Math.sqrt(o.a * o.a + o.b * o.b), o.l, o.opacity); - } - - function hcl(h, c, l, opacity) { - return arguments.length === 1 ? hclConvert(h) : new Hcl(h, c, l, opacity == null ? 1 : opacity); - } - - function Hcl(h, c, l, opacity) { - this.h = +h; - this.c = +c; - this.l = +l; - this.opacity = +opacity; - } - - function hcl2lab(o) { - if (isNaN(o.h)) return new Lab(o.l, 0, 0, o.opacity); - var h = o.h * deg2rad; - return new Lab(o.l, Math.cos(h) * o.c, Math.sin(h) * o.c, o.opacity); - } - - define(Hcl, hcl, extend(Color, { - brighter: function(k) { - return new Hcl(this.h, this.c, this.l + K * (k == null ? 1 : k), this.opacity); - }, - darker: function(k) { - return new Hcl(this.h, this.c, this.l - K * (k == null ? 1 : k), this.opacity); - }, - rgb: function() { - return hcl2lab(this).rgb(); - } - })); - - var A = -0.14861, - B = +1.78277, - C = -0.29227, - D = -0.90649, - E = +1.97294, - ED = E * D, - EB = E * B, - BC_DA = B * C - D * A; - - function cubehelixConvert(o) { - if (o instanceof Cubehelix) return new Cubehelix(o.h, o.s, o.l, o.opacity); - if (!(o instanceof Rgb)) o = rgbConvert(o); - var r = o.r / 255, - g = o.g / 255, - b = o.b / 255, - l = (BC_DA * b + ED * r - EB * g) / (BC_DA + ED - EB), - bl = b - l, - k = (E * (g - l) - C * bl) / D, - s = Math.sqrt(k * k + bl * bl) / (E * l * (1 - l)), // NaN if l=0 or l=1 - h = s ? Math.atan2(k, bl) * rad2deg - 120 : NaN; - return new Cubehelix(h < 0 ? h + 360 : h, s, l, o.opacity); - } - - function cubehelix(h, s, l, opacity) { - return arguments.length === 1 ? cubehelixConvert(h) : new Cubehelix(h, s, l, opacity == null ? 1 : opacity); - } - - function Cubehelix(h, s, l, opacity) { - this.h = +h; - this.s = +s; - this.l = +l; - this.opacity = +opacity; - } - - define(Cubehelix, cubehelix, extend(Color, { - brighter: function(k) { - k = k == null ? brighter : Math.pow(brighter, k); - return new Cubehelix(this.h, this.s, this.l * k, this.opacity); - }, - darker: function(k) { - k = k == null ? darker : Math.pow(darker, k); - return new Cubehelix(this.h, this.s, this.l * k, this.opacity); - }, - rgb: function() { - var h = isNaN(this.h) ? 0 : (this.h + 120) * deg2rad, - l = +this.l, - a = isNaN(this.s) ? 0 : this.s * l * (1 - l), - cosh = Math.cos(h), - sinh = Math.sin(h); - return new Rgb( - 255 * (l + a * (A * cosh + B * sinh)), - 255 * (l + a * (C * cosh + D * sinh)), - 255 * (l + a * (E * cosh)), - this.opacity - ); - } - })); - - function constant(x) { - return function() { - return x; - }; - } - - function linear(a, d) { - return function(t) { - return a + t * d; - }; - } - - function exponential(a, b, y) { - return a = Math.pow(a, y), b = Math.pow(b, y) - a, y = 1 / y, function(t) { - return Math.pow(a + t * b, y); - }; - } - - function gamma(y) { - return (y = +y) === 1 ? nogamma : function(a, b) { - return b - a ? exponential(a, b, y) : constant(isNaN(a) ? b : a); - }; - } - - function nogamma(a, b) { - var d = b - a; - return d ? linear(a, d) : constant(isNaN(a) ? b : a); - } - - var rgb$1 = (function rgbGamma(y) { - var color = gamma(y); - - function rgb$1(start, end) { - var r = color((start = rgb(start)).r, (end = rgb(end)).r), - g = color(start.g, end.g), - b = color(start.b, end.b), - opacity = nogamma(start.opacity, end.opacity); - return function(t) { - start.r = r(t); - start.g = g(t); - start.b = b(t); - start.opacity = opacity(t); - return start + ""; - }; - } - - rgb$1.gamma = rgbGamma; - - return rgb$1; - })(1); - - function numberArray(a, b) { - if (!b) b = []; - var n = a ? Math.min(b.length, a.length) : 0, - c = b.slice(), - i; - return function(t) { - for (i = 0; i < n; ++i) c[i] = a[i] * (1 - t) + b[i] * t; - return c; - }; - } - - function isNumberArray(x) { - return ArrayBuffer.isView(x) && !(x instanceof DataView); - } - - function genericArray(a, b) { - var nb = b ? b.length : 0, - na = a ? Math.min(nb, a.length) : 0, - x = new Array(na), - c = new Array(nb), - i; - - for (i = 0; i < na; ++i) x[i] = interpolate(a[i], b[i]); - for (; i < nb; ++i) c[i] = b[i]; - - return function(t) { - for (i = 0; i < na; ++i) c[i] = x[i](t); - return c; - }; - } - - function date(a, b) { - var d = new Date; - return a = +a, b = +b, function(t) { - return d.setTime(a * (1 - t) + b * t), d; - }; - } - - function interpolateNumber(a, b) { - return a = +a, b = +b, function(t) { - return a * (1 - t) + b * t; - }; - } - - function object(a, b) { - var i = {}, - c = {}, - k; - - if (a === null || typeof a !== "object") a = {}; - if (b === null || typeof b !== "object") b = {}; - - for (k in b) { - if (k in a) { - i[k] = interpolate(a[k], b[k]); - } else { - c[k] = b[k]; - } - } - - return function(t) { - for (k in i) c[k] = i[k](t); - return c; - }; - } - - var reA = /[-+]?(?:\d+\.?\d*|\.?\d+)(?:[eE][-+]?\d+)?/g, - reB = new RegExp(reA.source, "g"); - - function zero(b) { - return function() { - return b; - }; - } - - function one(b) { - return function(t) { - return b(t) + ""; - }; - } - - function string(a, b) { - var bi = reA.lastIndex = reB.lastIndex = 0, // scan index for next number in b - am, // current match in a - bm, // current match in b - bs, // string preceding current number in b, if any - i = -1, // index in s - s = [], // string constants and placeholders - q = []; // number interpolators - - // Coerce inputs to strings. - a = a + "", b = b + ""; - - // Interpolate pairs of numbers in a & b. - while ((am = reA.exec(a)) - && (bm = reB.exec(b))) { - if ((bs = bm.index) > bi) { // a string precedes the next number in b - bs = b.slice(bi, bs); - if (s[i]) s[i] += bs; // coalesce with previous string - else s[++i] = bs; - } - if ((am = am[0]) === (bm = bm[0])) { // numbers in a & b match - if (s[i]) s[i] += bm; // coalesce with previous string - else s[++i] = bm; - } else { // interpolate non-matching numbers - s[++i] = null; - q.push({i: i, x: interpolateNumber(am, bm)}); - } - bi = reB.lastIndex; - } - - // Add remains of b. - if (bi < b.length) { - bs = b.slice(bi); - if (s[i]) s[i] += bs; // coalesce with previous string - else s[++i] = bs; - } - - // Special optimization for only a single match. - // Otherwise, interpolate each of the numbers and rejoin the string. - return s.length < 2 ? (q[0] - ? one(q[0].x) - : zero(b)) - : (b = q.length, function(t) { - for (var i = 0, o; i < b; ++i) s[(o = q[i]).i] = o.x(t); - return s.join(""); - }); - } - - function interpolate(a, b) { - var t = typeof b, c; - return b == null || t === "boolean" ? constant(b) - : (t === "number" ? interpolateNumber - : t === "string" ? ((c = color(b)) ? (b = c, rgb$1) : string) - : b instanceof color ? rgb$1 - : b instanceof Date ? date - : isNumberArray(b) ? numberArray - : Array.isArray(b) ? genericArray - : typeof b.valueOf !== "function" && typeof b.toString !== "function" || isNaN(b) ? object - : interpolateNumber)(a, b); - } - - function interpolateRound(a, b) { - return a = +a, b = +b, function(t) { - return Math.round(a * (1 - t) + b * t); - }; - } - - function constant$1(x) { - return function() { - return x; - }; - } - - function number(x) { - return +x; - } - - var unit = [0, 1]; - - function identity(x) { - return x; - } - - function normalize(a, b) { - return (b -= (a = +a)) - ? function(x) { return (x - a) / b; } - : constant$1(isNaN(b) ? NaN : 0.5); - } - - function clamper(a, b) { - var t; - if (a > b) t = a, a = b, b = t; - return function(x) { return Math.max(a, Math.min(b, x)); }; - } - - // normalize(a, b)(x) takes a domain value x in [a,b] and returns the corresponding parameter t in [0,1]. - // interpolate(a, b)(t) takes a parameter t in [0,1] and returns the corresponding range value x in [a,b]. - function bimap(domain, range, interpolate) { - var d0 = domain[0], d1 = domain[1], r0 = range[0], r1 = range[1]; - if (d1 < d0) d0 = normalize(d1, d0), r0 = interpolate(r1, r0); - else d0 = normalize(d0, d1), r0 = interpolate(r0, r1); - return function(x) { return r0(d0(x)); }; - } - - function polymap(domain, range, interpolate) { - var j = Math.min(domain.length, range.length) - 1, - d = new Array(j), - r = new Array(j), - i = -1; - - // Reverse descending domains. - if (domain[j] < domain[0]) { - domain = domain.slice().reverse(); - range = range.slice().reverse(); - } - - while (++i < j) { - d[i] = normalize(domain[i], domain[i + 1]); - r[i] = interpolate(range[i], range[i + 1]); - } - - return function(x) { - var i = bisectRight(domain, x, 1, j) - 1; - return r[i](d[i](x)); - }; - } - - function copy(source, target) { - return target - .domain(source.domain()) - .range(source.range()) - .interpolate(source.interpolate()) - .clamp(source.clamp()) - .unknown(source.unknown()); - } - - function transformer() { - var domain = unit, - range = unit, - interpolate$1 = interpolate, - transform, - untransform, - unknown, - clamp = identity, - piecewise, - output, - input; - - function rescale() { - var n = Math.min(domain.length, range.length); - if (clamp !== identity) clamp = clamper(domain[0], domain[n - 1]); - piecewise = n > 2 ? polymap : bimap; - output = input = null; - return scale; - } - - function scale(x) { - return isNaN(x = +x) ? unknown : (output || (output = piecewise(domain.map(transform), range, interpolate$1)))(transform(clamp(x))); - } - - scale.invert = function(y) { - return clamp(untransform((input || (input = piecewise(range, domain.map(transform), interpolateNumber)))(y))); - }; - - scale.domain = function(_) { - return arguments.length ? (domain = Array.from(_, number), rescale()) : domain.slice(); - }; - - scale.range = function(_) { - return arguments.length ? (range = Array.from(_), rescale()) : range.slice(); - }; - - scale.rangeRound = function(_) { - return range = Array.from(_), interpolate$1 = interpolateRound, rescale(); - }; - - scale.clamp = function(_) { - return arguments.length ? (clamp = _ ? true : identity, rescale()) : clamp !== identity; - }; - - scale.interpolate = function(_) { - return arguments.length ? (interpolate$1 = _, rescale()) : interpolate$1; - }; - - scale.unknown = function(_) { - return arguments.length ? (unknown = _, scale) : unknown; - }; - - return function(t, u) { - transform = t, untransform = u; - return rescale(); - }; - } - - function continuous() { - return transformer()(identity, identity); - } - - // Computes the decimal coefficient and exponent of the specified number x with - // significant digits p, where x is positive and p is in [1, 21] or undefined. - // For example, formatDecimal(1.23) returns ["123", 0]. - function formatDecimal(x, p) { - if ((i = (x = p ? x.toExponential(p - 1) : x.toExponential()).indexOf("e")) < 0) return null; // NaN, ±Infinity - var i, coefficient = x.slice(0, i); - - // The string returned by toExponential either has the form \d\.\d+e[-+]\d+ - // (e.g., 1.2e+3) or the form \de[-+]\d+ (e.g., 1e+3). - return [ - coefficient.length > 1 ? coefficient[0] + coefficient.slice(2) : coefficient, - +x.slice(i + 1) - ]; - } - - function exponent(x) { - return x = formatDecimal(Math.abs(x)), x ? x[1] : NaN; - } - - function formatGroup(grouping, thousands) { - return function(value, width) { - var i = value.length, - t = [], - j = 0, - g = grouping[0], - length = 0; - - while (i > 0 && g > 0) { - if (length + g + 1 > width) g = Math.max(1, width - length); - t.push(value.substring(i -= g, i + g)); - if ((length += g + 1) > width) break; - g = grouping[j = (j + 1) % grouping.length]; - } - - return t.reverse().join(thousands); - }; - } - - function formatNumerals(numerals) { - return function(value) { - return value.replace(/[0-9]/g, function(i) { - return numerals[+i]; - }); - }; - } - - // [[fill]align][sign][symbol][0][width][,][.precision][~][type] - var re = /^(?:(.)?([<>=^]))?([+\-( ])?([$#])?(0)?(\d+)?(,)?(\.\d+)?(~)?([a-z%])?$/i; - - function formatSpecifier(specifier) { - if (!(match = re.exec(specifier))) throw new Error("invalid format: " + specifier); - var match; - return new FormatSpecifier({ - fill: match[1], - align: match[2], - sign: match[3], - symbol: match[4], - zero: match[5], - width: match[6], - comma: match[7], - precision: match[8] && match[8].slice(1), - trim: match[9], - type: match[10] - }); - } - - formatSpecifier.prototype = FormatSpecifier.prototype; // instanceof - - function FormatSpecifier(specifier) { - this.fill = specifier.fill === undefined ? " " : specifier.fill + ""; - this.align = specifier.align === undefined ? ">" : specifier.align + ""; - this.sign = specifier.sign === undefined ? "-" : specifier.sign + ""; - this.symbol = specifier.symbol === undefined ? "" : specifier.symbol + ""; - this.zero = !!specifier.zero; - this.width = specifier.width === undefined ? undefined : +specifier.width; - this.comma = !!specifier.comma; - this.precision = specifier.precision === undefined ? undefined : +specifier.precision; - this.trim = !!specifier.trim; - this.type = specifier.type === undefined ? "" : specifier.type + ""; - } - - FormatSpecifier.prototype.toString = function() { - return this.fill - + this.align - + this.sign - + this.symbol - + (this.zero ? "0" : "") - + (this.width === undefined ? "" : Math.max(1, this.width | 0)) - + (this.comma ? "," : "") - + (this.precision === undefined ? "" : "." + Math.max(0, this.precision | 0)) - + (this.trim ? "~" : "") - + this.type; - }; - - // Trims insignificant zeros, e.g., replaces 1.2000k with 1.2k. - function formatTrim(s) { - out: for (var n = s.length, i = 1, i0 = -1, i1; i < n; ++i) { - switch (s[i]) { - case ".": i0 = i1 = i; break; - case "0": if (i0 === 0) i0 = i; i1 = i; break; - default: if (!+s[i]) break out; if (i0 > 0) i0 = 0; break; - } - } - return i0 > 0 ? s.slice(0, i0) + s.slice(i1 + 1) : s; - } - - var prefixExponent; - - function formatPrefixAuto(x, p) { - var d = formatDecimal(x, p); - if (!d) return x + ""; - var coefficient = d[0], - exponent = d[1], - i = exponent - (prefixExponent = Math.max(-8, Math.min(8, Math.floor(exponent / 3))) * 3) + 1, - n = coefficient.length; - return i === n ? coefficient - : i > n ? coefficient + new Array(i - n + 1).join("0") - : i > 0 ? coefficient.slice(0, i) + "." + coefficient.slice(i) - : "0." + new Array(1 - i).join("0") + formatDecimal(x, Math.max(0, p + i - 1))[0]; // less than 1y! - } - - function formatRounded(x, p) { - var d = formatDecimal(x, p); - if (!d) return x + ""; - var coefficient = d[0], - exponent = d[1]; - return exponent < 0 ? "0." + new Array(-exponent).join("0") + coefficient - : coefficient.length > exponent + 1 ? coefficient.slice(0, exponent + 1) + "." + coefficient.slice(exponent + 1) - : coefficient + new Array(exponent - coefficient.length + 2).join("0"); - } - - var formatTypes = { - "%": function(x, p) { return (x * 100).toFixed(p); }, - "b": function(x) { return Math.round(x).toString(2); }, - "c": function(x) { return x + ""; }, - "d": function(x) { return Math.round(x).toString(10); }, - "e": function(x, p) { return x.toExponential(p); }, - "f": function(x, p) { return x.toFixed(p); }, - "g": function(x, p) { return x.toPrecision(p); }, - "o": function(x) { return Math.round(x).toString(8); }, - "p": function(x, p) { return formatRounded(x * 100, p); }, - "r": formatRounded, - "s": formatPrefixAuto, - "X": function(x) { return Math.round(x).toString(16).toUpperCase(); }, - "x": function(x) { return Math.round(x).toString(16); } - }; - - function identity$1(x) { - return x; - } - - var map = Array.prototype.map, - prefixes = ["y","z","a","f","p","n","µ","m","","k","M","G","T","P","E","Z","Y"]; - - function formatLocale(locale) { - var group = locale.grouping === undefined || locale.thousands === undefined ? identity$1 : formatGroup(map.call(locale.grouping, Number), locale.thousands + ""), - currencyPrefix = locale.currency === undefined ? "" : locale.currency[0] + "", - currencySuffix = locale.currency === undefined ? "" : locale.currency[1] + "", - decimal = locale.decimal === undefined ? "." : locale.decimal + "", - numerals = locale.numerals === undefined ? identity$1 : formatNumerals(map.call(locale.numerals, String)), - percent = locale.percent === undefined ? "%" : locale.percent + "", - minus = locale.minus === undefined ? "-" : locale.minus + "", - nan = locale.nan === undefined ? "NaN" : locale.nan + ""; - - function newFormat(specifier) { - specifier = formatSpecifier(specifier); - - var fill = specifier.fill, - align = specifier.align, - sign = specifier.sign, - symbol = specifier.symbol, - zero = specifier.zero, - width = specifier.width, - comma = specifier.comma, - precision = specifier.precision, - trim = specifier.trim, - type = specifier.type; - - // The "n" type is an alias for ",g". - if (type === "n") comma = true, type = "g"; - - // The "" type, and any invalid type, is an alias for ".12~g". - else if (!formatTypes[type]) precision === undefined && (precision = 12), trim = true, type = "g"; - - // If zero fill is specified, padding goes after sign and before digits. - if (zero || (fill === "0" && align === "=")) zero = true, fill = "0", align = "="; - - // Compute the prefix and suffix. - // For SI-prefix, the suffix is lazily computed. - var prefix = symbol === "$" ? currencyPrefix : symbol === "#" && /[boxX]/.test(type) ? "0" + type.toLowerCase() : "", - suffix = symbol === "$" ? currencySuffix : /[%p]/.test(type) ? percent : ""; - - // What format function should we use? - // Is this an integer type? - // Can this type generate exponential notation? - var formatType = formatTypes[type], - maybeSuffix = /[defgprs%]/.test(type); - - // Set the default precision if not specified, - // or clamp the specified precision to the supported range. - // For significant precision, it must be in [1, 21]. - // For fixed precision, it must be in [0, 20]. - precision = precision === undefined ? 6 - : /[gprs]/.test(type) ? Math.max(1, Math.min(21, precision)) - : Math.max(0, Math.min(20, precision)); - - function format(value) { - var valuePrefix = prefix, - valueSuffix = suffix, - i, n, c; - - if (type === "c") { - valueSuffix = formatType(value) + valueSuffix; - value = ""; - } else { - value = +value; - - // Determine the sign. -0 is not less than 0, but 1 / -0 is! - var valueNegative = value < 0 || 1 / value < 0; - - // Perform the initial formatting. - value = isNaN(value) ? nan : formatType(Math.abs(value), precision); - - // Trim insignificant zeros. - if (trim) value = formatTrim(value); - - // If a negative value rounds to zero after formatting, and no explicit positive sign is requested, hide the sign. - if (valueNegative && +value === 0 && sign !== "+") valueNegative = false; - - // Compute the prefix and suffix. - valuePrefix = (valueNegative ? (sign === "(" ? sign : minus) : sign === "-" || sign === "(" ? "" : sign) + valuePrefix; - valueSuffix = (type === "s" ? prefixes[8 + prefixExponent / 3] : "") + valueSuffix + (valueNegative && sign === "(" ? ")" : ""); - - // Break the formatted value into the integer “value” part that can be - // grouped, and fractional or exponential “suffix” part that is not. - if (maybeSuffix) { - i = -1, n = value.length; - while (++i < n) { - if (c = value.charCodeAt(i), 48 > c || c > 57) { - valueSuffix = (c === 46 ? decimal + value.slice(i + 1) : value.slice(i)) + valueSuffix; - value = value.slice(0, i); - break; - } - } - } - } - - // If the fill character is not "0", grouping is applied before padding. - if (comma && !zero) value = group(value, Infinity); - - // Compute the padding. - var length = valuePrefix.length + value.length + valueSuffix.length, - padding = length < width ? new Array(width - length + 1).join(fill) : ""; - - // If the fill character is "0", grouping is applied after padding. - if (comma && zero) value = group(padding + value, padding.length ? width - valueSuffix.length : Infinity), padding = ""; - - // Reconstruct the final output based on the desired alignment. - switch (align) { - case "<": value = valuePrefix + value + valueSuffix + padding; break; - case "=": value = valuePrefix + padding + value + valueSuffix; break; - case "^": value = padding.slice(0, length = padding.length >> 1) + valuePrefix + value + valueSuffix + padding.slice(length); break; - default: value = padding + valuePrefix + value + valueSuffix; break; - } - - return numerals(value); - } - - format.toString = function() { - return specifier + ""; - }; - - return format; - } - - function formatPrefix(specifier, value) { - var f = newFormat((specifier = formatSpecifier(specifier), specifier.type = "f", specifier)), - e = Math.max(-8, Math.min(8, Math.floor(exponent(value) / 3))) * 3, - k = Math.pow(10, -e), - prefix = prefixes[8 + e / 3]; - return function(value) { - return f(k * value) + prefix; - }; - } - - return { - format: newFormat, - formatPrefix: formatPrefix - }; - } - - var locale; - var format; - var formatPrefix; - - defaultLocale({ - decimal: ".", - thousands: ",", - grouping: [3], - currency: ["$", ""], - minus: "-" - }); - - function defaultLocale(definition) { - locale = formatLocale(definition); - format = locale.format; - formatPrefix = locale.formatPrefix; - return locale; - } - - function precisionFixed(step) { - return Math.max(0, -exponent(Math.abs(step))); - } - - function precisionPrefix(step, value) { - return Math.max(0, Math.max(-8, Math.min(8, Math.floor(exponent(value) / 3))) * 3 - exponent(Math.abs(step))); - } - - function precisionRound(step, max) { - step = Math.abs(step), max = Math.abs(max) - step; - return Math.max(0, exponent(max) - exponent(step)) + 1; - } - - function tickFormat(start, stop, count, specifier) { - var step = tickStep(start, stop, count), - precision; - specifier = formatSpecifier(specifier == null ? ",f" : specifier); - switch (specifier.type) { - case "s": { - var value = Math.max(Math.abs(start), Math.abs(stop)); - if (specifier.precision == null && !isNaN(precision = precisionPrefix(step, value))) specifier.precision = precision; - return formatPrefix(specifier, value); - } - case "": - case "e": - case "g": - case "p": - case "r": { - if (specifier.precision == null && !isNaN(precision = precisionRound(step, Math.max(Math.abs(start), Math.abs(stop))))) specifier.precision = precision - (specifier.type === "e"); - break; - } - case "f": - case "%": { - if (specifier.precision == null && !isNaN(precision = precisionFixed(step))) specifier.precision = precision - (specifier.type === "%") * 2; - break; - } - } - return format(specifier); - } - - function linearish(scale) { - var domain = scale.domain; - - scale.ticks = function(count) { - var d = domain(); - return ticks(d[0], d[d.length - 1], count == null ? 10 : count); - }; - - scale.tickFormat = function(count, specifier) { - var d = domain(); - return tickFormat(d[0], d[d.length - 1], count == null ? 10 : count, specifier); - }; - - scale.nice = function(count) { - if (count == null) count = 10; - - var d = domain(), - i0 = 0, - i1 = d.length - 1, - start = d[i0], - stop = d[i1], - step; - - if (stop < start) { - step = start, start = stop, stop = step; - step = i0, i0 = i1, i1 = step; - } - - step = tickIncrement(start, stop, count); - - if (step > 0) { - start = Math.floor(start / step) * step; - stop = Math.ceil(stop / step) * step; - step = tickIncrement(start, stop, count); - } else if (step < 0) { - start = Math.ceil(start * step) / step; - stop = Math.floor(stop * step) / step; - step = tickIncrement(start, stop, count); - } - - if (step > 0) { - d[i0] = Math.floor(start / step) * step; - d[i1] = Math.ceil(stop / step) * step; - domain(d); - } else if (step < 0) { - d[i0] = Math.ceil(start * step) / step; - d[i1] = Math.floor(stop * step) / step; - domain(d); - } - - return scale; - }; - - return scale; - } - - function linear$1() { - var scale = continuous(); - - scale.copy = function() { - return copy(scale, linear$1()); - }; - - initRange.apply(scale, arguments); - - return linearish(scale); - } - - var t0$1 = new Date, - t1$1 = new Date; - - function newInterval(floori, offseti, count, field) { - - function interval(date) { - return floori(date = arguments.length === 0 ? new Date : new Date(+date)), date; - } - - interval.floor = function(date) { - return floori(date = new Date(+date)), date; - }; - - interval.ceil = function(date) { - return floori(date = new Date(date - 1)), offseti(date, 1), floori(date), date; - }; - - interval.round = function(date) { - var d0 = interval(date), - d1 = interval.ceil(date); - return date - d0 < d1 - date ? d0 : d1; - }; - - interval.offset = function(date, step) { - return offseti(date = new Date(+date), step == null ? 1 : Math.floor(step)), date; - }; - - interval.range = function(start, stop, step) { - var range = [], previous; - start = interval.ceil(start); - step = step == null ? 1 : Math.floor(step); - if (!(start < stop) || !(step > 0)) return range; // also handles Invalid Date - do range.push(previous = new Date(+start)), offseti(start, step), floori(start); - while (previous < start && start < stop); - return range; - }; - - interval.filter = function(test) { - return newInterval(function(date) { - if (date >= date) while (floori(date), !test(date)) date.setTime(date - 1); - }, function(date, step) { - if (date >= date) { - if (step < 0) while (++step <= 0) { - while (offseti(date, -1), !test(date)) {} // eslint-disable-line no-empty - } else while (--step >= 0) { - while (offseti(date, +1), !test(date)) {} // eslint-disable-line no-empty - } - } - }); - }; - - if (count) { - interval.count = function(start, end) { - t0$1.setTime(+start), t1$1.setTime(+end); - floori(t0$1), floori(t1$1); - return Math.floor(count(t0$1, t1$1)); - }; - - interval.every = function(step) { - step = Math.floor(step); - return !isFinite(step) || !(step > 0) ? null - : !(step > 1) ? interval - : interval.filter(field - ? function(d) { return field(d) % step === 0; } - : function(d) { return interval.count(0, d) % step === 0; }); - }; - } - - return interval; - } - - var millisecond = newInterval(function() { - // noop - }, function(date, step) { - date.setTime(+date + step); - }, function(start, end) { - return end - start; - }); - - // An optimized implementation for this simple case. - millisecond.every = function(k) { - k = Math.floor(k); - if (!isFinite(k) || !(k > 0)) return null; - if (!(k > 1)) return millisecond; - return newInterval(function(date) { - date.setTime(Math.floor(date / k) * k); - }, function(date, step) { - date.setTime(+date + step * k); - }, function(start, end) { - return (end - start) / k; - }); - }; - - var durationSecond = 1e3; - var durationMinute = 6e4; - var durationHour = 36e5; - var durationDay = 864e5; - var durationWeek = 6048e5; - - var second = newInterval(function(date) { - date.setTime(date - date.getMilliseconds()); - }, function(date, step) { - date.setTime(+date + step * durationSecond); - }, function(start, end) { - return (end - start) / durationSecond; - }, function(date) { - return date.getUTCSeconds(); - }); - - var minute = newInterval(function(date) { - date.setTime(date - date.getMilliseconds() - date.getSeconds() * durationSecond); - }, function(date, step) { - date.setTime(+date + step * durationMinute); - }, function(start, end) { - return (end - start) / durationMinute; - }, function(date) { - return date.getMinutes(); - }); - - var hour = newInterval(function(date) { - date.setTime(date - date.getMilliseconds() - date.getSeconds() * durationSecond - date.getMinutes() * durationMinute); - }, function(date, step) { - date.setTime(+date + step * durationHour); - }, function(start, end) { - return (end - start) / durationHour; - }, function(date) { - return date.getHours(); - }); - - var day = newInterval(function(date) { - date.setHours(0, 0, 0, 0); - }, function(date, step) { - date.setDate(date.getDate() + step); - }, function(start, end) { - return (end - start - (end.getTimezoneOffset() - start.getTimezoneOffset()) * durationMinute) / durationDay; - }, function(date) { - return date.getDate() - 1; - }); - - function weekday(i) { - return newInterval(function(date) { - date.setDate(date.getDate() - (date.getDay() + 7 - i) % 7); - date.setHours(0, 0, 0, 0); - }, function(date, step) { - date.setDate(date.getDate() + step * 7); - }, function(start, end) { - return (end - start - (end.getTimezoneOffset() - start.getTimezoneOffset()) * durationMinute) / durationWeek; - }); - } - - var sunday = weekday(0); - var monday = weekday(1); - var tuesday = weekday(2); - var wednesday = weekday(3); - var thursday = weekday(4); - var friday = weekday(5); - var saturday = weekday(6); - - var month = newInterval(function(date) { - date.setDate(1); - date.setHours(0, 0, 0, 0); - }, function(date, step) { - date.setMonth(date.getMonth() + step); - }, function(start, end) { - return end.getMonth() - start.getMonth() + (end.getFullYear() - start.getFullYear()) * 12; - }, function(date) { - return date.getMonth(); - }); - - var year = newInterval(function(date) { - date.setMonth(0, 1); - date.setHours(0, 0, 0, 0); - }, function(date, step) { - date.setFullYear(date.getFullYear() + step); - }, function(start, end) { - return end.getFullYear() - start.getFullYear(); - }, function(date) { - return date.getFullYear(); - }); - - // An optimized implementation for this simple case. - year.every = function(k) { - return !isFinite(k = Math.floor(k)) || !(k > 0) ? null : newInterval(function(date) { - date.setFullYear(Math.floor(date.getFullYear() / k) * k); - date.setMonth(0, 1); - date.setHours(0, 0, 0, 0); - }, function(date, step) { - date.setFullYear(date.getFullYear() + step * k); - }); - }; - - var utcMinute = newInterval(function(date) { - date.setUTCSeconds(0, 0); - }, function(date, step) { - date.setTime(+date + step * durationMinute); - }, function(start, end) { - return (end - start) / durationMinute; - }, function(date) { - return date.getUTCMinutes(); - }); - - var utcHour = newInterval(function(date) { - date.setUTCMinutes(0, 0, 0); - }, function(date, step) { - date.setTime(+date + step * durationHour); - }, function(start, end) { - return (end - start) / durationHour; - }, function(date) { - return date.getUTCHours(); - }); - - var utcDay = newInterval(function(date) { - date.setUTCHours(0, 0, 0, 0); - }, function(date, step) { - date.setUTCDate(date.getUTCDate() + step); - }, function(start, end) { - return (end - start) / durationDay; - }, function(date) { - return date.getUTCDate() - 1; - }); - - function utcWeekday(i) { - return newInterval(function(date) { - date.setUTCDate(date.getUTCDate() - (date.getUTCDay() + 7 - i) % 7); - date.setUTCHours(0, 0, 0, 0); - }, function(date, step) { - date.setUTCDate(date.getUTCDate() + step * 7); - }, function(start, end) { - return (end - start) / durationWeek; - }); - } - - var utcSunday = utcWeekday(0); - var utcMonday = utcWeekday(1); - var utcTuesday = utcWeekday(2); - var utcWednesday = utcWeekday(3); - var utcThursday = utcWeekday(4); - var utcFriday = utcWeekday(5); - var utcSaturday = utcWeekday(6); - - var utcMonth = newInterval(function(date) { - date.setUTCDate(1); - date.setUTCHours(0, 0, 0, 0); - }, function(date, step) { - date.setUTCMonth(date.getUTCMonth() + step); - }, function(start, end) { - return end.getUTCMonth() - start.getUTCMonth() + (end.getUTCFullYear() - start.getUTCFullYear()) * 12; - }, function(date) { - return date.getUTCMonth(); - }); - - var utcYear = newInterval(function(date) { - date.setUTCMonth(0, 1); - date.setUTCHours(0, 0, 0, 0); - }, function(date, step) { - date.setUTCFullYear(date.getUTCFullYear() + step); - }, function(start, end) { - return end.getUTCFullYear() - start.getUTCFullYear(); - }, function(date) { - return date.getUTCFullYear(); - }); - - // An optimized implementation for this simple case. - utcYear.every = function(k) { - return !isFinite(k = Math.floor(k)) || !(k > 0) ? null : newInterval(function(date) { - date.setUTCFullYear(Math.floor(date.getUTCFullYear() / k) * k); - date.setUTCMonth(0, 1); - date.setUTCHours(0, 0, 0, 0); - }, function(date, step) { - date.setUTCFullYear(date.getUTCFullYear() + step * k); - }); - }; - - function localDate(d) { - if (0 <= d.y && d.y < 100) { - var date = new Date(-1, d.m, d.d, d.H, d.M, d.S, d.L); - date.setFullYear(d.y); - return date; - } - return new Date(d.y, d.m, d.d, d.H, d.M, d.S, d.L); - } - - function utcDate(d) { - if (0 <= d.y && d.y < 100) { - var date = new Date(Date.UTC(-1, d.m, d.d, d.H, d.M, d.S, d.L)); - date.setUTCFullYear(d.y); - return date; - } - return new Date(Date.UTC(d.y, d.m, d.d, d.H, d.M, d.S, d.L)); - } - - function newDate(y, m, d) { - return {y: y, m: m, d: d, H: 0, M: 0, S: 0, L: 0}; - } - - function formatLocale$1(locale) { - var locale_dateTime = locale.dateTime, - locale_date = locale.date, - locale_time = locale.time, - locale_periods = locale.periods, - locale_weekdays = locale.days, - locale_shortWeekdays = locale.shortDays, - locale_months = locale.months, - locale_shortMonths = locale.shortMonths; - - var periodRe = formatRe(locale_periods), - periodLookup = formatLookup(locale_periods), - weekdayRe = formatRe(locale_weekdays), - weekdayLookup = formatLookup(locale_weekdays), - shortWeekdayRe = formatRe(locale_shortWeekdays), - shortWeekdayLookup = formatLookup(locale_shortWeekdays), - monthRe = formatRe(locale_months), - monthLookup = formatLookup(locale_months), - shortMonthRe = formatRe(locale_shortMonths), - shortMonthLookup = formatLookup(locale_shortMonths); - - var formats = { - "a": formatShortWeekday, - "A": formatWeekday, - "b": formatShortMonth, - "B": formatMonth, - "c": null, - "d": formatDayOfMonth, - "e": formatDayOfMonth, - "f": formatMicroseconds, - "H": formatHour24, - "I": formatHour12, - "j": formatDayOfYear, - "L": formatMilliseconds, - "m": formatMonthNumber, - "M": formatMinutes, - "p": formatPeriod, - "q": formatQuarter, - "Q": formatUnixTimestamp, - "s": formatUnixTimestampSeconds, - "S": formatSeconds, - "u": formatWeekdayNumberMonday, - "U": formatWeekNumberSunday, - "V": formatWeekNumberISO, - "w": formatWeekdayNumberSunday, - "W": formatWeekNumberMonday, - "x": null, - "X": null, - "y": formatYear, - "Y": formatFullYear, - "Z": formatZone, - "%": formatLiteralPercent - }; - - var utcFormats = { - "a": formatUTCShortWeekday, - "A": formatUTCWeekday, - "b": formatUTCShortMonth, - "B": formatUTCMonth, - "c": null, - "d": formatUTCDayOfMonth, - "e": formatUTCDayOfMonth, - "f": formatUTCMicroseconds, - "H": formatUTCHour24, - "I": formatUTCHour12, - "j": formatUTCDayOfYear, - "L": formatUTCMilliseconds, - "m": formatUTCMonthNumber, - "M": formatUTCMinutes, - "p": formatUTCPeriod, - "q": formatUTCQuarter, - "Q": formatUnixTimestamp, - "s": formatUnixTimestampSeconds, - "S": formatUTCSeconds, - "u": formatUTCWeekdayNumberMonday, - "U": formatUTCWeekNumberSunday, - "V": formatUTCWeekNumberISO, - "w": formatUTCWeekdayNumberSunday, - "W": formatUTCWeekNumberMonday, - "x": null, - "X": null, - "y": formatUTCYear, - "Y": formatUTCFullYear, - "Z": formatUTCZone, - "%": formatLiteralPercent - }; - - var parses = { - "a": parseShortWeekday, - "A": parseWeekday, - "b": parseShortMonth, - "B": parseMonth, - "c": parseLocaleDateTime, - "d": parseDayOfMonth, - "e": parseDayOfMonth, - "f": parseMicroseconds, - "H": parseHour24, - "I": parseHour24, - "j": parseDayOfYear, - "L": parseMilliseconds, - "m": parseMonthNumber, - "M": parseMinutes, - "p": parsePeriod, - "q": parseQuarter, - "Q": parseUnixTimestamp, - "s": parseUnixTimestampSeconds, - "S": parseSeconds, - "u": parseWeekdayNumberMonday, - "U": parseWeekNumberSunday, - "V": parseWeekNumberISO, - "w": parseWeekdayNumberSunday, - "W": parseWeekNumberMonday, - "x": parseLocaleDate, - "X": parseLocaleTime, - "y": parseYear, - "Y": parseFullYear, - "Z": parseZone, - "%": parseLiteralPercent - }; - - // These recursive directive definitions must be deferred. - formats.x = newFormat(locale_date, formats); - formats.X = newFormat(locale_time, formats); - formats.c = newFormat(locale_dateTime, formats); - utcFormats.x = newFormat(locale_date, utcFormats); - utcFormats.X = newFormat(locale_time, utcFormats); - utcFormats.c = newFormat(locale_dateTime, utcFormats); - - function newFormat(specifier, formats) { - return function(date) { - var string = [], - i = -1, - j = 0, - n = specifier.length, - c, - pad, - format; - - if (!(date instanceof Date)) date = new Date(+date); - - while (++i < n) { - if (specifier.charCodeAt(i) === 37) { - string.push(specifier.slice(j, i)); - if ((pad = pads[c = specifier.charAt(++i)]) != null) c = specifier.charAt(++i); - else pad = c === "e" ? " " : "0"; - if (format = formats[c]) c = format(date, pad); - string.push(c); - j = i + 1; - } - } - - string.push(specifier.slice(j, i)); - return string.join(""); - }; - } - - function newParse(specifier, Z) { - return function(string) { - var d = newDate(1900, undefined, 1), - i = parseSpecifier(d, specifier, string += "", 0), - week, day$1; - if (i != string.length) return null; - - // If a UNIX timestamp is specified, return it. - if ("Q" in d) return new Date(d.Q); - if ("s" in d) return new Date(d.s * 1000 + ("L" in d ? d.L : 0)); - - // If this is utcParse, never use the local timezone. - if (Z && !("Z" in d)) d.Z = 0; - - // The am-pm flag is 0 for AM, and 1 for PM. - if ("p" in d) d.H = d.H % 12 + d.p * 12; - - // If the month was not specified, inherit from the quarter. - if (d.m === undefined) d.m = "q" in d ? d.q : 0; - - // Convert day-of-week and week-of-year to day-of-year. - if ("V" in d) { - if (d.V < 1 || d.V > 53) return null; - if (!("w" in d)) d.w = 1; - if ("Z" in d) { - week = utcDate(newDate(d.y, 0, 1)), day$1 = week.getUTCDay(); - week = day$1 > 4 || day$1 === 0 ? utcMonday.ceil(week) : utcMonday(week); - week = utcDay.offset(week, (d.V - 1) * 7); - d.y = week.getUTCFullYear(); - d.m = week.getUTCMonth(); - d.d = week.getUTCDate() + (d.w + 6) % 7; - } else { - week = localDate(newDate(d.y, 0, 1)), day$1 = week.getDay(); - week = day$1 > 4 || day$1 === 0 ? monday.ceil(week) : monday(week); - week = day.offset(week, (d.V - 1) * 7); - d.y = week.getFullYear(); - d.m = week.getMonth(); - d.d = week.getDate() + (d.w + 6) % 7; - } - } else if ("W" in d || "U" in d) { - if (!("w" in d)) d.w = "u" in d ? d.u % 7 : "W" in d ? 1 : 0; - day$1 = "Z" in d ? utcDate(newDate(d.y, 0, 1)).getUTCDay() : localDate(newDate(d.y, 0, 1)).getDay(); - d.m = 0; - d.d = "W" in d ? (d.w + 6) % 7 + d.W * 7 - (day$1 + 5) % 7 : d.w + d.U * 7 - (day$1 + 6) % 7; - } - - // If a time zone is specified, all fields are interpreted as UTC and then - // offset according to the specified time zone. - if ("Z" in d) { - d.H += d.Z / 100 | 0; - d.M += d.Z % 100; - return utcDate(d); - } - - // Otherwise, all fields are in local time. - return localDate(d); - }; - } - - function parseSpecifier(d, specifier, string, j) { - var i = 0, - n = specifier.length, - m = string.length, - c, - parse; - - while (i < n) { - if (j >= m) return -1; - c = specifier.charCodeAt(i++); - if (c === 37) { - c = specifier.charAt(i++); - parse = parses[c in pads ? specifier.charAt(i++) : c]; - if (!parse || ((j = parse(d, string, j)) < 0)) return -1; - } else if (c != string.charCodeAt(j++)) { - return -1; - } - } - - return j; - } - - function parsePeriod(d, string, i) { - var n = periodRe.exec(string.slice(i)); - return n ? (d.p = periodLookup[n[0].toLowerCase()], i + n[0].length) : -1; - } - - function parseShortWeekday(d, string, i) { - var n = shortWeekdayRe.exec(string.slice(i)); - return n ? (d.w = shortWeekdayLookup[n[0].toLowerCase()], i + n[0].length) : -1; - } - - function parseWeekday(d, string, i) { - var n = weekdayRe.exec(string.slice(i)); - return n ? (d.w = weekdayLookup[n[0].toLowerCase()], i + n[0].length) : -1; - } - - function parseShortMonth(d, string, i) { - var n = shortMonthRe.exec(string.slice(i)); - return n ? (d.m = shortMonthLookup[n[0].toLowerCase()], i + n[0].length) : -1; - } - - function parseMonth(d, string, i) { - var n = monthRe.exec(string.slice(i)); - return n ? (d.m = monthLookup[n[0].toLowerCase()], i + n[0].length) : -1; - } - - function parseLocaleDateTime(d, string, i) { - return parseSpecifier(d, locale_dateTime, string, i); - } - - function parseLocaleDate(d, string, i) { - return parseSpecifier(d, locale_date, string, i); - } - - function parseLocaleTime(d, string, i) { - return parseSpecifier(d, locale_time, string, i); - } - - function formatShortWeekday(d) { - return locale_shortWeekdays[d.getDay()]; - } - - function formatWeekday(d) { - return locale_weekdays[d.getDay()]; - } - - function formatShortMonth(d) { - return locale_shortMonths[d.getMonth()]; - } - - function formatMonth(d) { - return locale_months[d.getMonth()]; - } - - function formatPeriod(d) { - return locale_periods[+(d.getHours() >= 12)]; - } - - function formatQuarter(d) { - return 1 + ~~(d.getMonth() / 3); - } - - function formatUTCShortWeekday(d) { - return locale_shortWeekdays[d.getUTCDay()]; - } - - function formatUTCWeekday(d) { - return locale_weekdays[d.getUTCDay()]; - } - - function formatUTCShortMonth(d) { - return locale_shortMonths[d.getUTCMonth()]; - } - - function formatUTCMonth(d) { - return locale_months[d.getUTCMonth()]; - } - - function formatUTCPeriod(d) { - return locale_periods[+(d.getUTCHours() >= 12)]; - } - - function formatUTCQuarter(d) { - return 1 + ~~(d.getUTCMonth() / 3); - } - - return { - format: function(specifier) { - var f = newFormat(specifier += "", formats); - f.toString = function() { return specifier; }; - return f; - }, - parse: function(specifier) { - var p = newParse(specifier += "", false); - p.toString = function() { return specifier; }; - return p; - }, - utcFormat: function(specifier) { - var f = newFormat(specifier += "", utcFormats); - f.toString = function() { return specifier; }; - return f; - }, - utcParse: function(specifier) { - var p = newParse(specifier += "", true); - p.toString = function() { return specifier; }; - return p; - } - }; - } - - var pads = {"-": "", "_": " ", "0": "0"}, - numberRe = /^\s*\d+/, // note: ignores next directive - percentRe = /^%/, - requoteRe = /[\\^$*+?|[\]().{}]/g; - - function pad(value, fill, width) { - var sign = value < 0 ? "-" : "", - string = (sign ? -value : value) + "", - length = string.length; - return sign + (length < width ? new Array(width - length + 1).join(fill) + string : string); - } - - function requote(s) { - return s.replace(requoteRe, "\\$&"); - } - - function formatRe(names) { - return new RegExp("^(?:" + names.map(requote).join("|") + ")", "i"); - } - - function formatLookup(names) { - var map = {}, i = -1, n = names.length; - while (++i < n) map[names[i].toLowerCase()] = i; - return map; - } - - function parseWeekdayNumberSunday(d, string, i) { - var n = numberRe.exec(string.slice(i, i + 1)); - return n ? (d.w = +n[0], i + n[0].length) : -1; - } - - function parseWeekdayNumberMonday(d, string, i) { - var n = numberRe.exec(string.slice(i, i + 1)); - return n ? (d.u = +n[0], i + n[0].length) : -1; - } - - function parseWeekNumberSunday(d, string, i) { - var n = numberRe.exec(string.slice(i, i + 2)); - return n ? (d.U = +n[0], i + n[0].length) : -1; - } - - function parseWeekNumberISO(d, string, i) { - var n = numberRe.exec(string.slice(i, i + 2)); - return n ? (d.V = +n[0], i + n[0].length) : -1; - } - - function parseWeekNumberMonday(d, string, i) { - var n = numberRe.exec(string.slice(i, i + 2)); - return n ? (d.W = +n[0], i + n[0].length) : -1; - } - - function parseFullYear(d, string, i) { - var n = numberRe.exec(string.slice(i, i + 4)); - return n ? (d.y = +n[0], i + n[0].length) : -1; - } - - function parseYear(d, string, i) { - var n = numberRe.exec(string.slice(i, i + 2)); - return n ? (d.y = +n[0] + (+n[0] > 68 ? 1900 : 2000), i + n[0].length) : -1; - } - - function parseZone(d, string, i) { - var n = /^(Z)|([+-]\d\d)(?::?(\d\d))?/.exec(string.slice(i, i + 6)); - return n ? (d.Z = n[1] ? 0 : -(n[2] + (n[3] || "00")), i + n[0].length) : -1; - } - - function parseQuarter(d, string, i) { - var n = numberRe.exec(string.slice(i, i + 1)); - return n ? (d.q = n[0] * 3 - 3, i + n[0].length) : -1; - } - - function parseMonthNumber(d, string, i) { - var n = numberRe.exec(string.slice(i, i + 2)); - return n ? (d.m = n[0] - 1, i + n[0].length) : -1; - } - - function parseDayOfMonth(d, string, i) { - var n = numberRe.exec(string.slice(i, i + 2)); - return n ? (d.d = +n[0], i + n[0].length) : -1; - } - - function parseDayOfYear(d, string, i) { - var n = numberRe.exec(string.slice(i, i + 3)); - return n ? (d.m = 0, d.d = +n[0], i + n[0].length) : -1; - } - - function parseHour24(d, string, i) { - var n = numberRe.exec(string.slice(i, i + 2)); - return n ? (d.H = +n[0], i + n[0].length) : -1; - } - - function parseMinutes(d, string, i) { - var n = numberRe.exec(string.slice(i, i + 2)); - return n ? (d.M = +n[0], i + n[0].length) : -1; - } - - function parseSeconds(d, string, i) { - var n = numberRe.exec(string.slice(i, i + 2)); - return n ? (d.S = +n[0], i + n[0].length) : -1; - } - - function parseMilliseconds(d, string, i) { - var n = numberRe.exec(string.slice(i, i + 3)); - return n ? (d.L = +n[0], i + n[0].length) : -1; - } - - function parseMicroseconds(d, string, i) { - var n = numberRe.exec(string.slice(i, i + 6)); - return n ? (d.L = Math.floor(n[0] / 1000), i + n[0].length) : -1; - } - - function parseLiteralPercent(d, string, i) { - var n = percentRe.exec(string.slice(i, i + 1)); - return n ? i + n[0].length : -1; - } - - function parseUnixTimestamp(d, string, i) { - var n = numberRe.exec(string.slice(i)); - return n ? (d.Q = +n[0], i + n[0].length) : -1; - } - - function parseUnixTimestampSeconds(d, string, i) { - var n = numberRe.exec(string.slice(i)); - return n ? (d.s = +n[0], i + n[0].length) : -1; - } - - function formatDayOfMonth(d, p) { - return pad(d.getDate(), p, 2); - } - - function formatHour24(d, p) { - return pad(d.getHours(), p, 2); - } - - function formatHour12(d, p) { - return pad(d.getHours() % 12 || 12, p, 2); - } - - function formatDayOfYear(d, p) { - return pad(1 + day.count(year(d), d), p, 3); - } - - function formatMilliseconds(d, p) { - return pad(d.getMilliseconds(), p, 3); - } - - function formatMicroseconds(d, p) { - return formatMilliseconds(d, p) + "000"; - } - - function formatMonthNumber(d, p) { - return pad(d.getMonth() + 1, p, 2); - } - - function formatMinutes(d, p) { - return pad(d.getMinutes(), p, 2); - } - - function formatSeconds(d, p) { - return pad(d.getSeconds(), p, 2); - } - - function formatWeekdayNumberMonday(d) { - var day = d.getDay(); - return day === 0 ? 7 : day; - } - - function formatWeekNumberSunday(d, p) { - return pad(sunday.count(year(d) - 1, d), p, 2); - } - - function formatWeekNumberISO(d, p) { - var day = d.getDay(); - d = (day >= 4 || day === 0) ? thursday(d) : thursday.ceil(d); - return pad(thursday.count(year(d), d) + (year(d).getDay() === 4), p, 2); - } - - function formatWeekdayNumberSunday(d) { - return d.getDay(); - } - - function formatWeekNumberMonday(d, p) { - return pad(monday.count(year(d) - 1, d), p, 2); - } - - function formatYear(d, p) { - return pad(d.getFullYear() % 100, p, 2); - } - - function formatFullYear(d, p) { - return pad(d.getFullYear() % 10000, p, 4); - } - - function formatZone(d) { - var z = d.getTimezoneOffset(); - return (z > 0 ? "-" : (z *= -1, "+")) - + pad(z / 60 | 0, "0", 2) - + pad(z % 60, "0", 2); - } - - function formatUTCDayOfMonth(d, p) { - return pad(d.getUTCDate(), p, 2); - } - - function formatUTCHour24(d, p) { - return pad(d.getUTCHours(), p, 2); - } - - function formatUTCHour12(d, p) { - return pad(d.getUTCHours() % 12 || 12, p, 2); - } - - function formatUTCDayOfYear(d, p) { - return pad(1 + utcDay.count(utcYear(d), d), p, 3); - } - - function formatUTCMilliseconds(d, p) { - return pad(d.getUTCMilliseconds(), p, 3); - } - - function formatUTCMicroseconds(d, p) { - return formatUTCMilliseconds(d, p) + "000"; - } - - function formatUTCMonthNumber(d, p) { - return pad(d.getUTCMonth() + 1, p, 2); - } - - function formatUTCMinutes(d, p) { - return pad(d.getUTCMinutes(), p, 2); - } - - function formatUTCSeconds(d, p) { - return pad(d.getUTCSeconds(), p, 2); - } - - function formatUTCWeekdayNumberMonday(d) { - var dow = d.getUTCDay(); - return dow === 0 ? 7 : dow; - } - - function formatUTCWeekNumberSunday(d, p) { - return pad(utcSunday.count(utcYear(d) - 1, d), p, 2); - } - - function formatUTCWeekNumberISO(d, p) { - var day = d.getUTCDay(); - d = (day >= 4 || day === 0) ? utcThursday(d) : utcThursday.ceil(d); - return pad(utcThursday.count(utcYear(d), d) + (utcYear(d).getUTCDay() === 4), p, 2); - } - - function formatUTCWeekdayNumberSunday(d) { - return d.getUTCDay(); - } - - function formatUTCWeekNumberMonday(d, p) { - return pad(utcMonday.count(utcYear(d) - 1, d), p, 2); - } - - function formatUTCYear(d, p) { - return pad(d.getUTCFullYear() % 100, p, 2); - } - - function formatUTCFullYear(d, p) { - return pad(d.getUTCFullYear() % 10000, p, 4); - } - - function formatUTCZone() { - return "+0000"; - } - - function formatLiteralPercent() { - return "%"; - } - - function formatUnixTimestamp(d) { - return +d; - } - - function formatUnixTimestampSeconds(d) { - return Math.floor(+d / 1000); - } - - var locale$1; - var timeFormat; - var timeParse; - var utcFormat; - var utcParse; - - defaultLocale$1({ - dateTime: "%x, %X", - date: "%-m/%-d/%Y", - time: "%-I:%M:%S %p", - periods: ["AM", "PM"], - days: ["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"], - shortDays: ["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"], - months: ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"], - shortMonths: ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"] - }); - - function defaultLocale$1(definition) { - locale$1 = formatLocale$1(definition); - timeFormat = locale$1.format; - timeParse = locale$1.parse; - utcFormat = locale$1.utcFormat; - utcParse = locale$1.utcParse; - return locale$1; - } - - var isoSpecifier = "%Y-%m-%dT%H:%M:%S.%LZ"; - - function formatIsoNative(date) { - return date.toISOString(); - } - - var formatIso = Date.prototype.toISOString - ? formatIsoNative - : utcFormat(isoSpecifier); - - function parseIsoNative(string) { - var date = new Date(string); - return isNaN(date) ? null : date; - } - - var parseIso = +new Date("2000-01-01T00:00:00.000Z") - ? parseIsoNative - : utcParse(isoSpecifier); - - var noop = {value: function() {}}; - - function dispatch() { - for (var i = 0, n = arguments.length, _ = {}, t; i < n; ++i) { - if (!(t = arguments[i] + "") || (t in _) || /[\s.]/.test(t)) throw new Error("illegal type: " + t); - _[t] = []; - } - return new Dispatch(_); - } - - function Dispatch(_) { - this._ = _; - } - - function parseTypenames(typenames, types) { - return typenames.trim().split(/^|\s+/).map(function(t) { - var name = "", i = t.indexOf("."); - if (i >= 0) name = t.slice(i + 1), t = t.slice(0, i); - if (t && !types.hasOwnProperty(t)) throw new Error("unknown type: " + t); - return {type: t, name: name}; - }); - } - - Dispatch.prototype = dispatch.prototype = { - constructor: Dispatch, - on: function(typename, callback) { - var _ = this._, - T = parseTypenames(typename + "", _), - t, - i = -1, - n = T.length; - - // If no callback was specified, return the callback of the given type and name. - if (arguments.length < 2) { - while (++i < n) if ((t = (typename = T[i]).type) && (t = get(_[t], typename.name))) return t; - return; - } - - // If a type was specified, set the callback for the given type and name. - // Otherwise, if a null callback was specified, remove callbacks of the given name. - if (callback != null && typeof callback !== "function") throw new Error("invalid callback: " + callback); - while (++i < n) { - if (t = (typename = T[i]).type) _[t] = set(_[t], typename.name, callback); - else if (callback == null) for (t in _) _[t] = set(_[t], typename.name, null); - } - - return this; - }, - copy: function() { - var copy = {}, _ = this._; - for (var t in _) copy[t] = _[t].slice(); - return new Dispatch(copy); - }, - call: function(type, that) { - if ((n = arguments.length - 2) > 0) for (var args = new Array(n), i = 0, n, t; i < n; ++i) args[i] = arguments[i + 2]; - if (!this._.hasOwnProperty(type)) throw new Error("unknown type: " + type); - for (t = this._[type], i = 0, n = t.length; i < n; ++i) t[i].value.apply(that, args); - }, - apply: function(type, that, args) { - if (!this._.hasOwnProperty(type)) throw new Error("unknown type: " + type); - for (var t = this._[type], i = 0, n = t.length; i < n; ++i) t[i].value.apply(that, args); - } - }; - - function get(type, name) { - for (var i = 0, n = type.length, c; i < n; ++i) { - if ((c = type[i]).name === name) { - return c.value; - } - } - } - - function set(type, name, callback) { - for (var i = 0, n = type.length; i < n; ++i) { - if (type[i].name === name) { - type[i] = noop, type = type.slice(0, i).concat(type.slice(i + 1)); - break; - } - } - if (callback != null) type.push({name: name, value: callback}); - return type; - } - - var xhtml = "http://www.w3.org/1999/xhtml"; - - var namespaces = { - svg: "http://www.w3.org/2000/svg", - xhtml: xhtml, - xlink: "http://www.w3.org/1999/xlink", - xml: "http://www.w3.org/XML/1998/namespace", - xmlns: "http://www.w3.org/2000/xmlns/" - }; - - function namespace(name) { - var prefix = name += "", i = prefix.indexOf(":"); - if (i >= 0 && (prefix = name.slice(0, i)) !== "xmlns") name = name.slice(i + 1); - return namespaces.hasOwnProperty(prefix) ? {space: namespaces[prefix], local: name} : name; - } - - function creatorInherit(name) { - return function() { - var document = this.ownerDocument, - uri = this.namespaceURI; - return uri === xhtml && document.documentElement.namespaceURI === xhtml - ? document.createElement(name) - : document.createElementNS(uri, name); - }; - } - - function creatorFixed(fullname) { - return function() { - return this.ownerDocument.createElementNS(fullname.space, fullname.local); - }; - } - - function creator(name) { - var fullname = namespace(name); - return (fullname.local - ? creatorFixed - : creatorInherit)(fullname); - } - - function none() {} - - function selector(selector) { - return selector == null ? none : function() { - return this.querySelector(selector); - }; - } - - function selection_select(select) { - if (typeof select !== "function") select = selector(select); - - for (var groups = this._groups, m = groups.length, subgroups = new Array(m), j = 0; j < m; ++j) { - for (var group = groups[j], n = group.length, subgroup = subgroups[j] = new Array(n), node, subnode, i = 0; i < n; ++i) { - if ((node = group[i]) && (subnode = select.call(node, node.__data__, i, group))) { - if ("__data__" in node) subnode.__data__ = node.__data__; - subgroup[i] = subnode; - } - } - } - - return new Selection(subgroups, this._parents); - } - - function empty() { - return []; - } - - function selectorAll(selector) { - return selector == null ? empty : function() { - return this.querySelectorAll(selector); - }; - } - - function selection_selectAll(select) { - if (typeof select !== "function") select = selectorAll(select); - - for (var groups = this._groups, m = groups.length, subgroups = [], parents = [], j = 0; j < m; ++j) { - for (var group = groups[j], n = group.length, node, i = 0; i < n; ++i) { - if (node = group[i]) { - subgroups.push(select.call(node, node.__data__, i, group)); - parents.push(node); - } - } - } - - return new Selection(subgroups, parents); - } - - function matcher(selector) { - return function() { - return this.matches(selector); - }; - } - - function selection_filter(match) { - if (typeof match !== "function") match = matcher(match); - - for (var groups = this._groups, m = groups.length, subgroups = new Array(m), j = 0; j < m; ++j) { - for (var group = groups[j], n = group.length, subgroup = subgroups[j] = [], node, i = 0; i < n; ++i) { - if ((node = group[i]) && match.call(node, node.__data__, i, group)) { - subgroup.push(node); - } - } - } - - return new Selection(subgroups, this._parents); - } - - function sparse(update) { - return new Array(update.length); - } - - function selection_enter() { - return new Selection(this._enter || this._groups.map(sparse), this._parents); - } - - function EnterNode(parent, datum) { - this.ownerDocument = parent.ownerDocument; - this.namespaceURI = parent.namespaceURI; - this._next = null; - this._parent = parent; - this.__data__ = datum; - } - - EnterNode.prototype = { - constructor: EnterNode, - appendChild: function(child) { return this._parent.insertBefore(child, this._next); }, - insertBefore: function(child, next) { return this._parent.insertBefore(child, next); }, - querySelector: function(selector) { return this._parent.querySelector(selector); }, - querySelectorAll: function(selector) { return this._parent.querySelectorAll(selector); } - }; - - function constant$2(x) { - return function() { - return x; - }; - } - - var keyPrefix = "$"; // Protect against keys like “__proto__”. - - function bindIndex(parent, group, enter, update, exit, data) { - var i = 0, - node, - groupLength = group.length, - dataLength = data.length; - - // Put any non-null nodes that fit into update. - // Put any null nodes into enter. - // Put any remaining data into enter. - for (; i < dataLength; ++i) { - if (node = group[i]) { - node.__data__ = data[i]; - update[i] = node; - } else { - enter[i] = new EnterNode(parent, data[i]); - } - } - - // Put any non-null nodes that don’t fit into exit. - for (; i < groupLength; ++i) { - if (node = group[i]) { - exit[i] = node; - } - } - } - - function bindKey(parent, group, enter, update, exit, data, key) { - var i, - node, - nodeByKeyValue = {}, - groupLength = group.length, - dataLength = data.length, - keyValues = new Array(groupLength), - keyValue; - - // Compute the key for each node. - // If multiple nodes have the same key, the duplicates are added to exit. - for (i = 0; i < groupLength; ++i) { - if (node = group[i]) { - keyValues[i] = keyValue = keyPrefix + key.call(node, node.__data__, i, group); - if (keyValue in nodeByKeyValue) { - exit[i] = node; - } else { - nodeByKeyValue[keyValue] = node; - } - } - } - - // Compute the key for each datum. - // If there a node associated with this key, join and add it to update. - // If there is not (or the key is a duplicate), add it to enter. - for (i = 0; i < dataLength; ++i) { - keyValue = keyPrefix + key.call(parent, data[i], i, data); - if (node = nodeByKeyValue[keyValue]) { - update[i] = node; - node.__data__ = data[i]; - nodeByKeyValue[keyValue] = null; - } else { - enter[i] = new EnterNode(parent, data[i]); - } - } - - // Add any remaining nodes that were not bound to data to exit. - for (i = 0; i < groupLength; ++i) { - if ((node = group[i]) && (nodeByKeyValue[keyValues[i]] === node)) { - exit[i] = node; - } - } - } - - function selection_data(value, key) { - if (!value) { - data = new Array(this.size()), j = -1; - this.each(function(d) { data[++j] = d; }); - return data; - } - - var bind = key ? bindKey : bindIndex, - parents = this._parents, - groups = this._groups; - - if (typeof value !== "function") value = constant$2(value); - - for (var m = groups.length, update = new Array(m), enter = new Array(m), exit = new Array(m), j = 0; j < m; ++j) { - var parent = parents[j], - group = groups[j], - groupLength = group.length, - data = value.call(parent, parent && parent.__data__, j, parents), - dataLength = data.length, - enterGroup = enter[j] = new Array(dataLength), - updateGroup = update[j] = new Array(dataLength), - exitGroup = exit[j] = new Array(groupLength); - - bind(parent, group, enterGroup, updateGroup, exitGroup, data, key); - - // Now connect the enter nodes to their following update node, such that - // appendChild can insert the materialized enter node before this node, - // rather than at the end of the parent node. - for (var i0 = 0, i1 = 0, previous, next; i0 < dataLength; ++i0) { - if (previous = enterGroup[i0]) { - if (i0 >= i1) i1 = i0 + 1; - while (!(next = updateGroup[i1]) && ++i1 < dataLength); - previous._next = next || null; - } - } - } - - update = new Selection(update, parents); - update._enter = enter; - update._exit = exit; - return update; - } - - function selection_exit() { - return new Selection(this._exit || this._groups.map(sparse), this._parents); - } - - function selection_join(onenter, onupdate, onexit) { - var enter = this.enter(), update = this, exit = this.exit(); - enter = typeof onenter === "function" ? onenter(enter) : enter.append(onenter + ""); - if (onupdate != null) update = onupdate(update); - if (onexit == null) exit.remove(); else onexit(exit); - return enter && update ? enter.merge(update).order() : update; - } - - function selection_merge(selection) { - - for (var groups0 = this._groups, groups1 = selection._groups, m0 = groups0.length, m1 = groups1.length, m = Math.min(m0, m1), merges = new Array(m0), j = 0; j < m; ++j) { - for (var group0 = groups0[j], group1 = groups1[j], n = group0.length, merge = merges[j] = new Array(n), node, i = 0; i < n; ++i) { - if (node = group0[i] || group1[i]) { - merge[i] = node; - } - } - } - - for (; j < m0; ++j) { - merges[j] = groups0[j]; - } - - return new Selection(merges, this._parents); - } - - function selection_order() { - - for (var groups = this._groups, j = -1, m = groups.length; ++j < m;) { - for (var group = groups[j], i = group.length - 1, next = group[i], node; --i >= 0;) { - if (node = group[i]) { - if (next && node.compareDocumentPosition(next) ^ 4) next.parentNode.insertBefore(node, next); - next = node; - } - } - } - - return this; - } - - function selection_sort(compare) { - if (!compare) compare = ascending$1; - - function compareNode(a, b) { - return a && b ? compare(a.__data__, b.__data__) : !a - !b; - } - - for (var groups = this._groups, m = groups.length, sortgroups = new Array(m), j = 0; j < m; ++j) { - for (var group = groups[j], n = group.length, sortgroup = sortgroups[j] = new Array(n), node, i = 0; i < n; ++i) { - if (node = group[i]) { - sortgroup[i] = node; - } - } - sortgroup.sort(compareNode); - } - - return new Selection(sortgroups, this._parents).order(); - } - - function ascending$1(a, b) { - return a < b ? -1 : a > b ? 1 : a >= b ? 0 : NaN; - } - - function selection_call() { - var callback = arguments[0]; - arguments[0] = this; - callback.apply(null, arguments); - return this; - } - - function selection_nodes() { - var nodes = new Array(this.size()), i = -1; - this.each(function() { nodes[++i] = this; }); - return nodes; - } - - function selection_node() { - - for (var groups = this._groups, j = 0, m = groups.length; j < m; ++j) { - for (var group = groups[j], i = 0, n = group.length; i < n; ++i) { - var node = group[i]; - if (node) return node; - } - } - - return null; - } - - function selection_size() { - var size = 0; - this.each(function() { ++size; }); - return size; - } - - function selection_empty() { - return !this.node(); - } - - function selection_each(callback) { - - for (var groups = this._groups, j = 0, m = groups.length; j < m; ++j) { - for (var group = groups[j], i = 0, n = group.length, node; i < n; ++i) { - if (node = group[i]) callback.call(node, node.__data__, i, group); - } - } - - return this; - } - - function attrRemove(name) { - return function() { - this.removeAttribute(name); - }; - } - - function attrRemoveNS(fullname) { - return function() { - this.removeAttributeNS(fullname.space, fullname.local); - }; - } - - function attrConstant(name, value) { - return function() { - this.setAttribute(name, value); - }; - } - - function attrConstantNS(fullname, value) { - return function() { - this.setAttributeNS(fullname.space, fullname.local, value); - }; - } - - function attrFunction(name, value) { - return function() { - var v = value.apply(this, arguments); - if (v == null) this.removeAttribute(name); - else this.setAttribute(name, v); - }; - } - - function attrFunctionNS(fullname, value) { - return function() { - var v = value.apply(this, arguments); - if (v == null) this.removeAttributeNS(fullname.space, fullname.local); - else this.setAttributeNS(fullname.space, fullname.local, v); - }; - } - - function selection_attr(name, value) { - var fullname = namespace(name); - - if (arguments.length < 2) { - var node = this.node(); - return fullname.local - ? node.getAttributeNS(fullname.space, fullname.local) - : node.getAttribute(fullname); - } - - return this.each((value == null - ? (fullname.local ? attrRemoveNS : attrRemove) : (typeof value === "function" - ? (fullname.local ? attrFunctionNS : attrFunction) - : (fullname.local ? attrConstantNS : attrConstant)))(fullname, value)); - } - - function defaultView(node) { - return (node.ownerDocument && node.ownerDocument.defaultView) // node is a Node - || (node.document && node) // node is a Window - || node.defaultView; // node is a Document - } - - function styleRemove(name) { - return function() { - this.style.removeProperty(name); - }; - } - - function styleConstant(name, value, priority) { - return function() { - this.style.setProperty(name, value, priority); - }; - } - - function styleFunction(name, value, priority) { - return function() { - var v = value.apply(this, arguments); - if (v == null) this.style.removeProperty(name); - else this.style.setProperty(name, v, priority); - }; - } - - function selection_style(name, value, priority) { - return arguments.length > 1 - ? this.each((value == null - ? styleRemove : typeof value === "function" - ? styleFunction - : styleConstant)(name, value, priority == null ? "" : priority)) - : styleValue(this.node(), name); - } - - function styleValue(node, name) { - return node.style.getPropertyValue(name) - || defaultView(node).getComputedStyle(node, null).getPropertyValue(name); - } - - function propertyRemove(name) { - return function() { - delete this[name]; - }; - } - - function propertyConstant(name, value) { - return function() { - this[name] = value; - }; - } - - function propertyFunction(name, value) { - return function() { - var v = value.apply(this, arguments); - if (v == null) delete this[name]; - else this[name] = v; - }; - } - - function selection_property(name, value) { - return arguments.length > 1 - ? this.each((value == null - ? propertyRemove : typeof value === "function" - ? propertyFunction - : propertyConstant)(name, value)) - : this.node()[name]; - } - - function classArray(string) { - return string.trim().split(/^|\s+/); - } - - function classList(node) { - return node.classList || new ClassList(node); - } - - function ClassList(node) { - this._node = node; - this._names = classArray(node.getAttribute("class") || ""); - } - - ClassList.prototype = { - add: function(name) { - var i = this._names.indexOf(name); - if (i < 0) { - this._names.push(name); - this._node.setAttribute("class", this._names.join(" ")); - } - }, - remove: function(name) { - var i = this._names.indexOf(name); - if (i >= 0) { - this._names.splice(i, 1); - this._node.setAttribute("class", this._names.join(" ")); - } - }, - contains: function(name) { - return this._names.indexOf(name) >= 0; - } - }; - - function classedAdd(node, names) { - var list = classList(node), i = -1, n = names.length; - while (++i < n) list.add(names[i]); - } - - function classedRemove(node, names) { - var list = classList(node), i = -1, n = names.length; - while (++i < n) list.remove(names[i]); - } - - function classedTrue(names) { - return function() { - classedAdd(this, names); - }; - } - - function classedFalse(names) { - return function() { - classedRemove(this, names); - }; - } - - function classedFunction(names, value) { - return function() { - (value.apply(this, arguments) ? classedAdd : classedRemove)(this, names); - }; - } - - function selection_classed(name, value) { - var names = classArray(name + ""); - - if (arguments.length < 2) { - var list = classList(this.node()), i = -1, n = names.length; - while (++i < n) if (!list.contains(names[i])) return false; - return true; - } - - return this.each((typeof value === "function" - ? classedFunction : value - ? classedTrue - : classedFalse)(names, value)); - } - - function textRemove() { - this.textContent = ""; - } - - function textConstant(value) { - return function() { - this.textContent = value; - }; - } - - function textFunction(value) { - return function() { - var v = value.apply(this, arguments); - this.textContent = v == null ? "" : v; - }; - } - - function selection_text(value) { - return arguments.length - ? this.each(value == null - ? textRemove : (typeof value === "function" - ? textFunction - : textConstant)(value)) - : this.node().textContent; - } - - function htmlRemove() { - this.innerHTML = ""; - } - - function htmlConstant(value) { - return function() { - this.innerHTML = value; - }; - } - - function htmlFunction(value) { - return function() { - var v = value.apply(this, arguments); - this.innerHTML = v == null ? "" : v; - }; - } - - function selection_html(value) { - return arguments.length - ? this.each(value == null - ? htmlRemove : (typeof value === "function" - ? htmlFunction - : htmlConstant)(value)) - : this.node().innerHTML; - } - - function raise() { - if (this.nextSibling) this.parentNode.appendChild(this); - } - - function selection_raise() { - return this.each(raise); - } - - function lower() { - if (this.previousSibling) this.parentNode.insertBefore(this, this.parentNode.firstChild); - } - - function selection_lower() { - return this.each(lower); - } - - function selection_append(name) { - var create = typeof name === "function" ? name : creator(name); - return this.select(function() { - return this.appendChild(create.apply(this, arguments)); - }); - } - - function constantNull() { - return null; - } - - function selection_insert(name, before) { - var create = typeof name === "function" ? name : creator(name), - select = before == null ? constantNull : typeof before === "function" ? before : selector(before); - return this.select(function() { - return this.insertBefore(create.apply(this, arguments), select.apply(this, arguments) || null); - }); - } - - function remove() { - var parent = this.parentNode; - if (parent) parent.removeChild(this); - } - - function selection_remove() { - return this.each(remove); - } - - function selection_cloneShallow() { - var clone = this.cloneNode(false), parent = this.parentNode; - return parent ? parent.insertBefore(clone, this.nextSibling) : clone; - } - - function selection_cloneDeep() { - var clone = this.cloneNode(true), parent = this.parentNode; - return parent ? parent.insertBefore(clone, this.nextSibling) : clone; - } - - function selection_clone(deep) { - return this.select(deep ? selection_cloneDeep : selection_cloneShallow); - } - - function selection_datum(value) { - return arguments.length - ? this.property("__data__", value) - : this.node().__data__; - } - - var filterEvents = {}; - - var event = null; - - if (typeof document !== "undefined") { - var element = document.documentElement; - if (!("onmouseenter" in element)) { - filterEvents = {mouseenter: "mouseover", mouseleave: "mouseout"}; - } - } - - function filterContextListener(listener, index, group) { - listener = contextListener(listener, index, group); - return function(event) { - var related = event.relatedTarget; - if (!related || (related !== this && !(related.compareDocumentPosition(this) & 8))) { - listener.call(this, event); - } - }; - } - - function contextListener(listener, index, group) { - return function(event1) { - var event0 = event; // Events can be reentrant (e.g., focus). - event = event1; - try { - listener.call(this, this.__data__, index, group); - } finally { - event = event0; - } - }; - } - - function parseTypenames$1(typenames) { - return typenames.trim().split(/^|\s+/).map(function(t) { - var name = "", i = t.indexOf("."); - if (i >= 0) name = t.slice(i + 1), t = t.slice(0, i); - return {type: t, name: name}; - }); - } - - function onRemove(typename) { - return function() { - var on = this.__on; - if (!on) return; - for (var j = 0, i = -1, m = on.length, o; j < m; ++j) { - if (o = on[j], (!typename.type || o.type === typename.type) && o.name === typename.name) { - this.removeEventListener(o.type, o.listener, o.capture); - } else { - on[++i] = o; - } - } - if (++i) on.length = i; - else delete this.__on; - }; - } - - function onAdd(typename, value, capture) { - var wrap = filterEvents.hasOwnProperty(typename.type) ? filterContextListener : contextListener; - return function(d, i, group) { - var on = this.__on, o, listener = wrap(value, i, group); - if (on) for (var j = 0, m = on.length; j < m; ++j) { - if ((o = on[j]).type === typename.type && o.name === typename.name) { - this.removeEventListener(o.type, o.listener, o.capture); - this.addEventListener(o.type, o.listener = listener, o.capture = capture); - o.value = value; - return; - } - } - this.addEventListener(typename.type, listener, capture); - o = {type: typename.type, name: typename.name, value: value, listener: listener, capture: capture}; - if (!on) this.__on = [o]; - else on.push(o); - }; - } - - function selection_on(typename, value, capture) { - var typenames = parseTypenames$1(typename + ""), i, n = typenames.length, t; - - if (arguments.length < 2) { - var on = this.node().__on; - if (on) for (var j = 0, m = on.length, o; j < m; ++j) { - for (i = 0, o = on[j]; i < n; ++i) { - if ((t = typenames[i]).type === o.type && t.name === o.name) { - return o.value; - } - } - } - return; - } - - on = value ? onAdd : onRemove; - if (capture == null) capture = false; - for (i = 0; i < n; ++i) this.each(on(typenames[i], value, capture)); - return this; - } - - function customEvent(event1, listener, that, args) { - var event0 = event; - event1.sourceEvent = event; - event = event1; - try { - return listener.apply(that, args); - } finally { - event = event0; - } - } - - function dispatchEvent(node, type, params) { - var window = defaultView(node), - event = window.CustomEvent; - - if (typeof event === "function") { - event = new event(type, params); - } else { - event = window.document.createEvent("Event"); - if (params) event.initEvent(type, params.bubbles, params.cancelable), event.detail = params.detail; - else event.initEvent(type, false, false); - } - - node.dispatchEvent(event); - } - - function dispatchConstant(type, params) { - return function() { - return dispatchEvent(this, type, params); - }; - } - - function dispatchFunction(type, params) { - return function() { - return dispatchEvent(this, type, params.apply(this, arguments)); - }; - } - - function selection_dispatch(type, params) { - return this.each((typeof params === "function" - ? dispatchFunction - : dispatchConstant)(type, params)); - } - - var root = [null]; - - function Selection(groups, parents) { - this._groups = groups; - this._parents = parents; - } - - function selection() { - return new Selection([[document.documentElement]], root); - } - - Selection.prototype = selection.prototype = { - constructor: Selection, - select: selection_select, - selectAll: selection_selectAll, - filter: selection_filter, - data: selection_data, - enter: selection_enter, - exit: selection_exit, - join: selection_join, - merge: selection_merge, - order: selection_order, - sort: selection_sort, - call: selection_call, - nodes: selection_nodes, - node: selection_node, - size: selection_size, - empty: selection_empty, - each: selection_each, - attr: selection_attr, - style: selection_style, - property: selection_property, - classed: selection_classed, - text: selection_text, - html: selection_html, - raise: selection_raise, - lower: selection_lower, - append: selection_append, - insert: selection_insert, - remove: selection_remove, - clone: selection_clone, - datum: selection_datum, - on: selection_on, - dispatch: selection_dispatch - }; - - function select(selector) { - return typeof selector === "string" - ? new Selection([[document.querySelector(selector)]], [document.documentElement]) - : new Selection([[selector]], root); - } - - function sourceEvent() { - var current = event, source; - while (source = current.sourceEvent) current = source; - return current; - } - - function point(node, event) { - var svg = node.ownerSVGElement || node; - - if (svg.createSVGPoint) { - var point = svg.createSVGPoint(); - point.x = event.clientX, point.y = event.clientY; - point = point.matrixTransform(node.getScreenCTM().inverse()); - return [point.x, point.y]; - } - - var rect = node.getBoundingClientRect(); - return [event.clientX - rect.left - node.clientLeft, event.clientY - rect.top - node.clientTop]; - } - - function mouse(node) { - var event = sourceEvent(); - if (event.changedTouches) event = event.changedTouches[0]; - return point(node, event); - } - - function touch(node, touches, identifier) { - if (arguments.length < 3) identifier = touches, touches = sourceEvent().changedTouches; - - for (var i = 0, n = touches ? touches.length : 0, touch; i < n; ++i) { - if ((touch = touches[i]).identifier === identifier) { - return point(node, touch); - } - } - - return null; - } - - function nopropagation() { - event.stopImmediatePropagation(); - } - - function noevent() { - event.preventDefault(); - event.stopImmediatePropagation(); - } - - function nodrag(view) { - var root = view.document.documentElement, - selection = select(view).on("dragstart.drag", noevent, true); - if ("onselectstart" in root) { - selection.on("selectstart.drag", noevent, true); - } else { - root.__noselect = root.style.MozUserSelect; - root.style.MozUserSelect = "none"; - } - } - - function yesdrag(view, noclick) { - var root = view.document.documentElement, - selection = select(view).on("dragstart.drag", null); - if (noclick) { - selection.on("click.drag", noevent, true); - setTimeout(function() { selection.on("click.drag", null); }, 0); - } - if ("onselectstart" in root) { - selection.on("selectstart.drag", null); - } else { - root.style.MozUserSelect = root.__noselect; - delete root.__noselect; - } - } - - function constant$3(x) { - return function() { - return x; - }; - } - - function DragEvent(target, type, subject, id, active, x, y, dx, dy, dispatch) { - this.target = target; - this.type = type; - this.subject = subject; - this.identifier = id; - this.active = active; - this.x = x; - this.y = y; - this.dx = dx; - this.dy = dy; - this._ = dispatch; - } - - DragEvent.prototype.on = function() { - var value = this._.on.apply(this._, arguments); - return value === this._ ? this : value; - }; - - // Ignore right-click, since that should open the context menu. - function defaultFilter() { - return !event.ctrlKey && !event.button; - } - - function defaultContainer() { - return this.parentNode; - } - - function defaultSubject(d) { - return d == null ? {x: event.x, y: event.y} : d; - } - - function defaultTouchable() { - return navigator.maxTouchPoints || ("ontouchstart" in this); - } - - function drag() { - var filter = defaultFilter, - container = defaultContainer, - subject = defaultSubject, - touchable = defaultTouchable, - gestures = {}, - listeners = dispatch("start", "drag", "end"), - active = 0, - mousedownx, - mousedowny, - mousemoving, - touchending, - clickDistance2 = 0; - - function drag(selection) { - selection - .on("mousedown.drag", mousedowned) - .filter(touchable) - .on("touchstart.drag", touchstarted) - .on("touchmove.drag", touchmoved) - .on("touchend.drag touchcancel.drag", touchended) - .style("touch-action", "none") - .style("-webkit-tap-highlight-color", "rgba(0,0,0,0)"); - } - - function mousedowned() { - if (touchending || !filter.apply(this, arguments)) return; - var gesture = beforestart("mouse", container.apply(this, arguments), mouse, this, arguments); - if (!gesture) return; - select(event.view).on("mousemove.drag", mousemoved, true).on("mouseup.drag", mouseupped, true); - nodrag(event.view); - nopropagation(); - mousemoving = false; - mousedownx = event.clientX; - mousedowny = event.clientY; - gesture("start"); - } - - function mousemoved() { - noevent(); - if (!mousemoving) { - var dx = event.clientX - mousedownx, dy = event.clientY - mousedowny; - mousemoving = dx * dx + dy * dy > clickDistance2; - } - gestures.mouse("drag"); - } - - function mouseupped() { - select(event.view).on("mousemove.drag mouseup.drag", null); - yesdrag(event.view, mousemoving); - noevent(); - gestures.mouse("end"); - } - - function touchstarted() { - if (!filter.apply(this, arguments)) return; - var touches = event.changedTouches, - c = container.apply(this, arguments), - n = touches.length, i, gesture; - - for (i = 0; i < n; ++i) { - if (gesture = beforestart(touches[i].identifier, c, touch, this, arguments)) { - nopropagation(); - gesture("start"); - } - } - } - - function touchmoved() { - var touches = event.changedTouches, - n = touches.length, i, gesture; - - for (i = 0; i < n; ++i) { - if (gesture = gestures[touches[i].identifier]) { - noevent(); - gesture("drag"); - } - } - } - - function touchended() { - var touches = event.changedTouches, - n = touches.length, i, gesture; - - if (touchending) clearTimeout(touchending); - touchending = setTimeout(function() { touchending = null; }, 500); // Ghost clicks are delayed! - for (i = 0; i < n; ++i) { - if (gesture = gestures[touches[i].identifier]) { - nopropagation(); - gesture("end"); - } - } - } - - function beforestart(id, container, point, that, args) { - var p = point(container, id), s, dx, dy, - sublisteners = listeners.copy(); - - if (!customEvent(new DragEvent(drag, "beforestart", s, id, active, p[0], p[1], 0, 0, sublisteners), function() { - if ((event.subject = s = subject.apply(that, args)) == null) return false; - dx = s.x - p[0] || 0; - dy = s.y - p[1] || 0; - return true; - })) return; - - return function gesture(type) { - var p0 = p, n; - switch (type) { - case "start": gestures[id] = gesture, n = active++; break; - case "end": delete gestures[id], --active; // nobreak - case "drag": p = point(container, id), n = active; break; - } - customEvent(new DragEvent(drag, type, s, id, n, p[0] + dx, p[1] + dy, p[0] - p0[0], p[1] - p0[1], sublisteners), sublisteners.apply, sublisteners, [type, that, args]); - }; - } - - drag.filter = function(_) { - return arguments.length ? (filter = typeof _ === "function" ? _ : constant$3(!!_), drag) : filter; - }; - - drag.container = function(_) { - return arguments.length ? (container = typeof _ === "function" ? _ : constant$3(_), drag) : container; - }; - - drag.subject = function(_) { - return arguments.length ? (subject = typeof _ === "function" ? _ : constant$3(_), drag) : subject; - }; - - drag.touchable = function(_) { - return arguments.length ? (touchable = typeof _ === "function" ? _ : constant$3(!!_), drag) : touchable; - }; - - drag.on = function() { - var value = listeners.on.apply(listeners, arguments); - return value === listeners ? drag : value; - }; - - drag.clickDistance = function(_) { - return arguments.length ? (clickDistance2 = (_ = +_) * _, drag) : Math.sqrt(clickDistance2); - }; - - return drag; - } - - // Copyright 2018 The Distill Template Authors - - const T$a = Template('d-slider', ` - - -
    -
    -
    -
    -
    -
    -
    -
    -
    -`); - - // ARIA - // If the slider has a visible label, it is referenced by aria-labelledby on the slider element. Otherwise, the slider element has a label provided by aria-label. - // If the slider is vertically oriented, it has aria-orientation set to vertical. The default value of aria-orientation for a slider is horizontal. - - const keyCodes = { - left: 37, - up: 38, - right: 39, - down: 40, - pageUp: 33, - pageDown: 34, - end: 35, - home: 36 - }; - - class Slider extends T$a(HTMLElement) { - - - connectedCallback() { - this.connected = true; - this.setAttribute('role', 'slider'); - // Makes the element tab-able. - if (!this.hasAttribute('tabindex')) { this.setAttribute('tabindex', 0); } - - // Keeps track of keyboard vs. mouse interactions for focus rings - this.mouseEvent = false; - - // Handles to shadow DOM elements - this.knob = this.root.querySelector('.knob-container'); - this.background = this.root.querySelector('.background'); - this.trackFill = this.root.querySelector('.track-fill'); - this.track = this.root.querySelector('.track'); - - // Default values for attributes - this.min = this.min ? this.min : 0; - this.max = this.max ? this.max : 100; - this.scale = linear$1().domain([this.min, this.max]).range([0, 1]).clamp(true); - - this.origin = this.origin !== undefined ? this.origin : this.min; - this.step = this.step ? this.step : 1; - this.update(this.value ? this.value : 0); - - this.ticks = this.ticks ? this.ticks : false; - this.renderTicks(); - - this.drag = drag() - .container(this.background) - .on('start', () => { - this.mouseEvent = true; - this.background.classList.add('mousedown'); - this.changeValue = this.value; - this.dragUpdate(); - }) - .on('drag', () => { - this.dragUpdate(); - }) - .on('end', () => { - this.mouseEvent = false; - this.background.classList.remove('mousedown'); - this.dragUpdate(); - if (this.changeValue !== this.value) this.dispatchChange(); - this.changeValue = this.value; - }); - this.drag(select(this.background)); - - this.addEventListener('focusin', () => { - if(!this.mouseEvent) { - this.background.classList.add('focus'); - } - }); - this.addEventListener('focusout', () => { - this.background.classList.remove('focus'); - }); - this.addEventListener('keydown', this.onKeyDown); - - } - - static get observedAttributes() {return ['min', 'max', 'value', 'step', 'ticks', 'origin', 'tickValues', 'tickLabels']; } - - attributeChangedCallback(attr, oldValue, newValue) { - if (isNaN(newValue) || newValue === undefined || newValue === null) return; - if (attr == 'min') { - this.min = +newValue; - this.setAttribute('aria-valuemin', this.min); - } - if (attr == 'max') { - this.max = +newValue; - this.setAttribute('aria-valuemax', this.max); - } - if (attr == 'value') { - this.update(+newValue); - } - if (attr == 'origin') { - this.origin = +newValue; - // this.update(this.value); - } - if (attr == 'step') { - if (newValue > 0) { - this.step = +newValue; - } - } - if (attr == 'ticks') { - this.ticks = (newValue === '' ? true : newValue); - } - } - - onKeyDown(event) { - this.changeValue = this.value; - let stopPropagation = false; - switch (event.keyCode) { - case keyCodes.left: - case keyCodes.down: - this.update(this.value - this.step); - stopPropagation = true; - break; - case keyCodes.right: - case keyCodes.up: - this.update(this.value + this.step); - stopPropagation = true; - break; - case keyCodes.pageUp: - this.update(this.value + this.step * 10); - stopPropagation = true; - break; - - case keyCodes.pageDown: - this.update(this.value + this.step * 10); - stopPropagation = true; - break; - case keyCodes.home: - this.update(this.min); - stopPropagation = true; - break; - case keyCodes.end: - this.update(this.max); - stopPropagation = true; - break; - } - if (stopPropagation) { - this.background.classList.add('focus'); - event.preventDefault(); - event.stopPropagation(); - if (this.changeValue !== this.value) this.dispatchChange(); - } - } - - validateValueRange(min, max, value) { - return Math.max(Math.min(max, value), min); - } - - quantizeValue(value, step) { - return Math.round(value / step) * step; - } - - dragUpdate() { - const bbox = this.background.getBoundingClientRect(); - const x = event.x; - const width = bbox.width; - this.update(this.scale.invert(x / width)); - } - - update(value) { - let v = value; - if (this.step !== 'any') { - v = this.quantizeValue(value, this.step); - } - v = this.validateValueRange(this.min, this.max, v); - if (this.connected) { - this.knob.style.left = this.scale(v) * 100 + '%'; - this.trackFill.style.width = this.scale(this.min + Math.abs(v - this.origin)) * 100 + '%'; - this.trackFill.style.left = this.scale(Math.min(v, this.origin)) * 100 + '%'; - } - if (this.value !== v) { - this.value = v; - this.setAttribute('aria-valuenow', this.value); - this.dispatchInput(); - } - } - - // Dispatches only on a committed change (basically only on mouseup). - dispatchChange() { - const e = new Event('change'); - this.dispatchEvent(e, {}); - } - - // Dispatches on each value change. - dispatchInput() { - const e = new Event('input'); - this.dispatchEvent(e, {}); - } - - renderTicks() { - const ticksContainer = this.root.querySelector('.ticks'); - if (this.ticks !== false) { - let tickData = []; - if (this.ticks > 0) { - tickData = this.scale.ticks(this.ticks); - } else if (this.step === 'any') { - tickData = this.scale.ticks(); - } else { - tickData = range(this.min, this.max + 1e-6, this.step); - } - tickData.forEach(d => { - const tick = document.createElement('div'); - tick.classList.add('tick'); - tick.style.left = this.scale(d) * 100 + '%'; - ticksContainer.appendChild(tick); - }); - } else { - ticksContainer.style.display = 'none'; - } - } - } - - var logo = "\n \n\n"; - - const headerTemplate = ` - - -`; - - // Copyright 2018 The Distill Template Authors - - const T$b = Template('distill-header', headerTemplate, false); - - class DistillHeader extends T$b(HTMLElement) { - - } - - // Copyright 2018 The Distill Template Authors - - const styles$2 = ` - -`; - - function appendixTemplate(frontMatter) { - let html = styles$2; - - if (typeof frontMatter.githubUrl !== 'undefined') { - html += ` -

    Updates and Corrections

    -

    `; - if (frontMatter.githubCompareUpdatesUrl) { - html += `View all changes to this article since it was first published.`; - } - html += ` - If you see mistakes or want to suggest changes, please create an issue on GitHub.

    - `; - } - - const journal = frontMatter.journal; - if (typeof journal !== 'undefined' && journal.title === 'Distill') { - html += ` -

    Reuse

    -

    Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by a note in their caption: “Figure from …”.

    - `; - } - - if (typeof frontMatter.publishedDate !== 'undefined') { - html += ` -

    Citation

    -

    For attribution in academic contexts, please cite this work as

    -
    ${frontMatter.concatenatedAuthors}, "${frontMatter.title}", Distill, ${frontMatter.publishedYear}.
    -

    BibTeX citation

    -
    ${serializeFrontmatterToBibtex(frontMatter)}
    - `; - } - - return html; - } - - class DistillAppendix extends HTMLElement { - - static get is() { return 'distill-appendix'; } - - set frontMatter(frontMatter) { - this.innerHTML = appendixTemplate(frontMatter); - } - - } - - const footerTemplate = ` - - - - -`; - - // Copyright 2018 The Distill Template Authors - - const T$c = Template('distill-footer', footerTemplate); - - class DistillFooter extends T$c(HTMLElement) { - - } - - // Copyright 2018 The Distill Template Authors - - let templateIsLoading = false; - let runlevel = 0; - const initialize = function() { - if (window.distill.runlevel < 1) { - throw new Error("Insufficient Runlevel for Distill Template!"); - } - - /* 1. Flag that we're being loaded */ - if ("distill" in window && window.distill.templateIsLoading) { - throw new Error( - "Runlevel 1: Distill Template is getting loaded more than once, aborting!" - ); - } else { - window.distill.templateIsLoading = true; - console.debug("Runlevel 1: Distill Template has started loading."); - } - - /* 2. Add styles if they weren't added during prerendering */ - makeStyleTag(document); - console.debug("Runlevel 1: Static Distill styles have been added."); - console.debug("Runlevel 1->2."); - window.distill.runlevel += 1; - - /* 3. Register Controller listener functions */ - /* Needs to happen before components to their connected callbacks have a controller to talk to. */ - for (const [functionName, callback] of Object.entries(Controller.listeners)) { - if (typeof callback === "function") { - document.addEventListener(functionName, callback); - } else { - console.error("Runlevel 2: Controller listeners need to be functions!"); - } - } - console.debug("Runlevel 2: We can now listen to controller events."); - console.debug("Runlevel 2->3."); - window.distill.runlevel += 1; - - /* 4. Register components */ - const components = [ - Abstract, Appendix, Article, Bibliography, Byline, Cite, CitationList, Code, - Footnote, FootnoteList, FrontMatter$1, HoverBox, Title, DMath, References, TOC, Figure, - Slider, Interstitial - ]; - - const distillComponents = [DistillHeader, DistillAppendix, DistillFooter]; - - if (window.distill.runlevel < 2) { - throw new Error("Insufficient Runlevel for adding custom elements!"); - } - const allComponents = components.concat(distillComponents); - for (const component of allComponents) { - console.debug("Runlevel 2: Registering custom element: " + component.is); - customElements.define(component.is, component); - } - - console.debug( - "Runlevel 3: Distill Template finished registering custom elements." - ); - console.debug("Runlevel 3->4."); - window.distill.runlevel += 1; - - // If template was added after DOMContentLoaded we may have missed that event. - // Controller will check for that case, so trigger the event explicitly: - if (domContentLoaded()) { - Controller.listeners.DOMContentLoaded(); - } - - console.debug("Runlevel 4: Distill Template initialisation complete."); - window.distill.templateIsLoading = false; - window.distill.templateHasLoaded = true; - }; - - window.distill = { runlevel, initialize, templateIsLoading }; - - /* 0. Check browser feature support; synchronously polyfill if needed */ - if (Polyfills.browserSupportsAllFeatures()) { - console.debug("Runlevel 0: No need for polyfills."); - console.debug("Runlevel 0->1."); - window.distill.runlevel += 1; - window.distill.initialize(); - } else { - console.debug("Runlevel 0: Distill Template is loading polyfills."); - Polyfills.load(window.distill.initialize); - } - -}))); -//# sourceMappingURL=template.v2.js.map +!function(n){"function"==typeof define&&define.amd?define(n):n()}(function(){"use strict"; +// Copyright 2018 The Distill Template Authors +function n(n,t){n.title=t.title,t.published&&(t.published instanceof Date?n.publishedDate=t.published:t.published.constructor===String&&(n.publishedDate=new Date(t.published))),t.publishedDate&&(t.publishedDate instanceof Date?n.publishedDate=t.publishedDate:t.publishedDate.constructor===String?n.publishedDate=new Date(t.publishedDate):console.error("Don't know what to do with published date: "+t.publishedDate)),n.description=t.description,n.authors=t.authors.map(n=>new Nr(n)),n.katex=t.katex,n.password=t.password,t.doi&&(n.doi=t.doi)} +// Copyright 2018 The Distill Template Authors +function t(n=document){const t=new Set,e=n.querySelectorAll("d-cite");for(const n of e){const e=(n.getAttribute("key")||n.getAttribute("bibtex-key")).split(",").map(n=>n.trim());for(const n of e)t.add(n)}return[...t]}function e(n,t,e,i){if(null==n.author)return"";var r=n.author.split(" and ");let o=r.map(n=>{if(-1!=(n=n.trim()).indexOf(","))var e=n.split(",")[0].trim(),i=n.split(",")[1];else if(-1!=n.indexOf(" "))e=n.split(" ").slice(-1)[0].trim(),i=n.split(" ").slice(0,-1).join(" ");else e=n.trim();var r="";return i!=undefined&&(r=(r=i.trim().split(" ").map(n=>n.trim()[0])).join(".")+"."),t.replace("${F}",i).replace("${L}",e).replace("${I}",r).trim()});if(r.length>1){var a=o.slice(0,r.length-1).join(e);return a+=(i||e)+o[r.length-1]}return o[0]}function i(n){var t=n.journal||n.booktitle||"";if("volume"in n){var e=n.issue||n.number;e=e!=undefined?"("+e+")":"",t+=", Vol "+n.volume+e}return"pages"in n&&(t+=", pp. "+n.pages),""!=t&&(t+=". "),"publisher"in n&&"."!=(t+=n.publisher)[t.length-1]&&(t+="."),t}function r(n){if("url"in n){var t=n.url,e=/arxiv\.org\/abs\/([0-9\.]*)/.exec(t);if(null!=e&&(t=`http://arxiv.org/pdf/${e[1]}.pdf`),".pdf"==t.slice(-4))var i="PDF";else if(".html"==t.slice(-5))i="HTML";return`  [${i||"link"}]`}return""}function o(n,t){return"doi"in n?`${t?"
    ":""} DOI: ${n.doi}`:""}function a(n){return''+n.title+" "}function s(n){if(n){var t=a(n);return t+=r(n)+"
    ",n.author&&(t+=e(n,"${L}, ${I}",", "," and "),(n.year||n.date)&&(t+=", ")),n.year||n.date?t+=(n.year||n.date)+". ":t+=". ",t+=i(n),t+=o(n)}return"?"}function l(n){if(n){var t="";t+=""+n.title+"",t+=r(n),t+="
    ";var a=e(n,"${I} ${L}",", ")+".",s=i(n).trim()+" "+n.year+". "+o(n,!0);return(a+s).length"+s,t}return"?"}function u(){return-1!==["interactive","complete"].indexOf(document.readyState)} +// Copyright 2018 The Distill Template Authors +function c(n){for(let t of n.authors){const n=Boolean(t.affiliation),e=Boolean(t.affiliations);if(n)if(e)console.warn(`Author ${t.author} has both old-style ("affiliation" & "affiliationURL") and new style ("affiliations") affiliation information!`);else{let n={name:t.affiliation};t.affiliationURL&&(n.url=t.affiliationURL),t.affiliations=[n]}}return n}function d(n){const t=n.firstElementChild;if(t){if("json"==t.getAttribute("type").split("/")[1]){const n=t.textContent;return c(JSON.parse(n))}console.error("Distill only supports JSON frontmatter tags anymore; no more YAML.")}else console.error("You added a frontmatter tag but did not provide a script tag with front matter data in it. Please take a look at our templates.");return{}} +// Copyright 2018 The Distill Template Authors +function h(n,t){const e=n.body,i=e.querySelector("d-article");if(!i)return void console.warn("No d-article tag found; skipping adding optional components!");let r=n.querySelector("d-byline");r||(t.authors?(r=n.createElement("d-byline"),e.insertBefore(r,i)):console.warn("No authors found in front matter; please add them before submission!"));let o=n.querySelector("d-title");o||(o=n.createElement("d-title"),e.insertBefore(o,r));let a=o.querySelector("h1");a||((a=n.createElement("h1")).textContent=t.title,o.insertBefore(a,o.firstChild));const s="undefined"!=typeof t.password;let l=e.querySelector("d-interstitial");if(s&&!l){const i="undefined"!=typeof window,r=i&&window.location.hostname.includes("localhost");i&&r||((l=n.createElement("d-interstitial")).password=t.password,e.insertBefore(l,e.firstChild))}else!s&&l&&l.parentElement.removeChild(this);let u=n.querySelector("d-appendix");u||(u=n.createElement("d-appendix"),n.body.appendChild(u));let c=n.querySelector("d-footnote-list");c||(c=n.createElement("d-footnote-list"),u.appendChild(c));let d=n.querySelector("d-citation-list");d||(d=n.createElement("d-citation-list"),u.appendChild(d))} +// Copyright 2018 The Distill Template Authors +function p(n){const t="distill-prerendered-styles";if(!n.getElementById(t)){const e=n.createElement("style");e.id=t,e.type="text/css";const i=n.createTextNode(Kr);e.appendChild(i);const r=n.head.querySelector("script");n.head.insertBefore(e,r)}} +// Copyright 2018 The Distill Template Authors +function f(n,t){console.debug("Runlevel 0: Polyfill required: "+n.name);const e=document.createElement("script");e.src=n.url,e.async=!1,t&&(e.onload=function(){t(n)}),e.onerror=function(){new Error("Runlevel 0: Polyfills failed to load script "+n.name)},document.head.appendChild(e)} +// Copyright 2018 The Distill Template Authors +function g(n){return`${n} {\n grid-column: left / text;\n }\n `} +// Copyright 2018 The Distill Template Authors +function m(n,t){return n(t={exports:{}},t.exports),t.exports} +// Copyright 2018 The Distill Template Authors +function b(n){return n.replace(/[\t\n ]+/g," ").replace(/{\\["^`.'acu~Hvs]( )?([a-zA-Z])}/g,(n,t,e)=>e).replace(/{\\([a-zA-Z])}/g,(n,t)=>t)}function y(n){const t=new Map,e=oo.toJSON(n);for(const n of e){for(const[t,e]of Object.entries(n.entryTags))n.entryTags[t.toLowerCase()]=b(e);n.entryTags.type=n.entryType,t.set(n.citationKey,n.entryTags)}return t}function v(n){return`@article{${n.slug},\n author = {${n.bibtexAuthors}},\n title = {${n.title}},\n journal = {${n.journal.title}},\n year = {${n.publishedYear}},\n note = {${n.url}},\n doi = {${n.doi}}\n}`} +// Copyright 2018 The Distill Template Authors +// Copyright 2018 The Distill Template Authors +function w(n){return`\n \n`}function x(n,t,e=document){if(t.size>0){n.style.display="";let i=n.querySelector(".references");if(i)i.innerHTML="";else{const t=e.createElement("style");t.innerHTML=co,n.appendChild(t);const r=e.createElement("h3");r.id="references",r.textContent="References",n.appendChild(r),(i=e.createElement("ol")).id="references-list",i.className="references",n.appendChild(i)}for(const[n,r]of t){const t=e.createElement("li");t.id=n,t.innerHTML=s(r),i.appendChild(t)}}else n.style.display="none"}function k(n,t){let e='\n \n \n

    Table of contents

    \n
      ';for(const n of t){const t="D-TITLE"==n.parentElement.tagName,i=n.getAttribute("no-toc");if(t||i)continue;const r=n.textContent;let o='
    • '+r+"
    • ";"H3"==n.tagName?o="
        "+o+"
      ":o+="
      ",e+=o}e+="
    ",n.innerHTML=e} +// Copyright 2018 The Distill Template Authors +function S(n,t){return nt?1:n>=t?0:NaN}function M(n){return 1===n.length&&(n=T(n)),{left:function(t,e,i,r){for(null==i&&(i=0),null==r&&(r=t.length);i>>1;n(t[o],e)<0?i=o+1:r=o}return i},right:function(t,e,i,r){for(null==i&&(i=0),null==r&&(r=t.length);i>>1;n(t[o],e)>0?r=o:i=o+1}return i}}}function T(n){return function(t,e){return S(n(t),e)}}function _(n,t,e){n=+n,t=+t,e=(r=arguments.length)<2?(t=n,n=0,1):r<3?1:+e;for(var i=-1,r=0|Math.max(0,Math.ceil((t-n)/e)),o=new Array(r);++i0)return[n];if((i=t0)for(n=Math.ceil(n/a),t=Math.floor(t/a),o=new Array(r=Math.ceil(t-n+1));++s=0?(o>=Lo?10:o>=Do?5:o>=Oo?2:1)*Math.pow(10,r):-Math.pow(10,-r)/(o>=Lo?10:o>=Do?5:o>=Oo?2:1)}function E(n,t,e){var i=Math.abs(t-n)/Math.max(0,e),r=Math.pow(10,Math.floor(Math.log(i)/Math.LN10)),o=i/r;return o>=Lo?r*=10:o>=Do?r*=5:o>=Oo&&(r*=2),t>8&15|t>>4&240,t>>4&15|240&t,(15&t)<<4|15&t,1):8===e?P(t>>24&255,t>>16&255,t>>8&255,(255&t)/255):4===e?P(t>>12&15|t>>8&240,t>>8&15|t>>4&240,t>>4&15|240&t,((15&t)<<4|15&t)/255):null):(t=Ho.exec(n))?new q(t[1],t[2],t[3],1):(t=zo.exec(n))?new q(255*t[1]/100,255*t[2]/100,255*t[3]/100,1):(t=qo.exec(n))?P(t[1],t[2],t[3],t[4]):(t=jo.exec(n))?P(255*t[1]/100,255*t[2]/100,255*t[3]/100,t[4]):(t=Bo.exec(n))?W(t[1],t[2]/100,t[3]/100,1):(t=Yo.exec(n))?W(t[1],t[2]/100,t[3]/100,t[4]):Wo.hasOwnProperty(n)?$(Wo[n]):"transparent"===n?new q(NaN,NaN,NaN,0):null}function $(n){return new q(n>>16&255,n>>8&255,255&n,1)}function P(n,t,e,i){return i<=0&&(n=t=e=NaN),new q(n,t,e,i)}function H(n){return n instanceof O||(n=U(n)),n?new q((n=n.rgb()).r,n.g,n.b,n.opacity):new q}function z(n,t,e,i){return 1===arguments.length?H(n):new q(n,t,e,null==i?1:i)}function q(n,t,e,i){this.r=+n,this.g=+t,this.b=+e,this.opacity=+i}function j(){return"#"+Y(this.r)+Y(this.g)+Y(this.b)}function B(){var n=this.opacity;return(1===(n=isNaN(n)?1:Math.max(0,Math.min(1,n)))?"rgb(":"rgba(")+Math.max(0,Math.min(255,Math.round(this.r)||0))+", "+Math.max(0,Math.min(255,Math.round(this.g)||0))+", "+Math.max(0,Math.min(255,Math.round(this.b)||0))+(1===n?")":", "+n+")")}function Y(n){return((n=Math.max(0,Math.min(255,Math.round(n)||0)))<16?"0":"")+n.toString(16)}function W(n,t,e,i){return i<=0?n=t=e=NaN:e<=0||e>=1?n=t=NaN:t<=0&&(n=NaN),new K(n,t,e,i)}function G(n){if(n instanceof K)return new K(n.h,n.s,n.l,n.opacity);if(n instanceof O||(n=U(n)),!n)return new K;if(n instanceof K)return n;var t=(n=n.rgb()).r/255,e=n.g/255,i=n.b/255,r=Math.min(t,e,i),o=Math.max(t,e,i),a=NaN,s=o-r,l=(o+r)/2;return s?(a=t===o?(e-i)/s+6*(e0&&l<1?0:a,new K(a,s,l,n.opacity)}function V(n,t,e,i){return 1===arguments.length?G(n):new K(n,t,e,null==i?1:i)}function K(n,t,e,i){this.h=+n,this.s=+t,this.l=+e,this.opacity=+i}function X(n,t,e){return 255*(n<60?t+(e-t)*n/60:n<180?e:n<240?t+(e-t)*(240-n)/60:t)}function Z(n){if(n instanceof J)return new J(n.l,n.a,n.b,n.opacity);if(n instanceof sn)return ln(n);n instanceof q||(n=H(n));var t,e,i=rn(n.r),r=rn(n.g),o=rn(n.b),a=nn((.2225045*i+.7168786*r+.0606169*o)/Zo);return i===r&&r===o?t=e=a:(t=nn((.4360747*i+.3850649*r+.1430804*o)/Xo),e=nn((.0139322*i+.0971045*r+.7141733*o)/Qo)),new J(116*a-16,500*(t-a),200*(a-e),n.opacity)}function Q(n,t,e,i){return 1===arguments.length?Z(n):new J(n,t,e,null==i?1:i)}function J(n,t,e,i){this.l=+n,this.a=+t,this.b=+e,this.opacity=+i}function nn(n){return n>ea?Math.pow(n,1/3):n/ta+Jo}function tn(n){return n>na?n*n*n:ta*(n-Jo)}function en(n){return 255*(n<=.0031308?12.92*n:1.055*Math.pow(n,1/2.4)-.055)}function rn(n){return(n/=255)<=.04045?n/12.92:Math.pow((n+.055)/1.055,2.4)}function on(n){if(n instanceof sn)return new sn(n.h,n.c,n.l,n.opacity);if(n instanceof J||(n=Z(n)),0===n.a&&0===n.b)return new sn(NaN,0o&&(r=t.slice(o,r),s[a]?s[a]+=r:s[++a]=r),(e=e[0])===(i=i[0])?s[a]?s[a]+=i:s[++a]=i:(s[++a]=null,l.push({i:a,x:xn(e,i)})),o=fa.lastIndex;return ot&&(e=n,n=t,t=e),function(e){return Math.max(n,Math.min(t,e))}}function On(n,t,e){var i=n[0],r=n[1],o=t[0],a=t[1];return r2?In:On,a=s=null,t}function t(n){return isNaN(n=+n)?r:(a||(a=o(l.map(e),u,c)))(e(d(n)))}var e,i,r,o,a,s,l=ga,u=ga,c=_n,d=Nn;return t.invert=function(n){return d(i((s||(s=o(u,l.map(e),xn)))(n)))},t.domain=function(t){return arguments.length?(l=Array.from(t,En),n()):l.slice()},t.range=function(t){return arguments.length?(u=Array.from(t),n()):u.slice()},t.rangeRound=function(t){return u=Array.from(t),c=Cn,n()},t.clamp=function(t){return arguments.length?(d=!!t||Nn,n()):d!==Nn},t.interpolate=function(t){return arguments.length?(c=t,n()):c},t.unknown=function(n){return arguments.length?(r=n,t):r},function(t,r){return e=t,i=r,n()}}function Un(){return Rn()(Nn,Nn)}function $n(n,t){if((e=(n=t?n.toExponential(t-1):n.toExponential()).indexOf("e"))<0)return null;var e,i=n.slice(0,e);return[i.length>1?i[0]+i.slice(2):i,+n.slice(e+1)]}function Pn(n){return(n=$n(Math.abs(n)))?n[1]:NaN}function Hn(n,t){return function(e,i){for(var r=e.length,o=[],a=0,s=n[0],l=0;r>0&&s>0&&(l+s+1>i&&(s=Math.max(1,i-l)),o.push(e.substring(r-=s,r+s)),!((l+=s+1)>i));)s=n[a=(a+1)%n.length];return o.reverse().join(t)}}function zn(n){return function(t){return t.replace(/[0-9]/g,function(t){return n[+t]})}}function qn(n){if(!(t=ma.exec(n)))throw new Error("invalid format: "+n);var t;return new jn({fill:t[1],align:t[2],sign:t[3],symbol:t[4],zero:t[5],width:t[6],comma:t[7],precision:t[8]&&t[8].slice(1),trim:t[9],type:t[10]})}function jn(n){this.fill=n.fill===undefined?" ":n.fill+"",this.align=n.align===undefined?">":n.align+"",this.sign=n.sign===undefined?"-":n.sign+"",this.symbol=n.symbol===undefined?"":n.symbol+"",this.zero=!!n.zero,this.width=n.width===undefined?undefined:+n.width,this.comma=!!n.comma,this.precision=n.precision===undefined?undefined:+n.precision,this.trim=!!n.trim,this.type=n.type===undefined?"":n.type+""}function Bn(n){n:for(var t,e=n.length,i=1,r=-1;i0&&(r=0)}return r>0?n.slice(0,r)+n.slice(t+1):n}function Yn(n,t){var e=$n(n,t);if(!e)return n+"";var i=e[0],r=e[1],o=r-(da=3*Math.max(-8,Math.min(8,Math.floor(r/3))))+1,a=i.length;return o===a?i:o>a?i+new Array(o-a+1).join("0"):o>0?i.slice(0,o)+"."+i.slice(o):"0."+new Array(1-o).join("0")+$n(n,Math.max(0,t+o-1))[0]}function Wn(n,t){var e=$n(n,t);if(!e)return n+"";var i=e[0],r=e[1];return r<0?"0."+new Array(-r).join("0")+i:i.length>r+1?i.slice(0,r+1)+"."+i.slice(r+1):i+new Array(r-i.length+2).join("0")}function Gn(n){return n}function Vn(n){function t(n){function t(n){var t,r,o,l=w,p=x;if("c"===v)p=k(n)+p,n="";else{var M=(n=+n)<0||1/n<0;if(n=isNaN(n)?c:k(Math.abs(n),b),y&&(n=Bn(n)),M&&0==+n&&"+"!==h&&(M=!1),l=(M?"("===h?h:u:"-"===h||"("===h?"":h)+l,p=("s"===v?ka[8+da/3]:"")+p+(M&&"("===h?")":""),S)for(t=-1,r=n.length;++t(o=n.charCodeAt(t))||o>57){p=(46===o?a+n.slice(t+1):n.slice(t))+p,n=n.slice(0,t);break}}m&&!f&&(n=i(n,Infinity));var T=l.length+n.length+p.length,_=T>1)+l+n+p+_.slice(T);break;default:n=_+l+n+p}return s(n)}var e=(n=qn(n)).fill,d=n.align,h=n.sign,p=n.symbol,f=n.zero,g=n.width,m=n.comma,b=n.precision,y=n.trim,v=n.type;"n"===v?(m=!0,v="g"):wa[v]||(b===undefined&&(b=12),y=!0,v="g"),(f||"0"===e&&"="===d)&&(f=!0,e="0",d="=");var w="$"===p?r:"#"===p&&/[boxX]/.test(v)?"0"+v.toLowerCase():"",x="$"===p?o:/[%p]/.test(v)?l:"",k=wa[v],S=/[defgprs%]/.test(v);return b=b===undefined?6:/[gprs]/.test(v)?Math.max(1,Math.min(21,b)):Math.max(0,Math.min(20,b)),t.toString=function(){return n+""},t}function e(n,e){var i=t(((n=qn(n)).type="f",n)),r=3*Math.max(-8,Math.min(8,Math.floor(Pn(e)/3))),o=Math.pow(10,-r),a=ka[8+r/3];return function(n){return i(o*n)+a}}var i=n.grouping===undefined||n.thousands===undefined?Gn:Hn(xa.call(n.grouping,Number),n.thousands+""),r=n.currency===undefined?"":n.currency[0]+"",o=n.currency===undefined?"":n.currency[1]+"",a=n.decimal===undefined?".":n.decimal+"",s=n.numerals===undefined?Gn:zn(xa.call(n.numerals,String)),l=n.percent===undefined?"%":n.percent+"",u=n.minus===undefined?"-":n.minus+"",c=n.nan===undefined?"NaN":n.nan+"";return{format:t,formatPrefix:e}}function Kn(n){return ba=Vn(n),ya=ba.format,va=ba.formatPrefix,ba}function Xn(n){return Math.max(0,-Pn(Math.abs(n)))}function Zn(n,t){return Math.max(0,3*Math.max(-8,Math.min(8,Math.floor(Pn(t)/3)))-Pn(Math.abs(n)))}function Qn(n,t){return n=Math.abs(n),t=Math.abs(t)-n,Math.max(0,Pn(t)-Pn(n))+1}function Jn(n,t,e,i){var r,o=E(n,t,e);switch((i=qn(null==i?",f":i)).type){case"s":var a=Math.max(Math.abs(n),Math.abs(t));return null!=i.precision||isNaN(r=Zn(o,a))||(i.precision=r),va(i,a);case"":case"e":case"g":case"p":case"r":null!=i.precision||isNaN(r=Qn(o,Math.max(Math.abs(n),Math.abs(t))))||(i.precision=r-("e"===i.type));break;case"f":case"%":null!=i.precision||isNaN(r=Xn(o))||(i.precision=r-2*("%"===i.type))}return ya(i)}function nt(n){var t=n.domain;return n.ticks=function(n){var e=t();return C(e[0],e[e.length-1],null==n?10:n)},n.tickFormat=function(n,e){var i=t();return Jn(i[0],i[i.length-1],null==n?10:n,e)},n.nice=function(e){null==e&&(e=10);var i,r=t(),o=0,a=r.length-1,s=r[o],l=r[a];return l0?i=A(s=Math.floor(s/i)*i,l=Math.ceil(l/i)*i,e):i<0&&(i=A(s=Math.ceil(s*i)/i,l=Math.floor(l*i)/i,e)),i>0?(r[o]=Math.floor(s/i)*i,r[a]=Math.ceil(l/i)*i,t(r)):i<0&&(r[o]=Math.ceil(s*i)/i,r[a]=Math.floor(l*i)/i,t(r)),n},n}function tt(){var n=Un();return n.copy=function(){return Fn(n,tt())},N.apply(n,arguments),nt(n)}function et(n,t,e,i){function r(t){return n(t=0===arguments.length?new Date:new Date(+t)),t}return r.floor=function(t){return n(t=new Date(+t)),t},r.ceil=function(e){return n(e=new Date(e-1)),t(e,1),n(e),e},r.round=function(n){var t=r(n),e=r.ceil(n);return n-t0))return s;do{s.push(a=new Date(+e)),t(e,o),n(e)}while(a=t)for(;n(t),!e(t);)t.setTime(t-1)},function(n,i){if(n>=n)if(i<0)for(;++i<=0;)for(;t(n,-1),!e(n););else for(;--i>=0;)for(;t(n,1),!e(n););})},e&&(r.count=function(t,i){return Sa.setTime(+t),Ma.setTime(+i),n(Sa),n(Ma),Math.floor(e(Sa,Ma))},r.every=function(n){return n=Math.floor(n),isFinite(n)&&n>0?n>1?r.filter(i?function(t){return i(t)%n==0}:function(t){return r.count(0,t)%n==0}):r:null}),r}function it(n){return et(function(t){t.setDate(t.getDate()-(t.getDay()+7-n)%7),t.setHours(0,0,0,0)},function(n,t){n.setDate(n.getDate()+7*t)},function(n,t){return(t-n-(t.getTimezoneOffset()-n.getTimezoneOffset())*Ca)/Na})}function rt(n){return et(function(t){t.setUTCDate(t.getUTCDate()-(t.getUTCDay()+7-n)%7),t.setUTCHours(0,0,0,0)},function(n,t){n.setUTCDate(n.getUTCDate()+7*t)},function(n,t){return(t-n)/Na})}function ot(n){if(0<=n.y&&n.y<100){var t=new Date(-1,n.m,n.d,n.H,n.M,n.S,n.L);return t.setFullYear(n.y),t}return new Date(n.y,n.m,n.d,n.H,n.M,n.S,n.L)}function at(n){if(0<=n.y&&n.y<100){var t=new Date(Date.UTC(-1,n.m,n.d,n.H,n.M,n.S,n.L));return t.setUTCFullYear(n.y),t}return new Date(Date.UTC(n.y,n.m,n.d,n.H,n.M,n.S,n.L))}function st(n,t,e){return{y:n,m:t,d:e,H:0,M:0,S:0,L:0}}function lt(n){function t(n,t){return function(e){var i,r,o,a=[],s=-1,l=0,u=n.length;for(e instanceof Date||(e=new Date(+e));++s53)return null;"w"in a||(a.w=1),"Z"in a?(r=(o=(r=at(st(a.y,0,1))).getUTCDay())>4||0===o?$a.ceil(r):$a(r),r=Ra.offset(r,7*(a.V-1)),a.y=r.getUTCFullYear(),a.m=r.getUTCMonth(),a.d=r.getUTCDate()+(a.w+6)%7):(r=(o=(r=ot(st(a.y,0,1))).getDay())>4||0===o?Oa.ceil(r):Oa(r),r=La.offset(r,7*(a.V-1)),a.y=r.getFullYear(),a.m=r.getMonth(),a.d=r.getDate()+(a.w+6)%7)}else("W"in a||"U"in a)&&("w"in a||(a.w="u"in a?a.u%7:"W"in a?1:0),o="Z"in a?at(st(a.y,0,1)).getUTCDay():ot(st(a.y,0,1)).getDay(),a.m=0,a.d="W"in a?(a.w+6)%7+7*a.W-(o+5)%7:a.w+7*a.U-(o+6)%7);return"Z"in a?(a.H+=a.Z/100|0,a.M+=a.Z%100,at(a)):ot(a)}}function i(n,t,e,i){for(var r,o,a=0,s=t.length,l=e.length;a=l)return-1;if(37===(r=t.charCodeAt(a++))){if(r=t.charAt(a++),!(o=B[r in Ba?t.charAt(a++):r])||(i=o(n,e,i))<0)return-1}else if(r!=e.charCodeAt(i++))return-1}return i}function r(n,t,e){var i=D.exec(t.slice(e));return i?(n.p=O[i[0].toLowerCase()],e+i[0].length):-1}function o(n,t,e){var i=R.exec(t.slice(e));return i?(n.w=U[i[0].toLowerCase()],e+i[0].length):-1}function a(n,t,e){var i=I.exec(t.slice(e));return i?(n.w=F[i[0].toLowerCase()],e+i[0].length):-1}function s(n,t,e){var i=H.exec(t.slice(e));return i?(n.m=z[i[0].toLowerCase()],e+i[0].length):-1}function l(n,t,e){var i=$.exec(t.slice(e));return i?(n.m=P[i[0].toLowerCase()],e+i[0].length):-1}function u(n,t,e){return i(n,M,t,e)}function c(n,t,e){return i(n,T,t,e)}function d(n,t,e){return i(n,_,t,e)}function h(n){return E[n.getDay()]}function p(n){return A[n.getDay()]}function f(n){return L[n.getMonth()]}function g(n){return N[n.getMonth()]}function m(n){return C[+(n.getHours()>=12)]}function b(n){return 1+~~(n.getMonth()/3)}function y(n){return E[n.getUTCDay()]}function v(n){return A[n.getUTCDay()]}function w(n){return L[n.getUTCMonth()]}function x(n){return N[n.getUTCMonth()]}function k(n){return C[+(n.getUTCHours()>=12)]}function S(n){return 1+~~(n.getUTCMonth()/3)}var M=n.dateTime,T=n.date,_=n.time,C=n.periods,A=n.days,E=n.shortDays,N=n.months,L=n.shortMonths,D=dt(C),O=ht(C),I=dt(A),F=ht(A),R=dt(E),U=ht(E),$=dt(N),P=ht(N),H=dt(L),z=ht(L),q={a:h,A:p,b:f,B:g,c:null,d:Ot,e:Ot,f:$t,H:It,I:Ft,j:Rt,L:Ut,m:Pt,M:Ht,p:m,q:b,Q:fe,s:ge,S:zt,u:qt,U:jt,V:Bt,w:Yt,W:Wt,x:null,X:null,y:Gt,Y:Vt,Z:Kt,"%":pe},j={a:y,A:v,b:w,B:x,c:null,d:Xt,e:Xt,f:te,H:Zt,I:Qt,j:Jt,L:ne,m:ee,M:ie,p:k,q:S,Q:fe,s:ge,S:re,u:oe,U:ae,V:se,w:le,W:ue,x:null,X:null,y:ce,Y:de,Z:he,"%":pe},B={a:o,A:a,b:s,B:l,c:u,d:St,e:St,f:Et,H:Tt,I:Tt,j:Mt,L:At,m:kt,M:_t,p:r,q:xt,Q:Lt,s:Dt,S:Ct,u:ft,U:gt,V:mt,w:pt,W:bt,x:c,X:d,y:vt,Y:yt,Z:wt,"%":Nt};return q.x=t(T,q),q.X=t(_,q),q.c=t(M,q),j.x=t(T,j),j.X=t(_,j),j.c=t(M,j),{format:function(n){var e=t(n+="",q);return e.toString=function(){return n},e},parse:function(n){var t=e(n+="",!1);return t.toString=function(){return n},t},utcFormat:function(n){var e=t(n+="",j);return e.toString=function(){return n},e},utcParse:function(n){var t=e(n+="",!0);return t.toString=function(){return n},t}}}function ut(n,t,e){var i=n<0?"-":"",r=(i?-n:n)+"",o=r.length;return i+(o68?1900:2e3),e+i[0].length):-1}function wt(n,t,e){var i=/^(Z)|([+-]\d\d)(?::?(\d\d))?/.exec(t.slice(e,e+6));return i?(n.Z=i[1]?0:-(i[2]+(i[3]||"00")),e+i[0].length):-1}function xt(n,t,e){var i=Ya.exec(t.slice(e,e+1));return i?(n.q=3*i[0]-3,e+i[0].length):-1}function kt(n,t,e){var i=Ya.exec(t.slice(e,e+2));return i?(n.m=i[0]-1,e+i[0].length):-1}function St(n,t,e){var i=Ya.exec(t.slice(e,e+2));return i?(n.d=+i[0],e+i[0].length):-1}function Mt(n,t,e){var i=Ya.exec(t.slice(e,e+3));return i?(n.m=0,n.d=+i[0],e+i[0].length):-1}function Tt(n,t,e){var i=Ya.exec(t.slice(e,e+2));return i?(n.H=+i[0],e+i[0].length):-1}function _t(n,t,e){var i=Ya.exec(t.slice(e,e+2));return i?(n.M=+i[0],e+i[0].length):-1}function Ct(n,t,e){var i=Ya.exec(t.slice(e,e+2));return i?(n.S=+i[0],e+i[0].length):-1}function At(n,t,e){var i=Ya.exec(t.slice(e,e+3));return i?(n.L=+i[0],e+i[0].length):-1}function Et(n,t,e){var i=Ya.exec(t.slice(e,e+6));return i?(n.L=Math.floor(i[0]/1e3),e+i[0].length):-1}function Nt(n,t,e){var i=Wa.exec(t.slice(e,e+1));return i?e+i[0].length:-1}function Lt(n,t,e){var i=Ya.exec(t.slice(e));return i?(n.Q=+i[0],e+i[0].length):-1}function Dt(n,t,e){var i=Ya.exec(t.slice(e));return i?(n.s=+i[0],e+i[0].length):-1}function Ot(n,t){return ut(n.getDate(),t,2)}function It(n,t){return ut(n.getHours(),t,2)}function Ft(n,t){return ut(n.getHours()%12||12,t,2)}function Rt(n,t){return ut(1+La.count(Fa(n),n),t,3)}function Ut(n,t){return ut(n.getMilliseconds(),t,3)}function $t(n,t){return Ut(n,t)+"000"}function Pt(n,t){return ut(n.getMonth()+1,t,2)}function Ht(n,t){return ut(n.getMinutes(),t,2)}function zt(n,t){return ut(n.getSeconds(),t,2)}function qt(n){var t=n.getDay();return 0===t?7:t}function jt(n,t){return ut(Da.count(Fa(n)-1,n),t,2)}function Bt(n,t){var e=n.getDay();return n=e>=4||0===e?Ia(n):Ia.ceil(n),ut(Ia.count(Fa(n),n)+(4===Fa(n).getDay()),t,2)}function Yt(n){return n.getDay()}function Wt(n,t){return ut(Oa.count(Fa(n)-1,n),t,2)}function Gt(n,t){return ut(n.getFullYear()%100,t,2)}function Vt(n,t){return ut(n.getFullYear()%1e4,t,4)}function Kt(n){var t=n.getTimezoneOffset();return(t>0?"-":(t*=-1,"+"))+ut(t/60|0,"0",2)+ut(t%60,"0",2)}function Xt(n,t){return ut(n.getUTCDate(),t,2)}function Zt(n,t){return ut(n.getUTCHours(),t,2)}function Qt(n,t){return ut(n.getUTCHours()%12||12,t,2)}function Jt(n,t){return ut(1+Ra.count(Ha(n),n),t,3)}function ne(n,t){return ut(n.getUTCMilliseconds(),t,3)}function te(n,t){return ne(n,t)+"000"}function ee(n,t){return ut(n.getUTCMonth()+1,t,2)}function ie(n,t){return ut(n.getUTCMinutes(),t,2)}function re(n,t){return ut(n.getUTCSeconds(),t,2)}function oe(n){var t=n.getUTCDay();return 0===t?7:t}function ae(n,t){return ut(Ua.count(Ha(n)-1,n),t,2)}function se(n,t){var e=n.getUTCDay();return n=e>=4||0===e?Pa(n):Pa.ceil(n),ut(Pa.count(Ha(n),n)+(4===Ha(n).getUTCDay()),t,2)}function le(n){return n.getUTCDay()}function ue(n,t){return ut($a.count(Ha(n)-1,n),t,2)}function ce(n,t){return ut(n.getUTCFullYear()%100,t,2)}function de(n,t){return ut(n.getUTCFullYear()%1e4,t,4)}function he(){return"+0000"}function pe(){return"%"}function fe(n){return+n}function ge(n){return Math.floor(+n/1e3)}function me(n){return za=lt(n),za.format,za.parse,qa=za.utcFormat,ja=za.utcParse,za}function be(n){return n.toISOString()}function ye(n){var t=new Date(n);return isNaN(t)?null:t}function ve(){for(var n,t=0,e=arguments.length,i={};t=0&&(e=n.slice(i+1),n=n.slice(0,i)),n&&!t.hasOwnProperty(n))throw new Error("unknown type: "+n);return{type:n,name:e}})}function ke(n,t){for(var e,i=0,r=n.length;i=0&&"xmlns"!==(t=n.slice(0,e))&&(n=n.slice(e+1)),Za.hasOwnProperty(t)?{space:Za[t],local:n}:n}function Te(n){return function(){var t=this.ownerDocument,e=this.namespaceURI;return e===Xa&&t.documentElement.namespaceURI===Xa?t.createElement(n):t.createElementNS(e,n)}}function _e(n){return function(){return this.ownerDocument.createElementNS(n.space,n.local)}}function Ce(n){var t=Me(n);return(t.local?_e:Te)(t)}function Ae(){}function Ee(n){return null==n?Ae:function(){return this.querySelector(n)}}function Ne(n){"function"!=typeof n&&(n=Ee(n));for(var t=this._groups,e=t.length,i=new Array(e),r=0;r=w&&(w=v+1);!(y=m[w])&&++w=0;)(i=r[o])&&(a&&4^i.compareDocumentPosition(a)&&a.parentNode.insertBefore(i,a),a=i);return this}function Ge(n){function t(t,e){return t&&e?n(t.__data__,e.__data__):!t-!e}n||(n=Ve);for(var e=this._groups,i=e.length,r=new Array(i),o=0;ot?1:n>=t?0:NaN}function Ke(){var n=arguments[0];return arguments[0]=this,n.apply(null,arguments),this}function Xe(){var n=new Array(this.size()),t=-1;return this.each(function(){n[++t]=this}),n}function Ze(){for(var n=this._groups,t=0,e=n.length;t1?this.each((null==t?ui:"function"==typeof t?di:ci)(n,t,null==e?"":e)):pi(this.node(),n)}function pi(n,t){return n.style.getPropertyValue(t)||li(n).getComputedStyle(n,null).getPropertyValue(t)}function fi(n){return function(){delete this[n]}}function gi(n,t){return function(){this[n]=t}}function mi(n,t){return function(){var e=t.apply(this,arguments);null==e?delete this[n]:this[n]=e}}function bi(n,t){return arguments.length>1?this.each((null==t?fi:"function"==typeof t?mi:gi)(n,t)):this.node()[n]}function yi(n){return n.trim().split(/^|\s+/)}function vi(n){return n.classList||new wi(n)}function wi(n){this._node=n,this._names=yi(n.getAttribute("class")||"")}function xi(n,t){for(var e=vi(n),i=-1,r=t.length;++i=0&&(t=n.slice(e+1),n=n.slice(0,e)),{type:n,name:t}})}function Zi(n){return function(){var t=this.__on;if(t){for(var e,i=0,r=-1,o=t.length;iv}m.mouse("drag")}function i(){sr(ns.view).on("mousemove.drag mouseup.drag",null),gr(ns.view,c),pr(),m.mouse("end")}function r(){if(h.apply(this,arguments)){var n,t,e=ns.changedTouches,i=p.apply(this,arguments),r=e.length;for(n=0;nView all changes to this article since it was first published.`),t+=`\n If you see mistakes or want to suggest changes, please create an issue on GitHub.

    \n `);const e=n.journal;return void 0!==e&&"Distill"===e.title&&(t+=`\n

    Reuse

    \n

    Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don\u2019t fall under this license and can be recognized by a note in their caption: \u201cFigure from \u2026\u201d.

    \n `),"undefined"!=typeof n.publishedDate&&(t+=`\n

    Citation

    \n

    For attribution in academic contexts, please cite this work as

    \n
    ${n.concatenatedAuthors}, "${n.title}", Distill, ${n.publishedYear}.
    \n

    BibTeX citation

    \n
    ${v(n)}
    \n `),t}const Mr=["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"],Tr=["Jan.","Feb.","March","April","May","June","July","Aug.","Sept.","Oct.","Nov.","Dec."],_r=n=>n<10?"0"+n:n,Cr=function(n){return`${Mr[n.getDay()].substring(0,3)}, ${_r(n.getDate())} ${Tr[n.getMonth()].substring(0,3)} ${n.getFullYear().toString()} ${n.getUTCHours().toString()}:${n.getUTCMinutes().toString()}:${n.getUTCSeconds().toString()} Z`},Ar=function(n){return Array.from(n).reduce((n,[t,e])=>Object.assign(n,{[t]:e}),{})},Er=function(n){const t=new Map;for(var e in n)n.hasOwnProperty(e)&&t.set(e,n[e]);return t};class Nr{constructor(n){this.name=n.author,this.personalURL=n.authorURL,this.affiliation=n.affiliation,this.affiliationURL=n.affiliationURL,this.affiliations=n.affiliations||[]}get firstName(){const n=this.name.split(" ");return n.slice(0,n.length-1).join(" ")}get lastName(){const n=this.name.split(" ");return n[n.length-1]}}class Lr{constructor(){this.title="unnamed article",this.description="",this.authors=[],this.bibliography=new Map,this.bibliographyParsed=!1,this.citations=[],this.citationsCollected=!1,this.journal={},this.katex={},this.doi=undefined,this.publishedDate=undefined}set url(n){this._url=n}get url(){return this._url?this._url:this.distillPath&&this.journal.url?this.journal.url+"/"+this.distillPath:this.journal.url?this.journal.url:void 0}get githubUrl(){return this.githubPath?"https://github.com/"+this.githubPath:undefined}set previewURL(n){this._previewURL=n}get previewURL(){return this._previewURL?this._previewURL:this.url+"/thumbnail.jpg"}get publishedDateRFC(){return Cr(this.publishedDate)}get updatedDateRFC(){return Cr(this.updatedDate)}get publishedYear(){return this.publishedDate.getFullYear()}get publishedMonth(){return Tr[this.publishedDate.getMonth()]}get publishedDay(){return this.publishedDate.getDate()}get publishedMonthPadded(){return _r(this.publishedDate.getMonth()+1)}get publishedDayPadded(){return _r(this.publishedDate.getDate())}get publishedISODateOnly(){return this.publishedDate.toISOString().split("T")[0]}get volume(){const n=this.publishedYear-2015;if(n<1)throw new Error("Invalid publish date detected during computing volume");return n}get issue(){return this.publishedDate.getMonth()+1}get concatenatedAuthors(){return this.authors.length>2?this.authors[0].lastName+", et al.":2===this.authors.length?this.authors[0].lastName+" & "+this.authors[1].lastName:1===this.authors.length?this.authors[0].lastName:void 0}get bibtexAuthors(){return this.authors.map(n=>n.lastName+", "+n.firstName).join(" and ")}get slug(){let n="";return this.authors.length&&(n+=this.authors[0].lastName.toLowerCase(),n+=this.publishedYear,n+=this.title.split(" ")[0].toLowerCase()),n||"Untitled"}get bibliographyEntries(){return new Map(this.citations.map(n=>{return[n,this.bibliography.get(n)]}))}set bibliography(n){n instanceof Map?this._bibliography=n:"object"==typeof n&&(this._bibliography=Er(n))}get bibliography(){return this._bibliography}static fromObject(n){const t=new Lr;return Object.assign(t,n),t}assignToObject(n){Object.assign(n,this),n.bibliography=Ar(this.bibliographyEntries),n.url=this.url,n.doi=this.doi,n.githubUrl=this.githubUrl,n.previewURL=this.previewURL,this.publishedDate&&(n.volume=this.volume,n.issue=this.issue,n.publishedDateRFC=this.publishedDateRFC,n.publishedYear=this.publishedYear,n.publishedMonth=this.publishedMonth,n.publishedDay=this.publishedDay,n.publishedMonthPadded=this.publishedMonthPadded,n.publishedDayPadded=this.publishedDayPadded),this.updatedDate&&(n.updatedDateRFC=this.updatedDateRFC),n.concatenatedAuthors=this.concatenatedAuthors,n.bibtexAuthors=this.bibtexAuthors,n.slug=this.slug}} +// Copyright 2018 The Distill Template Authors +const Dr=n=>(class extends n{constructor(){super();const n={childList:!0,characterData:!0,subtree:!0},t=new MutationObserver(()=>{t.disconnect(),this.renderIfPossible(),t.observe(this,n)});t.observe(this,n)}connectedCallback(){super.connectedCallback(),this.renderIfPossible()}renderIfPossible(){this.textContent&&this.root&&this.renderContent()}renderContent(){console.error(`Your class ${this.constructor.name} must provide a custom renderContent() method!`)}}),Or=(n,t,e=!0)=>i=>{const r=document.createElement("template");return r.innerHTML=t,e&&"ShadyCSS"in window&&ShadyCSS.prepareTemplate(r,n),class extends i{static get is(){return n}constructor(){super(),this.clone=document.importNode(r.content,!0),e&&(this.attachShadow({mode:"open"}),this.shadowRoot.appendChild(this.clone))}connectedCallback(){this.hasAttribute("distill-prerendered")||(e?"ShadyCSS"in window&&ShadyCSS.styleElement(this):this.insertBefore(this.clone,this.firstChild))}get root(){return e?this.shadowRoot:this}$(n){return this.root.querySelector(n)}$$(n){return this.root.querySelectorAll(n)}}}; +// Copyright 2018 The Distill Template Authors +var Ir='/*\n * Copyright 2018 The Distill Template Authors\n *\n * Licensed under the Apache License, Version 2.0 (the "License");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n * http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an "AS IS" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */'; +// Copyright 2018 The Distill Template Authors +const Fr=function(n,t,e){let i=e,r=0;const o=n.length;for(;i[n.left,n.right]),r=n=>i.some(t=>-1!==n.indexOf(t));e.mightHaveMath=r,Pr(n,e)},qr="https://distill.pub/third-party/katex/katex.min.js",jr='',Br=Or("d-math",`\n${jr}\n\n\n`);class Yr extends(Dr(Br(HTMLElement))){static set katexOptions(n){Yr._katexOptions=n,Yr.katexOptions.delimiters&&(Yr.katexAdded?Yr.katexLoadedCallback():Yr.addKatex())}static get katexOptions(){return Yr._katexOptions||(Yr._katexOptions={delimiters:[{left:"$$",right:"$$",display:!1}]}),Yr._katexOptions}static katexLoadedCallback(){const n=document.querySelectorAll("d-math");for(const t of n)t.renderContent();Yr.katexOptions.delimiters&&zr(document.body,Yr.katexOptions)}static addKatex(){document.head.insertAdjacentHTML("beforeend",jr);const n=document.createElement("script");n.src=qr,n.async=!0,n.onload=Yr.katexLoadedCallback,n.crossorigin="anonymous",document.head.appendChild(n),Yr.katexAdded=!0}get options(){const n={displayMode:this.hasAttribute("block")};return Object.assign(n,Yr.katexOptions)}connectedCallback(){super.connectedCallback(),Yr.katexAdded||Yr.addKatex()}renderContent(){if("undefined"!=typeof katex){const n=this.root.querySelector("#katex-container");katex.render(this.textContent,n,this.options)}}}Yr.katexAdded=!1,Yr.inlineMathRendered=!1,window.DMath=Yr;class Wr extends HTMLElement{static get is(){return"d-front-matter"}constructor(){super();const n={childList:!0,characterData:!0,subtree:!0};new MutationObserver(n=>{for(const t of n)if("SCRIPT"===t.target.nodeName||"characterData"===t.type){const n=d(this);this.notify(n)}}).observe(this,n)}notify(n){const t=new CustomEvent("onFrontMatterChanged",{detail:n,bubbles:!0});document.dispatchEvent(t)}}const Gr=new Lr,Vr={frontMatter:Gr,waitingOn:{bibliography:[],citations:[]},listeners:{onCiteKeyCreated(n){const[t,e]=n.detail;if(!Gr.citationsCollected)return void Vr.waitingOn.citations.push(()=>Vr.listeners.onCiteKeyCreated(n));if(!Gr.bibliographyParsed)return void Vr.waitingOn.bibliography.push(()=>Vr.listeners.onCiteKeyCreated(n));const i=e.map(n=>Gr.citations.indexOf(n));t.numbers=i;const r=e.map(n=>Gr.bibliography.get(n));t.entries=r},onCiteKeyChanged(){Gr.citations=t(),Gr.citationsCollected=!0;for(const n of Vr.waitingOn.citations.slice())n();const n=document.querySelector("d-citation-list"),e=new Map(Gr.citations.map(n=>[n,Gr.bibliography.get(n)]));n.citations=e;const i=document.querySelectorAll("d-cite");for(const n of i){console.log(n);const t=n.keys,e=t.map(n=>Gr.citations.indexOf(n));n.numbers=e;const i=t.map(n=>Gr.bibliography.get(n));n.entries=i}},onCiteKeyRemoved(n){Vr.listeners.onCiteKeyChanged(n)},onBibliographyChanged(n){const t=document.querySelector("d-citation-list"),e=n.detail;Gr.bibliography=e,Gr.bibliographyParsed=!0;for(const n of Vr.waitingOn.bibliography.slice())n();if(Gr.citationsCollected)if(t.hasAttribute("distill-prerendered"))console.debug("Citation list was prerendered; not updating it.");else{const n=new Map(Gr.citations.map(n=>[n,Gr.bibliography.get(n)]));t.citations=n}else Vr.waitingOn.citations.push(function(){Vr.listeners.onBibliographyChanged({target:n.target,detail:n.detail})})},onFootnoteChanged(){const n=document.querySelector("d-footnote-list");if(n){const t=document.querySelectorAll("d-footnote");n.footnotes=t}},onFrontMatterChanged(t){const e=t.detail;n(Gr,e);const i=document.querySelector("d-interstitial");if(i&&("undefined"!=typeof Gr.password?i.password=Gr.password:i.parentElement.removeChild(i)),!document.body.hasAttribute("distill-prerendered")&&u()){h(document,Gr);const n=document.querySelector("distill-appendix");n&&(n.frontMatter=Gr);const t=document.querySelector("d-byline");t&&(t.frontMatter=Gr),e.katex&&(Yr.katexOptions=e.katex)}},DOMContentLoaded(){if(Vr.loaded)return void console.warn("Controller received DOMContentLoaded but was already loaded!");if(!u())return void console.warn("Controller received DOMContentLoaded at document.readyState: "+document.readyState+"!");Vr.loaded=!0,console.debug("Runlevel 4: Controller running DOMContentLoaded");const n=document.querySelector("d-front-matter");if(n){const t=d(n);Vr.listeners.onFrontMatterChanged({detail:t})}Gr.citations=t(),Gr.citationsCollected=!0;for(const n of Vr.waitingOn.citations.slice())n();if(Gr.bibliographyParsed)for(const n of Vr.waitingOn.bibliography.slice())n();const e=document.querySelector("d-footnote-list");if(e){const n=document.querySelectorAll("d-footnote");e.footnotes=n}}}}; +// Copyright 2018 The Distill Template Authors +const Kr='/*\n * Copyright 2018 The Distill Template Authors\n *\n * Licensed under the Apache License, Version 2.0 (the "License");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n * http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an "AS IS" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\nhtml {\n font-size: 14px;\n\tline-height: 1.6em;\n /* font-family: "Libre Franklin", "Helvetica Neue", sans-serif; */\n font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen, Ubuntu, Cantarell, "Fira Sans", "Droid Sans", "Helvetica Neue", Arial, sans-serif;\n /*, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol";*/\n text-size-adjust: 100%;\n -ms-text-size-adjust: 100%;\n -webkit-text-size-adjust: 100%;\n}\n\n@media(min-width: 768px) {\n html {\n font-size: 16px;\n }\n}\n\nbody {\n margin: 0;\n}\n\na {\n color: #004276;\n}\n\nfigure {\n margin: 0;\n}\n\ntable {\n\tborder-collapse: collapse;\n\tborder-spacing: 0;\n}\n\ntable th {\n\ttext-align: left;\n}\n\ntable thead {\n border-bottom: 1px solid rgba(0, 0, 0, 0.05);\n}\n\ntable thead th {\n padding-bottom: 0.5em;\n}\n\ntable tbody :first-child td {\n padding-top: 0.5em;\n}\n\npre {\n overflow: auto;\n max-width: 100%;\n}\n\np {\n margin-top: 0;\n margin-bottom: 1em;\n}\n\nsup, sub {\n vertical-align: baseline;\n position: relative;\n top: -0.4em;\n line-height: 1em;\n}\n\nsub {\n top: 0.4em;\n}\n\n.kicker,\n.marker {\n font-size: 15px;\n font-weight: 600;\n color: rgba(0, 0, 0, 0.5);\n}\n\n\n/* Headline */\n\n@media(min-width: 1024px) {\n d-title h1 span {\n display: block;\n }\n}\n\n/* Figure */\n\nfigure {\n position: relative;\n margin-bottom: 2.5em;\n margin-top: 1.5em;\n}\n\nfigcaption+figure {\n\n}\n\nfigure img {\n width: 100%;\n}\n\nfigure svg text,\nfigure svg tspan {\n}\n\nfigcaption,\n.figcaption {\n color: rgba(0, 0, 0, 0.6);\n font-size: 12px;\n line-height: 1.5em;\n}\n\n@media(min-width: 1024px) {\nfigcaption,\n.figcaption {\n font-size: 13px;\n }\n}\n\nfigure.external img {\n background: white;\n border: 1px solid rgba(0, 0, 0, 0.1);\n box-shadow: 0 1px 8px rgba(0, 0, 0, 0.1);\n padding: 18px;\n box-sizing: border-box;\n}\n\nfigcaption a {\n color: rgba(0, 0, 0, 0.6);\n}\n\nfigcaption b,\nfigcaption strong, {\n font-weight: 600;\n color: rgba(0, 0, 0, 1.0);\n}\n'+'/*\n * Copyright 2018 The Distill Template Authors\n *\n * Licensed under the Apache License, Version 2.0 (the "License");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n * http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an "AS IS" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n@supports not (display: grid) {\n .base-grid,\n distill-header,\n d-title,\n d-abstract,\n d-article,\n d-appendix,\n distill-appendix,\n d-byline,\n d-footnote-list,\n d-citation-list,\n distill-footer {\n display: block;\n padding: 8px;\n }\n}\n\n.base-grid,\ndistill-header,\nd-title,\nd-abstract,\nd-article,\nd-appendix,\ndistill-appendix,\nd-byline,\nd-footnote-list,\nd-citation-list,\ndistill-footer {\n display: grid;\n justify-items: stretch;\n grid-template-columns: [screen-start] 8px [page-start kicker-start text-start gutter-start middle-start] 1fr 1fr 1fr 1fr 1fr 1fr 1fr 1fr [text-end page-end gutter-end kicker-end middle-end] 8px [screen-end];\n grid-column-gap: 8px;\n}\n\n.grid {\n display: grid;\n grid-column-gap: 8px;\n}\n\n@media(min-width: 768px) {\n .base-grid,\n distill-header,\n d-title,\n d-abstract,\n d-article,\n d-appendix,\n distill-appendix,\n d-byline,\n d-footnote-list,\n d-citation-list,\n distill-footer {\n grid-template-columns: [screen-start] 1fr [page-start kicker-start middle-start text-start] 45px 45px 45px 45px 45px 45px 45px 45px [ kicker-end text-end gutter-start] 45px [middle-end] 45px [page-end gutter-end] 1fr [screen-end];\n grid-column-gap: 16px;\n }\n\n .grid {\n grid-column-gap: 16px;\n }\n}\n\n@media(min-width: 1000px) {\n .base-grid,\n distill-header,\n d-title,\n d-abstract,\n d-article,\n d-appendix,\n distill-appendix,\n d-byline,\n d-footnote-list,\n d-citation-list,\n distill-footer {\n grid-template-columns: [screen-start] 1fr [page-start kicker-start] 50px [middle-start] 50px [text-start kicker-end] 50px 50px 50px 50px 50px 50px 50px 50px [text-end gutter-start] 50px [middle-end] 50px [page-end gutter-end] 1fr [screen-end];\n grid-column-gap: 16px;\n }\n\n .grid {\n grid-column-gap: 16px;\n }\n}\n\n@media(min-width: 1180px) {\n .base-grid,\n distill-header,\n d-title,\n d-abstract,\n d-article,\n d-appendix,\n distill-appendix,\n d-byline,\n d-footnote-list,\n d-citation-list,\n distill-footer {\n grid-template-columns: [screen-start] 1fr [page-start kicker-start] 60px [middle-start] 60px [text-start kicker-end] 60px 60px 60px 60px 60px 60px 60px 60px [text-end gutter-start] 60px [middle-end] 60px [page-end gutter-end] 1fr [screen-end];\n grid-column-gap: 32px;\n }\n\n .grid {\n grid-column-gap: 32px;\n }\n}\n\n\n\n\n.base-grid {\n grid-column: screen;\n}\n\n/* .l-body,\nd-article > * {\n grid-column: text;\n}\n\n.l-page,\nd-title > *,\nd-figure {\n grid-column: page;\n} */\n\n.l-gutter {\n grid-column: gutter;\n}\n\n.l-text,\n.l-body {\n grid-column: text;\n}\n\n.l-page {\n grid-column: page;\n}\n\n.l-body-outset {\n grid-column: middle;\n}\n\n.l-page-outset {\n grid-column: page;\n}\n\n.l-screen {\n grid-column: screen;\n}\n\n.l-screen-inset {\n grid-column: screen;\n padding-left: 16px;\n padding-left: 16px;\n}\n\n\n/* Aside */\n\nd-article aside {\n grid-column: gutter;\n font-size: 12px;\n line-height: 1.6em;\n color: rgba(0, 0, 0, 0.6)\n}\n\n@media(min-width: 768px) {\n aside {\n grid-column: gutter;\n }\n\n .side {\n grid-column: gutter;\n }\n}\n'+'/*\n * Copyright 2018 The Distill Template Authors\n *\n * Licensed under the Apache License, Version 2.0 (the "License");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n * http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an "AS IS" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\nd-title {\n padding: 2rem 0 1.5rem;\n contain: layout style;\n overflow-x: hidden;\n}\n\n@media(min-width: 768px) {\n d-title {\n padding: 4rem 0 1.5rem;\n }\n}\n\nd-title h1 {\n grid-column: text;\n font-size: 40px;\n font-weight: 700;\n line-height: 1.1em;\n margin: 0 0 0.5rem;\n}\n\n@media(min-width: 768px) {\n d-title h1 {\n font-size: 50px;\n }\n}\n\nd-title p {\n font-weight: 300;\n font-size: 1.2rem;\n line-height: 1.55em;\n grid-column: text;\n}\n\nd-title .status {\n margin-top: 0px;\n font-size: 12px;\n color: #009688;\n opacity: 0.8;\n grid-column: kicker;\n}\n\nd-title .status span {\n line-height: 1;\n display: inline-block;\n padding: 6px 0;\n border-bottom: 1px solid #80cbc4;\n font-size: 11px;\n text-transform: uppercase;\n}\n'+'/*\n * Copyright 2018 The Distill Template Authors\n *\n * Licensed under the Apache License, Version 2.0 (the "License");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n * http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an "AS IS" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\nd-byline {\n contain: style;\n overflow: hidden;\n border-top: 1px solid rgba(0, 0, 0, 0.1);\n font-size: 0.8rem;\n line-height: 1.8em;\n padding: 1.5rem 0;\n min-height: 1.8em;\n}\n\n\nd-byline .byline {\n grid-template-columns: 1fr 1fr;\n grid-column: text;\n}\n\n@media(min-width: 768px) {\n d-byline .byline {\n grid-template-columns: 1fr 1fr 1fr 1fr;\n }\n}\n\nd-byline .authors-affiliations {\n grid-column-end: span 2;\n grid-template-columns: 1fr 1fr;\n margin-bottom: 1em;\n}\n\n@media(min-width: 768px) {\n d-byline .authors-affiliations {\n margin-bottom: 0;\n }\n}\n\nd-byline h3 {\n font-size: 0.6rem;\n font-weight: 400;\n color: rgba(0, 0, 0, 0.5);\n margin: 0;\n text-transform: uppercase;\n}\n\nd-byline p {\n margin: 0;\n}\n\nd-byline a,\nd-article d-byline a {\n color: rgba(0, 0, 0, 0.8);\n text-decoration: none;\n border-bottom: none;\n}\n\nd-article d-byline a:hover {\n text-decoration: underline;\n border-bottom: none;\n}\n\nd-byline p.author {\n font-weight: 500;\n}\n\nd-byline .affiliations {\n\n}\n'+'/*\n * Copyright 2018 The Distill Template Authors\n *\n * Licensed under the Apache License, Version 2.0 (the "License");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n * http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an "AS IS" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\nd-article {\n contain: layout style;\n overflow-x: hidden;\n border-top: 1px solid rgba(0, 0, 0, 0.1);\n padding-top: 2rem;\n color: rgba(0, 0, 0, 0.8);\n}\n\nd-article > * {\n grid-column: text;\n}\n\n@media(min-width: 768px) {\n d-article {\n font-size: 16px;\n }\n}\n\n@media(min-width: 1024px) {\n d-article {\n font-size: 1.06rem;\n line-height: 1.7em;\n }\n}\n\n\n/* H2 */\n\n\nd-article .marker {\n text-decoration: none;\n border: none;\n counter-reset: section;\n grid-column: kicker;\n line-height: 1.7em;\n}\n\nd-article .marker:hover {\n border: none;\n}\n\nd-article .marker span {\n padding: 0 3px 4px;\n border-bottom: 1px solid rgba(0, 0, 0, 0.2);\n position: relative;\n top: 4px;\n}\n\nd-article .marker:hover span {\n color: rgba(0, 0, 0, 0.7);\n border-bottom: 1px solid rgba(0, 0, 0, 0.7);\n}\n\nd-article h2 {\n font-weight: 600;\n font-size: 24px;\n line-height: 1.25em;\n margin: 2rem 0 1.5rem 0;\n border-bottom: 1px solid rgba(0, 0, 0, 0.1);\n padding-bottom: 1rem;\n}\n\n@media(min-width: 1024px) {\n d-article h2 {\n font-size: 36px;\n }\n}\n\n/* H3 */\n\nd-article h3 {\n font-weight: 700;\n font-size: 18px;\n line-height: 1.4em;\n margin-bottom: 1em;\n margin-top: 2em;\n}\n\n@media(min-width: 1024px) {\n d-article h3 {\n font-size: 20px;\n }\n}\n\n/* H4 */\n\nd-article h4 {\n font-weight: 600;\n text-transform: uppercase;\n font-size: 14px;\n line-height: 1.4em;\n}\n\nd-article a {\n color: inherit;\n}\n\nd-article p,\nd-article ul,\nd-article ol,\nd-article blockquote {\n margin-top: 0;\n margin-bottom: 1em;\n margin-left: 0;\n margin-right: 0;\n}\n\nd-article blockquote {\n border-left: 2px solid rgba(0, 0, 0, 0.2);\n padding-left: 2em;\n font-style: italic;\n color: rgba(0, 0, 0, 0.6);\n}\n\nd-article a {\n border-bottom: 1px solid rgba(0, 0, 0, 0.4);\n text-decoration: none;\n}\n\nd-article a:hover {\n border-bottom: 1px solid rgba(0, 0, 0, 0.8);\n}\n\nd-article .link {\n text-decoration: underline;\n cursor: pointer;\n}\n\nd-article ul,\nd-article ol {\n padding-left: 24px;\n}\n\nd-article li {\n margin-bottom: 1em;\n margin-left: 0;\n padding-left: 0;\n}\n\nd-article li:last-child {\n margin-bottom: 0;\n}\n\nd-article pre {\n font-size: 14px;\n margin-bottom: 20px;\n}\n\nd-article hr {\n grid-column: screen;\n width: 100%;\n border: none;\n border-bottom: 1px solid rgba(0, 0, 0, 0.1);\n margin-top: 60px;\n margin-bottom: 60px;\n}\n\nd-article section {\n margin-top: 60px;\n margin-bottom: 60px;\n}\n\nd-article span.equation-mimic {\n font-family: georgia;\n font-size: 115%;\n font-style: italic;\n}\n\nd-article > d-code,\nd-article section > d-code {\n display: block;\n}\n\nd-article > d-math[block],\nd-article section > d-math[block] {\n display: block;\n}\n\n@media (max-width: 768px) {\n d-article > d-code,\n d-article section > d-code,\n d-article > d-math[block],\n d-article section > d-math[block] {\n overflow-x: scroll;\n -ms-overflow-style: none; // IE 10+\n overflow: -moz-scrollbars-none; // Firefox\n }\n\n d-article > d-code::-webkit-scrollbar,\n d-article section > d-code::-webkit-scrollbar,\n d-article > d-math[block]::-webkit-scrollbar,\n d-article section > d-math[block]::-webkit-scrollbar {\n display: none; // Safari and Chrome\n }\n}\n\nd-article .citation {\n color: #668;\n cursor: pointer;\n}\n\nd-include {\n width: auto;\n display: block;\n}\n\nd-figure {\n contain: layout style;\n}\n\n/* KaTeX */\n\n.katex, .katex-prerendered {\n contain: style;\n display: inline-block;\n}\n\n/* Tables */\n\nd-article table {\n border-collapse: collapse;\n margin-bottom: 1.5rem;\n border-bottom: 1px solid rgba(0, 0, 0, 0.2);\n}\n\nd-article table th {\n border-bottom: 1px solid rgba(0, 0, 0, 0.2);\n}\n\nd-article table td {\n border-bottom: 1px solid rgba(0, 0, 0, 0.05);\n}\n\nd-article table tr:last-of-type td {\n border-bottom: none;\n}\n\nd-article table th,\nd-article table td {\n font-size: 15px;\n padding: 2px 8px;\n}\n\nd-article table tbody :first-child td {\n padding-top: 2px;\n}\n'+Ir+'/*\n * Copyright 2018 The Distill Template Authors\n *\n * Licensed under the Apache License, Version 2.0 (the "License");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n * http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an "AS IS" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n@media print {\n\n @page {\n size: 8in 11in;\n @bottom-right {\n content: counter(page) " of " counter(pages);\n }\n }\n\n html {\n /* no general margins -- CSS Grid takes care of those */\n }\n\n p, code {\n page-break-inside: avoid;\n }\n\n h2, h3 {\n page-break-after: avoid;\n }\n\n d-header {\n visibility: hidden;\n }\n\n d-footer {\n display: none!important;\n }\n\n}\n',Xr=[{name:"WebComponents",support:function(){return"customElements"in window&&"attachShadow"in Element.prototype&&"getRootNode"in Element.prototype&&"content"in document.createElement("template")&&"Promise"in window&&"from"in Array},url:"https://distill.pub/third-party/polyfills/webcomponents-lite.js"},{name:"IntersectionObserver",support:function(){return"IntersectionObserver"in window&&"IntersectionObserverEntry"in window},url:"https://distill.pub/third-party/polyfills/intersection-observer.js"}];class Zr{static browserSupportsAllFeatures(){return Xr.every(n=>n.support())}static load(n){const t=function(t){t.loaded=!0,console.debug("Runlevel 0: Polyfill has finished loading: "+t.name),Zr.neededPolyfills.every(n=>n.loaded)&&(console.debug("Runlevel 0: All required polyfills have finished loading."),console.debug("Runlevel 0->1."),window.distillRunlevel=1,n())};for(const n of Zr.neededPolyfills)f(n,t)}static get neededPolyfills(){return Zr._neededPolyfills||(Zr._neededPolyfills=Xr.filter(n=>!n.support())),Zr._neededPolyfills}}const Qr=Or("d-abstract",`\n\n\n\n`);class Jr extends(Qr(HTMLElement)){} +// Copyright 2018 The Distill Template Authors +const no=Or("d-appendix","\n\n\n",!1);class to extends(no(HTMLElement)){} +// Copyright 2018 The Distill Template Authors +const eo=/^\s*$/;class io extends HTMLElement{static get is(){return"d-article"}constructor(){super(),new MutationObserver(n=>{for(const t of n)for(const n of t.addedNodes)switch(n.nodeName){case"#text":{const t=n.nodeValue;if(!eo.test(t)){console.warn("Use of unwrapped text in distill articles is discouraged as it breaks layout! Please wrap any text in a or

    tag. We found the following text: "+t);const e=document.createElement("span");e.innerHTML=n.nodeValue,n.parentNode.insertBefore(e,n),n.parentNode.removeChild(n)}}}}).observe(this,{childList:!0})}}var ro="undefined"!=typeof globalThis?globalThis:"undefined"!=typeof window?window:"undefined"!=typeof global?global:"undefined"!=typeof self?self:{},oo=m(function(n,t){!function(n){function t(){this.months=["jan","feb","mar","apr","may","jun","jul","aug","sep","oct","nov","dec"],this.notKey=[",","{","}"," ","="],this.pos=0,this.input="",this.entries=new Array,this.currentEntry="",this.setInput=function(n){this.input=n},this.getEntries=function(){return this.entries},this.isWhitespace=function(n){return" "==n||"\r"==n||"\t"==n||"\n"==n},this.match=function(n,t){if(t!=undefined&&null!=t||(t=!0),this.skipWhitespace(t),this.input.substring(this.pos,this.pos+n.length)!=n)throw"Token mismatch, expected "+n+", found "+this.input.substring(this.pos);this.pos+=n.length,this.skipWhitespace(t)},this.tryMatch=function(n,t){return t!=undefined&&null!=t||(t=!0),this.skipWhitespace(t),this.input.substring(this.pos,this.pos+n.length)==n},this.matchAt=function(){for(;this.input.length>this.pos&&"@"!=this.input[this.pos];)this.pos++;return"@"==this.input[this.pos]},this.skipWhitespace=function(n){for(;this.isWhitespace(this.input[this.pos]);)this.pos++;if("%"==this.input[this.pos]&&1==n){for(;"\n"!=this.input[this.pos];)this.pos++;this.skipWhitespace(n)}},this.value_braces=function(){var n=0;this.match("{",!1);for(var t=this.pos,e=!1;;){if(!e)if("}"==this.input[this.pos]){if(!(n>0)){var i=this.pos;return this.match("}",!1),this.input.substring(t,i)}n--}else if("{"==this.input[this.pos])n++;else if(this.pos>=this.input.length-1)throw"Unterminated value";e="\\"==this.input[this.pos]&&0==e,this.pos++}},this.value_comment=function(){for(var n="",t=0;!this.tryMatch("}",!1)||0!=t;){if(n+=this.input[this.pos],"{"==this.input[this.pos]&&t++,"}"==this.input[this.pos]&&t--,this.pos>=this.input.length-1)throw"Unterminated value:"+this.input.substring(start);this.pos++}return n},this.value_quotes=function(){this.match('"',!1);for(var n=this.pos,t=!1;;){if(!t){if('"'==this.input[this.pos]){var e=this.pos;return this.match('"',!1),this.input.substring(n,e)}if(this.pos>=this.input.length-1)throw"Unterminated value:"+this.input.substring(n)}t="\\"==this.input[this.pos]&&0==t,this.pos++}},this.single_value=function(){var n=this.pos;if(this.tryMatch("{"))return this.value_braces();if(this.tryMatch('"'))return this.value_quotes();var t=this.key();if(t.match("^[0-9]+$"))return t;if(this.months.indexOf(t.toLowerCase())>=0)return t.toLowerCase();throw"Value expected:"+this.input.substring(n)+" for key: "+t},this.value=function(){var n=[];for(n.push(this.single_value());this.tryMatch("#");)this.match("#"),n.push(this.single_value());return n.join("")},this.key=function(){for(var n=this.pos;;){if(this.pos>=this.input.length)throw"Runaway key";if(this.notKey.indexOf(this.input[this.pos])>=0)return this.input.substring(n,this.pos);this.pos++}},this.key_equals_value=function(){var n=this.key();if(this.tryMatch("="))return this.match("="),[n,this.value()];throw"... = value expected, equals sign missing:"+this.input.substring(this.pos)},this.key_value_list=function(){var n=this.key_equals_value();for(this.currentEntry.entryTags={},this.currentEntry.entryTags[n[0]]=n[1];this.tryMatch(",")&&(this.match(","),!this.tryMatch("}"));)n=this.key_equals_value(),this.currentEntry.entryTags[n[0]]=n[1]},this.entry_body=function(n){this.currentEntry={},this.currentEntry.citationKey=this.key(),this.currentEntry.entryType=n.substring(1),this.match(","),this.key_value_list(),this.entries.push(this.currentEntry)},this.directive=function(){return this.match("@"),"@"+this.key()},this.preamble=function(){this.currentEntry={},this.currentEntry.entryType="PREAMBLE",this.currentEntry.entry=this.value_comment(),this.entries.push(this.currentEntry)},this.comment=function(){this.currentEntry={},this.currentEntry.entryType="COMMENT",this.currentEntry.entry=this.value_comment(),this.entries.push(this.currentEntry)},this.entry=function(n){this.entry_body(n)},this.bibtex=function(){for(;this.matchAt();){var n=this.directive();this.match("{"),"@STRING"==n?this.string():"@PREAMBLE"==n?this.preamble():"@COMMENT"==n?this.comment():this.entry(n),this.match("}")}}}n.toJSON=function(n){var e=new t;return e.setInput(n),e.bibtex(),e.entries},n.toBibtex=function(n){var t="";for(var e in n){if(t+="@"+n[e].entryType,t+="{",n[e].citationKey&&(t+=n[e].citationKey+", "),n[e].entry&&(t+=n[e].entry),n[e].entryTags){var i="";for(var r in n[e].entryTags)0!=i.length&&(i+=", "),i+=r+"= {"+n[e].entryTags[r]+"}";t+=i}t+="}\n\n"}return t}}(t)});class ao extends HTMLElement{static get is(){return"d-bibliography"}constructor(){super();const n={childList:!0,characterData:!0,subtree:!0};new MutationObserver(n=>{for(const t of n)"SCRIPT"!==t.target.nodeName&&"characterData"!==t.type||this.parseIfPossible()}).observe(this,n)}connectedCallback(){requestAnimationFrame(()=>{this.parseIfPossible()})}parseIfPossible(){const n=this.querySelector("script");if(n)if("text/bibtex"==n.type){const t=n.textContent;if(this.bibtex!==t){this.bibtex=t;const n=y(this.bibtex);this.notify(n)}}else if("text/json"==n.type){const t=new Map(JSON.parse(n.textContent));this.notify(t)}else console.warn("Unsupported bibliography script tag type: "+n.type)}notify(n){const t=new CustomEvent("onBibliographyChanged",{detail:n,bubbles:!0});this.dispatchEvent(t)}static get observedAttributes(){return["src"]}receivedBibtex(n){const t=y(n.target.response);this.notify(t)}attributeChangedCallback(n,t,e){var i=new XMLHttpRequest;i.onload=(n=>this.receivedBibtex(n)),i.onerror=(()=>console.warn(`Could not load Bibtex! (tried ${e})`)),i.responseType="text",i.open("GET",e,!0),i.send()}}class so extends HTMLElement{static get is(){return"d-byline"}set frontMatter(n){this.innerHTML=w(n)}} +// Copyright 2018 The Distill Template Authors +const lo=Or("d-cite",'\n\n\n\n\n

    \n \n
    \n');class uo extends(lo(HTMLElement)){constructor(){super(),this._numbers=[],this._entries=[]}connectedCallback(){this.outerSpan=this.root.querySelector("#citation-"),this.innerSpan=this.root.querySelector(".citation-number"),this.hoverBox=this.root.querySelector("d-hover-box"),window.customElements.whenDefined("d-hover-box").then(()=>{this.hoverBox.listen(this)}),this.numbers&&this.displayNumbers(this.numbers),this.entries&&this.displayEntries(this.entries)}static get observedAttributes(){return["key","bibtex-key"]}attributeChangedCallback(n,t,e){const i=t?"onCiteKeyChanged":"onCiteKeyCreated",r=e.split(",").map(n=>n.trim()),o=new CustomEvent(i,{detail:[this,r],bubbles:!0});document.dispatchEvent(o)}set key(n){this.setAttribute("key",n)}get key(){return this.getAttribute("key")||this.getAttribute("bibtex-key")}get keys(){const n=this.key.split(",");return console.log(n),n}set numbers(n){this._numbers=n,this.displayNumbers(n)}get numbers(){return this._numbers}displayNumbers(n){if(!this.innerSpan)return;const t="["+n.map(n=>-1==n?"?":n+1+"").join(", ")+"]";this.innerSpan.textContent=t}set entries(n){this._entries=n,this.displayEntries(n)}get entries(){return this._entries}displayEntries(n){this.hoverBox&&(this.hoverBox.innerHTML=`
      \n ${n.map(l).map(n=>`
    • ${n}
    • `).join("\n")}\n
    `)}} +// Copyright 2018 The Distill Template Authors +const co="\nd-citation-list {\n contain: style;\n}\n\nd-citation-list .references {\n grid-column: text;\n}\n\nd-citation-list .references .title {\n font-weight: 500;\n}\n";class ho extends HTMLElement{static get is(){return"d-citation-list"}connectedCallback(){this.hasAttribute("distill-prerendered")||(this.style.display="none")}set citations(n){x(this,n)}}var po=m(function(n){var t=function(n){function t(n,t,e,i,r){this.type=n,this.content=t,this.alias=e,this.length=0|(i||"").length,this.greedy=!!r}function e(n,i,a,s,l,u,d){for(var h in a)if(a.hasOwnProperty(h)&&a[h]){var p=a[h];p=Array.isArray(p)?p:[p];for(var f=0;fn.length)return;if(!(M instanceof t)){var T=1;if(y&&k!=i.tail.prev){if(g.lastIndex=S,!(N=g.exec(n)))break;var _=N.index+(b&&N[1]?N[1].length:0),C=N.index+N[0].length,A=S;for(A+=k.value.length;_>=A;)A+=(k=k.next).value.length;if(S=A-=k.value.length,k.value instanceof t)continue;for(var E=k;E!==i.tail&&(A1&&e(n,i,a,k.prev,S,!0,h+","+f),u)break}else if(u)break}}}}}function i(){var n={value:null,prev:null,next:null},t={value:null,prev:n,next:null};n.next=t,this.head=n,this.tail=t,this.length=0}function r(n,t,e){var i=t.next,r={value:e,prev:t,next:i};return t.next=r,i.prev=r,n.length++,r}function o(n,t,e){for(var i=t.next,r=0;r"+i.content+""},!n.document)return n.addEventListener?(c.disableWorkerMessageHandler||n.addEventListener("message",function(t){var e=JSON.parse(t.data),i=e.language,r=e.code,o=e.immediateClose;n.postMessage(c.highlight(r,c.languages[i],i)),o&&n.close()},!1),c):c;var d=c.util.currentScript();if(d&&(c.filename=d.src,d.hasAttribute("data-manual")&&(c.manual=!0)),!c.manual){var h=document.readyState;"loading"===h||"interactive"===h&&d&&d.defer?document.addEventListener("DOMContentLoaded",s):window.requestAnimationFrame?window.requestAnimationFrame(s):window.setTimeout(s,16)}return c}("undefined"!=typeof window?window:"undefined"!=typeof WorkerGlobalScope&&self instanceof WorkerGlobalScope?self:{});n.exports&&(n.exports=t),void 0!==ro&&(ro.Prism=t),t.languages.markup={comment://,prolog:/<\?[\s\S]+?\?>/,doctype:{pattern:/"'[\]]|"[^"]*"|'[^']*')+(?:\[(?:(?!)*\]\s*)?>/i,greedy:!0},cdata://i,tag:{pattern:/<\/?(?!\d)[^\s>\/=$<%]+(?:\s(?:\s*[^\s>\/=]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^\s'">=]+(?=[\s>]))|(?=[\s/>])))+)?\s*\/?>/i,greedy:!0,inside:{tag:{pattern:/^<\/?[^\s>\/]+/i,inside:{punctuation:/^<\/?/,namespace:/^[^\s>\/:]+:/}},"attr-value":{pattern:/=\s*(?:"[^"]*"|'[^']*'|[^\s'">=]+)/i,inside:{punctuation:[/^=/,{pattern:/^(\s*)["']|["']$/,lookbehind:!0}]}},punctuation:/\/?>/,"attr-name":{pattern:/[^\s>\/]+/,inside:{namespace:/^[^\s>\/:]+:/}}}},entity:/&#?[\da-z]{1,8};/i},t.languages.markup.tag.inside["attr-value"].inside.entity=t.languages.markup.entity,t.hooks.add("wrap",function(n){"entity"===n.type&&(n.attributes.title=n.content.replace(/&/,"&"))}),Object.defineProperty(t.languages.markup.tag,"addInlined",{value:function(n,e){var i={};i["language-"+e]={pattern:/(^$)/i,lookbehind:!0,inside:t.languages[e]},i.cdata=/^$/i;var r={"included-cdata":{pattern://i,inside:i}};r["language-"+e]={pattern:/[\s\S]+/,inside:t.languages[e]};var o={};o[n]={pattern:RegExp(/(<__[\s\S]*?>)(?:\s*|[\s\S])*?(?=<\/__>)/.source.replace(/__/g,function(){return n}),"i"),lookbehind:!0,greedy:!0,inside:r},t.languages.insertBefore("markup","cdata",o)}}),t.languages.xml=t.languages.extend("markup",{}),t.languages.html=t.languages.markup,t.languages.mathml=t.languages.markup,t.languages.svg=t.languages.markup,function(n){var t=/("|')(?:\\(?:\r\n|[\s\S])|(?!\1)[^\\\r\n])*\1/;n.languages.css={comment:/\/\*[\s\S]*?\*\//,atrule:{pattern:/@[\w-]+[\s\S]*?(?:;|(?=\s*\{))/,inside:{rule:/^@[\w-]+/,"selector-function-argument":{pattern:/(\bselector\s*\((?!\s*\))\s*)(?:[^()]|\((?:[^()]|\([^()]*\))*\))+?(?=\s*\))/,lookbehind:!0,alias:"selector"}}},url:{pattern:RegExp("url\\((?:"+t.source+"|[^\n\r()]*)\\)","i"),greedy:!0,inside:{"function":/^url/i,punctuation:/^\(|\)$/}},selector:RegExp("[^{}\\s](?:[^{};\"']|"+t.source+")*?(?=\\s*\\{)"),string:{pattern:t,greedy:!0},property:/[-_a-z\xA0-\uFFFF][-\w\xA0-\uFFFF]*(?=\s*:)/i,important:/!important\b/i,"function":/[-a-z0-9]+(?=\()/i,punctuation:/[(){};:,]/},n.languages.css.atrule.inside.rest=n.languages.css;var e=n.languages.markup;e&&(e.tag.addInlined("style","css"),n.languages.insertBefore("inside","attr-value",{"style-attr":{pattern:/\s*style=("|')(?:\\[\s\S]|(?!\1)[^\\])*\1/i,inside:{"attr-name":{pattern:/^\s*style/i,inside:e.tag.inside},punctuation:/^\s*=\s*['"]|['"]\s*$/,"attr-value":{pattern:/.+/i,inside:n.languages.css}},alias:"language-css"}},e.tag))}(t),t.languages.clike={comment:[{pattern:/(^|[^\\])\/\*[\s\S]*?(?:\*\/|$)/,lookbehind:!0},{pattern:/(^|[^\\:])\/\/.*/,lookbehind:!0,greedy:!0}],string:{pattern:/(["'])(?:\\(?:\r\n|[\s\S])|(?!\1)[^\\\r\n])*\1/,greedy:!0},"class-name":{pattern:/(\b(?:class|interface|extends|implements|trait|instanceof|new)\s+|\bcatch\s+\()[\w.\\]+/i,lookbehind:!0,inside:{punctuation:/[.\\]/}},keyword:/\b(?:if|else|while|do|for|return|in|instanceof|function|new|try|throw|catch|finally|null|break|continue)\b/,boolean:/\b(?:true|false)\b/,"function":/\w+(?=\()/,number:/\b0x[\da-f]+\b|(?:\b\d+\.?\d*|\B\.\d+)(?:e[+-]?\d+)?/i,operator:/[<>]=?|[!=]=?=?|--?|\+\+?|&&?|\|\|?|[?*/~^%]/,punctuation:/[{}[\];(),.:]/},t.languages.javascript=t.languages.extend("clike",{"class-name":[t.languages.clike["class-name"],{pattern:/(^|[^$\w\xA0-\uFFFF])[_$A-Z\xA0-\uFFFF][$\w\xA0-\uFFFF]*(?=\.(?:prototype|constructor))/,lookbehind:!0}],keyword:[{pattern:/((?:^|})\s*)(?:catch|finally)\b/,lookbehind:!0},{pattern:/(^|[^.]|\.\.\.\s*)\b(?:as|async(?=\s*(?:function\b|\(|[$\w\xA0-\uFFFF]|$))|await|break|case|class|const|continue|debugger|default|delete|do|else|enum|export|extends|for|from|function|get|if|implements|import|in|instanceof|interface|let|new|null|of|package|private|protected|public|return|set|static|super|switch|this|throw|try|typeof|undefined|var|void|while|with|yield)\b/,lookbehind:!0}],number:/\b(?:(?:0[xX](?:[\dA-Fa-f](?:_[\dA-Fa-f])?)+|0[bB](?:[01](?:_[01])?)+|0[oO](?:[0-7](?:_[0-7])?)+)n?|(?:\d(?:_\d)?)+n|NaN|Infinity)\b|(?:\b(?:\d(?:_\d)?)+\.?(?:\d(?:_\d)?)*|\B\.(?:\d(?:_\d)?)+)(?:[Ee][+-]?(?:\d(?:_\d)?)+)?/,"function":/#?[_$a-zA-Z\xA0-\uFFFF][$\w\xA0-\uFFFF]*(?=\s*(?:\.\s*(?:apply|bind|call)\s*)?\()/,operator:/--|\+\+|\*\*=?|=>|&&|\|\||[!=]==|<<=?|>>>?=?|[-+*/%&|^!=<>]=?|\.{3}|\?[.?]?|[~:]/}),t.languages.javascript["class-name"][0].pattern=/(\b(?:class|interface|extends|implements|instanceof|new)\s+)[\w.\\]+/,t.languages.insertBefore("javascript","keyword",{regex:{pattern:/((?:^|[^$\w\xA0-\uFFFF."'\])\s])\s*)\/(?:\[(?:[^\]\\\r\n]|\\.)*]|\\.|[^/\\\[\r\n])+\/[gimyus]{0,6}(?=(?:\s|\/\*[\s\S]*?\*\/)*(?:$|[\r\n,.;:})\]]|\/\/))/,lookbehind:!0,greedy:!0},"function-variable":{pattern:/#?[_$a-zA-Z\xA0-\uFFFF][$\w\xA0-\uFFFF]*(?=\s*[=:]\s*(?:async\s*)?(?:\bfunction\b|(?:\((?:[^()]|\([^()]*\))*\)|[_$a-zA-Z\xA0-\uFFFF][$\w\xA0-\uFFFF]*)\s*=>))/,alias:"function"},parameter:[{pattern:/(function(?:\s+[_$A-Za-z\xA0-\uFFFF][$\w\xA0-\uFFFF]*)?\s*\(\s*)(?!\s)(?:[^()]|\([^()]*\))+?(?=\s*\))/,lookbehind:!0,inside:t.languages.javascript},{pattern:/[_$a-z\xA0-\uFFFF][$\w\xA0-\uFFFF]*(?=\s*=>)/i,inside:t.languages.javascript},{pattern:/(\(\s*)(?!\s)(?:[^()]|\([^()]*\))+?(?=\s*\)\s*=>)/,lookbehind:!0,inside:t.languages.javascript},{pattern:/((?:\b|\s|^)(?!(?:as|async|await|break|case|catch|class|const|continue|debugger|default|delete|do|else|enum|export|extends|finally|for|from|function|get|if|implements|import|in|instanceof|interface|let|new|null|of|package|private|protected|public|return|set|static|super|switch|this|throw|try|typeof|undefined|var|void|while|with|yield)(?![$\w\xA0-\uFFFF]))(?:[_$A-Za-z\xA0-\uFFFF][$\w\xA0-\uFFFF]*\s*)\(\s*)(?!\s)(?:[^()]|\([^()]*\))+?(?=\s*\)\s*\{)/,lookbehind:!0,inside:t.languages.javascript}],constant:/\b[A-Z](?:[A-Z_]|\dx?)*\b/}),t.languages.insertBefore("javascript","string",{"template-string":{pattern:/`(?:\\[\s\S]|\${(?:[^{}]|{(?:[^{}]|{[^}]*})*})+}|(?!\${)[^\\`])*`/,greedy:!0,inside:{"template-punctuation":{pattern:/^`|`$/,alias:"string"},interpolation:{pattern:/((?:^|[^\\])(?:\\{2})*)\${(?:[^{}]|{(?:[^{}]|{[^}]*})*})+}/,lookbehind:!0,inside:{"interpolation-punctuation":{pattern:/^\${|}$/,alias:"punctuation"},rest:t.languages.javascript}},string:/[\s\S]+/}}}),t.languages.markup&&t.languages.markup.tag.addInlined("script","javascript"),t.languages.js=t.languages.javascript,"undefined"!=typeof self&&self.Prism&&self.document&&document.querySelector&&(self.Prism.fileHighlight=function(n){n=n||document;var e={js:"javascript",py:"python",rb:"ruby",ps1:"powershell",psm1:"powershell",sh:"bash",bat:"batch",h:"c",tex:"latex"};Array.prototype.slice.call(n.querySelectorAll("pre[data-src]")).forEach(function(n){if(!n.hasAttribute("data-src-loaded")){for(var i,r=n.getAttribute("data-src"),o=n,a=/\blang(?:uage)?-([\w-]+)\b/i;o&&!a.test(o.className);)o=o.parentNode;if(o&&(i=(n.className.match(a)||[,""])[1]),!i){var s=(r.match(/\.(\w+)$/)||[,""])[1];i=e[s]||s}var l=document.createElement("code");l.className="language-"+i,n.textContent="",l.textContent="Loading\u2026",n.appendChild(l);var u=new XMLHttpRequest;u.open("GET",r,!0),u.onreadystatechange=function(){4==u.readyState&&(u.status<400&&u.responseText?(l.textContent=u.responseText,t.highlightElement(l),n.setAttribute("data-src-loaded","")):u.status>=400?l.textContent="\u2716 Error "+u.status+" while fetching file: "+u.statusText:l.textContent="\u2716 Error: File does not exist or is empty")},u.send(null)}})},document.addEventListener("DOMContentLoaded",function(){self.Prism.fileHighlight()}))});Prism.languages.python={comment:{pattern:/(^|[^\\])#.*/,lookbehind:!0},"string-interpolation":{pattern:/(?:f|rf|fr)(?:("""|''')[\s\S]+?\1|("|')(?:\\.|(?!\2)[^\\\r\n])*\2)/i,greedy:!0,inside:{interpolation:{pattern:/((?:^|[^{])(?:{{)*){(?!{)(?:[^{}]|{(?!{)(?:[^{}]|{(?!{)(?:[^{}])+})+})+}/,lookbehind:!0,inside:{"format-spec":{pattern:/(:)[^:(){}]+(?=}$)/,lookbehind:!0},"conversion-option":{pattern:/![sra](?=[:}]$)/,alias:"punctuation"},rest:null}},string:/[\s\S]+/}},"triple-quoted-string":{pattern:/(?:[rub]|rb|br)?("""|''')[\s\S]+?\1/i,greedy:!0,alias:"string"},string:{pattern:/(?:[rub]|rb|br)?("|')(?:\\.|(?!\1)[^\\\r\n])*\1/i,greedy:!0},"function":{pattern:/((?:^|\s)def[ \t]+)[a-zA-Z_]\w*(?=\s*\()/g,lookbehind:!0},"class-name":{pattern:/(\bclass\s+)\w+/i,lookbehind:!0},decorator:{pattern:/(^\s*)@\w+(?:\.\w+)*/im,lookbehind:!0,alias:["annotation","punctuation"],inside:{punctuation:/\./}},keyword:/\b(?:and|as|assert|async|await|break|class|continue|def|del|elif|else|except|exec|finally|for|from|global|if|import|in|is|lambda|nonlocal|not|or|pass|print|raise|return|try|while|with|yield)\b/,builtin:/\b(?:__import__|abs|all|any|apply|ascii|basestring|bin|bool|buffer|bytearray|bytes|callable|chr|classmethod|cmp|coerce|compile|complex|delattr|dict|dir|divmod|enumerate|eval|execfile|file|filter|float|format|frozenset|getattr|globals|hasattr|hash|help|hex|id|input|int|intern|isinstance|issubclass|iter|len|list|locals|long|map|max|memoryview|min|next|object|oct|open|ord|pow|property|range|raw_input|reduce|reload|repr|reversed|round|set|setattr|slice|sorted|staticmethod|str|sum|super|tuple|type|unichr|unicode|vars|xrange|zip)\b/,boolean:/\b(?:True|False|None)\b/,number:/(?:\b(?=\d)|\B(?=\.))(?:0[bo])?(?:(?:\d|0x[\da-f])[\da-f]*\.?\d*|\.\d+)(?:e[+-]?\d+)?j?\b/i,operator:/[-+%=]=?|!=|\*\*?=?|\/\/?=?|<[<=>]?|>[=>]?|[&|^~]/,punctuation:/[{}[\];(),.:]/},Prism.languages.python["string-interpolation"].inside.interpolation.inside.rest=Prism.languages.python,Prism.languages.py=Prism.languages.python,Prism.languages.clike={comment:[{pattern:/(^|[^\\])\/\*[\s\S]*?(?:\*\/|$)/,lookbehind:!0},{pattern:/(^|[^\\:])\/\/.*/,lookbehind:!0,greedy:!0}],string:{pattern:/(["'])(?:\\(?:\r\n|[\s\S])|(?!\1)[^\\\r\n])*\1/,greedy:!0},"class-name":{pattern:/(\b(?:class|interface|extends|implements|trait|instanceof|new)\s+|\bcatch\s+\()[\w.\\]+/i,lookbehind:!0,inside:{punctuation:/[.\\]/}},keyword:/\b(?:if|else|while|do|for|return|in|instanceof|function|new|try|throw|catch|finally|null|break|continue)\b/,boolean:/\b(?:true|false)\b/,"function":/\w+(?=\()/,number:/\b0x[\da-f]+\b|(?:\b\d+\.?\d*|\B\.\d+)(?:e[+-]?\d+)?/i,operator:/[<>]=?|[!=]=?=?|--?|\+\+?|&&?|\|\|?|[?*/~^%]/,punctuation:/[{}[\];(),.:]/},Prism.languages.lua={comment:/^#!.+|--(?:\[(=*)\[[\s\S]*?\]\1\]|.*)/m,string:{pattern:/(["'])(?:(?!\1)[^\\\r\n]|\\z(?:\r\n|\s)|\\(?:\r\n|[\s\S]))*\1|\[(=*)\[[\s\S]*?\]\2\]/,greedy:!0},number:/\b0x[a-f\d]+\.?[a-f\d]*(?:p[+-]?\d+)?\b|\b\d+(?:\.\B|\.?\d*(?:e[+-]?\d+)?\b)|\B\.\d+(?:e[+-]?\d+)?\b/i,keyword:/\b(?:and|break|do|else|elseif|end|false|for|function|goto|if|in|local|nil|not|or|repeat|return|then|true|until|while)\b/,"function":/(?!\d)\w+(?=\s*(?:[({]))/,operator:[/[-+*%^&|#]|\/\/?|<[<=]?|>[>=]?|[=~]=?/,{pattern:/(^|[^.])\.\.(?!\.)/,lookbehind:!0}],punctuation:/[\[\](){},;]|\.+|:+/},function(n){var t="\\b(?:BASH|BASHOPTS|BASH_ALIASES|BASH_ARGC|BASH_ARGV|BASH_CMDS|BASH_COMPLETION_COMPAT_DIR|BASH_LINENO|BASH_REMATCH|BASH_SOURCE|BASH_VERSINFO|BASH_VERSION|COLORTERM|COLUMNS|COMP_WORDBREAKS|DBUS_SESSION_BUS_ADDRESS|DEFAULTS_PATH|DESKTOP_SESSION|DIRSTACK|DISPLAY|EUID|GDMSESSION|GDM_LANG|GNOME_KEYRING_CONTROL|GNOME_KEYRING_PID|GPG_AGENT_INFO|GROUPS|HISTCONTROL|HISTFILE|HISTFILESIZE|HISTSIZE|HOME|HOSTNAME|HOSTTYPE|IFS|INSTANCE|JOB|LANG|LANGUAGE|LC_ADDRESS|LC_ALL|LC_IDENTIFICATION|LC_MEASUREMENT|LC_MONETARY|LC_NAME|LC_NUMERIC|LC_PAPER|LC_TELEPHONE|LC_TIME|LESSCLOSE|LESSOPEN|LINES|LOGNAME|LS_COLORS|MACHTYPE|MAILCHECK|MANDATORY_PATH|NO_AT_BRIDGE|OLDPWD|OPTERR|OPTIND|ORBIT_SOCKETDIR|OSTYPE|PAPERSIZE|PATH|PIPESTATUS|PPID|PS1|PS2|PS3|PS4|PWD|RANDOM|REPLY|SECONDS|SELINUX_INIT|SESSION|SESSIONTYPE|SESSION_MANAGER|SHELL|SHELLOPTS|SHLVL|SSH_AUTH_SOCK|TERM|UID|UPSTART_EVENTS|UPSTART_INSTANCE|UPSTART_JOB|UPSTART_SESSION|USER|WINDOWID|XAUTHORITY|XDG_CONFIG_DIRS|XDG_CURRENT_DESKTOP|XDG_DATA_DIRS|XDG_GREETER_DATA_DIR|XDG_MENU_PREFIX|XDG_RUNTIME_DIR|XDG_SEAT|XDG_SEAT_PATH|XDG_SESSION_DESKTOP|XDG_SESSION_ID|XDG_SESSION_PATH|XDG_SESSION_TYPE|XDG_VTNR|XMODIFIERS)\\b",e={environment:{pattern:RegExp("\\$"+t),alias:"constant"},variable:[{pattern:/\$?\(\([\s\S]+?\)\)/,greedy:!0,inside:{variable:[{pattern:/(^\$\(\([\s\S]+)\)\)/,lookbehind:!0},/^\$\(\(/],number:/\b0x[\dA-Fa-f]+\b|(?:\b\d+\.?\d*|\B\.\d+)(?:[Ee]-?\d+)?/,operator:/--?|-=|\+\+?|\+=|!=?|~|\*\*?|\*=|\/=?|%=?|<<=?|>>=?|<=?|>=?|==?|&&?|&=|\^=?|\|\|?|\|=|\?|:/,punctuation:/\(\(?|\)\)?|,|;/}},{pattern:/\$\((?:\([^)]+\)|[^()])+\)|`[^`]+`/,greedy:!0,inside:{variable:/^\$\(|^`|\)$|`$/}},{pattern:/\$\{[^}]+\}/,greedy:!0,inside:{operator:/:[-=?+]?|[!\/]|##?|%%?|\^\^?|,,?/,punctuation:/[\[\]]/,environment:{pattern:RegExp("(\\{)"+t),lookbehind:!0,alias:"constant"}}},/\$(?:\w+|[#?*!@$])/],entity:/\\(?:[abceEfnrtv\\"]|O?[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})/};n.languages.bash={shebang:{pattern:/^#!\s*\/.*/,alias:"important"},comment:{pattern:/(^|[^"{\\$])#.*/,lookbehind:!0},"function-name":[{pattern:/(\bfunction\s+)\w+(?=(?:\s*\(?:\s*\))?\s*\{)/,lookbehind:!0,alias:"function"},{pattern:/\b\w+(?=\s*\(\s*\)\s*\{)/,alias:"function"}],"for-or-select":{pattern:/(\b(?:for|select)\s+)\w+(?=\s+in\s)/,alias:"variable",lookbehind:!0},"assign-left":{pattern:/(^|[\s;|&]|[<>]\()\w+(?=\+?=)/,inside:{environment:{pattern:RegExp("(^|[\\s;|&]|[<>]\\()"+t),lookbehind:!0,alias:"constant"}},alias:"variable",lookbehind:!0},string:[{pattern:/((?:^|[^<])<<-?\s*)(\w+?)\s*(?:\r?\n|\r)[\s\S]*?(?:\r?\n|\r)\2/,lookbehind:!0,greedy:!0,inside:e},{pattern:/((?:^|[^<])<<-?\s*)(["'])(\w+)\2\s*(?:\r?\n|\r)[\s\S]*?(?:\r?\n|\r)\3/,lookbehind:!0,greedy:!0},{pattern:/(^|[^\\](?:\\\\)*)(["'])(?:\\[\s\S]|\$\([^)]+\)|`[^`]+`|(?!\2)[^\\])*\2/,lookbehind:!0,greedy:!0,inside:e}],environment:{pattern:RegExp("\\$?"+t),alias:"constant"},variable:e.variable,"function":{pattern:/(^|[\s;|&]|[<>]\()(?:add|apropos|apt|aptitude|apt-cache|apt-get|aspell|automysqlbackup|awk|basename|bash|bc|bconsole|bg|bzip2|cal|cat|cfdisk|chgrp|chkconfig|chmod|chown|chroot|cksum|clear|cmp|column|comm|cp|cron|crontab|csplit|curl|cut|date|dc|dd|ddrescue|debootstrap|df|diff|diff3|dig|dir|dircolors|dirname|dirs|dmesg|du|egrep|eject|env|ethtool|expand|expect|expr|fdformat|fdisk|fg|fgrep|file|find|fmt|fold|format|free|fsck|ftp|fuser|gawk|git|gparted|grep|groupadd|groupdel|groupmod|groups|grub-mkconfig|gzip|halt|head|hg|history|host|hostname|htop|iconv|id|ifconfig|ifdown|ifup|import|install|ip|jobs|join|kill|killall|less|link|ln|locate|logname|logrotate|look|lpc|lpr|lprint|lprintd|lprintq|lprm|ls|lsof|lynx|make|man|mc|mdadm|mkconfig|mkdir|mke2fs|mkfifo|mkfs|mkisofs|mknod|mkswap|mmv|more|most|mount|mtools|mtr|mutt|mv|nano|nc|netstat|nice|nl|nohup|notify-send|npm|nslookup|op|open|parted|passwd|paste|pathchk|ping|pkill|pnpm|popd|pr|printcap|printenv|ps|pushd|pv|quota|quotacheck|quotactl|ram|rar|rcp|reboot|remsync|rename|renice|rev|rm|rmdir|rpm|rsync|scp|screen|sdiff|sed|sendmail|seq|service|sftp|sh|shellcheck|shuf|shutdown|sleep|slocate|sort|split|ssh|stat|strace|su|sudo|sum|suspend|swapon|sync|tac|tail|tar|tee|time|timeout|top|touch|tr|traceroute|tsort|tty|umount|uname|unexpand|uniq|units|unrar|unshar|unzip|update-grub|uptime|useradd|userdel|usermod|users|uudecode|uuencode|v|vdir|vi|vim|virsh|vmstat|wait|watch|wc|wget|whereis|which|who|whoami|write|xargs|xdg-open|yarn|yes|zenity|zip|zsh|zypper)(?=$|[)\s;|&])/,lookbehind:!0},keyword:{pattern:/(^|[\s;|&]|[<>]\()(?:if|then|else|elif|fi|for|while|in|case|esac|function|select|do|done|until)(?=$|[)\s;|&])/,lookbehind:!0},builtin:{pattern:/(^|[\s;|&]|[<>]\()(?:\.|:|break|cd|continue|eval|exec|exit|export|getopts|hash|pwd|readonly|return|shift|test|times|trap|umask|unset|alias|bind|builtin|caller|command|declare|echo|enable|help|let|local|logout|mapfile|printf|read|readarray|source|type|typeset|ulimit|unalias|set|shopt)(?=$|[)\s;|&])/,lookbehind:!0,alias:"class-name"},boolean:{pattern:/(^|[\s;|&]|[<>]\()(?:true|false)(?=$|[)\s;|&])/,lookbehind:!0},"file-descriptor":{pattern:/\B&\d\b/,alias:"important"},operator:{pattern:/\d?<>|>\||\+=|==?|!=?|=~|<<[<-]?|[&\d]?>>|\d?[<>]&?|&[>&]?|\|[&|]?|<=?|>=?/,inside:{"file-descriptor":{pattern:/^\d/,alias:"important"}}},punctuation:/\$?\(\(?|\)\)?|\.\.|[{}[\];\\]/,number:{pattern:/(^|\s)(?:[1-9]\d*|0)(?:[.,]\d+)?\b/,lookbehind:!0}};for(var i=["comment","function-name","for-or-select","assign-left","string","environment","function","keyword","builtin","boolean","file-descriptor","operator","punctuation","number"],r=e.variable[1].inside,o=0;o(?:>=?|=)?|<(?:<=?|=|-)?|:=|\.\.\./,number:/(?:\b0x[a-f\d]+|(?:\b\d+\.?\d*|\B\.\d+)(?:e[-+]?\d+)?)i?/i,string:{pattern:/(["'`])(?:\\[\s\S]|(?!\1)[^\\])*\1/,greedy:!0}}),delete Prism.languages.go["class-name"],function(n){function t(n,t){return n=n.replace(//g,function(){return e}),t&&(n=n+"|"+n.replace(/_/g,"\\*")),RegExp(/((?:^|[^\\])(?:\\{2})*)/.source+"(?:"+n+")")}var e=/(?:\\.|[^\\\n\r]|(?:\n|\r\n?)(?!\n|\r\n?))/.source,i=/(?:\\.|``.+?``|`[^`\r\n]+`|[^\\|\r\n`])+/.source,r=/\|?__(?:\|__)+\|?(?:(?:\n|\r\n?)|$)/.source.replace(/__/g,function(){return i}),o=/\|?[ \t]*:?-{3,}:?[ \t]*(?:\|[ \t]*:?-{3,}:?[ \t]*)+\|?(?:\n|\r\n?)/.source;n.languages.markdown=n.languages.extend("markup",{}),n.languages.insertBefore("markdown","prolog",{blockquote:{pattern:/^>(?:[\t ]*>)*/m,alias:"punctuation"},table:{pattern:RegExp("^"+r+o+"(?:"+r+")*","m"),inside:{"table-data-rows":{pattern:RegExp("^("+r+o+")(?:"+r+")*$"),lookbehind:!0,inside:{"table-data":{pattern:RegExp(i),inside:n.languages.markdown},punctuation:/\|/}},"table-line":{pattern:RegExp("^("+r+")"+o+"$"),lookbehind:!0,inside:{punctuation:/\||:?-{3,}:?/}},"table-header-row":{pattern:RegExp("^"+r+"$"),inside:{"table-header":{pattern:RegExp(i),alias:"important",inside:n.languages.markdown},punctuation:/\|/}}}},code:[{pattern:/((?:^|\n)[ \t]*\n|(?:^|\r\n?)[ \t]*\r\n?)(?: {4}|\t).+(?:(?:\n|\r\n?)(?: {4}|\t).+)*/,lookbehind:!0,alias:"keyword"},{pattern:/``.+?``|`[^`\r\n]+`/,alias:"keyword"},{pattern:/^```[\s\S]*?^```$/m,greedy:!0,inside:{"code-block":{pattern:/^(```.*(?:\n|\r\n?))[\s\S]+?(?=(?:\n|\r\n?)^```$)/m,lookbehind:!0},"code-language":{pattern:/^(```).+/,lookbehind:!0},punctuation:/```/}}],title:[{pattern:/\S.*(?:\n|\r\n?)(?:==+|--+)(?=[ \t]*$)/m,alias:"important",inside:{punctuation:/==+$|--+$/}},{pattern:/(^\s*)#+.+/m,lookbehind:!0,alias:"important",inside:{punctuation:/^#+|#+$/}}],hr:{pattern:/(^\s*)([*-])(?:[\t ]*\2){2,}(?=\s*$)/m,lookbehind:!0,alias:"punctuation"},list:{pattern:/(^\s*)(?:[*+-]|\d+\.)(?=[\t ].)/m,lookbehind:!0,alias:"punctuation"},"url-reference":{pattern:/!?\[[^\]]+\]:[\t ]+(?:\S+|<(?:\\.|[^>\\])+>)(?:[\t ]+(?:"(?:\\.|[^"\\])*"|'(?:\\.|[^'\\])*'|\((?:\\.|[^)\\])*\)))?/,inside:{variable:{pattern:/^(!?\[)[^\]]+/,lookbehind:!0},string:/(?:"(?:\\.|[^"\\])*"|'(?:\\.|[^'\\])*'|\((?:\\.|[^)\\])*\))$/,punctuation:/^[\[\]!:]|[<>]/},alias:"url"},bold:{pattern:t(/__(?:(?!_)|_(?:(?!_))+_)+__/.source,!0),lookbehind:!0,greedy:!0,inside:{content:{pattern:/(^..)[\s\S]+(?=..$)/,lookbehind:!0,inside:{}},punctuation:/\*\*|__/}},italic:{pattern:t(/_(?:(?!_)|__(?:(?!_))+__)+_/.source,!0),lookbehind:!0,greedy:!0,inside:{content:{pattern:/(^.)[\s\S]+(?=.$)/,lookbehind:!0,inside:{}},punctuation:/[*_]/}},strike:{pattern:t(/(~~?)(?:(?!~))+?\2/.source,!1),lookbehind:!0,greedy:!0,inside:{content:{pattern:/(^~~?)[\s\S]+(?=\1$)/,lookbehind:!0,inside:{}},punctuation:/~~?/}},url:{pattern:t(/!?\[(?:(?!\]))+\](?:\([^\s)]+(?:[\t ]+"(?:\\.|[^"\\])*")?\)| ?\[(?:(?!\]))+\])/.source,!1),lookbehind:!0,greedy:!0,inside:{variable:{pattern:/(\[)[^\]]+(?=\]$)/,lookbehind:!0},content:{pattern:/(^!?\[)[^\]]+(?=\])/,lookbehind:!0,inside:{}},string:{pattern:/"(?:\\.|[^"\\])*"(?=\)$)/}}}}),["url","bold","italic","strike"].forEach(function(t){["url","bold","italic","strike"].forEach(function(e){t!==e&&(n.languages.markdown[t].inside.content.inside[e]=n.languages.markdown[e])})}),n.hooks.add("after-tokenize",function(n){function t(n){if(n&&"string"!=typeof n)for(var e=0,i=n.length;e]?|<(?:<=?|[=:])?|>(?:=|>>?=?)?|==?=?|[~\u2260\u2264\u2265]/,punctuation:/[{}[\];(),.:]/,constant:/\b(?:(?:NaN|Inf)(?:16|32|64)?)\b/}; +// Copyright 2018 The Distill Template Authors +const fo=Or("d-code",`\n\n\n\n\n`);class go extends(Dr(fo(HTMLElement))){renderContent(){if(this.languageName=this.getAttribute("language"),!this.languageName)return void console.warn('You need to provide a language attribute to your block to let us know how to highlight your code; e.g.:\n zeros = np.zeros(shape).');const n=po.languages[this.languageName];if(n==undefined)return void console.warn(`Distill does not yet support highlighting your code block in "${this.languageName}'.`);let t=this.textContent;const e=this.shadowRoot.querySelector("#code-container");if(this.hasAttribute("block")){const n=(t=t.replace(/\n/,"")).match(/\s*/);if(t=(t=t.replace(new RegExp("\n"+n,"g"),"\n")).trim(),e.parentNode instanceof ShadowRoot){const n=document.createElement("pre");this.shadowRoot.removeChild(e),n.appendChild(e),this.shadowRoot.appendChild(n)}}e.className=`language-${this.languageName}`,e.innerHTML=po.highlight(t,n)}} +// Copyright 2018 The Distill Template Authors +const mo=Or("d-footnote",'\n\n\n\n
    \n \n
    \n
    \n\n\n \n\n\n');class bo extends(mo(HTMLElement)){constructor(){super();const n={childList:!0,characterData:!0,subtree:!0};new MutationObserver(this.notify).observe(this,n)}notify(){const n=new CustomEvent("onFootnoteChanged",{detail:this,bubbles:!0});document.dispatchEvent(n)}connectedCallback(){this.hoverBox=this.root.querySelector("d-hover-box"),window.customElements.whenDefined("d-hover-box").then(()=>{this.hoverBox.listen(this)}),bo.currentFootnoteId+=1;const n=bo.currentFootnoteId.toString();this.root.host.id="d-footnote-"+n;const t="dt-fn-hover-box-"+n;this.hoverBox.id=t;const e=this.root.querySelector("#fn-");e.setAttribute("id","fn-"+n),e.setAttribute("data-hover-ref",t),e.textContent=n}}bo.currentFootnoteId=0; +// Copyright 2018 The Distill Template Authors +const yo=Or("d-footnote-list","\n\n\n

    Footnotes

    \n
      \n",!1);class vo extends(yo(HTMLElement)){connectedCallback(){super.connectedCallback(),this.list=this.root.querySelector("ol"),this.root.style.display="none"}set footnotes(n){if(this.list.innerHTML="",n.length){this.root.style.display="";for(const t of n){const n=document.createElement("li");n.id=t.id+"-listing",n.innerHTML=t.innerHTML;const e=document.createElement("a");e.setAttribute("class","footnote-backlink"),e.textContent="[\u21a9]",e.href="#"+t.id,n.appendChild(e),this.list.appendChild(n)}}else this.root.style.display="none"}} +// Copyright 2018 The Distill Template Authors +const wo=Or("d-hover-box",'\n\n\n
      \n
      \n \n
      \n
      \n');class xo extends(wo(HTMLElement)){constructor(){super()}connectedCallback(){}listen(n){this.bindDivEvents(this),this.bindTriggerEvents(n)}bindDivEvents(n){n.addEventListener("mouseover",()=>{this.visible||this.showAtNode(n),this.stopTimeout()}),n.addEventListener("mouseout",()=>{this.extendTimeout(500)}),n.addEventListener("touchstart",n=>{n.stopPropagation()},{passive:!0}),document.body.addEventListener("touchstart",()=>{this.hide()},{passive:!0})}bindTriggerEvents(n){n.addEventListener("mouseover",()=>{this.visible||this.showAtNode(n),this.stopTimeout()}),n.addEventListener("mouseout",()=>{this.extendTimeout(300)}),n.addEventListener("touchstart",t=>{this.visible?this.hide():this.showAtNode(n),t.stopPropagation()},{passive:!0})}show(n){this.visible=!0,this.style.display="block",this.style.top=Math.round(n[1]+10)+"px"}showAtNode(n){const t=n.getBoundingClientRect();this.show([n.offsetLeft+t.width,n.offsetTop+t.height])}hide(){this.visible=!1,this.style.display="none",this.stopTimeout()}stopTimeout(){this.timeout&&clearTimeout(this.timeout)}extendTimeout(n){this.stopTimeout(),this.timeout=setTimeout(()=>{this.hide()},n)}} +// Copyright 2018 The Distill Template Authors +class ko extends HTMLElement{static get is(){return"d-title"}} +// Copyright 2018 The Distill Template Authors +const So=Or("d-references","\n\n",!1);class Mo extends(So(HTMLElement)){} +// Copyright 2018 The Distill Template Authors +class To extends HTMLElement{static get is(){return"d-toc"}connectedCallback(){this.getAttribute("prerendered")||(window.onload=(()=>{k(this,document.querySelector("d-article").querySelectorAll("h2, h3"))}))}}class _o extends HTMLElement{static get is(){return"d-figure"}static get readyQueue(){return _o._readyQueue||(_o._readyQueue=[]),_o._readyQueue}static addToReadyQueue(n){-1===_o.readyQueue.indexOf(n)&&(_o.readyQueue.push(n),_o.runReadyQueue())}static runReadyQueue(){const n=_o.readyQueue.sort((n,t)=>n._seenOnScreen-t._seenOnScreen).filter(n=>!n._ready).pop();n&&(n.ready(),requestAnimationFrame(_o.runReadyQueue))}constructor(){super(),this._ready=!1,this._onscreen=!1,this._offscreen=!0}connectedCallback(){this.loadsWhileScrolling=this.hasAttribute("loadsWhileScrolling"),_o.marginObserver.observe(this),_o.directObserver.observe(this)}disconnectedCallback(){_o.marginObserver.unobserve(this),_o.directObserver.unobserve(this)}static get marginObserver(){if(!_o._marginObserver){const n=window.innerHeight,t=Math.floor(2*n),e={rootMargin:t+"px 0px "+t+"px 0px",threshold:.01},i=_o.didObserveMarginIntersection,r=new IntersectionObserver(i,e);_o._marginObserver=r}return _o._marginObserver}static didObserveMarginIntersection(n){for(const t of n){const n=t.target;t.isIntersecting&&!n._ready&&_o.addToReadyQueue(n)}}static get directObserver(){return _o._directObserver||(_o._directObserver=new IntersectionObserver(_o.didObserveDirectIntersection,{rootMargin:"0px",threshold:[0,1]})),_o._directObserver}static didObserveDirectIntersection(n){for(const t of n){const n=t.target;t.isIntersecting?(n._seenOnScreen=new Date,n._offscreen&&n.onscreen()):n._onscreen&&n.offscreen()}}addEventListener(n,t){super.addEventListener(n,t),"ready"===n&&-1!==_o.readyQueue.indexOf(this)&&(this._ready=!1,_o.runReadyQueue()),"onscreen"===n&&this.onscreen()}ready(){this._ready=!0,_o.marginObserver.unobserve(this);const n=new CustomEvent("ready");this.dispatchEvent(n)}onscreen(){this._onscreen=!0,this._offscreen=!1;const n=new CustomEvent("onscreen");this.dispatchEvent(n)}offscreen(){this._onscreen=!1,this._offscreen=!0;const n=new CustomEvent("offscreen");this.dispatchEvent(n)}}if("undefined"!=typeof window){let n;_o.isScrolling=!1;const t=()=>{_o.isScrolling=!0,clearTimeout(n),n=setTimeout(()=>{_o.isScrolling=!1,_o.runReadyQueue()},500)};window.addEventListener("scroll",t,!0)} +// Copyright 2018 The Distill Template Authors +const Co="distill.pub",Ao=Or("d-interstitial",'\n\n\n
      \n
      \n

      This article is in review.

      \n

      Do not share this URL or the contents of this article. Thank you!

      \n \n

      Enter the password we shared with you as part of the review process to view the article.

      \n
      \n
      \n');class Eo extends(Ao(HTMLElement)){connectedCallback(){if(this.shouldRemoveSelf())this.parentElement.removeChild(this);else{this.root.querySelector("#interstitial-password-input").oninput=(n=>this.passwordChanged(n))}}passwordChanged(n){n.target.value===this.password&&(console.log("Correct password entered."),this.parentElement.removeChild(this),"undefined"!=typeof Storage&&(console.log("Saved that correct password was entered."),localStorage.setItem(this.localStorageIdentifier(),"true")))}shouldRemoveSelf(){return window&&window.location.hostname===Co?(console.warn("Interstitial found on production, hiding it."),!0):"undefined"!=typeof Storage&&"true"===localStorage.getItem(this.localStorageIdentifier())&&(console.log("Loaded that correct password was entered before; skipping interstitial."),!0)}localStorageIdentifier(){const n="interstitial-password-correct";return"distill-drafts"+(window?window.location.pathname:"-")+n}}var No=M(S).right,Lo=Math.sqrt(50),Do=Math.sqrt(10),Oo=Math.sqrt(2),Io=.7,Fo=1/Io,Ro="\\s*([+-]?\\d+)\\s*",Uo="\\s*([+-]?\\d*\\.?\\d+(?:[eE][+-]?\\d+)?)\\s*",$o="\\s*([+-]?\\d*\\.?\\d+(?:[eE][+-]?\\d+)?)%\\s*",Po=/^#([0-9a-f]{3,8})$/,Ho=new RegExp("^rgb\\("+[Ro,Ro,Ro]+"\\)$"),zo=new RegExp("^rgb\\("+[$o,$o,$o]+"\\)$"),qo=new RegExp("^rgba\\("+[Ro,Ro,Ro,Uo]+"\\)$"),jo=new RegExp("^rgba\\("+[$o,$o,$o,Uo]+"\\)$"),Bo=new RegExp("^hsl\\("+[Uo,$o,$o]+"\\)$"),Yo=new RegExp("^hsla\\("+[Uo,$o,$o,Uo]+"\\)$"),Wo={aliceblue:15792383,antiquewhite:16444375,aqua:65535,aquamarine:8388564,azure:15794175,beige:16119260,bisque:16770244,black:0,blanchedalmond:16772045,blue:255,blueviolet:9055202,brown:10824234,burlywood:14596231,cadetblue:6266528,chartreuse:8388352,chocolate:13789470,coral:16744272,cornflowerblue:6591981,cornsilk:16775388,crimson:14423100,cyan:65535,darkblue:139,darkcyan:35723,darkgoldenrod:12092939,darkgray:11119017,darkgreen:25600,darkgrey:11119017,darkkhaki:12433259,darkmagenta:9109643,darkolivegreen:5597999,darkorange:16747520,darkorchid:10040012,darkred:9109504,darksalmon:15308410,darkseagreen:9419919,darkslateblue:4734347,darkslategray:3100495,darkslategrey:3100495,darkturquoise:52945,darkviolet:9699539,deeppink:16716947,deepskyblue:49151,dimgray:6908265,dimgrey:6908265,dodgerblue:2003199,firebrick:11674146,floralwhite:16775920,forestgreen:2263842,fuchsia:16711935,gainsboro:14474460,ghostwhite:16316671,gold:16766720,goldenrod:14329120,gray:8421504,green:32768,greenyellow:11403055,grey:8421504,honeydew:15794160,hotpink:16738740,indianred:13458524,indigo:4915330,ivory:16777200,khaki:15787660,lavender:15132410,lavenderblush:16773365,lawngreen:8190976,lemonchiffon:16775885,lightblue:11393254,lightcoral:15761536,lightcyan:14745599,lightgoldenrodyellow:16448210,lightgray:13882323,lightgreen:9498256,lightgrey:13882323,lightpink:16758465,lightsalmon:16752762,lightseagreen:2142890,lightskyblue:8900346,lightslategray:7833753,lightslategrey:7833753,lightsteelblue:11584734,lightyellow:16777184,lime:65280,limegreen:3329330,linen:16445670,magenta:16711935,maroon:8388608,mediumaquamarine:6737322,mediumblue:205,mediumorchid:12211667,mediumpurple:9662683,mediumseagreen:3978097,mediumslateblue:8087790,mediumspringgreen:64154,mediumturquoise:4772300,mediumvioletred:13047173,midnightblue:1644912,mintcream:16121850,mistyrose:16770273,moccasin:16770229,navajowhite:16768685,navy:128,oldlace:16643558,olive:8421376,olivedrab:7048739,orange:16753920,orangered:16729344,orchid:14315734,palegoldenrod:15657130,palegreen:10025880,paleturquoise:11529966,palevioletred:14381203,papayawhip:16773077,peachpuff:16767673,peru:13468991,pink:16761035,plum:14524637,powderblue:11591910,purple:8388736,rebeccapurple:6697881,red:16711680,rosybrown:12357519,royalblue:4286945,saddlebrown:9127187,salmon:16416882,sandybrown:16032864,seagreen:3050327,seashell:16774638,sienna:10506797,silver:12632256,skyblue:8900331,slateblue:6970061,slategray:7372944,slategrey:7372944,snow:16775930,springgreen:65407,steelblue:4620980,tan:13808780,teal:32896,thistle:14204888,tomato:16737095,turquoise:4251856,violet:15631086,wheat:16113331,white:16777215,whitesmoke:16119285,yellow:16776960,yellowgreen:10145074};L(O,U,{copy:function(n){return Object.assign(new this.constructor,this,n)},displayable:function(){return this.rgb().displayable()},hex:I,formatHex:I,formatHsl:F,formatRgb:R,toString:R}),L(q,z,D(O,{brighter:function(n){return n=null==n?Fo:Math.pow(Fo,n),new q(this.r*n,this.g*n,this.b*n,this.opacity)},darker:function(n){return n=null==n?Io:Math.pow(Io,n),new q(this.r*n,this.g*n,this.b*n,this.opacity)},rgb:function(){return this},displayable:function(){return-.5<=this.r&&this.r<255.5&&-.5<=this.g&&this.g<255.5&&-.5<=this.b&&this.b<255.5&&0<=this.opacity&&this.opacity<=1},hex:j,formatHex:j,formatRgb:B,toString:B})),L(K,V,D(O,{brighter:function(n){return n=null==n?Fo:Math.pow(Fo,n),new K(this.h,this.s,this.l*n,this.opacity)},darker:function(n){return n=null==n?Io:Math.pow(Io,n),new K(this.h,this.s,this.l*n,this.opacity)},rgb:function(){var n=this.h%360+360*(this.h<0),t=isNaN(n)||isNaN(this.s)?0:this.s,e=this.l,i=e+(e<.5?e:1-e)*t,r=2*e-i;return new q(X(n>=240?n-240:n+120,r,i),X(n,r,i),X(n<120?n+240:n-120,r,i),this.opacity)},displayable:function(){return(0<=this.s&&this.s<=1||isNaN(this.s))&&0<=this.l&&this.l<=1&&0<=this.opacity&&this.opacity<=1},formatHsl:function(){var n=this.opacity;return(1===(n=isNaN(n)?1:Math.max(0,Math.min(1,n)))?"hsl(":"hsla(")+(this.h||0)+", "+100*(this.s||0)+"%, "+100*(this.l||0)+"%"+(1===n?")":", "+n+")")}}));var Go=Math.PI/180,Vo=180/Math.PI,Ko=18,Xo=.96422,Zo=1,Qo=.82521,Jo=4/29,na=6/29,ta=3*na*na,ea=na*na*na;L(J,Q,D(O,{brighter:function(n){return new J(this.l+Ko*(null==n?1:n),this.a,this.b,this.opacity)},darker:function(n){return new J(this.l-Ko*(null==n?1:n),this.a,this.b,this.opacity)},rgb:function(){var n=(this.l+16)/116,t=isNaN(this.a)?n:n+this.a/500,e=isNaN(this.b)?n:n-this.b/200;return new q(en(3.1338561*(t=Xo*tn(t))-1.6168667*(n=Zo*tn(n))-.4906146*(e=Qo*tn(e))),en(-.9787684*t+1.9161415*n+.033454*e),en(.0719453*t-.2289914*n+1.4052427*e),this.opacity)}})),L(sn,an,D(O,{brighter:function(n){return new sn(this.h,this.c,this.l+Ko*(null==n?1:n),this.opacity)},darker:function(n){return new sn(this.h,this.c,this.l-Ko*(null==n?1:n),this.opacity)},rgb:function(){return ln(this).rgb()}}));var ia=-.14861,ra=1.78277,oa=-.29227,aa=-.90649,sa=1.97294,la=sa*aa,ua=sa*ra,ca=ra*oa-aa*ia;L(dn,cn,D(O,{brighter:function(n){return n=null==n?Fo:Math.pow(Fo,n),new dn(this.h,this.s,this.l*n,this.opacity)},darker:function(n){return n=null==n?Io:Math.pow(Io,n),new dn(this.h,this.s,this.l*n,this.opacity)},rgb:function(){var n=isNaN(this.h)?0:(this.h+120)*Go,t=+this.l,e=isNaN(this.s)?0:this.s*t*(1-t),i=Math.cos(n),r=Math.sin(n);return new q(255*(t+e*(ia*i+ra*r)),255*(t+e*(oa*i+aa*r)),255*(t+e*(sa*i)),this.opacity)}}));var da,ha=function gs(n){function t(n,t){var i=e((n=z(n)).r,(t=z(t)).r),r=e(n.g,t.g),o=e(n.b,t.b),a=mn(n.opacity,t.opacity);return function(t){return n.r=i(t),n.g=r(t),n.b=o(t),n.opacity=a(t),n+""}}var e=gn(n);return t.gamma=gs,t}(1),pa=/[-+]?(?:\d+\.?\d*|\.?\d+)(?:[eE][-+]?\d+)?/g,fa=new RegExp(pa.source,"g"),ga=[0,1],ma=/^(?:(.)?([<>=^]))?([+\-( ])?([$#])?(0)?(\d+)?(,)?(\.\d+)?(~)?([a-z%])?$/i;qn.prototype=jn.prototype,jn.prototype.toString=function(){return this.fill+this.align+this.sign+this.symbol+(this.zero?"0":"")+(this.width===undefined?"":Math.max(1,0|this.width))+(this.comma?",":"")+(this.precision===undefined?"":"."+Math.max(0,0|this.precision))+(this.trim?"~":"")+this.type};var ba,ya,va,wa={"%":function(n,t){return(100*n).toFixed(t)},b:function(n){return Math.round(n).toString(2)},c:function(n){return n+""},d:function(n){return Math.round(n).toString(10)},e:function(n,t){return n.toExponential(t)},f:function(n,t){return n.toFixed(t)},g:function(n,t){return n.toPrecision(t)},o:function(n){return Math.round(n).toString(8)},p:function(n,t){return Wn(100*n,t)},r:Wn,s:Yn,X:function(n){return Math.round(n).toString(16).toUpperCase()},x:function(n){return Math.round(n).toString(16)}},xa=Array.prototype.map,ka=["y","z","a","f","p","n","\xb5","m","","k","M","G","T","P","E","Z","Y"];Kn({decimal:".",thousands:",",grouping:[3],currency:["$",""],minus:"-"});var Sa=new Date,Ma=new Date,Ta=et(function(){},function(n,t){n.setTime(+n+t)},function(n,t){return t-n});Ta.every=function(n){return n=Math.floor(n),isFinite(n)&&n>0?n>1?et(function(t){t.setTime(Math.floor(t/n)*n)},function(t,e){t.setTime(+t+e*n)},function(t,e){return(e-t)/n}):Ta:null};var _a=1e3,Ca=6e4,Aa=36e5,Ea=864e5,Na=6048e5,La=(et(function(n){n.setTime(n-n.getMilliseconds())},function(n,t){n.setTime(+n+t*_a)},function(n,t){return(t-n)/_a},function(n){return n.getUTCSeconds()}),et(function(n){n.setTime(n-n.getMilliseconds()-n.getSeconds()*_a)},function(n,t){n.setTime(+n+t*Ca)},function(n,t){return(t-n)/Ca},function(n){return n.getMinutes()}),et(function(n){n.setTime(n-n.getMilliseconds()-n.getSeconds()*_a-n.getMinutes()*Ca)},function(n,t){n.setTime(+n+t*Aa)},function(n,t){return(t-n)/Aa},function(n){return n.getHours()}),et(function(n){n.setHours(0,0,0,0)},function(n,t){n.setDate(n.getDate()+t)},function(n,t){return(t-n-(t.getTimezoneOffset()-n.getTimezoneOffset())*Ca)/Ea},function(n){return n.getDate()-1})),Da=it(0),Oa=it(1),Ia=(it(2),it(3),it(4)),Fa=(it(5),it(6),et(function(n){n.setDate(1),n.setHours(0,0,0,0)},function(n,t){n.setMonth(n.getMonth()+t)},function(n,t){return t.getMonth()-n.getMonth()+12*(t.getFullYear()-n.getFullYear())},function(n){return n.getMonth()}),et(function(n){n.setMonth(0,1),n.setHours(0,0,0,0)},function(n,t){n.setFullYear(n.getFullYear()+t)},function(n,t){return t.getFullYear()-n.getFullYear()},function(n){return n.getFullYear()}));Fa.every=function(n){return isFinite(n=Math.floor(n))&&n>0?et(function(t){t.setFullYear(Math.floor(t.getFullYear()/n)*n),t.setMonth(0,1),t.setHours(0,0,0,0)},function(t,e){t.setFullYear(t.getFullYear()+e*n)}):null};et(function(n){n.setUTCSeconds(0,0)},function(n,t){n.setTime(+n+t*Ca)},function(n,t){return(t-n)/Ca},function(n){return n.getUTCMinutes()}),et(function(n){n.setUTCMinutes(0,0,0)},function(n,t){n.setTime(+n+t*Aa)},function(n,t){return(t-n)/Aa},function(n){return n.getUTCHours()});var Ra=et(function(n){n.setUTCHours(0,0,0,0)},function(n,t){n.setUTCDate(n.getUTCDate()+t)},function(n,t){return(t-n)/Ea},function(n){return n.getUTCDate()-1}),Ua=rt(0),$a=rt(1),Pa=(rt(2),rt(3),rt(4)),Ha=(rt(5),rt(6),et(function(n){n.setUTCDate(1),n.setUTCHours(0,0,0,0)},function(n,t){n.setUTCMonth(n.getUTCMonth()+t)},function(n,t){return t.getUTCMonth()-n.getUTCMonth()+12*(t.getUTCFullYear()-n.getUTCFullYear())},function(n){return n.getUTCMonth()}),et(function(n){n.setUTCMonth(0,1),n.setUTCHours(0,0,0,0)},function(n,t){n.setUTCFullYear(n.getUTCFullYear()+t)},function(n,t){return t.getUTCFullYear()-n.getUTCFullYear()},function(n){return n.getUTCFullYear()}));Ha.every=function(n){return isFinite(n=Math.floor(n))&&n>0?et(function(t){t.setUTCFullYear(Math.floor(t.getUTCFullYear()/n)*n),t.setUTCMonth(0,1),t.setUTCHours(0,0,0,0)},function(t,e){t.setUTCFullYear(t.getUTCFullYear()+e*n)}):null};var za,qa,ja,Ba={"-":"",_:" ",0:"0"},Ya=/^\s*\d+/,Wa=/^%/,Ga=/[\\^$*+?|[\]().{}]/g;me({dateTime:"%x, %X",date:"%-m/%-d/%Y",time:"%-I:%M:%S %p",periods:["AM","PM"],days:["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"],shortDays:["Sun","Mon","Tue","Wed","Thu","Fri","Sat"],months:["January","February","March","April","May","June","July","August","September","October","November","December"],shortMonths:["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"]});var Va="%Y-%m-%dT%H:%M:%S.%LZ",Ka=(Date.prototype.toISOString||qa(Va),+new Date("2000-01-01T00:00:00.000Z")||ja(Va),{value:function(){}});we.prototype=ve.prototype={constructor:we,on:function(n,t){var e,i=this._,r=xe(n+"",i),o=-1,a=r.length;if(!(arguments.length<2)){if(null!=t&&"function"!=typeof t)throw new Error("invalid callback: "+t);for(;++o0)for(var e,i,r=new Array(e),o=0;o=0&&(this._names.splice(t,1),this._node.setAttribute("class",this._names.join(" ")))},contains:function(n){return this._names.indexOf(n)>=0}};var Ja={},ns=null;"undefined"!=typeof document&&("onmouseenter"in document.documentElement||(Ja={mouseenter:"mouseover",mouseleave:"mouseout"}));var ts=[null];or.prototype=ar.prototype={constructor:or,select:Ne,selectAll:Oe,filter:Fe,data:qe,enter:Ue,exit:je,join:Be,merge:Ye,order:We,sort:Ge,call:Ke,nodes:Xe,node:Ze,size:Qe,empty:Je,each:ni,attr:si,style:hi,property:bi,classed:_i,text:Ni,html:Ii,raise:Ri,lower:$i,append:Pi,insert:zi,remove:ji,clone:Wi,datum:Gi,on:Ji,dispatch:rr},br.prototype.on=function(){var n=this._.on.apply(this._,arguments);return n===this._?this:n};const es=Or("d-slider","\n\n\n
      \n
      \n
      \n
      \n
      \n
      \n
      \n
      \n
      \n"),is={left:37,up:38,right:39,down:40,pageUp:33,pageDown:34,end:35,home:36};class rs extends(es(HTMLElement)){connectedCallback(){this.connected=!0,this.setAttribute("role","slider"),this.hasAttribute("tabindex")||this.setAttribute("tabindex",0),this.mouseEvent=!1,this.knob=this.root.querySelector(".knob-container"),this.background=this.root.querySelector(".background"),this.trackFill=this.root.querySelector(".track-fill"),this.track=this.root.querySelector(".track"),this.min=this.min?this.min:0,this.max=this.max?this.max:100,this.scale=tt().domain([this.min,this.max]).range([0,1]).clamp(!0),this.origin=this.origin!==undefined?this.origin:this.min,this.step=this.step?this.step:1,this.update(this.value?this.value:0),this.ticks=!!this.ticks&&this.ticks,this.renderTicks(),this.drag=kr().container(this.background).on("start",()=>{this.mouseEvent=!0,this.background.classList.add("mousedown"),this.changeValue=this.value,this.dragUpdate()}).on("drag",()=>{this.dragUpdate()}).on("end",()=>{this.mouseEvent=!1,this.background.classList.remove("mousedown"),this.dragUpdate(),this.changeValue!==this.value&&this.dispatchChange(),this.changeValue=this.value}),this.drag(sr(this.background)),this.addEventListener("focusin",()=>{this.mouseEvent||this.background.classList.add("focus")}),this.addEventListener("focusout",()=>{this.background.classList.remove("focus")}),this.addEventListener("keydown",this.onKeyDown)}static get observedAttributes(){return["min","max","value","step","ticks","origin","tickValues","tickLabels"]}attributeChangedCallback(n,t,e){isNaN(e)||e===undefined||null===e||("min"==n&&(this.min=+e,this.setAttribute("aria-valuemin",this.min)),"max"==n&&(this.max=+e,this.setAttribute("aria-valuemax",this.max)),"value"==n&&this.update(+e),"origin"==n&&(this.origin=+e),"step"==n&&e>0&&(this.step=+e),"ticks"==n&&(this.ticks=""===e||e))}onKeyDown(n){this.changeValue=this.value;let t=!1;switch(n.keyCode){case is.left:case is.down:this.update(this.value-this.step),t=!0;break;case is.right:case is.up:this.update(this.value+this.step),t=!0;break;case is.pageUp:case is.pageDown:this.update(this.value+10*this.step),t=!0;break;case is.home:this.update(this.min),t=!0;break;case is.end:this.update(this.max),t=!0}t&&(this.background.classList.add("focus"),n.preventDefault(),n.stopPropagation(),this.changeValue!==this.value&&this.dispatchChange())}validateValueRange(n,t,e){return Math.max(Math.min(t,e),n)}quantizeValue(n,t){return Math.round(n/t)*t}dragUpdate(){const n=this.background.getBoundingClientRect(),t=ns.x,e=n.width;this.update(this.scale.invert(t/e))}update(n){let t=n;"any"!==this.step&&(t=this.quantizeValue(n,this.step)),t=this.validateValueRange(this.min,this.max,t),this.connected&&(this.knob.style.left=100*this.scale(t)+"%",this.trackFill.style.width=100*this.scale(this.min+Math.abs(t-this.origin))+"%",this.trackFill.style.left=100*this.scale(Math.min(t,this.origin))+"%"),this.value!==t&&(this.value=t,this.setAttribute("aria-valuenow",this.value),this.dispatchInput())}dispatchChange(){const n=new Event("change");this.dispatchEvent(n,{})}dispatchInput(){const n=new Event("input");this.dispatchEvent(n,{})}renderTicks(){const n=this.root.querySelector(".ticks");if(!1!==this.ticks){let t=[];(t=this.ticks>0?this.scale.ticks(this.ticks):"any"===this.step?this.scale.ticks():_(this.min,this.max+1e-6,this.step)).forEach(t=>{const e=document.createElement("div");e.classList.add("tick"),e.style.left=100*this.scale(t)+"%",n.appendChild(e)})}else n.style.display="none"}}var os='\n \n\n';const as=Or("distill-header",`\n\n\n`,!1); +// Copyright 2018 The Distill Template Authors +class ss extends(as(HTMLElement)){} +// Copyright 2018 The Distill Template Authors +const ls="\n\n";class us extends HTMLElement{static get is(){return"distill-appendix"}set frontMatter(n){this.innerHTML=Sr(n)}}const cs=Or("distill-footer",`\n\n\n\n\n`); +// Copyright 2018 The Distill Template Authors +class ds extends(cs(HTMLElement)){} +// Copyright 2018 The Distill Template Authors +let hs=!1,ps=0;const fs=function(){if(window.distill.runlevel<1)throw new Error("Insufficient Runlevel for Distill Template!");if("distill"in window&&window.distill.templateIsLoading)throw new Error("Runlevel 1: Distill Template is getting loaded more than once, aborting!");window.distill.templateIsLoading=!0,console.debug("Runlevel 1: Distill Template has started loading."),p(document),console.debug("Runlevel 1: Static Distill styles have been added."),console.debug("Runlevel 1->2."),window.distill.runlevel+=1;for(const[n,t]of Object.entries(Vr.listeners))"function"==typeof t?document.addEventListener(n,t):console.error("Runlevel 2: Controller listeners need to be functions!");console.debug("Runlevel 2: We can now listen to controller events."),console.debug("Runlevel 2->3."),window.distill.runlevel+=1;const n=[Jr,to,io,ao,so,uo,ho,go,bo,vo,Wr,xo,ko,Yr,Mo,To,_o,rs,Eo],t=[ss,us,ds];if(window.distill.runlevel<2)throw new Error("Insufficient Runlevel for adding custom elements!");const e=n.concat(t);for(const n of e)console.debug("Runlevel 2: Registering custom element: "+n.is),customElements.define(n.is,n);console.debug("Runlevel 3: Distill Template finished registering custom elements."),console.debug("Runlevel 3->4."),window.distill.runlevel+=1,u()&&Vr.listeners.DOMContentLoaded(),console.debug("Runlevel 4: Distill Template initialisation complete."),window.distill.templateIsLoading=!1,window.distill.templateHasLoaded=!0};window.distill={runlevel:ps,initialize:fs,templateIsLoading:hs},Zr.browserSupportsAllFeatures()?(console.debug("Runlevel 0: No need for polyfills."),console.debug("Runlevel 0->1."),window.distill.runlevel+=1,window.distill.initialize()):(console.debug("Runlevel 0: Distill Template is loading polyfills."),Zr.load(window.distill.initialize))}); \ No newline at end of file diff --git a/assets/js/distillpub/transforms.v2.js b/assets/js/distillpub/transforms.v2.js index 2d12d323..41d3b7d3 100644 --- a/assets/js/distillpub/transforms.v2.js +++ b/assets/js/distillpub/transforms.v2.js @@ -1,13185 +1,75 @@ -(function (global, factory) { - typeof exports === 'object' && typeof module !== 'undefined' ? factory(exports, require('fs')) : - typeof define === 'function' && define.amd ? define(['exports', 'fs'], factory) : - (global = global || self, factory(global.dl = {}, global.fs)); -}(this, (function (exports, fs) { 'use strict'; - - fs = fs && Object.prototype.hasOwnProperty.call(fs, 'default') ? fs['default'] : fs; - - // Copyright 2018 The Distill Template Authors - // - // Licensed under the Apache License, Version 2.0 (the "License"); - // you may not use this file except in compliance with the License. - // You may obtain a copy of the License at - // - // http://www.apache.org/licenses/LICENSE-2.0 - // - // Unless required by applicable law or agreed to in writing, software - // distributed under the License is distributed on an "AS IS" BASIS, - // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - // See the License for the specific language governing permissions and - // limitations under the License. - - const days = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']; - const months = ['Jan.', 'Feb.', 'March', 'April', 'May', 'June', 'July', 'Aug.', 'Sept.', 'Oct.', 'Nov.', 'Dec.']; - const zeroPad = n => n < 10 ? '0' + n : n; - - const RFC = function(date) { - const day = days[date.getDay()].substring(0, 3); - const paddedDate = zeroPad(date.getDate()); - const month = months[date.getMonth()].substring(0,3); - const year = date.getFullYear().toString(); - const hours = date.getUTCHours().toString(); - const minutes = date.getUTCMinutes().toString(); - const seconds = date.getUTCSeconds().toString(); - return `${day}, ${paddedDate} ${month} ${year} ${hours}:${minutes}:${seconds} Z`; - }; - - const objectFromMap = function(map) { - const object = Array.from(map).reduce((object, [key, value]) => ( - Object.assign(object, { [key]: value }) // Be careful! Maps can have non-String keys; object literals can't. - ), {}); - return object; - }; - - const mapFromObject = function(object) { - const map = new Map(); - for (var property in object) { - if (object.hasOwnProperty(property)) { - map.set(property, object[property]); - } - } - return map; - }; - - class Author { - - // constructor(name='', personalURL='', affiliation='', affiliationURL='') { - // this.name = name; // 'Chris Olah' - // this.personalURL = personalURL; // 'https://colah.github.io' - // this.affiliation = affiliation; // 'Google Brain' - // this.affiliationURL = affiliationURL; // 'https://g.co/brain' - // } - - constructor(object) { - this.name = object.author; // 'Chris Olah' - this.personalURL = object.authorURL; // 'https://colah.github.io' - this.affiliation = object.affiliation; // 'Google Brain' - this.affiliationURL = object.affiliationURL; // 'https://g.co/brain' - this.affiliations = object.affiliations || []; // new-style affiliations - } - - // 'Chris' - get firstName() { - const names = this.name.split(' '); - return names.slice(0, names.length - 1).join(' '); - } - - // 'Olah' - get lastName() { - const names = this.name.split(' '); - return names[names.length -1]; - } - } - - function mergeFromYMLFrontmatter(target, source) { - target.title = source.title; - if (source.published) { - if (source.published instanceof Date) { - target.publishedDate = source.published; - } else if (source.published.constructor === String) { - target.publishedDate = new Date(source.published); - } - } - if (source.publishedDate) { - if (source.publishedDate instanceof Date) { - target.publishedDate = source.publishedDate; - } else if (source.publishedDate.constructor === String) { - target.publishedDate = new Date(source.publishedDate); - } else { - console.error('Don\'t know what to do with published date: ' + source.publishedDate); - } - } - target.description = source.description; - target.authors = source.authors.map( (authorObject) => new Author(authorObject)); - target.katex = source.katex; - target.password = source.password; - if (source.doi) { - target.doi = source.doi; - } - } - - class FrontMatter { - constructor() { - this.title = 'unnamed article'; // 'Attention and Augmented Recurrent Neural Networks' - this.description = ''; // 'A visual overview of neural attention...' - this.authors = []; // Array of Author(s) - - this.bibliography = new Map(); - this.bibliographyParsed = false; - // { - // 'gregor2015draw': { - // 'title': 'DRAW: A recurrent neural network for image generation', - // 'author': 'Gregor, Karol and Danihelka, Ivo and Graves, Alex and Rezende, Danilo Jimenez and Wierstra, Daan', - // 'journal': 'arXiv preprint arXiv:1502.04623', - // 'year': '2015', - // 'url': 'https://arxiv.org/pdf/1502.04623.pdf', - // 'type': 'article' - // }, - // } - - // Citation keys should be listed in the order that they are appear in the document. - // Each key refers to a key in the bibliography dictionary. - this.citations = []; // [ 'gregor2015draw', 'mercier2011humans' ] - this.citationsCollected = false; - - // - // Assigned from posts.csv - // - - // publishedDate: 2016-09-08T07:00:00.000Z, - // tags: [ 'rnn' ], - // distillPath: '2016/augmented-rnns', - // githubPath: 'distillpub/post--augmented-rnns', - // doiSuffix: 1, - - // - // Assigned from journal - // - this.journal = {}; - // journal: { - // 'title': 'Distill', - // 'full_title': 'Distill', - // 'abbrev_title': 'Distill', - // 'url': 'http://distill.pub', - // 'doi': '10.23915/distill', - // 'publisherName': 'Distill Working Group', - // 'publisherEmail': 'admin@distill.pub', - // 'issn': '2476-0757', - // 'editors': [...], - // 'committee': [...] - // } - // volume: 1, - // issue: 9, - - this.katex = {}; - - // - // Assigned from publishing process - // - - // githubCompareUpdatesUrl: 'https://github.com/distillpub/post--augmented-rnns/compare/1596e094d8943d2dc0ea445d92071129c6419c59...3bd9209e0c24d020f87cf6152dcecc6017cbc193', - // updatedDate: 2017-03-21T07:13:16.000Z, - // doi: '10.23915/distill.00001', - this.doi = undefined; - this.publishedDate = undefined; - } - - // Example: - // title: Demo Title Attention and Augmented Recurrent Neural Networks - // published: Jan 10, 2017 - // authors: - // - Chris Olah: - // - Shan Carter: http://shancarter.com - // affiliations: - // - Google Brain: - // - Google Brain: http://g.co/brain - - // - // Computed Properties - // - - // 'http://distill.pub/2016/augmented-rnns', - set url(value) { - this._url = value; - } - get url() { - if (this._url) { - return this._url; - } else if (this.distillPath && this.journal.url) { - return this.journal.url + '/' + this.distillPath; - } else if (this.journal.url) { - return this.journal.url; - } - } - - // 'https://github.com/distillpub/post--augmented-rnns', - get githubUrl() { - if (this.githubPath) { - return 'https://github.com/' + this.githubPath; - } else { - return undefined; - } - } - - // TODO resolve differences in naming of URL/Url/url. - // 'http://distill.pub/2016/augmented-rnns/thumbnail.jpg', - set previewURL(value) { - this._previewURL = value; - } - get previewURL() { - return this._previewURL ? this._previewURL : this.url + '/thumbnail.jpg'; - } - - // 'Thu, 08 Sep 2016 00:00:00 -0700', - get publishedDateRFC() { - return RFC(this.publishedDate); - } - - // 'Thu, 08 Sep 2016 00:00:00 -0700', - get updatedDateRFC() { - return RFC(this.updatedDate); - } - - // 2016, - get publishedYear() { - return this.publishedDate.getFullYear(); - } - - // 'Sept', - get publishedMonth() { - return months[this.publishedDate.getMonth()]; - } - - // 8, - get publishedDay() { - return this.publishedDate.getDate(); - } - - // '09', - get publishedMonthPadded() { - return zeroPad(this.publishedDate.getMonth() + 1); - } - - // '08', - get publishedDayPadded() { - return zeroPad(this.publishedDate.getDate()); - } - - get publishedISODateOnly() { - return this.publishedDate.toISOString().split('T')[0]; - } - - get volume() { - const volume = this.publishedYear - 2015; - if (volume < 1) { - throw new Error('Invalid publish date detected during computing volume'); - } - return volume; - } - - get issue() { - return this.publishedDate.getMonth() + 1; - } - - // 'Olah & Carter', - get concatenatedAuthors() { - if (this.authors.length > 2) { - return this.authors[0].lastName + ', et al.'; - } else if (this.authors.length === 2) { - return this.authors[0].lastName + ' & ' + this.authors[1].lastName; - } else if (this.authors.length === 1) { - return this.authors[0].lastName; - } - } - - // 'Olah, Chris and Carter, Shan', - get bibtexAuthors() { - return this.authors.map(author => { - return author.lastName + ', ' + author.firstName; - }).join(' and '); - } - - // 'olah2016attention' - get slug() { - let slug = ''; - if (this.authors.length) { - slug += this.authors[0].lastName.toLowerCase(); - slug += this.publishedYear; - slug += this.title.split(' ')[0].toLowerCase(); - } - return slug || 'Untitled'; - } - - get bibliographyEntries() { - return new Map(this.citations.map( citationKey => { - const entry = this.bibliography.get(citationKey); - return [citationKey, entry]; - })); - } - - set bibliography(bibliography) { - if (bibliography instanceof Map) { - this._bibliography = bibliography; - } else if (typeof bibliography === 'object') { - this._bibliography = mapFromObject(bibliography); - } - } - - get bibliography() { - return this._bibliography; - } - - static fromObject(source) { - const frontMatter = new FrontMatter(); - Object.assign(frontMatter, source); - return frontMatter; - } - - assignToObject(target) { - Object.assign(target, this); - target.bibliography = objectFromMap(this.bibliographyEntries); - target.url = this.url; - target.doi = this.doi; - target.githubUrl = this.githubUrl; - target.previewURL = this.previewURL; - if (this.publishedDate) { - target.volume = this.volume; - target.issue = this.issue; - target.publishedDateRFC = this.publishedDateRFC; - target.publishedYear = this.publishedYear; - target.publishedMonth = this.publishedMonth; - target.publishedDay = this.publishedDay; - target.publishedMonthPadded = this.publishedMonthPadded; - target.publishedDayPadded = this.publishedDayPadded; - } - if (this.updatedDate) { - target.updatedDateRFC = this.updatedDateRFC; - } - target.concatenatedAuthors = this.concatenatedAuthors; - target.bibtexAuthors = this.bibtexAuthors; - target.slug = this.slug; - } - - } - - // Copyright 2018 The Distill Template Authors - // - // Licensed under the Apache License, Version 2.0 (the "License"); - // you may not use this file except in compliance with the License. - // You may obtain a copy of the License at - // - // http://www.apache.org/licenses/LICENSE-2.0 - // - // Unless required by applicable law or agreed to in writing, software - // distributed under the License is distributed on an "AS IS" BASIS, - // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - // See the License for the specific language governing permissions and - // limitations under the License. - - function _moveLegacyAffiliationFormatIntoArray(frontMatter) { - // authors used to have propoerties "affiliation" and "affiliationURL". - // We now encourage using an array for affiliations containing objects with - // properties "name" and "url". - for (let author of frontMatter.authors) { - const hasOldStyle = Boolean(author.affiliation); - const hasNewStyle = Boolean(author.affiliations); - if (!hasOldStyle) continue; - if (hasNewStyle) { - console.warn(`Author ${author.author} has both old-style ("affiliation" & "affiliationURL") and new style ("affiliations") affiliation information!`); - } else { - let newAffiliation = { - "name": author.affiliation - }; - if (author.affiliationURL) newAffiliation.url = author.affiliationURL; - author.affiliations = [newAffiliation]; - } - } - return frontMatter - } - - function parseFrontmatter(element) { - const scriptTag = element.firstElementChild; - if (scriptTag) { - const type = scriptTag.getAttribute('type'); - if (type.split('/')[1] == 'json') { - const content = scriptTag.textContent; - const parsed = JSON.parse(content); - return _moveLegacyAffiliationFormatIntoArray(parsed); - } else { - console.error('Distill only supports JSON frontmatter tags anymore; no more YAML.'); - } - } else { - console.error('You added a frontmatter tag but did not provide a script tag with front matter data in it. Please take a look at our templates.'); - } - return {}; - } - - // Copyright 2018 The Distill Template Authors - - function ExtractFrontmatter(dom, data) { - const frontMatterTag = dom.querySelector('d-front-matter'); - if (!frontMatterTag) { - console.warn('No front matter tag found!'); - return; - } - const extractedData = parseFrontmatter(frontMatterTag); - mergeFromYMLFrontmatter(data, extractedData); - } - - function commonjsRequire () { - throw new Error('Dynamic requires are not currently supported by rollup-plugin-commonjs'); - } - - function unwrapExports (x) { - return x && x.__esModule && Object.prototype.hasOwnProperty.call(x, 'default') ? x['default'] : x; - } - - function createCommonjsModule(fn, module) { - return module = { exports: {} }, fn(module, module.exports), module.exports; - } - - var bibtexParse = createCommonjsModule(function (module, exports) { - /* start bibtexParse 0.0.22 */ - - //Original work by Henrik Muehe (c) 2010 - // - //CommonJS port by Mikola Lysenko 2013 - // - //Port to Browser lib by ORCID / RCPETERS - // - //Issues: - //no comment handling within strings - //no string concatenation - //no variable values yet - //Grammar implemented here: - //bibtex -> (string | preamble | comment | entry)*; - //string -> '@STRING' '{' key_equals_value '}'; - //preamble -> '@PREAMBLE' '{' value '}'; - //comment -> '@COMMENT' '{' value '}'; - //entry -> '@' key '{' key ',' key_value_list '}'; - //key_value_list -> key_equals_value (',' key_equals_value)*; - //key_equals_value -> key '=' value; - //value -> value_quotes | value_braces | key; - //value_quotes -> '"' .*? '"'; // not quite - //value_braces -> '{' .*? '"'; // not quite - (function(exports) { - - function BibtexParser() { - - this.months = ["jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec"]; - this.notKey = [',','{','}',' ','=']; - this.pos = 0; - this.input = ""; - this.entries = new Array(); - - this.currentEntry = ""; - - this.setInput = function(t) { - this.input = t; - }; - - this.getEntries = function() { - return this.entries; - }; - - this.isWhitespace = function(s) { - return (s == ' ' || s == '\r' || s == '\t' || s == '\n'); - }; - - this.match = function(s, canCommentOut) { - if (canCommentOut == undefined || canCommentOut == null) - canCommentOut = true; - this.skipWhitespace(canCommentOut); - if (this.input.substring(this.pos, this.pos + s.length) == s) { - this.pos += s.length; - } else { - throw "Token mismatch, expected " + s + ", found " - + this.input.substring(this.pos); - } this.skipWhitespace(canCommentOut); - }; - - this.tryMatch = function(s, canCommentOut) { - if (canCommentOut == undefined || canCommentOut == null) - canCommentOut = true; - this.skipWhitespace(canCommentOut); - if (this.input.substring(this.pos, this.pos + s.length) == s) { - return true; - } else { - return false; - } }; - - /* when search for a match all text can be ignored, not just white space */ - this.matchAt = function() { - while (this.input.length > this.pos && this.input[this.pos] != '@') { - this.pos++; - } - if (this.input[this.pos] == '@') { - return true; - } return false; - }; - - this.skipWhitespace = function(canCommentOut) { - while (this.isWhitespace(this.input[this.pos])) { - this.pos++; - } if (this.input[this.pos] == "%" && canCommentOut == true) { - while (this.input[this.pos] != "\n") { - this.pos++; - } this.skipWhitespace(canCommentOut); - } }; - - this.value_braces = function() { - var bracecount = 0; - this.match("{", false); - var start = this.pos; - var escaped = false; - while (true) { - if (!escaped) { - if (this.input[this.pos] == '}') { - if (bracecount > 0) { - bracecount--; - } else { - var end = this.pos; - this.match("}", false); - return this.input.substring(start, end); - } } else if (this.input[this.pos] == '{') { - bracecount++; - } else if (this.pos >= this.input.length - 1) { - throw "Unterminated value"; - } } if (this.input[this.pos] == '\\' && escaped == false) - escaped = true; - else - escaped = false; - this.pos++; - } }; - - this.value_comment = function() { - var str = ''; - var brcktCnt = 0; - while (!(this.tryMatch("}", false) && brcktCnt == 0)) { - str = str + this.input[this.pos]; - if (this.input[this.pos] == '{') - brcktCnt++; - if (this.input[this.pos] == '}') - brcktCnt--; - if (this.pos >= this.input.length - 1) { - throw "Unterminated value:" + this.input.substring(start); - } this.pos++; - } return str; - }; - - this.value_quotes = function() { - this.match('"', false); - var start = this.pos; - var escaped = false; - while (true) { - if (!escaped) { - if (this.input[this.pos] == '"') { - var end = this.pos; - this.match('"', false); - return this.input.substring(start, end); - } else if (this.pos >= this.input.length - 1) { - throw "Unterminated value:" + this.input.substring(start); - } } - if (this.input[this.pos] == '\\' && escaped == false) - escaped = true; - else - escaped = false; - this.pos++; - } }; - - this.single_value = function() { - var start = this.pos; - if (this.tryMatch("{")) { - return this.value_braces(); - } else if (this.tryMatch('"')) { - return this.value_quotes(); - } else { - var k = this.key(); - if (k.match("^[0-9]+$")) - return k; - else if (this.months.indexOf(k.toLowerCase()) >= 0) - return k.toLowerCase(); - else - throw "Value expected:" + this.input.substring(start) + ' for key: ' + k; - - } }; - - this.value = function() { - var values = []; - values.push(this.single_value()); - while (this.tryMatch("#")) { - this.match("#"); - values.push(this.single_value()); - } return values.join(""); - }; - - this.key = function() { - var start = this.pos; - while (true) { - if (this.pos >= this.input.length) { - throw "Runaway key"; - } // а-яА-Я is Cyrillic - //console.log(this.input[this.pos]); - if (this.notKey.indexOf(this.input[this.pos]) >= 0) { - return this.input.substring(start, this.pos); - } else { - this.pos++; - - } } }; - - this.key_equals_value = function() { - var key = this.key(); - if (this.tryMatch("=")) { - this.match("="); - var val = this.value(); - return [ key, val ]; - } else { - throw "... = value expected, equals sign missing:" - + this.input.substring(this.pos); - } }; - - this.key_value_list = function() { - var kv = this.key_equals_value(); - this.currentEntry['entryTags'] = {}; - this.currentEntry['entryTags'][kv[0]] = kv[1]; - while (this.tryMatch(",")) { - this.match(","); - // fixes problems with commas at the end of a list - if (this.tryMatch("}")) { - break; - } - kv = this.key_equals_value(); - this.currentEntry['entryTags'][kv[0]] = kv[1]; - } }; - - this.entry_body = function(d) { - this.currentEntry = {}; - this.currentEntry['citationKey'] = this.key(); - this.currentEntry['entryType'] = d.substring(1); - this.match(","); - this.key_value_list(); - this.entries.push(this.currentEntry); - }; - - this.directive = function() { - this.match("@"); - return "@" + this.key(); - }; - - this.preamble = function() { - this.currentEntry = {}; - this.currentEntry['entryType'] = 'PREAMBLE'; - this.currentEntry['entry'] = this.value_comment(); - this.entries.push(this.currentEntry); - }; - - this.comment = function() { - this.currentEntry = {}; - this.currentEntry['entryType'] = 'COMMENT'; - this.currentEntry['entry'] = this.value_comment(); - this.entries.push(this.currentEntry); - }; - - this.entry = function(d) { - this.entry_body(d); - }; - - this.bibtex = function() { - while (this.matchAt()) { - var d = this.directive(); - this.match("{"); - if (d == "@STRING") { - this.string(); - } else if (d == "@PREAMBLE") { - this.preamble(); - } else if (d == "@COMMENT") { - this.comment(); - } else { - this.entry(d); - } - this.match("}"); - } }; - } - exports.toJSON = function(bibtex) { - var b = new BibtexParser(); - b.setInput(bibtex); - b.bibtex(); - return b.entries; - }; - - /* added during hackathon don't hate on me */ - exports.toBibtex = function(json) { - var out = ''; - for ( var i in json) { - out += "@" + json[i].entryType; - out += '{'; - if (json[i].citationKey) - out += json[i].citationKey + ', '; - if (json[i].entry) - out += json[i].entry ; - if (json[i].entryTags) { - var tags = ''; - for (var jdx in json[i].entryTags) { - if (tags.length != 0) - tags += ', '; - tags += jdx + '= {' + json[i].entryTags[jdx] + '}'; - } - out += tags; - } - out += '}\n\n'; - } - return out; - - }; - - })( exports); - - /* end bibtexParse */ - }); - - // Copyright 2018 The Distill Template Authors - - function normalizeTag(string) { - return string - .replace(/[\t\n ]+/g, ' ') - .replace(/{\\["^`.'acu~Hvs]( )?([a-zA-Z])}/g, (full, x, char) => char) - .replace(/{\\([a-zA-Z])}/g, (full, char) => char); - } - - function parseBibtex(bibtex) { - const bibliography = new Map(); - const parsedEntries = bibtexParse.toJSON(bibtex); - for (const entry of parsedEntries) { - // normalize tags; note entryTags is an object, not Map - for (const [key, value] of Object.entries(entry.entryTags)) { - entry.entryTags[key.toLowerCase()] = normalizeTag(value); - } - entry.entryTags.type = entry.entryType; - // add to bibliography - bibliography.set(entry.citationKey, entry.entryTags); - } - return bibliography; - } - - function serializeFrontmatterToBibtex(frontMatter) { - return `@article{${frontMatter.slug}, - author = {${frontMatter.bibtexAuthors}}, - title = {${frontMatter.title}}, - journal = {${frontMatter.journal.title}}, - year = {${frontMatter.publishedYear}}, - note = {${frontMatter.url}}, - doi = {${frontMatter.doi}} -}`; - } - - // Copyright 2018 The Distill Template Authors - - function parseBibliography(element) { - const scriptTag = element.firstElementChild; - if (scriptTag && scriptTag.tagName === 'SCRIPT') { - if (scriptTag.type == 'text/bibtex') { - const bibtex = element.firstElementChild.textContent; - return parseBibtex(bibtex); - } else if (scriptTag.type == 'text/json') { - return new Map(JSON.parse(scriptTag.textContent)); - } else { - console.warn('Unsupported bibliography script tag type: ' + scriptTag.type); - } - } else { - console.warn('Bibliography did not have any script tag.'); - } - } - - // Copyright 2018 The Distill Template Authors - - function ExtractBibliography(dom, data) { - const bibliographyTag = dom.querySelector('d-bibliography'); - if (!bibliographyTag) { - console.warn('No bibliography tag found!'); - return; - } - - const src = bibliographyTag.getAttribute('src'); - if (src) { - const path = data.inputDirectory + '/' + src; - const text = fs.readFileSync(path, 'utf-8'); - const bibliography = parseBibtex(text); - const scriptTag = dom.createElement('script'); - scriptTag.type = 'text/json'; - scriptTag.textContent = JSON.stringify([...bibliography]); - bibliographyTag.appendChild(scriptTag); - bibliographyTag.removeAttribute('src'); - } - - data.bibliography = parseBibliography(bibliographyTag); - } - - // Copyright 2018 The Distill Template Authors - // - // Licensed under the Apache License, Version 2.0 (the "License"); - // you may not use this file except in compliance with the License. - // You may obtain a copy of the License at - // - // http://www.apache.org/licenses/LICENSE-2.0 - // - // Unless required by applicable law or agreed to in writing, software - // distributed under the License is distributed on an "AS IS" BASIS, - // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - // See the License for the specific language governing permissions and - // limitations under the License. - - function collect_citations(dom = document) { - const citations = new Set(); - const citeTags = dom.querySelectorAll("d-cite"); - for (const tag of citeTags) { - const keyString = tag.getAttribute("key") || tag.getAttribute("bibtex-key"); - const keys = keyString.split(",").map(k => k.trim()); - for (const key of keys) { - citations.add(key); - } - } - return [...citations]; - } - - function author_string(ent, template, sep, finalSep) { - if (ent.author == null) { - return ""; - } - var names = ent.author.split(" and "); - let name_strings = names.map(name => { - name = name.trim(); - if (name.indexOf(",") != -1) { - var last = name.split(",")[0].trim(); - var firsts = name.split(",")[1]; - } else if (name.indexOf(" ") != -1) { - var last = name - .split(" ") - .slice(-1)[0] - .trim(); - var firsts = name - .split(" ") - .slice(0, -1) - .join(" "); - } else { - var last = name.trim(); - } - var initials = ""; - if (firsts != undefined) { - initials = firsts - .trim() - .split(" ") - .map(s => s.trim()[0]); - initials = initials.join(".") + "."; - } - return template - .replace("${F}", firsts) - .replace("${L}", last) - .replace("${I}", initials) - .trim(); // in case one of first or last was empty - }); - if (names.length > 1) { - var str = name_strings.slice(0, names.length - 1).join(sep); - str += (finalSep || sep) + name_strings[names.length - 1]; - return str; - } else { - return name_strings[0]; - } - } - - function venue_string(ent) { - var cite = ent.journal || ent.booktitle || ""; - if ("volume" in ent) { - var issue = ent.issue || ent.number; - issue = issue != undefined ? "(" + issue + ")" : ""; - cite += ", Vol " + ent.volume + issue; - } - if ("pages" in ent) { - cite += ", pp. " + ent.pages; - } - if (cite != "") cite += ". "; - if ("publisher" in ent) { - cite += ent.publisher; - if (cite[cite.length - 1] != ".") cite += "."; - } - return cite; - } - - function link_string(ent) { - if ("url" in ent) { - var url = ent.url; - var arxiv_match = /arxiv\.org\/abs\/([0-9\.]*)/.exec(url); - if (arxiv_match != null) { - url = `http://arxiv.org/pdf/${arxiv_match[1]}.pdf`; - } - - if (url.slice(-4) == ".pdf") { - var label = "PDF"; - } else if (url.slice(-5) == ".html") { - var label = "HTML"; - } - return `  [${label || "link"}]`; - } /* else if ("doi" in ent){ - return `  [DOI]`; - }*/ else { - return ""; - } - } - function doi_string(ent, new_line) { - if ("doi" in ent) { - return `${new_line ? "
      " : ""} DOI: ${ent.doi}`; - } else { - return ""; - } - } - - function title_string(ent) { - return '' + ent.title + " "; - } - - function bibliography_cite(ent, fancy) { - if (ent) { - var cite = title_string(ent); - cite += link_string(ent) + "
      "; - if (ent.author) { - cite += author_string(ent, "${L}, ${I}", ", ", " and "); - if (ent.year || ent.date) { - cite += ", "; - } - } - if (ent.year || ent.date) { - cite += (ent.year || ent.date) + ". "; - } else { - cite += ". "; - } - cite += venue_string(ent); - cite += doi_string(ent); - return cite; - /*var cite = author_string(ent, "${L}, ${I}", ", ", " and "); - if (ent.year || ent.date){ - cite += ", " + (ent.year || ent.date) + ". " - } else { - cite += ". " - } - cite += "" + ent.title + ". "; - cite += venue_string(ent); - cite += doi_string(ent); - cite += link_string(ent); - return cite*/ - } else { - return "?"; - } - } - - // Copyright 2018 The Distill Template Authors - - function ExtractCitations(dom, data) { - const citations = new Set(data.citations); - const newCitations = collect_citations(dom); - for (const citation of newCitations) { - citations.add(citation); - } - data.citations = Array.from(citations); - } - - // Copyright 2018 The Distill Template Authors - // - // Licensed under the Apache License, Version 2.0 (the "License"); - // you may not use this file except in compliance with the License. - // You may obtain a copy of the License at - // - // http://www.apache.org/licenses/LICENSE-2.0 - // - // Unless required by applicable law or agreed to in writing, software - // distributed under the License is distributed on an "AS IS" BASIS, - // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - // See the License for the specific language governing permissions and - // limitations under the License. - - function HTML(dom) { - - const head = dom.querySelector('head'); - - // set language to 'en' - if (!dom.querySelector('html').getAttribute('lang')) { - dom.querySelector('html').setAttribute('lang', 'en'); - } - - // set charset to 'utf-8' - if (!dom.querySelector('meta[charset]')) { - const meta = dom.createElement('meta'); - meta.setAttribute('charset', 'utf-8'); - head.appendChild(meta); - } - - // set viewport - if (!dom.querySelector('meta[name=viewport]')) { - const meta = dom.createElement('meta'); - meta.setAttribute('name', 'viewport'); - meta.setAttribute('content', 'width=device-width, initial-scale=1'); - head.appendChild(meta); - } - } - - // Copyright 2018 The Distill Template Authors - // - // Licensed under the Apache License, Version 2.0 (the "License"); - // you may not use this file except in compliance with the License. - // You may obtain a copy of the License at - // - // http://www.apache.org/licenses/LICENSE-2.0 - // - // Unless required by applicable law or agreed to in writing, software - // distributed under the License is distributed on an "AS IS" BASIS, - // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - // See the License for the specific language governing permissions and - // limitations under the License. - - // import style from '../styles/d-byline.css'; - - function bylineTemplate(frontMatter) { - return ` - -`; - } - - // Copyright 2018 The Distill Template Authors - - function Byline(dom, data) { - const byline = dom.querySelector('d-byline'); - if (byline) { - byline.innerHTML = bylineTemplate(data); - } - } - - // Copyright 2018 The Distill Template Authors - // - // Licensed under the Apache License, Version 2.0 (the "License"); - // you may not use this file except in compliance with the License. - // You may obtain a copy of the License at - // - // http://www.apache.org/licenses/LICENSE-2.0 - // - // Unless required by applicable law or agreed to in writing, software - // distributed under the License is distributed on an "AS IS" BASIS, - // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - // See the License for the specific language governing permissions and - // limitations under the License. - - // no appendix -> add appendix - // title in front, no h1 -> add it - // no title in front, h1 -> read and put into frontMatter - // footnote -> footnote list - // break up bib - // if citation, no bib-list -> add citation-list - - // if authors, no byline -> add byline - - function OptionalComponents(dom, data) { - const body = dom.body; - const article = body.querySelector('d-article'); - - // If we don't have an article tag, something weird is going on—giving up. - if (!article) { - console.warn('No d-article tag found; skipping adding optional components!'); - return; - } - - let byline = dom.querySelector('d-byline'); - if (!byline) { - if (data.authors) { - byline = dom.createElement('d-byline'); - body.insertBefore(byline, article); - } else { - console.warn('No authors found in front matter; please add them before submission!'); - } - } - - let title = dom.querySelector('d-title'); - if (!title) { - title = dom.createElement('d-title'); - body.insertBefore(title, byline); - } - - let h1 = title.querySelector('h1'); - if (!h1) { - h1 = dom.createElement('h1'); - h1.textContent = data.title; - title.insertBefore(h1, title.firstChild); - } - - const hasPassword = typeof data.password !== 'undefined'; - let interstitial = body.querySelector('d-interstitial'); - if (hasPassword && !interstitial) { - const inBrowser = typeof window !== 'undefined'; - const onLocalhost = inBrowser && window.location.hostname.includes('localhost'); - if (!inBrowser || !onLocalhost) { - interstitial = dom.createElement('d-interstitial'); - interstitial.password = data.password; - body.insertBefore(interstitial, body.firstChild); - } - } else if (!hasPassword && interstitial) { - interstitial.parentElement.removeChild(this); - } - - let appendix = dom.querySelector('d-appendix'); - if (!appendix) { - appendix = dom.createElement('d-appendix'); - dom.body.appendChild(appendix); - } - - let footnoteList = dom.querySelector('d-footnote-list'); - if (!footnoteList) { - footnoteList = dom.createElement('d-footnote-list'); - appendix.appendChild(footnoteList); - } - - let citationList = dom.querySelector('d-citation-list'); - if (!citationList) { - citationList = dom.createElement('d-citation-list'); - appendix.appendChild(citationList); - } - - } - - var katex$1 = createCommonjsModule(function (module, exports) { - (function(f){{module.exports=f();}})(function(){return (function e(t,n,r){function s(o,u){if(!n[o]){if(!t[o]){var a=typeof commonjsRequire=="function"&&commonjsRequire;if(!u&&a)return a(o,!0);if(i)return i(o,!0);var f=new Error("Cannot find module '"+o+"'");throw f.code="MODULE_NOT_FOUND",f}var l=n[o]={exports:{}};t[o][0].call(l.exports,function(e){var n=t[o][1][e];return s(n?n:e)},l,l.exports,e,t,n,r);}return n[o].exports}var i=typeof commonjsRequire=="function"&&commonjsRequire;for(var o=0;o= 0; --i) { - tok = expansion[i]; - if (tok.text === "#") { - if (i === 0) { - throw new _ParseError2.default("Incomplete placeholder at end of macro body", tok); - } - tok = expansion[--i]; // next token on stack - if (tok.text === "#") { - // ## → # - expansion.splice(i + 1, 1); // drop first # - } else if (/^[1-9]$/.test(tok.text)) { - // expansion.splice(i, 2, arg[0], arg[1], …) - // to replace placeholder with the indicated argument. - // TODO: use spread once we move to ES2015 - expansion.splice.apply(expansion, [i, 2].concat(args[tok.text - 1])); - } else { - throw new _ParseError2.default("Not a valid argument number", tok); - } - } - } - } - this.stack = this.stack.concat(expansion); - } - } - }, { - key: "get", - value: function get(ignoreSpace) { - this.discardedWhiteSpace = []; - var token = this.nextToken(); - if (ignoreSpace) { - while (token.text === " ") { - this.discardedWhiteSpace.push(token); - token = this.nextToken(); - } - } - return token; - } - - /** - * Undo the effect of the preceding call to the get method. - * A call to this method MUST be immediately preceded and immediately followed - * by a call to get. Only used during mode switching, i.e. after one token - * was got in the old mode but should get got again in a new mode - * with possibly different whitespace handling. - */ - - }, { - key: "unget", - value: function unget(token) { - this.stack.push(token); - while (this.discardedWhiteSpace.length !== 0) { - this.stack.push(this.discardedWhiteSpace.pop()); - } - } - }]); - return MacroExpander; - }(); - - module.exports = MacroExpander; - - },{"./Lexer":26,"./ParseError":29,"./macros":44,"babel-runtime/helpers/classCallCheck":4,"babel-runtime/helpers/createClass":5,"object-assign":25}],28:[function(require,module,exports){ - - var _classCallCheck2 = require("babel-runtime/helpers/classCallCheck"); - - var _classCallCheck3 = _interopRequireDefault(_classCallCheck2); - - var _createClass2 = require("babel-runtime/helpers/createClass"); - - var _createClass3 = _interopRequireDefault(_createClass2); - - var _fontMetrics2 = require("./fontMetrics"); - - var _fontMetrics3 = _interopRequireDefault(_fontMetrics2); - - function _interopRequireDefault(obj) { return obj && obj.__esModule ? obj : { default: obj }; } - - var BASESIZE = 6; /** - * This file contains information about the options that the Parser carries - * around with it while parsing. Data is held in an `Options` object, and when - * recursing, a new `Options` object can be created with the `.with*` and - * `.reset` functions. - */ - - var sizeStyleMap = [ - // Each element contains [textsize, scriptsize, scriptscriptsize]. - // The size mappings are taken from TeX with \normalsize=10pt. - [1, 1, 1], // size1: [5, 5, 5] \tiny - [2, 1, 1], // size2: [6, 5, 5] - [3, 1, 1], // size3: [7, 5, 5] \scriptsize - [4, 2, 1], // size4: [8, 6, 5] \footnotesize - [5, 2, 1], // size5: [9, 6, 5] \small - [6, 3, 1], // size6: [10, 7, 5] \normalsize - [7, 4, 2], // size7: [12, 8, 6] \large - [8, 6, 3], // size8: [14.4, 10, 7] \Large - [9, 7, 6], // size9: [17.28, 12, 10] \LARGE - [10, 8, 7], // size10: [20.74, 14.4, 12] \huge - [11, 10, 9]]; - - var sizeMultipliers = [ - // fontMetrics.js:getFontMetrics also uses size indexes, so if - // you change size indexes, change that function. - 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.2, 1.44, 1.728, 2.074, 2.488]; - - var sizeAtStyle = function sizeAtStyle(size, style) { - return style.size < 2 ? size : sizeStyleMap[size - 1][style.size - 1]; - }; - - /** - * This is the main options class. It contains the current style, size, color, - * and font. - * - * Options objects should not be modified. To create a new Options with - * different properties, call a `.having*` method. - */ - - var Options = function () { - function Options(data) { - (0, _classCallCheck3.default)(this, Options); - - this.style = data.style; - this.color = data.color; - this.size = data.size || BASESIZE; - this.textSize = data.textSize || this.size; - this.phantom = data.phantom; - this.font = data.font; - this.sizeMultiplier = sizeMultipliers[this.size - 1]; - this._fontMetrics = null; - } - - /** - * Returns a new options object with the same properties as "this". Properties - * from "extension" will be copied to the new options object. - */ - - - (0, _createClass3.default)(Options, [{ - key: "extend", - value: function extend(extension) { - var data = { - style: this.style, - size: this.size, - textSize: this.textSize, - color: this.color, - phantom: this.phantom, - font: this.font - }; - - for (var key in extension) { - if (extension.hasOwnProperty(key)) { - data[key] = extension[key]; - } - } - - return new Options(data); - } - - /** - * Return an options object with the given style. If `this.style === style`, - * returns `this`. - */ - - }, { - key: "havingStyle", - value: function havingStyle(style) { - if (this.style === style) { - return this; - } else { - return this.extend({ - style: style, - size: sizeAtStyle(this.textSize, style) - }); - } - } - - /** - * Return an options object with a cramped version of the current style. If - * the current style is cramped, returns `this`. - */ - - }, { - key: "havingCrampedStyle", - value: function havingCrampedStyle() { - return this.havingStyle(this.style.cramp()); - } - - /** - * Return an options object with the given size and in at least `\textstyle`. - * Returns `this` if appropriate. - */ - - }, { - key: "havingSize", - value: function havingSize(size) { - if (this.size === size && this.textSize === size) { - return this; - } else { - return this.extend({ - style: this.style.text(), - size: size, - textSize: size - }); - } - } - - /** - * Like `this.havingSize(BASESIZE).havingStyle(style)`. If `style` is omitted, - * changes to at least `\textstyle`. - */ - - }, { - key: "havingBaseStyle", - value: function havingBaseStyle(style) { - style = style || this.style.text(); - var wantSize = sizeAtStyle(BASESIZE, style); - if (this.size === wantSize && this.textSize === BASESIZE && this.style === style) { - return this; - } else { - return this.extend({ - style: style, - size: wantSize, - baseSize: BASESIZE - }); - } - } - - /** - * Create a new options object with the given color. - */ - - }, { - key: "withColor", - value: function withColor(color) { - return this.extend({ - color: color - }); - } - - /** - * Create a new options object with "phantom" set to true. - */ - - }, { - key: "withPhantom", - value: function withPhantom() { - return this.extend({ - phantom: true - }); - } - - /** - * Create a new options objects with the give font. - */ - - }, { - key: "withFont", - value: function withFont(font) { - return this.extend({ - font: font || this.font - }); - } - - /** - * Return the CSS sizing classes required to switch from enclosing options - * `oldOptions` to `this`. Returns an array of classes. - */ - - }, { - key: "sizingClasses", - value: function sizingClasses(oldOptions) { - if (oldOptions.size !== this.size) { - return ["sizing", "reset-size" + oldOptions.size, "size" + this.size]; - } else { - return []; - } - } - - /** - * Return the CSS sizing classes required to switch to the base size. Like - * `this.havingSize(BASESIZE).sizingClasses(this)`. - */ - - }, { - key: "baseSizingClasses", - value: function baseSizingClasses() { - if (this.size !== BASESIZE) { - return ["sizing", "reset-size" + this.size, "size" + BASESIZE]; - } else { - return []; - } - } - - /** - * Return the font metrics for this size. - */ - - }, { - key: "fontMetrics", - value: function fontMetrics() { - if (!this._fontMetrics) { - this._fontMetrics = _fontMetrics3.default.getFontMetrics(this.size); - } - return this._fontMetrics; - } - - /** - * A map of color names to CSS colors. - * TODO(emily): Remove this when we have real macros - */ - - }, { - key: "getColor", - - - /** - * Gets the CSS color of the current options object, accounting for the - * `colorMap`. - */ - value: function getColor() { - if (this.phantom) { - return "transparent"; - } else { - return Options.colorMap[this.color] || this.color; - } - } - }]); - return Options; - }(); - - /** - * The base size index. - */ - - - Options.colorMap = { - "katex-blue": "#6495ed", - "katex-orange": "#ffa500", - "katex-pink": "#ff00af", - "katex-red": "#df0030", - "katex-green": "#28ae7b", - "katex-gray": "gray", - "katex-purple": "#9d38bd", - "katex-blueA": "#ccfaff", - "katex-blueB": "#80f6ff", - "katex-blueC": "#63d9ea", - "katex-blueD": "#11accd", - "katex-blueE": "#0c7f99", - "katex-tealA": "#94fff5", - "katex-tealB": "#26edd5", - "katex-tealC": "#01d1c1", - "katex-tealD": "#01a995", - "katex-tealE": "#208170", - "katex-greenA": "#b6ffb0", - "katex-greenB": "#8af281", - "katex-greenC": "#74cf70", - "katex-greenD": "#1fab54", - "katex-greenE": "#0d923f", - "katex-goldA": "#ffd0a9", - "katex-goldB": "#ffbb71", - "katex-goldC": "#ff9c39", - "katex-goldD": "#e07d10", - "katex-goldE": "#a75a05", - "katex-redA": "#fca9a9", - "katex-redB": "#ff8482", - "katex-redC": "#f9685d", - "katex-redD": "#e84d39", - "katex-redE": "#bc2612", - "katex-maroonA": "#ffbde0", - "katex-maroonB": "#ff92c6", - "katex-maroonC": "#ed5fa6", - "katex-maroonD": "#ca337c", - "katex-maroonE": "#9e034e", - "katex-purpleA": "#ddd7ff", - "katex-purpleB": "#c6b9fc", - "katex-purpleC": "#aa87ff", - "katex-purpleD": "#7854ab", - "katex-purpleE": "#543b78", - "katex-mintA": "#f5f9e8", - "katex-mintB": "#edf2df", - "katex-mintC": "#e0e5cc", - "katex-grayA": "#f6f7f7", - "katex-grayB": "#f0f1f2", - "katex-grayC": "#e3e5e6", - "katex-grayD": "#d6d8da", - "katex-grayE": "#babec2", - "katex-grayF": "#888d93", - "katex-grayG": "#626569", - "katex-grayH": "#3b3e40", - "katex-grayI": "#21242c", - "katex-kaBlue": "#314453", - "katex-kaGreen": "#71B307" - }; - Options.BASESIZE = BASESIZE; - - module.exports = Options; - - },{"./fontMetrics":41,"babel-runtime/helpers/classCallCheck":4,"babel-runtime/helpers/createClass":5}],29:[function(require,module,exports){ - - var _classCallCheck2 = require("babel-runtime/helpers/classCallCheck"); - - var _classCallCheck3 = _interopRequireDefault(_classCallCheck2); - - function _interopRequireDefault(obj) { return obj && obj.__esModule ? obj : { default: obj }; } - - /** - * This is the ParseError class, which is the main error thrown by KaTeX - * functions when something has gone wrong. This is used to distinguish internal - * errors from errors in the expression that the user provided. - * - * If possible, a caller should provide a Token or ParseNode with information - * about where in the source string the problem occurred. - * - * @param {string} message The error message - * @param {(Token|ParseNode)=} token An object providing position information - */ - var ParseError = function ParseError(message, token) { - (0, _classCallCheck3.default)(this, ParseError); - - var error = "KaTeX parse error: " + message; - var start = void 0; - var end = void 0; - - if (token && token.lexer && token.start <= token.end) { - // If we have the input and a position, make the error a bit fancier - - // Get the input - var input = token.lexer.input; - - // Prepend some information - start = token.start; - end = token.end; - if (start === input.length) { - error += " at end of input: "; - } else { - error += " at position " + (start + 1) + ": "; - } - - // Underline token in question using combining underscores - var underlined = input.slice(start, end).replace(/[^]/g, "$&\u0332"); - - // Extract some context from the input and add it to the error - var left = void 0; - if (start > 15) { - left = "…" + input.slice(start - 15, start); - } else { - left = input.slice(0, start); - } - var right = void 0; - if (end + 15 < input.length) { - right = input.slice(end, end + 15) + "…"; - } else { - right = input.slice(end); - } - error += left + underlined + right; - } - - // Some hackery to make ParseError a prototype of Error - // See http://stackoverflow.com/a/8460753 - var self = new Error(error); - self.name = "ParseError"; - self.__proto__ = ParseError.prototype; - - self.position = start; - return self; - }; - - // More hackery - - - ParseError.prototype.__proto__ = Error.prototype; - - module.exports = ParseError; - - },{"babel-runtime/helpers/classCallCheck":4}],30:[function(require,module,exports){ - - Object.defineProperty(exports, "__esModule", { - value: true - }); - - var _classCallCheck2 = require("babel-runtime/helpers/classCallCheck"); - - var _classCallCheck3 = _interopRequireDefault(_classCallCheck2); - - function _interopRequireDefault(obj) { return obj && obj.__esModule ? obj : { default: obj }; } - - /** - * The resulting parse tree nodes of the parse tree. - * - * It is possible to provide position information, so that a ParseNode can - * fulfil a role similar to a Token in error reporting. - * For details on the corresponding properties see Token constructor. - * Providing such information can lead to better error reporting. - * - * @param {string} type type of node, like e.g. "ordgroup" - * @param {?object} value type-specific representation of the node - * @param {string} mode parse mode in action for this node, - * "math" or "text" - * @param {Token=} firstToken first token of the input for this node, - * will omit position information if unset - * @param {Token=} lastToken last token of the input for this node, - * will default to firstToken if unset - */ - var ParseNode = function ParseNode(type, value, mode, firstToken, lastToken) { - (0, _classCallCheck3.default)(this, ParseNode); - - this.type = type; - this.value = value; - this.mode = mode; - if (firstToken && (!lastToken || lastToken.lexer === firstToken.lexer)) { - this.lexer = firstToken.lexer; - this.start = firstToken.start; - this.end = (lastToken || firstToken).end; - } - }; - - exports.default = ParseNode; - - },{"babel-runtime/helpers/classCallCheck":4}],31:[function(require,module,exports){ - - var _classCallCheck2 = require("babel-runtime/helpers/classCallCheck"); - - var _classCallCheck3 = _interopRequireDefault(_classCallCheck2); - - var _createClass2 = require("babel-runtime/helpers/createClass"); - - var _createClass3 = _interopRequireDefault(_createClass2); - - var _functions = require("./functions"); - - var _functions2 = _interopRequireDefault(_functions); - - var _environments = require("./environments"); - - var _environments2 = _interopRequireDefault(_environments); - - var _MacroExpander = require("./MacroExpander"); - - var _MacroExpander2 = _interopRequireDefault(_MacroExpander); - - var _symbols = require("./symbols"); - - var _symbols2 = _interopRequireDefault(_symbols); - - var _utils = require("./utils"); - - var _utils2 = _interopRequireDefault(_utils); - - var _units = require("./units"); - - var _units2 = _interopRequireDefault(_units); - - var _unicodeRegexes = require("./unicodeRegexes"); - - var _ParseNode = require("./ParseNode"); - - var _ParseNode2 = _interopRequireDefault(_ParseNode); - - var _ParseError = require("./ParseError"); - - var _ParseError2 = _interopRequireDefault(_ParseError); - - function _interopRequireDefault(obj) { return obj && obj.__esModule ? obj : { default: obj }; } - - /** - * This file contains the parser used to parse out a TeX expression from the - * input. Since TeX isn't context-free, standard parsers don't work particularly - * well. - * - * The strategy of this parser is as such: - * - * The main functions (the `.parse...` ones) take a position in the current - * parse string to parse tokens from. The lexer (found in Lexer.js, stored at - * this.lexer) also supports pulling out tokens at arbitrary places. When - * individual tokens are needed at a position, the lexer is called to pull out a - * token, which is then used. - * - * The parser has a property called "mode" indicating the mode that - * the parser is currently in. Currently it has to be one of "math" or - * "text", which denotes whether the current environment is a math-y - * one or a text-y one (e.g. inside \text). Currently, this serves to - * limit the functions which can be used in text mode. - * - * The main functions then return an object which contains the useful data that - * was parsed at its given point, and a new position at the end of the parsed - * data. The main functions can call each other and continue the parsing by - * using the returned position as a new starting point. - * - * There are also extra `.handle...` functions, which pull out some reused - * functionality into self-contained functions. - * - * The earlier functions return ParseNodes. - * The later functions (which are called deeper in the parse) sometimes return - * ParseFuncOrArgument, which contain a ParseNode as well as some data about - * whether the parsed object is a function which is missing some arguments, or a - * standalone object which can be used as an argument to another function. - */ - - /** - * An initial function (without its arguments), or an argument to a function. - * The `result` argument should be a ParseNode. - */ - function ParseFuncOrArgument(result, isFunction, token) { - this.result = result; - // Is this a function (i.e. is it something defined in functions.js)? - this.isFunction = isFunction; - this.token = token; - } /* eslint no-constant-condition:0 */ - - var Parser = function () { - function Parser(input, settings) { - (0, _classCallCheck3.default)(this, Parser); - - // Create a new macro expander (gullet) and (indirectly via that) also a - // new lexer (mouth) for this parser (stomach, in the language of TeX) - this.gullet = new _MacroExpander2.default(input, settings.macros); - // Use old \color behavior (same as LaTeX's \textcolor) if requested. - // We do this after the macros object has been copied by MacroExpander. - if (settings.colorIsTextColor) { - this.gullet.macros["\\color"] = "\\textcolor"; - } - // Store the settings for use in parsing - this.settings = settings; - // Count leftright depth (for \middle errors) - this.leftrightDepth = 0; - } - - /** - * Checks a result to make sure it has the right type, and throws an - * appropriate error otherwise. - * - * @param {boolean=} consume whether to consume the expected token, - * defaults to true - */ - - - (0, _createClass3.default)(Parser, [{ - key: "expect", - value: function expect(text, consume) { - if (this.nextToken.text !== text) { - throw new _ParseError2.default("Expected '" + text + "', got '" + this.nextToken.text + "'", this.nextToken); - } - if (consume !== false) { - this.consume(); - } - } - - /** - * Considers the current look ahead token as consumed, - * and fetches the one after that as the new look ahead. - */ - - }, { - key: "consume", - value: function consume() { - this.nextToken = this.gullet.get(this.mode === "math"); - } - }, { - key: "switchMode", - value: function switchMode(newMode) { - this.gullet.unget(this.nextToken); - this.mode = newMode; - this.consume(); - } - - /** - * Main parsing function, which parses an entire input. - * - * @return {?Array.} - */ - - }, { - key: "parse", - value: function parse() { - // Try to parse the input - this.mode = "math"; - this.consume(); - var parse = this.parseInput(); - return parse; - } - - /** - * Parses an entire input tree. - */ - - }, { - key: "parseInput", - value: function parseInput() { - // Parse an expression - var expression = this.parseExpression(false); - // If we succeeded, make sure there's an EOF at the end - this.expect("EOF", false); - return expression; - } - }, { - key: "parseExpression", - - - /** - * Parses an "expression", which is a list of atoms. - * - * @param {boolean} breakOnInfix Should the parsing stop when we hit infix - * nodes? This happens when functions have higher precendence - * than infix nodes in implicit parses. - * - * @param {?string} breakOnTokenText The text of the token that the expression - * should end with, or `null` if something else should end the - * expression. - * - * @return {ParseNode} - */ - value: function parseExpression(breakOnInfix, breakOnTokenText) { - var body = []; - // Keep adding atoms to the body until we can't parse any more atoms (either - // we reached the end, a }, or a \right) - while (true) { - var lex = this.nextToken; - if (Parser.endOfExpression.indexOf(lex.text) !== -1) { - break; - } - if (breakOnTokenText && lex.text === breakOnTokenText) { - break; - } - if (breakOnInfix && _functions2.default[lex.text] && _functions2.default[lex.text].infix) { - break; - } - var atom = this.parseAtom(); - if (!atom) { - if (!this.settings.throwOnError && lex.text[0] === "\\") { - var errorNode = this.handleUnsupportedCmd(); - body.push(errorNode); - continue; - } - - break; - } - body.push(atom); - } - return this.handleInfixNodes(body); - } - - /** - * Rewrites infix operators such as \over with corresponding commands such - * as \frac. - * - * There can only be one infix operator per group. If there's more than one - * then the expression is ambiguous. This can be resolved by adding {}. - * - * @returns {Array} - */ - - }, { - key: "handleInfixNodes", - value: function handleInfixNodes(body) { - var overIndex = -1; - var funcName = void 0; - - for (var i = 0; i < body.length; i++) { - var node = body[i]; - if (node.type === "infix") { - if (overIndex !== -1) { - throw new _ParseError2.default("only one infix operator per group", node.value.token); - } - overIndex = i; - funcName = node.value.replaceWith; - } - } - - if (overIndex !== -1) { - var numerNode = void 0; - var denomNode = void 0; - - var numerBody = body.slice(0, overIndex); - var denomBody = body.slice(overIndex + 1); - - if (numerBody.length === 1 && numerBody[0].type === "ordgroup") { - numerNode = numerBody[0]; - } else { - numerNode = new _ParseNode2.default("ordgroup", numerBody, this.mode); - } - - if (denomBody.length === 1 && denomBody[0].type === "ordgroup") { - denomNode = denomBody[0]; - } else { - denomNode = new _ParseNode2.default("ordgroup", denomBody, this.mode); - } - - var value = this.callFunction(funcName, [numerNode, denomNode], null); - return [new _ParseNode2.default(value.type, value, this.mode)]; - } else { - return body; - } - } - - // The greediness of a superscript or subscript - - }, { - key: "handleSupSubscript", - - - /** - * Handle a subscript or superscript with nice errors. - */ - value: function handleSupSubscript(name) { - var symbolToken = this.nextToken; - var symbol = symbolToken.text; - this.consume(); - var group = this.parseGroup(); - - if (!group) { - if (!this.settings.throwOnError && this.nextToken.text[0] === "\\") { - return this.handleUnsupportedCmd(); - } else { - throw new _ParseError2.default("Expected group after '" + symbol + "'", symbolToken); - } - } else if (group.isFunction) { - // ^ and _ have a greediness, so handle interactions with functions' - // greediness - var funcGreediness = _functions2.default[group.result].greediness; - if (funcGreediness > Parser.SUPSUB_GREEDINESS) { - return this.parseFunction(group); - } else { - throw new _ParseError2.default("Got function '" + group.result + "' with no arguments " + "as " + name, symbolToken); - } - } else { - return group.result; - } - } - - /** - * Converts the textual input of an unsupported command into a text node - * contained within a color node whose color is determined by errorColor - */ - - }, { - key: "handleUnsupportedCmd", - value: function handleUnsupportedCmd() { - var text = this.nextToken.text; - var textordArray = []; - - for (var i = 0; i < text.length; i++) { - textordArray.push(new _ParseNode2.default("textord", text[i], "text")); - } - - var textNode = new _ParseNode2.default("text", { - body: textordArray, - type: "text" - }, this.mode); - - var colorNode = new _ParseNode2.default("color", { - color: this.settings.errorColor, - value: [textNode], - type: "color" - }, this.mode); - - this.consume(); - return colorNode; - } - - /** - * Parses a group with optional super/subscripts. - * - * @return {?ParseNode} - */ - - }, { - key: "parseAtom", - value: function parseAtom() { - // The body of an atom is an implicit group, so that things like - // \left(x\right)^2 work correctly. - var base = this.parseImplicitGroup(); - - // In text mode, we don't have superscripts or subscripts - if (this.mode === "text") { - return base; - } - - // Note that base may be empty (i.e. null) at this point. - - var superscript = void 0; - var subscript = void 0; - while (true) { - // Lex the first token - var lex = this.nextToken; - - if (lex.text === "\\limits" || lex.text === "\\nolimits") { - // We got a limit control - if (!base || base.type !== "op") { - throw new _ParseError2.default("Limit controls must follow a math operator", lex); - } else { - var limits = lex.text === "\\limits"; - base.value.limits = limits; - base.value.alwaysHandleSupSub = true; - } - this.consume(); - } else if (lex.text === "^") { - // We got a superscript start - if (superscript) { - throw new _ParseError2.default("Double superscript", lex); - } - superscript = this.handleSupSubscript("superscript"); - } else if (lex.text === "_") { - // We got a subscript start - if (subscript) { - throw new _ParseError2.default("Double subscript", lex); - } - subscript = this.handleSupSubscript("subscript"); - } else if (lex.text === "'") { - // We got a prime - if (superscript) { - throw new _ParseError2.default("Double superscript", lex); - } - var prime = new _ParseNode2.default("textord", "\\prime", this.mode); - - // Many primes can be grouped together, so we handle this here - var primes = [prime]; - this.consume(); - // Keep lexing tokens until we get something that's not a prime - while (this.nextToken.text === "'") { - // For each one, add another prime to the list - primes.push(prime); - this.consume(); - } - // If there's a superscript following the primes, combine that - // superscript in with the primes. - if (this.nextToken.text === "^") { - primes.push(this.handleSupSubscript("superscript")); - } - // Put everything into an ordgroup as the superscript - superscript = new _ParseNode2.default("ordgroup", primes, this.mode); - } else { - // If it wasn't ^, _, or ', stop parsing super/subscripts - break; - } - } - - if (superscript || subscript) { - // If we got either a superscript or subscript, create a supsub - return new _ParseNode2.default("supsub", { - base: base, - sup: superscript, - sub: subscript - }, this.mode); - } else { - // Otherwise return the original body - return base; - } - } - - // A list of the size-changing functions, for use in parseImplicitGroup - - - // A list of the style-changing functions, for use in parseImplicitGroup - - - // Old font functions - - }, { - key: "parseImplicitGroup", - - - /** - * Parses an implicit group, which is a group that starts at the end of a - * specified, and ends right before a higher explicit group ends, or at EOL. It - * is used for functions that appear to affect the current style, like \Large or - * \textrm, where instead of keeping a style we just pretend that there is an - * implicit grouping after it until the end of the group. E.g. - * small text {\Large large text} small text again - * It is also used for \left and \right to get the correct grouping. - * - * @return {?ParseNode} - */ - value: function parseImplicitGroup() { - var start = this.parseSymbol(); - - if (start == null) { - // If we didn't get anything we handle, fall back to parseFunction - return this.parseFunction(); - } - - var func = start.result; - - if (func === "\\left") { - // If we see a left: - // Parse the entire left function (including the delimiter) - var left = this.parseFunction(start); - // Parse out the implicit body - ++this.leftrightDepth; - var body = this.parseExpression(false); - --this.leftrightDepth; - // Check the next token - this.expect("\\right", false); - var right = this.parseFunction(); - return new _ParseNode2.default("leftright", { - body: body, - left: left.value.value, - right: right.value.value - }, this.mode); - } else if (func === "\\begin") { - // begin...end is similar to left...right - var begin = this.parseFunction(start); - var envName = begin.value.name; - if (!_environments2.default.hasOwnProperty(envName)) { - throw new _ParseError2.default("No such environment: " + envName, begin.value.nameGroup); - } - // Build the environment object. Arguments and other information will - // be made available to the begin and end methods using properties. - var env = _environments2.default[envName]; - var args = this.parseArguments("\\begin{" + envName + "}", env); - var context = { - mode: this.mode, - envName: envName, - parser: this, - positions: args.pop() - }; - var result = env.handler(context, args); - this.expect("\\end", false); - var endNameToken = this.nextToken; - var end = this.parseFunction(); - if (end.value.name !== envName) { - throw new _ParseError2.default("Mismatch: \\begin{" + envName + "} matched " + "by \\end{" + end.value.name + "}", endNameToken); - } - result.position = end.position; - return result; - } else if (_utils2.default.contains(Parser.sizeFuncs, func)) { - // If we see a sizing function, parse out the implicit body - this.consumeSpaces(); - var _body = this.parseExpression(false); - return new _ParseNode2.default("sizing", { - // Figure out what size to use based on the list of functions above - size: _utils2.default.indexOf(Parser.sizeFuncs, func) + 1, - value: _body - }, this.mode); - } else if (_utils2.default.contains(Parser.styleFuncs, func)) { - // If we see a styling function, parse out the implicit body - this.consumeSpaces(); - var _body2 = this.parseExpression(true); - return new _ParseNode2.default("styling", { - // Figure out what style to use by pulling out the style from - // the function name - style: func.slice(1, func.length - 5), - value: _body2 - }, this.mode); - } else if (func in Parser.oldFontFuncs) { - var style = Parser.oldFontFuncs[func]; - // If we see an old font function, parse out the implicit body - this.consumeSpaces(); - var _body3 = this.parseExpression(true); - if (style.slice(0, 4) === 'text') { - return new _ParseNode2.default("text", { - style: style, - body: new _ParseNode2.default("ordgroup", _body3, this.mode) - }, this.mode); - } else { - return new _ParseNode2.default("font", { - font: style, - body: new _ParseNode2.default("ordgroup", _body3, this.mode) - }, this.mode); - } - } else if (func === "\\color") { - // If we see a styling function, parse out the implicit body - var color = this.parseColorGroup(false); - if (!color) { - throw new _ParseError2.default("\\color not followed by color"); - } - var _body4 = this.parseExpression(true); - return new _ParseNode2.default("color", { - type: "color", - color: color.result.value, - value: _body4 - }, this.mode); - } else if (func === "$") { - if (this.mode === "math") { - throw new _ParseError2.default("$ within math mode"); - } - this.consume(); - var outerMode = this.mode; - this.switchMode("math"); - var _body5 = this.parseExpression(false, "$"); - this.expect("$", true); - this.switchMode(outerMode); - return new _ParseNode2.default("styling", { - style: "text", - value: _body5 - }, "math"); - } else { - // Defer to parseFunction if it's not a function we handle - return this.parseFunction(start); - } - } - - /** - * Parses an entire function, including its base and all of its arguments. - * The base might either have been parsed already, in which case - * it is provided as an argument, or it's the next group in the input. - * - * @param {ParseFuncOrArgument=} baseGroup optional as described above - * @return {?ParseNode} - */ - - }, { - key: "parseFunction", - value: function parseFunction(baseGroup) { - if (!baseGroup) { - baseGroup = this.parseGroup(); - } - - if (baseGroup) { - if (baseGroup.isFunction) { - var func = baseGroup.result; - var funcData = _functions2.default[func]; - if (this.mode === "text" && !funcData.allowedInText) { - throw new _ParseError2.default("Can't use function '" + func + "' in text mode", baseGroup.token); - } else if (this.mode === "math" && funcData.allowedInMath === false) { - throw new _ParseError2.default("Can't use function '" + func + "' in math mode", baseGroup.token); - } - - var args = this.parseArguments(func, funcData); - var token = baseGroup.token; - var result = this.callFunction(func, args, args.pop(), token); - return new _ParseNode2.default(result.type, result, this.mode); - } else { - return baseGroup.result; - } - } else { - return null; - } - } - - /** - * Call a function handler with a suitable context and arguments. - */ - - }, { - key: "callFunction", - value: function callFunction(name, args, positions, token) { - var context = { - funcName: name, - parser: this, - positions: positions, - token: token - }; - return _functions2.default[name].handler(context, args); - } - - /** - * Parses the arguments of a function or environment - * - * @param {string} func "\name" or "\begin{name}" - * @param {{numArgs:number,numOptionalArgs:number|undefined}} funcData - * @return the array of arguments, with the list of positions as last element - */ - - }, { - key: "parseArguments", - value: function parseArguments(func, funcData) { - var totalArgs = funcData.numArgs + funcData.numOptionalArgs; - if (totalArgs === 0) { - return [[this.pos]]; - } - - var baseGreediness = funcData.greediness; - var positions = [this.pos]; - var args = []; - - for (var i = 0; i < totalArgs; i++) { - var nextToken = this.nextToken; - var argType = funcData.argTypes && funcData.argTypes[i]; - var arg = void 0; - if (i < funcData.numOptionalArgs) { - if (argType) { - arg = this.parseGroupOfType(argType, true); - } else { - arg = this.parseGroup(true); - } - if (!arg) { - args.push(null); - positions.push(this.pos); - continue; - } - } else { - if (argType) { - arg = this.parseGroupOfType(argType); - } else { - arg = this.parseGroup(); - } - if (!arg) { - if (!this.settings.throwOnError && this.nextToken.text[0] === "\\") { - arg = new ParseFuncOrArgument(this.handleUnsupportedCmd(this.nextToken.text), false); - } else { - throw new _ParseError2.default("Expected group after '" + func + "'", nextToken); - } - } - } - var argNode = void 0; - if (arg.isFunction) { - var argGreediness = _functions2.default[arg.result].greediness; - if (argGreediness > baseGreediness) { - argNode = this.parseFunction(arg); - } else { - throw new _ParseError2.default("Got function '" + arg.result + "' as " + "argument to '" + func + "'", nextToken); - } - } else { - argNode = arg.result; - } - args.push(argNode); - positions.push(this.pos); - } - - args.push(positions); - - return args; - } - - /** - * Parses a group when the mode is changing. - * - * @return {?ParseFuncOrArgument} - */ - - }, { - key: "parseGroupOfType", - value: function parseGroupOfType(innerMode, optional) { - var outerMode = this.mode; - // Handle `original` argTypes - if (innerMode === "original") { - innerMode = outerMode; - } - - if (innerMode === "color") { - return this.parseColorGroup(optional); - } - if (innerMode === "size") { - return this.parseSizeGroup(optional); - } - - this.switchMode(innerMode); - if (innerMode === "text") { - // text mode is special because it should ignore the whitespace before - // it - this.consumeSpaces(); - } - // By the time we get here, innerMode is one of "text" or "math". - // We switch the mode of the parser, recurse, then restore the old mode. - var res = this.parseGroup(optional); - this.switchMode(outerMode); - return res; - } - }, { - key: "consumeSpaces", - value: function consumeSpaces() { - while (this.nextToken.text === " ") { - this.consume(); - } - } - - /** - * Parses a group, essentially returning the string formed by the - * brace-enclosed tokens plus some position information. - * - * @param {string} modeName Used to describe the mode in error messages - * @param {boolean=} optional Whether the group is optional or required - */ - - }, { - key: "parseStringGroup", - value: function parseStringGroup(modeName, optional) { - if (optional && this.nextToken.text !== "[") { - return null; - } - var outerMode = this.mode; - this.mode = "text"; - this.expect(optional ? "[" : "{"); - var str = ""; - var firstToken = this.nextToken; - var lastToken = firstToken; - while (this.nextToken.text !== (optional ? "]" : "}")) { - if (this.nextToken.text === "EOF") { - throw new _ParseError2.default("Unexpected end of input in " + modeName, firstToken.range(this.nextToken, str)); - } - lastToken = this.nextToken; - str += lastToken.text; - this.consume(); - } - this.mode = outerMode; - this.expect(optional ? "]" : "}"); - return firstToken.range(lastToken, str); - } - - /** - * Parses a regex-delimited group: the largest sequence of tokens - * whose concatenated strings match `regex`. Returns the string - * formed by the tokens plus some position information. - * - * @param {RegExp} regex - * @param {string} modeName Used to describe the mode in error messages - */ - - }, { - key: "parseRegexGroup", - value: function parseRegexGroup(regex, modeName) { - var outerMode = this.mode; - this.mode = "text"; - var firstToken = this.nextToken; - var lastToken = firstToken; - var str = ""; - while (this.nextToken.text !== "EOF" && regex.test(str + this.nextToken.text)) { - lastToken = this.nextToken; - str += lastToken.text; - this.consume(); - } - if (str === "") { - throw new _ParseError2.default("Invalid " + modeName + ": '" + firstToken.text + "'", firstToken); - } - this.mode = outerMode; - return firstToken.range(lastToken, str); - } - - /** - * Parses a color description. - */ - - }, { - key: "parseColorGroup", - value: function parseColorGroup(optional) { - var res = this.parseStringGroup("color", optional); - if (!res) { - return null; - } - var match = /^(#[a-z0-9]+|[a-z]+)$/i.exec(res.text); - if (!match) { - throw new _ParseError2.default("Invalid color: '" + res.text + "'", res); - } - return new ParseFuncOrArgument(new _ParseNode2.default("color", match[0], this.mode), false); - } - - /** - * Parses a size specification, consisting of magnitude and unit. - */ - - }, { - key: "parseSizeGroup", - value: function parseSizeGroup(optional) { - var res = void 0; - if (!optional && this.nextToken.text !== "{") { - res = this.parseRegexGroup(/^[-+]? *(?:$|\d+|\d+\.\d*|\.\d*) *[a-z]{0,2} *$/, "size"); - } else { - res = this.parseStringGroup("size", optional); - } - if (!res) { - return null; - } - var match = /([-+]?) *(\d+(?:\.\d*)?|\.\d+) *([a-z]{2})/.exec(res.text); - if (!match) { - throw new _ParseError2.default("Invalid size: '" + res.text + "'", res); - } - var data = { - number: +(match[1] + match[2]), // sign + magnitude, cast to number - unit: match[3] - }; - if (!_units2.default.validUnit(data)) { - throw new _ParseError2.default("Invalid unit: '" + data.unit + "'", res); - } - return new ParseFuncOrArgument(new _ParseNode2.default("size", data, this.mode), false); - } - - /** - * If the argument is false or absent, this parses an ordinary group, - * which is either a single nucleus (like "x") or an expression - * in braces (like "{x+y}"). - * If the argument is true, it parses either a bracket-delimited expression - * (like "[x+y]") or returns null to indicate the absence of a - * bracket-enclosed group. - * - * @param {boolean=} optional Whether the group is optional or required - * @return {?ParseFuncOrArgument} - */ - - }, { - key: "parseGroup", - value: function parseGroup(optional) { - var firstToken = this.nextToken; - // Try to parse an open brace - if (this.nextToken.text === (optional ? "[" : "{")) { - // If we get a brace, parse an expression - this.consume(); - var expression = this.parseExpression(false, optional ? "]" : null); - var lastToken = this.nextToken; - // Make sure we get a close brace - this.expect(optional ? "]" : "}"); - if (this.mode === "text") { - this.formLigatures(expression); - } - return new ParseFuncOrArgument(new _ParseNode2.default("ordgroup", expression, this.mode, firstToken, lastToken), false); - } else { - // Otherwise, just return a nucleus, or nothing for an optional group - return optional ? null : this.parseSymbol(); - } - } - - /** - * Form ligature-like combinations of characters for text mode. - * This includes inputs like "--", "---", "``" and "''". - * The result will simply replace multiple textord nodes with a single - * character in each value by a single textord node having multiple - * characters in its value. The representation is still ASCII source. - * - * @param {Array.} group the nodes of this group, - * list will be moified in place - */ - - }, { - key: "formLigatures", - value: function formLigatures(group) { - var n = group.length - 1; - for (var i = 0; i < n; ++i) { - var a = group[i]; - var v = a.value; - if (v === "-" && group[i + 1].value === "-") { - if (i + 1 < n && group[i + 2].value === "-") { - group.splice(i, 3, new _ParseNode2.default("textord", "---", "text", a, group[i + 2])); - n -= 2; - } else { - group.splice(i, 2, new _ParseNode2.default("textord", "--", "text", a, group[i + 1])); - n -= 1; - } - } - if ((v === "'" || v === "`") && group[i + 1].value === v) { - group.splice(i, 2, new _ParseNode2.default("textord", v + v, "text", a, group[i + 1])); - n -= 1; - } - } - } - - /** - * Parse a single symbol out of the string. Here, we handle both the functions - * we have defined, as well as the single character symbols - * - * @return {?ParseFuncOrArgument} - */ - - }, { - key: "parseSymbol", - value: function parseSymbol() { - var nucleus = this.nextToken; - - if (_functions2.default[nucleus.text]) { - this.consume(); - // If there exists a function with this name, we return the function and - // say that it is a function. - return new ParseFuncOrArgument(nucleus.text, true, nucleus); - } else if (_symbols2.default[this.mode][nucleus.text]) { - this.consume(); - // Otherwise if this is a no-argument function, find the type it - // corresponds to in the symbols map - return new ParseFuncOrArgument(new _ParseNode2.default(_symbols2.default[this.mode][nucleus.text].group, nucleus.text, this.mode, nucleus), false, nucleus); - } else if (this.mode === "text" && _unicodeRegexes.cjkRegex.test(nucleus.text)) { - this.consume(); - return new ParseFuncOrArgument(new _ParseNode2.default("textord", nucleus.text, this.mode, nucleus), false, nucleus); - } else if (nucleus.text === "$") { - return new ParseFuncOrArgument(nucleus.text, false, nucleus); - } else { - return null; - } - } - }]); - return Parser; - }(); - - Parser.endOfExpression = ["}", "\\end", "\\right", "&", "\\\\", "\\cr"]; - Parser.SUPSUB_GREEDINESS = 1; - Parser.sizeFuncs = ["\\tiny", "\\sixptsize", "\\scriptsize", "\\footnotesize", "\\small", "\\normalsize", "\\large", "\\Large", "\\LARGE", "\\huge", "\\Huge"]; - Parser.styleFuncs = ["\\displaystyle", "\\textstyle", "\\scriptstyle", "\\scriptscriptstyle"]; - Parser.oldFontFuncs = { - "\\rm": "mathrm", - "\\sf": "mathsf", - "\\tt": "mathtt", - "\\bf": "mathbf", - "\\it": "mathit" - }; - - - Parser.prototype.ParseNode = _ParseNode2.default; - - module.exports = Parser; - - },{"./MacroExpander":27,"./ParseError":29,"./ParseNode":30,"./environments":40,"./functions":43,"./symbols":48,"./unicodeRegexes":49,"./units":50,"./utils":51,"babel-runtime/helpers/classCallCheck":4,"babel-runtime/helpers/createClass":5}],32:[function(require,module,exports){ - - var _classCallCheck2 = require("babel-runtime/helpers/classCallCheck"); - - var _classCallCheck3 = _interopRequireDefault(_classCallCheck2); - - var _utils = require("./utils"); - - var _utils2 = _interopRequireDefault(_utils); - - function _interopRequireDefault(obj) { return obj && obj.__esModule ? obj : { default: obj }; } - - /** - * The main Settings object - * - * The current options stored are: - * - displayMode: Whether the expression should be typeset as inline math - * (false, the default), meaning that the math starts in - * \textstyle and is placed in an inline-block); or as display - * math (true), meaning that the math starts in \displaystyle - * and is placed in a block with vertical margin. - */ - var Settings = function Settings(options) { - (0, _classCallCheck3.default)(this, Settings); - - // allow null options - options = options || {}; - this.displayMode = _utils2.default.deflt(options.displayMode, false); - this.throwOnError = _utils2.default.deflt(options.throwOnError, true); - this.errorColor = _utils2.default.deflt(options.errorColor, "#cc0000"); - this.macros = options.macros || {}; - this.colorIsTextColor = _utils2.default.deflt(options.colorIsTextColor, false); - }; /** - * This is a module for storing settings passed into KaTeX. It correctly handles - * default settings. - */ - - module.exports = Settings; - - },{"./utils":51,"babel-runtime/helpers/classCallCheck":4}],33:[function(require,module,exports){ - - var _classCallCheck2 = require("babel-runtime/helpers/classCallCheck"); - - var _classCallCheck3 = _interopRequireDefault(_classCallCheck2); - - var _createClass2 = require("babel-runtime/helpers/createClass"); - - var _createClass3 = _interopRequireDefault(_createClass2); - - function _interopRequireDefault(obj) { return obj && obj.__esModule ? obj : { default: obj }; } - - /** - * This file contains information and classes for the various kinds of styles - * used in TeX. It provides a generic `Style` class, which holds information - * about a specific style. It then provides instances of all the different kinds - * of styles possible, and provides functions to move between them and get - * information about them. - */ - - /** - * The main style class. Contains a unique id for the style, a size (which is - * the same for cramped and uncramped version of a style), and a cramped flag. - */ - var Style = function () { - function Style(id, size, cramped) { - (0, _classCallCheck3.default)(this, Style); - - this.id = id; - this.size = size; - this.cramped = cramped; - } - - /** - * Get the style of a superscript given a base in the current style. - */ - - - (0, _createClass3.default)(Style, [{ - key: "sup", - value: function sup() { - return styles[_sup[this.id]]; - } - - /** - * Get the style of a subscript given a base in the current style. - */ - - }, { - key: "sub", - value: function sub() { - return styles[_sub[this.id]]; - } - - /** - * Get the style of a fraction numerator given the fraction in the current - * style. - */ - - }, { - key: "fracNum", - value: function fracNum() { - return styles[_fracNum[this.id]]; - } - - /** - * Get the style of a fraction denominator given the fraction in the current - * style. - */ - - }, { - key: "fracDen", - value: function fracDen() { - return styles[_fracDen[this.id]]; - } - - /** - * Get the cramped version of a style (in particular, cramping a cramped style - * doesn't change the style). - */ - - }, { - key: "cramp", - value: function cramp() { - return styles[_cramp[this.id]]; - } - - /** - * Get a text or display version of this style. - */ - - }, { - key: "text", - value: function text() { - return styles[_text[this.id]]; - } - - /** - * Return if this style is tightly spaced (scriptstyle/scriptscriptstyle) - */ - - }, { - key: "isTight", - value: function isTight() { - return this.size >= 2; - } - }]); - return Style; - }(); - - // IDs of the different styles - - - var D = 0; - var Dc = 1; - var T = 2; - var Tc = 3; - var S = 4; - var Sc = 5; - var SS = 6; - var SSc = 7; - - // Instances of the different styles - var styles = [new Style(D, 0, false), new Style(Dc, 0, true), new Style(T, 1, false), new Style(Tc, 1, true), new Style(S, 2, false), new Style(Sc, 2, true), new Style(SS, 3, false), new Style(SSc, 3, true)]; - - // Lookup tables for switching from one style to another - var _sup = [S, Sc, S, Sc, SS, SSc, SS, SSc]; - var _sub = [Sc, Sc, Sc, Sc, SSc, SSc, SSc, SSc]; - var _fracNum = [T, Tc, S, Sc, SS, SSc, SS, SSc]; - var _fracDen = [Tc, Tc, Sc, Sc, SSc, SSc, SSc, SSc]; - var _cramp = [Dc, Dc, Tc, Tc, Sc, Sc, SSc, SSc]; - var _text = [D, Dc, T, Tc, T, Tc, T, Tc]; - - // We only export some of the styles. Also, we don't export the `Style` class so - // no more styles can be generated. - module.exports = { - DISPLAY: styles[D], - TEXT: styles[T], - SCRIPT: styles[S], - SCRIPTSCRIPT: styles[SS] - }; - - },{"babel-runtime/helpers/classCallCheck":4,"babel-runtime/helpers/createClass":5}],34:[function(require,module,exports){ - - var _domTree = require("./domTree"); - - var _domTree2 = _interopRequireDefault(_domTree); - - var _fontMetrics = require("./fontMetrics"); - - var _fontMetrics2 = _interopRequireDefault(_fontMetrics); - - var _symbols = require("./symbols"); - - var _symbols2 = _interopRequireDefault(_symbols); - - var _utils = require("./utils"); - - var _utils2 = _interopRequireDefault(_utils); - - function _interopRequireDefault(obj) { return obj && obj.__esModule ? obj : { default: obj }; } - - // The following have to be loaded from Main-Italic font, using class mainit - /* eslint no-console:0 */ - /** - * This module contains general functions that can be used for building - * different kinds of domTree nodes in a consistent manner. - */ - - var mainitLetters = ["\\imath", // dotless i - "\\jmath", // dotless j - "\\pounds"]; - - /** - * Looks up the given symbol in fontMetrics, after applying any symbol - * replacements defined in symbol.js - */ - var lookupSymbol = function lookupSymbol(value, fontFamily, mode) { - // Replace the value with its replaced value from symbol.js - if (_symbols2.default[mode][value] && _symbols2.default[mode][value].replace) { - value = _symbols2.default[mode][value].replace; - } - return { - value: value, - metrics: _fontMetrics2.default.getCharacterMetrics(value, fontFamily) - }; - }; - - /** - * Makes a symbolNode after translation via the list of symbols in symbols.js. - * Correctly pulls out metrics for the character, and optionally takes a list of - * classes to be attached to the node. - * - * TODO: make argument order closer to makeSpan - * TODO: add a separate argument for math class (e.g. `mop`, `mbin`), which - * should if present come first in `classes`. - */ - var makeSymbol = function makeSymbol(value, fontFamily, mode, options, classes) { - var lookup = lookupSymbol(value, fontFamily, mode); - var metrics = lookup.metrics; - value = lookup.value; - - var symbolNode = void 0; - if (metrics) { - var italic = metrics.italic; - if (mode === "text") { - italic = 0; - } - symbolNode = new _domTree2.default.symbolNode(value, metrics.height, metrics.depth, italic, metrics.skew, classes); - } else { - // TODO(emily): Figure out a good way to only print this in development - typeof console !== "undefined" && console.warn("No character metrics for '" + value + "' in style '" + fontFamily + "'"); - symbolNode = new _domTree2.default.symbolNode(value, 0, 0, 0, 0, classes); - } - - if (options) { - symbolNode.maxFontSize = options.sizeMultiplier; - if (options.style.isTight()) { - symbolNode.classes.push("mtight"); - } - if (options.getColor()) { - symbolNode.style.color = options.getColor(); - } - } - - return symbolNode; - }; - - /** - * Makes a symbol in Main-Regular or AMS-Regular. - * Used for rel, bin, open, close, inner, and punct. - */ - var mathsym = function mathsym(value, mode, options, classes) { - // Decide what font to render the symbol in by its entry in the symbols - // table. - // Have a special case for when the value = \ because the \ is used as a - // textord in unsupported command errors but cannot be parsed as a regular - // text ordinal and is therefore not present as a symbol in the symbols - // table for text - if (value === "\\" || _symbols2.default[mode][value].font === "main") { - return makeSymbol(value, "Main-Regular", mode, options, classes); - } else { - return makeSymbol(value, "AMS-Regular", mode, options, classes.concat(["amsrm"])); - } - }; - - /** - * Makes a symbol in the default font for mathords and textords. - */ - var mathDefault = function mathDefault(value, mode, options, classes, type) { - if (type === "mathord") { - var fontLookup = mathit(value); - return makeSymbol(value, fontLookup.fontName, mode, options, classes.concat([fontLookup.fontClass])); - } else if (type === "textord") { - var font = _symbols2.default[mode][value] && _symbols2.default[mode][value].font; - if (font === "ams") { - return makeSymbol(value, "AMS-Regular", mode, options, classes.concat(["amsrm"])); - } else { - // if (font === "main") { - return makeSymbol(value, "Main-Regular", mode, options, classes.concat(["mathrm"])); - } - } else { - throw new Error("unexpected type: " + type + " in mathDefault"); - } - }; - - /** - * Determines which of the two font names (Main-Italic and Math-Italic) and - * corresponding style tags (mainit or mathit) to use for font "mathit", - * depending on the symbol. Use this function instead of fontMap for font - * "mathit". - */ - var mathit = function mathit(value, mode, options, classes) { - if (/[0-9]/.test(value.charAt(0)) || - // glyphs for \imath and \jmath do not exist in Math-Italic so we - // need to use Main-Italic instead - _utils2.default.contains(mainitLetters, value)) { - return { - fontName: "Main-Italic", - fontClass: "mainit" - }; - } else { - return { - fontName: "Math-Italic", - fontClass: "mathit" - }; - } - }; - - /** - * Makes either a mathord or textord in the correct font and color. - */ - var makeOrd = function makeOrd(group, options, type) { - var mode = group.mode; - var value = group.value; - - var classes = ["mord"]; - - var font = options.font; - if (font) { - var fontLookup = void 0; - if (font === "mathit" || _utils2.default.contains(mainitLetters, value)) { - fontLookup = mathit(value); - } else { - fontLookup = fontMap[font]; - } - if (lookupSymbol(value, fontLookup.fontName, mode).metrics) { - return makeSymbol(value, fontLookup.fontName, mode, options, classes.concat([fontLookup.fontClass || font])); - } else { - return mathDefault(value, mode, options, classes, type); - } - } else { - return mathDefault(value, mode, options, classes, type); - } - }; - - /** - * Calculate the height, depth, and maxFontSize of an element based on its - * children. - */ - var sizeElementFromChildren = function sizeElementFromChildren(elem) { - var height = 0; - var depth = 0; - var maxFontSize = 0; - - if (elem.children) { - for (var i = 0; i < elem.children.length; i++) { - if (elem.children[i].height > height) { - height = elem.children[i].height; - } - if (elem.children[i].depth > depth) { - depth = elem.children[i].depth; - } - if (elem.children[i].maxFontSize > maxFontSize) { - maxFontSize = elem.children[i].maxFontSize; - } - } - } - - elem.height = height; - elem.depth = depth; - elem.maxFontSize = maxFontSize; - }; - - /** - * Makes a span with the given list of classes, list of children, and options. - * - * TODO: Ensure that `options` is always provided (currently some call sites - * don't pass it). - * TODO: add a separate argument for math class (e.g. `mop`, `mbin`), which - * should if present come first in `classes`. - */ - var makeSpan = function makeSpan(classes, children, options) { - var span = new _domTree2.default.span(classes, children, options); - - sizeElementFromChildren(span); - - return span; - }; - - /** - * Prepends the given children to the given span, updating height, depth, and - * maxFontSize. - */ - var prependChildren = function prependChildren(span, children) { - span.children = children.concat(span.children); - - sizeElementFromChildren(span); - }; - - /** - * Makes a document fragment with the given list of children. - */ - var makeFragment = function makeFragment(children) { - var fragment = new _domTree2.default.documentFragment(children); - - sizeElementFromChildren(fragment); - - return fragment; - }; - - /** - * Makes a vertical list by stacking elements and kerns on top of each other. - * Allows for many different ways of specifying the positioning method. - * - * Arguments: - * - children: A list of child or kern nodes to be stacked on top of each other - * (i.e. the first element will be at the bottom, and the last at - * the top). Element nodes are specified as - * {type: "elem", elem: node} - * while kern nodes are specified as - * {type: "kern", size: size} - * - positionType: The method by which the vlist should be positioned. Valid - * values are: - * - "individualShift": The children list only contains elem - * nodes, and each node contains an extra - * "shift" value of how much it should be - * shifted (note that shifting is always - * moving downwards). positionData is - * ignored. - * - "top": The positionData specifies the topmost point of - * the vlist (note this is expected to be a height, - * so positive values move up) - * - "bottom": The positionData specifies the bottommost point - * of the vlist (note this is expected to be a - * depth, so positive values move down - * - "shift": The vlist will be positioned such that its - * baseline is positionData away from the baseline - * of the first child. Positive values move - * downwards. - * - "firstBaseline": The vlist will be positioned such that - * its baseline is aligned with the - * baseline of the first child. - * positionData is ignored. (this is - * equivalent to "shift" with - * positionData=0) - * - positionData: Data used in different ways depending on positionType - * - options: An Options object - * - */ - var makeVList = function makeVList(children, positionType, positionData, options) { - var depth = void 0; - var currPos = void 0; - var i = void 0; - if (positionType === "individualShift") { - var oldChildren = children; - children = [oldChildren[0]]; - - // Add in kerns to the list of children to get each element to be - // shifted to the correct specified shift - depth = -oldChildren[0].shift - oldChildren[0].elem.depth; - currPos = depth; - for (i = 1; i < oldChildren.length; i++) { - var diff = -oldChildren[i].shift - currPos - oldChildren[i].elem.depth; - var size = diff - (oldChildren[i - 1].elem.height + oldChildren[i - 1].elem.depth); - - currPos = currPos + diff; - - children.push({ type: "kern", size: size }); - children.push(oldChildren[i]); - } - } else if (positionType === "top") { - // We always start at the bottom, so calculate the bottom by adding up - // all the sizes - var bottom = positionData; - for (i = 0; i < children.length; i++) { - if (children[i].type === "kern") { - bottom -= children[i].size; - } else { - bottom -= children[i].elem.height + children[i].elem.depth; - } - } - depth = bottom; - } else if (positionType === "bottom") { - depth = -positionData; - } else if (positionType === "shift") { - depth = -children[0].elem.depth - positionData; - } else if (positionType === "firstBaseline") { - depth = -children[0].elem.depth; - } else { - depth = 0; - } - - // Create a strut that is taller than any list item. The strut is added to - // each item, where it will determine the item's baseline. Since it has - // `overflow:hidden`, the strut's top edge will sit on the item's line box's - // top edge and the strut's bottom edge will sit on the item's baseline, - // with no additional line-height spacing. This allows the item baseline to - // be positioned precisely without worrying about font ascent and - // line-height. - var pstrutSize = 0; - for (i = 0; i < children.length; i++) { - if (children[i].type === "elem") { - var child = children[i].elem; - pstrutSize = Math.max(pstrutSize, child.maxFontSize, child.height); - } - } - pstrutSize += 2; - var pstrut = makeSpan(["pstrut"], []); - pstrut.style.height = pstrutSize + "em"; - - // Create a new list of actual children at the correct offsets - var realChildren = []; - var minPos = depth; - var maxPos = depth; - currPos = depth; - for (i = 0; i < children.length; i++) { - if (children[i].type === "kern") { - currPos += children[i].size; - } else { - var _child = children[i].elem; - - var childWrap = makeSpan([], [pstrut, _child]); - childWrap.style.top = -pstrutSize - currPos - _child.depth + "em"; - if (children[i].marginLeft) { - childWrap.style.marginLeft = children[i].marginLeft; - } - if (children[i].marginRight) { - childWrap.style.marginRight = children[i].marginRight; - } - - realChildren.push(childWrap); - currPos += _child.height + _child.depth; - } - minPos = Math.min(minPos, currPos); - maxPos = Math.max(maxPos, currPos); - } - - // The vlist contents go in a table-cell with `vertical-align:bottom`. - // This cell's bottom edge will determine the containing table's baseline - // without overly expanding the containing line-box. - var vlist = makeSpan(["vlist"], realChildren); - vlist.style.height = maxPos + "em"; - - // A second row is used if necessary to represent the vlist's depth. - var rows = void 0; - if (minPos < 0) { - var depthStrut = makeSpan(["vlist"], []); - depthStrut.style.height = -minPos + "em"; - - // Safari wants the first row to have inline content; otherwise it - // puts the bottom of the *second* row on the baseline. - var topStrut = makeSpan(["vlist-s"], [new _domTree2.default.symbolNode("\u200B")]); - - rows = [makeSpan(["vlist-r"], [vlist, topStrut]), makeSpan(["vlist-r"], [depthStrut])]; - } else { - rows = [makeSpan(["vlist-r"], [vlist])]; - } - - var vtable = makeSpan(["vlist-t"], rows); - if (rows.length === 2) { - vtable.classes.push("vlist-t2"); - } - vtable.height = maxPos; - vtable.depth = -minPos; - return vtable; - }; - - // A map of spacing functions to their attributes, like size and corresponding - // CSS class - var spacingFunctions = { - "\\qquad": { - size: "2em", - className: "qquad" - }, - "\\quad": { - size: "1em", - className: "quad" - }, - "\\enspace": { - size: "0.5em", - className: "enspace" - }, - "\\;": { - size: "0.277778em", - className: "thickspace" - }, - "\\:": { - size: "0.22222em", - className: "mediumspace" - }, - "\\,": { - size: "0.16667em", - className: "thinspace" - }, - "\\!": { - size: "-0.16667em", - className: "negativethinspace" - } - }; - - /** - * Maps TeX font commands to objects containing: - * - variant: string used for "mathvariant" attribute in buildMathML.js - * - fontName: the "style" parameter to fontMetrics.getCharacterMetrics - */ - // A map between tex font commands an MathML mathvariant attribute values - var fontMap = { - // styles - "mathbf": { - variant: "bold", - fontName: "Main-Bold" - }, - "mathrm": { - variant: "normal", - fontName: "Main-Regular" - }, - "textit": { - variant: "italic", - fontName: "Main-Italic" - }, - - // "mathit" is missing because it requires the use of two fonts: Main-Italic - // and Math-Italic. This is handled by a special case in makeOrd which ends - // up calling mathit. - - // families - "mathbb": { - variant: "double-struck", - fontName: "AMS-Regular" - }, - "mathcal": { - variant: "script", - fontName: "Caligraphic-Regular" - }, - "mathfrak": { - variant: "fraktur", - fontName: "Fraktur-Regular" - }, - "mathscr": { - variant: "script", - fontName: "Script-Regular" - }, - "mathsf": { - variant: "sans-serif", - fontName: "SansSerif-Regular" - }, - "mathtt": { - variant: "monospace", - fontName: "Typewriter-Regular" - } - }; - - module.exports = { - fontMap: fontMap, - makeSymbol: makeSymbol, - mathsym: mathsym, - makeSpan: makeSpan, - makeFragment: makeFragment, - makeVList: makeVList, - makeOrd: makeOrd, - prependChildren: prependChildren, - spacingFunctions: spacingFunctions - }; - - },{"./domTree":39,"./fontMetrics":41,"./symbols":48,"./utils":51}],35:[function(require,module,exports){ - - var _stringify = require("babel-runtime/core-js/json/stringify"); - - var _stringify2 = _interopRequireDefault(_stringify); - - var _ParseError = require("./ParseError"); - - var _ParseError2 = _interopRequireDefault(_ParseError); - - var _Style = require("./Style"); - - var _Style2 = _interopRequireDefault(_Style); - - var _buildCommon = require("./buildCommon"); - - var _buildCommon2 = _interopRequireDefault(_buildCommon); - - var _delimiter = require("./delimiter"); - - var _delimiter2 = _interopRequireDefault(_delimiter); - - var _domTree = require("./domTree"); - - var _domTree2 = _interopRequireDefault(_domTree); - - var _units = require("./units"); - - var _units2 = _interopRequireDefault(_units); - - var _utils = require("./utils"); - - var _utils2 = _interopRequireDefault(_utils); - - var _stretchy = require("./stretchy"); - - var _stretchy2 = _interopRequireDefault(_stretchy); - - function _interopRequireDefault(obj) { return obj && obj.__esModule ? obj : { default: obj }; } - - /* eslint no-console:0 */ - /** - * This file does the main work of building a domTree structure from a parse - * tree. The entry point is the `buildHTML` function, which takes a parse tree. - * Then, the buildExpression, buildGroup, and various groupTypes functions are - * called, to produce a final HTML tree. - */ - - var isSpace = function isSpace(node) { - return node instanceof _domTree2.default.span && node.classes[0] === "mspace"; - }; - - // Binary atoms (first class `mbin`) change into ordinary atoms (`mord`) - // depending on their surroundings. See TeXbook pg. 442-446, Rules 5 and 6, - // and the text before Rule 19. - var isBin = function isBin(node) { - return node && node.classes[0] === "mbin"; - }; - - var isBinLeftCanceller = function isBinLeftCanceller(node, isRealGroup) { - // TODO: This code assumes that a node's math class is the first element - // of its `classes` array. A later cleanup should ensure this, for - // instance by changing the signature of `makeSpan`. - if (node) { - return _utils2.default.contains(["mbin", "mopen", "mrel", "mop", "mpunct"], node.classes[0]); - } else { - return isRealGroup; - } - }; - - var isBinRightCanceller = function isBinRightCanceller(node, isRealGroup) { - if (node) { - return _utils2.default.contains(["mrel", "mclose", "mpunct"], node.classes[0]); - } else { - return isRealGroup; - } - }; - - /** - * Splice out any spaces from `children` starting at position `i`, and return - * the spliced-out array. Returns null if `children[i]` does not exist or is not - * a space. - */ - var spliceSpaces = function spliceSpaces(children, i) { - var j = i; - while (j < children.length && isSpace(children[j])) { - j++; - } - if (j === i) { - return null; - } else { - return children.splice(i, j - i); - } - }; - - /** - * Take a list of nodes, build them in order, and return a list of the built - * nodes. documentFragments are flattened into their contents, so the - * returned list contains no fragments. `isRealGroup` is true if `expression` - * is a real group (no atoms will be added on either side), as opposed to - * a partial group (e.g. one created by \color). - */ - var buildExpression = function buildExpression(expression, options, isRealGroup) { - // Parse expressions into `groups`. - var groups = []; - for (var i = 0; i < expression.length; i++) { - var group = expression[i]; - var output = buildGroup(group, options); - if (output instanceof _domTree2.default.documentFragment) { - Array.prototype.push.apply(groups, output.children); - } else { - groups.push(output); - } - } - // At this point `groups` consists entirely of `symbolNode`s and `span`s. - - // Explicit spaces (e.g., \;, \,) should be ignored with respect to atom - // spacing (e.g., "add thick space between mord and mrel"). Since CSS - // adjacency rules implement atom spacing, spaces should be invisible to - // CSS. So we splice them out of `groups` and into the atoms themselves. - for (var _i = 0; _i < groups.length; _i++) { - var spaces = spliceSpaces(groups, _i); - if (spaces) { - // Splicing of spaces may have removed all remaining groups. - if (_i < groups.length) { - // If there is a following group, move space within it. - if (groups[_i] instanceof _domTree2.default.symbolNode) { - groups[_i] = (0, _buildCommon.makeSpan)([].concat(groups[_i].classes), [groups[_i]]); - } - _buildCommon2.default.prependChildren(groups[_i], spaces); - } else { - // Otherwise, put any spaces back at the end of the groups. - Array.prototype.push.apply(groups, spaces); - break; - } - } - } - - // Binary operators change to ordinary symbols in some contexts. - for (var _i2 = 0; _i2 < groups.length; _i2++) { - if (isBin(groups[_i2]) && (isBinLeftCanceller(groups[_i2 - 1], isRealGroup) || isBinRightCanceller(groups[_i2 + 1], isRealGroup))) { - groups[_i2].classes[0] = "mord"; - } - } - - // Process \\not commands within the group. - // TODO(kevinb): Handle multiple \\not commands in a row. - // TODO(kevinb): Handle \\not{abc} correctly. The \\not should appear over - // the 'a' instead of the 'c'. - for (var _i3 = 0; _i3 < groups.length; _i3++) { - if (groups[_i3].value === "\u0338" && _i3 + 1 < groups.length) { - var children = groups.slice(_i3, _i3 + 2); - - children[0].classes = ["mainrm"]; - // \u0338 is a combining glyph so we could reorder the children so - // that it comes after the other glyph. This works correctly on - // most browsers except for Safari. Instead we absolutely position - // the glyph and set its right side to match that of the other - // glyph which is visually equivalent. - children[0].style.position = "absolute"; - children[0].style.right = "0"; - - // Copy the classes from the second glyph to the new container. - // This is so it behaves the same as though there was no \\not. - var classes = groups[_i3 + 1].classes; - var container = (0, _buildCommon.makeSpan)(classes, children); - - // LaTeX adds a space between ords separated by a \\not. - if (classes.indexOf("mord") !== -1) { - // \glue(\thickmuskip) 2.77771 plus 2.77771 - container.style.paddingLeft = "0.277771em"; - } - - // Ensure that the \u0338 is positioned relative to the container. - container.style.position = "relative"; - groups.splice(_i3, 2, container); - } - } - - return groups; - }; - - // Return math atom class (mclass) of a domTree. - var getTypeOfDomTree = function getTypeOfDomTree(node) { - if (node instanceof _domTree2.default.documentFragment) { - if (node.children.length) { - return getTypeOfDomTree(node.children[node.children.length - 1]); - } - } else { - if (_utils2.default.contains(["mord", "mop", "mbin", "mrel", "mopen", "mclose", "mpunct", "minner"], node.classes[0])) { - return node.classes[0]; - } - } - return null; - }; - - /** - * Sometimes, groups perform special rules when they have superscripts or - * subscripts attached to them. This function lets the `supsub` group know that - * its inner element should handle the superscripts and subscripts instead of - * handling them itself. - */ - var shouldHandleSupSub = function shouldHandleSupSub(group, options) { - if (!group.value.base) { - return false; - } else { - var base = group.value.base; - if (base.type === "op") { - // Operators handle supsubs differently when they have limits - // (e.g. `\displaystyle\sum_2^3`) - return base.value.limits && (options.style.size === _Style2.default.DISPLAY.size || base.value.alwaysHandleSupSub); - } else if (base.type === "accent") { - return isCharacterBox(base.value.base); - } else if (base.type === "horizBrace") { - var isSup = group.value.sub ? false : true; - return isSup === base.value.isOver; - } else { - return null; - } - } - }; - - /** - * Sometimes we want to pull out the innermost element of a group. In most - * cases, this will just be the group itself, but when ordgroups and colors have - * a single element, we want to pull that out. - */ - var getBaseElem = function getBaseElem(group) { - if (!group) { - return false; - } else if (group.type === "ordgroup") { - if (group.value.length === 1) { - return getBaseElem(group.value[0]); - } else { - return group; - } - } else if (group.type === "color") { - if (group.value.value.length === 1) { - return getBaseElem(group.value.value[0]); - } else { - return group; - } - } else if (group.type === "font") { - return getBaseElem(group.value.body); - } else { - return group; - } - }; - - /** - * TeXbook algorithms often reference "character boxes", which are simply groups - * with a single character in them. To decide if something is a character box, - * we find its innermost group, and see if it is a single character. - */ - var isCharacterBox = function isCharacterBox(group) { - var baseElem = getBaseElem(group); - - // These are all they types of groups which hold single characters - return baseElem.type === "mathord" || baseElem.type === "textord" || baseElem.type === "bin" || baseElem.type === "rel" || baseElem.type === "inner" || baseElem.type === "open" || baseElem.type === "close" || baseElem.type === "punct"; - }; - - var makeNullDelimiter = function makeNullDelimiter(options, classes) { - var moreClasses = ["nulldelimiter"].concat(options.baseSizingClasses()); - return (0, _buildCommon.makeSpan)(classes.concat(moreClasses)); - }; - - /** - * This is a map of group types to the function used to handle that type. - * Simpler types come at the beginning, while complicated types come afterwards. - */ - var groupTypes = {}; - - groupTypes.mathord = function (group, options) { - return _buildCommon2.default.makeOrd(group, options, "mathord"); - }; - - groupTypes.textord = function (group, options) { - return _buildCommon2.default.makeOrd(group, options, "textord"); - }; - - groupTypes.bin = function (group, options) { - return _buildCommon2.default.mathsym(group.value, group.mode, options, ["mbin"]); - }; - - groupTypes.rel = function (group, options) { - return _buildCommon2.default.mathsym(group.value, group.mode, options, ["mrel"]); - }; - - groupTypes.open = function (group, options) { - return _buildCommon2.default.mathsym(group.value, group.mode, options, ["mopen"]); - }; - - groupTypes.close = function (group, options) { - return _buildCommon2.default.mathsym(group.value, group.mode, options, ["mclose"]); - }; - - groupTypes.inner = function (group, options) { - return _buildCommon2.default.mathsym(group.value, group.mode, options, ["minner"]); - }; - - groupTypes.punct = function (group, options) { - return _buildCommon2.default.mathsym(group.value, group.mode, options, ["mpunct"]); - }; - - groupTypes.ordgroup = function (group, options) { - return (0, _buildCommon.makeSpan)(["mord"], buildExpression(group.value, options, true), options); - }; - - groupTypes.text = function (group, options) { - var newOptions = options.withFont(group.value.style); - var inner = buildExpression(group.value.body, newOptions, true); - for (var i = 0; i < inner.length - 1; i++) { - if (inner[i].tryCombine(inner[i + 1])) { - inner.splice(i + 1, 1); - i--; - } - } - return (0, _buildCommon.makeSpan)(["mord", "text"], inner, newOptions); - }; - - groupTypes.color = function (group, options) { - var elements = buildExpression(group.value.value, options.withColor(group.value.color), false); - - // \color isn't supposed to affect the type of the elements it contains. - // To accomplish this, we wrap the results in a fragment, so the inner - // elements will be able to directly interact with their neighbors. For - // example, `\color{red}{2 +} 3` has the same spacing as `2 + 3` - return new _buildCommon2.default.makeFragment(elements); - }; - - groupTypes.supsub = function (group, options) { - // Superscript and subscripts are handled in the TeXbook on page - // 445-446, rules 18(a-f). - - // Here is where we defer to the inner group if it should handle - // superscripts and subscripts itself. - if (shouldHandleSupSub(group, options)) { - return groupTypes[group.value.base.type](group, options); - } - - var base = buildGroup(group.value.base, options); - var supm = void 0; - var subm = void 0; - - var metrics = options.fontMetrics(); - var newOptions = void 0; - - // Rule 18a - var supShift = 0; - var subShift = 0; - - if (group.value.sup) { - newOptions = options.havingStyle(options.style.sup()); - supm = buildGroup(group.value.sup, newOptions, options); - if (!isCharacterBox(group.value.base)) { - supShift = base.height - newOptions.fontMetrics().supDrop * newOptions.sizeMultiplier / options.sizeMultiplier; - } - } - - if (group.value.sub) { - newOptions = options.havingStyle(options.style.sub()); - subm = buildGroup(group.value.sub, newOptions, options); - if (!isCharacterBox(group.value.base)) { - subShift = base.depth + newOptions.fontMetrics().subDrop * newOptions.sizeMultiplier / options.sizeMultiplier; - } - } - - // Rule 18c - var minSupShift = void 0; - if (options.style === _Style2.default.DISPLAY) { - minSupShift = metrics.sup1; - } else if (options.style.cramped) { - minSupShift = metrics.sup3; - } else { - minSupShift = metrics.sup2; - } - - // scriptspace is a font-size-independent size, so scale it - // appropriately - var multiplier = options.sizeMultiplier; - var scriptspace = 0.5 / metrics.ptPerEm / multiplier + "em"; - - var supsub = void 0; - if (!group.value.sup) { - // Rule 18b - subShift = Math.max(subShift, metrics.sub1, subm.height - 0.8 * metrics.xHeight); - - var vlistElem = [{ type: "elem", elem: subm, marginRight: scriptspace }]; - // Subscripts shouldn't be shifted by the base's italic correction. - // Account for that by shifting the subscript back the appropriate - // amount. Note we only do this when the base is a single symbol. - if (base instanceof _domTree2.default.symbolNode) { - vlistElem[0].marginLeft = -base.italic + "em"; - } - - supsub = _buildCommon2.default.makeVList(vlistElem, "shift", subShift, options); - } else if (!group.value.sub) { - // Rule 18c, d - supShift = Math.max(supShift, minSupShift, supm.depth + 0.25 * metrics.xHeight); - - supsub = _buildCommon2.default.makeVList([{ type: "elem", elem: supm, marginRight: scriptspace }], "shift", -supShift, options); - } else { - supShift = Math.max(supShift, minSupShift, supm.depth + 0.25 * metrics.xHeight); - subShift = Math.max(subShift, metrics.sub2); - - var ruleWidth = metrics.defaultRuleThickness; - - // Rule 18e - if (supShift - supm.depth - (subm.height - subShift) < 4 * ruleWidth) { - subShift = 4 * ruleWidth - (supShift - supm.depth) + subm.height; - var psi = 0.8 * metrics.xHeight - (supShift - supm.depth); - if (psi > 0) { - supShift += psi; - subShift -= psi; - } - } - - var _vlistElem = [{ type: "elem", elem: subm, shift: subShift, marginRight: scriptspace }, { type: "elem", elem: supm, shift: -supShift, marginRight: scriptspace }]; - // See comment above about subscripts not being shifted - if (base instanceof _domTree2.default.symbolNode) { - _vlistElem[0].marginLeft = -base.italic + "em"; - } - - supsub = _buildCommon2.default.makeVList(_vlistElem, "individualShift", null, options); - } - - // We ensure to wrap the supsub vlist in a span.msupsub to reset text-align - var mclass = getTypeOfDomTree(base) || "mord"; - return (0, _buildCommon.makeSpan)([mclass], [base, (0, _buildCommon.makeSpan)(["msupsub"], [supsub])], options); - }; - - groupTypes.genfrac = function (group, options) { - // Fractions are handled in the TeXbook on pages 444-445, rules 15(a-e). - // Figure out what style this fraction should be in based on the - // function used - var style = options.style; - if (group.value.size === "display") { - style = _Style2.default.DISPLAY; - } else if (group.value.size === "text") { - style = _Style2.default.TEXT; - } - - var nstyle = style.fracNum(); - var dstyle = style.fracDen(); - var newOptions = void 0; - - newOptions = options.havingStyle(nstyle); - var numerm = buildGroup(group.value.numer, newOptions, options); - - newOptions = options.havingStyle(dstyle); - var denomm = buildGroup(group.value.denom, newOptions, options); - - var rule = void 0; - var ruleWidth = void 0; - var ruleSpacing = void 0; - if (group.value.hasBarLine) { - rule = makeLineSpan("frac-line", options); - ruleWidth = rule.height; - ruleSpacing = rule.height; - } else { - rule = null; - ruleWidth = 0; - ruleSpacing = options.fontMetrics().defaultRuleThickness; - } - - // Rule 15b - var numShift = void 0; - var clearance = void 0; - var denomShift = void 0; - if (style.size === _Style2.default.DISPLAY.size) { - numShift = options.fontMetrics().num1; - if (ruleWidth > 0) { - clearance = 3 * ruleSpacing; - } else { - clearance = 7 * ruleSpacing; - } - denomShift = options.fontMetrics().denom1; - } else { - if (ruleWidth > 0) { - numShift = options.fontMetrics().num2; - clearance = ruleSpacing; - } else { - numShift = options.fontMetrics().num3; - clearance = 3 * ruleSpacing; - } - denomShift = options.fontMetrics().denom2; - } - - var frac = void 0; - if (ruleWidth === 0) { - // Rule 15c - var candidateClearance = numShift - numerm.depth - (denomm.height - denomShift); - if (candidateClearance < clearance) { - numShift += 0.5 * (clearance - candidateClearance); - denomShift += 0.5 * (clearance - candidateClearance); - } - - frac = _buildCommon2.default.makeVList([{ type: "elem", elem: denomm, shift: denomShift }, { type: "elem", elem: numerm, shift: -numShift }], "individualShift", null, options); - } else { - // Rule 15d - var axisHeight = options.fontMetrics().axisHeight; - - if (numShift - numerm.depth - (axisHeight + 0.5 * ruleWidth) < clearance) { - numShift += clearance - (numShift - numerm.depth - (axisHeight + 0.5 * ruleWidth)); - } - - if (axisHeight - 0.5 * ruleWidth - (denomm.height - denomShift) < clearance) { - denomShift += clearance - (axisHeight - 0.5 * ruleWidth - (denomm.height - denomShift)); - } - - var midShift = -(axisHeight - 0.5 * ruleWidth); - - frac = _buildCommon2.default.makeVList([{ type: "elem", elem: denomm, shift: denomShift }, { type: "elem", elem: rule, shift: midShift }, { type: "elem", elem: numerm, shift: -numShift }], "individualShift", null, options); - } - - // Since we manually change the style sometimes (with \dfrac or \tfrac), - // account for the possible size change here. - newOptions = options.havingStyle(style); - frac.height *= newOptions.sizeMultiplier / options.sizeMultiplier; - frac.depth *= newOptions.sizeMultiplier / options.sizeMultiplier; - - // Rule 15e - var delimSize = void 0; - if (style.size === _Style2.default.DISPLAY.size) { - delimSize = options.fontMetrics().delim1; - } else { - delimSize = options.fontMetrics().delim2; - } - - var leftDelim = void 0; - var rightDelim = void 0; - if (group.value.leftDelim == null) { - leftDelim = makeNullDelimiter(options, ["mopen"]); - } else { - leftDelim = _delimiter2.default.customSizedDelim(group.value.leftDelim, delimSize, true, options.havingStyle(style), group.mode, ["mopen"]); - } - if (group.value.rightDelim == null) { - rightDelim = makeNullDelimiter(options, ["mclose"]); - } else { - rightDelim = _delimiter2.default.customSizedDelim(group.value.rightDelim, delimSize, true, options.havingStyle(style), group.mode, ["mclose"]); - } - - return (0, _buildCommon.makeSpan)(["mord"].concat(newOptions.sizingClasses(options)), [leftDelim, (0, _buildCommon.makeSpan)(["mfrac"], [frac]), rightDelim], options); - }; - - groupTypes.array = function (group, options) { - var r = void 0; - var c = void 0; - var nr = group.value.body.length; - var nc = 0; - var body = new Array(nr); - - // Horizontal spacing - var pt = 1 / options.fontMetrics().ptPerEm; - var arraycolsep = 5 * pt; // \arraycolsep in article.cls - - // Vertical spacing - var baselineskip = 12 * pt; // see size10.clo - // Default \jot from ltmath.dtx - // TODO(edemaine): allow overriding \jot via \setlength (#687) - var jot = 3 * pt; - // Default \arraystretch from lttab.dtx - // TODO(gagern): may get redefined once we have user-defined macros - var arraystretch = _utils2.default.deflt(group.value.arraystretch, 1); - var arrayskip = arraystretch * baselineskip; - var arstrutHeight = 0.7 * arrayskip; // \strutbox in ltfsstrc.dtx and - var arstrutDepth = 0.3 * arrayskip; // \@arstrutbox in lttab.dtx - - var totalHeight = 0; - for (r = 0; r < group.value.body.length; ++r) { - var inrow = group.value.body[r]; - var height = arstrutHeight; // \@array adds an \@arstrut - var depth = arstrutDepth; // to each tow (via the template) - - if (nc < inrow.length) { - nc = inrow.length; - } - - var outrow = new Array(inrow.length); - for (c = 0; c < inrow.length; ++c) { - var elt = buildGroup(inrow[c], options); - if (depth < elt.depth) { - depth = elt.depth; - } - if (height < elt.height) { - height = elt.height; - } - outrow[c] = elt; - } - - var gap = 0; - if (group.value.rowGaps[r]) { - gap = _units2.default.calculateSize(group.value.rowGaps[r].value, options); - if (gap > 0) { - // \@argarraycr - gap += arstrutDepth; - if (depth < gap) { - depth = gap; // \@xargarraycr - } - gap = 0; - } - } - // In AMS multiline environments such as aligned and gathered, rows - // correspond to lines that have additional \jot added to the - // \baselineskip via \openup. - if (group.value.addJot) { - depth += jot; - } - - outrow.height = height; - outrow.depth = depth; - totalHeight += height; - outrow.pos = totalHeight; - totalHeight += depth + gap; // \@yargarraycr - body[r] = outrow; - } - - var offset = totalHeight / 2 + options.fontMetrics().axisHeight; - var colDescriptions = group.value.cols || []; - var cols = []; - var colSep = void 0; - var colDescrNum = void 0; - for (c = 0, colDescrNum = 0; - // Continue while either there are more columns or more column - // descriptions, so trailing separators don't get lost. - c < nc || colDescrNum < colDescriptions.length; ++c, ++colDescrNum) { - - var colDescr = colDescriptions[colDescrNum] || {}; - - var firstSeparator = true; - while (colDescr.type === "separator") { - // If there is more than one separator in a row, add a space - // between them. - if (!firstSeparator) { - colSep = (0, _buildCommon.makeSpan)(["arraycolsep"], []); - colSep.style.width = options.fontMetrics().doubleRuleSep + "em"; - cols.push(colSep); - } - - if (colDescr.separator === "|") { - var separator = (0, _buildCommon.makeSpan)(["vertical-separator"], []); - separator.style.height = totalHeight + "em"; - separator.style.verticalAlign = -(totalHeight - offset) + "em"; - - cols.push(separator); - } else { - throw new _ParseError2.default("Invalid separator type: " + colDescr.separator); - } - - colDescrNum++; - colDescr = colDescriptions[colDescrNum] || {}; - firstSeparator = false; - } - - if (c >= nc) { - continue; - } - - var sepwidth = void 0; - if (c > 0 || group.value.hskipBeforeAndAfter) { - sepwidth = _utils2.default.deflt(colDescr.pregap, arraycolsep); - if (sepwidth !== 0) { - colSep = (0, _buildCommon.makeSpan)(["arraycolsep"], []); - colSep.style.width = sepwidth + "em"; - cols.push(colSep); - } - } - - var col = []; - for (r = 0; r < nr; ++r) { - var row = body[r]; - var elem = row[c]; - if (!elem) { - continue; - } - var shift = row.pos - offset; - elem.depth = row.depth; - elem.height = row.height; - col.push({ type: "elem", elem: elem, shift: shift }); - } - - col = _buildCommon2.default.makeVList(col, "individualShift", null, options); - col = (0, _buildCommon.makeSpan)(["col-align-" + (colDescr.align || "c")], [col]); - cols.push(col); - - if (c < nc - 1 || group.value.hskipBeforeAndAfter) { - sepwidth = _utils2.default.deflt(colDescr.postgap, arraycolsep); - if (sepwidth !== 0) { - colSep = (0, _buildCommon.makeSpan)(["arraycolsep"], []); - colSep.style.width = sepwidth + "em"; - cols.push(colSep); - } - } - } - body = (0, _buildCommon.makeSpan)(["mtable"], cols); - return (0, _buildCommon.makeSpan)(["mord"], [body], options); - }; - - groupTypes.spacing = function (group, options) { - if (group.value === "\\ " || group.value === "\\space" || group.value === " " || group.value === "~") { - // Spaces are generated by adding an actual space. Each of these - // things has an entry in the symbols table, so these will be turned - // into appropriate outputs. - if (group.mode === "text") { - return _buildCommon2.default.makeOrd(group, options, "textord"); - } else { - return (0, _buildCommon.makeSpan)(["mspace"], [_buildCommon2.default.mathsym(group.value, group.mode, options)], options); - } - } else { - // Other kinds of spaces are of arbitrary width. We use CSS to - // generate these. - return (0, _buildCommon.makeSpan)(["mspace", _buildCommon2.default.spacingFunctions[group.value].className], [], options); - } - }; - - groupTypes.llap = function (group, options) { - var inner = (0, _buildCommon.makeSpan)(["inner"], [buildGroup(group.value.body, options)]); - var fix = (0, _buildCommon.makeSpan)(["fix"], []); - return (0, _buildCommon.makeSpan)(["mord", "llap"], [inner, fix], options); - }; - - groupTypes.rlap = function (group, options) { - var inner = (0, _buildCommon.makeSpan)(["inner"], [buildGroup(group.value.body, options)]); - var fix = (0, _buildCommon.makeSpan)(["fix"], []); - return (0, _buildCommon.makeSpan)(["mord", "rlap"], [inner, fix], options); - }; - - groupTypes.op = function (group, options) { - // Operators are handled in the TeXbook pg. 443-444, rule 13(a). - var supGroup = void 0; - var subGroup = void 0; - var hasLimits = false; - if (group.type === "supsub") { - // If we have limits, supsub will pass us its group to handle. Pull - // out the superscript and subscript and set the group to the op in - // its base. - supGroup = group.value.sup; - subGroup = group.value.sub; - group = group.value.base; - hasLimits = true; - } - - var style = options.style; - - // Most operators have a large successor symbol, but these don't. - var noSuccessor = ["\\smallint"]; - - var large = false; - if (style.size === _Style2.default.DISPLAY.size && group.value.symbol && !_utils2.default.contains(noSuccessor, group.value.body)) { - - // Most symbol operators get larger in displaystyle (rule 13) - large = true; - } - - var base = void 0; - if (group.value.symbol) { - // If this is a symbol, create the symbol. - var fontName = large ? "Size2-Regular" : "Size1-Regular"; - base = _buildCommon2.default.makeSymbol(group.value.body, fontName, "math", options, ["mop", "op-symbol", large ? "large-op" : "small-op"]); - } else if (group.value.value) { - // If this is a list, compose that list. - var inner = buildExpression(group.value.value, options, true); - if (inner.length === 1 && inner[0] instanceof _domTree2.default.symbolNode) { - base = inner[0]; - base.classes[0] = "mop"; // replace old mclass - } else { - base = (0, _buildCommon.makeSpan)(["mop"], inner, options); - } - } else { - // Otherwise, this is a text operator. Build the text from the - // operator's name. - // TODO(emily): Add a space in the middle of some of these - // operators, like \limsup - var output = []; - for (var i = 1; i < group.value.body.length; i++) { - output.push(_buildCommon2.default.mathsym(group.value.body[i], group.mode)); - } - base = (0, _buildCommon.makeSpan)(["mop"], output, options); - } - - // If content of op is a single symbol, shift it vertically. - var baseShift = 0; - var slant = 0; - if (base instanceof _domTree2.default.symbolNode) { - // Shift the symbol so its center lies on the axis (rule 13). It - // appears that our fonts have the centers of the symbols already - // almost on the axis, so these numbers are very small. Note we - // don't actually apply this here, but instead it is used either in - // the vlist creation or separately when there are no limits. - baseShift = (base.height - base.depth) / 2 - options.fontMetrics().axisHeight; - - // The slant of the symbol is just its italic correction. - slant = base.italic; - } - - if (hasLimits) { - // IE 8 clips \int if it is in a display: inline-block. We wrap it - // in a new span so it is an inline, and works. - base = (0, _buildCommon.makeSpan)([], [base]); - - var supm = void 0; - var supKern = void 0; - var subm = void 0; - var subKern = void 0; - var newOptions = void 0; - // We manually have to handle the superscripts and subscripts. This, - // aside from the kern calculations, is copied from supsub. - if (supGroup) { - newOptions = options.havingStyle(style.sup()); - supm = buildGroup(supGroup, newOptions, options); - - supKern = Math.max(options.fontMetrics().bigOpSpacing1, options.fontMetrics().bigOpSpacing3 - supm.depth); - } - - if (subGroup) { - newOptions = options.havingStyle(style.sub()); - subm = buildGroup(subGroup, newOptions, options); - - subKern = Math.max(options.fontMetrics().bigOpSpacing2, options.fontMetrics().bigOpSpacing4 - subm.height); - } - - // Build the final group as a vlist of the possible subscript, base, - // and possible superscript. - var finalGroup = void 0; - var top = void 0; - var bottom = void 0; - if (!supGroup) { - top = base.height - baseShift; - - // Shift the limits by the slant of the symbol. Note - // that we are supposed to shift the limits by 1/2 of the slant, - // but since we are centering the limits adding a full slant of - // margin will shift by 1/2 that. - finalGroup = _buildCommon2.default.makeVList([{ type: "kern", size: options.fontMetrics().bigOpSpacing5 }, { type: "elem", elem: subm, marginLeft: -slant + "em" }, { type: "kern", size: subKern }, { type: "elem", elem: base }], "top", top, options); - } else if (!subGroup) { - bottom = base.depth + baseShift; - - finalGroup = _buildCommon2.default.makeVList([{ type: "elem", elem: base }, { type: "kern", size: supKern }, { type: "elem", elem: supm, marginLeft: slant + "em" }, { type: "kern", size: options.fontMetrics().bigOpSpacing5 }], "bottom", bottom, options); - } else if (!supGroup && !subGroup) { - // This case probably shouldn't occur (this would mean the - // supsub was sending us a group with no superscript or - // subscript) but be safe. - return base; - } else { - bottom = options.fontMetrics().bigOpSpacing5 + subm.height + subm.depth + subKern + base.depth + baseShift; - - finalGroup = _buildCommon2.default.makeVList([{ type: "kern", size: options.fontMetrics().bigOpSpacing5 }, { type: "elem", elem: subm, marginLeft: -slant + "em" }, { type: "kern", size: subKern }, { type: "elem", elem: base }, { type: "kern", size: supKern }, { type: "elem", elem: supm, marginLeft: slant + "em" }, { type: "kern", size: options.fontMetrics().bigOpSpacing5 }], "bottom", bottom, options); - } - - return (0, _buildCommon.makeSpan)(["mop", "op-limits"], [finalGroup], options); - } else { - if (baseShift) { - base.style.position = "relative"; - base.style.top = baseShift + "em"; - } - - return base; - } - }; - - groupTypes.mod = function (group, options) { - var inner = []; - - if (group.value.modType === "bmod") { - // “\nonscript\mskip-\medmuskip\mkern5mu” - if (!options.style.isTight()) { - inner.push((0, _buildCommon.makeSpan)(["mspace", "negativemediumspace"], [], options)); - } - inner.push((0, _buildCommon.makeSpan)(["mspace", "thickspace"], [], options)); - } else if (options.style.size === _Style2.default.DISPLAY.size) { - inner.push((0, _buildCommon.makeSpan)(["mspace", "quad"], [], options)); - } else if (group.value.modType === "mod") { - inner.push((0, _buildCommon.makeSpan)(["mspace", "twelvemuspace"], [], options)); - } else { - inner.push((0, _buildCommon.makeSpan)(["mspace", "eightmuspace"], [], options)); - } - - if (group.value.modType === "pod" || group.value.modType === "pmod") { - inner.push(_buildCommon2.default.mathsym("(", group.mode)); - } - - if (group.value.modType !== "pod") { - var modInner = [_buildCommon2.default.mathsym("m", group.mode), _buildCommon2.default.mathsym("o", group.mode), _buildCommon2.default.mathsym("d", group.mode)]; - if (group.value.modType === "bmod") { - inner.push((0, _buildCommon.makeSpan)(["mbin"], modInner, options)); - // “\mkern5mu\nonscript\mskip-\medmuskip” - inner.push((0, _buildCommon.makeSpan)(["mspace", "thickspace"], [], options)); - if (!options.style.isTight()) { - inner.push((0, _buildCommon.makeSpan)(["mspace", "negativemediumspace"], [], options)); - } - } else { - Array.prototype.push.apply(inner, modInner); - inner.push((0, _buildCommon.makeSpan)(["mspace", "sixmuspace"], [], options)); - } - } - - if (group.value.value) { - Array.prototype.push.apply(inner, buildExpression(group.value.value, options, false)); - } - - if (group.value.modType === "pod" || group.value.modType === "pmod") { - inner.push(_buildCommon2.default.mathsym(")", group.mode)); - } - - return _buildCommon2.default.makeFragment(inner); - }; - - groupTypes.katex = function (group, options) { - // The KaTeX logo. The offsets for the K and a were chosen to look - // good, but the offsets for the T, E, and X were taken from the - // definition of \TeX in TeX (see TeXbook pg. 356) - var k = (0, _buildCommon.makeSpan)(["k"], [_buildCommon2.default.mathsym("K", group.mode)], options); - var a = (0, _buildCommon.makeSpan)(["a"], [_buildCommon2.default.mathsym("A", group.mode)], options); - - a.height = (a.height + 0.2) * 0.75; - a.depth = (a.height - 0.2) * 0.75; - - var t = (0, _buildCommon.makeSpan)(["t"], [_buildCommon2.default.mathsym("T", group.mode)], options); - var e = (0, _buildCommon.makeSpan)(["e"], [_buildCommon2.default.mathsym("E", group.mode)], options); - - e.height = e.height - 0.2155; - e.depth = e.depth + 0.2155; - - var x = (0, _buildCommon.makeSpan)(["x"], [_buildCommon2.default.mathsym("X", group.mode)], options); - - return (0, _buildCommon.makeSpan)(["mord", "katex-logo"], [k, a, t, e, x], options); - }; - - var makeLineSpan = function makeLineSpan(className, options, thickness) { - var line = (0, _buildCommon.makeSpan)([className], [], options); - line.height = thickness || options.fontMetrics().defaultRuleThickness; - line.style.borderBottomWidth = line.height + "em"; - line.maxFontSize = 1.0; - return line; - }; - - groupTypes.overline = function (group, options) { - // Overlines are handled in the TeXbook pg 443, Rule 9. - - // Build the inner group in the cramped style. - var innerGroup = buildGroup(group.value.body, options.havingCrampedStyle()); - - // Create the line above the body - var line = makeLineSpan("overline-line", options); - - // Generate the vlist, with the appropriate kerns - var vlist = _buildCommon2.default.makeVList([{ type: "elem", elem: innerGroup }, { type: "kern", size: 3 * line.height }, { type: "elem", elem: line }, { type: "kern", size: line.height }], "firstBaseline", null, options); - - return (0, _buildCommon.makeSpan)(["mord", "overline"], [vlist], options); - }; - - groupTypes.underline = function (group, options) { - // Underlines are handled in the TeXbook pg 443, Rule 10. - // Build the inner group. - var innerGroup = buildGroup(group.value.body, options); - - // Create the line above the body - var line = makeLineSpan("underline-line", options); - - // Generate the vlist, with the appropriate kerns - var vlist = _buildCommon2.default.makeVList([{ type: "kern", size: line.height }, { type: "elem", elem: line }, { type: "kern", size: 3 * line.height }, { type: "elem", elem: innerGroup }], "top", innerGroup.height, options); - - return (0, _buildCommon.makeSpan)(["mord", "underline"], [vlist], options); - }; - - groupTypes.sqrt = function (group, options) { - // Square roots are handled in the TeXbook pg. 443, Rule 11. - - // First, we do the same steps as in overline to build the inner group - // and line - var inner = buildGroup(group.value.body, options.havingCrampedStyle()); - - // Some groups can return document fragments. Handle those by wrapping - // them in a span. - if (inner instanceof _domTree2.default.documentFragment) { - inner = (0, _buildCommon.makeSpan)([], [inner], options); - } - - // Calculate the minimum size for the \surd delimiter - var metrics = options.fontMetrics(); - var theta = metrics.defaultRuleThickness; - - var phi = theta; - if (options.style.id < _Style2.default.TEXT.id) { - phi = options.fontMetrics().xHeight; - } - - // Calculate the clearance between the body and line - var lineClearance = theta + phi / 4; - - var minDelimiterHeight = (inner.height + inner.depth + lineClearance + theta) * options.sizeMultiplier; - - // Create a sqrt SVG of the required minimum size - var img = _delimiter2.default.customSizedDelim("\\surd", minDelimiterHeight, false, options, group.mode); - - // Calculate the actual line width. - // This actually should depend on the chosen font -- e.g. \boldmath - // should use the thicker surd symbols from e.g. KaTeX_Main-Bold, and - // have thicker rules. - var ruleWidth = options.fontMetrics().sqrtRuleThickness * img.sizeMultiplier; - - var delimDepth = img.height - ruleWidth; - - // Adjust the clearance based on the delimiter size - if (delimDepth > inner.height + inner.depth + lineClearance) { - lineClearance = (lineClearance + delimDepth - inner.height - inner.depth) / 2; - } - - // Shift the sqrt image - var imgShift = img.height - inner.height - lineClearance - ruleWidth; - - // We add a special case here, because even when `inner` is empty, we - // still get a line. So, we use a simple heuristic to decide if we - // should omit the body entirely. (note this doesn't work for something - // like `\sqrt{\rlap{x}}`, but if someone is doing that they deserve for - // it not to work. - var body = void 0; - if (inner.height === 0 && inner.depth === 0) { - body = (0, _buildCommon.makeSpan)(); - } else { - inner.style.paddingLeft = img.surdWidth + "em"; - - // Overlay the image and the argument. - body = _buildCommon2.default.makeVList([{ type: "elem", elem: inner }, { type: "kern", size: -(inner.height + imgShift) }, { type: "elem", elem: img }, { type: "kern", size: ruleWidth }], "firstBaseline", null, options); - body.children[0].children[0].classes.push("svg-align"); - } - - if (!group.value.index) { - return (0, _buildCommon.makeSpan)(["mord", "sqrt"], [body], options); - } else { - // Handle the optional root index - - // The index is always in scriptscript style - var newOptions = options.havingStyle(_Style2.default.SCRIPTSCRIPT); - var rootm = buildGroup(group.value.index, newOptions, options); - - // The amount the index is shifted by. This is taken from the TeX - // source, in the definition of `\r@@t`. - var toShift = 0.6 * (body.height - body.depth); - - // Build a VList with the superscript shifted up correctly - var rootVList = _buildCommon2.default.makeVList([{ type: "elem", elem: rootm }], "shift", -toShift, options); - // Add a class surrounding it so we can add on the appropriate - // kerning - var rootVListWrap = (0, _buildCommon.makeSpan)(["root"], [rootVList]); - - return (0, _buildCommon.makeSpan)(["mord", "sqrt"], [rootVListWrap, body], options); - } - }; - - function sizingGroup(value, options, baseOptions) { - var inner = buildExpression(value, options, false); - var multiplier = options.sizeMultiplier / baseOptions.sizeMultiplier; - - // Add size-resetting classes to the inner list and set maxFontSize - // manually. Handle nested size changes. - for (var i = 0; i < inner.length; i++) { - var pos = _utils2.default.indexOf(inner[i].classes, "sizing"); - if (pos < 0) { - Array.prototype.push.apply(inner[i].classes, options.sizingClasses(baseOptions)); - } else if (inner[i].classes[pos + 1] === "reset-size" + options.size) { - // This is a nested size change: e.g., inner[i] is the "b" in - // `\Huge a \small b`. Override the old size (the `reset-` class) - // but not the new size. - inner[i].classes[pos + 1] = "reset-size" + baseOptions.size; - } - - inner[i].height *= multiplier; - inner[i].depth *= multiplier; - } - - return _buildCommon2.default.makeFragment(inner); - } - - groupTypes.sizing = function (group, options) { - // Handle sizing operators like \Huge. Real TeX doesn't actually allow - // these functions inside of math expressions, so we do some special - // handling. - var newOptions = options.havingSize(group.value.size); - return sizingGroup(group.value.value, newOptions, options); - }; - - groupTypes.styling = function (group, options) { - // Style changes are handled in the TeXbook on pg. 442, Rule 3. - - // Figure out what style we're changing to. - var styleMap = { - "display": _Style2.default.DISPLAY, - "text": _Style2.default.TEXT, - "script": _Style2.default.SCRIPT, - "scriptscript": _Style2.default.SCRIPTSCRIPT - }; - - var newStyle = styleMap[group.value.style]; - var newOptions = options.havingStyle(newStyle); - return sizingGroup(group.value.value, newOptions, options); - }; - - groupTypes.font = function (group, options) { - var font = group.value.font; - return buildGroup(group.value.body, options.withFont(font)); - }; - - groupTypes.delimsizing = function (group, options) { - var delim = group.value.value; - - if (delim === ".") { - // Empty delimiters still count as elements, even though they don't - // show anything. - return (0, _buildCommon.makeSpan)([group.value.mclass]); - } - - // Use delimiter.sizedDelim to generate the delimiter. - return _delimiter2.default.sizedDelim(delim, group.value.size, options, group.mode, [group.value.mclass]); - }; - - groupTypes.leftright = function (group, options) { - // Build the inner expression - var inner = buildExpression(group.value.body, options, true); - - var innerHeight = 0; - var innerDepth = 0; - var hadMiddle = false; - - // Calculate its height and depth - for (var i = 0; i < inner.length; i++) { - if (inner[i].isMiddle) { - hadMiddle = true; - } else { - innerHeight = Math.max(inner[i].height, innerHeight); - innerDepth = Math.max(inner[i].depth, innerDepth); - } - } - - // The size of delimiters is the same, regardless of what style we are - // in. Thus, to correctly calculate the size of delimiter we need around - // a group, we scale down the inner size based on the size. - innerHeight *= options.sizeMultiplier; - innerDepth *= options.sizeMultiplier; - - var leftDelim = void 0; - if (group.value.left === ".") { - // Empty delimiters in \left and \right make null delimiter spaces. - leftDelim = makeNullDelimiter(options, ["mopen"]); - } else { - // Otherwise, use leftRightDelim to generate the correct sized - // delimiter. - leftDelim = _delimiter2.default.leftRightDelim(group.value.left, innerHeight, innerDepth, options, group.mode, ["mopen"]); - } - // Add it to the beginning of the expression - inner.unshift(leftDelim); - - // Handle middle delimiters - if (hadMiddle) { - for (var _i4 = 1; _i4 < inner.length; _i4++) { - var middleDelim = inner[_i4]; - if (middleDelim.isMiddle) { - // Apply the options that were active when \middle was called - inner[_i4] = _delimiter2.default.leftRightDelim(middleDelim.isMiddle.value, innerHeight, innerDepth, middleDelim.isMiddle.options, group.mode, []); - // Add back spaces shifted into the delimiter - var spaces = spliceSpaces(middleDelim.children, 0); - if (spaces) { - _buildCommon2.default.prependChildren(inner[_i4], spaces); - } - } - } - } - - var rightDelim = void 0; - // Same for the right delimiter - if (group.value.right === ".") { - rightDelim = makeNullDelimiter(options, ["mclose"]); - } else { - rightDelim = _delimiter2.default.leftRightDelim(group.value.right, innerHeight, innerDepth, options, group.mode, ["mclose"]); - } - // Add it to the end of the expression. - inner.push(rightDelim); - - return (0, _buildCommon.makeSpan)(["minner"], inner, options); - }; - - groupTypes.middle = function (group, options) { - var middleDelim = void 0; - if (group.value.value === ".") { - middleDelim = makeNullDelimiter(options, []); - } else { - middleDelim = _delimiter2.default.sizedDelim(group.value.value, 1, options, group.mode, []); - middleDelim.isMiddle = { value: group.value.value, options: options }; - } - return middleDelim; - }; - - groupTypes.rule = function (group, options) { - // Make an empty span for the rule - var rule = (0, _buildCommon.makeSpan)(["mord", "rule"], [], options); - - // Calculate the shift, width, and height of the rule, and account for units - var shift = 0; - if (group.value.shift) { - shift = _units2.default.calculateSize(group.value.shift, options); - } - - var width = _units2.default.calculateSize(group.value.width, options); - var height = _units2.default.calculateSize(group.value.height, options); - - // Style the rule to the right size - rule.style.borderRightWidth = width + "em"; - rule.style.borderTopWidth = height + "em"; - rule.style.bottom = shift + "em"; - - // Record the height and width - rule.width = width; - rule.height = height + shift; - rule.depth = -shift; - // Font size is the number large enough that the browser will - // reserve at least `absHeight` space above the baseline. - // The 1.125 factor was empirically determined - rule.maxFontSize = height * 1.125 * options.sizeMultiplier; - - return rule; - }; - - groupTypes.kern = function (group, options) { - // Make an empty span for the rule - var rule = (0, _buildCommon.makeSpan)(["mord", "rule"], [], options); - - if (group.value.dimension) { - var dimension = _units2.default.calculateSize(group.value.dimension, options); - rule.style.marginLeft = dimension + "em"; - } - - return rule; - }; - - groupTypes.accent = function (group, options) { - // Accents are handled in the TeXbook pg. 443, rule 12. - var base = group.value.base; - - var supsubGroup = void 0; - if (group.type === "supsub") { - // If our base is a character box, and we have superscripts and - // subscripts, the supsub will defer to us. In particular, we want - // to attach the superscripts and subscripts to the inner body (so - // that the position of the superscripts and subscripts won't be - // affected by the height of the accent). We accomplish this by - // sticking the base of the accent into the base of the supsub, and - // rendering that, while keeping track of where the accent is. - - // The supsub group is the group that was passed in - var supsub = group; - // The real accent group is the base of the supsub group - group = supsub.value.base; - // The character box is the base of the accent group - base = group.value.base; - // Stick the character box into the base of the supsub group - supsub.value.base = base; - - // Rerender the supsub group with its new base, and store that - // result. - supsubGroup = buildGroup(supsub, options); - } - - // Build the base group - var body = buildGroup(base, options.havingCrampedStyle()); - - // Does the accent need to shift for the skew of a character? - var mustShift = group.value.isShifty && isCharacterBox(base); - - // Calculate the skew of the accent. This is based on the line "If the - // nucleus is not a single character, let s = 0; otherwise set s to the - // kern amount for the nucleus followed by the \skewchar of its font." - // Note that our skew metrics are just the kern between each character - // and the skewchar. - var skew = 0; - if (mustShift) { - // If the base is a character box, then we want the skew of the - // innermost character. To do that, we find the innermost character: - var baseChar = getBaseElem(base); - // Then, we render its group to get the symbol inside it - var baseGroup = buildGroup(baseChar, options.havingCrampedStyle()); - // Finally, we pull the skew off of the symbol. - skew = baseGroup.skew; - // Note that we now throw away baseGroup, because the layers we - // removed with getBaseElem might contain things like \color which - // we can't get rid of. - // TODO(emily): Find a better way to get the skew - } - - // calculate the amount of space between the body and the accent - var clearance = Math.min(body.height, options.fontMetrics().xHeight); - - // Build the accent - var accentBody = void 0; - if (!group.value.isStretchy) { - var accent = _buildCommon2.default.makeSymbol(group.value.label, "Main-Regular", group.mode, options); - // Remove the italic correction of the accent, because it only serves to - // shift the accent over to a place we don't want. - accent.italic = 0; - - // The \vec character that the fonts use is a combining character, and - // thus shows up much too far to the left. To account for this, we add a - // specific class which shifts the accent over to where we want it. - // TODO(emily): Fix this in a better way, like by changing the font - // Similarly, text accent \H is a combining character and - // requires a different adjustment. - var accentClass = null; - if (group.value.label === "\\vec") { - accentClass = "accent-vec"; - } else if (group.value.label === '\\H') { - accentClass = "accent-hungarian"; - } - - accentBody = (0, _buildCommon.makeSpan)([], [accent]); - accentBody = (0, _buildCommon.makeSpan)(["accent-body", accentClass], [accentBody]); - - // Shift the accent over by the skew. Note we shift by twice the skew - // because we are centering the accent, so by adding 2*skew to the left, - // we shift it to the right by 1*skew. - accentBody.style.marginLeft = 2 * skew + "em"; - - accentBody = _buildCommon2.default.makeVList([{ type: "elem", elem: body }, { type: "kern", size: -clearance }, { type: "elem", elem: accentBody }], "firstBaseline", null, options); - } else { - accentBody = _stretchy2.default.svgSpan(group, options); - - accentBody = _buildCommon2.default.makeVList([{ type: "elem", elem: body }, { type: "elem", elem: accentBody }], "firstBaseline", null, options); - - var styleSpan = accentBody.children[0].children[0].children[1]; - styleSpan.classes.push("svg-align"); // text-align: left; - if (skew > 0) { - // Shorten the accent and nudge it to the right. - styleSpan.style.width = "calc(100% - " + 2 * skew + "em)"; - styleSpan.style.marginLeft = 2 * skew + "em"; - } - } - - var accentWrap = (0, _buildCommon.makeSpan)(["mord", "accent"], [accentBody], options); - - if (supsubGroup) { - // Here, we replace the "base" child of the supsub with our newly - // generated accent. - supsubGroup.children[0] = accentWrap; - - // Since we don't rerun the height calculation after replacing the - // accent, we manually recalculate height. - supsubGroup.height = Math.max(accentWrap.height, supsubGroup.height); - - // Accents should always be ords, even when their innards are not. - supsubGroup.classes[0] = "mord"; - - return supsubGroup; - } else { - return accentWrap; - } - }; - - groupTypes.horizBrace = function (group, options) { - var style = options.style; - - var hasSupSub = group.type === "supsub"; - var supSubGroup = void 0; - var newOptions = void 0; - if (hasSupSub) { - // Ref: LaTeX source2e: }}}}\limits} - // i.e. LaTeX treats the brace similar to an op and passes it - // with \limits, so we need to assign supsub style. - if (group.value.sup) { - newOptions = options.havingStyle(style.sup()); - supSubGroup = buildGroup(group.value.sup, newOptions, options); - } else { - newOptions = options.havingStyle(style.sub()); - supSubGroup = buildGroup(group.value.sub, newOptions, options); - } - group = group.value.base; - } - - // Build the base group - var body = buildGroup(group.value.base, options.havingBaseStyle(_Style2.default.DISPLAY)); - - // Create the stretchy element - var braceBody = _stretchy2.default.svgSpan(group, options); - - // Generate the vlist, with the appropriate kerns ┏━━━━━━━━┓ - // This first vlist contains the subject matter and the brace: equation - var vlist = void 0; - if (group.value.isOver) { - vlist = _buildCommon2.default.makeVList([{ type: "elem", elem: body }, { type: "kern", size: 0.1 }, { type: "elem", elem: braceBody }], "firstBaseline", null, options); - vlist.children[0].children[0].children[1].classes.push("svg-align"); - } else { - vlist = _buildCommon2.default.makeVList([{ type: "elem", elem: braceBody }, { type: "kern", size: 0.1 }, { type: "elem", elem: body }], "bottom", body.depth + 0.1 + braceBody.height, options); - vlist.children[0].children[0].children[0].classes.push("svg-align"); - } - - if (hasSupSub) { - // In order to write the supsub, wrap the first vlist in another vlist: - // They can't all go in the same vlist, because the note might be wider - // than the equation. We want the equation to control the brace width. - - // note long note long note - // ┏━━━━━━━━┓ or ┏━━━┓ not ┏━━━━━━━━━┓ - // equation eqn eqn - - var vSpan = (0, _buildCommon.makeSpan)(["mord", group.value.isOver ? "mover" : "munder"], [vlist], options); - - if (group.value.isOver) { - vlist = _buildCommon2.default.makeVList([{ type: "elem", elem: vSpan }, { type: "kern", size: 0.2 }, { type: "elem", elem: supSubGroup }], "firstBaseline", null, options); - } else { - vlist = _buildCommon2.default.makeVList([{ type: "elem", elem: supSubGroup }, { type: "kern", size: 0.2 }, { type: "elem", elem: vSpan }], "bottom", vSpan.depth + 0.2 + supSubGroup.height, options); - } - } - - return (0, _buildCommon.makeSpan)(["mord", group.value.isOver ? "mover" : "munder"], [vlist], options); - }; - - groupTypes.accentUnder = function (group, options) { - // Treat under accents much like underlines. - var innerGroup = buildGroup(group.value.body, options); - - var accentBody = _stretchy2.default.svgSpan(group, options); - var kern = /tilde/.test(group.value.label) ? 0.12 : 0; - - // Generate the vlist, with the appropriate kerns - var vlist = _buildCommon2.default.makeVList([{ type: "elem", elem: accentBody }, { type: "kern", size: kern }, { type: "elem", elem: innerGroup }], "bottom", accentBody.height + kern, options); - - vlist.children[0].children[0].children[0].classes.push("svg-align"); - - return (0, _buildCommon.makeSpan)(["mord", "accentunder"], [vlist], options); - }; - - groupTypes.enclose = function (group, options) { - // \cancel, \bcancel, \xcancel, \sout, \fbox - var inner = buildGroup(group.value.body, options); - - var label = group.value.label.substr(1); - var scale = options.sizeMultiplier; - var img = void 0; - var pad = 0; - var imgShift = 0; - - if (label === "sout") { - img = (0, _buildCommon.makeSpan)(["stretchy", "sout"]); - img.height = options.fontMetrics().defaultRuleThickness / scale; - imgShift = -0.5 * options.fontMetrics().xHeight; - } else { - // Add horizontal padding - inner.classes.push(label === "fbox" ? "boxpad" : "cancel-pad"); - - // Add vertical padding - var isCharBox = isCharacterBox(group.value.body); - // ref: LaTeX source2e: \fboxsep = 3pt; \fboxrule = .4pt - // ref: cancel package: \advance\totalheight2\p@ % "+2" - pad = label === "fbox" ? 0.34 : isCharBox ? 0.2 : 0; - imgShift = inner.depth + pad; - - img = _stretchy2.default.encloseSpan(inner, label, pad, options); - } - - var vlist = _buildCommon2.default.makeVList([{ type: "elem", elem: inner, shift: 0 }, { type: "elem", elem: img, shift: imgShift }], "individualShift", null, options); - - if (label !== "fbox") { - vlist.children[0].children[0].children[1].classes.push("svg-align"); - } - - if (/cancel/.test(label)) { - // cancel does not create horiz space for its line extension. - // That is, not when adjacent to a mord. - return (0, _buildCommon.makeSpan)(["mord", "cancel-lap"], [vlist], options); - } else { - return (0, _buildCommon.makeSpan)(["mord"], [vlist], options); - } - }; - - groupTypes.xArrow = function (group, options) { - var style = options.style; - - // Build the argument groups in the appropriate style. - // Ref: amsmath.dtx: \hbox{$\scriptstyle\mkern#3mu{#6}\mkern#4mu$}% - - var newOptions = options.havingStyle(style.sup()); - var upperGroup = buildGroup(group.value.body, newOptions, options); - upperGroup.classes.push("x-arrow-pad"); - - var lowerGroup = void 0; - if (group.value.below) { - // Build the lower group - newOptions = options.havingStyle(style.sub()); - lowerGroup = buildGroup(group.value.below, newOptions, options); - lowerGroup.classes.push("x-arrow-pad"); - } - - var arrowBody = _stretchy2.default.svgSpan(group, options); - - var arrowShift = -options.fontMetrics().axisHeight + arrowBody.depth; - var upperShift = -options.fontMetrics().axisHeight - arrowBody.height - 0.111; // 2 mu. Ref: amsmath.dtx: #7\if0#2\else\mkern#2mu\fi - - // Generate the vlist - var vlist = void 0; - if (group.value.below) { - var lowerShift = -options.fontMetrics().axisHeight + lowerGroup.height + arrowBody.height + 0.111; - vlist = _buildCommon2.default.makeVList([{ type: "elem", elem: upperGroup, shift: upperShift }, { type: "elem", elem: arrowBody, shift: arrowShift }, { type: "elem", elem: lowerGroup, shift: lowerShift }], "individualShift", null, options); - } else { - vlist = _buildCommon2.default.makeVList([{ type: "elem", elem: upperGroup, shift: upperShift }, { type: "elem", elem: arrowBody, shift: arrowShift }], "individualShift", null, options); - } - - vlist.children[0].children[0].children[1].classes.push("svg-align"); - - return (0, _buildCommon.makeSpan)(["mrel", "x-arrow"], [vlist], options); - }; - - groupTypes.phantom = function (group, options) { - var elements = buildExpression(group.value.value, options.withPhantom(), false); - - // \phantom isn't supposed to affect the elements it contains. - // See "color" for more details. - return new _buildCommon2.default.makeFragment(elements); - }; - - groupTypes.mclass = function (group, options) { - var elements = buildExpression(group.value.value, options, true); - - return (0, _buildCommon.makeSpan)([group.value.mclass], elements, options); - }; - - /** - * buildGroup is the function that takes a group and calls the correct groupType - * function for it. It also handles the interaction of size and style changes - * between parents and children. - */ - var buildGroup = function buildGroup(group, options, baseOptions) { - if (!group) { - return (0, _buildCommon.makeSpan)(); - } - - if (groupTypes[group.type]) { - // Call the groupTypes function - var groupNode = groupTypes[group.type](group, options); - - // If the size changed between the parent and the current group, account - // for that size difference. - if (baseOptions && options.size !== baseOptions.size) { - groupNode = (0, _buildCommon.makeSpan)(options.sizingClasses(baseOptions), [groupNode], options); - - var multiplier = options.sizeMultiplier / baseOptions.sizeMultiplier; - - groupNode.height *= multiplier; - groupNode.depth *= multiplier; - } - - return groupNode; - } else { - throw new _ParseError2.default("Got group of unknown type: '" + group.type + "'"); - } - }; - - /** - * Take an entire parse tree, and build it into an appropriate set of HTML - * nodes. - */ - var buildHTML = function buildHTML(tree, options) { - // buildExpression is destructive, so we need to make a clone - // of the incoming tree so that it isn't accidentally changed - tree = JSON.parse((0, _stringify2.default)(tree)); - - // Build the expression contained in the tree - var expression = buildExpression(tree, options, true); - var body = (0, _buildCommon.makeSpan)(["base"], expression, options); - - // Add struts, which ensure that the top of the HTML element falls at the - // height of the expression, and the bottom of the HTML element falls at the - // depth of the expression. - var topStrut = (0, _buildCommon.makeSpan)(["strut"]); - var bottomStrut = (0, _buildCommon.makeSpan)(["strut", "bottom"]); - - topStrut.style.height = body.height + "em"; - bottomStrut.style.height = body.height + body.depth + "em"; - // We'd like to use `vertical-align: top` but in IE 9 this lowers the - // baseline of the box to the bottom of this strut (instead staying in the - // normal place) so we use an absolute value for vertical-align instead - bottomStrut.style.verticalAlign = -body.depth + "em"; - - // Wrap the struts and body together - var htmlNode = (0, _buildCommon.makeSpan)(["katex-html"], [topStrut, bottomStrut, body]); - - htmlNode.setAttribute("aria-hidden", "true"); - - return htmlNode; - }; - - module.exports = buildHTML; - - },{"./ParseError":29,"./Style":33,"./buildCommon":34,"./delimiter":38,"./domTree":39,"./stretchy":47,"./units":50,"./utils":51,"babel-runtime/core-js/json/stringify":2}],36:[function(require,module,exports){ - - var _buildCommon = require("./buildCommon"); - - var _buildCommon2 = _interopRequireDefault(_buildCommon); - - var _fontMetrics = require("./fontMetrics"); - - var _fontMetrics2 = _interopRequireDefault(_fontMetrics); - - var _mathMLTree = require("./mathMLTree"); - - var _mathMLTree2 = _interopRequireDefault(_mathMLTree); - - var _ParseError = require("./ParseError"); - - var _ParseError2 = _interopRequireDefault(_ParseError); - - var _Style = require("./Style"); - - var _Style2 = _interopRequireDefault(_Style); - - var _symbols = require("./symbols"); - - var _symbols2 = _interopRequireDefault(_symbols); - - var _utils = require("./utils"); - - var _utils2 = _interopRequireDefault(_utils); - - var _stretchy = require("./stretchy"); - - var _stretchy2 = _interopRequireDefault(_stretchy); - - function _interopRequireDefault(obj) { return obj && obj.__esModule ? obj : { default: obj }; } - - /** - * Takes a symbol and converts it into a MathML text node after performing - * optional replacement from symbols.js. - */ - /** - * This file converts a parse tree into a cooresponding MathML tree. The main - * entry point is the `buildMathML` function, which takes a parse tree from the - * parser. - */ - - var makeText = function makeText(text, mode) { - if (_symbols2.default[mode][text] && _symbols2.default[mode][text].replace) { - text = _symbols2.default[mode][text].replace; - } - - return new _mathMLTree2.default.TextNode(text); - }; - - /** - * Returns the math variant as a string or null if none is required. - */ - var getVariant = function getVariant(group, options) { - var font = options.font; - if (!font) { - return null; - } - - var mode = group.mode; - if (font === "mathit") { - return "italic"; - } - - var value = group.value; - if (_utils2.default.contains(["\\imath", "\\jmath"], value)) { - return null; - } - - if (_symbols2.default[mode][value] && _symbols2.default[mode][value].replace) { - value = _symbols2.default[mode][value].replace; - } - - var fontName = _buildCommon.fontMap[font].fontName; - if (_fontMetrics2.default.getCharacterMetrics(value, fontName)) { - return _buildCommon.fontMap[options.font].variant; - } - - return null; - }; - - /** - * Functions for handling the different types of groups found in the parse - * tree. Each function should take a parse group and return a MathML node. - */ - var groupTypes = {}; - - var defaultVariant = { - "mi": "italic", - "mn": "normal", - "mtext": "normal" - }; - - groupTypes.mathord = function (group, options) { - var node = new _mathMLTree2.default.MathNode("mi", [makeText(group.value, group.mode)]); - - var variant = getVariant(group, options) || "italic"; - if (variant !== defaultVariant[node.type]) { - node.setAttribute("mathvariant", variant); - } - return node; - }; - - groupTypes.textord = function (group, options) { - var text = makeText(group.value, group.mode); - - var variant = getVariant(group, options) || "normal"; - - var node = void 0; - if (group.mode === 'text') { - node = new _mathMLTree2.default.MathNode("mtext", [text]); - } else if (/[0-9]/.test(group.value)) { - // TODO(kevinb) merge adjacent nodes - // do it as a post processing step - node = new _mathMLTree2.default.MathNode("mn", [text]); - } else if (group.value === "\\prime") { - node = new _mathMLTree2.default.MathNode("mo", [text]); - } else { - node = new _mathMLTree2.default.MathNode("mi", [text]); - } - if (variant !== defaultVariant[node.type]) { - node.setAttribute("mathvariant", variant); - } - - return node; - }; - - groupTypes.bin = function (group) { - var node = new _mathMLTree2.default.MathNode("mo", [makeText(group.value, group.mode)]); - - return node; - }; - - groupTypes.rel = function (group) { - var node = new _mathMLTree2.default.MathNode("mo", [makeText(group.value, group.mode)]); - - return node; - }; - - groupTypes.open = function (group) { - var node = new _mathMLTree2.default.MathNode("mo", [makeText(group.value, group.mode)]); - - return node; - }; - - groupTypes.close = function (group) { - var node = new _mathMLTree2.default.MathNode("mo", [makeText(group.value, group.mode)]); - - return node; - }; - - groupTypes.inner = function (group) { - var node = new _mathMLTree2.default.MathNode("mo", [makeText(group.value, group.mode)]); - - return node; - }; - - groupTypes.punct = function (group) { - var node = new _mathMLTree2.default.MathNode("mo", [makeText(group.value, group.mode)]); - - node.setAttribute("separator", "true"); - - return node; - }; - - groupTypes.ordgroup = function (group, options) { - var inner = buildExpression(group.value, options); - - var node = new _mathMLTree2.default.MathNode("mrow", inner); - - return node; - }; - - groupTypes.text = function (group, options) { - var body = group.value.body; - - // Convert each element of the body into MathML, and combine consecutive - // outputs into a single tag. In this way, we don't - // nest non-text items (e.g., $nested-math$) within an . - var inner = []; - var currentText = null; - for (var i = 0; i < body.length; i++) { - var _group = buildGroup(body[i], options); - if (_group.type === 'mtext' && currentText != null) { - Array.prototype.push.apply(currentText.children, _group.children); - } else { - inner.push(_group); - if (_group.type === 'mtext') { - currentText = _group; - } - } - } - - // If there is a single tag in the end (presumably ), - // just return it. Otherwise, wrap them in an . - if (inner.length === 1) { - return inner[0]; - } else { - return new _mathMLTree2.default.MathNode("mrow", inner); - } - }; - - groupTypes.color = function (group, options) { - var inner = buildExpression(group.value.value, options); - - var node = new _mathMLTree2.default.MathNode("mstyle", inner); - - node.setAttribute("mathcolor", group.value.color); - - return node; - }; - - groupTypes.supsub = function (group, options) { - // Is the inner group a relevant horizonal brace? - var isBrace = false; - var isOver = void 0; - var isSup = void 0; - if (group.value.base) { - if (group.value.base.value.type === "horizBrace") { - isSup = group.value.sup ? true : false; - if (isSup === group.value.base.value.isOver) { - isBrace = true; - isOver = group.value.base.value.isOver; - } - } - } - - var removeUnnecessaryRow = true; - var children = [buildGroup(group.value.base, options, removeUnnecessaryRow)]; - - if (group.value.sub) { - children.push(buildGroup(group.value.sub, options, removeUnnecessaryRow)); - } - - if (group.value.sup) { - children.push(buildGroup(group.value.sup, options, removeUnnecessaryRow)); - } - - var nodeType = void 0; - if (isBrace) { - nodeType = isOver ? "mover" : "munder"; - } else if (!group.value.sub) { - nodeType = "msup"; - } else if (!group.value.sup) { - nodeType = "msub"; - } else { - var base = group.value.base; - if (base && base.value.limits && options.style === _Style2.default.DISPLAY) { - nodeType = "munderover"; - } else { - nodeType = "msubsup"; - } - } - - var node = new _mathMLTree2.default.MathNode(nodeType, children); - - return node; - }; - - groupTypes.genfrac = function (group, options) { - var node = new _mathMLTree2.default.MathNode("mfrac", [buildGroup(group.value.numer, options), buildGroup(group.value.denom, options)]); - - if (!group.value.hasBarLine) { - node.setAttribute("linethickness", "0px"); - } - - if (group.value.leftDelim != null || group.value.rightDelim != null) { - var withDelims = []; - - if (group.value.leftDelim != null) { - var leftOp = new _mathMLTree2.default.MathNode("mo", [new _mathMLTree2.default.TextNode(group.value.leftDelim)]); - - leftOp.setAttribute("fence", "true"); - - withDelims.push(leftOp); - } - - withDelims.push(node); - - if (group.value.rightDelim != null) { - var rightOp = new _mathMLTree2.default.MathNode("mo", [new _mathMLTree2.default.TextNode(group.value.rightDelim)]); - - rightOp.setAttribute("fence", "true"); - - withDelims.push(rightOp); - } - - var outerNode = new _mathMLTree2.default.MathNode("mrow", withDelims); - - return outerNode; - } - - return node; - }; - - groupTypes.array = function (group, options) { - return new _mathMLTree2.default.MathNode("mtable", group.value.body.map(function (row) { - return new _mathMLTree2.default.MathNode("mtr", row.map(function (cell) { - return new _mathMLTree2.default.MathNode("mtd", [buildGroup(cell, options)]); - })); - })); - }; - - groupTypes.sqrt = function (group, options) { - var node = void 0; - if (group.value.index) { - node = new _mathMLTree2.default.MathNode("mroot", [buildGroup(group.value.body, options), buildGroup(group.value.index, options)]); - } else { - node = new _mathMLTree2.default.MathNode("msqrt", [buildGroup(group.value.body, options)]); - } - - return node; - }; - - groupTypes.leftright = function (group, options) { - var inner = buildExpression(group.value.body, options); - - if (group.value.left !== ".") { - var leftNode = new _mathMLTree2.default.MathNode("mo", [makeText(group.value.left, group.mode)]); - - leftNode.setAttribute("fence", "true"); - - inner.unshift(leftNode); - } - - if (group.value.right !== ".") { - var rightNode = new _mathMLTree2.default.MathNode("mo", [makeText(group.value.right, group.mode)]); - - rightNode.setAttribute("fence", "true"); - - inner.push(rightNode); - } - - var outerNode = new _mathMLTree2.default.MathNode("mrow", inner); - - return outerNode; - }; - - groupTypes.middle = function (group, options) { - var middleNode = new _mathMLTree2.default.MathNode("mo", [makeText(group.value.middle, group.mode)]); - middleNode.setAttribute("fence", "true"); - return middleNode; - }; - - groupTypes.accent = function (group, options) { - var accentNode = void 0; - if (group.value.isStretchy) { - accentNode = _stretchy2.default.mathMLnode(group.value.label); - } else { - accentNode = new _mathMLTree2.default.MathNode("mo", [makeText(group.value.label, group.mode)]); - } - - var node = new _mathMLTree2.default.MathNode("mover", [buildGroup(group.value.base, options), accentNode]); - - node.setAttribute("accent", "true"); - - return node; - }; - - groupTypes.spacing = function (group) { - var node = void 0; - - if (group.value === "\\ " || group.value === "\\space" || group.value === " " || group.value === "~") { - node = new _mathMLTree2.default.MathNode("mtext", [new _mathMLTree2.default.TextNode("\xA0")]); - } else { - node = new _mathMLTree2.default.MathNode("mspace"); - - node.setAttribute("width", _buildCommon2.default.spacingFunctions[group.value].size); - } - - return node; - }; - - groupTypes.op = function (group, options) { - var node = void 0; - - // TODO(emily): handle big operators using the `largeop` attribute - - if (group.value.symbol) { - // This is a symbol. Just add the symbol. - node = new _mathMLTree2.default.MathNode("mo", [makeText(group.value.body, group.mode)]); - } else if (group.value.value) { - // This is an operator with children. Add them. - node = new _mathMLTree2.default.MathNode("mo", buildExpression(group.value.value, options)); - } else { - // This is a text operator. Add all of the characters from the - // operator's name. - // TODO(emily): Add a space in the middle of some of these - // operators, like \limsup. - node = new _mathMLTree2.default.MathNode("mi", [new _mathMLTree2.default.TextNode(group.value.body.slice(1))]); - } - - return node; - }; - - groupTypes.mod = function (group, options) { - var inner = []; - - if (group.value.modType === "pod" || group.value.modType === "pmod") { - inner.push(new _mathMLTree2.default.MathNode("mo", [makeText("(", group.mode)])); - } - if (group.value.modType !== "pod") { - inner.push(new _mathMLTree2.default.MathNode("mo", [makeText("mod", group.mode)])); - } - if (group.value.value) { - var space = new _mathMLTree2.default.MathNode("mspace"); - space.setAttribute("width", "0.333333em"); - inner.push(space); - inner = inner.concat(buildExpression(group.value.value, options)); - } - if (group.value.modType === "pod" || group.value.modType === "pmod") { - inner.push(new _mathMLTree2.default.MathNode("mo", [makeText(")", group.mode)])); - } - - return new _mathMLTree2.default.MathNode("mo", inner); - }; - - groupTypes.katex = function (group) { - var node = new _mathMLTree2.default.MathNode("mtext", [new _mathMLTree2.default.TextNode("KaTeX")]); - - return node; - }; - - groupTypes.font = function (group, options) { - var font = group.value.font; - return buildGroup(group.value.body, options.withFont(font)); - }; - - groupTypes.delimsizing = function (group) { - var children = []; - - if (group.value.value !== ".") { - children.push(makeText(group.value.value, group.mode)); - } - - var node = new _mathMLTree2.default.MathNode("mo", children); - - if (group.value.mclass === "mopen" || group.value.mclass === "mclose") { - // Only some of the delimsizing functions act as fences, and they - // return "mopen" or "mclose" mclass. - node.setAttribute("fence", "true"); - } else { - // Explicitly disable fencing if it's not a fence, to override the - // defaults. - node.setAttribute("fence", "false"); - } - - return node; - }; - - groupTypes.styling = function (group, options) { - // Figure out what style we're changing to. - // TODO(kevinb): dedupe this with buildHTML.js - // This will be easier of handling of styling nodes is in the same file. - var styleMap = { - "display": _Style2.default.DISPLAY, - "text": _Style2.default.TEXT, - "script": _Style2.default.SCRIPT, - "scriptscript": _Style2.default.SCRIPTSCRIPT - }; - - var newStyle = styleMap[group.value.style]; - var newOptions = options.havingStyle(newStyle); - - var inner = buildExpression(group.value.value, newOptions); - - var node = new _mathMLTree2.default.MathNode("mstyle", inner); - - var styleAttributes = { - "display": ["0", "true"], - "text": ["0", "false"], - "script": ["1", "false"], - "scriptscript": ["2", "false"] - }; - - var attr = styleAttributes[group.value.style]; - - node.setAttribute("scriptlevel", attr[0]); - node.setAttribute("displaystyle", attr[1]); - - return node; - }; - - groupTypes.sizing = function (group, options) { - var newOptions = options.havingSize(group.value.size); - var inner = buildExpression(group.value.value, newOptions); - - var node = new _mathMLTree2.default.MathNode("mstyle", inner); - - // TODO(emily): This doesn't produce the correct size for nested size - // changes, because we don't keep state of what style we're currently - // in, so we can't reset the size to normal before changing it. Now - // that we're passing an options parameter we should be able to fix - // this. - node.setAttribute("mathsize", newOptions.sizeMultiplier + "em"); - - return node; - }; - - groupTypes.overline = function (group, options) { - var operator = new _mathMLTree2.default.MathNode("mo", [new _mathMLTree2.default.TextNode("\u203E")]); - operator.setAttribute("stretchy", "true"); - - var node = new _mathMLTree2.default.MathNode("mover", [buildGroup(group.value.body, options), operator]); - node.setAttribute("accent", "true"); - - return node; - }; - - groupTypes.underline = function (group, options) { - var operator = new _mathMLTree2.default.MathNode("mo", [new _mathMLTree2.default.TextNode("\u203E")]); - operator.setAttribute("stretchy", "true"); - - var node = new _mathMLTree2.default.MathNode("munder", [buildGroup(group.value.body, options), operator]); - node.setAttribute("accentunder", "true"); - - return node; - }; - - groupTypes.accentUnder = function (group, options) { - var accentNode = _stretchy2.default.mathMLnode(group.value.label); - var node = new _mathMLTree2.default.MathNode("munder", [buildGroup(group.value.body, options), accentNode]); - node.setAttribute("accentunder", "true"); - return node; - }; - - groupTypes.enclose = function (group, options) { - var node = new _mathMLTree2.default.MathNode("menclose", [buildGroup(group.value.body, options)]); - var notation = ""; - switch (group.value.label) { - case "\\bcancel": - notation = "downdiagonalstrike"; - break; - case "\\sout": - notation = "horizontalstrike"; - break; - case "\\fbox": - notation = "box"; - break; - default: - notation = "updiagonalstrike"; - } - node.setAttribute("notation", notation); - return node; - }; - - groupTypes.horizBrace = function (group, options) { - var accentNode = _stretchy2.default.mathMLnode(group.value.label); - return new _mathMLTree2.default.MathNode(group.value.isOver ? "mover" : "munder", [buildGroup(group.value.base, options), accentNode]); - }; - - groupTypes.xArrow = function (group, options) { - var arrowNode = _stretchy2.default.mathMLnode(group.value.label); - var node = void 0; - var lowerNode = void 0; - - if (group.value.body) { - var upperNode = buildGroup(group.value.body, options); - if (group.value.below) { - lowerNode = buildGroup(group.value.below, options); - node = new _mathMLTree2.default.MathNode("munderover", [arrowNode, lowerNode, upperNode]); - } else { - node = new _mathMLTree2.default.MathNode("mover", [arrowNode, upperNode]); - } - } else if (group.value.below) { - lowerNode = buildGroup(group.value.below, options); - node = new _mathMLTree2.default.MathNode("munder", [arrowNode, lowerNode]); - } else { - node = new _mathMLTree2.default.MathNode("mover", [arrowNode]); - } - return node; - }; - - groupTypes.rule = function (group) { - // TODO(emily): Figure out if there's an actual way to draw black boxes - // in MathML. - var node = new _mathMLTree2.default.MathNode("mrow"); - - return node; - }; - - groupTypes.kern = function (group) { - // TODO(kevin): Figure out if there's a way to add space in MathML - var node = new _mathMLTree2.default.MathNode("mrow"); - - return node; - }; - - groupTypes.llap = function (group, options) { - var node = new _mathMLTree2.default.MathNode("mpadded", [buildGroup(group.value.body, options)]); - - node.setAttribute("lspace", "-1width"); - node.setAttribute("width", "0px"); - - return node; - }; - - groupTypes.rlap = function (group, options) { - var node = new _mathMLTree2.default.MathNode("mpadded", [buildGroup(group.value.body, options)]); - - node.setAttribute("width", "0px"); - - return node; - }; - - groupTypes.phantom = function (group, options) { - var inner = buildExpression(group.value.value, options); - return new _mathMLTree2.default.MathNode("mphantom", inner); - }; - - groupTypes.mclass = function (group, options) { - var inner = buildExpression(group.value.value, options); - return new _mathMLTree2.default.MathNode("mstyle", inner); - }; - - /** - * Takes a list of nodes, builds them, and returns a list of the generated - * MathML nodes. A little simpler than the HTML version because we don't do any - * previous-node handling. - */ - var buildExpression = function buildExpression(expression, options) { - var groups = []; - for (var i = 0; i < expression.length; i++) { - var group = expression[i]; - groups.push(buildGroup(group, options)); - } - - // TODO(kevinb): combine \\not with mrels and mords - - return groups; - }; - - /** - * Takes a group from the parser and calls the appropriate groupTypes function - * on it to produce a MathML node. - */ - // TODO(kevinb): determine if removeUnnecessaryRow should always be true - var buildGroup = function buildGroup(group, options) { - var removeUnnecessaryRow = arguments.length > 2 && arguments[2] !== undefined ? arguments[2] : false; - - if (!group) { - return new _mathMLTree2.default.MathNode("mrow"); - } - - if (groupTypes[group.type]) { - // Call the groupTypes function - var result = groupTypes[group.type](group, options); - if (removeUnnecessaryRow) { - if (result.type === "mrow" && result.children.length === 1) { - return result.children[0]; - } - } - return result; - } else { - throw new _ParseError2.default("Got group of unknown type: '" + group.type + "'"); - } - }; - - /** - * Takes a full parse tree and settings and builds a MathML representation of - * it. In particular, we put the elements from building the parse tree into a - * tag so we can also include that TeX source as an annotation. - * - * Note that we actually return a domTree element with a `` inside it so - * we can do appropriate styling. - */ - var buildMathML = function buildMathML(tree, texExpression, options) { - var expression = buildExpression(tree, options); - - // Wrap up the expression in an mrow so it is presented in the semantics - // tag correctly. - var wrapper = new _mathMLTree2.default.MathNode("mrow", expression); - - // Build a TeX annotation of the source - var annotation = new _mathMLTree2.default.MathNode("annotation", [new _mathMLTree2.default.TextNode(texExpression)]); - - annotation.setAttribute("encoding", "application/x-tex"); - - var semantics = new _mathMLTree2.default.MathNode("semantics", [wrapper, annotation]); - - var math = new _mathMLTree2.default.MathNode("math", [semantics]); - - // You can't style nodes, so we wrap the node in a span. - return (0, _buildCommon.makeSpan)(["katex-mathml"], [math]); - }; - - module.exports = buildMathML; - - },{"./ParseError":29,"./Style":33,"./buildCommon":34,"./fontMetrics":41,"./mathMLTree":45,"./stretchy":47,"./symbols":48,"./utils":51}],37:[function(require,module,exports){ - - var _buildHTML = require("./buildHTML"); - - var _buildHTML2 = _interopRequireDefault(_buildHTML); - - var _buildMathML = require("./buildMathML"); - - var _buildMathML2 = _interopRequireDefault(_buildMathML); - - var _buildCommon = require("./buildCommon"); - - var _Options = require("./Options"); - - var _Options2 = _interopRequireDefault(_Options); - - var _Settings = require("./Settings"); - - var _Settings2 = _interopRequireDefault(_Settings); - - var _Style = require("./Style"); - - var _Style2 = _interopRequireDefault(_Style); - - function _interopRequireDefault(obj) { return obj && obj.__esModule ? obj : { default: obj }; } - - var buildTree = function buildTree(tree, expression, settings) { - settings = settings || new _Settings2.default({}); - - var startStyle = _Style2.default.TEXT; - if (settings.displayMode) { - startStyle = _Style2.default.DISPLAY; - } - - // Setup the default options - var options = new _Options2.default({ - style: startStyle - }); - - // `buildHTML` sometimes messes with the parse tree (like turning bins -> - // ords), so we build the MathML version first. - var mathMLNode = (0, _buildMathML2.default)(tree, expression, options); - var htmlNode = (0, _buildHTML2.default)(tree, options); - - var katexNode = (0, _buildCommon.makeSpan)(["katex"], [mathMLNode, htmlNode]); - - if (settings.displayMode) { - return (0, _buildCommon.makeSpan)(["katex-display"], [katexNode]); - } else { - return katexNode; - } - }; - - module.exports = buildTree; - - },{"./Options":28,"./Settings":32,"./Style":33,"./buildCommon":34,"./buildHTML":35,"./buildMathML":36}],38:[function(require,module,exports){ - - var _ParseError = require("./ParseError"); - - var _ParseError2 = _interopRequireDefault(_ParseError); - - var _Style = require("./Style"); - - var _Style2 = _interopRequireDefault(_Style); - - var _buildCommon = require("./buildCommon"); - - var _buildCommon2 = _interopRequireDefault(_buildCommon); - - var _fontMetrics = require("./fontMetrics"); - - var _fontMetrics2 = _interopRequireDefault(_fontMetrics); - - var _symbols = require("./symbols"); - - var _symbols2 = _interopRequireDefault(_symbols); - - var _utils = require("./utils"); - - var _utils2 = _interopRequireDefault(_utils); - - function _interopRequireDefault(obj) { return obj && obj.__esModule ? obj : { default: obj }; } - - /** - * Get the metrics for a given symbol and font, after transformation (i.e. - * after following replacement from symbols.js) - */ - /** - * This file deals with creating delimiters of various sizes. The TeXbook - * discusses these routines on page 441-442, in the "Another subroutine sets box - * x to a specified variable delimiter" paragraph. - * - * There are three main routines here. `makeSmallDelim` makes a delimiter in the - * normal font, but in either text, script, or scriptscript style. - * `makeLargeDelim` makes a delimiter in textstyle, but in one of the Size1, - * Size2, Size3, or Size4 fonts. `makeStackedDelim` makes a delimiter out of - * smaller pieces that are stacked on top of one another. - * - * The functions take a parameter `center`, which determines if the delimiter - * should be centered around the axis. - * - * Then, there are three exposed functions. `sizedDelim` makes a delimiter in - * one of the given sizes. This is used for things like `\bigl`. - * `customSizedDelim` makes a delimiter with a given total height+depth. It is - * called in places like `\sqrt`. `leftRightDelim` makes an appropriate - * delimiter which surrounds an expression of a given height an depth. It is - * used in `\left` and `\right`. - */ - - var getMetrics = function getMetrics(symbol, font) { - if (_symbols2.default.math[symbol] && _symbols2.default.math[symbol].replace) { - return _fontMetrics2.default.getCharacterMetrics(_symbols2.default.math[symbol].replace, font); - } else { - return _fontMetrics2.default.getCharacterMetrics(symbol, font); - } - }; - - /** - * Puts a delimiter span in a given style, and adds appropriate height, depth, - * and maxFontSizes. - */ - var styleWrap = function styleWrap(delim, toStyle, options, classes) { - var newOptions = options.havingBaseStyle(toStyle); - - var span = (0, _buildCommon.makeSpan)((classes || []).concat(newOptions.sizingClasses(options)), [delim], options); - - span.delimSizeMultiplier = newOptions.sizeMultiplier / options.sizeMultiplier; - span.height *= span.delimSizeMultiplier; - span.depth *= span.delimSizeMultiplier; - span.maxFontSize = newOptions.sizeMultiplier; - - return span; - }; - - var centerSpan = function centerSpan(span, options, style) { - var newOptions = options.havingBaseStyle(style); - var shift = (1 - options.sizeMultiplier / newOptions.sizeMultiplier) * options.fontMetrics().axisHeight; - - span.classes.push("delimcenter"); - span.style.top = shift + "em"; - span.height -= shift; - span.depth += shift; - }; - - /** - * Makes a small delimiter. This is a delimiter that comes in the Main-Regular - * font, but is restyled to either be in textstyle, scriptstyle, or - * scriptscriptstyle. - */ - var makeSmallDelim = function makeSmallDelim(delim, style, center, options, mode, classes) { - var text = _buildCommon2.default.makeSymbol(delim, "Main-Regular", mode, options); - var span = styleWrap(text, style, options, classes); - if (center) { - centerSpan(span, options, style); - } - return span; - }; - - /** - * Builds a symbol in the given font size (note size is an integer) - */ - var mathrmSize = function mathrmSize(value, size, mode, options) { - return _buildCommon2.default.makeSymbol(value, "Size" + size + "-Regular", mode, options); - }; - - /** - * Makes a large delimiter. This is a delimiter that comes in the Size1, Size2, - * Size3, or Size4 fonts. It is always rendered in textstyle. - */ - var makeLargeDelim = function makeLargeDelim(delim, size, center, options, mode, classes) { - var inner = mathrmSize(delim, size, mode, options); - var span = styleWrap((0, _buildCommon.makeSpan)(["delimsizing", "size" + size], [inner], options), _Style2.default.TEXT, options, classes); - if (center) { - centerSpan(span, options, _Style2.default.TEXT); - } - return span; - }; - - /** - * Make an inner span with the given offset and in the given font. This is used - * in `makeStackedDelim` to make the stacking pieces for the delimiter. - */ - var makeInner = function makeInner(symbol, font, mode) { - var sizeClass = void 0; - // Apply the correct CSS class to choose the right font. - if (font === "Size1-Regular") { - sizeClass = "delim-size1"; - } else if (font === "Size4-Regular") { - sizeClass = "delim-size4"; - } - - var inner = (0, _buildCommon.makeSpan)(["delimsizinginner", sizeClass], [(0, _buildCommon.makeSpan)([], [_buildCommon2.default.makeSymbol(symbol, font, mode)])]); - - // Since this will be passed into `makeVList` in the end, wrap the element - // in the appropriate tag that VList uses. - return { type: "elem", elem: inner }; - }; - - /** - * Make a stacked delimiter out of a given delimiter, with the total height at - * least `heightTotal`. This routine is mentioned on page 442 of the TeXbook. - */ - var makeStackedDelim = function makeStackedDelim(delim, heightTotal, center, options, mode, classes) { - // There are four parts, the top, an optional middle, a repeated part, and a - // bottom. - var top = void 0; - var middle = void 0; - var repeat = void 0; - var bottom = void 0; - top = repeat = bottom = delim; - middle = null; - // Also keep track of what font the delimiters are in - var font = "Size1-Regular"; - - // We set the parts and font based on the symbol. Note that we use - // '\u23d0' instead of '|' and '\u2016' instead of '\\|' for the - // repeats of the arrows - if (delim === "\\uparrow") { - repeat = bottom = "\u23D0"; - } else if (delim === "\\Uparrow") { - repeat = bottom = "\u2016"; - } else if (delim === "\\downarrow") { - top = repeat = "\u23D0"; - } else if (delim === "\\Downarrow") { - top = repeat = "\u2016"; - } else if (delim === "\\updownarrow") { - top = "\\uparrow"; - repeat = "\u23D0"; - bottom = "\\downarrow"; - } else if (delim === "\\Updownarrow") { - top = "\\Uparrow"; - repeat = "\u2016"; - bottom = "\\Downarrow"; - } else if (delim === "[" || delim === "\\lbrack") { - top = "\u23A1"; - repeat = "\u23A2"; - bottom = "\u23A3"; - font = "Size4-Regular"; - } else if (delim === "]" || delim === "\\rbrack") { - top = "\u23A4"; - repeat = "\u23A5"; - bottom = "\u23A6"; - font = "Size4-Regular"; - } else if (delim === "\\lfloor") { - repeat = top = "\u23A2"; - bottom = "\u23A3"; - font = "Size4-Regular"; - } else if (delim === "\\lceil") { - top = "\u23A1"; - repeat = bottom = "\u23A2"; - font = "Size4-Regular"; - } else if (delim === "\\rfloor") { - repeat = top = "\u23A5"; - bottom = "\u23A6"; - font = "Size4-Regular"; - } else if (delim === "\\rceil") { - top = "\u23A4"; - repeat = bottom = "\u23A5"; - font = "Size4-Regular"; - } else if (delim === "(") { - top = "\u239B"; - repeat = "\u239C"; - bottom = "\u239D"; - font = "Size4-Regular"; - } else if (delim === ")") { - top = "\u239E"; - repeat = "\u239F"; - bottom = "\u23A0"; - font = "Size4-Regular"; - } else if (delim === "\\{" || delim === "\\lbrace") { - top = "\u23A7"; - middle = "\u23A8"; - bottom = "\u23A9"; - repeat = "\u23AA"; - font = "Size4-Regular"; - } else if (delim === "\\}" || delim === "\\rbrace") { - top = "\u23AB"; - middle = "\u23AC"; - bottom = "\u23AD"; - repeat = "\u23AA"; - font = "Size4-Regular"; - } else if (delim === "\\lgroup") { - top = "\u23A7"; - bottom = "\u23A9"; - repeat = "\u23AA"; - font = "Size4-Regular"; - } else if (delim === "\\rgroup") { - top = "\u23AB"; - bottom = "\u23AD"; - repeat = "\u23AA"; - font = "Size4-Regular"; - } else if (delim === "\\lmoustache") { - top = "\u23A7"; - bottom = "\u23AD"; - repeat = "\u23AA"; - font = "Size4-Regular"; - } else if (delim === "\\rmoustache") { - top = "\u23AB"; - bottom = "\u23A9"; - repeat = "\u23AA"; - font = "Size4-Regular"; - } - - // Get the metrics of the four sections - var topMetrics = getMetrics(top, font); - var topHeightTotal = topMetrics.height + topMetrics.depth; - var repeatMetrics = getMetrics(repeat, font); - var repeatHeightTotal = repeatMetrics.height + repeatMetrics.depth; - var bottomMetrics = getMetrics(bottom, font); - var bottomHeightTotal = bottomMetrics.height + bottomMetrics.depth; - var middleHeightTotal = 0; - var middleFactor = 1; - if (middle !== null) { - var middleMetrics = getMetrics(middle, font); - middleHeightTotal = middleMetrics.height + middleMetrics.depth; - middleFactor = 2; // repeat symmetrically above and below middle - } - - // Calcuate the minimal height that the delimiter can have. - // It is at least the size of the top, bottom, and optional middle combined. - var minHeight = topHeightTotal + bottomHeightTotal + middleHeightTotal; - - // Compute the number of copies of the repeat symbol we will need - var repeatCount = Math.ceil((heightTotal - minHeight) / (middleFactor * repeatHeightTotal)); - - // Compute the total height of the delimiter including all the symbols - var realHeightTotal = minHeight + repeatCount * middleFactor * repeatHeightTotal; - - // The center of the delimiter is placed at the center of the axis. Note - // that in this context, "center" means that the delimiter should be - // centered around the axis in the current style, while normally it is - // centered around the axis in textstyle. - var axisHeight = options.fontMetrics().axisHeight; - if (center) { - axisHeight *= options.sizeMultiplier; - } - // Calculate the depth - var depth = realHeightTotal / 2 - axisHeight; - - // Now, we start building the pieces that will go into the vlist - - // Keep a list of the inner pieces - var inners = []; - - // Add the bottom symbol - inners.push(makeInner(bottom, font, mode)); - - if (middle === null) { - // Add that many symbols - for (var i = 0; i < repeatCount; i++) { - inners.push(makeInner(repeat, font, mode)); - } - } else { - // When there is a middle bit, we need the middle part and two repeated - // sections - for (var _i = 0; _i < repeatCount; _i++) { - inners.push(makeInner(repeat, font, mode)); - } - inners.push(makeInner(middle, font, mode)); - for (var _i2 = 0; _i2 < repeatCount; _i2++) { - inners.push(makeInner(repeat, font, mode)); - } - } - - // Add the top symbol - inners.push(makeInner(top, font, mode)); - - // Finally, build the vlist - var newOptions = options.havingBaseStyle(_Style2.default.TEXT); - var inner = _buildCommon2.default.makeVList(inners, "bottom", depth, newOptions); - - return styleWrap((0, _buildCommon.makeSpan)(["delimsizing", "mult"], [inner], newOptions), _Style2.default.TEXT, options, classes); - }; - - var sqrtInnerSVG = { - // The main path geometry is from glyph U221A in the font KaTeX Main - main: "", - - // size1 is from glyph U221A in the font KaTeX_Size1-Regular - 1: "", - - // size2 is from glyph U221A in the font KaTeX_Size2-Regular - 2: "", - - // size3 is from glyph U221A in the font KaTeX_Size3-Regular - 3: "", - - // size4 is from glyph U221A in the font KaTeX_Size4-Regular - 4: "", - - // tall is from glyph U23B7 in the font KaTeX_Size4-Regular - tall: "l-4 4-4 4c-.667.667-2 1.5-4 2.5s-4.167 1.833-6.5 2.5-5.5 1-9.5 1h\n-12l-28-84c-16.667-52-96.667 -294.333-240-727l-212 -643 -85 170c-4-3.333-8.333\n-7.667-13 -13l-13-13l77-155 77-156c66 199.333 139 419.667 219 661 l218 661z\nM702 0H400000v40H742z'/>" - }; - - var sqrtSpan = function sqrtSpan(height, delim, options) { - // Create a span containing an SVG image of a sqrt symbol. - var span = _buildCommon2.default.makeSpan([], [], options); - var sizeMultiplier = options.sizeMultiplier; // default - - if (delim.type === "small") { - // Get an SVG that is derived from glyph U+221A in font KaTeX-Main. - var newOptions = options.havingBaseStyle(delim.style); - sizeMultiplier = newOptions.sizeMultiplier / options.sizeMultiplier; - - span.height = 1 * sizeMultiplier; - span.style.height = span.height + "em"; - span.surdWidth = 0.833 * sizeMultiplier; // from the font. - //In the font, the glyph is 1000 units tall. The font scale is 1:1000. - - span.innerHTML = "\n " + sqrtInnerSVG['main'] + ""; - } else if (delim.type === "large") { - // These SVGs come from fonts: KaTeX_Size1, _Size2, etc. - // Get sqrt height from font data - span.height = sizeToMaxHeight[delim.size] / sizeMultiplier; - span.style.height = span.height + "em"; - span.surdWidth = 1.0 / sizeMultiplier; // from the font - - span.innerHTML = "\n " + sqrtInnerSVG[delim.size] + ""; - } else { - // Tall sqrt. In TeX, this would be stacked using multiple glyphs. - // We'll use a single SVG to accomplish the same thing. - span.height = height / sizeMultiplier; - span.style.height = span.height + "em"; - span.surdWidth = 1.056 / sizeMultiplier; - var viewBoxHeight = Math.floor(span.height * 1000); // scale = 1:1000 - var vertSegment = viewBoxHeight - 54; - - // This \sqrt is customized in both height and width. We set the - // height now. Then CSS will stretch the image to the correct width. - // This SVG path comes from glyph U+23B7, font KaTeX_Size4-Regular. - span.innerHTML = "\n \n "; - } - - span.sizeMultiplier = sizeMultiplier; - - return span; - }; - - // There are three kinds of delimiters, delimiters that stack when they become - // too large - var stackLargeDelimiters = ["(", ")", "[", "\\lbrack", "]", "\\rbrack", "\\{", "\\lbrace", "\\}", "\\rbrace", "\\lfloor", "\\rfloor", "\\lceil", "\\rceil", "\\surd"]; - - // delimiters that always stack - var stackAlwaysDelimiters = ["\\uparrow", "\\downarrow", "\\updownarrow", "\\Uparrow", "\\Downarrow", "\\Updownarrow", "|", "\\|", "\\vert", "\\Vert", "\\lvert", "\\rvert", "\\lVert", "\\rVert", "\\lgroup", "\\rgroup", "\\lmoustache", "\\rmoustache"]; - - // and delimiters that never stack - var stackNeverDelimiters = ["<", ">", "\\langle", "\\rangle", "/", "\\backslash", "\\lt", "\\gt"]; - - // Metrics of the different sizes. Found by looking at TeX's output of - // $\bigl| // \Bigl| \biggl| \Biggl| \showlists$ - // Used to create stacked delimiters of appropriate sizes in makeSizedDelim. - var sizeToMaxHeight = [0, 1.2, 1.8, 2.4, 3.0]; - - /** - * Used to create a delimiter of a specific size, where `size` is 1, 2, 3, or 4. - */ - var makeSizedDelim = function makeSizedDelim(delim, size, options, mode, classes) { - // < and > turn into \langle and \rangle in delimiters - if (delim === "<" || delim === "\\lt") { - delim = "\\langle"; - } else if (delim === ">" || delim === "\\gt") { - delim = "\\rangle"; - } - - // Sized delimiters are never centered. - if (_utils2.default.contains(stackLargeDelimiters, delim) || _utils2.default.contains(stackNeverDelimiters, delim)) { - return makeLargeDelim(delim, size, false, options, mode, classes); - } else if (_utils2.default.contains(stackAlwaysDelimiters, delim)) { - return makeStackedDelim(delim, sizeToMaxHeight[size], false, options, mode, classes); - } else { - throw new _ParseError2.default("Illegal delimiter: '" + delim + "'"); - } - }; - - /** - * There are three different sequences of delimiter sizes that the delimiters - * follow depending on the kind of delimiter. This is used when creating custom - * sized delimiters to decide whether to create a small, large, or stacked - * delimiter. - * - * In real TeX, these sequences aren't explicitly defined, but are instead - * defined inside the font metrics. Since there are only three sequences that - * are possible for the delimiters that TeX defines, it is easier to just encode - * them explicitly here. - */ - - // Delimiters that never stack try small delimiters and large delimiters only - var stackNeverDelimiterSequence = [{ type: "small", style: _Style2.default.SCRIPTSCRIPT }, { type: "small", style: _Style2.default.SCRIPT }, { type: "small", style: _Style2.default.TEXT }, { type: "large", size: 1 }, { type: "large", size: 2 }, { type: "large", size: 3 }, { type: "large", size: 4 }]; - - // Delimiters that always stack try the small delimiters first, then stack - var stackAlwaysDelimiterSequence = [{ type: "small", style: _Style2.default.SCRIPTSCRIPT }, { type: "small", style: _Style2.default.SCRIPT }, { type: "small", style: _Style2.default.TEXT }, { type: "stack" }]; - - // Delimiters that stack when large try the small and then large delimiters, and - // stack afterwards - var stackLargeDelimiterSequence = [{ type: "small", style: _Style2.default.SCRIPTSCRIPT }, { type: "small", style: _Style2.default.SCRIPT }, { type: "small", style: _Style2.default.TEXT }, { type: "large", size: 1 }, { type: "large", size: 2 }, { type: "large", size: 3 }, { type: "large", size: 4 }, { type: "stack" }]; - - /** - * Get the font used in a delimiter based on what kind of delimiter it is. - */ - var delimTypeToFont = function delimTypeToFont(type) { - if (type.type === "small") { - return "Main-Regular"; - } else if (type.type === "large") { - return "Size" + type.size + "-Regular"; - } else if (type.type === "stack") { - return "Size4-Regular"; - } - }; - - /** - * Traverse a sequence of types of delimiters to decide what kind of delimiter - * should be used to create a delimiter of the given height+depth. - */ - var traverseSequence = function traverseSequence(delim, height, sequence, options) { - // Here, we choose the index we should start at in the sequences. In smaller - // sizes (which correspond to larger numbers in style.size) we start earlier - // in the sequence. Thus, scriptscript starts at index 3-3=0, script starts - // at index 3-2=1, text starts at 3-1=2, and display starts at min(2,3-0)=2 - var start = Math.min(2, 3 - options.style.size); - for (var i = start; i < sequence.length; i++) { - if (sequence[i].type === "stack") { - // This is always the last delimiter, so we just break the loop now. - break; - } - - var metrics = getMetrics(delim, delimTypeToFont(sequence[i])); - var heightDepth = metrics.height + metrics.depth; - - // Small delimiters are scaled down versions of the same font, so we - // account for the style change size. - - if (sequence[i].type === "small") { - var newOptions = options.havingBaseStyle(sequence[i].style); - heightDepth *= newOptions.sizeMultiplier; - } - - // Check if the delimiter at this size works for the given height. - if (heightDepth > height) { - return sequence[i]; - } - } - - // If we reached the end of the sequence, return the last sequence element. - return sequence[sequence.length - 1]; - }; - - /** - * Make a delimiter of a given height+depth, with optional centering. Here, we - * traverse the sequences, and create a delimiter that the sequence tells us to. - */ - var makeCustomSizedDelim = function makeCustomSizedDelim(delim, height, center, options, mode, classes) { - if (delim === "<" || delim === "\\lt") { - delim = "\\langle"; - } else if (delim === ">" || delim === "\\gt") { - delim = "\\rangle"; - } - - // Decide what sequence to use - var sequence = void 0; - if (_utils2.default.contains(stackNeverDelimiters, delim)) { - sequence = stackNeverDelimiterSequence; - } else if (_utils2.default.contains(stackLargeDelimiters, delim)) { - sequence = stackLargeDelimiterSequence; - } else { - sequence = stackAlwaysDelimiterSequence; - } - - // Look through the sequence - var delimType = traverseSequence(delim, height, sequence, options); - - if (delim === "\\surd") { - // Get an SVG image for - return sqrtSpan(height, delimType, options); - } else { - // Get the delimiter from font glyphs. - // Depending on the sequence element we decided on, call the - // appropriate function. - if (delimType.type === "small") { - return makeSmallDelim(delim, delimType.style, center, options, mode, classes); - } else if (delimType.type === "large") { - return makeLargeDelim(delim, delimType.size, center, options, mode, classes); - } else if (delimType.type === "stack") { - return makeStackedDelim(delim, height, center, options, mode, classes); - } - } - }; - - /** - * Make a delimiter for use with `\left` and `\right`, given a height and depth - * of an expression that the delimiters surround. - */ - var makeLeftRightDelim = function makeLeftRightDelim(delim, height, depth, options, mode, classes) { - // We always center \left/\right delimiters, so the axis is always shifted - var axisHeight = options.fontMetrics().axisHeight * options.sizeMultiplier; - - // Taken from TeX source, tex.web, function make_left_right - var delimiterFactor = 901; - var delimiterExtend = 5.0 / options.fontMetrics().ptPerEm; - - var maxDistFromAxis = Math.max(height - axisHeight, depth + axisHeight); - - var totalHeight = Math.max( - // In real TeX, calculations are done using integral values which are - // 65536 per pt, or 655360 per em. So, the division here truncates in - // TeX but doesn't here, producing different results. If we wanted to - // exactly match TeX's calculation, we could do - // Math.floor(655360 * maxDistFromAxis / 500) * - // delimiterFactor / 655360 - // (To see the difference, compare - // x^{x^{\left(\rule{0.1em}{0.68em}\right)}} - // in TeX and KaTeX) - maxDistFromAxis / 500 * delimiterFactor, 2 * maxDistFromAxis - delimiterExtend); - - // Finally, we defer to `makeCustomSizedDelim` with our calculated total - // height - return makeCustomSizedDelim(delim, totalHeight, true, options, mode, classes); - }; - - module.exports = { - sizedDelim: makeSizedDelim, - customSizedDelim: makeCustomSizedDelim, - leftRightDelim: makeLeftRightDelim - }; - - },{"./ParseError":29,"./Style":33,"./buildCommon":34,"./fontMetrics":41,"./symbols":48,"./utils":51}],39:[function(require,module,exports){ - - var _classCallCheck2 = require("babel-runtime/helpers/classCallCheck"); - - var _classCallCheck3 = _interopRequireDefault(_classCallCheck2); - - var _createClass2 = require("babel-runtime/helpers/createClass"); - - var _createClass3 = _interopRequireDefault(_createClass2); - - var _unicodeRegexes = require("./unicodeRegexes"); - - var _unicodeRegexes2 = _interopRequireDefault(_unicodeRegexes); - - var _utils = require("./utils"); - - var _utils2 = _interopRequireDefault(_utils); - - function _interopRequireDefault(obj) { return obj && obj.__esModule ? obj : { default: obj }; } - - /** - * Create an HTML className based on a list of classes. In addition to joining - * with spaces, we also remove null or empty classes. - */ - /** - * These objects store the data about the DOM nodes we create, as well as some - * extra data. They can then be transformed into real DOM nodes with the - * `toNode` function or HTML markup using `toMarkup`. They are useful for both - * storing extra properties on the nodes, as well as providing a way to easily - * work with the DOM. - * - * Similar functions for working with MathML nodes exist in mathMLTree.js. - */ - var createClass = function createClass(classes) { - classes = classes.slice(); - for (var i = classes.length - 1; i >= 0; i--) { - if (!classes[i]) { - classes.splice(i, 1); - } - } - - return classes.join(" "); - }; - - /** - * This node represents a span node, with a className, a list of children, and - * an inline style. It also contains information about its height, depth, and - * maxFontSize. - */ - - var span = function () { - function span(classes, children, options) { - (0, _classCallCheck3.default)(this, span); - - this.classes = classes || []; - this.children = children || []; - this.height = 0; - this.depth = 0; - this.maxFontSize = 0; - this.style = {}; - this.attributes = {}; - this.innerHTML; // used for inline SVG code. - if (options) { - if (options.style.isTight()) { - this.classes.push("mtight"); - } - if (options.getColor()) { - this.style.color = options.getColor(); - } - } - } - - /** - * Sets an arbitrary attribute on the span. Warning: use this wisely. Not all - * browsers support attributes the same, and having too many custom attributes - * is probably bad. - */ - - - (0, _createClass3.default)(span, [{ - key: "setAttribute", - value: function setAttribute(attribute, value) { - this.attributes[attribute] = value; - } - }, { - key: "tryCombine", - value: function tryCombine(sibling) { - return false; - } - - /** - * Convert the span into an HTML node - */ - - }, { - key: "toNode", - value: function toNode() { - var span = document.createElement("span"); - - // Apply the class - span.className = createClass(this.classes); - - // Apply inline styles - for (var style in this.style) { - if (Object.prototype.hasOwnProperty.call(this.style, style)) { - span.style[style] = this.style[style]; - } - } - - // Apply attributes - for (var attr in this.attributes) { - if (Object.prototype.hasOwnProperty.call(this.attributes, attr)) { - span.setAttribute(attr, this.attributes[attr]); - } - } - - if (this.innerHTML) { - span.innerHTML = this.innerHTML; - } - - // Append the children, also as HTML nodes - for (var i = 0; i < this.children.length; i++) { - span.appendChild(this.children[i].toNode()); - } - - return span; - } - - /** - * Convert the span into an HTML markup string - */ - - }, { - key: "toMarkup", - value: function toMarkup() { - var markup = " 0 || createClass(this.classes) !== createClass(sibling.classes) || this.skew !== sibling.skew || this.maxFontSize !== sibling.maxFontSize) { - return false; - } - for (var style in this.style) { - if (this.style.hasOwnProperty(style) && this.style[style] !== sibling.style[style]) { - return false; - } - } - for (var _style in sibling.style) { - if (sibling.style.hasOwnProperty(_style) && this.style[_style] !== sibling.style[_style]) { - return false; - } - } - this.value += sibling.value; - this.height = Math.max(this.height, sibling.height); - this.depth = Math.max(this.depth, sibling.depth); - this.italic = sibling.italic; - return true; - } - - /** - * Creates a text node or span from a symbol node. Note that a span is only - * created if it is needed. - */ - - }, { - key: "toNode", - value: function toNode() { - var node = document.createTextNode(this.value); - var span = null; - - if (this.italic > 0) { - span = document.createElement("span"); - span.style.marginRight = this.italic + "em"; - } - - if (this.classes.length > 0) { - span = span || document.createElement("span"); - span.className = createClass(this.classes); - } - - for (var style in this.style) { - if (this.style.hasOwnProperty(style)) { - span = span || document.createElement("span"); - span.style[style] = this.style[style]; - } - } - - if (span) { - span.appendChild(node); - return span; - } else { - return node; - } - } - - /** - * Creates markup for a symbol node. - */ - - }, { - key: "toMarkup", - value: function toMarkup() { - // TODO(alpert): More duplication than I'd like from - // span.prototype.toMarkup and symbolNode.prototype.toNode... - var needsSpan = false; - - var markup = " 0) { - styles += "margin-right:" + this.italic + "em;"; - } - for (var style in this.style) { - if (this.style.hasOwnProperty(style)) { - styles += _utils2.default.hyphenate(style) + ":" + this.style[style] + ";"; - } - } - - if (styles) { - needsSpan = true; - markup += " style=\"" + _utils2.default.escape(styles) + "\""; - } - - var escaped = _utils2.default.escape(this.value); - if (needsSpan) { - markup += ">"; - markup += escaped; - markup += "
      "; - return markup; - } else { - return escaped; - } - } - }]); - return symbolNode; - }(); - - module.exports = { - span: span, - documentFragment: documentFragment, - symbolNode: symbolNode - }; - - },{"./unicodeRegexes":49,"./utils":51,"babel-runtime/helpers/classCallCheck":4,"babel-runtime/helpers/createClass":5}],40:[function(require,module,exports){ - - var _ParseNode = require("./ParseNode"); - - var _ParseNode2 = _interopRequireDefault(_ParseNode); - - var _ParseError = require("./ParseError"); - - var _ParseError2 = _interopRequireDefault(_ParseError); - - function _interopRequireDefault(obj) { return obj && obj.__esModule ? obj : { default: obj }; } - - /** - * Parse the body of the environment, with rows delimited by \\ and - * columns delimited by &, and create a nested list in row-major order - * with one group per cell. If given an optional argument style - * ("text", "display", etc.), then each cell is cast into that style. - */ - /* eslint no-constant-condition:0 */ - function parseArray(parser, result, style) { - var row = []; - var body = [row]; - var rowGaps = []; - while (true) { - var cell = parser.parseExpression(false, null); - cell = new _ParseNode2.default("ordgroup", cell, parser.mode); - if (style) { - cell = new _ParseNode2.default("styling", { - style: style, - value: [cell] - }, parser.mode); - } - row.push(cell); - var next = parser.nextToken.text; - if (next === "&") { - parser.consume(); - } else if (next === "\\end") { - break; - } else if (next === "\\\\" || next === "\\cr") { - var cr = parser.parseFunction(); - rowGaps.push(cr.value.size); - row = []; - body.push(row); - } else { - throw new _ParseError2.default("Expected & or \\\\ or \\end", parser.nextToken); - } - } - result.body = body; - result.rowGaps = rowGaps; - return new _ParseNode2.default(result.type, result, parser.mode); - } - - /* - * An environment definition is very similar to a function definition: - * it is declared with a name or a list of names, a set of properties - * and a handler containing the actual implementation. - * - * The properties include: - * - numArgs: The number of arguments after the \begin{name} function. - * - argTypes: (optional) Just like for a function - * - allowedInText: (optional) Whether or not the environment is allowed inside - * text mode (default false) (not enforced yet) - * - numOptionalArgs: (optional) Just like for a function - * A bare number instead of that object indicates the numArgs value. - * - * The handler function will receive two arguments - * - context: information and references provided by the parser - * - args: an array of arguments passed to \begin{name} - * The context contains the following properties: - * - envName: the name of the environment, one of the listed names. - * - parser: the parser object - * - lexer: the lexer object - * - positions: the positions associated with these arguments from args. - * The handler must return a ParseResult. - */ - function defineEnvironment(names, props, handler) { - if (typeof names === "string") { - names = [names]; - } - if (typeof props === "number") { - props = { numArgs: props }; - } - // Set default values of environments - var data = { - numArgs: props.numArgs || 0, - argTypes: props.argTypes, - greediness: 1, - allowedInText: !!props.allowedInText, - numOptionalArgs: props.numOptionalArgs || 0, - handler: handler - }; - for (var i = 0; i < names.length; ++i) { - module.exports[names[i]] = data; - } - } - - // Decides on a style for cells in an array according to whether the given - // environment name starts with the letter 'd'. - function dCellStyle(envName) { - if (envName.substr(0, 1) === "d") { - return "display"; - } else { - return "text"; - } - } - - // Arrays are part of LaTeX, defined in lttab.dtx so its documentation - // is part of the source2e.pdf file of LaTeX2e source documentation. - // {darray} is an {array} environment where cells are set in \displaystyle, - // as defined in nccmath.sty. - defineEnvironment(["array", "darray"], { - numArgs: 1 - }, function (context, args) { - var colalign = args[0]; - colalign = colalign.value.map ? colalign.value : [colalign]; - var cols = colalign.map(function (node) { - var ca = node.value; - if ("lcr".indexOf(ca) !== -1) { - return { - type: "align", - align: ca - }; - } else if (ca === "|") { - return { - type: "separator", - separator: "|" - }; - } - throw new _ParseError2.default("Unknown column alignment: " + node.value, node); - }); - var res = { - type: "array", - cols: cols, - hskipBeforeAndAfter: true }; - res = parseArray(context.parser, res, dCellStyle(context.envName)); - return res; - }); - - // The matrix environments of amsmath builds on the array environment - // of LaTeX, which is discussed above. - defineEnvironment(["matrix", "pmatrix", "bmatrix", "Bmatrix", "vmatrix", "Vmatrix"], {}, function (context) { - var delimiters = { - "matrix": null, - "pmatrix": ["(", ")"], - "bmatrix": ["[", "]"], - "Bmatrix": ["\\{", "\\}"], - "vmatrix": ["|", "|"], - "Vmatrix": ["\\Vert", "\\Vert"] - }[context.envName]; - var res = { - type: "array", - hskipBeforeAndAfter: false }; - res = parseArray(context.parser, res, dCellStyle(context.envName)); - if (delimiters) { - res = new _ParseNode2.default("leftright", { - body: [res], - left: delimiters[0], - right: delimiters[1] - }, context.mode); - } - return res; - }); - - // A cases environment (in amsmath.sty) is almost equivalent to - // \def\arraystretch{1.2}% - // \left\{\begin{array}{@{}l@{\quad}l@{}} … \end{array}\right. - // {dcases} is a {cases} environment where cells are set in \displaystyle, - // as defined in mathtools.sty. - defineEnvironment(["cases", "dcases"], {}, function (context) { - var res = { - type: "array", - arraystretch: 1.2, - cols: [{ - type: "align", - align: "l", - pregap: 0, - // TODO(kevinb) get the current style. - // For now we use the metrics for TEXT style which is what we were - // doing before. Before attempting to get the current style we - // should look at TeX's behavior especially for \over and matrices. - postgap: 1.0 }, { - type: "align", - align: "l", - pregap: 0, - postgap: 0 - }] - }; - res = parseArray(context.parser, res, dCellStyle(context.envName)); - res = new _ParseNode2.default("leftright", { - body: [res], - left: "\\{", - right: "." - }, context.mode); - return res; - }); - - // An aligned environment is like the align* environment - // except it operates within math mode. - // Note that we assume \nomallineskiplimit to be zero, - // so that \strut@ is the same as \strut. - defineEnvironment("aligned", {}, function (context) { - var res = { - type: "array", - cols: [], - addJot: true - }; - res = parseArray(context.parser, res, "display"); - // Count number of columns = maximum number of cells in each row. - // At the same time, prepend empty group {} at beginning of every second - // cell in each row (starting with second cell) so that operators become - // binary. This behavior is implemented in amsmath's \start@aligned. - var emptyGroup = new _ParseNode2.default("ordgroup", [], context.mode); - var numCols = 0; - res.value.body.forEach(function (row) { - for (var i = 1; i < row.length; i += 2) { - // Modify ordgroup node within styling node - var ordgroup = row[i].value.value[0]; - ordgroup.value.unshift(emptyGroup); - } - if (numCols < row.length) { - numCols = row.length; - } - }); - for (var i = 0; i < numCols; ++i) { - var align = "r"; - var pregap = 0; - if (i % 2 === 1) { - align = "l"; - } else if (i > 0) { - pregap = 2; // one \qquad between columns - } - res.value.cols[i] = { - type: "align", - align: align, - pregap: pregap, - postgap: 0 - }; - } - return res; - }); - - // A gathered environment is like an array environment with one centered - // column, but where rows are considered lines so get \jot line spacing - // and contents are set in \displaystyle. - defineEnvironment("gathered", {}, function (context) { - var res = { - type: "array", - cols: [{ - type: "align", - align: "c" - }], - addJot: true - }; - res = parseArray(context.parser, res, "display"); - return res; - }); - - },{"./ParseError":29,"./ParseNode":30}],41:[function(require,module,exports){ - - var _unicodeRegexes = require("./unicodeRegexes"); - - var _fontMetricsData = require("./fontMetricsData"); - - var _fontMetricsData2 = _interopRequireDefault(_fontMetricsData); - - function _interopRequireDefault(obj) { return obj && obj.__esModule ? obj : { default: obj }; } - - /** - * This file contains metrics regarding fonts and individual symbols. The sigma - * and xi variables, as well as the metricMap map contain data extracted from - * TeX, TeX font metrics, and the TTF files. These data are then exposed via the - * `metrics` variable and the getCharacterMetrics function. - */ - - // In TeX, there are actually three sets of dimensions, one for each of - // textstyle (size index 5 and higher: >=9pt), scriptstyle (size index 3 and 4: - // 7-8pt), and scriptscriptstyle (size index 1 and 2: 5-6pt). These are - // provided in the the arrays below, in that order. - // - // The font metrics are stored in fonts cmsy10, cmsy7, and cmsy5 respsectively. - // This was determined by running the following script: - // - // latex -interaction=nonstopmode \ - // '\documentclass{article}\usepackage{amsmath}\begin{document}' \ - // '$a$ \expandafter\show\the\textfont2' \ - // '\expandafter\show\the\scriptfont2' \ - // '\expandafter\show\the\scriptscriptfont2' \ - // '\stop' - // - // The metrics themselves were retreived using the following commands: - // - // tftopl cmsy10 - // tftopl cmsy7 - // tftopl cmsy5 - // - // The output of each of these commands is quite lengthy. The only part we - // care about is the FONTDIMEN section. Each value is measured in EMs. - var sigmasAndXis = { - slant: [0.250, 0.250, 0.250], // sigma1 - space: [0.000, 0.000, 0.000], // sigma2 - stretch: [0.000, 0.000, 0.000], // sigma3 - shrink: [0.000, 0.000, 0.000], // sigma4 - xHeight: [0.431, 0.431, 0.431], // sigma5 - quad: [1.000, 1.171, 1.472], // sigma6 - extraSpace: [0.000, 0.000, 0.000], // sigma7 - num1: [0.677, 0.732, 0.925], // sigma8 - num2: [0.394, 0.384, 0.387], // sigma9 - num3: [0.444, 0.471, 0.504], // sigma10 - denom1: [0.686, 0.752, 1.025], // sigma11 - denom2: [0.345, 0.344, 0.532], // sigma12 - sup1: [0.413, 0.503, 0.504], // sigma13 - sup2: [0.363, 0.431, 0.404], // sigma14 - sup3: [0.289, 0.286, 0.294], // sigma15 - sub1: [0.150, 0.143, 0.200], // sigma16 - sub2: [0.247, 0.286, 0.400], // sigma17 - supDrop: [0.386, 0.353, 0.494], // sigma18 - subDrop: [0.050, 0.071, 0.100], // sigma19 - delim1: [2.390, 1.700, 1.980], // sigma20 - delim2: [1.010, 1.157, 1.420], // sigma21 - axisHeight: [0.250, 0.250, 0.250], // sigma22 - - // These font metrics are extracted from TeX by using tftopl on cmex10.tfm; - // they correspond to the font parameters of the extension fonts (family 3). - // See the TeXbook, page 441. In AMSTeX, the extension fonts scale; to - // match cmex7, we'd use cmex7.tfm values for script and scriptscript - // values. - defaultRuleThickness: [0.04, 0.049, 0.049], // xi8; cmex7: 0.049 - bigOpSpacing1: [0.111, 0.111, 0.111], // xi9 - bigOpSpacing2: [0.166, 0.166, 0.166], // xi10 - bigOpSpacing3: [0.2, 0.2, 0.2], // xi11 - bigOpSpacing4: [0.6, 0.611, 0.611], // xi12; cmex7: 0.611 - bigOpSpacing5: [0.1, 0.143, 0.143], // xi13; cmex7: 0.143 - - // The \sqrt rule width is taken from the height of the surd character. - // Since we use the same font at all sizes, this thickness doesn't scale. - sqrtRuleThickness: [0.04, 0.04, 0.04], - - // This value determines how large a pt is, for metrics which are defined - // in terms of pts. - // This value is also used in katex.less; if you change it make sure the - // values match. - ptPerEm: [10.0, 10.0, 10.0], - - // The space between adjacent `|` columns in an array definition. From - // `\showthe\doublerulesep` in LaTeX. Equals 2.0 / ptPerEm. - doubleRuleSep: [0.2, 0.2, 0.2] - }; - - // This map contains a mapping from font name and character code to character - // metrics, including height, depth, italic correction, and skew (kern from the - // character to the corresponding \skewchar) - // This map is generated via `make metrics`. It should not be changed manually. - - - // These are very rough approximations. We default to Times New Roman which - // should have Latin-1 and Cyrillic characters, but may not depending on the - // operating system. The metrics do not account for extra height from the - // accents. In the case of Cyrillic characters which have both ascenders and - // descenders we prefer approximations with ascenders, primarily to prevent - // the fraction bar or root line from intersecting the glyph. - // TODO(kevinb) allow union of multiple glyph metrics for better accuracy. - var extraCharacterMap = { - // Latin-1 - 'À': 'A', - 'Á': 'A', - 'Â': 'A', - 'Ã': 'A', - 'Ä': 'A', - 'Å': 'A', - 'Æ': 'A', - 'Ç': 'C', - 'È': 'E', - 'É': 'E', - 'Ê': 'E', - 'Ë': 'E', - 'Ì': 'I', - 'Í': 'I', - 'Î': 'I', - 'Ï': 'I', - 'Ð': 'D', - 'Ñ': 'N', - 'Ò': 'O', - 'Ó': 'O', - 'Ô': 'O', - 'Õ': 'O', - 'Ö': 'O', - 'Ø': 'O', - 'Ù': 'U', - 'Ú': 'U', - 'Û': 'U', - 'Ü': 'U', - 'Ý': 'Y', - 'Þ': 'o', - 'ß': 'B', - 'à': 'a', - 'á': 'a', - 'â': 'a', - 'ã': 'a', - 'ä': 'a', - 'å': 'a', - 'æ': 'a', - 'ç': 'c', - 'è': 'e', - 'é': 'e', - 'ê': 'e', - 'ë': 'e', - 'ì': 'i', - 'í': 'i', - 'î': 'i', - 'ï': 'i', - 'ð': 'd', - 'ñ': 'n', - 'ò': 'o', - 'ó': 'o', - 'ô': 'o', - 'õ': 'o', - 'ö': 'o', - 'ø': 'o', - 'ù': 'u', - 'ú': 'u', - 'û': 'u', - 'ü': 'u', - 'ý': 'y', - 'þ': 'o', - 'ÿ': 'y', - - // Cyrillic - 'А': 'A', - 'Б': 'B', - 'В': 'B', - 'Г': 'F', - 'Д': 'A', - 'Е': 'E', - 'Ж': 'K', - 'З': '3', - 'И': 'N', - 'Й': 'N', - 'К': 'K', - 'Л': 'N', - 'М': 'M', - 'Н': 'H', - 'О': 'O', - 'П': 'N', - 'Р': 'P', - 'С': 'C', - 'Т': 'T', - 'У': 'y', - 'Ф': 'O', - 'Х': 'X', - 'Ц': 'U', - 'Ч': 'h', - 'Ш': 'W', - 'Щ': 'W', - 'Ъ': 'B', - 'Ы': 'X', - 'Ь': 'B', - 'Э': '3', - 'Ю': 'X', - 'Я': 'R', - 'а': 'a', - 'б': 'b', - 'в': 'a', - 'г': 'r', - 'д': 'y', - 'е': 'e', - 'ж': 'm', - 'з': 'e', - 'и': 'n', - 'й': 'n', - 'к': 'n', - 'л': 'n', - 'м': 'm', - 'н': 'n', - 'о': 'o', - 'п': 'n', - 'р': 'p', - 'с': 'c', - 'т': 'o', - 'у': 'y', - 'ф': 'b', - 'х': 'x', - 'ц': 'n', - 'ч': 'n', - 'ш': 'w', - 'щ': 'w', - 'ъ': 'a', - 'ы': 'm', - 'ь': 'a', - 'э': 'e', - 'ю': 'm', - 'я': 'r' - }; - - /** - * This function is a convenience function for looking up information in the - * metricMap table. It takes a character as a string, and a style. - * - * Note: the `width` property may be undefined if fontMetricsData.js wasn't - * built using `Make extended_metrics`. - */ - var getCharacterMetrics = function getCharacterMetrics(character, style) { - var ch = character.charCodeAt(0); - if (character[0] in extraCharacterMap) { - ch = extraCharacterMap[character[0]].charCodeAt(0); - } else if (_unicodeRegexes.cjkRegex.test(character[0])) { - ch = 'M'.charCodeAt(0); - } - var metrics = _fontMetricsData2.default[style][ch]; - if (metrics) { - return { - depth: metrics[0], - height: metrics[1], - italic: metrics[2], - skew: metrics[3], - width: metrics[4] - }; - } - }; - - var fontMetricsBySizeIndex = {}; - - /** - * Get the font metrics for a given size. - */ - var getFontMetrics = function getFontMetrics(size) { - var sizeIndex = void 0; - if (size >= 5) { - sizeIndex = 0; - } else if (size >= 3) { - sizeIndex = 1; - } else { - sizeIndex = 2; - } - if (!fontMetricsBySizeIndex[sizeIndex]) { - var metrics = fontMetricsBySizeIndex[sizeIndex] = {}; - for (var key in sigmasAndXis) { - if (sigmasAndXis.hasOwnProperty(key)) { - metrics[key] = sigmasAndXis[key][sizeIndex]; - } - } - metrics.cssEmPerMu = metrics.quad / 18; - } - return fontMetricsBySizeIndex[sizeIndex]; - }; - - module.exports = { - getFontMetrics: getFontMetrics, - getCharacterMetrics: getCharacterMetrics - }; - - },{"./fontMetricsData":42,"./unicodeRegexes":49}],42:[function(require,module,exports){ - - module.exports = { - "AMS-Regular": { - "65": [0, 0.68889, 0, 0], - "66": [0, 0.68889, 0, 0], - "67": [0, 0.68889, 0, 0], - "68": [0, 0.68889, 0, 0], - "69": [0, 0.68889, 0, 0], - "70": [0, 0.68889, 0, 0], - "71": [0, 0.68889, 0, 0], - "72": [0, 0.68889, 0, 0], - "73": [0, 0.68889, 0, 0], - "74": [0.16667, 0.68889, 0, 0], - "75": [0, 0.68889, 0, 0], - "76": [0, 0.68889, 0, 0], - "77": [0, 0.68889, 0, 0], - "78": [0, 0.68889, 0, 0], - "79": [0.16667, 0.68889, 0, 0], - "80": [0, 0.68889, 0, 0], - "81": [0.16667, 0.68889, 0, 0], - "82": [0, 0.68889, 0, 0], - "83": [0, 0.68889, 0, 0], - "84": [0, 0.68889, 0, 0], - "85": [0, 0.68889, 0, 0], - "86": [0, 0.68889, 0, 0], - "87": [0, 0.68889, 0, 0], - "88": [0, 0.68889, 0, 0], - "89": [0, 0.68889, 0, 0], - "90": [0, 0.68889, 0, 0], - "107": [0, 0.68889, 0, 0], - "165": [0, 0.675, 0.025, 0], - "174": [0.15559, 0.69224, 0, 0], - "240": [0, 0.68889, 0, 0], - "295": [0, 0.68889, 0, 0], - "710": [0, 0.825, 0, 0], - "732": [0, 0.9, 0, 0], - "770": [0, 0.825, 0, 0], - "771": [0, 0.9, 0, 0], - "989": [0.08167, 0.58167, 0, 0], - "1008": [0, 0.43056, 0.04028, 0], - "8245": [0, 0.54986, 0, 0], - "8463": [0, 0.68889, 0, 0], - "8487": [0, 0.68889, 0, 0], - "8498": [0, 0.68889, 0, 0], - "8502": [0, 0.68889, 0, 0], - "8503": [0, 0.68889, 0, 0], - "8504": [0, 0.68889, 0, 0], - "8513": [0, 0.68889, 0, 0], - "8592": [-0.03598, 0.46402, 0, 0], - "8594": [-0.03598, 0.46402, 0, 0], - "8602": [-0.13313, 0.36687, 0, 0], - "8603": [-0.13313, 0.36687, 0, 0], - "8606": [0.01354, 0.52239, 0, 0], - "8608": [0.01354, 0.52239, 0, 0], - "8610": [0.01354, 0.52239, 0, 0], - "8611": [0.01354, 0.52239, 0, 0], - "8619": [0, 0.54986, 0, 0], - "8620": [0, 0.54986, 0, 0], - "8621": [-0.13313, 0.37788, 0, 0], - "8622": [-0.13313, 0.36687, 0, 0], - "8624": [0, 0.69224, 0, 0], - "8625": [0, 0.69224, 0, 0], - "8630": [0, 0.43056, 0, 0], - "8631": [0, 0.43056, 0, 0], - "8634": [0.08198, 0.58198, 0, 0], - "8635": [0.08198, 0.58198, 0, 0], - "8638": [0.19444, 0.69224, 0, 0], - "8639": [0.19444, 0.69224, 0, 0], - "8642": [0.19444, 0.69224, 0, 0], - "8643": [0.19444, 0.69224, 0, 0], - "8644": [0.1808, 0.675, 0, 0], - "8646": [0.1808, 0.675, 0, 0], - "8647": [0.1808, 0.675, 0, 0], - "8648": [0.19444, 0.69224, 0, 0], - "8649": [0.1808, 0.675, 0, 0], - "8650": [0.19444, 0.69224, 0, 0], - "8651": [0.01354, 0.52239, 0, 0], - "8652": [0.01354, 0.52239, 0, 0], - "8653": [-0.13313, 0.36687, 0, 0], - "8654": [-0.13313, 0.36687, 0, 0], - "8655": [-0.13313, 0.36687, 0, 0], - "8666": [0.13667, 0.63667, 0, 0], - "8667": [0.13667, 0.63667, 0, 0], - "8669": [-0.13313, 0.37788, 0, 0], - "8672": [-0.064, 0.437, 0, 0], - "8674": [-0.064, 0.437, 0, 0], - "8705": [0, 0.825, 0, 0], - "8708": [0, 0.68889, 0, 0], - "8709": [0.08167, 0.58167, 0, 0], - "8717": [0, 0.43056, 0, 0], - "8722": [-0.03598, 0.46402, 0, 0], - "8724": [0.08198, 0.69224, 0, 0], - "8726": [0.08167, 0.58167, 0, 0], - "8733": [0, 0.69224, 0, 0], - "8736": [0, 0.69224, 0, 0], - "8737": [0, 0.69224, 0, 0], - "8738": [0.03517, 0.52239, 0, 0], - "8739": [0.08167, 0.58167, 0, 0], - "8740": [0.25142, 0.74111, 0, 0], - "8741": [0.08167, 0.58167, 0, 0], - "8742": [0.25142, 0.74111, 0, 0], - "8756": [0, 0.69224, 0, 0], - "8757": [0, 0.69224, 0, 0], - "8764": [-0.13313, 0.36687, 0, 0], - "8765": [-0.13313, 0.37788, 0, 0], - "8769": [-0.13313, 0.36687, 0, 0], - "8770": [-0.03625, 0.46375, 0, 0], - "8774": [0.30274, 0.79383, 0, 0], - "8776": [-0.01688, 0.48312, 0, 0], - "8778": [0.08167, 0.58167, 0, 0], - "8782": [0.06062, 0.54986, 0, 0], - "8783": [0.06062, 0.54986, 0, 0], - "8785": [0.08198, 0.58198, 0, 0], - "8786": [0.08198, 0.58198, 0, 0], - "8787": [0.08198, 0.58198, 0, 0], - "8790": [0, 0.69224, 0, 0], - "8791": [0.22958, 0.72958, 0, 0], - "8796": [0.08198, 0.91667, 0, 0], - "8806": [0.25583, 0.75583, 0, 0], - "8807": [0.25583, 0.75583, 0, 0], - "8808": [0.25142, 0.75726, 0, 0], - "8809": [0.25142, 0.75726, 0, 0], - "8812": [0.25583, 0.75583, 0, 0], - "8814": [0.20576, 0.70576, 0, 0], - "8815": [0.20576, 0.70576, 0, 0], - "8816": [0.30274, 0.79383, 0, 0], - "8817": [0.30274, 0.79383, 0, 0], - "8818": [0.22958, 0.72958, 0, 0], - "8819": [0.22958, 0.72958, 0, 0], - "8822": [0.1808, 0.675, 0, 0], - "8823": [0.1808, 0.675, 0, 0], - "8828": [0.13667, 0.63667, 0, 0], - "8829": [0.13667, 0.63667, 0, 0], - "8830": [0.22958, 0.72958, 0, 0], - "8831": [0.22958, 0.72958, 0, 0], - "8832": [0.20576, 0.70576, 0, 0], - "8833": [0.20576, 0.70576, 0, 0], - "8840": [0.30274, 0.79383, 0, 0], - "8841": [0.30274, 0.79383, 0, 0], - "8842": [0.13597, 0.63597, 0, 0], - "8843": [0.13597, 0.63597, 0, 0], - "8847": [0.03517, 0.54986, 0, 0], - "8848": [0.03517, 0.54986, 0, 0], - "8858": [0.08198, 0.58198, 0, 0], - "8859": [0.08198, 0.58198, 0, 0], - "8861": [0.08198, 0.58198, 0, 0], - "8862": [0, 0.675, 0, 0], - "8863": [0, 0.675, 0, 0], - "8864": [0, 0.675, 0, 0], - "8865": [0, 0.675, 0, 0], - "8872": [0, 0.69224, 0, 0], - "8873": [0, 0.69224, 0, 0], - "8874": [0, 0.69224, 0, 0], - "8876": [0, 0.68889, 0, 0], - "8877": [0, 0.68889, 0, 0], - "8878": [0, 0.68889, 0, 0], - "8879": [0, 0.68889, 0, 0], - "8882": [0.03517, 0.54986, 0, 0], - "8883": [0.03517, 0.54986, 0, 0], - "8884": [0.13667, 0.63667, 0, 0], - "8885": [0.13667, 0.63667, 0, 0], - "8888": [0, 0.54986, 0, 0], - "8890": [0.19444, 0.43056, 0, 0], - "8891": [0.19444, 0.69224, 0, 0], - "8892": [0.19444, 0.69224, 0, 0], - "8901": [0, 0.54986, 0, 0], - "8903": [0.08167, 0.58167, 0, 0], - "8905": [0.08167, 0.58167, 0, 0], - "8906": [0.08167, 0.58167, 0, 0], - "8907": [0, 0.69224, 0, 0], - "8908": [0, 0.69224, 0, 0], - "8909": [-0.03598, 0.46402, 0, 0], - "8910": [0, 0.54986, 0, 0], - "8911": [0, 0.54986, 0, 0], - "8912": [0.03517, 0.54986, 0, 0], - "8913": [0.03517, 0.54986, 0, 0], - "8914": [0, 0.54986, 0, 0], - "8915": [0, 0.54986, 0, 0], - "8916": [0, 0.69224, 0, 0], - "8918": [0.0391, 0.5391, 0, 0], - "8919": [0.0391, 0.5391, 0, 0], - "8920": [0.03517, 0.54986, 0, 0], - "8921": [0.03517, 0.54986, 0, 0], - "8922": [0.38569, 0.88569, 0, 0], - "8923": [0.38569, 0.88569, 0, 0], - "8926": [0.13667, 0.63667, 0, 0], - "8927": [0.13667, 0.63667, 0, 0], - "8928": [0.30274, 0.79383, 0, 0], - "8929": [0.30274, 0.79383, 0, 0], - "8934": [0.23222, 0.74111, 0, 0], - "8935": [0.23222, 0.74111, 0, 0], - "8936": [0.23222, 0.74111, 0, 0], - "8937": [0.23222, 0.74111, 0, 0], - "8938": [0.20576, 0.70576, 0, 0], - "8939": [0.20576, 0.70576, 0, 0], - "8940": [0.30274, 0.79383, 0, 0], - "8941": [0.30274, 0.79383, 0, 0], - "8994": [0.19444, 0.69224, 0, 0], - "8995": [0.19444, 0.69224, 0, 0], - "9416": [0.15559, 0.69224, 0, 0], - "9484": [0, 0.69224, 0, 0], - "9488": [0, 0.69224, 0, 0], - "9492": [0, 0.37788, 0, 0], - "9496": [0, 0.37788, 0, 0], - "9585": [0.19444, 0.68889, 0, 0], - "9586": [0.19444, 0.74111, 0, 0], - "9632": [0, 0.675, 0, 0], - "9633": [0, 0.675, 0, 0], - "9650": [0, 0.54986, 0, 0], - "9651": [0, 0.54986, 0, 0], - "9654": [0.03517, 0.54986, 0, 0], - "9660": [0, 0.54986, 0, 0], - "9661": [0, 0.54986, 0, 0], - "9664": [0.03517, 0.54986, 0, 0], - "9674": [0.11111, 0.69224, 0, 0], - "9733": [0.19444, 0.69224, 0, 0], - "10003": [0, 0.69224, 0, 0], - "10016": [0, 0.69224, 0, 0], - "10731": [0.11111, 0.69224, 0, 0], - "10846": [0.19444, 0.75583, 0, 0], - "10877": [0.13667, 0.63667, 0, 0], - "10878": [0.13667, 0.63667, 0, 0], - "10885": [0.25583, 0.75583, 0, 0], - "10886": [0.25583, 0.75583, 0, 0], - "10887": [0.13597, 0.63597, 0, 0], - "10888": [0.13597, 0.63597, 0, 0], - "10889": [0.26167, 0.75726, 0, 0], - "10890": [0.26167, 0.75726, 0, 0], - "10891": [0.48256, 0.98256, 0, 0], - "10892": [0.48256, 0.98256, 0, 0], - "10901": [0.13667, 0.63667, 0, 0], - "10902": [0.13667, 0.63667, 0, 0], - "10933": [0.25142, 0.75726, 0, 0], - "10934": [0.25142, 0.75726, 0, 0], - "10935": [0.26167, 0.75726, 0, 0], - "10936": [0.26167, 0.75726, 0, 0], - "10937": [0.26167, 0.75726, 0, 0], - "10938": [0.26167, 0.75726, 0, 0], - "10949": [0.25583, 0.75583, 0, 0], - "10950": [0.25583, 0.75583, 0, 0], - "10955": [0.28481, 0.79383, 0, 0], - "10956": [0.28481, 0.79383, 0, 0], - "57350": [0.08167, 0.58167, 0, 0], - "57351": [0.08167, 0.58167, 0, 0], - "57352": [0.08167, 0.58167, 0, 0], - "57353": [0, 0.43056, 0.04028, 0], - "57356": [0.25142, 0.75726, 0, 0], - "57357": [0.25142, 0.75726, 0, 0], - "57358": [0.41951, 0.91951, 0, 0], - "57359": [0.30274, 0.79383, 0, 0], - "57360": [0.30274, 0.79383, 0, 0], - "57361": [0.41951, 0.91951, 0, 0], - "57366": [0.25142, 0.75726, 0, 0], - "57367": [0.25142, 0.75726, 0, 0], - "57368": [0.25142, 0.75726, 0, 0], - "57369": [0.25142, 0.75726, 0, 0], - "57370": [0.13597, 0.63597, 0, 0], - "57371": [0.13597, 0.63597, 0, 0] - }, - "Caligraphic-Regular": { - "48": [0, 0.43056, 0, 0], - "49": [0, 0.43056, 0, 0], - "50": [0, 0.43056, 0, 0], - "51": [0.19444, 0.43056, 0, 0], - "52": [0.19444, 0.43056, 0, 0], - "53": [0.19444, 0.43056, 0, 0], - "54": [0, 0.64444, 0, 0], - "55": [0.19444, 0.43056, 0, 0], - "56": [0, 0.64444, 0, 0], - "57": [0.19444, 0.43056, 0, 0], - "65": [0, 0.68333, 0, 0.19445], - "66": [0, 0.68333, 0.03041, 0.13889], - "67": [0, 0.68333, 0.05834, 0.13889], - "68": [0, 0.68333, 0.02778, 0.08334], - "69": [0, 0.68333, 0.08944, 0.11111], - "70": [0, 0.68333, 0.09931, 0.11111], - "71": [0.09722, 0.68333, 0.0593, 0.11111], - "72": [0, 0.68333, 0.00965, 0.11111], - "73": [0, 0.68333, 0.07382, 0], - "74": [0.09722, 0.68333, 0.18472, 0.16667], - "75": [0, 0.68333, 0.01445, 0.05556], - "76": [0, 0.68333, 0, 0.13889], - "77": [0, 0.68333, 0, 0.13889], - "78": [0, 0.68333, 0.14736, 0.08334], - "79": [0, 0.68333, 0.02778, 0.11111], - "80": [0, 0.68333, 0.08222, 0.08334], - "81": [0.09722, 0.68333, 0, 0.11111], - "82": [0, 0.68333, 0, 0.08334], - "83": [0, 0.68333, 0.075, 0.13889], - "84": [0, 0.68333, 0.25417, 0], - "85": [0, 0.68333, 0.09931, 0.08334], - "86": [0, 0.68333, 0.08222, 0], - "87": [0, 0.68333, 0.08222, 0.08334], - "88": [0, 0.68333, 0.14643, 0.13889], - "89": [0.09722, 0.68333, 0.08222, 0.08334], - "90": [0, 0.68333, 0.07944, 0.13889] - }, - "Fraktur-Regular": { - "33": [0, 0.69141, 0, 0], - "34": [0, 0.69141, 0, 0], - "38": [0, 0.69141, 0, 0], - "39": [0, 0.69141, 0, 0], - "40": [0.24982, 0.74947, 0, 0], - "41": [0.24982, 0.74947, 0, 0], - "42": [0, 0.62119, 0, 0], - "43": [0.08319, 0.58283, 0, 0], - "44": [0, 0.10803, 0, 0], - "45": [0.08319, 0.58283, 0, 0], - "46": [0, 0.10803, 0, 0], - "47": [0.24982, 0.74947, 0, 0], - "48": [0, 0.47534, 0, 0], - "49": [0, 0.47534, 0, 0], - "50": [0, 0.47534, 0, 0], - "51": [0.18906, 0.47534, 0, 0], - "52": [0.18906, 0.47534, 0, 0], - "53": [0.18906, 0.47534, 0, 0], - "54": [0, 0.69141, 0, 0], - "55": [0.18906, 0.47534, 0, 0], - "56": [0, 0.69141, 0, 0], - "57": [0.18906, 0.47534, 0, 0], - "58": [0, 0.47534, 0, 0], - "59": [0.12604, 0.47534, 0, 0], - "61": [-0.13099, 0.36866, 0, 0], - "63": [0, 0.69141, 0, 0], - "65": [0, 0.69141, 0, 0], - "66": [0, 0.69141, 0, 0], - "67": [0, 0.69141, 0, 0], - "68": [0, 0.69141, 0, 0], - "69": [0, 0.69141, 0, 0], - "70": [0.12604, 0.69141, 0, 0], - "71": [0, 0.69141, 0, 0], - "72": [0.06302, 0.69141, 0, 0], - "73": [0, 0.69141, 0, 0], - "74": [0.12604, 0.69141, 0, 0], - "75": [0, 0.69141, 0, 0], - "76": [0, 0.69141, 0, 0], - "77": [0, 0.69141, 0, 0], - "78": [0, 0.69141, 0, 0], - "79": [0, 0.69141, 0, 0], - "80": [0.18906, 0.69141, 0, 0], - "81": [0.03781, 0.69141, 0, 0], - "82": [0, 0.69141, 0, 0], - "83": [0, 0.69141, 0, 0], - "84": [0, 0.69141, 0, 0], - "85": [0, 0.69141, 0, 0], - "86": [0, 0.69141, 0, 0], - "87": [0, 0.69141, 0, 0], - "88": [0, 0.69141, 0, 0], - "89": [0.18906, 0.69141, 0, 0], - "90": [0.12604, 0.69141, 0, 0], - "91": [0.24982, 0.74947, 0, 0], - "93": [0.24982, 0.74947, 0, 0], - "94": [0, 0.69141, 0, 0], - "97": [0, 0.47534, 0, 0], - "98": [0, 0.69141, 0, 0], - "99": [0, 0.47534, 0, 0], - "100": [0, 0.62119, 0, 0], - "101": [0, 0.47534, 0, 0], - "102": [0.18906, 0.69141, 0, 0], - "103": [0.18906, 0.47534, 0, 0], - "104": [0.18906, 0.69141, 0, 0], - "105": [0, 0.69141, 0, 0], - "106": [0, 0.69141, 0, 0], - "107": [0, 0.69141, 0, 0], - "108": [0, 0.69141, 0, 0], - "109": [0, 0.47534, 0, 0], - "110": [0, 0.47534, 0, 0], - "111": [0, 0.47534, 0, 0], - "112": [0.18906, 0.52396, 0, 0], - "113": [0.18906, 0.47534, 0, 0], - "114": [0, 0.47534, 0, 0], - "115": [0, 0.47534, 0, 0], - "116": [0, 0.62119, 0, 0], - "117": [0, 0.47534, 0, 0], - "118": [0, 0.52396, 0, 0], - "119": [0, 0.52396, 0, 0], - "120": [0.18906, 0.47534, 0, 0], - "121": [0.18906, 0.47534, 0, 0], - "122": [0.18906, 0.47534, 0, 0], - "8216": [0, 0.69141, 0, 0], - "8217": [0, 0.69141, 0, 0], - "58112": [0, 0.62119, 0, 0], - "58113": [0, 0.62119, 0, 0], - "58114": [0.18906, 0.69141, 0, 0], - "58115": [0.18906, 0.69141, 0, 0], - "58116": [0.18906, 0.47534, 0, 0], - "58117": [0, 0.69141, 0, 0], - "58118": [0, 0.62119, 0, 0], - "58119": [0, 0.47534, 0, 0] - }, - "Main-Bold": { - "33": [0, 0.69444, 0, 0], - "34": [0, 0.69444, 0, 0], - "35": [0.19444, 0.69444, 0, 0], - "36": [0.05556, 0.75, 0, 0], - "37": [0.05556, 0.75, 0, 0], - "38": [0, 0.69444, 0, 0], - "39": [0, 0.69444, 0, 0], - "40": [0.25, 0.75, 0, 0], - "41": [0.25, 0.75, 0, 0], - "42": [0, 0.75, 0, 0], - "43": [0.13333, 0.63333, 0, 0], - "44": [0.19444, 0.15556, 0, 0], - "45": [0, 0.44444, 0, 0], - "46": [0, 0.15556, 0, 0], - "47": [0.25, 0.75, 0, 0], - "48": [0, 0.64444, 0, 0], - "49": [0, 0.64444, 0, 0], - "50": [0, 0.64444, 0, 0], - "51": [0, 0.64444, 0, 0], - "52": [0, 0.64444, 0, 0], - "53": [0, 0.64444, 0, 0], - "54": [0, 0.64444, 0, 0], - "55": [0, 0.64444, 0, 0], - "56": [0, 0.64444, 0, 0], - "57": [0, 0.64444, 0, 0], - "58": [0, 0.44444, 0, 0], - "59": [0.19444, 0.44444, 0, 0], - "60": [0.08556, 0.58556, 0, 0], - "61": [-0.10889, 0.39111, 0, 0], - "62": [0.08556, 0.58556, 0, 0], - "63": [0, 0.69444, 0, 0], - "64": [0, 0.69444, 0, 0], - "65": [0, 0.68611, 0, 0], - "66": [0, 0.68611, 0, 0], - "67": [0, 0.68611, 0, 0], - "68": [0, 0.68611, 0, 0], - "69": [0, 0.68611, 0, 0], - "70": [0, 0.68611, 0, 0], - "71": [0, 0.68611, 0, 0], - "72": [0, 0.68611, 0, 0], - "73": [0, 0.68611, 0, 0], - "74": [0, 0.68611, 0, 0], - "75": [0, 0.68611, 0, 0], - "76": [0, 0.68611, 0, 0], - "77": [0, 0.68611, 0, 0], - "78": [0, 0.68611, 0, 0], - "79": [0, 0.68611, 0, 0], - "80": [0, 0.68611, 0, 0], - "81": [0.19444, 0.68611, 0, 0], - "82": [0, 0.68611, 0, 0], - "83": [0, 0.68611, 0, 0], - "84": [0, 0.68611, 0, 0], - "85": [0, 0.68611, 0, 0], - "86": [0, 0.68611, 0.01597, 0], - "87": [0, 0.68611, 0.01597, 0], - "88": [0, 0.68611, 0, 0], - "89": [0, 0.68611, 0.02875, 0], - "90": [0, 0.68611, 0, 0], - "91": [0.25, 0.75, 0, 0], - "92": [0.25, 0.75, 0, 0], - "93": [0.25, 0.75, 0, 0], - "94": [0, 0.69444, 0, 0], - "95": [0.31, 0.13444, 0.03194, 0], - "96": [0, 0.69444, 0, 0], - "97": [0, 0.44444, 0, 0], - "98": [0, 0.69444, 0, 0], - "99": [0, 0.44444, 0, 0], - "100": [0, 0.69444, 0, 0], - "101": [0, 0.44444, 0, 0], - "102": [0, 0.69444, 0.10903, 0], - "103": [0.19444, 0.44444, 0.01597, 0], - "104": [0, 0.69444, 0, 0], - "105": [0, 0.69444, 0, 0], - "106": [0.19444, 0.69444, 0, 0], - "107": [0, 0.69444, 0, 0], - "108": [0, 0.69444, 0, 0], - "109": [0, 0.44444, 0, 0], - "110": [0, 0.44444, 0, 0], - "111": [0, 0.44444, 0, 0], - "112": [0.19444, 0.44444, 0, 0], - "113": [0.19444, 0.44444, 0, 0], - "114": [0, 0.44444, 0, 0], - "115": [0, 0.44444, 0, 0], - "116": [0, 0.63492, 0, 0], - "117": [0, 0.44444, 0, 0], - "118": [0, 0.44444, 0.01597, 0], - "119": [0, 0.44444, 0.01597, 0], - "120": [0, 0.44444, 0, 0], - "121": [0.19444, 0.44444, 0.01597, 0], - "122": [0, 0.44444, 0, 0], - "123": [0.25, 0.75, 0, 0], - "124": [0.25, 0.75, 0, 0], - "125": [0.25, 0.75, 0, 0], - "126": [0.35, 0.34444, 0, 0], - "168": [0, 0.69444, 0, 0], - "172": [0, 0.44444, 0, 0], - "175": [0, 0.59611, 0, 0], - "176": [0, 0.69444, 0, 0], - "177": [0.13333, 0.63333, 0, 0], - "180": [0, 0.69444, 0, 0], - "215": [0.13333, 0.63333, 0, 0], - "247": [0.13333, 0.63333, 0, 0], - "305": [0, 0.44444, 0, 0], - "567": [0.19444, 0.44444, 0, 0], - "710": [0, 0.69444, 0, 0], - "711": [0, 0.63194, 0, 0], - "713": [0, 0.59611, 0, 0], - "714": [0, 0.69444, 0, 0], - "715": [0, 0.69444, 0, 0], - "728": [0, 0.69444, 0, 0], - "729": [0, 0.69444, 0, 0], - "730": [0, 0.69444, 0, 0], - "732": [0, 0.69444, 0, 0], - "768": [0, 0.69444, 0, 0], - "769": [0, 0.69444, 0, 0], - "770": [0, 0.69444, 0, 0], - "771": [0, 0.69444, 0, 0], - "772": [0, 0.59611, 0, 0], - "774": [0, 0.69444, 0, 0], - "775": [0, 0.69444, 0, 0], - "776": [0, 0.69444, 0, 0], - "778": [0, 0.69444, 0, 0], - "779": [0, 0.69444, 0, 0], - "780": [0, 0.63194, 0, 0], - "824": [0.19444, 0.69444, 0, 0], - "915": [0, 0.68611, 0, 0], - "916": [0, 0.68611, 0, 0], - "920": [0, 0.68611, 0, 0], - "923": [0, 0.68611, 0, 0], - "926": [0, 0.68611, 0, 0], - "928": [0, 0.68611, 0, 0], - "931": [0, 0.68611, 0, 0], - "933": [0, 0.68611, 0, 0], - "934": [0, 0.68611, 0, 0], - "936": [0, 0.68611, 0, 0], - "937": [0, 0.68611, 0, 0], - "8211": [0, 0.44444, 0.03194, 0], - "8212": [0, 0.44444, 0.03194, 0], - "8216": [0, 0.69444, 0, 0], - "8217": [0, 0.69444, 0, 0], - "8220": [0, 0.69444, 0, 0], - "8221": [0, 0.69444, 0, 0], - "8224": [0.19444, 0.69444, 0, 0], - "8225": [0.19444, 0.69444, 0, 0], - "8242": [0, 0.55556, 0, 0], - "8407": [0, 0.72444, 0.15486, 0], - "8463": [0, 0.69444, 0, 0], - "8465": [0, 0.69444, 0, 0], - "8467": [0, 0.69444, 0, 0], - "8472": [0.19444, 0.44444, 0, 0], - "8476": [0, 0.69444, 0, 0], - "8501": [0, 0.69444, 0, 0], - "8592": [-0.10889, 0.39111, 0, 0], - "8593": [0.19444, 0.69444, 0, 0], - "8594": [-0.10889, 0.39111, 0, 0], - "8595": [0.19444, 0.69444, 0, 0], - "8596": [-0.10889, 0.39111, 0, 0], - "8597": [0.25, 0.75, 0, 0], - "8598": [0.19444, 0.69444, 0, 0], - "8599": [0.19444, 0.69444, 0, 0], - "8600": [0.19444, 0.69444, 0, 0], - "8601": [0.19444, 0.69444, 0, 0], - "8636": [-0.10889, 0.39111, 0, 0], - "8637": [-0.10889, 0.39111, 0, 0], - "8640": [-0.10889, 0.39111, 0, 0], - "8641": [-0.10889, 0.39111, 0, 0], - "8656": [-0.10889, 0.39111, 0, 0], - "8657": [0.19444, 0.69444, 0, 0], - "8658": [-0.10889, 0.39111, 0, 0], - "8659": [0.19444, 0.69444, 0, 0], - "8660": [-0.10889, 0.39111, 0, 0], - "8661": [0.25, 0.75, 0, 0], - "8704": [0, 0.69444, 0, 0], - "8706": [0, 0.69444, 0.06389, 0], - "8707": [0, 0.69444, 0, 0], - "8709": [0.05556, 0.75, 0, 0], - "8711": [0, 0.68611, 0, 0], - "8712": [0.08556, 0.58556, 0, 0], - "8715": [0.08556, 0.58556, 0, 0], - "8722": [0.13333, 0.63333, 0, 0], - "8723": [0.13333, 0.63333, 0, 0], - "8725": [0.25, 0.75, 0, 0], - "8726": [0.25, 0.75, 0, 0], - "8727": [-0.02778, 0.47222, 0, 0], - "8728": [-0.02639, 0.47361, 0, 0], - "8729": [-0.02639, 0.47361, 0, 0], - "8730": [0.18, 0.82, 0, 0], - "8733": [0, 0.44444, 0, 0], - "8734": [0, 0.44444, 0, 0], - "8736": [0, 0.69224, 0, 0], - "8739": [0.25, 0.75, 0, 0], - "8741": [0.25, 0.75, 0, 0], - "8743": [0, 0.55556, 0, 0], - "8744": [0, 0.55556, 0, 0], - "8745": [0, 0.55556, 0, 0], - "8746": [0, 0.55556, 0, 0], - "8747": [0.19444, 0.69444, 0.12778, 0], - "8764": [-0.10889, 0.39111, 0, 0], - "8768": [0.19444, 0.69444, 0, 0], - "8771": [0.00222, 0.50222, 0, 0], - "8776": [0.02444, 0.52444, 0, 0], - "8781": [0.00222, 0.50222, 0, 0], - "8801": [0.00222, 0.50222, 0, 0], - "8804": [0.19667, 0.69667, 0, 0], - "8805": [0.19667, 0.69667, 0, 0], - "8810": [0.08556, 0.58556, 0, 0], - "8811": [0.08556, 0.58556, 0, 0], - "8826": [0.08556, 0.58556, 0, 0], - "8827": [0.08556, 0.58556, 0, 0], - "8834": [0.08556, 0.58556, 0, 0], - "8835": [0.08556, 0.58556, 0, 0], - "8838": [0.19667, 0.69667, 0, 0], - "8839": [0.19667, 0.69667, 0, 0], - "8846": [0, 0.55556, 0, 0], - "8849": [0.19667, 0.69667, 0, 0], - "8850": [0.19667, 0.69667, 0, 0], - "8851": [0, 0.55556, 0, 0], - "8852": [0, 0.55556, 0, 0], - "8853": [0.13333, 0.63333, 0, 0], - "8854": [0.13333, 0.63333, 0, 0], - "8855": [0.13333, 0.63333, 0, 0], - "8856": [0.13333, 0.63333, 0, 0], - "8857": [0.13333, 0.63333, 0, 0], - "8866": [0, 0.69444, 0, 0], - "8867": [0, 0.69444, 0, 0], - "8868": [0, 0.69444, 0, 0], - "8869": [0, 0.69444, 0, 0], - "8900": [-0.02639, 0.47361, 0, 0], - "8901": [-0.02639, 0.47361, 0, 0], - "8902": [-0.02778, 0.47222, 0, 0], - "8968": [0.25, 0.75, 0, 0], - "8969": [0.25, 0.75, 0, 0], - "8970": [0.25, 0.75, 0, 0], - "8971": [0.25, 0.75, 0, 0], - "8994": [-0.13889, 0.36111, 0, 0], - "8995": [-0.13889, 0.36111, 0, 0], - "9651": [0.19444, 0.69444, 0, 0], - "9657": [-0.02778, 0.47222, 0, 0], - "9661": [0.19444, 0.69444, 0, 0], - "9667": [-0.02778, 0.47222, 0, 0], - "9711": [0.19444, 0.69444, 0, 0], - "9824": [0.12963, 0.69444, 0, 0], - "9825": [0.12963, 0.69444, 0, 0], - "9826": [0.12963, 0.69444, 0, 0], - "9827": [0.12963, 0.69444, 0, 0], - "9837": [0, 0.75, 0, 0], - "9838": [0.19444, 0.69444, 0, 0], - "9839": [0.19444, 0.69444, 0, 0], - "10216": [0.25, 0.75, 0, 0], - "10217": [0.25, 0.75, 0, 0], - "10815": [0, 0.68611, 0, 0], - "10927": [0.19667, 0.69667, 0, 0], - "10928": [0.19667, 0.69667, 0, 0] - }, - "Main-Italic": { - "33": [0, 0.69444, 0.12417, 0], - "34": [0, 0.69444, 0.06961, 0], - "35": [0.19444, 0.69444, 0.06616, 0], - "37": [0.05556, 0.75, 0.13639, 0], - "38": [0, 0.69444, 0.09694, 0], - "39": [0, 0.69444, 0.12417, 0], - "40": [0.25, 0.75, 0.16194, 0], - "41": [0.25, 0.75, 0.03694, 0], - "42": [0, 0.75, 0.14917, 0], - "43": [0.05667, 0.56167, 0.03694, 0], - "44": [0.19444, 0.10556, 0, 0], - "45": [0, 0.43056, 0.02826, 0], - "46": [0, 0.10556, 0, 0], - "47": [0.25, 0.75, 0.16194, 0], - "48": [0, 0.64444, 0.13556, 0], - "49": [0, 0.64444, 0.13556, 0], - "50": [0, 0.64444, 0.13556, 0], - "51": [0, 0.64444, 0.13556, 0], - "52": [0.19444, 0.64444, 0.13556, 0], - "53": [0, 0.64444, 0.13556, 0], - "54": [0, 0.64444, 0.13556, 0], - "55": [0.19444, 0.64444, 0.13556, 0], - "56": [0, 0.64444, 0.13556, 0], - "57": [0, 0.64444, 0.13556, 0], - "58": [0, 0.43056, 0.0582, 0], - "59": [0.19444, 0.43056, 0.0582, 0], - "61": [-0.13313, 0.36687, 0.06616, 0], - "63": [0, 0.69444, 0.1225, 0], - "64": [0, 0.69444, 0.09597, 0], - "65": [0, 0.68333, 0, 0], - "66": [0, 0.68333, 0.10257, 0], - "67": [0, 0.68333, 0.14528, 0], - "68": [0, 0.68333, 0.09403, 0], - "69": [0, 0.68333, 0.12028, 0], - "70": [0, 0.68333, 0.13305, 0], - "71": [0, 0.68333, 0.08722, 0], - "72": [0, 0.68333, 0.16389, 0], - "73": [0, 0.68333, 0.15806, 0], - "74": [0, 0.68333, 0.14028, 0], - "75": [0, 0.68333, 0.14528, 0], - "76": [0, 0.68333, 0, 0], - "77": [0, 0.68333, 0.16389, 0], - "78": [0, 0.68333, 0.16389, 0], - "79": [0, 0.68333, 0.09403, 0], - "80": [0, 0.68333, 0.10257, 0], - "81": [0.19444, 0.68333, 0.09403, 0], - "82": [0, 0.68333, 0.03868, 0], - "83": [0, 0.68333, 0.11972, 0], - "84": [0, 0.68333, 0.13305, 0], - "85": [0, 0.68333, 0.16389, 0], - "86": [0, 0.68333, 0.18361, 0], - "87": [0, 0.68333, 0.18361, 0], - "88": [0, 0.68333, 0.15806, 0], - "89": [0, 0.68333, 0.19383, 0], - "90": [0, 0.68333, 0.14528, 0], - "91": [0.25, 0.75, 0.1875, 0], - "93": [0.25, 0.75, 0.10528, 0], - "94": [0, 0.69444, 0.06646, 0], - "95": [0.31, 0.12056, 0.09208, 0], - "97": [0, 0.43056, 0.07671, 0], - "98": [0, 0.69444, 0.06312, 0], - "99": [0, 0.43056, 0.05653, 0], - "100": [0, 0.69444, 0.10333, 0], - "101": [0, 0.43056, 0.07514, 0], - "102": [0.19444, 0.69444, 0.21194, 0], - "103": [0.19444, 0.43056, 0.08847, 0], - "104": [0, 0.69444, 0.07671, 0], - "105": [0, 0.65536, 0.1019, 0], - "106": [0.19444, 0.65536, 0.14467, 0], - "107": [0, 0.69444, 0.10764, 0], - "108": [0, 0.69444, 0.10333, 0], - "109": [0, 0.43056, 0.07671, 0], - "110": [0, 0.43056, 0.07671, 0], - "111": [0, 0.43056, 0.06312, 0], - "112": [0.19444, 0.43056, 0.06312, 0], - "113": [0.19444, 0.43056, 0.08847, 0], - "114": [0, 0.43056, 0.10764, 0], - "115": [0, 0.43056, 0.08208, 0], - "116": [0, 0.61508, 0.09486, 0], - "117": [0, 0.43056, 0.07671, 0], - "118": [0, 0.43056, 0.10764, 0], - "119": [0, 0.43056, 0.10764, 0], - "120": [0, 0.43056, 0.12042, 0], - "121": [0.19444, 0.43056, 0.08847, 0], - "122": [0, 0.43056, 0.12292, 0], - "126": [0.35, 0.31786, 0.11585, 0], - "163": [0, 0.69444, 0, 0], - "305": [0, 0.43056, 0, 0.02778], - "567": [0.19444, 0.43056, 0, 0.08334], - "768": [0, 0.69444, 0, 0], - "769": [0, 0.69444, 0.09694, 0], - "770": [0, 0.69444, 0.06646, 0], - "771": [0, 0.66786, 0.11585, 0], - "772": [0, 0.56167, 0.10333, 0], - "774": [0, 0.69444, 0.10806, 0], - "775": [0, 0.66786, 0.11752, 0], - "776": [0, 0.66786, 0.10474, 0], - "778": [0, 0.69444, 0, 0], - "779": [0, 0.69444, 0.1225, 0], - "780": [0, 0.62847, 0.08295, 0], - "915": [0, 0.68333, 0.13305, 0], - "916": [0, 0.68333, 0, 0], - "920": [0, 0.68333, 0.09403, 0], - "923": [0, 0.68333, 0, 0], - "926": [0, 0.68333, 0.15294, 0], - "928": [0, 0.68333, 0.16389, 0], - "931": [0, 0.68333, 0.12028, 0], - "933": [0, 0.68333, 0.11111, 0], - "934": [0, 0.68333, 0.05986, 0], - "936": [0, 0.68333, 0.11111, 0], - "937": [0, 0.68333, 0.10257, 0], - "8211": [0, 0.43056, 0.09208, 0], - "8212": [0, 0.43056, 0.09208, 0], - "8216": [0, 0.69444, 0.12417, 0], - "8217": [0, 0.69444, 0.12417, 0], - "8220": [0, 0.69444, 0.1685, 0], - "8221": [0, 0.69444, 0.06961, 0], - "8463": [0, 0.68889, 0, 0] - }, - "Main-Regular": { - "32": [0, 0, 0, 0], - "33": [0, 0.69444, 0, 0], - "34": [0, 0.69444, 0, 0], - "35": [0.19444, 0.69444, 0, 0], - "36": [0.05556, 0.75, 0, 0], - "37": [0.05556, 0.75, 0, 0], - "38": [0, 0.69444, 0, 0], - "39": [0, 0.69444, 0, 0], - "40": [0.25, 0.75, 0, 0], - "41": [0.25, 0.75, 0, 0], - "42": [0, 0.75, 0, 0], - "43": [0.08333, 0.58333, 0, 0], - "44": [0.19444, 0.10556, 0, 0], - "45": [0, 0.43056, 0, 0], - "46": [0, 0.10556, 0, 0], - "47": [0.25, 0.75, 0, 0], - "48": [0, 0.64444, 0, 0], - "49": [0, 0.64444, 0, 0], - "50": [0, 0.64444, 0, 0], - "51": [0, 0.64444, 0, 0], - "52": [0, 0.64444, 0, 0], - "53": [0, 0.64444, 0, 0], - "54": [0, 0.64444, 0, 0], - "55": [0, 0.64444, 0, 0], - "56": [0, 0.64444, 0, 0], - "57": [0, 0.64444, 0, 0], - "58": [0, 0.43056, 0, 0], - "59": [0.19444, 0.43056, 0, 0], - "60": [0.0391, 0.5391, 0, 0], - "61": [-0.13313, 0.36687, 0, 0], - "62": [0.0391, 0.5391, 0, 0], - "63": [0, 0.69444, 0, 0], - "64": [0, 0.69444, 0, 0], - "65": [0, 0.68333, 0, 0], - "66": [0, 0.68333, 0, 0], - "67": [0, 0.68333, 0, 0], - "68": [0, 0.68333, 0, 0], - "69": [0, 0.68333, 0, 0], - "70": [0, 0.68333, 0, 0], - "71": [0, 0.68333, 0, 0], - "72": [0, 0.68333, 0, 0], - "73": [0, 0.68333, 0, 0], - "74": [0, 0.68333, 0, 0], - "75": [0, 0.68333, 0, 0], - "76": [0, 0.68333, 0, 0], - "77": [0, 0.68333, 0, 0], - "78": [0, 0.68333, 0, 0], - "79": [0, 0.68333, 0, 0], - "80": [0, 0.68333, 0, 0], - "81": [0.19444, 0.68333, 0, 0], - "82": [0, 0.68333, 0, 0], - "83": [0, 0.68333, 0, 0], - "84": [0, 0.68333, 0, 0], - "85": [0, 0.68333, 0, 0], - "86": [0, 0.68333, 0.01389, 0], - "87": [0, 0.68333, 0.01389, 0], - "88": [0, 0.68333, 0, 0], - "89": [0, 0.68333, 0.025, 0], - "90": [0, 0.68333, 0, 0], - "91": [0.25, 0.75, 0, 0], - "92": [0.25, 0.75, 0, 0], - "93": [0.25, 0.75, 0, 0], - "94": [0, 0.69444, 0, 0], - "95": [0.31, 0.12056, 0.02778, 0], - "96": [0, 0.69444, 0, 0], - "97": [0, 0.43056, 0, 0], - "98": [0, 0.69444, 0, 0], - "99": [0, 0.43056, 0, 0], - "100": [0, 0.69444, 0, 0], - "101": [0, 0.43056, 0, 0], - "102": [0, 0.69444, 0.07778, 0], - "103": [0.19444, 0.43056, 0.01389, 0], - "104": [0, 0.69444, 0, 0], - "105": [0, 0.66786, 0, 0], - "106": [0.19444, 0.66786, 0, 0], - "107": [0, 0.69444, 0, 0], - "108": [0, 0.69444, 0, 0], - "109": [0, 0.43056, 0, 0], - "110": [0, 0.43056, 0, 0], - "111": [0, 0.43056, 0, 0], - "112": [0.19444, 0.43056, 0, 0], - "113": [0.19444, 0.43056, 0, 0], - "114": [0, 0.43056, 0, 0], - "115": [0, 0.43056, 0, 0], - "116": [0, 0.61508, 0, 0], - "117": [0, 0.43056, 0, 0], - "118": [0, 0.43056, 0.01389, 0], - "119": [0, 0.43056, 0.01389, 0], - "120": [0, 0.43056, 0, 0], - "121": [0.19444, 0.43056, 0.01389, 0], - "122": [0, 0.43056, 0, 0], - "123": [0.25, 0.75, 0, 0], - "124": [0.25, 0.75, 0, 0], - "125": [0.25, 0.75, 0, 0], - "126": [0.35, 0.31786, 0, 0], - "160": [0, 0, 0, 0], - "168": [0, 0.66786, 0, 0], - "172": [0, 0.43056, 0, 0], - "175": [0, 0.56778, 0, 0], - "176": [0, 0.69444, 0, 0], - "177": [0.08333, 0.58333, 0, 0], - "180": [0, 0.69444, 0, 0], - "215": [0.08333, 0.58333, 0, 0], - "247": [0.08333, 0.58333, 0, 0], - "305": [0, 0.43056, 0, 0], - "567": [0.19444, 0.43056, 0, 0], - "710": [0, 0.69444, 0, 0], - "711": [0, 0.62847, 0, 0], - "713": [0, 0.56778, 0, 0], - "714": [0, 0.69444, 0, 0], - "715": [0, 0.69444, 0, 0], - "728": [0, 0.69444, 0, 0], - "729": [0, 0.66786, 0, 0], - "730": [0, 0.69444, 0, 0], - "732": [0, 0.66786, 0, 0], - "768": [0, 0.69444, 0, 0], - "769": [0, 0.69444, 0, 0], - "770": [0, 0.69444, 0, 0], - "771": [0, 0.66786, 0, 0], - "772": [0, 0.56778, 0, 0], - "774": [0, 0.69444, 0, 0], - "775": [0, 0.66786, 0, 0], - "776": [0, 0.66786, 0, 0], - "778": [0, 0.69444, 0, 0], - "779": [0, 0.69444, 0, 0], - "780": [0, 0.62847, 0, 0], - "824": [0.19444, 0.69444, 0, 0], - "915": [0, 0.68333, 0, 0], - "916": [0, 0.68333, 0, 0], - "920": [0, 0.68333, 0, 0], - "923": [0, 0.68333, 0, 0], - "926": [0, 0.68333, 0, 0], - "928": [0, 0.68333, 0, 0], - "931": [0, 0.68333, 0, 0], - "933": [0, 0.68333, 0, 0], - "934": [0, 0.68333, 0, 0], - "936": [0, 0.68333, 0, 0], - "937": [0, 0.68333, 0, 0], - "8211": [0, 0.43056, 0.02778, 0], - "8212": [0, 0.43056, 0.02778, 0], - "8216": [0, 0.69444, 0, 0], - "8217": [0, 0.69444, 0, 0], - "8220": [0, 0.69444, 0, 0], - "8221": [0, 0.69444, 0, 0], - "8224": [0.19444, 0.69444, 0, 0], - "8225": [0.19444, 0.69444, 0, 0], - "8230": [0, 0.12, 0, 0], - "8242": [0, 0.55556, 0, 0], - "8407": [0, 0.71444, 0.15382, 0], - "8463": [0, 0.68889, 0, 0], - "8465": [0, 0.69444, 0, 0], - "8467": [0, 0.69444, 0, 0.11111], - "8472": [0.19444, 0.43056, 0, 0.11111], - "8476": [0, 0.69444, 0, 0], - "8501": [0, 0.69444, 0, 0], - "8592": [-0.13313, 0.36687, 0, 0], - "8593": [0.19444, 0.69444, 0, 0], - "8594": [-0.13313, 0.36687, 0, 0], - "8595": [0.19444, 0.69444, 0, 0], - "8596": [-0.13313, 0.36687, 0, 0], - "8597": [0.25, 0.75, 0, 0], - "8598": [0.19444, 0.69444, 0, 0], - "8599": [0.19444, 0.69444, 0, 0], - "8600": [0.19444, 0.69444, 0, 0], - "8601": [0.19444, 0.69444, 0, 0], - "8614": [0.011, 0.511, 0, 0], - "8617": [0.011, 0.511, 0, 0], - "8618": [0.011, 0.511, 0, 0], - "8636": [-0.13313, 0.36687, 0, 0], - "8637": [-0.13313, 0.36687, 0, 0], - "8640": [-0.13313, 0.36687, 0, 0], - "8641": [-0.13313, 0.36687, 0, 0], - "8652": [0.011, 0.671, 0, 0], - "8656": [-0.13313, 0.36687, 0, 0], - "8657": [0.19444, 0.69444, 0, 0], - "8658": [-0.13313, 0.36687, 0, 0], - "8659": [0.19444, 0.69444, 0, 0], - "8660": [-0.13313, 0.36687, 0, 0], - "8661": [0.25, 0.75, 0, 0], - "8704": [0, 0.69444, 0, 0], - "8706": [0, 0.69444, 0.05556, 0.08334], - "8707": [0, 0.69444, 0, 0], - "8709": [0.05556, 0.75, 0, 0], - "8711": [0, 0.68333, 0, 0], - "8712": [0.0391, 0.5391, 0, 0], - "8715": [0.0391, 0.5391, 0, 0], - "8722": [0.08333, 0.58333, 0, 0], - "8723": [0.08333, 0.58333, 0, 0], - "8725": [0.25, 0.75, 0, 0], - "8726": [0.25, 0.75, 0, 0], - "8727": [-0.03472, 0.46528, 0, 0], - "8728": [-0.05555, 0.44445, 0, 0], - "8729": [-0.05555, 0.44445, 0, 0], - "8730": [0.2, 0.8, 0, 0], - "8733": [0, 0.43056, 0, 0], - "8734": [0, 0.43056, 0, 0], - "8736": [0, 0.69224, 0, 0], - "8739": [0.25, 0.75, 0, 0], - "8741": [0.25, 0.75, 0, 0], - "8743": [0, 0.55556, 0, 0], - "8744": [0, 0.55556, 0, 0], - "8745": [0, 0.55556, 0, 0], - "8746": [0, 0.55556, 0, 0], - "8747": [0.19444, 0.69444, 0.11111, 0], - "8764": [-0.13313, 0.36687, 0, 0], - "8768": [0.19444, 0.69444, 0, 0], - "8771": [-0.03625, 0.46375, 0, 0], - "8773": [-0.022, 0.589, 0, 0], - "8776": [-0.01688, 0.48312, 0, 0], - "8781": [-0.03625, 0.46375, 0, 0], - "8784": [-0.133, 0.67, 0, 0], - "8800": [0.215, 0.716, 0, 0], - "8801": [-0.03625, 0.46375, 0, 0], - "8804": [0.13597, 0.63597, 0, 0], - "8805": [0.13597, 0.63597, 0, 0], - "8810": [0.0391, 0.5391, 0, 0], - "8811": [0.0391, 0.5391, 0, 0], - "8826": [0.0391, 0.5391, 0, 0], - "8827": [0.0391, 0.5391, 0, 0], - "8834": [0.0391, 0.5391, 0, 0], - "8835": [0.0391, 0.5391, 0, 0], - "8838": [0.13597, 0.63597, 0, 0], - "8839": [0.13597, 0.63597, 0, 0], - "8846": [0, 0.55556, 0, 0], - "8849": [0.13597, 0.63597, 0, 0], - "8850": [0.13597, 0.63597, 0, 0], - "8851": [0, 0.55556, 0, 0], - "8852": [0, 0.55556, 0, 0], - "8853": [0.08333, 0.58333, 0, 0], - "8854": [0.08333, 0.58333, 0, 0], - "8855": [0.08333, 0.58333, 0, 0], - "8856": [0.08333, 0.58333, 0, 0], - "8857": [0.08333, 0.58333, 0, 0], - "8866": [0, 0.69444, 0, 0], - "8867": [0, 0.69444, 0, 0], - "8868": [0, 0.69444, 0, 0], - "8869": [0, 0.69444, 0, 0], - "8872": [0.249, 0.75, 0, 0], - "8900": [-0.05555, 0.44445, 0, 0], - "8901": [-0.05555, 0.44445, 0, 0], - "8902": [-0.03472, 0.46528, 0, 0], - "8904": [0.005, 0.505, 0, 0], - "8942": [0.03, 0.9, 0, 0], - "8943": [-0.19, 0.31, 0, 0], - "8945": [-0.1, 0.82, 0, 0], - "8968": [0.25, 0.75, 0, 0], - "8969": [0.25, 0.75, 0, 0], - "8970": [0.25, 0.75, 0, 0], - "8971": [0.25, 0.75, 0, 0], - "8994": [-0.14236, 0.35764, 0, 0], - "8995": [-0.14236, 0.35764, 0, 0], - "9136": [0.244, 0.744, 0, 0], - "9137": [0.244, 0.744, 0, 0], - "9651": [0.19444, 0.69444, 0, 0], - "9657": [-0.03472, 0.46528, 0, 0], - "9661": [0.19444, 0.69444, 0, 0], - "9667": [-0.03472, 0.46528, 0, 0], - "9711": [0.19444, 0.69444, 0, 0], - "9824": [0.12963, 0.69444, 0, 0], - "9825": [0.12963, 0.69444, 0, 0], - "9826": [0.12963, 0.69444, 0, 0], - "9827": [0.12963, 0.69444, 0, 0], - "9837": [0, 0.75, 0, 0], - "9838": [0.19444, 0.69444, 0, 0], - "9839": [0.19444, 0.69444, 0, 0], - "10216": [0.25, 0.75, 0, 0], - "10217": [0.25, 0.75, 0, 0], - "10222": [0.244, 0.744, 0, 0], - "10223": [0.244, 0.744, 0, 0], - "10229": [0.011, 0.511, 0, 0], - "10230": [0.011, 0.511, 0, 0], - "10231": [0.011, 0.511, 0, 0], - "10232": [0.024, 0.525, 0, 0], - "10233": [0.024, 0.525, 0, 0], - "10234": [0.024, 0.525, 0, 0], - "10236": [0.011, 0.511, 0, 0], - "10815": [0, 0.68333, 0, 0], - "10927": [0.13597, 0.63597, 0, 0], - "10928": [0.13597, 0.63597, 0, 0] - }, - "Math-BoldItalic": { - "47": [0.19444, 0.69444, 0, 0], - "65": [0, 0.68611, 0, 0], - "66": [0, 0.68611, 0.04835, 0], - "67": [0, 0.68611, 0.06979, 0], - "68": [0, 0.68611, 0.03194, 0], - "69": [0, 0.68611, 0.05451, 0], - "70": [0, 0.68611, 0.15972, 0], - "71": [0, 0.68611, 0, 0], - "72": [0, 0.68611, 0.08229, 0], - "73": [0, 0.68611, 0.07778, 0], - "74": [0, 0.68611, 0.10069, 0], - "75": [0, 0.68611, 0.06979, 0], - "76": [0, 0.68611, 0, 0], - "77": [0, 0.68611, 0.11424, 0], - "78": [0, 0.68611, 0.11424, 0], - "79": [0, 0.68611, 0.03194, 0], - "80": [0, 0.68611, 0.15972, 0], - "81": [0.19444, 0.68611, 0, 0], - "82": [0, 0.68611, 0.00421, 0], - "83": [0, 0.68611, 0.05382, 0], - "84": [0, 0.68611, 0.15972, 0], - "85": [0, 0.68611, 0.11424, 0], - "86": [0, 0.68611, 0.25555, 0], - "87": [0, 0.68611, 0.15972, 0], - "88": [0, 0.68611, 0.07778, 0], - "89": [0, 0.68611, 0.25555, 0], - "90": [0, 0.68611, 0.06979, 0], - "97": [0, 0.44444, 0, 0], - "98": [0, 0.69444, 0, 0], - "99": [0, 0.44444, 0, 0], - "100": [0, 0.69444, 0, 0], - "101": [0, 0.44444, 0, 0], - "102": [0.19444, 0.69444, 0.11042, 0], - "103": [0.19444, 0.44444, 0.03704, 0], - "104": [0, 0.69444, 0, 0], - "105": [0, 0.69326, 0, 0], - "106": [0.19444, 0.69326, 0.0622, 0], - "107": [0, 0.69444, 0.01852, 0], - "108": [0, 0.69444, 0.0088, 0], - "109": [0, 0.44444, 0, 0], - "110": [0, 0.44444, 0, 0], - "111": [0, 0.44444, 0, 0], - "112": [0.19444, 0.44444, 0, 0], - "113": [0.19444, 0.44444, 0.03704, 0], - "114": [0, 0.44444, 0.03194, 0], - "115": [0, 0.44444, 0, 0], - "116": [0, 0.63492, 0, 0], - "117": [0, 0.44444, 0, 0], - "118": [0, 0.44444, 0.03704, 0], - "119": [0, 0.44444, 0.02778, 0], - "120": [0, 0.44444, 0, 0], - "121": [0.19444, 0.44444, 0.03704, 0], - "122": [0, 0.44444, 0.04213, 0], - "915": [0, 0.68611, 0.15972, 0], - "916": [0, 0.68611, 0, 0], - "920": [0, 0.68611, 0.03194, 0], - "923": [0, 0.68611, 0, 0], - "926": [0, 0.68611, 0.07458, 0], - "928": [0, 0.68611, 0.08229, 0], - "931": [0, 0.68611, 0.05451, 0], - "933": [0, 0.68611, 0.15972, 0], - "934": [0, 0.68611, 0, 0], - "936": [0, 0.68611, 0.11653, 0], - "937": [0, 0.68611, 0.04835, 0], - "945": [0, 0.44444, 0, 0], - "946": [0.19444, 0.69444, 0.03403, 0], - "947": [0.19444, 0.44444, 0.06389, 0], - "948": [0, 0.69444, 0.03819, 0], - "949": [0, 0.44444, 0, 0], - "950": [0.19444, 0.69444, 0.06215, 0], - "951": [0.19444, 0.44444, 0.03704, 0], - "952": [0, 0.69444, 0.03194, 0], - "953": [0, 0.44444, 0, 0], - "954": [0, 0.44444, 0, 0], - "955": [0, 0.69444, 0, 0], - "956": [0.19444, 0.44444, 0, 0], - "957": [0, 0.44444, 0.06898, 0], - "958": [0.19444, 0.69444, 0.03021, 0], - "959": [0, 0.44444, 0, 0], - "960": [0, 0.44444, 0.03704, 0], - "961": [0.19444, 0.44444, 0, 0], - "962": [0.09722, 0.44444, 0.07917, 0], - "963": [0, 0.44444, 0.03704, 0], - "964": [0, 0.44444, 0.13472, 0], - "965": [0, 0.44444, 0.03704, 0], - "966": [0.19444, 0.44444, 0, 0], - "967": [0.19444, 0.44444, 0, 0], - "968": [0.19444, 0.69444, 0.03704, 0], - "969": [0, 0.44444, 0.03704, 0], - "977": [0, 0.69444, 0, 0], - "981": [0.19444, 0.69444, 0, 0], - "982": [0, 0.44444, 0.03194, 0], - "1009": [0.19444, 0.44444, 0, 0], - "1013": [0, 0.44444, 0, 0] - }, - "Math-Italic": { - "47": [0.19444, 0.69444, 0, 0], - "65": [0, 0.68333, 0, 0.13889], - "66": [0, 0.68333, 0.05017, 0.08334], - "67": [0, 0.68333, 0.07153, 0.08334], - "68": [0, 0.68333, 0.02778, 0.05556], - "69": [0, 0.68333, 0.05764, 0.08334], - "70": [0, 0.68333, 0.13889, 0.08334], - "71": [0, 0.68333, 0, 0.08334], - "72": [0, 0.68333, 0.08125, 0.05556], - "73": [0, 0.68333, 0.07847, 0.11111], - "74": [0, 0.68333, 0.09618, 0.16667], - "75": [0, 0.68333, 0.07153, 0.05556], - "76": [0, 0.68333, 0, 0.02778], - "77": [0, 0.68333, 0.10903, 0.08334], - "78": [0, 0.68333, 0.10903, 0.08334], - "79": [0, 0.68333, 0.02778, 0.08334], - "80": [0, 0.68333, 0.13889, 0.08334], - "81": [0.19444, 0.68333, 0, 0.08334], - "82": [0, 0.68333, 0.00773, 0.08334], - "83": [0, 0.68333, 0.05764, 0.08334], - "84": [0, 0.68333, 0.13889, 0.08334], - "85": [0, 0.68333, 0.10903, 0.02778], - "86": [0, 0.68333, 0.22222, 0], - "87": [0, 0.68333, 0.13889, 0], - "88": [0, 0.68333, 0.07847, 0.08334], - "89": [0, 0.68333, 0.22222, 0], - "90": [0, 0.68333, 0.07153, 0.08334], - "97": [0, 0.43056, 0, 0], - "98": [0, 0.69444, 0, 0], - "99": [0, 0.43056, 0, 0.05556], - "100": [0, 0.69444, 0, 0.16667], - "101": [0, 0.43056, 0, 0.05556], - "102": [0.19444, 0.69444, 0.10764, 0.16667], - "103": [0.19444, 0.43056, 0.03588, 0.02778], - "104": [0, 0.69444, 0, 0], - "105": [0, 0.65952, 0, 0], - "106": [0.19444, 0.65952, 0.05724, 0], - "107": [0, 0.69444, 0.03148, 0], - "108": [0, 0.69444, 0.01968, 0.08334], - "109": [0, 0.43056, 0, 0], - "110": [0, 0.43056, 0, 0], - "111": [0, 0.43056, 0, 0.05556], - "112": [0.19444, 0.43056, 0, 0.08334], - "113": [0.19444, 0.43056, 0.03588, 0.08334], - "114": [0, 0.43056, 0.02778, 0.05556], - "115": [0, 0.43056, 0, 0.05556], - "116": [0, 0.61508, 0, 0.08334], - "117": [0, 0.43056, 0, 0.02778], - "118": [0, 0.43056, 0.03588, 0.02778], - "119": [0, 0.43056, 0.02691, 0.08334], - "120": [0, 0.43056, 0, 0.02778], - "121": [0.19444, 0.43056, 0.03588, 0.05556], - "122": [0, 0.43056, 0.04398, 0.05556], - "915": [0, 0.68333, 0.13889, 0.08334], - "916": [0, 0.68333, 0, 0.16667], - "920": [0, 0.68333, 0.02778, 0.08334], - "923": [0, 0.68333, 0, 0.16667], - "926": [0, 0.68333, 0.07569, 0.08334], - "928": [0, 0.68333, 0.08125, 0.05556], - "931": [0, 0.68333, 0.05764, 0.08334], - "933": [0, 0.68333, 0.13889, 0.05556], - "934": [0, 0.68333, 0, 0.08334], - "936": [0, 0.68333, 0.11, 0.05556], - "937": [0, 0.68333, 0.05017, 0.08334], - "945": [0, 0.43056, 0.0037, 0.02778], - "946": [0.19444, 0.69444, 0.05278, 0.08334], - "947": [0.19444, 0.43056, 0.05556, 0], - "948": [0, 0.69444, 0.03785, 0.05556], - "949": [0, 0.43056, 0, 0.08334], - "950": [0.19444, 0.69444, 0.07378, 0.08334], - "951": [0.19444, 0.43056, 0.03588, 0.05556], - "952": [0, 0.69444, 0.02778, 0.08334], - "953": [0, 0.43056, 0, 0.05556], - "954": [0, 0.43056, 0, 0], - "955": [0, 0.69444, 0, 0], - "956": [0.19444, 0.43056, 0, 0.02778], - "957": [0, 0.43056, 0.06366, 0.02778], - "958": [0.19444, 0.69444, 0.04601, 0.11111], - "959": [0, 0.43056, 0, 0.05556], - "960": [0, 0.43056, 0.03588, 0], - "961": [0.19444, 0.43056, 0, 0.08334], - "962": [0.09722, 0.43056, 0.07986, 0.08334], - "963": [0, 0.43056, 0.03588, 0], - "964": [0, 0.43056, 0.1132, 0.02778], - "965": [0, 0.43056, 0.03588, 0.02778], - "966": [0.19444, 0.43056, 0, 0.08334], - "967": [0.19444, 0.43056, 0, 0.05556], - "968": [0.19444, 0.69444, 0.03588, 0.11111], - "969": [0, 0.43056, 0.03588, 0], - "977": [0, 0.69444, 0, 0.08334], - "981": [0.19444, 0.69444, 0, 0.08334], - "982": [0, 0.43056, 0.02778, 0], - "1009": [0.19444, 0.43056, 0, 0.08334], - "1013": [0, 0.43056, 0, 0.05556] - }, - "Math-Regular": { - "65": [0, 0.68333, 0, 0.13889], - "66": [0, 0.68333, 0.05017, 0.08334], - "67": [0, 0.68333, 0.07153, 0.08334], - "68": [0, 0.68333, 0.02778, 0.05556], - "69": [0, 0.68333, 0.05764, 0.08334], - "70": [0, 0.68333, 0.13889, 0.08334], - "71": [0, 0.68333, 0, 0.08334], - "72": [0, 0.68333, 0.08125, 0.05556], - "73": [0, 0.68333, 0.07847, 0.11111], - "74": [0, 0.68333, 0.09618, 0.16667], - "75": [0, 0.68333, 0.07153, 0.05556], - "76": [0, 0.68333, 0, 0.02778], - "77": [0, 0.68333, 0.10903, 0.08334], - "78": [0, 0.68333, 0.10903, 0.08334], - "79": [0, 0.68333, 0.02778, 0.08334], - "80": [0, 0.68333, 0.13889, 0.08334], - "81": [0.19444, 0.68333, 0, 0.08334], - "82": [0, 0.68333, 0.00773, 0.08334], - "83": [0, 0.68333, 0.05764, 0.08334], - "84": [0, 0.68333, 0.13889, 0.08334], - "85": [0, 0.68333, 0.10903, 0.02778], - "86": [0, 0.68333, 0.22222, 0], - "87": [0, 0.68333, 0.13889, 0], - "88": [0, 0.68333, 0.07847, 0.08334], - "89": [0, 0.68333, 0.22222, 0], - "90": [0, 0.68333, 0.07153, 0.08334], - "97": [0, 0.43056, 0, 0], - "98": [0, 0.69444, 0, 0], - "99": [0, 0.43056, 0, 0.05556], - "100": [0, 0.69444, 0, 0.16667], - "101": [0, 0.43056, 0, 0.05556], - "102": [0.19444, 0.69444, 0.10764, 0.16667], - "103": [0.19444, 0.43056, 0.03588, 0.02778], - "104": [0, 0.69444, 0, 0], - "105": [0, 0.65952, 0, 0], - "106": [0.19444, 0.65952, 0.05724, 0], - "107": [0, 0.69444, 0.03148, 0], - "108": [0, 0.69444, 0.01968, 0.08334], - "109": [0, 0.43056, 0, 0], - "110": [0, 0.43056, 0, 0], - "111": [0, 0.43056, 0, 0.05556], - "112": [0.19444, 0.43056, 0, 0.08334], - "113": [0.19444, 0.43056, 0.03588, 0.08334], - "114": [0, 0.43056, 0.02778, 0.05556], - "115": [0, 0.43056, 0, 0.05556], - "116": [0, 0.61508, 0, 0.08334], - "117": [0, 0.43056, 0, 0.02778], - "118": [0, 0.43056, 0.03588, 0.02778], - "119": [0, 0.43056, 0.02691, 0.08334], - "120": [0, 0.43056, 0, 0.02778], - "121": [0.19444, 0.43056, 0.03588, 0.05556], - "122": [0, 0.43056, 0.04398, 0.05556], - "915": [0, 0.68333, 0.13889, 0.08334], - "916": [0, 0.68333, 0, 0.16667], - "920": [0, 0.68333, 0.02778, 0.08334], - "923": [0, 0.68333, 0, 0.16667], - "926": [0, 0.68333, 0.07569, 0.08334], - "928": [0, 0.68333, 0.08125, 0.05556], - "931": [0, 0.68333, 0.05764, 0.08334], - "933": [0, 0.68333, 0.13889, 0.05556], - "934": [0, 0.68333, 0, 0.08334], - "936": [0, 0.68333, 0.11, 0.05556], - "937": [0, 0.68333, 0.05017, 0.08334], - "945": [0, 0.43056, 0.0037, 0.02778], - "946": [0.19444, 0.69444, 0.05278, 0.08334], - "947": [0.19444, 0.43056, 0.05556, 0], - "948": [0, 0.69444, 0.03785, 0.05556], - "949": [0, 0.43056, 0, 0.08334], - "950": [0.19444, 0.69444, 0.07378, 0.08334], - "951": [0.19444, 0.43056, 0.03588, 0.05556], - "952": [0, 0.69444, 0.02778, 0.08334], - "953": [0, 0.43056, 0, 0.05556], - "954": [0, 0.43056, 0, 0], - "955": [0, 0.69444, 0, 0], - "956": [0.19444, 0.43056, 0, 0.02778], - "957": [0, 0.43056, 0.06366, 0.02778], - "958": [0.19444, 0.69444, 0.04601, 0.11111], - "959": [0, 0.43056, 0, 0.05556], - "960": [0, 0.43056, 0.03588, 0], - "961": [0.19444, 0.43056, 0, 0.08334], - "962": [0.09722, 0.43056, 0.07986, 0.08334], - "963": [0, 0.43056, 0.03588, 0], - "964": [0, 0.43056, 0.1132, 0.02778], - "965": [0, 0.43056, 0.03588, 0.02778], - "966": [0.19444, 0.43056, 0, 0.08334], - "967": [0.19444, 0.43056, 0, 0.05556], - "968": [0.19444, 0.69444, 0.03588, 0.11111], - "969": [0, 0.43056, 0.03588, 0], - "977": [0, 0.69444, 0, 0.08334], - "981": [0.19444, 0.69444, 0, 0.08334], - "982": [0, 0.43056, 0.02778, 0], - "1009": [0.19444, 0.43056, 0, 0.08334], - "1013": [0, 0.43056, 0, 0.05556] - }, - "SansSerif-Regular": { - "33": [0, 0.69444, 0, 0], - "34": [0, 0.69444, 0, 0], - "35": [0.19444, 0.69444, 0, 0], - "36": [0.05556, 0.75, 0, 0], - "37": [0.05556, 0.75, 0, 0], - "38": [0, 0.69444, 0, 0], - "39": [0, 0.69444, 0, 0], - "40": [0.25, 0.75, 0, 0], - "41": [0.25, 0.75, 0, 0], - "42": [0, 0.75, 0, 0], - "43": [0.08333, 0.58333, 0, 0], - "44": [0.125, 0.08333, 0, 0], - "45": [0, 0.44444, 0, 0], - "46": [0, 0.08333, 0, 0], - "47": [0.25, 0.75, 0, 0], - "48": [0, 0.65556, 0, 0], - "49": [0, 0.65556, 0, 0], - "50": [0, 0.65556, 0, 0], - "51": [0, 0.65556, 0, 0], - "52": [0, 0.65556, 0, 0], - "53": [0, 0.65556, 0, 0], - "54": [0, 0.65556, 0, 0], - "55": [0, 0.65556, 0, 0], - "56": [0, 0.65556, 0, 0], - "57": [0, 0.65556, 0, 0], - "58": [0, 0.44444, 0, 0], - "59": [0.125, 0.44444, 0, 0], - "61": [-0.13, 0.37, 0, 0], - "63": [0, 0.69444, 0, 0], - "64": [0, 0.69444, 0, 0], - "65": [0, 0.69444, 0, 0], - "66": [0, 0.69444, 0, 0], - "67": [0, 0.69444, 0, 0], - "68": [0, 0.69444, 0, 0], - "69": [0, 0.69444, 0, 0], - "70": [0, 0.69444, 0, 0], - "71": [0, 0.69444, 0, 0], - "72": [0, 0.69444, 0, 0], - "73": [0, 0.69444, 0, 0], - "74": [0, 0.69444, 0, 0], - "75": [0, 0.69444, 0, 0], - "76": [0, 0.69444, 0, 0], - "77": [0, 0.69444, 0, 0], - "78": [0, 0.69444, 0, 0], - "79": [0, 0.69444, 0, 0], - "80": [0, 0.69444, 0, 0], - "81": [0.125, 0.69444, 0, 0], - "82": [0, 0.69444, 0, 0], - "83": [0, 0.69444, 0, 0], - "84": [0, 0.69444, 0, 0], - "85": [0, 0.69444, 0, 0], - "86": [0, 0.69444, 0.01389, 0], - "87": [0, 0.69444, 0.01389, 0], - "88": [0, 0.69444, 0, 0], - "89": [0, 0.69444, 0.025, 0], - "90": [0, 0.69444, 0, 0], - "91": [0.25, 0.75, 0, 0], - "93": [0.25, 0.75, 0, 0], - "94": [0, 0.69444, 0, 0], - "95": [0.35, 0.09444, 0.02778, 0], - "97": [0, 0.44444, 0, 0], - "98": [0, 0.69444, 0, 0], - "99": [0, 0.44444, 0, 0], - "100": [0, 0.69444, 0, 0], - "101": [0, 0.44444, 0, 0], - "102": [0, 0.69444, 0.06944, 0], - "103": [0.19444, 0.44444, 0.01389, 0], - "104": [0, 0.69444, 0, 0], - "105": [0, 0.67937, 0, 0], - "106": [0.19444, 0.67937, 0, 0], - "107": [0, 0.69444, 0, 0], - "108": [0, 0.69444, 0, 0], - "109": [0, 0.44444, 0, 0], - "110": [0, 0.44444, 0, 0], - "111": [0, 0.44444, 0, 0], - "112": [0.19444, 0.44444, 0, 0], - "113": [0.19444, 0.44444, 0, 0], - "114": [0, 0.44444, 0.01389, 0], - "115": [0, 0.44444, 0, 0], - "116": [0, 0.57143, 0, 0], - "117": [0, 0.44444, 0, 0], - "118": [0, 0.44444, 0.01389, 0], - "119": [0, 0.44444, 0.01389, 0], - "120": [0, 0.44444, 0, 0], - "121": [0.19444, 0.44444, 0.01389, 0], - "122": [0, 0.44444, 0, 0], - "126": [0.35, 0.32659, 0, 0], - "305": [0, 0.44444, 0, 0], - "567": [0.19444, 0.44444, 0, 0], - "768": [0, 0.69444, 0, 0], - "769": [0, 0.69444, 0, 0], - "770": [0, 0.69444, 0, 0], - "771": [0, 0.67659, 0, 0], - "772": [0, 0.60889, 0, 0], - "774": [0, 0.69444, 0, 0], - "775": [0, 0.67937, 0, 0], - "776": [0, 0.67937, 0, 0], - "778": [0, 0.69444, 0, 0], - "779": [0, 0.69444, 0, 0], - "780": [0, 0.63194, 0, 0], - "915": [0, 0.69444, 0, 0], - "916": [0, 0.69444, 0, 0], - "920": [0, 0.69444, 0, 0], - "923": [0, 0.69444, 0, 0], - "926": [0, 0.69444, 0, 0], - "928": [0, 0.69444, 0, 0], - "931": [0, 0.69444, 0, 0], - "933": [0, 0.69444, 0, 0], - "934": [0, 0.69444, 0, 0], - "936": [0, 0.69444, 0, 0], - "937": [0, 0.69444, 0, 0], - "8211": [0, 0.44444, 0.02778, 0], - "8212": [0, 0.44444, 0.02778, 0], - "8216": [0, 0.69444, 0, 0], - "8217": [0, 0.69444, 0, 0], - "8220": [0, 0.69444, 0, 0], - "8221": [0, 0.69444, 0, 0] - }, - "Script-Regular": { - "65": [0, 0.7, 0.22925, 0], - "66": [0, 0.7, 0.04087, 0], - "67": [0, 0.7, 0.1689, 0], - "68": [0, 0.7, 0.09371, 0], - "69": [0, 0.7, 0.18583, 0], - "70": [0, 0.7, 0.13634, 0], - "71": [0, 0.7, 0.17322, 0], - "72": [0, 0.7, 0.29694, 0], - "73": [0, 0.7, 0.19189, 0], - "74": [0.27778, 0.7, 0.19189, 0], - "75": [0, 0.7, 0.31259, 0], - "76": [0, 0.7, 0.19189, 0], - "77": [0, 0.7, 0.15981, 0], - "78": [0, 0.7, 0.3525, 0], - "79": [0, 0.7, 0.08078, 0], - "80": [0, 0.7, 0.08078, 0], - "81": [0, 0.7, 0.03305, 0], - "82": [0, 0.7, 0.06259, 0], - "83": [0, 0.7, 0.19189, 0], - "84": [0, 0.7, 0.29087, 0], - "85": [0, 0.7, 0.25815, 0], - "86": [0, 0.7, 0.27523, 0], - "87": [0, 0.7, 0.27523, 0], - "88": [0, 0.7, 0.26006, 0], - "89": [0, 0.7, 0.2939, 0], - "90": [0, 0.7, 0.24037, 0] - }, - "Size1-Regular": { - "40": [0.35001, 0.85, 0, 0], - "41": [0.35001, 0.85, 0, 0], - "47": [0.35001, 0.85, 0, 0], - "91": [0.35001, 0.85, 0, 0], - "92": [0.35001, 0.85, 0, 0], - "93": [0.35001, 0.85, 0, 0], - "123": [0.35001, 0.85, 0, 0], - "125": [0.35001, 0.85, 0, 0], - "710": [0, 0.72222, 0, 0], - "732": [0, 0.72222, 0, 0], - "770": [0, 0.72222, 0, 0], - "771": [0, 0.72222, 0, 0], - "8214": [-0.00099, 0.601, 0, 0], - "8593": [1e-05, 0.6, 0, 0], - "8595": [1e-05, 0.6, 0, 0], - "8657": [1e-05, 0.6, 0, 0], - "8659": [1e-05, 0.6, 0, 0], - "8719": [0.25001, 0.75, 0, 0], - "8720": [0.25001, 0.75, 0, 0], - "8721": [0.25001, 0.75, 0, 0], - "8730": [0.35001, 0.85, 0, 0], - "8739": [-0.00599, 0.606, 0, 0], - "8741": [-0.00599, 0.606, 0, 0], - "8747": [0.30612, 0.805, 0.19445, 0], - "8748": [0.306, 0.805, 0.19445, 0], - "8749": [0.306, 0.805, 0.19445, 0], - "8750": [0.30612, 0.805, 0.19445, 0], - "8896": [0.25001, 0.75, 0, 0], - "8897": [0.25001, 0.75, 0, 0], - "8898": [0.25001, 0.75, 0, 0], - "8899": [0.25001, 0.75, 0, 0], - "8968": [0.35001, 0.85, 0, 0], - "8969": [0.35001, 0.85, 0, 0], - "8970": [0.35001, 0.85, 0, 0], - "8971": [0.35001, 0.85, 0, 0], - "9168": [-0.00099, 0.601, 0, 0], - "10216": [0.35001, 0.85, 0, 0], - "10217": [0.35001, 0.85, 0, 0], - "10752": [0.25001, 0.75, 0, 0], - "10753": [0.25001, 0.75, 0, 0], - "10754": [0.25001, 0.75, 0, 0], - "10756": [0.25001, 0.75, 0, 0], - "10758": [0.25001, 0.75, 0, 0] - }, - "Size2-Regular": { - "40": [0.65002, 1.15, 0, 0], - "41": [0.65002, 1.15, 0, 0], - "47": [0.65002, 1.15, 0, 0], - "91": [0.65002, 1.15, 0, 0], - "92": [0.65002, 1.15, 0, 0], - "93": [0.65002, 1.15, 0, 0], - "123": [0.65002, 1.15, 0, 0], - "125": [0.65002, 1.15, 0, 0], - "710": [0, 0.75, 0, 0], - "732": [0, 0.75, 0, 0], - "770": [0, 0.75, 0, 0], - "771": [0, 0.75, 0, 0], - "8719": [0.55001, 1.05, 0, 0], - "8720": [0.55001, 1.05, 0, 0], - "8721": [0.55001, 1.05, 0, 0], - "8730": [0.65002, 1.15, 0, 0], - "8747": [0.86225, 1.36, 0.44445, 0], - "8748": [0.862, 1.36, 0.44445, 0], - "8749": [0.862, 1.36, 0.44445, 0], - "8750": [0.86225, 1.36, 0.44445, 0], - "8896": [0.55001, 1.05, 0, 0], - "8897": [0.55001, 1.05, 0, 0], - "8898": [0.55001, 1.05, 0, 0], - "8899": [0.55001, 1.05, 0, 0], - "8968": [0.65002, 1.15, 0, 0], - "8969": [0.65002, 1.15, 0, 0], - "8970": [0.65002, 1.15, 0, 0], - "8971": [0.65002, 1.15, 0, 0], - "10216": [0.65002, 1.15, 0, 0], - "10217": [0.65002, 1.15, 0, 0], - "10752": [0.55001, 1.05, 0, 0], - "10753": [0.55001, 1.05, 0, 0], - "10754": [0.55001, 1.05, 0, 0], - "10756": [0.55001, 1.05, 0, 0], - "10758": [0.55001, 1.05, 0, 0] - }, - "Size3-Regular": { - "40": [0.95003, 1.45, 0, 0], - "41": [0.95003, 1.45, 0, 0], - "47": [0.95003, 1.45, 0, 0], - "91": [0.95003, 1.45, 0, 0], - "92": [0.95003, 1.45, 0, 0], - "93": [0.95003, 1.45, 0, 0], - "123": [0.95003, 1.45, 0, 0], - "125": [0.95003, 1.45, 0, 0], - "710": [0, 0.75, 0, 0], - "732": [0, 0.75, 0, 0], - "770": [0, 0.75, 0, 0], - "771": [0, 0.75, 0, 0], - "8730": [0.95003, 1.45, 0, 0], - "8968": [0.95003, 1.45, 0, 0], - "8969": [0.95003, 1.45, 0, 0], - "8970": [0.95003, 1.45, 0, 0], - "8971": [0.95003, 1.45, 0, 0], - "10216": [0.95003, 1.45, 0, 0], - "10217": [0.95003, 1.45, 0, 0] - }, - "Size4-Regular": { - "40": [1.25003, 1.75, 0, 0], - "41": [1.25003, 1.75, 0, 0], - "47": [1.25003, 1.75, 0, 0], - "91": [1.25003, 1.75, 0, 0], - "92": [1.25003, 1.75, 0, 0], - "93": [1.25003, 1.75, 0, 0], - "123": [1.25003, 1.75, 0, 0], - "125": [1.25003, 1.75, 0, 0], - "710": [0, 0.825, 0, 0], - "732": [0, 0.825, 0, 0], - "770": [0, 0.825, 0, 0], - "771": [0, 0.825, 0, 0], - "8730": [1.25003, 1.75, 0, 0], - "8968": [1.25003, 1.75, 0, 0], - "8969": [1.25003, 1.75, 0, 0], - "8970": [1.25003, 1.75, 0, 0], - "8971": [1.25003, 1.75, 0, 0], - "9115": [0.64502, 1.155, 0, 0], - "9116": [1e-05, 0.6, 0, 0], - "9117": [0.64502, 1.155, 0, 0], - "9118": [0.64502, 1.155, 0, 0], - "9119": [1e-05, 0.6, 0, 0], - "9120": [0.64502, 1.155, 0, 0], - "9121": [0.64502, 1.155, 0, 0], - "9122": [-0.00099, 0.601, 0, 0], - "9123": [0.64502, 1.155, 0, 0], - "9124": [0.64502, 1.155, 0, 0], - "9125": [-0.00099, 0.601, 0, 0], - "9126": [0.64502, 1.155, 0, 0], - "9127": [1e-05, 0.9, 0, 0], - "9128": [0.65002, 1.15, 0, 0], - "9129": [0.90001, 0, 0, 0], - "9130": [0, 0.3, 0, 0], - "9131": [1e-05, 0.9, 0, 0], - "9132": [0.65002, 1.15, 0, 0], - "9133": [0.90001, 0, 0, 0], - "9143": [0.88502, 0.915, 0, 0], - "10216": [1.25003, 1.75, 0, 0], - "10217": [1.25003, 1.75, 0, 0], - "57344": [-0.00499, 0.605, 0, 0], - "57345": [-0.00499, 0.605, 0, 0], - "57680": [0, 0.12, 0, 0], - "57681": [0, 0.12, 0, 0], - "57682": [0, 0.12, 0, 0], - "57683": [0, 0.12, 0, 0] - }, - "Typewriter-Regular": { - "33": [0, 0.61111, 0, 0], - "34": [0, 0.61111, 0, 0], - "35": [0, 0.61111, 0, 0], - "36": [0.08333, 0.69444, 0, 0], - "37": [0.08333, 0.69444, 0, 0], - "38": [0, 0.61111, 0, 0], - "39": [0, 0.61111, 0, 0], - "40": [0.08333, 0.69444, 0, 0], - "41": [0.08333, 0.69444, 0, 0], - "42": [0, 0.52083, 0, 0], - "43": [-0.08056, 0.53055, 0, 0], - "44": [0.13889, 0.125, 0, 0], - "45": [-0.08056, 0.53055, 0, 0], - "46": [0, 0.125, 0, 0], - "47": [0.08333, 0.69444, 0, 0], - "48": [0, 0.61111, 0, 0], - "49": [0, 0.61111, 0, 0], - "50": [0, 0.61111, 0, 0], - "51": [0, 0.61111, 0, 0], - "52": [0, 0.61111, 0, 0], - "53": [0, 0.61111, 0, 0], - "54": [0, 0.61111, 0, 0], - "55": [0, 0.61111, 0, 0], - "56": [0, 0.61111, 0, 0], - "57": [0, 0.61111, 0, 0], - "58": [0, 0.43056, 0, 0], - "59": [0.13889, 0.43056, 0, 0], - "60": [-0.05556, 0.55556, 0, 0], - "61": [-0.19549, 0.41562, 0, 0], - "62": [-0.05556, 0.55556, 0, 0], - "63": [0, 0.61111, 0, 0], - "64": [0, 0.61111, 0, 0], - "65": [0, 0.61111, 0, 0], - "66": [0, 0.61111, 0, 0], - "67": [0, 0.61111, 0, 0], - "68": [0, 0.61111, 0, 0], - "69": [0, 0.61111, 0, 0], - "70": [0, 0.61111, 0, 0], - "71": [0, 0.61111, 0, 0], - "72": [0, 0.61111, 0, 0], - "73": [0, 0.61111, 0, 0], - "74": [0, 0.61111, 0, 0], - "75": [0, 0.61111, 0, 0], - "76": [0, 0.61111, 0, 0], - "77": [0, 0.61111, 0, 0], - "78": [0, 0.61111, 0, 0], - "79": [0, 0.61111, 0, 0], - "80": [0, 0.61111, 0, 0], - "81": [0.13889, 0.61111, 0, 0], - "82": [0, 0.61111, 0, 0], - "83": [0, 0.61111, 0, 0], - "84": [0, 0.61111, 0, 0], - "85": [0, 0.61111, 0, 0], - "86": [0, 0.61111, 0, 0], - "87": [0, 0.61111, 0, 0], - "88": [0, 0.61111, 0, 0], - "89": [0, 0.61111, 0, 0], - "90": [0, 0.61111, 0, 0], - "91": [0.08333, 0.69444, 0, 0], - "92": [0.08333, 0.69444, 0, 0], - "93": [0.08333, 0.69444, 0, 0], - "94": [0, 0.61111, 0, 0], - "95": [0.09514, 0, 0, 0], - "96": [0, 0.61111, 0, 0], - "97": [0, 0.43056, 0, 0], - "98": [0, 0.61111, 0, 0], - "99": [0, 0.43056, 0, 0], - "100": [0, 0.61111, 0, 0], - "101": [0, 0.43056, 0, 0], - "102": [0, 0.61111, 0, 0], - "103": [0.22222, 0.43056, 0, 0], - "104": [0, 0.61111, 0, 0], - "105": [0, 0.61111, 0, 0], - "106": [0.22222, 0.61111, 0, 0], - "107": [0, 0.61111, 0, 0], - "108": [0, 0.61111, 0, 0], - "109": [0, 0.43056, 0, 0], - "110": [0, 0.43056, 0, 0], - "111": [0, 0.43056, 0, 0], - "112": [0.22222, 0.43056, 0, 0], - "113": [0.22222, 0.43056, 0, 0], - "114": [0, 0.43056, 0, 0], - "115": [0, 0.43056, 0, 0], - "116": [0, 0.55358, 0, 0], - "117": [0, 0.43056, 0, 0], - "118": [0, 0.43056, 0, 0], - "119": [0, 0.43056, 0, 0], - "120": [0, 0.43056, 0, 0], - "121": [0.22222, 0.43056, 0, 0], - "122": [0, 0.43056, 0, 0], - "123": [0.08333, 0.69444, 0, 0], - "124": [0.08333, 0.69444, 0, 0], - "125": [0.08333, 0.69444, 0, 0], - "126": [0, 0.61111, 0, 0], - "127": [0, 0.61111, 0, 0], - "305": [0, 0.43056, 0, 0], - "567": [0.22222, 0.43056, 0, 0], - "768": [0, 0.61111, 0, 0], - "769": [0, 0.61111, 0, 0], - "770": [0, 0.61111, 0, 0], - "771": [0, 0.61111, 0, 0], - "772": [0, 0.56555, 0, 0], - "774": [0, 0.61111, 0, 0], - "776": [0, 0.61111, 0, 0], - "778": [0, 0.61111, 0, 0], - "780": [0, 0.56597, 0, 0], - "915": [0, 0.61111, 0, 0], - "916": [0, 0.61111, 0, 0], - "920": [0, 0.61111, 0, 0], - "923": [0, 0.61111, 0, 0], - "926": [0, 0.61111, 0, 0], - "928": [0, 0.61111, 0, 0], - "931": [0, 0.61111, 0, 0], - "933": [0, 0.61111, 0, 0], - "934": [0, 0.61111, 0, 0], - "936": [0, 0.61111, 0, 0], - "937": [0, 0.61111, 0, 0], - "2018": [0, 0.61111, 0, 0], - "2019": [0, 0.61111, 0, 0], - "8242": [0, 0.61111, 0, 0] - } - }; - - },{}],43:[function(require,module,exports){ - - var _utils = require("./utils"); - - var _utils2 = _interopRequireDefault(_utils); - - var _ParseError = require("./ParseError"); - - var _ParseError2 = _interopRequireDefault(_ParseError); - - var _ParseNode = require("./ParseNode"); - - var _ParseNode2 = _interopRequireDefault(_ParseNode); - - function _interopRequireDefault(obj) { return obj && obj.__esModule ? obj : { default: obj }; } - - /* This file contains a list of functions that we parse, identified by - * the calls to defineFunction. - * - * The first argument to defineFunction is a single name or a list of names. - * All functions named in such a list will share a single implementation. - * - * Each declared function can have associated properties, which - * include the following: - * - * - numArgs: The number of arguments the function takes. - * If this is the only property, it can be passed as a number - * instead of an element of a properties object. - * - argTypes: (optional) An array corresponding to each argument of the - * function, giving the type of argument that should be parsed. Its - * length should be equal to `numArgs + numOptionalArgs`. Valid - * types: - * - "size": A size-like thing, such as "1em" or "5ex" - * - "color": An html color, like "#abc" or "blue" - * - "original": The same type as the environment that the - * function being parsed is in (e.g. used for the - * bodies of functions like \textcolor where the - * first argument is special and the second - * argument is parsed normally) - * Other possible types (probably shouldn't be used) - * - "text": Text-like (e.g. \text) - * - "math": Normal math - * If undefined, this will be treated as an appropriate length - * array of "original" strings - * - greediness: (optional) The greediness of the function to use ungrouped - * arguments. - * - * E.g. if you have an expression - * \sqrt \frac 1 2 - * since \frac has greediness=2 vs \sqrt's greediness=1, \frac - * will use the two arguments '1' and '2' as its two arguments, - * then that whole function will be used as the argument to - * \sqrt. On the other hand, the expressions - * \frac \frac 1 2 3 - * and - * \frac \sqrt 1 2 - * will fail because \frac and \frac have equal greediness - * and \sqrt has a lower greediness than \frac respectively. To - * make these parse, we would have to change them to: - * \frac {\frac 1 2} 3 - * and - * \frac {\sqrt 1} 2 - * - * The default value is `1` - * - allowedInText: (optional) Whether or not the function is allowed inside - * text mode (default false) - * - numOptionalArgs: (optional) The number of optional arguments the function - * should parse. If the optional arguments aren't found, - * `null` will be passed to the handler in their place. - * (default 0) - * - infix: (optional) Must be true if the function is an infix operator. - * - * The last argument is that implementation, the handler for the function(s). - * It is called to handle these functions and their arguments. - * It receives two arguments: - * - context contains information and references provided by the parser - * - args is an array of arguments obtained from TeX input - * The context contains the following properties: - * - funcName: the text (i.e. name) of the function, including \ - * - parser: the parser object - * - lexer: the lexer object - * - positions: the positions in the overall string of the function - * and the arguments. - * The latter three should only be used to produce error messages. - * - * The function should return an object with the following keys: - * - type: The type of element that this is. This is then used in - * buildHTML/buildMathML to determine which function - * should be called to build this node into a DOM node - * Any other data can be added to the object, which will be passed - * in to the function in buildHTML/buildMathML as `group.value`. - */ - - function defineFunction(names, props, handler) { - if (typeof names === "string") { - names = [names]; - } - if (typeof props === "number") { - props = { numArgs: props }; - } - // Set default values of functions - var data = { - numArgs: props.numArgs, - argTypes: props.argTypes, - greediness: props.greediness === undefined ? 1 : props.greediness, - allowedInText: !!props.allowedInText, - allowedInMath: props.allowedInMath, - numOptionalArgs: props.numOptionalArgs || 0, - infix: !!props.infix, - handler: handler - }; - for (var i = 0; i < names.length; ++i) { - module.exports[names[i]] = data; - } - } - - // Since the corresponding buildHTML/buildMathML function expects a - // list of elements, we normalize for different kinds of arguments - var ordargument = function ordargument(arg) { - if (arg.type === "ordgroup") { - return arg.value; - } else { - return [arg]; - } - }; - - // A normal square root - defineFunction("\\sqrt", { - numArgs: 1, - numOptionalArgs: 1 - }, function (context, args) { - var index = args[0]; - var body = args[1]; - return { - type: "sqrt", - body: body, - index: index - }; - }); - - // Non-mathy text, possibly in a font - var textFunctionStyles = { - "\\text": undefined, "\\textrm": "mathrm", "\\textsf": "mathsf", - "\\texttt": "mathtt", "\\textnormal": "mathrm", "\\textbf": "mathbf", - "\\textit": "textit" - }; - - defineFunction(["\\text", "\\textrm", "\\textsf", "\\texttt", "\\textnormal", "\\textbf", "\\textit"], { - numArgs: 1, - argTypes: ["text"], - greediness: 2, - allowedInText: true - }, function (context, args) { - var body = args[0]; - return { - type: "text", - body: ordargument(body), - style: textFunctionStyles[context.funcName] - }; - }); - - // A two-argument custom color - defineFunction("\\textcolor", { - numArgs: 2, - allowedInText: true, - greediness: 3, - argTypes: ["color", "original"] - }, function (context, args) { - var color = args[0]; - var body = args[1]; - return { - type: "color", - color: color.value, - value: ordargument(body) - }; - }); - - // \color is handled in Parser.js's parseImplicitGroup - defineFunction("\\color", { - numArgs: 1, - allowedInText: true, - greediness: 3, - argTypes: ["color"] - }, null); - - // An overline - defineFunction("\\overline", { - numArgs: 1 - }, function (context, args) { - var body = args[0]; - return { - type: "overline", - body: body - }; - }); - - // An underline - defineFunction("\\underline", { - numArgs: 1 - }, function (context, args) { - var body = args[0]; - return { - type: "underline", - body: body - }; - }); - - // A box of the width and height - defineFunction("\\rule", { - numArgs: 2, - numOptionalArgs: 1, - argTypes: ["size", "size", "size"] - }, function (context, args) { - var shift = args[0]; - var width = args[1]; - var height = args[2]; - return { - type: "rule", - shift: shift && shift.value, - width: width.value, - height: height.value - }; - }); - - // TODO: In TeX, \mkern only accepts mu-units, and \kern does not accept - // mu-units. In current KaTeX we relax this; both commands accept any unit. - defineFunction(["\\kern", "\\mkern"], { - numArgs: 1, - argTypes: ["size"] - }, function (context, args) { - return { - type: "kern", - dimension: args[0].value - }; - }); - - // A KaTeX logo - defineFunction("\\KaTeX", { - numArgs: 0 - }, function (context) { - return { - type: "katex" - }; - }); - - defineFunction("\\phantom", { - numArgs: 1 - }, function (context, args) { - var body = args[0]; - return { - type: "phantom", - value: ordargument(body) - }; - }); - - // Math class commands except \mathop - defineFunction(["\\mathord", "\\mathbin", "\\mathrel", "\\mathopen", "\\mathclose", "\\mathpunct", "\\mathinner"], { - numArgs: 1 - }, function (context, args) { - var body = args[0]; - return { - type: "mclass", - mclass: "m" + context.funcName.substr(5), - value: ordargument(body) - }; - }); - - // Build a relation by placing one symbol on top of another - defineFunction("\\stackrel", { - numArgs: 2 - }, function (context, args) { - var top = args[0]; - var bottom = args[1]; - - var bottomop = new _ParseNode2.default("op", { - type: "op", - limits: true, - alwaysHandleSupSub: true, - symbol: false, - value: ordargument(bottom) - }, bottom.mode); - - var supsub = new _ParseNode2.default("supsub", { - base: bottomop, - sup: top, - sub: null - }, top.mode); - - return { - type: "mclass", - mclass: "mrel", - value: [supsub] - }; - }); - - // \mod-type functions - defineFunction("\\bmod", { - numArgs: 0 - }, function (context, args) { - return { - type: "mod", - modType: "bmod", - value: null - }; - }); - - defineFunction(["\\pod", "\\pmod", "\\mod"], { - numArgs: 1 - }, function (context, args) { - var body = args[0]; - return { - type: "mod", - modType: context.funcName.substr(1), - value: ordargument(body) - }; - }); - - // Extra data needed for the delimiter handler down below - var delimiterSizes = { - "\\bigl": { mclass: "mopen", size: 1 }, - "\\Bigl": { mclass: "mopen", size: 2 }, - "\\biggl": { mclass: "mopen", size: 3 }, - "\\Biggl": { mclass: "mopen", size: 4 }, - "\\bigr": { mclass: "mclose", size: 1 }, - "\\Bigr": { mclass: "mclose", size: 2 }, - "\\biggr": { mclass: "mclose", size: 3 }, - "\\Biggr": { mclass: "mclose", size: 4 }, - "\\bigm": { mclass: "mrel", size: 1 }, - "\\Bigm": { mclass: "mrel", size: 2 }, - "\\biggm": { mclass: "mrel", size: 3 }, - "\\Biggm": { mclass: "mrel", size: 4 }, - "\\big": { mclass: "mord", size: 1 }, - "\\Big": { mclass: "mord", size: 2 }, - "\\bigg": { mclass: "mord", size: 3 }, - "\\Bigg": { mclass: "mord", size: 4 } - }; - - var delimiters = ["(", ")", "[", "\\lbrack", "]", "\\rbrack", "\\{", "\\lbrace", "\\}", "\\rbrace", "\\lfloor", "\\rfloor", "\\lceil", "\\rceil", "<", ">", "\\langle", "\\rangle", "\\lt", "\\gt", "\\lvert", "\\rvert", "\\lVert", "\\rVert", "\\lgroup", "\\rgroup", "\\lmoustache", "\\rmoustache", "/", "\\backslash", "|", "\\vert", "\\|", "\\Vert", "\\uparrow", "\\Uparrow", "\\downarrow", "\\Downarrow", "\\updownarrow", "\\Updownarrow", "."]; - - var fontAliases = { - "\\Bbb": "\\mathbb", - "\\bold": "\\mathbf", - "\\frak": "\\mathfrak" - }; - - // Single-argument color functions - defineFunction(["\\blue", "\\orange", "\\pink", "\\red", "\\green", "\\gray", "\\purple", "\\blueA", "\\blueB", "\\blueC", "\\blueD", "\\blueE", "\\tealA", "\\tealB", "\\tealC", "\\tealD", "\\tealE", "\\greenA", "\\greenB", "\\greenC", "\\greenD", "\\greenE", "\\goldA", "\\goldB", "\\goldC", "\\goldD", "\\goldE", "\\redA", "\\redB", "\\redC", "\\redD", "\\redE", "\\maroonA", "\\maroonB", "\\maroonC", "\\maroonD", "\\maroonE", "\\purpleA", "\\purpleB", "\\purpleC", "\\purpleD", "\\purpleE", "\\mintA", "\\mintB", "\\mintC", "\\grayA", "\\grayB", "\\grayC", "\\grayD", "\\grayE", "\\grayF", "\\grayG", "\\grayH", "\\grayI", "\\kaBlue", "\\kaGreen"], { - numArgs: 1, - allowedInText: true, - greediness: 3 - }, function (context, args) { - var body = args[0]; - return { - type: "color", - color: "katex-" + context.funcName.slice(1), - value: ordargument(body) - }; - }); - - // There are 2 flags for operators; whether they produce limits in - // displaystyle, and whether they are symbols and should grow in - // displaystyle. These four groups cover the four possible choices. - - // No limits, not symbols - defineFunction(["\\arcsin", "\\arccos", "\\arctan", "\\arctg", "\\arcctg", "\\arg", "\\ch", "\\cos", "\\cosec", "\\cosh", "\\cot", "\\cotg", "\\coth", "\\csc", "\\ctg", "\\cth", "\\deg", "\\dim", "\\exp", "\\hom", "\\ker", "\\lg", "\\ln", "\\log", "\\sec", "\\sin", "\\sinh", "\\sh", "\\tan", "\\tanh", "\\tg", "\\th"], { - numArgs: 0 - }, function (context) { - return { - type: "op", - limits: false, - symbol: false, - body: context.funcName - }; - }); - - // Limits, not symbols - defineFunction(["\\det", "\\gcd", "\\inf", "\\lim", "\\liminf", "\\limsup", "\\max", "\\min", "\\Pr", "\\sup"], { - numArgs: 0 - }, function (context) { - return { - type: "op", - limits: true, - symbol: false, - body: context.funcName - }; - }); - - // No limits, symbols - defineFunction(["\\int", "\\iint", "\\iiint", "\\oint"], { - numArgs: 0 - }, function (context) { - return { - type: "op", - limits: false, - symbol: true, - body: context.funcName - }; - }); - - // Limits, symbols - defineFunction(["\\coprod", "\\bigvee", "\\bigwedge", "\\biguplus", "\\bigcap", "\\bigcup", "\\intop", "\\prod", "\\sum", "\\bigotimes", "\\bigoplus", "\\bigodot", "\\bigsqcup", "\\smallint"], { - numArgs: 0 - }, function (context) { - return { - type: "op", - limits: true, - symbol: true, - body: context.funcName - }; - }); - - // \mathop class command - defineFunction("\\mathop", { - numArgs: 1 - }, function (context, args) { - var body = args[0]; - return { - type: "op", - limits: false, - symbol: false, - value: ordargument(body) - }; - }); - - // Fractions - defineFunction(["\\dfrac", "\\frac", "\\tfrac", "\\dbinom", "\\binom", "\\tbinom", "\\\\atopfrac"], { - numArgs: 2, - greediness: 2 - }, function (context, args) { - var numer = args[0]; - var denom = args[1]; - var hasBarLine = void 0; - var leftDelim = null; - var rightDelim = null; - var size = "auto"; - - switch (context.funcName) { - case "\\dfrac": - case "\\frac": - case "\\tfrac": - hasBarLine = true; - break; - case "\\\\atopfrac": - hasBarLine = false; - break; - case "\\dbinom": - case "\\binom": - case "\\tbinom": - hasBarLine = false; - leftDelim = "("; - rightDelim = ")"; - break; - default: - throw new Error("Unrecognized genfrac command"); - } - - switch (context.funcName) { - case "\\dfrac": - case "\\dbinom": - size = "display"; - break; - case "\\tfrac": - case "\\tbinom": - size = "text"; - break; - } - - return { - type: "genfrac", - numer: numer, - denom: denom, - hasBarLine: hasBarLine, - leftDelim: leftDelim, - rightDelim: rightDelim, - size: size - }; - }); - - // Left and right overlap functions - defineFunction(["\\llap", "\\rlap"], { - numArgs: 1, - allowedInText: true - }, function (context, args) { - var body = args[0]; - return { - type: context.funcName.slice(1), - body: body - }; - }); - - // Delimiter functions - var checkDelimiter = function checkDelimiter(delim, context) { - if (_utils2.default.contains(delimiters, delim.value)) { - return delim; - } else { - throw new _ParseError2.default("Invalid delimiter: '" + delim.value + "' after '" + context.funcName + "'", delim); - } - }; - - defineFunction(["\\bigl", "\\Bigl", "\\biggl", "\\Biggl", "\\bigr", "\\Bigr", "\\biggr", "\\Biggr", "\\bigm", "\\Bigm", "\\biggm", "\\Biggm", "\\big", "\\Big", "\\bigg", "\\Bigg"], { - numArgs: 1 - }, function (context, args) { - var delim = checkDelimiter(args[0], context); - - return { - type: "delimsizing", - size: delimiterSizes[context.funcName].size, - mclass: delimiterSizes[context.funcName].mclass, - value: delim.value - }; - }); - - defineFunction(["\\left", "\\right"], { - numArgs: 1 - }, function (context, args) { - var delim = checkDelimiter(args[0], context); - - // \left and \right are caught somewhere in Parser.js, which is - // why this data doesn't match what is in buildHTML. - return { - type: "leftright", - value: delim.value - }; - }); - - defineFunction("\\middle", { - numArgs: 1 - }, function (context, args) { - var delim = checkDelimiter(args[0], context); - if (!context.parser.leftrightDepth) { - throw new _ParseError2.default("\\middle without preceding \\left", delim); - } - - return { - type: "middle", - value: delim.value - }; - }); - - // Sizing functions (handled in Parser.js explicitly, hence no handler) - defineFunction(["\\tiny", "\\scriptsize", "\\footnotesize", "\\small", "\\normalsize", "\\large", "\\Large", "\\LARGE", "\\huge", "\\Huge"], 0, null); - - // Style changing functions (handled in Parser.js explicitly, hence no - // handler) - defineFunction(["\\displaystyle", "\\textstyle", "\\scriptstyle", "\\scriptscriptstyle"], 0, null); - - // Old font changing functions - defineFunction(["\\rm", "\\sf", "\\tt", "\\bf", "\\it"], 0, null); - - defineFunction([ - // styles - "\\mathrm", "\\mathit", "\\mathbf", - - // families - "\\mathbb", "\\mathcal", "\\mathfrak", "\\mathscr", "\\mathsf", "\\mathtt", - - // aliases - "\\Bbb", "\\bold", "\\frak"], { - numArgs: 1, - greediness: 2 - }, function (context, args) { - var body = args[0]; - var func = context.funcName; - if (func in fontAliases) { - func = fontAliases[func]; - } - return { - type: "font", - font: func.slice(1), - body: body - }; - }); - - // Accents - defineFunction(["\\acute", "\\grave", "\\ddot", "\\tilde", "\\bar", "\\breve", "\\check", "\\hat", "\\vec", "\\dot", "\\widehat", "\\widetilde", "\\overrightarrow", "\\overleftarrow", "\\Overrightarrow", "\\overleftrightarrow", "\\overgroup", "\\overlinesegment", "\\overleftharpoon", "\\overrightharpoon"], { - numArgs: 1 - }, function (context, args) { - var base = args[0]; - - var isStretchy = !_utils2.default.contains(["\\acute", "\\grave", "\\ddot", "\\tilde", "\\bar", "\\breve", "\\check", "\\hat", "\\vec", "\\dot"], context.funcName); - - var isShifty = !isStretchy || _utils2.default.contains(["\\widehat", "\\widetilde"], context.funcName); - - return { - type: "accent", - label: context.funcName, - isStretchy: isStretchy, - isShifty: isShifty, - value: ordargument(base), - base: base - }; - }); - - // Text-mode accents - defineFunction(["\\'", "\\`", "\\^", "\\~", "\\=", "\\u", "\\.", '\\"', "\\r", "\\H", "\\v"], { - numArgs: 1, - allowedInText: true, - allowedInMath: false - }, function (context, args) { - var base = args[0]; - - return { - type: "accent", - label: context.funcName, - isStretchy: false, - isShifty: true, - value: ordargument(base), - base: base - }; - }); - - // Horizontal stretchy braces - defineFunction(["\\overbrace", "\\underbrace"], { - numArgs: 1 - }, function (context, args) { - var base = args[0]; - return { - type: "horizBrace", - label: context.funcName, - isOver: /^\\over/.test(context.funcName), - base: base - }; - }); - - // Stretchy accents under the body - defineFunction(["\\underleftarrow", "\\underrightarrow", "\\underleftrightarrow", "\\undergroup", "\\underlinesegment", "\\undertilde"], { - numArgs: 1 - }, function (context, args) { - var body = args[0]; - return { - type: "accentUnder", - label: context.funcName, - value: ordargument(body), - body: body - }; - }); - - // Stretchy arrows with an optional argument - defineFunction(["\\xleftarrow", "\\xrightarrow", "\\xLeftarrow", "\\xRightarrow", "\\xleftrightarrow", "\\xLeftrightarrow", "\\xhookleftarrow", "\\xhookrightarrow", "\\xmapsto", "\\xrightharpoondown", "\\xrightharpoonup", "\\xleftharpoondown", "\\xleftharpoonup", "\\xrightleftharpoons", "\\xleftrightharpoons", "\\xLongequal", "\\xtwoheadrightarrow", "\\xtwoheadleftarrow", "\\xLongequal", "\\xtofrom"], { - numArgs: 1, - numOptionalArgs: 1 - }, function (context, args) { - var below = args[0]; - var body = args[1]; - return { - type: "xArrow", // x for extensible - label: context.funcName, - body: body, - below: below - }; - }); - - // enclose - defineFunction(["\\cancel", "\\bcancel", "\\xcancel", "\\sout", "\\fbox"], { - numArgs: 1 - }, function (context, args) { - var body = args[0]; - return { - type: "enclose", - label: context.funcName, - body: body - }; - }); - - // Infix generalized fractions - defineFunction(["\\over", "\\choose", "\\atop"], { - numArgs: 0, - infix: true - }, function (context) { - var replaceWith = void 0; - switch (context.funcName) { - case "\\over": - replaceWith = "\\frac"; - break; - case "\\choose": - replaceWith = "\\binom"; - break; - case "\\atop": - replaceWith = "\\\\atopfrac"; - break; - default: - throw new Error("Unrecognized infix genfrac command"); - } - return { - type: "infix", - replaceWith: replaceWith, - token: context.token - }; - }); - - // Row breaks for aligned data - defineFunction(["\\\\", "\\cr"], { - numArgs: 0, - numOptionalArgs: 1, - argTypes: ["size"] - }, function (context, args) { - var size = args[0]; - return { - type: "cr", - size: size - }; - }); - - // Environment delimiters - defineFunction(["\\begin", "\\end"], { - numArgs: 1, - argTypes: ["text"] - }, function (context, args) { - var nameGroup = args[0]; - if (nameGroup.type !== "ordgroup") { - throw new _ParseError2.default("Invalid environment name", nameGroup); - } - var name = ""; - for (var i = 0; i < nameGroup.value.length; ++i) { - name += nameGroup.value[i].value; - } - return { - type: "environment", - name: name, - nameGroup: nameGroup - }; - }); - - },{"./ParseError":29,"./ParseNode":30,"./utils":51}],44:[function(require,module,exports){ - - /** - * Predefined macros for KaTeX. - * This can be used to define some commands in terms of others. - */ - - // This function might one day accept additional argument and do more things. - function defineMacro(name, body) { - module.exports[name] = body; - } - - ////////////////////////////////////////////////////////////////////// - // basics - defineMacro("\\bgroup", "{"); - defineMacro("\\egroup", "}"); - defineMacro("\\begingroup", "{"); - defineMacro("\\endgroup", "}"); - - // We don't distinguish between math and nonmath kerns. - // (In TeX, the mu unit works only with \mkern.) - defineMacro("\\mkern", "\\kern"); - - ////////////////////////////////////////////////////////////////////// - // amsmath.sty - - // \def\overset#1#2{\binrel@{#2}\binrel@@{\mathop{\kern\z@#2}\limits^{#1}}} - defineMacro("\\overset", "\\mathop{#2}\\limits^{#1}"); - defineMacro("\\underset", "\\mathop{#2}\\limits_{#1}"); - - // \newcommand{\boxed}[1]{\fbox{\m@th$\displaystyle#1$}} - defineMacro("\\boxed", "\\fbox{\\displaystyle{#1}}"); - - //TODO: When implementing \dots, should ideally add the \DOTSB indicator - // into the macro, to indicate these are binary operators. - // \def\iff{\DOTSB\;\Longleftrightarrow\;} - // \def\implies{\DOTSB\;\Longrightarrow\;} - // \def\impliedby{\DOTSB\;\Longleftarrow\;} - defineMacro("\\iff", "\\;\\Longleftrightarrow\\;"); - defineMacro("\\implies", "\\;\\Longrightarrow\\;"); - defineMacro("\\impliedby", "\\;\\Longleftarrow\\;"); - - ////////////////////////////////////////////////////////////////////// - // mathtools.sty - - //\providecommand\ordinarycolon{:} - defineMacro("\\ordinarycolon", ":"); - //\def\vcentcolon{\mathrel{\mathop\ordinarycolon}} - //TODO(edemaine): Not yet centered. Fix via \raisebox or #726 - defineMacro("\\vcentcolon", "\\mathrel{\\mathop\\ordinarycolon}"); - // \providecommand*\dblcolon{\vcentcolon\mathrel{\mkern-.9mu}\vcentcolon} - defineMacro("\\dblcolon", "\\vcentcolon\\mathrel{\\mkern-.9mu}\\vcentcolon"); - // \providecommand*\coloneqq{\vcentcolon\mathrel{\mkern-1.2mu}=} - defineMacro("\\coloneqq", "\\vcentcolon\\mathrel{\\mkern-1.2mu}="); - // \providecommand*\Coloneqq{\dblcolon\mathrel{\mkern-1.2mu}=} - defineMacro("\\Coloneqq", "\\dblcolon\\mathrel{\\mkern-1.2mu}="); - // \providecommand*\coloneq{\vcentcolon\mathrel{\mkern-1.2mu}\mathrel{-}} - defineMacro("\\coloneq", "\\vcentcolon\\mathrel{\\mkern-1.2mu}\\mathrel{-}"); - // \providecommand*\Coloneq{\dblcolon\mathrel{\mkern-1.2mu}\mathrel{-}} - defineMacro("\\Coloneq", "\\dblcolon\\mathrel{\\mkern-1.2mu}\\mathrel{-}"); - // \providecommand*\eqqcolon{=\mathrel{\mkern-1.2mu}\vcentcolon} - defineMacro("\\eqqcolon", "=\\mathrel{\\mkern-1.2mu}\\vcentcolon"); - // \providecommand*\Eqqcolon{=\mathrel{\mkern-1.2mu}\dblcolon} - defineMacro("\\Eqqcolon", "=\\mathrel{\\mkern-1.2mu}\\dblcolon"); - // \providecommand*\eqcolon{\mathrel{-}\mathrel{\mkern-1.2mu}\vcentcolon} - defineMacro("\\eqcolon", "\\mathrel{-}\\mathrel{\\mkern-1.2mu}\\vcentcolon"); - // \providecommand*\Eqcolon{\mathrel{-}\mathrel{\mkern-1.2mu}\dblcolon} - defineMacro("\\Eqcolon", "\\mathrel{-}\\mathrel{\\mkern-1.2mu}\\dblcolon"); - // \providecommand*\colonapprox{\vcentcolon\mathrel{\mkern-1.2mu}\approx} - defineMacro("\\colonapprox", "\\vcentcolon\\mathrel{\\mkern-1.2mu}\\approx"); - // \providecommand*\Colonapprox{\dblcolon\mathrel{\mkern-1.2mu}\approx} - defineMacro("\\Colonapprox", "\\dblcolon\\mathrel{\\mkern-1.2mu}\\approx"); - // \providecommand*\colonsim{\vcentcolon\mathrel{\mkern-1.2mu}\sim} - defineMacro("\\colonsim", "\\vcentcolon\\mathrel{\\mkern-1.2mu}\\sim"); - // \providecommand*\Colonsim{\dblcolon\mathrel{\mkern-1.2mu}\sim} - defineMacro("\\Colonsim", "\\dblcolon\\mathrel{\\mkern-1.2mu}\\sim"); - - ////////////////////////////////////////////////////////////////////// - // colonequals.sty - - // Alternate names for mathtools's macros: - defineMacro("\\ratio", "\\vcentcolon"); - defineMacro("\\coloncolon", "\\dblcolon"); - defineMacro("\\colonequals", "\\coloneqq"); - defineMacro("\\coloncolonequals", "\\Coloneqq"); - defineMacro("\\equalscolon", "\\eqqcolon"); - defineMacro("\\equalscoloncolon", "\\Eqqcolon"); - defineMacro("\\colonminus", "\\coloneq"); - defineMacro("\\coloncolonminus", "\\Coloneq"); - defineMacro("\\minuscolon", "\\eqcolon"); - defineMacro("\\minuscoloncolon", "\\Eqcolon"); - // \colonapprox name is same in mathtools and colonequals. - defineMacro("\\coloncolonapprox", "\\Colonapprox"); - // \colonsim name is same in mathtools and colonequals. - defineMacro("\\coloncolonsim", "\\Colonsim"); - - // Additional macros, implemented by analogy with mathtools definitions: - defineMacro("\\simcolon", "\\sim\\mathrel{\\mkern-1.2mu}\\vcentcolon"); - defineMacro("\\simcoloncolon", "\\sim\\mathrel{\\mkern-1.2mu}\\dblcolon"); - defineMacro("\\approxcolon", "\\approx\\mathrel{\\mkern-1.2mu}\\vcentcolon"); - defineMacro("\\approxcoloncolon", "\\approx\\mathrel{\\mkern-1.2mu}\\dblcolon"); - - },{}],45:[function(require,module,exports){ - - var _classCallCheck2 = require("babel-runtime/helpers/classCallCheck"); - - var _classCallCheck3 = _interopRequireDefault(_classCallCheck2); - - var _createClass2 = require("babel-runtime/helpers/createClass"); - - var _createClass3 = _interopRequireDefault(_createClass2); - - var _utils = require("./utils"); - - var _utils2 = _interopRequireDefault(_utils); - - function _interopRequireDefault(obj) { return obj && obj.__esModule ? obj : { default: obj }; } - - /** - * This node represents a general purpose MathML node of any type. The - * constructor requires the type of node to create (for example, `"mo"` or - * `"mspace"`, corresponding to `` and `` tags). - */ - var MathNode = function () { - function MathNode(type, children) { - (0, _classCallCheck3.default)(this, MathNode); - - this.type = type; - this.attributes = {}; - this.children = children || []; - } - - /** - * Sets an attribute on a MathML node. MathML depends on attributes to convey a - * semantic content, so this is used heavily. - */ - - - (0, _createClass3.default)(MathNode, [{ - key: "setAttribute", - value: function setAttribute(name, value) { - this.attributes[name] = value; - } - - /** - * Converts the math node into a MathML-namespaced DOM element. - */ - - }, { - key: "toNode", - value: function toNode() { - var node = document.createElementNS("http://www.w3.org/1998/Math/MathML", this.type); - - for (var attr in this.attributes) { - if (Object.prototype.hasOwnProperty.call(this.attributes, attr)) { - node.setAttribute(attr, this.attributes[attr]); - } - } - - for (var i = 0; i < this.children.length; i++) { - node.appendChild(this.children[i].toNode()); - } - - return node; - } - - /** - * Converts the math node into an HTML markup string. - */ - - }, { - key: "toMarkup", - value: function toMarkup() { - var markup = "<" + this.type; - - // Add the attributes - for (var attr in this.attributes) { - if (Object.prototype.hasOwnProperty.call(this.attributes, attr)) { - markup += " " + attr + "=\""; - markup += _utils2.default.escape(this.attributes[attr]); - markup += "\""; - } - } - - markup += ">"; - - for (var i = 0; i < this.children.length; i++) { - markup += this.children[i].toMarkup(); - } - - markup += ""; - - return markup; - } - }]); - return MathNode; - }(); - - /** - * This node represents a piece of text. - */ - /** - * These objects store data about MathML nodes. This is the MathML equivalent - * of the types in domTree.js. Since MathML handles its own rendering, and - * since we're mainly using MathML to improve accessibility, we don't manage - * any of the styling state that the plain DOM nodes do. - * - * The `toNode` and `toMarkup` functions work simlarly to how they do in - * domTree.js, creating namespaced DOM nodes and HTML text markup respectively. - */ - - var TextNode = function () { - function TextNode(text) { - (0, _classCallCheck3.default)(this, TextNode); - - this.text = text; - } - - /** - * Converts the text node into a DOM text node. - */ - - - (0, _createClass3.default)(TextNode, [{ - key: "toNode", - value: function toNode() { - return document.createTextNode(this.text); - } - - /** - * Converts the text node into HTML markup (which is just the text itself). - */ - - }, { - key: "toMarkup", - value: function toMarkup() { - return _utils2.default.escape(this.text); - } - }]); - return TextNode; - }(); - - module.exports = { - MathNode: MathNode, - TextNode: TextNode - }; - - },{"./utils":51,"babel-runtime/helpers/classCallCheck":4,"babel-runtime/helpers/createClass":5}],46:[function(require,module,exports){ - - var _Parser = require('./Parser'); - - var _Parser2 = _interopRequireDefault(_Parser); - - function _interopRequireDefault(obj) { return obj && obj.__esModule ? obj : { default: obj }; } - - /** - * Parses an expression using a Parser, then returns the parsed result. - */ - var parseTree = function parseTree(toParse, settings) { - if (!(typeof toParse === 'string' || toParse instanceof String)) { - throw new TypeError('KaTeX can only parse string typed expression'); - } - var parser = new _Parser2.default(toParse, settings); - - return parser.parse(); - }; /** - * Provides a single function for parsing an expression using a Parser - * TODO(emily): Remove this - */ - - module.exports = parseTree; - - },{"./Parser":31}],47:[function(require,module,exports){ - - /** - * This file provides support to buildMathML.js and buildHTML.js - * for stretchy wide elements rendered from SVG files - * and other CSS trickery. - */ - - var buildCommon = require("./buildCommon"); - var mathMLTree = require("./mathMLTree"); - var utils = require("./utils"); - - var stretchyCodePoint = { - widehat: "^", - widetilde: "~", - undertilde: "~", - overleftarrow: "\u2190", - underleftarrow: "\u2190", - xleftarrow: "\u2190", - overrightarrow: "\u2192", - underrightarrow: "\u2192", - xrightarrow: "\u2192", - underbrace: "\u23B5", - overbrace: "\u23DE", - overleftrightarrow: "\u2194", - underleftrightarrow: "\u2194", - xleftrightarrow: "\u2194", - Overrightarrow: "\u21D2", - xRightarrow: "\u21D2", - overleftharpoon: "\u21BC", - xleftharpoonup: "\u21BC", - overrightharpoon: "\u21C0", - xrightharpoonup: "\u21C0", - xLeftarrow: "\u21D0", - xLeftrightarrow: "\u21D4", - xhookleftarrow: "\u21A9", - xhookrightarrow: "\u21AA", - xmapsto: "\u21A6", - xrightharpoondown: "\u21C1", - xleftharpoondown: "\u21BD", - xrightleftharpoons: "\u21CC", - xleftrightharpoons: "\u21CB", - xtwoheadleftarrow: "\u219E", - xtwoheadrightarrow: "\u21A0", - xLongequal: "=", - xtofrom: "\u21C4" - }; - - var mathMLnode = function mathMLnode(label) { - var node = new mathMLTree.MathNode("mo", [new mathMLTree.TextNode(stretchyCodePoint[label.substr(1)])]); - node.setAttribute("stretchy", "true"); - return node; - }; - - // In the katexImagesData object just below, the dimensions all - // correspond to path geometry inside the relevant SVG. - // For example, \rightarrow uses the same arrowhead as glyph U+2192 - // from the KaTeX Main font. The scaling factor is 1000. - // That is, inside the font, that arrowhead is 522 units tall, which - // corresponds to 0.522 em inside the document. - // And for extensible arrows, we split that distance around the math axis. - - var katexImagesData = { - // height, depth, imageName, minWidth - overleftarrow: [0.522, 0, "leftarrow", 0.5], - underleftarrow: [0.522, 0, "leftarrow", 0.5], - xleftarrow: [0.261, 0.261, "leftarrow", 0.783], - overrightarrow: [0.522, 0, "rightarrow", 0.5], - underrightarrow: [0.522, 0, "rightarrow", 0.5], - xrightarrow: [0.261, 0.261, "rightarrow", 0.783], - overbrace: [0.548, 0, "overbrace", 1.6], - underbrace: [0.548, 0, "underbrace", 1.6], - overleftrightarrow: [0.522, 0, "leftrightarrow", 0.5], - underleftrightarrow: [0.522, 0, "leftrightarrow", 0.5], - xleftrightarrow: [0.261, 0.261, "leftrightarrow", 0.783], - Overrightarrow: [0.56, 0, "doublerightarrow", 0.5], - xLeftarrow: [0.28, 0.28, "doubleleftarrow", 0.783], - xRightarrow: [0.28, 0.28, "doublerightarrow", 0.783], - xLeftrightarrow: [0.28, 0.28, "doubleleftrightarrow", 0.955], - overleftharpoon: [0.522, 0, "leftharpoon", 0.5], - overrightharpoon: [0.522, 0, "rightharpoon", 0.5], - xleftharpoonup: [0.261, 0.261, "leftharpoon", 0.783], - xrightharpoonup: [0.261, 0.261, "rightharpoon", 0.783], - xhookleftarrow: [0.261, 0.261, "hookleftarrow", 0.87], - xhookrightarrow: [0.261, 0.261, "hookrightarrow", 0.87], - overlinesegment: [0.414, 0, "linesegment", 0.5], - underlinesegment: [0.414, 0, "linesegment", 0.5], - xmapsto: [0.261, 0.261, "mapsto", 0.783], - xrightharpoondown: [0.261, 0.261, "rightharpoondown", 0.783], - xleftharpoondown: [0.261, 0.261, "leftharpoondown", 0.783], - xrightleftharpoons: [0.358, 0.358, "rightleftharpoons", 0.716], - xleftrightharpoons: [0.358, 0.358, "leftrightharpoons", 0.716], - overgroup: [0.342, 0, "overgroup", 0.87], - undergroup: [0.342, 0, "undergroup", 0.87], - xtwoheadleftarrow: [0.167, 0.167, "twoheadleftarrow", 0.86], - xtwoheadrightarrow: [0.167, 0.167, "twoheadrightarrow", 0.86], - xLongequal: [0.167, 0.167, "longequal", 0.5], - xtofrom: [0.264, 0.264, "tofrom", 0.86] - }; - - // Many of the KaTeX SVG images have been adapted from glyphs in KaTeX fonts. - // Copyright (c) 2009-2010, Design Science, Inc. () - // Copyright (c) 2014-2017 Khan Academy () - // Licensed under the SIL Open Font License, Version 1.1. - // See \nhttp://scripts.sil.org/OFL - - // Nested SVGs - // Many of the KaTeX SVG images contain a nested SVG. This is done to - // achieve a stretchy image while avoiding distortion of arrowheads or - // brace corners. - - // The inner SVG typically contains a very long (400 em) arrow. - - // The outer SVG acts like a window that exposes only part of the inner SVG. - // The outer SVG will grow or shrink to match the dimensions set by CSS. - - // The inner SVG always has a longer, thinner aspect ratio than the outer - // SVG. After the inner SVG fills 100% of the height of the outer SVG, - // there is a long arrow shaft left over. That left-over shaft is not shown. - // Instead, it is sliced off because the inner SVG is set to - // "preserveAspectRatio='... slice'". - - // Thus, the reader sees an arrow that matches the subject matter width - // without distortion. - - // Some functions, such as \cancel, need to vary their aspect ratio. These - // functions do not get the nested SVG treatment. - - // Second Brush Stroke - // Low resolution monitors struggle to display images in fine detail. - // So browsers apply anti-aliasing. A long straight arrow shaft therefore - // will sometimes appear as if it has a blurred edge. - - // To mitigate this, these SVG files contain a second "brush-stroke" on the - // arrow shafts. That is, a second long thin rectangular SVG path has been - // written directly on top of each arrow shaft. This reinforcement causes - // some of the screen pixels to display as black instead of the anti-aliased - // gray pixel that a single path would generate. So we get arrow shafts - // whose edges appear to be sharper. - - var svgPath = { - doubleleftarrow: "", - - doublerightarrow: "", - - leftarrow: "", - - rightarrow: "" - }; - - var innerSVG = { - // Since bcancel's SVG is inline and it omits the viewBox attribute, - // it's stroke-width will not vary with span area. - bcancel: "", - - cancel: "", - - // The doubleleftarrow geometry is from glyph U+21D0 in the font KaTeX Main - doubleleftarrow: ">" + svgPath["doubleleftarrow"] + "", - - // doubleleftrightarrow is from glyph U+21D4 in font KaTeX Main - doubleleftrightarrow: ">" + svgPath["doubleleftarrow"] + "\n" + svgPath["doublerightarrow"] + "", - - // doublerightarrow is from glyph U+21D2 in font KaTeX Main - doublerightarrow: ">" + svgPath["doublerightarrow"] + "", - - // hookleftarrow is from glyph U+21A9 in font KaTeX Main - hookleftarrow: ">" + svgPath["leftarrow"] + "\n", - - // hookrightarrow is from glyph U+21AA in font KaTeX Main - hookrightarrow: ">" + svgPath["rightarrow"] + "", - - // leftarrow is from glyph U+2190 in font KaTeX Main - leftarrow: ">" + svgPath["leftarrow"] + "", - - // leftharpoon is from glyph U+21BD in font KaTeX Main - leftharpoon: ">", - - // leftharpoondown is from glyph U+21BD in font KaTeX Main - leftharpoondown: ">", - - // leftrightarrow is from glyph U+2194 in font KaTeX Main - leftrightarrow: ">" + svgPath["leftarrow"] + "\n" + svgPath["rightarrow"] + "", - - // leftrightharpoons is from glyphs U+21BC/21B1 in font KaTeX Main - leftrightharpoons: ">\n", - - linesegment: ">\n", - - longequal: " viewBox='0 0 100 334' preserveAspectRatio='none'>\n", - - // mapsto is from glyph U+21A6 in font KaTeX Main - mapsto: ">" + svgPath["rightarrow"] + "", - - // overbrace is from glyphs U+23A9/23A8/23A7 in font KaTeX_Size4-Regular - overbrace: ">\n", - - // overgroup is from the MnSymbol package (public domain) - overgroup: ">", - - // rightarrow is from glyph U+2192 in font KaTeX Main - rightarrow: ">" + svgPath["rightarrow"] + "", - - // rightharpoon is from glyph U+21C0 in font KaTeX Main - rightharpoon: ">", - - // rightharpoondown is from glyph U+21C1 in font KaTeX Main - rightharpoondown: ">", - - // rightleftharpoons is from glyph U+21CC in font KaTeX Main - rightleftharpoons: ">", - - // tilde1 is a modified version of a glyph from the MnSymbol package - tilde1: " viewBox='0 0 600 260' preserveAspectRatio='none'>\n", - - // Ditto tilde2, tilde3, and tilde 4 - tilde2: " viewBox='0 0 1033 286' preserveAspectRatio='none'>\n", - - tilde3: " viewBox='0 0 2339 306' preserveAspectRatio='none'>\n", - - tilde4: " viewBox='0 0 2340 312' preserveAspectRatio='none'>\n", - - // tofrom is from glyph U+21C4 in font KaTeX AMS Regular - tofrom: ">", - - // twoheadleftarrow is from glyph U+219E in font KaTeX AMS Regular - twoheadleftarrow: ">\n", - - // twoheadrightarrow is from glyph U+21A0 in font KaTeX AMS Regular - twoheadrightarrow: ">\n", - - // underbrace is from glyphs U+23A9/23A8/23A7 in font KaTeX_Size4-Regular - underbrace: ">\n", - - // undergroup is from the MnSymbol package (public domain) - undergroup: ">", - - // widehat1 is a modified version of a glyph from the MnSymbol package - widehat1: " viewBox='0 0 1062 239' preserveAspectRatio='none'>\n", - - // Ditto widehat2, widehat3, and widehat4 - widehat2: " viewBox='0 0 2364 300' preserveAspectRatio='none'>\n", - - widehat3: " viewBox='0 0 2364 360' preserveAspectRatio='none'>\n", - - widehat4: " viewBox='0 0 2364 420' preserveAspectRatio='none'>\n", - - xcancel: "\n" - }; - - var svgSpan = function svgSpan(group, options) { - // Create a span with inline SVG for the element. - var label = group.value.label.substr(1); - var height = 0; - var depth = 0; - var imageName = ""; - var minWidth = 0; - - if (utils.contains(["widehat", "widetilde", "undertilde"], label)) { - // There are four SVG images available for each function. - // Choose a taller image when there are more characters. - var numChars = group.value.value.length; - if (numChars > 5) { - height = 0.312; - imageName = (label === "widehat" ? "widehat" : "tilde") + "4"; - } else { - var imgIndex = [1, 1, 2, 2, 3, 3][numChars]; - if (label === "widehat") { - height = [0, 0.24, 0.30, 0.30, 0.36, 0.36][numChars]; - imageName = "widehat" + imgIndex; - } else { - height = [0, 0.26, 0.30, 0.30, 0.34, 0.34][numChars]; - imageName = "tilde" + imgIndex; - } - } - } else { - var imgData = katexImagesData[label]; - height = imgData[0]; - depth = imgData[1]; - imageName = imgData[2]; - minWidth = imgData[3]; - } - - var span = buildCommon.makeSpan([], [], options); - span.height = height; - span.depth = depth; - var totalHeight = height + depth; - span.style.height = totalHeight + "em"; - if (minWidth > 0) { - span.style.minWidth = minWidth + "em"; - } - - span.innerHTML = ""; - - return span; - }; - - var encloseSpan = function encloseSpan(inner, label, pad, options) { - // Return an image span for \cancel, \bcancel, \xcancel, or \fbox - var img = void 0; - var totalHeight = inner.height + inner.depth + 2 * pad; - - if (label === "fbox") { - img = buildCommon.makeSpan(["stretchy", label], [], options); - if (options.color) { - img.style.borderColor = options.getColor(); - } - } else { - img = buildCommon.makeSpan([], [], options); - img.innerHTML = "" + innerSVG[label] + ""; - } - - img.height = totalHeight; - img.style.height = totalHeight + "em"; - - return img; - }; - - module.exports = { - encloseSpan: encloseSpan, - mathMLnode: mathMLnode, - svgSpan: svgSpan - }; - - },{"./buildCommon":34,"./mathMLTree":45,"./utils":51}],48:[function(require,module,exports){ - - /** - * This file holds a list of all no-argument functions and single-character - * symbols (like 'a' or ';'). - * - * For each of the symbols, there are three properties they can have: - * - font (required): the font to be used for this symbol. Either "main" (the - normal font), or "ams" (the ams fonts). - * - group (required): the ParseNode group type the symbol should have (i.e. - "textord", "mathord", etc). - See https://github.com/Khan/KaTeX/wiki/Examining-TeX#group-types - * - replace: the character that this symbol or function should be - * replaced with (i.e. "\phi" has a replace value of "\u03d5", the phi - * character in the main font). - * - * The outermost map in the table indicates what mode the symbols should be - * accepted in (e.g. "math" or "text"). - */ - - module.exports = { - math: {}, - text: {} - }; - - function defineSymbol(mode, font, group, replace, name, acceptUnicodeChar) { - module.exports[mode][name] = { - font: font, - group: group, - replace: replace - }; - - if (acceptUnicodeChar) { - module.exports[mode][replace] = module.exports[mode][name]; - } - } - - // Some abbreviations for commonly used strings. - // This helps minify the code, and also spotting typos using jshint. - - // modes: - var math = "math"; - var text = "text"; - - // fonts: - var main = "main"; - var ams = "ams"; - - // groups: - var accent = "accent"; - var bin = "bin"; - var close = "close"; - var inner = "inner"; - var mathord = "mathord"; - var op = "op"; - var open = "open"; - var punct = "punct"; - var rel = "rel"; - var spacing = "spacing"; - var textord = "textord"; - - // Now comes the symbol table - - // Relation Symbols - defineSymbol(math, main, rel, "\u2261", "\\equiv"); - defineSymbol(math, main, rel, "\u227A", "\\prec"); - defineSymbol(math, main, rel, "\u227B", "\\succ"); - defineSymbol(math, main, rel, "\u223C", "\\sim"); - defineSymbol(math, main, rel, "\u22A5", "\\perp"); - defineSymbol(math, main, rel, "\u2AAF", "\\preceq"); - defineSymbol(math, main, rel, "\u2AB0", "\\succeq"); - defineSymbol(math, main, rel, "\u2243", "\\simeq"); - defineSymbol(math, main, rel, "\u2223", "\\mid"); - defineSymbol(math, main, rel, "\u226A", "\\ll"); - defineSymbol(math, main, rel, "\u226B", "\\gg"); - defineSymbol(math, main, rel, "\u224D", "\\asymp"); - defineSymbol(math, main, rel, "\u2225", "\\parallel"); - defineSymbol(math, main, rel, "\u22C8", "\\bowtie"); - defineSymbol(math, main, rel, "\u2323", "\\smile"); - defineSymbol(math, main, rel, "\u2291", "\\sqsubseteq"); - defineSymbol(math, main, rel, "\u2292", "\\sqsupseteq"); - defineSymbol(math, main, rel, "\u2250", "\\doteq"); - defineSymbol(math, main, rel, "\u2322", "\\frown"); - defineSymbol(math, main, rel, "\u220B", "\\ni"); - defineSymbol(math, main, rel, "\u221D", "\\propto"); - defineSymbol(math, main, rel, "\u22A2", "\\vdash"); - defineSymbol(math, main, rel, "\u22A3", "\\dashv"); - defineSymbol(math, main, rel, "\u220B", "\\owns"); - - // Punctuation - defineSymbol(math, main, punct, ".", "\\ldotp"); - defineSymbol(math, main, punct, "\u22C5", "\\cdotp"); - - // Misc Symbols - defineSymbol(math, main, textord, "#", "\\#"); - defineSymbol(text, main, textord, "#", "\\#"); - defineSymbol(math, main, textord, "&", "\\&"); - defineSymbol(text, main, textord, "&", "\\&"); - defineSymbol(math, main, textord, "\u2135", "\\aleph"); - defineSymbol(math, main, textord, "\u2200", "\\forall"); - defineSymbol(math, main, textord, "\u210F", "\\hbar"); - defineSymbol(math, main, textord, "\u2203", "\\exists"); - defineSymbol(math, main, textord, "\u2207", "\\nabla"); - defineSymbol(math, main, textord, "\u266D", "\\flat"); - defineSymbol(math, main, textord, "\u2113", "\\ell"); - defineSymbol(math, main, textord, "\u266E", "\\natural"); - defineSymbol(math, main, textord, "\u2663", "\\clubsuit"); - defineSymbol(math, main, textord, "\u2118", "\\wp"); - defineSymbol(math, main, textord, "\u266F", "\\sharp"); - defineSymbol(math, main, textord, "\u2662", "\\diamondsuit"); - defineSymbol(math, main, textord, "\u211C", "\\Re"); - defineSymbol(math, main, textord, "\u2661", "\\heartsuit"); - defineSymbol(math, main, textord, "\u2111", "\\Im"); - defineSymbol(math, main, textord, "\u2660", "\\spadesuit"); - - // Math and Text - defineSymbol(math, main, textord, "\u2020", "\\dag"); - defineSymbol(text, main, textord, "\u2020", "\\dag"); - defineSymbol(text, main, textord, "\u2020", "\\textdagger"); - defineSymbol(math, main, textord, "\u2021", "\\ddag"); - defineSymbol(text, main, textord, "\u2021", "\\ddag"); - defineSymbol(text, main, textord, "\u2020", "\\textdaggerdbl"); - - // Large Delimiters - defineSymbol(math, main, close, "\u23B1", "\\rmoustache"); - defineSymbol(math, main, open, "\u23B0", "\\lmoustache"); - defineSymbol(math, main, close, "\u27EF", "\\rgroup"); - defineSymbol(math, main, open, "\u27EE", "\\lgroup"); - - // Binary Operators - defineSymbol(math, main, bin, "\u2213", "\\mp"); - defineSymbol(math, main, bin, "\u2296", "\\ominus"); - defineSymbol(math, main, bin, "\u228E", "\\uplus"); - defineSymbol(math, main, bin, "\u2293", "\\sqcap"); - defineSymbol(math, main, bin, "\u2217", "\\ast"); - defineSymbol(math, main, bin, "\u2294", "\\sqcup"); - defineSymbol(math, main, bin, "\u25EF", "\\bigcirc"); - defineSymbol(math, main, bin, "\u2219", "\\bullet"); - defineSymbol(math, main, bin, "\u2021", "\\ddagger"); - defineSymbol(math, main, bin, "\u2240", "\\wr"); - defineSymbol(math, main, bin, "\u2A3F", "\\amalg"); - - // Arrow Symbols - defineSymbol(math, main, rel, "\u27F5", "\\longleftarrow"); - defineSymbol(math, main, rel, "\u21D0", "\\Leftarrow"); - defineSymbol(math, main, rel, "\u27F8", "\\Longleftarrow"); - defineSymbol(math, main, rel, "\u27F6", "\\longrightarrow"); - defineSymbol(math, main, rel, "\u21D2", "\\Rightarrow"); - defineSymbol(math, main, rel, "\u27F9", "\\Longrightarrow"); - defineSymbol(math, main, rel, "\u2194", "\\leftrightarrow"); - defineSymbol(math, main, rel, "\u27F7", "\\longleftrightarrow"); - defineSymbol(math, main, rel, "\u21D4", "\\Leftrightarrow"); - defineSymbol(math, main, rel, "\u27FA", "\\Longleftrightarrow"); - defineSymbol(math, main, rel, "\u21A6", "\\mapsto"); - defineSymbol(math, main, rel, "\u27FC", "\\longmapsto"); - defineSymbol(math, main, rel, "\u2197", "\\nearrow"); - defineSymbol(math, main, rel, "\u21A9", "\\hookleftarrow"); - defineSymbol(math, main, rel, "\u21AA", "\\hookrightarrow"); - defineSymbol(math, main, rel, "\u2198", "\\searrow"); - defineSymbol(math, main, rel, "\u21BC", "\\leftharpoonup"); - defineSymbol(math, main, rel, "\u21C0", "\\rightharpoonup"); - defineSymbol(math, main, rel, "\u2199", "\\swarrow"); - defineSymbol(math, main, rel, "\u21BD", "\\leftharpoondown"); - defineSymbol(math, main, rel, "\u21C1", "\\rightharpoondown"); - defineSymbol(math, main, rel, "\u2196", "\\nwarrow"); - defineSymbol(math, main, rel, "\u21CC", "\\rightleftharpoons"); - - // AMS Negated Binary Relations - defineSymbol(math, ams, rel, "\u226E", "\\nless"); - defineSymbol(math, ams, rel, "\uE010", "\\nleqslant"); - defineSymbol(math, ams, rel, "\uE011", "\\nleqq"); - defineSymbol(math, ams, rel, "\u2A87", "\\lneq"); - defineSymbol(math, ams, rel, "\u2268", "\\lneqq"); - defineSymbol(math, ams, rel, "\uE00C", "\\lvertneqq"); - defineSymbol(math, ams, rel, "\u22E6", "\\lnsim"); - defineSymbol(math, ams, rel, "\u2A89", "\\lnapprox"); - defineSymbol(math, ams, rel, "\u2280", "\\nprec"); - defineSymbol(math, ams, rel, "\u22E0", "\\npreceq"); - defineSymbol(math, ams, rel, "\u22E8", "\\precnsim"); - defineSymbol(math, ams, rel, "\u2AB9", "\\precnapprox"); - defineSymbol(math, ams, rel, "\u2241", "\\nsim"); - defineSymbol(math, ams, rel, "\uE006", "\\nshortmid"); - defineSymbol(math, ams, rel, "\u2224", "\\nmid"); - defineSymbol(math, ams, rel, "\u22AC", "\\nvdash"); - defineSymbol(math, ams, rel, "\u22AD", "\\nvDash"); - defineSymbol(math, ams, rel, "\u22EA", "\\ntriangleleft"); - defineSymbol(math, ams, rel, "\u22EC", "\\ntrianglelefteq"); - defineSymbol(math, ams, rel, "\u228A", "\\subsetneq"); - defineSymbol(math, ams, rel, "\uE01A", "\\varsubsetneq"); - defineSymbol(math, ams, rel, "\u2ACB", "\\subsetneqq"); - defineSymbol(math, ams, rel, "\uE017", "\\varsubsetneqq"); - defineSymbol(math, ams, rel, "\u226F", "\\ngtr"); - defineSymbol(math, ams, rel, "\uE00F", "\\ngeqslant"); - defineSymbol(math, ams, rel, "\uE00E", "\\ngeqq"); - defineSymbol(math, ams, rel, "\u2A88", "\\gneq"); - defineSymbol(math, ams, rel, "\u2269", "\\gneqq"); - defineSymbol(math, ams, rel, "\uE00D", "\\gvertneqq"); - defineSymbol(math, ams, rel, "\u22E7", "\\gnsim"); - defineSymbol(math, ams, rel, "\u2A8A", "\\gnapprox"); - defineSymbol(math, ams, rel, "\u2281", "\\nsucc"); - defineSymbol(math, ams, rel, "\u22E1", "\\nsucceq"); - defineSymbol(math, ams, rel, "\u22E9", "\\succnsim"); - defineSymbol(math, ams, rel, "\u2ABA", "\\succnapprox"); - defineSymbol(math, ams, rel, "\u2246", "\\ncong"); - defineSymbol(math, ams, rel, "\uE007", "\\nshortparallel"); - defineSymbol(math, ams, rel, "\u2226", "\\nparallel"); - defineSymbol(math, ams, rel, "\u22AF", "\\nVDash"); - defineSymbol(math, ams, rel, "\u22EB", "\\ntriangleright"); - defineSymbol(math, ams, rel, "\u22ED", "\\ntrianglerighteq"); - defineSymbol(math, ams, rel, "\uE018", "\\nsupseteqq"); - defineSymbol(math, ams, rel, "\u228B", "\\supsetneq"); - defineSymbol(math, ams, rel, "\uE01B", "\\varsupsetneq"); - defineSymbol(math, ams, rel, "\u2ACC", "\\supsetneqq"); - defineSymbol(math, ams, rel, "\uE019", "\\varsupsetneqq"); - defineSymbol(math, ams, rel, "\u22AE", "\\nVdash"); - defineSymbol(math, ams, rel, "\u2AB5", "\\precneqq"); - defineSymbol(math, ams, rel, "\u2AB6", "\\succneqq"); - defineSymbol(math, ams, rel, "\uE016", "\\nsubseteqq"); - defineSymbol(math, ams, bin, "\u22B4", "\\unlhd"); - defineSymbol(math, ams, bin, "\u22B5", "\\unrhd"); - - // AMS Negated Arrows - defineSymbol(math, ams, rel, "\u219A", "\\nleftarrow"); - defineSymbol(math, ams, rel, "\u219B", "\\nrightarrow"); - defineSymbol(math, ams, rel, "\u21CD", "\\nLeftarrow"); - defineSymbol(math, ams, rel, "\u21CF", "\\nRightarrow"); - defineSymbol(math, ams, rel, "\u21AE", "\\nleftrightarrow"); - defineSymbol(math, ams, rel, "\u21CE", "\\nLeftrightarrow"); - - // AMS Misc - defineSymbol(math, ams, rel, "\u25B3", "\\vartriangle"); - defineSymbol(math, ams, textord, "\u210F", "\\hslash"); - defineSymbol(math, ams, textord, "\u25BD", "\\triangledown"); - defineSymbol(math, ams, textord, "\u25CA", "\\lozenge"); - defineSymbol(math, ams, textord, "\u24C8", "\\circledS"); - defineSymbol(math, ams, textord, "\xAE", "\\circledR"); - defineSymbol(text, ams, textord, "\xAE", "\\circledR"); - defineSymbol(math, ams, textord, "\u2221", "\\measuredangle"); - defineSymbol(math, ams, textord, "\u2204", "\\nexists"); - defineSymbol(math, ams, textord, "\u2127", "\\mho"); - defineSymbol(math, ams, textord, "\u2132", "\\Finv"); - defineSymbol(math, ams, textord, "\u2141", "\\Game"); - defineSymbol(math, ams, textord, "k", "\\Bbbk"); - defineSymbol(math, ams, textord, "\u2035", "\\backprime"); - defineSymbol(math, ams, textord, "\u25B2", "\\blacktriangle"); - defineSymbol(math, ams, textord, "\u25BC", "\\blacktriangledown"); - defineSymbol(math, ams, textord, "\u25A0", "\\blacksquare"); - defineSymbol(math, ams, textord, "\u29EB", "\\blacklozenge"); - defineSymbol(math, ams, textord, "\u2605", "\\bigstar"); - defineSymbol(math, ams, textord, "\u2222", "\\sphericalangle"); - defineSymbol(math, ams, textord, "\u2201", "\\complement"); - defineSymbol(math, ams, textord, "\xF0", "\\eth"); - defineSymbol(math, ams, textord, "\u2571", "\\diagup"); - defineSymbol(math, ams, textord, "\u2572", "\\diagdown"); - defineSymbol(math, ams, textord, "\u25A1", "\\square"); - defineSymbol(math, ams, textord, "\u25A1", "\\Box"); - defineSymbol(math, ams, textord, "\u25CA", "\\Diamond"); - defineSymbol(math, ams, textord, "\xA5", "\\yen"); - defineSymbol(math, ams, textord, "\u2713", "\\checkmark"); - defineSymbol(text, ams, textord, "\u2713", "\\checkmark"); - - // AMS Hebrew - defineSymbol(math, ams, textord, "\u2136", "\\beth"); - defineSymbol(math, ams, textord, "\u2138", "\\daleth"); - defineSymbol(math, ams, textord, "\u2137", "\\gimel"); - - // AMS Greek - defineSymbol(math, ams, textord, "\u03DD", "\\digamma"); - defineSymbol(math, ams, textord, "\u03F0", "\\varkappa"); - - // AMS Delimiters - defineSymbol(math, ams, open, "\u250C", "\\ulcorner"); - defineSymbol(math, ams, close, "\u2510", "\\urcorner"); - defineSymbol(math, ams, open, "\u2514", "\\llcorner"); - defineSymbol(math, ams, close, "\u2518", "\\lrcorner"); - - // AMS Binary Relations - defineSymbol(math, ams, rel, "\u2266", "\\leqq"); - defineSymbol(math, ams, rel, "\u2A7D", "\\leqslant"); - defineSymbol(math, ams, rel, "\u2A95", "\\eqslantless"); - defineSymbol(math, ams, rel, "\u2272", "\\lesssim"); - defineSymbol(math, ams, rel, "\u2A85", "\\lessapprox"); - defineSymbol(math, ams, rel, "\u224A", "\\approxeq"); - defineSymbol(math, ams, bin, "\u22D6", "\\lessdot"); - defineSymbol(math, ams, rel, "\u22D8", "\\lll"); - defineSymbol(math, ams, rel, "\u2276", "\\lessgtr"); - defineSymbol(math, ams, rel, "\u22DA", "\\lesseqgtr"); - defineSymbol(math, ams, rel, "\u2A8B", "\\lesseqqgtr"); - defineSymbol(math, ams, rel, "\u2251", "\\doteqdot"); - defineSymbol(math, ams, rel, "\u2253", "\\risingdotseq"); - defineSymbol(math, ams, rel, "\u2252", "\\fallingdotseq"); - defineSymbol(math, ams, rel, "\u223D", "\\backsim"); - defineSymbol(math, ams, rel, "\u22CD", "\\backsimeq"); - defineSymbol(math, ams, rel, "\u2AC5", "\\subseteqq"); - defineSymbol(math, ams, rel, "\u22D0", "\\Subset"); - defineSymbol(math, ams, rel, "\u228F", "\\sqsubset"); - defineSymbol(math, ams, rel, "\u227C", "\\preccurlyeq"); - defineSymbol(math, ams, rel, "\u22DE", "\\curlyeqprec"); - defineSymbol(math, ams, rel, "\u227E", "\\precsim"); - defineSymbol(math, ams, rel, "\u2AB7", "\\precapprox"); - defineSymbol(math, ams, rel, "\u22B2", "\\vartriangleleft"); - defineSymbol(math, ams, rel, "\u22B4", "\\trianglelefteq"); - defineSymbol(math, ams, rel, "\u22A8", "\\vDash"); - defineSymbol(math, ams, rel, "\u22AA", "\\Vvdash"); - defineSymbol(math, ams, rel, "\u2323", "\\smallsmile"); - defineSymbol(math, ams, rel, "\u2322", "\\smallfrown"); - defineSymbol(math, ams, rel, "\u224F", "\\bumpeq"); - defineSymbol(math, ams, rel, "\u224E", "\\Bumpeq"); - defineSymbol(math, ams, rel, "\u2267", "\\geqq"); - defineSymbol(math, ams, rel, "\u2A7E", "\\geqslant"); - defineSymbol(math, ams, rel, "\u2A96", "\\eqslantgtr"); - defineSymbol(math, ams, rel, "\u2273", "\\gtrsim"); - defineSymbol(math, ams, rel, "\u2A86", "\\gtrapprox"); - defineSymbol(math, ams, bin, "\u22D7", "\\gtrdot"); - defineSymbol(math, ams, rel, "\u22D9", "\\ggg"); - defineSymbol(math, ams, rel, "\u2277", "\\gtrless"); - defineSymbol(math, ams, rel, "\u22DB", "\\gtreqless"); - defineSymbol(math, ams, rel, "\u2A8C", "\\gtreqqless"); - defineSymbol(math, ams, rel, "\u2256", "\\eqcirc"); - defineSymbol(math, ams, rel, "\u2257", "\\circeq"); - defineSymbol(math, ams, rel, "\u225C", "\\triangleq"); - defineSymbol(math, ams, rel, "\u223C", "\\thicksim"); - defineSymbol(math, ams, rel, "\u2248", "\\thickapprox"); - defineSymbol(math, ams, rel, "\u2AC6", "\\supseteqq"); - defineSymbol(math, ams, rel, "\u22D1", "\\Supset"); - defineSymbol(math, ams, rel, "\u2290", "\\sqsupset"); - defineSymbol(math, ams, rel, "\u227D", "\\succcurlyeq"); - defineSymbol(math, ams, rel, "\u22DF", "\\curlyeqsucc"); - defineSymbol(math, ams, rel, "\u227F", "\\succsim"); - defineSymbol(math, ams, rel, "\u2AB8", "\\succapprox"); - defineSymbol(math, ams, rel, "\u22B3", "\\vartriangleright"); - defineSymbol(math, ams, rel, "\u22B5", "\\trianglerighteq"); - defineSymbol(math, ams, rel, "\u22A9", "\\Vdash"); - defineSymbol(math, ams, rel, "\u2223", "\\shortmid"); - defineSymbol(math, ams, rel, "\u2225", "\\shortparallel"); - defineSymbol(math, ams, rel, "\u226C", "\\between"); - defineSymbol(math, ams, rel, "\u22D4", "\\pitchfork"); - defineSymbol(math, ams, rel, "\u221D", "\\varpropto"); - defineSymbol(math, ams, rel, "\u25C0", "\\blacktriangleleft"); - defineSymbol(math, ams, rel, "\u2234", "\\therefore"); - defineSymbol(math, ams, rel, "\u220D", "\\backepsilon"); - defineSymbol(math, ams, rel, "\u25B6", "\\blacktriangleright"); - defineSymbol(math, ams, rel, "\u2235", "\\because"); - defineSymbol(math, ams, rel, "\u22D8", "\\llless"); - defineSymbol(math, ams, rel, "\u22D9", "\\gggtr"); - defineSymbol(math, ams, bin, "\u22B2", "\\lhd"); - defineSymbol(math, ams, bin, "\u22B3", "\\rhd"); - defineSymbol(math, ams, rel, "\u2242", "\\eqsim"); - defineSymbol(math, main, rel, "\u22C8", "\\Join"); - defineSymbol(math, ams, rel, "\u2251", "\\Doteq"); - - // AMS Binary Operators - defineSymbol(math, ams, bin, "\u2214", "\\dotplus"); - defineSymbol(math, ams, bin, "\u2216", "\\smallsetminus"); - defineSymbol(math, ams, bin, "\u22D2", "\\Cap"); - defineSymbol(math, ams, bin, "\u22D3", "\\Cup"); - defineSymbol(math, ams, bin, "\u2A5E", "\\doublebarwedge"); - defineSymbol(math, ams, bin, "\u229F", "\\boxminus"); - defineSymbol(math, ams, bin, "\u229E", "\\boxplus"); - defineSymbol(math, ams, bin, "\u22C7", "\\divideontimes"); - defineSymbol(math, ams, bin, "\u22C9", "\\ltimes"); - defineSymbol(math, ams, bin, "\u22CA", "\\rtimes"); - defineSymbol(math, ams, bin, "\u22CB", "\\leftthreetimes"); - defineSymbol(math, ams, bin, "\u22CC", "\\rightthreetimes"); - defineSymbol(math, ams, bin, "\u22CF", "\\curlywedge"); - defineSymbol(math, ams, bin, "\u22CE", "\\curlyvee"); - defineSymbol(math, ams, bin, "\u229D", "\\circleddash"); - defineSymbol(math, ams, bin, "\u229B", "\\circledast"); - defineSymbol(math, ams, bin, "\u22C5", "\\centerdot"); - defineSymbol(math, ams, bin, "\u22BA", "\\intercal"); - defineSymbol(math, ams, bin, "\u22D2", "\\doublecap"); - defineSymbol(math, ams, bin, "\u22D3", "\\doublecup"); - defineSymbol(math, ams, bin, "\u22A0", "\\boxtimes"); - - // AMS Arrows - defineSymbol(math, ams, rel, "\u21E2", "\\dashrightarrow"); - defineSymbol(math, ams, rel, "\u21E0", "\\dashleftarrow"); - defineSymbol(math, ams, rel, "\u21C7", "\\leftleftarrows"); - defineSymbol(math, ams, rel, "\u21C6", "\\leftrightarrows"); - defineSymbol(math, ams, rel, "\u21DA", "\\Lleftarrow"); - defineSymbol(math, ams, rel, "\u219E", "\\twoheadleftarrow"); - defineSymbol(math, ams, rel, "\u21A2", "\\leftarrowtail"); - defineSymbol(math, ams, rel, "\u21AB", "\\looparrowleft"); - defineSymbol(math, ams, rel, "\u21CB", "\\leftrightharpoons"); - defineSymbol(math, ams, rel, "\u21B6", "\\curvearrowleft"); - defineSymbol(math, ams, rel, "\u21BA", "\\circlearrowleft"); - defineSymbol(math, ams, rel, "\u21B0", "\\Lsh"); - defineSymbol(math, ams, rel, "\u21C8", "\\upuparrows"); - defineSymbol(math, ams, rel, "\u21BF", "\\upharpoonleft"); - defineSymbol(math, ams, rel, "\u21C3", "\\downharpoonleft"); - defineSymbol(math, ams, rel, "\u22B8", "\\multimap"); - defineSymbol(math, ams, rel, "\u21AD", "\\leftrightsquigarrow"); - defineSymbol(math, ams, rel, "\u21C9", "\\rightrightarrows"); - defineSymbol(math, ams, rel, "\u21C4", "\\rightleftarrows"); - defineSymbol(math, ams, rel, "\u21A0", "\\twoheadrightarrow"); - defineSymbol(math, ams, rel, "\u21A3", "\\rightarrowtail"); - defineSymbol(math, ams, rel, "\u21AC", "\\looparrowright"); - defineSymbol(math, ams, rel, "\u21B7", "\\curvearrowright"); - defineSymbol(math, ams, rel, "\u21BB", "\\circlearrowright"); - defineSymbol(math, ams, rel, "\u21B1", "\\Rsh"); - defineSymbol(math, ams, rel, "\u21CA", "\\downdownarrows"); - defineSymbol(math, ams, rel, "\u21BE", "\\upharpoonright"); - defineSymbol(math, ams, rel, "\u21C2", "\\downharpoonright"); - defineSymbol(math, ams, rel, "\u21DD", "\\rightsquigarrow"); - defineSymbol(math, ams, rel, "\u21DD", "\\leadsto"); - defineSymbol(math, ams, rel, "\u21DB", "\\Rrightarrow"); - defineSymbol(math, ams, rel, "\u21BE", "\\restriction"); - - defineSymbol(math, main, textord, "\u2018", "`"); - defineSymbol(math, main, textord, "$", "\\$"); - defineSymbol(text, main, textord, "$", "\\$"); - defineSymbol(text, main, textord, "$", "\\textdollar"); - defineSymbol(math, main, textord, "%", "\\%"); - defineSymbol(text, main, textord, "%", "\\%"); - defineSymbol(math, main, textord, "_", "\\_"); - defineSymbol(text, main, textord, "_", "\\_"); - defineSymbol(text, main, textord, "_", "\\textunderscore"); - defineSymbol(math, main, textord, "\u2220", "\\angle"); - defineSymbol(math, main, textord, "\u221E", "\\infty"); - defineSymbol(math, main, textord, "\u2032", "\\prime"); - defineSymbol(math, main, textord, "\u25B3", "\\triangle"); - defineSymbol(math, main, textord, "\u0393", "\\Gamma", true); - defineSymbol(math, main, textord, "\u0394", "\\Delta", true); - defineSymbol(math, main, textord, "\u0398", "\\Theta", true); - defineSymbol(math, main, textord, "\u039B", "\\Lambda", true); - defineSymbol(math, main, textord, "\u039E", "\\Xi", true); - defineSymbol(math, main, textord, "\u03A0", "\\Pi", true); - defineSymbol(math, main, textord, "\u03A3", "\\Sigma", true); - defineSymbol(math, main, textord, "\u03A5", "\\Upsilon", true); - defineSymbol(math, main, textord, "\u03A6", "\\Phi", true); - defineSymbol(math, main, textord, "\u03A8", "\\Psi", true); - defineSymbol(math, main, textord, "\u03A9", "\\Omega", true); - defineSymbol(math, main, textord, "\xAC", "\\neg"); - defineSymbol(math, main, textord, "\xAC", "\\lnot"); - defineSymbol(math, main, textord, "\u22A4", "\\top"); - defineSymbol(math, main, textord, "\u22A5", "\\bot"); - defineSymbol(math, main, textord, "\u2205", "\\emptyset"); - defineSymbol(math, ams, textord, "\u2205", "\\varnothing"); - defineSymbol(math, main, mathord, "\u03B1", "\\alpha", true); - defineSymbol(math, main, mathord, "\u03B2", "\\beta", true); - defineSymbol(math, main, mathord, "\u03B3", "\\gamma", true); - defineSymbol(math, main, mathord, "\u03B4", "\\delta", true); - defineSymbol(math, main, mathord, "\u03F5", "\\epsilon", true); - defineSymbol(math, main, mathord, "\u03B6", "\\zeta", true); - defineSymbol(math, main, mathord, "\u03B7", "\\eta", true); - defineSymbol(math, main, mathord, "\u03B8", "\\theta", true); - defineSymbol(math, main, mathord, "\u03B9", "\\iota", true); - defineSymbol(math, main, mathord, "\u03BA", "\\kappa", true); - defineSymbol(math, main, mathord, "\u03BB", "\\lambda", true); - defineSymbol(math, main, mathord, "\u03BC", "\\mu", true); - defineSymbol(math, main, mathord, "\u03BD", "\\nu", true); - defineSymbol(math, main, mathord, "\u03BE", "\\xi", true); - defineSymbol(math, main, mathord, "\u03BF", "\\omicron", true); - defineSymbol(math, main, mathord, "\u03C0", "\\pi", true); - defineSymbol(math, main, mathord, "\u03C1", "\\rho", true); - defineSymbol(math, main, mathord, "\u03C3", "\\sigma", true); - defineSymbol(math, main, mathord, "\u03C4", "\\tau", true); - defineSymbol(math, main, mathord, "\u03C5", "\\upsilon", true); - defineSymbol(math, main, mathord, "\u03D5", "\\phi", true); - defineSymbol(math, main, mathord, "\u03C7", "\\chi", true); - defineSymbol(math, main, mathord, "\u03C8", "\\psi", true); - defineSymbol(math, main, mathord, "\u03C9", "\\omega", true); - defineSymbol(math, main, mathord, "\u03B5", "\\varepsilon", true); - defineSymbol(math, main, mathord, "\u03D1", "\\vartheta", true); - defineSymbol(math, main, mathord, "\u03D6", "\\varpi", true); - defineSymbol(math, main, mathord, "\u03F1", "\\varrho", true); - defineSymbol(math, main, mathord, "\u03C2", "\\varsigma", true); - defineSymbol(math, main, mathord, "\u03C6", "\\varphi", true); - defineSymbol(math, main, bin, "\u2217", "*"); - defineSymbol(math, main, bin, "+", "+"); - defineSymbol(math, main, bin, "\u2212", "-"); - defineSymbol(math, main, bin, "\u22C5", "\\cdot"); - defineSymbol(math, main, bin, "\u2218", "\\circ"); - defineSymbol(math, main, bin, "\xF7", "\\div"); - defineSymbol(math, main, bin, "\xB1", "\\pm"); - defineSymbol(math, main, bin, "\xD7", "\\times"); - defineSymbol(math, main, bin, "\u2229", "\\cap"); - defineSymbol(math, main, bin, "\u222A", "\\cup"); - defineSymbol(math, main, bin, "\u2216", "\\setminus"); - defineSymbol(math, main, bin, "\u2227", "\\land"); - defineSymbol(math, main, bin, "\u2228", "\\lor"); - defineSymbol(math, main, bin, "\u2227", "\\wedge"); - defineSymbol(math, main, bin, "\u2228", "\\vee"); - defineSymbol(math, main, textord, "\u221A", "\\surd"); - defineSymbol(math, main, open, "(", "("); - defineSymbol(math, main, open, "[", "["); - defineSymbol(math, main, open, "\u27E8", "\\langle"); - defineSymbol(math, main, open, "\u2223", "\\lvert"); - defineSymbol(math, main, open, "\u2225", "\\lVert"); - defineSymbol(math, main, close, ")", ")"); - defineSymbol(math, main, close, "]", "]"); - defineSymbol(math, main, close, "?", "?"); - defineSymbol(math, main, close, "!", "!"); - defineSymbol(math, main, close, "\u27E9", "\\rangle"); - defineSymbol(math, main, close, "\u2223", "\\rvert"); - defineSymbol(math, main, close, "\u2225", "\\rVert"); - defineSymbol(math, main, rel, "=", "="); - defineSymbol(math, main, rel, "<", "<"); - defineSymbol(math, main, rel, ">", ">"); - defineSymbol(math, main, rel, ":", ":"); - defineSymbol(math, main, rel, "\u2248", "\\approx"); - defineSymbol(math, main, rel, "\u2245", "\\cong"); - defineSymbol(math, main, rel, "\u2265", "\\ge"); - defineSymbol(math, main, rel, "\u2265", "\\geq"); - defineSymbol(math, main, rel, "\u2190", "\\gets"); - defineSymbol(math, main, rel, ">", "\\gt"); - defineSymbol(math, main, rel, "\u2208", "\\in"); - defineSymbol(math, main, rel, "\u2209", "\\notin"); - defineSymbol(math, main, rel, "\u0338", "\\not"); - defineSymbol(math, main, rel, "\u2282", "\\subset"); - defineSymbol(math, main, rel, "\u2283", "\\supset"); - defineSymbol(math, main, rel, "\u2286", "\\subseteq"); - defineSymbol(math, main, rel, "\u2287", "\\supseteq"); - defineSymbol(math, ams, rel, "\u2288", "\\nsubseteq"); - defineSymbol(math, ams, rel, "\u2289", "\\nsupseteq"); - defineSymbol(math, main, rel, "\u22A8", "\\models"); - defineSymbol(math, main, rel, "\u2190", "\\leftarrow"); - defineSymbol(math, main, rel, "\u2264", "\\le"); - defineSymbol(math, main, rel, "\u2264", "\\leq"); - defineSymbol(math, main, rel, "<", "\\lt"); - defineSymbol(math, main, rel, "\u2260", "\\ne"); - defineSymbol(math, main, rel, "\u2260", "\\neq"); - defineSymbol(math, main, rel, "\u2192", "\\rightarrow"); - defineSymbol(math, main, rel, "\u2192", "\\to"); - defineSymbol(math, ams, rel, "\u2271", "\\ngeq"); - defineSymbol(math, ams, rel, "\u2270", "\\nleq"); - defineSymbol(math, main, spacing, null, "\\!"); - defineSymbol(math, main, spacing, "\xA0", "\\ "); - defineSymbol(math, main, spacing, "\xA0", "~"); - defineSymbol(math, main, spacing, null, "\\,"); - defineSymbol(math, main, spacing, null, "\\:"); - defineSymbol(math, main, spacing, null, "\\;"); - defineSymbol(math, main, spacing, null, "\\enspace"); - defineSymbol(math, main, spacing, null, "\\qquad"); - defineSymbol(math, main, spacing, null, "\\quad"); - defineSymbol(math, main, spacing, "\xA0", "\\space"); - defineSymbol(math, main, punct, ",", ","); - defineSymbol(math, main, punct, ";", ";"); - defineSymbol(math, main, punct, ":", "\\colon"); - defineSymbol(math, ams, bin, "\u22BC", "\\barwedge"); - defineSymbol(math, ams, bin, "\u22BB", "\\veebar"); - defineSymbol(math, main, bin, "\u2299", "\\odot"); - defineSymbol(math, main, bin, "\u2295", "\\oplus"); - defineSymbol(math, main, bin, "\u2297", "\\otimes"); - defineSymbol(math, main, textord, "\u2202", "\\partial"); - defineSymbol(math, main, bin, "\u2298", "\\oslash"); - defineSymbol(math, ams, bin, "\u229A", "\\circledcirc"); - defineSymbol(math, ams, bin, "\u22A1", "\\boxdot"); - defineSymbol(math, main, bin, "\u25B3", "\\bigtriangleup"); - defineSymbol(math, main, bin, "\u25BD", "\\bigtriangledown"); - defineSymbol(math, main, bin, "\u2020", "\\dagger"); - defineSymbol(math, main, bin, "\u22C4", "\\diamond"); - defineSymbol(math, main, bin, "\u22C6", "\\star"); - defineSymbol(math, main, bin, "\u25C3", "\\triangleleft"); - defineSymbol(math, main, bin, "\u25B9", "\\triangleright"); - defineSymbol(math, main, open, "{", "\\{"); - defineSymbol(text, main, textord, "{", "\\{"); - defineSymbol(text, main, textord, "{", "\\textbraceleft"); - defineSymbol(math, main, close, "}", "\\}"); - defineSymbol(text, main, textord, "}", "\\}"); - defineSymbol(text, main, textord, "}", "\\textbraceright"); - defineSymbol(math, main, open, "{", "\\lbrace"); - defineSymbol(math, main, close, "}", "\\rbrace"); - defineSymbol(math, main, open, "[", "\\lbrack"); - defineSymbol(math, main, close, "]", "\\rbrack"); - defineSymbol(text, main, textord, "<", "\\textless"); // in T1 fontenc - defineSymbol(text, main, textord, ">", "\\textgreater"); // in T1 fontenc - defineSymbol(math, main, open, "\u230A", "\\lfloor"); - defineSymbol(math, main, close, "\u230B", "\\rfloor"); - defineSymbol(math, main, open, "\u2308", "\\lceil"); - defineSymbol(math, main, close, "\u2309", "\\rceil"); - defineSymbol(math, main, textord, "\\", "\\backslash"); - defineSymbol(math, main, textord, "\u2223", "|"); - defineSymbol(math, main, textord, "\u2223", "\\vert"); - defineSymbol(text, main, textord, "|", "\\textbar"); // in T1 fontenc - defineSymbol(math, main, textord, "\u2225", "\\|"); - defineSymbol(math, main, textord, "\u2225", "\\Vert"); - defineSymbol(text, main, textord, "\u2225", "\\textbardbl"); - defineSymbol(math, main, rel, "\u2191", "\\uparrow"); - defineSymbol(math, main, rel, "\u21D1", "\\Uparrow"); - defineSymbol(math, main, rel, "\u2193", "\\downarrow"); - defineSymbol(math, main, rel, "\u21D3", "\\Downarrow"); - defineSymbol(math, main, rel, "\u2195", "\\updownarrow"); - defineSymbol(math, main, rel, "\u21D5", "\\Updownarrow"); - defineSymbol(math, main, op, "\u2210", "\\coprod"); - defineSymbol(math, main, op, "\u22C1", "\\bigvee"); - defineSymbol(math, main, op, "\u22C0", "\\bigwedge"); - defineSymbol(math, main, op, "\u2A04", "\\biguplus"); - defineSymbol(math, main, op, "\u22C2", "\\bigcap"); - defineSymbol(math, main, op, "\u22C3", "\\bigcup"); - defineSymbol(math, main, op, "\u222B", "\\int"); - defineSymbol(math, main, op, "\u222B", "\\intop"); - defineSymbol(math, main, op, "\u222C", "\\iint"); - defineSymbol(math, main, op, "\u222D", "\\iiint"); - defineSymbol(math, main, op, "\u220F", "\\prod"); - defineSymbol(math, main, op, "\u2211", "\\sum"); - defineSymbol(math, main, op, "\u2A02", "\\bigotimes"); - defineSymbol(math, main, op, "\u2A01", "\\bigoplus"); - defineSymbol(math, main, op, "\u2A00", "\\bigodot"); - defineSymbol(math, main, op, "\u222E", "\\oint"); - defineSymbol(math, main, op, "\u2A06", "\\bigsqcup"); - defineSymbol(math, main, op, "\u222B", "\\smallint"); - defineSymbol(text, main, inner, "\u2026", "\\textellipsis"); - defineSymbol(math, main, inner, "\u2026", "\\mathellipsis"); - defineSymbol(text, main, inner, "\u2026", "\\ldots", true); - defineSymbol(math, main, inner, "\u2026", "\\ldots", true); - defineSymbol(math, main, inner, "\u22EF", "\\cdots", true); - defineSymbol(math, main, inner, "\u22F1", "\\ddots", true); - defineSymbol(math, main, textord, "\u22EE", "\\vdots", true); - defineSymbol(math, main, accent, "\xB4", "\\acute"); - defineSymbol(math, main, accent, "`", "\\grave"); - defineSymbol(math, main, accent, "\xA8", "\\ddot"); - defineSymbol(math, main, accent, "~", "\\tilde"); - defineSymbol(math, main, accent, "\xAF", "\\bar"); - defineSymbol(math, main, accent, "\u02D8", "\\breve"); - defineSymbol(math, main, accent, "\u02C7", "\\check"); - defineSymbol(math, main, accent, "^", "\\hat"); - defineSymbol(math, main, accent, "\u20D7", "\\vec"); - defineSymbol(math, main, accent, "\u02D9", "\\dot"); - defineSymbol(math, main, mathord, "\u0131", "\\imath"); - defineSymbol(math, main, mathord, "\u0237", "\\jmath"); - defineSymbol(text, main, accent, "\u02CA", "\\'"); // acute - defineSymbol(text, main, accent, "\u02CB", "\\`"); // grave - defineSymbol(text, main, accent, "\u02C6", "\\^"); // circumflex - defineSymbol(text, main, accent, "\u02DC", "\\~"); // tilde - defineSymbol(text, main, accent, "\u02C9", "\\="); // macron - defineSymbol(text, main, accent, "\u02D8", "\\u"); // breve - defineSymbol(text, main, accent, "\u02D9", "\\."); // dot above - defineSymbol(text, main, accent, "\u02DA", "\\r"); // ring above - defineSymbol(text, main, accent, "\u02C7", "\\v"); // caron - defineSymbol(text, main, accent, "\xA8", '\\"'); // diaresis - defineSymbol(text, main, accent, "\u030B", "\\H"); // double acute - - defineSymbol(text, main, textord, "\u2013", "--"); - defineSymbol(text, main, textord, "\u2013", "\\textendash"); - defineSymbol(text, main, textord, "\u2014", "---"); - defineSymbol(text, main, textord, "\u2014", "\\textemdash"); - defineSymbol(text, main, textord, "\u2018", "`"); - defineSymbol(text, main, textord, "\u2018", "\\textquoteleft"); - defineSymbol(text, main, textord, "\u2019", "'"); - defineSymbol(text, main, textord, "\u2019", "\\textquoteright"); - defineSymbol(text, main, textord, "\u201C", "``"); - defineSymbol(text, main, textord, "\u201C", "\\textquotedblleft"); - defineSymbol(text, main, textord, "\u201D", "''"); - defineSymbol(text, main, textord, "\u201D", "\\textquotedblright"); - defineSymbol(math, main, textord, "\xB0", "\\degree"); - defineSymbol(text, main, textord, "\xB0", "\\degree"); - // TODO: In LaTeX, \pounds can generate a different character in text and math - // mode, but among our fonts, only Main-Italic defines this character "163". - defineSymbol(math, main, mathord, "\xA3", "\\pounds"); - defineSymbol(math, main, mathord, "\xA3", "\\mathsterling"); - defineSymbol(text, main, mathord, "\xA3", "\\pounds"); - defineSymbol(text, main, mathord, "\xA3", "\\textsterling"); - defineSymbol(math, ams, textord, "\u2720", "\\maltese"); - defineSymbol(text, ams, textord, "\u2720", "\\maltese"); - - defineSymbol(text, main, spacing, "\xA0", "\\ "); - defineSymbol(text, main, spacing, "\xA0", " "); - defineSymbol(text, main, spacing, "\xA0", "~"); - - // There are lots of symbols which are the same, so we add them in afterwards. - - // All of these are textords in math mode - var mathTextSymbols = "0123456789/@.\""; - for (var i = 0; i < mathTextSymbols.length; i++) { - var ch = mathTextSymbols.charAt(i); - defineSymbol(math, main, textord, ch, ch); - } - - // All of these are textords in text mode - var textSymbols = "0123456789!@*()-=+[]<>|\";:?/.,"; - for (var _i = 0; _i < textSymbols.length; _i++) { - var _ch = textSymbols.charAt(_i); - defineSymbol(text, main, textord, _ch, _ch); - } - - // All of these are textords in text mode, and mathords in math mode - var letters = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"; - for (var _i2 = 0; _i2 < letters.length; _i2++) { - var _ch2 = letters.charAt(_i2); - defineSymbol(math, main, mathord, _ch2, _ch2); - defineSymbol(text, main, textord, _ch2, _ch2); - } - - // Latin-1 letters - for (var _i3 = 0x00C0; _i3 <= 0x00D6; _i3++) { - var _ch3 = String.fromCharCode(_i3); - defineSymbol(math, main, mathord, _ch3, _ch3); - defineSymbol(text, main, textord, _ch3, _ch3); - } - - for (var _i4 = 0x00D8; _i4 <= 0x00F6; _i4++) { - var _ch4 = String.fromCharCode(_i4); - defineSymbol(math, main, mathord, _ch4, _ch4); - defineSymbol(text, main, textord, _ch4, _ch4); - } - - for (var _i5 = 0x00F8; _i5 <= 0x00FF; _i5++) { - var _ch5 = String.fromCharCode(_i5); - defineSymbol(math, main, mathord, _ch5, _ch5); - defineSymbol(text, main, textord, _ch5, _ch5); - } - - // Cyrillic - for (var _i6 = 0x0410; _i6 <= 0x044F; _i6++) { - var _ch6 = String.fromCharCode(_i6); - defineSymbol(text, main, textord, _ch6, _ch6); - } - - // Unicode versions of existing characters - defineSymbol(text, main, textord, "\u2013", "–"); - defineSymbol(text, main, textord, "\u2014", "—"); - defineSymbol(text, main, textord, "\u2018", "‘"); - defineSymbol(text, main, textord, "\u2019", "’"); - defineSymbol(text, main, textord, "\u201C", "“"); - defineSymbol(text, main, textord, "\u201D", "”"); - - },{}],49:[function(require,module,exports){ - - var hangulRegex = /[\uAC00-\uD7AF]/; - - // This regex combines - // - CJK symbols and punctuation: [\u3000-\u303F] - // - Hiragana: [\u3040-\u309F] - // - Katakana: [\u30A0-\u30FF] - // - CJK ideograms: [\u4E00-\u9FAF] - // - Hangul syllables: [\uAC00-\uD7AF] - // - Fullwidth punctuation: [\uFF00-\uFF60] - // Notably missing are halfwidth Katakana and Romanji glyphs. - var cjkRegex = /[\u3000-\u30FF\u4E00-\u9FAF\uAC00-\uD7AF\uFF00-\uFF60]/; - - module.exports = { - cjkRegex: cjkRegex, - hangulRegex: hangulRegex - }; - - },{}],50:[function(require,module,exports){ - - var _ParseError = require("./ParseError"); - - var _ParseError2 = _interopRequireDefault(_ParseError); - - function _interopRequireDefault(obj) { return obj && obj.__esModule ? obj : { default: obj }; } - - // This table gives the number of TeX pts in one of each *absolute* TeX unit. - // Thus, multiplying a length by this number converts the length from units - // into pts. Dividing the result by ptPerEm gives the number of ems - // *assuming* a font size of ptPerEm (normal size, normal style). - var ptPerUnit = { - // https://en.wikibooks.org/wiki/LaTeX/Lengths and - // https://tex.stackexchange.com/a/8263 - "pt": 1, // TeX point - "mm": 7227 / 2540, // millimeter - "cm": 7227 / 254, // centimeter - "in": 72.27, // inch - "bp": 803 / 800, // big (PostScript) points - "pc": 12, // pica - "dd": 1238 / 1157, // didot - "cc": 14856 / 1157, // cicero (12 didot) - "nd": 685 / 642, // new didot - "nc": 1370 / 107, // new cicero (12 new didot) - "sp": 1 / 65536, // scaled point (TeX's internal smallest unit) - // https://tex.stackexchange.com/a/41371 - "px": 803 / 800 }; - - // Dictionary of relative units, for fast validity testing. - /* eslint no-console:0 */ - - /** - * This file does conversion between units. In particular, it provides - * calculateSize to convert other units into ems. - */ - - var relativeUnit = { - "ex": true, - "em": true, - "mu": true - }; - - /** - * Determine whether the specified unit (either a string defining the unit - * or a "size" parse node containing a unit field) is valid. - */ - var validUnit = function validUnit(unit) { - if (unit.unit) { - unit = unit.unit; - } - return unit in ptPerUnit || unit in relativeUnit || unit === "ex"; - }; - - /* - * Convert a "size" parse node (with numeric "number" and string "unit" fields, - * as parsed by functions.js argType "size") into a CSS em value for the - * current style/scale. `options` gives the current options. - */ - var calculateSize = function calculateSize(sizeValue, options) { - var scale = void 0; - if (sizeValue.unit in ptPerUnit) { - // Absolute units - scale = ptPerUnit[sizeValue.unit] // Convert unit to pt - / options.fontMetrics().ptPerEm // Convert pt to CSS em - / options.sizeMultiplier; // Unscale to make absolute units - } else if (sizeValue.unit === "mu") { - // `mu` units scale with scriptstyle/scriptscriptstyle. - scale = options.fontMetrics().cssEmPerMu; - } else { - // Other relative units always refer to the *textstyle* font - // in the current size. - var unitOptions = void 0; - if (options.style.isTight()) { - // isTight() means current style is script/scriptscript. - unitOptions = options.havingStyle(options.style.text()); - } else { - unitOptions = options; - } - // TODO: In TeX these units are relative to the quad of the current - // *text* font, e.g. cmr10. KaTeX instead uses values from the - // comparably-sized *Computer Modern symbol* font. At 10pt, these - // match. At 7pt and 5pt, they differ: cmr7=1.138894, cmsy7=1.170641; - // cmr5=1.361133, cmsy5=1.472241. Consider $\scriptsize a\kern1emb$. - // TeX \showlists shows a kern of 1.13889 * fontsize; - // KaTeX shows a kern of 1.171 * fontsize. - if (sizeValue.unit === "ex") { - scale = unitOptions.fontMetrics().xHeight; - } else if (sizeValue.unit === "em") { - scale = unitOptions.fontMetrics().quad; - } else { - throw new _ParseError2.default("Invalid unit: '" + sizeValue.unit + "'"); - } - if (unitOptions !== options) { - scale *= unitOptions.sizeMultiplier / options.sizeMultiplier; - } - } - return sizeValue.number * scale; - }; - - module.exports = { - validUnit: validUnit, - calculateSize: calculateSize - }; - - },{"./ParseError":29}],51:[function(require,module,exports){ - - /** - * This file contains a list of utility functions which are useful in other - * files. - */ - - /** - * Provide an `indexOf` function which works in IE8, but defers to native if - * possible. - */ - var nativeIndexOf = Array.prototype.indexOf; - var indexOf = function indexOf(list, elem) { - if (list == null) { - return -1; - } - if (nativeIndexOf && list.indexOf === nativeIndexOf) { - return list.indexOf(elem); - } - var l = list.length; - for (var i = 0; i < l; i++) { - if (list[i] === elem) { - return i; - } - } - return -1; - }; - - /** - * Return whether an element is contained in a list - */ - var contains = function contains(list, elem) { - return indexOf(list, elem) !== -1; - }; - - /** - * Provide a default value if a setting is undefined - */ - var deflt = function deflt(setting, defaultIfUndefined) { - return setting === undefined ? defaultIfUndefined : setting; - }; - - // hyphenate and escape adapted from Facebook's React under Apache 2 license - - var uppercase = /([A-Z])/g; - var hyphenate = function hyphenate(str) { - return str.replace(uppercase, "-$1").toLowerCase(); - }; - - var ESCAPE_LOOKUP = { - "&": "&", - ">": ">", - "<": "<", - "\"": """, - "'": "'" - }; - - var ESCAPE_REGEX = /[&><"']/g; - - function escaper(match) { - return ESCAPE_LOOKUP[match]; - } - - /** - * Escapes text to prevent scripting attacks. - * - * @param {*} text Text value to escape. - * @return {string} An escaped string. - */ - function escape(text) { - return ("" + text).replace(ESCAPE_REGEX, escaper); - } - - /** - * A function to set the text content of a DOM element in all supported - * browsers. Note that we don't define this if there is no document. - */ - var setTextContent = void 0; - if (typeof document !== "undefined") { - var testNode = document.createElement("span"); - if ("textContent" in testNode) { - setTextContent = function setTextContent(node, text) { - node.textContent = text; - }; - } else { - setTextContent = function setTextContent(node, text) { - node.innerText = text; - }; - } - } - - /** - * A function to clear a node. - */ - function clearNode(node) { - setTextContent(node, ""); - } - - module.exports = { - contains: contains, - deflt: deflt, - escape: escape, - hyphenate: hyphenate, - indexOf: indexOf, - setTextContent: setTextContent, - clearNode: clearNode - }; - - },{}]},{},[1])(1) - }); - }); - - var katex$2 = unwrapExports(katex$1); - - // Copyright 2018 The Distill Template Authors - // - // Licensed under the Apache License, Version 2.0 (the "License"); - // you may not use this file except in compliance with the License. - // You may obtain a copy of the License at - // - // http://www.apache.org/licenses/LICENSE-2.0 - // - // Unless required by applicable law or agreed to in writing, software - // distributed under the License is distributed on an "AS IS" BASIS, - // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - // See the License for the specific language governing permissions and - // limitations under the License. - - // This is a straight concatenation of code from KaTeX's contrib folder, - // but we aren't using some of their helpers that don't work well outside a browser environment. - - /*global katex */ - - const findEndOfMath = function(delimiter, text, startIndex) { - // Adapted from - // https://github.com/Khan/perseus/blob/master/src/perseus-markdown.jsx - let index = startIndex; - let braceLevel = 0; - - const delimLength = delimiter.length; - - while (index < text.length) { - const character = text[index]; - - if ( - braceLevel <= 0 && - text.slice(index, index + delimLength) === delimiter - ) { - return index; - } else if (character === "\\") { - index++; - } else if (character === "{") { - braceLevel++; - } else if (character === "}") { - braceLevel--; - } - - index++; - } - - return -1; - }; - - const splitAtDelimiters = function(startData, leftDelim, rightDelim, display) { - const finalData = []; - - for (let i = 0; i < startData.length; i++) { - if (startData[i].type === "text") { - const text = startData[i].data; - - let lookingForLeft = true; - let currIndex = 0; - let nextIndex; - - nextIndex = text.indexOf(leftDelim); - if (nextIndex !== -1) { - currIndex = nextIndex; - finalData.push({ - type: "text", - data: text.slice(0, currIndex) - }); - lookingForLeft = false; - } - - while (true) { - // eslint-disable-line no-constant-condition - if (lookingForLeft) { - nextIndex = text.indexOf(leftDelim, currIndex); - if (nextIndex === -1) { - break; - } - - finalData.push({ - type: "text", - data: text.slice(currIndex, nextIndex) - }); - - currIndex = nextIndex; - } else { - nextIndex = findEndOfMath( - rightDelim, - text, - currIndex + leftDelim.length - ); - if (nextIndex === -1) { - break; - } - - finalData.push({ - type: "math", - data: text.slice(currIndex + leftDelim.length, nextIndex), - rawData: text.slice(currIndex, nextIndex + rightDelim.length), - display: display - }); - - currIndex = nextIndex + rightDelim.length; - } - - lookingForLeft = !lookingForLeft; - } - - finalData.push({ - type: "text", - data: text.slice(currIndex) - }); - } else { - finalData.push(startData[i]); - } - } - - return finalData; - }; - - const splitWithDelimiters = function(text, delimiters) { - let data = [{ type: "text", data: text }]; - for (let i = 0; i < delimiters.length; i++) { - const delimiter = delimiters[i]; - data = splitAtDelimiters( - data, - delimiter.left, - delimiter.right, - delimiter.display || false - ); - } - return data; - }; - - /* Note: optionsCopy is mutated by this method. If it is ever exposed in the - * API, we should copy it before mutating. - */ - const renderMathInText = function(text, optionsCopy) { - const data = splitWithDelimiters(text, optionsCopy.delimiters); - const fragment = document.createDocumentFragment(); - - for (let i = 0; i < data.length; i++) { - if (data[i].type === "text") { - fragment.appendChild(document.createTextNode(data[i].data)); - } else { - const tag = document.createElement("d-math"); - const math = data[i].data; - // Override any display mode defined in the settings with that - // defined by the text itself - optionsCopy.displayMode = data[i].display; - try { - tag.textContent = math; - if (optionsCopy.displayMode) { - tag.setAttribute("block", ""); - } - } catch (e) { - if (!(e instanceof katex.ParseError)) { - throw e; - } - optionsCopy.errorCallback( - "KaTeX auto-render: Failed to parse `" + data[i].data + "` with ", - e - ); - fragment.appendChild(document.createTextNode(data[i].rawData)); - continue; - } - fragment.appendChild(tag); - } - } - - return fragment; - }; - - const renderElem = function(elem, optionsCopy) { - for (let i = 0; i < elem.childNodes.length; i++) { - const childNode = elem.childNodes[i]; - if (childNode.nodeType === 3) { - // Text node - const text = childNode.textContent; - if (optionsCopy.mightHaveMath(text)) { - const frag = renderMathInText(text, optionsCopy); - i += frag.childNodes.length - 1; - elem.replaceChild(frag, childNode); - } - } else if (childNode.nodeType === 1) { - // Element node - const shouldRender = - optionsCopy.ignoredTags.indexOf(childNode.nodeName.toLowerCase()) === - -1; - - if (shouldRender) { - renderElem(childNode, optionsCopy); - } - } - // Otherwise, it's something else, and ignore it. - } - }; - - const defaultAutoRenderOptions = { - delimiters: [ - { left: "$$", right: "$$", display: true }, - { left: "\\[", right: "\\]", display: true }, - { left: "\\(", right: "\\)", display: false } - // LaTeX uses this, but it ruins the display of normal `$` in text: - // {left: '$', right: '$', display: false}, - ], - - ignoredTags: [ - "script", - "noscript", - "style", - "textarea", - "pre", - "code", - "svg" - ], - - errorCallback: function(msg, err) { - console.error(msg, err); - } - }; - - const renderMathInElement = function(elem, options) { - if (!elem) { - throw new Error("No element provided to render"); - } - - const optionsCopy = Object.assign({}, defaultAutoRenderOptions, options); - const delimiterStrings = optionsCopy.delimiters.flatMap(d => [ - d.left, - d.right - ]); - const mightHaveMath = text => - delimiterStrings.some(d => text.indexOf(d) !== -1); - optionsCopy.mightHaveMath = mightHaveMath; - renderElem(elem, optionsCopy); - }; - - // Copyright 2018 The Distill Template Authors - - function Mathematics(dom, data) { - let needsCSS = false; - const body = dom.querySelector('body'); - - if (!body) { - console.warn("No body tag found!"); - return; - } - - if (data.katex && data.katex.delimiters) { - global.document = dom; - renderMathInElement(body, data.katex); - } - - // render d-math tags - const mathTags = body.querySelectorAll('d-math'); - if (mathTags.length > 0) { - needsCSS = true; - console.warn(`Prerendering ${mathTags.length} math tags...`); - for (const mathTag of mathTags) { - const localOptions = { displayMode: mathTag.hasAttribute('block') }; - const options = Object.assign(localOptions, data.katex); - const html = katex$2.renderToString(mathTag.textContent, options); - const container = dom.createElement('span'); - container.innerHTML = html; - mathTag.parentElement.insertBefore(container, mathTag); - mathTag.parentElement.removeChild(mathTag); - } - } - - if (needsCSS) { - const katexCSSTag = ''; - dom.head.insertAdjacentHTML('beforeend', katexCSSTag); - } - - } - - var favicon = "iVBORw0KGgoAAAANSUhEUgAAAEAAAABACAYAAACqaXHeAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJbWFnZVJlYWR5ccllPAAAA99JREFUeNrsG4t1ozDMzQSM4A2ODUonKBucN2hugtIJ6E1AboLcBiQTkJsANiAb9OCd/OpzMWBJBl5TvaeXPiiyJetry0J8wW3D3QpjRh3GjneXDq+fSQA9s2mH9x3KDhN4foJfCb8N/Jrv+2fnDn8vLRQOplWHVYdvHZYdZsBcZP1vBmh/n8DzEmhUQDPaOuP9pFuY+JwJHwHnCLQE2tnWBGEyXozY9xCUgHMhhjE2I4heVWtgIkZ83wL6Qgxj1obfWBxymPwe+b00BCCRNPbwfb60yleAkkBHGT5AEehIYz7eJrFDMF9CvH4wwhcGHiHMneFvLDQwlwvMLQq58trRcYBWfYn0A0OgHWQUSu25mE+BnoYKnnEJoeIWAifzOv7vLWd2ZKRfWAIme3tOiUaQ3UnLkb0xj1FxRIeEGKaGIHOs9nEgLaaA9i0JRYo1Ic67wJW86KSKE/ZAM8KuVMk8ITVhmxUxJ3Cl2xlm9Vtkeju1+mpCQNxaEGNCY8bs9X2YqwNoQeGjBWut/ma0QAWy/TqAsHx9wSya3I5IRxOfTC+leG+kA/4vSeEcGBtNUN6byhu3+keEZCQJUNh8MAO7HL6H8pQLnsW/Hd4T4lv93TPjfM7A46iEEqbB5EDOvwYNW6tGNZzT/o+CZ6sqZ6wUtR/wf7mi/VL8iNciT6rHih48Y55b4nKCHJCCzb4y0nwFmin3ZEMIoLfZF8F7nncFmvnWBaBj7CGAYA/WGJsUwHdYqVDwAmNsUgAx4CGgAA7GOOxADYOFWOaIKifuVYzmOpREqA21Mo7aPsgiY1PhOMAmxtR+AUbYH3Id2wc0SAFIQTsn9IUGWR8k9jx3vtXSiAacFxTAGakBk9UudkNECd6jLe+6HrshshvIuC6IlLMRy7er+JpcKma24SlE4cFZSZJDGVVrsNvitQhQrDhW0jfiOLfFd47C42eHT56D/BK0To+58Ahj+cAT8HT1UWlfLZCCd/uKawzU0Rh2EyIX/Icqth3niG8ybNroezwe6khdCNxRN+l4XGdOLVLlOOt2hTRJlr1ETIuMAltVTMz70mJrkdGAaZLSmnBEqmAE32JCMmuTlCnRgsBENtOUpHhvvsYIL0ibnBkaC6QvKcR7738GKp0AKnim7xgUSNv1bpS8QwhBt8r+EP47v/oyRK/S34yJ9nT+AN0Tkm4OdB9E4BsmXM3SnMlRFUrtp6IDpV2eKzdYvF3etm3KhQksbOLChGkSmcBdmcEwvqkrMy5BzL00NZeu3qPYJOOuCc+5NjcWKXQxFvTa3NoXJ4d8in7fiAUuTt781dkvuHX4K8AA2Usy7yNKLy0AAAAASUVORK5CYII=\n"; - - /*! +!function(e,t){"object"==typeof exports&&"undefined"!=typeof module?t(exports,require("fs")):"function"==typeof define&&define.amd?define(["exports","fs"],t):t((e=e||self).dl={},e.fs)}(this,function(e,t){"use strict";function n(e,t){e.title=t.title,t.published&&(t.published instanceof Date?e.publishedDate=t.published:t.published.constructor===String&&(e.publishedDate=new Date(t.published))),t.publishedDate&&(t.publishedDate instanceof Date?e.publishedDate=t.publishedDate:t.publishedDate.constructor===String?e.publishedDate=new Date(t.publishedDate):console.error("Don't know what to do with published date: "+t.publishedDate)),e.description=t.description,e.authors=t.authors.map(e=>new te(e)),e.katex=t.katex,e.password=t.password,t.doi&&(e.doi=t.doi)} +// Copyright 2018 The Distill Template Authors +function r(e){for(let t of e.authors){const e=Boolean(t.affiliation),n=Boolean(t.affiliations);if(e)if(n)console.warn(`Author ${t.author} has both old-style ("affiliation" & "affiliationURL") and new style ("affiliations") affiliation information!`);else{let e={name:t.affiliation};t.affiliationURL&&(e.url=t.affiliationURL),t.affiliations=[e]}}return e}function i(e){const t=e.firstElementChild;if(t){if("json"==t.getAttribute("type").split("/")[1]){const e=t.textContent;return r(JSON.parse(e))}console.error("Distill only supports JSON frontmatter tags anymore; no more YAML.")}else console.error("You added a frontmatter tag but did not provide a script tag with front matter data in it. Please take a look at our templates.");return{}} +// Copyright 2018 The Distill Template Authors +function a(e,t){const r=e.querySelector("d-front-matter");r?n(t,i(r)):console.warn("No front matter tag found!")}function o(){throw new Error("Dynamic requires are not currently supported by rollup-plugin-commonjs")}function s(e){return e&&e.__esModule&&Object.prototype.hasOwnProperty.call(e,"default")?e["default"]:e}function l(e,t){return e(t={exports:{}},t.exports),t.exports} +// Copyright 2018 The Distill Template Authors +function u(e){return e.replace(/[\t\n ]+/g," ").replace(/{\\["^`.'acu~Hvs]( )?([a-zA-Z])}/g,(e,t,n)=>n).replace(/{\\([a-zA-Z])}/g,(e,t)=>t)}function d(e){const t=new Map,n=re.toJSON(e);for(const e of n){for(const[t,n]of Object.entries(e.entryTags))e.entryTags[t.toLowerCase()]=u(n);e.entryTags.type=e.entryType,t.set(e.citationKey,e.entryTags)}return t}function c(e){return`@article{${e.slug},\n author = {${e.bibtexAuthors}},\n title = {${e.title}},\n journal = {${e.journal.title}},\n year = {${e.publishedYear}},\n note = {${e.url}},\n doi = {${e.doi}}\n}`} +// Copyright 2018 The Distill Template Authors +function h(e){const t=e.firstElementChild;if(t&&"SCRIPT"===t.tagName){if("text/bibtex"==t.type){return d(e.firstElementChild.textContent)}if("text/json"==t.type)return new Map(JSON.parse(t.textContent));console.warn("Unsupported bibliography script tag type: "+t.type)}else console.warn("Bibliography did not have any script tag.")} +// Copyright 2018 The Distill Template Authors +function p(e,n){const r=e.querySelector("d-bibliography");if(!r)return void console.warn("No bibliography tag found!");const i=r.getAttribute("src");if(i){const a=n.inputDirectory+"/"+i,o=d(t.readFileSync(a,"utf-8")),s=e.createElement("script");s.type="text/json",s.textContent=JSON.stringify([...o]),r.appendChild(s),r.removeAttribute("src")}n.bibliography=h(r)} +// Copyright 2018 The Distill Template Authors +function f(e=document){const t=new Set,n=e.querySelectorAll("d-cite");for(const e of n){const n=(e.getAttribute("key")||e.getAttribute("bibtex-key")).split(",").map(e=>e.trim());for(const e of n)t.add(e)}return[...t]}function m(e,t,n,r){if(null==e.author)return"";var i=e.author.split(" and ");let a=i.map(e=>{if(-1!=(e=e.trim()).indexOf(","))var n=e.split(",")[0].trim(),r=e.split(",")[1];else if(-1!=e.indexOf(" "))n=e.split(" ").slice(-1)[0].trim(),r=e.split(" ").slice(0,-1).join(" ");else n=e.trim();var i="";return r!=undefined&&(i=(i=r.trim().split(" ").map(e=>e.trim()[0])).join(".")+"."),t.replace("${F}",r).replace("${L}",n).replace("${I}",i).trim()});if(i.length>1){var o=a.slice(0,i.length-1).join(n);return o+=(r||n)+a[i.length-1]}return a[0]}function g(e){var t=e.journal||e.booktitle||"";if("volume"in e){var n=e.issue||e.number;n=n!=undefined?"("+n+")":"",t+=", Vol "+e.volume+n}return"pages"in e&&(t+=", pp. "+e.pages),""!=t&&(t+=". "),"publisher"in e&&"."!=(t+=e.publisher)[t.length-1]&&(t+="."),t}function v(e){if("url"in e){var t=e.url,n=/arxiv\.org\/abs\/([0-9\.]*)/.exec(t);if(null!=n&&(t=`http://arxiv.org/pdf/${n[1]}.pdf`),".pdf"==t.slice(-4))var r="PDF";else if(".html"==t.slice(-5))r="HTML";return`  [${r||"link"}]`}return""}function b(e,t){return"doi"in e?`${t?"
      ":""} DOI: ${e.doi}`:""}function y(e){return''+e.title+" "}function x(e){if(e){var t=y(e);return t+=v(e)+"
      ",e.author&&(t+=m(e,"${L}, ${I}",", "," and "),(e.year||e.date)&&(t+=", ")),e.year||e.date?t+=(e.year||e.date)+". ":t+=". ",t+=g(e),t+=b(e)}return"?"} +// Copyright 2018 The Distill Template Authors +function w(e,t){const n=new Set(t.citations),r=f(e);for(const e of r)n.add(e);t.citations=Array.from(n)} +// Copyright 2018 The Distill Template Authors +function k(e){const t=e.querySelector("head");if(e.querySelector("html").getAttribute("lang")||e.querySelector("html").setAttribute("lang","en"),!e.querySelector("meta[charset]")){const n=e.createElement("meta");n.setAttribute("charset","utf-8"),t.appendChild(n)}if(!e.querySelector("meta[name=viewport]")){const n=e.createElement("meta");n.setAttribute("name","viewport"),n.setAttribute("content","width=device-width, initial-scale=1"),t.appendChild(n)}} +// Copyright 2018 The Distill Template Authors +function M(e){return`\n \n`} +// Copyright 2018 The Distill Template Authors +function S(e,t){const n=e.querySelector("d-byline");n&&(n.innerHTML=M(t))} +// Copyright 2018 The Distill Template Authors +function z(e,t){const n=e.body,r=n.querySelector("d-article");if(!r)return void console.warn("No d-article tag found; skipping adding optional components!");let i=e.querySelector("d-byline");i||(t.authors?(i=e.createElement("d-byline"),n.insertBefore(i,r)):console.warn("No authors found in front matter; please add them before submission!"));let a=e.querySelector("d-title");a||(a=e.createElement("d-title"),n.insertBefore(a,i));let o=a.querySelector("h1");o||((o=e.createElement("h1")).textContent=t.title,a.insertBefore(o,a.firstChild));const s="undefined"!=typeof t.password;let l=n.querySelector("d-interstitial");if(s&&!l){const r="undefined"!=typeof window,i=r&&window.location.hostname.includes("localhost");r&&i||((l=e.createElement("d-interstitial")).password=t.password,n.insertBefore(l,n.firstChild))}else!s&&l&&l.parentElement.removeChild(this);let u=e.querySelector("d-appendix");u||(u=e.createElement("d-appendix"),e.body.appendChild(u));let d=e.querySelector("d-footnote-list");d||(d=e.createElement("d-footnote-list"),u.appendChild(d));let c=e.querySelector("d-citation-list");c||(c=e.createElement("d-citation-list"),u.appendChild(c))} +// Copyright 2018 The Distill Template Authors +function A(e,t){let n=!1;const r=e.querySelector("body");if(!r)return void console.warn("No body tag found!");t.katex&&t.katex.delimiters&&(global.document=e,ce(r,t.katex));const i=r.querySelectorAll("d-math");if(i.length>0){n=!0,console.warn(`Prerendering ${i.length} math tags...`);for(const n of i){const r={displayMode:n.hasAttribute("block")},i=Object.assign(r,t.katex),a=ie.renderToString(n.textContent,i),o=e.createElement("span");o.innerHTML=a,n.parentElement.insertBefore(o,n),n.parentElement.removeChild(n)}}if(n){const t='';e.head.insertAdjacentHTML("beforeend",t)}}function C(e){var t,n=""+e,r=pe.exec(n);if(!r)return n;var i="",a=0,o=0;for(a=r.index;a\n`)}let r=e.querySelector("head"),i=e=>N(r,e);if(i(`\n \n \n \n `),t.title&&i(`\n ${fe(t.title)}\n `),t.url&&i(`\n \n `),t.publishedDate&&i(`\n \n \n \n \n `),t.updatedDate&&i(`\n \n `),(t.authors||[]).forEach(e=>{N(r,`\n `)}),i(`\n \n \n \n \n \n \n \n \n `),i(`\n \n \n \n \n \n \n \n \n `),t.doiSuffix){i("\n \n"),n("citation_title",t.title),n("citation_fulltext_html_url",t.url),n("citation_volume",t.volume),n("citation_issue",t.issue),n("citation_firstpage",t.doiSuffix?`e${t.doiSuffix}`:undefined),n("citation_doi",t.doi);let e=t.journal||{};n("citation_journal_title",e.full_title||e.title),n("citation_journal_abbrev",e.abbrev_title),n("citation_issn",e.issn),n("citation_publisher",e.publisher),n("citation_fulltext_world_readable","",!0),t.publishedDate&&(n("citation_online_date",`${t.publishedYear}/${t.publishedMonthPadded}/${t.publishedDayPadded}`),n("citation_publication_date",`${t.publishedYear}/${t.publishedMonthPadded}/${t.publishedDayPadded}`)),(t.authors||[]).forEach(e=>{n("citation_author",`${e.lastName}, ${e.firstName}`),n("citation_author_institution",e.affiliation)})}else console.warn("No DOI suffix in data; not adding citation meta tags!");t.citations?t.citations.forEach(e=>{if(t.bibliography&&t.bibliography.has(e)){n("citation_reference",E(t.bibliography.get(e)))}else console.warn("No bibliography data found for "+e)}):console.warn("No citations found; not adding any references meta tags!")}function N(e,t){e.innerHTML+=t}function E(e){var t=`citation_title=${e.title};`;e.author&&""!==e.author&&e.author.split(" and ").forEach(e=>{let n,r;-1!=(e=e.trim()).indexOf(",")?(n=e.split(",")[0].trim(),r=e.split(",")[1].trim()):(n=e.split(" ").slice(-1)[0].trim(),r=e.split(" ").slice(0,-1).join(" ")),t+=`citation_author=${r} ${n};`}),"year"in e&&(t+=`citation_publication_date=${e.year};`);let n=/https?:\/\/arxiv\.org\/pdf\/([0-9]*\.[0-9]*)\.pdf/.exec(e.url);return(n=(n=n||/https?:\/\/arxiv\.org\/abs\/([0-9]*\.[0-9]*)/.exec(e.url))||/arXiv preprint arXiv:([0-9]*\.[0-9]*)/.exec(e.journal))&&n[1]?t+=`citation_arxiv_id=${n[1]};`:("journal"in e&&(t+=`citation_journal_title=${fe(e.journal)};`),"volume"in e&&(t+=`citation_volume=${fe(e.volume)};`),("issue"in e||"number"in e)&&(t+=`citation_number=${fe(e.issue||e.number)};`),t)}function R(e){const t="distill-prerendered-styles";if(!e.getElementById(t)){const n=e.createElement("style");n.id=t,n.type="text/css";const r=e.createTextNode(me);n.appendChild(r);const i=e.head.querySelector("script");e.head.insertBefore(n,i)}} +// Copyright 2018 The Distill Template Authors +function L(e,t){let n='\n \n \n

      Table of contents

      \n
        ';for(const e of t){const t="D-TITLE"==e.parentElement.tagName,r=e.getAttribute("no-toc");if(t||r)continue;const i=e.textContent;let a='
      • '+i+"
      • ";"H3"==e.tagName?a="
          "+a+"
        ":a+="
        ",n+=a}n+="
      ",e.innerHTML=n} +// Copyright 2018 The Distill Template Authors +function O(e){const t=e.querySelector("d-article"),n=e.querySelector("d-toc");if(n){L(n,t.querySelectorAll("h2, h3")),n.setAttribute("prerendered","true")}} +// Copyright 2018 The Distill Template Authors +function q(e){for(var t=e.createTreeWalker(e.body,e.defaultView.NodeFilter.SHOW_TEXT);t.nextNode();){var n=t.currentNode,r=n.nodeValue;r&&_(n)&&(r=D(r=B(r)),n.nodeValue=r)}}function _(e){var t=e.parentElement,n=!!(t&&t.getAttribute&&t.getAttribute("class"))&&(t.getAttribute("class").includes("katex")||t.getAttribute("class").includes("MathJax"));return t&&"SCRIPT"!==t.nodeName&&"STYLE"!==t.nodeName&&"CODE"!==t.nodeName&&"PRE"!==t.nodeName&&"SPAN"!==t.nodeName&&"D-HEADER"!==t.nodeName&&"D-BYLINE"!==t.nodeName&&"D-MATH"!==t.nodeName&&"D-CODE"!==t.nodeName&&"D-BIBLIOGRAPHY"!==t.nodeName&&"D-FOOTER"!==t.nodeName&&"D-APPENDIX"!==t.nodeName&&"D-FRONTMATTER"!==t.nodeName&&"D-TOC"!==t.nodeName&&8!==t.nodeType&&!n} +/*! + * typeset - Typesetting for the web + * @version v0.1.6 + * @link https://github.com/davidmerfield/Typeset.js + * @author David Merfield + */function D(e){var t="\xa0",n=/([\xab\xbf\xa1]) /g,r=/ ([!?:;.,\u203d\xbb])/g;return e=(e=(e=(e=(e=e.replace(/--/g,"\u2014")).replace(/\s*\u2014\s*/g,"\u2009\u2014\u2009")).replace(/\.\.\./g,"\u2026")).replace(n,"$1"+t)).replace(r,t+"$1")}function B(e){return e=(e=(e=(e=(e=e.replace(/(\W|^)"([^\s!?:;.,\u203d\xbb])/g,"$1\u201c$2").replace(/(\u201c[^"]*)"([^"]*$|[^\u201c"]*\u201c)/g,"$1\u201d$2").replace(/([^0-9])"/g,"$1\u201d").replace(/(\W|^)'(\S)/g,"$1\u2018$2").replace(/([a-z])'([a-z])/gi,"$1\u2019$2").replace(/((\u2018[^']*)|[a-z])'([^0-9]|$)/gi,"$1\u2019$3").replace(/(\u2018)([0-9]{2}[^\u2019]*)(\u2018([^0-9]|$)|$|\u2019[a-z])/gi,"\u2019$2$3").replace(/(\B|^)\u2018(?=([^\u2019]*\u2019\b)*([^\u2019\u2018]*\W[\u2019\u2018]\b|[^\u2019\u2018]*$))/gi,"$1\u2019").replace(/'''/g,"\u2034").replace(/("|'')/g,"\u2033").replace(/'/g,"\u2032")).replace(/\\\u201c/,'"')).replace(/\\\u201d/,'"')).replace(/\\\u2019/,"'")).replace(/\\\u2018/,"'")} +// Copyright 2018 The Distill Template Authors +function I(e){const t=e.querySelector('script[src*="template.v2.js"]');t?t.parentNode.removeChild(t):console.debug("FYI: Did not find template tag when trying to remove it. You may not have added it. Be aware that our polyfills will add it.");const n=e.createElement("script");n.src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.0.17/webcomponents-loader.js",e.head.insertBefore(n,e.head.firstChild);const r=e.createElement("script");r.innerHTML=ge,e.head.insertBefore(r,e.head.firstChild)} +// Copyright 2018 The Distill Template Authors +function H(e,t,n=document){if(t.size>0){e.style.display="";let r=e.querySelector(".references");if(r)r.innerHTML="";else{const t=n.createElement("style");t.innerHTML=ve,e.appendChild(t);const i=n.createElement("h3");i.id="references",i.textContent="References",e.appendChild(i),(r=n.createElement("ol")).id="references-list",r.className="references",e.appendChild(r)}for(const[e,i]of t){const t=n.createElement("li");t.id=e,t.innerHTML=x(i),r.appendChild(t)}}else e.style.display="none"} +// Copyright 2018 The Distill Template Authors +function P(e,t){const n=e.querySelector("d-citation-list");if(n){H(n,new Map(t.citations.map(e=>[e,t.bibliography.get(e)])),e),n.setAttribute("distill-prerendered","true")}} +// Copyright 2018 The Distill Template Authors +function j(e){const t=e.head,n=t.querySelector("meta[http-equiv]");t.insertBefore(n,t.firstChild);const r=t.querySelector("meta[name=viewport]");t.insertBefore(r,t.firstChild);const i=t.querySelector("meta[charset]");t.insertBefore(i,t.firstChild)} +// Copyright 2018 The Distill Template Authors +function F(e){if(!e.querySelector("distill-header")){const t=e.createElement("distill-header");t.innerHTML=ye,t.setAttribute("distill-prerendered","");const n=e.querySelector("body");n.insertBefore(t,n.firstChild)}} +// Copyright 2018 The Distill Template Authors +function $(e){let t=xe;"undefined"!=typeof e.githubUrl&&(t+='\n

      Updates and Corrections

      \n

      ',e.githubCompareUpdatesUrl&&(t+=`View all changes to this article since it was first published.`),t+=`\n If you see mistakes or want to suggest changes, please create an issue on GitHub.

      \n `);const n=e.journal;return void 0!==n&&"Distill"===n.title&&(t+=`\n

      Reuse

      \n

      Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don\u2019t fall under this license and can be recognized by a note in their caption: \u201cFigure from \u2026\u201d.

      \n `),"undefined"!=typeof e.publishedDate&&(t+=`\n

      Citation

      \n

      For attribution in academic contexts, please cite this work as

      \n
      ${e.concatenatedAuthors}, "${e.title}", Distill, ${e.publishedYear}.
      \n

      BibTeX citation

      \n
      ${c(e)}
      \n `),t} +// Copyright 2018 The Distill Template Authors +function U(e,t){const n=e.querySelector("d-appendix");if(n){if(!n.querySelector("distill-appendix")){const r=e.createElement("distill-appendix");n.appendChild(r),r.innerHTML=$(t)}}else console.warn("No appendix tag found!")} +// Copyright 2018 The Distill Template Authors +function Y(e){if(!e.querySelector("distill-footer")){const t=e.createElement("distill-footer");t.innerHTML=we,e.querySelector("body").appendChild(t)}} +// Copyright 2018 The Distill Template Authors +function V(e,t,n=!0){let r;r=t instanceof ne?t:ne.fromObject(t);for(const[t,i]of ke.entries())n&&console.warn("Running extractor: "+t),i(e,r,n);for(const[t,i]of Me.entries())n&&console.warn("Running transform: "+t),i(e,r,n);e.body.setAttribute("distill-prerendered",""),t instanceof ne||r.assignToObject(t)}function G(e,t,n=!0){for(const[r,i]of Se.entries())n&&console.warn("Running distillify: ",r),i(e,t,n)}function W(e){const t=e.querySelectorAll("script");let n=undefined;for(const e of t){const t=e.src;if(t.includes("template.v1.js"))n=!1;else if(t.includes("template.v2.js"))n=!0;else if(t.includes("template."))throw new Error("Uses distill template, but unknown version?!")}if(n===undefined)throw new Error("Does not seem to use Distill template at all.");return n}t=t&&Object.prototype.hasOwnProperty.call(t,"default")?t["default"]:t; +// Copyright 2018 The Distill Template Authors +const K=["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"],J=["Jan.","Feb.","March","April","May","June","July","Aug.","Sept.","Oct.","Nov.","Dec."],X=e=>e<10?"0"+e:e,Z=function(e){return`${K[e.getDay()].substring(0,3)}, ${X(e.getDate())} ${J[e.getMonth()].substring(0,3)} ${e.getFullYear().toString()} ${e.getUTCHours().toString()}:${e.getUTCMinutes().toString()}:${e.getUTCSeconds().toString()} Z`},Q=function(e){return Array.from(e).reduce((e,[t,n])=>Object.assign(e,{[t]:n}),{})},ee=function(e){const t=new Map;for(var n in e)e.hasOwnProperty(n)&&t.set(n,e[n]);return t};class te{constructor(e){this.name=e.author,this.personalURL=e.authorURL,this.affiliation=e.affiliation,this.affiliationURL=e.affiliationURL,this.affiliations=e.affiliations||[]}get firstName(){const e=this.name.split(" ");return e.slice(0,e.length-1).join(" ")}get lastName(){const e=this.name.split(" ");return e[e.length-1]}}class ne{constructor(){this.title="unnamed article",this.description="",this.authors=[],this.bibliography=new Map,this.bibliographyParsed=!1,this.citations=[],this.citationsCollected=!1,this.journal={},this.katex={},this.doi=undefined,this.publishedDate=undefined}set url(e){this._url=e}get url(){return this._url?this._url:this.distillPath&&this.journal.url?this.journal.url+"/"+this.distillPath:this.journal.url?this.journal.url:void 0}get githubUrl(){return this.githubPath?"https://github.com/"+this.githubPath:undefined}set previewURL(e){this._previewURL=e}get previewURL(){return this._previewURL?this._previewURL:this.url+"/thumbnail.jpg"}get publishedDateRFC(){return Z(this.publishedDate)}get updatedDateRFC(){return Z(this.updatedDate)}get publishedYear(){return this.publishedDate.getFullYear()}get publishedMonth(){return J[this.publishedDate.getMonth()]}get publishedDay(){return this.publishedDate.getDate()}get publishedMonthPadded(){return X(this.publishedDate.getMonth()+1)}get publishedDayPadded(){return X(this.publishedDate.getDate())}get publishedISODateOnly(){return this.publishedDate.toISOString().split("T")[0]}get volume(){const e=this.publishedYear-2015;if(e<1)throw new Error("Invalid publish date detected during computing volume");return e}get issue(){return this.publishedDate.getMonth()+1}get concatenatedAuthors(){return this.authors.length>2?this.authors[0].lastName+", et al.":2===this.authors.length?this.authors[0].lastName+" & "+this.authors[1].lastName:1===this.authors.length?this.authors[0].lastName:void 0}get bibtexAuthors(){return this.authors.map(e=>e.lastName+", "+e.firstName).join(" and ")}get slug(){let e="";return this.authors.length&&(e+=this.authors[0].lastName.toLowerCase(),e+=this.publishedYear,e+=this.title.split(" ")[0].toLowerCase()),e||"Untitled"}get bibliographyEntries(){return new Map(this.citations.map(e=>{return[e,this.bibliography.get(e)]}))}set bibliography(e){e instanceof Map?this._bibliography=e:"object"==typeof e&&(this._bibliography=ee(e))}get bibliography(){return this._bibliography}static fromObject(e){const t=new ne;return Object.assign(t,e),t}assignToObject(e){Object.assign(e,this),e.bibliography=Q(this.bibliographyEntries),e.url=this.url,e.doi=this.doi,e.githubUrl=this.githubUrl,e.previewURL=this.previewURL,this.publishedDate&&(e.volume=this.volume,e.issue=this.issue,e.publishedDateRFC=this.publishedDateRFC,e.publishedYear=this.publishedYear,e.publishedMonth=this.publishedMonth,e.publishedDay=this.publishedDay,e.publishedMonthPadded=this.publishedMonthPadded,e.publishedDayPadded=this.publishedDayPadded),this.updatedDate&&(e.updatedDateRFC=this.updatedDateRFC),e.concatenatedAuthors=this.concatenatedAuthors,e.bibtexAuthors=this.bibtexAuthors,e.slug=this.slug}}var re=l(function(e,t){!function(e){function t(){this.months=["jan","feb","mar","apr","may","jun","jul","aug","sep","oct","nov","dec"],this.notKey=[",","{","}"," ","="],this.pos=0,this.input="",this.entries=new Array,this.currentEntry="",this.setInput=function(e){this.input=e},this.getEntries=function(){return this.entries},this.isWhitespace=function(e){return" "==e||"\r"==e||"\t"==e||"\n"==e},this.match=function(e,t){if(t!=undefined&&null!=t||(t=!0),this.skipWhitespace(t),this.input.substring(this.pos,this.pos+e.length)!=e)throw"Token mismatch, expected "+e+", found "+this.input.substring(this.pos);this.pos+=e.length,this.skipWhitespace(t)},this.tryMatch=function(e,t){return t!=undefined&&null!=t||(t=!0),this.skipWhitespace(t),this.input.substring(this.pos,this.pos+e.length)==e},this.matchAt=function(){for(;this.input.length>this.pos&&"@"!=this.input[this.pos];)this.pos++;return"@"==this.input[this.pos]},this.skipWhitespace=function(e){for(;this.isWhitespace(this.input[this.pos]);)this.pos++;if("%"==this.input[this.pos]&&1==e){for(;"\n"!=this.input[this.pos];)this.pos++;this.skipWhitespace(e)}},this.value_braces=function(){var e=0;this.match("{",!1);for(var t=this.pos,n=!1;;){if(!n)if("}"==this.input[this.pos]){if(!(e>0)){var r=this.pos;return this.match("}",!1),this.input.substring(t,r)}e--}else if("{"==this.input[this.pos])e++;else if(this.pos>=this.input.length-1)throw"Unterminated value";n="\\"==this.input[this.pos]&&0==n,this.pos++}},this.value_comment=function(){for(var e="",t=0;!this.tryMatch("}",!1)||0!=t;){if(e+=this.input[this.pos],"{"==this.input[this.pos]&&t++,"}"==this.input[this.pos]&&t--,this.pos>=this.input.length-1)throw"Unterminated value:"+this.input.substring(start);this.pos++}return e},this.value_quotes=function(){this.match('"',!1);for(var e=this.pos,t=!1;;){if(!t){if('"'==this.input[this.pos]){var n=this.pos;return this.match('"',!1),this.input.substring(e,n)}if(this.pos>=this.input.length-1)throw"Unterminated value:"+this.input.substring(e)}t="\\"==this.input[this.pos]&&0==t,this.pos++}},this.single_value=function(){var e=this.pos;if(this.tryMatch("{"))return this.value_braces();if(this.tryMatch('"'))return this.value_quotes();var t=this.key();if(t.match("^[0-9]+$"))return t;if(this.months.indexOf(t.toLowerCase())>=0)return t.toLowerCase();throw"Value expected:"+this.input.substring(e)+" for key: "+t},this.value=function(){var e=[];for(e.push(this.single_value());this.tryMatch("#");)this.match("#"),e.push(this.single_value());return e.join("")},this.key=function(){for(var e=this.pos;;){if(this.pos>=this.input.length)throw"Runaway key";if(this.notKey.indexOf(this.input[this.pos])>=0)return this.input.substring(e,this.pos);this.pos++}},this.key_equals_value=function(){var e=this.key();if(this.tryMatch("="))return this.match("="),[e,this.value()];throw"... = value expected, equals sign missing:"+this.input.substring(this.pos)},this.key_value_list=function(){var e=this.key_equals_value();for(this.currentEntry.entryTags={},this.currentEntry.entryTags[e[0]]=e[1];this.tryMatch(",")&&(this.match(","),!this.tryMatch("}"));)e=this.key_equals_value(),this.currentEntry.entryTags[e[0]]=e[1]},this.entry_body=function(e){this.currentEntry={},this.currentEntry.citationKey=this.key(),this.currentEntry.entryType=e.substring(1),this.match(","),this.key_value_list(),this.entries.push(this.currentEntry)},this.directive=function(){return this.match("@"),"@"+this.key()},this.preamble=function(){this.currentEntry={},this.currentEntry.entryType="PREAMBLE",this.currentEntry.entry=this.value_comment(),this.entries.push(this.currentEntry)},this.comment=function(){this.currentEntry={},this.currentEntry.entryType="COMMENT",this.currentEntry.entry=this.value_comment(),this.entries.push(this.currentEntry)},this.entry=function(e){this.entry_body(e)},this.bibtex=function(){for(;this.matchAt();){var e=this.directive();this.match("{"),"@STRING"==e?this.string():"@PREAMBLE"==e?this.preamble():"@COMMENT"==e?this.comment():this.entry(e),this.match("}")}}}e.toJSON=function(e){var n=new t;return n.setInput(e),n.bibtex(),n.entries},e.toBibtex=function(e){var t="";for(var n in e){if(t+="@"+e[n].entryType,t+="{",e[n].citationKey&&(t+=e[n].citationKey+", "),e[n].entry&&(t+=e[n].entry),e[n].entryTags){var r="";for(var i in e[n].entryTags)0!=r.length&&(r+=", "),r+=i+"= {"+e[n].entryTags[i]+"}";t+=r}t+="}\n\n"}return t}}(t)}),ie=s(l(function(e){var t;t=function(){return function e(t,n,r){function i(s,l){if(!n[s]){if(!t[s]){var u="function"==typeof o&&o;if(!l&&u)return u(s,!0);if(a)return a(s,!0);var d=new Error("Cannot find module '"+s+"'");throw d.code="MODULE_NOT_FOUND",d}var c=n[s]={exports:{}};t[s][0].call(c.exports,function(e){var n=t[s][1][e];return i(n||e)},c,c.exports,e,t,n,r)}return n[s].exports}for(var a="function"==typeof o&&o,s=0;s=0;--d)if("#"===(n=r[d]).text){if(0===d)throw new s["default"]("Incomplete placeholder at end of macro body",n);if("#"===(n=r[--d]).text)r.splice(d+1,1);else{if(!/^[1-9]$/.test(n.text))throw new s["default"]("Not a valid argument number",n);r.splice.apply(r,[d,2].concat(u[n.text-1]))}}}this.stack=this.stack.concat(r)}}},{key:"get",value:function(e){this.discardedWhiteSpace=[];var t=this.nextToken();if(e)for(;" "===t.text;)this.discardedWhiteSpace.push(t),t=this.nextToken();return t}},{key:"unget",value:function(e){for(this.stack.push(e);0!==this.discardedWhiteSpace.length;)this.stack.push(this.discardedWhiteSpace.pop())}}]),e}();t.exports=u},{"./Lexer":26,"./ParseError":29,"./macros":44,"babel-runtime/helpers/classCallCheck":4,"babel-runtime/helpers/createClass":5,"object-assign":25}],28:[function(e,t){function n(e){return e&&e.__esModule?e:{"default":e}}var r=n(e("babel-runtime/helpers/classCallCheck")),i=n(e("babel-runtime/helpers/createClass")),a=n(e("./fontMetrics")),o=6,s=[[1,1,1],[2,1,1],[3,1,1],[4,2,1],[5,2,1],[6,3,1],[7,4,2],[8,6,3],[9,7,6],[10,8,7],[11,10,9]],l=[.5,.6,.7,.8,.9,1,1.2,1.44,1.728,2.074,2.488],u=function(e,t){return t.size<2?e:s[e-1][t.size-1]},d=function(){function e(t){(0,r["default"])(this,e),this.style=t.style,this.color=t.color,this.size=t.size||o,this.textSize=t.textSize||this.size,this.phantom=t.phantom,this.font=t.font,this.sizeMultiplier=l[this.size-1],this._fontMetrics=null}return(0,i["default"])(e,[{key:"extend",value:function(t){var n={style:this.style,size:this.size,textSize:this.textSize,color:this.color,phantom:this.phantom,font:this.font};for(var r in t)t.hasOwnProperty(r)&&(n[r]=t[r]);return new e(n)}},{key:"havingStyle",value:function(e){return this.style===e?this:this.extend({style:e,size:u(this.textSize,e)})}},{key:"havingCrampedStyle",value:function(){return this.havingStyle(this.style.cramp())}},{key:"havingSize",value:function(e){return this.size===e&&this.textSize===e?this:this.extend({style:this.style.text(),size:e,textSize:e})}},{key:"havingBaseStyle",value:function(e){e=e||this.style.text();var t=u(o,e);return this.size===t&&this.textSize===o&&this.style===e?this:this.extend({style:e,size:t,baseSize:o})}},{key:"withColor",value:function(e){return this.extend({color:e})}},{key:"withPhantom",value:function(){return this.extend({phantom:!0})}},{key:"withFont",value:function(e){return this.extend({font:e||this.font})}},{key:"sizingClasses",value:function(e){return e.size!==this.size?["sizing","reset-size"+e.size,"size"+this.size]:[]}},{key:"baseSizingClasses",value:function(){return this.size!==o?["sizing","reset-size"+this.size,"size"+o]:[]}},{key:"fontMetrics",value:function(){return this._fontMetrics||(this._fontMetrics=a["default"].getFontMetrics(this.size)),this._fontMetrics}},{key:"getColor",value:function(){return this.phantom?"transparent":e.colorMap[this.color]||this.color}}]),e}();d.colorMap={"katex-blue":"#6495ed","katex-orange":"#ffa500","katex-pink":"#ff00af","katex-red":"#df0030","katex-green":"#28ae7b","katex-gray":"gray","katex-purple":"#9d38bd","katex-blueA":"#ccfaff","katex-blueB":"#80f6ff","katex-blueC":"#63d9ea","katex-blueD":"#11accd","katex-blueE":"#0c7f99","katex-tealA":"#94fff5","katex-tealB":"#26edd5","katex-tealC":"#01d1c1","katex-tealD":"#01a995","katex-tealE":"#208170","katex-greenA":"#b6ffb0","katex-greenB":"#8af281","katex-greenC":"#74cf70","katex-greenD":"#1fab54","katex-greenE":"#0d923f","katex-goldA":"#ffd0a9","katex-goldB":"#ffbb71","katex-goldC":"#ff9c39","katex-goldD":"#e07d10","katex-goldE":"#a75a05","katex-redA":"#fca9a9","katex-redB":"#ff8482","katex-redC":"#f9685d","katex-redD":"#e84d39","katex-redE":"#bc2612","katex-maroonA":"#ffbde0","katex-maroonB":"#ff92c6","katex-maroonC":"#ed5fa6","katex-maroonD":"#ca337c","katex-maroonE":"#9e034e","katex-purpleA":"#ddd7ff","katex-purpleB":"#c6b9fc","katex-purpleC":"#aa87ff","katex-purpleD":"#7854ab","katex-purpleE":"#543b78","katex-mintA":"#f5f9e8","katex-mintB":"#edf2df","katex-mintC":"#e0e5cc","katex-grayA":"#f6f7f7","katex-grayB":"#f0f1f2","katex-grayC":"#e3e5e6","katex-grayD":"#d6d8da","katex-grayE":"#babec2","katex-grayF":"#888d93","katex-grayG":"#626569","katex-grayH":"#3b3e40","katex-grayI":"#21242c","katex-kaBlue":"#314453","katex-kaGreen":"#71B307"},d.BASESIZE=o,t.exports=d},{"./fontMetrics":41,"babel-runtime/helpers/classCallCheck":4,"babel-runtime/helpers/createClass":5}],29:[function(e,t){function n(e){return e&&e.__esModule?e:{"default":e}}var r=n(e("babel-runtime/helpers/classCallCheck")),i=function a(e,t){(0,r["default"])(this,a);var n="KaTeX parse error: "+e,i=void 0,o=void 0;if(t&&t.lexer&&t.start<=t.end){var s=t.lexer.input;i=t.start,o=t.end,i===s.length?n+=" at end of input: ":n+=" at position "+(i+1)+": ";var l=s.slice(i,o).replace(/[^]/g,"$&\u0332");n+=(i>15?"\u2026"+s.slice(i-15,i):s.slice(0,i))+l+(o+15e.SUPSUB_GREEDINESS)return this.parseFunction(i);throw new f["default"]("Got function '"+i.result+"' with no arguments as "+t,n)}return i.result}if(this.settings.throwOnError||"\\"!==this.nextToken.text[0])throw new f["default"]("Expected group after '"+r+"'",n);return this.handleUnsupportedCmd()}},{key:"handleUnsupportedCmd",value:function(){for(var e=this.nextToken.text,t=[],n=0;ni))throw new f["default"]("Got function '"+c.result+"' as argument to '"+e+"'",u);h=this.parseFunction(c)}else h=c.result;s.push(h),a.push(this.pos)}return s.push(a),s}},{key:"parseGroupOfType",value:function(e,t){var n=this.mode;if("original"===e&&(e=n),"color"===e)return this.parseColorGroup(t);if("size"===e)return this.parseSizeGroup(t);this.switchMode(e),"text"===e&&this.consumeSpaces();var r=this.parseGroup(t);return this.switchMode(n),r}},{key:"consumeSpaces",value:function(){for(;" "===this.nextToken.text;)this.consume()}},{key:"parseStringGroup",value:function(e,t){if(t&&"["!==this.nextToken.text)return null;var n=this.mode;this.mode="text",this.expect(t?"[":"{");for(var r="",i=this.nextToken,a=i;this.nextToken.text!==(t?"]":"}");){if("EOF"===this.nextToken.text)throw new f["default"]("Unexpected end of input in "+e,i.range(this.nextToken,r));r+=(a=this.nextToken).text,this.consume()}return this.mode=n,this.expect(t?"]":"}"),i.range(a,r)}},{key:"parseRegexGroup",value:function(e,t){var n=this.mode;this.mode="text";for(var r=this.nextToken,i=r,a="";"EOF"!==this.nextToken.text&&e.test(a+this.nextToken.text);)a+=(i=this.nextToken).text,this.consume();if(""===a)throw new f["default"]("Invalid "+t+": '"+r.text+"'",r);return this.mode=n,r.range(i,a)}},{key:"parseColorGroup",value:function(e){var t=this.parseStringGroup("color",e);if(!t)return null;var n=/^(#[a-z0-9]+|[a-z]+)$/i.exec(t.text);if(!n)throw new f["default"]("Invalid color: '"+t.text+"'",t) +;return new r(new p["default"]("color",n[0],this.mode),!1)}},{key:"parseSizeGroup",value:function(e){var t=void 0;if(!(t=e||"{"===this.nextToken.text?this.parseStringGroup("size",e):this.parseRegexGroup(/^[-+]? *(?:$|\d+|\d+\.\d*|\.\d*) *[a-z]{0,2} *$/,"size")))return null;var n=/([-+]?) *(\d+(?:\.\d*)?|\.\d+) *([a-z]{2})/.exec(t.text);if(!n)throw new f["default"]("Invalid size: '"+t.text+"'",t);var i={number:+(n[1]+n[2]),unit:n[3]};if(!c["default"].validUnit(i))throw new f["default"]("Invalid unit: '"+i.unit+"'",t);return new r(new p["default"]("size",i,this.mode),!1)}},{key:"parseGroup",value:function(e){var t=this.nextToken;if(this.nextToken.text===(e?"[":"{")){this.consume();var n=this.parseExpression(!1,e?"]":null),i=this.nextToken;return this.expect(e?"]":"}"),"text"===this.mode&&this.formLigatures(n),new r(new p["default"]("ordgroup",n,this.mode,t,i),!1)}return e?null:this.parseSymbol()}},{key:"formLigatures",value:function(e){for(var t=e.length-1,n=0;n=2}}]),e}(),o=0,s=1,l=2,u=3,d=4,c=5,h=6,p=7,f=[new a(o,0,!1),new a(s,0,!0),new a(l,1,!1),new a(u,1,!0),new a(d,2,!1),new a(c,2,!0),new a(h,3,!1),new a(p,3,!0)],m=[d,c,d,c,h,p,h,p],g=[c,c,c,c,p,p,p,p],v=[l,u,d,c,h,p,h,p],b=[u,u,c,c,p,p,p,p],y=[s,s,u,u,c,c,p,p],x=[o,s,l,u,l,u,l,u];t.exports={DISPLAY:f[o],TEXT:f[l],SCRIPT:f[d],SCRIPTSCRIPT:f[h]}},{"babel-runtime/helpers/classCallCheck":4,"babel-runtime/helpers/createClass":5}],34:[function(e,t){function n(e){return e&&e.__esModule?e:{"default":e}}var r=n(e("./domTree")),i=n(e("./fontMetrics")),a=n(e("./symbols")),o=n(e("./utils")),s=["\\imath","\\jmath","\\pounds"],l=function(e,t,n){return a["default"][n][e]&&a["default"][n][e].replace&&(e=a["default"][n][e].replace),{value:e,metrics:i["default"].getCharacterMetrics(e,t)}},u=function(e,t,n,i,a){var o=l(e,t,n),s=o.metrics;e=o.value;var u=void 0;if(s){var d=s.italic;"text"===n&&(d=0),u=new r["default"].symbolNode(e,s.height,s.depth,d,s.skew,a)}else"undefined"!=typeof console&&console.warn("No character metrics for '"+e+"' in style '"+t+"'"),u=new r["default"].symbolNode(e,0,0,0,0,a);return i&&(u.maxFontSize=i.sizeMultiplier,i.style.isTight()&&u.classes.push("mtight"),i.getColor()&&(u.style.color=i.getColor())),u},d=function(e,t,n,r){return"\\"===e||"main"===a["default"][t][e].font?u(e,"Main-Regular",t,n,r):u(e,"AMS-Regular",t,n,r.concat(["amsrm"]))},c=function(e,t,n,r,i){if("mathord"===i){var o=h(e);return u(e,o.fontName,t,n,r.concat([o.fontClass]))}if("textord"===i)return"ams"===(a["default"][t][e]&&a["default"][t][e].font)?u(e,"AMS-Regular",t,n,r.concat(["amsrm"])):u(e,"Main-Regular",t,n,r.concat(["mathrm"]));throw new Error("unexpected type: "+i+" in mathDefault")},h=function(e){return/[0-9]/.test(e.charAt(0))||o["default"].contains(s,e)?{fontName:"Main-Italic",fontClass:"mainit"}:{fontName:"Math-Italic",fontClass:"mathit"}},p=function(e,t,n){var r=e.mode,i=e.value,a=["mord"],d=t.font;if(d){var p=void 0;return p="mathit"===d||o["default"].contains(s,i)?h(i):x[d],l(i,p.fontName,r).metrics?u(i,p.fontName,r,t,a.concat([p.fontClass||d])):c(i,r,t,a,n)}return c(i,r,t,a,n)},f=function(e){var t=0,n=0,r=0;if(e.children)for(var i=0;it&&(t=e.children[i].height),e.children[i].depth>n&&(n=e.children[i].depth),e.children[i].maxFontSize>r&&(r=e.children[i].maxFontSize);e.height=t,e.depth=n,e.maxFontSize=r},m=function(e,t,n){var i=new r["default"].span(e,t,n);return f(i),i},g=function(e,t){e.children=t.concat(e.children),f(e)},v=function(e){var t=new r["default"].documentFragment(e);return f(t),t},b=function(e,t,n){var i=void 0,a=void 0,o=void 0;if("individualShift"===t){var s=e;for(e=[s[0]],a=i=-s[0].shift-s[0].elem.depth,o=1;o0&&(c+=b,h-=b)}var y=[{type:"elem",elem:i,shift:h,marginRight:m},{type:"elem",elem:r,shift:-c,marginRight:m}];n instanceof d["default"].symbolNode&&(y[0].marginLeft=-n.italic+"em"),g=l["default"].makeVList(y,"individualShift",null,t)}else c=Math.max(c,p,r.depth+.25*a.xHeight),g=l["default"].makeVList([{type:"elem",elem:r,marginRight:m}],"shift",-c,t);else{h=Math.max(h,a.sub1,i.height-.8*a.xHeight);var k=[{type:"elem",elem:i,marginRight:m}];n instanceof d["default"].symbolNode&&(k[0].marginLeft=-n.italic+"em"),g=l["default"].makeVList(k,"shift",h,t)}var S=x(n)||"mord";return(0,s.makeSpan)([S],[n,(0,s.makeSpan)(["msupsub"],[g])],t)},genfrac:function(e,t){var n=t.style;"display"===e.value.size?n=o["default"].DISPLAY:"text"===e.value.size&&(n=o["default"].TEXT);var r=n.fracNum(),i=n.fracDen(),a=void 0;a=t.havingStyle(r);var d=C(e.value.numer,a,t);a=t.havingStyle(i);var c=C(e.value.denom,a,t),h=void 0,p=void 0,f=void 0;e.value.hasBarLine?(p=(h=A("frac-line",t)).height,f=h.height):(h=null,p=0,f=t.fontMetrics().defaultRuleThickness);var m=void 0,g=void 0,v=void 0;n.size===o["default"].DISPLAY.size?(m=t.fontMetrics().num1,g=p>0?3*f:7*f,v=t.fontMetrics().denom1):(p>0?(m=t.fontMetrics().num2,g=f):(m=t.fontMetrics().num3,g=3*f),v=t.fontMetrics().denom2);var b=void 0;if(0===p){var y=m-d.depth-(c.height-v);y0&&(k<(z+=b)&&(k=z),z=0),e.value.addJot&&(k+=m),M.height=w,M.depth=k,y+=w,M.pos=y,y+=k+z,u[n]=M}var A=y/2+t.fontMetrics().axisHeight,T=e.value.cols||[],N=[],E=void 0,R=void 0;for(r=0,R=0;r=o)){var _=void 0;(r>0||e.value.hskipBeforeAndAfter)&&0!==(_=h["default"].deflt(L.pregap,p))&&((E=(0,s.makeSpan)(["arraycolsep"],[])).style.width=_+"em",N.push(E));var D=[];for(n=0;nn.height+n.depth+a&&(a=(a+f-n.height-n.depth)/2);var m=h.height-n.height-a-p,g=void 0;if(0===n.height&&0===n.depth?g=(0,s.makeSpan)():(n.style.paddingLeft=h.surdWidth+"em",(g=l["default"].makeVList([{type:"elem",elem:n},{type:"kern",size:-(n.height+m)},{type:"elem",elem:h},{type:"kern",size:p}],"firstBaseline",null,t)).children[0].children[0].classes.push("svg-align")),e.value.index){var v=t.havingStyle(o["default"].SCRIPTSCRIPT),b=C(e.value.index,v,t),y=.6*(g.height-g.depth),x=l["default"].makeVList([{type:"elem",elem:b}],"shift",-y,t),w=(0,s.makeSpan)(["root"],[x]);return(0,s.makeSpan)(["mord","sqrt"],[w,g],t)}return(0,s.makeSpan)(["mord","sqrt"],[g],t)},z.sizing=function(e,t){var n=t.havingSize(e.value.size);return r(e.value.value,n,t)},z.styling=function(e,t){var n={display:o["default"].DISPLAY,text:o["default"].TEXT,script:o["default"].SCRIPT,scriptscript:o["default"].SCRIPTSCRIPT}[e.value.style],i=t.havingStyle(n);return r(e.value.value,i,t)},z.font=function(e,t){var n=e.value.font;return C(e.value.body,t.withFont(n))},z.delimsizing=function(e,t){var n=e.value.value;return"."===n?(0,s.makeSpan)([e.value.mclass]):u["default"].sizedDelim(n,e.value.size,t,e.mode,[e.value.mclass])},z.leftright=function(e,t){for(var n=y(e.value.body,t,!0),r=0,i=0,a=!1,o=0;o0&&(h.style.width="calc(100% - "+2*o+"em)",h.style.marginLeft=2*o+"em")}else{var f=l["default"].makeSymbol(e.value.label,"Main-Regular",e.mode,t);f.italic=0;var m=null;"\\vec"===e.value.label?m="accent-vec":"\\H"===e.value.label&&(m="accent-hungarian"),c=(0,s.makeSpan)([],[f]),(c=(0,s.makeSpan)(["accent-body",m],[c])).style.marginLeft=2*o+"em",c=l["default"].makeVList([{type:"elem",elem:a},{type:"kern",size:-d},{type:"elem",elem:c}],"firstBaseline",null,t)}var g=(0,s.makeSpan)(["mord","accent"],[c],t);return r?(r.children[0]=g,r.height=Math.max(g.height,r.height),r.classes[0]="mord",r):g},z.horizBrace=function(e,t){var n=t.style,r="supsub"===e.type,i=void 0,a=void 0;r&&(e.value.sup?(a=t.havingStyle(n.sup()),i=C(e.value.sup,a,t)):(a=t.havingStyle(n.sub()),i=C(e.value.sub,a,t)),e=e.value.base);var u=C(e.value.base,t.havingBaseStyle(o["default"].DISPLAY)),d=p["default"].svgSpan(e,t),c=void 0;if(e.value.isOver?(c=l["default"].makeVList([{type:"elem",elem:u},{type:"kern",size:.1},{type:"elem",elem:d}],"firstBaseline",null,t)).children[0].children[0].children[1].classes.push("svg-align"):(c=l["default"].makeVList([{type:"elem",elem:d},{type:"kern",size:.1},{type:"elem",elem:u}],"bottom",u.depth+.1+d.height,t)).children[0].children[0].children[0].classes.push("svg-align"),r){var h=(0,s.makeSpan)(["mord",e.value.isOver?"mover":"munder"],[c],t);c=e.value.isOver?l["default"].makeVList([{type:"elem",elem:h},{type:"kern",size:.2},{type:"elem",elem:i}],"firstBaseline",null,t):l["default"].makeVList([{type:"elem",elem:i},{type:"kern",size:.2},{type:"elem",elem:h}],"bottom",h.depth+.2+i.height,t)}return(0,s.makeSpan)(["mord",e.value.isOver?"mover":"munder"],[c],t)},z.accentUnder=function(e,t){var n=C(e.value.body,t),r=p["default"].svgSpan(e,t),i=/tilde/.test(e.value.label)?.12:0,a=l["default"].makeVList([{type:"elem",elem:r},{type:"kern",size:i},{type:"elem",elem:n}],"bottom",r.height+i,t);return a.children[0].children[0].children[0].classes.push("svg-align"),(0,s.makeSpan)(["mord","accentunder"],[a],t)},z.enclose=function(e,t){var n=C(e.value.body,t),r=e.value.label.substr(1),i=t.sizeMultiplier,a=void 0,o=0,u=0;if("sout"===r)(a=(0,s.makeSpan)(["stretchy","sout"])).height=t.fontMetrics().defaultRuleThickness/i,u=-.5*t.fontMetrics().xHeight;else{n.classes.push("fbox"===r?"boxpad":"cancel-pad");var d=M(e.value.body);o="fbox"===r?.34:d?.2:0,u=n.depth+o,a=p["default"].encloseSpan(n,r,o,t)}var c=l["default"].makeVList([{type:"elem",elem:n,shift:0},{type:"elem",elem:a,shift:u}],"individualShift",null,t);return"fbox"!==r&&c.children[0].children[0].children[1].classes.push("svg-align"),/cancel/.test(r)?(0,s.makeSpan)(["mord","cancel-lap"],[c],t):(0,s.makeSpan)(["mord"],[c],t)},z.xArrow=function(e,t){var n=t.style,r=t.havingStyle(n.sup()),i=C(e.value.body,r,t);i.classes.push("x-arrow-pad");var a=void 0;e.value.below&&(r=t.havingStyle(n.sub()),(a=C(e.value.below,r,t)).classes.push("x-arrow-pad"));var o=p["default"].svgSpan(e,t),u=-t.fontMetrics().axisHeight+o.depth,d=-t.fontMetrics().axisHeight-o.height-.111,c=void 0;if(e.value.below){var h=-t.fontMetrics().axisHeight+a.height+o.height+.111;c=l["default"].makeVList([{type:"elem",elem:i,shift:d},{type:"elem",elem:o,shift:u},{type:"elem",elem:a,shift:h}],"individualShift",null,t)}else c=l["default"].makeVList([{type:"elem",elem:i,shift:d},{type:"elem",elem:o,shift:u}],"individualShift",null,t);return c.children[0].children[0].children[1].classes.push("svg-align"),(0,s.makeSpan)(["mrel","x-arrow"],[c],t)},z.phantom=function(e,t){var n=y(e.value.value,t.withPhantom(),!1);return new l["default"].makeFragment(n)},z.mclass=function(e,t){var n=y(e.value.value,t,!0);return(0,s.makeSpan)([e.value.mclass],n,t)};var C=function(e,t,n){if(!e)return(0,s.makeSpan)();if(z[e.type]){var r=z[e.type](e,t);if(n&&t.size!==n.size){r=(0,s.makeSpan)(t.sizingClasses(n),[r],t);var i=t.sizeMultiplier/n.sizeMultiplier;r.height*=i,r.depth*=i}return r}throw new a["default"]("Got group of unknown type: '"+e.type+"'")},T=function(e,t){e=JSON.parse((0,i["default"])(e));var n=y(e,t,!0),r=(0,s.makeSpan)(["base"],n,t),a=(0,s.makeSpan)(["strut"]),o=(0,s.makeSpan)(["strut","bottom"]);a.style.height=r.height+"em",o.style.height=r.height+r.depth+"em",o.style.verticalAlign=-r.depth+"em";var l=(0,s.makeSpan)(["katex-html"],[a,o,r]);return l.setAttribute("aria-hidden","true"),l};t.exports=T},{"./ParseError":29,"./Style":33,"./buildCommon":34,"./delimiter":38,"./domTree":39,"./stretchy":47,"./units":50,"./utils":51,"babel-runtime/core-js/json/stringify":2}],36:[function(e,t){function n(e){return e&&e.__esModule?e:{"default":e}}var r=e("./buildCommon"),i=n(r),a=n(e("./fontMetrics")),o=n(e("./mathMLTree")),s=n(e("./ParseError")),l=n(e("./Style")),u=n(e("./symbols")),d=n(e("./utils")),c=n(e("./stretchy")),h=function(e,t){return u["default"][t][e]&&u["default"][t][e].replace&&(e=u["default"][t][e].replace),new o["default"].TextNode(e)},p=function(e,t){var n=t.font;if(!n)return null;var i=e.mode;if("mathit"===n)return"italic";var o=e.value;if(d["default"].contains(["\\imath","\\jmath"],o))return null;u["default"][i][o]&&u["default"][i][o].replace&&(o=u["default"][i][o].replace);var s=r.fontMap[n].fontName;return a["default"].getCharacterMetrics(o,s)?r.fontMap[t.font].variant:null},f={},m={mi:"italic",mn:"normal",mtext:"normal"};f.mathord=function(e,t){var n=new o["default"].MathNode("mi",[h(e.value,e.mode)]),r=p(e,t)||"italic";return r!==m[n.type]&&n.setAttribute("mathvariant",r),n},f.textord=function(e,t){var n=h(e.value,e.mode),r=p(e,t)||"normal",i=void 0;return i="text"===e.mode?new o["default"].MathNode("mtext",[n]):/[0-9]/.test(e.value)?new o["default"].MathNode("mn",[n]):"\\prime"===e.value?new o["default"].MathNode("mo",[n]):new o["default"].MathNode("mi",[n]),r!==m[i.type]&&i.setAttribute("mathvariant",r),i},f.bin=function(e){return new o["default"].MathNode("mo",[h(e.value,e.mode)])},f.rel=function(e){return new o["default"].MathNode("mo",[h(e.value,e.mode)])},f.open=function(e){return new o["default"].MathNode("mo",[h(e.value,e.mode)])},f.close=function(e){return new o["default"].MathNode("mo",[h(e.value,e.mode)])},f.inner=function(e){return new o["default"].MathNode("mo",[h(e.value,e.mode)])},f.punct=function(e){var t=new o["default"].MathNode("mo",[h(e.value,e.mode)]);return t.setAttribute("separator","true"),t},f.ordgroup=function(e,t){var n=g(e.value,t);return new o["default"].MathNode("mrow",n)},f.text=function(e,t){for(var n=e.value.body,r=[],i=null,a=0;a2&&arguments[2]!==undefined&&arguments[2];if(!e)return new o["default"].MathNode("mrow");if(f[e.type]){var r=f[e.type](e,t);return n&&"mrow"===r.type&&1===r.children.length?r.children[0]:r}throw new s["default"]("Got group of unknown type: '"+e.type+"'")},b=function(e,t,n){var i=g(e,n),a=new o["default"].MathNode("mrow",i),s=new o["default"].MathNode("annotation",[new o["default"].TextNode(t)]);s.setAttribute("encoding","application/x-tex");var l=new o["default"].MathNode("semantics",[a,s]),u=new o["default"].MathNode("math",[l]);return(0,r.makeSpan)(["katex-mathml"],[u])};t.exports=b},{"./ParseError":29,"./Style":33,"./buildCommon":34,"./fontMetrics":41,"./mathMLTree":45,"./stretchy":47,"./symbols":48,"./utils":51}],37:[function(e,t){function n(e){return e&&e.__esModule?e:{"default":e}}var r=n(e("./buildHTML")),i=n(e("./buildMathML")),a=e("./buildCommon"),o=n(e("./Options")),s=n(e("./Settings")),l=n(e("./Style")),u=function(e,t,n){n=n||new s["default"]({});var u=l["default"].TEXT;n.displayMode&&(u=l["default"].DISPLAY);var d=new o["default"]({style:u}),c=(0,i["default"])(e,t,d),h=(0,r["default"])(e,d),p=(0,a.makeSpan)(["katex"],[c,h]);return n.displayMode?(0,a.makeSpan)(["katex-display"],[p]):p};t.exports=u},{"./Options":28,"./Settings":32,"./Style":33,"./buildCommon":34,"./buildHTML":35,"./buildMathML":36}],38:[function(e,t){function n(e){return e&&e.__esModule?e:{"default":e}}var r=n(e("./ParseError")),i=n(e("./Style")),a=e("./buildCommon"),o=n(a),s=n(e("./fontMetrics")),l=n(e("./symbols")),u=n(e("./utils")),d=function(e,t){return l["default"].math[e]&&l["default"].math[e].replace?s["default"].getCharacterMetrics(l["default"].math[e].replace,t):s["default"].getCharacterMetrics(e,t)},c=function(e,t,n,r){var i=n.havingBaseStyle(t),o=(0,a.makeSpan)((r||[]).concat(i.sizingClasses(n)),[e],n);return o.delimSizeMultiplier=i.sizeMultiplier/n.sizeMultiplier,o.height*=o.delimSizeMultiplier,o.depth*=o.delimSizeMultiplier,o.maxFontSize=i.sizeMultiplier,o},h=function(e,t,n){var r=t.havingBaseStyle(n),i=(1-t.sizeMultiplier/r.sizeMultiplier)*t.fontMetrics().axisHeight;e.classes.push("delimcenter"),e.style.top=i+"em",e.height-=i,e.depth+=i},p=function(e,t,n,r,i,a){var s=o["default"].makeSymbol(e,"Main-Regular",i,r),l=c(s,t,r,a);return n&&h(l,r,t),l},f=function(e,t,n,r){return o["default"].makeSymbol(e,"Size"+t+"-Regular",n,r)},m=function(e,t,n,r,o,s){var l=f(e,t,o,r),u=c((0,a.makeSpan)(["delimsizing","size"+t],[l],r),i["default"].TEXT,r,s);return n&&h(u,r,i["default"].TEXT),u},g=function(e,t,n){var r=void 0;return"Size1-Regular"===t?r="delim-size1":"Size4-Regular"===t&&(r="delim-size4"),{type:"elem",elem:(0,a.makeSpan)(["delimsizinginner",r],[(0,a.makeSpan)([],[o["default"].makeSymbol(e,t,n)])])}},v=function(e,t,n,r,s,l){var u=void 0,h=void 0,p=void 0,f=void 0;u=p=f=e,h=null;var m="Size1-Regular";"\\uparrow"===e?p=f="\u23d0":"\\Uparrow"===e?p=f="\u2016":"\\downarrow"===e?u=p="\u23d0":"\\Downarrow"===e?u=p="\u2016":"\\updownarrow"===e?(u="\\uparrow",p="\u23d0",f="\\downarrow"):"\\Updownarrow"===e?(u="\\Uparrow",p="\u2016",f="\\Downarrow"):"["===e||"\\lbrack"===e?(u="\u23a1",p="\u23a2",f="\u23a3",m="Size4-Regular"):"]"===e||"\\rbrack"===e?(u="\u23a4",p="\u23a5",f="\u23a6",m="Size4-Regular"):"\\lfloor"===e?(p=u="\u23a2",f="\u23a3",m="Size4-Regular"):"\\lceil"===e?(u="\u23a1",p=f="\u23a2",m="Size4-Regular"):"\\rfloor"===e?(p=u="\u23a5",f="\u23a6",m="Size4-Regular"):"\\rceil"===e?(u="\u23a4",p=f="\u23a5",m="Size4-Regular"):"("===e?(u="\u239b",p="\u239c",f="\u239d",m="Size4-Regular"):")"===e?(u="\u239e",p="\u239f",f="\u23a0",m="Size4-Regular"):"\\{"===e||"\\lbrace"===e?(u="\u23a7",h="\u23a8",f="\u23a9",p="\u23aa",m="Size4-Regular"):"\\}"===e||"\\rbrace"===e?(u="\u23ab",h="\u23ac",f="\u23ad",p="\u23aa",m="Size4-Regular"):"\\lgroup"===e?(u="\u23a7",f="\u23a9",p="\u23aa",m="Size4-Regular"):"\\rgroup"===e?(u="\u23ab",f="\u23ad",p="\u23aa",m="Size4-Regular"):"\\lmoustache"===e?(u="\u23a7",f="\u23ad",p="\u23aa",m="Size4-Regular"):"\\rmoustache"===e&&(u="\u23ab",f="\u23a9",p="\u23aa",m="Size4-Regular");var v=d(u,m),b=v.height+v.depth,y=d(p,m),x=y.height+y.depth,w=d(f,m),k=w.height+w.depth,M=0,S=1;if(null!==h){var z=d(h,m);M=z.height+z.depth,S=2}var A=b+k+M,C=Math.ceil((t-A)/(S*x)),T=A+C*S*x,N=r.fontMetrics().axisHeight;n&&(N*=r.sizeMultiplier);var E=T/2-N,R=[];if(R.push(g(f,m,s)),null===h)for(var L=0;L",1:"",2:"",3:"",4:"",tall:"l-4 4-4 4c-.667.667-2 1.5-4 2.5s-4.167 1.833-6.5 2.5-5.5 1-9.5 1h\n-12l-28-84c-16.667-52-96.667 -294.333-240-727l-212 -643 -85 170c-4-3.333-8.333\n-7.667-13 -13l-13-13l77-155 77-156c66 199.333 139 419.667 219 661 l218 661z\nM702 0H400000v40H742z'/>"},y=function(e,t,n){var r=o["default"].makeSpan([],[],n),i=n.sizeMultiplier;if("small"===t.type)i=n.havingBaseStyle(t.style).sizeMultiplier/n.sizeMultiplier,r.height=1*i,r.style.height=r.height+"em",r.surdWidth=.833*i,r.innerHTML="\n "+b.main+"";else if("large"===t.type)r.height=M[t.size]/i,r.style.height=r.height+"em",r.surdWidth=1/i,r.innerHTML='\n '+b[t.size]+"";else{r.height=e/i,r.style.height=r.height+"em",r.surdWidth=1.056/i;var a=Math.floor(1e3*r.height),s=a-54;r.innerHTML="\n \n t)return n[i]}return n[n.length-1]},E=function(e,t,n,r,i,a){"<"===e||"\\lt"===e?e="\\langle":">"!==e&&"\\gt"!==e||(e="\\rangle");var o=void 0;o=u["default"].contains(k,e)?z:u["default"].contains(x,e)?C:A;var s=N(e,t,o,r);return"\\surd"===e?y(t,s,r):"small"===s.type?p(e,s.style,n,r,i,a):"large"===s.type?m(e,s.size,n,r,i,a):"stack"===s.type?v(e,t,n,r,i,a):void 0},R=function(e,t,n,r,i,a){var o=r.fontMetrics().axisHeight*r.sizeMultiplier,s=901,l=5/r.fontMetrics().ptPerEm,u=Math.max(t-o,n+o),d=Math.max(u/500*s,2*u-l);return E(e,d,!0,r,i,a)};t.exports={sizedDelim:S,customSizedDelim:E,leftRightDelim:R}},{"./ParseError":29,"./Style":33,"./buildCommon":34,"./fontMetrics":41,"./symbols":48,"./utils":51}],39:[function(e,t){function n(e){return e&&e.__esModule?e:{"default":e}}var r=n(e("babel-runtime/helpers/classCallCheck")),i=n(e("babel-runtime/helpers/createClass")),a=n(e("./unicodeRegexes")),o=n(e("./utils")),s=function(e){for(var t=(e=e.slice()).length-1;t>=0;t--)e[t]||e.splice(t,1);return e.join(" ")},l=function(){function e(t,n,i){(0,r["default"])(this,e),this.classes=t||[],this.children=n||[],this.height=0,this.depth=0,this.maxFontSize=0,this.style={},this.attributes={},this.innerHTML,i&&(i.style.isTight()&&this.classes.push("mtight"),i.getColor()&&(this.style.color=i.getColor()))}return(0,i["default"])(e,[{key:"setAttribute",value:function(e,t){this.attributes[e]=t}},{key:"tryCombine",value:function(){return!1}},{key:"toNode",value:function(){var e=document.createElement("span");for(var t in e.className=s(this.classes),this.style)Object.prototype.hasOwnProperty.call(this.style,t)&&(e.style[t]=this.style[t]);for(var n in this.attributes)Object.prototype.hasOwnProperty.call(this.attributes,n)&&e.setAttribute(n,this.attributes[n]);this.innerHTML&&(e.innerHTML=this.innerHTML);for(var r=0;r0||s(this.classes)!==s(t.classes)||this.skew!==t.skew||this.maxFontSize!==t.maxFontSize)return!1;for(var n in this.style)if(this.style.hasOwnProperty(n)&&this.style[n]!==t.style[n])return!1;for(var r in t.style)if(t.style.hasOwnProperty(r)&&this.style[r]!==t.style[r])return!1;return this.value+=t.value,this.height=Math.max(this.height,t.height),this.depth=Math.max(this.depth,t.depth),this.italic=t.italic,!0}},{key:"toNode",value:function(){var e=document.createTextNode(this.value),t=null;for(var n in this.italic>0&&((t=document.createElement("span")).style.marginRight=this.italic+"em"),this.classes.length>0&&((t=t||document.createElement("span")).className=s(this.classes)),this.style)this.style.hasOwnProperty(n)&&((t=t||document.createElement("span")).style[n]=this.style[n]);return t?(t.appendChild(e),t):e}},{key:"toMarkup",value:function(){var e=!1,t="0&&(n+="margin-right:"+this.italic+"em;"),this.style)this.style.hasOwnProperty(r)&&(n+=o["default"].hyphenate(r)+":"+this.style[r]+";");n&&(e=!0,t+=' style="'+o["default"].escape(n)+'"');var i=o["default"].escape(this.value);return e?(t+=">",t+=i,t+="
      "):i}}]),e}();t.exports={span:l,documentFragment:u,symbolNode:c}},{"./unicodeRegexes":49,"./utils":51,"babel-runtime/helpers/classCallCheck":4,"babel-runtime/helpers/createClass":5}],40:[function(e,t){function n(e){return e&&e.__esModule?e:{"default":e}}function r(e,t,n){for(var r=[],i=[r],a=[];;){var l=e.parseExpression(!1,null);l=new o["default"]("ordgroup",l,e.mode),n&&(l=new o["default"]("styling",{style:n,value:[l]},e.mode)),r.push(l);var u=e.nextToken.text;if("&"===u)e.consume();else{if("\\end"===u)break;if("\\\\"!==u&&"\\cr"!==u)throw new s["default"]("Expected & or \\\\ or \\end",e.nextToken);var d=e.parseFunction();a.push(d.value.size),r=[],i.push(r)}}return t.body=i,t.rowGaps=a,new o["default"](t.type,t,e.mode)}function i(e,n,r){"string"==typeof e&&(e=[e]),"number"==typeof n&&(n={numArgs:n});for(var i={numArgs:n.numArgs||0,argTypes:n.argTypes,greediness:1,allowedInText:!!n.allowedInText,numOptionalArgs:n.numOptionalArgs||0,handler:r},a=0;a0&&(l=2),t.value.cols[a]={type:"align",align:s,pregap:l,postgap:0}}return t}),i("gathered",{},function(e){var t={type:"array",cols:[{type:"align",align:"c"}],addJot:!0};return t=r(e.parser,t,"display")})},{"./ParseError":29,"./ParseNode":30}],41:[function(e,t){function n(e){return e&&e.__esModule?e:{"default":e}}var r=e("./unicodeRegexes"),i=n(e("./fontMetricsData")),a={slant:[.25,.25,.25],space:[0,0,0],stretch:[0,0,0],shrink:[0,0,0],xHeight:[.431,.431,.431],quad:[1,1.171,1.472],extraSpace:[0,0,0],num1:[.677,.732,.925],num2:[.394,.384,.387],num3:[.444,.471,.504],denom1:[.686,.752,1.025],denom2:[.345,.344,.532],sup1:[.413,.503,.504],sup2:[.363,.431,.404],sup3:[.289,.286,.294],sub1:[.15,.143,.2],sub2:[.247,.286,.4],supDrop:[.386,.353,.494],subDrop:[.05,.071,.1],delim1:[2.39,1.7,1.98],delim2:[1.01,1.157,1.42],axisHeight:[.25,.25,.25],defaultRuleThickness:[.04,.049,.049],bigOpSpacing1:[.111,.111,.111],bigOpSpacing2:[.166,.166,.166],bigOpSpacing3:[.2,.2,.2],bigOpSpacing4:[.6,.611,.611],bigOpSpacing5:[.1,.143,.143],sqrtRuleThickness:[.04,.04,.04],ptPerEm:[10,10,10],doubleRuleSep:[.2,.2,.2]},o={"\xc0":"A","\xc1":"A","\xc2":"A","\xc3":"A","\xc4":"A","\xc5":"A","\xc6":"A","\xc7":"C","\xc8":"E","\xc9":"E","\xca":"E","\xcb":"E","\xcc":"I","\xcd":"I","\xce":"I","\xcf":"I","\xd0":"D","\xd1":"N","\xd2":"O","\xd3":"O","\xd4":"O","\xd5":"O","\xd6":"O","\xd8":"O","\xd9":"U","\xda":"U","\xdb":"U","\xdc":"U","\xdd":"Y","\xde":"o","\xdf":"B","\xe0":"a","\xe1":"a","\xe2":"a","\xe3":"a","\xe4":"a","\xe5":"a","\xe6":"a","\xe7":"c","\xe8":"e","\xe9":"e","\xea":"e","\xeb":"e","\xec":"i","\xed":"i","\xee":"i","\xef":"i","\xf0":"d","\xf1":"n","\xf2":"o","\xf3":"o","\xf4":"o","\xf5":"o","\xf6":"o","\xf8":"o","\xf9":"u","\xfa":"u","\xfb":"u","\xfc":"u","\xfd":"y","\xfe":"o","\xff":"y","\u0410":"A","\u0411":"B","\u0412":"B","\u0413":"F","\u0414":"A","\u0415":"E","\u0416":"K","\u0417":"3","\u0418":"N","\u0419":"N","\u041a":"K","\u041b":"N","\u041c":"M","\u041d":"H","\u041e":"O","\u041f":"N","\u0420":"P","\u0421":"C","\u0422":"T","\u0423":"y","\u0424":"O","\u0425":"X","\u0426":"U","\u0427":"h","\u0428":"W","\u0429":"W","\u042a":"B","\u042b":"X","\u042c":"B","\u042d":"3","\u042e":"X","\u042f":"R","\u0430":"a","\u0431":"b","\u0432":"a","\u0433":"r","\u0434":"y","\u0435":"e","\u0436":"m","\u0437":"e","\u0438":"n","\u0439":"n","\u043a":"n","\u043b":"n","\u043c":"m","\u043d":"n","\u043e":"o","\u043f":"n","\u0440":"p","\u0441":"c","\u0442":"o","\u0443":"y","\u0444":"b","\u0445":"x","\u0446":"n","\u0447":"n","\u0448":"w","\u0449":"w","\u044a":"a","\u044b":"m","\u044c":"a","\u044d":"e","\u044e":"m","\u044f":"r"},s=function(e,t){var n=e.charCodeAt(0);e[0]in o?n=o[e[0]].charCodeAt(0):r.cjkRegex.test(e[0])&&(n="M".charCodeAt(0));var a=i["default"][t][n];if(a)return{depth:a[0],height:a[1],italic:a[2],skew:a[3],width:a[4]}},l={},u=function(e){var t=void 0;if(!l[t=e>=5?0:e>=3?1:2]){var n=l[t]={};for(var r in a)a.hasOwnProperty(r)&&(n[r]=a[r][t]);n.cssEmPerMu=n.quad/18}return l[t]};t.exports={getFontMetrics:u,getCharacterMetrics:s}},{"./fontMetricsData":42,"./unicodeRegexes":49}],42:[function(e,t){t.exports={"AMS-Regular":{65:[0,.68889,0,0],66:[0,.68889,0,0],67:[0,.68889,0,0],68:[0,.68889,0,0],69:[0,.68889,0,0],70:[0,.68889,0,0],71:[0,.68889,0,0],72:[0,.68889,0,0],73:[0,.68889,0,0],74:[.16667,.68889,0,0],75:[0,.68889,0,0],76:[0,.68889,0,0],77:[0,.68889,0,0],78:[0,.68889,0,0],79:[.16667,.68889,0,0],80:[0,.68889,0,0],81:[.16667,.68889,0,0],82:[0,.68889,0,0],83:[0,.68889,0,0],84:[0,.68889,0,0],85:[0,.68889,0,0],86:[0,.68889,0,0],87:[0,.68889,0,0],88:[0,.68889,0,0],89:[0,.68889,0,0],90:[0,.68889,0,0],107:[0,.68889,0,0],165:[0,.675,.025,0],174:[.15559,.69224,0,0],240:[0,.68889,0,0],295:[0,.68889,0,0],710:[0,.825,0,0],732:[0,.9,0,0],770:[0,.825,0,0],771:[0,.9,0,0],989:[.08167,.58167,0,0],1008:[0,.43056,.04028,0],8245:[0,.54986,0,0],8463:[0,.68889,0,0],8487:[0,.68889,0,0],8498:[0,.68889,0,0],8502:[0,.68889,0,0],8503:[0,.68889,0,0],8504:[0,.68889,0,0],8513:[0,.68889,0,0],8592:[-.03598,.46402,0,0],8594:[-.03598,.46402,0,0],8602:[-.13313,.36687,0,0],8603:[-.13313,.36687,0,0],8606:[.01354,.52239,0,0],8608:[.01354,.52239,0,0],8610:[.01354,.52239,0,0],8611:[.01354,.52239,0,0],8619:[0,.54986,0,0],8620:[0,.54986,0,0],8621:[-.13313,.37788,0,0],8622:[-.13313,.36687,0,0],8624:[0,.69224,0,0],8625:[0,.69224,0,0],8630:[0,.43056,0,0],8631:[0,.43056,0,0],8634:[.08198,.58198,0,0],8635:[.08198,.58198,0,0],8638:[.19444,.69224,0,0],8639:[.19444,.69224,0,0],8642:[.19444,.69224,0,0],8643:[.19444,.69224,0,0],8644:[.1808,.675,0,0],8646:[.1808,.675,0,0],8647:[.1808,.675,0,0],8648:[.19444,.69224,0,0],8649:[.1808,.675,0,0],8650:[.19444,.69224,0,0],8651:[.01354,.52239,0,0],8652:[.01354,.52239,0,0],8653:[-.13313,.36687,0,0],8654:[-.13313,.36687,0,0],8655:[-.13313,.36687,0,0],8666:[.13667,.63667,0,0],8667:[.13667,.63667,0,0],8669:[-.13313,.37788,0,0],8672:[-.064,.437,0,0],8674:[-.064,.437,0,0],8705:[0,.825,0,0],8708:[0,.68889,0,0],8709:[.08167,.58167,0,0],8717:[0,.43056,0,0],8722:[-.03598,.46402,0,0],8724:[.08198,.69224,0,0],8726:[.08167,.58167,0,0],8733:[0,.69224,0,0],8736:[0,.69224,0,0],8737:[0,.69224,0,0],8738:[.03517,.52239,0,0],8739:[.08167,.58167,0,0],8740:[.25142,.74111,0,0],8741:[.08167,.58167,0,0],8742:[.25142,.74111,0,0],8756:[0,.69224,0,0],8757:[0,.69224,0,0],8764:[-.13313,.36687,0,0],8765:[-.13313,.37788,0,0],8769:[-.13313,.36687,0,0],8770:[-.03625,.46375,0,0],8774:[.30274,.79383,0,0],8776:[-.01688,.48312,0,0],8778:[.08167,.58167,0,0],8782:[.06062,.54986,0,0],8783:[.06062,.54986,0,0],8785:[.08198,.58198,0,0],8786:[.08198,.58198,0,0],8787:[.08198,.58198,0,0],8790:[0,.69224,0,0],8791:[.22958,.72958,0,0],8796:[.08198,.91667,0,0],8806:[.25583,.75583,0,0],8807:[.25583,.75583,0,0],8808:[.25142,.75726,0,0],8809:[.25142,.75726,0,0],8812:[.25583,.75583,0,0],8814:[.20576,.70576,0,0],8815:[.20576,.70576,0,0],8816:[.30274,.79383,0,0],8817:[.30274,.79383,0,0],8818:[.22958,.72958,0,0],8819:[.22958,.72958,0,0],8822:[.1808,.675,0,0],8823:[.1808,.675,0,0],8828:[.13667,.63667,0,0],8829:[.13667,.63667,0,0],8830:[.22958,.72958,0,0],8831:[.22958,.72958,0,0],8832:[.20576,.70576,0,0],8833:[.20576,.70576,0,0],8840:[.30274,.79383,0,0],8841:[.30274,.79383,0,0],8842:[.13597,.63597,0,0],8843:[.13597,.63597,0,0],8847:[.03517,.54986,0,0],8848:[.03517,.54986,0,0],8858:[.08198,.58198,0,0],8859:[.08198,.58198,0,0],8861:[.08198,.58198,0,0],8862:[0,.675,0,0],8863:[0,.675,0,0],8864:[0,.675,0,0],8865:[0,.675,0,0],8872:[0,.69224,0,0],8873:[0,.69224,0,0],8874:[0,.69224,0,0],8876:[0,.68889,0,0],8877:[0,.68889,0,0],8878:[0,.68889,0,0],8879:[0,.68889,0,0],8882:[.03517,.54986,0,0],8883:[.03517,.54986,0,0],8884:[.13667,.63667,0,0],8885:[.13667,.63667,0,0],8888:[0,.54986,0,0],8890:[.19444,.43056,0,0],8891:[.19444,.69224,0,0],8892:[.19444,.69224,0,0],8901:[0,.54986,0,0],8903:[.08167,.58167,0,0],8905:[.08167,.58167,0,0],8906:[.08167,.58167,0,0],8907:[0,.69224,0,0],8908:[0,.69224,0,0],8909:[-.03598,.46402,0,0],8910:[0,.54986,0,0],8911:[0,.54986,0,0],8912:[.03517,.54986,0,0],8913:[.03517,.54986,0,0],8914:[0,.54986,0,0],8915:[0,.54986,0,0],8916:[0,.69224,0,0],8918:[.0391,.5391,0,0],8919:[.0391,.5391,0,0],8920:[.03517,.54986,0,0],8921:[.03517,.54986,0,0],8922:[.38569,.88569,0,0],8923:[.38569,.88569,0,0],8926:[.13667,.63667,0,0],8927:[.13667,.63667,0,0],8928:[.30274,.79383,0,0],8929:[.30274,.79383,0,0],8934:[.23222,.74111,0,0],8935:[.23222,.74111,0,0],8936:[.23222,.74111,0,0],8937:[.23222,.74111,0,0],8938:[.20576,.70576,0,0],8939:[.20576,.70576,0,0],8940:[.30274,.79383,0,0],8941:[.30274,.79383,0,0],8994:[.19444,.69224,0,0],8995:[.19444,.69224,0,0],9416:[.15559,.69224,0,0],9484:[0,.69224,0,0],9488:[0,.69224,0,0],9492:[0,.37788,0,0],9496:[0,.37788,0,0],9585:[.19444,.68889,0,0],9586:[.19444,.74111,0,0],9632:[0,.675,0,0],9633:[0,.675,0,0],9650:[0,.54986,0,0],9651:[0,.54986,0,0],9654:[.03517,.54986,0,0],9660:[0,.54986,0,0],9661:[0,.54986,0,0],9664:[.03517,.54986,0,0],9674:[.11111,.69224,0,0],9733:[.19444,.69224,0,0],10003:[0,.69224,0,0],10016:[0,.69224,0,0],10731:[.11111,.69224,0,0],10846:[.19444,.75583,0,0],10877:[.13667,.63667,0,0],10878:[.13667,.63667,0,0],10885:[.25583,.75583,0,0],10886:[.25583,.75583,0,0],10887:[.13597,.63597,0,0],10888:[.13597,.63597,0,0],10889:[.26167,.75726,0,0],10890:[.26167,.75726,0,0],10891:[.48256,.98256,0,0],10892:[.48256,.98256,0,0],10901:[.13667,.63667,0,0],10902:[.13667,.63667,0,0],10933:[.25142,.75726,0,0],10934:[.25142,.75726,0,0],10935:[.26167,.75726,0,0],10936:[.26167,.75726,0,0],10937:[.26167,.75726,0,0],10938:[.26167,.75726,0,0],10949:[.25583,.75583,0,0],10950:[.25583,.75583,0,0],10955:[.28481,.79383,0,0],10956:[.28481,.79383,0,0],57350:[.08167,.58167,0,0],57351:[.08167,.58167,0,0],57352:[.08167,.58167,0,0],57353:[0,.43056,.04028,0],57356:[.25142,.75726,0,0],57357:[.25142,.75726,0,0],57358:[.41951,.91951,0,0],57359:[.30274,.79383,0,0],57360:[.30274,.79383,0,0],57361:[.41951,.91951,0,0],57366:[.25142,.75726,0,0],57367:[.25142,.75726,0,0],57368:[.25142,.75726,0,0],57369:[.25142,.75726,0,0],57370:[.13597,.63597,0,0],57371:[.13597,.63597,0,0]},"Caligraphic-Regular":{48:[0,.43056,0,0],49:[0,.43056,0,0],50:[0,.43056,0,0],51:[.19444,.43056,0,0],52:[.19444,.43056,0,0],53:[.19444,.43056,0,0],54:[0,.64444,0,0],55:[.19444,.43056,0,0],56:[0,.64444,0,0],57:[.19444,.43056,0,0],65:[0,.68333,0,.19445],66:[0,.68333,.03041,.13889],67:[0,.68333,.05834,.13889], +68:[0,.68333,.02778,.08334],69:[0,.68333,.08944,.11111],70:[0,.68333,.09931,.11111],71:[.09722,.68333,.0593,.11111],72:[0,.68333,.00965,.11111],73:[0,.68333,.07382,0],74:[.09722,.68333,.18472,.16667],75:[0,.68333,.01445,.05556],76:[0,.68333,0,.13889],77:[0,.68333,0,.13889],78:[0,.68333,.14736,.08334],79:[0,.68333,.02778,.11111],80:[0,.68333,.08222,.08334],81:[.09722,.68333,0,.11111],82:[0,.68333,0,.08334],83:[0,.68333,.075,.13889],84:[0,.68333,.25417,0],85:[0,.68333,.09931,.08334],86:[0,.68333,.08222,0],87:[0,.68333,.08222,.08334],88:[0,.68333,.14643,.13889],89:[.09722,.68333,.08222,.08334],90:[0,.68333,.07944,.13889]},"Fraktur-Regular":{33:[0,.69141,0,0],34:[0,.69141,0,0],38:[0,.69141,0,0],39:[0,.69141,0,0],40:[.24982,.74947,0,0],41:[.24982,.74947,0,0],42:[0,.62119,0,0],43:[.08319,.58283,0,0],44:[0,.10803,0,0],45:[.08319,.58283,0,0],46:[0,.10803,0,0],47:[.24982,.74947,0,0],48:[0,.47534,0,0],49:[0,.47534,0,0],50:[0,.47534,0,0],51:[.18906,.47534,0,0],52:[.18906,.47534,0,0],53:[.18906,.47534,0,0],54:[0,.69141,0,0],55:[.18906,.47534,0,0],56:[0,.69141,0,0],57:[.18906,.47534,0,0],58:[0,.47534,0,0],59:[.12604,.47534,0,0],61:[-.13099,.36866,0,0],63:[0,.69141,0,0],65:[0,.69141,0,0],66:[0,.69141,0,0],67:[0,.69141,0,0],68:[0,.69141,0,0],69:[0,.69141,0,0],70:[.12604,.69141,0,0],71:[0,.69141,0,0],72:[.06302,.69141,0,0],73:[0,.69141,0,0],74:[.12604,.69141,0,0],75:[0,.69141,0,0],76:[0,.69141,0,0],77:[0,.69141,0,0],78:[0,.69141,0,0],79:[0,.69141,0,0],80:[.18906,.69141,0,0],81:[.03781,.69141,0,0],82:[0,.69141,0,0],83:[0,.69141,0,0],84:[0,.69141,0,0],85:[0,.69141,0,0],86:[0,.69141,0,0],87:[0,.69141,0,0],88:[0,.69141,0,0],89:[.18906,.69141,0,0],90:[.12604,.69141,0,0],91:[.24982,.74947,0,0],93:[.24982,.74947,0,0],94:[0,.69141,0,0],97:[0,.47534,0,0],98:[0,.69141,0,0],99:[0,.47534,0,0],100:[0,.62119,0,0],101:[0,.47534,0,0],102:[.18906,.69141,0,0],103:[.18906,.47534,0,0],104:[.18906,.69141,0,0],105:[0,.69141,0,0],106:[0,.69141,0,0],107:[0,.69141,0,0],108:[0,.69141,0,0],109:[0,.47534,0,0],110:[0,.47534,0,0],111:[0,.47534,0,0],112:[.18906,.52396,0,0],113:[.18906,.47534,0,0],114:[0,.47534,0,0],115:[0,.47534,0,0],116:[0,.62119,0,0],117:[0,.47534,0,0],118:[0,.52396,0,0],119:[0,.52396,0,0],120:[.18906,.47534,0,0],121:[.18906,.47534,0,0],122:[.18906,.47534,0,0],8216:[0,.69141,0,0],8217:[0,.69141,0,0],58112:[0,.62119,0,0],58113:[0,.62119,0,0],58114:[.18906,.69141,0,0],58115:[.18906,.69141,0,0],58116:[.18906,.47534,0,0],58117:[0,.69141,0,0],58118:[0,.62119,0,0],58119:[0,.47534,0,0]},"Main-Bold":{33:[0,.69444,0,0],34:[0,.69444,0,0],35:[.19444,.69444,0,0],36:[.05556,.75,0,0],37:[.05556,.75,0,0],38:[0,.69444,0,0],39:[0,.69444,0,0],40:[.25,.75,0,0],41:[.25,.75,0,0],42:[0,.75,0,0],43:[.13333,.63333,0,0],44:[.19444,.15556,0,0],45:[0,.44444,0,0],46:[0,.15556,0,0],47:[.25,.75,0,0],48:[0,.64444,0,0],49:[0,.64444,0,0],50:[0,.64444,0,0],51:[0,.64444,0,0],52:[0,.64444,0,0],53:[0,.64444,0,0],54:[0,.64444,0,0],55:[0,.64444,0,0],56:[0,.64444,0,0],57:[0,.64444,0,0],58:[0,.44444,0,0],59:[.19444,.44444,0,0],60:[.08556,.58556,0,0],61:[-.10889,.39111,0,0],62:[.08556,.58556,0,0],63:[0,.69444,0,0],64:[0,.69444,0,0],65:[0,.68611,0,0],66:[0,.68611,0,0],67:[0,.68611,0,0],68:[0,.68611,0,0],69:[0,.68611,0,0],70:[0,.68611,0,0],71:[0,.68611,0,0],72:[0,.68611,0,0],73:[0,.68611,0,0],74:[0,.68611,0,0],75:[0,.68611,0,0],76:[0,.68611,0,0],77:[0,.68611,0,0],78:[0,.68611,0,0],79:[0,.68611,0,0],80:[0,.68611,0,0],81:[.19444,.68611,0,0],82:[0,.68611,0,0],83:[0,.68611,0,0],84:[0,.68611,0,0],85:[0,.68611,0,0],86:[0,.68611,.01597,0],87:[0,.68611,.01597,0],88:[0,.68611,0,0],89:[0,.68611,.02875,0],90:[0,.68611,0,0],91:[.25,.75,0,0],92:[.25,.75,0,0],93:[.25,.75,0,0],94:[0,.69444,0,0],95:[.31,.13444,.03194,0],96:[0,.69444,0,0],97:[0,.44444,0,0],98:[0,.69444,0,0],99:[0,.44444,0,0],100:[0,.69444,0,0],101:[0,.44444,0,0],102:[0,.69444,.10903,0],103:[.19444,.44444,.01597,0],104:[0,.69444,0,0],105:[0,.69444,0,0],106:[.19444,.69444,0,0],107:[0,.69444,0,0],108:[0,.69444,0,0],109:[0,.44444,0,0],110:[0,.44444,0,0],111:[0,.44444,0,0],112:[.19444,.44444,0,0],113:[.19444,.44444,0,0],114:[0,.44444,0,0],115:[0,.44444,0,0],116:[0,.63492,0,0],117:[0,.44444,0,0],118:[0,.44444,.01597,0],119:[0,.44444,.01597,0],120:[0,.44444,0,0],121:[.19444,.44444,.01597,0],122:[0,.44444,0,0],123:[.25,.75,0,0],124:[.25,.75,0,0],125:[.25,.75,0,0],126:[.35,.34444,0,0],168:[0,.69444,0,0],172:[0,.44444,0,0],175:[0,.59611,0,0],176:[0,.69444,0,0],177:[.13333,.63333,0,0],180:[0,.69444,0,0],215:[.13333,.63333,0,0],247:[.13333,.63333,0,0],305:[0,.44444,0,0],567:[.19444,.44444,0,0],710:[0,.69444,0,0],711:[0,.63194,0,0],713:[0,.59611,0,0],714:[0,.69444,0,0],715:[0,.69444,0,0],728:[0,.69444,0,0],729:[0,.69444,0,0],730:[0,.69444,0,0],732:[0,.69444,0,0],768:[0,.69444,0,0],769:[0,.69444,0,0],770:[0,.69444,0,0],771:[0,.69444,0,0],772:[0,.59611,0,0],774:[0,.69444,0,0],775:[0,.69444,0,0],776:[0,.69444,0,0],778:[0,.69444,0,0],779:[0,.69444,0,0],780:[0,.63194,0,0],824:[.19444,.69444,0,0],915:[0,.68611,0,0],916:[0,.68611,0,0],920:[0,.68611,0,0],923:[0,.68611,0,0],926:[0,.68611,0,0],928:[0,.68611,0,0],931:[0,.68611,0,0],933:[0,.68611,0,0],934:[0,.68611,0,0],936:[0,.68611,0,0],937:[0,.68611,0,0],8211:[0,.44444,.03194,0],8212:[0,.44444,.03194,0],8216:[0,.69444,0,0],8217:[0,.69444,0,0],8220:[0,.69444,0,0],8221:[0,.69444,0,0],8224:[.19444,.69444,0,0],8225:[.19444,.69444,0,0],8242:[0,.55556,0,0],8407:[0,.72444,.15486,0],8463:[0,.69444,0,0],8465:[0,.69444,0,0],8467:[0,.69444,0,0],8472:[.19444,.44444,0,0],8476:[0,.69444,0,0],8501:[0,.69444,0,0],8592:[-.10889,.39111,0,0],8593:[.19444,.69444,0,0],8594:[-.10889,.39111,0,0],8595:[.19444,.69444,0,0],8596:[-.10889,.39111,0,0],8597:[.25,.75,0,0],8598:[.19444,.69444,0,0],8599:[.19444,.69444,0,0],8600:[.19444,.69444,0,0],8601:[.19444,.69444,0,0],8636:[-.10889,.39111,0,0],8637:[-.10889,.39111,0,0],8640:[-.10889,.39111,0,0],8641:[-.10889,.39111,0,0],8656:[-.10889,.39111,0,0],8657:[.19444,.69444,0,0],8658:[-.10889,.39111,0,0],8659:[.19444,.69444,0,0],8660:[-.10889,.39111,0,0],8661:[.25,.75,0,0],8704:[0,.69444,0,0],8706:[0,.69444,.06389,0],8707:[0,.69444,0,0],8709:[.05556,.75,0,0],8711:[0,.68611,0,0],8712:[.08556,.58556,0,0],8715:[.08556,.58556,0,0],8722:[.13333,.63333,0,0],8723:[.13333,.63333,0,0],8725:[.25,.75,0,0],8726:[.25,.75,0,0],8727:[-.02778,.47222,0,0],8728:[-.02639,.47361,0,0],8729:[-.02639,.47361,0,0],8730:[.18,.82,0,0],8733:[0,.44444,0,0],8734:[0,.44444,0,0],8736:[0,.69224,0,0],8739:[.25,.75,0,0],8741:[.25,.75,0,0],8743:[0,.55556,0,0],8744:[0,.55556,0,0],8745:[0,.55556,0,0],8746:[0,.55556,0,0],8747:[.19444,.69444,.12778,0],8764:[-.10889,.39111,0,0],8768:[.19444,.69444,0,0],8771:[.00222,.50222,0,0],8776:[.02444,.52444,0,0],8781:[.00222,.50222,0,0],8801:[.00222,.50222,0,0],8804:[.19667,.69667,0,0],8805:[.19667,.69667,0,0],8810:[.08556,.58556,0,0],8811:[.08556,.58556,0,0],8826:[.08556,.58556,0,0],8827:[.08556,.58556,0,0],8834:[.08556,.58556,0,0],8835:[.08556,.58556,0,0],8838:[.19667,.69667,0,0],8839:[.19667,.69667,0,0],8846:[0,.55556,0,0],8849:[.19667,.69667,0,0],8850:[.19667,.69667,0,0],8851:[0,.55556,0,0],8852:[0,.55556,0,0],8853:[.13333,.63333,0,0],8854:[.13333,.63333,0,0],8855:[.13333,.63333,0,0],8856:[.13333,.63333,0,0],8857:[.13333,.63333,0,0],8866:[0,.69444,0,0],8867:[0,.69444,0,0],8868:[0,.69444,0,0],8869:[0,.69444,0,0],8900:[-.02639,.47361,0,0],8901:[-.02639,.47361,0,0],8902:[-.02778,.47222,0,0],8968:[.25,.75,0,0],8969:[.25,.75,0,0],8970:[.25,.75,0,0],8971:[.25,.75,0,0],8994:[-.13889,.36111,0,0],8995:[-.13889,.36111,0,0],9651:[.19444,.69444,0,0],9657:[-.02778,.47222,0,0],9661:[.19444,.69444,0,0],9667:[-.02778,.47222,0,0],9711:[.19444,.69444,0,0],9824:[.12963,.69444,0,0],9825:[.12963,.69444,0,0],9826:[.12963,.69444,0,0],9827:[.12963,.69444,0,0],9837:[0,.75,0,0],9838:[.19444,.69444,0,0],9839:[.19444,.69444,0,0],10216:[.25,.75,0,0],10217:[.25,.75,0,0],10815:[0,.68611,0,0],10927:[.19667,.69667,0,0],10928:[.19667,.69667,0,0]},"Main-Italic":{33:[0,.69444,.12417,0],34:[0,.69444,.06961,0],35:[.19444,.69444,.06616,0],37:[.05556,.75,.13639,0],38:[0,.69444,.09694,0],39:[0,.69444,.12417,0],40:[.25,.75,.16194,0],41:[.25,.75,.03694,0],42:[0,.75,.14917,0],43:[.05667,.56167,.03694,0],44:[.19444,.10556,0,0],45:[0,.43056,.02826,0],46:[0,.10556,0,0],47:[.25,.75,.16194,0],48:[0,.64444,.13556,0],49:[0,.64444,.13556,0],50:[0,.64444,.13556,0],51:[0,.64444,.13556,0],52:[.19444,.64444,.13556,0],53:[0,.64444,.13556,0],54:[0,.64444,.13556,0],55:[.19444,.64444,.13556,0],56:[0,.64444,.13556,0],57:[0,.64444,.13556,0],58:[0,.43056,.0582,0],59:[.19444,.43056,.0582,0],61:[-.13313,.36687,.06616,0],63:[0,.69444,.1225,0],64:[0,.69444,.09597,0],65:[0,.68333,0,0],66:[0,.68333,.10257,0],67:[0,.68333,.14528,0],68:[0,.68333,.09403,0],69:[0,.68333,.12028,0],70:[0,.68333,.13305,0],71:[0,.68333,.08722,0],72:[0,.68333,.16389,0],73:[0,.68333,.15806,0],74:[0,.68333,.14028,0],75:[0,.68333,.14528,0],76:[0,.68333,0,0],77:[0,.68333,.16389,0],78:[0,.68333,.16389,0],79:[0,.68333,.09403,0],80:[0,.68333,.10257,0],81:[.19444,.68333,.09403,0],82:[0,.68333,.03868,0],83:[0,.68333,.11972,0],84:[0,.68333,.13305,0],85:[0,.68333,.16389,0],86:[0,.68333,.18361,0],87:[0,.68333,.18361,0],88:[0,.68333,.15806,0],89:[0,.68333,.19383,0],90:[0,.68333,.14528,0],91:[.25,.75,.1875,0],93:[.25,.75,.10528,0],94:[0,.69444,.06646,0],95:[.31,.12056,.09208,0],97:[0,.43056,.07671,0],98:[0,.69444,.06312,0],99:[0,.43056,.05653,0],100:[0,.69444,.10333,0],101:[0,.43056,.07514,0],102:[.19444,.69444,.21194,0],103:[.19444,.43056,.08847,0],104:[0,.69444,.07671,0],105:[0,.65536,.1019,0],106:[.19444,.65536,.14467,0],107:[0,.69444,.10764,0],108:[0,.69444,.10333,0],109:[0,.43056,.07671,0],110:[0,.43056,.07671,0],111:[0,.43056,.06312,0],112:[.19444,.43056,.06312,0],113:[.19444,.43056,.08847,0],114:[0,.43056,.10764,0],115:[0,.43056,.08208,0],116:[0,.61508,.09486,0],117:[0,.43056,.07671,0],118:[0,.43056,.10764,0],119:[0,.43056,.10764,0],120:[0,.43056,.12042,0],121:[.19444,.43056,.08847,0],122:[0,.43056,.12292,0],126:[.35,.31786,.11585,0],163:[0,.69444,0,0],305:[0,.43056,0,.02778],567:[.19444,.43056,0,.08334],768:[0,.69444,0,0],769:[0,.69444,.09694,0],770:[0,.69444,.06646,0],771:[0,.66786,.11585,0],772:[0,.56167,.10333,0],774:[0,.69444,.10806,0],775:[0,.66786,.11752,0],776:[0,.66786,.10474,0],778:[0,.69444,0,0],779:[0,.69444,.1225,0],780:[0,.62847,.08295,0],915:[0,.68333,.13305,0],916:[0,.68333,0,0],920:[0,.68333,.09403,0],923:[0,.68333,0,0],926:[0,.68333,.15294,0],928:[0,.68333,.16389,0],931:[0,.68333,.12028,0],933:[0,.68333,.11111,0],934:[0,.68333,.05986,0],936:[0,.68333,.11111,0],937:[0,.68333,.10257,0],8211:[0,.43056,.09208,0],8212:[0,.43056,.09208,0],8216:[0,.69444,.12417,0],8217:[0,.69444,.12417,0],8220:[0,.69444,.1685,0],8221:[0,.69444,.06961,0],8463:[0,.68889,0,0]},"Main-Regular":{32:[0,0,0,0],33:[0,.69444,0,0],34:[0,.69444,0,0],35:[.19444,.69444,0,0],36:[.05556,.75,0,0],37:[.05556,.75,0,0],38:[0,.69444,0,0],39:[0,.69444,0,0],40:[.25,.75,0,0],41:[.25,.75,0,0],42:[0,.75,0,0],43:[.08333,.58333,0,0],44:[.19444,.10556,0,0],45:[0,.43056,0,0],46:[0,.10556,0,0],47:[.25,.75,0,0],48:[0,.64444,0,0],49:[0,.64444,0,0],50:[0,.64444,0,0],51:[0,.64444,0,0],52:[0,.64444,0,0],53:[0,.64444,0,0],54:[0,.64444,0,0],55:[0,.64444,0,0],56:[0,.64444,0,0],57:[0,.64444,0,0],58:[0,.43056,0,0],59:[.19444,.43056,0,0],60:[.0391,.5391,0,0],61:[-.13313,.36687,0,0],62:[.0391,.5391,0,0],63:[0,.69444,0,0],64:[0,.69444,0,0],65:[0,.68333,0,0],66:[0,.68333,0,0],67:[0,.68333,0,0],68:[0,.68333,0,0],69:[0,.68333,0,0],70:[0,.68333,0,0],71:[0,.68333,0,0],72:[0,.68333,0,0],73:[0,.68333,0,0],74:[0,.68333,0,0],75:[0,.68333,0,0],76:[0,.68333,0,0],77:[0,.68333,0,0],78:[0,.68333,0,0],79:[0,.68333,0,0],80:[0,.68333,0,0],81:[.19444,.68333,0,0],82:[0,.68333,0,0],83:[0,.68333,0,0],84:[0,.68333,0,0],85:[0,.68333,0,0],86:[0,.68333,.01389,0],87:[0,.68333,.01389,0],88:[0,.68333,0,0],89:[0,.68333,.025,0],90:[0,.68333,0,0],91:[.25,.75,0,0],92:[.25,.75,0,0],93:[.25,.75,0,0],94:[0,.69444,0,0],95:[.31,.12056,.02778,0],96:[0,.69444,0,0],97:[0,.43056,0,0],98:[0,.69444,0,0],99:[0,.43056,0,0],100:[0,.69444,0,0],101:[0,.43056,0,0],102:[0,.69444,.07778,0],103:[.19444,.43056,.01389,0],104:[0,.69444,0,0],105:[0,.66786,0,0],106:[.19444,.66786,0,0],107:[0,.69444,0,0],108:[0,.69444,0,0],109:[0,.43056,0,0],110:[0,.43056,0,0],111:[0,.43056,0,0],112:[.19444,.43056,0,0],113:[.19444,.43056,0,0],114:[0,.43056,0,0],115:[0,.43056,0,0],116:[0,.61508,0,0],117:[0,.43056,0,0],118:[0,.43056,.01389,0],119:[0,.43056,.01389,0],120:[0,.43056,0,0],121:[.19444,.43056,.01389,0],122:[0,.43056,0,0],123:[.25,.75,0,0],124:[.25,.75,0,0],125:[.25,.75,0,0],126:[.35,.31786,0,0],160:[0,0,0,0],168:[0,.66786,0,0],172:[0,.43056,0,0],175:[0,.56778,0,0],176:[0,.69444,0,0],177:[.08333,.58333,0,0],180:[0,.69444,0,0],215:[.08333,.58333,0,0],247:[.08333,.58333,0,0],305:[0,.43056,0,0],567:[.19444,.43056,0,0],710:[0,.69444,0,0],711:[0,.62847,0,0],713:[0,.56778,0,0],714:[0,.69444,0,0],715:[0,.69444,0,0],728:[0,.69444,0,0],729:[0,.66786,0,0],730:[0,.69444,0,0],732:[0,.66786,0,0],768:[0,.69444,0,0],769:[0,.69444,0,0],770:[0,.69444,0,0],771:[0,.66786,0,0],772:[0,.56778,0,0],774:[0,.69444,0,0],775:[0,.66786,0,0],776:[0,.66786,0,0],778:[0,.69444,0,0],779:[0,.69444,0,0],780:[0,.62847,0,0],824:[.19444,.69444,0,0],915:[0,.68333,0,0],916:[0,.68333,0,0],920:[0,.68333,0,0],923:[0,.68333,0,0],926:[0,.68333,0,0],928:[0,.68333,0,0],931:[0,.68333,0,0],933:[0,.68333,0,0],934:[0,.68333,0,0],936:[0,.68333,0,0],937:[0,.68333,0,0],8211:[0,.43056,.02778,0],8212:[0,.43056,.02778,0],8216:[0,.69444,0,0],8217:[0,.69444,0,0],8220:[0,.69444,0,0],8221:[0,.69444,0,0],8224:[.19444,.69444,0,0],8225:[.19444,.69444,0,0],8230:[0,.12,0,0],8242:[0,.55556,0,0],8407:[0,.71444,.15382,0],8463:[0,.68889,0,0],8465:[0,.69444,0,0],8467:[0,.69444,0,.11111],8472:[.19444,.43056,0,.11111],8476:[0,.69444,0,0],8501:[0,.69444,0,0],8592:[-.13313,.36687,0,0],8593:[.19444,.69444,0,0],8594:[-.13313,.36687,0,0],8595:[.19444,.69444,0,0],8596:[-.13313,.36687,0,0],8597:[.25,.75,0,0],8598:[.19444,.69444,0,0],8599:[.19444,.69444,0,0],8600:[.19444,.69444,0,0],8601:[.19444,.69444,0,0],8614:[.011,.511,0,0],8617:[.011,.511,0,0],8618:[.011,.511,0,0],8636:[-.13313,.36687,0,0],8637:[-.13313,.36687,0,0],8640:[-.13313,.36687,0,0],8641:[-.13313,.36687,0,0],8652:[.011,.671,0,0],8656:[-.13313,.36687,0,0],8657:[.19444,.69444,0,0],8658:[-.13313,.36687,0,0],8659:[.19444,.69444,0,0],8660:[-.13313,.36687,0,0],8661:[.25,.75,0,0],8704:[0,.69444,0,0],8706:[0,.69444,.05556,.08334],8707:[0,.69444,0,0],8709:[.05556,.75,0,0],8711:[0,.68333,0,0],8712:[.0391,.5391,0,0],8715:[.0391,.5391,0,0],8722:[.08333,.58333,0,0],8723:[.08333,.58333,0,0],8725:[.25,.75,0,0],8726:[.25,.75,0,0],8727:[-.03472,.46528,0,0],8728:[-.05555,.44445,0,0],8729:[-.05555,.44445,0,0],8730:[.2,.8,0,0],8733:[0,.43056,0,0],8734:[0,.43056,0,0],8736:[0,.69224,0,0],8739:[.25,.75,0,0],8741:[.25,.75,0,0],8743:[0,.55556,0,0],8744:[0,.55556,0,0],8745:[0,.55556,0,0],8746:[0,.55556,0,0],8747:[.19444,.69444,.11111,0],8764:[-.13313,.36687,0,0],8768:[.19444,.69444,0,0],8771:[-.03625,.46375,0,0],8773:[-.022,.589,0,0],8776:[-.01688,.48312,0,0],8781:[-.03625,.46375,0,0],8784:[-.133,.67,0,0],8800:[.215,.716,0,0],8801:[-.03625,.46375,0,0],8804:[.13597,.63597,0,0],8805:[.13597,.63597,0,0],8810:[.0391,.5391,0,0],8811:[.0391,.5391,0,0],8826:[.0391,.5391,0,0],8827:[.0391,.5391,0,0],8834:[.0391,.5391,0,0],8835:[.0391,.5391,0,0],8838:[.13597,.63597,0,0],8839:[.13597,.63597,0,0],8846:[0,.55556,0,0],8849:[.13597,.63597,0,0],8850:[.13597,.63597,0,0],8851:[0,.55556,0,0],8852:[0,.55556,0,0],8853:[.08333,.58333,0,0],8854:[.08333,.58333,0,0],8855:[.08333,.58333,0,0],8856:[.08333,.58333,0,0],8857:[.08333,.58333,0,0],8866:[0,.69444,0,0],8867:[0,.69444,0,0],8868:[0,.69444,0,0],8869:[0,.69444,0,0],8872:[.249,.75,0,0],8900:[-.05555,.44445,0,0],8901:[-.05555,.44445,0,0],8902:[-.03472,.46528,0,0],8904:[.005,.505,0,0],8942:[.03,.9,0,0],8943:[-.19,.31,0,0],8945:[-.1,.82,0,0],8968:[.25,.75,0,0],8969:[.25,.75,0,0],8970:[.25,.75,0,0],8971:[.25,.75,0,0],8994:[-.14236,.35764,0,0],8995:[-.14236,.35764,0,0],9136:[.244,.744,0,0],9137:[.244,.744,0,0],9651:[.19444,.69444,0,0],9657:[-.03472,.46528,0,0],9661:[.19444,.69444,0,0],9667:[-.03472,.46528,0,0],9711:[.19444,.69444,0,0],9824:[.12963,.69444,0,0],9825:[.12963,.69444,0,0],9826:[.12963,.69444,0,0],9827:[.12963,.69444,0,0],9837:[0,.75,0,0],9838:[.19444,.69444,0,0],9839:[.19444,.69444,0,0],10216:[.25,.75,0,0],10217:[.25,.75,0,0],10222:[.244,.744,0,0],10223:[.244,.744,0,0],10229:[.011,.511,0,0],10230:[.011,.511,0,0],10231:[.011,.511,0,0],10232:[.024,.525,0,0],10233:[.024,.525,0,0],10234:[.024,.525,0,0],10236:[.011,.511,0,0],10815:[0,.68333,0,0],10927:[.13597,.63597,0,0],10928:[.13597,.63597,0,0]},"Math-BoldItalic":{47:[.19444,.69444,0,0],65:[0,.68611,0,0],66:[0,.68611,.04835,0],67:[0,.68611,.06979,0],68:[0,.68611,.03194,0],69:[0,.68611,.05451,0],70:[0,.68611,.15972,0],71:[0,.68611,0,0],72:[0,.68611,.08229,0],73:[0,.68611,.07778,0],74:[0,.68611,.10069,0],75:[0,.68611,.06979,0],76:[0,.68611,0,0],77:[0,.68611,.11424,0],78:[0,.68611,.11424,0],79:[0,.68611,.03194,0],80:[0,.68611,.15972,0],81:[.19444,.68611,0,0],82:[0,.68611,.00421,0],83:[0,.68611,.05382,0],84:[0,.68611,.15972,0],85:[0,.68611,.11424,0],86:[0,.68611,.25555,0],87:[0,.68611,.15972,0],88:[0,.68611,.07778,0],89:[0,.68611,.25555,0],90:[0,.68611,.06979,0],97:[0,.44444,0,0],98:[0,.69444,0,0],99:[0,.44444,0,0],100:[0,.69444,0,0],101:[0,.44444,0,0],102:[.19444,.69444,.11042,0],103:[.19444,.44444,.03704,0],104:[0,.69444,0,0],105:[0,.69326,0,0],106:[.19444,.69326,.0622,0],107:[0,.69444,.01852,0],108:[0,.69444,.0088,0],109:[0,.44444,0,0],110:[0,.44444,0,0],111:[0,.44444,0,0],112:[.19444,.44444,0,0],113:[.19444,.44444,.03704,0],114:[0,.44444,.03194,0],115:[0,.44444,0,0],116:[0,.63492,0,0],117:[0,.44444,0,0],118:[0,.44444,.03704,0],119:[0,.44444,.02778,0],120:[0,.44444,0,0],121:[.19444,.44444,.03704,0],122:[0,.44444,.04213,0],915:[0,.68611,.15972,0],916:[0,.68611,0,0],920:[0,.68611,.03194,0],923:[0,.68611,0,0],926:[0,.68611,.07458,0],928:[0,.68611,.08229,0],931:[0,.68611,.05451,0],933:[0,.68611,.15972,0],934:[0,.68611,0,0],936:[0,.68611,.11653,0],937:[0,.68611,.04835,0],945:[0,.44444,0,0],946:[.19444,.69444,.03403,0],947:[.19444,.44444,.06389,0],948:[0,.69444,.03819,0],949:[0,.44444,0,0],950:[.19444,.69444,.06215,0],951:[.19444,.44444,.03704,0],952:[0,.69444,.03194,0],953:[0,.44444,0,0],954:[0,.44444,0,0],955:[0,.69444,0,0],956:[.19444,.44444,0,0],957:[0,.44444,.06898,0],958:[.19444,.69444,.03021,0],959:[0,.44444,0,0],960:[0,.44444,.03704,0],961:[.19444,.44444,0,0],962:[.09722,.44444,.07917,0],963:[0,.44444,.03704,0],964:[0,.44444,.13472,0],965:[0,.44444,.03704,0],966:[.19444,.44444,0,0],967:[.19444,.44444,0,0],968:[.19444,.69444,.03704,0],969:[0,.44444,.03704,0],977:[0,.69444,0,0],981:[.19444,.69444,0,0],982:[0,.44444,.03194,0],1009:[.19444,.44444,0,0],1013:[0,.44444,0,0]},"Math-Italic":{47:[.19444,.69444,0,0],65:[0,.68333,0,.13889],66:[0,.68333,.05017,.08334],67:[0,.68333,.07153,.08334],68:[0,.68333,.02778,.05556],69:[0,.68333,.05764,.08334],70:[0,.68333,.13889,.08334],71:[0,.68333,0,.08334],72:[0,.68333,.08125,.05556],73:[0,.68333,.07847,.11111],74:[0,.68333,.09618,.16667],75:[0,.68333,.07153,.05556],76:[0,.68333,0,.02778],77:[0,.68333,.10903,.08334],78:[0,.68333,.10903,.08334],79:[0,.68333,.02778,.08334],80:[0,.68333,.13889,.08334],81:[.19444,.68333,0,.08334],82:[0,.68333,.00773,.08334],83:[0,.68333,.05764,.08334],84:[0,.68333,.13889,.08334],85:[0,.68333,.10903,.02778],86:[0,.68333,.22222,0],87:[0,.68333,.13889,0],88:[0,.68333,.07847,.08334],89:[0,.68333,.22222,0],90:[0,.68333,.07153,.08334],97:[0,.43056,0,0],98:[0,.69444,0,0],99:[0,.43056,0,.05556],100:[0,.69444,0,.16667],101:[0,.43056,0,.05556],102:[.19444,.69444,.10764,.16667],103:[.19444,.43056,.03588,.02778],104:[0,.69444,0,0],105:[0,.65952,0,0],106:[.19444,.65952,.05724,0],107:[0,.69444,.03148,0],108:[0,.69444,.01968,.08334],109:[0,.43056,0,0],110:[0,.43056,0,0],111:[0,.43056,0,.05556],112:[.19444,.43056,0,.08334],113:[.19444,.43056,.03588,.08334],114:[0,.43056,.02778,.05556],115:[0,.43056,0,.05556],116:[0,.61508,0,.08334],117:[0,.43056,0,.02778],118:[0,.43056,.03588,.02778],119:[0,.43056,.02691,.08334],120:[0,.43056,0,.02778],121:[.19444,.43056,.03588,.05556],122:[0,.43056,.04398,.05556],915:[0,.68333,.13889,.08334],916:[0,.68333,0,.16667],920:[0,.68333,.02778,.08334],923:[0,.68333,0,.16667],926:[0,.68333,.07569,.08334],928:[0,.68333,.08125,.05556],931:[0,.68333,.05764,.08334],933:[0,.68333,.13889,.05556],934:[0,.68333,0,.08334],936:[0,.68333,.11,.05556],937:[0,.68333,.05017,.08334],945:[0,.43056,.0037,.02778],946:[.19444,.69444,.05278,.08334],947:[.19444,.43056,.05556,0],948:[0,.69444,.03785,.05556],949:[0,.43056,0,.08334],950:[.19444,.69444,.07378,.08334],951:[.19444,.43056,.03588,.05556],952:[0,.69444,.02778,.08334],953:[0,.43056,0,.05556],954:[0,.43056,0,0],955:[0,.69444,0,0],956:[.19444,.43056,0,.02778],957:[0,.43056,.06366,.02778],958:[.19444,.69444,.04601,.11111],959:[0,.43056,0,.05556],960:[0,.43056,.03588,0],961:[.19444,.43056,0,.08334],962:[.09722,.43056,.07986,.08334],963:[0,.43056,.03588,0],964:[0,.43056,.1132,.02778],965:[0,.43056,.03588,.02778],966:[.19444,.43056,0,.08334],967:[.19444,.43056,0,.05556],968:[.19444,.69444,.03588,.11111],969:[0,.43056,.03588,0],977:[0,.69444,0,.08334],981:[.19444,.69444,0,.08334],982:[0,.43056,.02778,0],1009:[.19444,.43056,0,.08334],1013:[0,.43056,0,.05556]},"Math-Regular":{65:[0,.68333,0,.13889],66:[0,.68333,.05017,.08334],67:[0,.68333,.07153,.08334],68:[0,.68333,.02778,.05556],69:[0,.68333,.05764,.08334],70:[0,.68333,.13889,.08334],71:[0,.68333,0,.08334],72:[0,.68333,.08125,.05556],73:[0,.68333,.07847,.11111],74:[0,.68333,.09618,.16667],75:[0,.68333,.07153,.05556],76:[0,.68333,0,.02778],77:[0,.68333,.10903,.08334],78:[0,.68333,.10903,.08334],79:[0,.68333,.02778,.08334],80:[0,.68333,.13889,.08334],81:[.19444,.68333,0,.08334],82:[0,.68333,.00773,.08334],83:[0,.68333,.05764,.08334],84:[0,.68333,.13889,.08334],85:[0,.68333,.10903,.02778],86:[0,.68333,.22222,0],87:[0,.68333,.13889,0],88:[0,.68333,.07847,.08334],89:[0,.68333,.22222,0],90:[0,.68333,.07153,.08334],97:[0,.43056,0,0],98:[0,.69444,0,0],99:[0,.43056,0,.05556],100:[0,.69444,0,.16667],101:[0,.43056,0,.05556],102:[.19444,.69444,.10764,.16667],103:[.19444,.43056,.03588,.02778],104:[0,.69444,0,0],105:[0,.65952,0,0],106:[.19444,.65952,.05724,0],107:[0,.69444,.03148,0],108:[0,.69444,.01968,.08334],109:[0,.43056,0,0],110:[0,.43056,0,0],111:[0,.43056,0,.05556],112:[.19444,.43056,0,.08334],113:[.19444,.43056,.03588,.08334],114:[0,.43056,.02778,.05556],115:[0,.43056,0,.05556],116:[0,.61508,0,.08334],117:[0,.43056,0,.02778],118:[0,.43056,.03588,.02778],119:[0,.43056,.02691,.08334],120:[0,.43056,0,.02778],121:[.19444,.43056,.03588,.05556],122:[0,.43056,.04398,.05556],915:[0,.68333,.13889,.08334],916:[0,.68333,0,.16667],920:[0,.68333,.02778,.08334],923:[0,.68333,0,.16667],926:[0,.68333,.07569,.08334],928:[0,.68333,.08125,.05556],931:[0,.68333,.05764,.08334],933:[0,.68333,.13889,.05556],934:[0,.68333,0,.08334],936:[0,.68333,.11,.05556],937:[0,.68333,.05017,.08334],945:[0,.43056,.0037,.02778],946:[.19444,.69444,.05278,.08334],947:[.19444,.43056,.05556,0],948:[0,.69444,.03785,.05556],949:[0,.43056,0,.08334],950:[.19444,.69444,.07378,.08334],951:[.19444,.43056,.03588,.05556],952:[0,.69444,.02778,.08334],953:[0,.43056,0,.05556],954:[0,.43056,0,0],955:[0,.69444,0,0],956:[.19444,.43056,0,.02778],957:[0,.43056,.06366,.02778],958:[.19444,.69444,.04601,.11111],959:[0,.43056,0,.05556],960:[0,.43056,.03588,0],961:[.19444,.43056,0,.08334],962:[.09722,.43056,.07986,.08334],963:[0,.43056,.03588,0],964:[0,.43056,.1132,.02778],965:[0,.43056,.03588,.02778],966:[.19444,.43056,0,.08334],967:[.19444,.43056,0,.05556],968:[.19444,.69444,.03588,.11111],969:[0,.43056,.03588,0],977:[0,.69444,0,.08334],981:[.19444,.69444,0,.08334],982:[0,.43056,.02778,0],1009:[.19444,.43056,0,.08334],1013:[0,.43056,0,.05556]},"SansSerif-Regular":{33:[0,.69444,0,0],34:[0,.69444,0,0],35:[.19444,.69444,0,0],36:[.05556,.75,0,0],37:[.05556,.75,0,0],38:[0,.69444,0,0],39:[0,.69444,0,0],40:[.25,.75,0,0],41:[.25,.75,0,0],42:[0,.75,0,0],43:[.08333,.58333,0,0],44:[.125,.08333,0,0],45:[0,.44444,0,0],46:[0,.08333,0,0],47:[.25,.75,0,0],48:[0,.65556,0,0],49:[0,.65556,0,0],50:[0,.65556,0,0],51:[0,.65556,0,0],52:[0,.65556,0,0],53:[0,.65556,0,0],54:[0,.65556,0,0],55:[0,.65556,0,0],56:[0,.65556,0,0],57:[0,.65556,0,0],58:[0,.44444,0,0],59:[.125,.44444,0,0],61:[-.13,.37,0,0],63:[0,.69444,0,0],64:[0,.69444,0,0],65:[0,.69444,0,0],66:[0,.69444,0,0],67:[0,.69444,0,0],68:[0,.69444,0,0],69:[0,.69444,0,0],70:[0,.69444,0,0],71:[0,.69444,0,0],72:[0,.69444,0,0],73:[0,.69444,0,0],74:[0,.69444,0,0],75:[0,.69444,0,0],76:[0,.69444,0,0],77:[0,.69444,0,0],78:[0,.69444,0,0],79:[0,.69444,0,0],80:[0,.69444,0,0],81:[.125,.69444,0,0],82:[0,.69444,0,0],83:[0,.69444,0,0],84:[0,.69444,0,0],85:[0,.69444,0,0],86:[0,.69444,.01389,0],87:[0,.69444,.01389,0],88:[0,.69444,0,0],89:[0,.69444,.025,0],90:[0,.69444,0,0],91:[.25,.75,0,0],93:[.25,.75,0,0],94:[0,.69444,0,0],95:[.35,.09444,.02778,0],97:[0,.44444,0,0],98:[0,.69444,0,0],99:[0,.44444,0,0],100:[0,.69444,0,0],101:[0,.44444,0,0],102:[0,.69444,.06944,0],103:[.19444,.44444,.01389,0],104:[0,.69444,0,0],105:[0,.67937,0,0],106:[.19444,.67937,0,0],107:[0,.69444,0,0],108:[0,.69444,0,0],109:[0,.44444,0,0],110:[0,.44444,0,0],111:[0,.44444,0,0],112:[.19444,.44444,0,0],113:[.19444,.44444,0,0],114:[0,.44444,.01389,0],115:[0,.44444,0,0],116:[0,.57143,0,0],117:[0,.44444,0,0],118:[0,.44444,.01389,0],119:[0,.44444,.01389,0],120:[0,.44444,0,0],121:[.19444,.44444,.01389,0],122:[0,.44444,0,0],126:[.35,.32659,0,0],305:[0,.44444,0,0],567:[.19444,.44444,0,0],768:[0,.69444,0,0],769:[0,.69444,0,0],770:[0,.69444,0,0],771:[0,.67659,0,0],772:[0,.60889,0,0],774:[0,.69444,0,0],775:[0,.67937,0,0],776:[0,.67937,0,0],778:[0,.69444,0,0],779:[0,.69444,0,0],780:[0,.63194,0,0],915:[0,.69444,0,0],916:[0,.69444,0,0],920:[0,.69444,0,0],923:[0,.69444,0,0],926:[0,.69444,0,0],928:[0,.69444,0,0],931:[0,.69444,0,0],933:[0,.69444,0,0],934:[0,.69444,0,0],936:[0,.69444,0,0],937:[0,.69444,0,0],8211:[0,.44444,.02778,0],8212:[0,.44444,.02778,0],8216:[0,.69444,0,0],8217:[0,.69444,0,0],8220:[0,.69444,0,0],8221:[0,.69444,0,0]},"Script-Regular":{65:[0,.7,.22925,0],66:[0,.7,.04087,0],67:[0,.7,.1689,0],68:[0,.7,.09371,0],69:[0,.7,.18583,0],70:[0,.7,.13634,0],71:[0,.7,.17322,0],72:[0,.7,.29694,0],73:[0,.7,.19189,0],74:[.27778,.7,.19189,0],75:[0,.7,.31259,0],76:[0,.7,.19189,0],77:[0,.7,.15981,0],78:[0,.7,.3525,0],79:[0,.7,.08078,0],80:[0,.7,.08078,0],81:[0,.7,.03305,0],82:[0,.7,.06259,0],83:[0,.7,.19189,0],84:[0,.7,.29087,0],85:[0,.7,.25815,0],86:[0,.7,.27523,0],87:[0,.7,.27523,0],88:[0,.7,.26006,0],89:[0,.7,.2939,0],90:[0,.7,.24037,0]},"Size1-Regular":{40:[.35001,.85,0,0],41:[.35001,.85,0,0],47:[.35001,.85,0,0],91:[.35001,.85,0,0],92:[.35001,.85,0,0],93:[.35001,.85,0,0],123:[.35001,.85,0,0],125:[.35001,.85,0,0],710:[0,.72222,0,0],732:[0,.72222,0,0],770:[0,.72222,0,0],771:[0,.72222,0,0],8214:[-99e-5,.601,0,0],8593:[1e-5,.6,0,0],8595:[1e-5,.6,0,0],8657:[1e-5,.6,0,0],8659:[1e-5,.6,0,0],8719:[.25001,.75,0,0],8720:[.25001,.75,0,0],8721:[.25001,.75,0,0],8730:[.35001,.85,0,0],8739:[-.00599,.606,0,0],8741:[-.00599,.606,0,0],8747:[.30612,.805,.19445,0],8748:[.306,.805,.19445,0],8749:[.306,.805,.19445,0],8750:[.30612,.805,.19445,0],8896:[.25001,.75,0,0],8897:[.25001,.75,0,0],8898:[.25001,.75,0,0],8899:[.25001,.75,0,0],8968:[.35001,.85,0,0],8969:[.35001,.85,0,0],8970:[.35001,.85,0,0],8971:[.35001,.85,0,0],9168:[-99e-5,.601,0,0],10216:[.35001,.85,0,0],10217:[.35001,.85,0,0],10752:[.25001,.75,0,0],10753:[.25001,.75,0,0],10754:[.25001,.75,0,0],10756:[.25001,.75,0,0],10758:[.25001,.75,0,0]},"Size2-Regular":{40:[.65002,1.15,0,0],41:[.65002,1.15,0,0],47:[.65002,1.15,0,0],91:[.65002,1.15,0,0],92:[.65002,1.15,0,0],93:[.65002,1.15,0,0],123:[.65002,1.15,0,0],125:[.65002,1.15,0,0],710:[0,.75,0,0],732:[0,.75,0,0],770:[0,.75,0,0],771:[0,.75,0,0],8719:[.55001,1.05,0,0],8720:[.55001,1.05,0,0],8721:[.55001,1.05,0,0],8730:[.65002,1.15,0,0],8747:[.86225,1.36,.44445,0],8748:[.862,1.36,.44445,0],8749:[.862,1.36,.44445,0],8750:[.86225,1.36,.44445,0],8896:[.55001,1.05,0,0],8897:[.55001,1.05,0,0],8898:[.55001,1.05,0,0],8899:[.55001,1.05,0,0],8968:[.65002,1.15,0,0],8969:[.65002,1.15,0,0],8970:[.65002,1.15,0,0],8971:[.65002,1.15,0,0],10216:[.65002,1.15,0,0],10217:[.65002,1.15,0,0],10752:[.55001,1.05,0,0],10753:[.55001,1.05,0,0],10754:[.55001,1.05,0,0],10756:[.55001,1.05,0,0],10758:[.55001,1.05,0,0]},"Size3-Regular":{40:[.95003,1.45,0,0],41:[.95003,1.45,0,0],47:[.95003,1.45,0,0],91:[.95003,1.45,0,0],92:[.95003,1.45,0,0],93:[.95003,1.45,0,0],123:[.95003,1.45,0,0],125:[.95003,1.45,0,0],710:[0,.75,0,0],732:[0,.75,0,0],770:[0,.75,0,0],771:[0,.75,0,0],8730:[.95003,1.45,0,0],8968:[.95003,1.45,0,0],8969:[.95003,1.45,0,0],8970:[.95003,1.45,0,0],8971:[.95003,1.45,0,0],10216:[.95003,1.45,0,0],10217:[.95003,1.45,0,0]},"Size4-Regular":{40:[1.25003,1.75,0,0],41:[1.25003,1.75,0,0],47:[1.25003,1.75,0,0],91:[1.25003,1.75,0,0],92:[1.25003,1.75,0,0],93:[1.25003,1.75,0,0],123:[1.25003,1.75,0,0],125:[1.25003,1.75,0,0],710:[0,.825,0,0],732:[0,.825,0,0],770:[0,.825,0,0],771:[0,.825,0,0],8730:[1.25003,1.75,0,0],8968:[1.25003,1.75,0,0],8969:[1.25003,1.75,0,0],8970:[1.25003,1.75,0,0],8971:[1.25003,1.75,0,0],9115:[.64502,1.155,0,0],9116:[1e-5,.6,0,0],9117:[.64502,1.155,0,0],9118:[.64502,1.155,0,0],9119:[1e-5,.6,0,0],9120:[.64502,1.155,0,0],9121:[.64502,1.155,0,0],9122:[-99e-5,.601,0,0],9123:[.64502,1.155,0,0],9124:[.64502,1.155,0,0],9125:[-99e-5,.601,0,0],9126:[.64502,1.155,0,0],9127:[1e-5,.9,0,0],9128:[.65002,1.15,0,0],9129:[.90001,0,0,0],9130:[0,.3,0,0],9131:[1e-5,.9,0,0],9132:[.65002,1.15,0,0],9133:[.90001,0,0,0],9143:[.88502,.915,0,0],10216:[1.25003,1.75,0,0],10217:[1.25003,1.75,0,0],57344:[-.00499,.605,0,0],57345:[-.00499,.605,0,0],57680:[0,.12,0,0],57681:[0,.12,0,0],57682:[0,.12,0,0],57683:[0,.12,0,0]},"Typewriter-Regular":{33:[0,.61111,0,0],34:[0,.61111,0,0],35:[0,.61111,0,0],36:[.08333,.69444,0,0],37:[.08333,.69444,0,0],38:[0,.61111,0,0],39:[0,.61111,0,0],40:[.08333,.69444,0,0],41:[.08333,.69444,0,0],42:[0,.52083,0,0],43:[-.08056,.53055,0,0],44:[.13889,.125,0,0],45:[-.08056,.53055,0,0],46:[0,.125,0,0],47:[.08333,.69444,0,0],48:[0,.61111,0,0],49:[0,.61111,0,0],50:[0,.61111,0,0],51:[0,.61111,0,0],52:[0,.61111,0,0],53:[0,.61111,0,0],54:[0,.61111,0,0],55:[0,.61111,0,0],56:[0,.61111,0,0],57:[0,.61111,0,0],58:[0,.43056,0,0],59:[.13889,.43056,0,0],60:[-.05556,.55556,0,0],61:[-.19549,.41562,0,0],62:[-.05556,.55556,0,0],63:[0,.61111,0,0],64:[0,.61111,0,0],65:[0,.61111,0,0],66:[0,.61111,0,0],67:[0,.61111,0,0],68:[0,.61111,0,0],69:[0,.61111,0,0],70:[0,.61111,0,0],71:[0,.61111,0,0],72:[0,.61111,0,0],73:[0,.61111,0,0],74:[0,.61111,0,0],75:[0,.61111,0,0],76:[0,.61111,0,0],77:[0,.61111,0,0],78:[0,.61111,0,0],79:[0,.61111,0,0],80:[0,.61111,0,0],81:[.13889,.61111,0,0],82:[0,.61111,0,0],83:[0,.61111,0,0],84:[0,.61111,0,0],85:[0,.61111,0,0],86:[0,.61111,0,0],87:[0,.61111,0,0],88:[0,.61111,0,0],89:[0,.61111,0,0],90:[0,.61111,0,0],91:[.08333,.69444,0,0],92:[.08333,.69444,0,0],93:[.08333,.69444,0,0],94:[0,.61111,0,0],95:[.09514,0,0,0],96:[0,.61111,0,0],97:[0,.43056,0,0],98:[0,.61111,0,0],99:[0,.43056,0,0],100:[0,.61111,0,0],101:[0,.43056,0,0],102:[0,.61111,0,0],103:[.22222,.43056,0,0],104:[0,.61111,0,0],105:[0,.61111,0,0],106:[.22222,.61111,0,0],107:[0,.61111,0,0],108:[0,.61111,0,0],109:[0,.43056,0,0],110:[0,.43056,0,0],111:[0,.43056,0,0],112:[.22222,.43056,0,0],113:[.22222,.43056,0,0],114:[0,.43056,0,0],115:[0,.43056,0,0],116:[0,.55358,0,0],117:[0,.43056,0,0],118:[0,.43056,0,0],119:[0,.43056,0,0],120:[0,.43056,0,0],121:[.22222,.43056,0,0],122:[0,.43056,0,0],123:[.08333,.69444,0,0],124:[.08333,.69444,0,0],125:[.08333,.69444,0,0],126:[0,.61111,0,0],127:[0,.61111,0,0],305:[0,.43056,0,0],567:[.22222,.43056,0,0],768:[0,.61111,0,0],769:[0,.61111,0,0],770:[0,.61111,0,0],771:[0,.61111,0,0],772:[0,.56555,0,0],774:[0,.61111,0,0],776:[0,.61111,0,0],778:[0,.61111,0,0],780:[0,.56597,0,0],915:[0,.61111,0,0],916:[0,.61111,0,0],920:[0,.61111,0,0],923:[0,.61111,0,0],926:[0,.61111,0,0],928:[0,.61111,0,0],931:[0,.61111,0,0],933:[0,.61111,0,0],934:[0,.61111,0,0],936:[0,.61111,0,0],937:[0,.61111,0,0],2018:[0,.61111,0,0],2019:[0,.61111,0,0],8242:[0,.61111,0,0]}}},{}],43:[function(e,t){function n(e){return e&&e.__esModule?e:{"default":e}}function r(e,n,r){"string"==typeof e&&(e=[e]),"number"==typeof n&&(n={numArgs:n});for(var i={numArgs:n.numArgs,argTypes:n.argTypes,greediness:n.greediness===undefined?1:n.greediness,allowedInText:!!n.allowedInText,allowedInMath:n.allowedInMath,numOptionalArgs:n.numOptionalArgs||0,infix:!!n.infix,handler:r},a=0;a","\\langle","\\rangle","\\lt","\\gt","\\lvert","\\rvert","\\lVert","\\rVert","\\lgroup","\\rgroup","\\lmoustache","\\rmoustache","/","\\backslash","|","\\vert","\\|","\\Vert","\\uparrow","\\Uparrow","\\downarrow","\\Downarrow","\\updownarrow","\\Updownarrow","."],c={"\\Bbb":"\\mathbb","\\bold":"\\mathbf","\\frak":"\\mathfrak"};r(["\\blue","\\orange","\\pink","\\red","\\green","\\gray","\\purple","\\blueA","\\blueB","\\blueC","\\blueD","\\blueE","\\tealA","\\tealB","\\tealC","\\tealD","\\tealE","\\greenA","\\greenB","\\greenC","\\greenD","\\greenE","\\goldA","\\goldB","\\goldC","\\goldD","\\goldE","\\redA","\\redB","\\redC","\\redD","\\redE","\\maroonA","\\maroonB","\\maroonC","\\maroonD","\\maroonE","\\purpleA","\\purpleB","\\purpleC","\\purpleD","\\purpleE","\\mintA","\\mintB","\\mintC","\\grayA","\\grayB","\\grayC","\\grayD","\\grayE","\\grayF","\\grayG","\\grayH","\\grayI","\\kaBlue","\\kaGreen"],{numArgs:1,allowedInText:!0,greediness:3},function(e,t){var n=t[0];return{type:"color",color:"katex-"+e.funcName.slice(1),value:s(n)}}),r(["\\arcsin","\\arccos","\\arctan","\\arctg","\\arcctg","\\arg","\\ch","\\cos","\\cosec","\\cosh","\\cot","\\cotg","\\coth","\\csc","\\ctg","\\cth","\\deg","\\dim","\\exp","\\hom","\\ker","\\lg","\\ln","\\log","\\sec","\\sin","\\sinh","\\sh","\\tan","\\tanh","\\tg","\\th"],{numArgs:0},function(e){return{type:"op",limits:!1,symbol:!1,body:e.funcName}}),r(["\\det","\\gcd","\\inf","\\lim","\\liminf","\\limsup","\\max","\\min","\\Pr","\\sup"],{numArgs:0},function(e){return{type:"op",limits:!0,symbol:!1,body:e.funcName}}),r(["\\int","\\iint","\\iiint","\\oint"],{numArgs:0},function(e){return{type:"op",limits:!1,symbol:!0,body:e.funcName}}),r(["\\coprod","\\bigvee","\\bigwedge","\\biguplus","\\bigcap","\\bigcup","\\intop","\\prod","\\sum","\\bigotimes","\\bigoplus","\\bigodot","\\bigsqcup","\\smallint"],{numArgs:0},function(e){return{type:"op",limits:!0,symbol:!0,body:e.funcName}}),r("\\mathop",{numArgs:1},function(e,t){var n=t[0];return{type:"op",limits:!1,symbol:!1,value:s(n)}}),r(["\\dfrac","\\frac","\\tfrac","\\dbinom","\\binom","\\tbinom","\\\\atopfrac"],{numArgs:2,greediness:2},function(e,t){var n=t[0],r=t[1],i=void 0,a=null,o=null,s="auto";switch(e.funcName){case"\\dfrac":case"\\frac":case"\\tfrac":i=!0;break;case"\\\\atopfrac":i=!1;break;case"\\dbinom":case"\\binom":case"\\tbinom":i=!1,a="(",o=")";break;default:throw new Error("Unrecognized genfrac command")}switch(e.funcName){case"\\dfrac":case"\\dbinom":s="display";break;case"\\tfrac":case"\\tbinom":s="text"}return{type:"genfrac",numer:n,denom:r,hasBarLine:i,leftDelim:a,rightDelim:o,size:s}}),r(["\\llap","\\rlap"],{numArgs:1,allowedInText:!0},function(e,t){var n=t[0];return{type:e.funcName.slice(1),body:n}});var h=function(e,t){if(i["default"].contains(d,e.value))return e;throw new a["default"]("Invalid delimiter: '"+e.value+"' after '"+t.funcName+"'",e)};r(["\\bigl","\\Bigl","\\biggl","\\Biggl","\\bigr","\\Bigr","\\biggr","\\Biggr","\\bigm","\\Bigm","\\biggm","\\Biggm","\\big","\\Big","\\bigg","\\Bigg"],{numArgs:1},function(e,t){var n=h(t[0],e);return{type:"delimsizing",size:u[e.funcName].size,mclass:u[e.funcName].mclass,value:n.value}}),r(["\\left","\\right"],{numArgs:1},function(e,t){return{type:"leftright",value:h(t[0],e).value}}),r("\\middle",{numArgs:1},function(e,t){var n=h(t[0],e);if(!e.parser.leftrightDepth)throw new a["default"]("\\middle without preceding \\left",n);return{type:"middle",value:n.value}}),r(["\\tiny","\\scriptsize","\\footnotesize","\\small","\\normalsize","\\large","\\Large","\\LARGE","\\huge","\\Huge"],0,null),r(["\\displaystyle","\\textstyle","\\scriptstyle","\\scriptscriptstyle"],0,null),r(["\\rm","\\sf","\\tt","\\bf","\\it"],0,null),r(["\\mathrm","\\mathit","\\mathbf","\\mathbb","\\mathcal","\\mathfrak","\\mathscr","\\mathsf","\\mathtt","\\Bbb","\\bold","\\frak"],{numArgs:1,greediness:2},function(e,t){var n=t[0],r=e.funcName;return r in c&&(r=c[r]),{type:"font",font:r.slice(1),body:n}}),r(["\\acute","\\grave","\\ddot","\\tilde","\\bar","\\breve","\\check","\\hat","\\vec","\\dot","\\widehat","\\widetilde","\\overrightarrow","\\overleftarrow","\\Overrightarrow","\\overleftrightarrow","\\overgroup","\\overlinesegment","\\overleftharpoon","\\overrightharpoon"],{numArgs:1},function(e,t){var n=t[0],r=!i["default"].contains(["\\acute","\\grave","\\ddot","\\tilde","\\bar","\\breve","\\check","\\hat","\\vec","\\dot"],e.funcName),a=!r||i["default"].contains(["\\widehat","\\widetilde"],e.funcName);return{type:"accent",label:e.funcName,isStretchy:r,isShifty:a,value:s(n),base:n}}),r(["\\'","\\`","\\^","\\~","\\=","\\u","\\.",'\\"',"\\r","\\H","\\v"],{numArgs:1,allowedInText:!0,allowedInMath:!1},function(e,t){var n=t[0];return{type:"accent",label:e.funcName,isStretchy:!1,isShifty:!0,value:s(n),base:n}}),r(["\\overbrace","\\underbrace"],{numArgs:1},function(e,t){var n=t[0];return{type:"horizBrace",label:e.funcName,isOver:/^\\over/.test(e.funcName),base:n}}),r(["\\underleftarrow","\\underrightarrow","\\underleftrightarrow","\\undergroup","\\underlinesegment","\\undertilde"],{numArgs:1},function(e,t){var n=t[0];return{type:"accentUnder",label:e.funcName,value:s(n),body:n}}),r(["\\xleftarrow","\\xrightarrow","\\xLeftarrow","\\xRightarrow","\\xleftrightarrow","\\xLeftrightarrow","\\xhookleftarrow","\\xhookrightarrow","\\xmapsto","\\xrightharpoondown","\\xrightharpoonup","\\xleftharpoondown","\\xleftharpoonup","\\xrightleftharpoons","\\xleftrightharpoons","\\xLongequal","\\xtwoheadrightarrow","\\xtwoheadleftarrow","\\xLongequal","\\xtofrom"],{numArgs:1,numOptionalArgs:1},function(e,t){var n=t[0],r=t[1];return{type:"xArrow",label:e.funcName,body:r,below:n}}),r(["\\cancel","\\bcancel","\\xcancel","\\sout","\\fbox"],{numArgs:1},function(e,t){var n=t[0];return{type:"enclose",label:e.funcName,body:n}}),r(["\\over","\\choose","\\atop"],{numArgs:0,infix:!0},function(e){var t=void 0;switch(e.funcName){case"\\over":t="\\frac";break;case"\\choose":t="\\binom";break;case"\\atop":t="\\\\atopfrac";break;default:throw new Error("Unrecognized infix genfrac command")}return{type:"infix",replaceWith:t,token:e.token}}),r(["\\\\","\\cr"],{numArgs:0,numOptionalArgs:1,argTypes:["size"]},function(e,t){return{type:"cr",size:t[0]}}),r(["\\begin","\\end"],{numArgs:1,argTypes:["text"]},function(e,t){var n=t[0];if("ordgroup"!==n.type)throw new a["default"]("Invalid environment name",n);for(var r="",i=0;i"}}]),e}(),s=function(){function e(t){(0,r["default"])(this,e),this.text=t}return(0,i["default"])(e,[{key:"toNode",value:function(){return document.createTextNode(this.text)}},{key:"toMarkup",value:function(){return a["default"].escape(this.text)}}]),e}();t.exports={MathNode:o,TextNode:s}},{"./utils":51,"babel-runtime/helpers/classCallCheck":4,"babel-runtime/helpers/createClass":5}],46:[function(e,t){function n(e){return e&&e.__esModule?e:{"default":e}}var r=n(e("./Parser")),i=function(e,t){if(!("string"==typeof e||e instanceof String))throw new TypeError("KaTeX can only parse string typed expression");return new r["default"](e,t).parse()};t.exports=i},{"./Parser":31}],47:[function(e,t){var n=e("./buildCommon"),r=e("./mathMLTree"),i=e("./utils"),a={widehat:"^",widetilde:"~",undertilde:"~",overleftarrow:"\u2190",underleftarrow:"\u2190",xleftarrow:"\u2190",overrightarrow:"\u2192",underrightarrow:"\u2192",xrightarrow:"\u2192",underbrace:"\u23b5",overbrace:"\u23de",overleftrightarrow:"\u2194",underleftrightarrow:"\u2194",xleftrightarrow:"\u2194",Overrightarrow:"\u21d2",xRightarrow:"\u21d2",overleftharpoon:"\u21bc",xleftharpoonup:"\u21bc",overrightharpoon:"\u21c0",xrightharpoonup:"\u21c0",xLeftarrow:"\u21d0",xLeftrightarrow:"\u21d4",xhookleftarrow:"\u21a9",xhookrightarrow:"\u21aa",xmapsto:"\u21a6",xrightharpoondown:"\u21c1",xleftharpoondown:"\u21bd",xrightleftharpoons:"\u21cc",xleftrightharpoons:"\u21cb",xtwoheadleftarrow:"\u219e",xtwoheadrightarrow:"\u21a0",xLongequal:"=",xtofrom:"\u21c4"},o=function(e){var t=new r.MathNode("mo",[new r.TextNode(a[e.substr(1)])]);return t.setAttribute("stretchy","true"),t},s={overleftarrow:[.522,0,"leftarrow",.5],underleftarrow:[.522,0,"leftarrow",.5],xleftarrow:[.261,.261,"leftarrow",.783],overrightarrow:[.522,0,"rightarrow",.5],underrightarrow:[.522,0,"rightarrow",.5],xrightarrow:[.261,.261,"rightarrow",.783],overbrace:[.548,0,"overbrace",1.6],underbrace:[.548,0,"underbrace",1.6],overleftrightarrow:[.522,0,"leftrightarrow",.5],underleftrightarrow:[.522,0,"leftrightarrow",.5],xleftrightarrow:[.261,.261,"leftrightarrow",.783],Overrightarrow:[.56,0,"doublerightarrow",.5],xLeftarrow:[.28,.28,"doubleleftarrow",.783],xRightarrow:[.28,.28,"doublerightarrow",.783],xLeftrightarrow:[.28,.28,"doubleleftrightarrow",.955],overleftharpoon:[.522,0,"leftharpoon",.5],overrightharpoon:[.522,0,"rightharpoon",.5],xleftharpoonup:[.261,.261,"leftharpoon",.783],xrightharpoonup:[.261,.261,"rightharpoon",.783],xhookleftarrow:[.261,.261,"hookleftarrow",.87],xhookrightarrow:[.261,.261,"hookrightarrow",.87],overlinesegment:[.414,0,"linesegment",.5],underlinesegment:[.414,0,"linesegment",.5],xmapsto:[.261,.261,"mapsto",.783],xrightharpoondown:[.261,.261,"rightharpoondown",.783],xleftharpoondown:[.261,.261,"leftharpoondown",.783],xrightleftharpoons:[.358,.358,"rightleftharpoons",.716],xleftrightharpoons:[.358,.358,"leftrightharpoons",.716],overgroup:[.342,0,"overgroup",.87],undergroup:[.342,0,"undergroup",.87],xtwoheadleftarrow:[.167,.167,"twoheadleftarrow",.86],xtwoheadrightarrow:[.167,.167,"twoheadrightarrow",.86],xLongequal:[.167,.167,"longequal",.5],xtofrom:[.264,.264,"tofrom",.86]},l={doubleleftarrow:"",doublerightarrow:"",leftarrow:"",rightarrow:""},u={bcancel:"",cancel:"",doubleleftarrow:">"+l.doubleleftarrow+"",doubleleftrightarrow:">"+l.doubleleftarrow+"\n"+l.doublerightarrow+"",doublerightarrow:">"+l.doublerightarrow+"",hookleftarrow:">"+l.leftarrow+"\n",hookrightarrow:">"+l.rightarrow+"",leftarrow:">"+l.leftarrow+"",leftharpoon:">",leftharpoondown:">",leftrightarrow:">"+l.leftarrow+"\n"+l.rightarrow+"",leftrightharpoons:">\n",linesegment:">\n",longequal:" viewBox='0 0 100 334' preserveAspectRatio='none'>\n",mapsto:">"+l.rightarrow+"",overbrace:">\n",overgroup:">",rightarrow:">"+l.rightarrow+"",rightharpoon:">",rightharpoondown:">",rightleftharpoons:">",tilde1:" viewBox='0 0 600 260' preserveAspectRatio='none'>\n",tilde2:" viewBox='0 0 1033 286' preserveAspectRatio='none'>\n",tilde3:" viewBox='0 0 2339 306' preserveAspectRatio='none'>\n",tilde4:" viewBox='0 0 2340 312' preserveAspectRatio='none'>\n",tofrom:">",twoheadleftarrow:">\n",twoheadrightarrow:">\n",underbrace:">\n",undergroup:">",widehat1:" viewBox='0 0 1062 239' preserveAspectRatio='none'>\n",widehat2:" viewBox='0 0 2364 300' preserveAspectRatio='none'>\n",widehat3:" viewBox='0 0 2364 360' preserveAspectRatio='none'>\n",widehat4:" viewBox='0 0 2364 420' preserveAspectRatio='none'>\n",xcancel:"\n"},d=function(e,t){var r=e.value.label.substr(1),a=0,o=0,l="",d=0;if(i.contains(["widehat","widetilde","undertilde"],r)){var c=e.value.value.length;if(c>5)a=.312,l=("widehat"===r?"widehat":"tilde")+"4";else{var h=[1,1,2,2,3,3][c];"widehat"===r?(a=[0,.24,.3,.3,.36,.36][c],l="widehat"+h):(a=[0,.26,.3,.3,.34,.34][c],l="tilde"+h)}}else{var p=s[r];a=p[0],o=p[1],l=p[2],d=p[3]}var f=n.makeSpan([],[],t);f.height=a,f.depth=o;var m=a+o;return f.style.height=m+"em",d>0&&(f.style.minWidth=d+"em"),f.innerHTML="",f},c=function(e,t,r,i){var a=void 0,o=e.height+e.depth+2*r;return"fbox"===t?(a=n.makeSpan(["stretchy",t],[],i),i.color&&(a.style.borderColor=i.getColor())):(a=n.makeSpan([],[],i)).innerHTML=""+u[t]+"",a.height=o,a.style.height=o+"em",a};t.exports={encloseSpan:c,mathMLnode:o,svgSpan:d}},{"./buildCommon":34,"./mathMLTree":45,"./utils":51}],48:[function(e,t){function n(e,n,r,i,a,o){t.exports[e][a]={font:n,group:r,replace:i},o&&(t.exports[e][i]=t.exports[e][a])}t.exports={math:{},text:{}};var r="math",i="text",a="main",o="ams",s="accent",l="bin",u="close",d="inner",c="mathord",h="op",p="open",f="punct",m="rel",g="spacing",v="textord";n(r,a,m,"\u2261","\\equiv"),n(r,a,m,"\u227a","\\prec"),n(r,a,m,"\u227b","\\succ"),n(r,a,m,"\u223c","\\sim"),n(r,a,m,"\u22a5","\\perp"),n(r,a,m,"\u2aaf","\\preceq"),n(r,a,m,"\u2ab0","\\succeq"),n(r,a,m,"\u2243","\\simeq"),n(r,a,m,"\u2223","\\mid"),n(r,a,m,"\u226a","\\ll"),n(r,a,m,"\u226b","\\gg"),n(r,a,m,"\u224d","\\asymp"),n(r,a,m,"\u2225","\\parallel"),n(r,a,m,"\u22c8","\\bowtie"),n(r,a,m,"\u2323","\\smile"),n(r,a,m,"\u2291","\\sqsubseteq"),n(r,a,m,"\u2292","\\sqsupseteq"),n(r,a,m,"\u2250","\\doteq"),n(r,a,m,"\u2322","\\frown"),n(r,a,m,"\u220b","\\ni"),n(r,a,m,"\u221d","\\propto"),n(r,a,m,"\u22a2","\\vdash"),n(r,a,m,"\u22a3","\\dashv"), +n(r,a,m,"\u220b","\\owns"),n(r,a,f,".","\\ldotp"),n(r,a,f,"\u22c5","\\cdotp"),n(r,a,v,"#","\\#"),n(i,a,v,"#","\\#"),n(r,a,v,"&","\\&"),n(i,a,v,"&","\\&"),n(r,a,v,"\u2135","\\aleph"),n(r,a,v,"\u2200","\\forall"),n(r,a,v,"\u210f","\\hbar"),n(r,a,v,"\u2203","\\exists"),n(r,a,v,"\u2207","\\nabla"),n(r,a,v,"\u266d","\\flat"),n(r,a,v,"\u2113","\\ell"),n(r,a,v,"\u266e","\\natural"),n(r,a,v,"\u2663","\\clubsuit"),n(r,a,v,"\u2118","\\wp"),n(r,a,v,"\u266f","\\sharp"),n(r,a,v,"\u2662","\\diamondsuit"),n(r,a,v,"\u211c","\\Re"),n(r,a,v,"\u2661","\\heartsuit"),n(r,a,v,"\u2111","\\Im"),n(r,a,v,"\u2660","\\spadesuit"),n(r,a,v,"\u2020","\\dag"),n(i,a,v,"\u2020","\\dag"),n(i,a,v,"\u2020","\\textdagger"),n(r,a,v,"\u2021","\\ddag"),n(i,a,v,"\u2021","\\ddag"),n(i,a,v,"\u2020","\\textdaggerdbl"),n(r,a,u,"\u23b1","\\rmoustache"),n(r,a,p,"\u23b0","\\lmoustache"),n(r,a,u,"\u27ef","\\rgroup"),n(r,a,p,"\u27ee","\\lgroup"),n(r,a,l,"\u2213","\\mp"),n(r,a,l,"\u2296","\\ominus"),n(r,a,l,"\u228e","\\uplus"),n(r,a,l,"\u2293","\\sqcap"),n(r,a,l,"\u2217","\\ast"),n(r,a,l,"\u2294","\\sqcup"),n(r,a,l,"\u25ef","\\bigcirc"),n(r,a,l,"\u2219","\\bullet"),n(r,a,l,"\u2021","\\ddagger"),n(r,a,l,"\u2240","\\wr"),n(r,a,l,"\u2a3f","\\amalg"),n(r,a,m,"\u27f5","\\longleftarrow"),n(r,a,m,"\u21d0","\\Leftarrow"),n(r,a,m,"\u27f8","\\Longleftarrow"),n(r,a,m,"\u27f6","\\longrightarrow"),n(r,a,m,"\u21d2","\\Rightarrow"),n(r,a,m,"\u27f9","\\Longrightarrow"),n(r,a,m,"\u2194","\\leftrightarrow"),n(r,a,m,"\u27f7","\\longleftrightarrow"),n(r,a,m,"\u21d4","\\Leftrightarrow"),n(r,a,m,"\u27fa","\\Longleftrightarrow"),n(r,a,m,"\u21a6","\\mapsto"),n(r,a,m,"\u27fc","\\longmapsto"),n(r,a,m,"\u2197","\\nearrow"),n(r,a,m,"\u21a9","\\hookleftarrow"),n(r,a,m,"\u21aa","\\hookrightarrow"),n(r,a,m,"\u2198","\\searrow"),n(r,a,m,"\u21bc","\\leftharpoonup"),n(r,a,m,"\u21c0","\\rightharpoonup"),n(r,a,m,"\u2199","\\swarrow"),n(r,a,m,"\u21bd","\\leftharpoondown"),n(r,a,m,"\u21c1","\\rightharpoondown"),n(r,a,m,"\u2196","\\nwarrow"),n(r,a,m,"\u21cc","\\rightleftharpoons"),n(r,o,m,"\u226e","\\nless"),n(r,o,m,"\ue010","\\nleqslant"),n(r,o,m,"\ue011","\\nleqq"),n(r,o,m,"\u2a87","\\lneq"),n(r,o,m,"\u2268","\\lneqq"),n(r,o,m,"\ue00c","\\lvertneqq"),n(r,o,m,"\u22e6","\\lnsim"),n(r,o,m,"\u2a89","\\lnapprox"),n(r,o,m,"\u2280","\\nprec"),n(r,o,m,"\u22e0","\\npreceq"),n(r,o,m,"\u22e8","\\precnsim"),n(r,o,m,"\u2ab9","\\precnapprox"),n(r,o,m,"\u2241","\\nsim"),n(r,o,m,"\ue006","\\nshortmid"),n(r,o,m,"\u2224","\\nmid"),n(r,o,m,"\u22ac","\\nvdash"),n(r,o,m,"\u22ad","\\nvDash"),n(r,o,m,"\u22ea","\\ntriangleleft"),n(r,o,m,"\u22ec","\\ntrianglelefteq"),n(r,o,m,"\u228a","\\subsetneq"),n(r,o,m,"\ue01a","\\varsubsetneq"),n(r,o,m,"\u2acb","\\subsetneqq"),n(r,o,m,"\ue017","\\varsubsetneqq"),n(r,o,m,"\u226f","\\ngtr"),n(r,o,m,"\ue00f","\\ngeqslant"),n(r,o,m,"\ue00e","\\ngeqq"),n(r,o,m,"\u2a88","\\gneq"),n(r,o,m,"\u2269","\\gneqq"),n(r,o,m,"\ue00d","\\gvertneqq"),n(r,o,m,"\u22e7","\\gnsim"),n(r,o,m,"\u2a8a","\\gnapprox"),n(r,o,m,"\u2281","\\nsucc"),n(r,o,m,"\u22e1","\\nsucceq"),n(r,o,m,"\u22e9","\\succnsim"),n(r,o,m,"\u2aba","\\succnapprox"),n(r,o,m,"\u2246","\\ncong"),n(r,o,m,"\ue007","\\nshortparallel"),n(r,o,m,"\u2226","\\nparallel"),n(r,o,m,"\u22af","\\nVDash"),n(r,o,m,"\u22eb","\\ntriangleright"),n(r,o,m,"\u22ed","\\ntrianglerighteq"),n(r,o,m,"\ue018","\\nsupseteqq"),n(r,o,m,"\u228b","\\supsetneq"),n(r,o,m,"\ue01b","\\varsupsetneq"),n(r,o,m,"\u2acc","\\supsetneqq"),n(r,o,m,"\ue019","\\varsupsetneqq"),n(r,o,m,"\u22ae","\\nVdash"),n(r,o,m,"\u2ab5","\\precneqq"),n(r,o,m,"\u2ab6","\\succneqq"),n(r,o,m,"\ue016","\\nsubseteqq"),n(r,o,l,"\u22b4","\\unlhd"),n(r,o,l,"\u22b5","\\unrhd"),n(r,o,m,"\u219a","\\nleftarrow"),n(r,o,m,"\u219b","\\nrightarrow"),n(r,o,m,"\u21cd","\\nLeftarrow"),n(r,o,m,"\u21cf","\\nRightarrow"),n(r,o,m,"\u21ae","\\nleftrightarrow"),n(r,o,m,"\u21ce","\\nLeftrightarrow"),n(r,o,m,"\u25b3","\\vartriangle"),n(r,o,v,"\u210f","\\hslash"),n(r,o,v,"\u25bd","\\triangledown"),n(r,o,v,"\u25ca","\\lozenge"),n(r,o,v,"\u24c8","\\circledS"),n(r,o,v,"\xae","\\circledR"),n(i,o,v,"\xae","\\circledR"),n(r,o,v,"\u2221","\\measuredangle"),n(r,o,v,"\u2204","\\nexists"),n(r,o,v,"\u2127","\\mho"),n(r,o,v,"\u2132","\\Finv"),n(r,o,v,"\u2141","\\Game"),n(r,o,v,"k","\\Bbbk"),n(r,o,v,"\u2035","\\backprime"),n(r,o,v,"\u25b2","\\blacktriangle"),n(r,o,v,"\u25bc","\\blacktriangledown"),n(r,o,v,"\u25a0","\\blacksquare"),n(r,o,v,"\u29eb","\\blacklozenge"),n(r,o,v,"\u2605","\\bigstar"),n(r,o,v,"\u2222","\\sphericalangle"),n(r,o,v,"\u2201","\\complement"),n(r,o,v,"\xf0","\\eth"),n(r,o,v,"\u2571","\\diagup"),n(r,o,v,"\u2572","\\diagdown"),n(r,o,v,"\u25a1","\\square"),n(r,o,v,"\u25a1","\\Box"),n(r,o,v,"\u25ca","\\Diamond"),n(r,o,v,"\xa5","\\yen"),n(r,o,v,"\u2713","\\checkmark"),n(i,o,v,"\u2713","\\checkmark"),n(r,o,v,"\u2136","\\beth"),n(r,o,v,"\u2138","\\daleth"),n(r,o,v,"\u2137","\\gimel"),n(r,o,v,"\u03dd","\\digamma"),n(r,o,v,"\u03f0","\\varkappa"),n(r,o,p,"\u250c","\\ulcorner"),n(r,o,u,"\u2510","\\urcorner"),n(r,o,p,"\u2514","\\llcorner"),n(r,o,u,"\u2518","\\lrcorner"),n(r,o,m,"\u2266","\\leqq"),n(r,o,m,"\u2a7d","\\leqslant"),n(r,o,m,"\u2a95","\\eqslantless"),n(r,o,m,"\u2272","\\lesssim"),n(r,o,m,"\u2a85","\\lessapprox"),n(r,o,m,"\u224a","\\approxeq"),n(r,o,l,"\u22d6","\\lessdot"),n(r,o,m,"\u22d8","\\lll"),n(r,o,m,"\u2276","\\lessgtr"),n(r,o,m,"\u22da","\\lesseqgtr"),n(r,o,m,"\u2a8b","\\lesseqqgtr"),n(r,o,m,"\u2251","\\doteqdot"),n(r,o,m,"\u2253","\\risingdotseq"),n(r,o,m,"\u2252","\\fallingdotseq"),n(r,o,m,"\u223d","\\backsim"),n(r,o,m,"\u22cd","\\backsimeq"),n(r,o,m,"\u2ac5","\\subseteqq"),n(r,o,m,"\u22d0","\\Subset"),n(r,o,m,"\u228f","\\sqsubset"),n(r,o,m,"\u227c","\\preccurlyeq"),n(r,o,m,"\u22de","\\curlyeqprec"),n(r,o,m,"\u227e","\\precsim"),n(r,o,m,"\u2ab7","\\precapprox"),n(r,o,m,"\u22b2","\\vartriangleleft"),n(r,o,m,"\u22b4","\\trianglelefteq"),n(r,o,m,"\u22a8","\\vDash"),n(r,o,m,"\u22aa","\\Vvdash"),n(r,o,m,"\u2323","\\smallsmile"),n(r,o,m,"\u2322","\\smallfrown"),n(r,o,m,"\u224f","\\bumpeq"),n(r,o,m,"\u224e","\\Bumpeq"),n(r,o,m,"\u2267","\\geqq"),n(r,o,m,"\u2a7e","\\geqslant"),n(r,o,m,"\u2a96","\\eqslantgtr"),n(r,o,m,"\u2273","\\gtrsim"),n(r,o,m,"\u2a86","\\gtrapprox"),n(r,o,l,"\u22d7","\\gtrdot"),n(r,o,m,"\u22d9","\\ggg"),n(r,o,m,"\u2277","\\gtrless"),n(r,o,m,"\u22db","\\gtreqless"),n(r,o,m,"\u2a8c","\\gtreqqless"),n(r,o,m,"\u2256","\\eqcirc"),n(r,o,m,"\u2257","\\circeq"),n(r,o,m,"\u225c","\\triangleq"),n(r,o,m,"\u223c","\\thicksim"),n(r,o,m,"\u2248","\\thickapprox"),n(r,o,m,"\u2ac6","\\supseteqq"),n(r,o,m,"\u22d1","\\Supset"),n(r,o,m,"\u2290","\\sqsupset"),n(r,o,m,"\u227d","\\succcurlyeq"),n(r,o,m,"\u22df","\\curlyeqsucc"),n(r,o,m,"\u227f","\\succsim"),n(r,o,m,"\u2ab8","\\succapprox"),n(r,o,m,"\u22b3","\\vartriangleright"),n(r,o,m,"\u22b5","\\trianglerighteq"),n(r,o,m,"\u22a9","\\Vdash"),n(r,o,m,"\u2223","\\shortmid"),n(r,o,m,"\u2225","\\shortparallel"),n(r,o,m,"\u226c","\\between"),n(r,o,m,"\u22d4","\\pitchfork"),n(r,o,m,"\u221d","\\varpropto"),n(r,o,m,"\u25c0","\\blacktriangleleft"),n(r,o,m,"\u2234","\\therefore"),n(r,o,m,"\u220d","\\backepsilon"),n(r,o,m,"\u25b6","\\blacktriangleright"),n(r,o,m,"\u2235","\\because"),n(r,o,m,"\u22d8","\\llless"),n(r,o,m,"\u22d9","\\gggtr"),n(r,o,l,"\u22b2","\\lhd"),n(r,o,l,"\u22b3","\\rhd"),n(r,o,m,"\u2242","\\eqsim"),n(r,a,m,"\u22c8","\\Join"),n(r,o,m,"\u2251","\\Doteq"),n(r,o,l,"\u2214","\\dotplus"),n(r,o,l,"\u2216","\\smallsetminus"),n(r,o,l,"\u22d2","\\Cap"),n(r,o,l,"\u22d3","\\Cup"),n(r,o,l,"\u2a5e","\\doublebarwedge"),n(r,o,l,"\u229f","\\boxminus"),n(r,o,l,"\u229e","\\boxplus"),n(r,o,l,"\u22c7","\\divideontimes"),n(r,o,l,"\u22c9","\\ltimes"),n(r,o,l,"\u22ca","\\rtimes"),n(r,o,l,"\u22cb","\\leftthreetimes"),n(r,o,l,"\u22cc","\\rightthreetimes"),n(r,o,l,"\u22cf","\\curlywedge"),n(r,o,l,"\u22ce","\\curlyvee"),n(r,o,l,"\u229d","\\circleddash"),n(r,o,l,"\u229b","\\circledast"),n(r,o,l,"\u22c5","\\centerdot"),n(r,o,l,"\u22ba","\\intercal"),n(r,o,l,"\u22d2","\\doublecap"),n(r,o,l,"\u22d3","\\doublecup"),n(r,o,l,"\u22a0","\\boxtimes"),n(r,o,m,"\u21e2","\\dashrightarrow"),n(r,o,m,"\u21e0","\\dashleftarrow"),n(r,o,m,"\u21c7","\\leftleftarrows"),n(r,o,m,"\u21c6","\\leftrightarrows"),n(r,o,m,"\u21da","\\Lleftarrow"),n(r,o,m,"\u219e","\\twoheadleftarrow"),n(r,o,m,"\u21a2","\\leftarrowtail"),n(r,o,m,"\u21ab","\\looparrowleft"),n(r,o,m,"\u21cb","\\leftrightharpoons"),n(r,o,m,"\u21b6","\\curvearrowleft"),n(r,o,m,"\u21ba","\\circlearrowleft"),n(r,o,m,"\u21b0","\\Lsh"),n(r,o,m,"\u21c8","\\upuparrows"),n(r,o,m,"\u21bf","\\upharpoonleft"),n(r,o,m,"\u21c3","\\downharpoonleft"),n(r,o,m,"\u22b8","\\multimap"),n(r,o,m,"\u21ad","\\leftrightsquigarrow"),n(r,o,m,"\u21c9","\\rightrightarrows"),n(r,o,m,"\u21c4","\\rightleftarrows"),n(r,o,m,"\u21a0","\\twoheadrightarrow"),n(r,o,m,"\u21a3","\\rightarrowtail"),n(r,o,m,"\u21ac","\\looparrowright"),n(r,o,m,"\u21b7","\\curvearrowright"),n(r,o,m,"\u21bb","\\circlearrowright"),n(r,o,m,"\u21b1","\\Rsh"),n(r,o,m,"\u21ca","\\downdownarrows"),n(r,o,m,"\u21be","\\upharpoonright"),n(r,o,m,"\u21c2","\\downharpoonright"),n(r,o,m,"\u21dd","\\rightsquigarrow"),n(r,o,m,"\u21dd","\\leadsto"),n(r,o,m,"\u21db","\\Rrightarrow"),n(r,o,m,"\u21be","\\restriction"),n(r,a,v,"\u2018","`"),n(r,a,v,"$","\\$"),n(i,a,v,"$","\\$"),n(i,a,v,"$","\\textdollar"),n(r,a,v,"%","\\%"),n(i,a,v,"%","\\%"),n(r,a,v,"_","\\_"),n(i,a,v,"_","\\_"),n(i,a,v,"_","\\textunderscore"),n(r,a,v,"\u2220","\\angle"),n(r,a,v,"\u221e","\\infty"),n(r,a,v,"\u2032","\\prime"),n(r,a,v,"\u25b3","\\triangle"),n(r,a,v,"\u0393","\\Gamma",!0),n(r,a,v,"\u0394","\\Delta",!0),n(r,a,v,"\u0398","\\Theta",!0),n(r,a,v,"\u039b","\\Lambda",!0),n(r,a,v,"\u039e","\\Xi",!0),n(r,a,v,"\u03a0","\\Pi",!0),n(r,a,v,"\u03a3","\\Sigma",!0),n(r,a,v,"\u03a5","\\Upsilon",!0),n(r,a,v,"\u03a6","\\Phi",!0),n(r,a,v,"\u03a8","\\Psi",!0),n(r,a,v,"\u03a9","\\Omega",!0),n(r,a,v,"\xac","\\neg"),n(r,a,v,"\xac","\\lnot"),n(r,a,v,"\u22a4","\\top"),n(r,a,v,"\u22a5","\\bot"),n(r,a,v,"\u2205","\\emptyset"),n(r,o,v,"\u2205","\\varnothing"),n(r,a,c,"\u03b1","\\alpha",!0),n(r,a,c,"\u03b2","\\beta",!0),n(r,a,c,"\u03b3","\\gamma",!0),n(r,a,c,"\u03b4","\\delta",!0),n(r,a,c,"\u03f5","\\epsilon",!0),n(r,a,c,"\u03b6","\\zeta",!0),n(r,a,c,"\u03b7","\\eta",!0),n(r,a,c,"\u03b8","\\theta",!0),n(r,a,c,"\u03b9","\\iota",!0),n(r,a,c,"\u03ba","\\kappa",!0),n(r,a,c,"\u03bb","\\lambda",!0),n(r,a,c,"\u03bc","\\mu",!0),n(r,a,c,"\u03bd","\\nu",!0),n(r,a,c,"\u03be","\\xi",!0),n(r,a,c,"\u03bf","\\omicron",!0),n(r,a,c,"\u03c0","\\pi",!0),n(r,a,c,"\u03c1","\\rho",!0),n(r,a,c,"\u03c3","\\sigma",!0),n(r,a,c,"\u03c4","\\tau",!0),n(r,a,c,"\u03c5","\\upsilon",!0),n(r,a,c,"\u03d5","\\phi",!0),n(r,a,c,"\u03c7","\\chi",!0),n(r,a,c,"\u03c8","\\psi",!0),n(r,a,c,"\u03c9","\\omega",!0),n(r,a,c,"\u03b5","\\varepsilon",!0),n(r,a,c,"\u03d1","\\vartheta",!0),n(r,a,c,"\u03d6","\\varpi",!0),n(r,a,c,"\u03f1","\\varrho",!0),n(r,a,c,"\u03c2","\\varsigma",!0),n(r,a,c,"\u03c6","\\varphi",!0),n(r,a,l,"\u2217","*"),n(r,a,l,"+","+"),n(r,a,l,"\u2212","-"),n(r,a,l,"\u22c5","\\cdot"),n(r,a,l,"\u2218","\\circ"),n(r,a,l,"\xf7","\\div"),n(r,a,l,"\xb1","\\pm"),n(r,a,l,"\xd7","\\times"),n(r,a,l,"\u2229","\\cap"),n(r,a,l,"\u222a","\\cup"),n(r,a,l,"\u2216","\\setminus"),n(r,a,l,"\u2227","\\land"),n(r,a,l,"\u2228","\\lor"),n(r,a,l,"\u2227","\\wedge"),n(r,a,l,"\u2228","\\vee"),n(r,a,v,"\u221a","\\surd"),n(r,a,p,"(","("),n(r,a,p,"[","["),n(r,a,p,"\u27e8","\\langle"),n(r,a,p,"\u2223","\\lvert"),n(r,a,p,"\u2225","\\lVert"),n(r,a,u,")",")"),n(r,a,u,"]","]"),n(r,a,u,"?","?"),n(r,a,u,"!","!"),n(r,a,u,"\u27e9","\\rangle"),n(r,a,u,"\u2223","\\rvert"),n(r,a,u,"\u2225","\\rVert"),n(r,a,m,"=","="),n(r,a,m,"<","<"),n(r,a,m,">",">"),n(r,a,m,":",":"),n(r,a,m,"\u2248","\\approx"),n(r,a,m,"\u2245","\\cong"),n(r,a,m,"\u2265","\\ge"),n(r,a,m,"\u2265","\\geq"),n(r,a,m,"\u2190","\\gets"),n(r,a,m,">","\\gt"),n(r,a,m,"\u2208","\\in"),n(r,a,m,"\u2209","\\notin"),n(r,a,m,"\u0338","\\not"),n(r,a,m,"\u2282","\\subset"),n(r,a,m,"\u2283","\\supset"),n(r,a,m,"\u2286","\\subseteq"),n(r,a,m,"\u2287","\\supseteq"),n(r,o,m,"\u2288","\\nsubseteq"),n(r,o,m,"\u2289","\\nsupseteq"),n(r,a,m,"\u22a8","\\models"),n(r,a,m,"\u2190","\\leftarrow"),n(r,a,m,"\u2264","\\le"),n(r,a,m,"\u2264","\\leq"),n(r,a,m,"<","\\lt"),n(r,a,m,"\u2260","\\ne"),n(r,a,m,"\u2260","\\neq"),n(r,a,m,"\u2192","\\rightarrow"),n(r,a,m,"\u2192","\\to"),n(r,o,m,"\u2271","\\ngeq"),n(r,o,m,"\u2270","\\nleq"),n(r,a,g,null,"\\!"),n(r,a,g,"\xa0","\\ "),n(r,a,g,"\xa0","~"),n(r,a,g,null,"\\,"),n(r,a,g,null,"\\:"),n(r,a,g,null,"\\;"),n(r,a,g,null,"\\enspace"),n(r,a,g,null,"\\qquad"),n(r,a,g,null,"\\quad"),n(r,a,g,"\xa0","\\space"),n(r,a,f,",",","),n(r,a,f,";",";"),n(r,a,f,":","\\colon"),n(r,o,l,"\u22bc","\\barwedge"),n(r,o,l,"\u22bb","\\veebar"),n(r,a,l,"\u2299","\\odot"),n(r,a,l,"\u2295","\\oplus"),n(r,a,l,"\u2297","\\otimes"),n(r,a,v,"\u2202","\\partial"),n(r,a,l,"\u2298","\\oslash"),n(r,o,l,"\u229a","\\circledcirc"),n(r,o,l,"\u22a1","\\boxdot"),n(r,a,l,"\u25b3","\\bigtriangleup"),n(r,a,l,"\u25bd","\\bigtriangledown"),n(r,a,l,"\u2020","\\dagger"),n(r,a,l,"\u22c4","\\diamond"),n(r,a,l,"\u22c6","\\star"),n(r,a,l,"\u25c3","\\triangleleft"),n(r,a,l,"\u25b9","\\triangleright"),n(r,a,p,"{","\\{"),n(i,a,v,"{","\\{"),n(i,a,v,"{","\\textbraceleft"),n(r,a,u,"}","\\}"),n(i,a,v,"}","\\}"),n(i,a,v,"}","\\textbraceright"),n(r,a,p,"{","\\lbrace"),n(r,a,u,"}","\\rbrace"),n(r,a,p,"[","\\lbrack"),n(r,a,u,"]","\\rbrack"),n(i,a,v,"<","\\textless"),n(i,a,v,">","\\textgreater"),n(r,a,p,"\u230a","\\lfloor"),n(r,a,u,"\u230b","\\rfloor"),n(r,a,p,"\u2308","\\lceil"),n(r,a,u,"\u2309","\\rceil"),n(r,a,v,"\\","\\backslash"),n(r,a,v,"\u2223","|"),n(r,a,v,"\u2223","\\vert"),n(i,a,v,"|","\\textbar"),n(r,a,v,"\u2225","\\|"),n(r,a,v,"\u2225","\\Vert"),n(i,a,v,"\u2225","\\textbardbl"),n(r,a,m,"\u2191","\\uparrow"),n(r,a,m,"\u21d1","\\Uparrow"),n(r,a,m,"\u2193","\\downarrow"),n(r,a,m,"\u21d3","\\Downarrow"),n(r,a,m,"\u2195","\\updownarrow"),n(r,a,m,"\u21d5","\\Updownarrow"),n(r,a,h,"\u2210","\\coprod"),n(r,a,h,"\u22c1","\\bigvee"),n(r,a,h,"\u22c0","\\bigwedge"),n(r,a,h,"\u2a04","\\biguplus"),n(r,a,h,"\u22c2","\\bigcap"),n(r,a,h,"\u22c3","\\bigcup"),n(r,a,h,"\u222b","\\int"),n(r,a,h,"\u222b","\\intop"),n(r,a,h,"\u222c","\\iint"),n(r,a,h,"\u222d","\\iiint"),n(r,a,h,"\u220f","\\prod"),n(r,a,h,"\u2211","\\sum"),n(r,a,h,"\u2a02","\\bigotimes"),n(r,a,h,"\u2a01","\\bigoplus"),n(r,a,h,"\u2a00","\\bigodot"),n(r,a,h,"\u222e","\\oint"),n(r,a,h,"\u2a06","\\bigsqcup"),n(r,a,h,"\u222b","\\smallint"),n(i,a,d,"\u2026","\\textellipsis"),n(r,a,d,"\u2026","\\mathellipsis"),n(i,a,d,"\u2026","\\ldots",!0),n(r,a,d,"\u2026","\\ldots",!0),n(r,a,d,"\u22ef","\\cdots",!0),n(r,a,d,"\u22f1","\\ddots",!0),n(r,a,v,"\u22ee","\\vdots",!0),n(r,a,s,"\xb4","\\acute"),n(r,a,s,"`","\\grave"),n(r,a,s,"\xa8","\\ddot"),n(r,a,s,"~","\\tilde"),n(r,a,s,"\xaf","\\bar"),n(r,a,s,"\u02d8","\\breve"),n(r,a,s,"\u02c7","\\check"),n(r,a,s,"^","\\hat"),n(r,a,s,"\u20d7","\\vec"),n(r,a,s,"\u02d9","\\dot"),n(r,a,c,"\u0131","\\imath"),n(r,a,c,"\u0237","\\jmath"),n(i,a,s,"\u02ca","\\'"),n(i,a,s,"\u02cb","\\`"),n(i,a,s,"\u02c6","\\^"),n(i,a,s,"\u02dc","\\~"),n(i,a,s,"\u02c9","\\="),n(i,a,s,"\u02d8","\\u"),n(i,a,s,"\u02d9","\\."),n(i,a,s,"\u02da","\\r"),n(i,a,s,"\u02c7","\\v"),n(i,a,s,"\xa8",'\\"'),n(i,a,s,"\u030b","\\H"),n(i,a,v,"\u2013","--"),n(i,a,v,"\u2013","\\textendash"),n(i,a,v,"\u2014","---"),n(i,a,v,"\u2014","\\textemdash"),n(i,a,v,"\u2018","`"),n(i,a,v,"\u2018","\\textquoteleft"),n(i,a,v,"\u2019","'"),n(i,a,v,"\u2019","\\textquoteright"),n(i,a,v,"\u201c","``"),n(i,a,v,"\u201c","\\textquotedblleft"),n(i,a,v,"\u201d","''"),n(i,a,v,"\u201d","\\textquotedblright"),n(r,a,v,"\xb0","\\degree"),n(i,a,v,"\xb0","\\degree"),n(r,a,c,"\xa3","\\pounds"),n(r,a,c,"\xa3","\\mathsterling"),n(i,a,c,"\xa3","\\pounds"),n(i,a,c,"\xa3","\\textsterling"),n(r,o,v,"\u2720","\\maltese"),n(i,o,v,"\u2720","\\maltese"),n(i,a,g,"\xa0","\\ "),n(i,a,g,"\xa0"," "),n(i,a,g,"\xa0","~");for(var b='0123456789/@."',y=0;y":">","<":"<",'"':""","'":"'"},h=/[&><"']/g,p=void 0;if("undefined"!=typeof document){var f=document.createElement("span");p="textContent"in f?function(e,t){e.textContent=t}:function(e,t){e.innerText=t}}t.exports={contains:s,deflt:l,escape:r,hyphenate:d,indexOf:o,setTextContent:p,clearNode:i}},{}]},{},[1])(1)},e.exports=t()})); +// Copyright 2018 The Distill Template Authors +const ae=function(e,t,n){let r=n,i=0;const a=e.length;for(;r[e.left,e.right]),i=e=>r.some(t=>-1!==e.indexOf(t));n.mightHaveMath=i,ue(e,n)};var he="iVBORw0KGgoAAAANSUhEUgAAAEAAAABACAYAAACqaXHeAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJbWFnZVJlYWR5ccllPAAAA99JREFUeNrsG4t1ozDMzQSM4A2ODUonKBucN2hugtIJ6E1AboLcBiQTkJsANiAb9OCd/OpzMWBJBl5TvaeXPiiyJetry0J8wW3D3QpjRh3GjneXDq+fSQA9s2mH9x3KDhN4foJfCb8N/Jrv+2fnDn8vLRQOplWHVYdvHZYdZsBcZP1vBmh/n8DzEmhUQDPaOuP9pFuY+JwJHwHnCLQE2tnWBGEyXozY9xCUgHMhhjE2I4heVWtgIkZ83wL6Qgxj1obfWBxymPwe+b00BCCRNPbwfb60yleAkkBHGT5AEehIYz7eJrFDMF9CvH4wwhcGHiHMneFvLDQwlwvMLQq58trRcYBWfYn0A0OgHWQUSu25mE+BnoYKnnEJoeIWAifzOv7vLWd2ZKRfWAIme3tOiUaQ3UnLkb0xj1FxRIeEGKaGIHOs9nEgLaaA9i0JRYo1Ic67wJW86KSKE/ZAM8KuVMk8ITVhmxUxJ3Cl2xlm9Vtkeju1+mpCQNxaEGNCY8bs9X2YqwNoQeGjBWut/ma0QAWy/TqAsHx9wSya3I5IRxOfTC+leG+kA/4vSeEcGBtNUN6byhu3+keEZCQJUNh8MAO7HL6H8pQLnsW/Hd4T4lv93TPjfM7A46iEEqbB5EDOvwYNW6tGNZzT/o+CZ6sqZ6wUtR/wf7mi/VL8iNciT6rHih48Y55b4nKCHJCCzb4y0nwFmin3ZEMIoLfZF8F7nncFmvnWBaBj7CGAYA/WGJsUwHdYqVDwAmNsUgAx4CGgAA7GOOxADYOFWOaIKifuVYzmOpREqA21Mo7aPsgiY1PhOMAmxtR+AUbYH3Id2wc0SAFIQTsn9IUGWR8k9jx3vtXSiAacFxTAGakBk9UudkNECd6jLe+6HrshshvIuC6IlLMRy7er+JpcKma24SlE4cFZSZJDGVVrsNvitQhQrDhW0jfiOLfFd47C42eHT56D/BK0To+58Ahj+cAT8HT1UWlfLZCCd/uKawzU0Rh2EyIX/Icqth3niG8ybNroezwe6khdCNxRN+l4XGdOLVLlOOt2hTRJlr1ETIuMAltVTMz70mJrkdGAaZLSmnBEqmAE32JCMmuTlCnRgsBENtOUpHhvvsYIL0ibnBkaC6QvKcR7738GKp0AKnim7xgUSNv1bpS8QwhBt8r+EP47v/oyRK/S34yJ9nT+AN0Tkm4OdB9E4BsmXM3SnMlRFUrtp6IDpV2eKzdYvF3etm3KhQksbOLChGkSmcBdmcEwvqkrMy5BzL00NZeu3qPYJOOuCc+5NjcWKXQxFvTa3NoXJ4d8in7fiAUuTt781dkvuHX4K8AA2Usy7yNKLy0AAAAASUVORK5CYII=\n",pe=/["'&<>]/,fe=C; +/*! * escape-html * Copyright(c) 2012-2013 TJ Holowaychuk * Copyright(c) 2015 Andreas Lubbe * Copyright(c) 2015 Tiancheng "Timothy" Gu * MIT Licensed */ - - /** - * Module variables. - * @private - */ - - var matchHtmlRegExp = /["'&<>]/; - - /** - * Module exports. - * @public - */ - - var escapeHtml_1 = escapeHtml; - - /** - * Escape special characters in the given string of html. - * - * @param {string} string The string to escape for inserting into HTML - * @return {string} - * @public - */ - - function escapeHtml(string) { - var str = '' + string; - var match = matchHtmlRegExp.exec(str); - - if (!match) { - return str; - } - - var escape; - var html = ''; - var index = 0; - var lastIndex = 0; - - for (index = match.index; index < str.length; index++) { - switch (str.charCodeAt(index)) { - case 34: // " - escape = '"'; - break; - case 38: // & - escape = '&'; - break; - case 39: // ' - escape = '''; - break; - case 60: // < - escape = '<'; - break; - case 62: // > - escape = '>'; - break; - default: - continue; - } - - if (lastIndex !== index) { - html += str.substring(lastIndex, index); - } - - lastIndex = index + 1; - html += escape; - } - - return lastIndex !== index - ? html + str.substring(lastIndex, index) - : html; - } - - // Copyright 2018 The Distill Template Authors - - function Meta(dom, data) { - let head = dom.querySelector('head'); - let appendHead = html => appendHtml(head, html); - - function meta(name, content, force) { - if (content || force) - appendHead(` \n`); - } - - appendHead(` - - - - `); - - if (data.title) { - appendHead(` - ${escapeHtml_1(data.title)} - `); - } - - if (data.url) { - appendHead(` - - `); - } - - - if (data.publishedDate){ - appendHead(` - - - - - `); - } - - if (data.updatedDate) { - appendHead(` - - `); - } - - (data.authors || []).forEach((a) => { - appendHtml(head, ` - `); - }); - - appendHead(` - - - - - - - - - `); - - appendHead(` - - - - - - - - - `); - - // if this is a proprer article, generate Google Scholar meta data - if (data.doiSuffix){ - appendHead(` - \n`); - - meta('citation_title', data.title); - meta('citation_fulltext_html_url', data.url); - meta('citation_volume', data.volume); - meta('citation_issue', data.issue); - meta('citation_firstpage', data.doiSuffix ? `e${data.doiSuffix}` : undefined); - meta('citation_doi', data.doi); - - let journal = data.journal || {}; - meta('citation_journal_title', journal.full_title || journal.title); - meta('citation_journal_abbrev', journal.abbrev_title); - meta('citation_issn', journal.issn); - meta('citation_publisher', journal.publisher); - meta('citation_fulltext_world_readable', '', true); - - if (data.publishedDate){ - meta('citation_online_date', `${data.publishedYear}/${data.publishedMonthPadded}/${data.publishedDayPadded}`); - meta('citation_publication_date', `${data.publishedYear}/${data.publishedMonthPadded}/${data.publishedDayPadded}`); - } - - (data.authors || []).forEach((a) => { - meta('citation_author', `${a.lastName}, ${a.firstName}`); - meta('citation_author_institution', a.affiliation); - }); - } else { - console.warn('No DOI suffix in data; not adding citation meta tags!'); - } - - if (data.citations) { - data.citations.forEach(key => { - if (data.bibliography && data.bibliography.has(key)) { - const entry = data.bibliography.get(key); - meta('citation_reference', citation_meta_content(entry) ); - } else { - console.warn('No bibliography data found for ' + key); - } - }); - } else { - console.warn('No citations found; not adding any references meta tags!'); - } - } - - function appendHtml(el, html) { - el.innerHTML += html; - } - - function citation_meta_content(ref){ - var content = `citation_title=${ref.title};`; - - if (ref.author && ref.author !== '') { - ref.author.split(' and ').forEach(name => { - name = name.trim(); - let last, firsts; - if (name.indexOf(',') != -1){ - last = name.split(',')[0].trim(); - firsts = name.split(',')[1].trim(); - } else { - last = name.split(' ').slice(-1)[0].trim(); - firsts = name.split(' ').slice(0,-1).join(' '); - } - content += `citation_author=${firsts} ${last};`; - }); - } - - if ('year' in ref) { - content += `citation_publication_date=${ref.year};`; - } - - // Special test for arxiv - let arxiv_id_search = /https?:\/\/arxiv\.org\/pdf\/([0-9]*\.[0-9]*)\.pdf/.exec(ref.url); - arxiv_id_search = arxiv_id_search || /https?:\/\/arxiv\.org\/abs\/([0-9]*\.[0-9]*)/.exec(ref.url); - arxiv_id_search = arxiv_id_search || /arXiv preprint arXiv:([0-9]*\.[0-9]*)/.exec(ref.journal); - if (arxiv_id_search && arxiv_id_search[1]){ - content += `citation_arxiv_id=${arxiv_id_search[1]};`; - return content; // arXiv is not considered a journal, so we don't need journal/volume/issue - } - if ('journal' in ref){ - content += `citation_journal_title=${escapeHtml_1(ref.journal)};`; - } - if ('volume' in ref) { - content += `citation_volume=${escapeHtml_1(ref.volume)};`; - } - if ('issue' in ref || 'number' in ref){ - content += `citation_number=${escapeHtml_1(ref.issue || ref.number)};`; - } - return content; - } - - var base = "/*\n * Copyright 2018 The Distill Template Authors\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n * http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\nhtml {\n font-size: 14px;\n\tline-height: 1.6em;\n /* font-family: \"Libre Franklin\", \"Helvetica Neue\", sans-serif; */\n font-family: -apple-system, BlinkMacSystemFont, \"Segoe UI\", Roboto, Oxygen, Ubuntu, Cantarell, \"Fira Sans\", \"Droid Sans\", \"Helvetica Neue\", Arial, sans-serif;\n /*, \"Apple Color Emoji\", \"Segoe UI Emoji\", \"Segoe UI Symbol\";*/\n text-size-adjust: 100%;\n -ms-text-size-adjust: 100%;\n -webkit-text-size-adjust: 100%;\n}\n\n@media(min-width: 768px) {\n html {\n font-size: 16px;\n }\n}\n\nbody {\n margin: 0;\n}\n\na {\n color: #004276;\n}\n\nfigure {\n margin: 0;\n}\n\ntable {\n\tborder-collapse: collapse;\n\tborder-spacing: 0;\n}\n\ntable th {\n\ttext-align: left;\n}\n\ntable thead {\n border-bottom: 1px solid rgba(0, 0, 0, 0.05);\n}\n\ntable thead th {\n padding-bottom: 0.5em;\n}\n\ntable tbody :first-child td {\n padding-top: 0.5em;\n}\n\npre {\n overflow: auto;\n max-width: 100%;\n}\n\np {\n margin-top: 0;\n margin-bottom: 1em;\n}\n\nsup, sub {\n vertical-align: baseline;\n position: relative;\n top: -0.4em;\n line-height: 1em;\n}\n\nsub {\n top: 0.4em;\n}\n\n.kicker,\n.marker {\n font-size: 15px;\n font-weight: 600;\n color: rgba(0, 0, 0, 0.5);\n}\n\n\n/* Headline */\n\n@media(min-width: 1024px) {\n d-title h1 span {\n display: block;\n }\n}\n\n/* Figure */\n\nfigure {\n position: relative;\n margin-bottom: 2.5em;\n margin-top: 1.5em;\n}\n\nfigcaption+figure {\n\n}\n\nfigure img {\n width: 100%;\n}\n\nfigure svg text,\nfigure svg tspan {\n}\n\nfigcaption,\n.figcaption {\n color: rgba(0, 0, 0, 0.6);\n font-size: 12px;\n line-height: 1.5em;\n}\n\n@media(min-width: 1024px) {\nfigcaption,\n.figcaption {\n font-size: 13px;\n }\n}\n\nfigure.external img {\n background: white;\n border: 1px solid rgba(0, 0, 0, 0.1);\n box-shadow: 0 1px 8px rgba(0, 0, 0, 0.1);\n padding: 18px;\n box-sizing: border-box;\n}\n\nfigcaption a {\n color: rgba(0, 0, 0, 0.6);\n}\n\nfigcaption b,\nfigcaption strong, {\n font-weight: 600;\n color: rgba(0, 0, 0, 1.0);\n}\n"; - - var layout = "/*\n * Copyright 2018 The Distill Template Authors\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n * http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n@supports not (display: grid) {\n .base-grid,\n distill-header,\n d-title,\n d-abstract,\n d-article,\n d-appendix,\n distill-appendix,\n d-byline,\n d-footnote-list,\n d-citation-list,\n distill-footer {\n display: block;\n padding: 8px;\n }\n}\n\n.base-grid,\ndistill-header,\nd-title,\nd-abstract,\nd-article,\nd-appendix,\ndistill-appendix,\nd-byline,\nd-footnote-list,\nd-citation-list,\ndistill-footer {\n display: grid;\n justify-items: stretch;\n grid-template-columns: [screen-start] 8px [page-start kicker-start text-start gutter-start middle-start] 1fr 1fr 1fr 1fr 1fr 1fr 1fr 1fr [text-end page-end gutter-end kicker-end middle-end] 8px [screen-end];\n grid-column-gap: 8px;\n}\n\n.grid {\n display: grid;\n grid-column-gap: 8px;\n}\n\n@media(min-width: 768px) {\n .base-grid,\n distill-header,\n d-title,\n d-abstract,\n d-article,\n d-appendix,\n distill-appendix,\n d-byline,\n d-footnote-list,\n d-citation-list,\n distill-footer {\n grid-template-columns: [screen-start] 1fr [page-start kicker-start middle-start text-start] 45px 45px 45px 45px 45px 45px 45px 45px [ kicker-end text-end gutter-start] 45px [middle-end] 45px [page-end gutter-end] 1fr [screen-end];\n grid-column-gap: 16px;\n }\n\n .grid {\n grid-column-gap: 16px;\n }\n}\n\n@media(min-width: 1000px) {\n .base-grid,\n distill-header,\n d-title,\n d-abstract,\n d-article,\n d-appendix,\n distill-appendix,\n d-byline,\n d-footnote-list,\n d-citation-list,\n distill-footer {\n grid-template-columns: [screen-start] 1fr [page-start kicker-start] 50px [middle-start] 50px [text-start kicker-end] 50px 50px 50px 50px 50px 50px 50px 50px [text-end gutter-start] 50px [middle-end] 50px [page-end gutter-end] 1fr [screen-end];\n grid-column-gap: 16px;\n }\n\n .grid {\n grid-column-gap: 16px;\n }\n}\n\n@media(min-width: 1180px) {\n .base-grid,\n distill-header,\n d-title,\n d-abstract,\n d-article,\n d-appendix,\n distill-appendix,\n d-byline,\n d-footnote-list,\n d-citation-list,\n distill-footer {\n grid-template-columns: [screen-start] 1fr [page-start kicker-start] 60px [middle-start] 60px [text-start kicker-end] 60px 60px 60px 60px 60px 60px 60px 60px [text-end gutter-start] 60px [middle-end] 60px [page-end gutter-end] 1fr [screen-end];\n grid-column-gap: 32px;\n }\n\n .grid {\n grid-column-gap: 32px;\n }\n}\n\n\n\n\n.base-grid {\n grid-column: screen;\n}\n\n/* .l-body,\nd-article > * {\n grid-column: text;\n}\n\n.l-page,\nd-title > *,\nd-figure {\n grid-column: page;\n} */\n\n.l-gutter {\n grid-column: gutter;\n}\n\n.l-text,\n.l-body {\n grid-column: text;\n}\n\n.l-page {\n grid-column: page;\n}\n\n.l-body-outset {\n grid-column: middle;\n}\n\n.l-page-outset {\n grid-column: page;\n}\n\n.l-screen {\n grid-column: screen;\n}\n\n.l-screen-inset {\n grid-column: screen;\n padding-left: 16px;\n padding-left: 16px;\n}\n\n\n/* Aside */\n\nd-article aside {\n grid-column: gutter;\n font-size: 12px;\n line-height: 1.6em;\n color: rgba(0, 0, 0, 0.6)\n}\n\n@media(min-width: 768px) {\n aside {\n grid-column: gutter;\n }\n\n .side {\n grid-column: gutter;\n }\n}\n"; - - var print = "/*\n * Copyright 2018 The Distill Template Authors\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n * http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n@media print {\n\n @page {\n size: 8in 11in;\n @bottom-right {\n content: counter(page) \" of \" counter(pages);\n }\n }\n\n html {\n /* no general margins -- CSS Grid takes care of those */\n }\n\n p, code {\n page-break-inside: avoid;\n }\n\n h2, h3 {\n page-break-after: avoid;\n }\n\n d-header {\n visibility: hidden;\n }\n\n d-footer {\n display: none!important;\n }\n\n}\n"; - - var byline = "/*\n * Copyright 2018 The Distill Template Authors\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n * http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\nd-byline {\n contain: style;\n overflow: hidden;\n border-top: 1px solid rgba(0, 0, 0, 0.1);\n font-size: 0.8rem;\n line-height: 1.8em;\n padding: 1.5rem 0;\n min-height: 1.8em;\n}\n\n\nd-byline .byline {\n grid-template-columns: 1fr 1fr;\n grid-column: text;\n}\n\n@media(min-width: 768px) {\n d-byline .byline {\n grid-template-columns: 1fr 1fr 1fr 1fr;\n }\n}\n\nd-byline .authors-affiliations {\n grid-column-end: span 2;\n grid-template-columns: 1fr 1fr;\n margin-bottom: 1em;\n}\n\n@media(min-width: 768px) {\n d-byline .authors-affiliations {\n margin-bottom: 0;\n }\n}\n\nd-byline h3 {\n font-size: 0.6rem;\n font-weight: 400;\n color: rgba(0, 0, 0, 0.5);\n margin: 0;\n text-transform: uppercase;\n}\n\nd-byline p {\n margin: 0;\n}\n\nd-byline a,\nd-article d-byline a {\n color: rgba(0, 0, 0, 0.8);\n text-decoration: none;\n border-bottom: none;\n}\n\nd-article d-byline a:hover {\n text-decoration: underline;\n border-bottom: none;\n}\n\nd-byline p.author {\n font-weight: 500;\n}\n\nd-byline .affiliations {\n\n}\n"; - - var article = "/*\n * Copyright 2018 The Distill Template Authors\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n * http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\nd-article {\n contain: layout style;\n overflow-x: hidden;\n border-top: 1px solid rgba(0, 0, 0, 0.1);\n padding-top: 2rem;\n color: rgba(0, 0, 0, 0.8);\n}\n\nd-article > * {\n grid-column: text;\n}\n\n@media(min-width: 768px) {\n d-article {\n font-size: 16px;\n }\n}\n\n@media(min-width: 1024px) {\n d-article {\n font-size: 1.06rem;\n line-height: 1.7em;\n }\n}\n\n\n/* H2 */\n\n\nd-article .marker {\n text-decoration: none;\n border: none;\n counter-reset: section;\n grid-column: kicker;\n line-height: 1.7em;\n}\n\nd-article .marker:hover {\n border: none;\n}\n\nd-article .marker span {\n padding: 0 3px 4px;\n border-bottom: 1px solid rgba(0, 0, 0, 0.2);\n position: relative;\n top: 4px;\n}\n\nd-article .marker:hover span {\n color: rgba(0, 0, 0, 0.7);\n border-bottom: 1px solid rgba(0, 0, 0, 0.7);\n}\n\nd-article h2 {\n font-weight: 600;\n font-size: 24px;\n line-height: 1.25em;\n margin: 2rem 0 1.5rem 0;\n border-bottom: 1px solid rgba(0, 0, 0, 0.1);\n padding-bottom: 1rem;\n}\n\n@media(min-width: 1024px) {\n d-article h2 {\n font-size: 36px;\n }\n}\n\n/* H3 */\n\nd-article h3 {\n font-weight: 700;\n font-size: 18px;\n line-height: 1.4em;\n margin-bottom: 1em;\n margin-top: 2em;\n}\n\n@media(min-width: 1024px) {\n d-article h3 {\n font-size: 20px;\n }\n}\n\n/* H4 */\n\nd-article h4 {\n font-weight: 600;\n text-transform: uppercase;\n font-size: 14px;\n line-height: 1.4em;\n}\n\nd-article a {\n color: inherit;\n}\n\nd-article p,\nd-article ul,\nd-article ol,\nd-article blockquote {\n margin-top: 0;\n margin-bottom: 1em;\n margin-left: 0;\n margin-right: 0;\n}\n\nd-article blockquote {\n border-left: 2px solid rgba(0, 0, 0, 0.2);\n padding-left: 2em;\n font-style: italic;\n color: rgba(0, 0, 0, 0.6);\n}\n\nd-article a {\n border-bottom: 1px solid rgba(0, 0, 0, 0.4);\n text-decoration: none;\n}\n\nd-article a:hover {\n border-bottom: 1px solid rgba(0, 0, 0, 0.8);\n}\n\nd-article .link {\n text-decoration: underline;\n cursor: pointer;\n}\n\nd-article ul,\nd-article ol {\n padding-left: 24px;\n}\n\nd-article li {\n margin-bottom: 1em;\n margin-left: 0;\n padding-left: 0;\n}\n\nd-article li:last-child {\n margin-bottom: 0;\n}\n\nd-article pre {\n font-size: 14px;\n margin-bottom: 20px;\n}\n\nd-article hr {\n grid-column: screen;\n width: 100%;\n border: none;\n border-bottom: 1px solid rgba(0, 0, 0, 0.1);\n margin-top: 60px;\n margin-bottom: 60px;\n}\n\nd-article section {\n margin-top: 60px;\n margin-bottom: 60px;\n}\n\nd-article span.equation-mimic {\n font-family: georgia;\n font-size: 115%;\n font-style: italic;\n}\n\nd-article > d-code,\nd-article section > d-code {\n display: block;\n}\n\nd-article > d-math[block],\nd-article section > d-math[block] {\n display: block;\n}\n\n@media (max-width: 768px) {\n d-article > d-code,\n d-article section > d-code,\n d-article > d-math[block],\n d-article section > d-math[block] {\n overflow-x: scroll;\n -ms-overflow-style: none; // IE 10+\n overflow: -moz-scrollbars-none; // Firefox\n }\n\n d-article > d-code::-webkit-scrollbar,\n d-article section > d-code::-webkit-scrollbar,\n d-article > d-math[block]::-webkit-scrollbar,\n d-article section > d-math[block]::-webkit-scrollbar {\n display: none; // Safari and Chrome\n }\n}\n\nd-article .citation {\n color: #668;\n cursor: pointer;\n}\n\nd-include {\n width: auto;\n display: block;\n}\n\nd-figure {\n contain: layout style;\n}\n\n/* KaTeX */\n\n.katex, .katex-prerendered {\n contain: style;\n display: inline-block;\n}\n\n/* Tables */\n\nd-article table {\n border-collapse: collapse;\n margin-bottom: 1.5rem;\n border-bottom: 1px solid rgba(0, 0, 0, 0.2);\n}\n\nd-article table th {\n border-bottom: 1px solid rgba(0, 0, 0, 0.2);\n}\n\nd-article table td {\n border-bottom: 1px solid rgba(0, 0, 0, 0.05);\n}\n\nd-article table tr:last-of-type td {\n border-bottom: none;\n}\n\nd-article table th,\nd-article table td {\n font-size: 15px;\n padding: 2px 8px;\n}\n\nd-article table tbody :first-child td {\n padding-top: 2px;\n}\n"; - - var title = "/*\n * Copyright 2018 The Distill Template Authors\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n * http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\nd-title {\n padding: 2rem 0 1.5rem;\n contain: layout style;\n overflow-x: hidden;\n}\n\n@media(min-width: 768px) {\n d-title {\n padding: 4rem 0 1.5rem;\n }\n}\n\nd-title h1 {\n grid-column: text;\n font-size: 40px;\n font-weight: 700;\n line-height: 1.1em;\n margin: 0 0 0.5rem;\n}\n\n@media(min-width: 768px) {\n d-title h1 {\n font-size: 50px;\n }\n}\n\nd-title p {\n font-weight: 300;\n font-size: 1.2rem;\n line-height: 1.55em;\n grid-column: text;\n}\n\nd-title .status {\n margin-top: 0px;\n font-size: 12px;\n color: #009688;\n opacity: 0.8;\n grid-column: kicker;\n}\n\nd-title .status span {\n line-height: 1;\n display: inline-block;\n padding: 6px 0;\n border-bottom: 1px solid #80cbc4;\n font-size: 11px;\n text-transform: uppercase;\n}\n"; - - var math = "/*\n * Copyright 2018 The Distill Template Authors\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n * http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\nspan.katex-display {\n text-align: left;\n padding: 8px 0 8px 0;\n margin: 0.5em 0 0.5em 1em;\n}\n\nspan.katex {\n -webkit-font-smoothing: antialiased;\n color: rgba(0, 0, 0, 0.8);\n font-size: 1.18em;\n}\n"; - - // Copyright 2018 The Distill Template Authors - - const styles = base + layout + title + byline + article + math + print; - - function makeStyleTag(dom) { - - const styleTagId = 'distill-prerendered-styles'; - const prerenderedTag = dom.getElementById(styleTagId); - if (!prerenderedTag) { - const styleTag = dom.createElement('style'); - styleTag.id = styleTagId; - styleTag.type = 'text/css'; - const cssTextTag = dom.createTextNode(styles); - styleTag.appendChild(cssTextTag); - const firstScriptTag = dom.head.querySelector('script'); - dom.head.insertBefore(styleTag, firstScriptTag); - } - - } - - // Copyright 2018 The Distill Template Authors - - function renderTOC(element, headings) { - - let ToC =` - - -

      Table of contents

      -
        `; - - for (const el of headings) { - // should element be included in TOC? - const isInTitle = el.parentElement.tagName == 'D-TITLE'; - const isException = el.getAttribute('no-toc'); - if (isInTitle || isException) continue; - // create TOC entry - const title = el.textContent; - const link = '#' + el.getAttribute('id'); - - let newLine = '
      • ' + '' + title + '' + '
      • '; - if (el.tagName == 'H3') { - newLine = '
          ' + newLine + '
        '; - } else { - newLine += '
        '; - } - ToC += newLine; - - } - - ToC += '
      '; - element.innerHTML = ToC; - } - - // Copyright 2018 The Distill Template Authors - - function TOC(dom) { - const article = dom.querySelector('d-article'); - const toc = dom.querySelector('d-toc'); - if (toc) { - const headings = article.querySelectorAll('h2, h3'); - renderTOC(toc, headings); - toc.setAttribute('prerendered', 'true'); - } - } - - // Copyright 2018 The Distill Template Authors - // - // Licensed under the Apache License, Version 2.0 (the "License"); - // you may not use this file except in compliance with the License. - // You may obtain a copy of the License at - // - // http://www.apache.org/licenses/LICENSE-2.0 - // - // Unless required by applicable law or agreed to in writing, software - // distributed under the License is distributed on an "AS IS" BASIS, - // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - // See the License for the specific language governing permissions and - // limitations under the License. - - function Typeset(dom) { - - var textNodes = dom.createTreeWalker( - dom.body, - dom.defaultView.NodeFilter.SHOW_TEXT - ); - while (textNodes.nextNode()) { - var n = textNodes.currentNode, - text = n.nodeValue; - if (text && acceptNode(n)) { - text = quotes(text); - text = punctuation(text); - // TODO: Add back support for ligatures once their uppercased versions don't hang Chrome search anymore - // see: https://bugs.chromium.org/p/chromium/issues/detail?id=862648 - // text = ligatures(text); - n.nodeValue = text; - } - } - } - - // 2018-07-11 shancarter@ and ludwigschubert@ no longer know what this was meant to accomplish - // if it was trying to not replace text in any child nodes of those listed here, - // then it does not accomplish that. - function acceptNode(node) { - var parent = node.parentElement; - var isMath = (parent && parent.getAttribute && parent.getAttribute('class')) ? parent.getAttribute('class').includes('katex') || parent.getAttribute('class').includes('MathJax') : false; - return parent && - parent.nodeName !== 'SCRIPT' && - parent.nodeName !== 'STYLE' && - parent.nodeName !== 'CODE' && - parent.nodeName !== 'PRE' && - parent.nodeName !== 'SPAN' && - parent.nodeName !== 'D-HEADER' && - parent.nodeName !== 'D-BYLINE' && - parent.nodeName !== 'D-MATH' && - parent.nodeName !== 'D-CODE' && - parent.nodeName !== 'D-BIBLIOGRAPHY' && - parent.nodeName !== 'D-FOOTER' && - parent.nodeName !== 'D-APPENDIX' && - parent.nodeName !== 'D-FRONTMATTER' && - parent.nodeName !== 'D-TOC' && - parent.nodeType !== 8 && //comment nodes - !isMath; - } - - - /*! - * typeset - Typesetting for the web - * @version v0.1.6 - * @link https://github.com/davidmerfield/Typeset.js - * @author David Merfield - */ - // which has a CC0 license - // http://creativecommons.org/publicdomain/zero/1.0/ - - - function punctuation(text){ - - // Dashes - text = text.replace(/--/g, '\u2014'); - text = text.replace(/\s*\u2014\s*/g,'\u2009\u2014\u2009'); //this has thin spaces - - // Elipses - text = text.replace(/\.\.\./g,'…'); - - // Nbsp for punc with spaces - var NBSP = '\u00a0'; - var NBSP_PUNCTUATION_START = /([«¿¡]) /g; - var NBSP_PUNCTUATION_END = / ([!?:;.,‽»])/g; - - text = text.replace(NBSP_PUNCTUATION_START, '$1' + NBSP); - text = text.replace(NBSP_PUNCTUATION_END, NBSP + '$1'); - - return text; - } - - function quotes(text) { - - text = text - .replace(/(\W|^)"([^\s!?:;.,‽»])/g, '$1\u201c$2') // beginning " - .replace(/(\u201c[^"]*)"([^"]*$|[^\u201c"]*\u201c)/g, '$1\u201d$2') // ending " - .replace(/([^0-9])"/g,'$1\u201d') // remaining " at end of word - .replace(/(\W|^)'(\S)/g, '$1\u2018$2') // beginning ' - .replace(/([a-z])'([a-z])/ig, '$1\u2019$2') // conjunction's possession - .replace(/((\u2018[^']*)|[a-z])'([^0-9]|$)/ig, '$1\u2019$3') // ending ' - .replace(/(\u2018)([0-9]{2}[^\u2019]*)(\u2018([^0-9]|$)|$|\u2019[a-z])/ig, '\u2019$2$3') // abbrev. years like '93 - .replace(/(\B|^)\u2018(?=([^\u2019]*\u2019\b)*([^\u2019\u2018]*\W[\u2019\u2018]\b|[^\u2019\u2018]*$))/ig, '$1\u2019') // backwards apostrophe - .replace(/'''/g, '\u2034') // triple prime - .replace(/("|'')/g, '\u2033') // double prime - .replace(/'/g, '\u2032'); - - // Allow escaped quotes - text = text.replace(/\\“/, '"'); - text = text.replace(/\\”/, '"'); - text = text.replace(/\\’/, '\''); - text = text.replace(/\\‘/, '\''); - - return text; - } - - // Copyright 2018 The Distill Template Authors - - // const template = ` - // if ('IntersectionObserver' in window && - // 'IntersectionObserverEntry' in window && - // 'intersectionRatio' in IntersectionObserverEntry.prototype) { - // // Platform supports IntersectionObserver natively! :-) - // if (!('isIntersecting' in IntersectionObserverEntry.prototype)) { - // Object.defineProperty(IntersectionObserverEntry.prototype, - // 'isIntersecting', { - // get: function () { - // return this.intersectionRatio > 0; - // } - // }); - // } - // } else { - // // Platform does not support webcomponents--loading polyfills synchronously. - // const scriptTag = document.createElement('script'); - // scriptTag.src = '${intersectionObserverPath}'; - // scriptTag.async = false; - // document.currentScript.parentNode.insertBefore(scriptTag, document.currentScript.nextSibling); - // } - // - // if ('registerElement' in document && - // 'import' in document.createElement('link') && - // 'content' in document.createElement('template')) { - // // Platform supports webcomponents natively! :-) - // } else { - // // Platform does not support webcomponents--loading polyfills synchronously. - // const scriptTag = document.createElement('script'); - // scriptTag.src = '${webcomponentPath}'; - // scriptTag.async = false; - // document.currentScript.parentNode.insertBefore(scriptTag, document.currentScript.nextSibling); - // } - // - // - // `; - - - const addBackIn = ` -window.addEventListener('WebComponentsReady', function() { - console.warn('WebComponentsReady'); - const loaderTag = document.createElement('script'); - loaderTag.src = 'https://distill.pub/template.v2.js'; - document.head.insertBefore(loaderTag, document.head.firstChild); -}); -`; - - function render(dom) { - // pull out template script tag - const templateTag = dom.querySelector('script[src*="template.v2.js"]'); - if (templateTag) { - templateTag.parentNode.removeChild(templateTag); - } else { - console.debug('FYI: Did not find template tag when trying to remove it. You may not have added it. Be aware that our polyfills will add it.'); - } - - // add loader - const loaderTag = dom.createElement('script'); - loaderTag.src = 'https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.0.17/webcomponents-loader.js'; - dom.head.insertBefore(loaderTag, dom.head.firstChild); - - // add loader event listener to add tempalrte back in - const addTag = dom.createElement('script'); - addTag.innerHTML = addBackIn; - dom.head.insertBefore(addTag, dom.head.firstChild); - - - // create polyfill script tag - // const polyfillScriptTag = dom.createElement('script'); - // polyfillScriptTag.innerHTML = template; - // polyfillScriptTag.id = 'polyfills'; - - // insert at appropriate position--before any other script tag - // const firstScriptTag = dom.head.querySelector('script'); - // dom.head.insertBefore(polyfillScriptTag, firstScriptTag); - } - - // Copyright 2018 The Distill Template Authors - - const styles$1 = ` -d-citation-list { - contain: style; -} - -d-citation-list .references { - grid-column: text; -} - -d-citation-list .references .title { - font-weight: 500; -} -`; - - function renderCitationList(element, entries, dom=document) { - if (entries.size > 0) { - element.style.display = ''; - let list = element.querySelector('.references'); - if (list) { - list.innerHTML = ''; - } else { - const stylesTag = dom.createElement('style'); - stylesTag.innerHTML = styles$1; - element.appendChild(stylesTag); - - const heading = dom.createElement('h3'); - heading.id = 'references'; - heading.textContent = 'References'; - element.appendChild(heading); - - list = dom.createElement('ol'); - list.id = 'references-list'; - list.className = 'references'; - element.appendChild(list); - } - - for (const [key, entry] of entries) { - const listItem = dom.createElement('li'); - listItem.id = key; - listItem.innerHTML = bibliography_cite(entry); - list.appendChild(listItem); - } - } else { - element.style.display = 'none'; - } - } - - // Copyright 2018 The Distill Template Authors - - function CitationList(dom, data) { - const citationListTag = dom.querySelector('d-citation-list'); - if (citationListTag) { - const entries = new Map(data.citations.map( citationKey => { - return [citationKey, data.bibliography.get(citationKey)]; - })); - renderCitationList(citationListTag, entries, dom); - citationListTag.setAttribute('distill-prerendered', 'true'); - } - } - - // Copyright 2018 The Distill Template Authors - // - // Licensed under the Apache License, Version 2.0 (the "License"); - // you may not use this file except in compliance with the License. - // You may obtain a copy of the License at - // - // http://www.apache.org/licenses/LICENSE-2.0 - // - // Unless required by applicable law or agreed to in writing, software - // distributed under the License is distributed on an "AS IS" BASIS, - // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - // See the License for the specific language governing permissions and - // limitations under the License. - - /* - Try to only reorder things that MAY be user defined. - Try to use templates etc to define the order of our own tags. - */ - - function render$1(dom) { - const head = dom.head; - - const metaIE = head.querySelector('meta[http-equiv]'); - head.insertBefore(metaIE, head.firstChild); - - const metaViewport = head.querySelector('meta[name=viewport]'); - head.insertBefore(metaViewport, head.firstChild); - - const metaCharset = head.querySelector('meta[charset]'); - head.insertBefore(metaCharset, head.firstChild); - } - - var logo = "\n \n\n"; - - const headerTemplate = ` - - -`; - - // Copyright 2018 The Distill Template Authors - - function DistillHeader(dom, data) { - const headerTag = dom.querySelector('distill-header'); - if (!headerTag) { - const header = dom.createElement('distill-header'); - header.innerHTML = headerTemplate; - header.setAttribute('distill-prerendered', ""); - const body = dom.querySelector('body'); - body.insertBefore(header, body.firstChild); - } - } - - // Copyright 2018 The Distill Template Authors - - const styles$2 = ` - -`; - - function appendixTemplate(frontMatter) { - let html = styles$2; - - if (typeof frontMatter.githubUrl !== 'undefined') { - html += ` -

      Updates and Corrections

      -

      `; - if (frontMatter.githubCompareUpdatesUrl) { - html += `View all changes to this article since it was first published.`; - } - html += ` - If you see mistakes or want to suggest changes, please create an issue on GitHub.

      - `; - } - - const journal = frontMatter.journal; - if (typeof journal !== 'undefined' && journal.title === 'Distill') { - html += ` -

      Reuse

      -

      Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by a note in their caption: “Figure from …”.

      - `; - } - - if (typeof frontMatter.publishedDate !== 'undefined') { - html += ` -

      Citation

      -

      For attribution in academic contexts, please cite this work as

      -
      ${frontMatter.concatenatedAuthors}, "${frontMatter.title}", Distill, ${frontMatter.publishedYear}.
      -

      BibTeX citation

      -
      ${serializeFrontmatterToBibtex(frontMatter)}
      - `; - } - - return html; - } - - // Copyright 2018 The Distill Template Authors - - function DistillAppendix(dom, data) { - - const appendixTag = dom.querySelector('d-appendix'); - if (!appendixTag) { - console.warn('No appendix tag found!'); - return; - } - const distillAppendixTag = appendixTag.querySelector('distill-appendix'); - if (!distillAppendixTag) { - const distillAppendix = dom.createElement('distill-appendix'); - appendixTag.appendChild(distillAppendix); - distillAppendix.innerHTML = appendixTemplate(data); - } - - } - - const footerTemplate = ` - - - - -`; - - // Copyright 2018 The Distill Template Authors - - function DistillFooter(dom) { - const footerTag = dom.querySelector('distill-footer'); - if(!footerTag) { - const footer = dom.createElement('distill-footer'); - footer.innerHTML = footerTemplate; - const body = dom.querySelector('body'); - body.appendChild(footer); - } - } - - // Copyright 2018 The Distill Template Authors - - const extractors = new Map([ - ['ExtractFrontmatter', ExtractFrontmatter], - ['ExtractBibliography', ExtractBibliography], - ['ExtractCitations', ExtractCitations], - ]); - - const transforms = new Map([ - ['HTML', HTML], - ['makeStyleTag', makeStyleTag], - ['OptionalComponents', OptionalComponents], - ['TOC', TOC], - ['Byline', Byline], - ['Mathematics', Mathematics], - ['Meta', Meta], - ['Typeset', Typeset], - ['Polyfills', render], - ['CitationList', CitationList], - ['Reorder', render$1] // keep last - ]); - - const distillTransforms = new Map([ - ['DistillHeader', DistillHeader], - ['DistillAppendix', DistillAppendix], - ['DistillFooter', DistillFooter], - ]); - - /* Exported functions */ - - function render$2(dom, data, verbose=true) { - let frontMatter; - if (data instanceof FrontMatter) { - frontMatter = data; - } else { - frontMatter = FrontMatter.fromObject(data); - } - // first, we collect static data from the dom - for (const [name, extract] of extractors.entries()) { - if (verbose) console.warn('Running extractor: ' + name); - extract(dom, frontMatter, verbose); - } - // secondly we use it to transform parts of the dom - for (const [name, transform] of transforms.entries()) { - if (verbose) console.warn('Running transform: ' + name); - // console.warn('Running transform: ', transform); - transform(dom, frontMatter, verbose); - } - dom.body.setAttribute('distill-prerendered', ''); - // the function calling us can now use the transformed dom and filled data object - if (data instanceof FrontMatter) ; else { - frontMatter.assignToObject(data); - } - } - - function distillify(dom, data, verbose=true) { - // thirdly, we can use these additional transforms when publishing on the Distill website - for (const [name, transform] of distillTransforms.entries()) { - if (verbose) console.warn('Running distillify: ', name); - transform(dom, data, verbose); - } - } - - function usesTemplateV2(dom) { - const tags = dom.querySelectorAll('script'); - let usesV2 = undefined; - for (const tag of tags) { - const src = tag.src; - if (src.includes('template.v1.js')) { - usesV2 = false; - } else if (src.includes('template.v2.js')) { - usesV2 = true; - } else if (src.includes('template.')) { - throw new Error('Uses distill template, but unknown version?!'); - } - } - - if (usesV2 === undefined) { - throw new Error('Does not seem to use Distill template at all.'); - } else { - return usesV2; - } - } - - const testing = { - extractors: extractors, - transforms: transforms, - distillTransforms: distillTransforms - }; - - exports.FrontMatter = FrontMatter; - exports.distillify = distillify; - exports.render = render$2; - exports.testing = testing; - exports.usesTemplateV2 = usesTemplateV2; - - Object.defineProperty(exports, '__esModule', { value: true }); - -}))); -//# sourceMappingURL=transforms.v2.js.map +// Copyright 2018 The Distill Template Authors +const me='/*\n * Copyright 2018 The Distill Template Authors\n *\n * Licensed under the Apache License, Version 2.0 (the "License");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n * http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an "AS IS" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\nhtml {\n font-size: 14px;\n\tline-height: 1.6em;\n /* font-family: "Libre Franklin", "Helvetica Neue", sans-serif; */\n font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen, Ubuntu, Cantarell, "Fira Sans", "Droid Sans", "Helvetica Neue", Arial, sans-serif;\n /*, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol";*/\n text-size-adjust: 100%;\n -ms-text-size-adjust: 100%;\n -webkit-text-size-adjust: 100%;\n}\n\n@media(min-width: 768px) {\n html {\n font-size: 16px;\n }\n}\n\nbody {\n margin: 0;\n}\n\na {\n color: #004276;\n}\n\nfigure {\n margin: 0;\n}\n\ntable {\n\tborder-collapse: collapse;\n\tborder-spacing: 0;\n}\n\ntable th {\n\ttext-align: left;\n}\n\ntable thead {\n border-bottom: 1px solid rgba(0, 0, 0, 0.05);\n}\n\ntable thead th {\n padding-bottom: 0.5em;\n}\n\ntable tbody :first-child td {\n padding-top: 0.5em;\n}\n\npre {\n overflow: auto;\n max-width: 100%;\n}\n\np {\n margin-top: 0;\n margin-bottom: 1em;\n}\n\nsup, sub {\n vertical-align: baseline;\n position: relative;\n top: -0.4em;\n line-height: 1em;\n}\n\nsub {\n top: 0.4em;\n}\n\n.kicker,\n.marker {\n font-size: 15px;\n font-weight: 600;\n color: rgba(0, 0, 0, 0.5);\n}\n\n\n/* Headline */\n\n@media(min-width: 1024px) {\n d-title h1 span {\n display: block;\n }\n}\n\n/* Figure */\n\nfigure {\n position: relative;\n margin-bottom: 2.5em;\n margin-top: 1.5em;\n}\n\nfigcaption+figure {\n\n}\n\nfigure img {\n width: 100%;\n}\n\nfigure svg text,\nfigure svg tspan {\n}\n\nfigcaption,\n.figcaption {\n color: rgba(0, 0, 0, 0.6);\n font-size: 12px;\n line-height: 1.5em;\n}\n\n@media(min-width: 1024px) {\nfigcaption,\n.figcaption {\n font-size: 13px;\n }\n}\n\nfigure.external img {\n background: white;\n border: 1px solid rgba(0, 0, 0, 0.1);\n box-shadow: 0 1px 8px rgba(0, 0, 0, 0.1);\n padding: 18px;\n box-sizing: border-box;\n}\n\nfigcaption a {\n color: rgba(0, 0, 0, 0.6);\n}\n\nfigcaption b,\nfigcaption strong, {\n font-weight: 600;\n color: rgba(0, 0, 0, 1.0);\n}\n'+'/*\n * Copyright 2018 The Distill Template Authors\n *\n * Licensed under the Apache License, Version 2.0 (the "License");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n * http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an "AS IS" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n@supports not (display: grid) {\n .base-grid,\n distill-header,\n d-title,\n d-abstract,\n d-article,\n d-appendix,\n distill-appendix,\n d-byline,\n d-footnote-list,\n d-citation-list,\n distill-footer {\n display: block;\n padding: 8px;\n }\n}\n\n.base-grid,\ndistill-header,\nd-title,\nd-abstract,\nd-article,\nd-appendix,\ndistill-appendix,\nd-byline,\nd-footnote-list,\nd-citation-list,\ndistill-footer {\n display: grid;\n justify-items: stretch;\n grid-template-columns: [screen-start] 8px [page-start kicker-start text-start gutter-start middle-start] 1fr 1fr 1fr 1fr 1fr 1fr 1fr 1fr [text-end page-end gutter-end kicker-end middle-end] 8px [screen-end];\n grid-column-gap: 8px;\n}\n\n.grid {\n display: grid;\n grid-column-gap: 8px;\n}\n\n@media(min-width: 768px) {\n .base-grid,\n distill-header,\n d-title,\n d-abstract,\n d-article,\n d-appendix,\n distill-appendix,\n d-byline,\n d-footnote-list,\n d-citation-list,\n distill-footer {\n grid-template-columns: [screen-start] 1fr [page-start kicker-start middle-start text-start] 45px 45px 45px 45px 45px 45px 45px 45px [ kicker-end text-end gutter-start] 45px [middle-end] 45px [page-end gutter-end] 1fr [screen-end];\n grid-column-gap: 16px;\n }\n\n .grid {\n grid-column-gap: 16px;\n }\n}\n\n@media(min-width: 1000px) {\n .base-grid,\n distill-header,\n d-title,\n d-abstract,\n d-article,\n d-appendix,\n distill-appendix,\n d-byline,\n d-footnote-list,\n d-citation-list,\n distill-footer {\n grid-template-columns: [screen-start] 1fr [page-start kicker-start] 50px [middle-start] 50px [text-start kicker-end] 50px 50px 50px 50px 50px 50px 50px 50px [text-end gutter-start] 50px [middle-end] 50px [page-end gutter-end] 1fr [screen-end];\n grid-column-gap: 16px;\n }\n\n .grid {\n grid-column-gap: 16px;\n }\n}\n\n@media(min-width: 1180px) {\n .base-grid,\n distill-header,\n d-title,\n d-abstract,\n d-article,\n d-appendix,\n distill-appendix,\n d-byline,\n d-footnote-list,\n d-citation-list,\n distill-footer {\n grid-template-columns: [screen-start] 1fr [page-start kicker-start] 60px [middle-start] 60px [text-start kicker-end] 60px 60px 60px 60px 60px 60px 60px 60px [text-end gutter-start] 60px [middle-end] 60px [page-end gutter-end] 1fr [screen-end];\n grid-column-gap: 32px;\n }\n\n .grid {\n grid-column-gap: 32px;\n }\n}\n\n\n\n\n.base-grid {\n grid-column: screen;\n}\n\n/* .l-body,\nd-article > * {\n grid-column: text;\n}\n\n.l-page,\nd-title > *,\nd-figure {\n grid-column: page;\n} */\n\n.l-gutter {\n grid-column: gutter;\n}\n\n.l-text,\n.l-body {\n grid-column: text;\n}\n\n.l-page {\n grid-column: page;\n}\n\n.l-body-outset {\n grid-column: middle;\n}\n\n.l-page-outset {\n grid-column: page;\n}\n\n.l-screen {\n grid-column: screen;\n}\n\n.l-screen-inset {\n grid-column: screen;\n padding-left: 16px;\n padding-left: 16px;\n}\n\n\n/* Aside */\n\nd-article aside {\n grid-column: gutter;\n font-size: 12px;\n line-height: 1.6em;\n color: rgba(0, 0, 0, 0.6)\n}\n\n@media(min-width: 768px) {\n aside {\n grid-column: gutter;\n }\n\n .side {\n grid-column: gutter;\n }\n}\n'+'/*\n * Copyright 2018 The Distill Template Authors\n *\n * Licensed under the Apache License, Version 2.0 (the "License");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n * http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an "AS IS" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\nd-title {\n padding: 2rem 0 1.5rem;\n contain: layout style;\n overflow-x: hidden;\n}\n\n@media(min-width: 768px) {\n d-title {\n padding: 4rem 0 1.5rem;\n }\n}\n\nd-title h1 {\n grid-column: text;\n font-size: 40px;\n font-weight: 700;\n line-height: 1.1em;\n margin: 0 0 0.5rem;\n}\n\n@media(min-width: 768px) {\n d-title h1 {\n font-size: 50px;\n }\n}\n\nd-title p {\n font-weight: 300;\n font-size: 1.2rem;\n line-height: 1.55em;\n grid-column: text;\n}\n\nd-title .status {\n margin-top: 0px;\n font-size: 12px;\n color: #009688;\n opacity: 0.8;\n grid-column: kicker;\n}\n\nd-title .status span {\n line-height: 1;\n display: inline-block;\n padding: 6px 0;\n border-bottom: 1px solid #80cbc4;\n font-size: 11px;\n text-transform: uppercase;\n}\n'+'/*\n * Copyright 2018 The Distill Template Authors\n *\n * Licensed under the Apache License, Version 2.0 (the "License");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n * http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an "AS IS" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\nd-byline {\n contain: style;\n overflow: hidden;\n border-top: 1px solid rgba(0, 0, 0, 0.1);\n font-size: 0.8rem;\n line-height: 1.8em;\n padding: 1.5rem 0;\n min-height: 1.8em;\n}\n\n\nd-byline .byline {\n grid-template-columns: 1fr 1fr;\n grid-column: text;\n}\n\n@media(min-width: 768px) {\n d-byline .byline {\n grid-template-columns: 1fr 1fr 1fr 1fr;\n }\n}\n\nd-byline .authors-affiliations {\n grid-column-end: span 2;\n grid-template-columns: 1fr 1fr;\n margin-bottom: 1em;\n}\n\n@media(min-width: 768px) {\n d-byline .authors-affiliations {\n margin-bottom: 0;\n }\n}\n\nd-byline h3 {\n font-size: 0.6rem;\n font-weight: 400;\n color: rgba(0, 0, 0, 0.5);\n margin: 0;\n text-transform: uppercase;\n}\n\nd-byline p {\n margin: 0;\n}\n\nd-byline a,\nd-article d-byline a {\n color: rgba(0, 0, 0, 0.8);\n text-decoration: none;\n border-bottom: none;\n}\n\nd-article d-byline a:hover {\n text-decoration: underline;\n border-bottom: none;\n}\n\nd-byline p.author {\n font-weight: 500;\n}\n\nd-byline .affiliations {\n\n}\n'+'/*\n * Copyright 2018 The Distill Template Authors\n *\n * Licensed under the Apache License, Version 2.0 (the "License");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n * http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an "AS IS" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\nd-article {\n contain: layout style;\n overflow-x: hidden;\n border-top: 1px solid rgba(0, 0, 0, 0.1);\n padding-top: 2rem;\n color: rgba(0, 0, 0, 0.8);\n}\n\nd-article > * {\n grid-column: text;\n}\n\n@media(min-width: 768px) {\n d-article {\n font-size: 16px;\n }\n}\n\n@media(min-width: 1024px) {\n d-article {\n font-size: 1.06rem;\n line-height: 1.7em;\n }\n}\n\n\n/* H2 */\n\n\nd-article .marker {\n text-decoration: none;\n border: none;\n counter-reset: section;\n grid-column: kicker;\n line-height: 1.7em;\n}\n\nd-article .marker:hover {\n border: none;\n}\n\nd-article .marker span {\n padding: 0 3px 4px;\n border-bottom: 1px solid rgba(0, 0, 0, 0.2);\n position: relative;\n top: 4px;\n}\n\nd-article .marker:hover span {\n color: rgba(0, 0, 0, 0.7);\n border-bottom: 1px solid rgba(0, 0, 0, 0.7);\n}\n\nd-article h2 {\n font-weight: 600;\n font-size: 24px;\n line-height: 1.25em;\n margin: 2rem 0 1.5rem 0;\n border-bottom: 1px solid rgba(0, 0, 0, 0.1);\n padding-bottom: 1rem;\n}\n\n@media(min-width: 1024px) {\n d-article h2 {\n font-size: 36px;\n }\n}\n\n/* H3 */\n\nd-article h3 {\n font-weight: 700;\n font-size: 18px;\n line-height: 1.4em;\n margin-bottom: 1em;\n margin-top: 2em;\n}\n\n@media(min-width: 1024px) {\n d-article h3 {\n font-size: 20px;\n }\n}\n\n/* H4 */\n\nd-article h4 {\n font-weight: 600;\n text-transform: uppercase;\n font-size: 14px;\n line-height: 1.4em;\n}\n\nd-article a {\n color: inherit;\n}\n\nd-article p,\nd-article ul,\nd-article ol,\nd-article blockquote {\n margin-top: 0;\n margin-bottom: 1em;\n margin-left: 0;\n margin-right: 0;\n}\n\nd-article blockquote {\n border-left: 2px solid rgba(0, 0, 0, 0.2);\n padding-left: 2em;\n font-style: italic;\n color: rgba(0, 0, 0, 0.6);\n}\n\nd-article a {\n border-bottom: 1px solid rgba(0, 0, 0, 0.4);\n text-decoration: none;\n}\n\nd-article a:hover {\n border-bottom: 1px solid rgba(0, 0, 0, 0.8);\n}\n\nd-article .link {\n text-decoration: underline;\n cursor: pointer;\n}\n\nd-article ul,\nd-article ol {\n padding-left: 24px;\n}\n\nd-article li {\n margin-bottom: 1em;\n margin-left: 0;\n padding-left: 0;\n}\n\nd-article li:last-child {\n margin-bottom: 0;\n}\n\nd-article pre {\n font-size: 14px;\n margin-bottom: 20px;\n}\n\nd-article hr {\n grid-column: screen;\n width: 100%;\n border: none;\n border-bottom: 1px solid rgba(0, 0, 0, 0.1);\n margin-top: 60px;\n margin-bottom: 60px;\n}\n\nd-article section {\n margin-top: 60px;\n margin-bottom: 60px;\n}\n\nd-article span.equation-mimic {\n font-family: georgia;\n font-size: 115%;\n font-style: italic;\n}\n\nd-article > d-code,\nd-article section > d-code {\n display: block;\n}\n\nd-article > d-math[block],\nd-article section > d-math[block] {\n display: block;\n}\n\n@media (max-width: 768px) {\n d-article > d-code,\n d-article section > d-code,\n d-article > d-math[block],\n d-article section > d-math[block] {\n overflow-x: scroll;\n -ms-overflow-style: none; // IE 10+\n overflow: -moz-scrollbars-none; // Firefox\n }\n\n d-article > d-code::-webkit-scrollbar,\n d-article section > d-code::-webkit-scrollbar,\n d-article > d-math[block]::-webkit-scrollbar,\n d-article section > d-math[block]::-webkit-scrollbar {\n display: none; // Safari and Chrome\n }\n}\n\nd-article .citation {\n color: #668;\n cursor: pointer;\n}\n\nd-include {\n width: auto;\n display: block;\n}\n\nd-figure {\n contain: layout style;\n}\n\n/* KaTeX */\n\n.katex, .katex-prerendered {\n contain: style;\n display: inline-block;\n}\n\n/* Tables */\n\nd-article table {\n border-collapse: collapse;\n margin-bottom: 1.5rem;\n border-bottom: 1px solid rgba(0, 0, 0, 0.2);\n}\n\nd-article table th {\n border-bottom: 1px solid rgba(0, 0, 0, 0.2);\n}\n\nd-article table td {\n border-bottom: 1px solid rgba(0, 0, 0, 0.05);\n}\n\nd-article table tr:last-of-type td {\n border-bottom: none;\n}\n\nd-article table th,\nd-article table td {\n font-size: 15px;\n padding: 2px 8px;\n}\n\nd-article table tbody :first-child td {\n padding-top: 2px;\n}\n'+'/*\n * Copyright 2018 The Distill Template Authors\n *\n * Licensed under the Apache License, Version 2.0 (the "License");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n * http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an "AS IS" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\nspan.katex-display {\n text-align: left;\n padding: 8px 0 8px 0;\n margin: 0.5em 0 0.5em 1em;\n}\n\nspan.katex {\n -webkit-font-smoothing: antialiased;\n color: rgba(0, 0, 0, 0.8);\n font-size: 1.18em;\n}\n'+'/*\n * Copyright 2018 The Distill Template Authors\n *\n * Licensed under the Apache License, Version 2.0 (the "License");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n * http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an "AS IS" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\n@media print {\n\n @page {\n size: 8in 11in;\n @bottom-right {\n content: counter(page) " of " counter(pages);\n }\n }\n\n html {\n /* no general margins -- CSS Grid takes care of those */\n }\n\n p, code {\n page-break-inside: avoid;\n }\n\n h2, h3 {\n page-break-after: avoid;\n }\n\n d-header {\n visibility: hidden;\n }\n\n d-footer {\n display: none!important;\n }\n\n}\n',ge="\nwindow.addEventListener('WebComponentsReady', function() {\n console.warn('WebComponentsReady');\n const loaderTag = document.createElement('script');\n loaderTag.src = 'https://distill.pub/template.v2.js';\n document.head.insertBefore(loaderTag, document.head.firstChild);\n});\n",ve="\nd-citation-list {\n contain: style;\n}\n\nd-citation-list .references {\n grid-column: text;\n}\n\nd-citation-list .references .title {\n font-weight: 500;\n}\n";var be='\n \n\n';const ye=`\n\n\n`,xe="\n\n",we=`\n\n\n\n\n`,ke=new Map([["ExtractFrontmatter",a],["ExtractBibliography",p],["ExtractCitations",w]]),Me=new Map([["HTML",k],["makeStyleTag",R],["OptionalComponents",z],["TOC",O],["Byline",S],["Mathematics",A],["Meta",T],["Typeset",q],["Polyfills",I],["CitationList",P],["Reorder",j]]),Se=new Map([["DistillHeader",F],["DistillAppendix",U],["DistillFooter",Y]]),ze={extractors:ke,transforms:Me,distillTransforms:Se};e.FrontMatter=ne,e.distillify=G,e.render=V,e.testing=ze,e.usesTemplateV2=W,Object.defineProperty(e,"__esModule",{value:!0})}); \ No newline at end of file diff --git a/assets/js/masonry.js b/assets/js/masonry.js index 054f3a08..57fd6fe5 100644 --- a/assets/js/masonry.js +++ b/assets/js/masonry.js @@ -1,12 +1 @@ -$(document).ready(function() { - // Init Masonry - var $grid = $('.grid').masonry({ - gutter: 10, - horizontalOrder: true, - itemSelector: '.grid-item', - }); - // Layout Masonry after each image loads - $grid.imagesLoaded().progress( function() { - $grid.masonry('layout'); - }); -}); +$(document).ready(function(){var r=$(".grid").masonry({gutter:10,horizontalOrder:!0,itemSelector:".grid-item"});r.imagesLoaded().progress(function(){r.masonry("layout")})}); \ No newline at end of file diff --git a/assets/js/theme.js b/assets/js/theme.js index f6c9cdf7..55f4fd8e 100644 --- a/assets/js/theme.js +++ b/assets/js/theme.js @@ -1,64 +1 @@ -// Has to be in the head tag, otherwise a flicker effect will occur. - -let toggleTheme = (theme) => { - if (theme == "dark") { - setTheme("light"); - } else { - setTheme("dark"); - } -} - - -let setTheme = (theme) => { - transTheme(); - setHighlight(theme); - - if (theme) { - document.documentElement.setAttribute("data-theme", theme); - } - else { - document.documentElement.removeAttribute("data-theme"); - } - localStorage.setItem("theme", theme); - - // Updates the background of medium-zoom overlay. - if (typeof medium_zoom !== 'undefined') { - medium_zoom.update({ - background: getComputedStyle(document.documentElement) - .getPropertyValue('--global-bg-color') + 'ee', // + 'ee' for trasparency. - }) - } -}; - -let setHighlight = (theme) => { - if (theme == "dark") { - document.getElementById("highlight_theme_light").media = "none"; - document.getElementById("highlight_theme_dark").media = ""; - } else { - document.getElementById("highlight_theme_dark").media = "none"; - document.getElementById("highlight_theme_light").media = ""; - } -} - - -let transTheme = () => { - document.documentElement.classList.add("transition"); - window.setTimeout(() => { - document.documentElement.classList.remove("transition"); - }, 500) -} - - -let initTheme = (theme) => { - if (theme == null || theme == 'null') { - const userPref = window.matchMedia; - if (userPref && userPref('(prefers-color-scheme: dark)').matches) { - theme = 'dark'; - } - } - - setTheme(theme); -} - - -initTheme(localStorage.getItem("theme")); +let toggleTheme=e=>{setTheme("dark"==e?"light":"dark")},setTheme=e=>{transTheme(),setHighlight(e),e?document.documentElement.setAttribute("data-theme",e):document.documentElement.removeAttribute("data-theme"),localStorage.setItem("theme",e),"undefined"!=typeof medium_zoom&&medium_zoom.update({background:getComputedStyle(document.documentElement).getPropertyValue("--global-bg-color")+"ee"})},setHighlight=e=>{"dark"==e?(document.getElementById("highlight_theme_light").media="none",document.getElementById("highlight_theme_dark").media=""):(document.getElementById("highlight_theme_dark").media="none",document.getElementById("highlight_theme_light").media="")},transTheme=()=>{document.documentElement.classList.add("transition"),window.setTimeout(()=>{document.documentElement.classList.remove("transition")},500)},initTheme=e=>{if(null==e||"null"==e){const t=window.matchMedia;t&&t("(prefers-color-scheme: dark)").matches&&(e="dark")}setTheme(e)};initTheme(localStorage.getItem("theme")); \ No newline at end of file diff --git a/assets/js/zoom.js b/assets/js/zoom.js index c8610d61..2a8bc1fb 100644 --- a/assets/js/zoom.js +++ b/assets/js/zoom.js @@ -1,8 +1 @@ -// Initialize medium zoom. -$(document).ready(function() { - medium_zoom = mediumZoom('[data-zoomable]', { - margin: 100, - background: getComputedStyle(document.documentElement) - .getPropertyValue('--global-bg-color') + 'ee', // + 'ee' for trasparency. - }) -}); +$(document).ready(function(){medium_zoom=mediumZoom("[data-zoomable]",{margin:100,background:getComputedStyle(document.documentElement).getPropertyValue("--global-bg-color")+"ee"})}); \ No newline at end of file diff --git a/bin/build b/bin/build deleted file mode 100644 index ccd5ebae..00000000 --- a/bin/build +++ /dev/null @@ -1,117 +0,0 @@ -#!/usr/bin/env bash - -# Run this script to deploy the app to Github Pages - -# Parse cmd arguments - -SRC_BRANCH="master" -DEPLOY_BRANCH="gh-pages" - -USAGE_MSG="usage: deploy [-h|--help] [-u|--user] [-s|--src SRC_BRANCH] [-d|--deploy DEPLOY_BRANCH] [--verbose] [--no-push]" - -while [[ $# > 0 ]]; do - key="$1" - - case $key in - -h|--help) - echo $USAGE_MSG - exit 0 - ;; - -u|--user) - SRC_BRANCH="source" - DEPLOY_BRANCH="master" - ;; - -s|--src) - SRC_BRANCH="$2" - shift - ;; - -g|--slug) - SLUG="$2" - shift - ;; - -d|--deploy) - DEPLOY_BRANCH="$2" - shift - ;; - --verbose) - set -x - ;; - --no-push) - NO_PUSH="--no-push" - ;; - *) - echo "Option $1 is unknown." >&2 - echo $USAGE_MSG >&2 - exit 1 - ;; - esac - shift -done - -# Exit if any subcommand fails -set -e - -echo "Deploying..." -echo "Source branch: $SRC_BRANCH" -echo "Deploy branch: $DEPLOY_BRANCH" - -read -r -p "Do you want to proceed? [y/N] " response -if [[ ! $response =~ ^([yY][eE][sS]|[yY])+$ ]] -then - echo "Aborting." - [[ "$0" = "$BASH_SOURCE" ]] && exit 1 || return 1 -fi - -# Check if there are any uncommitted changes -if ! git diff-index --quiet HEAD --; then - echo "Changes to the following files are uncommitted:" - git diff-index --name-only HEAD -- - echo "Please commit the changes before proceeding." - echo "Aborting." - [[ "$0" = "$BASH_SOURCE" ]] && exit 1 || return 1 -fi - -# Check if there are any untracked files -if ! test -z "$(git ls-files --exclude-standard --others)"; then - echo "There are untracked files:" - git ls-files --exclude-standard --others - echo "Please commit those files or stash them before proceeding." - echo "Aborting." - [[ "$0" = "$BASH_SOURCE" ]] && exit 1 || return 1 -fi - -# Switch to source branch (creates it if necessary from the current branch) -if [ `git branch | grep $SRC_BRANCH | tr ' ' '\n' | tail -1` ] -then - git checkout $SRC_BRANCH -else - git checkout -b $SRC_BRANCH -fi - -# Checkout DEPLOY_BRANCH branch -if [ `git branch | grep $DEPLOY_BRANCH` ] -then - git branch -D $DEPLOY_BRANCH -fi -git checkout -b $DEPLOY_BRANCH - -# Export JEKYLL_ENV=production -export JEKYLL_ENV=production - -# CHARLIE SEP 29 2023: -# BEFORE BUILDING, WE NEED TO CHANGE THE _config.yaml URL, OTHERWISE THE WEBSITE URLS ARE ALL WRONG -echo $SLUG -python -c 'import yaml;f=open("_config.yml");y=yaml.safe_load(f);y["url"] = ""; outfile=open("_config.yml", "w"); yaml.dump(y, outfile, default_flow_style=False, sort_keys=False)' -PASS_SLUG=$SLUG python -c 'import yaml; import os; f=open("_config.yml");y=yaml.safe_load(f);y["baseurl"] = "/" + os.environ["PASS_SLUG"]; outfile=open("_config.yml", "w"); yaml.dump(y, outfile, default_flow_style=False, sort_keys=False)' - -cat _config.yml - -# Build site -bundle exec jekyll build --future - -# Delete and move files -find . -maxdepth 1 ! -name '_site' ! -name '.git' ! -name 'CNAME' ! -name '.gitignore' -exec rm -rf {} \; -zip -r site.zip _site/ -mkdir site_out -mv site.zip site_out/ -exit 0 diff --git a/bin/cibuild b/bin/cibuild deleted file mode 100755 index d5c9e195..00000000 --- a/bin/cibuild +++ /dev/null @@ -1 +0,0 @@ -bundle exec jekyll build diff --git a/bin/deploy b/bin/deploy deleted file mode 100755 index b00a28fc..00000000 --- a/bin/deploy +++ /dev/null @@ -1,118 +0,0 @@ -#!/usr/bin/env bash - -# Run this script to deploy the app to Github Pages - -# Parse cmd arguments - -SRC_BRANCH="master" -DEPLOY_BRANCH="gh-pages" - -USAGE_MSG="usage: deploy [-h|--help] [-u|--user] [-s|--src SRC_BRANCH] [-d|--deploy DEPLOY_BRANCH] [--verbose] [--no-push]" - -while [[ $# > 0 ]]; do - key="$1" - - case $key in - -h|--help) - echo $USAGE_MSG - exit 0 - ;; - -u|--user) - SRC_BRANCH="source" - DEPLOY_BRANCH="master" - ;; - -s|--src) - SRC_BRANCH="$2" - shift - ;; - -d|--deploy) - DEPLOY_BRANCH="$2" - shift - ;; - --verbose) - set -x - ;; - --no-push) - NO_PUSH="--no-push" - ;; - *) - echo "Option $1 is unknown." >&2 - echo $USAGE_MSG >&2 - exit 1 - ;; - esac - shift -done - -# Exit if any subcommand fails -set -e - -echo "Deploying..." -echo "Source branch: $SRC_BRANCH" -echo "Deploy branch: $DEPLOY_BRANCH" - -read -r -p "Do you want to proceed? [y/N] " response -if [[ ! $response =~ ^([yY][eE][sS]|[yY])+$ ]] -then - echo "Aborting." - [[ "$0" = "$BASH_SOURCE" ]] && exit 1 || return 1 -fi - -# Check if there are any uncommitted changes -if ! git diff-index --quiet HEAD --; then - echo "Changes to the following files are uncommitted:" - git diff-index --name-only HEAD -- - echo "Please commit the changes before proceeding." - echo "Aborting." - [[ "$0" = "$BASH_SOURCE" ]] && exit 1 || return 1 -fi - -# Check if there are any untracked files -if ! test -z "$(git ls-files --exclude-standard --others)"; then - echo "There are untracked files:" - git ls-files --exclude-standard --others - echo "Please commit those files or stash them before proceeding." - echo "Aborting." - [[ "$0" = "$BASH_SOURCE" ]] && exit 1 || return 1 -fi - -# Switch to source branch (creates it if necessary from the current branch) -if [ `git branch | grep $SRC_BRANCH | tr ' ' '\n' | tail -1` ] -then - git checkout $SRC_BRANCH -else - git checkout -b $SRC_BRANCH -fi - -# Checkout DEPLOY_BRANCH branch -if [ `git branch | grep $DEPLOY_BRANCH` ] -then - git branch -D $DEPLOY_BRANCH -fi -git checkout -b $DEPLOY_BRANCH - -# Export JEKYLL_ENV=production -export JEKYLL_ENV=production - -# Build site -bundle exec jekyll build --future - -# Delete and move files -find . -maxdepth 1 ! -name '_site' ! -name '.git' ! -name 'CNAME' ! -name '.gitignore' -exec rm -rf {} \; -mv _site/* . -rm -R _site/ - -# Create `.nojekyll` file (bypass GitHub Pages Jekyll processing) -touch .nojekyll - -# Push to DEPLOY_BRANCH -git add -fA -git commit --allow-empty -m "$(git log -1 --pretty=%B) [ci skip]" -[[ ${NO_PUSH} ]] || git push -f -q origin $DEPLOY_BRANCH - -# Move back to SRC_BRANCH -git checkout $SRC_BRANCH - -echo "Deployed successfully!" - -exit 0 diff --git a/bin/docker_run.sh b/bin/docker_run.sh deleted file mode 100755 index 681f14c8..00000000 --- a/bin/docker_run.sh +++ /dev/null @@ -1,8 +0,0 @@ -FILE=Gemfile.lock -if [ -f "$FILE" ]; then - rm $FILE -fi -docker build -t "iclr-2024:latest" . && \ -docker run --rm -v "$PWD:/srv/jekyll/" -p "8080:8080" \ - -it iclr-2024:latest bundler \ - exec jekyll serve --trace --future --watch --port=8080 --host=0.0.0.0 diff --git a/bin/entry_point.sh b/bin/entry_point.sh deleted file mode 100644 index 917ae357..00000000 --- a/bin/entry_point.sh +++ /dev/null @@ -1,22 +0,0 @@ -#!/bin/bash - -CONFIG_FILE=_config.yml - -/bin/bash -c "rm -f Gemfile.lock && exec jekyll serve --watch --port=8080 --host=0.0.0.0 --livereload --verbose --trace --force_polling"& - -while true; do - - inotifywait -q -e modify,move,create,delete $CONFIG_FILE - - if [ $? -eq 0 ]; then - - echo "Change detected to $CONFIG_FILE, restarting Jekyll" - - jekyll_pid=$(pgrep -f jekyll) - kill -KILL $jekyll_pid - - /bin/bash -c "rm -f Gemfile.lock && exec jekyll serve --watch --port=8080 --host=0.0.0.0 --livereload --verbose --trace --force_polling"& - - fi - -done diff --git a/bin/filterpaths.py b/bin/filterpaths.py deleted file mode 100644 index 5c682c1c..00000000 --- a/bin/filterpaths.py +++ /dev/null @@ -1,61 +0,0 @@ -#!/usr/bin/env python3 - -import re -import sys - -SUCCESS = True - -SLUG = sys.argv[1] - -OUTPUT_MSG = "" - -SLUG_TEMPLATE = "2024-\d\d-\d\d-.+" -if re.match(SLUG_TEMPLATE, SLUG) is None: - print("Your slug does not match the template! Please change it.") - print(f"Your slug: {SLUG}") - print(f"The template: {SLUG_TEMPLATE}") - print("PATHFILTERFAILED") - SUCCESS = False - OUTPUT_MSG = f"Your PR title does not match the slug template, which is <{SLUG_TEMPLATE}>." - -CHANGED_FILES = sys.argv[2:] -ACCEPTABLE_PATHS = [ - f"_posts/{SLUG}.md", - f"assets/img/{SLUG}/*", - f"assets/html/{SLUG}/*", - f"assets/bibliography/{SLUG}.bib" -] - -failed_paths = [] - -for changed_file in CHANGED_FILES: - for acc_path in ACCEPTABLE_PATHS: - if re.match(acc_path, changed_file) is not None: - break - else: - failed_paths.append(changed_file) - -if len(failed_paths) > 0: - print(f"These files were changed, but they shouldn't have been:") - for failed in failed_paths: - print(f"\t{failed}") - - print("PATHFILTERFAILED") - SUCCESS = False - -if len(failed_paths) > 0: - if OUTPUT_MSG != "": - OUTPUT_MSG += " Also, y" - else: - OUTPUT_MSG = "Y" - - OUTPUT_MSG += f"ou can only add/change/remove files related to your post, i.e. files that match one of these patterns: <_posts/SLUG.md, assets/img/SLUG/..., assets/html/SLUG/..., assets/bibliography/SLUG.bib>. But we found that you changed the following: <{' & '.join(failed_paths)}>." -if not SUCCESS: - OUTPUT_MSG += f" Also, make sure your PR's title ({SLUG}) matches your post's slug!" - print(OUTPUT_MSG) - -# example usage of this script: python3 filter_file.py 2024-0a1-01-whateve _posts/2024-01-01-whateve.md assets/img/2024-01-01-whateve/bla.pic assets/html/2024-01-01-whateve/plot1.j assets/bibliography/2024-01-01-whateve.bib assets/img/2024-01-02-whateve/bla.pic -if SUCCESS: - exit(0) -else: - exit(1) diff --git a/blog/2024/index.html b/blog/2024/index.html new file mode 100644 index 00000000..9886ffbc --- /dev/null +++ b/blog/2024/index.html @@ -0,0 +1 @@ + 2024 | ICLR Blogposts 2024

      2024

      an archive of posts from this year

      May 7, 2024 What exactly has TabPFN learned to do?
      May 7, 2024 Fair Model-Based Reinforcement Learning Comparisons with Explicit and Consistent Update Frequency
      May 7, 2024 Unraveling The Impact of Training Samples
      May 7, 2024 Understanding in-context learning in transformers
      May 7, 2024 Understanding gradient inversion attacks from the prior knowledge perspective
      May 7, 2024 The N Implementation Details of RLHF with PPO
      May 7, 2024 Towards Robust Foundation Models: Adversarial Contrastive Learning
      May 7, 2024 RLHF without RL - Direct Preference Optimization
      May 7, 2024 It's Time to Move On: Primacy Bias and Why It Helps to Forget
      May 7, 2024 Behavioral Differences in Mode-Switching Exploration for Reinforcement Learning
      May 7, 2024 A New Alchemy: Language Model Development as a Subfield?
      May 7, 2024 The Hidden Convex Optimization Landscape of Two-Layer ReLU Networks
      May 7, 2024 Fairness in AI: two philosophies or just one?
      May 7, 2024 Exploring Meta-learned Curiosity Algorithms
      May 7, 2024 Elaborating on the Value of Flow Matching for Density Estimation
      May 7, 2024 Bridging the Data Processing Inequality and Function-Space Variational Inference
      May 7, 2024 Double Descent Demystified
      May 7, 2024 Sample Blog Post (HTML version)
      May 7, 2024 Sample Blog Post
      May 7, 2024 Building Diffusion Model's theory from ground up
      May 7, 2024 Deep Equilibrium Models For Algorithmic Reasoning
      May 7, 2024 On Bayesian Model Selection: The Marginal Likelihood, Cross-Validation, and Conditional Log Marginal Likelihood
      May 7, 2024 How to compute Hessian-vector products?
      May 7, 2024 Masked Language Model with ALiBi and CLAP head
      \ No newline at end of file diff --git a/blog/alibi-mlm/index.html b/blog/alibi-mlm/index.html new file mode 100644 index 00000000..ea522904 --- /dev/null +++ b/blog/alibi-mlm/index.html @@ -0,0 +1,128 @@ + Masked Language Model with ALiBi and CLAP head | ICLR Blogposts 2024

      Masked Language Model with ALiBi and CLAP head

      As a new approach to positional encoding, Attention with Linear Biases (ALiBi) uses linear biases of the attention weights to encode positional information, with capability of context length extrapolation. In their paper however, Press et al. focus on the perplexity of autoregressive decoder-only language models, leaving the question of downstream tasks and its applicability to encoder-attention open. In this blogpost, we attempt to bridge the gap by testing masked language models (MLMs) with encoder-attention ALiBi and prediction head similar to the counterparts of the original ALiBi models. We find that while simplified prediction head may be beneficial, performance of MLMs with encoder-attention ALiBi starts to deteriorate with 2048 sequence length at larger scales. We put our results in the context of related recent experiments and tentatively identify the circumstances more challenging to positional encoding designs. Finally, we open-source our MLMs, with BERT-level performance and 2048 context length.

      Adapted and expanded from EIFY/fairseq.

      Unmodified and unmasked, attention mechanism is permutation-invariant and positional encoding is therefore employed by transformer-based language models to break the symmetry and enable sequence modeling. In their ICLR 2022 paper, Press et al. introduced Attention with Linear Biases (ALiBi) as a new approach to positional encoding, where the positional info of the tokens are encoded by applying an attention weight bias proportional to the distance between tokens:

      where \(m\) is a head-specific slope chosen to follow geometric sequence \(\frac{1}{2^{0.5}}, \frac{1}{2^1}, \frac{1}{2^{1.5}}, \dots, \frac{1}{2^\frac{n}{2}}\) for a model with \(n\) attention heads. This approach is shown to enable input length extrapolation in the sense that perplexity of the model remains stable as the inference context length exceeds training context length. The paper, however, focuses on autoregressive decoder-only models and relies on model perplexity as the metric, therefore leaves the question open whether ALiBi is applicable to MLMs like BERT and RoBERTa . To help answer this question, we tested the two following changes to the RoBERTa baseline models, based on the first-party Fairseq toolkit :

      Attention with Linear Biases (ALiBi)

      Since MLMs are based on encoders that attend to tokens both before and after the given position, considerations must be made regarding how to distinguish them. Press himself suggested the 3 following options for encoder-attention ALiBi:

      1. Symmetric: Keep attention weight bias proportional to the distance between tokens and rely on the context to distinguish between tokens at +N and -N position.
      2. Nonsymmetric, one-sided: Make half of the heads only attend to the tokens before and half of the heads only attend to the tokens after. Weight bias is still proportional to the distance.
      3. Nonsymmetric with different slopes: Make the slopes \(m\) different forward and backward, with either learned or fixed values.

      With the observation that option 2 spends about half of the attention compute on no-op and option 3 can still result in bias value collision (e.g. \(m_{bwd} = 2 m_{fwd}\) and -1 vs. +2 positions), we implemented both option 1 and what we call “nonsymmetric with offset”: Shift the linear biases ahead by 0.5 * slope, i.e. the constant bias (right matrix of the figure above) becomes

       0 -.5 -1.5 -2.5 -3.5
      +-1   0  -.5 -1.5 -2.5
      +-2  -1    0  -.5 -1.5
      +-3  -2   -1    0  -.5
      +-4  -3   -2   -1    0
      +

      Unless otherwise noted, ALiBi for the following experiments means this nonsymmetric-with-offset encoder-attention ALiBi.

      Contrastive Language Pretraining (CLAP) Head

      The prediction head is one part of the LMs that has received less attention that happens to differ between the ALiBi autoregressive decoder-only models and RoBERTa. Based on the configs and training logs, the ALiBi models use the adaptive word embedding and softmax of Baevski & Auli with weight tying , whereas the RoBERTa prediction head has an additional fully-connected layer and nonlinearity on top of weight-tying. Inspired by CLIP , we decided to test what we called Contrastive Language Pretraining (CLAP) head below, as the simplest possible prediction head with weight tying for the masked tokens plus the thermodynamic beta (inverse temperature):

      class ClapHead(nn.Module):
      +    """Head for masked language modeling."""
      +
      +    def __init__(self, initial_beta, weight):
      +        super().__init__()
      +        self.beta = nn.Parameter(torch.tensor(initial_beta))
      +        self.weight = weight
      +
      +    def forward(self, features, masked_tokens=None, normalize=True):
      +        # Only project the masked tokens while training,
      +        # saves both memory and computation
      +        if masked_tokens is not None:
      +            features = features[masked_tokens, :]
      +        w = self.weight
      +        if normalize:
      +            w = F.normalize(w, dim=-1)
      +        return self.beta * F.linear(features, w)

      Compared to the baseline RoBERTa prediction head

      class RobertaLMHead(nn.Module):
      +    """Head for masked language modeling."""
      +
      +    def __init__(self, embed_dim, output_dim, activation_fn, weight=None):
      +        super().__init__()
      +        self.dense = nn.Linear(embed_dim, embed_dim)
      +        self.activation_fn = utils.get_activation_fn(activation_fn)
      +        self.layer_norm = LayerNorm(embed_dim)
      +
      +        if weight is None:
      +            weight = nn.Linear(embed_dim, output_dim, bias=False).weight
      +        self.weight = weight
      +        self.bias = nn.Parameter(torch.zeros(output_dim))
      +
      +    def forward(self, features, masked_tokens=None, **kwargs):
      +        # Only project the masked tokens while training,
      +        # saves both memory and computation
      +        if masked_tokens is not None:
      +            features = features[masked_tokens, :]
      +
      +        x = self.dense(features)
      +        x = self.activation_fn(x)
      +        x = self.layer_norm(x)
      +        # project back to size of vocabulary with bias
      +        x = F.linear(x, self.weight) + self.bias
      +        return x

      We removed the embed_dim x embed_dim fully-connected layer, activation function (GELU), layer norm, and the output_dim trainable bias. Just like CLIP, we added the trainable thermodynamic beta and L2-normalize the token embeddings before feeding them to the transformer and computing the inner products between them and the transformer output as the softmax logits, scaled by beta.

      Experiments

      WikiText-103

      At first we tested the changes with the WikiText-103 dataset with a GeForce RTX 3080 16 GB Laptop GPU, using the validation set MLM perplexity as the metric. We tested the baseline (learned positional encoding + RoBERTa prediction head), learned-clap (learned positional encoding + CLAP head), ALiBi (ALiBi + RoBERTa prediction head), and zero-clap (ALiBi + CLAP head), in addition to baseline but with sinusoidal positional encoding instead of learned positional encoding:

      where solid lines are what’s considered “canonical” setup and dotted lines are experiments with the following variations in setup. These variations turned out to be irrelevant:

      1. Whether we use attention dropout or not
      2. Whether we use symmetric ALiBi (option 1) or nonsymmetric-with-offset ALiBi above
      3. Whether we use zero vector or a separate learnable embedding for the mask embeddingThe intention was to test using zero vector instead of a separate learnable embedding for the mask embedding, which in combination with ALiBi results in no non-semantic information in the input embeddings. However, a bug prevented this variation from working correctly and the end effect was merely deleting the last two words (madeupword0001 and madeupword0002) from the dictionary instead, which we don't expect to be consequential.
      4. Whether we L2-normalize the embeddings for the CLAP head or not
      5. Whether we scale the L2-normalized embeddings by sqrt(embed_dim) (no_scale_embedding=False) or not

      As we can see, the dotted lines are almost on top of the solid lines. Notably, sinusoidal positional encoding underperforms significantly compared to learned positional encoding.

      The Pile

      As the next step, we scaled our experiments to train on the Pile for one epoch. About half of the examples in the Pile has sequence length > 1024, so we set sequence length to 2048. Even so, ~1/7 of the examples have sequence length > 2048 and had to be discarded. In the end, one epoch consists of 133082 updates and we employ cosine learning rate schedule while “overestimating” the number of training steps by 10%, as inspired by the Chinchilla paper . In addition to the validation MLM perplexity, we also fine-tuned the models on the GLUE benchmark . As in the original RoBERTa paper, we tested both the roberta.base with 125M parameters and roberta.large with 355M parameters. These experiments were performed on 8 x A100 40GB SXM4 GPUs, where the roberta.base experiments took ~3 days and roberta.large experiments took ~9 days. In the table below, PPL is the final validation MLM perplexity, STS-B is the best validation loss, and all the others are the best validation accuracies over 10 epochs of finetuning.

      roberta.base

                   PPL↓ CoLA MNLI MRPC QNLI QQP  RTE  SST-2 STS-B↓
      +baseline     2.94 83.6 84.2 90   91.6 91.3 73.6 92.1  0.028
      +learned-clap 2.86 81.7 84.4 86.3 90.9 91.2 72.6 92.5  0.027
      +alibi        2.93 69.2 85.1 80.9 92   91.5 63.9 93.1  0.033
      +zero-clap    2.83 70.5 84.9 75.5 90.6 91.1 54.9 89.7  0.041
      +

      *Baseline but with sinusoidal positional encoding instead of learned positional encoding failed to converge.

      roberta.large

                   PPL↓ CoLA MNLI MRPC QNLI QQP  RTE  SST-2 STS-B↓
      +baseline*    2.55 83.7 86.8 84.3 92.5 91.8 79.8 93.3  0.027
      +learned-clap 2.5  84.1 86.3 89.7 92.8 91.7 79.8 93.7  0.023
      +alibi        2.65 69.1 86.5 68.4 92.4 91.7 52.7 93.6  0.123
      +zero-clap    2.54 69.1 86.7 81.9 92.2 91.6 52.7 93.1  0.031
      +

      *Loss spiked somewhere between 24000-24500 updates and the model failed to recover. Loosely following the practice of 5.1 Training Instability in the PaLM paper , we solved the issue by restarting the training from the 20000 updates checkpoint with the PyTorch random seed changed from 1 to 2.

      We found that ALiBi no longer helps lowering the validation MLM perplexity. Furthermore, ALiBi turned out to be harmful for several specific GLUE tasks (CoLA, MRPC, and RTE). CLAP head on its own, however, seems to be competitive and in fact outperforms the baseline with roberta.large.

      Conclusions

      This seems to be another case where models with lower perplexity do not necessarily yield higher accuracies for downstream tasks and architectural changes beneficial for models at smaller scales do not imply the same for models at larger scales . CLAP head, however, is simpler than the standard prediction head for MLMs, requires minimal changes, and may be worth trying especially at larger scales.

      In the broader context, MosaicBERT and LittleBird are most similar to our experiments. In the MosaicBERT paper, Portes et al. also evaluate BERT-style MLMs with symmetric (option 1) encoder-attention ALiBi on the GLUE benchmark and find performance exceeding the BERT baseline within limited training budget. However, these MosaicBERT models were trained with much shorter (128) sequence length and so may have avoided the sequence length regime in which perplexity and performance of certain downstream tasks start to deteriorate The same can be said about , which also reports in Table 4 the MLM perplexity of RoBERTa large models trained on an excerpt of the Pile with various positional encodings including symmetric (option 1) encoder-attention ALiBi with 128 sequence length.. The LittleBird architecture is designed for question answering and built with BiALiBi (Bidirectional ALiBi), a variation of option 3 (nonsymmetric with different slopes) where the model not only learned the forward and backward slopes \(m_{fwd}\) and \(m_{bwd}\), but also a special bias value for the attention weight of the global [CLS] token. Lee et al. evaluate LittleBird models on a collection of QA Benchmarks for both English and Korean and report favorable performance, but leave the question open whether they work well for other NLP tasks. Notably, we also found our ALiBi models capable of matching the baseline performance of the question answering task QNLI, so the reported performance is compatible with our experiments even without attributing to the other differences in architecture or pretraining task.

      Finally, what can we say about the original decoder-attention ALiBi and positional encodings in general? The original decoder-attention ALiBi has been shown to help not only perplexity, but also performance on evaluation suites consist of a diverse set of tasks like the EleutherAI Language Model Evaluation Harness . This discrepancy may be explained by the causal mask, which has been proven to be sufficient for encoding positional information in theory One caveat is that Proof C.1 of for absolute positional encoding depends on distinguishing values of unit fractions 1/t, which eventually fails due to precision limit. For example, 1/1464 can't be distinguished from 1/1465 in float16, well within the context length of interest., if not quite matching the performance of models with additional positional encodings in practice . Perhaps we can conclude that

      1. Decoder-attention positional encodings really should be considered causal mask + additional encodings and how they complement each other should be taken into account.
      2. Longer context length and certain downstream tasks are more challenging for positional encodings. One worthwhile direction may be to rank their difficulties systematically and iterate on the more challenging circumstances first for future positional encoding designs.

      Model checkpoints

      Final checkpoints for models trained on the Pile:

      roberta.base

      baseline learned-clap alibi zero-clap

      roberta.large

      baseline learned-clap alibi zero-clap

      To load them, install EIFY/fairseq following the original instructions and download the GPT-2 fairseq dictionary:

      wget -O gpt2_bpe/dict.txt https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt
      +

      Then all of the checkpoints above except the zero-clap ones can load as follows:

      $ python
      +Python 3.8.10 (default, Jun 22 2022, 20:18:18)
      +[GCC 9.4.0] on linux
      +Type "help", "copyright", "credits" or "license" for more information.
      +>>> from fairseq.models.roberta import RobertaModel
      +>>> roberta = RobertaModel.from_pretrained('/checkpoint-dir', 'learned-clap-large.pt', '/dict-dir')
      +(...)
      +>>> roberta.fill_mask('The capital of China is <mask>.', topk=3)
      +[('The capital of China is Beijing.', 0.7009016871452332, ' Beijing'), ('The capital of China is Shanghai.', 0.23566904664039612, ' Shanghai'), ('The capital of China is Moscow.', 0.010170688852667809, ' Moscow')]
      +>>>
      +

      The zero-clap ones were trained without the last two madeupword’sThis is due to the same bug that affected the WikiText-103 variation above and its only visible effect., so you need to delete them from dict.txt before loading, i.e.:

      +(...)
      +50009 0
      +50256 0
      +madeupword0000 0
      +madeupword0001 0
      +madeupword0002 0
      +
      $ python
      +Python 3.8.10 (default, Jun 22 2022, 20:18:18)
      +[GCC 9.4.0] on linux
      +Type "help", "copyright", "credits" or "license" for more information.
      +>>> from fairseq.models.roberta import RobertaModel
      +>>> roberta = RobertaModel.from_pretrained('/checkpoint-dir', 'zero-clap-large.pt', '/dict-dir')
      +(...)
      +>>> roberta.fill_mask('The capital of China is <mask>.', topk=3)
      +[('The capital of China is Beijing.', 0.7051425576210022, ' Beijing'), ('The capital of China is Shanghai.', 0.21408841013908386, ' Shanghai'), ('The capital of China is Taiwan.', 0.007823833264410496, ' Taiwan')]
      +>>>
      +

      The rest of the original example usage should also just work. While these checkpoints have only been tested with this fork, the baseline ones should also work with the original fairseq repo with minimum changes to the state dict:

      >>> path = '/checkpoint-dir/baseline-large.pt'
      +>>> with open(path, 'rb') as f:
      +...   state = torch.load(f, map_location=torch.device("cpu"))
      +...
      +>>>
      +>>> del state['cfg']['task']['omit_mask']
      +(...)
      +>>> torch.save(state, '/checkpoint-dir/compatible.pt')
      +
      For attribution in academic contexts, please cite this work as
      +        PLACEHOLDER FOR ACADEMIC ATTRIBUTION
      +  
      BibTeX citation
      +        PLACEHOLDER FOR BIBTEX
      +  
      \ No newline at end of file diff --git a/blog/bench-hvp/index.html b/blog/bench-hvp/index.html new file mode 100644 index 00000000..6d44b5ad --- /dev/null +++ b/blog/bench-hvp/index.html @@ -0,0 +1,202 @@ + How to compute Hessian-vector products? | ICLR Blogposts 2024

      How to compute Hessian-vector products?

      The product between the Hessian of a function and a vector, the Hessian-vector product (HVP), is a fundamental quantity to study the variation of a function. It is ubiquitous in traditional optimization and machine learning. However, the computation of HVPs is often considered prohibitive in the context of deep learning, driving practitioners to use proxy quantities to evaluate the loss geometry. Standard automatic differentiation theory predicts that the computational complexity of an HVP is of the same order of magnitude as the complexity of computing a gradient. The goal of this blog post is to provide a practical counterpart to this theoretical result, showing that modern automatic differentiation frameworks, JAX and PyTorch, allow for efficient computation of these HVPs in standard deep learning cost functions.

      Hessian-vector products (HVPs) play a central role in the study and the use of the geometric property of the loss function of deep neural networks, as well as in many recent bilevel optimizers. However, computing such quantity is often considered prohibitive by practitioners, discouraging them from using algorithms that rely on HVPs.

      With this blog post, we aim to convince the practitioners that with modern automatic differentiation (AD) frameworks such as JAX or PyTorch, HVPs can be efficiently evaluated. Indeed, standard AD theory predicts that the computational cost of an HVP is of the same order as the cost of computing a gradient. After a brief introduction on why HVPs are useful for optimization and ML applications and on the basis of AD, we explain in detail the AD-based methods to compute an HVP and the reason for their efficiency. In particular, we show that one can compute HVPs without explicit Hessian computation. We then compare the different methods to compute HVPs for several deep neural network architectures in terms of time and memory for both JAX and PyTorch. Our results illustrate the complexity predicted by the theory, showing that computing an HVP is not much more expensive than computing a gradient. This opens an avenue to develop efficient second-order informed methods for neural networks.

      What are HVPs and where are they useful?

      Let us first introduce the notion of Hessian and HVP. We will consider in this post a twice differentiable function \(f:\mathbb{R}^d\to\mathbb{R}\) that goes from a vector \(x\) in space \(\mathbb{R}^d\) to a real number in \(\mathbb{R}\). This typically corresponds to a function that maps the value of the parameters \(\theta\) of a neural network to the loss \(f(\theta)\). For such a function, standard AD can be used to efficiently compute the gradient of the loss \(\nabla f(\theta) = \left[ \frac{\partial f}{\partial \theta_i}(\theta)\right]_{1\le i \le d} \in \mathbb{R}^d\), using the backpropagation. The Hessian matrix of \(f\) at \(\theta\) is the matrix of its second-order partial derivatives

      \[\nabla^2 f(\theta) = \left[\frac{\partial^2f}{\partial \theta_i\partial \theta_j}(\theta)\right]_{1\leq i,j\leq d}\in\mathbb{R}^{d\times d}\enspace.\]

      This matrix corresponds to the derivative of the gradient and captures how the gradient will change when moving \(x\). To evaluate the variation of the gradient when moving \(\theta\) in the direction \(v\in\mathbb{R}^d\), one can compute the quantity \(\nabla^2 f(\theta) v\in\mathbb{R}^d\). This is the Hessian-vector product (HVP).

      Let us review some use cases of HVPs in optimization and machine learning.

      Inverse Hessian-vector products (iHVPs) in optimization

      When trying to find the minimum of the function \(f\), methods that account for the second-order information often rely on the product between the inverse Hessian and a vector to find a good update direction. For instance, Newton’s method relies on update rules of the form

      \[\theta_{k+1} = \theta_k - \eta_k[\nabla^2f(\theta_k)]^{-1}\nabla f(\theta_k)\]

      for some step-size \(\eta_k>0\).

      When evaluating the term \([\nabla^2f(\theta_k)]^{-1}\nabla f(\theta_k)\), it would be very inefficient to first compute the full Hessian matrix \(\nabla^2f(\theta_k)\), then invert it and finally multiply this with the gradient \(\nabla f(\theta_k)\). Instead, one computes the inverse Hessian-Vector Product (iHPV) by solving the following linear system

      \begin{equation}\label{eq:linear_system} \nabla^2f(\theta)v = b\enspace. \end{equation}

      with \(b = \nabla f(\theta_k)\). This approach is much more efficient as it avoids computing and storing the full Hessian matrix, and only computes the inverse of the matrix in the direction \(v\).

      A second use case for the iHVP in optimization is with bilevel optimization. In bilevel optimization, one wants to solve the following problem

      \begin{equation}\label{eq:bilevel_pb} \min_{x\in\mathbb{R}^d} h(x) = F(x, y^* (x))\quad\text{with}\quad y^*(x) = \arg\min_{y\in\mathbb{R}^p} G(x, y)\enspace. \end{equation}

      The gradient of the function \(h\) can be computed using the implicit function theorem, giving the following expression

      \[\nabla h(x) = \nabla_x F(x, y^* (x)) - \nabla_{xy}G(x, y^*(x))[\nabla_{yy}G(x, y^*(x))]^{-1}\nabla_y G(x, y^*(x))\enspace.\]

      Here, the term \(\nabla^2_{yy} G(x, y)\) is the Hessian of the function \(G\) relatively to \(y\). Thus, this quantity also requires computing an iHVP.

      To compute the iHVP, there are many methods in the literature to solve \eqref{eq:linear_system}, like Neumann iterates, the Conjugate Gradient method or gradient descent steps in the quadratic form \(v\mapsto \frac12\langle\nabla^2f(\theta)v, v\rangle - \langle b, v\rangle\). These methods rely on HVPs, as illustrated by the highlighted terms in the Conjugate Gradient method. Thus, an efficient implementation of HVPs is crucial for the overall algorithm performance.

      Conjugate gradient to solve \eqref{eq:linear_system}
      Input Initialization \(v_0\)
      Initialization $$ r_0 = \textcolor{orange}{\nabla^2f(\theta) v_0} - b,\quad p_0 = -r_0,\quad t = 0 $$ While \(r_t \neq 0\) \begin{align*} \alpha_t &=\frac{r_t^\top r_t}{p_t^\top \textcolor{orange}{\nabla^2f(\theta) p_t}} \\ v_{t+1} &=v_t + \alpha_t p_t \\ r_{t+1} &=r_t + \alpha_t\textcolor{orange}{\nabla^2f(\theta) p_t} \\ \beta_{t+1} &=\frac{r_{t+1}^\top r_{t+1}}{r_t^\top r_t} \\ p_{t+1} &=-r_{t+1} + \beta_{t+1} p_t\\ t &=t + 1 \end{align*}

      HVPs for the study of the loss landscape

      The study of the geometry of neural networks is an active field that aims at understanding the links between training dynamics, local geometry of the training loss and generalization. One way to study the local geometry of a neural network is to find the distribution of the eigenvalues of its Hessian matrix. Indeed, depending on the sign of the eigenvalues of the Hessian, one can for instance distinguish local minima, local maxima and saddle points. As an illustration, the following figure shows how the sign of the eigenvalues of the Hessian matrix of a function affects the shape of the function’s landscape around a stationary point.

      In several papers, an approximation of the Hessian spectrum is computed thanks to the Lanczos algorithm. This algorithm is a modification of the power method where each new iterate is taken in the orthogonal complement of the previous iterates. It outputs a factorization of the Hessian of the form $\nabla^2 f(\theta) = VTV^\top$ where \(V=(v_0,...,v_{k-1})\) is orthogonal and

      \[T = \begin{pmatrix} \alpha_0& \beta_1 & 0 & \cdots & 0\\ \beta_1 & \alpha_1 & \beta_2 & \ddots & \vdots\\ 0 & \beta_2 & \alpha_2 & \ddots & 0\\ \vdots & \ddots & \ddots & \ddots & \beta_{k-1}\\ 0 & \cdots & 0 & \beta_{k-1} & \alpha_{k-1} \end{pmatrix}\enspace.\]

      Lanczos' algorithm
      Input Initial vector \(v_0\).
      Initialization $$ w'_0 = \textcolor{orange}{\nabla^2f(\theta)v_0},\quad \alpha_0 = w_0'^\top v_0,\quad w_0 = w_0' - \alpha_0 v_0 $$ For \(i = 1,\dots, k-1\):
      \begin{align*} \beta_i &= \|w_{i-1}\|\\ v_{i} &= \frac{w_{i-1}}{\beta_{i}}\\ w_i' &= \textcolor{orange}{\nabla^2f(\theta)v_i}\\ \alpha_i &= w_i'^\top v_i\\ w_i &= w_i' - \alpha_i v_i - \beta_iv_{i-1} \end{align*}

      We observe once again that the Hessian information is accessed through HVPs rather than the full Hessian matrix itself.

      A quick detour by automatic differentiation

      Automatic differentiation (AD) is an important tool to compute exactly the derivatives of differentiable functions obtained as the composition of simple operations. There are two modes in AD; the forward mode that computes Jacobian-vector products (JVPs) and the reverse mode that computes vector-Jacobian products (VJPs). Since the gradient of a scalar function is a special case of the VJP, the reverse mode is the most frequently used in machine learning. It is typically used to compute the gradients of deep learning cost functions, where it is called backpropagation.

      In what follows, we briefly present the notion of computational graph and the two AD modes. For a more detailed explanation, we refer the reader to the excellent survey by Baydin et al..

      Computational graph

      A key ingredient of AD is a computational graph associated with the code that evaluates a function. It is a directed acyclic graph that represents the succession of elementary operations required the evaluate a function.
      Simple computational graph of a function \(f:\mathbb{R}^d\to\mathbb{R}^p\) are typically

      In this graph, the vertices \(z_i\in\mathbb{R}^{m_i}\) represent the intermediate states of the evaluation of \(f\). To get the vertex \(z_i\), we use the values of its parents in the graph \(z_{i-1}\), with simple transfer functions \(z_i(z_{i-1})\). The computational complexity of the function evaluation depends on the complexity of the considered graph, as one node might have more than one parent. The memory footprint of the evaluation of the function is also linked to the maximum number of parents that can have a vertex in the computational graph, as their value needs to be stored until all children nodes have been computed.

      Let us take an example with a multilayer linear perceptron (MLP) with 2 layers. The function \(f_x:\mathbb{R}^h\times \mathbb{R}^{h\times p}\to \mathbb{R}\) is defined for an input \(x\in\mathbb{R}^p\) by

      \begin{equation}\label{eq:mlp} f_x(U, W) = \frac12(UWx)^2\enspace. \end{equation}

      Here, the input \(\theta\) corresponds to the parameters of the network \((U, V)\) and the intermediate steps are \(z_1 = Wx\), \(z_2 = Uz_1\) and \(z_3 = \frac12 z_2^2\). A possible computational graph to get \(f_x(U, W)\) is the following

      and the associated Python code to compute \(f_x\) is

      def f(U, W):
      +    z1 = W @ x
      +    z2 = U @ z1
      +    z3 = 0.5 * z2**2
      +    return z3
      +

      Here, the feed-forward structure of the function makes the computational graph very simple, as each node has a single intermediate result parent.

      AD uses this computational graph to compute the function’s derivatives. Using the chain rule, the Jacobian \(\frac{\partial f}{\partial \theta}(\theta)\) of \(f\) is obtained as a product of the Jacobian of the intermediate states \(z_1, \dots, z_n\). \begin{equation}\label{eq:chain_rule} \underbrace{\frac{\partial f}{\partial \theta}(\theta)}_{p\times d} = \frac{\partial z_n}{\partial \theta} =\frac{\partial z_n}{\partial z_1}\frac{\partial z_1}{\partial \theta}=\cdots = \underbrace{\frac{\partial z_n}{\partial z_{n-1}}}_{p\times m_{n-1}}\underbrace{\frac{\partial z_{n-1}}{\partial z_{n-2}}}_{m_{n-1}\times m_{n-2}}\cdots\underbrace{\frac{\partial z_1}{\partial \theta}}_{m_1\times d}\enspace. \end{equation} Depending on the order of the multiplication, one can compute the derivative of \(f\) with respect to \(\theta\) in two ways: the forward mode and the reverse mode.

      Forward mode

      For a vector $v\in\mathbb{R}^d$, the Jacobian-vector product (JVP) corresponds to the directional derative of $f$ in the direction $v$. It can be computed by the forward mode AD

      \begin{equation}\label{eq:chain_rule_jvp} \frac{\partial f}{\partial \theta}(\theta)\times v = \frac{\partial z_n}{\partial z_{n-1}}\frac{\partial z_{n-1}}{\partial z_{n-2}}\cdots\frac{\partial z_1}{\partial \theta}v\enspace. \end{equation}

      It consists in doing the multiplications in \eqref{eq:chain_rule_jvp} from the right to the left. It is a forward pass in the computational graph where we propagate at the same time the states \(z_i\) and the partial derivatives \(\frac{\partial z_{i+1}}{\partial z_i}\). If \(f\) is real-valued, the \(i\)th coordinate of its gradient is exactly given by product of the Jacobian of \(f\) and the \(i\)th canonical basis vector \(e_i\) since \begin{equation} \frac{\partial f}{\partial \theta_i}(\theta) = \lim_{t\to 0}\frac{f(\theta+te_i)-f(\theta)}{t}\enspace. \end{equation} Thus, we can get its gradient by computing each of the \(d\) JVPs \(\left(\frac{\partial f}{\partial \theta_i}(\theta)\times e_i\right)_{1\leq i \leq d}\) with forward AD.

      To understand properly what is happening when using forward differentiation, let us go back to the linear MLP defined in \eqref{eq:mlp}. If we implement ourselves the forward differentiation to get the JVP, we obtain the following code

      def jvp(U, W, v_u, v_w):
      +    # Forward diff of f
      +    z1 = W @ x
      +    v_z1 = v_w @ x  # Directional derivative of W -> W @ x in the direction v_w
      +  
      +    z2 = U @ z1
      +    v_z2 = U @ v_z1 + v_u @ z1  #  Directional derivative of (U, z_1) -> z2 in the direction (v_u, v_z1)
      +  
      +    v_z3 = v_z2 @ z2  # Directional derivative of z2 -> .5*z2**2 in the direction v_z2 
      +    return v_z3
      +

      In comparison with the code of the evaluation of \(f_x\), there are two more operations corresponding to the computation of the dual variables v_z1 and v_z2. In terms of memory, if we consider the computation of the JVP as coded in the previous snippet, the maximum number of parents of a vertex is four. This maximum is achieved by the vertex v_z2 which has the vertices U, v_z1, v_u and z1 as parents.

      In JAX, we get the JVP of a function \(f\) in the direction \(v\) with jax.jvp(f, (params, ), (v, ))[1].

      Reverse mode

      The reverse mode is also known as backpropagation in the context of deep learing. For $u\in\mathbb{R}^p$, it aims at computing VJPs

      \begin{equation}\label{eq:chain_rule_vjp} u^\top\frac{\partial f}{\partial \theta}(\theta) = u^\top\frac{\partial z_n}{\partial z_{n-1}}\frac{\partial z_{n-1}}{\partial z_{n-2}}\cdots\frac{\partial z_1}{\partial \theta}\enspace. \end{equation}

      In the reverse AD, the multiplications of \eqref{eq:chain_rule_jvp} are done from the left to the right. It requires doing one forward pass in the computational graph to compute the intermediate states \(z_i\) and then a backward pass to propagate the successive partial derivatives from the left to the right. Contrary to the forward mode, it has a more important memory footprint. Indeed, it requires storing the values of all the states. For instance, to compute the last term \(\frac{\partial z_3}{\partial z_2}\), one needs the value of \(z_2\) which was the first computed during the forward pass. If \(f\) is real-valued, \(u\) is a scalar and the VJP is the multiplication of the gradient of \(f\) by \(u\). Thus, one can get the gradient on \(f\) by using \(u=1\) and performing only one reverse differentiation. This makes this mode more efficient in computing gradients.

      Let us observe what happens if we code manually the backpropagation to get the gradient of the previous function \(f_x\) defined by \(f_x(U, W) = \frac12(UW x)^2\).

      def gradient(U, W):
      +    # Forward pass
      +    z1 = W @ x
      +    z2 = U @ z1
      +    z3 = 0.5 * z2**2
      +
      +    # Reverse pass
      +    ## Transfer function: z3 = 0.5 * z2**2
      +    dz2 = z2  # derivative of z3 wrt z2
      +  
      +    ## Transfer function: z2 = U @ z1
      +    dU = jnp.outer(dz2, z1)  # derivative of z3 wrt U
      +    dz1 = U.T @ dz2  # derivative of z3 wrt z1
      +  
      +    ## Transfer function: z1 = W @ x
      +    dW = jnp.outer(dz1, x)   # derivative of z3 wrt W
      +    
      +    return dU, dW
      +

      This function returns the gradient of \(f_x\). At reading this code, we understand one needs to store all the intermediate values of the forward pass in the graph. Indeed, if we look at the case of z1 which is the first node computed, it is used four steps later for the computation of dU.

      To get the gradient in JAX, one can use jax.grad(f)(params).

      Naive computation of HVPs

      Since we are interested in computing \(\nabla^2 f(\theta)v\), the simplest way to do it is to compute the Hessian matrix and then multiply it by the vector \(v\). This can be achieved in JAX by calling jax.hessian(f)(params) @ v.

      This method is quite cumbersome making it impossible to use for deep neural networks. Indeed, the storage of the full Hessian matrix has \(\mathcal{O}(d^2)\) complexity where \(d\) is the dimension of the model’s parameters set.

      The good news is that we can compute HVP without computing the Hessian thanks to clever use of AD.

      HVPs without explicit Hessian computation

      In 1994, Pearlmutter proposed to leverage the following observation to compute HVP efficiently: the HVP is also the directional derivative of the gradient in the direction \(v\):

      \[\nabla^2f(\theta) v = \lim_{\epsilon\to 0} \frac1\epsilon[\nabla f(\theta+\epsilon v)-\nabla f(\theta)] = \nabla [\langle \nabla f(.), v\rangle](\theta)\enspace.\]

      Based on this identity, AD enables to compute HVPs in three ways, as described in the JAX documentation.

      Forward-over-reverse

      The forward-over-reverse mode consists in doing forward differentiation in a computational graph of the gradient of \(f\).

      Its implementation in JAX is only two lines of code.

      def hvp_forward_over_reverse(f, params, v):
      +  return jax.jvp(jax.grad(f), (params, ), (v, ))[1]
      +

      In this case, jax.grad(f)(params) is computed by backward AD, whose complexity is two times the complexity of evaluating \(f\). Thus, the temporal complexity of hvp_forward_over_reverse is roughly four times the complexity of the evaluation of \(f\).

      To better see what happens, let us consider again our function \(f_x\) defined by \eqref{eq:mlp}. The Python code of the forward-over-reverse HVP is the following.

      def forward_over_reverse(U, W, v_U, v_W):
      +    # Forward through the forward pass through f
      +    z1 = W @ x
      +    v_z1 = v_W @ x
      +  
      +    z2 = U @ z1
      +    v_z2 = U @ v_z1 + v_U @ z1
      +    
      +    # z3 = 0.5 * z2**2
      +    # Forward through the backward pass through f
      +    z4 = z2  # dz2
      +    v_z4 = v_z2  # v_dz2
      +  
      +    z5 = jnp.outer(z4, z1)  # dU
      +    v_z5 = jnp.outer(v_z4, z1) + jnp.outer(z4, v_z1)  # v_dU
      +  
      +    z6 = U.T @ z4  # dz1
      +    v_z6 = U.T @ v_z4 + v_U.T @ z4  # v_dz1
      +  
      +    z7 = jnp.outer(z6, x)  # dW
      +    v_z7 = jnp.outer(v_z6, x)  # v_dW
      +  
      +    return v_z5, v_z7  # v_dU, v_dW
      +

      The take-home message of this part is that, after computing the gradient of \(f_x\), one can consider a computational graph of this gradient and perform forward differentiation through this new computational graph. Here, the variables z1,…, z7 are the vertices of a computational graph of the gradient of \(f_x\). The nice thing is that this mode enables getting at the same time the gradient and the HVP. Indeed, in the previous snippet, z5 and z7 are the components of the gradient of \(f_x\) which could be also returned if needed. This feature can be useful in bilevel optimization for instance.

      Reverse-over-reverse

      Instead of doing forward differentiation of the gradient, one can multiply the gradient by \(v\) and thus get a scalar. We can then backpropagate into this scalar product. This is the reverse-over-reverse mode.

      It can be implemented by these lines of code.

      def hvp_reverse_over_reverse(f, params, v):
      +  return jax.grad(lambda y: jnp.vdot(jax.grad(f)(y), v))(params)
      +

      Since the gradients are computed by backpropagation, the complexity of hvp_reverse_over_reverse is twice the complexity of jax.grad(f), which is roughly four times the complexity of the evaluation of \(f\).

      Writting down the code of the reverse-over-reverse HVP for our function \(f_x\) defined by \eqref{eq:mlp} makes us understand the differences between this mode and the forward-over-reverse mode. Particularly, one can notice that there are more elementary operations in the reverse-over-reverse mode than in the forward-over-reverse mode. Moreover, in terms of memory footprint, the reverse-over-reverse requires storing the values of the vertices of the computational graph of the gradient of \(f_x\), while the forward-over-reverse only needs to store the values of the vertices of the computational graph of \(f_x\). Thus, the former is less efficient than the latter.

      def reverse_over_reverse(U, W, v_u, v_w):
      +    # Forward through <grad(f), v>
      +    ## Forward through f
      +    z1 = W @ x
      +    z2 = U @ z1
      +    z3 = 0.5 * jnp.linalg.norm(z2)**2
      +  
      +    ## Reverse through f
      +    z4 = z2  # dz2
      +    z4 = jnp.outer(z3, z1) # dU
      +    z5 = U.T @ z3 # dz1
      +    z6 = jnp.outer(z5, x) # dW
      +  
      +    # Output: dot product <grad(f), v>
      +    z7 = jnp.sum(z4 * v_u) + jnp.sum(z6 * v_w)
      +  
      +    # Backward through z7 = <grad(f),v>
      +    ## z7 = jnp.sum(z4 * v_u) + jnp.sum(z6 * v_w)
      +    dz6 = v_w
      +    dz4 = v_u
      +  
      +    ## z6 = jnp.outer(z5, x)
      +    dz5 = dz6 @ x
      +  
      +    ## z5 = U.T @ z3
      +    dz3 = U @ dz5
      +    ddU = jnp.outer(z3, dz5)  # Derivative of z7 wrt U
      +  
      +    ## z4 = jnp.outer(z3, z1)
      +    dz3 += dz4 @ z1
      +    dz1 = dz4.T @ z3
      +  
      +    ## z3 = z2
      +    dz2 = dz3
      +  
      +    ## z2 = U @ z1
      +    dz1 += dz2 * U
      +    # As U appears multiple times in the graph, we sum its contributions
      +    ddU += jnp.outer(dz2, z1) 
      +  
      +    ## z1 = W @ x
      +    ddW = jnp.outer(dz1, x)  # Derivative of z7 wrt W
      +  
      +    return ddU, ddW
      +

      Reverse-over-forward

      What about doing forward differentiation of \(f\) rather than reverse propagation? This is what is done in the reverse-over-forward mode. It consists in backpropagating in the computational graph of the JVP of \(f\) and \(v\).

      def hvp_reverse_over_forward(f, params, v):
      +  jvp_fun = lambda params: jax.jvp(f, (params, ), (v, ))[1]
      +  return jax.grad(jvp_fun)(params)
      +

      This method is more efficient than the previous one. Indeed, since we backpropagate only once, the memory burden is lower than for the reverse_over_reverse fashion. In comparison with forward-over-reverse, the complexity is the same. However, one can notice that the forward-over-reverse enables computing at the same time the gradient of \(f\) and the HVP, which is not the case for the reverse-over-forward mode.

      The code of the reverse-over-forward HVP for the MLP \(f_x\) defined by \eqref{eq:mlp} is the following.

      def reverse_over_forward(U, W, v_U, v_W):
      +    # Forward diff of f to  <grad(f), v>
      +    z1 = W @ x
      +    z6 = v_W @ x  # v_z1
      +  
      +    z2 = U @ z1
      +    z5 = U @ z6 + v_U @ z1  # v_z2
      +  
      +    # output <grad(f), v>
      +    z4 = z5 @ z2  # v_z3
      +  
      +    # Backward pass through <grad(f), v>
      +    ## z4 = z5 @ z2
      +    dz2 = z5
      +    dz5 = z2  # dv_z2
      +  
      +    ## z5 = U @ z6 + v_U @ z1
      +    dz1 = v_U.T @ dz5
      +    dz6 = U.T @ dz5  # dv_z1
      +    ddU = jnp.outer(dz5, z6)  # derivative of z4 wrt U
      +  
      +    ## z2 = U @ z1
      +    # As U and dz1 appear multiple times, we sum their contributions
      +    dz1 += U.T @ dz2
      +    ddU += jnp.outer(dz2, z1)
      +    
      +    ## z1 = W @ x
      +    ddW = jnp.outer(dz1, x)
      +    return ddU, ddW
      +

      Benchmark with deep learning architectures

      While these three methods compute the same outputs, the different ways of traversing the computational graph change their overall time and memory complexities. We now compare the computation of HVPs with these three methods for various deep-learning architectures. To cover a broad range of use cases, we consider a residual network (ResNet34) and a transformer-based architecture (ViT-base) for image classification as well as a transformer for natural language processing (Bert-base.). We use the Flax and PyTorch implementations of these architectures available in the transformers package provided by Hugging Face 🤗.

      All computations were run on an Nvidia A100 GPU with 40 GB of memory. We used the version 0.4.21. of Jax and the version 2.1.1. of torch.

      The code of the benchmark is available on this repo.

      Time complexity

      The first comparison we make is a comparison in terms of wall-clock time between the different ways to compute HVPs and also the computation of a gradient by backpropagation. For each architecture, we compute the gradient of the model with respect to the parameters by backpropagation. We also compute the HVPs in forward-over-reverse, reverse-over-forward and reverse-over-reverse modes. For each computation, we measure the time taken. Specifically for the HVPs, we subtract the time taken by a gradient computation, to get only the time of the overhead required by the HVP computation. The inputs for each architecture are generated randomly. For the ResNet34 architecture, we generated a batch of images of size 224x224x3. To limit out-of-memory issues in the experiments, we generated for the ViT architecture images of size 96x96x3. For the BERT architecture, we generated a batch of sequences of length 32.

      We first use JAX with just-in-time compilation. Each computation is run 90 times. We plot on the left of the figure, the median computation time and also the 20% and 80% percentile in black. The computations are done with a batch size of 128. We observe that, in practice, the overhead over the gradient computation for the HVP computation is between one and twice the time of a gradient computation for the three architectures. Consequently, a whole HVP computation takes between twice and three times the time of a gradient calculation. This is consistent with the theory. One can notice that the reverse-over-reverse is slightly slower than the others in all the cases. The forward-over-reverse and reverse-over-forward are, as for them, very close in terms of time.

      We also report on the right figure the computational time of each method with respect to the batch size for the ResNet34 architecture. We observe, as expected, that the computational time scales linearly with the batch size.

      We run a similar experiment with the functional API available in PyTorch torch.func similar to the one JAX has. The results we get are more contrasted.

      In the case of ResNet34, the scaling between the different methods is similar to the one we get with JAX. Also, during our experiments, we figured out that batch normalization made the forward computation slow and induced out-of-memory issues. Thus, we removed the batch normalization layers from the ResNet34 architecture.

      For ViT and BERT, the forward-over-reverse is surprisingly longer than the reverse-over-reverse method. Moreover, the scaling between the gradient and HVP computational time differs from the one we get with JAX. Indeed, for these architectures, the HVP computations take between four and five more time than the gradient computations. This is a discrepancy with what we would expect in theory. This might be because, at the time we are writing this blog post, the functional API of PyTorch is still in its early stages. Particularly, we could not use the compilation with torch.compile because it does not work with some operators of torch.func such as torch.func.jvp.

      Memory complexity

      We also compare the memory footprint of each approach. The following figure provides the results we get with jax jitted code. On the left, we represent the result for each method and model with a batch size of 64. On the right, we show the evolution of the memory footprint of each method for the ResNet34 with the batch size. Surprisingly, we could observe that the memory footprint of the different methods to compute HVPs does not vary for a given model. This is counterintuitive since we expect that the reverse-over-reverse method have a larger memory footprint due to the double backpropagation.

      However, we do the same experiment by disabling the JIT compilation. The result we get corroborates the theory. Indeed, one can observe in the following figure that the memory footprint of the reverse-over-reverse method is larger than the one of the forward-over-reverse and reverse-over-forward methods. This is because the reverse-over-reverse involves two successive backward differentiations while the other two involve only one reverse differentiation. Moreover, it scales linearly with the batch size, which was not the case in the previous figure in the small batch size regime.

      In light of these two results, the clever memory allocation performed during just-in-time compilation reduces significantly the memory footprint of the HVP computations.

      In the following figure, we plot the results we get with the PyTorch implementation. One can observe that in all the cases the forward-over-reverse consumes more memory in comparison with the reverse-over-forward mode. It is almost at the same level as reverse-over-reverse mode, which is quite unexpected.

      The right plot of the evolution of the memory footprint with the batch size for the ResNet34 architecture evolves linearly as expected.

      Conclusion

      In this blog post, we have explored the different ways to compute HVP from theoretical and practical perspectives. The three take-home messages to keep in mind are the following:

      • We can compute HVPs without computing Hessian matrices.

      • In practice, computing an HVP takes between twice and four times the time taken by a gradient computation and requires two to three times more memory than computing a gradient.

      • The AD framework and the use or not of the just-in-time compilation affects the practical performances of HVPs computations in time and memory.

      For attribution in academic contexts, please cite this work as
      +        PLACEHOLDER FOR ACADEMIC ATTRIBUTION
      +  
      BibTeX citation
      +        PLACEHOLDER FOR BIBTEX
      +  
      \ No newline at end of file diff --git a/blog/category/data-processing-inequality/index.html b/blog/category/data-processing-inequality/index.html new file mode 100644 index 00000000..f2417e52 --- /dev/null +++ b/blog/category/data-processing-inequality/index.html @@ -0,0 +1 @@ + Data Processing Inequality | ICLR Blogposts 2024

      Data Processing Inequality

      an archive of posts in this category

      \ No newline at end of file diff --git a/blog/category/entropy-regularization/index.html b/blog/category/entropy-regularization/index.html new file mode 100644 index 00000000..55706bfb --- /dev/null +++ b/blog/category/entropy-regularization/index.html @@ -0,0 +1 @@ + Entropy Regularization | ICLR Blogposts 2024

      Entropy Regularization

      an archive of posts in this category

      \ No newline at end of file diff --git a/blog/category/function-space-variational-inference/index.html b/blog/category/function-space-variational-inference/index.html new file mode 100644 index 00000000..fc2f0e0f --- /dev/null +++ b/blog/category/function-space-variational-inference/index.html @@ -0,0 +1 @@ + Function-Space Variational Inference | ICLR Blogposts 2024

      Function-Space Variational Inference

      an archive of posts in this category

      \ No newline at end of file diff --git a/blog/category/information-theory/index.html b/blog/category/information-theory/index.html new file mode 100644 index 00000000..2c47fa15 --- /dev/null +++ b/blog/category/information-theory/index.html @@ -0,0 +1 @@ + Information Theory | ICLR Blogposts 2024

      Information Theory

      an archive of posts in this category

      \ No newline at end of file diff --git a/blog/category/label-entropy-regularization/index.html b/blog/category/label-entropy-regularization/index.html new file mode 100644 index 00000000..bb6ac36a --- /dev/null +++ b/blog/category/label-entropy-regularization/index.html @@ -0,0 +1 @@ + Label Entropy Regularization | ICLR Blogposts 2024

      Label Entropy Regularization

      an archive of posts in this category

      \ No newline at end of file diff --git a/blog/category/parameter-equivalence-classes/index.html b/blog/category/parameter-equivalence-classes/index.html new file mode 100644 index 00000000..f07ef3fd --- /dev/null +++ b/blog/category/parameter-equivalence-classes/index.html @@ -0,0 +1 @@ + Parameter Equivalence Classes | ICLR Blogposts 2024

      Parameter Equivalence Classes

      an archive of posts in this category

      \ No newline at end of file diff --git a/blog/clml/index.html b/blog/clml/index.html new file mode 100644 index 00000000..9ebeb4e3 --- /dev/null +++ b/blog/clml/index.html @@ -0,0 +1,113 @@ + On Bayesian Model Selection: The Marginal Likelihood, Cross-Validation, and Conditional Log Marginal Likelihood | ICLR Blogposts 2024

      On Bayesian Model Selection: The Marginal Likelihood, Cross-Validation, and Conditional Log Marginal Likelihood

      Bayesian model selection has long relied on the marginal likelihood and related quantities, often motivated by the principle of Occam's razor. Following the paper 'Bayesian Model Selection, the Marginal Likelihood, and Generalization' by Lotfi et al. (2022/2023), this blog post critically examines the conventional focus on the marginal likelihood and related quantities for Bayesian model selection as a direct consequence of Occam's razor. We find that the suitability of these criteria depends on the specific context and goals of the modeling task. We revisit the concepts of log marginal likelihood (LML), cross-validation, and the recently introduced conditional log marginal likelihood (CLML), highlighting their connections and differences through an information-theoretic lens. Through thought experiments and empirical observations, we explore the behavior of these model selection criteria in different data regimes under model misspecification and prior-data conflict, finding that the conditional marginal cross-entropy, closely related to cross-validation, is often more reliable for optimizing generalization performance. We review relevant literature, compare the CLML and validation loss for deep neural networks, and using a toy Bayesian linear regression, we demonstrate that all the discussed quantities can fail to reliably predict generalization. Our takeaways are that: there is no one-size-fits-all solution; the choice of model selection quantity depends on the specific context and goals; and in the future, we should take into account model complexity as well and not assume a uniform model prior. While the post is limited by the need for more rigorous theoretical justification, a broader range of models and datasets (and deeper engagement with philosophical implications), it rightly questions the primacy of the (conditional) log marginal likelihood and encourages critical thinking about its foundations, aiming for a more nuanced understanding of Bayesian model selection.

      $$\require{mathtools} \DeclareMathOperator{\opExpectation}{\mathbb{E}} \newcommand{\E}[2]{\opExpectation_{#1} \left [ #2 \right ]} \newcommand{\simpleE}[1]{\opExpectation_{#1}} \newcommand{\MidSymbol}[1][]{\:#1\:} \newcommand{\given}{\MidSymbol[\vert]} \DeclareMathOperator{\opmus}{\mu^*} \newcommand{\IMof}[1]{\opmus[#1]} \DeclareMathOperator{\opInformationContent}{H} \newcommand{\ICof}[1]{\opInformationContent[#1]} \newcommand{\xICof}[1]{\opInformationContent(#1)} \DeclareMathOperator{\opEntropy}{H} \newcommand{\Hof}[1]{\opEntropy[#1]} \newcommand{\xHof}[1]{\opEntropy(#1)} \DeclareMathOperator{\opMI}{I} \newcommand{\MIof}[1]{\opMI[#1]} \DeclareMathOperator{\opTC}{TC} \newcommand{\TCof}[1]{\opTC[#1]} \newcommand{\CrossEntropy}[2]{\opEntropy(#1 \MidSymbol[\Vert] #2)} \newcommand{\iCrossEntropy}[3]{\opEntropy_{#1 \Vert #2}[#3]} \DeclareMathOperator{\opKale}{D_\mathrm{KL}} \newcommand{\Kale}[2]{\opKale(#1 \MidSymbol[\Vert] #2)} \newcommand{\iKale}[3]{\opKale_{,\, #1 \Vert #2}[#3]} \DeclareMathOperator{\opJSD}{D_\mathrm{JSD}} \newcommand{\JSD}[2]{\opJSD(#1 \MidSymbol[\Vert] #2)} \DeclareMathOperator{\opp}{p} \newcommand{\pof}[1]{\opp(#1)} \newcommand{\hpof}[1]{\hat{\opp}(#1)} \newcommand{\pcof}[2]{\opp_{#1}(#2)} \newcommand{\hpcof}[2]{\hat\opp_{#1}(#2)} \DeclareMathOperator{\opq}{q} \newcommand{\qof}[1]{\opq(#1)} \newcommand{\hqof}[1]{\hat{\opq}(#1)} \newcommand{\qcof}[2]{\opq_{#1}(#2)} \newcommand{\varHof}[2]{\opEntropy_{#1}[#2]} \newcommand{\xvarHof}[2]{\opEntropy_{#1}(#2)} \newcommand{\varMIof}[2]{\opMI_{#1}[#2]} \newcommand{\w}{\boldsymbol{\theta}} \newcommand{\W}{\boldsymbol{\Theta}} \newcommand{\h}{\boldsymbol{\phi}} \newcommand{\hopt}{\boldsymbol{\h^\star}} \newcommand{\H}{\boldsymbol{\Phi}} \DeclareMathOperator{\opf}{f} \newcommand{\fof}[1]{\opf(#1)} \newcommand{\xset}[3]{(\x_n^{#1})_{n=#2}^{#3}} \newcommand{\xNset}{(\x_n)_{n=1}^N} \newcommand{\XNtuple}{(\X_n)_{n=1}^N} \newcommand{\xNtuple}{(\x_n)_{n=1}^N} \newcommand{\XNset}{\{\X_n\}_{n=1}^N} \newcommand{\xNset}{\{\x_n\}_{n=1}^N} \newcommand{\XNsetk}{\{\X_n\}_{n=N-k+1}^N} \newcommand{\xNsetk}{\{\x_n\}_{n=N-k+1}^N} \newcommand{\XNkset}{\{\X_n\}_{n=1}^{N-k}} \newcommand{\xNkset}{\{\x_n\}_{n=1}^{N-k}} \newcommand{\XNoset}{\{\X_n\}_{n=1}^{N-1}} \newcommand{\y}{y} \newcommand{\Y}{Y} \newcommand{\L}{\boldsymbol{L}} \newcommand{\x}{\boldsymbol{x}} \newcommand{\X}{\boldsymbol{X}} \newcommand{\oppdata}{\hat{\opp}_{\text{data}}} \newcommand{\pdata}[1]{\hpcof{\text{data}}{#1}} \newcommand{\normaldist}[1]{\mathcal{N}(#1)} \newcommand{\wstddev}{\sigma_\w} \newcommand{\noisestddev}{\sigma_\text{noise}} \newcommand{\Dataset}{\mathcal{D}} \newcommand{\Dtrain}{\Dataset_{\text{train}}} \newcommand{\Dval}{\Dataset_{\text{val}}} $$

      Introduction

      Model selection is a crucial aspect of machine learning, as it allows us to choose the most appropriate model for a given task. In the Bayesian setting, the marginal likelihood has been a popular tool for model selection and hyperparameter learning, often motivated by the principle of Occam’s razor. However, the suitability of the marginal likelihood depends on the specific context and goals of the modeling task.

      Recently, the paper “Bayesian Model Selection, the Marginal Likelihood, and Generalization” by Lotfi et al. (2022/2023), which was accepted as Outstanding Paper and Long Oral at ICML 2022, examined the importance and challenges of model selection in machine learning, focusing on the log marginal likelihood (LML) and proposing a variant, the conditional log marginal likelihood (CLML). The authors argue that while LML is a useful tool for hypothesis testing, it may not be the best metric for model selection and for predicting the generalization performance of trained models or learning hyperparameters. They introduce the CLML as a potential improvement and demonstrate its effectiveness across various settings, including density models, Fourier features, Gaussian Processes, and deep neural networks.

      In this blog post, inspired by the above paper, we (re-)derive insights that challenge the conventional focus on the marginal likelihood and related quantities for Bayesian model selection. We argue that the quantities we examine are all consequences of Occam’s razor, and thus no single quantity should be considered universally superior. Instead, the choice of model selection criterion should be guided by the context and the desired outcomes. We highlight that many recently proposed metrics for model selection, including CLML, are closely related to cross-validation and have failure cases that can be explained by considering model misspecification and prior-data conflicts. Overall, the choice between these metrics should be based on the specific requirements of the task at hand.

      We begin by discussing the foundations of model selection, including the role of Occam’s razor and its relationship to maximum likelihood estimation (MLE) and maximum a posteriori (MAP) estimation. We then introduce the concepts of log marginal likelihood (LML), cross-validation, and conditional log marginal likelihood (CLML), highlighting their connections and differences. Through a series of thought experiments and empirical observations, we explore the behavior of these model selection criteria in various scenarios, such as under model misspecification, prior-data conflict, and in different data regimes. We find that the conditional marginal cross-entropy, which is closely related to cross-validation, is often a more reliable choice when the primary objective is to select for generalization performance. On the other hand, the conditional joint marginal cross-entropy (permutation-invariant negative CLML) may be preferable when the focus is on sequential prediction and online learning. At the same time, the joint marginal information (negative LML) is rarely the right choice for model selection. We review relevant literature, including the work of Fong and Holmes (2020) on the connection between the LML and cross-validation, the training speed estimators by Lyle et al. (2020) and Ru et al. (2021), and the experiments of Lotfi et al. (2022/2023) , comparing the CLML and validation loss for deep neural networks (DNNs). These studies provide valuable insights into the strengths and limitations of different model selection criteria.

      Throughout the post, we emphasize the importance of considering the context, available data, and desired outcomes when selecting the most appropriate metric for model selection and hyperparameter tuning. By questioning the primacy of the (conditional) joint marginal likelihood and encouraging critical thinking about the foundations of these quantities, we hope to foster a more nuanced understanding of Bayesian model selection.

      (Bayesian) Model Selection

      In our daily lives, we’re often faced with choices that require us to sift through competing explanations or decisions. Imagine you hear your doorbell ring. You might think it’s the delivery you’ve been waiting for, a neighbor dropping by, or perhaps you didn’t hear anything at all, and it was just your imagination. In deciding between these options, you’re likely to lean towards the simplest explanation that aligns with your expectations—say, the long-awaited delivery. This inclination towards simplicity has a formal counterpart in scientific discovery and machine learning, known as Occam’s razor:

      This concept is further illustrated using an example from chapter 28 of David MacKay’s seminal book, “Information Theory, Inference, and Learning Algorithms”, where the essence of selecting between models based on their evidence is laid out succinctly.

      Occam’s razor --- How many boxes are in the picture (figure 28.1)? In particular, how many boxes are in the vicinity of the tree? If we looked with x-ray spectacles, would we see one or two boxes behind the trunk (figure 28.2)? (Or even more?) Occam’s razor is the principle that states a preference for simple theories. ‘Accept the simplest explanation that fits the data’. Thus according to Occam’s razor, we should deduce that there is only one box behind the tree. Is this an ad hoc rule of thumb? Or is there a convincing reason for believing there is most likely one box? Perhaps your intuition likes the argument ‘well, it would be a remarkable coincidence for the two boxes to be just the same height and colour as each other’. If we wish to make artificial intelligences that interpret data correctly, we must translate this intuitive feeling into a concrete theory.
      Excerpt from page 343 in David MacKay’s "Information Theory, Inference, and Learning Algorithms.”

      But how can we express this formally using mathematics?

      In the next section, we will use information-theoretic concepts to formalize Occam’s razor and connect it to the maximum likelihood estimation (MLE) and maximum-a-posteriori (MAP) estimation approaches. This formalization highlights that Occam’s razor, as a general principle favoring simplicity, can motivate various techniques, not just Bayesian ones. Therefore, using Occam’s razor as the sole justification for Bayesian model selection may not be as compelling as it initially appears.

      However, one could argue that when Occam’s razor is properly applied within a Bayesian framework, it captures a more nuanced notion of complexity. From this perspective, the Bayesian formulation of Occam’s razor favors models that strike a balance between goodness-of-fit and model complexity, where complexity is measured by the model’s ability to compress the data. This view is consistent with the minimum description length (MDL) principle, which posits that the best model is the one that minimizes the total description length of both the model and the data given the model.

      From Philosophical Principle to Mathematical Statement

      Let’s first connect Occam’s razor to Maximum Likelihood Estimation (MLE) before diving deeper into the background and (Bayesian) model selection.

      In information theory, the information content of an event \(x\) is defined as \(-\log_2 \pof{x}\), where \(\pof{x}\) is the probability of that event occurring according to a given model. This is also called Shannon’s information content. We use the base \(2\) for logarithms and measure information in bits (binary digits), and for the rest of the post, we will drop the base of the logarithm. The information content measures the optimal encoding length in bits for the event \(x\) under the model specified by its probability distribution \(\pof{\cdot}\). In the context of probabilistic modeling, variables that cannot be directly observed are called latent variables. Occam’s razor suggests that we should prefer simpler explanations for latent variables, given the observed data.

      Consider a model with a latent variable \(z\) and observed data \(x\). The model specifies a probability distribution \(\pof{z \given x}\). According to Occam’s razor, we prefer simpler explanations, which correspond to smaller values of \(-\log \pof{z \given x}\). Using Bayes’ theorem, we can rewrite this as:

      \[\text{minimize } z \text{ in } -\log \pof{z \given x} = -\log \pof{x \given z} - \log \pof{z} + \log \pof{x}.\]

      Given that \(\pof{x}\) is independent of \(z\), we can omit it from our objective. Additionally, if we posit a uniform (or non-informative prior) for \(z\), implying that all potential values of \(z\) are equally probable before observing \(x\), then \(\pof{z}\) becomes constant and can also be dropped from our objective. This simplifies our preference to:

      \[\text{minimize } z \text{ in } -\log \pof{x \given z}.\]

      Equivalently, we can maximize \(\pof{x \given z}\), which is the likelihood of the observed data \(x\) given the latent variable \(z\). When making a decision and selecting a single value for \(z\), this leads to the maximum likelihood estimation (MLE) approach.

      In summary, the connection between Occam’s razor and MLE relies on the following assumptions:

      1. Shannon’s information content is how we measure complexity.
      2. The prior distribution for the latent variables is uniform (or uninformative).
      3. Simpler explanations, as measured by the information content, are preferred (Occam’s razor).

      Under these assumptions, the preference for simpler explanations leads to the MLE approach, where more likely values of the latent variable given the observed data are preferred.

      Optimizing the MLE is common in machine learning because we can directly optimize the likelihood function. Still, this is not easy for deep learning models because they have a large number of parameters and the loss function is non-convex.

      Maximum-a-Posteriori Estimation

      However, the assumption of a uniform or non-informative prior for the latent variables is not always valid or desirable. In many cases, we have prior knowledge about the latent variables that can be incorporated into the model. This leads to the Maximum-A-Posteriori (MAP) Estimation as an alternative to MLE.

      In MAP estimation, \(\pof{z}\) is not constant, so we cannot drop it—we can still drop \(\pof{x}\), however—and maximize the joint distribution \(\pof{z, x}\), or equivalently:

      \[\text{minimize } z \text{ in } -\log \pof{x, z}=-\log \pof{x \given z} - \log \pof{z}.\]

      Before we go further, we need to introduce notation for information-theoretic quantities and concepts that we will use throughout the postThis next section is mostly shared with the sister post.

      Background: Information-Theoretic Notation and Concepts

      Information theory deals with the communication and quantification of informationSee the excellent "Visual Information Theory" by Chris Olah for a visual introduction to information theory.. In this post, we use a unified information-theoretic notation to express various quantities related to probability distributions and their relationshipsIt largely follows "A Practical & Unified Notation for Information-Theoretic Quantities in ML".. Here are some key concepts we will use:

      The information content of an event \(x\) is denoted as \(\Hof{x}\) and is defined as \(-\log_2 \pof{x}\), where \(\pof{x}\) is the probability of event \(x\) occurring. It represents the minimum amount of information needed to describe the occurrence of \(x\) given an underlying probability distribution. \(\Hof{x \given y}\) and \(\Hof{x, y}\) are analogously defined and denote the conditional and joint information content of random variables \(X\) and \(Y\), respectively. In machine learning, the information content is often used as a minimization objective, represented as the negative log-likelihood or cross-entropy when averaged over a dataset (see below).

      The entropy \(\Hof{X}\) of a random variable \(X\) is the expectation of its information content:

      \[\Hof{X} \triangleq \E{\pof{x}}{\Hof{x}} = \E{\pof{x}}{-\log \pof{x}}.\]

      The entropy measures the average amount of information needed to describe the random variable \(X\). It provides a measure of uncertainty or randomness associated with \(X\). We can similarly define the entropy of a conditional distribution \(\Hof{X \given Y}\) and the joint entropy \(\Hof{X, Y}\).

      We will also use the Kullback-Leibler divergence \(\Kale{\pof{X}}{\qof{X}}\) and the cross-entropy \(\CrossEntropy{\pof{X}}{\qof{X}}\):

      \[\begin{aligned} \CrossEntropy{\pof{X}}{\qof{X}} & = \E{\pof{x}}{-\log \qof{x}}\\ \Kale{\pof{X}}{\qof{X}} & = \CrossEntropy{\pof{X}}{\qof{X}} - \Hof{X} \end{aligned}\]

      The cross-entropy quantifies the average number of bits needed to encode samples drawn from the true distribution \(\pof{X}\) using a different distribution \(\qof{X}\). The Kullback-Leibler divergence measures the difference between two probability distributions and captures the additional bits needed to encode samples from \(\pof{X}\) compared to encoding them using the true distribution \(\qof{X}\).

      Expressing Occam’s Razor in Information-Theoretic Terms

      Taking this notation into account, we can express Occam’s razor as:

      \[\text{prefer small } z \text{ for } \Hof{z \given x},\]

      where \(Z\) is the latent variable and \(X\) is the observed data. Note that \(x\) and \(z\) are individual realizations of the random variables \(X\) and \(Z\), respectively.

      The MLE and MAP objectives are accordingly:

      \[\text{minimize } z \text{ in } \Hof{x \given z} \text{ for MLE and } \Hof{x, z} \text{ for MAP.}\]

      This measures the number of bits we need to encode the observed data given the latent variable for MLE and the number of bits to encode both the observed data and the latent variable for MAP. This relates Occam’s razor to the minimum description length principleSee the Wikipedia article on Minimum Description Length for more details..

      Hyperparameter Learning and Model Selection

      In many machine learning tasks, we need to determine the best hyperparameters for a model or select the most suitable model architecture from several discrete options. The primary goal is to find the hyperparameters or model that generalizes best to new, unseen data.

      Both cases can be viewed as inferring a random variable \(\H\), which represents either the model choice as a categorical distribution or the hyperparameters as a continuous distribution. In this sense, \(\H\) can be considered as another latent variable in the model.

      For consistency, we will continue using \(\x\) to denote data points throughout this post. Although it is common to use \(\y\) for predictions and \(\x\) for side channel information, we will not require this distinction here and will stick to \(\x\) for simplicity.

      The same arguments discussed previously also apply in this context, and we can express the objective as:

      \[\text{minimize } \h \text{ in } \Hof{\x \given \h}.\]

      Model Parameters

      In addition to the hyperparameters \(\H\), we usually have model parameters \(\W\) for a given \(\h\) with a parameter distribution \(\pof{\w \given \h}\) that we need to infer based on observed data. These parameters are the learnable components of the model, such as the weights and biases in a neural network. For given \(\w\) and \(\h\), we can easily compute the likelihood \(\pof{\x \given \w, \h}\), which represents the probability of observing the data \(\x\) given the specific values of the parameters and hyperparameters. However, to make predictions or compute the marginal likelihood, we will need to consider the uncertainty in the parameter values by integrating over all possible \(\w\).

      Bayesian Model Averaging

      Bayesian Model Averaging (BMA) is a technique that integrates, or marginalizes, over the model parameters \(\W\) when making predictions. This accounts for the uncertainty in the model parameters, which is particularly useful when dealing with complex models, high-dimensional parameter spaces, and limited data. In contrast to the MLE or MAP estimate, which use a single parameter value \(\w\) for predictions, BMA provides a more robust and comprehensive approach. The probability of a new data point \(\x'\) under BMA is given by:

      \[\pof{\x' \given \x, \h} = \int \pof{\x' \given \x, \w, \h} \pof{\w \given \x, \h} \, \mathrm{d}\w,\]

      where \(\pof{\w \given \x, \h}\) is the posterior distribution of the parameters given the data, and \(\pof{\x' \given \x, \w, \h}\) is the likelihood of the new data point given the parameters, hyperparameters, and training data.

      While BMA offers benefits, it is computationally challenging, particularly when dealing with high-dimensional parameter spaces commonly encountered in deep learning models. To make BMA tractable, various approximation methods, such as Markov Chain Monte Carlo (MCMC) and Variational Inference, have been proposed.

      Marginal Likelihood and Estimation

      Let’s now discuss the marginal likelihood and its relation to BMA. The marginal likelihood, denoted as \(\pof{\x \given \h}\), is the likelihood of the observed data given the hyperparameters, marginalized over all possible parameter values \(\W\). It is also known as the model evidence. To compute the marginal likelihood, we integrate over all possible \(\w\):

      \[\pof{\x \given \h} = \int \pof{\x \given \w, \h} \pof{\w \given \h} \, d\w,\]

      where \(\pof{\x \given \w, \h}\) is the likelihood of the data given the parameters and hyperparameters, and \(\pof{\w \given \h}\) is the prior distribution of the parameters given the hyperparameters.

      Comparing BMA to the marginal likelihood, we see that they match for individual data points. However, for multiple data points (i.e., conditioning on datasets), the marginal likelihood is more complex. “BMA” typically refers to making predictions for a single new data point, while the marginal likelihood can be considered for many points simultaneously. Apart from this difference, the two are equivalent. Let’s discuss the case of multiple data points in more detail to understand why computing the marginal likelihood on datasets is even more challenging.

      Datasets instead of Individual Data Points

      So far, we have described everything as if we only had a single data point \(x\). However, in practice, we often have a dataset \(\xNtuple = (\x_1, \x_2, \ldots, \x_N)\).

      Joint Marginal Information and Cross-Entropy

      The easiest way to extend the previous definitions is to simply substitute \(\xNset\) for \(\x\) and assume we can compute a likelihood for the entire dataset using its joint predictive distribution:

      \[\pof{\xNtuple \given \h} = \int \pof{\x_1, \x_2, \ldots, \x_N \given \w, \h} \, \pof{\w \given \h} \, d\w.\]

      We can then maximize this likelihood or equivalently minimize the joint marginal information \(\Hof{\xNtuple \given \h}.\)

      If our model is exchangeable, meaning the order of the \(\x_n\) does not matter, we can equivalently take an expectation over all permutations of the data to obtain the joint marginal cross-entropy:

      \[\CrossEntropy{\pdata{\X_1, ...,\X_n}}{\pof{\X_1, ... \X_n \given \h}},\]

      where \(\pdata{\cdot}\) is an empirical data distribution that allows us to draw samples without replacement. In this case, the joint marginal information and cross-entropy are equivalent.

      With exchangeability, we can simply write \(\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNset}\) instead of using the tuple notation \(\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNtuple}\) as the order of the data points does not matter.

      Conversely, if a model is not exchangeable, we can induce exchangeability by averaging over all permutations of the data points via ensembling. For example, deep learning models trained with stochastic gradient descent are generally not exchangeable, as the order and composition of the batches can impact the results. However, we can make them effectively exchangeable by training multiple models and averaging their predictions. In the limit of infinite models, the resulting ensemble will be exchangeableThe ensemble might not necessarily perform better though, as papers on training curricula have shown that batch order can be important..

      The joint marginal cross-entropy turns a potentially non-exchangeable joint information into an exchangeable one by taking an expectation.

      Marginal Information and Cross-Entropy

      Before we try to understand these joint expressions, we should consider alternative ways to extend the previous definitions.

      For instance, we could take the average of the likelihoods for individual data points:

      \[\frac{1}{N} \sum_{n=1}^N \pof{\x_n \given \h}.\]

      Assuming an underlying data distribution \(\pdata{x}\), we can also express this as an attempt to estimate:

      \[\E{\pdata{\x}}{\pof{\x \given \h}} = \int \pof{\x \given \h} \, \pdata{\x} \, d\x.\]

      This provides an average score for the data likelihood.

      However, from the perspective of Occam’s razor, simply taking the average likelihood is not the most principled approach. Instead, we can leverage information theory, which has been our tool of choice thus far. Recall that we prefer small values of the marginal information \(\Hof{\x \given \h}\). By taking the expectation over the data distribution, we obtain the individual marginal cross-entropy:

      \[\CrossEntropy{\pdata{\X}}{\pof{\X \given \h}} = \E{\pdata{\x}}{-\log \pof{\x \given \h}}.\]

      This cross-entropy measures the average number of bits needed to encode the data using the model’s probability distribution. As it does not involve a joint distribution, we refer to it simply as the marginal cross-entropy.

      It is evident that the marginal cross-entropy and the average likelihood are not equivalent. Using the convexity of the negative logarithm and Jensen’s inequality, we see that the marginal cross-entropy is always larger than the negative logarithm of the average likelihood:

      \[\begin{aligned} \CrossEntropy{\pdata{\X}}{\pof{\X \given \h}} &= \E{\pdata{\x}}{-\log \pof{\x \given \h}} \\ &\geq -\log \E{\pdata{\x}}{\pof{\x \given \h}} \\ &\approx -\log \frac{1}{N} \sum_{n=1}^N \pof{\x_n \given \h}. \end{aligned}\]

      The NLL is frequently used to evaluate a model’s performance after training, typically on a held-out validation set. This is equivalent to computing the cross-entropy between the empirical distribution of the validation set and the model’s predictive distribution, conditioned on the parameters learned from the training data:

      \[\CrossEntropy{\hpcof{\text{val}}{\X'}}{\pof{\X' \given \xNtuple, \h}}\]

      It is essential to distinguish this from the cross-entropy computed on the prior distribution of the model parameters before seeing any data, which is less useful for evaluating a trained model’s performance:

      \[\CrossEntropy{\hpcof{\text{val}}{\X'}}{\pof{\X' \given \h}}\]

      Only the NLL on a validation set conditioned on the training data provides an estimate of the model’s generalization ability after training. The same holds for the quantities marginalized over the model parameters.

      Marginal Cross-Entropy vs Joint Cross-Entropy

      Occam’s razor does not clearly specify which aggregate metric on \(\Hof{\x \given \h}\) we should prefer. Instead of the mean, we could use the median or a different quantile of the information content as a summary statistic to assess the model’s performance on the dataset. This might be more robust, as it is less sensitive to outliers.

      Crucially, the marginal cross-entropy and related summary statistics measure the model’s performance using the “prior” parameter distribution, not the posterior conditioned on data. However, the joint distribution captures something else, which can be seen more clearly using the chain rule:

      \[\Hof{\xNset \given \h} = \sum_{k=1}^N \Hof{\x_n \given \x_1, \ldots, \x_{k-1}, \h}\]

      Each term is a conditional marginal information on the previous data points. Similarly, when we take an expectation over the data distribution, we obtain a chain of conditional marginal cross-entropies:

      \[\begin{aligned} & \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNtuple} = \\ &\quad = \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X_1} + \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X_2 \given \X_1} \\ &\quad \quad + \ldots + \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{X_N \given \X_1, \X_2, \ldots, \X_{N-1}} \\ &\quad = \sum_{n=1}^N \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X_n \given \X_{n-1}, \ldots, \X_1}. \end{aligned}\]

      Each term in the sum is a conditional marginal cross-entropy conditioned on the previous data points, which differs from the marginal cross-entropy (recognized in the first term).

      The following visualization summarizes the relationship between the conditional and joint marginal cross-entropies and information. The chain rule tells us that the area under the curve of the conditional quantities equals the joint quantity.

      The relationship between conditional and joint marginal cross-entropies and information. Left: Conditional marginal cross-entropy (blue) for a multi-class classification problem. The area under the curve (orange) represents the joint marginal cross-entropy. As the dataset size increases, the conditional marginal cross-entropy decreases and converges to the best achievable loss for the given model hypothesis \(\h\). Right: Conditional marginal information (green). The area under the curve (red) represents the joint marginal information. The conditional marginal information is a noisy estimate of the conditional marginal cross-entropy, as it is computed on individual data points.

      In summary, the marginal and joint cross-entropies offer different perspectives on a model’s performanceRecent works by Ian Osband et al., starting with The Neural Testbed: Evaluating Joint Predictions can help build intuitions for joint predictions. Similarly, a gentler introduction, comparing marginal and joint predictions, can also be found in the arXiv note Marginal and Joint Cross-Entropies & Predictives for Online Bayesian Inference, Active Learning, and Active Sampling.:

      • The marginal cross-entropy and related summary statistics assess the model’s performance using the prior parameter distribution, without considering the effect of the data on the model.
      • The joint marginal cross-entropy, expressed as a sum of conditional marginal cross-entropies, captures the model’s online learning performance as it processes the data sequentially.

      While both metrics are useful for evaluating models, the joint marginal cross-entropy provides insight into how well the model learns from the data during training. The conditional marginal cross-entropy, on the other hand, is more suitable for assessing the model’s generalization ability at a given point in time, without the influence of parameter updates.

      Intermediate Comparison

      This brings us back to the earlier question of what metric we should prefer and use for model selection. Let’s consider:

      1. The marginal cross-entropy, as in the first term, is likely not useful for model selection with deep learning models, as it is not conditioned on any data and thus cannot correlate well with the model’s performance after training.

      2. If we care about the model’s “generalization” performance after training on \(N-1\) data points without further adaptation, the marginal cross-entropy on the last data point is the more relevant quantity:

        \[\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X_N \given \X_{N-1}, \ldots, \X_1}\]

        It measures the model’s performance on the last data point after having seen all previous data points, similar to a “leave-one-out” metric. Indeed, it is equivalent to leave-one-out cross-validation when we have an empirical data distribution consisting of \(N\) data points and sample without replacement.

      3. More generally, it is equivalent to cross-validation when we hold out more than one data point for evaluation from the empirical data distribution:

        \[\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X' \given \X_{N-k}, ..., \X_{1}}.\]

        This is the same expression as in (2.) but we assume there are more samples to draw from in the empirical data distribution \(\pdata{\x'}\). We call this term the conditional marginal cross-entropy and keep in mind its connection to cross-validation.

      4. On the other hand, if we care about the model’s performance as an online learner, or in the case of LLMs, as an in-context learner, the joint marginal cross-entropy becomes a more relevant metric. It measures the model’s ability to adapt and make accurate predictions as it sequentially processes new data points, conditioned on the information it has seen so far.

        In the context of online learning, the model receives data points one at a time and updates its predictions based on the cumulative knowledge gained from previous data points. The joint marginal cross-entropy captures how well the model incorporates this sequential information to make accurate predictions for future data points.

        Similarly, for in-context learning of LLMs, the model is provided with a prompt or context consisting of a sequence of data points, and it is expected to generate accurate completions or predictions based on this context. The joint marginal cross-entropy measures the model’s ability to effectively utilize the provided context to make accurate predictions for the next data point in the sequence.

      5. However, we would not want to use the unconditional joint marginal cross-entropy, but rather condition on some initial data to be closer to the actual use case of the model, which will have been (pre-)trained already. As such, we are interested in estimating a conditional joint marginal cross-entropy:

        \[\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNsetk \given \XNkset}.\]

        By conditioning on the previously seen data points, this metric assesses the model’s capacity to learn and adapt its predictions based on the evolving context. It provides a more fine-grained evaluation of the model’s sequential prediction performance, taking into account the specific order and dependencies within the data.

        Moreover, the conditional joint marginal cross-entropy can be used to compare different models or hyperparameter settings in terms of their online learning or in-context learning capabilities. By evaluating this metric on held-out data sequences, we can determine which model or setting is better suited for tasks that require sequential adaptation and context-dependent predictions.

      6. If we have a preferred order of the data points (or a split in the case of exchangeability), we can also consider the conditional joint marginal information:

        \[\Hof{\xNsetk \given \xNkset, \h}.\]

        It is also known as the conditional joint marginal log likelihood.

      7. All these quantities are equally valid from the perspective of Occam’s razor.

      8. We have not yet discussed how to efficiently estimate these quantities, especially for deep learning models. More importantly, we have already considered that the joint marginal information (marginal likelihood), BMA, and the joint marginal cross-entropy (as an expectation over the marginal likelihood) are not easy to estimate.

      This brings us to one of the main points:

      This is a crucial point that has not been sufficiently considered in the literature on model selection and hyperparameter learning previously, where the model evidence and marginal likelihood have been presented as the ultimate criteria. In practice, we rarely update a model on additional data during inference—this is changing with the advent of LLMs and strong in-context learners, but it is still not the norm.

      But why has the marginal likelihood been the preferred choice for model selection so far then?

      Different Data Regimes

      To explore when the conditional marginal cross-entropy and joint marginal cross-entropy lead to different outcomes for model selection and hypothesis testing, let’s consider a few key scenarios.

      For the discrete case, we can reduce the question to one about ranking: if we have two possible hyperparameter choices \(\h_1\) and \(\h_2\), when do we get the same ranking \(\h_1 \succ \h_2\) for both metrics?

      Model Misspecification

      First, let’s examine the case when we have a large amount of data available. Here, model misspecification, a common concern, plays a crucial role.

      As renowned statistician George Box famously stated:

      All models are wrong, but some are useful.

      George Box, Science and Statistics (1976)

      When working with real-world data, we must always assume that our models are misspecified to some degree. Models simplify complex systems and cannot capture every nuance of the data-generating process. Consequently, the goal of model selection is not to find the “true” model but rather to identify the most useful model that balances simplicity, interpretability, and predictive performance.

      Without model misspecification, we would always converge to the maximum likelihood estimate (MLE) that matches the data-generating model in the infinite data limit as the Bernstein-von Mises’ theorem tells us that posteriors converge to the MLE in the limit. However, in practice, we are always dealing with misspecified models, and the MLE will not converge to the true data-generating model.

      Infinite Data Limit

      Let’s return to our question of when the different quantities lead to similar rankings.

      While a conditional joint marginal cross-entropy, as a sum of conditional marginal cross-entropies, is obviously larger than each individual term, if we divide the joint marginal cross-entropy by the number of samples in the conditional joint distribution, we obtain the rateIn this context, "rate" refers to the average amount of cross-entropy or information per (training) sample, drawing parallels to the concept of entropy rate in Shannon's information theory. This usage is distinct from other common uses of "rate" in machine learning, such as learning rate or convergence rate. of the conditional joint marginal cross-entropies as its per-sample average, which can be more easily related:

      \[\begin{aligned} & \frac{1}{N-k} \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNsetk \given \XNkset} \\ &\quad = \sum_{n=N-k+1}^N \frac{1}{N-k} \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X_n \given \X_{n-1}, ..., \X_1}. \end{aligned}\]

      Bernstein-von Mises’ theorem tells us that the posterior distribution of the model parameters converges to a normal distribution around the MLE as the number of data points goes to infinityThere are likely fewer caveats to this statement than the naive interpretation of the theorem implies because we are usually not interested in converging towards some unique and identifiable parameters but rather in the predictions matching the data-generating process.. This means that the later terms in the chain rule decomposition of the joint cross-entropy will converge to the same value in the infinite sample limit as the data we condition on becomes infinite. If we take the limit, we can ignore the first terms in the chain rule decomposition of the joint cross-entropy, and we will get the same average value for the terms of the joint cross-entropy (one per sample in the joint) and the conditional cross-entropy. This matches a similar result on entropy rates in “Elements of Information Theory” by Cover & Thomas.

      Overall, we have (without formal proof):

      \[\begin{aligned} &\lim_{N \to \infty} \frac{1}{N} \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNset} = \\ &\quad = \lim_{N \to \infty} \frac{1}{N} \sum_{n=1}^N \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X_n \given \X_{n-1}, ..., \X_1} \\ &\quad = \lim_{N \to \infty} \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X' \given \XNset}. \end{aligned}\]

      Given sufficient data (in the infinite sample limit), we see that either of these quantities will lead to the same ranking of different hyperparameters/model hypotheses. Conversely, we can expect to see meaningful differences only in low-data regimes, where the model is not yet fully adapted to the data.

      Finally, in the infinite data limit, for the conditional marginal cross-entropy, we don’t need to take an expectation over the data we condition on (as the model parameters will still have converged):

      \[\begin{aligned} &\lim_{N \to \infty} \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNsetk \given \XNkset} \\ &\quad = \lim_{N \to \infty} \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNsetk \given \xNset}, \end{aligned}\]

      forany \(\xNset \sim \pdata{\xNset}\) as \(n \to \infty\). More importantly, this also holds for the joint marginal information, whose rate in the limit is the same as the rate of the joint marginal cross-entropy above (and thus also joint cross-entropy):

      \[\begin{aligned} &\lim_{N \to \infty} \frac{1}{N} \Hof{\xNset \given \h} = \\ &\quad = \lim_{N \to \infty} \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X' \given \XNset}. \end{aligned}\]

      We have previously mentioned the connection between cross-validation, leave-one-out validation, and the conditional marginal cross-entropy. This result also connects the marginal likelihood in the limit to these quantities.

      Thus:

      The catch is that “sufficient data” might be a very large amount of data, especially for highly expressive models like neural networks.

      Hence, we only expect these quantities to be meaningfully different in the low-data regime. So let’s focus on the low-data regime now.

      Prior-Data Conflict

      Even if different hyperparameter choices lead to the same generalization loss in the infinite data limit, they can induce different priors that affect the convergence speed and model performance in the low-data regime.

      In the low-data regime, assuming all models converge to the same validation loss given infinite data, we prefer the model that converges the fastest, i.e., with the least amount of training data. A model with a prior well-aligned with the data distribution learns efficiently and generalizes better with limited data.

      Conditional marginal cross-entropy vs. dataset size under different modeling scenarios. Left: Model misspecification - Three model hypotheses (\(\h_1\), \(\h_2\), \(\h_3\)) converge to different losses due to the model class not containing the true data-generating process. The minimum achievable loss represents the misspecification error. Right: Prior-data conflict - Three model priors (\(\h_1\), \(\h_2\), \(\h_3\)) converge to the same loss but at different speeds due to varying alignment with the data distribution. Priors with more mass near the MLE converge faster. Real-world models often face both prior-data conflict and model misspecification.

      In this scenario, the area under the conditional marginal cross-entropy or information curve (equivalent to the joint marginal cross-entropy, or joint marginal information) indicates the preferred model. The model with the lowest joint marginal information (highest log marginal likelihood) fits the available data best while having a prior enabling efficient learning and generalization.

      Anti-Correlated Model Misspecification and Prior-Data Conflict

      Finally, what happens when there are both model misspecification and a prior-data conflict in the low-data regime? If both are correlated, the ranking will be preserved, but if they are anti-correlated, the ranking might change.

      Let’s visualize this: the curves will intersect at some point, and the model with the best achievable loss in the infinite data limit might not be the best choice in the low-data regime, depending on how much data we can train on. The optimal model choice may also change based on the amount of available data.

      The conditional marginal cross-entropy is plotted for three different model hypotheses (\(\h_0\), \(\h_1\), \(\h_2\)) as a function of dataset size. The models exhibit both prior-data conflict and model misspecification. In the small data regime, \(\h_2\) has the lowest loss due to its prior aligning well with the data distribution, allowing for faster initial learning. However, as more data becomes available, the models’ asymptotic performance quickly plateaus. First, \(\h_1\) takes over, and then finally \(\h_0\), which converges to the lowest achievable loss in the infinite data limit, indicating it suffers the least from model misspecification. In contrast, \(\h_1\) and \(\h_2\) converge to higher loss values due to greater misspecification. Notably, the models’ performance ranking changes multiple times as the dataset grows, with \(\h_2\) being initially favored but ultimately having the worst infinite-data loss. Each model ranks best for the conditional joint marginal cross-entropy for some chosen range. This illustrates how the interplay between prior-data conflict and model misspecification can lead to different model selection decisions depending on the amount of available data and the metric used to measure performance.

      Here, the joint marginal cross-entropy and the joint marginal information (log marginal likelihood) might not lead to the same decision because the area under the curve at the start might be larger than what the best model can save later. This could change the ranking of the models compared to the conditional marginal cross-entropy (leave-one-out cross-validation) at the end of training, which serves as a proxy for the model’s generalization performance.

      Instead, the conditional joint marginal cross-entropy and information can shine here by conditioning “away” the beginning of the curve, thus giving us a better estimate of the conditional marginal cross-entropy (or expected information) at the point of interest.

      To formalize this, we can use the chain rule to split the joint marginal cross-entropy into two terms:

      \[\begin{aligned} &\underbrace{\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNset}}_{\text{Joint Marginal Cross-Entropy}} = \\ &\quad = \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNsetk} \\ &\quad \quad + \underbrace{\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNset \given \XNsetk}}_{\text{Conditional Joint Marginal Cross-Entropy}}, \end{aligned}\]

      Note that the per-sample averages of both terms converge to the same value in the infinite data limit—the conditional marginal cross-entropy (cross-validation loss), as discussed previously. However, the second term will converge faster because it does not include the constant \(\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNsetk}\).

      We can also see both terms as approximating the conditional marginal cross-entropy (cross-validation loss) for a fixed \(N\) in the low-data regime. The per-sample average of the second term will provide a better approximation.

      In summary, the consistency of the ranking will depend on the size of \(\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNsetk}\) for different \(\h\) and how it compares to the conditional joint marginal cross-entropy \(\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNset \given \XNsetk}\).

      This analysis highlights the importance of considering both prior-data conflict and model misspecification when selecting models in the low-data regime. The choice of performance metric and the amount of available data can significantly impact the ranking of models. The conditional joint marginal cross-entropy provides a more accurate estimate of the model’s generalization performance by conditioning away the initial part of the learning curve, which may be heavily influenced by prior-data conflict.

      Approximating the Validation Loss

      You may be wondering: why bother with the marginal likelihood or conditional joint marginal cross-entropy at all? Why not just always use leave-one-out cross-validation (i.e., the conditional marginal cross-entropy) or a simple validation loss?

      While that is a valid approach, the key question is: can we approximate the validation loss earlier in training, without fully training the model? Or can we do this more efficiently than performing inference on each element of a validation set?

      One option is to extrapolate the training loss to predict the validation loss. While potentially underexplored in this context, scaling laws have been found effective for predicting model performance.

      Alternatively, when training a model on a dataset for a single epoch—which is still surprisingly common for large language models, especially without active data sampling—the average training loss per batch provides a good approximation of the validation loss. With a cross-entropy loss, this is equivalent to estimating the conditional marginal cross-entropy.

      However, the batch size may not be large enough for a precise estimate. Averaging over the last few batches or using an exponential moving average can help, as the training losses on earlier batches were computed with older model parameters. Compared to using only the last batch’s loss, this smooths the estimate and reduces sensitivity to outliers.

      In the multi-epoch setting, revisiting data points multiple times prevents using the training loss as a validation loss estimate. Here, cross-validation offers a solution: train on the held-out data in the last epoch, compute the validation loss via the training losses, and obtain an ensemble of fully trained models without wasting data.

      In summary, while the validation loss is the gold standard, approximations based on the training loss or cross-validation can provide efficient estimates, especially in the early stages of training or with limited data.

      The Big Comparison

      In this post, we have explored various metrics for model selection and hyperparameter learning in the Bayesian context, focusing on the marginal likelihood, joint marginal cross-entropy, and conditional marginal cross-entropy. Our discussion has led to several key insights:

      1. Infinite Data Limit: As the dataset size approaches infinity, the rate of the log marginal likelihood (or equivalently, the joint marginal information), the joint marginal cross-entropy, and the conditional marginal cross-entropy converge to the same value when averaged over the data distribution. Given sufficient data, all these metrics will produce the same ranking of different model hypotheses or hyperparameter choices.

      2. Connection to Cross-Validation: The conditional marginal cross-entropy is equivalent to the expected cross-validation loss. Cross-validation is the gold standard for model selection in machine learning practice, where a model’s generalization performance is estimated by evaluating it on held-out validation data after training on the remaining data.

      3. Sufficient Data Requirement: The amount of data needed for the convergence of these metrics in the infinite data limit may be impractically large, especially for highly expressive models like deep neural networks. Therefore, the convergence property may not be directly relevant in many real-world scenarios.

      4. Low-Data Regimes: When data is limited, the metrics can differ significantly. The conditional marginal cross-entropy (or cross-validation loss) is often the more reliable choice for model selection targeting generalization performance, as it directly measures the model’s ability to predict unseen data after being trained on the available data.

      5. Sequential Prediction and Compression: The joint marginal cross-entropy, which corresponds to the negative log marginal likelihood, may be preferable if the focus is on a model’s overall sequential prediction performance or compression ability on the training data itself. It measures how well the model fits the entire training dataset jointly, without splitting into train and validation sets.

        Moreover, the conditional joint marginal information and cross-entropy are particularly relevant for measuring the performance of online learners and the in-context learning abilities of large language models (LLMs). These metrics capture the model’s ability to adapt and make accurate predictions based on the sequential information and evolving context after training on available data.

      6. Model Misspecification and Prior-Data Conflict: In practice, models often face a combination of model misspecification (where the true data-generating process is not contained within the model class) and prior-data conflict (where the prior distribution does not align well with the data distribution). The interplay between these factors can lead to different rankings of models depending on the amount of available data and the specific metric used for evaluation.

      While the marginal likelihood has been a popular tool for model selection and hyperparameter learning in the Bayesian community, its suitability depends on the specific context and goals. The conditional marginal cross-entropy, closely related to cross-validation, is often a more reliable choice when the primary objective is to optimize generalization performance. However, the conditional joint marginal cross-entropy (or conditional log marginal likelihood) may be preferable when the focus is on sequential prediction after training or measuring in-context learning abilities.

      Now, after having thought about all this in detail and mostly from first principles, let’s discuss the literature and how it supports or augments these considerations.

      Literature Review

      Having discussed the key concepts, we will now look at several influential papers that have shaped the previous discussion on model selection and hyperparameter tuning in the Bayesian context or have provided valuable insights into the marginal likelihood and its connections to other metrics.

      Fong and Holmes (2020): “On the marginal likelihood and cross-validation”

      Fong and Holmes (2020) explore the connection between the log marginal likelihood (joint marginal information) and cumulative leave-p-out cross-validation. Under exchangeability, they show that the joint marginal information can be rewritten as a cumulative sum of leave-p-out cross-validation terms.

      The authors define the leave-p-out cross-validation score as:

      \[S_{CV}(\xNset;p) = \frac{1}{\binom{N}{p}} \sum_{V \in \binom{[N]}{p}} \frac{1}{p} \sum_{i=1}^p \Hof{\x^{V}_i \given \{\x^{\bar{V}_k}\}_{k=1}^{N-p}}\]

      where \(\binom{[N]}{p}\) denotes the set of all \(p\)-length subsets of \(\{1,...,N\}\)—the indices of the validation set—\(\x^V_i\) is the \(i\)-th validation data point, and \(\x^{\bar{V}}_k\) is the \(k\)-th training data point. This score measures the model’s performance using \(p\) validation points given the remaining data for training, equivalent to the respective conditional marginal cross-entropy.

      The cumulative leave-P-out cross-validation score is defined as:

      \[S_{CCV}(\xNset; P) = \sum_{p=1}^P S_{CV}(\xNset; p)\]

      This score focuses on the last \(P\) stages of the learning curve equally and is the same as the conditional joint marginal cross-entropy. For \(P=N\), the cumulative leave-N-out cross-validation score equals the joint marginal information:

      \[S_{CCV}(\xNset; N) = \Hof{\xNset}\]

      Comparing \(P<N\) to \(P=N\), Fong and Holmes highlight the potential sensitivity of the marginal likelihood to the choice of prior. They argue for using cumulative cross-validation following a preparatory training phase with \(P<N\) (e.g., \(10\%\) or \(50\%\)), demonstrating benefits over the full marginal likelihood for model selection, especially with vague priors or model misspecification.

      The paper also discusses the coherence of the log posterior predictive probability as a scoring rule in cross-validation and explores connections to prequential analysis and intrinsic Bayes factors.

      Fong and Holmes (2020) strongly support the ideas in this blog post, particularly the connections between marginal likelihood, cross-validation, and focusing on later learning curve stages for model selection. They establish the equivalence between the cumulative leave-p-out cross-validation score and conditional joint marginal information, aligning with our discussion of the conditional joint marginal cross-entropy as a more reliable metric compared to the full marginal likelihood.

      Lyle et al. (2020) and Ru et al. (2021): Training speed and model selection

      In “A Bayesian Perspective on Training Speed and Model Selection”, Lyle et al. (2020) establish a connection between training speed and the marginal likelihood in linear models. They propose using the sum of mini-batch training losses as a proxy for the log marginal likelihood to predict the generalization behavior of deep neural networks. This sum, referred to in later works as the training speed estimator (TSE), corresponds to the area under the learning curve. For 1-sample batches, the TSE is defined as:

      \[\text{TSE}(\xNset) = \sum_{n=1}^N \Hof{\x_n \given \w_n},\]

      where \(\Hof{\x_n \given \w_n}\) is the cross-entropy loss at training step \(n\) with model parameters \(\w_n\). Thus, an MLE estimate is used instead of conditioning on the data points \(\x_{<n}\) and using the BMA.

      The authors provide an iterative algorithm for linear models to estimate a lower bound on the LML over multiple epochs of training. This allows capturing the model’s performance as it sees more data points over the course of training, rather than being limited to a single epoch. They also discuss extending their estimator to the infinite-width limit of neural networks.

      Building upon Lyle et al. (2020), Ru et al. (2021) focus on using TSE for model selection in neural architecture search in “Speedy Performance Estimation for Neural Architecture Search”. They propose two variants of TSE: TSE-E, which focuses on the last few epochs, and TSE-EMA, which uses an exponential moving average to assign higher weights to later epochs:

      \[\begin{aligned} \text{TSE-E}(\xNset) &= \sum_{n=N-E+1}^N \Hof{\x_n \given \w_n}, \\ \text{TSE-EMA}(\xNset) &= \sum_{n=1}^N \alpha^{N-n} \Hof{\x_n \given \w_n}, \end{aligned}\]

      where \(\alpha \in (0, 1)\) is a hyperparameter controlling the decay rate.

      The authors hypothesize that assigning higher weights to later epochs may lead to better correlation with the true generalization performance of the final trained network, as the early epochs may be unstable and less informative.

      They demonstrate empirically that TSE-E and TSE-EMA can reliably estimate the generalization performance of neural architectures with a small training budget and remain effective for a large range of training epochs. TSE outperforms other efficient estimators, such as early stopping and learning curve extrapolation, in terms of rank correlation with the true test performance.

      The TSE estimators proposed by Ru et al. (2021) align closely with the ideas discussed in this blog post, as they prioritize the model’s performance in the later stages of learning. The empirical results presented by Ru et al. (2021) and Lyle et al. (2020) provide supporting evidence for the importance of going beyond the marginal likelihood.

      Lotfi et al. (2022/2023): “Bayesian Model Selection, the Marginal Likelihood, and Generalization”

      Lotfi et al. (2022/2023) provide a comprehensive re-evaluation of the marginal likelihood as a metric for predicting the generalization performance of trained models and learning hyperparameters. They argue that while the marginal likelihood is well-suited for prior hypothesis testing, it is only peripherally related to generalization after training. The authors identify several practical and philosophical issues in using the marginal likelihood for selecting between trained models, such as its sensitivity to the choice of prior, potential to lead to both underfitting and overfitting, and negative correlation with generalization performance in some cases.

      To address these limitations, Lotfi et al. propose the conditional marginal likelihood (CLML) as a partial remedy. The CLML is computed by conditioning on a subset of the training data, which helps to mitigate the influence of the prior and focus on the model’s performance under this posterior. It is also less sensitive to the number of parameters in the model. The authors demonstrate that the CLML is better correlated with generalization than the marginal likelihood and provides promising performance for deep kernel hyperparameter learning and neural architecture search.

      The CLML shares significant similarities with the cumulative leave-p-out cross-validation score proposed by Fong and Holmes (2020). Both approaches essentially propose the same metric, which focuses on the model’s performance in the later stages of learning and provides a more reliable indication of generalization compared to the full marginal likelihood. Lotfi et al. also critically compare their work to that of Lyle et al. (2020), but do not discuss the work of Ru et al. (2021).

      Lotfi et al. conduct an extensive empirical evaluation of the CLML across various settings, comparing it to the marginal likelihood and other baselines under different conditions, such as varying dataset sizes, model complexities, and hyperparameter settings. They demonstrate that the CLML consistently outperforms the marginal likelihood in terms of selecting the hyperparameters that lead to better generalization performance. The authors also acknowledge some limitations of their work, such as the need for further theoretical analysis of the CLML’s properties and the potential challenges in estimating the CLML for more complex models.

      The key novelty of Lotfi et al.’s work lies in their comprehensive analysis of the limitations of the marginal likelihood for model selection and hyperparameter learning, as well as their proposal of the CLML as a practical alternative that addresses these limitations.

      A Simple Toy Experiment

      To illustrate the concepts discussed in this post, we conduct a simple toy experiment using a Bayesian linear regression model. The goal is to demonstrate how the various information metrics behave under different prior settings and dataset sizes, and to show that none of the metrics are universally reliable. In particular, the joint marginal information may not be the best choice when the primary concern is static performance after training on data.

      Experimental Setup

      We generate a synthetic dataset with 64 features and 500 training and validation samples each. The true coefficients are drawn from a normal distribution with a mean of 2, and the target is the dot product between the features and the true coefficients.

      For the model, we use a Bayesian linear regression with an isotropic Gaussian prior on the weights (hyperparameter \(\wstddev\)) and independent Gaussian noise (hyperparameter \(\noisestddev\)). The model is misspecified when \(\noisestddev > 0\). We consider three different prior settings:

      • Model 1 (\(\h_1\)): \(\wstddev=0.1\), \(\noisestddev=0.8\)
      • Model 2 (\(\h_2\)): \(\wstddev=100\), \(\noisestddev=1.0\)
      • Model 3 (\(\h_3\)): \(\wstddev=1\), \(\noisestddev=1.2\)

      Thus, all three models are misspecified to varying degrees and exhibit different levels of prior-data conflict.

      We train the model on subsets of the training data of varying sizes, ranging from 1 to the full training set size, performing 5 trials with different splits. For each subset size, we compute the following metrics:

      • Joint Marginal Information (JMI)
      • Conditional Joint Marginal Information (CJMI) with half the data used for conditioning
      • Marginal Cross-Entropy (MCE) on the training set
      • Marginal Cross-Entropy (MCE) on the validation set
      • Training Speed (Approximate)
      • Joint Marginal Information Rate (JMI Rate)

      The JMI is equivalent to the negative log marginal likelihood, the CJMI to the negative conditional log likelihood, and the MCE corresponds to the cross-entropy loss. The Training Speed approximates an iterative algorithm by following the full data gradient. The JMI Rate is the JMI divided by the dataset size, which converges to the MCE in the infinite data limit.

      Results

      The results of the experiment are summarized in the following plots:

      Information metrics for the three Bayesian linear regression models as a function of dataset size. The joint marginal information does not indicate the best performing model. The conditional joint marginal information (conditioned on half the dataset size, predicting on the other half) only finds the best model after 4/5 of the data are observed. Metrics are reported in bits (log base 2), five trials each.

      The plots show the behavior of the information metrics as the dataset size increases for the three different prior settings. Some key observations:

      • The marginal cross-entropy (MCE) metrics decrease as the dataset size increases, indicating improved model performance.
      • The joint marginal information (JMI) increases with more data, as it is equivalent to the area under the curve of the MCE on the training set. (As we take the average over multiple trials, its mean is actually an estimate of the joint marginal cross-entropy.)
      • The JMI rate, which is the JMI divided by the dataset size, decreases very slowly towards the same value as the MCE. This agrees with the previous discussion on the infinite data limit.
      • The training losses also decrease, while their sum, equal to the training speed estimator (TSE), increases with the dataset size.
      • The conditional joint marginal information (CJMI) with half the data used for conditioning shows a similar trend to the JMI but with lower values, as it focuses on the model’s performance on the held-back data. As we take an average over multiple trials, it is actually an estimate of the conditional joint marginal cross-entropy.

      To further analyze the model selection behavior, we computed the CJMI for different conditioning set sizes and selected the model with the lowest CJMI for each combination of dataset size and conditioning set size. The results are visualized in the following plot:

      Decision boundary for the best model amongst three (\(\phi_1\), \(\phi_2\), \(\phi_3\)) with the lowest conditional joint marginal cross-entropy/information, as a function of dataset size and held-back size. The three models \(\phi_1\), \(\phi_2\), and \(\phi_3\) correspond to different prior variances and noise levels. The white diagonal line shows where the conditional joint marginal information is computed using half the dataset size. In the region below this line, \(\phi_1\) (blue) has the lowest conditional joint marginal information, while \(\phi_2\) (orange) and \(\phi_3\) (green) are preferred for different dataset and held-back sizes.

      The plot shows which model is selected based on the lowest CJMI for different dataset sizes (x-axis) and conditioning set sizes (y-axis). The white line represents the case where half the data is used for conditioning (CJMI half in the previous plot). We observe that the model selection decision changes depending on the amount of available data and the size of the conditioning set/held-back data.

      A Narrow but Deep Dive into “Bayesian Model Selection, the Marginal Likelihood, and Generalization”

      Now that we have introduced the necessary concepts and discussed the literature, let’s take a closer look at the paper by Lotfi et al. (2022/2023).

      Use Cases and Pitfalls of the LML

      Lotfi et al. (2022/2023) present both the case for the log marginal likelihood (LML) as well as potential pitfalls when using it. They highlight the following use cases for the LML—quoted and paraphrased from the paper:

      1. Hypothesis testing: The LML provides an elegant mechanism to select between fixed prior hypotheses, even if each hypothesis is entirely consistent with observations. It automatically favors the most constrained hypothesis that fits the data, encoding a notion of Occam’s razor. The paper gives the example of the LML favoring general relativity over alternative explanations for Mercury’s orbit.

      2. Hyperparameter learning: The LML is often successfully used in practice to learn hyperparameters of the prior, finding the hyperparameters \(\h\) that maximize \(\pof{\mathcal{D} \given \h}\), where \(\mathcal{D}\) is a dataset. The paper highlights Gaussian processes as a compelling example, where the LML chooses kernel hyperparameters that make the distribution over functions likely to generate the training data, rather than simply maximizing data fit. The LML can learn many kernel parameters and be used where cross-validation would be intractable.

      3. Constraint learning: Unlike typical learning objectives like maximum likelihood, the LML is incentivized to select for constraints. It provides a consistent estimator for constraints, automatically selecting the most constrained solution that fits the data and collapsing to the true constraint value as the number of observations grows. Examples include the LML consistently estimating the true dimensionality in Bayesian PCA and automatically learning symmetries like rotation invariance.

      However, the paper argues that the LML has several pitfalls for model selection and generalization:

      1. Not aligned with generalization: The LML answers “what is the probability a prior model generated the training data?” rather than “how likely is the posterior to have generated withheld points?”. A prior that initially explains the data well can still lead to a posterior that generalizes poorly.

      2. Misaligned in model selection: The LML evaluates priors, while model selection should evaluate posteriors. Maximizing LML is not equivalent to selecting the best generalizing posterior.

      3. Can overfit: The LML can favor “simple” priors concentrated around overfit maximum likelihood solutions that generalize poorly.

      4. Underfitting bias in hyperparameter selection: The LML may not favor hyperparameters that make good parameters likely if they also make many poor parameters likely.

      Relating these points to the previous discussions:

      For hypothesis testing and hyperparameter learning (1. & 2.), the LML favors the simpler hypothesis that converges faster, implying a smaller area under the learning curve. This aligns with the discussion on prior-data conflict for similarly misspecified models.

      At the same time, the paper also states about the case of Mercury’s orbit that:

      We emphasize here we are comparing fixed prior hypotheses. We are not interested in how parameters of general relativity update based on orbital data, and then deciding whether the updated general relativity is the correct description of orbital trajectories.

      This could be misconstrued at computing the marginal cross-entropy for the data under the prior, which is not what the LML is doing: it computes a joint marginal cross-entropy after all. The two questions in (4.) point to the joint and conditional marginal cross-entropies—the areas under the full and partial learning curves, respectively.

      However, neither LML nor CLML align with static evaluation, but rather with continued learning (5.).

      Points (6.) and (7.) relate to prior-data conflict and model misspecification when they are anti-correlated.

      Overall, all quantities can fail in the low-data regime. In the infinite data limit, model (mis-)specification dominates other factors, making the quantities less interesting.

      The “Conditional Marginal Likelihood” in Lotfi et al. (2022/2023)

      The paper introduces the conditional marginal likelihood (CLML) as a remedy for the pitfalls of the LML, matching the earlier definition of conditional joint marginal information:

      \[\Hof{\xset{}{N-P+1}{N} \given \xset{}{1}{N-P}, \h}.\]

      Unlike the LML which is invariant to data order, the CLML depends on how the data is split into a conditioning set and validation set. To make the CLML permutation-invariant, the paper proposes averaging over different permutations, equivalent to the joint marginal cross-entropy. However, this becomes computationally expensive, so the paper uses a single permutation with \(P=20\% \, N\) to ensure the posterior has sufficiently converged.

      Estimating the CLML and LML via Laplace Approximation

      Computing the LML via sampling is intractable for deep neural networks. Estimating it from an uninformative prior leads to high-variance estimates, as most \(\w\) sampled from the prior will perform poorly on the data. While Monte Carlo sampling works well in high dimensions, it fails here because randomly sampling a good \(\w\) from the prior is incredibly unlikely, as illustrated in these tweets:

      While sampling from the prior to estimate the LML is intractable, we can fare better when sampling from a posterior for computing a CLML, which is the approach taken by the paper for the CLML. The posterior is more concentrated around “good” \(\w\), and the paper uses a Laplace approximation to approximate it:

      However, the LA only captures uncertainty around a single mode, underestimating the uncertainty before the model converges, as beautifully illustrated in the paper:

      This is especially relevant for overparameterized DNNs which have multiple diverse modes (Wilson, Izmailov, 2020; 2021, blog).

      Furthermore, when computing the CLML, the LA may similarly struggle to find meaningful \(\w\) that perform well on the held-out data when that data would meaningfully change the model, as the CLML decomposes into conditional marginal information terms that condition on these additional data sequentially.

      DNN Experiments: Validation Loss vs. CLML

      The DNN experiments in Lotfi et al. (2022/2023) compare the CLML to the validation loss for DNNs on CIFAR-10 and CIFAR-100 datasets. The results provide empirical evidence for the challenges of computing the CLML and beg the question whether these approximations are meaningfully different from a validation loss.

      The paper shows that while the CLML is better correlated with the generalization performance of the model than the LML, the validation loss is still better correlated with the generalization performance than the CLML. Interestingly, the initially published DNN experiments in the first arXiv version of the paper did not actually compute the CLML but instead computed the validation loss. This was fixed in the second arXiv revision.This bug was found by yours truly, see the appendix of this post.

      However, given the previous discussions on the similarities between the CLML and cross-validation and difficulty of approximating the CLML meaningfully, this bug was not a major issue for the paper’s conclusions.

      Importantly, as we examine in the appendix of this post, when comparing the CLML using Monte Carlo sampling with the validation loss computed using Monte Carlo sampling for the Bayesian Model Average (BMA), the validation loss is still better correlated with the generalization performance than the CLML.

      Conclusion

      In conclusion, this blog post has challenged the conventional focus on the marginal likelihood and related quantities for Bayesian model selection as a direct consequence of Occam’s razor. It highlights the importance of considering context and goals when choosing a model selection criterion. By motivating MLE and MAP using Occam’s razor and questioning the uniqueness of the (conditional) joint marginal likelihood, we hope to encourage critical thinking about the foundations of these quantities.

      However, it is important to acknowledge the limitations of our arguments and experiments. A more rigorous theoretical justification, a broader range of models and datasets, and a deeper engagement with philosophical implications are needed to strengthen the insights. As most of the presented methods ignore model complexity and assume a uniform model prior \(\pof{\h}\), we have not discussed it in the detail necessary, even though from the perspective of model description lengths (MDL), it would be crucial to take into account.

      Despite these limitations, our exploration of the connections between information-theoretic concepts and their behavior in different data regimes, along the lines of model misspecification and prior-data conflict, provides a necessary starting point for understanding recently proposed metrics.

      The toy experiment demonstrates that all discussed quantities can fail to reliably predict generalization under model misspecification and prior-data conflict, even for a basic setting using Bayesian linear regression. This emphasizes the need for caution when making claims about the superiority of any particular metric.

      Ultimately, the key takeaway is that there is no one-size-fits-all solution, and the choice of model selection criterion should be guided by a careful consideration of the specific context and goals at hand.


      Acknowledgements: We would like to thank the authors of the examined papers for their valuable contributions to the field and for inspiring this blog post. Claude-3 and GPT-4 were used to edit and improve this blog post (via cursor.sh).

      Reproducibility: The figures were created using matplotlib and seaborn in Python. The Bayesian linear regression model was implemented using numpy. The code for the toy experiment is available in this Google colab, and the code for the visualizations is available in this Google colab.


      Appendix

      Detailed Code Review of the DNN Experiments in Lotfi et al. (2022/2023)

      The logcml_ files in the repository contain the code to compute the CLML for partially trained models. However, instead of computing

      \[\begin{aligned} \log p(\mathcal D_{\ge m} \mid \mathcal D_{< m}, \mathcal{M} ) \approx \log \sum_{k=1}^K \frac{1}{K}\, p(\mathcal{D}_{\ge m} \mid w_k, \mathcal M ) \\ = \log \sum_{k=1}^K \frac{1}{K}\, \prod_{j=m}^n p(y_j \mid x_j, w_k, \mathcal M ), \end{aligned}\]

      the code computes:

      \[\begin{aligned} &\frac{1}{|\mathcal{D}_{\ge m}|}\,\sum_{j=m}^n \log p(\mathcal D_{j} \mid \mathcal D_{< m}, \mathcal{M} ) \approx \\ &\quad =\frac{1}{|\mathcal{D}_{\ge m}|}\,\sum_{j=m}^n \log \sum_{k=1}^K \frac{1}{K}\, p(y_j \mid x_j, w_k, \mathcal M ), \end{aligned}\]

      which is the validation cross-entropy loss of the BMA (of the model trained with 80% of the training data).

      The high-level code that computes the CLML is:

      
      +
      1
      +2
      +3
      +4
      +5
      +
      bma_accuracy, bma_probs, all_ys = get_bma_acc(
      +    net, la, trainloader_test, bma_nsamples, 
      +    hessian_structure, temp=best_temp
      +)
      +cmll = get_cmll(bma_probs, all_ys, eps=1e-4)
      +

      get_bma_acc marginalizes over the LA samples before returning bma_probs:

      
      +
      1
      +2
      +3
      +4
      +5
      +6
      +7
      +8
      +9
      +10
      +11
      +12
      +13
      +14
      +15
      +16
      +17
      +18
      +19
      +20
      +21
      +
      [...]
      +for sample_params in params:
      +    sample_probs = []
      +    all_ys = []
      +    with torch.no_grad():
      +        vector_to_parameters(sample_params, net.parameters())
      +        net.eval()
      +        for x, y in loader:
      +            logits = net(x.cuda()).detach().cpu()
      +            probs = torch.nn.functional.softmax(logits, dim=-1)
      +            sample_probs.append(probs.detach().cpu().numpy())
      +            all_ys.append(y.detach().cpu().numpy())
      +        sample_probs = np.concatenate(sample_probs, axis=0)
      +        all_ys = np.concatenate(all_ys, axis=0)
      +        all_probs.append(sample_probs)
      +
      +all_probs = np.stack(all_probs)
      +bma_probs = np.mean(all_probs, 0)
      +bma_accuracy = (np.argmax(bma_probs, axis=-1) == all_ys).mean() * 100
      +
      +return bma_accuracy, bma_probs, all_ys
      +

      The important line is #18: bma_probs = np.mean(all_probs, 0) which marginalizes over the predictions and returns the BMA prediction for each sample.

      Finally, get_cmll computes the validation loss for each sample independently (after applying a bit of label smoothing):

      
      +
      1
      +2
      +3
      +4
      +5
      +6
      +7
      +8
      +9
      +10
      +11
      +
      def get_cmll(bma_probs, all_ys, eps=1e-4):
      +    log_lik = 0      
      +    eps = 1e-4
      +    for i, label in enumerate(all_ys):
      +        probs_i = bma_probs[i]
      +        probs_i += eps
      +        probs_i[np.argmax(probs_i)] -= eps * len(probs_i)
      +        log_lik += np.log(probs_i[label]).item()
      +    cmll = log_lik/len(all_ys)
      +    
      +    return cmll
      +

      The DNN experiments in Section 5 and Section 6 of the first arXiv revision of the paper (v1) thus did not estimate the CLML per-se but computed the BMA validation loss of a partially trained model (80%) and find that this correlates positively with the test accuracy and test log-likelihood of the fully trained model (at 100%). This is not surprising because it is well-known that the validation loss of a model trained 80% of the data correlates positively with the test accuracy (and generalization loss).

      Author Response from 2022

      The following response sadly seems to target the first draft mainly. However, it is also helpful for the final blog post and provides additional context.

      Thanks for your interest in our paper and your comments. Here are our comments about the blog as it is currently framed:

      (1) Thank you for pointing out a bug in the CLML computation for Figure 5b. We note that this bug is only relevant to a single panel of a single figure in the main text. We have re-run this experiment with the right CLML, and the results, attached here, are qualitatively the same. In summary, it was a very minor part of the paper, and even for that part it did not affect the take-away. We also attach the results of the correlation between the BMA test accuracy and the negative validation loss. You suggest in your post that the validation loss might correlate better with the BMA test accuracy than the CLML given that we use 20 samples for NAS. Our empirical results show the opposite conclusion. Additionally, we are not suggesting the CLML as a replacement to cross-validation but rather as a minor way to modify the LML for improvements in predicting generalization. Finally, we attach results for different sample sizes (20 samples vs. 100 samples) to address your comments on the sample size used to estimate the CLML. As we can see in the figure, the Spearman correlation factor is quite similar. 20 samples appears to provide a reasonable estimate of the CLML for these purposes, and is different from validation loss.

      (2) Your post currently opens by suggesting that there is something wrong with our experiments, likely either an LML approximation or a CLML issue, because we note that the LML correlates more poorly with generalization for larger datasets (where “large” is relative in the context of a specific experiment). A few points here: (i) this result is actually completely expected. The LML is in fact non-monotonic in how well it predicts generalization. For small datasets, the prior should be reasonably predictive of generalization. For intermediate datasets, the first terms in the LML decomposition have a negative effect on the correlation with generalization. For asymptotically large datasets, the first terms have a diminishing effect, and we get a consistent estimator; (ii) almost all of our experiments are exact, and we see this behaviour in the exact experiments for the Fourier model. For example, for the Fourier feature experiment in Fig 4(d), LML picks the better generalizing model for n < 50 and n > 296. For n in [50, 296] it picks the wrong model. For large neural network models, it is reasonable that the exact LML could pick the wrong model for CIFAR-sized datasets. (iii) any potential issues with the CLML are not relevant to these considerations, which are about the behaviour of the LML.

      (3) Your post currently suggests that issues with approximate inference could be responsible for our take-aways, rather than issues with the LML in general. But as we note in (2), almost all of our experiments use the exact LML and CLML: the density model, Fourier features, Gaussian processes, and deep learning exps on DKL, and there was never any bug associated with CLML computation in these experiments. The takeaways for the Laplace experiments are consistent with the exact experiments, and also expected, as above. While it’s true that the CLML can be estimated more effectively than the LML for the Laplace experiments, this is actually an advantage of the CLML that we note in the paper. The LML results also stand on their own, as we discuss above.

      (4) Your post places a lot of importance on Figure 5, as if it is the main result of the paper and our main “DNN” experiments. We stand by the results of Figure 5, but it is a relatively minor component of the paper. As we’ve mentioned most of our results are exact, including our DKL experiments, which are certainly the most substantial DNN experiments, with practically exciting results for transfer and few-shot learning. The DKL experiments are actually where we expect the CLML to be practically useful, and currently they seem to be overlooked in the post.

      (5) The blog seems to question the learning curve experiments, but these experiments in Figure 4 are exact, with no Laplace approximation, and relatively straightforward.

      (6) Your post seems to be negative about the CLML, presenting its similarity with cross-validation as a potential drawback, and implying the skepticism about the CLML should affect the interpretation of our take-aways. Two points here: (i) as above, the CLML is independent of most of our take-aways, which are about the properties of the LML; (ii) our goal with the CLML was not to introduce something starkly different from cross-validation, but to show how a very minor modification to the LML could improve alignment with generalization. Moreover, the DKL CLML results are quite promising as an efficient way to do gradient based estimation of a large number of hyperparameters.

      (7) The blog opens as if it is leading up to some fatal flaw. But as above, (i) the LML considerations are independent of the CLML, (ii) most of the experiments are exact, (iii) the trends for the exact and approximate inference procedures are the same and are naturally understandable and explainable, such as the non-monotonic trend in how well the LML correlates with generalization, and (iv) the CLML bug only affected Figure 5, panel b, and when it’s corrected the qualitative take-away is the same as before.

      We appreciate your interest and effort in reading the paper, and we think your questions will improve the clarity of the paper, which we have updated with an acknowledgement to you. Given the above considerations, we do think there would need to be substantial revisions to the blog post to accurately and fairly reflect the paper. We would appreciate being able to see the revisions before it’s posted.

      Best wishes,
      Sanae, Pavel, Greg, Micah, Andrew

      Ablation: CLML vs. BMA Validation Loss vs. (non-BMA) Validation Loss

      Let us examine the new results:

      In the three panels below, two panels show test accuracy vs. validation loss; one shows test accuracy vs. CLML. The left-most panel is the BMA test accuracy vs. (negative) BMA validation loss, the middle panel is vs. the CLML, and the right-most panel is vs. the (negative) non-BMA validation loss.

      Note that the left-most panel is from v1, which was accidentally computing the BMA validation loss, and whose axis label is adapted here from v1 for clarity. The two other plots are from v2 after fixing the bug. See commits here for fixing the CLML estimation and here for computing the non-BMA validation loss.

      BMA Neg Validation Loss
      CLML
      Validation Loss
      Leg

      At first glance, there might be an observer effect in the experiments for the validation loss. The BMA validation loss in v1 performs better than the CLML in v2, while the non-BMA validation loss in v2 underperforms the CLML in v2. When asked about it, the authors pushed the respective code (see link above) and explained that the updated, right-most panel computes the non-BMA validation loss, i.e., without LA samples. It seems surprising that there is such a difference between the non-BMA validation loss and BMA validation loss: the non-BMA validation loss is more than one nat worse on average than the BMA validation loss, based on visual inspection. Note that the plots here and in the paper compute the average CLML and average validation loss and are thus directly comparable.

      The authors said in their response that:

      You suggest in your post that the validation loss might correlate better with the BMA test accuracy than the CLML given that we use 20 samples for NAS. Our empirical results show the opposite conclusion.

      This is only partially true. The BMA validation loss (which was accidentally computed in v1 instead of the CLML) correlates very well with the BMA test accuracy. This is not surprising given that this is the frequentist purpose of using validation sets. If validation sets were not correlating well with the test accuracy, we would not be using them in practice. 🤗 As such, this raises the question why the non-BMA validation loss correlates negatively with the BMA test accuracy for ResNets and overall in the v2 results. Thus, only the non-BMA validation loss supports the now opposite conclusion in v2 of the paper and in the authors’ response.

      Yet what is also surprising is how well the BMA validation loss does vs. the CLML:

      Ablation: LA Sample Size

      Secondly, when we compare the reported values between BMA validation loss and CLML, we notice that the CLML is lower than the BMA validation loss by half a nat for \(\lambda=10^2\) and generally for CNNs.

      However, it seems, even though the new experiments in v2 are supposed to reproduce the ones from v1, and we can assume that the same model checkpoints were used for re-evaluation (as retraining is not necessary), both CLML and non-BMA validation loss are off by about half a nat for the CNNs. As such, the above consideration might hold but might not provide the answer here.

      Instead, we overlay the non-BMA validation loss and the CLML plots, both from v2, with a “difference blend”: it shows the absolute difference between the colors for overlapping data points (the circles 🔴 and triangles 🔺), leading to black where there is a match, negative (green-ish) color for CLML, and positive (sepia) color for validation losses. The background grids were used to match the plots, but we hid the ones from CLML afterward—as such, the strong overlay is because the values are so close.

      Surprisingly—or rather as predicted when the LA does not really do much—it turns out that the validation loss for the CNNs (🔴) mostly fully matches the estimated CLML with 20 LA samples following a visual inspection. To be more precise, either the models have already sufficiently converged, or the CLML estimate is not actually capturing the correlations between points and thus ends up being very similar to the validation loss.

      This changes the interpretation of the sample ablation in the author’s response. The ablation shows no difference between 20 and 100 LA samples, with 100 LA even samples having a slightly lower rank correlation. So it seems 5 times more LA samples are not sufficient to make a difference, or the Laplace posterior cannot capture the posterior as well as hoped. It would be interesting to examine this further. Kirsch et al (2022) reported running toy experiments on MNIST with 10,000 MC Dropout samples without achieving good adaptation. Laplace approximation is not MC Dropout, and this is speculation, but it seems in agreement. Notwithstanding the compute cost and feasibility, could posterior samples using HMC or similar more principled methods provide better estimates?

      All in all, given the above, it is fair to say that the estimate of the CLML is probably not as good as hoped, and further experiments might be needed to tease out when the CLML provides more value than the (BMA) validation loss. Note, however, that this question has not been explicitly examined in the paper. Instead, for DNNs, the paper only compares LML and CLML with distinct estimation methods.

      For attribution in academic contexts, please cite this work as
      +        PLACEHOLDER FOR ACADEMIC ATTRIBUTION
      +  
      BibTeX citation
      +        PLACEHOLDER FOR BIBTEX
      +  
      \ No newline at end of file diff --git a/blog/deqalg-reasoning/index.html b/blog/deqalg-reasoning/index.html new file mode 100644 index 00000000..16610b94 --- /dev/null +++ b/blog/deqalg-reasoning/index.html @@ -0,0 +1,76 @@ + Deep Equilibrium Models For Algorithmic Reasoning | ICLR Blogposts 2024

      Deep Equilibrium Models For Algorithmic Reasoning

      In this blogpost we discuss the idea of teaching neural networks to reach fixed points when reasoning. Specifically, on the algorithmic reasoning benchmark CLRS the current neural networks are told the number of reasoning steps they need, which they shouldn't be given. While a quick fix is to add a termination network that predicts when to stop, a much more salient inductive bias is that the neural network shouldn't change its answer any further once the answer is correct, i.e. it should reach a fixed point. This is supported by denotational semantics, which tells us that while loops that terminate are the minimum fixed points of a function. We implement this idea with the help of deep equilibrium models and discuss several hurdles one encounters along the way. We show on several algorithms from the CLRS benchmark the partial success of this approach and the difficulty in making it work robustly across all algorithms.

      What is Algorithmic Reasoning?

      Broadly, algorthmic reasoning studies how well neural networks can learn to execute classical computer science algorithms. In particular to measure how well an algorithm has been learned we look at size-generalisation, i.e. if we train on inputs of size \(N\) and check how well the Neural Network perform on inputs of size \(2N\) or \(10N\). The idea is that neural networks often learn shortcuts that work well in-distribution, but fail out-of-distribution, whereas classical computer science algorithms work no matter the input size. The purpose of this exercise is to study the generalisation of reasoning tasks, especially what tricks help to improve robustness and get the network closer to deducing logically rather than relying on statistical short cuts.

      Why care about fixed-points?

      First, let’s remember that for \(x_0\) to be a fixed-point of a function \(f\) it must satisfy \(f(x_0) = x_0\). Secondly, we can observe that many algorithms consist of an update rule that you apply until there is no more change. The final output can easily be seen to be a fixed-point! In a classical computer science algorithm some smart person will have sat down and shown that under some conditions on the input this convergence will happen and the final answer is correct.

      An example algorithm would be the Bellman-Ford algorithm to compute the shortest-distance to a given node in a graph. Here the update rule looks like \(x_i^{(t+1)} =\min(x_i^{(t)}, \min \{x_j^{(t)} + e_{ij}\}_{j\in N(i)})\), where \(x_i^{(t)}\) is the shortest distance estimate to the source node at time \(t\), \(e_{ij}\) is the distance between nodes \(i\) and \(j\), and \(\{j\}_{j\in N(i)}\) are the neighbours of node \(i\). The algorithm says to apply this rule until there is no more change—a fixed point.

      Interestingly, denotational semantics—a theoretical field of computer science—has shown you can represent Turing complete programming languages as mathematical functions. This is mostly quite trivial with the exception of the while loop (which is also the key ingredient to make it Turing complete). Here the trick is a special mathematical operator that returns the minimum fixed point of a function! (If there is no fixed point to a function then the corresponding while loop doesn’t terminate.) And thus we can see that fixed-points are reached by all programs that terminate, and yet they aren’t used in neural networks that try to learn how to do reasoning. A missed inductive bias perhaps?

      The details

      Task specification

      The CLRS paper provides us with a benchmark dataset for algorithmic reasoning. The general structure of the data is a sequence in time of intermediate states of a given algorithm. In other words, at timestep \(t\) we have a state \(x_t\) that describes various variables that the algorithm stores, e.g. in BellmanFord \(x_t\) will contain the current estimate of the shortest path in each node of the graph. At each timestep \(t\) we then try to predict the next time step, we do this by outputting some \(y_t\) from which we can extract \(x_{t+1}\). Note that \(y_t\) may be slightly different from \(x_{t+1}\), for instance because it has some state may never change by definition, e.g. the graph in BellmanFord, hence we don’t predict it again. This is all illustrated in the next figure, where we split the state into a state at each node \(x\) and at each edge \(e\) for a given graph \(G\) as an example.

      Algorithmic Reasoning Task, diagram recreated from

      The architecture

      The high-level architecture is that of an encoder-processor-decoder. The motivation is that neural networks perform well in high-dimensional spaces but that classical algorithms tend to operate on very low-dimensional variables, e.g. in BellmanFord the shortest distance would be a single scalar. Thus the encoder projects the state into a high-dimensional space \(z_t\) where the main computation is then done by the processor network—typically a Graph Neural Network. The output of the processor \(z_{t+1}\) is then decoded back into the low-dimensional space by the decoder. The encoder and decoders mostly consist of linear layers with the occasional exception, e.g. a softmax for categorical variables. The processor will be a graph neural network, for which several different architectures have been explored, for example in. We either use the TripletMPNN from which adds edge message passing or a simple MPNN with a linear message layer.

      High-level architecture employed

      The processor is supposed to do the main computation of the network, in particular, the hope is that one iteration of the processor is equal to one iteration of the algorithm. In our example of BellmanFord, it would be one iteration of the update rule \(x_i^{(t+1)} =\min(x_i^{(t)}, \min \{x_j^{(t)} + e_{ij}\}_{j\in n(i)})\) (see also the Figure below). Thus, the processor should indicate termination by no longer changing it’s output \(z\).

      Training

      Traditionally the training approach has been teacher-forcing. In teacher forcing we train each step of the algorithm independently by feeding the network the ground-truth \(x_t\) and computing the loss against \(y_t\) at all \(t\) simultaneously. This requires us to know the exact number of steps in the algorithm a priori. In other words, training with just teacher forcing will require us to tell the network the number of iterations it should run for at test time (which will vary depending on the input state). This is unrealistic in practice, where we would simply give our neural network the input state and ask it to run the algorithm on its own, which includes knowing when to stop the computation. While a termination network is suggested in , the issue is ignored in later papers such as .

      Remember that neural networks are really good at learning in-distribution shortcuts. To more rigorously test whether the neural network has learned the underlying logical algorithm we introduce a shift between the training and test distribution. If the network has learned the classical algorithm, it should be able to overcome this shift. Throughout the CLRS algorithmic reasoning benchmark size generalisation is used, i.e. we train on examples of size 16 (i.e. the graph has 16 nodes) and at test time we will use an input size of 64.

      An example algorithm: Bellman-Ford

      How can we do fixed-points in DNNs?

      One approach to training neural networks that run until they reach a fixed point is deep equilibrium models (DEQ). We give a brief introduction to this approach next based on the blogpost .

      Given our input \(x\), our hidden state \(z\), and our processor \(f\), the goal is to optimise the fixed point \(z^*=f(z^*,x)\) we reach. The question how can we backprop through \(z^* = f(z^*,x)\).

      In backprop, we ultimately want to compute

      \[\left(\frac{\partial z^*(.)}{\partial(.)}\right)^{\top} g\]

      for some incoming gradient \(g\) from the layers after (in our case from the decoder) and \((.)\) being anything we want, but usually the weights of the network. We can show by implicit differentation of \(z^* = f(z^*,x)\) that

      \[\left(\frac{\partial z^*(.)}{\partial(.)}\right)^{\top} g = \left(\frac{\partial f(z^*, x)}{\partial (.)}\right)^{\top}\left(I-\frac{\partial f(z^*, x)}{\partial z^*}\right)^{-\top}g\]

      The difficult to term to solve in the above equation is \(\left(I-\frac{\partial f(z^*, x)}{\partial z^*}\right)^{-\top}g\), which is the solution of a linear system, namely:

      \[\left(I-\frac{\partial f(z^*, x)}{\partial z^*}\right)^{\top}h = g\]

      In general, we can try to solve it in two ways, use a linear system solver, like can be found torch.linalg, or by computing a fixed point to

      \[h = \left(\frac{\partial f(z^*, x)}{\partial z^*}\right)^{-\top}h +g\]

      In the DEQ blogpost they suggest solving the above fixed point. The reason to use implicit differentiation is that backpropagating through time may easily run into exploding or vanishing gradients or error accumulation due to the number of steps needed to reach a fixed point.

      We tried both: solving the linear system with torch.linalg.solve and finding the above fixed point. But we converged to computing the fixed point of the equation above as suggested by the deep equilibrium blogpost as it is computationally faster, while the added accuracy of linear system solvers wasn’t beneficial. Note this trade-off is heavily informed by what is readily implemented in PyTorch to run on GPU, hence the balance may shift in the future.

      Tricks we employ

      To encourage convergence we change the update function in the MPNN to be a minimum update, i.e. \(z^{(t+1)} = \min(z^{(t)}, z^{'(t+1)})\). This update rule is motivated by the problem of getting neural networks to converge to a fixed point. We discuss the effect of this in more detail after the experimental section.

      Currently, gradient flows through the implicit differentiation explained above as well as back in time through standard backprop via \(z_t\). To enable more ways for the gradient to inform early steps in the algorithm, we propagate the gradient through \(y_t\) as well. For discrete \(y_t\), in other words, for categorical variables in the state \(x_t\) we employ the Rao-Blackwell straight-through gumbel softmax estimator to allow gradients to flow.

      Finally, we also try adding a loss for the number of steps by adding the penalty \(\sum_{t=0}^{T} \|z_{t+1} - z_{t}\|^2\). The penalty will be larger as we take more steps and stay away from the fixed point, thus hopefully encouraging convergence to a fixed point more quickly.

      How well does it work?

      In the table below we show the accuracyWhat exactly is measured for the accuracy depends on each algorithm, but usually is a pointer, e.g. in the Bellman-Ford algorithm it is a pointer to the previous node along the shortest path. For more details see the CLRS Benchmark paper. of the algorithms when tested on graphs of size 64.

      DEQ is our approach of reaching a fixed point together with the implicit differentiation explained above. Hint propagation is simply reaching a fixed point and back propagating through time with no implicit differentiation. Teacher forcing is used for the baselines, where the first number is the simple MPNN architecture and the second number is the more complex TripletMPNN (these numbers are taken from the paper ). For BellmanFord and BFS we use the simple MPNN and for all others we use the TripletMPNN.

      Tables DEQ Hint propagation Teacher forcing
      BellmanFord* 96.4% 96.7% 92%/97%
      Dijkstra 78.8% 84.4% 92%/96%
      BFS* 53.8% 57.1% 100%/100%
      DFS 5.0% 4.7% 7%/48%
      MST-Kruskal 82.3% 82.3% 71%/90%
      MST-Prim 75.2% 50.4% 71%/90%

      As we can see in the table above the approach works very well for simpler algorithms such as BellmanFord, where with simple MPNN we manage to achieve equal or better accuracy than the simple MPNN and match the TripletMPNN. Interestingly, this is a parallel algorithm, i.e. all node representations run the same code, in constrast sequential algorithms which go through the graph node by node. We did try gating to enable the GNN to better mimic a sequential algorithm, but this didn’t help.

      On the other algorithms while we are able to learn we cannot match the performance of teacher forcing where we assume to know the number of timesteps to run the neural network. This additional help makes the comparison slightly unfair, however, it shows how learning a fixed point is difficult for the network as we are not able to match the performance. We hypothesise about the reasons behind this in the next section.

      What’s the problem?

      There are a few major issues that we notice during training. The first is that the network is prone to underfitting, while we only show the test accuracy in the table above the training error doesn’t actually reach 0. It is unclear what causes this, however, trying to solve some issues with the DEQ may solve this. So let’s delve into them.

      Convergence is a key issue

      Firstly, the network will often take a large number of steps to reach a fixed point. We can see on easier algorithms like the BellmanFord algorithm that the number of forward steps during training often reaches our set upper limit of 64 forwards steps (the actual algorithm would take on average 4-5, max 10 for this graph size). This is why we implement our architecture trick, where we update the next hidden representation only if it is smaller than the current one, i.e. \(z^{(t+1)} = \min(z^{(t)}, z^{'(t+1)})\) where \(z^{'(t+1)}\) is the output of our min aggregator in the message passing step (alternatives such as gating and an exponential moving average update function were also tried). This helps with convergence, which enables finding a fixed point in simple cases, but fails to work reliably for more complex architectures and problems, while also introducing a different issue.

      The problem with hard constraints to achieve convergence

      Remember that during the implicit differentiation we are trying to solve

      \[h = \left(I-\frac{\partial f(z^*, x)}{\partial z^*}\right)^{-\top}g\]

      i.e. in the linear system \(y = Ax\) our matrix \(A\) is equal to \(I-J\) where \(J\) is the Jacobian in the above equation. If the Jacobian is equal to the identity then our matrix $A=0$ and our system has no solution. In practice, \(z^{(t+1)} = \min(z^{(t)}, z^{'(t+1)})\) will reduce to \(f(z) = z\) in many dimensions of \(z\). This leads to many rows of the Jacobian being the identity due to the function effectively becoming \(f(x)=x\) in many dimensions. Thus leading to rows that are entirely zero in \(A\), which is ill-defined and has no solution causing the optimisation to break.

      One solution is to try a soft-min, i.e. \(softmin_{\tau}(a,b) = \frac{ae^{-a/\tau}+be^{-b/\tau}}{e^{-a/\tau}+e^{-b/\tau}}\). Here we get the ability to trade off between convergence and the Jacobian being interesting. For \(\tau<<1\) we basically recover the min operation and for \(\tau>>1\) we simply get an average, i.e. an exponential moving average. In practice, there was not a trade-off for which we consistently have an interesting Jacobian, while also converging sufficiently fast.

      What do we take away?

      1. Training to reach a fixed point can work as way to determine when to stop reasoning. But it gets increasingly more difficult as the underlying problem gets harder.
      2. It’s unclear what inductive bias to choose in order to ensure fast enough convergence to a fixed point. There are downsides such as uninformative gradients at the fixed point.
      3. Optimisation is tricky and stands in the way. In particular, with implicit differentiation through the fixed point.
      For attribution in academic contexts, please cite this work as
      +        PLACEHOLDER FOR ACADEMIC ATTRIBUTION
      +  
      BibTeX citation
      +        PLACEHOLDER FOR BIBTEX
      +  
      \ No newline at end of file diff --git a/blog/diffusion-theory-from-scratch/index.html b/blog/diffusion-theory-from-scratch/index.html new file mode 100644 index 00000000..bcab3243 --- /dev/null +++ b/blog/diffusion-theory-from-scratch/index.html @@ -0,0 +1,36 @@ + Building Diffusion Model's theory from ground up | ICLR Blogposts 2024

      Building Diffusion Model's theory from ground up

      Diffusion Models, a new generative model family, have taken the world by storm after the seminal paper by Ho et al. [2020]. While diffusion models are often described as a probabilistic Markov Chains, their underlying principle is based on the decade-old theory of Stochastic Differential Equations (SDE), as found out later by Song et al. [2021]. In this article, we will go back and revisit the 'fundamental ingredients' behind the SDE formulation and show how the idea can be 'shaped' to get to the modern form of Score-based Diffusion Models. We'll start from the very definition of the 'score', how it was used in the context of generative modeling, how we achieve the necessary theoretical guarantees and how the critical design choices were made to finally arrive at the more 'principled' framework of Score-based Diffusion. Throughout this article, we provide several intuitive illustrations for ease of understanding.

      Introduction

      Motivation

      Not only generative modeling has been around for decades, few promising model families emerged and dominated the field for several years in the recent past. VAEs dominated the generative modelling landscape from 2014 onwards, until GANs took off in 2015-16; Normalizing Flows (NF) never really made it to the mainstream generative modeling due to its restrictive architectural requirement. However, it is quite clear at this point that the magnitude of impact they made is relatively less than barely 2-3 years of Diffusion Models. It is mostly attributed to one of the seminal papers (by Jonathan Ho et al.), now popularly referred to as “Denoising Diffusion Probabilistic Models” or DDPM. With the exponential explosion of works following DDPM, it is very hard, or rather unnecessary to look beyond this pivotal point.

      In this article, we look back into the conceptual and theoretical ideas that were in development for a long time, even outside the field of core machine learning. We will show in a later sections that, some of the theoretical ‘pillars’ holding Diffusion Models, have their roots deep into statistical physics and other fields. A significant part of this theory was presented afresh in the ICLR paper (won best paper award). Lastly, even though the ideas presented in this article are quite theoretical, we made our best attempt to convey them with intuitive explanations, diagrams and figures, thereby expanding its potential audience. To encourage further exploration, we provide all codes used in producing the figures (and experiments) of this article in this repository.

      This article notes that, historically, there were two distinct roads of development that merged in order for modern diffusion models to emerge – “scalable estimation of score” and “using the score for generative modelling”. The former is relatively short, while the latter traces its origin back to ~1900, if not earlier. This article explores these two paths independently – the latter one first while assuming the knowledge of the former. Rest of this introductory section is spent on defining the general modelling problem and the very notion of ‘score’ – the primary quantity of interest. The next section deals with how we can use score in generative modelling, assuming access to an oracle for the true score. The last section dives solely into the problem of estimating the score in a scalable manner. It is worth mentioning that, in this article, we explain only the “sufficient and necessary” concepts needed to build the diffusion model framework and hence may not directly resemble the typical formalism seen in most papers.

      Generative Modeling

      The problem of generative modeling, in most cases, is posed as parametric density estimation using a finite set of samples \(\{ x^{(n)} \}_{n=1}^N\) from a “true but unknown” data distribution \(q_{data}(x)\). With a suitable model family chosen as \(p_{\theta}(x)\), with unknown parameters \(\theta\), the problem boils down to maximizing the average (log-)likelihood (w.r.t \(\theta\)) of all the samples under the model

      \[\theta^* = arg\max_{\theta} \mathbb{E}_{x \sim q_{data}(x)} \left[ \log p_{\theta}(x) \right] \approx arg\max_{\theta} \frac{1}{N} \sum_{n=1}^N \log p_{\theta}(x^{(n)})\]

      It turned out however, that defining an arbitrary parametric density \(p_{\theta}(x)\) is not as easy as it looks. There was one aspect of \(p_{\theta}\) that is widely considered to be the evil behind this difficulty – the normalizing constant that stems from the axiom of probability

      \[p_{\theta}(x) = \frac{\tilde{p}_{\theta}(x)}{\color{purple} \int_x \tilde{p}_{\theta}(x)}\]

      Existing Frameworks

      It was understood quite early on that any promising generative model family must have one property – ease of sampling, i.e. generating new data samples. Sampling was so essential to generative modeling, that the model families that followed were all geared towards effective sampling, even if it was at the expense of other not-so-important properties. It was also well understood that there was one common underlying principle most effective for crafting “sampling-centric” generative models – transforming simple probability densities. This formed the backbone of every single generative model family so far; be it VAEs, GANs or NFs, their generative process is a density transformation of this form

      \[x = f_{\theta}(z),\text{ where } z \sim \mathcal{N}(0, I)\]

      that suggests to start with a simple density (often just standard normal) followed by a functional transformation \(f_{\theta}\), typically a neural network with parameters \(\theta\). For VAEs, the function \(f_{\theta}\) is the decoder; for GANs, it’s the generator network and for NFs, it’s the entire flow model. It is to be noted however, that the way they differ is mostly how they are trained, which may involve more parametric functions (e.g. VAE’s encoder or GAN’s discriminator) and additional machinery. This way of building generative models turned out to be an effective way of sidestepping the notorious normalizing constant.

      Diffusion is no different

      Diffusion Models, at its core, follow the exact same principle, but with a slightly clever design. For diffusion models, the transformation \(f_{\theta}\) is rather complicated. It is a sequence of invocations of a neural function (denoted as \(s_{\theta}\)) along with some additional computation (denoted as \(g(\cdot)\))

      \begin{equation} \label{eq:diffusion_general_parametric_structure} x = g_1(g_2(g_3(\cdots z \cdots, s_{\theta}), s_{\theta}), s_{\theta}), \text{ where } z \sim \mathcal{N}(0, I) \end{equation}

      This is a big difference between Diffusion Models and other generative model families. Prior generative families tried to learn the exact transformation directly via one parametric neural function \(f_{\theta}\). Diffusion Models on the other hand, try to learn \(s_{\theta}\), a quantity very fundamental and intrinsic to any true data distribution \(q_{data}(x)\). The quantity in question has historically been called the “Score”.

      The ‘Score’

      The term ‘Score’ is simply defined as the gradient of the log-density of a distribution, i.e. \(\nabla \log p(\cdot)\). In statistics, it is also known (but not very popular) as the ‘Informant’. One might argue that ‘Score’ is rather a strange name for such a quantity. It so happened that the origin of this term can be tracedThanks to this StackOverflow answer by @ben to a 1935 paper by Ronald Fisher, where he used the term in a very generic sense in order to “rank” some quantities. In the context of diffusion models however, we stick to the modern definition of score. The true score of our data distribution is therefore defined as the gradient of the log of true density of data, w.r.t the data variable

      \begin{equation} \label{eq:data_score_defn} \nabla_x \log q_{data}(x) \triangleq s(x) \end{equation}

      The quantity in Eq.\eqref{eq:data_score_defn} is unknown, just like the true data density \(q_{data}(x)\). It does have a meaning though: the “true score” refers to the direction of steepest increase in log-likelihood at any given point in the data space. See the gray arrows in the figure below.

      Simply, at a point \(x\), it tell us the best direction to step into (with little step-size \(\delta\)) if we would like to see a point \(x'\) with slightly higher likelihood

      \begin{equation} \label{eq:naive_score_steps} x’ = x + \delta \cdot \left. \nabla_x \log q_{data}(x) \right|_{x = x} \end{equation}

      Please note that this stems just from the definition of the gradient operator \(\nabla\) in score. If you are familiar with gradient descent, you may find conceptual resemblance.

      Now, there are two burning questions here:

      1. Considering we have access to the true score, is Eq.\eqref{eq:naive_score_steps} enough to define a generative process with appropriate convergence guarantee ?
      2. How do we actually get the true score ?

      The following two sections answer these questions respectively. Luckily, as we now understand that these two questions are somewhat decoupled, that they can be studied independently. The first section analyzes the first question, assuming we have access to the true score \(\nabla_x \log q_{data}(x)\). The second section explores how to get the true score, or rather, an approximation of it.

      Generative Modeling with Scores

      As explained before, we would like to sample from the true data distribution \(q_{data}(x)\) but all we have access to (we assume) is its score \(s(x)\) as defined in Eq.\eqref{eq:data_score_defn}. One may define a naive generative process as the iterative application of Eq.\eqref{eq:naive_score_steps}. Intuitively, it is very similar to gradient descent, where we greedily climb the log-density surface to attain a local maxima. If so, we can already see a possible instance of the general structure of Diffusion’s generative process as hinted in Eq.\eqref{eq:diffusion_general_parametric_structure}, with \(g(\cdot)\) being

      \[g(z, s(\cdot)) = z + \delta \cdot s(z) = z + \delta \cdot \nabla_x \log q_{data}(x)\]

      With a little reshuffling of Eq.\eqref{eq:naive_score_steps} and considering \(\delta \rightarrow 0\), one can immediately reveal the underlying ODEOrdinary Differential Equations, or ODEs describe how a process evolves over time by its infinitesimal change. that describes the infinitesimal change

      \begin{equation} \label{eq:ode_with_score} dx = \nabla_x \log q_{data}(x) dt \end{equation}

      BUT, please note that this is only an intuitive attempt and is entirely based on the definition of score. It possesses absolutely no guarantee that this process can converge to samples from the true data distribution. In fact, this process is greedy, i.e. it only seeks to go uphill, converging exactly at the modesLocal maxima of probability density. You can see the below figure that shows the samples \(x\) subjected to the process in Eq.\eqref{eq:ode_with_score} and its density \(p_t(x)\) evolving over time. The density in red is the target density whose score (we assume we know it) is being used.

      In this case, at \(t=\infty\), all samples will converge to the state with the highest likelihood (i.e. exactly a the center). This isn’t really desirable as it doesn’t “explore” at all. Just like any other sampling algorithm, we need noise injection !

      Langevin Equation and Brownian Motion

      Turned out that this problem was explored long ago in molecular dynamics by french physicist Paul Langevin in the context of analyzing movements of particles suspended in a fluid. He described the overall dynamics of particles, i.e how the position of the particle changes over time $t$ when in a potential energy field \(U(x)\)

      \begin{equation} \label{eq:original_langevin_dyn} dx = - \nabla_x U(x) dt + \sqrt{2} dB_t \end{equation}

      The term \(dB_t\) is called “Brownian Motion” and is effectively the source of noise – we will talk about this later in this subsection. Energy is considered “bad”, i.e. particles do not want to stay in a state with high energy. So they try to go downhill and settle in low-energy states using the gradient of the energy surface. The langevin equation (i.e. Eq.\eqref{eq:original_langevin_dyn}) happened to provide sufficient “exploration” abilities so that the particles visit states with probability \(\propto e^{-U(x)}\). This suggests that we can treat “negative energy” as log-likelihood

      \[q_{data}(x) \propto e^{-U(x)} \implies \log q_{data}(x) = -U(x) + C \implies \nabla_x \log q_{data}(x) = - \nabla_x U(x)\]

      By using the above substitution into the langevin equation, we can move out of physics and continue with out ML perspective

      \begin{equation} \label{eq:langevin_dyn} dx = \nabla_x \log q_{data}(x) dt + \sqrt{2} dB_t \end{equation}

      Note that this isn’t very different from our “intuitive” and greedy process in Eq.\eqref{eq:ode_with_score}, except for the noise term \(dB_t\) and a strange \(\sqrt{2}\). But this makes a difference! The brownian motion is an old construct from particle physics to describe random motion of particles in fluid/gas. It is simply a gaussian noise with infinitesimally small varianceIn practice, the smaller step you take, the small noise you get.

      \[dB_t = \mathcal{N}(0, dt) \implies dB_t = \sqrt{dt} \cdot z,\text{ where } z \sim \mathcal{N}(0, I)\]

      With that, we can simulate our new langevin equation with noise (i.e. Eq.\eqref{eq:langevin_dyn}) just like the noiseless case. You can see now that the noise is keeping the process from entirely converging into the mode. If you notice carefully, we have added a little “tail” to each point to help visualize their movement.

      Fokker-Planck Equation

      The simulation is convincing; but it’d be even better if we can theoretically verify that the process in Eq.\eqref{eq:langevin_dyn} indeed converges to \(q_{data}(x)\). The key to this proof is figuring out \(p_t(x)\) and making sure that it stabilizes as \(t\rightarrow \infty\), i.e. \(p_{\infty}(x) = q_{data}(x)\). It turned out that a stochastic process of the form \(dx = \mu_t(x) dt + \sigma_t(x) dB_t\), acting on a random variable \(x\), induces a time-varying distribution that can be described by this ODE

      \begin{equation} \frac{\partial}{\partial t}p_t(x) = -\frac{\partial}{\partial x} \Big[ p_t(x)\mu_t(x) \Big] + \frac{1}{2} \frac{\partial^2}{\partial x^2} \Big[ p_t(x) \sigma^2_t(x) \Big] \end{equation}

      This is a well celebrated result know as the “Fokker-Planck equation” that even predates the Langevin Equation. So, the solution of this ODE is exactly what we are seeing in the above figure (middle). One can easily verify the convergence of Eq.\eqref{eq:langevin_dyn} by first observing \(\mu_t(x) = \nabla_x \log q_{data}(x), \sigma_t(x) = \sqrt{2}\) and then using \(\frac{\partial}{\partial t} p_{\infty}(x) = \frac{\partial}{\partial t} q_{data}(x) = 0\).

      \[\begin{eqnarray*} \frac{\partial}{\partial t}p_{\infty}(x) &=& -\frac{\partial}{\partial x} \Big[ p_{\infty}(x) \nabla_x \log q_{data}(x) \Big] + \frac{(\sqrt{2})^2}{2} \frac{\partial^2}{\partial x^2} \Big[ p_{\infty}(x) \Big] \\ \frac{\partial}{\partial t} q_{data}(x) &=& -\frac{\partial}{\partial x} \Big[ q_{data}(x) \nabla_x \log q_{data}(x) \Big] + \frac{(\sqrt{2})^2}{2} \frac{\partial^2}{\partial x^2} \Big[ q_{data}(x) \Big] \\ 0 \text{ (LHS)} &=& -\frac{\partial}{\partial x} \Big[ \nabla_x q_{data}(x) \Big] + \frac{\partial}{\partial x} \Big[ \nabla_x q_{data}(x) \Big] = 0\text{ (RHS)} \end{eqnarray*}\]

      The LHS holds due to the fact that after a long time (i.e. \(t = \infty\)) the distribution stabilizesIt's called a "stationary or equilibrium distribution". Please also note that the proof above is for the 1 dimensional case and included for illustrative purpose only – the general case is slightly more complicated.

      So, we’re all good. Eq.\eqref{eq:langevin_dyn} is a provable way of sampling given we have access to the true score. In fact, the very work (by Song et al.) that immediately precedes DDPM, used exactly Eq.\eqref{eq:langevin_dyn} in its discrete form

      \begin{equation} x_{t+\delta} = x_t + \delta \cdot \nabla_x \log q_{data}(x) + \sqrt{2\delta} \cdot z \end{equation}

      where \(\delta\) (a small constant) is used as a practical proxy for the theoretical \(dt\).

      If you are already familiar with Diffusion Models, specifically their reverse process, you might be scratching your head. That is because, the generative process in Eq.\eqref{eq:langevin_dyn} isn’t quite same as what modern diffusion models do. We need to cross a few more hurdles before we get there.

      A probability path

      More than just a proof, the Fokker-Planck ODE provides us with a key insight – i.e. gradually transforming one distribution into another is equivalent to traveling (over time) on a “path” in the space of probability distributions. Imagine a space of all possible probability distributions \(p\)While each distribution vary in space (i.e. $x$) too, let's hide it for now and imagine them to be just a vectors.. The Fokker-Planck ODE for Eq.\eqref{eq:langevin_dyn}, therefore, represents a specific dynamics on this probability space whose solution trajectory \(p_t\) ends at \(q_{data}\) at \(t = \infty\).

      Speaking of ODEs, there is something we haven’t talked about yet – the initial distribution at \(t=0\), i.e. \(p_0\). In the simulation above, I quietly used a standard normal \(\mathcal{N}(0, I)\) as starting distributionYou can notice this if you carefully see the first few frames of the animation. without ever discussing it. Turns out that our Fokker-Planck ODE does not have any specific requirement for \(p_0\), i.e. it always converges to \(p_{\infty} = q_{data}\) no matter where you start. Here’s an illustration that shows two different starting distributions \(p_0\) and both of their “paths” over time, i.e. \(p_t\) in probability space ultimately converges to \(q_{data}\).

      So theoretically, given the score function \(\nabla_x \log q_{data}(x)\) of a target distribution \(q_{data}(x)\), one can “travel to” it from any distribution. However, keeping in mind our need for sampling, it’s best to choose an initial distribution that is sampling-friendly. Strictly speaking, there are couple of reasonable choices, but the diffusion model community ended up with the Isotropic Gaussian (i.e. \(\mathcal{N}(0, I)\)). This is not only due to its goodwill across machine learning and statistics, but also the fact that in the context of SDEs with Brownian motionsRemember, they are infinitesimal gaussian noises., Gaussians arise quite naturally.

      Estimating the “score” is hard

      So far what we’ve talked about, is just the generative process or as diffusion model literature calls it, the “reverse process”. But we haven’t really talked about the “forward process” yet, in case you are familiar with it. The forward process, in simple terms, is an ahead-of-time description of the “probability path” that reverse process intends to take. But the question is, why do we need to know the path ahead of time – the reverse process seems quite spontaneousIn the sense that, given a score function, it just travels to the correct target distribution on its own., no ? Sadly, it can’t be answered with theory alone.

      The problem lies in Eq.\eqref{eq:langevin_dyn} – let’s write it again with a little more verbosity

      \begin{equation} dx_t = \nabla_x \left. \log q_{data}(x) \right|_{x = x_t}\ dt + \sqrt{2} dB_t \end{equation}

      Even though we wished to estimate \(\nabla_x \log q_{data}(x)\vert_{x = x_t}\) with neural network \(s_{\theta}(x = x_t)\), this turned out to be extremely hard in practice. It was understood that one neural network is not enough to capture the richness of the score function at all values of \(x\). There were two options before the us – one, make the neural network expressive enough, or second, learn the network only where it’s needed. The community settled on the second one because it was easier to solve.

      So, what some of the pioneering works did, is first fixing a pathOn probability space, like we showed above and then learning the score only on that path. It’s all about specializing the neural network \(s_{\theta}(x_t, t)\) over \(t \in [0, \infty]\). The neural score estimator is capable of producing the right score if we provide the time \(t\), which we can of course. We will see in the next section that, to learn a score of any distribution, we need samples from it. This begs the question: how do we get samples \(x_t\) (for all \(t\)) for training purpose ? It certainly can’t be with Eq.\eqref{eq:langevin_dyn} since it requires the score. The answer is, we need to run this process in the other way – this is what Diffusion Models call the “Forward Process”.

      The “forward process”

      Going the other way requires us to run a simulation to go from \(q_{data}(x)\) at \(t=0\) to \(t=\infty\), just the opposite of the animation above. Recall that we already saw how to do this. To go to any distribution at \(t=\infty\), all you need is its score and the langevin equation. So how about we start from \(q_0 = q_{data}(x)\) this timeDo you remember that starting point doesn't matter ! and run the langevin simulation again with a known end target \(q_{\infty} = \mathcal{N}(0, I)\) ?

      \[\begin{eqnarray} dx &=& \nabla_x \log \mathcal{N}(0, I) dt + \sqrt{2} dB_t \\ \label{eq:forward_sde} &=& -x dt + \sqrt{2 dt} z \end{eqnarray}\]

      It is interesting to note that due to the target distribution being known in its closed form, we do not see any awkward scores dangling around. The score of \(\mathcal{N}(0, I)\) is simply \(-x\)We encourage the reader to verify this on their own as an exercise.. The discretized version of Eq.\eqref{eq:forward_sde}, i.e.

      \[\begin{eqnarray*} x_{t+dt} &=& x_t - x_t \cdot dt + \sqrt{2 dt}\ z \\ &=& (1 - dt) x_t + \sqrt{2 dt}\ z \end{eqnarray*}\]

      .. may resemble DDPM’s forward processHint: compare $dt$ with DDPM's $\beta_t$..

      NOTE: A little subtlety here that we only fixed the end point of the forward process, but not the exact path. It seems that running the langevin equation in the forward direction chose one path on its own. Turns out that this is the “isotropic path” where all dimensions of the variable \(x\) evolves in time the exact same way. Some works recently uncovered non-isotropic diffusion, where it is indeed possible to travel on other paths. But this is outside the scope of this article.

      We can simulate the above equation just like we did in the reverse process, in order to get samples \(x_t \sim q_t\). Below we show simulation of the forward process

      While it is true that the reverse process in inherently sequential due to the arbitrary nature of the score, the forward process (in Eq.\eqref{eq:forward_sde}) is entirely known and hence can be exploited for easing the sequentiality. We can see a way out if we try to simplifyWe use the standard assumption of $dt^2 = 0$. the expression for \(x_{t+2dt}\) using \(x_{t+dt}\)

      \[\begin{eqnarray*} x_{t+2dt} &=& (1 - dt) {\color{blue} x_{t+dt}} + \sqrt{2dt}\ z_2 \\ &=& (1 - dt) {\color{blue} \left[(1 - dt) x_t + \sqrt{2 dt}\ z_1\right]} + \sqrt{2dt}\ z_2 \\ &=& (1 - 2dt) x_t + \sqrt{2dt(1-dt)^2 + 2dt}\ z_{12} \\ &=& (1 - 2 \cdot dt) x_t + \sqrt{2 \cdot 2dt}\ z_{12} \\ \implies x_{t+2dt} &\sim& \mathcal{N}((1 - 2 \cdot dt) x_t, 2 \cdot 2dt I) \end{eqnarray*}\]

      The above simplification suggests that we can jump to any time \(t\), without going through the entire sequence, in order to sample \(x_t \sim q_t\). In fact, \(q_t(x_t\vert x_0)\) is gaussian ! This result opens up an interesting interpretation – generating \(x_0 \sim q(x_0 \vert x_t)\) can be interpreted as solving a “gaussian inverse problems”, which we explore in a later section.

      All good for now, but there is one more thing we need to deal with.

      Finite time & the “schedule”

      What we discussed so far, i.e. the forward and reverse process, require infinite time to reach its end state. This is a direct consequence of using the langevin equation. That, of course, is unacceptable in practice. But it so happened that there exists quite an elegant fix, which is well known to mathematics – we simply re-define what time means. We may choose a re-parameterization of time as, for example, \(t' = \mathcal{T}(t) = 1 - e^{-t} \in [0, 1]\)You can see $t = 0 \implies t' = 0$ and $t = \infty \implies t' = 1$. Hence we converted the range $[0, \infty]$ to $[0, 1]$.. Plugging \(dt = \mathcal{T}'(t)^{-1} dt' = e^t dt'\)One can easily see that $t' = 1 - e^{-t} \implies dt' = e^{-t} dt \implies dt = e^t dt'$. into the forward equation brings us even closer to DDPM’s forward process

      \[x_{t' + dt'} = (1 - {\color{blue}e^t dt'}) x_t + \sqrt{2 {\color{blue}e^t dt'}}\ z\]

      This suggests that in the world where time runs from \(t' = 0 \rightarrow 1\), we need to escalate the forward process by replacing \(dt\) with \(e^t dt'\). The quantity \(\mathcal{T}'(t)^{-1} dt' = e^t dt'\) is analogous to what diffusion models call a “schedule”. Recall that DDPM uses a small but increasing$e^t dt'$ is small because of $dt'$, while increasing because of $e^t$. “schedule” \(\beta_t\).

      Of course, our choice of the exact value of end time (i.e. \(t' = 1\)) and the re-parameterization \(\mathcal{T}\) are somewhat arbitrary. Different choices of \(\mathcal{T}\), and consequently \(\mathcal{T}'(t)^{-1} dt'\) lead to different schedules (e.g. linear, cosine etc.).

      NOTE: Choosing a different schedule does not mean the process takes a different path on the probability space, it simply changes its speed of movement over time towards the end state.

      Summary

      To summarize, in this section, we started with the definition of ‘score’ and arrived at a stochastic process (thanks to an old result by Langevin) that, at infinite time, converges to the density associated with the score. We saw that this process is provably correct and can be interpreted as a “path” on the probability space. We argued that due to the difficulty of score estimation everywhere along the path, we need samples at the intermediate time \(t\) in order to specialize the score estimates. To do that, we had to travel backwards on the path, which can be done in closed form. We also saw how this process, even though theoretically takes infinite time, can be shrunk down to a finite interval, opening up a design choice known as “schedules”.

      Estimating the Score

      The last chapter, while explaining the “sampling” part of score-based diffusion models, assumed that we have access to the true score \(\nabla_x \log q_{data}(x)\) via some oracle. That is, of course, untrue in practice. In fact, accessing the true score for any arbitrary distribution is just not possibleWe can only have access to the true score for distributions with closed-form, e.g. Gaussian.. So the way forward, as mentioned before, is to estimate/learn it with a parametric neural network \(s_{\theta}(x)\). Recall however, that all we have access to is samples from \(q_{data}(x)\).

      If curious enough, one may question how realistic it is to estimate the score \(\nabla_x \log q_{data}(x)\), while we can NOT usually estimate the density \(q_{data}(x)\) itself ? After all, it is a quantity derived from the density ! The answer becomes clear once you make the normalization constant explicit

      \[\begin{eqnarray*} \nabla_x \log q_{data}(x) &=& \nabla_x \log \frac{\tilde{q}_{data}(x)}{\int_{x} \tilde{q}_{data}(x) dx} \\ &=& \nabla_x \log \tilde{q}_{data}(x) - {\color{red}\nabla_x \log \int_{x} \tilde{q}_{data}(x) dx} \\ &=& \nabla_x \log \tilde{q}_{data}(x) \end{eqnarray*}\]

      The part in red is zero due to not having dependence on \(x\). So, the score, very cleverly sidesteps the normalization constant. This is the reason score estimation gained momentum in the research community.

      Implicit Score Matching

      The first notable attempt of this problem was by Aapo Hyvärinen back in 2005. His idea was simply to start from a loss function that, when minimized, leads to an estimator of the true score

      \begin{equation} J(\theta) = \frac{1}{2} \mathbb{E}_{x\sim q_{data}(x)}\Big[ \vert\vert s_{\theta}(x) - \nabla_x \log q_{data}(x) \vert\vert^2 \Big] \end{equation}

      It is simply an \(L_2\) loss between a parametric model and the true score, weighted by the probability of individual states (hence the expectation). But of course, it is not computable in this form as it contains the true score. Hyvärinen’s contribution was to simply show that, theoretically, the minimization problem is equivalent when the loss function is

      \begin{equation} \label{eq:impl_score_match} J_{\mathrm{I}}(\theta) = \mathbb{E}_{x\sim q_{data}(x)}\Big[ \mathrm{Tr}(\nabla_x s_{\theta}(x)) + \frac{1}{2} \vert\vert s_{\theta}(x) \vert\vert^2 \Big] \end{equation}

      In the literature, this is known as the “Implicit Score Matching”. The derivation is relatively simple and only involves algebraic manipulations – please see Appendix A of . The remarkable nature of this result stems from the fact that \(J_{\mathrm{I}}\) no longer contains the true score. The only dependency on \(q_{data}\) is via the expectation, which can be approximated by sample average over our dataset.

      But the key challenge with Implicit Score Matching was the \(\mathrm{Tr}(\nabla_x s_{\theta}(x))\) term, i.e. the trace of the hessian of the neural score model, which is costly to compute. This prompted several follow-up works for the race towards scalable score matching, one of which (namely De-noising score matching) is used in Diffusion Models till this day.

      For the sake of completeness, I would like to mention the work of Yang Song et al. around 2019, that proposed an engineering trick to alleviate the hessian computation. They simply used the “Hutchinson Trace estimator”A stochastic way of computing trace: $\mathrm{Tr}(M) = \mathbb{E}_{v\sim p_v} \Big[ v^T M v \Big]$, where $p_v$ can be a lot of distributions, most notably $\mathcal{N}(0, I)$. to replace the \(\mathrm{Tr}(\cdot)\) in Eq.\eqref{eq:impl_score_match}, which eased the computation a bit. This approach however, did not end up being used in practice.

      Denoising Score Matching

      The most valuable contribution came from Vincent Pascal in 2011, when he showed that the score matching problem has yet another equivalent objective, which was called “Denoising” score matching

      \begin{equation} \label{eq:deno_score_match} J_{\mathrm{D}}(\theta) = \mathbb{E}_{x\sim q_{data}(x), \epsilon\sim\mathcal{N}(0, I)}\left[ \frac{1}{2} \left|\left| s_{\theta}(\ \underbrace{x + \sigma\epsilon}_{\tilde{x}}\ ) - (- \frac{\epsilon}{\sigma}) \right|\right|^2 \right] \end{equation}

      We deliberately wrote it in a way that exposes its widely accepted interpretation. Denoising score matching simply adds some known noise \(\sigma\epsilon\) to the datapoints \(x\) and learns (in mean squeared sense), from the “noisy” point \(\tilde{x}\), the direction of comeback, i.e. \((-\epsilon)\), scaled by \(\frac{1}{\sigma}\). In a way, it acts like a “de-noiser”, hence the name. It is theoretically guaranteed that \(J_{\mathrm{D}}\) leads to an unbiased estimate of the true score. Below we show a visualization of the score estimate as it learns from data.

      A little algebraic manipulation of Eq.\eqref{eq:deno_score_match}, demonstrated by Ho et al. , leads to an equivalent form which turned out to be training friendly.

      \[\begin{eqnarray} J_{\mathrm{D}}(\theta) &=& \mathbb{E}_{x\sim q_{data}(x), \epsilon\sim\mathcal{N}(0, I)}\left[ \frac{1}{2\sigma^2} \left|\left| {\color{blue} - \sigma s_{\theta}}(\tilde{x}) - \epsilon \right|\right|^2 \right] \\ &=& \mathbb{E}_{x\sim q_{data}(x), \epsilon\sim\mathcal{N}(0, I)}\left[ \frac{1}{2\sigma^2} \left|\left| {\color{blue} \epsilon}_{\theta}(\tilde{x}) - \epsilon \right|\right|^2 \right]\label{eq:deno_eps_match} \end{eqnarray}\]

      We simply change the interpretation of what the network learns. In this form, the “noise estimator” network learns just the original pure gaussian noise vector \(\epsilon\) that was added while crafting the noisy sample. So, from a noisy sample, the network \(\epsilon_{\theta}\) learns roughly an unit variance direction that points towards the clean sample.

      There is yet another re-interpretation of Eq.\eqref{eq:deno_score_match} that leads to a slightly different perspective

      \[\begin{eqnarray} J_{\mathrm{D}}(\theta) &=& \mathbb{E}_{x\sim q_{data}(x), \epsilon\sim\mathcal{N}(0, I)}\left[ \frac{1}{2\sigma^4} \left|\left| {\color{blue}\tilde{x} + \sigma^2 s_{\theta}}(\tilde{x}) - (\underbrace{\tilde{x} - \sigma\epsilon}_{x}) \right|\right|^2 \right] \\ &=& \mathbb{E}_{x\sim q_{data}(x), \epsilon\sim\mathcal{N}(0, I)}\left[ \frac{1}{2\sigma^4} \left|\left| {\color{blue} x_{\theta}}(\tilde{x}) - x \right|\right|^2 \right]\label{eq:deno_endpoint_match} \end{eqnarray}\]

      Eq.\eqref{eq:deno_endpoint_match} shows, that instead of the noise direction towards clean sample, we can also have the clean sample directly as a learning target. This is like doing “denoising” in its true sense. We will get back to this in the next subsection.

      Probing the learning objective

      If you are still puzzled about how Eq.\eqref{eq:deno_eps_match} is related to learning the score, there is a way to probe exactly what the network is learning at an arbitrary input point \(\tilde{x}\). We note that the clean sample \(x\) and the noisy sample \(\tilde{x}\) come from a joint distribution that factorizes

      \[q(x, \tilde{x}) = q(\tilde{x} \vert x) q_{data}(x) = \mathcal{N}(\tilde{x}; x, \sigma I) q_{data}(x).\]

      We then factorize this joint in a slightly different way, i.e.

      \[q(x, \tilde{x}) = q(x \vert \tilde{x}) q(\tilde{x})\]

      where \(q(x \vert \tilde{x})\) can be thought of as a distribution of all clean samples which could’ve led to the given \(\tilde{x}\). Eq.\eqref{eq:deno_eps_match} can therefore be written as

      \[\begin{eqnarray*} J_{\mathrm{D}}(\theta) &=& \mathbb{E}_{(x, \tilde{x}) \sim q(x,\tilde{x})}\left[ \frac{1}{2\sigma^2} \left|\left| \epsilon_{\theta}(\tilde{x}) - \epsilon \right|\right|^2 \right] \\ &=& \mathbb{E}_{\tilde{x} \sim q(\tilde{x}), x \sim q(x\vert \tilde{x})}\left[ \frac{1}{2\sigma^2} \left|\left| \epsilon_{\theta}(\tilde{x}) - \frac{\tilde{x} - x}{\sigma} \right|\right|^2 \right] \\ &=& \mathbb{E}_{\tilde{x} \sim q(\tilde{x})}\left[ \frac{1}{2\sigma^2} \left|\left| \epsilon_{\theta}(\tilde{x}) - \frac{\tilde{x} - \mathbb{E}_{x \sim q(x\vert \tilde{x})}[x]}{\sigma} \right|\right|^2 \right] \\ \end{eqnarray*}\]

      In the last step, the expectation \(\mathbb{E}_{q(x\vert\tilde{x})}\left[ \cdot \right]\) was pushed inside, up until the only quantity that involves \(x\). Looking at it, you may realize that the network \(\epsilon_{\theta}\), given an input \(\tilde{x}\), learns the average noise direction that leads to the given input point \(\tilde{x}\). It also exposes the quantity \(\mathbb{E}_{x \sim q(x\vert \tilde{x})}[x]\), which is the average clean sample that led to the given \(\tilde{x}\).

      Below we visualize this process with a toy example, followed by a short explanation.

      Explanation: We have 10 data points \(x\sim q_{data}(x)\) in two clusters (big red dots) and we run the learning process by generating noisy samples \(\tilde{x}\sim q(\tilde{x})\) (small red dots). Instead of learning a neural mapping over the entire space, we learn a tabular map with only three chosen input points \(\tilde{x}_1, \tilde{x}_2, \tilde{x}_3\) (blue, magenta and green cross). Every time we sample one of thosePractically it's impossible to randomly sample a specific point. So we assume a little ball around each point. three chosen input points, we note which input data point it came from (shown by connecting a dotted line of same color) and maintain a running average (bold cross of same color) of them, i.e. which is nothing but \(\mathbb{E}_{x \sim q(x\vert \tilde{x})}[x]\). We also show the average noise direction at each \(\tilde{x}\), i.e. \(\frac{\tilde{x} - \mathbb{E}_{x \sim q(x\vert \tilde{x})}[x]}{\sigma}\), with gray arrows. The gray arrows, as the training progresses, start to resemble the score estimate of the data.

      Denoising as inverse problem

      A similar treatment, when applied on Eq.\eqref{eq:deno_endpoint_match}, yields the following

      \[\begin{eqnarray*} J_{\mathrm{D}}(\theta) &=& \mathbb{E}_{(x, \tilde{x}) \sim q(x,\tilde{x})}\left[ \frac{1}{2\sigma^4} \left|\left| {\color{blue}x_{\theta}}(\tilde{x}) - x \right|\right|^2 \right] \\ &=& \mathbb{E}_{\tilde{x} \sim q(\tilde{x})}\left[ \frac{1}{2\sigma^4} \left|\left| {\color{blue}\tilde{x} + \sigma^2 s_{\theta}}(\tilde{x}) - \mathbb{E}_{x \sim q(x\vert \tilde{x})}[x] \right|\right|^2 \right] \\ \end{eqnarray*}\]

      Notice that I brought back the original form of \(x_{\theta}(\cdot)\) that involves the score. If we had the true score instead of an learned estimate, we would have

      \[\mathbb{E}_{x \sim q(x\vert \tilde{x})}[x] = \tilde{x} + \sigma^2 \nabla_{\tilde{x}} \log p(\tilde{x})\]

      In “Inverse problem” and Bayesian literature, this is a very well celebrated result named “Tweedie’s Formula”, first published by Robbins but credited to statistician Maurice Tweedie. This theorem is applied in the context of bayesian posterior estimation of a “true” quantity \(x\) which we only observe through a (gaussian) noisy measurement \(\tilde{x}\). Tweedie’s formula tells us that the posterior mean of the inverse problem \(q(x\vert \tilde{x})\) can be computed without ever knowing the actually density, as long as we have access to the score at the noisy measurement.

      Summary

      In this section, we explored the problem of scalable score matching. We looked at the notable attempts in the literature and learned that score can be estimated from samples only. We also looked at several interpretations of the learning objective and the connections they expose.

      Last few bits

      Incorporating time

      In the last section, we expressed and explained everything in terms of one known noise level \(\sigma\) and the noisy sample \(\tilde{x}\). We did so to avoid cluttering of multiple concepts that aren’t necessary to explain each other. In a previous section however, we learned that the score must be estimated along every timestep of the forward process. By simply augmenting Eq.\eqref{eq:deno_score_match} with an additional time variable \(t \in \mathcal{U}[0, 1]\) is sufficient to induce the time dependency in the score matching problem

      \begin{equation} \label{eq:deno_score_match_with_time} J_{\mathrm{D}}(\theta) = \mathbb{E}_{x_0, \epsilon, t \sim \mathcal{U}[0, 1], x_t\sim q_t(x_t\vert x_0) }\left[ \frac{1}{2} \left|\left| s_{\theta}(x_t, t) - (- \frac{\epsilon}{\sigma_t}) \right|\right|^2 \right] \end{equation}

      .. where \(q_t(x_t \vert x_0)\) is defined in a previous section and \(\sigma_t\) is the standard deviation of it.

      We took an different approach

      We would like to highlight that, in this article, we first explored the reverse process and then showed why the forward process emerges out of necessity. Typical diffusion models papers start from a forward process specification of the form

      \[dx_t = f(t)x_t dt + g(t) {dB}_t\]

      .. and then use Anderson’s SDE reversal to explain the reverse process, which also involves the score

      \[dx_t = \left[ f(t) x_t - g(t)^2 \underbrace{\nabla_{x_t} \log q_t(x_t)}_{s_{\theta}(x_t, t)} \right] dt + g(t) dB_t\]

      We argue that our approach is more “organic” in the sense that it builds up the theory chronologically, exploring the exact path the community went through over time.

      Conclusion

      In this article, we dived deep into the theoretical fundamentals of Diffusion Models, which are often ignored by practitioners. We started from the ‘heart’ of diffusion models, i.e. scores, and built the concepts up almost chronologically. We hope this article will serve as a conceptual guide toward understanding diffusion models from the score SDE perspective. We intentionally avoid the ‘probabilistic markov model’ view of diffusion since more and more works have been seen to embrace the SDE formalism.

      For attribution in academic contexts, please cite this work as
      +        PLACEHOLDER FOR ACADEMIC ATTRIBUTION
      +  
      BibTeX citation
      +        PLACEHOLDER FOR BIBTEX
      +  
      \ No newline at end of file diff --git a/blog/distill-example/index.html b/blog/distill-example/index.html new file mode 100644 index 00000000..c53eccaf --- /dev/null +++ b/blog/distill-example/index.html @@ -0,0 +1,102 @@ + Sample Blog Post | ICLR Blogposts 2024

      Sample Blog Post

      Your blog post's abstract. Please add your abstract or summary here and not in the main body of your text. Do not include math/latex or hyperlinks.

      Note: please use the table of contents as defined in the front matter rather than the traditional markdown styling.

      Equations

      This theme supports rendering beautiful math in inline and display modes using MathJax 3 engine. You just need to surround your math expression with $$, like $$ E = mc^2 $$. If you leave it inside a paragraph, it will produce an inline expression, just like \(E = mc^2\).

      To use display mode, again surround your expression with $$ and place it as a separate paragraph. Here is an example:

      \[\left( \sum_{k=1}^n a_k b_k \right)^2 \leq \left( \sum_{k=1}^n a_k^2 \right) \left( \sum_{k=1}^n b_k^2 \right)\]

      Note that MathJax 3 is a major re-write of MathJax that brought a significant improvement to the loading and rendering speed, which is now on par with KaTeX.

      Images and Figures

      Its generally a better idea to avoid linking to images hosted elsewhere - links can break and you might face losing important information in your blog post. To include images in your submission in this way, you must do something like the following:

      {% include figure.html path="assets/img/2024-05-07-distill-example/iclr.png" class="img-fluid" %}
      +

      which results in the following image:

      To ensure that there are no namespace conflicts, you must save your asset to your unique directory /assets/img/2024-05-07-[SUBMISSION NAME] within your submission.

      Please avoid using the direct markdown method of embedding images; they may not be properly resized. Some more complex ways to load images (note the different styles of the shapes/shadows):

      A simple, elegant caption looks good between image rows, after each row, or doesn't have to be there at all.

      Interactive Figures

      Here’s how you could embed interactive figures that have been exported as HTML files. Note that we will be using plotly for this demo, but anything built off of HTML should work (no extra javascript is allowed!). All that’s required is for you to export your figure into HTML format, and make sure that the file exists in the assets/html/[SUBMISSION NAME]/ directory in this repository’s root directory. To embed it into any page, simply insert the following code anywhere into your page.

      {% include [FIGURE_NAME].html %} 
      +

      For example, the following code can be used to generate the figure underneath it.

      import pandas as pd
      +import plotly.express as px
      +
      +df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/earthquakes-23k.csv')
      +
      +fig = px.density_mapbox(
      +    df, lat='Latitude', lon='Longitude', z='Magnitude', radius=10,
      +    center=dict(lat=0, lon=180), zoom=0, mapbox_style="stamen-terrain")
      +fig.show()
      +
      +fig.write_html('./assets/html/2024-05-07-distill-example/plotly_demo_1.html')
      +

      And then include it with the following:

      <div class="l-page">
      +  <iframe src="{{ 'assets/html/2024-05-07-distill-example/plotly_demo_1.html' | relative_url }}" frameborder='0' scrolling='no' height="600px" width="100%"></iframe>
      +</div>
      +

      Voila!

      Citations

      Citations are then used in the article body with the <d-cite> tag. The key attribute is a reference to the id provided in the bibliography. The key attribute can take multiple ids, separated by commas.

      The citation is presented inline like this: (a number that displays more information on hover). If you have an appendix, a bibliography is automatically created and populated in it.

      Distill chose a numerical inline citation style to improve readability of citation dense articles and because many of the benefits of longer citations are obviated by displaying more information on hover. However, we consider it good style to mention author last names if you discuss something at length and it fits into the flow well — the authors are human and it’s nice for them to have the community associate them with their work.


      Footnotes

      Just wrap the text you would like to show up in a footnote in a <d-footnote> tag. The number of the footnote will be automatically generated.This will become a hoverable footnote.


      Code Blocks

      This theme implements a built-in Jekyll feature, the use of Rouge, for syntax highlighting. It supports more than 100 languages. This example is in C++. All you have to do is wrap your code in a liquid tag:

      {% highlight c++ linenos %}
      code code code
      {% endhighlight %}

      The keyword linenos triggers display of line numbers. You can try toggling it on or off yourself below:

      int main(int argc, char const \*argv[])
      +{
      +string myString;
      +
      +    cout << "input a string: ";
      +    getline(cin, myString);
      +    int length = myString.length();
      +
      +    char charArray = new char * [length];
      +
      +    charArray = myString;
      +    for(int i = 0; i < length; ++i){
      +        cout << charArray[i] << " ";
      +    }
      +
      +    return 0;
      +}

      Diagrams

      This theme supports generating various diagrams from a text description using jekyll-diagrams plugin. Below, we generate a few examples of such diagrams using languages such as mermaid, plantuml, vega-lite, etc.

      Note: different diagram-generation packages require external dependencies to be installed on your machine. Also, be mindful of that because of diagram generation the first time you build your Jekyll website after adding new diagrams will be SLOW. For any other details, please refer to jekyll-diagrams README.

      Note: This is not supported for local rendering!

      The diagram below was generated by the following code:

      {% mermaid %}
      +sequenceDiagram
      +    participant John
      +    participant Alice
      +    Alice->>John: Hello John, how are you?
      +    John-->>Alice: Great!
      +{% endmermaid %}
      +
      JohnAliceHello John, how are you?Great!JohnAlice

      Tweets

      An example of displaying a tweet:

      An example of pulling from a timeline:

      For more details on using the plugin visit: jekyll-twitter-plugin


      Blockquotes

      We do not grow absolutely, chronologically. We grow sometimes in one dimension, and not in another, unevenly. We grow partially. We are relative. We are mature in one realm, childish in another. —Anais Nin

      Layouts

      The main text column is referred to as the body. It is the assumed layout of any direct descendants of the d-article element.

      .l-body

      For images you want to display a little larger, try .l-page:

      .l-page

      All of these have an outset variant if you want to poke out from the body text a little bit. For instance:

      .l-body-outset

      .l-page-outset

      Occasionally you’ll want to use the full browser width. For this, use .l-screen. You can also inset the element a little from the edge of the browser by using the inset variant.

      .l-screen

      .l-screen-inset

      The final layout is for marginalia, asides, and footnotes. It does not interrupt the normal flow of .l-body-sized text except on mobile screen sizes.

      .l-gutter


      Other Typography?

      Emphasis, aka italics, with asterisks (*asterisks*) or underscores (_underscores_).

      Strong emphasis, aka bold, with asterisks or underscores.

      Combined emphasis with asterisks and underscores.

      Strikethrough uses two tildes. Scratch this.

      1. First ordered list item
      2. Another item ⋅⋅* Unordered sub-list.
      3. Actual numbers don’t matter, just that it’s a number ⋅⋅1. Ordered sub-list
      4. And another item.

      ⋅⋅⋅You can have properly indented paragraphs within list items. Notice the blank line above, and the leading spaces (at least one, but we’ll use three here to also align the raw Markdown).

      ⋅⋅⋅To have a line break without a paragraph, you will need to use two trailing spaces.⋅⋅ ⋅⋅⋅Note that this line is separate, but within the same paragraph.⋅⋅ ⋅⋅⋅(This is contrary to the typical GFM line break behavior, where trailing spaces are not required.)

      • Unordered lists can use asterisks
      • Or minuses
      • Or pluses

      I’m an inline-style link

      I’m an inline-style link with title

      I’m a reference-style link

      I’m a relative reference to a repository file

      You can use numbers for reference-style link definitions

      Or leave it empty and use the link text itself.

      URLs and URLs in angle brackets will automatically get turned into links. http://www.example.com or http://www.example.com and sometimes example.com (but not on Github, for example).

      Some text to show that the reference links can follow later.

      Here’s our logo (hover to see the title text):

      Inline-style: alt text

      Reference-style: alt text

      Inline code has back-ticks around it.

      var s = "JavaScript syntax highlighting";
      +alert(s);
      +
      s = "Python syntax highlighting"
      +print(s)
      +
      No language indicated, so no syntax highlighting. 
      +But let's throw in a <b>tag</b>.
      +

      Colons can be used to align columns.

      Tables Are Cool
      col 3 is right-aligned $1600
      col 2 is centered $12
      zebra stripes are neat $1

      There must be at least 3 dashes separating each header cell. The outer pipes (|) are optional, and you don’t need to make the raw Markdown line up prettily. You can also use inline Markdown.

      Markdown Less Pretty
      Still renders nicely
      1 2 3

      Blockquotes are very handy in email to emulate reply text. This line is part of the same quote.

      Quote break.

      This is a very long line that will still be quoted properly when it wraps. Oh boy let’s keep writing to make sure this is long enough to actually wrap for everyone. Oh, you can put Markdown into a blockquote.

      Here’s a line for us to start with.

      This line is separated from the one above by two newlines, so it will be a separate paragraph.

      This line is also a separate paragraph, but… This line is only separated by a single newline, so it’s a separate line in the same paragraph.

      For attribution in academic contexts, please cite this work as
      +        PLACEHOLDER FOR ACADEMIC ATTRIBUTION
      +  
      BibTeX citation
      +        PLACEHOLDER FOR BIBTEX
      +  
      \ No newline at end of file diff --git a/blog/distill-example2/index.html b/blog/distill-example2/index.html new file mode 100644 index 00000000..115b8c3e --- /dev/null +++ b/blog/distill-example2/index.html @@ -0,0 +1,100 @@ + Sample Blog Post (HTML version) | ICLR Blogposts 2024

      Sample Blog Post (HTML version)

      Your blog post's abstract. Please add your abstract or summary here and not in the main body of your text. Do not include math/latex or hyperlinks.

      This is a sample blog post written in HTML (while the other sample post is written in Markdown). Authors have the choice to write in HTML or Markdown. While Markdown is easier to write, HTML gives you more control over the layout of your post. Furthermore, Markdown often interacts in unexpected ways with MathJax and other HTML widgets. If you are having trouble with Markdown, try writing in HTML instead.

      Note: please use the table of contents as defined in the front matter rather than the traditional markdown styling.

      Equations

      This theme supports rendering beautiful math in inline and display modes using MathJax 3 engine. You just need to surround your math expression with $$, like $$ E = mc^2 $$. If you leave it inside a paragraph, it will produce an inline expression, just like \(E = mc^2\).

      To use display mode, again surround your expression with $$ and place it as a separate paragraph. Here is an example: $$ \left( \sum_{k=1}^n a_k b_k \right)^2 \leq \left( \sum_{k=1}^n a_k^2 \right) \left( \sum_{k=1}^n b_k^2 \right) $$

      Note that MathJax 3 is a major re-write of MathJax that brought a significant improvement to the loading and rendering speed, which is now on par with KaTeX.

      Images and Figures

      Its generally a better idea to avoid linking to images hosted elsewhere - links can break and you might face losing important information in your blog post. You can display images from this repository using the following code:

      {% include figure.html path="assets/img/2024-05-07-distill-example/iclr.png" class="img-fluid" %}

      which results in the following image:

      To ensure that there are no namespace conflicts, you must save your asset to your unique directory `/assets/img/2024-05-07-[SUBMISSION NAME]` within your submission.

      Please avoid using the direct HTML method of embedding images; they may not be properly resized. Some below complex ways to load images (note the different styles of the shapes/shadows):

      A simple, elegant caption looks good between image rows, after each row, or doesn't have to be there at all.

      Interactive Figures

      Here's how you could embed interactive figures that have been exported as HTML files. Note that we will be using plotly for this demo, but anything built off of HTML should work. All that's required is for you to export your figure into HTML format, and make sure that the file exists in the `assets/html/[SUBMISSION NAME]/` directory in this repository's root directory. To embed it into any page, simply insert the following code anywhere into your page.

      {% include [FIGURE_NAME].html %}

      For example, the following code can be used to generate the figure underneath it.

      import pandas as pd
      +import plotly.express as px
      +
      +df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/earthquakes-23k.csv')
      +
      +fig = px.density_mapbox(
      +    df, lat='Latitude', lon='Longitude', z='Magnitude', radius=10,
      +    center=dict(lat=0, lon=180), zoom=0, mapbox_style="stamen-terrain")
      +fig.show()
      +
      +fig.write_html('./assets/html/2024-05-07-distill-example/plotly_demo_1.html')
      +
      And then include it with the following:
      <div class="l-page">
      +  <iframe src="{{ 'assets/html/2024-05-07-distill-example/plotly_demo_1.html' | relative_url }}" frameborder='0' scrolling='no' height="600px" width="100%"></iframe>
      +</div>
      +
      Voila!

      Citations

      Citations are then used in the article body with the <d-cite> tag. The key attribute is a reference to the id provided in the bibliography. The key attribute can take multiple ids, separated by commas.

      The citation is presented inline like this: (a number that displays more information on hover). If you have an appendix, a bibliography is automatically created and populated in it.

      Distill chose a numerical inline citation style to improve readability of citation dense articles and because many of the benefits of longer citations are obviated by displaying more information on hover. However, we consider it good style to mention author last names if you discuss something at length and it fits into the flow well - the authors are human and it's nice for them to have the community associate them with their work.

      Footnotes

      Just wrap the text you would like to show up in a footnote in a <d-footnote> tag. The number of the footnote will be automatically generated.This will become a hoverable footnote.

      Code Blocks

      This theme implements a built-in Jekyll feature, the use of Rouge, for syntax highlighting. It supports more than 100 languages. This example is in C++. All you have to do is wrap your code in a liquid tag as follows:

      
      +{% highlight c++ linenos %}  
      code code code
      {% endhighlight %} + +
      The keyword `linenos` triggers display of line numbers. You can try toggling it on or off yourself below:
      int main(int argc, char const *argv[])
      +{
      +string myString;
      +
      +    cout &lt;&lt; "input a string: ";
      +    getline(cin, myString);
      +    int length = myString.length();
      +
      +    char charArray = new char * [length];
      +
      +    charArray = myString;
      +    for(int i = 0; i < length; ++i){
      +        cout &lt;&lt; charArray[i] &lt;&lt; " ";
      +    }
      +
      +    return 0;
      +}

      Diagrams

      This theme supports generating various diagrams from a text description using jekyll-diagrams plugin. Below, we generate a few examples of such diagrams using languages such as mermaid, plantuml, vega-lite, etc.

      Notedifferent diagram-generation packages require external dependencies to be installed on your machine. Also, be mindful of that because of diagram generation the first time you build your Jekyll website after adding new diagrams will be SLOW. For any other details, please refer to the jekyll-diagrams README.

      Note: This is not supported for local rendering!

      The diagram below was generated by the following code:

      {% mermaid %}
      +sequenceDiagram
      +    participant John
      +    participant Alice
      +    Alice->>John: Hello John, how are you?
      +    John-->>Alice: Great!
      +{% endmermaid %}
      +
      +
      JohnAliceHello John, how are you?Great!JohnAlice

      Tweets

      An example of displaying a tweet:

      An example of pulling from a timeline:

      For more details on using the plugin visit: jekyll-twitter-plugin

      Blockquotes

      We do not grow absolutely, chronologically. We grow sometimes in one dimension, and not in another, unevenly. We grow partially. We are relative. We are mature in one realm, childish in another. —Anais Nin

      Layouts

      The main text column is referred to as the body. It's the assumed layout of any direct descendants of the `d-article` element.

      .l-body

      For images you want to display a little larger, try `.l-page`:

      .l-page

      All of these have an outset variant if you want to poke out from the body text a little bit. For instance:

      .l-body-outset

      .l-page-outset

      Occasionally you'll want to use the full browser width. For this, use `.l-screen`. You can also inset the element a little from the edge of the browser by using the inset variant.

      .l-screen

      .l-screen-inset

      The final layout is for marginalia, asides, and footnotes. It does not interrupt the normal flow of `.l-body`-sized text except on mobile screen sizes.

      .l-gutter

      Other Typography?

      Emphasis, aka italics, with the <i></i> tag emphasis.

      Strong emphasis, aka bold, with <b></b> tag bold.

      Strikethrough ca be accomplished with the <s></s> tag. Scratch this.

      • First ordered list item
      • Another item
        1. Unordered sub-list.
      • And another item.

      For code, the language can be specified in the class. For example, use language-javascript for Javascript and language-python for Python code.

      var s = "JavaScript syntax highlighting";
      +  alert(s);
      s = "Python syntax highlighting"
      +  print(s)
      No language indicated, so no syntax highlighting.

      A table can be created with the <table> element. Below is an example

      Tables Are Cool
      col 3 is right-aligned $1600
      col 2 is centered $12
      zebra stripes are neat $1

      Blockquotes can be defined with the >blockquote< tag.
      For attribution in academic contexts, please cite this work as
      +        PLACEHOLDER FOR ACADEMIC ATTRIBUTION
      +  
      BibTeX citation
      +        PLACEHOLDER FOR BIBTEX
      +  
      \ No newline at end of file diff --git a/blog/double-descent-demystified/index.html b/blog/double-descent-demystified/index.html new file mode 100644 index 00000000..91fe6e0a --- /dev/null +++ b/blog/double-descent-demystified/index.html @@ -0,0 +1,116 @@ + Double Descent Demystified | ICLR Blogposts 2024

      Double Descent Demystified

      Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle

      Introduction

      Machine learning models, while incredibly powerful, can sometimes act unpredictably. One of the most intriguing behaviors is when the test loss suddenly diverges at the interpolation threshold, a phenomenon distinctly observed in double descent .

      Figure 1. Double descent in ordinary linear regression. Three real datasets (California Housing, Diabetes, and WHO Life Expectancy) and one synthetic dataset (Student-Teacher) all exhibit double descent, with test loss spiking at the interpolation threshold. Blue is training error. Orange is test error.

      While significant theoretical work has been done to comprehend why double descent occurs, it can be difficult for a newcomer to gain a general understanding of why the test loss behaves in this manner, and under what conditions one should expect similar misbehavior. In this blog post, when we say double descent, we mean the divergence at the interpolation threshold, and not whether overparameterized models generalize (or fail to generalize).

      In this work, we intuitively and quantitatively explain why the test loss diverges at the interpolation threshold, with as much generality as possible and with as simple of mathematical machinery as possible, but also without sacrificing rigor. To accomplish this, we focus on the simplest supervised model - ordinary linear regression - using the most basic linear algebra primitive: the singular value decomposition. We identify three distinct interpretable factors which, when collectively present, trigger the divergence. Through practical experiments on real data sets, we confirm that both model’s test losses diverge at the interpolation threshold, and this divergence vanishes when even one of the three factors is removed. We complement our understanding by offering a geometric picture that reveals linear models perform representation learning when overparameterized, and conclude by shedding light on recent results in nonlinear models concerning superposition.

      Double Descent in Ordinary Linear Regression

      Empirical Evidence of Double Descent in Ordinary Linear Regression

      Before studying ordinary linear regression mathematically, does our claim that it exhibits double descent hold empirically? We show that it indeed does, using one synthetic and three real datasets: World Health Organization Life Expectancy , California Housing , Diabetes ; these three real datasets were selected on the basis of being easily accessible through sklearn or Kaggle. As shown in Fig 1, all display a spike in test mean squared error at the interpolation threshold. Our simple Python code is publicly available.

      Notation and Terminology

      Consider a regression dataset of $N$ training data with features $\vec{x}_n \in \mathbb{R}^D$ and targets $y_n \in \mathbb{R}$. We sometimes use matrix-vector notation to refer to the training data:

      \[X \in \mathbb{R}^{N \times D} \quad , \quad Y \in \mathbb{R}^{N \times 1}.\]

      In ordinary linear regression, we want to learn parameters $\hat{\vec{\beta}} \in \mathbb{R}^{D}$ such that:

      \[\vec{x}_n \cdot \hat{\vec{\beta}} \approx y_n.\]

      We will study three key parameters:

      1. The number of model parameters $P$
      2. The number of training data $N$
      3. The dimensionality of the data $D$

      We say that a model is overparameterized if $N < P$ and underparameterized if $N > P$. The interpolation threshold refers to $N=P$, because when $N\leq P$, the model can perfectly interpolate the training points. Recall that in ordinary linear regression, the number of parameters $P$ equals the dimension $D$ of the covariates. Consequently, rather than thinking about changing the number of parameters $P$, we’ll instead think about changing the number of data points $N$.

      Mathematical Analysis of Ordinary Linear Regression

      To understand under what conditions and why double descent occurs at the interpolation threshold in linear regression, we’ll study the two parameterization regimes. If the regression is underparameterized, we estimate the linear relationship between covariates $\vec{x}_n$ and target $y_n$ by solving the least-squares minimization problem:

      \[\begin{align*} \hat{\vec{\beta}}_{under} \, &:= \, \arg \min_{\vec{\beta}} \frac{1}{N} \sum_n ||\vec{x}_n \cdot \vec{\beta} - y_n||_2^2\\ \, &:= \, \arg \min_{\vec{\beta}} ||X \vec{\beta} - Y ||_2^2. \end{align*}\]

      The solution is the ordinary least squares estimator based on the second moment matrix $X^T X$:

      \[\hat{\vec{\beta}}_{under} = (X^T X)^{-1} X^T Y.\]

      If the model is overparameterized, the optimization problem is ill-posed since we have fewer constraints than parameters. Consequently, we choose a different (constrained) optimization problem that asks for the minimum norm parameters that still perfectly interpolate the training data:

      \[\begin{align*} \hat{\vec{\beta}}_{over} \, &:= \, \arg \min_{\vec{\beta}} ||\vec{\beta}||_2^2\\ \text{s.t.} \quad \quad \forall \, n \in &\{1, ..., N\}, \quad \vec{x}_n \cdot \vec{\beta} = y_n. \end{align*}\]

      We choose this optimization problem because it is the one gradient descent implicitly minimizes. The solution to this optimization problem uses the Gram matrix $X X^T \in \mathbb{R}^{N \times N}$:

      \[\hat{\vec{\beta}}_{over} = X^T (X X^T)^{-1} Y.\]

      One way to see why the Gram matrix appears is via constrained optimization: define the Lagrangian $\mathcal{L}(\vec{\beta}, \vec{\lambda}) \, := \, \frac{1}{2}||\vec{\beta}||_2^2 + \vec{\lambda}^T (Y - X \vec{\beta})$ with Lagrange multipliers $\vec{\lambda} \in \mathbb{R}^N$, then differentiate with respect to the parameters and Lagrange multipliers to obtain the overparameterized solution.

      After being fit, for test point $\vec{x}_{test}$, the model will make the following predictions:

      \[\hat{y}_{test, under} = \vec{x}_{test} \cdot \hat{\vec{\beta}}_{under} = \vec{x}_{test} \cdot (X^T X)^{-1} X^T Y\] \[\hat{y}_{test, over} = \vec{x}_{test} \cdot \hat{\vec{\beta}}_{over} = \vec{x}_{test} \cdot X^T (X X^T)^{-1} Y.\]

      Hidden in the above equations is an interaction between three quantities that can, when all grow extreme, create a divergence in the test loss!

      To reveal the three quantities, we’ll rewrite the regression targets by introducing a slightly more detailed notation. Unknown to us, there are some ideal linear parameters $\vec{\beta}^* \in \mathbb{R}^P = \mathbb{R}^D$ that truly minimize the test mean squared error. We can write any regression target as the inner product of the data $\vec{x}_n$ and the ideal parameters $\vec{\beta}^*$, plus an additional error term $e_n$ that is an “uncapturable” residual from the “viewpoint” of the model class

      \[y_n = \vec{x}_n \cdot \vec{\beta}^* + e_n.\]

      In matrix-vector form, we will equivalently write:

      \[Y = X \vec{\beta}^* + E,\]

      with $E \in \mathbb{R}^{N \times 1}$. To be clear, we are not imposing assumptions. Rather, we are introducing notation to express that there are (unknown) ideal linear parameters, and possibly non-zero errors $E$ that even the ideal model might be unable to capture; these errors $E$ could be random noise or could be fully deterministic patterns that this particular model class cannot capture. Using this new notation, we rewrite the model’s predictions to show how the test datum’s features $\vec{x}_{test}$, training data’s features $X$ and training data’s regression targets $Y$ interact.

      Let $y_{test}^* := \vec{x}_{test} \cdot \vec{\beta}^*$. In the underparameterized regime:

      \[\begin{align*} \hat{y}_{test,under} &= \vec{x}_{test} \cdot \hat{\vec{\beta}}_{under}\\ &=\vec{x}_{test} \cdot (X^T X)^{-1} X^T Y\\ &=\vec{x}_{test} \cdot (X^T X)^{-1} X^T (X \vec{\beta}^* + E)\\ &=\vec{x}_{test} \cdot \vec{\beta}^* + \, \vec{x}_{test} \cdot (X^T X)^{-1} X^T E\\ \hat{y}_{test,under} - y_{test}^* &= \vec{x}_{test} \cdot (X^T X)^{-1} X^T E. \end{align*}\]

      This equation is important, but opaque. To extract the intuition, replace $X$ with its singular value decomposition $X = U S V^T$. Let $R \, := \, \text{rank}(X)$ and let $\sigma_1 > \sigma_2 > … > \sigma_R > 0$ be $X$’s (non-zero) singular values. Let $S^+$ denote the Moore-Penrose inverse; in this context, this means that if a singular value $\sigma_r$ is non-zero, then in $S^+$, it becomes its reciprocal $1/\sigma_r$, but if the singular value is zero, then in $S^+$, it remains $0$. We can decompose the underparameterized prediction error along the orthogonal singular modes:

      \[\begin{align*} \hat{y}_{test, under} - y_{test}^* &= \vec{x}_{test} \cdot V S^{+} U^T E\\ &= \sum_{r=1}^R \frac{1}{\sigma_r} (\vec{x}_{test} \cdot \vec{v}_r) (\vec{u}_r \cdot E). \end{align*}\]

      This equation will be critical! The same term will appear in the overparameterized regime (plus one additional term):

      \[\begin{align*} \hat{y}_{test,over} &= \vec{x}_{test} \cdot \hat{\vec{\beta}}_{over}\\ &= \vec{x}_{test} \cdot X^T (X X^T)^{-1} Y\\ &= \vec{x}_{test} \cdot X^T (X X^T)^{-1} (X \beta^* + E)\\ \hat{y}_{test,over} - y_{test}^* &= \vec{x}_{test} \cdot (X^T (X X^T)^{-1} X - I_D) \beta^* \\ &\quad\quad + \quad \vec{x}_{test} \cdot X^T (X X^T)^{-1} E\\ &= \vec{x}_{test} \cdot (X^T (X X^T)^{-1} X - I_D) \beta^* \\ &\quad\quad + \quad \sum_{r=1}^R \frac{1}{\sigma_r} (\vec{x}_{test} \cdot \vec{v}_r) (\vec{u}_r \cdot E), \end{align*}\]

      where the last step again replaced $X$ with its SVD $X = U S V^T$. Thus, the prediction errors in the overparameterized and underparameterized regimes will be:

      \[\begin{align*} \hat{y}_{test,over} - y_{test}^* &= \sum_{r=1}^R \frac{1}{\sigma_r} (\vec{x}_{test} \cdot \vec{v}_r) (\vec{u}_r \cdot E)\\ &\quad \quad + \quad \vec{x}_{test} \cdot (X^T (X X^T)^{-1} X - I_D) \beta^*\\ \hat{y}_{test,under} - y_{test}^* &= \sum_{r=1}^R \frac{1}{\sigma_r} (\vec{x}_{test} \cdot \vec{v}_r) (\vec{u}_r \cdot E). \end{align*}\]

      The shared term in the two prediction errors causes the divergence:

      \[\begin{equation} \sum_{r=1}^R \frac{1}{\sigma_r} (\vec{x}_{test} \cdot \vec{v}_r) (\vec{u}_r \cdot E). \label{eq:variance} \end{equation}\]

      Eqn. \ref{eq:variance} is critical. It reveals that our test prediction error (and thus, our test squared error!) will depend on an interaction between 3 quantities:

      1. How much the training features vary in each direction. More formally, the inverse (non-zero) singular values of the training features $X$:

        \[\frac{1}{\sigma_r}\]
      2. How much, and in which directions, the test features vary relative to the training features. More formally: how $\vec{x}_{test}$ projects onto $X$’s right singular vectors $V$:

        \[\vec{x}_{test} \cdot \vec{v}_r\]
      3. How well the best possible model in the model class can correlate the variance in the training features with the training regression targets. More formally: how the residuals $E$ of the best possible model in the model class (i.e. insurmountable “errors” from the “perspective” of the model class) project onto $X$’s left singular vectors $U$:

        \[\vec{u}_r \cdot E\]

      We use the term “vary” when discussing $\vec{v}_r$ because $V$ can be related to the empirical (or sample) covariance matrix oftentimes studied in Principal Component Analysis. That is, if the SVD of $X$ is $U S V^T$, then $\frac{1}{N} X^T X = \frac{1}{N} V S^2 V^T$. If the training data are centered (a common preprocessing step), then this is the empirical covariance matrix and its eigenvectors $\vec{v}_1, …, \vec{v}_R$ identify the orthogonal directions of variance. We’ll return to this in Fig 6.

      Why does the test error diverge? When (1) and (3) are both present in the learning problem, the model’s parameters along this singular mode are likely incorrect. When (2) is added to the mix by a test datum $\vec{x}_{test}$ with a large projection along this mode, the model is forced to extrapolate significantly beyond what it saw in the training data, in a direction where the training data had an error-prone relationship between its predictions and the training targets, using parameters that are likely wrong. As a consequence, the test squared error explodes!

      Factor 1 - Low Variance in Training Features

      Figure 2. Required Factor #1: How much training features vary in each direction. The test loss diverges at the interpolation threshold only if training features $X$ contain small (non-zero) singular values. Ablation: By removing all singular values below a cutoff, the divergence at the interpolation threshold is diminished or disappears entirely. Blue is training error. Orange is test error.

      The test loss will not diverge if any of the three required factors are absent. What could cause that? One way is if small-but-nonzero singular values do not appear in the training data features. One way to accomplish this is by setting all singular values below a selected threshold to exactly 0. To test our understanding, we independently ablate all small singular values in the training features. Specifically, as we run the ordinary linear regression fitting process, and as we sweep the number of training data, we also sweep different singular value cutoffs and remove all singular values of the training features $X$ below the cutoff (Fig 2).

      Factor 2 - Test Features in Training Feature Subspace

      Figure 3. Required Factor #2: How much, and in which directions, test features vary relative to training features. The test loss diverges only if the test features $\vec{x}_{test}$ have a large projection onto the training features $X$'s right singular vectors $V$. Ablation: By projecting the test features into the subspace of the leading singular modes, the divergence at the interpolation threshold is diminished or disappears entirely. Blue is training error. Orange is test error.

      Double descent should not occur if the test datum does not vary in different directions than the training features. Specifically, if the test datum lies entirely in the subspace of just a few of the leading singular directions, then the divergence is unlikely to occur. To test our understanding, we force the test data features to lie in the training features subspace: as we run the ordinary linear regression fitting process, and as we sweep the number of training data, we project the test features $\vec{x}_{test}$ onto the subspace spanned by the training features $X$ singular modes (Fig 3).

      Factor 3 - Errors from Best Possible Model

      Figure 4. Required Factor #3: How well the best possible model in the model class can correlate variance in training features with training targets. The test loss diverges only if the residuals $E$ from the best possible model in the model class on the training data have a large projection onto the training features $X$'s left singular vectors $U$. Ablation: By ensuring the true relationship between features and targets is within the model class i.e. linear, the divergence at the interpolation threshold disappears. Blue is training error. Orange is test error.

      Double descent should not occur if the best possible model in the model class makes no errors on the training data. For example, if we use a linear model class on data where the true relationship is a noiseless linear relationship, then at the interpolation threshold, we will have $D=P$ data, $P=D$ parameters, our line of best fit will exactly match the true relationship, and no divergence will occur. To test our understanding, we ensure no residual errors exist in the best possible model: we first use the entire dataset to fit a linear model, then replace all target values with the predictions made by the ideal linear model. We then rerun our typical fitting process using these new labels, sweeping the number of training data (Fig 4).

      As a short aside, what could cause residual errors in the best possible model in the model class?

      1. Noise: If the data is noisy, then the best possible model in the model class will have residual errors.
      2. Model Misspecification: If the data is generated by a nonlinear model, but we use a linear model class (or vice versa), then the best possible model in the model class will have residual errors.
      3. Missing Features: Even if the data is noiseless and our model belongs to the correct model class, but we are missing covariates, then the best possible model in the model class will still have residual errors.

      Divergence at the Interpolation Threshold

      Figure 5. The training features are most likely to obtain their smallest non-zero singular value when approaching the interpolation threshold.

      Why does this divergence happen near the interpolation threshold? The answer is that the first factor (small non-zero singular values in the training features $X$) is likely to occur at the interpolation threshold (Fig 5), but why?

      Suppose we’re given a single training datum \(\vec{x}_1\). So long as this datum isn’t exactly zero, that datum varies in a single direction, meaning we gain information about the variance in that direction, but the variance in all orthogonal directions is exactly 0. With the second training datum \(\vec{x}_2\), so long as this datum isn’t exactly zero, that datum varies, but now, some fraction of \(\vec{x}_2\) might have a positive projection along \(\vec{x}_1\); if this happens (and it likely will, since the two vectors are unlikely to be exactly orthogonal), the shared direction gives us more information about the variance in this shared direction, but less information about the second orthogonal direction of variation. Ergo, the training data’s smallest non-zero singular value after 2 samples is probabilistically smaller than after 1 sample. As we approach the interpolation threshold, the probability that each additional datum has large variance in a new direction orthogonal to all previous directions grows unlikely (Fig 5), but as we move beyond the interpolation threshold, the variance in each covariate dimension becomes increasingly clear.

      Figure 6. Geometric intuition for why the smallest non-zero singular value reaches its lowest value near the interpolation threshold. If $1$ datum is observed, variance exists in only 1 direction. If $2$ data are observed, a second axis of variation appears, but because the two data are likely to share some component, the second axis is likely to have less variance than the first. At the interpolation threshold (here, $D=P=N=3$), because the three data are likely to share components along the first two axes, the third axis is likely to have even less variance. Beyond the interpolation threshold, additional data contribute additional variance to these three axes.

      Generalization in Overparameterized Linear Regression

      You might be wondering why three of the datasets have low test squared error in the overparameterized regime (California Housing, Diabetes, Student-Teacher) but one (WHO Life Expectancy) does not. Recall that the overparameterized regime’s prediction error has another term \(\hat{y}_{test,over} - y_{test}^*\) not present in the underparameterized regime:

      \[\begin{equation} \vec{x}_{test} \cdot (X^T (X X^T)^{-1} X - I_D) \beta^*. \label{eq:bias} \end{equation}\]

      To understand why this bias exists, recall that our goal is to correlate fluctuations in the covariates $\vec{x}$ with fluctuations in the targets $y$. In the overparameterized regime, there are more parameters than data; consequently, for $N$ data points in $D=P$ dimensions, the model can “see” fluctuations in at most $N$ dimensions, but has no ``visibility” into the remaining $P-N$ dimensions. This causes information about the optimal linear relationship $\vec{\beta}^*$ to be lost, thereby increasing the overparameterized prediction error.

      Figure 7. Geometry of Generalization in Overparameterized Ordinary Linear Regression. The rowspace of the training features $X$ forms a subspace (here, $\mathbb{R}^1$) of the ambient space (here, $\mathbb{R}^2$). For test datum $\vec{x}_{test}$, the linear model forms an internal representation of the test datum $\hat{\vec{x}}_{test}$ by orthogonally projecting the test datum onto the rowspace via projection matrix $X^T (X X^T)^{-1} X$. The generalization error will then increase commensurate with the inner product between $\hat{\vec{x}}_{test} - \vec{x}_{test}$ and the best possible parameters for the function class $\vec{\beta}^*$. Three different possible $\vec{\beta}^*$ are shown with low (blue), medium (green) and high (red) generalization errors.

      We previously saw that away from the interpolation threshold, the variance is unlikely to affect the discrepancy between the overparameterized model’s predictions and the ideal model’s predictions, meaning most of the discrepancy must therefore emerge from the bias (Eqn. \ref{eq:bias}). This bias term yields an intuitive geometric picture (Fig 7) that also reveals a surprising fact: overparameterized linear regression does representation learning! Specifically, for test datum \(\vec{x}_{test}\), a linear model creates a representation of the test datum \(\hat{\vec{x}}_{test}\) by orthogonally projecting the test datum onto the row space of the training covariates \(X\) via the projection matrix \(X^T (X X^T)^{-1} X\):

      \[\begin{equation*} \hat{\vec{x}}_{test} := X^T (X X^T)^{-1} X \; \vec{x}_{test}. \end{equation*}\]

      Seen this way, the bias can be rewritten as the inner product between (1) the difference between its representation of the test datum and the test datum and (2) the ideal linear model’s fit parameters:

      \[\begin{equation}\label{eq:overparam_gen_bias} (\hat{\vec{x}}_{test} - \vec{x}_{test}) \cdot \vec{\beta}^*. \end{equation}\]
      Figure 8. Test Error of Overparameterized Models. Large inner product between the ideal model's parameters and the difference between the fit model's internal representations of the test data and the test data creates large test squared error for overparameterized models.

      Intuitively, an overparameterized model will generalize well if the model’s representations capture the essential information necessary for the best model in the model class to perform well (Fig. 8).

      Adversarial Test Data and Adversarial Training Data

      Our key equation (Eqn. \ref{eq:variance}) also reveals why adversarial test data and adversarial training data exist (at least in linear regression) and how mechanistically they function. For convenience, we repeat the equation:

      \[\begin{equation*} \sum_{r=1}^R \frac{1}{\sigma_r} (\vec{x}_{test} \cdot \vec{v}_r) (\vec{u}_r \cdot E). \end{equation*}\]

      Adversarial test examples are a well-known phenomenon in machine learning that we can see in this equation. The adversarial test features correspond to \(\vec{x}_{test} \cdot \vec{v}_r\) being large, where one can drastically increase the test squared error by moving the test example in the direction of the right singular vector(s) with the smallest non-zero singular values (Fig 9).

      Figure 9. Adversarial Test Examples in Linear Regression. Adversarial examples arise by pushing $\vec{x}_{test}$ far along the trailing singular modes in the training features $X$. Blue is training error. Orange is test error.

      Less well-known are adversarial training data, akin to dataset poisoning or backdoor attacks . Adversarial training examples correspond to \(\vec{u}_r \cdot E\) being large, where one can drastically increase the test squared error by moving the training errors $E$ in the direction of the left singular vector(s) with the smallest non-zero singular value. This gives a practical way to construct adversarial training data: training features and targets whose training loss is unchanged from unaltered training data, but causes the test loss to be 1-3 orders of magnitude larger (Fig 10).

      Figure 10. Adversarial Training Dataset in Linear Regression. By manipulating the residual errors $E$ that the best possible model in the model class achieves on the training data, we construct training datasets that increase the test error of the learned model by 1-3 orders of magnitude without affecting its training error. Blue is training error. Orange is test error.

      Intuition for Nonlinear Models

      Although we mathematically studied ordinary linear regression, the intuition for why the test loss diverges extends to nonlinear models, such as polynomial regression and including certain classes of deep neural networks . For a concrete example about how our intuition can shed light on the behavior of nonlinear models, Henighan et al. 2023 recently discovered interesting properties of shallow nonlinear autoencoders: depending on the number of training data, (1) autoencoders either store data points or features, and (2) the test loss increases sharply between these two regimes (Fig. 11).

      Figure 11. Superposition, Memorization and Double Descent in Nonlinear Shallow Autoencoders. Figure from Henighan et al. 2023 .

      Our work sheds light on the results in two ways:

      1. Henighan et al. 2023 write, “It’s interesting to note that we’re observing double descent in the absence of label noise.” Our work clarifies that noise, in the sense of a random quantity, is not necessary to produce double descent. Rather, what is necessary is residual errors from the perspective of the model class ($E$, in our notation). Those errors could be entirely deterministic, such as a nonlinear model attempting to fit a noiseless linear relationship, or other model misspecifications.

      2. Henighan et al. 2023 write, “[Our work] suggests a naive mechanistic theory of overfitting and memorization: memorization and overfitting occur when models operate on ‘data point features’ instead of ‘generalizing features’.” Our work hopefully clarifies that this dichotomy is incorrect: when overparameterized, data point features are akin to the Gram matrix $X X^T$ and when underparameterized, generalizing features are akin to the second moment matrix $X^T X$. Our work hopefully clarifies that data point features can and very often do generalize, and that there is a deep connection between the two, i.e., their shared spectra.

      Conclusion

      In this work, we intuitively and quantitatively explained why the test loss misbehaves based on three interpretable factors, tested our understanding via ablations, connected our understanding to adversarial test examples and adversarial training datasets, and added conceptual clarity of recent discoveries in nonlinear models.

      For attribution in academic contexts, please cite this work as
      +        PLACEHOLDER FOR ACADEMIC ATTRIBUTION
      +  
      BibTeX citation
      +        PLACEHOLDER FOR BIBTEX
      +  
      \ No newline at end of file diff --git a/blog/dpi-fsvi/index.html b/blog/dpi-fsvi/index.html new file mode 100644 index 00000000..953cebbb --- /dev/null +++ b/blog/dpi-fsvi/index.html @@ -0,0 +1,36 @@ + Bridging the Data Processing Inequality and Function-Space Variational Inference | ICLR Blogposts 2024

      Bridging the Data Processing Inequality and Function-Space Variational Inference

      This blog post explores the interplay between the Data Processing Inequality (DPI), a cornerstone concept in information theory, and Function-Space Variational Inference (FSVI) within the context of Bayesian deep learning. The DPI governs the transformation and flow of information through stochastic processes, and its unique connection to FSVI is employed to highlight FSVI's focus on Bayesian predictive posteriors over parameter space. Throughout the post, theoretical concepts are intertwined with intuitive explanations and mathematical rigor, offering a comprehensive understanding of these complex topics. The post concludes by bringing together various ideas to explain why the choice of predictive priors (initial probability distributions assumed for model predictions before training) is important for training machine learning models and preventing overfitting. It also discusses the practical implications of these concepts in areas such as continual learning and knowledge distillation. By examining these concepts in depth, the post provides valuable insights for both theory and practice in machine learning, making it an informative resource for researchers and practitioners.

      $$\require{mathtools} \DeclareMathOperator{\opExpectation}{\mathbb{E}} \newcommand{\E}[2]{\opExpectation_{#1} \left [ #2 \right ]} \newcommand{\simpleE}[1]{\opExpectation_{#1}} \newcommand{\MidSymbol}[1][]{\:#1\:} \newcommand{\given}{\MidSymbol[\vert]} \DeclareMathOperator{\opmus}{\mu^*} \newcommand{\IMof}[1]{\opmus[#1]} \DeclareMathOperator{\opInformationContent}{H} \newcommand{\ICof}[1]{\opInformationContent[#1]} \newcommand{\xICof}[1]{\opInformationContent(#1)} \DeclareMathOperator{\opEntropy}{H} \newcommand{\Hof}[1]{\opEntropy[#1]} \newcommand{\xHof}[1]{\opEntropy(#1)} \DeclareMathOperator{\opMI}{I} \newcommand{\MIof}[1]{\opMI[#1]} \DeclareMathOperator{\opTC}{TC} \newcommand{\TCof}[1]{\opTC[#1]} \newcommand{\CrossEntropy}[2]{\opEntropy(#1 \MidSymbol[\Vert] #2)} \DeclareMathOperator{\opKale}{D_\mathrm{KL}} \newcommand{\Kale}[2]{\opKale(#1 \MidSymbol[\Vert] #2)} \DeclareMathOperator{\opJSD}{D_\mathrm{JSD}} \newcommand{\JSD}[2]{\opJSD(#1 \MidSymbol[\Vert] #2)} \DeclareMathOperator{\opp}{p} \newcommand{\pof}[1]{\opp(#1)} \newcommand{\hpof}[1]{\hat{\opp}(#1)} \newcommand{\pcof}[2]{\opp_{#1}(#2)} \newcommand{\hpcof}[2]{\hat\opp_{#1}(#2)} \DeclareMathOperator{\opq}{q} \newcommand{\qof}[1]{\opq(#1)} \newcommand{\hqof}[1]{\hat{\opq}(#1)} \newcommand{\qcof}[2]{\opq_{#1}(#2)} \newcommand{\varHof}[2]{\opEntropy_{#1}[#2]} \newcommand{\xvarHof}[2]{\opEntropy_{#1}(#2)} \newcommand{\varMIof}[2]{\opMI_{#1}[#2]} \newcommand{\w}{\boldsymbol{\theta}} \newcommand{\W}{\boldsymbol{\Theta}} \DeclareMathOperator{\opf}{f} \newcommand{\fof}[1]{\opf(#1)} \newcommand{\Dany}{\mathcal{D}} \newcommand{\y}{y} \newcommand{\Y}{Y} \newcommand{\L}{\boldsymbol{L}} \newcommand{\x}{\boldsymbol{x}} \newcommand{\X}{\boldsymbol{X}} \newcommand{\pdata}[1]{\hpcof{\text{data}}{#1}} \newcommand{\normaldist}[1]{\mathcal{N}(#1)} $$

      Introduction

      In information theory, the data processing inequality (DPI) expresses a fundamental idea: processing data (stochastically) cannot increase information. The DPI provides us with a powerful intuition about what information processing systems can do and what the limitations of data processing are.

      In this blog post, we first study the DPI, developing intuition through vivid examples and detailed proofs—especially the equality case, which is arguably the best way to understand inequalities. We will consider classic forms of the DPI as well as DPIs relating probability distributions more broadly. Then, we explore the intriguing connection between DPI and function-space variational inference (FSVI), a modern Bayesian deep learning technique that focuses on the Bayesian predictive posterior rather than the parameter space. Exploring this connection is important because it can provide new insights into FSVI on a fundamental level. We apply the DPI to recover several interesting results from the literature in a simple form and build intuitions for the relationship between parameter and functional priors.

      Most importantly, we consider how FSVI can measure a predictive divergence between the approximate and true posterior which is independent of parameter symmetries. (With parameter symmetries, I refer to different parameters that yield the same predictions, which is very common in over-parameterized neural networks: think of parameter symmetries like different paths leading to the same destination; they might look different but end up at the same predictionsThanks to ChatGPT for this analogy! 🤗.) Explaining this connection is one of the main goals of this article and will help you understand the relationships between DPI, FSVI, and other deep learning methods. As a concrete example and application, we relate FSVI to training with knowledge distillation and label entropy regularization: potentially more meaningful priors than the ones usually used in Bayesian neural networksIn many papers, an isotropic Gaussian is used because of its simplicity. Indeed, there are better alternatives, see Fortuin et al (2022) and Fortuin (2022).. This connection highlights the practical relevance of the theoretical concepts discussed in this post and will hopefully inspire the reader to view Bayesian deep learning from a new point of view.

      TL;DR

      The following sections summarize the key takeaways of this blog post. If they don’t make sense, don’t worry: they will after reading this post.

      Data Processing Inequality

      The data processing inequality examines how information cannot increase due to processing. In information theory, it is usually stated based on a Markov chain of random variables \(X \rightarrow Y \rightarrow Z\) and their mutual information. We will look at different data processing inequalities that relate different distributions instead of different random variables. However, the blog posts in particular looks at the DPI when formulated using Kullback-Leibler (KL) divergences between distributions. I will use “🥬 divergence” in headings to add a bit of color. 😊

      Concretely, this KL DPI states that processing data stochastically can only reduce information. More formally:

      That is, the KL divergence between \(\qof{Y}\) and \(\pof{Y}\) cannot be larger than the one between the original \(\qof{\W}\) and \(\pof{\W}\). Intuitively, the stochastic mapping \(\opf\) induces a bottleneck that reduces how well we can distinguish between \(\opp\) and \(\opq\). Finally we have equality when \(\Kale{\qof{\W \given Y}}{\pof{\W \given Y}} = 0\).

      The paper “Understanding Variational Inference in Function-Space” by Burt et al. (2021) succinctly summarizes the DPI as follows:

      The data processing inequality states that if two random variables are transformed in this way, they cannot become easier to tell apart.

      Function-Space Variational Inference

      Generally, variational inference is a powerful technique for approximating complex Bayesian posteriors with simpler distributions. In its usual form, it optimizes an approximate, variational distribution to match the Bayesian parameter posterior as closely as possible. This way, it transforms the problem of Bayesian inference into an optimization problem.

      However, especially for deep neural networks, obtaining a good approximation of the parameter space can be difficult. One reason is the sheer size of the parameter space. Additionally, the parameterization of a neural network often contains many symmetries—different parameter configurations can lead to the same predictions of the model—that are not taken into account either.

      Here, Function-space variational inference (FSVI) side-steps some of these restrictions by only requiring that the variational distribution matches the Bayesian predictive posterior: Whereas regular variational inference regularizes towards a parameter prior, FSVI regularizes towards a data prior. This is especially useful when the parameter prior is not very meaningful, e.g. an isotropic Gaussian prior, which is often used in Bayesian neural networks.

      Background: Information-Theoretic Notation

      Information theory deals with the communication of informationSee the excellent "Visual Information Theory" by Chris Olah for a visual introduction to information theory.. In this blog post, we use a unified information-theoretic notation to express various quantities related to probability distributions and their relationshipsIt largely follows "A Practical & Unified Notation for Information-Theoretic Quantities in ML".. Here are some key concepts we will use:

      The information content of an event \(x\) is denoted as \(\Hof{x}\) and is defined as \(-\log \pof{x}\). It represents the minimum amount of information needed to describe the occurrence of \(x\) given an underlying probability distribution. In machine learning, this information content is often used as a minimization objective, represented as the negative log-likelihood or cross-entropy when averaged over a dataset.

      The entropy \(\Hof{X}\) of a random variable \(X\) is the expectation of its information content:

      \[\Hof{X} \triangleq \E{\pof{x}}{\Hof{x}} = \E{\pof{x}}{-\log \pof{x}}.\]

      The entropy measures the average amount of information needed to describe the random variable \(X\). It provides a measure of uncertainty or randomness associated with \(X\). We can similarly define the entropy of a conditional distribution \(\Hof{X \given Y}\) and the joint entropy \(\Hof{X, Y}\).

      The mutual information \(\MIof{X;Y}\) between two random variables \(X\) and \(Y\) is a measure of the amount of information that one random variable contains about the other. It is defined as:

      \[\begin{aligned} \MIof{X;Y} & \triangleq \Hof{X} - \Hof{X \given Y} \\ &= \Hof{Y} - \Hof{Y \given X} \\ &= \Hof{X} + \Hof{Y} - \Hof{X, Y}. \end{aligned}\]

      We will also use the Kullback-Leibler divergence \(\Kale{\pof{X}}{\qof{X}}\) and the cross-entropy \(\CrossEntropy{\pof{X}}{\qof{X}}\):

      \[\begin{aligned} \CrossEntropy{\pof{X}}{\qof{X}} & = \E{\pof{x}}{-\log \qof{x}}\\ \Kale{\pof{X}}{\qof{X}} & = \CrossEntropy{\pof{X}}{\qof{X}} - \Hof{X} \end{aligned}\]

      The cross-entropy quantifies the average number of bits needed to encode samples drawn from the true distribution \(\pof{X}\) using a different distribution \(\qof{X}\). The Kullback-Leibler divergence is a measure of the difference between two probability distributions and captures the additional bits needed to encode samples from \(\pof{X}\) compared to encoding them using the true distribution \(\qof{X}\).

      Now that we have covered the notation, let’s delve into the data processing inequality.

      Data Processing Inequality

      The data processing inequality (DPI) is a fundamental inequality in information theory that states the mutual information between two random variables cannot increase through processing. The original DPI is typically stated for a Markov chain of random variables \(X \rightarrow Y \rightarrow Z\) and relates the mutual information terms as follows:

      \[\MIof{X;Y} \ge \MIof{X;Z}.\]

      We can view \(\rightarrow\) as a processing or transition step that maps \(X\) to \(Y\) and \(Y\) to \(Z\), whereas the mapping can be deterministic or stochastic. The inequality tells us that processing the random variable \(X\) to obtain \(Y\) and further processing \(Y\) to obtain \(Z\) cannot increase the mutual information between \(X\) and \(Z\) compared to the mutual information between \(X\) and \(Y\).

      The following three scenarios illustrate the data processing inequality using different mappings:

      Example: Image Processing Pipeline

      Consider an image processing pipeline with the following steps. Let:

      • \(X\) be the original image data;
      • \(Y\) be a compressed version of the image; and
      • \(Z\) be \(Y\) after adding blur and pixelation.

      In this case, \(X\) has more mutual information with \(Y\) than with \(Z\). The compression reduces information, but the image is still recognizable. However, after the additional processing of blurring and pixelating, the mutual information between \(X\) and \(Z\) is further reduced. This gives an intuitive example of how additional processing on data reduces the mutual information with the original data. Each processing step results in some loss of information.

      Example: Supervised Learning

      Consider a supervised learning pipeline with the following steps. Let

      • \(X\) be the input features;
      • \(Y\) be the intermediate representations learned by the model; and
      • \(Z\) be the model predictions.

      Here, \(X \rightarrow Y \rightarrow Z\) forms a Markov chain. The data processing inequality tells us that the mutual information between the inputs \(X\) and predictions \(Z\) cannot exceed the mutual information between the inputs \(X\) and intermediate representations \(Y\):

      \[\MIof{X; Y} \geq \MIof{X; Z}.\]

      This makes intuitive sense—the intermediate representations \(Y\) are obtained by processing the raw inputs \(X\), so they cannot contain more information about \(X\) than \(X\) itself. The predictions \(Z\) are obtained by further processing \(Y\), so additional information may be lost, reducing the mutual information with the original inputs \(X\).

      As a more concrete example, consider an image classification model. Let:

      • \(X\) be the input images;
      • \(Y\) be the activations of the convolutional layers; and
      • \(Z\) be predicted image labels.

      The convolutional layers will extract features from the input images, but cannot extract more information than present in the original images. The predicted labels are obtained by further processing these convolutional features, so may lose some fine-grained information about the original inputs.

      Example: Autoencoders

      An autoencoder compresses the input \(X\) into a latent code \(Y\) and then tries to reconstruct the original input from the code, producing \(\hat{X}\). Let:

      • \(X\) be the input;
      • \(Y\) be the latent code; and
      • \(\hat{X}\) be the reconstruction;

      The data processing inequality tells us again:

      \[\MIof{X; Y} \geq \MIof{X; \hat{X}}.\]

      The latent code \(Y\) is obtained by compressing \(X\), so cannot contain more information. The reconstruction \(\hat{X}\) tries to recover \(X\) from \(Y\), but some information may be lost, reducing the mutual information with \(X\).

      Intuitively, autoencoders try to preserve as much mutual information between inputs \(X\) and reconstructions \(\hat{X}\) as possible by learning latent representations \(Y\) that compress inputs without losing too much information. The data processing inequality quantifies this information bottleneck.

      Proof of the DPI

      The proof is simple and connects the DPI to another important inequality.

      First we note that the Markov Chain implies the following factorization of the joint distribution:

      \[\pof{x, y, z} = \pof{x} \pof{y \given x} \pof{z \given y}.\]

      Using this factorization, we can express the mutual information terms:

      \[\begin{aligned} \MIof{X;Y} &= \Hof{X} - \Hof{X \given Y} \\ &\ge \Hof{X} - \Hof{X \given Z} \\ &= \MIof{X;Z}. \end{aligned}\]

      This relies on \(\Hof{X \given Y} \le \Hof{X \given Z}\). Why is this true?

      We have the following chain of inequalities:

      \[\Hof{X \given Y} = \underbrace{\MIof{X ; Z \given Y}}_{\overset{(1)}{=}0} + \Hof{X \given Y, Z} \overset{(2)}{\le} \Hof{X \given Z}.\]

      (1) follows from the Markov chain property: when \(X \rightarrow Y \rightarrow Z\), \(X\) does not depend on \(Z\) at all when conditioned on \(Y\); and (2) follows from the fact that conditioning reduces entropy, i.e. \(\Hof{A \given B} \le \Hof{A}.\)

      The equality gap \(\Hof{X \given Y, Z} - \Hof{X \given Z}\) corresponds to the mutual information \(\MIof{X ; Y \given Z}\). This mutual information measures the extra information about \(X\) contained in \(Y\) that is not already conveyed by \(Z\). It is zero if and only if \(X \rightarrow Z \rightarrow Y\) forms a Markov chain, indicating that \(Z\) is a sufficient statistic for \(X\).

      Proof of (2) "Conditioning Reduces Entropy":

      We can easily show that conditioning reduces entropy by using the non-negative property of the mutual information:

      \(\begin{aligned} 0 &\le \Kale{\pof{X,Y}}{\pof{X}\pof{Y}} \\ &= \MIof{X;Y} \\ &= \Hof{X} - \Hof{X \given Y} \\ \implies \Hof{X \given Y} &\le \Hof{X}. \end{aligned}\)

      The fact that conditioning reduces entropy, \(\Hof{X} \ge \Hof{X \given Y}\), is an important property by itself and is reminiscent of the data processing inequality. The conditional entropy \(\Hof{X \given Y}\) quantifies the remaining uncertainty about \(X\) after observing \(Y\). If \(X\) and \(Y\) are independent, then \(\Hof{X} = \Hof{X \given Y}\), as knowing \(Y\) does not provide any information about \(X\). On the other hand, if \(Y\) completely determines \(X\), then \(\Hof{X \given Y} = 0\), as there is no remaining uncertainty about \(X\) once \(Y\) is known. In general, conditioning can only reduce the uncertainty about \(X\), but it does not necessarily reduce it to zero.

      Let’s move on and consider the KL data processing inequality.

      🥬 Data Processing Inequality

      A similar DPI can be expressed for different distributions \(\pof{x}\) and \(\qof{x}\) of the same random variable and the KL divergence between them. This DPI states that if we evolve two distributions using the same transition function, they cannot become less similar. The KL divergence is sometimes also referred to as “relative entropy”, so we could also call this the “relative data processing inequality”.

      This can be formalized for distributions \(\pof{x}\) and \(\qof{x}\) and a stochastic transition function \(X \overset{\fof{y \given x}}{\longrightarrow} Y\). Here, we use that such a stochastic mapping \(Y = \fof{X}\) is equivalent to having a probability (density) \(\fof{y \given x}\):

      \[\Kale{\pof{X}}{\qof{X}} \ge \Kale{\pof{Y}}{\qof{Y}},\]

      where \(\pof{y \given x} = \fof{y \given x} = \qof{y \given x}\). The marginals after the transition are \(\pof{y} = \E{\pof{x}}{\fof{y \given x}}\) and \(\qof{y} = \E{\qof{x}}{\fof{y \given x}}\), so more explicitly:

      \[\Kale{\pof{X}}{\qof{X}} \ge \Kale{\E{\pof{x}}{\fof{Y \given x}}}{\E{\qof{x}}{\fof{Y \given x}}}.\]

      In their book Elements of Information Theory, Thomas and Cover describe this as “relative entropy never increases” and relate it to the second law of thermodynamics.

      Example: Comparing Image Distributions

      As an example, let:

      • \(\pof{x}\) be the true distribution of images in a dataset;
      • \(\qof{x}\) be a generative model that tries to mimic \(\pof{x}\); and
      • \(\fof{y \given x}\) be a function that thresholds images \(x\) into bilevel black and white images \(y\).

      Then \(\pof{y}\) and \(\qof{y}\) will be more difficult to distinguish after the thresholding operation than \(\pof{x}\) and \(\qof{x}\). Converting to black and white images has lost information that could help distinguish the real and generated distributions.

      This provides some intuition for why the KL divergence between distributions decreases under a shared stochastic mapping, as formalized by the KL data processing inequality. Processing through \(\fof{y \given x}\) makes the distributions harder to tell apart.

      Counter-Example: Bayesian Inference

      It might be inviting to think that this data processing inequality also applies to Bayesian inference, that is updating the model parameters based on new evidence. Then, we could argue that if two agents start with different prior beliefs but update based on the same evidence, their posterior beliefs will become more similar. However, this intuition is flawed: the data processing inequality does not apply to Bayesian inference.

      Let’s walk through why. Consider:

      • \(\pof{\w}\) be an agent’s prior belief;
      • \(\qof{\w}\) be another agent’s different prior;
      • \(\pof{\w\given x}\) is the posterior after observing data \(x\); and
      • \(\qof{\w\given x}\) is the other agent’s posterior.

      The priors \(\pof{\w}\) and \(\qof{\w}\) may have large divergence, representing very different initial beliefs. However, when conditioning on the same data \(x\), the KL divergence between \(\pof{\w \given x}\) and \(\qof{\w \given x}\) could increase or decrease—the data processing inequality does not give us any guarantee.

      This is because \(\pof{\w}\) and \(\qof{\w}\) are not evolving under the same stochastic mapping. Rather, each prior is mapped to its respective posterior via Bayes’ rule, which operates differently on \(\opp\) and \(\opq\):

      \[\begin{aligned} \pof{\w \given x} &= \frac{\pof{x \given \w}}{\pof{x}} \, \pof{\w}\\ \qof{\w \given x} &= \frac{\qof{x \given \w}}{\qof{x}} \, \qof{\w}. \end{aligned}\]

      Even assuming that both agents have the same internal model, that is they use the same likelihood \(\pof{x \given \w} = \qof{x \given \w}\), the priors \(\pof{\w}\) and \(\qof{\w}\) will still influence the posterior distributions differently because they lead to different evidence terms \(\pof{x}\) and \(\qof{x}\):

      \[\begin{aligned} \pof{x} &= \E{\pof{\w}}{\pof{x \given \w}}\\ \qof{x} &= \E{\qof{\w}}{\qof{x \given \w}}. \end{aligned}\]

      Thus, the correct intuition is that observing the same data \(x\) does not necessarily bring the posterior beliefs closer together—they depend on the interplay between their specific priors and likelihoods. The data processing inequality does not directly apply to this Bayesian updating scenario:

      \[\Kale{\qof{\W}}{\pof{\W}} {\color{red}{\not\ge}} \Kale{\qof{\W \given \mathcal{D}}}{\pof{\W \given \mathcal{D}}},\]

      This counterexample highlights the importance of precisely understanding the assumptions underlying conceptual principles like the DPI. While the DPI provides insight about information dynamics in many cases, it does not universally apply, as exemplified here by Bayesian updating under different priors. As always, bear in mind that:

      As we currently also seem to experience a world of increasing polarization, this counterexample might also serve as a reminder that different priors can lead to different beliefs, even when observing the same evidence. This is a fundamental aspect of Bayesian inference and the scientific method.

      Proofs of the 🥬 DPI

      We will prove this inequality in two different ways. First, we will develop a “brute-force” proof, and then we will look at a more elegant proof that follows Thomas and Cover. Importantly, we will also consider the equality case in detail.

      Brute-force Proof

      If \(\opp\) does not have support in \(\opq\), the inequality is trivially true because then \(\Kale{\pof{Y}}{\qof{Y}}=\infty\).

      Thus, let’s now assume that \(\opp\) has support in \(\opq\). Then, we can brute-force using the definitions, starting from the cross-entropy:

      \[\begin{aligned} \CrossEntropy{\pof{Y}}{\qof{Y}}&=\CrossEntropy{\pof{Y}}{\E{\qof{x}}{\pof{Y \given x}}}\\ &=\CrossEntropy{\pof{Y}}{\E{\qof{x}}{\frac{\pof{x \given Y}\pof{Y}}{\pof{x}}}}\\ &=\CrossEntropy{\pof{Y}}{\E{\pof{x \given Y}}{\frac{\qof{x}}{\pof{x}}}}+\CrossEntropy{\pof{Y}}{\pof{Y}}\\ &\overset{(1)}{=}\CrossEntropy{\pof{Y}}{\E{\pof{x \given Y}}{\frac{\qof{x}}{\pof{x}}}}+\xHof{\pof{Y}}\\ &\overset{(2)}{\le}\CrossEntropy{\pof{X, Y}}{\frac{\qof{X}}{\pof{X}}}+\xHof{\pof{Y}}\\ &\overset{(3)}{=}\CrossEntropy{\pof{X}}{\frac{\qof{X}}{\pof{X}}}+\xHof{\pof{Y}}\\ &\overset{(4)}{=}\Kale{\pof{X}}{\qof{X}}+\xHof{\pof{Y}}\\ \iff \Kale{\pof{Y}}{\qof{Y}}&\le\Kale{\pof{X}}{\qof{X}}, \end{aligned}\]

      where we have used (1) that the cross-entropy of a distribution with itself is just the entropy, (2) that the cross-entropy is convex and we can apply Jensen’s inequality, (3) that the RHS side of the cross-entropy does not depend on \(Y\) and we can trivially marginalize it out, and (4) that the definition of the Kullback-Leibler divergence is equivalent an (unnormalized) cross-entropy over a fraction.

      This makes it difficult to extract the case for equality, however.

      Equality Case

      We have only one inequality in above proof, and it stems from applying Jensen’s inequality. Remembering the equality case for Jensen’s inequality, we recall:

      For (2), this is sadly slightly more complex than it might seem on first glance. Let’s unwrap the term:

      \[\CrossEntropy{\pof{Y}}{\E{\pof{x \given Y}}{\frac{\qof{x}}{\pof{x}}}} = \E{\pof{y}}{-\log \E{\pof{x \given y}}{\frac{\qof{x}}{\pof{x}}}}.\]

      We take an expectation over \(\pof{y}\), so we need to look at almost all \(\pof{x \given y} \not= 0\) for (almost all) \(\pof{y} \not= 0\) separately to consider equality. \(-\log x\) is strictly convex—and thus not linear—so we need \(f(x) = \frac{\qof{X}}{\pof{X}}\) to be constant for any fixed \(y\) with \(\pof{y} \not= 0\)—only then have we equality in Jensen’s inequality.

      In the following, I will limit myself to the discrete case to avoid having to deal with measure theoryI currently don't have a good 'toolbox' to express simple ideas cleanly in measure theory. I'm working on it.. To obtain equality, for all \(y\) with \(\pof{y} \not= 0\) (i.e. we have support) and for all \(x_1, x_2\) with \(\pof{x_1 \given y}, \pof{x_2 \given y} \not= 0\), we need \(\frac{\qof{x_1}}{\pof{x_1}} = \frac{\qof{x_2}}{\pof{x_2}}\). Equivalently (for the reader, why is then \(\pof{x_1} \not= 0?\)):

      \[\begin{aligned} \frac{\qof{x_1}}{\pof{x_1}} &= \frac{\qof{x_2}}{\pof{x_2}} \\ \iff \qof{x_1} &= \frac{\qof{x_2}}{\pof{x_2}} \, \pof{x_1} \\ \end{aligned}\]

      This means that \(\qof{x} = C_y \pof{x}\) piecewise for all \(x\) for which \(\pof{x \given y} \not= 0\) for some fixed \(y\) with \(\pof{y} \not= 0\). That is if we keep \(y\) fixed, all the \(x\) for which \(\pof{x \given y} \not= 0\) have the same constant factor \(C_y\). Then for all \(y\) with \(\pof{y} \not= 0\), we have equality and overall equality in (2).

      If for any \(x\) there are multiple \(y\), e.g. \(y_1, y_2\) for which \(\pof{x \given y} \not= 0\), then we have \(C_{y_1} = C_{y_2}\).

      As an example, at the simplest, if this is the case for all \(y\), then \(C_y = 1\) constant.

      As a side-note, this is a great reason why we often require full support for distributions as we then can avoid these piecewise constant factors (and the headaches they might cause).

      Simpler Elegant Proof

      Thomas and Cover provide a beautifully simple proof:

      What does this mean? Whereas \(\fof{y \given x}\) is the ‘forward’ transition function, \(\pof{x \given y}\) and \(\qof{x \given y}\) are the ‘backward’ transition functions. We only have equality when the backward transition functions are equal (almost everywhere).

      The statement on equality is not very informative yet though, so we have to put in a bit more work. Again, this is written for the discrete case.

      This time we explicitly use Bayes’ rule to connect the forward and backward transition functions. First, we have to fix \(y\) such that \(\pof{y} \not= 0\) (i.e. \(y\) is in the support of \(\pof{y}\)) and then \(\qof{y} \not=0\). We have:

      \[\begin{aligned} \pof{x \given y} &= \qof{x \given y} \\ \overset{\text{ass. }\pof{y} \not= 0}{\iff} \frac{\fof{y \given x}\pof{x}}{\pof{y}} &= \frac{\fof{y \given x}\qof{x}}{\qof{y}} \\ \overset{\text{ass. }\fof{y \given x}\not= 0}{\iff} \frac{\pof{x}}{\pof{y}} &= \frac{\qof{x}}{\qof{y}} \\ \iff \pof{x} &= \frac{\pof{y}}{\qof{y}} \, \qof{x}. \end{aligned}\]

      For a given \(y\) with \(\pof{y} \not=0\), for the equality case, we see that for all \(x\) with \(\fof{y \given x} \not= 0\), \(\pof{x}\) and \(\qof{x}\) have to be coupled via piecewise constant factors.

      As another example, if \(\fof{y \given x} \not=0\) (has full support) for all possible \(x\), for the equality case we have \(\pof{x} = \qof{x}\).

      Compared to the previous equality case, we went a bit deeper and rewrote the conditions to consider the ratios between \(x\) and \(y\). Note we could have shown the same thing in the “brute-force” proof, too.

      Altogether, we have see that both \(x\) and \(y\) are modulated by the same constant factor between \(\pof{\cdot}\) and \(\qof{\cdot}\). Essentially, this tells us that we could split our support into unconnected sub-domains and examine each individually for the equality case.

      Overall Statement

      We have the following overall statement:

      (\(\pof{x} \ll \qof{x}\) means that \(\qof{x} > 0\) implies \(\pof{x} > 0\), so the KL divergence is not \(\infty\).) But more precisely, for \(\pof{x} \ll \qof{x}\), we have equality when:

      \[\forall y, \pof{y} \not= 0 \exists C_y \in \mathbb{R}_{> 0} \forall x, \fof{y \given x}\not=0\colon \pof{x} = C_y \, \qof{x}.\]

      Other Data Processing Inequalities

      Now, we can use these ideas to derive a few additional results and even close the circle to the original data processing inequality.

      Jensen-Shannon Divergence

      The KL divergence is not a metric: the triangle inequality does not hold, and it is not symmetric.

      However, we can symmetrize it to obtain the Jensen-Shannon divergence (JSD). The JSD is defined as the mean of the two KL divergences of the two distributions from their average. In essence, it makes the KL divergence symmetric:

      \[\begin{aligned} \fof{x} &= \frac{\pof{x} + \qof{x}}{2}\\ \JSD{\pof{x}}{\qof{x}} &= \frac{1}{2} \Kale{\pof{x}}{\fof{x}} + \frac{1}{2} \Kale{\qof{x}}{\fof{x}}. \end{aligned}\]

      Similar approaches can be used to “symmetrize” other concepts; for example matrices: \(\frac{1}{2} A + \frac{1}{2} A^T\) is also symmetric by construction for any matrix \(A\).

      The JSD is still not a metric, but the square root of the Jensen-Shannon divergence is symmetric and satisfies the triangle inequality and gives us the Jensen-Shannon distance, a metric.

      JSD-DPI

      We can also obtain a data processing inequality for the Jensen-Shannon divergence and the Jensen-Shannon distance:

      The proof uses the KL data processing inequality:

      \[\begin{aligned} \JSD{\pof{X}}{\qof{X}} &= \frac{1}{2} \Kale{\pof{X}}{\fof{X}} + \frac{1}{2} \Kale{\qof{X}}{\fof{X}}\\ &\ge \frac{1}{2} \Kale{\pof{Y}}{\fof{Y}} + \frac{1}{2} \Kale{\qof{Y}}{\fof{Y}}\\ &= \JSD{\pof{Y}}{\qof{Y}}. \end{aligned}\]

      We verify \(\fof{y} = \frac{\pof{y} + \qof{y}}{2}\) is the average of \(\pof{y}\) and \(\qof{y}\):

      \[\begin{aligned} \fof{y} &= \E{\fof{x}}{\fof{y \given x}}\\ &= \E{\frac{\pof{x}+\qof{x}}{2}}{\fof{y \given x}}\\ &= \frac{1}{2} \E{\pof{x}}{\fof{y \given x}} + \frac{1}{2} \E{\qof{x}}{\fof{y \given x}}\\ &= \frac{1}{2} \pof{y} + \frac{1}{2} \qof{y}. \end{aligned}\]

      Finally, \(\pof{x}, \qof{x} \ll \fof{x}\), and the equality condition of the KL data processing inequality gives us:

      \[\begin{aligned} &\Kale{\pof{X \given Y}}{\fof{X \given Y}} = 0 &\\ \land \quad &\Kale{\qof{X \given Y}}{\fof{X \given Y}} = 0 &\\ \iff &\pof{x \given y} = \fof{x \given y} \land \qof{x \given y} = \fof{x \given y}& \forall x,y \\ \iff &\pof{x \given y} = \qof{x \given y}& \forall x,y. \end{aligned}\]

      Mutual Information

      The JSD can also be expressed as a mutual information. For \(\begin{aligned} Z &\sim \mathrm{Bernoulli}(\frac{1}{2}) = \fof{Z} \\ X \given Z = 0 &\sim \pof{x}\\ X \given Z = 1 &\sim \qof{x}, \end{aligned}\)

      we have:

      \[\JSD{\pof{X}}{\qof{X}} = \MIof{X;Z}.\]

      This follows from rewriting the mutual information as a KL divergence:

      \[\begin{aligned} \MIof{X;Z} &= \Kale{\fof{X \given Z}}{\fof{X}}\\ &= \E{\fof{z}} {\Kale{\fof{X \given Z = z}}{\fof{X}}}\\ &= \frac{1}{2} \Kale{\pof{x}}{\fof{x}} + \frac{1}{2} \Kale{\qof{x}}{\fof{x}}\\ &= \JSD{\pof{X}}{\qof{X}}. \end{aligned}\]

      We can generalize this to the Markov chain \(Z \rightarrow X \rightarrow Y\) with \(\fof{z, x, y} = \fof{z} \fof{x \given z} \fof{y \given x}\) for any distribution \(\fof{z}\):

      \[\begin{aligned} \MIof{X;Z} &= \Kale{\fof{X \given Z}}{\fof{X}}\\ &= \E{\fof{z}} {\Kale{\fof{X \given z}}{\fof{X}}}\\ &\overset{(1)}{\ge} \E{\fof{z}} {\Kale{\fof{Y \given z}}{\fof{Y}}}\\ &= \Kale{\fof{Y \given Z}}{\fof{Y}}\\ &= \MIof{Y;Z}, \end{aligned}\]

      where \((1)\) follows from the KL data processing inequality.

      This is just the data processing inequality we presented initially. We have gone full circle!

      The equality gap (Jensen gap) is \(\Kale{\fof{X \given Y, Z}}{\fof{X \given Y}}\), and we have equality when:

      \[\begin{aligned} \Kale{\fof{X \given Y, Z}}{\fof{X \given Y}} &= 0\\ \iff \MIof{X;Z \given Y} &= 0. \end{aligned}\]

      This is exactly when \(X\) is independent of \(Z\) given \(Y\). (\(Y\) is a sufficient statistic in that case.)

      Function-Space Variational Inference

      So far we’ve explored the foundational aspects of the data processing inequality (DPI) and its extended forms, in particular the KL data processing inequality. Through detailed derivations and intuitive examples, we’ve demonstrated how these inequalities can be applied, emphasizing their significance and limitations. Specifically, we’ve shown how the KL data processing inequality relates to the reduction in information as data is processed. The examples and counterexample have hopefully demonstrated the nuances of applying these inequalities in different contexts.

      This exploration sets the stage for diving into function-space variational inference and building up a robust understanding of it, leveraging the insights gained about the DPI and its implications in Bayesian deep learning.

      Problem Setting & Notation

      In the following, we will consider a classification task with cross-entropy loss, and we will use the following the random variables and distributions:

      • \(\y\) is the label,
      • \(\x\) is the input,
      • \(\qof{\y \given \x}\) is the predictive distribution we want to learn,
      • \(\pdata{\y \given \x}\) is the data distribution,
      • \(\Dany\) is the (training) dataset, and
      • \(C\) is the number of classes.

      The probabilistic model is:

      \[\pof{\y, \w \given \x} = \pof{\y \given \x, \w} \, \pof{\w}.\]

      As before, I use upper-case letters for random variables, which we take an expectation over, e.g. in the KL divergence, and lower-case letters when I’m referring to specific observations or values that could be substituted (with the exception of \(\Dany\)).

      Chain Rule of the 🥬 Divergence & DPI

      An important property of the KL divergence is the chain rule:

      \[\begin{aligned} &\Kale{\qof{\Y_n,...,\Y_1}}{\pof{\Y_n,...,\Y_1}} \\ &\quad = \sum_{i=1}^n \Kale{\qof{\Y_i \given \Y_{i-1}, ..., \Y_1}}{\pof{\Y_i \given \Y_{i-1}, ..., \Y_1}}. \end{aligned}\]

      The chain rule yields a chain inequality for the DPI as well:

      \[\begin{aligned} \Kale{\qof{\W}}{\pof{\W}} &\ge \Kale{\qof{\Y_n,...,\Y_1}}{\pof{\Y_n,...,\Y_1}}\\ &\ge \Kale{\qof{\Y_{n-1},...,\Y_1}}{\pof{\Y_{n-1},...,\Y_1}}\\ &\ge \Kale{\qof{\Y_1}}{\pof{\Y_1}}, \end{aligned}\]

      where we start from the KL DPI and then apply the chain rule.

      Deriving the Functional ELBO

      The DPI has an intriguing connection to FSVI. Let’s say we want to approximate a Bayesian posterior \(\pof{\w \given \Dany}\) with a variational distribution \(\qof{\w}\). In standard VI, we would minimize \(\Kale{\qof{\W}}{\pof{\W \given \Dany}}\) to match the variational distribution to the Bayesian posterior. Specifically:

      \[\begin{aligned} &\Kale{\qof{\W}}{\pof{\W \given \Dany}} =\\ &\quad = \underbrace{\E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\W}}{\pof{\W}}}_{\text{Evidence}\ \text{Bound}} + \log \pof{\Dany} \ge 0 \\ &\iff \underbrace{-\log \pof{\Dany}}_{=\xHof{\pof{\Dany}}} \le \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\W}}{\pof{\W}}. \end{aligned}\]

      This is an information-theoretic evidence (upper) bound on the information content \(-\log \pof{\Dany}\) of the data \(\Dany\) under the variational distribution \(\qof{\w}\), which we can minimize as an objective to approximiate \(\pof{\w \given \Dany}\) via \(\qof{\w}\).

      In more probability-theory inspired literature, the negative of this bound is called the evidence lower bound (ELBO) and is maximized.

      Both the ELBO and the information-theoretic evidence upper-bound are equivalent, and we can use either objective, but the information-theoretic perspective is obviously superior 🙃 I’ll refer to this as evidence bound from now on.

      In FSVI (with a caveat I detail below), we apply the DPI to the prior KL divergence term and obtain a “functional” version of the evidence bound:

      \[\begin{aligned} \Kale{\qof{\W}}{\pof{\W}} \ge \Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}}, \end{aligned}\]

      where \(\Y... \given \x...\) are (finite or infinite) sets of samples. That is, we do not only optimize marginal distributions but also joint distributions.

      The resulting objective:

      \[\begin{aligned} \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}} \end{aligned}\]

      is equal to the (negative) functional ELBO (fELBO) in “Functional variational Bayesian neural networks” by Sun et al. (2019)—with caveats that we discuss below.

      Choosing the “Coreset” \(\x...\)

      One important detail is the question of how to choose the \(\x...\):

      Ideally, we want to choose them such that the DPI inequality is as tight as possible.

      Given the chain inequality, it is obvious that the larger the set \(\x...\), the tighter the inequality will be. Hence, if we could choose an infinite set of points well, we might be able to get the tightest possible inequality. However, this might not be tractable, and in practice, it is often not.

      Some works take a supremum over finite subsets of a certain size, essentially building a core-set as an approximation (Rudner et al., 2022a/b); others take an expectation over finite sets of input samples (Sun et al., 2019), which is not necessarily yielding the tightest inequality but provides an unbiased estimate; while again other works focus on finite datasets for which the all points can be taken into account (Klarner et al., 2023).

      We will discuss the tightness of the inequality and the implications in the data limit below.

      Focusing on the most important aspect of FSVI, we observe:

      Application to Continual Learning

      When we directly optimize the KL divergence on a finite input dataset, for example, we align \(\opq\) with the prior of \(\opp\) where it matters most: on the predictions of the observed data.

      This is of particular interest in continual learning, where the prior for the next task is chosen to be the posterior from the previous task. In this case, the functional ELBO can be used to approximate the posterior of the previous model while incorporating new data.

      For two great papers that are very readable and provide further insights, see “Continual learning via sequential function-space variational inference and “Tractable function-space variational inference in Bayesian neural networks, both by Rudner et al. (2022).

      Comparison to FSVI in the Literature

      In practice, both works by Rudner et al. (2022), linearize the logitsThe logits are the final activations of the neural network before applying the softmax function (in multi-class classification). They are not to be confused with the pre-logits, e.g. embeddings before the final linear layer. (similar to a Laplace approximation) and use the DPI to show (in their notation):

      \[\mathbb{D}_{\mathrm{KL}}\left(q_{f(\cdot ; \boldsymbol{\Theta})} \| p_{f(\cdot ; \boldsymbol{\Theta})}\right) \leq \mathbb{D}_{\mathrm{KL}}\left(q_{\Theta} \| p_{\Theta}\right)\]

      which in my notation is equivalent to the first application of the DPI above:

      \[\Kale{\qof{\L...\given \x...}}{\pof{\L...\given \x...}} \le \Kale{\qof{\W}}{\pof{\W}}.\]

      They maximize the fELBO objective:

      \[\begin{aligned} \mathcal{F}\left(q_{\boldsymbol{\Theta}}\right) &=\mathbb{E}_{q_{f\left(\mathbf{x}_{\mathcal{D}} ; \boldsymbol{\Theta}\right)}}\left[\log p_{\mathbf{y} \mid f(\mathbf{X} ; \boldsymbol{\Theta})}\left(\mathbf{y}_{\mathcal{D}} \mid f\left(\mathbf{X}_{\mathcal{D}} ; \boldsymbol{\theta}\right)\right)\right]\\ &\quad -\sup _{\mathbf{X} \in \mathcal{X}_{\mathbb{N}}} \mathbb{D}_{\mathrm{KL}}\left(q_{f(\mathbf{X} ; \boldsymbol{\Theta})} \| p_{f(\mathbf{X} ; \boldsymbol{\Theta})}\right), \end{aligned}\]

      which is equivalent to minimizing the information-theoretic objective:

      \[\E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\L... \given \x...}}{\pof{\L... \given \x...}},\]

      if we choose the \(\x...\) to tighten the DPI inequality as much as possible (i.e. by “finding” the supremum).

      Using the inequality chain from above, we can sandwich their objective between a regular (negative) ELBO and the (negative) functional ELBO, we have derived above:

      \[\begin{aligned} &\E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\W}}{\pof{\W}} \\ &\quad \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\L... \given \x...}}{\pof{\L... \given \x...}} \\ &\quad \ge \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}}. \end{aligned}\]

      Why are they using logits instead of probabilities? In practice, using the probabilities instead of logits when performing linearization is often cumbersome due to the non-linearity of the softmax functions, which requires Monte-Carlo sampling of the logits to obtain an approximation of the final probabilities. Furthermore, I speculate that sampling the logits can be more benign given that we often use ReLUs in the underlying neural networks. (Don’t quote me too strongly on this, though.)

      Conceptually, this explains the derivation of their ELBO objective and also relates them to the ‘purer’ and simpler functional evidence bound derived above, but this raises the question of how these inequalities are different and what the gap between them tells us. Let’s address this question next.

      The Equality Case and Equivalence Classes

      When do we have equality? That is, when do we have:

      \[\Kale{\qof{\W}}{\pof{\W}} = \Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}}?\]

      And what does it tell us?

      As we have seen in the first part of this post, we have equality in the DPI if and only:

      \(\Kale{\qof{\W \given \Y..., \x...}}{\pof{\W \given \Y..., \x...}}=0\).

      Given that we are trying to approximate the Bayesian posterior \(\pof{\w \given \Y..., \x...}\) using \(\qof{\w}\), this equality condition tells us that we would have to find the exact posterior for equality. Hence, it is unlikely that we will have equality in practice. From this, the next question immediately follows: what does this predictive prior term

      \[\Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}}\]

      provides us with?

      Another way to think about the gap between the two KL divergences is that one is parameter-based and the other one is not. This points to a deeper truth about overparameterized models used in deep learning:

      The functional KL divergences won’t be affected by this as they are parameter-free and do not take into account the parameters of the model but only the predictions. The regular parameter-based KL divergence, however, would be affected by this—depending on the prior \(\pof{\w}\), they might express differences between the parameter distributions that have no effect on the outputs.

      In other words, if the prior assigns different probability to otherwise equivalent parameters, this obviously changes the parameter posterior, while the outputs are invariant to these changes if the overall assigned probability to a given output remains the same.

      For example, the paper “Deep Ensembles: A Loss Landscape Perspective” by Fort et al. (2020) examines the similarity of the predictions of models trained from different initializations and shows that the prediction space has a multi-modal loss landspace. In the language of FSVI, this is similar to analyzing the function-space distances between different models.

      Equivalence Classes

      Unless there are other considerations, it makes sense to use priors that assign the same density to parameters that are equivalent. Hence, for a given function \(\fof{\x ; \w}\), which determines the likelihood \(\pof{\y \given \x, \w} \triangleq \pof{y \given \fof{\x ; \w}}\), we can define an equivalence relation such that \(\w \sim \w'\) if and only if \(\fof{\x; \w} = \fof{\x; \w'}\) for all \(\x\). This equivalence relation partitions the parameter space into equivalence classes:

      \[[\w] \triangleq \{\w' : \fof{x ; \w} = \fof{x ; \w} \quad \forall x \}.\]

      A prior \(\pof{\w}\) induces a prior \(\hpof{[\w]}\) over the equivalence classes:

      \[\hpof{[\w]} \triangleq \sum_{\w' \in [\w]} \pof{\w'}.\]

      —or \(\int_{[\w]} \pof{\w'} \, d \w'\) for continuous \(\w\)—with the corresponding model:

      \[\begin{aligned} \hpof{\y, [\w] \given \x} &\triangleq \hpof{\y \given \x, [\w]} \, \hpof{[\w]} \\ &= \pof{\y \given \x, \w} \, \hpof{[\w]}. \end{aligned}\]

      Consistency

      Importantly, the definition of the equivalence classes above is consistent with Bayesian inference:

      This is easy to show with using Bayes’ rule:

      \[\begin{aligned} \hpof{[\w] \given \Dany} &= \hpof{\Dany \given [\w]} \, \hpof{[\w]} / \hpof{\Dany} \\ &= \pof{\Dany \given \w} \sum_{\w' \in [\w]} \pof{\w'} / \hpof{\Dany} \\ &= \sum_{\w' \in [\w]} \pof{\Dany \given \w'} \, \pof{\w'} / \hpof{\Dany} \\ &= \sum_{\w' \in [\w]} \pof{\w' \given \Dany} \, \pof{\Dany} / \hpof{\Dany} \\ &= \sum_{\w' \in [\w]} \pof{\w' \given \Dany}. \end{aligned}\]

      The last step follows from \(\hpof{\Dany}=\pof{\Dany}\):

      \[\begin{aligned} \hpof{\Dany} &= \sum_{[\w]} \hpof{\Dany, [\w]} \\ &= \sum_{[\w]} \sum_{\w' \in [\w]} \pof{\Dany, \w'} \\ &= \sum_{\w'} \pof{\Dany, \w} \\ &= \pof{\Dany}. \end{aligned}\]

      This also tells us that, for any \(\x\) and \(\y\):

      \(\pof{\y... \given \x...} = \hpof{\y... \given \x...}\).

      Given this consistency, we don’t have to differentiate between \(\hat\opp\) and \(\opp\) and can use \(\opp\) interchangeably. The same holds for \(\opq\).

      Equality & Symmetries

      We can view \([\w]\) as a projection from \(\w\) to its equivalence class \([\w]\). The DPI then gives us:

      \[\Kale{\qof{\W}}{\pof{\W}} \ge \Kale{\qof{[\W]}}{\pof{[\W]}}.\]

      And again: what does the gap between the two terms tell us?

      Let’s look at a few examples to get a better understanding of this.

      1. Trivial Constant Case

      Let \(\fof{\x ; \w} = 0\) independent of any \(f\). Then \([\w] = [\w']\) for any \(\w\), \(\w'\).

      For any approximate distribution \(\qof{\w}\), the induced \(\Kale{\qof{[\W]}}{\pof{[\W]}}=0\), while \(\Kale{\qof{\W}}{\pof{\W}}\) also includes superfluous divergence.

      2. Unused Parameter

      Let \(\y \given (\w_1, \w_2) = \w_1\) deterministic but independent of \(\w_2\). Then \([(\w_1, \w_2)] = [(\w_1, {\w'}_2)]\) for any \({\w'}_2\) and \([(\w_1,*)]\not=[({\w'}_1, *)]\) for any \(\w_1 \not= \w'_1\).

      \(\Kale{\qof{[\W]}}{\pof{[\W]}}=\Kale{\qof{\W_1}}{\pof{\W_1}}\) captures the meaningful divergence between approximate and true distribution, while \(\Kale{\qof{\W}}{\pof{\W}}\) also includes any divergence across \(\w_2\) that has no effect on the predictions.

      3. Periodic Parameter Space

      Finally, let’s assume that the predictions are periodic in some way. That is, for example \(\y = \sin \w\). We then have \([\w] = [\w + 2\pi]\).

      Further, let \(\pof{\w} = \operatorname{U}(\w; [0,2\pi \, N))\) for some \(N\) that determines the number of periods. Then, if we introduce another random variable \(K\), that captures which period we are in, we can (again) use the chain rule to write:

      \[\begin{aligned} \Kale{\qof{\W}}{\pof{\W}} &= \Kale{\qof{\W \given \W \in [K\,2\pi, (K+1)\,2\pi]}}{\pof{\W \given \W \in [K\,2\pi, (K+1)\,2\pi]}} \\ &\quad + \Kale{\qof{\W \in [K\,2\pi, (K+1)\,2\pi]}}{\pof{\W \in [K\,2\pi, (K+1)\,2\pi]}} \\ &= \Kale{\qof{[\W]}}{\pof{[\W]}} \\ &\quad + \Kale{\qof{\W \in [K\,2\pi, (K+1)\,2\pi]}}{\pof{\W \in [K\,2\pi, (K+1)\,2\pi]}}. \end{aligned}\]

      This follows from the setup of this specific example. Finally, we have:

      \[\Kale{\qof{\W \in [K\,2\pi, (K+1)\,2\pi]}}{\pof{\W \in [K\,2\pi, (K+1)\,2\pi]}} \le \log N.\]

      So, if \(\opq\) only had support in a single period for example, the difference between \(\Kale{\qof{\W}}{\pof{\W}}\) and \(\Kale{\qof{[\W]}}{\pof{[\W]}}\) would be \(\log N\): the redundancy.

      Predictive Prior

      How does the predictive prior term fit into this? The DPI again yields the answer:

      This tells us that the predictive prior term can at best measure the KL divergence between the equivalence classes of the parameters—and not between the parameters itself—but luckily, this is the more meaningful divergence anyway!

      For the equality cases, we observe that:

      1. we need a 1:1 mapping between parameters and equivalence classes for the first bound to be tight, and
      2. we need \(\Kale{\qof{[\W] \given \Y_n,\x_n,...,\Y_1,\x_1}}{\pof{[\W] \given \Y_n,\x_n,...,\Y_1,\x_1}} \to 0\) for \(n \to \infty\) for the second bound to be tight.

      For 2.: as we know from the chain rule that

      \[\Kale{\qof{\Y_n,...\Y_1\given\x_n,...,\x_1}}{\pof{\Y_n,...\Y_1\given\x_n,...,\x_1}}\]

      is monotonically increasing in \(n\), and it is bounded by \(\Kale{\qof{[\W]}}{\pof{[\W]}}\) from above, it must convergeIt is a bounded monotonically increasing sequence.. So, when does it close the gap?

      To give intuition that it might do that, and without attempting to prove this formally, we can appeal to Bernstein von Mises theorem, which states that the posterior distribution of the parameters converges to a Gaussian distribution with mean and variance given by the maximum likelihood estimate (MLE) as the number of data points tends to infinity as long as the model parameters are identifiable, that is the true parameters we want to learn are unique, and that they have support.

      For the evidence bound to be meaningful, we already know that we need support of the approximate distribution \(\opq\) in the prior \(\opp\)—otherwise, the LHS is \(\infty\). Moreover, realizing that we take an expectation over \(\qof{\Y_n ,..., \Y_1 \given \x_n ,..., \x_1}\), we can decompose the KL term for the gap as:

      \[\begin{aligned} &\Kale{\qof{[\W] \given \Y_n,\x_n,...,\Y_1,\x_1}}{\pof{[\W] \given \Y_n,\x_n,...,\Y_1,\x_1}} \\ &\quad = \E{\qof{\y_n,...,\y_1\given\x_n,...,\x_1}}{\Kale{\qof{[\W]\given \y_n, \x_n, ..., \y_1, \x_1}}{\pof{[\W]\given \y_n, \x_n, ..., \y_1, \x_1}}} \\ &\quad = \simpleE{\qof{[\w']}}{\E{\qof{\y_n,..,.\y_1\given\x_n,...,\x_1, [\w']}}{\Kale{\qof{[\W]\given \y_n, \x_n, ..., \y_1, \x_1}}{\pof{[\W]\given \y_n, \x_n, ..., \y_1, \x_1}}}}. \end{aligned}\]

      That is, we sample a \([\w'] \sim \qof{[\w']}\) and then sample \(\y_n,...\y_1\given\x_n,...,\x_1\) from the corresponding \(\qof{\y_n,...\y_1\given\x_n,...,\x_1, [\w']}\) and marginalize over these. Crucially, \([\w']\) are the true parameters of the data-generating process for the inner KL divergence term. We thus take an expectation over KL terms fulfilling the conditions of the Bernstein von Mises theorem:

      \[\begin{aligned} \Kale{\qof{[\W] \given \y_n,\x_1...\y_1, \x_1}}{\pof{[\W] \given \y_n,\x_1...\y_1, \x_1}} \to 0. \end{aligned}\]

      In other words, for a given \([w']\), in the space of equivalence classes as defined previously, the equivalence class of all MLE solutions in the data limit, \([MLE]\), will be unique by definition—the model is identifiable—and match \([\w']\)This follows from the consistency of MLE estimators but also from Berstein von Mises with a flat/uninformative prior.. As the MLE is prior-independent once there is support for it, both \(\opq\) and \(\opp\) will converge to the MLE \([\w']\) with sufficient data. Taking the expectation, this yields \(\Kale{\qof{[\W]\given \Y,..., \x...}}{\pof{[\W] \given \Y,..., \x...}} \to 0\) for \(n \to \infty\), and thus, we have:

      \[\begin{aligned} & \Kale{\qof{[\W]}}{\pof{[\W]}} = \\ &\quad = \sup_{n\in \mathbb{N}} \Kale{\qof{\Y_n,...,\Y_1\given\x_n,...,\x_1}}{\pof{\Y_n,...,\Y_1\given\x_n,...,\x_1}}. \end{aligned}\]

      (Again, this is not a formal proof but an intuition for why the gap might close in the data limit.)

      In my opinion, this is a great result. We have shown both that the predictive prior term converges given our assumptions and that it converges to the symmetry-free parameter-based divergence in the data limit. This is a strong argument for the predictive prior term being meaningful and not just a technical trick.

      Let’s appreciate one more thing: the predictive prior can consist of infinitely many data points and still converge to a finite value.

      Parameter Priors vs. Predictive Priors

      What is the advantage of this all?

      In Bayesian deep learning, we often use parameter priors that are not meaningful and which also do not take parameter symmetries into account. For example, a unit Gaussian prior over the parameters of a neural network does not induce different predictions for different parameters necessarily. While this prior can be sensible from a parameter compression perspective (e.g. see Hinton and van Camp (1993)), this does not have to be the only consideration guiding us.

      With function priors and predictive priors, we can specify more meaningful priors because we can focus on the predictions and ignore the parameters. More importantly, this connects Bayesian approaches to data augmentation and other regularization techniques as we will see next.

      Given that priors over equivalence classes are difficult to express explicitly though, using the DPI to obtain a functional ELBO can be an easier way to express and approximate them.

      Label Entropy Regularization

      All this also helps us gain a new perspective on label entropy regularization. The functional evidence bound can be lower-bounded using the chain rule by:

      \[\begin{aligned} \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}} \\ \ge \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \E{\pdata{\x}}{\Kale{\qof{\Y \given \x}}{\pof{\Y \given \x}}}, \end{aligned}\]

      where we can expand the term under the second expectation to:

      \[\Kale{\qof{\Y \given \x}}{\pof{\Y \given \x}}=\CrossEntropy{\qof{\Y \given \x}}{\pof{\Y \given \x}} - \xHof{\qof{\Y \given \x}}.\]

      Assuming that our prior yields a uniform distribution over the labels, we can drop the cross entropy term because it is constant and obtain:

      \[\E{\qof{\w}}{-\log \pof{\Dany \given \w}} - \E{\pdata{\x}}{\xHof{\qof{\Y \given \x}}}.\]

      This is the same as an MLE minimization objective with an additional entropy regularization term \(-\xHof{\qof{\Y \given \x}}\) for different \(\x\) that prevents the model from overfitting to the labels and collapsing to the one-hot encoding of the labels.

      Thus, in the simplest approximation, the DPI and functional variational inference give us a new perspective on label entropy regularization.

      Knowledge Distillation

      Obviously, assuming non-uniform prior predictions, \(\E{\pdata{\x}}{\Kale{\qof{\Y \given \x}}{\pof{\Y \given \x}}}\) can be related to knowledge distillation in deep neural networks as introduced by Hinton et al. (2015).

      The main technical difference is that knowledge distillation is using the reverse KL divergence instead of the forward KL divergence, while the conceptual difference is that we are not distilling the knowledge from a teacher model but from the prior that we downweigh while also training our model on the data itself. However, the connection between knowledge distillation and continual learning using informative priors is manifest.

      Conclusion

      In this blog post, we took a deep dive into the data processing inequality (DPI) and its surprisingly far-reaching implications for modern Bayesian deep learning. By carefully examining the assumptions, equality conditions, and chain rule of the DPI, we arrived at an intuitive understanding of why function-space variational inference (FSVI) can be such a powerful tool. The DPI perspective illuminates how FSVI side-steps issues with high-dimensional parameter spaces by focusing on matching Bayesian predictive posteriors.

      Reasoning about parameter equivalence classes under the lens of the DPI, we saw how predictive KL divergences can capture meaningful differences between models while ignoring superficial discrepancies due to symmetries. This provides a fresh perspective on the advantages of predictive priors over standard parameter priors commonly used in Bayesian neural networks.

      While our treatment only scratched the surface of the full mathematical story, the intuitions we developed allowed us to re-derive key results from the literature and uncover deep connections between seemingly disparate methods like entropy regularization, continual learning, and knowledge distillation. The examples and proofs peppered throughout solidified the core concepts.

      More than a bag of technical tricks, the DPI reveals itself to be a powerful conceptual tool for reasoning about models, objectives, and algorithms. I hope this post inspires the reader to seek the fundamental principles underpinning machine learning innovations and to use those principles as a guide for future research. With a solid grasp of foundational tools like the DPI, we can all contribute to demystifying and unifying the rapidly evolving field of Bayesian deep learning.


      Acknowledgements. Many thanks to Freddie Bickford Smith for very helpful comments and feedback on this post and to Tim Rudner for additional pointers to relevant literature and feedback on the FSVI section in particular 🤗

      For attribution in academic contexts, please cite this work as
      +        PLACEHOLDER FOR ACADEMIC ATTRIBUTION
      +  
      BibTeX citation
      +        PLACEHOLDER FOR BIBTEX
      +  
      \ No newline at end of file diff --git a/blog/elaborating-on-the-value-of-flow-matching-for-density-estimation/index.html b/blog/elaborating-on-the-value-of-flow-matching-for-density-estimation/index.html new file mode 100644 index 00000000..c2df3c43 --- /dev/null +++ b/blog/elaborating-on-the-value-of-flow-matching-for-density-estimation/index.html @@ -0,0 +1,46 @@ + Elaborating on the Value of Flow Matching for Density Estimation | ICLR Blogposts 2024

      Elaborating on the Value of Flow Matching for Density Estimation

      The transfer of matching-based training from Diffusion Models to Normalizing Flows allows to fit expressive continuous normalizing flows efficiently and therefore enables their usage for different kinds of density estimation tasks. One particularly interesting task is Simulation-Based Inference, where Flow Matching enabled several improvements. The post shall focus on the discussion of Flow Matching for Continuous Normalizing Flows. To highlight the relevance and the practicality of the method, their use and advantages for Simulation-Based Inference is elaborated.

      Motivation

      Normalizing Flows (NF) enable the construction of complex probability distributions by transforming a simple, known distribution into a more complex one. They do so by leveraging the change of variables formula, defining a bijection from the simple distribution to the complex one.

      For most of the time, flows were based on chaining several differentiable and invertible transformations. However, these diffeomorphic transformations limit the flows in their complexity as such have to be simple. Furthermore, this leads to trade-off sampling speed and evaluation performance . Their continuous counterpart, Continuous Normalizing Flows (CNFs) have been held back by limitations in their Simulation-Based maximum likelihood training . By utilizing Flow Matching, this limitation has been overcome and CNFs have been shown to be a powerful tool for density estimation.

      In the following sections, CNFs and Flow Matching are explained. Following the explanation, the empirical results of Flow Matching are presented. Finally, the application of Flow Matching in Simulation-Based Inference is discussed, which shall highlight their wide applicability and consistent improvement.

      Continuous Normalizing Flows

      Continuous normalizing flows are among the first applications of neural ordinary differential equations (ODEs) . Instead of the traditional layers of neural networks, the flow is defined by a vector field that is integrated over time.

      \[\frac{d}{dt} x(t) = f_{\theta}(x(t), t)\]

      The vector field is typically parameterized by a neural network. While traditional layer based flow architectures need to impose special architectural restrictions to ensure invertibility, CNFs are invertible as long as the uniqueness of the solution of the ODE is guaranteed. This is for instance the case if the vector field is Lipschitz continuous in \(x\) and continuous in \(t\). Many common neural network architectures satisfy these conditions. Hence, the above equation defines a diffeomorphism \(\phi_t(x_0) = x_0 + \int_0^t f_{\theta}(x(t), t)\) under the discussed assumption. The change of variables formula can be applied to compute the density of a distribution that is transformed by \(\phi_t\).

      As usual, a CNF is trained to transform a simple base distribution \(p_B\), usually a standard normal distribution, into a complex data distribution \(p_D\). For each point in time \(t\in[0,1]\) the time-dependent vector field defines a distribution \(p_t\) (probability path) and the goal is to find a vector field \(f_\theta\) such that \(p_1=p_D\). This is usually achieved by maximum likelihood training, i.e. by minimizing the negative log-likelihood of the data under the flow.

      While CNFs are very flexible, they are also computationally expensive to train naively with maximum likelihood since the flow has to be integrated over time for each sample. This is especially problematic for large datasets which are needed for the precise estimation of complex high-dimensional distributions.

      Flow Matching

      The authors of propose a new method for training CNFs, which avoids the need for simulation. The key idea is to regress the vector field directly from an implicit definition of a target vector field that defines a probability path \(p_t(x)\) with \(p_0=p_{B}\) and \(p_1=p_{D}\). Moreover, the authors propose a loss function that directly regresses the time dependent vector field against the conditional vector fields with respect to single samples.

      Unconditional ImageNet-128 samples of a CNF trained using Flow Matching with Optimal Transport probability paths. Figure obtained from .

      Assuming that the target vector field is known, the authors propose a loss function that directly regresses the time dependent vector field:

      \[L_{\textrm{FM}}(\omega) = \mathbb{E}_{t, p_t(x)}(|f_{\omega}(x, t) - u_t(x)|^2),\]

      where \(u_t\) is a vector field that generates \(p_t\) and the expectation with respect to \(t\) is over a uniform distribution. Unfortunately, the loss function is not directly applicable because we do not know how to define the target vector field. However, it turns out that one can define appropriate conditional target vector fields when conditioning on the outcome \(x_1\):

      \[p_t(x) = \int p_t(x|x_1)p_{D}(x_1)d x_1.\]

      Using this fact, the conditional flow matching loss can be defined, obtaining equivalent gradients as the flow matching loss.

      \[L_{\textrm{CFM}}(\omega) = \mathbb{E}_{t, p_t(x|x_1), p_D(x_1)}(|f_{\omega}(x, t) - u_t(x|x_1)|^2).\]

      Finally, one can easily obtain an unbiased estimate for this loss if samples from \(p_D\) are available, \(p_t(x|x_1)\) can be efficiently sampled, and \(u_t(x|x_1)\) can be computed efficiently. We discuss these points in the following.

      Gaussian Conditional Probability Paths

      The vector field that defines a probability path is usually not unique. This is often due to invariance properties of the distribution, e.g. rotational invariance. The authors focus on the simplest possible vector fields to avoid unnecessary computations. They choose to define conditional probability paths that maintain the shape of a Gaussian throughout the entire process. Hence, the conditional probability paths can be described by a variable transformation \(\phi_t(x \mid x_1) = \sigma_t(x_1)x + \mu_t(x_1)\). The time-dependent functions \(\sigma_t\) and \(\mu_t\) are chosen such that \(\sigma_0(x_1) = 1\) and \(\sigma_1 = \sigma_\text{min}\) (chosen sufficiently small), as well as \(\mu_0(x_1) = 0\) and \(\mu_1(x_1)=x_1\). The corresponding probability path can be written as

      \[p_t(x|x_1) = \mathcal{N}(x; \mu_t(x_1), \sigma_t(x_1)^2 I).\]

      In order to train a CNF, it is necessary to derive the corresponding conditional vector field. An important contribution of the authors is therefore the derivation of a general formula for the conditional vector field \(u_t(x|x_1)\) for a given conditional probability path \(p_t(x|x_1)\) in terms of \(\sigma_t\) and \(\mu_t\):

      \[u_t(x\mid x_1) = \frac{\sigma_t'(x_1)}{\sigma_t(x_1)}(x-\mu_t(x_1)) - \mu_t'(x_1),\]

      where \(\psi_t'\) denotes the derivative with respect to time \(t\).

      Compared to the diffusion path’s conditional score function, the OT path’s conditional vector field has constant direction in time and is arguably simpler to fit with a parametric model. Note the blue color denotes larger magnitude while red color denotes smaller magnitude. Figure obtained from .

      They show that it is possible to recover certain diffusion training objectives with this choice of conditional probability paths, e.g. the variance preserving diffusion path with noise scaling function \(\beta\) is given by:

      \[\begin{align*} \phi_t(x \mid x_1) &= (1-\alpha_{1-t}^2)x + \alpha_{1-t}x_1 \\\ \alpha_{t} &= \exp\left(-\frac{1}{2}\int_0^t \beta(s) ds\right) \end{align*}\]

      Additionally, they propose a novel conditional probability path based on optimal transport, which linearly interpolates between the base and the conditional target distribution.

      \[\phi_t(x \mid x_1) = (1-(1-\sigma_{\text{min}})t)x + tx_1\]

      The authors argue that this choice leads to more natural vector fields, faster convergence and better results.

      Empirical Results

      The authors investigate the utility of Flow Matching in the context of image datasets, employing CIFAR-10 and ImageNet at different resolutions. Ablation studies are conducted to evaluate the impact of choosing between standard variance-preserving diffusion paths and optimal transport (OT) paths in Flow Matching. The authors explore how directly parameterizing the generating vector field and incorporating the Flow Matching objective enhances sample generation.

      Likelihood (BPD), quality of generated samples (FID), and evaluation time (NFE) for the same model trained with different methods. Figure from .

      The findings are presented through a comprehensive evaluation using various metrics such as negative log-likelihood (NLL), Frechet Inception Distance (FID), and the number of function evaluations (NFE). Flow Matching with OT paths consistently outperforms other methods across different resolutions.

      Flow Matching, especially when using OT paths, allows us to use fewer evaluations for sampling while retaining similar numerical error (left) and sample quality (right). Results are shown for models trained on ImageNet 32×32, and numerical errors are for the midpoint scheme. Figure from .

      The study also delves into the efficiency aspects of Flow Matching, showcasing faster convergence during training and improved sampling efficiency, particularly with OT paths.

      Sample paths from the same initial noise with models trained on ImageNet 64×64. The OT path reduces noise roughly linearly, while diffusion paths visibly remove noise only towards the end of the path. Note also the differences between the generated images. Figure from .
      Image super-resolution on the ImageNet validation set. Figure from .

      Additionally, conditional image generation and super-resolution experiments demonstrate the versatility of Flow Matching, achieving competitive performance in comparison to state-of-the-art models. The results suggest that Flow Matching presents a promising approach for generative modeling with notable advantages in terms of model efficiency and sample quality.

      Application of Flow Matching in Simulation-Based Inference

      A very specifically interesting application of density estimation, i.e. Normalizing Flows, is in Simulation-Based Inference (SBI). In SBI, Normalizing Flows are used to estimate the posterior distribution of model parameters given some observations. An important factor here are the sample efficiency, scalability, and expressivity of the density model. Especially for the later two, Flow Matching has shown to the yield an improvement. This is due to the efficient transport between source and target density and the flexibility due the more complex transformations allowed by continuous normalizing flows. To start out, a brief introduction to SBI shall be given as not many might be familiar with this topic.

      Primer on Simulation-Based Inference

      In many practical scenarios, the likelihood function of a model is intractable and cannot be described analytically. This might be the case for where the forward model is a complex or proprietary simulation, or if it is a physical experiment . In order to still be able to perform Bayesian inference, one can resort to a class of methods called Likelihood-free Inference. One possible but popular method in this class is SBI. The core idea is to use a prior in combination with the simulator to obtain samples from the joint distribution of the parameters and the data. Based on these samples, the posterior can either be learned directly or the likelihood can be approximated . Depending on the exact method chosen, the approximated posterior is either amortized, i.e. does not require refitting when conditioned on different data, or non-amortized.

      The figure depicts the schematic flow of information for different kinds of Likelihood-free methods. Modern methods in SBI are depicted in the bottom row where the likelihood is approximated in subfigure E, the posterior is approximated in subfigure F, and the likelihood-ratio in subfigure G. Figure from .

      In order to formalize the method, let \(\theta \sim \pi(\theta)\) denote the parameters to a system and its respective prior distribution. The system under evaluation and the respective observations obtained are denoted by \(x = \mathcal{M}(\theta)\). To sample from the joint distribution \(p(\theta, x)\), the dedicated parameter \(\theta_i\) is sampled from the prior and the observation is obtained by evaluating the forward model on that parameter \(x_i = \mathcal{M}(\theta_i)\). According to this approach, a dataset of samples from the joint distribution can be generated \(\mathcal{X} = \{ (\theta, \mathbf{x})_i \}^N_{i=1}\). A density estimator is then fitted on the provided dataset in order to estimate the desired distribution, e.g. directly the posterior \(q_{\omega}(\theta \mid x) \approx p(\theta \mid x)\).

      The interested reader shall be directed to and especially for a more rigorous introduction to SBI. In order to compare the performances of the different approaches to SBI and their performance with respect to certain tasks, an excellent overview is provided in . For the sake of this post, a more abstract understanding is enough.

      Flow Matching for Simulation-Based Inference

      The approach using the Flow Matching formulation to fit the density network is presented by Dax et al. . In the setting described by the authors and the before mentioned SBI context, the goal is to approximate a posterior distribution of over model parameters given observations \(p(\theta \vert x)\). To learn the posterior, the Flow Matching loss is adapted to the following:

      \[\mathcal{L}_{FMPE} = \mathbb{E}_{t \sim p(t),\theta_1 \sim p(\theta), x \sim p(x \vert \theta_1),\theta_t \sim p_t(\theta_t \mid \theta_1)} \Vert f_{\omega,x}(\theta_t, t) - u_t(\theta_t \mid \theta_1) \Vert^2\]

      The important details to note here are the adaptations to minimize the loss w.r.t. samples drawn from the joint distribution, as it is described in the general section to SBI. To do so, the expectation is adapted to be w.r.t. \(\theta_1 \sim p(\theta), x \sim p(x \vert \theta_1)\), which yield the desired samples.

      Another adaption by the authors is to exchange the uniform distribution over the time with a general distribution \(t \sim p(t)\). The effects of this substitution won’t be focus deeper. However, adapting the distribution makes intuitive sense as the training gets harder close to the target distribution. Therefore, focussing on time steps \(t\) closer to one is beneficial, as the authors have also found in their empirical studies.

      In order to provide a general comparison of the Flow Matching-based SBI approach, the CFM model is tested on the SBI benchmarking tasks . The results show either equal or better performance, underscoring the approaches ability and applicability to SBI.

      The figure depicts the results of the CFM model on the SBI benchmarking tasks, as carried out by the authors of . Comparing the results to such obtained by neural posterior estimation with a normalizing flow shows comparable performance on most tasks while outperforming on some.

      Besides the general benchmarks, the authors use their proposed technique to estimate the posterior distribution of gravitational wave parameters \(p(\theta \mid x)\) where \(\theta \in \mathbb{R}^{15}, x \in \mathbb{R}^{15744}\). In order to reduce the problem’s dimensionality and increase the information density, the observations are compressed to \(128\) dimensions using an embedding network.

      Following the preprocessing of the data, three density estimators are fitted and compared to each other. The first method uses a neural spline flow, which has proven itself on these kinds of problems. It is compared to a neural posterior estimation using the Flow Matching approach described here. Finally, a neural posterior estimator leveraging physical symmetries is used to estimate the targeted posterior. All were trained on a simulation budget of \(5 \cdot 10^6\) samples for a total of 400 epochs.

      In order to evaluate the models’ performances, the obtained posteriors were compared w.r.t. their 50% credible regions as well as Jensen-Shannon divergence between the inferred posterior and reference results. The results shown below support the advantages found in the benchmarking tasks. The Flow Matching-based shows a good performance for all shown parameters and has a clear advantage over the classical NPE approach.

      The figure shows the single performances of a classic NPE approach using neural spline flows, the proposed Flow Matching approach, and a physics-focussed NPE approach. The results are shown for the 50% credible regions on the left, as well as the Jensen-Shannon divergence between the inferred posterior and reference results on the right. The Flow Matching-based approach shows a good performance for all investigated parameters and has a clear advantage over the classical NPE approach. In the pair plot on the left, the choice was made to only show the four parameters for which the classical NPE method performs the worst. While the Flow Matching approach could perform worse on other dimensions, this is not the case as shown on the right. Figure from .

      Whilst the examples are interesting themselves, their evaluation has shown the applicability, scalability, and flexibility of Flow Matching for density estimation. These performance improvements in different areas have motivated the discussion of Flow Matching in the first place and hopefully become clear now.

      A Personal Note

      Whilst this is a blog post, we’d like to use this last part to express our personal thoughts on this topic. SBI is a powerful method, enabling Bayesian Inference where it would not be possibleIt might be more fitting to say that Bayesian Inference is not practically feasible in many scenarios as, in theory, it might still be possible by sampling. However, this is essentially not possible where single evaluations of the forward model are expensive or further evaluations are simply not available, as shown in the example. otherwise. Due to the natural problem setting of SBI, where problems are high-dimensional, observations are scarce, and distribution complex, density estimators capable to counter these are required. In the past, Normalizing Flows have proven themselves to meet these challenges, whilst not resolving them completely. CNFs, due to their higher flexibility, have been a desired method to put to test whether they could even improve on these but were limited in the inability to train the efficiently.

      Formulating the Flow Matching variant of CNFs has allowed their application to complex density estimation tasks, as for example in SBI, and they’ve shown to yield the expected improvements – on standard SBI benchmarking tasks as well a very high dimensional task from the field of astrophysics. Furthermore, the generalization of CFM even broadens their applicability. It will be very interesting to see what possibilities are opened by this exact formulation and, in addition, what further improvements can be obtained by transferring techniques from the Diffusion Models to Normalizing Flows.

      For attribution in academic contexts, please cite this work as
      +        PLACEHOLDER FOR ACADEMIC ATTRIBUTION
      +  
      BibTeX citation
      +        PLACEHOLDER FOR BIBTEX
      +  
      \ No newline at end of file diff --git a/blog/exploring-meta-learned-curiosity-algorithms/index.html b/blog/exploring-meta-learned-curiosity-algorithms/index.html new file mode 100644 index 00000000..01388bf8 --- /dev/null +++ b/blog/exploring-meta-learned-curiosity-algorithms/index.html @@ -0,0 +1,36 @@ + Exploring Meta-learned Curiosity Algorithms | ICLR Blogposts 2024

      Exploring Meta-learned Curiosity Algorithms

      This blog post delves into Alet et al.'s ICLR 2020 paper, Meta-learning curiosity algorithms, which introduces a unique approach to meta-learning curiosity algorithms. Instead of meta-learning neural network weights, the focus is on meta-learning pieces of code, allowing it to be interpretable by humans. The post explores the two meta-learned algorithms, namely Fast Action Space Transition (FAST) and Cycle-Consistency Intrinsic Motivation (CCIM).

      Introduction

      Dealing with environments with sparse rewards, i.e., feedback comes at a low frequency, in reinforcement learning (RL) requires meaningful exploration. One way to encourage the RL agent to perform meaningful exploration is by instilling intrinsic motivation into the agents. This intrinsic motivation usually comes in the form of curiosity. As Schmidhuber highlighted : One becomes curious as soon as one believes there’s something about the world that one does not know. It is because of this that curiosity or intrinsic rewards are usually predictive errors. For instance, an RL agent equipped with a world model is given the current state of the environment, \(s_t\), and attempts to predict the next state, \(s_{t+1}\). The error in this prediction is the intrinsic reward. As the world model improves one should expect the intrinsic rewards to decrease as the agent’s knowledge about environment increases. This is known as curiosity-driven exploration.

      Now there has been success with curious agents solving environments with sparse rewards . Curiosity algorithms such as Random Network Distillation (RND) and BYOL-Explore are hand-designed and are able to perform well across different environments. However, in the 2020 paper , Meta-learning curiosity algorithms, Alet et al. took a unique approach to discovering new curisoity algorithms. They did this by meta-learning pieces of code. Similar to the code segments used by researchers when crafting curiosity algorithms such as neural networks with gradient descent mechanisms, trained objective functions, ensembles, buffers, and various regression models. Two new interpretable algorithms were learned by meta-learning these pieces of code: Fast Action Space Transition (FAST) and Cycle-Consistency Intrinsic Motivation (CCIM). It is these two algorithms that we will explore and compare their behaviour to our baselines: RND and BYOL-Explore.

      The roadmap for exploring FAST and CCIM is organised as follows. We begin with a brief introduction to RL, meta-learning, and meta-reinforcement learning (meta-RL). Next, we provide concise explanations of how curiosity-driven exploration baselines, RND and BYOL-Explore, operate. Subsequently, we delve into the discovery process of FAST and CCIM. Following that, we explore the intricacies of FAST and CCIM, evaluating their performance and studying their behaviour in both the empty grid-world environment and the bsuite deep sea environment. We then compare them to curiosity-driven baselines and a non-curious agent. Finally, we conclude our journey.

      Background

      Reinforcement Learning

      RL is inspired by how biological systems learn as animals are to able learn through trial-and-error. In RL we have an agent that tries to maximise the sum of rewards it recieves by learning from its interactions with the environment. This agent-environment interaction is usually modelled as a Markov decision process (MDP). Figure 1 below illstrustates this agent-environment interaction.

      Figure 1. The agent-environment interaction as a MDP. Taken from .

      From the figure we can see that the agent observes a state and then takes action. The agent can then decide on its next action based on the next state it observes and the rewards it receives from the critic in the environment. The critic decides on what reward the agent receives at every time-step by evaluating its behaviour.

      As Sutton et al. highlighted in Figure 1 can be misleading though. It implies that the agent-environment boundary is similar to the physical boundary between an organism’s entire body and the outside world. In RL we consider anything that the agent cannot change through its actions as the environment. For example, if a human was an RL agent their skeletal structure or their muscles could be considered part of the environment. So we can then see that when it comes to RL we have two types of environments: The internal environment, such as sensory organs of an animal, and the external environment. Also, the reward the agent receives is not always from the external environment. The rewards can be seen as reward signals like a human’s brain releasing dopamine when one achieves an objective. Thus, the critic can also be in inside the RL agent. The figure below shows an extended view of the agent-environment interactions.

      Figure 2. The extended agent-environment interaction. Taken from .

      Singh et al. highlighted in that Figure 2 shows that an RL agent has a motivational system since the critic can be within the internal environment of the agent. And this motivational system should ideally remain consistent across a wide range of diverse environments. Since we can view the critic as being inside the agent we can instil intrinsic motivation into the agent. This means that the agent can receive two types rewards, namely extrinsic rewards from the external environments and intrinsic rewards from the internal environment. Singh et al. () highlighted the advantages of endowing an agent with intrinsic motivation. They pointed out that an agent equipped with a collection of skills learned through intrinsic reward can more easily adapt to and learn a wide variety of extrinsically rewarded tasks compared to an agent lacking these skills.

      Meta-RL and Meta-learning

      The next stop on our journey takes us to meta-learning. Meta-learning is about learning how to to learn. The goal is for meta-learning agents to enhance their learning abilities over time, enabling them to generalise to new, unseen tasks. Meta-learning involves two essential loops: the inner loop and the outer loop. In the inner loop, our learning algorithm adapts to a new task using experiences obtained from solving other tasks in the outer loop, which is referred to as meta-training .

      The inner loop addresses a single task, while the outer loop deals with the distribution of tasks. Figure 3 illustrates this concept of meta-learning.

      Figure 3. An illustration of meta-learning. Taken from .

      Moving into the intersection of meta-learning and reinforcement learning (RL) is meta-RL, where the agent learns how to reinforcement learn . In meta-RL, the agent aims to maximise the sum of rewards from a distribution of MDPs.

      In basic RL, we have an algorithm \(f\) that outputs a policy, mapping states to actions. However, in meta-RL, our algorithm has meta-parameters \(\theta\) that outputs \(f\), and \(f\) then produces a policy when faced with a new MDP. Figure 4 illustrates that the meta-RL process. Note that in the outer loop the meta-parameters \(\theta\) are updated.

      Figure 4. An illustration of meta-RL. Taken from .

      Random Network Distillation

      We now move onto our curiosity-driven exploration baselines. The first baseline that we will briefly discuss is RND . RND works by having two neural networks. One is the predictor network and the other is the target network. The target network is randomly initialised and its parameters stay fixed during training. Given a state, \(s_t\), it then outputs the feature representation of that state \(f_t\). The predictor network then tries to predict to \(f_t\) given \(s_t\) as well. The error in this prediction is then the intrinsic reward, \(r_i\), given to the agent and it is given by the following formula,

      \[r_i=\|\hat{f}_t - f_t\|_2^2,\]

      where \(\hat{f}_t\) is the output of the predictor network. The formula above also serves as the loss function of the predictor network. We normalise \(r_i\) by dividing it by the running estimate of the standard deviations of the intrinsic returns. We do this because the intrinsic rewards can be very different in various environments. Normalising the intrinsic rewards make it easier to pick hyperparameters that work across a wide range of environments. As the agent explores more the predictor network will get better and the intrinsic rewards will decrease. The key idea in RND is that the predictor network is trying to predict the output of a network that is deterministic, the target network. The figure below illustrates the process of RND.

      Figure 5. The process of RND. Taken from .

      BYOL-Explore

      BYOL-Explore builds upon Bootstrap Your Own Latent (BYOL) , a self-supervised learning algorithm used in computer vision and representation learning. BYOL-Explore is similar to RND in that there’s a network that tries to predict the output of a target network. In BYOL-Explore we have an online network that consists of an encoder, a close-loop recurrent neural network (RNN) cell, an open-loop RNN cell and a predictor. While the target network just consists of an encoder. The key difference is that the target’s network parameters do not stay fixed like in RND. We update the target network’s parameters using the exponential moving average (EMA) of the online network’s predictor parameters. The update is performed using the formula below:

      \[\phi \leftarrow \alpha\phi + (1-\alpha)\theta.\]

      In the above equation, \(\phi\), is the target network’s parameters, \(\theta\) is the online network’s predictor parameters and \(\alpha\) is the EMA smoothing factor. In our implementation of BYOL-Explore we do not make use of the RNN cells as we are dealing with simple environments, we call our implementation BYOL-Explore Lite. In our implementation the online network is composed of a multilayer perceptron (MLP) encoder and a predictor. The target network, \(h\), is just composed of an MLP encoder. In the BYOL-Explore Lite process the current state of the environment, \(s_t\), is inputted into the encoder \(f\), which outputs a feature representation of the state, \(f(s_t)\). This feature representation is then passed to both the RL agent and the predictor \(g\). The RL agent uses \(f(s_t)\) to decide on its next action and determine the value of that state. The predictor uses \(f(s_t)\) to predict \(h(s_{t+1})\), i.e., the predictor is attempting to predict the target network’s output for the next state. There are two losses namely the encoder loss and the predictor loss. The predictor loss is given by,

      \[\mathcal{L}_p=\left\|\frac{g(f(s_{t}))}{\|g(f(s_{t}))\|_2}-\frac{h(s_{t+1})}{\|h(s_{t+1})\|_2}\right\|_2^2.\]

      Since the RL agent and the predictor both make use of the online network’s encoder its loss is given by the sum of the RL loss and the predictor loss. Importantly, the loss \(\mathcal{L}_p\) serves as the intrinsic reward that the RL agent receives at each step. We normalise the intrinsic rewards by dividing it by the EMA estimate of their standard deviation.

      BYOL-Explore Lite also makes use of something known as reward prioritisation. Reward prioritisation involves focusing on parts of the environment where the agent receives high intrinsic rewards while disregarding those with low intrinsic rewards. This enables the agent to concentrate on areas it understands the least. Over time the previously ignored areas with low intrinsic rewards become the priority for the agent. To do this we take the EMA mean relative to the successive batch of normalised intrinsic rewards, $\mu$. Note that $\mu$ is used as a threshold to separate the high intrinsic rewards and the low intrinsic rewards. Therefore, the intrinsic rewards that agent obtains after reward prioritisation is,

      \[i_t=\max(ri_t-\mu,\,0),\]

      where $ri_t$ is the normalised intrinsic reward.

      Meta-learning curiosity algorithms

      Alet et al. view curiosity as a mechanism that is found through natural selection. As a result they turn to meta-learning to discover new curiosity algorithms. In this case the outer loop searches over the curiosity algorithm space while the inner loop performs the standard RL procedure.

      Figure 6. The process of how the meta-learned curiosity algorithm should work. Taken from .

      In the above figure we can see that the curiosity algorithm, \(\mathcal{C}\), takes in the state and reward from the environment and then feeds proxy reward \(\hat{r}\) to the RL agent. The RL algorithm used is a fully-specified algorithm, i.e., all its hyperparameters are specified. There were two stages in the authors search because the module \(\mathcal{C}\) is made of two components. The first component, \(\mathcal{I}\), calculates the intrinsic reward given the current state, next state and the action taken. The second component, \(\chi\), then takes the extrinsic reward, the intrinsic reward and the current normalised time step to combine them and output \(\hat{r}\).

      Meta-Learned Components and their DAGs

      As mention earlier Alet et al. focused on meta-learning pieces of code or rather meta-learning in a space of programs or operations. The programs and operations are represented in a domain-specific language (DSL). The DSL used to find component \(\chi\) consisted of operations such as arithmetic, Min, Max and more. While the DSL used to find component \(\mathcal{I}\) consisted of programs such as neural networks complete with gradient-descent mechanisms, L2 distance calculation, and ensembles of neural networks and more. Component \(\mathcal{I}\)’s DSL can describe many other hand-designed curiosity algorithms in literature, such as RND.

      The components \(\mathcal{I}\) and \(\chi\) are represented as Directed Acyclic Graphs (DAGs). The DAGs consist of the following types of modules:

      • Input modules: These are the inputs we put in each component of module \(\mathcal{C}\).
      • Parameter and Buffer modules: This module either consists of the weights of a neural network which can be updated via back-propagation or First In, First Out queues that output a finite list of the most recent \(k\) inputs.
      • Functional modules: This type of module calculates the output given some input.
      • Update modules: These modules can add real-valued outputs to the loss function of the neural network or add variables to buffers.

      The DAGs also have an output node which is a single node and the output of this node is the output of the entire program. To make these ideas more concrete, let us look the DAG that describes RND.

      Figure 7. The DAG of RND. Taken from .

      The blue rectangles represent the input modules, and we can see from the figure that the inputs are states from the environment. The parameter modules are the gray rectangles and these are the parameters of the target network and the predictor network. Note that the target network’s parameters are given by \(\theta\){1} and the predictor network’s parameter’s are given by \(\theta\){2}. The functional modules are the white rectangles and these are the neural networks. The update module is the pink rectangle which is the loss function.

      The output node is the green rectangle and is the L2 distance between the output of predictor network and the target network. This is the loss function described in the RND section. Note that the \(\theta\){2} rectangle has a pink border and a pink arrow, this indicates that it can be updated via back-propagation. While the \(\theta\){1} rectangle has black border and a black arrow indicating the parameters are not updated via back-propagation. Also note that the functional module that makes use of those parameters has the word “Detach” indicating the gradient information is not flowing back. Recall that \(\theta\){1} represents the parameters of the target network, which remain fixed, and \(\theta\){2} represents the parameters of the predictor network, which are updated during training.

      Now a very important idea is that the DAGs used in the paper have polymorphic types for the inputs and outputs. There are four types:

      • \(\mathbb{R}\), the real numbers.
      • \(\mathbb{S}\), the state space of the environment.
      • \(\mathbb{A}\), the action space of the environment.
      • \(\mathbb{F}\), the feature space.

      The instantiation of some types depends on the environment. For example in Figure 7, if \(\mathbb{S}\) is an image then both the target network and the predictor network are instantiated as a convolutional neural network. If \(\mathbb{S}\) is just an array of numbers then target network and the predictor network are fully connected neural networks. We now look at the method used to find the components \(\mathcal{I}\) and \(\chi\).

      Method

      We now turn our attention to how component \(\mathcal{I}\) was searched for. Alet et al. decided to focus on environment that has sparse rewards. They chose an image-based grid-world. In this environment the agent is tasked with finding the goal position and only obtains a reward if it finds the goal position. This environment has sparse rewards as the agent only receives feedback once it finds the goal position. They limited the number of operations that component \(\mathcal{I}\) could perform to 7 so that the search space remains manageable, and we can still interpret the algorithm. They focused on finding a component \(\mathcal{I}\) that optimises the number of distinct cells visited. From the search 13 of the top 16 components found where variants of FAST and 3 of them were variants of CCIM. We will cover FAST and CCIM in the upcoming sections.

      For the component \(\chi\) they focused on the Lunar Lander environment as it has a strong external reward signal. The algorithm used to output the intrinsic reward was a variant of RND. The main difference was that instead of single neural network for the predicator network an ensemble is used. This algorithm came from a preliminary set of algorithms that all resemble RND. The best reward combiner found was,

      \[\hat{r}_t = \frac{(1+ri_t-t/T)\cdot ri_t+ r_t\cdot t/T}{1+ri_t}.\]

      Here \(r_t\) is the external reward, \(t\) is the current time-step, \(T\) is the maximum steps possible in the episode, and \(ri_t\) is the intrinsic reward. However, in this blog post we decided not to focus on the reward combiner \(\chi\) but instead focus on FAST and CCIM.This decision arises because we felt our exploration of the reward combiner was not exhaustive enough..

      FAST

      FAST is very simple algorithm in that it only contains one neural network. Below is the DAG of FAST.

      Figure 8. The DAG of FAST. Taken from .

      This single neural network in FAST is a policy-mimicking network, \(\hat{\pi}\). The network \(\hat{\pi}\) tries to predict what action the agent took given a state of the environmentWe assume the environment has a discrete action space but this not be the case.. Then the loss of the policy-mimicking network will be the negative log likelihood (NLL) loss. Note that by looking at the DAG the output of FAST is not the same as loss function of the policy-mimicking network. The output is given by,

      \[ri_t=\|\hat{\pi}(s_{t+1})-\hat{\pi}(s_{t})\|_2.\]

      This is different from RND and BYOL-Explore Lite. The intrinsic reward is not given by a predictive error or the loss function of one of the networks in the program. We understood the above formula as the L2 difference between the logits of the current state and the next state. The agent is then rewarded if the next state’s logits is different from the current state. Importantly, the agent isn’t rewarded for taking a different action in the next state. Alet et al. pointed out that if the policy-mimicking network has a uniform distribution over the action space in all states, the agent will receive an intrinsic reward of zero. Therefore, in environments where the action probability distributions outputted by the policy-mimicking network vary across states, we expect this algorithm to generate intrinsic rewards. We hypothesize that this algorithm may not perform well in environments where the optimal policy requires the agent to visit states with very similar action probability distributions. While the agent explores by going to different states, ideally, we wish for the intrinsic rewards to decrease as the agent explores. Looking at the output of FAST it is not clear to use how the intrinsic reward decreases, and we expect that this could cause issues.

      CCIM

      CCIM took us quite a while to understand and process. Let us first go through its DAG below.

      Figure 9. The DAG of CCIM. Taken from .

      We can see that there are 3 neural networks: a random network, a random and forward network, and a backward network. The parameters \(\theta\){1} are the parameters of the random network, \(\theta\){2} are the parameters of the backward network, and \(\theta\){3} are the parameters of the random and forward network. Looking at the black border of \(\theta\){1}’s rectangle we can see that the random network’s parameters stay fixed during training like in RND. Let us denote the random network as \(r_{\theta_1}\), the backward network as \(b_{\theta_2}\), and the random and forward network as \(fr_{\theta_3}\). Let us look at the loss function of the \(b_{\theta_2}\) and \(fr_{\theta_3}\). The loss function of \(b_{\theta_2}\) is given by,

      \[\mathcal{L}_b=\|b_{\theta_2}(fr_{\theta_3}(s_t))-r_{\theta_1}\|_2+\|b_{\theta_2}(fr_{\theta_3}(s_{t+1}))-fr_{\theta_3}(s_t)\|_2,\]

      and the loss function for \(fr_{\theta_3}\) is

      \[\mathcal{L}_f=\|b_{\theta_2}(fr_{\theta_3}(s_t))-r_{\theta_1}\|_2.\]

      Note the first term in \(\mathcal{L}_b\) is the same as \(\mathcal{L}_f\). The intrinsic reward, i.e., the output of this program is given by,

      \[ri_t=\|b_{\theta_2}(fr_{\theta_3}(s_{t+1}))-b_{\theta_2}(fr_{\theta_3}(s_t))\|_2.\]

      Looking at the equations, we can see that CCIM borrows ideas from the cycle-consistency seen in the Image-to-Image Translation literature. The cycle-consistency ensures that if you translate from space \(A\) to space \(B\), then given space \(B\), you should be able to translate back to space \(A\). To see how CCIM applies this, let us turn our attention to \(\mathcal{L}_f\)’s equation. The \(fr_{\theta_3}\) network applies a random embedding to state \(s_t\). It then forwards this random embedding to the “next state”. The \(b_{\theta_2}\) network then takes this forwarded random embedding of state \(s_t\) and undoes the forward transformation so that we end up again with just the random embedding of state \(s_t\). Now, the random embedding that \(fr_{\theta_3}\) applied should match the random embedding that \(r_{\theta_1}\) applied to the state \(s_t\) for the loss to be minimised. In other words, once we apply a forward transformation to the random embedding of the state, we should be able to undo that transformation and end up where we started.

      Let us look at the second term in \(\mathcal{L}_b\) given by \(\|b_{\theta_2}(fr_{\theta_3}(s_{t+1}))-fr_{\theta_3}(s_t)\|_2\). We apply a forward and then a backward transformation to the random embedding of state \(s_{t+1}\), so we should end up with just the random embedding of state \(s_{t+1}\). We then apply \(fr_{\theta_3}\) to state \(s_t\) and end up with the forwarded random embedding of state \(s_t\), which should equal the random embedding of \(s_{t+1}\).

      The intrinsic reward confuses us. Looking at the DAG of CCIM, we see that the output is given by the L2 distance between \(\mathcal{L}_f\) and \(\mathcal{L}_b\); hence, we initially thought the intrinsic reward was given by \(\|b_{\theta_2}(fr_{\theta_3}(s_{t+1}))-fr_{\theta_3}(s_t)\|\). The difference between this equation and the original intrinsic reward equation is that the backward model, \(b_{\theta_2}\), is not applied to the \(fr_{\theta_3}(s_t)\) term. Looking at the original formula of the intrinsic reward, we can see that it is just the difference between the random embedding of the current state and the next stateIf we assume that the backward network can undo the forward transformation., so it is not clear to us as to how the intrinsic reward will decrease as the agent explores. Not only that, but we also noticed unexpected behaviour in the loss function of the \(fr_{\theta_3}\) network in our experiments. We then watched Alet et al.’s presentation of their paper to see where we went wrong, and we noticed in the presentation they swapped the labels for \(fr_{\theta_3}\) and \(b_{\theta_2}\) networks. After reaching out to them about this discrepancy, they did confirm that the equations in the paper are correct, and the labels in the talk are wrong. So for our implementation, we used the equations as found in the paper.

      CCIM-slimmed

      Through our communication with them, Alet et al. recommended we try ablations of CCIM and they suggested the following slimmed-down version of CCIM:

      • Network \(r_{\theta_1}\) remains unchanged and its parameters stay fixed.
      • Network \(fr_{\theta_3}\) changes to just being a forward network, \(f_{\theta_3}\).
      • The loss function of the \(f_{\theta_3}\) is now \(\mathcal{L}_f=\|f_{\theta_3}(r_{\theta_1}(s_t))-r_{\theta_1}(s_{t+1})\|_2^2\).
      • Network \(b_{\theta_2}\)’s loss function, \(\mathcal{L}_b\), also changes. \(\mathcal{L}_b=\|b_{\theta_2}(r_{\theta_1}(s_{t+1}))-r_{\theta_1}(s_{t})\|_2^2\).
      • The intrinsic reward is now \(\mathcal{L}_f+\mathcal{L}_b\).

      This slimmed down version of CCIM was much easier to implement. Since the sum of the loss functions also act as the intrinsic reward it is clearer to us as to how the intrinsic rewards will decrease as the agent explores. As agent explores both the forward and backward networks become better at predicting what the random embedding of the next state and previous state will be, respectively.

      Experiments

      Emperical Design

      In devising the methodology for our experiments, we sought guidance from the principles outlined in Patterson et al.’s cookbook, “Empirical Design in Reinforcement Learning” . Our codebase is derived from PureJaxRL and can be found here. Specifically, we leverage PureJaxRL’s Proximal Policy Optimization (PPO) implementation as our chosen reinforcement learning (RL) algorithm. We compare each meta-learned curiosity algorithm to a non-curious agent (normal PPO) and our baselines. The foundation of our experiments is laid upon a JAX implementation of Minigrid’s grid-world environment , which uses gymnax’s API . Additionally, we make use of gymnax’s deep sea environment implementation as well.

      Each RL agent undergoes training for 500,000 time steps across four vectorized environments, employing 30 seeds for each RL algorithm. To assess performances on the environments, we calculate the average episode return across seeds at the end of training with a 95% confidence interval determined through the percentile bootstrapped method. We are not just interested in how well these curiosity algorithms perform but also in understanding the behaviour of these algorithms. We therefore also visualise the sample standard deviation during training to see the performance variations. This assists us in seeing how consistent the behaviour is for each curiosity algorithm and the normal PPO algorithm.

      Now since we are not testing the reward combiner found, it is not clear how we should combine the external reward and the intrinsic reward. However, we treat both the external reward and the intrinsic reward as episodic and therefore we use the following formula, \(\hat{r} = r_t + \lambda ri_t\), where \(\lambda\) is some weight factor. These are the optimal values we found for \(\lambda\) for each curiosity algorithm:

      • FAST: \(\lambda = 0.003\).
      • CCIM-slimmed: \(\lambda = 0.17\).
      • CCIM: \(\lambda = 0.003\).
      • BYOL-Explore Lite: \(\lambda = 0.006\)
      • RND: \(\lambda = 0.2\).

      For FAST, CCIM, and CCIM-slimmed we normalise the intrinsic reward using the same method as RND. Next we describe the environments we use in more detail.

      Empty grid-world

      The empty grid-world is a very simple environment. As mentioned earlier the agent’s task is to reach the goal position. The size is \(16\times 16\) and the maximum number of steps is 1024. In our implementation the agent starts at the bottom left corner and has to reach the top right corner. The reward that agent recieves if it finds the goal is 1 - 0.9 * (step_count / max_steps). The gif shows a RL agent exploring the environment to reach the goal.

      The empty grid-world environment.

      Deep sea

      The deep sea environment is one of the bsuite environments developed by Google Deepmind . This is a \(N \times N\) grid environment that focuses on testing the exploration capabilities of an RL algorithm. The figure below shows the environment.

      Figure 10. The Deep sea environment. Taken from .

      The agent starts at the top left corner and its goal is to reach the bottom right corner. At each time step the agent descends one row. The agent can either go left or right. There’s a small penalty of going right which is \(−0.01/N\) while going left just gives a reward of zero. The agent receives a reward of 1 if it finds the treasure at the bottom right corner. The max number of steps in the environment is \(N\). Therefore, the optimal policy is to go right at every time step ignoring the greedy action. In our experiments we set \(N=10\).

      Results

      CCIM

      We start with the deep sea environment. The left of Figure 11 shows the sample standard deviation during training. We only show it for the first 10,000 steps because after that we notice the graphs plateau. We see that RND and BYOL-Explore Lite produce the most consistent agents in the deep sea environment. And CCIM-slimmed produces more consistent agents than CCIM and PPO. Looking at the right of Figure 11 we can see the mean episode return across the 30 seeds with the 95% confidence intervals. RND, BYOL-Explore, and CCIM-slimmed all perform better than PPO. However, CCIM does performs roughly the same as PPO at the end of training. From our experiments we also noticed that intrinsic rewards produced by CCIM increase and then plateau. The CCIM random and forward network’s loss continued to increase during training as well.

      Figure 11. The sample standard deviation during training (left) and the average episode return (right) in deep sea environment.

      Next we move onto the empty grid-world. Looking at the left of Figure 12 we can see that all curiosity algorithms produce more consistent agents than PPO due to their sample standard deviations being lower. CCIM and CCIM-slimmed both actually produce more consistent agents than RND and PPO in this environment. The right of Figure 12 also indicate that CCIM performed much better in the empty grid-world and was closer to the baselines. However in this environment we did once again notice the raw intrinsic reward increased then plateaued and the loss of random forward network increased during training. It should also be noted the confidence intervals of all the RL algorithms overlap in the empty grid-world environment.

      Figure 12. The sample standard deviation during training (left) and the average episode return (right) in empty grid-world environment.

      Next we decided to plot the RND, BYOL-Explore Lite, normal PPO, CCIM and CCIM-slimmed heatmaps in Figure 13 and 14. To make the heatmaps we looked at the best 15 seeds for each algorithm and kept track of the paths each seed took. Looking at Figure 13 and Figure 14, we can see that the CCIM and CCIM-slimmed covered more of the map than RND and BYOL-Explore Lite. However, they only covered slightly more of the map than PPO.

      Figure 13. Heatmaps of the RND agent (left) and the BYOL-Explore Lite agent (right) in empty grid-world.
      Figure 14. Heatmaps of the CCIM agent (left), CCIm-slimmed agent (middle), and the normal PPO agent (right) in empty grid-world.

      FAST

      Let us now turn our attention to how FAST performed. We began with the deep sea environment. In Figure 15 we plot the sample deviation for the first 10,000 steps, as we observe no significant difference beyond this point. The left side of Figure 15 indicates that PPO and our curiosity-driven baselines produces more consistent agents than FAST as they exhibit a lower sample standard deviation.

      On the right side of Figure 15, we see that FAST, similar to CCIM, performs poorly on this environment compared to our baselines. Notably, during training we noticed the intrinsic reward of the FAST agents also increased.

      Figure 15. The sample standard deviation during training (left) and the average episode return (right) in deep sea environment.

      The right side of Figure 16 shows FAST’s performance in the empty grid-world is better than its performance in the deep sea environment; it is now comparable to our baselines despite its intrinsic rewards also increasing over time. Once again, similar to CCIM’s results, we observe overlapping confidence intervals in the empty grid-world. Figure 16 shows that not only has its performance improved in the empty grid-world but it now produces more consistent agents than RND and PPO as its sample standard deviation is lower.

      Figure 16. The sample standard deviation during training (left) and the average episode return (right) in empty grid-world environment.

      We once again plot the heatmap of FAST and compare it to PPO’s heatmap using the best 15 seeds. When comparing Figure 17 (left) with both Figure 17 (right) and Figure 13, we observe that FAST covered more of the grid-world than PPO, BYOL-Explore Lite, and RND.

      Figure 17. Heatmaps of the FAST agent (left) and the normal PPO (right) in empty grid-world.

      Discussion

      Alet et al. provided a unique approach to meta-learning. The performance of CCIM and FAST in the empty grid-world then did not surprise us as that was the environment used to search for the algorithms. Note in Figure 17 that the 15 best seeds of FAST covered more of the map, i.e., most of the seeds took different parts to the goal compared to PPO. However for the CCIM and CCIM-slimmed heatmaps we notice that these algorithms only slightly covered more of the map then PPO. It should be noted that by looking at the heat maps that CCIM-slimmed, CCIM, and FAST both covered more of the map than our baselines which makes sense given Alet et al. looked for curiosity that optimise the number of distinct cells visited when searching for the curiosity algorithms.

      From the sample deviation plots, we can see that FAST and CCIM do not produce consistent agents than PPO and the curiosity-driven baselines in the deep sea environment. While CCIM-slimmed produced more consistent agents than PPO but not the baselines. However, in the empty grid-world environment FAST, CCIM, and CCIM-slimmed is able to produce more consistent agents than PPO and RND. In the mean episode return plots, CCIM, CCIM-slimmed, and FAST perform better than PPO and RND in the empty grid-world environment which makes sense as the empty grid-world environment was used to find these curiosity algorithms. However, in the deep sea environment we see that the meta-learned curiosity algorithms perform worse than our curiosity-driven baselines.

      From the mean episode return plots we can see that BYOL-Explore Lite is the best performing algorithm. Even in the empty grid-world environment it performs better than the meta-learned curiosity algorithms. We believe this is because of the reward prioritisation implemented in BYOL-Explore. This could explain its performance is better than the meta-learned curiosity algorithms and why it produces the most consistent agents.

      One major concern we still have is how the intrinsic rewards for FAST and CCIM didn’t decrease during training for both environments used in our experiments. However, we noted that the intrinsic rewards for CCIM-slimmed decreased during training. We believe the decrease in intrinsic rewards as training progresses is one of the main reasons why BYOL-Explore and RND are effective and why we see the improved performance of the CCIM-slimmed algorithm. Even with the reward combiner, we still believe that the intrinsic rewards not decreasing could potentially cause an issue, as it did with the deep-sea environment.Recall that the reward combiner has the following formula,

      \[\hat{r}_t = \frac{(1+ri_t-t/T)\cdot ri_t+ r_t\cdot t/T}{1+ri_t}.\]

      Now if \(t=T\) then the \(\hat{r}_t \approx r_t\) if \(0 \leq ri_t \ll 1\). However for us the intrinsic rewards were not much less than zero during training. We believe that it is important for curiosity algorithms that the intrinsic reward decreases as the agent becomes more familiar with its environment. We believe that this is why CCIM-slimmed performed better than CCIM and FAST in the deep sea environment. Another concern we have is how the CCIM random and forward network’s loss increased during training. It is possible that there’s a bug somewhere in our code which we have not found yet.

      In the future we think it will be interesting to repeat this experiment using the deep sea environment to find the curiosity algorithms that output the intrinsic reward. Additionally, exploring the use of a variant of FAST or CCIM to find a reward combiner is also of interest to us. We wonder why a variant of FAST or CCIM wasn’t employed for this purpose, as a variant of RND was used to find the reward combiner. As stated earlier, FAST, CCIM and CCIM-slimmed do not make use reward prioritisation like BYOL-Explore Lite does. Therefore, repeating the experiments with the meta-learned curiosity algorithms where some form of reward prioritisation is implemented is another interesting path we hope to explore. We would also like to increase the number of seeds used to reduce the confidence intervals. Since we are training end-to-end in JAX in simple environments, increasing the number of seeds should not be much of an issue.

      Conclusion

      In this blog post, we studied two meta-learned curiosity algorithms, namely FAST and CCIM. We compared them to a non-curious agent and our baselines for the curiosity algorithms: RND and BYOL-Explore. Our experiments were conducted using both the empty grid-world environment and the deep-sea environment.

      FAST and CCIM both performed well in the empty grid-world, covering more of the map than the baselines when examining their heatmaps. This aligns with our expectations since this was the environment used to search for the curiosity algorithms. However, in the deep-sea environment, both algorithms did not perform well compared to the baselines. Conversely, CCIM-slimmed, a slimmed down version of CCIM, showed performance comparable to the baselines. We suspect that this is because the intrinsic reward decreased as the agent explored more. This behaviour was not observed in FAST and CCIM, which we believe is not ideal and consider it the main flaw of these algorithms.

      This approach of meta-learning curiosity algorithms is novel, and we believe there’s interesting work that can be done following the same approach as Alet et al., trying it with different environments to search for curiosity algorithms, such as the deep-sea environment. Moreover, BYOL-Explore makes use of reward prioritisation. Therefore, in the future, we hope to include reward prioritisation in our FAST, CCIM, and CCIM-slimmed implementations to see if it improves performance. Another avenue is using the meta-learned curiosity algorithms to search for the reward combiner.

      For attribution in academic contexts, please cite this work as
      +        PLACEHOLDER FOR ACADEMIC ATTRIBUTION
      +  
      BibTeX citation
      +        PLACEHOLDER FOR BIBTEX
      +  
      \ No newline at end of file diff --git a/blog/fairness-ai-two-phil-or-just-one/index.html b/blog/fairness-ai-two-phil-or-just-one/index.html new file mode 100644 index 00000000..f9b54176 --- /dev/null +++ b/blog/fairness-ai-two-phil-or-just-one/index.html @@ -0,0 +1,36 @@ + Fairness in AI: two philosophies or just one? | ICLR Blogposts 2024

      Fairness in AI: two philosophies or just one?

      The topic of fairness in AI has garnered more attention over the last year, recently with the arrival of the EU's AI Act. This goal of achieving fairness in AI is often done in one of two ways, namely through counterfactual fairness or through group fairness. These research strands originate from two vastly differing ideologies. However, with the use of causal graphs, it is possible to show that they are related and even that satisfying a fairness group measure means satisfying counterfactual fairness.

      This blog post is based on the paper of Anthis and Veitch. The original paper is enriched with a wide overview of fairness concepts used in research and visuals aiding the readers in gaining a deeper understanding. The blog post aims to raise questions about the dichotomy between procedural and outcome fairness, that they perhaps should not be treated as separate research fields as is currently often the case.

      Why fairness?

      The spread of AI exposed some of the dark patterns that are present in society. Some well known examples are the COMPAS case which showed discrimination against black defendants and the Amazon hiring tool which showed a preference towards men compared to women. However, these AI system were most likely not the source of this disparate treatment. This behavior stems from the data that was used to train the system, thus this behavior comes from people who were behind the creation of that data.

      Fairness in AI is a research strain which aims to remove the biases in the AI models that result in that disparate treatment. The goal of these models is that people are treated more fairly, perhaps even more than a human decision.

      What is fairness?

      The question of what is fair does not have a single answer. Even when stepping away from the computer science context, a universal definition, that can be used to determine if something is fair or not, cannot be found. The concept of fair is heavily influenced by a person, but also society’s biases. The fluidity of the notion therefore gives rise to multiple philosophies in what a fair AI system would be.

      Figure 1: Some examples of the concepts used in the respective philosophies.

      Two main philosophies can be found in research. The first one, often called explainable AI, aims to either create explainable models or to create explanations for the results obtained from a model. This can also be described as aiming for procedural fairness. The second philosophy is called group fairness. Group fairness focusses on outcome fairness. This means that the predictions from the AI system should have similar properties across groups that only differ in a certain personal attribute.

      Explainable AI

      The most famous example of explainable AI is fairness through unawareness. Fairness through unawareness means that no personal attributes are passed into the system, unless these are relevant for the prediction. The system does therefore not have access to the personal attributes, which means it cannot directly discriminate. Fairness through unawareness is often used as the basic model for fairness. However, the systems from both the COMPAS and Amazon example used fairness through unawareness and they still exhibited disparate treatment. The personal attributes that were removed from the data still had an influence on the dataset itself. For instance, a ZIP code can function as a proxy for race or someone’s gender influenced their writing style.

      Figure 2: Examples of Fairness Through Unawareness (FTU) and fair feature selection on the Adult dataset.

      Related to fairness through unawareness is fair feature selection . Instead of removing the personal attributes, only features that are deemed appropriate remain in the dataset. It needs to be noted that one universal agreement for what are fair features to use is unlike due to the aforementioned biases of people and cultures. Oftentimes, there exists an overlap between the features removed in fairness through unawareness and fair feature selection as is evident in Figure 2.

      Counterfactual fairness is a currently popular type of explainable AI. Counterfactual fairness stems from systems that check for direct discrimination, meaning that simply changing a personal attribute would change a person’s prediction. An example of direct discrimination can be found in Figure 3, where changing the sex would result into a different prediction. From a legal standpoint it is clear that if a model would exhibit this behavior, it can be deemed unfair.

      Figure 3: Example of direct discrimination where changing the personal attribute of sex changes the prediction a person would receive.

      Models for counterfactual fairness change both the personal attributes of a person and other features are also adjusted according to a causal model related to the personal attributes. For example changing someone’s race might also require to change someone’s ZIP code or high school they went to. Figure 4 contains an example of creating counterfactuals. That system is unfair as some of the counterfactuals have a different prediction from the original. Satisfying counterfactual fairness can also be achieved through requiring independence between the personal attributes and the prediction itself. A more stringent constraint is to require that the prediction is independent on all proxy features in the dataset.

      Figure 4: Imaginary examples of a system that would not satisfy counterfactual fairness. Changing features in accordance with the personal attributes and data distribution results in a different prediction.

      Group Fairness

      Group fairness is a different philosophy regarding fairness of an AI system. Instead of requiring the process of the system is fair, it requires the outcome of the model to be fair. This verdict of fairness is based on the equality of a chosen statistical measure between groups. People are divided into these groups based on their personal attributes. Three definitions are most commonly used for group fairness namely, demographic parity, equalized odds and conditional use accuracy equality.

      Demographic parity requires that the selection rate is equal across groups. This means that an equal percentage of people from both groups receives a positive prediction. This definition is independent of the ground truth, which means that for example a perfect predictor could never satisfy demographic parity if the base rates differ between groups. Therefore, from the observation of the dataset it must seem that the prediction is independent of the personal attributes.

      Figure 5: A representation of demographic parity. Two groups are distinguished one male, one female. The circled individuals are the ones to receive a positive prediction.

      A second fairness measure used in group fairness in equalized odds. This fairness measure requires that both the true positive and true negative rates are equal across groups. This means that given the ground truth, there is an equal chance of giving a positive prediction irrespective of a person’s group. In other words equalized odds requires the prediction is independent of the personal attribute given the ground truth. Unlike demographic parity, equalized odds is dependent on the ground truth.

      Figure 6: A representation of predictions which satisfy equalized odds. Two groups are distinguished one male, one female. The circled individuals are the ones to receive a positive prediction. The colors of the individuals indicates the ground truth of the samples. The male groups has a base rate of 0.8 and the female group a base rate of 0.6.

      The final common fairness measure in group fairness is conditional use accuracy equality. In order to satisfy conditional use accuracy equality, the precision and false omission rate must be equal between groups. Similar to equalized odds, conditional use accuracy equality requires two statistical properties to be equal between groups, namely precision and false omission rate. Put differently, this requires that given the prediction there is an equal chance that this prediction is correct regardless of the group a person belongs to. Conditional use accuracy equality is therefore defined similarly to equalized odds; the roles of the prediction and ground truth are simply reversed. This equality also holds for the independent condition, conditional use accuracy equality requires that the ground truth is independent of the personal attribute if the prediction is known.

      Figure 7: A representation of predictions which satisfy conditional use accuracy equality. Two groups are distinguished one male, one female. The circled individuals are the ones to receive a positive prediction. The colors of the individuals indicates the ground truth of the samples. The male groups has a base rate of 0.8 and the female group a base rate of 0.6.

      Unifying these philosophies

      The previous two sections discussed the different concepts used for explainable AI and group fairness. It is clear that they employ a different basis for their philosophy of fairness. However, when looking at these definitions, the concept of independence returns in both counterfactual fairness and the fairness measures used for group fairness. This property of requiring independence allows to unify these notions that they accomplish the same result. Table 1 provides an overview of the fairness measures and the respective independence they require.

      In the following section \(Y\) symbolises the perceived label, \(D\) the prediction, \(A\) the personal attributes, \(S\) the selection of a sample in the dataset, \(X^{\bot}_A\) the data independent of the personal attributes, \(X^{\bot}_Y\) the data independent of the prediction and \(\tilde{Y}\) the real label.

      Table 1: A summary of the independence requirement of the fairness notions discussed.
      Name Probability definition Independence
      Demographic parity \(P(D=1\vert A=1) = P(D=1\vert A=0)\) \(D \bot A\)
      Equalized odds \(P(D=1 \vert A=1, Y=y) = P(D=1 \vert A=0, Y=y)\) \(D \bot A \vert Y\)
      Conditional use accuracy equality \(P(Y=1\vert A=1, d=y) = P(D=1 \vert A=0, D=y)\) \(Y \bot A \vert D\)

      Measurement error - Demographic parity

      Figure 8: A directed acyclic graph showing the relation between the prediction and the data, in the situation of measurement error.

      Measurement error is a first type of dependence that can be resolved in order to be counterfactually fair. Measurement errors means that there is some bias on the perceived ground truth in the dataset. For example in system that determines whether pulling a car over is justified or not (whether a crime was committed or not). More crimes can be uncovered if a full car search happens, however a car search is not always undertaken resulting in a bias of more positive samples for a population where a car search is more likely to happen. In this situation the label is whether or not a crime was detected, not wether a crime was committed. The imbalance car searches for a group with a certain personal attribute will then have an effect on the label. This influence of the personal attributes on the label, but not the ground truth is shown in Figure 6.

      A second example of measurement error can be found in healthcare prediction. Predicting someone’s health is abstract as this is not quantifiable. A proxy for health is the costs related to the healthcare an individual receives. However, costs are not universal for each group in society. Certain groups can thus have lower costs while managing more health problem due to the care that they receive or perhaps not receive. This faulty proxy is another example of measurement errors.

      This system is thus made counterfactually fair if the dependence between the personal attribute and the label is removed. The same independence that is requires to satisfy demographic parity.

      Selection on label - Equalized odds

      Figure 9: A directed acyclic graph showing the relation between the prediction and the data, in the situation of selection on label.

      Selection on label is a type of bias that arises by that not only someone’s label affects their adoption in the dataset but also their personal attribute. A subtype of this type of bias is self-selection bias. This means that certain groups of the population are more represented in certain dataset due to that certain groups are more likely to interact with the data collection system. An example of this is in voluntary studies where certain groups are more likely to participate than others leading to a skewed dataset in favor of the participating group. A study around self-selection bias in nutrition trials also found that a person’s ground truth influences their participation in the trial (healthy eaters were more likely to apply for this trial).

      The directed acyclic graph in Figure 7 shows how to decouple the label itself with the personal attribute by introducing the variable of the selection bias in S, which is an observed variable. \(A\) and \(X^{\bot}_A\) are only connected through a path that includes \(Y\) which means that given \(Y\), \(A\) and \(X^{\bot}_A\) are independent, which is the condition of equalized odds.

      Selection on predictor - conditional use accuracy equality

      Figure 10: A directed acyclic graph showing the relation between the prediction and the data, in the situation of selection on predictor.

      Selection on predictor is similar to selection on label, but instead of the label influencing the prediction is it the features themselves that influence the prediction together with the personal attributes. An example of this can be seen in the student population of engineering degrees. A relevant feature such as what a person studied in high school influence their choice to do engineering. However, there is a large discrepancy in the number of male versus female student who pursue engineering even though that difference does not exist in that degree when graduating high school. This shows that both relevant features, but also personal attributes influence their presence in a dataset about engineering students.

      The acyclic graph in Figure 8 for selection on predictor is similar to that for selection on label. The features and label are simply reversed in this situation. This is also in accordance with the similarity seen between equalized odds and conditional use accuracy equality. Through \(X^{\bot}_A\), are \(A\) and \(Y\) connected, which means that if the prediction is known, which is captured in \(X^{\bot}_A\), then \(A\) and \(Y\) are independent, which is necessary to satisfy conditional use accuracy.

      Confirmation with experiments

      This relation between counterfactual fairness and group fairness is supported by experiments. These experiments were done on a synthetic version of the Adult dataset. A simulated protected class A was added where the incidence is balanced (50/50 odds of belonging to the protected class or not). If someone belonged to the protected class, then there is a causal effect of A on X: \(P(race=other) = 0.8\). This thus means that A will loosely relate to someone’s race being noted as “other”. This dataset serves as the target distribution for the biased datasets.

      A counterfactually fair model is achieved by by taking the average prediction of an instance if it were part of the protected class and if it was not. Three biased datasets are created based on the directed acyclic graphs in Figures 8, 9, and 10. Table 2 shows that satisfying counterfactual fairness for a certain type of dataset will satisfy a corresponding fairness measure, confirming the theoretical results above.

      Table 2: The results of applying counterfactual fairness to a model with its performance on different fairness measures.
        Demographic parity difference Equalized odds difference Conditional use accuracy equality
      Measurement Error -0.0005 0.0906 -0.8158
      Selection on Label 0.1321 -0.0021 0.2225
      Selection on Predictors 0.1428 0.0789 0.0040

      What can we take away?

      Procedural and outcome fairness have tended to coexist in research. They are each their own field with their philosophy with the common goal of creating fairer AI systems. The strengths of techniques like counterfactual fairness lie in their explainability and thus allow for an easier determination of whether they are fair or not. The group fairness techniques know many implementations and have been proven to be powerful. However, they are not very interpretable. In order to determine what is fair a first abstraction must be made into converting the meaning of fairness into a mathematical fairness measure. The determination of whether the system is fair is thus dependent on the interpretation of the fairness measure and the quality of the dataset. If the dataset is not representative then there is no guarantee that the system will have a fair outcome.

      This relation between the procedural fairness and outcome fairness opens certain research possibilities, perhaps allowing for the strength of the outcome fairness techniques to be combined with the interpretability of the procedural fairness concepts. A future research possibility is to investigate if the techniques to satisfy fairness measure also satisfy some explainability notions or what adjustments would be needed.

      For attribution in academic contexts, please cite this work as
      +        PLACEHOLDER FOR ACADEMIC ATTRIBUTION
      +  
      BibTeX citation
      +        PLACEHOLDER FOR BIBTEX
      +  
      \ No newline at end of file diff --git a/blog/hidden-convex-relu/index.html b/blog/hidden-convex-relu/index.html new file mode 100644 index 00000000..ce9d62a0 --- /dev/null +++ b/blog/hidden-convex-relu/index.html @@ -0,0 +1,56 @@ + The Hidden Convex Optimization Landscape of Two-Layer ReLU Networks | ICLR Blogposts 2024

      The Hidden Convex Optimization Landscape of Two-Layer ReLU Networks

      In this article, we delve into the research paper titled 'The Hidden Convex Optimization Landscape of Regularized Two-Layer ReLU Networks'. We put our focus on the significance of this study and evaluate its relevance in the current landscape of the theory of machine learning. This paper describes how solving a convex problem can directly give the solution to the highly non-convex problem that is optimizing a two-layer ReLU Network. After giving some intuition on the proof through a few examples, we will observe the limits of this model as we might not yet be able to throw away the non-convex problem.

      $$ \def\RR{ \mathbb{R} } \newcommand{\dd}{\mathrm{d}} \newcommand{\step}{\gamma} \newcommand{\reg}{\beta} \newcommand{\paramS}{\Theta} \newcommand{\param}{\theta} \newcommand{\dirac}{\delta} \definecolor{cvred}{RGB}{230, 29, 0} \definecolor{cred}{RGB}{230, 159, 0} \definecolor{cblue}{RGB}{51, 102, 253} \definecolor{cgreen}{RGB}{0, 158, 115} \def\czero{ {\color{cred}{0}} } \definecolor{cvblue}{RGB}{86, 180, 233} \def\cone{ {\color{cvblue}{1}} } \def\max{\mathop{\mathrm{max}}} \def\sgn{\mathop{\mathrm{sgn}}} $$

      There exists an equivalent convex formulation to the classical non-convex ReLU two-layer network training. That sounds like great news but is it the case in practice? Let's find out together.

      The code for this plot is available and reproducible on this Jupyter Notebook (or in HTML).

      I. Overview and Motivation

      50 years ago, two-layer networks with non-linear activations were known to be universal approximators, however, they did not catch on as they were hard to train. The recent years have been marked by deeper networks running on dedicated hardware with very large datasets. Those networks have since been at the top of the benchmark in many applications including self-driving and text generation. The pragmatic method to train such models is to run stochastic gradient descent on the non-convex optimization problem, which is concretely tuning the weights (and bias) until the model is accurate enough. The best models usually require billions of parameters and very large datasets. The training, in turn, requires millions of dollars of hardware and electricity to run gradient descent and train a single model.

      Deep learning is not without faults. Even though the test performance can surpass those of many machine learning models, it is very hard to know what the network has learned because of its black-box nature. Interpretability in neural networks is crucial for creating trustworthy AI systems, one of the biggest obstacle to AI adoption. It may also lead us to simpler models that are cheaper to run, are more robust, generalize better, and are easier to adapt to specific tasks.

      To figure out what a neural network learns, we will focus in this post on the training of a shallow ReLU network by vanilla gradient descent, using the full batch of data at each step, in a regression setting. More precisely, we will investigate how the construction of a convex equivalent to the non-convex training problem can enlighten us on how neurons evolve during the training phase, with a specific focus on the activation of the ReLU functions and their consequences.

      Problem and notation

      Our problem of interest will be the training of a simple two-layer neural network with ReLU activation. We focus on a classical regression problem with a mean squared error loss and we add a weight decay term (whose importance will be underlined later). This leads to the following full-batch gradient method (note that we make a slight abuse of notation by denoting by $\nabla$ the output of the derivative of the parameters, obtained, for instance, by backpropagation).

      Because there are only two layers, we will integrate the biases of the neurons directly into the data by adding a dimension filled with ones.

      Two-Layer ReLU Network Training
      Data points: $n$ inputs \(\pmb{x}_j \in \RR^d\) and labels \(y_j \in \RR\), $j=1,..,n$
      Model: $m$ neurons: First layer \(\pmb{w}_i \in \RR^d\), second layer \(\alpha_i \in \RR\), $i=1,..,m$
      Hyper-parameters: step-size \(\step > 0\), regularization \(\lambda\geq 0\)
      Loss to be minimized: \begin{equation}\label{eq:theloss} \mathcal{L}(\pmb{W}, \pmb{\alpha}) = \sum_{j=1}^n \bigg( \underbrace{\sum_{i=1}^m \max(0, \pmb{w}_i^\top \pmb{x}_j) \alpha_i}_{\text{Network's Output}} - y_j \bigg)^2 + \underbrace{\lambda \sum_{i=1}^m \| \pmb{w}_i \|^2_2 + \alpha_i^2}_{\text{Weight Decay}} \end{equation} (Full-batch) Gradient Descent: \begin{equation*} (\pmb{W}, \pmb{\alpha})_{t+1} = (\pmb{W}, \pmb{\alpha})_t - \step \nabla \mathcal{L}((\pmb{W}, \pmb{\alpha})_t) \end{equation*}

      Even the simplest ReLU models have non-trivial non-convexity as depicted in the figure below. We plot the loss function \(\mathcal{L}\) of a network with two neurons on one-dimensional data. We only optimize the first layer here so we have a total of two parameters to optimize. Despite the simple setup, a gradient descent starting from a random initialization can converge to three different values, two of them being bigger than zero. However, there always exists a path of non-increasing loss from initialization to the global minimum (as predicted by a ).

      Loss landscape of a network with two parameters, one for each ReLU neuron, and two data points: $(x_1, y_1) = (-1, 1)$ and $(x_2, y_2) = (1, 2)$ are fixed. Since all labels are positive, we fix the second layer $\alpha_1, \alpha_2$ to 1 to plot the loss in 2D without a loss of generality. The black lines represent the loss for only one neuron (since the other is equal to 0). The red lines(critical points) are paths of parameters for which the loss is constant and the gradient is zero. They represent the parameters for which the neuron fits exactly one data point and is deactivated for the other and thus suffers a loss of $(y_1)^2$ for the red line on the left and $(y_2)^2$ for the other. The exact formula to compute each point of the loss landscape is: \begin{equation*} \begin{split} \mathcal{L}(w_1, w_2) =&\ \left(\max(0, x_1 w_1) + \max(0, x_1 w_2) - y_1\right)^2 \\ +&\ \left(\max(0, x_2 w_1) + \max(0, x_2 w_2) - y_2\right)^2 \end{split} \end{equation*}

      To avoid the local minima, one idea is to add constraints to the parameters. The constrained problem where $w_1$ has to be positive and $w_2$ has to be negative, is convex, and a simple gradient descent will find the global minima of the original unconstrained problem. In , they find a more general way to build an equivalent convex problem to our ReLU shallow network training problem.

      In this blog post, we will first work out the intuition needed to understand why an equivalent, finite convex problem even exists. Then we will study the exact links between the problem in practice and the convex problem, and go over the limits of such an approach both in theory and in practice.

      Research context

      The question of how neural networks learn is a very active domain of research with many different paths of investigation. Its main goal is to lay a mathematical foundation for deep learning and for that goal, shallow neural networks act as a stepping stone for understanding deeper and more complex networks.

      For networks with a hidden layer of infinite width, it is proven that gradient descent converges to one of the global minima under the NTK regime, or by considering them as Wasserstein gradient flows. Studying the NTK amounts to analyzing the first-order Taylor expansion of the network, treating the network as a linear regression over a feature map. This approximation is accurate if the neurons are initialized with a large scale(far from zero), large enough that neurons do not move far from their initialization. This is also called the lazy regime , in contrast with the feature learning regime where neurons align themselves to a finite amount of directions. While it is noticeable, we are also interested here in a feature-learning regime with small initialization where we can observe actual non-convex behavior such as neuron alignment, incremental learning and saddle to saddle dynamic.

      Examining the loss landscape reveals that shallow networks with more neurons than data points always have a non-increasing path to a global minimum. This is a favorable property for (stochastic) gradient convergence. In ‘The Hidden Convex Optimization Landscape of Regularized Two-Layer ReLU Networks, the authors extend those results by adding the weight decay regularization.

      Regularization plays a pivotal role as it let us influence which local minimum we will reach with gradient descent, usually to favor a simpler solution. Even if no explicit regularization is used, it is known that there is an implicit bias of gradient descent for linear activations, and more recently for ReLU networks using the convex reformulation.

      Other convex approaches are limited to an infinite amount of neurons, or to optimization in neuron-by-neuron fashion which requires solving many non-convex problems. The setting studied here allows for any number of neurons.

      To sum up, the convex reformulation approach described in this post contrasts with what precedes by presenting results for a shallow network with finite width layers, in a regression setting with ReLU activation and weight decay regularization.

      II. Convex reformulation

      Small example walkthrough

      First, let’s get familiar with and understand the inherent convexity caused by ReLU and the second layer. To do so, we will take simple yet non-convex examples and find their global minima using a convex problem.

      One ReLU, no second layer, no regularization

      Below is the loss of a single ReLU neuron (\(w_1 \in \RR\)) trained on two data points: \((x_1, y_1)=(-1, 1)\) and \((x_2, y_2) = (1, 0.5)\)

      \begin{equation}\label{eq:one_neuron_loss} {\color{cvred}{\mathcal{L}}}(w_1) = \big(\max(0, x_1 ~ w_1) - y_1\big)^2+\big(\max(0, x_2 ~ w_1) - y_2\big)^2 \end{equation}

      Because our only trainable parameter is one-dimensional, we can directly plot the entire loss landscape.

      \(\color{cvred}{\mathcal{L}}\) is non-convex in a strong sense: two local minima exist and have distinct values (\((y_1)^2\) and \((y_2)^2\)). In practice, a gradient descent will never be able to switch from fitting one data point to the other (switching from positive to a negative weight $w_1$ can only be done by increasing the loss).

      We say that the ReLU neuron can activate one or more data points if the output of its ReLU is non-zero when evaluated on said data. The output of a one-neuron ReLU network is \(\color{cvblue}{\max(0, x ~ w_1)}\), we can plot both the output and the two data points on the same graph.

      Plot of the output of a one-neuron ReLU network with a positive weight $w_1$. The ReLU only activates the second data point (as $x_2>0$ and $w_1 > 0$) so the network can fit the second data point. However, doing so means it cannot activate $x_1$ and will incur a constant loss $(y_1)^2$. Overall, depending on the sign of $w_1$, we will have a loss consisting of a constant term for not activating one example and a quadratic term for matching the label of the activated data point.

      Before moving on, the important fact here is that we have a true non-convexity of the loss(the difference between two local minima $\vert (y_1)^2 - (y_2)^2 \vert$ can be made arbitrarily large), even without a single layer or regularization. Now we will explore the corresponding convex problems.

      Activation

      We want to find the global minima of the one-neuron ReLU network loss function\eqref{eq:one_neuron_loss}. Recall that the loss has two local minima: $(y_2)^2$ for $w_1=y_1/x_1$ and $(y_1)^2$ for $w_1=y_2/x_2$.

      Which data points are activated plays a crucial role in the loss. In the specific example above, $x_2>0$ is activated and $x_1<0$ is not. If we fix the ReLU’s activation to this pattern and replace the max operators with \(\czero\) or \(\cone\):

      \begin{equation}\label{eq:firsttry} \min_{u_1 \in \RR} (\czero \times x_1 u_1 - y_1)^2+ (\cone \times x_2 u_1 - y_2)^2 \end{equation}

      This problem is convex. A gradient descent from any initialization will converge to the optimal loss $(y_1)^2$ with the parameter $u_1 =y_2/x_2$. This parameter directly corresponds to one of the two local minima of the non-convex loss\eqref{eq:one_neuron_loss} by taking $w_1 = u_1$.

      \begin{equation*} \min_{u_2 \in \RR} (\cone \times x_1 u_2 - y_1)^2+ (\czero \times x_2 u_2 - y_2)^2 \end{equation*}

      Similarly, this convex problem’s optimal solution directly corresponds to the second local minima: $(y_2)^2$ for $u_2 =-y_1/x_1$.

      All seems good. But keep in mind that we want to build an equivalent problem. If $u_2$ is positive, taking $w_1 = u_2$ does not lead to the same loss value in the original problem because a positive parameter will never activate the first data point.

      To make the issue obvious, consider this convex problem obtained by replacing the two $\max$ operators by \(\cone\):

      \begin{equation*} \min_{u_3 \in \RR} (\cone \times x_1 u_3 - y_1)^2+ (\cone \times x_2 u_3 - y_2)^2 \end{equation*}

      While it is convex, there is no link between the ReLU parameter $w_1$, and this new problem’s parameter $u_3$: it is not possible to activate both data points. This issue comes from the fact that replacing a $\max$ by \(\cone\) only makes sense if what is inside the $\max$ is indeed positive. In other words, as long as \(x_1 ~ w_1\) is positive we have that \(max(x_1 ~ w_1, 0) = \cone x_1 ~ w_1\).

      \begin{equation*} \min_{\substack{x_1 ~ u_3 \geq 0\\x_2 ~ u_3 \geq 0}} (\cone \times x_1 u_3 - y_1)^2+ (\cone \times x_2 u_3 - y_2)^2 \end{equation*}

      We added the constraints corresponding to the activation, and it adequately restricts $u_3$ to be in ${0}$.

      As a simple reformulation of \eqref{eq:firsttry}, we vectorize (in the number of data points) the convex loss and we add the constraints:

      \begin{equation*} \min_{\substack{\begin{bmatrix}-1 & 0 \\ 0 & 1\end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} u_1 \geq 0}} \ \ \bigg\| \underbrace{\begin{bmatrix} \czero & 0 \\ 0 & \cone \end{bmatrix}}_{\text{diagonal activation matrix}} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} u_1 - \begin{bmatrix} y_1 \\ y_2 \end{bmatrix} \bigg\|_2^2 \end{equation*}

      The diagonal activation matrix (named \(D_i \in \{0, 1\}^{n \times n}\)) summarize the on/off behavior of one ReLU for all data points. The constraints on $u_1$ are directly given by this activation matrix:

      \[\begin{bmatrix} -1 & 0 \\ 0 & 1 \end{bmatrix} = 2 \begin{bmatrix} \czero & 0 \\ 0 & \cone \end{bmatrix}- I_2 \qquad \text{$I_2$ the identity matrix of $\RR^2$}\]

      The other way around, we can define the activation pattern vector for a specific parameter \(u\): \((\mathbb{1}_{u ~ x_j \geq 0})_{j=1\dots n} \in \{0,1\}^n\) with $n$ the number of data points. The activation matrix of \(u\) is simply the matrix that has this vector for its diagonal.

      So we have exactly four possible activation matrices. \(D_1 = (\begin{smallmatrix} \czero & 0 \\ 0 & \czero \end{smallmatrix})\) and \(D_2 = (\begin{smallmatrix} \cone & 0 \\ 0 & \cone \end{smallmatrix})\) will have constraints that reduce to $w_1 = 0$, making them not interesting. The other two lead to convex problems with convex constraints. Solving them will give the parameters that correspond to the two local minima of the loss of ReLU neural network with only a single neuron\eqref{eq:one_neuron_loss}.

      For any number $n$ of 1-D data points, there are $2^n$ distinct activation matrices but only two of them will be interesting: activating all positive data points, or only activating negative data points. Only some $D_i$ are interesting in higher dimensions, but finding all of them is not obvious.

      Replacing everything with the usual matrices (\(X=(\begin{smallmatrix}x_1 \\x_2\end{smallmatrix})\), \(Y=(\begin{smallmatrix}y_1 \\y_2\end{smallmatrix})\)) will get us the equivalent convex problem to a one-neuron ReLU network, whose activation pattern is $D_i$:

      \begin{equation*} \min_{\substack{u_1 \in \RR\\ (2 D_i - I_2) X u_1 \geq 0}} \ \ \big\| D_i X u_1 - Y \big\|_2^2 \end{equation*}

      Later sections will investigate what we can say about a ReLU network with more than one neuron.

      Multiplicative non-convexity from the second layer

      \begin{equation}\label{eq:ncvxlin} \min_{(x, y) \in \RR^2} (x ~ y - 1)^2 \end{equation}

      \eqref{eq:ncvxlin} is not convex, it has two local minima. However, they are symmetric. Simply replace the term $x ~ y$ by a new variable $z$, and use a simple mapping such as $z \rightarrow (1, z)$ to get the solution of \eqref{eq:ncvxlin} from the solution of the convex problem: \(\min_{z \in \RR} (z-1)^2\).

      The initial problem\eqref{eq:ncvxlin} with L2 regularization is non-convex as well:

      \begin{equation*} \min_{(x, y) \in \RR^2} (x ~ y - 1)^2 + \frac{\lambda}{2} ( \vert x \vert^2 + \vert y \vert^2) \end{equation*}

      The convex reformulation with one variable is:

      \begin{equation*} \min_{z \in \RR} (z - 1)^2 + \lambda \vert z \vert \end{equation*}

      We have to use a different mapping \(z \rightarrow (\sgn(z) \sqrt(\vert z \vert), \sqrt(\vert z \vert))\). One can verify that plugging this mapping into the non-convex problem will give the same value. Therefore, you can solve the convex problem in lieu of the non-convex one.

      Back to non-linear activations, consider the non-convex problem of training a single ReLU neuron with a second layer(\(\alpha_1\)) and a L2 regularization:

      \begin{equation*} \min_{(w_1, \alpha_1) \in \RR^2} \big(\max(0, x_1 w_1) \alpha_1 - y_1\big)^2 + \frac{\lambda}{2} \left(\vert w_1 \vert^2 + \vert \alpha_1 \vert^2\right) \end{equation*}

      We fix the activation to only activate $x_1$(as could be done for any activation pattern) and add the corresponding constraint as done in the previous section:

      \begin{equation}\label{eq:ncvx1} \min_{\substack{(u_1, \alpha_1) \in \RR^2\\ x_1 ~ u_1 \geq 0}} \left( \cone ~ x_1 ~ u_1 ~ \alpha_1 - y_1 \right)^2 + \frac{\lambda}{2} (\vert u_1 \vert^2 + \vert \alpha_1 \vert^2) \end{equation}

      \eqref{eq:ncvx1} is a non-convex problem because we are multiplying $w_1$ and $\alpha_1$ together (and some constant). However, this non-convexity can be ignored by considering an equivalent convex function in a very similar way to the $(x ~ y - 1)^2$ problem.

      \begin{equation}\label{eq:cvx1} \min_{x_1 ~ z_1 \geq 0} \left( \cone ~ x_1 ~ z_1 - y_1 \right)^2 + \lambda \vert z_1 \vert \end{equation}

      $z_1$ takes the role of the product $w_1 ~ \alpha_1$. We can solve \eqref{eq:cvx1} to get an optimal $z_1$ and then use a mapping \((w_1, \alpha_1) = (\sgn(z_1) ~ \sqrt{\vert z_1 \vert}, \sqrt{\vert z_1\vert})\). However, the two problems do not have the same expressivity: \(\max(0, x_1 ~ z_1) \alpha_1\) can be negative but not \(\cone ~ x_1 ~ z_1\) because of the constraint. Let’s add a second variable with the same constraint as $z_1$ that will take the role of a negative $\alpha_1$.

      \begin{equation}\label{eq:cvx2} \min_{\substack{x_1 ~ z_1 \geq 0\\x_1 ~ v_1 \geq 0}} \big( \cone ~ x_1 ~ (z_1 - v_1) - y_1 \big)^2 + \lambda (\vert z_1 \vert + \vert v_1 \vert) \end{equation}

      The variable \(z_1\) represents a neuron with a positive second layer and \(v_1\) a neuron with the same activation pattern but with a negative second layer. This is a convex problem(adding a convex regularization preserves the convexity) with convex constraints. At the optimum, only one of the two variables will be non-zero. We consider this mapping:

      \begin{align*} (w_1, \alpha_1) &= (\sgn(z_1) ~ \sqrt{\vert z_1 \vert}, \sqrt{\vert z_1 \vert}) & \text{ if $z_1$ is non-zero}\\ (w_1, \alpha_1) &= (\sgn(v_1) ~ \sqrt{\vert v_1 \vert}, - \sqrt{\vert v_1 \vert}) & \text{ if $v_1$ is non-zero} \end{align*}

      One can verify that this mapping does give the same value when plugged into \eqref{eq:ncvx1}. The two problems share the same global minima as we can easily map back and forth without altering the loss. The global minima of the two problems have the same value as they have the same expressivity, we can say the two problems are equivalent in the sense that we can solve one to get the solution of the other by a simple mapping.

      To summarize, here’s the equivalent (with the above mapping) convex problem for a one-neuron ReLU Network with regularization and a second layer, whose activation pattern is $D_i$:

      \begin{equation*} \min_{\substack{(2 D_i - I_2) X u_1 \geq 0\\ (2 D_i - I_2) X v_1 \geq 0}} \ \ \big\| D_i ~ X (u_1 - v_1) - Y \big\|_2^2 \end{equation*}

      Equivalent Convex problem with two neurons

      Before moving on to the general results, we want to fit two data points, i.e. having both data points activated. To do so, we need at least two neurons. The usual non-convex problem is as follows (with \(X=(\begin{smallmatrix}x_1 \\x_2\end{smallmatrix})\), \(Y=(\begin{smallmatrix}y_1 \\y_2\end{smallmatrix})\) and $m=2$):

      \begin{equation*} \min_{w_i, \alpha_i \in \RR, i=1 \dots m} \bigg\| \sum_{i=1}^m \max(0, X w_i) \alpha_i - y \bigg\|^2_2 + \lambda \sum_{i=1}^m w_i ^2 + \alpha_i^2. \end{equation*}

      This loss is plotted (with $\lambda = 0$ and fixed second layer) in the introduction section. The convex reformulation is very similar.

      \begin{equation*} \min_{\substack{(2 D_i - I_2) X u_i \geq 0\\ (2 D_i - I_2) X v_i \geq 0}, i=1 \dots m} \ \ \bigg\| \sum_{i=1}^m D_i ~ X (u_i - v_i) - Y \bigg\|_2^2 + \lambda \sum_{i=1}^m \vert u_i \vert +\vert v_i \vert \end{equation*}

      The best choice(only obvious in this 1-D data case) of activation matrices would be \(D_1 = (\begin{smallmatrix} \czero & 0 \\ 0 & \cone \end{smallmatrix})\) and \(D_2 = (\begin{smallmatrix} \cone & 0 \\ 0 & \czero \end{smallmatrix})\).

      Solving and mapping the solutions would give the optimal global solution to the problem of fitting two data points with a ReLU network with two neurons. More insights about why this is true are given after the general case section, and the complete proof can be found in the paper.

      General Case

      Let us consider a general two-layer ReLU network with an input of dimension $d$, an output of dimension 1 (vector output requires a similar but parallel construction) and a hidden layer of size $m$. With $n$ data points, the full regularized loss is

      \begin{equation*} \mathcal{L}(\pmb{W}, \pmb{\alpha}) = \bigg\| \sum_{i=1}^m \max(0, \pmb{X} \pmb{w}_i) \alpha_i - \pmb{y} \bigg\|^2_2 + \lambda \sum_{i=1}^m \| \pmb{w}_i \|^2_2 + \alpha_i^2 \end{equation*}

      This is the same loss as presented at the beginning of the article\eqref{eq:theloss} but with matrix and vectors. \(\pmb{X} \in \RR^{n \times d}\) is the data matrix and \(\pmb{y} \in \RR^n\) are the labels. Each neuron has its first layer parameter \(\pmb{w}_i \in \RR^d\) and second layer \(\alpha_i \in \RR\).

      By analogy with what we saw earlier, an equivalent convex problem can be found. Multiplications are replaced by scalar products in the definition of activation matrices and thus most insights about activation hold.

      \begin{equation}\label{eq:thecvx} \min_{\pmb{U}, \pmb{V} \in \mathcal{K}} \bigg\| \sum_{i=1}^m \pmb{D}_i \pmb{X} (\pmb{u}_i - \pmb{v}_i) - \pmb{y} \bigg\|^2_2 + \lambda \sum_{i=1}^m \| \pmb{u}_i \|_2 + \| \pmb{v}_i \|_2 \end{equation}

      \(\pmb{D}_i\) are the activation matrix. The set of the constraints \(\mathcal{K}\) is the concatenation of the constraints of all neurons. Each constraint can be written succintely: \((2 \pmb{D}_i - \pmb{I}_n) X \pmb{u}_i \geq 0\). If \(u_i\) respects the constraint, its activation pattern is exactly \(D_i\) and this is crucial to retrieve the optimal solution of the non-convex loss\eqref{eq:theloss} from the solution of the convex reformulation\eqref{eq:thecvx}.

      A conceptually easy way to have the two problems have the same global loss, is to consider a ReLU network with \(2^n\) neurons, and to formulate the convex problem using all \(2^n\) distinct activation matrices \(D_i\). In that case, it is easy to see that they both have the same expressivity. In the paper, it is proved that in theory only \(n\) neurons and activation patterns are required (using carathéodory’s theorem), but the patterns are not given explicitly. The next section will give more insights on when the two problems are equivalent.

      From a solution of the convex problem\eqref{eq:thecvx}, the convex neurons \(u_i\) can be mapped to the non-convex neurons \((w_i, \alpha_i)\) using this mapping:

      \begin{align*} (w_i, \alpha_i) &= (\frac{u_i}{\sqrt{\| u_i \|_2}}, \sqrt{\| u_i \|_2}) & \text{ if $u_i$ is non-zero}\\ (w_i, \alpha_i) &= (\frac{v_i}{\sqrt{\| v_i \|_2}}, -\sqrt{\| v_i \|_2}) & \text{ if $v_i$ is non-zero} \end{align*}

      We use the same mapping as in the 1D case except the direction of the neuron (\(u_i\)) is now a vector in \(\RR^d\)

      This is a very simple mapping from convex solution to non-convex neurons. We will call convex neurons the set of parameters that correspond to a neuron in the original, non-convex problem. One can expect similar trajectories between the non-convex and convex neurons during gradient descent.

      Here, we fixed the number of neurons and the corresponding activations. A few questions are left unanswered: how many different activation patterns need to be considered, and how many neurons should we consider for both convex and non-convex problems?

      Specifics about equivalence

      Two problems are considered equivalent when their global optima can be seamlessly mapped back and forth.

      As seen before, there are only two interesting possible activation patterns in the one-dimensional case (a single neuron can either activate all the positive data points and none of the negative, or the opposite), but there are close to \(2^n\) interesting patterns when the data dimension is higher. An activation pattern is interesting if there exists a non-zero vector that can respect the constraints and in fine, the activation pattern.

      The (unique) optimal loss of the convex problem \eqref{eq:thecvx} with all possible activation patterns(for fixed data) \(D_i\) is the best loss any non-convex network can reach. The following sections are dedicated to understanding why adding more neurons than there are activation patterns will not improve the loss.

      However, if we only consider a subset of all patterns, the convex problem will in general correspond to a local optimum of the non-convex network. Indeed, it is not as expressive as before. This would either correspond to a non-convex network with not enough neurons, or with too many neurons concentrated in the same regions.

      To explore this idea, we go back to one-dimensional data.

      1-D EXAMPLE, ONE NEURON

      In the non-convex problem with only one neuron, there are exactly two local minima.

      Plot of the output of a ReLU Network with one neuron, one for each of the parameter's local minima. The parameter on the left can be formulated as a solution of a convex problem with one convex neuron using the activation matrix \((\begin{smallmatrix} \czero & 0 \\ 0 & \cone\end{smallmatrix})\), and \((\begin{smallmatrix} \cone & 0 \\ 0 & \czero \end{smallmatrix})\) for the right output.

      As seen in the previous section, each local minimum can be found exactly by solving the convex problem with a subset of all possible activations, that is on the left and on the right. Here we cannot say that the convex problem (that considers only one pattern) is equivalent to the non-convex one because the global minimum of the non-convex cannot be achieved in the convex problem. However, once we reach a local minimum in the non-convex gradient descent, then it can be described as a convex problem, by considering one pattern or the other.

      1-D EXAMPLE, TWO NEURONS

      The non-convex problem initialized with two random neurons and optimized with gradient descent will have three possible local minima (if there is some regularization, otherwise there's an infinite number of them). Either we initialize a neuron for each activation and it will reach the global optima (left), or two of them will end up in the same pattern (right), activating the same data point.

      In the case of two neurons, the convex equivalent problem is as follows:

      \begin{equation*} \mathcal{L}(u_1, u_2)= \bigg\| \begin{bmatrix} \czero & 0 \\ 0 & \cone \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} u_1 + \begin{bmatrix} \cone & 0 \\ 0 & \czero \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} u_2 - \begin{bmatrix} y_1 \\ y_2 \end{bmatrix} \bigg\|_2^2 + \lambda (| u_1 | + | u_2 |) \end{equation*}

      is equivalent to the non-convex problem i.e. solving it will give the global optimum of the non-convex objective. (the negative $v_i$ are zero at the optimal and are removed here only to be clear.)

      1-D EXAMPLE, MANY NEURONS

      Plotting the positive part of many ReLU neurons. Summed up, they form a network output that perfectly fits the data.

      We draw one example of a usual local minimum for gradient descent in the specific case of having more neurons than existing patterns. In practice (with more data in higher dimensions) there are much fewer neurons than possible activations. However, there are many situations in which neurons will lead to the same activation patterns, and in the experiment section we will see how to force such dynamics.

      Note that we can merge neurons that are in the same activation pattern by summing them up (even in higher dimensions), creating a new neuron, and keeping both the output and the loss unchanged (although regularization might decrease). The fact that having more than one neuron in one pattern does not decrease the loss is at the core of the proof.

      Activation patterns

      The equivalence proof is heavily based on ReLU, specifically that a ReLU unit divides the input space into two regions: one where it will output zero, and the other where it is the identity. If you consider a finite set of samples and a single ReLU, it will activate and deactivate some samples: this is called an activation pattern. A diagonal matrix \(\pmb{D}_i \in \{0,1\}^{n \times n}\) describes one activation pattern, but not all are possible for a given dataset. There is a finite amount of such possible patterns, exponential in the dimension of the data.

      This section is important to understand the final animations in the experimental section and helps understand how active activation patterns evolve in the non-convex problem.

      Two-Dimensional Data

      In the previous part, we considered data to be one-dimensional which resulted in only two possible activation patterns. Let us consider two-dimensional data. To do so in the simplest way possible, we will consider regular one-dimensional data and a dimension filled with \(1\)s. This will effectively give the neural network a bias to use without modifying the formulas.

      We consider two data points: \(\color{cvred}{\pmb{x}_1} = (-0.2, 1)\) and \(\color{cvred}{\pmb{x}_2} = (1, 1)\), each associated with their label \(y_1 = 0.5\) and \(y_2 = 1\). We plot the output of one ReLU unit initialized at \(\pmb{w}_1 = (0.3, 0.15)\), \(\alpha_1 = 1\). Therefore we have

      \begin{align*} \max(0, \pmb{w}_1^\top \pmb{x}_1) &= 0 \\ \max(0, \pmb{w}_1^\top \pmb{x}_2) &= \pmb{w}_1^\top \pmb{x}_2 \end{align*}

      The activation pattern of \(\pmb{w}_1\) is \(\pmb{D}_1=\left(\begin{smallmatrix} \czero & 0 \\ 0 & \cone \end{smallmatrix}\right)\). There are only three other possible activation patterns, activating both data points: \(\pmb{D}_2=\left(\begin{smallmatrix} 1 & 0 \\ 0 & 1 \end{smallmatrix}\right)\), activating only the first one with \(\pmb{D}_3=\left(\begin{smallmatrix} 1 & 0 \\ 0 & 0 \end{smallmatrix}\right)\) and activating no data point with a zero matrix.

      One point of interest is the data for which the ReLU will be 0. This is where the output changes its slope: \(a_1 = -w_1^2/w_1^1\) where \(w_1^i\) is the i-th coordinate of \(\pmb{w}_i\). Here, \(a_1 = 0.5\). We call this the activation point of the neuron \(\pmb{w}_1\).

      We plot the output, \(\color{cvblue}{\max(0, (x, 1) ~ \pmb{w}_1^\top)}\), of the network as a function of the first dimension of the data \(x^1\) (here simply written \(x\)):

      A neuron initialized so that it activates only one data point i.e. its activation point is between the two samples, and its slope tells us if it activates on the left or on the right like in this case.

      Illustration.

      In the animation below, we train this network using vanilla gradient descent on the two data points \(\color{cvred}{\pmb{x}_1}\) and \(\color{cvred}{\pmb{x}_2}\), represented by the red crosses. We plot its \(\color{cblue}{\text{output}}\) in blue for every possible data point (omitting the second dimension as it is always 1 in this example, playing the role of the bias), and we plot in red the label associated with the two data points. Each frame corresponds to one step of full-batch gradient descent with a small learning rate. We mark the \(\color{cgreen}{\text{activation point}}\) of the neuron with a green triangle, pointing toward the side the neuron activates. The green triangle’s height is the slope of the ReLU’s output, equal to \(u_1^1 = w_1^1 \alpha_1\), allowing us to visualize how important one neuron is for the output of the network.

      Training a single neuron network with gradient descent until it exactly fits two data points. It starts by fitting the only point it activates, \(\color{cvred}{\pmb{x}_2}\). As training progresses, the activation point represented by a green triangle shifts position. As soon as the activation point reaches \(\color{cvred}{\pmb{x}_1}\), it activates it and starts fitting both points at the same time. Its activation pattern shifts from \(\left(\begin{smallmatrix} \czero & 0 \\ 0 & \cone \end{smallmatrix}\right)\) to \(\left(\begin{smallmatrix} \cone & 0 \\ 0 & \cone \end{smallmatrix}\right)\) and stays the same until convergence.

      Adding more neurons will not create additional activation patterns, only adding more data points will. With only two data points \(\pmb{x}_1\) and \(\pmb{x}_2\), we only had 4 possible patterns, with four data points we have 10 possible patterns.

      We plot the individual output and activation points of each of the ReLU neurons associated with the ten _interesting_ activation patterns in blue. Those are the 10 (20 with negative ones) neurons that need to be considered to get the global optima using the convex equivalent. When moving the activation point \(a_i\) of a neuron between two data points, its activation pattern does not change.

      Notice that it is not possible to only activate the data points in the middle. However, if we increase the data's dimension, this becomes possible. This is also possible with a second layer of ReLU. In higher dimensions, we cannot visualize the activation patterns as easily, but we can understand that as dimensionality increases, more patterns are possible as it is easier to separate different data points.

      Extensions of the convex reformulation to other settings

      Batch Normalization (BN) is a key process that adjusts a batch of data to have a mean of zero and a standard deviation of one, using two trainable parameters. In the convex equivalent, we replace \(\pmb{D}_i \pmb{X}\) with \(\pmb{U}_i\). This \(\pmb{U}_i\) is the first matrix in the Singular Value Decomposition (SVD) of \(\pmb{D}_i \pmb{X} = \pmb{U}_i \pmb{\Sigma}_i \pmb{V}_i\) . If the output is a vector, rather than a scalar, the regularization changes to require a nuclear norm in the convex equivalent . Three-layer networks also have a convex equivalent using all possible combinations of two activation matrices. Moreover, parallel networks are also linked to a convex problem . Lastly, in Wasserstein Generative Adversarial Network (WGAN) problems, the adversarial games played by two-layer discriminators are identified as instances of convex-concave games .

      III. Can We Forget the Non-Convex Problem?

      Solving the convex problem efficiently is hard

      In the last ten years, deep neural networks have been trained using (stochastic) gradient descent on the non-convex problem. The algorithm, the implementation, and even the hardware running the training have been heavily optimized, supported, and pushed by industrial and scientific applications. Such networks were practically abandoned for years after being discovered because there did not exist an efficient way to train them. Nowadays, it takes a few lines to train a network on dedicated hardware and this might make us forget how much engineering has made this possible. This should be kept in mind when comparing a new approach to the problem.

      Training a network with the non-convex problem can be time consuming as it requires tuning hyperparameters and rollbacks(retrieving a previous state) to get out of a bad minimum. In that case, the convex approach deals with much fewer parameters and has only one global minimum.

      In complexity terms, the convex reformulation with all possible activation patterns $D_i$ gives an algorithm in polynomial time for all parameters except for the rank of the data matrix. In practice and with usual datasets, the rank is high and there will be too many patterns to consider them all.

      There has been some work focused on solving the convex problem quickly. The first idea is to take a random subset of activation patterns and use standard convex solvers. Current convex solvers (ECOS, …) are not tailored to problems with many constraints. There is some hope in considering the unconstrained version of the problem to build an approximation. In most deep learning scenarios, it is hard to be faster, or even start to compete against a simple gradient descent running on GPUs.

      Dataset Convex Adam SGD Adagrad
      MNIST 97.6 98.0 97.2 97.5
      CIFAR-10 56.4 50.1 54.3 54.2

      Test accuracy on popular datasets for a single layer network with 5000 neurons.

      Time to solve problems from the UCI datasets with Adam on the non-convex problem and a custom solver(using the augmented Lagrangian method). The code for the paper's experiments is available on github, as well as the convex problem toolkit.

      For relatively small datasets and networks, convex solvers are fast and do not require any tuning to get convergence. Adjusting the regularization will directly reduce the amount of neurons needed.

      A convex equivalent of deeper networks exists but exacerbates existing problems. The only way to make it possible is to optimize layer by layer. This is still a work in progress and needs further improvements to be competitive.

      Activation patterns are not a constant in the non-convex problem

      Let’s set aside the performance concerns and use the reformulation as a new point of view for observation. Our non-convex problem is equivalent to a convex and well-specified optimization problem with constraints. The global optima might be the same, but training the network with gradient descent almost always leads to a local minimum. Because there are too many activations to consider them all, the convex problem only find a local minimum. However, it is not clear if they find the same kind of local minimum.

      Activation patterns can and will change during gradient descent in the non-convex problem. In some cases, this pattern shifting is useful because the new activation patterns may lead to a better minimizer. To verify this, we monitor the number of unique activation patterns used by the network at each step of a gradient descent. If two neurons have the same activation pattern (i.e. they activate and deactivate the same data points), we would count them as one.

      Training a network with 100 random data points in 10 dimensions. The network only has 20 randomly initialized neurons and the data is linearly dependent on the input. Each neuron has a unique activation pattern as can be seen on the graph. It is expected in this setting because there are so many possible activation patterns (close to $10^{25}$The number of activation patterns is the same as the number of regions in a partition by hyperplanes perpendicular to rows of $X$ and passing through the origin. This number of region is bounded by \(2 r \left(\frac{e ~ (n-1)}{r}\right)^r\) with $r$ the rank of $X$). However, as training progresses, neurons align themselves to the same pattern. After 300 steps, the 20 neurons only share 5 unique activation patterns.

      However, we can show an aspect that sets both formulations apart. The convex problem has fixed activation patterns. If the activations are missing important data, the convex solution will not be optimal. Meanwhile, in the non-convex problem, the gradient descent keeps shifting from pattern to pattern until it converges.

      Illustration.

      We will further study this setting with 100 data points and 20 neurons in high dimensions. To compare how the two methods deal with activation patterns, we will use the activation pattern of the neurons of the non-convex problem to construct a convex problem and solve it. To be more explicit, for each non-convex neuron \(\pmb{w}_i\), we find its activation pattern and add a \(\pmb{u}_i\) constrained to this pattern to the convex problem. In the end, we have a convex problem with 20 neurons that will activate the same data points as the non-convex neurons.

      We train the non-convex network using gradient descent, and at each step, we construct a convex problem, solve it, and compare its global minimum to our current non-convex loss. This convex problem fully describes the local minimum we would find if the non-convex problem was constrained to never change its activation patterns.

      Training a 20-neuron network with gradient descent and using the same activation patterns to solve the convex equivalent. We plot for each step, the current loss of the non-convex network and the optimal loss of the convex problem. At initialization (first point on the graph), the non-convex loss is 1. We take the current activation pattern and build a convex problem and solve it, we find an optimal loss of $0.1$. In the next step, the non-convex loss decreases and the activation pattern has changed, thus we find a different optimal loss for the convex problem. The initial optimal loss of the convex is quickly beaten by gradient descent (at around step 175), this means that the activation patterns at step 0 were far from optimal. The convex loss at the start is quickly beaten by gradient descent, this means our initial choice of activation pattern was bad, and gradient descent continually improves them. We use cvxpy to define the problem and solve it using ECOS.

      In general, we cannot predict which patterns will be used by the neurons found by GD, or which patterns are the best. Thus we cannot hope that the convex problem will give us an insight as it requires us to know the activation patterns. We can however predict what (some of) the optimal solution will look like a spline interpolation on each training sample.

      In the next section, we focus on cases where the non-convex minima can be accurately described by convex problems.

      On large initialization scale

      The initialization scale of the network is the absolute size of the neurons’ parameters. To get a change in the scale, we can simply multiply every parameter by a scalar. The initial value of the neuron is a large topic in machine learning as it has a large influence on the quality of the local minimum. By default in popular libraries, He initialization is used, it draws neurons from a normal distribution centered on 0 and with a variance in \(1/m\) with \(m\) the number of neurons. However, in the literature, there is a large choice to pick from.

      We say we are on a large scale when neurons do not move far from their initial value during descent. This typically happens when using large initial values for the parameters of each neuron.

      The theory states that you can push the scale used high enough so that neurons will not change their activation patterns at all. If this is verified, the convex reformulation will describe exactly the minima that gradient descent will reach. However, it is not possible to observe this in practice as the loss becomes very small and the training process is too slow to carry on to the end. The NTK briefly mentioned in the introduction operates in this setting, using the fact that the network is very close to its linear approximation. On a similar note, reducing the step size for the first layer guarantee convergence.

      Illustration.

      Using an animation, we plot every step of a gradient descent in the non-convex problem until the loss is small enough. As mentioned before, the training is too slow to continue until we reach a real local minimum described by the convex problem here. We plot the output of the network, which is the sum of all the neurons. We want to focus on the activation point of each neuron.

      Training a network with 1000 neurons with big initial values using gradient descent. The output of the network is in blue, and the four data points (red crosses) represent linear data. Each green triangle represents one neuron with its activation point horizontally, and its norm vertically. The orientation of the triangle reveals which side the neuron will activate the data. At initialization, the repartition of the activation point is uniform. The movement of the activation point is minimal, only a few neurons will change their patterns, among the thousands.

      Here, computing the convex optimal gives us a single neuron to fit the linear data. While the non-convex problem has converged to very low loss, their outputs are completely different.

      A side effect of the large initialization is catastrophic overfitting i.e. there are very large variations between data points which will negatively impact test loss.

      On very small initialization

      At the other extreme, the small-scale setting effectively lets neurons align themselves before ever decreasing the loss. In theory, if you push the scale down enough, neurons will converge to a finite set of directions before trying to fit the objective.

      Training a network with 1000 neurons with very small initial values using gradient descent. The output of the network is in blue, the four data points (red crosses) represent linear data. Each green triangle represents one neuron with its activation point horizontally, and its norm vertically. The orientation of the triangle reveals which side the neuron will activate the data. At initialization, the repartition of the activation point is uniform. However, as training progresses most neurons that activate toward the right converge to $-1.3$. Once the norm of the neuron at activating at $-1.3$ is large enough, the loss decreases and we quickly reach convergence.

      Taking a look at the loss on the same problem, we can identify the two distinct regimes: alignment and fitting (then convergence).

      Plot of the loss during gradient descent in the same setting as the animation above. In the first half only the directions of the neurons are changing (i.e. their activation patterns), and start fitting the four data points once their parameters are large enough.

      If you take orthogonal data and a small scale, the behavior is very predictable even in a regression setting.

      Unless mentioned otherwise, all experiments were run using full batch vanilla gradient descent. In experiments, it is clear that adding momentum or using the Adam optimizer is much easier to use on top of being faster to converge. However, the behavior is much less predictable.

      Conclusion

      The main takeaway is that the best network for a given dataset can be found exactly by solving a convex problem. Additionally, the convex problem can describe every local minimum found by gradient descent in the non-convex setting. However, finding the global optima is impossible in practice, and approximations are still costly in precision. While there is no evident link between feature learning in the non-convex and the convex reformulation, many settings allow for a direct equivalence and the whole convex toolkit for proofs.

      The performance side of the convex reformulation will benefit from dedicated software as has been the case for gradient descent in deep networks. Only then will it offer a no-tuning alternative to costly stochastic gradient descent. In smaller settings, it already allows us to quickly find all the possible local minima that are so important in machine learning.

      Despite advancements in understanding the optimization landscape of neural networks, a significant gap persists in reconciling theory with practical challenges, notably because of early stopping. In real-world scenarios, networks often cease learning before reaching a local minimum and this has a direct impact (in large-scale initialization) but there are limited results.

      Acknowledgements

      This work is partly funded by the ANR JCJC project ANR-21-CE23-0022-01.

      For attribution in academic contexts, please cite this work as
      +        PLACEHOLDER FOR ACADEMIC ATTRIBUTION
      +  
      BibTeX citation
      +        PLACEHOLDER FOR BIBTEX
      +  
      \ No newline at end of file diff --git a/blog/index.html b/blog/index.html index 87e3234f..f8e7afe7 100644 --- a/blog/index.html +++ b/blog/index.html @@ -1,103 +1 @@ ---- -layout: default -title: blog -nav: true -nav_order: 9 -permalink: /blog -pagination: - enabled: true - collection: posts - permalink: /page/:num/ - per_page: 12 - sort_field: title - sort_reverse: false - trail: - before: 1 # The number of links before the current page - after: 3 # The number of links after the current page ---- - -
      - -
      -

      {{ site.blog_name }}

      -

      {{ site.blog_description }}

      -
      - - {% if site.display_tags %} -
      -
        - {% for tag in site.display_tags %} -
      • - {{ tag }} -
      • - {% unless forloop.last %} -

        - {% endunless %} - {% endfor %} -
      -
      - {% endif %} - -
        - {% for post in paginator.posts %} - - {% if post.external_source == blank %} - {% assign read_time = post.content | number_of_words | divided_by: 180 | plus: 1 %} - {% else %} - {% assign read_time = post.feed_content | strip_html | number_of_words | divided_by: 180 | plus: 1 %} - {% endif %} - {% assign year = post.date | date: "%Y" %} - {% assign tags = post.tags | join: "" %} - {% assign categories = post.categories | join: "" %} - -
      • -

        - {% if post.redirect == blank %} - {{ post.title }} - {% else %} - {% if post.redirect contains '://' %} - {{ post.title }} - - - - {% else %} - {{ post.title }} - {% endif %} - {% endif %} -

        -

        {{ post.description }}

        - - -
      • - - {% endfor %} -
      - - {% include pagination.html %} - -
      + blog | ICLR Blogposts 2024

      blogposts

      Blog Posts

      • A New Alchemy: Language Model Development as a Subfield?

        This blog post makes the case that the body of research on language models become sufficiently large and mature that we can start thinking about “language model development” as a new subfield. To support this claim, we sketch out the focuses and methodologies of this new subfield. In addition, we provide some personal reflections on what to do when your field of study gives birth to a new one.

      • Behavioral Differences in Mode-Switching Exploration for Reinforcement Learning

        In 2022, researchers from Google DeepMind presented an initial study on mode-switching exploration, by which an agent separates its exploitation and exploration actions more coarsely throughout an episode by intermittently and significantly changing its behavior policy. We supplement their work in this blog post by showcasing some observed behavioral differences between mode-switching and monolithic exploration on the Atari suite and presenting illustrative examples of its benefits. This work aids practitioners and researchers by providing practical guidance and eliciting future research directions in mode-switching exploration.

      • Bridging the Data Processing Inequality and Function-Space Variational Inference

        This blog post explores the interplay between the Data Processing Inequality (DPI), a cornerstone concept in information theory, and Function-Space Variational Inference (FSVI) within the context of Bayesian deep learning. The DPI governs the transformation and flow of information through stochastic processes, and its unique connection to FSVI is employed to highlight FSVI's focus on Bayesian predictive posteriors over parameter space. Throughout the post, theoretical concepts are intertwined with intuitive explanations and mathematical rigor, offering a comprehensive understanding of these complex topics. The post concludes by bringing together various ideas to explain why the choice of predictive priors (initial probability distributions assumed for model predictions before training) is important for training machine learning models and preventing overfitting. It also discusses the practical implications of these concepts in areas such as continual learning and knowledge distillation. By examining these concepts in depth, the post provides valuable insights for both theory and practice in machine learning, making it an informative resource for researchers and practitioners.

      • Building Diffusion Model's theory from ground up

        Diffusion Models, a new generative model family, have taken the world by storm after the seminal paper by Ho et al. [2020]. While diffusion models are often described as a probabilistic Markov Chains, their underlying principle is based on the decade-old theory of Stochastic Differential Equations (SDE), as found out later by Song et al. [2021]. In this article, we will go back and revisit the 'fundamental ingredients' behind the SDE formulation and show how the idea can be 'shaped' to get to the modern form of Score-based Diffusion Models. We'll start from the very definition of the 'score', how it was used in the context of generative modeling, how we achieve the necessary theoretical guarantees and how the critical design choices were made to finally arrive at the more 'principled' framework of Score-based Diffusion. Throughout this article, we provide several intuitive illustrations for ease of understanding.

      • Deep Equilibrium Models For Algorithmic Reasoning

        In this blogpost we discuss the idea of teaching neural networks to reach fixed points when reasoning. Specifically, on the algorithmic reasoning benchmark CLRS the current neural networks are told the number of reasoning steps they need, which they shouldn't be given. While a quick fix is to add a termination network that predicts when to stop, a much more salient inductive bias is that the neural network shouldn't change its answer any further once the answer is correct, i.e. it should reach a fixed point. This is supported by denotational semantics, which tells us that while loops that terminate are the minimum fixed points of a function. We implement this idea with the help of deep equilibrium models and discuss several hurdles one encounters along the way. We show on several algorithms from the CLRS benchmark the partial success of this approach and the difficulty in making it work robustly across all algorithms.

      • Double Descent Demystified

        Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle

      • Elaborating on the Value of Flow Matching for Density Estimation

        The transfer of matching-based training from Diffusion Models to Normalizing Flows allows to fit expressive continuous normalizing flows efficiently and therefore enables their usage for different kinds of density estimation tasks. One particularly interesting task is Simulation-Based Inference, where Flow Matching enabled several improvements. The post shall focus on the discussion of Flow Matching for Continuous Normalizing Flows. To highlight the relevance and the practicality of the method, their use and advantages for Simulation-Based Inference is elaborated.

      • Exploring Meta-learned Curiosity Algorithms

        This blog post delves into Alet et al.'s ICLR 2020 paper, Meta-learning curiosity algorithms, which introduces a unique approach to meta-learning curiosity algorithms. Instead of meta-learning neural network weights, the focus is on meta-learning pieces of code, allowing it to be interpretable by humans. The post explores the two meta-learned algorithms, namely Fast Action Space Transition (FAST) and Cycle-Consistency Intrinsic Motivation (CCIM).

      • Fair Model-Based Reinforcement Learning Comparisons with Explicit and Consistent Update Frequency

        Implicit update frequencies can introduce ambiguity in the interpretation of model-based reinforcement learning benchmarks, obscuring the real objective of the evaluation. While the update frequency can sometimes be optimized to improve performance, real-world applications often impose constraints, allowing updates only between deployments on the actual system. This blog post emphasizes the need for evaluations using consistent update frequencies across different algorithms to provide researchers and practitioners with clearer comparisons under realistic constraints.

      • Fairness in AI: two philosophies or just one?

        The topic of fairness in AI has garnered more attention over the last year, recently with the arrival of the EU's AI Act. This goal of achieving fairness in AI is often done in one of two ways, namely through counterfactual fairness or through group fairness. These research strands originate from two vastly differing ideologies. However, with the use of causal graphs, it is possible to show that they are related and even that satisfying a fairness group measure means satisfying counterfactual fairness.

      • How to compute Hessian-vector products?

        The product between the Hessian of a function and a vector, the Hessian-vector product (HVP), is a fundamental quantity to study the variation of a function. It is ubiquitous in traditional optimization and machine learning. However, the computation of HVPs is often considered prohibitive in the context of deep learning, driving practitioners to use proxy quantities to evaluate the loss geometry. Standard automatic differentiation theory predicts that the computational complexity of an HVP is of the same order of magnitude as the complexity of computing a gradient. The goal of this blog post is to provide a practical counterpart to this theoretical result, showing that modern automatic differentiation frameworks, JAX and PyTorch, allow for efficient computation of these HVPs in standard deep learning cost functions.

      • It's Time to Move On: Primacy Bias and Why It Helps to Forget

        'The Primacy Bias in Deep Reinforcement Learning' demonstrates how the first experiences of a deep learning model can cause catastrophic memorization and how this can be prevented. In this post we describe primacy bias, summarize the authors' key findings, and present a simple environment to experiment with primacy bias.

      \ No newline at end of file diff --git a/blog/language-model-development-as-a-new-subfield/index.html b/blog/language-model-development-as-a-new-subfield/index.html new file mode 100644 index 00000000..53640248 --- /dev/null +++ b/blog/language-model-development-as-a-new-subfield/index.html @@ -0,0 +1,36 @@ + A New Alchemy: Language Model Development as a Subfield? | ICLR Blogposts 2024

      A New Alchemy: Language Model Development as a Subfield?

      This blog post makes the case that the body of research on language models become sufficiently large and mature that we can start thinking about “language model development” as a new subfield. To support this claim, we sketch out the focuses and methodologies of this new subfield. In addition, we provide some personal reflections on what to do when your field of study gives birth to a new one.

      Historically, language models have served as an important component of many learning systems – for example, to improve the transcriptions generated by a speech recognition system. However, the impact and usage of language models has grown dramatically over the past few years. Arguably, this growth is simply thanks to the fact that language models have gotten better, i.e. more accurate at predicting some text based on some context. Since most text-based tasks can be cast as predicting a response to a request (e.g. “summarize the following article”, “write me a Python function that queries Wikipedia”, etc.), recent large language models (LLMs) have proven somewhat effective at performing an incredibly wide range of tasks. Improvements in the language understanding and generation capabilities of LLMs have also led to their adoption in many larger systems (e.g. robots, image processing/generation, etc.), where they increasingly enable natural language to be used as an interface. These advances have led to a huge amount of research into building and using language models. I think this body of research has become sufficiently large and mature that we can start thinking about “language model development” as a new subfield. The goal of this blog post is to sketch out the focuses and methodologies of the subfield of language model development as well as to provide some personal reflections on what to do when your field of study gives birth to a new one.

      Some history

      As a subfield, language modeling has many sibling and parent fields, including information theory, artificial intelligence, natural language processing, and machine learning. In my biased opinion, many recent advances in language modeling have stemmed from advances in deep learning. When thinking about fields like deep learning, I think it can be valuable to define what the assumptions and major problems of the field are. For deep learning, I would roughly say that the assumptions are:

      1. We should end-to-end optimize everything.
      2. Training a bigger model on a bigger dataset should yield improved performance, but we should also strive to develop efficient and performant model architectures.
      3. If we can bake structure into our model (e.g. convolutions for images), things work better…
      4. but what we really want is a system that can learn everything from data and relies on as few hard-coded assumptions as possible.
      5. We care less about theoretical guarantees and more about how well something works in practice.

      Notably, the assumptions of a field are not necessarily scientifically or philosophically motivated - they can be cultural or arise from extraneous factors (e.g. the availability of GPUs). The major problems of the field of deep learning might be:

      1. How can we design neural network architectures that work well for a given problem, or better yet, across a wide variety of problems?
      2. Similarly, what objective works best?
      3. How should we optimize that objective?
      4. How can we ensure all of the above can be scaled up effectively?

      Arguably, one of the biggest successes of recent deep learning research is a powerful recipe for training effective models on a wide variety of problems, namely, the Transformer trained with some variant of Adam. While the objective used can vary across problem settings, in text-based problems a simple language modeling objective works well (and, as discussed above, encapsulates pretty much any text-based task). An important aspect of this Transformer recipe is its scalability, i.e. the ability to attain predictable gains from scaling up training compute and/or dataset size.

      Language model development

      I think the scalability of the Transformer has ushered in a new era of research that is distinct from deep learning research. For the first time, we can (to a significant degree) stop worrying about what model architecture to use, how to train the model, what objective to use, whether we’ll continue to get returns from scaling, etc. Instead, this new line of research primarily aims to study the development of language models in order to expand and understand their capabilities. In addition, the fact that recent LLMs are reasonably competent at a huge range of tasks has led to major differences in terms of how we use LLMs (when compared to e.g. how we built and used neural networks in the context of deep learning) For lack of a better term, I’ll refer to this new (sub)field as “language model development”, which might have the following assumptions:

      1. We can assume that the model architecture, optimizer, and objective are basically fixed.
      2. We hope or expect that a given LLM can be induced to perform basically any task out-of-the-box without performing any additional training (i.e. updating its parameters), and in general we should avoid updating parameters to specialize a model to a given task (i.e. task-specific fine-tuning).
      3. The computational cost of getting a model to perform a task is mostly irrelevant, or at least, these costs will be resolved by something else (e.g. better/more hardware).
      4. If we invest more compute in training an LLM, it will produce better results.

      Arguably, some of these assumptions could be considered consequences of the fact that many state-of-the-art language models are only available through black-box APIs. The major problems of language model development are something like:

      1. How can we get the model to do what we want (i.e. “prompt engineering”)?
      2. How can we make the model run as efficiently as possible?
      3. To the extent that we are going to update a model, how can we update it so that it is better at following instructions and less likely to generate harmful content (i.e. alignment)?
      4. More broadly, if we are really hoping the model can do anything, how do we prevent it from doing things we don’t want it to?
      5. How can we integrate language models into other systems (i.e. tool use, multimodality, etc.)?

      Let me give a few additional examples of papers and techniques that I think aim to attack these problems under the aforementioned assumptions.

      • An early technique for “getting an LLM to do what we want” (goal #1) is few-shot in-context learning (ICL), where a few examples of the desired input/output behavior are provided in the model’s input before the model is asked to process an unseen example. Few-shot ICL avoids updating the model’s parameters (assumption #1) and mostly ignores the fact that it significantly increases computational costs (assumption #3). A related and more recent variant of ICL is “chain-of-thought prompting”, which adds reasoning steps to the in-context examples in hopes of improving performance by inducing the model to generate similar reasoning steps before generating its prediction. The fact that including reasoning steps further increases computational costs is, again, mostly ignored (assumption #3).
      • Techniques like FlashAttention and Speculative Decoding aim to make the model run more efficiently (goal #2) without changing the model or its outputs whatsoever (assumption #1). More broadly, techniques like the Heavy-Hitter Oracle or quantization aim to reduce memory or computational costs with minimal performance degradation. The pursuit of these techniques, along with orthogonal hardware advances like NVIDIA’s Transformer Engine, arguably supports the apparent disregard for increases in computational cost that arise from using a larger model (assumption #3).
      • While there certainly has been some effort to improve over the Transformer architecture or the optimizer used to train LLMs (in violation of assumption #1), the vast majority of these improvements have not been widely adopted, either due to inertia (i.e., enforcement of assumption #1) or the apparent fact that they do not always transfer across applications.

      Separately, a sign of the maturity of a new subfield is the development of teaching materials. I think my friend Sasha Rush is leading the charge here, with e.g. GPTWorld for learning prompting, LLM training puzzles for learning about distributed training, and Transformer puzzles for understanding how Transformers might work. Another sign is the establishment of a conference on the subject, and we have one of those now too.

      A New Alchemy

      LLMs have ushered in a paradigm shift in the path toward imbuing computers with human-like capabilities. This paradigm shift is being felt in various fields, including deep learning (where the work of designing new architectures or optimizers is increasingly less relevant), natural language processing (where we now have a recipe that works reasonably well across subproblems that previously demanded custom methodologies), and beyond.

      I started my PhD in 2012 during a similar paradigm shift from what I’d call “statistical machine learning” to deep learning. Unlike deep learning, statistical ML prioritized theoretical guarantees (e.g. convexity of the objective function and/or convergence under certain conditions). These guarantees arguably limited model expressivity, which arguably necessitated things like feature engineering that deep learning strove to avoid. While deep learning by no means “solved” the problems of statistical ML (just as language model development does not “solve” deep learning), it nevertheless presented a paradigm that made dramatic progress on the target problems of statistical ML and unlocked new applications. Such empirical successes of deep learning – which almost entirely eschewed theoretical guarantees – led to a great deal of hand-wringing on the part of the statistical ML crowd.

      As my research increasingly made use of deep learning, I started to find myself at the receiving end of this hand-wringing. For example, during my first-ever oral presentation at a conference, I was presenting work that made use of convolutional neural networks. During questions, an audience member expressed distaste at my use of “convoluted” neural networks and suggested that something simpler would have worked better (of course I had tried simpler models and they worked significantly worse, but let’s put that aside for the moment). This kind of despair was common at the time - people were applying deep neural networks in settings where they may or may not have been overkill, simply because it was the zeitgeist. At another conference I attended during my PhD, I happened to share a hostel room with a computer vision researcher who went on a long rant about the atrocity of deep learning (sometimes I wonder what this researcher is working on now). I think this sentiment is most elegantly laid out in Ali Rahimi’s NeurIPS 2017 test-of-time award acceptance speech, where he argues that deep learning is like alchemy - trial-and-error that yields some effective techniques but lacks rigor. Ali’s speech had a big impact on me and others but arguably didn’t really stop people from continuing to develop and apply deep learning without worrying about rigor and in settings where simpler methods would have sufficed (simply because using a big fancy neural network was sexier).

      These experiences led me to promise myself that when my field of study gave birth to another, I wouldn’t dig my feet in and resist, I’d follow the tide of progress. Now that this is (arguably) happening I’m finding it more difficult than I had anticipated. As much as I wish it wasn’t true, I cringe a little whenever I see a new LLM technique that ignores a dramatic increase in computational cost and bends over backwards to avoid updating the model’s parameters, or an application of an LLM where something dramatically cheaper would suffice, or a paper studying the behaviors of an LLM as if it’s a black box (or studying an LLM API, in which case it actually is somewhat of a black box), and on and on. And try as I might, I can’t resist trying to stem the tide – for example, the T-Few paper aimed to convince everyone that few-shot ICL was absurdly computationally inefficient and that fine-tuning specialized models is cheaper and better. Of course, people are still using few-shot ICL and are still avoiding task-specific fine-tuning at all costs, because that’s the zeitgeist – and I think this isn’t totally wrong, because in tandem there’s a huge amount of synergistic work on making LLMs more efficient and effective. But, to be honest, it still feels a little wrong, and I’m not sure if I’ll be able to shake that feeling.

      So, what’s the best course of action when you used to be with it, but then they changed what “it” was? I think there were many ML researchers who successfully rode the tide from statistical ML to deep learning – they willingly embraced the new field while bringing their knowledge and sense of rigor to their deep learning research. In other words, they used their past knowledge to provide a broader and deeper perspective that newcomers may have lacked. An especially prominent product of this kind of research is arguably the Variational Autoencoder (VAE), which connected ideas from variational inference to the autoencoder neural network architecture. VAEs are still an important component of state-of-the-art diffusion-based generative models. Hopefully, those of us who were working on deep learning and NLP before the LLM era can bring a similar perspective (and avoid digging our feet in too much).

      For attribution in academic contexts, please cite this work as
      +        PLACEHOLDER FOR ACADEMIC ATTRIBUTION
      +  
      BibTeX citation
      +        PLACEHOLDER FOR BIBTEX
      +  
      \ No newline at end of file diff --git a/blog/mode-switching/index.html b/blog/mode-switching/index.html new file mode 100644 index 00000000..1fba46aa --- /dev/null +++ b/blog/mode-switching/index.html @@ -0,0 +1,36 @@ + Behavioral Differences in Mode-Switching Exploration for Reinforcement Learning | ICLR Blogposts 2024

      Behavioral Differences in Mode-Switching Exploration for Reinforcement Learning

      In 2022, researchers from Google DeepMind presented an initial study on mode-switching exploration, by which an agent separates its exploitation and exploration actions more coarsely throughout an episode by intermittently and significantly changing its behavior policy. We supplement their work in this blog post by showcasing some observed behavioral differences between mode-switching and monolithic exploration on the Atari suite and presenting illustrative examples of its benefits. This work aids practitioners and researchers by providing practical guidance and eliciting future research directions in mode-switching exploration.

      1. Introduction

      Imagine learning to ride a bicycle for the first time. This task requires the investigation of numerous actions such as steering the handlebars to change direction, shifting weight to maintain balance, and applying pedaling power to move forward. To achieve any satisfaction, a complex sequence of these actions must be taken for a substantial amount of time. However, a dilemma emerges: many other tasks such as eating, sleeping, and working may result in more immediate satisfaction (e.g. lowered hunger, better rest, bigger paycheck), which may tempt the learner to favor other tasks. Furthermore, if enough satisfaction is not quickly achieved, the learner may even abandon the task of learning to ride a bicycle altogether.

      One frivolous strategy (Figure 1, Option 1) to overcome this dilemma is to interleave a few random actions on the bicycle throughout the remaining tasks of the day. This strategy neglects the sequential nature of bicycle riding and will achieve satisfaction very slowly, if at all. Furthermore, this strategy may interrupt and reduce the satisfaction of the other daily tasks. The more intuitive strategy (Figure 1, Option 2) is to dedicate significant portions of the day to explore the possible actions of bicycle riding. The benefits of this approach include testing the sequential relationships between actions, isolating different facets of the task for quick mastery, and providing an explicit cutoff point to shift focus and accomplish other daily tasks. Also – let’s face it – who wants to wake up in the middle of the night to turn the bicycle handlebar twice before going back to bed?

      Figure 1: Illustrative difference between monolithic and mode-switching behavior policies .

      The above example elicits the main ideas of the paper When Should Agents Explore? , published by researchers from Google DeepMind at ICLR 2022, which is the central piece of literature discussed throughout this blog post. The first strategy presented in the preceding paragraph is known as a monolithic behavior policy that interleaves exploration actions (e.g. learning to ride a bicycle) among the more frequent exploitation actions (e.g. work, sleep) in a reinforcement learning (RL) environment. In contrast, the second strategy presented above is a mode-switching behavior policy, as it more coarsely separates exploration and exploitation actions by switching between disparate behavior modes throughout an episode. Mode-switching policies subsume monolithic policies at the cost of increased complexity through introducing a new question: when to switch. Similar aspects of mode-switching for diverse exploration have been observed in the exploratory behavior of humans and animals , which served as a notable motivation for the initial mode-switching study .

      This introduction section continues with a brief discussion of topics related to mode-switching behavior policies, ranging from different temporal granularities to algorithms in the literature that exhibit mode-switching behavior. We emphasize practical understanding rather than attempting to present an exhaustive classification or survey of the subject. Afterwards, we discuss our motivation and rationale for this blog post: the authors of the initial mode-switching study showed that training with mode-switching behavior policies surpassed the performance of training with monolithic behavior policies on hard-exploration Atari games; we augment their work by presenting observed differences between mode-switching and monolithic behavior policies through supplementary experiments on the Atari benchmark and other illustrative environments. Possible avenues for applications and future investigations are emphasized throughout the discussion of each experiment. It is assumed that the interested reader has basic knowledge in RL techniques and challenges before proceeding to the rest of this blog post.

      Mode-Switching Distinctions

      Mode-switching behavior policies (which we will sometimes shorten to switching policies, and likewise to monolithic policies) were explicitly introduced in the initial mode-switching study, and we will now focus on briefly contrasting switching policies against monolithic policies and the previous exploration literature. Figure 2 illustrates the high-level, pivotal difference between switching and monolithic policies: at the beginning of each time step, the agent may use all of its available information to determine its behavior mode for the current time step and then output a corresponding behavior policy to determine the action. A key distinction is that switching policies can drastically change between time steps since the modes can be tailored to a variety of different purposes (e.g. exploration, exploitation, mastery, novelty). As the graphic illustrates, switching is such a general addition to an algorithm that it was not exhaustively characterized in the initial study.

      Figure 2: Introduction of mode-switching behavior to standard agent-environment RL interaction.

      A mode period is defined as a sequence of time steps in a single mode. At the finest granularity, step-level periods only last one step in length; the primary example is $\epsilon$-greedy exploration because its behavior policy switches between explore and exploit mode at the level of one time step . At the other extreme, experiment-level periods encompass the entire training duration, possibly to be used in offline RL (ORL) algorithms . A finer granularity is episode-level, in which a single behavior policy is chosen for one entire episode at a time, such as when diversifying the stochasticity of a policy throughout training . The switching policies analyzed in this blog post produce intra-episodic periods at a granularity between step-level periods and episode-level periods. Intra-episodic periods generally occur at least a few times during an episode and last for more than a few time steps. The practice and study of interpolating between extremes has occurred in areas such as $n$-step returns and colored noise with notable success, making the study of intra-episodic mode periods even more enticing.

      The question investigated by the initial mode-switching study is when to switch. This blog post and the initial study only perform experiments with two possible modes, exploration and exploitation, so the question of when to switch reduces to the question of when to explore. Other questions regarding exploration include how much to explore that analyzes the proportion of exploration actions taken over the entire course of training. This problem encompasses the annealing of exploration hyperparameters including $\epsilon$ from $\epsilon$-greedy policies and the entropy bonus $\beta$ from softmax policies . Another related question is how to explore that includes strategies such as randomly , optimistically , and intrinsically . These two questions are separate from the question of when to explore, as they usually consider a smooth change in the behavior policy after each time step; switching policies incorporate a much more rigid change in the behavior policy, meriting a separate analysis.

      Mode-Switching Basics

      The preceding subsection narrowed our focus to determining when to explore using intra-episodic mode periods. At the time of publication of the initial mode-switching study, the previous literature contained a few works that had incorporated basic aspects of intra-episodic mode-switching exploration. For example, Go-Explore is a resetting algorithm that explores randomly after resetting to previously-encountered promising states at the beginning of an episode. However, this algorithm implements only one switch from resetting to exploration over the course of an episode. Temporally-extended $\epsilon$-greedy exploration generalizes $\epsilon$-greedy exploration by sampling from a distribution the number of time steps that an exploration action should repeat. This method of switching is intra-episodic, but it only allows repetition of an action during explore mode. The initial mode-switching study extends the above and other work in many dimensions and may soon be viewed as the seminal work on mode-switching behavior policies; we discuss the most fundamental facets of mode-switching architectures below.

      The starting mode is the mode of the algorithm on the first time step, usually exploit mode. The set of behavior modes (e.g. explore and exploit) must contain at least two modes, and the set of behaviors induced by all modes should be fairly diverse. The switching trigger is the mechanism that prompts the agent to switch modes and is perhaps the most interesting consideration of switching policies. An informed trigger incorporates aspects of the state, action, and reward signals; it is actuated after crossing a prespecified threshold such as the difference between the expected and realized reward. A blind trigger acts independently of these signals; for example, it can be actuated after a certain number of time steps has elapsed or actuated randomly at each time step with a prespecified probability. A bandit meta-controller may be employed to choose the switching hyperparameters (e.g. termination probability, mode length, informed threshold) at the beginning of each episode to maximize episodic return and prevent additional hyperparameter tuning. Finally, homeostasis can be added when using trigger thresholds (e.g. for informed triggers), which adapts the switching threshold to a target rate across the course of training, again for ease of hyperparameter tuning. Note that these dimensions are so richly diverse that we end the associated discussion to maintain any notion of brevity, and we summarize these facets of mode-switching in Table 1.

      Mode-Switching Facet Description
      Starting Mode Mode during first time step at episode start
      Behavior Mode Set Set of modes with diverse set of associated behavior policies
      Trigger Informs agent when to switch modes
      Bandit Meta-Controller Adapts switching hyperparameters to maximize episodic return
      Homeostasis Adapts switching threshold to achieve a target rate
      Table 1: Various facets of mode-switching policies .

      Blog Post Motivation

      The initial mode-switching study performed experiments solely on 7 hard-exploration Atari games. The focus of the study was to show the increase in score on these games when using switching policies versus monolithic policies. One area of future work pointed out by the reviewers is to increase the understanding of these less-studied policies. For example, the meta review of the paper stated that an illustrative task may help provide intuition of the method. The first reviewer noted how the paper could be greatly improved through demonstrating specific benefits of the method on certain tasks. The second reviewer stated how discussing observed differences on the different domains may be useful. The third reviewer mentioned how the paper could be strengthened by developing guidelines for practical use. The last reviewer stated that it would be helpful to more thoroughly compare switching policies to monolithic policies for the sake of highlighting their superiority.

      We extend the initial mode-switching study and progress towards further understanding of these methods in this blog post through additional experiments. The following experiments each discuss an observed behavioral difference in switching policies versus monolithic policies. We focus on behavioral differences in this work, as they are observable in the environment and are not unique to the architecture of certain agents . Our experiments are performed on 10 commonly-used Atari games , and we also provide another illustrative task or chart for each experiment to further enhance understanding. One highlight of this work is showcasing how switching policies not only influence exploration but also significantly influence exploitation. Our work serves as a first step in empirically delineating the differences between switching policies and monolithic policies for the use of practitioners and researchers alike.

      2. Experiments

      This section begins with a discussion on the experimental setup before delving into five experiments that highlight observational differences in switching and monolithic behavior policies. The complete details of the agent and environments can be found in the accompanying GitHub repository.

      • The experimental testbed is comprised of 10 commonly-used Atari games: Asterix, Breakout, Space Invaders, Seaquest, Q*Bert, Beam Rider, Enduro, MsPacman, Bowling, and River Raid. Environments follow the standard Atari protocols of incorporating sticky actions and only providing a terminal signal when all lives are lost.
      • A Stable-Baselines3 DQN policy is trained on each game for 25 epochs of 100K time steps each, totaling 2.5M time steps or 10M frames due to frame skipping. The DQN policy takes an exploration action on 10% of time steps after being linearly annealed from 100% across the first 250K time steps.
      • A switching policy and monolithic policy were evaluated on the testbed using the greedy actions of the trained DQN policy when taking exploitation actions. Evaluations were made for 100 episodes for each game and epoch. The monolithic policy was $\epsilon$-greedy with a 10% exploration rate. The switching policy we chose to examine incorporates blind switching; we leave an analogous investigation of informed switching policies to future work (see initial study for background and experiments using informed switching policies). The policy begins in exploit mode and randomly switches to uniform random explore mode 0.7% of the time. It randomly chooses an explore mode length from the set $\{5, 10, 15, 20, 25\}$ with probabilities $\{0.05, 0.20, 0.50, 0.20, 0.05\} $. During experimentation, we determined that this switching policy took exploration actions at an almost identical rate as the monolithic policy (10%).

      We briefly cite difficulties and possible confounding factors in our experimental design to aid other researchers during future studies on this topic.

      • The DQN policy was trained using a monolithic policy, and unsurprisingly, monolithic policies had slightly higher evaluation scores. Additional studies may use exploitation actions from a policy trained with switching behavior for comparison.
      • Many of our experiments aim to evaluate the effect of exploration or exploitation actions on some aspect of agent behavior. Due to delayed gratification in RL, the credit assignment problem persists and confounds the association of actions to behaviors. To attempt to mitigate some confounding factors of this problem, we weight the behavior score of the agent at an arbitrary time step by the proportion of exploration or exploitation actions in a small window of past time steps; for example, in the first experiment, we weight the effect of taking exploration actions on yielding terminal states by calculating the proportion of exploration actions within 10 time steps of reaching the terminal state. Then, we average the proportions across 100 evaluation episodes to compute a final score for a single epoch for a single game.
      • Lastly, we only claim to have made observations about the behavioral differences, and we do not claim to have produced statistically significant results; we leave this analysis to future work.

      Concentrated Terminal States

      Exploration actions are generally considered to be suboptimal and are incorporated to learn about the state space rather than accrue the most return. Many environments contain regions of the state space that simply do not need more exploration, such as critical states that require directed behavior for meaningful progress. For instance, a self-driving car needing to merge onto a highway is in a critical state, as it has few behaviors that will keep it driving correctly. In these critical states, suboptimal action choices may cause the agent to reach a terminal state more quickly than desired. We investigate if terminal states are more concentrated after an exploration period of a switching policy due to the many exploration actions taken in succession.

      Our first experiment attempts to analyze the relationship between taking many exploration actions in succession and reaching a terminal state. Each terminal state is given a score equal to the proportion of exploration actions during the past 10 time steps (see second paragraph of Experiments section for rationale). Final scores for each behavior policy and epoch are computed by averaging the scores of each terminal state across all 100 evaluation episodes and each game. The results are shown in Figure 3. Switching policies produced terminal states that more closely followed exploration actions. Furthermore, the effect was more pronounced as the policies improved, most likely due to the increased disparity of optimality between exploitation and exploration actions that seems more detrimental to switching policies which explore multiple times in succession. Note how the scores for monolithic policies are near 0.10 on average, which is the expected proportion of exploration actions per episode and therefore suggests that exploration actions had little effect. These results demonstrate that switching policies may be able to concentrate terminal states to specific areas of an agent’s trajectory.

      Figure 3 (Left): Terminal states are more concentrated after switching exploration periods. Figure 4 (Right): Switching policies perform better on cliffwalk environments.

      We showcase a quick illustrative example of the ability of switching policies to concentrate terminal states more uniformly in a cliffwalk environment (Figure 4). The agent starts at the black circle in the middle column and top row of a 101$\times$11 grid and attempts to reach the white ‘x’ at the bottom. All states aside from those in the middle column are terminal, and the heatmaps show the visitation frequency per episode of all non-terminal states across 10K episodes. When the exploitation policy is to move only downward and the behavior policies are the usual policies in these experiments, the agent incorporating a switching policy more heavily concentrates the terminal states in exploration mode and visits states further down the cliffwalk environment at a higher rate per episode.

      Environments that incorporate checkpoint states that agents must traverse to make substantial progress may benefit from switching policies that concentrate exploration periods away from the checkpoints. For example, the game of Montezuma’s revenge sometimes requires that the agent retrieves a key before advancing through a door, and the agent may achieve faster learning by concentrating exploration actions away from states near the key after that action is learned. One notable and emerging area of RL research that may benefit from concentrating terminal states is safe RL . In safe RL, certain safety constraints are required during the learning and deployment process. In some situations, the safety constraints are closely aligned with terminal states (e.g. aerospace ), and concentrating exploration actions away from terminal states may aid in achieving those safety constraints.

      Early Exploration

      Monolithic policies uniformly take exploration actions throughout an episode, and as a result, the exploration steps are less concentrated than those of switching policies. While the expected number of exploration steps may be the same per episode in monolithic policies, certain situations may require more concentrated exploration during the beginning of episodes. For example, the build orders in StarCraft II significantly influence the possible future strategies, making exploration crucial throughout the beginning time steps. Early suboptimal actions have also been manually implemented to achieve certain effects: passive actions are taken in Atari games to prevent memorization of trajectories , and 30 random actions were taken at the beginning of Go games when training the AlphaGo engine to force agents to encounter more diverse data . We investigate the flexibility of switching policies to concentrate exploration actions in the beginning of episodes.

      We perform an experiment to determine how quickly a policy takes a prespecified number of exploration actions. Specifically, we compute the average number of time steps it takes for a policy to take at least $x$ total exploration actions across its top 10 of 100 fastest episodes, and we repeat this process for $x \in \{1, 2, 3, \ldots, 20\}$. We compare the top 10 fastest episodes because we are only interested in gauging the flexibility of switching behavior of being able to achieve this specific facet of exploration (beginning exploration) during a small percentage of episodes and not for each episode. Note that this experiment did not need to utilize the Atari signals, so we only used data from the last epoch. Results were again averaged over each game and shown in Figure 5. It is clear that some episodes contain many more exploration actions concentrated in the beginning few time steps with switching policies. This makes sense intuitively, as only one switch needs to occur early in an episode with a switching policy for many exploration actions to be taken immediately afterwards. The difference increases roughly linearly for greater number of necessary exploration actions and shows that switching natively produces more episodes with exploration concentrated in the beginning.

      Figure 5 (Left): Switching policies can explore more frequently earlier during the episode. Figure 6 (Right): Switching policies have better exploration near the start state on downwalk environments.

      We illustrate beginning exploration with a downwalk environment in which an agent attempts to first move to the middle column and then down the middle column to the white ‘x’ (Figure 6). The agent starts in the second row in the middle column at the white circle, and visitation frequencies across 1K episodes are shown for all states aside from those between the white circle and the white ‘x’, inclusive. We chose to analyze this environment because it is a crude approximation of the trajectory of agents that have learned a single policy and immediately move away from the initial start state at the beginning of an episode. The switching and monolithic policies are the same as before, and switching produces much higher visitation counts at states further from the obvious exploitation trajectory.

      Environments that may benefit from flexible early exploration are sparse reward environments that provide a single nonzero reward at the terminal state. Many game environments fall into this category, since a terminal reward of 1 can be provided for a win, -1 for a loss, and 0 for a draw. In such environments, agents usually need to learn at states near the sparse reward region before learning at states further away, also known as cascading . After learning near the sparse reward region, the agent may need to reconsider earlier actions, and switching policies natively allow for this type of exploration. Future work may consider the extent to which switching aids in improving policies near the start state in sparse reward environments.

      Concentrated Return

      In contrast to the investigation in the first experiment, exploitation actions of a trained agent are presumed to be better than all other alternatives. Since agents aim to maximize the expected return in an environment, exploitation actions often accrue relatively large amounts of expected return. For example, the initial experiments of DQN and double DQN (DDQN) decreased the exploration constant (thereby increasing exploitation) during testing runs to achieve higher scores and ultimately demonstrate superhuman performance on Atari. In this subsection, we investigate the effect of the concentrated exploitation actions of switching policies on expected return.

      We perform an experiment to determine the proportion of return that is concentrated during exploitation periods. Each reward during an episode is weighted by the proportion of exploitation actions during the past 10 time steps. The score for each episode is the sum of weighted rewards divided by the total rewards. Scores for each behavior policy and epoch are computed by averaging scores across all games. The results are shown in Figure 7. Quite quickly, exploitation steps of switching policies contain a greater percentage of the return than those of monolithic policies. This trend seems fairly constant after roughly 2M frames, with switching policies having roughly 95% of the return in exploitation steps and monolithic policies having roughly 90% of the return; from another point of view, exploration steps yield 5% of the return for switching policies and 10% of the return for monolithic policies. These results agree with Experiment 1, as switching policies will generally reach terminal states more frequently in explore mode and will not receive more rewards. Since most of the rewards in our selected Atari games are positive, switching policies should accrue lower return while in explore mode.

      Figure 7 (Left): Switching policies concentrate return in exploitation mode. Figure 8 (Right): Switching policies concentrate return in the beginning of episodes.

      One notable case in which exploitation steps are concentrated together is in resetting methods such as Go-Explore that reset to promising states at the beginning of the episode and explore from there. Promising states are usually defined as states that are frequently traversed in trajectories that accrue high return. More generally, resetting methods aim to prevent derailment, whereby an agent is unable to return or is derailed from returning to promising states through its exploratory mechanisms. Since our switching agent begins in exploit mode which aims to accrue the most return, we investigate to see if switching policies possess characteristics that are inherent to resetting methods.

      In Figure 8, we plot the proportion of episode return over the past 5% of the episode versus the current proportion of episode that is complete. Data is taken from the last training epoch. The results show that switching policies concentrate return more towards the beginning of each episode, most likely because its first exploit mode of switching policies is relatively long. Future work involves determining the extent to which the beginning exploitation mode of switching policies serves as a flexible alternative to resetting, which would have applications in situations that do not allow for manual resets such as model-free RL.

      Post-Exploration Entropy

      Monolithic policies such as $\epsilon$-greedy are nearly on-policy when any exploration constants have been annealed. In contrast, the exploration periods of switching policies are meant to free the agent from its current exploitation policy and allow the agent to experience significantly different trajectories than usual. Due to the lack of meaningful learning at states that are further from usual on-policy trajectories, the exploitation actions at those states are more likely to have greater diversity. In this experiment, we investigate the diversity of the action distribution after exploration periods.

      We quantify the diversity of the realized action distribution in the time step immediately after each exploration period. The diversity is quantified by entropy that has higher values for more random data and vice versa. An action distribution is constructed for each game and epoch, and the entropies across games are averaged. The results are shown in Figure 9. The entropy of the action distribution for switching policies is distinctly greater than that of monolithic policies. Like most of the previous results, this quantity only plateaus until roughly 2M frames have elapsed.

      Figure 9 (Left): Switching policies produce action distributions with higher entropy after exploration periods. Figure 10 (Right): Agent has random exploitation actions in states that are visited less frequently.

      To illustrate this idea, we create a gridworld environment that provides the agent a reward of -1 for each time step that the agent is still on the grid; the agent’s goal is to leave the grid as quickly as possible. The agent begins in the center of the grid and learns through discrete Q-learning. Distinct actions have separate colors in Figure 10, with arrows showing the exploit action. The agent learns that it is fastest to exit the grid by going left or right. Notably, the actions near the top and bottom of the grid are seemingly random, as the agent has not seen and learned from those states as frequently as the others. Switching policies are more likely to reach the top and bottom areas of the gridworld state space and consequently would be more likely to have a higher entropy of the action distribution after exploration.

      The difference in the entropy of the action distributions suggests that more diverse areas of the state space may be encountered after exploration modes with switching policies. This phenomenon is closely tied to the notion of detachment , whereby agents forget how to return or are detached from areas of high reward, perhaps by focusing too unimodally on one region of the state space. The concentrated behavior of switching policies may provide enough consecutive exploration actions to explore a more diverse set of trajectories. Future work could investigate the ability of switching policies to curb detachment on environments with multiple regions of the state space with high reward.

      Top Exploitation Proportions

      Our final investigation involves the change in exploitation proportion under switching policies. Since the probability of switching to explore mode is very low, there may be some episodes where the switch seldom happens if at all. This creates a distribution of exploitation action proportions per episode that is more extreme than that of monolithic policies, yet it is still not as extreme as using a single mode throughout the entire episode. Investigations of methods having similar interpolative characteristics have been conducted recently; for example, an action noise called pink noise was recently introduced that achieved better performance than white and red noise. Pink noise is more temporally-correlated than white noise but not as much as red noise. Here, we investigate the return of the most extreme episodes in exploitation proportion.

      We perform an experiment to compare the return of the episodes with highest exploitation proportions between switching and monolithic policies. The returns of the top 10 of 100 episodes ranked by exploitation proportion of each epoch and game were averaged. Then, a ratio between the averages of switching and monolithic policies was computed and averaged across games. The results are plotted in Figure 11. There does not appear to be a clear trend aside from the ratio hovering mostly above 1.00, indicating that the top exploitation episodes of switching policies accrue more return than those of monolithic policies.

      Figure 11 (Left): Switching policies have higher return for episodes with largest exploit proportion. Figure 12 (Right): Switching policies have more extreme exploration and exploitation proportions per episode.

      The results are best illustrated through plotting the switching and monolithic exploitation proportions for 1K episodes (10 games of the last epoch) as shown in Figure 12. The top 100 episodes with highest exploitation proportion take more exploitation actions than any monolithic episode. Therefore, the corresponding distribution is indeed more extreme.

      While the previous discussion has illustrated that some switching episodes exploit more and generate more return, they don’t specifically explain why training with mode-switching is superior; in particular, the slightly greater return is not necessary for learning an optimal policy as long as a similar state distribution is reached during training. One possibility is the fact that mode-switching policies train on a more diverse set of behavior and must generalize to that diversity. Reinforcement learning algorithms are notorious at overfitting , and future work may investigate the extent to which generalization is improved upon using switching policies.

      3. Conclusion

      This blog post highlighted five observational differences between mode-switching and monolithic behavior policies on Atari and other illustrative tasks. The analysis showcased the flexibility of mode-switching policies, such as the ability to explore earlier in episodes and exploit at a notably higher rate. As the original study of mode-switching behavior by DeepMind was primarily concerned with performance, the experiments in this blog post supplement the study by providing a better understanding of the strengths and weaknesses of mode-switching exploration. Due to the vast challenges in RL, we envision that mode-switching policies will need to be tailored to specific environments to achieve the greatest performance gains over monolithic policies. Pending a wealth of future studies, we believe that mode-switching has the potential to become the default behavioral policy to be used by researchers and practitioners alike.

      Acknowledgements

      We thank Nathan Bittner for a few helpful discussions on the topic of mode-switching exploration. We also thank Theresa Schlangen (Theresa Anderson at the time of publication) for helping polish some of the figures.

      For attribution in academic contexts, please cite this work as
      +        PLACEHOLDER FOR ACADEMIC ATTRIBUTION
      +  
      BibTeX citation
      +        PLACEHOLDER FOR BIBTEX
      +  
      \ No newline at end of file diff --git a/blog/page/2/index.html b/blog/page/2/index.html new file mode 100644 index 00000000..2c28ece5 --- /dev/null +++ b/blog/page/2/index.html @@ -0,0 +1 @@ + blog - page 2 | ICLR Blogposts 2024

      blogposts

      Blog Posts

      • Masked Language Model with ALiBi and CLAP head

        As a new approach to positional encoding, Attention with Linear Biases (ALiBi) uses linear biases of the attention weights to encode positional information, with capability of context length extrapolation. In their paper however, Press et al. focus on the perplexity of autoregressive decoder-only language models, leaving the question of downstream tasks and its applicability to encoder-attention open. In this blogpost, we attempt to bridge the gap by testing masked language models (MLMs) with encoder-attention ALiBi and prediction head similar to the counterparts of the original ALiBi models. We find that while simplified prediction head may be beneficial, performance of MLMs with encoder-attention ALiBi starts to deteriorate with 2048 sequence length at larger scales. We put our results in the context of related recent experiments and tentatively identify the circumstances more challenging to positional encoding designs. Finally, we open-source our MLMs, with BERT-level performance and 2048 context length.

      • On Bayesian Model Selection: The Marginal Likelihood, Cross-Validation, and Conditional Log Marginal Likelihood

        Bayesian model selection has long relied on the marginal likelihood and related quantities, often motivated by the principle of Occam's razor. Following the paper 'Bayesian Model Selection, the Marginal Likelihood, and Generalization' by Lotfi et al. (2022/2023), this blog post critically examines the conventional focus on the marginal likelihood and related quantities for Bayesian model selection as a direct consequence of Occam's razor. We find that the suitability of these criteria depends on the specific context and goals of the modeling task. We revisit the concepts of log marginal likelihood (LML), cross-validation, and the recently introduced conditional log marginal likelihood (CLML), highlighting their connections and differences through an information-theoretic lens. Through thought experiments and empirical observations, we explore the behavior of these model selection criteria in different data regimes under model misspecification and prior-data conflict, finding that the conditional marginal cross-entropy, closely related to cross-validation, is often more reliable for optimizing generalization performance. We review relevant literature, compare the CLML and validation loss for deep neural networks, and using a toy Bayesian linear regression, we demonstrate that all the discussed quantities can fail to reliably predict generalization. Our takeaways are that: there is no one-size-fits-all solution; the choice of model selection quantity depends on the specific context and goals; and in the future, we should take into account model complexity as well and not assume a uniform model prior. While the post is limited by the need for more rigorous theoretical justification, a broader range of models and datasets (and deeper engagement with philosophical implications), it rightly questions the primacy of the (conditional) log marginal likelihood and encourages critical thinking about its foundations, aiming for a more nuanced understanding of Bayesian model selection.

      • RLHF without RL - Direct Preference Optimization

        We discuss the RL part of RLHF and its recent displacement by direct preference optimization (DPO). With DPO, a language model can be aligned with human preferences without sampling from an LM, thereby significantly simplifying the training process. By now, DPO has been implemented in many projects and seems to be here to stay.

      • The Hidden Convex Optimization Landscape of Two-Layer ReLU Networks

        In this article, we delve into the research paper titled 'The Hidden Convex Optimization Landscape of Regularized Two-Layer ReLU Networks'. We put our focus on the significance of this study and evaluate its relevance in the current landscape of the theory of machine learning. This paper describes how solving a convex problem can directly give the solution to the highly non-convex problem that is optimizing a two-layer ReLU Network. After giving some intuition on the proof through a few examples, we will observe the limits of this model as we might not yet be able to throw away the non-convex problem.

      • The N Implementation Details of RLHF with PPO

        Reinforcement Learning from Human Feedback (RLHF) is pivotal in the modern application of language modeling, as exemplified by ChatGPT. This blog post delves into an in-depth exploration of RLHF, attempting to reproduce the results from OpenAI's inaugural RLHF paper, published in 2019. Our detailed examination provides valuable insights into the implementation details of RLHF, which often go unnoticed.

      • Towards Robust Foundation Models: Adversarial Contrastive Learning

        Foundation models pre-trained on large-scale unlabelled datasets using self-supervision can be generalizable to a wide range of downstream tasks. Existing work has shown that adversarial attacks can effectively fool any downstream models fine-tuned from a pre-trained foundation model. The existence of such adversarial attacks necessitates the development of robust foundation models which can yield both standard generalization and adversarial robustness to safety-critical downstream tasks. Currently, adversarial contrastive learning (ACL) is one of the most effective methods for outputting a robust foundation model. ACL incorporates contrastive learning with adversarial data to effectively output a robust representation without requiring costly annotations. In this blog, we introduced two NeurIPS 2023 publications that can enhance ACL's efficacy and efficiency, respectively. (1) This blog introduces Adversarial Invariant Regularization (AIR) which is a state-of-the-art ACL algorithm. A causal theoretical framework is built to interpret ACL, and then the AIR algorithm is derived from the causal framework to regulate and improve the ACL. (2) This blog also introduces a Robustness-aware Coreset Selection (RCS) method to speed up ACL. RCS does not require label information and searches for an informative training subset that can maintain the adversarial robustness. For the first time, RCS enables the application of ACL on the large-scale ImageNet-1K dataset.

      • Understanding gradient inversion attacks from the prior knowledge perspective

        In this blogpost, we mention multiple works in gradient inversion attacks, point out the chanllenges we need to solve in GIAs, and provide a perspective from the prior knowledge to understand the logic behind recent papers.

      • Understanding in-context learning in transformers

        We propose a technical exploration of In-Context Learning (ICL) for linear regression tasks in transformer architectures. Focusing on the article Transformers Learn In-Context by Gradient Descent by J. von Oswald et al., published in ICML 2023 last year, we provide detailed explanations and illustrations of the mechanisms involved. We also contribute novel analyses on ICL, discuss recent developments and we point to open questions in this area of research.

      • Unraveling The Impact of Training Samples

        How do we quantify the influence of datasets? Recent works on Data Attribution Methods shed light on this problem. In this blog post, we introduce Data Attribution Methods which leverage robust statistics and surrogate functions, and present their applications like distinguishing the feature selection difference of learning algorithms, detecting data leakage, and assessing model robustness.

      • What exactly has TabPFN learned to do?

        TabPFN [Hollmann et al., 2023], a Transformer model pretrained to perform in-context learning on fresh tabular classification problems, was presented at the last ICLR conference. To better understand its behavior, we treat it as a black-box function approximator generator and observe its generated function approximations on a varied selection of training datasets. Exploring its learned inductive biases in this manner, we observe behavior that is at turns either brilliant or baffling. We conclude this post with thoughts on how these results might inform the development, evaluation, and application of prior-data fitted networks (PFNs) in the future.

      \ No newline at end of file diff --git a/blog/primacy-bias-and-why-it-helps-to-forget/index.html b/blog/primacy-bias-and-why-it-helps-to-forget/index.html new file mode 100644 index 00000000..fcc9b033 --- /dev/null +++ b/blog/primacy-bias-and-why-it-helps-to-forget/index.html @@ -0,0 +1,46 @@ + It's Time to Move On: Primacy Bias and Why It Helps to Forget | ICLR Blogposts 2024

      It's Time to Move On: Primacy Bias and Why It Helps to Forget

      'The Primacy Bias in Deep Reinforcement Learning' demonstrates how the first experiences of a deep learning model can cause catastrophic memorization and how this can be prevented. In this post we describe primacy bias, summarize the authors' key findings, and present a simple environment to experiment with primacy bias.

      Introduction to Primacy Bias

      Primacy bias occurs when a model’s training is damaged by overfitting to its first experiences. This can be caused by poor hyperparameter selection, the underlying dynamics of the system being studied, or simply bad luck.

      In this post we explore the paper “Primacy Bias in Deep Reinforcement Learning” by Nikishin et al. and presented at ICML 2022 . We will present primacy bias and how it applies to deep reinforcement learning, discuss how the authors prevent primacy bias, and finish by experimenting with our own toy example of primacy bias.

      Like many deep learning concepts, primacy bias takes inspiration from psychology . For example, you might have a friend who “doesn’t like math” because they had a bad experience in primary school. Now, they avoid the subject despite having an aptitude for it. It turns out that for humans and machines, first impressions matter more than they should. This is primacy bias.

      Off Policy Deep Reinforcement Learning

      Nikishin et al. discuss a specific type of model that is particularly sensitive to primacy bias: off-policy deep reinforcement learning. Here, the goal is to learn a (policy) that makes good decisions in an interactive environment. Off-policy algorithms achieve this by separating decision-making from learning. Deep Q-Learning (DQN) was one of the first popular off-policy algorithms, which separates the learning process into two steps:

      1. Data Collection: use the current policy to interact with the environment and save memories to a dataset called the replay buffer.
      2. Learning: sample from the replay buffer to perform gradient updates on the policy.

      Are we Overcomplicating?

      For those without a reinforcement learning background, this might seem needlessly complicated. Why can’t we simply explore with a random policy and then fit a model all at once?

      Although this is sometimes done , the quality of the memories in the replay buffer is proportionate to the quality of the policy that gathered the experience. Consider an agent learning to play chess. A random policy might have enough data to learn how to play the start of the game effectively, but it will never learn how to chase an opponent’s king around an empty board. If a policy isn’t smart enough to get the agent out of the ‘early’ game, it will never collect experiences to learn the ‘mid’ or ‘late’ games.

      Selecting a Replay Ratio

      The replay ratio is the total number of gradient updates per environment interaction. If the number of experiences is fixed, then modifying the replay ratio is equivalent to changing the number of training epochs in a typical deep learning problem.

      Most researchers know the importance of training for a sufficient number of epochs. Training for more epochs is preferred and methods such as early stopping, weight regularization, and dropout layers can mitigate the risk of overfitting. At worst, if you end up with an overfit model then you can retrain it from scratch.

      In deep reinforcement learning, the replay ratio is typically set to one. Unfortunately, finding the correct replay ratio is difficult. We want the agent to learn as much as possible but there is a path-dependency that is hard to ignore. If the policy becomes overfit early it will have less meaningful interactions with the environment, creating negative feedback. If you don’t catch overfitting in your Poker Bot until it loses a couple tournaments, then you might have spent a lot of money for a dataset on how to lose poker hands.

      Heavy Priming

      To quantify this, Nikishin et al. perform an experiment with heavy priming. The goal is to train an agent on the “quadruped-run” environment, where an agent learns to manipulate joint movement to travel forward.

      First, a baseline is trained with default parameters. Next, to create heavy priming, the agent collects 100 interactions and then trains for 100K steps. The model with heavy priming fails to ever recover in an example of catastrophic memorization.

      Example of Heavy Priming by Nikishi et al.

      Weight Resets

      To avoid primacy bias, Nikishi et al. propose the following solution: freely increase the replay ratio, but periodically perform a weight reset to reinitialize all of the agent’s weights while preserving the replay buffer. This destroys any learned information in the network’s weights. At worst, if there is no primacy bias, the replay buffer will contain enough information to retrain to the previous weights. At best, primacy bias is eliminated, and the model finds a new optima.

      To think about this concretely, consider a 100 step training loop. At each step we:

      1. Gather 1 observation.
      2. Add it to the replay buffer.
      3. Select a random sample from the replay buffer.
      4. Perform a gradient update to the model with the sample.

      After 100 steps, the first observation will have been sampled on average 5.19 times. The 50th observation will have been sampled 0.71 times, and the 99th observation will have been sampled on average 0.01 times. This can be summarized in a plot.

      How often an example is sampled on average in a 100 step training loop.

      Some solutions to mitigate this include recency weighting or using prioritized experience replay , however, weight resets offer a theoretically parameter free way to fix this. If weights are trained from scratch at every step then all prior observations will have equal influence.

      In practice, weight resets are a bit more complicated. Ideally, we retrain the model from scratch after each observation. Unfortunately this isn’t realistic (on my computer). This leaves us with two decisions:

      1. Select a reset frequency.
      2. Decide what to reset.

      Resetting often will prevent primacy bias but this requires a high replay ratio. This trade-off is discussed in detail in the follow up work “Sample-Efficient Reinforcement Learning by Breaking the Replay Ratio Barrier” published at ICLR in 2023. In particular, a heatmap is shared showing the trade-off between data and computation budget on a dynamic motion control problem:

      "Performance of SR-SAC in DMC15 as a function of the number of interactions and of the number of agent updates, determined by the replay ratio."

      Do Resets Work?

      Nitkshi et al. show that on average resets work well.

      1. Immediately after a reset there is a sudden drop in performance that quickly recovers.
      2. Resets never irreparably harm a model. At worse, the model returns to the pre-reset level (ex: cheetah-run), but sometimes it can perform substantially better (humanoid-run).

      These results are consistent across multiple algorithms and environments, including the continuous control Deep Mind Control Suite and the discrete Atari 100k benchmarks.

      Episode return overtime on a subset of DeepMind Control, with and without resets, using SAC algorithm. Averaged over 10 random seeds.
      Figure 4,
      Episode return overtime in DeepMind Control, with and without resets, using the DRQ algorithm. Averaged over 20 random seeds.
      Figure 18, from Appendix C)
      Per-game scores in Atari, with and without reset, using the SPR algorithm. Averaged over 20-100 random seeds.
      Table 7, from Appendix C)

      After seeing the success of resets, it is reasonable to wonder how weight resets compare to other regularization tools. The authors test this as well and show that resets improve outcomes in their experiments on average more than either dropout or L2 regularization (which actually perform worse than the baseline).

      Comparison of Base Algorithm, Resets (+ resets), Dropout (+ dropout), and L2 (+ L2). Averaged over 10 runs.

      What’s The Catch?

      While these results are impressive, they come at a cost. At minimum, increasing the replay ratio increases the compute time linearly. D’Oro et al 2023 note that running the full dynamic control benchmark with a replay ratio of 32 takes 4 GPU days with a NVIDIA V100. Using a replay ratio of 16 on Atari 100K requires 5 GPU hours per run.

      Additionally, implementing weight resets requires a sneaky number of design decisions. The results from the paper show reset rules specifically chosen for each environment and algorithm.

      Some of these considerations include:

      1. How often should you reset? Every step is ‘ideal’ but it is also ideal to get results this year.
      2. What is the optimal replay ratio to maximally learn per sample and sustain the reset frequency?
      3. What exactly should I reset? Full model? Last layer?

      These are open questions. For weight resets to become widely used new heuristics and best practices will need to develop. The answers may depend on both the network architecture and the underlying system dynamics. Trying to imagine the precise behaviours induced by primacy bias on Atari and Deep Mind Control can be difficult.

      Implementing Primacy Bias

      The best way to learn something is through practice. In this section we will present a minimum example of primacy bias. The associated code is released as a notebook along with additional experiments.

      The biggest obstacle to studying primacy bias is the compute required. Training time scales linearly with replay ratio, and a high replay ratio is necessary to extract maximal information per sample and to recover after each reset. To work around this, we present an MVP: Minimum Viable Primacy (bias).

      We use a modified version of the Frozen Lake environment provided by Farama Gymnasium with a DQN model (one of first models to popularize a replay buffer) based on the CleanRL implementation .

      2x2 Switching Frozen Lake

      Frozen Lake is a simple pathfinding problem. The model receives a reward if it successfully traverses a grid to reach a goal. The model can fail in two ways: 1) it falls in a hole or 2) it takes too long to reach the goal. The model observes its location on the grid and each action is a move one tile up, down, left, or right.

      To simplify the problem, we restrict the map size to 2x2 and keep the environment deterministic. The agent always starts in the top left corner and is rewarded if it reaches the bottom right corner. A hole is placed in one of the two remaining spaces. The agent fails if it takes more than 2 steps or falls in a hole. Each map has exactly one solution.

      MVP: Switching 2x2 Frozen Lake Environment, with solution in red.

      The agent attempts to cross the lake 1,000 times. To force primacy bias, we show the agent Map 1 for the first 200 crossings, and Map 2 for the last 800. The maps are deliberately chosen to have opposite solutions. After 400 crossings the agent will have experienced each map equally and afterwards the agent should begin to prefer Map 2 with increasing confidence. Our agent is maximally exploitative and will always take the action it thinks is best.

      Each trial is considered expensive (our agent doesn’t want to freeze). A good algorithm will maximize the number of successful crossings in the 1,000 attempts. Each attempt is saved to the replay buffer and any reset will fully reinitialize all network weights.

      The advantage of this environment is that it is very fast. A trial of 1,000 crossings with a replay ratio of 1 completes in less than 5 seconds on a CPU. The disadvantage of this environment is that it’s incredibly simple, and findings might not generalize to more complex problems.

      Results

      The first thing we do is inspect how our model scores its first action with and without resets for each cross.

      Model scores for first action overtime (after softmax), with and without resets. The correct first action is down for the first 200 episodes and right afterwards. Replay ratio of 16 with results averaged over 25 seeds.
      Additional action values overtime for various learning rates.


      Both models quickly determine that moving down is correct. The resetting model will periodically score actions equally before quickly recovering. Without resets, the map switch is only recognized after the 800th crossing. With resets, this switch happens around crossing 500. We also see that after the map switch the model without resets tries to adjust by increasing the scores for the incorrect left and up actions (which led to failure in two steps instead of one).

      We can also plot the reward per crossing, averaged over 25 seeds. Similar to the first result, the model with resets periodically fails, but also adapts to the map switch faster.

      Model score overtime, with and without resets. Replay ratio of 16. Average of 25 seeds.
      Additional scores overtime for various learning rates.


      Next, we conduct a hyperparameter sweep with replay ratios 1, 4, 16 and reset frequencies 0, 50, 100, 500. We then compare the average number of successful crossings. A random policy will earn the reward 1/16 of the time.

      Full period average score, averaged across all crossings. Average of 25 seeds.
      Additional averages scores for various learning rates.


      In general, the results match our expectations. With a learning rate of 0.01 a higher replay ratio improves results and having resets is always helpful. A high replay ratio with resets is necessary to achieve a score over 0.6 for all learning rates. Reset frequency and replay ratio must be adjusted alongside learning rate which scales how quickly the network can adapt in a non-stationary environment.

      As a final experiment, we vary model size. We compare a much smaller two layer DQN architecture to the larger three layer model used in prior experiments. Interestingly, this produces the highest score yet with a reset frequency of 10 steps although the result quickly disappears with a lower learning rate.

      Full period average score. Average of 25 seeds. Split by Network Size with Replay Ratio of 16.
      Additional averages scores for various learning rates by network size.
      Comparison of 3 layer and 2 layer networks. Reset every 10 steps with a replay ratio of 16. Average of 25 seeds.


      Conclusions

      In this blogpost, we discuss primacy bias and its application to off-policy deep reinforcement learning. We highlight a subset of results and apply weight resets to a new problem.

      We hope that more examples of primacy bias continue to be discovered and studied. Eventually, we would like to identify specific behaviors that are catastrophically memorized and create guiding principles to identify environments that are most at risk of primacy bias. Overtime we hope this might unlock new applications of deep reinforcement learning.

      Even as the theory continues to develop, there is little harm in attempting periodic weight resets with a high replay ratio to train off-policy reinforcement learning agents.

      Finally, primacy bias might not always be a bad thing. If you decide to take a new shortcut to work by walking down an alley and the first thing you notice is how dark and unsafe it seems then maybe it’s a good idea to turn back. As always, it is an important decision for the modeller to decide if primacy bias should be treated in their problem.

      Acknowledgements

      This blogpost is derived from our work that began in Dr. Zsolt Kira’s excellent Deep Learning course at Georgia Tech.

      For attribution in academic contexts, please cite this work as
      +        PLACEHOLDER FOR ACADEMIC ATTRIBUTION
      +  
      BibTeX citation
      +        PLACEHOLDER FOR BIBTEX
      +  
      \ No newline at end of file diff --git a/blog/rlhf-without-rl/index.html b/blog/rlhf-without-rl/index.html new file mode 100644 index 00000000..fa751e3d --- /dev/null +++ b/blog/rlhf-without-rl/index.html @@ -0,0 +1,36 @@ + RLHF without RL - Direct Preference Optimization | ICLR Blogposts 2024

      RLHF without RL - Direct Preference Optimization

      We discuss the RL part of RLHF and its recent displacement by direct preference optimization (DPO). With DPO, a language model can be aligned with human preferences without sampling from an LM, thereby significantly simplifying the training process. By now, DPO has been implemented in many projects and seems to be here to stay.

      Background

      Reinforcement learning from human feedback (RLHF) is an important technique for aligning (large) language models (LM) with human preferences. It was introduced by Christiano et al. and then first applied to language models in the work by Ziegler et al.. Since then, RLHF has become a central building block of many LLM-based applications, including the first versions of ChatGPT.

      RLHF for language models works roughly as follows:

      1. Collect a dataset of prompts $\mathcal{D}$ for the LM, typically containing instructions or questions.
      2. For each prompt $x\in \mathcal{D}$, collect a set of completions $y_1, …, y_N$ from the LM. One can increase the temperature of the language model for this step to get a sufficient variability in them.
      3. Ask human annotators to rate the completions, thereby obtaining a dataset of preferences $x, y_{rank_1}, …, y_{rank_N}$.
      4. Train a parameterized reward function $r_\phi$ (mapping pairs $(x,y)$ to scalars) on the collected preferences by minimizing the loss

        \[\mathcal{L}(r) = \mathbb{E}_{(x, y_{rank_i})} \left[ \log \frac{e^{r(x, y_{rank_i})}}{\sum_{j=1}^N e^{r(x, y_{rank_j})}} \right].\]

        This loss is inspired by the Bradley-Terry model for pairwise comparisons and by maximum-entropy inverse RL. Intuitively, it encourages the reward function to assign higher rewards to completions that are preferred by humans. Usually, the reward function is parameterized by the LM itself with an additional linear layer. Thus, the mapping from $(x, y)$ to $r(x, y)$ is given by simply concatenating the sequences $x$ and $y$ and passing the embedding of the last (or a differently selected) token through a linear layer.

      5. Fine-tune the LM by viewing it as a policy $\pi_\theta$ and using RL with the learned reward function $r_\phi$ as the reward. For this step, a separate dataset of prompts $\mathcal{D}_{\text{RL}}$ is used to query the LM and collect completions. Since the reward is learned on a very limited subset of possible completions, and is therefore unreliable in off-distribution data, it would be unwise to aim at optimizing it without any regularization.

        The typical choice of regularization is the KL-divergence between the policy (i.e. the aligned/fine-tuned LM) and a reference policy $\pi_{\text{ref}}$ (usually the pretrained LM before fine-tuning). The RLHF objective then becomes

        \[\tag{1} \label{eq:rlhf} J(\pi) = \mathbb{E}_{x \sim \mathcal{D}_\text{RL}, y\sim \pi_\theta(y \mid x)} \left[ r_\phi(x, y)- \beta D_{\text{KL}} \left( \pi(y, s) || \pi_\text{ref}(y, s) \right) \right],\]

        which is then used to find the optimal policy $\pi_\theta$ by some optimization algorithm, typically a variant of proximal policy optimization (PPO). Here $D_{\text{KL}}$ denotes the KL-divergence between two distributions, and the temperature $\beta$ is a hyperparameter that controls the strength of the regularization.

      The resulting LLMs are very powerful and so widely used that we don’t need to further elaborate on their performance here. Note, however, that the RLHF scheme has quite some complexity when it comes to actually making it work in practice.

      Is RLHF Reinforcement Learning?

      From the beginning, RLHF has sparked some controversy. Some regarded it as one of the prime applications of reinforcement learning (which may currently be perceived as “less hot” than LLMs, wherefore applying RL in LLMs is in the former’s favor). At the same time, others were skeptical about whether RLHF is reinforcement learning at all.

      Indeed, some crucial components of RL are missing in RLHF. First, the current forms of RLHF do not involve sequential decision-making (although there is some work on that, e.g., the ILQL algorithm). While the rollout of a completion can formally be viewed as a sequence of actions, the reward is not given after the completion has ended. Moreover, for the purpose of RLHF the LM itself can be regarded as a direct mapping from inputs to distributions over completions, rather than a sequential decision-making agent in the space of tokens. Thus, at best, RLHF is a form of single-step, immediate-reward RL - in other words, a contextual bandit.

      Even more troubling than the non-sequential nature of RLHF may be its information flow. While the policy optimization of RLHF is framed as an online RL algorithm, the environment consists of the policy itself. Usually, in online RL an agent is able to extract new information from the environment. In RLHF, however, the information is not “new” in the sense that it is not extracted from something external to the agent itself. The only information not originally contained in the LM is in the preferences data (notably, not even in the completions themselves, but only in their rankings), and it is only used to fit a reward function. Thus, RLHF is more reminiscent of offline RL or supervised learning than of online RL.

      Because of this 1-step nature of RLHF and due to the (unusual for RL) application of training enormous models, the majority of RLHF software is not set up to be compatible with gym(nasium) or other environment interfaces. Take, for example, the well known trl and trlx libraries, which barely mention environments at all. A notable exception is the RL4LMs project by AllenAI, which unfortunately seems to be abandoned, and is based on the deprecated gym instead of gymnasium. For practical RLHF, training in parallel on massive datasets is a necessary requirement, which somewhat complicates the use of standard environment and training interfaces.

      The view that RLHF is not “really” RL, or at least does not have to be, has become even more popular after the publication of the DPO algorithm, which we will discuss in the next section.

      Direct Preference Optimization

      The direct preference optimization (DPO) algorithm for aligning language models (LM) by Rafailov et al. is a method for aligning LMs to human preferences without having to sample from the LM and without using RL explicitly. Interestingly, DPO still optimizes the same objective as RLHF, but does so purely by supervised learning. This results in a much simpler training procedure and reportedly better performance in a number of experiments.

      The mathematical derivation of DPO is short and insightful. It is based on the following observations:

      1. Reward as a Function of the Policy

      The RLHF objective (\ref{eq:rlhf}) has an exact (non-parametric) solution for the optimal policy $\pi_r$:

      \[\pi_r(y \mid x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y \mid x) \exp \left( \frac{1}{\beta} r(x, y) \right).\]

      This expression is well known in the RL literature and is sometimes referred to as Boltzmann policy (note that in the 1-step RL setting, the Q-function is given by the reward itself).

      Similar results were proved in the REPS algorithm and follow-up work (a more recent paper in that direction is ). While this solution for $\pi_r$ in itself is intractable (because of the partition function $Z(x)$), it can be used to express the reward as a function of the optimal policy:

      \[\tag{2} \label{eq:reward-as-function-of-policy} r(x, y) = \beta \log \left( \frac{\pi_r(y \mid x)}{\pi_{\text{ref}}(y \mid x)} \right) + \log Z(x).\]

      2. Only Differences of Rewards Are Needed

      For simplicity, let us consider that only two completions are collected per input, which are then ranked as $y_w$ and $y_l$ (for winning and losing). DPO can be easily extended to the case of more completions per input, but the notation becomes more cumbersome.

      The reward $r_\phi$ is then learned by minimizing the loss:

      \[\mathcal{L}_\phi = \mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\left[ \log \frac{ e ^ {r_\phi(x, y_w)}}{ e^{r_\phi(x, y_w)} + e^{r_\phi(x, y_l)}} \right]\]

      which is equivalent to

      \[\tag{3} \label{eq:reward-loss-binary} \mathcal{L}_\phi = - \mathbb{E}_{(x, y_w, y_l)\sim \mathcal{D}} \left[ \log \sigma \left( r_\phi(x, y_w) - r_\phi(x, y_l) \right) \right],\]

      where $\sigma$ is the sigmoid function. Note that only differences of rewards enter (\ref{eq:reward-loss-binary}).

      3. DPO Objective

      After plugging the expression for the policy \ref{eq:reward-as-function-of-policy} into the loss \ref{eq:reward-loss-binary}, the partition function $Z(x)$ cancels out. Replacing the optimal $\pi_r$ with the parameterized $\pi_\theta$, the DPO objective is obtained as

      \[\mathcal{L}_{\text{DPO}}(\pi_\theta ; \pi_{\text{ref}}) := - \mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)} \right) \right].\]

      Thus, instead of first learning a reward and then finding the optimizing policy, one directly finds the optimal policy such that its reward as obtained from (\ref{eq:reward-as-function-of-policy}) corresponds to collected human preferences (i.e., a reward that optimizes (\ref{eq:reward-loss-binary})). Note that while the induced reward function itself is intractable, the differences of rewards remain tractable and can be computed using the learned policy. This should be sufficient for practical purposes, where rewards are mostly used to rank completions and, e.g., perform rejection sampling.

      The paper includes some more details and a discussion of the interpretation of the DPO update, and a detailed comparison to standard RLHF, but the essence of the method is captured by the above derivation. DPO can be easily extended to the case of more completions per input.

      DPO in the Wild - Experiments, LLMs and Software

      The original experiments in the paper were conducted on small-scale models and datasets, and as such were not very convincing. We partially include them here for completeness:

      Original evaluation of DPO on small-scale models and datasets. Left: TL;DR summarization win rates vs. human-written summaries, using GPT-4 as evaluator. DPO exceeds PPO’s best-case performance on summarization, while being more robust to changes in the sampling temperature. Right: The frontier of expected reward vs KL to the reference policy. DPO provides the highest expected reward for all KL values, demonstrating the quality of the optimization.

      Fortunately, DPO’s simplicity has made it attractive to many researchers and engineers. By now, only a few months after the publication of the paper, it is already included in trl as well as the ray-based library OpenRLHF (which is notably not using rllib, but that’s a story for another day). Moreover, several large models have been trained with DPO, including Zephyr 7B and the 70B parameters TÜLU 2. Here is what the authors of the latter had to say about DPO:

      DPO training significantly improves AlpacaEval and MT-Bench performance. At all sizes, DPO training provides significant improvements in AlpacaEval, with our largest DPO-trained model significantly outperforming GPT-3.5-turbo-0314 (89.4 vs. 95.1) and is competitive with GPT-4 ... We also observe that DPO training provides a large boost in MT-Bench performance for the 13B and 70B size models, with TÜLU 2+DPO 70B being the best-performing open model compared to all other models on the MT-Bench leaderboard.
      DPO training is stable at large scales. We find that DPO training scales without issues with 70Bsize models, with DPO training still providing large benefits for open-ended generation (AlpacaEval) even at the 70B size. This suggests DPO is a promising path for training large models on human feedback without the engineering complexity required by PPO. To our knowledge, TÜLU 2+DPO 70B is the largest publicly-released DPO-trained model.
      DPO does not dramatically harm most other metrics. We find that DPO training does not significantly change performance in most other metrics we measure, such as factual reasoning (MMLU) or reasoning (BBH, GSM8k), with the exception of multilinguality (which we discuss below). This suggests that DPO training does not significantly change model capabilities. DPO training significantly drops multilingual capabilities. We find that DPO training significantly drops performance in TydiQA, which tests the multilingual capabilities of our model. However, we note that both our supervised finetuning and DPO data mixes do not explicitly contain multilingual data, and are majority English-language. As such, DPO training is likely to make multilingual outputs further out-of-distribution, and mixing in multilingual data at instruction tuning and DPO training stages may significantly improve these results.
      DPO training increases model verbosity. As seen in Table 4, TÜLU 2+DPO models generally output answers of longer length than those trained without DPO. This is in line with prior work showing a bias toward verbosity from RLHF training. However, we note that our DPO-trained models appear dramatically less verbose than other openweight models, which future work will investigate.

      Closing Remarks

      One may find it surprising that supervised learning is able to replace RL on a formal level. For RLHF, new data is sampled from the language model, and for DPO this is not the case.

      However, after paying closer attention to the information flow of RLHF as described above, it may not be too surprising after all. The sampled data is not really new - it is created using the very same model that one is trying to optimize. The rewards for these samples are also not new, they are obtained by fitting a reward function to the preferences, and no new human preferences are retrieved during optimization. So from the information-flow perspective, supervised learning and RL are indeed equivalent in this particular case. Maybe Francois Chollet was not too extreme for suggesting to get rid of deep RL altogether in his tweet (note that it predates DPO. Personally, I don’t believe in a complete futility of deep RL, but for RLHF he was on point):

      .

      Another surprising aspect of DPO is the question: Why has nobody done this before? Hopefully after reading this blog post, you will agree that the derivation of DPO is not particularly complicated, so why did it take almost 4 years after the introduction of RLHF? Especially considering how tricky RLHF can be to implement. I don’t have an answer, though my intuition is that sometimes as a community we put too much effort into following a working solution, instead of taking a step back and searching for a simpler path. We might have witnessed a large scale instance of the Region-beta paradox.

      As a final note on community dynamics: supervised and self-supervised learning are now making more headlines compared to reinforcement learning, and DPO might have the effect of slowing down the complicated (but, as I believe, necessary) marriage of RL and LLMs. I do think that planning and search should play some part of LLM training in the future, although only for settings in which there is an actual environment from which new information can be extracted (like tool-use or robotics). For now, however, taking the RL out of RLHF seems like a good step forward. If DPO can be made beneficial for most LLM trainings, I believe that one can firmly answer the opening question of this blog as:

      Is RLHF really (online) RL? No, it is not.

      For attribution in academic contexts, please cite this work as
      +        PLACEHOLDER FOR ACADEMIC ATTRIBUTION
      +  
      BibTeX citation
      +        PLACEHOLDER FOR BIBTEX
      +  
      \ No newline at end of file diff --git a/blog/robust-foundation-model/index.html b/blog/robust-foundation-model/index.html new file mode 100644 index 00000000..b849e2a0 --- /dev/null +++ b/blog/robust-foundation-model/index.html @@ -0,0 +1,335 @@ + Towards Robust Foundation Models: Adversarial Contrastive Learning | ICLR Blogposts 2024

      Towards Robust Foundation Models: Adversarial Contrastive Learning

      Foundation models pre-trained on large-scale unlabelled datasets using self-supervision can be generalizable to a wide range of downstream tasks. Existing work has shown that adversarial attacks can effectively fool any downstream models fine-tuned from a pre-trained foundation model. The existence of such adversarial attacks necessitates the development of robust foundation models which can yield both standard generalization and adversarial robustness to safety-critical downstream tasks. Currently, adversarial contrastive learning (ACL) is one of the most effective methods for outputting a robust foundation model. ACL incorporates contrastive learning with adversarial data to effectively output a robust representation without requiring costly annotations. In this blog, we introduced two NeurIPS 2023 publications that can enhance ACL's efficacy and efficiency, respectively. (1) This blog introduces Adversarial Invariant Regularization (AIR) which is a state-of-the-art ACL algorithm. A causal theoretical framework is built to interpret ACL, and then the AIR algorithm is derived from the causal framework to regulate and improve the ACL. (2) This blog also introduces a Robustness-aware Coreset Selection (RCS) method to speed up ACL. RCS does not require label information and searches for an informative training subset that can maintain the adversarial robustness. For the first time, RCS enables the application of ACL on the large-scale ImageNet-1K dataset.

      Foundation Models

      Foundation models are pre-trained on large-scale unlabelled datasets using self-supervised learning methods, which is generalizable to a wide range of downstream tasks via fine-tuning. For example, GPT-3 has been successfully commercialized as a powerful text generation application. Vision transformer has been widely used in computer vision tasks such as object detection and medical analysis . BLIP is a vision-language pre-trained model that can perform many vision-language tasks such as the visual question answering task . CLAP is a language-audio pre-trained model that can be used for understanding the pair of texts and audio.

      Contrastive Learning (CL)

      To build foundation models, contrastive learning (CL) is one of the popular self-supervised learning methods. CL aims to maximize the agreement between different natural views of the original data.

      Let \(f_\theta: \mathcal{X} \rightarrow \mathcal{Z}\) be a feature extractor parameterized by \(\theta\), \(g:\mathcal{Z} \rightarrow \mathcal{V}\) be a projection head that maps representations to the space where the contrastive loss is applied, and \(\tau_i, \tau_j: \mathcal{X} \rightarrow \mathcal{X}\) be two transformation operations randomly sampled from a pre-defined transformation set \(\mathcal{T}\). Given a mini-batch \(B \sim \mathcal{X}^\beta\) consisting of \(\beta\) samples, we denote the augmented minibatch \(B^\prime = \{ \tau_i(x_k), \tau_j(x_k) \mid \forall x_k \in B \}\) consisting of \(2\beta\) samples. We take \(h_\theta(\cdot) = g \circ f_\theta(\cdot)\) and \(x_k^u = \tau_u(x_k)\) for any \(x_k \sim \mathcal{X}\) and \(u \in \{i,j\}\). The contrastive loss between different natural views (i.e., \(x_k^i\) and \(x_k^j\)) is formulated as follows:

      \[\ell_\mathrm{CL}(x_k^i,x_k^j; \theta)\!=\!-\! \sum\limits_{u \in \{i,j\}} \! \log \frac{e^{\mathrm{sim} \left(h_\theta(x_k^i), h_\theta(x_k^j) \right)/t}}{\sum\limits_{x \in B^\prime \setminus \{x_k^u\}} e^{\mathrm{sim} \left( h_\theta(x_k^u), h_\theta(x) \right)/t}},\]

      where \(\mathrm{sim}(\cdot,\cdot)\) is the cosine similarity function.

      Intuitively, CL aims to maximize the agreement between different natural views (the dash blue lines).

      How to implement CL at the pre-training stage in practice?

      Click here to see the Pytorch code for calculating contrastive loss. You can copy-paste it to calculate the contrastive loss in convenience. The code is copied from https://github.com/GodXuxilie/Enhancing_ACL_via_AIR.
      import torch
      +import torch.nn as nn
      +import torch.nn.functional as F
      +
      +class CL(nn.Module):
      +
      +    def __init__(self, normalize=True, temperature=0.5):
      +        super(CL, self).__init__()
      +        self.normalize = normalize
      +        self.temperature = temperature
      +
      +    def forward(self, zi, zj):
      +        # zi: the representation of natural view x^i.
      +        # zj: the representation of natural view x^j.
      +
      +        bs = zi.shape[0]
      +        labels = torch.zeros((2*bs,)).long().to(zi.device)
      +        mask = torch.ones((bs, bs), dtype=bool).fill_diagonal_(0)
      +
      +        zi_norm = F.normalize(zi, p=2, dim=-1) if self.normalize else zi
      +        zj_norm = F.normalize(zj, p=2, dim=-1) if self.normalize else zj
      +
      +        ### Contrastive Loss ###
      +        logits_ii = torch.mm(zi_norm, zi_norm.t()) / self.temperature
      +        logits_ij = torch.mm(zi_norm, zj_norm.t()) / self.temperature
      +        logits_ji = torch.mm(zj_norm, zi_norm.t()) / self.temperature
      +        logits_jj = torch.mm(zj_norm, zj_norm.t()) / self.temperature
      +
      +        logits_ij_pos = logits_ij[torch.logical_not(mask)]                                          
      +        logits_ji_pos = logits_ji[torch.logical_not(mask)]                                          
      +        logits_ii_neg = logits_ii[mask].reshape(bs, -1)                                            
      +        logits_ij_neg = logits_ij[mask].reshape(bs, -1)                                             
      +        logits_ji_neg = logits_ji[mask].reshape(bs, -1)                                             
      +        logits_jj_neg = logits_jj[mask].reshape(bs, -1)                                             
      +
      +        pos = torch.cat((logits_ij_pos, logits_ji_pos), dim=0).unsqueeze(1)                         
      +        neg_i = torch.cat((logits_ii_neg, logits_ij_neg), dim=1)                                    
      +        neg_j = torch.cat((logits_ji_neg, logits_jj_neg), dim=1)                                    
      +        neg = torch.cat((neg_i, neg_j), dim=0)                                                      
      +
      +        logits = torch.cat((pos, neg), dim=1)                                                       
      +        nat_contrastive_loss = F.cross_entropy(logits, labels)
      +        return nat_contrastive_loss

      Besides, you can use the following script to conduct self-supervised pre-training via CL using ResNet-18 on CIFAR-10:

      # Pre-training stage via CL
      +git clone https://github.com/GodXuxilie/Enhancing_ACL_via_AIR.git
      +cd Enhancing_ACL_via_AIR
      +PRE_TRAIN_DIR=CL_ResNet18_cifar10
      +python pretraining.py $PRE_TRAIN_DIR --dataset cifar10 \
      +                                     --model r18 \
      +                                     --pgd_iter 0  --lambda1 0 --lambda2 0

      Robust Foundation Models

      Existing work has shown that there exist adversarial attacks that can fool the foundation representations to output incorrect predictions by adding imperceptible adversarial perturbations to the original inputs in downstream tasks. The existence of adversarial attacks necessitates the development of robust foundation models in safety-critical downstream tasks.

      The foundation representation is vulnerable to adversarial attacks, which wrongly predicts a car as 'NOT a car'.

      Robust foundation models are pre-trained on large-scale datasets via robust self-supervised learning methods. Robust foundation models have the following two critical properties:

      • Robust foundation representations is generalizable to downstream tasks;
      • Fine-tuned robust foundation representations is adversarially robust against adversarial attacks in downstream tasks.

      Adversarial Contrastive Learning (ACL)

      To learn robust foundation representations, adversarial contrastive learning (ACL) is one of the most popular and effective robust self-supervised learning methods. ACL incorporates CL with adversarial data to build a robust foundation model without requiring costly annotations. ACL aims to maximize the agreement between different natural views as well as the agreement between different adversarial views. The adversarial contrastive loss given a data point \(x_k \in \mathcal{X}\) is formulated as follows:

      \[\ell_\mathrm{ACL}(x_k;\theta) = (1 + \omega) \cdot \ell_\mathrm{CL}(\tilde{x}_{k}^i, \tilde{x}_{k}^j; \theta) + (1 - \omega) \cdot \ell_\mathrm{CL}(x_k^i, x_k^j; \theta),\]

      where adversarial views are formulated as follows:

      \[\tilde{x}_{k}^i, \tilde{x}_{k}^j = \mathop{\arg\max}_{ {\Large \tilde{x}_{k}^i \in \mathcal{B}_\epsilon[x_k^i]} \atop {\Large \tilde{x}_{k}^j \in \mathcal{B}_\epsilon[x_k^j]} } \ell_\mathrm{CL}(\tilde{x}_{k}^i, \tilde{x}_{k}^j; \theta).\]

      Note that \(\omega \in [0,1]\) is a scalar and \(\mathcal{B}_\epsilon[x]\) is a constraint that ensures the adversarial data \(\tilde{x}\) is in the \(\epsilon\)-ball around data \(x\).

      Intuitively, ACL aims to maximize the agreement between different natural view (the dash blue lines) and the agreement between different adversarial views (the dash red lines).

      Here is the generation procedure of adversarial data via Projected Gradient Descent (PGD) . Given an initial positive pair \((x_k^{i,(0)}, x_k^{j,(0)})\), PGD step \(T \in \mathbb{N}\), step size \(\rho > 0\), and adversarial budget \(\epsilon \geq 0\), PGD iteratively updates the pair of data from \(t=0\) to \(T-1\) as follows:

      \[x_k^{i,(t+1)} \! = \! \Pi_{\mathcal{B}_\epsilon[x_k^{i,(0)}]} \big( x_k^{i,(t)} +\rho \cdot \mathrm{sign} (\nabla_{x_k^{i,(t)}} \ell_\mathrm{CL}(x_k^{i,(t)}, x_k^{j,(t)}) \big ),\] \[x_k^{j,(t+1)} \! = \! \Pi_{\mathcal{B}_\epsilon[x_k^{j,(0)}]} \big( x_k^{j,(t)} +\rho \cdot \mathrm{sign} (\nabla_{x_k^{j,(t)}} \ell_\mathrm{CL}(x_k^{i,(t)}, x_k^{j,(t)}) \big ),\]

      where \(\Pi_{\mathcal{B}_\epsilon[x]}\) projects the data into the \(\epsilon\)-ball around the initial point \(x\). Generating adversarial data requires \(T\) iterations of forwarding and back-propagations, which makes the training procedure extremely slow.

      The generation procedure of adversarial data in ACL. The adversarial data $\tilde{x}_k^i$ and $\tilde{x}_k^j$ are updated from the low-loss region to the high-loss region step by step according to the loss gradient.

      At each epoch, ACL conducts steps (1) and (2) alternatively:

      • Step (1): generating adversarial data (i.e., \(\tilde{x}_k^i\) and \(\tilde{x}_k^j\)) via PGD;

      • Step (2): updating model parameters via minimizing adversarial contrastive loss to maximize agreements on the adversarial data and natural data.

      How to implement ACL at the pre-training stage in practice?

      Click here to see the Pytorch code for calculating adversarial contrastive loss. You can copy-paste it to calculate the adversarial contrastive loss in convenience. The code is copied from https://github.com/GodXuxilie/Enhancing_ACL_via_AIR.
      import torch
      +import torch.nn as nn
      +import torch.nn.functional as F
      +
      +class ACL(nn.Module):
      +
      +    def __init__(self, normalize=True, temperature=0.5):
      +        super(ACL, self).__init__()
      +        self.normalize = normalize
      +        self.temperature = temperature
      +
      +    def forward(self, zi, zj, zi_adv, zj_adv, weight=0.5):
      +        # zi: the representation of natural view x^i.
      +        # zj: the representation of natural view x^j.
      +        # zi_adv: the representation of adversarial view \tilde{x}^i.
      +        # zj_adv: the representation of adversarial view \tilde{x}^j.
      +
      +        bs = zi.shape[0]
      +        labels = torch.zeros((2*bs,)).long().to(zi.device)
      +        mask = torch.ones((bs, bs), dtype=bool).fill_diagonal_(0)
      +
      +        zi_norm = F.normalize(zi, p=2, dim=-1) if self.normalize else zi
      +        zj_norm = F.normalize(zj, p=2, dim=-1) if self.normalize else zj
      +        zi_adv_norm = F.normalize(zi_adv, p=2, dim=-1) if self.normalize else zi_adv
      +        zj_adv_norm = F.normalize(zj_adv, p=2, dim=-1) i if self.normalize else zj_adv
      +        
      +        ### Adversarial Contrastive Loss ###
      +
      +        logits_ii = torch.mm(zi_norm, zi_norm.t()) / self.temperature
      +        logits_ij = torch.mm(zi_norm, zj_norm.t()) / self.temperature
      +        logits_ji = torch.mm(zj_norm, zi_norm.t()) / self.temperature
      +        logits_jj = torch.mm(zj_norm, zj_norm.t()) / self.temperature
      +
      +        logits_ij_pos = logits_ij[torch.logical_not(mask)]                                          
      +        logits_ji_pos = logits_ji[torch.logical_not(mask)]                                          
      +        logits_ii_neg = logits_ii[mask].reshape(bs, -1)                                            
      +        logits_ij_neg = logits_ij[mask].reshape(bs, -1)                                             
      +        logits_ji_neg = logits_ji[mask].reshape(bs, -1)                                             
      +        logits_jj_neg = logits_jj[mask].reshape(bs, -1)                                             
      +
      +        pos = torch.cat((logits_ij_pos, logits_ji_pos), dim=0).unsqueeze(1)                         
      +        neg_i = torch.cat((logits_ii_neg, logits_ij_neg), dim=1)                                    
      +        neg_j = torch.cat((logits_ji_neg, logits_jj_neg), dim=1)                                    
      +        neg = torch.cat((neg_i, neg_j), dim=0)                                                      
      +
      +        logits = torch.cat((pos, neg), dim=1)                                                       
      +        nat_contrastive_loss = F.cross_entropy(logits, labels)
      +
      +        logits_ii_adv = torch.mm(zi_adv_norm, zi_adv_norm.t()) / self.temperature
      +        logits_ij_adv = torch.mm(zi_adv_norm, zj_adv_norm.t()) / self.temperature
      +        logits_ji_adv = torch.mm(zj_adv_norm, zi_adv_norm.t()) / self.temperature
      +        logits_jj_adv = torch.mm(zj_adv_norm, zj_adv_norm.t()) / self.temperature
      +
      +        logits_ij_pos_adv = logits_ij_adv[torch.logical_not(mask)]                                         
      +        logits_ji_pos_adv = logits_ji_adv[torch.logical_not(mask)]                                          
      +        logits_ii_neg_adv = logits_ii_adv[mask].reshape(bs, -1)                                            
      +        logits_ij_neg_adv = logits_ij_adv[mask].reshape(bs, -1)                                             
      +        logits_ji_neg_adv = logits_ji_adv[mask].reshape(bs, -1)                                             
      +        logits_jj_neg_adv = logits_jj_adv[mask].reshape(bs, -1)                                             
      +
      +        pos_adv = torch.cat((logits_ij_pos_adv, logits_ji_pos_adv), dim=0).unsqueeze(1)                         
      +        neg_i_adv = torch.cat((logits_ii_neg_adv, logits_ij_neg_adv), dim=1)                                    
      +        neg_j_adv = torch.cat((logits_ji_neg_adv, logits_jj_neg_adv), dim=1)                                    
      +        neg_adv = torch.cat((neg_i_adv, neg_j_adv), dim=0)                                                      
      +
      +        logits_adv = torch.cat((pos_adv, neg_adv), dim=1)                                                       
      +        adv_contrastive_loss = F.cross_entropy(logits_adv, labels)
      +
      +        return (1 - weight) * nat_contrastive_loss + (1 + weight) * adv_contrastive_loss

      Besides, you can use the following script to conduct robust self-supervised pre-training via ACL using ResNet-18 on CIFAR-10:

      # Pre-training stage via ACL
      +git clone https://github.com/GodXuxilie/Enhancing_ACL_via_AIR.git
      +cd Enhancing_ACL_via_AIR
      +PRE_TRAIN_DIR=ACL_ResNet18_cifar10
      +python pretraining.py $PRE_TRAIN_DIR --dataset cifar10 \
      +                                     --model r18 \
      +                                     --DynAug --lambda1 0 --lambda2 0

      How to utilize robust foundation representations via fine-tuning in downstream tasks?

      At the fine-tuning stage, a classifier is randomly initialized and appended to the pre-trained feature extractor for solving the classification tasks. There are three types of fine-tuning modes:

      1. Standard linear fine-tuning (SLF): only standardly fine-tuning the classifier while freezing the feature extractor.
      2. Adversarial linear fine-tuning (ALF): only adversarially fine-tuning the classifier while freezing the feature extractor.
      3. Adversarial full fine-tuning (AFF): adversarially fine-tuning both the feature extractor and the classifier.

      You can use the following script to transfer an adversarially pre-trained ResNet-18 on CIFAR-10 to a downstream task CIFAR-100 via fine-tuning:

      # Fine-tuning stage
      +cd Enhancing_ACL_via_AIR
      +PRE_TRAIN_DIR=ACL_ResNet18_cifar10
      +FINETUNE_DIR=ACL_ResNet18_cifar10_cifar100
      +MODE=SLF/ALF/AFF/ALL
      +python finetuning.py --mode $MODE \
      +                     --experiment $FINETUNE_DIR \
      +                     --checkpoint ./checkpoints/$PRE_TRAIN_DIR/model.pt \
      +                     --dataset cifar100 \
      +                     --model r18 \
      +                     --eval-AA --eval-OOD --pretraining DynACL

      Note that MODE=ALL refers to that the finetuning.py sequentially conducts fine-tuning of all three modes (i.e., SLF, ALF, and AFF) and outputs the result via each fine-tuning mode in the log file $FINETUNE_DIR/results/log.txt.

      Enhancing ACL via Adversarial Invariant Regularization (AIR)

      Here, we introduce the NeurIPS 2023 paper which proposes Adversarial Invariant Regularization (AIR) that regulates both standard and robust representations to be style-independent based on a causal theoretical framework. Empirically, AIR yields state-of-the-art performance in terms of robustness against adversarial attacks and common corruption as well as the standard generalization in downstream tasks.

      Causal View of ACL

      AIR first introduces the causal graph of the ACL as shown in the following figure.

      The causal graph of the ACL.

      During the data generation procedure:

      • \(c\) is the content variable, which can be regarded as the original data in the datasets.
      • \(s\) is the style factor, which can regarded as the data transformation functions that can modify the content while maintaining the semantic meaning of the content. Note that factors \(c\) and \(s\) are independent.
      • \(x\) is the natural data, which is decided by the content factor \(c\) and the style factor \(s\).
      • \(y_t \in \{ y_i \}_{i=1}^{T}\) is the label from an unknown downstream task. Note that \(y_t\) is only decided by the content factor \(c\).
      • \(y^R\) is the proxy label, which is a refinement of $y_t$. \(y^R\) is used for self-supervised learning without labels. As illustrated in the following figure, the label dog is refined into proxy labels golden Retriever with yellow hair and labrador retriever with black hair. Therefore, when there is no target label, we can train models by differentiating these two different pictures using the contrastive loss.
      The illustration of the proxy label $y^R$ which is a refinement of the label $y_t$.
      • \(\tilde{x}\) is the adversarial data of $x$. Since the generation procedure of \(\tilde{x}\) in ACL does not use the labels, the adversarial data \(\tilde{x}\) is decided by the natural data \(x\) and the model parameter \(\theta\).

      During the learning procedure, ACL optimizes the parameters \(\theta\) by maximizing the conditional probabilities both \(p(y^R \mid x)\) and \(p(y^R \mid \tilde{x})\).

      the Methodology of AIR

      Style-invariant criterion.

      From the causal view of ACL, the learning procedure should satisfy the style-independent criterion. That is to say, the intervention on the style factor should not affect the conditional probability, i.e., \(p^{do(\tau_i)}(y^R \mid x) = p^{do(\tau_j)}(y^R \mid x)\) where \(do(\tau)\) is the intervention approximated by the data augmentation function $\tau \in \mathcal{T}$.

      According to causal reasoning, the style factor $s$ should not affect $p(y^R \mid x)$.

      Assuming that the path \(x \rightarrow \tilde{x} \rightarrow y^R\) in the causal graph satisfies the Markov condition, we can obtain that

      \[p(y^R \mid x) = p(y^R \mid \tilde{x})p(\tilde{x} \mid x).\]

      Therefore, ACL should follow the style-independent criterion as follows:

      \[p^{do(\tau_i)}(y^R \mid \tilde{x}) p^{do(\tau_i)}(\tilde{x} \mid x) = p^{do(\tau_j)}(y^R \mid \tilde{x}) p^{do(\tau_j)}(\tilde{x} \mid x) \quad \forall \tau_i, \tau_j \in \mathcal{T} .\]

      The conditional probability \(p^{do(\tau_u)}(y^R \mid \tilde{x})\) for \(u \in \{i,j\}\) is calculated as the cosine similarity between the original data \(x\) and the adversarial data \(\tilde{x}^u\) normalized by the softmax function:

      \[p^{do(\tau_u)}(y^R \mid \tilde{x}) = \frac{e^{\mathrm{sim} \left(f_\theta(x), f_\theta(\tilde{x}^u) \right)/t}} {\sum\limits_{x_k \in B} e^{\mathrm{sim} \left( f_\theta(x_k), f_\theta(\tilde{x}_k^u) \right)/t}}.\]

      Note that \(y^R\) is only decided by the content factor \(c\). Empirically, the content factor \(c\) can be approximated by the original data \(x\) from the datasets.

      The conditional probability \(p^{do(\tau_u)}(\tilde{x} \mid x)\) for \(u \in \{i,j\}\) is calculated as the cosine similarity between the natural data \(x^u\) and the adversarial data \(\tilde{x}^u\) normalized by the softmax function:

      \[p^{do(\tau_u)}(\tilde{x} | x) = \frac{e^{\mathrm{sim} \left(f_\theta(\tilde{x}^u), f_\theta(x^u) \right)/t}} {\sum\limits_{x_k \in B} e^{\mathrm{sim} \left( f_\theta(\tilde{x}_k^u), f_\theta(x_k^u) \right)/t}}.\]

      The loss function of AIR.

      To achieve the style-invariant criterion, AIR is proposed to regulate the representations to be style-independent as follows:

      \[\mathcal{L}_\mathrm{AIR}(B;\theta, \epsilon) = \mathrm{KL}\left(p^{do(\tau_i)}(y^R \mid \tilde{x}) p^{do(\tau_i)}(\tilde{x} \mid x) \| p^{do(\tau_j)}(y^R \mid \tilde{x}) p^{do(\tau_j)}(\tilde{x} \mid x) ; B \right),\]

      in which \(\epsilon \geq 0\) is the adversarial budget, \(B\) is a mini-batch, and \(\mathrm{KL}(p(x) \| q(x); B) = \sum_{x \in B} p(x) \log \frac{p(x)}{q(x)}\) denotes the Kullback–Leibler (KL) divergence.

      We provide an illustration of AIR for ACL. The AIR aims to maximize the agreements between the original data and the adversarial view (the dash yellow lines) and the agreements between the natural view and the adversarial view (the dash pink lines).

      Intuitively, AIR aims to maximize the agreements among different natural views, different adversarial views, and original data.

      Learning objective of AIR enhanced ACL.

      The learning objective of AIR is formulated as follows:

      \[\mathop{\arg\min}_{\theta} \sum_{x \in U} \ell_\mathrm{ACL}(x; \theta) + \lambda_1 \cdot \mathcal{L}_\mathrm{AIR}(U;\theta,0) + \lambda_2 \cdot \mathcal{L}_\mathrm{AIR}(U;\theta,\epsilon),\]

      where \(\lambda_1 \geq 0\) and \(\lambda_2 \geq 0\) are two hyper-parameters.

      The official code of AIR is available at https://github.com/GodXuxilie/Enhancing_ACL_via_AIR.

      Click here to see the Pytorch code for calculating AIR loss. You can copy-paste it to calculate the AIR loss in convenience.
      import torch
      +import torch.nn as nn
      +import torch.nn.functional as F
      +
      +class AIR(nn.Module):
      +
      +    def __init__(self, normalize=True, temperature=0.5):
      +        super(AIR, self).__init__()
      +        self.normalize = normalize
      +        self.temperature = temperature
      +
      +    def forward(self, zi, zj, zi_adv, zj_adv, z_orig, weight=0.5, lambda1=0.5, lambda2=0.5):
      +        # zi: the representation of natural data x^i.
      +        # zj: the representation of natural data x^j.
      +        # zi_adv: the representation of adversarial data \tilde{x}^i.
      +        # zj_adv: the representation of adversarial data \tilde{x}^j.
      +        # z_orig: the representation of original data x.
      +
      +        bs = zi.shape[0]
      +        labels = torch.zeros((2*bs,)).long().to(zi.device)
      +        mask = torch.ones((bs, bs), dtype=bool).fill_diagonal_(0)
      +
      +        zi_norm = F.normalize(zi, p=2, dim=-1) if self.normalize else zi
      +        zj_norm = F.normalize(zj, p=2, dim=-1) if self.normalize else zj
      +        zi_adv_norm = F.normalize(zi_adv, p=2, dim=-1) if self.normalize else zi_adv
      +        zj_adv_norm = F.normalize(zj_adv, p=2, dim=-1) if self.normalize else zj_adv
      +        zo_norm = F.normalize(z_orig, p=2, dim=-1) if self.normalize else z_orig
      +
      +        ### Adversarial Contrastive Loss ###
      +        logits_ii = torch.mm(zi_norm, zi_norm.t()) / self.temperature
      +        logits_ij = torch.mm(zi_norm, zj_norm.t()) / self.temperature
      +        logits_ji = torch.mm(zj_norm, zi_norm.t()) / self.temperature
      +        logits_jj = torch.mm(zj_norm, zj_norm.t()) / self.temperature
      +
      +        logits_ij_pos = logits_ij[torch.logical_not(mask)]                                          
      +        logits_ji_pos = logits_ji[torch.logical_not(mask)]                                          
      +        logits_ii_neg = logits_ii[mask].reshape(bs, -1)                                            
      +        logits_ij_neg = logits_ij[mask].reshape(bs, -1)                                             
      +        logits_ji_neg = logits_ji[mask].reshape(bs, -1)                                             
      +        logits_jj_neg = logits_jj[mask].reshape(bs, -1)                                             
      +
      +        pos = torch.cat((logits_ij_pos, logits_ji_pos), dim=0).unsqueeze(1)                         
      +        neg_i = torch.cat((logits_ii_neg, logits_ij_neg), dim=1)                                    
      +        neg_j = torch.cat((logits_ji_neg, logits_jj_neg), dim=1)                                    
      +        neg = torch.cat((neg_i, neg_j), dim=0)                                                      
      +
      +        logits = torch.cat((pos, neg), dim=1)                                                       
      +        nat_contrastive_loss = F.cross_entropy(logits, labels)
      +
      +        logits_ii_adv = torch.mm(zi_adv_norm, zi_adv_norm.t()) / self.temperature
      +        logits_ij_adv = torch.mm(zi_adv_norm, zj_adv_norm.t()) / self.temperature
      +        logits_ji_adv = torch.mm(zj_adv_norm, zi_adv_norm.t()) / self.temperature
      +        logits_jj_adv = torch.mm(zj_adv_norm, zj_adv_norm.t()) / self.temperature
      +
      +        logits_ij_pos_adv = logits_ij_adv[torch.logical_not(mask)]                                         
      +        logits_ji_pos_adv = logits_ji_adv[torch.logical_not(mask)]                                          
      +        logits_ii_neg_adv = logits_ii_adv[mask].reshape(bs, -1)                                            
      +        logits_ij_neg_adv = logits_ij_adv[mask].reshape(bs, -1)                                             
      +        logits_ji_neg_adv = logits_ji_adv[mask].reshape(bs, -1)                                             
      +        logits_jj_neg_adv = logits_jj_adv[mask].reshape(bs, -1)                                             
      +
      +        pos_adv = torch.cat((logits_ij_pos_adv, logits_ji_pos_adv), dim=0).unsqueeze(1)                         
      +        neg_i_adv = torch.cat((logits_ii_neg_adv, logits_ij_neg_adv), dim=1)                                    
      +        neg_j_adv = torch.cat((logits_ji_neg_adv, logits_jj_neg_adv), dim=1)                                    
      +        neg_adv = torch.cat((neg_i_adv, neg_j_adv), dim=0)                                                      
      +
      +        logits_adv = torch.cat((pos_adv, neg_adv), dim=1)                                                       
      +        adv_contrastive_loss = F.cross_entropy(logits_adv, labels)
      +
      +        ### Adversarial Invariant Regularization ###
      +        logits_io = torch.mm(zi_norm, zo_norm.t()) / self.temperature
      +        logits_jo = torch.mm(zj_norm, zo_norm.t()) / self.temperature
      +        probs_io_zi = F.softmax(logits_io[torch.logical_not(mask)], -1)
      +        probs_jo_zj = F.log_softmax(logits_jo[torch.logical_not(mask)], -1)
      +        AIR_standard = F.kl_div(probs_io_zi, probs_jo_zj, log_target=True, reduction="sum")
      +
      +        logits_io = torch.mm(zi_adv_norm, zi_norm.t()) / self.temperature
      +        logits_jo = torch.mm(zj_adv_norm, zj_norm.t()) / self.temperature
      +        probs_io_zi_adv_consis = F.softmax(logits_io[torch.logical_not(mask)], -1)
      +        probs_jo_zj_adv_consis = F.softmax(logits_jo[torch.logical_not(mask)], -1)
      +
      +        logits_io = torch.mm(zi_adv_norm, zo_norm.t()) / self.temperature
      +        logits_jo = torch.mm(zj_adv_norm, zo_norm.t()) / self.temperature
      +        probs_io_zi_adv = F.softmax(logits_io[torch.logical_not(mask)], -1)
      +        probs_jo_zj_adv = F.softmax(logits_jo[torch.logical_not(mask)], -1)
      +
      +        probs_io_zi_adv = torch.mul(probs_io_zi_adv, probs_io_zi_adv_consis)
      +        probs_jo_zj_adv = torch.mul(probs_jo_zj_adv, probs_jo_zj_adv_consis)
      +        AIR_robust = F.kl_div(probs_io_zi_adv, torch.log(probs_jo_zj_adv), log_target=True, reduction="sum")
      +
      +        return (1 - weight) * nat_contrastive_loss + (1 + weight) * adv_contrastive_loss + lambda1 * AIR_standard + lambda2 * AIR_robust

      Besides, you can use the following script to conduct robust self-supervised pre-training via AIR using ResNet-18 on CIFAR-10:

      # Pre-training stage via AIR
      +git clone https://github.com/GodXuxilie/Enhancing_ACL_via_AIR.git
      +cd Enhancing_ACL_via_AIR
      +PRE_TRAIN_DIR=AIR_ResNet18_cifar10
      +python pretraining.py $PRE_TRAIN_DIR --dataset cifar10 --model r18 --DynAug

      Empirical Results

      AIR yields state-of-the-art cross-task robustness transferability against adversarial attacks.

      • \(\mathcal{D}_1 \rightarrow \mathcal{D}_2\) refers to that the model is pre-trained on dataset \(\mathcal{D}_1\) and fine-tuned on downstream dataset \(\mathcal{D}_2\).
      • SA refers the standard accuracy calculated as the average accuracy on the natural test data in the downstream dataset \(\mathcal{D}_2\).
      • AA refers to the robust accuracy calculated as the average accuracy on the adversarial test data generated via adversarial attacks in the downstream dataset \(\mathcal{D}_2\).

      AIR yields state-of-the-art cross-task robustness transferability against common corruptions.

      CS-# refers to the the average accuracy evaluated on the test data under common corruptions with corruption severity (CS) of # \(\in\) {1,3,5} in the downstream dataset \(\mathcal{D}_2\).

      To reproduce the above results of the transferability from CIFAR-10 to CIFAR-100, you can use the following scripts.

      • At the pre-training stage, you can conduct AIR using ResNet-18 on CIFAR-10.
      # Pre-training stage using AIR
      +git clone https://github.com/GodXuxilie/Enhancing_ACL_via_AIR.git
      +cd Enhancing_ACL_via_AIR
      +PRE_TRAIN_DIR=AIR_ResNet18_cifar10
      +python pretraining.py $PRETRAIN_DIR --dataset cifar10 --model r18 --DynAug
      • At the fine-tuning stage, you can fine-tune the pre-trained ResNet-18 to downstream task CIFAR-100. During the fine-tuning stage, the following script will automatically conduct all three fine-tuning modes (i.e., SLF, ALF, and AFF). After the fine-tuning stage, you can check the standard accuracy, the robust accuracy under adversarial attacks and common cottuptions under each fine-tuning method from a log file at $FINETUNE_DIR/results/log.txt.
      # Fine-tuning stage
      +cd Enhancing_ACL_via_AIR
      +PRE_TRAIN_DIR=AIR_ResNet18_cifar10
      +FINETUNE_DIR=AIR_ResNet18_cifar10_cifar100
      +python finetuning.py --experiment $EXP_DIR \
      +                     --checkpoint ./checkpoints/$PRE_TRAIN_DIR/model.pt \
      +                     --dataset cifar100 \
      +                     --model r18 \
      +                     --mode ALL \
      +                     --eval-AA --eval-OOD --pretraining DynACL_AIR

      Robust Self-Supervised Learning (RobustSSL) Benchmark The website of RobustSSL Benchmark is at https://robustssl.github.io/.

      AIR ranks FIRST in RobustSSL Benchmark! For more information regarding the leaderboards, please check the website of RobustSSL Benchmark.

      A screenshot of the leaderboard shown in RobustSSL Benchmark.

      Efficient ACL via Robustness-Aware Coreset Selection (RCS)

      Here, we introduce the NeurIPS 2023 spotlight paper which proposes Robustness-Aware Coreset Selection (RCS) that selects an informative coreset without label annotations to speed up ACL. Theoretically, Xu et al. (2023) show that a greedy search algorithm can efficiently find the coreset. Empirically, RCS can speed up both ACL and supervised robust pre-training by a large margin on CIFAR and ImageNet-1K datasets without significantly hurting the robustness transferability. This paper for the first time proves the concept of the possibility of applying ACL on large-scale datasets.

      Motivation—ACL is Inefficient

      ACL is computationally prohibitive on large-scale datasets since generating adversarial data requires expensive computational overheads.

      Empirically, ACL on the entire ImageNet-1K dataset (1,281,167 training data points) requires about 650 hours evaluated on RTX A5000 GPUs. Due to the inefficiency of ACL, ACL has not yet been applied to ImageNet-1K datasets without RCS.

      ACL is inefficient because $T$ PGD steps require expensive computational overheads.

      the Methodology of RCS

      Intuition of RCS.

      To speed up ACL, RCS takes an intuitive idea which is to find an informative training subset (called “coreset”). The coreset can directly decrease the number of training samples, thus significantly accelerating ACL. Besides, since the coreset is informative, which is beneficial in improving \(f\)’s adversarial robustness, it should guarantee the ACL to output an effective robust foundation model.

      RCS generates an informative coreset to make ACL efficiently obtain an effective robust foundation model.Image from https://medium.com/analytics-vidhya/sampling-statistical-approach-in-machine-learning-4903c40ebf86.

      Representational Distance (RD) as a measurement of \(f\)’s adversarial robustness without labels.

      RD of a data point \(\ell_\mathrm{RD}(x;\theta)\) is quantified by the representational distance between the natural data and its adversarial counterpart, i.e.,

      \[\ell_{\mathrm{RD}}(x; \theta) = d(g \circ f_\theta(\tilde{x}), g \circ f_\theta(x)) \quad \mathrm{s.t.} \quad \tilde{x} = \mathop{\arg\max}_{x^{\prime} \in \mathcal{B}_\epsilon[x]} \quad d(g \circ f_\theta(x^{\prime}), g \circ f_\theta(x)),\]

      in which the PGD method is used to generate adversarial data \(\tilde{x}\) within the \(\epsilon\)-ball centered at \(x\) and \(d(\cdot, \cdot): \mathcal{V} \times \mathcal{V} \rightarrow \mathbb{R}\) is a distance function, such as the KL divergence. The smaller the RD is, the representations are of less sensitivity to adversarial perturbations, thus being more adversarially robust.

      Objective function of RCS.

      To realize the intuitive idea, RCS is formulated as follows:

      \[S^* = \mathop{\arg\min}_{S \subseteq X, |S|/|X| = k} \mathcal{L}_{\mathrm{RD}}(U; \theta(S)),\] \[\theta(S) = \mathop{\arg\min}_{\theta} \mathcal{L}_\mathrm{ACL}(S; \theta),\]

      in which \(S^*\) is the coreset, \(U\) is an unlabled validation set, \(k \in (0,1]\) is subset fraction that controls the size of coreset, and \(\mathcal{L}_{\mathrm{RD}}(U; \theta(S)) = \sum_{x \in U} \ell_\mathrm{RD}(x; \theta(S))\), and \(\mathcal{L}_\mathrm{ACL}(S; \theta) = \sum_{x \in S} \ell_\mathrm{ACL}(x; \theta)\).

      Intuitively, given a coreset \(S^*\), after the model parameters are updated to \(\theta(S^{*})\) via minimizing the ACL loss on the coreset \(\mathcal{L}_\mathrm{ACL}(S^*; \theta)\), the model will achieve the minimizied RD loss on the validation dataset \(\mathcal{L}_{\mathrm{RD}}(U; \theta(S^*))\), thus being adversarially robust.

      Then, RCS can be converted into a problem of maximizing a set function subject to a cardinality constraint as follows:

      \[S^* = \mathop{\arg\max}_{S \subseteq X, |S|/|X| = k} G_\theta(S),\] \[G_\theta(S \subseteq X) \triangleq - \mathcal{L}_\mathrm{RD}(U; \theta(S)) = - \mathcal{L}_\mathrm{RD}(U; \theta - \eta \nabla_\theta \mathcal{L}_\mathrm{ACL}(S; \theta)),\]

      where \(G:2^\mathcal{X} \rightarrow \mathbb{R}\) is a set function, \(\theta(S)\) is estimated using the one-step approximation and \(\eta \in \mathbb{R}^+\) is the learning rate.

      RCS via Greedy Search.

      The vanilla solution of traversing all subsets and selecting the subset that has the largest \(G_\theta(S)\) is intractable. Xu et al. (2023) show that the set function \(G_\theta(S)\) satisfies the following two critical properties, which motivates a greedy search to efficiently search for the coreset.

      The set function \(G_\theta(S)\) is proved as submodularIn reality, the authors of RCS rigorously proved a proxy set function as weakly submodular. Further, the authors of RCS proved that the greedy search algorithm provides a guaranteed lower bound for the proposed set function maximization problem based on a weakly submodular proxy set function. For more details, please refer to the paper of RCS. which satisfies the following two properties:

      • Monotonicity: As more data is added to the set, the representation becomes better.
        \(G(x\mid X)=G(S \cup \{x\}) - G(S) \geq 0\) for any \(S \subseteq X\) and \(x \in X \setminus S\).
      • Diminishing returns: As the set has more data, the marginal gain of extra data for learning representations gradually diminishes.
        \(\mathop{\forall}\limits_{A,B \mid A \subseteq B} G_\theta(x \mid A) \geq G_\theta(x \mid B)\).

      Therefore, RCS greedily searches for the data \(x\) that has the largest marginal gain and then adds them into the coreset.

      Pseudo-code of efficient ACL via RCS.

      • Step 1 (Warm-up): Warm up training on the entire training set to find a better starting point \(f_\theta\).
      • Step 2.1 (RCS): \(S \gets\emptyset\). \(\theta' \gets \theta\). Compute gradients \(Q \gets \{ q_k = \nabla_\theta \mathcal{L}_\mathrm{ACL}(x_k; \theta) \mid \forall x_k \in X \}\) on unlabeled training dataset \(X\).
      • Step 2.2 (RCS): Compute gradients \(q_U \gets \nabla_\theta \mathcal{L}_\mathrm{RD}(U; \theta')\) on unlabeled validation dataset \(U\).
      • Step 2.3 (RCS): Select a data \(x_k\), whose gradient \(q_k\) matches best with \(q_U\), i.e., \(\mathop{\arg\max}_k \{q_k^\top q_U \}\).
      • Step 2.4 (RCS): \(S \gets S \cup \{x_k\}\), \(X \gets X \setminus \{ x_k \}\), \(\theta' \gets \theta' - \eta' q_k\).
      • Step 2.5 (RCS): Repeat Steps 2.2-2.4 until \(\mid S\mid/\mid X\mid = k\).
      • Step 3 (ACL training): Update parameters \(\theta \gets \theta - \eta \nabla_\theta \mathcal{L}_\mathrm{ACL}(S; \theta)\).
      • Step 4: Every a few epochs, go to Step 2.1 to generate a new coreset; otherwise go to Step 3 to update model parameters. The algorithm stops when reaching the final training epoch.
      A pipeline of efficient ACL via RCS. After the warm-up periods, the model is trained on the coreset. Thus, RCS makes the training procedure much more efficient by decreasing the number of training data.

      Intuitively, RCS greedily selects and adds the data \(x\) whose training loss gradient (i.e., \(\nabla_\theta\mathcal{L}_\mathrm{ACL}(\{x\}, \theta)\)) and validation loss gradient (i.e, \(\nabla_\theta\mathcal{L}_\mathcal{RD}(U; \theta(S))\)) have the most similarity into the coreset. In this way, training on the data selected by RCS is most beneficial in optimizing the RD loss, which is thus most helpful to improve \(f\)’s adversarial robustness.

      The official code of RCS is available at https://github.com/GodXuxilie/Efficient_ACL_via_RCS.

      Experimental Results

      RCS significantly speeds up ACL on CIFAR-10.

      • The term speed-up ratio refers to the ratio of the time consumption of pre-training on the training set to the the time consumption of pre-training on the training subset. Thus, the larger the speed-up ratio is, the more efficient the pre-training procedure is.
      • The terms standard test accuracy and robust test accuracy refer to the average accuracy evaluated on natural test data and adversarial test data, respectively. Thus, the higher the line is, the more effective the pre-training method is.

      The results obtained by RCS located in the upper-right corner is more efficient and more effective.

      To reproduce the above results of the robustness transferability from CIFAR-10 to CIFAR-100, you can use the following scripts.

      • At the pre-training stage, you can conduct ACL via RCS using ResNet-18 on CIFAR-10.
      # Pre-training stage using RCS
      +git clone https://github.com/GodXuxilie/Efficient_ACL_via_RCS.git
      +cd Efficient_ACL_via_RCS/ACL_RCS/small_scale_datasets
      +PRE_TRAIN_DIR=ACL_RCS_ResNet18_cifar10
      +python DynACL_RCS.py $PRE_TRAIN_DIR --ACL_DS --dataset cifar10 --fraction 0.2
      • At the fine-tuning stage, you can fine-tune the pre-trained ResNet-18 on CIFAR-100. The test accuracy are saved in $FINETUNE_DIR/results/log.txt.
      # Fine-tuning stage (SLF, ALF, AFF)
      +cd Efficient_ACL_via_RCS/ACL_RCS/small_scale_datasets
      +PRE_TRAIN_DIR=ACL_RCS_ResNet18_cifar10
      +FINETUNE_DIR=ACL_RCS_ResNet18_cifar10_cifar100
      +python finetuning.py --experiment $FINETUNE_DIR \
      +                     --checkpoint ./checkpoints/$PRE_TRAIN_DIR/model.pt \
      +                     --dataset cifar100 \
      +                     --model r18 \
      +                     --mode ALL --eval-AA --eval-OOD --pretraining DynACL_RCS

      For the first time, ACL was conducted efficiently on ImageNet-1K via RCS. The results prove the possibility of applying ACL on large-scale datasets. Here, SA refers to standard test accuracy and RA refers to the robust test accuracy.

      To reproduce the above results of the robustness transferability from ImageNet-1K to CIFAR-10, you can use the following scripts.

      • At the pre-training stage, you can ACL via RCS using Wide ResNet with width 10 and depth 28 (WRN-28-10) on ImageNet-1K of \(32 \times 32\) resolution.
      # Pre-training stage using RCS
      +git clone https://github.com/GodXuxilie/Efficient_ACL_via_RCS.git
      +cd Efficient_ACL_via_RCS/ACL_RCS/ImageNet_32
      +PRE_TRAIN_DIR=ACL_RCS_WRN_ImageNet
      +python ACL_RCS.py $PRE_TRAIN_DIR --gpu 0,1,2,3 --ACL_DS --fraction 0.05
      • At the fine-tuning stage, you can fine-tune the ImageNet-1K pre-trained models on CIFAR-10.
      cd Efficient_ACL_via_RCS/ACL_RCS/ImageNet_32
      +PRE_TRAIN_DIR=ACL_RCS_WRN_ImageNet
      +FINETUNE_DIR=ACL_RCS_WRN_ImageNet_cifar10
      +# Fine-tuning stage (SLF)
      +python transfer.py --out_dir $FINETUNE_DIR/SLF \
      +                   --resume $PRE_TRAIN_DIR/model.pt 
      +                   --dataset cifar10 \
      +                   --lr 0.01 --linear 
      +# Fine-tuning stage (ALF)
      +python adv_tune.py --out_dir $FINETUNE_DIR/ALF \
      +                   --resume $PRE_TRAIN_DIR/model.pt \
      +                   --dataset cifar10 \
      +                   --lr 0.1 --linear 
      +# Fine-tuning stage (AFF)
      +python adv_tune.py --out_dir $FINETUNE_DIR/AFF \
      +                   --resume $PRE_TRAIN_DIR/model.pt \
      +                   --dataset cifar10 \
      +                   --lr 0.1

      RCS can speed up Standard Adversarial Training (SAT) on ImageNet-1K. The results show that RCS is applicable to robust pre-training in the supervised setting.

      To reproduce the above results of the robustness transferability from ImageNet-1K to CIFAR-10, you can use the following scripts.

      • At the pre-training stage, you can conduct SAT using WRN-28-10 on ImageNet-1K of \(32 \times 32\) resolution.
      git clone https://github.com/GodXuxilie/Efficient_ACL_via_RCS.git
      +cd Efficient_ACL_via_RCS/SAT_RCS/ImageNet_32
      +# Pre-training stage using RCS
      +PRE_TRAIN_DIR=SAT_RCS_WRN_ImageNet
      +nohup python SAT_RCS.py --gpu 0,1,2,3 --out_dir $PRE_TRAIN_DIR --fraction 0.2
      • At the fine-tuning stage, you can fine-tune ImageNet-1K pre-trained WRN-28-10 on CIFAR-10.
      cd Efficient_ACL_via_RCS/SAT_RCS/ImageNet_32
      +PRE_TRAIN_DIR=SAT_RCS_WRN_ImageNet
      +FINETUNE_DIR=SAT_RCS_WRN_ImageNet_cifar10
      +# Fine-tuning stage (ALF)
      +python adv_tune.py --out_dir $FINETUNE_DIR/ALF \
      +                   --resume $PRE_TRAIN_DIR/checkpoint.pth.tar \
      +                   --dataset cifar10 \
      +                   --lr 0.1 \
      +                   --linear 
      +# Fine-tuning stage (AFF)
      +python adv_tune.py --out_dir $FINETUNE_DIR/AFF \
      +                   --resume $PRE_TRAIN_DIR/checkpoint.pth.tar 
      +                   --dataset cifar10 \
      +                   --lr 0.1
      For attribution in academic contexts, please cite this work as
      +        PLACEHOLDER FOR ACADEMIC ATTRIBUTION
      +  
      BibTeX citation
      +        PLACEHOLDER FOR BIBTEX
      +  
      \ No newline at end of file diff --git a/blog/tag/bayesian-neural-network/index.html b/blog/tag/bayesian-neural-network/index.html new file mode 100644 index 00000000..425434d0 --- /dev/null +++ b/blog/tag/bayesian-neural-network/index.html @@ -0,0 +1 @@ + Bayesian Neural Network | ICLR Blogposts 2024
      \ No newline at end of file diff --git a/blog/tag/conditional-log-marginal-likelihood/index.html b/blog/tag/conditional-log-marginal-likelihood/index.html new file mode 100644 index 00000000..33b6c13d --- /dev/null +++ b/blog/tag/conditional-log-marginal-likelihood/index.html @@ -0,0 +1 @@ + Conditional Log Marginal Likelihood | ICLR Blogposts 2024

      Conditional Log Marginal Likelihood

      an archive of posts with this tag

      \ No newline at end of file diff --git a/blog/tag/generalization/index.html b/blog/tag/generalization/index.html new file mode 100644 index 00000000..094173bb --- /dev/null +++ b/blog/tag/generalization/index.html @@ -0,0 +1 @@ + Generalization | ICLR Blogposts 2024
      \ No newline at end of file diff --git a/blog/tag/information-theory/index.html b/blog/tag/information-theory/index.html new file mode 100644 index 00000000..69d38b2e --- /dev/null +++ b/blog/tag/information-theory/index.html @@ -0,0 +1 @@ + Information Theory | ICLR Blogposts 2024
      \ No newline at end of file diff --git a/blog/tag/log-marginal-likelihood/index.html b/blog/tag/log-marginal-likelihood/index.html new file mode 100644 index 00000000..c5d6a68c --- /dev/null +++ b/blog/tag/log-marginal-likelihood/index.html @@ -0,0 +1 @@ + Log Marginal Likelihood | ICLR Blogposts 2024
      \ No newline at end of file diff --git a/blog/tag/model-evaluation/index.html b/blog/tag/model-evaluation/index.html new file mode 100644 index 00000000..aca7df9f --- /dev/null +++ b/blog/tag/model-evaluation/index.html @@ -0,0 +1 @@ + Model Evaluation | ICLR Blogposts 2024
      \ No newline at end of file diff --git a/blog/tag/model-selection/index.html b/blog/tag/model-selection/index.html new file mode 100644 index 00000000..50b17aa5 --- /dev/null +++ b/blog/tag/model-selection/index.html @@ -0,0 +1 @@ + Model Selection | ICLR Blogposts 2024
      \ No newline at end of file diff --git a/blog/the-n-implementation-details-of-rlhf-with-ppo/index.html b/blog/the-n-implementation-details-of-rlhf-with-ppo/index.html new file mode 100644 index 00000000..53c308f4 --- /dev/null +++ b/blog/the-n-implementation-details-of-rlhf-with-ppo/index.html @@ -0,0 +1,400 @@ + The N Implementation Details of RLHF with PPO | ICLR Blogposts 2024

      The N Implementation Details of RLHF with PPO

      Reinforcement Learning from Human Feedback (RLHF) is pivotal in the modern application of language modeling, as exemplified by ChatGPT. This blog post delves into an in-depth exploration of RLHF, attempting to reproduce the results from OpenAI's inaugural RLHF paper, published in 2019. Our detailed examination provides valuable insights into the implementation details of RLHF, which often go unnoticed.

      Reinforcement Learning from Human Feedback (RLHF) has been an impactful technique for training modern language models such as ChatGPT. In our quest to research more on RLHF, this blog post closely examines OpenAI’s inaugural RLHF paper published in 2019 together with its open-source codebase at available at openai/lm-human-preferences. Despite being based on TensorFlow-1, the code base released by OpenAI is very well-evaluated and benchmarked, making it a good place to study RLHF implementation engineering details.

      We aim to:

      1. reproduce OpenAI’s results in stylistic tasks and match the learning curves of openai/lm-human-preferences, using the modern PyTorch and JAX frameworks in conjunction with HuggingFace Transformers that are predominantly used by the open-source community nowadays;
      2. present a checklist of implementation details, similar to the spirit of The 37 Implementation Details of Proximal Policy Optimization and Debugging RL, Without the Agonizing Pain;
      3. provide a simple-to-read and minimal reference implementation of RLHF;

      This work is just for educational / learning purposes. For advanced users requiring more features, such as running larger models with parameter-efficient fine-tuning, huggingface/trl would be a great choice.

      • In Matching Learning Curves, we show our main contribution: creating a codebase that can reproduce OpenAI’s results in the stylistic tasks and matching learning curves very closely with openai/lm-human-preferences.
      • We then take a technical deep dive into the implementation details that are relevant to reproducing OpenAI’s work. In General Implementation Details, we talk about basic details, such as how rewards/values are generated and how responses are generated. In Reward Model Implementation Details, we talk about details such as reward normalization. In Policy Training Implementation Details, we discuss details such as rejection sampling and reward “whitening”.
      • Next, we examine the effect of training different base models (e.g., gpt2-xl, falcon-1b,) given that the reward labels are produced with gpt2-large.
      • Finally, we conclude our work with limitations and discussions.

      Here are the important links:

      Matching Learning Curves

      Our main contribution is to reproduce OpenAI’s results in stylistic tasks, such as sentiment and descriptiveness. As shown in the figure below, our codebase (orange curves) can produce nearly identical learning curves as OpenAI’s codebase (blue curves).

      A note on running openai/lm-human-preferences

      To make a direct comparison, we ran the original RLHF code at openai/lm-human-preferences, which will offer valuable metrics to help validate and diagnose our reproduction. We were able to set the original TensorFlow 1.x code up, but it requires a hyper-specific setup:

      • OpenAI’s dataset was partially corrupted/lost (so we replaced them with similar HF datasets, which may or may not cause a performance difference)
      • It can’t run on 1 V100 because it doesn’t implement gradient accumulation. Instead, it uses a large batch size and splits the batch across 8 GPUs, and will OOM on just 1 GPU.
      • It can’t run on 8x A100 because it uses TensorFlow 1.x, which is incompatible with Cuda 8+
      • It can’t run on 8x V100 (16GB) because it will OOM
      • It can only run on 8x V100 (32GB), which is only offered by AWS as the p3dn.24xlarge instance.

      General Implementation Details

      We now take a technical deep dive into the implementation details that are relevant to reproducing OpenAI’s work. In this section, we talk about basic details, such as how rewards/values are generated and how responses are generated. Here are these details in no particular order:

      1. The reward model and policy’s value head take input as the concatenation of query and response
        1. The reward model and policy’s value head do not only look at the response. Instead, it concatenates the query and response together as query_response (lm_human_preferences/rewards.py#L105-L107).
        2. So, for example, if query = "he was quiet for a minute, his eyes unreadable"., and the response = "He looked at his left hand, which held the arm that held his arm out in front of him.", then the reward model and policy’s value do a forward pass on query_response = "he was quiet for a minute, his eyes unreadable. He looked at his left hand, which held the arm that held his arm out in front of him." and produced rewards and values of shape (B, T, 1), where B is the batch size, T is the sequence length, and 1 is the reward head dimension of 1 (lm_human_preferences/rewards.py#L105-L107, lm_human_preferences/policy.py#L111).
        3. The T means that each token has a reward associated with it and its previous context. For example, the eyes token would have a reward corresponding to he was quiet for a minute, his eyes.
      2. Pad with a special padding token and truncate inputs.
        1. OpenAI sets a fixed input length for query query_length; it pads sequences that are too short with pad_token (lm_human_preferences/language/datasets.py#L66-L67) and truncates sequences that are too long (lm_human_preferences/language/datasets.py#L57). See here for a general introduction to the concept). When padding the inputs, OpenAI uses a token beyond the vocabulary (lm_human_preferences/language/encodings.py#L56).
          1. Note on HF’s transformers — padding token. According to (transformers#2630#issuecomment-578159876), padding tokens were not used during the pre-training of GPT and GPT-2; therefore transformer’s gpt2 models have no official padding token associated with its tokenizer. A common practice is to set tokenizer.pad_token = tokenizer.eos_token, but in this work, we shall distinguish these two special tokens to match OpenAI’s original setting, so we will use tokenizer.add_special_tokens({"pad_token": "[PAD]"}).

          Note that having no padding token is a default setting for decoder models, since they train with “packing” during pretraining, which means that many sequences are concatenated and separated by the EOS token and chunks of this sequence that always have the max length are fed to the model during pretraining.

        2. When putting everything together, here is an example
         import transformers
        + tokenizer = transformers.AutoTokenizer.from_pretrained("gpt2", padding_side="right")
        + tokenizer.add_special_tokens({"pad_token": "[PAD]"})
        + query_length = 5
        + texts = [
        +     "usually, he would",
        +     "she thought about it",
        + ]    
        + tokens = []
        + for text in texts:
        +     tokens.append(tokenizer.encode(text)[:query_length])
        +    
        + print("tokens", tokens)
        + inputs = tokenizer.pad(
        +     {"input_ids": tokens},
        +     padding="max_length",
        +     max_length=query_length,
        +     return_tensors="pt",
        +     return_attention_mask=True,
        + )
        + print("inputs", inputs)
        +    
        + """prints are
        + tokens [[23073, 11, 339, 561], [7091, 1807, 546, 340]]
        + inputs {'input_ids': tensor([[23073,    11,   339,   561, 50257],
        +         [ 7091,  1807,   546,   340, 50257]]), 'attention_mask': tensor([[1, 1, 1, 1, 0],
        +         [1, 1, 1, 1, 0]])}
        + """
        +
      3. Adjust position indices correspondingly for padding tokens
        1. When calculating the logits, OpenAI’s code works by masking out padding tokens properly. This is achieved by finding out the token indices corresponding to the padding tokens (lm_human_preferences/language/model.py#L296-L297), followed by adjusting their position indices correspondingly (lm_human_preferences/language/model.py#L320).
        2. For example, if the query=[23073, 50259, 50259] and response=[11, 339, 561], where (50259 is OpenAI’s padding token), it then creates position indices as [[0 1 1 1 2 3]] and logits as follows. Note how the logits corresponding to the padding tokens remain the same as before! This is the effect we should be aiming for in our reproduction.

           all_logits [[[ -35.28693   -34.2875    -38.16074  ...  -41.595802  -41.082108
          +     -35.36577 ]
          +   [ -35.28693   -34.2875    -38.16074  ...  -41.595802  -41.082108
          +     -35.36577 ]
          +   [ -35.28693   -34.2875    -38.16074  ...  -41.595802  -41.082108
          +     -35.36577 ]
          +   [-111.303955 -110.94471  -112.90624  ... -113.13064  -113.7788
          +    -109.17345 ]
          +   [-111.51512  -109.61077  -114.90231  ... -118.43514  -111.56671
          +    -112.12478 ]
          +   [-122.69775  -121.84468  -128.27417  ... -132.28055  -130.39604
          +    -125.707756]]] (1, 6, 50257)
          +
        3. Note on HF’s transformers — position_ids and padding_side. We can replicate the exact logits using Hugging Face’s transformer with 1) left padding and 2) pass in the appropriate position_ids:

           import torch
          + import transformers
          + tokenizer = transformers.AutoTokenizer.from_pretrained("gpt2", padding_side="right")
          + tokenizer.add_special_tokens({"pad_token": "[PAD]"})
          + pad_id = tokenizer.pad_token_id
          + query = torch.tensor([
          +     [pad_id, pad_id, 23073],
          + ])
          + response = torch.tensor([
          +     [11, 339, 561],
          + ])
          + temperature = 1.0
          +        
          + query = torch.tensor(query)
          + response = torch.tensor(response).long()
          + context_length = query.shape[1]
          + query_response = torch.cat((query, response), 1)
          + pretrained_model = transformers.AutoModelForCausalLM.from_pretrained("gpt2")
          + def forward(policy, query_responses, tokenizer):
          +     attention_mask = query_responses != tokenizer.pad_token_id
          +     position_ids = attention_mask.cumsum(1) - attention_mask.long()  # exclusive cumsum
          +     input_ids = query_responses.clone()
          +     input_ids[~attention_mask] = 0
          +     return policy(
          +         input_ids=input_ids,
          +         attention_mask=attention_mask,
          +         position_ids=position_ids,
          +         return_dict=True,
          +         output_hidden_states=True,
          +     )
          + output = forward(pretrained_model, query_response, tokenizer)
          + logits = output.logits
          + logits /= temperature
          + print(logits)
          +        
          + """
          + tensor([[[ -26.9395,  -26.4709,  -30.0456,  ...,  -33.2208,  -33.2884,
          +            -27.4360],
          +          [ -27.1677,  -26.7330,  -30.2386,  ...,  -33.6813,  -33.6931,
          +            -27.5928],
          +          [ -35.2869,  -34.2875,  -38.1608,  ...,  -41.5958,  -41.0821,
          +            -35.3658],
          +          [-111.3040, -110.9447, -112.9062,  ..., -113.1306, -113.7788,
          +           -109.1734],
          +          [-111.5152, -109.6108, -114.9024,  ..., -118.4352, -111.5668,
          +           -112.1248],
          +          [-122.6978, -121.8447, -128.2742,  ..., -132.2805, -130.3961,
          +           -125.7078]]], grad_fn=<DivBackward0>)
          + """
          +
        4. Note on HF’s transformers — position_ids during generate: during generate we should not pass in position_ids because the position_ids are already adjusted in transformers (see huggingface/transformers#/7552).

        Usually, we almost never pass position_ids in transformers. All the masking and shifting logic are already implemented e.g. in the generate function (need permanent code link).

      4. Response generation samples a fixed-length response without padding.
        1. During response generation, OpenAI uses top_k=0, top_p=1.0 and just do categorical samples across the vocabulary (lm_human_preferences/language/sample.py#L43) and the code would keep sampling until a fixed-length response is generated (lm_human_preferences/policy.py#L103). Notably, even if it encounters EOS (end-of-sequence) tokens, it will keep sampling.
        2. Note on HF’s transformers — sampling could stop at eos_token: in transformers, the generation could stop at eos_token (src/transformers/generation/utils.py#L2248-L2256), which is not the same as OpenAI’s setting. To align the setting, we need to do set pretrained_model.generation_config.eos_token_id = None, pretrained_model.generation_config.pad_token_id = None. Note that transformers.GenerationConfig(eos_token_id=None, pad_token_id=None, ...) does not work because pretrained_model.generation_config would override and set a eos_token.

           import torch
          + import transformers
          + tokenizer = transformers.AutoTokenizer.from_pretrained("gpt2", padding_side="right")
          + tokenizer.add_special_tokens({"pad_token": "[PAD]"})
          + pad_id = tokenizer.pad_token_id
          + query = torch.tensor([
          +     [pad_id, pad_id, 23073],
          + ])
          + response = torch.tensor([
          +     [11, 339, 561],
          + ])
          + response_length = 4
          + temperature = 0.7
          + pretrained_model = transformers.AutoModelForCausalLM.from_pretrained("gpt2")
          + pretrained_model.generation_config.eos_token_id = None # disable `pad_token_id` and `eos_token_id` because we just want to
          + pretrained_model.generation_config.pad_token_id = None  # generate tokens without truncation / padding
          + generation_config = transformers.GenerationConfig(
          +     max_new_tokens=response_length,
          +     min_new_tokens=response_length,
          +     temperature=temperature,
          +     top_k=0.0,
          +     top_p=1.0,
          +     do_sample=True,
          + )
          + context_length = query.shape[1]
          + attention_mask = query != tokenizer.pad_token_id
          + input_ids = query.clone()
          + input_ids[~attention_mask] = 0  # set padding tokens to 0
          + output = pretrained_model.generate(
          +     input_ids=input_ids,
          +     attention_mask=attention_mask,
          +     # position_ids=attention_mask.cumsum(1) - attention_mask.long(), # generation collapsed if this was turned on.
          +     generation_config=generation_config,
          +     return_dict_in_generate=True,
          + )
          + print(output.sequences)
          +        
          + """
          + tensor([[    0,     0, 23073, 16851,    11,   475,   991]])
          + """
          +
        3. Note that in a more recent codebase https://github.com/openai/summarize-from-feedback, OpenAI does stop sampling when encountering EOS token (summarize_from_feedback/utils/experiment_helpers.py#L19). However in this work we aim to do a 1:1 replication, so we align the setting that could keep sampling even eos_token is encountered
      5. Learning rate annealing for reward model and policy training.
        1. As Ziegler et al. (2019) suggested, the reward model is trained for a single epoch to avoid overfitting the limited amount of human annotation data (e.g., the descriptiveness task only had about 5000 labels). During this single epoch, the learning rate is annealed to zero (lm_human_preferences/train_reward.py#L249).
        2. Similar to reward model training, the policy’s learning rate is annealed to zero (lm_human_preferences/train_policy.py#L172-L173).
      6. Use different seeds for different processes
        1. When spawning 8 GPU processes to do data parallelism, OpenAI sets a different random seed per process (lm_human_preferences/utils/core.py#L108-L111). Implementation-wise, this is done via local_seed = args.seed + process_rank * 100003. The seed is going to make the model produce different responses and get different scores, for example.
          1. Note: We believe the dataset shuffling has a bug — the dataset is shuffled using the same seed for some reason (lm_human_preferences/lm_tasks.py#L94-L97).

      Reward Model Implementation Details

      In this section, we discuss reward-model-specific implementation details. We talk about details such as reward normalization and layer initialization. Here are these details in no particular order:

      1. The reward model only outputs the value at the last token.
        1. Notice that the rewards obtained after the forward pass on the concatenation of query and response will have the shape (B, T, 1), where B is the batch size, T is the sequence length (which is always the same; it is query_length + response_length = 64 + 24 = 88 in OpenAI’s setting for stylistic tasks, see launch.py#L9-L11), and 1 is the reward head dimension of 1. For RLHF purposes, the original codebase extracts the reward of the last token (lm_human_preferences/rewards.py#L132), so that the rewards will only have shape (B, 1).
        2. Note that in a more recent codebase openai/summarize-from-feedback, OpenAI stops sampling when encountering EOS token (summarize_from_feedback/utils/experiment_helpers.py#L19). When extracting rewards, it is going to identify the last_response_index, the index before the EOS token (#L11-L13), and extract the reward at that index (summarize_from_feedback/reward_model.py#L59). However in this work we just stick with the original setting.
      2. Reward head layer initialization
        1. The weight of the reward head is initialized according to \( \mathcal{N}\left(0,1 /\left(\sqrt{d_{\text {model }}+1}\right)\right) \) (lm_human_preferences/language/model.py#L368, lm_human_preferences/language/model.py#L251-L252). This aligns with the settings in Stiennon et al., 2020 (summarize_from_feedback/query_response_model.py#L106-L107) (P.S., Stiennon et al., 2020 had a typo on page 17 saying the distribution is \( \mathcal{N}\left(0,1 /\left(d_{\text {model }}+1\right)\right) \) without the square root)
        2. The bias of the reward head is set to 0 (lm_human_preferences/language/model.py#L254).
      3. Reward model normalization before and after
        1. In the paper, Ziegler el al. (2019) mentioned that “to keep the scale of the reward model consistent across training, we normalize it so that it has mean 0 and variance 1 for
          \( x \sim \mathcal{D}, y \sim \rho(·|x) \).” To perform the normalization process, the code first creates a reward_gain and reward_bias, such that the reward can be calculated by reward = reward * reward_gain + reward_bias (lm_human_preferences/rewards.py#L50-L51).
        2. When performing the normalization process, the code first sets reward_gain=1, reward_bias=0 (lm_human_preferences/train_reward.py#L211), followed by collecting sampled queries from the target dataset (e.g., bookcorpus, tldr, cnndm), completed responses, and evaluated rewards. It then gets the empirical mean and std of the evaluated reward (lm_human_preferences/train_reward.py#L162-L167) and tries to compute what the reward_gain and reward_bias should be.
        3. Let us use \( \mu_{\mathcal{D}} \) to denote the empirical mean, \( \sigma_{\mathcal{D}} \) the empirical std, \(g\) the reward_gain, \(b\) reward_bias, \( \mu_{\mathcal{T}} = 0\) target mean and \( \sigma_{\mathcal{T}}=1\) target std. Then we have the following formula.
        \[\begin{aligned}g*\mathcal{N}(\mu_{\mathcal{D}}, \sigma_{\mathcal{D}}) + b &= \mathcal{N}(g*\mu_{\mathcal{D}}, g*\sigma_{\mathcal{D}}) + b\\&= \mathcal{N}(g*\mu_{\mathcal{D}} + b, g*\sigma_{\mathcal{D}}) \\&= \mathcal{N}(\mu_{\mathcal{T}}, \sigma_{\mathcal{T}}) \\g &= \frac{\sigma_{\mathcal{T}}}{\sigma_{\mathcal{D}}} \\b &= \mu_{\mathcal{T}} - g*\mu_{\mathcal{D}}\end{aligned}\]
        1. The normalization process is then applied before and after reward model training (lm_human_preferences/train_reward.py#L232-L234, lm_human_preferences/train_reward.py#L252-L254).

        2. Note that responses \( y \sim \rho(·|x) \) we generated for the normalization purpose are from the pre-trained language model \(\rho \). The model \(\rho \) is fixed as a reference and is not updated in reward learning (lm_human_preferences/train_reward.py#L286C1-L286C31).

      Policy Training Implementation Details

      In this section, we will delve into details, such as layer initialization, data post-processing, and dropout settings. We will also explore techniques, such as of rejection sampling and reward “whitening”, and adaptive KL. Here are these details in no particular order:

      1. Scale the logits by sampling temperature.
        1. When calculating the log probability of responses, the model first outputs the logits of the tokens in the responses, followed by dividing the logits with the sampling temperature (lm_human_preferences/policy.py#L121). I.e., logits /= self.temperature
        2. In an informal test, we found that without this scaling, the KL would rise faster than expected, and performance would deteriorate.
      2. Value head layer initialization
        1. The weight of the value head is initialized according to \(\mathcal{N}\left(0,0\right)\) (lm_human_preferences/language/model.py#L368, lm_human_preferences/language/model.py#L251-L252). This is
        2. The bias of the reward head is set to 0 (lm_human_preferences/language/model.py#L254).
      3. Select query texts that start and end with a period
        1. This is done as part of the data preprocessing;
          1. Tries to select text only after start_text="." (lm_human_preferences/language/datasets.py#L51)
          2. Tries select text just before end_text="." (lm_human_preferences/language/datasets.py#L61)
          3. Then pad the text (lm_human_preferences/language/datasets.py#L66-L67)
        2. When running openai/lm-human-preferences, OpenAI’s datasets were partially corrupted/lost (openai/lm-human-preferences/issues/17#issuecomment-104405149), so we had to replace them with similar HF datasets, which may or may not cause a performance difference)
        3. For the book dataset, we used https://huggingface.co/datasets/bookcorpus, which we find not necessary to extract sentences that start and end with periods because the dataset ) is already pre-processed this way (e.g., "usually , he would be tearing around the living room , playing with his toys .") To this end, we set start_text=None, end_text=None for the sentiment and descriptiveness tasks.
      4. Disable dropout
        1. Ziegler et al. (2019) suggested, “We do not use dropout for policy training.” This is also done in the code (lm_human_preferences/policy.py#L48).
      5. Rejection sampling
        1. Ziegler et al. (2019) suggested, “We use rejection sampling to ensure there is a period between tokens 16 and 24 and then truncate at that period (This is a crude approximation for ‘end of sentence.’ We chose it because it is easy to integrate into the RL loop, and even a crude approximation is sufficient for the intended purpose of making the human evaluation task somewhat easier). During the RL finetuning, we penalize continuations that don’t have such a period by giving them a fixed reward of −1.”
        2. Specifically, this is achieved with the following steps:
          1. Token truncation: We want to truncate at the first occurrence of truncate_token that appears at or after position truncate_after in the responses (lm_human_preferences/train_policy.py#L378)
          2. Run reward model on truncated response: After the response has been truncated by the token truncation process, the code then runs the reward model on the truncated response.
          3. Rejection sampling: if there is not a period between tokens 16 and 24, then replace the score of the response with a fixed low value (such as -1)(lm_human_preferences/train_policy.py#L384, lm_human_preferences/train_policy.py#L384-L402)
          4. To give some examples in descriptiveness:

      </figure>

      1. Discount factor = 1
        1. The discount parameter \(\gamma\) is set to 1 (lm_human_preferences/train_policy.py#L56), which means that future rewards are given the same weight as immediate rewards.
      2. Terminology of the training loop: batches and minibatches in PPO
        1. OpenAI uses the following training loop (lm_human_preferences/train_policy.py#L184-L192). Note: we additionally added the micro_batch_size to help deal with the case in gradient accumulation. At each epoch, it shuffles the batch indices.

          
          + import numpy as np
          + batch_size = 8
          + nminibatches = 2
          + gradient_accumulation_steps = 2
          + mini_batch_size = batch_size // nminibatches
          + micro_batch_size = mini_batch_size // gradient_accumulation_steps
          + data = np.arange(batch_size).astype(np.float32)
          + print("data:", data)
          + print("batch_size:", batch_size)
          + print("mini_batch_size:", mini_batch_size)
          + print("micro_batch_size:", micro_batch_size)
          + for epoch in range(4):
          +     batch_inds = np.random.permutation(batch_size)
          +     print("epoch:", epoch, "batch_inds:", batch_inds)
          +     for mini_batch_start in range(0, batch_size, mini_batch_size):
          +         mini_batch_end = mini_batch_start + mini_batch_size
          +         mini_batch_inds = batch_inds[mini_batch_start:mini_batch_end]
          +                
          +         # `optimizer.zero_grad()` set optimizer to zero for gradient accumulation
          +         for micro_batch_start in range(0, mini_batch_size, micro_batch_size):
          +             micro_batch_end = micro_batch_start + micro_batch_size 
          +             micro_batch_inds = mini_batch_inds[micro_batch_start:micro_batch_end]
          +             print("____⏩ a forward pass on", data[micro_batch_inds])
          +         # `optimizer.step()`
          +         print("⏪ a backward pass on", data[mini_batch_inds])
          +        
          + # data: [0. 1. 2. 3. 4. 5. 6. 7.]
          + # batch_size: 8
          + # mini_batch_size: 4
          + # micro_batch_size: 2
          + # epoch: 0 batch_inds: [6 4 0 7 3 5 1 2]
          + # ____⏩ a forward pass on [6. 4.]
          + # ____⏩ a forward pass on [0. 7.]
          + # ⏪ a backward pass on [6. 4. 0. 7.]
          + # ____⏩ a forward pass on [3. 5.]
          + # ____⏩ a forward pass on [1. 2.]
          + # ⏪ a backward pass on [3. 5. 1. 2.]
          + # epoch: 1 batch_inds: [6 7 3 2 0 4 5 1]
          + # ____⏩ a forward pass on [6. 7.]
          + # ____⏩ a forward pass on [3. 2.]
          + # ⏪ a backward pass on [6. 7. 3. 2.]
          + # ____⏩ a forward pass on [0. 4.]
          + # ____⏩ a forward pass on [5. 1.]
          + # ⏪ a backward pass on [0. 4. 5. 1.]
          + # epoch: 2 batch_inds: [1 4 5 6 0 7 3 2]
          + # ____⏩ a forward pass on [1. 4.]
          + # ____⏩ a forward pass on [5. 6.]
          + # ⏪ a backward pass on [1. 4. 5. 6.]
          + # ____⏩ a forward pass on [0. 7.]
          + # ____⏩ a forward pass on [3. 2.]
          + # ⏪ a backward pass on [0. 7. 3. 2.]
          + # epoch: 3 batch_inds: [7 2 4 1 3 0 6 5]
          + # ____⏩ a forward pass on [7. 2.]
          + # ____⏩ a forward pass on [4. 1.]
          + # ⏪ a backward pass on [7. 2. 4. 1.]
          + # ____⏩ a forward pass on [3. 0.]
          + # ____⏩ a forward pass on [6. 5.]
          + # ⏪ a backward pass on [3. 0. 6. 5.]
          +
      3. Per-token KL penalty
        • The code adds a per-token KL penalty (lm_human_preferences/train_policy.py#L150-L153) to the rewards, in order to discourage the policy to be very different from the original policy.
        • Using the "usually, he would" as an example, it gets tokenized to [23073, 11, 339, 561]. Say we use [23073] as the query and [11, 339, 561] as the response. Then under the default gpt2 parameters, the response tokens will have log probabilities of the reference policy logprobs=[-3.3213, -4.9980, -3.8690] .
          • During the first PPO update epoch and minibatch update, so the active policy will have the same log probabilities new_logprobs=[-3.3213, -4.9980, -3.8690]. , so the per-token KL penalty would be kl = new_logprobs - logprobs = [0., 0., 0.,]
          • However, after the first gradient backward pass, we could have new_logprob=[3.3213, -4.9980, -3.8690] , so the per-token KL penalty becomes kl = new_logprobs - logprobs = [-0.3315, -0.0426, 0.6351]
          • Then the non_score_reward = beta * kl , where beta is the KL penalty coefficient \(\beta\), and it’s added to the score obtained from the reward model to create the rewards used for training. The score is only given at the end of episode; it could look like [0.4,] , and we have rewards = [beta * -0.3315, beta * -0.0426, beta * 0.6351 + 0.4].
      4. Per-minibatch reward and advantage whitening, with optional mean shifting
        1. OpenAI implements a whiten function that looks like below, basically normalizing the values by subtracting its mean followed by dividing by its standard deviation. Optionally, whiten can shift back the mean of the whitened values with shift_mean=True.
         def whiten(values, shift_mean=True):
        +     mean, var = torch.mean(values), torch.var(values, unbiased=False)
        +     whitened = (values - mean) * torch.rsqrt(var + 1e-8)
        +     if not shift_mean:
        +         whitened += mean
        +     return whitened
        +
        1. In each minibatch, OpenAI then whitens the reward whiten(rewards, shift_mean=False) without shifting the mean (lm_human_preferences/train_policy.py#L325) and whitens the advantages whiten(advantages) with the shifted mean (lm_human_preferences/train_policy.py#L338).
        2. Optimization note: if the number of minibatches is one (which is the case in this reproduction) we only need to whiten rewards, calculate and whiten advantages once since their values won’t change.
        3. TensorFlow vs PyTorch note: Different behavior of tf.moments vs torch.var: The behavior of whitening is different in torch vs tf because the variance calculation is different:

           import numpy as np
          + import tensorflow as tf
          + import torch
          +        
          + def whiten_tf(values, shift_mean=True):
          +     mean, var = tf.nn.moments(values, axes=list(range(values.shape.rank)))
          +     mean = tf.Print(mean, [mean], 'mean', summarize=100)
          +     var = tf.Print(var, [var], 'var', summarize=100)
          +     whitened = (values - mean) * tf.rsqrt(var + 1e-8)
          +     if not shift_mean:
          +         whitened += mean
          +     return whitened
          +        
          + def whiten_pt(values, shift_mean=True, unbiased=True):
          +     mean, var = torch.mean(values), torch.var(values, unbiased=unbiased)
          +     print("mean", mean)
          +     print("var", var)
          +     whitened = (values - mean) * torch.rsqrt(var + 1e-8)
          +     if not shift_mean:
          +         whitened += mean
          +     return whitened
          +        
          + rewards = np.array([
          +     [1.2, 1.3, 1.4],
          +     [1.5, 1.6, 1.7],
          +     [1.8, 1.9, 2.0],
          + ])
          +        
          + with tf.Session() as sess:
          +     print(sess.run(whiten_tf(tf.constant(rewards, dtype=tf.float32), shift_mean=False)))
          +     print(whiten_pt(torch.tensor(rewards), shift_mean=False, unbiased=True))
          +     print(whiten_pt(torch.tensor(rewards), shift_mean=False, unbiased=False))
          +
           mean[1.5999999]
          + var[0.0666666627]
          + [[0.05080712 0.4381051  0.8254035 ]
          +  [1.2127019  1.6000004  1.9872988 ]
          +  [2.3745968  2.7618952  3.1491938 ]]
          + mean tensor(1.6000, dtype=torch.float64)
          + var tensor(0.0750, dtype=torch.float64)
          + tensor([[0.1394, 0.5046, 0.8697],
          +         [1.2349, 1.6000, 1.9651],
          +         [2.3303, 2.6954, 3.0606]], dtype=torch.float64)
          + mean tensor(1.6000, dtype=torch.float64)
          + var tensor(0.0667, dtype=torch.float64)
          + tensor([[0.0508, 0.4381, 0.8254],
          +         [1.2127, 1.6000, 1.9873],
          +         [2.3746, 2.7619, 3.1492]], dtype=torch.float64)
          +        
          +
      5. Clipped value function
        1. As done in the original PPO (baselines/ppo2/model.py#L68-L75), the value function is clipped (lm_human_preferences/train_policy.py#L343-L348) in a similar fashion as the policy objective.
      6. Adaptive KL
        • The KL divergence penalty coefficient \(\beta\) is modified adaptively based on the KL divergence between the current policy and the previous policy. If the KL divergence is outside a predefined target range, the penalty coefficient is adjusted to bring it closer to the target range (lm_human_preferences/train_policy.py#L115-L124). It’s implemented as follows:

            class AdaptiveKLController:
          +      def __init__(self, init_kl_coef, hparams):
          +          self.value = init_kl_coef
          +          self.hparams = hparams
          +        
          +      def update(self, current, n_steps):
          +          target = self.hparams.target
          +          proportional_error = np.clip(current / target - 1, -0.2, 0.2)
          +          mult = 1 + proportional_error * n_steps / self.hparams.horizon
          +          self.value *= mult
          +
        • For the sentiment and descriptiveness tasks examined in this work, we have init_kl_coef=0.15, hparams.target=6, hparams.horizon=10000.

      PyTorch Adam optimizer numerical issues w.r.t RLHF

      • This implementation detail is so interesting that it deserves a full section.
      • PyTorch Adam optimizer (torch.optim.Adam.html) has a different implementation compared to TensorFlow’s Adam optimizer (TF1 Adam at tensorflow/v1.15.2/adam.py, TF2 Adam at keras/adam.py#L26-L220). In particular, PyTorch follows Algorithm 1 of the Kingma and Ba’s Adam , but TensorFlow uses the formulation just before Section 2.1 of the paper and its epsilon referred to here is epsilon hat in the paper. In a pseudocode comparison, we have the following
      ### pytorch adam implementation:
      +bias_correction1 = 1 - beta1 ** step
      +bias_correction2 = 1 - beta2 ** step
      +step_size = lr / bias_correction1
      +bias_correction2_sqrt = _dispatch_sqrt(bias_correction2)
      +denom = (exp_avg_sq.sqrt() / bias_correction2_sqrt).add_(eps)
      +param.addcdiv_(exp_avg, denom, value=-step_size)
      +
      +### tensorflow adam implementation:
      +lr_t = lr * _dispatch_sqrt((1 - beta2 ** step)) / (1 - beta1 ** step)
      +denom = exp_avg_sq.sqrt().add_(eps)
      +param.addcdiv_(exp_avg, denom, value=-lr_t)
      +
      • Let’s compare the update equations of pytorch-style and tensorflow-style adam. Following the notation of the adam paper (Kingma and Ba, 2014), we have the gradient update rules for pytorch adam (Algorithm 1 of Kingma and Ba’s paper) and tensorflow-style adam (the formulation just before Section 2.1 of Kingma and Ba’s paper) as below:
      \[\begin{aligned}\text{pytorch adam :}\quad \theta_t & =\theta_{t-1}-\alpha \cdot \hat{m}_t /\left(\sqrt{\hat{v}_t}+\varepsilon\right) \\& =\theta_{t-1}- \alpha \underbrace{\left[m_t /\left(1-\beta_1^t\right)\right]}_{=\hat{m}_t} /\left[\sqrt{\underbrace{v_t /\left(1-\beta_2^t\right)}_{=\hat{v}_t} }+\varepsilon\right]\\& =\theta_{t-1}- \alpha\left[m_t /\left(1-\beta_1^t\right)\right]\frac{\sqrt{1-\beta_2^t}}{\sqrt{v_t}+\color{green}{\varepsilon \sqrt{1-\beta_2^t}}}\end{aligned}\] \[\begin{aligned}\text{tensorflow adam:}\quad \theta_t & =\theta_{t-1}-\alpha_t m_t /\left(\sqrt{v_t}+\hat{\varepsilon}\right) \\& =\theta_{t-1}-\underbrace{\left[\alpha \sqrt{1-\beta_2^t} /\left(1-\beta_1^t\right)\right]}_{=\alpha_t} m_t /\left(\sqrt{v_t}+\hat{\varepsilon}\right) \\& =\theta_{t-1}- \alpha\left[m_t /\left(1-\beta_1^t\right)\right] \frac{\sqrt{1-\beta_2^t}}{\sqrt{v_t}+\color{green}{\hat{\varepsilon}}} \end{aligned}\]
      • The equations above highlight that the distinction between pytorch and tensorflow implementation is their normalization terms, \(\color{green}{\varepsilon \sqrt{1-\beta_2^t}}\) and \(\color{green}{\hat{\varepsilon}}\). The two versions are equivalent if we set \(\hat{\varepsilon} =\varepsilon \sqrt{1-\beta_2^t}\) . However, in the pytorch and tensorflow APIs, we can only set \(\varepsilon\) (pytorch) and \(\hat{\varepsilon}\) (tensorflow) via the eps argument, causing differences in their update equations. What if we set \(\varepsilon\) and \(\hat{\varepsilon}\) to the same value, say, 1e-5? Then for tensorflow adam, the normalization term \(\hat{\varepsilon} = \text{1e-5}\) is just a constant. But for pytorch adam, the normalization term \({\varepsilon \sqrt{1-\beta_2^t}}\) changes over time. Importantly, initially much smaller than 1e-5 when the timestep \(t\) is small, the term \({\varepsilon \sqrt{1-\beta_2^t}}\) gradually approaches to 1e-5 as timesteps increase. The plot below compares these two normalization terms over timesteps:
      • The above figure shows that, if we set the same eps in pytorch adam and tensorflow adam, then pytorch-adam uses a much smaller normalization term than tensorflow-adam in the early phase of training. In other words, pytorch adam goes for more aggressive gradient updates early in the training. Our experiments support this finding, as we will demonstrate below.
      • How does this impact reproducibility and performance? To align settings, we record the original query, response, and rewards from https://github.com/openai/lm-human-preferences and save them. We also record the metrics of the first two epochs of training with TF1’s AdamOptimizer optimizer as the ground truth. Below are some key metrics:

          OpenAI’s TF1 Adam PyTorch’s Adam Our custom Tensorflow-style Adam
        policy/approxkl 0.00037167023 0.0023672834504395723 0.000374998344341293
        policy/clipfrac 0.0045572915 0.02018229104578495 0.0052083334885537624
        ratio_mean 1.0051285 1.0105520486831665 1.0044583082199097
        ratio_var 0.0007716546 0.005374275613576174 0.0007942612282931805
        ratio_max 1.227216 1.8121057748794556 1.250215768814087
        ratio_min 0.7400441 0.4011387825012207 0.7299948930740356
        logprob_diff_mean 0.0047487603 0.008101251907646656 0.004073789343237877
        logprob_diff_var 0.0007207897 0.004668936599045992 0.0007334011606872082
        logprob_diff_max 0.20474821 0.594489574432373 0.22331619262695312
        logprob_diff_min -0.30104542 -0.9134478569030762 -0.31471776962280273
      • PyTorch’s Adam produces a more aggressive update for some reason. Here are some evidence:
        • PyTorch’s Adam’s logprob_diff_var is 6x higher. Here logprobs_diff = new_logprobs - logprobs is the difference between the log probability of tokens between the initial and current policy after two epochs of training. Having a larger logprob_diff_var means the scale of the log probability changes is larger than that in OpenAI’s TF1 Adam.
        • PyTorch’s Adam presents a more extreme ratio max and min. Here ratio = torch.exp(logprobs_diff). Having a ratio_max=1.8121057748794556 means that for some token, the probability of sampling that token is 1.8x more likely under the current policy, as opposed to only 1.2x with OpenAI’s TF1 Adam.
        • Larger policy/approxkl policy/clipfrac. Because of the aggressive update, the ratio gets clipped 4.4x more often, and the approximate KL divergence is 6x larger.
        • The aggressive update is likely gonna cause further issues. E.g., logprob_diff_mean is 1.7x larger in PyTorch’s Adam, which would correspond to 1.7x larger KL penalty in the next reward calculation; this could get compounded. In fact, this might be related to the famous KL divergence issue — KL penalty is much larger than it should be and the model could pay more attention and optimizes for it more instead, therefore causing negative KL divergence.
      • Larger models get affected more. We conducted experiments comparing PyTorch’s Adam (codename pt_adam) and our custom TensorFlow-style (codename tf_adam) with gpt2 and gpt2-xl. We found that the performance are roughly similar under gpt2; however with gpt2-xl, we observed a more aggressive updates, meaning that larger models get affected by this issue more.
        • When the initial policy updates are more aggressive in gpt2-xl, the training dynamics get affected. For example, we see a much larger objective/kl and objective/scores spikes with pt_adam, especially with sentimentthe biggest KL was as large as 17.5 in one of the random seeds, suggesting an undesirable over-optimization.
        • Furthermore, because of the larger KL, many other training metrics are affected as well. For example, we see a much larger clipfrac (the fraction of time the ratio gets clipped by PPO’s objective clip coefficient 0.2) and approxkl.

      Limitations

      Noticed this work does not try to reproduce the summarization work in CNN DM or TL;DR. This was because we found the training to be time-consuming and brittle.

      The particular training run we had showed poor GPU utilization (around 30%), so it takes almost 4 days to perform a training run, which is highly expensive (only AWS sells p3dn.24xlarge, and it costs $31.212 per hour)

      Additionally, training was brittle. While the reward goes up, we find it difficult to reproduce the “smart copier” behavior reported by Ziegler et al. (2019) . Below are some sample outputs — clearly, the agent overfits somehow.

      QUERY: The modern take on Death & Taxes still plays it. I know as a Tron player I find Mindcensor 
      +pretty frustrating. ⏎ However, the answer to your question is probably that the decks that 
      +are currently taking up the lion's share of the metagame don't really care about its effect.
      +It has a chance to act as an instant speed Stone Rain + Gut Shot in response to a fetch 
      +crack, but after that it's a 2/1 flyer that dies to all relevant removal. ⏎ It's especially 
      +powerful against Tron since so much of the deck's consistency is in its tutor effects -- 
      +Sylvan Scrying, Expedition Map, and Eye of Ugin. This combined with Pyroclasm and Oblivion 
      +Stone being the only creature removal the deck has available to it before Karn make 
      +Mindcensor fairly potent. ⏎ I'm having trouble thinking of any other popular decks (ignoring
      +the interaction with fetches) that Mindcensor really sticks it to, however.
      +
      +PPO GENERATED RESPONSE: Mindcensor is a very good card, but it's not really worth playing.<|endoftext|>Image 
      +caption The family was forced from their home after they were not allowed to leave the 
      +house⏎⏎A family of four has been jailed for five years after admitting breaching home rule 
      +by refusing to leave their home.⏎
      +
      +reward: -0.48
      +kl: +11.37
      +total reward: -0.82
      +
      +QUERY: To make a long story short, my SO (27/m) and I live in my hometown. We've been together for 
      +4 years. His family lives across the country. A lot of the time we spend with my parents but
      +besides that, we aren't very social. His family not being here doesn't make me feel like I 
      +can't be with mine. I just don't like to. ⏎ I find my extended family (so not my parents) 
      +extremely irritating. Especially the ones around my age. They're dramatic and rude. Every 
      +family event is some Jerry Springer episode I don't want to participate in. I don't have any
      +issues with anyone in particular. I just find them annoying. I'm never uncivil or openly 
      +telling them how I feel. If there's get togethers, I tend to avoid them. I don't ask anyone 
      +to hang out or come over. ⏎ My SO and I have experienced multiple miscarriages due to 
      +unknown reasons. This has made me very sensitive and a bit of a loaner when it comes to the 
      +girls in my family my age. I'm not jealous that my family has children. I just don't seem to
      +be able to relate to them. When I explained our issues, they reacted very unsympathetic and 
      +just weren't very supportive. I guess that was the time that I started to pull away and not 
      +try to keep up any relationships. ⏎ My SO and I are pregnant, half way! This is something 
      +we're so joyed about. However, when we told everyone, they just didn't really give a 
      +reaction. No congrats or that they were happy for us. Since then, I completely pulled away. 
      +I talked to my parents about this and I just don't feel the need to contact anyone. The 
      +reason I'm posting is because I discussed deleting some relatives off social media with my 
      +mom and she was telling me how it's really rude. I don't think it is. We don't talk. There's
      +no relationship. Honestly, I think they're just friends with me for gossip purposes. A few 
      +times I tried reaching out to some of my cousins with no reply. ⏎ I guess what I'm asking is
      +if I should be trying harder to establish something with my family or if it's perfectly OK 
      +just to delete them and not talk to them again? Our primary contact is Facebook even though 
      +we live in the same town.
      +
      +PPO GENERATED RESPONSE:  We're dating now. We've met, we're dating. We're dating now. We're dating now. We're dating
      +now. We're dating now. We're dating now.⏎⏎We've met⏎⏎We're dating now.⏎⏎We're dating now.⏎
      +
      +reward: +2.09
      +kl: +19.45
      +total reward: +1.50
      +
      +QUERY: This tower totally decided to screw me over on the final stage. So a bit of background I 
      +don't have the strongest squad but pretty much I have: ⏎ Guardian Loch (Sacred Jewel) ⏎ 
      +Anima Tiara (Flesh Armor) ⏎ Lord Cayena (sacred Jewel) ⏎ Oracle Twins (sacred Jewel) ⏎ 
      +Guardian Logan (Burny soul stone) ⏎ and was using a friends maxed michelle. ⏎ So throughout 
      +the earlier stages I've seen this guy about 5 times but have never caught him. So I had a 
      +little trouble with stage 9 but felt like with the additional 50% atk boost and bb from 
      +michelle I could do it. Also on stage 9 the reason I died was because I didn't take any 
      +stimulants and this guy managed to paralyze 3 members of my squad. So I replaced mega cures 
      +for stimulants. ⏎ Round 3 I meet him. He gets a lucky crit on my caynea and takes her down. 
      +I have to use my one revive, take him down and  capture  him. I'm pretty excited. ⏎ Round 7 
      +I see the warning danger sign. I face him again and manage to take him down without any of 
      +my units dying no capture. At this point I'm starting to get low on healing items. Also I've
      +only heard of meeting him twice on a run. But then I checked this thread and saw that he 
      +always appears on the last stage. I thought that maybe it glitched and he wouldn't appear on
      +the final round. ⏎ Nope. ⏎ This guy shows up on the final round showing up 3 times in one 
      +run. Worst part was I manged to get him down to a sliver of health so that as long as one 
      +unit hit him I would've defeated him with the next turn. This was also the turn he decided 
      +to use soul desecration twice. I definitely felt like my soul got desecrated watching all my
      +units die. So at this point it was give up the captured unit and try all over again (knowing
      +my luck never capture him again) or gem for the last hit. I was really looking forward to 
      +getting the gem for completion (f2p gotta scrap up everything right?) but I let it go.
      +
      +PPO GENERATED RESPONSE:  ⏎⏎When you capture a unit try to kill it.⏎⏎Kill them.⏎⏎Kill them.⏎⏎Kill them.⏎⏎Kill 
      +them.⏎⏎Kill them.⏎⏎Kill them.⏎⏎Kill them.⏎⏎Kill them.⏎⏎Kill them.⏎
      +
      +reward: +0.79
      +kl: +10.82
      +total reward: +0.46
      +

      Conclusion

      In this work, we took a deep dive into OpenAI’s original RLHF codebase and compiled a list of its implementation details. We also created a minimal base which reproduces the same learning curves as OpenAI’s original RLHF codebase, when the dataset and hyperparameters are controlled. Furthermore, we identify surprising implementation details such as the adam optimizer’s setting which causes aggressive updates in early RLHF training.

      For attribution in academic contexts, please cite this work as
      +        PLACEHOLDER FOR ACADEMIC ATTRIBUTION
      +  
      BibTeX citation
      +        PLACEHOLDER FOR BIBTEX
      +  
      \ No newline at end of file diff --git a/blog/understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/index.html b/blog/understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/index.html new file mode 100644 index 00000000..67cc897b --- /dev/null +++ b/blog/understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/index.html @@ -0,0 +1,56 @@ + Understanding gradient inversion attacks from the prior knowledge perspective | ICLR Blogposts 2024

      Understanding gradient inversion attacks from the prior knowledge perspective

      In this blogpost, we mention multiple works in gradient inversion attacks, point out the chanllenges we need to solve in GIAs, and provide a perspective from the prior knowledge to understand the logic behind recent papers.

      Federated learning, as a way to collaboratively train a deep model, was originally developed to enhance training efficiency and protect data privacy. In a federated learning paradigm, no matter whether it is horizontal or vertical, data could be processed locally, and the central server could only get access to the processed information, such as trained model weights or intermediate gradients. Avoiding direct access to private local data, federated learning is believed to successfully protect clients’ data privacy, for the central server could only make use of uploaded information to train a global model but it does not know exactly what the training dataset really contains. However, in horizontal federated learning, researchers found that with training gradients, the central server could still recover input data, which may be a threat to training data privacy. Such privacy attack is then named gradient inversion attack (or gradient leakage attack).

      Fundamental pipeline of Gradient inversion attacks (GIAs)

      Gradient inversion attacks (GIAs) aim at reconstructing clients’ private input data from the gradients in deep neural network training phases. It is a threat to federated learning framework, especially the horizontal one where a curious-but-honest central server collects gradients from multiple clients, analyzes the optimal parameter updating direction, and sends back the updated model in one step. Getting rid of complicated mathematical formulas, GIA is actually a matching process: the attacker (which is the central server in the most common settings) expects that the data it randomly initialized could finally generate the identical gradients as the ground truth, therefore it measures the difference (or distance) to optimize input data pixel-wisely. The smaller the distance between gradients, the better the private data are reconstructed.

      This is a white-box attack, for its requirement for full model parameters to conduct backpropagation. In such a process, with fixed model parameters, the distance between gradients is highly dependent on the attacker’s dummy data. GIA’s target is to optimize the distance below, where $x^\ast$ and $y^\ast$ represent the dummy data-label tuple, $\mathcal{D}$ represents the distance function, $\theta$ represents the model weights, and $\mathcal{L}$ represents the CE loss.

      \[\arg\min \limits_{(x^*,y^*)} {\mathcal{D}}\left(\nabla_\theta\mathcal{L}_\theta\left( x,y\right),\nabla_\theta\mathcal{L}_\theta\left( x^*,y^*\right)\right)\]

      After raising this problem, there are a few research topics in this field. iDLG provides a way to recover the input label analytically. Following this, a series of works is proposed to recover labels from batches, and it is generally believed that compared with optimizing image-label tuples simultaneously, simply optimizing input images with ground-truth labels could achieve better performance. Except for recovering labels, attack evaluations and defense methods also attract much attention. However, recovering high-quality images is still the key focus.

      The tough challenge in GIAs

      In GIA, the tough challenge, which has not been solved yet, is the reconstruction of batched input data, where multiple samples share the same labels. Previous works headed towards such a goal by a few steps: they first recovered single input data, then extended them to batches with known labels, and added a new algorithm to recover batched one-hot labels before recovering input images. However, to the best of my knowledge, it is still limited to the situation where for every class there could be at most one sample in a batch. Batched data recovery with repeated labels is still a failure for all current algorithms. The key reason for this failure lies in the information discard of averaged gradients.

      A simple example of information discards

      Let’s first take a look at a simple neural network: MLP. In a specific layer, it takes in intermediate features $\mathbf{x}$ and outputs a result of matrix multiplication $\mathbf{z}=\mathbf{Wx}+\mathbf{b}$. To recover the input from gradients, we could simply use the bias attack:

      \[\frac{\partial \mathcal{L}}{\partial {\mathbf{W}}}=\frac{\partial \mathcal{L}}{\partial \mathbf{z}} \times \frac{\partial \mathbf{z}}{\partial {\mathbf{W}}}=\frac{\partial \mathcal{L}}{\partial {b}}\mathbf{x}^\mathrm{T}\]

      In the above equation, it is clear that for a single input, with full access to model weights and gradients, the gradients of the MLP contain full information to execute single-image recovery.

      Here, we conduct a simple experiment to illustrate the existence of information discard. Firstly We pick a 4-layer MLP as the target neural network and randomly select a few images from the Flowers-17 dataset as the private input data for recovery. We take $l_2$ loss as the gradient matching function without any prior knowledge (regularization terms). Firstly, we provide an example of input image recovery when batchsize=1 with known labels.

      Image reconstruction with $l_2$ loss on MLP. no regularization terms are adopted.

      It is not surprising that $l_2$ gradient matching functions could recover the input data well. Such a good performance is mainly because MLP’s gradients contain enough information of intermediate features for single inputs. With proper labels, we could conclude that GIA works well on MLP when batchsize=1.

      However, when it comes to CNNs, such inversion gets harder. For convolution layers, the gradients of convolution kernels are aggregated through the whole feature map, therefore even if we set batchsize=1, gradients may still experience information discards, affecting the attack performance. This problem is also mentioned in R-GAP, which executes the GIA from an equation-solving perspective. If equations are “rank-deficient”, then we cannot get a unique solution, indicating obvious information discards. Here, for better illustration, we first show CIFAR-10 image reconstructions on LeNet with batchsize=1. Ground-truth one-hot labels are provided.

      Image reconstruction on LeNet with CIFAR-10 dataset when batchsize=1. we show the ground-truth image in the middle and attach the reconstruction process on two sides ($l_2$ loss on the left and cosine similarity loss on the right).

      It is clear that even though both functions could recover the image, there are some pixels not perfectly optimized, indicating the existence of information discards. If we change the batchsize, even if we only slightly enlarge it as batchsize=2, such reconstruction ends up with a failure.

      Image reconstruction with cosine similarity loss on LeNet and no regularization terms are adopted. In the middle, we show ground-truth images in the batch.

      For a given network, the size of gradients is fixed. Therefore, with the increase in batchsize, GIA will experience more obvious information discards. This is easy to understand, and researchers designed a few ways to complement this loss.

      Understanding GIAs from the prior knowledge perspective

      Realizing the information discards, reviewing the recent paper through the prior knowledge perspective may help understand the logic better. To achieve better image reconstruction quality, it is natural to consider the prior knowledge of images as the complement. Here, the prior knowledge could be explained in three aspects.

      Unparameterized regularization terms

      In IG, they utilize the total variance as a regularization because they believe a real image taken from nature should have a small total variance. That is the first prior knowledge term utilized in the gradient matching function, and it turns out to function well. After that, in GradInversion this regularization term is extended to include batch normalization supervision, \(l_2\) norms and group consistency. This is a stronger prior knowledge implying that a real input image, or batched real images, except for total variance, should also possess lower \(l_2\) norms, proper intermediate mean and the variance for batch normalization layers. Apart from that, all reconstructions from different random initializations ought to reach a group consistency. These terms are unparameterized, and it is clearly demonstrated in their ablation experiments that these terms matter significantly in reconstructing high-quality images.

      To further illustrate the benefits such regulariztaion terms have on the data reconstruction processes, here is an example of adding total variance for batchsize=2 image reconstruction. The scale of total variance ranges from \(10^{-4}\) to \(10^{-1}\).

      Image reconstruction with cosine similarity loss and total variance on LeNet. The scale of the total variance starts from $10^{-4}$ for the very left column to $10^{-1}$ with 10 times as the interval.

      With identical learning rate, images with higher total variance are reconstructed faster. Because the total variance penalizes obvious distinctions for adjacent pixels, images with higher total variance are also more blurred. On the other side, reconstructions with insufficient total variance fail to generate recognizable images.

      Generative models

      Keep following the logic that recent works require some other conditions as prior knowledge to reinforce the information discards from gradients, generative models, especially GANs, could serve as a strong tool to encode what “real images” should be. The way to add GAN’s generator in gradient matching processes is simple: instead of optimizing direct image pixels, with the generator we could keep the backpropagation way back to the latent space, then alter the latent code as well as the parameters of the generator to produce recovered images. Pre-trained generators naturally encode a likely distribution of the input data, which is a stronger prior knowledge compared with previous unparameterized regularization terms.

      Recent work GIFD extends this method by optimizing GAN network layer-wisely. Instead of directly optimizing GAN weights and the latent vector in one step, GIFD optimizes the intermediate layers iteratively, making such a process more stable. In summary, gradients here serve more as an indicator for attackers to select the best image from distributions modeled by pre-trained GANs.

      End-to-end networks

      Actually, the most intuitive way to conduct a GIA is to design a function that takes gradients as input and then outputs recovered images. For a target network, image-gradient tuples are easy to collect, therefore the prior knowledge could be encoded in such an end-to-end neural network through model training.

      Here, the neural network resembles a GAN generator which takes in representation vectors and outputs a synthesized image. However, instead of abstract latent codes, such a network receives gradient vectors to generate images. In implementations, Wu et.al utilizes feature hashing to reduce the dimension of gradient vectors. For network picking, they use a simple 3-layer MLP to generate flattened images, which is different from widely-used GAN structures. However, such a method faces multiple difficulties, such as large input sizes and limited structural flexibility. Even for one specific model, once the model weights are changed, such end-to-end network requires retraining to construct a new mapping from gradients to images. Besides, there is still space for network design. Will the network structure influence image reconstruction performance under identical datasets? How to construct a mapping function from gradients to images with varying batchsize? Could the network find an optimal batchsize after analyzing the gradients? These questions are all worth further exploration.

      Limitation and future directions

      For GIAs that require pre-trained models, the key limitation is the auxiliary dataset. It is kind of unrealistic to claim that the dataset used for pretraining generative models (or end-to-end models) shares the same distribution with the unknown private input data, and possibly, with distinct dataset distribution, the generative performance may experience a drop. Both GIAS and GIFD use GAN with in-distribution auxiliary data to compare with previous state-of-the-art works, and GIFD paper only shows the reconstruction result of distinct distribution data when batchsize=1 with the same label space. For the most general situation where the attacker has limited knowledge of the potential distribution of the private data, it may be still hard to recover high-quality batched data with generative networks. Considering these limitations, it is of great value to explore algorithms to learn some general prior knowledge, especially those robust among different data distributions.

      Conclusions

      1. The existence of information discards in gradient aggregation is the tough challenge of GIAs.
      2. From the prior knowledge perspective, previous GIA works provide three ways to complement information discards.
      3. It may still be hard to recover batched data from gradients with limited knowledge of private data distribution.
      For attribution in academic contexts, please cite this work as
      +        PLACEHOLDER FOR ACADEMIC ATTRIBUTION
      +  
      BibTeX citation
      +        PLACEHOLDER FOR BIBTEX
      +  
      \ No newline at end of file diff --git a/blog/understanding-icl/index.html b/blog/understanding-icl/index.html new file mode 100644 index 00000000..ea5a92f8 --- /dev/null +++ b/blog/understanding-icl/index.html @@ -0,0 +1,56 @@ + Understanding in-context learning in transformers | ICLR Blogposts 2024

      Understanding in-context learning in transformers

      We propose a technical exploration of In-Context Learning (ICL) for linear regression tasks in transformer architectures. Focusing on the article Transformers Learn In-Context by Gradient Descent by J. von Oswald et al., published in ICML 2023 last year, we provide detailed explanations and illustrations of the mechanisms involved. We also contribute novel analyses on ICL, discuss recent developments and we point to open questions in this area of research.

      $$ \definecolor{input}{rgb}{0.42, 0.55, 0.74} \definecolor{params}{rgb}{0.51,0.70,0.40} \definecolor{output}{rgb}{0.843, 0.608, 0} \def\mba{\boldsymbol a} \def\mbb{\boldsymbol b} \def\mbc{\boldsymbol c} \def\mbd{\boldsymbol d} \def\mbe{\boldsymbol e} \def\mbf{\boldsymbol f} \def\mbg{\boldsymbol g} \def\mbh{\boldsymbol h} \def\mbi{\boldsymbol i} \def\mbj{\boldsymbol j} \def\mbk{\boldsymbol k} \def\mbl{\boldsymbol l} \def\mbm{\boldsymbol m} \def\mbn{\boldsymbol n} \def\mbo{\boldsymbol o} \def\mbp{\boldsymbol p} \def\mbq{\boldsymbol q} \def\mbr{\boldsymbol r} \def\mbs{\boldsymbol s} \def\mbt{\boldsymbol t} \def\mbu{\boldsymbol u} \def\mbv{\boldsymbol v} \def\mbw{\textcolor{params}{\boldsymbol w}} \def\mbx{\textcolor{input}{\boldsymbol x}} \def\mby{\boldsymbol y} \def\mbz{\boldsymbol z} \def\mbA{\boldsymbol A} \def\mbB{\boldsymbol B} \def\mbE{\boldsymbol E} \def\mbH{\boldsymbol{H}} \def\mbK{\boldsymbol{K}} \def\mbP{\boldsymbol{P}} \def\mbR{\boldsymbol{R}} \def\mbW{\textcolor{params}{\boldsymbol W}} \def\mbQ{\boldsymbol{Q}} \def\mbV{\boldsymbol{V}} \def\mbtheta{\textcolor{params}{\boldsymbol \theta}} \def\mbzero{\boldsymbol 0} \def\mbI{\boldsymbol I} \def\cF{\mathcal F} \def\cH{\mathcal H} \def\cL{\mathcal L} \def\cM{\mathcal M} \def\cN{\mathcal N} \def\cX{\mathcal X} \def\cY{\mathcal Y} \def\cU{\mathcal U} \def\bbR{\mathbb R} \def\y{\textcolor{output}{y}} $$

      What is in-context learning?

      In-Context Learning (ICL) is the behavior first observed in Large Language Models (LLMs), whereby learning occurs from prompted data without modification of the weights of the model . It is a simple technique used daily and throughout the world by AI practitioners of all backgrounds, to improve generation quality and alignment of LLMs . ICL is important because it addresses full-on the once widespread criticism that for all their impressive performance, modern deep learning models are rigid systems that lack the ability to adapt quickly to novel tasks in dynamic settings - a hallmark of biological intelligence. By this new form of “learning during inference”, Large Language Models have shown that they can be, in some specific sense (once pretrained), surprisingly versatile and few-shot learners.

      transformer

      Figure 1: Example of a simple in-context prompt for ChatGPT.

      Interestingly, it was around the release of GPT-2 and GPT-3 that researchers observed that an auto-regressive language model pre-trained on enough data with enough parameters was capable of performing arbitrary tasks without fine-tuning, by simply prompting the model with the task with few examples and letting it generate the output. In recent months, the research community has started to investigate the phenomenon of ICL in more details, and several papers have been published on the topic.

      Figure 2: The number of papers published on the topic of ICL (and transformers) in the last years. Data extracted from arxiv.org on November 16th, 2023. In the last year alone, the number of papers on the topic has increased by more than 200%.


      Specifically, since learning processes in biology and machine are often, if not always, understood in terms of iterative optimization, it is natural to ask what kind of iterative optimization is being realized during ICL, and how.

      From large language models to regression tasks

      Though ICL is generally regarded as a phenomenon exhibited by LLMs, we now hasten to study it in a non-language, small-scale model that enables more control and where ICL can still be shown to emerge. This simpler situation is that of a transformer model trained to regress a set of numerical data points presented in the prompt, with data points generated from a distinct function for each prompt, but where all prompts sample a function from the same general class (i.e. linear) at train and at test time. We will see that to some extent, this simplification allows for a mathematical treatment of ICL.

      The following figure gives a visual representation of the ICL setup we will consider in this blog post. The model is a generic transformer pre-trained to solve generic linear regression tasks. At inference time, we can give the model a prompt with a new linear regression task, and it is able to solve it with surprisingly good performance.

      Figure 3: The model is pre-trained to regress linear functions, and frozen during inference. With different context (input points), the model can still recover the exact underlying function. Use the slider to change the linear function to regress.

      Objective of this blog post

      The objective of this blog post is to understand how ICL is possible, and to present in an interactive way what is known of its underlying mechanism. Specifically, we will analyze the results reported in the paper Transformers Learn In-Context by Gradient Descent by J. von Oswald et al. recently published in ICML 2023 , which first showed that a simplified transformer model learns in-context by gradient descent. We will replicate the authors’ findings and then we will complement the discussion with a number of additional insights, before pointing to open questions. We hope the reader comes out of this post with a better vision of what fundamentally ICL is and the open challenges that remain.

      Preliminaries and notations

      First of all we need to agree on a mathematical formalization of in-context learning.

      Before we start, let’s introduce some notation and color convention that will be used throughout the rest of the blog post. We will use the following colors to denote different quantities:

      • blue: inputs
      • green: model parameters
      • yellow: output

      Vectors will be denoted with bold letters, e.g. \(\mba\), and matrices with bold capital letters, e.g. \(\mbA\). Additional notation will be introduced in-line when needed.

      Formally, let’s define \(p(\mbx)\) as a probability distribution over inputs \(\mbx\in\cX\) and \(\cH\) a class of functions \(h: \cX \rightarrow \cY\). You can think of \(\cH\) as a set of functions that share some common properties, for example, the set of all linear functions, or the set of all functions that can be represented by a neural network with a given architecture. Also, let’s define \(p(h)\) as a probability measure over \(\cH\).

      Figure 4: Visual representation of various parametric function classes (linear, sinusoidal, shallow neural network). Use the dropdown menu to select the function class.


      Following the terminology of the LLM community, let’s define a prompt \(P\) of length \(C\) as a sequence of \(2C+1\) points \((\mbx_0, h(\mbx_0), \ldots, \mbx_{C-1}, h(\mbx_{C-1}), \mbx_{\text{query}})\) where inputs (\(\mbx_i\) and \(\mbx_{\text{query}}\)) are independently and identically drawn from \(p(\mbx)\), and \(h\) is drawn from \(\cH\). In short we will also write \(P_C = \left[\{\mbx_i, h(\mbx_i)\}_{i=0}^{C-1}, \mbx_\text{query}\right]\).

      Note: The expectation in Equation \eqref{eq:in-context-error} is taken over the randomness of the input and the function. This means that we are considering the average performance of the model over all possible inputs and functions in \(\cH\).

      Additional details on the ICL formalism

      We can also define the ICL problem through the lens of statistical learning theory. Suppose \(\ell\) the same per-task loss function as described above. Let’s define the following loss \(\cL:\cF\rightarrow\bbR\):

      \[\begin{equation} \cL_C(f) = \mathbb{E}\left[\ell\left(f(P_C), h\left(\mbx_{\text{query}}\right)\right) \right] \end{equation}\]

      Let’s define \(f_C\) as the model that minimizes the loss with \(C\) in-context examples:

      \[\begin{equation} f_C = \arg\min_{f\in\cF} \cL_C(f) \end{equation}\]

      and \(f_\infty\) as the model that minimizes the loss with an infinite number of in-context examples:

      \[\begin{equation} f_\infty = \arg\min_{f\in\cF} \cL_\infty(f) \end{equation}\]

      We say that a class of transformer models \(\cF\) learns in-context for a function class \(\cH\) if, for any \(\epsilon > 0\), there exists a model \(f\in\cF\) such that the following inequality holds:

      \[\begin{equation} \mathbb{P} \left[ \cL( f_C) - \cL( f_\infty) \leq \epsilon \right] \geq 1 - \delta \end{equation}\]

      In other words, the last equation says that a class of transformer models \(\cF\) learns in-context for a function class \(\cH\) if, for any \(\epsilon > 0\), there exists a model \(f\in\cF\) such that the difference between the loss of the model trained with \(C\) in-context examples and the loss of the model trained with an infinite number of in-context examples is smaller than \(\epsilon\) with probability at least \(1-\delta\).

      Additionally, we can look at the consistency property, defined as:

      \[\begin{equation} \lim_{C\rightarrow\infty} \mathbb{P} \left[ \cL( f_C) - \cL( f_\infty) \geq \epsilon \right] = 0 \end{equation}\]

      This equation signifies that the difference between the loss of the model trained with \(C\) in-context examples and the loss of the model trained with an infinite number of in-context examples converges to zero as \(C\) goes to infinity.

      Dataset construction and tokenization

      For our setup, we will consider a linear regression problem, where the goal is to learn a linear function \(h_{\mbw}(\mbx) = \mbw^\top\mbx\), with \(\mbw\in\bbR^D\), from a set of in-context examples \(\{\mbx_i, \y_i\}_{i=0}^{C-1}\), where \(\mbx_i\in\bbR^D\) and \(\y_i\in\bbR\). So \(h_{\mbw} \in \cH\).

      In order to better understand how the prompt is constructed starting from a regression task, let’s consider the following visual example:

      Figure 5: Visualization of the data construction process, from the regression dataset, to the input prompt and the tokenization.


      The figure shows a visual representation of the construction of a single input prompt. In particular, we first sample a weight \(\mbw\) from the distribution \(p(\mbw)\), and then we sample \(C\) inputs \(\mbx_i\) from \(p(\mbx)\), where \(C\) is the fixed context size. Finally, we compute the corresponding outputs \(\y_i = \mbw^\top\mbx_i\). We consider \(p(\mbx) = \cU(-1, 1)\), where \(\cU\) is the uniform distribution, and \(p(\mbw) = \cN(\mbzero, \alpha^2\mbI)\), where \(\cN\) is a multivariate Gaussian distribution of dimension \(D\), with \(0\) mean and \(\alpha\) standard deviation.

      Defining \(c=C+1\) and \(d=D+1\), where \(C\) is the context size and \(D\) is the input dimension, we can represent the input as a matrix \(\mbE\in\bbR^{d\times c}\) (also referred to as token embeddings or, simply, embeddings), where the first \(C\) columns represent the context inputs \(\mbx_i\) and output \(\y\) and the last column represents the query input \(\mbx_{\text{query}}\) with \(0\) padding.

      To construct a batch of regression problems, we just repeat the above procedure \(N\) times with the fixed context size \(C\), where \(N\) is the size of the batch.

      A quick review of self-attention

      In this section we will briefly review the self-attention mechanism, which is the core component of the transformer architecture .

      Let \(\mbW^K, \mbW^Q \in \bbR^{d_k\times d}\), \(\mbW^V \in \bbR^{d_v\times d}\) and \(\mbW^P \in \bbR^{d \times d_v}\) the key, query, value and projection weight matrices respectively. Given an embedding \(\mbE\in\bbR^{d\times c}\), the softmax self-attention layer implements the following operation,

      \[\begin{equation} \label{eq:softmax-self-attention} f_\text{attn} (\mbtheta_\text{attn}, \mbE) = \mbE + \mbW^P \mbW^V \mbE \sigma\left(\frac{(\mbW^K \mbE)^\top \mbW^Q \mbE}{\sqrt{d}}\right), \end{equation}\]

      with \(\mbtheta_\text{attn}=\{\mbW^K, \mbW^Q, \mbW^V, \mbW^P\}\), where for simplicity we will consider \(d_k=d_v=d\), and \(\sigma(\cdot)\) is the softmax function applied column-wise. It’s simple to verify that the output dimension of \(f_\text{attn}\) is the same as the input dimension. To simplify further, we can also define the value, key and query matrices as \(\mbV = \mbW^V\mbE\), \(\mbK = \mbW^K\mbE\), \(\mbQ = \mbW^Q\mbE\), respectively.

      Training details

      Figure 6: Visualization of the pre-training process. The model is trained to minimize the loss function defined in Equation \eqref{eq:pre-train-loss-expectation}.


      Once the dataset is created, we can train the model using the following objective:

      \[\begin{equation} \label{eq:pre-train-loss-expectation} \cL(\mbtheta) = \mathbb{E}\left\|f\left(\mbtheta, \left[\{\mbx_i, \y_i\}_{i=0}^{C-1}, \mbx_\text{query}\right]\right) - \y_{\text{query}}\right\|^2, \end{equation}\]

      where the expectation is taken over \(p(\mbx)\) and \(p(\mbw)\), with \(h_{\mbw}(\mbx) = \mbw^\top\mbx\). Note that the output of the model is a sequence of \(C+1\) values, i.e. same as the input prompt, and the loss is computed only on the last value of the sequence, which corresponds to the predicted query output \(\widehat\y_{\text{query}}\). Specifically, for reading out just the prediction for \(\mbx_{\text{query}}\), we multiply again by \(-1\) this last value. Note that this choice is completely transparent during model training, as it is equivalent to simply changing the sign of a few elements in the projection weight matrix \(\mbW^P\). The reason for this will be clear in the following sections. At each training iteration, we replace the expectation with an empirical average over a batch of \(N\) regression tasks, each made of a different set of context points \(\{\mbx_i^{(n)}, \y_i^{(n)}\}_{i=0}^{C-1}\), and a query input/target pain, \(\mbx^{(n)}_\text{query}\) and \(\y^{(n)}_{\text{query}}\), respectively. Note that because of the on-line creation of the dataset, during training the model will never see the same regression task twice.

      Code for the transformer loss This is the code for the loss computation, including the reading out of the query output.

      Transformers can learn any linear function in-context

      With all the preliminaries and notations in place, we can now start to analyze some results regarding the ability of transformers to learn linear functions in-context. One of the first papers that studied the ability of transformers to learn linear functions in-context is What Can Transformers Learn In-Context? A Case Study of Simple Function Classes by S. Garg et al . We will first replicate their results using a simpler configuration: using only up to 5 layers, single head attention, with 64 embedding units for a total number of parameters of 17K, 34K, 50K, 67K, 84K respectively.

      In the figure below, we report the in-context test loss (as defined in Equation \eqref{eq:in-context-test-loss}) for each model configuration, for various context sizes \(C\), from 2 to 100.

      Figure 7: Transformers can learn linear functions in-context, reasonably well. The test loss decreases as the context size increases, and as the number of layers increases.


      The experiment above shows that the test loss diminishes for larger context sizes, and also as the number of layers increases. These two main effects are clearly expected, as consequences of more data points and more compute, respectively, and they replicate the findings of Garg et al .

      Linear self-attention is sufficient

      From this point, we will depart from the classic softmax self-attention layer, and restrict our study to a linear self-attention layer, which is the setting considered in the paper of J. von Oswald et al . Recently, a number of papers have drawn connections between linear transformers and Fast Weight Programmers and have shown that linearized self-attention layers can be used to replace the softmax self-attention layer in transformers, with the advantage of reducing the computational complexity of the attention operation .

      A linear self-attention updates embeddings \(\mbE\) as follows:

      \[\begin{equation} f_\text{linattn} (\mbtheta_\text{linattn}, \mbE) = \mbE + \frac{\mbW^P \mbV\left(\mbK^\top \mbQ \right)}{\sqrt{d}}, \end{equation}\]

      with \(\mbV, \mbK, \mbQ\) being the value, key and query defined right after Equation \eqref{eq:softmax-self-attention}.

      Now, to analyze if a linear self-attention layer is sufficient to learn linear functions in-context, we can use the same experimental setup as before, but replacing the softmax self-attention layer with a linear self-attention layer.

      Additionally, we also strip down the transformer to its bare minimum, i.e. we remove the normalization, the embedding layer, the feed-forward layer, and only use a single head. The only remaining component is the linear self-attention layer. Therefore, in the following we use the term “linear transformer” to refer to this simplified model.

      Code for the linear transformer This is the code for the linear transformer, without any normalization, embedding, etc with a single head

      We test the linear transformer on the same dataset setup as before, and we will use the same number of layers as before, i.e. 1, 2, 3, 4, 5.

      Figure 8: Linear transformers can also learn linear functions in-context, reasonably well. The test loss decreases as the context size increases, and as the number of layers increases.


      What is special about linear self-attention?

      From the previous section we have seen that a linear self-attention layer is sufficient to learn linear functions in-context. In this section we will try to understand why this is the case, starting from a review of least-squares regression and gradient descent.

      Establishing a connection between gradient descent and data manipulation

      In this section, we establish an important connection that will be fundamental to understand the mechanism behind ICL with linear self-attention. To do so we need to start from a simple linear regression problem, and we will show that we can achieve the same loss after one gradient step by changing the inputs and the targets, and keeping the weights fixed.

      The loss for a linear regression problem is defined as: \(\begin{equation} \label{eq:linear-regression-loss} \cL_{\text{lin}}\left(\mbw, \{\mbx_i, {\y}_i\}_{i=0}^{C-1}\right) = \frac 1 {2C} \sum_{i=0}^{C-1} (\mbw^\top\mbx_i - \y_i)^2 \end{equation}\)

      where \(\mbw\in\bbR^D\), \(\mbx_i\in\bbR^D\) and \(\y_i\in\bbR\). With a given learning rate \(\eta\), the gradient descent update is \(\mbw \leftarrow \mbw - \Delta \mbw\), where \(\begin{equation} \label{eq:linear-regression-gd-gradient} \Delta \mbw = \eta \nabla_{\mbw} \cL_{\text{lin}}\left(\mbw, \{\mbx_i, {\y}_i\}_{i=0}^{C-1}\right) = \frac{\eta}{C} \sum_{i=0}^{C-1} \left(\mbw^\top\mbx_i - \y_i\right)\mbx_i \end{equation}\) The corresponding loss (after the update) is: \(\begin{equation} \label{eq:linear-regression-loss-after-gd} \cL_{\text{lin}}\left(\mbw - \Delta \mbw, \{\mbx_i, {\y}_i\}_{i=0}^{C-1}\right) = \frac 1 {2C} \sum_{i=0}^{C-1} \left(\mbw^\top\mbx_i - \y_i - \Delta \mbw^\top\mbx_i\right)^2 \end{equation}\)

      It is trivial to see that if we now define \(\widehat{\mbx}_i = \mbx_i\) and \(\widehat{\y}_i = \y_i + \Delta \mbw^\top\mbx_i\), we can compute Equation \eqref{eq:linear-regression-loss} with the new inputs and targets, i.e. \(\cL_{\text{lin}}(\mbw, \{\widehat{\mbx}_i, \widehat{\y}_i\}_{i=0}^{C-1})\), which is the same as the loss after the gradient descent update (Equation \eqref{eq:linear-regression-loss-after-gd}).

      Building a linear transformer that implements a gradient descent step

      As we just saw, the starting intuition is that we can build a gradient step on the linear regression loss by manipulating the inputs and the targets. This is the key insight of Oswald et al. that allows us to draw a connection between the gradient descent dynamics and the linear transformer.

      Before stating the main result, recall the definitions of value, key and query as \(\mbV = \mbW^V\mbE\), \(\mbK = \mbW^K\mbE\), and \(\mbq_j = \mbW^Q\mbe_j\).

      Main result: Given a 1-head linear attention layer and the tokens \(\mbe_j = (\mbx_j, \y_j)\), for \(j=0,\ldots,C-1\), we can construct key, query and value matrices \(\mbW^K, \mbW^Q, \mbW^V\) as well as the projection matrix \(\mbW^P\) such that a transformer step on every token \(\mbe_j \leftarrow (\mbx_i, \y_{i}) + \mbW^{P} \mbV \mbK^{T}\mbq_{j}\) is identical to the gradient-induced dynamics \(\mbe_j \leftarrow (\mbx_j, \y_j) + (0, -\Delta \mbW \mbx_j)\). For the query data \((\mbx_{\text{query}}, \y_{\text{query}})\), the dynamics are identical.

      For notation, we will identify with \(\mbtheta_\text{GD}\) the set of parameters of the linear transformer that implements a gradient descent step.

      Nonetheless, we can construct a linear self-attention layer that implements a gradient descent step and a possible construction is in block form, as follows.

      \[\begin{align} \mbW^K = \mbW^Q = \left(\begin{array}{@{}c c@{}} \mbI_D & 0 \\ 0 & 0 \end{array}\right) \end{align}\]

      with \(\mbI_D\) the identity matrix of size \(D\), and

      \[\begin{align} \mbW^V = \left(\begin{array}{@{}c c@{}} 0 & 0 \\ \mbw_0^\top & -1 \end{array} \right) \end{align}\]

      with \(\mbw_0 \in \bbR^{D}\) the weight vector of the linear model and \(\mbW^P = \frac{\eta}{C}\mbI_{d}\) with identity matrix of size \(d\).

      If you are interested in the proof of construction for the GD-equivalent transformer, you can find it in the following collapsible section.

      Proof of construction for the GD-equivalent transformer

      To verify this, first remember that if \(\mbA\) is a matrix of size \(N\times M\) and \(\mbB\) is a matrix of size \(M\times P\),

      \[\begin{align} \mbA\mbB = \sum_{i=1}^M \mba_i\otimes\mbb_{,i} \end{align}\]

      where \(\mba_i \in \bbR^{N}\) is the \(i\)-th column of \(\mbA\), \(\mbb_{,i} \in \bbR^{P}\) is the \(i\)-th row of \(\mbB\), and \(\otimes\) is the outer product between two vectors.

      It is easy to verify that with this construction we obtain the following dynamics

      \[\begin{align} \left(\begin{array}{@{}c@{}} \mbx_j\\ \y_j \end{array}\right) \leftarrow & \left(\begin{array}{@{}c@{}} \mbx_j\\ \y_j \end{array}\right) + \mbW^{P} \mbV \mbK^{T}\mbq_{j} = \mbe_j + \frac{\eta}{C} \sum_{i={0}}^{C-1} \left(\begin{array}{@{}c c@{}} 0 & 0 \\ \mbw_0 & -1 \end{array} \right) \left(\begin{array}{@{}c@{}} \mbx_i\\ \y_i \end{array}\right) \otimes \left( \left(\begin{array}{@{}c c@{}} \mbI_D & 0 \\ 0 & 0 \end{array}\right) \left(\begin{array}{@{}c@{}} \mbx_i\\ \y_i \end{array}\right) \right) \left(\begin{array}{@{}c c@{}} \mbI_D & 0 \\ 0 & 0 \end{array}\right) \left(\begin{array}{@{}c@{}} \mbx_j\\ \y_j \end{array}\right)\\ &= \left(\begin{array}{@{}c@{}} \mbx_j\\ \y_j \end{array}\right) + \frac{\eta}{C} \sum_{i={0}}^{C-1} \left(\begin{array}{@{}c@{}} 0\\ \mbw_0^\top \mbx_i - \y_i \end{array}\right) \otimes \left(\begin{array}{@{}c@{}} \mbx_i\\ 0 \end{array}\right) \left(\begin{array}{@{}c@{}} \mbx_j\\ 0 \end{array}\right) = \left(\begin{array}{@{}c@{}} \mbx_j\\ \y_j \end{array}\right) + \left(\begin{array}{@{}c@{}} 0\\ - \frac{\eta}{C}\sum_{i=0}^{C-1} \left( \left(\mbw_0^\top\mbx_i - \y_i\right)\mbx_i\right)^\top \mbx_j \end{array}\right). \end{align}\]

      Note that the update for the query token \((\mbx_{\text{query}}, \textcolor{output}{0})\) is identical to the update for the context tokens \((\mbx_j, \y_j)\) for \(j=0,\ldots,C-1\).

      Experiments and analysis of the linear transformer

      Now let’s do some experiments to verify the theoretical results. We will work within the same experimental setup as before with the same dataset construction, training procedure and testing procedure. In this first section, we consider a linear transformer with a single layer, and the transformer built as described in the previous section (the GD-equivalent transformer), i.e. with a linear self-attention layer that implements a gradient descent step.

      During training, a linear transformer learns to implement a gradient descent step

      We now study the evolution of the test loss of a linear transformer during training \(\cL(\mbtheta)\), and compare it to the loss of a transformer implementing a gradient descent step \(\cL(\mbtheta_\text{GD})\).

      Figure 9: The loss of a trained linear transformer converges to the loss of a transformer implementing a gradient descent step on the least-squares regression loss with the same dataset. Use the slider to change the context size.


      Although an empirical proof of such a functional equivalence would require to check the outputs for all possible test samples, we can try to gather more evidence by considering more closely the computations that unfold in the linear transformer during one pass.

      To better understand the dynamics of the linear transformer, we now study the evolution of a few metrics during training (the L2 error for predictions, the L2 error for gradients and the cosine similarity between models).

      Metrics details

      The metrics introduced above are defined as follows:

      • L2 error (predictions) measures the difference between the predictions of the linear transformer and the predictions of the transformer implementing a gradient descent step and it is defined as \(\left\|f\left(\mbtheta, \left[\{\mbx_i, \y_i\}_{i=0}^{C-1}, \mbx_\text{query}\right]\right) - f\left(\mbtheta_\text{GD}, \left[\{\mbx_i, \y_i\}_{i=0}^{C-1}, \mbx_\text{query}\right]\right) \right\|^2\);

      • L2 error (gradients w.r.t. inputs) measures the difference between the gradients of the linear transformer and the gradients of the transformer implementing a gradient descent step and it is defined as \(\left\|\nabla_{\mbx_\text{query}} f\left(\mbtheta, \left[\{\mbx_i, \y_i\}_{i=0}^{C-1}, \mbx_\text{query}\right]\right) - \nabla_{\mbx_\text{query}} f\left(\mbtheta_\text{GD}, \left[\{\mbx_i, \y_i\}_{i=0}^{C-1}, \mbx_\text{query}\right]\right) \right\|^2\);

      • Model cosine similarity (gradients w.r.t. inputs) measures the cosine similarity between the gradients of the linear transformer and the gradients of the transformer implementing a gradient descent step and it is defined as \(\cos\left(\nabla_{\mbx_\text{query}} f\left(\mbtheta, \left[\{\mbx_i, \y_i\}_{i=0}^{C-1}, \mbx_\text{query}\right]\right), \nabla_{\mbx_\text{query}} f\left(\mbtheta_\text{GD}, \left[\{\mbx_i, \y_i\}_{i=0}^{C-1}, \mbx_\text{query}\right]\right)\right)\).


      Figure 10: Comparison between the linear transformer and the GD-transformer during training. The predictions of the linear transformer converge to the predictions of the GD-transformer and the gradients of the linear transformer converge to the gradients of the GD-transformer. Use the slider to change the context size.


      From this figure, we see that the predictions of the linear transformer converge to the predictions of the GD-transformer, and the gradients of the linear transformer converge to the gradients of the GD-transformer. Notably, this is true for all context sizes, though the convergence is faster for larger \(C\).

      As a final visualization, we can also look at the evolution of the gradients of the linear transformer during training, as shown in the figure below. In this animation, we take six different regression tasks and we plot the gradients of the linear transformer during training and the exact gradients of the least-squares regression loss.

      transformer

      Figure 11: Animation of the gradients of the linear transformer during training. The loss landscape visualized is the least-squares regression loss (each task has its own loss). The gradients of the linear transformer are shown in red, while the gradients of the least-squares regression loss are shown in orange.

      To reiterate, the loss landscape visualized is the least-squares regression loss and each task is a different linear regression problem with a different loss landscape. Once more, this is a visualization that the linear transformer is not learning a single regression model, but it is learning to solve a linear regression problem.

      The effect of the GD learning rate

      Next, we study the effect of the GD learning rate on the test loss of the GD-equivalent transformer. We believe this is an important point of discussion which was covered only briefly in the paper.

      Indeed, this is the same procedure we have used to find the optimal GD learning rate for our previous experiments. We now show what happens if we use a different GD learning rate than the one found with line search. In the following experiment, we visualize this behavior, by plotting the metrics described above for different values of the GD learning rate.

      Figure 12: Effect of the GD learning rate on the alignment between the linear transformer and the GD-transformer. The agreement between the two is maximized for a specific GD learning rate, which must be found by line search. Use the slider to manually change the GD learning rate.


      Analytical derivation of the best GD learning rate

      It turns out that having a line search to find the best GD learning rate is not necessary.

      The analytical solution is provided below with its derivation reported in the collapsible section immediately following.

      Analytical derivation of the best GD learning rate

      We are interested in finding the optimal learning rate for the GD-transformer, which by construction (see main Proposition), is equivalent to finding the optimal GD learning rate for the least-squares regression problem. Consequently, the analysis can be constructed from the least-squares regression problem \eqref{eq:linear-regression-loss}.

      Recall the GD update of the least-squares regression in \eqref{eq:linear-regression-gd-gradient} without taking into account of the learning rate. That is,

      \[\begin{equation} \label{eq:linear-regression-gd-gradient-no-lr} \Delta \mbw = \nabla_{\mbw} \cL_{\text{lin}}\left(\mbw, \{\mbx_i, \y_i\}_{i=0}^{C-1}\right) = \frac{1}{C} \sum_{i=0}^{C-1} \left(\mbw^\top\mbx_i - \y_i\right)\mbx_i. \end{equation}\]

      Now we consider the test loss of the least-squares regression defined as

      \[\begin{equation} \cL_\mathrm{lin, te}(\{\mbw^{(n)}\}_{n=0}^{N-1}) = \frac{1}{N} \sum_{n=0}^{N-1} ((\mbx^{(n)}_\text{query})^\top \mbw^{(n)} - \y^{(n)}_\text{query})^2, \end{equation}\]

      where \(N\) is the number of the queries, which is the same number of the regression tasks of the in-context test loss dataset. Similar to \eqref{eq:linear-regression-loss-after-gd}, after one step of the GD update \eqref{eq:linear-regression-gd-gradient-no-lr}, the corresponding test loss becomes

      \[\begin{align} &\quad \ \ \cL_\mathrm{lin, te}(\{\mbw^{(n)} - \eta \Delta \mbw^{(n)}\}_{n=0}^{N-1}) \nonumber \\ &= \frac{1}{N} \sum_{n=0}^{N-1} \left((\mbx^{(n)}_\text{query})^\top (\mbw^{(n)} - \eta \Delta \mbw^{(n)}) - \y^{(n)}_\text{query}\right)^2 \nonumber \\ &= \frac{1}{N} \sum_{n=0}^{N-1} \left((\mbx^{(n)}_\text{query})^\top \mbw^{(n)} - \y^{(n)}_\text{query} - \eta (\mbx^{(n)}_\text{query})^\top \Delta \mbw^{(n)} \right)^2 \nonumber \\ &= \frac{\eta^2}{N} \sum_{n=0}^{N-1} ((\mbx^{(n)}_\text{query})^\top \Delta \mbw^{(n)})^2 + \cL_\mathrm{lin, te}(\{\mbw^{(n)}\}_{n=0}^{N-1}) \nonumber \\ &\quad \ - \frac{2\eta}{N} \sum_{n=0}^{N-1} ((\mbx^{(n)}_\text{query})^\top \mbw^{(n)} - \y^{(n)}_\text{query})(\mbx^{(n)}_\text{query})^\top \Delta \mbw^{(n)}. \label{eq:loss_query_W1} \end{align}\]

      One can choose the optimum learning rate \(\eta^*\) such that \(\cL_\mathrm{lin, te}(\{\mbw^{(n)} - \eta \Delta \mbw^{(n)}\}_{n=0}^{N-1})\) achieves its minimum with respect to the learning rate \(\eta\). That is,

      \[\begin{align} \eta^* \in \arg\min_{\eta > 0} \cL_\mathrm{lin, te}(\{\mbw^{(n)} - \eta \Delta \mbw^{(n)}\}_{n=0}^{N-1}). \end{align}\]

      To obtain \(\eta^*\), it suffices to solve

      \(\begin{align} \nabla_\eta \cL_\mathrm{lin, te}(\{\mbw^{(n)} - \eta \Delta \mbw^{(n)}\}_{n=0}^{N-1}) = 0. \end{align}\) From \eqref{eq:loss_query_W1} and plugging \(\Delta w^{(n)}\) in \eqref{eq:linear-regression-gd-gradient-no-lr}, we obtain \(\begin{align} \eta^* &= \frac{\sum_{n=0}^{N-1} ((\mbx^{(n)}_\text{query})^\top \mbw^{(n)} - \y^{(n)}_\text{query})(\mbx^{(n)}_\text{query})^\top \Delta \mbw^{(n)} } {\sum_{n=0}^{N-1} ((\mbx^{(n)}_\text{query})^\top \Delta \mbw^{(n)})^2} \nonumber \\ &= C \frac{\sum_{n=0}^{N-1} ((\mbx^{(n)}_\text{query})^\top \mbw^{(n)} - \y^{(n)}_\text{query}) \sum_{i=0}^{C-1} ((\mbw^{(n)})^\top \mbx_i^{(n)} - \y_i^{(n)})(\mbx_i^{(n)})^\top \mbx^{(n)}_\text{query}} {\sum_{n=0}^{N-1} \left( \sum_{i=0}^{C-1} ((\mbw^{(n)})^\top \mbx_i^{(n)} - \y_i^{(n)})(\mbx_i^{(n)})^\top \mbx^{(n)}_\text{query} \right)^2}. \end{align}\) Finally, for the initialization \(\mbw^{(n)} = 0\) for \(n = 0, \ldots, N-1\), the optimal learning rate can be simplified to be \(\begin{align} \eta^* = C \frac{\sum_{n=1}^{N-1} \y^{(n)}_\text{query} \left(\sum_{i=0}^{C-1}\left( \y^{(n)}_i{\left(\mbx^{(n)}_i\right)}^\top \mbx_\text{query}^{(n)}\right)\right) }{\sum_{n=1}^{N-1} \left(\sum_{i=0}^{C-1}\left(\y^{(n)}_i {\left(\mbx^{(n)}_i\right)}^\top \mbx_\text{query}^{(n)}\right)\right)^2}. \end{align}\)


      Some comments on the analytical solution

      This derivation of the optimal GD learning rate \(\eta^*\) agrees well with the line search procedure (up to the numerical precision of the line search procedure itself). While this is expected, let’s take a moment to understand why this is the case.

      1. The analytical solution is obtained starting from the linear regression loss, while the line search procedure using the loss \(\cL(\mbtheta_\text{GD})\) defined in Equation \eqref{eq:pre-train-loss-expectation}. However, the two losses are equivalent by construction, hence the two procedures are equivalent.

      2. Because the construction of the GD transformer is not unique, it’s not easy to see the effect of the GD learning rate once we compare it with the trained linear transformer. Recall that due to its parametrization, the linear transformer does not have an explicit \(\eta\) parameter, which it can be absorbed in any of the weight matrices in the linear self-attention layer. Yet, the linear transformer converges to the exact same loss of the GD-transformer for the optimal GD learning rate \(\eta^*\). This is expected because fundamentally the loss function used for the line search and the one used for the analytical solution is equivalent to the loss in Equation \eqref{eq:pre-train-loss-expectation} used during the transformer training.

      Said differently, what we did in two steps for the GD-transformer (first build the \(\mbW^K, \mbW^Q, \mbW^V\) matrices, then find the optimal GD learning rate) is done implicitly during the training of the linear transformer.

      The following table summarizes the three different procedures we have discussed so far.

        Loss function GD learning rate
      Least-squares regression \(\cL_\text{lin}(\mbw-\Delta \mbw)\) Explicit \(\eta^*\) by analytical solution
      GD-transformer \(\cL(\mbtheta_\text{GD})\) Explicit \(\eta^*\) by line search
      Linear transformer \(\cL(\mbtheta)\) Implicit \(\eta^*\) by training \(\mbtheta\)

      Finally, one comment on the computational complexity of the two procedures. It doesn’t come as a surprise that the analytical solution is faster to compute than the line search: the line search requires on average 10 seconds to find the optimal GD learning rate, while the analytical solution requires only 10 milliseconds (both with JAX’s JIT compilation turned on, run on the same GPU).

      If one layer is a GD step, what about multiple layers?

      It is only natural to ask if the same behavior is observed for a linear transformer with multiple layers. In particular, if we take a trained linear transformer with a single layer (which we now know it implements a gradient descent step) and we repeat the same layer update multiple times recursively, will we observe the same behavior?

      As we now show in the following experiment, the answer is no. In fact, the test loss for both the linear transformer and the transformer implementing a gradient descent step diverges as we increase the number of layers.

      To stabilize this behavior, we use a dampening factor \(\lambda\), which is a scalar in \([0, 1]\), and we update the linear transformer as follows:

      \[\begin{equation} \label{eq:linear-transformer-update} \mbE^{(l+1)} = \mbE^{(l)} + \lambda \mbW^P \mbV\left(\mbK^\top \mbQ \right), \end{equation}\]

      where \(\mbE^{(l)}\) is the embedding matrix at layer \(l\), and \(\mbW^P, \mbV, \mbK, \mbQ\) are the projection, value, key and query matrices as defined before. Effectively, this is equivalent to applying a gradient descent step with scaled learning rate.

      Code for the recurrent transformer This is the code for the recurrent transformer, with a dampening factor \(\lambda\). Note that the attention layer is the same as before, but we now apply it multiple times.


      Figure 13: A pre-trained transformer with a single layer can be used recursively to implement multiple gradient descent steps, after applying a dampening factor \(\lambda\) to the self-attention layer. Use the slider to change the value of \(\lambda\).


      Note that in the original paper, the authors suggest that a dampening factor of \(\lambda=0.75\) is generally sufficient to obtain the same behavior as a single layer linear transformer. As we can see from the figure above, in our investigations we do not find this to be the case. In our experiments, we see that we need at least \(\lambda=0.70\) to obtain the same behavior as a single layer linear transformer, which suggests that the effect of the dampening factor can vary.

      Is this just for transformers? What about LSTMs?

      Transformers are not the only architecture that can sequence-to-sequence models . Notably, recurrent neural networks (RNNs) have been used for a long time to implement sequence-to-sequence models, and in particular long short-term memory (LSTM) networks have been shown to be very effective in many tasks .

      Indeed, from a modeling perspective, nothing prevents us from using a LSTM to implement in-context learning for regression tasks. In fact, we can use the same experimental setup as before, but replacing the transformer with a LSTM. The main architectural difference between a LSTM and a transformer is that LSTM layers are by-design causal, i.e. they can only attend to previous tokens in the sequence, while transformers can attend to any token in the sequence. While for some tasks where order matters, like language modeling, this is a desirable property, for the regression task we are considering this is not the case, since the input sequence is not ordered (i.e. shuffling the input sequence does not change the output of the linear regression model). For this reason, together with the classic uni-directional LSTM, we will also consider a bi-directional LSTM, which can attend to both previous and future tokens in the sequence. This provides a fair comparison between the LSTMs and the transformers.

      In this first experiment, we analyze the performance of the uni-directional and the bi-directional LSTM to learn linear functions in-context. Note that because of the intrinsic non-linear nature of the LSTM layers, we cannot manually construct a LSTM that implements a gradient descent step, as we did for the transformer. Nonetheless, we can still compare the LSTMs with the GD-equivalent transformer (which we now know it implements a gradient descent step on the least-squares regression loss).

      Figure 14: LSTMs cannot learn linear functions in-context as effectively as transformers and bi-directional LSTMs can learn linear functions in-context better than uni-directional LSTMs. Use the slider to change the number of layers.


      In this figure we can see that a single layer LSTM is not sufficient to learn linear functions in-context. For the uni-directional LSTM, we see that the test loss is always higher than the test loss of the transformer implementing a gradient descent step, even if we increase the number of layers. On the contrary, for the bi-directional LSTM, we see that the test loss approaches that of the GD-equivalent transformer as we increase the number of layers.

      The poor performance of the uni-directional LSTM is not surprising. Additional evidence is provided in the figure below, where, as we did for the transformer, we plot the L2 error (predictions), the L2 error (gradients w.r.t. inputs) and the model cosine similarity (gradients w.r.t. inputs) comparing the LSTM with the GD-equivalent transformer.


      Figure 15: Uni-directional LSTMs cannot learn linear functions in-context as effectively as transformers. Use the slider to change the number of layers.


      Regardless of the number of layers, we see that the uni-directional LSTM is not implementing a gradient descent step, as the L2 error (predictions) and the L2 error (gradients w.r.t. inputs) do not converge to 0, and the model cosine similarity (gradients w.r.t. inputs) remains well below 1. The picture changes for the bi-directional LSTM, as we can see in the figure below.


      Figure 16: Bi-directional LSTMs align better with the GD-equivalent transformer as we increase the number of layers. Use the slider to change the number of layers.


      While for a single layer, we can comfortably say that also the bi-directional LSTM is not equivalent to a GD step, for 2 or more layers we cannot reject the hypothesis that the bi-directional LSTM is equivalent to a GD step (use the slider to change the number of layers in Figure 14-16). Note that if we compare this result with Figure 10, while we don’t see exactly the same behavior (e.g. cosine similarity a bit lower than 1), it is still remarkably similar. This is not a conclusive result but it is interesting to see that the bi-directional LSTM can learn linear functions in-context similarly to a transformer implementing a gradient descent step.

      Concluding remarks

      In this blog post, we have presented a series of experiments to understand the mechanistic behavior of transformers and self-attention layers through the lens of optimization theory. In particular, we analyze the results of the paper Transformers Learn In-Context by Gradient Descent, replicating some of the experiments and providing additional insights. In particular, we also derive an analytical solution for the best GD learning rate, which is faster to compute than the line search procedure used in the original paper. Finally, we also empirically show that LSTMs behave differently than transformers, and that single layer LSTMs do not in fact implement a gradient descent step. The results on deep LSTMs are less conclusive, showing behavior similar to the GD-equivalent transformer, but not exactly the same.

      What now?

      The results presented in this blog post, while confirming the main findings of the original paper, also raise a number of questions and suggest possible future research directions.

      1. To reiterate, what we have done so far is to try to understand the behavior of transformers and self-attention layers through the lens of optimization theory. This is the common approach in the literature, including very recent additions , and it is the approach we have followed in this blog post. However, this can pose significant limitations regarding the generalization of the results and the applicability of the findings to other architectures (notably, causal self-attention layers). Phenomena like the emergent abilities or the memorization of large language models may indicate that fundamentally different mechanisms are at play in these models, and that the optimization perspective might not be sufficient to understand them.

      2. On the other hand, nothing prevents us from working in the opposite direction, i.e. to start from specific learning algorithms and try to design neural networks that implement them. From an alignment perspective, for example, this is desirable because it allows us to start by designing objective functions and learning algorithms that are more interpretable and more aligned with our objectives, rather than starting from a black-box neural network and trying to understand its behavior. In this quest, the developing theory of mesa-optimization can represent a useful framework to understand these large models .

      3. Finally, we want to highlight that the main results shown in this blog post are consequences of the simplified hypothesis and the experimental setup we have considered (linear functions, least-squares regression loss, linear self-attention layers). In an equally recent paper , for example, the authors take a completely different route: by representing transformers as interacting particle systems, they were able to show that tokens tend to cluster to limiting objects, which are dependent on the input context. This suggests that other interpretations of the behavior of transformers are not only possible, but also possibly necessary to understand how these models learn in context.

      Appendix

      Connection with meta-learning

      From a learning point-of-view, ICL seems closely related to the definition of meta-learning, where the goal is to learn a model that can quickly adapt to new tasks . If we consider the function class \(\cH\) as an uncountable set of tasks, then the model is learning how to adapt to new function by observing a few examples of that function. The main difference between the classic formulation of meta-learning and the formulation of in-context learning is that in the latter case the model is not allowed to change its weights, but it can only change its internal state (e.g., the hidden activations of the transformer). Indeed, meta-learning relies on the assumption that the model can quickly adapt to new tasks by changing its weights (i.e. by taking one or more gradient steps).

      Connection with MAML (Model-Agnostic Meta-Learning)

      In the meta-learning setup, we need to define a generic base-model \(m:\cX\rightarrow\cY\) parameterized with \(\mbw\) that works at sample-level. Let’s now relax the assumption of \(\cF\) as a class of transformer models and let’s build \(f\) as follows:

      \[\begin{equation} \label{eq:meta-learning-model} f(\mbw, P_C) = m\left(\mbw - \eta \nabla_{\mbw} \sum_{i=0}^{C-1}\ell\left(m(\mbw,\mbx_i), \y_i\right),\mbx_\text{query}\right) \end{equation}\]

      where \(\eta\) is the learning rate of the meta-learning algorithm. Equation \eqref{eq:meta-learning-model} represents the inner optimization loop in a simplified version of the MAML algorithm , where the model is updated with a single gradient step.

      Putting all together, we can define the meta-learning loss as:

      \[\begin{equation} \label{eq:meta-learning-loss} \cL_{\text{MAML}}(\mbw) = \mathbb{E}\left[\ell\left(f(\mbw, P_C), h\left(\mbx_{\text{query}}\right)\right) \right] \end{equation}\]

      which now is optimized w.r.t. the base-model’s parameters \(\mbw\).

      The resemblance between Equation \eqref{eq:in-context-error} and Equation \eqref{eq:meta-learning-loss} is now clear and it justifies the interpretation of in-context learning as a form of meta-learning.

      In particular, it is interesting to study under which conditions the model \(f\) defined in Equation \eqref{eq:meta-learning-model} is equivalent to a transformer model.

      Testing details

      In order to test whether a model learns in-context for a given function class, we need to define a dataset of in-context examples. In this case we will only consider in-distribution test examples, i.e. examples that are drawn from the same distribution as the training examples. Specifically, we will use the same distribution for the test inputs \(p(\mbx)\) and the same distribution for the test weights \(p(\mbw)\) as those used during training. Various papers have also considered the case where the inputs are drawn from a different distribution than the training examples (also known as out-of-distribution, or OOD), but to keep the discussion relevant we will only consider the in-distribution case.

      We define the in-context test loss as:

      \[\begin{equation} \label{eq:in-context-test-loss} \cL_\text{te}(\mbtheta) = \frac 1 N \sum_{n=0}^{N-1} \left\|f\left(\mbtheta, \left[\{\mbx_i^{(n)}, \y_i^{(n)}\}_{i=0}^{C-1}, \mbx^{(n)}_\text{query}\right]\right) - \y^{(n)}_{\text{query}}\right\|^2. \end{equation}\]

      Specifically, we will consider a fixed dataset of \(N=10000\) regression tasks, where each task is defined by a set of in-context examples \(\{\mbx_i^{(n)}, \y_i^{(n)}\}_{i=0}^{C-1}\) and a query pair \(\mbx^{(n)}_{\text{query}}\) and \(\y^{(n)}_{\text{query}}\).

      For attribution in academic contexts, please cite this work as
      +        PLACEHOLDER FOR ACADEMIC ATTRIBUTION
      +  
      BibTeX citation
      +        PLACEHOLDER FOR BIBTEX
      +  
      \ No newline at end of file diff --git a/blog/unraveling-the-impact-of-training-samples/index.html b/blog/unraveling-the-impact-of-training-samples/index.html new file mode 100644 index 00000000..49e33fff --- /dev/null +++ b/blog/unraveling-the-impact-of-training-samples/index.html @@ -0,0 +1,56 @@ + Unraveling The Impact of Training Samples | ICLR Blogposts 2024

      Unraveling The Impact of Training Samples

      How do we quantify the influence of datasets? Recent works on Data Attribution Methods shed light on this problem. In this blog post, we introduce Data Attribution Methods which leverage robust statistics and surrogate functions, and present their applications like distinguishing the feature selection difference of learning algorithms, detecting data leakage, and assessing model robustness.

      How do we quantify the true influence of datasets? What role does the influence score play in refining datasets and unraveling the intricacies of learning algorithms? Recent works on Data Attribution Methods give us an interesting answer to these problems.

      This blog post revisits several proposed Data Attribution Methods which aim to quantitatively measure the importance of each training sample with respect to the model’s output. The blog post also demonstrates the utility of the data attribution methods by providing some usage examples, e.g. understanding the difference of learning algorithms, checking data leakage, and analyzing the model robustness .

      Motivation of data attribution. For a given target, we want to quantify the influence of each of the training samples. Therefore, it’s more interpretable for us to understand model decisions and bias.

      Data Attribution Methods

      Exploring various milestone frameworks offers valuable insight into understanding the impact of training samples. Let’s delve into some established methods used for data attribution.

      Influence Functions

      In the paper Understanding Black-box Predictions via Influence Functions , the authors scaled up influence functions (a classic technique from robust statistics ) to Modern Deep Learning settings. Under the twice-differentiable and strictly convex assumption of empirical risk function and the assumption of the algorithm attaining the optimal point, we can estimate the influence of training samples by only calculating the gradients and Hessian-vector products of the model.

      The intuition behind the influence function is by looking at the difference of test loss after one training sample removal or perturbation. The calculation is given as follows:

      \[\mathcal{I}_{\text{removal,loss}}(z,z_{\text{test}}):=\frac{dL(z_\text{test},\hat\theta_{\epsilon,z})}{d\epsilon}\Bigg|_{\epsilon=0}\approx-\nabla_\theta L(z_{\text{test}},\hat\theta)^\top H_{\hat\theta}^{-1}\nabla_\theta L(z,\hat\theta)\]
      Show algorithm step by step

      Given the assumption we made, our algorithm can find the optimal $\hat\theta$ which minimizes the empirical risk and also guarantees the existence of the positive definite Hessian matrix: $$R(\theta):=\frac{1}{n}\sum L(z_i,\theta), \ \ \hat\theta=\arg\min_\theta R(\theta)$$ $$H_{\hat\theta}:=\frac{1}{n}\sum \nabla _\theta^2 L(z_i,\hat\theta).$$ Given the intuition written above, we look at the parameter difference $\Delta_\epsilon=\hat\theta_{\epsilon, z}-\hat\theta$ by perturbing one training sample: $$\hat\theta_{\epsilon, z}=\arg\min_{\theta}\{R(\theta)+\epsilon L(z,\theta)\}$$ Recall our goal is to estimate how does the algorithm changes with sample perturbation, we can express our goal as $\frac{d \hat\theta_{\epsilon, z}}{d \epsilon}$. Since $\hat\theta_{\epsilon, z}$ is a minimizer of the pertured loss. We can write its first order optimality condition: $$0=\nabla R(\hat\theta_{\epsilon, z})+\epsilon \nabla L(z,\hat\theta_{\epsilon, z}).$$ By performing a taylor expansion on $\hat\theta_{\epsilon, z}$, we can estimate $$0\approx \left[ \nabla R(\hat\theta)+\epsilon \nabla L(z,\hat\theta)\right] + \left[ \nabla^2 R(\hat\theta)+\epsilon \nabla^2 L(z,\hat\theta)\right]\Delta_\epsilon.$$ Since $\hat\theta$ minimizes $R$ and $o(\epsilon)$ term can be omitted, we can solve for $\Delta_\epsilon$ as follows: $$\Delta_\epsilon\approx -\nabla^2 R(\hat\theta)^{-1} \nabla L(z,\hat\theta)\epsilon \Rightarrow \frac{d \Delta_\epsilon}{d \epsilon}\Bigg|_{\epsilon=0}=\frac{d \hat\theta_{\epsilon,z}}{d\epsilon}\Bigg|_{\epsilon=0}=-H_{\hat\theta}^{-1}\nabla_\theta L(z,\hat\theta) $$
      Therefore, $\mathcal{I}_{\text{removal,loss}}(z,z_{\text{test}}):=\frac{dL(z_\text{test},\hat\theta_{\epsilon,z})}{d\epsilon}\Bigg|_{\epsilon=0} =\frac{dL(z_\text{test},\hat\theta_{\epsilon,z})}{d\hat\theta_{\epsilon,z}}\frac{d \hat\theta_{\epsilon,z}}{d\epsilon}\Bigg|_{\epsilon=0}\approx-\nabla_\theta L(z_{\text{test}},\hat\theta)^\top H_{\hat\theta}^{-1}\nabla_\theta L(z,\hat\theta)$



      Since one training sample removal can be understood as setting $\epsilon=-\frac{1}{n}$, we can predict the corresponding test loss difference by $-\frac{1}{n}\mathcal{I_{\; remove, loss}} \;(\mathcal{z}, \mathcal{z}_{\text{test}})$. By comparing the predicted test loss difference and the actual test loss difference by leave-one-out retraining, we can verify the accuracy of the proposed influence scores, as shown in the figure below.

      Based on their experiments, we can empirically say that the proposed influence function performs well on the tasks which satisfy their underlying assumptions (the twice-differentiable and strictly convex assumption): In Fig(a) & Fig(b), under convex and convergent situations (Logistic Regression model & L-BGFS algorithm), the predicted loss difference and actual loss difference align well with each other. However, in Fig(c), under non-convex and non-convergent-guarantee situations(CNN model & SGD algorithm), the influence function could not make satisfying approximation.

      Although the Influence Functions seem provide a good estimation of the importance of each training sample, the expensive computational cost on estimating Hessian matrix and the unstablility under non-convex and non-convergent-guarantee situations are big issues for this data attribution method.

      Data Models

      Another branch of methods for data attribution are sampling-based methods, such as the Datamodels work of Ilyas et al . Given a learning algorithm $\mathcal{A}$, a fixed training dataset $S$ of $m$ data points, and a model function trained on $S$ with $\mathcal{A}$, is a function that maps an input data $z$ to $f_{\mathcal{A}}(z; S)$. This function $f$ can be complex in practice and hence, it’s hard to learn a model to understand how the training examples in $S$ contributes to the prediction of a specific target point. Therefore, the authors use a linear function $g_{w}$ as a simple surrogate model to learn the contribution of each training examples to a target example.

      How do we train such a linear surrogate function? Consider a fixed training dataset $S$, a learning algorithm $\mathcal{A}$, and a target example $z$, and a distribution $D_{S}$ over subsets of $S$. Use $D_S$ to repeatedly sample a number of $S_{i}$, train $f_{\mathcal{A}}(z; S_{i})$ using $\mathcal{A}$, and evaluating on $z$ to get pairs:

      \[\{\Bigl(S_{1}, f_{\mathcal{A}}(z; S_{1})\Bigr),\cdot \cdot \cdot,\Bigl(S_{m}, f_{\mathcal{A}} (z; S_{m})\Bigr)\}\]

      A datamodel for a target example $z$ is a parametric function $g_w$ optimized to predict $f_{\mathcal{A}}(z; S_{i})$ from training subsets $S_{i}$, where $S_{i} \sim D_{S}$. The training objective is formulated as:

      \[g_{w}: \{0, 1\}^{|S|} \mapsto \mathbb{R}, \text{ where }\; w = \underset{\beta}{argmin} \;\frac{1}{m}\sum_{i = 1}^{m}\mathcal{L}\Bigl(g_{\beta}(S_{i}),\; f_{\mathcal{A}}(z; S_{i})\Bigr) + \lambda||\beta||_{1}\]

      \(g_{w}(S_{i}) = <w, \mathbb{1}_{S_{i}}>\);
      \(\mathcal{L}\bigl(g_{w}(S_{i}),\; f_{\mathcal{A}}(z; S_{i})\bigr) = \bigl(\;g_{w}(S_{i}) - f_{\mathcal{A}}(z; S_{i})\;\bigr)^2\);
      \(f_{\mathcal{A}}(z; S_{i}):= (\text{logit for correct class}) - (\text{highest incorrect logit})\)

      One Datamodel is specifically optimized to learn the data attribution of a fixed training dataset to a fixed but arbitrary example $z$. For a fixed sample of interest, we use $g_{w}$ to assign a learnable weight to each example in $S$. The sum of weights of all training example that’s included in $S_{i}$ is trained to predict the model outputs on $z$. This is formulated as the dot product between a weight vector $w$ and an indicator vector where entry $k$ indicates the existence of the $k^{th}$ training datapoint in $S$. Therefore, for a set of target examples, we can train a datamodel for each of them and construct a collection of datamodels.

      Caption: Linear datamodels accurately predict true margins averaged across 100 models. Source: Fig 5 in the paper “Datamodels: Predicting Predictions from Training Data”

      In their experiments using CIFAR-10, the authors reserved a specific subset of output pairs for evaluation. Here, $\alpha$ represents the subsampling fraction in relation to the training set size. For instance, in a training dataset with $|S| = 100$ data points, setting $\alpha = 0.2$ means each subset, $S_i$, comprises a fixed size of $|S_i| = 20$. They demonstrated that Datamodels effectively predict outcomes for unseen in-distribution test subsets. In the above plots, the bottom-right panel illustrates data for three color-coded random target examples, showing a strong Spearman correlation ($r > 0.99$) between predicted and actual outputs.

      It’s crucial to note that the displayed margins represent averages across 100 models trained on $S_i$. This underscores a limitation of linear datamodeling:

      achieving stability demands training a sufficient number of models for each subset. The figures from the original paper involves averaging over 100 models. When the true model output aren’t averaged across a significant number of models, it becomes apparent that the linearity is affected (see the figure below).

      Despite the simplicity and accuracy of datamodels in predictions, training them for specific examples in large-scale scenarios poses challenges. Imagine training datamodels for ImageNet’s set of target examples, requiring training numerous models from scratch using ImageNet’s 1000-class training dataset. Ensuring stable prediction performance requires extensive computational resources, which is prohibitively expensive for modern foundation models.

      TRAK

      Inspired by Datamodeling framework and motivated to circumvent its expensive training cost, in TRAK:Attributing Model Behavior at Scale, Ilyas et al. propose a new data attribution framework, Tracing with the Randomly-Projected After Kernel (TRAK).

      First, in this paper the authors further denote $\tau(z, S_i)$ as a data attribution method that assigns a real-valued score to each training input in $S_i$, indicating its importance to the model output $f_{\mathcal{A}}(z;S_i)$. The key concept of TRAK is to use first order Taylor expansion to approximate the trained model $\theta^{*}(S)$, of an algorithm for a given training dataset, and then use random projections to reduce the dimensionality of the gradient. Each time, we sample a training subset $S_i$ of size $\alpha \times |S|$ from $S$, and train a model $\theta^{*}(S_i)$, and then use random projection to project the high-dimensional gradient matrix at $\theta^{*}$ from $p$ to $k$ dimension where $k \ll p$. Ilyas et al. denote the projected gradients to be $\phi_t$ and conclude that using a training subset $S_i$, The TRAK attribution scores for an example of interest $z$ is:

      \[\tau(z, S_i) := \phi_{i}(z)^{T}(\Phi_{i}^{T}\Phi_{i})^{-1}\Phi_{i}^{T}\mathbf{Q_{i}}\]

      $i$: the index of a training subset;
      $\mathbf{Q}_{i}:=diag(1 - p_t^*)$ = $diag({(1 + exp(y_t \cdot f(z;\theta^{*})))^{-1}})$ where $p_t^*$ is the predicted correct-class probability at $\theta^{*}$;
      > $t$: the index of a training sample in $S$;
      $\mathbf{P}$: Random projection matrix that each entry is sample from a standard Gaussian distribution: $\mathbf{P}\sim \mathcal{N} (0, 1)^{p \times k}$ for $k \ll p$;

      $\phi_{i}(z) = \mathbf{P}^T \nabla_{\theta} f(z;\theta^{*})$ a projected gradients from model $\theta^{*}(S_i)$ for target sample $z$;
      $\Phi_{i} = [\phi_1 \cdot\cdot\cdot \phi_{m}]$ stacked projected gradients for all training data ${z_1,…z_m}$;

      Further, TRAK samples $N$ training subsets of fixed size factor $\alpha$, trains each of them independently, and ensembles over these $N$ models: \(\tau_{TRAK}(z, S) := \mathfrak{S}((\frac{1}{N} \sum_{i=1}^{N} \mathbf{Q}_{i}) \cdot (\frac{1}{N} \sum_{i=1}^{N} \phi_{i}(z)^{T}(\Phi_{i}^{T}\Phi_{i})^{-1}\Phi_{i}^{T}), \hat{\lambda})\)

      $\mathfrak{S}(\cdot; \lambda)$ is the soft thresholding operator;
      $N$: total number of training subsets;
      $m$: total number of training samples in $S$;
      $\hat{\lambda}$ is the soft thresholding parameter, and it’s selected via cross-validation

      Show algorithm step by step

      Before introducing the implementation steps, Ilyas et al. first use binary logistic regression as a case study to to illustrate the benefits of computing data attribution scores in cases where a classification learning algorithm can be framed as straightforward logistic regression. We consider a training set of $n$ samples: $$S = \{z_1,\cdot\cdot\cdot,z_n: z_t = (x_t \in \mathbb{R}^d, b_t \in \mathbb{R}, y_t \in \{-1, 1\}) \}$$ where
              $x_t$ is an input in $\mathbb{R}^d$;
              $y_t$ is the binary label;
              $b_t$ the bias term
      Then the authors further parametrize the learning algorithm with $\theta$ as the model parameters: $$\theta^{*}(S) := arg\; \underset{\theta}{min} \sum_{(x_t, y_t)\in S} log[1 + exp(-y_t \cdot (\theta^{T}x_t + b_t))]$$ Data attribution in binary logistic regression setting can be learned by using the _one-step Newton approximation_ . Ilyas et al. present it as follow: $$\tau_{NS}(z, z_t) := \frac{x^{T}(X^{T}RX)^{-1}x_t}{1- x_{i}^{T}(X^{T}RX)^{-1}x_t \cdot p_{t}^{*}(1-p_{t}^{*})} \approx f(z;\theta^{*}(S)) - f(z;\theta^{*}(S \setminus z_t))$$ where
              $z$: target sample;
              $f(z;\theta) :=\theta^{T}x+b$;
              $z_t$: the $t^{th}$ training example, $z_t = (x_t, b_t, y_t)$;
              $X \in \mathbb{R}^{n \times d}$ stacking all input in one matrix $X$;
              $p_{t}^{*}:= (1 + exp(-y_t \cdot f(z_t; \theta^*)))^{-1}$
              $p_{t}^{*}$ is the predicted correct-class probability at $\theta^{*}$;
              $R$ is a diagonal $n \times n$ matrix with $R_{tt} = p_{t}\times (1-p_{t}^{*})$
      Now that the Ilyas et al. have introduced this method to calcuate data attribution in the binary logistic regression setting, how can we leverage it effectively? The key insight is that, in a binary non-convex or multi-class classification setting, we can linearize the model function with its Taylor expansion centered around the final model parameters $\theta^*$. By selecting the output function as the raw logit of the classifier, this linear approximation allows us to approach the problem as a binary logistic regression, utilizing gradients as inputs, thereby leading to the development of the TRAK algorithm.
      In this paper, the algorithm of TRAK is consist of five steps:
      1. Linearizing the model output function via Taylor approximation, which reduces the model of interest to a linear funtion in parameter space. Consider $f(z;\theta)$ as a non-convex function, then we can approximate it with its Taylor expansion centered around $\theta^{\*}$:
      $$\hat{f}(z;\theta):= f(z;\theta^{*}) + \nabla_{\theta} \; f(z;\theta^{*})^{T}(\theta - \theta^{*})$$ $$\theta^{*}(S) \approx arg\; \underset{\theta}{min} \sum_{z_t \in S} log[1 + exp(-y_t \cdot ( \underbrace{\nabla_{\theta} \; f(z;\theta^{*})^{T}}_{inputs}\;\theta + b_t))]$$ where
              $f(z;\theta):=log(\frac{p(z;\theta)}{1 - p(z; \theta)})$
              $b_t = f(z;\theta^{\*}) - \nabla_{\theta} \; f(z;\theta^{\*})^{T} \theta^{\*}$
      2. Reducing the dimensionality of the linearized model using random projections. To preserve the model-relevent information, Ilyas et al use the Johnson-Lindenstrauss lemma . We need to compute gradient for each $z_i$ at $\theta^{*}$ and then project to $k$ dimensions $$\phi(z) = \mathbf{P}^{T} \nabla_{\theta}f(z;\theta^{*})$$ where
               $\mathbf{P}\sim \mathcal{N} (0, 1)^{p \times k}$ for $k \ll p$
      3. Estimating influences by adapting the one-step newton approximation.
      $$\tau(z, S) := \phi(z)^{T}(\Phi^{T}\Phi)^{-1}\Phi^{T}\mathbf{Q}$$ where
              $\mathbf{Q}:= diag(1 - p_{t}^*) = diag(\{(1 + exp(y_t \cdot f(z;\theta^{*})))^{-1}\})$;
              $\mathbf{Q} \in \mathbb{R}^{n \times n}$ where each diagonal is a one minus correct-class probability term.
      4. Ensembling over $N$ independently trained models. Each model is trained on a subset of the training set, $S_i \subset S$.
      $$\tau_{N}(z, S) := (\frac{1}{N} \sum_{i=1}^{N} \mathbf{Q}_{i}) \cdot (\frac{1}{N} \sum_{i=1}^{N} \phi_{i}(z)^{T}(\Phi_{i}^{T}\Phi_{i})^{-1}\Phi_{i}^{T})$$
      5. Inducing sparsity via soft-thresholding. $$\tau_{TRAK}(z, S) := \mathfrak{S}((\frac{1}{N} \sum_{i=1}^{N} \mathbf{Q}_{i}) \cdot (\frac{1}{N} \sum_{i=1}^{N} \phi_{i}(z)^{T}(\Phi_{i}^{T}\Phi_{i})^{-1}\Phi_{i}^{T}), \hat{\lambda})$$ where
              $\mathfrak{S}(\cdot; \lambda)$ is the soft thresholding operator;
              $\hat{\lambda}$ is the soft thresholding parameter, and it's selected via cross-validation



      Caption: We trained 90 RestNet9 models independently on 90 randomly selected subsets of size factor 0.5 from $S$. Then we used TRAK to calculate influence score for the test dataset of CIFAR-10. These are two random samples that show the efficacy of TRAK. For the training images that have high TRAK scores, they are of the same category. While those of low TRAK scores are of different categories of the target image.

      Ilyas et al. conducted a study utilizing TRAK to attribute various classifiers on datasets such as CIFAR-2, CIFAR-10, QNLI, and ImageNet. Their findings demonstrated that TRAK achieves superior accuracy while utilizing significantly fewer models.

      In replicating the experiments detailed in Ilyas et al. , we encountered a notable drawback in the TRAK algorithm, we found that the TRAK algorithm is memory-expensive. It requires recording numerous model gradients for each test sample across models trained on different subsets, which is intractable for Modern Foundation Models. Furthermore, our investigation unveiled a limited linear correlation between TRAK scores and true model margins. This observation suggests that the predicted margins derived from TRAK do not serve as robust estimates of the model output and its ability of predicting model outputs is not on par with Datamodels.

      While TRAK offers an interpretable and computationally efficient way to analyze training data impact, its limitations cannot be overlooked. Further research is needed to propose better data attribution methods.

      How do we use it?

      Learning Algorithm Comparison

      Data attribution methods estimate the importance of each training sample with respect to the model’s output. An natural idea comes up: can we leverage the data attribution methods to understand the learning algorithms’ difference based on how they weight the training data?

      The paper ModelDiff: A Framework for Comparing Learning Algorithms develops this idea: use data attribution method to figure out the “feature selection” difference of two learning algorithms. Specifically, the authors use data attribution methods to quantify the impact of each training sample to each test sample.

      Therefore, we could get the importance matrix $\Theta^{|train | \times |test|}$ for each learning algorithm applied on a specific task. We apply matrix projection and PCA techniques on the importance matrix $\Theta$ to explore the distinguishing difference between how two algorithms use training samples. The detailed pipeline of comparing learning algorithm is depicted in the following figure.

      Source: Figure 2 in the paper “MODELDIFF: A Framework for Comparing Learning Algorithms”
      In the figure above, the authors PCA on the residual importance matrix (after projection, we remove the common importance allocation). The training samples corresponding to the TOP-K principal components (these principal component directions explain a significant amount of variance in one importance matrix but not the other) reflect the distinguishing subpopulations that one learning algorithm prefers, but another learning algorithm pays little attention to.

      By visually checking these distinguishing subpolutations, we could speculate the semantic feature selection difference of two algorithms and then confirm it by applying the semantic feature transformations on test data and checking the model output difference.

      Source: Figure 3 in the paper “MODELDIFF: A Framework for Comparing Learning Algorithms”
      For example, in the figure above, they compared two models trained on LIVING17 dataset. The only difference between these two models is whether they are trained with or without standard data augmentations. By exploring the training sample importance matrix using the method mentioned above, they speculated that the model trained with data augmentation prefers using “web” to predict the class “spider” and using “yellow polka dots” to predict the class “salamander”. Therefore, they added “web” or “yellow polka dots” texture to test samples and found out that only the prediction of the model with data augmentation changes a lot. This experiment verified the previous work that the data augmentation will enhance the texture bias.

      The ModelDiff shows that the data attribution methods can be key tools for understanding model behaviors and distinguishing the subtle differences of algorithms.

      Data Leakage Detection

      Except for comparing learning algorithms, we can also leverage the importance score to find training samples which are most relevant to the model prediction. By empirically observing the training samples with different importance magnitude, Harshay et al. find that the training samples with large importance magnitude consistently look similar to the test sample which also follows the intuition: training samples most similar to the test sample are most relevant to the prediction (see the first line of the figure).

      _Source: Figure 3 in the paper “MODELDIFF: A Framework for Comparing Learning Algorithms”

      Source: From the randomly selected validation points provided by Ilyas et al. , we found this data leakage example

      We can leverage such phenomenon to identify train-test leakage in different benchmark datasets. For example, in the second line of the figure, Harshay et al. identified significant data leakage on CIFAR10 dataset. Extending this data leakage detection technique to different datasets holds the potential to assist the ML community in curating datasets, thereby enhancing overall data quality.

      Prediction Brittleness Examination

      We can also use the data attribution methods to identify brittle predictions (i.e. the model outputs which are brittle to a few training samples removal) and estimate data counterfactual (i.e. the casual effect of removing a set of training samples on model outputs).

      Specifically, we could leverage the sample importance scores to find the smallest training subset (defined as support set) such that removing them could flip the model prediction. By calculating the support set size for each test sample, we could know the brittleness of the model output with respect to the input.

      Source: Fig 8 in the paper “Datamodels: Predicting Predictions from Training Data” *

      Another application involves data counterfactual estimation. As illustrated in the figure above, after the training subset removal, the observed changes in actual model logits closely align with the predicted model logits changes estimated through data attribution methods.

      These experiments demonstrate that the data attribution methods could serve as efficient and convincing tools to investigate the sensitivity and robustness of the learning algorithms.

      Conclusion

      The data attribution methods give us an interesting answer to a natural question arising from the deep learning field: how does each training sample help with the model’s prediction? These methods can quantitatively measure the importance of each training sample with respect to the model’s output. The versatility of these methods extends across diverse applications, such as understanding learning algorithm behaviors, checking the data quality and analyzing the robustness of models.

      Future works can focus on leveraging the data attribution methods to do dataset curation and model refinement. Also, investigating the scalability of the data attribution methods to larger datasets and different tasks remains a promising direction for enhancing their practical utility.

      For attribution in academic contexts, please cite this work as
      +        PLACEHOLDER FOR ACADEMIC ATTRIBUTION
      +  
      BibTeX citation
      +        PLACEHOLDER FOR BIBTEX
      +  
      \ No newline at end of file diff --git a/blog/update-frequency-in-mbrl/index.html b/blog/update-frequency-in-mbrl/index.html new file mode 100644 index 00000000..3ae321ec --- /dev/null +++ b/blog/update-frequency-in-mbrl/index.html @@ -0,0 +1,269 @@ + Fair Model-Based Reinforcement Learning Comparisons with Explicit and Consistent Update Frequency | ICLR Blogposts 2024

      Fair Model-Based Reinforcement Learning Comparisons with Explicit and Consistent Update Frequency

      Implicit update frequencies can introduce ambiguity in the interpretation of model-based reinforcement learning benchmarks, obscuring the real objective of the evaluation. While the update frequency can sometimes be optimized to improve performance, real-world applications often impose constraints, allowing updates only between deployments on the actual system. This blog post emphasizes the need for evaluations using consistent update frequencies across different algorithms to provide researchers and practitioners with clearer comparisons under realistic constraints.

      Introduction

      In reinforcement learning , an agent learns to make decisions by interacting with an environment, receiving a feedback, or reward, following each action it takes to move from a state of the environment to another. The objective is to learn a policy, a mapping from states to action, that maximizes the expected cumulative reward over successive interactions.

      There are two main approaches when designing a reinforcement learning algorithm: model-based or model-free. Model-based reinforcement learning (MBRL) algorithms first learn a model of the environment dynamics which, given a state of the environment and an action, predicts the next state of the environment. This model can then be used in place of the real environment to learn or decide how to act. Model-free algorithms avoid this step and directly try to learn a policy. As MBRL algorithms can rely on the learned dynamics model instead of the real environment, they are known to be more sample efficient than model-free algorithms (see for instance or ). MBRL is thus a good choice when interactions with the environment are limited, which is often the case for real applications such as controlling engineering systems.

      We discuss here about one of the design choices of MBRL algorithms: the update frequency of the agent. As shown in the figure below This figure is inspired by Figure 1 in ., the frequency at which algorithms update their agent varies widely: some algorithms update their agent after each step on the real system while others update after thousands of steps . At the end of the spectrum, the pure offline setting considers only a single training of the agent from an initial dataset We observe that similar differences in update frequency exist in the model-free literature but we decide to focus only on model-based algorithms..

      The update frequency is often viewed as yet another hyperparameter of the complex MBRL pipeline. However, in practice the update frequency may be imposed by real-life deployment constraints, motivating the discussions of this blog post. It is often the case that for safety reasons, system engineers agree to run a new agent on their system for a given period of time but prefer the agent to be fixed during this deployment, as studies. System engineers are then able to investigate the fixed solution before deciding to deploy it, knowing that it will not change during the deployment. It also happens that the system on which the agent is deployed does not have the required computational resources to support agent updates. Such real-life constraints could thus discard state-of-the-art MBRL algorithms that require updating their agent too frequently to perform well.

      Given the importance of the update frequency in real-life applications, this blog post advocates for:

      • explicitly specifying the update frequency employed by each algorithm in a benchmark, as this remains implicit and hard to find in many existing benchmarks,
      • conducting additional experiments that compare algorithms under a given update frequency, mirroring the constraints often encountered in real-life applications, and
      • performing more ablation studies on update frequency, evaluating its impact on algorithm performance.

      For the rest of this blog post, we define a deployment as a data collection campaign realized with a fixed agent. The agents are thus updated between two consecutive deployments but not within one deployment. The update frequency is the number of steps realized at each deployment (that we assume fixed for all deployments). We use the term agent to refer to all the components of the model-based algorithm that are used to act on the system. For instance, in a Dyna-style algorithm , where a model-free algorithm is applied on the model instead of the real system, agent would thus refer to both the dynamics model and the policy learned with a model-free algorithm.

      We begin by introducing three popular MBRL algorithms (MBPO, PETS and BREMEN) as we will often refer to them to illustrate our arguments.

      The following table gives an overview of the update frequency of the three algorithms we discussed below and few others. This table is not meant to provide an exhaustive list of all the MBRL algorithms but rather to give an idea of the different training schedules that are used in the literature.

      Algorithm Agent update frequency Policy update frequency Model update frequency
      MBPO 1 step 1 step 250 steps
      PETS Task Horizon No policy Task Horizon
      PILCO Task Horizon Task Horizon Task Horizon
      BREMEN 100k or 200k steps 100k or 200k steps 100k or 200k steps
      ME-TRPO 3k or 6k steps 3k or 6k steps 3k or 6k steps

      MBPO

      Model-based Policy Optimization (MBPO) Original code available at https://github.com/jannerm/mbpo is one of the most well-known model-based algorithms. The algorithm trains an ensemble of probabilistic neural networks for the dynamics model and trains a model-free agent, Soft Actor Critic (SAC) , using short rollouts on the model to avoid error accumulation. The agent is updated at each step: the model is updated each 250 steps but the SAC policy is updated at each step. This highly frequent update schedule discards MBPO even for small deployments on real systems.

      PETS

      Probabilistic Ensemble and Trajectory Sampling (PETS) Original code available at https://github.com/kchua/handful-of-trials is another popular model-based algorithm known for its use of an ensemble of probabilistic neural networks for the dynamics model (MBPO uses the dynamics model introduced by PETS). PETS relies on the learned model and the Cross-Entropy Method to search for the best action sequence at decision time. Therefore, it does not have to learn (nor update) a policy, as MBPO does with SAC. The only component that needs learning is the dynamics model. Compared to MBPO, the dynamics model is updated at the end of each episode (usually 1000 steps).

      BREMEN

      Behavior-Regularized Model-ENsemble (BREMEN) Original code available at https://github.com/matsuolab/BREMEN considers the setting where only a few deployments (between 5 to 10) are possible on the real system. However large datasets can be collected at each deployment (they assume 100 000 or 200 000 transitions for each deployment, far more than just one episode which is usually of the order of 1000 transitions). The algorithm relies on an ensemble of deterministic dynamics models and a policy learned on the model, à la Dyna-Style. It only updates the policy and the model between two consecutive deployments. The update frequency is here very clear as it is motivated by real-life applications where deployments are limited. Therefore in this algorithm this is not an hyperparameter that can be tuned for better performance but rather a parameter imposed by the application. One of the goals of the blog post is to emphasize and to develop the idea of a constrained update frequency.

      We now detail the main arguments of our blog post: making the update frequency more accessible, designing benchmarks with fixed update frequencies and running ablation studies on the update frequency.

      Making the update frequency more accessible

      Experiments done in popular papers do not always explicit the update frequencies they use for each of the algorithms they run. When nothing is said, it is very likely that most of the times the benchmarks are using the original implementation of the algorithms, shared by the authors of the algorithms in the best case. For instance the MBPO paper does not mention the update frequencies that the authors used in their experiments. The update frequency of MBPO can be found in the code shared by the authors. However it is harder to find the update frequency that the authors used for PETS. We thus assume that they use the original PETS update frequency, which updates the agent at the end of each episode. We also looked at one of the most exhaustive benchmark of MBRL algorithms . Nothing is said in the paper about the update frequency and a careful investigation of the code provided by the authors is required (more on this later).

      The difficulty in knowing the update frequencies used in benchmarks makes it harder for the researchers and practitioners to take this parameter into account to assess the performance of the algorithms and whether they would be good candidates for their real-life applications. It also demands much more investigation from the reader to know what the authors used.

      MBRL algorithms have an order of magnitude more meaningful hyperparameters than supervised models, and managing and reporting on them usually falls out of the scope of research papers. The practice of sharing the code alleviates this issue somewhat, and should be saluted, since we can always dig up in the code what the parameters were. However, ideally, choices that drastically change the performance of the algorithms, should be made explicit as much as possible in the research papers and the ablation studies.

      Comparisons with fixed update frequency

      We want to make the community aware of the importance of the update frequency when comparing algorithms and when designing benchmarks. Running benchmarks without any constraints allows using different update frequencies for each algorithm. We believe that such benchmarks are valuable for the community. However it would also be very informative for the community to have benchmarks with comparable update frequencies between the algorithms. This would for instance help to find the potentially best algorithms for real applications with constraints on the update frequency.

      Coming back to the experiments run in MBPO’s paper, as the default MBPO implementation updates the model each 250 steps, it might also make sense to allow PETS to be updated each 250 steps as well to have comparable results. We also note that the MBRL-Lib paper compares the MBRL-Lib implementations of PETS and MBPO with their respective original update frequency. We do not think that this would have a big impact for these two algorithms but it would be fairer to use the same update frequency. Finally, looking at the code of the MBRL benchmark done by , it is not clear whether the same update frequency is used for all the algorithms of the benchmark For instance it seems the update frequency on Acrobot is 3000 for RS (time_step_per_batch in https://github.com/WilsonWangTHU/mbbl/blob/master/scripts/exp_1_performance_curve/rs.sh) but 5000 for ME-TRPO (num_path_onpol $\times$ env_horizon in https://github.com/WilsonWangTHU/mbbl-metrpo/blob/master/configs/params_acrobot.json)..

      The BREMEN paper has a benchmark comparing different algorithms under fixed update frequencies. This gives valuable insights on the performance of the existing algorithms under these deployment constraints. The next step would be to evaluate the performance with a different number of deployments and a different number of steps per deployment, which we now argue for in the next section.

      Ablation studies

      Comparisons of different update frequencies are very rare in existing benchmarks and existing papers. Even without real-life constraints it would be valuable to know how sensitive the performance of a given algorithm is with respect to the update frequency. The issue for the authors is that this could be asked for many other hyperparameters and represent additional computational budget and time. However we often find ablations on the number of models (if the model is an ensemble), the rollout length, the number of gradient updates for the model-free policy, but very rarely on the update frequency. It is very likely that the agents that are good for small deployments would be bad for large deployments, a setting that would tend to be closer to the pure offline setting (for the same total budget of real system interactions). We perform such an ablation study using MBPO in the next section, showing that MBPO’s performance is degrading with larger update frequencies.

      Varying the update frequency in MBPO

      Using the MBPO implementation and the examples provided by MBRL-Lib we ran MBPO on Gym-Halfcheetah-v4, Gym-Hopper-v4 and Gym-Walker2d-v4 with different update frequencies: updating the agent at each step (default implementation described above), each 1000 steps, each 5000 steps and each 10 000 steps. Each curve shows the mean episode return obtained with at least 10 seeds. We did not run Hopper and Walker with an update frequency of 10 000 steps as the performance obtained with 5000 was already poor. The lightly shaded areas indicate the 95% bootstrap confidence interval.

      Except for the update frequency of 1000 steps on Halfcheetah and Walker which achieves similar performance than the default configuration updating the agent at each step, the results indicate a decline in asymptotic performance with larger update frequencies. Although MBPO exhibits good performance over different environments for the default update frequency, this is not the case for the other update frequencies that we consider here. We note here that 1000 steps is the usual maximum episode length and therefore a reasonable value to try for the update frequency. One insight from this experiment is that even though MBPO is one of the state-of-the-art MBRL algorithms, practical constraints like the update frequency can potentially alleviate its performance in real-world applications.

      When trying these values of updates frequencies we adjusted the number of gradient steps to maintain a constant ratio of gradient steps per step on the real system. For the maximum buffer size of SAC we used the rule provided in MBPO’s code. The table below shows the values obtained for the maximum buffer size. As shown in the figure below, using a smaller buffer size negatively impacts the performance for the update frequency of 1000 steps and 10 000 steps. While there is a possibility that better values for the hyperparameters (other than the update frequency) could be found, we did what appeared to be the natural way to adapt the other hyperparameters when increasing the update frequency. See the Appendix for the complete description of the hyperparameters used in these experiments.

      Agent update frequency Model update frequency Policy update frequency Max SAC buffer size
      default (1 step) 250 1 400 000
      1 000 steps 1000 1000 400 000
      5 000 steps 5000 5000 2 million
      10 000 steps 10 000 10 000 4 million

      Conclusion

      The goal of this blog post is to shed light on a frequently overlooked hyperparameter in MBRL: the update frequency. Despite its importance for real-life applications, this parameter is rarely discussed or analyzed. We emphasize the importance of running more evaluations using consistent update frequencies across different algorithms and more ablation studies. We for instance show how the update frequency impacts the performance of MBPO. Similar to the update frequency, we can identify several other hyperparameters that deserve more attention when benchmarking different MBRL algorithms. A typical example is the continual training (of the model and/or policy) versus retraining from scratch (referred to as the primacy bias in some previous work ). We believe this blog post offers valuable insights to researchers, providing directions that would be worth investigating to explain the differences between MBRL algorithms and whether these differences really impact the existing comparisons.

      Appendix

      We provide here the configuration files we used to run the different experiments.

      Halfcheetah

      • Update frequency of 1000 steps
      # @package _group_
      +env: "gym___HalfCheetah-v4"
      +term_fn: "no_termination"
      +
      +num_steps: 400000
      +epoch_length: 1000
      +num_elites: 5
      +patience: 5
      +model_lr: 0.001
      +model_wd: 0.00001
      +model_batch_size: 256
      +validation_ratio: 0.2
      +freq_train_model: 1000
      +effective_model_rollouts_per_step: 400
      +rollout_schedule: [20, 150, 1, 1]
      +num_sac_updates_per_step: 10000
      +sac_updates_every_steps: 1000
      +num_epochs_to_retain_sac_buffer: 1
      +
      +sac_gamma: 0.99
      +sac_tau: 0.005
      +sac_alpha: 0.2
      +sac_policy: "Gaussian"
      +sac_target_update_interval: 1
      +sac_automatic_entropy_tuning: true
      +sac_target_entropy: -1
      +sac_hidden_size: 512
      +sac_lr: 0.0003
      +sac_batch_size: 256
      +
      • Update frequency of 5000 steps
      # @package _group_
      +env: "gym___HalfCheetah-v4"
      +term_fn: "no_termination"
      +
      +num_steps: 400000
      +epoch_length: 5000
      +num_elites: 5
      +patience: 5
      +model_lr: 0.001
      +model_wd: 0.00001
      +model_batch_size: 256
      +validation_ratio: 0.2
      +freq_train_model: 5000
      +effective_model_rollouts_per_step: 400
      +rollout_schedule: [20, 150, 1, 1]
      +num_sac_updates_per_step: 50000
      +sac_updates_every_steps: 5000
      +num_epochs_to_retain_sac_buffer: 1
      +
      +sac_gamma: 0.99
      +sac_tau: 0.005
      +sac_alpha: 0.2
      +sac_policy: "Gaussian"
      +sac_target_update_interval: 1
      +sac_automatic_entropy_tuning: true
      +sac_target_entropy: -1
      +sac_hidden_size: 512
      +sac_lr: 0.0003
      +sac_batch_size: 256
      +
      • Update frequency of 10000 steps
      # @package _group_
      +env: "gym___HalfCheetah-v4"
      +term_fn: "no_termination"
      +
      +num_steps: 400000
      +epoch_length: 10000
      +num_elites: 5
      +patience: 5
      +model_lr: 0.001
      +model_wd: 0.00001
      +model_batch_size: 256
      +validation_ratio: 0.2
      +freq_train_model: 10000
      +effective_model_rollouts_per_step: 400
      +rollout_schedule: [20, 150, 1, 1]
      +num_sac_updates_per_step: 100000
      +sac_updates_every_steps: 10000
      +num_epochs_to_retain_sac_buffer: 1
      +
      +sac_gamma: 0.99
      +sac_tau: 0.005
      +sac_alpha: 0.2
      +sac_policy: "Gaussian"
      +sac_target_update_interval: 1
      +sac_automatic_entropy_tuning: true
      +sac_target_entropy: -1
      +sac_hidden_size: 512
      +sac_lr: 0.0003
      +sac_batch_size: 256
      +

      Hopper

      • Update frequency of 1000 steps
      # @package _group_
      +env: "gym___Hopper-v4"
      +term_fn: "hopper"
      +
      +num_steps: 125000
      +epoch_length: 1000
      +num_elites: 5
      +patience: 5
      +model_lr: 0.001
      +model_wd: 0.00001
      +model_batch_size: 256
      +validation_ratio: 0.2
      +freq_train_model: 1000
      +effective_model_rollouts_per_step: 400
      +rollout_schedule: [20, 150, 1, 15]
      +num_sac_updates_per_step: 40_000
      +sac_updates_every_steps: 1000
      +num_epochs_to_retain_sac_buffer: 1
      +
      +sac_gamma: 0.99
      +sac_tau: 0.005
      +sac_alpha: 0.2
      +sac_policy: "Gaussian"
      +sac_target_update_interval: 4
      +sac_automatic_entropy_tuning: false
      +sac_target_entropy: 1 # ignored, since entropy tuning is false
      +sac_hidden_size: 512
      +sac_lr: 0.0003
      +sac_batch_size: 256
      +
      • Update frequency of 5000 steps
      # @package _group_
      +env: "gym___Hopper-v4"
      +term_fn: "hopper"
      +
      +num_steps: 125000
      +epoch_length: 1000
      +num_elites: 5
      +patience: 5
      +model_lr: 0.001
      +model_wd: 0.00001
      +model_batch_size: 256
      +validation_ratio: 0.2
      +freq_train_model: 5000
      +effective_model_rollouts_per_step: 400
      +rollout_schedule: [20, 150, 1, 15]
      +num_sac_updates_per_step: 200000
      +sac_updates_every_steps: 5000
      +num_epochs_to_retain_sac_buffer: 1
      +
      +sac_gamma: 0.99
      +sac_tau: 0.005
      +sac_alpha: 0.2
      +sac_policy: "Gaussian"
      +sac_target_update_interval: 4
      +sac_automatic_entropy_tuning: false
      +sac_target_entropy: 1 # ignored, since entropy tuning is false
      +sac_hidden_size: 512
      +sac_lr: 0.0003
      +sac_batch_size: 256
      +

      Walker

      • Update frequency of 1000 steps
      # @package _group_
      +env: "gym___Walker2d-v4"
      +term_fn: "walker2d"
      +
      +num_steps: 300000
      +epoch_length: 1000
      +num_elites: 5
      +patience: 10
      +model_lr: 0.001
      +model_wd: 0.00001
      +model_batch_size: 256
      +validation_ratio: 0.2
      +freq_train_model: 1000
      +effective_model_rollouts_per_step: 400
      +rollout_schedule: [20, 150, 1, 1]
      +num_sac_updates_per_step: 20000
      +sac_updates_every_steps: 1000
      +num_epochs_to_retain_sac_buffer: 1
      +
      +sac_gamma: 0.99
      +sac_tau: 0.005
      +sac_alpha: 0.2
      +sac_policy: "Gaussian"
      +sac_target_update_interval: 4
      +sac_automatic_entropy_tuning: false
      +sac_target_entropy: -1 # ignored, since entropy tuning is false
      +sac_hidden_size: 1024
      +sac_lr: 0.0001
      +sac_batch_size: 256
      +
      • Update frequency of 5000 steps We only used a maximum buffer size of 1 million to limit the memory usage of this experiment.
      # @package _group_
      +env: "gym___Walker2d-v4"
      +term_fn: "walker2d"
      +
      +num_steps: 300000
      +epoch_length: 1000
      +num_elites: 5
      +patience: 10
      +model_lr: 0.001
      +model_wd: 0.00001
      +model_batch_size: 256
      +validation_ratio: 0.2
      +freq_train_model: 5000
      +effective_model_rollouts_per_step: 200
      +rollout_schedule: [20, 150, 1, 1]
      +num_sac_updates_per_step: 100000
      +sac_updates_every_steps: 5000
      +num_epochs_to_retain_sac_buffer: 1
      +
      +sac_gamma: 0.99
      +sac_tau: 0.005
      +sac_alpha: 0.2
      +sac_policy: "Gaussian"
      +sac_target_update_interval: 4
      +sac_automatic_entropy_tuning: false
      +sac_target_entropy: -1 # ignored, since entropy tuning is false
      +sac_hidden_size: 1024
      +sac_lr: 0.0001
      +sac_batch_size: 256
      +
      For attribution in academic contexts, please cite this work as
      +        PLACEHOLDER FOR ACADEMIC ATTRIBUTION
      +  
      BibTeX citation
      +        PLACEHOLDER FOR BIBTEX
      +  
      \ No newline at end of file diff --git a/blog/what-exactly-has-tabpfn-learned-to-do/index.html b/blog/what-exactly-has-tabpfn-learned-to-do/index.html new file mode 100644 index 00000000..626978bf --- /dev/null +++ b/blog/what-exactly-has-tabpfn-learned-to-do/index.html @@ -0,0 +1,36 @@ + What exactly has TabPFN learned to do? | ICLR Blogposts 2024

      What exactly has TabPFN learned to do?

      TabPFN [Hollmann et al., 2023], a Transformer model pretrained to perform in-context learning on fresh tabular classification problems, was presented at the last ICLR conference. To better understand its behavior, we treat it as a black-box function approximator generator and observe its generated function approximations on a varied selection of training datasets. Exploring its learned inductive biases in this manner, we observe behavior that is at turns either brilliant or baffling. We conclude this post with thoughts on how these results might inform the development, evaluation, and application of prior-data fitted networks (PFNs) in the future.

      Introduction

      TabPFN is a deep learning model pretrained to perform in-context learning for tabular classification. Since then, it has attracted attention both for its high predictive performance on small dataset benchmarks and for its unique meta-learning approach. This meta-learning approach, which builds upon earlier work on prior-data fitted networks (PFN) , requires only synthetically-generating data: structural causal models (SCMs) are randomly generated, then training datasets are sampled from each SCM. On fresh classification tasks, no training (i.e. weight updating) is needed; instead, training data is given as context to TabPFN, a Transformer model with self-attention among training samples and cross-attention from test samples to training samples. TabPFN can be optionally used with ensembling, wherein the forward pass is repeated with random permutations of features and class labels, and with power transformation applied to random subsets of features. Subsequent works have reproduced its classification performance on other tabular benchmarks , and analyzed its theoretical foundations .

      At the same time, TabPFN has received criticism from within the applied ML community, around concerns that its “one large neural network is all you need” approach is fundamentally flawed and that its performance on public benchmarks may be due to overfitting.

      In this article, we will attempt to demystify TabPFN’s behavior in order to move towards a resolution to these questions. With this goal, we will take a different tack to analyzing TabPFN than previous works: we will neither theoretically analyze its meta-learning pre-training approach, nor run it on yet another dataset-of-datasets, nor even mechanistically interpret the meaning of specific model weights or subnetworks.

      Instead, we will first explore its holistic behavior on two simple settings, in order to develop an intuition about TabPFN as a function approximation generator. This is motivated by the observation that TabPFN once fitted on fresh training data (even though “fitting” is merely storing the training data), is not mathematically different from any other fitted model: it is simply a function \(f_{\mathcal{D}, \theta}: x \rightarrow y\) from test input \(x\) to prediction \(y\), where \(\mathcal{D} = (X_{\textrm{train}}, y_{\textrm{train}})\) is the training data and \(\theta\) are the TabPFN model weights. By plotting \(f\) for various case studies of \((X_{\textrm{train}}, y_{\textrm{train}})\), we aim to better understand what statistical knowledge has been represented in model parameters \(\theta\).

      Next, we will evaluate TabPFN on two non-standard tabular ML classification tasks, comparing its performance with other methods. These atypical tasks can be thought of as out-of-distribution relative to the synthetic pretraining datasets upon which TabPFN was pretrained. This analysis will help indicate whether TabPFN was overfit to the statistical peculiarities of publicly-available small tabular datasets, or whether it has learned generalizable principles that lead to sensible behavior even in out-of-domain settings.

      1d binary classification

      We begin by examining the case of binary classification with 1d inputs. To better illustrate the inductive biases of the base TabPFN model, we do not use ensembling in this section unless otherwise indicated.

      Below, we show the predictions for two training samples located at +1 and -1, labeled green and red, respectively. We see that the probabilities are non-monotonic, as one would obtain from a sigmoid function; not only do we see that the model has higher uncertainty on the far sides of the training points, we see that between them there is a small wiggle. We also see that the decision boundary biased below 0.5; likely this is because TabPFN has learned that features are have right-skewed distributions.

      These wiggles and asymmetry more-or-less disappear once we incorporate ensembling, shown below. However, the general shape of the predicted probability function is similar regardless of the number of ensembles.

      TabPFN predicted probabilities for test data, in red and green, for varying number of ensembles. Also shown are the predicted probabilities from using inverse-square-root of Euclidean distance within softmax, in orange and lime-green.

      The above results raise the question of what parametric attention function might have been learned by TabPFN. No simple dot-product-based or Euclidean distance-based function (used within the softmax operation) exactly recapitulated the observed predicted probabilities. However, the general shape of inverse-square-root of Euclidean distance matched reasonably well, particularly between the two training points. Still, it appears that TabPFN has meta-learned an attention function that outperforms previously-known attention functions on small datasets.

      Next, we look at the effect of duplicating features. We tried repeating the +1 and -1 inputs for a total of 1, 4, 16, and 64 copies, as shown below. The effect is to push the predicted probabilities away from 0.5, although we observe diminishing marginal effects as the number of repeats increases.

      Meanwhile, there is no discernible effect from replicating samples, when both red and green samples are replicated. Below we show the predicted probabilities, when both red and green samples are each copied for a total of 1, 4, 16, and 64 times.

      In contrast, there is an impact to repeating only the red sample. Below is shown the effect of repeating only the red sample. While this unsurprisingly increases the probability of red for \(X < 0\), it bizarrely increases the probability of green for \(X > 0\). This is especially strange because repeating green samples in the previous setting did not have the same effect. This behavior of TabPFN seems suboptimal; it remains to be seen whether this behavior was optimal for its pretraining data, or whether this is some kind of artifact of TabPFN’s architecture or training.

      Finally, we were unable to find evidence that TabPFN is able to detect periodic patterns in the training data, as exemplified for three different training patterns shown below. This behavior of TabPFN suggests that it does not support either periodic interpolation or extrapolation. Furthermore, we observe that as the number of observed cycles in the data increases, the predicted probabilities trend toward 0.5, which also seems suboptimal. We also notice that there is marked left-right asymmetry in these settings.

      2d multiclass classification

      Here, we examine the behavior of TabPFN on 2d input data, on problems with as many samples as classes. Below we show results for both randomly-spaced and grid-spaced inputs, and for both ensembling and no-ensembling settings of TabPFN. In each plot, we show the training data, their corresponding Voronoi diagrams, and finally the model predictions for the test inputs. We see that, without ensembling, TabPFN performs quite poorly, partitioning the input space in a non-sensical manner. The results markedly improve when we use 32 ensembles. Particularly for the randomly-spaced training points, the model predictions clearly resemble the Voronoi diagram, suggesting that (ensembled) TabPFN has meta-learned to perform 1-nearest-neighbor classification in the setting where each class has a single training sample.

      On the other hand, that this behavior relies upon ensembling suggests that the base TabPFN model could be further improved. In the original paper, Hollmann et al. express the hope that a future better version of TabPFN would not need to rely upon ensembling for permutation invariance, by having internalized that behavior through better architecture and training. The aforementioned observed behavior suggests that ensembling improves performance not only by (approximately) enforcing permutation invariance, but also by producing lower variance estimators; if so, the base model could also be trained to do the latter directly.

      TabPFN predictions on randomly-spaced points (left) and grid-spaced points (right). The training points are depicted as $\times$s. The yellow lines depict the Voronoi diagram of the training points. The test points are colored by TabPFN's predictions, using the same color scheme as the training points. We see that, without ensembling, TabPFN's predicted classes do not form contiguous regions over the input space.

      Cancer status classification from high-dimensional gene expressions

      We now turn towards a comparison of TabPFN with logistic regression (LR), support vector classification (SVC), and XGBoost on the BladderBatch cancer status classification task. The bladderbatch dataset consists of 57 samples, 22,283 gene expression features, and 3 classes (“normal” vs “biopsy” vs “cancer”). This is an extremely high-dimensional problem compared to TabPFN’s intended use for \(d \le 100\); also, linear models tend to be sufficient for predicting cancer status given gene expressions. Thus, this setting is far outside the domain on which we would expect TabPFN to perform well, particularly if it had been overfit to small tabular datasets. Furthermore, the 57 samples come from 5 different batches of gene microarray measurements. This adds additional difficulty to the task, because there is confounded shift between the technical batch effect and the unequal proportions of cancer status in the different batches.

      For all methods, we do not perform hyperparameter search, in order to simulate the scenario where there are too few samples to perform cross-validation without the risk of overfitting. We use the scikit-learn implementations of LR and SVC with their default hyperparameters. For TabPFN, we use the default hyperparameter of 32 ensembles; we also enable feature subsampling as is required for \(d > 100\) problems.

      Results are shown below, aggregated over 10 random 75-25 train-test splits, and evaluated via both accuracy and macro-averaged F1-score. TabPFN has a surprisingly strong showing, handily beating SVC and XGBoost, while almost matching logistic regression. This pattern holds both when we use all features and also when we use only the first 1k features.

      We also evaluate the different methods on a more realistic setting, where we train on 4 out of 5 batches of data and evaluate on all samples from the remaining unseen batch. Results are shown below, with scatterplot labels used to indicate the identity of the test batch. While all methods perform worse in this setting, TabPFN still almost matches LR while beating the other baselines.

      We also verify that TabPFN is not simply memorizing the class imbalance in favor of cancer. We compute confusion matrices, shown below, for each train-test split. Even though cancer is the most common class in every training split, there does not appear to be any systematic bias across the splits in favor of predicting cancer.

      Computer vision as a tabular classification problem

      Finally, we compare TabPFN with other methods on two computer vision (CV) tasks. As in the previous section, we use the default hyperparameter settings for all methods. We treat MNIST and CIFAR-10 as tabular ML problems with \(28*28^2\) and \(3*32^2\) features, respectively. We aggregate over 10 train-test splits, where the test set is the full MNIST / CIFAR-10 test set, and the training set is a random subsample of size 30, 100, 300, and 1000. In this experiment, TabPFN was competitive for smaller training set sizes, but lagged as we trained on more samples. Interestingly, while for cancer classification SVC performed poorly, it performed well for large sample sizes on the CV tasks. Meanwhile, while logistic regression (LR) performed well on cancer classification, it struggled in the current setting. It remains to be seen whether the shared behavioral characteristics of TabPFN and LR in these tasks hold more generally. If so, this could motivate future work on meta-learning TabPFN to perform robust classification with a hinge-type loss.

      Test accuracy on MNIST (left) and CIFAR-10 (right).

      Closing thoughts

      Taken together, our preliminary results are suggestive of future developments in tabular PFNs. Currently, an applied ML practitioner will likely choose between training a model on their own small dataset and using the TabPFN “one size fits all” model. Our results suggest that TabPFN model will likely perform quite well, even outside its intended domain. However, it still came second-place to logistic regression on our cancer classification task and last or second-last on the CV classification problems. This suggests that the future will not look like a binary choice between training a non-PFN and selecting a single state-of-the-art tabular PFN. Rather, we suspect that there will exist PFNs for specific modalities of data (e.g. gene expression), or for specific settings (e.g. robust classification) that bridge the gap between the two extremes.

      In such a future, we believe our approach to evaluating TabPFN will become increasingly essential. In the burgeoning field of large language models (LLMs), evaluation on various public benchmarks is widely considered necessary but insufficient. LLM researchers and users will also evaluate a newly-announced model by trying their favorite personal examples on the new LLM. When the LLM fails on a prompt, one modifies the prompt slightly to see whether the LLM simply expected a different prompting style. When the LLM succeeds, one tries variants to see whether its satisfactory response was in fact brittle to the prompt. By interacting with an LLM, one gets a sense for its expected prompting style and the type of outputs it generates. In particular, providing out-of-distribution (adversarial) inputs (e.g. “poem poem poem”) to an LLM tells us something useful about how it will operate on future unanticipated out-of-distribution inputs.

      By analogy, we argue that, while open tabular benchmarks are valuable resources, these should not be fully determinative for researchers and users of tabular ML methods. Benchmarks do allow us to quickly discover which methods are Pareto-dominated and can therefore be safely ignored. However, as we move into a world with multiple available PFN options, with different sorts of inductive priors, it will become increasingly useful to interact with them on simple problems to gain an intuition for whether their priors match one’s own use-case. For our analysis on 1d inputs, it is important to notice that there is not necessarily one “right answer”. Thus, evaluations of tabular ML approaches will need to be more granular than to describe TabPFN as state of the art for all of tabular ML. Instead, evaluations should aim at identifying specific tabular PFN checkpoints, based on different inductive priors and synthetic datasets, as being best suited for specific classes of problem settings.

      Furthermore, our results illuminate a key practical difference between TabPFN, which relies on in-context learning, and other neural network models for tabular ML. Skepticism around neural networks for tabular ML has been justified by problems stemming from the non-convexity of neural network training. Note that the problem (in the small dataset context) with neural net training non-convexity is not fundamentally about the fact that one may have missed a global optimum with better performance. Rather, deep learning requires babysitting during training runs and optimization of training hyperparameters which are unrelated to one’s beliefs about the nature of one’s specific problem. Thus, a modified architecture, preprocessing method, or data selection approach might be better matched for a particular dataset, but in the end perform worse due to problematic training dynamics – which one might be unable to fix without risk of overfitting. In the small dataset regime, the maximum performance (over all training hyperparameter settings) matters less than the performance on the default hyperparameter settings.

      Because the overall approach of TabPFN obviates this problem with pure in-context learning, the fundamental weaknesses of other neural network approaches do not apply. For example, our 1d experiments would not have been straightforwardly possible if we had retrained a neural network on each reconfiguration of the training data. If we had done so while keeping the training hyperparameters fixed, it would not represent how people actually use such a neural network. On the other hand, if we had plotted results for carefully optimized hyperparameters, it is not clear whether the results would be illustrative of the general inductive biases of the neural network architecture, or merely of the behavior of an optimally-trained neural network. However, the flip side of this advantage of TabPFN is that our analysis applies not so much to TabPFN-the-method, as it does to prior_diff_real_checkpoint_n_0_epoch_42.cpkt-the-checkpoint.

      Finally, we believe our evaluation helps address some of the popular skepticism around TabPFN. While our results indicate that there remains substantial room for improvement, we found no evidence that would suggest that TabPFN’s results were solely the result of overfitting a large neural network to public benchmarks. Rather, our results suggest that TabPFN learns a simple “world model” of small-n statistical learning for tabular classification. This, in itself, makes TabPFN worthy of further careful empirical study.

      For attribution in academic contexts, please cite this work as
      +        PLACEHOLDER FOR ACADEMIC ATTRIBUTION
      +  
      BibTeX citation
      +        PLACEHOLDER FOR BIBTEX
      +  
      \ No newline at end of file diff --git a/call/index.html b/call/index.html new file mode 100644 index 00000000..b4a48bb7 --- /dev/null +++ b/call/index.html @@ -0,0 +1 @@ + call for blogposts | ICLR Blogposts 2024

      Submit your blogpost on Openreview

      Call for blog posts

      ​ We invite all researchers and practitioners to submit a blog post discussing work previously published at a top-tier venue to the ICLR 2024 blog post track. The format and process for this blog post track are described below. ​

      Content

      ​ Write a post on a subject that has been published at a top-tier venue (ICLR, ICML, NeurIPS, AAAI, UAI, CVPR, SIGGRAPH, ECCV, ICCV, etc.) relatively recently. Past blog posts can be accessed here. ​

      Conflict of interest

      ​ The authors of the blog posts will have to declare their conflicts of interest (positive or negative) with the paper (and their authors) they write about. Conflicts of interest include:

      • Recent collaborators (less than 3 years)
      • Current institution ​ Reviewers will be asked to judge if the submission is sufficiently critical and objective of the papers addressed in the blog post. Blog Posts must not be used to highlight or advertise past publications of the authors or of their lab.

      Publication

      Blog post

      ​ The posts will be created and published under a unified template; see the submission instructions and the sample post hosted on the blog of this website.

      Poster

      Additionally, accepted posts will have the option to present their work as a poster during the main poster session. For more information about the main poster session (time, poster format, etc.) please refer to the ICLR homepage.

      Review

      Blogs will be peer-reviewed (double-blind) for quality and novelty of the content: clarity and pedagogy of the exposition, new theoretical or practical insights, reproduction/extension of experiments, etc. The review is dual-anonymous assuming good faith from both submitters and reviewers (see the submission instructions for more details). ​

      Key Dates

      • Abstract deadline: December 11th 00:00GMT, 2023 (submit to OpenReview).  

      • Submission deadline: December 17th 00:00GMT, 2023 (any modifications to your blog post, via a pull request on github).  

      • Notification of acceptance: January 30th, 2024 UPDATED: February 15th, 2024  

      • Camera-ready merge: March 15th, 2024

      Contact

      For answers to many common questions please refer to the ICLR FAQ

      Should you have other inquiries, please don’t hesitate to reach out via email at: blog.track.chairs@gmail.com

      \ No newline at end of file diff --git a/feed.xml b/feed.xml new file mode 100644 index 00000000..f2fa7821 --- /dev/null +++ b/feed.xml @@ -0,0 +1,393 @@ +Jekyll2024-04-08T03:01:46+02:00https://iclr-blogposts.github.io/2024/feed.xmlICLR Blogposts 2024Home to the 2024 ICLR Blogposts track Masked Language Model with ALiBi and CLAP head2024-05-07T00:00:00+02:002024-05-07T00:00:00+02:00https://iclr-blogposts.github.io/2024/blog/alibi-mlmAdapted and expanded from EIFY/fairseq.

      Unmodified and unmasked, attention mechanism is permutation-invariant and positional encoding is therefore employed by transformer-based language models to break the symmetry and enable sequence modeling. In their ICLR 2022 paper, Press et al. introduced Attention with Linear Biases (ALiBi) as a new approach to positional encoding, where the positional info of the tokens are encoded by applying an attention weight bias proportional to the distance between tokens:

      where \(m\) is a head-specific slope chosen to follow geometric sequence \(\frac{1}{2^{0.5}}, \frac{1}{2^1}, \frac{1}{2^{1.5}}, \dots, \frac{1}{2^\frac{n}{2}}\) for a model with \(n\) attention heads. This approach is shown to enable input length extrapolation in the sense that perplexity of the model remains stable as the inference context length exceeds training context length. The paper, however, focuses on autoregressive decoder-only models and relies on model perplexity as the metric, therefore leaves the question open whether ALiBi is applicable to MLMs like BERT and RoBERTa . To help answer this question, we tested the two following changes to the RoBERTa baseline models, based on the first-party Fairseq toolkit :

      Attention with Linear Biases (ALiBi)

      Since MLMs are based on encoders that attend to tokens both before and after the given position, considerations must be made regarding how to distinguish them. Press himself suggested the 3 following options for encoder-attention ALiBi:

      1. Symmetric: Keep attention weight bias proportional to the distance between tokens and rely on the context to distinguish between tokens at +N and -N position.
      2. Nonsymmetric, one-sided: Make half of the heads only attend to the tokens before and half of the heads only attend to the tokens after. Weight bias is still proportional to the distance.
      3. Nonsymmetric with different slopes: Make the slopes \(m\) different forward and backward, with either learned or fixed values.

      With the observation that option 2 spends about half of the attention compute on no-op and option 3 can still result in bias value collision (e.g. \(m_{bwd} = 2 m_{fwd}\) and -1 vs. +2 positions), we implemented both option 1 and what we call “nonsymmetric with offset”: Shift the linear biases ahead by 0.5 * slope, i.e. the constant bias (right matrix of the figure above) becomes

       0 -.5 -1.5 -2.5 -3.5
      +-1   0  -.5 -1.5 -2.5
      +-2  -1    0  -.5 -1.5
      +-3  -2   -1    0  -.5
      +-4  -3   -2   -1    0
      +

      Unless otherwise noted, ALiBi for the following experiments means this nonsymmetric-with-offset encoder-attention ALiBi.

      Contrastive Language Pretraining (CLAP) Head

      The prediction head is one part of the LMs that has received less attention that happens to differ between the ALiBi autoregressive decoder-only models and RoBERTa. Based on the configs and training logs, the ALiBi models use the adaptive word embedding and softmax of Baevski & Auli with weight tying , whereas the RoBERTa prediction head has an additional fully-connected layer and nonlinearity on top of weight-tying. Inspired by CLIP , we decided to test what we called Contrastive Language Pretraining (CLAP) head below, as the simplest possible prediction head with weight tying for the masked tokens plus the thermodynamic beta (inverse temperature):

      class ClapHead(nn.Module):
      +    """Head for masked language modeling."""
      +
      +    def __init__(self, initial_beta, weight):
      +        super().__init__()
      +        self.beta = nn.Parameter(torch.tensor(initial_beta))
      +        self.weight = weight
      +
      +    def forward(self, features, masked_tokens=None, normalize=True):
      +        # Only project the masked tokens while training,
      +        # saves both memory and computation
      +        if masked_tokens is not None:
      +            features = features[masked_tokens, :]
      +        w = self.weight
      +        if normalize:
      +            w = F.normalize(w, dim=-1)
      +        return self.beta * F.linear(features, w)

      Compared to the baseline RoBERTa prediction head

      class RobertaLMHead(nn.Module):
      +    """Head for masked language modeling."""
      +
      +    def __init__(self, embed_dim, output_dim, activation_fn, weight=None):
      +        super().__init__()
      +        self.dense = nn.Linear(embed_dim, embed_dim)
      +        self.activation_fn = utils.get_activation_fn(activation_fn)
      +        self.layer_norm = LayerNorm(embed_dim)
      +
      +        if weight is None:
      +            weight = nn.Linear(embed_dim, output_dim, bias=False).weight
      +        self.weight = weight
      +        self.bias = nn.Parameter(torch.zeros(output_dim))
      +
      +    def forward(self, features, masked_tokens=None, **kwargs):
      +        # Only project the masked tokens while training,
      +        # saves both memory and computation
      +        if masked_tokens is not None:
      +            features = features[masked_tokens, :]
      +
      +        x = self.dense(features)
      +        x = self.activation_fn(x)
      +        x = self.layer_norm(x)
      +        # project back to size of vocabulary with bias
      +        x = F.linear(x, self.weight) + self.bias
      +        return x

      We removed the embed_dim x embed_dim fully-connected layer, activation function (GELU), layer norm, and the output_dim trainable bias. Just like CLIP, we added the trainable thermodynamic beta and L2-normalize the token embeddings before feeding them to the transformer and computing the inner products between them and the transformer output as the softmax logits, scaled by beta.

      Experiments

      WikiText-103

      At first we tested the changes with the WikiText-103 dataset with a GeForce RTX 3080 16 GB Laptop GPU, using the validation set MLM perplexity as the metric. We tested the baseline (learned positional encoding + RoBERTa prediction head), learned-clap (learned positional encoding + CLAP head), ALiBi (ALiBi + RoBERTa prediction head), and zero-clap (ALiBi + CLAP head), in addition to baseline but with sinusoidal positional encoding instead of learned positional encoding:

      where solid lines are what’s considered “canonical” setup and dotted lines are experiments with the following variations in setup. These variations turned out to be irrelevant:

      1. Whether we use attention dropout or not
      2. Whether we use symmetric ALiBi (option 1) or nonsymmetric-with-offset ALiBi above
      3. Whether we use zero vector or a separate learnable embedding for the mask embeddingThe intention was to test using zero vector instead of a separate learnable embedding for the mask embedding, which in combination with ALiBi results in no non-semantic information in the input embeddings. However, a bug prevented this variation from working correctly and the end effect was merely deleting the last two words (madeupword0001 and madeupword0002) from the dictionary instead, which we don't expect to be consequential.
      4. Whether we L2-normalize the embeddings for the CLAP head or not
      5. Whether we scale the L2-normalized embeddings by sqrt(embed_dim) (no_scale_embedding=False) or not

      As we can see, the dotted lines are almost on top of the solid lines. Notably, sinusoidal positional encoding underperforms significantly compared to learned positional encoding.

      The Pile

      As the next step, we scaled our experiments to train on the Pile for one epoch. About half of the examples in the Pile has sequence length > 1024, so we set sequence length to 2048. Even so, ~1/7 of the examples have sequence length > 2048 and had to be discarded. In the end, one epoch consists of 133082 updates and we employ cosine learning rate schedule while “overestimating” the number of training steps by 10%, as inspired by the Chinchilla paper . In addition to the validation MLM perplexity, we also fine-tuned the models on the GLUE benchmark . As in the original RoBERTa paper, we tested both the roberta.base with 125M parameters and roberta.large with 355M parameters. These experiments were performed on 8 x A100 40GB SXM4 GPUs, where the roberta.base experiments took ~3 days and roberta.large experiments took ~9 days. In the table below, PPL is the final validation MLM perplexity, STS-B is the best validation loss, and all the others are the best validation accuracies over 10 epochs of finetuning.

      roberta.base

                   PPL↓ CoLA MNLI MRPC QNLI QQP  RTE  SST-2 STS-B↓
      +baseline     2.94 83.6 84.2 90   91.6 91.3 73.6 92.1  0.028
      +learned-clap 2.86 81.7 84.4 86.3 90.9 91.2 72.6 92.5  0.027
      +alibi        2.93 69.2 85.1 80.9 92   91.5 63.9 93.1  0.033
      +zero-clap    2.83 70.5 84.9 75.5 90.6 91.1 54.9 89.7  0.041
      +

      *Baseline but with sinusoidal positional encoding instead of learned positional encoding failed to converge.

      roberta.large

                   PPL↓ CoLA MNLI MRPC QNLI QQP  RTE  SST-2 STS-B↓
      +baseline*    2.55 83.7 86.8 84.3 92.5 91.8 79.8 93.3  0.027
      +learned-clap 2.5  84.1 86.3 89.7 92.8 91.7 79.8 93.7  0.023
      +alibi        2.65 69.1 86.5 68.4 92.4 91.7 52.7 93.6  0.123
      +zero-clap    2.54 69.1 86.7 81.9 92.2 91.6 52.7 93.1  0.031
      +

      *Loss spiked somewhere between 24000-24500 updates and the model failed to recover. Loosely following the practice of 5.1 Training Instability in the PaLM paper , we solved the issue by restarting the training from the 20000 updates checkpoint with the PyTorch random seed changed from 1 to 2.

      We found that ALiBi no longer helps lowering the validation MLM perplexity. Furthermore, ALiBi turned out to be harmful for several specific GLUE tasks (CoLA, MRPC, and RTE). CLAP head on its own, however, seems to be competitive and in fact outperforms the baseline with roberta.large.

      Conclusions

      This seems to be another case where models with lower perplexity do not necessarily yield higher accuracies for downstream tasks and architectural changes beneficial for models at smaller scales do not imply the same for models at larger scales . CLAP head, however, is simpler than the standard prediction head for MLMs, requires minimal changes, and may be worth trying especially at larger scales.

      In the broader context, MosaicBERT and LittleBird are most similar to our experiments. In the MosaicBERT paper, Portes et al. also evaluate BERT-style MLMs with symmetric (option 1) encoder-attention ALiBi on the GLUE benchmark and find performance exceeding the BERT baseline within limited training budget. However, these MosaicBERT models were trained with much shorter (128) sequence length and so may have avoided the sequence length regime in which perplexity and performance of certain downstream tasks start to deteriorate The same can be said about , which also reports in Table 4 the MLM perplexity of RoBERTa large models trained on an excerpt of the Pile with various positional encodings including symmetric (option 1) encoder-attention ALiBi with 128 sequence length.. The LittleBird architecture is designed for question answering and built with BiALiBi (Bidirectional ALiBi), a variation of option 3 (nonsymmetric with different slopes) where the model not only learned the forward and backward slopes \(m_{fwd}\) and \(m_{bwd}\), but also a special bias value for the attention weight of the global [CLS] token. Lee et al. evaluate LittleBird models on a collection of QA Benchmarks for both English and Korean and report favorable performance, but leave the question open whether they work well for other NLP tasks. Notably, we also found our ALiBi models capable of matching the baseline performance of the question answering task QNLI, so the reported performance is compatible with our experiments even without attributing to the other differences in architecture or pretraining task.

      Finally, what can we say about the original decoder-attention ALiBi and positional encodings in general? The original decoder-attention ALiBi has been shown to help not only perplexity, but also performance on evaluation suites consist of a diverse set of tasks like the EleutherAI Language Model Evaluation Harness . This discrepancy may be explained by the causal mask, which has been proven to be sufficient for encoding positional information in theory One caveat is that Proof C.1 of for absolute positional encoding depends on distinguishing values of unit fractions 1/t, which eventually fails due to precision limit. For example, 1/1464 can't be distinguished from 1/1465 in float16, well within the context length of interest., if not quite matching the performance of models with additional positional encodings in practice . Perhaps we can conclude that

      1. Decoder-attention positional encodings really should be considered causal mask + additional encodings and how they complement each other should be taken into account.
      2. Longer context length and certain downstream tasks are more challenging for positional encodings. One worthwhile direction may be to rank their difficulties systematically and iterate on the more challenging circumstances first for future positional encoding designs.

      Model checkpoints

      Final checkpoints for models trained on the Pile:

      roberta.base

      baseline learned-clap alibi zero-clap

      roberta.large

      baseline learned-clap alibi zero-clap

      To load them, install EIFY/fairseq following the original instructions and download the GPT-2 fairseq dictionary:

      wget -O gpt2_bpe/dict.txt https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt
      +

      Then all of the checkpoints above except the zero-clap ones can load as follows:

      $ python
      +Python 3.8.10 (default, Jun 22 2022, 20:18:18)
      +[GCC 9.4.0] on linux
      +Type "help", "copyright", "credits" or "license" for more information.
      +>>> from fairseq.models.roberta import RobertaModel
      +>>> roberta = RobertaModel.from_pretrained('/checkpoint-dir', 'learned-clap-large.pt', '/dict-dir')
      +(...)
      +>>> roberta.fill_mask('The capital of China is <mask>.', topk=3)
      +[('The capital of China is Beijing.', 0.7009016871452332, ' Beijing'), ('The capital of China is Shanghai.', 0.23566904664039612, ' Shanghai'), ('The capital of China is Moscow.', 0.010170688852667809, ' Moscow')]
      +>>>
      +

      The zero-clap ones were trained without the last two madeupword’sThis is due to the same bug that affected the WikiText-103 variation above and its only visible effect., so you need to delete them from dict.txt before loading, i.e.:

      +(...)
      +50009 0
      +50256 0
      +madeupword0000 0
      +madeupword0001 0
      +madeupword0002 0
      +
      $ python
      +Python 3.8.10 (default, Jun 22 2022, 20:18:18)
      +[GCC 9.4.0] on linux
      +Type "help", "copyright", "credits" or "license" for more information.
      +>>> from fairseq.models.roberta import RobertaModel
      +>>> roberta = RobertaModel.from_pretrained('/checkpoint-dir', 'zero-clap-large.pt', '/dict-dir')
      +(...)
      +>>> roberta.fill_mask('The capital of China is <mask>.', topk=3)
      +[('The capital of China is Beijing.', 0.7051425576210022, ' Beijing'), ('The capital of China is Shanghai.', 0.21408841013908386, ' Shanghai'), ('The capital of China is Taiwan.', 0.007823833264410496, ' Taiwan')]
      +>>>
      +

      The rest of the original example usage should also just work. While these checkpoints have only been tested with this fork, the baseline ones should also work with the original fairseq repo with minimum changes to the state dict:

      >>> path = '/checkpoint-dir/baseline-large.pt'
      +>>> with open(path, 'rb') as f:
      +...   state = torch.load(f, map_location=torch.device("cpu"))
      +...
      +>>>
      +>>> del state['cfg']['task']['omit_mask']
      +(...)
      +>>> torch.save(state, '/checkpoint-dir/compatible.pt')
      +
      ]]>
      Jason Chuan-Chih Chou
      How to compute Hessian-vector products?2024-05-07T00:00:00+02:002024-05-07T00:00:00+02:00https://iclr-blogposts.github.io/2024/blog/bench-hvpHessian-vector products (HVPs) play a central role in the study and the use of the geometric property of the loss function of deep neural networks, as well as in many recent bilevel optimizers. However, computing such quantity is often considered prohibitive by practitioners, discouraging them from using algorithms that rely on HVPs.

      With this blog post, we aim to convince the practitioners that with modern automatic differentiation (AD) frameworks such as JAX or PyTorch, HVPs can be efficiently evaluated. Indeed, standard AD theory predicts that the computational cost of an HVP is of the same order as the cost of computing a gradient. After a brief introduction on why HVPs are useful for optimization and ML applications and on the basis of AD, we explain in detail the AD-based methods to compute an HVP and the reason for their efficiency. In particular, we show that one can compute HVPs without explicit Hessian computation. We then compare the different methods to compute HVPs for several deep neural network architectures in terms of time and memory for both JAX and PyTorch. Our results illustrate the complexity predicted by the theory, showing that computing an HVP is not much more expensive than computing a gradient. This opens an avenue to develop efficient second-order informed methods for neural networks.

      What are HVPs and where are they useful?

      Let us first introduce the notion of Hessian and HVP. We will consider in this post a twice differentiable function \(f:\mathbb{R}^d\to\mathbb{R}\) that goes from a vector \(x\) in space \(\mathbb{R}^d\) to a real number in \(\mathbb{R}\). This typically corresponds to a function that maps the value of the parameters \(\theta\) of a neural network to the loss \(f(\theta)\). For such a function, standard AD can be used to efficiently compute the gradient of the loss \(\nabla f(\theta) = \left[ \frac{\partial f}{\partial \theta_i}(\theta)\right]_{1\le i \le d} \in \mathbb{R}^d\), using the backpropagation. The Hessian matrix of \(f\) at \(\theta\) is the matrix of its second-order partial derivatives

      \[\nabla^2 f(\theta) = \left[\frac{\partial^2f}{\partial \theta_i\partial \theta_j}(\theta)\right]_{1\leq i,j\leq d}\in\mathbb{R}^{d\times d}\enspace.\]

      This matrix corresponds to the derivative of the gradient and captures how the gradient will change when moving \(x\). To evaluate the variation of the gradient when moving \(\theta\) in the direction \(v\in\mathbb{R}^d\), one can compute the quantity \(\nabla^2 f(\theta) v\in\mathbb{R}^d\). This is the Hessian-vector product (HVP).

      Let us review some use cases of HVPs in optimization and machine learning.

      Inverse Hessian-vector products (iHVPs) in optimization

      When trying to find the minimum of the function \(f\), methods that account for the second-order information often rely on the product between the inverse Hessian and a vector to find a good update direction. For instance, Newton’s method relies on update rules of the form

      \[\theta_{k+1} = \theta_k - \eta_k[\nabla^2f(\theta_k)]^{-1}\nabla f(\theta_k)\]

      for some step-size \(\eta_k>0\).

      When evaluating the term \([\nabla^2f(\theta_k)]^{-1}\nabla f(\theta_k)\), it would be very inefficient to first compute the full Hessian matrix \(\nabla^2f(\theta_k)\), then invert it and finally multiply this with the gradient \(\nabla f(\theta_k)\). Instead, one computes the inverse Hessian-Vector Product (iHPV) by solving the following linear system

      \begin{equation}\label{eq:linear_system} \nabla^2f(\theta)v = b\enspace. \end{equation}

      with \(b = \nabla f(\theta_k)\). This approach is much more efficient as it avoids computing and storing the full Hessian matrix, and only computes the inverse of the matrix in the direction \(v\).

      A second use case for the iHVP in optimization is with bilevel optimization. In bilevel optimization, one wants to solve the following problem

      \begin{equation}\label{eq:bilevel_pb} \min_{x\in\mathbb{R}^d} h(x) = F(x, y^* (x))\quad\text{with}\quad y^*(x) = \arg\min_{y\in\mathbb{R}^p} G(x, y)\enspace. \end{equation}

      The gradient of the function \(h\) can be computed using the implicit function theorem, giving the following expression

      \[\nabla h(x) = \nabla_x F(x, y^* (x)) - \nabla_{xy}G(x, y^*(x))[\nabla_{yy}G(x, y^*(x))]^{-1}\nabla_y G(x, y^*(x))\enspace.\]

      Here, the term \(\nabla^2_{yy} G(x, y)\) is the Hessian of the function \(G\) relatively to \(y\). Thus, this quantity also requires computing an iHVP.

      To compute the iHVP, there are many methods in the literature to solve \eqref{eq:linear_system}, like Neumann iterates, the Conjugate Gradient method or gradient descent steps in the quadratic form \(v\mapsto \frac12\langle\nabla^2f(\theta)v, v\rangle - \langle b, v\rangle\). These methods rely on HVPs, as illustrated by the highlighted terms in the Conjugate Gradient method. Thus, an efficient implementation of HVPs is crucial for the overall algorithm performance.

      Conjugate gradient to solve \eqref{eq:linear_system}
      Input Initialization \(v_0\)
      Initialization $$ r_0 = \textcolor{orange}{\nabla^2f(\theta) v_0} - b,\quad p_0 = -r_0,\quad t = 0 $$ While \(r_t \neq 0\) \begin{align*} \alpha_t &=\frac{r_t^\top r_t}{p_t^\top \textcolor{orange}{\nabla^2f(\theta) p_t}} \\ v_{t+1} &=v_t + \alpha_t p_t \\ r_{t+1} &=r_t + \alpha_t\textcolor{orange}{\nabla^2f(\theta) p_t} \\ \beta_{t+1} &=\frac{r_{t+1}^\top r_{t+1}}{r_t^\top r_t} \\ p_{t+1} &=-r_{t+1} + \beta_{t+1} p_t\\ t &=t + 1 \end{align*}

      HVPs for the study of the loss landscape

      The study of the geometry of neural networks is an active field that aims at understanding the links between training dynamics, local geometry of the training loss and generalization. One way to study the local geometry of a neural network is to find the distribution of the eigenvalues of its Hessian matrix. Indeed, depending on the sign of the eigenvalues of the Hessian, one can for instance distinguish local minima, local maxima and saddle points. As an illustration, the following figure shows how the sign of the eigenvalues of the Hessian matrix of a function affects the shape of the function’s landscape around a stationary point.

      In several papers, an approximation of the Hessian spectrum is computed thanks to the Lanczos algorithm. This algorithm is a modification of the power method where each new iterate is taken in the orthogonal complement of the previous iterates. It outputs a factorization of the Hessian of the form $\nabla^2 f(\theta) = VTV^\top$ where \(V=(v_0,...,v_{k-1})\) is orthogonal and

      \[T = \begin{pmatrix} \alpha_0& \beta_1 & 0 & \cdots & 0\\ \beta_1 & \alpha_1 & \beta_2 & \ddots & \vdots\\ 0 & \beta_2 & \alpha_2 & \ddots & 0\\ \vdots & \ddots & \ddots & \ddots & \beta_{k-1}\\ 0 & \cdots & 0 & \beta_{k-1} & \alpha_{k-1} \end{pmatrix}\enspace.\]

      Lanczos' algorithm
      Input Initial vector \(v_0\).
      Initialization $$ w'_0 = \textcolor{orange}{\nabla^2f(\theta)v_0},\quad \alpha_0 = w_0'^\top v_0,\quad w_0 = w_0' - \alpha_0 v_0 $$ For \(i = 1,\dots, k-1\):
      \begin{align*} \beta_i &= \|w_{i-1}\|\\ v_{i} &= \frac{w_{i-1}}{\beta_{i}}\\ w_i' &= \textcolor{orange}{\nabla^2f(\theta)v_i}\\ \alpha_i &= w_i'^\top v_i\\ w_i &= w_i' - \alpha_i v_i - \beta_iv_{i-1} \end{align*}

      We observe once again that the Hessian information is accessed through HVPs rather than the full Hessian matrix itself.

      A quick detour by automatic differentiation

      Automatic differentiation (AD) is an important tool to compute exactly the derivatives of differentiable functions obtained as the composition of simple operations. There are two modes in AD; the forward mode that computes Jacobian-vector products (JVPs) and the reverse mode that computes vector-Jacobian products (VJPs). Since the gradient of a scalar function is a special case of the VJP, the reverse mode is the most frequently used in machine learning. It is typically used to compute the gradients of deep learning cost functions, where it is called backpropagation.

      In what follows, we briefly present the notion of computational graph and the two AD modes. For a more detailed explanation, we refer the reader to the excellent survey by Baydin et al..

      Computational graph

      A key ingredient of AD is a computational graph associated with the code that evaluates a function. It is a directed acyclic graph that represents the succession of elementary operations required the evaluate a function.
      Simple computational graph of a function \(f:\mathbb{R}^d\to\mathbb{R}^p\) are typically

      In this graph, the vertices \(z_i\in\mathbb{R}^{m_i}\) represent the intermediate states of the evaluation of \(f\). To get the vertex \(z_i\), we use the values of its parents in the graph \(z_{i-1}\), with simple transfer functions \(z_i(z_{i-1})\). The computational complexity of the function evaluation depends on the complexity of the considered graph, as one node might have more than one parent. The memory footprint of the evaluation of the function is also linked to the maximum number of parents that can have a vertex in the computational graph, as their value needs to be stored until all children nodes have been computed.

      Let us take an example with a multilayer linear perceptron (MLP) with 2 layers. The function \(f_x:\mathbb{R}^h\times \mathbb{R}^{h\times p}\to \mathbb{R}\) is defined for an input \(x\in\mathbb{R}^p\) by

      \begin{equation}\label{eq:mlp} f_x(U, W) = \frac12(UWx)^2\enspace. \end{equation}

      Here, the input \(\theta\) corresponds to the parameters of the network \((U, V)\) and the intermediate steps are \(z_1 = Wx\), \(z_2 = Uz_1\) and \(z_3 = \frac12 z_2^2\). A possible computational graph to get \(f_x(U, W)\) is the following

      and the associated Python code to compute \(f_x\) is

      def f(U, W):
      +    z1 = W @ x
      +    z2 = U @ z1
      +    z3 = 0.5 * z2**2
      +    return z3
      +

      Here, the feed-forward structure of the function makes the computational graph very simple, as each node has a single intermediate result parent.

      AD uses this computational graph to compute the function’s derivatives. Using the chain rule, the Jacobian \(\frac{\partial f}{\partial \theta}(\theta)\) of \(f\) is obtained as a product of the Jacobian of the intermediate states \(z_1, \dots, z_n\). \begin{equation}\label{eq:chain_rule} \underbrace{\frac{\partial f}{\partial \theta}(\theta)}_{p\times d} = \frac{\partial z_n}{\partial \theta} =\frac{\partial z_n}{\partial z_1}\frac{\partial z_1}{\partial \theta}=\cdots = \underbrace{\frac{\partial z_n}{\partial z_{n-1}}}_{p\times m_{n-1}}\underbrace{\frac{\partial z_{n-1}}{\partial z_{n-2}}}_{m_{n-1}\times m_{n-2}}\cdots\underbrace{\frac{\partial z_1}{\partial \theta}}_{m_1\times d}\enspace. \end{equation} Depending on the order of the multiplication, one can compute the derivative of \(f\) with respect to \(\theta\) in two ways: the forward mode and the reverse mode.

      Forward mode

      For a vector $v\in\mathbb{R}^d$, the Jacobian-vector product (JVP) corresponds to the directional derative of $f$ in the direction $v$. It can be computed by the forward mode AD

      \begin{equation}\label{eq:chain_rule_jvp} \frac{\partial f}{\partial \theta}(\theta)\times v = \frac{\partial z_n}{\partial z_{n-1}}\frac{\partial z_{n-1}}{\partial z_{n-2}}\cdots\frac{\partial z_1}{\partial \theta}v\enspace. \end{equation}

      It consists in doing the multiplications in \eqref{eq:chain_rule_jvp} from the right to the left. It is a forward pass in the computational graph where we propagate at the same time the states \(z_i\) and the partial derivatives \(\frac{\partial z_{i+1}}{\partial z_i}\). If \(f\) is real-valued, the \(i\)th coordinate of its gradient is exactly given by product of the Jacobian of \(f\) and the \(i\)th canonical basis vector \(e_i\) since \begin{equation} \frac{\partial f}{\partial \theta_i}(\theta) = \lim_{t\to 0}\frac{f(\theta+te_i)-f(\theta)}{t}\enspace. \end{equation} Thus, we can get its gradient by computing each of the \(d\) JVPs \(\left(\frac{\partial f}{\partial \theta_i}(\theta)\times e_i\right)_{1\leq i \leq d}\) with forward AD.

      To understand properly what is happening when using forward differentiation, let us go back to the linear MLP defined in \eqref{eq:mlp}. If we implement ourselves the forward differentiation to get the JVP, we obtain the following code

      def jvp(U, W, v_u, v_w):
      +    # Forward diff of f
      +    z1 = W @ x
      +    v_z1 = v_w @ x  # Directional derivative of W -> W @ x in the direction v_w
      +  
      +    z2 = U @ z1
      +    v_z2 = U @ v_z1 + v_u @ z1  #  Directional derivative of (U, z_1) -> z2 in the direction (v_u, v_z1)
      +  
      +    v_z3 = v_z2 @ z2  # Directional derivative of z2 -> .5*z2**2 in the direction v_z2 
      +    return v_z3
      +

      In comparison with the code of the evaluation of \(f_x\), there are two more operations corresponding to the computation of the dual variables v_z1 and v_z2. In terms of memory, if we consider the computation of the JVP as coded in the previous snippet, the maximum number of parents of a vertex is four. This maximum is achieved by the vertex v_z2 which has the vertices U, v_z1, v_u and z1 as parents.

      In JAX, we get the JVP of a function \(f\) in the direction \(v\) with jax.jvp(f, (params, ), (v, ))[1].

      Reverse mode

      The reverse mode is also known as backpropagation in the context of deep learing. For $u\in\mathbb{R}^p$, it aims at computing VJPs

      \begin{equation}\label{eq:chain_rule_vjp} u^\top\frac{\partial f}{\partial \theta}(\theta) = u^\top\frac{\partial z_n}{\partial z_{n-1}}\frac{\partial z_{n-1}}{\partial z_{n-2}}\cdots\frac{\partial z_1}{\partial \theta}\enspace. \end{equation}

      In the reverse AD, the multiplications of \eqref{eq:chain_rule_jvp} are done from the left to the right. It requires doing one forward pass in the computational graph to compute the intermediate states \(z_i\) and then a backward pass to propagate the successive partial derivatives from the left to the right. Contrary to the forward mode, it has a more important memory footprint. Indeed, it requires storing the values of all the states. For instance, to compute the last term \(\frac{\partial z_3}{\partial z_2}\), one needs the value of \(z_2\) which was the first computed during the forward pass. If \(f\) is real-valued, \(u\) is a scalar and the VJP is the multiplication of the gradient of \(f\) by \(u\). Thus, one can get the gradient on \(f\) by using \(u=1\) and performing only one reverse differentiation. This makes this mode more efficient in computing gradients.

      Let us observe what happens if we code manually the backpropagation to get the gradient of the previous function \(f_x\) defined by \(f_x(U, W) = \frac12(UW x)^2\).

      def gradient(U, W):
      +    # Forward pass
      +    z1 = W @ x
      +    z2 = U @ z1
      +    z3 = 0.5 * z2**2
      +
      +    # Reverse pass
      +    ## Transfer function: z3 = 0.5 * z2**2
      +    dz2 = z2  # derivative of z3 wrt z2
      +  
      +    ## Transfer function: z2 = U @ z1
      +    dU = jnp.outer(dz2, z1)  # derivative of z3 wrt U
      +    dz1 = U.T @ dz2  # derivative of z3 wrt z1
      +  
      +    ## Transfer function: z1 = W @ x
      +    dW = jnp.outer(dz1, x)   # derivative of z3 wrt W
      +    
      +    return dU, dW
      +

      This function returns the gradient of \(f_x\). At reading this code, we understand one needs to store all the intermediate values of the forward pass in the graph. Indeed, if we look at the case of z1 which is the first node computed, it is used four steps later for the computation of dU.

      To get the gradient in JAX, one can use jax.grad(f)(params).

      Naive computation of HVPs

      Since we are interested in computing \(\nabla^2 f(\theta)v\), the simplest way to do it is to compute the Hessian matrix and then multiply it by the vector \(v\). This can be achieved in JAX by calling jax.hessian(f)(params) @ v.

      This method is quite cumbersome making it impossible to use for deep neural networks. Indeed, the storage of the full Hessian matrix has \(\mathcal{O}(d^2)\) complexity where \(d\) is the dimension of the model’s parameters set.

      The good news is that we can compute HVP without computing the Hessian thanks to clever use of AD.

      HVPs without explicit Hessian computation

      In 1994, Pearlmutter proposed to leverage the following observation to compute HVP efficiently: the HVP is also the directional derivative of the gradient in the direction \(v\):

      \[\nabla^2f(\theta) v = \lim_{\epsilon\to 0} \frac1\epsilon[\nabla f(\theta+\epsilon v)-\nabla f(\theta)] = \nabla [\langle \nabla f(.), v\rangle](\theta)\enspace.\]

      Based on this identity, AD enables to compute HVPs in three ways, as described in the JAX documentation.

      Forward-over-reverse

      The forward-over-reverse mode consists in doing forward differentiation in a computational graph of the gradient of \(f\).

      Its implementation in JAX is only two lines of code.

      def hvp_forward_over_reverse(f, params, v):
      +  return jax.jvp(jax.grad(f), (params, ), (v, ))[1]
      +

      In this case, jax.grad(f)(params) is computed by backward AD, whose complexity is two times the complexity of evaluating \(f\). Thus, the temporal complexity of hvp_forward_over_reverse is roughly four times the complexity of the evaluation of \(f\).

      To better see what happens, let us consider again our function \(f_x\) defined by \eqref{eq:mlp}. The Python code of the forward-over-reverse HVP is the following.

      def forward_over_reverse(U, W, v_U, v_W):
      +    # Forward through the forward pass through f
      +    z1 = W @ x
      +    v_z1 = v_W @ x
      +  
      +    z2 = U @ z1
      +    v_z2 = U @ v_z1 + v_U @ z1
      +    
      +    # z3 = 0.5 * z2**2
      +    # Forward through the backward pass through f
      +    z4 = z2  # dz2
      +    v_z4 = v_z2  # v_dz2
      +  
      +    z5 = jnp.outer(z4, z1)  # dU
      +    v_z5 = jnp.outer(v_z4, z1) + jnp.outer(z4, v_z1)  # v_dU
      +  
      +    z6 = U.T @ z4  # dz1
      +    v_z6 = U.T @ v_z4 + v_U.T @ z4  # v_dz1
      +  
      +    z7 = jnp.outer(z6, x)  # dW
      +    v_z7 = jnp.outer(v_z6, x)  # v_dW
      +  
      +    return v_z5, v_z7  # v_dU, v_dW
      +

      The take-home message of this part is that, after computing the gradient of \(f_x\), one can consider a computational graph of this gradient and perform forward differentiation through this new computational graph. Here, the variables z1,…, z7 are the vertices of a computational graph of the gradient of \(f_x\). The nice thing is that this mode enables getting at the same time the gradient and the HVP. Indeed, in the previous snippet, z5 and z7 are the components of the gradient of \(f_x\) which could be also returned if needed. This feature can be useful in bilevel optimization for instance.

      Reverse-over-reverse

      Instead of doing forward differentiation of the gradient, one can multiply the gradient by \(v\) and thus get a scalar. We can then backpropagate into this scalar product. This is the reverse-over-reverse mode.

      It can be implemented by these lines of code.

      def hvp_reverse_over_reverse(f, params, v):
      +  return jax.grad(lambda y: jnp.vdot(jax.grad(f)(y), v))(params)
      +

      Since the gradients are computed by backpropagation, the complexity of hvp_reverse_over_reverse is twice the complexity of jax.grad(f), which is roughly four times the complexity of the evaluation of \(f\).

      Writting down the code of the reverse-over-reverse HVP for our function \(f_x\) defined by \eqref{eq:mlp} makes us understand the differences between this mode and the forward-over-reverse mode. Particularly, one can notice that there are more elementary operations in the reverse-over-reverse mode than in the forward-over-reverse mode. Moreover, in terms of memory footprint, the reverse-over-reverse requires storing the values of the vertices of the computational graph of the gradient of \(f_x\), while the forward-over-reverse only needs to store the values of the vertices of the computational graph of \(f_x\). Thus, the former is less efficient than the latter.

      def reverse_over_reverse(U, W, v_u, v_w):
      +    # Forward through <grad(f), v>
      +    ## Forward through f
      +    z1 = W @ x
      +    z2 = U @ z1
      +    z3 = 0.5 * jnp.linalg.norm(z2)**2
      +  
      +    ## Reverse through f
      +    z4 = z2  # dz2
      +    z4 = jnp.outer(z3, z1) # dU
      +    z5 = U.T @ z3 # dz1
      +    z6 = jnp.outer(z5, x) # dW
      +  
      +    # Output: dot product <grad(f), v>
      +    z7 = jnp.sum(z4 * v_u) + jnp.sum(z6 * v_w)
      +  
      +    # Backward through z7 = <grad(f),v>
      +    ## z7 = jnp.sum(z4 * v_u) + jnp.sum(z6 * v_w)
      +    dz6 = v_w
      +    dz4 = v_u
      +  
      +    ## z6 = jnp.outer(z5, x)
      +    dz5 = dz6 @ x
      +  
      +    ## z5 = U.T @ z3
      +    dz3 = U @ dz5
      +    ddU = jnp.outer(z3, dz5)  # Derivative of z7 wrt U
      +  
      +    ## z4 = jnp.outer(z3, z1)
      +    dz3 += dz4 @ z1
      +    dz1 = dz4.T @ z3
      +  
      +    ## z3 = z2
      +    dz2 = dz3
      +  
      +    ## z2 = U @ z1
      +    dz1 += dz2 * U
      +    # As U appears multiple times in the graph, we sum its contributions
      +    ddU += jnp.outer(dz2, z1) 
      +  
      +    ## z1 = W @ x
      +    ddW = jnp.outer(dz1, x)  # Derivative of z7 wrt W
      +  
      +    return ddU, ddW
      +

      Reverse-over-forward

      What about doing forward differentiation of \(f\) rather than reverse propagation? This is what is done in the reverse-over-forward mode. It consists in backpropagating in the computational graph of the JVP of \(f\) and \(v\).

      def hvp_reverse_over_forward(f, params, v):
      +  jvp_fun = lambda params: jax.jvp(f, (params, ), (v, ))[1]
      +  return jax.grad(jvp_fun)(params)
      +

      This method is more efficient than the previous one. Indeed, since we backpropagate only once, the memory burden is lower than for the reverse_over_reverse fashion. In comparison with forward-over-reverse, the complexity is the same. However, one can notice that the forward-over-reverse enables computing at the same time the gradient of \(f\) and the HVP, which is not the case for the reverse-over-forward mode.

      The code of the reverse-over-forward HVP for the MLP \(f_x\) defined by \eqref{eq:mlp} is the following.

      def reverse_over_forward(U, W, v_U, v_W):
      +    # Forward diff of f to  <grad(f), v>
      +    z1 = W @ x
      +    z6 = v_W @ x  # v_z1
      +  
      +    z2 = U @ z1
      +    z5 = U @ z6 + v_U @ z1  # v_z2
      +  
      +    # output <grad(f), v>
      +    z4 = z5 @ z2  # v_z3
      +  
      +    # Backward pass through <grad(f), v>
      +    ## z4 = z5 @ z2
      +    dz2 = z5
      +    dz5 = z2  # dv_z2
      +  
      +    ## z5 = U @ z6 + v_U @ z1
      +    dz1 = v_U.T @ dz5
      +    dz6 = U.T @ dz5  # dv_z1
      +    ddU = jnp.outer(dz5, z6)  # derivative of z4 wrt U
      +  
      +    ## z2 = U @ z1
      +    # As U and dz1 appear multiple times, we sum their contributions
      +    dz1 += U.T @ dz2
      +    ddU += jnp.outer(dz2, z1)
      +    
      +    ## z1 = W @ x
      +    ddW = jnp.outer(dz1, x)
      +    return ddU, ddW
      +

      Benchmark with deep learning architectures

      While these three methods compute the same outputs, the different ways of traversing the computational graph change their overall time and memory complexities. We now compare the computation of HVPs with these three methods for various deep-learning architectures. To cover a broad range of use cases, we consider a residual network (ResNet34) and a transformer-based architecture (ViT-base) for image classification as well as a transformer for natural language processing (Bert-base.). We use the Flax and PyTorch implementations of these architectures available in the transformers package provided by Hugging Face 🤗.

      All computations were run on an Nvidia A100 GPU with 40 GB of memory. We used the version 0.4.21. of Jax and the version 2.1.1. of torch.

      The code of the benchmark is available on this repo.

      Time complexity

      The first comparison we make is a comparison in terms of wall-clock time between the different ways to compute HVPs and also the computation of a gradient by backpropagation. For each architecture, we compute the gradient of the model with respect to the parameters by backpropagation. We also compute the HVPs in forward-over-reverse, reverse-over-forward and reverse-over-reverse modes. For each computation, we measure the time taken. Specifically for the HVPs, we subtract the time taken by a gradient computation, to get only the time of the overhead required by the HVP computation. The inputs for each architecture are generated randomly. For the ResNet34 architecture, we generated a batch of images of size 224x224x3. To limit out-of-memory issues in the experiments, we generated for the ViT architecture images of size 96x96x3. For the BERT architecture, we generated a batch of sequences of length 32.

      We first use JAX with just-in-time compilation. Each computation is run 90 times. We plot on the left of the figure, the median computation time and also the 20% and 80% percentile in black. The computations are done with a batch size of 128. We observe that, in practice, the overhead over the gradient computation for the HVP computation is between one and twice the time of a gradient computation for the three architectures. Consequently, a whole HVP computation takes between twice and three times the time of a gradient calculation. This is consistent with the theory. One can notice that the reverse-over-reverse is slightly slower than the others in all the cases. The forward-over-reverse and reverse-over-forward are, as for them, very close in terms of time.

      We also report on the right figure the computational time of each method with respect to the batch size for the ResNet34 architecture. We observe, as expected, that the computational time scales linearly with the batch size.

      We run a similar experiment with the functional API available in PyTorch torch.func similar to the one JAX has. The results we get are more contrasted.

      In the case of ResNet34, the scaling between the different methods is similar to the one we get with JAX. Also, during our experiments, we figured out that batch normalization made the forward computation slow and induced out-of-memory issues. Thus, we removed the batch normalization layers from the ResNet34 architecture.

      For ViT and BERT, the forward-over-reverse is surprisingly longer than the reverse-over-reverse method. Moreover, the scaling between the gradient and HVP computational time differs from the one we get with JAX. Indeed, for these architectures, the HVP computations take between four and five more time than the gradient computations. This is a discrepancy with what we would expect in theory. This might be because, at the time we are writing this blog post, the functional API of PyTorch is still in its early stages. Particularly, we could not use the compilation with torch.compile because it does not work with some operators of torch.func such as torch.func.jvp.

      Memory complexity

      We also compare the memory footprint of each approach. The following figure provides the results we get with jax jitted code. On the left, we represent the result for each method and model with a batch size of 64. On the right, we show the evolution of the memory footprint of each method for the ResNet34 with the batch size. Surprisingly, we could observe that the memory footprint of the different methods to compute HVPs does not vary for a given model. This is counterintuitive since we expect that the reverse-over-reverse method have a larger memory footprint due to the double backpropagation.

      However, we do the same experiment by disabling the JIT compilation. The result we get corroborates the theory. Indeed, one can observe in the following figure that the memory footprint of the reverse-over-reverse method is larger than the one of the forward-over-reverse and reverse-over-forward methods. This is because the reverse-over-reverse involves two successive backward differentiations while the other two involve only one reverse differentiation. Moreover, it scales linearly with the batch size, which was not the case in the previous figure in the small batch size regime.

      In light of these two results, the clever memory allocation performed during just-in-time compilation reduces significantly the memory footprint of the HVP computations.

      In the following figure, we plot the results we get with the PyTorch implementation. One can observe that in all the cases the forward-over-reverse consumes more memory in comparison with the reverse-over-forward mode. It is almost at the same level as reverse-over-reverse mode, which is quite unexpected.

      The right plot of the evolution of the memory footprint with the batch size for the ResNet34 architecture evolves linearly as expected.

      Conclusion

      In this blog post, we have explored the different ways to compute HVP from theoretical and practical perspectives. The three take-home messages to keep in mind are the following:

      • We can compute HVPs without computing Hessian matrices.

      • In practice, computing an HVP takes between twice and four times the time taken by a gradient computation and requires two to three times more memory than computing a gradient.

      • The AD framework and the use or not of the just-in-time compilation affects the practical performances of HVPs computations in time and memory.

      ]]>
      Mathieu Dagréou
      On Bayesian Model Selection: The Marginal Likelihood, Cross-Validation, and Conditional Log Marginal Likelihood2024-05-07T00:00:00+02:002024-05-07T00:00:00+02:00https://iclr-blogposts.github.io/2024/blog/clml $$\require{mathtools} \DeclareMathOperator{\opExpectation}{\mathbb{E}} \newcommand{\E}[2]{\opExpectation_{#1} \left [ #2 \right ]} \newcommand{\simpleE}[1]{\opExpectation_{#1}} \newcommand{\MidSymbol}[1][]{\:#1\:} \newcommand{\given}{\MidSymbol[\vert]} \DeclareMathOperator{\opmus}{\mu^*} \newcommand{\IMof}[1]{\opmus[#1]} \DeclareMathOperator{\opInformationContent}{H} \newcommand{\ICof}[1]{\opInformationContent[#1]} \newcommand{\xICof}[1]{\opInformationContent(#1)} \DeclareMathOperator{\opEntropy}{H} \newcommand{\Hof}[1]{\opEntropy[#1]} \newcommand{\xHof}[1]{\opEntropy(#1)} \DeclareMathOperator{\opMI}{I} \newcommand{\MIof}[1]{\opMI[#1]} \DeclareMathOperator{\opTC}{TC} \newcommand{\TCof}[1]{\opTC[#1]} \newcommand{\CrossEntropy}[2]{\opEntropy(#1 \MidSymbol[\Vert] #2)} \newcommand{\iCrossEntropy}[3]{\opEntropy_{#1 \Vert #2}[#3]} \DeclareMathOperator{\opKale}{D_\mathrm{KL}} \newcommand{\Kale}[2]{\opKale(#1 \MidSymbol[\Vert] #2)} \newcommand{\iKale}[3]{\opKale_{,\, #1 \Vert #2}[#3]} \DeclareMathOperator{\opJSD}{D_\mathrm{JSD}} \newcommand{\JSD}[2]{\opJSD(#1 \MidSymbol[\Vert] #2)} \DeclareMathOperator{\opp}{p} \newcommand{\pof}[1]{\opp(#1)} \newcommand{\hpof}[1]{\hat{\opp}(#1)} \newcommand{\pcof}[2]{\opp_{#1}(#2)} \newcommand{\hpcof}[2]{\hat\opp_{#1}(#2)} \DeclareMathOperator{\opq}{q} \newcommand{\qof}[1]{\opq(#1)} \newcommand{\hqof}[1]{\hat{\opq}(#1)} \newcommand{\qcof}[2]{\opq_{#1}(#2)} \newcommand{\varHof}[2]{\opEntropy_{#1}[#2]} \newcommand{\xvarHof}[2]{\opEntropy_{#1}(#2)} \newcommand{\varMIof}[2]{\opMI_{#1}[#2]} \newcommand{\w}{\boldsymbol{\theta}} \newcommand{\W}{\boldsymbol{\Theta}} \newcommand{\h}{\boldsymbol{\phi}} \newcommand{\hopt}{\boldsymbol{\h^\star}} \newcommand{\H}{\boldsymbol{\Phi}} \DeclareMathOperator{\opf}{f} \newcommand{\fof}[1]{\opf(#1)} \newcommand{\xset}[3]{(\x_n^{#1})_{n=#2}^{#3}} \newcommand{\xNset}{(\x_n)_{n=1}^N} \newcommand{\XNtuple}{(\X_n)_{n=1}^N} \newcommand{\xNtuple}{(\x_n)_{n=1}^N} \newcommand{\XNset}{\{\X_n\}_{n=1}^N} \newcommand{\xNset}{\{\x_n\}_{n=1}^N} \newcommand{\XNsetk}{\{\X_n\}_{n=N-k+1}^N} \newcommand{\xNsetk}{\{\x_n\}_{n=N-k+1}^N} \newcommand{\XNkset}{\{\X_n\}_{n=1}^{N-k}} \newcommand{\xNkset}{\{\x_n\}_{n=1}^{N-k}} \newcommand{\XNoset}{\{\X_n\}_{n=1}^{N-1}} \newcommand{\y}{y} \newcommand{\Y}{Y} \newcommand{\L}{\boldsymbol{L}} \newcommand{\x}{\boldsymbol{x}} \newcommand{\X}{\boldsymbol{X}} \newcommand{\oppdata}{\hat{\opp}_{\text{data}}} \newcommand{\pdata}[1]{\hpcof{\text{data}}{#1}} \newcommand{\normaldist}[1]{\mathcal{N}(#1)} \newcommand{\wstddev}{\sigma_\w} \newcommand{\noisestddev}{\sigma_\text{noise}} \newcommand{\Dataset}{\mathcal{D}} \newcommand{\Dtrain}{\Dataset_{\text{train}}} \newcommand{\Dval}{\Dataset_{\text{val}}} $$

      Introduction

      Model selection is a crucial aspect of machine learning, as it allows us to choose the most appropriate model for a given task. In the Bayesian setting, the marginal likelihood has been a popular tool for model selection and hyperparameter learning, often motivated by the principle of Occam’s razor. However, the suitability of the marginal likelihood depends on the specific context and goals of the modeling task.

      Recently, the paper “Bayesian Model Selection, the Marginal Likelihood, and Generalization” by Lotfi et al. (2022/2023), which was accepted as Outstanding Paper and Long Oral at ICML 2022, examined the importance and challenges of model selection in machine learning, focusing on the log marginal likelihood (LML) and proposing a variant, the conditional log marginal likelihood (CLML). The authors argue that while LML is a useful tool for hypothesis testing, it may not be the best metric for model selection and for predicting the generalization performance of trained models or learning hyperparameters. They introduce the CLML as a potential improvement and demonstrate its effectiveness across various settings, including density models, Fourier features, Gaussian Processes, and deep neural networks.

      In this blog post, inspired by the above paper, we (re-)derive insights that challenge the conventional focus on the marginal likelihood and related quantities for Bayesian model selection. We argue that the quantities we examine are all consequences of Occam’s razor, and thus no single quantity should be considered universally superior. Instead, the choice of model selection criterion should be guided by the context and the desired outcomes. We highlight that many recently proposed metrics for model selection, including CLML, are closely related to cross-validation and have failure cases that can be explained by considering model misspecification and prior-data conflicts. Overall, the choice between these metrics should be based on the specific requirements of the task at hand.

      We begin by discussing the foundations of model selection, including the role of Occam’s razor and its relationship to maximum likelihood estimation (MLE) and maximum a posteriori (MAP) estimation. We then introduce the concepts of log marginal likelihood (LML), cross-validation, and conditional log marginal likelihood (CLML), highlighting their connections and differences. Through a series of thought experiments and empirical observations, we explore the behavior of these model selection criteria in various scenarios, such as under model misspecification, prior-data conflict, and in different data regimes. We find that the conditional marginal cross-entropy, which is closely related to cross-validation, is often a more reliable choice when the primary objective is to select for generalization performance. On the other hand, the conditional joint marginal cross-entropy (permutation-invariant negative CLML) may be preferable when the focus is on sequential prediction and online learning. At the same time, the joint marginal information (negative LML) is rarely the right choice for model selection. We review relevant literature, including the work of Fong and Holmes (2020) on the connection between the LML and cross-validation, the training speed estimators by Lyle et al. (2020) and Ru et al. (2021), and the experiments of Lotfi et al. (2022/2023) , comparing the CLML and validation loss for deep neural networks (DNNs). These studies provide valuable insights into the strengths and limitations of different model selection criteria.

      Throughout the post, we emphasize the importance of considering the context, available data, and desired outcomes when selecting the most appropriate metric for model selection and hyperparameter tuning. By questioning the primacy of the (conditional) joint marginal likelihood and encouraging critical thinking about the foundations of these quantities, we hope to foster a more nuanced understanding of Bayesian model selection.

      (Bayesian) Model Selection

      In our daily lives, we’re often faced with choices that require us to sift through competing explanations or decisions. Imagine you hear your doorbell ring. You might think it’s the delivery you’ve been waiting for, a neighbor dropping by, or perhaps you didn’t hear anything at all, and it was just your imagination. In deciding between these options, you’re likely to lean towards the simplest explanation that aligns with your expectations—say, the long-awaited delivery. This inclination towards simplicity has a formal counterpart in scientific discovery and machine learning, known as Occam’s razor:

      This concept is further illustrated using an example from chapter 28 of David MacKay’s seminal book, “Information Theory, Inference, and Learning Algorithms”, where the essence of selecting between models based on their evidence is laid out succinctly.

      Occam’s razor --- How many boxes are in the picture (figure 28.1)? In particular, how many boxes are in the vicinity of the tree? If we looked with x-ray spectacles, would we see one or two boxes behind the trunk (figure 28.2)? (Or even more?) Occam’s razor is the principle that states a preference for simple theories. ‘Accept the simplest explanation that fits the data’. Thus according to Occam’s razor, we should deduce that there is only one box behind the tree. Is this an ad hoc rule of thumb? Or is there a convincing reason for believing there is most likely one box? Perhaps your intuition likes the argument ‘well, it would be a remarkable coincidence for the two boxes to be just the same height and colour as each other’. If we wish to make artificial intelligences that interpret data correctly, we must translate this intuitive feeling into a concrete theory.
      Excerpt from page 343 in David MacKay’s "Information Theory, Inference, and Learning Algorithms.”

      But how can we express this formally using mathematics?

      In the next section, we will use information-theoretic concepts to formalize Occam’s razor and connect it to the maximum likelihood estimation (MLE) and maximum-a-posteriori (MAP) estimation approaches. This formalization highlights that Occam’s razor, as a general principle favoring simplicity, can motivate various techniques, not just Bayesian ones. Therefore, using Occam’s razor as the sole justification for Bayesian model selection may not be as compelling as it initially appears.

      However, one could argue that when Occam’s razor is properly applied within a Bayesian framework, it captures a more nuanced notion of complexity. From this perspective, the Bayesian formulation of Occam’s razor favors models that strike a balance between goodness-of-fit and model complexity, where complexity is measured by the model’s ability to compress the data. This view is consistent with the minimum description length (MDL) principle, which posits that the best model is the one that minimizes the total description length of both the model and the data given the model.

      From Philosophical Principle to Mathematical Statement

      Let’s first connect Occam’s razor to Maximum Likelihood Estimation (MLE) before diving deeper into the background and (Bayesian) model selection.

      In information theory, the information content of an event \(x\) is defined as \(-\log_2 \pof{x}\), where \(\pof{x}\) is the probability of that event occurring according to a given model. This is also called Shannon’s information content. We use the base \(2\) for logarithms and measure information in bits (binary digits), and for the rest of the post, we will drop the base of the logarithm. The information content measures the optimal encoding length in bits for the event \(x\) under the model specified by its probability distribution \(\pof{\cdot}\). In the context of probabilistic modeling, variables that cannot be directly observed are called latent variables. Occam’s razor suggests that we should prefer simpler explanations for latent variables, given the observed data.

      Consider a model with a latent variable \(z\) and observed data \(x\). The model specifies a probability distribution \(\pof{z \given x}\). According to Occam’s razor, we prefer simpler explanations, which correspond to smaller values of \(-\log \pof{z \given x}\). Using Bayes’ theorem, we can rewrite this as:

      \[\text{minimize } z \text{ in } -\log \pof{z \given x} = -\log \pof{x \given z} - \log \pof{z} + \log \pof{x}.\]

      Given that \(\pof{x}\) is independent of \(z\), we can omit it from our objective. Additionally, if we posit a uniform (or non-informative prior) for \(z\), implying that all potential values of \(z\) are equally probable before observing \(x\), then \(\pof{z}\) becomes constant and can also be dropped from our objective. This simplifies our preference to:

      \[\text{minimize } z \text{ in } -\log \pof{x \given z}.\]

      Equivalently, we can maximize \(\pof{x \given z}\), which is the likelihood of the observed data \(x\) given the latent variable \(z\). When making a decision and selecting a single value for \(z\), this leads to the maximum likelihood estimation (MLE) approach.

      In summary, the connection between Occam’s razor and MLE relies on the following assumptions:

      1. Shannon’s information content is how we measure complexity.
      2. The prior distribution for the latent variables is uniform (or uninformative).
      3. Simpler explanations, as measured by the information content, are preferred (Occam’s razor).

      Under these assumptions, the preference for simpler explanations leads to the MLE approach, where more likely values of the latent variable given the observed data are preferred.

      Optimizing the MLE is common in machine learning because we can directly optimize the likelihood function. Still, this is not easy for deep learning models because they have a large number of parameters and the loss function is non-convex.

      Maximum-a-Posteriori Estimation

      However, the assumption of a uniform or non-informative prior for the latent variables is not always valid or desirable. In many cases, we have prior knowledge about the latent variables that can be incorporated into the model. This leads to the Maximum-A-Posteriori (MAP) Estimation as an alternative to MLE.

      In MAP estimation, \(\pof{z}\) is not constant, so we cannot drop it—we can still drop \(\pof{x}\), however—and maximize the joint distribution \(\pof{z, x}\), or equivalently:

      \[\text{minimize } z \text{ in } -\log \pof{x, z}=-\log \pof{x \given z} - \log \pof{z}.\]

      Before we go further, we need to introduce notation for information-theoretic quantities and concepts that we will use throughout the postThis next section is mostly shared with the sister post.

      Background: Information-Theoretic Notation and Concepts

      Information theory deals with the communication and quantification of informationSee the excellent "Visual Information Theory" by Chris Olah for a visual introduction to information theory.. In this post, we use a unified information-theoretic notation to express various quantities related to probability distributions and their relationshipsIt largely follows "A Practical & Unified Notation for Information-Theoretic Quantities in ML".. Here are some key concepts we will use:

      The information content of an event \(x\) is denoted as \(\Hof{x}\) and is defined as \(-\log_2 \pof{x}\), where \(\pof{x}\) is the probability of event \(x\) occurring. It represents the minimum amount of information needed to describe the occurrence of \(x\) given an underlying probability distribution. \(\Hof{x \given y}\) and \(\Hof{x, y}\) are analogously defined and denote the conditional and joint information content of random variables \(X\) and \(Y\), respectively. In machine learning, the information content is often used as a minimization objective, represented as the negative log-likelihood or cross-entropy when averaged over a dataset (see below).

      The entropy \(\Hof{X}\) of a random variable \(X\) is the expectation of its information content:

      \[\Hof{X} \triangleq \E{\pof{x}}{\Hof{x}} = \E{\pof{x}}{-\log \pof{x}}.\]

      The entropy measures the average amount of information needed to describe the random variable \(X\). It provides a measure of uncertainty or randomness associated with \(X\). We can similarly define the entropy of a conditional distribution \(\Hof{X \given Y}\) and the joint entropy \(\Hof{X, Y}\).

      We will also use the Kullback-Leibler divergence \(\Kale{\pof{X}}{\qof{X}}\) and the cross-entropy \(\CrossEntropy{\pof{X}}{\qof{X}}\):

      \[\begin{aligned} \CrossEntropy{\pof{X}}{\qof{X}} & = \E{\pof{x}}{-\log \qof{x}}\\ \Kale{\pof{X}}{\qof{X}} & = \CrossEntropy{\pof{X}}{\qof{X}} - \Hof{X} \end{aligned}\]

      The cross-entropy quantifies the average number of bits needed to encode samples drawn from the true distribution \(\pof{X}\) using a different distribution \(\qof{X}\). The Kullback-Leibler divergence measures the difference between two probability distributions and captures the additional bits needed to encode samples from \(\pof{X}\) compared to encoding them using the true distribution \(\qof{X}\).

      Expressing Occam’s Razor in Information-Theoretic Terms

      Taking this notation into account, we can express Occam’s razor as:

      \[\text{prefer small } z \text{ for } \Hof{z \given x},\]

      where \(Z\) is the latent variable and \(X\) is the observed data. Note that \(x\) and \(z\) are individual realizations of the random variables \(X\) and \(Z\), respectively.

      The MLE and MAP objectives are accordingly:

      \[\text{minimize } z \text{ in } \Hof{x \given z} \text{ for MLE and } \Hof{x, z} \text{ for MAP.}\]

      This measures the number of bits we need to encode the observed data given the latent variable for MLE and the number of bits to encode both the observed data and the latent variable for MAP. This relates Occam’s razor to the minimum description length principleSee the Wikipedia article on Minimum Description Length for more details..

      Hyperparameter Learning and Model Selection

      In many machine learning tasks, we need to determine the best hyperparameters for a model or select the most suitable model architecture from several discrete options. The primary goal is to find the hyperparameters or model that generalizes best to new, unseen data.

      Both cases can be viewed as inferring a random variable \(\H\), which represents either the model choice as a categorical distribution or the hyperparameters as a continuous distribution. In this sense, \(\H\) can be considered as another latent variable in the model.

      For consistency, we will continue using \(\x\) to denote data points throughout this post. Although it is common to use \(\y\) for predictions and \(\x\) for side channel information, we will not require this distinction here and will stick to \(\x\) for simplicity.

      The same arguments discussed previously also apply in this context, and we can express the objective as:

      \[\text{minimize } \h \text{ in } \Hof{\x \given \h}.\]

      Model Parameters

      In addition to the hyperparameters \(\H\), we usually have model parameters \(\W\) for a given \(\h\) with a parameter distribution \(\pof{\w \given \h}\) that we need to infer based on observed data. These parameters are the learnable components of the model, such as the weights and biases in a neural network. For given \(\w\) and \(\h\), we can easily compute the likelihood \(\pof{\x \given \w, \h}\), which represents the probability of observing the data \(\x\) given the specific values of the parameters and hyperparameters. However, to make predictions or compute the marginal likelihood, we will need to consider the uncertainty in the parameter values by integrating over all possible \(\w\).

      Bayesian Model Averaging

      Bayesian Model Averaging (BMA) is a technique that integrates, or marginalizes, over the model parameters \(\W\) when making predictions. This accounts for the uncertainty in the model parameters, which is particularly useful when dealing with complex models, high-dimensional parameter spaces, and limited data. In contrast to the MLE or MAP estimate, which use a single parameter value \(\w\) for predictions, BMA provides a more robust and comprehensive approach. The probability of a new data point \(\x'\) under BMA is given by:

      \[\pof{\x' \given \x, \h} = \int \pof{\x' \given \x, \w, \h} \pof{\w \given \x, \h} \, \mathrm{d}\w,\]

      where \(\pof{\w \given \x, \h}\) is the posterior distribution of the parameters given the data, and \(\pof{\x' \given \x, \w, \h}\) is the likelihood of the new data point given the parameters, hyperparameters, and training data.

      While BMA offers benefits, it is computationally challenging, particularly when dealing with high-dimensional parameter spaces commonly encountered in deep learning models. To make BMA tractable, various approximation methods, such as Markov Chain Monte Carlo (MCMC) and Variational Inference, have been proposed.

      Marginal Likelihood and Estimation

      Let’s now discuss the marginal likelihood and its relation to BMA. The marginal likelihood, denoted as \(\pof{\x \given \h}\), is the likelihood of the observed data given the hyperparameters, marginalized over all possible parameter values \(\W\). It is also known as the model evidence. To compute the marginal likelihood, we integrate over all possible \(\w\):

      \[\pof{\x \given \h} = \int \pof{\x \given \w, \h} \pof{\w \given \h} \, d\w,\]

      where \(\pof{\x \given \w, \h}\) is the likelihood of the data given the parameters and hyperparameters, and \(\pof{\w \given \h}\) is the prior distribution of the parameters given the hyperparameters.

      Comparing BMA to the marginal likelihood, we see that they match for individual data points. However, for multiple data points (i.e., conditioning on datasets), the marginal likelihood is more complex. “BMA” typically refers to making predictions for a single new data point, while the marginal likelihood can be considered for many points simultaneously. Apart from this difference, the two are equivalent. Let’s discuss the case of multiple data points in more detail to understand why computing the marginal likelihood on datasets is even more challenging.

      Datasets instead of Individual Data Points

      So far, we have described everything as if we only had a single data point \(x\). However, in practice, we often have a dataset \(\xNtuple = (\x_1, \x_2, \ldots, \x_N)\).

      Joint Marginal Information and Cross-Entropy

      The easiest way to extend the previous definitions is to simply substitute \(\xNset\) for \(\x\) and assume we can compute a likelihood for the entire dataset using its joint predictive distribution:

      \[\pof{\xNtuple \given \h} = \int \pof{\x_1, \x_2, \ldots, \x_N \given \w, \h} \, \pof{\w \given \h} \, d\w.\]

      We can then maximize this likelihood or equivalently minimize the joint marginal information \(\Hof{\xNtuple \given \h}.\)

      If our model is exchangeable, meaning the order of the \(\x_n\) does not matter, we can equivalently take an expectation over all permutations of the data to obtain the joint marginal cross-entropy:

      \[\CrossEntropy{\pdata{\X_1, ...,\X_n}}{\pof{\X_1, ... \X_n \given \h}},\]

      where \(\pdata{\cdot}\) is an empirical data distribution that allows us to draw samples without replacement. In this case, the joint marginal information and cross-entropy are equivalent.

      With exchangeability, we can simply write \(\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNset}\) instead of using the tuple notation \(\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNtuple}\) as the order of the data points does not matter.

      Conversely, if a model is not exchangeable, we can induce exchangeability by averaging over all permutations of the data points via ensembling. For example, deep learning models trained with stochastic gradient descent are generally not exchangeable, as the order and composition of the batches can impact the results. However, we can make them effectively exchangeable by training multiple models and averaging their predictions. In the limit of infinite models, the resulting ensemble will be exchangeableThe ensemble might not necessarily perform better though, as papers on training curricula have shown that batch order can be important..

      The joint marginal cross-entropy turns a potentially non-exchangeable joint information into an exchangeable one by taking an expectation.

      Marginal Information and Cross-Entropy

      Before we try to understand these joint expressions, we should consider alternative ways to extend the previous definitions.

      For instance, we could take the average of the likelihoods for individual data points:

      \[\frac{1}{N} \sum_{n=1}^N \pof{\x_n \given \h}.\]

      Assuming an underlying data distribution \(\pdata{x}\), we can also express this as an attempt to estimate:

      \[\E{\pdata{\x}}{\pof{\x \given \h}} = \int \pof{\x \given \h} \, \pdata{\x} \, d\x.\]

      This provides an average score for the data likelihood.

      However, from the perspective of Occam’s razor, simply taking the average likelihood is not the most principled approach. Instead, we can leverage information theory, which has been our tool of choice thus far. Recall that we prefer small values of the marginal information \(\Hof{\x \given \h}\). By taking the expectation over the data distribution, we obtain the individual marginal cross-entropy:

      \[\CrossEntropy{\pdata{\X}}{\pof{\X \given \h}} = \E{\pdata{\x}}{-\log \pof{\x \given \h}}.\]

      This cross-entropy measures the average number of bits needed to encode the data using the model’s probability distribution. As it does not involve a joint distribution, we refer to it simply as the marginal cross-entropy.

      It is evident that the marginal cross-entropy and the average likelihood are not equivalent. Using the convexity of the negative logarithm and Jensen’s inequality, we see that the marginal cross-entropy is always larger than the negative logarithm of the average likelihood:

      \[\begin{aligned} \CrossEntropy{\pdata{\X}}{\pof{\X \given \h}} &= \E{\pdata{\x}}{-\log \pof{\x \given \h}} \\ &\geq -\log \E{\pdata{\x}}{\pof{\x \given \h}} \\ &\approx -\log \frac{1}{N} \sum_{n=1}^N \pof{\x_n \given \h}. \end{aligned}\]

      The NLL is frequently used to evaluate a model’s performance after training, typically on a held-out validation set. This is equivalent to computing the cross-entropy between the empirical distribution of the validation set and the model’s predictive distribution, conditioned on the parameters learned from the training data:

      \[\CrossEntropy{\hpcof{\text{val}}{\X'}}{\pof{\X' \given \xNtuple, \h}}\]

      It is essential to distinguish this from the cross-entropy computed on the prior distribution of the model parameters before seeing any data, which is less useful for evaluating a trained model’s performance:

      \[\CrossEntropy{\hpcof{\text{val}}{\X'}}{\pof{\X' \given \h}}\]

      Only the NLL on a validation set conditioned on the training data provides an estimate of the model’s generalization ability after training. The same holds for the quantities marginalized over the model parameters.

      Marginal Cross-Entropy vs Joint Cross-Entropy

      Occam’s razor does not clearly specify which aggregate metric on \(\Hof{\x \given \h}\) we should prefer. Instead of the mean, we could use the median or a different quantile of the information content as a summary statistic to assess the model’s performance on the dataset. This might be more robust, as it is less sensitive to outliers.

      Crucially, the marginal cross-entropy and related summary statistics measure the model’s performance using the “prior” parameter distribution, not the posterior conditioned on data. However, the joint distribution captures something else, which can be seen more clearly using the chain rule:

      \[\Hof{\xNset \given \h} = \sum_{k=1}^N \Hof{\x_n \given \x_1, \ldots, \x_{k-1}, \h}\]

      Each term is a conditional marginal information on the previous data points. Similarly, when we take an expectation over the data distribution, we obtain a chain of conditional marginal cross-entropies:

      \[\begin{aligned} & \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNtuple} = \\ &\quad = \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X_1} + \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X_2 \given \X_1} \\ &\quad \quad + \ldots + \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{X_N \given \X_1, \X_2, \ldots, \X_{N-1}} \\ &\quad = \sum_{n=1}^N \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X_n \given \X_{n-1}, \ldots, \X_1}. \end{aligned}\]

      Each term in the sum is a conditional marginal cross-entropy conditioned on the previous data points, which differs from the marginal cross-entropy (recognized in the first term).

      The following visualization summarizes the relationship between the conditional and joint marginal cross-entropies and information. The chain rule tells us that the area under the curve of the conditional quantities equals the joint quantity.

      The relationship between conditional and joint marginal cross-entropies and information. Left: Conditional marginal cross-entropy (blue) for a multi-class classification problem. The area under the curve (orange) represents the joint marginal cross-entropy. As the dataset size increases, the conditional marginal cross-entropy decreases and converges to the best achievable loss for the given model hypothesis \(\h\). Right: Conditional marginal information (green). The area under the curve (red) represents the joint marginal information. The conditional marginal information is a noisy estimate of the conditional marginal cross-entropy, as it is computed on individual data points.

      In summary, the marginal and joint cross-entropies offer different perspectives on a model’s performanceRecent works by Ian Osband et al., starting with The Neural Testbed: Evaluating Joint Predictions can help build intuitions for joint predictions. Similarly, a gentler introduction, comparing marginal and joint predictions, can also be found in the arXiv note Marginal and Joint Cross-Entropies & Predictives for Online Bayesian Inference, Active Learning, and Active Sampling.:

      • The marginal cross-entropy and related summary statistics assess the model’s performance using the prior parameter distribution, without considering the effect of the data on the model.
      • The joint marginal cross-entropy, expressed as a sum of conditional marginal cross-entropies, captures the model’s online learning performance as it processes the data sequentially.

      While both metrics are useful for evaluating models, the joint marginal cross-entropy provides insight into how well the model learns from the data during training. The conditional marginal cross-entropy, on the other hand, is more suitable for assessing the model’s generalization ability at a given point in time, without the influence of parameter updates.

      Intermediate Comparison

      This brings us back to the earlier question of what metric we should prefer and use for model selection. Let’s consider:

      1. The marginal cross-entropy, as in the first term, is likely not useful for model selection with deep learning models, as it is not conditioned on any data and thus cannot correlate well with the model’s performance after training.

      2. If we care about the model’s “generalization” performance after training on \(N-1\) data points without further adaptation, the marginal cross-entropy on the last data point is the more relevant quantity:

        \[\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X_N \given \X_{N-1}, \ldots, \X_1}\]

        It measures the model’s performance on the last data point after having seen all previous data points, similar to a “leave-one-out” metric. Indeed, it is equivalent to leave-one-out cross-validation when we have an empirical data distribution consisting of \(N\) data points and sample without replacement.

      3. More generally, it is equivalent to cross-validation when we hold out more than one data point for evaluation from the empirical data distribution:

        \[\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X' \given \X_{N-k}, ..., \X_{1}}.\]

        This is the same expression as in (2.) but we assume there are more samples to draw from in the empirical data distribution \(\pdata{\x'}\). We call this term the conditional marginal cross-entropy and keep in mind its connection to cross-validation.

      4. On the other hand, if we care about the model’s performance as an online learner, or in the case of LLMs, as an in-context learner, the joint marginal cross-entropy becomes a more relevant metric. It measures the model’s ability to adapt and make accurate predictions as it sequentially processes new data points, conditioned on the information it has seen so far.

        In the context of online learning, the model receives data points one at a time and updates its predictions based on the cumulative knowledge gained from previous data points. The joint marginal cross-entropy captures how well the model incorporates this sequential information to make accurate predictions for future data points.

        Similarly, for in-context learning of LLMs, the model is provided with a prompt or context consisting of a sequence of data points, and it is expected to generate accurate completions or predictions based on this context. The joint marginal cross-entropy measures the model’s ability to effectively utilize the provided context to make accurate predictions for the next data point in the sequence.

      5. However, we would not want to use the unconditional joint marginal cross-entropy, but rather condition on some initial data to be closer to the actual use case of the model, which will have been (pre-)trained already. As such, we are interested in estimating a conditional joint marginal cross-entropy:

        \[\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNsetk \given \XNkset}.\]

        By conditioning on the previously seen data points, this metric assesses the model’s capacity to learn and adapt its predictions based on the evolving context. It provides a more fine-grained evaluation of the model’s sequential prediction performance, taking into account the specific order and dependencies within the data.

        Moreover, the conditional joint marginal cross-entropy can be used to compare different models or hyperparameter settings in terms of their online learning or in-context learning capabilities. By evaluating this metric on held-out data sequences, we can determine which model or setting is better suited for tasks that require sequential adaptation and context-dependent predictions.

      6. If we have a preferred order of the data points (or a split in the case of exchangeability), we can also consider the conditional joint marginal information:

        \[\Hof{\xNsetk \given \xNkset, \h}.\]

        It is also known as the conditional joint marginal log likelihood.

      7. All these quantities are equally valid from the perspective of Occam’s razor.

      8. We have not yet discussed how to efficiently estimate these quantities, especially for deep learning models. More importantly, we have already considered that the joint marginal information (marginal likelihood), BMA, and the joint marginal cross-entropy (as an expectation over the marginal likelihood) are not easy to estimate.

      This brings us to one of the main points:

      This is a crucial point that has not been sufficiently considered in the literature on model selection and hyperparameter learning previously, where the model evidence and marginal likelihood have been presented as the ultimate criteria. In practice, we rarely update a model on additional data during inference—this is changing with the advent of LLMs and strong in-context learners, but it is still not the norm.

      But why has the marginal likelihood been the preferred choice for model selection so far then?

      Different Data Regimes

      To explore when the conditional marginal cross-entropy and joint marginal cross-entropy lead to different outcomes for model selection and hypothesis testing, let’s consider a few key scenarios.

      For the discrete case, we can reduce the question to one about ranking: if we have two possible hyperparameter choices \(\h_1\) and \(\h_2\), when do we get the same ranking \(\h_1 \succ \h_2\) for both metrics?

      Model Misspecification

      First, let’s examine the case when we have a large amount of data available. Here, model misspecification, a common concern, plays a crucial role.

      As renowned statistician George Box famously stated:

      All models are wrong, but some are useful.

      George Box, Science and Statistics (1976)

      When working with real-world data, we must always assume that our models are misspecified to some degree. Models simplify complex systems and cannot capture every nuance of the data-generating process. Consequently, the goal of model selection is not to find the “true” model but rather to identify the most useful model that balances simplicity, interpretability, and predictive performance.

      Without model misspecification, we would always converge to the maximum likelihood estimate (MLE) that matches the data-generating model in the infinite data limit as the Bernstein-von Mises’ theorem tells us that posteriors converge to the MLE in the limit. However, in practice, we are always dealing with misspecified models, and the MLE will not converge to the true data-generating model.

      Infinite Data Limit

      Let’s return to our question of when the different quantities lead to similar rankings.

      While a conditional joint marginal cross-entropy, as a sum of conditional marginal cross-entropies, is obviously larger than each individual term, if we divide the joint marginal cross-entropy by the number of samples in the conditional joint distribution, we obtain the rateIn this context, "rate" refers to the average amount of cross-entropy or information per (training) sample, drawing parallels to the concept of entropy rate in Shannon's information theory. This usage is distinct from other common uses of "rate" in machine learning, such as learning rate or convergence rate. of the conditional joint marginal cross-entropies as its per-sample average, which can be more easily related:

      \[\begin{aligned} & \frac{1}{N-k} \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNsetk \given \XNkset} \\ &\quad = \sum_{n=N-k+1}^N \frac{1}{N-k} \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X_n \given \X_{n-1}, ..., \X_1}. \end{aligned}\]

      Bernstein-von Mises’ theorem tells us that the posterior distribution of the model parameters converges to a normal distribution around the MLE as the number of data points goes to infinityThere are likely fewer caveats to this statement than the naive interpretation of the theorem implies because we are usually not interested in converging towards some unique and identifiable parameters but rather in the predictions matching the data-generating process.. This means that the later terms in the chain rule decomposition of the joint cross-entropy will converge to the same value in the infinite sample limit as the data we condition on becomes infinite. If we take the limit, we can ignore the first terms in the chain rule decomposition of the joint cross-entropy, and we will get the same average value for the terms of the joint cross-entropy (one per sample in the joint) and the conditional cross-entropy. This matches a similar result on entropy rates in “Elements of Information Theory” by Cover & Thomas.

      Overall, we have (without formal proof):

      \[\begin{aligned} &\lim_{N \to \infty} \frac{1}{N} \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNset} = \\ &\quad = \lim_{N \to \infty} \frac{1}{N} \sum_{n=1}^N \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X_n \given \X_{n-1}, ..., \X_1} \\ &\quad = \lim_{N \to \infty} \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X' \given \XNset}. \end{aligned}\]

      Given sufficient data (in the infinite sample limit), we see that either of these quantities will lead to the same ranking of different hyperparameters/model hypotheses. Conversely, we can expect to see meaningful differences only in low-data regimes, where the model is not yet fully adapted to the data.

      Finally, in the infinite data limit, for the conditional marginal cross-entropy, we don’t need to take an expectation over the data we condition on (as the model parameters will still have converged):

      \[\begin{aligned} &\lim_{N \to \infty} \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNsetk \given \XNkset} \\ &\quad = \lim_{N \to \infty} \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNsetk \given \xNset}, \end{aligned}\]

      forany \(\xNset \sim \pdata{\xNset}\) as \(n \to \infty\). More importantly, this also holds for the joint marginal information, whose rate in the limit is the same as the rate of the joint marginal cross-entropy above (and thus also joint cross-entropy):

      \[\begin{aligned} &\lim_{N \to \infty} \frac{1}{N} \Hof{\xNset \given \h} = \\ &\quad = \lim_{N \to \infty} \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X' \given \XNset}. \end{aligned}\]

      We have previously mentioned the connection between cross-validation, leave-one-out validation, and the conditional marginal cross-entropy. This result also connects the marginal likelihood in the limit to these quantities.

      Thus:

      The catch is that “sufficient data” might be a very large amount of data, especially for highly expressive models like neural networks.

      Hence, we only expect these quantities to be meaningfully different in the low-data regime. So let’s focus on the low-data regime now.

      Prior-Data Conflict

      Even if different hyperparameter choices lead to the same generalization loss in the infinite data limit, they can induce different priors that affect the convergence speed and model performance in the low-data regime.

      In the low-data regime, assuming all models converge to the same validation loss given infinite data, we prefer the model that converges the fastest, i.e., with the least amount of training data. A model with a prior well-aligned with the data distribution learns efficiently and generalizes better with limited data.

      Conditional marginal cross-entropy vs. dataset size under different modeling scenarios. Left: Model misspecification - Three model hypotheses (\(\h_1\), \(\h_2\), \(\h_3\)) converge to different losses due to the model class not containing the true data-generating process. The minimum achievable loss represents the misspecification error. Right: Prior-data conflict - Three model priors (\(\h_1\), \(\h_2\), \(\h_3\)) converge to the same loss but at different speeds due to varying alignment with the data distribution. Priors with more mass near the MLE converge faster. Real-world models often face both prior-data conflict and model misspecification.

      In this scenario, the area under the conditional marginal cross-entropy or information curve (equivalent to the joint marginal cross-entropy, or joint marginal information) indicates the preferred model. The model with the lowest joint marginal information (highest log marginal likelihood) fits the available data best while having a prior enabling efficient learning and generalization.

      Anti-Correlated Model Misspecification and Prior-Data Conflict

      Finally, what happens when there are both model misspecification and a prior-data conflict in the low-data regime? If both are correlated, the ranking will be preserved, but if they are anti-correlated, the ranking might change.

      Let’s visualize this: the curves will intersect at some point, and the model with the best achievable loss in the infinite data limit might not be the best choice in the low-data regime, depending on how much data we can train on. The optimal model choice may also change based on the amount of available data.

      The conditional marginal cross-entropy is plotted for three different model hypotheses (\(\h_0\), \(\h_1\), \(\h_2\)) as a function of dataset size. The models exhibit both prior-data conflict and model misspecification. In the small data regime, \(\h_2\) has the lowest loss due to its prior aligning well with the data distribution, allowing for faster initial learning. However, as more data becomes available, the models’ asymptotic performance quickly plateaus. First, \(\h_1\) takes over, and then finally \(\h_0\), which converges to the lowest achievable loss in the infinite data limit, indicating it suffers the least from model misspecification. In contrast, \(\h_1\) and \(\h_2\) converge to higher loss values due to greater misspecification. Notably, the models’ performance ranking changes multiple times as the dataset grows, with \(\h_2\) being initially favored but ultimately having the worst infinite-data loss. Each model ranks best for the conditional joint marginal cross-entropy for some chosen range. This illustrates how the interplay between prior-data conflict and model misspecification can lead to different model selection decisions depending on the amount of available data and the metric used to measure performance.

      Here, the joint marginal cross-entropy and the joint marginal information (log marginal likelihood) might not lead to the same decision because the area under the curve at the start might be larger than what the best model can save later. This could change the ranking of the models compared to the conditional marginal cross-entropy (leave-one-out cross-validation) at the end of training, which serves as a proxy for the model’s generalization performance.

      Instead, the conditional joint marginal cross-entropy and information can shine here by conditioning “away” the beginning of the curve, thus giving us a better estimate of the conditional marginal cross-entropy (or expected information) at the point of interest.

      To formalize this, we can use the chain rule to split the joint marginal cross-entropy into two terms:

      \[\begin{aligned} &\underbrace{\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNset}}_{\text{Joint Marginal Cross-Entropy}} = \\ &\quad = \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNsetk} \\ &\quad \quad + \underbrace{\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNset \given \XNsetk}}_{\text{Conditional Joint Marginal Cross-Entropy}}, \end{aligned}\]

      Note that the per-sample averages of both terms converge to the same value in the infinite data limit—the conditional marginal cross-entropy (cross-validation loss), as discussed previously. However, the second term will converge faster because it does not include the constant \(\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNsetk}\).

      We can also see both terms as approximating the conditional marginal cross-entropy (cross-validation loss) for a fixed \(N\) in the low-data regime. The per-sample average of the second term will provide a better approximation.

      In summary, the consistency of the ranking will depend on the size of \(\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNsetk}\) for different \(\h\) and how it compares to the conditional joint marginal cross-entropy \(\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNset \given \XNsetk}\).

      This analysis highlights the importance of considering both prior-data conflict and model misspecification when selecting models in the low-data regime. The choice of performance metric and the amount of available data can significantly impact the ranking of models. The conditional joint marginal cross-entropy provides a more accurate estimate of the model’s generalization performance by conditioning away the initial part of the learning curve, which may be heavily influenced by prior-data conflict.

      Approximating the Validation Loss

      You may be wondering: why bother with the marginal likelihood or conditional joint marginal cross-entropy at all? Why not just always use leave-one-out cross-validation (i.e., the conditional marginal cross-entropy) or a simple validation loss?

      While that is a valid approach, the key question is: can we approximate the validation loss earlier in training, without fully training the model? Or can we do this more efficiently than performing inference on each element of a validation set?

      One option is to extrapolate the training loss to predict the validation loss. While potentially underexplored in this context, scaling laws have been found effective for predicting model performance.

      Alternatively, when training a model on a dataset for a single epoch—which is still surprisingly common for large language models, especially without active data sampling—the average training loss per batch provides a good approximation of the validation loss. With a cross-entropy loss, this is equivalent to estimating the conditional marginal cross-entropy.

      However, the batch size may not be large enough for a precise estimate. Averaging over the last few batches or using an exponential moving average can help, as the training losses on earlier batches were computed with older model parameters. Compared to using only the last batch’s loss, this smooths the estimate and reduces sensitivity to outliers.

      In the multi-epoch setting, revisiting data points multiple times prevents using the training loss as a validation loss estimate. Here, cross-validation offers a solution: train on the held-out data in the last epoch, compute the validation loss via the training losses, and obtain an ensemble of fully trained models without wasting data.

      In summary, while the validation loss is the gold standard, approximations based on the training loss or cross-validation can provide efficient estimates, especially in the early stages of training or with limited data.

      The Big Comparison

      In this post, we have explored various metrics for model selection and hyperparameter learning in the Bayesian context, focusing on the marginal likelihood, joint marginal cross-entropy, and conditional marginal cross-entropy. Our discussion has led to several key insights:

      1. Infinite Data Limit: As the dataset size approaches infinity, the rate of the log marginal likelihood (or equivalently, the joint marginal information), the joint marginal cross-entropy, and the conditional marginal cross-entropy converge to the same value when averaged over the data distribution. Given sufficient data, all these metrics will produce the same ranking of different model hypotheses or hyperparameter choices.

      2. Connection to Cross-Validation: The conditional marginal cross-entropy is equivalent to the expected cross-validation loss. Cross-validation is the gold standard for model selection in machine learning practice, where a model’s generalization performance is estimated by evaluating it on held-out validation data after training on the remaining data.

      3. Sufficient Data Requirement: The amount of data needed for the convergence of these metrics in the infinite data limit may be impractically large, especially for highly expressive models like deep neural networks. Therefore, the convergence property may not be directly relevant in many real-world scenarios.

      4. Low-Data Regimes: When data is limited, the metrics can differ significantly. The conditional marginal cross-entropy (or cross-validation loss) is often the more reliable choice for model selection targeting generalization performance, as it directly measures the model’s ability to predict unseen data after being trained on the available data.

      5. Sequential Prediction and Compression: The joint marginal cross-entropy, which corresponds to the negative log marginal likelihood, may be preferable if the focus is on a model’s overall sequential prediction performance or compression ability on the training data itself. It measures how well the model fits the entire training dataset jointly, without splitting into train and validation sets.

        Moreover, the conditional joint marginal information and cross-entropy are particularly relevant for measuring the performance of online learners and the in-context learning abilities of large language models (LLMs). These metrics capture the model’s ability to adapt and make accurate predictions based on the sequential information and evolving context after training on available data.

      6. Model Misspecification and Prior-Data Conflict: In practice, models often face a combination of model misspecification (where the true data-generating process is not contained within the model class) and prior-data conflict (where the prior distribution does not align well with the data distribution). The interplay between these factors can lead to different rankings of models depending on the amount of available data and the specific metric used for evaluation.

      While the marginal likelihood has been a popular tool for model selection and hyperparameter learning in the Bayesian community, its suitability depends on the specific context and goals. The conditional marginal cross-entropy, closely related to cross-validation, is often a more reliable choice when the primary objective is to optimize generalization performance. However, the conditional joint marginal cross-entropy (or conditional log marginal likelihood) may be preferable when the focus is on sequential prediction after training or measuring in-context learning abilities.

      Now, after having thought about all this in detail and mostly from first principles, let’s discuss the literature and how it supports or augments these considerations.

      Literature Review

      Having discussed the key concepts, we will now look at several influential papers that have shaped the previous discussion on model selection and hyperparameter tuning in the Bayesian context or have provided valuable insights into the marginal likelihood and its connections to other metrics.

      Fong and Holmes (2020): “On the marginal likelihood and cross-validation”

      Fong and Holmes (2020) explore the connection between the log marginal likelihood (joint marginal information) and cumulative leave-p-out cross-validation. Under exchangeability, they show that the joint marginal information can be rewritten as a cumulative sum of leave-p-out cross-validation terms.

      The authors define the leave-p-out cross-validation score as:

      \[S_{CV}(\xNset;p) = \frac{1}{\binom{N}{p}} \sum_{V \in \binom{[N]}{p}} \frac{1}{p} \sum_{i=1}^p \Hof{\x^{V}_i \given \{\x^{\bar{V}_k}\}_{k=1}^{N-p}}\]

      where \(\binom{[N]}{p}\) denotes the set of all \(p\)-length subsets of \(\{1,...,N\}\)—the indices of the validation set—\(\x^V_i\) is the \(i\)-th validation data point, and \(\x^{\bar{V}}_k\) is the \(k\)-th training data point. This score measures the model’s performance using \(p\) validation points given the remaining data for training, equivalent to the respective conditional marginal cross-entropy.

      The cumulative leave-P-out cross-validation score is defined as:

      \[S_{CCV}(\xNset; P) = \sum_{p=1}^P S_{CV}(\xNset; p)\]

      This score focuses on the last \(P\) stages of the learning curve equally and is the same as the conditional joint marginal cross-entropy. For \(P=N\), the cumulative leave-N-out cross-validation score equals the joint marginal information:

      \[S_{CCV}(\xNset; N) = \Hof{\xNset}\]

      Comparing \(P<N\) to \(P=N\), Fong and Holmes highlight the potential sensitivity of the marginal likelihood to the choice of prior. They argue for using cumulative cross-validation following a preparatory training phase with \(P<N\) (e.g., \(10\%\) or \(50\%\)), demonstrating benefits over the full marginal likelihood for model selection, especially with vague priors or model misspecification.

      The paper also discusses the coherence of the log posterior predictive probability as a scoring rule in cross-validation and explores connections to prequential analysis and intrinsic Bayes factors.

      Fong and Holmes (2020) strongly support the ideas in this blog post, particularly the connections between marginal likelihood, cross-validation, and focusing on later learning curve stages for model selection. They establish the equivalence between the cumulative leave-p-out cross-validation score and conditional joint marginal information, aligning with our discussion of the conditional joint marginal cross-entropy as a more reliable metric compared to the full marginal likelihood.

      Lyle et al. (2020) and Ru et al. (2021): Training speed and model selection

      In “A Bayesian Perspective on Training Speed and Model Selection”, Lyle et al. (2020) establish a connection between training speed and the marginal likelihood in linear models. They propose using the sum of mini-batch training losses as a proxy for the log marginal likelihood to predict the generalization behavior of deep neural networks. This sum, referred to in later works as the training speed estimator (TSE), corresponds to the area under the learning curve. For 1-sample batches, the TSE is defined as:

      \[\text{TSE}(\xNset) = \sum_{n=1}^N \Hof{\x_n \given \w_n},\]

      where \(\Hof{\x_n \given \w_n}\) is the cross-entropy loss at training step \(n\) with model parameters \(\w_n\). Thus, an MLE estimate is used instead of conditioning on the data points \(\x_{<n}\) and using the BMA.

      The authors provide an iterative algorithm for linear models to estimate a lower bound on the LML over multiple epochs of training. This allows capturing the model’s performance as it sees more data points over the course of training, rather than being limited to a single epoch. They also discuss extending their estimator to the infinite-width limit of neural networks.

      Building upon Lyle et al. (2020), Ru et al. (2021) focus on using TSE for model selection in neural architecture search in “Speedy Performance Estimation for Neural Architecture Search”. They propose two variants of TSE: TSE-E, which focuses on the last few epochs, and TSE-EMA, which uses an exponential moving average to assign higher weights to later epochs:

      \[\begin{aligned} \text{TSE-E}(\xNset) &= \sum_{n=N-E+1}^N \Hof{\x_n \given \w_n}, \\ \text{TSE-EMA}(\xNset) &= \sum_{n=1}^N \alpha^{N-n} \Hof{\x_n \given \w_n}, \end{aligned}\]

      where \(\alpha \in (0, 1)\) is a hyperparameter controlling the decay rate.

      The authors hypothesize that assigning higher weights to later epochs may lead to better correlation with the true generalization performance of the final trained network, as the early epochs may be unstable and less informative.

      They demonstrate empirically that TSE-E and TSE-EMA can reliably estimate the generalization performance of neural architectures with a small training budget and remain effective for a large range of training epochs. TSE outperforms other efficient estimators, such as early stopping and learning curve extrapolation, in terms of rank correlation with the true test performance.

      The TSE estimators proposed by Ru et al. (2021) align closely with the ideas discussed in this blog post, as they prioritize the model’s performance in the later stages of learning. The empirical results presented by Ru et al. (2021) and Lyle et al. (2020) provide supporting evidence for the importance of going beyond the marginal likelihood.

      Lotfi et al. (2022/2023): “Bayesian Model Selection, the Marginal Likelihood, and Generalization”

      Lotfi et al. (2022/2023) provide a comprehensive re-evaluation of the marginal likelihood as a metric for predicting the generalization performance of trained models and learning hyperparameters. They argue that while the marginal likelihood is well-suited for prior hypothesis testing, it is only peripherally related to generalization after training. The authors identify several practical and philosophical issues in using the marginal likelihood for selecting between trained models, such as its sensitivity to the choice of prior, potential to lead to both underfitting and overfitting, and negative correlation with generalization performance in some cases.

      To address these limitations, Lotfi et al. propose the conditional marginal likelihood (CLML) as a partial remedy. The CLML is computed by conditioning on a subset of the training data, which helps to mitigate the influence of the prior and focus on the model’s performance under this posterior. It is also less sensitive to the number of parameters in the model. The authors demonstrate that the CLML is better correlated with generalization than the marginal likelihood and provides promising performance for deep kernel hyperparameter learning and neural architecture search.

      The CLML shares significant similarities with the cumulative leave-p-out cross-validation score proposed by Fong and Holmes (2020). Both approaches essentially propose the same metric, which focuses on the model’s performance in the later stages of learning and provides a more reliable indication of generalization compared to the full marginal likelihood. Lotfi et al. also critically compare their work to that of Lyle et al. (2020), but do not discuss the work of Ru et al. (2021).

      Lotfi et al. conduct an extensive empirical evaluation of the CLML across various settings, comparing it to the marginal likelihood and other baselines under different conditions, such as varying dataset sizes, model complexities, and hyperparameter settings. They demonstrate that the CLML consistently outperforms the marginal likelihood in terms of selecting the hyperparameters that lead to better generalization performance. The authors also acknowledge some limitations of their work, such as the need for further theoretical analysis of the CLML’s properties and the potential challenges in estimating the CLML for more complex models.

      The key novelty of Lotfi et al.’s work lies in their comprehensive analysis of the limitations of the marginal likelihood for model selection and hyperparameter learning, as well as their proposal of the CLML as a practical alternative that addresses these limitations.

      A Simple Toy Experiment

      To illustrate the concepts discussed in this post, we conduct a simple toy experiment using a Bayesian linear regression model. The goal is to demonstrate how the various information metrics behave under different prior settings and dataset sizes, and to show that none of the metrics are universally reliable. In particular, the joint marginal information may not be the best choice when the primary concern is static performance after training on data.

      Experimental Setup

      We generate a synthetic dataset with 64 features and 500 training and validation samples each. The true coefficients are drawn from a normal distribution with a mean of 2, and the target is the dot product between the features and the true coefficients.

      For the model, we use a Bayesian linear regression with an isotropic Gaussian prior on the weights (hyperparameter \(\wstddev\)) and independent Gaussian noise (hyperparameter \(\noisestddev\)). The model is misspecified when \(\noisestddev > 0\). We consider three different prior settings:

      • Model 1 (\(\h_1\)): \(\wstddev=0.1\), \(\noisestddev=0.8\)
      • Model 2 (\(\h_2\)): \(\wstddev=100\), \(\noisestddev=1.0\)
      • Model 3 (\(\h_3\)): \(\wstddev=1\), \(\noisestddev=1.2\)

      Thus, all three models are misspecified to varying degrees and exhibit different levels of prior-data conflict.

      We train the model on subsets of the training data of varying sizes, ranging from 1 to the full training set size, performing 5 trials with different splits. For each subset size, we compute the following metrics:

      • Joint Marginal Information (JMI)
      • Conditional Joint Marginal Information (CJMI) with half the data used for conditioning
      • Marginal Cross-Entropy (MCE) on the training set
      • Marginal Cross-Entropy (MCE) on the validation set
      • Training Speed (Approximate)
      • Joint Marginal Information Rate (JMI Rate)

      The JMI is equivalent to the negative log marginal likelihood, the CJMI to the negative conditional log likelihood, and the MCE corresponds to the cross-entropy loss. The Training Speed approximates an iterative algorithm by following the full data gradient. The JMI Rate is the JMI divided by the dataset size, which converges to the MCE in the infinite data limit.

      Results

      The results of the experiment are summarized in the following plots:

      Information metrics for the three Bayesian linear regression models as a function of dataset size. The joint marginal information does not indicate the best performing model. The conditional joint marginal information (conditioned on half the dataset size, predicting on the other half) only finds the best model after 4/5 of the data are observed. Metrics are reported in bits (log base 2), five trials each.

      The plots show the behavior of the information metrics as the dataset size increases for the three different prior settings. Some key observations:

      • The marginal cross-entropy (MCE) metrics decrease as the dataset size increases, indicating improved model performance.
      • The joint marginal information (JMI) increases with more data, as it is equivalent to the area under the curve of the MCE on the training set. (As we take the average over multiple trials, its mean is actually an estimate of the joint marginal cross-entropy.)
      • The JMI rate, which is the JMI divided by the dataset size, decreases very slowly towards the same value as the MCE. This agrees with the previous discussion on the infinite data limit.
      • The training losses also decrease, while their sum, equal to the training speed estimator (TSE), increases with the dataset size.
      • The conditional joint marginal information (CJMI) with half the data used for conditioning shows a similar trend to the JMI but with lower values, as it focuses on the model’s performance on the held-back data. As we take an average over multiple trials, it is actually an estimate of the conditional joint marginal cross-entropy.

      To further analyze the model selection behavior, we computed the CJMI for different conditioning set sizes and selected the model with the lowest CJMI for each combination of dataset size and conditioning set size. The results are visualized in the following plot:

      Decision boundary for the best model amongst three (\(\phi_1\), \(\phi_2\), \(\phi_3\)) with the lowest conditional joint marginal cross-entropy/information, as a function of dataset size and held-back size. The three models \(\phi_1\), \(\phi_2\), and \(\phi_3\) correspond to different prior variances and noise levels. The white diagonal line shows where the conditional joint marginal information is computed using half the dataset size. In the region below this line, \(\phi_1\) (blue) has the lowest conditional joint marginal information, while \(\phi_2\) (orange) and \(\phi_3\) (green) are preferred for different dataset and held-back sizes.

      The plot shows which model is selected based on the lowest CJMI for different dataset sizes (x-axis) and conditioning set sizes (y-axis). The white line represents the case where half the data is used for conditioning (CJMI half in the previous plot). We observe that the model selection decision changes depending on the amount of available data and the size of the conditioning set/held-back data.

      A Narrow but Deep Dive into “Bayesian Model Selection, the Marginal Likelihood, and Generalization”

      Now that we have introduced the necessary concepts and discussed the literature, let’s take a closer look at the paper by Lotfi et al. (2022/2023).

      Use Cases and Pitfalls of the LML

      Lotfi et al. (2022/2023) present both the case for the log marginal likelihood (LML) as well as potential pitfalls when using it. They highlight the following use cases for the LML—quoted and paraphrased from the paper:

      1. Hypothesis testing: The LML provides an elegant mechanism to select between fixed prior hypotheses, even if each hypothesis is entirely consistent with observations. It automatically favors the most constrained hypothesis that fits the data, encoding a notion of Occam’s razor. The paper gives the example of the LML favoring general relativity over alternative explanations for Mercury’s orbit.

      2. Hyperparameter learning: The LML is often successfully used in practice to learn hyperparameters of the prior, finding the hyperparameters \(\h\) that maximize \(\pof{\mathcal{D} \given \h}\), where \(\mathcal{D}\) is a dataset. The paper highlights Gaussian processes as a compelling example, where the LML chooses kernel hyperparameters that make the distribution over functions likely to generate the training data, rather than simply maximizing data fit. The LML can learn many kernel parameters and be used where cross-validation would be intractable.

      3. Constraint learning: Unlike typical learning objectives like maximum likelihood, the LML is incentivized to select for constraints. It provides a consistent estimator for constraints, automatically selecting the most constrained solution that fits the data and collapsing to the true constraint value as the number of observations grows. Examples include the LML consistently estimating the true dimensionality in Bayesian PCA and automatically learning symmetries like rotation invariance.

      However, the paper argues that the LML has several pitfalls for model selection and generalization:

      1. Not aligned with generalization: The LML answers “what is the probability a prior model generated the training data?” rather than “how likely is the posterior to have generated withheld points?”. A prior that initially explains the data well can still lead to a posterior that generalizes poorly.

      2. Misaligned in model selection: The LML evaluates priors, while model selection should evaluate posteriors. Maximizing LML is not equivalent to selecting the best generalizing posterior.

      3. Can overfit: The LML can favor “simple” priors concentrated around overfit maximum likelihood solutions that generalize poorly.

      4. Underfitting bias in hyperparameter selection: The LML may not favor hyperparameters that make good parameters likely if they also make many poor parameters likely.

      Relating these points to the previous discussions:

      For hypothesis testing and hyperparameter learning (1. & 2.), the LML favors the simpler hypothesis that converges faster, implying a smaller area under the learning curve. This aligns with the discussion on prior-data conflict for similarly misspecified models.

      At the same time, the paper also states about the case of Mercury’s orbit that:

      We emphasize here we are comparing fixed prior hypotheses. We are not interested in how parameters of general relativity update based on orbital data, and then deciding whether the updated general relativity is the correct description of orbital trajectories.

      This could be misconstrued at computing the marginal cross-entropy for the data under the prior, which is not what the LML is doing: it computes a joint marginal cross-entropy after all. The two questions in (4.) point to the joint and conditional marginal cross-entropies—the areas under the full and partial learning curves, respectively.

      However, neither LML nor CLML align with static evaluation, but rather with continued learning (5.).

      Points (6.) and (7.) relate to prior-data conflict and model misspecification when they are anti-correlated.

      Overall, all quantities can fail in the low-data regime. In the infinite data limit, model (mis-)specification dominates other factors, making the quantities less interesting.

      The “Conditional Marginal Likelihood” in Lotfi et al. (2022/2023)

      The paper introduces the conditional marginal likelihood (CLML) as a remedy for the pitfalls of the LML, matching the earlier definition of conditional joint marginal information:

      \[\Hof{\xset{}{N-P+1}{N} \given \xset{}{1}{N-P}, \h}.\]

      Unlike the LML which is invariant to data order, the CLML depends on how the data is split into a conditioning set and validation set. To make the CLML permutation-invariant, the paper proposes averaging over different permutations, equivalent to the joint marginal cross-entropy. However, this becomes computationally expensive, so the paper uses a single permutation with \(P=20\% \, N\) to ensure the posterior has sufficiently converged.

      Estimating the CLML and LML via Laplace Approximation

      Computing the LML via sampling is intractable for deep neural networks. Estimating it from an uninformative prior leads to high-variance estimates, as most \(\w\) sampled from the prior will perform poorly on the data. While Monte Carlo sampling works well in high dimensions, it fails here because randomly sampling a good \(\w\) from the prior is incredibly unlikely, as illustrated in these tweets:

      While sampling from the prior to estimate the LML is intractable, we can fare better when sampling from a posterior for computing a CLML, which is the approach taken by the paper for the CLML. The posterior is more concentrated around “good” \(\w\), and the paper uses a Laplace approximation to approximate it:

      However, the LA only captures uncertainty around a single mode, underestimating the uncertainty before the model converges, as beautifully illustrated in the paper:

      This is especially relevant for overparameterized DNNs which have multiple diverse modes (Wilson, Izmailov, 2020; 2021, blog).

      Furthermore, when computing the CLML, the LA may similarly struggle to find meaningful \(\w\) that perform well on the held-out data when that data would meaningfully change the model, as the CLML decomposes into conditional marginal information terms that condition on these additional data sequentially.

      DNN Experiments: Validation Loss vs. CLML

      The DNN experiments in Lotfi et al. (2022/2023) compare the CLML to the validation loss for DNNs on CIFAR-10 and CIFAR-100 datasets. The results provide empirical evidence for the challenges of computing the CLML and beg the question whether these approximations are meaningfully different from a validation loss.

      The paper shows that while the CLML is better correlated with the generalization performance of the model than the LML, the validation loss is still better correlated with the generalization performance than the CLML. Interestingly, the initially published DNN experiments in the first arXiv version of the paper did not actually compute the CLML but instead computed the validation loss. This was fixed in the second arXiv revision.This bug was found by yours truly, see the appendix of this post.

      However, given the previous discussions on the similarities between the CLML and cross-validation and difficulty of approximating the CLML meaningfully, this bug was not a major issue for the paper’s conclusions.

      Importantly, as we examine in the appendix of this post, when comparing the CLML using Monte Carlo sampling with the validation loss computed using Monte Carlo sampling for the Bayesian Model Average (BMA), the validation loss is still better correlated with the generalization performance than the CLML.

      Conclusion

      In conclusion, this blog post has challenged the conventional focus on the marginal likelihood and related quantities for Bayesian model selection as a direct consequence of Occam’s razor. It highlights the importance of considering context and goals when choosing a model selection criterion. By motivating MLE and MAP using Occam’s razor and questioning the uniqueness of the (conditional) joint marginal likelihood, we hope to encourage critical thinking about the foundations of these quantities.

      However, it is important to acknowledge the limitations of our arguments and experiments. A more rigorous theoretical justification, a broader range of models and datasets, and a deeper engagement with philosophical implications are needed to strengthen the insights. As most of the presented methods ignore model complexity and assume a uniform model prior \(\pof{\h}\), we have not discussed it in the detail necessary, even though from the perspective of model description lengths (MDL), it would be crucial to take into account.

      Despite these limitations, our exploration of the connections between information-theoretic concepts and their behavior in different data regimes, along the lines of model misspecification and prior-data conflict, provides a necessary starting point for understanding recently proposed metrics.

      The toy experiment demonstrates that all discussed quantities can fail to reliably predict generalization under model misspecification and prior-data conflict, even for a basic setting using Bayesian linear regression. This emphasizes the need for caution when making claims about the superiority of any particular metric.

      Ultimately, the key takeaway is that there is no one-size-fits-all solution, and the choice of model selection criterion should be guided by a careful consideration of the specific context and goals at hand.


      Acknowledgements: We would like to thank the authors of the examined papers for their valuable contributions to the field and for inspiring this blog post. Claude-3 and GPT-4 were used to edit and improve this blog post (via cursor.sh).

      Reproducibility: The figures were created using matplotlib and seaborn in Python. The Bayesian linear regression model was implemented using numpy. The code for the toy experiment is available in this Google colab, and the code for the visualizations is available in this Google colab.


      Appendix

      Detailed Code Review of the DNN Experiments in Lotfi et al. (2022/2023)

      The logcml_ files in the repository contain the code to compute the CLML for partially trained models. However, instead of computing

      \[\begin{aligned} \log p(\mathcal D_{\ge m} \mid \mathcal D_{< m}, \mathcal{M} ) \approx \log \sum_{k=1}^K \frac{1}{K}\, p(\mathcal{D}_{\ge m} \mid w_k, \mathcal M ) \\ = \log \sum_{k=1}^K \frac{1}{K}\, \prod_{j=m}^n p(y_j \mid x_j, w_k, \mathcal M ), \end{aligned}\]

      the code computes:

      \[\begin{aligned} &\frac{1}{|\mathcal{D}_{\ge m}|}\,\sum_{j=m}^n \log p(\mathcal D_{j} \mid \mathcal D_{< m}, \mathcal{M} ) \approx \\ &\quad =\frac{1}{|\mathcal{D}_{\ge m}|}\,\sum_{j=m}^n \log \sum_{k=1}^K \frac{1}{K}\, p(y_j \mid x_j, w_k, \mathcal M ), \end{aligned}\]

      which is the validation cross-entropy loss of the BMA (of the model trained with 80% of the training data).

      The high-level code that computes the CLML is:

      1
      +2
      +3
      +4
      +5
      +
      bma_accuracy, bma_probs, all_ys = get_bma_acc(
      +    net, la, trainloader_test, bma_nsamples, 
      +    hessian_structure, temp=best_temp
      +)
      +cmll = get_cmll(bma_probs, all_ys, eps=1e-4)
      +

      get_bma_acc marginalizes over the LA samples before returning bma_probs:

      1
      +2
      +3
      +4
      +5
      +6
      +7
      +8
      +9
      +10
      +11
      +12
      +13
      +14
      +15
      +16
      +17
      +18
      +19
      +20
      +21
      +
      [...]
      +for sample_params in params:
      +    sample_probs = []
      +    all_ys = []
      +    with torch.no_grad():
      +        vector_to_parameters(sample_params, net.parameters())
      +        net.eval()
      +        for x, y in loader:
      +            logits = net(x.cuda()).detach().cpu()
      +            probs = torch.nn.functional.softmax(logits, dim=-1)
      +            sample_probs.append(probs.detach().cpu().numpy())
      +            all_ys.append(y.detach().cpu().numpy())
      +        sample_probs = np.concatenate(sample_probs, axis=0)
      +        all_ys = np.concatenate(all_ys, axis=0)
      +        all_probs.append(sample_probs)
      +
      +all_probs = np.stack(all_probs)
      +bma_probs = np.mean(all_probs, 0)
      +bma_accuracy = (np.argmax(bma_probs, axis=-1) == all_ys).mean() * 100
      +
      +return bma_accuracy, bma_probs, all_ys
      +

      The important line is #18: bma_probs = np.mean(all_probs, 0) which marginalizes over the predictions and returns the BMA prediction for each sample.

      Finally, get_cmll computes the validation loss for each sample independently (after applying a bit of label smoothing):

      1
      +2
      +3
      +4
      +5
      +6
      +7
      +8
      +9
      +10
      +11
      +
      def get_cmll(bma_probs, all_ys, eps=1e-4):
      +    log_lik = 0      
      +    eps = 1e-4
      +    for i, label in enumerate(all_ys):
      +        probs_i = bma_probs[i]
      +        probs_i += eps
      +        probs_i[np.argmax(probs_i)] -= eps * len(probs_i)
      +        log_lik += np.log(probs_i[label]).item()
      +    cmll = log_lik/len(all_ys)
      +    
      +    return cmll
      +

      The DNN experiments in Section 5 and Section 6 of the first arXiv revision of the paper (v1) thus did not estimate the CLML per-se but computed the BMA validation loss of a partially trained model (80%) and find that this correlates positively with the test accuracy and test log-likelihood of the fully trained model (at 100%). This is not surprising because it is well-known that the validation loss of a model trained 80% of the data correlates positively with the test accuracy (and generalization loss).

      Author Response from 2022

      The following response sadly seems to target the first draft mainly. However, it is also helpful for the final blog post and provides additional context.

      Thanks for your interest in our paper and your comments. Here are our comments about the blog as it is currently framed:

      (1) Thank you for pointing out a bug in the CLML computation for Figure 5b. We note that this bug is only relevant to a single panel of a single figure in the main text. We have re-run this experiment with the right CLML, and the results, attached here, are qualitatively the same. In summary, it was a very minor part of the paper, and even for that part it did not affect the take-away. We also attach the results of the correlation between the BMA test accuracy and the negative validation loss. You suggest in your post that the validation loss might correlate better with the BMA test accuracy than the CLML given that we use 20 samples for NAS. Our empirical results show the opposite conclusion. Additionally, we are not suggesting the CLML as a replacement to cross-validation but rather as a minor way to modify the LML for improvements in predicting generalization. Finally, we attach results for different sample sizes (20 samples vs. 100 samples) to address your comments on the sample size used to estimate the CLML. As we can see in the figure, the Spearman correlation factor is quite similar. 20 samples appears to provide a reasonable estimate of the CLML for these purposes, and is different from validation loss.

      (2) Your post currently opens by suggesting that there is something wrong with our experiments, likely either an LML approximation or a CLML issue, because we note that the LML correlates more poorly with generalization for larger datasets (where “large” is relative in the context of a specific experiment). A few points here: (i) this result is actually completely expected. The LML is in fact non-monotonic in how well it predicts generalization. For small datasets, the prior should be reasonably predictive of generalization. For intermediate datasets, the first terms in the LML decomposition have a negative effect on the correlation with generalization. For asymptotically large datasets, the first terms have a diminishing effect, and we get a consistent estimator; (ii) almost all of our experiments are exact, and we see this behaviour in the exact experiments for the Fourier model. For example, for the Fourier feature experiment in Fig 4(d), LML picks the better generalizing model for n < 50 and n > 296. For n in [50, 296] it picks the wrong model. For large neural network models, it is reasonable that the exact LML could pick the wrong model for CIFAR-sized datasets. (iii) any potential issues with the CLML are not relevant to these considerations, which are about the behaviour of the LML.

      (3) Your post currently suggests that issues with approximate inference could be responsible for our take-aways, rather than issues with the LML in general. But as we note in (2), almost all of our experiments use the exact LML and CLML: the density model, Fourier features, Gaussian processes, and deep learning exps on DKL, and there was never any bug associated with CLML computation in these experiments. The takeaways for the Laplace experiments are consistent with the exact experiments, and also expected, as above. While it’s true that the CLML can be estimated more effectively than the LML for the Laplace experiments, this is actually an advantage of the CLML that we note in the paper. The LML results also stand on their own, as we discuss above.

      (4) Your post places a lot of importance on Figure 5, as if it is the main result of the paper and our main “DNN” experiments. We stand by the results of Figure 5, but it is a relatively minor component of the paper. As we’ve mentioned most of our results are exact, including our DKL experiments, which are certainly the most substantial DNN experiments, with practically exciting results for transfer and few-shot learning. The DKL experiments are actually where we expect the CLML to be practically useful, and currently they seem to be overlooked in the post.

      (5) The blog seems to question the learning curve experiments, but these experiments in Figure 4 are exact, with no Laplace approximation, and relatively straightforward.

      (6) Your post seems to be negative about the CLML, presenting its similarity with cross-validation as a potential drawback, and implying the skepticism about the CLML should affect the interpretation of our take-aways. Two points here: (i) as above, the CLML is independent of most of our take-aways, which are about the properties of the LML; (ii) our goal with the CLML was not to introduce something starkly different from cross-validation, but to show how a very minor modification to the LML could improve alignment with generalization. Moreover, the DKL CLML results are quite promising as an efficient way to do gradient based estimation of a large number of hyperparameters.

      (7) The blog opens as if it is leading up to some fatal flaw. But as above, (i) the LML considerations are independent of the CLML, (ii) most of the experiments are exact, (iii) the trends for the exact and approximate inference procedures are the same and are naturally understandable and explainable, such as the non-monotonic trend in how well the LML correlates with generalization, and (iv) the CLML bug only affected Figure 5, panel b, and when it’s corrected the qualitative take-away is the same as before.

      We appreciate your interest and effort in reading the paper, and we think your questions will improve the clarity of the paper, which we have updated with an acknowledgement to you. Given the above considerations, we do think there would need to be substantial revisions to the blog post to accurately and fairly reflect the paper. We would appreciate being able to see the revisions before it’s posted.

      Best wishes,
      Sanae, Pavel, Greg, Micah, Andrew

      Ablation: CLML vs. BMA Validation Loss vs. (non-BMA) Validation Loss

      Let us examine the new results:

      In the three panels below, two panels show test accuracy vs. validation loss; one shows test accuracy vs. CLML. The left-most panel is the BMA test accuracy vs. (negative) BMA validation loss, the middle panel is vs. the CLML, and the right-most panel is vs. the (negative) non-BMA validation loss.

      Note that the left-most panel is from v1, which was accidentally computing the BMA validation loss, and whose axis label is adapted here from v1 for clarity. The two other plots are from v2 after fixing the bug. See commits here for fixing the CLML estimation and here for computing the non-BMA validation loss.

      BMA Neg Validation Loss
      CLML
      Validation Loss
      Leg

      At first glance, there might be an observer effect in the experiments for the validation loss. The BMA validation loss in v1 performs better than the CLML in v2, while the non-BMA validation loss in v2 underperforms the CLML in v2. When asked about it, the authors pushed the respective code (see link above) and explained that the updated, right-most panel computes the non-BMA validation loss, i.e., without LA samples. It seems surprising that there is such a difference between the non-BMA validation loss and BMA validation loss: the non-BMA validation loss is more than one nat worse on average than the BMA validation loss, based on visual inspection. Note that the plots here and in the paper compute the average CLML and average validation loss and are thus directly comparable.

      The authors said in their response that:

      You suggest in your post that the validation loss might correlate better with the BMA test accuracy than the CLML given that we use 20 samples for NAS. Our empirical results show the opposite conclusion.

      This is only partially true. The BMA validation loss (which was accidentally computed in v1 instead of the CLML) correlates very well with the BMA test accuracy. This is not surprising given that this is the frequentist purpose of using validation sets. If validation sets were not correlating well with the test accuracy, we would not be using them in practice. 🤗 As such, this raises the question why the non-BMA validation loss correlates negatively with the BMA test accuracy for ResNets and overall in the v2 results. Thus, only the non-BMA validation loss supports the now opposite conclusion in v2 of the paper and in the authors’ response.

      Yet what is also surprising is how well the BMA validation loss does vs. the CLML:

      Ablation: LA Sample Size

      Secondly, when we compare the reported values between BMA validation loss and CLML, we notice that the CLML is lower than the BMA validation loss by half a nat for \(\lambda=10^2\) and generally for CNNs.

      However, it seems, even though the new experiments in v2 are supposed to reproduce the ones from v1, and we can assume that the same model checkpoints were used for re-evaluation (as retraining is not necessary), both CLML and non-BMA validation loss are off by about half a nat for the CNNs. As such, the above consideration might hold but might not provide the answer here.

      Instead, we overlay the non-BMA validation loss and the CLML plots, both from v2, with a “difference blend”: it shows the absolute difference between the colors for overlapping data points (the circles 🔴 and triangles 🔺), leading to black where there is a match, negative (green-ish) color for CLML, and positive (sepia) color for validation losses. The background grids were used to match the plots, but we hid the ones from CLML afterward—as such, the strong overlay is because the values are so close.

      Surprisingly—or rather as predicted when the LA does not really do much—it turns out that the validation loss for the CNNs (🔴) mostly fully matches the estimated CLML with 20 LA samples following a visual inspection. To be more precise, either the models have already sufficiently converged, or the CLML estimate is not actually capturing the correlations between points and thus ends up being very similar to the validation loss.

      This changes the interpretation of the sample ablation in the author’s response. The ablation shows no difference between 20 and 100 LA samples, with 100 LA even samples having a slightly lower rank correlation. So it seems 5 times more LA samples are not sufficient to make a difference, or the Laplace posterior cannot capture the posterior as well as hoped. It would be interesting to examine this further. Kirsch et al (2022) reported running toy experiments on MNIST with 10,000 MC Dropout samples without achieving good adaptation. Laplace approximation is not MC Dropout, and this is speculation, but it seems in agreement. Notwithstanding the compute cost and feasibility, could posterior samples using HMC or similar more principled methods provide better estimates?

      All in all, given the above, it is fair to say that the estimate of the CLML is probably not as good as hoped, and further experiments might be needed to tease out when the CLML provides more value than the (BMA) validation loss. Note, however, that this question has not been explicitly examined in the paper. Instead, for DNNs, the paper only compares LML and CLML with distinct estimation methods.

      ]]>Andreas KirschDeep Equilibrium Models For Algorithmic Reasoning2024-05-07T00:00:00+02:002024-05-07T00:00:00+02:00https://iclr-blogposts.github.io/2024/blog/deqalg-reasoningWhat is Algorithmic Reasoning?

      Broadly, algorthmic reasoning studies how well neural networks can learn to execute classical computer science algorithms. In particular to measure how well an algorithm has been learned we look at size-generalisation, i.e. if we train on inputs of size \(N\) and check how well the Neural Network perform on inputs of size \(2N\) or \(10N\). The idea is that neural networks often learn shortcuts that work well in-distribution, but fail out-of-distribution, whereas classical computer science algorithms work no matter the input size. The purpose of this exercise is to study the generalisation of reasoning tasks, especially what tricks help to improve robustness and get the network closer to deducing logically rather than relying on statistical short cuts.

      Why care about fixed-points?

      First, let’s remember that for \(x_0\) to be a fixed-point of a function \(f\) it must satisfy \(f(x_0) = x_0\). Secondly, we can observe that many algorithms consist of an update rule that you apply until there is no more change. The final output can easily be seen to be a fixed-point! In a classical computer science algorithm some smart person will have sat down and shown that under some conditions on the input this convergence will happen and the final answer is correct.

      An example algorithm would be the Bellman-Ford algorithm to compute the shortest-distance to a given node in a graph. Here the update rule looks like \(x_i^{(t+1)} =\min(x_i^{(t)}, \min \{x_j^{(t)} + e_{ij}\}_{j\in N(i)})\), where \(x_i^{(t)}\) is the shortest distance estimate to the source node at time \(t\), \(e_{ij}\) is the distance between nodes \(i\) and \(j\), and \(\{j\}_{j\in N(i)}\) are the neighbours of node \(i\). The algorithm says to apply this rule until there is no more change—a fixed point.

      Interestingly, denotational semantics—a theoretical field of computer science—has shown you can represent Turing complete programming languages as mathematical functions. This is mostly quite trivial with the exception of the while loop (which is also the key ingredient to make it Turing complete). Here the trick is a special mathematical operator that returns the minimum fixed point of a function! (If there is no fixed point to a function then the corresponding while loop doesn’t terminate.) And thus we can see that fixed-points are reached by all programs that terminate, and yet they aren’t used in neural networks that try to learn how to do reasoning. A missed inductive bias perhaps?

      The details

      Task specification

      The CLRS paper provides us with a benchmark dataset for algorithmic reasoning. The general structure of the data is a sequence in time of intermediate states of a given algorithm. In other words, at timestep \(t\) we have a state \(x_t\) that describes various variables that the algorithm stores, e.g. in BellmanFord \(x_t\) will contain the current estimate of the shortest path in each node of the graph. At each timestep \(t\) we then try to predict the next time step, we do this by outputting some \(y_t\) from which we can extract \(x_{t+1}\). Note that \(y_t\) may be slightly different from \(x_{t+1}\), for instance because it has some state may never change by definition, e.g. the graph in BellmanFord, hence we don’t predict it again. This is all illustrated in the next figure, where we split the state into a state at each node \(x\) and at each edge \(e\) for a given graph \(G\) as an example.

      Algorithmic Reasoning Task, diagram recreated from

      The architecture

      The high-level architecture is that of an encoder-processor-decoder. The motivation is that neural networks perform well in high-dimensional spaces but that classical algorithms tend to operate on very low-dimensional variables, e.g. in BellmanFord the shortest distance would be a single scalar. Thus the encoder projects the state into a high-dimensional space \(z_t\) where the main computation is then done by the processor network—typically a Graph Neural Network. The output of the processor \(z_{t+1}\) is then decoded back into the low-dimensional space by the decoder. The encoder and decoders mostly consist of linear layers with the occasional exception, e.g. a softmax for categorical variables. The processor will be a graph neural network, for which several different architectures have been explored, for example in. We either use the TripletMPNN from which adds edge message passing or a simple MPNN with a linear message layer.

      High-level architecture employed

      The processor is supposed to do the main computation of the network, in particular, the hope is that one iteration of the processor is equal to one iteration of the algorithm. In our example of BellmanFord, it would be one iteration of the update rule \(x_i^{(t+1)} =\min(x_i^{(t)}, \min \{x_j^{(t)} + e_{ij}\}_{j\in n(i)})\) (see also the Figure below). Thus, the processor should indicate termination by no longer changing it’s output \(z\).

      Training

      Traditionally the training approach has been teacher-forcing. In teacher forcing we train each step of the algorithm independently by feeding the network the ground-truth \(x_t\) and computing the loss against \(y_t\) at all \(t\) simultaneously. This requires us to know the exact number of steps in the algorithm a priori. In other words, training with just teacher forcing will require us to tell the network the number of iterations it should run for at test time (which will vary depending on the input state). This is unrealistic in practice, where we would simply give our neural network the input state and ask it to run the algorithm on its own, which includes knowing when to stop the computation. While a termination network is suggested in , the issue is ignored in later papers such as .

      Remember that neural networks are really good at learning in-distribution shortcuts. To more rigorously test whether the neural network has learned the underlying logical algorithm we introduce a shift between the training and test distribution. If the network has learned the classical algorithm, it should be able to overcome this shift. Throughout the CLRS algorithmic reasoning benchmark size generalisation is used, i.e. we train on examples of size 16 (i.e. the graph has 16 nodes) and at test time we will use an input size of 64.

      An example algorithm: Bellman-Ford

      How can we do fixed-points in DNNs?

      One approach to training neural networks that run until they reach a fixed point is deep equilibrium models (DEQ). We give a brief introduction to this approach next based on the blogpost .

      Given our input \(x\), our hidden state \(z\), and our processor \(f\), the goal is to optimise the fixed point \(z^*=f(z^*,x)\) we reach. The question how can we backprop through \(z^* = f(z^*,x)\).

      In backprop, we ultimately want to compute

      \[\left(\frac{\partial z^*(.)}{\partial(.)}\right)^{\top} g\]

      for some incoming gradient \(g\) from the layers after (in our case from the decoder) and \((.)\) being anything we want, but usually the weights of the network. We can show by implicit differentation of \(z^* = f(z^*,x)\) that

      \[\left(\frac{\partial z^*(.)}{\partial(.)}\right)^{\top} g = \left(\frac{\partial f(z^*, x)}{\partial (.)}\right)^{\top}\left(I-\frac{\partial f(z^*, x)}{\partial z^*}\right)^{-\top}g\]

      The difficult to term to solve in the above equation is \(\left(I-\frac{\partial f(z^*, x)}{\partial z^*}\right)^{-\top}g\), which is the solution of a linear system, namely:

      \[\left(I-\frac{\partial f(z^*, x)}{\partial z^*}\right)^{\top}h = g\]

      In general, we can try to solve it in two ways, use a linear system solver, like can be found torch.linalg, or by computing a fixed point to

      \[h = \left(\frac{\partial f(z^*, x)}{\partial z^*}\right)^{-\top}h +g\]

      In the DEQ blogpost they suggest solving the above fixed point. The reason to use implicit differentiation is that backpropagating through time may easily run into exploding or vanishing gradients or error accumulation due to the number of steps needed to reach a fixed point.

      We tried both: solving the linear system with torch.linalg.solve and finding the above fixed point. But we converged to computing the fixed point of the equation above as suggested by the deep equilibrium blogpost as it is computationally faster, while the added accuracy of linear system solvers wasn’t beneficial. Note this trade-off is heavily informed by what is readily implemented in PyTorch to run on GPU, hence the balance may shift in the future.

      Tricks we employ

      To encourage convergence we change the update function in the MPNN to be a minimum update, i.e. \(z^{(t+1)} = \min(z^{(t)}, z^{'(t+1)})\). This update rule is motivated by the problem of getting neural networks to converge to a fixed point. We discuss the effect of this in more detail after the experimental section.

      Currently, gradient flows through the implicit differentiation explained above as well as back in time through standard backprop via \(z_t\). To enable more ways for the gradient to inform early steps in the algorithm, we propagate the gradient through \(y_t\) as well. For discrete \(y_t\), in other words, for categorical variables in the state \(x_t\) we employ the Rao-Blackwell straight-through gumbel softmax estimator to allow gradients to flow.

      Finally, we also try adding a loss for the number of steps by adding the penalty \(\sum_{t=0}^{T} \|z_{t+1} - z_{t}\|^2\). The penalty will be larger as we take more steps and stay away from the fixed point, thus hopefully encouraging convergence to a fixed point more quickly.

      How well does it work?

      In the table below we show the accuracyWhat exactly is measured for the accuracy depends on each algorithm, but usually is a pointer, e.g. in the Bellman-Ford algorithm it is a pointer to the previous node along the shortest path. For more details see the CLRS Benchmark paper. of the algorithms when tested on graphs of size 64.

      DEQ is our approach of reaching a fixed point together with the implicit differentiation explained above. Hint propagation is simply reaching a fixed point and back propagating through time with no implicit differentiation. Teacher forcing is used for the baselines, where the first number is the simple MPNN architecture and the second number is the more complex TripletMPNN (these numbers are taken from the paper ). For BellmanFord and BFS we use the simple MPNN and for all others we use the TripletMPNN.

      Tables DEQ Hint propagation Teacher forcing
      BellmanFord* 96.4% 96.7% 92%/97%
      Dijkstra 78.8% 84.4% 92%/96%
      BFS* 53.8% 57.1% 100%/100%
      DFS 5.0% 4.7% 7%/48%
      MST-Kruskal 82.3% 82.3% 71%/90%
      MST-Prim 75.2% 50.4% 71%/90%

      As we can see in the table above the approach works very well for simpler algorithms such as BellmanFord, where with simple MPNN we manage to achieve equal or better accuracy than the simple MPNN and match the TripletMPNN. Interestingly, this is a parallel algorithm, i.e. all node representations run the same code, in constrast sequential algorithms which go through the graph node by node. We did try gating to enable the GNN to better mimic a sequential algorithm, but this didn’t help.

      On the other algorithms while we are able to learn we cannot match the performance of teacher forcing where we assume to know the number of timesteps to run the neural network. This additional help makes the comparison slightly unfair, however, it shows how learning a fixed point is difficult for the network as we are not able to match the performance. We hypothesise about the reasons behind this in the next section.

      What’s the problem?

      There are a few major issues that we notice during training. The first is that the network is prone to underfitting, while we only show the test accuracy in the table above the training error doesn’t actually reach 0. It is unclear what causes this, however, trying to solve some issues with the DEQ may solve this. So let’s delve into them.

      Convergence is a key issue

      Firstly, the network will often take a large number of steps to reach a fixed point. We can see on easier algorithms like the BellmanFord algorithm that the number of forward steps during training often reaches our set upper limit of 64 forwards steps (the actual algorithm would take on average 4-5, max 10 for this graph size). This is why we implement our architecture trick, where we update the next hidden representation only if it is smaller than the current one, i.e. \(z^{(t+1)} = \min(z^{(t)}, z^{'(t+1)})\) where \(z^{'(t+1)}\) is the output of our min aggregator in the message passing step (alternatives such as gating and an exponential moving average update function were also tried). This helps with convergence, which enables finding a fixed point in simple cases, but fails to work reliably for more complex architectures and problems, while also introducing a different issue.

      The problem with hard constraints to achieve convergence

      Remember that during the implicit differentiation we are trying to solve

      \[h = \left(I-\frac{\partial f(z^*, x)}{\partial z^*}\right)^{-\top}g\]

      i.e. in the linear system \(y = Ax\) our matrix \(A\) is equal to \(I-J\) where \(J\) is the Jacobian in the above equation. If the Jacobian is equal to the identity then our matrix $A=0$ and our system has no solution. In practice, \(z^{(t+1)} = \min(z^{(t)}, z^{'(t+1)})\) will reduce to \(f(z) = z\) in many dimensions of \(z\). This leads to many rows of the Jacobian being the identity due to the function effectively becoming \(f(x)=x\) in many dimensions. Thus leading to rows that are entirely zero in \(A\), which is ill-defined and has no solution causing the optimisation to break.

      One solution is to try a soft-min, i.e. \(softmin_{\tau}(a,b) = \frac{ae^{-a/\tau}+be^{-b/\tau}}{e^{-a/\tau}+e^{-b/\tau}}\). Here we get the ability to trade off between convergence and the Jacobian being interesting. For \(\tau<<1\) we basically recover the min operation and for \(\tau>>1\) we simply get an average, i.e. an exponential moving average. In practice, there was not a trade-off for which we consistently have an interesting Jacobian, while also converging sufficiently fast.

      What do we take away?

      1. Training to reach a fixed point can work as way to determine when to stop reasoning. But it gets increasingly more difficult as the underlying problem gets harder.
      2. It’s unclear what inductive bias to choose in order to ensure fast enough convergence to a fixed point. There are downsides such as uninformative gradients at the fixed point.
      3. Optimisation is tricky and stands in the way. In particular, with implicit differentiation through the fixed point.
      ]]>
      Sophie Xhonneux
      Building Diffusion Model’s theory from ground up2024-05-07T00:00:00+02:002024-05-07T00:00:00+02:00https://iclr-blogposts.github.io/2024/blog/diffusion-theory-from-scratchIntroduction

      Motivation

      Not only generative modeling has been around for decades, few promising model families emerged and dominated the field for several years in the recent past. VAEs dominated the generative modelling landscape from 2014 onwards, until GANs took off in 2015-16; Normalizing Flows (NF) never really made it to the mainstream generative modeling due to its restrictive architectural requirement. However, it is quite clear at this point that the magnitude of impact they made is relatively less than barely 2-3 years of Diffusion Models. It is mostly attributed to one of the seminal papers (by Jonathan Ho et al.), now popularly referred to as “Denoising Diffusion Probabilistic Models” or DDPM. With the exponential explosion of works following DDPM, it is very hard, or rather unnecessary to look beyond this pivotal point.

      In this article, we look back into the conceptual and theoretical ideas that were in development for a long time, even outside the field of core machine learning. We will show in a later sections that, some of the theoretical ‘pillars’ holding Diffusion Models, have their roots deep into statistical physics and other fields. A significant part of this theory was presented afresh in the ICLR paper (won best paper award). Lastly, even though the ideas presented in this article are quite theoretical, we made our best attempt to convey them with intuitive explanations, diagrams and figures, thereby expanding its potential audience. To encourage further exploration, we provide all codes used in producing the figures (and experiments) of this article in this repository.

      This article notes that, historically, there were two distinct roads of development that merged in order for modern diffusion models to emerge – “scalable estimation of score” and “using the score for generative modelling”. The former is relatively short, while the latter traces its origin back to ~1900, if not earlier. This article explores these two paths independently – the latter one first while assuming the knowledge of the former. Rest of this introductory section is spent on defining the general modelling problem and the very notion of ‘score’ – the primary quantity of interest. The next section deals with how we can use score in generative modelling, assuming access to an oracle for the true score. The last section dives solely into the problem of estimating the score in a scalable manner. It is worth mentioning that, in this article, we explain only the “sufficient and necessary” concepts needed to build the diffusion model framework and hence may not directly resemble the typical formalism seen in most papers.

      Generative Modeling

      The problem of generative modeling, in most cases, is posed as parametric density estimation using a finite set of samples \(\{ x^{(n)} \}_{n=1}^N\) from a “true but unknown” data distribution \(q_{data}(x)\). With a suitable model family chosen as \(p_{\theta}(x)\), with unknown parameters \(\theta\), the problem boils down to maximizing the average (log-)likelihood (w.r.t \(\theta\)) of all the samples under the model

      \[\theta^* = arg\max_{\theta} \mathbb{E}_{x \sim q_{data}(x)} \left[ \log p_{\theta}(x) \right] \approx arg\max_{\theta} \frac{1}{N} \sum_{n=1}^N \log p_{\theta}(x^{(n)})\]

      It turned out however, that defining an arbitrary parametric density \(p_{\theta}(x)\) is not as easy as it looks. There was one aspect of \(p_{\theta}\) that is widely considered to be the evil behind this difficulty – the normalizing constant that stems from the axiom of probability

      \[p_{\theta}(x) = \frac{\tilde{p}_{\theta}(x)}{\color{purple} \int_x \tilde{p}_{\theta}(x)}\]

      Existing Frameworks

      It was understood quite early on that any promising generative model family must have one property – ease of sampling, i.e. generating new data samples. Sampling was so essential to generative modeling, that the model families that followed were all geared towards effective sampling, even if it was at the expense of other not-so-important properties. It was also well understood that there was one common underlying principle most effective for crafting “sampling-centric” generative models – transforming simple probability densities. This formed the backbone of every single generative model family so far; be it VAEs, GANs or NFs, their generative process is a density transformation of this form

      \[x = f_{\theta}(z),\text{ where } z \sim \mathcal{N}(0, I)\]

      that suggests to start with a simple density (often just standard normal) followed by a functional transformation \(f_{\theta}\), typically a neural network with parameters \(\theta\). For VAEs, the function \(f_{\theta}\) is the decoder; for GANs, it’s the generator network and for NFs, it’s the entire flow model. It is to be noted however, that the way they differ is mostly how they are trained, which may involve more parametric functions (e.g. VAE’s encoder or GAN’s discriminator) and additional machinery. This way of building generative models turned out to be an effective way of sidestepping the notorious normalizing constant.

      Diffusion is no different

      Diffusion Models, at its core, follow the exact same principle, but with a slightly clever design. For diffusion models, the transformation \(f_{\theta}\) is rather complicated. It is a sequence of invocations of a neural function (denoted as \(s_{\theta}\)) along with some additional computation (denoted as \(g(\cdot)\))

      \begin{equation} \label{eq:diffusion_general_parametric_structure} x = g_1(g_2(g_3(\cdots z \cdots, s_{\theta}), s_{\theta}), s_{\theta}), \text{ where } z \sim \mathcal{N}(0, I) \end{equation}

      This is a big difference between Diffusion Models and other generative model families. Prior generative families tried to learn the exact transformation directly via one parametric neural function \(f_{\theta}\). Diffusion Models on the other hand, try to learn \(s_{\theta}\), a quantity very fundamental and intrinsic to any true data distribution \(q_{data}(x)\). The quantity in question has historically been called the “Score”.

      The ‘Score’

      The term ‘Score’ is simply defined as the gradient of the log-density of a distribution, i.e. \(\nabla \log p(\cdot)\). In statistics, it is also known (but not very popular) as the ‘Informant’. One might argue that ‘Score’ is rather a strange name for such a quantity. It so happened that the origin of this term can be tracedThanks to this StackOverflow answer by @ben to a 1935 paper by Ronald Fisher, where he used the term in a very generic sense in order to “rank” some quantities. In the context of diffusion models however, we stick to the modern definition of score. The true score of our data distribution is therefore defined as the gradient of the log of true density of data, w.r.t the data variable

      \begin{equation} \label{eq:data_score_defn} \nabla_x \log q_{data}(x) \triangleq s(x) \end{equation}

      The quantity in Eq.\eqref{eq:data_score_defn} is unknown, just like the true data density \(q_{data}(x)\). It does have a meaning though: the “true score” refers to the direction of steepest increase in log-likelihood at any given point in the data space. See the gray arrows in the figure below.

      Simply, at a point \(x\), it tell us the best direction to step into (with little step-size \(\delta\)) if we would like to see a point \(x'\) with slightly higher likelihood

      \begin{equation} \label{eq:naive_score_steps} x’ = x + \delta \cdot \left. \nabla_x \log q_{data}(x) \right|_{x = x} \end{equation}

      Please note that this stems just from the definition of the gradient operator \(\nabla\) in score. If you are familiar with gradient descent, you may find conceptual resemblance.

      Now, there are two burning questions here:

      1. Considering we have access to the true score, is Eq.\eqref{eq:naive_score_steps} enough to define a generative process with appropriate convergence guarantee ?
      2. How do we actually get the true score ?

      The following two sections answer these questions respectively. Luckily, as we now understand that these two questions are somewhat decoupled, that they can be studied independently. The first section analyzes the first question, assuming we have access to the true score \(\nabla_x \log q_{data}(x)\). The second section explores how to get the true score, or rather, an approximation of it.

      Generative Modeling with Scores

      As explained before, we would like to sample from the true data distribution \(q_{data}(x)\) but all we have access to (we assume) is its score \(s(x)\) as defined in Eq.\eqref{eq:data_score_defn}. One may define a naive generative process as the iterative application of Eq.\eqref{eq:naive_score_steps}. Intuitively, it is very similar to gradient descent, where we greedily climb the log-density surface to attain a local maxima. If so, we can already see a possible instance of the general structure of Diffusion’s generative process as hinted in Eq.\eqref{eq:diffusion_general_parametric_structure}, with \(g(\cdot)\) being

      \[g(z, s(\cdot)) = z + \delta \cdot s(z) = z + \delta \cdot \nabla_x \log q_{data}(x)\]

      With a little reshuffling of Eq.\eqref{eq:naive_score_steps} and considering \(\delta \rightarrow 0\), one can immediately reveal the underlying ODEOrdinary Differential Equations, or ODEs describe how a process evolves over time by its infinitesimal change. that describes the infinitesimal change

      \begin{equation} \label{eq:ode_with_score} dx = \nabla_x \log q_{data}(x) dt \end{equation}

      BUT, please note that this is only an intuitive attempt and is entirely based on the definition of score. It possesses absolutely no guarantee that this process can converge to samples from the true data distribution. In fact, this process is greedy, i.e. it only seeks to go uphill, converging exactly at the modesLocal maxima of probability density. You can see the below figure that shows the samples \(x\) subjected to the process in Eq.\eqref{eq:ode_with_score} and its density \(p_t(x)\) evolving over time. The density in red is the target density whose score (we assume we know it) is being used.

      In this case, at \(t=\infty\), all samples will converge to the state with the highest likelihood (i.e. exactly a the center). This isn’t really desirable as it doesn’t “explore” at all. Just like any other sampling algorithm, we need noise injection !

      Langevin Equation and Brownian Motion

      Turned out that this problem was explored long ago in molecular dynamics by french physicist Paul Langevin in the context of analyzing movements of particles suspended in a fluid. He described the overall dynamics of particles, i.e how the position of the particle changes over time $t$ when in a potential energy field \(U(x)\)

      \begin{equation} \label{eq:original_langevin_dyn} dx = - \nabla_x U(x) dt + \sqrt{2} dB_t \end{equation}

      The term \(dB_t\) is called “Brownian Motion” and is effectively the source of noise – we will talk about this later in this subsection. Energy is considered “bad”, i.e. particles do not want to stay in a state with high energy. So they try to go downhill and settle in low-energy states using the gradient of the energy surface. The langevin equation (i.e. Eq.\eqref{eq:original_langevin_dyn}) happened to provide sufficient “exploration” abilities so that the particles visit states with probability \(\propto e^{-U(x)}\). This suggests that we can treat “negative energy” as log-likelihood

      \[q_{data}(x) \propto e^{-U(x)} \implies \log q_{data}(x) = -U(x) + C \implies \nabla_x \log q_{data}(x) = - \nabla_x U(x)\]

      By using the above substitution into the langevin equation, we can move out of physics and continue with out ML perspective

      \begin{equation} \label{eq:langevin_dyn} dx = \nabla_x \log q_{data}(x) dt + \sqrt{2} dB_t \end{equation}

      Note that this isn’t very different from our “intuitive” and greedy process in Eq.\eqref{eq:ode_with_score}, except for the noise term \(dB_t\) and a strange \(\sqrt{2}\). But this makes a difference! The brownian motion is an old construct from particle physics to describe random motion of particles in fluid/gas. It is simply a gaussian noise with infinitesimally small varianceIn practice, the smaller step you take, the small noise you get.

      \[dB_t = \mathcal{N}(0, dt) \implies dB_t = \sqrt{dt} \cdot z,\text{ where } z \sim \mathcal{N}(0, I)\]

      With that, we can simulate our new langevin equation with noise (i.e. Eq.\eqref{eq:langevin_dyn}) just like the noiseless case. You can see now that the noise is keeping the process from entirely converging into the mode. If you notice carefully, we have added a little “tail” to each point to help visualize their movement.

      Fokker-Planck Equation

      The simulation is convincing; but it’d be even better if we can theoretically verify that the process in Eq.\eqref{eq:langevin_dyn} indeed converges to \(q_{data}(x)\). The key to this proof is figuring out \(p_t(x)\) and making sure that it stabilizes as \(t\rightarrow \infty\), i.e. \(p_{\infty}(x) = q_{data}(x)\). It turned out that a stochastic process of the form \(dx = \mu_t(x) dt + \sigma_t(x) dB_t\), acting on a random variable \(x\), induces a time-varying distribution that can be described by this ODE

      \begin{equation} \frac{\partial}{\partial t}p_t(x) = -\frac{\partial}{\partial x} \Big[ p_t(x)\mu_t(x) \Big] + \frac{1}{2} \frac{\partial^2}{\partial x^2} \Big[ p_t(x) \sigma^2_t(x) \Big] \end{equation}

      This is a well celebrated result know as the “Fokker-Planck equation” that even predates the Langevin Equation. So, the solution of this ODE is exactly what we are seeing in the above figure (middle). One can easily verify the convergence of Eq.\eqref{eq:langevin_dyn} by first observing \(\mu_t(x) = \nabla_x \log q_{data}(x), \sigma_t(x) = \sqrt{2}\) and then using \(\frac{\partial}{\partial t} p_{\infty}(x) = \frac{\partial}{\partial t} q_{data}(x) = 0\).

      \[\begin{eqnarray*} \frac{\partial}{\partial t}p_{\infty}(x) &=& -\frac{\partial}{\partial x} \Big[ p_{\infty}(x) \nabla_x \log q_{data}(x) \Big] + \frac{(\sqrt{2})^2}{2} \frac{\partial^2}{\partial x^2} \Big[ p_{\infty}(x) \Big] \\ \frac{\partial}{\partial t} q_{data}(x) &=& -\frac{\partial}{\partial x} \Big[ q_{data}(x) \nabla_x \log q_{data}(x) \Big] + \frac{(\sqrt{2})^2}{2} \frac{\partial^2}{\partial x^2} \Big[ q_{data}(x) \Big] \\ 0 \text{ (LHS)} &=& -\frac{\partial}{\partial x} \Big[ \nabla_x q_{data}(x) \Big] + \frac{\partial}{\partial x} \Big[ \nabla_x q_{data}(x) \Big] = 0\text{ (RHS)} \end{eqnarray*}\]

      The LHS holds due to the fact that after a long time (i.e. \(t = \infty\)) the distribution stabilizesIt's called a "stationary or equilibrium distribution". Please also note that the proof above is for the 1 dimensional case and included for illustrative purpose only – the general case is slightly more complicated.

      So, we’re all good. Eq.\eqref{eq:langevin_dyn} is a provable way of sampling given we have access to the true score. In fact, the very work (by Song et al.) that immediately precedes DDPM, used exactly Eq.\eqref{eq:langevin_dyn} in its discrete form

      \begin{equation} x_{t+\delta} = x_t + \delta \cdot \nabla_x \log q_{data}(x) + \sqrt{2\delta} \cdot z \end{equation}

      where \(\delta\) (a small constant) is used as a practical proxy for the theoretical \(dt\).

      If you are already familiar with Diffusion Models, specifically their reverse process, you might be scratching your head. That is because, the generative process in Eq.\eqref{eq:langevin_dyn} isn’t quite same as what modern diffusion models do. We need to cross a few more hurdles before we get there.

      A probability path

      More than just a proof, the Fokker-Planck ODE provides us with a key insight – i.e. gradually transforming one distribution into another is equivalent to traveling (over time) on a “path” in the space of probability distributions. Imagine a space of all possible probability distributions \(p\)While each distribution vary in space (i.e. $x$) too, let's hide it for now and imagine them to be just a vectors.. The Fokker-Planck ODE for Eq.\eqref{eq:langevin_dyn}, therefore, represents a specific dynamics on this probability space whose solution trajectory \(p_t\) ends at \(q_{data}\) at \(t = \infty\).

      Speaking of ODEs, there is something we haven’t talked about yet – the initial distribution at \(t=0\), i.e. \(p_0\). In the simulation above, I quietly used a standard normal \(\mathcal{N}(0, I)\) as starting distributionYou can notice this if you carefully see the first few frames of the animation. without ever discussing it. Turns out that our Fokker-Planck ODE does not have any specific requirement for \(p_0\), i.e. it always converges to \(p_{\infty} = q_{data}\) no matter where you start. Here’s an illustration that shows two different starting distributions \(p_0\) and both of their “paths” over time, i.e. \(p_t\) in probability space ultimately converges to \(q_{data}\).

      So theoretically, given the score function \(\nabla_x \log q_{data}(x)\) of a target distribution \(q_{data}(x)\), one can “travel to” it from any distribution. However, keeping in mind our need for sampling, it’s best to choose an initial distribution that is sampling-friendly. Strictly speaking, there are couple of reasonable choices, but the diffusion model community ended up with the Isotropic Gaussian (i.e. \(\mathcal{N}(0, I)\)). This is not only due to its goodwill across machine learning and statistics, but also the fact that in the context of SDEs with Brownian motionsRemember, they are infinitesimal gaussian noises., Gaussians arise quite naturally.

      Estimating the “score” is hard

      So far what we’ve talked about, is just the generative process or as diffusion model literature calls it, the “reverse process”. But we haven’t really talked about the “forward process” yet, in case you are familiar with it. The forward process, in simple terms, is an ahead-of-time description of the “probability path” that reverse process intends to take. But the question is, why do we need to know the path ahead of time – the reverse process seems quite spontaneousIn the sense that, given a score function, it just travels to the correct target distribution on its own., no ? Sadly, it can’t be answered with theory alone.

      The problem lies in Eq.\eqref{eq:langevin_dyn} – let’s write it again with a little more verbosity

      \begin{equation} dx_t = \nabla_x \left. \log q_{data}(x) \right|_{x = x_t}\ dt + \sqrt{2} dB_t \end{equation}

      Even though we wished to estimate \(\nabla_x \log q_{data}(x)\vert_{x = x_t}\) with neural network \(s_{\theta}(x = x_t)\), this turned out to be extremely hard in practice. It was understood that one neural network is not enough to capture the richness of the score function at all values of \(x\). There were two options before the us – one, make the neural network expressive enough, or second, learn the network only where it’s needed. The community settled on the second one because it was easier to solve.

      So, what some of the pioneering works did, is first fixing a pathOn probability space, like we showed above and then learning the score only on that path. It’s all about specializing the neural network \(s_{\theta}(x_t, t)\) over \(t \in [0, \infty]\). The neural score estimator is capable of producing the right score if we provide the time \(t\), which we can of course. We will see in the next section that, to learn a score of any distribution, we need samples from it. This begs the question: how do we get samples \(x_t\) (for all \(t\)) for training purpose ? It certainly can’t be with Eq.\eqref{eq:langevin_dyn} since it requires the score. The answer is, we need to run this process in the other way – this is what Diffusion Models call the “Forward Process”.

      The “forward process”

      Going the other way requires us to run a simulation to go from \(q_{data}(x)\) at \(t=0\) to \(t=\infty\), just the opposite of the animation above. Recall that we already saw how to do this. To go to any distribution at \(t=\infty\), all you need is its score and the langevin equation. So how about we start from \(q_0 = q_{data}(x)\) this timeDo you remember that starting point doesn't matter ! and run the langevin simulation again with a known end target \(q_{\infty} = \mathcal{N}(0, I)\) ?

      \[\begin{eqnarray} dx &=& \nabla_x \log \mathcal{N}(0, I) dt + \sqrt{2} dB_t \\ \label{eq:forward_sde} &=& -x dt + \sqrt{2 dt} z \end{eqnarray}\]

      It is interesting to note that due to the target distribution being known in its closed form, we do not see any awkward scores dangling around. The score of \(\mathcal{N}(0, I)\) is simply \(-x\)We encourage the reader to verify this on their own as an exercise.. The discretized version of Eq.\eqref{eq:forward_sde}, i.e.

      \[\begin{eqnarray*} x_{t+dt} &=& x_t - x_t \cdot dt + \sqrt{2 dt}\ z \\ &=& (1 - dt) x_t + \sqrt{2 dt}\ z \end{eqnarray*}\]

      .. may resemble DDPM’s forward processHint: compare $dt$ with DDPM's $\beta_t$..

      NOTE: A little subtlety here that we only fixed the end point of the forward process, but not the exact path. It seems that running the langevin equation in the forward direction chose one path on its own. Turns out that this is the “isotropic path” where all dimensions of the variable \(x\) evolves in time the exact same way. Some works recently uncovered non-isotropic diffusion, where it is indeed possible to travel on other paths. But this is outside the scope of this article.

      We can simulate the above equation just like we did in the reverse process, in order to get samples \(x_t \sim q_t\). Below we show simulation of the forward process

      While it is true that the reverse process in inherently sequential due to the arbitrary nature of the score, the forward process (in Eq.\eqref{eq:forward_sde}) is entirely known and hence can be exploited for easing the sequentiality. We can see a way out if we try to simplifyWe use the standard assumption of $dt^2 = 0$. the expression for \(x_{t+2dt}\) using \(x_{t+dt}\)

      \[\begin{eqnarray*} x_{t+2dt} &=& (1 - dt) {\color{blue} x_{t+dt}} + \sqrt{2dt}\ z_2 \\ &=& (1 - dt) {\color{blue} \left[(1 - dt) x_t + \sqrt{2 dt}\ z_1\right]} + \sqrt{2dt}\ z_2 \\ &=& (1 - 2dt) x_t + \sqrt{2dt(1-dt)^2 + 2dt}\ z_{12} \\ &=& (1 - 2 \cdot dt) x_t + \sqrt{2 \cdot 2dt}\ z_{12} \\ \implies x_{t+2dt} &\sim& \mathcal{N}((1 - 2 \cdot dt) x_t, 2 \cdot 2dt I) \end{eqnarray*}\]

      The above simplification suggests that we can jump to any time \(t\), without going through the entire sequence, in order to sample \(x_t \sim q_t\). In fact, \(q_t(x_t\vert x_0)\) is gaussian ! This result opens up an interesting interpretation – generating \(x_0 \sim q(x_0 \vert x_t)\) can be interpreted as solving a “gaussian inverse problems”, which we explore in a later section.

      All good for now, but there is one more thing we need to deal with.

      Finite time & the “schedule”

      What we discussed so far, i.e. the forward and reverse process, require infinite time to reach its end state. This is a direct consequence of using the langevin equation. That, of course, is unacceptable in practice. But it so happened that there exists quite an elegant fix, which is well known to mathematics – we simply re-define what time means. We may choose a re-parameterization of time as, for example, \(t' = \mathcal{T}(t) = 1 - e^{-t} \in [0, 1]\)You can see $t = 0 \implies t' = 0$ and $t = \infty \implies t' = 1$. Hence we converted the range $[0, \infty]$ to $[0, 1]$.. Plugging \(dt = \mathcal{T}'(t)^{-1} dt' = e^t dt'\)One can easily see that $t' = 1 - e^{-t} \implies dt' = e^{-t} dt \implies dt = e^t dt'$. into the forward equation brings us even closer to DDPM’s forward process

      \[x_{t' + dt'} = (1 - {\color{blue}e^t dt'}) x_t + \sqrt{2 {\color{blue}e^t dt'}}\ z\]

      This suggests that in the world where time runs from \(t' = 0 \rightarrow 1\), we need to escalate the forward process by replacing \(dt\) with \(e^t dt'\). The quantity \(\mathcal{T}'(t)^{-1} dt' = e^t dt'\) is analogous to what diffusion models call a “schedule”. Recall that DDPM uses a small but increasing$e^t dt'$ is small because of $dt'$, while increasing because of $e^t$. “schedule” \(\beta_t\).

      Of course, our choice of the exact value of end time (i.e. \(t' = 1\)) and the re-parameterization \(\mathcal{T}\) are somewhat arbitrary. Different choices of \(\mathcal{T}\), and consequently \(\mathcal{T}'(t)^{-1} dt'\) lead to different schedules (e.g. linear, cosine etc.).

      NOTE: Choosing a different schedule does not mean the process takes a different path on the probability space, it simply changes its speed of movement over time towards the end state.

      Summary

      To summarize, in this section, we started with the definition of ‘score’ and arrived at a stochastic process (thanks to an old result by Langevin) that, at infinite time, converges to the density associated with the score. We saw that this process is provably correct and can be interpreted as a “path” on the probability space. We argued that due to the difficulty of score estimation everywhere along the path, we need samples at the intermediate time \(t\) in order to specialize the score estimates. To do that, we had to travel backwards on the path, which can be done in closed form. We also saw how this process, even though theoretically takes infinite time, can be shrunk down to a finite interval, opening up a design choice known as “schedules”.

      Estimating the Score

      The last chapter, while explaining the “sampling” part of score-based diffusion models, assumed that we have access to the true score \(\nabla_x \log q_{data}(x)\) via some oracle. That is, of course, untrue in practice. In fact, accessing the true score for any arbitrary distribution is just not possibleWe can only have access to the true score for distributions with closed-form, e.g. Gaussian.. So the way forward, as mentioned before, is to estimate/learn it with a parametric neural network \(s_{\theta}(x)\). Recall however, that all we have access to is samples from \(q_{data}(x)\).

      If curious enough, one may question how realistic it is to estimate the score \(\nabla_x \log q_{data}(x)\), while we can NOT usually estimate the density \(q_{data}(x)\) itself ? After all, it is a quantity derived from the density ! The answer becomes clear once you make the normalization constant explicit

      \[\begin{eqnarray*} \nabla_x \log q_{data}(x) &=& \nabla_x \log \frac{\tilde{q}_{data}(x)}{\int_{x} \tilde{q}_{data}(x) dx} \\ &=& \nabla_x \log \tilde{q}_{data}(x) - {\color{red}\nabla_x \log \int_{x} \tilde{q}_{data}(x) dx} \\ &=& \nabla_x \log \tilde{q}_{data}(x) \end{eqnarray*}\]

      The part in red is zero due to not having dependence on \(x\). So, the score, very cleverly sidesteps the normalization constant. This is the reason score estimation gained momentum in the research community.

      Implicit Score Matching

      The first notable attempt of this problem was by Aapo Hyvärinen back in 2005. His idea was simply to start from a loss function that, when minimized, leads to an estimator of the true score

      \begin{equation} J(\theta) = \frac{1}{2} \mathbb{E}_{x\sim q_{data}(x)}\Big[ \vert\vert s_{\theta}(x) - \nabla_x \log q_{data}(x) \vert\vert^2 \Big] \end{equation}

      It is simply an \(L_2\) loss between a parametric model and the true score, weighted by the probability of individual states (hence the expectation). But of course, it is not computable in this form as it contains the true score. Hyvärinen’s contribution was to simply show that, theoretically, the minimization problem is equivalent when the loss function is

      \begin{equation} \label{eq:impl_score_match} J_{\mathrm{I}}(\theta) = \mathbb{E}_{x\sim q_{data}(x)}\Big[ \mathrm{Tr}(\nabla_x s_{\theta}(x)) + \frac{1}{2} \vert\vert s_{\theta}(x) \vert\vert^2 \Big] \end{equation}

      In the literature, this is known as the “Implicit Score Matching”. The derivation is relatively simple and only involves algebraic manipulations – please see Appendix A of . The remarkable nature of this result stems from the fact that \(J_{\mathrm{I}}\) no longer contains the true score. The only dependency on \(q_{data}\) is via the expectation, which can be approximated by sample average over our dataset.

      But the key challenge with Implicit Score Matching was the \(\mathrm{Tr}(\nabla_x s_{\theta}(x))\) term, i.e. the trace of the hessian of the neural score model, which is costly to compute. This prompted several follow-up works for the race towards scalable score matching, one of which (namely De-noising score matching) is used in Diffusion Models till this day.

      For the sake of completeness, I would like to mention the work of Yang Song et al. around 2019, that proposed an engineering trick to alleviate the hessian computation. They simply used the “Hutchinson Trace estimator”A stochastic way of computing trace: $\mathrm{Tr}(M) = \mathbb{E}_{v\sim p_v} \Big[ v^T M v \Big]$, where $p_v$ can be a lot of distributions, most notably $\mathcal{N}(0, I)$. to replace the \(\mathrm{Tr}(\cdot)\) in Eq.\eqref{eq:impl_score_match}, which eased the computation a bit. This approach however, did not end up being used in practice.

      Denoising Score Matching

      The most valuable contribution came from Vincent Pascal in 2011, when he showed that the score matching problem has yet another equivalent objective, which was called “Denoising” score matching

      \begin{equation} \label{eq:deno_score_match} J_{\mathrm{D}}(\theta) = \mathbb{E}_{x\sim q_{data}(x), \epsilon\sim\mathcal{N}(0, I)}\left[ \frac{1}{2} \left|\left| s_{\theta}(\ \underbrace{x + \sigma\epsilon}_{\tilde{x}}\ ) - (- \frac{\epsilon}{\sigma}) \right|\right|^2 \right] \end{equation}

      We deliberately wrote it in a way that exposes its widely accepted interpretation. Denoising score matching simply adds some known noise \(\sigma\epsilon\) to the datapoints \(x\) and learns (in mean squeared sense), from the “noisy” point \(\tilde{x}\), the direction of comeback, i.e. \((-\epsilon)\), scaled by \(\frac{1}{\sigma}\). In a way, it acts like a “de-noiser”, hence the name. It is theoretically guaranteed that \(J_{\mathrm{D}}\) leads to an unbiased estimate of the true score. Below we show a visualization of the score estimate as it learns from data.

      A little algebraic manipulation of Eq.\eqref{eq:deno_score_match}, demonstrated by Ho et al. , leads to an equivalent form which turned out to be training friendly.

      \[\begin{eqnarray} J_{\mathrm{D}}(\theta) &=& \mathbb{E}_{x\sim q_{data}(x), \epsilon\sim\mathcal{N}(0, I)}\left[ \frac{1}{2\sigma^2} \left|\left| {\color{blue} - \sigma s_{\theta}}(\tilde{x}) - \epsilon \right|\right|^2 \right] \\ &=& \mathbb{E}_{x\sim q_{data}(x), \epsilon\sim\mathcal{N}(0, I)}\left[ \frac{1}{2\sigma^2} \left|\left| {\color{blue} \epsilon}_{\theta}(\tilde{x}) - \epsilon \right|\right|^2 \right]\label{eq:deno_eps_match} \end{eqnarray}\]

      We simply change the interpretation of what the network learns. In this form, the “noise estimator” network learns just the original pure gaussian noise vector \(\epsilon\) that was added while crafting the noisy sample. So, from a noisy sample, the network \(\epsilon_{\theta}\) learns roughly an unit variance direction that points towards the clean sample.

      There is yet another re-interpretation of Eq.\eqref{eq:deno_score_match} that leads to a slightly different perspective

      \[\begin{eqnarray} J_{\mathrm{D}}(\theta) &=& \mathbb{E}_{x\sim q_{data}(x), \epsilon\sim\mathcal{N}(0, I)}\left[ \frac{1}{2\sigma^4} \left|\left| {\color{blue}\tilde{x} + \sigma^2 s_{\theta}}(\tilde{x}) - (\underbrace{\tilde{x} - \sigma\epsilon}_{x}) \right|\right|^2 \right] \\ &=& \mathbb{E}_{x\sim q_{data}(x), \epsilon\sim\mathcal{N}(0, I)}\left[ \frac{1}{2\sigma^4} \left|\left| {\color{blue} x_{\theta}}(\tilde{x}) - x \right|\right|^2 \right]\label{eq:deno_endpoint_match} \end{eqnarray}\]

      Eq.\eqref{eq:deno_endpoint_match} shows, that instead of the noise direction towards clean sample, we can also have the clean sample directly as a learning target. This is like doing “denoising” in its true sense. We will get back to this in the next subsection.

      Probing the learning objective

      If you are still puzzled about how Eq.\eqref{eq:deno_eps_match} is related to learning the score, there is a way to probe exactly what the network is learning at an arbitrary input point \(\tilde{x}\). We note that the clean sample \(x\) and the noisy sample \(\tilde{x}\) come from a joint distribution that factorizes

      \[q(x, \tilde{x}) = q(\tilde{x} \vert x) q_{data}(x) = \mathcal{N}(\tilde{x}; x, \sigma I) q_{data}(x).\]

      We then factorize this joint in a slightly different way, i.e.

      \[q(x, \tilde{x}) = q(x \vert \tilde{x}) q(\tilde{x})\]

      where \(q(x \vert \tilde{x})\) can be thought of as a distribution of all clean samples which could’ve led to the given \(\tilde{x}\). Eq.\eqref{eq:deno_eps_match} can therefore be written as

      \[\begin{eqnarray*} J_{\mathrm{D}}(\theta) &=& \mathbb{E}_{(x, \tilde{x}) \sim q(x,\tilde{x})}\left[ \frac{1}{2\sigma^2} \left|\left| \epsilon_{\theta}(\tilde{x}) - \epsilon \right|\right|^2 \right] \\ &=& \mathbb{E}_{\tilde{x} \sim q(\tilde{x}), x \sim q(x\vert \tilde{x})}\left[ \frac{1}{2\sigma^2} \left|\left| \epsilon_{\theta}(\tilde{x}) - \frac{\tilde{x} - x}{\sigma} \right|\right|^2 \right] \\ &=& \mathbb{E}_{\tilde{x} \sim q(\tilde{x})}\left[ \frac{1}{2\sigma^2} \left|\left| \epsilon_{\theta}(\tilde{x}) - \frac{\tilde{x} - \mathbb{E}_{x \sim q(x\vert \tilde{x})}[x]}{\sigma} \right|\right|^2 \right] \\ \end{eqnarray*}\]

      In the last step, the expectation \(\mathbb{E}_{q(x\vert\tilde{x})}\left[ \cdot \right]\) was pushed inside, up until the only quantity that involves \(x\). Looking at it, you may realize that the network \(\epsilon_{\theta}\), given an input \(\tilde{x}\), learns the average noise direction that leads to the given input point \(\tilde{x}\). It also exposes the quantity \(\mathbb{E}_{x \sim q(x\vert \tilde{x})}[x]\), which is the average clean sample that led to the given \(\tilde{x}\).

      Below we visualize this process with a toy example, followed by a short explanation.

      Explanation: We have 10 data points \(x\sim q_{data}(x)\) in two clusters (big red dots) and we run the learning process by generating noisy samples \(\tilde{x}\sim q(\tilde{x})\) (small red dots). Instead of learning a neural mapping over the entire space, we learn a tabular map with only three chosen input points \(\tilde{x}_1, \tilde{x}_2, \tilde{x}_3\) (blue, magenta and green cross). Every time we sample one of thosePractically it's impossible to randomly sample a specific point. So we assume a little ball around each point. three chosen input points, we note which input data point it came from (shown by connecting a dotted line of same color) and maintain a running average (bold cross of same color) of them, i.e. which is nothing but \(\mathbb{E}_{x \sim q(x\vert \tilde{x})}[x]\). We also show the average noise direction at each \(\tilde{x}\), i.e. \(\frac{\tilde{x} - \mathbb{E}_{x \sim q(x\vert \tilde{x})}[x]}{\sigma}\), with gray arrows. The gray arrows, as the training progresses, start to resemble the score estimate of the data.

      Denoising as inverse problem

      A similar treatment, when applied on Eq.\eqref{eq:deno_endpoint_match}, yields the following

      \[\begin{eqnarray*} J_{\mathrm{D}}(\theta) &=& \mathbb{E}_{(x, \tilde{x}) \sim q(x,\tilde{x})}\left[ \frac{1}{2\sigma^4} \left|\left| {\color{blue}x_{\theta}}(\tilde{x}) - x \right|\right|^2 \right] \\ &=& \mathbb{E}_{\tilde{x} \sim q(\tilde{x})}\left[ \frac{1}{2\sigma^4} \left|\left| {\color{blue}\tilde{x} + \sigma^2 s_{\theta}}(\tilde{x}) - \mathbb{E}_{x \sim q(x\vert \tilde{x})}[x] \right|\right|^2 \right] \\ \end{eqnarray*}\]

      Notice that I brought back the original form of \(x_{\theta}(\cdot)\) that involves the score. If we had the true score instead of an learned estimate, we would have

      \[\mathbb{E}_{x \sim q(x\vert \tilde{x})}[x] = \tilde{x} + \sigma^2 \nabla_{\tilde{x}} \log p(\tilde{x})\]

      In “Inverse problem” and Bayesian literature, this is a very well celebrated result named “Tweedie’s Formula”, first published by Robbins but credited to statistician Maurice Tweedie. This theorem is applied in the context of bayesian posterior estimation of a “true” quantity \(x\) which we only observe through a (gaussian) noisy measurement \(\tilde{x}\). Tweedie’s formula tells us that the posterior mean of the inverse problem \(q(x\vert \tilde{x})\) can be computed without ever knowing the actually density, as long as we have access to the score at the noisy measurement.

      Summary

      In this section, we explored the problem of scalable score matching. We looked at the notable attempts in the literature and learned that score can be estimated from samples only. We also looked at several interpretations of the learning objective and the connections they expose.

      Last few bits

      Incorporating time

      In the last section, we expressed and explained everything in terms of one known noise level \(\sigma\) and the noisy sample \(\tilde{x}\). We did so to avoid cluttering of multiple concepts that aren’t necessary to explain each other. In a previous section however, we learned that the score must be estimated along every timestep of the forward process. By simply augmenting Eq.\eqref{eq:deno_score_match} with an additional time variable \(t \in \mathcal{U}[0, 1]\) is sufficient to induce the time dependency in the score matching problem

      \begin{equation} \label{eq:deno_score_match_with_time} J_{\mathrm{D}}(\theta) = \mathbb{E}_{x_0, \epsilon, t \sim \mathcal{U}[0, 1], x_t\sim q_t(x_t\vert x_0) }\left[ \frac{1}{2} \left|\left| s_{\theta}(x_t, t) - (- \frac{\epsilon}{\sigma_t}) \right|\right|^2 \right] \end{equation}

      .. where \(q_t(x_t \vert x_0)\) is defined in a previous section and \(\sigma_t\) is the standard deviation of it.

      We took an different approach

      We would like to highlight that, in this article, we first explored the reverse process and then showed why the forward process emerges out of necessity. Typical diffusion models papers start from a forward process specification of the form

      \[dx_t = f(t)x_t dt + g(t) {dB}_t\]

      .. and then use Anderson’s SDE reversal to explain the reverse process, which also involves the score

      \[dx_t = \left[ f(t) x_t - g(t)^2 \underbrace{\nabla_{x_t} \log q_t(x_t)}_{s_{\theta}(x_t, t)} \right] dt + g(t) dB_t\]

      We argue that our approach is more “organic” in the sense that it builds up the theory chronologically, exploring the exact path the community went through over time.

      Conclusion

      In this article, we dived deep into the theoretical fundamentals of Diffusion Models, which are often ignored by practitioners. We started from the ‘heart’ of diffusion models, i.e. scores, and built the concepts up almost chronologically. We hope this article will serve as a conceptual guide toward understanding diffusion models from the score SDE perspective. We intentionally avoid the ‘probabilistic markov model’ view of diffusion since more and more works have been seen to embrace the SDE formalism.

      ]]>
      Ayan Das
      Sample Blog Post2024-05-07T00:00:00+02:002024-05-07T00:00:00+02:00https://iclr-blogposts.github.io/2024/blog/distill-exampleNote: please use the table of contents as defined in the front matter rather than the traditional markdown styling.

      Equations

      This theme supports rendering beautiful math in inline and display modes using MathJax 3 engine. You just need to surround your math expression with $$, like $$ E = mc^2 $$. If you leave it inside a paragraph, it will produce an inline expression, just like \(E = mc^2\).

      To use display mode, again surround your expression with $$ and place it as a separate paragraph. Here is an example:

      \[\left( \sum_{k=1}^n a_k b_k \right)^2 \leq \left( \sum_{k=1}^n a_k^2 \right) \left( \sum_{k=1}^n b_k^2 \right)\]

      Note that MathJax 3 is a major re-write of MathJax that brought a significant improvement to the loading and rendering speed, which is now on par with KaTeX.

      Images and Figures

      Its generally a better idea to avoid linking to images hosted elsewhere - links can break and you might face losing important information in your blog post. To include images in your submission in this way, you must do something like the following:

      {% include figure.html path="assets/img/2024-05-07-distill-example/iclr.png" class="img-fluid" %}
      +

      which results in the following image:

      To ensure that there are no namespace conflicts, you must save your asset to your unique directory /assets/img/2024-05-07-[SUBMISSION NAME] within your submission.

      Please avoid using the direct markdown method of embedding images; they may not be properly resized. Some more complex ways to load images (note the different styles of the shapes/shadows):

      A simple, elegant caption looks good between image rows, after each row, or doesn't have to be there at all.

      Interactive Figures

      Here’s how you could embed interactive figures that have been exported as HTML files. Note that we will be using plotly for this demo, but anything built off of HTML should work (no extra javascript is allowed!). All that’s required is for you to export your figure into HTML format, and make sure that the file exists in the assets/html/[SUBMISSION NAME]/ directory in this repository’s root directory. To embed it into any page, simply insert the following code anywhere into your page.

      {% include [FIGURE_NAME].html %} 
      +

      For example, the following code can be used to generate the figure underneath it.

      import pandas as pd
      +import plotly.express as px
      +
      +df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/earthquakes-23k.csv')
      +
      +fig = px.density_mapbox(
      +    df, lat='Latitude', lon='Longitude', z='Magnitude', radius=10,
      +    center=dict(lat=0, lon=180), zoom=0, mapbox_style="stamen-terrain")
      +fig.show()
      +
      +fig.write_html('./assets/html/2024-05-07-distill-example/plotly_demo_1.html')
      +

      And then include it with the following:

      <div class="l-page">
      +  <iframe src="{{ 'assets/html/2024-05-07-distill-example/plotly_demo_1.html' | relative_url }}" frameborder='0' scrolling='no' height="600px" width="100%"></iframe>
      +</div>
      +

      Voila!

      Citations

      Citations are then used in the article body with the <d-cite> tag. The key attribute is a reference to the id provided in the bibliography. The key attribute can take multiple ids, separated by commas.

      The citation is presented inline like this: (a number that displays more information on hover). If you have an appendix, a bibliography is automatically created and populated in it.

      Distill chose a numerical inline citation style to improve readability of citation dense articles and because many of the benefits of longer citations are obviated by displaying more information on hover. However, we consider it good style to mention author last names if you discuss something at length and it fits into the flow well — the authors are human and it’s nice for them to have the community associate them with their work.


      Footnotes

      Just wrap the text you would like to show up in a footnote in a <d-footnote> tag. The number of the footnote will be automatically generated.This will become a hoverable footnote.


      Code Blocks

      This theme implements a built-in Jekyll feature, the use of Rouge, for syntax highlighting. It supports more than 100 languages. This example is in C++. All you have to do is wrap your code in a liquid tag:

      {% highlight c++ linenos %}
      code code code
      {% endhighlight %}

      The keyword linenos triggers display of line numbers. You can try toggling it on or off yourself below:

      int main(int argc, char const \*argv[])
      +{
      +string myString;
      +
      +    cout << "input a string: ";
      +    getline(cin, myString);
      +    int length = myString.length();
      +
      +    char charArray = new char * [length];
      +
      +    charArray = myString;
      +    for(int i = 0; i < length; ++i){
      +        cout << charArray[i] << " ";
      +    }
      +
      +    return 0;
      +}

      Diagrams

      This theme supports generating various diagrams from a text description using jekyll-diagrams plugin. Below, we generate a few examples of such diagrams using languages such as mermaid, plantuml, vega-lite, etc.

      Note: different diagram-generation packages require external dependencies to be installed on your machine. Also, be mindful of that because of diagram generation the first time you build your Jekyll website after adding new diagrams will be SLOW. For any other details, please refer to jekyll-diagrams README.

      Note: This is not supported for local rendering!

      The diagram below was generated by the following code:

      {% mermaid %}
      +sequenceDiagram
      +    participant John
      +    participant Alice
      +    Alice->>John: Hello John, how are you?
      +    John-->>Alice: Great!
      +{% endmermaid %}
      +
      JohnAliceHello John, how are you?Great!JohnAlice

      Tweets

      An example of displaying a tweet:

      An example of pulling from a timeline:

      For more details on using the plugin visit: jekyll-twitter-plugin


      Blockquotes

      We do not grow absolutely, chronologically. We grow sometimes in one dimension, and not in another, unevenly. We grow partially. We are relative. We are mature in one realm, childish in another. —Anais Nin

      Layouts

      The main text column is referred to as the body. It is the assumed layout of any direct descendants of the d-article element.

      .l-body

      For images you want to display a little larger, try .l-page:

      .l-page

      All of these have an outset variant if you want to poke out from the body text a little bit. For instance:

      .l-body-outset

      .l-page-outset

      Occasionally you’ll want to use the full browser width. For this, use .l-screen. You can also inset the element a little from the edge of the browser by using the inset variant.

      .l-screen

      .l-screen-inset

      The final layout is for marginalia, asides, and footnotes. It does not interrupt the normal flow of .l-body-sized text except on mobile screen sizes.

      .l-gutter


      Other Typography?

      Emphasis, aka italics, with asterisks (*asterisks*) or underscores (_underscores_).

      Strong emphasis, aka bold, with asterisks or underscores.

      Combined emphasis with asterisks and underscores.

      Strikethrough uses two tildes. Scratch this.

      1. First ordered list item
      2. Another item ⋅⋅* Unordered sub-list.
      3. Actual numbers don’t matter, just that it’s a number ⋅⋅1. Ordered sub-list
      4. And another item.

      ⋅⋅⋅You can have properly indented paragraphs within list items. Notice the blank line above, and the leading spaces (at least one, but we’ll use three here to also align the raw Markdown).

      ⋅⋅⋅To have a line break without a paragraph, you will need to use two trailing spaces.⋅⋅ ⋅⋅⋅Note that this line is separate, but within the same paragraph.⋅⋅ ⋅⋅⋅(This is contrary to the typical GFM line break behavior, where trailing spaces are not required.)

      • Unordered lists can use asterisks
      • Or minuses
      • Or pluses

      I’m an inline-style link

      I’m an inline-style link with title

      I’m a reference-style link

      I’m a relative reference to a repository file

      You can use numbers for reference-style link definitions

      Or leave it empty and use the link text itself.

      URLs and URLs in angle brackets will automatically get turned into links. http://www.example.com or http://www.example.com and sometimes example.com (but not on Github, for example).

      Some text to show that the reference links can follow later.

      Here’s our logo (hover to see the title text):

      Inline-style: alt text

      Reference-style: alt text

      Inline code has back-ticks around it.

      var s = "JavaScript syntax highlighting";
      +alert(s);
      +
      s = "Python syntax highlighting"
      +print(s)
      +
      No language indicated, so no syntax highlighting. 
      +But let's throw in a <b>tag</b>.
      +

      Colons can be used to align columns.

      Tables Are Cool
      col 3 is right-aligned $1600
      col 2 is centered $12
      zebra stripes are neat $1

      There must be at least 3 dashes separating each header cell. The outer pipes (|) are optional, and you don’t need to make the raw Markdown line up prettily. You can also use inline Markdown.

      Markdown Less Pretty
      Still renders nicely
      1 2 3

      Blockquotes are very handy in email to emulate reply text. This line is part of the same quote.

      Quote break.

      This is a very long line that will still be quoted properly when it wraps. Oh boy let’s keep writing to make sure this is long enough to actually wrap for everyone. Oh, you can put Markdown into a blockquote.

      Here’s a line for us to start with.

      This line is separated from the one above by two newlines, so it will be a separate paragraph.

      This line is also a separate paragraph, but… This line is only separated by a single newline, so it’s a separate line in the same paragraph.

      ]]>
      Albert Einstein
      Sample Blog Post (HTML version)2024-05-07T00:00:00+02:002024-05-07T00:00:00+02:00https://iclr-blogposts.github.io/2024/blog/distill-example2 This is a sample blog post written in HTML (while the other sample post is written in Markdown). Authors have the choice to write in HTML or Markdown. While Markdown is easier to write, HTML gives you more control over the layout of your post. Furthermore, Markdown often interacts in unexpected ways with MathJax and other HTML widgets. If you are having trouble with Markdown, try writing in HTML instead.

      Note: please use the table of contents as defined in the front matter rather than the traditional markdown styling.

      Equations

      This theme supports rendering beautiful math in inline and display modes using MathJax 3 engine. You just need to surround your math expression with $$, like $$ E = mc^2 $$. If you leave it inside a paragraph, it will produce an inline expression, just like \(E = mc^2\).

      To use display mode, again surround your expression with $$ and place it as a separate paragraph. Here is an example: $$ \left( \sum_{k=1}^n a_k b_k \right)^2 \leq \left( \sum_{k=1}^n a_k^2 \right) \left( \sum_{k=1}^n b_k^2 \right) $$

      Note that MathJax 3 is a major re-write of MathJax that brought a significant improvement to the loading and rendering speed, which is now on par with KaTeX.

      Images and Figures

      Its generally a better idea to avoid linking to images hosted elsewhere - links can break and you might face losing important information in your blog post. You can display images from this repository using the following code:

      {% include figure.html path="assets/img/2024-05-07-distill-example/iclr.png" class="img-fluid" %}

      which results in the following image:

      To ensure that there are no namespace conflicts, you must save your asset to your unique directory `/assets/img/2024-05-07-[SUBMISSION NAME]` within your submission.

      Please avoid using the direct HTML method of embedding images; they may not be properly resized. Some below complex ways to load images (note the different styles of the shapes/shadows):

      A simple, elegant caption looks good between image rows, after each row, or doesn't have to be there at all.

      Interactive Figures

      Here's how you could embed interactive figures that have been exported as HTML files. Note that we will be using plotly for this demo, but anything built off of HTML should work. All that's required is for you to export your figure into HTML format, and make sure that the file exists in the `assets/html/[SUBMISSION NAME]/` directory in this repository's root directory. To embed it into any page, simply insert the following code anywhere into your page.

      {% include [FIGURE_NAME].html %}

      For example, the following code can be used to generate the figure underneath it.

      import pandas as pd
      +import plotly.express as px
      +
      +df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/earthquakes-23k.csv')
      +
      +fig = px.density_mapbox(
      +    df, lat='Latitude', lon='Longitude', z='Magnitude', radius=10,
      +    center=dict(lat=0, lon=180), zoom=0, mapbox_style="stamen-terrain")
      +fig.show()
      +
      +fig.write_html('./assets/html/2024-05-07-distill-example/plotly_demo_1.html')
      +
      And then include it with the following:
      <div class="l-page">
      +  <iframe src="{{ 'assets/html/2024-05-07-distill-example/plotly_demo_1.html' | relative_url }}" frameborder='0' scrolling='no' height="600px" width="100%"></iframe>
      +</div>
      +
      Voila!

      Citations

      Citations are then used in the article body with the <d-cite> tag. The key attribute is a reference to the id provided in the bibliography. The key attribute can take multiple ids, separated by commas.

      The citation is presented inline like this: (a number that displays more information on hover). If you have an appendix, a bibliography is automatically created and populated in it.

      Distill chose a numerical inline citation style to improve readability of citation dense articles and because many of the benefits of longer citations are obviated by displaying more information on hover. However, we consider it good style to mention author last names if you discuss something at length and it fits into the flow well - the authors are human and it's nice for them to have the community associate them with their work.

      Footnotes

      Just wrap the text you would like to show up in a footnote in a <d-footnote> tag. The number of the footnote will be automatically generated.This will become a hoverable footnote.

      Code Blocks

      This theme implements a built-in Jekyll feature, the use of Rouge, for syntax highlighting. It supports more than 100 languages. This example is in C++. All you have to do is wrap your code in a liquid tag as follows:

      
      +{% highlight c++ linenos %}  
      code code code
      {% endhighlight %} + +
      The keyword `linenos` triggers display of line numbers. You can try toggling it on or off yourself below:
      int main(int argc, char const *argv[])
      +{
      +string myString;
      +
      +    cout &lt;&lt; "input a string: ";
      +    getline(cin, myString);
      +    int length = myString.length();
      +
      +    char charArray = new char * [length];
      +
      +    charArray = myString;
      +    for(int i = 0; i < length; ++i){
      +        cout &lt;&lt; charArray[i] &lt;&lt; " ";
      +    }
      +
      +    return 0;
      +}

      Diagrams

      This theme supports generating various diagrams from a text description using jekyll-diagrams plugin. Below, we generate a few examples of such diagrams using languages such as mermaid, plantuml, vega-lite, etc.

      Notedifferent diagram-generation packages require external dependencies to be installed on your machine. Also, be mindful of that because of diagram generation the first time you build your Jekyll website after adding new diagrams will be SLOW. For any other details, please refer to the jekyll-diagrams README.

      Note: This is not supported for local rendering!

      The diagram below was generated by the following code:

      {% mermaid %}
      +sequenceDiagram
      +    participant John
      +    participant Alice
      +    Alice->>John: Hello John, how are you?
      +    John-->>Alice: Great!
      +{% endmermaid %}
      +
      +
      JohnAliceHello John, how are you?Great!JohnAlice

      Tweets

      An example of displaying a tweet:

      An example of pulling from a timeline:

      For more details on using the plugin visit: jekyll-twitter-plugin

      Blockquotes

      We do not grow absolutely, chronologically. We grow sometimes in one dimension, and not in another, unevenly. We grow partially. We are relative. We are mature in one realm, childish in another. —Anais Nin

      Layouts

      The main text column is referred to as the body. It's the assumed layout of any direct descendants of the `d-article` element.

      .l-body

      For images you want to display a little larger, try `.l-page`:

      .l-page

      All of these have an outset variant if you want to poke out from the body text a little bit. For instance:

      .l-body-outset

      .l-page-outset

      Occasionally you'll want to use the full browser width. For this, use `.l-screen`. You can also inset the element a little from the edge of the browser by using the inset variant.

      .l-screen

      .l-screen-inset

      The final layout is for marginalia, asides, and footnotes. It does not interrupt the normal flow of `.l-body`-sized text except on mobile screen sizes.

      .l-gutter

      Other Typography?

      Emphasis, aka italics, with the <i></i> tag emphasis.

      Strong emphasis, aka bold, with <b></b> tag bold.

      Strikethrough ca be accomplished with the <s></s> tag. Scratch this.

      • First ordered list item
      • Another item
        1. Unordered sub-list.
      • And another item.

      For code, the language can be specified in the class. For example, use language-javascript for Javascript and language-python for Python code.

      var s = "JavaScript syntax highlighting";
      +  alert(s);
      s = "Python syntax highlighting"
      +  print(s)
      No language indicated, so no syntax highlighting.

      A table can be created with the <table> element. Below is an example

      Tables Are Cool
      col 3 is right-aligned $1600
      col 2 is centered $12
      zebra stripes are neat $1

      Blockquotes can be defined with the >blockquote< tag.

      ]]>
      Albert Einstein
      Double Descent Demystified2024-05-07T00:00:00+02:002024-05-07T00:00:00+02:00https://iclr-blogposts.github.io/2024/blog/double-descent-demystifiedIntroduction

      Machine learning models, while incredibly powerful, can sometimes act unpredictably. One of the most intriguing behaviors is when the test loss suddenly diverges at the interpolation threshold, a phenomenon distinctly observed in double descent .

      Figure 1. Double descent in ordinary linear regression. Three real datasets (California Housing, Diabetes, and WHO Life Expectancy) and one synthetic dataset (Student-Teacher) all exhibit double descent, with test loss spiking at the interpolation threshold. Blue is training error. Orange is test error.

      While significant theoretical work has been done to comprehend why double descent occurs, it can be difficult for a newcomer to gain a general understanding of why the test loss behaves in this manner, and under what conditions one should expect similar misbehavior. In this blog post, when we say double descent, we mean the divergence at the interpolation threshold, and not whether overparameterized models generalize (or fail to generalize).

      In this work, we intuitively and quantitatively explain why the test loss diverges at the interpolation threshold, with as much generality as possible and with as simple of mathematical machinery as possible, but also without sacrificing rigor. To accomplish this, we focus on the simplest supervised model - ordinary linear regression - using the most basic linear algebra primitive: the singular value decomposition. We identify three distinct interpretable factors which, when collectively present, trigger the divergence. Through practical experiments on real data sets, we confirm that both model’s test losses diverge at the interpolation threshold, and this divergence vanishes when even one of the three factors is removed. We complement our understanding by offering a geometric picture that reveals linear models perform representation learning when overparameterized, and conclude by shedding light on recent results in nonlinear models concerning superposition.

      Double Descent in Ordinary Linear Regression

      Empirical Evidence of Double Descent in Ordinary Linear Regression

      Before studying ordinary linear regression mathematically, does our claim that it exhibits double descent hold empirically? We show that it indeed does, using one synthetic and three real datasets: World Health Organization Life Expectancy , California Housing , Diabetes ; these three real datasets were selected on the basis of being easily accessible through sklearn or Kaggle. As shown in Fig 1, all display a spike in test mean squared error at the interpolation threshold. Our simple Python code is publicly available.

      Notation and Terminology

      Consider a regression dataset of $N$ training data with features $\vec{x}_n \in \mathbb{R}^D$ and targets $y_n \in \mathbb{R}$. We sometimes use matrix-vector notation to refer to the training data:

      \[X \in \mathbb{R}^{N \times D} \quad , \quad Y \in \mathbb{R}^{N \times 1}.\]

      In ordinary linear regression, we want to learn parameters $\hat{\vec{\beta}} \in \mathbb{R}^{D}$ such that:

      \[\vec{x}_n \cdot \hat{\vec{\beta}} \approx y_n.\]

      We will study three key parameters:

      1. The number of model parameters $P$
      2. The number of training data $N$
      3. The dimensionality of the data $D$

      We say that a model is overparameterized if $N < P$ and underparameterized if $N > P$. The interpolation threshold refers to $N=P$, because when $N\leq P$, the model can perfectly interpolate the training points. Recall that in ordinary linear regression, the number of parameters $P$ equals the dimension $D$ of the covariates. Consequently, rather than thinking about changing the number of parameters $P$, we’ll instead think about changing the number of data points $N$.

      Mathematical Analysis of Ordinary Linear Regression

      To understand under what conditions and why double descent occurs at the interpolation threshold in linear regression, we’ll study the two parameterization regimes. If the regression is underparameterized, we estimate the linear relationship between covariates $\vec{x}_n$ and target $y_n$ by solving the least-squares minimization problem:

      \[\begin{align*} \hat{\vec{\beta}}_{under} \, &:= \, \arg \min_{\vec{\beta}} \frac{1}{N} \sum_n ||\vec{x}_n \cdot \vec{\beta} - y_n||_2^2\\ \, &:= \, \arg \min_{\vec{\beta}} ||X \vec{\beta} - Y ||_2^2. \end{align*}\]

      The solution is the ordinary least squares estimator based on the second moment matrix $X^T X$:

      \[\hat{\vec{\beta}}_{under} = (X^T X)^{-1} X^T Y.\]

      If the model is overparameterized, the optimization problem is ill-posed since we have fewer constraints than parameters. Consequently, we choose a different (constrained) optimization problem that asks for the minimum norm parameters that still perfectly interpolate the training data:

      \[\begin{align*} \hat{\vec{\beta}}_{over} \, &:= \, \arg \min_{\vec{\beta}} ||\vec{\beta}||_2^2\\ \text{s.t.} \quad \quad \forall \, n \in &\{1, ..., N\}, \quad \vec{x}_n \cdot \vec{\beta} = y_n. \end{align*}\]

      We choose this optimization problem because it is the one gradient descent implicitly minimizes. The solution to this optimization problem uses the Gram matrix $X X^T \in \mathbb{R}^{N \times N}$:

      \[\hat{\vec{\beta}}_{over} = X^T (X X^T)^{-1} Y.\]

      One way to see why the Gram matrix appears is via constrained optimization: define the Lagrangian $\mathcal{L}(\vec{\beta}, \vec{\lambda}) \, := \, \frac{1}{2}||\vec{\beta}||_2^2 + \vec{\lambda}^T (Y - X \vec{\beta})$ with Lagrange multipliers $\vec{\lambda} \in \mathbb{R}^N$, then differentiate with respect to the parameters and Lagrange multipliers to obtain the overparameterized solution.

      After being fit, for test point $\vec{x}_{test}$, the model will make the following predictions:

      \[\hat{y}_{test, under} = \vec{x}_{test} \cdot \hat{\vec{\beta}}_{under} = \vec{x}_{test} \cdot (X^T X)^{-1} X^T Y\] \[\hat{y}_{test, over} = \vec{x}_{test} \cdot \hat{\vec{\beta}}_{over} = \vec{x}_{test} \cdot X^T (X X^T)^{-1} Y.\]

      Hidden in the above equations is an interaction between three quantities that can, when all grow extreme, create a divergence in the test loss!

      To reveal the three quantities, we’ll rewrite the regression targets by introducing a slightly more detailed notation. Unknown to us, there are some ideal linear parameters $\vec{\beta}^* \in \mathbb{R}^P = \mathbb{R}^D$ that truly minimize the test mean squared error. We can write any regression target as the inner product of the data $\vec{x}_n$ and the ideal parameters $\vec{\beta}^*$, plus an additional error term $e_n$ that is an “uncapturable” residual from the “viewpoint” of the model class

      \[y_n = \vec{x}_n \cdot \vec{\beta}^* + e_n.\]

      In matrix-vector form, we will equivalently write:

      \[Y = X \vec{\beta}^* + E,\]

      with $E \in \mathbb{R}^{N \times 1}$. To be clear, we are not imposing assumptions. Rather, we are introducing notation to express that there are (unknown) ideal linear parameters, and possibly non-zero errors $E$ that even the ideal model might be unable to capture; these errors $E$ could be random noise or could be fully deterministic patterns that this particular model class cannot capture. Using this new notation, we rewrite the model’s predictions to show how the test datum’s features $\vec{x}_{test}$, training data’s features $X$ and training data’s regression targets $Y$ interact.

      Let $y_{test}^* := \vec{x}_{test} \cdot \vec{\beta}^*$. In the underparameterized regime:

      \[\begin{align*} \hat{y}_{test,under} &= \vec{x}_{test} \cdot \hat{\vec{\beta}}_{under}\\ &=\vec{x}_{test} \cdot (X^T X)^{-1} X^T Y\\ &=\vec{x}_{test} \cdot (X^T X)^{-1} X^T (X \vec{\beta}^* + E)\\ &=\vec{x}_{test} \cdot \vec{\beta}^* + \, \vec{x}_{test} \cdot (X^T X)^{-1} X^T E\\ \hat{y}_{test,under} - y_{test}^* &= \vec{x}_{test} \cdot (X^T X)^{-1} X^T E. \end{align*}\]

      This equation is important, but opaque. To extract the intuition, replace $X$ with its singular value decomposition $X = U S V^T$. Let $R \, := \, \text{rank}(X)$ and let $\sigma_1 > \sigma_2 > … > \sigma_R > 0$ be $X$’s (non-zero) singular values. Let $S^+$ denote the Moore-Penrose inverse; in this context, this means that if a singular value $\sigma_r$ is non-zero, then in $S^+$, it becomes its reciprocal $1/\sigma_r$, but if the singular value is zero, then in $S^+$, it remains $0$. We can decompose the underparameterized prediction error along the orthogonal singular modes:

      \[\begin{align*} \hat{y}_{test, under} - y_{test}^* &= \vec{x}_{test} \cdot V S^{+} U^T E\\ &= \sum_{r=1}^R \frac{1}{\sigma_r} (\vec{x}_{test} \cdot \vec{v}_r) (\vec{u}_r \cdot E). \end{align*}\]

      This equation will be critical! The same term will appear in the overparameterized regime (plus one additional term):

      \[\begin{align*} \hat{y}_{test,over} &= \vec{x}_{test} \cdot \hat{\vec{\beta}}_{over}\\ &= \vec{x}_{test} \cdot X^T (X X^T)^{-1} Y\\ &= \vec{x}_{test} \cdot X^T (X X^T)^{-1} (X \beta^* + E)\\ \hat{y}_{test,over} - y_{test}^* &= \vec{x}_{test} \cdot (X^T (X X^T)^{-1} X - I_D) \beta^* \\ &\quad\quad + \quad \vec{x}_{test} \cdot X^T (X X^T)^{-1} E\\ &= \vec{x}_{test} \cdot (X^T (X X^T)^{-1} X - I_D) \beta^* \\ &\quad\quad + \quad \sum_{r=1}^R \frac{1}{\sigma_r} (\vec{x}_{test} \cdot \vec{v}_r) (\vec{u}_r \cdot E), \end{align*}\]

      where the last step again replaced $X$ with its SVD $X = U S V^T$. Thus, the prediction errors in the overparameterized and underparameterized regimes will be:

      \[\begin{align*} \hat{y}_{test,over} - y_{test}^* &= \sum_{r=1}^R \frac{1}{\sigma_r} (\vec{x}_{test} \cdot \vec{v}_r) (\vec{u}_r \cdot E)\\ &\quad \quad + \quad \vec{x}_{test} \cdot (X^T (X X^T)^{-1} X - I_D) \beta^*\\ \hat{y}_{test,under} - y_{test}^* &= \sum_{r=1}^R \frac{1}{\sigma_r} (\vec{x}_{test} \cdot \vec{v}_r) (\vec{u}_r \cdot E). \end{align*}\]

      The shared term in the two prediction errors causes the divergence:

      \[\begin{equation} \sum_{r=1}^R \frac{1}{\sigma_r} (\vec{x}_{test} \cdot \vec{v}_r) (\vec{u}_r \cdot E). \label{eq:variance} \end{equation}\]

      Eqn. \ref{eq:variance} is critical. It reveals that our test prediction error (and thus, our test squared error!) will depend on an interaction between 3 quantities:

      1. How much the training features vary in each direction. More formally, the inverse (non-zero) singular values of the training features $X$:

        \[\frac{1}{\sigma_r}\]
      2. How much, and in which directions, the test features vary relative to the training features. More formally: how $\vec{x}_{test}$ projects onto $X$’s right singular vectors $V$:

        \[\vec{x}_{test} \cdot \vec{v}_r\]
      3. How well the best possible model in the model class can correlate the variance in the training features with the training regression targets. More formally: how the residuals $E$ of the best possible model in the model class (i.e. insurmountable “errors” from the “perspective” of the model class) project onto $X$’s left singular vectors $U$:

        \[\vec{u}_r \cdot E\]

      We use the term “vary” when discussing $\vec{v}_r$ because $V$ can be related to the empirical (or sample) covariance matrix oftentimes studied in Principal Component Analysis. That is, if the SVD of $X$ is $U S V^T$, then $\frac{1}{N} X^T X = \frac{1}{N} V S^2 V^T$. If the training data are centered (a common preprocessing step), then this is the empirical covariance matrix and its eigenvectors $\vec{v}_1, …, \vec{v}_R$ identify the orthogonal directions of variance. We’ll return to this in Fig 6.

      Why does the test error diverge? When (1) and (3) are both present in the learning problem, the model’s parameters along this singular mode are likely incorrect. When (2) is added to the mix by a test datum $\vec{x}_{test}$ with a large projection along this mode, the model is forced to extrapolate significantly beyond what it saw in the training data, in a direction where the training data had an error-prone relationship between its predictions and the training targets, using parameters that are likely wrong. As a consequence, the test squared error explodes!

      Factor 1 - Low Variance in Training Features

      Figure 2. Required Factor #1: How much training features vary in each direction. The test loss diverges at the interpolation threshold only if training features $X$ contain small (non-zero) singular values. Ablation: By removing all singular values below a cutoff, the divergence at the interpolation threshold is diminished or disappears entirely. Blue is training error. Orange is test error.

      The test loss will not diverge if any of the three required factors are absent. What could cause that? One way is if small-but-nonzero singular values do not appear in the training data features. One way to accomplish this is by setting all singular values below a selected threshold to exactly 0. To test our understanding, we independently ablate all small singular values in the training features. Specifically, as we run the ordinary linear regression fitting process, and as we sweep the number of training data, we also sweep different singular value cutoffs and remove all singular values of the training features $X$ below the cutoff (Fig 2).

      Factor 2 - Test Features in Training Feature Subspace

      Figure 3. Required Factor #2: How much, and in which directions, test features vary relative to training features. The test loss diverges only if the test features $\vec{x}_{test}$ have a large projection onto the training features $X$'s right singular vectors $V$. Ablation: By projecting the test features into the subspace of the leading singular modes, the divergence at the interpolation threshold is diminished or disappears entirely. Blue is training error. Orange is test error.

      Double descent should not occur if the test datum does not vary in different directions than the training features. Specifically, if the test datum lies entirely in the subspace of just a few of the leading singular directions, then the divergence is unlikely to occur. To test our understanding, we force the test data features to lie in the training features subspace: as we run the ordinary linear regression fitting process, and as we sweep the number of training data, we project the test features $\vec{x}_{test}$ onto the subspace spanned by the training features $X$ singular modes (Fig 3).

      Factor 3 - Errors from Best Possible Model

      Figure 4. Required Factor #3: How well the best possible model in the model class can correlate variance in training features with training targets. The test loss diverges only if the residuals $E$ from the best possible model in the model class on the training data have a large projection onto the training features $X$'s left singular vectors $U$. Ablation: By ensuring the true relationship between features and targets is within the model class i.e. linear, the divergence at the interpolation threshold disappears. Blue is training error. Orange is test error.

      Double descent should not occur if the best possible model in the model class makes no errors on the training data. For example, if we use a linear model class on data where the true relationship is a noiseless linear relationship, then at the interpolation threshold, we will have $D=P$ data, $P=D$ parameters, our line of best fit will exactly match the true relationship, and no divergence will occur. To test our understanding, we ensure no residual errors exist in the best possible model: we first use the entire dataset to fit a linear model, then replace all target values with the predictions made by the ideal linear model. We then rerun our typical fitting process using these new labels, sweeping the number of training data (Fig 4).

      As a short aside, what could cause residual errors in the best possible model in the model class?

      1. Noise: If the data is noisy, then the best possible model in the model class will have residual errors.
      2. Model Misspecification: If the data is generated by a nonlinear model, but we use a linear model class (or vice versa), then the best possible model in the model class will have residual errors.
      3. Missing Features: Even if the data is noiseless and our model belongs to the correct model class, but we are missing covariates, then the best possible model in the model class will still have residual errors.

      Divergence at the Interpolation Threshold

      Figure 5. The training features are most likely to obtain their smallest non-zero singular value when approaching the interpolation threshold.

      Why does this divergence happen near the interpolation threshold? The answer is that the first factor (small non-zero singular values in the training features $X$) is likely to occur at the interpolation threshold (Fig 5), but why?

      Suppose we’re given a single training datum \(\vec{x}_1\). So long as this datum isn’t exactly zero, that datum varies in a single direction, meaning we gain information about the variance in that direction, but the variance in all orthogonal directions is exactly 0. With the second training datum \(\vec{x}_2\), so long as this datum isn’t exactly zero, that datum varies, but now, some fraction of \(\vec{x}_2\) might have a positive projection along \(\vec{x}_1\); if this happens (and it likely will, since the two vectors are unlikely to be exactly orthogonal), the shared direction gives us more information about the variance in this shared direction, but less information about the second orthogonal direction of variation. Ergo, the training data’s smallest non-zero singular value after 2 samples is probabilistically smaller than after 1 sample. As we approach the interpolation threshold, the probability that each additional datum has large variance in a new direction orthogonal to all previous directions grows unlikely (Fig 5), but as we move beyond the interpolation threshold, the variance in each covariate dimension becomes increasingly clear.

      Figure 6. Geometric intuition for why the smallest non-zero singular value reaches its lowest value near the interpolation threshold. If $1$ datum is observed, variance exists in only 1 direction. If $2$ data are observed, a second axis of variation appears, but because the two data are likely to share some component, the second axis is likely to have less variance than the first. At the interpolation threshold (here, $D=P=N=3$), because the three data are likely to share components along the first two axes, the third axis is likely to have even less variance. Beyond the interpolation threshold, additional data contribute additional variance to these three axes.

      Generalization in Overparameterized Linear Regression

      You might be wondering why three of the datasets have low test squared error in the overparameterized regime (California Housing, Diabetes, Student-Teacher) but one (WHO Life Expectancy) does not. Recall that the overparameterized regime’s prediction error has another term \(\hat{y}_{test,over} - y_{test}^*\) not present in the underparameterized regime:

      \[\begin{equation} \vec{x}_{test} \cdot (X^T (X X^T)^{-1} X - I_D) \beta^*. \label{eq:bias} \end{equation}\]

      To understand why this bias exists, recall that our goal is to correlate fluctuations in the covariates $\vec{x}$ with fluctuations in the targets $y$. In the overparameterized regime, there are more parameters than data; consequently, for $N$ data points in $D=P$ dimensions, the model can “see” fluctuations in at most $N$ dimensions, but has no ``visibility” into the remaining $P-N$ dimensions. This causes information about the optimal linear relationship $\vec{\beta}^*$ to be lost, thereby increasing the overparameterized prediction error.

      Figure 7. Geometry of Generalization in Overparameterized Ordinary Linear Regression. The rowspace of the training features $X$ forms a subspace (here, $\mathbb{R}^1$) of the ambient space (here, $\mathbb{R}^2$). For test datum $\vec{x}_{test}$, the linear model forms an internal representation of the test datum $\hat{\vec{x}}_{test}$ by orthogonally projecting the test datum onto the rowspace via projection matrix $X^T (X X^T)^{-1} X$. The generalization error will then increase commensurate with the inner product between $\hat{\vec{x}}_{test} - \vec{x}_{test}$ and the best possible parameters for the function class $\vec{\beta}^*$. Three different possible $\vec{\beta}^*$ are shown with low (blue), medium (green) and high (red) generalization errors.

      We previously saw that away from the interpolation threshold, the variance is unlikely to affect the discrepancy between the overparameterized model’s predictions and the ideal model’s predictions, meaning most of the discrepancy must therefore emerge from the bias (Eqn. \ref{eq:bias}). This bias term yields an intuitive geometric picture (Fig 7) that also reveals a surprising fact: overparameterized linear regression does representation learning! Specifically, for test datum \(\vec{x}_{test}\), a linear model creates a representation of the test datum \(\hat{\vec{x}}_{test}\) by orthogonally projecting the test datum onto the row space of the training covariates \(X\) via the projection matrix \(X^T (X X^T)^{-1} X\):

      \[\begin{equation*} \hat{\vec{x}}_{test} := X^T (X X^T)^{-1} X \; \vec{x}_{test}. \end{equation*}\]

      Seen this way, the bias can be rewritten as the inner product between (1) the difference between its representation of the test datum and the test datum and (2) the ideal linear model’s fit parameters:

      \[\begin{equation}\label{eq:overparam_gen_bias} (\hat{\vec{x}}_{test} - \vec{x}_{test}) \cdot \vec{\beta}^*. \end{equation}\]
      Figure 8. Test Error of Overparameterized Models. Large inner product between the ideal model's parameters and the difference between the fit model's internal representations of the test data and the test data creates large test squared error for overparameterized models.

      Intuitively, an overparameterized model will generalize well if the model’s representations capture the essential information necessary for the best model in the model class to perform well (Fig. 8).

      Adversarial Test Data and Adversarial Training Data

      Our key equation (Eqn. \ref{eq:variance}) also reveals why adversarial test data and adversarial training data exist (at least in linear regression) and how mechanistically they function. For convenience, we repeat the equation:

      \[\begin{equation*} \sum_{r=1}^R \frac{1}{\sigma_r} (\vec{x}_{test} \cdot \vec{v}_r) (\vec{u}_r \cdot E). \end{equation*}\]

      Adversarial test examples are a well-known phenomenon in machine learning that we can see in this equation. The adversarial test features correspond to \(\vec{x}_{test} \cdot \vec{v}_r\) being large, where one can drastically increase the test squared error by moving the test example in the direction of the right singular vector(s) with the smallest non-zero singular values (Fig 9).

      Figure 9. Adversarial Test Examples in Linear Regression. Adversarial examples arise by pushing $\vec{x}_{test}$ far along the trailing singular modes in the training features $X$. Blue is training error. Orange is test error.

      Less well-known are adversarial training data, akin to dataset poisoning or backdoor attacks . Adversarial training examples correspond to \(\vec{u}_r \cdot E\) being large, where one can drastically increase the test squared error by moving the training errors $E$ in the direction of the left singular vector(s) with the smallest non-zero singular value. This gives a practical way to construct adversarial training data: training features and targets whose training loss is unchanged from unaltered training data, but causes the test loss to be 1-3 orders of magnitude larger (Fig 10).

      Figure 10. Adversarial Training Dataset in Linear Regression. By manipulating the residual errors $E$ that the best possible model in the model class achieves on the training data, we construct training datasets that increase the test error of the learned model by 1-3 orders of magnitude without affecting its training error. Blue is training error. Orange is test error.

      Intuition for Nonlinear Models

      Although we mathematically studied ordinary linear regression, the intuition for why the test loss diverges extends to nonlinear models, such as polynomial regression and including certain classes of deep neural networks . For a concrete example about how our intuition can shed light on the behavior of nonlinear models, Henighan et al. 2023 recently discovered interesting properties of shallow nonlinear autoencoders: depending on the number of training data, (1) autoencoders either store data points or features, and (2) the test loss increases sharply between these two regimes (Fig. 11).

      Figure 11. Superposition, Memorization and Double Descent in Nonlinear Shallow Autoencoders. Figure from Henighan et al. 2023 .

      Our work sheds light on the results in two ways:

      1. Henighan et al. 2023 write, “It’s interesting to note that we’re observing double descent in the absence of label noise.” Our work clarifies that noise, in the sense of a random quantity, is not necessary to produce double descent. Rather, what is necessary is residual errors from the perspective of the model class ($E$, in our notation). Those errors could be entirely deterministic, such as a nonlinear model attempting to fit a noiseless linear relationship, or other model misspecifications.

      2. Henighan et al. 2023 write, “[Our work] suggests a naive mechanistic theory of overfitting and memorization: memorization and overfitting occur when models operate on ‘data point features’ instead of ‘generalizing features’.” Our work hopefully clarifies that this dichotomy is incorrect: when overparameterized, data point features are akin to the Gram matrix $X X^T$ and when underparameterized, generalizing features are akin to the second moment matrix $X^T X$. Our work hopefully clarifies that data point features can and very often do generalize, and that there is a deep connection between the two, i.e., their shared spectra.

      Conclusion

      In this work, we intuitively and quantitatively explained why the test loss misbehaves based on three interpretable factors, tested our understanding via ablations, connected our understanding to adversarial test examples and adversarial training datasets, and added conceptual clarity of recent discoveries in nonlinear models.

      ]]>
      Rylan Schaeffer
      Bridging the Data Processing Inequality and Function-Space Variational Inference2024-05-07T00:00:00+02:002024-05-07T00:00:00+02:00https://iclr-blogposts.github.io/2024/blog/dpi-fsvi $$\require{mathtools} \DeclareMathOperator{\opExpectation}{\mathbb{E}} \newcommand{\E}[2]{\opExpectation_{#1} \left [ #2 \right ]} \newcommand{\simpleE}[1]{\opExpectation_{#1}} \newcommand{\MidSymbol}[1][]{\:#1\:} \newcommand{\given}{\MidSymbol[\vert]} \DeclareMathOperator{\opmus}{\mu^*} \newcommand{\IMof}[1]{\opmus[#1]} \DeclareMathOperator{\opInformationContent}{H} \newcommand{\ICof}[1]{\opInformationContent[#1]} \newcommand{\xICof}[1]{\opInformationContent(#1)} \DeclareMathOperator{\opEntropy}{H} \newcommand{\Hof}[1]{\opEntropy[#1]} \newcommand{\xHof}[1]{\opEntropy(#1)} \DeclareMathOperator{\opMI}{I} \newcommand{\MIof}[1]{\opMI[#1]} \DeclareMathOperator{\opTC}{TC} \newcommand{\TCof}[1]{\opTC[#1]} \newcommand{\CrossEntropy}[2]{\opEntropy(#1 \MidSymbol[\Vert] #2)} \DeclareMathOperator{\opKale}{D_\mathrm{KL}} \newcommand{\Kale}[2]{\opKale(#1 \MidSymbol[\Vert] #2)} \DeclareMathOperator{\opJSD}{D_\mathrm{JSD}} \newcommand{\JSD}[2]{\opJSD(#1 \MidSymbol[\Vert] #2)} \DeclareMathOperator{\opp}{p} \newcommand{\pof}[1]{\opp(#1)} \newcommand{\hpof}[1]{\hat{\opp}(#1)} \newcommand{\pcof}[2]{\opp_{#1}(#2)} \newcommand{\hpcof}[2]{\hat\opp_{#1}(#2)} \DeclareMathOperator{\opq}{q} \newcommand{\qof}[1]{\opq(#1)} \newcommand{\hqof}[1]{\hat{\opq}(#1)} \newcommand{\qcof}[2]{\opq_{#1}(#2)} \newcommand{\varHof}[2]{\opEntropy_{#1}[#2]} \newcommand{\xvarHof}[2]{\opEntropy_{#1}(#2)} \newcommand{\varMIof}[2]{\opMI_{#1}[#2]} \newcommand{\w}{\boldsymbol{\theta}} \newcommand{\W}{\boldsymbol{\Theta}} \DeclareMathOperator{\opf}{f} \newcommand{\fof}[1]{\opf(#1)} \newcommand{\Dany}{\mathcal{D}} \newcommand{\y}{y} \newcommand{\Y}{Y} \newcommand{\L}{\boldsymbol{L}} \newcommand{\x}{\boldsymbol{x}} \newcommand{\X}{\boldsymbol{X}} \newcommand{\pdata}[1]{\hpcof{\text{data}}{#1}} \newcommand{\normaldist}[1]{\mathcal{N}(#1)} $$

      Introduction

      In information theory, the data processing inequality (DPI) expresses a fundamental idea: processing data (stochastically) cannot increase information. The DPI provides us with a powerful intuition about what information processing systems can do and what the limitations of data processing are.

      In this blog post, we first study the DPI, developing intuition through vivid examples and detailed proofs—especially the equality case, which is arguably the best way to understand inequalities. We will consider classic forms of the DPI as well as DPIs relating probability distributions more broadly. Then, we explore the intriguing connection between DPI and function-space variational inference (FSVI), a modern Bayesian deep learning technique that focuses on the Bayesian predictive posterior rather than the parameter space. Exploring this connection is important because it can provide new insights into FSVI on a fundamental level. We apply the DPI to recover several interesting results from the literature in a simple form and build intuitions for the relationship between parameter and functional priors.

      Most importantly, we consider how FSVI can measure a predictive divergence between the approximate and true posterior which is independent of parameter symmetries. (With parameter symmetries, I refer to different parameters that yield the same predictions, which is very common in over-parameterized neural networks: think of parameter symmetries like different paths leading to the same destination; they might look different but end up at the same predictionsThanks to ChatGPT for this analogy! 🤗.) Explaining this connection is one of the main goals of this article and will help you understand the relationships between DPI, FSVI, and other deep learning methods. As a concrete example and application, we relate FSVI to training with knowledge distillation and label entropy regularization: potentially more meaningful priors than the ones usually used in Bayesian neural networksIn many papers, an isotropic Gaussian is used because of its simplicity. Indeed, there are better alternatives, see Fortuin et al (2022) and Fortuin (2022).. This connection highlights the practical relevance of the theoretical concepts discussed in this post and will hopefully inspire the reader to view Bayesian deep learning from a new point of view.

      TL;DR

      The following sections summarize the key takeaways of this blog post. If they don’t make sense, don’t worry: they will after reading this post.

      Data Processing Inequality

      The data processing inequality examines how information cannot increase due to processing. In information theory, it is usually stated based on a Markov chain of random variables \(X \rightarrow Y \rightarrow Z\) and their mutual information. We will look at different data processing inequalities that relate different distributions instead of different random variables. However, the blog posts in particular looks at the DPI when formulated using Kullback-Leibler (KL) divergences between distributions. I will use “🥬 divergence” in headings to add a bit of color. 😊

      Concretely, this KL DPI states that processing data stochastically can only reduce information. More formally:

      That is, the KL divergence between \(\qof{Y}\) and \(\pof{Y}\) cannot be larger than the one between the original \(\qof{\W}\) and \(\pof{\W}\). Intuitively, the stochastic mapping \(\opf\) induces a bottleneck that reduces how well we can distinguish between \(\opp\) and \(\opq\). Finally we have equality when \(\Kale{\qof{\W \given Y}}{\pof{\W \given Y}} = 0\).

      The paper “Understanding Variational Inference in Function-Space” by Burt et al. (2021) succinctly summarizes the DPI as follows:

      The data processing inequality states that if two random variables are transformed in this way, they cannot become easier to tell apart.

      Function-Space Variational Inference

      Generally, variational inference is a powerful technique for approximating complex Bayesian posteriors with simpler distributions. In its usual form, it optimizes an approximate, variational distribution to match the Bayesian parameter posterior as closely as possible. This way, it transforms the problem of Bayesian inference into an optimization problem.

      However, especially for deep neural networks, obtaining a good approximation of the parameter space can be difficult. One reason is the sheer size of the parameter space. Additionally, the parameterization of a neural network often contains many symmetries—different parameter configurations can lead to the same predictions of the model—that are not taken into account either.

      Here, Function-space variational inference (FSVI) side-steps some of these restrictions by only requiring that the variational distribution matches the Bayesian predictive posterior: Whereas regular variational inference regularizes towards a parameter prior, FSVI regularizes towards a data prior. This is especially useful when the parameter prior is not very meaningful, e.g. an isotropic Gaussian prior, which is often used in Bayesian neural networks.

      Background: Information-Theoretic Notation

      Information theory deals with the communication of informationSee the excellent "Visual Information Theory" by Chris Olah for a visual introduction to information theory.. In this blog post, we use a unified information-theoretic notation to express various quantities related to probability distributions and their relationshipsIt largely follows "A Practical & Unified Notation for Information-Theoretic Quantities in ML".. Here are some key concepts we will use:

      The information content of an event \(x\) is denoted as \(\Hof{x}\) and is defined as \(-\log \pof{x}\). It represents the minimum amount of information needed to describe the occurrence of \(x\) given an underlying probability distribution. In machine learning, this information content is often used as a minimization objective, represented as the negative log-likelihood or cross-entropy when averaged over a dataset.

      The entropy \(\Hof{X}\) of a random variable \(X\) is the expectation of its information content:

      \[\Hof{X} \triangleq \E{\pof{x}}{\Hof{x}} = \E{\pof{x}}{-\log \pof{x}}.\]

      The entropy measures the average amount of information needed to describe the random variable \(X\). It provides a measure of uncertainty or randomness associated with \(X\). We can similarly define the entropy of a conditional distribution \(\Hof{X \given Y}\) and the joint entropy \(\Hof{X, Y}\).

      The mutual information \(\MIof{X;Y}\) between two random variables \(X\) and \(Y\) is a measure of the amount of information that one random variable contains about the other. It is defined as:

      \[\begin{aligned} \MIof{X;Y} & \triangleq \Hof{X} - \Hof{X \given Y} \\ &= \Hof{Y} - \Hof{Y \given X} \\ &= \Hof{X} + \Hof{Y} - \Hof{X, Y}. \end{aligned}\]

      We will also use the Kullback-Leibler divergence \(\Kale{\pof{X}}{\qof{X}}\) and the cross-entropy \(\CrossEntropy{\pof{X}}{\qof{X}}\):

      \[\begin{aligned} \CrossEntropy{\pof{X}}{\qof{X}} & = \E{\pof{x}}{-\log \qof{x}}\\ \Kale{\pof{X}}{\qof{X}} & = \CrossEntropy{\pof{X}}{\qof{X}} - \Hof{X} \end{aligned}\]

      The cross-entropy quantifies the average number of bits needed to encode samples drawn from the true distribution \(\pof{X}\) using a different distribution \(\qof{X}\). The Kullback-Leibler divergence is a measure of the difference between two probability distributions and captures the additional bits needed to encode samples from \(\pof{X}\) compared to encoding them using the true distribution \(\qof{X}\).

      Now that we have covered the notation, let’s delve into the data processing inequality.

      Data Processing Inequality

      The data processing inequality (DPI) is a fundamental inequality in information theory that states the mutual information between two random variables cannot increase through processing. The original DPI is typically stated for a Markov chain of random variables \(X \rightarrow Y \rightarrow Z\) and relates the mutual information terms as follows:

      \[\MIof{X;Y} \ge \MIof{X;Z}.\]

      We can view \(\rightarrow\) as a processing or transition step that maps \(X\) to \(Y\) and \(Y\) to \(Z\), whereas the mapping can be deterministic or stochastic. The inequality tells us that processing the random variable \(X\) to obtain \(Y\) and further processing \(Y\) to obtain \(Z\) cannot increase the mutual information between \(X\) and \(Z\) compared to the mutual information between \(X\) and \(Y\).

      The following three scenarios illustrate the data processing inequality using different mappings:

      Example: Image Processing Pipeline

      Consider an image processing pipeline with the following steps. Let:

      • \(X\) be the original image data;
      • \(Y\) be a compressed version of the image; and
      • \(Z\) be \(Y\) after adding blur and pixelation.

      In this case, \(X\) has more mutual information with \(Y\) than with \(Z\). The compression reduces information, but the image is still recognizable. However, after the additional processing of blurring and pixelating, the mutual information between \(X\) and \(Z\) is further reduced. This gives an intuitive example of how additional processing on data reduces the mutual information with the original data. Each processing step results in some loss of information.

      Example: Supervised Learning

      Consider a supervised learning pipeline with the following steps. Let

      • \(X\) be the input features;
      • \(Y\) be the intermediate representations learned by the model; and
      • \(Z\) be the model predictions.

      Here, \(X \rightarrow Y \rightarrow Z\) forms a Markov chain. The data processing inequality tells us that the mutual information between the inputs \(X\) and predictions \(Z\) cannot exceed the mutual information between the inputs \(X\) and intermediate representations \(Y\):

      \[\MIof{X; Y} \geq \MIof{X; Z}.\]

      This makes intuitive sense—the intermediate representations \(Y\) are obtained by processing the raw inputs \(X\), so they cannot contain more information about \(X\) than \(X\) itself. The predictions \(Z\) are obtained by further processing \(Y\), so additional information may be lost, reducing the mutual information with the original inputs \(X\).

      As a more concrete example, consider an image classification model. Let:

      • \(X\) be the input images;
      • \(Y\) be the activations of the convolutional layers; and
      • \(Z\) be predicted image labels.

      The convolutional layers will extract features from the input images, but cannot extract more information than present in the original images. The predicted labels are obtained by further processing these convolutional features, so may lose some fine-grained information about the original inputs.

      Example: Autoencoders

      An autoencoder compresses the input \(X\) into a latent code \(Y\) and then tries to reconstruct the original input from the code, producing \(\hat{X}\). Let:

      • \(X\) be the input;
      • \(Y\) be the latent code; and
      • \(\hat{X}\) be the reconstruction;

      The data processing inequality tells us again:

      \[\MIof{X; Y} \geq \MIof{X; \hat{X}}.\]

      The latent code \(Y\) is obtained by compressing \(X\), so cannot contain more information. The reconstruction \(\hat{X}\) tries to recover \(X\) from \(Y\), but some information may be lost, reducing the mutual information with \(X\).

      Intuitively, autoencoders try to preserve as much mutual information between inputs \(X\) and reconstructions \(\hat{X}\) as possible by learning latent representations \(Y\) that compress inputs without losing too much information. The data processing inequality quantifies this information bottleneck.

      Proof of the DPI

      The proof is simple and connects the DPI to another important inequality.

      First we note that the Markov Chain implies the following factorization of the joint distribution:

      \[\pof{x, y, z} = \pof{x} \pof{y \given x} \pof{z \given y}.\]

      Using this factorization, we can express the mutual information terms:

      \[\begin{aligned} \MIof{X;Y} &= \Hof{X} - \Hof{X \given Y} \\ &\ge \Hof{X} - \Hof{X \given Z} \\ &= \MIof{X;Z}. \end{aligned}\]

      This relies on \(\Hof{X \given Y} \le \Hof{X \given Z}\). Why is this true?

      We have the following chain of inequalities:

      \[\Hof{X \given Y} = \underbrace{\MIof{X ; Z \given Y}}_{\overset{(1)}{=}0} + \Hof{X \given Y, Z} \overset{(2)}{\le} \Hof{X \given Z}.\]

      (1) follows from the Markov chain property: when \(X \rightarrow Y \rightarrow Z\), \(X\) does not depend on \(Z\) at all when conditioned on \(Y\); and (2) follows from the fact that conditioning reduces entropy, i.e. \(\Hof{A \given B} \le \Hof{A}.\)

      The equality gap \(\Hof{X \given Y, Z} - \Hof{X \given Z}\) corresponds to the mutual information \(\MIof{X ; Y \given Z}\). This mutual information measures the extra information about \(X\) contained in \(Y\) that is not already conveyed by \(Z\). It is zero if and only if \(X \rightarrow Z \rightarrow Y\) forms a Markov chain, indicating that \(Z\) is a sufficient statistic for \(X\).

      Proof of (2) "Conditioning Reduces Entropy":

      We can easily show that conditioning reduces entropy by using the non-negative property of the mutual information:

      \(\begin{aligned} 0 &\le \Kale{\pof{X,Y}}{\pof{X}\pof{Y}} \\ &= \MIof{X;Y} \\ &= \Hof{X} - \Hof{X \given Y} \\ \implies \Hof{X \given Y} &\le \Hof{X}. \end{aligned}\)

      The fact that conditioning reduces entropy, \(\Hof{X} \ge \Hof{X \given Y}\), is an important property by itself and is reminiscent of the data processing inequality. The conditional entropy \(\Hof{X \given Y}\) quantifies the remaining uncertainty about \(X\) after observing \(Y\). If \(X\) and \(Y\) are independent, then \(\Hof{X} = \Hof{X \given Y}\), as knowing \(Y\) does not provide any information about \(X\). On the other hand, if \(Y\) completely determines \(X\), then \(\Hof{X \given Y} = 0\), as there is no remaining uncertainty about \(X\) once \(Y\) is known. In general, conditioning can only reduce the uncertainty about \(X\), but it does not necessarily reduce it to zero.

      Let’s move on and consider the KL data processing inequality.

      🥬 Data Processing Inequality

      A similar DPI can be expressed for different distributions \(\pof{x}\) and \(\qof{x}\) of the same random variable and the KL divergence between them. This DPI states that if we evolve two distributions using the same transition function, they cannot become less similar. The KL divergence is sometimes also referred to as “relative entropy”, so we could also call this the “relative data processing inequality”.

      This can be formalized for distributions \(\pof{x}\) and \(\qof{x}\) and a stochastic transition function \(X \overset{\fof{y \given x}}{\longrightarrow} Y\). Here, we use that such a stochastic mapping \(Y = \fof{X}\) is equivalent to having a probability (density) \(\fof{y \given x}\):

      \[\Kale{\pof{X}}{\qof{X}} \ge \Kale{\pof{Y}}{\qof{Y}},\]

      where \(\pof{y \given x} = \fof{y \given x} = \qof{y \given x}\). The marginals after the transition are \(\pof{y} = \E{\pof{x}}{\fof{y \given x}}\) and \(\qof{y} = \E{\qof{x}}{\fof{y \given x}}\), so more explicitly:

      \[\Kale{\pof{X}}{\qof{X}} \ge \Kale{\E{\pof{x}}{\fof{Y \given x}}}{\E{\qof{x}}{\fof{Y \given x}}}.\]

      In their book Elements of Information Theory, Thomas and Cover describe this as “relative entropy never increases” and relate it to the second law of thermodynamics.

      Example: Comparing Image Distributions

      As an example, let:

      • \(\pof{x}\) be the true distribution of images in a dataset;
      • \(\qof{x}\) be a generative model that tries to mimic \(\pof{x}\); and
      • \(\fof{y \given x}\) be a function that thresholds images \(x\) into bilevel black and white images \(y\).

      Then \(\pof{y}\) and \(\qof{y}\) will be more difficult to distinguish after the thresholding operation than \(\pof{x}\) and \(\qof{x}\). Converting to black and white images has lost information that could help distinguish the real and generated distributions.

      This provides some intuition for why the KL divergence between distributions decreases under a shared stochastic mapping, as formalized by the KL data processing inequality. Processing through \(\fof{y \given x}\) makes the distributions harder to tell apart.

      Counter-Example: Bayesian Inference

      It might be inviting to think that this data processing inequality also applies to Bayesian inference, that is updating the model parameters based on new evidence. Then, we could argue that if two agents start with different prior beliefs but update based on the same evidence, their posterior beliefs will become more similar. However, this intuition is flawed: the data processing inequality does not apply to Bayesian inference.

      Let’s walk through why. Consider:

      • \(\pof{\w}\) be an agent’s prior belief;
      • \(\qof{\w}\) be another agent’s different prior;
      • \(\pof{\w\given x}\) is the posterior after observing data \(x\); and
      • \(\qof{\w\given x}\) is the other agent’s posterior.

      The priors \(\pof{\w}\) and \(\qof{\w}\) may have large divergence, representing very different initial beliefs. However, when conditioning on the same data \(x\), the KL divergence between \(\pof{\w \given x}\) and \(\qof{\w \given x}\) could increase or decrease—the data processing inequality does not give us any guarantee.

      This is because \(\pof{\w}\) and \(\qof{\w}\) are not evolving under the same stochastic mapping. Rather, each prior is mapped to its respective posterior via Bayes’ rule, which operates differently on \(\opp\) and \(\opq\):

      \[\begin{aligned} \pof{\w \given x} &= \frac{\pof{x \given \w}}{\pof{x}} \, \pof{\w}\\ \qof{\w \given x} &= \frac{\qof{x \given \w}}{\qof{x}} \, \qof{\w}. \end{aligned}\]

      Even assuming that both agents have the same internal model, that is they use the same likelihood \(\pof{x \given \w} = \qof{x \given \w}\), the priors \(\pof{\w}\) and \(\qof{\w}\) will still influence the posterior distributions differently because they lead to different evidence terms \(\pof{x}\) and \(\qof{x}\):

      \[\begin{aligned} \pof{x} &= \E{\pof{\w}}{\pof{x \given \w}}\\ \qof{x} &= \E{\qof{\w}}{\qof{x \given \w}}. \end{aligned}\]

      Thus, the correct intuition is that observing the same data \(x\) does not necessarily bring the posterior beliefs closer together—they depend on the interplay between their specific priors and likelihoods. The data processing inequality does not directly apply to this Bayesian updating scenario:

      \[\Kale{\qof{\W}}{\pof{\W}} {\color{red}{\not\ge}} \Kale{\qof{\W \given \mathcal{D}}}{\pof{\W \given \mathcal{D}}},\]

      This counterexample highlights the importance of precisely understanding the assumptions underlying conceptual principles like the DPI. While the DPI provides insight about information dynamics in many cases, it does not universally apply, as exemplified here by Bayesian updating under different priors. As always, bear in mind that:

      As we currently also seem to experience a world of increasing polarization, this counterexample might also serve as a reminder that different priors can lead to different beliefs, even when observing the same evidence. This is a fundamental aspect of Bayesian inference and the scientific method.

      Proofs of the 🥬 DPI

      We will prove this inequality in two different ways. First, we will develop a “brute-force” proof, and then we will look at a more elegant proof that follows Thomas and Cover. Importantly, we will also consider the equality case in detail.

      Brute-force Proof

      If \(\opp\) does not have support in \(\opq\), the inequality is trivially true because then \(\Kale{\pof{Y}}{\qof{Y}}=\infty\).

      Thus, let’s now assume that \(\opp\) has support in \(\opq\). Then, we can brute-force using the definitions, starting from the cross-entropy:

      \[\begin{aligned} \CrossEntropy{\pof{Y}}{\qof{Y}}&=\CrossEntropy{\pof{Y}}{\E{\qof{x}}{\pof{Y \given x}}}\\ &=\CrossEntropy{\pof{Y}}{\E{\qof{x}}{\frac{\pof{x \given Y}\pof{Y}}{\pof{x}}}}\\ &=\CrossEntropy{\pof{Y}}{\E{\pof{x \given Y}}{\frac{\qof{x}}{\pof{x}}}}+\CrossEntropy{\pof{Y}}{\pof{Y}}\\ &\overset{(1)}{=}\CrossEntropy{\pof{Y}}{\E{\pof{x \given Y}}{\frac{\qof{x}}{\pof{x}}}}+\xHof{\pof{Y}}\\ &\overset{(2)}{\le}\CrossEntropy{\pof{X, Y}}{\frac{\qof{X}}{\pof{X}}}+\xHof{\pof{Y}}\\ &\overset{(3)}{=}\CrossEntropy{\pof{X}}{\frac{\qof{X}}{\pof{X}}}+\xHof{\pof{Y}}\\ &\overset{(4)}{=}\Kale{\pof{X}}{\qof{X}}+\xHof{\pof{Y}}\\ \iff \Kale{\pof{Y}}{\qof{Y}}&\le\Kale{\pof{X}}{\qof{X}}, \end{aligned}\]

      where we have used (1) that the cross-entropy of a distribution with itself is just the entropy, (2) that the cross-entropy is convex and we can apply Jensen’s inequality, (3) that the RHS side of the cross-entropy does not depend on \(Y\) and we can trivially marginalize it out, and (4) that the definition of the Kullback-Leibler divergence is equivalent an (unnormalized) cross-entropy over a fraction.

      This makes it difficult to extract the case for equality, however.

      Equality Case

      We have only one inequality in above proof, and it stems from applying Jensen’s inequality. Remembering the equality case for Jensen’s inequality, we recall:

      For (2), this is sadly slightly more complex than it might seem on first glance. Let’s unwrap the term:

      \[\CrossEntropy{\pof{Y}}{\E{\pof{x \given Y}}{\frac{\qof{x}}{\pof{x}}}} = \E{\pof{y}}{-\log \E{\pof{x \given y}}{\frac{\qof{x}}{\pof{x}}}}.\]

      We take an expectation over \(\pof{y}\), so we need to look at almost all \(\pof{x \given y} \not= 0\) for (almost all) \(\pof{y} \not= 0\) separately to consider equality. \(-\log x\) is strictly convex—and thus not linear—so we need \(f(x) = \frac{\qof{X}}{\pof{X}}\) to be constant for any fixed \(y\) with \(\pof{y} \not= 0\)—only then have we equality in Jensen’s inequality.

      In the following, I will limit myself to the discrete case to avoid having to deal with measure theoryI currently don't have a good 'toolbox' to express simple ideas cleanly in measure theory. I'm working on it.. To obtain equality, for all \(y\) with \(\pof{y} \not= 0\) (i.e. we have support) and for all \(x_1, x_2\) with \(\pof{x_1 \given y}, \pof{x_2 \given y} \not= 0\), we need \(\frac{\qof{x_1}}{\pof{x_1}} = \frac{\qof{x_2}}{\pof{x_2}}\). Equivalently (for the reader, why is then \(\pof{x_1} \not= 0?\)):

      \[\begin{aligned} \frac{\qof{x_1}}{\pof{x_1}} &= \frac{\qof{x_2}}{\pof{x_2}} \\ \iff \qof{x_1} &= \frac{\qof{x_2}}{\pof{x_2}} \, \pof{x_1} \\ \end{aligned}\]

      This means that \(\qof{x} = C_y \pof{x}\) piecewise for all \(x\) for which \(\pof{x \given y} \not= 0\) for some fixed \(y\) with \(\pof{y} \not= 0\). That is if we keep \(y\) fixed, all the \(x\) for which \(\pof{x \given y} \not= 0\) have the same constant factor \(C_y\). Then for all \(y\) with \(\pof{y} \not= 0\), we have equality and overall equality in (2).

      If for any \(x\) there are multiple \(y\), e.g. \(y_1, y_2\) for which \(\pof{x \given y} \not= 0\), then we have \(C_{y_1} = C_{y_2}\).

      As an example, at the simplest, if this is the case for all \(y\), then \(C_y = 1\) constant.

      As a side-note, this is a great reason why we often require full support for distributions as we then can avoid these piecewise constant factors (and the headaches they might cause).

      Simpler Elegant Proof

      Thomas and Cover provide a beautifully simple proof:

      What does this mean? Whereas \(\fof{y \given x}\) is the ‘forward’ transition function, \(\pof{x \given y}\) and \(\qof{x \given y}\) are the ‘backward’ transition functions. We only have equality when the backward transition functions are equal (almost everywhere).

      The statement on equality is not very informative yet though, so we have to put in a bit more work. Again, this is written for the discrete case.

      This time we explicitly use Bayes’ rule to connect the forward and backward transition functions. First, we have to fix \(y\) such that \(\pof{y} \not= 0\) (i.e. \(y\) is in the support of \(\pof{y}\)) and then \(\qof{y} \not=0\). We have:

      \[\begin{aligned} \pof{x \given y} &= \qof{x \given y} \\ \overset{\text{ass. }\pof{y} \not= 0}{\iff} \frac{\fof{y \given x}\pof{x}}{\pof{y}} &= \frac{\fof{y \given x}\qof{x}}{\qof{y}} \\ \overset{\text{ass. }\fof{y \given x}\not= 0}{\iff} \frac{\pof{x}}{\pof{y}} &= \frac{\qof{x}}{\qof{y}} \\ \iff \pof{x} &= \frac{\pof{y}}{\qof{y}} \, \qof{x}. \end{aligned}\]

      For a given \(y\) with \(\pof{y} \not=0\), for the equality case, we see that for all \(x\) with \(\fof{y \given x} \not= 0\), \(\pof{x}\) and \(\qof{x}\) have to be coupled via piecewise constant factors.

      As another example, if \(\fof{y \given x} \not=0\) (has full support) for all possible \(x\), for the equality case we have \(\pof{x} = \qof{x}\).

      Compared to the previous equality case, we went a bit deeper and rewrote the conditions to consider the ratios between \(x\) and \(y\). Note we could have shown the same thing in the “brute-force” proof, too.

      Altogether, we have see that both \(x\) and \(y\) are modulated by the same constant factor between \(\pof{\cdot}\) and \(\qof{\cdot}\). Essentially, this tells us that we could split our support into unconnected sub-domains and examine each individually for the equality case.

      Overall Statement

      We have the following overall statement:

      (\(\pof{x} \ll \qof{x}\) means that \(\qof{x} > 0\) implies \(\pof{x} > 0\), so the KL divergence is not \(\infty\).) But more precisely, for \(\pof{x} \ll \qof{x}\), we have equality when:

      \[\forall y, \pof{y} \not= 0 \exists C_y \in \mathbb{R}_{> 0} \forall x, \fof{y \given x}\not=0\colon \pof{x} = C_y \, \qof{x}.\]

      Other Data Processing Inequalities

      Now, we can use these ideas to derive a few additional results and even close the circle to the original data processing inequality.

      Jensen-Shannon Divergence

      The KL divergence is not a metric: the triangle inequality does not hold, and it is not symmetric.

      However, we can symmetrize it to obtain the Jensen-Shannon divergence (JSD). The JSD is defined as the mean of the two KL divergences of the two distributions from their average. In essence, it makes the KL divergence symmetric:

      \[\begin{aligned} \fof{x} &= \frac{\pof{x} + \qof{x}}{2}\\ \JSD{\pof{x}}{\qof{x}} &= \frac{1}{2} \Kale{\pof{x}}{\fof{x}} + \frac{1}{2} \Kale{\qof{x}}{\fof{x}}. \end{aligned}\]

      Similar approaches can be used to “symmetrize” other concepts; for example matrices: \(\frac{1}{2} A + \frac{1}{2} A^T\) is also symmetric by construction for any matrix \(A\).

      The JSD is still not a metric, but the square root of the Jensen-Shannon divergence is symmetric and satisfies the triangle inequality and gives us the Jensen-Shannon distance, a metric.

      JSD-DPI

      We can also obtain a data processing inequality for the Jensen-Shannon divergence and the Jensen-Shannon distance:

      The proof uses the KL data processing inequality:

      \[\begin{aligned} \JSD{\pof{X}}{\qof{X}} &= \frac{1}{2} \Kale{\pof{X}}{\fof{X}} + \frac{1}{2} \Kale{\qof{X}}{\fof{X}}\\ &\ge \frac{1}{2} \Kale{\pof{Y}}{\fof{Y}} + \frac{1}{2} \Kale{\qof{Y}}{\fof{Y}}\\ &= \JSD{\pof{Y}}{\qof{Y}}. \end{aligned}\]

      We verify \(\fof{y} = \frac{\pof{y} + \qof{y}}{2}\) is the average of \(\pof{y}\) and \(\qof{y}\):

      \[\begin{aligned} \fof{y} &= \E{\fof{x}}{\fof{y \given x}}\\ &= \E{\frac{\pof{x}+\qof{x}}{2}}{\fof{y \given x}}\\ &= \frac{1}{2} \E{\pof{x}}{\fof{y \given x}} + \frac{1}{2} \E{\qof{x}}{\fof{y \given x}}\\ &= \frac{1}{2} \pof{y} + \frac{1}{2} \qof{y}. \end{aligned}\]

      Finally, \(\pof{x}, \qof{x} \ll \fof{x}\), and the equality condition of the KL data processing inequality gives us:

      \[\begin{aligned} &\Kale{\pof{X \given Y}}{\fof{X \given Y}} = 0 &\\ \land \quad &\Kale{\qof{X \given Y}}{\fof{X \given Y}} = 0 &\\ \iff &\pof{x \given y} = \fof{x \given y} \land \qof{x \given y} = \fof{x \given y}& \forall x,y \\ \iff &\pof{x \given y} = \qof{x \given y}& \forall x,y. \end{aligned}\]

      Mutual Information

      The JSD can also be expressed as a mutual information. For \(\begin{aligned} Z &\sim \mathrm{Bernoulli}(\frac{1}{2}) = \fof{Z} \\ X \given Z = 0 &\sim \pof{x}\\ X \given Z = 1 &\sim \qof{x}, \end{aligned}\)

      we have:

      \[\JSD{\pof{X}}{\qof{X}} = \MIof{X;Z}.\]

      This follows from rewriting the mutual information as a KL divergence:

      \[\begin{aligned} \MIof{X;Z} &= \Kale{\fof{X \given Z}}{\fof{X}}\\ &= \E{\fof{z}} {\Kale{\fof{X \given Z = z}}{\fof{X}}}\\ &= \frac{1}{2} \Kale{\pof{x}}{\fof{x}} + \frac{1}{2} \Kale{\qof{x}}{\fof{x}}\\ &= \JSD{\pof{X}}{\qof{X}}. \end{aligned}\]

      We can generalize this to the Markov chain \(Z \rightarrow X \rightarrow Y\) with \(\fof{z, x, y} = \fof{z} \fof{x \given z} \fof{y \given x}\) for any distribution \(\fof{z}\):

      \[\begin{aligned} \MIof{X;Z} &= \Kale{\fof{X \given Z}}{\fof{X}}\\ &= \E{\fof{z}} {\Kale{\fof{X \given z}}{\fof{X}}}\\ &\overset{(1)}{\ge} \E{\fof{z}} {\Kale{\fof{Y \given z}}{\fof{Y}}}\\ &= \Kale{\fof{Y \given Z}}{\fof{Y}}\\ &= \MIof{Y;Z}, \end{aligned}\]

      where \((1)\) follows from the KL data processing inequality.

      This is just the data processing inequality we presented initially. We have gone full circle!

      The equality gap (Jensen gap) is \(\Kale{\fof{X \given Y, Z}}{\fof{X \given Y}}\), and we have equality when:

      \[\begin{aligned} \Kale{\fof{X \given Y, Z}}{\fof{X \given Y}} &= 0\\ \iff \MIof{X;Z \given Y} &= 0. \end{aligned}\]

      This is exactly when \(X\) is independent of \(Z\) given \(Y\). (\(Y\) is a sufficient statistic in that case.)

      Function-Space Variational Inference

      So far we’ve explored the foundational aspects of the data processing inequality (DPI) and its extended forms, in particular the KL data processing inequality. Through detailed derivations and intuitive examples, we’ve demonstrated how these inequalities can be applied, emphasizing their significance and limitations. Specifically, we’ve shown how the KL data processing inequality relates to the reduction in information as data is processed. The examples and counterexample have hopefully demonstrated the nuances of applying these inequalities in different contexts.

      This exploration sets the stage for diving into function-space variational inference and building up a robust understanding of it, leveraging the insights gained about the DPI and its implications in Bayesian deep learning.

      Problem Setting & Notation

      In the following, we will consider a classification task with cross-entropy loss, and we will use the following the random variables and distributions:

      • \(\y\) is the label,
      • \(\x\) is the input,
      • \(\qof{\y \given \x}\) is the predictive distribution we want to learn,
      • \(\pdata{\y \given \x}\) is the data distribution,
      • \(\Dany\) is the (training) dataset, and
      • \(C\) is the number of classes.

      The probabilistic model is:

      \[\pof{\y, \w \given \x} = \pof{\y \given \x, \w} \, \pof{\w}.\]

      As before, I use upper-case letters for random variables, which we take an expectation over, e.g. in the KL divergence, and lower-case letters when I’m referring to specific observations or values that could be substituted (with the exception of \(\Dany\)).

      Chain Rule of the 🥬 Divergence & DPI

      An important property of the KL divergence is the chain rule:

      \[\begin{aligned} &\Kale{\qof{\Y_n,...,\Y_1}}{\pof{\Y_n,...,\Y_1}} \\ &\quad = \sum_{i=1}^n \Kale{\qof{\Y_i \given \Y_{i-1}, ..., \Y_1}}{\pof{\Y_i \given \Y_{i-1}, ..., \Y_1}}. \end{aligned}\]

      The chain rule yields a chain inequality for the DPI as well:

      \[\begin{aligned} \Kale{\qof{\W}}{\pof{\W}} &\ge \Kale{\qof{\Y_n,...,\Y_1}}{\pof{\Y_n,...,\Y_1}}\\ &\ge \Kale{\qof{\Y_{n-1},...,\Y_1}}{\pof{\Y_{n-1},...,\Y_1}}\\ &\ge \Kale{\qof{\Y_1}}{\pof{\Y_1}}, \end{aligned}\]

      where we start from the KL DPI and then apply the chain rule.

      Deriving the Functional ELBO

      The DPI has an intriguing connection to FSVI. Let’s say we want to approximate a Bayesian posterior \(\pof{\w \given \Dany}\) with a variational distribution \(\qof{\w}\). In standard VI, we would minimize \(\Kale{\qof{\W}}{\pof{\W \given \Dany}}\) to match the variational distribution to the Bayesian posterior. Specifically:

      \[\begin{aligned} &\Kale{\qof{\W}}{\pof{\W \given \Dany}} =\\ &\quad = \underbrace{\E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\W}}{\pof{\W}}}_{\text{Evidence}\ \text{Bound}} + \log \pof{\Dany} \ge 0 \\ &\iff \underbrace{-\log \pof{\Dany}}_{=\xHof{\pof{\Dany}}} \le \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\W}}{\pof{\W}}. \end{aligned}\]

      This is an information-theoretic evidence (upper) bound on the information content \(-\log \pof{\Dany}\) of the data \(\Dany\) under the variational distribution \(\qof{\w}\), which we can minimize as an objective to approximiate \(\pof{\w \given \Dany}\) via \(\qof{\w}\).

      In more probability-theory inspired literature, the negative of this bound is called the evidence lower bound (ELBO) and is maximized.

      Both the ELBO and the information-theoretic evidence upper-bound are equivalent, and we can use either objective, but the information-theoretic perspective is obviously superior 🙃 I’ll refer to this as evidence bound from now on.

      In FSVI (with a caveat I detail below), we apply the DPI to the prior KL divergence term and obtain a “functional” version of the evidence bound:

      \[\begin{aligned} \Kale{\qof{\W}}{\pof{\W}} \ge \Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}}, \end{aligned}\]

      where \(\Y... \given \x...\) are (finite or infinite) sets of samples. That is, we do not only optimize marginal distributions but also joint distributions.

      The resulting objective:

      \[\begin{aligned} \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}} \end{aligned}\]

      is equal to the (negative) functional ELBO (fELBO) in “Functional variational Bayesian neural networks” by Sun et al. (2019)—with caveats that we discuss below.

      Choosing the “Coreset” \(\x...\)

      One important detail is the question of how to choose the \(\x...\):

      Ideally, we want to choose them such that the DPI inequality is as tight as possible.

      Given the chain inequality, it is obvious that the larger the set \(\x...\), the tighter the inequality will be. Hence, if we could choose an infinite set of points well, we might be able to get the tightest possible inequality. However, this might not be tractable, and in practice, it is often not.

      Some works take a supremum over finite subsets of a certain size, essentially building a core-set as an approximation (Rudner et al., 2022a/b); others take an expectation over finite sets of input samples (Sun et al., 2019), which is not necessarily yielding the tightest inequality but provides an unbiased estimate; while again other works focus on finite datasets for which the all points can be taken into account (Klarner et al., 2023).

      We will discuss the tightness of the inequality and the implications in the data limit below.

      Focusing on the most important aspect of FSVI, we observe:

      Application to Continual Learning

      When we directly optimize the KL divergence on a finite input dataset, for example, we align \(\opq\) with the prior of \(\opp\) where it matters most: on the predictions of the observed data.

      This is of particular interest in continual learning, where the prior for the next task is chosen to be the posterior from the previous task. In this case, the functional ELBO can be used to approximate the posterior of the previous model while incorporating new data.

      For two great papers that are very readable and provide further insights, see “Continual learning via sequential function-space variational inference and “Tractable function-space variational inference in Bayesian neural networks, both by Rudner et al. (2022).

      Comparison to FSVI in the Literature

      In practice, both works by Rudner et al. (2022), linearize the logitsThe logits are the final activations of the neural network before applying the softmax function (in multi-class classification). They are not to be confused with the pre-logits, e.g. embeddings before the final linear layer. (similar to a Laplace approximation) and use the DPI to show (in their notation):

      \[\mathbb{D}_{\mathrm{KL}}\left(q_{f(\cdot ; \boldsymbol{\Theta})} \| p_{f(\cdot ; \boldsymbol{\Theta})}\right) \leq \mathbb{D}_{\mathrm{KL}}\left(q_{\Theta} \| p_{\Theta}\right)\]

      which in my notation is equivalent to the first application of the DPI above:

      \[\Kale{\qof{\L...\given \x...}}{\pof{\L...\given \x...}} \le \Kale{\qof{\W}}{\pof{\W}}.\]

      They maximize the fELBO objective:

      \[\begin{aligned} \mathcal{F}\left(q_{\boldsymbol{\Theta}}\right) &=\mathbb{E}_{q_{f\left(\mathbf{x}_{\mathcal{D}} ; \boldsymbol{\Theta}\right)}}\left[\log p_{\mathbf{y} \mid f(\mathbf{X} ; \boldsymbol{\Theta})}\left(\mathbf{y}_{\mathcal{D}} \mid f\left(\mathbf{X}_{\mathcal{D}} ; \boldsymbol{\theta}\right)\right)\right]\\ &\quad -\sup _{\mathbf{X} \in \mathcal{X}_{\mathbb{N}}} \mathbb{D}_{\mathrm{KL}}\left(q_{f(\mathbf{X} ; \boldsymbol{\Theta})} \| p_{f(\mathbf{X} ; \boldsymbol{\Theta})}\right), \end{aligned}\]

      which is equivalent to minimizing the information-theoretic objective:

      \[\E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\L... \given \x...}}{\pof{\L... \given \x...}},\]

      if we choose the \(\x...\) to tighten the DPI inequality as much as possible (i.e. by “finding” the supremum).

      Using the inequality chain from above, we can sandwich their objective between a regular (negative) ELBO and the (negative) functional ELBO, we have derived above:

      \[\begin{aligned} &\E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\W}}{\pof{\W}} \\ &\quad \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\L... \given \x...}}{\pof{\L... \given \x...}} \\ &\quad \ge \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}}. \end{aligned}\]

      Why are they using logits instead of probabilities? In practice, using the probabilities instead of logits when performing linearization is often cumbersome due to the non-linearity of the softmax functions, which requires Monte-Carlo sampling of the logits to obtain an approximation of the final probabilities. Furthermore, I speculate that sampling the logits can be more benign given that we often use ReLUs in the underlying neural networks. (Don’t quote me too strongly on this, though.)

      Conceptually, this explains the derivation of their ELBO objective and also relates them to the ‘purer’ and simpler functional evidence bound derived above, but this raises the question of how these inequalities are different and what the gap between them tells us. Let’s address this question next.

      The Equality Case and Equivalence Classes

      When do we have equality? That is, when do we have:

      \[\Kale{\qof{\W}}{\pof{\W}} = \Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}}?\]

      And what does it tell us?

      As we have seen in the first part of this post, we have equality in the DPI if and only:

      \(\Kale{\qof{\W \given \Y..., \x...}}{\pof{\W \given \Y..., \x...}}=0\).

      Given that we are trying to approximate the Bayesian posterior \(\pof{\w \given \Y..., \x...}\) using \(\qof{\w}\), this equality condition tells us that we would have to find the exact posterior for equality. Hence, it is unlikely that we will have equality in practice. From this, the next question immediately follows: what does this predictive prior term

      \[\Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}}\]

      provides us with?

      Another way to think about the gap between the two KL divergences is that one is parameter-based and the other one is not. This points to a deeper truth about overparameterized models used in deep learning:

      The functional KL divergences won’t be affected by this as they are parameter-free and do not take into account the parameters of the model but only the predictions. The regular parameter-based KL divergence, however, would be affected by this—depending on the prior \(\pof{\w}\), they might express differences between the parameter distributions that have no effect on the outputs.

      In other words, if the prior assigns different probability to otherwise equivalent parameters, this obviously changes the parameter posterior, while the outputs are invariant to these changes if the overall assigned probability to a given output remains the same.

      For example, the paper “Deep Ensembles: A Loss Landscape Perspective” by Fort et al. (2020) examines the similarity of the predictions of models trained from different initializations and shows that the prediction space has a multi-modal loss landspace. In the language of FSVI, this is similar to analyzing the function-space distances between different models.

      Equivalence Classes

      Unless there are other considerations, it makes sense to use priors that assign the same density to parameters that are equivalent. Hence, for a given function \(\fof{\x ; \w}\), which determines the likelihood \(\pof{\y \given \x, \w} \triangleq \pof{y \given \fof{\x ; \w}}\), we can define an equivalence relation such that \(\w \sim \w'\) if and only if \(\fof{\x; \w} = \fof{\x; \w'}\) for all \(\x\). This equivalence relation partitions the parameter space into equivalence classes:

      \[[\w] \triangleq \{\w' : \fof{x ; \w} = \fof{x ; \w} \quad \forall x \}.\]

      A prior \(\pof{\w}\) induces a prior \(\hpof{[\w]}\) over the equivalence classes:

      \[\hpof{[\w]} \triangleq \sum_{\w' \in [\w]} \pof{\w'}.\]

      —or \(\int_{[\w]} \pof{\w'} \, d \w'\) for continuous \(\w\)—with the corresponding model:

      \[\begin{aligned} \hpof{\y, [\w] \given \x} &\triangleq \hpof{\y \given \x, [\w]} \, \hpof{[\w]} \\ &= \pof{\y \given \x, \w} \, \hpof{[\w]}. \end{aligned}\]

      Consistency

      Importantly, the definition of the equivalence classes above is consistent with Bayesian inference:

      This is easy to show with using Bayes’ rule:

      \[\begin{aligned} \hpof{[\w] \given \Dany} &= \hpof{\Dany \given [\w]} \, \hpof{[\w]} / \hpof{\Dany} \\ &= \pof{\Dany \given \w} \sum_{\w' \in [\w]} \pof{\w'} / \hpof{\Dany} \\ &= \sum_{\w' \in [\w]} \pof{\Dany \given \w'} \, \pof{\w'} / \hpof{\Dany} \\ &= \sum_{\w' \in [\w]} \pof{\w' \given \Dany} \, \pof{\Dany} / \hpof{\Dany} \\ &= \sum_{\w' \in [\w]} \pof{\w' \given \Dany}. \end{aligned}\]

      The last step follows from \(\hpof{\Dany}=\pof{\Dany}\):

      \[\begin{aligned} \hpof{\Dany} &= \sum_{[\w]} \hpof{\Dany, [\w]} \\ &= \sum_{[\w]} \sum_{\w' \in [\w]} \pof{\Dany, \w'} \\ &= \sum_{\w'} \pof{\Dany, \w} \\ &= \pof{\Dany}. \end{aligned}\]

      This also tells us that, for any \(\x\) and \(\y\):

      \(\pof{\y... \given \x...} = \hpof{\y... \given \x...}\).

      Given this consistency, we don’t have to differentiate between \(\hat\opp\) and \(\opp\) and can use \(\opp\) interchangeably. The same holds for \(\opq\).

      Equality & Symmetries

      We can view \([\w]\) as a projection from \(\w\) to its equivalence class \([\w]\). The DPI then gives us:

      \[\Kale{\qof{\W}}{\pof{\W}} \ge \Kale{\qof{[\W]}}{\pof{[\W]}}.\]

      And again: what does the gap between the two terms tell us?

      Let’s look at a few examples to get a better understanding of this.

      1. Trivial Constant Case

      Let \(\fof{\x ; \w} = 0\) independent of any \(f\). Then \([\w] = [\w']\) for any \(\w\), \(\w'\).

      For any approximate distribution \(\qof{\w}\), the induced \(\Kale{\qof{[\W]}}{\pof{[\W]}}=0\), while \(\Kale{\qof{\W}}{\pof{\W}}\) also includes superfluous divergence.

      2. Unused Parameter

      Let \(\y \given (\w_1, \w_2) = \w_1\) deterministic but independent of \(\w_2\). Then \([(\w_1, \w_2)] = [(\w_1, {\w'}_2)]\) for any \({\w'}_2\) and \([(\w_1,*)]\not=[({\w'}_1, *)]\) for any \(\w_1 \not= \w'_1\).

      \(\Kale{\qof{[\W]}}{\pof{[\W]}}=\Kale{\qof{\W_1}}{\pof{\W_1}}\) captures the meaningful divergence between approximate and true distribution, while \(\Kale{\qof{\W}}{\pof{\W}}\) also includes any divergence across \(\w_2\) that has no effect on the predictions.

      3. Periodic Parameter Space

      Finally, let’s assume that the predictions are periodic in some way. That is, for example \(\y = \sin \w\). We then have \([\w] = [\w + 2\pi]\).

      Further, let \(\pof{\w} = \operatorname{U}(\w; [0,2\pi \, N))\) for some \(N\) that determines the number of periods. Then, if we introduce another random variable \(K\), that captures which period we are in, we can (again) use the chain rule to write:

      \[\begin{aligned} \Kale{\qof{\W}}{\pof{\W}} &= \Kale{\qof{\W \given \W \in [K\,2\pi, (K+1)\,2\pi]}}{\pof{\W \given \W \in [K\,2\pi, (K+1)\,2\pi]}} \\ &\quad + \Kale{\qof{\W \in [K\,2\pi, (K+1)\,2\pi]}}{\pof{\W \in [K\,2\pi, (K+1)\,2\pi]}} \\ &= \Kale{\qof{[\W]}}{\pof{[\W]}} \\ &\quad + \Kale{\qof{\W \in [K\,2\pi, (K+1)\,2\pi]}}{\pof{\W \in [K\,2\pi, (K+1)\,2\pi]}}. \end{aligned}\]

      This follows from the setup of this specific example. Finally, we have:

      \[\Kale{\qof{\W \in [K\,2\pi, (K+1)\,2\pi]}}{\pof{\W \in [K\,2\pi, (K+1)\,2\pi]}} \le \log N.\]

      So, if \(\opq\) only had support in a single period for example, the difference between \(\Kale{\qof{\W}}{\pof{\W}}\) and \(\Kale{\qof{[\W]}}{\pof{[\W]}}\) would be \(\log N\): the redundancy.

      Predictive Prior

      How does the predictive prior term fit into this? The DPI again yields the answer:

      This tells us that the predictive prior term can at best measure the KL divergence between the equivalence classes of the parameters—and not between the parameters itself—but luckily, this is the more meaningful divergence anyway!

      For the equality cases, we observe that:

      1. we need a 1:1 mapping between parameters and equivalence classes for the first bound to be tight, and
      2. we need \(\Kale{\qof{[\W] \given \Y_n,\x_n,...,\Y_1,\x_1}}{\pof{[\W] \given \Y_n,\x_n,...,\Y_1,\x_1}} \to 0\) for \(n \to \infty\) for the second bound to be tight.

      For 2.: as we know from the chain rule that

      \[\Kale{\qof{\Y_n,...\Y_1\given\x_n,...,\x_1}}{\pof{\Y_n,...\Y_1\given\x_n,...,\x_1}}\]

      is monotonically increasing in \(n\), and it is bounded by \(\Kale{\qof{[\W]}}{\pof{[\W]}}\) from above, it must convergeIt is a bounded monotonically increasing sequence.. So, when does it close the gap?

      To give intuition that it might do that, and without attempting to prove this formally, we can appeal to Bernstein von Mises theorem, which states that the posterior distribution of the parameters converges to a Gaussian distribution with mean and variance given by the maximum likelihood estimate (MLE) as the number of data points tends to infinity as long as the model parameters are identifiable, that is the true parameters we want to learn are unique, and that they have support.

      For the evidence bound to be meaningful, we already know that we need support of the approximate distribution \(\opq\) in the prior \(\opp\)—otherwise, the LHS is \(\infty\). Moreover, realizing that we take an expectation over \(\qof{\Y_n ,..., \Y_1 \given \x_n ,..., \x_1}\), we can decompose the KL term for the gap as:

      \[\begin{aligned} &\Kale{\qof{[\W] \given \Y_n,\x_n,...,\Y_1,\x_1}}{\pof{[\W] \given \Y_n,\x_n,...,\Y_1,\x_1}} \\ &\quad = \E{\qof{\y_n,...,\y_1\given\x_n,...,\x_1}}{\Kale{\qof{[\W]\given \y_n, \x_n, ..., \y_1, \x_1}}{\pof{[\W]\given \y_n, \x_n, ..., \y_1, \x_1}}} \\ &\quad = \simpleE{\qof{[\w']}}{\E{\qof{\y_n,..,.\y_1\given\x_n,...,\x_1, [\w']}}{\Kale{\qof{[\W]\given \y_n, \x_n, ..., \y_1, \x_1}}{\pof{[\W]\given \y_n, \x_n, ..., \y_1, \x_1}}}}. \end{aligned}\]

      That is, we sample a \([\w'] \sim \qof{[\w']}\) and then sample \(\y_n,...\y_1\given\x_n,...,\x_1\) from the corresponding \(\qof{\y_n,...\y_1\given\x_n,...,\x_1, [\w']}\) and marginalize over these. Crucially, \([\w']\) are the true parameters of the data-generating process for the inner KL divergence term. We thus take an expectation over KL terms fulfilling the conditions of the Bernstein von Mises theorem:

      \[\begin{aligned} \Kale{\qof{[\W] \given \y_n,\x_1...\y_1, \x_1}}{\pof{[\W] \given \y_n,\x_1...\y_1, \x_1}} \to 0. \end{aligned}\]

      In other words, for a given \([w']\), in the space of equivalence classes as defined previously, the equivalence class of all MLE solutions in the data limit, \([MLE]\), will be unique by definition—the model is identifiable—and match \([\w']\)This follows from the consistency of MLE estimators but also from Berstein von Mises with a flat/uninformative prior.. As the MLE is prior-independent once there is support for it, both \(\opq\) and \(\opp\) will converge to the MLE \([\w']\) with sufficient data. Taking the expectation, this yields \(\Kale{\qof{[\W]\given \Y,..., \x...}}{\pof{[\W] \given \Y,..., \x...}} \to 0\) for \(n \to \infty\), and thus, we have:

      \[\begin{aligned} & \Kale{\qof{[\W]}}{\pof{[\W]}} = \\ &\quad = \sup_{n\in \mathbb{N}} \Kale{\qof{\Y_n,...,\Y_1\given\x_n,...,\x_1}}{\pof{\Y_n,...,\Y_1\given\x_n,...,\x_1}}. \end{aligned}\]

      (Again, this is not a formal proof but an intuition for why the gap might close in the data limit.)

      In my opinion, this is a great result. We have shown both that the predictive prior term converges given our assumptions and that it converges to the symmetry-free parameter-based divergence in the data limit. This is a strong argument for the predictive prior term being meaningful and not just a technical trick.

      Let’s appreciate one more thing: the predictive prior can consist of infinitely many data points and still converge to a finite value.

      Parameter Priors vs. Predictive Priors

      What is the advantage of this all?

      In Bayesian deep learning, we often use parameter priors that are not meaningful and which also do not take parameter symmetries into account. For example, a unit Gaussian prior over the parameters of a neural network does not induce different predictions for different parameters necessarily. While this prior can be sensible from a parameter compression perspective (e.g. see Hinton and van Camp (1993)), this does not have to be the only consideration guiding us.

      With function priors and predictive priors, we can specify more meaningful priors because we can focus on the predictions and ignore the parameters. More importantly, this connects Bayesian approaches to data augmentation and other regularization techniques as we will see next.

      Given that priors over equivalence classes are difficult to express explicitly though, using the DPI to obtain a functional ELBO can be an easier way to express and approximate them.

      Label Entropy Regularization

      All this also helps us gain a new perspective on label entropy regularization. The functional evidence bound can be lower-bounded using the chain rule by:

      \[\begin{aligned} \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}} \\ \ge \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \E{\pdata{\x}}{\Kale{\qof{\Y \given \x}}{\pof{\Y \given \x}}}, \end{aligned}\]

      where we can expand the term under the second expectation to:

      \[\Kale{\qof{\Y \given \x}}{\pof{\Y \given \x}}=\CrossEntropy{\qof{\Y \given \x}}{\pof{\Y \given \x}} - \xHof{\qof{\Y \given \x}}.\]

      Assuming that our prior yields a uniform distribution over the labels, we can drop the cross entropy term because it is constant and obtain:

      \[\E{\qof{\w}}{-\log \pof{\Dany \given \w}} - \E{\pdata{\x}}{\xHof{\qof{\Y \given \x}}}.\]

      This is the same as an MLE minimization objective with an additional entropy regularization term \(-\xHof{\qof{\Y \given \x}}\) for different \(\x\) that prevents the model from overfitting to the labels and collapsing to the one-hot encoding of the labels.

      Thus, in the simplest approximation, the DPI and functional variational inference give us a new perspective on label entropy regularization.

      Knowledge Distillation

      Obviously, assuming non-uniform prior predictions, \(\E{\pdata{\x}}{\Kale{\qof{\Y \given \x}}{\pof{\Y \given \x}}}\) can be related to knowledge distillation in deep neural networks as introduced by Hinton et al. (2015).

      The main technical difference is that knowledge distillation is using the reverse KL divergence instead of the forward KL divergence, while the conceptual difference is that we are not distilling the knowledge from a teacher model but from the prior that we downweigh while also training our model on the data itself. However, the connection between knowledge distillation and continual learning using informative priors is manifest.

      Conclusion

      In this blog post, we took a deep dive into the data processing inequality (DPI) and its surprisingly far-reaching implications for modern Bayesian deep learning. By carefully examining the assumptions, equality conditions, and chain rule of the DPI, we arrived at an intuitive understanding of why function-space variational inference (FSVI) can be such a powerful tool. The DPI perspective illuminates how FSVI side-steps issues with high-dimensional parameter spaces by focusing on matching Bayesian predictive posteriors.

      Reasoning about parameter equivalence classes under the lens of the DPI, we saw how predictive KL divergences can capture meaningful differences between models while ignoring superficial discrepancies due to symmetries. This provides a fresh perspective on the advantages of predictive priors over standard parameter priors commonly used in Bayesian neural networks.

      While our treatment only scratched the surface of the full mathematical story, the intuitions we developed allowed us to re-derive key results from the literature and uncover deep connections between seemingly disparate methods like entropy regularization, continual learning, and knowledge distillation. The examples and proofs peppered throughout solidified the core concepts.

      More than a bag of technical tricks, the DPI reveals itself to be a powerful conceptual tool for reasoning about models, objectives, and algorithms. I hope this post inspires the reader to seek the fundamental principles underpinning machine learning innovations and to use those principles as a guide for future research. With a solid grasp of foundational tools like the DPI, we can all contribute to demystifying and unifying the rapidly evolving field of Bayesian deep learning.


      Acknowledgements. Many thanks to Freddie Bickford Smith for very helpful comments and feedback on this post and to Tim Rudner for additional pointers to relevant literature and feedback on the FSVI section in particular 🤗

      ]]>Andreas KirschElaborating on the Value of Flow Matching for Density Estimation2024-05-07T00:00:00+02:002024-05-07T00:00:00+02:00https://iclr-blogposts.github.io/2024/blog/elaborating-on-the-value-of-flow-matching-for-density-estimationMotivation

      Normalizing Flows (NF) enable the construction of complex probability distributions by transforming a simple, known distribution into a more complex one. They do so by leveraging the change of variables formula, defining a bijection from the simple distribution to the complex one.

      For most of the time, flows were based on chaining several differentiable and invertible transformations. However, these diffeomorphic transformations limit the flows in their complexity as such have to be simple. Furthermore, this leads to trade-off sampling speed and evaluation performance . Their continuous counterpart, Continuous Normalizing Flows (CNFs) have been held back by limitations in their Simulation-Based maximum likelihood training . By utilizing Flow Matching, this limitation has been overcome and CNFs have been shown to be a powerful tool for density estimation.

      In the following sections, CNFs and Flow Matching are explained. Following the explanation, the empirical results of Flow Matching are presented. Finally, the application of Flow Matching in Simulation-Based Inference is discussed, which shall highlight their wide applicability and consistent improvement.

      Continuous Normalizing Flows

      Continuous normalizing flows are among the first applications of neural ordinary differential equations (ODEs) . Instead of the traditional layers of neural networks, the flow is defined by a vector field that is integrated over time.

      \[\frac{d}{dt} x(t) = f_{\theta}(x(t), t)\]

      The vector field is typically parameterized by a neural network. While traditional layer based flow architectures need to impose special architectural restrictions to ensure invertibility, CNFs are invertible as long as the uniqueness of the solution of the ODE is guaranteed. This is for instance the case if the vector field is Lipschitz continuous in \(x\) and continuous in \(t\). Many common neural network architectures satisfy these conditions. Hence, the above equation defines a diffeomorphism \(\phi_t(x_0) = x_0 + \int_0^t f_{\theta}(x(t), t)\) under the discussed assumption. The change of variables formula can be applied to compute the density of a distribution that is transformed by \(\phi_t\).

      As usual, a CNF is trained to transform a simple base distribution \(p_B\), usually a standard normal distribution, into a complex data distribution \(p_D\). For each point in time \(t\in[0,1]\) the time-dependent vector field defines a distribution \(p_t\) (probability path) and the goal is to find a vector field \(f_\theta\) such that \(p_1=p_D\). This is usually achieved by maximum likelihood training, i.e. by minimizing the negative log-likelihood of the data under the flow.

      While CNFs are very flexible, they are also computationally expensive to train naively with maximum likelihood since the flow has to be integrated over time for each sample. This is especially problematic for large datasets which are needed for the precise estimation of complex high-dimensional distributions.

      Flow Matching

      The authors of propose a new method for training CNFs, which avoids the need for simulation. The key idea is to regress the vector field directly from an implicit definition of a target vector field that defines a probability path \(p_t(x)\) with \(p_0=p_{B}\) and \(p_1=p_{D}\). Moreover, the authors propose a loss function that directly regresses the time dependent vector field against the conditional vector fields with respect to single samples.

      Unconditional ImageNet-128 samples of a CNF trained using Flow Matching with Optimal Transport probability paths. Figure obtained from .

      Assuming that the target vector field is known, the authors propose a loss function that directly regresses the time dependent vector field:

      \[L_{\textrm{FM}}(\omega) = \mathbb{E}_{t, p_t(x)}(|f_{\omega}(x, t) - u_t(x)|^2),\]

      where \(u_t\) is a vector field that generates \(p_t\) and the expectation with respect to \(t\) is over a uniform distribution. Unfortunately, the loss function is not directly applicable because we do not know how to define the target vector field. However, it turns out that one can define appropriate conditional target vector fields when conditioning on the outcome \(x_1\):

      \[p_t(x) = \int p_t(x|x_1)p_{D}(x_1)d x_1.\]

      Using this fact, the conditional flow matching loss can be defined, obtaining equivalent gradients as the flow matching loss.

      \[L_{\textrm{CFM}}(\omega) = \mathbb{E}_{t, p_t(x|x_1), p_D(x_1)}(|f_{\omega}(x, t) - u_t(x|x_1)|^2).\]

      Finally, one can easily obtain an unbiased estimate for this loss if samples from \(p_D\) are available, \(p_t(x|x_1)\) can be efficiently sampled, and \(u_t(x|x_1)\) can be computed efficiently. We discuss these points in the following.

      Gaussian Conditional Probability Paths

      The vector field that defines a probability path is usually not unique. This is often due to invariance properties of the distribution, e.g. rotational invariance. The authors focus on the simplest possible vector fields to avoid unnecessary computations. They choose to define conditional probability paths that maintain the shape of a Gaussian throughout the entire process. Hence, the conditional probability paths can be described by a variable transformation \(\phi_t(x \mid x_1) = \sigma_t(x_1)x + \mu_t(x_1)\). The time-dependent functions \(\sigma_t\) and \(\mu_t\) are chosen such that \(\sigma_0(x_1) = 1\) and \(\sigma_1 = \sigma_\text{min}\) (chosen sufficiently small), as well as \(\mu_0(x_1) = 0\) and \(\mu_1(x_1)=x_1\). The corresponding probability path can be written as

      \[p_t(x|x_1) = \mathcal{N}(x; \mu_t(x_1), \sigma_t(x_1)^2 I).\]

      In order to train a CNF, it is necessary to derive the corresponding conditional vector field. An important contribution of the authors is therefore the derivation of a general formula for the conditional vector field \(u_t(x|x_1)\) for a given conditional probability path \(p_t(x|x_1)\) in terms of \(\sigma_t\) and \(\mu_t\):

      \[u_t(x\mid x_1) = \frac{\sigma_t'(x_1)}{\sigma_t(x_1)}(x-\mu_t(x_1)) - \mu_t'(x_1),\]

      where \(\psi_t'\) denotes the derivative with respect to time \(t\).

      Compared to the diffusion path’s conditional score function, the OT path’s conditional vector field has constant direction in time and is arguably simpler to fit with a parametric model. Note the blue color denotes larger magnitude while red color denotes smaller magnitude. Figure obtained from .

      They show that it is possible to recover certain diffusion training objectives with this choice of conditional probability paths, e.g. the variance preserving diffusion path with noise scaling function \(\beta\) is given by:

      \[\begin{align*} \phi_t(x \mid x_1) &= (1-\alpha_{1-t}^2)x + \alpha_{1-t}x_1 \\\ \alpha_{t} &= \exp\left(-\frac{1}{2}\int_0^t \beta(s) ds\right) \end{align*}\]

      Additionally, they propose a novel conditional probability path based on optimal transport, which linearly interpolates between the base and the conditional target distribution.

      \[\phi_t(x \mid x_1) = (1-(1-\sigma_{\text{min}})t)x + tx_1\]

      The authors argue that this choice leads to more natural vector fields, faster convergence and better results.

      Empirical Results

      The authors investigate the utility of Flow Matching in the context of image datasets, employing CIFAR-10 and ImageNet at different resolutions. Ablation studies are conducted to evaluate the impact of choosing between standard variance-preserving diffusion paths and optimal transport (OT) paths in Flow Matching. The authors explore how directly parameterizing the generating vector field and incorporating the Flow Matching objective enhances sample generation.

      Likelihood (BPD), quality of generated samples (FID), and evaluation time (NFE) for the same model trained with different methods. Figure from .

      The findings are presented through a comprehensive evaluation using various metrics such as negative log-likelihood (NLL), Frechet Inception Distance (FID), and the number of function evaluations (NFE). Flow Matching with OT paths consistently outperforms other methods across different resolutions.

      Flow Matching, especially when using OT paths, allows us to use fewer evaluations for sampling while retaining similar numerical error (left) and sample quality (right). Results are shown for models trained on ImageNet 32×32, and numerical errors are for the midpoint scheme. Figure from .

      The study also delves into the efficiency aspects of Flow Matching, showcasing faster convergence during training and improved sampling efficiency, particularly with OT paths.

      Sample paths from the same initial noise with models trained on ImageNet 64×64. The OT path reduces noise roughly linearly, while diffusion paths visibly remove noise only towards the end of the path. Note also the differences between the generated images. Figure from .
      Image super-resolution on the ImageNet validation set. Figure from .

      Additionally, conditional image generation and super-resolution experiments demonstrate the versatility of Flow Matching, achieving competitive performance in comparison to state-of-the-art models. The results suggest that Flow Matching presents a promising approach for generative modeling with notable advantages in terms of model efficiency and sample quality.

      Application of Flow Matching in Simulation-Based Inference

      A very specifically interesting application of density estimation, i.e. Normalizing Flows, is in Simulation-Based Inference (SBI). In SBI, Normalizing Flows are used to estimate the posterior distribution of model parameters given some observations. An important factor here are the sample efficiency, scalability, and expressivity of the density model. Especially for the later two, Flow Matching has shown to the yield an improvement. This is due to the efficient transport between source and target density and the flexibility due the more complex transformations allowed by continuous normalizing flows. To start out, a brief introduction to SBI shall be given as not many might be familiar with this topic.

      Primer on Simulation-Based Inference

      In many practical scenarios, the likelihood function of a model is intractable and cannot be described analytically. This might be the case for where the forward model is a complex or proprietary simulation, or if it is a physical experiment . In order to still be able to perform Bayesian inference, one can resort to a class of methods called Likelihood-free Inference. One possible but popular method in this class is SBI. The core idea is to use a prior in combination with the simulator to obtain samples from the joint distribution of the parameters and the data. Based on these samples, the posterior can either be learned directly or the likelihood can be approximated . Depending on the exact method chosen, the approximated posterior is either amortized, i.e. does not require refitting when conditioned on different data, or non-amortized.

      The figure depicts the schematic flow of information for different kinds of Likelihood-free methods. Modern methods in SBI are depicted in the bottom row where the likelihood is approximated in subfigure E, the posterior is approximated in subfigure F, and the likelihood-ratio in subfigure G. Figure from .

      In order to formalize the method, let \(\theta \sim \pi(\theta)\) denote the parameters to a system and its respective prior distribution. The system under evaluation and the respective observations obtained are denoted by \(x = \mathcal{M}(\theta)\). To sample from the joint distribution \(p(\theta, x)\), the dedicated parameter \(\theta_i\) is sampled from the prior and the observation is obtained by evaluating the forward model on that parameter \(x_i = \mathcal{M}(\theta_i)\). According to this approach, a dataset of samples from the joint distribution can be generated \(\mathcal{X} = \{ (\theta, \mathbf{x})_i \}^N_{i=1}\). A density estimator is then fitted on the provided dataset in order to estimate the desired distribution, e.g. directly the posterior \(q_{\omega}(\theta \mid x) \approx p(\theta \mid x)\).

      The interested reader shall be directed to and especially for a more rigorous introduction to SBI. In order to compare the performances of the different approaches to SBI and their performance with respect to certain tasks, an excellent overview is provided in . For the sake of this post, a more abstract understanding is enough.

      Flow Matching for Simulation-Based Inference

      The approach using the Flow Matching formulation to fit the density network is presented by Dax et al. . In the setting described by the authors and the before mentioned SBI context, the goal is to approximate a posterior distribution of over model parameters given observations \(p(\theta \vert x)\). To learn the posterior, the Flow Matching loss is adapted to the following:

      \[\mathcal{L}_{FMPE} = \mathbb{E}_{t \sim p(t),\theta_1 \sim p(\theta), x \sim p(x \vert \theta_1),\theta_t \sim p_t(\theta_t \mid \theta_1)} \Vert f_{\omega,x}(\theta_t, t) - u_t(\theta_t \mid \theta_1) \Vert^2\]

      The important details to note here are the adaptations to minimize the loss w.r.t. samples drawn from the joint distribution, as it is described in the general section to SBI. To do so, the expectation is adapted to be w.r.t. \(\theta_1 \sim p(\theta), x \sim p(x \vert \theta_1)\), which yield the desired samples.

      Another adaption by the authors is to exchange the uniform distribution over the time with a general distribution \(t \sim p(t)\). The effects of this substitution won’t be focus deeper. However, adapting the distribution makes intuitive sense as the training gets harder close to the target distribution. Therefore, focussing on time steps \(t\) closer to one is beneficial, as the authors have also found in their empirical studies.

      In order to provide a general comparison of the Flow Matching-based SBI approach, the CFM model is tested on the SBI benchmarking tasks . The results show either equal or better performance, underscoring the approaches ability and applicability to SBI.

      The figure depicts the results of the CFM model on the SBI benchmarking tasks, as carried out by the authors of . Comparing the results to such obtained by neural posterior estimation with a normalizing flow shows comparable performance on most tasks while outperforming on some.

      Besides the general benchmarks, the authors use their proposed technique to estimate the posterior distribution of gravitational wave parameters \(p(\theta \mid x)\) where \(\theta \in \mathbb{R}^{15}, x \in \mathbb{R}^{15744}\). In order to reduce the problem’s dimensionality and increase the information density, the observations are compressed to \(128\) dimensions using an embedding network.

      Following the preprocessing of the data, three density estimators are fitted and compared to each other. The first method uses a neural spline flow, which has proven itself on these kinds of problems. It is compared to a neural posterior estimation using the Flow Matching approach described here. Finally, a neural posterior estimator leveraging physical symmetries is used to estimate the targeted posterior. All were trained on a simulation budget of \(5 \cdot 10^6\) samples for a total of 400 epochs.

      In order to evaluate the models’ performances, the obtained posteriors were compared w.r.t. their 50% credible regions as well as Jensen-Shannon divergence between the inferred posterior and reference results. The results shown below support the advantages found in the benchmarking tasks. The Flow Matching-based shows a good performance for all shown parameters and has a clear advantage over the classical NPE approach.

      The figure shows the single performances of a classic NPE approach using neural spline flows, the proposed Flow Matching approach, and a physics-focussed NPE approach. The results are shown for the 50% credible regions on the left, as well as the Jensen-Shannon divergence between the inferred posterior and reference results on the right. The Flow Matching-based approach shows a good performance for all investigated parameters and has a clear advantage over the classical NPE approach. In the pair plot on the left, the choice was made to only show the four parameters for which the classical NPE method performs the worst. While the Flow Matching approach could perform worse on other dimensions, this is not the case as shown on the right. Figure from .

      Whilst the examples are interesting themselves, their evaluation has shown the applicability, scalability, and flexibility of Flow Matching for density estimation. These performance improvements in different areas have motivated the discussion of Flow Matching in the first place and hopefully become clear now.

      A Personal Note

      Whilst this is a blog post, we’d like to use this last part to express our personal thoughts on this topic. SBI is a powerful method, enabling Bayesian Inference where it would not be possibleIt might be more fitting to say that Bayesian Inference is not practically feasible in many scenarios as, in theory, it might still be possible by sampling. However, this is essentially not possible where single evaluations of the forward model are expensive or further evaluations are simply not available, as shown in the example. otherwise. Due to the natural problem setting of SBI, where problems are high-dimensional, observations are scarce, and distribution complex, density estimators capable to counter these are required. In the past, Normalizing Flows have proven themselves to meet these challenges, whilst not resolving them completely. CNFs, due to their higher flexibility, have been a desired method to put to test whether they could even improve on these but were limited in the inability to train the efficiently.

      Formulating the Flow Matching variant of CNFs has allowed their application to complex density estimation tasks, as for example in SBI, and they’ve shown to yield the expected improvements – on standard SBI benchmarking tasks as well a very high dimensional task from the field of astrophysics. Furthermore, the generalization of CFM even broadens their applicability. It will be very interesting to see what possibilities are opened by this exact formulation and, in addition, what further improvements can be obtained by transferring techniques from the Diffusion Models to Normalizing Flows.

      ]]>
      Maternus Herold
      \ No newline at end of file diff --git a/index.html b/index.html new file mode 100644 index 00000000..59ba039a --- /dev/null +++ b/index.html @@ -0,0 +1 @@ + Redirecting…

      Redirecting…

      Click here if you are not redirected. \ No newline at end of file diff --git a/index.md b/index.md deleted file mode 100644 index 470ab2f4..00000000 --- a/index.md +++ /dev/null @@ -1,4 +0,0 @@ ---- -title: home -redirect_to: /about ---- diff --git a/news/announcement_1/index.html b/news/announcement_1/index.html index 3333a286..8a36b0d6 100644 --- a/news/announcement_1/index.html +++ b/news/announcement_1/index.html @@ -1 +1 @@ - Announcement_1 | You R. Name

      Announcement_1

      A simple inline announcement.

      \ No newline at end of file + Announcement_1 | ICLR Blogposts 2024

      Announcement_1

      A simple inline announcement.

      \ No newline at end of file diff --git a/news/announcement_2/index.html b/news/announcement_2/index.html index 5410b4c6..600cf7e0 100644 --- a/news/announcement_2/index.html +++ b/news/announcement_2/index.html @@ -1 +1 @@ - A long announcement with details | You R. Name

      A long announcement with details

      Announcements and news can be much longer than just quick inline posts. In fact, they can have all the features available for the standard blog posts. See below.


      Jean shorts raw denim Vice normcore, art party High Life PBR skateboard stumptown vinyl kitsch. Four loko meh 8-bit, tousled banh mi tilde forage Schlitz dreamcatcher twee 3 wolf moon. Chambray asymmetrical paleo salvia, sartorial umami four loko master cleanse drinking vinegar brunch. Pinterest DIY authentic Schlitz, hoodie Intelligentsia butcher trust fund brunch shabby chic Kickstarter forage flexitarian. Direct trade cold-pressed meggings stumptown plaid, pop-up taxidermy. Hoodie XOXO fingerstache scenester Echo Park. Plaid ugh Wes Anderson, freegan pug selvage fanny pack leggings pickled food truck DIY irony Banksy.

      Hipster list

      • brunch
      • fixie
      • raybans
      • messenger bag

      Hoodie Thundercats retro, tote bag 8-bit Godard craft beer gastropub. Truffaut Tumblr taxidermy, raw denim Kickstarter sartorial dreamcatcher. Quinoa chambray slow-carb salvia readymade, bicycle rights 90’s yr typewriter selfies letterpress cardigan vegan.


      Pug heirloom High Life vinyl swag, single-origin coffee four dollar toast taxidermy reprehenderit fap distillery master cleanse locavore. Est anim sapiente leggings Brooklyn ea. Thundercats locavore excepteur veniam eiusmod. Raw denim Truffaut Schlitz, migas sapiente Portland VHS twee Bushwick Marfa typewriter retro id keytar.

      We do not grow absolutely, chronologically. We grow sometimes in one dimension, and not in another, unevenly. We grow partially. We are relative. We are mature in one realm, childish in another. —Anais Nin

      Fap aliqua qui, scenester pug Echo Park polaroid irony shabby chic ex cardigan church-key Odd Future accusamus. Blog stumptown sartorial squid, gastropub duis aesthetic Truffaut vero. Pinterest tilde twee, odio mumblecore jean shorts lumbersexual.

      \ No newline at end of file + A long announcement with details | ICLR Blogposts 2024

      A long announcement with details

      Announcements and news can be much longer than just quick inline posts. In fact, they can have all the features available for the standard blog posts. See below.


      Jean shorts raw denim Vice normcore, art party High Life PBR skateboard stumptown vinyl kitsch. Four loko meh 8-bit, tousled banh mi tilde forage Schlitz dreamcatcher twee 3 wolf moon. Chambray asymmetrical paleo salvia, sartorial umami four loko master cleanse drinking vinegar brunch. Pinterest DIY authentic Schlitz, hoodie Intelligentsia butcher trust fund brunch shabby chic Kickstarter forage flexitarian. Direct trade cold-pressed meggings stumptown plaid, pop-up taxidermy. Hoodie XOXO fingerstache scenester Echo Park. Plaid ugh Wes Anderson, freegan pug selvage fanny pack leggings pickled food truck DIY irony Banksy.

      Hipster list

      • brunch
      • fixie
      • raybans
      • messenger bag

      Hoodie Thundercats retro, tote bag 8-bit Godard craft beer gastropub. Truffaut Tumblr taxidermy, raw denim Kickstarter sartorial dreamcatcher. Quinoa chambray slow-carb salvia readymade, bicycle rights 90’s yr typewriter selfies letterpress cardigan vegan.


      Pug heirloom High Life vinyl swag, single-origin coffee four dollar toast taxidermy reprehenderit fap distillery master cleanse locavore. Est anim sapiente leggings Brooklyn ea. Thundercats locavore excepteur veniam eiusmod. Raw denim Truffaut Schlitz, migas sapiente Portland VHS twee Bushwick Marfa typewriter retro id keytar.

      We do not grow absolutely, chronologically. We grow sometimes in one dimension, and not in another, unevenly. We grow partially. We are relative. We are mature in one realm, childish in another. —Anais Nin

      Fap aliqua qui, scenester pug Echo Park polaroid irony shabby chic ex cardigan church-key Odd Future accusamus. Blog stumptown sartorial squid, gastropub duis aesthetic Truffaut vero. Pinterest tilde twee, odio mumblecore jean shorts lumbersexual.

      \ No newline at end of file diff --git a/news/announcement_3/index.html b/news/announcement_3/index.html index 2e9b4295..42f3d500 100644 --- a/news/announcement_3/index.html +++ b/news/announcement_3/index.html @@ -1 +1 @@ - Announcement_3 | You R. Name

      Announcement_3

      A simple inline announcement with Markdown emoji! :sparkles: :smile:

      \ No newline at end of file + Announcement_3 | ICLR Blogposts 2024

      Announcement_3

      A simple inline announcement with Markdown emoji! :sparkles: :smile:

      \ No newline at end of file diff --git a/redirects.json b/redirects.json new file mode 100644 index 00000000..db86bb74 --- /dev/null +++ b/redirects.json @@ -0,0 +1 @@ +{"/":"https://iclr-blogposts.github.io/2024/about"} \ No newline at end of file diff --git a/reviewing/index.html b/reviewing/index.html new file mode 100644 index 00000000..e7517451 --- /dev/null +++ b/reviewing/index.html @@ -0,0 +1 @@ + reviewing | ICLR Blogposts 2024

      Reviewing Process

      Reviewers will be required to only view the live content of the blog. We ask that they act in good faith, and refrain from digging into the repository’s logs and closed Pull Requests to find any identifying information on the authors.

      Reviewers should motivate their final decision based on the following points:

      • Is there a significant added value in comparison to the cited papers?
      • Is this added value supported by accurate, convincing, and clear arguments?
      • If the blogpost does not directly relate to a paper, does it address a relevant research topic from a novel perspective?
      • In case the field Conflict Of Interest is marked as YES the reviewers are asked to pay specific attention to how the related work mentioned in the field ICLR Papers: is the blogpost too positive (self advertisement) or too negative (unfair assessment of this related work)?

      In order to access them please follow the following steps:

      1. Go to the OpenReview submission page.
      2. To see the blogpost submission, go to the blogpost url specified in the field ‘Blogpost Url’.
      \ No newline at end of file diff --git a/robots.txt b/robots.txt index a450fbe2..0da66967 100644 --- a/robots.txt +++ b/robots.txt @@ -1,7 +1,4 @@ ---- -permalink: /robots.txt ---- User-agent: * Disallow: -Sitemap: {{ site.baseurl | prepend: site.url }}/sitemap.xml +Sitemap: https://iclr-blogposts.github.io/2024/sitemap.xml diff --git a/sitemap.xml b/sitemap.xml new file mode 100644 index 00000000..14107457 --- /dev/null +++ b/sitemap.xml @@ -0,0 +1 @@ + https://iclr-blogposts.github.io/2024/news/announcement_1/ 2015-10-22T21:59:00+02:00 https://iclr-blogposts.github.io/2024/news/announcement_2/ 2015-11-07T21:11:00+01:00 https://iclr-blogposts.github.io/2024/news/announcement_3/ 2016-01-15T12:59:00+01:00 https://iclr-blogposts.github.io/2024/blog/alibi-mlm/ 2024-05-07T00:00:00+02:00 https://iclr-blogposts.github.io/2024/blog/bench-hvp/ 2024-05-07T00:00:00+02:00 https://iclr-blogposts.github.io/2024/blog/clml/ 2024-05-07T00:00:00+02:00 https://iclr-blogposts.github.io/2024/blog/deqalg-reasoning/ 2024-05-07T00:00:00+02:00 https://iclr-blogposts.github.io/2024/blog/diffusion-theory-from-scratch/ 2024-05-07T00:00:00+02:00 https://iclr-blogposts.github.io/2024/blog/distill-example/ 2024-05-07T00:00:00+02:00 https://iclr-blogposts.github.io/2024/blog/distill-example2/ 2024-05-07T00:00:00+02:00 https://iclr-blogposts.github.io/2024/blog/double-descent-demystified/ 2024-05-07T00:00:00+02:00 https://iclr-blogposts.github.io/2024/blog/dpi-fsvi/ 2024-05-07T00:00:00+02:00 https://iclr-blogposts.github.io/2024/blog/elaborating-on-the-value-of-flow-matching-for-density-estimation/ 2024-05-07T00:00:00+02:00 https://iclr-blogposts.github.io/2024/blog/exploring-meta-learned-curiosity-algorithms/ 2024-05-07T00:00:00+02:00 https://iclr-blogposts.github.io/2024/blog/fairness-ai-two-phil-or-just-one/ 2024-05-07T00:00:00+02:00 https://iclr-blogposts.github.io/2024/blog/hidden-convex-relu/ 2024-05-07T00:00:00+02:00 https://iclr-blogposts.github.io/2024/blog/language-model-development-as-a-new-subfield/ 2024-05-07T00:00:00+02:00 https://iclr-blogposts.github.io/2024/blog/mode-switching/ 2024-05-07T00:00:00+02:00 https://iclr-blogposts.github.io/2024/blog/primacy-bias-and-why-it-helps-to-forget/ 2024-05-07T00:00:00+02:00 https://iclr-blogposts.github.io/2024/blog/rlhf-without-rl/ 2024-05-07T00:00:00+02:00 https://iclr-blogposts.github.io/2024/blog/robust-foundation-model/ 2024-05-07T00:00:00+02:00 https://iclr-blogposts.github.io/2024/blog/the-n-implementation-details-of-rlhf-with-ppo/ 2024-05-07T00:00:00+02:00 https://iclr-blogposts.github.io/2024/blog/understanding-gradient-inversion-attacks-from-the-prior-knowledge-perspective/ 2024-05-07T00:00:00+02:00 https://iclr-blogposts.github.io/2024/blog/understanding-icl/ 2024-05-07T00:00:00+02:00 https://iclr-blogposts.github.io/2024/blog/unraveling-the-impact-of-training-samples/ 2024-05-07T00:00:00+02:00 https://iclr-blogposts.github.io/2024/blog/update-frequency-in-mbrl/ 2024-05-07T00:00:00+02:00 https://iclr-blogposts.github.io/2024/blog/what-exactly-has-tabpfn-learned-to-do/ 2024-05-07T00:00:00+02:00 https://iclr-blogposts.github.io/2024/about/ https://iclr-blogposts.github.io/2024/call/ https://iclr-blogposts.github.io/2024/_pages/dropdown/ https://iclr-blogposts.github.io/2024/reviewing/ https://iclr-blogposts.github.io/2024/submitting/ https://iclr-blogposts.github.io/2024/blog/tag/bayesian-neural-network/ https://iclr-blogposts.github.io/2024/blog/tag/generalization/ https://iclr-blogposts.github.io/2024/blog/tag/log-marginal-likelihood/ https://iclr-blogposts.github.io/2024/blog/tag/conditional-log-marginal-likelihood/ https://iclr-blogposts.github.io/2024/blog/tag/information-theory/ https://iclr-blogposts.github.io/2024/blog/tag/model-evaluation/ https://iclr-blogposts.github.io/2024/blog/tag/model-selection/ https://iclr-blogposts.github.io/2024/blog/category/data-processing-inequality/ https://iclr-blogposts.github.io/2024/blog/category/information-theory/ https://iclr-blogposts.github.io/2024/blog/category/function-space-variational-inference/ https://iclr-blogposts.github.io/2024/blog/category/parameter-equivalence-classes/ https://iclr-blogposts.github.io/2024/blog/category/entropy-regularization/ https://iclr-blogposts.github.io/2024/blog/category/label-entropy-regularization/ https://iclr-blogposts.github.io/2024/blog/2024/ https://iclr-blogposts.github.io/2024/blog/ https://iclr-blogposts.github.io/2024/blog/page/2/ https://iclr-blogposts.github.io/2024/news/announcement_1/ 2024-04-08T03:01:30+02:00 https://iclr-blogposts.github.io/2024/news/announcement_2/ 2024-04-08T03:01:30+02:00 https://iclr-blogposts.github.io/2024/news/announcement_3/ 2024-04-08T03:01:30+02:00 https://iclr-blogposts.github.io/2024/publications/ 2024-04-08T03:01:30+02:00 \ No newline at end of file diff --git a/submitting/index.html b/submitting/index.html new file mode 100644 index 00000000..21cd1293 --- /dev/null +++ b/submitting/index.html @@ -0,0 +1,103 @@ + submitting | ICLR Blogposts 2024

      A more open process

      As with the previous edition of the Blog Post track, we forgo the requirement for total anonymity. The blog posts must be anonymized for the review process, but users will submit their anonymized blog posts via a pull request to the blog track’s repository (in addition to a submission on OpenReview). The pull request will trigger an automated pipeline that will build and deploy your post onto a website dedicated to the reviewing process. Reviewers will be able to access the posts directly through a public URL (generated by the Github action), and will submit their reviews on OpenReview. Reviewers should refrain from looking at the git history for the post, which may reveal information about the authors.

      This still largely follows the Double-Blind reviewing principle; it is no less double-blind than when reviewers are asked to score papers that have previously been released to arXiv, an overwhelmingly common practice in the ML community. This approach was chosen to lower the burden on both the organizers and the authors; in 2022, many submissions had to be reworked once deployed due to a variety of reasons. By allowing the authors to render their websites to Github Pages prior to the review process, we hope to avoid this issue entirely.

      However, we understand the desire for total anonymity. Authors that wish to have a fully double-blind process might consider creating new GitHub accounts without identifying information which they will only be use for this track. For an example of a submission in the past which used an anonymous account in this manner, you can check out the World Models blog post (Ha and Schmidhuber, 2018) and the accompanying repository.

      Template

      The workflow you will use to participate in this track should be relatively familiar to you if have used Github Pages. Specifically, our website uses the Al-Folio template. This template uses Github Pages as part of its process, but it also utilizes a separate build step using Github Actions and intermediary Docker Images.

      We recommend paying close attention to the steps presented in this guide. Small mistakes here can have very hard-to-debug consequences.

      Contents

      Quickstart

      This section provides a summary of the workflow for creating and submitting a blog post. For more details about any of these steps, please refer to the appropriate section.

      1. Fork or download our repository.

      2. Create your blog post content as detailed in the Creating a Blog Post section. In summary, to create your post, you will:
        • Create a Markdown or HTML file in the _posts/ directory with the format _posts/2024-05-07-[SUBMISSION NAME].md. If you choose to write the post in HTML, then the extension of this last file should be .html instead of .md. NOTE: HTML posts are not officially supported, use at your own risk!
        • Add any static image to assets/img/2024-05-07-[SUBMISSION NAME]/.
        • Add any interactive HTML figures to assets/html/2024-05-07-[SUBMISSION NAME]/.
        • Put your citations into a bibtex file in assets/bibliography/2024-05-07-[SUBMISSION NAME].bib.

        DO NOT touch anything else in the repository. We will utilize an automated deployment action which will filter out all submissions that modifiy more than the list of files that we just described above. Read the relevant section for more details. Make sure to omit any identifying information for the review process.

      3. To render your website locally, you can build a docker container via $ ./bin/docker_run.sh to serve your website locally. Alternatively, you can setup your local environment to render the website via conventional $ bundle exec jekyll serve --future commands. More information for both of these configuratoins can be found in the Local Serving section.

      4. To submit your website, create a pull request to the main repository. Make sure that this PR’s title is _posts/2024-05-07-[SUBMISSION NAME]. This will trigger a GitHub Action that will build your blogpost and write the host’s URL in a comment to your PR.

      5. If accepted, we will merge the accepted posts to our main repository. See the camera ready section for more details on merging in an accepted blog post.

      Should you edit ANY files other your new post inside the _posts directory, and your new folder inside the assets directory, your pull requests will automatically be rejected.

      You can view an example of a successful PR here. You can view an example of a PR with erroneous files here.

      Download the Blog Repository

      Download or fork our repository. You will be submitting a pull request this repository.

      Creating a Blog Post

      To create a blog post in Markdown format, you can modify the example Markdown post _posts/2024-05-07-distill-example.md and rename it to _posts/2024-05-07-[SUBMISSION NAME].md, where [SUBMISSION NAME] is the name of your submission. You can see the result of the sample post .

      While most users will want to create a post in the Markdown format, it is also possible to create a post in HTML format. For this, modify instead the example _posts/2024-05-08-distill-example2.html and rename it to _posts/2024-05-07-[SUBMISSION NAME].html. (NOTE: HTML is not officially supported, use at your own risk).

      You must modify the file’s header (or ‘front-matter’) as needed.

       ---
      +layout: distill
      +title: [Your Blog Title]
      +description: [Your blog post's abstract - no math/latex or hyperlinks!]
      +date: 2024-05-07
      +future: true
      +htmlwidgets: true
      +
      +# anonymize when submitting 
      +authors:
      +  - name: Anonymous 
      +
      +# do not fill this in until your post is accepted and you're publishing your camera-ready post!
      +# authors:
      +#   - name: Albert Einstein
      +#     url: "https://en.wikipedia.org/wiki/Albert_Einstein"
      +#     affiliations:
      +#       name: IAS, Princeton
      +#   - name: Boris Podolsky
      +#     url: "https://en.wikipedia.org/wiki/Boris_Podolsky"
      +#     affiliations:
      +#       name: IAS, Princeton
      +#   - name: Nathan Rosen
      +#     url: "https://en.wikipedia.org/wiki/Nathan_Rosen"
      +#     affiliations:
      +#       name: IAS, Princeton 
      +
      +# must be the exact same name as your blogpost
      +bibliography: 2024-05-07-distill-example.bib  
      +
      +# Add a table of contents to your post.
      +#   - make sure that TOC names match the actual section names
      +#     for hyperlinks within the post to work correctly.
      +toc:
      +  - name: [Section 1]
      +  - name: [Section 2]
      +  # you can additionally add subentries like so
      +    subsections:
      +    - name: [Subsection 2.1]
      +  - name: [Section 3]
      +---
      +
      +# ... your blog post's content ...
      +

      You must change the title, discription, toc, and eventually the authors fields (ensure that the submission is anonymous for the review process).

      Read our sample blog post carefully to see how you can add image assets, and how to write using \(\LaTeX\)! Read about rendering your post locally below.

      Important: make sure your post is completely anonymized before you export and submit it!

      Before going any further, it will be useful to highlight exactly what folders and files you are going to add or modify. Even if you use one of our simpler quickstart methods, this will always be what’s happening behind the scenes.

      If you clone our repo or download a release, you will find a directory structure that looks like the following (excluding all files and directories that are not relevant to your submission):

      your_blogpost_repo/
      +│
      +├── _posts
      +│   ├── 2024-05-07-[YOUR SUBMISSION].md         # <--- Create this markdown file; this is your blogpost
      +│   └── ...
      +├── assets
      +│   ├── bibliography
      +│   │   ├── 2024-05-07-[YOUR SUBMISSION].bib    # <--- Create this bibtex file
      +│   │   └── ...
      +│   ├── html
      +│   │   ├── 2024-05-07-[YOUR SUBMISSION]        # <--- Create this directory and add interactive html figures
      +│   │   │   └──[YOUR HTML FIGURES].html
      +│   │   └── ...
      +│   ├── img
      +│   │   ├── 2024-05-07-[YOUR SUBMISSION]        # <--- Create this directory and add static images here
      +│   │   │   └──[YOUR IMAGES].png
      +│   │   └── ...
      +│   └── ...
      +└── ...
      +

      In summary, to create your post, you will:

      • Create a Markdown (or HTML) file in the _posts/ directory with the format _posts/2024-05-07-[SUBMISSION NAME].md (_posts/2024-05-07-[SUBMISSION NAME].html in the case of an HTML file).
      • Add any static image assets will be added to assets/img/2024-05-07-[SUBMISSION NAME]/.
      • Add any interactive HTML figures will be added to assets/html/2024-05-07-[SUBMISSION NAME]/.
      • Put your citations into a bibtex file in assets/bibliography/2024-05-07-[SUBMISSION NAME].bib.

      DO NOT touch anything else in the blog post! If you do, our automated pipeline will reject your PR and you will have to undo those changes in order for it to be accepted!

      Note that 2024-05-07-[YOUR SUBMISSION] serves as a tag to your submission, so it should be the same for all three items. For example, if you’re writing a blog post called “Deep Learning”, you’d likely want to make your tag 2024-05-07-deep-learning, and the directory structure would look like this:

      your_blogpost_repo/
      +│
      +├── _posts
      +│   ├── 2024-05-07-deep-learning.md         # <--- Create this markdown file; this is your blogpost
      +│   └── ...
      +├── assets
      +│   ├── bibliography
      +│   │   ├── 2024-05-07-deep-learning.bib    # <--- Create this bibtex file
      +│   │   └── ...
      +│   ├── html
      +│   │   ├── 2024-05-07-deep-learning        # <--- Create this directory and add interactive html figures
      +│   │   │   └──[YOUR HTML FIGURES].html
      +│   │   └── ...
      +│   ├── img
      +│   │   ├── 2024-05-07-deep-learning        # <--- Create this directory and add static images here
      +│   │   │   └──[YOUR IMAGES].png
      +│   │   └── ...
      +│   └── ...
      +└── ...
      +

      Local serving

      So far we’ve talked about how to get the relevant repository and create a blog post conforming to our requirements. Everything you have done so far has been in Markdown, but this is not the same format as web content (typically HTML, etc.). You’ll now need to build your static web site (which is done using Jekyll), and then serve it on some local webserver in order to view it properly. We will now discuss how you can serve your blog site locally, so you can visualize your work before you open a pull request on the staging website so you can submit it to the ICLR venue.

      Method 1: Using Docker

      To render your website locally, we follow the instructions for Local setup using Docker (Recommended on Windows), but specifically you will need to create your own docker container rather than pull it from Dockerhub (because we modified the Gemfile).

      Create and run the Docker image:

      ./bin/docker_run.sh
      +

      Remove the Gemfile.lock file if prompted. This will create a docker image labeled as al-folio:latest. Don’t use dockerhub_run.sh; this may result in issues with missing jekyll dependencies.

      Method 2: Using Jekyll Manually

      For users wishing to not use a Docker container, you can install Jekyll directly to your computer and build the site using Jekyll directly. This is done at your own risk, as there are many potential points of error! Follow the instructions for rendering the website via the conventional method of $ bundle exec jekyll serve --future

      Installation

      You will need to manually install Jekyll which will vary based on your operating system. The instructions here are only for convenience - you are responsible for making sure it works on your system and we are not liable for potential issues that occur when adding your submissions to our repo!

      Ubuntu/Debian

      1. Install Ruby

         sudo apt install ruby-full
        +
      2. Once installed, add the following to your .bashrc or whatever terminal startup script you may use (this is important because otherwise gem may complain about needing sudo permission to install packages):

         export GEM_HOME="$HOME/.gem"
        + export PATH="$HOME/.gem/bin:$PATH"
        +
      3. Install Jekyll and Bundler:

         gem install jekyll bundler
        +

      MacOS and Windows

      Mac and Windows users can find relevant guides for installing Jekyll here:

      Manual Serving

      Once you’ve installed jekyll and all of the dependencies, you can now serve the webpage on your local machine for development purposes using the bundle exec jekyll serve command.

      You may first need to install any project dependencies. In your terminal, from the directory containing the Jekyll project run:

      bundle install
      +

      This will install any plugins required by the project. To serve the webpage locally, from your terminal, in the directory containing the Jekyll project run:

      bundle exec jekyll serve --future --port=8080 --host=0.0.0.0
      +

      You should see something along the lines of:

      > bundle exec jekyll serve
      +Configuration file: /home/$USER/blog_post_repo/_config.yml
      +            Source: /home/$USER/blog_post_repo
      +       Destination: /home/$USER/blog_post_repo/_site
      + Incremental build: disabled. Enable with --incremental
      +      Generating... 
      +       Jekyll Feed: Generating feed for posts
      +
      +        ... you may see a lot of stuff in here related to images ...
      +
      +                    done in 0.426 seconds.
      + Auto-regeneration: enabled for '/home/$USER/blog_post_repo'
      +    Server address: http://0.0.0.0:8080/2024/
      +  Server running... press ctrl-c to stop.
      +

      If you see this, you’ve successfully served your web page locally! You can access it at server address specified, in this case http://0.0.0.0:8080/2024/ (and the blog posts should once again be viewable at the blog/ endpoint).

      Submitting your Blog Post

      To submit your blog post:

      1. Anonymize your blog post. Strip all identifying information from your post, including the author’s list (replace with Anonymous).
      2. Double check that your post matches the formatting requirements, including (but not limited to):
        • Only modify files in the following locations (failure to do so will result in your PR automatically being closed!):
          • a Markdown (or HTML) file in _posts/ with the format _posts/2024-05-07-[SUBMISSION NAME].md (or .html)
          • static image assets added to assets/img/2024-05-07-[SUBMISSION NAME]/
          • interactive HTML figures added to assets/html/2024-05-07-[SUBMISSION NAME]/
          • citations in a bibtex file in assets/bibliography/2024-05-07-[SUBMISSION NAME].bib
        • Have a short 2-3 sentence abstract in the description field of your front-matter (example)
        • Have a table of contents, formatted using the toc field of your front-matter (example)
        • Your bibliography uses a .bibtex file as per the sample post
      3. Open a pull request against the main branch of the 2024 repo. Fill in the checklist provided in the PR template. The title of your pull request should be exactly the name of your markdown/html file.
        • i.e. _posts/2024-05-07-[SUBMISSION NAME].md would require a PR name 2024-05-07-[SUBMISSION NAME]
      4. (TBD) Your post will automatically run two pipelines: one to verify that you have not modified any other file in the repo, and another that will create a unique URL for your contributed blog post.
        • Verify that everything looks correct in the given URL.
        • If the pipelines failed, check if it was because of improper formatting (i.e. you modified restricted files). If this is the case, fix the issues. If the issue persist, please ping one of the repo admins.
      5. Submit the name of your blog post and its URL to our OpenReview through this link.

      Note: If you wish to make updates to your submission, you should update the content in the PR that you already opened.

      Reviewing Process

      Reviewers will be required to only view the live content of the reviewing website - the website to which the Pull Requests push to. We ask that they act in good faith, and refrain from digging into the repository’s logs and closed Pull Requests to find any identifying information on the authors.

      Camera-ready

      TBD - instructions will be provided closer to the submission deadline.

      \ No newline at end of file