You will be redirected to the main page within 3 seconds. If not redirected, please click here.
diff --git a/.all-contributorsrc b/.all-contributorsrc
deleted file mode 100644
index bc6a9103..00000000
--- a/.all-contributorsrc
+++ /dev/null
@@ -1,45 +0,0 @@
-{
- "files": [
- "README.md"
- ],
- "imageSize": 100,
- "commit": false,
- "contributorsPerLine": 7,
- "projectName": "al-folio",
- "projectOwner": "alshedivat",
- "repoType": "github",
- "repoHost": "https://github.com",
- "badgeTemplate": "[core_contributors]: https://img.shields.io/badge/core_contributors-<%= contributors.length %>-orange.svg 'Number of core contributors'",
- "contributorTemplate": "\">\" width=\"<%= options.imageSize %>px;\" alt=\"\"/>
<%= contributor.name %>",
- "skipCi": true,
- "contributors": [
- {
- "login": "alshedivat",
- "name": "Maruan",
- "avatar_url": "https://avatars.githubusercontent.com/u/2126561?v=4",
- "profile": "http://maruan.alshedivat.com",
- "contributions": [
- "design",
- "code"
- ]
- },
- {
- "login": "rohandebsarkar",
- "name": "Rohan Deb Sarkar",
- "avatar_url": "https://avatars.githubusercontent.com/u/50144004?v=4",
- "profile": "http://rohandebsarkar.github.io",
- "contributions": [
- "code"
- ]
- },
- {
- "login": "pourmand1376",
- "name": "Amir Pourmand",
- "avatar_url": "https://avatars.githubusercontent.com/u/32064808?v=4",
- "profile": "https://amirpourmand.ir",
- "contributions": [
- "code"
- ]
- }
- ]
-}
diff --git a/.gitattributes b/.gitattributes
deleted file mode 100644
index 24244739..00000000
--- a/.gitattributes
+++ /dev/null
@@ -1 +0,0 @@
-_config.yml merge=ours
diff --git a/.github/FUNDING.yml b/.github/FUNDING.yml
deleted file mode 100644
index c78502f4..00000000
--- a/.github/FUNDING.yml
+++ /dev/null
@@ -1,12 +0,0 @@
-# These are supported funding model platforms
-
-github: # Replace with up to 4 GitHub Sponsors-enabled usernames e.g., [user1, user2]
-patreon: # Replace with a single Patreon username
-open_collective: # Replace with a single Open Collective username
-ko_fi: alshedivat
-tidelift: # Replace with a single Tidelift platform-name/package-name e.g., npm/babel
-community_bridge: # Replace with a single Community Bridge project-name e.g., cloud-foundry
-liberapay: # Replace with a single Liberapay username
-issuehunt: # Replace with a single IssueHunt username
-otechie: # Replace with a single Otechie username
-custom: # ['https://www.buymeacoffee.com/TkFxuKo']
diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md
deleted file mode 100644
index 511f5851..00000000
--- a/.github/ISSUE_TEMPLATE/bug_report.md
+++ /dev/null
@@ -1,38 +0,0 @@
----
-name: Bug report
-about: Create a report to help us improve
-title: ''
-labels: bug
-assignees: ''
-
----
-
-**Acknowledge the following**
-- [ ] I carefully read and followed the [Getting Started](https://github.com/alshedivat/al-folio#getting-started) guide.
-- [ ] I read through [FAQ](https://github.com/alshedivat/al-folio#faq) and searched through the [past issues](https://github.com/alshedivat/al-folio/issues), none of which addressed my issue.
-- [ ] The issue I am raising is a potential bug in al-folio and not just a usage question.
[For usage questions, please post in the [Discussions](https://github.com/alshedivat/al-folio/discussions) instead of raising an issue.]
-
-**Describe the bug**
-A clear and concise description of what the bug is.
-
-**To Reproduce**
-Steps to reproduce the behavior:
-1. Go to '...'
-2. Click on '....'
-3. Scroll down to '....'
-4. See error
-
-**Expected behavior**
-A clear and concise description of what you expected to happen.
-
-**Screenshots**
-If applicable, add screenshots to help explain your problem.
-
-**System (please complete the following information):**
- - OS: [e.g. iOS]
- - Browser (and its version) [e.g. chrome, safari]
- - Jekyll version [e.g. 3.8.7]
-- Ruby version [e.g. 2.6.5]
-
-**Additional context**
-Add any other context about the problem here.
diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md
deleted file mode 100644
index 11fc491e..00000000
--- a/.github/ISSUE_TEMPLATE/feature_request.md
+++ /dev/null
@@ -1,20 +0,0 @@
----
-name: Feature request
-about: Suggest an idea for this project
-title: ''
-labels: enhancement
-assignees: ''
-
----
-
-**Is your feature request related to a problem? Please describe.**
-A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
-
-**Describe the solution you'd like**
-A clear and concise description of what you want to happen.
-
-**Describe alternatives you've considered**
-A clear and concise description of any alternative solutions or features you've considered.
-
-**Additional context**
-Add any other context or screenshots about the feature request here.
diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md
deleted file mode 100644
index 82c43a79..00000000
--- a/.github/pull_request_template.md
+++ /dev/null
@@ -1,26 +0,0 @@
-
-
-
-## OpenReview Submission Thread
-
-
-
-## Checklist before opening a PR
-
-- [ ] I am opening a pull request against the `main` branch of the `2024` repo.
-- [ ] The title of my PR is exactly the name of my markdown file
- - i.e. `_posts/2024-05-07-[SUBMISSION NAME].md` would require a PR name `2024-05-07-[SUBMISSION NAME]`
-- [ ] I have **anonymized** my post: my author's list is `Anonymous`, and there is no potential
- content which can reveal my/my collaborators identities.
-- [ ] My post matches the formatting requirements, including (but not limited to):
- - [ ] I have **ONLY MODIFIED** files in the following locations (failure to do so will result in
- your PR automatically being closed!):
- - a Markdown (or HTML) file in `_posts/` with the format `_posts/2024-05-07-[SUBMISSION NAME].md` (or `.html`)
- - static image assets added to `assets/img/2024-05-07-[SUBMISSION NAME]/`
- - interactive HTML figures added to `assets/html/2024-05-07-[SUBMISSION NAME]/`
- - citations in a bibtex file in `assets/bibliography/2024-05-07-[SUBMISSION NAME].bib`
- - [ ] I have a short 2-3 sentence abstract in the `description` field of my front-matter ([example](https://github.com/iclr-blogposts/2024/blob/295ab5b4c31f2c7d421a4caf41e5481cbb4ad42c/_posts/2024-05-07-distill-example.md?plain=1#L4-L6))
- - [ ] I have a table of contents, formatted using the `toc` field of my front-matter ([example](https://github.com/iclr-blogposts/2024/blob/295ab5b4c31f2c7d421a4caf41e5481cbb4ad42c/_posts/2024-05-07-distill-example.md?plain=1#L36-L47))
- - [ ] My bibliography is correctly formatted, using a `.bibtex` file as per the sample post
-
-## Any other comments
diff --git a/.github/stale.yml b/.github/stale.yml
deleted file mode 100644
index 8ec2004d..00000000
--- a/.github/stale.yml
+++ /dev/null
@@ -1,18 +0,0 @@
-# Number of days of inactivity before an issue becomes stale
-daysUntilStale: 60
-# Number of days of inactivity before a stale issue is closed
-daysUntilClose: 7
-# Issues with these labels will never be considered stale
-exemptLabels:
- - pinned
- - security
- - enhancement
-# Label to use when marking an issue as stale
-staleLabel: wontfix
-# Comment to post when marking an issue as stale. Set to `false` to disable
-markComment: >
- This issue has been automatically marked as stale because it has not had
- recent activity. It will be closed if no further activity occurs. Thank you
- for your contributions.
-# Comment to post when closing a stale issue. Set to `false` to disable
-closeComment: false
diff --git a/.github/workflows/comment-on-error.yaml b/.github/workflows/comment-on-error.yaml
deleted file mode 100644
index 35099659..00000000
--- a/.github/workflows/comment-on-error.yaml
+++ /dev/null
@@ -1,36 +0,0 @@
-name: Comment on error
-
-on:
- workflow_run:
- workflows: ["filter-files"]
- types:
- - completed
-
-jobs:
- upload:
- runs-on: ubuntu-latest
- if: >
- github.event.workflow_run.event == 'pull_request' &&
- github.event.workflow_run.conclusion == 'failure'
- steps:
- - name: Download build artifact from triggered workflow
- uses: dawidd6/action-download-artifact@v2
- with:
- run_id: ${{ github.event.workflow_run.id }}
- # name: website_out
- # path: site_out
- search_artifacts: true
- - name: Get ISSUE_NUMBER
- run: echo "ISSUE_NUMBER=$(cat website_out/pr_number.txt)" >> $GITHUB_ENV
- - name: Get filterout
- run: echo "MSG=$(cat website_out/filterout.txt)" >> $GITHUB_ENV
- - uses: actions/github-script@v6
- with:
- github-token: ${{ secrets.GITHUB_TOKEN }}
- script: |
- github.rest.issues.createComment({
- issue_number: ${{ env.ISSUE_NUMBER }},
- owner: context.repo.owner,
- repo: context.repo.repo,
- body: "⚠️ **We have detected a problem with your submission!** ⚠️\n\n${{ env.MSG }}\n\nPlease make the aforementioned changes and re-submit :)"
- })
diff --git a/.github/workflows/deploy-docker-tag.yml b/.github/workflows/deploy-docker-tag.yml
deleted file mode 100644
index 3e6b6a3a..00000000
--- a/.github/workflows/deploy-docker-tag.yml
+++ /dev/null
@@ -1,40 +0,0 @@
-name: Docker Image CI (Upload Tag)
-
-on:
- push:
- tags:
- - 'v*'
-
-jobs:
-
- build:
-
- runs-on: ubuntu-latest
-
- steps:
- - name: Checkout
- uses: actions/checkout@v2
- - name: Buildx
- uses: docker/setup-buildx-action@v1
-
- -
- name: Docker meta
- id: meta
- uses: docker/metadata-action@v4
- with:
- images: amirpourmand/al-folio
-
- - name: Login
- uses: docker/login-action@v1
- with:
- username: ${{ secrets.DOCKER_USERNAME }}
- password: ${{ secrets.DOCKER_PASSWORD }}
-
- - name: Build and push
- uses: docker/build-push-action@v3
- with:
- context: .
- push: ${{ github.event_name != 'pull_request' }}
- tags: ${{ steps.meta.outputs.tags }}
- labels: ${{ steps.meta.outputs.labels }}
-
diff --git a/.github/workflows/deploy-for-review.yml b/.github/workflows/deploy-for-review.yml
deleted file mode 100644
index f76729d1..00000000
--- a/.github/workflows/deploy-for-review.yml
+++ /dev/null
@@ -1,54 +0,0 @@
-name: Deploy post for review
-
-on:
- workflow_run:
- workflows: ["filter-files"]
- types:
- - completed
-
-jobs:
- upload:
- runs-on: ubuntu-latest
- if: >
- github.event.workflow_run.event == 'pull_request' &&
- github.event.workflow_run.conclusion == 'success'
- steps:
- - name: Download build artifact from triggered workflow
- uses: dawidd6/action-download-artifact@v2
- with:
- run_id: ${{ github.event.workflow_run.id }}
- # name: website_out
- # path: site_out
- search_artifacts: true
- - run: unzip website_out/site.zip
- # set the SLUG environment variable to the contests of website_out/slug.txt
- - name: Get SLUG
- run: echo "SLUG=$(cat website_out/slug.txt)" >> $GITHUB_ENV
- - name: Print SLUG
- run: echo ${{env.SLUG}}
- # the post name is the slug minus the first 11 characters
- - name: Get post name
- run: echo "POST_NAME=${SLUG:11}" >> $GITHUB_ENV
- - name: Print POST_NAME
- run: echo ${{env.POST_NAME}}
- - name: Get ISSUE_NUMBER
- run: echo "ISSUE_NUMBER=$(cat website_out/pr_number.txt)" >> $GITHUB_ENV
- - name: Print ISSUE_NUMBER
- run: echo ${{env.ISSUE_NUMBER}}
- - name: Setup AWS CLI
- uses: aws-actions/configure-aws-credentials@v4
- with:
- aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
- aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
- aws-region: eu-west-1
- - run: aws s3 sync --region eu-west-1 --acl public-read _site s3://iclr-blogposts-2024/${{env.SLUG}}-${{env.ISSUE_NUMBER}}
- - uses: actions/github-script@v6
- with:
- github-token: ${{ secrets.GITHUB_TOKEN }}
- script: |
- github.rest.issues.createComment({
- issue_number: ${{ env.ISSUE_NUMBER }},
- owner: context.repo.owner,
- repo: context.repo.repo,
- body: '👋 Thanks for your submission! We have successfully built your website and we will push it shortly to the URL https://d2jud02ci9yv69.cloudfront.net/${{env.SLUG}}-${{env.ISSUE_NUMBER}}/blog/${{env.POST_NAME}}/ !'
- })
diff --git a/.github/workflows/deploy-image.yml b/.github/workflows/deploy-image.yml
deleted file mode 100644
index b747dfc1..00000000
--- a/.github/workflows/deploy-image.yml
+++ /dev/null
@@ -1,31 +0,0 @@
-name: Docker Image CI
-
-on:
- push:
- branches: [ master ]
-
-jobs:
-
- build:
-
- runs-on: ubuntu-latest
- if: github.repository_owner == 'alshedivat'
-
- steps:
- - name: Checkout
- uses: actions/checkout@v2
- - name: Buildx
- uses: docker/setup-buildx-action@v1
-
- - name: Login
- uses: docker/login-action@v1
- with:
- username: ${{ secrets.DOCKER_USERNAME }}
- password: ${{ secrets.DOCKER_PASSWORD }}
-
- - name: Build and push
- uses: docker/build-push-action@v2
- with:
- context: .
- push: true
- tags: amirpourmand/al-folio
diff --git a/.github/workflows/deploy.yml b/.github/workflows/deploy.yml
deleted file mode 100644
index cbfb6996..00000000
--- a/.github/workflows/deploy.yml
+++ /dev/null
@@ -1,39 +0,0 @@
-name: deploy
-
-on:
- push:
- branches:
- - master
- - main
- workflow_dispatch: {}
-
-jobs:
- deploy:
- runs-on: ubuntu-latest
- steps:
- - name: Checkout code
- uses: actions/checkout@v3
- - name: Setup Ruby
- uses: ruby/setup-ruby@v1
- with:
- ruby-version: '3.0.2'
- bundler-cache: true
- - name: Install deps
- run: |
- npm install -g mermaid.cli
- - name: Setup deploy options
- id: setup
- run: |
- git config --global user.name "GitHub Action"
- git config --global user.email "41898282+github-actions[bot]@users.noreply.github.com"
- if [[ ${GITHUB_REF} = refs/pull/*/merge ]]; then # pull request
- echo "SRC_BRANCH=${GITHUB_HEAD_REF}" >> $GITHUB_OUTPUT
- echo "NO_PUSH=--no-push" >> $GITHUB_OUTPUT
- elif [[ ${GITHUB_REF} = refs/heads/* ]]; then # branch, e.g. master, source etc
- echo "SRC_BRANCH=${GITHUB_REF#refs/heads/}" >> $GITHUB_OUTPUT
- fi
- echo "DEPLOY_BRANCH=gh-pages" >> $GITHUB_OUTPUT
- - name: Deploy website
- run: yes | bash bin/deploy --verbose ${{ steps.setup.outputs.NO_PUSH }}
- --src ${{ steps.setup.outputs.SRC_BRANCH }}
- --deploy ${{ steps.setup.outputs.DEPLOY_BRANCH }}
diff --git a/.github/workflows/filter-files.yml b/.github/workflows/filter-files.yml
deleted file mode 100644
index cfbc3c0b..00000000
--- a/.github/workflows/filter-files.yml
+++ /dev/null
@@ -1,113 +0,0 @@
-name: filter-files
-
-on:
- pull_request:
- branches:
- - main
-
-# hack for https://github.com/actions/cache/issues/810#issuecomment-1222550359
-#env:
-# SEGMENT_DOWNLOAD_TIMEOUT_MIN: 3
-
-jobs:
- files-changed:
- name: Detect what files changed
- # if: contains(github.event.pull_request.labels.*.name, 'submission')
- # if: ${{ github.event.label.name == 'submission' }}
- runs-on: ubuntu-20.04
- timeout-minutes: 3
- outputs:
- offendingfiles: ${{ steps.pythonfilter.outputs.offendingfiles }}
-
- steps:
- - name: Checkout code
- uses: actions/checkout@v3
- - uses: actions/setup-python@v4
- with:
- python-version: '3.10'
- - run: pip install python-slugify pyyaml
- - uses: dorny/paths-filter@v2
- id: filter
- with:
- # Enable listing of files matching each filter.
- # Paths to files will be available in `${FILTER_NAME}_files` output variable.
- # Paths will be escaped and space-delimited.
- # Output is usable as command-line argument list in Linux shell
- list-files: shell
-
- # In this example changed files will be checked by linter.
- # It doesn't make sense to lint deleted files.
- # Therefore we specify we are only interested in added or modified files.
- filters: |
- changed:
- - '**'
- - name: Check label
- run: echo ${{ github.event.label.name }}
- - name: Save title slug
- run: echo "SLUG=`slugify ${{ github.event.pull_request.title }}`" >> $GITHUB_ENV
- - name: Print slug
- run: echo ${{env.SLUG}}
- - name: Check if changed files fit our filters
- id: pythonfilter
- if: ${{ steps.filter.outputs.changed == 'true' }}
- # todo read from step below
- run: |
- FILTEROUT=$(python3 bin/filterpaths.py $SLUG ${{ steps.filter.outputs.changed_files }} | tail -1)
- echo "offendingfiles=$FILTEROUT" >> $GITHUB_OUTPUT
- mkdir site_out
- python3 bin/filterpaths.py $SLUG ${{ steps.filter.outputs.changed_files }}
- #- uses: actions/github-script@v6
- # if: always() && steps.pythonfilter.outcome == 'failure'
- # with:
- # script: |
- # github.rest.issues.createComment({
- # issue_number: context.issue.number,
- # owner: context.repo.owner,
- # repo: context.repo.repo,
- # body: "⚠️ **We have detected a problem with your submission!** ⚠️\n\n${{ steps.pythonfilter.outputs.offendingfiles }}\n\nPlease make the aforementioned changes and re-submit :)"
- # })
- - name: Setup Ruby
- if: always() && steps.pythonfilter.outcome == 'success'
- uses: ruby/setup-ruby@v1
- with:
- ruby-version: '3.0.2'
- bundler-cache: true
- - name: Install deps
- if: always() && steps.pythonfilter.outcome == 'success'
- run: |
- npm install -g mermaid.cli
- - name: Setup deploy options
- if: always() && steps.pythonfilter.outcome == 'success'
- id: setup
- run: |
- git config --global user.name "GitHub Action"
- git config --global user.email "41898282+github-actions[bot]@users.noreply.github.com"
- if [[ ${GITHUB_REF} = refs/pull/*/merge ]]; then # pull request
- echo "SRC_BRANCH=${GITHUB_HEAD_REF}" >> $GITHUB_OUTPUT
- echo "NO_PUSH=--no-push" >> $GITHUB_OUTPUT
- elif [[ ${GITHUB_REF} = refs/heads/* ]]; then # branch, e.g. master, source etc
- echo "SRC_BRANCH=${GITHUB_REF#refs/heads/}" >> $GITHUB_OUTPUT
- fi
- echo "DEPLOY_BRANCH=gh-pages" >> $GITHUB_OUTPUT
- - name: Build website
- if: always() && steps.pythonfilter.outcome == 'success'
- run: yes | bash bin/build --verbose ${{ steps.setup.outputs.NO_PUSH }}
- --src ${{ steps.setup.outputs.SRC_BRANCH }}
- --deploy ${{ steps.setup.outputs.DEPLOY_BRANCH }}
- --slug ${{env.SLUG}}-${{ github.event.number }}
- - name: Save slug
- if: always()
- run: echo ${{env.SLUG}} > site_out/slug.txt
- - name: Save PR number
- if: always()
- env:
- PR_NUMBER: ${{ github.event.number }}
- run: echo $PR_NUMBER > site_out/pr_number.txt
- - name: Save filterout
- if: always()
- run: echo "${{ steps.pythonfilter.outputs.offendingfiles }}" > site_out/filterout.txt
- - uses: actions/upload-artifact@v2
- if: always()
- with:
- name: website_out
- path: site_out
diff --git a/.nojekyll b/.nojekyll
new file mode 100644
index 00000000..e69de29b
diff --git a/404.html b/404.html
index 0da4ee0b..2659a807 100644
--- a/404.html
+++ b/404.html
@@ -1,9 +1 @@
----
-layout: page
-permalink: /404.html
-title: "Page not found"
-description: "Looks like there has been a mistake. Nothing exists here."
-redirect: true
----
-
-
You will be redirected to the main page within 3 seconds. If not redirected, please click here.
+You will be redirected to the main page within 3 seconds. If not redirected, please click here.
{{ content.name }} | -{{ content.value }} | -
{{ item.date | date: "%b %-d, %Y" }} | -
- {% if item.inline -%}
- {{ item.content | remove: ' ' | remove: ' ' | emojify }} - {%- else -%} - {{ item.title }} - {%- endif %} - |
-
---|
No news so far...
- {%- endif %} -{{ project.description }}
-{{ project.description }}
-{{ page.subtitle }}
-an archive of posts in this category
-{{ post.date | date: "%b %-d, %Y" }} | -- {{ post.title }} - | -
---|
an archive of posts with this tag
-{{ post.date | date: "%b %-d, %Y" }} | -- {{ post.title }} - | -
---|
an archive of posts from this year
-{{ post.date | date: "%b %-d, %Y" }} | -- {{ post.title }} - | -
---|
{{ page.description }}
-{{ page.description }}
-
- PLACEHOLDER FOR ACADEMIC ATTRIBUTION
-
-
- BibTeX citation
-
- PLACEHOLDER FOR BIBTEX
-
- -(...) -50009 0 -50256 0 -madeupword0000 0 -- -``` -$ python -Python 3.8.10 (default, Jun 22 2022, 20:18:18) -[GCC 9.4.0] on linux -Type "help", "copyright", "credits" or "license" for more information. ->>> from fairseq.models.roberta import RobertaModel ->>> roberta = RobertaModel.from_pretrained('/checkpoint-dir', 'zero-clap-large.pt', '/dict-dir') -(...) ->>> roberta.fill_mask('The capital of China ismadeupword0001 0 -madeupword0002 0-
- Conjugate gradient to solve \eqref{eq:linear_system}
- Input Initialization \(v_0\)
-
- Initialization
- $$
- r_0 = \textcolor{orange}{\nabla^2f(\theta) v_0} - b,\quad p_0 = -r_0,\quad t = 0
- $$
-
- While \(r_t \neq 0\)
- \begin{align*}
- \alpha_t &=\frac{r_t^\top r_t}{p_t^\top \textcolor{orange}{\nabla^2f(\theta) p_t}} \\
- v_{t+1} &=v_t + \alpha_t p_t \\
- r_{t+1} &=r_t + \alpha_t\textcolor{orange}{\nabla^2f(\theta) p_t} \\
- \beta_{t+1} &=\frac{r_{t+1}^\top r_{t+1}}{r_t^\top r_t} \\
- p_{t+1} &=-r_{t+1} + \beta_{t+1} p_t\\
- t &=t + 1
- \end{align*}
-
- Lanczos' algorithm
-
- Input Initial vector \(v_0\).
- Initialization
- $$
- w'_0 = \textcolor{orange}{\nabla^2f(\theta)v_0},\quad \alpha_0 = w_0'^\top v_0,\quad w_0 = w_0' - \alpha_0 v_0
- $$
-
- For \(i = 1,\dots, k-1\):
-
- \begin{align*}
- \beta_i &= \|w_{i-1}\|\\
- v_{i} &= \frac{w_{i-1}}{\beta_{i}}\\
- w_i' &= \textcolor{orange}{\nabla^2f(\theta)v_i}\\
- \alpha_i &= w_i'^\top v_i\\
- w_i &= w_i' - \alpha_i v_i - \beta_iv_{i-1}
- \end{align*}
-
- All models are wrong, but some are useful. - -- -When working with real-world data, we must always assume that our models are misspecified to some degree. Models simplify complex systems and cannot capture every nuance of the data-generating process. Consequently, the goal of model selection is not to find the "true" model but rather to identify the most useful model that balances simplicity, interpretability, and predictive performance. - - - -Without model misspecification, we would always converge to the maximum likelihood estimate (MLE) that matches the data-generating model in the infinite data limit as the [Bernstein-von Mises' theorem](https://www.wikiwand.com/en/Bernstein%E2%80%93von_Mises_theorem) tells us that posteriors converge to the MLE in the limit. However, in practice, we are always dealing with misspecified models, and the MLE will not converge to the true data-generating model. - - - -### Infinite Data Limit - -Let's return to our question of when the different quantities lead to similar rankings. - -While a conditional joint marginal cross-entropy, as a sum of conditional marginal cross-entropies, is obviously larger than each individual term, if we divide the joint marginal cross-entropy by the number of samples in the conditional joint distribution, we obtain the **rate**
-Thanks for your interest in our paper and your comments. Here are our comments about the blog as it is currently framed: - -(1) Thank you for pointing out a bug in the CLML computation for Figure 5b. We note that this bug is only relevant to a single panel of a single figure in the main text. We have re-run this experiment with the right CLML, and the results, attached here, are qualitatively the same. In summary, it was a very minor part of the paper, and even for that part it did not affect the take-away. We also attach the results of the correlation between the BMA test accuracy and the negative validation loss. You suggest in your post that the validation loss might correlate better with the BMA test accuracy than the CLML given that we use 20 samples for NAS. Our empirical results show the opposite conclusion. Additionally, we are not suggesting the CLML as a replacement to cross-validation but rather as a minor way to modify the LML for improvements in predicting generalization. Finally, we attach results for different sample sizes (20 samples vs. 100 samples) to address your comments on the sample size used to estimate the CLML. As we can see in the figure, the Spearman correlation factor is quite similar. 20 samples appears to provide a reasonable estimate of the CLML for these purposes, and is different from validation loss. - -{% capture max-width %} -" style="max-width: 20em; -{% endcapture %} -{% include figure.html path="assets/img/2024-05-07-clml/rebuttal_1.png" max-width=max-width %} -{% include figure.html path="assets/img/2024-05-07-clml/rebuttal_2.png" max-width=max-width %} -{% include figure.html path="assets/img/2024-05-07-clml/rebuttal_3.png" max-width=max-width %} - -(2) Your post currently opens by suggesting that there is something wrong with our experiments, likely either an LML approximation or a CLML issue, because we note that the LML correlates more poorly with generalization for larger datasets (where “large” is relative in the context of a specific experiment). A few points here: (i) this result is actually completely expected. The LML is in fact non-monotonic in how well it predicts generalization. For small datasets, the prior should be reasonably predictive of generalization. For intermediate datasets, the first terms in the LML decomposition have a negative effect on the correlation with generalization. For asymptotically large datasets, the first terms have a diminishing effect, and we get a consistent estimator; (ii) almost all of our experiments are exact, and we see this behaviour in the exact experiments for the Fourier model. For example, for the Fourier feature experiment in Fig 4(d), LML picks the better generalizing model for n < 50 and n > 296. For n in [50, 296] it picks the wrong model. For large neural network models, it is reasonable that the exact LML could pick the wrong model for CIFAR-sized datasets. (iii) any potential issues with the CLML are not relevant to these considerations, which are about the behaviour of the LML. - -(3) Your post currently suggests that issues with approximate inference could be responsible for our take-aways, rather than issues with the LML in general. But as we note in (2), almost all of our experiments use the exact LML and CLML: the density model, Fourier features, Gaussian processes, and deep learning exps on DKL, and there was never any bug associated with CLML computation in these experiments. The takeaways for the Laplace experiments are consistent with the exact experiments, and also expected, as above. While it’s true that the CLML can be estimated more effectively than the LML for the Laplace experiments, this is actually an advantage of the CLML that we note in the paper. The LML results also stand on their own, as we discuss above. - -(4) Your post places a lot of importance on Figure 5, as if it is the main result of the paper and our main “DNN” experiments. We stand by the results of Figure 5, but it is a relatively minor component of the paper. As we’ve mentioned most of our results are exact, including our DKL experiments, which are certainly the most substantial DNN experiments, with practically exciting results for transfer and few-shot learning. The DKL experiments are actually where we expect the CLML to be practically useful, and currently they seem to be overlooked in the post. - -(5) The blog seems to question the learning curve experiments, but these experiments in Figure 4 are exact, with no Laplace approximation, and relatively straightforward. - -(6) Your post seems to be negative about the CLML, presenting its similarity with cross-validation as a potential drawback, and implying the skepticism about the CLML should affect the interpretation of our take-aways. Two points here: (i) as above, the CLML is independent of most of our take-aways, which are about the properties of the LML; (ii) our goal with the CLML was not to introduce something starkly different from cross-validation, but to show how a very minor modification to the LML could improve alignment with generalization. Moreover, the DKL CLML results are quite promising as an efficient way to do gradient based estimation of a large number of hyperparameters. - -(7) The blog opens as if it is leading up to some fatal flaw. But as above, (i) the LML considerations are independent of the CLML, (ii) most of the experiments are exact, (iii) the trends for the exact and approximate inference procedures are the same and are naturally understandable and explainable, such as the non-monotonic trend in how well the LML correlates with generalization, and (iv) the CLML bug only affected Figure 5, panel b, and when it’s corrected the qualitative take-away is the same as before. - -We appreciate your interest and effort in reading the paper, and we think your questions will improve the clarity of the paper, which we have updated with an acknowledgement to you. Given the above considerations, we do think there would need to be substantial revisions to the blog post to accurately and fairly reflect the paper. We would appreciate being able to see the revisions before it’s posted. - -Best wishes,\\ -Sanae, Pavel, Greg, Micah, Andrew -- -### Ablation: CLML vs. BMA Validation Loss vs. (non-BMA) Validation Loss - -Let us examine the new results: - -In the three panels below, two panels show test accuracy vs. validation loss; one shows test accuracy vs. CLML. The left-most panel is the BMA test accuracy vs. (negative) BMA validation loss, the middle panel is vs. the CLML, and the right-most panel is vs. the (negative) non-BMA validation loss. - -Note that the left-most panel is from *v1*, which was accidentally computing the BMA validation loss, and whose axis label is adapted here from *v1* for clarity. The two other plots are from *v2* after fixing the bug. See commits [here](https://github.com/Sanaelotfi/Bayesian_model_comparison/commit/a579aa292723dc20a6105ec8f4fff1045dd9a9fd) for fixing the CLML estimation and [here](https://github.com/Sanaelotfi/Bayesian_model_comparison/commit/3fa8ca2ecb314ee881f6c95a602ef58b9ccd3620) for computing the non-BMA validation loss. - -{% capture width %} -" style="width: 20em; -{% endcapture %} -
- We do not grow absolutely, chronologically. We grow sometimes in one dimension, and not in another, unevenly. We grow partially. We are relative. We are mature in one realm, childish in another. - —Anais Nin -- -*** - - -## Layouts - -The main text column is referred to as the body. -It is the assumed layout of any direct descendants of the `d-article` element. - -
.l-body
-.l-page
-.l-body-outset
-.l-page-outset
-.l-screen
-.l-screen-inset
-.l-gutter
-- This is a sample blog post written in HTML (while the other sample post is written in Markdown). Authors have the choice to write in HTML or Markdown. While Markdown is easier to write, HTML gives you more control over the layout of your post. Furthermore, Markdown often interacts in unexpected ways with MathJax and other HTML widgets. If you are having trouble with Markdown, try writing in HTML instead. -
- -- Note: please use the table of contents as defined in the front matter rather than the traditional markdown styling. -
- -This theme supports rendering beautiful math in inline and display modes using MathJax 3 engine.
-You just need to surround your math expression with $$
, like $$ E = mc^2 $$
.
-If you leave it inside a paragraph, it will produce an inline expression, just like \(E = mc^2\).
To use display mode, again surround your expression with $$
and place it as a separate paragraph.
-Here is an example:
-$$
-\left( \sum_{k=1}^n a_k b_k \right)^2 \leq \left( \sum_{k=1}^n a_k^2 \right) \left( \sum_{k=1}^n b_k^2 \right)
-$$
-
Note that MathJax 3 is a major re-write of MathJax -that brought a significant improvement to the loading and rendering speed, which is now -on par with KaTeX.
- -Its generally a better idea to avoid linking to images hosted elsewhere - links can break and you -might face losing important information in your blog post. -You can display images from this repository using the following code:
- -{% raw %}{% include figure.html path="assets/img/2024-05-07-distill-example/iclr.png" class="img-fluid" %}{% endraw %}
-
-which results in the following image:
- -{% include figure.html path="assets/img/2024-05-07-distill-example/iclr.png" class="img-fluid" %} - - -- To ensure that there are no namespace conflicts, you must save your asset to your unique directory - `/assets/img/2024-05-07-[SUBMISSION NAME]` within your submission. -
- -- Please avoid using the direct HTML method of embedding images; they may not be properly resized. - Some below complex ways to load images (note the different styles of the shapes/shadows): -
- -- Here's how you could embed interactive figures that have been exported as HTML files. - Note that we will be using plotly for this demo, but anything built off of HTML should work. - All that's required is for you to export your figure into HTML format, and make sure that the file - exists in the `assets/html/[SUBMISSION NAME]/` directory in this repository's root directory. - To embed it into any page, simply insert the following code anywhere into your page. -
- -{% raw %}{% include [FIGURE_NAME].html %}{% endraw %}
-
--For example, the following code can be used to generate the figure underneath it. -
- -import pandas as pd
-import plotly.express as px
-
-df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/earthquakes-23k.csv')
-
-fig = px.density_mapbox(
- df, lat='Latitude', lon='Longitude', z='Magnitude', radius=10,
- center=dict(lat=0, lon=180), zoom=0, mapbox_style="stamen-terrain")
-fig.show()
-
-fig.write_html('./assets/html/2024-05-07-distill-example/plotly_demo_1.html')
-
-
-And then include it with the following:
-
-{% raw %}<div class="l-page">
- <iframe src="{{ 'assets/html/2024-05-07-distill-example/plotly_demo_1.html' | relative_url }}" frameborder='0' scrolling='no' height="600px" width="100%"></iframe>
-</div>{% endraw %}
-
-
-Voila!
-
-
- Citations are then used in the article body with the <d-cite>
tag.
- The key attribute is a reference to the id provided in the bibliography.
- The key attribute can take multiple ids, separated by commas.
-
- The citation is presented inline like this:
- Distill chose a numerical inline citation style to improve readability of citation dense articles and because many of the benefits of longer citations are obviated by displaying more information on hover. - However, we consider it good style to mention author last names if you discuss something at length and it fits into the flow well - the authors are human and it's nice for them to have the community associate them with their work. -
- - -
- Just wrap the text you would like to show up in a footnote in a <d-footnote>
tag.
- The number of the footnote will be automatically generated.
- This theme implements a built-in Jekyll feature, the use of Rouge, for syntax highlighting. - It supports more than 100 languages. - This example is in C++. - All you have to do is wrap your code in a liquid tag as follows: -
- -{% raw %}
-{% highlight c++ linenos %}
code code code
{% endhighlight %}
-{% endraw %}
-
-
-The keyword `linenos` triggers display of line numbers. You can try toggling it on or off yourself below:
-
-{% highlight c++ %}
-
-int main(int argc, char const *argv[])
-{
-string myString;
-
- cout << "input a string: ";
- getline(cin, myString);
- int length = myString.length();
-
- char charArray = new char * [length];
-
- charArray = myString;
- for(int i = 0; i < length; ++i){
- cout << charArray[i] << " ";
- }
-
- return 0;
-}
-
-{% endhighlight %}
-
-
-
-- This theme supports generating various diagrams from a text description using jekyll-diagrams plugin. - Below, we generate a few examples of such diagrams using languages such as mermaid, plantuml, vega-lite, etc. -
- -- Notedifferent diagram-generation packages require external dependencies to be installed on your machine. - Also, be mindful of that because of diagram generation the first time you build your Jekyll website after adding new diagrams will be SLOW. - For any other details, please refer to the jekyll-diagrams README. -
- -- Note: This is not supported for local rendering! -
- -- The diagram below was generated by the following code: -
- -{% raw %}{% mermaid %}
-sequenceDiagram
- participant John
- participant Alice
- Alice->>John: Hello John, how are you?
- John-->>Alice: Great!
-{% endmermaid %}
-{% endraw %}
-
-
-{% mermaid %}
-sequenceDiagram
-participant John
-participant Alice
-Alice->>John: Hello John, how are you?
-John-->>Alice: Great!
-{% endmermaid %}
-
-
-- An example of displaying a tweet: - {% twitter https://twitter.com/rubygems/status/518821243320287232 %} -
- -- An example of pulling from a timeline: - {% twitter https://twitter.com/jekyllrb maxwidth=500 limit=3 %} -
- -- For more details on using the plugin visit: jekyll-twitter-plugin -
- - -- We do not grow absolutely, chronologically. We grow sometimes in one dimension, and not in another, unevenly. We grow partially. We are relative. We are mature in one realm, childish in another. - —Anais Nin -- - -
.l-body
-.l-page
-.l-body-outset
-.l-page-outset
-.l-screen
-.l-screen-inset
-.l-gutter
-
- Emphasis, aka italics, with the <i></i>
tag emphasis.
-
- Strong emphasis, aka bold, with <b></b>
tag bold.
-
- Strikethrough ca be accomplished with the <s></s>
tag. Scratch this.
-
- For code, the language can be specified in the class. For example, use language-javascript
for Javascript and language-python
for Python code.
-
var s = "JavaScript syntax highlighting";
- alert(s);
-
-s = "Python syntax highlighting"
- print(s)
-
-No language indicated, so no syntax highlighting.
-
-
- A table can be created with the <table>
element. Below is an example
-
Tables | -Are | -Cool | -
---|---|---|
col 3 is | -right-aligned | -$1600 | -
col 2 is | -centered | -$12 | -
zebra stripes | -are neat | -$1 | -
-
Blockquotes can be defined with the >blockquote< tag.- diff --git a/_posts/2024-05-07-double-descent-demystified.md b/_posts/2024-05-07-double-descent-demystified.md deleted file mode 100644 index d78ba984..00000000 --- a/_posts/2024-05-07-double-descent-demystified.md +++ /dev/null @@ -1,736 +0,0 @@ ---- -layout: distill -title: Double Descent Demystified -description: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle -date: 2024-05-07 -future: true -htmlwidgets: true - -authors: - - name: Rylan Schaeffer - url: "https://scholar.google.com/citations?user=6tMEGz8AAAAJ&hl=en" - affiliations: - name: Stanford University - - name: Zachary Robertson - url: "https://scholar.google.com/citations?user=769PIisAAAAJ&hl=en&oi=ao" - affiliations: - name: Stanford University - - name: Akhilan Boopathy - url: "https://scholar.google.com/citations?user=21alU7EAAAAJ&hl=en" - affiliations: - name: MIT - - name: Mikail Khona - url: "https://scholar.google.com/citations?user=K5f0SYQAAAAJ&hl=en&oi=ao" - affiliations: - name: MIT - - name: Kateryna Pistunova - url: "https://scholar.google.com/citations?user=V7QY5j0AAAAJ&hl=en" - affiliations: - name: Stanford University - - name: Jason W. Rocks - url: "https://scholar.google.com/citations?user=rFHAzMUAAAAJ" - affiliations: - name: Boston University - - name: Ila R. Fiete - url: "https://scholar.google.com/citations?user=uE-CihIAAAAJ&hl=en&oi=ao" - affiliations: - name: MIT - - name: Andrey Gromov - url: "https://scholar.google.com/citations?user=D056qfMAAAAJ&hl=en&oi=ao" - affiliations: - name: UMD & Meta AI FAIR - - name: Sanmi Koyejo - url: "https://scholar.google.com/citations?user=EaaOeJwAAAAJ&hl=en&oi=ao" - affiliations: - name: Stanford University - -# must be the exact same name as your blogpost -bibliography: 2024-05-07-double-descent-demystified.bib - -# Add a table of contents to your post. -# - make sure that TOC names match the actual section names -# for hyperlinks within the post to work correctly. -# - please use this format rather than manually creating a markdown table of contents. -toc: - - name: Introduction - - name: Double Descent in Ordinary Linear Regression - subsections: - - name: Empirical Evidence - - name: Notation and Terminology - - name: Mathematical Analysis - - name: Factor 1 - Low Variance in Training Features - - name: Factor 2 - Test Features in Training Feature Subspace - - name: Factor 3 - Errors from Best Possible Model - - name: Divergence at the Interpolation Threshold - - name: Generalization in Overparameterized Linear Regression - - name: Adversarial Data - subsections: - - name: Adversarial Test Examples - - name: Adversarial Training Data - - name: Intuition for Nonlinear Models - -# Below is an example of injecting additional post-specific styles. -# This is used in the 'Layouts' section of this post. -# If you use this post as a template, delete this _styles block. -_styles: > - .fake-img { - background: #bbb; - border: 1px solid rgba(0, 0, 0, 0.1); - box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1); - margin-bottom: 12px; - } - .fake-img p { - font-family: monospace; - color: white; - text-align: left; - margin: 12px 0; - text-align: center; - font-size: 16px; - } ---- - -## Introduction - -Machine learning models, while incredibly powerful, can sometimes act unpredictably. One of the most intriguing -behaviors is when the test loss suddenly diverges at the interpolation threshold, a phenomenon -distinctly observed in **double descent**
-The data processing inequality states that if two random variables are transformed in this way, they cannot become easier to tell apart. -- -#### Function-Space Variational Inference - -Generally, *variational inference* is a powerful technique for approximating complex Bayesian posteriors with simpler distributions. In its usual form, it optimizes an approximate, *variational* distribution to match the *Bayesian **parameter** posterior* as closely as possible. This way, it transforms the problem of Bayesian inference into an optimization problem. - -However, especially for deep neural networks, obtaining a good approximation of the parameter space can be difficult. One reason is the sheer size of the parameter space. Additionally, the parameterization of a neural network often contains many symmetries---different parameter configurations can lead to the same predictions of the model---that are not taken into account either. - -Here, **Function-space variational inference (FSVI)** side-steps some of these restrictions by only requiring that the variational distribution matches the *Bayesian **predictive** posterior*: -Whereas regular variational inference regularizes towards a parameter prior, FSVI regularizes towards a data prior. This is especially useful when the parameter prior is not very meaningful, e.g. an isotropic Gaussian prior, which is often used in Bayesian neural networks. - - - -## Background: Information-Theoretic Notation - -Information theory deals with the communication of information
There exists an equivalent convex formulation to the classical non-convex ReLU two-layer network training. That sounds like great news but is it the case in practice? Let's find out together.
- -The code for _this plot_ is available and reproducible on this __[Jupyter Notebook]({{'assets/html/2024-05-07-hidden-convex-relu/hidden-convex-relu.ipynb' | relative_url}})__ (or in __[HTML]({{'assets/html/2024-05-07-hidden-convex-relu/hidden-convex-relu.html' | relative_url}})__). - -## I. Overview and Motivation - -50 years ago, two-layer networks with non-linear activations were known to be universal approximators, however, they did not catch on as they were hard to train. The recent years have been marked by deeper networks running on dedicated hardware with very large datasets. Those networks have since been at the top of the benchmark in many applications including self-driving and text generation. The pragmatic method to train such models is to run stochastic gradient descent on the non-convex optimization problem, which is concretely tuning the weights (and bias) until the model is accurate enough. The best models usually require billions of parameters and very large datasets. The training, in turn, requires millions of dollars of hardware and electricity to run gradient descent and train a single model. - -Deep learning is not without faults. Even though the test performance can surpass those of many machine learning models, it is very hard to know what the network has learned because of its black-box nature. Interpretability in neural networks is crucial for creating trustworthy AI systems, one of the biggest obstacle to AI adoption. It may also lead us to simpler models that are cheaper to run, are more robust, generalize better, and are easier to adapt to specific tasks. - -To figure out what a neural network learns, we will focus in this post on the training of a shallow ReLU network by vanilla gradient descent, using the full batch of data at each step, in a regression setting. More precisely, we will investigate how the construction of a convex equivalent to the non-convex training problem can enlighten us on how neurons evolve during the training phase, with a specific focus on the activation of the ReLU functions and their consequences. - -### Problem and notation - -Our problem of interest will be the training of a simple two-layer neural network with ReLU activation. We focus on a classical regression problem with a mean squared error loss and we add a weight decay term (whose importance will be underlined later). This leads to the following full-batch gradient method (note that we make a slight abuse of notation by denoting by $\nabla$ the output of the derivative of the parameters, obtained, for instance, by backpropagation). - -Because there are only two layers, we will integrate the biases of the neurons directly into the data by adding a dimension filled with ones. - -
- Two-Layer ReLU Network Training
- Data points: $n$ inputs \(\pmb{x}_j \in \RR^d\) and labels \(y_j \in \RR\), $j=1,..,n$
- Model: $m$ neurons: First layer \(\pmb{w}_i \in \RR^d\), second layer \(\alpha_i \in \RR\), $i=1,..,m$
- Hyper-parameters: step-size \(\step > 0\), regularization \(\lambda\geq 0\)
- Loss to be minimized:
- \begin{equation}\label{eq:theloss}
- \mathcal{L}(\pmb{W}, \pmb{\alpha}) = \sum_{j=1}^n \bigg( \underbrace{\sum_{i=1}^m \max(0, \pmb{w}_i^\top \pmb{x}_j) \alpha_i}_{\text{Network's Output}} - y_j \bigg)^2 + \underbrace{\lambda \sum_{i=1}^m \| \pmb{w}_i \|^2_2 + \alpha_i^2}_{\text{Weight Decay}}
- \end{equation}
- (Full-batch) Gradient Descent:
- \begin{equation*}
- (\pmb{W}, \pmb{\alpha})_{t+1} = (\pmb{W}, \pmb{\alpha})_t - \step \nabla \mathcal{L}((\pmb{W}, \pmb{\alpha})_t)
- \end{equation*}
-
Loss landscape of a network with two parameters, one for each ReLU neuron, and two data points: $(x_1, y_1) = (-1, 1)$ and $(x_2, y_2) = (1, 2)$ are fixed. Since all labels are positive, we fix the second layer $\alpha_1, \alpha_2$ to 1 to plot the loss in 2D without a loss of generality. The black lines represent the loss for only one neuron (since the other is equal to 0). The red lines(critical points) are paths of parameters for which the loss is constant and the gradient is zero. They represent the parameters for which the neuron fits exactly one data point and is deactivated for the other and thus suffers a loss of $(y_1)^2$ for the red line on the left and $(y_2)^2$ for the other. The exact formula to compute each point of the loss landscape is: - -\begin{equation*} -\begin{split} -\mathcal{L}(w_1, w_2) =&\ \left(\max(0, x_1 w_1) + \max(0, x_1 w_2) - y_1\right)^2 \\ -+&\ \left(\max(0, x_2 w_1) + \max(0, x_2 w_2) - y_2\right)^2 -\end{split} -\end{equation*} -
- -To avoid the local minima, one idea is to add constraints to the parameters. The constrained problem where $w_1$ has to be positive and $w_2$ has to be negative, _is_ convex, and a simple gradient descent will find the global minima of the original unconstrained problem. In-\begin{equation}\label{eq:one_neuron_loss} -{\color{cvred}{\mathcal{L}}}(w_1) = \big(\max(0, x_1 ~ w_1) - y_1\big)^2+\big(\max(0, x_2 ~ w_1) - y_2\big)^2 -\end{equation} -
- -Because our only trainable parameter is one-dimensional, we can directly plot the entire loss landscape. - -{% include figure.html path="assets/img/2024-05-07-hidden-convex-relu/redloss.png" class="img-fluid" %} - -\(\color{cvred}{\mathcal{L}}\) is non-convex in a strong sense: two local minima exist and have distinct values (\((y_1)^2\) and \((y_2)^2\)). In practice, a gradient descent will never be able to switch from fitting one data point to the other (switching from positive to a negative weight $w_1$ can only be done by increasing the loss).
- -We say that the ReLU neuron can _activate_ one or more data points if the output of its ReLU is non-zero when evaluated on said data. The output of a one-neuron ReLU network is $$\color{cvblue}{\max(0, x ~ w_1)}$$, we can plot both the output and the two data points on the same graph. - -{% include figure.html path="assets/img/2024-05-07-hidden-convex-relu/blueoutput.png" class="img-fluid" %} - -Plot of the output of a one-neuron ReLU network with a positive weight $w_1$. The ReLU only activates the second data point (as $x_2>0$ and $w_1 > 0$) so the network can fit the second data point. However, doing so means it cannot activate $x_1$ and will incur a constant loss $(y_1)^2$. Overall, depending on the sign of $w_1$, we will have a loss consisting of a constant term for not activating one example and a quadratic term for matching the label of the activated data point. -
- -Before moving on, the important fact here is that we have a true non-convexity of the loss(the difference between two local minima $\vert (y_1)^2 - (y_2)^2 \vert$ can be made arbitrarily large), even without a single layer or regularization. Now we will explore the corresponding convex problems. - -#### Activation - -We want to find the global minima of the one-neuron ReLU network loss function\eqref{eq:one_neuron_loss}. Recall that the loss has two local minima: $(y_2)^2$ for $w_1=y_1/x_1$ and $(y_1)^2$ for $w_1=y_2/x_2$. - -Which data points are activated plays a crucial role in the loss. In the specific example above, $x_2>0$ is activated and $x_1<0$ is not. If we fix the ReLU's activation to this pattern and __replace the max operators__ with $$\czero$$ or $$\cone$$: - --\begin{equation}\label{eq:firsttry} -\min_{u_1 \in \RR} (\czero \times x_1 u_1 - y_1)^2+ (\cone \times x_2 u_1 - y_2)^2 -\end{equation} -
- -This problem is convex. A gradient descent from any initialization will converge to the optimal loss $(y_1)^2$ with the parameter $u_1 =y_2/x_2$. This parameter directly corresponds to one of the two local minima of the non-convex loss\eqref{eq:one_neuron_loss} by taking $w_1 = u_1$. - --\begin{equation*} -\min_{u_2 \in \RR} (\cone \times x_1 u_2 - y_1)^2+ (\czero \times x_2 u_2 - y_2)^2 -\end{equation*} -
- -Similarly, this convex problem's optimal solution directly corresponds to the second local minima: $(y_2)^2$ for $u_2 =-y_1/x_1$. - -All seems good. But keep in mind that we want to build an equivalent problem. If $u_2$ is positive, taking $w_1 = u_2$ does not lead to the same loss value in the original problem because a positive parameter will never activate the first data point. - -To make the issue obvious, consider this convex problem obtained by replacing the two $\max$ operators by $$\cone$$: - --\begin{equation*} -\min_{u_3 \in \RR} (\cone \times x_1 u_3 - y_1)^2+ (\cone \times x_2 u_3 - y_2)^2 -\end{equation*} -
- -While it is convex, there is no link between the ReLU parameter $w_1$, and this new problem's parameter $u_3$: it is not possible to activate both data points. This issue comes from the fact that replacing a $\max$ by $$\cone$$ only makes sense if what is inside the $\max$ is indeed positive. In other words, as long as $$x_1 ~ w_1$$ is positive we have that $$max(x_1 ~ w_1, 0) = \cone x_1 ~ w_1$$. - --\begin{equation*} -\min_{\substack{x_1 ~ u_3 \geq 0\\x_2 ~ u_3 \geq 0}} (\cone \times x_1 u_3 - y_1)^2+ (\cone \times x_2 u_3 - y_2)^2 -\end{equation*} -
- -We added the constraints corresponding to the activation, and it adequately restricts $u_3$ to be in $\{0\}$. - -As a simple reformulation of \eqref{eq:firsttry}, we vectorize (in the number of data points) the convex loss and we add the constraints: - --\begin{equation*} -\min_{\substack{\begin{bmatrix}-1 & 0 \\ 0 & 1\end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} u_1 \geq 0}} \ \ -\bigg\| \underbrace{\begin{bmatrix} \czero & 0 \\ 0 & \cone \end{bmatrix}}_{\text{diagonal activation matrix}} -\begin{bmatrix} x_1 \\ x_2 \end{bmatrix} u_1 - \begin{bmatrix} y_1 \\ y_2 \end{bmatrix} \bigg\|_2^2 -\end{equation*} -
- -The diagonal activation matrix (named $$D_i \in \{0, 1\}^{n \times n}$$) summarize the on/off behavior of _one_ ReLU for _all_ data points. The constraints on $u_1$ are directly given by this activation matrix: - -$$\begin{bmatrix} -1 & 0 \\ 0 & 1 \end{bmatrix} = 2 \begin{bmatrix} \czero & 0 \\ 0 & \cone \end{bmatrix}- I_2 \qquad \text{$I_2$ the identity matrix of $\RR^2$}$$ - -The other way around, we can define the activation pattern vector for a specific parameter $$u$$: $$(\mathbb{1}_{u ~ x_j \geq 0})_{j=1\dots n} \in \{0,1\}^n$$ with $n$ the number of data points. The activation matrix of $$u$$ is simply the matrix that has this vector for its diagonal. - -So we have exactly four possible activation matrices. $$D_1 = (\begin{smallmatrix} \czero & 0 \\ 0 & \czero \end{smallmatrix})$$ and $$D_2 = (\begin{smallmatrix} \cone & 0 \\ 0 & \cone \end{smallmatrix})$$ will have constraints that reduce to $w_1 = 0$, making them not interesting. The other two lead to convex problems with convex constraints. Solving them will give the parameters that correspond to the two local minima of the loss of ReLU neural network with only a single neuron\eqref{eq:one_neuron_loss}. - -For any number $n$ of 1-D data points, there are $2^n$ distinct activation matrices but only two of them will be interesting: activating all positive data points, or only activating negative data points. Only some $D_i$ are interesting in higher dimensions, but finding all of them is not obvious.
- -Replacing everything with the usual matrices ($$X=(\begin{smallmatrix}x_1 \\x_2\end{smallmatrix})$$, $$Y=(\begin{smallmatrix}y_1 \\y_2\end{smallmatrix})$$) will get us the equivalent convex problem to a one-neuron ReLU network, whose activation pattern is $D_i$: - --\begin{equation*} -\min_{\substack{u_1 \in \RR\\ (2 D_i - I_2) X u_1 \geq 0}} \ \ -\big\| D_i X u_1 - Y \big\|_2^2 -\end{equation*} -
- - -Later sections will investigate what we can say about a ReLU network with more than one neuron. - -#### Multiplicative non-convexity from the second layer - - -{% include figure.html path="assets/img/2024-05-07-hidden-convex-relu/vraitroisd.png" class="img-fluid" %} - --\begin{equation}\label{eq:ncvxlin} -\min_{(x, y) \in \RR^2} (x ~ y - 1)^2 -\end{equation} -
- -\eqref{eq:ncvxlin} is not convex, it has two local minima. However, they are symmetric. Simply replace the term $x ~ y$ by a new variable $z$, and use a simple mapping such as $z \rightarrow (1, z)$ to get the solution of \eqref{eq:ncvxlin} from the solution of the convex problem: $$\min_{z \in \RR} (z-1)^2$$. - -The initial problem\eqref{eq:ncvxlin} with L2 regularization is non-convex as well: - --\begin{equation*} -\min_{(x, y) \in \RR^2} (x ~ y - 1)^2 + \frac{\lambda}{2} ( \vert x \vert^2 + \vert y \vert^2) -\end{equation*} -
- -The convex reformulation with one variable is: --\begin{equation*} -\min_{z \in \RR} (z - 1)^2 + \lambda \vert z \vert -\end{equation*} -
- -We have to use a different mapping $$z \rightarrow (\sgn(z) \sqrt(\vert z \vert), \sqrt(\vert z \vert))$$. One can verify that plugging this mapping into the non-convex problem will give the same value. Therefore, you can solve the convex problem in lieu of the non-convex one. - -Back to non-linear activations, consider the non-convex problem of training a single ReLU neuron with a second layer($$\alpha_1$$) and a L2 regularization: - --\begin{equation*} -\min_{(w_1, \alpha_1) \in \RR^2} \big(\max(0, x_1 w_1) \alpha_1 - y_1\big)^2 + \frac{\lambda}{2} \left(\vert w_1 \vert^2 + \vert \alpha_1 \vert^2\right) -\end{equation*} -
- -We fix the activation to only activate $x_1$(as could be done for any activation pattern) and add the corresponding constraint as done in the previous section: - --\begin{equation}\label{eq:ncvx1} -\min_{\substack{(u_1, \alpha_1) \in \RR^2\\ -x_1 ~ u_1 \geq 0}} -\left( \cone ~ x_1 ~ u_1 ~ \alpha_1 - y_1 \right)^2 -+ \frac{\lambda}{2} (\vert u_1 \vert^2 + \vert \alpha_1 \vert^2) -\end{equation} -
- -\eqref{eq:ncvx1} is a non-convex problem because we are multiplying $w_1$ and $\alpha_1$ together (and some constant). However, this non-convexity can be ignored by considering an equivalent convex function in a very similar way to the $(x ~ y - 1)^2$ problem. - --\begin{equation}\label{eq:cvx1} -\min_{x_1 ~ z_1 \geq 0} -\left( \cone ~ x_1 ~ z_1 - y_1 \right)^2 -+ \lambda \vert z_1 \vert -\end{equation} -
- -$z_1$ takes the role of the product $w_1 ~ \alpha_1$. We can solve \eqref{eq:cvx1} to get an optimal $z_1$ and then use a mapping $$(w_1, \alpha_1) = (\sgn(z_1) ~ \sqrt{\vert z_1 \vert}, \sqrt{\vert z_1\vert})$$. However, the two problems do not have the same expressivity: $$ \max(0, x_1 ~ z_1) \alpha_1 $$ can be negative but not $$\cone ~ x_1 ~ z_1$$ because of the constraint. Let's add a second variable with the same constraint as $z_1$ that will take the role of a negative $\alpha_1$. - --\begin{equation}\label{eq:cvx2} -\min_{\substack{x_1 ~ z_1 \geq 0\\x_1 ~ v_1 \geq 0}} -\big( \cone ~ x_1 ~ (z_1 - v_1) - y_1 \big)^2 -+ \lambda (\vert z_1 \vert + \vert v_1 \vert) - -\end{equation} -
- -The variable $$z_1$$ represents a neuron with a positive second layer and $$v_1$$ a neuron with the same activation pattern but with a negative second layer. This is a convex problem(adding a convex regularization preserves the convexity) with convex constraints. At the optimum, only one of the two variables will be non-zero. We consider this mapping: - --\begin{align*} -(w_1, \alpha_1) &= (\sgn(z_1) ~ \sqrt{\vert z_1 \vert}, \sqrt{\vert z_1 \vert}) & \text{ if $z_1$ is non-zero}\\ -(w_1, \alpha_1) &= (\sgn(v_1) ~ \sqrt{\vert v_1 \vert}, - \sqrt{\vert v_1 \vert}) & \text{ if $v_1$ is non-zero} -\end{align*} -
- -One can verify that this mapping does give the same value when plugged into \eqref{eq:ncvx1}. The two problems share the same global minima as we can easily map back and forth without altering the loss. The global minima of the two problems have the same value as they have the same expressivity, we can say the two problems are equivalent in the sense that we can solve one to get the solution of the other by a simple mapping. - -To summarize, here's the equivalent (with the above mapping) convex problem for a one-neuron ReLU Network with regularization and a second layer, whose activation pattern is $D_i$: - --\begin{equation*} -\min_{\substack{(2 D_i - I_2) X u_1 \geq 0\\ -(2 D_i - I_2) X v_1 \geq 0}} \ \ -\big\| D_i ~ X (u_1 - v_1) - Y \big\|_2^2 -\end{equation*} -
- -#### Equivalent Convex problem with two neurons - -Before moving on to the general results, we want to fit two data points, *i.e.* having both data points activated. To do so, we need at least two neurons. The usual non-convex problem is as follows (with $$X=(\begin{smallmatrix}x_1 \\x_2\end{smallmatrix})$$, $$Y=(\begin{smallmatrix}y_1 \\y_2\end{smallmatrix})$$ and $m=2$): - --\begin{equation*} - \min_{w_i, \alpha_i \in \RR, i=1 \dots m} \bigg\| \sum_{i=1}^m \max(0, X w_i) \alpha_i - y \bigg\|^2_2 + \lambda \sum_{i=1}^m w_i ^2 + \alpha_i^2. -\end{equation*} -
- -This loss is plotted (with $\lambda = 0$ and fixed second layer) in the introduction section. The convex reformulation is very similar. - --\begin{equation*} -\min_{\substack{(2 D_i - I_2) X u_i \geq 0\\ -(2 D_i - I_2) X v_i \geq 0}, i=1 \dots m} \ \ -\bigg\| \sum_{i=1}^m D_i ~ X (u_i - v_i) - Y \bigg\|_2^2 + \lambda \sum_{i=1}^m \vert u_i \vert +\vert v_i \vert -\end{equation*} -
- -The best choice(only obvious in this 1-D data case) of activation matrices would be $$D_1 = (\begin{smallmatrix} \czero & 0 \\ 0 & \cone \end{smallmatrix})$$ and $$D_2 = (\begin{smallmatrix} \cone & 0 \\ 0 & \czero \end{smallmatrix})$$. - -Solving and mapping the solutions would give the optimal *global* solution to the problem of fitting two data points with a ReLU network with two neurons. More insights about why this is true are given after the general case section, and the complete proof can be found in the paper. - -#### General Case - -Let us consider a general two-layer ReLU network with an input of dimension $d$, an output of dimension 1 (vector output requires a similar but parallel construction-\begin{equation*} - \mathcal{L}(\pmb{W}, \pmb{\alpha}) = \bigg\| \sum_{i=1}^m \max(0, \pmb{X} \pmb{w}_i) \alpha_i - \pmb{y} \bigg\|^2_2 + \lambda \sum_{i=1}^m \| \pmb{w}_i \|^2_2 + \alpha_i^2 -\end{equation*} -
- -This is the same loss as presented at the beginning of the article\eqref{eq:theloss} but with matrix and vectors. $$\pmb{X} \in \RR^{n \times d}$$ is the data matrix and $$\pmb{y} \in \RR^n$$ are the labels. Each neuron has its first layer parameter $$\pmb{w}_i \in \RR^d$$ and second layer $$\alpha_i \in \RR$$. - -By analogy with what we saw earlier, an equivalent convex problem can be found. Multiplications are replaced by scalar products in the definition of activation matrices and thus most insights about activation hold. - --\begin{equation}\label{eq:thecvx} - \min_{\pmb{U}, \pmb{V} \in \mathcal{K}} \bigg\| \sum_{i=1}^m \pmb{D}_i \pmb{X} (\pmb{u}_i - \pmb{v}_i) - \pmb{y} \bigg\|^2_2 + \lambda \sum_{i=1}^m \| \pmb{u}_i \|_2 + \| \pmb{v}_i \|_2 -\end{equation} -
- -$$\pmb{D}_i$$ are the activation matrix. The set of the constraints $$\mathcal{K}$$ is the concatenation of the constraints of all neurons. Each constraint can be written succintely: $$(2 \pmb{D}_i - \pmb{I}_n) X \pmb{u}_i \geq 0$$. If $$u_i$$ respects the constraint, its activation pattern is exactly $$D_i$$ and this is crucial to retrieve the optimal solution of the non-convex loss\eqref{eq:theloss} from the solution of the convex reformulation\eqref{eq:thecvx}. - -A conceptually easy way to have the two problems have the same global loss, is to consider a ReLU network with $$2^n$$ neurons, and to formulate the convex problem using all $$2^n$$ distinct activation matrices $$D_i$$. In that case, it is easy to see that they both have the same expressivity. In the paper, it is proved that in theory only $$n$$ neurons and activation patterns are required (using carathéodory's theorem), but the patterns are not given explicitly. The next section will give more insights on when the two problems are equivalent. - -From a solution of the convex problem\eqref{eq:thecvx}, the *convex neurons* $$u_i$$ can be mapped to the *non-convex neurons* $$(w_i, \alpha_i)$$ using this mapping: - --\begin{align*} -(w_i, \alpha_i) &= (\frac{u_i}{\sqrt{\| u_i \|_2}}, \sqrt{\| u_i \|_2}) & \text{ if $u_i$ is non-zero}\\ -(w_i, \alpha_i) &= (\frac{v_i}{\sqrt{\| v_i \|_2}}, -\sqrt{\| v_i \|_2}) & \text{ if $v_i$ is non-zero} -\end{align*} -
- -We use the same mapping as in the 1D case except the direction of the neuron ($$u_i$$) is now a vector in $$\RR^d$$ - -This is a very simple mapping from convex solution to non-convex neurons. We will call convex neurons the set of parameters that correspond to a neuron in the original, non-convex problem. One can expect similar trajectories between the non-convex and convex neurons during gradient descent. -
- -Here, we fixed the number of neurons and the corresponding activations. A few questions are left unanswered: how many different activation patterns need to be considered, and how many neurons should we consider for both convex and non-convex problems? - -### Specifics about equivalence - -Two problems are considered equivalent when their global optima can be seamlessly mapped back and forth. - -As seen before, there are only two *interesting* possible activation patterns in the one-dimensional case (a single neuron can either activate all the positive data points and none of the negative, or the opposite), but there are close to $$2^n$$ _interesting_ patterns when the data dimension is higher. An activation pattern is interesting if there exists a non-zero vector that can respect the constraints and in fine, the activation pattern. - -The (unique) optimal loss of the convex problem \eqref{eq:thecvx} with all possible activation patterns(for fixed data) $$D_i$$ is the best loss any non-convex network can reach. The following sections are dedicated to understanding why adding more neurons than there are activation patterns will not improve the loss. - -However, if we only consider a subset of all patterns, the convex problem will in general correspond to a local optimum of the non-convex network. Indeed, it is not as expressive as before. This would either correspond to a non-convex network with not enough neurons, or with too many neurons concentrated in the same regions. - -To explore this idea, we go back to one-dimensional data. - -#### 1-D EXAMPLE, ONE NEURON - -In the non-convex problem with only one neuron, there are exactly two local minima. - -{% include figure.html path="assets/img/2024-05-07-hidden-convex-relu/oneneuron.png" class="img-fluid" %} - -Plot of the output of a ReLU Network with one neuron, one for each of the parameter's local minima. The parameter on the left can be formulated as a solution of a convex problem with one convex neuron using the activation matrix \((\begin{smallmatrix} \czero & 0 \\ 0 & \cone\end{smallmatrix})\), and \((\begin{smallmatrix} \cone & 0 \\ 0 & \czero \end{smallmatrix})\) for the right output.
- -As seen in the previous section, each local minimum can be found exactly by solving the convex problem with a subset of all possible activations, that is on the left and on the right. Here we cannot say that the convex problem (that considers only one pattern) is equivalent to the non-convex one because the global minimum of the non-convex cannot be achieved in the convex problem. However, once we reach a local minimum in the non-convex gradient descent, then it can be described as a convex problem, by considering one pattern or the other. - -#### 1-D EXAMPLE, TWO NEURONS - -{% include figure.html path="assets/img/2024-05-07-hidden-convex-relu/twoneuron.png" class="img-fluid" %} - -The non-convex problem initialized with two random neurons and optimized with gradient descent will have three possible local minima (if there is some regularization, otherwise there's an infinite number of them). Either we initialize a neuron for each activation and it will reach the global optima (left), or two of them will end up in the same pattern (right), activating the same data point.
- -In the case of two neurons, the convex equivalent problem is as follows: - --\begin{equation*} -\mathcal{L}(u_1, u_2)= -\bigg\| \begin{bmatrix} \czero & 0 \\ 0 & \cone \end{bmatrix} -\begin{bmatrix} x_1 \\ x_2 \end{bmatrix} u_1 + -\begin{bmatrix} \cone & 0 \\ 0 & \czero \end{bmatrix} -\begin{bmatrix} x_1 \\ x_2 \end{bmatrix} u_2 - \begin{bmatrix} y_1 \\ y_2 \end{bmatrix} \bigg\|_2^2 + \lambda (| u_1 | + | u_2 |) -\end{equation*} -
- -is equivalent to the non-convex problem i.e. solving it will give the global optimum of the non-convex objective. (the negative $v_i$ are zero at the optimal and are removed here only to be clear.) - -#### 1-D EXAMPLE, MANY NEURONS - -{% include figure.html path="assets/img/2024-05-07-hidden-convex-relu/manyneurons.png" class="img-fluid" %} - -Plotting the positive part of many ReLU neurons. Summed up, they form a network output that perfectly fits the data.
- -We draw one example of a usual local minimum for gradient descent in the specific case of having more neurons than existing patterns. In practice (with more data in higher dimensions) there are much fewer neurons than possible activations. However, there are many situations in which neurons will lead to the same activation patterns, and in the experiment section we will see how to force such dynamics. - -Note that we can merge neurons that are in the same activation pattern by summing them up (even in higher dimensions), creating a new neuron, and keeping both the output and the loss unchanged (although regularization might decrease). The fact that having more than one neuron in one pattern does not decrease the loss is at the core of the proof. - -### Activation patterns - -The equivalence proof is heavily based on ReLU, specifically that a ReLU unit divides the input space into two regions: one where it will output zero, and the other where it is the identity. If you consider a finite set of samples and a single ReLU, it will activate and deactivate some samples: this is called an activation pattern. A diagonal matrix $$\pmb{D}_i \in \{0,1\}^{n \times n}$$ describes one activation pattern, but not all are possible for a given dataset. There is a finite amount of such possible patterns, exponential in the dimension of the data. - -This section is important to understand the final animations in the experimental section and helps understand how active activation patterns evolve in the non-convex problem. - -#### Two-Dimensional Data - -In the previous part, we considered data to be one-dimensional which resulted in only two possible activation patterns. Let us consider two-dimensional data. To do so in the simplest way possible, we will consider regular one-dimensional data and a dimension filled with $$1$$s. This will effectively give the neural network a _bias_ to use without modifying the formulas. - -We consider two data points: $$\color{cvred}{\pmb{x}_1} = (-0.2, 1)$$ and $$\color{cvred}{\pmb{x}_2} = (1, 1)$$, each associated with their label $$y_1 = 0.5$$ and $$y_2 = 1$$. We plot the output of one ReLU unit initialized at $$\pmb{w}_1 = (0.3, 0.15)$$, $$\alpha_1 = 1$$. Therefore we have - --\begin{align*} -\max(0, \pmb{w}_1^\top \pmb{x}_1) &= 0 \\ -\max(0, \pmb{w}_1^\top \pmb{x}_2) &= \pmb{w}_1^\top \pmb{x}_2 -\end{align*} -
- -The activation pattern of $$\pmb{w}_1$$ is $$\pmb{D}_1=\left(\begin{smallmatrix} \czero & 0 \\ 0 & \cone \end{smallmatrix}\right)$$. There are only three other possible activation patterns, activating both data points: $$\pmb{D}_2=\left(\begin{smallmatrix} 1 & 0 \\ 0 & 1 \end{smallmatrix}\right)$$, activating only the first one with $$\pmb{D}_3=\left(\begin{smallmatrix} 1 & 0 \\ 0 & 0 \end{smallmatrix}\right)$$ and activating no data point with a zero matrix. - -One point of interest is the data for which the ReLU will be 0. This is where the output changes its slope: $$a_1 = -w_1^2/w_1^1$$ where $$w_1^i$$ is the i-th coordinate of $$\pmb{w}_i$$. Here, $$a_1 = 0.5$$. We call this the _activation point_ of the neuron $$\pmb{w}_1$$. - -We plot the output, $$\color{cvblue}{\max(0, (x, 1) ~ \pmb{w}_1^\top)}$$, of the network as a function of the first dimension of the data $$x^1$$ (here simply written $$x$$): - -{% include figure.html path="assets/img/2024-05-07-hidden-convex-relu/twodim.png" class="img-fluid" %} - -A neuron initialized so that it activates only one data point i.e. its activation point is between the two samples, and its slope tells us if it activates on the left or on the right like in this case.
- -__Illustration__. - -In the animation below, we train this network using vanilla gradient descent on the two data points $$\color{cvred}{\pmb{x}_1}$$ and $$\color{cvred}{\pmb{x}_2}$$, represented by the red crosses. We plot its $$\color{cblue}{\text{output}}$$ in blue for every possible data point (omitting the second dimension as it is always 1 in this example, playing the role of the bias), and we plot in red the label associated with the two data points. Each frame corresponds to one step of full-batch gradient descent with a small learning rate. We mark the $$\color{cgreen}{\text{activation point}}$$ of the neuron with a green triangle, pointing toward the side the neuron activates. The green triangle's height is the slope of the ReLU's output, equal to $$u_1^1 = w_1^1 \alpha_1$$, allowing us to visualize how important one neuron is for the output of the network. - -{% include figure.html path="assets/img/2024-05-07-hidden-convex-relu/firstgif_movie.gif" class="img-fluid" %} - -Training a single neuron network with gradient descent until it exactly fits two data points. It starts by fitting the only point it activates, \(\color{cvred}{\pmb{x}_2}\). As training progresses, the activation point represented by a green triangle shifts position. As soon as the activation point reaches \(\color{cvred}{\pmb{x}_1}\), it activates it and starts fitting both points at the same time. Its activation pattern shifts from \(\left(\begin{smallmatrix} \czero & 0 \\ 0 & \cone \end{smallmatrix}\right)\) to \(\left(\begin{smallmatrix} \cone & 0 \\ 0 & \cone \end{smallmatrix}\right)\) and stays the same until convergence.
- -Adding more neurons will not create additional activation patterns, only adding more data points will. With only two data points $$\pmb{x}_1$$ and $$\pmb{x}_2$$, we only had 4 possible patterns, with four data points we have 10 possible patterns. - -{% include figure.html path="assets/img/2024-05-07-hidden-convex-relu/annoying.png" class="img-fluid" %} - -We plot the individual output and activation points of each of the ReLU neurons associated with the ten _interesting_ activation patterns in blue. Those are the 10 (20 with negative ones) neurons that need to be considered to get the global optima using the convex equivalent. When moving the activation point \(a_i\) of a neuron between two data points, its activation pattern does not change.
- -Notice that it is not possible to only activate the data points in the middle. However, if we increase the data's dimension, this becomes possible. This is also possible with a second layer of ReLU. In higher dimensions, we cannot visualize the activation patterns as easily, but we can understand that as dimensionality increases, more patterns are possible as it is easier to separate different data points.
- -### Extensions of the convex reformulation to other settings - -Batch Normalization (BN) is a key process that adjusts a batch of data to have a mean of zero and a standard deviation of one, using two trainable parameters. In the convex equivalent, we replace $$\pmb{D}_i \pmb{X}$$ with $$\pmb{U}_i$$. This $$\pmb{U}_i$$ is the first matrix in the Singular Value Decomposition (SVD) of $$\pmb{D}_i \pmb{X} = \pmb{U}_i \pmb{\Sigma}_i \pmb{V}_i$$ Test accuracy on popular datasets for a single layer network
Time to solve problems from the UCI datasets with Adam on the non-convex problem and a custom solver
-A convex equivalent of deeper networks exists but exacerbates existing problems. The only way to make it possible is to optimize layer by layer. This is still a work in progress and needs further improvements to be competitive.
- -### Activation patterns are not a constant in the non-convex problem - -Let's set aside the performance concerns and use the reformulation as a new point of view for observation. Our non-convex problem is equivalent to a convex and well-specified optimization problem with constraints. The global optima might be the same, but training the network with gradient descent almost always leads to a local minimum. Because there are too many activations to consider them all, the convex problem only find a local minimum. However, it is not clear if they find the same kind of local minimum. - -Activation patterns can and will change during gradient descent in the non-convex problem. In some cases, this pattern shifting is useful because the new activation patterns may lead to a better minimizer. To verify this, we monitor the number of unique activation patterns used by the network at each step of a gradient descent. If two neurons have the same activation pattern (_i.e._ they activate and deactivate the same data points), we would count them as one. - -{% include figure.html path="assets/img/2024-05-07-hidden-convex-relu/nbactiv.png" class="img-fluid" %} - -Training a network with 100 random data points in 10 dimensions. The network only has 20 randomly initialized neurons and the data is linearly dependent on the input. Each neuron has a unique activation pattern as can be seen on the graph. It is expected in this setting because there are so many possible activation patterns (close to $10^{25}$
- -Training a 20-neuron network with gradient descent and using the same activation patterns to solve the convex equivalent. We plot for each step, the current loss of the non-convex network and the optimal loss of the convex problem. At initialization (first point on the graph), the non-convex loss is 1. We take the current activation pattern and build a convex problem and solve it, we find an optimal loss of $0.1$. In the next step, the non-convex loss decreases and the activation pattern has changed, thus we find a different optimal loss for the convex problem. The initial optimal loss of the convex is quickly beaten by gradient descent (at around step 175), this means that the activation patterns at step 0 were far from optimal. The convex loss at the start is quickly beaten by gradient descent, this means our initial choice of activation pattern was bad, and gradient descent continually improves them. We use cvxpy to define the problem and solve it using ECOS. -
- -In general, we cannot predict which patterns will be used by the neurons found by GD, or which patterns are the best. Thus we cannot hope that the convex problem will give us an insight as it requires us to know the activation patterns.-Training a network with 1000 neurons with big initial values using gradient descent. The output of the network is in blue, and the four data points (red crosses) represent linear data. Each green triangle represents one neuron with its activation point horizontally, and its norm vertically. The orientation of the triangle reveals which side the neuron will activate the data. At initialization, the repartition of the activation point is uniform. The movement of the activation point is minimal, only a few neurons will change their patterns, among the thousands. -
- -Here, computing the convex optimal gives us a single neuron to fit the linear data. While the non-convex problem has converged to very low loss, their outputs are completely different. - -A side effect of the large initialization is catastrophic overfitting i.e. there are very large variations between data points which will negatively impact test loss. -
- -### On very small initialization - -At the other extreme, the small-scale setting effectively lets neurons align themselves before ever decreasing the loss. In theory, if you push the scale down enough, neurons will converge to a finite set of directions before trying to fit the objective. - -{% include figure.html path="assets/img/2024-05-07-hidden-convex-relu/smallscale_movie.gif" class="img-fluid" %} - --Training a network with 1000 neurons with very small initial values using gradient descent. The output of the network is in blue, the four data points (red crosses) represent linear data. Each green triangle represents one neuron with its activation point horizontally, and its norm vertically. The orientation of the triangle reveals which side the neuron will activate the data. At initialization, the repartition of the activation point is uniform. However, as training progresses most neurons that activate toward the right converge to $-1.3$. Once the norm of the neuron at activating at $-1.3$ is large enough, the loss decreases and we quickly reach convergence. -
- -Taking a look at the loss on the same problem, we can identify the two distinct regimes: alignment and fitting (then convergence). - -{% include figure.html path="assets/img/2024-05-07-hidden-convex-relu/lastgif_plot.png" class="img-fluid" %} -Plot of the loss during gradient descent in the same setting as the animation above. In the first half only the directions of the neurons are changing (i.e. their activation patterns), and start fitting the four data points once their parameters are large enough.
- -If you take orthogonal data and a small scale, the behavior is very predictableUnless mentioned otherwise, all experiments were run using full batch vanilla gradient descent. In experiments, it is clear that adding momentum or using the Adam optimizer is much easier to use on top of being faster to converge. However, the behavior is much less predictable.
- -## Conclusion - -The main takeaway is that the best network for a given dataset can be found exactly by solving a convex problem. Additionally, the convex problem can describe every local minimum found by gradient descent in the non-convex setting. However, finding the global optima is impossible in practice, and approximations are still costly in precision. While there is no evident link between feature learning in the non-convex and the convex reformulation, many settings allow for a direct equivalence and the whole convex toolkit for proofs. - -The performance side of the convex reformulation will benefit from dedicated software as has been the case for gradient descent in deep networks. Only then will it offer a no-tuning alternative to costly stochastic gradient descent. In smaller settings, it already allows us to quickly find all the possible local minima that are so important in machine learning. - -Despite advancements in understanding the optimization landscape of neural networks, a significant gap persists in reconciling theory with practical challenges, notably because of early stopping. In real-world scenarios, networks often cease learning before reaching a local minimum and this has a direct impact (in large-scale initialization) but there are limited results. - -## Acknowledgements - -This work is partly funded by the ANR JCJC project ANR-21-CE23-0022-01. diff --git a/_posts/2024-05-07-language-model-development-as-a-new-subfield.md b/_posts/2024-05-07-language-model-development-as-a-new-subfield.md deleted file mode 100644 index bf4eb8ee..00000000 --- a/_posts/2024-05-07-language-model-development-as-a-new-subfield.md +++ /dev/null @@ -1,132 +0,0 @@ ---- -layout: distill -title: A New Alchemy: Language Model Development as a Subfield? -description: This blog post makes the case that the body of research on language models become sufficiently large and mature that we can start thinking about “language model development” as a new subfield. - To support this claim, we sketch out the focuses and methodologies of this new subfield. - In addition, we provide some personal reflections on what to do when your field of study gives birth to a new one. -date: 2024-05-07 -future: true -htmlwidgets: true - -authors: - - name: Colin Raffel - url: "https://colinraffel.com/" - affiliations: - name: University of Toronto, Vector Institute - -# must be the exact same name as your blogpost -# bibliography: 2024-05-07-distill-example.bib - -# Add a table of contents to your post. -# - make sure that TOC names match the actual section names -# for hyperlinks within the post to work correctly. -toc: - - name: Some history - - name: Language model development - - name: A New Alchemy ---- - -Historically, language models have served as an important component of many learning systems -- for example, to improve the transcriptions generated by a speech recognition system. -However, the impact and usage of language models has grown dramatically over the past few years. -Arguably, this growth is simply thanks to the fact that language models have gotten *better*, i.e. more accurate at predicting some text based on some context. -Since most text-based tasks can be cast as predicting a response to a request (e.g. "summarize the following article", "write me a Python function that queries Wikipedia", etc.), recent large language models (LLMs) have proven somewhat effective at performing an incredibly wide range of tasks. -Improvements in the language understanding and generation capabilities of LLMs have also led to their adoption in many larger systems (e.g. robots, image processing/generation, etc.), where they increasingly enable natural language to be used as an interface. -These advances have led to a huge amount of research into building and using language models. -I think this body of research has become sufficiently large and mature that we can start thinking about "language model development" as a new subfield. -The goal of this blog post is to sketch out the focuses and methodologies of the subfield of language model development as well as to provide some personal reflections on what to do when your field of study gives birth to a new one. - - -## Some history - -As a subfield, language modeling has many sibling and parent fields, including information theory, artificial intelligence, natural language processing, and machine learning. -In my biased opinion, many recent advances in language modeling have stemmed from advances in deep learning. -When thinking about fields like deep learning, I think it can be valuable to define what the assumptions and major problems of the field are. -For deep learning, I would roughly say that the assumptions are: - -1. We should end-to-end optimize everything. -1. Training a bigger model on a bigger dataset should yield improved performance, but we should also strive to develop efficient and performant model architectures. -1. If we can bake structure into our model (e.g. convolutions for images), things work better... -1. but what we really want is a system that can learn everything from data and relies on as few hard-coded assumptions as possible. -1. We care less about theoretical guarantees and more about how well something works in practice. - -Notably, the assumptions of a field are not necessarily scientifically or philosophically motivated - they can be cultural or arise from extraneous factors (e.g. the availability of GPUs). -The major problems of the field of deep learning might be: - -1. How can we design neural network architectures that work well for a given problem, or better yet, across a wide variety of problems? -1. Similarly, what objective works best? -1. How should we optimize that objective? -1. How can we ensure all of the above can be scaled up effectively? - -Arguably, one of the biggest successes of recent deep learning research is a powerful recipe for training effective models on a wide variety of problems, namely, the Transformer trained with some variant of Adam. -While the objective used can vary across problem settings, in text-based problems a simple language modeling objective works well (and, as discussed above, encapsulates pretty much any text-based task). -An important aspect of this Transformer recipe is its scalability, i.e. the ability to attain predictable gains from scaling up training compute and/or dataset size. - -## Language model development - -I think the scalability of the Transformer has ushered in a new era of research that is distinct from deep learning research. -For the first time, we can (to a significant degree) stop worrying about what model architecture to use, how to train the model, what objective to use, whether we'll continue to get returns from scaling, etc. -Instead, this new line of research primarily aims to study the development of language models in order to expand and understand their capabilities. -In addition, the fact that recent LLMs are reasonably competent at a huge range of tasks has led to major differences in terms of how we use LLMs (when compared to e.g. how we built and used neural networks in the context of deep learning) -For lack of a better term, I'll refer to this new (sub)field as "language model development", which might have the following assumptions: - -1. We can assume that the model architecture, optimizer, and objective are basically fixed. -1. We hope or expect that a given LLM can be induced to perform basically any task out-of-the-box without performing any additional training (i.e. updating its parameters), and in general we should avoid updating parameters to specialize a model to a given task (i.e. task-specific fine-tuning). -1. The computational cost of getting a model to perform a task is mostly irrelevant, or at least, these costs will be resolved by something else (e.g. better/more hardware). -1. If we invest more compute in training an LLM, it will [produce better results](https://arxiv.org/abs/2001.08361). - -Arguably, some of these assumptions could be considered consequences of the fact that many state-of-the-art language models are only available through black-box APIs. -The major problems of language model development are something like: - -1. How can we get the model to do what we want (i.e. "prompt engineering")? -1. How can we make the model run as efficiently as possible? -1. To the extent that we are going to update a model, how can we update it so that it is better at following instructions and less likely to generate harmful content (i.e. alignment)? -1. More broadly, if we are really hoping the model can do *anything*, how do we prevent it from doing things we don't want it to? -1. How can we integrate language models into other systems (i.e. tool use, multimodality, etc.)? - -Let me give a few additional examples of papers and techniques that I think aim to attack these problems under the aforementioned assumptions. - -- An early technique for "getting an LLM to do what we want" (goal #1) is [few-shot in-context learning (ICL)](https://arxiv.org/abs/2005.14165), where a few examples of the desired input/output behavior are provided in the model's input before the model is asked to process an unseen example. - Few-shot ICL avoids updating the model's parameters (assumption #1) and mostly ignores the fact that it significantly increases computational costs (assumption #3). - A related and more recent variant of ICL is ["chain-of-thought prompting"](https://arxiv.org/abs/2201.11903), which adds reasoning steps to the in-context examples in hopes of improving performance by inducing the model to generate similar reasoning steps before generating its prediction. - The fact that including reasoning steps further increases computational costs is, again, mostly ignored (assumption #3). -- Techniques like [FlashAttention](https://arxiv.org/abs/2205.14135) and [Speculative Decoding](https://arxiv.org/abs/2211.17192) aim to make the model run more efficiently (goal #2) without changing the model or its outputs whatsoever (assumption #1). - More broadly, techniques like the [Heavy-Hitter Oracle](https://arxiv.org/abs/2306.14048) or [quantization](https://arxiv.org/abs/2208.07339) aim to reduce memory or computational costs with minimal performance degradation. - The pursuit of these techniques, along with orthogonal hardware advances like NVIDIA's Transformer Engine, arguably supports the apparent disregard for increases in computational cost that arise from using a larger model (assumption #3). -- While there certainly has been some effort to improve over the Transformer architecture or the optimizer used to train LLMs (in violation of assumption #1), the vast majority of these improvements have not been widely adopted, either due to inertia (i.e., enforcement of assumption #1) or the apparent fact that [they do not always transfer across applications](https://arxiv.org/abs/2102.11972). - -Separately, a sign of the maturity of a new subfield is the development of teaching materials. -I think my friend Sasha Rush is leading the charge here, with e.g. [GPTWorld for learning prompting](https://github.com/srush/GPTWorld), [LLM training puzzles for learning about distributed training](https://github.com/srush/LLM-Training-Puzzles), and [Transformer puzzles for understanding how Transformers might work](https://github.com/srush/Transformer-Puzzles). -Another sign is the establishment of a conference on the subject, and we [have one of those now too](https://colmweb.org/). - -## A New Alchemy - -LLMs have ushered in a paradigm shift in the path toward imbuing computers with human-like capabilities. -This paradigm shift is being felt in various fields, including deep learning (where the work of designing new architectures or optimizers is increasingly less relevant), natural language processing (where we now have a recipe that works reasonably well across subproblems that previously demanded custom methodologies), and beyond. - -I started my PhD in 2012 during a similar paradigm shift from what I'd call "statistical machine learning" to deep learning. -Unlike deep learning, statistical ML prioritized theoretical guarantees (e.g. convexity of the objective function and/or convergence under certain conditions). -These guarantees arguably limited model expressivity, which arguably necessitated things like feature engineering that deep learning strove to avoid. -While deep learning by no means "solved" the problems of statistical ML (just as language model development does not "solve" deep learning), it nevertheless presented a paradigm that made dramatic progress on the target problems of statistical ML and unlocked new applications. -Such empirical successes of deep learning -- which almost entirely eschewed theoretical guarantees -- led to a great deal of hand-wringing on the part of the statistical ML crowd. - -As my research increasingly made use of deep learning, I started to find myself at the receiving end of this hand-wringing. -For example, during my first-ever oral presentation at a conference, I was presenting work that made use of convolutional neural networks. -During questions, an audience member expressed distaste at my use of "*convoluted*" neural networks and suggested that something simpler would have worked better (of course I had tried simpler models and they worked significantly worse, but let's put that aside for the moment). -This kind of despair was common at the time - people were applying deep neural networks in settings where they may or may not have been overkill, simply because it was the zeitgeist. -At another conference I attended during my PhD, I happened to share a hostel room with a computer vision researcher who went on a long rant about the atrocity of deep learning (sometimes I wonder what this researcher is working on now). -I think this sentiment is most elegantly laid out in [Ali Rahimi's NeurIPS 2017 test-of-time award acceptance speech](https://www.youtube.com/watch?v=x7psGHgatGM), where he argues that deep learning is like alchemy - trial-and-error that yields some effective techniques but lacks rigor. -Ali's speech had a big impact on me and others but arguably didn't really stop people from continuing to develop and apply deep learning without worrying about rigor and in settings where simpler methods would have sufficed (simply because using a big fancy neural network was sexier). - -These experiences led me to promise myself that when my field of study gave birth to another, I wouldn't dig my feet in and resist, I'd follow the tide of progress. -Now that this is (arguably) happening I'm finding it more difficult than I had anticipated. -As much as I wish it wasn't true, I cringe a little whenever I see a new LLM technique that ignores a dramatic increase in computational cost and bends over backwards to avoid updating the model's parameters, or an application of an LLM where something dramatically cheaper would suffice, or a paper studying the behaviors of an LLM as if it's a black box (or studying an LLM API, in which case it actually *is* somewhat of a black box), and on and on. -And try as I might, I can't resist trying to stem the tide -- for example, the [T-Few paper](https://arxiv.org/abs/2205.05638) aimed to convince everyone that few-shot ICL was absurdly computationally inefficient and that fine-tuning specialized models is cheaper and better. -Of course, people are still using few-shot ICL and are still avoiding task-specific fine-tuning at all costs, because that's the zeitgeist -- and I think this isn't totally wrong, because in tandem there's a huge amount of synergistic work on making LLMs more efficient and effective. -But, to be honest, it still *feels* a little wrong, and I'm not sure if I'll be able to shake that feeling. - -So, what's the best course of action [when you used to be with it, but then they changed what "it" was](https://www.youtube.com/watch?v=LV0wTtiJygY)? -I think there were many ML researchers who successfully rode the tide from statistical ML to deep learning -- they willingly embraced the new field while bringing their knowledge and sense of rigor to their deep learning research. -In other words, they used their past knowledge to provide a broader and deeper perspective that newcomers may have lacked. -An especially prominent product of this kind of research is arguably the [Variational Autoencoder (VAE)](https://arxiv.org/abs/1312.6114), which connected ideas from variational inference to the autoencoder neural network architecture. -VAEs are still an important component of state-of-the-art diffusion-based generative models. -Hopefully, those of us who were working on deep learning and NLP before the LLM era can bring a similar perspective (and avoid digging our feet in too much). diff --git a/_posts/2024-05-07-mode-switching.md b/_posts/2024-05-07-mode-switching.md deleted file mode 100644 index 7e1c5e08..00000000 --- a/_posts/2024-05-07-mode-switching.md +++ /dev/null @@ -1,615 +0,0 @@ ---- -layout: distill -title: Behavioral Differences in Mode-Switching Exploration for - Reinforcement Learning -description: In 2022, researchers from Google DeepMind presented an initial - study on mode-switching exploration, by which an agent separates its - exploitation and exploration actions more coarsely throughout an episode - by intermittently and significantly changing its behavior policy. We - supplement their work in this blog post by showcasing some observed - behavioral differences between mode-switching and monolithic exploration - on the Atari suite and presenting illustrative examples of its benefits. - This work aids practitioners and researchers by providing practical - guidance and eliciting future research directions in mode-switching - exploration. -date: 2024-05-07 -future: true -htmlwidgets: true - -# Anonymize when submitting -# authors: -# - name: Anonymous - -authors: - - name: Loren J Anderson - url: - affiliations: - name: USA Space Force - -# must be the exact same name as your blogpost -bibliography: 2024-05-07-mode-switching.bib - -# Add a table of contents to your post. -# - make sure that TOC names match the actual section names -# for hyperlinks within the post to work correctly. -# - please use this format rather than manually creating a markdown table of contents. -toc: - - name: 1. Introduction - subsections: - - name: Mode-Switching Distinctions - - name: Mode-Switching Basics - - name: Blog Post Motivation - - name: 2. Experiments - subsections: - - name: Concentrated Terminal States - - name: Early Exploration - - name: Concentrated Return - - name: Post-Exploration Entropy - - name: Top Exploitation Proportions - - name: 3. Conclusion - subsections: - - name: Acknowledgements - -# Below is an example of injecting additional post-specific styles. -# This is used in the 'Layouts' section of this post. -# If you use this post as a template, delete this _styles block. - ---- - -## 1. Introduction - -Imagine learning to ride a bicycle for the first time. This task -requires the investigation of numerous actions such as steering the -handlebars to change direction, shifting weight to maintain balance, and -applying pedaling power to move forward. To achieve any satisfaction, a -complex sequence of these actions must be taken for a substantial amount of -time. However, a dilemma emerges: many other tasks such as eating, sleeping, and working may result in more immediate satisfaction (e.g. lowered hunger, better rest, bigger paycheck), which may tempt the learner to favor other tasks. Furthermore, if enough satisfaction is not quickly achieved, the learner may even abandon the task of learning to ride a bicycle altogether. - -One frivolous strategy (Figure 1, Option 1) to overcome this dilemma is to -interleave a few random actions on the bicycle throughout the remaining -tasks of the day. This strategy neglects the sequential nature of bicycle -riding and will achieve satisfaction very slowly, if at all. Furthermore, -this strategy may interrupt and reduce the satisfaction of the other daily -tasks. The more intuitive strategy (Figure 1, Option 2) is to dedicate -significant portions of the day to explore the possible actions of bicycle -riding. The benefits of this approach include testing the sequential -relationships between actions, isolating different facets of the -task for quick mastery, and providing an explicit cutoff point to shift -focus and accomplish other daily tasks. Also -- let's face it -- who wants to wake up in the middle of the night to turn the bicycle handlebar twice -before going back to bed? - -{% include figure.html path="assets/img/2024-05-07-mode-switching/bike.png" class="img-fluid" %} - - -The above example elicits the main ideas of the paper *When Should Agents -Explore?*- DPO training significantly improves AlpacaEval and MT-Bench performance. At all sizes, - DPO training provides significant improvements in AlpacaEval, with our largest DPO-trained model - significantly outperforming GPT-3.5-turbo-0314 (89.4 vs. 95.1) and is competitive with GPT-4 ... - We also observe that DPO training provides a large boost in MT-Bench - performance for the 13B and 70B size models, with TÜLU 2+DPO 70B being the best-performing - open model compared to all other models on the MT-Bench leaderboard. -- -
- DPO training is stable at large scales. We find that DPO training scales without issues with 70Bsize models, - with DPO training still providing large benefits for open-ended generation (AlpacaEval) - even at the 70B size. This suggests DPO is a promising path for training large models on human - feedback without the engineering complexity required by PPO. To our knowledge, TÜLU 2+DPO - 70B is the largest publicly-released DPO-trained model. -- -
- DPO does not dramatically harm most other metrics. We find that DPO training does not - significantly change performance in most other metrics we measure, such as factual reasoning - (MMLU) or reasoning (BBH, GSM8k), with the exception of multilinguality (which we discuss - below). This suggests that DPO training does not significantly change model capabilities. - DPO training significantly drops multilingual capabilities. We find that DPO training significantly drops performance in TydiQA, which tests the multilingual capabilities of our model. However, - we note that both our supervised finetuning and DPO data mixes do not explicitly contain multilingual - data, and are majority English-language. As such, DPO training is likely to make multilingual outputs - further out-of-distribution, and mixing in multilingual data at instruction tuning and DPO training - stages may significantly improve these results. -- -
- DPO training increases model verbosity. As seen in Table 4, TÜLU 2+DPO models generally - output answers of longer length than those trained without DPO. This is in line with prior work - showing a bias toward verbosity from RLHF training. However, we note that our DPO-trained models appear dramatically less verbose than other openweight models, which future work will investigate. -- -## Closing Remarks - -One may find it surprising that supervised learning is able to replace RL -on a formal level. For RLHF, _new_ data is sampled from the language model, and for DPO -this is not the case. - -However, after paying closer attention to the information flow -of RLHF as described above, it may not be too surprising after all. The sampled -data is not really new - it is created using the very same model that one is trying -to optimize. The rewards for these samples are also not new, they are obtained -by fitting a reward function to the preferences, and no new human preferences are -retrieved during optimization. So from the information-flow perspective, -supervised learning and RL are indeed equivalent in this particular case. Maybe -Francois Chollet was not too extreme for suggesting to _get rid of deep RL -altogether_ in his tweet (note that it predates DPO. Personally, I don't believe in a complete futility of deep RL, but for RLHF he was on point): -{% twitter https://twitter.com/fchollet/status/1630241783111364608?s=20 %} -. - -Another surprising aspect of DPO is the question: *Why has nobody done this before?* -Hopefully after reading this blog post, you will agree that the derivation of DPO is -not particularly complicated, so why did it take almost 4 years after the introduction of RLHF? -Especially considering how tricky RLHF can be to implement. -I don't have an answer, though my intuition is that sometimes as a community we put too much -effort into following a working solution, instead of taking a step back -and searching for a simpler path. We might have witnessed a large scale instance of the -[Region-beta paradox](https://en.wikipedia.org/wiki/Region-beta_paradox). - -As a final note on community dynamics: supervised and self-supervised learning are now making more headlines -compared to reinforcement learning, and DPO might have the effect of slowing down -the complicated (but, as I believe, necessary) marriage of RL and LLMs. -I do think that planning and search should play some part of LLM training in the future, -although only for settings in which there is an actual environment from which new information -can be extracted (like tool-use or robotics). For now, however, taking the RL out of RLHF -seems like a good step forward. If DPO can be made beneficial for most LLM trainings, I believe -that one can firmly answer the opening question of this blog as: - -*Is RLHF really (online) RL? No, it is not.* diff --git a/_posts/2024-05-07-robust-foundation-model.md b/_posts/2024-05-07-robust-foundation-model.md deleted file mode 100644 index 6bf87f8e..00000000 --- a/_posts/2024-05-07-robust-foundation-model.md +++ /dev/null @@ -1,863 +0,0 @@ ---- -layout: distill -title: 'Towards Robust Foundation Models: Adversarial Contrastive Learning' -description: Foundation models pre-trained on large-scale unlabelled datasets using self-supervision can be generalizable to a wide range of downstream tasks. Existing work has shown that adversarial attacks can effectively fool any downstream models fine-tuned from a pre-trained foundation model. The existence of such adversarial attacks necessitates the development of robust foundation models which can yield both standard generalization and adversarial robustness to safety-critical downstream tasks. Currently, adversarial contrastive learning (ACL) is one of the most effective methods for outputting a robust foundation model. ACL incorporates contrastive learning with adversarial data to effectively output a robust representation without requiring costly annotations. In this blog, we introduced two NeurIPS 2023 publications that can enhance ACL's efficacy and efficiency, respectively. (1) This blog introduces Adversarial Invariant Regularization (AIR) which is a state-of-the-art ACL algorithm. A causal theoretical framework is built to interpret ACL, and then the AIR algorithm is derived from the causal framework to regulate and improve the ACL. (2) This blog also introduces a Robustness-aware Coreset Selection (RCS) method to speed up ACL. RCS does not require label information and searches for an informative training subset that can maintain the adversarial robustness. For the first time, RCS enables the application of ACL on the large-scale ImageNet-1K dataset. -# Your blog post's abstract. - # Please add your abstract or summary here and not in the main body of your text. - # Do not include math/latex or hyperlinks. -date: 2024-05-07 -future: true -htmlwidgets: true - -# Anonymize when submitting -# authors: -# - name: Anonymous - -authors: - - name: Jingfeng Zhang - url: https://zjfheart.github.io/ - affiliations: - name: The University of Auckland & RIKEN Center for Advanced Intelligence Project - - name: Xilie Xu - url: https://godxuxilie.github.io/ - affiliations: - name: National University of Singapore - -# must be the exact same name as your blogpost -bibliography: 2024-05-07-robust-foundation-model.bib - -# Add a table of contents to your post. -# - make sure that TOC names match the actual section names -# for hyperlinks within the post to work correctly. -# - please use this format rather than manually creating a markdown table of contents. -toc: - - name: Foundation Models - subsections: - - name: Contrastive Learning (CL) - - name: Robust Foundation Models - subsections: - - name: Adversarial Contrastive Learning (ACL) - # subsections: - # - name: Interactive Figures - - name: Enhancing ACL via Adversarial Invariant Regularization (AIR) - subsections: - - name: Causal View of ACL - - name: the Methodology of AIR - - name: Empirical Results - - name: Robust Self-Supervised Learning (RobustSSL) Benchmark - - name: Efficient ACL via Robustness-Aware Coreset Selection (RCS) - subsections: - - name: Motivation---ACL is Inefficient - - name: the Methodology of RCS - - name: Experimental Results - - -# Below is an example of injecting additional post-specific styles. -# This is used in the 'Layouts' section of this post. -# If you use this post as a template, delete this _styles block. -_styles: > - .fake-img { - background: #bbb; - border: 1px solid rgba(0, 0, 0, 0.1); - box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1); - margin-bottom: 12px; - } - .fake-img p { - font-family: monospace; - color: white; - text-align: left; - margin: 12px 0; - text-align: center; - font-size: 16px; - } ---- - - - -## Foundation Models - - -Foundation models
-Given the assumption we made, our algorithm can find the optimal $\hat\theta$ which minimizes the empirical risk and also guarantees the existence of the positive definite Hessian matrix:
-
-$$R(\theta):=\frac{1}{n}\sum L(z_i,\theta), \ \ \hat\theta=\arg\min_\theta R(\theta)$$
-
-$$H_{\hat\theta}:=\frac{1}{n}\sum \nabla _\theta^2 L(z_i,\hat\theta).$$
-
-Given the intuition written above, we look at the parameter difference $\Delta_\epsilon=\hat\theta_{\epsilon, z}-\hat\theta$ by perturbing one training sample:
-
-$$\hat\theta_{\epsilon, z}=\arg\min_{\theta}\{R(\theta)+\epsilon L(z,\theta)\}$$
-
-Recall our goal is to estimate how does the algorithm changes with sample perturbation, we can express our goal as $\frac{d \hat\theta_{\epsilon, z}}{d \epsilon}$. Since $\hat\theta_{\epsilon, z}$ is a minimizer of the pertured loss. We can write its first order optimality condition:
-
-$$0=\nabla R(\hat\theta_{\epsilon, z})+\epsilon \nabla L(z,\hat\theta_{\epsilon, z}).$$
-
-By performing a taylor expansion on $\hat\theta_{\epsilon, z}$, we can estimate
-
-$$0\approx \left[ \nabla R(\hat\theta)+\epsilon \nabla L(z,\hat\theta)\right] + \left[ \nabla^2 R(\hat\theta)+\epsilon \nabla^2 L(z,\hat\theta)\right]\Delta_\epsilon.$$
-
-Since $\hat\theta$ minimizes $R$ and $o(\epsilon)$ term can be omitted, we can solve for $\Delta_\epsilon$ as follows:
-
-$$\Delta_\epsilon\approx -\nabla^2 R(\hat\theta)^{-1} \nabla L(z,\hat\theta)\epsilon \Rightarrow \frac{d \Delta_\epsilon}{d \epsilon}\Bigg|_{\epsilon=0}=\frac{d \hat\theta_{\epsilon,z}}{d\epsilon}\Bigg|_{\epsilon=0}=-H_{\hat\theta}^{-1}\nabla_\theta L(z,\hat\theta) $$
-
-
-Therefore, $\mathcal{I}_{\text{removal,loss}}(z,z_{\text{test}}):=\frac{dL(z_\text{test},\hat\theta_{\epsilon,z})}{d\epsilon}\Bigg|_{\epsilon=0}
-=\frac{dL(z_\text{test},\hat\theta_{\epsilon,z})}{d\hat\theta_{\epsilon,z}}\frac{d \hat\theta_{\epsilon,z}}{d\epsilon}\Bigg|_{\epsilon=0}\approx-\nabla_\theta L(z_{\text{test}},\hat\theta)^\top H_{\hat\theta}^{-1}\nabla_\theta L(z,\hat\theta)$
-
-Before introducing the implementation steps, Ilyas et al.
- $x_t$ is an input in $\mathbb{R}^d$;
- $y_t$ is the binary label;
- $b_t$ the bias term
-
-Then the authors further parametrize the learning algorithm with $\theta$ as the model parameters:
-$$\theta^{*}(S) := arg\; \underset{\theta}{min} \sum_{(x_t, y_t)\in S} log[1 + exp(-y_t \cdot (\theta^{T}x_t + b_t))]$$
-
-Data attribution in binary logistic regression setting can be learned by using the _one-step Newton approximation_
- $z$: target sample;
- $f(z;\theta) :=\theta^{T}x+b$;
- $z_t$: the $t^{th}$ training example, $z_t = (x_t, b_t, y_t)$;
- $X \in \mathbb{R}^{n \times d}$ stacking all input in one matrix $X$;
- $p_{t}^{*}:= (1 + exp(-y_t \cdot f(z_t; \theta^*)))^{-1}$
- $p_{t}^{*}$ is the predicted correct-class probability at $\theta^{*}$;
- $R$ is a diagonal $n \times n$ matrix with $R_{tt} = p_{t}\times (1-p_{t}^{*})$
-
-Now that the Ilyas et al.
-
-In this paper, the algorithm of TRAK is consist of five steps:
-
-1. Linearizing the model output function via Taylor approximation, which reduces the model of interest to a linear funtion in parameter space.
- Consider $f(z;\theta)$ as a non-convex function, then we can approximate it with its Taylor expansion centered around $\theta^{\*}$:
- $$\hat{f}(z;\theta):= f(z;\theta^{*}) + \nabla_{\theta} \; f(z;\theta^{*})^{T}(\theta - \theta^{*})$$
- $$\theta^{*}(S) \approx arg\; \underset{\theta}{min} \sum_{z_t \in S} log[1 + exp(-y_t \cdot ( \underbrace{\nabla_{\theta} \; f(z;\theta^{*})^{T}}_{inputs}\;\theta + b_t))]$$
- where
- $f(z;\theta):=log(\frac{p(z;\theta)}{1 - p(z; \theta)})$
- $b_t = f(z;\theta^{\*}) - \nabla_{\theta} \; f(z;\theta^{\*})^{T} \theta^{\*}$
-
-
-2. Reducing the dimensionality of the linearized model using random projections. To preserve the model-relevent information, Ilyas et al
- $\mathbf{P}\sim \mathcal{N} (0, 1)^{p \times k}$ for $k \ll p$
-
-
-3. Estimating influences by adapting the one-step newton approximation.
-$$\tau(z, S) := \phi(z)^{T}(\Phi^{T}\Phi)^{-1}\Phi^{T}\mathbf{Q}$$
-where
- $\mathbf{Q}:= diag(1 - p_{t}^*) = diag(\{(1 + exp(y_t \cdot f(z;\theta^{*})))^{-1}\})$;
- $\mathbf{Q} \in \mathbb{R}^{n \times n}$ where each diagonal is a one minus correct-class probability term.
-
-4. Ensembling over $N$ independently trained models. Each model is trained on a subset of the training set, $S_i \subset S$.
-$$\tau_{N}(z, S) := (\frac{1}{N} \sum_{i=1}^{N} \mathbf{Q}_{i}) \cdot (\frac{1}{N} \sum_{i=1}^{N} \phi_{i}(z)^{T}(\Phi_{i}^{T}\Phi_{i})^{-1}\Phi_{i}^{T})$$
-
-5. Inducing sparsity via soft-thresholding.
-$$\tau_{TRAK}(z, S) := \mathfrak{S}((\frac{1}{N} \sum_{i=1}^{N} \mathbf{Q}_{i}) \cdot (\frac{1}{N} \sum_{i=1}^{N} \phi_{i}(z)^{T}(\Phi_{i}^{T}\Phi_{i})^{-1}\Phi_{i}^{T}), \hat{\lambda})$$
-where
- $\mathfrak{S}(\cdot; \lambda)$ is the soft thresholding operator;
- $\hat{\lambda}$ is the soft thresholding parameter, and it's selected via cross-validation
-
Announcements:
The Machine Learning community is currently experiencing a reproducibility crisis and a reviewing crisis [Littman, 2021]. Because of the highly competitive and noisy reviewing process of ML conferences [Tran et al., 2020], researchers have an incentive to oversell their results, slowing down the progress and diminishing the integrity of the scientific community. Moreover with the growing number of papers published and submitted at the main ML conferences [Lin et al., 2020], it has become more challenging to keep track of the latest advances in the field.
Blog posts are becoming an increasingly popular and useful way to talk about science [Brown and Woolston, 2018]. They offer substantial value to the scientific community by providing a flexible platform to foster open, human, and transparent discussions about new insights or limitations of a scientific publication. However, because they are not as recognized as standard scientific publications, only a minority of researchers manage to maintain an active blog and get visibility for their efforts. Many are well-established researchers (Francis Bach, Ben Recht, Ferenc Huszár, Lilian Weng) or big corporations that leverage entire teams of graphic designers designer and writers to polish their blogs (Facebook AI, Google AI, DeepMind, OpenAI). As a result, the incentives for writing scientific blog posts are largely personal; it is unreasonable to expect a significant portion of the machine learning community to contribute to such an initiative when everyone is trying to establish themselves through publications.
Submit your blogpost on Openreview
Last year, we ran the second iteration of the Blogpost track at ICLR 2023!
It was very successful, with accepted posts presented in person at the main conference.
Our goal is to create a formal call for blog posts at ICLR to incentivize and reward researchers to review past work and summarize the outcomes, develop new intuitions, or highlight some shortcomings. A very influential initiative of this kind happened after the Second World War in France. Because of the lack of up-to-date textbooks, a collective of mathematicians under the pseudonym Nicolas Bourbaki [Halmos 1957], decided to start a series of textbooks about the foundations of mathematics [Bourbaki, 1939]. In the same vein, we aim to provide a new way to summarize scientific knowledge in the ML community.
Due to the large diversity of topics that can be discussed in a blog post, we decided to restrict the range of topics for this call for blog posts. We identified that the blog posts that would bring to most value to the community and the conference would be posts that distill and discuss previously published papers.
Abstract deadline: December 11th 00:00GMT, 2023 (submit to OpenReview - to be announced soon).
Submission deadline: December 17th 00:00GMT, 2023 (any modifications to your blog post, via a pull request on GitHub).
Decision Notification: January 30th, 2024 UPDATED: February 15th, 2024
Camera-ready merge: March 15th, 2024
Write a post on a subject that has been published at a top-tier venue (ICLR, ICML, NeurIPS, AAAI, UAI, CVPR, SIGGRAPH, ECCV, ICCV, etc.) relatively recently.
The authors of the blog posts will have to declare their conflicts of interest (positive or negative) with the paper (and the paper’s authors) they write about. Conflicts of interest include:
We will only ask the authors to report if they have a conflict of interest. If so, reviewers will be asked to judge if the submission is sufficiently critical and objective of the papers addressed in the blog post.
The posts will be created and published under a unified template; see the submission instructions and the sample post hosted on the blog of this website.
Additionally, accepted posts will have the option to present their work as a poster during the main poster session. For more information about the main poster session (time, poster format, etc.) please refer to the ICLR homepage.
Our goal is to avoid heavily engineered, professionally-made blog posts —Such as the “100+ hours” mentioned as a standard by the Distill guidelines—to entice ideas and clear writing rather than dynamic visualizations or embedded javascript engines. Please check our submission instructions for more details. We accept submissions in both Markdown and HTML. We believe this is a good trade-off between complexity and flexibility.
Submit your blogpost on Openreview
For any technical issues with the blog post repository (for example, blog posts not displaying correctly or issues while following the submission instructions), please open an issue in our github repository.
For other inquiries, reach us via email at: blog.track.chairs@gmail.com
Eryn Brown and Chris Woolston. Why science blogging still matters. Nature, 2018.
Paul R Halmos. Nicolas Bourbaki. Scientific American, 1957.
Nicolas Bourbaki. Elements of mathematics. Éditions Hermann, 1939.
tag. We found the following text: ' + text); - const wrapper = document.createElement('span'); - wrapper.innerHTML = addedNode.nodeValue; - addedNode.parentNode.insertBefore(wrapper, addedNode); - addedNode.parentNode.removeChild(addedNode); - } - } break; - } - } - } - }).observe(this, {childList: true}); - } - - } - - var commonjsGlobal = typeof globalThis !== 'undefined' ? globalThis : typeof window !== 'undefined' ? window : typeof global !== 'undefined' ? global : typeof self !== 'undefined' ? self : {}; - - function createCommonjsModule(fn, module) { - return module = { exports: {} }, fn(module, module.exports), module.exports; - } - - var bibtexParse = createCommonjsModule(function (module, exports) { - /* start bibtexParse 0.0.22 */ - - //Original work by Henrik Muehe (c) 2010 - // - //CommonJS port by Mikola Lysenko 2013 - // - //Port to Browser lib by ORCID / RCPETERS - // - //Issues: - //no comment handling within strings - //no string concatenation - //no variable values yet - //Grammar implemented here: - //bibtex -> (string | preamble | comment | entry)*; - //string -> '@STRING' '{' key_equals_value '}'; - //preamble -> '@PREAMBLE' '{' value '}'; - //comment -> '@COMMENT' '{' value '}'; - //entry -> '@' key '{' key ',' key_value_list '}'; - //key_value_list -> key_equals_value (',' key_equals_value)*; - //key_equals_value -> key '=' value; - //value -> value_quotes | value_braces | key; - //value_quotes -> '"' .*? '"'; // not quite - //value_braces -> '{' .*? '"'; // not quite - (function(exports) { - - function BibtexParser() { - - this.months = ["jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec"]; - this.notKey = [',','{','}',' ','=']; - this.pos = 0; - this.input = ""; - this.entries = new Array(); - - this.currentEntry = ""; - - this.setInput = function(t) { - this.input = t; - }; - - this.getEntries = function() { - return this.entries; - }; - - this.isWhitespace = function(s) { - return (s == ' ' || s == '\r' || s == '\t' || s == '\n'); - }; - - this.match = function(s, canCommentOut) { - if (canCommentOut == undefined || canCommentOut == null) - canCommentOut = true; - this.skipWhitespace(canCommentOut); - if (this.input.substring(this.pos, this.pos + s.length) == s) { - this.pos += s.length; - } else { - throw "Token mismatch, expected " + s + ", found " - + this.input.substring(this.pos); - } this.skipWhitespace(canCommentOut); - }; - - this.tryMatch = function(s, canCommentOut) { - if (canCommentOut == undefined || canCommentOut == null) - canCommentOut = true; - this.skipWhitespace(canCommentOut); - if (this.input.substring(this.pos, this.pos + s.length) == s) { - return true; - } else { - return false; - } }; - - /* when search for a match all text can be ignored, not just white space */ - this.matchAt = function() { - while (this.input.length > this.pos && this.input[this.pos] != '@') { - this.pos++; - } - if (this.input[this.pos] == '@') { - return true; - } return false; - }; - - this.skipWhitespace = function(canCommentOut) { - while (this.isWhitespace(this.input[this.pos])) { - this.pos++; - } if (this.input[this.pos] == "%" && canCommentOut == true) { - while (this.input[this.pos] != "\n") { - this.pos++; - } this.skipWhitespace(canCommentOut); - } }; - - this.value_braces = function() { - var bracecount = 0; - this.match("{", false); - var start = this.pos; - var escaped = false; - while (true) { - if (!escaped) { - if (this.input[this.pos] == '}') { - if (bracecount > 0) { - bracecount--; - } else { - var end = this.pos; - this.match("}", false); - return this.input.substring(start, end); - } } else if (this.input[this.pos] == '{') { - bracecount++; - } else if (this.pos >= this.input.length - 1) { - throw "Unterminated value"; - } } if (this.input[this.pos] == '\\' && escaped == false) - escaped = true; - else - escaped = false; - this.pos++; - } }; - - this.value_comment = function() { - var str = ''; - var brcktCnt = 0; - while (!(this.tryMatch("}", false) && brcktCnt == 0)) { - str = str + this.input[this.pos]; - if (this.input[this.pos] == '{') - brcktCnt++; - if (this.input[this.pos] == '}') - brcktCnt--; - if (this.pos >= this.input.length - 1) { - throw "Unterminated value:" + this.input.substring(start); - } this.pos++; - } return str; - }; - - this.value_quotes = function() { - this.match('"', false); - var start = this.pos; - var escaped = false; - while (true) { - if (!escaped) { - if (this.input[this.pos] == '"') { - var end = this.pos; - this.match('"', false); - return this.input.substring(start, end); - } else if (this.pos >= this.input.length - 1) { - throw "Unterminated value:" + this.input.substring(start); - } } - if (this.input[this.pos] == '\\' && escaped == false) - escaped = true; - else - escaped = false; - this.pos++; - } }; - - this.single_value = function() { - var start = this.pos; - if (this.tryMatch("{")) { - return this.value_braces(); - } else if (this.tryMatch('"')) { - return this.value_quotes(); - } else { - var k = this.key(); - if (k.match("^[0-9]+$")) - return k; - else if (this.months.indexOf(k.toLowerCase()) >= 0) - return k.toLowerCase(); - else - throw "Value expected:" + this.input.substring(start) + ' for key: ' + k; - - } }; - - this.value = function() { - var values = []; - values.push(this.single_value()); - while (this.tryMatch("#")) { - this.match("#"); - values.push(this.single_value()); - } return values.join(""); - }; - - this.key = function() { - var start = this.pos; - while (true) { - if (this.pos >= this.input.length) { - throw "Runaway key"; - } // а-яА-Я is Cyrillic - //console.log(this.input[this.pos]); - if (this.notKey.indexOf(this.input[this.pos]) >= 0) { - return this.input.substring(start, this.pos); - } else { - this.pos++; - - } } }; - - this.key_equals_value = function() { - var key = this.key(); - if (this.tryMatch("=")) { - this.match("="); - var val = this.value(); - return [ key, val ]; - } else { - throw "... = value expected, equals sign missing:" - + this.input.substring(this.pos); - } }; - - this.key_value_list = function() { - var kv = this.key_equals_value(); - this.currentEntry['entryTags'] = {}; - this.currentEntry['entryTags'][kv[0]] = kv[1]; - while (this.tryMatch(",")) { - this.match(","); - // fixes problems with commas at the end of a list - if (this.tryMatch("}")) { - break; - } - kv = this.key_equals_value(); - this.currentEntry['entryTags'][kv[0]] = kv[1]; - } }; - - this.entry_body = function(d) { - this.currentEntry = {}; - this.currentEntry['citationKey'] = this.key(); - this.currentEntry['entryType'] = d.substring(1); - this.match(","); - this.key_value_list(); - this.entries.push(this.currentEntry); - }; - - this.directive = function() { - this.match("@"); - return "@" + this.key(); - }; - - this.preamble = function() { - this.currentEntry = {}; - this.currentEntry['entryType'] = 'PREAMBLE'; - this.currentEntry['entry'] = this.value_comment(); - this.entries.push(this.currentEntry); - }; - - this.comment = function() { - this.currentEntry = {}; - this.currentEntry['entryType'] = 'COMMENT'; - this.currentEntry['entry'] = this.value_comment(); - this.entries.push(this.currentEntry); - }; - - this.entry = function(d) { - this.entry_body(d); - }; - - this.bibtex = function() { - while (this.matchAt()) { - var d = this.directive(); - this.match("{"); - if (d == "@STRING") { - this.string(); - } else if (d == "@PREAMBLE") { - this.preamble(); - } else if (d == "@COMMENT") { - this.comment(); - } else { - this.entry(d); - } - this.match("}"); - } }; - } - exports.toJSON = function(bibtex) { - var b = new BibtexParser(); - b.setInput(bibtex); - b.bibtex(); - return b.entries; - }; - - /* added during hackathon don't hate on me */ - exports.toBibtex = function(json) { - var out = ''; - for ( var i in json) { - out += "@" + json[i].entryType; - out += '{'; - if (json[i].citationKey) - out += json[i].citationKey + ', '; - if (json[i].entry) - out += json[i].entry ; - if (json[i].entryTags) { - var tags = ''; - for (var jdx in json[i].entryTags) { - if (tags.length != 0) - tags += ', '; - tags += jdx + '= {' + json[i].entryTags[jdx] + '}'; - } - out += tags; - } - out += '}\n\n'; - } - return out; - - }; - - })( exports); - - /* end bibtexParse */ - }); - - // Copyright 2018 The Distill Template Authors - - function normalizeTag(string) { - return string - .replace(/[\t\n ]+/g, ' ') - .replace(/{\\["^`.'acu~Hvs]( )?([a-zA-Z])}/g, (full, x, char) => char) - .replace(/{\\([a-zA-Z])}/g, (full, char) => char); - } - - function parseBibtex(bibtex) { - const bibliography = new Map(); - const parsedEntries = bibtexParse.toJSON(bibtex); - for (const entry of parsedEntries) { - // normalize tags; note entryTags is an object, not Map - for (const [key, value] of Object.entries(entry.entryTags)) { - entry.entryTags[key.toLowerCase()] = normalizeTag(value); - } - entry.entryTags.type = entry.entryType; - // add to bibliography - bibliography.set(entry.citationKey, entry.entryTags); - } - return bibliography; - } - - function serializeFrontmatterToBibtex(frontMatter) { - return `@article{${frontMatter.slug}, - author = {${frontMatter.bibtexAuthors}}, - title = {${frontMatter.title}}, - journal = {${frontMatter.journal.title}}, - year = {${frontMatter.publishedYear}}, - note = {${frontMatter.url}}, - doi = {${frontMatter.doi}} -}`; - } - - // Copyright 2018 The Distill Template Authors - - class Bibliography extends HTMLElement { - - static get is() { return 'd-bibliography'; } - - constructor() { - super(); - - // set up mutation observer - const options = {childList: true, characterData: true, subtree: true}; - const observer = new MutationObserver( (entries) => { - for (const entry of entries) { - if (entry.target.nodeName === 'SCRIPT' || entry.type === 'characterData') { - this.parseIfPossible(); - } - } - }); - observer.observe(this, options); - } - - connectedCallback() { - requestAnimationFrame(() => { - this.parseIfPossible(); - }); - } - - parseIfPossible() { - const scriptTag = this.querySelector('script'); - if (!scriptTag) return; - if (scriptTag.type == 'text/bibtex') { - const newBibtex = scriptTag.textContent; - if (this.bibtex !== newBibtex) { - this.bibtex = newBibtex; - const bibliography = parseBibtex(this.bibtex); - this.notify(bibliography); - } - } else if (scriptTag.type == 'text/json') { - const bibliography = new Map(JSON.parse(scriptTag.textContent)); - this.notify(bibliography); - } else { - console.warn('Unsupported bibliography script tag type: ' + scriptTag.type); - } - } - - notify(bibliography) { - const options = { detail: bibliography, bubbles: true }; - const event = new CustomEvent('onBibliographyChanged', options); - this.dispatchEvent(event); - } - - /* observe 'src' attribute */ - - static get observedAttributes() { - return ['src']; - } - - receivedBibtex(event) { - const bibliography = parseBibtex(event.target.response); - this.notify(bibliography); - } - - attributeChangedCallback(name, oldValue, newValue) { - var oReq = new XMLHttpRequest(); - oReq.onload = (e) => this.receivedBibtex(e); - oReq.onerror = () => console.warn(`Could not load Bibtex! (tried ${newValue})`); - oReq.responseType = 'text'; - oReq.open('GET', newValue, true); - oReq.send(); - } - - - } - - // Copyright 2018 The Distill Template Authors - // - // Licensed under the Apache License, Version 2.0 (the "License"); - // you may not use this file except in compliance with the License. - // You may obtain a copy of the License at - // - // http://www.apache.org/licenses/LICENSE-2.0 - // - // Unless required by applicable law or agreed to in writing, software - // distributed under the License is distributed on an "AS IS" BASIS, - // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - // See the License for the specific language governing permissions and - // limitations under the License. - - // import style from '../styles/d-byline.css'; - - function bylineTemplate(frontMatter) { - return ` -
-`; - } - - class Byline extends HTMLElement { - - static get is() { return 'd-byline'; } - - set frontMatter(frontMatter) { - this.innerHTML = bylineTemplate(frontMatter); - } - - } - - // Copyright 2018 The Distill Template Authors - - const T$3 = Template( - "d-cite", - ` - - -
-
-`);
-
- class Code extends Mutating(T$4(HTMLElement)) {
-
- renderContent() {
-
- // check if language can be highlighted
- this.languageName = this.getAttribute('language');
- if (!this.languageName) {
- console.warn('You need to provide a language attribute to your `; - if (frontMatter.githubCompareUpdatesUrl) { - html += `View all changes to this article since it was first published.`; - } - html += ` - If you see mistakes or want to suggest changes, please create an issue on GitHub.
- `; - } - - const journal = frontMatter.journal; - if (typeof journal !== 'undefined' && journal.title === 'Distill') { - html += ` -Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by a note in their caption: “Figure from …”.
- `; - } - - if (typeof frontMatter.publishedDate !== 'undefined') { - html += ` -For attribution in academic contexts, please cite this work as
-${frontMatter.concatenatedAuthors}, "${frontMatter.title}", Distill, ${frontMatter.publishedYear}.-
BibTeX citation
-${serializeFrontmatterToBibtex(frontMatter)}- `; - } - - return html; - } - - class DistillAppendix extends HTMLElement { - - static get is() { return 'distill-appendix'; } - - set frontMatter(frontMatter) { - this.innerHTML = appendixTemplate(frontMatter); - } - - } - - const footerTemplate = ` - - - - -`; - - // Copyright 2018 The Distill Template Authors - - const T$c = Template('distill-footer', footerTemplate); - - class DistillFooter extends T$c(HTMLElement) { - - } - - // Copyright 2018 The Distill Template Authors - - let templateIsLoading = false; - let runlevel = 0; - const initialize = function() { - if (window.distill.runlevel < 1) { - throw new Error("Insufficient Runlevel for Distill Template!"); - } - - /* 1. Flag that we're being loaded */ - if ("distill" in window && window.distill.templateIsLoading) { - throw new Error( - "Runlevel 1: Distill Template is getting loaded more than once, aborting!" - ); - } else { - window.distill.templateIsLoading = true; - console.debug("Runlevel 1: Distill Template has started loading."); - } - - /* 2. Add styles if they weren't added during prerendering */ - makeStyleTag(document); - console.debug("Runlevel 1: Static Distill styles have been added."); - console.debug("Runlevel 1->2."); - window.distill.runlevel += 1; - - /* 3. Register Controller listener functions */ - /* Needs to happen before components to their connected callbacks have a controller to talk to. */ - for (const [functionName, callback] of Object.entries(Controller.listeners)) { - if (typeof callback === "function") { - document.addEventListener(functionName, callback); - } else { - console.error("Runlevel 2: Controller listeners need to be functions!"); - } - } - console.debug("Runlevel 2: We can now listen to controller events."); - console.debug("Runlevel 2->3."); - window.distill.runlevel += 1; - - /* 4. Register components */ - const components = [ - Abstract, Appendix, Article, Bibliography, Byline, Cite, CitationList, Code, - Footnote, FootnoteList, FrontMatter$1, HoverBox, Title, DMath, References, TOC, Figure, - Slider, Interstitial - ]; - - const distillComponents = [DistillHeader, DistillAppendix, DistillFooter]; - - if (window.distill.runlevel < 2) { - throw new Error("Insufficient Runlevel for adding custom elements!"); - } - const allComponents = components.concat(distillComponents); - for (const component of allComponents) { - console.debug("Runlevel 2: Registering custom element: " + component.is); - customElements.define(component.is, component); - } - - console.debug( - "Runlevel 3: Distill Template finished registering custom elements." - ); - console.debug("Runlevel 3->4."); - window.distill.runlevel += 1; - - // If template was added after DOMContentLoaded we may have missed that event. - // Controller will check for that case, so trigger the event explicitly: - if (domContentLoaded()) { - Controller.listeners.DOMContentLoaded(); - } - - console.debug("Runlevel 4: Distill Template initialisation complete."); - window.distill.templateIsLoading = false; - window.distill.templateHasLoaded = true; - }; - - window.distill = { runlevel, initialize, templateIsLoading }; - - /* 0. Check browser feature support; synchronously polyfill if needed */ - if (Polyfills.browserSupportsAllFeatures()) { - console.debug("Runlevel 0: No need for polyfills."); - console.debug("Runlevel 0->1."); - window.distill.runlevel += 1; - window.distill.initialize(); - } else { - console.debug("Runlevel 0: Distill Template is loading polyfills."); - Polyfills.load(window.distill.initialize); - } - -}))); -//# sourceMappingURL=template.v2.js.map +!function(n){"function"==typeof define&&define.amd?define(n):n()}(function(){"use strict"; +// Copyright 2018 The Distill Template Authors +function n(n,t){n.title=t.title,t.published&&(t.published instanceof Date?n.publishedDate=t.published:t.published.constructor===String&&(n.publishedDate=new Date(t.published))),t.publishedDate&&(t.publishedDate instanceof Date?n.publishedDate=t.publishedDate:t.publishedDate.constructor===String?n.publishedDate=new Date(t.publishedDate):console.error("Don't know what to do with published date: "+t.publishedDate)),n.description=t.description,n.authors=t.authors.map(n=>new Nr(n)),n.katex=t.katex,n.password=t.password,t.doi&&(n.doi=t.doi)} +// Copyright 2018 The Distill Template Authors +function t(n=document){const t=new Set,e=n.querySelectorAll("d-cite");for(const n of e){const e=(n.getAttribute("key")||n.getAttribute("bibtex-key")).split(",").map(n=>n.trim());for(const n of e)t.add(n)}return[...t]}function e(n,t,e,i){if(null==n.author)return"";var r=n.author.split(" and ");let o=r.map(n=>{if(-1!=(n=n.trim()).indexOf(","))var e=n.split(",")[0].trim(),i=n.split(",")[1];else if(-1!=n.indexOf(" "))e=n.split(" ").slice(-1)[0].trim(),i=n.split(" ").slice(0,-1).join(" ");else e=n.trim();var r="";return i!=undefined&&(r=(r=i.trim().split(" ").map(n=>n.trim()[0])).join(".")+"."),t.replace("${F}",i).replace("${L}",e).replace("${I}",r).trim()});if(r.length>1){var a=o.slice(0,r.length-1).join(e);return a+=(i||e)+o[r.length-1]}return o[0]}function i(n){var t=n.journal||n.booktitle||"";if("volume"in n){var e=n.issue||n.number;e=e!=undefined?"("+e+")":"",t+=", Vol "+n.volume+e}return"pages"in n&&(t+=", pp. "+n.pages),""!=t&&(t+=". "),"publisher"in n&&"."!=(t+=n.publisher)[t.length-1]&&(t+="."),t}function r(n){if("url"in n){var t=n.url,e=/arxiv\.org\/abs\/([0-9\.]*)/.exec(t);if(null!=e&&(t=`http://arxiv.org/pdf/${e[1]}.pdf`),".pdf"==t.slice(-4))var i="PDF";else if(".html"==t.slice(-5))i="HTML";return` [${i||"link"}]`}return""}function o(n,t){return"doi"in n?`${t?"
',n.githubCompareUpdatesUrl&&(t+=`View all changes to this article since it was first published.`),t+=`\n If you see mistakes or want to suggest changes, please create an issue on GitHub.
\n `);const e=n.journal;return void 0!==e&&"Distill"===e.title&&(t+=`\nDiagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don\u2019t fall under this license and can be recognized by a note in their caption: \u201cFigure from \u2026\u201d.
\n `),"undefined"!=typeof n.publishedDate&&(t+=`\nFor attribution in academic contexts, please cite this work as
\n${n.concatenatedAuthors}, "${n.title}", Distill, ${n.publishedYear}.\n
BibTeX citation
\n${v(n)}\n `),t}const Mr=["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"],Tr=["Jan.","Feb.","March","April","May","June","July","Aug.","Sept.","Oct.","Nov.","Dec."],_r=n=>n<10?"0"+n:n,Cr=function(n){return`${Mr[n.getDay()].substring(0,3)}, ${_r(n.getDate())} ${Tr[n.getMonth()].substring(0,3)} ${n.getFullYear().toString()} ${n.getUTCHours().toString()}:${n.getUTCMinutes().toString()}:${n.getUTCSeconds().toString()} Z`},Ar=function(n){return Array.from(n).reduce((n,[t,e])=>Object.assign(n,{[t]:e}),{})},Er=function(n){const t=new Map;for(var e in n)n.hasOwnProperty(e)&&t.set(e,n[e]);return t};class Nr{constructor(n){this.name=n.author,this.personalURL=n.authorURL,this.affiliation=n.affiliation,this.affiliationURL=n.affiliationURL,this.affiliations=n.affiliations||[]}get firstName(){const n=this.name.split(" ");return n.slice(0,n.length-1).join(" ")}get lastName(){const n=this.name.split(" ");return n[n.length-1]}}class Lr{constructor(){this.title="unnamed article",this.description="",this.authors=[],this.bibliography=new Map,this.bibliographyParsed=!1,this.citations=[],this.citationsCollected=!1,this.journal={},this.katex={},this.doi=undefined,this.publishedDate=undefined}set url(n){this._url=n}get url(){return this._url?this._url:this.distillPath&&this.journal.url?this.journal.url+"/"+this.distillPath:this.journal.url?this.journal.url:void 0}get githubUrl(){return this.githubPath?"https://github.com/"+this.githubPath:undefined}set previewURL(n){this._previewURL=n}get previewURL(){return this._previewURL?this._previewURL:this.url+"/thumbnail.jpg"}get publishedDateRFC(){return Cr(this.publishedDate)}get updatedDateRFC(){return Cr(this.updatedDate)}get publishedYear(){return this.publishedDate.getFullYear()}get publishedMonth(){return Tr[this.publishedDate.getMonth()]}get publishedDay(){return this.publishedDate.getDate()}get publishedMonthPadded(){return _r(this.publishedDate.getMonth()+1)}get publishedDayPadded(){return _r(this.publishedDate.getDate())}get publishedISODateOnly(){return this.publishedDate.toISOString().split("T")[0]}get volume(){const n=this.publishedYear-2015;if(n<1)throw new Error("Invalid publish date detected during computing volume");return n}get issue(){return this.publishedDate.getMonth()+1}get concatenatedAuthors(){return this.authors.length>2?this.authors[0].lastName+", et al.":2===this.authors.length?this.authors[0].lastName+" & "+this.authors[1].lastName:1===this.authors.length?this.authors[0].lastName:void 0}get bibtexAuthors(){return this.authors.map(n=>n.lastName+", "+n.firstName).join(" and ")}get slug(){let n="";return this.authors.length&&(n+=this.authors[0].lastName.toLowerCase(),n+=this.publishedYear,n+=this.title.split(" ")[0].toLowerCase()),n||"Untitled"}get bibliographyEntries(){return new Map(this.citations.map(n=>{return[n,this.bibliography.get(n)]}))}set bibliography(n){n instanceof Map?this._bibliography=n:"object"==typeof n&&(this._bibliography=Er(n))}get bibliography(){return this._bibliography}static fromObject(n){const t=new Lr;return Object.assign(t,n),t}assignToObject(n){Object.assign(n,this),n.bibliography=Ar(this.bibliographyEntries),n.url=this.url,n.doi=this.doi,n.githubUrl=this.githubUrl,n.previewURL=this.previewURL,this.publishedDate&&(n.volume=this.volume,n.issue=this.issue,n.publishedDateRFC=this.publishedDateRFC,n.publishedYear=this.publishedYear,n.publishedMonth=this.publishedMonth,n.publishedDay=this.publishedDay,n.publishedMonthPadded=this.publishedMonthPadded,n.publishedDayPadded=this.publishedDayPadded),this.updatedDate&&(n.updatedDateRFC=this.updatedDateRFC),n.concatenatedAuthors=this.concatenatedAuthors,n.bibtexAuthors=this.bibtexAuthors,n.slug=this.slug}} +// Copyright 2018 The Distill Template Authors +const Dr=n=>(class extends n{constructor(){super();const n={childList:!0,characterData:!0,subtree:!0},t=new MutationObserver(()=>{t.disconnect(),this.renderIfPossible(),t.observe(this,n)});t.observe(this,n)}connectedCallback(){super.connectedCallback(),this.renderIfPossible()}renderIfPossible(){this.textContent&&this.root&&this.renderContent()}renderContent(){console.error(`Your class ${this.constructor.name} must provide a custom renderContent() method!`)}}),Or=(n,t,e=!0)=>i=>{const r=document.createElement("template");return r.innerHTML=t,e&&"ShadyCSS"in window&&ShadyCSS.prepareTemplate(r,n),class extends i{static get is(){return n}constructor(){super(),this.clone=document.importNode(r.content,!0),e&&(this.attachShadow({mode:"open"}),this.shadowRoot.appendChild(this.clone))}connectedCallback(){this.hasAttribute("distill-prerendered")||(e?"ShadyCSS"in window&&ShadyCSS.styleElement(this):this.insertBefore(this.clone,this.firstChild))}get root(){return e?this.shadowRoot:this}$(n){return this.root.querySelector(n)}$$(n){return this.root.querySelectorAll(n)}}}; +// Copyright 2018 The Distill Template Authors +var Ir='/*\n * Copyright 2018 The Distill Template Authors\n *\n * Licensed under the Apache License, Version 2.0 (the "License");\n * you may not use this file except in compliance with the License.\n * You may obtain a copy of the License at\n *\n * http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an "AS IS" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */'; +// Copyright 2018 The Distill Template Authors +const Fr=function(n,t,e){let i=e,r=0;const o=n.length;for(;i
tag. We found the following text: "+t);const e=document.createElement("span");e.innerHTML=n.nodeValue,n.parentNode.insertBefore(e,n),n.parentNode.removeChild(n)}}}}).observe(this,{childList:!0})}}var ro="undefined"!=typeof globalThis?globalThis:"undefined"!=typeof window?window:"undefined"!=typeof global?global:"undefined"!=typeof self?self:{},oo=m(function(n,t){!function(n){function t(){this.months=["jan","feb","mar","apr","may","jun","jul","aug","sep","oct","nov","dec"],this.notKey=[",","{","}"," ","="],this.pos=0,this.input="",this.entries=new Array,this.currentEntry="",this.setInput=function(n){this.input=n},this.getEntries=function(){return this.entries},this.isWhitespace=function(n){return" "==n||"\r"==n||"\t"==n||"\n"==n},this.match=function(n,t){if(t!=undefined&&null!=t||(t=!0),this.skipWhitespace(t),this.input.substring(this.pos,this.pos+n.length)!=n)throw"Token mismatch, expected "+n+", found "+this.input.substring(this.pos);this.pos+=n.length,this.skipWhitespace(t)},this.tryMatch=function(n,t){return t!=undefined&&null!=t||(t=!0),this.skipWhitespace(t),this.input.substring(this.pos,this.pos+n.length)==n},this.matchAt=function(){for(;this.input.length>this.pos&&"@"!=this.input[this.pos];)this.pos++;return"@"==this.input[this.pos]},this.skipWhitespace=function(n){for(;this.isWhitespace(this.input[this.pos]);)this.pos++;if("%"==this.input[this.pos]&&1==n){for(;"\n"!=this.input[this.pos];)this.pos++;this.skipWhitespace(n)}},this.value_braces=function(){var n=0;this.match("{",!1);for(var t=this.pos,e=!1;;){if(!e)if("}"==this.input[this.pos]){if(!(n>0)){var i=this.pos;return this.match("}",!1),this.input.substring(t,i)}n--}else if("{"==this.input[this.pos])n++;else if(this.pos>=this.input.length-1)throw"Unterminated value";e="\\"==this.input[this.pos]&&0==e,this.pos++}},this.value_comment=function(){for(var n="",t=0;!this.tryMatch("}",!1)||0!=t;){if(n+=this.input[this.pos],"{"==this.input[this.pos]&&t++,"}"==this.input[this.pos]&&t--,this.pos>=this.input.length-1)throw"Unterminated value:"+this.input.substring(start);this.pos++}return n},this.value_quotes=function(){this.match('"',!1);for(var n=this.pos,t=!1;;){if(!t){if('"'==this.input[this.pos]){var e=this.pos;return this.match('"',!1),this.input.substring(n,e)}if(this.pos>=this.input.length-1)throw"Unterminated value:"+this.input.substring(n)}t="\\"==this.input[this.pos]&&0==t,this.pos++}},this.single_value=function(){var n=this.pos;if(this.tryMatch("{"))return this.value_braces();if(this.tryMatch('"'))return this.value_quotes();var t=this.key();if(t.match("^[0-9]+$"))return t;if(this.months.indexOf(t.toLowerCase())>=0)return t.toLowerCase();throw"Value expected:"+this.input.substring(n)+" for key: "+t},this.value=function(){var n=[];for(n.push(this.single_value());this.tryMatch("#");)this.match("#"),n.push(this.single_value());return n.join("")},this.key=function(){for(var n=this.pos;;){if(this.pos>=this.input.length)throw"Runaway key";if(this.notKey.indexOf(this.input[this.pos])>=0)return this.input.substring(n,this.pos);this.pos++}},this.key_equals_value=function(){var n=this.key();if(this.tryMatch("="))return this.match("="),[n,this.value()];throw"... = value expected, equals sign missing:"+this.input.substring(this.pos)},this.key_value_list=function(){var n=this.key_equals_value();for(this.currentEntry.entryTags={},this.currentEntry.entryTags[n[0]]=n[1];this.tryMatch(",")&&(this.match(","),!this.tryMatch("}"));)n=this.key_equals_value(),this.currentEntry.entryTags[n[0]]=n[1]},this.entry_body=function(n){this.currentEntry={},this.currentEntry.citationKey=this.key(),this.currentEntry.entryType=n.substring(1),this.match(","),this.key_value_list(),this.entries.push(this.currentEntry)},this.directive=function(){return this.match("@"),"@"+this.key()},this.preamble=function(){this.currentEntry={},this.currentEntry.entryType="PREAMBLE",this.currentEntry.entry=this.value_comment(),this.entries.push(this.currentEntry)},this.comment=function(){this.currentEntry={},this.currentEntry.entryType="COMMENT",this.currentEntry.entry=this.value_comment(),this.entries.push(this.currentEntry)},this.entry=function(n){this.entry_body(n)},this.bibtex=function(){for(;this.matchAt();){var n=this.directive();this.match("{"),"@STRING"==n?this.string():"@PREAMBLE"==n?this.preamble():"@COMMENT"==n?this.comment():this.entry(n),this.match("}")}}}n.toJSON=function(n){var e=new t;return e.setInput(n),e.bibtex(),e.entries},n.toBibtex=function(n){var t="";for(var e in n){if(t+="@"+n[e].entryType,t+="{",n[e].citationKey&&(t+=n[e].citationKey+", "),n[e].entry&&(t+=n[e].entry),n[e].entryTags){var i="";for(var r in n[e].entryTags)0!=i.length&&(i+=", "),i+=r+"= {"+n[e].entryTags[r]+"}";t+=i}t+="}\n\n"}return t}}(t)});class ao extends HTMLElement{static get is(){return"d-bibliography"}constructor(){super();const n={childList:!0,characterData:!0,subtree:!0};new MutationObserver(n=>{for(const t of n)"SCRIPT"!==t.target.nodeName&&"characterData"!==t.type||this.parseIfPossible()}).observe(this,n)}connectedCallback(){requestAnimationFrame(()=>{this.parseIfPossible()})}parseIfPossible(){const n=this.querySelector("script");if(n)if("text/bibtex"==n.type){const t=n.textContent;if(this.bibtex!==t){this.bibtex=t;const n=y(this.bibtex);this.notify(n)}}else if("text/json"==n.type){const t=new Map(JSON.parse(n.textContent));this.notify(t)}else console.warn("Unsupported bibliography script tag type: "+n.type)}notify(n){const t=new CustomEvent("onBibliographyChanged",{detail:n,bubbles:!0});this.dispatchEvent(t)}static get observedAttributes(){return["src"]}receivedBibtex(n){const t=y(n.target.response);this.notify(t)}attributeChangedCallback(n,t,e){var i=new XMLHttpRequest;i.onload=(n=>this.receivedBibtex(n)),i.onerror=(()=>console.warn(`Could not load Bibtex! (tried ${e})`)),i.responseType="text",i.open("GET",e,!0),i.send()}}class so extends HTMLElement{static get is(){return"d-byline"}set frontMatter(n){this.innerHTML=w(n)}}
+// Copyright 2018 The Distill Template Authors
+const lo=Or("d-cite",'\n\n\n
\n\n`);class go extends(Dr(fo(HTMLElement))){renderContent(){if(this.languageName=this.getAttribute("language"),!this.languageName)return void console.warn('You need to provide a language attribute to your Model selection is a crucial aspect of machine learning, as it allows us to choose the most appropriate model for a given task. In the Bayesian setting, the marginal likelihood has been a popular tool for model selection and hyperparameter learning, often motivated by the principle of Occam’s razor. However, the suitability of the marginal likelihood depends on the specific context and goals of the modeling task.
Recently, the paper “Bayesian Model Selection, the Marginal Likelihood, and Generalization” by Lotfi et al. (2022/2023)
In this blog post, inspired by the above paper, we (re-)derive insights that challenge the conventional focus on the marginal likelihood and related quantities for Bayesian model selection. We argue that the quantities we examine are all consequences of Occam’s razor, and thus no single quantity should be considered universally superior. Instead, the choice of model selection criterion should be guided by the context and the desired outcomes. We highlight that many recently proposed metrics for model selection, including CLML, are closely related to cross-validation and have failure cases that can be explained by considering model misspecification and prior-data conflicts. Overall, the choice between these metrics should be based on the specific requirements of the task at hand.
We begin by discussing the foundations of model selection, including the role of Occam’s razor and its relationship to maximum likelihood estimation (MLE) and maximum a posteriori (MAP) estimation. We then introduce the concepts of log marginal likelihood (LML), cross-validation, and conditional log marginal likelihood (CLML), highlighting their connections and differences. Through a series of thought experiments and empirical observations, we explore the behavior of these model selection criteria in various scenarios, such as under model misspecification, prior-data conflict, and in different data regimes. We find that the conditional marginal cross-entropy, which is closely related to cross-validation, is often a more reliable choice when the primary objective is to select for generalization performance. On the other hand, the conditional joint marginal cross-entropy (permutation-invariant negative CLML) may be preferable when the focus is on sequential prediction and online learning. At the same time, the joint marginal information (negative LML) is rarely the right choice for model selection. We review relevant literature, including the work of Fong and Holmes (2020)
Throughout the post, we emphasize the importance of considering the context, available data, and desired outcomes when selecting the most appropriate metric for model selection and hyperparameter tuning. By questioning the primacy of the (conditional) joint marginal likelihood and encouraging critical thinking about the foundations of these quantities, we hope to foster a more nuanced understanding of Bayesian model selection.
In our daily lives, we’re often faced with choices that require us to sift through competing explanations or decisions. Imagine you hear your doorbell ring. You might think it’s the delivery you’ve been waiting for, a neighbor dropping by, or perhaps you didn’t hear anything at all, and it was just your imagination. In deciding between these options, you’re likely to lean towards the simplest explanation that aligns with your expectations—say, the long-awaited delivery. This inclination towards simplicity has a formal counterpart in scientific discovery and machine learning, known as Occam’s razor:
This concept is further illustrated using an example from chapter 28 of David MacKay’s seminal book, “Information Theory, Inference, and Learning Algorithms”, where the essence of selecting between models based on their evidence is laid out succinctly.
But how can we express this formally using mathematics?
In the next section, we will use information-theoretic concepts to formalize Occam’s razor and connect it to the maximum likelihood estimation (MLE) and maximum-a-posteriori (MAP) estimation approaches. This formalization highlights that Occam’s razor, as a general principle favoring simplicity, can motivate various techniques, not just Bayesian ones. Therefore, using Occam’s razor as the sole justification for Bayesian model selection may not be as compelling as it initially appears.
However, one could argue that when Occam’s razor is properly applied within a Bayesian framework, it captures a more nuanced notion of complexity. From this perspective, the Bayesian formulation of Occam’s razor favors models that strike a balance between goodness-of-fit and model complexity, where complexity is measured by the model’s ability to compress the data. This view is consistent with the minimum description length (MDL) principle, which posits that the best model is the one that minimizes the total description length of both the model and the data given the model.
From Philosophical Principle to Mathematical Statement
Let’s first connect Occam’s razor to Maximum Likelihood Estimation (MLE) before diving deeper into the background and (Bayesian) model selection.
In information theory, the information content of an event \(x\) is defined as \(-\log_2 \pof{x}\), where \(\pof{x}\) is the probability of that event occurring according to a given model. This is also called Shannon’s information content. We use the base \(2\) for logarithms and measure information in bits (binary digits), and for the rest of the post, we will drop the base of the logarithm. The information content measures the optimal encoding length in bits for the event \(x\) under the model specified by its probability distribution \(\pof{\cdot}\). In the context of probabilistic modeling, variables that cannot be directly observed are called latent variables. Occam’s razor suggests that we should prefer simpler explanations for latent variables, given the observed data.
Consider a model with a latent variable \(z\) and observed data \(x\). The model specifies a probability distribution \(\pof{z \given x}\). According to Occam’s razor, we prefer simpler explanations, which correspond to smaller values of \(-\log \pof{z \given x}\). Using Bayes’ theorem, we can rewrite this as:
\[\text{minimize } z \text{ in } -\log \pof{z \given x} = -\log \pof{x \given z} - \log \pof{z} + \log \pof{x}.\]Given that \(\pof{x}\) is independent of \(z\), we can omit it from our objective. Additionally, if we posit a uniform (or non-informative prior) for \(z\), implying that all potential values of \(z\) are equally probable before observing \(x\), then \(\pof{z}\) becomes constant and can also be dropped from our objective. This simplifies our preference to:
\[\text{minimize } z \text{ in } -\log \pof{x \given z}.\]Equivalently, we can maximize \(\pof{x \given z}\), which is the likelihood of the observed data \(x\) given the latent variable \(z\). When making a decision and selecting a single value for \(z\), this leads to the maximum likelihood estimation (MLE) approach.
In summary, the connection between Occam’s razor and MLE relies on the following assumptions:
Under these assumptions, the preference for simpler explanations leads to the MLE approach, where more likely values of the latent variable given the observed data are preferred.
Optimizing the MLE is common in machine learning because we can directly optimize the likelihood function. Still, this is not easy for deep learning models because they have a large number of parameters and the loss function is non-convex.
However, the assumption of a uniform or non-informative prior for the latent variables is not always valid or desirable. In many cases, we have prior knowledge about the latent variables that can be incorporated into the model. This leads to the Maximum-A-Posteriori (MAP) Estimation as an alternative to MLE.
In MAP estimation, \(\pof{z}\) is not constant, so we cannot drop it—we can still drop \(\pof{x}\), however—and maximize the joint distribution \(\pof{z, x}\), or equivalently:
\[\text{minimize } z \text{ in } -\log \pof{x, z}=-\log \pof{x \given z} - \log \pof{z}.\]Before we go further, we need to introduce notation for information-theoretic quantities and concepts that we will use throughout the post
Information theory deals with the communication and quantification of information
The information content of an event \(x\) is denoted as \(\Hof{x}\) and is defined as \(-\log_2 \pof{x}\), where \(\pof{x}\) is the probability of event \(x\) occurring. It represents the minimum amount of information needed to describe the occurrence of \(x\) given an underlying probability distribution. \(\Hof{x \given y}\) and \(\Hof{x, y}\) are analogously defined and denote the conditional and joint information content of random variables \(X\) and \(Y\), respectively. In machine learning, the information content is often used as a minimization objective, represented as the negative log-likelihood or cross-entropy when averaged over a dataset (see below).
The entropy \(\Hof{X}\) of a random variable \(X\) is the expectation of its information content:
\[\Hof{X} \triangleq \E{\pof{x}}{\Hof{x}} = \E{\pof{x}}{-\log \pof{x}}.\]The entropy measures the average amount of information needed to describe the random variable \(X\). It provides a measure of uncertainty or randomness associated with \(X\). We can similarly define the entropy of a conditional distribution \(\Hof{X \given Y}\) and the joint entropy \(\Hof{X, Y}\).
We will also use the Kullback-Leibler divergence \(\Kale{\pof{X}}{\qof{X}}\) and the cross-entropy \(\CrossEntropy{\pof{X}}{\qof{X}}\):
\[\begin{aligned} \CrossEntropy{\pof{X}}{\qof{X}} & = \E{\pof{x}}{-\log \qof{x}}\\ \Kale{\pof{X}}{\qof{X}} & = \CrossEntropy{\pof{X}}{\qof{X}} - \Hof{X} \end{aligned}\]The cross-entropy quantifies the average number of bits needed to encode samples drawn from the true distribution \(\pof{X}\) using a different distribution \(\qof{X}\). The Kullback-Leibler divergence measures the difference between two probability distributions and captures the additional bits needed to encode samples from \(\pof{X}\) compared to encoding them using the true distribution \(\qof{X}\).
Taking this notation into account, we can express Occam’s razor as:
\[\text{prefer small } z \text{ for } \Hof{z \given x},\]where \(Z\) is the latent variable and \(X\) is the observed data. Note that \(x\) and \(z\) are individual realizations of the random variables \(X\) and \(Z\), respectively.
The MLE and MAP objectives are accordingly:
\[\text{minimize } z \text{ in } \Hof{x \given z} \text{ for MLE and } \Hof{x, z} \text{ for MAP.}\]This measures the number of bits we need to encode the observed data given the latent variable for MLE and the number of bits to encode both the observed data and the latent variable for MAP. This relates Occam’s razor to the minimum description length principle
In many machine learning tasks, we need to determine the best hyperparameters for a model or select the most suitable model architecture from several discrete options. The primary goal is to find the hyperparameters or model that generalizes best to new, unseen data.
Both cases can be viewed as inferring a random variable \(\H\), which represents either the model choice as a categorical distribution or the hyperparameters as a continuous distribution. In this sense, \(\H\) can be considered as another latent variable in the model.
For consistency, we will continue using \(\x\) to denote data points throughout this post. Although it is common to use \(\y\) for predictions and \(\x\) for side channel information, we will not require this distinction here and will stick to \(\x\) for simplicity.
The same arguments discussed previously also apply in this context, and we can express the objective as:
\[\text{minimize } \h \text{ in } \Hof{\x \given \h}.\]In addition to the hyperparameters \(\H\), we usually have model parameters \(\W\) for a given \(\h\) with a parameter distribution \(\pof{\w \given \h}\) that we need to infer based on observed data. These parameters are the learnable components of the model, such as the weights and biases in a neural network. For given \(\w\) and \(\h\), we can easily compute the likelihood \(\pof{\x \given \w, \h}\), which represents the probability of observing the data \(\x\) given the specific values of the parameters and hyperparameters. However, to make predictions or compute the marginal likelihood, we will need to consider the uncertainty in the parameter values by integrating over all possible \(\w\).
Bayesian Model Averaging (BMA) is a technique that integrates, or marginalizes, over the model parameters \(\W\) when making predictions. This accounts for the uncertainty in the model parameters, which is particularly useful when dealing with complex models, high-dimensional parameter spaces, and limited data. In contrast to the MLE or MAP estimate, which use a single parameter value \(\w\) for predictions, BMA provides a more robust and comprehensive approach. The probability of a new data point \(\x'\) under BMA is given by:
\[\pof{\x' \given \x, \h} = \int \pof{\x' \given \x, \w, \h} \pof{\w \given \x, \h} \, \mathrm{d}\w,\]where \(\pof{\w \given \x, \h}\) is the posterior distribution of the parameters given the data, and \(\pof{\x' \given \x, \w, \h}\) is the likelihood of the new data point given the parameters, hyperparameters, and training data.
While BMA offers benefits, it is computationally challenging, particularly when dealing with high-dimensional parameter spaces commonly encountered in deep learning models. To make BMA tractable, various approximation methods, such as Markov Chain Monte Carlo (MCMC) and Variational Inference, have been proposed.
Let’s now discuss the marginal likelihood and its relation to BMA. The marginal likelihood, denoted as \(\pof{\x \given \h}\), is the likelihood of the observed data given the hyperparameters, marginalized over all possible parameter values \(\W\). It is also known as the model evidence. To compute the marginal likelihood, we integrate over all possible \(\w\):
\[\pof{\x \given \h} = \int \pof{\x \given \w, \h} \pof{\w \given \h} \, d\w,\]where \(\pof{\x \given \w, \h}\) is the likelihood of the data given the parameters and hyperparameters, and \(\pof{\w \given \h}\) is the prior distribution of the parameters given the hyperparameters.
Comparing BMA to the marginal likelihood, we see that they match for individual data points. However, for multiple data points (i.e., conditioning on datasets), the marginal likelihood is more complex. “BMA” typically refers to making predictions for a single new data point, while the marginal likelihood can be considered for many points simultaneously. Apart from this difference, the two are equivalent. Let’s discuss the case of multiple data points in more detail to understand why computing the marginal likelihood on datasets is even more challenging.
So far, we have described everything as if we only had a single data point \(x\). However, in practice, we often have a dataset \(\xNtuple = (\x_1, \x_2, \ldots, \x_N)\).
The easiest way to extend the previous definitions is to simply substitute \(\xNset\) for \(\x\) and assume we can compute a likelihood for the entire dataset using its joint predictive distribution:
\[\pof{\xNtuple \given \h} = \int \pof{\x_1, \x_2, \ldots, \x_N \given \w, \h} \, \pof{\w \given \h} \, d\w.\]We can then maximize this likelihood or equivalently minimize the joint marginal information \(\Hof{\xNtuple \given \h}.\)
If our model is exchangeable, meaning the order of the \(\x_n\) does not matter, we can equivalently take an expectation over all permutations of the data to obtain the joint marginal cross-entropy:
\[\CrossEntropy{\pdata{\X_1, ...,\X_n}}{\pof{\X_1, ... \X_n \given \h}},\]where \(\pdata{\cdot}\) is an empirical data distribution that allows us to draw samples without replacement. In this case, the joint marginal information and cross-entropy are equivalent.
With exchangeability, we can simply write \(\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNset}\) instead of using the tuple notation \(\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNtuple}\) as the order of the data points does not matter.
Conversely, if a model is not exchangeable, we can induce exchangeability by averaging over all permutations of the data points via ensembling. For example, deep learning models trained with stochastic gradient descent are generally not exchangeable, as the order and composition of the batches can impact the results. However, we can make them effectively exchangeable by training multiple models and averaging their predictions. In the limit of infinite models, the resulting ensemble will be exchangeable
The joint marginal cross-entropy turns a potentially non-exchangeable joint information into an exchangeable one by taking an expectation.
Before we try to understand these joint expressions, we should consider alternative ways to extend the previous definitions.
For instance, we could take the average of the likelihoods for individual data points:
\[\frac{1}{N} \sum_{n=1}^N \pof{\x_n \given \h}.\]Assuming an underlying data distribution \(\pdata{x}\), we can also express this as an attempt to estimate:
\[\E{\pdata{\x}}{\pof{\x \given \h}} = \int \pof{\x \given \h} \, \pdata{\x} \, d\x.\]This provides an average score for the data likelihood.
However, from the perspective of Occam’s razor, simply taking the average likelihood is not the most principled approach. Instead, we can leverage information theory, which has been our tool of choice thus far. Recall that we prefer small values of the marginal information \(\Hof{\x \given \h}\). By taking the expectation over the data distribution, we obtain the individual marginal cross-entropy:
\[\CrossEntropy{\pdata{\X}}{\pof{\X \given \h}} = \E{\pdata{\x}}{-\log \pof{\x \given \h}}.\]This cross-entropy measures the average number of bits needed to encode the data using the model’s probability distribution. As it does not involve a joint distribution, we refer to it simply as the marginal cross-entropy.
It is evident that the marginal cross-entropy and the average likelihood are not equivalent. Using the convexity of the negative logarithm and Jensen’s inequality, we see that the marginal cross-entropy is always larger than the negative logarithm of the average likelihood:
\[\begin{aligned} \CrossEntropy{\pdata{\X}}{\pof{\X \given \h}} &= \E{\pdata{\x}}{-\log \pof{\x \given \h}} \\ &\geq -\log \E{\pdata{\x}}{\pof{\x \given \h}} \\ &\approx -\log \frac{1}{N} \sum_{n=1}^N \pof{\x_n \given \h}. \end{aligned}\]The NLL is frequently used to evaluate a model’s performance after training, typically on a held-out validation set. This is equivalent to computing the cross-entropy between the empirical distribution of the validation set and the model’s predictive distribution, conditioned on the parameters learned from the training data:
\[\CrossEntropy{\hpcof{\text{val}}{\X'}}{\pof{\X' \given \xNtuple, \h}}\]It is essential to distinguish this from the cross-entropy computed on the prior distribution of the model parameters before seeing any data, which is less useful for evaluating a trained model’s performance:
\[\CrossEntropy{\hpcof{\text{val}}{\X'}}{\pof{\X' \given \h}}\]Only the NLL on a validation set conditioned on the training data provides an estimate of the model’s generalization ability after training. The same holds for the quantities marginalized over the model parameters.
Occam’s razor does not clearly specify which aggregate metric on \(\Hof{\x \given \h}\) we should prefer. Instead of the mean, we could use the median or a different quantile of the information content as a summary statistic to assess the model’s performance on the dataset. This might be more robust, as it is less sensitive to outliers.
Crucially, the marginal cross-entropy and related summary statistics measure the model’s performance using the “prior” parameter distribution, not the posterior conditioned on data. However, the joint distribution captures something else, which can be seen more clearly using the chain rule:
\[\Hof{\xNset \given \h} = \sum_{k=1}^N \Hof{\x_n \given \x_1, \ldots, \x_{k-1}, \h}\]Each term is a conditional marginal information on the previous data points. Similarly, when we take an expectation over the data distribution, we obtain a chain of conditional marginal cross-entropies:
\[\begin{aligned} & \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNtuple} = \\ &\quad = \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X_1} + \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X_2 \given \X_1} \\ &\quad \quad + \ldots + \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{X_N \given \X_1, \X_2, \ldots, \X_{N-1}} \\ &\quad = \sum_{n=1}^N \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X_n \given \X_{n-1}, \ldots, \X_1}. \end{aligned}\]Each term in the sum is a conditional marginal cross-entropy conditioned on the previous data points, which differs from the marginal cross-entropy (recognized in the first term).
The following visualization summarizes the relationship between the conditional and joint marginal cross-entropies and information. The chain rule tells us that the area under the curve of the conditional quantities equals the joint quantity.
In summary, the marginal and joint cross-entropies offer different perspectives on a model’s performance
While both metrics are useful for evaluating models, the joint marginal cross-entropy provides insight into how well the model learns from the data during training. The conditional marginal cross-entropy, on the other hand, is more suitable for assessing the model’s generalization ability at a given point in time, without the influence of parameter updates.
This brings us back to the earlier question of what metric we should prefer and use for model selection. Let’s consider:
The marginal cross-entropy, as in the first term, is likely not useful for model selection with deep learning models, as it is not conditioned on any data and thus cannot correlate well with the model’s performance after training.
If we care about the model’s “generalization” performance after training on \(N-1\) data points without further adaptation, the marginal cross-entropy on the last data point is the more relevant quantity:
\[\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X_N \given \X_{N-1}, \ldots, \X_1}\]It measures the model’s performance on the last data point after having seen all previous data points, similar to a “leave-one-out” metric. Indeed, it is equivalent to leave-one-out cross-validation when we have an empirical data distribution consisting of \(N\) data points and sample without replacement.
More generally, it is equivalent to cross-validation when we hold out more than one data point for evaluation from the empirical data distribution:
\[\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X' \given \X_{N-k}, ..., \X_{1}}.\]This is the same expression as in (2.) but we assume there are more samples to draw from in the empirical data distribution \(\pdata{\x'}\). We call this term the conditional marginal cross-entropy and keep in mind its connection to cross-validation.
On the other hand, if we care about the model’s performance as an online learner, or in the case of LLMs, as an in-context learner, the joint marginal cross-entropy becomes a more relevant metric. It measures the model’s ability to adapt and make accurate predictions as it sequentially processes new data points, conditioned on the information it has seen so far.
In the context of online learning, the model receives data points one at a time and updates its predictions based on the cumulative knowledge gained from previous data points. The joint marginal cross-entropy captures how well the model incorporates this sequential information to make accurate predictions for future data points.
Similarly, for in-context learning of LLMs, the model is provided with a prompt or context consisting of a sequence of data points, and it is expected to generate accurate completions or predictions based on this context. The joint marginal cross-entropy measures the model’s ability to effectively utilize the provided context to make accurate predictions for the next data point in the sequence.
However, we would not want to use the unconditional joint marginal cross-entropy, but rather condition on some initial data to be closer to the actual use case of the model, which will have been (pre-)trained already. As such, we are interested in estimating a conditional joint marginal cross-entropy:
\[\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNsetk \given \XNkset}.\]By conditioning on the previously seen data points, this metric assesses the model’s capacity to learn and adapt its predictions based on the evolving context. It provides a more fine-grained evaluation of the model’s sequential prediction performance, taking into account the specific order and dependencies within the data.
Moreover, the conditional joint marginal cross-entropy can be used to compare different models or hyperparameter settings in terms of their online learning or in-context learning capabilities. By evaluating this metric on held-out data sequences, we can determine which model or setting is better suited for tasks that require sequential adaptation and context-dependent predictions.
If we have a preferred order of the data points (or a split in the case of exchangeability), we can also consider the conditional joint marginal information:
\[\Hof{\xNsetk \given \xNkset, \h}.\]It is also known as the conditional joint marginal log likelihood.
All these quantities are equally valid from the perspective of Occam’s razor.
We have not yet discussed how to efficiently estimate these quantities, especially for deep learning models. More importantly, we have already considered that the joint marginal information (marginal likelihood), BMA, and the joint marginal cross-entropy (as an expectation over the marginal likelihood) are not easy to estimate.
This brings us to one of the main points:
This is a crucial point that has not been sufficiently considered in the literature on model selection and hyperparameter learning previously, where the model evidence and marginal likelihood have been presented as the ultimate criteria. In practice, we rarely update a model on additional data during inference—this is changing with the advent of LLMs and strong in-context learners, but it is still not the norm.
But why has the marginal likelihood been the preferred choice for model selection so far then?
To explore when the conditional marginal cross-entropy and joint marginal cross-entropy lead to different outcomes for model selection and hypothesis testing, let’s consider a few key scenarios.
For the discrete case, we can reduce the question to one about ranking: if we have two possible hyperparameter choices \(\h_1\) and \(\h_2\), when do we get the same ranking \(\h_1 \succ \h_2\) for both metrics?
First, let’s examine the case when we have a large amount of data available. Here, model misspecification, a common concern, plays a crucial role.
As renowned statistician George Box famously stated:
All models are wrong, but some are useful.
When working with real-world data, we must always assume that our models are misspecified to some degree. Models simplify complex systems and cannot capture every nuance of the data-generating process. Consequently, the goal of model selection is not to find the “true” model but rather to identify the most useful model that balances simplicity, interpretability, and predictive performance.
Without model misspecification, we would always converge to the maximum likelihood estimate (MLE) that matches the data-generating model in the infinite data limit as the Bernstein-von Mises’ theorem tells us that posteriors converge to the MLE in the limit. However, in practice, we are always dealing with misspecified models, and the MLE will not converge to the true data-generating model.
Let’s return to our question of when the different quantities lead to similar rankings.
While a conditional joint marginal cross-entropy, as a sum of conditional marginal cross-entropies, is obviously larger than each individual term, if we divide the joint marginal cross-entropy by the number of samples in the conditional joint distribution, we obtain the rate
Bernstein-von Mises’ theorem tells us that the posterior distribution of the model parameters converges to a normal distribution around the MLE as the number of data points goes to infinity
Overall, we have (without formal proof):
\[\begin{aligned} &\lim_{N \to \infty} \frac{1}{N} \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNset} = \\ &\quad = \lim_{N \to \infty} \frac{1}{N} \sum_{n=1}^N \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X_n \given \X_{n-1}, ..., \X_1} \\ &\quad = \lim_{N \to \infty} \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X' \given \XNset}. \end{aligned}\]Given sufficient data (in the infinite sample limit), we see that either of these quantities will lead to the same ranking of different hyperparameters/model hypotheses. Conversely, we can expect to see meaningful differences only in low-data regimes, where the model is not yet fully adapted to the data.
Finally, in the infinite data limit, for the conditional marginal cross-entropy, we don’t need to take an expectation over the data we condition on (as the model parameters will still have converged):
\[\begin{aligned} &\lim_{N \to \infty} \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNsetk \given \XNkset} \\ &\quad = \lim_{N \to \infty} \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNsetk \given \xNset}, \end{aligned}\]forany \(\xNset \sim \pdata{\xNset}\) as \(n \to \infty\). More importantly, this also holds for the joint marginal information, whose rate in the limit is the same as the rate of the joint marginal cross-entropy above (and thus also joint cross-entropy):
\[\begin{aligned} &\lim_{N \to \infty} \frac{1}{N} \Hof{\xNset \given \h} = \\ &\quad = \lim_{N \to \infty} \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\X' \given \XNset}. \end{aligned}\]We have previously mentioned the connection between cross-validation, leave-one-out validation, and the conditional marginal cross-entropy. This result also connects the marginal likelihood in the limit to these quantities.
Thus:
The catch is that “sufficient data” might be a very large amount of data, especially for highly expressive models like neural networks.
Hence, we only expect these quantities to be meaningfully different in the low-data regime. So let’s focus on the low-data regime now.
Even if different hyperparameter choices lead to the same generalization loss in the infinite data limit, they can induce different priors that affect the convergence speed and model performance in the low-data regime.
In the low-data regime, assuming all models converge to the same validation loss given infinite data, we prefer the model that converges the fastest, i.e., with the least amount of training data. A model with a prior well-aligned with the data distribution learns efficiently and generalizes better with limited data.
In this scenario, the area under the conditional marginal cross-entropy or information curve (equivalent to the joint marginal cross-entropy, or joint marginal information) indicates the preferred model. The model with the lowest joint marginal information (highest log marginal likelihood) fits the available data best while having a prior enabling efficient learning and generalization.
Finally, what happens when there are both model misspecification and a prior-data conflict in the low-data regime? If both are correlated, the ranking will be preserved, but if they are anti-correlated, the ranking might change.
Let’s visualize this: the curves will intersect at some point, and the model with the best achievable loss in the infinite data limit might not be the best choice in the low-data regime, depending on how much data we can train on. The optimal model choice may also change based on the amount of available data.
Here, the joint marginal cross-entropy and the joint marginal information (log marginal likelihood) might not lead to the same decision because the area under the curve at the start might be larger than what the best model can save later. This could change the ranking of the models compared to the conditional marginal cross-entropy (leave-one-out cross-validation) at the end of training, which serves as a proxy for the model’s generalization performance.
Instead, the conditional joint marginal cross-entropy and information can shine here by conditioning “away” the beginning of the curve, thus giving us a better estimate of the conditional marginal cross-entropy (or expected information) at the point of interest.
To formalize this, we can use the chain rule to split the joint marginal cross-entropy into two terms:
\[\begin{aligned} &\underbrace{\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNset}}_{\text{Joint Marginal Cross-Entropy}} = \\ &\quad = \iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNsetk} \\ &\quad \quad + \underbrace{\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNset \given \XNsetk}}_{\text{Conditional Joint Marginal Cross-Entropy}}, \end{aligned}\]Note that the per-sample averages of both terms converge to the same value in the infinite data limit—the conditional marginal cross-entropy (cross-validation loss), as discussed previously. However, the second term will converge faster because it does not include the constant \(\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNsetk}\).
We can also see both terms as approximating the conditional marginal cross-entropy (cross-validation loss) for a fixed \(N\) in the low-data regime. The per-sample average of the second term will provide a better approximation.
In summary, the consistency of the ranking will depend on the size of \(\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNsetk}\) for different \(\h\) and how it compares to the conditional joint marginal cross-entropy \(\iCrossEntropy{\oppdata}{\pof{\cdot \given \h}}{\XNset \given \XNsetk}\).
This analysis highlights the importance of considering both prior-data conflict and model misspecification when selecting models in the low-data regime. The choice of performance metric and the amount of available data can significantly impact the ranking of models. The conditional joint marginal cross-entropy provides a more accurate estimate of the model’s generalization performance by conditioning away the initial part of the learning curve, which may be heavily influenced by prior-data conflict.
You may be wondering: why bother with the marginal likelihood or conditional joint marginal cross-entropy at all? Why not just always use leave-one-out cross-validation (i.e., the conditional marginal cross-entropy) or a simple validation loss?
While that is a valid approach, the key question is: can we approximate the validation loss earlier in training, without fully training the model? Or can we do this more efficiently than performing inference on each element of a validation set?
One option is to extrapolate the training loss to predict the validation loss. While potentially underexplored in this context, scaling laws have been found effective for predicting model performance.
Alternatively, when training a model on a dataset for a single epoch—which is still surprisingly common for large language models, especially without active data sampling—the average training loss per batch provides a good approximation of the validation loss. With a cross-entropy loss, this is equivalent to estimating the conditional marginal cross-entropy.
However, the batch size may not be large enough for a precise estimate. Averaging over the last few batches or using an exponential moving average can help, as the training losses on earlier batches were computed with older model parameters. Compared to using only the last batch’s loss, this smooths the estimate and reduces sensitivity to outliers.
In the multi-epoch setting, revisiting data points multiple times prevents using the training loss as a validation loss estimate. Here, cross-validation offers a solution: train on the held-out data in the last epoch, compute the validation loss via the training losses, and obtain an ensemble of fully trained models without wasting data.
In summary, while the validation loss is the gold standard, approximations based on the training loss or cross-validation can provide efficient estimates, especially in the early stages of training or with limited data.
In this post, we have explored various metrics for model selection and hyperparameter learning in the Bayesian context, focusing on the marginal likelihood, joint marginal cross-entropy, and conditional marginal cross-entropy. Our discussion has led to several key insights:
Infinite Data Limit: As the dataset size approaches infinity, the rate of the log marginal likelihood (or equivalently, the joint marginal information), the joint marginal cross-entropy, and the conditional marginal cross-entropy converge to the same value when averaged over the data distribution. Given sufficient data, all these metrics will produce the same ranking of different model hypotheses or hyperparameter choices.
Connection to Cross-Validation: The conditional marginal cross-entropy is equivalent to the expected cross-validation loss. Cross-validation is the gold standard for model selection in machine learning practice, where a model’s generalization performance is estimated by evaluating it on held-out validation data after training on the remaining data.
Sufficient Data Requirement: The amount of data needed for the convergence of these metrics in the infinite data limit may be impractically large, especially for highly expressive models like deep neural networks. Therefore, the convergence property may not be directly relevant in many real-world scenarios.
Low-Data Regimes: When data is limited, the metrics can differ significantly. The conditional marginal cross-entropy (or cross-validation loss) is often the more reliable choice for model selection targeting generalization performance, as it directly measures the model’s ability to predict unseen data after being trained on the available data.
Sequential Prediction and Compression: The joint marginal cross-entropy, which corresponds to the negative log marginal likelihood, may be preferable if the focus is on a model’s overall sequential prediction performance or compression ability on the training data itself. It measures how well the model fits the entire training dataset jointly, without splitting into train and validation sets.
Moreover, the conditional joint marginal information and cross-entropy are particularly relevant for measuring the performance of online learners and the in-context learning abilities of large language models (LLMs). These metrics capture the model’s ability to adapt and make accurate predictions based on the sequential information and evolving context after training on available data.
Model Misspecification and Prior-Data Conflict: In practice, models often face a combination of model misspecification (where the true data-generating process is not contained within the model class) and prior-data conflict (where the prior distribution does not align well with the data distribution). The interplay between these factors can lead to different rankings of models depending on the amount of available data and the specific metric used for evaluation.
While the marginal likelihood has been a popular tool for model selection and hyperparameter learning in the Bayesian community, its suitability depends on the specific context and goals. The conditional marginal cross-entropy, closely related to cross-validation, is often a more reliable choice when the primary objective is to optimize generalization performance. However, the conditional joint marginal cross-entropy (or conditional log marginal likelihood) may be preferable when the focus is on sequential prediction after training or measuring in-context learning abilities.
Now, after having thought about all this in detail and mostly from first principles, let’s discuss the literature and how it supports or augments these considerations.
Having discussed the key concepts, we will now look at several influential papers that have shaped the previous discussion on model selection and hyperparameter tuning in the Bayesian context or have provided valuable insights into the marginal likelihood and its connections to other metrics.
Fong and Holmes (2020)
The authors define the leave-p-out cross-validation score as:
\[S_{CV}(\xNset;p) = \frac{1}{\binom{N}{p}} \sum_{V \in \binom{[N]}{p}} \frac{1}{p} \sum_{i=1}^p \Hof{\x^{V}_i \given \{\x^{\bar{V}_k}\}_{k=1}^{N-p}}\]where \(\binom{[N]}{p}\) denotes the set of all \(p\)-length subsets of \(\{1,...,N\}\)—the indices of the validation set—\(\x^V_i\) is the \(i\)-th validation data point, and \(\x^{\bar{V}}_k\) is the \(k\)-th training data point. This score measures the model’s performance using \(p\) validation points given the remaining data for training, equivalent to the respective conditional marginal cross-entropy.
The cumulative leave-P-out cross-validation score is defined as:
\[S_{CCV}(\xNset; P) = \sum_{p=1}^P S_{CV}(\xNset; p)\]This score focuses on the last \(P\) stages of the learning curve equally and is the same as the conditional joint marginal cross-entropy. For \(P=N\), the cumulative leave-N-out cross-validation score equals the joint marginal information:
\[S_{CCV}(\xNset; N) = \Hof{\xNset}\]Comparing \(P<N\) to \(P=N\), Fong and Holmes highlight the potential sensitivity of the marginal likelihood to the choice of prior. They argue for using cumulative cross-validation following a preparatory training phase with \(P<N\) (e.g., \(10\%\) or \(50\%\)), demonstrating benefits over the full marginal likelihood for model selection, especially with vague priors or model misspecification.
The paper also discusses the coherence of the log posterior predictive probability as a scoring rule in cross-validation and explores connections to prequential analysis and intrinsic Bayes factors.
Fong and Holmes (2020) strongly support the ideas in this blog post, particularly the connections between marginal likelihood, cross-validation, and focusing on later learning curve stages for model selection. They establish the equivalence between the cumulative leave-p-out cross-validation score and conditional joint marginal information, aligning with our discussion of the conditional joint marginal cross-entropy as a more reliable metric compared to the full marginal likelihood.
In “A Bayesian Perspective on Training Speed and Model Selection”, Lyle et al. (2020)
where \(\Hof{\x_n \given \w_n}\) is the cross-entropy loss at training step \(n\) with model parameters \(\w_n\). Thus, an MLE estimate is used instead of conditioning on the data points \(\x_{<n}\) and using the BMA.
The authors provide an iterative algorithm for linear models to estimate a lower bound on the LML over multiple epochs of training. This allows capturing the model’s performance as it sees more data points over the course of training, rather than being limited to a single epoch. They also discuss extending their estimator to the infinite-width limit of neural networks.
Building upon Lyle et al. (2020)
where \(\alpha \in (0, 1)\) is a hyperparameter controlling the decay rate.
The authors hypothesize that assigning higher weights to later epochs may lead to better correlation with the true generalization performance of the final trained network, as the early epochs may be unstable and less informative.
They demonstrate empirically that TSE-E and TSE-EMA can reliably estimate the generalization performance of neural architectures with a small training budget and remain effective for a large range of training epochs. TSE outperforms other efficient estimators, such as early stopping and learning curve extrapolation, in terms of rank correlation with the true test performance.
The TSE estimators proposed by Ru et al. (2021) align closely with the ideas discussed in this blog post, as they prioritize the model’s performance in the later stages of learning. The empirical results presented by Ru et al. (2021) and Lyle et al. (2020) provide supporting evidence for the importance of going beyond the marginal likelihood.
Lotfi et al. (2022/2023)
To address these limitations, Lotfi et al. propose the conditional marginal likelihood (CLML) as a partial remedy. The CLML is computed by conditioning on a subset of the training data, which helps to mitigate the influence of the prior and focus on the model’s performance under this posterior. It is also less sensitive to the number of parameters in the model. The authors demonstrate that the CLML is better correlated with generalization than the marginal likelihood and provides promising performance for deep kernel hyperparameter learning and neural architecture search.
The CLML shares significant similarities with the cumulative leave-p-out cross-validation score proposed by Fong and Holmes (2020)
Lotfi et al. conduct an extensive empirical evaluation of the CLML across various settings, comparing it to the marginal likelihood and other baselines under different conditions, such as varying dataset sizes, model complexities, and hyperparameter settings. They demonstrate that the CLML consistently outperforms the marginal likelihood in terms of selecting the hyperparameters that lead to better generalization performance. The authors also acknowledge some limitations of their work, such as the need for further theoretical analysis of the CLML’s properties and the potential challenges in estimating the CLML for more complex models.
The key novelty of Lotfi et al.’s work lies in their comprehensive analysis of the limitations of the marginal likelihood for model selection and hyperparameter learning, as well as their proposal of the CLML as a practical alternative that addresses these limitations.
To illustrate the concepts discussed in this post, we conduct a simple toy experiment using a Bayesian linear regression model. The goal is to demonstrate how the various information metrics behave under different prior settings and dataset sizes, and to show that none of the metrics are universally reliable. In particular, the joint marginal information may not be the best choice when the primary concern is static performance after training on data.
We generate a synthetic dataset with 64 features and 500 training and validation samples each. The true coefficients are drawn from a normal distribution with a mean of 2, and the target is the dot product between the features and the true coefficients.
For the model, we use a Bayesian linear regression with an isotropic Gaussian prior on the weights (hyperparameter \(\wstddev\)) and independent Gaussian noise (hyperparameter \(\noisestddev\)). The model is misspecified when \(\noisestddev > 0\). We consider three different prior settings:
Thus, all three models are misspecified to varying degrees and exhibit different levels of prior-data conflict.
We train the model on subsets of the training data of varying sizes, ranging from 1 to the full training set size, performing 5 trials with different splits. For each subset size, we compute the following metrics:
The JMI is equivalent to the negative log marginal likelihood, the CJMI to the negative conditional log likelihood, and the MCE corresponds to the cross-entropy loss. The Training Speed approximates an iterative algorithm by following the full data gradient. The JMI Rate is the JMI divided by the dataset size, which converges to the MCE in the infinite data limit.
The results of the experiment are summarized in the following plots:
The plots show the behavior of the information metrics as the dataset size increases for the three different prior settings. Some key observations:
To further analyze the model selection behavior, we computed the CJMI for different conditioning set sizes and selected the model with the lowest CJMI for each combination of dataset size and conditioning set size. The results are visualized in the following plot:
The plot shows which model is selected based on the lowest CJMI for different dataset sizes (x-axis) and conditioning set sizes (y-axis). The white line represents the case where half the data is used for conditioning (CJMI half in the previous plot). We observe that the model selection decision changes depending on the amount of available data and the size of the conditioning set/held-back data.
Now that we have introduced the necessary concepts and discussed the literature, let’s take a closer look at the paper by Lotfi et al. (2022/2023)
Lotfi et al. (2022/2023) present both the case for the log marginal likelihood (LML) as well as potential pitfalls when using it. They highlight the following use cases for the LML—quoted and paraphrased from the paper:
Hypothesis testing: The LML provides an elegant mechanism to select between fixed prior hypotheses, even if each hypothesis is entirely consistent with observations. It automatically favors the most constrained hypothesis that fits the data, encoding a notion of Occam’s razor. The paper gives the example of the LML favoring general relativity over alternative explanations for Mercury’s orbit.
Hyperparameter learning: The LML is often successfully used in practice to learn hyperparameters of the prior, finding the hyperparameters \(\h\) that maximize \(\pof{\mathcal{D} \given \h}\), where \(\mathcal{D}\) is a dataset. The paper highlights Gaussian processes as a compelling example, where the LML chooses kernel hyperparameters that make the distribution over functions likely to generate the training data, rather than simply maximizing data fit. The LML can learn many kernel parameters and be used where cross-validation would be intractable.
Constraint learning: Unlike typical learning objectives like maximum likelihood, the LML is incentivized to select for constraints. It provides a consistent estimator for constraints, automatically selecting the most constrained solution that fits the data and collapsing to the true constraint value as the number of observations grows. Examples include the LML consistently estimating the true dimensionality in Bayesian PCA and automatically learning symmetries like rotation invariance.
However, the paper argues that the LML has several pitfalls for model selection and generalization:
Not aligned with generalization: The LML answers “what is the probability a prior model generated the training data?” rather than “how likely is the posterior to have generated withheld points?”. A prior that initially explains the data well can still lead to a posterior that generalizes poorly.
Misaligned in model selection: The LML evaluates priors, while model selection should evaluate posteriors. Maximizing LML is not equivalent to selecting the best generalizing posterior.
Can overfit: The LML can favor “simple” priors concentrated around overfit maximum likelihood solutions that generalize poorly.
Underfitting bias in hyperparameter selection: The LML may not favor hyperparameters that make good parameters likely if they also make many poor parameters likely.
Relating these points to the previous discussions:
For hypothesis testing and hyperparameter learning (1. & 2.), the LML favors the simpler hypothesis that converges faster, implying a smaller area under the learning curve. This aligns with the discussion on prior-data conflict for similarly misspecified models.
At the same time, the paper also states about the case of Mercury’s orbit that:
We emphasize here we are comparing fixed prior hypotheses. We are not interested in how parameters of general relativity update based on orbital data, and then deciding whether the updated general relativity is the correct description of orbital trajectories.
This could be misconstrued at computing the marginal cross-entropy for the data under the prior, which is not what the LML is doing: it computes a joint marginal cross-entropy after all. The two questions in (4.) point to the joint and conditional marginal cross-entropies—the areas under the full and partial learning curves, respectively.
However, neither LML nor CLML align with static evaluation, but rather with continued learning (5.).
Points (6.) and (7.) relate to prior-data conflict and model misspecification when they are anti-correlated.
Overall, all quantities can fail in the low-data regime. In the infinite data limit, model (mis-)specification dominates other factors, making the quantities less interesting.
The paper introduces the conditional marginal likelihood (CLML) as a remedy for the pitfalls of the LML, matching the earlier definition of conditional joint marginal information:
\[\Hof{\xset{}{N-P+1}{N} \given \xset{}{1}{N-P}, \h}.\]Unlike the LML which is invariant to data order, the CLML depends on how the data is split into a conditioning set and validation set. To make the CLML permutation-invariant, the paper proposes averaging over different permutations, equivalent to the joint marginal cross-entropy. However, this becomes computationally expensive, so the paper uses a single permutation with \(P=20\% \, N\) to ensure the posterior has sufficiently converged.
Computing the LML via sampling is intractable for deep neural networks. Estimating it from an uninformative prior leads to high-variance estimates, as most \(\w\) sampled from the prior will perform poorly on the data. While Monte Carlo sampling works well in high dimensions, it fails here because randomly sampling a good \(\w\) from the prior is incredibly unlikely, as illustrated in these tweets:
How powerful is gradient descent exactly? For a small CNN on CIFAR-10 I've looked at the typical loss change due to a random step of the same length as a gradient step starting at the same weights. The gradient step is literally a 185 sigma event => ~impossible~ at random ✅ pic.twitter.com/oOQnkwCCG0
— Stanislav Fort ✨🧠🤖📈✨ (@stanislavfort) May 26, 2022
How good is a gradient?
— Robert Rosenbaum RobertRosenbaum@neuromatch.social (@RobertRosenba14) April 22, 2022
The top histogram shows the change in loss from 1000 random weight updates with a fixed norm. The bottom compares this histogram to the change in loss from a gradient descent step with the same norm.
It's 280 standard deviations away! pic.twitter.com/iJ2SSESEJ0
While sampling from the prior to estimate the LML is intractable, we can fare better when sampling from a posterior for computing a CLML, which is the approach taken by the paper for the CLML. The posterior is more concentrated around “good” \(\w\), and the paper uses a Laplace approximation to approximate it:
However, the LA only captures uncertainty around a single mode, underestimating the uncertainty before the model converges, as beautifully illustrated in the paper:
This is especially relevant for overparameterized DNNs which have multiple diverse modes (Wilson, Izmailov, 2020
Furthermore, when computing the CLML, the LA may similarly struggle to find meaningful \(\w\) that perform well on the held-out data when that data would meaningfully change the model, as the CLML decomposes into conditional marginal information terms that condition on these additional data sequentially.
The DNN experiments in Lotfi et al. (2022/2023) compare the CLML to the validation loss for DNNs on CIFAR-10 and CIFAR-100 datasets. The results provide empirical evidence for the challenges of computing the CLML and beg the question whether these approximations are meaningfully different from a validation loss.
The paper shows that while the CLML is better correlated with the generalization performance of the model than the LML, the validation loss is still better correlated with the generalization performance than the CLML. Interestingly, the initially published DNN experiments in the first arXiv version of the paper did not actually compute the CLML but instead computed the validation loss. This was fixed in the second arXiv revision.
However, given the previous discussions on the similarities between the CLML and cross-validation and difficulty of approximating the CLML meaningfully, this bug was not a major issue for the paper’s conclusions.
Importantly, as we examine in the appendix of this post, when comparing the CLML using Monte Carlo sampling with the validation loss computed using Monte Carlo sampling for the Bayesian Model Average (BMA), the validation loss is still better correlated with the generalization performance than the CLML.
In conclusion, this blog post has challenged the conventional focus on the marginal likelihood and related quantities for Bayesian model selection as a direct consequence of Occam’s razor. It highlights the importance of considering context and goals when choosing a model selection criterion. By motivating MLE and MAP using Occam’s razor and questioning the uniqueness of the (conditional) joint marginal likelihood, we hope to encourage critical thinking about the foundations of these quantities.
However, it is important to acknowledge the limitations of our arguments and experiments. A more rigorous theoretical justification, a broader range of models and datasets, and a deeper engagement with philosophical implications are needed to strengthen the insights. As most of the presented methods ignore model complexity and assume a uniform model prior \(\pof{\h}\), we have not discussed it in the detail necessary, even though from the perspective of model description lengths (MDL), it would be crucial to take into account.
Despite these limitations, our exploration of the connections between information-theoretic concepts and their behavior in different data regimes, along the lines of model misspecification and prior-data conflict, provides a necessary starting point for understanding recently proposed metrics.
The toy experiment demonstrates that all discussed quantities can fail to reliably predict generalization under model misspecification and prior-data conflict, even for a basic setting using Bayesian linear regression. This emphasizes the need for caution when making claims about the superiority of any particular metric.
Ultimately, the key takeaway is that there is no one-size-fits-all solution, and the choice of model selection criterion should be guided by a careful consideration of the specific context and goals at hand.
Acknowledgements: We would like to thank the authors of the examined papers for their valuable contributions to the field and for inspiring this blog post. Claude-3 and GPT-4 were used to edit and improve this blog post (via cursor.sh).
Reproducibility: The figures were created using matplotlib and seaborn in Python. The Bayesian linear regression model was implemented using numpy. The code for the toy experiment is available in this Google colab, and the code for the visualizations is available in this Google colab.
The logcml_
files in the repository contain the code to compute the CLML for partially trained models. However, instead of computing
the code computes:
\[\begin{aligned} &\frac{1}{|\mathcal{D}_{\ge m}|}\,\sum_{j=m}^n \log p(\mathcal D_{j} \mid \mathcal D_{< m}, \mathcal{M} ) \approx \\ &\quad =\frac{1}{|\mathcal{D}_{\ge m}|}\,\sum_{j=m}^n \log \sum_{k=1}^K \frac{1}{K}\, p(y_j \mid x_j, w_k, \mathcal M ), \end{aligned}\]which is the validation cross-entropy loss of the BMA (of the model trained with 80% of the training data).
The high-level code that computes the CLML is:
1
+2
+3
+4
+5
+
bma_accuracy, bma_probs, all_ys = get_bma_acc(
+ net, la, trainloader_test, bma_nsamples,
+ hessian_structure, temp=best_temp
+)
+cmll = get_cmll(bma_probs, all_ys, eps=1e-4)
+
get_bma_acc
marginalizes over the LA samples before returning bma_probs
:
1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
+16
+17
+18
+19
+20
+21
+
[...]
+for sample_params in params:
+ sample_probs = []
+ all_ys = []
+ with torch.no_grad():
+ vector_to_parameters(sample_params, net.parameters())
+ net.eval()
+ for x, y in loader:
+ logits = net(x.cuda()).detach().cpu()
+ probs = torch.nn.functional.softmax(logits, dim=-1)
+ sample_probs.append(probs.detach().cpu().numpy())
+ all_ys.append(y.detach().cpu().numpy())
+ sample_probs = np.concatenate(sample_probs, axis=0)
+ all_ys = np.concatenate(all_ys, axis=0)
+ all_probs.append(sample_probs)
+
+all_probs = np.stack(all_probs)
+bma_probs = np.mean(all_probs, 0)
+bma_accuracy = (np.argmax(bma_probs, axis=-1) == all_ys).mean() * 100
+
+return bma_accuracy, bma_probs, all_ys
+
The important line is #18: bma_probs = np.mean(all_probs, 0)
which marginalizes over the predictions and returns the BMA prediction for each sample.
Finally, get_cmll
computes the validation loss for each sample independently (after applying a bit of label smoothing):
1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+
def get_cmll(bma_probs, all_ys, eps=1e-4):
+ log_lik = 0
+ eps = 1e-4
+ for i, label in enumerate(all_ys):
+ probs_i = bma_probs[i]
+ probs_i += eps
+ probs_i[np.argmax(probs_i)] -= eps * len(probs_i)
+ log_lik += np.log(probs_i[label]).item()
+ cmll = log_lik/len(all_ys)
+
+ return cmll
+
The DNN experiments in Section 5 and Section 6 of the first arXiv revision of the paper (v1) thus did not estimate the CLML per-se but computed the BMA validation loss of a partially trained model (80%) and find that this correlates positively with the test accuracy and test log-likelihood of the fully trained model (at 100%). This is not surprising because it is well-known that the validation loss of a model trained 80% of the data correlates positively with the test accuracy (and generalization loss).
The following response sadly seems to target the first draft mainly. However, it is also helpful for the final blog post and provides additional context.
Thanks for your interest in our paper and your comments. Here are our comments about the blog as it is currently framed:
(1) Thank you for pointing out a bug in the CLML computation for Figure 5b. We note that this bug is only relevant to a single panel of a single figure in the main text. We have re-run this experiment with the right CLML, and the results, attached here, are qualitatively the same. In summary, it was a very minor part of the paper, and even for that part it did not affect the take-away. We also attach the results of the correlation between the BMA test accuracy and the negative validation loss. You suggest in your post that the validation loss might correlate better with the BMA test accuracy than the CLML given that we use 20 samples for NAS. Our empirical results show the opposite conclusion. Additionally, we are not suggesting the CLML as a replacement to cross-validation but rather as a minor way to modify the LML for improvements in predicting generalization. Finally, we attach results for different sample sizes (20 samples vs. 100 samples) to address your comments on the sample size used to estimate the CLML. As we can see in the figure, the Spearman correlation factor is quite similar. 20 samples appears to provide a reasonable estimate of the CLML for these purposes, and is different from validation loss.
(2) Your post currently opens by suggesting that there is something wrong with our experiments, likely either an LML approximation or a CLML issue, because we note that the LML correlates more poorly with generalization for larger datasets (where “large” is relative in the context of a specific experiment). A few points here: (i) this result is actually completely expected. The LML is in fact non-monotonic in how well it predicts generalization. For small datasets, the prior should be reasonably predictive of generalization. For intermediate datasets, the first terms in the LML decomposition have a negative effect on the correlation with generalization. For asymptotically large datasets, the first terms have a diminishing effect, and we get a consistent estimator; (ii) almost all of our experiments are exact, and we see this behaviour in the exact experiments for the Fourier model. For example, for the Fourier feature experiment in Fig 4(d), LML picks the better generalizing model for n < 50 and n > 296. For n in [50, 296] it picks the wrong model. For large neural network models, it is reasonable that the exact LML could pick the wrong model for CIFAR-sized datasets. (iii) any potential issues with the CLML are not relevant to these considerations, which are about the behaviour of the LML.
(3) Your post currently suggests that issues with approximate inference could be responsible for our take-aways, rather than issues with the LML in general. But as we note in (2), almost all of our experiments use the exact LML and CLML: the density model, Fourier features, Gaussian processes, and deep learning exps on DKL, and there was never any bug associated with CLML computation in these experiments. The takeaways for the Laplace experiments are consistent with the exact experiments, and also expected, as above. While it’s true that the CLML can be estimated more effectively than the LML for the Laplace experiments, this is actually an advantage of the CLML that we note in the paper. The LML results also stand on their own, as we discuss above.
(4) Your post places a lot of importance on Figure 5, as if it is the main result of the paper and our main “DNN” experiments. We stand by the results of Figure 5, but it is a relatively minor component of the paper. As we’ve mentioned most of our results are exact, including our DKL experiments, which are certainly the most substantial DNN experiments, with practically exciting results for transfer and few-shot learning. The DKL experiments are actually where we expect the CLML to be practically useful, and currently they seem to be overlooked in the post.
(5) The blog seems to question the learning curve experiments, but these experiments in Figure 4 are exact, with no Laplace approximation, and relatively straightforward.
(6) Your post seems to be negative about the CLML, presenting its similarity with cross-validation as a potential drawback, and implying the skepticism about the CLML should affect the interpretation of our take-aways. Two points here: (i) as above, the CLML is independent of most of our take-aways, which are about the properties of the LML; (ii) our goal with the CLML was not to introduce something starkly different from cross-validation, but to show how a very minor modification to the LML could improve alignment with generalization. Moreover, the DKL CLML results are quite promising as an efficient way to do gradient based estimation of a large number of hyperparameters.
(7) The blog opens as if it is leading up to some fatal flaw. But as above, (i) the LML considerations are independent of the CLML, (ii) most of the experiments are exact, (iii) the trends for the exact and approximate inference procedures are the same and are naturally understandable and explainable, such as the non-monotonic trend in how well the LML correlates with generalization, and (iv) the CLML bug only affected Figure 5, panel b, and when it’s corrected the qualitative take-away is the same as before.
We appreciate your interest and effort in reading the paper, and we think your questions will improve the clarity of the paper, which we have updated with an acknowledgement to you. Given the above considerations, we do think there would need to be substantial revisions to the blog post to accurately and fairly reflect the paper. We would appreciate being able to see the revisions before it’s posted.
Best wishes,
Sanae, Pavel, Greg, Micah, Andrew
Let us examine the new results:
In the three panels below, two panels show test accuracy vs. validation loss; one shows test accuracy vs. CLML. The left-most panel is the BMA test accuracy vs. (negative) BMA validation loss, the middle panel is vs. the CLML, and the right-most panel is vs. the (negative) non-BMA validation loss.
Note that the left-most panel is from v1, which was accidentally computing the BMA validation loss, and whose axis label is adapted here from v1 for clarity. The two other plots are from v2 after fixing the bug. See commits here for fixing the CLML estimation and here for computing the non-BMA validation loss.
At first glance, there might be an observer effect in the experiments for the validation loss. The BMA validation loss in v1 performs better than the CLML in v2, while the non-BMA validation loss in v2 underperforms the CLML in v2. When asked about it, the authors pushed the respective code (see link above) and explained that the updated, right-most panel computes the non-BMA validation loss, i.e., without LA samples. It seems surprising that there is such a difference between the non-BMA validation loss and BMA validation loss: the non-BMA validation loss is more than one nat worse on average than the BMA validation loss, based on visual inspection. Note that the plots here and in the paper compute the average CLML and average validation loss and are thus directly comparable.
The authors said in their response that:
You suggest in your post that the validation loss might correlate better with the BMA test accuracy than the CLML given that we use 20 samples for NAS. Our empirical results show the opposite conclusion.
This is only partially true. The BMA validation loss (which was accidentally computed in v1 instead of the CLML) correlates very well with the BMA test accuracy. This is not surprising given that this is the frequentist purpose of using validation sets. If validation sets were not correlating well with the test accuracy, we would not be using them in practice. 🤗 As such, this raises the question why the non-BMA validation loss correlates negatively with the BMA test accuracy for ResNets and overall in the v2 results. Thus, only the non-BMA validation loss supports the now opposite conclusion in v2 of the paper and in the authors’ response.
Yet what is also surprising is how well the BMA validation loss does vs. the CLML:
Secondly, when we compare the reported values between BMA validation loss and CLML, we notice that the CLML is lower than the BMA validation loss by half a nat for \(\lambda=10^2\) and generally for CNNs.
However, it seems, even though the new experiments in v2 are supposed to reproduce the ones from v1, and we can assume that the same model checkpoints were used for re-evaluation (as retraining is not necessary), both CLML and non-BMA validation loss are off by about half a nat for the CNNs. As such, the above consideration might hold but might not provide the answer here.
Instead, we overlay the non-BMA validation loss and the CLML plots, both from v2, with a “difference blend”: it shows the absolute difference between the colors for overlapping data points (the circles 🔴 and triangles 🔺), leading to black where there is a match, negative (green-ish) color for CLML, and positive (sepia) color for validation losses. The background grids were used to match the plots, but we hid the ones from CLML afterward—as such, the strong overlay is because the values are so close.
Surprisingly—or rather as predicted when the LA does not really do much—it turns out that the validation loss for the CNNs (🔴) mostly fully matches the estimated CLML with 20 LA samples following a visual inspection. To be more precise, either the models have already sufficiently converged, or the CLML estimate is not actually capturing the correlations between points and thus ends up being very similar to the validation loss.
This changes the interpretation of the sample ablation in the author’s response. The ablation shows no difference between 20 and 100 LA samples, with 100 LA even samples having a slightly lower rank correlation. So it seems 5 times more LA samples are not sufficient to make a difference, or the Laplace posterior cannot capture the posterior as well as hoped. It would be interesting to examine this further. Kirsch et al (2022)
All in all, given the above, it is fair to say that the estimate of the CLML is probably not as good as hoped, and further experiments might be needed to tease out when the CLML provides more value than the (BMA) validation loss. Note, however, that this question has not been explicitly examined in the paper. Instead, for DNNs, the paper only compares LML and CLML with distinct estimation methods.
]]>Broadly, algorthmic reasoning
First, let’s remember that for \(x_0\) to be a fixed-point of a function \(f\) it must satisfy \(f(x_0) = x_0\). Secondly, we can observe that many algorithms consist of an update rule that you apply until there is no more change. The final output can easily be seen to be a fixed-point! In a classical computer science algorithm some smart person will have sat down and shown that under some conditions on the input this convergence will happen and the final answer is correct.
An example algorithm would be the Bellman-Ford algorithm to compute the shortest-distance to a given node in a graph. Here the update rule looks like \(x_i^{(t+1)} =\min(x_i^{(t)}, \min \{x_j^{(t)} + e_{ij}\}_{j\in N(i)})\), where \(x_i^{(t)}\) is the shortest distance estimate to the source node at time \(t\), \(e_{ij}\) is the distance between nodes \(i\) and \(j\), and \(\{j\}_{j\in N(i)}\) are the neighbours of node \(i\). The algorithm says to apply this rule until there is no more change—a fixed point.
Interestingly, denotational semantics
The CLRS paper
The high-level architecture is that of an encoder-processor-decoder. The motivation is that neural networks perform well in high-dimensional spaces but that classical algorithms tend to operate on very low-dimensional variables, e.g. in BellmanFord the shortest distance would be a single scalar. Thus the encoder projects the state into a high-dimensional space \(z_t\) where the main computation is then done by the processor network—typically a Graph Neural Network. The output of the processor \(z_{t+1}\) is then decoded back into the low-dimensional space by the decoder. The encoder and decoders mostly consist of linear layers with the occasional exception, e.g. a softmax for categorical variables. The processor will be a graph neural network, for which several different architectures have been explored, for example in
The processor is supposed to do the main computation of the network, in particular, the hope is that one iteration of the processor is equal to one iteration of the algorithm. In our example of BellmanFord, it would be one iteration of the update rule \(x_i^{(t+1)} =\min(x_i^{(t)}, \min \{x_j^{(t)} + e_{ij}\}_{j\in n(i)})\) (see also the Figure below). Thus, the processor should indicate termination by no longer changing it’s output \(z\).
Traditionally the training approach has been teacher-forcing. In teacher forcing we train each step of the algorithm independently by feeding the network the ground-truth \(x_t\) and computing the loss against \(y_t\) at all \(t\) simultaneously. This requires us to know the exact number of steps in the algorithm a priori. In other words, training with just teacher forcing will require us to tell the network the number of iterations it should run for at test time (which will vary depending on the input state). This is unrealistic in practice, where we would simply give our neural network the input state and ask it to run the algorithm on its own, which includes knowing when to stop the computation. While a termination network is suggested in
Remember that neural networks are really good at learning in-distribution shortcuts. To more rigorously test whether the neural network has learned the underlying logical algorithm we introduce a shift between the training and test distribution. If the network has learned the classical algorithm, it should be able to overcome this shift. Throughout the CLRS algorithmic reasoning benchmark size generalisation is used, i.e. we train on examples of size 16 (i.e. the graph has 16 nodes) and at test time we will use an input size of 64.
One approach to training neural networks that run until they reach a fixed point is deep equilibrium models
Given our input \(x\), our hidden state \(z\), and our processor \(f\), the goal is to optimise the fixed point \(z^*=f(z^*,x)\) we reach. The question how can we backprop through \(z^* = f(z^*,x)\).
In backprop, we ultimately want to compute
\[\left(\frac{\partial z^*(.)}{\partial(.)}\right)^{\top} g\]for some incoming gradient \(g\) from the layers after (in our case from the decoder) and \((.)\) being anything we want, but usually the weights of the network. We can show by implicit differentation of \(z^* = f(z^*,x)\) that
\[\left(\frac{\partial z^*(.)}{\partial(.)}\right)^{\top} g = \left(\frac{\partial f(z^*, x)}{\partial (.)}\right)^{\top}\left(I-\frac{\partial f(z^*, x)}{\partial z^*}\right)^{-\top}g\]The difficult to term to solve in the above equation is \(\left(I-\frac{\partial f(z^*, x)}{\partial z^*}\right)^{-\top}g\), which is the solution of a linear system, namely:
\[\left(I-\frac{\partial f(z^*, x)}{\partial z^*}\right)^{\top}h = g\]In general, we can try to solve it in two ways, use a linear system solver, like can be found torch.linalg, or by computing a fixed point to
\[h = \left(\frac{\partial f(z^*, x)}{\partial z^*}\right)^{-\top}h +g\]In the DEQ blogpost
We tried both: solving the linear system with torch.linalg.solve and finding the above fixed point. But we converged to computing the fixed point of the equation above as suggested by the deep equilibrium blogpost as it is computationally faster, while the added accuracy of linear system solvers wasn’t beneficial. Note this trade-off is heavily informed by what is readily implemented in PyTorch to run on GPU, hence the balance may shift in the future.
To encourage convergence we change the update function in the MPNN
Currently, gradient flows through the implicit differentiation explained above as well as back in time through standard backprop via \(z_t\). To enable more ways for the gradient to inform early steps in the algorithm, we propagate the gradient through \(y_t\) as well. For discrete \(y_t\), in other words, for categorical variables in the state \(x_t\) we employ the Rao-Blackwell straight-through gumbel softmax estimator
Finally, we also try adding a loss for the number of steps by adding the penalty \(\sum_{t=0}^{T} \|z_{t+1} - z_{t}\|^2\). The penalty will be larger as we take more steps and stay away from the fixed point, thus hopefully encouraging convergence to a fixed point more quickly.
In the table below we show the accuracy
DEQ is our approach of reaching a fixed point together with the implicit differentiation explained above. Hint propagation is simply reaching a fixed point and back propagating through time with no implicit differentiation. Teacher forcing is used for the baselines, where the first number is the simple MPNN architecture
Tables | DEQ | Hint propagation | Teacher forcing |
---|---|---|---|
BellmanFord* | 96.4% | 96.7% | 92%/97% |
Dijkstra | 78.8% | 84.4% | 92%/96% |
BFS* | 53.8% | 57.1% | 100%/100% |
DFS | 5.0% | 4.7% | 7%/48% |
MST-Kruskal | 82.3% | 82.3% | 71%/90% |
MST-Prim | 75.2% | 50.4% | 71%/90% |
As we can see in the table above the approach works very well for simpler algorithms such as BellmanFord, where with simple MPNN we manage to achieve equal or better accuracy than the simple MPNN and match the TripletMPNN. Interestingly, this is a parallel algorithm, i.e. all node representations run the same code, in constrast sequential algorithms which go through the graph node by node. We did try gating to enable the GNN to better mimic a sequential algorithm, but this didn’t help.
On the other algorithms while we are able to learn we cannot match the performance of teacher forcing where we assume to know the number of timesteps to run the neural network. This additional help makes the comparison slightly unfair, however, it shows how learning a fixed point is difficult for the network as we are not able to match the performance. We hypothesise about the reasons behind this in the next section.
There are a few major issues that we notice during training. The first is that the network is prone to underfitting, while we only show the test accuracy in the table above the training error doesn’t actually reach 0. It is unclear what causes this, however, trying to solve some issues with the DEQ may solve this. So let’s delve into them.
Firstly, the network will often take a large number of steps to reach a fixed point. We can see on easier algorithms like the BellmanFord algorithm that the number of forward steps during training often reaches our set upper limit of 64 forwards steps (the actual algorithm would take on average 4-5, max 10 for this graph size). This is why we implement our architecture trick, where we update the next hidden representation only if it is smaller than the current one, i.e. \(z^{(t+1)} = \min(z^{(t)}, z^{'(t+1)})\) where \(z^{'(t+1)}\) is the output of our min aggregator in the message passing step (alternatives such as gating and an exponential moving average update function were also tried). This helps with convergence, which enables finding a fixed point in simple cases, but fails to work reliably for more complex architectures and problems, while also introducing a different issue.
Remember that during the implicit differentiation we are trying to solve
\[h = \left(I-\frac{\partial f(z^*, x)}{\partial z^*}\right)^{-\top}g\]i.e. in the linear system \(y = Ax\) our matrix \(A\) is equal to \(I-J\) where \(J\) is the Jacobian in the above equation. If the Jacobian is equal to the identity then our matrix $A=0$ and our system has no solution. In practice, \(z^{(t+1)} = \min(z^{(t)}, z^{'(t+1)})\) will reduce to \(f(z) = z\) in many dimensions of \(z\). This leads to many rows of the Jacobian being the identity due to the function effectively becoming \(f(x)=x\) in many dimensions. Thus leading to rows that are entirely zero in \(A\), which is ill-defined and has no solution causing the optimisation to break.
One solution is to try a soft-min, i.e. \(softmin_{\tau}(a,b) = \frac{ae^{-a/\tau}+be^{-b/\tau}}{e^{-a/\tau}+e^{-b/\tau}}\). Here we get the ability to trade off between convergence and the Jacobian being interesting. For \(\tau<<1\) we basically recover the min operation and for \(\tau>>1\) we simply get an average, i.e. an exponential moving average. In practice, there was not a trade-off for which we consistently have an interesting Jacobian, while also converging sufficiently fast.
Not only generative modeling has been around for decades, few promising model families emerged and dominated the field for several years in the recent past. VAEs
In this article, we look back into the conceptual and theoretical ideas that were in development for a long time, even outside the field of core machine learning. We will show in a later sections that, some of the theoretical ‘pillars’ holding Diffusion Models, have their roots deep into statistical physics and other fields. A significant part of this theory was presented afresh in the ICLR paper
This article notes that, historically, there were two distinct roads of development that merged in order for modern diffusion models to emerge – “scalable estimation of score” and “using the score for generative modelling”. The former is relatively short, while the latter traces its origin back to ~1900, if not earlier. This article explores these two paths independently – the latter one first while assuming the knowledge of the former. Rest of this introductory section is spent on defining the general modelling problem and the very notion of ‘score’ – the primary quantity of interest. The next section deals with how we can use score in generative modelling, assuming access to an oracle for the true score. The last section dives solely into the problem of estimating the score in a scalable manner. It is worth mentioning that, in this article, we explain only the “sufficient and necessary” concepts needed to build the diffusion model framework and hence may not directly resemble the typical formalism seen in most papers.
The problem of generative modeling, in most cases, is posed as parametric density estimation using a finite set of samples \(\{ x^{(n)} \}_{n=1}^N\) from a “true but unknown” data distribution \(q_{data}(x)\). With a suitable model family chosen as \(p_{\theta}(x)\), with unknown parameters \(\theta\), the problem boils down to maximizing the average (log-)likelihood (w.r.t \(\theta\)) of all the samples under the model
\[\theta^* = arg\max_{\theta} \mathbb{E}_{x \sim q_{data}(x)} \left[ \log p_{\theta}(x) \right] \approx arg\max_{\theta} \frac{1}{N} \sum_{n=1}^N \log p_{\theta}(x^{(n)})\]It turned out however, that defining an arbitrary parametric density \(p_{\theta}(x)\) is not as easy as it looks. There was one aspect of \(p_{\theta}\) that is widely considered to be the evil behind this difficulty – the normalizing constant that stems from the axiom of probability
\[p_{\theta}(x) = \frac{\tilde{p}_{\theta}(x)}{\color{purple} \int_x \tilde{p}_{\theta}(x)}\]It was understood quite early on that any promising generative model family must have one property – ease of sampling, i.e. generating new data samples. Sampling was so essential to generative modeling, that the model families that followed were all geared towards effective sampling, even if it was at the expense of other not-so-important properties. It was also well understood that there was one common underlying principle most effective for crafting “sampling-centric” generative models – transforming simple probability densities. This formed the backbone of every single generative model family so far; be it VAEs, GANs or NFs, their generative process is a density transformation of this form
\[x = f_{\theta}(z),\text{ where } z \sim \mathcal{N}(0, I)\]that suggests to start with a simple density (often just standard normal) followed by a functional transformation \(f_{\theta}\), typically a neural network with parameters \(\theta\). For VAEs, the function \(f_{\theta}\) is the decoder; for GANs, it’s the generator network and for NFs, it’s the entire flow model. It is to be noted however, that the way they differ is mostly how they are trained, which may involve more parametric functions (e.g. VAE’s encoder or GAN’s discriminator) and additional machinery. This way of building generative models turned out to be an effective way of sidestepping the notorious normalizing constant.
Diffusion Models, at its core, follow the exact same principle, but with a slightly clever design. For diffusion models, the transformation \(f_{\theta}\) is rather complicated. It is a sequence of invocations of a neural function (denoted as \(s_{\theta}\)) along with some additional computation (denoted as \(g(\cdot)\))
\begin{equation} \label{eq:diffusion_general_parametric_structure} x = g_1(g_2(g_3(\cdots z \cdots, s_{\theta}), s_{\theta}), s_{\theta}), \text{ where } z \sim \mathcal{N}(0, I) \end{equation}
This is a big difference between Diffusion Models and other generative model families. Prior generative families tried to learn the exact transformation directly via one parametric neural function \(f_{\theta}\). Diffusion Models on the other hand, try to learn \(s_{\theta}\), a quantity very fundamental and intrinsic to any true data distribution \(q_{data}(x)\). The quantity in question has historically been called the “Score”.
The term ‘Score’ is simply defined as the gradient of the log-density of a distribution, i.e. \(\nabla \log p(\cdot)\). In statistics, it is also known (but not very popular) as the ‘Informant’. One might argue that ‘Score’ is rather a strange name for such a quantity. It so happened that the origin of this term can be traced
\begin{equation} \label{eq:data_score_defn} \nabla_x \log q_{data}(x) \triangleq s(x) \end{equation}
The quantity in Eq.\eqref{eq:data_score_defn} is unknown, just like the true data density \(q_{data}(x)\). It does have a meaning though: the “true score” refers to the direction of steepest increase in log-likelihood at any given point in the data space. See the gray arrows in the figure below.
Simply, at a point \(x\), it tell us the best direction to step into (with little step-size \(\delta\)) if we would like to see a point \(x'\) with slightly higher likelihood
\begin{equation} \label{eq:naive_score_steps} x’ = x + \delta \cdot \left. \nabla_x \log q_{data}(x) \right|_{x = x} \end{equation}
Please note that this stems just from the definition of the gradient operator \(\nabla\) in score. If you are familiar with gradient descent, you may find conceptual resemblance.
Now, there are two burning questions here:
The following two sections answer these questions respectively. Luckily, as we now understand that these two questions are somewhat decoupled, that they can be studied independently. The first section analyzes the first question, assuming we have access to the true score \(\nabla_x \log q_{data}(x)\). The second section explores how to get the true score, or rather, an approximation of it.
As explained before, we would like to sample from the true data distribution \(q_{data}(x)\) but all we have access to (we assume) is its score \(s(x)\) as defined in Eq.\eqref{eq:data_score_defn}. One may define a naive generative process as the iterative application of Eq.\eqref{eq:naive_score_steps}. Intuitively, it is very similar to gradient descent, where we greedily climb the log-density surface to attain a local maxima. If so, we can already see a possible instance of the general structure of Diffusion’s generative process as hinted in Eq.\eqref{eq:diffusion_general_parametric_structure}, with \(g(\cdot)\) being
\[g(z, s(\cdot)) = z + \delta \cdot s(z) = z + \delta \cdot \nabla_x \log q_{data}(x)\]With a little reshuffling of Eq.\eqref{eq:naive_score_steps} and considering \(\delta \rightarrow 0\), one can immediately reveal the underlying ODE
\begin{equation} \label{eq:ode_with_score} dx = \nabla_x \log q_{data}(x) dt \end{equation}
BUT, please note that this is only an intuitive attempt and is entirely based on the definition of score. It possesses absolutely no guarantee that this process can converge to samples from the true data distribution. In fact, this process is greedy, i.e. it only seeks to go uphill, converging exactly at the modes
In this case, at \(t=\infty\), all samples will converge to the state with the highest likelihood (i.e. exactly a the center). This isn’t really desirable as it doesn’t “explore” at all. Just like any other sampling algorithm, we need noise injection !
Turned out that this problem was explored long ago
\begin{equation} \label{eq:original_langevin_dyn} dx = - \nabla_x U(x) dt + \sqrt{2} dB_t \end{equation}
The term \(dB_t\) is called “Brownian Motion” and is effectively the source of noise – we will talk about this later in this subsection. Energy is considered “bad”, i.e. particles do not want to stay in a state with high energy. So they try to go downhill and settle in low-energy states using the gradient of the energy surface. The langevin equation (i.e. Eq.\eqref{eq:original_langevin_dyn}) happened to provide sufficient “exploration” abilities so that the particles visit states with probability \(\propto e^{-U(x)}\). This suggests that we can treat “negative energy” as log-likelihood
\[q_{data}(x) \propto e^{-U(x)} \implies \log q_{data}(x) = -U(x) + C \implies \nabla_x \log q_{data}(x) = - \nabla_x U(x)\]By using the above substitution into the langevin equation, we can move out of physics and continue with out ML perspective
\begin{equation} \label{eq:langevin_dyn} dx = \nabla_x \log q_{data}(x) dt + \sqrt{2} dB_t \end{equation}
Note that this isn’t very different from our “intuitive” and greedy process in Eq.\eqref{eq:ode_with_score}, except for the noise term \(dB_t\) and a strange \(\sqrt{2}\). But this makes a difference! The brownian motion is an old construct from particle physics to describe random motion of particles in fluid/gas. It is simply a gaussian noise with infinitesimally small variance
With that, we can simulate our new langevin equation with noise (i.e. Eq.\eqref{eq:langevin_dyn}) just like the noiseless case. You can see now that the noise is keeping the process from entirely converging into the mode. If you notice carefully, we have added a little “tail” to each point to help visualize their movement.
The simulation is convincing; but it’d be even better if we can theoretically verify that the process in Eq.\eqref{eq:langevin_dyn} indeed converges to \(q_{data}(x)\). The key to this proof is figuring out \(p_t(x)\) and making sure that it stabilizes as \(t\rightarrow \infty\), i.e. \(p_{\infty}(x) = q_{data}(x)\). It turned out that a stochastic process of the form \(dx = \mu_t(x) dt + \sigma_t(x) dB_t\), acting on a random variable \(x\), induces a time-varying distribution that can be described by this ODE
\begin{equation} \frac{\partial}{\partial t}p_t(x) = -\frac{\partial}{\partial x} \Big[ p_t(x)\mu_t(x) \Big] + \frac{1}{2} \frac{\partial^2}{\partial x^2} \Big[ p_t(x) \sigma^2_t(x) \Big] \end{equation}
This is a well celebrated result know as the “Fokker-Planck equation” that even predates the Langevin Equation. So, the solution of this ODE is exactly what we are seeing in the above figure (middle). One can easily verify the convergence of Eq.\eqref{eq:langevin_dyn} by first observing \(\mu_t(x) = \nabla_x \log q_{data}(x), \sigma_t(x) = \sqrt{2}\) and then using \(\frac{\partial}{\partial t} p_{\infty}(x) = \frac{\partial}{\partial t} q_{data}(x) = 0\).
\[\begin{eqnarray*} \frac{\partial}{\partial t}p_{\infty}(x) &=& -\frac{\partial}{\partial x} \Big[ p_{\infty}(x) \nabla_x \log q_{data}(x) \Big] + \frac{(\sqrt{2})^2}{2} \frac{\partial^2}{\partial x^2} \Big[ p_{\infty}(x) \Big] \\ \frac{\partial}{\partial t} q_{data}(x) &=& -\frac{\partial}{\partial x} \Big[ q_{data}(x) \nabla_x \log q_{data}(x) \Big] + \frac{(\sqrt{2})^2}{2} \frac{\partial^2}{\partial x^2} \Big[ q_{data}(x) \Big] \\ 0 \text{ (LHS)} &=& -\frac{\partial}{\partial x} \Big[ \nabla_x q_{data}(x) \Big] + \frac{\partial}{\partial x} \Big[ \nabla_x q_{data}(x) \Big] = 0\text{ (RHS)} \end{eqnarray*}\]The LHS holds due to the fact that after a long time (i.e. \(t = \infty\)) the distribution stabilizes
So, we’re all good. Eq.\eqref{eq:langevin_dyn} is a provable way of sampling given we have access to the true score. In fact, the very work
\begin{equation} x_{t+\delta} = x_t + \delta \cdot \nabla_x \log q_{data}(x) + \sqrt{2\delta} \cdot z \end{equation}
where \(\delta\) (a small constant) is used as a practical proxy for the theoretical \(dt\).
If you are already familiar with Diffusion Models, specifically their reverse process, you might be scratching your head. That is because, the generative process in Eq.\eqref{eq:langevin_dyn} isn’t quite same as what modern diffusion models do. We need to cross a few more hurdles before we get there.
More than just a proof, the Fokker-Planck ODE provides us with a key insight – i.e. gradually transforming one distribution into another is equivalent to traveling (over time) on a “path” in the space of probability distributions. Imagine a space of all possible probability distributions \(p\)
Speaking of ODEs, there is something we haven’t talked about yet – the initial distribution at \(t=0\), i.e. \(p_0\). In the simulation above, I quietly used a standard normal \(\mathcal{N}(0, I)\) as starting distribution
So theoretically, given the score function \(\nabla_x \log q_{data}(x)\) of a target distribution \(q_{data}(x)\), one can “travel to” it from any distribution. However, keeping in mind our need for sampling, it’s best to choose an initial distribution that is sampling-friendly. Strictly speaking, there are couple of reasonable choices, but the diffusion model community ended up with the Isotropic Gaussian (i.e. \(\mathcal{N}(0, I)\)). This is not only due to its goodwill across machine learning and statistics, but also the fact that in the context of SDEs with Brownian motions
So far what we’ve talked about, is just the generative process or as diffusion model literature calls it, the “reverse process”. But we haven’t really talked about the “forward process” yet, in case you are familiar with it. The forward process, in simple terms, is an ahead-of-time description of the “probability path” that reverse process intends to take. But the question is, why do we need to know the path ahead of time – the reverse process seems quite spontaneous
The problem lies in Eq.\eqref{eq:langevin_dyn} – let’s write it again with a little more verbosity
\begin{equation} dx_t = \nabla_x \left. \log q_{data}(x) \right|_{x = x_t}\ dt + \sqrt{2} dB_t \end{equation}
Even though we wished to estimate \(\nabla_x \log q_{data}(x)\vert_{x = x_t}\) with neural network \(s_{\theta}(x = x_t)\), this turned out to be extremely hard in practice
So, what some of the pioneering works did, is first fixing a path
Going the other way requires us to run a simulation to go from \(q_{data}(x)\) at \(t=0\) to \(t=\infty\), just the opposite of the animation above. Recall that we already saw how to do this. To go to any distribution at \(t=\infty\), all you need is its score and the langevin equation. So how about we start from \(q_0 = q_{data}(x)\) this time
It is interesting to note that due to the target distribution being known in its closed form, we do not see any awkward scores dangling around. The score of \(\mathcal{N}(0, I)\) is simply \(-x\)
.. may resemble DDPM’s
NOTE: A little subtlety here that we only fixed the end point of the forward process, but not the exact path. It seems that running the langevin equation in the forward direction chose one path on its own. Turns out that this is the “isotropic path” where all dimensions of the variable \(x\) evolves in time the exact same way. Some works
recently uncovered non-isotropic diffusion, where it is indeed possible to travel on other paths. But this is outside the scope of this article.
We can simulate the above equation just like we did in the reverse process, in order to get samples \(x_t \sim q_t\). Below we show simulation of the forward process
While it is true that the reverse process in inherently sequential due to the arbitrary nature of the score, the forward process (in Eq.\eqref{eq:forward_sde}) is entirely known and hence can be exploited for easing the sequentiality. We can see a way out if we try to simplify
The above simplification suggests that we can jump to any time \(t\), without going through the entire sequence, in order to sample \(x_t \sim q_t\). In fact, \(q_t(x_t\vert x_0)\) is gaussian ! This result opens up an interesting interpretation – generating \(x_0 \sim q(x_0 \vert x_t)\) can be interpreted as solving a “gaussian inverse problems”, which we explore in a later section.
All good for now, but there is one more thing we need to deal with.
What we discussed so far, i.e. the forward and reverse process, require infinite time to reach its end state. This is a direct consequence of using the langevin equation. That, of course, is unacceptable in practice. But it so happened that there exists quite an elegant fix, which is well known to mathematics – we simply re-define what time means. We may choose a re-parameterization of time as, for example, \(t' = \mathcal{T}(t) = 1 - e^{-t} \in [0, 1]\)
This suggests that in the world where time runs from \(t' = 0 \rightarrow 1\), we need to escalate the forward process by replacing \(dt\) with \(e^t dt'\). The quantity \(\mathcal{T}'(t)^{-1} dt' = e^t dt'\) is analogous to what diffusion models
Of course, our choice of the exact value of end time (i.e. \(t' = 1\)) and the re-parameterization \(\mathcal{T}\) are somewhat arbitrary. Different choices of \(\mathcal{T}\), and consequently \(\mathcal{T}'(t)^{-1} dt'\) lead to different schedules (e.g. linear, cosine etc.).
NOTE: Choosing a different schedule does not mean the process takes a different path on the probability space, it simply changes its speed of movement over time towards the end state.
To summarize, in this section, we started with the definition of ‘score’ and arrived at a stochastic process (thanks to an old result by Langevin) that, at infinite time, converges to the density associated with the score. We saw that this process is provably correct and can be interpreted as a “path” on the probability space. We argued that due to the difficulty of score estimation everywhere along the path, we need samples at the intermediate time \(t\) in order to specialize the score estimates. To do that, we had to travel backwards on the path, which can be done in closed form. We also saw how this process, even though theoretically takes infinite time, can be shrunk down to a finite interval, opening up a design choice known as “schedules”.
The last chapter, while explaining the “sampling” part of score-based diffusion models, assumed that we have access to the true score \(\nabla_x \log q_{data}(x)\) via some oracle. That is, of course, untrue in practice. In fact, accessing the true score for any arbitrary distribution is just not possible
If curious enough, one may question how realistic it is to estimate the score \(\nabla_x \log q_{data}(x)\), while we can NOT usually estimate the density \(q_{data}(x)\) itself ? After all, it is a quantity derived from the density ! The answer becomes clear once you make the normalization constant explicit
\[\begin{eqnarray*} \nabla_x \log q_{data}(x) &=& \nabla_x \log \frac{\tilde{q}_{data}(x)}{\int_{x} \tilde{q}_{data}(x) dx} \\ &=& \nabla_x \log \tilde{q}_{data}(x) - {\color{red}\nabla_x \log \int_{x} \tilde{q}_{data}(x) dx} \\ &=& \nabla_x \log \tilde{q}_{data}(x) \end{eqnarray*}\]The part in red is zero due to not having dependence on \(x\). So, the score, very cleverly sidesteps the normalization constant. This is the reason score estimation gained momentum in the research community.
The first notable attempt of this problem was by Aapo Hyvärinen
\begin{equation} J(\theta) = \frac{1}{2} \mathbb{E}_{x\sim q_{data}(x)}\Big[ \vert\vert s_{\theta}(x) - \nabla_x \log q_{data}(x) \vert\vert^2 \Big] \end{equation}
It is simply an \(L_2\) loss between a parametric model and the true score, weighted by the probability of individual states (hence the expectation). But of course, it is not computable in this form as it contains the true score. Hyvärinen’s contribution was to simply show that, theoretically, the minimization problem is equivalent when the loss function is
\begin{equation} \label{eq:impl_score_match} J_{\mathrm{I}}(\theta) = \mathbb{E}_{x\sim q_{data}(x)}\Big[ \mathrm{Tr}(\nabla_x s_{\theta}(x)) + \frac{1}{2} \vert\vert s_{\theta}(x) \vert\vert^2 \Big] \end{equation}
In the literature, this is known as the “Implicit Score Matching”. The derivation is relatively simple and only involves algebraic manipulations – please see Appendix A of
But the key challenge with Implicit Score Matching was the \(\mathrm{Tr}(\nabla_x s_{\theta}(x))\) term, i.e. the trace of the hessian of the neural score model, which is costly to compute. This prompted several follow-up works for the race towards scalable score matching, one of which (namely De-noising score matching) is used in Diffusion Models till this day.
For the sake of completeness, I would like to mention the work of Yang Song et al.
The most valuable contribution came from Vincent Pascal in 2011, when he showed
\begin{equation} \label{eq:deno_score_match} J_{\mathrm{D}}(\theta) = \mathbb{E}_{x\sim q_{data}(x), \epsilon\sim\mathcal{N}(0, I)}\left[ \frac{1}{2} \left|\left| s_{\theta}(\ \underbrace{x + \sigma\epsilon}_{\tilde{x}}\ ) - (- \frac{\epsilon}{\sigma}) \right|\right|^2 \right] \end{equation}
We deliberately wrote it in a way that exposes its widely accepted interpretation. Denoising score matching simply adds some known noise \(\sigma\epsilon\) to the datapoints \(x\) and learns (in mean squeared sense), from the “noisy” point \(\tilde{x}\), the direction of comeback, i.e. \((-\epsilon)\), scaled by \(\frac{1}{\sigma}\). In a way, it acts like a “de-noiser”, hence the name. It is theoretically guaranteed
A little algebraic manipulation of Eq.\eqref{eq:deno_score_match}, demonstrated by Ho et al.
We simply change the interpretation of what the network learns. In this form, the “noise estimator” network learns just the original pure gaussian noise vector \(\epsilon\) that was added while crafting the noisy sample. So, from a noisy sample, the network \(\epsilon_{\theta}\) learns roughly an unit variance direction that points towards the clean sample.
There is yet another re-interpretation of Eq.\eqref{eq:deno_score_match} that leads to a slightly different perspective
\[\begin{eqnarray} J_{\mathrm{D}}(\theta) &=& \mathbb{E}_{x\sim q_{data}(x), \epsilon\sim\mathcal{N}(0, I)}\left[ \frac{1}{2\sigma^4} \left|\left| {\color{blue}\tilde{x} + \sigma^2 s_{\theta}}(\tilde{x}) - (\underbrace{\tilde{x} - \sigma\epsilon}_{x}) \right|\right|^2 \right] \\ &=& \mathbb{E}_{x\sim q_{data}(x), \epsilon\sim\mathcal{N}(0, I)}\left[ \frac{1}{2\sigma^4} \left|\left| {\color{blue} x_{\theta}}(\tilde{x}) - x \right|\right|^2 \right]\label{eq:deno_endpoint_match} \end{eqnarray}\]Eq.\eqref{eq:deno_endpoint_match} shows, that instead of the noise direction towards clean sample, we can also have the clean sample directly as a learning target. This is like doing “denoising” in its true sense. We will get back to this in the next subsection.
If you are still puzzled about how Eq.\eqref{eq:deno_eps_match} is related to learning the score, there is a way to probe exactly what the network is learning at an arbitrary input point \(\tilde{x}\). We note that the clean sample \(x\) and the noisy sample \(\tilde{x}\) come from a joint distribution that factorizes
\[q(x, \tilde{x}) = q(\tilde{x} \vert x) q_{data}(x) = \mathcal{N}(\tilde{x}; x, \sigma I) q_{data}(x).\]We then factorize this joint in a slightly different way, i.e.
\[q(x, \tilde{x}) = q(x \vert \tilde{x}) q(\tilde{x})\]where \(q(x \vert \tilde{x})\) can be thought of as a distribution of all clean samples which could’ve led to the given \(\tilde{x}\). Eq.\eqref{eq:deno_eps_match} can therefore be written as
\[\begin{eqnarray*} J_{\mathrm{D}}(\theta) &=& \mathbb{E}_{(x, \tilde{x}) \sim q(x,\tilde{x})}\left[ \frac{1}{2\sigma^2} \left|\left| \epsilon_{\theta}(\tilde{x}) - \epsilon \right|\right|^2 \right] \\ &=& \mathbb{E}_{\tilde{x} \sim q(\tilde{x}), x \sim q(x\vert \tilde{x})}\left[ \frac{1}{2\sigma^2} \left|\left| \epsilon_{\theta}(\tilde{x}) - \frac{\tilde{x} - x}{\sigma} \right|\right|^2 \right] \\ &=& \mathbb{E}_{\tilde{x} \sim q(\tilde{x})}\left[ \frac{1}{2\sigma^2} \left|\left| \epsilon_{\theta}(\tilde{x}) - \frac{\tilde{x} - \mathbb{E}_{x \sim q(x\vert \tilde{x})}[x]}{\sigma} \right|\right|^2 \right] \\ \end{eqnarray*}\]In the last step, the expectation \(\mathbb{E}_{q(x\vert\tilde{x})}\left[ \cdot \right]\) was pushed inside, up until the only quantity that involves \(x\). Looking at it, you may realize that the network \(\epsilon_{\theta}\), given an input \(\tilde{x}\), learns the average noise direction that leads to the given input point \(\tilde{x}\). It also exposes the quantity \(\mathbb{E}_{x \sim q(x\vert \tilde{x})}[x]\), which is the average clean sample that led to the given \(\tilde{x}\).
Below we visualize this process with a toy example, followed by a short explanation.
Explanation: We have 10 data points \(x\sim q_{data}(x)\) in two clusters (big red dots) and we run the learning process by generating noisy samples \(\tilde{x}\sim q(\tilde{x})\) (small red dots). Instead of learning a neural mapping over the entire space, we learn a tabular map with only three chosen input points \(\tilde{x}_1, \tilde{x}_2, \tilde{x}_3\) (blue, magenta and green cross). Every time we sample one of those
A similar treatment, when applied on Eq.\eqref{eq:deno_endpoint_match}, yields the following
\[\begin{eqnarray*} J_{\mathrm{D}}(\theta) &=& \mathbb{E}_{(x, \tilde{x}) \sim q(x,\tilde{x})}\left[ \frac{1}{2\sigma^4} \left|\left| {\color{blue}x_{\theta}}(\tilde{x}) - x \right|\right|^2 \right] \\ &=& \mathbb{E}_{\tilde{x} \sim q(\tilde{x})}\left[ \frac{1}{2\sigma^4} \left|\left| {\color{blue}\tilde{x} + \sigma^2 s_{\theta}}(\tilde{x}) - \mathbb{E}_{x \sim q(x\vert \tilde{x})}[x] \right|\right|^2 \right] \\ \end{eqnarray*}\]Notice that I brought back the original form of \(x_{\theta}(\cdot)\) that involves the score. If we had the true score instead of an learned estimate, we would have
\[\mathbb{E}_{x \sim q(x\vert \tilde{x})}[x] = \tilde{x} + \sigma^2 \nabla_{\tilde{x}} \log p(\tilde{x})\]In “Inverse problem” and Bayesian literature, this is a very well celebrated result named “Tweedie’s Formula”, first published by Robbins
In this section, we explored the problem of scalable score matching. We looked at the notable attempts in the literature and learned that score can be estimated from samples only. We also looked at several interpretations of the learning objective and the connections they expose.
In the last section, we expressed and explained everything in terms of one known noise level \(\sigma\) and the noisy sample \(\tilde{x}\). We did so to avoid cluttering of multiple concepts that aren’t necessary to explain each other. In a previous section however, we learned that the score must be estimated along every timestep of the forward process. By simply augmenting Eq.\eqref{eq:deno_score_match} with an additional time variable \(t \in \mathcal{U}[0, 1]\) is sufficient to induce the time dependency in the score matching problem
\begin{equation} \label{eq:deno_score_match_with_time} J_{\mathrm{D}}(\theta) = \mathbb{E}_{x_0, \epsilon, t \sim \mathcal{U}[0, 1], x_t\sim q_t(x_t\vert x_0) }\left[ \frac{1}{2} \left|\left| s_{\theta}(x_t, t) - (- \frac{\epsilon}{\sigma_t}) \right|\right|^2 \right] \end{equation}
.. where \(q_t(x_t \vert x_0)\) is defined in a previous section and \(\sigma_t\) is the standard deviation of it.
We would like to highlight that, in this article, we first explored the reverse process and then showed why the forward process emerges out of necessity. Typical diffusion models papers start from a forward process specification of the form
\[dx_t = f(t)x_t dt + g(t) {dB}_t\].. and then use Anderson’s SDE reversal
We argue that our approach is more “organic” in the sense that it builds up the theory chronologically, exploring the exact path the community went through over time.
In this article, we dived deep into the theoretical fundamentals of Diffusion Models, which are often ignored by practitioners. We started from the ‘heart’ of diffusion models, i.e. scores, and built the concepts up almost chronologically. We hope this article will serve as a conceptual guide toward understanding diffusion models from the score SDE perspective. We intentionally avoid the ‘probabilistic markov model’ view of diffusion since more and more works have been seen to embrace the SDE formalism.
]]>This theme supports rendering beautiful math in inline and display modes using MathJax 3 engine. You just need to surround your math expression with $$
, like $$ E = mc^2 $$
. If you leave it inside a paragraph, it will produce an inline expression, just like \(E = mc^2\).
To use display mode, again surround your expression with $$
and place it as a separate paragraph. Here is an example:
Note that MathJax 3 is a major re-write of MathJax that brought a significant improvement to the loading and rendering speed, which is now on par with KaTeX.
Its generally a better idea to avoid linking to images hosted elsewhere - links can break and you might face losing important information in your blog post. To include images in your submission in this way, you must do something like the following:
{% include figure.html path="assets/img/2024-05-07-distill-example/iclr.png" class="img-fluid" %}
+
which results in the following image:
To ensure that there are no namespace conflicts, you must save your asset to your unique directory /assets/img/2024-05-07-[SUBMISSION NAME]
within your submission.
Please avoid using the direct markdown method of embedding images; they may not be properly resized. Some more complex ways to load images (note the different styles of the shapes/shadows):
Here’s how you could embed interactive figures that have been exported as HTML files. Note that we will be using plotly for this demo, but anything built off of HTML should work (no extra javascript is allowed!). All that’s required is for you to export your figure into HTML format, and make sure that the file exists in the assets/html/[SUBMISSION NAME]/
directory in this repository’s root directory. To embed it into any page, simply insert the following code anywhere into your page.
{% include [FIGURE_NAME].html %}
+
For example, the following code can be used to generate the figure underneath it.
import pandas as pd
+import plotly.express as px
+
+df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/earthquakes-23k.csv')
+
+fig = px.density_mapbox(
+ df, lat='Latitude', lon='Longitude', z='Magnitude', radius=10,
+ center=dict(lat=0, lon=180), zoom=0, mapbox_style="stamen-terrain")
+fig.show()
+
+fig.write_html('./assets/html/2024-05-07-distill-example/plotly_demo_1.html')
+
And then include it with the following:
<div class="l-page">
+ <iframe src="{{ 'assets/html/2024-05-07-distill-example/plotly_demo_1.html' | relative_url }}" frameborder='0' scrolling='no' height="600px" width="100%"></iframe>
+</div>
+
Voila!
Citations are then used in the article body with the <d-cite>
tag. The key attribute is a reference to the id provided in the bibliography. The key attribute can take multiple ids, separated by commas.
The citation is presented inline like this:
Distill chose a numerical inline citation style to improve readability of citation dense articles and because many of the benefits of longer citations are obviated by displaying more information on hover. However, we consider it good style to mention author last names if you discuss something at length and it fits into the flow well — the authors are human and it’s nice for them to have the community associate them with their work.
Just wrap the text you would like to show up in a footnote in a <d-footnote>
tag. The number of the footnote will be automatically generated.
This theme implements a built-in Jekyll feature, the use of Rouge, for syntax highlighting. It supports more than 100 languages. This example is in C++. All you have to do is wrap your code in a liquid tag:
{% highlight c++ linenos %}
code code code
{% endhighlight %}
The keyword linenos
triggers display of line numbers. You can try toggling it on or off yourself below:
This theme supports generating various diagrams from a text description using jekyll-diagrams plugin. Below, we generate a few examples of such diagrams using languages such as mermaid, plantuml, vega-lite, etc.
Note: different diagram-generation packages require external dependencies to be installed on your machine. Also, be mindful of that because of diagram generation the first time you build your Jekyll website after adding new diagrams will be SLOW. For any other details, please refer to jekyll-diagrams README.
Note: This is not supported for local rendering!
The diagram below was generated by the following code:
{% mermaid %}
+sequenceDiagram
+ participant John
+ participant Alice
+ Alice->>John: Hello John, how are you?
+ John-->>Alice: Great!
+{% endmermaid %}
+
An example of displaying a tweet:
jekyll-twitter-plugin (1.0.0): A Liquid tag plugin for Jekyll that renders Tweets from Twitter API http://t.co/m4EIQPM9h4
— RubyGems (@rubygems) October 5, 2014
An example of pulling from a timeline:
For more details on using the plugin visit: jekyll-twitter-plugin
We do not grow absolutely, chronologically. We grow sometimes in one dimension, and not in another, unevenly. We grow partially. We are relative. We are mature in one realm, childish in another. —Anais Nin
The main text column is referred to as the body. It is the assumed layout of any direct descendants of the d-article
element.
.l-body
For images you want to display a little larger, try .l-page
:
.l-page
All of these have an outset variant if you want to poke out from the body text a little bit. For instance:
.l-body-outset
.l-page-outset
Occasionally you’ll want to use the full browser width. For this, use .l-screen
. You can also inset the element a little from the edge of the browser by using the inset variant.
.l-screen
.l-screen-inset
The final layout is for marginalia, asides, and footnotes. It does not interrupt the normal flow of .l-body
-sized text except on mobile screen sizes.
.l-gutter
Emphasis, aka italics, with asterisks (*asterisks*
) or underscores (_underscores_
).
Strong emphasis, aka bold, with asterisks or underscores.
Combined emphasis with asterisks and underscores.
Strikethrough uses two tildes. Scratch this.
⋅⋅⋅You can have properly indented paragraphs within list items. Notice the blank line above, and the leading spaces (at least one, but we’ll use three here to also align the raw Markdown).
⋅⋅⋅To have a line break without a paragraph, you will need to use two trailing spaces.⋅⋅ ⋅⋅⋅Note that this line is separate, but within the same paragraph.⋅⋅ ⋅⋅⋅(This is contrary to the typical GFM line break behavior, where trailing spaces are not required.)
I’m an inline-style link with title
I’m a relative reference to a repository file
You can use numbers for reference-style link definitions
Or leave it empty and use the link text itself.
URLs and URLs in angle brackets will automatically get turned into links. http://www.example.com or http://www.example.com and sometimes example.com (but not on Github, for example).
Some text to show that the reference links can follow later.
Here’s our logo (hover to see the title text):
Inline-style:
Reference-style:
Inline code
has back-ticks around
it.
var s = "JavaScript syntax highlighting";
+alert(s);
+
s = "Python syntax highlighting"
+print(s)
+
No language indicated, so no syntax highlighting.
+But let's throw in a <b>tag</b>.
+
Colons can be used to align columns.
Tables | Are | Cool |
---|---|---|
col 3 is | right-aligned | $1600 |
col 2 is | centered | $12 |
zebra stripes | are neat | $1 |
There must be at least 3 dashes separating each header cell. The outer pipes (|) are optional, and you don’t need to make the raw Markdown line up prettily. You can also use inline Markdown.
Markdown | Less | Pretty |
---|---|---|
Still | renders | nicely |
1 | 2 | 3 |
Blockquotes are very handy in email to emulate reply text. This line is part of the same quote.
Quote break.
This is a very long line that will still be quoted properly when it wraps. Oh boy let’s keep writing to make sure this is long enough to actually wrap for everyone. Oh, you can put Markdown into a blockquote.
Here’s a line for us to start with.
This line is separated from the one above by two newlines, so it will be a separate paragraph.
This line is also a separate paragraph, but… This line is only separated by a single newline, so it’s a separate line in the same paragraph.
]]>Note: please use the table of contents as defined in the front matter rather than the traditional markdown styling.
This theme supports rendering beautiful math in inline and display modes using MathJax 3 engine. You just need to surround your math expression with $$
, like $$ E = mc^2 $$
. If you leave it inside a paragraph, it will produce an inline expression, just like \(E = mc^2\).
To use display mode, again surround your expression with $$
and place it as a separate paragraph. Here is an example: $$ \left( \sum_{k=1}^n a_k b_k \right)^2 \leq \left( \sum_{k=1}^n a_k^2 \right) \left( \sum_{k=1}^n b_k^2 \right) $$
Note that MathJax 3 is a major re-write of MathJax that brought a significant improvement to the loading and rendering speed, which is now on par with KaTeX.
Its generally a better idea to avoid linking to images hosted elsewhere - links can break and you might face losing important information in your blog post. You can display images from this repository using the following code:
{% include figure.html path="assets/img/2024-05-07-distill-example/iclr.png" class="img-fluid" %}
which results in the following image:
To ensure that there are no namespace conflicts, you must save your asset to your unique directory `/assets/img/2024-05-07-[SUBMISSION NAME]` within your submission.
Please avoid using the direct HTML method of embedding images; they may not be properly resized. Some below complex ways to load images (note the different styles of the shapes/shadows):
Here's how you could embed interactive figures that have been exported as HTML files. Note that we will be using plotly for this demo, but anything built off of HTML should work. All that's required is for you to export your figure into HTML format, and make sure that the file exists in the `assets/html/[SUBMISSION NAME]/` directory in this repository's root directory. To embed it into any page, simply insert the following code anywhere into your page.
{% include [FIGURE_NAME].html %}
For example, the following code can be used to generate the figure underneath it.
import pandas as pd
+import plotly.express as px
+
+df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/earthquakes-23k.csv')
+
+fig = px.density_mapbox(
+ df, lat='Latitude', lon='Longitude', z='Magnitude', radius=10,
+ center=dict(lat=0, lon=180), zoom=0, mapbox_style="stamen-terrain")
+fig.show()
+
+fig.write_html('./assets/html/2024-05-07-distill-example/plotly_demo_1.html')
+
And then include it with the following: <div class="l-page">
+ <iframe src="{{ 'assets/html/2024-05-07-distill-example/plotly_demo_1.html' | relative_url }}" frameborder='0' scrolling='no' height="600px" width="100%"></iframe>
+</div>
+
Voila! Citations are then used in the article body with the <d-cite>
tag. The key attribute is a reference to the id provided in the bibliography. The key attribute can take multiple ids, separated by commas.
The citation is presented inline like this:
Distill chose a numerical inline citation style to improve readability of citation dense articles and because many of the benefits of longer citations are obviated by displaying more information on hover. However, we consider it good style to mention author last names if you discuss something at length and it fits into the flow well - the authors are human and it's nice for them to have the community associate them with their work.
Just wrap the text you would like to show up in a footnote in a <d-footnote>
tag. The number of the footnote will be automatically generated.
This theme implements a built-in Jekyll feature, the use of Rouge, for syntax highlighting. It supports more than 100 languages. This example is in C++. All you have to do is wrap your code in a liquid tag as follows:
+{% highlight c++ linenos %}
code code code
{% endhighlight %}
+
+
The keyword `linenos` triggers display of line numbers. You can try toggling it on or off yourself below: This theme supports generating various diagrams from a text description using jekyll-diagrams plugin. Below, we generate a few examples of such diagrams using languages such as mermaid, plantuml, vega-lite, etc.
Notedifferent diagram-generation packages require external dependencies to be installed on your machine. Also, be mindful of that because of diagram generation the first time you build your Jekyll website after adding new diagrams will be SLOW. For any other details, please refer to the jekyll-diagrams README.
Note: This is not supported for local rendering!
The diagram below was generated by the following code:
{% mermaid %}
+sequenceDiagram
+ participant John
+ participant Alice
+ Alice->>John: Hello John, how are you?
+ John-->>Alice: Great!
+{% endmermaid %}
+
+
An example of displaying a tweet:
jekyll-twitter-plugin (1.0.0): A Liquid tag plugin for Jekyll that renders Tweets from Twitter API http://t.co/m4EIQPM9h4
— RubyGems (@rubygems) October 5, 2014
An example of pulling from a timeline:
For more details on using the plugin visit: jekyll-twitter-plugin
We do not grow absolutely, chronologically. We grow sometimes in one dimension, and not in another, unevenly. We grow partially. We are relative. We are mature in one realm, childish in another. —Anais Nin
.l-body
.l-page
.l-body-outset
.l-page-outset
.l-screen
.l-screen-inset
.l-gutter
Emphasis, aka italics, with the <i></i>
tag emphasis.
Strong emphasis, aka bold, with <b></b>
tag bold.
Strikethrough ca be accomplished with the <s></s>
tag. Scratch this.
For code, the language can be specified in the class. For example, use language-javascript
for Javascript and language-python
for Python code.
var s = "JavaScript syntax highlighting";
+ alert(s);
s = "Python syntax highlighting"
+ print(s)
No language indicated, so no syntax highlighting.
A table can be created with the <table>
element. Below is an example
Tables | Are | Cool |
---|---|---|
col 3 is | right-aligned | $1600 |
col 2 is | centered | $12 |
zebra stripes | are neat | $1 |
Blockquotes can be defined with the >blockquote< tag.]]>
Machine learning models, while incredibly powerful, can sometimes act unpredictably. One of the most intriguing behaviors is when the test loss suddenly diverges at the interpolation threshold, a phenomenon distinctly observed in double descent
While significant theoretical work has been done to comprehend why double descent occurs, it can be difficult for a newcomer to gain a general understanding of why the test loss behaves in this manner, and under what conditions one should expect similar misbehavior. In this blog post, when we say double descent, we mean the divergence at the interpolation threshold, and not whether overparameterized models generalize (or fail to generalize).
In this work, we intuitively and quantitatively explain why the test loss diverges at the interpolation threshold, with as much generality as possible and with as simple of mathematical machinery as possible, but also without sacrificing rigor. To accomplish this, we focus on the simplest supervised model - ordinary linear regression - using the most basic linear algebra primitive: the singular value decomposition. We identify three distinct interpretable factors which, when collectively present, trigger the divergence. Through practical experiments on real data sets, we confirm that both model’s test losses diverge at the interpolation threshold, and this divergence vanishes when even one of the three factors is removed. We complement our understanding by offering a geometric picture that reveals linear models perform representation learning when overparameterized, and conclude by shedding light on recent results in nonlinear models concerning superposition.
Before studying ordinary linear regression mathematically, does our claim that it exhibits double descent hold empirically? We show that it indeed does, using one synthetic and three real datasets: World Health Organization Life Expectancy
Consider a regression dataset of $N$ training data with features $\vec{x}_n \in \mathbb{R}^D$ and targets $y_n \in \mathbb{R}$. We sometimes use matrix-vector notation to refer to the training data:
\[X \in \mathbb{R}^{N \times D} \quad , \quad Y \in \mathbb{R}^{N \times 1}.\]In ordinary linear regression, we want to learn parameters $\hat{\vec{\beta}} \in \mathbb{R}^{D}$ such that:
\[\vec{x}_n \cdot \hat{\vec{\beta}} \approx y_n.\]We will study three key parameters:
We say that a model is overparameterized if $N < P$ and underparameterized if $N > P$. The interpolation threshold refers to $N=P$, because when $N\leq P$, the model can perfectly interpolate the training points. Recall that in ordinary linear regression, the number of parameters $P$ equals the dimension $D$ of the covariates. Consequently, rather than thinking about changing the number of parameters $P$, we’ll instead think about changing the number of data points $N$.
To understand under what conditions and why double descent occurs at the interpolation threshold in linear regression, we’ll study the two parameterization regimes. If the regression is underparameterized, we estimate the linear relationship between covariates $\vec{x}_n$ and target $y_n$ by solving the least-squares minimization problem:
\[\begin{align*} \hat{\vec{\beta}}_{under} \, &:= \, \arg \min_{\vec{\beta}} \frac{1}{N} \sum_n ||\vec{x}_n \cdot \vec{\beta} - y_n||_2^2\\ \, &:= \, \arg \min_{\vec{\beta}} ||X \vec{\beta} - Y ||_2^2. \end{align*}\]The solution is the ordinary least squares estimator based on the second moment matrix $X^T X$:
\[\hat{\vec{\beta}}_{under} = (X^T X)^{-1} X^T Y.\]If the model is overparameterized, the optimization problem is ill-posed since we have fewer constraints than parameters. Consequently, we choose a different (constrained) optimization problem that asks for the minimum norm parameters that still perfectly interpolate the training data:
\[\begin{align*} \hat{\vec{\beta}}_{over} \, &:= \, \arg \min_{\vec{\beta}} ||\vec{\beta}||_2^2\\ \text{s.t.} \quad \quad \forall \, n \in &\{1, ..., N\}, \quad \vec{x}_n \cdot \vec{\beta} = y_n. \end{align*}\]We choose this optimization problem because it is the one gradient descent implicitly minimizes. The solution to this optimization problem uses the Gram matrix $X X^T \in \mathbb{R}^{N \times N}$:
\[\hat{\vec{\beta}}_{over} = X^T (X X^T)^{-1} Y.\]One way to see why the Gram matrix appears is via constrained optimization: define the Lagrangian $\mathcal{L}(\vec{\beta}, \vec{\lambda}) \, := \, \frac{1}{2}||\vec{\beta}||_2^2 + \vec{\lambda}^T (Y - X \vec{\beta})$ with Lagrange multipliers $\vec{\lambda} \in \mathbb{R}^N$, then differentiate with respect to the parameters and Lagrange multipliers to obtain the overparameterized solution.
After being fit, for test point $\vec{x}_{test}$, the model will make the following predictions:
\[\hat{y}_{test, under} = \vec{x}_{test} \cdot \hat{\vec{\beta}}_{under} = \vec{x}_{test} \cdot (X^T X)^{-1} X^T Y\] \[\hat{y}_{test, over} = \vec{x}_{test} \cdot \hat{\vec{\beta}}_{over} = \vec{x}_{test} \cdot X^T (X X^T)^{-1} Y.\]Hidden in the above equations is an interaction between three quantities that can, when all grow extreme, create a divergence in the test loss!
To reveal the three quantities, we’ll rewrite the regression targets by introducing a slightly more detailed notation. Unknown to us, there are some ideal linear parameters $\vec{\beta}^* \in \mathbb{R}^P = \mathbb{R}^D$ that truly minimize the test mean squared error. We can write any regression target as the inner product of the data $\vec{x}_n$ and the ideal parameters $\vec{\beta}^*$, plus an additional error term $e_n$ that is an “uncapturable” residual from the “viewpoint” of the model class
\[y_n = \vec{x}_n \cdot \vec{\beta}^* + e_n.\]In matrix-vector form, we will equivalently write:
\[Y = X \vec{\beta}^* + E,\]with $E \in \mathbb{R}^{N \times 1}$. To be clear, we are not imposing assumptions. Rather, we are introducing notation to express that there are (unknown) ideal linear parameters, and possibly non-zero errors $E$ that even the ideal model might be unable to capture; these errors $E$ could be random noise or could be fully deterministic patterns that this particular model class cannot capture. Using this new notation, we rewrite the model’s predictions to show how the test datum’s features $\vec{x}_{test}$, training data’s features $X$ and training data’s regression targets $Y$ interact.
Let $y_{test}^* := \vec{x}_{test} \cdot \vec{\beta}^*$. In the underparameterized regime:
\[\begin{align*} \hat{y}_{test,under} &= \vec{x}_{test} \cdot \hat{\vec{\beta}}_{under}\\ &=\vec{x}_{test} \cdot (X^T X)^{-1} X^T Y\\ &=\vec{x}_{test} \cdot (X^T X)^{-1} X^T (X \vec{\beta}^* + E)\\ &=\vec{x}_{test} \cdot \vec{\beta}^* + \, \vec{x}_{test} \cdot (X^T X)^{-1} X^T E\\ \hat{y}_{test,under} - y_{test}^* &= \vec{x}_{test} \cdot (X^T X)^{-1} X^T E. \end{align*}\]This equation is important, but opaque. To extract the intuition, replace $X$ with its singular value decomposition $X = U S V^T$. Let $R \, := \, \text{rank}(X)$ and let $\sigma_1 > \sigma_2 > … > \sigma_R > 0$ be $X$’s (non-zero) singular values. Let $S^+$ denote the Moore-Penrose inverse; in this context, this means that if a singular value $\sigma_r$ is non-zero, then in $S^+$, it becomes its reciprocal $1/\sigma_r$, but if the singular value is zero, then in $S^+$, it remains $0$. We can decompose the underparameterized prediction error along the orthogonal singular modes:
\[\begin{align*} \hat{y}_{test, under} - y_{test}^* &= \vec{x}_{test} \cdot V S^{+} U^T E\\ &= \sum_{r=1}^R \frac{1}{\sigma_r} (\vec{x}_{test} \cdot \vec{v}_r) (\vec{u}_r \cdot E). \end{align*}\]This equation will be critical! The same term will appear in the overparameterized regime (plus one additional term):
\[\begin{align*} \hat{y}_{test,over} &= \vec{x}_{test} \cdot \hat{\vec{\beta}}_{over}\\ &= \vec{x}_{test} \cdot X^T (X X^T)^{-1} Y\\ &= \vec{x}_{test} \cdot X^T (X X^T)^{-1} (X \beta^* + E)\\ \hat{y}_{test,over} - y_{test}^* &= \vec{x}_{test} \cdot (X^T (X X^T)^{-1} X - I_D) \beta^* \\ &\quad\quad + \quad \vec{x}_{test} \cdot X^T (X X^T)^{-1} E\\ &= \vec{x}_{test} \cdot (X^T (X X^T)^{-1} X - I_D) \beta^* \\ &\quad\quad + \quad \sum_{r=1}^R \frac{1}{\sigma_r} (\vec{x}_{test} \cdot \vec{v}_r) (\vec{u}_r \cdot E), \end{align*}\]where the last step again replaced $X$ with its SVD $X = U S V^T$. Thus, the prediction errors in the overparameterized and underparameterized regimes will be:
\[\begin{align*} \hat{y}_{test,over} - y_{test}^* &= \sum_{r=1}^R \frac{1}{\sigma_r} (\vec{x}_{test} \cdot \vec{v}_r) (\vec{u}_r \cdot E)\\ &\quad \quad + \quad \vec{x}_{test} \cdot (X^T (X X^T)^{-1} X - I_D) \beta^*\\ \hat{y}_{test,under} - y_{test}^* &= \sum_{r=1}^R \frac{1}{\sigma_r} (\vec{x}_{test} \cdot \vec{v}_r) (\vec{u}_r \cdot E). \end{align*}\]The shared term in the two prediction errors causes the divergence:
\[\begin{equation} \sum_{r=1}^R \frac{1}{\sigma_r} (\vec{x}_{test} \cdot \vec{v}_r) (\vec{u}_r \cdot E). \label{eq:variance} \end{equation}\]Eqn. \ref{eq:variance} is critical. It reveals that our test prediction error (and thus, our test squared error!) will depend on an interaction between 3 quantities:
How much the training features vary in each direction. More formally, the inverse (non-zero) singular values of the training features $X$:
\[\frac{1}{\sigma_r}\]How much, and in which directions, the test features vary relative to the training features. More formally: how $\vec{x}_{test}$ projects onto $X$’s right singular vectors $V$:
\[\vec{x}_{test} \cdot \vec{v}_r\]How well the best possible model in the model class can correlate the variance in the training features with the training regression targets. More formally: how the residuals $E$ of the best possible model in the model class (i.e. insurmountable “errors” from the “perspective” of the model class) project onto $X$’s left singular vectors $U$:
\[\vec{u}_r \cdot E\]We use the term “vary” when discussing $\vec{v}_r$ because $V$ can be related to the empirical (or sample) covariance matrix oftentimes studied in Principal Component Analysis. That is, if the SVD of $X$ is $U S V^T$, then $\frac{1}{N} X^T X = \frac{1}{N} V S^2 V^T$. If the training data are centered (a common preprocessing step), then this is the empirical covariance matrix and its eigenvectors $\vec{v}_1, …, \vec{v}_R$ identify the orthogonal directions of variance. We’ll return to this in Fig 6.
Why does the test error diverge? When (1) and (3) are both present in the learning problem, the model’s parameters along this singular mode are likely incorrect. When (2) is added to the mix by a test datum $\vec{x}_{test}$ with a large projection along this mode, the model is forced to extrapolate significantly beyond what it saw in the training data, in a direction where the training data had an error-prone relationship between its predictions and the training targets, using parameters that are likely wrong. As a consequence, the test squared error explodes!
The test loss will not diverge if any of the three required factors are absent. What could cause that? One way is if small-but-nonzero singular values do not appear in the training data features. One way to accomplish this is by setting all singular values below a selected threshold to exactly 0. To test our understanding, we independently ablate all small singular values in the training features. Specifically, as we run the ordinary linear regression fitting process, and as we sweep the number of training data, we also sweep different singular value cutoffs and remove all singular values of the training features $X$ below the cutoff (Fig 2).
Double descent should not occur if the test datum does not vary in different directions than the training features. Specifically, if the test datum lies entirely in the subspace of just a few of the leading singular directions, then the divergence is unlikely to occur. To test our understanding, we force the test data features to lie in the training features subspace: as we run the ordinary linear regression fitting process, and as we sweep the number of training data, we project the test features $\vec{x}_{test}$ onto the subspace spanned by the training features $X$ singular modes (Fig 3).
Double descent should not occur if the best possible model in the model class makes no errors on the training data. For example, if we use a linear model class on data where the true relationship is a noiseless linear relationship, then at the interpolation threshold, we will have $D=P$ data, $P=D$ parameters, our line of best fit will exactly match the true relationship, and no divergence will occur. To test our understanding, we ensure no residual errors exist in the best possible model: we first use the entire dataset to fit a linear model, then replace all target values with the predictions made by the ideal linear model. We then rerun our typical fitting process using these new labels, sweeping the number of training data (Fig 4).
As a short aside, what could cause residual errors in the best possible model in the model class?
Why does this divergence happen near the interpolation threshold? The answer is that the first factor (small non-zero singular values in the training features $X$) is likely to occur at the interpolation threshold (Fig 5), but why?
Suppose we’re given a single training datum \(\vec{x}_1\). So long as this datum isn’t exactly zero, that datum varies in a single direction, meaning we gain information about the variance in that direction, but the variance in all orthogonal directions is exactly 0. With the second training datum \(\vec{x}_2\), so long as this datum isn’t exactly zero, that datum varies, but now, some fraction of \(\vec{x}_2\) might have a positive projection along \(\vec{x}_1\); if this happens (and it likely will, since the two vectors are unlikely to be exactly orthogonal), the shared direction gives us more information about the variance in this shared direction, but less information about the second orthogonal direction of variation. Ergo, the training data’s smallest non-zero singular value after 2 samples is probabilistically smaller than after 1 sample. As we approach the interpolation threshold, the probability that each additional datum has large variance in a new direction orthogonal to all previous directions grows unlikely (Fig 5), but as we move beyond the interpolation threshold, the variance in each covariate dimension becomes increasingly clear.
You might be wondering why three of the datasets have low test squared error in the overparameterized regime (California Housing, Diabetes, Student-Teacher) but one (WHO Life Expectancy) does not. Recall that the overparameterized regime’s prediction error has another term \(\hat{y}_{test,over} - y_{test}^*\) not present in the underparameterized regime:
\[\begin{equation} \vec{x}_{test} \cdot (X^T (X X^T)^{-1} X - I_D) \beta^*. \label{eq:bias} \end{equation}\]To understand why this bias exists, recall that our goal is to correlate fluctuations in the covariates $\vec{x}$ with fluctuations in the targets $y$. In the overparameterized regime, there are more parameters than data; consequently, for $N$ data points in $D=P$ dimensions, the model can “see” fluctuations in at most $N$ dimensions, but has no ``visibility” into the remaining $P-N$ dimensions. This causes information about the optimal linear relationship $\vec{\beta}^*$ to be lost, thereby increasing the overparameterized prediction error.
We previously saw that away from the interpolation threshold, the variance is unlikely to affect the discrepancy between the overparameterized model’s predictions and the ideal model’s predictions, meaning most of the discrepancy must therefore emerge from the bias (Eqn. \ref{eq:bias}). This bias term yields an intuitive geometric picture (Fig 7) that also reveals a surprising fact: overparameterized linear regression does representation learning! Specifically, for test datum \(\vec{x}_{test}\), a linear model creates a representation of the test datum \(\hat{\vec{x}}_{test}\) by orthogonally projecting the test datum onto the row space of the training covariates \(X\) via the projection matrix \(X^T (X X^T)^{-1} X\):
\[\begin{equation*} \hat{\vec{x}}_{test} := X^T (X X^T)^{-1} X \; \vec{x}_{test}. \end{equation*}\]Seen this way, the bias can be rewritten as the inner product between (1) the difference between its representation of the test datum and the test datum and (2) the ideal linear model’s fit parameters:
\[\begin{equation}\label{eq:overparam_gen_bias} (\hat{\vec{x}}_{test} - \vec{x}_{test}) \cdot \vec{\beta}^*. \end{equation}\]Intuitively, an overparameterized model will generalize well if the model’s representations capture the essential information necessary for the best model in the model class to perform well (Fig. 8).
Our key equation (Eqn. \ref{eq:variance}) also reveals why adversarial test data and adversarial training data exist (at least in linear regression) and how mechanistically they function. For convenience, we repeat the equation:
\[\begin{equation*} \sum_{r=1}^R \frac{1}{\sigma_r} (\vec{x}_{test} \cdot \vec{v}_r) (\vec{u}_r \cdot E). \end{equation*}\]Adversarial test examples are a well-known phenomenon in machine learning
Less well-known are adversarial training data, akin to dataset poisoning
Although we mathematically studied ordinary linear regression, the intuition for why the test loss diverges extends to nonlinear models, such as polynomial regression and including certain classes of deep neural networks
Our work sheds light on the results in two ways:
Henighan et al. 2023 write, “It’s interesting to note that we’re observing double descent in the absence of label noise.” Our work clarifies that noise, in the sense of a random quantity, is not necessary to produce double descent. Rather, what is necessary is residual errors from the perspective of the model class ($E$, in our notation). Those errors could be entirely deterministic, such as a nonlinear model attempting to fit a noiseless linear relationship, or other model misspecifications.
Henighan et al. 2023 write, “[Our work] suggests a naive mechanistic theory of overfitting and memorization: memorization and overfitting occur when models operate on ‘data point features’ instead of ‘generalizing features’.” Our work hopefully clarifies that this dichotomy is incorrect: when overparameterized, data point features are akin to the Gram matrix $X X^T$ and when underparameterized, generalizing features are akin to the second moment matrix $X^T X$. Our work hopefully clarifies that data point features can and very often do generalize, and that there is a deep connection between the two, i.e., their shared spectra.
In this work, we intuitively and quantitatively explained why the test loss misbehaves based on three interpretable factors, tested our understanding via ablations, connected our understanding to adversarial test examples and adversarial training datasets, and added conceptual clarity of recent discoveries in nonlinear models.
]]>In information theory, the data processing inequality (DPI) expresses a fundamental idea: processing data (stochastically) cannot increase information. The DPI provides us with a powerful intuition about what information processing systems can do and what the limitations of data processing are.
In this blog post, we first study the DPI, developing intuition through vivid examples and detailed proofs—especially the equality case, which is arguably the best way to understand inequalities. We will consider classic forms of the DPI as well as DPIs relating probability distributions more broadly. Then, we explore the intriguing connection between DPI and function-space variational inference (FSVI), a modern Bayesian deep learning technique that focuses on the Bayesian predictive posterior rather than the parameter space. Exploring this connection is important because it can provide new insights into FSVI on a fundamental level. We apply the DPI to recover several interesting results from the literature in a simple form and build intuitions for the relationship between parameter and functional priors.
Most importantly, we consider how FSVI can measure a predictive divergence between the approximate and true posterior which is independent of parameter symmetries. (With parameter symmetries, I refer to different parameters that yield the same predictions, which is very common in over-parameterized neural networks: think of parameter symmetries like different paths leading to the same destination; they might look different but end up at the same predictions
The following sections summarize the key takeaways of this blog post. If they don’t make sense, don’t worry: they will after reading this post.
The data processing inequality examines how information cannot increase due to processing. In information theory, it is usually stated based on a Markov chain of random variables \(X \rightarrow Y \rightarrow Z\) and their mutual information. We will look at different data processing inequalities that relate different distributions instead of different random variables. However, the blog posts in particular looks at the DPI when formulated using Kullback-Leibler (KL) divergences between distributions. I will use “🥬 divergence” in headings to add a bit of color. 😊
Concretely, this KL DPI states that processing data stochastically can only reduce information. More formally:
That is, the KL divergence between \(\qof{Y}\) and \(\pof{Y}\) cannot be larger than the one between the original \(\qof{\W}\) and \(\pof{\W}\). Intuitively, the stochastic mapping \(\opf\) induces a bottleneck that reduces how well we can distinguish between \(\opp\) and \(\opq\). Finally we have equality when \(\Kale{\qof{\W \given Y}}{\pof{\W \given Y}} = 0\).
The paper “Understanding Variational Inference in Function-Space” by Burt et al. (2021)
The data processing inequality states that if two random variables are transformed in this way, they cannot become easier to tell apart.
Generally, variational inference is a powerful technique for approximating complex Bayesian posteriors with simpler distributions. In its usual form, it optimizes an approximate, variational distribution to match the Bayesian parameter posterior as closely as possible. This way, it transforms the problem of Bayesian inference into an optimization problem.
However, especially for deep neural networks, obtaining a good approximation of the parameter space can be difficult. One reason is the sheer size of the parameter space. Additionally, the parameterization of a neural network often contains many symmetries—different parameter configurations can lead to the same predictions of the model—that are not taken into account either.
Here, Function-space variational inference (FSVI) side-steps some of these restrictions by only requiring that the variational distribution matches the Bayesian predictive posterior: Whereas regular variational inference regularizes towards a parameter prior, FSVI regularizes towards a data prior. This is especially useful when the parameter prior is not very meaningful, e.g. an isotropic Gaussian prior, which is often used in Bayesian neural networks.
Information theory deals with the communication of information
The information content of an event \(x\) is denoted as \(\Hof{x}\) and is defined as \(-\log \pof{x}\). It represents the minimum amount of information needed to describe the occurrence of \(x\) given an underlying probability distribution. In machine learning, this information content is often used as a minimization objective, represented as the negative log-likelihood or cross-entropy when averaged over a dataset.
The entropy \(\Hof{X}\) of a random variable \(X\) is the expectation of its information content:
\[\Hof{X} \triangleq \E{\pof{x}}{\Hof{x}} = \E{\pof{x}}{-\log \pof{x}}.\]The entropy measures the average amount of information needed to describe the random variable \(X\). It provides a measure of uncertainty or randomness associated with \(X\). We can similarly define the entropy of a conditional distribution \(\Hof{X \given Y}\) and the joint entropy \(\Hof{X, Y}\).
The mutual information \(\MIof{X;Y}\) between two random variables \(X\) and \(Y\) is a measure of the amount of information that one random variable contains about the other. It is defined as:
\[\begin{aligned} \MIof{X;Y} & \triangleq \Hof{X} - \Hof{X \given Y} \\ &= \Hof{Y} - \Hof{Y \given X} \\ &= \Hof{X} + \Hof{Y} - \Hof{X, Y}. \end{aligned}\]We will also use the Kullback-Leibler divergence \(\Kale{\pof{X}}{\qof{X}}\) and the cross-entropy \(\CrossEntropy{\pof{X}}{\qof{X}}\):
\[\begin{aligned} \CrossEntropy{\pof{X}}{\qof{X}} & = \E{\pof{x}}{-\log \qof{x}}\\ \Kale{\pof{X}}{\qof{X}} & = \CrossEntropy{\pof{X}}{\qof{X}} - \Hof{X} \end{aligned}\]The cross-entropy quantifies the average number of bits needed to encode samples drawn from the true distribution \(\pof{X}\) using a different distribution \(\qof{X}\). The Kullback-Leibler divergence is a measure of the difference between two probability distributions and captures the additional bits needed to encode samples from \(\pof{X}\) compared to encoding them using the true distribution \(\qof{X}\).
Now that we have covered the notation, let’s delve into the data processing inequality.
The data processing inequality (DPI) is a fundamental inequality in information theory that states the mutual information between two random variables cannot increase through processing. The original DPI is typically stated for a Markov chain of random variables \(X \rightarrow Y \rightarrow Z\) and relates the mutual information terms as follows:
\[\MIof{X;Y} \ge \MIof{X;Z}.\]We can view \(\rightarrow\) as a processing or transition step that maps \(X\) to \(Y\) and \(Y\) to \(Z\), whereas the mapping can be deterministic or stochastic. The inequality tells us that processing the random variable \(X\) to obtain \(Y\) and further processing \(Y\) to obtain \(Z\) cannot increase the mutual information between \(X\) and \(Z\) compared to the mutual information between \(X\) and \(Y\).
The following three scenarios illustrate the data processing inequality using different mappings:
Consider an image processing pipeline with the following steps. Let:
In this case, \(X\) has more mutual information with \(Y\) than with \(Z\). The compression reduces information, but the image is still recognizable. However, after the additional processing of blurring and pixelating, the mutual information between \(X\) and \(Z\) is further reduced. This gives an intuitive example of how additional processing on data reduces the mutual information with the original data. Each processing step results in some loss of information.
Consider a supervised learning pipeline with the following steps. Let
Here, \(X \rightarrow Y \rightarrow Z\) forms a Markov chain. The data processing inequality tells us that the mutual information between the inputs \(X\) and predictions \(Z\) cannot exceed the mutual information between the inputs \(X\) and intermediate representations \(Y\):
\[\MIof{X; Y} \geq \MIof{X; Z}.\]This makes intuitive sense—the intermediate representations \(Y\) are obtained by processing the raw inputs \(X\), so they cannot contain more information about \(X\) than \(X\) itself. The predictions \(Z\) are obtained by further processing \(Y\), so additional information may be lost, reducing the mutual information with the original inputs \(X\).
As a more concrete example, consider an image classification model. Let:
The convolutional layers will extract features from the input images, but cannot extract more information than present in the original images. The predicted labels are obtained by further processing these convolutional features, so may lose some fine-grained information about the original inputs.
An autoencoder compresses the input \(X\) into a latent code \(Y\) and then tries to reconstruct the original input from the code, producing \(\hat{X}\). Let:
The data processing inequality tells us again:
\[\MIof{X; Y} \geq \MIof{X; \hat{X}}.\]The latent code \(Y\) is obtained by compressing \(X\), so cannot contain more information. The reconstruction \(\hat{X}\) tries to recover \(X\) from \(Y\), but some information may be lost, reducing the mutual information with \(X\).
Intuitively, autoencoders try to preserve as much mutual information between inputs \(X\) and reconstructions \(\hat{X}\) as possible by learning latent representations \(Y\) that compress inputs without losing too much information. The data processing inequality quantifies this information bottleneck.
The proof is simple and connects the DPI to another important inequality.
First we note that the Markov Chain implies the following factorization of the joint distribution:
\[\pof{x, y, z} = \pof{x} \pof{y \given x} \pof{z \given y}.\]Using this factorization, we can express the mutual information terms:
\[\begin{aligned} \MIof{X;Y} &= \Hof{X} - \Hof{X \given Y} \\ &\ge \Hof{X} - \Hof{X \given Z} \\ &= \MIof{X;Z}. \end{aligned}\]This relies on \(\Hof{X \given Y} \le \Hof{X \given Z}\). Why is this true?
We have the following chain of inequalities:
\[\Hof{X \given Y} = \underbrace{\MIof{X ; Z \given Y}}_{\overset{(1)}{=}0} + \Hof{X \given Y, Z} \overset{(2)}{\le} \Hof{X \given Z}.\](1) follows from the Markov chain property: when \(X \rightarrow Y \rightarrow Z\), \(X\) does not depend on \(Z\) at all when conditioned on \(Y\); and (2) follows from the fact that conditioning reduces entropy, i.e. \(\Hof{A \given B} \le \Hof{A}.\)
The equality gap \(\Hof{X \given Y, Z} - \Hof{X \given Z}\) corresponds to the mutual information \(\MIof{X ; Y \given Z}\). This mutual information measures the extra information about \(X\) contained in \(Y\) that is not already conveyed by \(Z\). It is zero if and only if \(X \rightarrow Z \rightarrow Y\) forms a Markov chain, indicating that \(Z\) is a sufficient statistic for \(X\).
We can easily show that conditioning reduces entropy by using the non-negative property of the mutual information:
\(\begin{aligned} 0 &\le \Kale{\pof{X,Y}}{\pof{X}\pof{Y}} \\ &= \MIof{X;Y} \\ &= \Hof{X} - \Hof{X \given Y} \\ \implies \Hof{X \given Y} &\le \Hof{X}. \end{aligned}\)
The fact that conditioning reduces entropy, \(\Hof{X} \ge \Hof{X \given Y}\), is an important property by itself and is reminiscent of the data processing inequality. The conditional entropy \(\Hof{X \given Y}\) quantifies the remaining uncertainty about \(X\) after observing \(Y\). If \(X\) and \(Y\) are independent, then \(\Hof{X} = \Hof{X \given Y}\), as knowing \(Y\) does not provide any information about \(X\). On the other hand, if \(Y\) completely determines \(X\), then \(\Hof{X \given Y} = 0\), as there is no remaining uncertainty about \(X\) once \(Y\) is known. In general, conditioning can only reduce the uncertainty about \(X\), but it does not necessarily reduce it to zero.
Let’s move on and consider the KL data processing inequality.
A similar DPI can be expressed for different distributions \(\pof{x}\) and \(\qof{x}\) of the same random variable and the KL divergence between them. This DPI states that if we evolve two distributions using the same transition function, they cannot become less similar. The KL divergence is sometimes also referred to as “relative entropy”, so we could also call this the “relative data processing inequality”.
This can be formalized for distributions \(\pof{x}\) and \(\qof{x}\) and a stochastic transition function \(X \overset{\fof{y \given x}}{\longrightarrow} Y\). Here, we use that such a stochastic mapping \(Y = \fof{X}\) is equivalent to having a probability (density) \(\fof{y \given x}\):
\[\Kale{\pof{X}}{\qof{X}} \ge \Kale{\pof{Y}}{\qof{Y}},\]where \(\pof{y \given x} = \fof{y \given x} = \qof{y \given x}\). The marginals after the transition are \(\pof{y} = \E{\pof{x}}{\fof{y \given x}}\) and \(\qof{y} = \E{\qof{x}}{\fof{y \given x}}\), so more explicitly:
\[\Kale{\pof{X}}{\qof{X}} \ge \Kale{\E{\pof{x}}{\fof{Y \given x}}}{\E{\qof{x}}{\fof{Y \given x}}}.\]In their book Elements of Information Theory, Thomas and Cover describe this as “relative entropy never increases” and relate it to the second law of thermodynamics.
As an example, let:
Then \(\pof{y}\) and \(\qof{y}\) will be more difficult to distinguish after the thresholding operation than \(\pof{x}\) and \(\qof{x}\). Converting to black and white images has lost information that could help distinguish the real and generated distributions.
This provides some intuition for why the KL divergence between distributions decreases under a shared stochastic mapping, as formalized by the KL data processing inequality. Processing through \(\fof{y \given x}\) makes the distributions harder to tell apart.
It might be inviting to think that this data processing inequality also applies to Bayesian inference, that is updating the model parameters based on new evidence. Then, we could argue that if two agents start with different prior beliefs but update based on the same evidence, their posterior beliefs will become more similar. However, this intuition is flawed: the data processing inequality does not apply to Bayesian inference.
Let’s walk through why. Consider:
The priors \(\pof{\w}\) and \(\qof{\w}\) may have large divergence, representing very different initial beliefs. However, when conditioning on the same data \(x\), the KL divergence between \(\pof{\w \given x}\) and \(\qof{\w \given x}\) could increase or decrease—the data processing inequality does not give us any guarantee.
This is because \(\pof{\w}\) and \(\qof{\w}\) are not evolving under the same stochastic mapping. Rather, each prior is mapped to its respective posterior via Bayes’ rule, which operates differently on \(\opp\) and \(\opq\):
\[\begin{aligned} \pof{\w \given x} &= \frac{\pof{x \given \w}}{\pof{x}} \, \pof{\w}\\ \qof{\w \given x} &= \frac{\qof{x \given \w}}{\qof{x}} \, \qof{\w}. \end{aligned}\]Even assuming that both agents have the same internal model, that is they use the same likelihood \(\pof{x \given \w} = \qof{x \given \w}\), the priors \(\pof{\w}\) and \(\qof{\w}\) will still influence the posterior distributions differently because they lead to different evidence terms \(\pof{x}\) and \(\qof{x}\):
\[\begin{aligned} \pof{x} &= \E{\pof{\w}}{\pof{x \given \w}}\\ \qof{x} &= \E{\qof{\w}}{\qof{x \given \w}}. \end{aligned}\]Thus, the correct intuition is that observing the same data \(x\) does not necessarily bring the posterior beliefs closer together—they depend on the interplay between their specific priors and likelihoods. The data processing inequality does not directly apply to this Bayesian updating scenario:
\[\Kale{\qof{\W}}{\pof{\W}} {\color{red}{\not\ge}} \Kale{\qof{\W \given \mathcal{D}}}{\pof{\W \given \mathcal{D}}},\]This counterexample highlights the importance of precisely understanding the assumptions underlying conceptual principles like the DPI. While the DPI provides insight about information dynamics in many cases, it does not universally apply, as exemplified here by Bayesian updating under different priors. As always, bear in mind that:
As we currently also seem to experience a world of increasing polarization, this counterexample might also serve as a reminder that different priors can lead to different beliefs, even when observing the same evidence. This is a fundamental aspect of Bayesian inference and the scientific method.
We will prove this inequality in two different ways. First, we will develop a “brute-force” proof, and then we will look at a more elegant proof that follows Thomas and Cover. Importantly, we will also consider the equality case in detail.
If \(\opp\) does not have support in \(\opq\), the inequality is trivially true because then \(\Kale{\pof{Y}}{\qof{Y}}=\infty\).
Thus, let’s now assume that \(\opp\) has support in \(\opq\). Then, we can brute-force using the definitions, starting from the cross-entropy:
\[\begin{aligned} \CrossEntropy{\pof{Y}}{\qof{Y}}&=\CrossEntropy{\pof{Y}}{\E{\qof{x}}{\pof{Y \given x}}}\\ &=\CrossEntropy{\pof{Y}}{\E{\qof{x}}{\frac{\pof{x \given Y}\pof{Y}}{\pof{x}}}}\\ &=\CrossEntropy{\pof{Y}}{\E{\pof{x \given Y}}{\frac{\qof{x}}{\pof{x}}}}+\CrossEntropy{\pof{Y}}{\pof{Y}}\\ &\overset{(1)}{=}\CrossEntropy{\pof{Y}}{\E{\pof{x \given Y}}{\frac{\qof{x}}{\pof{x}}}}+\xHof{\pof{Y}}\\ &\overset{(2)}{\le}\CrossEntropy{\pof{X, Y}}{\frac{\qof{X}}{\pof{X}}}+\xHof{\pof{Y}}\\ &\overset{(3)}{=}\CrossEntropy{\pof{X}}{\frac{\qof{X}}{\pof{X}}}+\xHof{\pof{Y}}\\ &\overset{(4)}{=}\Kale{\pof{X}}{\qof{X}}+\xHof{\pof{Y}}\\ \iff \Kale{\pof{Y}}{\qof{Y}}&\le\Kale{\pof{X}}{\qof{X}}, \end{aligned}\]where we have used (1) that the cross-entropy of a distribution with itself is just the entropy, (2) that the cross-entropy is convex and we can apply Jensen’s inequality, (3) that the RHS side of the cross-entropy does not depend on \(Y\) and we can trivially marginalize it out, and (4) that the definition of the Kullback-Leibler divergence is equivalent an (unnormalized) cross-entropy over a fraction.
This makes it difficult to extract the case for equality, however.
We have only one inequality in above proof, and it stems from applying Jensen’s inequality. Remembering the equality case for Jensen’s inequality, we recall:
For (2), this is sadly slightly more complex than it might seem on first glance. Let’s unwrap the term:
\[\CrossEntropy{\pof{Y}}{\E{\pof{x \given Y}}{\frac{\qof{x}}{\pof{x}}}} = \E{\pof{y}}{-\log \E{\pof{x \given y}}{\frac{\qof{x}}{\pof{x}}}}.\]We take an expectation over \(\pof{y}\), so we need to look at almost all \(\pof{x \given y} \not= 0\) for (almost all) \(\pof{y} \not= 0\) separately to consider equality. \(-\log x\) is strictly convex—and thus not linear—so we need \(f(x) = \frac{\qof{X}}{\pof{X}}\) to be constant for any fixed \(y\) with \(\pof{y} \not= 0\)—only then have we equality in Jensen’s inequality.
In the following, I will limit myself to the discrete case to avoid having to deal with measure theory
This means that \(\qof{x} = C_y \pof{x}\) piecewise for all \(x\) for which \(\pof{x \given y} \not= 0\) for some fixed \(y\) with \(\pof{y} \not= 0\). That is if we keep \(y\) fixed, all the \(x\) for which \(\pof{x \given y} \not= 0\) have the same constant factor \(C_y\). Then for all \(y\) with \(\pof{y} \not= 0\), we have equality and overall equality in (2).
If for any \(x\) there are multiple \(y\), e.g. \(y_1, y_2\) for which \(\pof{x \given y} \not= 0\), then we have \(C_{y_1} = C_{y_2}\).
As an example, at the simplest, if this is the case for all \(y\), then \(C_y = 1\) constant.
As a side-note, this is a great reason why we often require full support for distributions as we then can avoid these piecewise constant factors (and the headaches they might cause).
Thomas and Cover provide a beautifully simple proof:
What does this mean? Whereas \(\fof{y \given x}\) is the ‘forward’ transition function, \(\pof{x \given y}\) and \(\qof{x \given y}\) are the ‘backward’ transition functions. We only have equality when the backward transition functions are equal (almost everywhere).
The statement on equality is not very informative yet though, so we have to put in a bit more work. Again, this is written for the discrete case.
This time we explicitly use Bayes’ rule to connect the forward and backward transition functions. First, we have to fix \(y\) such that \(\pof{y} \not= 0\) (i.e. \(y\) is in the support of \(\pof{y}\)) and then \(\qof{y} \not=0\). We have:
\[\begin{aligned} \pof{x \given y} &= \qof{x \given y} \\ \overset{\text{ass. }\pof{y} \not= 0}{\iff} \frac{\fof{y \given x}\pof{x}}{\pof{y}} &= \frac{\fof{y \given x}\qof{x}}{\qof{y}} \\ \overset{\text{ass. }\fof{y \given x}\not= 0}{\iff} \frac{\pof{x}}{\pof{y}} &= \frac{\qof{x}}{\qof{y}} \\ \iff \pof{x} &= \frac{\pof{y}}{\qof{y}} \, \qof{x}. \end{aligned}\]For a given \(y\) with \(\pof{y} \not=0\), for the equality case, we see that for all \(x\) with \(\fof{y \given x} \not= 0\), \(\pof{x}\) and \(\qof{x}\) have to be coupled via piecewise constant factors.
As another example, if \(\fof{y \given x} \not=0\) (has full support) for all possible \(x\), for the equality case we have \(\pof{x} = \qof{x}\).
Compared to the previous equality case, we went a bit deeper and rewrote the conditions to consider the ratios between \(x\) and \(y\). Note we could have shown the same thing in the “brute-force” proof, too.
Altogether, we have see that both \(x\) and \(y\) are modulated by the same constant factor between \(\pof{\cdot}\) and \(\qof{\cdot}\). Essentially, this tells us that we could split our support into unconnected sub-domains and examine each individually for the equality case.
We have the following overall statement:
(\(\pof{x} \ll \qof{x}\) means that \(\qof{x} > 0\) implies \(\pof{x} > 0\), so the KL divergence is not \(\infty\).) But more precisely, for \(\pof{x} \ll \qof{x}\), we have equality when:
\[\forall y, \pof{y} \not= 0 \exists C_y \in \mathbb{R}_{> 0} \forall x, \fof{y \given x}\not=0\colon \pof{x} = C_y \, \qof{x}.\]Now, we can use these ideas to derive a few additional results and even close the circle to the original data processing inequality.
The KL divergence is not a metric: the triangle inequality does not hold, and it is not symmetric.
However, we can symmetrize it to obtain the Jensen-Shannon divergence (JSD). The JSD is defined as the mean of the two KL divergences of the two distributions from their average. In essence, it makes the KL divergence symmetric:
\[\begin{aligned} \fof{x} &= \frac{\pof{x} + \qof{x}}{2}\\ \JSD{\pof{x}}{\qof{x}} &= \frac{1}{2} \Kale{\pof{x}}{\fof{x}} + \frac{1}{2} \Kale{\qof{x}}{\fof{x}}. \end{aligned}\]Similar approaches can be used to “symmetrize” other concepts; for example matrices: \(\frac{1}{2} A + \frac{1}{2} A^T\) is also symmetric by construction for any matrix \(A\).
The JSD is still not a metric, but the square root of the Jensen-Shannon divergence is symmetric and satisfies the triangle inequality and gives us the Jensen-Shannon distance, a metric.
We can also obtain a data processing inequality for the Jensen-Shannon divergence and the Jensen-Shannon distance:
The proof uses the KL data processing inequality:
\[\begin{aligned} \JSD{\pof{X}}{\qof{X}} &= \frac{1}{2} \Kale{\pof{X}}{\fof{X}} + \frac{1}{2} \Kale{\qof{X}}{\fof{X}}\\ &\ge \frac{1}{2} \Kale{\pof{Y}}{\fof{Y}} + \frac{1}{2} \Kale{\qof{Y}}{\fof{Y}}\\ &= \JSD{\pof{Y}}{\qof{Y}}. \end{aligned}\]We verify \(\fof{y} = \frac{\pof{y} + \qof{y}}{2}\) is the average of \(\pof{y}\) and \(\qof{y}\):
\[\begin{aligned} \fof{y} &= \E{\fof{x}}{\fof{y \given x}}\\ &= \E{\frac{\pof{x}+\qof{x}}{2}}{\fof{y \given x}}\\ &= \frac{1}{2} \E{\pof{x}}{\fof{y \given x}} + \frac{1}{2} \E{\qof{x}}{\fof{y \given x}}\\ &= \frac{1}{2} \pof{y} + \frac{1}{2} \qof{y}. \end{aligned}\]Finally, \(\pof{x}, \qof{x} \ll \fof{x}\), and the equality condition of the KL data processing inequality gives us:
\[\begin{aligned} &\Kale{\pof{X \given Y}}{\fof{X \given Y}} = 0 &\\ \land \quad &\Kale{\qof{X \given Y}}{\fof{X \given Y}} = 0 &\\ \iff &\pof{x \given y} = \fof{x \given y} \land \qof{x \given y} = \fof{x \given y}& \forall x,y \\ \iff &\pof{x \given y} = \qof{x \given y}& \forall x,y. \end{aligned}\]The JSD can also be expressed as a mutual information. For \(\begin{aligned} Z &\sim \mathrm{Bernoulli}(\frac{1}{2}) = \fof{Z} \\ X \given Z = 0 &\sim \pof{x}\\ X \given Z = 1 &\sim \qof{x}, \end{aligned}\)
we have:
\[\JSD{\pof{X}}{\qof{X}} = \MIof{X;Z}.\]This follows from rewriting the mutual information as a KL divergence:
\[\begin{aligned} \MIof{X;Z} &= \Kale{\fof{X \given Z}}{\fof{X}}\\ &= \E{\fof{z}} {\Kale{\fof{X \given Z = z}}{\fof{X}}}\\ &= \frac{1}{2} \Kale{\pof{x}}{\fof{x}} + \frac{1}{2} \Kale{\qof{x}}{\fof{x}}\\ &= \JSD{\pof{X}}{\qof{X}}. \end{aligned}\]We can generalize this to the Markov chain \(Z \rightarrow X \rightarrow Y\) with \(\fof{z, x, y} = \fof{z} \fof{x \given z} \fof{y \given x}\) for any distribution \(\fof{z}\):
\[\begin{aligned} \MIof{X;Z} &= \Kale{\fof{X \given Z}}{\fof{X}}\\ &= \E{\fof{z}} {\Kale{\fof{X \given z}}{\fof{X}}}\\ &\overset{(1)}{\ge} \E{\fof{z}} {\Kale{\fof{Y \given z}}{\fof{Y}}}\\ &= \Kale{\fof{Y \given Z}}{\fof{Y}}\\ &= \MIof{Y;Z}, \end{aligned}\]where \((1)\) follows from the KL data processing inequality.
This is just the data processing inequality we presented initially. We have gone full circle!
The equality gap (Jensen gap) is \(\Kale{\fof{X \given Y, Z}}{\fof{X \given Y}}\), and we have equality when:
\[\begin{aligned} \Kale{\fof{X \given Y, Z}}{\fof{X \given Y}} &= 0\\ \iff \MIof{X;Z \given Y} &= 0. \end{aligned}\]This is exactly when \(X\) is independent of \(Z\) given \(Y\). (\(Y\) is a sufficient statistic in that case.)
So far we’ve explored the foundational aspects of the data processing inequality (DPI) and its extended forms, in particular the KL data processing inequality. Through detailed derivations and intuitive examples, we’ve demonstrated how these inequalities can be applied, emphasizing their significance and limitations. Specifically, we’ve shown how the KL data processing inequality relates to the reduction in information as data is processed. The examples and counterexample have hopefully demonstrated the nuances of applying these inequalities in different contexts.
This exploration sets the stage for diving into function-space variational inference and building up a robust understanding of it, leveraging the insights gained about the DPI and its implications in Bayesian deep learning.
In the following, we will consider a classification task with cross-entropy loss, and we will use the following the random variables and distributions:
The probabilistic model is:
\[\pof{\y, \w \given \x} = \pof{\y \given \x, \w} \, \pof{\w}.\]As before, I use upper-case letters for random variables, which we take an expectation over, e.g. in the KL divergence, and lower-case letters when I’m referring to specific observations or values that could be substituted (with the exception of \(\Dany\)).
An important property of the KL divergence is the chain rule:
\[\begin{aligned} &\Kale{\qof{\Y_n,...,\Y_1}}{\pof{\Y_n,...,\Y_1}} \\ &\quad = \sum_{i=1}^n \Kale{\qof{\Y_i \given \Y_{i-1}, ..., \Y_1}}{\pof{\Y_i \given \Y_{i-1}, ..., \Y_1}}. \end{aligned}\]The chain rule yields a chain inequality for the DPI as well:
\[\begin{aligned} \Kale{\qof{\W}}{\pof{\W}} &\ge \Kale{\qof{\Y_n,...,\Y_1}}{\pof{\Y_n,...,\Y_1}}\\ &\ge \Kale{\qof{\Y_{n-1},...,\Y_1}}{\pof{\Y_{n-1},...,\Y_1}}\\ &\ge \Kale{\qof{\Y_1}}{\pof{\Y_1}}, \end{aligned}\]where we start from the KL DPI and then apply the chain rule.
The DPI has an intriguing connection to FSVI. Let’s say we want to approximate a Bayesian posterior \(\pof{\w \given \Dany}\) with a variational distribution \(\qof{\w}\). In standard VI, we would minimize \(\Kale{\qof{\W}}{\pof{\W \given \Dany}}\) to match the variational distribution to the Bayesian posterior. Specifically:
\[\begin{aligned} &\Kale{\qof{\W}}{\pof{\W \given \Dany}} =\\ &\quad = \underbrace{\E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\W}}{\pof{\W}}}_{\text{Evidence}\ \text{Bound}} + \log \pof{\Dany} \ge 0 \\ &\iff \underbrace{-\log \pof{\Dany}}_{=\xHof{\pof{\Dany}}} \le \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\W}}{\pof{\W}}. \end{aligned}\]This is an information-theoretic evidence (upper) bound on the information content \(-\log \pof{\Dany}\) of the data \(\Dany\) under the variational distribution \(\qof{\w}\), which we can minimize as an objective to approximiate \(\pof{\w \given \Dany}\) via \(\qof{\w}\).
In more probability-theory inspired literature, the negative of this bound is called the evidence lower bound (ELBO) and is maximized.
Both the ELBO and the information-theoretic evidence upper-bound are equivalent, and we can use either objective, but the information-theoretic perspective is obviously superior 🙃 I’ll refer to this as evidence bound from now on.
In FSVI (with a caveat I detail below), we apply the DPI to the prior KL divergence term and obtain a “functional” version of the evidence bound:
\[\begin{aligned} \Kale{\qof{\W}}{\pof{\W}} \ge \Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}}, \end{aligned}\]where \(\Y... \given \x...\) are (finite or infinite) sets of samples. That is, we do not only optimize marginal distributions but also joint distributions.
The resulting objective:
\[\begin{aligned} \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}} \end{aligned}\]is equal to the (negative) functional ELBO (fELBO) in “Functional variational Bayesian neural networks” by Sun et al. (2019)
One important detail is the question of how to choose the \(\x...\):
Ideally, we want to choose them such that the DPI inequality is as tight as possible.
Given the chain inequality, it is obvious that the larger the set \(\x...\), the tighter the inequality will be. Hence, if we could choose an infinite set of points well, we might be able to get the tightest possible inequality. However, this might not be tractable, and in practice, it is often not.
Some works take a supremum over finite subsets of a certain size, essentially building a core-set as an approximation (Rudner et al., 2022a
We will discuss the tightness of the inequality and the implications in the data limit below.
Focusing on the most important aspect of FSVI, we observe:
When we directly optimize the KL divergence on a finite input dataset, for example, we align \(\opq\) with the prior of \(\opp\) where it matters most: on the predictions of the observed data.
This is of particular interest in continual learning, where the prior for the next task is chosen to be the posterior from the previous task. In this case, the functional ELBO can be used to approximate the posterior of the previous model while incorporating new data.
For two great papers that are very readable and provide further insights, see “Continual learning via sequential function-space variational inference“
In practice, both works by Rudner et al. (2022), linearize the logits
which in my notation is equivalent to the first application of the DPI above:
\[\Kale{\qof{\L...\given \x...}}{\pof{\L...\given \x...}} \le \Kale{\qof{\W}}{\pof{\W}}.\]They maximize the fELBO objective:
\[\begin{aligned} \mathcal{F}\left(q_{\boldsymbol{\Theta}}\right) &=\mathbb{E}_{q_{f\left(\mathbf{x}_{\mathcal{D}} ; \boldsymbol{\Theta}\right)}}\left[\log p_{\mathbf{y} \mid f(\mathbf{X} ; \boldsymbol{\Theta})}\left(\mathbf{y}_{\mathcal{D}} \mid f\left(\mathbf{X}_{\mathcal{D}} ; \boldsymbol{\theta}\right)\right)\right]\\ &\quad -\sup _{\mathbf{X} \in \mathcal{X}_{\mathbb{N}}} \mathbb{D}_{\mathrm{KL}}\left(q_{f(\mathbf{X} ; \boldsymbol{\Theta})} \| p_{f(\mathbf{X} ; \boldsymbol{\Theta})}\right), \end{aligned}\]which is equivalent to minimizing the information-theoretic objective:
\[\E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\L... \given \x...}}{\pof{\L... \given \x...}},\]if we choose the \(\x...\) to tighten the DPI inequality as much as possible (i.e. by “finding” the supremum).
Using the inequality chain from above, we can sandwich their objective between a regular (negative) ELBO and the (negative) functional ELBO, we have derived above:
\[\begin{aligned} &\E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\W}}{\pof{\W}} \\ &\quad \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\L... \given \x...}}{\pof{\L... \given \x...}} \\ &\quad \ge \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}}. \end{aligned}\]Why are they using logits instead of probabilities? In practice, using the probabilities instead of logits when performing linearization is often cumbersome due to the non-linearity of the softmax functions, which requires Monte-Carlo sampling of the logits to obtain an approximation of the final probabilities. Furthermore, I speculate that sampling the logits can be more benign given that we often use ReLUs in the underlying neural networks. (Don’t quote me too strongly on this, though.)
Conceptually, this explains the derivation of their ELBO objective and also relates them to the ‘purer’ and simpler functional evidence bound derived above, but this raises the question of how these inequalities are different and what the gap between them tells us. Let’s address this question next.
When do we have equality? That is, when do we have:
\[\Kale{\qof{\W}}{\pof{\W}} = \Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}}?\]And what does it tell us?
As we have seen in the first part of this post, we have equality in the DPI if and only:
\(\Kale{\qof{\W \given \Y..., \x...}}{\pof{\W \given \Y..., \x...}}=0\).
Given that we are trying to approximate the Bayesian posterior \(\pof{\w \given \Y..., \x...}\) using \(\qof{\w}\), this equality condition tells us that we would have to find the exact posterior for equality. Hence, it is unlikely that we will have equality in practice. From this, the next question immediately follows: what does this predictive prior term
\[\Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}}\]provides us with?
Another way to think about the gap between the two KL divergences is that one is parameter-based and the other one is not. This points to a deeper truth about overparameterized models used in deep learning:
The functional KL divergences won’t be affected by this as they are parameter-free and do not take into account the parameters of the model but only the predictions. The regular parameter-based KL divergence, however, would be affected by this—depending on the prior \(\pof{\w}\), they might express differences between the parameter distributions that have no effect on the outputs.
In other words, if the prior assigns different probability to otherwise equivalent parameters, this obviously changes the parameter posterior, while the outputs are invariant to these changes if the overall assigned probability to a given output remains the same.
For example, the paper “Deep Ensembles: A Loss Landscape Perspective” by Fort et al. (2020)
Unless there are other considerations, it makes sense to use priors that assign the same density to parameters that are equivalent. Hence, for a given function \(\fof{\x ; \w}\), which determines the likelihood \(\pof{\y \given \x, \w} \triangleq \pof{y \given \fof{\x ; \w}}\), we can define an equivalence relation such that \(\w \sim \w'\) if and only if \(\fof{\x; \w} = \fof{\x; \w'}\) for all \(\x\). This equivalence relation partitions the parameter space into equivalence classes:
\[[\w] \triangleq \{\w' : \fof{x ; \w} = \fof{x ; \w} \quad \forall x \}.\]A prior \(\pof{\w}\) induces a prior \(\hpof{[\w]}\) over the equivalence classes:
\[\hpof{[\w]} \triangleq \sum_{\w' \in [\w]} \pof{\w'}.\]—or \(\int_{[\w]} \pof{\w'} \, d \w'\) for continuous \(\w\)—with the corresponding model:
\[\begin{aligned} \hpof{\y, [\w] \given \x} &\triangleq \hpof{\y \given \x, [\w]} \, \hpof{[\w]} \\ &= \pof{\y \given \x, \w} \, \hpof{[\w]}. \end{aligned}\]Importantly, the definition of the equivalence classes above is consistent with Bayesian inference:
This is easy to show with using Bayes’ rule:
\[\begin{aligned} \hpof{[\w] \given \Dany} &= \hpof{\Dany \given [\w]} \, \hpof{[\w]} / \hpof{\Dany} \\ &= \pof{\Dany \given \w} \sum_{\w' \in [\w]} \pof{\w'} / \hpof{\Dany} \\ &= \sum_{\w' \in [\w]} \pof{\Dany \given \w'} \, \pof{\w'} / \hpof{\Dany} \\ &= \sum_{\w' \in [\w]} \pof{\w' \given \Dany} \, \pof{\Dany} / \hpof{\Dany} \\ &= \sum_{\w' \in [\w]} \pof{\w' \given \Dany}. \end{aligned}\]The last step follows from \(\hpof{\Dany}=\pof{\Dany}\):
\[\begin{aligned} \hpof{\Dany} &= \sum_{[\w]} \hpof{\Dany, [\w]} \\ &= \sum_{[\w]} \sum_{\w' \in [\w]} \pof{\Dany, \w'} \\ &= \sum_{\w'} \pof{\Dany, \w} \\ &= \pof{\Dany}. \end{aligned}\]This also tells us that, for any \(\x\) and \(\y\):
\(\pof{\y... \given \x...} = \hpof{\y... \given \x...}\).
Given this consistency, we don’t have to differentiate between \(\hat\opp\) and \(\opp\) and can use \(\opp\) interchangeably. The same holds for \(\opq\).
We can view \([\w]\) as a projection from \(\w\) to its equivalence class \([\w]\). The DPI then gives us:
\[\Kale{\qof{\W}}{\pof{\W}} \ge \Kale{\qof{[\W]}}{\pof{[\W]}}.\]And again: what does the gap between the two terms tell us?
Let’s look at a few examples to get a better understanding of this.
Let \(\fof{\x ; \w} = 0\) independent of any \(f\). Then \([\w] = [\w']\) for any \(\w\), \(\w'\).
For any approximate distribution \(\qof{\w}\), the induced \(\Kale{\qof{[\W]}}{\pof{[\W]}}=0\), while \(\Kale{\qof{\W}}{\pof{\W}}\) also includes superfluous divergence.
Let \(\y \given (\w_1, \w_2) = \w_1\) deterministic but independent of \(\w_2\). Then \([(\w_1, \w_2)] = [(\w_1, {\w'}_2)]\) for any \({\w'}_2\) and \([(\w_1,*)]\not=[({\w'}_1, *)]\) for any \(\w_1 \not= \w'_1\).
\(\Kale{\qof{[\W]}}{\pof{[\W]}}=\Kale{\qof{\W_1}}{\pof{\W_1}}\) captures the meaningful divergence between approximate and true distribution, while \(\Kale{\qof{\W}}{\pof{\W}}\) also includes any divergence across \(\w_2\) that has no effect on the predictions.
Finally, let’s assume that the predictions are periodic in some way. That is, for example \(\y = \sin \w\). We then have \([\w] = [\w + 2\pi]\).
Further, let \(\pof{\w} = \operatorname{U}(\w; [0,2\pi \, N))\) for some \(N\) that determines the number of periods. Then, if we introduce another random variable \(K\), that captures which period we are in, we can (again) use the chain rule to write:
\[\begin{aligned} \Kale{\qof{\W}}{\pof{\W}} &= \Kale{\qof{\W \given \W \in [K\,2\pi, (K+1)\,2\pi]}}{\pof{\W \given \W \in [K\,2\pi, (K+1)\,2\pi]}} \\ &\quad + \Kale{\qof{\W \in [K\,2\pi, (K+1)\,2\pi]}}{\pof{\W \in [K\,2\pi, (K+1)\,2\pi]}} \\ &= \Kale{\qof{[\W]}}{\pof{[\W]}} \\ &\quad + \Kale{\qof{\W \in [K\,2\pi, (K+1)\,2\pi]}}{\pof{\W \in [K\,2\pi, (K+1)\,2\pi]}}. \end{aligned}\]This follows from the setup of this specific example. Finally, we have:
\[\Kale{\qof{\W \in [K\,2\pi, (K+1)\,2\pi]}}{\pof{\W \in [K\,2\pi, (K+1)\,2\pi]}} \le \log N.\]So, if \(\opq\) only had support in a single period for example, the difference between \(\Kale{\qof{\W}}{\pof{\W}}\) and \(\Kale{\qof{[\W]}}{\pof{[\W]}}\) would be \(\log N\): the redundancy.
How does the predictive prior term fit into this? The DPI again yields the answer:
This tells us that the predictive prior term can at best measure the KL divergence between the equivalence classes of the parameters—and not between the parameters itself—but luckily, this is the more meaningful divergence anyway!
For the equality cases, we observe that:
For 2.: as we know from the chain rule that
\[\Kale{\qof{\Y_n,...\Y_1\given\x_n,...,\x_1}}{\pof{\Y_n,...\Y_1\given\x_n,...,\x_1}}\]is monotonically increasing in \(n\), and it is bounded by \(\Kale{\qof{[\W]}}{\pof{[\W]}}\) from above, it must converge
To give intuition that it might do that, and without attempting to prove this formally, we can appeal to Bernstein von Mises theorem, which states that the posterior distribution of the parameters converges to a Gaussian distribution with mean and variance given by the maximum likelihood estimate (MLE) as the number of data points tends to infinity as long as the model parameters are identifiable, that is the true parameters we want to learn are unique, and that they have support.
For the evidence bound to be meaningful, we already know that we need support of the approximate distribution \(\opq\) in the prior \(\opp\)—otherwise, the LHS is \(\infty\). Moreover, realizing that we take an expectation over \(\qof{\Y_n ,..., \Y_1 \given \x_n ,..., \x_1}\), we can decompose the KL term for the gap as:
\[\begin{aligned} &\Kale{\qof{[\W] \given \Y_n,\x_n,...,\Y_1,\x_1}}{\pof{[\W] \given \Y_n,\x_n,...,\Y_1,\x_1}} \\ &\quad = \E{\qof{\y_n,...,\y_1\given\x_n,...,\x_1}}{\Kale{\qof{[\W]\given \y_n, \x_n, ..., \y_1, \x_1}}{\pof{[\W]\given \y_n, \x_n, ..., \y_1, \x_1}}} \\ &\quad = \simpleE{\qof{[\w']}}{\E{\qof{\y_n,..,.\y_1\given\x_n,...,\x_1, [\w']}}{\Kale{\qof{[\W]\given \y_n, \x_n, ..., \y_1, \x_1}}{\pof{[\W]\given \y_n, \x_n, ..., \y_1, \x_1}}}}. \end{aligned}\]That is, we sample a \([\w'] \sim \qof{[\w']}\) and then sample \(\y_n,...\y_1\given\x_n,...,\x_1\) from the corresponding \(\qof{\y_n,...\y_1\given\x_n,...,\x_1, [\w']}\) and marginalize over these. Crucially, \([\w']\) are the true parameters of the data-generating process for the inner KL divergence term. We thus take an expectation over KL terms fulfilling the conditions of the Bernstein von Mises theorem:
\[\begin{aligned} \Kale{\qof{[\W] \given \y_n,\x_1...\y_1, \x_1}}{\pof{[\W] \given \y_n,\x_1...\y_1, \x_1}} \to 0. \end{aligned}\]In other words, for a given \([w']\), in the space of equivalence classes as defined previously, the equivalence class of all MLE solutions in the data limit, \([MLE]\), will be unique by definition—the model is identifiable—and match \([\w']\)
(Again, this is not a formal proof but an intuition for why the gap might close in the data limit.)
In my opinion, this is a great result. We have shown both that the predictive prior term converges given our assumptions and that it converges to the symmetry-free parameter-based divergence in the data limit. This is a strong argument for the predictive prior term being meaningful and not just a technical trick.
Let’s appreciate one more thing: the predictive prior can consist of infinitely many data points and still converge to a finite value.
What is the advantage of this all?
In Bayesian deep learning, we often use parameter priors that are not meaningful and which also do not take parameter symmetries into account. For example, a unit Gaussian prior over the parameters of a neural network does not induce different predictions for different parameters necessarily. While this prior can be sensible from a parameter compression perspective (e.g. see Hinton and van Camp (1993)
With function priors and predictive priors, we can specify more meaningful priors because we can focus on the predictions and ignore the parameters. More importantly, this connects Bayesian approaches to data augmentation and other regularization techniques as we will see next.
Given that priors over equivalence classes are difficult to express explicitly though, using the DPI to obtain a functional ELBO can be an easier way to express and approximate them.
All this also helps us gain a new perspective on label entropy regularization. The functional evidence bound can be lower-bounded using the chain rule by:
\[\begin{aligned} \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \Kale{\qof{\Y... \given \x...}}{\pof{\Y... \given \x...}} \\ \ge \E{\qof{\w}}{-\log \pof{\Dany \given \w}} + \E{\pdata{\x}}{\Kale{\qof{\Y \given \x}}{\pof{\Y \given \x}}}, \end{aligned}\]where we can expand the term under the second expectation to:
\[\Kale{\qof{\Y \given \x}}{\pof{\Y \given \x}}=\CrossEntropy{\qof{\Y \given \x}}{\pof{\Y \given \x}} - \xHof{\qof{\Y \given \x}}.\]Assuming that our prior yields a uniform distribution over the labels, we can drop the cross entropy term because it is constant and obtain:
\[\E{\qof{\w}}{-\log \pof{\Dany \given \w}} - \E{\pdata{\x}}{\xHof{\qof{\Y \given \x}}}.\]This is the same as an MLE minimization objective with an additional entropy regularization term \(-\xHof{\qof{\Y \given \x}}\) for different \(\x\) that prevents the model from overfitting to the labels and collapsing to the one-hot encoding of the labels.
Thus, in the simplest approximation, the DPI and functional variational inference give us a new perspective on label entropy regularization.
Obviously, assuming non-uniform prior predictions, \(\E{\pdata{\x}}{\Kale{\qof{\Y \given \x}}{\pof{\Y \given \x}}}\) can be related to knowledge distillation in deep neural networks as introduced by Hinton et al. (2015)
The main technical difference is that knowledge distillation is using the reverse KL divergence instead of the forward KL divergence, while the conceptual difference is that we are not distilling the knowledge from a teacher model but from the prior that we downweigh while also training our model on the data itself. However, the connection between knowledge distillation and continual learning using informative priors is manifest.
In this blog post, we took a deep dive into the data processing inequality (DPI) and its surprisingly far-reaching implications for modern Bayesian deep learning. By carefully examining the assumptions, equality conditions, and chain rule of the DPI, we arrived at an intuitive understanding of why function-space variational inference (FSVI) can be such a powerful tool. The DPI perspective illuminates how FSVI side-steps issues with high-dimensional parameter spaces by focusing on matching Bayesian predictive posteriors.
Reasoning about parameter equivalence classes under the lens of the DPI, we saw how predictive KL divergences can capture meaningful differences between models while ignoring superficial discrepancies due to symmetries. This provides a fresh perspective on the advantages of predictive priors over standard parameter priors commonly used in Bayesian neural networks.
While our treatment only scratched the surface of the full mathematical story, the intuitions we developed allowed us to re-derive key results from the literature and uncover deep connections between seemingly disparate methods like entropy regularization, continual learning, and knowledge distillation. The examples and proofs peppered throughout solidified the core concepts.
More than a bag of technical tricks, the DPI reveals itself to be a powerful conceptual tool for reasoning about models, objectives, and algorithms. I hope this post inspires the reader to seek the fundamental principles underpinning machine learning innovations and to use those principles as a guide for future research. With a solid grasp of foundational tools like the DPI, we can all contribute to demystifying and unifying the rapidly evolving field of Bayesian deep learning.
Acknowledgements. Many thanks to Freddie Bickford Smith for very helpful comments and feedback on this post and to Tim Rudner for additional pointers to relevant literature and feedback on the FSVI section in particular 🤗
]]>Normalizing Flows (NF) enable the construction of complex probability distributions by transforming a simple, known distribution into a more complex one. They do so by leveraging the change of variables formula, defining a bijection from the simple distribution to the complex one.
For most of the time, flows were based on chaining several differentiable and invertible transformations. However, these diffeomorphic transformations limit the flows in their complexity as such have to be simple. Furthermore, this leads to trade-off sampling speed and evaluation performance
In the following sections, CNFs and Flow Matching are explained. Following the explanation, the empirical results of Flow Matching are presented. Finally, the application of Flow Matching in Simulation-Based Inference is discussed, which shall highlight their wide applicability and consistent improvement.
Continuous normalizing flows are among the first applications of neural ordinary differential equations (ODEs)
The vector field is typically parameterized by a neural network. While traditional layer based flow architectures need to impose special architectural restrictions to ensure invertibility, CNFs are invertible as long as the uniqueness of the solution of the ODE is guaranteed. This is for instance the case if the vector field is Lipschitz continuous in \(x\) and continuous in \(t\). Many common neural network architectures satisfy these conditions. Hence, the above equation defines a diffeomorphism \(\phi_t(x_0) = x_0 + \int_0^t f_{\theta}(x(t), t)\) under the discussed assumption. The change of variables formula can be applied to compute the density of a distribution that is transformed by \(\phi_t\).
As usual, a CNF is trained to transform a simple base distribution \(p_B\), usually a standard normal distribution, into a complex data distribution \(p_D\). For each point in time \(t\in[0,1]\) the time-dependent vector field defines a distribution \(p_t\) (probability path) and the goal is to find a vector field \(f_\theta\) such that \(p_1=p_D\). This is usually achieved by maximum likelihood training, i.e. by minimizing the negative log-likelihood of the data under the flow.
While CNFs are very flexible, they are also computationally expensive to train naively with maximum likelihood since the flow has to be integrated over time for each sample. This is especially problematic for large datasets which are needed for the precise estimation of complex high-dimensional distributions.
The authors of
Assuming that the target vector field is known, the authors propose a loss function that directly regresses the time dependent vector field:
\[L_{\textrm{FM}}(\omega) = \mathbb{E}_{t, p_t(x)}(|f_{\omega}(x, t) - u_t(x)|^2),\]where \(u_t\) is a vector field that generates \(p_t\) and the expectation with respect to \(t\) is over a uniform distribution. Unfortunately, the loss function is not directly applicable because we do not know how to define the target vector field. However, it turns out that one can define appropriate conditional target vector fields when conditioning on the outcome \(x_1\):
\[p_t(x) = \int p_t(x|x_1)p_{D}(x_1)d x_1.\]Using this fact, the conditional flow matching loss can be defined, obtaining equivalent gradients as the flow matching loss.
\[L_{\textrm{CFM}}(\omega) = \mathbb{E}_{t, p_t(x|x_1), p_D(x_1)}(|f_{\omega}(x, t) - u_t(x|x_1)|^2).\]Finally, one can easily obtain an unbiased estimate for this loss if samples from \(p_D\) are available, \(p_t(x|x_1)\) can be efficiently sampled, and \(u_t(x|x_1)\) can be computed efficiently. We discuss these points in the following.
The vector field that defines a probability path is usually not unique. This is often due to invariance properties of the distribution, e.g. rotational invariance. The authors focus on the simplest possible vector fields to avoid unnecessary computations. They choose to define conditional probability paths that maintain the shape of a Gaussian throughout the entire process. Hence, the conditional probability paths can be described by a variable transformation \(\phi_t(x \mid x_1) = \sigma_t(x_1)x + \mu_t(x_1)\). The time-dependent functions \(\sigma_t\) and \(\mu_t\) are chosen such that \(\sigma_0(x_1) = 1\) and \(\sigma_1 = \sigma_\text{min}\) (chosen sufficiently small), as well as \(\mu_0(x_1) = 0\) and \(\mu_1(x_1)=x_1\). The corresponding probability path can be written as
\[p_t(x|x_1) = \mathcal{N}(x; \mu_t(x_1), \sigma_t(x_1)^2 I).\]In order to train a CNF, it is necessary to derive the corresponding conditional vector field. An important contribution of the authors is therefore the derivation of a general formula for the conditional vector field \(u_t(x|x_1)\) for a given conditional probability path \(p_t(x|x_1)\) in terms of \(\sigma_t\) and \(\mu_t\):
\[u_t(x\mid x_1) = \frac{\sigma_t'(x_1)}{\sigma_t(x_1)}(x-\mu_t(x_1)) - \mu_t'(x_1),\]where \(\psi_t'\) denotes the derivative with respect to time \(t\).
They show that it is possible to recover certain diffusion training objectives with this choice of conditional probability paths, e.g. the variance preserving diffusion path with noise scaling function \(\beta\) is given by:
\[\begin{align*} \phi_t(x \mid x_1) &= (1-\alpha_{1-t}^2)x + \alpha_{1-t}x_1 \\\ \alpha_{t} &= \exp\left(-\frac{1}{2}\int_0^t \beta(s) ds\right) \end{align*}\]Additionally, they propose a novel conditional probability path based on optimal transport, which linearly interpolates between the base and the conditional target distribution.
\[\phi_t(x \mid x_1) = (1-(1-\sigma_{\text{min}})t)x + tx_1\]The authors argue that this choice leads to more natural vector fields, faster convergence and better results.
The authors investigate the utility of Flow Matching in the context of image datasets, employing CIFAR-10 and ImageNet at different resolutions. Ablation studies are conducted to evaluate the impact of choosing between standard variance-preserving diffusion paths and optimal transport (OT) paths in Flow Matching. The authors explore how directly parameterizing the generating vector field and incorporating the Flow Matching objective enhances sample generation.
The findings are presented through a comprehensive evaluation using various metrics such as negative log-likelihood (NLL), Frechet Inception Distance (FID), and the number of function evaluations (NFE). Flow Matching with OT paths consistently outperforms other methods across different resolutions.
The study also delves into the efficiency aspects of Flow Matching, showcasing faster convergence during training and improved sampling efficiency, particularly with OT paths.
Additionally, conditional image generation and super-resolution experiments demonstrate the versatility of Flow Matching, achieving competitive performance in comparison to state-of-the-art models. The results suggest that Flow Matching presents a promising approach for generative modeling with notable advantages in terms of model efficiency and sample quality.
A very specifically interesting application of density estimation, i.e. Normalizing Flows, is in Simulation-Based Inference (SBI). In SBI, Normalizing Flows are used to estimate the posterior distribution of model parameters given some observations. An important factor here are the sample efficiency, scalability, and expressivity of the density model. Especially for the later two, Flow Matching has shown to the yield an improvement. This is due to the efficient transport between source and target density and the flexibility due the more complex transformations allowed by continuous normalizing flows. To start out, a brief introduction to SBI shall be given as not many might be familiar with this topic.
In many practical scenarios, the likelihood function of a model is intractable and cannot be described analytically. This might be the case for where the forward model is a complex or proprietary simulation, or if it is a physical experiment
In order to formalize the method, let \(\theta \sim \pi(\theta)\) denote the parameters to a system and its respective prior distribution. The system under evaluation and the respective observations obtained are denoted by \(x = \mathcal{M}(\theta)\). To sample from the joint distribution \(p(\theta, x)\), the dedicated parameter \(\theta_i\) is sampled from the prior and the observation is obtained by evaluating the forward model on that parameter \(x_i = \mathcal{M}(\theta_i)\). According to this approach, a dataset of samples from the joint distribution can be generated \(\mathcal{X} = \{ (\theta, \mathbf{x})_i \}^N_{i=1}\). A density estimator is then fitted on the provided dataset in order to estimate the desired distribution, e.g. directly the posterior \(q_{\omega}(\theta \mid x) \approx p(\theta \mid x)\).
The interested reader shall be directed to
The approach using the Flow Matching formulation to fit the density network is presented by Dax et al.
The important details to note here are the adaptations to minimize the loss w.r.t. samples drawn from the joint distribution, as it is described in the general section to SBI. To do so, the expectation is adapted to be w.r.t. \(\theta_1 \sim p(\theta), x \sim p(x \vert \theta_1)\), which yield the desired samples.
Another adaption by the authors is to exchange the uniform distribution over the time with a general distribution \(t \sim p(t)\). The effects of this substitution won’t be focus deeper. However, adapting the distribution makes intuitive sense as the training gets harder close to the target distribution. Therefore, focussing on time steps \(t\) closer to one is beneficial, as the authors have also found in their empirical studies.
In order to provide a general comparison of the Flow Matching-based SBI approach, the CFM model is tested on the SBI benchmarking tasks
Besides the general benchmarks, the authors use their proposed technique to estimate the posterior distribution of gravitational wave parameters \(p(\theta \mid x)\) where \(\theta \in \mathbb{R}^{15}, x \in \mathbb{R}^{15744}\). In order to reduce the problem’s dimensionality and increase the information density, the observations are compressed to \(128\) dimensions using an embedding network.
Following the preprocessing of the data, three density estimators are fitted and compared to each other. The first method uses a neural spline flow, which has proven itself on these kinds of problems. It is compared to a neural posterior estimation using the Flow Matching approach described here. Finally, a neural posterior estimator leveraging physical symmetries is used to estimate the targeted posterior. All were trained on a simulation budget of \(5 \cdot 10^6\) samples for a total of 400 epochs.
In order to evaluate the models’ performances, the obtained posteriors were compared w.r.t. their 50% credible regions as well as Jensen-Shannon divergence between the inferred posterior and reference results. The results shown below support the advantages found in the benchmarking tasks. The Flow Matching-based shows a good performance for all shown parameters and has a clear advantage over the classical NPE approach.
Whilst the examples are interesting themselves, their evaluation has shown the applicability, scalability, and flexibility of Flow Matching for density estimation. These performance improvements in different areas have motivated the discussion of Flow Matching in the first place and hopefully become clear now.
Whilst this is a blog post, we’d like to use this last part to express our personal thoughts on this topic. SBI is a powerful method, enabling Bayesian Inference where it would not be possible
Formulating the Flow Matching variant of CNFs has allowed their application to complex density estimation tasks, as for example in SBI, and they’ve shown to yield the expected improvements – on standard SBI benchmarking tasks as well a very high dimensional task from the field of astrophysics. Furthermore, the generalization of CFM even broadens their applicability. It will be very interesting to see what possibilities are opened by this exact formulation and, in addition, what further improvements can be obtained by transferring techniques from the Diffusion Models to Normalizing Flows.
]]>A simple inline announcement.
A simple inline announcement.
Announcements and news can be much longer than just quick inline posts. In fact, they can have all the features available for the standard blog posts. See below.
Jean shorts raw denim Vice normcore, art party High Life PBR skateboard stumptown vinyl kitsch. Four loko meh 8-bit, tousled banh mi tilde forage Schlitz dreamcatcher twee 3 wolf moon. Chambray asymmetrical paleo salvia, sartorial umami four loko master cleanse drinking vinegar brunch. Pinterest DIY authentic Schlitz, hoodie Intelligentsia butcher trust fund brunch shabby chic Kickstarter forage flexitarian. Direct trade cold-pressed meggings stumptown plaid, pop-up taxidermy. Hoodie XOXO fingerstache scenester Echo Park. Plaid ugh Wes Anderson, freegan pug selvage fanny pack leggings pickled food truck DIY irony Banksy.
Hoodie Thundercats retro, tote bag 8-bit Godard craft beer gastropub. Truffaut Tumblr taxidermy, raw denim Kickstarter sartorial dreamcatcher. Quinoa chambray slow-carb salvia readymade, bicycle rights 90’s yr typewriter selfies letterpress cardigan vegan.
Pug heirloom High Life vinyl swag, single-origin coffee four dollar toast taxidermy reprehenderit fap distillery master cleanse locavore. Est anim sapiente leggings Brooklyn ea. Thundercats locavore excepteur veniam eiusmod. Raw denim Truffaut Schlitz, migas sapiente Portland VHS twee Bushwick Marfa typewriter retro id keytar.
We do not grow absolutely, chronologically. We grow sometimes in one dimension, and not in another, unevenly. We grow partially. We are relative. We are mature in one realm, childish in another. —Anais Nin
Fap aliqua qui, scenester pug Echo Park polaroid irony shabby chic ex cardigan church-key Odd Future accusamus. Blog stumptown sartorial squid, gastropub duis aesthetic Truffaut vero. Pinterest tilde twee, odio mumblecore jean shorts lumbersexual.
Announcements and news can be much longer than just quick inline posts. In fact, they can have all the features available for the standard blog posts. See below.
Jean shorts raw denim Vice normcore, art party High Life PBR skateboard stumptown vinyl kitsch. Four loko meh 8-bit, tousled banh mi tilde forage Schlitz dreamcatcher twee 3 wolf moon. Chambray asymmetrical paleo salvia, sartorial umami four loko master cleanse drinking vinegar brunch. Pinterest DIY authentic Schlitz, hoodie Intelligentsia butcher trust fund brunch shabby chic Kickstarter forage flexitarian. Direct trade cold-pressed meggings stumptown plaid, pop-up taxidermy. Hoodie XOXO fingerstache scenester Echo Park. Plaid ugh Wes Anderson, freegan pug selvage fanny pack leggings pickled food truck DIY irony Banksy.
Hoodie Thundercats retro, tote bag 8-bit Godard craft beer gastropub. Truffaut Tumblr taxidermy, raw denim Kickstarter sartorial dreamcatcher. Quinoa chambray slow-carb salvia readymade, bicycle rights 90’s yr typewriter selfies letterpress cardigan vegan.
Pug heirloom High Life vinyl swag, single-origin coffee four dollar toast taxidermy reprehenderit fap distillery master cleanse locavore. Est anim sapiente leggings Brooklyn ea. Thundercats locavore excepteur veniam eiusmod. Raw denim Truffaut Schlitz, migas sapiente Portland VHS twee Bushwick Marfa typewriter retro id keytar.
We do not grow absolutely, chronologically. We grow sometimes in one dimension, and not in another, unevenly. We grow partially. We are relative. We are mature in one realm, childish in another. —Anais Nin
Fap aliqua qui, scenester pug Echo Park polaroid irony shabby chic ex cardigan church-key Odd Future accusamus. Blog stumptown sartorial squid, gastropub duis aesthetic Truffaut vero. Pinterest tilde twee, odio mumblecore jean shorts lumbersexual.
A simple inline announcement with Markdown emoji!
A simple inline announcement with Markdown emoji! :sparkles: :smile:
Reviewers will be required to only view the live content of the blog. We ask that they act in good faith, and refrain from digging into the repository’s logs and closed Pull Requests to find any identifying information on the authors.
Reviewers should motivate their final decision based on the following points:
In order to access them please follow the following steps:
As with the previous edition of the Blog Post track, we forgo the requirement for total anonymity. The blog posts must be anonymized for the review process, but users will submit their anonymized blog posts via a pull request to the blog track’s repository (in addition to a submission on OpenReview). The pull request will trigger an automated pipeline that will build and deploy your post onto a website dedicated to the reviewing process. Reviewers will be able to access the posts directly through a public URL (generated by the Github action), and will submit their reviews on OpenReview. Reviewers should refrain from looking at the git history for the post, which may reveal information about the authors.
This still largely follows the Double-Blind reviewing principle; it is no less double-blind than when reviewers are asked to score papers that have previously been released to arXiv, an overwhelmingly common practice in the ML community. This approach was chosen to lower the burden on both the organizers and the authors; in 2022, many submissions had to be reworked once deployed due to a variety of reasons. By allowing the authors to render their websites to Github Pages prior to the review process, we hope to avoid this issue entirely.
However, we understand the desire for total anonymity. Authors that wish to have a fully double-blind process might consider creating new GitHub accounts without identifying information which they will only be use for this track. For an example of a submission in the past which used an anonymous account in this manner, you can check out the World Models blog post (Ha and Schmidhuber, 2018) and the accompanying repository.
The workflow you will use to participate in this track should be relatively familiar to you if have used Github Pages. Specifically, our website uses the Al-Folio template. This template uses Github Pages as part of its process, but it also utilizes a separate build step using Github Actions and intermediary Docker Images.
We recommend paying close attention to the steps presented in this guide. Small mistakes here can have very hard-to-debug consequences.
This section provides a summary of the workflow for creating and submitting a blog post. For more details about any of these steps, please refer to the appropriate section.
Fork or download our repository.
_posts/
directory with the format _posts/2024-05-07-[SUBMISSION NAME].md
. If you choose to write the post in HTML, then the extension of this last file should be .html instead of .md. NOTE: HTML posts are not officially supported, use at your own risk!assets/img/2024-05-07-[SUBMISSION NAME]/
.assets/html/2024-05-07-[SUBMISSION NAME]/
.assets/bibliography/2024-05-07-[SUBMISSION NAME].bib
.DO NOT touch anything else in the repository. We will utilize an automated deployment action which will filter out all submissions that modifiy more than the list of files that we just described above. Read the relevant section for more details. Make sure to omit any identifying information for the review process.
To render your website locally, you can build a docker container via $ ./bin/docker_run.sh
to serve your website locally. Alternatively, you can setup your local environment to render the website via conventional $ bundle exec jekyll serve --future
commands. More information for both of these configuratoins can be found in the Local Serving section.
To submit your website, create a pull request to the main repository. Make sure that this PR’s title is _posts/2024-05-07-[SUBMISSION NAME]
. This will trigger a GitHub Action that will build your blogpost and write the host’s URL in a comment to your PR.
Should you edit ANY files other your new post inside the _posts
directory, and your new folder inside the assets
directory, your pull requests will automatically be rejected.
You can view an example of a successful PR here. You can view an example of a PR with erroneous files here.
Download or fork our repository. You will be submitting a pull request this repository.
To create a blog post in Markdown format, you can modify the example Markdown post _posts/2024-05-07-distill-example.md
and rename it to _posts/2024-05-07-[SUBMISSION NAME].md
, where [SUBMISSION NAME]
is the name of your submission. You can see the result of the sample post .
While most users will want to create a post in the Markdown format, it is also possible to create a post in HTML format. For this, modify instead the example _posts/2024-05-08-distill-example2.html
and rename it to _posts/2024-05-07-[SUBMISSION NAME].html
. (NOTE: HTML is not officially supported, use at your own risk).
You must modify the file’s header (or ‘front-matter’) as needed.
---
+layout: distill
+title: [Your Blog Title]
+description: [Your blog post's abstract - no math/latex or hyperlinks!]
+date: 2024-05-07
+future: true
+htmlwidgets: true
+
+# anonymize when submitting
+authors:
+ - name: Anonymous
+
+# do not fill this in until your post is accepted and you're publishing your camera-ready post!
+# authors:
+# - name: Albert Einstein
+# url: "https://en.wikipedia.org/wiki/Albert_Einstein"
+# affiliations:
+# name: IAS, Princeton
+# - name: Boris Podolsky
+# url: "https://en.wikipedia.org/wiki/Boris_Podolsky"
+# affiliations:
+# name: IAS, Princeton
+# - name: Nathan Rosen
+# url: "https://en.wikipedia.org/wiki/Nathan_Rosen"
+# affiliations:
+# name: IAS, Princeton
+
+# must be the exact same name as your blogpost
+bibliography: 2024-05-07-distill-example.bib
+
+# Add a table of contents to your post.
+# - make sure that TOC names match the actual section names
+# for hyperlinks within the post to work correctly.
+toc:
+ - name: [Section 1]
+ - name: [Section 2]
+ # you can additionally add subentries like so
+ subsections:
+ - name: [Subsection 2.1]
+ - name: [Section 3]
+---
+
+# ... your blog post's content ...
+
You must change the title
, discription
, toc
, and eventually the authors
fields (ensure that the submission is anonymous for the review process).
Read our sample blog post carefully to see how you can add image assets, and how to write using \(\LaTeX\)! Read about rendering your post locally below.
Important: make sure your post is completely anonymized before you export and submit it!
Before going any further, it will be useful to highlight exactly what folders and files you are going to add or modify. Even if you use one of our simpler quickstart methods, this will always be what’s happening behind the scenes.
If you clone our repo or download a release, you will find a directory structure that looks like the following (excluding all files and directories that are not relevant to your submission):
your_blogpost_repo/
+│
+├── _posts
+│ ├── 2024-05-07-[YOUR SUBMISSION].md # <--- Create this markdown file; this is your blogpost
+│ └── ...
+├── assets
+│ ├── bibliography
+│ │ ├── 2024-05-07-[YOUR SUBMISSION].bib # <--- Create this bibtex file
+│ │ └── ...
+│ ├── html
+│ │ ├── 2024-05-07-[YOUR SUBMISSION] # <--- Create this directory and add interactive html figures
+│ │ │ └──[YOUR HTML FIGURES].html
+│ │ └── ...
+│ ├── img
+│ │ ├── 2024-05-07-[YOUR SUBMISSION] # <--- Create this directory and add static images here
+│ │ │ └──[YOUR IMAGES].png
+│ │ └── ...
+│ └── ...
+└── ...
+
In summary, to create your post, you will:
_posts/
directory with the format _posts/2024-05-07-[SUBMISSION NAME].md
(_posts/2024-05-07-[SUBMISSION NAME].html
in the case of an HTML file).assets/img/2024-05-07-[SUBMISSION NAME]/
.assets/html/2024-05-07-[SUBMISSION NAME]/
.assets/bibliography/2024-05-07-[SUBMISSION NAME].bib
.DO NOT touch anything else in the blog post! If you do, our automated pipeline will reject your PR and you will have to undo those changes in order for it to be accepted!
Note that 2024-05-07-[YOUR SUBMISSION]
serves as a tag to your submission, so it should be the same for all three items. For example, if you’re writing a blog post called “Deep Learning”, you’d likely want to make your tag 2024-05-07-deep-learning
, and the directory structure would look like this:
your_blogpost_repo/
+│
+├── _posts
+│ ├── 2024-05-07-deep-learning.md # <--- Create this markdown file; this is your blogpost
+│ └── ...
+├── assets
+│ ├── bibliography
+│ │ ├── 2024-05-07-deep-learning.bib # <--- Create this bibtex file
+│ │ └── ...
+│ ├── html
+│ │ ├── 2024-05-07-deep-learning # <--- Create this directory and add interactive html figures
+│ │ │ └──[YOUR HTML FIGURES].html
+│ │ └── ...
+│ ├── img
+│ │ ├── 2024-05-07-deep-learning # <--- Create this directory and add static images here
+│ │ │ └──[YOUR IMAGES].png
+│ │ └── ...
+│ └── ...
+└── ...
+
So far we’ve talked about how to get the relevant repository and create a blog post conforming to our requirements. Everything you have done so far has been in Markdown, but this is not the same format as web content (typically HTML, etc.). You’ll now need to build your static web site (which is done using Jekyll), and then serve it on some local webserver in order to view it properly. We will now discuss how you can serve your blog site locally, so you can visualize your work before you open a pull request on the staging website so you can submit it to the ICLR venue.
To render your website locally, we follow the instructions for Local setup using Docker (Recommended on Windows), but specifically you will need to create your own docker container rather than pull it from Dockerhub (because we modified the Gemfile).
Create and run the Docker image:
./bin/docker_run.sh
+
Remove the Gemfile.lock
file if prompted. This will create a docker image labeled as al-folio:latest
. Don’t use dockerhub_run.sh
; this may result in issues with missing jekyll dependencies.
For users wishing to not use a Docker container, you can install Jekyll directly to your computer and build the site using Jekyll directly. This is done at your own risk, as there are many potential points of error! Follow the instructions for rendering the website via the conventional method of $ bundle exec jekyll serve --future
You will need to manually install Jekyll which will vary based on your operating system. The instructions here are only for convenience - you are responsible for making sure it works on your system and we are not liable for potential issues that occur when adding your submissions to our repo!
Ubuntu/Debian
Install Ruby
sudo apt install ruby-full
+
Once installed, add the following to your .bashrc
or whatever terminal startup script you may use (this is important because otherwise gem may complain about needing sudo permission to install packages):
export GEM_HOME="$HOME/.gem"
+ export PATH="$HOME/.gem/bin:$PATH"
+
Install Jekyll and Bundler:
gem install jekyll bundler
+
MacOS and Windows
Mac and Windows users can find relevant guides for installing Jekyll here:
Once you’ve installed jekyll and all of the dependencies, you can now serve the webpage on your local machine for development purposes using the bundle exec jekyll serve
command.
You may first need to install any project dependencies. In your terminal, from the directory containing the Jekyll project run:
bundle install
+
This will install any plugins required by the project. To serve the webpage locally, from your terminal, in the directory containing the Jekyll project run:
bundle exec jekyll serve --future --port=8080 --host=0.0.0.0
+
You should see something along the lines of:
> bundle exec jekyll serve
+Configuration file: /home/$USER/blog_post_repo/_config.yml
+ Source: /home/$USER/blog_post_repo
+ Destination: /home/$USER/blog_post_repo/_site
+ Incremental build: disabled. Enable with --incremental
+ Generating...
+ Jekyll Feed: Generating feed for posts
+
+ ... you may see a lot of stuff in here related to images ...
+
+ done in 0.426 seconds.
+ Auto-regeneration: enabled for '/home/$USER/blog_post_repo'
+ Server address: http://0.0.0.0:8080/2024/
+ Server running... press ctrl-c to stop.
+
If you see this, you’ve successfully served your web page locally! You can access it at server address specified, in this case http://0.0.0.0:8080/2024/
(and the blog posts should once again be viewable at the blog/
endpoint).
To submit your blog post:
Anonymous
)._posts/
with the format _posts/2024-05-07-[SUBMISSION NAME].md
(or .html
)assets/img/2024-05-07-[SUBMISSION NAME]/
assets/html/2024-05-07-[SUBMISSION NAME]/
assets/bibliography/2024-05-07-[SUBMISSION NAME].bib
description
field of your front-matter (example)toc
field of your front-matter (example).bibtex
file as per the sample postmain
branch of the 2024 repo. Fill in the checklist provided in the PR template. The title of your pull request should be exactly the name of your markdown/html file. _posts/2024-05-07-[SUBMISSION NAME].md
would require a PR name 2024-05-07-[SUBMISSION NAME]
Note: If you wish to make updates to your submission, you should update the content in the PR that you already opened.
Reviewers will be required to only view the live content of the reviewing website - the website to which the Pull Requests push to. We ask that they act in good faith, and refrain from digging into the repository’s logs and closed Pull Requests to find any identifying information on the authors.
TBD - instructions will be provided closer to the submission deadline.