Chunking with TextChunker #2

ndrean · 2024-08-15T16:47:24Z

ndrean
Aug 15, 2024
Maintainer

We consider 2 types of documents in the Phoenix_LiveView repo: markdown files, and Elixir modules (which contains moduledoc of particular interest).

GitHub serves pages with the endpoint "https://raw.githuusercontent.com/...". We can also get the list from the GitHub API at the endpoint "https://api.github.com/repose//...."

I use the source (".ex") rather than the HTML file as it may be difficult to extract relevant data when parsing an HTML file.

To chunk, I tested the package TextChunker. It divides the text into smaller chunk in a hierarchical and iterative manner using a set of separators.

I used [format: :markdown] for ".md" documents and nothing for the ".html" documents.
Chunk sizes are not "small": from 600 to 2000 codepoints.

url_external_uploads ="https://raw.githubusercontent.com/phoenixframework/phoenix_live_view/main/guides/client/uploads-external.md"

Req.get!(url_external_uploads).body
|> tap(fn body -> String.length(body) |>. IO.puts() end)
|> TextChunker.split(format: :markdown)

The result is:

71459

[
  %TextChunker.Chunk{
    start_byte: 0,
    end_byte: 689,
    text: "# External uploads\n\n> This guide ...\n"
  },
  %TextChunker.Chunk{
    start_byte: 689,
    end_byte: 1378,
    text: "\n## Chunked HTTP Uploads\n\nFor any service ... --save @mux/upchunk\n"
  },
  %TextChunker.Chunk{
    start_byte: 1378,
    end_byte: 3288,
    text: "```\n\nConfigure your uploader on `c:Phoenix.LiveView.mount/3`:\n\n    def mount(_params, _session, socket) do\n      {:ok,\n       socket\n       |> assign(:uploaded_files, [])\n       |> allow_upload(:avatar, accept: :any, max_entries: 3, external: ...}\n    })"
  },
  %TextChunker.Chunk{
    start_byte: 3150,
    end_byte: 3562,
    text: "\n\n    // notify progress events to LiveView\n    upload.on(\"progress\", (e) => {\n      if(e.detail < 100){ entry.progress(e.detail) }\n    })\n\n    ...\n  params: {_csrf_token: csrfToken}\n})\n```\n"
  },
   ....
]

ndrean · 2024-08-24T12:19:54Z

ndrean
Aug 24, 2024
Maintainer Author

defmodule RAG.DataCollector do
  def fetch_and_chunk_docs(urls) do
    Enum.flat_map(urls, &process_directory/1)
  end

  defp process_directory(url) do
    extract_chunks = fn file ->
      case file do
        %{"type" => "file", "name" => name, "download_url" => download_url} ->
          if String.ends_with?(name, ".md") do
            Req.get!(download_url).body
            |> TextChunker.split(format: :markdown)
            |> Enum.map(&Map.get(&1, :text))
          else
            []
          end
        _ -> []
      end
    end

    Req.get!(url).body
    |> Enum.flat_map(fn file -> extract_chunks.(file) end)
  end
end

 guides  = [
    "https://api.github.com/repos/phoenixframework/phoenix_live_view/contents/guides/server",
    "https://api.github.com/repos/phoenixframework/phoenix_live_view/contents/guides/client"
  ]

chunks = RAG.DataCollector.getch_and_chunk_docs(guides)

["# Assigns and HEEx templates\n\nAll of the data in a LiveView is stored in the socket, which is a server \nside struct called `Phoenix.LiveView.Socket`. Your own data is stored\nunder the `assigns` key of said struct. The server data is never shared\nwith the client beyond what your template renders.\n\nPhoenix template language is called HEEx (HTML+EEx). EEx is Embedded \nElixir, an Elixir string template engine. Those templates\nare either files with the `.heex` extension or they are created\ndirectly in source files via the `~H` sigil. You can learn more about\nthe HEEx syntax by checking the docs for [the `~H` sigil](`Phoenix.Component.sigil_H/2`).\n\nThe `Phoenix.Component.assign/2` and `Phoenix.Component.assign/3`\nfunctions help store those values. Those values can be accessed\nin the LiveView as `socket.assigns.name` but they are accessed\ninside HEEx templates as `@name`.\n\nIn this section, we are going to cover how LiveView minimizes\nthe payload over the wire by understanding the interplay between\nassigns and templates.\n",
 "\n## Change tracking\n\nWhen you first render a `.heex` template, it will send all of the\nstatic and dynamic parts of the template to the client. Imagine the\nfollowing template:\n\n```heex\n<h1><%= expand_title(@title) %></h1>\n```\n\nIt has two static parts, `<h1>` and `</h1>` and one dynamic part\nmade of `expand_title(@title)`. Further rendering of this template\nwon't resend the static parts and it will only resend the dynamic\npart if it changes.\n\nThe tracking of changes is done via assigns. If the `@title` assign\nchanges, then LiveView will execute the dynamic parts of the template,\n`expand_title(@title)`, and send\nthe new content. If `@title` is the same, nothing is executed and\nnothing is sent.\n\nChange tracking also works when accessing map/struct fields.\nTake this template:\n\n```heex\n<div id={\"user_\#{@user.id}\"}>\n  <%= @user.name %>\n</div>\n```\n\nIf the `@user.name` changes but `@user.id` doesn't, then LiveView\nwill re-render only `@user.name` and it will not execute or resend `@user.id`\nat all.\n\nThe change tracking also works when rendering other templates as\nlong as they are also `.heex` templates:\n\n```heex\n<%= render \"child_template.html\", assigns %>\n```\n\nOr when using function components:\n\n```heex\n<.show_name name={@user.name} />\n```\n\nThe assign tracking feature also implies that you MUST avoid performing\ndirect operations in the template. For example, if you perform a database\nquery in your template:\n\n```heex\n<%= for user <- Repo.all(User) do %>\n  <%= user.name %>\n<% end %>\n```\n\nThen Phoenix will never re-render the section above, even if the number of\nusers in the database changes. Instead, you need to store the users as\nassigns in your LiveView before it renders the template:\n\n    assign(socket, :users, Repo.all(User))\n\nGenerally speaking, **data loading should never happen inside the template**,\nregardless if you are using LiveView or not. The difference is that LiveView\nenforces this best practice.\n",
 "\n## Pitfalls\n\nThere are some common pitfalls to keep in mind when using the `~H` sigil\nor `.heex` templates inside LiveViews.\n\n### Variables\n\nDue to the scope of variables, LiveView has to disable change tracking\nwhenever variables are used in the template, with the exception of\nvariables introduced by Elixir block constructs such as `case`,\n`for`, `if`, and others. Therefore, you **must avoid** code like\nthis in your HEEx templates:\n\n```heex\n<% some_var = @x + @y %>\n<%= some_var %>\n```\n\nInstead, use a function:\n\n```heex\n<%= sum(@x, @y) %>\n```\n\nSimilarly, **do not** define variables at the top of your `render` function\nfor LiveViews or LiveComponents. Since LiveView cannot track `sum` or `title`,\nif either value changes, both must be re-rendered by LiveView.\n\n    def render(assigns) do\n      sum = assigns.x + assigns.y\n      title = assigns.title\n\n      ~H\"\"\"\n      <h1><%= title %></h1>\n\n      <%= sum %>\n      \"\"\"\n    end\n\nInstead use the `assign/2`, `assign/3`, `assign_new/3`, and `update/3`\nfunctions to compute it. Any assign defined or updated this way will be marked as\nchanged, while other assigns like `@title` will still be tracked by LiveView.\n\n    assign(assigns, sum: assigns.x + assigns.y)\n\nThe same functions can be used inside function components too:\n\n    attr :x, :integer, required: true\n    attr :y, :integer, required: true\n    attr :title, :string, required: true\n    def sum_component(assigns) do\n      assigns = assign(assigns, sum: assigns.x + assigns.y)\n\n      ~H\"\"\"\n      <h1><%= @title %></h1>\n\n      <%= @sum %>\n      \"\"\"\n    end\n\nGenerally speaking, avoid accessing variables inside `HEEx` templates, as code that\naccess variables is always executed on every render. The exception are variables\nintroduced by Elixir's block constructs. For example, accessing the `post` variable\ndefined by the comprehension below works as expected:\n\n```heex\n<%= for post <- @posts do %>\n  ...\n<% end %>\n```\n",
 "\n### The `assigns` variable\n\nWhen talking about variables, it is also worth discussing the `assigns`\nspecial variable. Every time you use the `~H` sigil, you must define an\n`assigns` variable, which is also available on every `.heex` template.\nHowever, we must avoid accessing this variable directly inside templates\nand instead use `@` for accessing specific keys. This also applies to\nfunction components. Let's see some examples.\n\nSometimes you might want to pass all assigns from one function component to\nanother. For example, imagine you have a complex `card` component with \nheader, content and footer section. You might refactor your component\ninto three smaller components internally:\n\n```elixir\ndef card(assigns) do\n  ~H\"\"\"\n  <div class=\"card\">\n    <.card_header {assigns} />\n    <.card_body {assigns} />\n    <.card_footer {assigns} />\n  </div>\n  \"\"\"\nend\n\ndefp card_header(assigns) do\n  ...\nend\n\ndefp card_body(assigns) do\n  ...\nend\n\ndefp card_footer(assigns) do\n  ...\nend\n```\n\nBecause of the way function components handle attributes, the above code will\nnot perform change tracking and it will always re-render all three components\non every change.\n\nGenerally, you should avoid passing all assigns and instead be explicit about\nwhich assigns the child components need:\n\n```elixir\ndef card(assigns) do\n  ~H\"\"\"\n  <div class=\"card\">\n    <.card_header title={@title} class={@title_class} />\n    <.card_body>\n      <%= render_slot(@inner_block) %>\n    </.card_body>\n    <.card_footer on_close={@on_close} />\n  </div>\n  \"\"\"\nend\n```\n\nIf you really need to pass all assigns you should instead use the regular\nfunction call syntax. This is the only case where accessing `assigns` inside\ntemplates is acceptable:\n\n```elixir\ndef card(assigns) do\n  ~H\"\"\"\n  <div class=\"card\">\n    <%= card_header(assigns) %>\n    <%= card_body(assigns) %>\n    <%= card_footer(assigns) %>\n  </div>\n  \"\"\"\nend\n",
 "```\n\nThi

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunking with TextChunker #2

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Chunking with TextChunker #2

ndrean Aug 15, 2024 Maintainer

Replies: 1 comment

ndrean Aug 24, 2024 Maintainer Author

ndrean
Aug 15, 2024
Maintainer

ndrean
Aug 24, 2024
Maintainer Author