Umsetzung Suche #18

flicksolutions · 2023-11-01T15:49:11Z

As discussed:
@pdaengeli will build an index JSON for a local MiniSearch instance:

The structure will look something like this:


[
  {
    id: 'd-1.1',
    sigla: 'd',
    verse: '1.1',
    content: 'ST zwiuel h ̉ er zen nahgebur'
  },
  {
    id: 'fr45-78.15-f',
    sigla: 'fr45',
    verse: '78.15-f',
    content: 'Schildeſ ampt worhten'
  },
  // ...and more
]

This describes a MVP. If possible we can add normalization information from https://parzival.unibe.ch/parzdb/listNFs.php like so:

[
  {
    id: 'd-1.1',
    sigla: 'd',
    verse: '1.1',
    content: 'ST zwiuel h ̉ er zen nahgebur adamans',
    normalized words: ['Adam']
  },
  {
    id: 'fr45-78.15-f',
    sigla: 'fr45',
    verse: '78.15-f',
    content: 'Schildeſ ampt worhten feirazizfeiraviz ',
    normalized words: ['Feirefiz']
  },
  // ...and more
]

The text was updated successfully, but these errors were encountered:

pdaengeli · 2024-12-03T11:03:30Z

S. Abel will hand over an XML file containing IDs of regularised entries with lemma.

With this we can generate normalized words: ['Feirefiz'] directly from information contained in the XML files (e.g. rego:o0001)

Also see https://parzival.unibe.ch/parzdb/listNFs.php#:~:text=Eigennamen

flicksolutions · 2024-12-13T17:34:14Z

Email from Stefan Abel

Im Anhang finden Sie je ein PDF mit einem Namenregister der Leithandschriften D (*D), m (*m), G (*G) und TUQ (*T); bei Fassung *T wechselt die Leithandschrift zwischen T, U und Q, deshalb gibt es für diese Fassung gleich drei Namenregister.

Über Tustep können wir diese Register automatisch erstellen lassen. Alle nötigen Informationen zu den einzelnen Namen (p = Personennamen, o = Ortsnamen, s = Sonstiges, z.B. Edelsteine und Planeten) enthält die Datei namen.xml (ebenfalls im Anhang). Darin sind auch die IDs mit den Normalformen der Namen verbunden.

Attachements

Parzival_Namenregister_Leithss.zip
namen.zip (xml file)

as outlined in DHBern/presentation_parzival#18 there might still be need to clarify how to handle various TEI elements (cf. build-index.xsl) and whether or not to cover relations contained in names.xml

first stab at DHBern/presentation_parzival#18

pdaengeli · 2024-12-19T20:49:17Z

search-index.json is now available at https://dhbern.github.io/parzival-static-api/api/json/

It contains two content strings, content and content_all:

    {
      "sigla": "d",
      "content_all": "der mac dennoc* dennoch weſen geil.",
      "id": "d_1.07",
      "verse": "07",
      "content": "der mac dennoch weſen geil."
    },
    {
      "sigla": "d",
      "content_all": "wand an im ſint *eidiv beidiv teil.",
      "id": "d_1.08",
      "verse": "08",
      "content": "wand an im ſint beidiv teil."
    },

They are based on selective/non-selective treatment of some of the TEI elements that are used for the verse encoding. Depending on the requirements of the search we can improve them. Perhaps we may omit content_all or reduce it to only contain additional strings and add it in a Minisearch index function as needed (here it could be: "add" : "dennoc*" and "add" : "*eidiv").

Instead of normalized_words I used terms and only create them where they exist.

    {
      "sigla": "d",
      "terms": [ "rubin" ],
      "content_all": " verwurchet verwürchet edeln rvͦbin.",
      "id": "d_3.17",
      "verse": "17",
      "content": "verwürchet edeln rvͦbin."
    },

    {
      "sigla": "d",
      "terms": [ "Gahmuret", "Anschevin" ],
      "content_all": "Gahmvret Anſcivin.",
      "id": "d_6.26",
      "verse": "26",
      "content": "Gahmvret Anſcivin."
    },

With regard to cases where more than one regular variant is given, we should decide if it is better to atomize them or keep them as they currently are:

    {
      "sigla": "q",
      "terms": [ "Barbigœl Parbigœl" ],
      "content_all": "Er ſehe in ſint zu barbigol",
      "id": "q_646.05",
      "verse": "05",
      "content": "Er ſehe in ſint zu barbigol"
    },

    {
      "sigla": "q",
      "terms": [ "Meljanz Melyanz", "Barbigœl Parbigœl" ],
      "content_all": "Vnd roys meliantz de barbigol",
      "id": "q_665.01",
      "verse": "01",
      "content": "Vnd roys meliantz de barbigol"
    },

(Q_namen.pdf)

The resulting file is large (> 70 MB). There are some options for shrinking (switching off indentation for the serialization, removing spurious information, shortening keys, splitting into several documents), but I'll leave this for later.

In addition to this, there is also names.json with information on persons, places and stones/planets/etc. This is not needed in the generation of the data, but it was a low-hanging fruit. Perhaps it comes in handy for the presentation at some point.

flicksolutions · 2025-01-06T13:39:52Z

Thanks, @pdaengeli ! Looks absolutely doable and 70MB isn't even that big... considering it's before gziping and minifying...
Reducing content_all seems favourable as well. Great work so far!

More than one regular variant seems very counterintuitive to me. Why would you want to have something like that? Isn't the point of normal forms that you only have one variant..? I guess it's for historical reasons :-) Anyways... I think we should treat them as separate terms for our use. It might hurt search performance if we keep two terms in one string (because there will never be an exact match for those). Is this easily doable?

pdaengeli · 2025-01-06T16:01:49Z

More than one regular variant seems very counterintuitive to me. Why would you want to have something like that? Isn't the point of normal forms that you only have one variant..?

The only explanation I can come up with is that there are cases with more than one relatively established normal variant. I was also surprised to find these cases.

Splitting the strings is trivial. I just pushed DHBern/parzival-static-api@5091a2e and it should soon be

    {
      "sigla": "q",
      "terms": [ "Barbigœl", "Parbigœl" ],
      "content_all": "Er ſehe in ſint zu barbigol",
      "id": "q_646.05",
      "verse": "05",
      "content": "Er ſehe in ſint zu barbigol"
    },

    {
      "sigla": "q",
      "terms": [ "Meljanz", "Melyanz", "Barbigœl", "Parbigœl" ],
      "content_all": "Vnd roys meliantz de barbigol",
      "id": "q_665.01",
      "verse": "01",
      "content": "Vnd roys meliantz de barbigol"
    },

flicksolutions · 2025-01-17T16:49:56Z

@pdaengeli I just realized, that the Dreissiger is missing. This can obviously be derived from the ID, but if I'm deriving the Dreissiger from the id, I can also derive sigla and verse. So we have to think about if we want to minify the json by striking out sigla and verse or add a thirties property.

What do you think?

pdaengeli · 2025-01-17T17:36:15Z

Generally speaking, I prefer having knowledge in the data over logic in the application (where it is affordable).

So I went ahead and added a d key in DHBern/parzival-static-api@af3a4c8.

 "docs": [
    {
      "sigla": "d",
      "d": "1",
      "content_all": "Gagmvret Herzey",
      "id": "d_1.01-01",
      "verse": "01-01",
      "content": "Gagmvret Herzey"
    },

flicksolutions · 2025-01-20T15:12:48Z

Thanks!
The parsing d doesn't completely work out with Fragments (Fr.):

pdaengeli · 2025-01-20T15:41:06Z

Must be a deficient regex in L71. I'll fix it later.

required fix for DHBern/presentation_parzival#18

pdaengeli · 2025-01-20T20:33:57Z

The fragments look good now.

E.g.

diff --git a/dist/api/json/search-index.json b/dist/api/json/search-index.json
index 77a9997..89b371f 100644
--- a/dist/api/json/search-index.json
+++ b/dist/api/json/search-index.json
@@ -201946,7 +201946,7 @@
[…]
+    { "sigla":"fr21", "d":"328", "content_all":"", "id":"fr21_328.27", "verse":"27", "content":"" },
+    { "sigla":"fr21", "d":"328", "content_all":"", "id":"fr21_328.28", "verse":"28", "content":"" },
     {
       "sigla": "fr21",
-      "d": "fr21_328.27",
-      "content_all": "",
-      "id": "fr21_328.27",
-      "verse": "27",
-      "content": ""
-    },
-    {
-      "sigla": "fr21",
-      "d": "fr21_328.28",
-      "content_all": "",
-      "id": "fr21_328.28",
-      "verse": "28",
-      "content": ""
-    },
-    {
-      "sigla": "fr21",
-      "d": "fr21_328.29",
+      "d": "328",
       "content_all": "Feirefiz anſche",
       "id": "fr21_328.29",
       "verse": "29",
@@ -266386,7 +264272,7 @@
     },
     {
       "sigla": "fr21",
-      "d": "fr21_328.30",
+      "d": "328",
       "content_all": "Deſ tat dvrich wen pin.-",
       "id": "fr21_328.30",
       "verse": "30",
@@ -266394,7 +264280,7 @@
     },
     {
       "sigla": "fr21",
-      "d": "fr21_329.01",
+      "d": "329",
       "content_all": "Swie fremdez mære.-",
       "id": "fr21_329.01",
       "verse": "01",
@@ -266402,7 +264288,7 @@
     },

pdaengeli · 2025-01-21T10:53:09Z

Regex still not covering all:

1768064     {
1768065       "sigla": "ok",
1768066       "d": "ok_1.01",
1768067       "content_all": "ISt zwifel hertzen noch gebur",
1768068       "id": "ok_1.01",
1768069       "verse": "01",
1768070       "content": "ISt zwifel hertzen noch gebur"
1768071     },

mk, nk, ok

DHBern/presentation_parzival#18 TODO: run, commit and push data

cf. DHBern/presentation_parzival#18

pdaengeli · 2025-01-21T16:55:13Z

@flicksolutions, `mk`, `nk`, `ok` are good now, too.

     {
       "sigla": "ok",
-      "d": "ok_1.01",
+      "d": "1",
       "content_all": "ISt zwifel hertzen noch gebur",
       "id": "ok_1.01",
       "verse": "01",
@@ -1768071,7 +1766510,7 @@
     },

To do:

index Fassungen

flicksolutions · 2025-01-21T19:00:56Z

Also todo for @flicksolutions

Indicate the match by making it bold
create option for exact search. if not exact: allow fuzzynes of at least 1
create Indexhandler to allow matches for Schaft-S ſ to match S and maybe others.
Search over Fassungen by default, option for searching over Zeugen

flicksolutions added this to the beta milestone Nov 1, 2023

flicksolutions self-assigned this Nov 1, 2023

flicksolutions assigned pdaengeli Dec 2, 2024

pdaengeli added a commit to DHBern/parzival-static-api that referenced this issue Dec 19, 2024

data, search-index.json, names.json

7c3a894

first stab at DHBern/presentation_parzival#18

flicksolutions linked a pull request Jan 20, 2025 that will close this issue

18 Umsetzung Suche #40

Open

pdaengeli added a commit to DHBern/parzival-static-api that referenced this issue Jan 20, 2025

src, data, index, fix dreissiger for fr

1befa9f

required fix for DHBern/presentation_parzival#18

pdaengeli added a commit to DHBern/parzival-static-api that referenced this issue Jan 21, 2025

src, index, cover "ok"-like cases

8472c09

DHBern/presentation_parzival#18 TODO: run, commit and push data

pdaengeli added a commit to DHBern/parzival-static-api that referenced this issue Jan 21, 2025

data, index, fix dreissiger for mk|nk|ok

c1f461b

cf. DHBern/presentation_parzival#18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Umsetzung Suche #18

Umsetzung Suche #18

flicksolutions commented Nov 1, 2023 •

edited

Loading

pdaengeli commented Dec 3, 2024 •

edited

Loading

flicksolutions commented Dec 13, 2024

pdaengeli commented Dec 19, 2024 •

edited

Loading

flicksolutions commented Jan 6, 2025

pdaengeli commented Jan 6, 2025

flicksolutions commented Jan 17, 2025

pdaengeli commented Jan 17, 2025

flicksolutions commented Jan 20, 2025

pdaengeli commented Jan 20, 2025

pdaengeli commented Jan 20, 2025

pdaengeli commented Jan 21, 2025 •

edited

Loading

pdaengeli commented Jan 21, 2025 •

edited

Loading

flicksolutions commented Jan 21, 2025 •

edited

Loading

Umsetzung Suche #18

Umsetzung Suche #18

Comments

flicksolutions commented Nov 1, 2023 • edited Loading

pdaengeli commented Dec 3, 2024 • edited Loading

flicksolutions commented Dec 13, 2024

Attachements

pdaengeli commented Dec 19, 2024 • edited Loading

flicksolutions commented Jan 6, 2025

pdaengeli commented Jan 6, 2025

flicksolutions commented Jan 17, 2025

pdaengeli commented Jan 17, 2025

flicksolutions commented Jan 20, 2025

pdaengeli commented Jan 20, 2025

pdaengeli commented Jan 20, 2025

pdaengeli commented Jan 21, 2025 • edited Loading

pdaengeli commented Jan 21, 2025 • edited Loading

flicksolutions commented Jan 21, 2025 • edited Loading

flicksolutions commented Nov 1, 2023 •

edited

Loading

pdaengeli commented Dec 3, 2024 •

edited

Loading

pdaengeli commented Dec 19, 2024 •

edited

Loading

pdaengeli commented Jan 21, 2025 •

edited

Loading

pdaengeli commented Jan 21, 2025 •

edited

Loading

flicksolutions commented Jan 21, 2025 •

edited

Loading