Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Umsetzung Suche #18

Open
flicksolutions opened this issue Nov 1, 2023 · 13 comments · May be fixed by #40
Open

Umsetzung Suche #18

flicksolutions opened this issue Nov 1, 2023 · 13 comments · May be fixed by #40
Assignees
Milestone

Comments

@flicksolutions
Copy link
Member

flicksolutions commented Nov 1, 2023

As discussed:
@pdaengeli will build an index JSON for a local MiniSearch instance:

The structure will look something like this:


[
  {
    id: 'd-1.1',
    sigla: 'd',
    verse: '1.1',
    content: 'ST zwiuel h ̉ er zen nahgebur'
  },
  {
    id: 'fr45-78.15-f',
    sigla: 'fr45',
    verse: '78.15-f',
    content: 'Schildeſ ampt worhten'
  },
  // ...and more
]

This describes a MVP. If possible we can add normalization information from https://parzival.unibe.ch/parzdb/listNFs.php like so:

[
  {
    id: 'd-1.1',
    sigla: 'd',
    verse: '1.1',
    content: 'ST zwiuel h ̉ er zen nahgebur adamans',
    normalized words: ['Adam']
  },
  {
    id: 'fr45-78.15-f',
    sigla: 'fr45',
    verse: '78.15-f',
    content: 'Schildeſ ampt worhten feirazizfeiraviz ',
    normalized words: ['Feirefiz']
  },
  // ...and more
]
@flicksolutions flicksolutions added this to the beta milestone Nov 1, 2023
@flicksolutions flicksolutions self-assigned this Nov 1, 2023
@pdaengeli
Copy link

pdaengeli commented Dec 3, 2024

S. Abel will hand over an XML file containing IDs of regularised entries with lemma.

With this we can generate normalized words: ['Feirefiz'] directly from information contained in the XML files (e.g. rego:o0001)

Also see https://parzival.unibe.ch/parzdb/listNFs.php#:~:text=Eigennamen

@flicksolutions
Copy link
Member Author

Email from Stefan Abel

Im Anhang finden Sie je ein PDF mit einem Namenregister der Leithandschriften D (*D), m (*m), G (*G) und TUQ (*T); bei Fassung *T wechselt die Leithandschrift zwischen T, U und Q, deshalb gibt es für diese Fassung gleich drei Namenregister.

Über Tustep können wir diese Register automatisch erstellen lassen. Alle nötigen Informationen zu den einzelnen Namen (p = Personennamen, o = Ortsnamen, s = Sonstiges, z.B. Edelsteine und Planeten) enthält die Datei namen.xml (ebenfalls im Anhang). Darin sind auch die IDs mit den Normalformen der Namen verbunden.

Attachements

pdaengeli added a commit to DHBern/parzival-static-api that referenced this issue Dec 19, 2024
as outlined in DHBern/presentation_parzival#18

there might still be need to clarify how to handle various TEI elements (cf. build-index.xsl) and whether or not to cover relations contained in names.xml
pdaengeli added a commit to DHBern/parzival-static-api that referenced this issue Dec 19, 2024
@pdaengeli
Copy link

pdaengeli commented Dec 19, 2024

search-index.json is now available at https://dhbern.github.io/parzival-static-api/api/json/

It contains two content strings, content and content_all:

    {
      "sigla": "d",
      "content_all": "der mac dennoc* dennoch weſen geil.",
      "id": "d_1.07",
      "verse": "07",
      "content": "der mac dennoch weſen geil."
    },
    {
      "sigla": "d",
      "content_all": "wand an im ſint *eidiv beidiv teil.",
      "id": "d_1.08",
      "verse": "08",
      "content": "wand an im ſint beidiv teil."
    },

They are based on selective/non-selective treatment of some of the TEI elements that are used for the verse encoding. Depending on the requirements of the search we can improve them. Perhaps we may omit content_all or reduce it to only contain additional strings and add it in a Minisearch index function as needed (here it could be: "add" : "dennoc*" and "add" : "*eidiv").

Instead of normalized_words I used terms and only create them where they exist.

    {
      "sigla": "d",
      "terms": [ "rubin" ],
      "content_all": " verwurchet verwürchet edeln rvͦbin.",
      "id": "d_3.17",
      "verse": "17",
      "content": "verwürchet edeln rvͦbin."
    },

    {
      "sigla": "d",
      "terms": [ "Gahmuret", "Anschevin" ],
      "content_all": "Gahmvret Anſcivin.",
      "id": "d_6.26",
      "verse": "26",
      "content": "Gahmvret Anſcivin."
    },

With regard to cases where more than one regular variant is given, we should decide if it is better to atomize them or keep them as they currently are:

    {
      "sigla": "q",
      "terms": [ "Barbigœl Parbigœl" ],
      "content_all": "Er ſehe in ſint zu barbigol",
      "id": "q_646.05",
      "verse": "05",
      "content": "Er ſehe in ſint zu barbigol"
    },

    {
      "sigla": "q",
      "terms": [ "Meljanz Melyanz", "Barbigœl Parbigœl" ],
      "content_all": "Vnd roys meliantz de barbigol",
      "id": "q_665.01",
      "verse": "01",
      "content": "Vnd roys meliantz de barbigol"
    },

image
(Q_namen.pdf)


The resulting file is large (> 70 MB). There are some options for shrinking (switching off indentation for the serialization, removing spurious information, shortening keys, splitting into several documents), but I'll leave this for later.

In addition to this, there is also names.json with information on persons, places and stones/planets/etc. This is not needed in the generation of the data, but it was a low-hanging fruit. Perhaps it comes in handy for the presentation at some point.

@flicksolutions
Copy link
Member Author

Thanks, @pdaengeli ! Looks absolutely doable and 70MB isn't even that big... considering it's before gziping and minifying...
Reducing content_all seems favourable as well. Great work so far!

More than one regular variant seems very counterintuitive to me. Why would you want to have something like that? Isn't the point of normal forms that you only have one variant..? I guess it's for historical reasons :-) Anyways... I think we should treat them as separate terms for our use. It might hurt search performance if we keep two terms in one string (because there will never be an exact match for those). Is this easily doable?

@pdaengeli
Copy link

More than one regular variant seems very counterintuitive to me. Why would you want to have something like that? Isn't the point of normal forms that you only have one variant..?

The only explanation I can come up with is that there are cases with more than one relatively established normal variant. I was also surprised to find these cases.

Splitting the strings is trivial. I just pushed DHBern/parzival-static-api@5091a2e and it should soon be

    {
      "sigla": "q",
      "terms": [ "Barbigœl", "Parbigœl" ],
      "content_all": "Er ſehe in ſint zu barbigol",
      "id": "q_646.05",
      "verse": "05",
      "content": "Er ſehe in ſint zu barbigol"
    },

    {
      "sigla": "q",
      "terms": [ "Meljanz", "Melyanz", "Barbigœl", "Parbigœl" ],
      "content_all": "Vnd roys meliantz de barbigol",
      "id": "q_665.01",
      "verse": "01",
      "content": "Vnd roys meliantz de barbigol"
    },

@flicksolutions
Copy link
Member Author

@pdaengeli I just realized, that the Dreissiger is missing. This can obviously be derived from the ID, but if I'm deriving the Dreissiger from the id, I can also derive sigla and verse. So we have to think about if we want to minify the json by striking out sigla and verse or add a thirties property.

What do you think?

@pdaengeli
Copy link

Generally speaking, I prefer having knowledge in the data over logic in the application (where it is affordable).

So I went ahead and added a d key in DHBern/parzival-static-api@af3a4c8.

 "docs": [
    {
      "sigla": "d",
      "d": "1",
      "content_all": "Gagmvret Herzey",
      "id": "d_1.01-01",
      "verse": "01-01",
      "content": "Gagmvret Herzey"
    },

@flicksolutions
Copy link
Member Author

Thanks!
The parsing d doesn't completely work out with Fragments (Fr.):

Image

@pdaengeli
Copy link

Must be a deficient regex in L71. I'll fix it later.

@flicksolutions flicksolutions linked a pull request Jan 20, 2025 that will close this issue
@flicksolutions flicksolutions linked a pull request Jan 20, 2025 that will close this issue
pdaengeli added a commit to DHBern/parzival-static-api that referenced this issue Jan 20, 2025
@pdaengeli
Copy link

The fragments look good now.

E.g.

diff --git a/dist/api/json/search-index.json b/dist/api/json/search-index.json
index 77a9997..89b371f 100644
--- a/dist/api/json/search-index.json
+++ b/dist/api/json/search-index.json
@@ -201946,7 +201946,7 @@
[…]
+    { "sigla":"fr21", "d":"328", "content_all":"", "id":"fr21_328.27", "verse":"27", "content":"" },
+    { "sigla":"fr21", "d":"328", "content_all":"", "id":"fr21_328.28", "verse":"28", "content":"" },
     {
       "sigla": "fr21",
-      "d": "fr21_328.27",
-      "content_all": "",
-      "id": "fr21_328.27",
-      "verse": "27",
-      "content": ""
-    },
-    {
-      "sigla": "fr21",
-      "d": "fr21_328.28",
-      "content_all": "",
-      "id": "fr21_328.28",
-      "verse": "28",
-      "content": ""
-    },
-    {
-      "sigla": "fr21",
-      "d": "fr21_328.29",
+      "d": "328",
       "content_all": "Feirefiz anſche",
       "id": "fr21_328.29",
       "verse": "29",
@@ -266386,7 +264272,7 @@
     },
     {
       "sigla": "fr21",
-      "d": "fr21_328.30",
+      "d": "328",
       "content_all": "Deſ tat dvrich wen pin.-",
       "id": "fr21_328.30",
       "verse": "30",
@@ -266394,7 +264280,7 @@
     },
     {
       "sigla": "fr21",
-      "d": "fr21_329.01",
+      "d": "329",
       "content_all": "Swie fremdez mære.-",
       "id": "fr21_329.01",
       "verse": "01",
@@ -266402,7 +264288,7 @@
     },

@pdaengeli
Copy link

pdaengeli commented Jan 21, 2025

Regex still not covering all:

1768064     {
1768065       "sigla": "ok",
1768066       "d": "ok_1.01",
1768067       "content_all": "ISt zwifel hertzen noch gebur",
1768068       "id": "ok_1.01",
1768069       "verse": "01",
1768070       "content": "ISt zwifel hertzen noch gebur"
1768071     },

mk, nk, ok

pdaengeli added a commit to DHBern/parzival-static-api that referenced this issue Jan 21, 2025
pdaengeli added a commit to DHBern/parzival-static-api that referenced this issue Jan 21, 2025
@pdaengeli
Copy link

pdaengeli commented Jan 21, 2025

@flicksolutions, `mk`, `nk`, `ok` are good now, too.
     {
       "sigla": "ok",
-      "d": "ok_1.01",
+      "d": "1",
       "content_all": "ISt zwifel hertzen noch gebur",
       "id": "ok_1.01",
       "verse": "01",
@@ -1768071,7 +1766510,7 @@
     },

To do:

  • index Fassungen

@flicksolutions
Copy link
Member Author

flicksolutions commented Jan 21, 2025

Also todo for @flicksolutions

  • Indicate the match by making it bold
  • create option for exact search. if not exact: allow fuzzynes of at least 1
  • create Indexhandler to allow matches for Schaft-S ſ to match S and maybe others.
  • Search over Fassungen by default, option for searching over Zeugen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants