-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Umsetzung Suche #18
Comments
S. Abel will hand over an XML file containing IDs of regularised entries with lemma. With this we can generate Also see https://parzival.unibe.ch/parzdb/listNFs.php#:~:text=Eigennamen |
Email from Stefan Abel
Attachements
|
as outlined in DHBern/presentation_parzival#18 there might still be need to clarify how to handle various TEI elements (cf. build-index.xsl) and whether or not to cover relations contained in names.xml
It contains two content strings, {
"sigla": "d",
"content_all": "der mac dennoc* dennoch weſen geil.",
"id": "d_1.07",
"verse": "07",
"content": "der mac dennoch weſen geil."
},
{
"sigla": "d",
"content_all": "wand an im ſint *eidiv beidiv teil.",
"id": "d_1.08",
"verse": "08",
"content": "wand an im ſint beidiv teil."
}, They are based on selective/non-selective treatment of some of the TEI elements that are used for the verse encoding. Depending on the requirements of the search we can improve them. Perhaps we may omit Instead of {
"sigla": "d",
"terms": [ "rubin" ],
"content_all": " verwurchet verwürchet edeln rvͦbin.",
"id": "d_3.17",
"verse": "17",
"content": "verwürchet edeln rvͦbin."
},
{
"sigla": "d",
"terms": [ "Gahmuret", "Anschevin" ],
"content_all": "Gahmvret Anſcivin.",
"id": "d_6.26",
"verse": "26",
"content": "Gahmvret Anſcivin."
}, With regard to cases where more than one regular variant is given, we should decide if it is better to atomize them or keep them as they currently are: {
"sigla": "q",
"terms": [ "Barbigœl Parbigœl" ],
"content_all": "Er ſehe in ſint zu barbigol",
"id": "q_646.05",
"verse": "05",
"content": "Er ſehe in ſint zu barbigol"
},
{
"sigla": "q",
"terms": [ "Meljanz Melyanz", "Barbigœl Parbigœl" ],
"content_all": "Vnd roys meliantz de barbigol",
"id": "q_665.01",
"verse": "01",
"content": "Vnd roys meliantz de barbigol"
},
The resulting file is large (> 70 MB). There are some options for shrinking (switching off indentation for the serialization, removing spurious information, shortening keys, splitting into several documents), but I'll leave this for later. In addition to this, there is also |
Thanks, @pdaengeli ! Looks absolutely doable and 70MB isn't even that big... considering it's before gziping and minifying... More than one regular variant seems very counterintuitive to me. Why would you want to have something like that? Isn't the point of normal forms that you only have one variant..? I guess it's for historical reasons :-) Anyways... I think we should treat them as separate terms for our use. It might hurt search performance if we keep two terms in one string (because there will never be an exact match for those). Is this easily doable? |
The only explanation I can come up with is that there are cases with more than one relatively established normal variant. I was also surprised to find these cases. Splitting the strings is trivial. I just pushed DHBern/parzival-static-api@5091a2e and it should soon be {
"sigla": "q",
"terms": [ "Barbigœl", "Parbigœl" ],
"content_all": "Er ſehe in ſint zu barbigol",
"id": "q_646.05",
"verse": "05",
"content": "Er ſehe in ſint zu barbigol"
},
{
"sigla": "q",
"terms": [ "Meljanz", "Melyanz", "Barbigœl", "Parbigœl" ],
"content_all": "Vnd roys meliantz de barbigol",
"id": "q_665.01",
"verse": "01",
"content": "Vnd roys meliantz de barbigol"
},
|
@pdaengeli I just realized, that the Dreissiger is missing. This can obviously be derived from the ID, but if I'm deriving the Dreissiger from the id, I can also derive sigla and verse. So we have to think about if we want to minify the json by striking out sigla and verse or add a What do you think? |
Generally speaking, I prefer having knowledge in the data over logic in the application (where it is affordable). So I went ahead and added a "docs": [
{
"sigla": "d",
"d": "1",
"content_all": "Gagmvret Herzey",
"id": "d_1.01-01",
"verse": "01-01",
"content": "Gagmvret Herzey"
}, |
Must be a deficient regex in |
The fragments look good now. E.g. diff --git a/dist/api/json/search-index.json b/dist/api/json/search-index.json
index 77a9997..89b371f 100644
--- a/dist/api/json/search-index.json
+++ b/dist/api/json/search-index.json
@@ -201946,7 +201946,7 @@
[…]
+ { "sigla":"fr21", "d":"328", "content_all":"", "id":"fr21_328.27", "verse":"27", "content":"" },
+ { "sigla":"fr21", "d":"328", "content_all":"", "id":"fr21_328.28", "verse":"28", "content":"" },
{
"sigla": "fr21",
- "d": "fr21_328.27",
- "content_all": "",
- "id": "fr21_328.27",
- "verse": "27",
- "content": ""
- },
- {
- "sigla": "fr21",
- "d": "fr21_328.28",
- "content_all": "",
- "id": "fr21_328.28",
- "verse": "28",
- "content": ""
- },
- {
- "sigla": "fr21",
- "d": "fr21_328.29",
+ "d": "328",
"content_all": "Feirefiz anſche",
"id": "fr21_328.29",
"verse": "29",
@@ -266386,7 +264272,7 @@
},
{
"sigla": "fr21",
- "d": "fr21_328.30",
+ "d": "328",
"content_all": "Deſ tat dvrich wen pin.-",
"id": "fr21_328.30",
"verse": "30",
@@ -266394,7 +264280,7 @@
},
{
"sigla": "fr21",
- "d": "fr21_329.01",
+ "d": "329",
"content_all": "Swie fremdez mære.-",
"id": "fr21_329.01",
"verse": "01",
@@ -266402,7 +264288,7 @@
}, |
Regex still not covering all: 1768064 {
1768065 "sigla": "ok",
1768066 "d": "ok_1.01",
1768067 "content_all": "ISt zwifel hertzen noch gebur",
1768068 "id": "ok_1.01",
1768069 "verse": "01",
1768070 "content": "ISt zwifel hertzen noch gebur"
1768071 },
|
DHBern/presentation_parzival#18 TODO: run, commit and push data
@flicksolutions, `mk`, `nk`, `ok` are good now, too. {
"sigla": "ok",
- "d": "ok_1.01",
+ "d": "1",
"content_all": "ISt zwifel hertzen noch gebur",
"id": "ok_1.01",
"verse": "01",
@@ -1768071,7 +1766510,7 @@
}, To do:
|
Also todo for @flicksolutions
|
As discussed:
@pdaengeli will build an index JSON for a local MiniSearch instance:
The structure will look something like this:
This describes a MVP. If possible we can add normalization information from https://parzival.unibe.ch/parzdb/listNFs.php like so:
The text was updated successfully, but these errors were encountered: