Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TODO list for keyboard search #87

Closed
15 tasks done
mcdurdin opened this issue Jul 12, 2020 · 12 comments
Closed
15 tasks done

TODO list for keyboard search #87

mcdurdin opened this issue Jul 12, 2020 · 12 comments
Milestone

Comments

@mcdurdin
Copy link
Member

mcdurdin commented Jul 12, 2020

FUTURE:

TODO:

===========================================================================================================================

DONE:

  1. Additional Notes: Notes on api.keyman.com changes for langtag consumption

    • example keyboard: burushaski_girminas, khw-latn "Khowar (Latin)"

    • ខ្មែរ finds zero results but ខ្មែ finds 7...

    • Show a 'popular keyboards' list for the empty search -- this can also be the search engine jumping-off point.

    • "Show obsolete keyboards" needs an indication of the change of status ("Hide obsolete keyboards") and needs to be outdented. Also
      needs some thought with paginated results.

    • Too many pages leads to overwhelming number of page links at bottom (e.g. s:latin)

    • http://api.keyman.com.local/search?q=l gives 500

    • el_dinka appears to have non-canonical bcp47 codes -- search finds it no trouble.

    • Show list of associated languages+scripts+countries in keyboard deatils (and related keyboards?)

    • For in-app download links, include information on searched language code (if available), for default language install (#1456)

    • Match fields in json should be integer or float where possible, not string! (and update schema accordingly)

    • schema for match type should be restrictive to actual types used

    • Search "spa" vs "spanish" -- the weighting could be better. Similar "ger" vs "german". (probably need length-based match weight override)

    • REFACTOR: region vs country

    • REFACTOR: code vs id vs tag

    • Pagination

    • Need to give more detail on failed links (and make it easier to find in logs, so tweak the broken link search a node wrapper)

    • Searches for keyboard ids should work

    • Phrases are not working yet (need to split into either a phrase search or separate words)

    • Searches for bcp47 tags, scripts, regions should work

      • need to highlight these on keyman.com (incl. keyboard_id)

FAIL: http://api.keyman.com.local/search/2.0?f=1&q=l:%, c:%, etc.

PHP Fatal error:  Uncaught PDOException: SQLSTATE[IMSSP]: The active result for the query contains no fields. in C:\Projects\keyman\sites\api.keyman.com\script\search\2.0\search.inc.php:236
Stack trace:
#0 C:\Projects\keyman\sites\api.keyman.com\script\search\2.0\search.inc.php(236): PDOStatement->fetchAll()
#1 C:\Projects\keyman\sites\api.keyman.com\script\search\2.0\search.inc.php(77): KeyboardSearch->GetSearchQueries()
#2 C:\Projects\keyman\sites\api.keyman.com\script\search\2.0\search.inc.php(73): KeyboardSearch->WriteSearchResults()
#3 C:\Projects\keyman\sites\api.keyman.com\script\search\2.0\search.php(50): KeyboardSearch->GetSearchMatches()
#4 {main}
  thrown in C:\Projects\keyman\sites\api.keyman.com\script\search\2.0\search.inc.php on line 236
  1. Default search should return a FLAT LIST of KEYBOARDS ONLY with highlights. e.g. 'Thai' should return keyboards with 'Thai' in the name, in a language name, or in the country associated with the language.

  2. Search results must be weighted (summed?)
    a) match of primary language name 1.0
    b) match of alternate language name 0.3
    c) match of keyboard name or id 1.0
    d) match of script name 1.0
    e) match of country name 0.5
    f) match on term in description 0.5
    g) match quality (whole word match = 1.0, down to 0.1 for further distance? as a multiplicand)
    select * from t_langtag_name inner join containstable(t_langtag_name, name, 'isabout (thai weight (1.0), "thai*" weight (0.5))') as KEY_TBL ON t_langtag_name._id = KEY_TBL.[KEY] order by [RaNK] desc
    5 / 5 = 1.0
    4 / 5 = 0.8
    1 / 5 = 0.2
    NOTE: final weighting is different but ... let's see how it goes

  3. Can also specify a search:
    ?q=l:<term> search for keyboards that support a language, by name (does not check id)
    ?q=l:id:<id> search for keyboards that support a language, by bcp 47 id
    ?q=c:<term> search for keyboards that support languages within a country
    ?q=c::id:<id> search for keyboards that support languages within a country, by iso 3166 id
    ?q=s:<term> search for keyboards that support a script
    ?q=s:id:<id> search for keyboards that support a script by script id
    ?q=id:<id> search for keyboards that match the id
    ?q=legacy:<id> search for keyboards that match the legacy id, only one returned!

  4. Should be able to specify alternate names? Searches should match on NFKD with diacritics stripped.

  • api.keyman.com#48 -- database rotation
  •   <br />
      <b>Warning</b>:  sizeof(): Parameter must be an array or an object that implements Countable in <b>C:\Projects\keyman\sites\api.keyman.com\script\search\search.php</b> on line <b>202</b><br />
      <br />
      <b>Warning</b>:  sizeof(): Parameter must be an array or an object that implements Countable in <b>C:\Projects\keyman\sites\api.keyman.com\script\search\search.php</b> on line <b>209</b><br />
    
    when searching for http://api.keyman.com.local/search?q=k:thai
  1. Highlight found term in results

Process:

  1. Add langtags.json to website database so it is available for rewrite. Don't touch APIs at this stage. Deploy this.
  2. Rewrite search.php to support the details above, basing off langtags.json.
@mcdurdin mcdurdin added this to the P10S11 milestone Jul 12, 2020
@mcdurdin
Copy link
Member Author

@mcdurdin
Copy link
Member Author

mcdurdin commented Jul 21, 2020

@mcdurdin mcdurdin modified the milestones: P10S11, P10S12 Jul 26, 2020
@darcywong00
Copy link
Contributor

darcywong00 commented Jul 31, 2020

@mcdurdin
Copy link
Member Author

mcdurdin commented Aug 3, 2020

  • Search no longer seems to update URL during the query process, which means that history is broken. I think the URL should be updated when the search results are returned. (FIX: fix: search history and hints keyman.com#149)

@mcdurdin
Copy link
Member Author

mcdurdin commented Aug 5, 2020

@ermshiperete comments:

  • The search currently only finds languages that start with the search term. Previously it also listed languages that contained that term. Searching for "German" now shows all keyboards for the German language, but not "German, Pennsylvania" that it showed previously. Searching for "Pennsylvania Dutch" shows the expected results, but searching just for "Dutch" shows only keyboards for Dutch, but not Pennsylvania Dutch.

  • if search finds a match I can't extend the search to include a space - the space at the end gets removed. I can paste from the clipboard, then it searches with space (e.g. "german pe") and finds "EuroLatin (SIL) (German, Pennsylvania language)", but I can't type it. (FIX: fix: search history and hints keyman.com#149)

  • Searching for "Amish" (that previously showed language "German, Pennsylvania (Amish Pennsylvania German)") has 0 results.

  • It's not possible to search for language code (unless you use l:id:code)

  • Searching for "usa" shows keyboards for Usakade language, but not keyboards used in the USA.

  • Searching for "c:id:usa" has 0 results (don't they use keyboards anymore? :-) ). Ah, I see. It expects the two-letter codes from ISO 3166-2, not the three letter codes: "c:id:us"

  • Searching for "c:usa" has 0 results. Has to use "c:united states" (and if I'm slower to type than the search to find results then I can't type that because it strips the space)

  • Searching for "swa" shows keyboards for several languages that start with "swa". However, it doesn't show "EuroLatin (SIL)" for Swabian; Swabian keyboards appear under obsolete keyboards. Searching for "swab" shows the EuroLatin one as well.

  • The display of results when searching for localized language name is awkward: "EuroLatin (SIL)(Deutsch language)". Putting the language first would be better: "EuroLatin (SIL)(language: Deutsch)"

  • Searching for "l:id:ydd" shows results for BCP 47 tag 'yi' - which might be correct but is a bit surprising. "l:id:yi" shows same results (old search didn't find anything for l:id:yi). Searching for "l:id:yih" shows results for BCP tag 'yih'.

  • Searching for "yiddis" shows "Yiddish Pasekh". Searching for "yiddish" shows "Yiddish Pasekh (Yiddish language)". Searching for "yiddish p" shows "Yiddish Pasekh" again.

  • The old search listed languages and countries related to the search term which I find helpful.

Further comments from @ermshiperete:

Just found a bug:

Further comments from @ermshiperete:

@mcdurdin
Copy link
Member Author

mcdurdin commented Aug 5, 2020

I was trying to query for a recently added sil_nko keyboard for the N’Ko language

My query l:n’ko gives a list of 7 Keyboards for languages matching ‘n’ko’ but sil_nko isn’t one of them.

The keyboard page
https://keyman-staging.com/keyboards/sil_nko

does list N’Ko (l:id:nqo) as one of the supported languages

@mcdurdin
Copy link
Member Author

mcdurdin commented Aug 5, 2020

  • The search currently only finds languages that start with the search term. Previously it also listed languages that contained that term. Searching for "German" now shows all keyboards for the German language, but not "German, Pennsylvania" that it showed previously. Searching for "Pennsylvania Dutch" shows the expected results, but searching just for "Dutch" shows only keyboards for Dutch, but not Pennsylvania Dutch.

This is by design. There is only one keyboard currently listed that supports those languages: sil_euro_latin. However, because it also supports Dutch and German, searching for those terms finds the shorter matching language names first. Because we don't do a nested search now, just a keyboard search, these types of changes in results are to be expected.

  • Searching for "Amish" (that previously showed language "German, Pennsylvania (Amish Pennsylvania German)") has 0 results.

langtags.json does not list Amish Pennsylvania German as an alternate language name for Pennsylvania German. If this is a problem, it should be fixed in langtags.json.

  • It's not possible to search for language code (unless you use l:id:_code_)

This is an advanced feature and is by design. There is now a hint to help you search by language code on the search page.

  • Searching for usa shows keyboards for Usakade language, but not keyboards used in the USA.
  • Searching for c:id:usa has 0 results (don't they use keyboards anymore? :-) ). Ah, I see. It expects the two-letter codes from ISO 3166-2, not the three letter codes: "c:id:us"
  • Searching for c:usa has 0 results. Has to use c:united states (and if I'm slower to type than the search to find results then I can't type that because it strips the space)

Correct. We don't currently support synonyms or abbreviations for country names. This would be a low priority feature I think; I don't want to maintain a database of synonyms for countries and the ISO 3166-1 list does not include them. We use the ISO 3166-1 alpha-1 list, which is the most common format.

  • Searching for "swa" shows keyboards for several languages that start with "swa". However, it doesn't show "EuroLatin (SIL)" for Swabian; Swabian keyboards appear under obsolete keyboards. Searching for "swab" shows the EuroLatin one as well.

This is by design. "EuroLatin (SIL)" matches on "Swati" language rather than "Swabian", and the keyboard won't be shown twice. There are 13 different language names starting with "swa" in langtags.json and we don't want to show duplicates. Just keep typing if it hasn't found the name you are looking for 😉.

  • The display of results when searching for localized language name is awkward: "EuroLatin (SIL)(Deutsch language)". Putting the language first would be better: "EuroLatin (SIL)(language: Deutsch)"

I think this is mostly personal preference 😁.

  • Searching for l:id:ydd shows results for BCP 47 tag yi - which might be correct but is a bit surprising. l:id:yi shows same results (old search didn't find anything for l:id:yi). Searching for l:id:yih shows results for BCP tag yih.

This is correct. We normalise the BCP 47 language subtag from ISO639-3 to ISO639-1 (which gives us ydd->yi). yih does not have an ISO639-1 code.

  • Searching for yiddis shows "Yiddish Pasekh". Searching for yiddish shows "Yiddish Pasekh (Yiddish language)". Searching for yiddish p shows "Yiddish Pasekh" again.

This is a side-effect of the precise match signal, which pushes the exact string match of Yiddish language name into a higher weight. I don't think I'll try and improve it 😄.

  • The old search listed languages and countries related to the search term which I find helpful.

I also, in some ways, prefer the nested search results... But this was the trade-off I made at the start of the design. The old search had too much complexity due to the multiple search result lists and I think that this simpler flat search result matches what most users are going to expect (as they will be familiar with the flat Google-style searches).

This has been resolved in an earlier PR.

  • When searching for ipa, why does IPATotal show up first (with 82 monthly downloads) and IPA (SIL) only second? Especially since I did the search on Linux where IPATotal is not supported. I would have expected IPA (SIL) to show up first.

Okay, so this is actually a bit of a tricky one.

For the embedded search, IPATotal would not show up. For the basic web search, we don't use the current user's platform as a signal, currently. The unexpected ordering here comes about because we are multiplying the match weight against the ln() of the download count (+2 for reasons).

IPATotal currently wins out because its name starts with IPA as well as having IPA in the description, giving it a basic weight of 60 vs SIL IPA of 35.

The final weights are 286.24 and 225.48 respectively. We just need to download sil_ipa another 3000 times a month and it'll sort itself out 🙈. Perhaps that indicates that ln() is a little too strong. Maybe sqrt() is a better curve, making popularity a stronger signal?

And with sqrt(), we end up with final weights of 652 and 877 approx, respectively, so SIL IPA would win. But does this hurt other searches? What are our other options?

Changing this formula will break all my tests because all the weights change so I am really not very keen 🤣... but will do if this is a good solution. Thoughts appreciated.

@mcdurdin
Copy link
Member Author

mcdurdin commented Aug 6, 2020

Finish keyboard install page (aka universal link infrastructure) for:

@mcdurdin mcdurdin modified the milestones: P10S12, P10S13 Aug 9, 2020
@mcdurdin
Copy link
Member Author

mcdurdin commented Aug 14, 2020

@darcywong00
Copy link
Contributor

darcywong00 commented Aug 18, 2020

I think the staging site is using the BCP 47 tag und-fonipa for the sil_ipa keyboard, but the keyboard package metadata is using und-latn.

On Keyman for Android alpha, I do a keyboard search for "sil_ipa" and install the keyboard. The sil_ipa keyboard shows up with the tag und-Latn. From the app, I then do a keyboard search for l:id:und-latn and I get 0 results. (Shouldn't it have found sil_ipa?)

@mcdurdin
Copy link
Member Author

Re und-fonipa and und-latn: this arises from a disconnect between the sil_ipa.keyboard_info and sil_ipa.kps language data:

sil_ipa.keyboard_info

    "languages": ["und-fonipa"],

sil_ipa.kps

      <Languages>
        <Language ID="und-Latn">und-Latn</Language>
      </Languages>

This was deliberate at the time, because we had trouble installing und-fonipa on some platforms. This will be resolved when we go to 14.0 release, so we should plan to update the SIL IPA keyboard to use und-fonipa in sil_ipa.kps as well.

@mcdurdin
Copy link
Member Author

All remaining items extracted into separate issues, so closing this mega checklist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants