Replies: 8 comments 17 replies
-
No, not really. Proper testing of the algorithm requires testing to see how it does as the following are varied:
The idea is to start small, so that experiments can be run quickly.
This is brand-new research. There aren't any algos, besides those you can dream up. |
Beta Was this translation helpful? Give feedback.
-
What leads to the hope that one can, in general, distinguish between such words by pure analysis of mutual info of word pairs in text?
Can the mutual info algo actually reconstruct very small dicts? |
Beta Was this translation helpful? Give feedback.
-
I edited my previous comment because it was not clear. |
Beta Was this translation helpful? Give feedback.
-
There's a PDF on this
That's not the correct "algorithm"
For example, the mutual information between a word and a disjunct. Step 1) create word pairs Step 2) MST parse Step 3) extract disjuncts Step 4) compute word-disjunct pairs Step 5) classify word-disjunct pairs Step 6) create LG dict Step 7) evaluate accuracy of LG dict That's the current algo I use. Atnton screwed around with replacing step 2 and step 5 with neural nets.
Anton Kolonin and crew did this very quickly, with 100% accuracy, almost immediately. Where they got stuck was with "child-directed speech" because they were unable to control the variables. Thus, they were unable to get meaningful measurements; they couldn't interpret their results. (Well, they gave up too easy; they were hoping to find magic-potion/philosophers-stone in a month or two, and gave up after discovering baking-soda+vinegar.)
There's no such "algo". That's like saying "the cosine distance algo" Everywhere you use cosine distance, you can use mutual info. Or you can use any other distance metric of your choosing. The only advantage for MI is that there are assorted formal proofs that it is the "best" for probabilities, and also there are many decades of experience with maximum entropy principles. Which one is free to ignore, but then one needs some other conceptual foundations for what one is doing. Axel Kleidon has some inspirational writings.. well, I guess so do many others. |
Beta Was this translation helpful? Give feedback.
-
Ooops, the PDF word-sense disambiguation is at .... I'm having trouble finding it. Chapter 6, page 39 of https://github.com/opencog/learn/blob/master/learn-lang-diary/connector-sets-revised.pdf mentions this briefly; somewhere there is another PDF that expands this into greater detail. Its not particularly complicated. Its part of classification. All cutting verbs have similar disjuncts and can be classified that way. All sensing verbs (hearing, touching) have similar disjuncts, and are classified together. Something like "saw" will have a pile of disjuncts that are cutting-like, and other disjuncts that are sensing-like, and the code tries to factorize this into those two classes, assigning a different "meaning" to each. It resembles sparse-matrix factorization; I can explain more when you're ready. All that the PDF said was this, plus some worked examples. The decision to split a collection of disjuncts into two meanings or not is just a standard classification decision. I was using agglomerate clustering. So was Anton. He kept using K-means, I kept telling him not to, because there is no way to know what K should be. Before he figured this out, he went full-on neural net, and got worse results. ... but one cannot be sloppy about such things. Boeing 747's aren't large paper airplanes made out of metal. That's not how things scale. |
Beta Was this translation helpful? Give feedback.
-
It is indeed clearly possible if you know the set of disjuncts that are observed of each word, to find out that some words have more than one meaning. However, I don't have, for now, any idea how to improve/extend this algo or how to use it better. But there are other things I don't understand.
How you MST-parse English not using any kind of ML? And how do you make MST parsing on sentences you have generated according to your own dict? |
Beta Was this translation helpful? Give feedback.
-
My first step was to port the old modification to the latest LG, without any optimization. A test on the With the The generated sentences include regex I will now try a few of my suggested optimizations to see if I can generate long-enough sentences from the |
Beta Was this translation helpful? Give feedback.
-
See also discussion in #1290 |
Beta Was this translation helpful? Give feedback.
-
From #785 (comment):
Can the current English dict take the role of such a dict, when sentences that have a full parse serve as such "random text"?
Such sentences can be easily "generated" in any desired quantity by filtering existing text (like stories and documents).
Is there an algo that exploits the nesting feature of languages, i.e., that many words can be replaced by phases?
Beta Was this translation helpful? Give feedback.
All reactions