Metatopic: Next Steps, Strategic Plan #1134

ampli · 2021-01-25T00:39:08Z

ampli
Jan 25, 2021
Collaborator

How do we evaluate the quality of the learning? This last question is key. The answer is to (1) create a random dictionary (2) create a random corpus of text from the random dictionary (3) apply the learning algo to learn a dictionary (4) compare the learned dictionary to the starting dict of (1): if its a good match, then learning worked. If not, then not.

Can the current English dict take the role of such a dict, when sentences that have a full parse serve as such "random text"?
Such sentences can be easily "generated" in any desired quantity by filtering existing text (like stories and documents).

The goal of learning is to deduce a dictionary using probabilistic techniques. What are the best learning algos? The best techniques?

Is there an algo that exploits the nesting feature of languages, i.e., that many words can be replaced by phases?

linas · 2021-01-25T00:54:31Z

linas
Jan 25, 2021
Maintainer

Can the current English dict take the role

No, not really. Proper testing of the algorithm requires testing to see how it does as the following are varied:

Size of vocabulary
Number of link types
Fraction of words having 1,2,3.. connectors
Fraction of words having ambiguous meanings (to see, to cut (saw))
Size of training corpus

The idea is to start small, so that experiments can be run quickly.

Is there an algo

This is brand-new research. There aren't any algos, besides those you can dream up.

0 replies

ampli · 2021-01-25T01:12:50Z

ampli
Jan 25, 2021
Collaborator Author

Fraction of words having ambiguous meanings (to see, to cut (saw))

What leads to the hope that one can, in general, distinguish between such words by pure analysis of mutual info of word pairs in text?
After all, one can see a table and also saw (cut) a table. It seems that in general, a deeper "understanding" of text is needed for that.
("Deeper" is maybe mutual information of many words at once.)

Size of vocabulary

Number of link types

Fraction of words having 1,2,3.. connectors

Fraction of words having ambiguous meanings (to see, to cut (saw))

Size of training corpus

Can the mutual info algo actually reconstruct very small dicts?
Say 10 words, 5 links, words can have 1 or 2 links?

0 replies

ampli · 2021-01-25T01:16:52Z

ampli
Jan 25, 2021
Collaborator Author

I edited my previous comment because it was not clear.

0 replies

linas · 2021-01-25T01:45:14Z

linas
Jan 25, 2021
Maintainer

What leads to the hope that one can, in general, distinguish between such words by

There's a PDF on this

by pure analysis of mutual info of word pairs in text?

That's not the correct "algorithm"

is maybe mutual information of many words at once

For example, the mutual information between a word and a disjunct. Step 1) create word pairs Step 2) MST parse Step 3) extract disjuncts Step 4) compute word-disjunct pairs Step 5) classify word-disjunct pairs Step 6) create LG dict Step 7) evaluate accuracy of LG dict That's the current algo I use. Atnton screwed around with replacing step 2 and step 5 with neural nets.

Can the algo actually reconstruct very small dicts?

Anton Kolonin and crew did this very quickly, with 100% accuracy, almost immediately. Where they got stuck was with "child-directed speech" because they were unable to control the variables. Thus, they were unable to get meaningful measurements; they couldn't interpret their results. (Well, they gave up too easy; they were hoping to find magic-potion/philosophers-stone in a month or two, and gave up after discovering baking-soda+vinegar.)

the mutual info algo

There's no such "algo". That's like saying "the cosine distance algo" Everywhere you use cosine distance, you can use mutual info. Or you can use any other distance metric of your choosing. The only advantage for MI is that there are assorted formal proofs that it is the "best" for probabilities, and also there are many decades of experience with maximum entropy principles. Which one is free to ignore, but then one needs some other conceptual foundations for what one is doing. Axel Kleidon has some inspirational writings.. well, I guess so do many others.

0 replies

linas · 2021-01-25T01:56:55Z

linas
Jan 25, 2021
Maintainer

Ooops, the PDF word-sense disambiguation is at .... I'm having trouble finding it. Chapter 6, page 39 of https://github.com/opencog/learn/blob/master/learn-lang-diary/connector-sets-revised.pdf mentions this briefly; somewhere there is another PDF that expands this into greater detail.

Its not particularly complicated. Its part of classification. All cutting verbs have similar disjuncts and can be classified that way. All sensing verbs (hearing, touching) have similar disjuncts, and are classified together. Something like "saw" will have a pile of disjuncts that are cutting-like, and other disjuncts that are sensing-like, and the code tries to factorize this into those two classes, assigning a different "meaning" to each. It resembles sparse-matrix factorization; I can explain more when you're ready. All that the PDF said was this, plus some worked examples.

The decision to split a collection of disjuncts into two meanings or not is just a standard classification decision. I was using agglomerate clustering. So was Anton. He kept using K-means, I kept telling him not to, because there is no way to know what K should be. Before he figured this out, he went full-on neural net, and got worse results. ... but one cannot be sloppy about such things. Boeing 747's aren't large paper airplanes made out of metal. That's not how things scale.

0 replies

ampli · 2021-01-27T23:53:01Z

ampli
Jan 27, 2021
Collaborator Author

It is indeed clearly possible if you know the set of disjuncts that are observed of each word, to find out that some words have more than one meaning. However, I don't have, for now, any idea how to improve/extend this algo or how to use it better.

But there are other things I don't understand.

2) MST parse Step
...
Atnton screwed around with replacing step 2 and step 5 with neural nets.

How you MST-parse English not using any kind of ML?

And how do you make MST parsing on sentences you have generated according to your own dict?

10 replies

ampli Feb 3, 2021
Collaborator Author

I would like to understand how this works ...

It is based on the idea described in my post "Fill in the gaps".

If you use several such wildcard words you get all the sentences for these number of words.
In the "old times" this was extremely slow. Now I guess it would be faster, but still very slow. However, several vast optimizations are possible: 1. When reading the dict, in order to reduce the number of needed expressions. 2. When discarding duplicate disjunct, in order to collapse together disjuncts of different words. 3. When building disjuncts (new algo). 4. When extracting the links (i.e. actually generating the sentences). 5. May need to use lowercase letters after ID for idioms to reduce the number of uppercase links. 6. Maybe defining idiom links as length=1 would help too.

The "parse set" is just the data structure used in extract-links.c.

Why do you need to specify the length of the sentence?

It could actually be "all sentences up to length N" if I would mark all words as optional. But this would add very much overhead. I don't have any other idea how I can do it with really variable number of words.

linas Feb 3, 2021
Maintainer

Oh, OK, that seems reasonable. At this time, a tool that works slowly is better than no tool at all. My tools take days-weeks wall-clock time to run, so a few days or longer to generate a corpus is fine. So, sure, write a demo!

ampli Feb 6, 2021
Collaborator Author

Step 2: use those sentences to compute the MI of word pairs
Step 3: Use the MI to perform MST parses

Are these steps considered as "solved"? Do you have a program that does that?

ampli Feb 8, 2021
Collaborator Author

Are these steps considered as "solved"? Do you have a program that does that?

I have code for steps 2-5.

So the answer is "yes".

linas Feb 9, 2021
Maintainer

I have code for steps 2-5.

It's in https://github.com/opencog/learn The README's will explain how to install and run it. The final output is an LG SQLite file. However, there's a lot of howevers; I'd have to explain those. Stage 5 is the one that needs tuning and exploration. Yes, re-writing stages 2-4 in pure C/C++ would probably speed them up by a factor of 10x or more. However, I tried to create something generic, something that could work for biology (for example), and not just LG.

What are the sentence length s that are to be used?

Arbitrary; not constrained. Don't care.

Are MST parses for English correspond well enough to manual parsing

Yes. That is Yuret's thesis: he shows they correspond "well enough". However, he (and everyone else) gave up at that point, and never pushed it any further. That is because they didn't know about the concept of disjuncts. It never occurred to them that mining for disjuncts is the next step. I strongly believe that mining for disjuncts will strongly improve quality, and it will surpass the hand-created dictionaries. But this is a claim/belief, the whole point of the project is to prove that this is true.

ampli · 2021-02-07T15:31:05Z

ampli
Feb 7, 2021
Collaborator Author

write a demo!

My first step was to port the old modification to the latest LG, without any optimization.

A test on the de dict with a sentence length of 12 words (excluding LEFT_WORD/RIGHT_WORD) with -limit=10000000 (10 million) finished creating the parse-set structure in 194 seconds and needed additional 3780 seconds to extract all the sentences and write them to a file (5.6GB!). In addition to the bottlenecks that I already mentioned, there was a bottleneck in link_lists() when it needed to repetitively scan Parse_choice linked lists of tens of thousands of elements (this, of course, can be fixed, and will benefit also the regular use of the library).

With the en 4.0.dict I had less success. With more than 3 words (excluding LEFT_WORD/RIGHT_WORD) the process is OOM-killed when it reaches ~13GB of memory during the disjunct build. However, this can be fixed by order of magnitudes in several ways even without changing the disjunct building algo (which inherently has a low memory consumption).

The generated sentences include regex NAME and CAPITLIZED_WORD, often more than once. I guess it may be a good idea to suppress their generation.
Another observation is that often LEFT_WALL and RIGHT_WALL appear in the middle of sentences. I guess they should be suppressed there.
I hoped to get . in the middle of sentences. This could be an easy way to generate shorter sentences (than the given fixed number of words). Apparently, the de dict doesn't include this possibility (it can be easily fixed for that). The algo can also have a preference for including . (and even only one .) in sentences in order to produce sentences of different lengths in certain amounts.
Another possibility is to add API to use certain given words.

I will now try a few of my suggested optimizations to see if I can generate long-enough sentences from the en dict in a reasonable time.

7 replies

linas Feb 10, 2021
Maintainer

Done.. I can now create very basic dictionaries. Code is in https://github.com:opencog/learn/tree/master/fake

Here's an example:

a: <a>;
<a>: (J-) or (J+ & B+);

b c d e f g: <b>;
<b>: (F-) or (A- & A+) or (G+ & A-);

h i j k: <c>;
<c>: (B+ & A+ & B+) or (A+) or (D- & A+ & C+);

l m n o p q: <d>;
<d>: (B-);

r s t u v w: <e>;
<e>: (A+ & J-) or (A+ & A-) or (A- & F+) or (D+) or (A+ & E- & B+) or (A+) or (D+ & A- & H+) or (E- & A+) or (A+ & A+ & A+) or (B- & A+ & I+) or (A- & C-) or (C+ & B-) or (A-) or (A+ & A-) or (G+) or (A+ & G-) or (B-) or (D-) or (A+ & E- & A+);

x y z aa ab: <f>;
<f>: (I+);

ac: <g>;
<g>: (C+ & B-);

ad: <h>;
<h>: (D- & B+ & A+) or (J+) or (J-) or (G-) or (A+);

ae: <i>;
<i>: (A-) or (F+) or (C+) or (E+) or (E- & A- & D-) or (E+) or (H-) or (A- & B- & B-);

af ag ah ai aj ak al: <j>;
<j>: (H- & B- & D-) or (H-);

A big issue, right now, is that it is not clear to me if every disjunct can get used. The disjuncts are randomly generated, and its not clear if there exist any sentences at all that use some of them. This will be an issue for learning: if a disjunct cannot be used while generating a corpus, it obviously can't be learned, and so the learned dict must not be marked as being "inaccurate" because of this.

I wouldn't be surprised if most disjuncts above are unusable.

To speed up corpus generation, it might be best to have only one word per word-class. Corpus post-processing can then randomly insert synonyms.

linas Feb 11, 2021
Maintainer

So I thought about this a bit more, and it became clear that, to generate a corpus, one only needs a single representative word per word-class. So, in the above example, af ag ah ai aj ak al are all synonyms for a single "representative" word, call it rep-j. To generate a corpus, just generate a valid sentence with rep-j in it. In a second stage, one can trivially create 7 different sentences just by string-replacing rep-j with each of af ag ah ai aj ak al.

Multiple word meanings can be handled similarly. If the dictionary contains

<pos-j>: (B+) or (B+ & A- & B+) or (E+) or (C+ & A- & B+);
repj: <pos-j>;
<pos-e>: (A+) or (C+ & A- & F+);
repe: <pose>;

<wcl-x>: <pos-j> or <pos-e>;
ax ay az: <wcl-x>;

then it is enough to generate sentences with repj and repe in them, and then do simple string replacements to paste ax ay az in for each repj or repe. That is, ax ay az are synonyms, and they each have two word-senses or word-meanings: they have the "j" sense and the "e" sense.

I've updated my code to generate dictionaries similar to this.

ampli Feb 12, 2021
Collaborator Author

(Writing again after I did page back by mistake and the input-box text didn't get restored on return, apparently because the "view more" mechanism that clashes with restoring it.)

To speed up corpus generation, it might be best to have only one word per word-class. Corpus post-processing can then randomly insert synonyms.

To this idea I referred above as:

When reading the dict, in order to reduce the number of needed expressions.

I called it word category (according to the terminology in the library doc). The Brandom insertion of the words is done in linkage_create().

Multiple word meanings can be handled similarly.

I implemented that in a more general way, by this method:

When discarding duplicate disjunct, in order to collapse together disjuncts of different words.

(This WIP is almost complete.)

A big issue, right now, is that it is not clear to me if every disjunct can get used.

In my implementation, the disjuncts that cannot be used can be found by observing which disjuncts are missing in the parse-set (the data structure that is created by mk_parse_set().

linas Feb 12, 2021
Maintainer

word category

I'm having trouble finding good names for the concepts. I keep trying to invent new names cause none of them fit well. But yes, I've called them "word classes" elsewhere.

I implemented that

OK. I can certainly generate the random dictionaries in all sorts of different formats. If some particular one is easier, I can do that.

This WIP is almost complete

Cool let me know. I'm eager to try it (even as I know this will me a years-long project.) Just feeling excited, for the moment.

ampli Feb 12, 2021
Collaborator Author

It is referred to as "category" here and in the rest of the document (search for "category").

linas · 2022-03-21T19:34:54Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metatopic: Next Steps, Strategic Plan #1134

{{title}}

Replies: 8 comments 17 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Metatopic: Next Steps, Strategic Plan #1134

ampli Jan 25, 2021 Collaborator

Replies: 8 comments · 17 replies

linas Jan 25, 2021 Maintainer

ampli Jan 25, 2021 Collaborator Author

ampli Jan 25, 2021 Collaborator Author

linas Jan 25, 2021 Maintainer

linas Jan 25, 2021 Maintainer

ampli Jan 27, 2021 Collaborator Author

ampli Feb 3, 2021 Collaborator Author

linas Feb 3, 2021 Maintainer

ampli Feb 6, 2021 Collaborator Author

ampli Feb 8, 2021 Collaborator Author

linas Feb 9, 2021 Maintainer

ampli Feb 7, 2021 Collaborator Author

linas Feb 10, 2021 Maintainer

linas Feb 11, 2021 Maintainer

ampli Feb 12, 2021 Collaborator Author

linas Feb 12, 2021 Maintainer

ampli Feb 12, 2021 Collaborator Author

linas Mar 21, 2022 Maintainer

ampli
Jan 25, 2021
Collaborator

Replies: 8 comments 17 replies

linas
Jan 25, 2021
Maintainer

ampli
Jan 25, 2021
Collaborator Author

ampli
Jan 25, 2021
Collaborator Author

linas
Jan 25, 2021
Maintainer

linas
Jan 25, 2021
Maintainer

ampli
Jan 27, 2021
Collaborator Author

ampli Feb 3, 2021
Collaborator Author

linas Feb 3, 2021
Maintainer

ampli Feb 6, 2021
Collaborator Author

ampli Feb 8, 2021
Collaborator Author

linas Feb 9, 2021
Maintainer

ampli
Feb 7, 2021
Collaborator Author

linas Feb 10, 2021
Maintainer

linas Feb 11, 2021
Maintainer

ampli Feb 12, 2021
Collaborator Author

linas Feb 12, 2021
Maintainer

ampli Feb 12, 2021
Collaborator Author

linas
Mar 21, 2022
Maintainer