Skip to content
This repository has been archived by the owner on Aug 5, 2023. It is now read-only.

Improving similarity measure #3

Open
ctlaltdefeat opened this issue Jun 16, 2021 · 11 comments
Open

Improving similarity measure #3

ctlaltdefeat opened this issue Jun 16, 2021 · 11 comments

Comments

@ctlaltdefeat
Copy link

Hi,

I actually did something similar to this a while back at https://channel-similarity.johnpyp.com/ but it didn't generate much interest, presumably due to among other things a lack of visualization which is a core part of this project and which makes it cool and interesting to look at and is well executed.

However, I do think that the similarity measure that I used is better in the sense of capturing similarity between channel communities, and could be implemented here without too much issue. Mathematically what I did is outlined at https://channel-similarity.johnpyp.com/details but it essentially boils down to a couple of differences from what you currently have going on right now:

  1. The weight of a viewer should be normalized according to the number of channels that they're in. The reason behind this is that we want the relative weight of that user to be determined by how much of the percent of their viewing is dedicated to that channel.
    For example: channel A and channel B sharing a viewer that ONLY views these two channels on the entire site should account for more than if channel A and channel B happen to share Nightbot that's present on a large chunk of channels on the site. Currently their relative weight for similarity is the same.
    (this should be fairly simple to implement, doesn't require any scraping changes, and can also be used for the realtime channel page view)

  2. The weight of a viewer should be determined not only by if they happened to be in that channel during the time period collected, but by the amount of time spent there (i.e the number of scrapes they appeared in).
    An example of the shortcoming of the current approach: if channel A happens to host a channel B during the period collected, then all those chatters appearing momentarily in channel B's chat currently provide as much weight to similarity as chatters that spend long periods of time in both of these channels.
    (this could require scraping changes to store these values)

@snoww
Copy link
Owner

snoww commented Jun 16, 2021

Thanks for the great feedback here, and also wanted to say that your site is very cool.
The original goal of the project was to implement what you have there, to measure actual similarity between different channels. However I wasn't comfortable with the math behind it and chose the easier route, where I simply check the overlap at a specific time, without taking into account previous occurrences.

For the first point you have, I'm not exactly sure how to do the actual calculation. Currently, I take all the channel pairs and compare them individually, which is pretty inefficient. If I were to do what you said there, I would need create a set that contains all the viewers captured in that time, which I guess I can union each of the individual channels together into one massive set. And then for each of the viewers in the big set, check if they belong in any of the channels. And I'm unsure where to proceed from there.
For the Nightbot problem, I did try to accommodate that by filtering out any users that had "bot" in their name.

For the second point, we would have to store each viewer and the channels they are present in over time. And again, I'm not sure how to proceed.
The short coming you explained shouldn't really be an issue, since I only retrieve the top live channels, which when A hosts B, A wouldn't appear as a live channel anymore, and thus wouldn't be updated, only B would. B would see a boost in chatter count, which might influence the number overlaps in other channels, but it won't affect A's numbers.

@ctlaltdefeat
Copy link
Author

Regarding the first point, the data structure that we're dealing with is a sparse matrix (a very large number of possible viewer-channel pairs, where only very few of them are actually populated) and in python there are modules (eg. sklearn) to efficiently perform calculations on it without doing operations on individual pairs. Once this matrix is created the calculation to find the channel x channel similarity scores is very short. However, in C# I'd have no idea how to do this, so this is something for maybe further down the line or maybe if I have time to integrate/port my analysis code into here.

For the second point, as far as implementation everything said above still applies, and though you're right about the host thing I do think there is still an advantage over the current approach: currently if channel A and channel B happen to do one singular collab stream that month and their unique viewers filter through each other, then that skews the similarity score almost as much as if they had streamed together every day for that month and their respective viewers were on each other's channel for a lot of the time (I say almost because obviously there are different viewers each time, so more collabs would marginally increase similarity). Intuitively, it seems to me that when people think of channel communities being similar they'd place importance on the amount of time spent shared rather than just a presence at one point.

@snoww
Copy link
Owner

snoww commented Jun 16, 2021

So like a matrix m x n where m are the viewers and n are the channels?

@ctlaltdefeat
Copy link
Author

ctlaltdefeat commented Jun 16, 2021

Yep, except it I think of it as n x m.
Once you've built a matrix X like that, then your current score of similarity can be calculated by calculating

shared = ((X > 0).astype(int) * (X > 0).astype(int).transpose()).A

which is an n x n matrix where the (c_1, c_2) cell contains the number of shared viewers between the channels c_1 and c_2.

@snoww
Copy link
Owner

snoww commented Jun 16, 2021

How would I calculate the weight of an individual chatter?
I've constructed the matrix with the weight of 1 for being apart of a channel, and that results in the n x n matrix. Since I didn't do anything with the weights, the resulting matrix is exactly the same as my method for calculating intersection so far.

Maybe I'm constructing the matrix incorrectly or I'm missing a step.

Here is the code I have so far.
https://gist.github.com/snoww/39063edd9bac1824f0685640944754c9

@ctlaltdefeat
Copy link
Author

ctlaltdefeat commented Jun 18, 2021

Sorry for the delayed response. According to my skimming through the Math.NET documentation, the method NormalizeColumns applied to your product matrix does part 1) of what I mentioned above, normalizing viewer weights.
Then the similarity score is given by the cosine similarity, which from what I can see the easiest way to calculate in Math.NET is to normalize the rows then transpose and multiply.
So in total you'd begin with your matrix that you built (and hopefully eventually instead of having 1 for being a part of a channel you'd have it be the number of times/instances that they were there throughout a time period), then build product the way you built it, and then:

normed = NormalizeRows(NormalizeColumns(product))
sim = normed.TransposeAndMultiply(normed)

where sim(c_1,c_2) is the similarity between channels c_1 and c_2. Those values can be fed into the graph algorithms.

A slight issue with presenting these values in a table is that these are values between 0 and 1 that don't really have an intuitive, natural significance to the average user (as compared to "Overlap Chatters" etc.), so what I show in my tables as the similarity score (in addition to the number of overlap chatters) is how much more similar c_1 and c_2 are as compared to any two random channels.

sim_score = sim / Mean(sim)

@snoww
Copy link
Owner

snoww commented Jun 19, 2021

Yeah no problem, I've been busy as well. I have a couple of more questions.

There's a problem with the pseudocode you gave, product is an n x n matrix, and matrix is an n x m matrix (m x n after transpose), so you can't multiply them, perhaps you meant something else?

Secondly, what do you mean by Mean(sim)? Do you mean Mean(sim[c_1]) excluding the self referenced sim[c_1][c_1]?

Also to track presence over time, I'd need to have a new method of data collection to keep track of how many times a user is present during collection. Since I assign each index j to a unique viewer during that collection, I'd have no way of knowing what the index based of a user. Do you have any recommendations as to how this could be done? Currently I'm thinking a dictionary to store every user, and the value of each key would be the channel they are present in. This dictionary is saved during each collection in a database or something, and then during the n'th collection, we tally everything up and construct a matrix using that data. I'm don't think the time/space complexity will be good at all.

@ctlaltdefeat
Copy link
Author

ctlaltdefeat commented Jun 19, 2021

I fixed the matrix typo with an edit, I just mean normed transposed and multiplied by itself.
For Mean(sim) you're right in that if we exclude the diagonal self-similarities (which are all 1) it makes slightly more sense, though in practice when the number of channels is large the difference will be small. So you can replace Mean(sim) with (Sum(sim)-n)/(n^2-n) where Sum(sim) is just the sum of all of the matrix entries.

For data storage, I'd recommend looking at ClickhouseDB which I've used to good effect. It has a bit of a learning curve if you don't know SQL, but it's especially designed for usecases such as this. At a basic level you could just have a table with columns of timestamp,username,channel which you'd write to when scraping, and your analysis would be based on aggregate queries from this table. It has various compression techniques to make storing this not as big a deal as you'd think, and also you can overcome the time queries take by using materialized views (which basically means precalculating) for timeframes that you'd want to make available like last hour, last day, last week, last month.

@snoww
Copy link
Owner

snoww commented Jun 19, 2021

Alright using current data I generated the following sim matrix, with the sum and average values:

SparseMatrix 273x273-Single 100.00 % Filled
  0.0873135  0.00580401  0.00537844   0.0049059  0.00379891  ..    0.0021354   0.00221413
 0.00580401    0.128494  0.00526486  0.00475757  0.00373337  ..    0.0021856   0.00212003
 0.00537844  0.00526486    0.113604   0.0048281  0.00346684  ..   0.00247231   0.00211706
  0.0049059  0.00475757   0.0048281     0.13099  0.00336063  ..   0.00269341   0.00208526
 0.00379891  0.00373337  0.00346684  0.00336063   0.0552542  ..   0.00285602   0.00266259
 0.00430393  0.00418302  0.00379814  0.00436117  0.00307903  ..   0.00200103   0.00290843
 0.00351856  0.00358706  0.00259304  0.00348231  0.00274441  ..   0.00150753   0.00194472
 0.00181425  0.00299082   0.0040593  0.00262552  0.00176428  ..    0.0018431   0.00255974
         ..          ..          ..          ..          ..  ..           ..           ..
 0.00179881  0.00187846  0.00233804  0.00208829  0.00277247  ..   0.00296599    0.0022457
0.000940746  0.00113573  0.00094281  0.00102587   0.0010148  ..  0.000608587  0.000887877
  0.0021354   0.0021856  0.00247231  0.00269341  0.00285602  ..     0.171247   0.00270646
 0.00221413  0.00212003  0.00211706  0.00208526  0.00266259  ..   0.00270646     0.205326

sum: 556.4703
average: 0.003817473

Do those sum and average values look correct to you?

Also for the database, I'm currently using postgresql for overlap data, maybe I can figure out an implementation that works for postgres.

@ctlaltdefeat
Copy link
Author

The diagonal of the sim matrix should be all 1's, so something's wrong.
I think what could be the issue is that I assumed normed.TransposeAndMultiply(normed) meant normed * normed^T but it could mean normed^T*normed. You want to do normed * normed^T so that the normalized rows of normed are multiplied by each other.

@snoww
Copy link
Owner

snoww commented Jun 20, 2021

No, normed.TransposeAndMultiply(normed) does mean normed * normed^T.
image

I figured it out and it was because I was using the wrong norm (l1norm instead of l2norm)
This is the new sim matrix:

SparseMatrix 208x208-Single 100.00 % Filled
        1   0.151693  0.0828079   0.117629  0.0250484  ..  0.0661653   0.072738
 0.151693          1   0.104735    0.13768  0.0359561  ..  0.0807566  0.0912743
0.0828079   0.104735          1   0.082864  0.0234403  ..  0.0420662  0.0672428
 0.117629    0.13768   0.082864          1   0.031538  ..  0.0570058  0.0722876
0.0250484  0.0359561  0.0234403   0.031538          1  ..  0.0143041  0.0225125
0.0973332   0.117665  0.0699897  0.0904914  0.0229693  ..  0.0476738  0.0570507
0.0315526  0.0397958  0.0260316  0.0333337  0.0229379  ..  0.0152806  0.0284114
0.0287586  0.0377476  0.0242125  0.0381142  0.0297462  ..  0.0143755  0.0252313
       ..         ..         ..         ..         ..  ..         ..         ..
0.0593646  0.0773671  0.0485027   0.057546  0.0184519  ..  0.0235233  0.0582726
0.0725666  0.0839854  0.0497061  0.0739458  0.0218162  ..  0.0295615  0.0519063
0.0661653  0.0807566  0.0420662  0.0570058  0.0143041  ..          1  0.0312037
 0.072738  0.0912743  0.0672428  0.0722876  0.0225125  ..  0.0312037          1

sum: 3054.2017
average: 0.06610464

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants