-
Notifications
You must be signed in to change notification settings - Fork 5
Improving similarity measure #3
Comments
Thanks for the great feedback here, and also wanted to say that your site is very cool. For the first point you have, I'm not exactly sure how to do the actual calculation. Currently, I take all the channel pairs and compare them individually, which is pretty inefficient. If I were to do what you said there, I would need create a set that contains all the viewers captured in that time, which I guess I can union each of the individual channels together into one massive set. And then for each of the viewers in the big set, check if they belong in any of the channels. And I'm unsure where to proceed from there. For the second point, we would have to store each viewer and the channels they are present in over time. And again, I'm not sure how to proceed. |
Regarding the first point, the data structure that we're dealing with is a sparse matrix (a very large number of possible viewer-channel pairs, where only very few of them are actually populated) and in python there are modules (eg. sklearn) to efficiently perform calculations on it without doing operations on individual pairs. Once this matrix is created the calculation to find the channel x channel similarity scores is very short. However, in C# I'd have no idea how to do this, so this is something for maybe further down the line or maybe if I have time to integrate/port my analysis code into here. For the second point, as far as implementation everything said above still applies, and though you're right about the host thing I do think there is still an advantage over the current approach: currently if channel A and channel B happen to do one singular collab stream that month and their unique viewers filter through each other, then that skews the similarity score almost as much as if they had streamed together every day for that month and their respective viewers were on each other's channel for a lot of the time (I say almost because obviously there are different viewers each time, so more collabs would marginally increase similarity). Intuitively, it seems to me that when people think of channel communities being similar they'd place importance on the amount of time spent shared rather than just a presence at one point. |
So like a matrix |
Yep, except it I think of it as
which is an |
How would I calculate the weight of an individual chatter? Maybe I'm constructing the matrix incorrectly or I'm missing a step. Here is the code I have so far. |
Sorry for the delayed response. According to my skimming through the Math.NET documentation, the method
where A slight issue with presenting these values in a table is that these are values between 0 and 1 that don't really have an intuitive, natural significance to the average user (as compared to "Overlap Chatters" etc.), so what I show in my tables as the similarity score (in addition to the number of overlap chatters) is how much more similar c_1 and c_2 are as compared to any two random channels.
|
Yeah no problem, I've been busy as well. I have a couple of more questions. There's a problem with the pseudocode you gave, Secondly, what do you mean by Also to track presence over time, I'd need to have a new method of data collection to keep track of how many times a user is present during collection. Since I assign each index |
I fixed the For data storage, I'd recommend looking at ClickhouseDB which I've used to good effect. It has a bit of a learning curve if you don't know SQL, but it's especially designed for usecases such as this. At a basic level you could just have a table with columns of timestamp,username,channel which you'd write to when scraping, and your analysis would be based on aggregate queries from this table. It has various compression techniques to make storing this not as big a deal as you'd think, and also you can overcome the time queries take by using materialized views (which basically means precalculating) for timeframes that you'd want to make available like last hour, last day, last week, last month. |
Alright using current data I generated the following sim matrix, with the sum and average values:
Do those sum and average values look correct to you? Also for the database, I'm currently using postgresql for overlap data, maybe I can figure out an implementation that works for postgres. |
The diagonal of the sim matrix should be all 1's, so something's wrong. |
Hi,
I actually did something similar to this a while back at https://channel-similarity.johnpyp.com/ but it didn't generate much interest, presumably due to among other things a lack of visualization which is a core part of this project and which makes it cool and interesting to look at and is well executed.
However, I do think that the similarity measure that I used is better in the sense of capturing similarity between channel communities, and could be implemented here without too much issue. Mathematically what I did is outlined at https://channel-similarity.johnpyp.com/details but it essentially boils down to a couple of differences from what you currently have going on right now:
The weight of a viewer should be normalized according to the number of channels that they're in. The reason behind this is that we want the relative weight of that user to be determined by how much of the percent of their viewing is dedicated to that channel.
For example: channel A and channel B sharing a viewer that ONLY views these two channels on the entire site should account for more than if channel A and channel B happen to share Nightbot that's present on a large chunk of channels on the site. Currently their relative weight for similarity is the same.
(this should be fairly simple to implement, doesn't require any scraping changes, and can also be used for the realtime channel page view)
The weight of a viewer should be determined not only by if they happened to be in that channel during the time period collected, but by the amount of time spent there (i.e the number of scrapes they appeared in).
An example of the shortcoming of the current approach: if channel A happens to host a channel B during the period collected, then all those chatters appearing momentarily in channel B's chat currently provide as much weight to similarity as chatters that spend long periods of time in both of these channels.
(this could require scraping changes to store these values)
The text was updated successfully, but these errors were encountered: