diff --git a/src/q7/algodata-INMA2472/summary/algodata-INMA2472-summary.tex b/src/q7/algodata-INMA2472/summary/algodata-INMA2472-summary.tex index 799606bf0..718d52cf6 100644 --- a/src/q7/algodata-INMA2472/summary/algodata-INMA2472-summary.tex +++ b/src/q7/algodata-INMA2472/summary/algodata-INMA2472-summary.tex @@ -1,5 +1,6 @@ \documentclass[en]{../../../eplsummary} +\usepackage{xurl} \renewcommand{\indent}{\hspace{\parindent}} \setlength\parindent{0pt} %\setcounter{tocdepth}{3} @@ -169,7 +170,7 @@ \subsubsection{Basic} \subsubsection{K-Core} -\textbf{Definition}: the $k$-core of a graph $G(V,E)$ is its maximal subgraph $G’(V’,E’)$ such that all nodes of $G’$ have a degree $\geq k$. +\textbf{Definition}: the $k$-core of a graph $G(V,E)$ is its maximal subgraph $G'(V',E')$ such that all nodes of $G'$ have a degree $\geq k$. Scales linearly with the number of links in the network. Only applicable on undirected networks, and works best on unweighted networks. There exists a version for weighted networks, but it requires arbitrary thresholds. All nodes in the $k+1$-core are contained in the $k$-core. The largest possible core is therefore given by the maximal degree in the graph. The difference between the $K+1$-core and the $K$-core is called the $K+1$-shell. @@ -196,7 +197,7 @@ \subsubsection{K-Core} \subsubsection{S-Core} -S-core decomposition works exactly the same as $k$-core, except that we replace the degree of each node by its weighted degree $s$ (the sum of the weights of its edges). If the link weights are diverse, the risk is, however, that we end up with each node in one core. A possible outcome for that is to create thresholds for defining cores, but unless there those can be linked to a concrete meaning, it’s difficult to have an objective way to separate those. Intervals are commonly used. +S-core decomposition works exactly the same as $k$-core, except that we replace the degree of each node by its weighted degree $s$ (the sum of the weights of its edges). If the link weights are diverse, the risk is, however, that we end up with each node in one core. A possible outcome for that is to create thresholds for defining cores, but unless there those can be linked to a concrete meaning, it's difficult to have an objective way to separate those. Intervals are commonly used. \subsection{Nodes with metadata} @@ -302,14 +303,16 @@ \subsubsection{Linear threshold model (LTM)} Used in modelling of diffusion of innovation/ideas and ``the spread of influence''. Type SI (stay infected). -Given a graph $G(V,E)$: Each edge $(u, v)$ is assigned a weight $w_uv$, such that for every node u, $\sum_v(w_{vu}) \leq 1$. +Given a graph $G(V,E)$: Each edge $(u, v)$ is assigned a weight $w_uv$, such that for every node $u$, $\sum_v(w_{vu}) \leq 1$. -Then, each node u chooses a threshold $\theta_u$ uniformly between 0 and 1. $\theta_u$ represents the weighted fraction of neighbours of u that are required to activate u. At every step, if $\sum_{\text{v in activated nodes}}(w_{vu}) \geq \theta_u$, then u becomes active. +Then, each node $u$ chooses a threshold $\theta_u$ uniformly between 0 and 1. $\theta_u$ represents the weighted fraction of neighbours of $u$ that are required to activate $u$. At every step, if $\sum_{\text{v in activated nodes}}(w_{vu}) \geq \theta_u$, then $u$ becomes active. \subsubsection{Independent cascade model (ICM)} -Animation here: \url{https://github.com/RomainGrx/LINMA2472-Homeworks/blob/main/miscellaneous/animations/cascade.gif}. With a set A0 of activated nodes at the start, at every step if a node u becomes activated it makes a single attempt to try to infect its neighbours v which succeeds with probability p. (independent of past trials). The process stops when no more nodes can be infected. We can perform link percolation: random removal of links in a graph, and study the size of the connected -components that are left. +Animation here: \url{https://github.com/RomainGrx/LINMA2472-Homeworks/blob/main/miscellaneous/animations/cascade.gif}. +With a set $A0$ of activated nodes at the start, at every step if a node $u$ becomes activated it makes a single attempt to try to infect its neighbours $v$ which succeeds with probability $p$ (independent of past trials). +The process stops when no more nodes can be infected. +We can perform link percolation: random removal of links in a graph, and study the size of the connected components that are left. \subsubsection{On transport networks} \subsubsection{Influence of structure on diffusion of ideas} @@ -346,7 +349,7 @@ \subsubsection{Greedy hill-climbing heuristic} done \end{lstlisting} -The heuristic performs better than targeting high degree nodes, but comes with a computational cost. More pragmatic approaches exist to avoid computational issues like social leader detection: a node u is a social leader if it has a higher social degree than all its neighbours. +The heuristic performs better than targeting high degree nodes, but comes with a computational cost. More pragmatic approaches exist to avoid computational issues like social leader detection: a node $u$ is a social leader if it has a higher social degree than all its neighbours. % ============================= \section{Embeddings} @@ -374,7 +377,7 @@ \subsubsection{How it works} \begin{itemize} \item Center data $X_c=(I_n - \dfrac{1}{n}11^T)X$ \item Compute the singular value decomposition $X_c = \mathcal{U}\Sigma\mathcal{V}^T$ - \item Keep the N biggest singular values and singular vector column corresponding $\mathcal{V}_h$ + \item Keep the $N$ biggest singular values and singular vector column corresponding $\mathcal{V}_h$ \item Then you can obtain the project data with $y = \mathcal{V}_h^T X$ \end{itemize} @@ -434,7 +437,7 @@ \subsubsection{Artificial neural networks} \textbf{Loss functions} -A loss function is used to quantify the error of network and then adjust the weights as we will see in \ref{pt:backprop}. +A loss function is used to quantify the error of network and then adjust the weights as we will see in~\ref{pt:backprop}. \vspace{1em} @@ -673,7 +676,7 @@ \section{Privacy} \subsection{What is privacy and why (we think) it matters in today's world} \subsubsection{Modern privacy} -Don’t fall in the common fallacy that privacy is about data not being collected or +Don't fall in the common fallacy that privacy is about data not being collected or shared. Modern (informational) privacy is about the individual having (meaningful) control over information about them, including when it's used against them. Even after the information has been disclosed, the right to erasure or right to be forgotten remains. A lot of laws worlwide are protecting privacy. @@ -690,19 +693,19 @@ \subsubsection{Why privacy (still) matters} \end{itemize} \item The ``nothing to hide'' argument is flawed on many levels: \begin{itemize} - \item Sometimes you want to be let alone even if you’re doing nothing bad. That’s why we have doors. + \item Sometimes you want to be let alone even if you're doing nothing bad. That's why we have doors. \item Privacy is essential to protect freedom and dignity against fear of shame. \item The ``nothing to hide'' argument implicitly assumes that there are only two kinds of people: good citizens and bad citizens. But are dissidents bad? And whistle-blowers? And journalists? Bad according to who and when? \end{itemize} \item Chilling effect and access to information: \begin{itemize} - \item When people know they are being (or might be) watched, they behave differently. This is this idea behind Jeremy Bentham’s Panopticon (\textit{design that allow all prisoners of an institution to be observed by a single security guard, without the inmates being able to tell whether they are being watched.}). + \item When people know they are being (or might be) watched, they behave differently. This is this idea behind Jeremy Bentham's Panopticon (\textit{design that allow all prisoners of an institution to be observed by a single security guard, without the inmates being able to tell whether they are being watched.}). \end{itemize} \end{enumerate} \subsection{Terminology} \begin{itemize} - \item \textbf{Sensitive information} = a piece of information about an individual (e.g. disease, drug use) we’re trying to protect (but is relevant for the application). + \item \textbf{Sensitive information} = a piece of information about an individual (e.g. disease, drug use) we're trying to protect (but is relevant for the application). \item \textbf{Identifier} = A piece of information that directly identifies a person (name, address, phone number, ip address, passport number, etc) \item \textbf{Quasi-identifier} = A piece of information that does not directly identify a person (e.g. nationality, date of birth). But multiple quasi-identifiers taken together could uniquely identify a person. A set of quasi-identifiers could be known to an attacker for a certain individual (auxiliary info). \item \textbf{Auxiliary information} = Information known to an attacker. @@ -832,9 +835,9 @@ \subsubsection{What is Big Data} Hence, the standard definitions of small scale data anonymization does not work anymore. We need new definitions and metrics. -\subsubsection{From uniqueness to unicity($\epsilon_p$)} +\subsubsection{From uniqueness to unicity ($\epsilon_p$)} -No quasi-identifiers or sensitive data anymore, every point is both sensitive and a point that could be known to an attacker. However, we don’t assume an attacker knows all the points, just a few (p) of them. Unicity aims at quantify the risk of re-identification in large scale behavioural datasets. +No quasi-identifiers or sensitive data anymore, every point is both sensitive and a point that could be known to an attacker. However, we don't assume an attacker knows all the points, just a few (p) of them. Unicity aims at quantify the risk of re-identification in large scale behavioural datasets. \vspace{1em} @@ -892,7 +895,7 @@ \subsubsection{Some attacks examples} Lets consider an anonymysed dataset U and a dataset with direct identifiers V (auxiliary information). \begin{itemize} - \item \textbf{Matching attacks} rely on two principles. \textbf{A measure of distance}, measuring how similar two records (from U and V) are. And a \textbf{linking algorithm} to perform the descision, based on the distance metric. Thus we could link records of U with records of V. Notice that the auxiliary information might not directly match the information available in the anonymous dataset, the data can be noisy, missing, or match several people. Similarly, the person we’re searching for might not be in the dataset. The linking algorithm considers edges only as a match if it's proximity distance differs by more than a threshold from the second closest distance. + \item \textbf{Matching attacks} rely on two principles. \textbf{A measure of distance}, measuring how similar two records (from U and V) are. And a \textbf{linking algorithm} to perform the descision, based on the distance metric. Thus we could link records of U with records of V. Notice that the auxiliary information might not directly match the information available in the anonymous dataset, the data can be noisy, missing, or match several people. Similarly, the person we're searching for might not be in the dataset. The linking algorithm considers edges only as a match if it's proximity distance differs by more than a threshold from the second closest distance. \item Profiling Attacks, in this case U and V doesn't necessarily overlap timewise (data collected at different times). Firstly we extract a profile of the user in the identified dataset through a profiling distance/algorithm. Then we compare the profiles of known users to users in the anonymous dataset to identify them using a linking algorithm. \end{itemize} @@ -918,7 +921,7 @@ \subsubsection{Some attack examples} \vspace{1em} -On such systems uniqueness attacks could still work. E.g.: Assume that you know that your classmate Bob was born on 1994-09-23. Now you ask the server: ``How many students are born on 1994-09-23 and don’t code with Notepad?'' +On such systems uniqueness attacks could still work. E.g.: Assume that you know that your classmate Bob was born on 1994-09-23. Now you ask the server: ``How many students are born on 1994-09-23 and don't code with Notepad?'' If the answer is 0, then you know for sure that Bob uses Notepad (course example). \vspace{1em} @@ -942,7 +945,7 @@ \subsubsection{Some attack examples} \item ``How many students at UCLouvain, not born on 1994-09-23, code with Notepad?'' \end{itemize} -Both $Q$ and $Q’$ are likely to have answers $A > 10$ and $A' >10$. They won’t be blocked by the query set size restriction. However, if $A - A' = 0$, Bob doesn’t code with Notepad. +Both $Q$ and $Q'$ are likely to have answers $A > 10$ and $A' >10$. They won't be blocked by the query set size restriction. However, if $A - A' = 0$, Bob doesn't code with Notepad. \vspace{1em} @@ -958,11 +961,11 @@ \subsubsection{Some attack examples} \vspace{1em} -While the system has to allow people to ask multiple queries (as it’d kill utility otherwise), we add independent Gaussian noise with $\sigma$ = 1 to all queries. However, nothing then prevent the attacker from asking the same question several times and take the average to find the right value. These are called averaging attacks. +While the system has to allow people to ask multiple queries (as it'd kill utility otherwise), we add independent Gaussian noise with $\sigma$ = 1 to all queries. However, nothing then prevent the attacker from asking the same question several times and take the average to find the right value. These are called averaging attacks. \vspace{1em} -The previous attacks only works because we can ask multiple times the same query and get different noise every time. If our noises were to be consistent, we wouldn’t learn anything by asking the same question again. Adding consistent noise prevents basic averaging attacks, attacks that send the exact same query multiple times. +The previous attacks only works because we can ask multiple times the same query and get different noise every time. If our noises were to be consistent, we wouldn't learn anything by asking the same question again. Adding consistent noise prevents basic averaging attacks, attacks that send the exact same query multiple times. \vspace{1em} @@ -985,10 +988,10 @@ \subsection{Formal guarantee for privacy} \vspace{1em} -To limit the amount of information we’re releasing (and thereby prevent attacks), we could: +To limit the amount of information we're releasing (and thereby prevent attacks), we could: \begin{itemize} \item count the number of queries and stop answering when too many queries have been asked - \item increase the amount of noise we’re adding with each new queries + \item increase the amount of noise we're adding with each new queries \end{itemize} But how can we know when to stop answering or how much noise to add? For this, we need to quantify how much information we are releasing with every answer. This is why Differential Privacy has been developed for. @@ -1018,9 +1021,9 @@ \subsubsection{Differential Privacy} This is where the nicest feature from DP comes in: \textbf{composability}. Releasing the output of any two queries on a dataset protected by a $\epsilon$-DP mechanism is equivalent to releasing one query protected by a 2$\epsilon$-DP mechanism. We can then decide on an $\epsilon$, which we call the \textbf{privacy budget}, which then defines the total number of queries (any of them!) anyone can run on the dataset. -The \textbf{global sensitivity of a function f} captures the magnitude by which a single individual’s data can change the function f in the worst case, and therefore, the uncertainty in the response that we must introduce in order to hide anyone’s participation. Adding noise according to $\text{Lap}(\Delta f / \epsilon)$ prevent averaging attack! +The \textbf{global sensitivity of a function f} captures the magnitude by which a single individual's data can change the function f in the worst case, and therefore, the uncertainty in the response that we must introduce in order to hide anyone's participation. Adding noise according to $\text{Lap}(\Delta f / \epsilon)$ prevent averaging attack! -DP guarantees can be \textbf{extended to groups} of size k but at the cost of ``\textbf{multiplying the noise}'' by a factor k. So to protect a group of 10 people we’d have to go from Lap(10) to Lap(100). +DP guarantees can be \textbf{extended to groups} of size k but at the cost of ``\textbf{multiplying the noise}'' by a factor k. So to protect a group of 10 people we'd have to go from Lap(10) to Lap(100). \textbf{Some other ideas}: