Skip to content

Commit

Permalink
[INMA2472] Small modifs
Browse files Browse the repository at this point in the history
- U+2019 changed to U+0027 (Gp2mv3#876), in math mode and in text mode
- more math mode
- xurl used instead of url to remove overfull hbox
  • Loading branch information
Jimvy committed Feb 14, 2021
1 parent 343dd3c commit 36dbaa3
Showing 1 changed file with 27 additions and 24 deletions.
51 changes: 27 additions & 24 deletions src/q7/algodata-INMA2472/summary/algodata-INMA2472-summary.tex
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
\documentclass[en]{../../../eplsummary}

\usepackage{xurl}
\renewcommand{\indent}{\hspace{\parindent}}
\setlength\parindent{0pt}
%\setcounter{tocdepth}{3}
Expand Down Expand Up @@ -169,7 +170,7 @@ \subsubsection{Basic}

\subsubsection{K-Core}

\textbf{Definition}: the $k$-core of a graph $G(V,E)$ is its maximal subgraph $G’(V’,E’)$ such that all nodes of $G$ have a degree $\geq k$.
\textbf{Definition}: the $k$-core of a graph $G(V,E)$ is its maximal subgraph $G'(V',E')$ such that all nodes of $G'$ have a degree $\geq k$.

Scales linearly with the number of links in the network. Only applicable on undirected networks, and works best on unweighted networks. There exists a version for weighted networks, but it requires arbitrary thresholds. All nodes in the $k+1$-core are contained in the $k$-core. The largest possible core is therefore given by the maximal degree in the graph. The difference between the $K+1$-core and the $K$-core is called the $K+1$-shell.

Expand All @@ -196,7 +197,7 @@ \subsubsection{K-Core}

\subsubsection{S-Core}

S-core decomposition works exactly the same as $k$-core, except that we replace the degree of each node by its weighted degree $s$ (the sum of the weights of its edges). If the link weights are diverse, the risk is, however, that we end up with each node in one core. A possible outcome for that is to create thresholds for defining cores, but unless there those can be linked to a concrete meaning, its difficult to have an objective way to separate those. Intervals are commonly used.
S-core decomposition works exactly the same as $k$-core, except that we replace the degree of each node by its weighted degree $s$ (the sum of the weights of its edges). If the link weights are diverse, the risk is, however, that we end up with each node in one core. A possible outcome for that is to create thresholds for defining cores, but unless there those can be linked to a concrete meaning, it's difficult to have an objective way to separate those. Intervals are commonly used.

\subsection{Nodes with metadata}

Expand Down Expand Up @@ -302,14 +303,16 @@ \subsubsection{Linear threshold model (LTM)}

Used in modelling of diffusion of innovation/ideas and ``the spread of influence''. Type SI (stay infected).

Given a graph $G(V,E)$: Each edge $(u, v)$ is assigned a weight $w_uv$, such that for every node u, $\sum_v(w_{vu}) \leq 1$.
Given a graph $G(V,E)$: Each edge $(u, v)$ is assigned a weight $w_uv$, such that for every node $u$, $\sum_v(w_{vu}) \leq 1$.

Then, each node u chooses a threshold $\theta_u$ uniformly between 0 and 1. $\theta_u$ represents the weighted fraction of neighbours of u that are required to activate u. At every step, if $\sum_{\text{v in activated nodes}}(w_{vu}) \geq \theta_u$, then u becomes active.
Then, each node $u$ chooses a threshold $\theta_u$ uniformly between 0 and 1. $\theta_u$ represents the weighted fraction of neighbours of $u$ that are required to activate $u$. At every step, if $\sum_{\text{v in activated nodes}}(w_{vu}) \geq \theta_u$, then $u$ becomes active.

\subsubsection{Independent cascade model (ICM)}

Animation here: \url{https://github.com/RomainGrx/LINMA2472-Homeworks/blob/main/miscellaneous/animations/cascade.gif}. With a set A0 of activated nodes at the start, at every step if a node u becomes activated it makes a single attempt to try to infect its neighbours v which succeeds with probability p. (independent of past trials). The process stops when no more nodes can be infected. We can perform link percolation: random removal of links in a graph, and study the size of the connected
components that are left.
Animation here: \url{https://github.com/RomainGrx/LINMA2472-Homeworks/blob/main/miscellaneous/animations/cascade.gif}.
With a set $A0$ of activated nodes at the start, at every step if a node $u$ becomes activated it makes a single attempt to try to infect its neighbours $v$ which succeeds with probability $p$ (independent of past trials).
The process stops when no more nodes can be infected.
We can perform link percolation: random removal of links in a graph, and study the size of the connected components that are left.

\subsubsection{On transport networks}
\subsubsection{Influence of structure on diffusion of ideas}
Expand Down Expand Up @@ -346,7 +349,7 @@ \subsubsection{Greedy hill-climbing heuristic}
done
\end{lstlisting}

The heuristic performs better than targeting high degree nodes, but comes with a computational cost. More pragmatic approaches exist to avoid computational issues like social leader detection: a node u is a social leader if it has a higher social degree than all its neighbours.
The heuristic performs better than targeting high degree nodes, but comes with a computational cost. More pragmatic approaches exist to avoid computational issues like social leader detection: a node $u$ is a social leader if it has a higher social degree than all its neighbours.

% =============================
\section{Embeddings}
Expand Down Expand Up @@ -374,7 +377,7 @@ \subsubsection{How it works}
\begin{itemize}
\item Center data $X_c=(I_n - \dfrac{1}{n}11^T)X$
\item Compute the singular value decomposition $X_c = \mathcal{U}\Sigma\mathcal{V}^T$
\item Keep the N biggest singular values and singular vector column corresponding $\mathcal{V}_h$
\item Keep the $N$ biggest singular values and singular vector column corresponding $\mathcal{V}_h$
\item Then you can obtain the project data with $y = \mathcal{V}_h^T X$
\end{itemize}

Expand Down Expand Up @@ -434,7 +437,7 @@ \subsubsection{Artificial neural networks}

\textbf{Loss functions}

A loss function is used to quantify the error of network and then adjust the weights as we will see in \ref{pt:backprop}.
A loss function is used to quantify the error of network and then adjust the weights as we will see in~\ref{pt:backprop}.

\vspace{1em}

Expand Down Expand Up @@ -673,7 +676,7 @@ \section{Privacy}
\subsection{What is privacy and why (we think) it matters in today's world}

\subsubsection{Modern privacy}
Dont fall in the common fallacy that privacy is about data not being collected or
Don't fall in the common fallacy that privacy is about data not being collected or
shared. Modern (informational) privacy is about the individual having (meaningful)
control over information about them, including when it's used against them. Even after the information has been disclosed, the right to erasure or right to be forgotten remains. A lot of laws worlwide are protecting privacy.

Expand All @@ -690,19 +693,19 @@ \subsubsection{Why privacy (still) matters}
\end{itemize}
\item The ``nothing to hide'' argument is flawed on many levels:
\begin{itemize}
\item Sometimes you want to be let alone even if youre doing nothing bad. Thats why we have doors.
\item Sometimes you want to be let alone even if you're doing nothing bad. That's why we have doors.
\item Privacy is essential to protect freedom and dignity against fear of shame.
\item The ``nothing to hide'' argument implicitly assumes that there are only two kinds of people: good citizens and bad citizens. But are dissidents bad? And whistle-blowers? And journalists? Bad according to who and when?
\end{itemize}
\item Chilling effect and access to information:
\begin{itemize}
\item When people know they are being (or might be) watched, they behave differently. This is this idea behind Jeremy Benthams Panopticon (\textit{design that allow all prisoners of an institution to be observed by a single security guard, without the inmates being able to tell whether they are being watched.}).
\item When people know they are being (or might be) watched, they behave differently. This is this idea behind Jeremy Bentham's Panopticon (\textit{design that allow all prisoners of an institution to be observed by a single security guard, without the inmates being able to tell whether they are being watched.}).
\end{itemize}
\end{enumerate}

\subsection{Terminology}
\begin{itemize}
\item \textbf{Sensitive information} = a piece of information about an individual (e.g. disease, drug use) were trying to protect (but is relevant for the application).
\item \textbf{Sensitive information} = a piece of information about an individual (e.g. disease, drug use) we're trying to protect (but is relevant for the application).
\item \textbf{Identifier} = A piece of information that directly identifies a person (name, address, phone number, ip address, passport number, etc)
\item \textbf{Quasi-identifier} = A piece of information that does not directly identify a person (e.g. nationality, date of birth). But multiple quasi-identifiers taken together could uniquely identify a person. A set of quasi-identifiers could be known to an attacker for a certain individual (auxiliary info).
\item \textbf{Auxiliary information} = Information known to an attacker.
Expand Down Expand Up @@ -832,9 +835,9 @@ \subsubsection{What is Big Data}

Hence, the standard definitions of small scale data anonymization does not work anymore. We need new definitions and metrics.

\subsubsection{From uniqueness to unicity($\epsilon_p$)}
\subsubsection{From uniqueness to unicity ($\epsilon_p$)}

No quasi-identifiers or sensitive data anymore, every point is both sensitive and a point that could be known to an attacker. However, we dont assume an attacker knows all the points, just a few (p) of them. Unicity aims at quantify the risk of re-identification in large scale behavioural datasets.
No quasi-identifiers or sensitive data anymore, every point is both sensitive and a point that could be known to an attacker. However, we don't assume an attacker knows all the points, just a few (p) of them. Unicity aims at quantify the risk of re-identification in large scale behavioural datasets.

\vspace{1em}

Expand Down Expand Up @@ -892,7 +895,7 @@ \subsubsection{Some attacks examples}
Lets consider an anonymysed dataset U and a dataset with direct identifiers V (auxiliary information).

\begin{itemize}
\item \textbf{Matching attacks} rely on two principles. \textbf{A measure of distance}, measuring how similar two records (from U and V) are. And a \textbf{linking algorithm} to perform the descision, based on the distance metric. Thus we could link records of U with records of V. Notice that the auxiliary information might not directly match the information available in the anonymous dataset, the data can be noisy, missing, or match several people. Similarly, the person were searching for might not be in the dataset. The linking algorithm considers edges only as a match if it's proximity distance differs by more than a threshold from the second closest distance.
\item \textbf{Matching attacks} rely on two principles. \textbf{A measure of distance}, measuring how similar two records (from U and V) are. And a \textbf{linking algorithm} to perform the descision, based on the distance metric. Thus we could link records of U with records of V. Notice that the auxiliary information might not directly match the information available in the anonymous dataset, the data can be noisy, missing, or match several people. Similarly, the person we're searching for might not be in the dataset. The linking algorithm considers edges only as a match if it's proximity distance differs by more than a threshold from the second closest distance.
\item Profiling Attacks, in this case U and V doesn't necessarily overlap timewise (data collected at different times). Firstly we extract a profile of the user in the identified dataset through a profiling distance/algorithm. Then we compare the profiles of known users to users in the anonymous dataset to identify them using a linking algorithm.
\end{itemize}

Expand All @@ -918,7 +921,7 @@ \subsubsection{Some attack examples}

\vspace{1em}

On such systems uniqueness attacks could still work. E.g.: Assume that you know that your classmate Bob was born on 1994-09-23. Now you ask the server: ``How many students are born on 1994-09-23 and dont code with Notepad?''
On such systems uniqueness attacks could still work. E.g.: Assume that you know that your classmate Bob was born on 1994-09-23. Now you ask the server: ``How many students are born on 1994-09-23 and don't code with Notepad?''
If the answer is 0, then you know for sure that Bob uses Notepad (course example).

\vspace{1em}
Expand All @@ -942,7 +945,7 @@ \subsubsection{Some attack examples}
\item ``How many students at UCLouvain, not born on 1994-09-23, code with Notepad?''
\end{itemize}

Both $Q$ and $Q$ are likely to have answers $A > 10$ and $A' >10$. They wont be blocked by the query set size restriction. However, if $A - A' = 0$, Bob doesnt code with Notepad.
Both $Q$ and $Q'$ are likely to have answers $A > 10$ and $A' >10$. They won't be blocked by the query set size restriction. However, if $A - A' = 0$, Bob doesn't code with Notepad.

\vspace{1em}

Expand All @@ -958,11 +961,11 @@ \subsubsection{Some attack examples}

\vspace{1em}

While the system has to allow people to ask multiple queries (as itd kill utility otherwise), we add independent Gaussian noise with $\sigma$ = 1 to all queries. However, nothing then prevent the attacker from asking the same question several times and take the average to find the right value. These are called averaging attacks.
While the system has to allow people to ask multiple queries (as it'd kill utility otherwise), we add independent Gaussian noise with $\sigma$ = 1 to all queries. However, nothing then prevent the attacker from asking the same question several times and take the average to find the right value. These are called averaging attacks.

\vspace{1em}

The previous attacks only works because we can ask multiple times the same query and get different noise every time. If our noises were to be consistent, we wouldnt learn anything by asking the same question again. Adding consistent noise prevents basic averaging attacks, attacks that send the exact same query multiple times.
The previous attacks only works because we can ask multiple times the same query and get different noise every time. If our noises were to be consistent, we wouldn't learn anything by asking the same question again. Adding consistent noise prevents basic averaging attacks, attacks that send the exact same query multiple times.

\vspace{1em}

Expand All @@ -985,10 +988,10 @@ \subsection{Formal guarantee for privacy}

\vspace{1em}

To limit the amount of information were releasing (and thereby prevent attacks), we could:
To limit the amount of information we're releasing (and thereby prevent attacks), we could:
\begin{itemize}
\item count the number of queries and stop answering when too many queries have been asked
\item increase the amount of noise were adding with each new queries
\item increase the amount of noise we're adding with each new queries
\end{itemize}

But how can we know when to stop answering or how much noise to add? For this, we need to quantify how much information we are releasing with every answer. This is why Differential Privacy has been developed for.
Expand Down Expand Up @@ -1018,9 +1021,9 @@ \subsubsection{Differential Privacy}

This is where the nicest feature from DP comes in: \textbf{composability}. Releasing the output of any two queries on a dataset protected by a $\epsilon$-DP mechanism is equivalent to releasing one query protected by a 2$\epsilon$-DP mechanism. We can then decide on an $\epsilon$, which we call the \textbf{privacy budget}, which then defines the total number of queries (any of them!) anyone can run on the dataset.

The \textbf{global sensitivity of a function f} captures the magnitude by which a single individuals data can change the function f in the worst case, and therefore, the uncertainty in the response that we must introduce in order to hide anyones participation. Adding noise according to $\text{Lap}(\Delta f / \epsilon)$ prevent averaging attack!
The \textbf{global sensitivity of a function f} captures the magnitude by which a single individual's data can change the function f in the worst case, and therefore, the uncertainty in the response that we must introduce in order to hide anyone's participation. Adding noise according to $\text{Lap}(\Delta f / \epsilon)$ prevent averaging attack!

DP guarantees can be \textbf{extended to groups} of size k but at the cost of ``\textbf{multiplying the noise}'' by a factor k. So to protect a group of 10 people wed have to go from Lap(10) to Lap(100).
DP guarantees can be \textbf{extended to groups} of size k but at the cost of ``\textbf{multiplying the noise}'' by a factor k. So to protect a group of 10 people we'd have to go from Lap(10) to Lap(100).


\textbf{Some other ideas}:
Expand Down

0 comments on commit 36dbaa3

Please sign in to comment.