-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpaper.tex
418 lines (361 loc) · 15.9 KB
/
paper.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
% ****** Start of file aipsamp.tex ******
%
% This file is part of the AIP files in the AIP distribution for REVTeX 4.
% Version 4.1 of REVTeX, October 2009
%
% Copyright (c) 2009 American Institute of Physics.
% Use this file as a source of example code for your aip document.
% Use the file aiptemplate.tex as a template for your document.
\documentclass[%
aip,
jmp,%
amsmath,amssymb,
%preprint,%
reprint,%
%author-year,%
%author-numerical,%
]{revtex4-1}
%\extrafloats{1000}
\usepackage{morefloats}% Include figure files
\usepackage{graphicx}% Include figure files
\usepackage{grffile}
\usepackage{afterpage}
\usepackage{dcolumn}% Align table columns on decimal point
\usepackage{bm}% bold math
%\usepackage[mathlines]{lineno}% Enable numbering of text and display math
%\linenumbers\relax % Commence numbering lines
\usepackage{multirow}
\usepackage{color} % for the notes
\usepackage{etex}
\reserveinserts{58}
%\usepackage{morefloats}
\usepackage{hyperref}
\usepackage{xcolor}
\usepackage{amsmath}
\hypersetup{
colorlinks,
linkcolor={red!50!black},
citecolor={blue!50!black},
urlcolor={blue!80!black}
}
\usepackage{xr}
\externaldocument{supportingInformation}
\maxdeadcycles=1000
\usepackage{placeins}
\begin{document}
\preprint{XXXXX (preprint)}
%\title[Evolution of interaction networks]{On the evolution of interaction networks: primitive typology of vertex, prominence of measures and activity statistics}% Force line breaks with \\
%\title[Evolution of interaction networks]{On the evolution of interaction networks: a primitive typology of vertex}% Force line breaks with \\
%\title[Stability of interaction networks]{Stability in human interaction networks: sector relative sizes, prominence of topological measures and time activity statistics.}% Force line breaks with \\
%\title[Stability in human interaction networks]{Sector relative sizes and topological metrics time stability in human interaction networks}% Force line breaks with \\
\title[Divergence between samples]{A divergence between samples
derived from the Kolmogorov-Smirnov statistic: specification, measures reference and example uses}% Force line breaks with \\
\author{Renato Fabbri}%
\homepage{http://ifsc.usp.br/~fabbri/}
\email{[email protected]}
\affiliation{
S\~ao Carlos Institute of Physics, University of S\~ao Paulo (IFSC/USP),
PO Box 369, 13560-970, S\~ao Carlos, SP, Brazil %\\This line break forced with \textbackslash\textbackslash
}
\date{\today}% It is always \today, today,
% but any date may be explicitly specified
\begin{abstract}
In this article, we describe reference values for a distance metric
derived from the Kolmogorov-Smirnov statistic $D_{F,F'}$.
Each measure of the $D_{F,F'}$ is a distance between two histograms.
This distance is normalized by the number of observations in each sample
to yield the $c=D_{F,F'}\sqrt{\frac{n n'}{n+n'}}$ statistic,
which can be mapped to p-values, i.e. values for which
high enough levels of significance $\alpha$ implies the rejection of the null hypothesis.
Benchmarks for the implementation are delivered by comparing samples from known distributions.
Example application of the statistic for distinction of samples in real data enables further
insight in the robustness and power of $c$.
\end{abstract}
\pacs{05.10-a,}% PACS, the Physics and Astronomy
\keywords{Kolmogorov-Smirnov test, statistic, benchmark, distance measure, histogram}
\maketitle
\tableofcontents
\section{Introduction}\label{sec:intro}
% $D_{n,n'}$
%depicted in Figure~\ref{fig:kolms}
Be $F$ and $F'$ two empirical cumulative distributions,
where $n$ and $n'$ are the number of observations on each sample.
The two-sample Kolmogorov-Smirnov test rejects the null hypothesis
that the histograms are the outcome of the same underlying distribution
if:
\begin{equation}\label{eq:ks}
D_{F,F'} > c(\alpha)\sqrt{\frac{n+n'}{nn'}}
\end{equation}
\noindent where $D_{F,F'}=sup_x[F-F']$ as in Figure~\ref{fig:dnn}
and $c(\alpha)$ is related to the level of significance $\alpha$ by:
\begin{table}[h!]
\centering
\begin{tabular}{|l||c|c|c|c|c|c|}\hline
$\alpha$ & 0.1 & 0.05 & 0.025 & 0.01 & 0.005 & 0.001 \\\hline
$c(\alpha)$ & 1.22 & 1.36 & 1.48 & 1.63 & 1.73 & 1.95 \\\hline
\end{tabular}
\end{table}
If distributions are drawn from empirical data, $D_{F,F'}$ is given as are $n$ and $n'$.
All terms in equation~\ref{eq:ks} are positive and $c(\alpha)$ can be isolated:
\begin{equation}\label{eq:ks2}
% c(\alpha) < \frac{D_{F,F'}}{\sqrt{\frac{n+n'}{nn'}}} = c
c(\alpha) < D_{F,F'}\sqrt{\frac{nn'}{n+n'}} = c
\end{equation}
%Tables~\ref{tab:kolSub}-\ref{tab:kolPctInter} are populated with values for $c'(\alpha)$
When $c$ is high, low values of $\alpha$ favor rejecting the null hypothesis.
In fact, $c$ can be normalized to yield p-values.
\begin{figure}[!h]
\centering
\includegraphics[width=0.44\textwidth]{figs/Dnn}
\caption{The Kolmogorov-Smirnov statistic $D_{F,F'}$: the maximum difference between
two cumulative distribution functions.}
\label{fig:dnn}
\end{figure}
High values of $c$ favor rejecting the null hypothesis.
For example, if the significance level is $\alpha=0.01$,
then $c$ greater than $1.7$
implies the rejection of the null hypothesis and
suggests that $F$ and $F'$
are outcomes of different distributions.
Of core importance in this study is to regard the $c$ statistic
as a measure of distance between both distributions~\cite{kolm}.
The main contribution of the following sections is the
explicit display of reference values of $c$
from which one might derive knowledge from
measures or even from a single value of $c$.
%\subsection{Workarounds}
%The Kolmogorov-Smirnov test is robust,
%but data often fall into the cases where
%it is not guaranteed to be valid.
%In order to make the example analysis,
%the empirical density functions are built from discrete data.
%
%as we fall into two of these.
%
% z-score
% discrete
% formulação com max(f-x,x-f2)
%If the number of elements ($n$ and $n'$)
%in each sample are bellow 50.
\subsection{Philosophical and technological note}
Difference and equivalence is of central role in human cognition,
philosophy and science.
This fact is so deeply recognized that thinkers often reduce
thought to classifications, e.g. through the
mathematical concept of equivalence classes~\cite{deleuze}.
Histograms are very immediate and informative
roughly wherever there is a phenomenon of interest which can yield measurements.
This present document should enable conclusions to be drawn about
the equivalence (and difference)
of the processes underlying sets of measurements for a very
broad range of phenomena.
The following tables
also validate the mathematical framework
and the software implementation.
\subsection{Document outline}
Section~\ref{sec:simulations} exposes reference values drawn from simulations.
Section~\ref{sec:empirical} exemplifies the use of the $c$ statistic
to make sense of phenomena.
Section~\ref{sec:conc} holds final remarks with directions to
software and data.
\section{References through simulations}\label{sec:simulations}
Values of $c$ are given for simulations involving
normal, uniform, Weibull and power function distributions.
The rendering of this article is automated to ease changes in
the settings with which the results are reported.
\input{aux/preambule1}
\subsection{When the null hypothesis is true}
If the null hypothesis is true, the number
of rejections of the null hypothesis ($c>c(\alpha)$)
in $N_c$ comparisons should not exceed $\alpha N_c$.
To verify this, let $C=\{c_i\}$ be a set of $c$ measures,
and $C(\alpha)=\{c : c>c(\alpha)\}$.
Be $|C(\alpha)|$ the cardinality of $C(\alpha)$,
i.e. the number of comparisons in which the two-sample Kolmogorov-Smirnov
test rejects the null hypothesis for a given $\alpha$.
This section reports that
$|C(\alpha)|$ very rarely exceeds $\alpha N_c$,
for all probability distributions and settings.
Also important are that
$c>c(\alpha)$ in many cases
and that the tabulated $\alpha$ values
are also good estimates of the upper limit
of the frequency of such an event.
%\begin{itemize}
% \item $c'>c(\alpha)$ in many cases, and $\alpha N_c$ is a good upper limit to keep in mind.
% \item
%\end{itemize}
% Input tables
% any plot? If std is ~stable, plots of the mean of c~(\alpha) are compact and informative
\input{tables/tabNormNull_}
\input{tables/tabUniformNull_}
\input{tables/tabWeibullNull_}
\input{tables/tabPowerNull_}
\FloatBarrier
\subsection{When the null hypothesis if false}
The null hypothesis is always false for a sufficiently small
significance level $\alpha$.
In this section,
each table holds a set comparisons between two samples:
one sample is generated through a
fixed distribution while the other
sample is modified in each comparison.
The comparison is repeated $N_c$ times.
The measures on $c$ chosen to report the results are:
the mean $\mu(c)$, the standard deviation $\sigma(c)$,
the median $m(c)$,
the fraction
$\overline{C(\alpha)}=\frac{|C(\alpha)|}{N_c}$
of rejection of the null hypothesis given the significance level $\alpha$.
The null hypothesis is true in the boldface lines.
Let $D$ be the KS statistic when sample size goes to infinity.
Notably, high values of $c$ yield smaller $|D-D_{F,F'}|$,
i.e. more accurate estimates of $D$ through $D_{F,F'}$.
\input{tables/tabNormDiffDispersion_}
\input{tables/tabNormDiffMean_}
\input{tables/tabUniformDiffSpread_}
\input{tables/tabUniformDiffMean_}
\input{tables/tabWeibullDiffShape_}
\input{tables/tabPowerDiffShape_}
%\input{tables/tabNormDiff1}
%\input{tables/tabNormDiff2}
% distribuicoes normal, uniforme, weibul
% distribuicao de lei de potencia
% contemplar numeros diferentes de bins, de amostras e de distribuicoes
% enquanto mantemos mudando os parametros
\FloatBarrier
\subsection{Changing the sample sizes}
Changing the number of elements in each sample
changes the value of the $c$ statistic.
This section is dedicated to tables in which
the $c$ statistic is given for two samples of varied sizes but
with fixed underlying distributions.
If a same value $\epsilon$ is changed of the mean $\mu$ or
the standard deviation, the first yields greater
values of the $c$ statistic.
In summary, raising the sample sizes raises $c$ and gives a value
of $D_{F,F'}$ closer to $D$ (the maximum difference between the theoretical cumulative distributions).
\input{tables/tabNormalDiffSamples_}
\input{tables/tabNormalDiffSamples2_}
\input{tables/tabUniformDiffSamples_}
\input{tables/tabUniformDiffSamples2_}
\input{tables/tabWeibullDiffSamples_}
\input{tables/tabPowerDiffSamples_}
\FloatBarrier
\section{Example uses in empirical data}\label{sec:empirical}
% nltk com machado, shakespeare e biblia
% arquivo de audio
% bytes quaisquer de algum arquivo?
% dados puxados da wikipedia, gmane ou ?
This section presents immediate results
drawn from the statistic $c$ when observed
in real samples.
The sample choices are arbitrary.
\subsection{Text}
This section exemplifies the use of $c$
in the detection of similarity between texts.
Each text $X$ was divided in two halves $X1$ and $X2$.
The set of known English words were considered as were
the set of stopwords (words with reduced meaning such
as prepositions and articles).
Only the number of letters in each word was measured.
Three approaches were chosen: 1) the text was partitioned into 1000 pieces of equal number of characters, the mean of the word size of each piece is an element of the sample; 2) the text was partitioned into 1000 pieces of equal number of characters, the standard deviation of the word size is an element of the sample; 3) each word size is an element of the sample.
This last case yields a discrete probability distribution, which was approximated as a continuous variable and gave the greatest sensibility to text differences.
The overall result is the same: smaller differences between parts
of the same text.
Notice that the $c$ is often high within a same book.
The Bible was the only book where the $c$ statistic is higher between the halves
then between the whole text and the halves.
This might be due to the differences of the Old and New Testaments.
\input{tables/textsGeneral_}
\input{tables/textsDistances_}
\input{tables/textsDistances2_}
\input{tables/textsDistances2b_}
\input{tables/textsDistances3_}
\input{tables/textsDistances4_}
\input{tables/textsDistances4b_}
\FloatBarrier
\subsection{Audio}
This section presents $c$ values
drawn from audio for testing the sound system of the computer.
The PCM samples of the files were normalized to fit the interval
$[-1,1]$ to yield the samples labeled. The wavelet decomposition was performed with the Daubechies 8 Wavelet function.
The resulting values of the $c$ statistic reflect most of all the
different types of signals analysed:
PCM samples and wavelet decomposition coefficients in different leafs.
Among each type of signal, the type of sound is also reflected in the measures of $c$, with the noise having the highest values.
\input{tables/audioGeneral_}
\input{tables/audioDistances_}
\FloatBarrier
\subsection{Music}
This section presents measures of the $c$ statistic drawn
from the pitches of the notes of classical compositions.
The results reflect music history.
For example, measures of $c$ involving Palestrina
increases with the exception of Beethoven
who, indeed, used modalism.
The values of $c$ related to Bach also increases along time,
and the outcome of the comparison against Palestrina
is only exceeded when Sch\"onberg is reached,
which reflects the non-tonal discourse of both
Palestrina and Sch\"onberg.
\input{tables/musicGeneral_}
\input{tables/musicDistances_}
\FloatBarrier
\subsection{OS status}
This last example expose the statistic $c$ for samples drawn from
the operational system of my laptop.
The patterns are less neat than on last examples,
but many conclusions can still be reached.
The memory used by the most consuming processes compose
the samples which present the highest values of $c$.
Lowest values of $c$ are related to RAM usage.
Again, the type of samples are mandatory:
they might all be identified by the values of $c$
found in comparison to other samples,
with the exception of the RAM memory.
\input{tables/osGeneral_}
\input{tables/osDistances_}
\FloatBarrier
\section{Conclusions and further benchmarks}\label{sec:conc}
The $c$ statistic is robust both to determine if
the distributions underlying the samples are the same
and to quantify the difference between such probability distributions.
The benchmarks for $c$, given in Section~\ref{sec:simulations},
are very useful as references to make sense of data
e.g. as the example analyses of Section~\ref{sec:empirical}.
Notice that the calculations require no
training or clusterization usually involved in
classification routines.
The rendering of this article and all tables
are automated through Python scripts
in order to ease the scrutinization of different settings~\cite{gmanePack}.
After some exploration of the results,
this setting was settled as template for this document:
simulations used $N_c=100$ comparisons per measure with
$n=n'=1000$ elements in each sample;
all empirical density functions
derived from histograms with
$N_b=30$ equally spaced bins.
The main reason for this choice is that
the results are consistent and stable,
but are still of very modest scales.
The use of more bins
should enhance the results with respect to generality.
On the other hand, samples are often not so big,
which favors reporting the tables with a small sample size.
%The $c$ statistic is implemented in a small function within
%the Gmane Python package~\cite{gmanePack}.
%The routines for generating the tables from simulated and empirical
%data are grouped in separated files to ease further use.
\begin{acknowledgments}
Financial support was obtained from CNPq (140860/2013-4,
project 870336/1997-5), United Nations Development Program (contract: 2013/000566; project BRA/12/018) and FAPESP.
We are also grateful to developers and users of Python scientific tools.
\end{acknowledgments}
%\appendix
%\section{Software and data specifications}\label{ap:soft}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\nocite{*}
\bibliography{paper}% Produces the bibliography via BibTeX.
\end{document}