-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathstylo_howto.tex
2138 lines (1674 loc) · 99.2 KB
/
stylo_howto.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% 2018/08/19
% new features:
% custom distances
% data(galbraith), .....
\documentclass[11pt,a4paper]{article}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\usepackage{microtype}
\usepackage{graphicx}
\usepackage{color}
\usepackage{hyperref}
\usepackage{alltt}
\usepackage{geometry}
\geometry{left=25mm,right=40mm,top=22mm}
\hypersetup{
pdftitle={stylo: a package for stylometric analyses},
pdfauthor={Maciej Eder, Jan Rybicki, Mike Kestemont},
pdfsubject={R package 'stylo'},
pdfkeywords={computational stylistics,} {stylometry,} {authorship attribution},
colorlinks=true, % false: boxed links; true: colored links
linkcolor=black, % color of internal links
linktoc=all, % both sections and subsections linked
citecolor=black, % color of links to bibliography
filecolor=black, % color of file links
urlcolor=black % color of external links
}
\frenchspacing
\definecolor{darkred}{rgb}{0.8,0,0}
\definecolor{darkgreen}{rgb}{0,0.8,0}
\definecolor{darkblue}{rgb}{0,0,0.8}
\definecolor{mygreen}{rgb}{.36,.37,.08} % rgb: 93 95 22
\definecolor{myblue}{rgb}{.07,.28,.4} % rgb: 18 73 102
\definecolor{myyellow}{rgb}{.75,.54,.12} % rgb: 193 138 31
\definecolor{myred}{rgb}{.52,.03,.3} % rgb: 133 9 79
\definecolor{myorange}{rgb}{.6,.26,.05} % rgb: 155 66 12
\def\underscore{\raisebox{-.8ex}{-}}
\def\margin#1{\marginpar{\textcolor{blue}{\footnotesize\tt #1}}}
%\usepackage{marginnote}
%\def\margin#1{\marginnote{\textcolor{blue}{\footnotesize\tt #1}}}
\def\code#1{{\tt #1}}
\def\textquotedbl{"}
\title{{\tt `Stylo'}: a package for stylometric analyses}
\author{Maciej Eder\\ {\footnotesize Pedagogical Univ. of Kraków}
\and Jan Rybicki\\ {\footnotesize Jagiellonian University}
\and Mike Kestemont\\ {\footnotesize University of Antwerp}
}
\begin{document}
\maketitle
\begin{center}
\includegraphics[width=0.2\linewidth]{img/csg.png}
\end{center}
\vglue\baselineskip
\hrule
\begin{abstract}
The \code{`stylo'} package (Eder et al., 2013; 2016) provides
easy-to-use implementations of various established analyses in the
field of computational stylistics, including non-traditional authorship
attribution, genre recognition, style development (``stylochronometry''),
etc. The package includes a number of explanatory methods provided
by the function \code{stylo()} (multidimensional scaling, principal
component analysis, cluster analysis, bootstrap consensus trees).
Addtionally, a number of supervised machine-learning methods are available
via the function \code{classify()} (Delta, support vector machines,
naive Bayes, \textit{k}-nearest neighbors, nearest shrunken centroids). The
\code{rolling.delta()} function analyses collaborative works and
tries to determine the authorship of fragments extracted from them.
The function \code{rolling.classify()} offers a more flexible interface
to sequential classification of collaborative works.
The \code{oppose()} function performs a contrastive analysis between
two given sets of texts: among other things, it generates lists of words
significantly \emph{preferred} and \emph{avoided} by one or more authors
in comparison to the texts by another author (or a set of them).
\end{abstract}
% a tiny trick to typeset keywords as if they were an abstract
\global\long\def\abstractname{Keywords}
\begin{abstract}
\noindent stylometry, computational stylistics, authorship attribution,
cluster analysis, dendrogram, bootstrap consensus trees, PCA, MDS,
\textit{k}-NN, SVM, NSC, naive Bayes, Delta, Zeta, rolling stylometry
\end{abstract}
\bigskip
\hrule
\bigskip
\tableofcontents
\bigskip\bigskip
\section{Introduction}
Stylometric studies, in all their variety of material and method,
have two features in common: the electronic texts they study have
to be coaxed to yield numbers, and the numbers themselves have to
be processed via statistics. Sometimes, the two actions are two independent
parts of a given study. To give the simplest example, one piece of
software is used solely to compile word frequency lists; then, one
of the many commercial statistics packages takes over to extract meaning
from this mass of words, draw graphs etc.
Yet, as stylometrists have begun to produce statistical methods of
their own – to name but a few, Burrows’s Delta, Zeta and Iota (Burrows,
2002, 2007) and their modifications by other scholars (Argamon, 2008,
Craig and Kinney, 2009, Hoover, 2004a, 2004b) – commercial software,
despite its wide array of accessible methods, becomes something of
a straightjacket. This is why a number of dedicated stylometric solutions
have appeared, targeting the specific analyses frequently used in
this community. Hoover’s Delta, Zeta and Iota Excel spreadsheets are
pioneering examples of this approach (Hoover, 2004b). Constantly developed
since 2004, they have at least two major assets: they do exactly what
the stylometrist wants (with several optional procedures) and they
only require fairly standard -- although proprietary -- spreadsheet
software. This has been especially helpful for uses in specialist
workshops and classrooms: the student only needs additional (and,
often, free) software to produce word frequency lists and (s)he is
ready to go. Yet Excel imposes one important limitation: it is very
demanding from the point of view of memory usage. Moreover, the two-stage
nature of the process (a separate piece of software prepares word
lists that can be later automatically imported into the spreadsheet)
might be problematic, because it takes an experienced Visual Basic
programmer to make Excel itself extract the various frequency dictionaries
needed.
In this respect, Juola's JGAAP can directly import texts in a variety
of formats and perform a variety of authorship attribution tasks using
an impressive variety of statistical methods (Juola et al., 2008).
These can be further expanded by experienced programmers in Java.
Java is also the language of another software solution which takes
an even broader approach: Craig's Intelligent Archive is able to perform
a number of standard stylometric procedures, but it can also be used
as a corpus organizer. Once the initial work of registering texts
is done, it enables a versatile combination of individual texts and
groups of texts (Craig and Kinney, 2009).
Since contemporary stylometry uses either stand-alone dedicated programs
custom-made by stylometrists, or applies existing software, the \code{stylo}
package can be situated somewhere in-between: the powerful open-source
statistical programming environment R provides, on the one hand, the
opportunity of building statistical applications from scratch, and,
on the other, allows less advanced researchers to use ready-made scripts
and libraries (cf. http://www.R-project.org).
In our own stylometric adventure with R, one of the
aims was to build a tool (or a set of tools) that would combine sophisticated
state-of-the-art algorithms for classification and/or clustering with
a user-friendly, ``point-and-click'' interface. In particular, we wanted
to implement a number of popular multidimensional methods to be used
by scholars without advanced programming skills. It soon became evident
that once our R scripts were provided with a graphical user interface
and modest documentation, they lent themselves well to classroom use.
In our experience, this suite of tools offers an excellent way to
work around R's typically steep learning curve, without losing anything
of the power of the environment -- namely R's considerable computing
power and speed.
A crucial point in building the interface was to make sure that all
stages of a typical stylometric analysis -- from loading texts to
visualizing the results -- could be performed from within a single
function. The \code{stylo()} function, for instance, does all the
work: it processes electronic texts to create a list of all the words
used in all texts studied, with their frequencies in the individual
texts; normalizes the frequencies with \textit{z}-scores (if applicable);
selects words from the desired frequency ranges; performs additional
procedures that might improve attribution, such as Hoover's (2004a,
2004b) automatic deletion of personal pronouns and ``culling'' (automatic
removal of words too characteristic for individual texts); compares
the results for individual texts; performs a variety of multivariate
analyses; presents the similarities/distances obtained in tree diagrams;
and finally, produces a bootstrap consensus tree (a new graph that
combines many tree diagrams for a variety of parameter values). It
was our aim to develop a general platform for multi-iteration stylometric
tests; for instance, an alternative script derived from the function
\code{classify()} produced heatmaps to show the degree of Delta's
success in attribution at various intervals of the word frequency
ranking list (Rybicki and Eder, 2011).
The last stage of the interface design was, firstly, to add a GUI
(since some humanists might be allergic to the command-line mode provided
by R) and, secondly, a host of various small improvements (like saving
and loading the parameters for the most recent analysis, a wide choice
of graphic output formats, etc.). Nevertheless, advanced users could
still easily switch off the GUI and embed the functions provided by
the ``stylo'' library in their own scripts.
\section{Installation}
Make sure you are connected to the internet. Launch R. Type \code{install.packages("stylo")}
in the console. Whenever you start a new R session, type \code{library(stylo)}.
This will automatically load all the functionality provided in the
package (see below). If you are very lazy and only use R for stylometric
purposes, you can find your \code{Rprofile.site} configuration file
(in \code{R/R-<your R version here>/etc}), open it with administrator privileges
and insert the line \code{library(stylo)} there. In this case, the
``stylo'' library will be loaded at the start of each R session and you
can start invoking the particular functions right away.
\section{Functions provided}
The most important tools included in this package are distributed
over the following functions:
\begin{itemize}
\item \code{stylo()}
\item \code{classify()}
\item \code{oppose()}
% \item \code{rolling.delta()}
\item \code{rolling.classify()}
\item \code{imposters()}
\item \code{crossv}
\end{itemize}
The next sections of this manual describe these four functions together
will all the different input options they can take. If you want to
get a general overview of these four functions, type \code{help(stylo)},
\code{help(classify)}, etc., and a help window will appear. More
advanced users might be interested in some other functions
provided by the library. Generally speaking, they are a great deal
of lower-level functions which are called automatically from inside
the upper-tier functions, such as \code{classify()}, \code{oppose()},
etc. This lower-level functionality can of course be used for developing
your own scripts and functions. These include:
\begin{itemize}
\item \code{assign.plot.colors}
\item \code{crossv}
\item \code{define.plot.area}
\item \code{delete.markup}
\item \code{delete.stop.words}
\item \code{dist.argamon}
\item \code{dist.cosine}
\item \code{dist.delta}
\item \code{dist.entropy}
\item \code{dist.eder}
\item \code{dist.simple}
\item \code{dist.wurzburg}
\item \code{draw.polygons}
\item \code{gui.classify}
\item \code{gui.oppose}
\item \code{gui.stylo}
\item \code{load.corpus.and.parse}
\item \code{load.corpus}
\item \code{make.frequency.list}
\item \code{make.ngrams}
\item \code{make.samples}
\item \code{make.table.of.frequencies}
\item \code{parse.corpus}
\item \code{parse.pos.tags}
\item \code{perform.culling}
\item \code{perform.delta}
\item \code{perform.knn}
\item \code{perform.naivebayes}
\item \code{perform.nsc}
\item \code{perform.svm}
% \item \code{print.stylo.corpus}
% \item \code{print.stylo.data}
% \item \code{print.stylo.results}
\item \code{stylo.default.settings}
\item \code{stylo.pronouns}
% \item \code{summary.stylo.corpus}
% \item \code{summary.stylo.results}
\item \code{txt.to.features}
\item \code{txt.to.words.ext}
\item \code{txt.to.words}
\item \code{zeta.chisquare}
\item \code{zeta.craig}
\item \code{zeta.eder}
\end{itemize}
In most cases, these lower-level functions provide very basic processing
functionality and they are therefore not intended to be invoked by
everyday users. Hence, they will not be discussed in this manual.
However, if you are interested how they work and how to use them,
you can invoke the help pages for these functions: \code{help(load.corpus)},
\code{help(make.ngrams)}, etc. Help pages routinely contain some
insightful examples as to how to use the code: refer to them if you
want to understand what a particular function does. The examples can
be copy-pasted into an active R console. (Don't be afraid of the lines
`\code{\#\#~Not run}' -- they prevent R to run some automatic
checks on interactive functions; you can use these examples safely).
Apart from functions, the package \code{`stylo'} (ver. >0.6.1) contains
three datasets that can be used to start playing with stylometric methods
without any actual texts. The datasets are as follows:
\begin{itemize}
\item \code{novels}
\item \code{galbraith}
\item \code{lee}
\end{itemize}
The first dataset contains 9 full-size novels by Jane Austen and
the Bront\"e sisters. The second and the third set contains computed
tables of word frequencies for 26 and 28, resp., contemporary novels
that for copyright-related reasons could not be made available in
their original format. A detailed description of the datasets
can be retrieved via \code{help(novels)}, \code{help(galbraith)}
and \code{help(lee)}.
\section{stylo()}
This is currently the main tool in the package. The function \code{stylo()}
is meant to enable users to automatically load and process a corpus
of electronic text files from a specified folder, and to perform a
variety of stylometric analyses from multivariate statistics to assess
and visualize stylistic similarities between input texts. This function
provides explanatory analyses; any users interested in machine-learning
supervised methods might want to skip this section and go to
\code{classify()}, below.
\code{stylo()} will typically be used to produce a most-frequent-word
(MFW) list for the entire corpus. Next, it will acquire the frequencies
of the MFWs in the individual texts to create an initial matrix of
words (rows) by individual texts (columns): each cell will contain
a single word's frequency in a single text. Subsequently, it will
normalize the frequencies: it selects words from the the desired frequency
ranges for an analysis (this is also saved to disk as
\code{table\underscore{}with\underscore{}frequencies.txt}
and it will perform additional processing procedures (automatic deletion
of personal pronouns and culling, see 4.3.5 below) to produce a final
wordlist for the actual analysis (this information is saved to disk
in the current working directory as \code{wordlist.txt}. It then
compares the results for individual texts, performing e.g. distance
calculations and using various statistical procedures (cluster analysis,
multidimensional scaling, or principal components analysis). Finally,
the function will produce graphical representations of distances between
texts and it will write the resulting authorship (or similarity) candidates
to a logfile (\code{results.txt}) in the current working directory.
When the \code{consensus tree} option is selected, the script produces
virtual cluster analyses for a variety of parameters, which then produce
a final diagram that reflects a compromise between the underlying
cluster analyses.
\subsection{Corpus preparation}
\begin{figure}
\centering
\includegraphics[width=0.85\linewidth]{img/stylo_corpus1.png}
\caption{Working directory containing a subdirectory \code{corpus} and some files generated by the function \code{stylo()}.}
\end{figure}
The procedure of loading corpora as described immediately below is probably the best way to start doing your first analyses. However, experienced users of R sooner or later will discover that input data structures (corpora, vectors of features, tables of frequencies) can be passed as R objects directly from, say, other functions, without any interaction with texts files. Refer to section 9.1 for details.
Each project requires a separate and dedicated working folder. You
will want to give it a meaningful name (like \code{SanskritPoetry11}
rather than \code{Blah-blah}), since the name of the folder will
appear as the title in your graphs generated by the function. By default,
the results of your analyses and other useful files will be written
automatically to this folder. The actual text files for your analyses
must be placed in a subfolder in the working directory, named \code{corpus}
(Note: all file names are case sensitive!). All functions in this
tool suite expect to find at least two input texts for their analyses.
The text files need to follow the following naming syntax: \code{category\underscore{}title.txt}.
For people working in authorship attribution, the \code{category}
will capture a text's authorial signature; other users, perhaps interested
to compare a translators' styles, should name their files \code{translatorname\underscore{}title.txt}.
Likewise, if you are looking for stylistic similarities between writers
of the same gender, use \code{gender\underscore{}title.txt},
etc. It is really important to use an underscore ``{}\code{\underscore}''
(underscore) as a delimiter: e.g. colors on the final graphs will
also be assigned according to strings of characters up to the first
underscore in the input files' names. (For further details and examples,
type \code{help(assign.plot.colors)}). Consider the following examples,
in which the classes are the authors' names and authors' gender, respectively:
\medskip
\code{
\noindent
\textcolor{darkgreen}{ABronte}\underscore Agnes.xml \\
\textcolor{darkgreen}{ABronte}\underscore Tenant.xml \\
\textcolor{darkred}{Austen}\underscore Emma.xml \\
\textcolor{darkred}{Austen}\underscore Pride.xml \\
\textcolor{darkred}{Austen}\underscore Northanger.xml \\
\textcolor{myyellow}{Conrad}\underscore Nostromo.xml \\
\textcolor{myyellow}{Conrad}\underscore Lord.xml \\
\textcolor{darkblue}{Dickens}\underscore Pickwick.xml \\
\dots
}
\medskip
\code{
\noindent
\textcolor{darkred}{M}\underscore Conrad\underscore Lord\underscore Jim.txt \\
\textcolor{darkred}{M}\underscore Joyce\underscore Dubliners.txt \\
\textcolor{darkgreen}{F}\underscore Woolf\underscore Night\underscore and\underscore day.txt \\
\textcolor{darkgreen}{F}\underscore Woolf\underscore Waves.txt \\
\dots
}
\medskip
Everything that comes after the underscore (say, the short titles
of novels) can be followed by any other information. Be careful with
long names, however, since these might not fit in the graphs that
will be generated. The texts must either be \textit{all} in plain
text format, or \textit{all} in HTML, or \textit{all} in TEI-XML (the
latter two options have not been extensively tested so far, and should
be used carefully).
A concise remark about possible encoding issues should also be added.
If the operating system you use is Linux or Mac, you just need to make
sure the texts are all in UTF-8 (aka Unicode). If your operating system
is Windows, you have two options. Firstly, you might want to save all
the texts in ANSI codepage, but you have to tread carefully if your
machine runs one charset, say, Central European (1250) and your texts
are in the Western European codepage (1252); in this respect, for
instance, French is notoriously difficult (\emph{nous sommes vraiement
désolés}). Alternatively, you can convert your texts into Unicode
(a variety of freeware converters are available on the internet),
and to use an appropriate encoding option when launching the function,
say, \code{stylo()} either by clicking the ``UTF-8'' button on GUI
(beginners), or passing the argument \code{encoding = "UTF-8"} directly
to the function (advanced users).
\subsection{Starting the function}
Start up R. At the prompt (where you see the cursor blinking), move
to your folder (the main folder you will be working in, \textit{not}
the \code{corpus} subfolder) using the command \code{setwd()}.
E.g.:
\medskip{}
\code{setwd("/Users/virgil/Documents/disputed-works-of-mine")}
\medskip{}
\noindent You can use either absolute paths (as in the above example),
or relative ones, i.e. you can navigate directly from the current
working directory. If you want to go, say, two levels up and then
descend to a folder \code{first\underscore experiment},
type:
\medskip{}
\code{setwd("../../first\underscore experiment")}
\medskip{}
You can always check you current working directory typing \code{getwd()}.
(If you use R app for Windows, you can set your directory by clicking
the \emph{File} menu: see Fig.~1; Mac OS users -- click the \emph{Misc}
menu on your R console). Call the function by typing \code{stylo()}
at the prompt and hitting enter. After a while, you should see a GUI
box appear on the screen. Change as many options as you need. Since
there are multiple tabs in the GUI, make sure you only click the OK
button after you've set the parameters in all the tabs. Shortly afterwards,
you will see the names of the files processed appear in the R console,
followed by other (technical) information. Depending on the size of
your corpus, this step might take a few minutes. When the process
is completed without major errors, you will typically see a diagram
on your screen; otherwise, a graphic file (you can choose one or more
format if you like) will be saved in your working directory (at better resolution than the onscreen version, so use this for your publication), and you can start exploring the other \code{stylo()} output files there.
\begin{figure}
\centering
\includegraphics[width=0.65\linewidth]{img/win_setwd.jpg}
\caption{R console on Windows: easiest way to set working directory}
\end{figure}
\subsection{Options available on GUI}
As a first step, beginners should learn how to use the graphical user
interface (GUI), which allows you to control the script's main parameters
without having to tamper with the actual code. However, if you \emph{do}
prefer to tamper with the code, you can call the function in batch mode:
\code{stylo(gui=FALSE)}. In that case, before you start, you might
want to visit the help pages via typing the command \code{help(stylo)}.
Also, you should be familiar with additional options that can, or
rather should, be passed as arguments; they are listed on the margins
of this document.
Whenever you use the GUI, each successful execution or ``run'' of the
script will generate a \code{stylo\underscore{}config.txt} file
(saved in your working folder) which you can review (for instance,
should you have forgotten the parameters you used in your last experiment).
The parameter settings specified in this file will be retrieved at
each subsequent run of the script, so that the user won't have to
\emph{re}specify their favorite settings every time. Please note that
when you hover your cursor over the labels of each of the entries
in the GUI, tool tips will appear that will help you understand the
GUI. In the following sections we will discuss each of the different
tabs in the \code{stylo()} GUI.
No matter if you decide using GUI or not, you can pass additional
arguments from command-line. If the graphic mode is on, these ``new''
values will appear in the GUI and thus they will be still modifiable.
Some examples include:
\medskip
\code{stylo(mfw.min=300, mfw.max=300, analyzed.features="c",
ngram.size=3)}
\medskip
\code{stylo(gui=FALSE, analysis.type="MDS",
write.png.file=TRUE)}
\medskip
\code{stylo(mfw.min=100, mfw.max=1000, mfw.incr=100, analysis.type="BCT")}
\medskip
\begin{figure}
\centering
\includegraphics[width=0.8\linewidth]{img/stylo-gui_tab1.png}
\caption{\code{stylo()} launched in its default mode: first tab on GUI}
\end{figure}
\subsubsection{Input}
This is where you specify the format of your corpus (see 4.1 above
for more details about corpus preparation, and mind possible encoding
issues). The available choices are:\margin{corpus.format=}
\begin{itemize}
\item plain text: plain text files.\margin{"plain"}
\item xml: XML files; this option will remove all tags and TEI headers.\margin{"xml"}
\item xml (plays): XML files of plays; with this option, all tags, TEI headers,
and speakers' names between \code{<speaker>...</speaker>} tags
are removed.\margin{"xml.drama"}
\item xml (no titles): XML contents only: all tags, TEI headers, and chapter/section
(sub)titles between \code{<head>...</head>} tags are removed.\margin{"xml.notitles"}
\item html: the option will attempt to remove HTML headers, menus, links
and other tags.\margin{"html"}
\item UTF-8:\margin{encoding=}\margin{"UTF-8"}\margin{"native.enc"} if you use Linux or Mac, this option is immaterial; however, if your operating system is Windows, then you need to set it depending whether your dataset is encoded in Unicode (then check the option), or in ANSI (then leave it unchecked).
\end{itemize}
\subsubsection{Language}
This setting makes sure that pronoun deletion (see below) works correctly.
If you decide not to remove pronouns from your corpus (which is known
to improve authorship attribution in some languages), this setting
is immaterial (unless you are using English; see immediately below).\margin{corpus.lang=}
\begin{itemize}
\item English: this setting makes sure that contractions (such as “don’t”)
are \emph{not} treated as single words (thus ``don't'' is understood
as two separate items, “don” and “t”), and that compound words (such
as “topsy-turvy”) are \emph{not} treated as one word (thus “topsy-turvy”
becomes “topsy” and “turvy”).\margin{"English"}
\item English (contr.): this setting makes sure that contractions (such as “don’t”) \emph{are} treated as single words (thus ``don't'' is understood as “don\^{ }t” and counted separately), but compound words (such as “topsy-turvy”) are still \emph{not} treated as one word (thus “topsy-turvy” becomes “topsy” and “turvy”).\margin{"English.contr"}
\item English (ALL): this setting makes sure that contractions (such as
“don’t”) \emph{are} treated as single words (thus ``don't'' is understood
as “don\^{ }t” and counted separately), and that compound words (such
as “topsy-turvy”) \emph{are} treated as one word (thus “topsy-turvy”
becomes “topsy\^{ }turvy”).\margin{"English.all"}
\item Latin: this setting makes sure that “v” and “u” are treated as discrete
character signs in Latin texts.\margin{"Latin"}
\item Latin.corr: since some editions do not distinguish between “v” and
“u”, this option provides a consistent conversion of both characters
to “u” in each text.\margin{"Latin.corr"}
\item CJK: Chinese, Japanese and Korean scripts, provided that the input
data is encoded in Unicode.\margin{"CJK"}
\item Other: non-Latin scripts: Hebrew, Arabic, Cyryllic, Coptic, Greek,
Georgian, Latin phonetic, so far. Make sure your input data is in
Unicode!\margin{"Other"}
\end{itemize}
Please do note that for all other languages, apostrophes do \emph{not}
join words and compound (hyphenated) words are split. This is not
the ideal solution and will be addressed as soon as we get to it.
\subsubsection{Features}
\begin{figure}
\centering
\includegraphics[width=0.8\linewidth]{img/stylo-gui_tab2.png}
\caption{\code{stylo()} GUI, the second tab: Features, MFW Settings, Culling, Various}
\end{figure}
In many established approaches to stylometry, the (relative) frequencies
of the most frequent words (MFW) in a corpus are used as the basis
for multidimensional analyses. It has been argued, however, that other
features are also worth considering, especially word and/or character
\emph{n}-grams. The general idea behind such \emph{n}-grams is to
combine a string of individual items into a partially overlapping,
consecutive sequences of \textit{n} of these individual items. Given a sample
sentence “This is a simple example”, the character 2-grams (``bigrams'')
are as follows: “th”, “hi”, “is”, “s~”, “~i”, “is”, “s~”, “~a”, “a~”, “~s”,
“si”, “im”, “mp”, etc. The same sentence split into bigrams
of words reads “this is”, “is a”, “a simple”, “simple example”. It
has been heavily debated in the secondary literature whether the use
of \emph{n}-grams really increases the accuracy of stylometric tests
(Hoover, 2002, 2003, 2012; Koppel et al., 2009; Stamatatos, 2009;
Eder, 2011; Alexis et al., 2014). However, it has been shown (Eder,
2013) that character \emph{n}-grams are impressively robust when one
deals with a ``dirty'' corpus (one with a high number of misspelled
characters, or one with bad {\sc ocr}). The ideal combination of parameters
in this section is another bone of contention between scholars; in
fact, Eder and Rybicki (2013) maintain that this differs not only
from language to language but also from one collection of text to
another.
\begin{itemize}
\item words: words are used as the unit. Naturally, the higher the \emph{n}
you specify, the less repetitive your n-grams there will be, and this
means poor statistics (data sparseness).\margin{analyzed.features=}\margin{"w"}
\item characters: characters are used as the unit.\margin{"c"}
\item \emph{n}-gram size: this is where you can specify the value of \emph{n}
for your \emph{n}-grams. Certainly, setting this option to 1 makes
sure that individual words/chars will be used instead of higher-order
\emph{n}-grams.\margin{ngram.size=}\margin{<integer>}; of course,
single-letter counts do not seem like a good idea.
\item preserve case:\margin{preserve.case=}\margin{TRUE|FALSE} normally, all the words from the input texts are turned into lowercase, no matter if they are proper nouns or not -- e.g. the sentence \textit{The family of Dashwood had long been settled in sussex} will be turned into \textit{the family of dashwood had long been settled in sussex}. In some situations, however, you might be interested in preserving the case. That's the option to do it.
\item select files manually: normally, the script performs the analysis
on \emph{all} files in your \code{corpus} subfolder. If this option
is checked, a dialogue window will appear enabling the user the make
a selection of input files from the subfolder. Obviously, you can
achieve the same results by simply removing the unwanted texts from
the \code{corpus} subfolder. Again, note that this function will
expect you to select at least two different input files.\margin{[TBD]}
\end{itemize}
\subsubsection{MFW settings}
This is where you specify the size of the most-frequent-word list
that will be used for your analysis. Actually, the name is slightly
misleading, since you are not at the mercy of ``most frequent \emph{words}''
only. You can use most frequent word pairs (bigrams), character sequences,
etc. We keep the name ``MFW'' because... Well, we don't really remember
why we keep it; probably, there was no-one around to propose a better
solution.
\begin{itemize}
\item Minimum: this setting determines how many words (or features) from
the top of the frequency list for the entire corpus will be used in
your analysis in the first (and possibly, only) run of the function.
With a value of 100 for this parameter, your analysis will be conducted
on the 100 most frequent words (features) in the entire corpus.\margin{mfw.min=}\margin{<integer>}
\item Maximum: this setting determines how many words from the top of the
word frequency list for the entire corpus will be used in your analysis
in the last (and possibly, only) run of the function. Thus, a setting
of 1000 results in your (final) experiment being conducted on 1000
most frequent words in the entire corpus.\margin{mfw.max=\\<integer>}
(This parameter setting is especially important when working with
the bootstrap consensus trees in \code{stylo()}, a procedure which
involves running several analyses in a row. See immediately below
under ``Increment'').
\item Increment: this setting defines the value by which the value of Minimum
will be increased at each subsequent run of your analysis until it
reaches the Maximum value. Thus, a setting of 200 (at a Minimum of
100 and a Maximum of 1000) provides for an analysis based on 100,
300, 500, 700 and 900 most frequent words.\margin{mfw.incr=}\margin{<integer>}
(As above, this parameter setting is especially important when working
with the bootstrap consensus trees in \code{stylo()}, a procedure
which involving running several analyses in a row).
\item Start at freq. rank: sometimes you might want to skip the very top
of the frequency list\margin{start.at=}\margin{<integer>}.
With this parameter, you can specify how many words from the top of
the overall frequency rank list should be skipped. Normally, however,
users will want to set this at~1.
\end{itemize}
N.B. For all statistical procedures (see 4.3.6 below) except the Consensus
Tree, it is advisable to set Minimum and Maximum to the same value
(this makes the Increment setting immaterial), unless you want to
produce a large series of cluster analysis, multidimensional scaling
or principal components analysis graphs in a row, for instance to
observe how/if the results change for various lengths of the MFW list.
\subsubsection{Culling}
``Culling'' refers to the automatic manipulation of the wordlist (proposed
by Hoover 2004a, 2004b). The culling values specify the degree to
which words that do not appear in all the texts of your corpus will
be removed. Thus, a culling value of 20 indicates that words that
appear in at least 20\% of the texts in the corpus will be considered
in the analysis. A culling setting of 0 means that no words will be
removed; a culling setting of 100 means that only those words will
be used in the analysis that appear in \emph{all} texts of your corpus
at least once.
\begin{itemize}
\item Minimum: this setting specifies the first (and possibly, only) culling
setting in your analysis (cf. the minimum MFW setting).\margin{culling.min=}\margin{<integer>}
\item Maximum: this setting specifies the last (and possibly, only) culling
setting in your analysis (cf. the maximum MFW setting)\margin{culling.max=}\margin{<integer>}.
(This parameter setting is especially important when working with
the bootstrap consensus trees in \code{stylo()}, a procedure which
involves running several analyses in a row).
\item Increment: this defines the increment by which the value of Minimum
will be increased at each subsequent run of your analysis until it
reaches the Maximum value. Thus a setting of 20 (at a Minimum of 0
and a Maximum of 100) provides for an analysis using culling settings
of 0, 20, 30, 60, 80 and 100\margin{culling.incr=}\margin{<integer>}.
(This parameter setting is especially important when working with
the bootstrap consensus trees in \code{stylo()}, a procedure which
involves running several analyses in a row).
\item List cutoff: Usually, it is recommended to cut off the tail of the
overall wordlist\margin{mfw.list.cutoff=}\margin{<integer>};
if you do not want to cut the list and analyze vectors of thousands
of words at once, then the variable may be set to an absurdly big
number (although this can be computationally demanding for your machine).
This setting is independent from the culling procedure.
\item Delete pronouns: (this setting too is independent of the culling
procedure).\margin{delete.pronouns=}\margin{TRUE|FALSE}
If this option is checked, make sure you have selected the correct
language for your corpus (see 4.3.2 above). This will select a list
of pronouns for that language inside the
script.\margin{corpus.lang=}\margin{"English"}\margin{"Dutch"}\margin{...}
Advanced users can use this part of the tool to remove any words they
want. So far, we have pronoun lists for English, Dutch, Polish, Latin,
French, German, Spanish, Italian, and Hungarian.
\end{itemize}
N.B. As had been mentioned above, for all statistical procedures (see
4.3.6 below) except consensus trees, it is advisable to set Minimum
and Maximum to the same value (this makes the Increment setting immaterial),
unless you want to produce a large series of cluster analysis, multidimensional
scaling or principal components analysis graphs etc. in a row.
\subsubsection{Statistics}
\begin{figure}
\centering
\includegraphics[width=0.8\linewidth]{img/stylo-gui_tab3.png}
\caption{\code{stylo()} GUI, the third tab: Statistics, Distances}
\end{figure}
This is the very last moment to emphasize one important thing: the
function \code{stylo()} provides a bunch of \emph{unsupervised}
methods used in stylometry, such as principal components analysis,
multidimensional scaling or cluster analysis. The results are represented
either on a scatterplot, or a tree-like diagram (dendrogram); the
last stage of the analysis involves a human interpretation of the
generated plots.\margin{analysis.type=} The results obtained using
these techniques ``speak for themselves'', which gives a practitioner
an opportunity to notice with the naked eye any peculiarities or unexpected
behavior in the analyzed corpus. Also, given a tree-like graphical
representation of similarities between particular samples, one can
easily interpret the results in terms of finding out which group of
texts a disputable sample belongs to. On the other hand, however, these
methods cannot be \emph{validated} in terms of an automatic verification
of a given method's reliability. Thus, if you feel you'd better use
one of \emph{machine-learning} techiniques, refer to the
funcion \code{classify()}, below.
\begin{itemize}
\item Cluster Analysis: Performs cluster analysis and produces a dendrogram,
or a graph showing hierarchical clustering of analyzed texts.
This option makes sense
if there is only a single iteration (or just a few).\margin{"CA"}
This is achieved by setting the MFW Minimum and Maximum to equal values,
and doing the same for Culling Minimum and Maximum.
\item MDS: Multidimensional Scaling.\margin{"MDS"}
This option makes sense if there is only a single iteration (or just
a few). This is achieved by setting the MFW Minimum and Maximum to
equal values, and doing the same for Culling Minimum and Maximum.
\item PCA (cov.): Principal Component Analysis using a covariance matrix.\margin{"PCV"}
This option makes sense if there is only a single iteration (or just
a few). This is achieved by setting the MFW Minimum and Maximum to
equal values, and doing the same for Culling Minimum and Maximum.
\item PCA (corr.): Principal Component Analysis using a correlation matrix
(and this is possibly the more reliable option of the two, at least
for English).\margin{"PCR"} This option
makes sense if there is only a single iteration (or just a few). This
is achieved by setting the MFW Minimum and Maximum to equal values,
and doing the same for Culling Minimum and Maximum.
\item Consensus Tree: this option will output a statistically justified
``compromise'' between a number of virtual cluster analyses results for
a variety of MFW and Culling parameter values.\margin{"BCT"}
\item Consensus strength: For Consensus Tree graphs, direct linkages between
two texts are made if the same link is made in a proportion of the
underlying virtual cluster analyses. The default setting of 0.5 means
that such a linkage is made if it appears in at least 50\% of the
cluster analyses.\margin{consensus.strength=}\margin{<integer>}
Legal values are $0.4-1$. This setting is immaterial for any other
Statistics settings.
\end{itemize}
\subsubsection{Distances}
This is where the user can choose the statistical procedure used to
analyze the distances (i.e. the similarities and differences) between
the frequency patterns of individual texts in your corpus. Although
this choice is far from trivial, some of the following measures seem
to be more suitable for linguistic purposes than others. On theoretical
grounds, Euclidean Distance and Manhattan Distance should be avoided
in stylometry based on word frequencies (unless the frequencies are
normalized; see: Delta). Canberra Distance is quite troublesome but
effective e.g. for Latin; it is very sensitive to rare vocabulary,
and thus might be a good choice for inflected languages, with sparse
frequencies (it should be combined with careful culling settings and
a limited number of MFWs taken into analysis). For English, usually
Classic Delta is a good choice: mathematically speaking (Argamon,
2008), it is simply Manhattan distance applied to normalized
(\textit{z}-scored) word frequencies. A theoretical explanation of
the measures implemented in this function is pending. The available
distance measures are as follows:\margin{distance.measure=}
\begin{itemize}
\item Euclidean Distance: basic and the most ``natural''.\margin{"dist.euclidean"}
It is an obvious choice when your variables are similarly distributed.
However, since word distributions are not similar at any rate (e.g.
compare the huge difference between the frequencies of ``the'' and ``dactyloscopy''),
this distance measure is not appropriate to testing vectors of dozens
of most frequent words. Or, to be precise, it could be used to assess
less frequent (content) words. According to Zipf's law, these words
are distributed more or less similarly in a corpus since, by being
less common than function words, they appear in the flattened sections
of a Zipf curve.
\[
\delta_{(AB)}=\sqrt{\sum_{i=1}^{n}\left\vert (A_{i})^{2}-(B_{i})^{2}\right\vert }
\]
%\[ \delta_{(AB)} = \sqrt{ \sum_{i=1}^{n} %
% \left\vert f_{i}(A)^2 - f_{i}(B)^2 \right\vert} \]
where: \\
$n=$ the number of MFWs (most frequent words), \\
$A,B=$ texts being compared, \\
$A_{i}=$ the frequency of a given word $i$ in the text $A$, \\
$B_{i}=$ the frequency of a given word $i$ in the text $B$. \\
\item \noindent Manhattan Distance: obvious and well documented.\margin{"dist.manhattan"}
It shares the pros and cons of Euclidean Distance.
\[
\delta_{(AB)}=\sum_{i=1}^{n}\left\vert A_{i}-B_{i}\right\vert
\]
\item Classic Delta as introduced by Burrows (2002).\margin{"dist.delta"}
Since this measure relies on \textit{z}-scores -- i.e. normalized word frequencies
-- it is dependent on the number of texts analyzed and on a balance
between these texts: if a corpus contains, say, a large nuber of plays
by Lope de Vega and only one play by Calderón de la Barca, the final
results might by biased.
\[
\Delta_{(AB)}=\frac{1}{n}\sum_{i=1}^{n}\left\vert \frac{A_{i}-\mu_{i}}{\sigma_{i}}-\frac{B_{i}-\mu_{i}}{\sigma_{i}}\right\vert
\]
where: \\
$n=$ the number of MFWs (most frequent words or other features), \\
$A,B=$ texts being compared, \\
$A_{i}=$ the frequency of a given feature $i$ in the text $A$, \\
$B_{i}=$ the frequency of a given feature $i$ in the text $B$, \\
$\mu_{i}=$ mean frequency of a given feature in the corpus, \\
$\sigma_{i}=$ standard deviation of frequencies of a given feature.
\noindent Argamon (2008) showed that the above formula can be simplified
algebraically:
\[
\Delta_{(AB)}=\frac{1}{n}\sum_{i=1}^{n}\left\vert \frac{A_{i}-B_{i}}{\sigma_{i}}\right\vert
\]
\item Argamon's Linear Delta, or Euclidean distance applied to normalized
(\textit{z}-scored) word frequencies (Argamon, 2008).\margin{"dist.argamon"}
The distance is sensitive to the number of texts in a corpus.
\[
\Delta_{(AB)}=\frac{1}{n}\sum_{i=1}^{n}\sqrt{\left\vert \frac{(A_{i})^{2}-(B_{i})^{2}}{\sigma_{i}}\right\vert }
\]
\item Eder's Delta: it is a modification of standard Burrows's distance;\margin{"dist.eder"}
it slightly increases the weights of frequent words and rescales less
frequent ones in order to suppress discriminative strength of some
random unfrequent words. The distance was meant to be used with highly
inflected languages. It is sensitive to the number of texts in a corpus.
\[
\Delta_{(AB)}=\frac{1}{n}\sum_{i=1}^{n}\left(\left\vert \frac{A_{i}-B_{i}}{\sigma_{i}}\right\vert \times\frac{n-n_{i}+1}{n}\right)
\]
where: \\
$n_{i}=$ the position of a given feature on a frequency list (i.e. its rank).
\item Eder's Simple: a type of normalization as simple as can be (independent
on the size of the corpus), intended to convert the implications of
Zipf's law.\margin{"dist.simple"} The normalization
used in this distance is so obvious and so widely-spread in exact
sciences that naming it ``Eder's Simple Distance'' is an abuse, so to
speak.
\[
\delta_{(AB)}=\sum_{i=1}^{n}\left\vert \sqrt{A_{i}}-\sqrt{B_{i}}\right\vert
\]
\item Canberra Distance: sometimes amazingly good.\margin{"dist.canberra"}
It is very sensitive to differences in rare vocabulary usage among
authors. On the other hand, this can be a disadvantage, since sensitiveness
to minute differences in word occurrences also means significant sensitiveness
to noise. Last but not least, Canberra Distance is very sensitive
to the number of words (features) analyzed.
\[
\delta_{(AB)}=\sum_{i=1}^{n}\frac{\left\vert A_{i}-B_{i}\right\vert }{\left\vert A_{i}\right\vert +\left\vert B_{i}\right\vert }
\]
\item Cosine Distance: a classical measure, introduced to this package
in the version 6.3.\margin{"dist.cosine"}:
\[
\delta_{(AB)}=1- \frac{A\cdot B}{\|A\|\|B\|}= 1- \frac{\sum\limits _{i=1}^{n}{A_{i}B_{i}}}{\sqrt{\sum\limits _{i=1}^{n}{A_{i}^2}}\cdot\sqrt{\sum\limits _{i=1}^{n}{A_{i}^2}}}
\]
\item It is also possible to use any custom distance measure. This option is
discussed below, in the section \ref{custom_distances}.
\end{itemize}
\subsubsection{Sampling}
When the default setting of ``No sampling'' is
checked\margin{sampling=}\margin{"no.sampling"}, each of
the texts in its entirety is treated as a single sample. The second
option, that of ``Normal sampling''\margin{"normal.sampling"},
performs the analysis on equal-sized
consecutive sections of each text, and the size is determined by the
setting immediately below. Eder (2015) suggests that even better attribution
results can be achieved with ``Random sampling''\margin{"random.sampling"},
where samples are made up of words each randomly selected from anywhere
in the text (``bag of words'');\margin{sample.size=}\margin{<integer>} here,
too, the sample size must be set below.