Skip to content

Commit

Permalink
74/model improvements (#81)
Browse files Browse the repository at this point in the history
* #76 remove 'y' from consonant sequences feature

* #77 add all Mexico states abbreviations and its source in the docstring

* #73 implement shannon entropy method and adapt the threshold calculation to also match values below

* #73 new model with shannon entropy and notebook sets

* #73 fix baja california abbreviation

* #73 fix keysmash sequence test to also consider special characters

* #74 fix private methods naming convention to double underscores

* #75 and #78 add KeySmash features: repeated bigrams and unique chars ratios

* #74 fix tests and parser private methods

* #74 add model test

* #74 update initial sets models

---------

Co-authored-by: atarchetti <[email protected]>
  • Loading branch information
apmt and atarchetti authored Feb 17, 2023
1 parent e0f39b5 commit d076b89
Show file tree
Hide file tree
Showing 28 changed files with 1,659 additions and 96 deletions.
98 changes: 97 additions & 1 deletion data/dicts/mexico_abbreviations.csv
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,100 @@ BLVD,BOULEVARD
LT,LOTE
MZ,MANZANA
CDMX,Ciudad de México
DF,Distrito Federal
DF,Distrito Federal
AGU,Aguascalientes
BCN,Baja California
BCS,Baja California Sur
CAM,Campeche
CHP,Chiapas
CHH,Chihuahua
COA,Coahuila
CL,Colima
DUR,Durango
MEX,Estado de México
GTO,Guanajuato
GRO,Guerrero
HGO,Hidalgo
JAL,Jalisco
MIC,Michoacán
MOR,Morelos
NAY,Nayarit
NLE,Nuevo León
OAX,Oaxaca
PUE,Puebla
QRO,Querétaro
QR,Quintana Roo
SLP,San Luis Potosí
SIN,Sinaloa
SON,Sonora
TAB,Tabasco
TAM,Tamaulipas
TLAX,Tlaxcala
VER,Veracruz
YUC,Yucatán
ZAC,Zacatecas
CDMX,Ciudad de México
ARS,Aguascalientes
AG,Aguascalientes
B.C,Baja California
BC,Baja California
B.C.S,Baja California Sur
BCS,Baja California Sur
Camp,Campeche
CM,Campeche
Chis,Chiapas
CS,Chiapas
Chih,Chihuahua
CH,Chihuahua
Coah,Coahuila
CO,Coahuila
Col,Colima
CL,Colima
CDMX,Ciudad de México
DF,Ciudad de México
Dgo,Durango
DG,Durango
Gto,Guanajuato
GT,Guanajuato
Gro,Guerrero
GR,Guerrero
Hgo,Hidalgo
HG,Hidalgo
Jal,Jalisco
JA,Jalisco
Edomex,Mexico
MEX,Mexico
Mich,Michoacán
MI,Michoacán
Mor,Morelos
MO,Morelos
Nay,Nayarit
NA,Nayarit
N.L,Nuevo León
NL,Nuevo León
Oax,Oaxaca
OA,Oaxaca
Pue,Puebla
PU,Puebla
Qro,Querétaro
QT,Querétaro
Q.R,Quintana Roo
QR,Quintana Roo
S.L.P,San Luis Potosí
SL,San Luis Potosí
Sin,Sinaloa
SI,Sinaloa
Son,Sonora
SO,Sonora
Tab,Tabasco
TB,Tabasco
Tamps,Tamaulipas
TM,Tamaulipas
Tlax,Tlaxcala
TL,Tlaxcala
Ver,Veracruz
VE,Veracruz
Yuc,Yucatán
YU,Yucatán
Zac,Zacatecas
ZA,Zacatecas
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
feature_ks_count_sequence_squared_vowels,feature_ks_count_sequence_squared_consonants,feature_ks_count_sequence_squared_special_characters,feature_ks_average_of_char_count_squared,feature_ks_shannon_entropy,feature_ks_repeated_bigram_ratio,feature_ks_unique_char_ratio
15.03125,30.0,30.0,30.0,4.735393717824877,1.9491525423728815,2.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
feature_ks_count_sequence_squared_vowels,feature_ks_count_sequence_squared_consonants,feature_ks_count_sequence_squared_special_characters,feature_ks_average_of_char_count_squared,feature_ks_shannon_entropy,feature_ks_repeated_bigram_ratio,feature_ks_unique_char_ratio
15.03125,30.0,30.0,30.0,4.680689288944333,1.9491525423728815,2.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
feature_ks_count_sequence_squared_vowels,feature_ks_count_sequence_squared_consonants,feature_ks_count_sequence_squared_special_characters,feature_ks_average_of_char_count_squared,feature_ks_shannon_entropy,feature_ks_repeated_bigram_ratio
15.03125,30.0,30.0,30.0,4.680689288944333,1.9491525423728815
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
feature_ks_count_sequence_squared_vowels,feature_ks_count_sequence_squared_consonants,feature_ks_count_sequence_squared_special_characters,feature_ks_average_of_char_count_squared,feature_ks_shannon_entropy
15.03125,30.0,30.0,30.0,4.779780045430954
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
feature_ks_count_sequence_squared_vowels,feature_ks_count_sequence_squared_consonants,feature_ks_count_sequence_squared_special_characters,feature_ks_average_of_char_count_squared,feature_ks_repeated_bigram_ratio,feature_ks_unique_char_ratio
15.03125,30.0,30.0,30.0,1.9491525423728815,2.0
Loading

0 comments on commit d076b89

Please sign in to comment.