Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better cleaning #8

Open
wants to merge 29 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
c2490d1
more cleanup BP/CS/TSA/CIDEX/CEDEX
cquest Feb 14, 2018
253cb54
cleanup BP/CS/TSA/CIDEX N° + tests
cquest Feb 14, 2018
3f75ccf
do not break on queries like "12bis"
cquest Feb 14, 2018
6d59df3
print() removed
cquest Feb 14, 2018
4e5be21
cleanup phone/fax numbers
cquest Feb 14, 2018
454ca50
fold initiales: F F I > F F I FFI, etc
cquest Feb 14, 2018
045c8f0
boite postale
cquest Feb 15, 2018
2261c92
WIP fold_initials "F F I" > "FFI"
cquest Feb 16, 2018
5f11ac4
fold_initials + tests
cquest Feb 16, 2018
330169a
pep8
cquest Feb 16, 2018
fd5d9ee
pep8
cquest Feb 16, 2018
3ccc267
pep8
cquest Feb 16, 2018
9b90da3
separate PR
cquest Feb 16, 2018
a3017f8
fold_initials is a yielder
cquest Feb 18, 2018
bb92a1d
fold_initials > glue_initials
cquest Feb 21, 2018
01cfa2a
glue usual words like 'MONT' 'VAL' 'LE' 'LA' 'L' in an additionnal token
cquest Feb 18, 2018
f6e2e0b
flod_words > glue_words
cquest Feb 21, 2018
221584a
glue_words test
cquest Feb 21, 2018
904ac5c
Merge branch 'glue_initials' into glue_words
cquest Oct 29, 2019
60acc23
more tests
cquest Oct 29, 2019
5ecc80f
test 0 manquant sur postcode
cquest Oct 29, 2019
3b2faa7
prise en compte des abréviations dans l'extraction
cquest Oct 29, 2019
25c3461
suppression de bte/boite/case postale
cquest Oct 29, 2019
35709fe
prise en compte de "1er étage"
cquest Oct 29, 2019
aaf5f95
ss -> sous
cquest Oct 29, 2019
30e3f3b
handle StopIteration exception (empty tokens)
cquest Nov 8, 2020
2a5e3b2
avoid bad transform in TYPES_REGEX due to leading [
cquest Nov 8, 2020
1e1b9ca
added: glue_refs, to glue D 412 into D412
cquest Nov 22, 2020
3000457
glue_refs test
cquest Nov 22, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 10 additions & 4 deletions addok_france/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,14 +50,17 @@


def clean_query(q):
q = re.sub(r'([\d]{5})', r' \1 ', q, flags=re.IGNORECASE)
q = re.sub(r'(^| )(boite postale|b\.?p\.?|cs|tsa|cidex) *(n(o|°|) *|)[\d]+ *', r'\1', q, flags=re.IGNORECASE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regex is quite slow, I'm not sure we wanna go that far in cleaning.
Have you had a look on perfs? :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regexp are not that slow (I measured 3µs for this one)... and it is called just once to clean the query

q = re.sub(r'([\d]{2})[\d]{3}(.*)c(e|é)dex ?[\d]*', r'\1\2', q, flags=re.IGNORECASE)
q = re.sub(r'([^\d ])([\d]{5})([^\d]|$)', r'\1 \2 ', q, flags=re.IGNORECASE)
q = re.sub('c(e|é)dex ?[\d]*', '', q, flags=re.IGNORECASE)
q = re.sub(r'\b(bp|cs|tsa|cidex) *[\d]*', '', q, flags=re.IGNORECASE)
q = re.sub('\d{,2}(e|[eè]me) ([eé]tage)', '', q, flags=re.IGNORECASE)
q = re.sub(r'((fax|t[eé]l|t[eé]l[eé]copieur)[ :,\.]*|)(\d{10}|[0-9][0-9][ -\./]\d\d[-\./ ]\d\d[-\./ ]\d\d[-\./ ]\d\d)', '', q, flags=re.IGNORECASE)
q = re.sub(' {2,}', ' ', q, flags=re.IGNORECASE)
q = re.sub('[ -]s/[ -]', ' sur ', q, flags=re.IGNORECASE)
q = re.sub('[ -]s/s[ -]', ' sous ', q, flags=re.IGNORECASE)
q = re.sub('^lieux?[ -]?dits?\\b(?=.)', '', q, flags=re.IGNORECASE)
q = re.sub(r'(^| )(([A-Z]) ([A-Z]) (([A-Z]) )?(([A-Z]) )?(([A-Z])( |$))?)', r'\1\2\3\4\6\8\10 ', q, flags=re.IGNORECASE)
q = q.strip()
return q

Expand Down Expand Up @@ -116,14 +119,17 @@ def flag_housenumber(tokens):

def fold_ordinal(s):
"""3bis => 3b."""
if s[0].isdigit() and not s.isdigit():
if s is not None and s !='' and s[0].isdigit() and not s.isdigit():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this needs to be fixed properly beforehand. I'll have a look.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No way to reproduce the issue, neither from the shell, the pyshell or the http API.

Can you be a bit more specific on how you get the issue here? A simple way to reproduce from shell or pyshell would help :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs either a reproducable test case (so we can understand) either removal :)

try:
number, ordinal = FOLD_PATTERN.findall(s)[0]
except (IndexError, ValueError):
pass
else:
s = s.update('{}{}'.format(number,
try:
s = s.update('{}{}'.format(number,
FOLD.get(ordinal.lower(), ordinal)))
except:
pass
return s


Expand Down
32 changes: 27 additions & 5 deletions tests/test_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,21 +13,25 @@

@pytest.mark.parametrize("input,expected", [
("2 allée Jules Guesde 31068 TOULOUSE CEDEX 7",
"2 allée Jules Guesde 31068 TOULOUSE"),
"2 allée Jules Guesde 31 TOULOUSE"),
("7, avenue Léon-Blum 31507 Toulouse Cedex 5",
"7, avenue Léon-Blum 31507 Toulouse"),
"7, avenue Léon-Blum 31 Toulouse"),
("159, avenue Jacques-Douzans 31604 Muret Cedex",
"159, avenue Jacques-Douzans 31604 Muret"),
"159, avenue Jacques-Douzans 31 Muret"),
("2 allée Jules Guesde BP 7015 31068 TOULOUSE",
"2 allée Jules Guesde 31068 TOULOUSE"),
("2 allée Jules Guesde B.P. 7015 31068 TOULOUSE",
"2 allée Jules Guesde 31068 TOULOUSE"),
("2 allée Jules Guesde B.P. N 7015 31068 TOULOUSE",
"2 allée Jules Guesde 31068 TOULOUSE"),
("BP 80111 159, avenue Jacques-Douzans 31604 Muret",
"159, avenue Jacques-Douzans 31604 Muret"),
("12, place de l'Hôtel-de-Ville BP 46 02150 Sissonne",
"12, place de l'Hôtel-de-Ville 02150 Sissonne"),
("6, rue Winston-Churchill CS 40055 60321 Compiègne",
"6, rue Winston-Churchill 60321 Compiègne"),
("BP 80111 159, avenue Jacques-Douzans 31604 Muret Cedex",
"159, avenue Jacques-Douzans 31604 Muret"),
"159, avenue Jacques-Douzans 31 Muret"),
("BP 20169 Cite administrative - 8e étage Rue Gustave-Delory 59017 Lille",
"Cite administrative - Rue Gustave-Delory 59017 Lille"),
("12e étage Rue Gustave-Delory 59017 Lille",
Expand All @@ -52,9 +56,27 @@
("32bis Rue des Vosges93290",
"32bis Rue des Vosges 93290"),
("20 avenue de Ségur TSA 30719 75334 Paris Cedex 07",
"20 avenue de Ségur 75334 Paris"),
"20 avenue de Ségur 75 Paris"),
("20 avenue de Ségur TSA No30719 75334 Paris Cedex 07",
"20 avenue de Ségur 75 Paris"),
("20 avenue de Ségur TSA N 30719 75334 Paris Cedex 07",
"20 avenue de Ségur 75 Paris"),
("20 rue saint germain CIDEX 304 89110 Poilly-sur-tholon",
"20 rue saint germain 89110 Poilly-sur-tholon"),
("20 rue saint germain CIDEX N°304 89110 Poilly-sur-tholon",
"20 rue saint germain 89110 Poilly-sur-tholon"),
("20 rue saint germain 89110 Poilly-sur-tholon 01.23.45.67.89",
"20 rue saint germain 89110 Poilly-sur-tholon"),
("32bis Rue des Vosges93290 fax: 0123456789",
"32bis Rue des Vosges 93290"),
("32bis Rue des Vosges 93290 tel 01 23 45 67 89",
"32bis Rue des Vosges 93290"),
("32bis Rue des Vosges 93290 telecopieur. 01/23/45/67/89",
"32bis Rue des Vosges 93290"),
("32bis Rue des Vosges 93290 télécopieur, 01-23-45-67-89",
"32bis Rue des Vosges 93290"),
("10 BLD DES F F I 85300 CHALLANS",
"10 BLD DES F F I FFI 85300 CHALLANS"),
])
def test_clean_query(input, expected):
assert clean_query(input) == expected
Expand Down