Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor url code to better align with SEC API #36

Closed

Conversation

jordan-gillard
Copy link
Collaborator

This PR updates the examples in the repository's README.md file to wrap exact phrase searches in single quotes. This provides a cross-platform way to search for exact phrases on both Unix and Windows-based machines.

Resolves: #24

@GalenReich
Copy link
Collaborator

Unfortunately this didn't work for me, the tool still behaves as if there was no quoting:

edgar-tool text_search '"Tsunami Hazards"'

makes requests to: https://efts.sec.gov/LATEST/search-index?q=Tsunami+Hazards (note the lack of url encoded quotes)

@jordan-gillard
Copy link
Collaborator Author

jordan-gillard commented Jul 9, 2024

@GalenReich Ahh my bad. I just ran through it with a debugger and figured it out in 549273e

In order to make the debugger easy to use I had to rename main.py to __main__.py and run the directory as a module. I kept those changes in this commit because I think it makes the code easier to work with locally, and has no effect for users running it as a script.

Those changes will fix the issue for now. Users will need to wrap their already quoted phrase in another single quote, like in your example (edgar-tool text_search '"Tsunami Hazards"'). But, it would be friendlier if we could allow users to just use quotes once, like this:

edgar-tool text_search "John Doe" Pharmaceuticals Chemicals

We can't because of an issue related to how fire interprets the command line arguments, and then how those arguments are parsed before the call to urlencode.

For example, if I call the CLI without using the single quote then double quote pattern, '"..."'

edgar-tool text_search "John Doe" Pharmaceuticals Chemicals

Then keywords equals ('John Doe', 'Pharmaceuticals', 'Chemicals') in SecEdgarScraperCli.text_search

EDGAR/edgar_tool/cli.py

Lines 73 to 77 in 20c5d58

@staticmethod
def text_search(
*keywords: str,
output: str = f"edgar_search_results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv",
entity_id: Optional[str] = None,

Which is then joined into the single string 'John Doe Pharmaceuticals Chemicals' in EdgarTextSearcher._generate_request_args here:

# Join search keywords into a single string
keywords = " ".join(keywords)

We can fix this so that the code understands that a string with a space character, like 'John Doe' but not 'Pharmaceuticals', should be parsed as an exact phrase. I can do this in a follow-up PR.

@GalenReich GalenReich self-requested a review July 10, 2024 08:14
@GalenReich
Copy link
Collaborator

The desired behaviour is that a search for "Tsunami Warning" will give requests where q=%22Tsunami+Warning%22

At the moment there is some double escaping going on and (in Powershell) the space is doubly escaped, but the quotes are missed: q=Tsunami%2520Warning

You've given me some ideas about solving this at the join - I'll push a commit or two in a sec

@GalenReich
Copy link
Collaborator

I think that solves it - the machinery was already mostly working because of the urlencode here:

request_args = urllib.parse.urlencode(request_args)

Those changes implement your suggestion of wrapping space-containing keywords in quotes, which also solves the escaping problem. Thanks for pointing out how fire parsed the keywords - super helpful!

Let me know what you think!

@jordan-gillard
Copy link
Collaborator Author

@GalenReich haha man, it's your repo! If you think the PR is good to go then merge away 😎

I'll take a closer look at your commits tomorrow - but I trust ya! Im embarrassed that I didn't actually solve the issue and encoded the URL twice. I code like sh*t without TDD. It's my kryptonite.

@jordan-gillard jordan-gillard marked this pull request as draft July 14, 2024 00:49
@jordan-gillard
Copy link
Collaborator Author

jordan-gillard commented Jul 14, 2024

Converted to a draft while I flesh out the URL generation code. Will wrap this up in the coming days.

@jordan-gillard jordan-gillard force-pushed the clarify-escape-characters branch from 4f412fb to 164bbb2 Compare July 14, 2024 00:51
@jordan-gillard jordan-gillard force-pushed the clarify-escape-characters branch from 4063931 to e30880b Compare July 28, 2024 18:04
@jordan-gillard jordan-gillard force-pushed the clarify-escape-characters branch from a25da2e to 4d4c1ba Compare August 18, 2024 21:05
@jordan-gillard jordan-gillard changed the title Update README examples w/ cross-platform escape characters Refactor url code to better align with SEC API Aug 25, 2024
@GalenReich
Copy link
Collaborator

Hey @jordan-gillard I finally managed to close pr/40 - once again really sorry about the delay! Can this be closed as a duplicate now?

@jordan-gillard
Copy link
Collaborator Author

@GalenReich I'll give everything a deeper look later and get back to ya!

@jordan-gillard
Copy link
Collaborator Author

Closing. I'll get caught up with recent work and plan the next contribution :)

@jordan-gillard jordan-gillard deleted the clarify-escape-characters branch December 28, 2024 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Escape characters not well documented
2 participants