-
Notifications
You must be signed in to change notification settings - Fork 736
Conversation
llama_hub/sec_filings/README.md
Outdated
@@ -13,78 +13,27 @@ python install -r requirements.txt | |||
The SEC Downloader expects 5 attributes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it still 5 attributes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it requires 4 attributes now, instead of 5. The previous implementation is breaking, but the current implementation is directly pulling from the official page, hence it is more reliable. Currently, the users can pull all the files for a given year, and the amount parameter earlier was really ambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have done those changes in the README file
@@ -13,78 +13,27 @@ python install -r requirements.txt | |||
The SEC Downloader expects 5 attributes | |||
|
|||
* tickers: It is a list of valid tickers | |||
* amount: Number of documents that you want to download |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we keep the deleted attributes as deprecated, for backwards compat? and just not show it here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, the previous implementation was breaking, and the amount parameter is a bit ambiguous. In my conversation, users would like to pull the documents for a given year or a list years, not number of filings. Hence, the year parameter serves better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok sounds good. In general we are trying to minimize the number of breaking changes, it's not good to switch user-facing params around because that breaks existing implementations.
If the previous implementation doesn't work at all then sure we can remove (and log a warning to the user that it no longer works). If it still does then I vote we leave in the parameter for backwards compat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, understood
In the latest commit, I have added the amount deprecating warning. Please do suggest, if I need to make other changes.
from llama_index.readers.base import BaseReader | ||
from llama_hub.sec_filings.secData import sec_main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure to add this file to extra_files
in library.json
( see github repo loader)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SEC filings already exists in library.json
. I added it when I first committed the loader. Do I need to modify it again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah see some other files that have the extra_files parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have added this
|
||
SEC_ARCHIVE_URL: Final[str] = "https://www.sec.gov/Archives/edgar/data" | ||
SEC_SEARCH_URL: Final[str] = "http://www.sec.gov/cgi-bin/browse-edgar" | ||
SEC_SUBMISSIONS_URL = "https://data.sec.gov/submissions" | ||
|
||
|
||
def get_filing( | ||
cik: Union[str, int], accession_number: Union[str, int], company: str, email: str | ||
accession_number: Union[str, int], cik: Union[str, int], company: str, email: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you switch the arg positions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The text extraction from SEC documents is a demanding process, hence I implemented a multiprocessing method so that it can be faster. In the secData.py
file, I have implemented parallel processing using a partial function
get_filing_partial = partial(
get_filing,
cik=rgld_cik,
company="Unstructured Technologies",
email="[email protected]",
)
sec_extractor = SECExtractor(ticker=ticker)
For the partial function to work, the first argument needs to be the accession number (a unique identifier for each file). Hence, I switched the arguments. Is there a better way to do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see. i'm mostly trying to minimize the number of breaking changes, and seems like there's not a way to prevent this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, understood
It is not an user-facing function, hopefully it will break previous implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is fine - can merge as is
@jerryjliu |
Description
The previous SEC Filings loader that was developed by me had some major bugs as the SEC website changed last year. In this modification, I have fixed the bugs, returned the text data in document format compatible with llama index, and added extra metadata to the texts like the filling and reporting date
Fixes # (issue)
Type of Change
Please delete options that are not relevant.
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration
Suggested Checklist:
make format; make lint
to appease the lint gods