-
Notifications
You must be signed in to change notification settings - Fork 943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Future Road Map #154
Comments
Things we could focus onThis is just a list of major things we can do.
Where to startI think we should focus first on fixing bugs and adding test. The amount of reported bugs is (to) large, some of them are really old, and the test coverage (to) low. I think we should focus less on major changes until we have fixed most of the bugs. This includes adding new features, dropping python 2 support, improving documentation, creating a stable API, etc. Since Python2 is no longer supported as of January 2020, I think we should postpone the major changes up till then. In 2019 we can do one or more releases that don't contain any breaking changes and use the old versioning system. These releases will be useful to everybody. With most of the bugs fixed, we can start from the beginning of 2020 to do some major breaking changes, e.g. drop Python2 support, change the versioning system and deprecate/remove some of the old API's. |
This week I will work on README.md and read-the-docs documentation. |
Since python 2 is dead, is there a way to merge this project with pdfminer and pdfminer3, to prevent duplicate work? |
A quote from @igormp:
|
Status update:
A lot of bugs where fixed. The CHANGELOG.md lists 15 fixes since 2019-10-20. For most of those bugs tests where added. There are still 27 issues that are labeled as bugs (compared to 59 earlier).
We have a readthedocs now, but it does not contain a lot of examples. We should probably add more.
The pdf2txt.py and dumppdf.py are always stable. Two functions are added to the high-level api:
Since October 2019, 3 new features were added according to the CHANGELOG.md. There are 11 issues that are labelled as enhancement. Still some work to do here...
The command-line utilities are better document, the high-level functions got better documentation, but there is still a lot of work to do on all the classes and functions.
Done!
We are not going to use semver (because pypi will get untenable confused). Also no progress on git lfs and automatic releasing using travis. On the bright side, the CHANGELOG.md is always up-to-date and code-style enforcement is used. What to do nextI still think we should focus first on fixing bugs and adding test. The amount of reported bugs is (to) large, some of them are really old. But I also noticed in the last months that some reported bugs are not actually bugs but rather questions. These issues, and also e.g. questions on stackoverflow, indicate that it is difficult to use pdfminer(.six). So improving the documentation is key. I think we should focus less on adding new features until we have fixed most of the bugs. |
A small addition to what @pietermarsman already mentionedI think documenting the code and improving read-the-docs is one of the most important things. We should start by explaining all the methods and classes used in the Tutorial demos. That will solve many problems people face by relating the errors and their understanding of the library. Being new to open source and this code base, I'll start contributing by helping with the issues and trying to improve the documentation. |
I would much prefer using github discussions instead of a chatroom. That way the discussions are part of the project. The wiki could also be used to capture the results of the discussion and the resulting roadmap. |
IIRC, the discussions feature wasn't a thing back then. Seeing how the gitter isn't that active, I guess that would be a good idea in order to properly organize any discussion into threads without needing to search through all of the chatroom history. |
I would like to update the current documentation of pdf miner, but whom should I tag for PR approval? it seems this repo is dormant for months... if anybody is maintaining it, please mention them |
Hi @vilabho, I'm the dormant maintainer with merge permissions. I've been meaning to do some work last months / year but haven't got to it. Help on proper documentation is very much appreciated. |
The chatroom doesn't actually seem to exist anymore! So searching its history is no longer an option :( But more on topic ... I have been submitting PRs to pdfplumber to properly support tagged PDFs that should really be features in Is there any possibility that bugfixes, optimizations, and documentation enhancements will be merged at any point soon, let alone new features? |
Long story short: we are looking for new maintainers @dhdaines I am sorry that I was not more active in the last years. Unfortunately, I cannot be as active as I was when I started as a maintainer op pdfminer.six. The current situation is much like when I took over from @goulu. For the future of pdfminer.six it would be very beneficial if we had a maintainer again with time and energy to guide this project. I'm tagging all potential candidates below. But feel free to respond here as well if you are not in the list. Right now there are 4 owners:
There are also 5 other members of the pdfminer.six organization: There are also some people that contributed more than once (all 3 commits or more):
(Have not thought of a procedure for picking a new maintainer yet). |
I'm sorry, but I don't use the project anymore nor do I have time to step in. Good luck finding a candidate though! |
Thank you for the quick reply... as the maintainer of a rather old project I totally understand! I think the underlying question is whether the project is still relevant enough and used enough to be maintained - I have to admit that I only actually use it via Since there are a variety of other options for high-level manipulation and text extraction (if you only want text...) from PDFs, I wonder if it would make sense to simply merge the two projects. |
FWIW, I use pdfminer every day; it's still been the only PDF library I've encountered which attempts to account accurately for whitespace and actual page position of text elements in a consistent way. Sometimes the structured data you need is buried in a PDF, and you don't have an alternative source... I'd love to keep it going, but I don't know if I have the skills to maintain it. A long time ago (15+ years?) I worked a little with @euske on improving PDF miner, so I'm quite fond of it. I'd put up my hand if no body else would. |
Hi happy to say it's still incredibly relevant pdfminer.six is incredibly useful for pdf parsing with a permissive license which most other libraries don't have, we still use it daily |
Yes, exactly - from what I've seen most PDF libraries work hard to hide the hideous, horrible complexity of the PDF format from you, which is fine if you just want to dump a load of text into a large language model, not so great if you want to use layout information. This, plus pure-Python and permissive license, make pdfminer (and by extension pdfplumber) relevant in my opinion. I would be willing to help out with maintenance as well. I could also definitely contribute some improvements to documentation and performance. |
The CI is currently broken - this might be the first area of improvement to make sure committing may be done in a way that doesn't break the current state of affairs see https://github.com/pdfminer/pdfminer.six/actions/runs/6793184760. If you invite me i might try some fixes that get the CI working again e.g. simple things such as code formatting. |
Getting the CI working again should be something which can be ensured on a fork and then submitted as a PR. If there really is some ongoing activity on this repository, merging the CI fixes first from the maintainer side is still possible - no need to directly grant you write permissions. |
Looks like it's mainly a case of looking for obsolete Python versions on Ubuntu latest. I'll take a look right now on my fork. |
Well, it's a bit more than just Python versions, because there's an unversioned dependency on |
And now CI passes: https://github.com/pdfminer/pdfminer.six/actions/runs/6883805593?pr=921 I did this with the minimal amount of code changes, but there are things that will need to be fixed so we can actually use the latest Now the $921 question! Can someone merge this? @pietermarsman ? |
A secondary PR to also fix building with current pip/setuptools (in Python 3.12): #923 |
@WolfgangFahl I'm positive to new contributors, but hesitant to handing out permissions quickly. I see you did not contribute to issues or PR's, that's a great start for any contributor. From your profile it looks like you are an active coder, and we could definitely benefit from your knowledge when triaging issues and PR's. For PR's I prefer to have at least one review, and not commit directly to master. |
@pietermarsman thanks for looking into my offer again. It was an if sentence and the decision was not to invite me. I accepted that decision. |
Yes, I can build setuptools integration and also with new python pypi trusted publishers It's now very easy to build python packages with github workflow automations. But for a change we eliminate setuptools and implement python new standard packaging procedures using hatchling. From now on I will give my hand and support this package and make sure this sets a new standard for python pdf users. |
Hi, I want to check if |
As you can see from having a look at the dependencies and the code, pdfminer.six implements its own PDF parser and API. No remote tools are required to run it, although it has some dependencies of course. |
Thank you @FriedrichFroebel for your quick response. I am thinking in terms of production, I do not want my PDF or PDF data to go anywhere for processing, expecting to happen everything within my environment. |
Hi everyone!
Since the inception of pdfminer.six, a lot of improvements have been made, and several issues have been fixed. I would first like to thank every contributor for having kept this project alive! We are all well aware of the difficulties of parsing PDF documents, and I am sure pdfminer.six has made it easier for developers to extract information from PDFs.
But, there are more issues cropping up, and a lot of PRs are pending as well. Documentation, too, is pending. With the increase in these incomplete tasks, it is time that we decide on how to take this project forward. There needs to be a road map created for the future development of this project. It is not necessary to have people completely dedicated to it, but we at least need to create some specific targets / goals, so that the project becomes more concrete and we can ensure its stability.
For starters, I have reached out to an audience through the dev.to platform - dev.to so that more people can become aware of this wonderful project, and start contributing.
I myself am new to open source, and I have been the admin of this project for sometime. Sadly, I haven't been able to give it much time, but I am sure that if we can make a good plan for the future of this project, the quality of the project can be improved. I would LOVE to hear your thoughts on this!
Update - I have created a Gitter chatroom for having discussions regarding the project.
The text was updated successfully, but these errors were encountered: