Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add doughnut http endpoint #230

Draft
wants to merge 46 commits into
base: main
Choose a base branch
from

Conversation

edknv
Copy link
Collaborator

@edknv edknv commented Nov 14, 2024

Description

Part of https://github.com/NVIDIA/nv-ingest-private/issues/52
Also closes https://github.com/NVIDIA/nv-ingest-private/issues/49.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@edknv edknv marked this pull request as ready for review November 19, 2024 19:01
@edknv edknv requested review from drobison00 and jdye64 November 19, 2024 21:06
@@ -62,7 +64,8 @@ def _update_metadata(row: pd.Series, cached_client: Any, deplot_client: Any, tra
# Only modify if content type is structured and subtype is 'chart' and chart_metadata exists
if ((content_metadata.get("type") != "structured") or
(content_metadata.get("subtype") != "chart") or
(chart_metadata is None)):
(chart_metadata is None) or
(chart_metadata.get("table_format") != TableFormatEnum.IMAGE)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct. Even if the table_format is image, we can still extract the content in the chart extractor. Am I thinking about this wrong?

Copy link
Collaborator Author

@edknv edknv Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this so that the tables extracted from the Doughnut model don't go through the table/chart extraction stages. Doughnut tables will already have text (as LaTex), so they don't need to go through the table/chart extraction stages. YOLOX tables need text extraction in table/chart extraction stages, and they are tagged as IMAGE tables so they do get processed in these stages. I'm not sure if that made sense, but I needed a way to skip table/chart extraction for tables identified by Doughnut, and thought TableFormat could be useful here to distinguish between yolox (== TableFormatEnum.IMAGE) and doughnut (==TableFormatEnum.LATEX).

@@ -63,7 +67,8 @@ def _update_metadata(row: pd.Series, paddle_client: Any, paddle_version: Any, tr
# Only modify if content type is structured and subtype is 'table' and table_metadata exists
if ((content_metadata.get("type") != "structured") or
(content_metadata.get("subtype") != "table") or
(table_metadata is None)):
(table_metadata is None) or
(table_metadata.get("table_format") != TableFormatEnum.IMAGE)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here.


def _call_image_inference_http_client(client, model_name: str, image_data: np.ndarray):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're getting a bit too specific in terms of data handling here. We probably want to standardize on a data format for each model and process it into that before we get to this function.

Small nitpick: please use base64_img(s) or base64_image(s), rather than base64_img and base64_images

Copy link
Collaborator Author

@edknv edknv Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cleaned it up a bit in c5b5d3e by basically grouping doughnut together with deplot and removing batching. Similarly to deplot, request-level batching is currently not supported in doughnut, and they have the same output format (for now). Note: The input/output format is not finalized and will likely change and actually be closer to the yolox output format so there will be more PRs to follow.

@edknv edknv requested a review from drobison00 November 20, 2024 03:48
@edknv
Copy link
Collaborator Author

edknv commented Nov 22, 2024

321052f adds support for preserving the text bounding boxes in the metadata (in the hierarchy field) and addresses issue https://github.com/NVIDIA/nv-ingest-private/issues/49. It also changes the way text blocks are concatenated to concatenate text blocks with \n\n as requested by research team.

@edknv edknv marked this pull request as draft December 6, 2024 02:45
edknv and others added 23 commits December 9, 2024 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants