Implement Text Extraction in PyMuPdf Fitz Layout Mode #86

MarcoWel · 2023-06-19T12:52:48Z

Thank you for this excellent muPdf wrapper!

One feature that muPdf does not implement natively is layout-preserving plain text extraction.

XPdf / poppler's pdftotext offer a layout mode as standard:
https://www.mankier.com/1/pdftotext
Other wrappers like PyMuPdf add their own implementation. Their fitz module extracts text in layout mode by default:
python -m fitz gettext input.pdf
https://pymupdf.readthedocs.io/en/latest/module.html#text-extraction

This is how the PyMuPdf fitz module does it:
https://github.com/pymupdf/PyMuPDF/blob/main/fitz/__main__.py#L577

When layout preservation is a must, there is currently no other way than invoking pdftotext from the go app or - even nastier - calling the fitz python module from go.

How hard would it be to add this to go-fitz as well?

The text was updated successfully, but these errors were encountered:

MarcoWel · 2023-06-19T22:54:54Z

I just had a closer look at how to possibly implement a layout-preserving func Text() in go.

A good starting point is checking the native C implementation for fz_new_buffer_from_stext_page:
https://github.com/ArtifexSoftware/mupdf/blob/master/source/fitz/util.c#L424

fz_buffer *
fz_new_buffer_from_stext_page(fz_context *ctx, fz_stext_page *page)
{
	fz_stext_block *block;
	fz_stext_line *line;
	fz_stext_char *ch;
	fz_buffer *buf;

	buf = fz_new_buffer(ctx, 256);
	fz_try(ctx)
	{
		for (block = page->first_block; block; block = block->next)
		{
			if (block->type == FZ_STEXT_BLOCK_TEXT)
			{
				for (line = block->u.t.first_line; line; line = line->next)
				{
					for (ch = line->first_char; ch; ch = ch->next)
						fz_append_rune(ctx, buf, ch->c);
					fz_append_byte(ctx, buf, '\n');
				}
				fz_append_byte(ctx, buf, '\n');
			}
		}
	}
	fz_catch(ctx)
	{
		fz_drop_buffer(ctx, buf);
		fz_rethrow(ctx);
	}

	return buf;
}

Now looking at the crucial structs:
https://github.com/ArtifexSoftware/mupdf/blob/master/include/mupdf/fitz/structured-text.h#L159

/**
	A text block is a list of lines of text (typically a paragraph),
	or an image.
*/
struct fz_stext_block
{
	int type;
	fz_rect bbox;
	union {
		struct { fz_stext_line *first_line, *last_line; } t;
		struct { fz_matrix transform; fz_image *image; } i;
	} u;
	fz_stext_block *prev, *next;
};

/**
	A text line is a list of characters that share a common baseline.
*/
struct fz_stext_line
{
	int wmode; /* 0 for horizontal, 1 for vertical */
	fz_point dir; /* normalized direction of baseline */
	fz_rect bbox;
	fz_stext_char *first_char, *last_char;
	fz_stext_line *prev, *next;
};

/**
	A text char is a unicode character, the style in which is
	appears, and the point at which it is positioned.
*/
struct fz_stext_char
{
	int c;
	int color; /* sRGB hex color */
	fz_point origin;
	fz_quad quad;
	float size;
	fz_font *font;
	fz_stext_char *next;
};

Those are not present in the go-fitz library (yet). The auto-generated go structs don't do the trick:

type _Ctype_struct_fz_stext_block struct {
	_type	_Ctype_int
	bbox	_Ctype_struct___7
	_	[4]byte
	u	[32]byte
	prev	*_Ctype_struct_fz_stext_block
	next	*_Ctype_struct_fz_stext_block
}

type _Ctype_struct_fz_stext_line struct {
	wmode		_Ctype_int
	dir		_Ctype_struct___28
	bbox		_Ctype_struct___7
	first_char	*_Ctype_struct_fz_stext_char
	last_char	*_Ctype_struct_fz_stext_char
	prev		*_Ctype_struct_fz_stext_line
	next		*_Ctype_struct_fz_stext_line
}

type _Ctype_struct_fz_stext_char struct {
	c	_Ctype_int
	color	_Ctype_int
	origin	_Ctype_struct___28
	quad	_Ctype_struct___29
	size	_Ctype_float
	font	*_Ctype_struct_fz_font
	next	*_Ctype_struct_fz_stext_char
}

First step would be to include proper definitions for those structs within go-fitz. Any help is appreciated!

MarcoWel · 2023-06-20T01:14:10Z

Okay, got the start right...

Structs:

type fzRect struct {
	X0, Y0 float32
	X1, Y1 float32
}

type fzPoint struct {
	X, Y float32
}

type fzQuad struct {
	Ul fzPoint
	Ur fzPoint
	Ll fzPoint
	Lr fzPoint
}

const (
	FZ_STEXT_BLOCK_TEXT  = 0
	FZ_STEXT_BLOCK_IMAGE = 1
)

type fzStextBlock struct {
	Type int32
	Bbox fzRect
	U    struct {
		T struct {
			FirstLine *fzStextLine
			LastLine  *fzStextLine
			_         [16]byte
		}
		// I struct {
		// 	Transform fzMatrix
		// 	Image     *fzImage
		// }
	}
	Prev *fzStextBlock
	Next *fzStextBlock
}

type fzStextLine struct {
	Wmode     int32
	Dir       fzPoint
	Bbox      fzRect
	FirstChar *fzStextChar
	LastChar  *fzStextChar
	Prev      *fzStextLine
	Next      *fzStextLine
}

type fzStextChar struct {
	C      int32
	Color  int32
	Origin fzPoint
	Quad   fzQuad
	Size   float32
	Font   unsafe.Pointer
	Next   *fzStextChar
}

Now the call to fz_new_buffer_from_stext_page() from go-fitz Text() can simply be replaced by a go port of the original function:

func (f *Document) Text(pageNumber int) (string, error) {
	...

	// buf := C.fz_new_buffer_from_stext_page(f.ctx, text)
	// defer C.fz_drop_buffer(f.ctx, buf)
	// str := C.GoString(C.fz_string_from_buffer(f.ctx, buf))

	str := ""
	block := (*fzStextBlock)(unsafe.Pointer(text.first_block))
	for block != nil {
		if block.Type == FZ_STEXT_BLOCK_TEXT {
			line := block.U.T.FirstLine
			for line != nil {
				char := line.FirstChar
				for char != nil {
					str += string(rune(char.C))
					char = char.Next
				}
				str += "\n"
				line = line.Next
			}
			str += "\n"
		}
		block = block.Next
	}
	return str, nil
}

We can go from here! :)

gen2brain · 2023-06-22T05:33:13Z

@MarcoWel If you or someone else manage to implement this I am willing to merge it. I don't have a plan or time to work on this.

MarcoWel · 2023-06-26T12:20:22Z

@gen2brain On it...

gen2brain added the enhancement label Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Text Extraction in PyMuPdf Fitz Layout Mode #86

Implement Text Extraction in PyMuPdf Fitz Layout Mode #86

MarcoWel commented Jun 19, 2023

MarcoWel commented Jun 19, 2023

MarcoWel commented Jun 20, 2023 •

edited

Loading

gen2brain commented Jun 22, 2023

MarcoWel commented Jun 26, 2023

Implement Text Extraction in PyMuPdf Fitz Layout Mode #86

Implement Text Extraction in PyMuPdf Fitz Layout Mode #86

Comments

MarcoWel commented Jun 19, 2023

MarcoWel commented Jun 19, 2023

MarcoWel commented Jun 20, 2023 • edited Loading

gen2brain commented Jun 22, 2023

MarcoWel commented Jun 26, 2023

MarcoWel commented Jun 20, 2023 •

edited

Loading