[MDBF-341] Create downloadable pdf version of the documentation - Jira

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: N/A
Component/s: None
Labels:
None

Description

There have many requests for an offline, downloadable, pdf version of the documentation. ~~MDEV-6881~~ was reported in 2014 and closed as "Won't fix" in 2019.

Requests include:
https://mariadb.com/kb/en/sql-statements-documentation/
https://mariadb.com/kb/en/dont-you-have-pdf-version-document/
https://mariadb.com/kb/en/pdf-or-search-area/
https://mariadb.com/kb/en/pdf-documentation-going-to-be-available-anytime-soon/

The intention is to run a Python script to generate the PDF from the current content on the KB.

The script will read a CSV file containing a list of all the URLS to be included in the document. The CSV currently contains the following fields:

Url
Help category: the CSV was generated from the current server HELP contents, and the goal is to have one csv to automatically generate both)
Include:
- 0: Exclude the page
- 1: include the contents
- 2: generate a header only (for categories)
- 3: include a link to content elsewhere in the pdf (many pages have multiple parents, the content will only appear once)
- 4: could be used to bring in page contents for categories, without the sub-pages. Still to be implemented, 1) may suffice
Header: the header number (the page title is automatically generated for include 1). The intention is to remove this column and generate it automatically for all urls, including the numbers, in future
Depth: Depth in the KB structure. Begins with depth 1, children are depth 2, etc. Currently manual. Ultimately this will be automatically generated. Currently unused, intended to replace the numbering in Header.
Notes

CSV maintenance is for now fairly manual. Over time it will become more automated.

The Python script will:

loop through the csv
read each of the urls from the KB
save the raw html locally (to avoid having to repeatedly download 1000s of pages, the prior step can be disabled in the config)
process and merge the individual html files into one large html file
convert the html file into a pdf

The pdf will include

a header page, including the date of generation
a contents
a longer-term goal is to generate an index

The script makes use of Beautiful Soup for html processing, and wkhtmltopdf for converting to pdf, which seem to give the best results.

Output of the pdf should match the website html as closely as possible, with the following differences for now:

formatting of the individual page contents was at times overlapping with text, so the contents have been moved to the top of each page
The product macro at times was being chopped in the middle of a page, so its formatting has been adjusted slightly.

Known issues to date:

Contents are not currently generated
the html parsing appears to be slightly different on different platforms. The website html is malformed in places, and the parsing is broken only on certain platforms
Some multi-page code blocks also chop the text on a pdf page break

Progress can be viewed at https://github.com/Icerath/mariadb-pdf/

Attachments

Issue Links

relates to

MDBF-399 PDF documentation contents page numbers can be incorrect

Open

MDBF-400 PDF documentation code blocks are cut on page breaks

Open

MDBF-407 PDF from KB logging generates duplicates

Closed

MDEV-28701 Update server HELP contents

Closed

MDBF-485 Generate PDFs in languages other than English

Open

Sub-Tasks

1.	Add page numbers to the table of contents		Closed	Dorje Gilfillan
2.	Create style for external links		Closed	Dorje Gilfillan

Activity

People

Assignee:: Ian Gilfillan

Reporter:: Ian Gilfillan

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 2022-02-14 11:25

Updated:: 2025-01-24 08:06

Resolved:: 2022-05-02 09:43

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

33d

Include sub-tasks

MariaDB Foundation Development