There have many requests for an offline, downloadable, pdf version of the documentation.
MDEV-6881 was reported in 2014 and closed as "Won't fix" in 2019.
The intention is to run a Python script to generate the PDF from the current content on the KB.
The script will read a CSV file containing a list of all the URLS to be included in the document. The CSV currently contains the following fields:
- Help category: the CSV was generated from the current server HELP contents, and the goal is to have one csv to automatically generate both)
- 0: Exclude the page
- 1: include the contents
- 2: generate a header only (for categories)
- 3: include a link to content elsewhere in the pdf (many pages have multiple parents, the content will only appear once)
- 4: could be used to bring in page contents for categories, without the sub-pages. Still to be implemented, 1) may suffice
- Header: the header number (the page title is automatically generated for include 1). The intention is to remove this column and generate it automatically for all urls, including the numbers, in future
- Depth: Depth in the KB structure. Begins with depth 1, children are depth 2, etc. Currently manual. Ultimately this will be automatically generated. Currently unused, intended to replace the numbering in Header.
CSV maintenance is for now fairly manual. Over time it will become more automated.
The Python script will:
- loop through the csv
- read each of the urls from the KB
- save the raw html locally (to avoid having to repeatedly download 1000s of pages, the prior step can be disabled in the config)
- process and merge the individual html files into one large html file
- convert the html file into a pdf
The pdf will include
- a header page, including the date of generation
- a contents
- a longer-term goal is to generate an index
The script makes use of Beautiful Soup for html processing, and wkhtmltopdf for converting to pdf, which seem to give the best results.
Output of the pdf should match the website html as closely as possible, with the following differences for now:
- formatting of the individual page contents was at times overlapping with text, so the contents have been moved to the top of each page
- The product macro at times was being chopped in the middle of a page, so its formatting has been adjusted slightly.
Known issues to date:
- Contents are not currently generated
- the html parsing appears to be slightly different on different platforms. The website html is malformed in places, and the parsing is broken only on certain platforms
- Some multi-page code blocks also chop the text on a pdf page break
Progress can be viewed at https://github.com/Icerath/mariadb-pdf/