[MDBF-341] Create downloadable pdf version of the documentation Created: 2022-02-14  Updated: 2022-10-17  Resolved: 2022-05-02

Status: Closed
Project: MariaDB Foundation Development
Component/s: None
Affects Version/s: None
Fix Version/s: N/A

Type: Task Priority: Major
Reporter: Ian Gilfillan Assignee: Ian Gilfillan
Resolution: Fixed Votes: 1
Labels: None
Σ Remaining Estimate: 0d Remaining Estimate: 0d
Σ Time Spent: 33d Time Spent: 33d
Σ Original Estimate: Not Specified Original Estimate: Not Specified

Issue Links:
Relates
relates to MDBF-399 PDF documentation contents page numbe... Open
relates to MDBF-400 PDF documentation code blocks are cut... Open
relates to MDBF-407 PDF from KB logging generates duplicates Closed
relates to MDEV-28701 Update server HELP contents Closed
relates to MDBF-485 Generate PDFs in languages other than... Open
Sub-Tasks:
Key
Summary
Type
Status
Assignee
MDBF-363 Add page numbers to the table of cont... Technical task Closed Dorje Gilfillan  
MDBF-368 Create style for external links Technical task Closed Dorje Gilfillan  

 Description   

There have many requests for an offline, downloadable, pdf version of the documentation. MDEV-6881 was reported in 2014 and closed as "Won't fix" in 2019.

Requests include:
https://mariadb.com/kb/en/sql-statements-documentation/
https://mariadb.com/kb/en/dont-you-have-pdf-version-document/
https://mariadb.com/kb/en/pdf-or-search-area/
https://mariadb.com/kb/en/pdf-documentation-going-to-be-available-anytime-soon/

The intention is to run a Python script to generate the PDF from the current content on the KB.

The script will read a CSV file containing a list of all the URLS to be included in the document. The CSV currently contains the following fields:

  • Url
  • Help category: the CSV was generated from the current server HELP contents, and the goal is to have one csv to automatically generate both)
  • Include:
    • 0: Exclude the page
    • 1: include the contents
    • 2: generate a header only (for categories)
    • 3: include a link to content elsewhere in the pdf (many pages have multiple parents, the content will only appear once)
    • 4: could be used to bring in page contents for categories, without the sub-pages. Still to be implemented, 1) may suffice
  • Header: the header number (the page title is automatically generated for include 1). The intention is to remove this column and generate it automatically for all urls, including the numbers, in future
  • Depth: Depth in the KB structure. Begins with depth 1, children are depth 2, etc. Currently manual. Ultimately this will be automatically generated. Currently unused, intended to replace the numbering in Header.
  • Notes

CSV maintenance is for now fairly manual. Over time it will become more automated.

The Python script will:

  • loop through the csv
  • read each of the urls from the KB
  • save the raw html locally (to avoid having to repeatedly download 1000s of pages, the prior step can be disabled in the config)
  • process and merge the individual html files into one large html file
  • convert the html file into a pdf

The pdf will include

  • a header page, including the date of generation
  • a contents
  • a longer-term goal is to generate an index

The script makes use of Beautiful Soup for html processing, and wkhtmltopdf for converting to pdf, which seem to give the best results.

Output of the pdf should match the website html as closely as possible, with the following differences for now:

  • formatting of the individual page contents was at times overlapping with text, so the contents have been moved to the top of each page
  • The product macro at times was being chopped in the middle of a page, so its formatting has been adjusted slightly.

Known issues to date:

  • Contents are not currently generated
  • the html parsing appears to be slightly different on different platforms. The website html is malformed in places, and the parsing is broken only on certain platforms
  • Some multi-page code blocks also chop the text on a pdf page break

Progress can be viewed at https://github.com/Icerath/mariadb-pdf/



 Comments   
Comment by Ian Gilfillan [ 2022-03-22 ]

Current version of the PDF at https://mariadb.org/mariadbserverknowledgebase_2022_03_22/

  • Pushed headers with larger syntax blocks down to reduce floating headers
  • improved csv error checking
  • external links have a different colour (ideally they'll be differentiated in mediawiki style)
  • fixed broken external mariadb.com links
  • page numbering font is now smaller
  • general formatting fixes

Other points from previous discussions:

  • added bold to high level headings
  • added a one page chapter contents
  • Training and Tutorials section added
  • older/obsolete content is specifically excluded, and some sections, like MaxScale, Enterprise Server, Connectors, Release Notes, are not included at al
  • Undecided about whether to add Release notes/FAQs in the main document, or separate appendices.
  • categories are now not treated as anything separate, but are handled and displayed in the same way as other pages (previously only the heading was displayed, now the sub-pages are as well)
  • Scrollbars (in code blocks) don't work in PDFs. Can be fixed in some cases on the KB by avoiding overly wide blocks
  • licence information is still needed
  • product macro text is displayed slightly differently (below rather than in the middle of the box) to avoid being chopped across page breaks
  • the contents for each page are above, rather than to the right of the body. They were sometimes overlapping, and this seemed the best solution
  • there are still some spacing improvements that could be made
  • suggestion for a new page for a new chapter (or even subchapter). I think this adds a lot of needless white space. Other comments?
Comment by Ian Gilfillan [ 2022-03-31 ]

Latest version: https://mariadb.org/mariadbserverknowledgebase_2022_03_30/

  • Cover page (currently a static image)
  • Page numbers added to table of contents
  • Dots connecting header and page number in contents
  • External links now in mediawiki style
Comment by Anel Husakovic [ 2022-04-04 ]

I would suggest to add a watermark on the first page just to be sure in which stage are we now.
It is done like this in latex:
https://texblog.org/2012/02/17/watermarks-draft-review-approved-confidential/

Comment by Ian Gilfillan [ 2022-04-12 ]

Windows version generates correctly, but on Linux there are font problems, and the Linux version of wkhtmltopdf generates slightly different output. Virtual server on Windows is very slow, currently still looking for fix.

Comment by Ian Gilfillan [ 2022-04-18 ]

Latest version: https://mariadb.org/mariadbserverknowledgebasenew_2022_04_18/

Known issues:

  • Increased font size has led to more prevalent scrollbars (which don't work in the PDF)
  • Needs to be generated on Windows due to Linux font issues
  • KB links can have multiple slugs. Only the primary link is treated as an internal link. Secondary links are implemented as external links (fix in progress by adding these secondaries to the csv)
  • Page numbers are too large again (this is a config setting)
  • See https://fi.mariadb.org/wiki/Generate_a_Pdf_from_the_Knowledge_Base for how to generate
  • Have requested a dedication from Monty
Comment by Ian Gilfillan [ 2022-04-27 ]

A first public version is up at https://mariadb.org/mariadbserverknowledgebase/

Known issues:

  • still a number of scrollbar instances chopping off text
  • number of external links that should be internal
  • chapter page numbering can drift out of sync
  • code blocks chopped on a linebreak don't always look so good
Generated at Thu Feb 08 03:37:10 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.