Uploaded image for project: 'MariaDB Foundation Development'
  1. MariaDB Foundation Development
  2. MDBF-341

Create downloadable pdf version of the documentation

Details

    • Task
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • None
    • N/A
    • None
    • None

    Description

      There have many requests for an offline, downloadable, pdf version of the documentation. MDEV-6881 was reported in 2014 and closed as "Won't fix" in 2019.

      Requests include:
      https://mariadb.com/kb/en/sql-statements-documentation/
      https://mariadb.com/kb/en/dont-you-have-pdf-version-document/
      https://mariadb.com/kb/en/pdf-or-search-area/
      https://mariadb.com/kb/en/pdf-documentation-going-to-be-available-anytime-soon/

      The intention is to run a Python script to generate the PDF from the current content on the KB.

      The script will read a CSV file containing a list of all the URLS to be included in the document. The CSV currently contains the following fields:

      • Url
      • Help category: the CSV was generated from the current server HELP contents, and the goal is to have one csv to automatically generate both)
      • Include:
        • 0: Exclude the page
        • 1: include the contents
        • 2: generate a header only (for categories)
        • 3: include a link to content elsewhere in the pdf (many pages have multiple parents, the content will only appear once)
        • 4: could be used to bring in page contents for categories, without the sub-pages. Still to be implemented, 1) may suffice
      • Header: the header number (the page title is automatically generated for include 1). The intention is to remove this column and generate it automatically for all urls, including the numbers, in future
      • Depth: Depth in the KB structure. Begins with depth 1, children are depth 2, etc. Currently manual. Ultimately this will be automatically generated. Currently unused, intended to replace the numbering in Header.
      • Notes

      CSV maintenance is for now fairly manual. Over time it will become more automated.

      The Python script will:

      • loop through the csv
      • read each of the urls from the KB
      • save the raw html locally (to avoid having to repeatedly download 1000s of pages, the prior step can be disabled in the config)
      • process and merge the individual html files into one large html file
      • convert the html file into a pdf

      The pdf will include

      • a header page, including the date of generation
      • a contents
      • a longer-term goal is to generate an index

      The script makes use of Beautiful Soup for html processing, and wkhtmltopdf for converting to pdf, which seem to give the best results.

      Output of the pdf should match the website html as closely as possible, with the following differences for now:

      • formatting of the individual page contents was at times overlapping with text, so the contents have been moved to the top of each page
      • The product macro at times was being chopped in the middle of a page, so its formatting has been adjusted slightly.

      Known issues to date:

      • Contents are not currently generated
      • the html parsing appears to be slightly different on different platforms. The website html is malformed in places, and the parsing is broken only on certain platforms
      • Some multi-page code blocks also chop the text on a pdf page break

      Progress can be viewed at https://github.com/Icerath/mariadb-pdf/

      Attachments

        Issue Links

          Activity

            greenman Ian Gilfillan added a comment -

            Current version of the PDF at https://mariadb.org/mariadbserverknowledgebase_2022_03_22/

            • Pushed headers with larger syntax blocks down to reduce floating headers
            • improved csv error checking
            • external links have a different colour (ideally they'll be differentiated in mediawiki style)
            • fixed broken external mariadb.com links
            • page numbering font is now smaller
            • general formatting fixes

            Other points from previous discussions:

            • added bold to high level headings
            • added a one page chapter contents
            • Training and Tutorials section added
            • older/obsolete content is specifically excluded, and some sections, like MaxScale, Enterprise Server, Connectors, Release Notes, are not included at al
            • Undecided about whether to add Release notes/FAQs in the main document, or separate appendices.
            • categories are now not treated as anything separate, but are handled and displayed in the same way as other pages (previously only the heading was displayed, now the sub-pages are as well)
            • Scrollbars (in code blocks) don't work in PDFs. Can be fixed in some cases on the KB by avoiding overly wide blocks
            • licence information is still needed
            • product macro text is displayed slightly differently (below rather than in the middle of the box) to avoid being chopped across page breaks
            • the contents for each page are above, rather than to the right of the body. They were sometimes overlapping, and this seemed the best solution
            • there are still some spacing improvements that could be made
            • suggestion for a new page for a new chapter (or even subchapter). I think this adds a lot of needless white space. Other comments?
            greenman Ian Gilfillan added a comment - Current version of the PDF at https://mariadb.org/mariadbserverknowledgebase_2022_03_22/ Pushed headers with larger syntax blocks down to reduce floating headers improved csv error checking external links have a different colour (ideally they'll be differentiated in mediawiki style) fixed broken external mariadb.com links page numbering font is now smaller general formatting fixes Other points from previous discussions: added bold to high level headings added a one page chapter contents Training and Tutorials section added older/obsolete content is specifically excluded, and some sections, like MaxScale, Enterprise Server, Connectors, Release Notes, are not included at al Undecided about whether to add Release notes/FAQs in the main document, or separate appendices. categories are now not treated as anything separate, but are handled and displayed in the same way as other pages (previously only the heading was displayed, now the sub-pages are as well) Scrollbars (in code blocks) don't work in PDFs. Can be fixed in some cases on the KB by avoiding overly wide blocks licence information is still needed product macro text is displayed slightly differently (below rather than in the middle of the box) to avoid being chopped across page breaks the contents for each page are above, rather than to the right of the body. They were sometimes overlapping, and this seemed the best solution there are still some spacing improvements that could be made suggestion for a new page for a new chapter (or even subchapter). I think this adds a lot of needless white space. Other comments?
            greenman Ian Gilfillan added a comment -

            Latest version: https://mariadb.org/mariadbserverknowledgebase_2022_03_30/

            • Cover page (currently a static image)
            • Page numbers added to table of contents
            • Dots connecting header and page number in contents
            • External links now in mediawiki style
            greenman Ian Gilfillan added a comment - Latest version: https://mariadb.org/mariadbserverknowledgebase_2022_03_30/ Cover page (currently a static image) Page numbers added to table of contents Dots connecting header and page number in contents External links now in mediawiki style

            I would suggest to add a watermark on the first page just to be sure in which stage are we now.
            It is done like this in latex:
            https://texblog.org/2012/02/17/watermarks-draft-review-approved-confidential/

            anel Anel Husakovic added a comment - I would suggest to add a watermark on the first page just to be sure in which stage are we now. It is done like this in latex: https://texblog.org/2012/02/17/watermarks-draft-review-approved-confidential/
            greenman Ian Gilfillan added a comment -

            Windows version generates correctly, but on Linux there are font problems, and the Linux version of wkhtmltopdf generates slightly different output. Virtual server on Windows is very slow, currently still looking for fix.

            greenman Ian Gilfillan added a comment - Windows version generates correctly, but on Linux there are font problems, and the Linux version of wkhtmltopdf generates slightly different output. Virtual server on Windows is very slow, currently still looking for fix.
            greenman Ian Gilfillan added a comment -

            Latest version: https://mariadb.org/mariadbserverknowledgebasenew_2022_04_18/

            Known issues:

            • Increased font size has led to more prevalent scrollbars (which don't work in the PDF)
            • Needs to be generated on Windows due to Linux font issues
            • KB links can have multiple slugs. Only the primary link is treated as an internal link. Secondary links are implemented as external links (fix in progress by adding these secondaries to the csv)
            • Page numbers are too large again (this is a config setting)
            • See https://fi.mariadb.org/wiki/Generate_a_Pdf_from_the_Knowledge_Base for how to generate
            • Have requested a dedication from Monty
            greenman Ian Gilfillan added a comment - Latest version: https://mariadb.org/mariadbserverknowledgebasenew_2022_04_18/ Known issues: Increased font size has led to more prevalent scrollbars (which don't work in the PDF) Needs to be generated on Windows due to Linux font issues KB links can have multiple slugs. Only the primary link is treated as an internal link. Secondary links are implemented as external links (fix in progress by adding these secondaries to the csv) Page numbers are too large again (this is a config setting) See https://fi.mariadb.org/wiki/Generate_a_Pdf_from_the_Knowledge_Base for how to generate Have requested a dedication from Monty
            greenman Ian Gilfillan added a comment -

            A first public version is up at https://mariadb.org/mariadbserverknowledgebase/

            Known issues:

            • still a number of scrollbar instances chopping off text
            • number of external links that should be internal
            • chapter page numbering can drift out of sync
            • code blocks chopped on a linebreak don't always look so good
            greenman Ian Gilfillan added a comment - A first public version is up at https://mariadb.org/mariadbserverknowledgebase/ Known issues: still a number of scrollbar instances chopping off text number of external links that should be internal chapter page numbering can drift out of sync code blocks chopped on a linebreak don't always look so good

            People

              greenman Ian Gilfillan
              greenman Ian Gilfillan
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0d
                  0d
                  Logged:
                  Time Spent - 33d
                  33d