Uploaded image for project: 'MariaDB Foundation Development'
  1. MariaDB Foundation Development
  2. MDBF-341

Create downloadable pdf version of the documentation

Details

    • Task
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • None
    • N/A
    • None
    • None

    Description

      There have many requests for an offline, downloadable, pdf version of the documentation. MDEV-6881 was reported in 2014 and closed as "Won't fix" in 2019.

      Requests include:
      https://mariadb.com/kb/en/sql-statements-documentation/
      https://mariadb.com/kb/en/dont-you-have-pdf-version-document/
      https://mariadb.com/kb/en/pdf-or-search-area/
      https://mariadb.com/kb/en/pdf-documentation-going-to-be-available-anytime-soon/

      The intention is to run a Python script to generate the PDF from the current content on the KB.

      The script will read a CSV file containing a list of all the URLS to be included in the document. The CSV currently contains the following fields:

      • Url
      • Help category: the CSV was generated from the current server HELP contents, and the goal is to have one csv to automatically generate both)
      • Include:
        • 0: Exclude the page
        • 1: include the contents
        • 2: generate a header only (for categories)
        • 3: include a link to content elsewhere in the pdf (many pages have multiple parents, the content will only appear once)
        • 4: could be used to bring in page contents for categories, without the sub-pages. Still to be implemented, 1) may suffice
      • Header: the header number (the page title is automatically generated for include 1). The intention is to remove this column and generate it automatically for all urls, including the numbers, in future
      • Depth: Depth in the KB structure. Begins with depth 1, children are depth 2, etc. Currently manual. Ultimately this will be automatically generated. Currently unused, intended to replace the numbering in Header.
      • Notes

      CSV maintenance is for now fairly manual. Over time it will become more automated.

      The Python script will:

      • loop through the csv
      • read each of the urls from the KB
      • save the raw html locally (to avoid having to repeatedly download 1000s of pages, the prior step can be disabled in the config)
      • process and merge the individual html files into one large html file
      • convert the html file into a pdf

      The pdf will include

      • a header page, including the date of generation
      • a contents
      • a longer-term goal is to generate an index

      The script makes use of Beautiful Soup for html processing, and wkhtmltopdf for converting to pdf, which seem to give the best results.

      Output of the pdf should match the website html as closely as possible, with the following differences for now:

      • formatting of the individual page contents was at times overlapping with text, so the contents have been moved to the top of each page
      • The product macro at times was being chopped in the middle of a page, so its formatting has been adjusted slightly.

      Known issues to date:

      • Contents are not currently generated
      • the html parsing appears to be slightly different on different platforms. The website html is malformed in places, and the parsing is broken only on certain platforms
      • Some multi-page code blocks also chop the text on a pdf page break

      Progress can be viewed at https://github.com/Icerath/mariadb-pdf/

      Attachments

        Issue Links

          Activity

            greenman Ian Gilfillan created issue -
            greenman Ian Gilfillan made changes -
            Field Original Value New Value
            Status Open [ 1 ] In Progress [ 3 ]
            greenman Ian Gilfillan made changes -
            Description There have many requests for an offline, downloadable, pdf version of the documentation. MDEV-6881 was reported in 2014 and closed as "Won't fix" in 2019.

            Requests include:
            https://mariadb.com/kb/en/sql-statements-documentation/
            https://mariadb.com/kb/en/dont-you-have-pdf-version-document/
            https://mariadb.com/kb/en/pdf-or-search-area/
            https://mariadb.com/kb/en/pdf-documentation-going-to-be-available-anytime-soon/
            There have many requests for an offline, downloadable, pdf version of the documentation. MDEV-6881 was reported in 2014 and closed as "Won't fix" in 2019.

            Requests include:
            https://mariadb.com/kb/en/sql-statements-documentation/
            https://mariadb.com/kb/en/dont-you-have-pdf-version-document/
            https://mariadb.com/kb/en/pdf-or-search-area/
            https://mariadb.com/kb/en/pdf-documentation-going-to-be-available-anytime-soon/

            The intention is to run a Python script to generate the PDF from the current content on the KB.

            The script will read a CSV file containing a list of all the URLS to be included in the document. The CSV currently contains the following fields:
            - Url
            - Help category: the CSV was generated from the current server HELP contents, and the goal is to have one csv to automatically generate both)
            - Include:
            -- 0: Exclude the page
            -- 1: include the contents
            -- 2: generate a header only (for categories)
            -- 3: include a link to content elsewhere in the pdf (many pages have multiple parents, the content will only appear once)
            -- 4: could be used to bring in page contents for categories, without the sub-pages. Still to be implemented, 1) may suffice
            - Header: the header number (the page title is automatically generated for include 1). The intention is to remove this column and generate it automatically for all urls, including the numbers, in future
            - Depth: Depth in the KB structure. Begins with depth 1, children are depth 2, etc. Currently manual. Ultimately this will be automatically generated. Currently unused, intended to replace the numbering in Header.
            - Notes

            CSV maintenance is for now fairly manual. Over time it will become more automated.

            The Python script will:
            - loop through the csv
            - read each of the urls from the KB
            - save the raw html locally (to avoid having to repeatedly download 1000s of pages, the prior step can be disabled in the config)
            - process and merge the individual html files into one large html file
            - convert the html file into a pdf

            The pdf will include
            - a header page, including the date of generation
            - a contents
            - a longer-term goal is to generate an index

            The script makes use of Beautiful Soup for html processing, and wkhtmltopdf for converting to pdf, which seem to give the best results.

            Output of the pdf should match the website html as closely as possible, with the following differences for now:
            - formatting of the individual page contents was at times overlapping with text, so the contents have been moved to the top of each page
            - The product macro at times was being chopped in the middle of a page, so its formatting has been adjusted slightly.

            Known issues to date:
            - Contents are not currently generated
            - the html parsing appears to be slightly different on different platforms. The website html is malformed in places, and the parsing is broken only on certain platforms
            - Some multi-page code blocks also chop the text on a pdf page break

            Progress can be viewed at https://github.com/Icerath/mariadb-pdf/
            greenman Ian Gilfillan made changes -
            Worklog Id 95531 [ 95531 ]
            Remaining Estimate 0d [ 0 ]
            Time Spent 10d [ 288000 ]
            greenman Ian Gilfillan made changes -
            Worklog Id 95532 [ 95532 ]
            Remaining Estimate 0d [ 0 ] 30d [ 864000 ]
            Time Spent 10d [ 288000 ] 20d [ 576000 ]
            greenman Ian Gilfillan made changes -
            Worklog Id 96051 [ 96051 ]
            Remaining Estimate 30d [ 864000 ] 29d [ 835200 ]
            Time Spent 20d [ 576000 ] 21d [ 604800 ]
            greenman Ian Gilfillan made changes -
            Worklog Id 96347 [ 96347 ]
            Remaining Estimate 29d [ 835200 ] 28d [ 806400 ]
            Time Spent 21d [ 604800 ] 22d [ 633600 ]
            greenman Ian Gilfillan made changes -
            Worklog Id 96618 [ 96618 ]
            Remaining Estimate 28d [ 806400 ] 26d [ 748800 ]
            Time Spent 22d [ 633600 ] 24d [ 691200 ]
            greenman Ian Gilfillan made changes -
            Worklog Id 97037 [ 97037 ]
            Remaining Estimate 26d [ 748800 ] 24d [ 691200 ]
            Time Spent 24d [ 691200 ] 26d [ 748800 ]
            greenman Ian Gilfillan made changes -
            Worklog Id 97038 [ 97038 ]
            Remaining Estimate 24d [ 691200 ] 0.75d [ 21600 ]
            Time Spent 26d [ 748800 ] 26.5d [ 763200 ]
            greenman Ian Gilfillan made changes -
            Worklog Id 97039 [ 97039 ]
            Remaining Estimate 0.75d [ 21600 ] 6d [ 172800 ]
            Time Spent 26.5d [ 763200 ] 27d [ 777600 ]
            greenman Ian Gilfillan made changes -
            Worklog Id 97250 [ 97250 ]
            Remaining Estimate 6d [ 172800 ] 4d [ 115200 ]
            Time Spent 27d [ 777600 ] 29d [ 835200 ]
            Ice Dorje Gilfillan made changes -
            Ice Dorje Gilfillan made changes -
            greenman Ian Gilfillan made changes -
            Worklog Id 97720 [ 97720 ]
            Remaining Estimate 4d [ 115200 ] 0d [ 0 ]
            Time Spent 29d [ 835200 ] 33d [ 950400 ]
            greenman Ian Gilfillan made changes -
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Closed [ 6 ]
            greenman Ian Gilfillan made changes -
            greenman Ian Gilfillan made changes -
            greenman Ian Gilfillan made changes -
            julien.fritsch Julien Fritsch made changes -
            Workflow MariaDB v4 [ 163007 ] MariaDB Foundation v1 [ 188571 ]
            cvicentiu Vicențiu Ciorbaru made changes -
            Component/s None [ 18105 ]

            People

              greenman Ian Gilfillan
              greenman Ian Gilfillan
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0d
                  0d
                  Logged:
                  Time Spent - 33d
                  33d