[MDEV-6125] Connect engine - cannot read XML file with default XML namespace defined Created: 2014-04-17  Updated: 2014-05-26  Resolved: 2014-05-23

Status: Closed
Project: MariaDB Server
Component/s: None
Affects Version/s: 10.0.10
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Rasmus Johansson (Inactive) Assignee: Olivier Bertrand
Resolution: Not a Bug Votes: 0
Labels: Connect-Engine

Attachments: File 20140401_1846_Running.gpx    

 Description   

It's not possible to read an XML file through the Connect engine if the XML file has the default XML namespace specified, i.e. the root node includes the attribute xmlns="[namespace URI]". Notice that namespaces with prefix works well, i.e. xmlns:[prefix]="[namespace URI]".

I'm including an example to be able to reproduce the problem. Attached find the file 20140401_1846_Running.gpx, which has the following root node:

<gpx version="1.1" creator="runtastic - makes sports funtastic, http://www.runtastic.com" xsi:schemaLocation="http://www.topografix.com/GPX/1/1
                                http://www.topografix.com/GPX/1/1/gpx.xsd
                                http://www.garmin.com/xmlschemas/GpxExtensions/v3
                                http://www.garmin.com/xmlschemas/GpxExtensionsv3.xsd
                                http://www.garmin.com/xmlschemas/TrackPointExtension/v1
                                http://www.garmin.com/xmlschemas/TrackPointExtensionv1.xsd" xmlns="http://www.topografix.com/GPX/1/1" xmlns:gpxtpx="http://www.garmin.com/xmlschemas/TrackPointExtension/v1" xml\
ns:gpxx="http://www.garmin.com/xmlschemas/GpxExtensions/v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

In the root node the default namespace is given in the xmlns -attribute:

xmlns="http://www.topografix.com/GPX/1/1"

Let's then create a Connect engine -table of the XML:

CREATE TABLE `GPXSource` (
  `lon` double(20,16) NOT NULL `field_format`='@',
  `lat` double(20,16) NOT NULL `field_format`='@',
  `ele` double(20,16) NOT NULL,
  `time` datetime date_format="YYYY-MM-DD 'T' hh:mm:ss '.000Z'"
) ENGINE=CONNECT DEFAULT CHARSET=latin1 `table_type`=XML `file_name`='/home/rasmus/nodegpxmariadb/public/uploads/20140401_1846_Running.gpx' `tabname`='trkseg';

Do a SELECT over the created table:

MariaDB [running]> SELECT * FROM GPXSource;
| lon                | lat                | ele                | time |
+--------------------+--------------------+--------------------+------+
| 0.0000000000000000 | 0.0000000000000000 | 0.0000000000000000 | NULL |
| 0.0000000000000000 | 0.0000000000000000 | 0.0000000000000000 | NULL |
+--------------------+--------------------+--------------------+------+
2 rows in set (0.01 sec)

Instead of two rows full of zeros and nulls it should have included a lot of rows without zero and null values.

Let's edit the XML file 20140401_1846_Running.gpx and remove the xmlns="http://www.topografix.com/GPX/1/1" -attribute and after that run the same SELECT over the table again. This time we get what we expected:

MariaDB [running]> SELECT * FROM GPXSource LIMIT 10;
+-----------------------+---------------------+--------------------+---------------------+
| lon                   | lat                 | ele                | time                |
+-----------------------+---------------------+--------------------+---------------------+
| -121.9822235107421900 | 37.3884925842285160 | 6.6108512878417970 | 2014-04-01 14:54:05 |
| -121.9821929931640600 | 37.3885803222656250 | 6.7878279685974120 | 2014-04-01 14:54:08 |
| -121.9821624755859400 | 37.3886299133300800 | 6.7719874382019040 | 2014-04-01 14:54:10 |
| -121.9821395874023400 | 37.3886795043945300 | 6.3355555534362790 | 2014-04-01 14:54:12 |
| -121.9821166992187500 | 37.3887557983398440 | 6.5390658378601070 | 2014-04-01 14:54:15 |
| -121.9821014404296900 | 37.3888053894043000 | 6.6053481101989750 | 2014-04-01 14:54:17 |
| -121.9820785522461000 | 37.3888893127441400 | 5.8897328376770020 | 2014-04-01 14:54:20 |
| -121.9820709228515600 | 37.3889427185058600 | 6.3196611404418945 | 2014-04-01 14:54:22 |
| -121.9820785522461000 | 37.3890190124511700 | 6.0785322189331055 | 2014-04-01 14:54:25 |
| -121.9820861816406200 | 37.3890762329101560 | 6.4796056747436520 | 2014-04-01 14:54:27 |
+-----------------------+---------------------+--------------------+---------------------+
10 rows in set (0.01 sec)



 Comments   
Comment by Olivier Bertrand [ 2014-04-18 ]

CONNECT does not handle the processing of XML files but delegates it to specialized libraries, MS DOMDOC or libxml2 on windows and libxml2 on Linux.

On Windows, with the default DOMDOC library, this table is normally handled with no error. However, when specifying libxml2, the above error occurs.

Therefore, this seems to be a libxml2 error, not a CONNECT one.

Comment by Olivier Bertrand [ 2014-04-18 ]

Looking more closely at what happens with this example, I found that the problem is that a general name space is defined and all XPATH are looking in this name space. When trying to locate the main table node, CONNECT construct an XPATH of '//trkseg' that fails to find the corresponding node.

Currently, when the table node is not found, CONNECT tries to use the ROOT node instead, which in that case, produces the wrong answer. Providing a general fix seems difficult but meawhile you can bypass this issue by specifying the TABNAME option as an XPATH that will ignore the the currently defined name space, in this example:

alter table gpxsource tabname="//*[local-name()='trkseg']";

For me, this worked.
Note that the lon and ele columns should be declared as double(21,16)

Comment by Olivier Bertrand [ 2014-04-20 ]

As a matter of facts, it did not work completely because if the row node attributes were retrieved normally, the column nodes were not found for the same reason (not in the default name space) To have a complete result this table must be created as:

 
CREATE TABLE `GPXSource` (
  `lon` double(21,16) NOT NULL field_format='@',
  `lat` double(20,16) NOT NULL field_format='@',
  `ele` double(21,16) NOT NULL field_format="*[local-name()='ele']",
  `time` datetime field_format="*[local-name()='time']" date_format="YYYY-MM-DD 'T' hh:mm:ss '.000Z'"
) ENGINE=CONNECT DEFAULT CHARSET=latin1 table_type=XML file_name='20140401_1846_Running.gpx'
tabname="//*[local-name()='trkseg']" option_list='xmlsup=libxml2';

With the field format xpath specified as above, the complete result is returned.

Comment by Rasmus Johansson (Inactive) [ 2014-05-26 ]

Olivier, the xpath tweaks worked well. I published a blog post a while ago already where I used it, https://blog.mariadb.org/crunching-xml-files-with-mariadb/

Generated at Thu Feb 08 07:09:32 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.