[MDEV-6744] CONNECT: special char for NULLs Created: 2014-09-16  Updated: 2014-10-12

Status: Open
Project: MariaDB Server
Component/s: Storage Engine - Connect
Fix Version/s: None

Type: Task Priority: Major
Reporter: Federico Razzoli Assignee: Olivier Bertrand
Resolution: Unresolved Votes: 0
Labels: None


 Description   

As far as I understand, there is no way to tell CONNECT that a special character in a file represents NULL.

Please, take a look at this example. It's an open dataset from Italian government, and the same format is used for several other datasets:
http://www.dati.salute.gov.it/imgs/C_17_dataset_7_download_itemDownload0_upFile.CSV

The column DATAFINEVALIDITA is a DDMMYYYY date, and '-' represents a NULL value.

If I may suggest, a new column option would be a great solution:

col_name DATE null_value = '-',



 Comments   
Comment by Olivier Bertrand [ 2014-10-11 ]

This is a general issue I have been already thinking about. What CONNECT currently does,
when a column is nullable, is to regard character columns as NULL when represented by an empty string
and to regard numeric columns as NULL when their value is 0. The case of DATE column is unclear.
Currently they are treated as numeric (dates are internally represented like in C by the number of
second since 01-01-1970) Unfortunately, even invalid dates are automatically set to 01-01-1970,
their value is not 0 because added or substracted a time shift corresponding to their GMT setting.
This why they are not set to NULL in this case.

I am not yet ready to implement a general solution. Should the NULL_CHAR option be a column or
table option, the NULL representation be a single character or a string or a numeric value?
These questions are still open.

Meanwhile, to take care of your problem, I have updated the DATE data type for invalid dates
to be regarded as NULL whatever GMT setting is applied. Here is the CREATE TABLE I used:

CREATE TABLE `salute` (
  `CODICEIDENTIFICATIVOSITO` int(6) NOT NULL,
  `DENOMINAZIONESITOLOGISTICO` char(83) NOT NULL,
  `INDIRIZZO` char(97) NOT NULL,
  `PARTITAIVA` bigint(11) NOT NULL FIELD_FORMAT='Z',
  `CAP` int(5) NOT NULL,
  `CODICECOMUNEISTAT` char(6) NOT NULL,
  `DESCRIZIONECOMUNE` char(33) NOT NULL,
  `CODICEPROVINCIAISTAT` char(3) NOT NULL,
  `SIGLAPROVINCIA` char(2) NOT NULL,
  `DESCRIZIONEPROVINCIA` char(25) NOT NULL,
  `CODICEREGIONE` char(3) NOT NULL,
  `DESCRIZIONEREGIONE` char(21) NOT NULL,
  `DATAINIZIOVALIDITA` date NOT NULL DATE_FORMAT='DD/MM/YYYY',
  `DATAFINEVALIDITA` date DATE_FORMAT='DD/MM/YYYY',
  `LATITUDINE` double(17,14) NOT NULL FIELD_FORMAT='D,',
  `LONGITUDINE` double(17,14) NOT NULL FIELD_FORMAT='D,',
  `LOCALIZE` int(1) NOT NULL
) ENGINE=CONNECT DEFAULT CHARSET=latin1 `TABLE_TYPE`='csv' `FILE_NAME`='E:/Data/salute.csv' `SEP_CHAR`=';' `HEADER`=1 `QUOTED`=0;

Note the FIELD_FORMAT='D,' to take care of the comma used as decimal separator and
the QUOTED=0 option needed because two lines in the file have their INDIRIZZO column quoted.

The query:

SELECT DENOMINAZIONESITOLOGISTICO, DATAINIZIOVALIDITA, DATAFINEVALIDITA, LATITUDINE FROM salute LIMIT 10;

now returns:

DENOMINAZIONESITOLOGISTICO DATAINIZIOVALIDITA DATAFINEVALIDITA LATITUDINE
Parafarmacia S.F.A. Fasani 2006-10-01 NULL 41.55404142482020
eurosan 2006-10-01 NULL 38.18235051630770
DOCPHARMA DI EPIS DOTT. LEONARDO 2006-10-01 2014-01-31 45.73083169375610
PARAFARMACIA NATURESTE 2006-10-01 NULL 45.21971325978440
parafarmacia sanitaria di tamagni dott.ssa roberta 2006-10-01 NULL 46.18183947061230
FARMA+ dott.Giuseppe moscariello 2006-10-01 NULL 40.84112547106560
pharmasan 2006-10-01 2007-02-28 40.83665880000000
Sole e Natura di Aricò Alessandra 2006-10-01 2009-03-31 37.51717130000000
ARTEMISIA 2006-11-01 2009-03-31 36.95185760000000
Erboristeria D.ssa Valeria Tantardini 2006-11-01 2007-06-30 46.11608700000000
Generated at Thu Feb 08 07:14:16 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.