Details

    Description

      As far as I understand, there is no way to tell CONNECT that a special character in a file represents NULL.

      Please, take a look at this example. It's an open dataset from Italian government, and the same format is used for several other datasets:
      http://www.dati.salute.gov.it/imgs/C_17_dataset_7_download_itemDownload0_upFile.CSV

      The column DATAFINEVALIDITA is a DDMMYYYY date, and '-' represents a NULL value.

      If I may suggest, a new column option would be a great solution:

      col_name DATE null_value = '-',

      Attachments

        Activity

          This is a general issue I have been already thinking about. What CONNECT currently does,
          when a column is nullable, is to regard character columns as NULL when represented by an empty string
          and to regard numeric columns as NULL when their value is 0. The case of DATE column is unclear.
          Currently they are treated as numeric (dates are internally represented like in C by the number of
          second since 01-01-1970) Unfortunately, even invalid dates are automatically set to 01-01-1970,
          their value is not 0 because added or substracted a time shift corresponding to their GMT setting.
          This why they are not set to NULL in this case.

          I am not yet ready to implement a general solution. Should the NULL_CHAR option be a column or
          table option, the NULL representation be a single character or a string or a numeric value?
          These questions are still open.

          Meanwhile, to take care of your problem, I have updated the DATE data type for invalid dates
          to be regarded as NULL whatever GMT setting is applied. Here is the CREATE TABLE I used:

          CREATE TABLE `salute` (
            `CODICEIDENTIFICATIVOSITO` int(6) NOT NULL,
            `DENOMINAZIONESITOLOGISTICO` char(83) NOT NULL,
            `INDIRIZZO` char(97) NOT NULL,
            `PARTITAIVA` bigint(11) NOT NULL FIELD_FORMAT='Z',
            `CAP` int(5) NOT NULL,
            `CODICECOMUNEISTAT` char(6) NOT NULL,
            `DESCRIZIONECOMUNE` char(33) NOT NULL,
            `CODICEPROVINCIAISTAT` char(3) NOT NULL,
            `SIGLAPROVINCIA` char(2) NOT NULL,
            `DESCRIZIONEPROVINCIA` char(25) NOT NULL,
            `CODICEREGIONE` char(3) NOT NULL,
            `DESCRIZIONEREGIONE` char(21) NOT NULL,
            `DATAINIZIOVALIDITA` date NOT NULL DATE_FORMAT='DD/MM/YYYY',
            `DATAFINEVALIDITA` date DATE_FORMAT='DD/MM/YYYY',
            `LATITUDINE` double(17,14) NOT NULL FIELD_FORMAT='D,',
            `LONGITUDINE` double(17,14) NOT NULL FIELD_FORMAT='D,',
            `LOCALIZE` int(1) NOT NULL
          ) ENGINE=CONNECT DEFAULT CHARSET=latin1 `TABLE_TYPE`='csv' `FILE_NAME`='E:/Data/salute.csv' `SEP_CHAR`=';' `HEADER`=1 `QUOTED`=0;

          Note the FIELD_FORMAT='D,' to take care of the comma used as decimal separator and
          the QUOTED=0 option needed because two lines in the file have their INDIRIZZO column quoted.

          The query:

          SELECT DENOMINAZIONESITOLOGISTICO, DATAINIZIOVALIDITA, DATAFINEVALIDITA, LATITUDINE FROM salute LIMIT 10;

          now returns:

          DENOMINAZIONESITOLOGISTICO DATAINIZIOVALIDITA DATAFINEVALIDITA LATITUDINE
          Parafarmacia S.F.A. Fasani 2006-10-01 NULL 41.55404142482020
          eurosan 2006-10-01 NULL 38.18235051630770
          DOCPHARMA DI EPIS DOTT. LEONARDO 2006-10-01 2014-01-31 45.73083169375610
          PARAFARMACIA NATURESTE 2006-10-01 NULL 45.21971325978440
          parafarmacia sanitaria di tamagni dott.ssa roberta 2006-10-01 NULL 46.18183947061230
          FARMA+ dott.Giuseppe moscariello 2006-10-01 NULL 40.84112547106560
          pharmasan 2006-10-01 2007-02-28 40.83665880000000
          Sole e Natura di Aricò Alessandra 2006-10-01 2009-03-31 37.51717130000000
          ARTEMISIA 2006-11-01 2009-03-31 36.95185760000000
          Erboristeria D.ssa Valeria Tantardini 2006-11-01 2007-06-30 46.11608700000000
          bertrandop Olivier Bertrand added a comment - This is a general issue I have been already thinking about. What CONNECT currently does, when a column is nullable, is to regard character columns as NULL when represented by an empty string and to regard numeric columns as NULL when their value is 0. The case of DATE column is unclear. Currently they are treated as numeric (dates are internally represented like in C by the number of second since 01-01-1970) Unfortunately, even invalid dates are automatically set to 01-01-1970, their value is not 0 because added or substracted a time shift corresponding to their GMT setting. This why they are not set to NULL in this case. I am not yet ready to implement a general solution. Should the NULL_CHAR option be a column or table option, the NULL representation be a single character or a string or a numeric value? These questions are still open. Meanwhile, to take care of your problem, I have updated the DATE data type for invalid dates to be regarded as NULL whatever GMT setting is applied. Here is the CREATE TABLE I used: CREATE TABLE `salute` ( `CODICEIDENTIFICATIVOSITO` int(6) NOT NULL, `DENOMINAZIONESITOLOGISTICO` char(83) NOT NULL, `INDIRIZZO` char(97) NOT NULL, `PARTITAIVA` bigint(11) NOT NULL FIELD_FORMAT='Z', `CAP` int(5) NOT NULL, `CODICECOMUNEISTAT` char(6) NOT NULL, `DESCRIZIONECOMUNE` char(33) NOT NULL, `CODICEPROVINCIAISTAT` char(3) NOT NULL, `SIGLAPROVINCIA` char(2) NOT NULL, `DESCRIZIONEPROVINCIA` char(25) NOT NULL, `CODICEREGIONE` char(3) NOT NULL, `DESCRIZIONEREGIONE` char(21) NOT NULL, `DATAINIZIOVALIDITA` date NOT NULL DATE_FORMAT='DD/MM/YYYY', `DATAFINEVALIDITA` date DATE_FORMAT='DD/MM/YYYY', `LATITUDINE` double(17,14) NOT NULL FIELD_FORMAT='D,', `LONGITUDINE` double(17,14) NOT NULL FIELD_FORMAT='D,', `LOCALIZE` int(1) NOT NULL ) ENGINE=CONNECT DEFAULT CHARSET=latin1 `TABLE_TYPE`='csv' `FILE_NAME`='E:/Data/salute.csv' `SEP_CHAR`=';' `HEADER`=1 `QUOTED`=0; Note the FIELD_FORMAT='D,' to take care of the comma used as decimal separator and the QUOTED=0 option needed because two lines in the file have their INDIRIZZO column quoted. The query: SELECT DENOMINAZIONESITOLOGISTICO, DATAINIZIOVALIDITA, DATAFINEVALIDITA, LATITUDINE FROM salute LIMIT 10; now returns: DENOMINAZIONESITOLOGISTICO DATAINIZIOVALIDITA DATAFINEVALIDITA LATITUDINE Parafarmacia S.F.A. Fasani 2006-10-01 NULL 41.55404142482020 eurosan 2006-10-01 NULL 38.18235051630770 DOCPHARMA DI EPIS DOTT. LEONARDO 2006-10-01 2014-01-31 45.73083169375610 PARAFARMACIA NATURESTE 2006-10-01 NULL 45.21971325978440 parafarmacia sanitaria di tamagni dott.ssa roberta 2006-10-01 NULL 46.18183947061230 FARMA+ dott.Giuseppe moscariello 2006-10-01 NULL 40.84112547106560 pharmasan 2006-10-01 2007-02-28 40.83665880000000 Sole e Natura di Aricò Alessandra 2006-10-01 2009-03-31 37.51717130000000 ARTEMISIA 2006-11-01 2009-03-31 36.95185760000000 Erboristeria D.ssa Valeria Tantardini 2006-11-01 2007-06-30 46.11608700000000

          People

            bertrandop Olivier Bertrand
            f_razzoli Federico Razzoli
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.