|
Some notes.
There's a package for imdbpy2sql:
sudo apt-get install python-imdbpy
|
sudo apt-get install python-genshi python-sqlalchemy python-sqlobject python3-mysqldb
|
This produces
/usr/share/doc/python-imdbpy/examples/imdbpy2sql.py.gz
|
Running it on the current imdb contents
mkdir imdb_data;
|
cd imdb_data;
|
wget ftp://ftp.fu-berlin.de/misc/movies/database/frozendata/*gz
|
cd ..
|
./imdbpy2sql.py -d ./imdb_data/ -umysql://root@127.0.0.1:3340/imdbload -c `pwd`/imdb_data_csv
|
will eventually fail with (does this have something to do with charsets?)
```
ERROR caught exception creating a foreign key: Cannot add or update a child row: a foreign key constraint fails (`imdbload`.`#sql-6bec_18`, CONSTRAINT `aka_title_movie_id_exists` FOREIGN KEY (`movie_id`) REFERENCES `title` (`id`))
```
|
|
Loading the original CSV data is hard, too. It has empty strings instead of NULLs. (MySQL/MariaDB's LOAD DATA INFILE need \N).
Loading into PostgreSQL as they specified doesn't work, but this works:
\copy aka_title from './aka_title.csv' with (format csv, escape '\') ;
|
\copy aka_name from './aka_name.csv' with (format csv, escape '\');
|
\copy cast_info from './cast_info.csv' with (format csv, escape '\');
|
\copy char_name from './char_name.csv' with (format csv, escape '\');
|
\copy comp_cast_type from './comp_cast_type.csv' with (format csv, escape '\');
|
\copy company_name from './company_name.csv' with (format csv, escape '\');
|
\copy company_type from './company_type.csv' with (format csv, escape '\');
|
\copy complete_cast from './complete_cast.csv' with (format csv, escape '\');
|
\copy info_type from './info_type.csv' with (format csv, escape '\');
|
\copy keyword from './keyword.csv' with (format csv, escape '\');
|
\copy kind_type from './kind_type.csv' with (format csv, escape '\');
|
\copy link_type from './link_type.csv' with (format csv, escape '\');
|
\copy movie_companies from './movie_companies.csv' with (format csv, escape '\');
|
\copy movie_info from './movie_info.csv' with (format csv, escape '\');
|
\copy movie_info_idx from './movie_info_idx.csv' with (format csv, escape '\');
|
\copy movie_keyword from './movie_keyword.csv' with (format csv, escape '\');
|
\copy movie_link from './movie_link.csv' with (format csv, escape '\');
|
\copy name from './name.csv' with (format csv, escape '\');
|
\copy person_info from './person_info.csv' with (format csv, escape '\');
|
\copy role_type from './role_type.csv' with (format csv, escape '\');
|
\copy title from './title.csv' with (format csv, escape '\');
|
|
|
Then, can use this to make CSV files readable by MariaDB:
\copy aka_title to './mysql/aka_title.csv' with (format csv, null '\N', escape '\') ;
|
\copy aka_name to './mysql/aka_name.csv' with (format csv, null '\N', escape '\');
|
\copy cast_info to './mysql/cast_info.csv' with (format csv, null '\N', escape '\');
|
\copy char_name to './mysql/char_name.csv' with (format csv, null '\N', escape '\');
|
\copy comp_cast_type to './mysql/comp_cast_type.csv' with (format csv, null '\N', escape '\');
|
\copy company_name to './mysql/company_name.csv' with (format csv, null '\N', escape '\');
|
\copy company_type to './mysql/company_type.csv' with (format csv, null '\N', escape '\');
|
\copy complete_cast to './mysql/complete_cast.csv' with (format csv, null '\N', escape '\');
|
\copy info_type to './mysql/info_type.csv' with (format csv, null '\N', escape '\');
|
\copy keyword to './mysql/keyword.csv' with (format csv, null '\N', escape '\');
|
\copy kind_type to './mysql/kind_type.csv' with (format csv, null '\N', escape '\');
|
\copy link_type to './mysql/link_type.csv' with (format csv, null '\N', escape '\');
|
\copy movie_companies to './mysql/movie_companies.csv' with (format csv, null '\N', escape '\');
|
\copy movie_info to './mysql/movie_info.csv' with (format csv, null '\N', escape '\');
|
\copy movie_info_idx to './mysql/movie_info_idx.csv' with (format csv, null '\N', escape '\');
|
\copy movie_keyword to './mysql/movie_keyword.csv' with (format csv, null '\N', escape '\');
|
\copy movie_link to './mysql/movie_link.csv' with (format csv, null '\N', escape '\');
|
\copy name to './mysql/name.csv' with (format csv, null '\N', escape '\');
|
\copy person_info to './mysql/person_info.csv' with (format csv, null '\N', escape '\');
|
\copy role_type to './mysql/role_type.csv' with (format csv, null '\N', escape '\');
|
\copy title to './mysql/title.csv' with (format csv, null '\N', escape '\');
|
... but there are errors
ERROR 1366 (22007) at line 21 in file: '../load-data.sql': Incorrect integer value: 'F624' for column `imdbload`.`title`.`imdb_id` at row 2514451
|
the line causing the error:
2522636,\Frag'ile\,\N,1,2010,\N,F624,\N,\N,\N,\N,c0b2e279bce6d3b1717e750a2591bb6d
|
changed it to:
2522636,\\Frag'ile\\,\N,1,2010,\N,F624,\N,\N,\N,\N,c0b2e279bce6d3b1717e750a2591bb6d
|
.
Another
LOAD DATA INFILE '/home/psergey/dev-git2/try-job/imdb_data/mysql/person_info.csv' INTO TABLE person_info FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"';
|
ERROR 1261 (01000): Row 2671666 doesn't contain data for all columns
|
this is
2671660,2604773,17,Daughter of Irish actor and raconteur 'Niall Toibin' (qv); \,\N
|
|
|
The first disaster was Query 16b.
Run time, seconds
|
10.4: 3492.499
|
10.4-eits 2714.008
|
MySQL-8-cur 44.630
|
PostgreSQL-12 12.391
|
SELECT
|
MIN(an.name) AS cool_actor_pseudonym,
|
MIN(t.title) AS series_named_after_char
|
FROM
|
aka_name AS an,
|
cast_info AS ci,
|
company_name AS cn,
|
keyword AS k,
|
movie_companies AS mc,
|
movie_keyword AS mk,
|
name AS n,
|
title AS t
|
WHERE
|
cn.country_code ='[us]' AND
|
k.keyword ='character-name-in-title' AND
|
an.person_id = n.id AND
|
n.id = ci.person_id AND
|
ci.movie_id = t.id AND
|
t.id = mk.movie_id AND
|
mk.keyword_id = k.id AND
|
t.id = mc.movie_id AND
|
mc.company_id = cn.id AND
|
an.person_id = ci.person_id AND
|
ci.movie_id = mc.movie_id AND
|
ci.movie_id = mk.movie_id AND
|
mc.movie_id = mk.movie_id;
|
MariaDB:
+------+-------------+-------+--------+-----------------------------------------------------+--------------------------+---------+------------------------+--------+----------+-------------+
|
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|
+------+-------------+-------+--------+-----------------------------------------------------+--------------------------+---------+------------------------+--------+----------+-------------+
|
| 1 | SIMPLE | an | ALL | person_id_aka_name | NULL | NULL | NULL | 901343 | 100.00 | |
|
| 1 | SIMPLE | n | eq_ref | PRIMARY | PRIMARY | 4 | imdbload.an.person_id | 1 | 100.00 | Using index |
|
| 1 | SIMPLE | ci | ref | movie_id_cast_info,person_id_cast_info | person_id_cast_info | 4 | imdbload.an.person_id | 8 | 100.00 | |
|
| 1 | SIMPLE | t | eq_ref | PRIMARY | PRIMARY | 4 | imdbload.ci.movie_id | 1 | 100.00 | |
|
| 1 | SIMPLE | mc | ref | company_id_movie_companies,movie_id_movie_companies | movie_id_movie_companies | 4 | imdbload.ci.movie_id | 2 | 100.00 | |
|
| 1 | SIMPLE | cn | eq_ref | PRIMARY | PRIMARY | 4 | imdbload.mc.company_id | 1 | 36.49 | Using where |
|
| 1 | SIMPLE | mk | ref | keyword_id_movie_keyword,movie_id_movie_keyword | movie_id_movie_keyword | 4 | imdbload.ci.movie_id | 9 | 100.00 | |
|
| 1 | SIMPLE | k | eq_ref | PRIMARY | PRIMARY | 4 | imdbload.mk.keyword_id | 1 | 100.00 | Using where |
|
+------+-------------+-------+--------+-----------------------------------------------------+--------------------------+---------+------------------------+--------+----------+-------------+
|
MySQL:
+----+-------------+-------+------------+--------+-----------------------------------------------------+--------------------------+---------+------------------------+--------+----------+-------------+
|
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|
+----+-------------+-------+------------+--------+-----------------------------------------------------+--------------------------+---------+------------------------+--------+----------+-------------+
|
| 1 | SIMPLE | k | NULL | ALL | PRIMARY | NULL | NULL | NULL | 127338 | 10.00 | Using where |
|
| 1 | SIMPLE | mk | NULL | ref | keyword_id_movie_keyword,movie_id_movie_keyword | keyword_id_movie_keyword | 4 | imdbload.k.id | 40 | 100.00 | NULL |
|
| 1 | SIMPLE | t | NULL | eq_ref | PRIMARY | PRIMARY | 4 | imdbload.mk.movie_id | 1 | 100.00 | NULL |
|
| 1 | SIMPLE | mc | NULL | ref | company_id_movie_companies,movie_id_movie_companies | movie_id_movie_companies | 4 | imdbload.mk.movie_id | 2 | 100.00 | NULL |
|
| 1 | SIMPLE | cn | NULL | eq_ref | PRIMARY | PRIMARY | 4 | imdbload.mc.company_id | 1 | 10.00 | Using where |
|
| 1 | SIMPLE | ci | NULL | ref | movie_id_cast_info,person_id_cast_info | movie_id_cast_info | 4 | imdbload.mk.movie_id | 18 | 100.00 | NULL |
|
| 1 | SIMPLE | n | NULL | eq_ref | PRIMARY | PRIMARY | 4 | imdbload.ci.person_id | 1 | 100.00 | Using index |
|
| 1 | SIMPLE | an | NULL | ref | person_id_aka_name | person_id_aka_name | 4 | imdbload.ci.person_id | 1 | 100.00 | NULL |
|
+----+-------------+-------+------------+--------+-----------------------------------------------------+--------------------------+---------+------------------------+--------+----------+-------------+
|
PostgreSQL:
QUERY PLAN
|
-----------------------------------------------------------------------------------------------------------------------------------------
|
Aggregate (cost=5048.47..5048.48 rows=1 width=64)
|
-> Nested Loop (cost=9.50..5034.80 rows=2733 width=33)
|
Join Filter: (n.id = an.person_id)
|
-> Nested Loop (cost=9.08..4498.65 rows=1102 width=25)
|
-> Nested Loop (cost=8.65..3996.92 rows=1102 width=21)
|
Join Filter: (t.id = ci.movie_id)
|
-> Nested Loop (cost=8.08..3873.63 rows=64 width=29)
|
-> Nested Loop (cost=7.66..3794.69 rows=177 width=33)
|
Join Filter: (t.id = mc.movie_id)
|
-> Nested Loop (cost=7.23..3774.19 rows=34 width=25)
|
-> Nested Loop (cost=6.80..3757.63 rows=34 width=4)
|
-> Seq Scan on keyword k (cost=0.00..2626.12 rows=1 width=4)
|
Filter: (keyword = 'character-name-in-title'::text)
|
-> Bitmap Heap Scan on movie_keyword mk (cost=6.80..1128.44 rows=306 width=8)
|
Recheck Cond: (keyword_id = k.id)
|
-> Bitmap Index Scan on keyword_id_movie_keyword (cost=0.00..6.73 rows=306 width=0)
|
Index Cond: (keyword_id = k.id)
|
-> Index Scan using title_pkey on title t (cost=0.43..0.49 rows=1 width=21)
|
Index Cond: (id = mk.movie_id)
|
-> Index Scan using movie_id_movie_companies on movie_companies mc (cost=0.43..0.54 rows=5 width=8)
|
Index Cond: (movie_id = mk.movie_id)
|
-> Index Scan using company_name_pkey on company_name cn (cost=0.42..0.45 rows=1 width=4)
|
Index Cond: (id = mc.company_id)
|
Filter: ((country_code)::text = '[us]'::text)
|
-> Index Scan using movie_id_cast_info on cast_info ci (cost=0.56..1.49 rows=35 width=8)
|
Index Cond: (movie_id = mk.movie_id)
|
-> Index Only Scan using name_pkey on name n (cost=0.43..0.46 rows=1 width=4)
|
Index Cond: (id = ci.person_id)
|
-> Index Scan using person_id_aka_name on aka_name an (cost=0.42..0.46 rows=2 width=20)
|
Index Cond: (person_id = ci.person_id)
|
.
|
|
k.keyword ='character-name-in-title' selectivity is 1/134K
|
cn.country_code ='[us]' selectivity is 84843/234997=0.36
|
MariaDB has correctly figured the selectivity of "cn" . It didn't figure selectivity of k, because the stock DDL defines its datatype as "text" (TODO: there's also a variant with VARCHAR(1000) instead of text. Try that?)
MySQL uses guesstimates of 10%. This looks dumb but has worked for them, table "k" is the first.
PostgreSQL was able to correctly determine K's selectivity and has put it as the first element.
|
|
The "next big thing" is Q19d
query_name PostgreSQL 10.4-varchar-eits MySQL-8-eits
|
q19d 4016 161991 40924
|
|
|
SELECT MIN(n.name) AS voicing_actress,
|
MIN(t.title) AS jap_engl_voiced_movie
|
FROM aka_name AS an,
|
char_name AS chn,
|
cast_info AS ci,
|
company_name AS cn,
|
info_type AS it,
|
movie_companies AS mc,
|
movie_info AS mi,
|
name AS n,
|
role_type AS rt,
|
title AS t
|
WHERE ci.note IN ('(voice)',
|
'(voice: Japanese version)',
|
'(voice) (uncredited)',
|
'(voice: English version)')
|
AND cn.country_code ='[us]'
|
AND it.info = 'release dates'
|
AND n.gender ='f'
|
AND rt.role ='actress'
|
AND t.production_year > 2000
|
AND t.id = mi.movie_id
|
AND t.id = mc.movie_id
|
AND t.id = ci.movie_id
|
AND mc.movie_id = ci.movie_id
|
AND mc.movie_id = mi.movie_id
|
AND mi.movie_id = ci.movie_id
|
AND cn.id = mc.company_id
|
AND it.id = mi.info_type_id
|
AND n.id = ci.person_id
|
AND rt.id = ci.role_id
|
AND n.id = an.person_id
|
AND ci.person_id = an.person_id
|
AND chn.id = ci.person_role_id;
|
MariaDB:
+------+-------------+-------+--------+-----------------------------------------------------------------------------------+--------------------------+---------+----------------------------+--------+------------+----------+------------+-------------------------------------------------+
|
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | r_rows | filtered | r_filtered | Extra |
|
+------+-------------+-------+--------+-----------------------------------------------------------------------------------+--------------------------+---------+----------------------------+--------+------------+----------+------------+-------------------------------------------------+
|
| 1 | SIMPLE | rt | ALL | PRIMARY | NULL | NULL | NULL | 12 | 12.00 | 8.33 | 8.33 | Using where |
|
| 1 | SIMPLE | it | ALL | PRIMARY | NULL | NULL | NULL | 113 | 113.00 | 0.88 | 0.88 | Using where; Using join buffer (flat, BNL join) |
|
| 1 | SIMPLE | mi | ref | info_type_id_movie_info,movie_id_movie_info | info_type_id_movie_info | 4 | imdbload.it.id | 208953 | 3036719.00 | 100.00 | 100.00 | |
|
| 1 | SIMPLE | t | eq_ref | PRIMARY | PRIMARY | 4 | imdbload.mi.movie_id | 1 | 1.00 | 56.16 | 59.22 | Using where |
|
| 1 | SIMPLE | mc | ref | company_id_movie_companies,movie_id_movie_companies | movie_id_movie_companies | 4 | imdbload.mi.movie_id | 2 | 3.68 | 100.00 | 100.00 | |
|
| 1 | SIMPLE | cn | eq_ref | PRIMARY | PRIMARY | 4 | imdbload.mc.company_id | 1 | 1.00 | 36.49 | 26.27 | Using where |
|
| 1 | SIMPLE | ci | ref | movie_id_cast_info,person_id_cast_info,person_role_id_cast_info,role_id_cast_info | movie_id_cast_info | 4 | imdbload.mi.movie_id | 15 | 94.91 | 100.00 | 0.48 | Using where |
|
| 1 | SIMPLE | n | eq_ref | PRIMARY | PRIMARY | 4 | imdbload.ci.person_id | 1 | 1.00 | 22.78 | 99.89 | Using where |
|
| 1 | SIMPLE | chn | eq_ref | PRIMARY | PRIMARY | 4 | imdbload.ci.person_role_id | 1 | 1.00 | 100.00 | 100.00 | Using index |
|
| 1 | SIMPLE | an | ref | person_id_aka_name | person_id_aka_name | 4 | imdbload.ci.person_id | 1 | 2.24 | 100.00 | 100.00 | Using index |
|
+------+-------------+-------+--------+-----------------------------------------------------------------------------------+--------------------------+---------+----------------------------+--------+------------+----------+------------+-------------------------------------------------+
|
PostgreSQL
Finalize Aggregate (cost=279430.42..279430.43 rows=1 width=64)
|
-> Gather (cost=279430.20..279430.41 rows=2 width=64)
|
Workers Planned: 2
|
-> Partial Aggregate (cost=278430.20..278430.21 rows=1 width=64)
|
-> Nested Loop (cost=135386.98..278424.64 rows=1112 width=32)
|
-> Nested Loop (cost=135386.55..277688.22 rows=448 width=40)
|
-> Nested Loop (cost=135386.13..276244.21 rows=1234 width=44)
|
Join Filter: (t.id = mc.movie_id)
|
-> Hash Join (cost=135385.70..275610.63 rows=237 width=52)
|
Hash Cond: (mi.info_type_id = it.id)
|
-> Nested Loop (cost=135383.27..275535.35 rows=26750 width=56)
|
-> Nested Loop (cost=135382.84..226825.94 rows=1857 width=48)
|
-> Nested Loop (cost=135382.41..221224.95 rows=3828 width=52)
|
-> Hash Join (cost=135381.98..195376.43 rows=16324 width=33)
|
Hash Cond: (t.id = ci.movie_id)
|
-> Parallel Seq Scan on title t (cost=0.00..49166.69 rows=573843 width=21)
|
Filter: (production_year > 2000)
|
-> Hash (cost=134130.92..134130.92 rows=71925 width=12)
|
-> Nested Loop (cost=0.56..134130.92 rows=71925 width=12)
|
-> Seq Scan on role_type rt (cost=0.00..1.15 rows=1 width=4)
|
Filter: ((role)::text = 'actress'::text)
|
-> Index Scan using role_id_cast_info on cast_info ci (cost=0.56..133345.14 rows=78463 width=16)
|
Index Cond: (role_id = rt.id)
|
Filter: (note = ANY ('{(voice),"(voice: Japanese version)","(voice) (uncredited)","(voice: English version)"}'::text[]))
|
-> Index Scan using name_pkey on name n (cost=0.43..1.58 rows=1 width=19)
|
Index Cond: (id = ci.person_id)
|
Filter: ((gender)::text = 'f'::text)
|
-> Index Only Scan using char_name_pkey on char_name chn (cost=0.43..1.46 rows=1 width=4)
|
Index Cond: (id = ci.person_role_id)
|
-> Index Scan using movie_id_movie_info on movie_info mi (cost=0.43..25.83 rows=40 width=8)
|
Index Cond: (movie_id = t.id)
|
-> Hash (cost=2.41..2.41 rows=1 width=4)
|
-> Seq Scan on info_type it (cost=0.00..2.41 rows=1 width=4)
|
Filter: ((info)::text = 'release dates'::text)
|
-> Index Scan using movie_id_movie_companies on movie_companies mc (cost=0.43..2.61 rows=5 width=8)
|
Index Cond: (movie_id = mi.movie_id)
|
-> Index Scan using company_name_pkey on company_name cn (cost=0.42..1.17 rows=1 width=4)
|
Index Cond: (id = mc.company_id)
|
Filter: ((country_code)::text = '[us]'::text)
|
-> Index Only Scan using person_id_aka_name on aka_name an (cost=0.42..1.62 rows=2 width=4)
|
Index Cond: (person_id = n.id)
|
MySQL
+----+-------------+-------+------------+--------+-----------------------------------------------------------------------------------+--------------------------+---------+----------------------------+------+----------+----------------------------------------------------+
|
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|
+----+-------------+-------+------------+--------+-----------------------------------------------------------------------------------+--------------------------+---------+----------------------------+------+----------+----------------------------------------------------+
|
| 1 | SIMPLE | it | NULL | ALL | PRIMARY | NULL | NULL | NULL | 113 | 0.88 | Using where |
|
| 1 | SIMPLE | rt | NULL | ALL | PRIMARY | NULL | NULL | NULL | 12 | 8.33 | Using where; Using join buffer (Block Nested Loop) |
|
| 1 | SIMPLE | ci | NULL | ref | movie_id_cast_info,person_id_cast_info,person_role_id_cast_info,role_id_cast_info | role_id_cast_info | 4 | imdbload.rt.id | 4905 | 1.99 | Using where |
|
| 1 | SIMPLE | t | NULL | eq_ref | PRIMARY | PRIMARY | 4 | imdbload.ci.movie_id | 1 | 55.15 | Using where |
|
| 1 | SIMPLE | chn | NULL | eq_ref | PRIMARY | PRIMARY | 4 | imdbload.ci.person_role_id | 1 | 100.00 | Using index |
|
| 1 | SIMPLE | n | NULL | eq_ref | PRIMARY | PRIMARY | 4 | imdbload.ci.person_id | 1 | 23.43 | Using where |
|
| 1 | SIMPLE | an | NULL | ref | person_id_aka_name | person_id_aka_name | 4 | imdbload.ci.person_id | 1 | 100.00 | Using index |
|
| 1 | SIMPLE | mc | NULL | ref | company_id_movie_companies,movie_id_movie_companies | movie_id_movie_companies | 4 | imdbload.ci.movie_id | 2 | 100.00 | NULL |
|
| 1 | SIMPLE | cn | NULL | eq_ref | PRIMARY | PRIMARY | 4 | imdbload.mc.company_id | 1 | 5.00 | Using where |
|
| 1 | SIMPLE | mi | NULL | ref | info_type_id_movie_info,movie_id_movie_info | movie_id_movie_info | 4 | imdbload.ci.movie_id | 7 | 1.43 | Using where |
|
+----+-------------+-------+------------+--------+-----------------------------------------------------------------------------------+--------------------------+---------+----------------------------+------+----------+----------------------------------------------------+
|
|
|
The issue is selectivity of ci.note. It is defined as "TEXT", so MariaDB is not aware that condition on the table ci is very selective.
MySQL-8 can compute the selectivity (1.93% vs 40% that I get if I drop the histogram).
|
|
Changing ci.note to VARCHAR(1000) fixes the above.
|
|
Ok, when
- all databases have histograms of their default size
- "text" columns with conditions are changed to VARCHAR(1000) // helping MariaDB' eits
test results vary across runs, but the total benchmark times are in these
ranges:
- PostgreSQL 12: 1.4 - 1.9 min
|
- MariaDB: 9 - 11 min
|
- MySQL-8: 17 - 18 min
|
Total benchmark time is a metric which may skew the result towards the heaviest queries. To get an idea about the whole population, here are plots of relative query times:
|