{"id":256,"date":"2011-01-20T10:35:57","date_gmt":"2011-01-20T10:35:57","guid":{"rendered":"http:\/\/10sa.com\/sql_stories\/?p=256"},"modified":"2011-05-05T13:14:22","modified_gmt":"2011-05-05T13:14:22","slug":"data-warehouse-in-mysql","status":"publish","type":"post","link":"http:\/\/10sa.com\/sql_stories\/?p=256","title":{"rendered":"Data warehouse in MySQL"},"content":{"rendered":"<p>New challange &#8211; data warehouse for the purpose of specific customer (not sales aggregation).<br \/>\nBefore I started to read know how about data warehouse I prepared my own plan. The result of basic structure was almost fantastic, I only had to borrow the names for the elements from official sources.<br \/>\nThe type of data warehouse is obviously <a href=\"http:\/\/en.wikipedia.org\/wiki\/ROLAP\">ROLAP (Relational OLAP)<\/a>.<br \/>\nIn short, the basic table containing aggregated data is <a href=\"http:\/\/en.wikipedia.org\/wiki\/Fact_table\">fact table<\/a>, dictionary tables (separately for data warehouse needs) are called <a href=\"http:\/\/en.wikipedia.org\/wiki\/Dimension_table\">dimension tables<\/a>.<br \/>\nThe typical data structure is &#8220;star&#8221; or &#8220;snow corn&#8221; &#8211; <a href=\"http:\/\/www.mysqlperformanceblog.com\/wp-content\/uploads\/2010\/06\/Star_Snowflake_dw.png\">this illustrates the difference between&#8217;em<\/a>.<\/p>\n<h5>ETL for Fact Tables<\/h5>\n<p>Due to the fact that the source and destination environment is homogenous, I decide to use triggers to register changes and save them in two tables representing old and new data (it cannot be made with one procedure for all fact tables &#8211; because it requires to execute dynamic queries, which is impossible in triggers).<br \/>\nI work with table `product_to_category`:<\/p>\n<pre lang=\"SQL\">mysql&gt; CREATE TABLE IF NOT EXISTS `product_to_category` (\r\n  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,\r\n  `product_id` int(11) NOT NULL,\r\n  `category_id` int(11) NOT NULL,\r\n  PRIMARY KEY (`product_id`,`category_id`),\r\n  UNIQUE KEY `id_unique` (`id`)\r\n) ENGINE=MyISAM  DEFAULT CHARSET=latin1 AUTO_INCREMENT=6 ;\r\n\r\nmysql&gt; desc product_to_category;\r\n+-------------+------------------+------+-----+---------+----------------+\r\n| Field       | Type             | Null | Key | Default | Extra          |\r\n+-------------+------------------+------+-----+---------+----------------+\r\n| id          | int(10) unsigned | NO   | UNI | NULL    | auto_increment |\r\n| product_id  | int(11)          | NO   | PRI | NULL    |                |\r\n| category_id | int(11)          | NO   | PRI | NULL    |                |\r\n+-------------+------------------+------+-----+---------+----------------+<\/pre>\n<p>Procedure used to create tables as storage for old and new data:<\/p>\n<pre lang=\"SQL\">DROP PROCEDURE IF EXISTS `OLAP_DDL_SHADOW_TABLE_MAKE`\/\/\r\n\r\nCREATE PROCEDURE `OLAP_DDL_SHADOW_TABLE_MAKE`(IN SourceTable VARCHAR(250), IN ShadowType VARCHAR(10))\r\nMAIN_BLOCK: BEGIN\r\n\r\nDECLARE _ShadowNameOld VARCHAR(255);\r\nDECLARE _ShadowNameNew VARCHAR(255);\r\n\r\nDECLARE _table VARCHAR(200);\r\nDECLARE _schema VARCHAR(32);\r\nDECLARE DestinationSchema VARCHAR(20) DEFAULT 'mms_olap';\r\n\r\nSELECT SUBSTRING_INDEX(TRIM(SourceTable), \".\",  1) INTO _schema;\r\nSELECT SUBSTRING_INDEX(TRIM(SourceTable), \".\", -1) INTO _table;\r\n\r\nIF (CHAR_LENGTH(_schema)) &lt; 4 OR (CHAR_LENGTH(_table))  &lt; 4 THEN\r\n\tSELECT \"Unknown table or schema. Required 3 chars as the minimum.\" FROM DUAL;\r\n\tLEAVE MAIN_BLOCK;\r\nEND IF;\r\n\r\nIF ShadowType NOT IN ('fact', 'dimension') THEN\r\n\tSELECT \"Error. Uknown shadow type (fact|dimension)\" FROM DUAL;\r\n\tLEAVE MAIN_BLOCK;\r\nEND IF;\r\n\r\nSET _ShadowNameOld =  CONCAT(\"etl_\", _table, \"_old\");\r\nSET _ShadowNameNew =  CONCAT(\"etl_\", _table, \"_new\");\r\n\r\nSET @DeleteQuery = CONCAT(\"DROP \/* OLAP_DDL_SHADOW_TABLE_MAKE *\/ TABLE IF EXISTS `\", DestinationSchema, \"`.`\", _ShadowNameNew, \"`;\");\r\nPREPARE MyStmt FROM @DeleteQuery;\r\nEXECUTE MyStmt;\r\n\r\nSET @DeleteQuery = CONCAT(\"DROP \/* OLAP_DDL_SHADOW_TABLE_MAKE *\/ TABLE IF EXISTS `\", DestinationSchema, \"`.`\", _ShadowNameOld, \"`;\");\r\nPREPARE MyStmt FROM @DeleteQuery;\r\nEXECUTE MyStmt;\r\n\r\nIF ShadowType = 'fact' THEN\r\n\r\n\tSET @CreateQuery = CONCAT(\"CREATE TABLE \/* OLAP_DDL_SHADOW_TABLE_MAKE *\/ `\", DestinationSchema, \"`.`\", _ShadowNameNew, \"`\r\n\t\t (olap_id INT UNSIGNED NOT NULL COMMENT 'Internal OLAP Key',  `olap_timestamp` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'Insert timestamp',\r\n\t\t PRIMARY KEY(`olap_id`, `olap_timestamp`), INDEX `prsn_timestamp_idx` ( `olap_timestamp` )) partition by hash(olap_id) partitions 4 SELECT * FROM `\", _schema,\"`.`\", _table,\"` LIMIT 0;\");\r\n\tPREPARE MyStmt FROM @CreateQuery;\r\n\tEXECUTE MyStmt;\r\n\r\n\tSET @CreateQuery = CONCAT(\"CREATE TABLE \/* OLAP_DDL_SHADOW_TABLE_MAKE *\/ `\", DestinationSchema, \"`.`\", _ShadowNameOld, \"`\r\n\t\t (olap_id INT UNSIGNED NOT NULL COMMENT 'Internal OLAP Key',  `olap_timestamp` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'Insert timestamp',\r\n\t\t PRIMARY KEY(`olap_id`, `olap_timestamp`), INDEX `prsn_timestamp_idx` ( `olap_timestamp` )) partition by hash(olap_id)  partitions 4 SELECT * FROM `\", _schema,\"`.`\", _table,\"` LIMIT 0;\");\r\n\t\t PREPARE MyStmt FROM @CreateQuery;\r\n\tEXECUTE MyStmt;\r\n\r\nELSEIF ShadowType = 'dimension' THEN\r\n\t-- we need only new data\r\n\tSET @CreateQuery = CONCAT(\"CREATE TABLE \/* OLAP_DDL_SHADOW_TABLE_MAKE *\/ `\", DestinationSchema, \"`.`\", _ShadowNameNew, \"` SELECT * FROM `\", _schema,\"`.`\", _table,\"` LIMIT 0;\");\r\n\tPREPARE MyStmt FROM @CreateQuery;\r\n\tEXECUTE MyStmt;\r\n\r\n\tSET @CreateQuery = CONCAT(\"ALTER TABLE \/* OLAP_DDL_SHADOW_TABLE_MAKE *\/ `\", DestinationSchema, \"`.`\", _ShadowNameNew, \"` ADD COLUMN olap_id VARCHAR(200) NOT NULL COMMENT 'Internal OLAP Key', ADD COLUMN `olap_timestamp` DATE NOT NULL COMMENT 'Insert\/Change date', ADD PRIMARY KEY(`olap_id`(200), `olap_timestamp`), ADD INDEX `prsn_timestamp_idx` ( `olap_timestamp` ) ;\");\r\n\tPREPARE MyStmt FROM @CreateQuery;\r\n\tEXECUTE MyStmt;\r\nEND IF;\r\nEND MAIN_BLOCK;<\/pre>\n<p>Now, if I want to create shadow tables for my fact table source I write following instruction:<\/p>\n<pre lang=\"SQL\">CALL OLAP_DDL_SHADOW_TABLE_MAKE('product_to_category', 'fact');<\/pre>\n<p>and I obtain two, self descripting tables (etl_product_to_category_old, etl_product_to_category_new).<\/p>\n<p>In order to capture changes of data I use very similar trigger for UPDATE Timing:<br \/>\nBEFORE:<\/p>\n<pre lang=\"SQL\">DROP TRIGGER IF EXISTS `PRODUCT_TO_CATEGORY_BU`\/\/\r\nCREATE DEFINER = root@localhost  TRIGGER PRODUCT_TO_CATEGORY_BU BEFORE UPDATE ON `product_to_category`\r\nFOR EACH ROW\r\nBEGIN\r\n\r\nSET @OLAP_TIMESTAMP = CURRENT_TIMESTAMP;\r\n-- Replace instead of insert due to the fact, that newer values are more important\r\nREPLACE INTO \/* OLAP_ShadowTableFill *\/ `etl_product_to_category_old` SELECT OLD.Id, @OLAP_TIMESTAMP, `product_to_category`.* FROM `product_to_category` WHERE Id = OLD.Id ;\r\n\r\nEND ;<\/pre>\n<p>AFTER:<\/p>\n<pre lang=\"SQL\">DROP TRIGGER IF EXISTS `PRODUCT_TO_CATEGORY_AU`\/\/\r\nCREATE DEFINER = root@localhost  TRIGGER PRODUCT_TO_CATEGORY_AU AFTER UPDATE ON `product_to_category`\r\nFOR EACH ROW\r\nBEGIN\r\n\r\n-- [update] Do not update old date (with REPLACE). Problem is when there is an update changing data and next update causes there is not difference between new and old record.\r\nINSERT IGNORE INTO \/* OLAP_ShadowTableFill *\/ `etl_product_to_category_new` SELECT OLD.Id,  @OLAP_TIMESTAMP \/* source of variable is in BEFORE Timing trigger *\/,`product_to_category`.* FROM `product_to_category` WHERE Id = NEW.Id ;\r\n\r\nEND ;<\/pre>\n<p>How it works? I insert three records:<\/p>\n<pre lang=\"SQL\">INSERT INTO `product_to_category` VALUES (1, 1), (2, 1), (3, 2) ;<\/pre>\n<p>Next, update it:<\/p>\n<pre lang=\"SQL\">UPDATE `product_to_category` SET `category_id` = 2 WHERE `product_id` =1;<\/pre>\n<p>and now I can read changes from shadow tables as follows:<\/p>\n<pre lang=\"SQL\">SELECT `OLD`.Id, `NEW`.category_id `New category`, `OLD`.category_id `Old category`\r\nFROM `etl_product_to_category_new` `NEW`\r\nINNER JOIN `etl_product_to_category_old` `OLD` ON `OLD`.`olap_id`=`NEW`.`olap_id` AND `OLD`.`olap_timestamp` = `NEW`.`olap_timestamp`;<\/pre>\n<p>As you can see I allow maximum 1 change per 1 second and it is sufficient for me. I used REPLACE function in order to ensure that the last change will be registered (in contrary I could use INSERT IGNORE).<\/p>\n<p>It is the source which is used to fulfill three (3) fact tables &#8211; conforming to my needs. Such data extraction has also its own name: <a href=\"http:\/\/en.wikipedia.org\/wiki\/Change_data_capture\">Change Data Capture (CDC)<\/a>.<br \/>\nMy pattern is described in a\/m link as &#8220;Triggers on tables&#8221;. The only difference is that I use a common procedure to log changes in partiotioned tables with names &#8220;etl_*_old\/new&#8221; which are simultanously queue tables.<br \/>\nEvent that I wrote is for my specific purpose (it includes quasi transactions implemented on MyISAM, etc.) so there is no need to present it here. I would like to add that I use another shadow table for fact table which contains relation between last olap_id and and fact_id &#8211; thanks to this I know which record should be decremented.<\/p>\n<h5>ETL for Dimension Tables.<\/h5>\n<p>The situation where data of dictionary tables also are changed has its own name: <a href=\"http:\/\/en.wikipedia.org\/wiki\/Slowly_changing_dimension\">Slowly Changing Dimension<\/a> . As the name suggests this type od data should be changed slowly, so I introduced limitation &#8211; one day as the minimum interval for change registration.<br \/>\nI create an adequate table for `category` with procedure OLAP_ShadowTableMake:<\/p>\n<pre lang=\"SQL\">call OLAP_ShadowTableMake('category', 'dimension');<\/pre>\n<p>Now I receive following the table:<\/p>\n<pre lang=\"SQL\">mysql&gt; desc etl_category_new;\r\n+----------------+--------------+------+-----+---------+-------+\r\n| Field          | Type         | Null | Key | Default | Extra |\r\n+----------------+--------------+------+-----+---------+-------+\r\n| id             | int(11)      | NO   |     | 0       |       |\r\n| name           | varchar(200) | NO   |     | NULL    |       |\r\n| olap_id        | varchar(200) | NO   | PRI | NULL    |       |\r\n| olap_timestamp | date         | NO   | PRI | NULL    |       |\r\n+----------------+--------------+------+-----+---------+-------+\r\n4 rows in set (0.02 sec)<\/pre>\n<p>The only difference is the data type for olap_timestamp &#8211; it stemps from the fact I am not interested in changes which are made more frequent then one day (it is slowly changing dimension).<br \/>\nI add triggers for INSERT and UPDATE with timing AFTER, e.g.<\/p>\n<pre lang=\"SQL\">DROP TRIGGER IF EXISTS `OLAP_DIMMENSION_CATEGORY_AI`\/\/\r\nCREATE DEFINER = root@localhost  TRIGGER `OLAP_DIMMENSION_CATEGORY_AI` AFTER INSERT ON `category`\r\nFOR EACH ROW\r\nBEGIN\r\n\r\nSET @OLAP_TIMESTAMP:=CURRENT_DATE;\r\nREPLACE INTO \/* OLAP *\/ `category_shadow_new` SELECT *, NEW.Id,  @OLAP_TIMESTAMP FROM `category` WHERE id = NEW.id ;\r\n\r\nEND ;<\/pre>\n<p>Next common process for ETL is to Transform and Load data, I use Event Schedulre for this purpose.<br \/>\nI used type4 of SCD and add field which contains number of versions.<br \/>\nThis is example of my event:<\/p>\n<pre lang=\"SQL\">SET NAMES utf8\/\/\r\nDROP EVENT IF EXISTS OLAP_DIMENSION_CATEGORY\/\/\r\nCREATE DEFINER=root@localhost EVENT OLAP_DIMENSION_CATEGORY\r\nON SCHEDULE EVERY 15 SECOND\r\nSTARTS '2011-01-17 15:20:00' DO\r\n BEGIN\r\n  DECLARE LeaveProcedure BOOLEAN DEFAULT FALSE;\r\n  DECLARE done INT DEFAULT 0;\r\n  DECLARE EventId TINYINT UNSIGNED DEFAULT 1;\r\n  DECLARE EventStatus TINYINT DEFAULT 0;\r\n  DECLARE `MyTimestamp` TIMESTAMP;\r\n  DECLARE `OlapTimestamp` DATE;\r\n\r\n  DECLARE CategoryId INT UNSIGNED DEFAULT 0;\r\n  DECLARE CategoryName VARCHAR(200) DEFAULT '';\r\n  DECLARE OlapId VARCHAR(200) DEFAULT '';\r\n\r\n  DECLARE MyQueue CURSOR FOR Select olap_id, olap_timestamp, name, id From `etl_category_new`;  \/* WHERE olap_timestamp &gt; '0000-01-01' It is myisam and lock for this event *\/\r\n  DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = 1;\r\n  DECLARE CONTINUE HANDLER FOR SQLWARNING BEGIN END;\r\n  DECLARE CONTINUE HANDLER FOR SQLEXCEPTION\r\n  \tBEGIN\r\n  \tEND;\r\n\r\n\tSELECT `status`, `update_timestamp` INTO EventStatus, MyTimestamp From `sys_event_lock` WHERE `id` = 1;  \t\r\n\r\n \tIF LeaveProcedure = FALSE AND ((EventStatus IS NULL OR EventStatus = 0) OR (MyTimestamp &lt; DATE_ADD(NOW(), INTERVAL -5 MINUTE \/* Max execution time *\/ ))) THEN\r\n\t-- START TRANSACTION;\r\n\t\tUPDATE `sys_event_lock` SET `status` = 1 WHERE `id` = 1;\r\n\t-- COMMIT;\r\n\tMAIN_BLOCK: BEGIN\r\n\t\tDECLARE EXIT HANDLER FOR SQLEXCEPTION\r\n\t\tBEGIN\r\n\t\t\t-- ROLLBACK;\r\n\t\t\t-- START TRANSACTION;\r\n\t\t\t\tUPDATE `sys_event_lock` SET error_timestamp=CURRENT_TIMESTAMP, error_messages = CONCAT_WS(\"|\", error_messages, @LAST_COMMAND) WHERE `id` = 1;\r\n\t\t\t-- COMMIT;\r\n\t\tEND;\r\n\t\t\t-- START TRANSACTION;\r\n\t\t\tSET done = 0;\r\n\t\t\tOPEN MyQueue;\r\n\t\t\tREPEAT FETCH MyQueue INTO OlapId, OlapTimestamp, CategoryName, CategoryId;\r\n\t\t\t\tIF done = 0 THEN\r\n\t\t\t\t\tSET @LAST_COMMAND := CONCAT(\r\n\t\t\t\t\t\t\"INSERT IGNORE \/* OLAP *\/ INTO olap_category_as_dimension (`id`, `source_id`, `name`, `valid_from`) VALUES(\", CONCAT_WS(\",\", 'NULL', CategoryId, QUOTE(CategoryName), QUOTE(OlapTimestamp) ), \");\"\r\n\t\t\t\t\t\t);\r\n\t\t\t\t\tPREPARE MyStmt FROM @LAST_COMMAND;\r\n\t\t\t\t\tEXECUTE MyStmt;\r\n\r\n\t\t\t\t\t-- does anything was changed?\r\n\t\t\t\t\tIF (ROW_COUNT() = 0) THEN\r\n\t\t\t\t\t\t-- Move or update record in history\r\n\t\t\t\t\t\tSET @LAST_COMMAND := CONCAT(\"INSERT INTO olap_category_as_dimension_history (id, source_id, `name`, valid_from, valid_to) SELECT id, source_id, name, valid_from, \", QUOTE(date_sub(CURRENT_DATE, INTERVAL 1 DAY)) \/* valid till yesterday*\/ ,\" FROM olap_category_as_dimension\r\n\t\t\t\t\t\t\tWHERE source_id= \", CategoryId,\" ON DUPLICATE KEY UPDATE \/* Only category attributes *\/ `name`=VALUES(`name`); \");\r\n\t\t\t\t\t\tPREPARE MyStmt FROM @LAST_COMMAND;\r\n\t\t\t\t\t\tEXECUTE MyStmt;\r\n\t\t\t\t\t\t-- Replace record with new values\r\n\t\t\t\t\t\tSET @LAST_COMMAND := CONCAT(\r\n\t\t\t\t\t\t\"REPLACE \/* OLAP *\/ INTO olap_category_as_dimension (`id`, `source_id`, `name`, `valid_from`, version)\r\n\t\t\t\t\t\t\tVALUES(\", CONCAT_WS(\",\", 'NULL', CategoryId, QUOTE(CategoryName), QUOTE(OlapTimestamp)), \", (SELECT COUNT(*) FROM olap_category_as_dimension_history WHERE source_id=\", CategoryId, \"));\"\t\t\t\t\t\t);\r\n\t\t\t\t\t\tPREPARE MyStmt FROM @LAST_COMMAND;\r\n\t\t\t\t\t\tEXECUTE MyStmt;\r\n\t\t\t\t\tEND IF;\r\n\r\n\t\t\t\t\tSET @LAST_COMMAND := CONCAT(\r\n\t\t\t\t\t\t\"DELETE From `category_shadow_new` \/* OLAP *\/ WHERE olap_id=\", QUOTE(OlapId), \" AND olap_timestamp= \", QUOTE(OlapTimestamp), \" ;\"\r\n\t\t\t\t\t\t);\r\n\r\n\t\t\t\t\tPREPARE MyStmt FROM @LAST_COMMAND;\r\n\t\t\t\t\tEXECUTE MyStmt;\r\n\r\n\t\t\t\tEND IF;\r\n\t      \tUNTIL done = 1  END REPEAT;\r\n\t    \tCLOSE MyQueue;\r\n  \t\t\t-- COMMIT;\r\n  \tEND MAIN_BLOCK;\r\n\r\n\t-- START TRANSACTION;\r\n\tUPDATE `sys_event_lock` SET `status` = 0, `duration` = UNIX_TIMESTAMP(CURRENT_TIMESTAMP) - UNIX_TIMESTAMP(update_timestamp) WHERE `id` = EventId;\r\n\tUPDATE `sys_event_lock` SET `duration_total` = `duration` + `duration_total` WHERE `id` = EventId;\r\n\t-- COMMIT;\r\n  END IF;\r\n END;<\/pre>\n<p>In order to fulfill fact table with old data I have to prepare a separate process. It can be also useful in the case of disaster and weekly data reconstruction.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>New challange &#8211; data warehouse for the purpose of specific customer (not sales aggregation). Before I started to read know how about data warehouse I prepared my own plan. The result of basic structure was almost fantastic, I only had to borrow the names for the elements from official sources. The type of data warehouse [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[7,3],"tags":[14],"_links":{"self":[{"href":"http:\/\/10sa.com\/sql_stories\/index.php?rest_route=\/wp\/v2\/posts\/256"}],"collection":[{"href":"http:\/\/10sa.com\/sql_stories\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/10sa.com\/sql_stories\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/10sa.com\/sql_stories\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/10sa.com\/sql_stories\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=256"}],"version-history":[{"count":50,"href":"http:\/\/10sa.com\/sql_stories\/index.php?rest_route=\/wp\/v2\/posts\/256\/revisions"}],"predecessor-version":[{"id":297,"href":"http:\/\/10sa.com\/sql_stories\/index.php?rest_route=\/wp\/v2\/posts\/256\/revisions\/297"}],"wp:attachment":[{"href":"http:\/\/10sa.com\/sql_stories\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=256"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/10sa.com\/sql_stories\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=256"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/10sa.com\/sql_stories\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=256"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}