{"id":2839,"date":"2021-10-23T23:52:02","date_gmt":"2021-10-24T07:52:02","guid":{"rendered":"https:\/\/wonghoi.humgar.com\/blog\/?p=2839"},"modified":"2021-11-15T22:14:27","modified_gmt":"2021-11-16T06:14:27","slug":"relational-database-concepts-heterogenous-data-tables-spreadsheets","status":"publish","type":"post","link":"https:\/\/wonghoi.humgar.com\/blog\/2021\/10\/23\/relational-database-concepts-heterogenous-data-tables-spreadsheets\/","title":{"rendered":"Data Relationships of Spreadsheets: Relational Database vs. Heterogenous Data Tables"},"content":{"rendered":"\n<p>This blog post is development in process. Will fill in the details missing details (especially pandas) later. Some of the MATLAB syntax are inaccurate in the sense that it&#8217;s just a description that is context dependent (such as column names can be cellstr, char string or linear\/logical indices).<\/p>\n\n\n\n<p>From data relationship point of view, relation database (RDMBS), heterogenous data tables (MATLAB&#8217;s dataset\/table or Python Panda&#8217;s Dataframe) are the same thing. But a proper database have to worry about concurrency issues and provide more consistency tools (ACID model). <\/p>\n\n\n\n<p>Heterogenous data tables are almost always column-oriented database (mainly for analyzing data) where MySQL and Postgres are row-store database. You can think of column-store database as Struct of Arrays (SoA) and row-store database as Array of Struct (AoS). Remember locality = performance: in general, you want to put the stuff you frequently want to access together as close to each other as possible.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<p>Mechanics:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th><strong>Concepts<\/strong><\/th><th><strong>SQL<\/strong><\/th><th><strong>MATLAB table<\/strong><\/th><th><strong>Pandas Dataframe<\/strong><\/th><\/tr><\/thead><tbody><tr><td> <\/td><td><\/td><td><\/td><td><\/td><\/tr><tr><td>tables<\/td><td>FROM<\/td><td>(work with T)<\/td><td>(work with df)<\/td><\/tr><tr><td>columns<br>variables<br>fields <\/td><td>SELECT <\/td><td><code data-enlighter-language=\"matlab\" class=\"EnlighterJSRAW\">T.(field)<\/code>  <br><code data-enlighter-language=\"raw\" class=\"EnlighterJSRAW\">T(:, cols\/varnames) <\/code><\/td><td><\/td><\/tr><tr><td>rows<br>records <\/td><td>WHERE<br><br>HAVING <\/td><td><code data-enlighter-language=\"nginx\" class=\"EnlighterJSRAW\">T( cond(T), : )<\/code><br><br><code data-enlighter-language=\"md\" class=\"EnlighterJSRAW\">T_grp( cond(T_grp), : ) <\/code><\/td><td><\/td><\/tr><tr><td>conditions<\/td><td>NOT<br>IS<br>IN<br>BETWEEN <\/td><td><code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">~<\/code><br><code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">==, isequal*()<\/code><br><code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">ismember()<\/code><br><code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">a&lt;=b &amp; b&lt;=c<\/code><\/td><td><\/td><\/tr><tr><td>Inject table to <br>another table<\/td><td>INSERT INTO t2<br>SELECT vars FROM t1<br>WHERE rows<\/td><td><code data-enlighter-language=\"matlab\" class=\"EnlighterJSRAW\">T2(end+(1:#rows), vars) = T1(rows, vars)<\/code><br>(Doable, throws warning)<\/td><td><br><\/td><\/tr><tr><td>Insert record\/row<\/td><td>INSERT INTO t (c1, c2, ..)<br>VALUES (v1, v2, ..)<\/td><td><code data-enlighter-language=\"matlab\" class=\"EnlighterJSRAW\">T=[T; {v1, v2, ...}]<\/code><br>(Cannot default for unspecified column*)<\/td><td><\/td><\/tr><tr><td>update records\/elements<\/td><td>UPDATE table <br>SET column = content <br>WHERE row_cond<\/td><td><code data-enlighter-language=\"matlab\" class=\"EnlighterJSRAW\">T.(col)(row_cond) = content<\/code><\/td><td><\/td><\/tr><tr><td>New table <br>from selection<\/td><td>SELECT vars<br>INTO t2<br>FROM t1<br>WHERE rows<\/td><td><code data-enlighter-language=\"matlab\" class=\"EnlighterJSRAW\">T2 = T1(rows, vars) <\/code><\/td><td><\/td><\/tr><tr><td>clear table<\/td><td>TRUNCATE TABLE t<\/td><td><code data-enlighter-language=\"matlab\" class=\"EnlighterJSRAW\">T( :, : )=[]<\/code><\/td><td><\/td><\/tr><tr><td>delete rows<\/td><td>DELETE FROM t WHERE cond<br>(if WHERE is not specified, it kills all rows one by one with consistency checks. Avoid it and use TRUNCATE TABLE instead)<\/td><td><code data-enlighter-language=\"matlab\" class=\"EnlighterJSRAW\">T( cond, : ) = []<\/code><\/td><td><\/td><\/tr><\/tbody><\/table><figcaption>* I developed sophisticated tools to allow partial row insertion, but it&#8217;s not something TMW supports right out of the box. This involves <a href=\"https:\/\/wonghoi.humgar.com\/blog\/2016\/08\/10\/matlab-techniques-self-identifying-by-type-methods\/\" data-type=\"post\" data-id=\"238\">overloading the default value generator<\/a> for each data type then extract the skeleton <code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">T( [], : )<\/code> to identify the data types.<\/figcaption><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>Core database concepts:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong>Concepts<\/strong><\/td><td><strong>SQL<\/strong><\/td><td><strong>MATLAB (table\/dataset)<\/strong><\/td><td><strong>Pandas (Dataframe)<\/strong><\/td><\/tr><tr><td>linear index<\/td><td>CREATE INDEX idx ON T (col)<\/td><td><code data-enlighter-language=\"matlab\" class=\"EnlighterJSRAW\">T.idx = (1:size(T,1))'<\/code><\/td><td><\/td><\/tr><tr><td>group index<\/td><td>CREATE UNIQUE INDEX idx ON T (cols)<\/td><td><code data-enlighter-language=\"matlab\" class=\"EnlighterJSRAW\">[~, T.idx] = sortrows(T, cols)<\/code><br>(old implementation is <code>grp2idx()<\/code>)<\/td><td><\/td><\/tr><tr><td>set operations<\/td><td>UNION<br>INTERSET<\/td><td>union()<br>intersect()<br>setdiff(), setxor()<\/td><td><\/td><\/tr><tr><td>sort<\/td><td>ORDER BY<\/td><td>sortrows()<\/td><td><\/td><\/tr><tr><td>unique<\/td><td>SELECT DISTINCT<\/td><td>unique()<\/td><td><\/td><\/tr><tr><td>reduction<br>aggregration<\/td><td>F()<\/td><td>@reductionFunctions<\/td><td><\/td><\/tr><tr><td>grouping<\/td><td>GROUP BY<\/td><td>Specifying &#8216;GroupingVariables&#8217; in varfun(), rowfun(), etc.<\/td><td><\/td><\/tr><tr><td>partitioning<\/td><td>(set partition option in <em>Table Definition<\/em>)<\/td><td><code data-enlighter-language=\"matlab\" class=\"EnlighterJSRAW\">T1=T(:, {'key', varnames_1})<\/code><br><code data-enlighter-language=\"matlab\" class=\"EnlighterJSRAW\">T2=T(:, {'key', varnames_2})<\/code><\/td><td><\/td><\/tr><tr><td>joins<\/td><td>[type] JOIN<\/td><td>*<code data-enlighter-language=\"matlab\" class=\"EnlighterJSRAW\">join(T1, T2, ...)<\/code><\/td><td>df.join(df2, &#8230;)<\/td><\/tr><tr><td><a href=\"https:\/\/www.tutorialrepublic.com\/sql-tutorial\/sql-cross-join-operation.php\" target=\"_blank\" rel=\"noreferrer noopener\">cartesian product<\/a><\/td><td>CROSS JOIN <br>(misnomer, no keys)<\/td><td><code data-enlighter-language=\"matlab\" class=\"EnlighterJSRAW\">T_cross = [repelem(T1, size(T2,1), 1), repmat(T2, [size(T1,1), 1])]<\/code><\/td><td><\/td><\/tr><\/tbody><\/table><figcaption>Function programming concepts map (linear index), filter (logical index), reduce (summary &amp; group) are heavily used with databases<\/figcaption><\/figure>\n\n\n\n<p>Formal databases has a <em>Table Definition<\/em> (<em>Column Properties<\/em>) that must be specified ahead of time and can be updated <strong>in-place<\/strong> later on (think of it as static typing). Heterogenous Data Tables can figure most of that out on the fly depending on context (think of it as dynamic typing). This impacts:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>data type (creation and conversion)<\/li><li>unspecified entries (<code>NULL<\/code>). <br>Often NaN in MATLAB native types but I extended it by <a rel=\"noreferrer noopener\" href=\"https:\/\/wonghoi.humgar.com\/blog\/2016\/08\/10\/matlab-techniques-self-identifying-by-type-methods\/\" data-type=\"post\" data-id=\"238\" target=\"_blank\">overloading relevant data types<\/a> with a isnull() function and consistently use the same interface<\/li><li>default values<\/li><li>keys (Indices)<\/li><\/ul>\n\n\n\n<p>SQL features not offered by heterogenous data tables yet:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>column name aliases (<code>AS<\/code>)<\/li><li>wildcard over names (<code>*<\/code>)<\/li><li>pattern matching (<code>LIKE<\/code>)<\/li><\/ul>\n\n\n\n<p>SQL features that are unnatural with heterogeneous data tables&#8217; syntax:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>implicitly filter a table with conditions in another table sharing the same key. <br>It&#8217;s an implied join(T, T_cond)+filter operation in MATLAB. Often used with ANY, ALL, EXISTS<\/li><\/ul>\n\n\n\n<p>Fundamentally heterogenous data types expects working with snapshots that doesn&#8217;t update often. Therefore they do not offer active checking (callbacks) as in SQL:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Invariant constraints (<code>CHECK<\/code>, <code>UNIQUE<\/code>, <code>NOT NULL<\/code>, Foreign key).<\/li><li>Auto Increment<\/li><li>Virtual (dependent) tables (<code>CREATE VIEW<\/code>)<\/li><\/ul>\n\n\n\n<p>Know these database\/spreadsheet concepts:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Tall vs wide tables<\/li><\/ul>\n\n\n\n<p>Language logistics (not related to database)<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong>Concepts<\/strong><\/td><td><strong>SQL<\/strong><\/td><td><strong>MATLAB (table\/dataset)<\/strong><\/td><td><strong>Pandas (Dataframe)<\/strong><\/td><\/tr><tr><td>Partial display<\/td><td><em>MySQL<\/em>: LIMIT<br><em>Oracle<\/em>: FETCH FIRST<\/td><td><code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">T( 1:10, : )<\/code><\/td><td><code data-enlighter-language=\"python\" class=\"EnlighterJSRAW\">df.head()<\/code><\/td><\/tr><tr><td>Comments<\/td><td><code>--<\/code> or <code>\/* <\/code>&#8230;<code> *\/<\/code><\/td><td><code>%<\/code> or <code>%{  <\/code>&#8230;<code> %}<\/code><\/td><td><code>#<\/code>  or <code>\"\"\"<\/code> &#8230; <code>\"\"\"<\/code><\/td><\/tr><tr><td>function<\/td><td>CREATE PROCEDURE fcn<\/td><td><code data-enlighter-language=\"matlab\" class=\"EnlighterJSRAW\">function [varargout{:}]=fcn(varargin{:})<\/code><\/td><td><code data-enlighter-language=\"python\" class=\"EnlighterJSRAW\">def fcn:<\/code><\/td><\/tr><tr><td>case<\/td><td>CASE WHEN THEN ELSE END<\/td><td>switch case end<\/td><td>(no case structure, use dictionary)<\/td><\/tr><tr><td>Null if no results<\/td><td>IFNULL ( statement )<\/td><td><code><code data-enlighter-language=\"matlab\" class=\"EnlighterJSRAW\">function X=null_if_empty(T, cond)<\/code><\/code><br><code data-enlighter-language=\"generic\" class=\"EnlighterJSRAW\">  <\/code><code><code data-enlighter-language=\"matlab\" class=\"EnlighterJSRAW\">X=T( cond, : ); <\/code><\/code><br><code><code data-enlighter-language=\"raw\" class=\"EnlighterJSRAW\">  <\/code><code data-enlighter-language=\"matlab\" class=\"EnlighterJSRAW\">if( isempty(X) ) X=NaN;<\/code><\/code><\/td><td><\/td><\/tr><tr><td>Replace nulls<\/td><td>ISNULL(col, target_val)<\/td><td><code data-enlighter-language=\"matlab\" class=\"EnlighterJSRAW\">T.col(isnan(T.col)) = target_val<\/code><br><code data-enlighter-language=\"matlab\" class=\"EnlighterJSRAW\">T = standardizeMissing( T, ... )<\/code><\/td><td><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p><\/p>\n<div class=\"pvc_clear\"><\/div><p id=\"pvc_stats_2839\" class=\"pvc_stats all  \" data-element-id=\"2839\" style=\"\"><i class=\"pvc-stats-icon medium\" aria-hidden=\"true\"><svg aria-hidden=\"true\" focusable=\"false\" data-prefix=\"far\" data-icon=\"chart-bar\" role=\"img\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" viewBox=\"0 0 512 512\" class=\"svg-inline--fa fa-chart-bar fa-w-16 fa-2x\"><path fill=\"currentColor\" d=\"M396.8 352h22.4c6.4 0 12.8-6.4 12.8-12.8V108.8c0-6.4-6.4-12.8-12.8-12.8h-22.4c-6.4 0-12.8 6.4-12.8 12.8v230.4c0 6.4 6.4 12.8 12.8 12.8zm-192 0h22.4c6.4 0 12.8-6.4 12.8-12.8V140.8c0-6.4-6.4-12.8-12.8-12.8h-22.4c-6.4 0-12.8 6.4-12.8 12.8v198.4c0 6.4 6.4 12.8 12.8 12.8zm96 0h22.4c6.4 0 12.8-6.4 12.8-12.8V204.8c0-6.4-6.4-12.8-12.8-12.8h-22.4c-6.4 0-12.8 6.4-12.8 12.8v134.4c0 6.4 6.4 12.8 12.8 12.8zM496 400H48V80c0-8.84-7.16-16-16-16H16C7.16 64 0 71.16 0 80v336c0 17.67 14.33 32 32 32h464c8.84 0 16-7.16 16-16v-16c0-8.84-7.16-16-16-16zm-387.2-48h22.4c6.4 0 12.8-6.4 12.8-12.8v-70.4c0-6.4-6.4-12.8-12.8-12.8h-22.4c-6.4 0-12.8 6.4-12.8 12.8v70.4c0 6.4 6.4 12.8 12.8 12.8z\" class=\"\"><\/path><\/svg><\/i> <img loading=\"lazy\" decoding=\"async\" width=\"16\" height=\"16\" alt=\"Loading\" src=\"https:\/\/wonghoi.humgar.com\/blog\/wp-content\/plugins\/page-views-count\/ajax-loader-2x.gif\" border=0 \/><\/p><div class=\"pvc_clear\"><\/div>","protected":false},"excerpt":{"rendered":"<p>This blog post is development in process. Will fill in the details missing details (especially pandas) later. Some of the MATLAB syntax are inaccurate in the sense that it&#8217;s just a description that is context dependent (such as column names &hellip; <a href=\"https:\/\/wonghoi.humgar.com\/blog\/2021\/10\/23\/relational-database-concepts-heterogenous-data-tables-spreadsheets\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n<div class=\"pvc_clear\"><\/div>\n<p id=\"pvc_stats_2839\" class=\"pvc_stats all  \" data-element-id=\"2839\" style=\"\"><i class=\"pvc-stats-icon medium\" aria-hidden=\"true\"><svg aria-hidden=\"true\" focusable=\"false\" data-prefix=\"far\" data-icon=\"chart-bar\" role=\"img\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" viewBox=\"0 0 512 512\" class=\"svg-inline--fa fa-chart-bar fa-w-16 fa-2x\"><path fill=\"currentColor\" d=\"M396.8 352h22.4c6.4 0 12.8-6.4 12.8-12.8V108.8c0-6.4-6.4-12.8-12.8-12.8h-22.4c-6.4 0-12.8 6.4-12.8 12.8v230.4c0 6.4 6.4 12.8 12.8 12.8zm-192 0h22.4c6.4 0 12.8-6.4 12.8-12.8V140.8c0-6.4-6.4-12.8-12.8-12.8h-22.4c-6.4 0-12.8 6.4-12.8 12.8v198.4c0 6.4 6.4 12.8 12.8 12.8zm96 0h22.4c6.4 0 12.8-6.4 12.8-12.8V204.8c0-6.4-6.4-12.8-12.8-12.8h-22.4c-6.4 0-12.8 6.4-12.8 12.8v134.4c0 6.4 6.4 12.8 12.8 12.8zM496 400H48V80c0-8.84-7.16-16-16-16H16C7.16 64 0 71.16 0 80v336c0 17.67 14.33 32 32 32h464c8.84 0 16-7.16 16-16v-16c0-8.84-7.16-16-16-16zm-387.2-48h22.4c6.4 0 12.8-6.4 12.8-12.8v-70.4c0-6.4-6.4-12.8-12.8-12.8h-22.4c-6.4 0-12.8 6.4-12.8 12.8v70.4c0 6.4 6.4 12.8 12.8 12.8z\" class=\"\"><\/path><\/svg><\/i> <img loading=\"lazy\" decoding=\"async\" width=\"16\" height=\"16\" alt=\"Loading\" src=\"https:\/\/wonghoi.humgar.com\/blog\/wp-content\/plugins\/page-views-count\/ajax-loader-2x.gif\" border=0 \/><\/p>\n<div class=\"pvc_clear\"><\/div>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"footnotes":""},"categories":[59,60,10,29],"tags":[],"class_list":["post-2839","post","type-post","status-publish","format-standard","hentry","category-concepts","category-database","category-matlab","category-programming"],"_links":{"self":[{"href":"https:\/\/wonghoi.humgar.com\/blog\/wp-json\/wp\/v2\/posts\/2839","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wonghoi.humgar.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wonghoi.humgar.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wonghoi.humgar.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/wonghoi.humgar.com\/blog\/wp-json\/wp\/v2\/comments?post=2839"}],"version-history":[{"count":35,"href":"https:\/\/wonghoi.humgar.com\/blog\/wp-json\/wp\/v2\/posts\/2839\/revisions"}],"predecessor-version":[{"id":3129,"href":"https:\/\/wonghoi.humgar.com\/blog\/wp-json\/wp\/v2\/posts\/2839\/revisions\/3129"}],"wp:attachment":[{"href":"https:\/\/wonghoi.humgar.com\/blog\/wp-json\/wp\/v2\/media?parent=2839"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wonghoi.humgar.com\/blog\/wp-json\/wp\/v2\/categories?post=2839"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wonghoi.humgar.com\/blog\/wp-json\/wp\/v2\/tags?post=2839"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}