Data Relationships of Spreadsheets: Relational Database vs. Heterogenous Data Tables

This blog post is development in process. Will fill in the details missing details (especially pandas) later. Some of the MATLAB syntax are inaccurate in the sense that it’s just a description that is context dependent (such as column names can be cellstr, char string or linear/logical indices).

From data relationship point of view, relation database (RDMBS), heterogenous data tables (MATLAB’s dataset/table or Python Panda’s Dataframe) are the same thing. But a proper database have to worry about concurrency issues and provide more consistency tools (ACID model).

Heterogenous data tables are almost always column-oriented database (mainly for analyzing data) where MySQL and Postgres are row-store database. You can think of column-store database as Struct of Arrays (SoA) and row-store database as Array of Struct (AoS). Remember locality = performance: in general, you want to put the stuff you frequently want to access together as close to each other as possible.

Mechanics:

ConceptsSQLMATLAB tablePandas Dataframe
tablesFROM(work with T)(work with df)
columns
variables
fields
SELECT T.(field)
T(:, cols/varnames)
rows
records
WHERE

HAVING
T( cond(T), : )

T_grp( cond(T_grp), : )
conditionsNOT
IS
IN
BETWEEN
~
==, isequal*()
ismember()
a<=b & b<=c
Inject table to
another table
INSERT INTO t2
SELECT vars FROM t1
WHERE rows
T2(end+(1:#rows), vars) = T1(rows, vars)
(Doable, throws warning)

Insert record/rowINSERT INTO t (c1, c2, ..)
VALUES (v1, v2, ..)
T=[T; {v1, v2, ...}]
(Cannot default for unspecified column*)
update records/elementsUPDATE table
SET column = content
WHERE row_cond
T.(col)(row_cond) = content
New table
from selection
SELECT vars
INTO t2
FROM t1
WHERE rows
T2 = T1(rows, vars)
clear tableTRUNCATE TABLE tT( :, : )=[]
delete rowsDELETE FROM t WHERE cond
(if WHERE is not specified, it kills all rows one by one with consistency checks. Avoid it and use TRUNCATE TABLE instead)
T( cond, : ) = []
* I developed sophisticated tools to allow partial row insertion, but it’s not something TMW supports right out of the box. This involves overloading the default value generator for each data type then extract the skeleton T( [], : ) to identify the data types.

Core database concepts:

ConceptsSQLMATLAB (table/dataset)Pandas (Dataframe)
linear indexCREATE INDEX idx ON T (col)T.idx = (1:size(T,1))'
group indexCREATE UNIQUE INDEX idx ON T (cols)[~, T.idx] = sortrows(T, cols)
(old implementation is grp2idx())
set operationsUNION
INTERSET
union()
intersect()
setdiff(), setxor()
sortORDER BYsortrows()
uniqueSELECT DISTINCTunique()
reduction
aggregration
F()@reductionFunctions
groupingGROUP BYSpecifying ‘GroupingVariables’ in varfun(), rowfun(), etc.
partitioning(set partition option in Table Definition)T1=T(:, {'key', varnames_1})
T2=T(:, {'key', varnames_2})
joins[type] JOIN*join(T1, T2, ...)df.join(df2, …)
cartesian productCROSS JOIN
(misnomer, no keys)
T_cross = [repelem(T1, size(T2,1), 1), repmat(T2, [size(T1,1), 1])]
Function programming concepts map (linear index), filter (logical index), reduce (summary & group) are heavily used with databases

Formal databases has a Table Definition (Column Properties) that must be specified ahead of time and can be updated in-place later on (think of it as static typing). Heterogenous Data Tables can figure most of that out on the fly depending on context (think of it as dynamic typing). This impacts:

  • data type (creation and conversion)
  • unspecified entries (NULL).
    Often NaN in MATLAB native types but I extended it by overloading relevant data types with a isnull() function and consistently use the same interface
  • default values
  • keys (Indices)

SQL features not offered by heterogenous data tables yet:

  • column name aliases (AS)
  • wildcard over names (*)
  • pattern matching (LIKE)

SQL features that are unnatural with heterogeneous data tables’ syntax:

  • implicitly filter a table with conditions in another table sharing the same key.
    It’s an implied join(T, T_cond)+filter operation in MATLAB. Often used with ANY, ALL, EXISTS

Fundamentally heterogenous data types expects working with snapshots that doesn’t update often. Therefore they do not offer active checking (callbacks) as in SQL:

  • Invariant constraints (CHECK, UNIQUE, NOT NULL, Foreign key).
  • Auto Increment
  • Virtual (dependent) tables (CREATE VIEW)

Know these database/spreadsheet concepts:

  • Tall vs wide tables

Language logistics (not related to database)

ConceptsSQLMATLAB (table/dataset)Pandas (Dataframe)
Partial displayMySQL: LIMIT
Oracle: FETCH FIRST
T( 1:10, : )df.head()
Comments-- or /* */% or %{ %}# or """"""
functionCREATE PROCEDURE fcnfunction [varargout{:}]=fcn(varargin{:})def fcn:
caseCASE WHEN THEN ELSE ENDswitch case end(no case structure, use dictionary)
Null if no resultsIFNULL ( statement )function X=null_if_empty(T, cond)
X=T( cond, : );
if( isempty(X) ) X=NaN;
Replace nullsISNULL(col, target_val)T.col(isnan(T.col)) = target_val
T = standardizeMissing( T, ... )

Loading

MATLAB assign index for unique rows

MATLAB’s dataset/table objects’ internals often involves identifying unique contents and assigning a unique (grouping) index to it so the indices can be mapped or joined without actually going through the contents of each row.

In the old days when I were using dataset(), the first generation of table() objects before the rewrite, there is a tool called grp2idx() which assigns the same number to identical items regardless of data types. It was part of Statistics Toolbox (needs to pay extra for it) and it does not work if you have multiple columns that you want to assign an unique index unless the ROWS are identical.

Upon inspection. grp2idx() is overrated. There are two ways to get it without paying for the toolbox:

  • double(categorical(X)): cast a categorical type (technically you can use nominal/ordinal, but it’s part of statistics toolbox)
  • Use the 2nd output argument for sort() or sortrows() function. I recommend sortrows() because it’s can be overloaded on table() objects and it works on multiple rows.

Loading

DBeaver connecting to MySQL in Namecheap Shared Hosting

Namecheap already provided instructions to connect MySQL Workbench client for its shared hosting, which involves SSH-tunneling because they disallowed direct MySQL connection out of security concerns.

So here’s basically the logistics:

  1. SSH to your namecheap hostname (can use your domain name) at SSH port 21098
  2. Tunnel listens to Port 5522 and forward it to localhost (the client itself) at MySQL Port 3306
  3. Instead of connecting directly to the {namecheap shared hosting server}:3306, connect to the localhost:3306

It’s a little confusing on how to do it on DBeaver because “Advanced settings” is hidden by default which you will need. The name ‘local client’ (source) vs ‘remote’ (destination) in the dialog box is confusing. It’s actually equivalent to

ssh -L ["Local host":]"Local port":"Remote host":"Remote port"
ssh -L [bind_address:]port:host:hostport

bind_address can be left blank. If you are paranoid and don’t want other machines to use your current MySQL client machine as a gateway (they tunnel into your machine to use the tunnel you are currently establishing), set (aka bind) it to localhost, or you can bind it to the client’s network adapter’s IP which you want to allow machines on a trusted network to use this MySQL client computer as a gateway.

For some reason (I suspect it’s IPv6), “Remote host” needs to be set to the loopback adapter 127.0.0.1 (cannot use the special hostname ‘localhost‘).

Remember MySQL’s username and password is the special database-only login credentials you created at cPanel.

Loading

Text manipulation idioms in linux

awk: select columns
sed: stream editor (operations like select, substitute, add/delete lines, modify)
sed expressions can be separated by ";"
sed can substitute all occurrences with 'g' modified at the end: 's/(find)/(replace)/g'

# https://unix.stackexchange.com/questions/92187/setting-ifs-for-a-single-statement

# arg I/O
$@: unpack all input args
$*: join all inputs as ONE arg, separated by FIRST character of IFS (empty space if unspecified)

# Remember the double quotes around "$*" or "$array[*]" usages or else IFS won't function

array[@]: entire array
${array[@]}: unpacks entire array into MULTIPLE arguments
${array[*]}: join entire array into ONE argument separated by FIRST character of IFS (defaults to an empty space if unspecified)
( IFS=$'\n'; echo "${my_array[*]}" )

${#str}: length of string
${#array[@]}: length of array
${#array[@]:start:after_stop}: select array[start] ... array[after_stop-1]

${str:="my_string"}: initializes variable str with "my_string" (useful for side-effect)

$(str##my_pattern}: delete front matching my_pattern
${str%%my_pattern}: deletes tail matching my_pattern (can use one % instead)
$(str%?}: delete last character (the my_pattern is a single character wildcard "?")

$( whatever_command ): captures stdout created by running whatever_command
( $str ): tokenize to string array, governed by IFS (specify delimiter)
( $( whatever_command ) ): combines the two operations above: capture stdout from command and tokenize the results

# https://unix.stackexchange.com/questions/92187/setting-ifs-for-a-single-statement
function strjoin { local IFS="$1"; shift; echo "$*"; }

Loading