Data Relationships of Spreadsheets: Relational Database vs. Heterogenous Data Tables

This blog post is development in process. Will fill in the details missing details (especially pandas) later. Some of the MATLAB syntax are inaccurate in the sense that it’s just a description that is context dependent (such as column names can be cellstr, char string or linear/logical indices).

From data relationship point of view, relation database (RDMBS), heterogenous data tables (MATLAB’s dataset/table or Python Panda’s Dataframe) are the same thing. But a proper database have to worry about concurrency issues and provide more consistency tools (ACID model).

Heterogenous data tables are almost always column-oriented database (mainly for analyzing data) where MySQL and Postgres are row-store database. You can think of column-store database as Struct of Arrays (SoA) and row-store database as Array of Struct (AoS). Remember locality = performance: in general, you want to put the stuff you frequently want to access together as close to each other as possible.

Mechanics:

ConceptsSQLMATLAB tablePandas Dataframe
tablesFROM(work with T)(work with df)
columns
variables
fields
SELECT T.(field)
T(:, cols/varnames)
rows
records
WHERE

HAVING
T( cond(T), : )

T_grp( cond(T_grp), : )
conditionsNOT
IS
IN
BETWEEN
~
==, isequal*()
ismember()
a<=b & b<=c
Inject table to
another table
INSERT INTO t2
SELECT vars FROM t1
WHERE rows
T2(end+(1:#rows), vars) = T1(rows, vars)
(Doable, throws warning)

Insert record/rowINSERT INTO t (c1, c2, ..)
VALUES (v1, v2, ..)
T=[T; {v1, v2, ...}]
(Cannot default for unspecified column*)
update records/elementsUPDATE table
SET column = content
WHERE row_cond
T.(col)(row_cond) = content
New table
from selection
SELECT vars
INTO t2
FROM t1
WHERE rows
T2 = T1(rows, vars)
clear tableTRUNCATE TABLE tT( :, : )=[]
delete rowsDELETE FROM t WHERE cond
(if WHERE is not specified, it kills all rows one by one with consistency checks. Avoid it and use TRUNCATE TABLE instead)
T( cond, : ) = []
* I developed sophisticated tools to allow partial row insertion, but it’s not something TMW supports right out of the box. This involves overloading the default value generator for each data type then extract the skeleton T( [], : ) to identify the data types.

Core database concepts:

ConceptsSQLMATLAB (table/dataset)Pandas (Dataframe)
linear indexCREATE INDEX idx ON T (col)T.idx = (1:size(T,1))'
group indexCREATE UNIQUE INDEX idx ON T (cols)[~, T.idx] = sortrows(T, cols)
(old implementation is grp2idx())
set operationsUNION
INTERSET
union()
intersect()
setdiff(), setxor()
sortORDER BYsortrows()
uniqueSELECT DISTINCTunique()
reduction
aggregration
F()@reductionFunctions
groupingGROUP BYSpecifying ‘GroupingVariables’ in varfun(), rowfun(), etc.
partitioning(set partition option in Table Definition)T1=T(:, {'key', varnames_1})
T2=T(:, {'key', varnames_2})
joins[type] JOIN*join(T1, T2, ...)df.join(df2, …)
cartesian productCROSS JOIN
(misnomer, no keys)
T_cross = [repelem(T1, size(T2,1), 1), repmat(T2, [size(T1,1), 1])]
Function programming concepts map (linear index), filter (logical index), reduce (summary & group) are heavily used with databases

Formal databases has a Table Definition (Column Properties) that must be specified ahead of time and can be updated in-place later on (think of it as static typing). Heterogenous Data Tables can figure most of that out on the fly depending on context (think of it as dynamic typing). This impacts:

  • data type (creation and conversion)
  • unspecified entries (NULL).
    Often NaN in MATLAB native types but I extended it by overloading relevant data types with a isnull() function and consistently use the same interface
  • default values
  • keys (Indices)

SQL features not offered by heterogenous data tables yet:

  • column name aliases (AS)
  • wildcard over names (*)
  • pattern matching (LIKE)

SQL features that are unnatural with heterogeneous data tables’ syntax:

  • implicitly filter a table with conditions in another table sharing the same key.
    It’s an implied join(T, T_cond)+filter operation in MATLAB. Often used with ANY, ALL, EXISTS

Fundamentally heterogenous data types expects working with snapshots that doesn’t update often. Therefore they do not offer active checking (callbacks) as in SQL:

  • Invariant constraints (CHECK, UNIQUE, NOT NULL, Foreign key).
  • Auto Increment
  • Virtual (dependent) tables (CREATE VIEW)

Know these database/spreadsheet concepts:

  • Tall vs wide tables

Language logistics (not related to database)

ConceptsSQLMATLAB (table/dataset)Pandas (Dataframe)
Partial displayMySQL: LIMIT
Oracle: FETCH FIRST
T( 1:10, : )df.head()
Comments-- or /* */% or %{ %}# or """"""
functionCREATE PROCEDURE fcnfunction [varargout{:}]=fcn(varargin{:})def fcn:
caseCASE WHEN THEN ELSE ENDswitch case end(no case structure, use dictionary)
Null if no resultsIFNULL ( statement )function X=null_if_empty(T, cond)
X=T( cond, : );
if( isempty(X) ) X=NaN;
Replace nullsISNULL(col, target_val)T.col(isnan(T.col)) = target_val
T = standardizeMissing( T, ... )

Loading

Explain XOR tricks succinctly with two properties Change detector, toggler

Most teaching materials on XOR starts from its straight definition then a bunch of recipes for exploiting XOR. It is not convenient to remember and not intuitive to come up with new tricks on our own. Instead, I’d like to describe XOR as two intuitive operations:

  1. Change detector (The logic definition itself)
  2. Toggling light switches (Flipping twice undoes it)

The second interpretation (toggling) is actually way more powerful and easy to remember than the first interpretation (definition).


With the toggling interpretation, you don’t have to think hard to come up with the four algebraic properties:

  1. Associativity and commutativity: we only care about whether the individual light switches (bits) are flipped odd (flipped) or even (not-flipped) number of times at the end of the day. When it happens in what order does not matter.
  2. Identity: flipping NONE of the switches (a mask of all 0) leaves it alone
  3. Inverse: pick only the switches that are already turned on (using itself as mask) and flip them (xor) guarantees to turn everything off (0)

The XOR swap trick can be constructed by exploiting the fact that flipping the same switch twice (at any point) reverts it back to where it started:

x y
Start x_0 y_0
Mix y_0 with x_0 x_1 = x_0 \oplus y_0 No info is lost, still have y_0 somewhere to undo if we wish.
The incumbent y_0 is used to cancel the y_0 inside x_1 to recover x_0 \begin{array}{rcl} y_2 & = & y_0 \oplus x_1 \\ & = & y_0 \oplus (x_0 \oplus y_0)\\ & = & x_0 \end{array} Since y_0 is already saved (mixed with) x_1, we don’t have to worry about losing it as long as we get to our main goal of putting x_0 in y, which will be used as a recovery key later.
Now use the x_0 stored in y_2 to unmix x_1 to recover y_0 \begin{array}{rcl} x_3 & = & x_1 \oplus y_2 \\ & = & (x_0 \oplus y_0) \oplus x_0\\ & = & y_0 \end{array}

Note that this trick is obsolete for modern computer architecture because

  • it enforces strict dependence that kills any predictive pipeline (speedup) in modern computer architecture, and
  • will yield zero all the time if you try to swap with itself in-place (same memory location), which is wrong because you expect the swap to do nothing.

Adding a check for the degenerate case (self-swap) slows the program down even further, making it worse than using a temporary storage for swapping. Therefore there’s no good reason to use it unless you have an exotic use case.

This is mainly used as a teaching device (or homework problem) to teach that XOR-ing even instances of the same variable in the chain cancels it.


 

Loading