Data Relationships of Spreadsheets: Relational Database vs. Heterogenous Data Tables

Posted on October 23, 2021 by admin

This blog post is development in process. Will fill in the details missing details (especially pandas) later. Some of the MATLAB syntax are inaccurate in the sense that it’s just a description that is context dependent (such as column names can be cellstr, char string or linear/logical indices).

From data relationship point of view, relation database (RDMBS), heterogenous data tables (MATLAB’s dataset/table or Python Panda’s Dataframe) are the same thing. But a proper database have to worry about concurrency issues and provide more consistency tools (ACID model).

Heterogenous data tables are almost always column-oriented database (mainly for analyzing data) where MySQL and Postgres are row-store database. You can think of column-store database as Struct of Arrays (SoA) and row-store database as Array of Struct (AoS). Remember locality = performance: in general, you want to put the stuff you frequently want to access together as close to each other as possible.

Mechanics:

Concepts	SQL	MATLAB table	Pandas Dataframe

tables	FROM	(work with T)	(work with df)
columns variables fields	SELECT	`T.(field)` `T(:, cols/varnames)`
rows records	WHERE HAVING	`T( cond(T), : )` `T_grp( cond(T_grp), : )`
conditions	NOT IS IN BETWEEN	`~` `==, isequal*()` `ismember()` `a<=b & b<=c`
Inject table to another table	INSERT INTO t2 SELECT vars FROM t1 WHERE rows	`T2(end+(1:#rows), vars) = T1(rows, vars)` (Doable, throws warning)
Insert record/row	INSERT INTO t (c1, c2, ..) VALUES (v1, v2, ..)	`T=[T; {v1, v2, ...}]` (Cannot default for unspecified column*)
update records/elements	UPDATE table SET column = content WHERE row_cond	`T.(col)(row_cond) = content`
New table from selection	SELECT vars INTO t2 FROM t1 WHERE rows	`T2 = T1(rows, vars)`
clear table	TRUNCATE TABLE t	`T( :, : )=[]`
delete rows	DELETE FROM t WHERE cond (if WHERE is not specified, it kills all rows one by one with consistency checks. Avoid it and use TRUNCATE TABLE instead)	`T( cond, : ) = []`

* I developed sophisticated tools to allow partial row insertion, but it’s not something TMW supports right out of the box. This involves overloading the default value generator for each data type then extract the skeleton T( [], : ) to identify the data types.

Core database concepts:

Concepts	SQL	MATLAB (table/dataset)	Pandas (Dataframe)
linear index	CREATE INDEX idx ON T (col)	`T.idx = (1:size(T,1))'`
group index	CREATE UNIQUE INDEX idx ON T (cols)	`[~, T.idx] = sortrows(T, cols)` (old implementation is `grp2idx()`)
set operations	UNION INTERSET	union() intersect() setdiff(), setxor()
sort	ORDER BY	sortrows()
unique	SELECT DISTINCT	unique()
reduction aggregration	F()	@reductionFunctions
grouping	GROUP BY	Specifying ‘GroupingVariables’ in varfun(), rowfun(), etc.
partitioning	(set partition option in Table Definition)	`T1=T(:, {'key', varnames_1})` `T2=T(:, {'key', varnames_2})`
joins	[type] JOIN	*`join(T1, T2, ...)`	df.join(df2, …)
cartesian product	CROSS JOIN (misnomer, no keys)	`T_cross = [repelem(T1, size(T2,1), 1), repmat(T2, [size(T1,1), 1])]`

Function programming concepts map (linear index), filter (logical index), reduce (summary & group) are heavily used with databases

Formal databases has a Table Definition (Column Properties) that must be specified ahead of time and can be updated in-place later on (think of it as static typing). Heterogenous Data Tables can figure most of that out on the fly depending on context (think of it as dynamic typing). This impacts:

data type (creation and conversion)
unspecified entries (NULL).
Often NaN in MATLAB native types but I extended it by overloading relevant data types with a isnull() function and consistently use the same interface
default values
keys (Indices)

SQL features not offered by heterogenous data tables yet:

column name aliases (AS)
wildcard over names (*)
pattern matching (LIKE)

SQL features that are unnatural with heterogeneous data tables’ syntax:

implicitly filter a table with conditions in another table sharing the same key.
It’s an implied join(T, T_cond)+filter operation in MATLAB. Often used with ANY, ALL, EXISTS

Fundamentally heterogenous data types expects working with snapshots that doesn’t update often. Therefore they do not offer active checking (callbacks) as in SQL:

Invariant constraints (CHECK, UNIQUE, NOT NULL, Foreign key).
Auto Increment
Virtual (dependent) tables (CREATE VIEW)

Know these database/spreadsheet concepts:

Tall vs wide tables

Language logistics (not related to database)

Concepts	SQL	MATLAB (table/dataset)	Pandas (Dataframe)
Partial display	MySQL: LIMIT Oracle: FETCH FIRST	`T( 1:10, : )`	`df.head()`
Comments	`--` or `/` … `/`	`%` or `%{` … `%}`	`#` or `"""` … `"""`
function	CREATE PROCEDURE fcn	`function [varargout{:}]=fcn(varargin{:})`	`def fcn:`
case	CASE WHEN THEN ELSE END	switch case end	(no case structure, use dictionary)
Null if no results	IFNULL ( statement )	`function X=null_if_empty(T, cond)` `X=T( cond, : );` `if( isempty(X) ) X=NaN;`
Replace nulls	ISNULL(col, target_val)	`T.col(isnan(T.col)) = target_val` `T = standardizeMissing( T, ... )`

MATLAB assign index for unique rows

Posted on October 23, 2021 by admin

MATLAB’s dataset/table objects’ internals often involves identifying unique contents and assigning a unique (grouping) index to it so the indices can be mapped or joined without actually going through the contents of each row.

In the old days when I were using dataset(), the first generation of table() objects before the rewrite, there is a tool called grp2idx() which assigns the same number to identical items regardless of data types. It was part of Statistics Toolbox (needs to pay extra for it) and it does not work if you have multiple columns that you want to assign an unique index unless the ROWS are identical.

Upon inspection. grp2idx() is overrated. There are two ways to get it without paying for the toolbox:

double(categorical(X)): cast a categorical type (technically you can use nominal/ordinal, but it’s part of statistics toolbox)
Use the 2nd output argument for sort() or sortrows() function. I recommend sortrows() because it’s can be overloaded on table() objects and it works on multiple rows.

Anonymous Functions (MATLAB) vs Lambdas (Python) Anonymous Functions in MATLAB is closure while Lambdas in Python are not

Posted on May 6, 2019 by admin

Lambdas in Python does not play by the same rules as anonymous functions in MATLAB

MATLAB takes a snapshot of (capture) the workspace variables involved in the anonymous function AT the time the anonymous function handle is created, thus the captured values will live on afterwards (by definition a proper closure).
Lambda in Python is NOT closure! [EDIT: I’ll need to investigate the definition of closure more closely before I use the term here] The free variables involved in lambda expressions are simply read on-the-fly (aka, the last state) when the functor is executed.

It’s kind of a mixed love-and-hate situation for both. Either design choice will be confusing for some use cases. I was at first thrown off by MATLAB’s anonymous function’s full variable capture behavior, then after I get used to it, Python’s Lambda’s non-closure tripped me. Even in the official FAQ, it address the surprise that people are not getting what they expected creating lambdas in a for-loop.

To enable capture in Python, you assign the value you wanted to capture to a lambda input argument (aka, using a bound variable as an intermediary and initialize it with the free variable that needs to be captured), then use the intermediary in the expression. For example:

lambda: ser.close()      # does not capture 'ser'
lambda s=ser: s.close()  # 'ser' is captured by s.

I usually keep the usage of nested functions to the minimum, even in MATLAB, because effectively it’s kind of a compromised ‘global’ between nested levels, or a little bit like protected classes in C++. It breaks encapsulation (intentionally) for functions in your inner circle (nest).

It’s often useful for coding up GUI in MATLAB quick because you need to share access to the UI controls within the same group. For GUI that gets more complicated, I actually avoided nested functions altogether and used *appdata() to share UI object handles.

Functors of nested functions are closures in both MATLAB and Python! Only Lambdas in Python behave slightly differently.

Programmatically determining platforms on MATLAB and Python

Posted on May 6, 2019 by admin

	MATLAB	Python
OS (32-bit or 64-bit)	computer(‘arch’)	platform.machine().endswith(’64’)
Engine (32-bit or 64-bit)	mexext()	platform.architecture()[0].startswith(’64’);
OS Type (Broad)	ismac/isunix/ispc or computer()	platform.system()

Python startup management

Posted on May 3, 2019 by admin

The startup script is simply startup.m in whatever folder MATLAB start with.

Now how about Python? For plain Python (anything that you launch in command line, NOT Spyder though), you’ll need to ADD a new environment variable PYTHONSTARTUP to point to your startup script (same drill for Windows and Linux).

For Spyder, it’s Tools>Preferences>IPython console>Startup>”Run a file”:

but you don’t need that if you already have new environment variable PYTHONSTARTUP correctly setup.

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Rambling Nerd with a Plan

Hoi Wong's blog

Category Archives: MATLAB

Data Relationships of Spreadsheets: Relational Database vs. Heterogenous Data Tables

MATLAB assign index for unique rows

Anonymous Functions (MATLAB) vs Lambdas (Python) Anonymous Functions in MATLAB is closure while Lambdas in Python are not

Programmatically determining platforms on MATLAB and Python

Python startup management