Official Dart docs is sometimes too simple to provide ultimate answers for various language questions. I discoverd an alternative syntax for named lambda here and here. In Dart,
(args) => expr
is the shorthand for
(args) {
return expr
}
So the Lambda-style (C++11) names variable by assigning it to a messy functor type (inferred with var in Dart and auto in C++11):
var f = (args) => expr
Which can also be rewritten using C-style function pointer-like declaration except
the body expr is included (which is not allowed in C),
returnmust be explicit in curly braces block.
arrow notation => takes whatever the expr evaluates to (which can be null for statements like print)
it’s simply the function name f in Dart instead of pointer (*f) in C
Since MATLAB R2015b, there’s a new feature called repelem(V, dim1, dim2, ...) which repeats each element by dimX times over dimension X. If N (dim1) is scalar, each V is uniformly repeated by N times. If N is a vector, it has to be the same length as V and each element of N says how many times the corresponding element in V is repeated.
The scalar case (repeat uniformly) can be emulated by a Kronecker product multiplying everything with 1 (self):
kron(V, ones(N,1))
Kron method is conceptually smart but it has unnecessary arithmetic (multiply by 1). Nonetheless this method is reasonable fast until TMW finally developed a built-in function for it that outperforms all the tricks people have accumulated over decades.
The vector case (each element is repeated a different number of times according to vector N) is basically decoding Run-Length Encoding (RLE), aka counts to placements, which you can download maturely written programs on MATLAB File Exchange (FEX). There are a bunch of cumsum/diff/accumarray/reshape tricks but at the end of the day, they are RLE decoding in vectorized forms.
There’s a name for almost each recurring problem that we can think of in MALTAB. Before jumping in and implementing your for loop, ask around and try to find the right keyword/terms to describe your problem! >99.9% of the time your problem is not new!
The most odd-ball MATLAB algorithm scenario I’ve ever came across that requires original thought is the ‘Jenga Matrix‘ (I coined the name) while I was working at Stanford University Medical School as a research assistant for MADIT-CRT.
MATLAB’s OOP was not mature at that time, so dataset() objects didn’t surface. The reason for the ‘Jenga Matrix’ was to create ‘sparse cells’ which uses a sparse matrix with non-zero indices mapping to a cell vector so I can make a table (that’s approximately the guts of heterogenous data structure).
As I remove elements of the ‘sparse cell matrix’, I didn’t want holes in it to accumulate so I’ll have to periodically compact the underlying cell vector and shift the indices to reflect the indices after compacting. Normally if you have to mess with these kind of ingenious indexing algorithms, you are working on some generic abstractions/tools rather than the business logic itself.
There’s no ultimate correct way to implement something in MATLAB, but there are tons of bad ways that is strictly worse under all circumstances! Being smart with these little toy (Cody) problems like array manipulation do not really show practical proficiency in MATLAB. Anybody can spend a day or two to solve a genuinely new algorithm puzzle or just ask around in the forums if you run into it once in a blue moon. Who cares if you can do it 5 times faster if it’s just <1% of the development time?
Most of your time should be spent on using MATLAB to succinctly and intuitively describe your business logic (which requires exploring and understanding your project requirements deeply), and hide the boring background work with generic abstractions (e.g. RDBMS and RLE)! People should be able to read your function and variable names and form a clear picture of what your codebase is trying to achieve instead of stumbling over smart-ass idioms that’s not immediately obvious (which should buried in the lowest level of generic tool functions if you had to develop it in-house).
Even a mathematician in Linear Algebra using MATLAB for 40 years doesn’t mean he’s good at MATLAB! The real MATLAB skills are keeping up with MATLAB has to offer for a variety of scenarios relevant to the task at hand (or know enough abstract concepts like functional programming, OOP, database, etc, to be able to find out the right tools quickly), which is a hell lot of knowledge considering MATLAB covered most common scenario imaginable (the vast majority of MATLAB users aren’t aware of the full offerings and used MATLAB the wrong/hard way)!
This blog post is development in process. Will fill in the details missing details (especially pandas) later. Some of the MATLAB syntax are inaccurate in the sense that it’s just a description that is context dependent (such as column names can be cellstr, char string or linear/logical indices).
From data relationship point of view, relation database (RDMBS), heterogenous data tables (MATLAB’s dataset/table or Python Panda’s Dataframe) are the same thing. But a proper database have to worry about concurrency issues and provide more consistency tools (ACID model).
Heterogenous data tables are almost always column-oriented database (mainly for analyzing data) where MySQL and Postgres are row-store database. You can think of column-store database as Struct of Arrays (SoA) and row-store database as Array of Struct (AoS). Remember locality = performance: in general, you want to put the stuff you frequently want to access together as close to each other as possible.
T=[T; {v1, v2, ...}] (Cannot default for unspecified column*)
update records/elements
UPDATE table SET column = content WHERE row_cond
T.(col)(row_cond) = content
New table from selection
SELECT vars INTO t2 FROM t1 WHERE rows
T2 = T1(rows, vars)
clear table
TRUNCATE TABLE t
T( :, : )=[]
delete rows
DELETE FROM t WHERE cond (if WHERE is not specified, it kills all rows one by one with consistency checks. Avoid it and use TRUNCATE TABLE instead)
T( cond, : ) = []
* I developed sophisticated tools to allow partial row insertion, but it’s not something TMW supports right out of the box. This involves overloading the default value generator for each data type then extract the skeleton T( [], : ) to identify the data types.
Core database concepts:
Concepts
SQL
MATLAB (table/dataset)
Pandas (Dataframe)
linear index
CREATE INDEX idx ON T (col)
T.idx = (1:size(T,1))'
group index
CREATE UNIQUE INDEX idx ON T (cols)
[~, T.idx] = sortrows(T, cols) (old implementation is grp2idx())
set operations
UNION INTERSET
union() intersect() setdiff(), setxor()
sort
ORDER BY
sortrows()
unique
SELECT DISTINCT
unique()
reduction aggregration
F()
@reductionFunctions
grouping
GROUP BY
Specifying ‘GroupingVariables’ in varfun(), rowfun(), etc.
Function programming concepts map (linear index), filter (logical index), reduce (summary & group) are heavily used with databases
Formal databases has a Table Definition (Column Properties) that must be specified ahead of time and can be updated in-place later on (think of it as static typing). Heterogenous Data Tables can figure most of that out on the fly depending on context (think of it as dynamic typing). This impacts:
data type (creation and conversion)
unspecified entries (NULL). Often NaN in MATLAB native types but I extended it by overloading relevant data types with a isnull() function and consistently use the same interface
default values
keys (Indices)
SQL features not offered by heterogenous data tables yet:
column name aliases (AS)
wildcard over names (*)
pattern matching (LIKE)
SQL features that are unnatural with heterogeneous data tables’ syntax:
implicitly filter a table with conditions in another table sharing the same key. It’s an implied join(T, T_cond)+filter operation in MATLAB. Often used with ANY, ALL, EXISTS
Fundamentally heterogenous data types expects working with snapshots that doesn’t update often. Therefore they do not offer active checking (callbacks) as in SQL:
Invariant constraints (CHECK, UNIQUE, NOT NULL, Foreign key).
MATLAB’s dataset/table objects’ internals often involves identifying unique contents and assigning a unique (grouping) index to it so the indices can be mapped or joined without actually going through the contents of each row.
In the old days when I were using dataset(), the first generation of table() objects before the rewrite, there is a tool called grp2idx() which assigns the same number to identical items regardless of data types. It was part of Statistics Toolbox (needs to pay extra for it) and it does not work if you have multiple columns that you want to assign an unique index unless the ROWS are identical.
Upon inspection. grp2idx() is overrated. There are two ways to get it without paying for the toolbox:
double(categorical(X)): cast a categorical type (technically you can use nominal/ordinal, but it’s part of statistics toolbox)
Use the 2nd output argument for sort() or sortrows() function. I recommend sortrows() because it’s can be overloaded on table() objects and it works on multiple rows.
I’d like to write a function to selectively modify lines read from a file handle and write it back. By default, lines are read as byte() objects that are immutable, so I converted it to bytearray() instead so it can be modified because only a few lines meeting certain criteria needs to be changed.
When I try to refactor similar operation into a function, I was hoping to pass the mutable bytearray() as an argument and directly modify the caller’s content like in C++, given Python variables works LIKE reference binding.
I know bytearray.replace() does not modify the data in place, but instead outputs the modified line to a new variable. Normally, I can simply do this:
line = line.replace(b'\tCLASS', b'')
and the code will work. However, it doesn’t do anything when I try to pass it as an argument to a Python function (unless I return line as output). Although I am well aware that Python variables assignments to existing variables means orphaning the old data and re-purposing the label, the variable assignment behavior in Python requires careful thought when used in non-idiomatic situation.
In other words, I want this function to have side effects on the variable ‘line‘, but I wasn’t doing it right. This is a tempting mistake for people with a C/C++ background: in C/C++, it is not possible to shadow an input parameter even if we were to explicitly declare it, so the innocent assignment I did above has to modify the object in the caller (passed as a reference to the function) in C/C++, as if I did this directly in the caller.
However, in Python, variables do not need to be declared (aka, dynamically typed). This opens up the possibility of unwittingly shadowing the input parameters, which is what happened here. Mutable arguments on the stack still can be modified through the function, but when you assign a variable using ‘=’ operator, a new local variable with the name on the LHS is created, which shadows the input parameter.
This means the connection to the caller objects is lost during shadowing.
The correct way to do this is use slice assignment (which the logic/concept is very different despite the syntax is similar) to replace all the contents of the input variable with the output of bytearray.replace():
def remove_from_header_token_CLASS(tokens, line):
# line is expected to be byte array (mutable)
try:
column_CLASS = tokens.index(b'CLASS')
except:
column_CLASS = None
else:
line[:] = line.replace(b'\tCLASS', b'')
return column_CLASS
Since Python has a clear distinct concept of parameter variable (from local variable), trying to apply nonlocal keyword over it (in hopes to broaden the scope) will not parse/compile.
This is actually the same behavior as in MATLAB (dynamic typing) for the same reason that variables does not have to be declared like in C/C++ (static typing). In MATLAB, if you choose to have a handle object (which works like references), you can shadow the input argument by creating a local variable of the same name:
classdef DemoHandleClass < handle
properties
x = 3;
end
end
function demo_shadowing()
C = DemoHandleClass();
f(C);
disp(C.x)
end
function f(C)
C = DemoHandleClass(); // Shadowing
C.x = 14;
end
The above MATLAB program will display 14 without shadowing and 3 with shadowing (C became a new local variable that has nothing to do with the input argument C). MATLAB users rarely run into this because the language design heavily discourage side-effects: we are supposed to return the changed local variable to the caller. The only way to do side-effects in MATLAB is through handles (which you need to establish a class, which is clumsy). Technically you can write the data to external resources (e.g. file) and read it back. But guess what? Resources are accessed through handles, so there’s no escape.
Of course, there’s a better way to do so (MATLAB’s preferred way): return the modified object back to the caller as if they are immutable:
def remove_from_header_token_CLASS(tokens, line):
# line DOES NOT HAVE TO BE MUTABLE
try:
column_CLASS = tokens.index(b'CLASS')
except:
column_CLASS = None
else:
line = line.replace(b'\tCLASS', b'')
return column_CLASS, line
This is what I ultimately used (so I ended up not converting the byte lines to bytearray), given that Python’s tuple syntax make it easy to return multiple outputs like MATLAB. The call ended up looking like this:
column_SPL_CLASS, line = remove_from_header_token_CLASS(tokens, line)
Nonetheless, I think there’s an important lesson to be learned for doing side-effects in dynamically typed languages. Maybe I’ll need this one day if I get an excuse to do something more complicated that genuinely requires side-effects.
In summary, variable assignments in most dynamically typed languages will shadow the input argument with a newly generated local variable instead of modifying the data in the original input argument. This implies that there function side-effects cannot be carried out through variable assignment.
The most common implication is: do not (equality) assign to a input variable to modify its contents in a dynamically typed language.