Powershell notes (for MATLAB/python users)

Data Type characteristics

PowershellMATLABPython
Nearly everything is a/anObjectMatrix (APL-philosophy)Object (which are dictionaries)
Assignment behavior*Reassigned referenceCopy-on-writeReassigned reference
Monads (wrapper for heterogenous data )Array/CollectionsCellsLists

* Shallow assignment (transferring reference only) means the LHS does not have its own copy, so modifying the new reference will modify the underlying data on the RHS.

Syntax / Usage

PowershellMATLABPython
Method chainingYesMight misbehaveYes
List ComprehensionNo. Map first then filterYes
Named input argumentsNative
f -a 1 -b 22
Name-Value pairs parsed insideNative
f(a=13, b=22)
Implicit NON-NULL return valueOptional
Binary map operationNative matrix ops
*fun() does n-ary
Use numpy
list( map(operator.add, L1, L2) )
Check Type$g -is [type]is*() or isa()isinstance(val, type)
Unpacking (flattening)
monads in monads
Default
Use unary , to avoid
No
Use [{:}] to perform
No
Use *, list comp, or
list(itertools.chain(*ls))
Conditional/statement block inside container creationYes?
View Object Info with Data| Format-List -Property *
or
Format-List -InputObject
properties()
methods()
get()
List members (method and properties)’s prototypes| Get-Member

Powershell specific

  • The UNCAPTURED output value in the last line of the block is the return value! Unary side effect statements such as $x++ do not have output value. Watch out for statements that looks like it’s going nowhere at the end of the code as these are not nop/bugs, but return value. This has the same stench as fall-throughs.
  • foreach() follows the last uncaptured output value return rule above doing a 1-to-1 map from the input collection to output collection (you can assign output to foreach() as it’s also seen as a function)
  • Powershell suck at binary operations between two arrays. Just an elementwise A+B you’d be thinking in terms of loops and worry about dimensions.
  • You can put if and loop blocks inside collections list construction, like this:
@( 3, if(cond1){...; $v1}  do{...; $v2}while(cond2) )

MATLAB specific

  • When used with classes and custom matrices/arrays, chaining fields/properties/methods by indices often do not work, when they do, they often give out only the first element instead of the entire array (IIRC, there are operator methods that needs to be coordinated in the classes involved to make sure they chain correctly). In short, just don’t chain unless in very simple, scalar cases. Always output it to a variable a access the leaf.

Range & Indexing

PowershellMATLABPython
Logical IndexingYesNo. Use list comprehension/Numpy
Negative (cyclic) IndexingYesYes
end‘ of array keywordYesNo. Skip stop in slice instead
Step (skip every n items)YesYes. Both range or slice
Detect descending rangeYes
Automatic extend arrayYes
Reading array out of boundsDo nothingErrorError

Negative (cyclic) indexing along with automatic descending range, along with the lack of ‘end’ keyword is a huge pain in the rear when you want to scan from left to right like A[5:end].

Instead, you’ll have to do $A[4..($A.length-1)] because the range 4..-1 inside A[4..-1] is unrolled as 4,3,2,1,0,-1 (thus scanning from right to left and wraps around) without first consulting with the array A like the end keyword in MATLAB does so it can substitute the ends of the range with the array information before it unrolls.

I am willing to bet that this behavior does not have a sound basis other than people thinking negative indices and descending ranges alone are two good ideas without realizing that nearly nobody freaking wants to scan from right to left and wrap around!

I had the same gripes about negative indices in Python not carefully coordinating with other combinations in common use cases which cases unintuitive behavior.

Range indexing syntax

# Powershell
1..10 # No step/skip for range creation
A[1..10]  # No special treatment in array such as figuring out the 'end'

% MATLAB
A[start:(step):stop]

# Python
A[range(start,stop,step)]
# Slicing (it's not range)
A[(start):(stop):(step)] # Can skip everything 
# In Python, A=X merely reassign the label A as the alias for X.
# Modifying the reassigned A through A=X will modify underlying contents of X
# To deep-copy contents without .Clone(), assign the full slice
A[:] = X

Hasthtable / Dictionaries

% MATLAB: Use dynamic fields in struct or containers.Map()
# Python: dictionaries such as {a:1, b='x'}
# Powershell: @{a=1, b='x'}

Structs

Powershell does not have direct struct or dynamic field name struct. Instead if your object is uniform (you expect the fields not to change much), use [PSCustomObject]@{}. You can also just use simple hashtable @{}, but for some reason it doesn’t work the way I expected when put into arrays when I try to reference it by array index.

Array rules surprises

  • Array comparisons are filtering operation (not boolean array output like MATLAB). (0..9) -ge 5 gives 5 to 9, not a list of False … False, True … True. To get a boolean array, use this shortcut:
(0..9) | % {$_ -ge 5}

Map-filter combo syntax is | ? instead of Map syntax | %

  • Monad (Cells in MATLAB) are unpacked and stacked by default (in MATLAB, I had to write a lot of routines to unpack and stack cells of cells). To keep cells packed (in MATLAB lingo, it’s like ‘UniformOutput’, false in cellfun), add a comma unary operator in front of the operation that are expected to be unpacked like this:
.$_.Split('_')

Set Operations

This is one of the WTF moments of Powershell as a programming language. Convenient set operations is essential for most of the routine boring stuff that involves relational data. A lot of Powershell’s intended audience works in database like environment (like IT managers dealing with Active Directory), they have Group-Object for typical data analysis tasks, yet they make life miserable just to do basic set operations like intersection and differencing!

Powershell has a Compare-Object, but this is as unnatural and annoying to use as users are effectively rebuilding all 4 basic set-ops (intersection, union, set-diff, xor) based on any two! Not to mention you have to sift through table to get to the piece you wanted!

Basically Compare-Object out of the box

  • is a set-diff showing both directions (A\B and also B\A) at the same time. If you throw away the direction info, it’s xor.
  • if you want intersection, you’ll need to add -IncludeEqual -ExcludeDifferent
  • (WTF!) If you just specify -ExcludeDifferent, by definition there’s no output because by default Compare-Object shows you ONLY the two set-diffs and you are telling it to not show any diffs!
  • Union is specifying -IncludeEqual only. But it’d rather stack both then do a | Sort-Object - Unique

Some people might suggest doing | ? {$_ -eq $B} for intersection (or is-member). This is generally a bad idea if you have a lot of data because it’s in the O(n*n) runtime algorithm (loop-within-loop) while any properly done intersection algorithm will just sort then scan the adjacent item to check for duplicates, which gives O(n log(n)) time (typical sorting algorithm takes up most of the time).

If you noticed, it’s set operations within the outputs of Compare-Objects with the Venn diagram of -IncludeEqual -ExcludeDifferent switches! It’s doable, but totally unnecessary mindfuck that should not be repeated frequently.

In MATLAB land, I made my own overloading operators that do set operation over cellstr(), categorical and tabular objects (I went into their code and added the features and talked to TMW so they added the features later), sometimes getting into their sort and indexing logic as necessary. This shows how badly do I need set operations to come naturally.

One might not deal with it too much in low level languages like C++ (STL set doesn’t get used as much compared to other containers), but for a language made to get a lot of common things done (i.e. the language designer kind of reads the users mind), I’m surprised that the Powershell team overlooked the set operations!

Sets are very powerful abstractions that should not be made less descriptive (hard to read) by dancing around it with equivalent operations with some programming gymnastics! If these basic stuff are not built in, we are going to see a lot of people taking ugly shortcuts to avoid coding up these bread and butter functions and put it in libraries (or downloading 3rd-party libraries)!

Powershell surprises

  • Typical symbolic comparison operators do not work because ‘>’ can be misinterpreted as redirection in command prompts. Use switches like -gt (greater than) instead.
  • Redirection’s default text output uses UTF16-LE encoding (2 bytes per character). Programs assuming ASCII (1 byte per character) might not behave as intended (e.g. if you use copy command merge an ASCII/UTF8 file with UTF16-LE, you might end up with spaces in the sections that are formatted with UTF16-LE)
  • Cannot extract string matches from regex without executing a -match which returns boolean unless we use the the $matches$ spilled into variable space. Consider [regex]::Match($Text, $Pattern).Groups[1].Value
  • Methods are called with parenthesis yet functions are not called with parenthesis, just like cmd-lets! Trying to call a function with multiple input arguments with parenthesis like f(3,5) will be interpreted as calling f with ONE ARGUMENT containing an ARRAY of 3 and 5!
  • Write-Host takes everything after it literally (white spaces included, almost like echo command), with the exception of plugging in $variables! If you want anything interpreted, such as concatenation, you need to put the bracket around the whole statement!

Libraries and Modules

  • Reload module using Import-Module $moduleName -Force

Loading

GUI Paradigms (1): Redux (Flutter/React) translated to MATLAB

For GUI development, we often start with controls (or widgets) that user interact with and it’ll emit/run the callback function we registered for the event when the event happens.

Most of the time we just want to read the state/properties of certain controls, process them, and update other controls to show the result. Model-View-Controller (MVC) puts strict boundaries between interaction, data processing and display.

The most common schematic for MVC is a circle showing the cycle of Controller->Model->View, but in practice, it’s the controller that’s the brains. The view can simultaneously accept user interactions, such as a editable text box or a list. The model usually don’t directly update the view directly on its own like the idealized diagram.

MVC with User Action
From https://www.educative.io/blog/mvc-tutorial

With MVC, basically we are concentrating the control’s callbacks to the controller object instead of just letting each control’s callback interact with the data store (model) and view in an unstructured way.


When learning Flutter, I was exposed to the Redux pattern (which came from React). Because the tutorials was designed around the language features of Dart, the documentation kind of obscured the essence of the idea (why do we want to do this) as it dwelt on the framework (structure can be refactored into a package). The docs talked a lot about boundaries but wasn’t clearly why they have to be meticulously observed, which I’ll get to later.

The core inspiration in Redux/BLoC is taking advantage of the concept of ‘listening to a data object for changes’ (instead of UI controls/widget events)!

Instead of having the UI control’s callback directly change other UI control’s state (e.g. for display), we design a state vector/dictionary/struct/class that holds contents (state variables) that we care. It doesn’t have to map 1-1 to input events or 1-1 to output display controls.

When an user interaction (input) event emitted a callback, the control’s callback do whatever it needs to produce the value(s) for the relevant state variable(s) and change the state vector. The changed state vector will trigger the listener that scans for what needs to be updated to reflect the new state and change the states of the appropriate view UI controls.

This way the input UI controls’ callbacks do not have to micromanage what output UI controls to update, so it can focus on the business logic that generates the content pool that will be picked up by the view UI controls to display the results. In Redux, you are free to design your state variables to match more closely to the input data from UI controls or output/view controls’ state. I personally prefer a state vector design that is closer to the output view than input controls.


The intuition above is not the complete/exact Redux, especially with Dart/Flutter/React. We also have to to keep the state in ONE place and make the order of state changes (thus behavior) predictable!

  • Actions and reducers are separate. Every input control fires a event (action signal) and we’ll wait until the reducers (registered to the actions) to pick it up during dispatch() instead of jumping on it. This way there’s only ONE place that can change states. Leave all the side effects in the control callback where you generate the action. No side effects (like changing other controls) allowed in reducers!
  • Reducers do not update the state in place (it’s read only). Always generate a new state vector to replace the old one (for performance, we’ll replace the state vector if we verified the contents actually changed). This will make timing predictable (you are stepping through state changes one by one)

In Javascript, there isn’t really a listener actively listening state variable changes. Dispatch (which will be called every time the user interacts using control) just runs through all the listeners registered at the very end after it has dispatched all the reducers. In MATLAB, you can optionally set the state vector to be Observable and attach the change listener callback instead of explicitly calling it within dispatch.

https://gist.github.com/gaearon/ffd88b0e4f00b22c3159

Here is an example of a MATLAB class that captures the spirit of Redux. I added a 2 second delay to emulate long operations and used enableDisableFig() to avoid dealing with queuing user interactions while it’s going through a long operation.

classdef ReduxStoreDemo < handle

    % Should be made private later
    properties (SetAccess = private, SetObservable)
        state % {count}
    end
        
    methods (Static)
        % Made static so reducer cannot call dispatch and indirectly do
        % side effect or create loops
        function state = reducer(state, action)        
            % Can use str2fun(action) here or use a function map
            switch action
                case 'increment'
                    fprintf('Wait 2 secs before incrementing\n');
                    pause(2)
                    state.count = state.count + 1;
                    fprintf('Incremented\n');
            end                
        end        
    end
    
    % We keep all the side-effect generating operations (such as
    % temporarily changing states in the GUI) in dispatch() so
    % there's only ONE PLACE where state can change
    methods
        function dispatch(obj, action, src, evt)
            % Disable all figures during an interaction
            figures = findobj(groot, 'type', 'figure');
            old_fig_states = arrayfun(@(f) enableDisableFig(f, 'off'), figures);      
            src.String = 'Wait ...';
            
                new_state = ReduxStoreDemo.reducer(obj.state, action);
                
                % Don't waste cycles updating nops 
                if( ~isequal(new_state, obj.state) )
                    % MATLAB already have listeners attached.
                    % So no need to scan listeners like React Redux            
                    obj.state = new_state;
                end                
                
            % Re-enable figure obj.controls after it's done
            arrayfun(@(f, os) enableDisableFig(f, os), figures, old_fig_states);                        
            src.String = 'Increment';
        end
    end
    
    methods
        function obj = ReduxStoreDemo()
            figure();            
            obj.state.count = 0;                        
            
            h_1x  = uicontrol('style', 'text', 'String', '1x Box', ...
                              'Units', 'Normalized', ...
                              'Position', [0.1 0.3, 0.2, 0.1], ...
                              'HorizontalAlignment', 'left');                          
            addlistener(obj, 'state', ...
                        'PostSet', @(varargin) obj.update_count_1x( h_1x , varargin{:})); 
                        
            uicontrol('style', 'pushbutton', 'String', 'Increment', ...
                      'Units', 'Normalized', ...
                      'Position', [0.1 0.1, 0.15, 0.1], ...
                      'Callback', @(varargin) obj.dispatch('increment', varargin{:}));                             
            
            % Force trigger the listeners to reflect the initial state
            obj.state = obj.state;
        end
    end
    
    %% These are 'renders' registered when the uiobj.controls are created
    % Should stick to reading off the state. Do not call dispatch here
    % (just leave it for the next action to pick up the consequentials)
    methods
        % The (src, event) is useless for listeners because it's not the 
        % uicontrol handle but the state property's metainfo (access modifiers, etc)
        function update_count_1x(obj, hObj, varargin)            
            hObj.String = num2str(obj.state.count);
        end        
    end
    
end

Loading

MATLAB repeating arrays (elementwise array replication, interleaved ‘repmat’)

Since MATLAB R2015b, there’s a new feature called repelem(V, dim1, dim2, ...) which repeats each element by dimX times over dimension X. If N (dim1) is scalar, each V is uniformly repeated by N times. If N is a vector, it has to be the same length as V and each element of N says how many times the corresponding element in V is repeated.

Here are some historical ways of doing it (as mentioned in MATLAB array manipulation tips)

The scalar case (repeat uniformly) can be emulated by a Kronecker product multiplying everything with 1 (self):

kron(V, ones(N,1))
Just replace all the elements b with 1 so we are left with elements of A repeating the way we wanted

Kron method is conceptually smart but it has unnecessary arithmetic (multiply by 1). Nonetheless this method is reasonable fast until TMW finally developed a built-in function for it that outperforms all the tricks people have accumulated over decades.

The vector case (each element is repeated a different number of times according to vector N) is basically decoding Run-Length Encoding (RLE), aka counts to placements, which you can download maturely written programs on MATLAB File Exchange (FEX). There are a bunch of cumsum/diff/accumarray/reshape tricks but at the end of the day, they are RLE decoding in vectorized forms.


There’s a name for almost each recurring problem that we can think of in MALTAB. Before jumping in and implementing your for loop, ask around and try to find the right keyword/terms to describe your problem! >99.9% of the time your problem is not new!

The most odd-ball MATLAB algorithm scenario I’ve ever came across that requires original thought is the ‘Jenga Matrix‘ (I coined the name) while I was working at Stanford University Medical School as a research assistant for MADIT-CRT.

MATLAB’s OOP was not mature at that time, so dataset() objects didn’t surface. The reason for the ‘Jenga Matrix’ was to create ‘sparse cells’ which uses a sparse matrix with non-zero indices mapping to a cell vector so I can make a table (that’s approximately the guts of heterogenous data structure).

As I remove elements of the ‘sparse cell matrix’, I didn’t want holes in it to accumulate so I’ll have to periodically compact the underlying cell vector and shift the indices to reflect the indices after compacting. Normally if you have to mess with these kind of ingenious indexing algorithms, you are working on some generic abstractions/tools rather than the business logic itself.

There’s no ultimate correct way to implement something in MATLAB, but there are tons of bad ways that is strictly worse under all circumstances! Being smart with these little toy (Cody) problems like array manipulation do not really show practical proficiency in MATLAB. Anybody can spend a day or two to solve a genuinely new algorithm puzzle or just ask around in the forums if you run into it once in a blue moon. Who cares if you can do it 5 times faster if it’s just <1% of the development time?

Most of your time should be spent on using MATLAB to succinctly and intuitively describe your business logic (which requires exploring and understanding your project requirements deeply), and hide the boring background work with generic abstractions (e.g. RDBMS and RLE)! People should be able to read your function and variable names and form a clear picture of what your codebase is trying to achieve instead of stumbling over smart-ass idioms that’s not immediately obvious (which should buried in the lowest level of generic tool functions if you had to develop it in-house).

Even a mathematician in Linear Algebra using MATLAB for 40 years doesn’t mean he’s good at MATLAB! The real MATLAB skills are keeping up with MATLAB has to offer for a variety of scenarios relevant to the task at hand (or know enough abstract concepts like functional programming, OOP, database, etc, to be able to find out the right tools quickly), which is a hell lot of knowledge considering MATLAB covered most common scenario imaginable (the vast majority of MATLAB users aren’t aware of the full offerings and used MATLAB the wrong/hard way)!

Loading

Data Relationships of Spreadsheets: Relational Database vs. Heterogenous Data Tables

This blog post is development in process. Will fill in the details missing details (especially pandas) later. Some of the MATLAB syntax are inaccurate in the sense that it’s just a description that is context dependent (such as column names can be cellstr, char string or linear/logical indices).

From data relationship point of view, relation database (RDMBS), heterogenous data tables (MATLAB’s dataset/table or Python Panda’s Dataframe) are the same thing. But a proper database have to worry about concurrency issues and provide more consistency tools (ACID model).

Heterogenous data tables are almost always column-oriented database (mainly for analyzing data) where MySQL and Postgres are row-store database. You can think of column-store database as Struct of Arrays (SoA) and row-store database as Array of Struct (AoS). Remember locality = performance: in general, you want to put the stuff you frequently want to access together as close to each other as possible.

Mechanics:

ConceptsSQLMATLAB tablePandas Dataframe
tablesFROM(work with T)(work with df)
columns
variables
fields
SELECT T.(field)
T(:, cols/varnames)
rows
records
WHERE

HAVING
T( cond(T), : )

T_grp( cond(T_grp), : )
conditionsNOT
IS
IN
BETWEEN
~
==, isequal*()
ismember()
a<=b & b<=c
Inject table to
another table
INSERT INTO t2
SELECT vars FROM t1
WHERE rows
T2(end+(1:#rows), vars) = T1(rows, vars)
(Doable, throws warning)

Insert record/rowINSERT INTO t (c1, c2, ..)
VALUES (v1, v2, ..)
T=[T; {v1, v2, ...}]
(Cannot default for unspecified column*)
update records/elementsUPDATE table
SET column = content
WHERE row_cond
T.(col)(row_cond) = content
New table
from selection
SELECT vars
INTO t2
FROM t1
WHERE rows
T2 = T1(rows, vars)
clear tableTRUNCATE TABLE tT( :, : )=[]
delete rowsDELETE FROM t WHERE cond
(if WHERE is not specified, it kills all rows one by one with consistency checks. Avoid it and use TRUNCATE TABLE instead)
T( cond, : ) = []
* I developed sophisticated tools to allow partial row insertion, but it’s not something TMW supports right out of the box. This involves overloading the default value generator for each data type then extract the skeleton T( [], : ) to identify the data types.

Core database concepts:

ConceptsSQLMATLAB (table/dataset)Pandas (Dataframe)
linear indexCREATE INDEX idx ON T (col)T.idx = (1:size(T,1))'
group indexCREATE UNIQUE INDEX idx ON T (cols)[~, T.idx] = sortrows(T, cols)
(old implementation is grp2idx())
set operationsUNION
INTERSET
union()
intersect()
setdiff(), setxor()
sortORDER BYsortrows()
uniqueSELECT DISTINCTunique()
reduction
aggregration
F()@reductionFunctions
groupingGROUP BYSpecifying ‘GroupingVariables’ in varfun(), rowfun(), etc.
partitioning(set partition option in Table Definition)T1=T(:, {'key', varnames_1})
T2=T(:, {'key', varnames_2})
joins[type] JOIN*join(T1, T2, ...)df.join(df2, …)
cartesian productCROSS JOIN
(misnomer, no keys)
T_cross = [repelem(T1, size(T2,1), 1), repmat(T2, [size(T1,1), 1])]
Function programming concepts map (linear index), filter (logical index), reduce (summary & group) are heavily used with databases

Formal databases has a Table Definition (Column Properties) that must be specified ahead of time and can be updated in-place later on (think of it as static typing). Heterogenous Data Tables can figure most of that out on the fly depending on context (think of it as dynamic typing). This impacts:

  • data type (creation and conversion)
  • unspecified entries (NULL).
    Often NaN in MATLAB native types but I extended it by overloading relevant data types with a isnull() function and consistently use the same interface
  • default values
  • keys (Indices)

SQL features not offered by heterogenous data tables yet:

  • column name aliases (AS)
  • wildcard over names (*)
  • pattern matching (LIKE)

SQL features that are unnatural with heterogeneous data tables’ syntax:

  • implicitly filter a table with conditions in another table sharing the same key.
    It’s an implied join(T, T_cond)+filter operation in MATLAB. Often used with ANY, ALL, EXISTS

Fundamentally heterogenous data types expects working with snapshots that doesn’t update often. Therefore they do not offer active checking (callbacks) as in SQL:

  • Invariant constraints (CHECK, UNIQUE, NOT NULL, Foreign key).
  • Auto Increment
  • Virtual (dependent) tables (CREATE VIEW)

Know these database/spreadsheet concepts:

  • Tall vs wide tables

Language logistics (not related to database)

ConceptsSQLMATLAB (table/dataset)Pandas (Dataframe)
Partial displayMySQL: LIMIT
Oracle: FETCH FIRST
T( 1:10, : )df.head()
Comments-- or /* */% or %{ %}# or """"""
functionCREATE PROCEDURE fcnfunction [varargout{:}]=fcn(varargin{:})def fcn:
caseCASE WHEN THEN ELSE ENDswitch case end(no case structure, use dictionary)
Null if no resultsIFNULL ( statement )function X=null_if_empty(T, cond)
X=T( cond, : );
if( isempty(X) ) X=NaN;
Replace nullsISNULL(col, target_val)T.col(isnan(T.col)) = target_val
T = standardizeMissing( T, ... )

Loading

MATLAB assign index for unique rows

MATLAB’s dataset/table objects’ internals often involves identifying unique contents and assigning a unique (grouping) index to it so the indices can be mapped or joined without actually going through the contents of each row.

In the old days when I were using dataset(), the first generation of table() objects before the rewrite, there is a tool called grp2idx() which assigns the same number to identical items regardless of data types. It was part of Statistics Toolbox (needs to pay extra for it) and it does not work if you have multiple columns that you want to assign an unique index unless the ROWS are identical.

Upon inspection. grp2idx() is overrated. There are two ways to get it without paying for the toolbox:

  • double(categorical(X)): cast a categorical type (technically you can use nominal/ordinal, but it’s part of statistics toolbox)
  • Use the 2nd output argument for sort() or sortrows() function. I recommend sortrows() because it’s can be overloaded on table() objects and it works on multiple rows.

Loading