Librarize! Free variables/functions school of thought (as compared to OOP)

When programming C++, I have prefer to stick to free functions and refactor everything generic into libraries. However it doesn’t sound like the norm out there. I’m glad after sawing this video that that I’m not the only one.

My rationale is that classes are merely a mental model built on Von Neuman architecture saying that data (variable) and program (functions) aren’t that different after all. Combined with structs, and a little help of the compiler treating function pointers differently (as methods), we can bundle action and data into one unit called a Class (of objects).

Classes is a useful idea but I would not over-objectify, like writing a class that holds 5 constants or frame a collection of loosely coupled or unrelated generic helper functions into a class (it should be organized into packages or namespaces). Over-objectifying reeks cargo cult programming.

My primary approach to program design is self-documenting. I prefer present the code in a way (not just syntax, but the code and data structure) that’s the easiest to understand without material sacrifices to performance or maintainability. I use classes when the problem statement happened to naturally align with what classes has to offer without mental gymnastics.

My decision process goes roughly like this:

  • If a problem naturally screams data type (like matrices), which is heady on operator overloading, I’d use classes in a heartbeat as data types are mathematical objects.
  • Then I’ll look into whether the problem is heavy on states. In other words, if it’s necessary for one method to drop something in the mailbox for another method to pick it up without meeting each other (through parameter passing calls), I’ll consider classes.
  • If the problem statement screams natural interactions between objects, like a chess on the chessboard, I’d consider classes even if I don’t need OOP-specific features

The last thing I want to use OOP as a tool for:

  1. Hiding sloppy generic methods that is correct only given your class’s implicit assumptions that are not spelt out, like sorting 7 digit phone numbers that are unique.
  2. Abusing data members to pretend you are not casually using globals and static all over the place (poor encapsulation) as if you would have done it in a C program.

1) Free functions for generic operations

The first one is an example that calls for free functions. Instead of writing a special sort function that makes the assumption that the numbers are unique and 7 math digits. A free function bitmap_sort() should be written and put in a sorting library if there aren’t an off-the-shelf package that does it already.

In the process of refactoring parts of your program out into generic free functions (I just made up the word ‘librarize’ to mean this), you go through the immensely useful mental process of

  • Explicitly understanding the exact assumptions that specifically applies to your problem that you want to take advantage of or work around the restrictions. You can’t be sure that your code is correct no matter how many tests you’ve written if you aren’t even clear about under what inputs your code is correct and what unexpected inputs will break the assumptions.
  • Discover the nomenclature for the concept you are using
  • Knowing the nomenclature, you have a very good chance of finding work already done on it so you don’t have to reinvent the wheel … poorly
  • If the concept hasn’t been implemented yet, you can contribute to code reuse that others and your future self can take advantage of.
  • Decoupling generic operation from business logic (the class itself) allows you to focus on your problem statement and easily swap out the generic implementation, whether it’s for debugging or performance improvement, or hand the work over to others without spending more time explaining what you wanted than writing the code yourself.

This is much better than jumping into writing a half-assed implementation of an idea that you haven’t fully understood the quirks (assumptions). You learn a new concept well rather than repeating similar gruntwork over and over and it doesn’t benefit anybody else, and you likely have to debug the tangled mess when you run into a corner case because you didn’t understand the assumptions well enough to decouple a generic operation from the class.

Polymorphism in OOP is a lot broader than just function overloading. While virtuals (which the run-time polymorphism make sense only with inheritance) so it has to be an OOP thing. Templates (which also applies to free functions), and function overloading (which also applies to free function) are compile-time polymorphism.

Polymorhpishm isn’t exclusive to OOP the way Bajrne defined it. C++ can overload free functions. You don’t need to put things into classes just because you want a context (signature) dependent dispatch (aka compiler figuring which version of the function with the same name to call).

2) Classes are not excuses to hide unnecessary uses of global/statics

Data members in classes are namespace-scoped version of global/static variables that could be optionally localized/bound to instances. Private/Public access specifiers in C++ were global/file scope variables switched through static modifier (file scope).

If you don’t think it’s a good habit to sprinkle global scope all over the place in C, try not to go wild using more data members than necessary either.

Data members give an illusion that you are encapsulating better and ended up incentivising less defensive programming practices. Instead of not polluting in the first place (designing your data flow using the mentality of global variables), it merely contained the pollution with namespace/class scopes.

For example, if you want to pass a message (say an error code) directly from one method to another and NOBODY else (other methods) are involved, you simply pass the message as an input argument.

Globals or data members are more like a mechanism that you drop a letter to a mailbox instead of handing it your intended recipient and hope somehow the right person(s) will reach it and the right recipient will get it. Tons of things can go wrong with this uncontrolled approach: somebody else could intercept it or the intended recipient never knew the message is waiting for him.

With data members, even if you marked them as private, you are polluting the namespace of your class’ scope (therefore not encapsulating properly) if there’s any method that can easily access data members that it doesn’t need.

How I figured this out on my own based on my experience in MATLAB

Speaking of insidious ways to litter your program design the globalist mentality (pun intended), data members are not the only offenders. Nested functions (not available in C++ but available in modern MATLAB and Python) is another hack that makes you FEEL less guilty structuring your program in terms of global variables. Everything visible one level above the nested function is relatively global to the nested function. You are literally polluting the variable space of the nested function with local variables of the function one level above, which is a lot more disorganized than data members that you kind of acknowledge what you’ve signed up for.

Librarize is the approach I came up with for MATLAB development: keep a folder tree of user MATLAB classes and free functions organized in sensible names. Every time I am tempted to reinvent the wheel, I try to think of the best name for it. If the folder with the same name exist, chances are I already did something similar before and just needed a little reminder. This way I always have high quality in-house generic functions (which I could expand the use cases with backward compatibility as needed).

This approach works because I’m confident with my ability to naturally come up with sensible names consistently. When I did research in undergrad, the new terminologies I came up with happened to coincide with wavelets before I studied wavelets, as in hindsight what I was doing was pretty much the same idea as wavelets except it doesn’t have the luxury of orthogonal basis.

If a concept has multiple names, I often drop breadcrumbs with dummy text files suggesting the synonym or write a wrapper function with a synonymous name to call the implemented function.

C++ could simply overload free functions by signatures, but not too many people know MATLAB can overload free functions too polymorphic by ONLY BY THE FIRST ARGUMENT (can’t do signatures because MATLAB supports variable arguments which defeats the concept of signatures). It’s a very advanced technique I came up with which allow the same code to work for many different data types, doing generics without templates available in C++.

I also understand that commercial development are often rushed so not everybody could afford the mental energy to do things properly (like considering free functions first). All I’m saying is that there’s a better way than casually relying on data members more than needed, and using data member should have the same stench as using global variables: it might be the right thing to do in some cases, but most often not.

Loading

How missing keys are handled in Dictionary (Hashtables) in C++, Python and MATLAB

C++

In C++ (STL), the default behavior to touching (especially reading) a missing key (whether it’s map or unordered_map, aka hashtable) is that the new key will be automatically inserted with the default value for the type (say 0 for integers).

I have an example MapWithDefault template wrapper implementation that allows you to endow the map (during construction) with a default value to be returned when the key you’re trying to read does not exist

C++ has the at() method but it involves throwing an exception when the key is not found. However enabling exceptions is a material performance overhead burden in C++.

MATLAB

MATLAB’s hashtables are done with containers.Map() and with more modern MATLAB, dictionary objects (R2020b and on), unless you want to stick to valid MATLAB variable names as keys and exploit dynamic fieldnames.

Neither containers.Map() or dictionary have a default value mechanism when a key is not found. It will just throw an error if the key you are trying to read does not exist. Use iskey()/isKey() method to check if the key exist first and decide to read it or spit out a default value.

Python

Natively Python dictionaries will throw a Key error exception if the requested key in [] operator (aka __getitem__()) do not already exist.

Use .get(key, default) method if you want to assume a default value if the key is not found. The .get() method does not throw an exception: the default is None if not specified.

If you want C++’s default behavior of reading a new key means inserting the said key with a default, you have to explicitly import collections package and use defaultdict. I wouldn’t recommend this as the behavior is not intuitive and likely confusing in insidious ways.

There’s a simiar approach to my MapWithDefault in Python dictionaries: subclass from dict and define your own __missing__ dunder/magic method that returns a default when a key is missing, then use the parent (aka dict)’s constructor to do an explicit (type/class) conversion for existing dict object into your child class object that has __missing__ implemented.

Despite this approach is a little like my MapWithDefault, the __missing__ approach has more flexibility like allowing the default value to be dependent on the query key string, but it comes at the expense of making up one different class, not instance per different default values.

Monkey-patching instance methods is frowned upon in Python. So if you want the default value to tie to instances, the mechanism needs to be redesigned.

Loading

We use ContextManager (“with … as” statement) in Python because Python’s fundamental language design (garbage collecting objects) broke RAII

[TLDR] Python doesn’t have RAII. C++ and MATLAB allows RAII. You can have a proper RAII only if destructor timing is 100% controllable by the programmer.

Python uses Context Manager (with ... as idiom) to address the old issue of opening up a resource handler (say a file or network socket) and automatically close (free) it regardless of whether the program quit abruptly or it gracefully terminates after it’s done with the resource.

Unlike destructors in C++ and MATLAB, which registers what to do (such as closing the resource) when the program quits or right before the resource (object) is gone, Python’s Context Manager is basically rehasing the old try-block idea by creating a rigid framework around it.

It’s not that Python doesn’t know the RAII mechanism (which is much cleaner), but Python’s fundamental language design choices drove itself to a corner so it’s stuck micro-optimizing the try-except/catch-finally approach of managing opened resourecs:

  • Everything is seen as object in Python. Even integers have a ton of methods.
    MATLAB and C++ treats POD, Plain Old Data, such as integers separately from classes
  • Python’s garbage collector controls the timing of when the destructor of any object is called (del merely decrement the reference count).
  • MATLAB’s do not garbage-collect objects so the destructor timing is guaranteed.
  • C++ has no garbage collection so the destructor timing is guaranteed and managed by the programmer.

Python cannot easily exclude garbage collecting classes (which breaks RAII) because fundamentally everything are classes (dictionaries potentially with callables) in Python.

This is one of the reasons why I have a lot of respects for MATLAB for giving a lot of consideration for corner cases (like what ’empty’ means) in their language design decisions. Python has many excellent ideas but not enough thoughts was given to how these ideas interact, producing what side effects.


Pythons documentation says out loud right what it does: with ... as ... is effectively a rigidly defined try-except-finally block:

Context Manager heavily depends on resource opener function (EXPR) to return a constructed class instance that implements __exit__ and __enter__, so if you have a C external library imported to Python, like python-ft4222, likely you have to write in your context manager in full when you write your wrapper.


Typically the destructor should check if the resource is already closed first, then close it if it wasn’t already closed. Take io.IOBase as an example:

However, this is only a convenience when you are at the interpreter and can live with the destructor called with a slight delay.

To make sure your code work reliably without timing bugs, you’ll need to explicitly close it somewhere other than at a destructor or rely on object lifecycle timing. The destructor can acts as a double guard to close it again if it hasn’t, but it should not be relied on.


The with ... as construct is extremely ugly, but it’s one of the downsides of Python that cannot be worked around easily. It also makes it difficult for users to retry acquiring a resource because one way or another retrying involves injecting the retry logic in __enter__. It’s not that much typographic savings using with ... as over try-except-finally block if you don’t plan to recycle th contextmanager and the cleanup code is a one-liner.

Loading

Pandas DataFrame in Python (1): Disadvantage of using attributes (dot notation) to access columns. Use `[]` (getitem) operator instead

There are two ways to access columns in DataFrame. The preferred way is by square brackets (indexing into it like a dictionary), while it’s tempting to use the neater dot notation (treating columns like an attribute), my recommendation is don’t!

Python has dictionaries that handles arbitary labels well while it doesn’t have dynamic field names like MATLAB do. This puts DataFrame at a disadvantage developing dot notation syntax while the dictionary syntax opens up a lot of possibilities that are worth giving up dot notation for. The nature of the language design makes the dot notation very half-baked in Python and it’s better to avoid it altogether

Reason 1: Cannot create new columns with dot notation

UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access

Reason 2: Only column names that doesn’t happen to be valid Python attribute names AND DataFrame do not have any method with the same name can be accessed through dot notation.

Take an example of dataframe constructed from device info dictionaries created by the package pyft4222. I added a column called 'test me' to a table converted from the dictionary of device info. The tabe T looks like this:

I tried dir() on the table and noticed:

  • The column name "test me" did not appear anywhere, not even mangled. It has a space in between so it’s not a valid attribute or variable name, so this column is effectively hidden from the dot notation
  • flags is an internal attribute of DataFrame and it was not overriden by the data column flags when called by the dot notation. This means the flags column was also shadowed in (aka hidden to) the dot notation as there were no mangled name for it either

Even more weird is that getattr() works for columns with non-qualified attribute name like test me (despite the dot notation cannot access it because of the lack of dynamic field names syntax yet test me doesn’t show up in dir()). getattr(T, 'flags') still gets the DataFrame’s internal attribute flags instead of the column called flags as expected.

Loading

Dictionary of equivalent/analogous concepts in programming languages

CommonCC++MATLABPython
Variable arguments<stdarg.h>
T f(...)
Packed in va_arg
Very BAD!

Cannot overload
when signatures are uncertain.
varargin
varargout

Both packed as cells.

MATLAB does not have named arguments
*args (simple, stored as tuples)

**kwargs (specify input by keyword, stored as a dictionary)
Referencing
N/A
operator[](_) is for references
subsindex
subsassgn


[_] is for concat
{_} is for (un)pack
__getitem__()
__setitem__()
Default
values
N/ASupportedNot supported.
Manage with inputParser() or
newer arguments
Non-intuitive static data behavior. Stick to None or immutables.
Name-Value
Argument
Matching
Old way:
.., 'PropName', Value
and parse varargin

Since R2021a:
Name=Value
options in arguments
Name=Value
**kwargs
Major
Dimension
RowRowColumnRow (Native/Numpy)
Column for Pandas
ConstnessconstconstOnly in classesN/A (Consenting adults)
Variable
Aliasing
PointersReferencesNO! Rely on Copy-on-write
(No in-place functions*)

Handle classes under limited circumstances
References
= assignmentCopy one
element
Values: Copy
References: Bind
New Copy
Copy-on-write
NO VALUES
Bind references only
(could be to unnamed objects)
Chained
access operators
N/ADifficult to operator overload it rightDifficult to get it right. MATLAB had some chaining bugs with dataset() as well.Chains correctly natively
Assignment
expressions
(assignment evaluates to assigned lvalue)
==N/ANamed Expression :=
Version ManagementverLessThan()
isMATLABReleaseOlderThan
virtenv (Virtual Environment)
Exponentiation<math.h>
pow()
<cmath>
pow()
^**
Stream
(Conveyor belt mechanism. Saves memory)
I/O (std, file, sockets)
iterator in
STL containers
MATLAB doesn’t do references. Just increment indices.iterators (uni-directional only)
iter(): __iter__()
next(): __next__()
Loopingfor(init, cont_cond, next)C-style

for(auto running: iterable)
for k = array to iterate
list-comp

for (index, thing) in enumerate(lists)
Since MATLAB doesn’t do references, iterators (by extension generators) and functions that do in-place operations do not make sense (unless you bend it very hard with anti-patterns such as handles and dbstack).

Data Types

CommonCC++MATLABPython
SetsN/Astd::setOnly set operations, not set data type{ , , ...}
Dictionariesstd::unordered_map– Dynamic fieldnames
(qualified varnames as keys)
containers.Map() or dictionary() since R2022b
Dictionaries
{key:value}
(Native)
Heterogeneous containerscells {}lists (mutable)
tuples (immutable)
Structured
Heterogeneous containers
table()
dataset() [Old]

Mix in classes
Pandas Dataframe
Array,
Matrices &
Tensors
Native [ , ; , ]Numpy/PyTorch
Recordsstructclass
(members)
dynamic field (structs)
properties (class)

getfield()/setfield()
No structs
(use dicts)

attribute (class)
getattr()/setattr()
Type deductionN/AautoNativeNative
Type extractionN/Adecltype() for compile time (static)

typeid() for RTTI (runtime)
class()type()
Native sets operations in Python are not stable and there’s no option to use stable algorithm like MATLAB does. Consider installing orderly-set package.

Array Operations

CommonMATLABPython
Repeatrepmat()[] * N
np.repeat()
Logical IndexingNativeList comprehension
Boolean Indexing (Numpy)
Equally spaced numbersInternally colon():
start:step:end

linspace/logspace
range(begin, past_end, step)
produces an iterator

list(range()) or tuple(range())
iterates to realize the vector
Equally spaced indexingMATLAB has no generators,
so produced vector only
[start:past_end:step] is internally
slice() which produces a slice object, not range/lists/tuple. Faster but not iterable
Shallow copyDeep copy-on-writeSlice: x = y[:]
copy.copy()
Deep copyDeep copy-on-writecopy.deepcopy()

Editor Syntax

CommonCC++MATLABPython
Commenting/* ... */

// (only for newer C)
// (single line)

/* ... */ (block)
% (single line)

(Block):
%{
...
%}
# (single line)

""" or '''
is docstring which might be undersirably picked up
Reliable multi-line
commenting
(IDE)
Ctrl+(Shift)+R(Windows), / (Mac or Linux)[Spyder]:
Ctrl+1(toggle), 4(comment), 5(uncomment)
Code cell
(IDE)
%%[Spyder]:
# %%
Line
Continuation
\\...\
Console
Precision
format%precision (IPython)
Clear variablesclear / clearvars%reset -sf (IPython)
Macros only make sense in C/C++. This makes code less transparent and is frowned upon in higher level programming languages. Even its use in C++ should be limited. Use inline functions whenever possible.

Python is messy about the workspace, so if you just delete

Object Oriented Programming Constructs

CommonC++MATLABPython
Getters
Setters
No native syntax.

Name mangle (prefix or suffix) yourself to manage
Define methods:
get.x
set.x
Getter:
@property
def x(self): ...


Setter:
@x.setter
def x(self, value): ...
DeletersMembers can’t be
changed on the fly
Members can’t be
changed on the fly
Deleter (removing attributes
dynamically by del)
Overloading
(Dispatch function by signature)
OverloadingOverload only by
first argument
@overload (Static type)
@singledispath
@multipledispatch
Initializing class variablesInitializer Lists
Constructor
ConstructorConstructor
ConstructorClassName()
Does not return
(*this is implicit)
obj=ClassName(...)
MUST output the constructed object
__init__(self, ...)
Object to be constructed is 1st argument
Destructor~ClassName()delete()__del__()
Special
methods
Special member functions(no name)
method that control specific behaviors
Magic/Dunder methods
Operator overloadingoperatoroperator methods to defineDunder methods
Resource
Self-cleanup
RIAAonCleanup(): make a dummy object with cleanup operation as destructor to be removed when it goes out of scopewith Context Managers
Naming for the object itselfClass: (class’s own name by SRO ::)
Instance: *this
Class: (class’s own name)
Instance: obj (or any output name defined in constructor)
Class: cls
Instance: self
(Recommended PEP8 names)
Python allows adding members (attributes) on the fly with setattr(), which includes methods. MATLAB’s dynamicprops allows adding properties (data members) on the fly with addprop

onCleanup() does not work reliably on Python because MATLAB’s object destructor time is deterministic (MATLAB specifically do not garbage collect user objects to avoid this mess. It only garbage collects PODs) while Python leaves it up to garbage collector.

*this is implicitly passed in C++ and not spelled out in the method declaration. The self object must be the first argument in the instance method’s signature/prototype for both MATLAB and Python.

Functional Programming Constructs

CommonC++MATLABPython
Function as
variable
Functors
(Function Objects)
operator()
Function HandleCallables
(Function Objects)
__call__()
Lambda
Syntax
Lambda
[capture](inputs) {expr} -> optional trailing return type
Anonymous Function
@(inputs) expr
Lambda
lambda inputs: expr
Closure
(Early binding): an
instance of function objects
Capture [] only as necessary.

Early binding [=] is capture all.
Early binding ONLY for anonymous functions (lambda).

Late binding for function handles to loose or nested functions.
Late binding* by default, even for Lambdas.

Can capture Po through default values
lambda x,P=Po: x+P
(We’re relying users to not enter the captured/optional input argument)
Concepts of Early/Late Binding also apply to non-lambda functions. It’s about when to access (usually read) the ‘global’ or broader scope (such as during nested functions) variables that gets recruited as a non-input variable that’s local to the function itself.

An instance of a function object is not a closure if there’s any parameter that’s late bound. All lambdas (anonymous functions) in MATLAB are early bound (at creation).

The more proper way (without creating an extra optional argument that’s not supposed to be used, aka defaults overridden) to convert late binding to early binding (by capturing variables) is called partial application, where you freeze the parameters (to be captured) by making them inputs to an outer layer function and return a function object (could be lambda) that uses these parameters.

The same trick (partial application) applies to bind (capture) variables in simple/nested function handles in MATLAB which do behave the same way (early binding) like anonymous functions (lambda).

Currying is partial application one parameter at a time, which is tedious way to stay faithful to pure functional programming.

List comprehension is a shorthand syntax for transform/map() and copy_if/remove_if/filter() in one shot, but not accumulate/reduce(). MATLAB and C/C++ does not have listcomp, but listcomp is not specific to Python. Even Powershell has it.

Listcomp syntax, if wrapped in round brackets like (x**x for x in range(5)), gives a generator. Wrapping in square bracket is the shortcut of casting the generator into a list, so [x**x for x in range(5)] is the same as list(x**x for x in range(5)).

Coroutines / Asynchronous Programming

MATLAB natively does not support coroutines.

CommonC++20Python
GeneratorsInput IteratorsFunctions that yield value_to_spit_out_on_next
(Implicitly return a generator/functor with iter and next)
CoroutinesFunctions that value_accepted_from_outside = yield
Send value to the continuation by g.send(user_input)

async/await (native coroutines)

Matrix Arrays

The way Numpy requires users to specify matrices with a bracket for every row drives me nuts. Not only there’s a lot of typing, the superfulous brackets reinforce C’s idea of row-major which is horrendous to people with a proper math background who see matrices as column-major \mathbf{A}_{r,c}. Pytorch is the same.

Once you are trained in APL/MATLAB’s matrix world-view, you’ll discover going back to the world where matrices aren’t first class citizens is clumsy AF.

With Python, you lose the clutter free readability where your MATLAB code is one step away from the matrix equations in your scientific computing work, despite a lot of the features that addresses frequent use patterns are implemented earlier in Python than MATLAB.

Don’t believe those who haven’t lived and breathed MATLAB tell you Python is strictly superior. No it isn’t. They just didn’t know what they were missing as they haven’t made the intellectual leap in MATLAB yet. Python is very convenient as a swiss-army knife but scientific computing is an afterthought in Python’s language design.

The only way to use MATLAB-like semi-colon to change rows only works for np.matrix() type, which they plan to deprecate. For now one can cast matrix into array like np.array(np.matrix(matrix_string)).

Even numpy’s ndarray (or matrix to be deprecated) are CONCEPTUALLY equivalent to a matrix of cells in MATLAB. There isn’t native numerical matrices like in MATLAB that doesn’t have the overhead of unpacking arbitrary data types. You don’t want to do numerical matrices in MATLAB with cell matrices as it’s insanely slow.

You get away without the unpacking penalty in Numpy if all the contents of the ndarray happens to have the same dtype (such as numerical), aka known to be uniform. In other words, MATLAB’s matrices are uniform if it’s formed by [] and heterogeneous if formed by {}, while for Python [] is context-dependent, kept track of by dtype.

ConceptMATLABNumpy
Construction[8,9;6,4]np.array([[8,9],[6,4]])
Size by dimensionsize()A.shape
Concatenate
within existing dimensions
[A;B] or vertcat()
[A,B] or horzcat()
cat(dim, A, B, ...)
np.vstack()
np.hstack()
np.concatenate(list, dim)
Concatenate expanding
to 3D (expand in last dimension)
cat(3, A, B, ...)np.dstack()
‘d’ for depth (3rd dimension)
Concatenate
expanding dimensions
cat(newdim, A, B, ...)
then permute()
np.stack([A, ..], expand_at_axis)
np.array([A, ..]) expands at first
dimension as outermost bracket
refers to first dimension
Tilingrepmat()np.tile()
Fill with same valuerepmat()np.full()
Fill with ones/zerosones(), zeros()np.ones(), np.zeros()
Fill minicking another
array’s size
repmat(x, size(B))
ones(x, size(B))

zeros(x, size(B))
np.full_like(B, x)
np.ones_like(B)
np.zeros_like(B)
PreallocateAny of the above
(Must be initialized)
np.empty()
np.empty_like()
UNINITIALIZED
repelem() is just repmat() with the repetition by axes vector expanded out as variable input arguments one per dimension. Using ones vector to broadcast a singleton instead of repmat() is horrendously inefficient and non-intuitive.

Heterogeneous Data Structures

Heterogeneous Data Structures are typically column major as it is a concept that derives from Structs of Arrays (SoA) and people typically expect columns to have the same data type from spreadsheets.

While Pandas offers a lot of useful features that I’ve easily implemented with wrappers in MATLAB, the indexing syntax of Pandas/Python is awkward and confusing. It’s due to the nature that matrix is a first-class citizen in MATLAB while it’s an afterthought in Python.

Python does not have the { } cell pack/unpack operator in MATLAB, so in Pandas, you select the Series object (think of it as a supercharged list with conveniences such as handling missing values and keeping track of row/column labels) then call its .values attribute.

However, Pandas is a lot more advanced than MATLAB in terms of using multiple columns as keys and have more tools to exploit multi-key row names (row names not mandatory in MATLAB but mandatory in Pandas). In the old days I had to write my own MATLAB function with unique(.., 'rows') exploit its index output to build unique keys under the hood.

ConceptMATLABPython (Pandas
Dataframe)
RowsObservations (dataset())
Row (table())
Rows
index
ColumnsVariablesColumns
Select rows/columnsT(rows, cols)T.loc[r, col_name]
T.iloc[r,c]

Caveats:

– single index
(not wrapped in list)
have content extracted

iloc on LHS cannot
expand table but loc can, but it can only inject 1 row

– can get index number of names by T.get_loc() to use with T.iloc[]
Remove rows/columnsT(rows, cols) = []T.drop(index=rows, columns=cols)
Optionally: inplace=True
del T[rows, cols] does NOT work
Extract one columnT{:, c}T[c].values
Extract one entryT{r, c}T.at[r,col_name]
T.iat[r,c]

Faster than loc/iloc
Show first few rowsT(1:5, :)T.head()
Drop duplicate rowsunique(T, 'stable')T.drop_duplicates()
Ordinalcategorical()
ordinal()
Categorical()
Index()
Getting column names/labelsT.Properties.VariableNames
(returns cellstr() only)
T.columns
(returns Index() or RangeIndex())
Getting row
names/labels
T.Properties.RowNamesT.index
Transpose tablerows2vars()T.transpose()
Move columns
by name
movevars() since R2023a
Rename columnsrenamevars() since R2020aT.rename(columns={source:target})
Rename rowsModify
T.Properties.RowNames
T.rename(index={source:target})
Use column as row indicesT.Properties.RowNames = T.cellstr_variablename
If multiple columns are needed, need to combine them into one column using some user rules
T.set_index(column_to_use)
Dataframe allows multiple columns as row index keys
Reorder or partial selectionT[rows, cols]T.reindex(columns=..., index=...)
New labels will autofill by NaN
Select columnsT[:, cols]T[list_of_cols]
Pick column by data typeT[:, varfun(...)]T.select_dtypes(include=[list of type names])
Pick column by string matchT[:, varfun(...)]T.filter(like=str_to_match)
Blindly concatenate columns of 2 tables[T1, T2]

If you defined optional rownames, they must match. You can delete it with T.Properties.RowNames = {}
Pandas assign row indices (labels) by default.

Mismatched row labels do not combine in the same row. Consider reset_index() or overwrite the row indices of one table with another, like
pd.concat([T1, T2.set_index(T1.index)]
Blindly
concatenate rows of 2 tables
[T1; T2]pd.concat([T1, T2], ignore_index=True)
Format exportwritetable().to_*()
MATLAB tables does not support ranging through column names (such as 'apple':'grapes') yet Pandas DataFrame support it. I don’t think it’s fine to use it in the interpreter to poke around, but this is just asking for confusing logic bugs when the columns are moved around and the programmer has a false sense of security knowing exactly what’s where because they are using only names.

Dataframe is a little smarter than MATLAB’s table() in terms of managing column names and indices as it’s tracked with Index() type which is the same idea as MATLAB’s ordinal() ordered categorical type, where uniques names are mapped to unique indices and it’s the indices under the hood. This is how 'apple':'grapes' can work in Python but not MATLAB.

MATLAB T.Properties.VariableNames is a little clumsy. I usually implement a consistent interface called varnames() that’d output the same cellstr() headings whether it’s struct, dataset or table objects.

MATLAB’s table() by default do not make up row names. Pandas make up row names by default sequentially.

MATLAB table() do requires qualified string characters as variable names. Dataframe doesn’t care what labels you use as long as Index() takes it. It can get confusing because you can have a number 1 and ‘1’ as column headers at the same time and they look the same when displayed in the console.

Loading