MATLAB features/conveniences not available in Python/C++

A lot of MATLAB’s conveniences over Python and vice versa stem from their design choices of what are first class citizens and what are afterthoughts.

MATLAB has its roots from scientific computing, so operations (use cases) that are natural to scientists comes first. Python is a great general purpose language but ultimately the motivation came from a computer science point of view. It will eventually get clumsy when Python tries to do everything MATLAB does.

Matrix (a stronger form of array) takes the center stage in MATLAB

In MATLAB, matrix is a first-class citizen. Unlike other languages that starts with singleton and containers are built to extend it to arrays (then to matrices), nearly everything is assumed to be a matrix in MATLAB so a singleton simply seen as a 1×1 matrix.

This is a huge reason why I love MATLAB over Python based on the language design.

Lists in Pythons are cell containers in MATLAB, so a list of numbers in Python is not the same as an array of doubles in MATLAB because the contents of [1, 2, 3.14] must the same type in MATLAB.

Non-uniform arrays like cells/lists are much slower because algorithms cannot take advantage of uniform data structure packed very locally (the underlying contents right next to each other) and do not need extra logic to make sure different types are handled correctly.

np.array() is an after thought so the syntax specifying a matrix is clumsy! The syntax is built on lists of lists (composition like arr[r][c] in C/C++). There used to be a way to use MATLAB’s syntax of separating rows of a matrix with a semicolon ‘;’ with np.matrix('') through a string (which clearly is not native and code transparent).

Given that np.matrix is deprecated, this option is out of window. One might think np.array would take similar syntax, but fuck no! If you typed a string of MATLAB style matrix syntax, np.array will treat it as if you entered an arbitrary (Unicode) string, which is a scalar (0 dimension).

For my own use, I extracted the routine numpy.matrix used to parse MATLAB style 2D matrix definition strings as a free function. But the effort wasted to get over these drivels are the hidden costs of using Python over MATLAB.

import ast

''' 
mat2list() below is _convert_from_string() copied right off
https://github.com/numpy/numpy/blob/main/numpy/matrixlib/defmatrix.py
because Numpy decided to phase out np.matrix yet choose not to
transplant this important convenience feature to ndarray
'''

def mat2list(data):
    for char in '[]':
        data = data.replace(char, '')

    rows = data.split(';')
    newdata = []
    for count, row in enumerate(rows):
        trow = row.split(',')
        newrow = []
        for col in trow:
            temp = col.split()
            newrow.extend(map(ast.literal_eval, temp))
        if count == 0:
            Ncols = len(newrow)
        elif len(newrow) != Ncols:
            raise ValueError("Rows not the same size.")
        newdata.append(newrow)
    return newdata

Numpy array is really cornering users to keep track of the dimensions by providing at least 2 pairs of brackets for matrices! No brackets = singleton, 1 pair of brackets [...] = array (1D), 2 pairs/levels of brackets [ [row1], [row2], ... [rowN] ] = matrix (2D). Python earns an expletive from me each time when I type in a matrix!

Slices are not first class citizens in Python

Slices in Python are roughly equivalent to colon operator in MATLAB.

However, in MATLAB, the colon operator is native down to the core so you can create a row matrix of equally spaced numbers without surrounding context. end keyword, as a shortcut to get the length (which happen to be the last index due to 1-based indexing) of the dimension when indexing, obviously do not make sense (and therefore invalid) for colon in free form.

Python on the other hand, uses slice object for indexing. Slice object can be instantiated anywhere (free form) but buidling it from the colon syntax is exclusively handled inside the square brakcet [] acess operator known as the __getitem__ dunder method. Slice objects are simpler than range as it’s not iterable so it’s not useful to generate a list of numbers like colon operator in MATLAB. In other words, Python reserved the colon syntax yet does not have the convenience of generating equally spaced numbers like MATLAB does. Yuck!

Since everything is a matrix (dimension >= 2) in MATLAB, there’s no such thing as 0 dimension (scalar) and 1 dimension (array/vector) as in Numpy/Python.

Transposes in Python makes no sense for 1D-arrays so it’s a nop. A 1D-array is promoted into a row vector when interacting with 2D arrays / matrices), while slices makes no sense with singletons.

Because of this, you don’t get to just say 3:6 in Python and get [3,4,5] (which in MATLAB it’s really {3,4,5} because lists in Python are heterogeneous containers like cells. The 3:5 in MATLAB gives out a genuine matrix like those used in numpy).

You will have to cast range(3,6), which is an iterator, into a list, aka list(range(3,6)) if the function you call with it does not recognize iterators but instead want a generated list stored in memory.

This is one of the big conveniences (compact syntax) that are lost with Python.

More Operator Overloading

Transposes in Numpy is an example where CS people don’t get exposed to scientific computing enough to know which use case is more common:

MATLABNumpyMeaning
a.'a.transpose() or a.Ttranspose of a
a'a.conj().transpose() or a.conj().Tconjugate transpose (Hermitian) of a
https://numpy.org/devdocs/user/numpy-for-matlab-users.html

Complex numbers are often not CS people’s strong suit. Whenever we do a ‘transpose’ with a physical meaning or context attached to it, we almost always mean Hermitian (conjugate transpose)! Most often the matrix is real anyway so many of us got lazy and call simply call it it transpose (a special case), so it’s easy to overlook this if one design/implement do not have a lot of firsthand experience with complex matrices in your math.

MATLAB is not cheap on symbols and overloaded an operator for transposes, with the shorter version being the most frequent use case (Hermitian). In Python you are stuck with calling methods instead of typing these commonly used operators in scientific computing like they are equations.

At least Python can do better by implementing a a.hermitian() and a.H method. But judging that the foresight isn’t there, the community that developed it are likely not the kind of people sophisticated enough in complex numbers to call conjugate transposes Hermitians.

Conventions that are more natural to scientific computing than programming

Slices notation in Python put the step as the last (3rd) parameter, which makes perfect sense from the eyes of a programmer because it’s messy to have the second parameter mean step or end point depending on whether there’s one colon or two. By placing the step parameter consistently as the 3rd argument, the optional case is easier to program.

To people who think in math, it’s more intuitive when you specify a slice/range in the order you draw the dots on a numbered line: you start with a starting point, then you’ll need the step-size to move onto the next point, then you’ll need to know when to stop. This is why it’s start:step:stop in MATLAB.

Python’s slice start:stop_exclusive:step convention reads like “let’s draw a line with a starting point and end points, then we figure out what points to put in between”. It’s usually mildly unpleasant to people who parse what they read on the fly (not buffering until the whole sentence is complete) because a 180 degree turn can appear at the end of a sentence (which happens a lot with Japanese or Reverse-Polish-Notation).

Be careful that the end points in Python’s slice and C++’s STL .end() are exclusive (open), which means the exact endpoint is not included. 0-based index systems (Python an C++) love to specify “one-past-last” instead of the included end points because it happens to align with the total count N: there are N points from [0, N-1] (note N-1 is inclusive, or a closed end) which is equivalent to [0, N), where N is an open end, for integers. This half-open (or open-end) convention avoids painfully typing -1 all over the place in most use cases in a 0-based indexing system.

0-based indexing is convenient when doing modulo (which is more common with programmers) while 1-based indexing matches our intuition of natural numbers (which starts from 1, bite me. lol) so when we count to 5, there are 5 items total. My oscilloscope don’t call the first channel Channel 0 and I work with floats more than I work with modulo, so 1-based indexing has a slight edge for my use cases.

MATLAB autoextends when assigning index out of range, not Python

This is one behavior I really hated Python for it, with passion. Enough for me to keep lean towards MATLAB.

In MATLAB, I simply assign the result x{3} = 4 even when the list x starts with an empty cell x={} and MATLAB will be smart enough to autoextend the list. Python will give you a nasty IndexError: list assignment index out of range.

I pretty much have to preallocate my list with [None] * target_list_size. MATLAB are pretty tight-assed when it comes now allowing syntax/behaviors that allows users to hurt themselves in insidious ways, yet they figured if you expanded a matrix that you didn’t intend to, soon you’ll find out when the dimensions mismatch.

Note that numpy array has the same behavior (refuses to autoexpand array when assigned an index out of the current range).

No consistent interface for concatenation in Python

In MATLAB, if you have a cell of tables C, you can simply vertically concatenate them simply with vertcat(C{:}), because MATLAB has a consistent interface for vertical concatenation, which is what the operator [;] calls.

Note that cell unpack {:} in MATLAB translate to comma separated list, putting a square bracket over commas like as [C{:}] is horzcat(C{:}) because it’s [,].

Python doesn’t have such consistent interface. Lists are concatenated by + operator while Dataframes are concatenated by pd.concat(list_or_tuples_of_dataframes, ...), as + in Dataframes means elementwise application of + (whatever + means to the pair of elements involved).

I just had a simple use case where I have a list containing dataframes corresponding to tests on each channel (index) that I’ll run the experiments on one by one. They don’t need to be run in order nor all of the tests need to be completed before I collect (vertically stack) the dataframes into one big dataframe!

Vertically stacking such a list of Dataframes is nightmare. The developers haphazardly added a check that throws an error for pandas.concat() if everything in the list is None, which throws an error “ValueError: All objects passed were None“!

If I haven’t run any tests yet, attempting to collect an aggregrate table should return None instead of throwing a fucking ValuerError! Checks for attempting to do a nop depending on the source data should be left for users! It’s safer to do less and let users expand on it than nannying and have users painfully undo your unwanted goodwill!

How empties are handled in each data type like cell or table() is a important part for a consistent generic interface to make sure different data type work together (cast or overload automatically) seamlessly. TMW support showed me a very detailed thought process on what to do when the row is empty (length 0) or column is empty (length 0) in our discussion getting into the implementation details or dataset/table (heterogenous data type). I just haven’t seen the thoughtfulness in Python (lists), Numpy (array) or Pandas (dataframe) yet.

Now that with the poorly thought out extra check in pd.concat(), I have to check if the list is all None. I often do not jump to listcomp or maps if there’s a more intuitive way as listcomps/maps are shorthands for writing for-loops instead a expressing specific concept, such as list.count(NaN)==len(list) or set(list)=={'None'}.

Dataframe broke list.count() with TypeError: unhashable type: 'DataFrame' because Dataframe has no __hash__, because it’s mutable (can do in-place operation through reference).

Then dataframe breaks set() with ValueError: The truth value of a DataFrame is ambiguous..., because the meaning of == changed from object comparsion (which returns a simple boolean) to elementwise comparsion (that returns a non-singleton structure with the same shape/frame as the Dataframe itself).

Aargh!

Loading

Concepts in C++ that does not apply to Python or MATLAB

Static Data Members

In Python/MATLAB, data members are called properties. It’s called (data) member in C++. I’ll use these names interchangably when comparing these languages.

Python and MATLAB have the concept static methods, but static properties (data members) doesn’t really exist in either language.

Python’s properties has a split personality (class AND instance), so it’s not like C++ that you choose between class XOR instance. Therefore I call those class variables because in C++, static variables do not have split personality: either you are classwise (static) or not. In C++ (or MATLAB) you don’t have both cases sharing the same variable name so a class variable can be shadowed by an instance variable.

As for MATLAB, there’s no Static property. The only classwise properties allowed is Constant property. TMW (creator of MATLAB) decided not to allow non-Const classwise/static properties because of this web of rules:

  • A.B = C can either mean
    1) [Variable] write a new struct A with field B, or
    2) [Class] attempt to write to property B of class (not instance of) A.
  • If a class (definition) A is loaded into the current workspace, allowing case #2 might throw users who intended to make a struct A with field B out of nowhere off.
  • MATLAB gives variables higher priorities than function/classes, so case #2 has to be struck down to make it unambigious.
  • By making A.B, a classwise access to field B, either read only (Constant) or tied to instances a=A(); a.B=C, MATLAB avoided the situation of A.B=C while A is a class (case#2) so A.B=C is unambiguously writing to a struct field (case#1).

I know. This is quite lame. Matrices are first class citizens in MATLAB while classes are an after-thought that isn’t really a thing until 2008. You win some and you lose some.

The official TMW workaround is to use a Static method getter/setter (not dependent properties because it only works for instances or if what it dependended are exclusively Constant properties) with persistent variable holding the internally data that’s meant to be static. This is very convoluted and it sucks. I’d call it a weakness of MATLAB.

Constant properties in MATLAB are static const (classwise-only)

In C++, const properties (data members) can be either instance bound or classwise (static), which means instances could be initialized with a different set of constants.

Instance-wise constant data members is possible in C++ through (member) initializer lists, which happens to be the only route for private constants as public constants could also be list (including brace) initialized (in newer C++ such as C++11).

Constructor is not a first-hand initializer in C++ (directly assigning memory with predefined values). Members are fully constructed in C++ (just not necessarily with the values you wanted) by the time you get inside the constructor function. Therefore, in C++, constants and references (things that cannot be changed) for instances must be done in (member) initializer lists right before the constructor.

Only static const can be optionally defined directly at the class with = sign since C++11. Before then it only worked for integers as it was enum under the hood with a primative compiler design. Otherwise you forward declare the data member (without specifying the details) in the class C such as static const T x, then define it (assign the value) like T C::x = 42 outside the class definition.

In MATLAB, Constant properties are classwise-only (in fact, it’s the only kind of Static property allowed as discussed in the first section). You simply declare the value at the class definition with = sign just like the fast way to type static const in C++11 because there is no concept of (member) initializer list in MATLAB so you can modify how consts and refs are stamped out, not remodel them in the constructor after they are already made.

Accessing constant properties with an object instance has to be a shortcut for classwise constant in MATLAB, while it depends on whether the const is static in C++.

In Python, everybody is a consenting adult. Scream your constants in ALL CAPS and hope everybody act like a gentleman and not touch them. Lol

Static Native Getters/Setters

C++ does not have native getter or setter using a data member’s name so the said data member do not store states but instead act as a proxy (potentially interacting to other state-holding data members).

Let’s say the variable with a native getter/setter in question is x.

In MATLAB, this feature is called Dependent properties, where you define members under properties (Dependent) and the getter function is named function get.x(self, ...).

In Python, @property decorator mangles your function with the same name as the dependent member, aka def x(self, ...) so the functior can call the corresponding getter/setter without the function call round brackets ().

However, both in MATLAB and Python, dependent data members are mainly aimed for instances (objects created), not class-wise!

In Python, it’s simply impossible to stack @staticmethod or @classmethod with @property. I tried doing that and Python just declared the function name (dependent property name) a property object so when I call it (without round brackets for function calls of course), it merely shows “<property at 0x...>‘.

The case of MATLAB is a lot weirder. It’s a web of rules for that are based on seemingly unrelated reasons which created a catch-22 to mean dependent properties are instance-bound without means to access it classwise:

  1. MATLAB doesn’t have classwise properties that are not Const properties (aka no Static data members/properties) to avoid breaking backward compatibility (see the first section of this post for the web of rules that caused it).
  2. In C++’s lingo, MATLAB has no (mutable) static data member. The only nearest thing is (immutable) static const data member (which is called Constant property in MATLAB).
  3. MATLAB does not have instance-wise constants (see above)
  4. Dependent property in MATLAB is neither classwise or instance-wise, because the concept itself is a method (function member) pretending to be a property (data member). More precisely, it’s an extra level of indirection that dispatches getters or setters depending on whether it’s ‘read’ or ‘written’
  5. MATLAB not treating properties (data member) and methods (function member) the same way (when it comes to whether it’s class-bound or instance-bound) created an dilemma (identity crisis) for Dependent property.
  6. Despite Dependent property is really a method in disguise, its getters/setters not allowed to take on a Static role like other methods because it claimed to be a property (data member) so it’s stuck following the more restrictive rules that applies to properties (aka no static)

Without knowing the underlying rules above, the implications are less than intuitive (makes perfect sense if you know what constraints MATLAB is working under that made it nearly impossible for a sensible use case where the user simply wants to create a shortcut for computing with classwise/Const properties):

  1. MATLAB’s rules requires any property that are not Constant to be accessed exclusively through instances.
  2. A Dependent property is not a Const property, so it’s considered an instance-based member.
  3. This implies Dependent data member can only be called from an instance (regardless of whether you’re trying to do it from within the same class or outside the class).
  4. Since Dependent property is defined to be instance-bound, the instance object (self) is passed to getters/setters as the first argument and the definition of the getters/setters has to expect it so it’d be like function get.x(self, ...), just like all instance-bound methods.
  5. There’s nothing that says such getters/setters must use the self passed to it. However, if you use constant properties, it doesn’t matter where you call from the self (object instance) or the class name since constants are classwise-only in MATLAB to both syntax refers to the same thing.

RAII available in C++ and MATLAB but not Python

Garbage collector is a concept only when you are allowed to make multiple aliases for the same underlying data, and don’t want to meticulously keep track of who’s ultimately responsible for winding it down and when.

In particular, garbage collector has the lattitude to not clean the object right away when the reference count goes down to zero. shared_ptr (an automatic memory management technique) promptly cleans the object when the reference count touches zero. Therefore garbage collector can procrastinate painful release (cleanup) process so the program doesn’t have to frequently stumble to do the cleanup.

C++ choose not to use garbage collectors (as class destructors running at a non-deterministic time breaks RAII). Python choose to embrace it at the expense of breaking RAII. MATLAB is kind of in between yet the way MATLAB does it does not break RAII, but it’s not obvious.

MATLAB language has a very unique design choice (mental model) that users see/reason variables as deep copies of stack objects, so there are no concept of aliases (let along reassignable aliases) in the first place to need garbage collection in user-servicable mental model for memory management.

There are people talking about JVM’s garbage collection in MATLAB, but that garbage collector only handle Java objects (which I almost never use unless it’s through Yair Altman’s code). Anything else is handled by MATLAB’s engine.

Since MATLAB’s engine manages how the underlying data are shared, along with allocating and freeing, some people argue that it’s garbage collection.

The popular mental picture of garbage collection is reference counting WITHOUT tracking the real owner (often the first creator): when the last guy drops the link (reference) to the underlying data, the data becomes orphaned and is ready to be garbage collected.

For conventional garbage collectors, timing of object destruction (when the destructor is called) is not deterministic because the object does not die with its original creator. The ‘ownership’ is effectively transferred to the last user of the underlying data.

TMW specifically said they are not garbage collecting (at least for classes since the article is for classes), so whatever they are doing under the hood is not exactly garbage collection in the conventional sense (https://www.mathworks.com/company/technical-articles/inside-matlab-objects-in-r2008a.html)

Given MATLAB is using copy-on-write, and Loren dropped a clue that the copy-on-write won’t be triggered for the entire struct when only one field is changed (only the changed field is copied):

This would imply that if MATLAB does anything close to garbage collecting or managing allocation and deallocation, it can only be done on whatever pieces (my guess would be PODs, simple native data types) that doesn’t involve a user-defined destructor.

Decoupling automatic memory management from classes has the advantage of keeping RAII because user-defined destructors are called deterministically. It’s only after the destructors are unwounded down to the simple data types (some classes contain other complex objects, so their destructors are chain-called deterministically) with no more user-defined destructors attached to it (down to the leaves of a tree), the automatic memory management mechanism can decide how long it’d keep these simple chunks (if somebody else is still using it, an extra copy doesn’t need to be made).

In Python, everything is perceived as (class) objects, even the PODs. MATLAB and C++ distinguishes between PODs and user-defined classes. This means if Python choose to do garbage collection, it has to do it to classes with user-defined destructor as well, thus breaking RAII.

Loading

Librarize! Free variables/functions school of thought (as compared to OOP)

When programming C++, I have prefer to stick to free functions and refactor everything generic into libraries. However it doesn’t sound like the norm for now. I’m glad after sawing this video that that I’m not the only one who prefers free functions.

This lecture explains why prefering free functions instead of jumping to cram everything into classes aligns with OOP doctrines, but that’s not how I came up with this idea on my own.

TLDR: My whole thesis of preferring free functions is based on

  1. there’s no reason to reinvent the wheel badly by not identifying generic operations and factor it out as calls to generic libraries not tied to the business logic!
  2. my observation that data mambers are globalist (global variable) style of programming sugarcoated by containing the namespace pollution with class scopes!

The lecture suggested putting some part of business logic code as free functions too, right after class definitions. I didn’t really think of that because I often refactor aggressively enough that there’s not much left to pollute the class’s namespace.

If you can refactor your code very well, the top level code should be so succinct that it pretty much reads the business logic without the noise from implementation details, to the extent that non-programmers can develop a picture of what your code does without knowing the intricate mechanics of the programming language.


Background

Class is a mental model built on Von Neuman architecture suggesting that data (variable) and program (functions) aren’t very different after all.

Structs provides a compact way to bundle different variables into one logical unit. It’s more of an eye candy.

Given the Von Neuman’s view that program (function pointers in reality) is treated as a data (variable), a struct can bundle programs and data.

Then people naturally made up the fabric of classes, calling a bundle of actions (program) anda state (data) an object to mimic our daily observations.

Making certain variables (fields) in struct callable requires a little special treatment that could be done in the compiler. This is the very primitative form of classes.

I said state here because data members in OOP naturally encourage people to frame the program in terms of shared states instead of tightly controlled data flow by passing arguments in function calls.

Functional programming avoids states which makes it a polar opposite way of structuring your program and data structures than OOP. It’s not even passing data, but chain acting on data.


Direct delivery (local variables) vs Sharing access (data members)

Shared states is what globalist programmers (pun intended) are doing with non-local variables.

With local variables passed through arguments in function calls, you hand the item (say a letter) to the intended recipient (the function you call) directly. It’s point-to-point delivery: simple and predictable.

With data members or nonlocal variables, you leave your message in the dropbox (non-local scope or data member scope) and hope for the best (the right people will pick it up and nobody messed with it in between).

People say globals are evil, but they are missing the point by thinking it’s just the breadth of namespace pollution that makes it evil. It’s actually the dropbox mechanism that make your program fragile and defenseless against domestic (namescope) violence (unwanted data mucking).

Enclosing globals into data members only limits the potential public violence into potential domestic violence (pun intended).

Classes merely put a lock in the dropbox and give ALL people in the same family/Class (methods) the same key for private members. Public means unlocked.

Think of data members scope as a fridge at home. Methods are the people at home. Whoever that has the home key can put food in the fridge, mess with it, or eat it.

Within the same class/family, there are no finer controls over who (methods) can touch which food item (data member) as long as they are in the same trusted level.

So if you didn’t refactor your class composition (not hieracy, as hierachy likely exposes more data members to more methods that doesn’t need it) to only allow intended methods to access only the data members they need, you are not encapsulating tightly.

“Only allow intended methods to access only the data members they need” is hell of hard. This is the same as saying that you need to devise a complicated composition hieracy and manage the interfaces between them so every class involved is a complete bipartite graph between data members and methods!

For example, if method R only needs (A,B,C), it should not have access to D, so D needs to be factored out into another class. If data member B is needed only by methods (P,Q), it should not be in the same family as R. Then you have to manage the interfaces between these classes. Yikes!

If you get this far to encapsulate properly, you might as well factor as many method as you can into directly passing variables in free functions. In reality we don’t really go this far and stop somewhere when the bipartite graph is dense enough so our mind can keep track of all possible parties (which methods are involved with what data members) within the class.


Compromise / Solution?

There are applications like GUI objects that it’s just a total pain in the ass to explicitly pass data through arguments over every event (callback) so state holding (data members) makes more sense. Eventually we need to leave something to data members even if we did our best to factor free functions out of the class. GUI is one of the use cases where I won’t shame myself for abusing non-local scopes.

Hope C++’ll eventually come up with a compile-time contract syntax (method access control) that allows users to define what methods are allowed to touch which data member (potentially read/write as well) such as:

private:
   bool a : method1, method2(r), method3(w)
   char b : method2(rw), method3(r)

I don’t think we need to go that far to micromanage the other direction where methods declare which data members it could touch, as it needs extra work to check if the two directions agree. It’s methods that could go rogue on data. Data could not go rogue on methods unless they were abused/poisoned.


When to use classes

OOP is a useful idea but I would not over-objectify, like writing a class that holds 5 constants or organize a collection of loosely coupled or unrelated generic helper functions into a class (it should be organized into packages or namespaces). Over-objectifying reeks cargo cult programming.

My primary approach to program design is self-documenting. I prefer present the code in a way (not just syntax, but the architecture and data structure) that’s the easiest to understand without material sacrifices to performance or maintainability. I use classes when the problem statement happened to naturally align with the features classes has to offer without a lot of mental gymnastics to frame it as classes.

My decision process goes roughly like this:

  • If a problem naturally screams data type (like matrices), which is heady on operator overloading, I’d use classes in a heartbeat as data types are mathematical objects.
  • Then I’ll look into whether the problem is heavy on states. In other words, if it’s necessary for one method to drop something in the mailbox for another method to pick it up without meeting each other (through parameter passing calls), I’ll consider classes.
  • If the problem statement screams natural interactions between objects, like a chess on the chessboard, I’d consider classes even if I don’t need OOP-specific features

Do not abuse OOP to hide bad practices

The last thing I want to use OOP as a tool for:

  1. Hiding sloppy generic methods that is correct only given the implicit assumptions (specific to the class) that are not spelt out, like sorting unique 7 digit phone numbers.
  2. Abusing data members to sugarcoat casually using globals and static all over the place (poor encapsulation) as if you’d have done it in a C program.

1) Free functions for generic operations

The first one is an example that calls for free functions. Instead of writing a special sort function that makes the assumption that the numbers are unique and 7 math digits. A free function bitmap_sort() should be written and put in a sorting library if there isn’t any off-the-shelf package out there.

In the process of refactoring parts of your program out into generic free functions (I just made up the word ‘librarize’ to mean this), you go through the immensely useful mental process of

  • Explicitly understanding the exact assumptions that specifically applies to your problem that you want to take advantage of or work around the restrictions. You can’t be sure that your code is correct no matter how many tests you’ve written if you aren’t even clear about under what inputs your code is correct and what unexpected inputs will break the assumptions.
  • Discover the nomenclature for the concept you are using
  • Knowing the nomenclature, you have a very good chance of finding work already done on it so you don’t have to reinvent the wheel … poorly
  • If the concept hasn’t been implemented yet, you can contribute to code reuse that others and your future self can take advantage of.
  • Decoupling generic operation from business logic (the class itself) allows you to focus on your problem statement and easily swap out the generic implementation, whether it’s for debugging or performance improvement, or hand the work over to others without spending more time explaining what you wanted than writing the code yourself.

This is much better than jumping into writing a half-assed implementation of an idea that you haven’t fully understood the quirks (assumptions) and hide it as an internal method.

Decouple, decouple, decouple

By hiding the good stuff (a generic piece of code which is a useful algorithm) inside the class, you are merely luring the people dealing with your code base to go all the way out to break your intended access controls in order to get to the juicy implementation.

Don’t be embarassed if your first attempt in the generic code/algorithm is primative and it applies to very narrow input conditions! Instead of hiding the ’embarssing’ implementation inside a class that tolerates it due to the assumptions given the class’s context, you can simply add suffix to your function names to explain the limitations.

Making the name longer discourages people from expand its usage beyond what you intended. It also helps code reviewers catch people using your function when the restrictions denoted by the suffix of the function name clearly doesn’t apply to the said use.

Later on when people develop a more generic version (likely by learning from you), they can make the function name shorter after removing the restriction suffix. If your function turned out to be more generic than you’ve planned for, they can always write a wrapper that calls your function under the hood.

For example, during your exploration, you could start with sort_unique_phone_numbers() in your library. Later on somebody expanded on it and call it sort_unique_numbers(). Eventually some of you realized it’s bitmap_sort(). This way you don’t have to worry about colliding with the name sort() which is way more general.

If you factor the juicy implementation (of a generic idea) out of your class, others can happily use your library of free functions, leaving your internal implementations alone instead of trying questionable maneuvers to exploit it which lead to code cruft and debugging hell.

You learn a new concept well rather than repeating similar gruntwork over and over and it doesn’t benefit anybody else, and you likely have to debug the tangled mess when you run into a corner case because you didn’t understand the assumptions well enough to decouple a generic operation from the class.

Overload free functions!

Polymorphism in OOP is a lot broader than just function overloading.

virtuals has to be an OOP thing because it’s is run-time polymorphism, which make sense only with inheritance.

Templates (which also applies to free functions), and function overloading (which also applies to free function) are compile-time polymorphism.

Polymorhpishm isn’t exclusive to OOP the way Bajrne defined it. C++ can overload free functions! You don’t need to put things into classes just because you want a context (signature) dependent dispatch (aka compiler figuring which version of the function with the same name to call).

I frequently overload free functions in MATLAB (by the type of the first argument) so it sound natural to me.


Program organization strategies

If you are really edgy about namespace pollution, just use namespace (but not class) for your free functions.

Here’s an example where I put half-baked generic library functions in C_tools and free functions that exclusively deal with the said class as C_free:

#include <iostream>

namespace C_tools { 
  int double_singleton(int x) { return 2*x; }
};

class C {
  public:
    C() : a(64) {}
    void update(int x) { a = x; }
    int a_doubled() { return C_tools::double_singleton(a); }
    int get_a() { return a; }
  private:
    int a;
};

namespace C_free {
  // It's a reference, so this changes the state
  void double_a(C& c) {
    c.update(c.a_doubled());
  }
};

int main() {
    C c; // Internally a is 64
    // a_doubled() returns the 2*a, giving 128
    std::cout << c.a_doubled() << std::endl;

    // Free generic helpers for others to use too
    // 2*89 = 178
    std::cout << C_tools::double_singleton(89) << std::endl;

    // Free functions acting on instances
    C_free::double_a(c);  // a=2*64=128
    std::cout << c.get_a() << std::endl;
    C_free::double_a(c);  // a=2*128=256
    std::cout << c.get_a() << std::endl;
    return 0;
}

The same namespace can be scattered in your code. So you can order them by the dependency in your code. There’s no reason to forward declare to keep the same namespace in one block.

I intentionally didn’t overload a free function in the example above to show that if your code is not really up to par to be generalized alongside with other code with the same idea and same name, it’s better to just use different names and not let C++ dispatch by context/signature.

In the following example,

  • I merged C_tools to C_free to demonstrate that the same namespaces do not have to stay in one block. (Despite separating the two namespaces intended for different set of audience is a better practice)
  • Now I merged C_tools are merged into C_free, I’ll take advantage of it to demonstrate free function overloading. Now there are two candidates of C_free:sqaure(): one takes an int while the other takes in a reference to class C, aka class C&.
#include <iostream>

namespace C_free { 
  int square(int x) { return x*x; }
};

class C {
  public:
    C() : a(64) {}
    void update(int x) { a = x; }
    int a_squared() { return C_free::square(a); }
    int get_a() { return a; }
  private:
    int a;
};

namespace C_free {
  // It's a reference, so this changes the state
  void square(C& c) {
    c.update(c.a_squared());
  }
};

int main() {
    C c; // Internally a is 64
    std::cout << c.a_squared() << std::endl; // 64*64=4096

    // Free generic helpers for others to use too
    std::cout << C_free::square(89) << std::endl; // 89*89=7921

    // Free functions acting on instances
    C_free::square(c);  // c.a=64*64=4096
    std::cout << c.get_a() << std::endl;
    C_free::square(c);  // c.a=4096*4096=16777216
    std::cout << c.get_a() << std::endl;
    return 0;
}

2) Classes are not excuses to hide unnecessary uses of global/statics

Data members in classes are namespace-scoped version of global/static variables that could be optionally localized/bound to instances. Private/Public access specifiers in C++ expanded on global/file scope variables switched through static modifier (file scope) back in C days.

If you don’t think it’s a good habit to sprinkle global scope all over the place in C, try not to go wild using more data members than necessary either.

Data members give an illusion that you are encapsulating better and ended up incentivising less defensive programming practices. Instead of not polluting in the first place (designing your data flow using the mentality of global variables), it merely contained the pollution with namespace/class scopes.

For example, if you want to pass a message (say an error code) directly from one method to another and NOBODY else (other methods) are involved, you simply pass the message as an input argument.

Globals or data members are more like a mechanism that you drop a letter to a mailbox instead of handing it your intended recipient and hope somehow the right person(s) will reach it and the right recipient will get it. Tons of things can go wrong with this uncontrolled approach: somebody else could intercept it or the intended recipient never knew the message is waiting for him.

With data members, even if you marked them as private, you are polluting the namespace of your class’s scope (therefore not encapsulating properly) if there’s any method that can easily access data members that it doesn’t need.


How I figured this out on my own based on my experience in MATLAB

Speaking of insidious ways to litter your program design the globalist mentality (pun intended), data members are not the only offenders. Nested functions (not available in C++ but available in modern MATLAB and Python) is another hack that makes you FEEL less guilty structuring your program in terms of global variables. Everything visible one level above the nested function is relatively global to the nested function. You are literally polluting the variable space of the nested function with local variables of the function one level above, which is a lot more disorganized than data members that you kind of acknowledge what you’ve signed up for.

Librarize is the approach I came up with for MATLAB development: keep a folder tree of user MATLAB classes and free functions organized in sensible names. Every time I am tempted to reinvent the wheel, I try to think of the best name for it. If the folder with the same name exist, chances are I already did something similar before and just needed a little reminder. This way I always have high quality in-house generic functions (which I could expand the use cases with backward compatibility as needed).

This approach works because I’m confident with my ability to naturally come up with sensible names consistently. When I did research in undergrad, the new terminologies I came up with happened to coincide with wavelets before I studied wavelets, as in hindsight what I was doing was pretty much the same idea as wavelets except it doesn’t have the luxury of orthogonal basis.

If a concept has multiple names, I often drop breadcrumbs with dummy text files suggesting the synonym or write a wrapper function with a synonymous name to call the implemented function.

C++ could simply overload free functions by signatures, but not too many people know MATLAB can overload free functions too! MATLAB’s function overload is polymorphic (the decision on which version to dispatch) by ONLY BY THE FIRST ARGUMENT.

MATLAB supports variable arguments which defeats the concept of signatures. So be grateful that at least it can overload by the first argument. Python doesn’t even overload by the first arugment.

It’s a very advanced technique I came up with which allow the same code to work for many different data types, doing generics without templates available in C++.

I also understand that commercial development are often rushed so not everybody could afford the mental energy to do things properly (like considering free functions first). All I’m saying is that there’s a better way than casually relying on data members more than needed, and using data member should have the same stench as using global variables: it might be the right thing to do in some cases, but most often not.

If you see people sticking to free functions, please consider the merits of it before jumping to judging them. It’s easy to tell based on the application (based on how ‘hard’ they try, aka bending things to fit a paradigm) if people are doing things one way or the other religiously or they adapt to the problem they are solving.

Loading

How missing keys are handled in Dictionary (Hashtables) in C++, Python and MATLAB

C++

In C++ (STL), the default behavior to touching (especially reading) a missing key (whether it’s map or unordered_map, aka hashtable) is that the new key will be automatically inserted with the default value for the type (say 0 for integers).

I have an example MapWithDefault template wrapper implementation that allows you to endow the map (during construction) with a default value to be returned when the key you’re trying to read does not exist

C++ has the at() method but it involves throwing an exception when the key is not found. However enabling exceptions is a material performance overhead burden in C++.

MATLAB

MATLAB’s hashtables are done with containers.Map() and with more modern MATLAB, dictionary objects (R2020b and on), unless you want to stick to valid MATLAB variable names as keys and exploit dynamic fieldnames.

Neither containers.Map() or dictionary have a default value mechanism when a key is not found. It will just throw an error if the key you are trying to read does not exist. Use iskey()/isKey() method to check if the key exist first and decide to read it or spit out a default value.

Python

Natively Python dictionaries will throw a Key error exception if the requested key in [] operator (aka __getitem__()) do not already exist.

Use .get(key, default) method if you want to assume a default value if the key is not found. The .get() method does not throw an exception: the default is None if not specified.

If you want C++’s default behavior of reading a new key means inserting the said key with a default, you have to explicitly import collections package and use defaultdict. I wouldn’t recommend this as the behavior is not intuitive and likely confusing in insidious ways.

There’s a simiar approach to my MapWithDefault in Python dictionaries: subclass from dict and define your own __missing__ dunder/magic method that returns a default when a key is missing, then use the parent (aka dict)’s constructor to do an explicit (type/class) conversion for existing dict object into your child class object that has __missing__ implemented.

Despite this approach is a little like my MapWithDefault, the __missing__ approach has more flexibility like allowing the default value to be dependent on the query key string, but it comes at the expense of making up one different class, not instance per different default values.

Monkey-patching instance methods is frowned upon in Python. So if you want the default value to tie to instances, the mechanism needs to be redesigned.

Loading

We use ContextManager (“with … as” statement) in Python because Python’s fundamental language design (garbage collecting objects) broke RAII

[TLDR] Python doesn’t have RAII. C++ and MATLAB allows RAII. You can have a proper RAII only if destructor timing is 100% controllable by the programmer.

Python uses Context Manager (with ... as idiom) to address the old issue of opening up a resource handler (say a file or network socket) and automatically close (free) it regardless of whether the program quit abruptly or it gracefully terminates after it’s done with the resource.

Unlike destructors in C++ and MATLAB, which registers what to do (such as closing the resource) when the program quits or right before the resource (object) is gone, Python’s Context Manager is basically rehasing the old try-block idea by creating a rigid framework around it.

It’s not that Python doesn’t know the RAII mechanism (which is much cleaner), but Python’s fundamental language design choices drove itself to a corner so it’s stuck micro-optimizing the try-except/catch-finally approach of managing opened resourecs:

  • Everything is seen as object in Python. Even integers have a ton of methods.
    MATLAB and C++ treats POD, Plain Old Data, such as integers separately from classes
  • Python’s garbage collector controls the timing of when the destructor of any object is called (del merely decrement the reference count).
  • MATLAB’s do not garbage-collect objects so the destructor timing is guaranteed.
  • C++ has no garbage collection so the destructor timing is guaranteed and managed by the programmer.

Python cannot easily exclude garbage collecting classes (which breaks RAII) because fundamentally everything are classes (dictionaries potentially with callables) in Python.

This is one of the reasons why I have a lot of respects for MATLAB for giving a lot of consideration for corner cases (like what ’empty’ means) in their language design decisions. Python has many excellent ideas but not enough thoughts was given to how these ideas interact, producing what side effects.


Pythons documentation says out loud right what it does: with ... as ... is effectively a rigidly defined try-except-finally block:

Context Manager heavily depends on resource opener function (EXPR) to return a constructed class instance that implements __exit__ and __enter__, so if you have a C external library imported to Python, like python-ft4222, likely you have to write in your context manager in full when you write your wrapper.


Typically the destructor should check if the resource is already closed first, then close it if it wasn’t already closed. Take io.IOBase as an example:

However, this is only a convenience when you are at the interpreter and can live with the destructor called with a slight delay.

To make sure your code work reliably without timing bugs, you’ll need to explicitly close it somewhere other than at a destructor or rely on object lifecycle timing. The destructor can acts as a double guard to close it again if it hasn’t, but it should not be relied on.


The with ... as construct is extremely ugly, but it’s one of the downsides of Python that cannot be worked around easily. It also makes it difficult for users to retry acquiring a resource because one way or another retrying involves injecting the retry logic in __enter__. It’s not that much typographic savings using with ... as over try-except-finally block if you don’t plan to recycle th contextmanager and the cleanup code is a one-liner.

Loading