Python 3 Scientific Installation To-do List

Although I am a big fan of MATLAB, it’s time for me to really try out Python so I can fairly compare the pros and cons of both languages.

The first tiny hurdle for Python is its scattered installation process for Windows. I thought Python(x,y) will give me everything in one place, but turns out the Spyder is stuck in Python 2.7. To install Python 3.7, I’ll need to do it from scratch. Here are the steps:

  1. Download official Python 3. You will need that for the “pip” package manager located in {python37}/scripts
  2. Update PIP first to avoid complaints. You can run it anywhere in command prompt
    python -m pip install --upgrade pip
    You don’t call PIP to update PIP because you an executable cannot write itself in WindowsNote that for 32-bit Python, you might run into Python37\python.exe: No module named pip, so you might want to use ensurepip to bootstrap: python -m ensurepip
  3. Now I’ll need Spyder3, a MATLAB-like IDE. Qt5 is one of the pre-req:
    pip install PyQt5
  4. And finally Spyder3
    pip install Spyder
    pip does not install icons in your start menu. So I’ll need to manually create a shortcut
    {Python37}/Scripts/spyder3.exe
    .py files are not associated with Spyder3 (normally it’ll just directly run the python script with python3). I usually manually change the association in Windows to Sypder3.
  5. PyVISA is the analog of “Instrument Control Toolbox” in MATLAB.
    pip install pyvisa
    MATLAB’s Instrument Control Toobox also cover serial ports, which is done in Python by PySerial
    pip install PySerial
  6. Numpy is included with scipy:
    pip install scipy
  7. Turns out that only NumPy and IPython is installed with SciPy, not the entire ecosystem.
    pip install pandas
    pip install matplotlib
    If you know the power of dataset/table objects in MATLAB like I do, you’ll jump for dataframes in panadas.
  8. SymPy, the analog of MATLAB’s symoblic math toolbox, needs to be installed separately
    pip install sympy
  9. IPython gives the ‘notebook’ feel in Mathematica, MathCAD and Maple, where the returned results are directly pasted in the same area where your command/syntax is. I rarely cared for it because I usually want the max visual real estate for my plots.

Update: I tried Anaconda (2019.03, Python 3.7.3 x64) which supposedly have everything in one place, but the Spyder it included crashes right out of the box. Jyupter is confusing as it relies on the web-browser to render the results. Feels patchy and doesn’t look like it adds more than the steps above. Uninstalled it without hesitation.

Update: To update the packages, tack -U switch at the end of each of the above pip install commands. Remember to follow the order of dependencies (e.g. update PyQt5 before Spyder)

Loading

MATLAB Techniques: variadic (variable input and output) arguments

I’ve seen a lot of ugly implementations from people trying to deal with variable number of input and output arguments. The most horrendous one I’ve seen so far came from MIT’s Physionet’s WFDB MATLAB Toolbox. Here’s a snippet showing how wfdbdesc.m handles variable input and output arguments:

function varargout=wfdbdesc(varargin)
% [siginfo,Fs,sigClass]=wfdbdesc(recordName)
...
%Set default pararamter values
inputs={'recordName'};
outputs={'siginfo','Fs','sigClass'};

for n=1:nargin
    if(~isempty(varargin{n}))
        eval([inputs{n} '=varargin{n};'])
    end
end
...
if(nargout>2)
    %Get signal class
    sigClass=getSignalClass(siginfo,config);
end
for n=1:nargout
    eval(['varargout{n}=' outputs{n} ';'])
end

 

The code itself reeks a very ‘smart’ beginner who didn’t RTFM. The code is so smart (shows some serious thoughts):

  • Knows to use nargout to control varargout to avoid the side effects when no output is requested
  • [Cargo cult practice]: (unnecessarily) track your variable names
  • [Cargo cult practice]: using varargin so it can be symmetric to varargout (also handled similarly). varargout might have a benefit mentioned above, but there is absolutely no benefit to use varargin over direct variable names when you are not forwarding or use inputParser().
  • [Cargo cult practice]: tries to be efficient to skip processing empty inputs. Judicially non-symmetric this time (not done to output variables)!

but yet so dumb (hell of unwise, absolutely no good reason for almost every ‘thoughtful’ act put in) at the same time. Definitely MIT: Make it Tough!

This code pattern is so wrong in many levels:

  • Unnecessarily obscuring the names by using varargin/varargout
  • Managing a list of variable names manually.
  • Loop through each item of varargin and varargout cells unnecessarily
  • Use eval() just to do simple cell assignments! Makes me cringe!

Actually, eval() is not even needed to achieve all the remaining evils above. Could have used S_in = cell2struct(varargin) and varargout=struct2cell(S_out) instead if one really wants to control the list of variable names manually!


The hurtful sins above came from not knowing a few common cell packing/unpacking idioms when dealing with varargin and varargout, which are cells by definition. Here are the few common use cases:

  1. Passing variable arguments to another function (called perfect forwarding in C++): remember C{:} unpacks to comma separated lists!
    function caller(varargin)
       callee(varargin{:});
  2. Limiting the number of outputs to what is actually requested: remember [C{:}] on left hand side (assignment) means the outputs are distributed as components of C that would have been unpacked as comma separated lists, i.e. [C{:}] = f(); means [C{1}, C{2}, C{3}, ...] = f();
    function varargout = f()
    // This will output no arguments when not requested,
    // avoiding echoing in command prompt when the call is not terminated by a semicolon
        [varargout{1:nargout}] = eig(rand(3));
  3. You can directly modify varargin and varargout by cells without de-referencing them with braces!
    function varargout = f(varargin)
    // This one is effectively deal()
        varargout = varargin(1:nargout);
    end
    
    function varargout = f(C)
    // This one unpacks each cell's content to each output arguments
        varargout = C(1:nargout);
    end

One good example combining all of the above is to achieve the no-output argument example in #2 yet neatly return the variables in the workspace directly by name.

function [a, b] = f()
// Original way to code: will return a = 4 when "f()" is called without a semicolon
    a = 4;
    b = 2;
end

function varargout = f()
// New way: will not return anything even when "f()" is called without a semicolon
    a = 4;
    b = 2;
    varargout = manage_return_arguments(nargout, a, b);
end

function C = manage_return_arguments(nargs, varargin)
    C = varargin(1:nargs);
end

I could have skipped nargs in manage_return_arguments() and use evalin(), but this will make the code nastily non-transparent. As a bonus, nargs can be fed with min(nargout, 3) instead of nargout for extra flexibility.


With the technique above, wfdbdesc.m can be simply rewritten as:

function varargout = wfdbdesc(recordName)
% varargout: siginfo, Fs, sigClass
...
varargout = manage_return_arguments(nargout, siginfo, Fs, sigClass);

Unless you are forwarding variable arguments (with technique#1 mentioned above), input arguments can be (and should be) named explicitly. Using varargin would not help you avoid padding the unused input arguments anyway, so there is absolutely no good reason to manage input variables with a flexible list. MATLAB already knows to skip unused arguments at the end as long as the code doesn’t need it. Use exist('someVariable', 'var') instead.

 

 

Loading

MATLAB Practices: Code and variable transparency. eval() is one letter away from evil() Transparency means that all references to variables must be visible in the text of the code

The general wisdom about eval() is: do not use it! At least not until you are really out of reasonable options after consulting more than 3 experts on the newsgroups, forums and support@mathworks.com (if your SMS is current)!

Abusing eval() turns it into evil()!

The elves running inside MATLAB needs to be able to track your variables to reason through your code because:

  • it helps your code run much faster (eval() cannot be pre-compiled)
  • able to use parallel computing toolbox (it has to know absolutely for sure about any shared writes)
  • mlint can warn you about potentially pitfalls through code smell.
  • it keeps you sane while debugging!

This is called ‘transparency’: MATLAB has to see what you are doing every step of the way. According to MATLAB’s parallel computing toolbox’s documentation,

Transparency means that all reference to variables must be visible in the text of the code

which I used as a subtitle of this post.


The 3 major built-in functions that breaks transparency are:

  1. eval(), evalc(): arbitrary code execution resulting in read and write access to its caller workspace.
  2. assignin(): it poofs variables in its caller’s workspace!
  3. evalin(): it breaks open the stack and read the variables in its caller’s workspace!

They should have been replaced by skillful use of dynamic field names, advanced uses of left assignment techniques, and freely passing variables as input arguments (remember MATLAB uses copy-on-write: nothing is copied if you just read it).


 

There are other frequently used native MATLAB functions (under certain usages) that breaks transparency:

  1. load(): poof variables from the file to the workspace. The best practice is to load the file as a struct, like S=load('file.mat'); , which is fully transparent. Organizing variables into structs actually reduces mental clutter (namespace pollution)!
  2. save(), who(), whos(): basically anything that takes variable names as input and act on the requested variable violates transparency because it’s has the effect of evalin(). I guess the save() function chose to use variable names instead of the actual variables as input because copy-on-write wasn’t available in early days of MATLAB. A example workaround would be:
    function save_transparent(filename, varargin)
        VN = arrayfun(@inputname, (2:nargin)', 'UniformOutput', false);     
        if( any( cellfun(@isempty, VN) ) )
            error('All variables to save must be named. Cannot be temporaries');
        end
        
        S = cell2struct(varargin(:), VN, 1);
        save(filename, '-struct', 'S');
    end
    
    function save_struct_transparent(filename, S)
        save(filename, '-struct', 'S');
    end

The good practices to avoid non-transparent load/save also reduces namespace pollution. For example, inputname() peeks into the caller to see what the variable names are, which should not be used lightly. The example above is one of the few uses that I consider justified. I’ve seen novice abusing inputname() because they were not comfortable with cells and structs yet, making it a total mindfuck to reason through.

Loading

Oversimplified: Getting rid of data in STL containers Summary of Item 9 in "Effective STL"

Unless deleting a known range of elements directly through iterators (no conditions to match), which rangeerase() method can be used directly, targeting specific key/value/predicate requires understanding of the container’s underlying data structure.

I’d like to give a summary of Item#9 in “Effective STL” by defining the names and concepts so the complicated rules can be sensibly deduced by a few basic facts.


The words ‘remove‘ and ‘erase‘ has very specific meaning for STL that are not immediately intuitive.

 Lives inTarget to matchPurpose
remove_?()<algorithm>requiredRearrange wanted elements to front
erase()containernot acceptedBlindly deleting range/position given

There is a remove() method for lists, which is an old STL naming inconsistency (they should have called it erase() like for associative containers). Treat it as a historical mistake.

The usage is easy to remember once you understand it with the right wording above:

algorithm + containercontiguouslistsassociative
remove_?(): move frontStep 1Step 1
(Use remove_?() method)
unordered*: cannot rearrange
(Use erase(key) directly)
erase(): trim tailStep 2Step 2
(Use remove_?() method)
Use after find_?()
(Use erase(key) directly)

Note that there are two steps for sequential (contiguous+lists) containers , hence the erase-remove idiom. It’s really two steps:

auto tail = remove(c.begin(), c.end(), T); 
c.erase(tail, c.end());

but they can be combined in one line since the tail is only useful at one place. Hence

c.erase( remove(c.begin(), c.end(), T), c.end() ); 

Lists provides a efficient shortcut method (see table below) since linked-lists does not need to be rearranged (just short the pointers).

one-shot methodscontiguouslistsassociative
by contentN/A: use erase-remove idiomremove(T)erase(key)
by predicateN/A: use erase-remove_if idiomremove_if(pred)N/A: Use for-loops for now
No erase_if() yet as of C++17.

Never try range-based remove_?() for associative containers. It is a data corruption trap if any attempt is made to use anything named remove on associative containers.

The trap used to be possible since <algorithms> and containers were separate, but newer C++ protects you from the trap by checking if the element you are moving is of a MoveAssignable type. Since associative containers’ keys cannot be modified (without a rescan), the elements are not move-assignable.


As for erasing through for-loops (necessary if you want to sneak in an extra step while iterating), C++11 now returns an iterator following the last erased element uniformly across all containers. This helps to preserve the running iterator that gets invalidated immediately after the erase through i=c.erase(i);


* For brevity, I twisted the term unordered here to mean that the native (implementation) data order is dependent on the data involved.

When I said ‘cannot rearrange’, I meant ‘cannot efficiently rearrange’, since there are no cheap O(1) next() or prev() traversal.

It’s a mess to simply copy one element over another (during rearrangement), leaving orphans there, and re-balance a BST or re-hash a hash-map. Nobody wants to go through such pains to remove element when there are tons of more direct ways out there.

Loading

Understanding the difference between recognized arrays and pointers 'Recognized' means sizeof(array_name) gives the underlying allocated size

array≠ pointer:

pointer only contains a memory location,
while an array already has memory allocated to hold the data.


The confusion comes from the fact that array names are always seen as pointers anywhere in C, but when an array name is referred in places that the scope happens know the allocated size, namely

  • Global arrays: everybody knows the size
  • Local arrays: only the instantiating function knows its size.

, the array name itself has a superpower that pointers lack: report the underlying allocated data size (NOT pointer size) using sizeof(array).


Definition: An array is ‘recognized‘ if the array name is used in the scope that knows the underlying data size.

Corollary: Calling the array name with sizeof() gives the underlying allocated data size.

Examples of consequences that can be derived from the definition above:

  • Heap allocations always return a pointer type, NOT an array name!
    So heap arrays are never recognized.
  • VLA in C99 are considered local stack arrays, so it’s recognized
  • x[] is just a cosmetic shorthand for *x: it doesn’t prevent any recognized array from decaying into a pointer across boundary.
  • The storage duration (static or not) does not matter. e.g.
    • Heap pointers at global level are not recognized arrays
    • Static local array still loses the recognition across function boundaries
      (unless passed carefully by data type T (&array)[N]).

Most often recognized arrays cannot be aliased without decaying into a pointer. However, we can bind a recognized array to a reference to an array, which is a completely different type. Example:

int v[]{1,2,3,4};
int (&w)[4]=v;  // w is a reference to an array of size 4

int* p = v;     // Decays v to a pointer. Size information lost.
// int &w[4]=v; // Does not compile: this means an array of 4 references.

Note that the syntax requires a bracket for reference name. Omitting it will lead the compiler to misinterpret it as an array of references, which cannot* be compiled.

This means contrary to common beliefs, you can pass a recognized array across functions through reference, but this is rarely done because of the hassle of explicitly entering the number of elements (4 for the example above) as part of the data type. This can still be done through templates/constexpr, but for such inconvenience, we’re better off using std::vector (or std::array if you want near zero overhead).

However, so far I haven’t found a way to re-recognize an array from a pointer. That means there is no way to keep a local array’s recognition across function boundaries in C since it does not have references like C++.


To summarize with a usage example: this post has described the entire logic needed to decide whether sizeof(x)/sizeof(x[0]) gives you the number of array elements, or how many times your machine pointer type is bigger than the element storage.


* references must be bound on creation. Declaring an array of references means you want to bound references in batches. There are no mechanisms to do so as of C++14.

 

Loading