Hoi Wong
  • NOTE: I'll gradually move the contents of this page to my blog. Keep posted!

  • MATLAB as a rapid-prototyping language

    MATLAB is most often misunderstood as a number crunching language. Yes, it was originally designed for the purpose, but over the years, especially since 2010, the langauge evolved tremendously to cover most of the important aspects of modern (rapid-prototyping) languages.

    If your problem is data centric (or time series driven, aka signals) and you don't need all the eye candies for GUI, nor you need to micromanage the hell out of your code to squeeze the last drop of your hardware performance, MATLAB will get you there in the shortest amount of time, for a reasonably sized project, given you really know what MATLAB has to offer.

    Right now I'm open to taking on MATLAB consulting projects for scientific computation, data management and instrumentation (signal acquisition and processing). Contact me with a sketch of what you're trying to accomplish and I'll guide you through it. I read and write in math, code and schematics. :)

    I'd take on C/C++ projects too sharpen my skills too, but MATLAB will be where I claim my expertise.


  • MATLAB Mentality

    One word: "vectorization!". Despite JIT has improved performance for for-loops, thinking in terms of matrices is the de-facto conceptual tool if you are to trade memory requirement for runtime performance. 

    Using cellfun(), arrayfun(), bsxfun(), structfun(), varfun(), rowfun(), etc like a pro will also help you think in parallel and concurrent programming, since it naturally force you to structure your program so that it is data parallel. Why? These are functional programming constructs! It's hard to keep the data flow in your head straight if you think in states even when the data is independent of each other!

    You can ditch the for-loop for most cases, except
    1) Left (lvalue) assignments: it's faster if you just want to modify a value in a big matrix
    2) Plotting: you are more likely to run into timing problems
    3) File/network access: same problem as copying all 100 files simutaneously. You don't want to do that.
    4) Your data is so big that you want to limit the memory footprint (process a managable chunk at a time)
    5) When you specifically want prepare your code to use parfor() later.


  • Modern Data Structures in MATLAB

    Hands down the heterogenous data (dataset/table) and categorical objects! It's an array, it's a struct, it's a cell as well, oh my luck!

    With tables you'll never need to (and should not) keep track of column indices as they go by variable names like struct fields. You can reference a table like a cell by T{'rowName_a', 'colName_b'} as well. You can delete a variable like a struct (e.g. T.colName_b = [];) or like a cell (e.g. T(:, 'colName_b') = [];).

    Even better, it comes with an array of tools (all pun intended) to convert between table(), matrices, struct() and cells(). You can still use it with old-fashioned code based on rudimentary data structures (cells/matrices/structs). In fact, look at TMW's fugly array of struct (AoS) output:

    >> S = ver() 
    1x2 struct array with fields: 
    Name 
    Version 
    Release 
    Date


    If you try to get all the toolbox names directly, you get this:
    >> [S.Name] 
    ans = 
    MATLABSimulink


    What a frustrating mess! I don't get to see the nice table from the ver() output with 0 nargout. But see what you can do with tables:

    T1 = struct2table(ver())
    T1 =  
    Name                        Version Release    Date  
    ___________________________ _______ __________ _____________ 
    'MATLAB'                    '8.6'   '(R2015b)' '13-Aug-2015' 
    'Simulink'                  '8.6'   '(R2015b)' '13-Aug-2015'

    >> T1.Name 
    ans =  
    'MATLAB' 
    'Simulink'


    If a beginner complain that table() is slow, it's almost always because they are processing repeated strings each time they occur. Use categorical/nominal/ordinal 'types' to keep only the unique names (labels), and under the hood it maps the corresponding integers (levels) so the object behaves like a cellstr() on demand. You can think of it a turbocharged enum.

    One tip for using categorical objects: keep the uses of getLevel() or getLabel() to the minimum since they are considered as 'low-level': if you start removing elements, these function will report the unused (dead) levels kept inside the object for performance reasons! The dead levels are typically NOT what you want or meant to use, so this behavior is almost like a trap. Use a combination of unique() and cellstr() casts instead: they will ignore dead levels. I know somebody who used a lot of getLevel() and getLabel() got burned when TMW changed the behavior in R2015b (shouldn't happen, but it did).

    After you got the above, the most juicy part of the table() + categorical() paradigm is join() and stack()/unstack(). These are basically relational database operations. Remember you used to do a lot of ismember() data matching? That's the low-level implementation of join(). Upsampling the rows and filling it with repeated values for selected columns? It's stack(). You don't have to painfully do data consistency check to make sure you pooled the data correctly anymore. What you have is a nifty database at your fingertips.

    You can still query a real database for the big data, but you don't have to rely on it to do the relational DB operations anymore once you get the minimal set of required ingredients (subtables) on your plate. This is freedom.

    Based on my interactions with TMW, I believe even they underestimated how poweful the concept of heterogenous array is. Here are some off-label uses that they didn't consider back then:

    1) make a signal table that carries a time axis 't' as a variable: we all how annoying it is to bookkeep and regenerate the time axis every time your filter introduce a delay or when you subsample. Now you can just have your functions modify the time axis (like adding and offset) instead of outputting the parameters needed to reconstruct the new time axis. You don't even have to worry about the time axis when you subsample: it's done consistently across all variables!

    I know it used to be a crime to use up all your precious RAM space to store the time vector when you can generate it on the fly. But if you think about it, most of the time you'll need to generate a temporary time vector in the memory anyway to do anything useful with it, so your worst case memory usage are the same unless you crafted your algorithm to see only a small chunk at a time. With today's memory prices, it's not worth the code maintanence hassle to keep track of and regenerate the time vector on-demand.

    Update: they finally provided a timetable() class in MATLAB 2016b. Kudos to the folks at TMW!

    2) keep track of your experiments (different combinations): very often we have a codebase with a few dozens of parameters that we want to try out and see how it goes. In the old days, the best you can do is to have a config file (a big nested struct) for each experiment. But how about storing and keeping track of the results? Can they all be in one place? Now you can!

    You can first build multiple tables that uniquely define the experimental characteristics (like on/off, column combinations, test/training/validation, different datasets). join() them together into one big table of experiments, and feed each row as a config (like you used to do with a nested struct) object to your experiment engine, and output the result (say as a cell array) as a variable/column of the experiment table.

    3) manage (filter out) your job list ahead of time instead of having the processing function check it and jump out: in the old days, it's more intuitive to go through your job list and skip the job if you find out the output is already cached.

    The reason for not instinctively checking the cache for each job before processing the list is because you have to manually shorten each input (sourceFiles, targetFiles) individually with the vector indicator_isCached, which is error prone because the size might not match. In fact, before table()/dataset(), you probably make up the targetFiles inside the processing function so you have one less variable to keep track of.

    With tables(), since you can index into multiple variables simultaneously in one shot by row indices, you might as well pre-determine the targetFiles names and check if they're already cached, then shortlist it to the rows (jobs) that hasn't been cached yet. An example would be:

    T = table(); 
    T.sourceFiles = setdiff(cellstr(ls(pwd)), {'.', '..'});
    T.targetFiles = strcat('C:data', filesep, T.sourceFiles);
    T.isCached = cellfun(@(x) logical(exist(x, 'file')), T.targetFiles);
    % Shorten job list
    T(T.isCached, :) = [];
    % Process only what's needed
    T.outputInfo = cellfun(@process, T.sourceFiles, T.targetFiles, 'UniformOutput', false);

    Nonetheless, while the new table() objects solved a few very difficult challenges with the old dataset(), such as multivariate datasetfun (can be done with rowfun()), it added a lot of annoying behavioral changes (such as you cannot horzcat() tables with overlapping columns containing IDENTICAL data, adding a prefix (function name) to the output of varfun()) and removed some useful features such as casting by double(). 

    There are lots of room for improvement, but they all can be fixed by writing a wrapper or just copy-editing the source code of table() into something like datatable(). Don't try the latter one on your own if you are not comfortable with MATLAB's OOP. I find it easier to track down and fix their missed corner cases instead of doing a non-trivial wrapper maneuver around it given my experience, but it also made me responsible for maintaining the forked version of table(). That's the price I'm willing to pay since since this paradigm already saved me 70% of the grunt work.

    With the new programming methodology, my top level code reads almost like the business logic in English. People can inspect my work like reading an Excel spreadsheet. I don't even need to meticulously document it provided I pick meaningful function/variable names. Many the concepts I showed on this page requires a few conceptual leaps (like vectorization, *fun(), tables()), but they are very way worth it.

    Chances are that if there's something you are unhappy about dataset()/table(), it's just a feature not yet implemented well. If you were to ditch it and roll your own, you're very likely repeating the same operations these objects do under the hood less competently. For my case, I chose to fix them myself while informing TMW so they'll correct it in future versions.


  • Object oriented programming features in MATLAB

    [PODs] They are treated like objects, and you can overload their operators as well. Unlike real classes like table(), you can just make a folder like /@cellstr and add the functions you want to overload. This is not possible (and should not be) because you are supposed to maintain everything in one place: the class definitions. 

  • Important concepts that determines MATLAB behavior

    [Functions] They are loaded by your file name, whatever name you use for the signature does not matter. This is why you cannot mix functions and scripts like Python.

    [Plain Old Datatypes (PODs)] Operations with empty, NaN and Inf are very well thoughtout by TMW. This enables you to do [varargout{1:nargout}] = ... correctly because the empty is not a corner case in TMW's eye (well, once in a while they failed to handle it, mostly in dataset/table/categorical's source code. I reported what I found every time I see one).


  • Tips for structuring your codebase 

    [Path Management] Organize your libraries in trees using consistent naming schemes and enable them with addpath( genpath(...) ). This way when you try to whip your own implementation for some generic problem, you will see a function name that suspiciously look like what you're trying to do. Reuse!

    [Name conflicts resolution] use addpath(..., '-end') if you want to yield to the incumbent names.

    [Private folder] I usually don't recommend this practice unless the content inside is really specific for a scenario and unlikely to be reusuable. 


  • Tips to make your code work everywhere

    [32-bit vs 64-bit] strcmpi(mexext(), 'mexw64')
    [Handling diffferent versions]: verLessThan('matlab', ...)
    [Parallel workers (headless)]: usejava('desktop')
    [DCOM / ActiveX / Automation Server]: enableservice('AutomationServer')
    [Compiled MATLAB]: isdeployed()


  • Really really cool MATLAB uses

    [DCOM / ActiveX] Automate other applications that interacts with MATLAB (most commonly, Excel, Word, and other MATLABs). I puppet-controlled VisualDSP+ (the development environment for Analog Device's SHARC DSP chips) through MATLAB to test the entire runtime library with just one keystroke to save me weeks of mindless gruntwork.

    [Using DCOM on Compiled MATLAB] This one even the people at TMW couldn't figure it out back then. I had to do a registry diff to realize the MCR didn't register the application name properly to translate to a correct CLSID. 

    [Serialization] A pain the butt to check if a file is fully transferred for the purpose of a RPC (remote procedure call)? If your data chunk isn't too big to risk a disconnection, just pack it up as raw binary integers.

MATLAB
feature(‘GetPid’)
feature(‘NumThreads’)
  parallel process management
legend(‘-DynamicLegend’)   remove the legend entry when your plot’s zoomed view doesn’t contain certain lines
graphics.cursorbar   shortcut to making dynamic cursor