MC Jin kicks the other rapper’s butt at their own game!
Hilarious Disco music from MP4: 你老豆索K
https://www.youtube.com/watch?v=f2Zyzq7o3yM
Mr.ONE – 尹光 (廟街歌王) Rap:
MC Jin kicks the other rapper’s butt at their own game!
Hilarious Disco music from MP4: 你老豆索K
https://www.youtube.com/watch?v=f2Zyzq7o3yM
Mr.ONE – 尹光 (廟街歌王) Rap:
大埔阿婆的台語通俗親切,是學習台語的首選!
xkcd is an obvious CS geek comics. But who’d expect hardcore CS jokes in Sandra and Woo, a raccoon comic? Here you go:
I suspect TMW (The MathWorks, maker of MATLAB) hasn’t really thought about dead levels when a categorical object (I mean nominal() and ordinal() as well since they are wrapper child class of categorical()) has elements removed so that some levels doesn’t map to any elements anymore.
For performance reasons, it makes sense to keep the dead levels in because the user can repetitively add and remove the same last level by deleting and adding the same element, causing unnecessary work each time. Naturally, there’s a getlevels()/getlabels()/categories() method in nominal(),ordinal()/categorical() class so you know what raw levels are available. Turns out it’s a horrible idea to expose the raw levels when dead levels are allowed!
Unless you are dealing with the internals of categorical objects, there’s very little reason why one would care or want to know about the dead levels (it’s just a cache for performance). It’s the active levels that are currently mapped to some elements that matters when user make such queries, which is handled correctly by unique().
If there are no dead levels, getlevels() is equivalent to unique(), while categorical() and getlabels() are equivalent to unique(cellstr()), but I’m very likely to run into dead levels because I delete rows of data when I filter by certain criterion.
My first take on it would be to hide getlevels()/getlabels()/categories() from users. But over the years, I’ve grown from a conservative software point of view to accepting more liberal approach, especially after exposure to functional programming ideas. That means I’d rather have a way to know what’s going on inside (keep those functions there), but I’d like to be warned that it’s an evil feature that shouldn’t be used lightly.
Yes, I’m dissing the use of getlevels()/getlabels()/categories() like the infamous eval(). Once in a long while, it might be a legitimate neat approach. But for 99% of the time, it’s a strictly worse solution that causes a lot of damages. It’s way more unlikely that getlevels()/getlabels()/categories() will yield what you really mean with dead levels than multiple inheritance in C++ being the right approach on the first try.
If I use unique() all the time, why would I even bother to talk about getlevels()/getlabels()/categories() since I never used them? It’s because TMW didn’t warn users about the dangers in their documentation. These methods looks legit and innocent, but it’s a usage trap like returning stack pointers in C/C++ (you can technically do it, but with almost 100% certainty, you are telling the computer to do something you don’t mean to, in short: wrong).
I have two encounters that other people using the raw categorical levels that harmed me:
gidx = double(s); ... gnames = getlabels(s)'; glevels = getlevels(s)';
Apparently the author of the factory-shipped code forgot that there’s a reason why the categorical/unique() has the same function name as double/unique() and cellstr/unique(): the point of overloading is to have the same function name for the same intention! The intention of unique() should be uniformly applied across all the data types applicable. Think twice before relying on language support for type info (like type traits in C++) to switch code when you can use function overloading (MATLAB differentiates by the type of the first argument, C++ looks at the whole signature). A good architecture should lead you to the correct code logic without the need of overriding good practices.
Rants aside, grpstats() will work as intended if those lines in grp2idx() are changed to:
gidx = double(s); ... glevels = unique(s(:)); gnames = cellstr(glevels);
A higher level fix would be applying grp2idx() to the grouping variable before it was fed into grpstats():
grpstats(X, grp2idx(g), ...)
The rationale is that the underlying contents doesn’t matter for grouping variables as long as each of them uniquely stand for the group they represent! In other words, categorical() objects are seen as nothing but a bunch of integers, which can be obtained by casting it to double():
gidx = double(s); grpstats(X, gidx, ...)
This is what grp2idx() calls under the hood anyway when it sees a categorical. The grp2idx() called from grpstats() will see a bunch of integers, which will correctly apply unique() to them, thus removing all dead levels.
Of course, use grp2idx() instead of double() because it works across all data types that applies. Why future-constrain yourself when a more generic implementation is already available?
The sin committed by grpstats() over nominal() is that the variables in glevels and gnames shouldn’t get involved in the first place because they don’t matter and shouldn’t even show up in the outputs. This is what’s fundamentally wrong about it:
[group,...,ngroups] = mgrp2idx(group,rows); ... // This code assumes there are no gaps in group levels (gnum), which is not always true. for gnum = 1:ngroups groups{gnum} = find(group==gnum); end
We can either blame the for-loop for not skipping dead levels, or blame mgrp2idx (a wrapper of grp2idx) for spitting out the dead levels. It doesn’t really matter which way it is. The most important thing is that dead levels were let loose, and nobody in the developer-user chain understand the implications enough to stop the problem from propagating to the final output.
To summarize, the raw levels in categorical objects is a dirty cache including junk you do not want 99.99% of the time. Use unique() to get the meaningful unique levels instead.
In the old days (before R2013a), nominal() and ordinal() were separate parallel classes with astoundingly similar structures. That means there’s a lot of copy-paste-mod going on. TMW improved on it by consolidating the ideas into a new categorical() class, which nominal() and ordinal() derives from it.
The documentation mentioned that nominal() and ordinal() might be deprecated in the future, but I contacted their support urging them not to. It’s not for compatibility reasons: nominal() and ordinal() captures the common use cases that these two ideas do not need to be unified, and the names themselves clearly encodes the intention.
If the user want to exploit the commonalities between the two, either it’s already taken care of by the parent’s public methods, or the object can be sliced to make it happen. I looked into the source code for nominal() and ordinal(): it’s pretty much a wrapper over categorical’s methods yet the interface (input arguments) are much simpler and intuitive because we don’t have to consider all the more general cases.
Back to the titled topic. Because categorical()’s properties (members) are different from pre R2013a’s nominal() and ordinal() objects, the objects created in R2012b or before cannot be loaded correctly in newer versions. That means the backward compatibility is completely broken for nominal()/ordinal() as far as saved objects are concerned.
There’s no good incentive to solve this problem on the TMWs side because the old nominal()/ordinal() is short-lived and they always want everybody to upgrade. Since I use nominal() most of the time and the ones that really need to be saved are all nominal(), I recommend the converting (‘casting’) them to cellstr by
>> A = nominal({'a','a','b','c'}); >> A = cellstr(A) A = 'a' 'a' 'b' 'c'
Remember, nominal() is pretty much compressing a ton of cellstr into a few unique items and mapping the indices. No information is lost going back and forth between cellstr() and nominal(). It’s just a little extra computations for the conversion.
As for ordinal(), I rarely need to save it because order/level assignment is almost the very last thing in the processing chain because it changes so frequently (e.g. how would you draw the lines for six levels of fatness?), I might as well just not save it and reprocess the last step (where the code with ordinal() sits) when I need it.
Nonetheless, if you still want to save ordinals() instead of re-crunching it, this time you’ll want to save it as numerical levels by casting the ordinal() into double():
>> A = ordinal([1 2 3; 3 2 1; 2 1 3],{'low' 'medium' 'high'}, [3 1 2]) A = medium high low low high medium high medium low >> D = double(A) D = 2 3 1 1 3 2 3 2 1 >> U = unique(A) U = low medium high >> L = cellstr(U) L = 'low' 'medium' 'high' >> I = double(U) I = 1 2 3 >> A_reconstructed = ordinal(D, L, I) A_reconstructed = medium high low low high medium high medium low
You’ll save (D, L, I) from old MATLAB and load it and reconstruct it with the triplets from the new MATLAB (I’d suggest using structs to keep track of the triplets). I know it’s a hairy mess!