{"id":303,"date":"2016-10-05T23:23:48","date_gmt":"2016-10-06T07:23:48","guid":{"rendered":"http:\/\/wonghoi.humgar.com\/blog\/?p=303"},"modified":"2021-10-30T23:13:59","modified_gmt":"2021-10-31T07:13:59","slug":"matlab-gotchas-do-not-use-getlevels-getlabels-or-categories-for-categoricalnominalordinal-objects","status":"publish","type":"post","link":"https:\/\/wonghoi.humgar.com\/blog\/2016\/10\/05\/matlab-gotchas-do-not-use-getlevels-getlabels-or-categories-for-categoricalnominalordinal-objects\/","title":{"rendered":"MATLAB Gotchas: Do NOT use getlevels(), getlabels() or categories() for categorical\/nominal\/ordinal objects"},"content":{"rendered":"<p>I suspect TMW (The MathWorks, maker of MATLAB) hasn&#8217;t really thought about dead levels when a categorical object\u00a0(I mean nominal() and ordinal() as well since they are wrapper child class of categorical()) has elements removed so that some levels doesn&#8217;t map to any elements anymore.<\/p>\n<p>For performance reasons, it makes sense to keep the dead levels in\u00a0because the user can repetitively add and remove the same last level by deleting and adding the same element, causing unnecessary work each time. Naturally, there&#8217;s a getlevels()\/getlabels()\/categories() method\u00a0in nominal(),ordinal()\/categorical() class\u00a0so you know what <span style=\"text-decoration: underline;\">raw<\/span> levels are\u00a0available. Turns out it&#8217;s a horrible idea to expose the raw\u00a0levels when dead levels are allowed!<\/p>\n<p>Unless you are dealing with the internals of categorical objects, there&#8217;s very little reason why one would\u00a0care or want to know about the dead\u00a0levels (it&#8217;s just a cache for performance). It&#8217;s the active levels that are currently mapped to some elements that matters when user make such queries, which is\u00a0handled correctly by unique().<\/p>\n<p>If there are no dead levels, getlevels() is equivalent to unique(), while categorical() and getlabels() are equivalent to unique(cellstr()), but I&#8217;m very likely to run into dead levels\u00a0because I delete rows of data when I filter by certain criterion.<\/p>\n<p>My first take on it would be to hide\u00a0getlevels()\/getlabels()\/categories() from users. But over the years, I&#8217;ve grown from a conservative software point of view to accepting more liberal approach, especially after exposure to functional programming ideas. That means I&#8217;d rather have a way to know what&#8217;s going on inside (keep those functions there), but I&#8217;d like to be warned that it&#8217;s an evil feature that shouldn&#8217;t be used lightly.<\/p>\n<p>Yes,\u00a0I&#8217;m\u00a0dissing the use of\u00a0getlevels()\/getlabels()\/categories() like the infamous eval(). Once in a long while, it might be a legitimate neat approach. But for 99% of the time, it&#8217;s a strictly worse solution that causes a lot of damages. It&#8217;s way more unlikely that\u00a0getlevels()\/getlabels()\/categories() will yield what you really mean with dead levels than multiple inheritance in C++ being the right approach on the first try.<\/p>\n<p>If I use unique() all the time, why would I even bother to talk about getlevels()\/getlabels()\/categories() since I never used them? It&#8217;s because\u00a0TMW didn&#8217;t warn users about the dangers\u00a0in their documentation. These methods looks legit and innocent, but it&#8217;s a usage trap like returning stack pointers in C\/C++ (you can technically do it, but\u00a0with almost 100% certainty, you are telling the computer to do something you don&#8217;t mean to, in short: wrong).<\/p>\n<p>I have two encounters that other people using\u00a0the raw categorical levels that harmed me:<\/p>\n<ol>\n<li>One of my coworkers spoke\u00a0against upgrading our MATLAB licenses (later withdrew his opposition) because the new versions breaks his old code involving nominal()\/ordinal() objects.I was perplexed because it didn&#8217;t break any of my code despite\u00a0I used more nominal() and ordinal() objects than anybody in my vicinity. On close inspection, he was using getlevels() and getlabels() all over the place instead of unique(), which works seamlessly\u00a0in the new MATLAB.Remember I mentioned that the internal design\/implementation details of <a style=\"text-indent: -1.5em; font-size: 1.0625rem;\" href=\"https:\/\/wonghoi.humgar.com\/blog\/2016\/10\/05\/matlab-compatibility-nominal-and-ordinal-objects-since-r2013a-are-not-compatible-with-r2012b-and-before\/\">nominal()\/ordinal() changed in MATLAB R2013a<\/a><span style=\"text-indent: -1.5em; font-size: 1.0625rem;\">? The internal\u00a0treatment of dead levels has changed. The change was supposed to be irrelevant to end-users by design if getlevels()\/getlabels() had not expose dead levels to end-users. Because of the oversight, users have written code that depends on how dead levels are internally handled!<\/span><\/li>\n<li>The default factory-shipped grpstats() is still &#8216;broken&#8217; as of\u00a0R2015b! If you feed grpstats() with a nominal grouping variable, it will give you lines of NaN because it was programmed to spit out one row for each level (dead or alive) in the grouping variable.\u00a0Since\u00a0the dead levels has nothing to group by the reduction function (@mean if not specified), it spits out multiple NaNs as by definition NaN do not equal to anything else, including NaN itself. This is traced to how\u00a0grp2idx() is used internally:\u00a0If the grouping variable is a cellstr() or double(), the groups are generated by using unique(), so there are no dead levels whatsoever. But if the grouping variable is a categorical, the developers thought their job is done already and just took it directly from the categorical object&#8217;s properties by calling getlabels() and getlevels():\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"matlab\">gidx = double(s);\n...\ngnames = getlabels(s)';\nglevels = getlevels(s)';<\/pre>\n<p>Apparently the author of the factory-shipped code forgot that there&#8217;s a reason why the categorical\/unique() has the same\u00a0function name as double\/unique() and cellstr\/unique():\u00a0the point of overloading is to have the same function name for\u00a0the same <span style=\"text-decoration: underline;\">intention<\/span>! The <span style=\"text-decoration: underline;\">intention<\/span> of unique() should be uniformly applied across all the data types applicable. Think twice before relying on language support for type info (like type traits in C++) to switch code when you can use function overloading (MATLAB differentiates by the type of the first argument, C++ looks at the whole signature). A good architecture should lead you to the correct code logic without the need of overriding good practices.<\/p>\n<p>Rants aside, grpstats() will work as intended if those lines in grp2idx() are changed to:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"matlab\">gidx = double(s);\n...\nglevels = unique(s(:));\ngnames = cellstr(glevels);\n<\/pre>\n<p>A higher level fix would be applying grp2idx() to the grouping variable before it was fed into grpstats():<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"matlab\">grpstats(X, grp2idx(g), ...)<\/pre>\n<p>The rationale is that\u00a0the underlying\u00a0contents\u00a0doesn&#8217;t matter for grouping variables as long as each of them\u00a0uniquely stand for\u00a0the\u00a0group they\u00a0represent! In other words,\u00a0categorical() objects are seen as nothing but a bunch of integers, which can be obtained by casting it to double():<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">gidx = double(s);\ngrpstats(X, gidx, ...)<\/pre>\n<p>This is what grp2idx() calls under the hood anyway when it sees a categorical. The grp2idx() called from\u00a0grpstats() will see a bunch of integers, which will correctly apply unique() to them, thus removing all dead levels.<\/p>\n<p>Of course, use grp2idx() instead of double() because it works across all data types that applies. Why future-constrain yourself when a more generic implementation is already available?<\/p>\n<p>The sin committed by grpstats() over nominal() is that the variables in glevels and gnames shouldn&#8217;t get involved in the first place because they don&#8217;t matter and shouldn&#8217;t even show up in the outputs. This is what&#8217;s fundamentally wrong about it:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">[group,...,ngroups] = mgrp2idx(group,rows);\n...\n\/\/ This code assumes there are no gaps in group levels (gnum), which is not always true.\nfor gnum = 1:ngroups\n    groups{gnum} = find(group==gnum);\nend<\/pre>\n<p>We can either blame the for-loop for not skipping dead levels, or blame mgrp2idx (a wrapper of grp2idx) for spitting out the dead levels. It doesn&#8217;t really matter which way it is. The most important thing is that\u00a0dead levels were\u00a0let loose, and nobody in the developer-user chain understand the implications\u00a0enough to stop the problem from propagating to the final output.<\/li>\n<\/ol>\n<p>To summarize, the <span style=\"text-decoration: underline;\">raw<\/span> levels in categorical objects is a dirty cache including junk you do not want 99.99% of the time. Use unique() to get the meaningful unique levels instead.<\/p>\n<div class=\"pvc_clear\"><\/div>\n<p id=\"pvc_stats_303\" class=\"pvc_stats all  \" data-element-id=\"303\" style=\"\"><i class=\"pvc-stats-icon medium\" aria-hidden=\"true\"><svg aria-hidden=\"true\" focusable=\"false\" data-prefix=\"far\" data-icon=\"chart-bar\" role=\"img\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" viewBox=\"0 0 512 512\" class=\"svg-inline--fa fa-chart-bar fa-w-16 fa-2x\"><path fill=\"currentColor\" d=\"M396.8 352h22.4c6.4 0 12.8-6.4 12.8-12.8V108.8c0-6.4-6.4-12.8-12.8-12.8h-22.4c-6.4 0-12.8 6.4-12.8 12.8v230.4c0 6.4 6.4 12.8 12.8 12.8zm-192 0h22.4c6.4 0 12.8-6.4 12.8-12.8V140.8c0-6.4-6.4-12.8-12.8-12.8h-22.4c-6.4 0-12.8 6.4-12.8 12.8v198.4c0 6.4 6.4 12.8 12.8 12.8zm96 0h22.4c6.4 0 12.8-6.4 12.8-12.8V204.8c0-6.4-6.4-12.8-12.8-12.8h-22.4c-6.4 0-12.8 6.4-12.8 12.8v134.4c0 6.4 6.4 12.8 12.8 12.8zM496 400H48V80c0-8.84-7.16-16-16-16H16C7.16 64 0 71.16 0 80v336c0 17.67 14.33 32 32 32h464c8.84 0 16-7.16 16-16v-16c0-8.84-7.16-16-16-16zm-387.2-48h22.4c6.4 0 12.8-6.4 12.8-12.8v-70.4c0-6.4-6.4-12.8-12.8-12.8h-22.4c-6.4 0-12.8 6.4-12.8 12.8v70.4c0 6.4 6.4 12.8 12.8 12.8z\" class=\"\"><\/path><\/svg><\/i> <img loading=\"lazy\" decoding=\"async\" width=\"16\" height=\"16\" alt=\"Loading\" src=\"https:\/\/wonghoi.humgar.com\/blog\/wp-content\/plugins\/page-views-count\/ajax-loader-2x.gif\" border=0 \/><\/p>\n<div class=\"pvc_clear\"><\/div>\n","protected":false},"excerpt":{"rendered":"<p>I suspect TMW (The MathWorks, maker of MATLAB) hasn&#8217;t really thought about dead levels when a categorical object\u00a0(I mean nominal() and ordinal() as well since they are wrapper child class of categorical()) has elements removed so that some levels doesn&#8217;t &hellip; <a href=\"https:\/\/wonghoi.humgar.com\/blog\/2016\/10\/05\/matlab-gotchas-do-not-use-getlevels-getlabels-or-categories-for-categoricalnominalordinal-objects\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n<div class=\"pvc_clear\"><\/div>\n<p id=\"pvc_stats_303\" class=\"pvc_stats all  \" data-element-id=\"303\" style=\"\"><i class=\"pvc-stats-icon medium\" aria-hidden=\"true\"><svg aria-hidden=\"true\" focusable=\"false\" data-prefix=\"far\" data-icon=\"chart-bar\" role=\"img\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" viewBox=\"0 0 512 512\" class=\"svg-inline--fa fa-chart-bar fa-w-16 fa-2x\"><path fill=\"currentColor\" d=\"M396.8 352h22.4c6.4 0 12.8-6.4 12.8-12.8V108.8c0-6.4-6.4-12.8-12.8-12.8h-22.4c-6.4 0-12.8 6.4-12.8 12.8v230.4c0 6.4 6.4 12.8 12.8 12.8zm-192 0h22.4c6.4 0 12.8-6.4 12.8-12.8V140.8c0-6.4-6.4-12.8-12.8-12.8h-22.4c-6.4 0-12.8 6.4-12.8 12.8v198.4c0 6.4 6.4 12.8 12.8 12.8zm96 0h22.4c6.4 0 12.8-6.4 12.8-12.8V204.8c0-6.4-6.4-12.8-12.8-12.8h-22.4c-6.4 0-12.8 6.4-12.8 12.8v134.4c0 6.4 6.4 12.8 12.8 12.8zM496 400H48V80c0-8.84-7.16-16-16-16H16C7.16 64 0 71.16 0 80v336c0 17.67 14.33 32 32 32h464c8.84 0 16-7.16 16-16v-16c0-8.84-7.16-16-16-16zm-387.2-48h22.4c6.4 0 12.8-6.4 12.8-12.8v-70.4c0-6.4-6.4-12.8-12.8-12.8h-22.4c-6.4 0-12.8 6.4-12.8 12.8v70.4c0 6.4 6.4 12.8 12.8 12.8z\" class=\"\"><\/path><\/svg><\/i> <img loading=\"lazy\" decoding=\"async\" width=\"16\" height=\"16\" alt=\"Loading\" src=\"https:\/\/wonghoi.humgar.com\/blog\/wp-content\/plugins\/page-views-count\/ajax-loader-2x.gif\" border=0 \/><\/p>\n<div class=\"pvc_clear\"><\/div>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"footnotes":""},"categories":[10],"tags":[],"class_list":["post-303","post","type-post","status-publish","format-standard","hentry","category-matlab"],"_links":{"self":[{"href":"https:\/\/wonghoi.humgar.com\/blog\/wp-json\/wp\/v2\/posts\/303","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wonghoi.humgar.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wonghoi.humgar.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wonghoi.humgar.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/wonghoi.humgar.com\/blog\/wp-json\/wp\/v2\/comments?post=303"}],"version-history":[{"count":10,"href":"https:\/\/wonghoi.humgar.com\/blog\/wp-json\/wp\/v2\/posts\/303\/revisions"}],"predecessor-version":[{"id":3089,"href":"https:\/\/wonghoi.humgar.com\/blog\/wp-json\/wp\/v2\/posts\/303\/revisions\/3089"}],"wp:attachment":[{"href":"https:\/\/wonghoi.humgar.com\/blog\/wp-json\/wp\/v2\/media?parent=303"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wonghoi.humgar.com\/blog\/wp-json\/wp\/v2\/categories?post=303"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wonghoi.humgar.com\/blog\/wp-json\/wp\/v2\/tags?post=303"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}