Do not help the compiler at the expense of readability Unless you read the assembly code emitted at the bottleneck and did benchmarks

Compilers has gotten smarter and smarter nowadays that they’d be able to analyze our code for common patterns (or logically deduce away steps that doesn’t have to performed at runtime).

Matt Godbolt gave a nice presentation at CppCon 2017 named “What Has My Compiler Done for Me Lately?”. Through observing the emitted assembly code at different optimization levels, he showed that the compiler doesn’t need to be micromanaged (through performance hacks in our code) anymore, as it will emit instructions as the performance-hacked code intended when it is better to do so.

It means the compiler writers already know our bag of performance hack tricks way much better than we do. Their efforts spare us from premature optimization and leave us more time to find a better data structure or algorithm to solve the problem.

What I got from the lecture is NOT that we are free to write clumsy code and let the compiler sort it out (though it occasionally can, like factoring a loop doing simple arithmetic series into a one line closed form solution), but we don’t have to make difficult coding choices to accommodate performance anymore.

The most striking facts I learned from his lecture are

  • The compiler can emit a one-line CPU instruction that does not have a corresponding native operation C/C++ if your hardware architecture supports it. (e.g. clang can convert a whole loop that counts the number of set bits into just ‘popcnt eax, edi‘)
  • Through Link-Time Optimization (LTO), we don’t have to pay the performance penalty for language features that are ultimately necessary for the current compilation (e.g. virtuals are automatically dropped if the linker finds that nowhere in the output currently needs it)

With such LTO,  why not do away the virtual specifier and make everything unspecified virtual by default anyway (like Java)? For decades, we’ve been making up stories that some classes are not meant to be derived (like STL containers), but the underlying motive is that we don’t want to pay for vtable if we don’t have to.

Instead of confusing new programmers about when should they make a method virtual (plenty of rule-of-thumbs became dogma), focus on telling them whenever they (choose to upcast a reference/pointer to the parent anywhere in their code and) invoke the destructor through the parent reference/pointer, they will pay a ‘hefty’ price of vtable and vptr.

I don’t think anybody (old codebase) will get harmed by turning on virtuals by default and let the linker decide if those virtuals can be dropped. If it changes anything, it might turn buggy code with the wrong destructor called into correct code which runs slower and takes up more space. In terms of correctness, this change might break low-level hacks that expects the objects to be of certain size (e.g. alignment) without vptr.

Even better, add a class specifier that mandates that all uses of its child must not invoke vtable (have the compiler catch that) unless explicitly overridden (the users decide to pay for the vtable). This way the compiler can warn about performance and space issues for the migration.

The old C++’s ideal was “you only pay for the language features you used (written)”, but as compilers gets better, we might be able change it to “you pay extra only for the language features that are actually used (in the finally generated executable) with your permission”.


I’d also like to add Return Value Optimization (RVO) into my list of compiler advances that changes the way we code. C++11 added move semantics, but I think it’s something that the compiler in the future could be able to manage themselves. Even with an old C++ compiler like the one shipped with VisualDSP 5.0, the copy constructor was not called (yes, skipping it is legal even if the copy constructor has side effects) when I do this:

Matrix operator+(const Matrix& a, const Matrix& b)
{
  Matrix c(a.dim);
  // ... for all element i, c.raw[i] = a.raw[i]+b.raw[i]
  return c;
}
Matrix c = a + b;

Actually, the compiler at that time was not that smart about RVO, the actual code I wrote originally had two return branches, which defeats RVO (it’s a defined behavior by the specs):

Matrix operator+(Matrix a, Matrix b)
{
  Dims m = a.dims;
  if( m == b.dims ) // Both inputs must have same dimensions
  {
    Matrix c(m); // Construct matrix c with same dimension as a
    // ... for all i, c.raw[i] = a.raw[i] + b.raw[i]
    return c;
  } 
  else 
  {
    return Matrix::dummy; // A static member, which is a Matrix object
  }
}

To take advantage of RVO, I had to reword my code

Matrix operator+(Matrix a, Matrix b)
{
  Dims m = a.dims;
  if( m == b.dims ) // Both inputs must have same dimensions
  {
    Matrix c(m); // Construct matrix c with same dimension as a
    // ... for all i, c.raw[i] = a.raw[i] + b.raw[i]
  } 
  else 
  {
    Matrix c = Matrix::dummy; // or just "Matrix c";
  }
  return c;
}

I think days are counting before C++ compilers can do “copy-on-write” like MATLAB does if independent compilation are no longer mandatory!

Given my extensive experience with MATLAB, I’d say it took me a while to get used designing my code with “copy-on write” behavior in mind. Always start with expressive, maintainable, readable and correct code keeping in mind the performance concerns only happens under certain conditions (i.e. passed object gets modified inside the function).

If people start embracing the mentality of letting the compiler do most of the mechanical optimization, we’ll move towards a world that debugging work are gradually displaced by performance-bottleneck hunting. In my view, anything that can be done systematically by programming (like a boilerplate code or idioms) can eventually be automated by better compiler/linker/IDE and language design. It’s the high-level business logic that needs a lot of software designers/engineers to translate fuzzy requirements into concrete steps.


Matt also developed a great website (http://godbolt.org/) that compiles your code repeatedly on the fly and shows you the corresponding assembly code. Here’s an example of how I use it to answer my question of “Should I bother to use std::div() if I want both the quotient and remainder without running the division twice?”:

The website also included a feature to share the pasted code through an URL.

As seen from the emitted assembly code, the answer is NO. The compiler can figure out that I’m repeating the division twice and do only one division and use the quotient (stored in eax) and remainder (stored in edx). Trying to enforce one division through std::div() requires an extra function call, which is strictly worse.

The bottom line: don’t help the compiler! Modern compiler does context free optimizations better than we do. Use the time and energy to rethink about the architecture and data structure instead!

149 total views, no views today

C++ annoyances (and reliefs): operator[] in STL map-based containers

I recently watched Louis Brandy’s CppCon presentation “Curiously Recurring C++ Bugs at Facebook” on youtube.

For bug#2, which is a well-known trap for STL map-based containers, operator[] will insert the requested key (associated with a default-constructed value) if it is not found. 

He mentioned a few workarounds and their disadvantages, like

  • use at() method: requires exception handling
  • const protect: noobs try to defeat that, transferred to non-const (stripped)
  • ban operator[] calls: makes the code ugly

but would like to see something neater. In bug#3, he added that a very common usage is to return a default when the key is not found. The normal approach requires returning a copy of the default (expensive if it’s large), which tempts noobs to return a local reference (to destroyed temporary variables: guaranteed bug).


Considering how much productivity drain a clumsy interface can cause, I think it’s worth spending a few hours of my time approaching it, since I might need to use STL map-based containers myself someday.

Here’s my thought process for the design choices:

  • Retain the complete STL interface to minimize user code/documentation changes
  • Endow a STL map-based container with a default_value (common use case), so that the new operator[] can return a reference without worrying about temporaries getting destroyed.
  • Give users a easy read-only access interface (make intentions clear with little typing)

The code (with detailed comment about design decisions and test cases) can be downloaded here: MapWithDefault. For the experienced, here’s the meat:

#include <unordered_map>
#include <map>

#include <utility>  // std::forward

// Legend (for extremely simple generic functions)
// ===============================================
// K: key
// V: value
// C: container
// B: base (class)
template <typename K, typename V, template <typename ... Args> class C = std::map, typename B = C<K,V> >
class MapWithDefault : private B 
{
public:
    // Make default_value mandatory. Everything else follows the requested STL container
    template<typename... Args>
    MapWithDefault(V default_value, Args&& ... args) : B(std::forward<Args>(args)...), default_value(default_value) {};

public:
    using B::operator=;
    using B::get_allocator;

    using B::at;

    using B::operator[];

    // Read-only map (const object) uses only read-only operator[]
    const V& operator[](const K& key) const
    {
        auto it = this->find(key);
        return (it==this->end()) ? default_value : it->second;
    }

    using B::begin;
    using B::cbegin;
    using B::end;
    using B::cend;
    using B::rbegin;
    using B::crbegin;
    using B::rend;
    using B::crend;

    using B::empty;
    using B::size;
    using B::max_size;

    using B::clear;
    using B::insert;
    // using B::insert_or_assign;   // C++17
    using B::emplace;
    using B::emplace_hint;
    using B::erase;
    using B::swap;

    using B::count;
    using B::find;
    using B::equal_range;
    using B::lower_bound;
    using B::upper_bound;

public:
    const               V default_value;
    const MapWithDefault& read_only = static_cast<MapWithDefault&>(*this);
};

Note that this is private inheritance (can go without virtual destructors since STL doesn’t have it). I have not exposed all the private members and methods back to public with the ‘using’ keyword yet, but you get the idea.


This is how I normally want the extended container to be used:

int main()
{
    MapWithDefault<string, int> m(17);  // Endowed with default of 17
    cout << "pull rabbit from m.read_only:  " << m.read_only["rabbit"] << endl;   // Should read 17

    // Demonstrates commonly unwanted behavior of inserting requested key when not found
    cout << "pull rabbit from m:            " << m["rabbit"] << endl; // Should read 0 because the key was inserted (not default anymore)

    // Won't compile: demonstrate that it's read only
    // m.read_only["rabbit"] = 42;

    // Demonstrate writing
    m["rabbit"] = 42;

    // Confirms written value
    cout << "pull rabbit from m_read_only:  " << m.read_only["rabbit"] << endl;   // Should read 42
    cout << "pull rabbit from m:            " << m["rabbit"] << endl;             // Should read 42

    return 0;
}

Basically, for read-only operations, always operate directly on the chained ‘m.read_only‘ object reference: it will make sure the const protected version of the methods (including read-only operator[]) is called.


Please let me know if it’s a bad idea or there’s some details I’ve missed!

 

143 total views, 1 views today

Super-simplified: Programming high performance code by considering cache

  • Code/data locality (compactness, % of each cache line that gets used)

  • Predictable access patterns: pre-fetch (instructions and data) friendly. This explains branching costs, why linear transversal might be faster than trees at smaller scales because of pointer chasing, why bubble sort is the fastest if the chunks fit in the cache.

  • Avoid false sharing: shared cache line unnecessarily with other threads/cores (due to how the data is packed) might have cache invalidating each other when anyone writes.

116 total views, no views today

Super-simplified: What is a topology

‘Super-simplified’ is my series of brief notes that summarizes what I have learned so I can pick it up at no time. That means summarizing an hour of lecture into a few takeaway points.

These lectures complemented my gap in understanding open sets in undergrad real analysis, which I understood it under the narrow world-view of the real line.


X: Universal set

Topology ≡ open + \left\{\varnothing, X\right\}

Open ≡ preserved under unions, and finite intersections.

Why finite needed for intersections only? Infinite intersections can squeeze open edge points to limit points, e.g. \bigcap^{\infty}_{n}(-\frac{1}{n},\frac{1}{n}) = \left\{0\right\}.

Never forget that \left\{\varnothing, X\right\} is always there because it might not have properties that the meat open set B doesn’t have. e.g. a discrete topology of \mathbb{Q} on (0,1) = B \subseteq universal set X=\mathbb{R} means for any irrational point, \mathbb{R} is the only open-neighborhood (despite it looks far away) because they cannot be ‘synthesized*’ from \mathbb{Q} using operation that preserves openness.

* ‘synthesized’ in here means constructed from union and/or finite intersections.


[Bonus] What I learned from real line topology in real analysis 101:

  1. Normal intuitive cases
  2. Null and universal set are clopen
  3. Look into rationals (countably infinite) and irrationals (uncountable)
  4. Blame Cantor (sets)!

 

 

162 total views, no views today

韓国から愛をこめたカンチョー

ちょっとガキっぽいげど、この下ネタを見逃しません:

Kancho is LOVE. カンチョーはラブである。
Kancho from the back. 後ろ。
Contrary to common design, the bag is supposed to be opened from the middle. Exactly what Kancho does! もちろん、開け口は間中です。カンチョーですから。

中身は生チョッコだから、イメージはぴったりです。

254 total views, no views today

EMC PCB Layout Notes

  • Implicit RLC (potentially filters) and antennas formed by traces
  • Large ground/voltage planes serves as EMI shield, low impedance path current sink
  • True differential signals can be generated by current sources
  • Decouple with ferrite beads if radiation inevitable by geometry/placement
  • Avoid / minimize large current swings on analog plane (e.g. buffer digital signals)
  • Star ground when splitting sections: don’t let heavy digital current sink through analog ground by cascading the grounds.
  • Don’t really need to split planes as long as large digital current’s preferred return paths are localized and far away from the analog section.
  • AGND/DGND refers to the grounds responsible for different sections of a mixed-signal IC. Has nothing to do with which actual ground to tie to. (e.g. DGND pin in ADC chips still goes to analog ground plane as it has low switching current)

162 total views, no views today

Rick and Morty Quote: What people called love

From Rick and Morty Season 1, Episode 6:

Rick: Listen Morty, I hate to break it to you, but what people calls “love” is just a chemical reaction that compels animals to breed. It hits hard, Morty, then it slowly fades, leaving you stranded in a failing marriage. I (Morty’s grandpa) did it. Your parents are gonna do it. Break the cycle, Morty. Rise above. Focus on science.

A follow up from Episode 9:

With the writers Rick and Morty, you can replace the whole philosophy department in all universities and move all the logicians to the math department:

A follow up from Episode 8:

Nobody exists on purpose 

 

1,051 total views, no views today

Lepy LP-2024A+ Class T Amplifier Mod

My traditional Hi-Fi amplifier drains a lot of power and heats up my room when I’m not using it. The summer heat prompted me to look into Class-D amplifiers as they’re highly energy efficient.

I bought a Lepy LP-2024A+, a Class-T amplifier (It’s Tripath’s improvement over Class-D amplfiers) for $22 shipped. It sounded good over a narrow range of volume, as I can hear the background sound details on my ADS 200 speakers.

Unfortunately, strong bass components in certain music gets distorted, a sign that the amplifier cannot deliver fast energy impulses. To put it simply, I enjoyed the treble but not the bass with this amplifier.

I saw some mods reports on the older TA-2020A+ based units (like LP-2020A+), but as of now, only one Japanese blog site talk about switching out input stage op-amps without other changes. So I decided to do my own mod and post the results here.


First, the unit came with a dinky 2200uF capacitor for power smoothing. I upgraded it to 6800uF. I happen to have a 12uH inductor with thick wires, so I replaced the toroid inductor (as in the LC power smoothing pair) with it while I’m at it.

Then I replaced all the remaining white-label capacitors with decent brands (Wruth, Nichicon, Panasonic, CDE), all rated at 105 degC, sometimes with higher voltage ratings depends on which brand-name parts Newark has on sale when I order it.

Then I upgraded the 4 output stage inductors with Wruth 7447452100 rated 10uH 4.5A. Tripath’s datasheet says 2A, so I supposed the one populated on board would be less than that.

I also replaced the SMD (1206) ceramic capacitors at the output stage (very close to the speaker wire terminals) with Polyester Film (for the taller 0.47uF ceramic chips) and NP0/C0G (the thinner 0.1uF ceramic) to improve linearity. I suspect this change helped to reduce the listening fatigue for treble components of the music I’m listening to. Now I don’t have to tone down the treble gain (knob) that aggressively.

As prompted by the Japanese blog site, I ordered some LT1364, but the improvement isn’t that big since the NE5532 wasn’t bad in the first place:

The bass and drums are much more enjoyable after the mod, since the improved power handling reduced the bass distortion on bass impulses. The amplifier is still 15W (7.5W+7.5W) dictated by TA-2024A+, but I rarely want to crank the music up louder than that anyway.

As a bonus, I took a thermal image with my Seek Thermal Imager:

The input stage ICs LT1634 are 118 deg F:


Update [02/06/2018] I did some experiments with external capacitors and realized that the real problem is the crappy 13.5V@3A power supply that came with it. Yes, I tested it with a DC load and it can really do 3A, but it has a weird behavior: when there’s a huge power draw (like from a bass drum) that drops the voltage level below 11.5V, the power supply starts oscillating from 10V to 11.5V (never gets above that even when I stopped the music) with a regular hissing noise of around 1Hz physically from the power supply itself. I had to turn the unit off for it to ramp back to 13.5V.

Then I used my HP 6033A systems power supply (can do 30A, relatively clean power with fast transient recovery) and observed the rail voltage and compared it to when I power the LP-2024A+ with a big capacitor + original power supply. It’s clear that no capacitor is big enough to cover the flaw of that crappy power supply that came with the amp.

After the power supply issue is resolved, even loud music sounds smooth, expressing the 3D acoustic image crisply through my ADS 200 speakers: I felt like a person is talking/singing right in front of me than some loudspeaker generated sound. It’s so crisp that I can hear each individual string pluck. Bass is deep too after I added a ADS sub 6 subwoofer. I’ve been listening to “the Phantom Of the Opera” CDs ever since high school and I’m still rediscovering new musical details with this amplifier + ADS speakers! 

Before the mod, I would hesitate to make it my main amplifier and might want to go back to my Denon AVR-988 for serious listening. It took me quite a while to tune my Denon to sound good, but it works right out of the box on my modified Lepy LP-2024A+!


Update [08/21/2017]: Based on people’s comment (in audiokarma) on TA-2020A+ vs TPA3116, I decided to give it a shot and ordered Nobsound’s 50W model.

Actually before this LP-2024A+, I modded a Pyle TA-2020A and was disappointed at the clarity is totally lacking compared to the lower powered LP-2024A+. I thought it could be Pyle’s terrible implementation of TA-2020A+, but after hearing TPA3316, I observed these:

  • TPA3116 has less distortion than TA-2020A+ and it sounded slightly tighter at all frequencies, especially the low ones
  • Both TA-2020A+ and TPA3116 lack the clarity at vocal frequencies and above. TA-2024A+ beat them both hands down.
  • TA-2024A+ draws 0.15A when no music is played while TPA3116 draws 0.03A. The reading is from my HP 6032A (60V, 50A) power supply.

TPA3316 might have a tighter bass than TA-2024A+, but I wouldn’t trade the vocal and up for that. TA-2024A+ might be marginally OK for N.W.A., but certainly not for parties. It’s an amp for music, not for acting cool.

 

724 total views, 1 views today