ChatGPT解决这个技术问题 Extra ChatGPT

Virtual functions and performance - C++

In my class design, I use abstract classes and virtual functions extensively. I had a feeling that virtual functions affects the performance. Is this true? But I think this performance difference is not noticeable and looks like I am doing premature optimization. Right?

As per my answer, I suggest closing this as duplicate of stackoverflow.com/questions/113830
If you are doing high performance computing and number crunching, do not use any virtuality in the core of calculation : it definitely kills all performances and prevent optimizations at compile-time. For initialization or finalization of the program it is not important. When working with interfaces, you can use virtuality as you wish.
quick-bench.com/q/hU7VjdB0IP7rxjYuH46xbocVBxY Try this benchmark. 10% difference in a tight loop. 20% in a single call quick-bench.com/q/Y4FvX3adXOjVp3Bh2SmbG-jVtco

C
Crashworks

Your question made me curious, so I went ahead and ran some timings on the 3GHz in-order PowerPC CPU we work with. The test I ran was to make a simple 4d vector class with get/set functions

class TestVec 
{
    float x,y,z,w; 
public:
    float GetX() { return x; }
    float SetX(float to) { return x=to; }  // and so on for the other three 
}

Then I set up three arrays each containing 1024 of these vectors (small enough to fit in L1) and ran a loop that added them to one another (A.x = B.x + C.x) 1000 times. I ran this with the functions defined as inline, virtual, and regular function calls. Here are the results:

inline: 8ms (0.65ns per call)

direct: 68ms (5.53ns per call)

virtual: 160ms (13ns per call)

So, in this case (where everything fits in cache) the virtual function calls were about 20x slower than the inline calls. But what does this really mean? Each trip through the loop caused exactly 3 * 4 * 1024 = 12,288 function calls (1024 vectors times four components times three calls per add), so these times represent 1000 * 12,288 = 12,288,000 function calls. The virtual loop took 92ms longer than the direct loop, so the additional overhead per call was 7 nanoseconds per function.

From this I conclude: yes, virtual functions are much slower than direct functions, and no, unless you're planning on calling them ten million times per second, it doesn't matter.

See also: comparison of the generated assembly.


But if if they are called multiple times, they can often be cheaper than when only called one time. See my irrelevant blog: phresnel.org/blog , the posts titled "Virtual functions considered not harmful", but of course it depends on the complexity of your codepaths
My test measures a small set of virtual functions called repeatedly. Your blog post assumes that the time-cost of code can be measured by counting operations, but that isn't always true; the major cost of a vfunc on modern processors is the pipeline bubble caused by a branch mispredict.
this would be a great benchmark for gcc LTO (Link Time Optimization); try to compile this again with lto enabled: gcc.gnu.org/wiki/LinkTimeOptimization and see what happens with the 20x factor
If a class has one virtual and one inline function, will the performance of the non-virtual method also be affected? Simply by the nature of the class being virtual?
@thomthom No, virtual/non-virtual is a per-function attribute. A function only needs to be defined via vtable if it's marked as virtual or if it's overriding a base class that has it as virtual. You'll often see classes that have a group of virtual functions for public interface, and then a lot of inline accessors and so on. (Technically, this is implementation-specific and a compiler could use virtual ponters even for functions marked 'inline', but a person who wrote such a compiler would be insane.)
G
Greg Hewgill

A good rule of thumb is:

It's not a performance problem until you can prove it.

The use of virtual functions will have a very slight effect on performance, but it's unlikely to affect the overall performance of your application. Better places to look for performance improvements are in algorithms and I/O.

An excellent article that talks about virtual functions (and more) is Member Function Pointers and the Fastest Possible C++ Delegates.


What about pure virtual functions? Do they affect performance in any way? Just wondering as it seem that they are there simply to enforce implementation.
@thomthom: Correct, there is no performance difference between pure virtual and ordinary virtual functions.
C
Chuck

When Objective-C (where all methods are virtual) is the primary language for the iPhone and freakin' Java is the main language for Android, I think it's pretty safe to use C++ virtual functions on our 3 GHz dual-core towers.


I'm not sure the iPhone is a good example of performant code: youtube.com/watch?v=Pdk2cJpSXLg
@Crashworks: The iPhone is not an example of code at all. It's an example of hardware — specifically slow hardware, which is the point I was making here. If these reputedly "slow" languages are good enough for underpowered hardware, virtual functions are not going to be a huge problem.
The iPhone runs on an ARM processor. The ARM processors used for iOS are designed for low MHz and low power usage. There is no silicon for branch prediction on the CPU and therefore no performance overhead from branch prediction misses from virtual function calls. Also the MHz for iOS hardware is low enough that a cache miss does not stall processor for 300 clock cycles while it retrieves data from RAM. Cache misses are less important at lower MHz. In short, there is no overhead from using virtual functions on iOS devices, but this is a hardware issue and does not apply to desktops CPUs.
As a long time Java programmer newly into C++, I wanna add that Java's JIT compiler and run-time optimizer has the ability to compile, predict and even inline some functions at run-time after a predefined number of loops. However I am not sure if C++ has such feature at compile and link time because it lacks runtime call pattern. Thus in C++ we might need to be slightly more careful.
@AlexSuo I'm not sure of your point? Being compiled, C++ of course cannot optimise based on what might happen at runtime, so prediction etc would have to be done by the CPU itself... but good C++ compilers (if instructed) go to great lengths to optimise functions and loops long before runtime.
M
Mark James

In very performance critical applications (like video games) a virtual function call can be too slow. With modern hardware, the biggest performance concern is the cache miss. If data isn't in the cache, it may be hundreds of cycles before it's available.

A normal function call can generate an instruction cache miss when the CPU fetches the first instruction of the new function and it's not in the cache.

A virtual function call first needs to load the vtable pointer from the object. This can result in a data cache miss. Then it loads the function pointer from the vtable which can result in another data cache miss. Then it calls the function which can result in an instruction cache miss like a non-virtual function.

In many cases, two extra cache misses are not a concern, but in a tight loop on performance critical code it can dramatically reduce performance.


Right, but any code (or vtable) that is called repeatedly from a tight loop will (of course) rarely suffer cache misses. Besides, the vtable pointer is typically in the same cache line as other data in the object that the called method will access, so often we're talking about only one extra cache miss.
@Qwertie I don't think that's necessary true. The body of the loop (if larger than L1 cache) could "retire" vtable pointer, function pointer and subsequent iteration would have to wait for L2 cache (or more) access on every iteration
B
Boojum

From page 44 of Agner Fog's "Optimizing Software in C++" manual:

The time it takes to call a virtual member function is a few clock cycles more than it takes to call a non-virtual member function, provided that the function call statement always calls the same version of the virtual function. If the version changes then you will get a misprediction penalty of 10 - 30 clock cycles. The rules for prediction and misprediction of virtual function calls is the same as for switch statements...


Thanks for this reference. Agner Fog's optimization manuals are the gold standard for optimally utilizing hardware.
Based on my recollection and a quick search - stackoverflow.com/questions/17061967/c-switch-and-jump-tables - I doubt this is always true for switch. With totally arbitrary case values, sure. But if all cases are consecutive, a compiler might be able to optimise this into a jump-table (ah, that reminds me of the good old Z80 days), which should be (for want of a better term) constant-time. Not that I recommend trying to replace vfuncs with switch, which is ludicrous. ;)
@underscore_d I think you are right that the vtable could be optimized to a jump table but what Agner's statement about rules for prediction and misprediction of virtual function calls is the same as for switch statements is also true in a sense that let's say vtable is implemented as a switch-case, then there are two possibilities: 1) it gets optimized to a jump table (as you said) if the cases are consecutive, 2) it can't be optimized to a jump table because the cases aren't consecutive, and so will get a misprediction penalty of 10 - 30 clock cycles as Anger states.
g
gbjbaanb

absolutely. It was a problem way back when computers ran at 100Mhz, as every method call required a lookup on the vtable before it was called. But today.. on a 3Ghz CPU that has 1st level cache with more memory than my first computer had? Not at all. Allocating memory from main RAM will cost you more time than if all your functions were virtual.

Its like the old, old days where people said structured programming was slow because all the code was split into functions, each function required stack allocations and a function call!

The only time I would even think of bothering to consider the performance impact of a virtual function, is if it was very heavily used and instantiated in templated code that ended up throughout everything. Even then, I wouldn't spend too much effort on it!

PS think of other 'easy to use' languages - all their methods are virtual under the covers and they don't crawl nowadays.


Well, even today avoiding function calls is important for high-perf apps. The difference is, today's compilers reliably inline small functions so we don't suffer speed penalties for writing small functions. As for virtual functions, smart CPUs can do smart branch prediction on them. The fact that old computers were slower is, I think, not really the issue--yes, they were much slower, but back then we knew that, so we gave them much smaller workloads. In 1992 if we played an MP3, we knew we might have to dedicate over half of the CPU to that task.
mp3 dates from 1995. in 92 we barely had 386, no way they could play an mp3, and 50% of cpu time assumes a good multi task OS, an idle process, and a preemptive scheduler. None of this existed on consumer market at the time. it was 100% from the moment power was ON, end of story.
C
Community

There's another performance criteria besides execution time. A Vtable takes up memory space as well, and in some cases can be avoided: ATL uses compile-time "simulated dynamic binding" with templates to get the effect of "static polymorphism", which is sort of hard to explain; you basically pass the derived class as a parameter to a base class template, so at compile time the base class "knows" what its derived class is in each instance. Won't let you store multiple different derived classes in a collection of base types (that's run-time polymorphism) but from a static sense, if you want to make a class Y that is the same as a preexisting template class X which has the hooks for this kind of overriding, you just need to override the methods you care about, and then you get the base methods of class X without having to have a vtable.

In classes with large memory footprints, the cost of a single vtable pointer is not much, but some of the ATL classes in COM are very small, and it's worth the vtable savings if the run-time polymorphism case is never going to occur.

See also this other SO question.

By the way here's a posting I found that talks about the CPU-time performance aspects.


S
Serge

Yes, you're right and if you curious about the cost of virtual function call you might find this post interesting.


The article linked does not consider very important part of virtual call, and that is possible branch misprediction.
D
Daemin

The only ever way that I can see that a virtual function will become a performance problem is if many virtual functions are called within a tight loop, and if and only if they cause a page fault or other "heavy" memory operation to occur.

Though like other people have said it's pretty much never going to be a problem for you in real life. And if you think it is, run a profiler, do some tests, and verify if this really is a problem before trying to "undesign" your code for a performance benefit.


calling anything in a tight loop is likely to keep all that code and data hot in cache...
Yes, but if that right loop is iterating through a list of objects then each object could potentially be calling a virtual function at a different address through the same function call.
E
Evgueny Sedov

When class method is not virtual, compiler usually does in-lining. In contrary, when you use pointer to some class with virtual function, the real address will be known only at runtime.

This is well illustrated by test, time difference ~700% (!):

#include <time.h>

class Direct
{
public:
    int Perform(int &ia) { return ++ia; }
};

class AbstrBase
{
public:
    virtual int Perform(int &ia)=0;
};

class Derived: public AbstrBase
{
public:
    virtual int Perform(int &ia) { return ++ia; }
};


int main(int argc, char* argv[])
{
    Direct *pdir, dir;
    pdir = &dir;

    int ia=0;
    double start = clock();
    while( pdir->Perform(ia) );
    double end = clock();
    printf( "Direct %.3f, ia=%d\n", (end-start)/CLOCKS_PER_SEC, ia );

    Derived drv;
    AbstrBase *ab = &drv;

    ia=0;
    start = clock();
    while( ab->Perform(ia) );
    end = clock();
    printf( "Virtual: %.3f, ia=%d\n", (end-start)/CLOCKS_PER_SEC, ia );

    return 0;
}

The impact of virtual function call highly depends on situation. If there are few calls and significant amount of work inside function - it could be negligible.

Or, when it is a virtual call repeatedly used many times, while doing some simple operation - it could be really big.


A virtual function call is expensive compared to ++ia. So what?
quick-bench.com/q/hU7VjdB0IP7rxjYuH46xbocVBxY Here's a benchmark which shows just 10% difference.
I
It'sPete

I've gone back and forth on this at least 20 times on my particular project. Although there can be some great gains in terms of code reuse, clarity, maintainability, and readability, on the other hand, performance hits still do exist with virtual functions.

Is the performance hit going to be noticeable on a modern laptop/desktop/tablet... probably not! However, in certain cases with embedded systems, the performance hit may be the driving factor in your code's inefficiency, especially if the virtual function is called over and over again in a loop.

Here's a some-what dated paper that anaylzes best practices for C/C++ in the embedded systems context: http://www.open-std.org/jtc1/sc22/wg21/docs/ESC_Boston_01_304_paper.pdf

To conclude: it's up to the programmer to understand the pros/cons of using a certain construct over another. Unless you're super performance driven, you probably don't care about the performance hit and should use all the neat OO stuff in C++ to help make your code as usable as possible.


佚名

In my experience, the main relevant thing is the ability to inline a function. If you have performance/optimization needs that dictate a function needs to be inlined, then you can't make the function virtual because it would prevent that. Otherwise, you probably won't notice the difference.


O
OwnageIsMagic

One thing to note is that this:

boolean contains(A element) {
    for (A current : this)
        if (element.equals(current))
            return true;
    return false;
}

may be faster than this:

boolean contains(A element) {
    for (A current : this)
        if (current.equals(element))
            return true;
    return false;
}

This is because the first method is only calling one function while the second may be calling many different functions. This applies to any virtual function in any language.

I say "may" because this depends on the compiler, the cache etc.


佚名

The performance penalty of using virtual functions can never outweight the advantages you get at the design level. Supposedly a call to a virtual function would be 25% less efficient then a direct call to a static function. This is because there is a level of indirection throught the VMT. However the time taken to make the call is normally very small compared to the time taken in the actual execution of your function so the total performance cost will be nigligable, especially with current performance of hardware. Furthermore the compiler can sometimes optimise and see that no virtual call is needed and compile it into a static call. So don't worry use virtual functions and abstract classes as much as you need.


never ever, no matter how small the target computer?
I might have agreed had you phrased that as The performance penalty of using virtual functions can sometimes be so insignificant that it is completely outweighed by the advantages you get at the design level. The key difference is saying sometimes, not never.
Considering the recent push for more cache-friendly programming, as memory speeds cannot keep up with CPU speeds, this answer is bad advice.
c
christianparpart

I always questioned myself this, especially since - quite a few years ago - I also did such a test comparing the timings of a standard member method call with a virtual one and was really angry about the results at that time, having empty virtual calls being 8 times slower than non-virtuals.

Today I had to decide whether or not to use a virtual function for allocating more memory in my buffer class, in a very performance critical app, so I googled (and found you), and in the end, did the test again.

// g++ -std=c++0x -o perf perf.cpp -lrt
#include <typeinfo>    // typeid
#include <cstdio>      // printf
#include <cstdlib>     // atoll
#include <ctime>       // clock_gettime

struct Virtual { virtual int call() { return 42; } }; 
struct Inline { inline int call() { return 42; } }; 
struct Normal { int call(); };
int Normal::call() { return 42; }

template<typename T>
void test(unsigned long long count) {
    std::printf("Timing function calls of '%s' %llu times ...\n", typeid(T).name(), count);

    timespec t0, t1;
    clock_gettime(CLOCK_REALTIME, &t0);

    T test;
    while (count--) test.call();

    clock_gettime(CLOCK_REALTIME, &t1);
    t1.tv_sec -= t0.tv_sec;
    t1.tv_nsec = t1.tv_nsec > t0.tv_nsec
        ? t1.tv_nsec - t0.tv_nsec
        : 1000000000lu - t0.tv_nsec;

    std::printf(" -- result: %d sec %ld nsec\n", t1.tv_sec, t1.tv_nsec);
}

template<typename T, typename Ua, typename... Un>
void test(unsigned long long count) {
    test<T>(count);
    test<Ua, Un...>(count);
}

int main(int argc, const char* argv[]) {
    test<Inline, Normal, Virtual>(argc == 2 ? atoll(argv[1]) : 10000000000llu);
    return 0;
}

And was really surprised that it - in fact - really does not matter at all anymore. While it makes just sense to have inlines faster than non-virtuals, and them being faster then virtuals, it often comes to the load of the computer overall, whether your cache has the necessary data or not, and whilst you might be able to optimize at cache-level, I think, that this should be done by the compiler developers more than by application devs.


I think it is quite likely that your compiler can tell that the virtual function call in your code can only call Virtual::call. In that case it can just inline it. There is also nothing preventing the compiler from inlining Normal::call even though you didn't ask it to. So I think that it is quite possible that you get the same times for the 3 operations because the compiler is generating identical code for them.