What is the best hardware concurrency for running inference on CPU?

Tarek Ziadé and Paul Adenot

from The Mozilla Blog on 2025-02-18 16:34 (#6VC13)

In the Firefox AI Runtime, we can use multiple threads in the dedicated inference process to speed up execution times CPU. The WASM/JS environment can create a SharedArrayBuffer and run multiple threads against its content and distribute the load on several CPU cores concurrently.

Below is the time taken in seconds on a MacBook M1 Pro, which has 10 physical cores, using our PDF.js image-to-text model to generate an alt text, with different levels of concurrency:

So running several threads is a game-changer ! But using more and more threads will start to slow down execution time to a point where it will become slower than not using threads at all.

So one question we have asked ourselves was: how can we determine the best number of threads ?

Physical vs logical cores

According to our most recent public data report, on desktop, 81% of our users are equipped with an Intel CPU, 14% with AMD and the rest are mostly Apple devices.

All modern CPUs provide more logical cores (also called threads") than physical cores. This happens due to technologies like Intel's Hyper-Threading. Or AMD's Simultaneous Multithreading (SMT).

For example, the Intel Core i9-10900K chip has 10 physical cores and 20 logical cores.

When you spin up threads equal to the number of logical cores, you might see performance gains, especially when tasks are I/O bound or if the CPU can effectively interleave instructions.

However, for compute-bound tasks (like heavy ML inference), having more threads than physical cores can lead to diminishing returns, or even performance drops, due to factors like thread scheduling overhead and cache contention.

Not all cores are created equal

On Apple Silicon, you don't just have a quantity of cores; you have different kinds of cores. Some are high-performance cores designed for heavy lifting, while others are efficiency cores that are optimized for lower power consumption.

For instance, Apple M1 Pro chips have a combination of high-performance (8) and efficiency cores (2). The physical cores might total 10, but each performance core is designed for heavy-duty tasks, while efficiency cores typically handle background tasks that are less demanding.

When your machine is under load with ML tasks, it's often better to fully utilize the high-performance cores and leave some breathing room for the efficiency cores to handle background or system processes.

Similarly, Intel's processors have different cores, most notably starting with their 12th-generation Alder Lake" architecture.

These chips feature Performance-cores (P-cores) designed for demanding, single-threaded tasks, and Efficient-cores (E-cores) aimed at background or less intensive workloads. The P-cores can leverage Intel's Hyper-Threading technology (meaning each P-core can run two logical threads), while E-cores typically run one thread each. This hybrid approach enables the CPU to optimize power usage and performance by scheduling tasks on the cores best suited for them. Like Apple Silicon's you'd typically want to maximize utilization of the higher-performance P-cores, while leaving some headroom on the E-cores for system processes.

Android is close to Apple Silicon's architecture, as most devices are using ARM's big.LITTLE (or DynamIQ) architecture - with 2 types of cores: big" and LITTLE".

On mobile Qualcomm's CPU, there can be three types: Prime", Performance" and Efficiency". Most recently, some phones like Samsung Galaxy S24 have gained a fourth kind of core (Exynos 2400) allowing even more combinations.

To summarize, all CPU makers have cores dedicated to performance, and cores for efficiency:

Performance: P-Core", big", Prime", Performance"
Efficiency: E-Core", LITTLE", Efficiency"

By combining high-efficiency and high-performance cores, Apple Silicon, Androids, and Intel based devices can strike a better balance between power consumption and raw compute power, depending on the demands of the workload.

But if you try to run all cores (performance + efficiency) at maximum capacity, you may see:

Less optimal thread scheduling, because tasks will bounce between slower efficiency cores and faster performance cores.
Contention for shared resources like the memory bus, cache.
And in extreme cases: thermal throttling if the system overheats, and reaches its Thermal Design Point, in which case the clock speed is throttled to cool down the system.

This is why simply setting the thread count to all cores, all the time" can be suboptimal for performance.

AMD on the other hand, does not have efficiency cores. Some CPUs like the Ryzen 5 8000 combine two sizes of cores Zen 4 and Zen 4c, but the latter is not an efficiency core and can also be used to run heavy-duty tasks.

navigator.hardwareConcurrency

In a browser, there is a single and simple API you can call: navigator.hardwareConcurrency

This returns the number of logical cores available. Since it's the only API available on the web, many libraries (including the one we vendor: onnxruntime) default to using navigator.hardwareConcurrency as a baseline for concurrency.

It's bad practice to use that value directly as it might overcommit threads as we've explained in the previous sections. It's also not aware of the current system's activity.

For that reason, ONNX formula takes the number of logical cores divided by two and will never set it higher than 4:

Math.min(4, Math.ceil((navigator.hardwareConcurrency || 1) / 2));

That formula works out ok in general, but will not take advantage of all the cores for some devices. For instance, on an Apple M1 Pro, ML tasks could use a concurrency level up to 8 cores instead of 4.

On the other end of the spectrum, a chip like Intel's i3-1220p that we use in our CI to run tests in Windows 11, which reflects better what our users have - see our hardware section in our Firefox Public Data Report.

It has 12 logical cores and 10 physical cores that are composed of 8 efficient cores and 2 performance cores. ONNX formula for that chip means we would run with 4 threads, where 2 would be a better value.

navigator.hardwareConcurrency is a good starting point, but it's just a blunt instrument. It won't always yield the true best" concurrency for a given device and a given workload.

MLUtils.getOptimalCPUConcurrency

While it's impossible to get the best value at any given time without considering the system activity as a whole, looking at the number of physical cores and not using efficiency" cores, can help to get to a better value.

Llama.cpp for instance is looking at the number of physical cores to decide for concurrency, with a few twists:

On any x86_64, it will return the number of performance cores
On Android, and any aarch64-based devices like Apple Silicon it will return the number of performance cores for tri-layered chips.

We've implemented something very similar in a C++ API that can be used via XPIDL in our inference engine:

NS_IMETHODIMP MLUtils::GetOptimalCPUConcurrency(uint8_t* _retval) { ProcessInfo processInfo = {}; if (!NS_SUCCEEDED(CollectProcessInfo(processInfo))) { return NS_ERROR_FAILURE; } #if defined(ANDROID) // On android, "big" and "medium" cpus can be used. uint8_t cpuCount = processInfo.cpuPCount + processInfo.cpuMCount; #else # ifdef __aarch64__ // On aarch64 (like macBooks) we want to avoid efficient cores and stick with "big" cpus. uint8_t cpuCount = processInfo.cpuPCount; # else // on x86_64 we're always using the number of physical cores. uint8_t cpuCount = processInfo.cpuCores; # endif #endif *_retval = cpuCount; return NS_OK;}

This function is then straightforward to use from JS shipped within Firefox to configure concurrency when we run inference:

let mlUtils = Cc["@mozilla.org/ml-utils;1"].createInstance(Ci.nsIMLUtils);const numThreads = mlUtils.getOptimalCPUConcurrency();

We've moved away from using navigator.hardwareConcurrency, and we're now using this new API.

Conclusion

In our quest to find the optimal number of threads, we're closer to reality now, but there are other factors to consider. The system will use the CPU for other applications so it's still possible to overload it.

Using more threads is also going to use more memory in our WASM environment, which can become a real issue. Depending on the workload, each additional thread can add up to 100MiB of physical memory usage in our runtime. We're working on reducing this overhead but on devices that don't have a lot of memory, limiting concurrency is still our best option.

For our Firefox ML features, we are using a variety of hardware profiles in our performance CI to make sure that we try them on devices that are close to what our users have. The list of devices we have is going to grow in the next few months to make sure we cover the whole spectrum of CPUs. We've started collecting and aggregating metrics on a dashboard that helps us understand what can be expected when our users run our inference engine.

The hardware landscape is also evolving a lot. For example, the most recent Apple devices introduced a new instruction set, called AMX, which used to be proprietary, and gave a significant boost compared to Neon. That has now been replaced by an official API called SME. Similarly, some phones are getting more core types, which could impact how we calculate the number of cores to use. Our current algorithm could be changed the day we leverage these new APIs and hardware in our backend.

Another aspect we have not discussed in this post is using GPU or even more specialized units like NPUs, to offload our ML tasks, which will be a post on its own.

The post What is the best hardware concurrency for running inference on CPU? appeared first on The Mozilla Blog.

Source	RSS or Atom Feed
Feed Location	http://blog.mozilla.com/feed/
Feed Title	The Mozilla Blog
Feed Link	https://blog.mozilla.org/en/