Actually, the air mass is part of the effect, but a smaller part than you might think. The spring mechanism of the cone suspension and the mass of the cone play a MUCH larger role. However, the volume of air inside the box, behind the speaker is important, not for its mass, but for its volume. A volume of air is also a spring because it can be compressed and decompressed.
The interface between a woofer cone and the air in to which it is supposed to launch an acoustic wave is very miss-matched.
That is to say, that the motor of the woofer, the magnet and coil, is very strong compared to the task of moving the tiny amount of air that is in front of the cone. It would be much better if a speaker had a huge moving surface that was evenly driven, all over, by much less force. This is why some people like electrostatic or planar dynamic speakers. But these methods are more expensive to manufacture and even WAY less efficient than cone speakers.
This is also why I like LOTS of little speakers instead of just a few bigger ones.
The best solution for any loud speaker system is a cone with a big horn on it!
The horn is an acoustic transformer. Just like an electric transformer can convert high voltage at low current into low voltage at high current, so a horn can transform high pressure at small surface area into low pressure at high surface area. The ratio of thrasformer effect comes from the area of the speaker opening compaired to the area of the mouth of the horn. Unfortunately, one horn is only effective for about 2.5 to 3 octaves. So multiple horns are needed to cover the whole 10 octaves. Horn dimensions are relative to acoustic wave lengths as well, so true first-octave horns are bigger than some houses!
There are some really great sounding horn speakers out there that approach 20% efficiency!
The horns I used in one of my designs put out 107dB @ 1W @ 1M. 107 - 120 = -13dB. That's about 1/20 or 5%. Not too bad!
James.