well, that depends on what you are looking for (bad pun). Caveat: humans perceive light on the dB scale, not a linear scale.
CPF members that have the necessary intrumentation to test the lumen output as well as take some beamshots outside will certainly have the basis to provide quantitative as well as comparative statments. Look up some of the Selfbuit tests - very informative, especially the comparisons with other similar lights in the list.
However, what kind of a beam does the light have? large spot? small spot? large spill? small spill? what is the transition between spot and spill? and then there is the human perception of color of the beam. Are you interested in a flooder or thower? Are you interested in the illumination in the center of the spot? Would you include the spill or not?
examples: An XM-L LED in the usual relatively small light typically has a large spot with large spill - a wall of light at short ranges. The XP-G in a similarly powered light has a relatively small spot and "throws" farther. Meaning that the illumination in the center of the spot may well be "brighter" with the XP-G than with the XM-L even though the total output of the XM-L is clearly greater. ie: the Quark series, one with XM-L and one with XP-G.
Since I do not have an integrating sphere, I use the ceiling bounce test to compare output levels, no numbers. I point the lights at the same spot on the standard house ceiling, away from walls, and look at the floor immediately under the point of aim on the ceiling. I then cover both lenses with my thumb or something relatively opaque and wait a bit for my eyes to adjust. I then begin alternating between lights by removing the blockage from one light output or the other. That is a poor flashaholics "integrating room". It makes no difference what the spot or spill size is, the comparison is reasonable, albeit not perfect. But no lumen numbers. For a correlation to lumen numbers, I could use use Selfbuilt's lumen test numbers as a basis, or even accept the statements from the manufacturer. However individual samples will vary.
Short answer - it depends.