Android NDK & ARM NEON instruction set extension support

Introduction:
====

Android NDK r3 added support for the new 'armeabi-v7a' ARM-based ABI
that allows native code to use two useful instruction set extensions:

- Thumb-2, which provides performance comparable to 32-bit ARM
  instructions with similar compactness to Thumb-1

- VFPv3, which provides hardware FPU registers and computations,
  to boost floating point performance significantly.

  More specifically, by default 'armeabi-v7a' only supports
  VFPv3-D16 which only uses/requires 16 hardware FPU 64-bit registers.

More information about this can be read in docs/CPU-ARCH-ABIS.html

The ARMv7 Architecture Reference Manual also defines another optional
instruction set extension known as "ARM Advanced SIMD", nick-named
"NEON". It provides:

- A set of interesting scalar/vector instructions and registers
  (the latter are mapped to the same chip area as the FPU ones),
  comparable to MMX/SSE/3DNow! in the x86 world.

- VFPv3-D32 as a requirement (i.e. 32 hardware FPU 64-bit registers,
  instead of the minimum of 16).

Not all ARMv7-based Android devices will support NEON, but those that
do may benefit in significant ways from the scalar/vector instructions.

The NDK supports the compilation of modules or even specific source
files with support for NEON. What this means is that a specific compiler
flag will be used to enable the use of GCC ARM Neon intrinsics and
VFPv3-D32 at the same time. The intrinsics are described here:

> http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html


Using LOCAL_ARM_NEON:
---------------------

Define LOCAL_ARM_NEON to 'true' in your module definition, and the NDK
will build all its source files with NEON support. This can be useful if
you want to build a static or shared library that specifically contains
NEON code paths.


Using the .neon suffix:
-----------------------

When listing sources files in your LOCAL_SRC_FILES variable, you now have
the option of using the .neon suffix to indicate that you want to
corresponding source(s) to be built with Neon support. For example:

        LOCAL_SRC_FILES := foo.c.neon bar.c

Will only build 'foo.c' with NEON support.

Note that the .neon suffix can be used with the .arm suffix too (used to
specify the 32-bit ARM instruction set for non-NEON instructions), but must
appear after it.

In other words, 'foo.c.arm.neon' works, but 'foo.c.neon.arm' does NOT.


Build Requirements:
------------------

Neon support only works when targeting the 'armeabi-v7a' ABI, otherwise the
NDK build scripts will complain and abort. It is important to use checks like
the following in your Android.mk:

        # define a static library containing our NEON code
        ifeq ($(TARGET_ARCH_ABI),armeabi-v7a)
            include $(CLEAR_VARS)
            LOCAL_MODULE    := mylib-neon
            LOCAL_SRC_FILES := mylib-neon.c
            LOCAL_ARM_NEON  := true
            include $(BUILD_STATIC_LIBRARY)
        endif # TARGET_ARCH_ABI == armeabi-v7a


Runtime Detection:
------------------

As said previously, NOT ALL ARMv7-BASED ANDROID DEVICES WILL SUPPORT NEON !
It is thus crucial to perform runtime detection to know if the NEON-capable
machine code can be run on the target device.

To do that, use the 'cpufeatures' library that comes with this NDK. To learn
more about it, see docs/CPU-FEATURES.html.

You should explicitly check that android_getCpuFamily() returns
ANDROID_CPU_FAMILY_ARM, and that android_getCpuFeatures() returns a value
that has the ANDROID_CPU_ARM_FEATURE_NEON flag set, as  in:

          #include <cpu-features.h>

          ...
          ...

          if (android_getCpuFamily() == ANDROID_CPU_FAMILY_ARM &&
              (android_getCpuFeatures() & ANDROID_CPU_ARM_FEATURE_NEON) != 0)
          {
              // use NEON-optimized routines
              ...
          }
          else
          {
              // use non-NEON fallback routines instead
              ...
          }

          ...

Sample code:
------------

Look at the source code for the "hello-neon" sample in this NDK for an example
on how to use the 'cpufeatures' library and Neon intrinsics at the same time.

This implements a tiny benchmark for a FIR filter loop using a C version, and
a NEON-optimized one for devices that support it.