1. Does this mean (if we only use image targets and no extended tracking), in theory we don't need a native provider (ARCore / ARKit) at all because detection and tracking is handled by Vuforia CV itself?
Correct. The Positional Tracking feature, as provided by Vuforia Fusion (through ARCore, ARKit or Vuforia VISLAM), is only used for Extended Tracking, Ground Plane and Model Targets.
2. Are there any settings to tell Vuforia it should exclusive concentrate on Image Targets and "forget" about "all the rest" e.g. model targets, smart terrain?
You can explicity set the FusionProviderType to VUFORIA_VISION_ONLY:
This will effectively disable Positional Tracking via Fusion. There are no APIs to disable features.
E.g. there is a Unity Setting which is part of the Vuforia Configuration called "Optimize Quality" vs "Optimize Speed". My naive assumption was that quality means more pixels (which probably means lower FPS) and speed means less pixels (which means more FPS). Does this setting has any relation to the image recognition task or is this only related to rendering (probably the camera background image)?
The Camera Device Mode is an abstraction of the Vuforia Engine API SelectVideoMode(): https://library.vuforia.com/content/vuforia-library/en/reference/unity/classVuforia_1_1CameraDevice.html#a8cd99944db68cd1b5709326042353c19
Your assumption is generally correct. Each Camera Device Mode has several settings related to it, including camera capture resolution, camera capture frame rate, and rendering resolution. The modes available and the device's performance when that mode is set (i.e. FPS) is dependent upon the device's capability.
3. Is there anything else (in the engine itself) i could take a look at to optimize the system for my use case?
The SDK is highly optimized for your use case. I cannot think of any other optimizations for your use case at this time.
Vuforia Engine Support