We're talking about radar and ultrasonic sensors here, not accelerometers. We're also talking about feeding them to a deep neural network. Not the same thing. Sensor fusion is not being done with a Kalman filter in this case.
Radar and ultrasound both give drastically less data than a simple 720p webcam. After postprocessing their output bandwidth is more similar to a 9-axis IMU than a camera.
Yes and each time you process the incoming camera data in the neural network you have extra calculations due to the sensor fusion with another data source, regardless of its bitrate. Have you worked with deep learning much?