The proliferation of edge video analytics applications has given rise to a new breed of streaming protocols which stream aggressively compressed videos to remote servers for compute-intensive DNN inference. One popular design paradigm of such protocols is to leverage the server-side DNN to extract useful feedback (e.g. based on a low-quality-encoded stream sent to the server) and use the feedback to inform how the camera should encode and stream the video in the future. In this server-driven approach, an ideal form of feedback should (1) be derived from minimum information from the video sensor (2) incur minimum bandwidth usage to obtain (3) indicate the optimal video streaming/encoding scheme (e.g. the minimum frames/regions that require high encoding quality). However, our preliminary study shows that these idealized requirements are far from being met. Using object detection as an example use case, we demonstrate significant yet untapped room for improvement by considering a broader design space, in terms of how the feedback should be derived from the DNN, how often it should be extracted, and how to determine the encoding quality of the video on which we extract the feedback.