Model benchmarking is important for evaluating land surface model performance and guiding model improvements. Emerging model benchmark approaches (e.g., functional relationship benchmarks) show promise in providing insights into both model responses to environmental forcings and simulated ecosystem processes. However, the subjective choices made in the selection and application of observational benchmarks can have a large influence on inferred model skill. Therefore, a systematic assessment of the impact of functional benchmarking choices on inferences of model skill is a needed component of robust model evaluation efforts. The International Land Model Benchmarking (ILAMB) tool is used here to test the influence the choice of observational benchmarks has on inferred model skill across the Arctic-Boreal region, as this region represents a potential key tipping point in Earth’s climate system. We evaluate how inferred skill of TRENDY v9 models varies based on the choice of observational-based benchmark and how benchmarks are applied in model evaluation. The analysis uses global data sets integrated into ILAMB as well as new regionally specific observational products from the Arctic-Boreal Vulnerability Experiment (ABoVE). We applied seven Gross Primary Production (GPP) and Ecosystem Respiration (ER) observational datasets to infer model skill and found differences around 40%, with inferred model skill degrading as more regionally specific observational benchmarks are applied. These results suggest a false sense of model skill if only using one data product. We also evaluate modeled relationships between ER and air temperature, GPP, and precipitation. Results indicate that the magnitude and shape of response curves, as well as inferred model skill, are highly impacted by the choice of observational data set and the approach used to construct the functional response benchmark. Collectively, these results highlight the influence of benchmarking choices on model evaluation and point to the need for benchmarking guidelines when assessing inferred model skill.