Did you sample the reanalysis to match the sampling of hadcrut both in the temporal and spatial domain?
A much better test would be to use a more complete dataset such as Berkeley earth which is more complete in the spatial and temporal domain.
If you dont mask the model results to match the collection structure of the observations you are comparing apples and oranges