After a bit of digging I found it: http://www.researchgate.net/publication/236011831_A_Storage-Centric_Analysis_of_MapReduce_Workloads_File_Popularity_Temporal_Locality_and_Arrival_Patterns/links/0c96051ffb6dff29fc000000
The research analysed file access patterns of two multi-petabyte Hadoop clusters at Yahoo! across several dimensions, with a focus on popularity, temporal locality and arrival patterns. The study analysed two 6-month traces, which together contain more than 940 million creates and 12 billion file open events.
The study found that files are very short lived: 90% of the file deletions occur on files that are 22.27mins − 1.25 hours old.
Other interesting data:
Workloads are dominated by high file churn (high rate of creates/deletes) which leads to 80%− 90% of files being accessed at most 10 times during a 6-month period. There is a small percentage of highly popular files: less than 3% of the files account for 34% − 39% of the accesses (opens).
Young files account for a high percentage of accesses, but a small percentage of bytes stored. For example, 79% − 85% of accesses target files that are most one day old, yet add up to 1.87% − 2.21% of the bytes stored.