"Since that first simple Tweet over eight years ago, hundreds of billions of Tweets have captured everyday human experiences and major historical events. Our search engine excelled at surfacing breaking news and events in real time, and our search index infrastructure reflected this strong emphasis on recency. But our long-standing goal has been to let people search through every Tweet ever published," Twitter said.
Twitter provided a few examples of when this expanded search capability might prove useful, such as providing comprehensive results for entire TV and sports seasons or digging through long-lived hashtag conversations across various topics, such as #JapanEarthquake, #Election2012, and so forth.
While on the surface this may seem like no big deal, it took quite a bit of engineering savvy to make it happen. Twitter points out that its full index is more than 100 times larger than its real-time index and grows by several billion tweets a week. The real-time index is fully stored in RAM for fast updates, though using the same RAM technology for the full index would have been cost prohibitive. Twitter turned to SSDs instead, though it wasn't as simple as using a different storage medium.
"SSDs were still orders of magnitude slower than RAM. Switching from RAM to SSD, our Earlybird QPS capacity took a major hit. To increase serving capacity, we made multiple optimizations such as tuning kernel parameters to optimize SSD performance, packing multiple DocValues fields together to reduce SSD random access, loading frequently accessed fields directly in-process and more," Twitter explains.
The blog post is actually rather interesting if you're into geeky details, as there's a lot more to digest than the storage medium alone.