Yahoo Releases Staggering 13.5TB Data Cache Treasure Trove For Machine Learning Research

The digital landscape has evolved quite substantially over the past decade, primarily due to smart devices taking over our lives. Beyond that, "machine-learning" has also become a major force with many top-flight companies, with many of them seeing a lot of value (ultimately, revenue-wise) in churning through big data sets. It's even helping some companies collaborate with users at large.

To give a couple of examples of what machine-learning can do, Microsoft released a neat project a couple of months ago that made use of machine-learning to detect your emotion; whether it be happiness, disgust, or anger. Microsoft even helped give computers a sense of humor, which is about as freaky as machine-learning has gotten up to this point. Google's also been playing a major role in machine-learning, with its RankBrain AI able to answer your most off-the-wall questions. The company even released some machine-learning software, called TensorFlow, this past November.

Yahoo Homepage News Feed

Suffice to say, machine-learning isn't just huge, it's important, and for that reason, Yahoo has decided to jump into the game by offering up a staggering 13.5TB dataset for anyone to churn through.

Yahoo notes that this dataset, which consists of 110 billion individual events, encompasses user actions on its news site between February and May of last year. The data includes the trends of each user; how they navigated from one part of Yahoo to another. If you want to snag the dataset, you can do so here. Because plain text is hugely compressible, the full 13.5TB dataset has been shrunk down to 1.5TB, making for a much quicker download for those with fat Internet pipes.

If you're worried about your data being part of this dataset, you can rest assured that everything is anonymous. That's a good thing, too, as the dataset even includes things like age, sex, and generalized location data.

Machine-learning is just getting started, and we're sure that companies handing out data like this could become common down-the-road - a great thing, for the sake of easier research.