The world's largest music data warehouse becomes a gold mine for artificial intelligence

Cát Tiên | 24/12/2025 09:16

Spotify's huge music data warehouse is becoming a gold mine for artificial intelligence, raising concerns about copyright and data mining.

A group of hackers that has shocked the technology and music industry when announcing that they have collected and stored about 300 terabyte of data from Spotify, the world's largest music streaming platform.

This data store includes tens of millions of audio files, album cover photos, and huge amounts of metadata, currently published through Annas Archive, an open source search engine for underground libraries.

According to the published information, Annas Archive currently stores 86 million audio files and more than 256 million song super data streams, with a total capacity of about 300TB.

The music superdatenateness includes the artist's name, musician, producer, genre, duration, release date, and ISRC code, the international identification code for each recording.

With 186 million ISRC codes, the platform claims to own the world's largest public music super data database.

The group behind Anna's Archive said their goal is to build a comprehensive " Conservation store" of music, allowing anyone with enough storage space to copy.

According to the plan, in addition to the released metadata, 86 million music files, accounting for about 99.6% of total listenings on Spotify - will be announced through torrent files, arranged according to popularity level.

This move is especially noteworthy in the context of rapid development of artificial intelligence. AI companies are now heavily dependent on large-scale data to train models, from text, images to audio.

Such a huge music data warehouse could become an attractive resource for training music creation AI models, audio analysis or multi-modal, increasing the existing tension between the AI industry and copyright owners.

Spotify confirmed that it has detected and disabled accounts involved in illegal data copying, and deployed additional protections.

The company said the preliminary investigation showed that a third party had collected public metadata and used illegal measures to bypass the digital copyright management system (DRM), thereby partially accessing the audio file.

Annas Archive works as a search engine, helping users access content stored in other sources on the internet, and asserting that the platform itself does not directly store copyright infringing content.

Previously, the platform's data warehouse was mainly books, research papers and academic documents. The expansion to metadata and music marks a new step, and Annas Archive has become a regular target in requests to remove content from copyright owners.

The operator group Annas Archive believes that current music libraries focus too much on famous artists and high-quality records, making it difficult to store the entire history of recorded human music.

By prioritizing Spotify's comprehensiveness and using Spotify's popularity index, they declare they want to create a representative music list for all previously released records.

Although justified under the name of cultural conservation, this 300TB data warehouse still raises big questions about the line between storage, copyright infringement and data mining in the AI era, where the value of data is becoming increasingly sensitive and controversial.

Cát Tiên