Wikimedia Germany has just announced the Wikidata Embedding project, a new database that makes it easier for AI models to access and understand Wikipedia's rich knowledge base.
The system applies vector-based grammatical search, a technique that helps computers identify the meaning and relationships between words, for nearly 120 million entries on Wikipedia and related platforms.
The project also integrates the Model ConceptCP (MCP), a standard that helps AI systems communicate directly with data sources.
As a result, large language models (LLM) can perform natural language queries, improving the ability to collect and use accurate information from Wikipedia.
The project is being implemented by Wikimedia Germany, in collaboration with Jina.AI and Data Stax, a real-time training data company owned by IBM.
Previously, Wikidata only supported keyword search and SPARQL queries, limiting the exploitation capability of AI.
The new system works well with models of increasing traceability (RAG) data creation, helping AI gather external information and build knowledge based on data that has been verified by the Wikipedia editor.
The data is also structured to provide language context, for example, queries from scientific houses will return to the list of famous nuclear scientists, researchers who have worked at Bell Labs, translations into multiple languages, licensed images, and related concepts such as lecturers or researcher houses.
This database can be accessed publicly on toolforge, and Wikidata will hold an online workshop for developers on October 9.
The project came into play in the context that AI developers are looking for high-quality data sources to refine the model.
With the increasing complexity of the AI training system, the need for reliable data is even more urgent, especially when Wikipedia provides more accurate information than large du lieu collections such as Common Crawl.
Philippe Saade, Wikidata's AI project manager, emphasized the project's independence and collaboration: " Strong AI does not necessarily have to be controlled by a small group of companies. It can be open, collaborative and serve everyone.