AI consultation is easier with Wikipedia's 120 million-entry database

Cát Tiên (THEO techcrunch) | 02/10/2025 06:00

The Wikidata Embedding project makes it easier for AI to access Wikipedia data, improving the ability to understand and use accurate information.

Wikimedia Germany has just announced the Wikidata Embedding project, a new database that makes it easier for AI models to access and understand Wikipedia's rich knowledge base.

The system applies vector-based grammatical search, a technique that helps computers identify the meaning and relationships between words, for nearly 120 million entries on Wikipedia and related platforms.

The project also integrates the Model ConceptCP (MCP), a standard that helps AI systems communicate directly with data sources.

As a result, large language models (LLM) can perform natural language queries, improving the ability to collect and use accurate information from Wikipedia.

The project is being implemented by Wikimedia Germany, in collaboration with Jina.AI and Data Stax, a real-time training data company owned by IBM.

Previously, Wikidata only supported keyword search and SPARQL queries, limiting the exploitation capability of AI.

The new system works well with models of increasing traceability (RAG) data creation, helping AI gather external information and build knowledge based on data that has been verified by the Wikipedia editor.

The data is also structured to provide language context, for example, queries from scientific houses will return to the list of famous nuclear scientists, researchers who have worked at Bell Labs, translations into multiple languages, licensed images, and related concepts such as lecturers or researcher houses.

This database can be accessed publicly on toolforge, and Wikidata will hold an online workshop for developers on October 9.

The project came into play in the context that AI developers are looking for high-quality data sources to refine the model.

With the increasing complexity of the AI training system, the need for reliable data is even more urgent, especially when Wikipedia provides more accurate information than large du lieu collections such as Common Crawl.

Philippe Saade, Wikidata's AI project manager, emphasized the project's independence and collaboration: " Strong AI does not necessarily have to be controlled by a small group of companies. It can be open, collaborative and serve everyone.

Cát Tiên (THEO techcrunch)

#Wikipedia

Media

Society

Business

News

Union

World

Real estate

Health

Sports

Culture - Entertainment

Car +

New display

AI consultation is easier with Wikipedia's 120 million-entry database

Details about 6 new features on iOS and Android that have just been added by WhatsApp

ChatGPT launches super hero AI portrait creation feature, competing with Gemini Nano Banana

Wikipedia tightens control of AI-based waste content

People's Artist The Hien - author of the song "Nhan Lan Duong" passed away

Fallen trees hit, causing 2 grandmothers and grandchildren to be hospitalized in Ho Chi Minh City

This week, there is a plan to arrange public service units and state-owned enterprises

The friendship between the people of Vietnam and the Russian Federation will develop better and better

Remaining bodies found in the fish pond burst in Lao Cai

Details about 6 new features on iOS and Android that have just been added by WhatsApp

ChatGPT launches super hero AI portrait creation feature, competing with Gemini Nano Banana

Wikipedia tightens control of AI-based waste content

AGENCY OF VIETNAM GENERAL CONFEDERATION OF LABOUR

Contact:

Advertising Support