Crowdsourcing Text2Cypher dataset

Contribute to the development of a text2cypher dataset for evaluation and fine-tuning of LLMs

Tomaz Bratanic
5 min readJan 25, 2024

The use of Large Language Models (LLMs) for generating database queries has gained remarkable popularity and is seen as a cutting-edge development in the field. While there are several Text2SQL datasets available to facilitate this, a notable gap exists in the form of the absence of a Text2Cypher dataset. My goal is to bridge this gap by starting a crowdsourcing initiative to create such a dataset. To this end, we have developed an application encompassing multiple graphs designed to streamline the data collection and integration process using a human-in-the-loop approach to generating and validating Cypher statements.

Application design. Image by the author.

The image depicts a workflow diagram for a human-in-the-loop system validating Cypher statements. We start with a user asking a question. The question is processed by a Cypher generating chain, which uses the graph schema information to generate a corresponding Cypher statement that is used to retrieve information from the graph that should answer the user question. The generated Cypher statement and its output are then evaluated by a human (human-in-the-loop), who acts as a judge of accuracy. The human can store upvotes or downvotes and possible corrections to the statements in a database of validated Cypher statements.

Since this process is labor-intensive, we need your help to generate an exhausting dataset of examples of various graphs.

The application is available here: https://text2cypher.vercel.app/

Provided databases

The application is connected to the demo server provided by Neo4j. There are 17 different databases that you can use to test the Cypher generation.

Therefore, to make quality contributions, it is essential to familiarize yourself with the graph schema and its content. We have also prepared three example questions per database. You can explore every database using Neo4j Browser.

Each database has a different user and password. For example, if you want to examine the “companies” graph, you must use “companies” as both username and password. You can also connect to it via any scripting language using Neo4j drivers.

URI: neo4j+s://demo.neo4jlabs.com
username: companies
password: companies
database: companies

It is also recommended to examine the databases for any specific entities that you want to ask questions about, so that you can use accurate and valid values in your queries.

Contribution guidelines

The main idea of this project is to generate high quality training and evaluation dataset for text2cypher applications by posing good questions and assessing them.

The application takes a user’s natural language query and converts it into a Cypher query. Once the user submits their question — for instance, inquiring about the number of directors Tom Hanks has worked with — the system translates this into the appropriate Cypher query. The query is then executed against the database, and the result is returned in a structured format, displaying that Tom Hanks has worked with 11 distinct directors. The interface offers the option to provide feedback on the generated Cypher statement. If you select a downvote, a modal opens that prompts you to input a valid cypher statement for the corresponding natural language input.

It is essential to consider the following guidelines to ensure the best possible outcome for this crowdsourcing project.

  • Please do not asks questions unrelated to the database. The LLM will try its best to produce a Cypher statement, but it will not be useful as evaluation or finetuning example
  • Avoid ambiguous questions. Try to be as descriptive as possible. For example:
  • If you mention specific values or entities in the database, please make sure that they are valid and accurate as there is no intermediate value mapping step implemented. For example: Is any of the companies that LIN PING is officer of from Hong Kong? <- Here both LIN PING and Hong Kong are used to filter values in the database and they have to match exactly. The LLM is not aware of the format of the value, so you need to be exact. If the value is LIN PING uppercased in the database, use the same value in the prompt.
  • Try to use questions that are more graph-based rather than simple statistics. Both are fine, however, we should focus on graph-based traversals, etc…
  • Try to be specific about the information you want to retrieve from the database and how many values there should be. For example: Return the top 3 movie titles with the highest imdb rating! <- In this example, we have specified that we want to return three movies and only their titles.
  • Try to use a variety of databases when producing/testing examples, not just movies :)

Summary

I really hope that we can build a good evaluation/training text2cypher dataset together, as I am quite eager to experiment with finetuned LLMs and schema descriptions, and more. The dataset will be shared publicly. I don’t know the exact details just yet, but I want all of us to benefit from it and be as non-restrictive as possible. Additionally, top 10 contributors will get Neo4j Swag rewards! I am planning on running the application for a good month, and then we can reevaluate.

Here’s to the year of finetuning text2cypher LLMs! Open the application and help us make it!

--

--

Tomaz Bratanic

Data explorer. Turn everything into a graph. Author of Graph algorithms for Data Science at Manning publication. http://mng.bz/GGVN