Vector Similarity Search¶

This notebook demonstrates usage of the vector similarity search, to help the agent select an exact query value.

In [1]:

Copied!

from pyspark_ai import SparkAI

# enable vector similarity search by specifying a vector_store_dir location
spark_ai = SparkAI(enable_cache=False, vector_store_dir="temp/", verbose=True)
spark_ai.activate()
from pyspark_ai import SparkAI

# enable vector similarity search by specifying a vector_store_dir location
spark_ai = SparkAI(enable_cache=False, vector_store_dir="temp/", verbose=True)
spark_ai.activate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/10/17 13:50:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

In [2]:

Copied!





# create a dataframe with inconsistent date formats
scores_df = spark_ai._spark.createDataFrame(
    [
        ("09-10-2017", "24-20", "New England Patriots"),
        ("2017-17-09", "27-23", "Green Bay Packers"),
        ("09/24/2017", "30-26", "Kansas City Chiefs"),
        ("10-01-2017", "26-24", "Los Angeles Rams"),
        ("10/08/2017", "23-10", "Baltimore Ravens"),
        ("10/15/17", "20-17", "Miami Dolphins"),
        ("10-22-2017", "33-31", "Chicago Bears"),
        ("2017-10-29", "27-24", "Philadelphia Eagles"),
        ("11-05-2017", "28-23", "Carolina Panthers"),
        ("11-12-2017", "27-24", "New Orleans Saints")
    ],
    ["date", "score", "winner"]
)
# create a dataframe with inconsistent date formats
scores_df = spark_ai._spark.createDataFrame(
    [
        ("09-10-2017", "24-20", "New England Patriots"),
        ("2017-17-09", "27-23", "Green Bay Packers"),
        ("09/24/2017", "30-26", "Kansas City Chiefs"),
        ("10-01-2017", "26-24", "Los Angeles Rams"),
        ("10/08/2017", "23-10", "Baltimore Ravens"),
        ("10/15/17", "20-17", "Miami Dolphins"),
        ("10-22-2017", "33-31", "Chicago Bears"),
        ("2017-10-29", "27-24", "Philadelphia Eagles"),
        ("11-05-2017", "28-23", "Carolina Panthers"),
        ("11-12-2017", "27-24", "New Orleans Saints")
    ],
    ["date", "score", "winner"]
)

In [3]:

Copied!

scores_df.ai.transform(desc="Which team won on October 29th?", cache=False).show()
scores_df.ai.transform(desc="Which team won on October 29th?", cache=False).show()

INFO: Creating temp view for the transform:
df.createOrReplaceTempView("spark_ai_temp_view_1345058194")


> Entering new AgentExecutor chain...
Thought: The keyword 'October 29th' is most similar to the sample values in the `date` column.
I need to filter on an exact value from the `date` column, so I will use the tool similar_value to help me choose my filter value.
Action: similar_value
Action Input: October 29th|date|spark_ai_temp_view_1345058194
Observation: 2017-10-29
Thought:The correct `date` filter should be '2017-10-29' because it is semantically closest to the keyword.
I will use the column `date` to filter the rows where its value is '2017-10-29' and then select the `winner`.
Action: query_validation
Action Input: SELECT `winner` FROM `spark_ai_temp_view_1345058194` WHERE `date` = '2017-10-29'
Observation: OK
Thought:I now know the final answer.
Final Answer: SELECT `winner` FROM `spark_ai_temp_view_1345058194` WHERE `date` = '2017-10-29'

> Finished chain.
+-------------------+
|             winner|
+-------------------+
|Philadelphia Eagles|
+-------------------+

In [4]:

Copied!

scores_df.ai.transform(desc="When was the score 23 v 28?", cache=False).show()
scores_df.ai.transform(desc="When was the score 23 v 28?", cache=False).show()

INFO: Creating temp view for the transform:
df.createOrReplaceTempView("spark_ai_temp_view_1345058194")



> Entering new AgentExecutor chain...
Thought: The keyword '23 v 28' is most similar to the sample values in the `score` column.
I need to filter on an exact value from the `score` column, so I will use the tool similar_value to help me choose my filter value.
Action: similar_value
Action Input: 23 v 28|score|spark_ai_temp_view_1345058194
Observation: 28-23
Thought:The correct `score` filter should be '28-23' because it is semantically closest to the keyword.
I will use the column `score` to filter the rows where its value is '28-23' and then select the `date` 
because the question asks for the date when the score was '28-23'.
Action: query_validation
Action Input: SELECT `date` FROM `spark_ai_temp_view_1345058194` WHERE `score` = '28-23'
Observation: OK
Thought:I now know the final answer.
Final Answer: SELECT `date` FROM `spark_ai_temp_view_1345058194` WHERE `score` = '28-23'

> Finished chain.
+----------+
|      date|
+----------+
|11-05-2017|
+----------+

In [5]:

Copied!

scores_df.ai.transform(desc="When did the Chiefs win?", cache=False).show()
scores_df.ai.transform(desc="When did the Chiefs win?", cache=False).show()

INFO: Creating temp view for the transform:
df.createOrReplaceTempView("spark_ai_temp_view_1345058194")



> Entering new AgentExecutor chain...
Thought: The column names are non-descriptive, but from the sample values I see that column `winner` contains team names
and column `date` contains dates. So, I will filter on column `winner` for 'Chiefs' and select column `date` for the date.
I will use = rather than "like" in my SQL query because I need an exact match. I'll use the similar_value tool to help
me choose an exact filter value for `winner`.
Action: similar_value
Action Input: Chiefs|winner|spark_ai_temp_view_1345058194
Observation: Kansas City Chiefs
Thought:The correct `winner` filter should be 'Kansas City Chiefs' because it is semantically closest to the keyword.
Action: query_validation
Action Input: SELECT `date` FROM `spark_ai_temp_view_1345058194` WHERE `winner` = 'Kansas City Chiefs'
Observation: OK
Thought:I now know the final answer.
Final Answer: SELECT `date` FROM `spark_ai_temp_view_1345058194` WHERE `winner` = 'Kansas City Chiefs'

> Finished chain.
+----------+
|      date|
+----------+
|09/24/2017|
+----------+

In [6]:

Copied!





# create a dataframe
countries_df = spark_ai._spark.createDataFrame(
    [
        ("USA", "21.43 trillion USD", 2.7, "331 million", 0.7),
        ("CHN", "14.34 trillion USD", 7.0, "1.4 billion", 0.4),
        ("IND", "2.87 trillion USD", 8.0, "1.3 billion", 1.2),
        ("BRA", "1.87 trillion USD", -3.5, "211 million", 0.7),
        ("CAN", "1.64 trillion USD", 0.7, "38 million", 0.8),
        ("AUS", "1.37 trillion USD", 1.0, "25 million", 1.1),
        ("RUS", "1.47 trillion USD", -2.0, "144 million", 0.04),
        ("FRA", "2.71 trillion USD", 1.3, "67 million", 0.3),
        ("GER", "3.68 trillion USD", 0.8, "83 million", 0.2),
        ("JPN", "5.08 trillion USD", 1.6, "126 million", 0.2),
        ("GBR", "2.83 trillion USD", 0.3, "67 million", 0.6),
        ("ZAF", "352 billion USD", 1.2, "60 million", 1.2)
    ],
    ["country", "gdp", "gdp growth (%)", "population", "population growth (%)"]
)
# create a dataframe
countries_df = spark_ai._spark.createDataFrame(
    [
        ("USA", "21.43 trillion USD", 2.7, "331 million", 0.7),
        ("CHN", "14.34 trillion USD", 7.0, "1.4 billion", 0.4),
        ("IND", "2.87 trillion USD", 8.0, "1.3 billion", 1.2),
        ("BRA", "1.87 trillion USD", -3.5, "211 million", 0.7),
        ("CAN", "1.64 trillion USD", 0.7, "38 million", 0.8),
        ("AUS", "1.37 trillion USD", 1.0, "25 million", 1.1),
        ("RUS", "1.47 trillion USD", -2.0, "144 million", 0.04),
        ("FRA", "2.71 trillion USD", 1.3, "67 million", 0.3),
        ("GER", "3.68 trillion USD", 0.8, "83 million", 0.2),
        ("JPN", "5.08 trillion USD", 1.6, "126 million", 0.2),
        ("GBR", "2.83 trillion USD", 0.3, "67 million", 0.6),
        ("ZAF", "352 billion USD", 1.2, "60 million", 1.2)
    ],
    ["country", "gdp", "gdp growth (%)", "population", "population growth (%)"]
)

In [7]:

Copied!

countries_df.ai.transform(desc="What is the GDP of the country that has 38,000,000 people?", cache=False).show()
countries_df.ai.transform(desc="What is the GDP of the country that has 38,000,000 people?", cache=False).show()

INFO: Creating temp view for the transform:
df.createOrReplaceTempView("spark_ai_temp_view__1575533453")



> Entering new AgentExecutor chain...
Thought: The keyword '38,000,000' is most similar to the sample values in the `population` column.
I need to filter on an exact value from the `population` column, so I will use the tool similar_value to help me choose my filter value.
Action: similar_value
Action Input: 38,000,000|population|spark_ai_temp_view__1575533453
Observation: 38 million
Thought:The correct `population` filter should be '38 million' because it is semantically closest to the keyword.
I will use the column `population` to filter the rows where its value is '38 million' and then select the `gdp` 
because the question asks for the GDP of the country.
Action: query_validation
Action Input: SELECT `gdp` FROM `spark_ai_temp_view__1575533453` WHERE `population` = '38 million'
Observation: OK
Thought:I now know the final answer.
Final Answer: SELECT `gdp` FROM `spark_ai_temp_view__1575533453` WHERE `population` = '38 million'

> Finished chain.
+-----------------+
|              gdp|
+-----------------+
|1.64 trillion USD|
+-----------------+

In [8]:

Copied!

countries_df.ai.transform(desc="What is the average GDP growth for all countries with .7 percent population growth?", cache=False).show()
countries_df.ai.transform(desc="What is the average GDP growth for all countries with .7 percent population growth?", cache=False).show()

INFO: Creating temp view for the transform:
df.createOrReplaceTempView("spark_ai_temp_view__1575533453")



> Entering new AgentExecutor chain...
Thought: The keyword '.7 percent population growth' is most similar to the sample values in the `population growth (%)` column.
I need to filter on an exact value from the `population growth (%)` column, so I will use the tool similar_value to help me choose my filter value.
Action: similar_value
Action Input: .7|population growth (%)|spark_ai_temp_view__1575533453
Observation: 0.7
Thought:The correct `population growth (%)` filter should be '0.7' because it is semantically closest to the keyword.
I will use the column `population growth (%)` to filter the rows where its value is '0.7' and then select the AVG(`gdp growth (%)`) 
because the question asks for the average GDP growth.
Action: query_validation
Action Input: SELECT AVG(`gdp growth (%)`) FROM `spark_ai_temp_view__1575533453` WHERE `population growth (%)` = 0.7
Observation: OK
Thought:I now know the final answer.
Final Answer: SELECT AVG(`gdp growth (%)`) FROM `spark_ai_temp_view__1575533453` WHERE `population growth (%)` = 0.7

> Finished chain.
+-------------------+
|avg(gdp growth (%))|
+-------------------+
|-0.3999999999999999|
+-------------------+

In [9]:

Copied!

countries_df.ai.transform(desc="Which countries had a higher population growth than Japan?", cache=False).show()
countries_df.ai.transform(desc="Which countries had a higher population growth than Japan?", cache=False).show()

INFO: Creating temp view for the transform:
df.createOrReplaceTempView("spark_ai_temp_view__1575533453")



> Entering new AgentExecutor chain...
Thought: The keyword 'Japan' is most similar to the sample values in the `country` column.
I need to filter on an exact value from the `country` column, so I will use the tool similar_value to help me choose my filter value.
Action: similar_value
Action Input: Japan|country|spark_ai_temp_view__1575533453
Observation: JPN
Thought:The correct `country` filter should be 'JPN' because it is semantically closest to the keyword.
I will use the column `country` to filter the rows where its value is 'JPN' and then select the `population growth (%)` 
because the question asks for the countries with higher population growth than Japan.
Action: query_sql_db
Action Input: "SELECT `population growth (%)` FROM spark_ai_temp_view__1575533453 WHERE `country` = 'JPN'"
Observation: [('0.2',)]
Thought:The population growth of Japan is 0.2%. Now I will write a query to find the countries with a higher population growth.
Action: query_validation
Action Input: SELECT `country` FROM `spark_ai_temp_view__1575533453` WHERE `population growth (%)` > 0.2
Observation: OK
Thought:I now know the final answer.
Final Answer: SELECT `country` FROM `spark_ai_temp_view__1575533453` WHERE `population growth (%)` > 0.2

> Finished chain.
+-------+
|country|
+-------+
|    USA|
|    CHN|
|    IND|
|    BRA|
|    CAN|
|    AUS|
|    FRA|
|    GBR|
|    ZAF|
+-------+