Basic Linear Regression
Let's create a tool which performs a least squares regression on numeric data retrieved from a Weaviate collection. For this example, we will use the default branch initialisation for the decision tree ("one_branch"
), which includes tools for querying and retrieving data from a Weaviate collection. We are going to add a BasicLinearRegression
tool that extracts data from the environment and solves the least squares problem to obtain coefficients for the regression.
Initial Setup
Let's follow the basic guideline for creating a tool, and create the skeleton of the tool below.
from elysia import Error, Tool, Result
import numpy as np
import matplotlib.pyplot as plt
class BasicLinearRegression(Tool):
def __init__(self, logger, **kwargs):
pass
async def __call__(
self,
tree_data,
inputs,
base_lm,
complex_lm,
client_manager,
**kwargs,
):
pass
async def is_tool_available(self, tree_data, base_lm, complex_lm, client_manager):
pass
The general outline for this tool is going to be as follows:
- No LLM calls within the tool, the decision agent is going to be deciding inputs for the regression.
- The call of the tool is going to transform the environment into matrices to be used for the analysis.
- The tool is going to add to the environment the labelled coefficients from the regression.
- The
is_tool_available
function is going to returnTrue
when there is data retrieved from thequery
tool in the environment.
Tool Initialisation
Let's set the initialisation parameters to provide a comprehensive tool description and input descriptions.
def __init__(self, logger, **kwargs):
super().__init__(
name="basic_linear_regression_tool",
description="""
Use this tool to perform linear regression on objects in the environment with numeric data types.
""",
status="Running linear regression...",
inputs={
"environment_key": {
"description": (
"A single key (string) of the `environment` dictionary that will be used in the analysis. "
"Choose the most relevant key for the analysis according to the user prompt. "
"All objects under that key will be used to create the dataframe. "
),
"required": True,
"type": str,
"default": None,
},
"x_variable_fields": {
"description": (
"The independent variables for the regression. "
"Choose one or more fields within the `objects` underneath the specific `environment_key`. "
),
"required": True,
"type": list[str],
"default": None,
},
"y_variable_field": {
"description": (
"The dependent variable for the regression. "
"Choose one single field within the `objects` underneath the specific `environment_key`. "
),
"required": True,
"type": str,
"default": None,
},
},
end=False,
)
Tool Call
The tool call is the __call__
method - this will be called when the tool is selected by the LLM. So this is where the actual tool execution takes place. We need to do several things here:
- Convert the environment (within the
tree_data
) to matrices - Perform least squares regression on these matrices
- Yield relevant objects to the decision tree, so the information goes back to the tree
Let's first start with setting the tool up and collecting the inputs:
async def __call__(
self,
tree_data,
inputs,
base_lm,
complex_lm,
client_manager,
**kwargs,
):
environment = tree_data.environment.environment
environment_key = inputs["environment_key"]
x_variable_field = inputs["x_variable_field"]
y_variable_field = inputs["y_variable_field"]
Now we have separated the relevant inputs and items we need, let's now proceed with converting the environment (within the tree_data
) to matrices:
# initialise empty matrices to store all objects
X = np.empty((0, 2))
y = np.empty((0, 1))
# iterate over all items in the environment
for inner_key in environment[environment_key]:
# convert all objects under this key to a matrix
inner_X = np.array(
[
[obj[x_variable_field]]
for environment_list in environment[environment_key][inner_key]
for obj in environment_list["objects"]
]
)
inner_X = np.hstack([np.ones((inner_X.shape[0], 1)), inner_X])
X = np.vstack([X, inner_X])
# convert all objects under this key to a matrix
inner_y = np.array(
[
[obj[y_variable_field]]
for environment_list in environment[environment_key][inner_key]
for obj in environment_list["objects"]
]
)
y = np.vstack([y, inner_y])
environment
is keyed via two variables: the tool used which added items to the environment and then a sub-key of names within that tool, we require looping over the inner_key
within the outer environment_key
which was decided by the LLM.
We use NumPy to create the matrices, which are initialised to the correct dimension (the independent variable should have matrix \(X\) which is \(n \times 2\), since there is only one independent variable and a intercept column, and the dependent variable, \(y\), should be \(n \times 1\)).
Now we need to perform least squares regression on these matrices to obtain the regression coefficients.
# calculate the beta hat values via least squares
beta_hat = np.linalg.inv(X.T @ X + 1e-10 * np.eye(X.shape[1])) @ X.T @ y
beta_hat_dict = {
"intercept": beta_hat[0],
"slope": beta_hat[1],
}
We follow the least squares solution using linear algebra to get these coefficients:
$$
\hat{\beta} = (X^\top X + \lambda I_n)^{-1} X^\top y
$$
where \(\lambda\) is small (e.g. 1e-10
). The first value in \(\hat{\beta}\) will correspond to the intercept and the second value will correspond to the independent variable and its correlation with the dependent variable (the slope). We can plot these values with matplotlib
, just for explanatory purposes:
# plot the data
pred_y = X @ beta_hat
fig, ax = plt.subplots()
ax.scatter(X[:, 1], y)
ax.plot(X[:, 1], pred_y, color="red")
ax.set_title(
f"Linear regression between {x_variable_field} and {y_variable_field}"
)
ax.set_xlabel(x_variable_field)
ax.set_ylabel(y_variable_field)
fig.show()
This will plot the independent variable against the dependent one, include the line of best fit, and then show the plot whenever the tool is successfully run.
Now we need to give information back to the decision tree about the results of this analysis, so it can be used to continue the conversation and make further decisions. All tools should yield
objects, and we use the Result
class which automatically will assign objects to the tree's environment.
yield Result(
objects=[
beta_hat_dict,
],
metadata={
"x_variable_field": x_variable_field,
"y_variable_field": y_variable_field,
},
llm_message=(
"Completed linear regression analysis where: "
" x variable: {x_variable_field} "
" y variable: {y_variable_field} "
),
)
There is only one object needed to be returned, the coefficient values. The decision agent will use this information to continue the decision tree by reading the objects. We also include what variable corresponds to where by including the x_variable_field
and y_variable_field
in the metadata, and include an llm_message
to custom display that the regression was completed.
async def __call__(
self,
tree_data,
inputs,
base_lm,
complex_lm,
client_manager,
**kwargs,
):
environment = tree_data.environment.environment
environment_key = inputs["environment_key"]
x_variable_field = inputs["x_variable_field"]
y_variable_field = inputs["y_variable_field"]
# initialise empty matrices to store all objects
X = np.empty((0, 2))
y = np.empty((0, 1))
# iterate over all items in the environment
for inner_key in environment[environment_key]:
# convert all objects under this key to a matrix
inner_X = np.array(
[
[obj[x_variable_field]]
for environment_list in environment[environment_key][inner_key]
for obj in environment_list["objects"]
]
)
inner_X = np.hstack([np.ones((inner_X.shape[0], 1)), inner_X])
X = np.vstack([X, inner_X])
# convert all objects under this key to a matrix
inner_y = np.array(
[
[obj[y_variable_field]]
for environment_list in environment[environment_key][inner_key]
for obj in environment_list["objects"]
]
)
y = np.vstack([y, inner_y])
# calculate the beta hat values via least squares
beta_hat = np.linalg.inv(X.T @ X + 1e-10 * np.eye(X.shape[1])) @ X.T @ y
beta_hat_dict = {
"intercept": beta_hat[0],
"slope": beta_hat[1],
}
pred_y = X @ beta_hat
# plot the data
fig, ax = plt.subplots()
ax.scatter(X[:, 1], y)
ax.plot(X[:, 1], pred_y, color="red")
ax.set_title(
f"Linear regression between {x_variable_field} and {y_variable_field}"
)
ax.set_xlabel(x_variable_field)
ax.set_ylabel(y_variable_field)
fig.show()
# yield the result to the decision tree
yield Result(
objects=[
beta_hat_dict,
],
metadata={
"x_variable_field": x_variable_field,
"y_variable_field": y_variable_field,
},
llm_message=(
"Completed linear regression analysis where: "
" x variable: {x_variable_field} "
" y variable: {y_variable_field} "
),
)
Controlling when the tool is available
We can also set the is_tool_available
method to only return True
when something relevant is in the environment. In this case, when information has been returned from the query
tool.
async def is_tool_available(self, tree_data, base_lm, complex_lm, client_manager):
"""
Available when the 'query' task has been completed and it has added data to the environment.
"""
return (
"query" in tree_data.environment.environment
and len(tree_data.environment.environment["query"]) > 0
)
We have also provided a brief description of when the tool will become available, which will be passed to the decision agent any time the tool is unavailable, so that it knows what tasks it needs to accomplish before being able to run this tool. In this case, this should inform the LM that to perform linear regression, it needs to get the data first.
Putting it all together
Here is the full tool construction, and let's also wrap the call in a try
/except
block to catch any errors (using the Error
class for self-healing errors).
class BasicLinearRegression(Tool):
def __init__(self, logger, **kwargs):
super().__init__(
name="basic_linear_regression_tool",
description="""
Use this tool to perform linear regression between two numeric variables in the environment.
""",
status="Running linear regression...",
inputs={
"environment_key": {
"description": (
"A single key (string) of the `environment` dictionary that will be used in the analysis. "
"Choose the most relevant key for the analysis according to the user prompt. "
"All objects under that key will be used to create the dataframe. "
),
"required": True,
"type": str,
"default": None,
},
"x_variable_field": {
"description": (
"The independent variable for the regression. "
"Choose one field title within the `objects` underneath the specific `environment_key`. "
),
"required": True,
"type": str,
"default": None,
},
"y_variable_field": {
"description": (
"The dependent variable for the regression. "
"Choose one single field within the `objects` underneath the specific `environment_key`. "
),
"required": True,
"type": str,
"default": None,
},
},
end=False,
)
async def __call__(
self,
tree_data,
inputs,
base_lm,
complex_lm,
client_manager,
**kwargs,
):
environment = tree_data.environment.environment
environment_key = inputs["environment_key"]
x_variable_field = inputs["x_variable_field"]
y_variable_field = inputs["y_variable_field"]
try:
# initialise empty matrices to store all objects
X = np.empty((0, 2))
y = np.empty((0, 1))
# iterate over all items in the environment
for inner_key in environment[environment_key]:
# convert all objects under this key to a matrix
inner_X = np.array(
[
[obj[x_variable_field]]
for environment_list in environment[environment_key][inner_key]
for obj in environment_list["objects"]
]
)
inner_X = np.hstack([np.ones((inner_X.shape[0], 1)), inner_X])
X = np.vstack([X, inner_X])
# convert all objects under this key to a matrix
inner_y = np.array(
[
[obj[y_variable_field]]
for environment_list in environment[environment_key][inner_key]
for obj in environment_list["objects"]
]
)
y = np.vstack([y, inner_y])
# calculate the beta hat values via least squares
beta_hat = np.linalg.inv(X.T @ X + 1e-10 * np.eye(X.shape[1])) @ X.T @ y
beta_hat_dict = {
"intercept": beta_hat[0],
"slope": beta_hat[1],
}
pred_y = X @ beta_hat
# plot the data
fig, ax = plt.subplots()
ax.scatter(X[:, 1], y)
ax.plot(X[:, 1], pred_y, color="red")
ax.set_title(
f"Linear regression between {x_variable_field} and {y_variable_field}"
)
ax.set_xlabel(x_variable_field)
ax.set_ylabel(y_variable_field)
fig.show()
# yield the result to the decision tree
yield Result(
objects=[
beta_hat_dict,
],
metadata={
"x_variable_field": x_variable_field,
"y_variable_field": y_variable_field,
},
llm_message=(
"Completed linear regression analysis where: "
" x variable: {x_variable_field} "
" y variable: {y_variable_field} "
),
)
except Exception as e:
yield Error(str(e))
async def is_tool_available(self, tree_data, base_lm, complex_lm, client_manager):
"""
Available when the 'query' task has been completed and it has added data to the environment.
"""
return (
"query" in tree_data.environment.environment
and len(tree_data.environment.environment["query"]) > 0
)
Experimenting
Let's initialise a tree, add the tool and test it runs correctly. In this example, a Weaviate collection called Ecommerce
will be used. The Ecommerce
collection has many redundant fields (for this example), but we want to evaluate whether the price
variable correlates with the review rating
.
from elysia import Tree
tree = Tree()
tree.add_tool(BasicLinearRegression)
response, objects = tree(
"Perform linear regression on the relationship between the price of a product and the review rating it has",
collection_names = ["Ecommerce"]
)
INFO Tasks completed (iteration 1):
- query (Avg. 3.90 seconds)
- summarise_items (Avg. 0.00 seconds)
INFO Tasks completed (iteration 2):
- basic_linear_regression_tool (Avg. 0.01 seconds)
INFO Tasks completed (iteration 3):
- cited_summarize (Avg. 4.59 seconds)
To begin, I'm querying the "Ecommerce" collection to retrieve the product price and review rating data needed for the linear regression. I am now retrieving all product prices and review ratings from the "Ecommerce" collection to prepare for the linear regression analysis. I am now performing a linear regression analysis on the retrieved data, using product price as the independent variable and review rating as the dependent variable. I have completed the linear regression analysis, revealing an intercept of approximately 4.61 and a slope of approximately 0.00044, indicating a very slight positive relationship between product price and review rating. The linear regression analysis performed on product price and review rating reveals a very slight positive relationship between the two variables. The intercept of the regression model is approximately 4.61, and the slope is approximately 0.00044. This indicates that as the product price increases, the review rating tends to increase, but only by a very small margin.