python-ragability
IMPORTANT: this project is still under development and may change significantly in the near future!
A library and corpus for checking/benchmarking LLMs with regard to properties relevant to their use in RAG systems.
Conda
- create a conda environment, e.g. by sourcing
conda-create.sourceme
- e.g. with bash:
. conda-create.sourceme
- e.g. with bash:
- if necessary (re-)install the requirements
pip install -r requirements.txt
(the conda creation code already does this too) - if necessary install/update the ragability package
pip install -e .
(the conda creation code already does this too) - NOTE: as long as the package gets developed/changed, the re-installation steps make sure that changed CLI programs are being made available.
Usage
- Activate conda environment:
conda activate ragability
- After installation, the following commands are available:
ragability_info
: show the versions of all relevant Python packages installed “ragability-cc-wc1
: convert the wiki-contradict based corpus to ragability formatragability_query
: given an input file with facts and queries, a list of candidate LLMs and a prompt template, produce and output file that contains LLM answers to the queriesragability_check
: given a query result file and a judge LLM, evaluate the answers received against the pre-defined answers and create an output file that contains the evaluation scores and meta-information for each exampleragability_eval
: given the data created withragability_check
, calculate detailled performance statisticsragability_hjson_info
: show some infor about the number of entries and keys present in a hjson, json, or jsonl fileragability_hjson_cat
: concatenate several hjson, json or jsonl files into one hjson or json filellms_wrapper_test
: for the configured/specified LLMs, test if they are working and returning an answer. This is useful to check a config file and test if all the API keys are correctly set and working- All commands take the
--help
option to get usage information
LLMs currently supported: support is based on the LiteLLM backend via the llms_wrapper package. The supported LLMs are listed here
Current usage
Example usage with the converted wiki-contradict dataset:
- (optional conversion step, the current converted dataset is already part of the repo): convert the dataset tsv file to ragability format
ragability_cc_wc1 --input corpus/wikicontradict1/Dataset_v0.2_short.tsv --output corpus/wikicontradict1/v0d2.hjson
- create or copy and modify one of the conf*.hjson files to contain the LLMs and LLM configs wanted for the experiment
- Run the base LLMs on the corpus. The following will create also a log file:
ragability_query -i corpus/wikicontradict1/v0d2.hjson -o experiments-wc1/v0d2.out1.hjson --config experiments-wc1/conf-all.hjson --promptfile experiments-wc1/prompt.hjson --logfile experiments-wc1/v0d2.log1.txt --verbose
- Run the checker LLM on the output of the previous step
ragability_check -i experiments-wc1/v0d2.out1.hjson -o experiments-wc1/v0d2.out2.hjson --config experiments-wc1/conf-ollama.hjson --promptfile experiments-wc1/prompt.hjson --logfile experiments-wc1/v0d2.log2.txt --verbose
- run the evaluation program:
ragability_eval -i experiments-wc1/v0d2_small.out2.hjson -o experiments-wc1/v0d2_small.eval.tsv --verbose
- the generated tsv file can be loaded into a pandas dataframe or some spreadsheet app
Files/File formats
NOTE: tools to convert between jsonl, json, yaml:
- https://github.com/spatialcurrent/go-simple-serializer
hjson
command (already available from the package installation can be used to convert between json and hjson
Query file:
- Either a json or hjson file that contains an array of dicts, or a jsonl file that contains one json dict per line or a yaml file that contains an array of dicts
- Each dict must contain the following fields (fields marked as ‘(output)’ are written to the output file):
qid
: the id of the query should be a short reminder, e.g.kwoks-are-vertebrates01
facts
: a string or an array of strings giving the knowledge snippets we want to query, these simulate the RAG document snippets included in a RAG queryquery
: the query to ask about the factspids
: the prompt ids from the configured prompts to use, if not given all configured prompts are usedtags
: a comma-separated list of tags which identify the kind, purpose, etc. of the corresponding instance. The presence or absence of a tag can be used in the eval program for breakign down the LLM-performances.response
(output) : the response as received from the base LLM if there was no errorerror
(output) : the error if there was an errorllm
(output) : the llm alias / name usedchecks
: (optional, but required if checking and evaluation should get performed later) a list of checks where each check is a dict which contains:query
: if present, a query to ask the checker LLM about the response from the base-LLM. If missing, the checking process will directly analyse the base-LLM response (e.g. when the base LLM query was a yes/no question)pid
: the prompt id of the configured prompt to use for the checking LLM. If this is missing a default prompt is used.func
: the name of a checking function, which will be called with the response of the checking or base LLM and the additional parameters specified with “args”. Each function definition internally knows about the kind of evaluation (binary, multiclass, score). See thechecks.py
moduleargs
: optional additional positional parameters for the checking function, e.g. some value or values to compare against. The meaning of the parameters depends on the concrete checking function. Some checking functions do not need anyargs
in which case this field can be omitted.check_for
: optional value to insert into the checker query and all prompt strings using variable name “${check_for}”kwargs
: optional additional keyword-arguments to provide to the checker functionresponse
(output) : the response as received from the checking LLM (if no error)error
(output) : the error if an error during checking occurredresult
(output) : the result of the checking function, either a response label that will get compared to a target label for evaluation, or a score
Prompt file:
- Either a json file that contains an array of dicts, or a jsonl file that contains one json dict per line or a yaml file that contains an array of dicts
- Each dict must contain the following fields:
pid
: the id of the prompt, should be a short reminder of what it does- at least one of
system
,assistant
,user
: a string to use for creating the actual final prompt for the LLM. The string can contain the placeholders${query}
and${facts}
in order to insert the current query and facts (if facts is an array of strings, these will get concatenated with newlines) fact
: how to format a single fact if there are several. This supports the variables${fact}
and${n}
(1-based fact index)
- The placeholder
${check_for}
can be used for prompts to be used by the checking LLM to insert some value to check for in the response to check - The placeholder
${answer}
can be used for prompts to be used by the checking LLM to insert the response from the base-LLM to check
Query Output file / Checker Input file:
- Either a json, hjson or a jsonl file
- each dictionary contains the same fields as the input, plus the fields added (marked with ‘(output)’ above)
- NOTE: if there were transient errors during processing, the output file can be re-used as an input file and by default, only those entries which do not already have a response will get re-processed
Checker Output file / Eval Input file:
- Either a json, hjson or a jsonl file
- each dictionary contains the same fields as the input, plus the fields added (marked with ‘(output)’ above)
- NOTE: if there were transient errors during processing, the output file can be re-used as an input file and by default, only those entries which do not already have a response will get re-processed
Config file:
- IMPORTANT: the config file is used by the backend library
llms_wrapper
to configure LLMs and providers, see the documentation of that library for the most recent description of what is supported in the config file for LLMs and providers: https://github.com/OFAI/python-llms-wrapper/wiki - a json or hjson or yaml file containing a dictionary with the following keys
llms
: a list of strings or dictionaries describing the LLMs to use. A dictionary can contain the following keys:llm
: the name/id of the LLM. This should be in the form provider:llmmodel where “provider” must be a known provider or something defined in theproviders
part of the config. The “llmmodel” is the provider-specific way to specify a model.api_key
: the API key to use for the modelapi_key_env
: the name of an environment variable containing the API keyapi_url
: the URL to use. In this URL the placeholders${model}
,${user}
,${password}
and${api_key}
can be used to get replaced with the actual valuesuser
: the user name to use for basic authenticationpassword
: the password to use for basic authentication- Any specification of the above fields that is present in the corresponding provider config is overridden with the value provided in the llm config
cost_per_prompt_token
: configure or override the cost per prompt tokencost_per_output_token
: configure or override the cost per output tokenmax_input_tokens
: configure or override the maximum number of input/prompt tokensmax_output_tokens
: configure or override the maximum number of output tokens
providers
: a dict with provider names as the key and a dict of provider settings as the values where each dict can contain the followign keys:api_key
: the API key to use for the modelapi_key_env
: the name of an environment variable containing the API keyapi_url
: the URL to use. In this URL the placeholders${model}
,${user}
,${password}
and${api_key}
can be used to get replaced with the actual valuesuser
: the user name to use for basic authenticationpassword
: the password to use for basic authentication
prompts
: a list of prompts in the same way as in a separate prompts file