EmbedSLR 🚀 Deterministic publication screening & bibliometric auditing

EmbedSLR is a compact Python framework that combines embedding‑based ranking with an automatic bibliometric audit to accelerate the screening phase in systematic literature reviews (SLR).

Fully deterministic – no stochastic LLM components
Five interchangeable embedding back‑ends: local SBERT, OpenAI, Cohere, Jina, Nomic
Two “zero‑config” entry points: an interactive Terminal Wizard and a Google Colab GUI
Creates a share‑ready dashboard: biblio_report.txt

1 Installation

1.1 From GitHub (development version)

pip install git+https://github.com/s-matysik/EmbedSLR.git

Alternatively, clone the repository and work in editable mode:

git clone https://github.com/s-matysik/EmbedSLR.git
cd EmbedSLR
pip install -e .       # editable install

1.2 Requirements & environment

Python >= 3.9
Core deps: pandas, numpy, scikit-learn, sentence-transformers, openai, cohere, requests, tqdm, ipywidgets
Optional (Colab): google-colab
Cloud providers (optional): set API keys via env vars: OPENAI_API_KEY, COHERE_API_KEY, JINA_API_KEY, NOMIC_API_KEY

2 Quick start

2.1 Google Colab GUI 🟢 (recommended for first‑time users)

!pip install -q git+https://github.com/s-matysik/EmbedSLR.git
from embedslr.colab_app import run
run()   # launches an interactive widget

2.2 Terminal Wizard ⚡ (offline‑friendly)

Export your Scopus search results to CSV.
Run:
```
python -m embedslr.wizard
```
Provide the CSV path, research query and choose provider/model.
Receive a ZIP archive (ranking.csv, optional topN.csv, biblio_report.txt).

2.3 CLI (one‑shot)

The command‑line interface is a single command (argparse‑based):

embedslr \
  -i scopus.csv \
  -q "How do CSR cues influence consumer behaviour?" \
  -p sbert \
  -o ranking.csv \
  -r biblio_report.txt \
  --json-embs

Flags: -i/--input, -q/--query, -p/--provider (sbert|openai|cohere|jina|nomic), -m/--model (optional), -o/--out, -r/--report, --api_key (optional), --json-embs (save embeddings column).

2.4 Python API (minimal)

from embedslr.io import read_csv, autodetect_columns, combine_title_abstract
from embedslr.embeddings import get_embeddings, list_models
from embedslr.similarity import rank_by_cosine
from embedslr.bibliometrics import full_report

df = read_csv("scopus.csv")
tcol, acol = autodetect_columns(df)
df["combined_text"] = combine_title_abstract(df, tcol, acol)

model = list_models()["sbert"][0]  # e.g. "sentence-transformers/all-MiniLM-L6-v2"
doc_vecs = get_embeddings(df["combined_text"].tolist(), provider="sbert", model=model)
qvec = get_embeddings(["your research question"], provider="sbert", model=model)[0]

ranked = rank_by_cosine(qvec, doc_vecs, df)
full_report(ranked, path="biblio_report.txt", top_n=30)  # if top_n omitted → full dataset

3 Local SBERT (automatic download & 100 % offline afterwards)

Code fragment	Purpose
`_ensure_sbert_installed()`	Checks for `sentence‑transformers`; if missing, prompts and installs it
`_local_model_dir()`	Resolves permanent path `embedslr/sbert_models/<model>`
`_get_or_download_local_sbert()`	Downloads the model on first run, sets `HF_HUB_OFFLINE=1` for subsequent offline use
`_select_model()`	You enter the model only once – no duplicate prompts

Result: from the second launch onward, EmbedSLR runs entirely offline even in closed networks.

3.1 Embedding providers & models (built‑ins)

sbert  : sentence-transformers (e.g. all-MiniLM-L6-v2)
openai : text-embedding-3-large, text-embedding-ada-002
cohere : embed-english-v3.0, embed-english-light-v3.0,
         embed-multilingual-v3.0, embed-multilingual-light-v3.0
nomic  : nomic-embed-text-v1, nomic-embed-text-v1.5
jina   : jina-embeddings-v3

4 Bibliometric indicators (A … I)

EmbedSLR computes 10 (+1) indicators that quantify topical coherence and internal citation patterns in a publication corpus. All functions are implemented in embedslr/bibliometrics.py.

Symbol	Description	Function
A	Average number of shared references per pair of papers	`indicator_a(df)`
A′	Mean Jaccard (references) across all pairs	`indicator_a_prime(df)`
B	Average number of shared keywords per pair	`indicator_b(df)`
B′	Mean Jaccard (keywords) across all pairs	`indicator_b_prime(df)`
C	Pairs with ≥1 common reference	`indicator_c(df)`
D	Unique references shared by ≥2 papers	`indicator_d(df)`
E	Total intersections of references across all pairs	`indicator_e(df)`
F	Pairs with ≥1 common keyword	`indicator_f(df)`
G	Keywords occurring in ≥2 papers	`indicator_g(df)`
H	Average number of mutually cited papers per pair	`indicator_h(df)`
I	Total unique mutually cited papers	`indicator_i(df)`

4.1 Full report with one call

from embedslr.bibliometrics import full_report
full_report(ranked_df, path="biblio_report.txt", top_n=30)

4.2 Single indicator example

from embedslr.bibliometrics import indicator_b_prime
print("Mean Jaccard (keywords):", indicator_b_prime(ranked_df))

4.3 Input columns & assumptions

Title and Abstract (auto‑detected; several common Scopus header variants supported).
Author Keywords (optional; created empty if missing).
Parsed_References — set/list of references. If missing but References exists, the wizard derives it automatically. If neither is present, reference‑based indicators will be near zero.

4.4 Outputs & columns

ranking.csv — sorted by distance_cosine (smaller = closer to query); includes combined_text and, if requested, combined_embeddings (JSON).
biblio_report.txt — human‑readable summary of indicators A…I (optionally limited to Top‑N).
topN.csv — optional top‑N slice of the ranking.

5 Repository structure

embedslr/                 # main package
├── io.py                 # read_csv(), column detection, title+abstract merge
├── embeddings.py         # providers + list_models(), get_embeddings()
├── similarity.py         # cosine ranking (rank_by_cosine)
├── bibliometrics.py      # indicators A … I + full_report()
├── wizard.py             # interactive terminal assistant (offline pipeline), run()
├── cli.py                # argparse CLI: single 'embedslr' command with flags
├── colab_app.py          # Google Colab widget
├── utils.py              # chunk_iterable(), getenv_or_raise(), progress()
├── _version.py           # aux version holder
└── __init__.py           # public API exports
docs/                     # static docs (this page)
examples/                 # sample data and results
pyproject.toml            # metadata, runtime deps, build backend, entry points
setup.cfg                 # flake8 + packaging extras
MANIFEST.in               # non‑Python files included in sdist
LICENSE                   # MIT licence
README.md                 # repo front page – quick overview
.gitignore                # build artefacts, __pycache__, *.ipynb_checkpoints

6 File highlights

io.py – read_csv(), column auto‑detection (autodetect_columns()), combine_title_abstract()
embeddings.py – functional API for providers (list_models(), get_embeddings())
similarity.py – rank_by_cosine() (ascending by distance_cosine)
bibliometrics.py – indicator functions, indicators(), full_report() (top_n is optional; default = full dataset)
wizard.py – end‑to‑end pipeline with SBERT offline logic (HF_HUB_OFFLINE=1 after first download)
cli.py – argparse‑based CLI (no Typer sub‑commands)
colab_app.py – run() builds the ipywidgets GUI
utils.py – chunk_iterable, getenv_or_raise, progress

7 Citing EmbedSLR

@misc{matysik2025embedslr,
  title  = {EmbedSLR – deterministic embedding‑based screening and bibliometric validation in SLR},
  author = {Matysik, Sebastian and Wiśniewska, Joanna and Frankowski, Paweł K.},
  year   = {2025},
  url    = {https://github.com/s-matysik/EmbedSLR}
}

8 FAQ

❓ I have no API key. Can I still use EmbedSLR?

Yes. Choose the sbert provider. The model is downloaded once from HF Hub and then works fully offline.

❓ How do I set Top‑N for the metrics?

In the wizard, enter the desired number when prompted (🔢 Top‑N publications for metrics). In the API, pass top_n to full_report(). If you omit top_n, the report is computed on the full dataset.

❓ Which environment variables are used for cloud providers?

OPENAI_API_KEY, COHERE_API_KEY, JINA_API_KEY, NOMIC_API_KEY. You may also pass --api_key in the CLI.

This page was generated automatically – last update: 08 Aug 2025.