EmbedSLR is a compact Python framework that combines embedding‑based ranking with an automatic bibliometric audit to accelerate the screening phase in systematic literature reviews (SLR).
biblio_report.txtpip install git+https://github.com/s-matysik/EmbedSLR.git
Alternatively, clone the repository and work in editable mode:
git clone https://github.com/s-matysik/EmbedSLR.git
cd EmbedSLR
pip install -e . # editable install
Python >= 3.9pandas, numpy, scikit-learn, sentence-transformers, openai, cohere, requests, tqdm, ipywidgetsgoogle-colabOPENAI_API_KEY, COHERE_API_KEY, JINA_API_KEY, NOMIC_API_KEY!pip install -q git+https://github.com/s-matysik/EmbedSLR.git
from embedslr.colab_app import run
run() # launches an interactive widget
CSV.python -m embedslr.wizardranking.csv, optional topN.csv, biblio_report.txt).The command‑line interface is a single command (argparse‑based):
embedslr \
-i scopus.csv \
-q "How do CSR cues influence consumer behaviour?" \
-p sbert \
-o ranking.csv \
-r biblio_report.txt \
--json-embs
Flags: -i/--input, -q/--query, -p/--provider (sbert|openai|cohere|jina|nomic), -m/--model (optional), -o/--out, -r/--report, --api_key (optional), --json-embs (save embeddings column).
from embedslr.io import read_csv, autodetect_columns, combine_title_abstract
from embedslr.embeddings import get_embeddings, list_models
from embedslr.similarity import rank_by_cosine
from embedslr.bibliometrics import full_report
df = read_csv("scopus.csv")
tcol, acol = autodetect_columns(df)
df["combined_text"] = combine_title_abstract(df, tcol, acol)
model = list_models()["sbert"][0] # e.g. "sentence-transformers/all-MiniLM-L6-v2"
doc_vecs = get_embeddings(df["combined_text"].tolist(), provider="sbert", model=model)
qvec = get_embeddings(["your research question"], provider="sbert", model=model)[0]
ranked = rank_by_cosine(qvec, doc_vecs, df)
full_report(ranked, path="biblio_report.txt", top_n=30) # if top_n omitted → full dataset
| Code fragment | Purpose |
|---|---|
_ensure_sbert_installed() | Checks for sentence‑transformers; if missing, prompts and installs it |
_local_model_dir() | Resolves permanent path embedslr/sbert_models/<model> |
_get_or_download_local_sbert() | Downloads the model on first run, sets HF_HUB_OFFLINE=1 for subsequent offline use |
_select_model() | You enter the model only once – no duplicate prompts |
Result: from the second launch onward, EmbedSLR runs entirely offline even in closed networks.
sbert : sentence-transformers (e.g. all-MiniLM-L6-v2)
openai : text-embedding-3-large, text-embedding-ada-002
cohere : embed-english-v3.0, embed-english-light-v3.0,
embed-multilingual-v3.0, embed-multilingual-light-v3.0
nomic : nomic-embed-text-v1, nomic-embed-text-v1.5
jina : jina-embeddings-v3
EmbedSLR computes 10 (+1) indicators that quantify topical coherence and internal citation patterns in a publication corpus. All functions are implemented in embedslr/bibliometrics.py.
| Symbol | Description | Function |
|---|---|---|
| A | Average number of shared references per pair of papers | indicator_a(df) |
| A′ | Mean Jaccard (references) across all pairs | indicator_a_prime(df) |
| B | Average number of shared keywords per pair | indicator_b(df) |
| B′ | Mean Jaccard (keywords) across all pairs | indicator_b_prime(df) |
| C | Pairs with ≥1 common reference | indicator_c(df) |
| D | Unique references shared by ≥2 papers | indicator_d(df) |
| E | Total intersections of references across all pairs | indicator_e(df) |
| F | Pairs with ≥1 common keyword | indicator_f(df) |
| G | Keywords occurring in ≥2 papers | indicator_g(df) |
| H | Average number of mutually cited papers per pair | indicator_h(df) |
| I | Total unique mutually cited papers | indicator_i(df) |
from embedslr.bibliometrics import full_report
full_report(ranked_df, path="biblio_report.txt", top_n=30)
from embedslr.bibliometrics import indicator_b_prime
print("Mean Jaccard (keywords):", indicator_b_prime(ranked_df))
Title and Abstract (auto‑detected; several common Scopus header variants supported).Author Keywords (optional; created empty if missing).Parsed_References — set/list of references. If missing but References exists, the wizard derives it automatically. If neither is present, reference‑based indicators will be near zero.ranking.csv — sorted by distance_cosine (smaller = closer to query); includes combined_text and, if requested, combined_embeddings (JSON).biblio_report.txt — human‑readable summary of indicators A…I (optionally limited to Top‑N).topN.csv — optional top‑N slice of the ranking.embedslr/ # main package ├── io.py # read_csv(), column detection, title+abstract merge ├── embeddings.py # providers + list_models(), get_embeddings() ├── similarity.py # cosine ranking (rank_by_cosine) ├── bibliometrics.py # indicators A … I + full_report() ├── wizard.py # interactive terminal assistant (offline pipeline), run() ├── cli.py # argparse CLI: single 'embedslr' command with flags ├── colab_app.py # Google Colab widget ├── utils.py # chunk_iterable(), getenv_or_raise(), progress() ├── _version.py # aux version holder └── __init__.py # public API exports docs/ # static docs (this page) examples/ # sample data and results pyproject.toml # metadata, runtime deps, build backend, entry points setup.cfg # flake8 + packaging extras MANIFEST.in # non‑Python files included in sdist LICENSE # MIT licence README.md # repo front page – quick overview .gitignore # build artefacts, __pycache__, *.ipynb_checkpoints
read_csv(), column auto‑detection (autodetect_columns()), combine_title_abstract()list_models(), get_embeddings())rank_by_cosine() (ascending by distance_cosine)indicators(), full_report() (top_n is optional; default = full dataset)HF_HUB_OFFLINE=1 after first download)run() builds the ipywidgets GUIchunk_iterable, getenv_or_raise, progress@misc{matysik2025embedslr,
title = {EmbedSLR – deterministic embedding‑based screening and bibliometric validation in SLR},
author = {Matysik, Sebastian and Wiśniewska, Joanna and Frankowski, Paweł K.},
year = {2025},
url = {https://github.com/s-matysik/EmbedSLR}
}
Yes. Choose the sbert provider. The model is downloaded once from HF Hub and then works fully offline.
In the wizard, enter the desired number when prompted (🔢 Top‑N publications for metrics). In the API, pass top_n to full_report(). If you omit top_n, the report is computed on the full dataset.
OPENAI_API_KEY, COHERE_API_KEY, JINA_API_KEY, NOMIC_API_KEY. You may also pass --api_key in the CLI.
This page was generated automatically – last update: .