EmbedSLR 🚀 Deterministic publication screening & bibliometric auditing

EmbedSLR is a compact Python framework that combines embedding‑based ranking with an automatic bibliometric audit to accelerate the screening phase in systematic literature reviews (SLR).

1  Installation

1.1 From GitHub (development version)

pip install git+https://github.com/s-matysik/EmbedSLR.git

Alternatively, clone the repository and work in editable mode:

git clone https://github.com/s-matysik/EmbedSLR.git
cd EmbedSLR
pip install -e .       # editable install

1.2 Requirements & environment

2  Quick start

2.1 Google Colab GUI 🟢 (recommended for first‑time users)

!pip install -q git+https://github.com/s-matysik/EmbedSLR.git
from embedslr.colab_app import run
run()   # launches an interactive widget

2.2 Terminal Wizard ⚡ (offline‑friendly)

  1. Export your Scopus search results to CSV.
  2. Run:
    python -m embedslr.wizard
  3. Provide the CSV path, research query and choose provider/model.
  4. Receive a ZIP archive (ranking.csv, optional topN.csv, biblio_report.txt).

2.3 CLI (one‑shot)

The command‑line interface is a single command (argparse‑based):

embedslr \
  -i scopus.csv \
  -q "How do CSR cues influence consumer behaviour?" \
  -p sbert \
  -o ranking.csv \
  -r biblio_report.txt \
  --json-embs

Flags: -i/--input, -q/--query, -p/--provider (sbert|openai|cohere|jina|nomic), -m/--model (optional), -o/--out, -r/--report, --api_key (optional), --json-embs (save embeddings column).

2.4 Python API (minimal)

from embedslr.io import read_csv, autodetect_columns, combine_title_abstract
from embedslr.embeddings import get_embeddings, list_models
from embedslr.similarity import rank_by_cosine
from embedslr.bibliometrics import full_report

df = read_csv("scopus.csv")
tcol, acol = autodetect_columns(df)
df["combined_text"] = combine_title_abstract(df, tcol, acol)

model = list_models()["sbert"][0]  # e.g. "sentence-transformers/all-MiniLM-L6-v2"
doc_vecs = get_embeddings(df["combined_text"].tolist(), provider="sbert", model=model)
qvec = get_embeddings(["your research question"], provider="sbert", model=model)[0]

ranked = rank_by_cosine(qvec, doc_vecs, df)
full_report(ranked, path="biblio_report.txt", top_n=30)  # if top_n omitted → full dataset

3  Local SBERT (automatic download & 100 % offline afterwards)

Code fragmentPurpose
_ensure_sbert_installed()Checks for sentence‑transformers; if missing, prompts and installs it
_local_model_dir()Resolves permanent path embedslr/sbert_models/<model>
_get_or_download_local_sbert()Downloads the model on first run, sets HF_HUB_OFFLINE=1 for subsequent offline use
_select_model()You enter the model only once – no duplicate prompts

Result: from the second launch onward, EmbedSLR runs entirely offline even in closed networks.

3.1 Embedding providers & models (built‑ins)

sbert  : sentence-transformers (e.g. all-MiniLM-L6-v2)
openai : text-embedding-3-large, text-embedding-ada-002
cohere : embed-english-v3.0, embed-english-light-v3.0,
         embed-multilingual-v3.0, embed-multilingual-light-v3.0
nomic  : nomic-embed-text-v1, nomic-embed-text-v1.5
jina   : jina-embeddings-v3

4  Bibliometric indicators (A … I)

EmbedSLR computes 10 (+1) indicators that quantify topical coherence and internal citation patterns in a publication corpus. All functions are implemented in embedslr/bibliometrics.py.

SymbolDescriptionFunction
AAverage number of shared references per pair of papersindicator_a(df)
A′Mean Jaccard (references) across all pairsindicator_a_prime(df)
BAverage number of shared keywords per pairindicator_b(df)
B′Mean Jaccard (keywords) across all pairsindicator_b_prime(df)
CPairs with ≥1 common referenceindicator_c(df)
DUnique references shared by ≥2 papersindicator_d(df)
ETotal intersections of references across all pairsindicator_e(df)
FPairs with ≥1 common keywordindicator_f(df)
GKeywords occurring in ≥2 papersindicator_g(df)
HAverage number of mutually cited papers per pairindicator_h(df)
ITotal unique mutually cited papersindicator_i(df)

4.1 Full report with one call

from embedslr.bibliometrics import full_report
full_report(ranked_df, path="biblio_report.txt", top_n=30)

4.2 Single indicator example

from embedslr.bibliometrics import indicator_b_prime
print("Mean Jaccard (keywords):", indicator_b_prime(ranked_df))

4.3 Input columns & assumptions

4.4 Outputs & columns

5  Repository structure

embedslr/                 # main package
├── io.py                 # read_csv(), column detection, title+abstract merge
├── embeddings.py         # providers + list_models(), get_embeddings()
├── similarity.py         # cosine ranking (rank_by_cosine)
├── bibliometrics.py      # indicators A … I + full_report()
├── wizard.py             # interactive terminal assistant (offline pipeline), run()
├── cli.py                # argparse CLI: single 'embedslr' command with flags
├── colab_app.py          # Google Colab widget
├── utils.py              # chunk_iterable(), getenv_or_raise(), progress()
├── _version.py           # aux version holder
└── __init__.py           # public API exports
docs/                     # static docs (this page)
examples/                 # sample data and results
pyproject.toml            # metadata, runtime deps, build backend, entry points
setup.cfg                 # flake8 + packaging extras
MANIFEST.in               # non‑Python files included in sdist
LICENSE                   # MIT licence
README.md                 # repo front page – quick overview
.gitignore                # build artefacts, __pycache__, *.ipynb_checkpoints

6  File highlights

7  Citing EmbedSLR

@misc{matysik2025embedslr,
  title  = {EmbedSLR – deterministic embedding‑based screening and bibliometric validation in SLR},
  author = {Matysik, Sebastian and Wiśniewska, Joanna and Frankowski, Paweł K.},
  year   = {2025},
  url    = {https://github.com/s-matysik/EmbedSLR}
}

8  FAQ

❓ I have no API key. Can I still use EmbedSLR?

Yes. Choose the sbert provider. The model is downloaded once from HF Hub and then works fully offline.

❓ How do I set Top‑N for the metrics?

In the wizard, enter the desired number when prompted (🔢 Top‑N publications for metrics). In the API, pass top_n to full_report(). If you omit top_n, the report is computed on the full dataset.

❓ Which environment variables are used for cloud providers?

OPENAI_API_KEY, COHERE_API_KEY, JINA_API_KEY, NOMIC_API_KEY. You may also pass --api_key in the CLI.

This page was generated automatically – last update: .