Augmenting Local LLMs with NFC Tags as Contextual Pointers
NFC-RAG is an embedded system that uses standard NFC tags to add Retrieval-Augmented Generation (RAG) to a local Large Language Model. By storing compact identifiers and metadata on low-memory NFC tags (e.g. 888 bytes on NTAG216), each tag acts as a contextual pointer into an external knowledge base. The result is personalized, privacy-preserving AI without retraining the model.
The system is designed for edge deployment on consumer hardware—for example an NVIDIA RTX 3090 running Ollama or llama.cpp—and fits scenarios that need offline operation and strict data locality.
Classic RAG retrieves relevant documents from large vector stores using embeddings, which needs substantial storage and compute—often not feasible in very constrained environments. NFC tags offer a physical API: a single tap provides precise context hints that drive retrieval from a local knowledge store.
Typical use case: industrial maintenance. A technician scans an NFC tag on a machine; the system sends the tag payload to a local LLM, which retrieves only the relevant manual sections and answers troubleshooting questions—no cloud required.
Example: A tag on a robotic arm stores {"id": "robot-arm-XYZ", "role": "maintenance"}. Tapping it with a smartphone sends this to a local Qwen2.5-7B model, which retrieves arm-specific diagrams and produces step-by-step fixes. This scales to thousands of assets with minimal per-tag data.
NFC tag capacity is limited:
Payloads must stay small: UUIDs (16 bytes), flags (1–4 bytes), short strings (e.g. up to ~200 chars). A single 384-dimensional embedding at 1 byte per dimension would use 384 bytes—so at most one vector fits, and on-tag vector search is not practical. Instead, tags act as routing keys to a backend index.
Heavy work is offloaded: tags hold <1 KB, while the RAG engine (e.g. FAISS or LanceDB) runs on-device with embeddings built offline. A base LLM like Gemma-2-9B (Q4_K_M, ~6 GB VRAM) can process retrieved chunks in under 2 seconds on an RTX 3090. Limitations include read-only tags (writable NTAG21x can be used) and short read range (~5 cm), which match asset-tagging use cases well.
Constraint example: For 10 documents with 512D embeddings (about 2.5 KB total), an NTAG216 cannot store them—but a 32-byte ID can index a 10 GB local database partitioned by asset.
Three layers, all runnable locally:
{"doc_set": "manual-robot-XYZ", "lang": "en", "style": "step-by-step"}), and forwards to middleware via HTTP or gRPC.doc_set to a DB partition, embeds the user query, retrieves top-3 chunks (~1 KB total), injects them into the LLM prompt.<system>Role: {tag.role}. Respond in {tag.lang}, style: {tag.style}.</system>
<context>{retrieved_chunks}</context>
<user>{query}</user>
End-to-end latency is under 5 seconds on mid-range hardware.
ollama run qwen2.5-coder:7b-q4_K_M)import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
from pydantic import BaseModel
import ollama
class TagPayload(BaseModel):
doc_set: str
role: str = "expert"
lang: str = "en"
class NFC_RAG:
def __init__(self):
self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
self.index = faiss.read_index('knowledge.faiss') # Prebuilt per doc_set
self.docstore = {} # {id: text} loaded from JSON/DB
def process(self, tag: TagPayload, query: str):
query_emb = self.encoder.encode([query])
scores, idxs = self.index.search(query_emb, k=3)
chunks = [self.docstore[i] for i in idxs[0]]
prompt = f"Role: {tag.role}. Lang: {tag.lang}.\nContext: {' '.join(chunks)}\nQ: {query}"
resp = ollama.generate(model='qwen2.5:7b', prompt=prompt)
return resp['response']
# Usage: rag = NFC_RAG(); print(rag.process(TagPayload(doc_set="robot-XYZ"), "fix vibration"))
TagPayload (Pydantic model) — Represents the data read from the NFC tag. doc_set is required and identifies which knowledge partition to use (e.g. "robot-XYZ" for a specific machine manual). role and lang default to "expert" and "en"; they are injected into the system prompt so the LLM answers in the right tone and language.
NFC_RAG.__init__ — Loads the embedding model (all-MiniLM-L6-v2, 384 dimensions, runs on CPU), reads the prebuilt FAISS index from disk (knowledge.faiss), and prepares an in-memory docstore (id → text) that you populate from your JSON or database. In production you would load docstore from the same partition as the index (e.g. keyed by doc_set).
NFC_RAG.process(tag, query) — Runs the full RAG pipeline:
encoder.encode([query]) turns the user question (e.g. “fix vibration”) into a 384D vector.index.search(query_emb, k=3) finds the 3 nearest document chunks in the FAISS index; idxs[0] are the chunk indices.chunks = [self.docstore[i] for i in idxs[0]] retrieves the actual text for those indices from the docstore.role and lang with the retrieved context and the user query, so the LLM sees a clear system instruction plus the relevant manual excerpts.ollama.generate(...) sends the prompt to the local Ollama server (e.g. Qwen2.5 7B) and returns the model’s reply (e.g. step-by-step troubleshooting).Prebuild the index by embedding all documents for each doc_set, saving the FAISS index and the id→text mapping; at runtime you load the partition that matches the tag’s doc_set.
Write payloads via NFC Tools app or nfcpy. Example JSON:
{"doc_set": "robot-XYZ", "role": "maintenance", "lang": "it", "v": 1}
For maximum density use TLV (Tag-Length-Value): e.g. ID=0x01, length=16, value=UUID.
Follow these steps to deploy NFC-RAG from scratch.
doc_set = "pump-model-A" → list of document paths or chunks.all-MiniLM-L6-v2). Encode all chunks for a given doc_set, add them to a FAISS index (faiss.IndexFlatIP or IndexHNSWFlat for larger sets), and save with faiss.write_index(index, "knowledge_<doc_set>.faiss"). Build a docstore (id → text) and persist it (e.g. JSON or SQLite) keyed by doc_set.doc_set you plan to use. You can have one index per asset or one partitioned index; the middleware must know which file/partition to load for each doc_set.nfcpy on a PC with a reader.doc_set and optionally role, lang, style. Example: {"doc_set": "robot-XYZ", "role": "maintenance", "lang": "it", "v": 1}. Ensure the string fits within the tag’s user memory (and leave room for NDEF if you use it).sentence-transformers, faiss-cpu (or faiss-gpu), ollama (or your LLM client), pydantic, and a web framework (e.g. FastAPI).doc_set (or use a single index if you built one global index with partitioned docstore). Call the NFC_RAG.process(tag, query) logic and return the LLM response.uvicorn rag:app --host 0.0.0.0 --port 8000). Ensure Ollama (or llama.cpp) is running and the chosen model is pulled (e.g. ollama run qwen2.5:7b).doc_set was used and the top retrieved chunks are sensible. Optionally measure latency (scan → parse → retrieve → generate) to ensure it stays under your target (e.g. <5 s).Attach an NTAG216 to a pump, conveyor, or robotic arm. The tag stores e.g. {"doc_set": "pump-model-A", "role": "maintenance", "lang": "en", "style": "step-by-step"}. When a technician taps the tag with a phone and asks “Why is pressure low?” or “How do I replace the seal?”, the app sends the tag payload and the question to the local NFC-RAG API. The middleware loads the FAISS index for pump-model-A, embeds the query, retrieves the top-3 chunks from the pump manual (e.g. troubleshooting section, parts list), and injects them into the LLM prompt. The model (e.g. Qwen2.5-7B) answers with concrete steps: “Check valve #3; per manual p.42 this is a common failure. If the seal is worn, order part XYZ and follow section 5.2 for replacement.” No cloud and no generic chatbot—only that machine’s documentation in context.
Scale: A factory can deploy hundreds of tags (one per asset or per asset type). Each tag points to a doc set of 20–50 chunks; the total knowledge base can be hundreds of MB, all on a single edge server with a single GPU.
Stick a tag on a textbook chapter or a printed exercise set. Payload example: {"doc_set": "calculus-ch3", "role": "tutor", "lang": "en", "style": "educational"}. A student taps the tag and asks “Explain integrals” or “Work through example 3.2”. The system retrieves the relevant theorems, definitions, and worked examples from the chapter’s index, and the LLM produces an explanation tailored to that material—avoiding drift into other chapters or generic web content. The same setup works for language learning (e.g. doc_set: "spanish-lesson-5"), safety training (e.g. doc_set: "forklift-safety"), or certification prep.
Put a tag on the fridge or a recipe binder. Example payload: {"doc_set": "recipes-vegan", "role": "assistant", "lang": "en"}. The user asks “Suggest dinner with what I have” or “Something quick with chickpeas”. The RAG layer can pull from a small, curated recipe set (and optionally from a grocery list if you store it in the same doc set or a linked one). The LLM suggests a concrete recipe and steps. Because the knowledge base is local and fixed, answers stay on-topic and private; you are not sending grocery or eating habits to the cloud.
Use one tag per shelf or product family. For example {"doc_set": "warehouse-zone-A3", "role": "logistics"}. Staff scan the tag and ask “Where is item SKU-789?” or “Restocking procedure for this zone”. The backend retrieves zone-specific procedures, layout notes, or inventory hints and the LLM answers in one place. At scale: 1k tags, each linked to a 50-document set (e.g. 50 chunks per zone), with a total DB size of a few hundred MB—easily hosted on a single server with FAISS and a 7B model.
In every case, the NFC tag is a contextual pointer: it tells the system which knowledge partition to use and how to shape the prompt (role, language, style). The actual retrieval and generation stay local, so latency stays low and data never leaves the premises.
| Aspect | NFC-RAG | Full Retraining | Cloud RAG |
|---|---|---|---|
| Privacy | Local-only | Local | Cloud exposure |
| Cost | ~$0.50/tag + free LLM | High compute | API fees |
| Latency | <5 s at edge | N/A | 200 ms+ network |
| Scalability | 1000s of tags, partitioned | Rigid | Vendor-locked |
| Update | Rewrite DB, not tags | Full retrain | Live sync |
NFC-RAG fits hybrid setups: a base LLM for fluency plus tag-triggered precision.
NFC-RAG shows that physical context (what you tap) can drive retrieval and prompt shaping for local LLMs without cloud or heavy retraining. Small NFC tags become cheap, writable “context switches” for RAG, suitable for maintenance, education, home automation, and inventory. With standard tags (NTAG216), existing embedding models, and tools like FAISS and Ollama, you can deploy offline, low-latency, privacy-preserving RAG at the edge and scale by adding more tags and partitioned indexes.
For questions about NFC-RAG or to discuss this project, please send an email.