NFC-RAG

Project Overview

NFC-RAG is an embedded system that uses standard NFC tags to add Retrieval-Augmented Generation (RAG) to a local Large Language Model. By storing compact identifiers and metadata on low-memory NFC tags (e.g. 888 bytes on NTAG216), each tag acts as a contextual pointer into an external knowledge base. The result is personalized, privacy-preserving AI without retraining the model.

The system is designed for edge deployment on consumer hardware—for example an NVIDIA RTX 3090 running Ollama or llama.cpp—and fits scenarios that need offline operation and strict data locality.

Project Motivation

Classic RAG retrieves relevant documents from large vector stores using embeddings, which needs substantial storage and compute—often not feasible in very constrained environments. NFC tags offer a physical API: a single tap provides precise context hints that drive retrieval from a local knowledge store.

Typical use case: industrial maintenance. A technician scans an NFC tag on a machine; the system sends the tag payload to a local LLM, which retrieves only the relevant manual sections and answers troubleshooting questions—no cloud required.

Example: A tag on a robotic arm stores {"id": "robot-arm-XYZ", "role": "maintenance"}. Tapping it with a smartphone sends this to a local Qwen2.5-7B model, which retrieves arm-specific diagrams and produces step-by-step fixes. This scales to thousands of assets with minimal per-tag data.

Technical Feasibility and Constraints

NFC tag capacity is limited:

NTAG213: 144 bytes user memory
NTAG215: 504 bytes
NTAG216: 888 bytes

Payloads must stay small: UUIDs (16 bytes), flags (1–4 bytes), short strings (e.g. up to ~200 chars). A single 384-dimensional embedding at 1 byte per dimension would use 384 bytes—so at most one vector fits, and on-tag vector search is not practical. Instead, tags act as routing keys to a backend index.

Heavy work is offloaded: tags hold <1 KB, while the RAG engine (e.g. FAISS or LanceDB) runs on-device with embeddings built offline. A base LLM like Gemma-2-9B (Q4_K_M, ~6 GB VRAM) can process retrieved chunks in under 2 seconds on an RTX 3090. Limitations include read-only tags (writable NTAG21x can be used) and short read range (~5 cm), which match asset-tagging use cases well.

Constraint example: For 10 documents with 512D embeddings (about 2.5 KB total), an NTAG216 cannot store them—but a 32-byte ID can index a 10 GB local database partitioned by asset.

System Architecture

Three layers, all runnable locally:

Architecture diagram

NFC Frontend → RAG Middleware → LLM Backend

NFC Frontend: Mobile app (Flutter/React Native with NFC) or Raspberry Pi scans the tag, parses JSON/TLV (e.g. {"doc_set": "manual-robot-XYZ", "lang": "en", "style": "step-by-step"}), and forwards to middleware via HTTP or gRPC.
RAG Middleware: Lightweight Python service (FastAPI + SentenceTransformers + FAISS): maps doc_set to a DB partition, embeds the user query, retrieves top-3 chunks (~1 KB total), injects them into the LLM prompt.
LLM Backend: Ollama or llama.cpp with a 7B–14B model. Prompt template:

<system>Role: {tag.role}. Respond in {tag.lang}, style: {tag.style}.</system>
<context>{retrieved_chunks}</context>
<user>{query}</user>

Data flow

Scan → Parse (≈10 ms) → Retrieve (≈100 ms) → Generate (1–3 s)

End-to-end latency is under 5 seconds on mid-range hardware.

Implementation Guide

Hardware requirements

NFC tags: NTAG216 (e.g. ST25TA series, ~$0.50/unit)
Host: PC or server with RTX 3080/3090 (24 GB+ VRAM), 32 GB RAM
Reader: Smartphone (Android/iOS NFC) or USB PN532 module

Software stack

NFC: nfcpy (Python) or react-native-nfc-manager
Embeddings: all-MiniLM-L6-v2 (384D, ~80 MB, runs on CPU)
Vector store: FAISS (local index, <100 MB for 10k docs)
LLM: Ollama (e.g. ollama run qwen2.5-coder:7b-q4_K_M)
API: FastAPI gluing the components

Sample Code: RAG Middleware (Python)

import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
from pydantic import BaseModel
import ollama

class TagPayload(BaseModel):
    doc_set: str
    role: str = "expert"
    lang: str = "en"

class NFC_RAG:
    def __init__(self):
        self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
        self.index = faiss.read_index('knowledge.faiss')  # Prebuilt per doc_set
        self.docstore = {}  # {id: text} loaded from JSON/DB

    def process(self, tag: TagPayload, query: str):
        query_emb = self.encoder.encode([query])
        scores, idxs = self.index.search(query_emb, k=3)
        chunks = [self.docstore[i] for i in idxs[0]]
        prompt = f"Role: {tag.role}. Lang: {tag.lang}.\nContext: {' '.join(chunks)}\nQ: {query}"
        resp = ollama.generate(model='qwen2.5:7b', prompt=prompt)
        return resp['response']

# Usage: rag = NFC_RAG(); print(rag.process(TagPayload(doc_set="robot-XYZ"), "fix vibration"))

What the code does — step by step

TagPayload (Pydantic model) — Represents the data read from the NFC tag. doc_set is required and identifies which knowledge partition to use (e.g. "robot-XYZ" for a specific machine manual). role and lang default to "expert" and "en"; they are injected into the system prompt so the LLM answers in the right tone and language.

NFC_RAG.__init__ — Loads the embedding model (all-MiniLM-L6-v2, 384 dimensions, runs on CPU), reads the prebuilt FAISS index from disk (knowledge.faiss), and prepares an in-memory docstore (id → text) that you populate from your JSON or database. In production you would load docstore from the same partition as the index (e.g. keyed by doc_set).

NFC_RAG.process(tag, query) — Runs the full RAG pipeline:

Encode: encoder.encode([query]) turns the user question (e.g. “fix vibration”) into a 384D vector.
Search: index.search(query_emb, k=3) finds the 3 nearest document chunks in the FAISS index; idxs[0] are the chunk indices.
Fetch chunks: chunks = [self.docstore[i] for i in idxs[0]] retrieves the actual text for those indices from the docstore.
Build prompt: The prompt combines the tag’s role and lang with the retrieved context and the user query, so the LLM sees a clear system instruction plus the relevant manual excerpts.
Generate: ollama.generate(...) sends the prompt to the local Ollama server (e.g. Qwen2.5 7B) and returns the model’s reply (e.g. step-by-step troubleshooting).

Prebuild the index by embedding all documents for each doc_set, saving the FAISS index and the id→text mapping; at runtime you load the partition that matches the tag’s doc_set.

Tag Encoding

Write payloads via NFC Tools app or nfcpy. Example JSON:

{"doc_set": "robot-XYZ", "role": "maintenance", "lang": "it", "v": 1}

For maximum density use TLV (Tag-Length-Value): e.g. ID=0x01, length=16, value=UUID.

Setup Steps

Follow these steps to deploy NFC-RAG from scratch.

Step 1: Prepare the knowledge base and build the index

Collect documents per asset. For each physical asset (e.g. pump model A, robot arm XYZ), gather the relevant manuals, FAQs, or procedures into a set of text files or database records. Keep a clear mapping: e.g. doc_set = "pump-model-A" → list of document paths or chunks.
Chunk the documents. Split long documents into overlapping or contiguous chunks (e.g. 256–512 tokens each) so that retrieval returns focused passages. Store each chunk with a stable ID (e.g. integer index or UUID).
Embed and build the FAISS index. Use the same embedding model you will use at runtime (e.g. all-MiniLM-L6-v2). Encode all chunks for a given doc_set, add them to a FAISS index (faiss.IndexFlatIP or IndexHNSWFlat for larger sets), and save with faiss.write_index(index, "knowledge_<doc_set>.faiss"). Build a docstore (id → text) and persist it (e.g. JSON or SQLite) keyed by doc_set.
Repeat for every doc_set you plan to use. You can have one index per asset or one partitioned index; the middleware must know which file/partition to load for each doc_set.

Step 2: Write the NFC tags

Choose a tag type (e.g. NTAG216 for 888 bytes). Use an NFC Tools app on a smartphone or nfcpy on a PC with a reader.
Write a JSON payload (or TLV) that includes at least doc_set and optionally role, lang, style. Example: {"doc_set": "robot-XYZ", "role": "maintenance", "lang": "it", "v": 1}. Ensure the string fits within the tag’s user memory (and leave room for NDEF if you use it).
Stick each tag on the corresponding asset (machine, book, appliance) so that scanning the tag uniquely identifies the context.

Step 3: Run the RAG middleware and LLM

Install dependencies: Python 3.8+, sentence-transformers, faiss-cpu (or faiss-gpu), ollama (or your LLM client), pydantic, and a web framework (e.g. FastAPI).
Implement an HTTP endpoint that accepts the tag payload (e.g. JSON body) and the user query. Load the FAISS index and docstore for the tag’s doc_set (or use a single index if you built one global index with partitioned docstore). Call the NFC_RAG.process(tag, query) logic and return the LLM response.
Start the API server (e.g. uvicorn rag:app --host 0.0.0.0 --port 8000). Ensure Ollama (or llama.cpp) is running and the chosen model is pulled (e.g. ollama run qwen2.5:7b).

Step 4: Test end-to-end

From a phone or NFC-enabled client, scan a tag to read its payload. Send that payload plus a test query (e.g. “Why is pressure low?”) to your API.
Verify that the response is relevant to the scanned asset (e.g. mentions the correct manual or part numbers). Check logs to confirm the correct doc_set was used and the top retrieved chunks are sensible. Optionally measure latency (scan → parse → retrieve → generate) to ensure it stays under your target (e.g. <5 s).

Use Cases and Examples

Industrial IoT maintenance

Attach an NTAG216 to a pump, conveyor, or robotic arm. The tag stores e.g. {"doc_set": "pump-model-A", "role": "maintenance", "lang": "en", "style": "step-by-step"}. When a technician taps the tag with a phone and asks “Why is pressure low?” or “How do I replace the seal?”, the app sends the tag payload and the question to the local NFC-RAG API. The middleware loads the FAISS index for pump-model-A, embeds the query, retrieves the top-3 chunks from the pump manual (e.g. troubleshooting section, parts list), and injects them into the LLM prompt. The model (e.g. Qwen2.5-7B) answers with concrete steps: “Check valve #3; per manual p.42 this is a common failure. If the seal is worn, order part XYZ and follow section 5.2 for replacement.” No cloud and no generic chatbot—only that machine’s documentation in context.

Scale: A factory can deploy hundreds of tags (one per asset or per asset type). Each tag points to a doc set of 20–50 chunks; the total knowledge base can be hundreds of MB, all on a single edge server with a single GPU.

Personalized tutoring

Stick a tag on a textbook chapter or a printed exercise set. Payload example: {"doc_set": "calculus-ch3", "role": "tutor", "lang": "en", "style": "educational"}. A student taps the tag and asks “Explain integrals” or “Work through example 3.2”. The system retrieves the relevant theorems, definitions, and worked examples from the chapter’s index, and the LLM produces an explanation tailored to that material—avoiding drift into other chapters or generic web content. The same setup works for language learning (e.g. doc_set: "spanish-lesson-5"), safety training (e.g. doc_set: "forklift-safety"), or certification prep.

Home automation and recipes

Put a tag on the fridge or a recipe binder. Example payload: {"doc_set": "recipes-vegan", "role": "assistant", "lang": "en"}. The user asks “Suggest dinner with what I have” or “Something quick with chickpeas”. The RAG layer can pull from a small, curated recipe set (and optionally from a grocery list if you store it in the same doc set or a linked one). The LLM suggests a concrete recipe and steps. Because the knowledge base is local and fixed, answers stay on-topic and private; you are not sending grocery or eating habits to the cloud.

Warehouse and inventory

Use one tag per shelf or product family. For example {"doc_set": "warehouse-zone-A3", "role": "logistics"}. Staff scan the tag and ask “Where is item SKU-789?” or “Restocking procedure for this zone”. The backend retrieves zone-specific procedures, layout notes, or inventory hints and the LLM answers in one place. At scale: 1k tags, each linked to a 50-document set (e.g. 50 chunks per zone), with a total DB size of a few hundred MB—easily hosted on a single server with FAISS and a 7B model.

Summary

In every case, the NFC tag is a contextual pointer: it tells the system which knowledge partition to use and how to shape the prompt (role, language, style). The actual retrieval and generation stay local, so latency stays low and data never leaves the premises.

Advantages Over Alternatives

Aspect	NFC-RAG	Full Retraining	Cloud RAG
Privacy	Local-only	Local	Cloud exposure
Cost	~$0.50/tag + free LLM	High compute	API fees
Latency	<5 s at edge	N/A	200 ms+ network
Scalability	1000s of tags, partitioned	Rigid	Vendor-locked
Update	Rewrite DB, not tags	Full retrain	Live sync

NFC-RAG fits hybrid setups: a base LLM for fluency plus tag-triggered precision.

Future Extensions

Multi-GPU llama.cpp for 70B models.
Larger payloads via NFC Type 4.
Tag chaining: scan one tag → get a hint to the next.
Edge SLMs (e.g. EdgeRAG) with mini-indices on 64 KB tags (by ~2027).
Alignment with privacy regulations (e.g. EU AI Act) favoring local RAG.
Open-source on GitHub for maker and animatronics/GPU experimentation.

Conclusions

NFC-RAG shows that physical context (what you tap) can drive retrieval and prompt shaping for local LLMs without cloud or heavy retraining. Small NFC tags become cheap, writable “context switches” for RAG, suitable for maintenance, education, home automation, and inventory. With standard tags (NTAG216), existing embedding models, and tools like FAISS and Ollama, you can deploy offline, low-latency, privacy-preserving RAG at the edge and scale by adding more tags and partitioned indexes.