Building Smarter AI with RAG: A Dev-Friendly Intro (Part 2: Let’s Build One)

Building Smarter AI with RAG: A Dev-Friendly Intro (Part 2: Let’s Build One)

TB

Teqani Blogs

Writer at Teqani

April 30, 20255 min read

In Part 1, we explored what Retrieval-Augmented Generation (RAG) is and why it’s a game-changer for LLMs. Now, it’s time to get our hands dirty and build something cool: A simple news summarization app powered by RAG! This post guides you through building a functional RAG demo using Langchain, Ollama, FAISS, and Streamlit, using real-world data from The Kathmandu Post.

Environment Setup and Library Installation

Before diving into the code, let’s get your environment ready. This project uses a local LLM (Llama2), so we’ll need to set up a Python environment, install dependencies, and run Ollama on your machine.

Why Ollama with Llama2?

  • Free and accessible: No need to burn your OpenAI credits. This setup is completely free and runs locally.
  • Lightweight and practical: We’re using Llama2 7B, which is about 3.8GB in size. It runs well even on mid- to low-end machines.
  • Great for testing RAG: Llama2 was last trained in 2023, so it’s slightly outdated. This makes it perfect for experimenting with RAG (Retrieval-Augmented Generation), since we can feed it recent context and see how it handles it.

All in all, it’s a beginner-friendly, dev-friendly choice that keeps things simple and cost-effective.

  1. Create a Virtual Environment (recommended)

Let’s avoid polluting your system with tons of Python libraries. Ensure you have anaconda or conda installed in your system. In your terminal:

conda create -p ./venv python=3.10

conda activate ./venv

  1. Create requirements.txt file

Inside your project folder, create a file called requirements.txt and add the following libraries:

langchain

langchain-community

ipykernel

python-dotenv

streamlit

langchain-core

beautifulsoup4

chromadb

faiss-cpu

  1. Install all dependencies

Now, in the same terminal, run:

pip install -r requirements.txt

Make sure you have ollama and llama2 installed in your machine. If not, here’s a quick installation guide:

Download Ollama from the official website.

Start the Ollama service: ollama serve

  1. Pull the Llama2 Model:

ollama pull llama2

Now that the environment has been set up, let’s get started with creating the actual application.

Building the News Summarizer RAG App

Step 1: Load the data (from the web)

We’ll start by pulling in content from a real news site using Langchain’s built-in web loader. This web loader is a web scraper that extracts contents from any website.

from langchain_community.document_loaders import WebBaseLoader

URL = "https://kathmandupost.com/"

loader = WebBaseLoader(URL)

docs = loader.load()

We’re fetching the HTML content and parsing it into text. This gives us raw content that we’ll later chunk, embed, and retrieve from. Langchain also provides apis to load data from other sources such as PDFs, JSON, XML, arxiv etc. You can find the list of all the document loaders here.

Step 2: Split the text into smaller chunks (Tokenization)

Large Language Models don’t work well with massive blocks of text. That’s why we need to split the content into smaller, overlapping chunks. These chunks are often called tokens.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)

chunks = text_splitter.split_documents(docs)

We’re dividing the raw text into 500-character blocks. Each block overlaps slightly with the previous one to maintain context across chunks. This helps improve answer quality later on.

You can find more about text preprocessing in this article.

Step 3: Generate vector embeddings & store them in FAISS

We now turn each chunk of text into a vector, a numerical representation of the content that lets us find “similar” chunks later using FAISS.

from langchain_community.embeddings import OllamaEmbeddings

from langchain_community.vectorstores import FAISS

embeddings = OllamaEmbeddings()

vectorstore = FAISS.from_documents(chunks, embeddings)

Think of this as making our text searchable but semantically. So if someone asks about “Nepal’s current politics”, we don’t need the exact words; the embedding will help us find related content based on meaning.

Step 4: Create a retriever

Now that the data is embedded and stored in FAISS, we turn it into a retriever.

retriever = vectorstore.as_retriever()

When a user asks a question, this retriever will look through all our chunks and return the most relevant ones based on similarity.

Step 5: Set up the prompt and LLM

Let’s now define the prompt that our LLM (Llama2) will use. This prompt gives the model instructions and the context it needs to generate useful responses.

from langchain_core.prompts import ChatPromptTemplate

from langchain_community.llms.ollama import Ollama

prompt = ChatPromptTemplate.from_messages([ ("system", "You are a helpful assistant. Please respond to the question using the provided context."), ("user", "Question: {question}\n\nContext: {context}") ])

llm = Ollama(model="llama2")

Here, we are creating roles for the LLM. We are instructing the LLM to act like an assistant and respond to any queries the user have. We are also setting up a template for how the user prompt will be curated. The user prompt will have a context and a question. The context comes from the documents that we scrapped earlier from the kathmandu post website.

Step 6: Chain everything together with Langchain

Now we glue everything together using Langchain’s RunnableMap and chain APIs.

from langchain.chains.combine_documents import create_stuff_documents_chain

from langchain_core.output_parsers import StrOutputParser

from langchain.schema.runnable import RunnableMap

output_parser = StrOutputParser()

document_chain = create_stuff_documents_chain(llm=llm, prompt=prompt)

rag_chain = RunnableMap({ "context": retriever, "question": lambda x: x["question"] }) | document_chain | output_parser

How does this chain work?

  • The retriever pulls in relevant content chunks
  • The LLM uses those chunks + your question to generate a response
  • The parser turns that response into clean text

This is your RAG pipeline in action.

Step 7: Add a simple UI with Streamlit

Let’s give users a way to interact with the system using a clean web interface.

import streamlit as st

st.title("AI News Summarizer")

input_text = st.text_input("Find out what's going on in Nepal")

if input_text: result = rag_chain.invoke({"question": input_text}) st.write(result)

It’s quick to set up, lets you test ideas fast, and doesn’t require frontend skills. Perfect for prototyping AI apps.

That’s it! 🎉 You’ve just built a working RAG app that pulls content from a live website, chunks it up, makes it searchable, and generates responses using Llama2.

This is just the tip of the iceberg tho, there’s a lot more you can do from here: improve your UI, add sources for multi-language content, integrate document uploaders, or explore more powerful LLMs if your hardware allows.

Thanks for following along! If you found this helpful or built something cool with it, feel free to share it or reach out, I’d love to see what you create. Stay curious 🚀

TB

Teqani Blogs

Verified
Writer at Teqani

Senior Software Engineer with 10 years of experience

April 30, 2025
Teqani Certified

All blogs are certified by our company and reviewed by our specialists
Issue Number: #854be1a8-b97e-47a9-b24b-39ae2d9a2892