Why Build Your Own ChatGPT?
ChatGPT is amazing, but there are great reasons to run your own AI assistant:
- Privacy: Your conversations never leave your server
- Cost: ~$0.01 per conversation vs $0.03+ on OpenAI
- Customization: Fine-tune for your specific use case
- No rate limits: Scale as much as you need
- Offline capability: Works without internet (on-premise)
In this tutorial, we'll build a full ChatGPT-style interface using Llama 3.1 70B and deploy it on GPUBrazil. Total cost: about $0.50/hour to run.
What We're Building
Our ChatGPT clone will have:
- Beautiful chat interface (Gradio)
- Streaming responses (just like ChatGPT)
- Conversation memory
- System prompt customization
- Multiple conversation threads
Step 1: Launch a GPU Instance
For Llama 3.1 70B, we need a GPU with 80GB VRAM. Options:
- 1x H100 80GB ($2.80/hr) - Best performance
- 1x A100 80GB ($1.60/hr) - Great balance
- 2x L40S 48GB ($1.80/hr) - Budget option
Go to GPUBrazil Console, deploy an A100 instance with Ubuntu 22.04.
Step 2: Install Dependencies
# SSH into your instance
ssh root@YOUR_INSTANCE_IP
# Create project directory
mkdir chatgpt-clone && cd chatgpt-clone
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install packages
pip install torch transformers accelerate gradio vllm huggingface_hub
Step 3: Create the Chat Application
Create a file called app.py:
import gradio as gr
from vllm import LLM, SamplingParams
from typing import List, Tuple
import time
# Initialize the model
print("Loading Llama 3.1 70B... (this takes 2-3 minutes)")
llm = LLM(
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
tensor_parallel_size=1, # Use 2 for dual GPU setup
gpu_memory_utilization=0.95,
max_model_len=8192,
)
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=2048,
stop=["<|eot_id|>", "<|end_of_text|>"]
)
def format_message(role: str, content: str) -> str:
"""Format a single message in Llama 3 format."""
return f"<|start_header_id|>{role}<|end_header_id|>\n\n{content}<|eot_id|>"
def format_conversation(history: List[Tuple[str, str]], system_prompt: str) -> str:
"""Format the full conversation for Llama 3."""
messages = [format_message("system", system_prompt)]
for user_msg, assistant_msg in history:
messages.append(format_message("user", user_msg))
if assistant_msg:
messages.append(format_message("assistant", assistant_msg))
# Add the assistant header for generation
formatted = "<|begin_of_text|>" + "".join(messages)
formatted += "<|start_header_id|>assistant<|end_header_id|>\n\n"
return formatted
def chat(message: str, history: List[Tuple[str, str]], system_prompt: str):
"""Generate a response to the user's message."""
# Add user message to history
history.append((message, ""))
# Format the conversation
prompt = format_conversation(history, system_prompt)
# Generate response
outputs = llm.generate([prompt], sampling_params)
response = outputs[0].outputs[0].text.strip()
# Update history with response
history[-1] = (message, response)
return history, history, ""
def clear_chat():
"""Clear the conversation history."""
return [], [], ""
# Create the Gradio interface
with gr.Blocks(
title="ChatGPT Clone - Powered by Llama 3.1",
theme=gr.themes.Soft(primary_hue="green"),
css="""
.chatbot {height: 500px !important;}
footer {display: none !important;}
"""
) as demo:
gr.Markdown("""
# ๐ฆ Private ChatGPT Clone
### Powered by Llama 3.1 70B on GPUBrazil
Your conversations are 100% private. Nothing is logged or sent to third parties.
""")
with gr.Row():
with gr.Column(scale=4):
chatbot = gr.Chatbot(
label="Chat",
elem_classes="chatbot",
avatar_images=("๐ค", "๐ค")
)
with gr.Row():
msg = gr.Textbox(
placeholder="Type your message here...",
label="Message",
scale=9,
container=False
)
send_btn = gr.Button("Send", scale=1, variant="primary")
clear_btn = gr.Button("๐๏ธ Clear Conversation")
with gr.Column(scale=1):
system_prompt = gr.Textbox(
value="You are a helpful, harmless, and honest AI assistant. You provide accurate, detailed responses while being respectful and safe.",
label="System Prompt",
lines=6
)
gr.Markdown("""
**Tips:**
- Customize the system prompt for different personas
- Try: "You are a Python expert"
- Or: "You are a creative writing assistant"
""")
# State for conversation history
state = gr.State([])
# Event handlers
msg.submit(chat, [msg, state, system_prompt], [chatbot, state, msg])
send_btn.click(chat, [msg, state, system_prompt], [chatbot, state, msg])
clear_btn.click(clear_chat, [], [chatbot, state, msg])
# Launch the app
if __name__ == "__main__":
demo.launch(
server_name="0.0.0.0",
server_port=7860,
share=False # Set True for public link
)
Step 4: Run Your ChatGPT Clone
# Make sure you're logged into Hugging Face
huggingface-cli login
# Run the app
python app.py
The model will take 2-3 minutes to load, then you'll see:
Running on local URL: http://0.0.0.0:7860
Open http://YOUR_INSTANCE_IP:7860 in your browser!
๐ก Pro Tip: Make it Public
Set share=True in demo.launch() to get a public Gradio URL you can share with anyone, no port forwarding needed!
Step 5: Add Streaming Responses
For a true ChatGPT experience, add streaming:
# Alternative chat function with streaming
async def chat_stream(message: str, history: List[Tuple[str, str]], system_prompt: str):
history.append((message, ""))
prompt = format_conversation(history, system_prompt)
response = ""
async for output in llm.generate_stream(prompt, sampling_params):
token = output.outputs[0].text
response += token
history[-1] = (message, response)
yield history, history
yield history, history
Cost Analysis
Running your own ChatGPT clone on GPUBrazil:
- A100 80GB: $1.60/hour
- Average conversation: ~500 tokens = ~10 seconds
- Conversations per hour: ~360
- Cost per conversation: $0.004 (~$4 per 1000 chats)
Compare to OpenAI GPT-4: $30 per 1000 conversations. That's 7.5x savings!
Deploy Your ChatGPT Clone Today
Get an A100 GPU for $1.60/hour and build your private AI assistant.
Get $5 Free Credit โGoing Further
Add User Authentication
demo.launch(
auth=("admin", "your_password"),
auth_message="Welcome to Private ChatGPT"
)
Use a Smaller Model for Cost Savings
Llama 3.1 8B runs on an L40S ($0.90/hour) and is still very capable:
llm = LLM(model="meta-llama/Meta-Llama-3.1-8B-Instruct")
Fine-Tune for Your Use Case
Check out our Fine-Tuning Guide to customize the model for your specific domain.
Conclusion
You now have your own private ChatGPT running on cloud GPUs! The benefits:
- โ Complete privacy - your data stays yours
- โ 7.5x cheaper than OpenAI
- โ Fully customizable
- โ No rate limits
Get started with GPUBrazil and build your AI-powered applications today!